[v2] Promotion of Unmapped Page Cache Folios.

[RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

Unmapped page cache pages can be demoted to low-tier memory, but
they can presently only be promoted in two conditions:
    1) The page is fully swapped out and re-faulted
    2) The page becomes mapped (and exposed to NUMA hint faults)

This RFC proposes promoting unmapped page cache pages by using
folio_mark_accessed as a hotness hint for unmapped pages.

Patches 1-3
	allow NULL as valid input to migration prep interfaces
	for vmf/vma - which is not present in unmapped folios.
Patch 4
	adds NUMA_HINT_PAGE_CACHE to vmstat
Patch 5
	adds the promotion mechanism, along with a sysfs
	extension which defaults the behavior to off.
	/sys/kernel/mm/numa/pagecache_promotion_enabled

Functional test showed that we are able to reclaim some performance
in canned scenarios (a file gets demoted and becomes hot with 
relatively little contention).  See test/overhead section below.

v2
- cleanup first commit to be accurate and take Ying's feedback
- cleanup NUMA_HINT_ define usage
- add NUMA_HINT_ type selection macro to keep code clean
- mild comment updates

Open Questions:
======
   1) Should we also add a limit to how much can be forced onto
      a single task's promotion list at any one time? This might
      piggy-back on the existing TPP promotion limit (256MB?) and
      would simply add something like task->promo_count.

      Technically we are limited by the batch read-rate before a
      TASK_RESUME occurs.

   2) Should we exempt certain forms of folios, or add additional
      knobs/levers in to deal with things like large folios?

   3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
      so we could validate the behavior works as intended. Should
      we just call this a NUMA_HINT_FAULT and not add a new hint?

   4) Benchmark suggestions that can pressure 1TB memory. This is
      not my typical wheelhouse, so if folks know of a useful
      benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
      I'd like to add additional measurements here.

Development Notes
=================

During development, we explored the following proposals:

1) directly promoting within folio_mark_accessed (FMA)
   Originally suggested by Johannes Weiner
   https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/

   This caused deadlocks due to the fact that the PTL was held
   in a variety of cases - but in particular during task exit.
   It also is incredibly inflexible and causes promotion-on-fault.
   It was discussed that a deferral mechanism was preferred.


2) promoting in filemap.c locations (calls of FMA)
   Originally proposed by Feng Tang and Ying Huang
   https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329

   First, we saw this as less problematic than directly hooking FMA,
   but we realized this has the potential to miss data in a variety of
   locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.

   Second, we discovered that the lock state of pages is very subtle,
   and that these locations in filemap.c can be called in an atomic
   context.  Prototypes lead to a variety of stalls and lockups.


3) a new LRU - originally proposed by Keith Busch
   https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7

   There are two issues with this approach: PG_promotable and reclaim.

   First - PG_promotable has generally be discouraged.

   Second - Attach this mechanism to an LRU is both backwards and
   counter-intutive.  A promotable list is better served by a MOST
   recently used list, and since LRUs are generally only shrank when
   exposed to pressure it would require implementing a new promotion
   list shrinker that runs separate from the existing reclaim logic.


4) Adding a separate kthread - suggested by many

   This is - to an extent - a more general version of the LRU proposal.
   We still have to track the folios - which likely requires the
   addition of a page flag.  Additionally, this method would actually
   contend pretty heavily with LRU behavior - i.e. we'd want to
   throttle addition to the promotion candidate list in some scenarios.


5) Doing it in task work

   This seemed to be the most realistic after considering the above.

   We observe the following:
    - FMA is an ideal hook for this and isolation is safe here
    - the new promotion_candidate function is an ideal hook for new
      filter logic (throttling, fairness, etc).
    - isolated folios are either promoted or putback on task resume,
      there are no additional concurrency mechanics to worry about
    - The mechanic can be made optional via a sysfs hook to avoid
      overhead in degenerate scenarios (thrashing).

   We also piggy-backed on the numa_hint_fault_latency timestamp to
   further throttle promotions to help avoid promotions on one or
   two time accesses to a particular page.


Test:
======

Environment:
    1.5-3.7GHz CPU, ~4000 BogoMIPS, 
    1TB Machine with 768GB DRAM and 256GB CXL
    A 64GB file being linearly read by 6-7 Python processes

Goal:
   Generate promotions. Demonstrate stability and measure overhead.

System Settings:
   echo 1 > /sys/kernel/mm/numa/demotion_enabled
   echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
   echo 2 > /proc/sys/kernel/numa_balancing
   
Each process took up ~128GB each, with anonymous memory growing and
shrinking as python filled and released buffers with the 64GB data.
This causes DRAM pressure to generate demotions, and file pages to
"become hot" - and therefore be selected for promotion.

First we ran with promotion disabled to show consistent overhead as
a result of forcing a file out to CXL memory. We first ran a single
reader to see uncontended performance, launched many readers to force
demotions, then droppedb back to a single reader to observe.

Single-reader DRAM: ~16.0-16.4s
Single-reader CXL (after demotion):  ~16.8-17s

Next we turned promotion on with only a single reader running.

Before promotions:
    Node 0 MemFree:        636478112 kB
    Node 0 FilePages:      59009156 kB
    Node 1 MemFree:        250336004 kB
    Node 1 FilePages:      14979628 kB

After promotions:
    Node 0 MemFree:        632267268 kB
    Node 0 FilePages:      72204968 kB
    Node 1 MemFree:        262567056 kB
    Node 1 FilePages:       2918768 kB

Single-reader (after_promotion): ~16.5s

Turning the promotion mechanism on when nothing had been demoted
produced no appreciable overhead (memory allocation noise overpowers it)

Read time did not change after turning promotion off after promotion
occurred, which implies that the additional overhead is not coming from
the promotion system itself - but likely other pages still trapped on
the low tier.  Either way, this at least demonstrates the mechanism is
not particularly harmful when there are no pages to promote - and the
mechanism is valuable when a file actually is quite hot.

Notability, it takes some time for the average read loop to come back
down, and there still remains unpromoted file pages trapped in pagecache.
This isn't entirely unexpected, there are many files which may have been
demoted, and they may not be very hot.


Overhead
======
When promotion was tured on we saw a loop-runtime increate temporarily

before: 16.8s
during:
  17.606216192245483
  17.375206470489502
  17.722095489501953
  18.230552434921265
  18.20712447166443
  18.008254528045654
  17.008427381515503
  16.851454257965088
  16.715774059295654
stable: ~16.5s

We measured overhead with a separate patch that simply measured the
rdtsc value before/after calls in promotion_candidate and task work.

e.g.:
+       start = rdtsc();
        list_for_each_entry_safe(folio, tmp, promo_list, lru) {
                list_del_init(&folio->lru);
                migrate_misplaced_folio(folio, NULL, nid);
+               count++;
        }
+       atomic_long_add(rdtsc()-start, &promo_time);
+       atomic_long_add(count, &promo_count);

numa_migrate_prep: 93 - time(3969867917) count(42576860)
migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)

Thoughts on a good throttling heuristic would be appreciated here.

Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Keith Busch <kbusch@meta.com>
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (5):
  migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
  memory: move conditionally defined enums use inside ifdef tags
  memory: allow non-fault migration in numa_migrate_check path
  vmstat: add page-cache numa hints
  migrate,sysfs: add pagecache promotion

 .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
 include/linux/memory-tiers.h                  |  2 +
 include/linux/migrate.h                       |  2 +
 include/linux/sched.h                         |  3 +
 include/linux/sched/numa_balancing.h          |  5 ++
 include/linux/vm_event_item.h                 |  8 +++
 init/init_task.c                              |  1 +
 kernel/sched/fair.c                           | 26 +++++++-
 mm/memory-tiers.c                             | 27 ++++++++
 mm/memory.c                                   | 32 +++++-----
 mm/mempolicy.c                                | 25 +++++---
 mm/migrate.c                                  | 61 ++++++++++++++++++-
 mm/swap.c                                     |  3 +
 mm/vmstat.c                                   |  2 +
 14 files changed, 193 insertions(+), 24 deletions(-)

-- 
2.43.0

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Huang, Ying 1 year, 1 month ago

Hi, Gregory,

Thanks for working on this!

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	adds the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with 
> relatively little contention).  See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
>    1) Should we also add a limit to how much can be forced onto
>       a single task's promotion list at any one time? This might
>       piggy-back on the existing TPP promotion limit (256MB?) and
>       would simply add something like task->promo_count.
>
>       Technically we are limited by the batch read-rate before a
>       TASK_RESUME occurs.
>
>    2) Should we exempt certain forms of folios, or add additional
>       knobs/levers in to deal with things like large folios?
>
>    3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
>       so we could validate the behavior works as intended. Should
>       we just call this a NUMA_HINT_FAULT and not add a new hint?
>
>    4) Benchmark suggestions that can pressure 1TB memory. This is
>       not my typical wheelhouse, so if folks know of a useful
>       benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
>       I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
>    Originally suggested by Johannes Weiner
>    https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
>    This caused deadlocks due to the fact that the PTL was held
>    in a variety of cases - but in particular during task exit.
>    It also is incredibly inflexible and causes promotion-on-fault.
>    It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
>    Originally proposed by Feng Tang and Ying Huang
>    https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
>    First, we saw this as less problematic than directly hooking FMA,
>    but we realized this has the potential to miss data in a variety of
>    locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
>    Second, we discovered that the lock state of pages is very subtle,
>    and that these locations in filemap.c can be called in an atomic
>    context.  Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
>    https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
>    There are two issues with this approach: PG_promotable and reclaim.
>
>    First - PG_promotable has generally be discouraged.
>
>    Second - Attach this mechanism to an LRU is both backwards and
>    counter-intutive.  A promotable list is better served by a MOST
>    recently used list, and since LRUs are generally only shrank when
>    exposed to pressure it would require implementing a new promotion
>    list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.  Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
>    This seemed to be the most realistic after considering the above.
>
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here
>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).
>     - isolated folios are either promoted or putback on task resume,
>       there are no additional concurrency mechanics to worry about
>     - The mechanic can be made optional via a sysfs hook to avoid
>       overhead in degenerate scenarios (thrashing).
>
>    We also piggy-backed on the numa_hint_fault_latency timestamp to
>    further throttle promotions to help avoid promotions on one or
>    two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
>    Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/demotion_enabled
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion):  ~16.8-17s

The difference is trivial.  This makes me thought that why we need this
patchset?

> Next we turned promotion on with only a single reader running.
>
> Before promotions:
>     Node 0 MemFree:        636478112 kB
>     Node 0 FilePages:      59009156 kB
>     Node 1 MemFree:        250336004 kB
>     Node 1 FilePages:      14979628 kB

Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0?  You moved some file pages from node 0 to node 1?

> After promotions:
>     Node 0 MemFree:        632267268 kB
>     Node 0 FilePages:      72204968 kB
>     Node 1 MemFree:        262567056 kB
>     Node 1 FilePages:       2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier.  Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
>   17.606216192245483
>   17.375206470489502
>   17.722095489501953
>   18.230552434921265
>   18.20712447166443
>   18.008254528045654
>   17.008427381515503
>   16.851454257965088
>   16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> +       start = rdtsc();
>         list_for_each_entry_safe(folio, tmp, promo_list, lru) {
>                 list_del_init(&folio->lru);
>                 migrate_misplaced_folio(folio, NULL, nid);
> +               count++;
>         }
> +       atomic_long_add(rdtsc()-start, &promo_time);
> +       atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.

We do have a throttle mechanism already, for example, you can used

$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

to rate limit the promotion throughput under 100 MB/s for each DRAM
node.

> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
>   migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
>   memory: move conditionally defined enums use inside ifdef tags
>   memory: allow non-fault migration in numa_migrate_check path
>   vmstat: add page-cache numa hints
>   migrate,sysfs: add pagecache promotion
>
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  2 +
>  include/linux/sched.h                         |  3 +
>  include/linux/sched/numa_balancing.h          |  5 ++
>  include/linux/vm_event_item.h                 |  8 +++
>  init/init_task.c                              |  1 +
>  kernel/sched/fair.c                           | 26 +++++++-
>  mm/memory-tiers.c                             | 27 ++++++++
>  mm/memory.c                                   | 32 +++++-----
>  mm/mempolicy.c                                | 25 +++++---
>  mm/migrate.c                                  | 61 ++++++++++++++++++-
>  mm/swap.c                                     |  3 +
>  mm/vmstat.c                                   |  2 +
>  14 files changed, 193 insertions(+), 24 deletions(-)

---
Best Regards,
Huang, Ying

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Sat, Dec 21, 2024 at 01:18:04PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> >
> > Single-reader DRAM: ~16.0-16.4s
> > Single-reader CXL (after demotion):  ~16.8-17s
> 
> The difference is trivial.  This makes me thought that why we need this
> patchset?
>

That's 3-6% performance in this contrived case.

We're working to testing a real workload we know suffers from this
problem as it is long-running. Should be early in the new year hopefully.

> > Next we turned promotion on with only a single reader running.
> >
> > Before promotions:
> >     Node 0 MemFree:        636478112 kB
> >     Node 0 FilePages:      59009156 kB
> >     Node 1 MemFree:        250336004 kB
> >     Node 1 FilePages:      14979628 kB
> 
> Why are there some many file pages on node 1 even if there're a lot of
> free pages on node 0?  You moved some file pages from node 0 to node 1?
> 

This was explicit and explained in the test notes:

  First we ran with promotion disabled to show consistent overhead as
  a result of forcing a file out to CXL memory. We first ran a single
  reader to see uncontended performance, launched many readers to force
  demotions, then dropped back to a single reader to observe.

The goal here was to simply demonstrate functionality and stability.

> > After promotions:
> >     Node 0 MemFree:        632267268 kB
> >     Node 0 FilePages:      72204968 kB
> >     Node 1 MemFree:        262567056 kB
> >     Node 1 FilePages:       2918768 kB
> >
> > Single-reader (after_promotion): ~16.5s

This represents a 2.5-6% speedup depending on the spread.

> >
> > numa_migrate_prep: 93 - time(3969867917) count(42576860)
> > migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> > migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
> >
> > Thoughts on a good throttling heuristic would be appreciated here.
> 
> We do have a throttle mechanism already, for example, you can used
> 
> $ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
> 
> to rate limit the promotion throughput under 100 MB/s for each DRAM
> node.
>

Can easily piggyback on that, just wasn't sure if overloading it was
an acceptable idea.  Although since that promotion rate limit is also
per-task (as far as I know, will need to read into it a bit more) this
is probably fine.

~Gregory

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Huang, Ying 1 year, 1 month ago

Gregory Price <gourry@gourry.net> writes:

> On Sat, Dec 21, 2024 at 01:18:04PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> >
>> > Single-reader DRAM: ~16.0-16.4s
>> > Single-reader CXL (after demotion):  ~16.8-17s
>> 
>> The difference is trivial.  This makes me thought that why we need this
>> patchset?
>>
>
> That's 3-6% performance in this contrived case.

This is small too.

> We're working to testing a real workload we know suffers from this
> problem as it is long-running. Should be early in the new year hopefully.

Good!

To demonstrate the max possible performance gain.  We can use a pure
file read/write benchmark such as fio, run in on pure DRAM and pure CXL.
Then the difference is the max possible performance gain we can get.

>> > Next we turned promotion on with only a single reader running.
>> >
>> > Before promotions:
>> >     Node 0 MemFree:        636478112 kB
>> >     Node 0 FilePages:      59009156 kB
>> >     Node 1 MemFree:        250336004 kB
>> >     Node 1 FilePages:      14979628 kB
>> 
>> Why are there some many file pages on node 1 even if there're a lot of
>> free pages on node 0?  You moved some file pages from node 0 to node 1?
>> 
>
> This was explicit and explained in the test notes:
>
>   First we ran with promotion disabled to show consistent overhead as
>   a result of forcing a file out to CXL memory. We first ran a single
>   reader to see uncontended performance, launched many readers to force
>   demotions, then dropped back to a single reader to observe.
>
> The goal here was to simply demonstrate functionality and stability.

Got it.

>> > After promotions:
>> >     Node 0 MemFree:        632267268 kB
>> >     Node 0 FilePages:      72204968 kB
>> >     Node 1 MemFree:        262567056 kB
>> >     Node 1 FilePages:       2918768 kB
>> >
>> > Single-reader (after_promotion): ~16.5s
>
> This represents a 2.5-6% speedup depending on the spread.
>
>> >
>> > numa_migrate_prep: 93 - time(3969867917) count(42576860)
>> > migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
>> > migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>> >
>> > Thoughts on a good throttling heuristic would be appreciated here.
>> 
>> We do have a throttle mechanism already, for example, you can used
>> 
>> $ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
>> 
>> to rate limit the promotion throughput under 100 MB/s for each DRAM
>> node.
>>
>
> Can easily piggyback on that, just wasn't sure if overloading it was
> an acceptable idea.

It's the recommended setup in the original PMEM promotion
implementation.  Please check commit c959924b0dc5 ("memory tiering:
adjust hot threshold automatically").

> Although since that promotion rate limit is also
> per-task (as far as I know, will need to read into it a bit more) this
> is probably fine.

It's not per-task.  Please read the code, especially
should_numa_migrate_memory().

---
Best Regards,
Huang, Ying

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> > That's 3-6% performance in this contrived case.
> 
> This is small too.
>

Small is relative.  3-6% performance increase across millions of servers
across a year is a non trivial speedup for such a common operation.

> > Can easily piggyback on that, just wasn't sure if overloading it was
> > an acceptable idea.
> 
> It's the recommended setup in the original PMEM promotion
> implementation.  Please check commit c959924b0dc5 ("memory tiering:
> adjust hot threshold automatically").
> 
> > Although since that promotion rate limit is also
> > per-task (as far as I know, will need to read into it a bit more) this
> > is probably fine.
> 
> It's not per-task.  Please read the code, especially
> should_numa_migrate_memory().

Oh, then this is already throttled.  We call mpol_misplaced which calls
should_numa_migrate_memory. 

There's some duplication of candidate selection logic between
promotion_candidate and should_numa_migrate_memory, but it may be
beneficial to keep it that way.  I'll have to look.

~Gregory

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Huang, Ying 1 year, 1 month ago

Gregory Price <gourry@gourry.net> writes:

> On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> > That's 3-6% performance in this contrived case.
>> 
>> This is small too.
>>
>
> Small is relative.  3-6% performance increase across millions of servers
> across a year is a non trivial speedup for such a common operation.

If we cannot only get 3-6% performance increase in a micro-benchmark,
how much can we get from a real life workloads?

Anyway, we need to prove the usefulness of the change via data.  3-6%
isn't some strong data.

Can we measure the largest improvement?  For example, run the benchmark
with all file pages in DRAM and CXL.mem via numa binding, and compare.

>> > Can easily piggyback on that, just wasn't sure if overloading it was
>> > an acceptable idea.
>> 
>> It's the recommended setup in the original PMEM promotion
>> implementation.  Please check commit c959924b0dc5 ("memory tiering:
>> adjust hot threshold automatically").
>> 
>> > Although since that promotion rate limit is also
>> > per-task (as far as I know, will need to read into it a bit more) this
>> > is probably fine.
>> 
>> It's not per-task.  Please read the code, especially
>> should_numa_migrate_memory().
>
> Oh, then this is already throttled.  We call mpol_misplaced which calls
> should_numa_migrate_memory. 
>
> There's some duplication of candidate selection logic between
> promotion_candidate and should_numa_migrate_memory, but it may be
> beneficial to keep it that way.  I'll have to look.

---
Best Regards,
Huang, Ying

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Fri, Dec 27, 2024 at 10:16:42AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > On Sun, Dec 22, 2024 at 03:09:44PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@gourry.net> writes:
> >> > That's 3-6% performance in this contrived case.
> >> 
> >> This is small too.
> >>
> >
> > Small is relative.  3-6% performance increase across millions of servers
> > across a year is a non trivial speedup for such a common operation.
> 
> If we cannot only get 3-6% performance increase in a micro-benchmark,
> how much can we get from a real life workloads?
> 
> Anyway, we need to prove the usefulness of the change via data.  3-6%
> isn't some strong data.
> 
> Can we measure the largest improvement?  For example, run the benchmark
> with all file pages in DRAM and CXL.mem via numa binding, and compare.
>

I can probably come up with something, will rework some stuff.

~Gregory

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Fri, Dec 27, 2024 at 10:40:36AM -0500, Gregory Price wrote:
> > Can we measure the largest improvement?  For example, run the benchmark
> > with all file pages in DRAM and CXL.mem via numa binding, and compare.
> 
> I can probably come up with something, will rework some stuff.
>

so I did as you suggested, I made a program that allocates a 16GB
buffer, initializes it, them membinds itself to node1 before accessing
the file to force it into pagecache, then i ran a bunch of tests.

Completely unexpected result: ~25% overhead from an inexplicable source.

baseline - no membind()
./test
Read loop took 0.93 seconds

drop caches

./test - w/ membind(1) just before file open
Read loop took 1.16 seconds

node 1 size: 262144 MB
node 1 free: 245756 MB <- file confirmed in cache

kill and relaunch without membind to avoid any funny business
./test
Read loop took 1.16 seconds

enable promotion
Read loop took 3.37 seconds <- migration overhead
... snip ...
Read loop took 1.17 seconds <- stabilizes here

node 1 size: 262144 MB
node 1 free: 262144 MB <- pagecache promoted

Absolutely bizarre result: there is 0% CXL usage ocurring, but the
overhead we originally measured is still present.

This overhead persists even if i do the following
  - disable pagecache promotion
  - disable numa_balancing
  - offline CXL memory entirely

This is actually pretty wild. I presume this must imply the folio flags
are mucked up after migration and we're incurring a bunch of overhead 
on access for no reason. At the very least it doesn't appear to be
an isolated folio issue:

nr_isolated_anon 0
nr_isolated_file 0

I'll have to dig into this further, I wonder if this happens with mapped
memory as well.

~Gregory

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 10:40:36AM -0500, Gregory Price wrote:

just adding some follow-up data

test is essentially
	membind(1)  - node1 is cxl
	read()      - filecache is initialized on cxl
	set_mempolicy(MPOL_DEFAULT) - allow migrations
	while true:
		start = time()
		read()
		print(time()-start)
	// external events cause migration/drop cache while running

baseline: .93-1s/read()
from cxl: ~1.15-1.2s/read()

So we are seeing anywhere from 20-25% overhead from the filecache living
on CXL right out of the box. At least we have good clear signal, right?

tests:
  echo 3 > drop_cache  - filecache refills into node 1
     result => ~.95-1s/read()
     we return back to the baseline, which is expected

  enable promotion     - numactl shows promotion occurs
     result => ~1.15-1.2s/read()
     No effect?! Even offlining the dax devices does nothing.

  enable promotion, wait for it to complete, drop cache
     after promotion  => 1.15-1.2s/read
     after drop cache => .95-1s/read()
     Back to baseline!

This seems to imply that the overhead we're seeing from read() even
when filecache is on the remote node isn't actually related to the
memory speed, but instead likely related to some kind of stale
metadata in the filesystem or filecache layers.

This is going to take me a bit to figure out.  I need to isolate the
filesystem influence (we are using btrfs, i want to make sure this
behavior is consistent on other file systems).

~Gregory

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Gregory Price 1 year, 1 month ago

On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote:
> On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
> 
> This seems to imply that the overhead we're seeing from read() even
> when filecache is on the remote node isn't actually related to the
> memory speed, but instead likely related to some kind of stale
> metadata in the filesystem or filecache layers.
> 
> ~Gregory

Mystery solved

> +void promotion_candidate(struct folio *folio)
> +{
... snip ...
> +	list_add(&folio->lru, promo_list);
> +}

read(file, length) will do a linear read, and promotion_candidate will
add those pages to the promotion list head resulting into a reversed
promotion order

so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order.

The result of this, on an unloaded system, is essentially that pages end
up in the worst possible configuration for the prefetcher, and therefore
TLB hits.  I figured this out because i was seeing the additional ~30%
overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user).

Swapping this for list_add_tail results in the following test result:

initializing
Read loop took 9.41 seconds  <- reading from CXL
Read loop took 31.74 seconds <- migration enabled
Read loop took 10.31 seconds
Read loop took 7.71 seconds  <-  migration finished
Read loop took 7.71 seconds
Read loop took 7.70 seconds
Read loop took 7.75 seconds
Read loop took 19.34 seconds <- dropped caches
Read loop took 13.68 seconds <- cache refilling to DRAM
Read loop took 7.37 seconds
Read loop took 7.68 seconds
Read loop took 7.65 seconds  <- back to DRAM baseline

On our CXL devices, we're seeing a 22-27% performance penalty for a file
being hosted entirely out of CXL.  When we promote this file out of CXL,
we set a 22-27% performance boost.

Probably list_add_tail is right here, but since files *tend to* be read
linearly with `read()` this should *tend toward* optimal.  That said, we
can probably make this more reliable by adding batch migration function
`mpol_migrate_misplaced_batch()` which also tries to do bulk allocation
of destination folios.  This will also probably save us a bunch of
invalidation overhead.

I'm also noticing that the migration limit (256mbps) is not being
respected, probably because we're doing 1 folio at a time instead of a
batch.  Will probably look at changing promotion_candidate to limit the
number of selected pages to promote per read-call.

---

diff --git a/mm/migrate.c b/mm/migrate.c
index f965814b7d40..99b584f22bcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio)
                folio_putback_lru(folio);
                return;
        }
-       list_add(&folio->lru, promo_list);
+       list_add_tail(&folio->lru, promo_list);

        return;
 }

Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

Posted by Huang, Ying 1 year, 1 month ago

Gregory Price <gourry@gourry.net> writes:

> On Fri, Dec 27, 2024 at 10:38:45PM -0500, Gregory Price wrote:
>> On Fri, Dec 27, 2024 at 02:09:50PM -0500, Gregory Price wrote:
>> 
>> This seems to imply that the overhead we're seeing from read() even
>> when filecache is on the remote node isn't actually related to the
>> memory speed, but instead likely related to some kind of stale
>> metadata in the filesystem or filecache layers.
>> 
>> ~Gregory
>
> Mystery solved
>
>> +void promotion_candidate(struct folio *folio)
>> +{
> ... snip ...
>> +	list_add(&folio->lru, promo_list);
>> +}
>
> read(file, length) will do a linear read, and promotion_candidate will
> add those pages to the promotion list head resulting into a reversed
> promotion order
>
> so you read [1,2,3,4] folios, you'll promote in [4,3,2,1] order.
>
> The result of this, on an unloaded system, is essentially that pages end
> up in the worst possible configuration for the prefetcher, and therefore
> TLB hits.  I figured this out because i was seeing the additional ~30%
> overhead show up purely in `copy_page_to_iter()` (i.e. copy_to_user).
>
> Swapping this for list_add_tail results in the following test result:
>
> initializing
> Read loop took 9.41 seconds  <- reading from CXL
> Read loop took 31.74 seconds <- migration enabled
> Read loop took 10.31 seconds

This shows that migration causes large disturbing to the workload.  This
may be not acceptable in real life.  Can you check whether promoting
rate limit can improve the situation?

> Read loop took 7.71 seconds  <-  migration finished
> Read loop took 7.71 seconds
> Read loop took 7.70 seconds
> Read loop took 7.75 seconds
> Read loop took 19.34 seconds <- dropped caches
> Read loop took 13.68 seconds <- cache refilling to DRAM
> Read loop took 7.37 seconds
> Read loop took 7.68 seconds
> Read loop took 7.65 seconds  <- back to DRAM baseline
>
> On our CXL devices, we're seeing a 22-27% performance penalty for a file
> being hosted entirely out of CXL.  When we promote this file out of CXL,
> we set a 22-27% performance boost.

This is a good number!  Thanks!

> Probably list_add_tail is right here, but since files *tend to* be read
> linearly with `read()` this should *tend toward* optimal.  That said, we
> can probably make this more reliable by adding batch migration function
> `mpol_migrate_misplaced_batch()` which also tries to do bulk allocation
> of destination folios.  This will also probably save us a bunch of
> invalidation overhead.
>
> I'm also noticing that the migration limit (256mbps) is not being
> respected, probably because we're doing 1 folio at a time instead of a
> batch.  Will probably look at changing promotion_candidate to limit the
> number of selected pages to promote per read-call.

The migration limit is checked in should_numa_migrate_memory().  You may
take a look at that function.

> ---
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f965814b7d40..99b584f22bcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2675,7 +2675,7 @@ void promotion_candidate(struct folio *folio)
>                 folio_putback_lru(folio);
>                 return;
>         }
> -       list_add(&folio->lru, promo_list);
> +       list_add_tail(&folio->lru, promo_list);
>
>         return;
>  }

[snip]

---
Best Regards,
Huang, Ying