arch/x86/events/amd/ibs.c | 11 + arch/x86/include/asm/entry-common.h | 3 + arch/x86/include/asm/hardirq.h | 2 + arch/x86/include/asm/ibs.h | 9 + arch/x86/include/asm/msr-index.h | 16 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/ibs.c | 343 +++++++++++++++ include/linux/migrate.h | 6 + include/linux/mmzone.h | 16 + include/linux/pghot.h | 98 +++++ include/linux/vm_event_item.h | 26 ++ kernel/sched/fair.c | 149 +------ mm/Kconfig | 19 + mm/Makefile | 2 + mm/internal.h | 4 + mm/klruscand.c | 118 +++++ mm/memory.c | 32 +- mm/migrate.c | 36 +- mm/mm_init.c | 10 + mm/pghot.c | 648 ++++++++++++++++++++++++++++ mm/vmscan.c | 176 ++++++-- mm/vmstat.c | 26 ++ 22 files changed, 1535 insertions(+), 218 deletions(-) create mode 100644 arch/x86/include/asm/ibs.h create mode 100644 arch/x86/mm/ibs.c create mode 100644 include/linux/pghot.h create mode 100644 mm/klruscand.c create mode 100644 mm/pghot.c
Hi, This patchset introduces a new subsystem for hot page tracking and promotion (pghot) that consolidates memory access information from various sources and enables centralized promotion of hot pages across memory tiers. Currently, multiple kernel subsystems detect page accesses independently. For eg. - NUMA Balancing via hint faults - MGLRU via page table scanning for PTE A bit This patchset consolidates the accesses from these mechanisms by providing a common API for reporting page accesses and a shared infrastructure for tracking hotness at PFN granularity and per-node kernel threads for promoting pages. Here is a brief summary of how this subsystem works: - Tracks frequency, last access time and accessing node as part of each access record. - Maintains per-PFN access records in hash lists. - Classifies pages as hot based on configurable thresholds. - Uses per-toptier-node max-heaps to prioritize hot pages for promotion. - Launches per-toptier-node kpromoted threads to perform batched migrations. When different subsystems report page accesses via the API introduced by this new subsystem, a record for each such page is stored in hash lists (hashed by PFN value). In addition to the PFN and target_nid, the hotness record includes parameters like frequency and time of access from which the hotness is derived. Repeated reporting of access on the same PFN will result in updating of hotness information. When the hotness of a record (as updated during reporting of access) crosses a threshold, the record becomes part of a max heap data structure. Records in the max heap are arranged based on the hotness and hence the top elements of the heap will correspond to the hottest pages. There will be one such heap for each toptier node so that per-toptier-node kpromoted thread can easily extract the top N records from its own heap and perform batched migration. Three page hotness sources have been integrated with pghot subsystem on experimental basis: 1. IBS 2. klruscand (based on MGLRU page table walks) 3. NUMA Balancing (mode 2). Changes in v2 ============= - Moved migration rate-limiting and dynamic threshold logic from NUMA Balancing subsystem to pghot. With this, the logic to classify a page as hot resembles more closely to the existing mechanism. - Converted NUMA Balancing mode 2 to just detect accesses through NUMA hint faults and delegate rest of the processing (hot page classification and promotion) to pghot. - Packed the three parameters required for hot page tracking (nid, frequency and timestamp) into a single u32 for space efficiency. - Misc cleanups and refactoring. This v2 patchset applies on top of upstream commit 8742b2d8935f and can be fetched from: https://github.com/AMDESE/linux-mm/tree/bharata/kpromoted-rfcv2 v1: https://lore.kernel.org/linux-mm/20250814134826.154003-1-bharata@amd.com/ v0: https://lore.kernel.org/linux-mm/20250306054532.221138-1-bharata@amd.com/ TODOs ===== - Memory allocation: High volume of allocations and frees (millions) from atomic context needs evaluation. - Memory overhead: The amount of data needed for tracking hotness is also a concern. - Integrate Kscand[1], the PTE A bit based approach that Raghavendra KT is working upon, so that Kscand acts as temperature sources and uses pghot for hot page heuristics and promotion. - Heap pruning: Consider adding heap pruning mechanism for periodic cleaning of cold records. - Address Ying Huang's comment about merging migrate_misplaced_folio() and migrate_misplaced_folios_batch() and correctly handling memcg stats counting properly in the latter. - Testing: Light functional testing done; performance benchmarking and stress testing will follow in the next iterations. Any feedback is welcome! Bharata B Rao (5): mm: migrate: Allow misplaced migration without VMA too mm: Hot page tracking and promotion x86: ibs: In-kernel IBS driver for memory access profiling x86: ibs: Enable IBS profiling for memory accesses mm: sched: Move hot page promotion from NUMAB=2 to kpromoted Gregory Price (1): migrate: implement migrate_misplaced_folios_batch Kinsey Ho (2): mm: mglru: generalize page table walk mm: klruscand: use mglru scanning for page promotion arch/x86/events/amd/ibs.c | 11 + arch/x86/include/asm/entry-common.h | 3 + arch/x86/include/asm/hardirq.h | 2 + arch/x86/include/asm/ibs.h | 9 + arch/x86/include/asm/msr-index.h | 16 + arch/x86/mm/Makefile | 3 +- arch/x86/mm/ibs.c | 343 +++++++++++++++ include/linux/migrate.h | 6 + include/linux/mmzone.h | 16 + include/linux/pghot.h | 98 +++++ include/linux/vm_event_item.h | 26 ++ kernel/sched/fair.c | 149 +------ mm/Kconfig | 19 + mm/Makefile | 2 + mm/internal.h | 4 + mm/klruscand.c | 118 +++++ mm/memory.c | 32 +- mm/migrate.c | 36 +- mm/mm_init.c | 10 + mm/pghot.c | 648 ++++++++++++++++++++++++++++ mm/vmscan.c | 176 ++++++-- mm/vmstat.c | 26 ++ 22 files changed, 1535 insertions(+), 218 deletions(-) create mode 100644 arch/x86/include/asm/ibs.h create mode 100644 arch/x86/mm/ibs.c create mode 100644 include/linux/pghot.h create mode 100644 mm/klruscand.c create mode 100644 mm/pghot.c [1] Kscand - https://lore.kernel.org/linux-mm/20250814153307.1553061-1-raghavendra.kt@amd.com/ -- 2.34.1
On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > This patchset introduces a new subsystem for hot page tracking > and promotion (pghot) that consolidates memory access information > from various sources and enables centralized promotion of hot > pages across memory tiers. Just to be clear, I continue to believe this is a terrible idea and we should not do this. If systems will be built with CXL (and given the horrendous performance, I cannot see why they would be), the kernel should not be migrating memory around like this.
On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > > This patchset introduces a new subsystem for hot page tracking > > and promotion (pghot) that consolidates memory access information > > from various sources and enables centralized promotion of hot > > pages across memory tiers. > > Just to be clear, I continue to believe this is a terrible idea and we > should not do this. If systems will be built with CXL (and given the > horrendous performance, I cannot see why they would be), the kernel > should not be migrating memory around like this. I've been considered this problem from the opposite approach since LSFMM. Rather than decide how to move stuff around, what if instead we just decide not to ever put certain classes of memory on CXL. Right now, so long as CXL is in the page allocator, it's the wild west - any page can end up anywhere. I have enough data now from ZONE_MOVABLE-only CXL deployments on real workloads to show local CXL expansion is valuable and performant enough to be worth deploying - but the key piece for me is that ZONE_MOVABLE disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of CXL, but allows any given user-driven page allocation (including page cache, file, and anon mappings) to land there. I'm hoping to share some of this data in the coming months. I've yet to see any strong indication that a complex hotness/movement system is warranted (yet) - but that may simply be because we have local cards with no switching involved. So far LRU-based promotion and demotion has been sufficient. It seems the closer to random-access the access pattern, the less valuable ANY movement is. Which should be intuitive. But, having CXL beats touching disk every day of the week. So I've become conflicted on this work - but only because I haven't seen the data to suggest such complexity is warranted. ~Gregory
On Wed, 10 Sep 2025, Gregory Price wrote: > On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: > > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > > > This patchset introduces a new subsystem for hot page tracking > > > and promotion (pghot) that consolidates memory access information > > > from various sources and enables centralized promotion of hot > > > pages across memory tiers. > > > > Just to be clear, I continue to believe this is a terrible idea and we > > should not do this. If systems will be built with CXL (and given the > > horrendous performance, I cannot see why they would be), the kernel > > should not be migrating memory around like this. > > I've been considered this problem from the opposite approach since LSFMM. > > Rather than decide how to move stuff around, what if instead we just > decide not to ever put certain classes of memory on CXL. Right now, so > long as CXL is in the page allocator, it's the wild west - any page can > end up anywhere. > > I have enough data now from ZONE_MOVABLE-only CXL deployments on real > workloads to show local CXL expansion is valuable and performant enough > to be worth deploying - but the key piece for me is that ZONE_MOVABLE > disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of > CXL, but allows any given user-driven page allocation (including page > cache, file, and anon mappings) to land there. > This is similar to our use case, although the direct allocation can be controlled by cpusets or mempolicies as needed depending on the memory access latency required for the workload; nothing new there, though, it's the same argument as NUMA in general and the abstraction of these far memory nodes as separate NUMA nodes makes this very straightforward. > I'm hoping to share some of this data in the coming months. > > I've yet to see any strong indication that a complex hotness/movement > system is warranted (yet) - but that may simply be because we have > local cards with no switching involved. So far LRU-based promotion and > demotion has been sufficient. > To me, this is a key point. As we've discussed in meetings, we're in the early days here. The CHMU does provide a lot of flexibility, both to create very good and very bad hotness trackers. But I think the key point is that we have multiple sources of hotness information depending on the platform and some of these sources only make sense for the kernel (or a BPF offload) to maintain as the source of truth. Some of these sources will be clear-on-read so only one entity would be possible to have as the source of truth of page hotness. I've been pretty focused on the promotion story here rather than demotion because of how responsive it needs to be. Harvesting the page table accessed bits or waiting on a sliding window through NUMA Balancing (even NUMAB=2) is not as responsive as needed for very fast promotion to top tier memory, hence things like the CHMU (or PEBS or IBS etc). A few things that I think we need to discuss and align on: - the kernel as the source of truth for all memory hotness information, which can then be abstracted and used for multiple downstream purposes, memory tiering only being one of them - the long-term plan for NUMAB=2 and memory tiering support in the kernel in general, are we planning on supporting this through NUMA hint faults forever despite their drawbacks (too slow, too much overhead for KVM) - the role of the kernel vs userspace in driving the memory migration; lots of discussion on hardware assists that can be leveraged for memory migration but today the balancing is driven in process context. The kthread as the driver of migration is yet to be a sold argument, but are where a number of companies are currently looking There's also some feature support that is possible with these CXL memory expansion devices that have started to pop up in labs that can also drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to chime in as well. This topic seems due for an alignment session as well, so will look to get that scheduled in the coming weeks if people are up for it. > It seems the closer to random-access the access pattern, the less > valuable ANY movement is. Which should be intuitive. But, having > CXL beats touching disk every day of the week. > > So I've become conflicted on this work - but only because I haven't seen > the data to suggest such complexity is warranted. > > ~Gregory >
Hi David, On Tue, Sep 16, 2025 at 1:28 PM David Rientjes <rientjes@google.com> wrote: > > I've been pretty focused on the promotion story here rather than demotion > because of how responsive it needs to be. Harvesting the page table > accessed bits or waiting on a sliding window through NUMA Balancing (even > NUMAB=2) is not as responsive as needed for very fast promotion to top > tier memory, hence things like the CHMU (or PEBS or IBS etc). First, thanks for sharing your thoughts on the promotion responsiveness challenges, definitely a critical aspect for tiering strategies. We recently put together a preliminary report using our experimental HW that I believe could be relevant to the ongoing discussions: A Limits Study of Memory-side Tiering Telemetry: https://arxiv.org/abs/2508.09351 It's essentially an initial step toward quantifying the benefits of HMU on the memory side, aiming to compare promotion quality (e.g., hotness coverage and accuracy) across HMU, PEBS-based promotion, and NUMA balancing (promotion path). Hopefully, this kind of work can help us better understand some of the trade-offs being discussed, support more data-driven comparisons, and spark more fruitful discussions... Best, Vinicius
On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: > > On Wed, 10 Sep 2025, Gregory Price wrote: > > > On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: > > > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > > > > This patchset introduces a new subsystem for hot page tracking > > > > and promotion (pghot) that consolidates memory access information > > > > from various sources and enables centralized promotion of hot > > > > pages across memory tiers. > > > > > > Just to be clear, I continue to believe this is a terrible idea and we > > > should not do this. If systems will be built with CXL (and given the > > > horrendous performance, I cannot see why they would be), the kernel > > > should not be migrating memory around like this. > > > > I've been considered this problem from the opposite approach since LSFMM. > > > > Rather than decide how to move stuff around, what if instead we just > > decide not to ever put certain classes of memory on CXL. Right now, so > > long as CXL is in the page allocator, it's the wild west - any page can > > end up anywhere. > > > > I have enough data now from ZONE_MOVABLE-only CXL deployments on real > > workloads to show local CXL expansion is valuable and performant enough > > to be worth deploying - but the key piece for me is that ZONE_MOVABLE > > disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of > > CXL, but allows any given user-driven page allocation (including page > > cache, file, and anon mappings) to land there. > > > > This is similar to our use case, although the direct allocation can be > controlled by cpusets or mempolicies as needed depending on the memory > access latency required for the workload; nothing new there, though, it's > the same argument as NUMA in general and the abstraction of these far > memory nodes as separate NUMA nodes makes this very straightforward. > > > I'm hoping to share some of this data in the coming months. > > > > I've yet to see any strong indication that a complex hotness/movement > > system is warranted (yet) - but that may simply be because we have > > local cards with no switching involved. So far LRU-based promotion and > > demotion has been sufficient. > > > > To me, this is a key point. As we've discussed in meetings, we're in the > early days here. The CHMU does provide a lot of flexibility, both to > create very good and very bad hotness trackers. But I think the key point > is that we have multiple sources of hotness information depending on the > platform and some of these sources only make sense for the kernel (or a > BPF offload) to maintain as the source of truth. Some of these sources > will be clear-on-read so only one entity would be possible to have as the > source of truth of page hotness. > > I've been pretty focused on the promotion story here rather than demotion > because of how responsive it needs to be. Harvesting the page table > accessed bits or waiting on a sliding window through NUMA Balancing (even > NUMAB=2) is not as responsive as needed for very fast promotion to top > tier memory, hence things like the CHMU (or PEBS or IBS etc). > > A few things that I think we need to discuss and align on: > > - the kernel as the source of truth for all memory hotness information, > which can then be abstracted and used for multiple downstream purposes, > memory tiering only being one of them > > - the long-term plan for NUMAB=2 and memory tiering support in the kernel > in general, are we planning on supporting this through NUMA hint faults > forever despite their drawbacks (too slow, too much overhead for KVM) > > - the role of the kernel vs userspace in driving the memory migration; > lots of discussion on hardware assists that can be leveraged for memory > migration but today the balancing is driven in process context. The > kthread as the driver of migration is yet to be a sold argument, but > are where a number of companies are currently looking > > There's also some feature support that is possible with these CXL memory > expansion devices that have started to pop up in labs that can also > drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to > chime in as well. > > This topic seems due for an alignment session as well, so will look to get > that scheduled in the coming weeks if people are up for it. Our experience is that workloads in hyper-scalar data centers such as Google often have significant cold memory. Offloading this to CXL memory devices, backed by cheaper, lower-performance media (e.g. DRAM with hardware compression), can be a practical approach to reduce overall TCO. Page promotion and demotion are then critical for such a tiered memory system. A kernel thread to drive hot page collection and promotion seems logical, especially since hot page data from new sources (e.g. CHMU) are collected outside the process execution context and in the form of physical addresses. I do agree that we need to balance the complexity and benefits of any new data structures for hotness tracking. > > It seems the closer to random-access the access pattern, the less > > valuable ANY movement is. Which should be intuitive. But, having > > CXL beats touching disk every day of the week. > > > > So I've become conflicted on this work - but only because I haven't seen > > the data to suggest such complexity is warranted. > > > > ~Gregory > >
On Tue, 16 Sep 2025 17:30:46 -0700 Wei Xu <weixugc@google.com> wrote: > On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: > > > > On Wed, 10 Sep 2025, Gregory Price wrote: > > > > > On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: > > > > On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > > > > > This patchset introduces a new subsystem for hot page tracking > > > > > and promotion (pghot) that consolidates memory access information > > > > > from various sources and enables centralized promotion of hot > > > > > pages across memory tiers. > > > > > > > > Just to be clear, I continue to believe this is a terrible idea and we > > > > should not do this. If systems will be built with CXL (and given the > > > > horrendous performance, I cannot see why they would be), the kernel > > > > should not be migrating memory around like this. > > > > > > I've been considered this problem from the opposite approach since LSFMM. > > > > > > Rather than decide how to move stuff around, what if instead we just > > > decide not to ever put certain classes of memory on CXL. Right now, so > > > long as CXL is in the page allocator, it's the wild west - any page can > > > end up anywhere. > > > > > > I have enough data now from ZONE_MOVABLE-only CXL deployments on real > > > workloads to show local CXL expansion is valuable and performant enough > > > to be worth deploying - but the key piece for me is that ZONE_MOVABLE > > > disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of > > > CXL, but allows any given user-driven page allocation (including page > > > cache, file, and anon mappings) to land there. > > > > > > > This is similar to our use case, although the direct allocation can be > > controlled by cpusets or mempolicies as needed depending on the memory > > access latency required for the workload; nothing new there, though, it's > > the same argument as NUMA in general and the abstraction of these far > > memory nodes as separate NUMA nodes makes this very straightforward. > > > > > I'm hoping to share some of this data in the coming months. > > > > > > I've yet to see any strong indication that a complex hotness/movement > > > system is warranted (yet) - but that may simply be because we have > > > local cards with no switching involved. So far LRU-based promotion and > > > demotion has been sufficient. > > > > > > > To me, this is a key point. As we've discussed in meetings, we're in the > > early days here. The CHMU does provide a lot of flexibility, both to > > create very good and very bad hotness trackers. But I think the key point > > is that we have multiple sources of hotness information depending on the > > platform and some of these sources only make sense for the kernel (or a > > BPF offload) to maintain as the source of truth. Some of these sources > > will be clear-on-read so only one entity would be possible to have as the > > source of truth of page hotness. > > > > I've been pretty focused on the promotion story here rather than demotion > > because of how responsive it needs to be. Harvesting the page table > > accessed bits or waiting on a sliding window through NUMA Balancing (even > > NUMAB=2) is not as responsive as needed for very fast promotion to top > > tier memory, hence things like the CHMU (or PEBS or IBS etc). > > > > A few things that I think we need to discuss and align on: > > > > - the kernel as the source of truth for all memory hotness information, > > which can then be abstracted and used for multiple downstream purposes, > > memory tiering only being one of them > > > > - the long-term plan for NUMAB=2 and memory tiering support in the kernel > > in general, are we planning on supporting this through NUMA hint faults > > forever despite their drawbacks (too slow, too much overhead for KVM) > > > > - the role of the kernel vs userspace in driving the memory migration; > > lots of discussion on hardware assists that can be leveraged for memory > > migration but today the balancing is driven in process context. The > > kthread as the driver of migration is yet to be a sold argument, but > > are where a number of companies are currently looking > > > > There's also some feature support that is possible with these CXL memory > > expansion devices that have started to pop up in labs that can also > > drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to > > chime in as well. > > > > This topic seems due for an alignment session as well, so will look to get > > that scheduled in the coming weeks if people are up for it. > > Our experience is that workloads in hyper-scalar data centers such as > Google often have significant cold memory. Offloading this to CXL memory > devices, backed by cheaper, lower-performance media (e.g. DRAM with > hardware compression), can be a practical approach to reduce overall > TCO. Page promotion and demotion are then critical for such a tiered > memory system. For the hardware compression devices how are you dealing with capacity variation / overcommit? Whilst there have been some discussions on that but without a backing store of flash or similar it seems to be challenging to use compressed memory in a tiering system (so as 'normalish' memory) unless you don't mind occasionally and unexpectedly running out of memory (in nasty async ways as dirty cache lines get written back). Or do you mean zswap type use with a hardware offload of the actual compression? > > A kernel thread to drive hot page collection and promotion seems > logical, especially since hot page data from new sources (e.g. CHMU) > are collected outside the process execution context and in the form of > physical addresses. > > I do agree that we need to balance the complexity and benefits of any > new data structures for hotness tracking. > > > > It seems the closer to random-access the access pattern, the less > > > valuable ANY movement is. Which should be intuitive. But, having > > > CXL beats touching disk every day of the week. > > > > > > So I've become conflicted on this work - but only because I haven't seen > > > the data to suggest such complexity is warranted. > > > > > > ~Gregory > > > >
> On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > On Tue, 16 Sep 2025 17:30:46 -0700 > Wei Xu <weixugc@google.com> wrote: > >> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: >>> >>> On Wed, 10 Sep 2025, Gregory Price wrote: >>> >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: >>>>>> This patchset introduces a new subsystem for hot page tracking >>>>>> and promotion (pghot) that consolidates memory access information >>>>>> from various sources and enables centralized promotion of hot >>>>>> pages across memory tiers. >>>>> >>>>> Just to be clear, I continue to believe this is a terrible idea and we >>>>> should not do this. If systems will be built with CXL (and given the >>>>> horrendous performance, I cannot see why they would be), the kernel >>>>> should not be migrating memory around like this. >>>> >>>> I've been considered this problem from the opposite approach since LSFMM. >>>> >>>> Rather than decide how to move stuff around, what if instead we just >>>> decide not to ever put certain classes of memory on CXL. Right now, so >>>> long as CXL is in the page allocator, it's the wild west - any page can >>>> end up anywhere. >>>> >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real >>>> workloads to show local CXL expansion is valuable and performant enough >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE >>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of >>>> CXL, but allows any given user-driven page allocation (including page >>>> cache, file, and anon mappings) to land there. >>>> >>> [snip] >>> There's also some feature support that is possible with these CXL memory >>> expansion devices that have started to pop up in labs that can also >>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to >>> chime in as well. >>> >>> This topic seems due for an alignment session as well, so will look to get >>> that scheduled in the coming weeks if people are up for it. >> >> Our experience is that workloads in hyper-scalar data centers such as >> Google often have significant cold memory. Offloading this to CXL memory >> devices, backed by cheaper, lower-performance media (e.g. DRAM with >> hardware compression), can be a practical approach to reduce overall >> TCO. Page promotion and demotion are then critical for such a tiered >> memory system. > > For the hardware compression devices how are you dealing with capacity variation > / overcommit? I understand that this is indeed one of the key questions from the upstream kernel’s perspective. So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can not speak of other solutions/deployments. However, our HW interface follows existing open specifications from OCP [1], so what I am describing below is more widely applicable. At a very high level, the way our HW works is that the DPA is indeed overcommitted. Then, there is a control plane over CXL.io (PCIe) which exposes the real remaining capacity, as well as some configurable MSI-X interrupts that raise warnings when the capacity crosses over certain configurable thresholds. Last year I presented this interface in LSF/MM [2]. Based on the feedback I got there, we have an early prototype that acts as the *last* memory tier before reclaim (kind of "compressed tier in lieu of discard" as was suggested to me by Dan). What is different from standard tiering is that the control plane is checked on demotion to make sure there is still capacity left. If not, the demotion fails. While this seems stable so far, a missing piece is to ensure that this tier is mainly written by demotions and not arbitrary kernel allocations (at least as a starting point). I want to explore how mempolicies can help there, or something of the sort that Gregory described. This early prototype still needs quite some work in order to find the right abstractions. Hopefully, I will be able to push an RFC in the near future (a couple of months). > Whilst there have been some discussions on that but without a > backing store of flash or similar it seems to be challenging to use > compressed memory in a tiering system (so as 'normalish' memory) unless you > don't mind occasionally and unexpectedly running out of memory (in nasty > async ways as dirty cache lines get written back). There are several things that may be done on the device side. For now, I think the kernel should be unaware of these. But with what I described above, the goal is to have the capacity thresholds configured in a way that we can absorb the occasional dirty cache lines that are written back. > > Or do you mean zswap type use with a hardware offload of the actual > compression? I would categorize this as a completely different discussion (and product line for us). [1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf [2] https://www.youtube.com/watch?v=tXWEbaJmZ_s Thanks, Yiannis PS: Sending from a personal email address to avoid issues with confidentiality footers of the corporate domain.
On Thu, 25 Sep 2025 16:03:46 +0200 Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote: Hi Yiannis, > > On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > > > On Tue, 16 Sep 2025 17:30:46 -0700 > > Wei Xu <weixugc@google.com> wrote: > > > >> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: > >>> > >>> On Wed, 10 Sep 2025, Gregory Price wrote: > >>> > >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: > >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: > >>>>>> This patchset introduces a new subsystem for hot page tracking > >>>>>> and promotion (pghot) that consolidates memory access information > >>>>>> from various sources and enables centralized promotion of hot > >>>>>> pages across memory tiers. > >>>>> > >>>>> Just to be clear, I continue to believe this is a terrible idea and we > >>>>> should not do this. If systems will be built with CXL (and given the > >>>>> horrendous performance, I cannot see why they would be), the kernel > >>>>> should not be migrating memory around like this. > >>>> > >>>> I've been considered this problem from the opposite approach since LSFMM. > >>>> > >>>> Rather than decide how to move stuff around, what if instead we just > >>>> decide not to ever put certain classes of memory on CXL. Right now, so > >>>> long as CXL is in the page allocator, it's the wild west - any page can > >>>> end up anywhere. > >>>> > >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real > >>>> workloads to show local CXL expansion is valuable and performant enough > >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE > >>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of > >>>> CXL, but allows any given user-driven page allocation (including page > >>>> cache, file, and anon mappings) to land there. > >>>> > >>> > [snip] > >>> There's also some feature support that is possible with these CXL memory > >>> expansion devices that have started to pop up in labs that can also > >>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to > >>> chime in as well. > >>> > >>> This topic seems due for an alignment session as well, so will look to get > >>> that scheduled in the coming weeks if people are up for it. > >> > >> Our experience is that workloads in hyper-scalar data centers such as > >> Google often have significant cold memory. Offloading this to CXL memory > >> devices, backed by cheaper, lower-performance media (e.g. DRAM with > >> hardware compression), can be a practical approach to reduce overall > >> TCO. Page promotion and demotion are then critical for such a tiered > >> memory system. > > > > For the hardware compression devices how are you dealing with capacity variation > > / overcommit? > I understand that this is indeed one of the key questions from the upstream > kernel’s perspective. > So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can > not speak of other solutions/deployments. However, our HW interface follows > existing open specifications from OCP [1], so what I am describing below is > more widely applicable. > > At a very high level, the way our HW works is that the DPA is indeed > overcommitted. Then, there is a control plane over CXL.io (PCIe) which > exposes the real remaining capacity, as well as some configurable > MSI-X interrupts that raise warnings when the capacity crosses over > certain configurable thresholds. > > Last year I presented this interface in LSF/MM [2]. Based on the feedback I > got there, we have an early prototype that acts as the *last* memory tier > before reclaim (kind of "compressed tier in lieu of discard" as was > suggested to me by Dan). > > What is different from standard tiering is that the control plane is > checked on demotion to make sure there is still capacity left. If not, the > demotion fails. While this seems stable so far, a missing piece is to > ensure that this tier is mainly written by demotions and not arbitrary kernel > allocations (at least as a starting point). I want to explore how mempolicies > can help there, or something of the sort that Gregory described. > > This early prototype still needs quite some work in order to find the right > abstractions. Hopefully, I will be able to push an RFC in the near future > (a couple of months). > > > Whilst there have been some discussions on that but without a > > backing store of flash or similar it seems to be challenging to use > > compressed memory in a tiering system (so as 'normalish' memory) unless you > > don't mind occasionally and unexpectedly running out of memory (in nasty > > async ways as dirty cache lines get written back). > There are several things that may be done on the device side. For now, I > think the kernel should be unaware of these. But with what I described > above, the goal is to have the capacity thresholds configured in a way > that we can absorb the occasional dirty cache lines that are written back. In worst case they are far from occasional. It's not hard to imagine a malicious program that ensures that all L3 in a system (say 256MiB+) is full of cache lines from the far compressed memory all of which are changed in a fashion that makes the allocation much less compressible. If you are doing compression at cache line granularity that's not so bad because it would only be 256MiB margin needed. If the system in question is doing large block side compression, say 4KiB. Then we have a 64x write amplification multiplier. If the virus is streaming over memory the evictions we are seeing at the result of new lines being fetched to be made much less compressible. Add a accelerator (say DPDK or other zero copy into userspace buffers) into the mix and you have a mess. You'll need to be extremely careful with what goes in this compressed memory or hold enormous buffer capacity against fast changes in compressability. Key is that all software is potentially malicious (sometimes accidentally so ;) Now, if we can put this into a special pool where it is acceptable to drop the writes and return poison (so the application crashes) then that may be fine. Or block writes. Running compressed memory as read only CoW is one way to avoid this problem. > > > > Or do you mean zswap type use with a hardware offload of the actual > > compression? > I would categorize this as a completely different discussion (and product > line for us). > > [1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf > [2] https://www.youtube.com/watch?v=tXWEbaJmZ_s > > Thanks, > Yiannis > > PS: Sending from a personal email address to avoid issues with > confidentiality footers of the corporate domain.
On Thu, Sep 25, 2025 at 5:01 PM Jonathan Cameron <jonathan.cameron@huawei.com> wrote: > > On Thu, 25 Sep 2025 16:03:46 +0200 > Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote: > > Hi Yiannis, Hi Jonathan! Thanks for your response! [snip] > > There are several things that may be done on the device side. For now, I > > think the kernel should be unaware of these. But with what I described > > above, the goal is to have the capacity thresholds configured in a way > > that we can absorb the occasional dirty cache lines that are written back. > > In worst case they are far from occasional. It's not hard to imagine a malicious This is correct. Any simplification on my end is mainly based on the empirical evidence of the use cases we are testing for (tiering). But I fully respect that we need to be proactive and assume the worst case scenario. > program that ensures that all L3 in a system (say 256MiB+) is full of cache lines > from the far compressed memory all of which are changed in a fashion that makes > the allocation much less compressible. If you are doing compression at cache line > granularity that's not so bad because it would only be 256MiB margin needed. > If the system in question is doing large block side compression, say 4KiB. > Then we have a 64x write amplification multiplier. If the virus is streaming over This is insightful indeed :). However, even in the case of the 64x amplification, you implicitly assume that each of the cachelines in the L3 belongs to a different page. But then one cache-line would not deteriorate the compressed size of the entire page that much (the bandwidth amplification on the device is a different -performance- story). So even in the 4K case the two ends of the spectrum are to either have big amplification with low compression ratio impact, or small amplification with higher compression ratio impact. Another practical assumption here, is that the different HMU mechanisms would help promote the contended pages before this becomes a big issue. Which of course might still not be enough on the malicious streaming writes workload. Overall, I understand these are heuristics and I do see your point that this needs to be robust even for the maliciously behaving programs. > memory the evictions we are seeing at the result of new lines being fetched > to be made much less compressible. > > Add a accelerator (say DPDK or other zero copy into userspace buffers) into the > mix and you have a mess. You'll need to be extremely careful with what goes Good point about the zero copy stuff. > in this compressed memory or hold enormous buffer capacity against fast > changes in compressability. To my experience the factor of buffer capacity would be closer to the benefit that you get from the compression (e.g. 2x the cache size in your example). But I understand the burden of proof is on our end. As we move further with this I will try to provide data as well. > > Key is that all software is potentially malicious (sometimes accidentally so ;) > > Now, if we can put this into a special pool where it is acceptable to drop the writes > and return poison (so the application crashes) then that may be fine. > > Or block writes. Running compressed memory as read only CoW is one way to > avoid this problem. These could be good starting points, as I see in the rest of the thread. Thanks, Yiannis
On Thu, 16 Oct 2025 18:16:31 +0200 Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote: > On Thu, Sep 25, 2025 at 5:01 PM Jonathan Cameron > <jonathan.cameron@huawei.com> wrote: > > > > On Thu, 25 Sep 2025 16:03:46 +0200 > > Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote: > > > > Hi Yiannis, > Hi Jonathan! Thanks for your response! > Hi Yiannis, This is way more fun than doing real work ;) > [snip] > > > There are several things that may be done on the device side. For now, I > > > think the kernel should be unaware of these. But with what I described > > > above, the goal is to have the capacity thresholds configured in a way > > > that we can absorb the occasional dirty cache lines that are written back. > > > > In worst case they are far from occasional. It's not hard to imagine a malicious > This is correct. Any simplification on my end is mainly based on the > empirical evidence of the use cases we are testing for (tiering). But > I fully respect that we need to be proactive and assume the worst case > scenario. > > program that ensures that all L3 in a system (say 256MiB+) is full of cache lines > > from the far compressed memory all of which are changed in a fashion that makes > > the allocation much less compressible. If you are doing compression at cache line > > granularity that's not so bad because it would only be 256MiB margin needed. > > If the system in question is doing large block side compression, say 4KiB. > > Then we have a 64x write amplification multiplier. If the virus is streaming over > This is insightful indeed :). However, even in the case of the 64x > amplification, you implicitly assume that each of the cachelines in > the L3 belongs to a different page. But then one cache-line would not > deteriorate the compressed size of the entire page that much (the > bandwidth amplification on the device is a different -performance- > story). This is putting limits on what compression algorithm is used. We could do that but then we'd have to never support anything different. Maybe if the device itself provided the worse case amplification numbers that would do Any device that gets this wrong is buggy - but it might be hard to detect that if people don't publish their compression algs and the proofs of worst case blow up of compression blocks. I guess we could do the maths on what the device manufacturer says and if we don't believe them or they haven't provided enough info to check, double it :) > So even in the 4K case the two ends of the spectrum are to > either have big amplification with low compression ratio impact, or > small amplification with higher compression ratio impact. > Another practical assumption here, is that the different HMU > mechanisms would help promote the contended pages before this becomes > a big issue. Which of course might still not be enough on the > malicious streaming writes workload. Using promotion to get you out of this is a non starter unless you have a backstop because we'll have annoying things like pinning going on or bandwidth bottlenecks at the promotion target. Promotion might massively reduce the performance impact of course under normal conditions. > Overall, I understand these are heuristics and I do see your point > that this needs to be robust even for the maliciously behaving > programs. > > memory the evictions we are seeing at the result of new lines being fetched > > to be made much less compressible. > > > > Add a accelerator (say DPDK or other zero copy into userspace buffers) into the > > mix and you have a mess. You'll need to be extremely careful with what goes > Good point about the zero copy stuff. > > in this compressed memory or hold enormous buffer capacity against fast > > changes in compressability. > To my experience the factor of buffer capacity would be closer to the > benefit that you get from the compression (e.g. 2x the cache size in > your example). > But I understand the burden of proof is on our end. As we move further > with this I will try to provide data as well. If we are aiming for generality the nasty problem is that either we have to write rules on what Linux will cope with, or design it to cope with the worse possible implementation :( I can think of lots of plausible sounding cases that have horrendous multiplication factors if done in a naive fashion. * De-duplication * Metadata flag for all 0s * Some general purpose compression algs are very vulnerable to the tails of the probability distributions. Some will flip between multiple modes with very different characteristics, perhaps to meet latency guarantees. Would be fun to ask an information theorist / compression expert to lay out an algorithm with the worst possible tail performance but with good average. > > > > Key is that all software is potentially malicious (sometimes accidentally so ;) > > > > Now, if we can put this into a special pool where it is acceptable to drop the writes > > and return poison (so the application crashes) then that may be fine. > > > > Or block writes. Running compressed memory as read only CoW is one way to > > avoid this problem. > These could be good starting points, as I see in the rest of the thread. > Fun problems. Maybe we start with very conservative handling and then argue for relaxations later. Jonathan > Thanks, > Yiannis
On Mon, Oct 20, 2025 at 03:23:45PM +0100, Jonathan Cameron wrote: > On Thu, 16 Oct 2025 18:16:31 +0200 > Yiannis Nikolakopoulos <yiannis.nikolakop@gmail.com> wrote: > > > These could be good starting points, as I see in the rest of the thread. > > > Fun problems. Maybe we start with very conservative handling and then > argue for relaxations later. > Not to pile on, but if we can't even manage the conservative handling due to other design issues - then it doesn't bode well for the rest. So getting that right should be the priority - not a maybe. ~Gregory
On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote: > Now, if we can put this into a special pool where it is acceptable to drop the writes > and return poison (so the application crashes) then that may be fine. > > Or block writes. Running compressed memory as read only CoW is one way to > avoid this problem. > This is an interesting thought. If you drop a write and return poison, can you instead handle the poison message as a fault and promote on fault? Then you might just be able to turn this whole thing into a zswap backend that promotes on write. Then you don't particular care about stronger isolation controls (except maybe keeping kernel memory out of those regions). ~Gregory
On Thu, 25 Sep 2025 11:08:59 -0400 Gregory Price <gourry@gourry.net> wrote: > On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote: > > Now, if we can put this into a special pool where it is acceptable to drop the writes > > and return poison (so the application crashes) then that may be fine. > > > > Or block writes. Running compressed memory as read only CoW is one way to > > avoid this problem. > > > > This is an interesting thought. If you drop a write and return poison, > can you instead handle the poison message as a fault and promote on > fault? Then you might just be able to turn this whole thing into a > zswap backend that promotes on write. Poison only comes on subsequent read so you don't see anything at write (which are inherently asynchronous due to cache write back). There are only few ways to do writes that are allowed to fail (the 64 byte atomic deferrable write stuff) and I think on all architectures where they can even be pointed at main memory, they only defer if on uncacheable memory. Seeing poison on subsequent read is far too late to promote the page, you've lost the data. The poison only works as ultimate safety gate. Also once you've tripped it the device probably needs to drop all write and return poison on all reads, not just the problem one (otherwise things might fail much later). The CoW thing only works because it's a permissions fault at point of asking for permission to write (so way before it goes into the cache). Then you can check margins to make sure you can still sink all outstanding writes if they become uncompressible and only let the write through if safe - if not promote some stuff before letting it proceed. Or you just promote on write and rely on the demotion path performing those careful checks later. Jonathan > > Then you don't particular care about stronger isolation controls > (except maybe keeping kernel memory out of those regions). > > ~Gregory
On Thu, Sep 25, 2025 at 04:24:26PM +0100, Jonathan Cameron wrote: > The CoW thing only works because it's a permissions fault at point of > asking for permission to write (so way before it goes into the cache). > Then you can check margins to make sure you can still sink all outstanding > writes if they become uncompressible and only let the write through if safe > - if not promote some stuff before letting it proceed. > Or you just promote on write and rely on the demotion path performing those > careful checks later. > Agreed. The question is now whether you can actually enforce page table bits not changing. I think you'd need your own fault handling infrastructure / driver for these pages. This does smell a lot like a kernel-internal dax allocation interface. There was a bunch of talk about virtualizing zswap backends, so that might be a nice place to look to insert this kind of hook. Then the device driver (which it will definitely need) would have to field page faults accordingly. It feels much more natural to put this as a zswap/zram backend. ~Gregory
On Thu, 25 Sep 2025 12:06:28 -0400 Gregory Price <gourry@gourry.net> wrote: > On Thu, Sep 25, 2025 at 04:24:26PM +0100, Jonathan Cameron wrote: > > The CoW thing only works because it's a permissions fault at point of > > asking for permission to write (so way before it goes into the cache). > > Then you can check margins to make sure you can still sink all outstanding > > writes if they become uncompressible and only let the write through if safe > > - if not promote some stuff before letting it proceed. > > Or you just promote on write and rely on the demotion path performing those > > careful checks later. > > > > Agreed. The question is now whether you can actually enforce page table > bits not changing. I think you'd need your own fault handling > infrastructure / driver for these pages. > > This does smell a lot like a kernel-internal dax allocation interface. > There was a bunch of talk about virtualizing zswap backends, so that > might be a nice place to look to insert this kind of hook. > > Then the device driver (which it will definitely need) would have to > field page faults accordingly. > > It feels much more natural to put this as a zswap/zram backend. > Agreed. I currently see two paths that are generic (ish). 1. zswap route - faulting as you describe on writes. 2. Fail safe route - Map compressible memory it into a VM (or application) you don't mind killing if we loose that promotion race due to pathological application. The attacker only disturbs memory allocated to that application / VM so the blast radius is contained. Jonathan > ~Gregory
On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote: > On Thu, 25 Sep 2025 12:06:28 -0400 > Gregory Price <gourry@gourry.net> wrote: > > > It feels much more natural to put this as a zswap/zram backend. > > > Agreed. I currently see two paths that are generic (ish). > > 1. zswap route - faulting as you describe on writes. aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub The interposition point for zswap/zram is the PTE present bit being hacked off to generate access faults. If you want any random VMA to be eligible for demotion into the tier, then you need to override that VMA's fault/protect hooks in its vm_area_struct. This is idea is a non-starter. What you'd have to do is have those particular vm_area_struct's be provided by some special allocator that says the memory is eligible for demotion to compressed memory, and to route all faults through it. That looks a lot like hacking up mm internals to support a single hardware case. Hard to justify. This may quite literally only be possible to do for for unmapped pages, which would limit the application to things like mm/filemap.c and making IO (read/write) calls faster. which - hey - maybe that's the best use-case anyway. Have all the read-only compressible filecache you want. At least you avoid touching disk. ~Gregory
On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > On Thu, 25 Sep 2025 12:06:28 -0400
> > Gregory Price <gourry@gourry.net> wrote:
> >
> > > It feels much more natural to put this as a zswap/zram backend.
> > >
> > Agreed. I currently see two paths that are generic (ish).
> >
> > 1. zswap route - faulting as you describe on writes.
>
> aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
>
> The interposition point for zswap/zram is the PTE present bit being
> hacked off to generate access faults.
>
I went digging around a bit.
Not only this, but the PTE is used to store the swap entry ID, so you
can't just use a swap backend and keep the mapping. It's just not a
compatible abstraction - so as a zswap-backend this is DOA.
Even if you could figure out a way to re-use the abstraction and just
take a hard-fault to fault it back in as read-only, you lose the swap
entry on fault. That just gets nasty trying to reconcile the
differences between this interface and swap at that point.
So here's a fun proposal. I'm not sure of how NUMA nodes for devices
get determined -
1. Carve out an explicit proximity domain (NUMA node) for the compressed
region via SRAT.
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
2. Make sure this proximity domain (NUMA node) has separate data in the
HMAT so it can be an explicit demotion target for higher tiers
https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
3. Create a node-to-zone-allocator registration and retrieval function
device_folio_alloc = nid_to_alloc(nid)
4. Create a DAX extension that registers the above allocator interface
5. in `alloc_migration_target()` mm/migrate.c
Since nid is not a valid buddy-allocator target, everything here
will fail. So we can simply append the following to the bottom
device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
if (device_folio_alloc)
folio = device_folio_alloc(...)
return folio;
6. in `struct migration_target_control` add a new .no_writable value
- This will say the new mapping replacements should have the
writable bit chopped off.
7. On write-fault, extent mm/memory.c:do_numa_page to detect this
and simply promote the page to allow writes. Write faults will
be expensive, but you'll have pretty strong guarantees around
not unexpectedly running out of space.
You can then loosen the .no_writable restriction with settings if
you have high confidence that your system will outrun your ability
to promote/evict/whatever if device memory becomes hot.
The only thing I don't know off hand is how shared pages will work in
this setup. For VMAs with a mapping that exist at demotion time, this
all works wonderfully - less so if the mapping doesn't exist or a new
VMA is created after a demotion has occurred.
I don't know what will happen there.
I think this would also sate the desire for a "separate CXL allocator"
for integration into other paths as well.
~Gregory
On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote: > > On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote: > > On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote: > > > On Thu, 25 Sep 2025 12:06:28 -0400 > > > Gregory Price <gourry@gourry.net> wrote: > > > > > > > It feels much more natural to put this as a zswap/zram backend. > > > > > > > Agreed. I currently see two paths that are generic (ish). > > > > > > 1. zswap route - faulting as you describe on writes. > > > > aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub > > > > The interposition point for zswap/zram is the PTE present bit being > > hacked off to generate access faults. > > > > I went digging around a bit. > > Not only this, but the PTE is used to store the swap entry ID, so you > can't just use a swap backend and keep the mapping. It's just not a > compatible abstraction - so as a zswap-backend this is DOA. > > Even if you could figure out a way to re-use the abstraction and just > take a hard-fault to fault it back in as read-only, you lose the swap > entry on fault. That just gets nasty trying to reconcile the > differences between this interface and swap at that point. > > So here's a fun proposal. I'm not sure of how NUMA nodes for devices > get determined - > > 1. Carve out an explicit proximity domain (NUMA node) for the compressed > region via SRAT. > https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html > > 2. Make sure this proximity domain (NUMA node) has separate data in the > HMAT so it can be an explicit demotion target for higher tiers > https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html This makes sense. I've done a dirty hardcoding trick in my prototype so that my node is always the last target. I'll have a look on how to make this right. > > 3. Create a node-to-zone-allocator registration and retrieval function > device_folio_alloc = nid_to_alloc(nid) > > 4. Create a DAX extension that registers the above allocator interface > > 5. in `alloc_migration_target()` mm/migrate.c > Since nid is not a valid buddy-allocator target, everything here > will fail. So we can simply append the following to the bottom > > device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC); > if (device_folio_alloc) > folio = device_folio_alloc(...) > return folio; In my current prototype alloc_migration_target was working (naively). Steps 3, 4 and 5 seem like an interesting thing to try after all this discussion. > > 6. in `struct migration_target_control` add a new .no_writable value > - This will say the new mapping replacements should have the > writable bit chopped off. > > 7. On write-fault, extent mm/memory.c:do_numa_page to detect this > and simply promote the page to allow writes. Write faults will > be expensive, but you'll have pretty strong guarantees around > not unexpectedly running out of space. > > You can then loosen the .no_writable restriction with settings if > you have high confidence that your system will outrun your ability > to promote/evict/whatever if device memory becomes hot. That looks modular enough that will allow me to test both writable and no_writable and being able to compare. > > The only thing I don't know off hand is how shared pages will work in > this setup. For VMAs with a mapping that exist at demotion time, this > all works wonderfully - less so if the mapping doesn't exist or a new > VMA is created after a demotion has occurred. I'll keep that in mind. > > I don't know what will happen there. > > I think this would also sate the desire for a "separate CXL allocator" > for integration into other paths as well. > > ~Gregory Thanks a lot for all the discussion and the input. I can move my prototype towards this direction and will get back with what I 've learned and an RFC if it makes sense. Please keep me in the loop in any related discussions. Best, /Yiannis
On Fri, Oct 17, 2025 at 11:53:31AM +0200, Yiannis Nikolakopoulos wrote:
> On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote:
> > 1. Carve out an explicit proximity domain (NUMA node) for the compressed
> > region via SRAT.
> > https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
> >
> > 2. Make sure this proximity domain (NUMA node) has separate data in the
> > HMAT so it can be an explicit demotion target for higher tiers
> > https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
> This makes sense. I've done a dirty hardcoding trick in my prototype
> so that my node is always the last target. I'll have a look on how to
> make this right.
I think it's probably a CEDT/CDAT/HMAT/SRAT/etc negotiation.
Essentially the platform needs to allow a single device to expose
multiple numa nodes based on different expected performance. From
those ranges. Then software needs to program the HDM decoders
appropriately.
> > 5. in `alloc_migration_target()` mm/migrate.c
> > Since nid is not a valid buddy-allocator target, everything here
> > will fail. So we can simply append the following to the bottom
> >
> > device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
> > if (device_folio_alloc)
> > folio = device_folio_alloc(...)
> > return folio;
> In my current prototype alloc_migration_target was working (naively).
> Steps 3, 4 and 5 seem like an interesting thing to try after all this
> discussion.
> >
Right because the memory is directly accessible to the buddy allocator.
What i'm proposing would remove this memory from the buddy allocator and
force more explicit integration (in this case with this function).
more explicitly: in this design __folio_alloc can never access this
memory.
~Gregory
On Fri, 17 Oct 2025 10:15:57 -0400 Gregory Price <gourry@gourry.net> wrote: > On Fri, Oct 17, 2025 at 11:53:31AM +0200, Yiannis Nikolakopoulos wrote: > > On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@gourry.net> wrote: > > > 1. Carve out an explicit proximity domain (NUMA node) for the compressed > > > region via SRAT. > > > https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html > > > > > > 2. Make sure this proximity domain (NUMA node) has separate data in the > > > HMAT so it can be an explicit demotion target for higher tiers > > > https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html > > This makes sense. I've done a dirty hardcoding trick in my prototype > > so that my node is always the last target. I'll have a look on how to > > make this right. > > I think it's probably a CEDT/CDAT/HMAT/SRAT/etc negotiation. > > Essentially the platform needs to allow a single device to expose > multiple numa nodes based on different expected performance. From > those ranges. Then software needs to program the HDM decoders > appropriately. It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT as you mention) to say this host memory region may be backed by compressed memory? Might be able to justify it from spec point of view by arguing that compression is a QoS related characteristic. Always possible host hardware will want to handle it differently before it even hits the bus even if it's just a case throttling writing differently. That then ends up in it's own NUMA node. Whether we take on the splitting CFMWS entries into multiple NUMA nodes depending on what backing devices end up in them is something we kicked into the long grass originally, but that can definitely be revisited. That doesn't matter for initial support of compressed memory though if we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS) in CEDT. > > > > 5. in `alloc_migration_target()` mm/migrate.c > > > Since nid is not a valid buddy-allocator target, everything here > > > will fail. So we can simply append the following to the bottom > > > > > > device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC); > > > if (device_folio_alloc) > > > folio = device_folio_alloc(...) > > > return folio; > > In my current prototype alloc_migration_target was working (naively). > > Steps 3, 4 and 5 seem like an interesting thing to try after all this > > discussion. > > > > > Right because the memory is directly accessible to the buddy allocator. > What i'm proposing would remove this memory from the buddy allocator and > force more explicit integration (in this case with this function). > > more explicitly: in this design __folio_alloc can never access this > memory. > > ~Gregory
On Fri, Oct 17, 2025 at 03:36:13PM +0100, Jonathan Cameron wrote: > On Fri, 17 Oct 2025 10:15:57 -0400 > Gregory Price <gourry@gourry.net> wrote: > > > > Essentially the platform needs to allow a single device to expose > > multiple numa nodes based on different expected performance. From > > those ranges. Then software needs to program the HDM decoders > > appropriately. > > It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT > as you mention) to say this host memory region may be backed by > compressed memory? > > Might be able to justify it from spec point of view by arguing that > compression is a QoS related characteristic. Always possible host > hardware will want to handle it differently before it even hits the > bus even if it's just a case throttling writing differently. > That's a Consortium discussion to have (and I am not of the consortium :P), but yeah you could do it that way. More generally could have a "Not-for-general-consumption bit" instead of specifically a compressed bit. Maybe both a "No-Consume" and a "Special Node" bit would be useful separately. Of course then platforms need to be made to understand all these: "No-Consume" -> force EFI_MEMORY_SP or leave it reserved "Special Node" -> allocate its own PXM / Provide discrete CFMWS Naming obviously non-instructive here, may as well call them Nancy and Bob bits. > That then ends up in it's own NUMA node. Whether we take on the > splitting CFMWS entries into multiple NUMA nodes depending on what > backing devices end up in them is something we kicked into the long > grass originally, but that can definitely be revisited. That > doesn't matter for initial support of compressed memory though if > we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS) > in CEDT. > This is the way I would initially approach it tbh - but i'm also not a hardware/firmware person, so i don't know exactly what bits a device would set to tell BIOS/EFI "Hey, give this chunk its own CFMWS", or if that lies solely with BIOS/EFI. ~Gregory
On Fri, 17 Oct 2025 10:59:01 -0400 Gregory Price <gourry@gourry.net> wrote: > On Fri, Oct 17, 2025 at 03:36:13PM +0100, Jonathan Cameron wrote: > > On Fri, 17 Oct 2025 10:15:57 -0400 > > Gregory Price <gourry@gourry.net> wrote: > > > > > > Essentially the platform needs to allow a single device to expose > > > multiple numa nodes based on different expected performance. From > > > those ranges. Then software needs to program the HDM decoders > > > appropriately. > > > > It's a bit 'fuzzy' to justify but maybe (for CXL) a CFWMS flag (so CEDT > > as you mention) to say this host memory region may be backed by > > compressed memory? > > > > Might be able to justify it from spec point of view by arguing that > > compression is a QoS related characteristic. Always possible host > > hardware will want to handle it differently before it even hits the > > bus even if it's just a case throttling writing differently. > > > > That's a Consortium discussion to have (and I am not of the > consortium :P), but yeah you could do it that way. The moment I know it's raised there I (and others involved in consortium) can't talk about it in public. (I love standards org IP rules!) So it's useful to have a pre discussion before that happens. We've done this before for other topics and it can be very productive. > > More generally could have a "Not-for-general-consumption bit" instead > of specifically a compressed bit. Maybe both a "No-Consume" and a > "Special Node" bit would be useful separately. > > Of course then platforms need to be made to understand all these: > > "No-Consume" -> force EFI_MEMORY_SP or leave it reserved > "Special Node" -> allocate its own PXM / Provide discrete CFMWS > > Naming obviously non-instructive here, may as well call them Nancy and > Bob bits. For compression specifically I think there is value in making it explicitly compression because the host hardware might handle that differently. The other bits might be worth having as well though. SPM was all about 'you could' use it as normal memory but someone put it there for something else. This more a case of SPOM. Specific Purpose Only Memory - eats babies if you don't know the extra rules for each instance of that. > > > That then ends up in it's own NUMA node. Whether we take on the > > splitting CFMWS entries into multiple NUMA nodes depending on what > > backing devices end up in them is something we kicked into the long > > grass originally, but that can definitely be revisited. That > > doesn't matter for initial support of compressed memory though if > > we can do it via a seperate CXL Fixed Memory Window Structure (CFMWS) > > in CEDT. > > > > This is the way I would initially approach it tbh - but i'm also not a > hardware/firmware person, so i don't know exactly what bits a device > would set to tell BIOS/EFI "Hey, give this chunk its own CFMWS", or if > that lies solely with BIOS/EFI. It's not a device thing wrt to nodes today (and there are good reasons why it should not be at that granularity e.g. node explosion has costs). The BIOS might pre setup the decoders and even lock them, but I'd expect we'll move away from that to fully OS managed over time (to get flexibility) - exception to that being when confidential compute is making its usual mess of things. Maybe the BIOS would have a look at devices and decide to enable a compressed memory CFMWS if it finds devices that need it and not do so otherwise, though not doing so breaks hotplug of compressed memory devices. So my guess is either we need to fix Linux to allow splitting a fixed memory window up into multiple NUMA nodes, or platforms have to spin extra fixed memory windows (host side PA ranges with a NUMA node for each). Which option depends a bit on whether we expect host hardware to either handle compressed differently from normal ram, or at least separate it for QoS reasons. What fun. J > > ~Gregory
On Mon, Oct 20, 2025 at 03:05:26PM +0100, Jonathan Cameron wrote: > > More generally could have a "Not-for-general-consumption bit" instead > > of specifically a compressed bit. Maybe both a "No-Consume" and a > > "Special Node" bit would be useful separately. > > > > Of course then platforms need to be made to understand all these: > > > > "No-Consume" -> force EFI_MEMORY_SP or leave it reserved > > "Special Node" -> allocate its own PXM / Provide discrete CFMWS > > > > Naming obviously non-instructive here, may as well call them Nancy and > > Bob bits. > > For compression specifically I think there is value in making it > explicitly compression because the host hardware might handle that > differently. The other bits might be worth having as well > though. SPM was all about 'you could' use it as normal memory but > someone put it there for something else. This more a case of > SPOM. Specific Purpose Only Memory - eats babies if you don't know > the extra rules for each instance of that. > This is a fair point. Something like a SPOM bit that says some other bit-field is valid and you get to add new extensions about how the memory should be used? :shrug: probably sufficiently extensible but maybe never used for anything more than compression. > Maybe the BIOS would have a look at devices and decide to enable a > compressed memory CFMWS if it finds devices that need it and not do > so otherwise, though not doing so breaks hotplug of compressed memory devices. > > So my guess is either we need to fix Linux to allow splitting a fixed > memory window up into multiple NUMA nodes, or platforms have to spin > extra fixed memory windows (host side PA ranges with a NUMA node for each). > I don't think splitting a CFMW into multiple nodes is feasible, but I also haven't looked at that region of ACPI code since i finished the docs. I can look into that. I would prefer the former, since this is already what's done for hostbridge interleave vs non-interleave setups, where the host may expose multiple CFMW for the same devices depending on how the OS. > Which option depends a bit on whether we expect host hardware to either > handle compressed differently from normal ram, or at least separate it > for QoS reasons. > There's only a handful of folks discussing this at the moment, but so far we've all be consistent in our gut telling us it should be handled differently for reliability reasons. But also, more opinions always welcome :] ~Gregory
On Tue, Oct 21, 2025 at 02:52:40PM -0400, Gregory Price wrote: > I would prefer the former, since this is already what's done for > hostbridge interleave vs non-interleave setups, where the host may > expose multiple CFMW for the same devices depending on how the OS. bah, got distracted "Depending on how the OS may choose to configure things at some unknown point in the future" ~Gregory
On Tue, 21 Oct 2025 14:57:26 -0400 Gregory Price <gourry@gourry.net> wrote: > On Tue, Oct 21, 2025 at 02:52:40PM -0400, Gregory Price wrote: > > I would prefer the former, since this is already what's done for > > hostbridge interleave vs non-interleave setups, where the host may > > expose multiple CFMW for the same devices depending on how the OS. > > bah, got distracted > > "Depending on how the OS may choose to configure things at some unknown > point in the future" My gut feeling is the need to do dynamic NUMA nodes will not be driven but this but more by large scale fabrics (if that ever happens) and trade offs of host PA space vs QoS in the hardware. Those trade offs might put memory with very different performance characteristics behind one window. Maybe it'll become a thing that can be used for compression. Otherwise compression from host hardware point of view might be like the question of share or separate fixed memory windows for persistent / volatile. Ideally they'd be separate but if Host PA space is limited, someone might build a system where a single fixed memory window is used to support both. Possible virtualization of some of this stuff will make it more complex again. Any crazy mess can share a fake fixed memory window as the QoS is all behind some page tables. Meh. Let's suggest people burn host PA space for now. If anyone hits that limit they can solve it (crosses fingers it's not my lot :) Jonathan > > > ~Gregory
On Wed, Oct 22, 2025 at 10:09:50AM +0100, Jonathan Cameron wrote: > > My gut feeling is the need to do dynamic NUMA nodes will not be driven > but this but more by large scale fabrics (if that ever happens) > and trade offs of host PA space vs QoS in the hardware. Those > trade offs might put memory with very different performance > characteristics behind one window. > I can't believe we live in a world where "We have to think about the scenario where we actually need all 256 TB of 48-bit phys-addressing" is not a tongue in cheek joke o_o That's a paltry 2048 128GB DIMMs... and whatever monstrosity you have to build to host it all but that's at least a fun engineering problem :V Bring on the 128-bit CPUs! What do we name those x86 registers though? Slap the E back on for ERAX? > Meh. Let's suggest people burn host PA space for now. If anyone hits > that limit they can solve it (crosses fingers it's not my lot :) > +1 ~Gregory
On Wed, 22 Oct 2025 11:05:16 -0400 Gregory Price <gourry@gourry.net> wrote: > On Wed, Oct 22, 2025 at 10:09:50AM +0100, Jonathan Cameron wrote: > > > > My gut feeling is the need to do dynamic NUMA nodes will not be driven > > but this but more by large scale fabrics (if that ever happens) > > and trade offs of host PA space vs QoS in the hardware. Those > > trade offs might put memory with very different performance > > characteristics behind one window. > > > > I can't believe we live in a world where "We have to think about the > scenario where we actually need all 256 TB of 48-bit phys-addressing" > is not a tongue in cheek joke o_o You think everyone wires all the 48 bits? Certainly not everyone does. > > That's a paltry 2048 128GB DIMMs... and whatever monstrosity you have to > build to host it all but that's at least a fun engineering problem :V > > Bring on the 128-bit CPUs! > > What do we name those x86 registers though? Slap the E back on for ERAX? > > > Meh. Let's suggest people burn host PA space for now. If anyone hits > > that limit they can solve it (crosses fingers it's not my lot :) > > > > +1 > > ~Gregory
On Thu, Sep 25, 2025 at 11:08:59AM -0400, Gregory Price wrote: > On Thu, Sep 25, 2025 at 04:00:58PM +0100, Jonathan Cameron wrote: > > Now, if we can put this into a special pool where it is acceptable to drop the writes > > and return poison (so the application crashes) then that may be fine. > > > > Or block writes. Running compressed memory as read only CoW is one way to > > avoid this problem. > > > > This is an interesting thought. If you drop a write and return poison, > can you instead handle the poison message as a fault and promote on > fault? Then you might just be able to turn this whole thing into a > zswap backend that promotes on write. > I just realized this would require some mechanism to re-issue the write. So yeah, you'd have to do this via some some heroic page table enforcement. The key observation here is that zswap hacks off all the page table entries - rather than leave them present and readable. In this design, you want to leave them present and readable, and therefore need some way to prevent entries from changing out from under you. > Then you don't particular care about stronger isolation controls > (except maybe keeping kernel memory out of those regions). > > ~Gregory
On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote: > > > > For the hardware compression devices how are you dealing with capacity variation > > / overcommit? ... > What is different from standard tiering is that the control plane is > checked on demotion to make sure there is still capacity left. If not, the > demotion fails. While this seems stable so far, a missing piece is to > ensure that this tier is mainly written by demotions and not arbitrary kernel > allocations (at least as a starting point). I want to explore how mempolicies > can help there, or something of the sort that Gregory described. > Writing back the description as i understand it: 1) The intent is to only have this memory allocable via demotion (i.e. no fault or direct allocation from userland possible) 2) The intent is to still have this memory accessible directly (DMA), while compressed, not trigger a fault/promotion on access (i.e. no zswap faults) 3) The intent is to have an external monitoring software handle outrunning run-away decompression/hotness by promoting that data. So basically we want a zswap-like interface for allocation, but to retain the `struct page` in page tables such that no faults are incurred on access. Then if the page becomes hot, depend on some kind of HMU tiering system to get it off the device. I think we all understand there's some bear we have to outrun to deal with problem #3 - and many of us are skeptical that the bear won't catch up with our pants down. Let's ignore this for the moment. If such a device's memory is added to the default page allocator, then the question becomes one of *isolation* - such that the kernel will provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER be used except under very explicit scenarios. There are only 3 mechanisms with which to restrict this (presently): 1) ZONE membership (to disallow GFP_KERNEL) 2) cgroups->cpusets->mems_allowed 3) task/vma mempolicy (obvious #4: Don't put it in the default page allocator) cpusets and mempolicy are not sufficient to provide full isolation - cgroups have the opposite hierarchical relationship than desired. The parent cgroup will lock out all children cgroups from using nodes not present in the parent mems_allowed. e.g. if you lock out access from the root cgroup, no cgroup on the entire system is eligible to allocate the memory. If you don't lock out the root cgroup - any root cgroup task is eligible. This isn't tractible. - task/vma mempolicy gets ignored in many cases and is closer to a suggestion than enforcible. It's also subject to rebinding as a task's cgroups.cpuset.mems_allowed changes. I haven't read up enough on ZONE_DEVICE to understand the implications of membership there, but have you explored this as an option? I don't see the work i'm doing intersecting well with your efforts - except maybe on the vmscan.c work around allocation on demotion. The work i'm doing is more aligned with - hey, filesystems are a global resource, why are we using cgroup/task/vma policies to dictate whether a filesystem's cache is eligible to land in remote nodes? i.e. drawing better boundaries and controls around what can land in some set of remote nodes "by default". You're looking for *strong isolation* controls, which implies a different kind of allocator interface. ~Gregory
Hi Gregory, Thanks for all the feedback. I am finally getting some time to come back to this. On Thu, Sep 25, 2025 at 4:41 PM Gregory Price <gourry@gourry.net> wrote: > > On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote: > > > > > > For the hardware compression devices how are you dealing with capacity variation > > > / overcommit? > ... > > What is different from standard tiering is that the control plane is > > checked on demotion to make sure there is still capacity left. If not, the > > demotion fails. While this seems stable so far, a missing piece is to > > ensure that this tier is mainly written by demotions and not arbitrary kernel > > allocations (at least as a starting point). I want to explore how mempolicies > > can help there, or something of the sort that Gregory described. > > > > Writing back the description as i understand it: > > 1) The intent is to only have this memory allocable via demotion > (i.e. no fault or direct allocation from userland possible) Yes that is what looks to me like the "safe" way to begin with. In theory you could have userland apps/middleware that is aware of this memory and its quirks and are ok to use it but I guess we can leave that for later and it feels like it could be provided by a separate driver. > > 2) The intent is to still have this memory accessible directly (DMA), > while compressed, not trigger a fault/promotion on access > (i.e. no zswap faults) Correct. One of the big advantages of CXL.mem is the cache-line access granularity and our customers don't want to lose that. > > 3) The intent is to have an external monitoring software handle > outrunning run-away decompression/hotness by promoting that data. External is not strictly necessary. E.g. it could be an additional source of input to the kpromote/kmigrate solution. > > So basically we want a zswap-like interface for allocation, but to If by "zswap-like interface" you mean something that can reject the demote (or store according to the zswap semantics) then yes. I just want to be careful when comparing with zswap. > retain the `struct page` in page tables such that no faults are incurred > on access. Then if the page becomes hot, depend on some kind of HMU > tiering system to get it off the device. Correct. > > I think we all understand there's some bear we have to outrun to deal > with problem #3 - and many of us are skeptical that the bear won't catch > up with our pants down. Let's ignore this for the moment. Agreed. > > If such a device's memory is added to the default page allocator, then > the question becomes one of *isolation* - such that the kernel will > provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER > be used except under very explicit scenarios. > > There are only 3 mechanisms with which to restrict this (presently): > > 1) ZONE membership (to disallow GFP_KERNEL) > 2) cgroups->cpusets->mems_allowed > 3) task/vma mempolicy > (obvious #4: Don't put it in the default page allocator) > > cpusets and mempolicy are not sufficient to provide full isolation > - cgroups have the opposite hierarchical relationship than desired. > The parent cgroup will lock out all children cgroups from using nodes > not present in the parent mems_allowed. e.g. if you lock out access > from the root cgroup, no cgroup on the entire system is eligible to > allocate the memory. If you don't lock out the root cgroup - any root > cgroup task is eligible. This isn't tractible. > > - task/vma mempolicy gets ignored in many cases and is closer to a > suggestion than enforcible. It's also subject to rebinding as a > task's cgroups.cpuset.mems_allowed changes. > > I haven't read up enough on ZONE_DEVICE to understand the implications > of membership there, but have you explored this as an option? I don't > see the work i'm doing intersecting well with your efforts - except > maybe on the vmscan.c work around allocation on demotion. Thanks for the very helpful breakdown. Your take on #2 & #3 seems reasonable. About #1, I've skimmed through the rest of the thread and I'll continue addressing your responses there. Yiannis > > The work i'm doing is more aligned with - hey, filesystems are a global > resource, why are we using cgroup/task/vma policies to dictate whether a > filesystem's cache is eligible to land in remote nodes? i.e. drawing > better boundaries and controls around what can land in some set of > remote nodes "by default". You're looking for *strong isolation* > controls, which implies a different kind of allocator interface. > > ~Gregory
On 9/17/25 10:30, Wei Xu wrote: > On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: >> >> On Wed, 10 Sep 2025, Gregory Price wrote: >> >>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: >>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: >>>>> This patchset introduces a new subsystem for hot page tracking >>>>> and promotion (pghot) that consolidates memory access information >>>>> from various sources and enables centralized promotion of hot >>>>> pages across memory tiers. >>>> >>>> Just to be clear, I continue to believe this is a terrible idea and we >>>> should not do this. If systems will be built with CXL (and given the >>>> horrendous performance, I cannot see why they would be), the kernel >>>> should not be migrating memory around like this. >>> >>> I've been considered this problem from the opposite approach since LSFMM. >>> >>> Rather than decide how to move stuff around, what if instead we just >>> decide not to ever put certain classes of memory on CXL. Right now, so >>> long as CXL is in the page allocator, it's the wild west - any page can >>> end up anywhere. >>> >>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real >>> workloads to show local CXL expansion is valuable and performant enough >>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE >>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of >>> CXL, but allows any given user-driven page allocation (including page >>> cache, file, and anon mappings) to land there. >>> >> >> This is similar to our use case, although the direct allocation can be >> controlled by cpusets or mempolicies as needed depending on the memory >> access latency required for the workload; nothing new there, though, it's >> the same argument as NUMA in general and the abstraction of these far >> memory nodes as separate NUMA nodes makes this very straightforward. >> >>> I'm hoping to share some of this data in the coming months. >>> >>> I've yet to see any strong indication that a complex hotness/movement >>> system is warranted (yet) - but that may simply be because we have >>> local cards with no switching involved. So far LRU-based promotion and >>> demotion has been sufficient. >>> >> >> To me, this is a key point. As we've discussed in meetings, we're in the >> early days here. The CHMU does provide a lot of flexibility, both to >> create very good and very bad hotness trackers. But I think the key point >> is that we have multiple sources of hotness information depending on the >> platform and some of these sources only make sense for the kernel (or a >> BPF offload) to maintain as the source of truth. Some of these sources >> will be clear-on-read so only one entity would be possible to have as the >> source of truth of page hotness. >> >> I've been pretty focused on the promotion story here rather than demotion >> because of how responsive it needs to be. Harvesting the page table >> accessed bits or waiting on a sliding window through NUMA Balancing (even >> NUMAB=2) is not as responsive as needed for very fast promotion to top >> tier memory, hence things like the CHMU (or PEBS or IBS etc). >> >> A few things that I think we need to discuss and align on: >> >> - the kernel as the source of truth for all memory hotness information, >> which can then be abstracted and used for multiple downstream purposes, >> memory tiering only being one of them >> >> - the long-term plan for NUMAB=2 and memory tiering support in the kernel >> in general, are we planning on supporting this through NUMA hint faults >> forever despite their drawbacks (too slow, too much overhead for KVM) >> >> - the role of the kernel vs userspace in driving the memory migration; >> lots of discussion on hardware assists that can be leveraged for memory >> migration but today the balancing is driven in process context. The >> kthread as the driver of migration is yet to be a sold argument, but >> are where a number of companies are currently looking >> >> There's also some feature support that is possible with these CXL memory >> expansion devices that have started to pop up in labs that can also >> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to >> chime in as well. >> >> This topic seems due for an alignment session as well, so will look to get >> that scheduled in the coming weeks if people are up for it. > > Our experience is that workloads in hyper-scalar data centers such as > Google often have significant cold memory. Offloading this to CXL memory > devices, backed by cheaper, lower-performance media (e.g. DRAM with > hardware compression), can be a practical approach to reduce overall > TCO. Page promotion and demotion are then critical for such a tiered > memory system. > > A kernel thread to drive hot page collection and promotion seems > logical, especially since hot page data from new sources (e.g. CHMU) > are collected outside the process execution context and in the form of > physical addresses. > > I do agree that we need to balance the complexity and benefits of any > new data structures for hotness tracking. I think there is a mismatch in the tiering structure and the patches. If you see the example in memory tiering /* * ... * Example 3: * * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. * * node distances: * node 0 1 2 * 0 10 20 30 * 1 20 10 40 * 2 30 40 10 * * memory_tiers0 = 1 * memory_tiers1 = 0 * memory_tiers2 = 2 *.. */ The topmost tier need not be DRAM, patch 3 states " [..] * kpromoted is a kernel thread that runs on each toptier node and * promotes pages from max_heap. " Also, there is no data in the cover letter to indicate what workloads benefit from migration to top-tier and by how much? Balbir
On 17-Sep-25 8:50 AM, Balbir Singh wrote: > On 9/17/25 10:30, Wei Xu wrote: >> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@google.com> wrote: >>> >>> On Wed, 10 Sep 2025, Gregory Price wrote: >>> >>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote: >>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote: >>>>>> This patchset introduces a new subsystem for hot page tracking >>>>>> and promotion (pghot) that consolidates memory access information >>>>>> from various sources and enables centralized promotion of hot >>>>>> pages across memory tiers. >>>>> >>>>> Just to be clear, I continue to believe this is a terrible idea and we >>>>> should not do this. If systems will be built with CXL (and given the >>>>> horrendous performance, I cannot see why they would be), the kernel >>>>> should not be migrating memory around like this. >>>> >>>> I've been considered this problem from the opposite approach since LSFMM. >>>> >>>> Rather than decide how to move stuff around, what if instead we just >>>> decide not to ever put certain classes of memory on CXL. Right now, so >>>> long as CXL is in the page allocator, it's the wild west - any page can >>>> end up anywhere. >>>> >>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real >>>> workloads to show local CXL expansion is valuable and performant enough >>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE >>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of >>>> CXL, but allows any given user-driven page allocation (including page >>>> cache, file, and anon mappings) to land there. >>>> >>> >>> This is similar to our use case, although the direct allocation can be >>> controlled by cpusets or mempolicies as needed depending on the memory >>> access latency required for the workload; nothing new there, though, it's >>> the same argument as NUMA in general and the abstraction of these far >>> memory nodes as separate NUMA nodes makes this very straightforward. >>> >>>> I'm hoping to share some of this data in the coming months. >>>> >>>> I've yet to see any strong indication that a complex hotness/movement >>>> system is warranted (yet) - but that may simply be because we have >>>> local cards with no switching involved. So far LRU-based promotion and >>>> demotion has been sufficient. >>>> >>> >>> To me, this is a key point. As we've discussed in meetings, we're in the >>> early days here. The CHMU does provide a lot of flexibility, both to >>> create very good and very bad hotness trackers. But I think the key point >>> is that we have multiple sources of hotness information depending on the >>> platform and some of these sources only make sense for the kernel (or a >>> BPF offload) to maintain as the source of truth. Some of these sources >>> will be clear-on-read so only one entity would be possible to have as the >>> source of truth of page hotness. >>> >>> I've been pretty focused on the promotion story here rather than demotion >>> because of how responsive it needs to be. Harvesting the page table >>> accessed bits or waiting on a sliding window through NUMA Balancing (even >>> NUMAB=2) is not as responsive as needed for very fast promotion to top >>> tier memory, hence things like the CHMU (or PEBS or IBS etc). >>> >>> A few things that I think we need to discuss and align on: >>> >>> - the kernel as the source of truth for all memory hotness information, >>> which can then be abstracted and used for multiple downstream purposes, >>> memory tiering only being one of them >>> >>> - the long-term plan for NUMAB=2 and memory tiering support in the kernel >>> in general, are we planning on supporting this through NUMA hint faults >>> forever despite their drawbacks (too slow, too much overhead for KVM) >>> >>> - the role of the kernel vs userspace in driving the memory migration; >>> lots of discussion on hardware assists that can be leveraged for memory >>> migration but today the balancing is driven in process context. The >>> kthread as the driver of migration is yet to be a sold argument, but >>> are where a number of companies are currently looking >>> >>> There's also some feature support that is possible with these CXL memory >>> expansion devices that have started to pop up in labs that can also >>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to >>> chime in as well. >>> >>> This topic seems due for an alignment session as well, so will look to get >>> that scheduled in the coming weeks if people are up for it. >> >> Our experience is that workloads in hyper-scalar data centers such as >> Google often have significant cold memory. Offloading this to CXL memory >> devices, backed by cheaper, lower-performance media (e.g. DRAM with >> hardware compression), can be a practical approach to reduce overall >> TCO. Page promotion and demotion are then critical for such a tiered >> memory system. >> >> A kernel thread to drive hot page collection and promotion seems >> logical, especially since hot page data from new sources (e.g. CHMU) >> are collected outside the process execution context and in the form of >> physical addresses. >> >> I do agree that we need to balance the complexity and benefits of any >> new data structures for hotness tracking. > > > I think there is a mismatch in the tiering structure and > the patches. If you see the example in memory tiering > > /* > * ... > * Example 3: > * > * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node. > * > * node distances: > * node 0 1 2 > * 0 10 20 30 > * 1 20 10 40 > * 2 30 40 10 > * > * memory_tiers0 = 1 > * memory_tiers1 = 0 > * memory_tiers2 = 2 > *.. > */ > > The topmost tier need not be DRAM, patch 3 states > > " > [..] > * kpromoted is a kernel thread that runs on each toptier node and > * promotes pages from max_heap. That comment is not accurate, will reword it next time. Currently I am using kthread_create_on_node() to create one kernel thread for each toptier node. I haven't tried this patchset with HBM but it should end up creating a kthread for HBM node too. However unlike for regular DRAM nodes, the kthread for HBM node can't be bound to any CPU. > > Also, there is no data in the cover letter to indicate what workloads benefit from > migration to top-tier and by how much? I have been trying to get the tracking infrastructure up and hoping to get some review on that. I will start including numbers from the next iteration. Regards, Bharata.
On Tue, Sep 16, 2025 at 12:45:52PM -0700, David Rientjes wrote:
> > I'm hoping to share some of this data in the coming months.
> >
> > I've yet to see any strong indication that a complex hotness/movement
> > system is warranted (yet) - but that may simply be because we have
> > local cards with no switching involved. So far LRU-based promotion and
> > demotion has been sufficient.
> >
...
>
> I've been pretty focused on the promotion story here rather than demotion
> because of how responsive it needs to be. Harvesting the page table
> accessed bits or waiting on a sliding window through NUMA Balancing (even
> NUMAB=2) is not as responsive as needed for very fast promotion to top
> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>
I feel the need to throw out there that we need to set some kind of
baseline for comparison that isn't simply comparing new hotness tracking
stuff against "Doing Nothing".
For example, if we assume MGLRU is the default, we probably want to
compare against some kind of simplistic system that is essentially:
if tier0 has bandwidth room, and
if tier1 is bandwidth pressured, then
promote a chunk from tier1 youngest generation LRU
("hottest") and demote a chunk from tier0 older LRU
("coldest") [if there's no space available].
Active bandwidth utilization numbers are still a little hard to come
by, but a system like above could be implemented largely in userland
with a few tweaks to reclaim.
~Gregory
© 2016 - 2026 Red Hat, Inc.