.../admin-guide/kernel-parameters.txt | 10 + drivers/dax/pmem.c | 2 +- drivers/nvdimm/dax_devs.c | 5 +- drivers/nvdimm/e820.c | 211 +++++++++++++++++- drivers/nvdimm/nd.h | 6 + drivers/nvdimm/pfn_devs.c | 158 +++++++++---- include/linux/libnvdimm.h | 3 + 7 files changed, 346 insertions(+), 49 deletions(-)
This includes: 1. Splitting one e820 entry into many regions. 2. Conversion to devdax during boot. This change is needed for the hypervisor live update. VMs' memory will be backed by those emulated pmem devices. To support various VM shapes I want to create devdax devices at 1GB granularity similar to hugetlb. Also detecting those devices as devdax during boot speeds up the whole process. Conversion in userspace would be much slower which is unacceptable while trying to minimize v3: - Added a second commit. - Reworked string parsing. - I was asked to rename the parameter to 'split' but I'm not sure it fits anymore with the conversion functionality, so I didn't do that yet. LMK. v2: Fixed a crash when pmem parameter is omitted. Michal Clapinski (2): libnvdimm/e820: Add a new parameter to split e820 entry into many regions libnvdimm: add nd_e820.pmem automatic devdax conversion .../admin-guide/kernel-parameters.txt | 10 + drivers/dax/pmem.c | 2 +- drivers/nvdimm/dax_devs.c | 5 +- drivers/nvdimm/e820.c | 211 +++++++++++++++++- drivers/nvdimm/nd.h | 6 + drivers/nvdimm/pfn_devs.c | 158 +++++++++---- include/linux/libnvdimm.h | 3 + 7 files changed, 346 insertions(+), 49 deletions(-) -- 2.50.0.rc1.591.g9c95f17f64-goog
Michal Clapinski wrote: > This includes: > 1. Splitting one e820 entry into many regions. > 2. Conversion to devdax during boot. > > This change is needed for the hypervisor live update. VMs' memory will > be backed by those emulated pmem devices. To support various VM shapes > I want to create devdax devices at 1GB granularity similar to hugetlb. > Also detecting those devices as devdax during boot speeds up the whole > process. Conversion in userspace would be much slower which is > unacceptable while trying to minimize Did you explore the NFIT injection strategy which Dan suggested?[1] [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/ If so why did it not work? Ira [snip]
On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote: > > Michal Clapinski wrote: > > This includes: > > 1. Splitting one e820 entry into many regions. > > 2. Conversion to devdax during boot. > > > > This change is needed for the hypervisor live update. VMs' memory will > > be backed by those emulated pmem devices. To support various VM shapes > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > Also detecting those devices as devdax during boot speeds up the whole > > process. Conversion in userspace would be much slower which is > > unacceptable while trying to minimize > > Did you explore the NFIT injection strategy which Dan suggested?[1] > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/ > > If so why did it not work? I'm new to all this so I might be off on some/all of the things. My issues with NFIT: 1. I can either go with custom bios or acpi nfit injection. Custom bios sounds rather aggressive to me and I'd prefer to avoid this. The NFIT injection is done via initramfs, right? If a system doesn't use initramfs at the moment, that would introduce another step in the boot process. One of the requirements of the hypervisor live update project is that the boot process has to be blazing fast and I'm worried introducing initramfs would go against this requirement. 2. If I were to create an NFIT, it would have to contain thousands of entries. That would have to be parsed on every boot. Again, I'm worried about the performance. Do you think an NFIT solution could be as fast as the simple command line solution? [snip]
On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote: > > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > Michal Clapinski wrote: > > > This includes: > > > 1. Splitting one e820 entry into many regions. > > > 2. Conversion to devdax during boot. > > > > > > This change is needed for the hypervisor live update. VMs' memory will > > > be backed by those emulated pmem devices. To support various VM shapes > > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > > Also detecting those devices as devdax during boot speeds up the whole > > > process. Conversion in userspace would be much slower which is > > > unacceptable while trying to minimize > > > > Did you explore the NFIT injection strategy which Dan suggested?[1] > > > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/ > > > > If so why did it not work? > > I'm new to all this so I might be off on some/all of the things. > > My issues with NFIT: > 1. I can either go with custom bios or acpi nfit injection. Custom > bios sounds rather aggressive to me and I'd prefer to avoid this. The > NFIT injection is done via initramfs, right? If a system doesn't use > initramfs at the moment, that would introduce another step in the boot > process. One of the requirements of the hypervisor live update project > is that the boot process has to be blazing fast and I'm worried > introducing initramfs would go against this requirement. > 2. If I were to create an NFIT, it would have to contain thousands of > entries. That would have to be parsed on every boot. Again, I'm > worried about the performance. > > Do you think an NFIT solution could be as fast as the simple command > line solution? Hello, just a follow up email. I'd like to receive some feedback on this.
Michał Cłapiński wrote: > On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote: > > > > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > Michal Clapinski wrote: > > > > This includes: > > > > 1. Splitting one e820 entry into many regions. > > > > 2. Conversion to devdax during boot. > > > > > > > > This change is needed for the hypervisor live update. VMs' memory will > > > > be backed by those emulated pmem devices. To support various VM shapes > > > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > > > Also detecting those devices as devdax during boot speeds up the whole > > > > process. Conversion in userspace would be much slower which is > > > > unacceptable while trying to minimize > > > > > > Did you explore the NFIT injection strategy which Dan suggested?[1] > > > > > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/ > > > > > > If so why did it not work? > > > > I'm new to all this so I might be off on some/all of the things. > > > > My issues with NFIT: > > 1. I can either go with custom bios or acpi nfit injection. Custom > > bios sounds rather aggressive to me and I'd prefer to avoid this. The > > NFIT injection is done via initramfs, right? If a system doesn't use > > initramfs at the moment, that would introduce another step in the boot > > process. One of the requirements of the hypervisor live update project > > is that the boot process has to be blazing fast and I'm worried > > introducing initramfs would go against this requirement. > > 2. If I were to create an NFIT, it would have to contain thousands of > > entries. That would have to be parsed on every boot. Again, I'm > > worried about the performance. > > > > Do you think an NFIT solution could be as fast as the simple command > > line solution? > > Hello, > just a follow up email. I'd like to receive some feedback on this. Apologies. I'm not keen on adding kernel parameters so I'm curious what you think about Mike's new driver?[1] [1] https://lore.kernel.org/all/68b0f8a31a2b8_293b3294ae@iweiny-mobl.notmuch/ Ira
On Thu, Aug 28, 2025 at 8:48 PM Ira Weiny <ira.weiny@intel.com> wrote: > > Michał Cłapiński wrote: > > On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote: > > > > > > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote: > > > > > > > > Michal Clapinski wrote: > > > > > This includes: > > > > > 1. Splitting one e820 entry into many regions. > > > > > 2. Conversion to devdax during boot. > > > > > > > > > > This change is needed for the hypervisor live update. VMs' memory will > > > > > be backed by those emulated pmem devices. To support various VM shapes > > > > > I want to create devdax devices at 1GB granularity similar to hugetlb. > > > > > Also detecting those devices as devdax during boot speeds up the whole > > > > > process. Conversion in userspace would be much slower which is > > > > > unacceptable while trying to minimize > > > > > > > > Did you explore the NFIT injection strategy which Dan suggested?[1] > > > > > > > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/ > > > > > > > > If so why did it not work? > > > > > > I'm new to all this so I might be off on some/all of the things. > > > > > > My issues with NFIT: > > > 1. I can either go with custom bios or acpi nfit injection. Custom > > > bios sounds rather aggressive to me and I'd prefer to avoid this. The > > > NFIT injection is done via initramfs, right? If a system doesn't use > > > initramfs at the moment, that would introduce another step in the boot > > > process. One of the requirements of the hypervisor live update project > > > is that the boot process has to be blazing fast and I'm worried > > > introducing initramfs would go against this requirement. > > > 2. If I were to create an NFIT, it would have to contain thousands of > > > entries. That would have to be parsed on every boot. Again, I'm > > > worried about the performance. > > > > > > Do you think an NFIT solution could be as fast as the simple command > > > line solution? > > > > Hello, > > just a follow up email. I'd like to receive some feedback on this. > > Apologies. I'm not keen on adding kernel parameters so I'm curious what > you think about Mike's new driver?[1] Hi Ira, Mike's proposal and our use case are different. What we're proposing is a way to automatically convert emulated PMEM into DAX/FSDAX during boot and subdivide it into page-aligned chunks (e.g., 1G/2M). We have a userspace agent that then manages these devdax devices, similar to how HugeTLB pages are handled, allowing the chunks to be used in a cloud environment to support guest memory for live updates. To be clear, we are not trying to make the carved-out PMEM region scalable. The hypervisor's memory allocation stays the same, and these PMEM/DAX devices are used exclusively for running VMs. This approach isn't intended for the general-purpose, scalable persistent memory use case that Mike's driver addresses. Pasha
© 2016 - 2025 Red Hat, Inc.