libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

[PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Michal Clapinski 4 months ago

This includes:
1. Splitting one e820 entry into many regions.
2. Conversion to devdax during boot.

This change is needed for the hypervisor live update. VMs' memory will
be backed by those emulated pmem devices. To support various VM shapes
I want to create devdax devices at 1GB granularity similar to hugetlb.
Also detecting those devices as devdax during boot speeds up the whole
process. Conversion in userspace would be much slower which is
unacceptable while trying to minimize

v3:
- Added a second commit.
- Reworked string parsing.
- I was asked to rename the parameter to 'split' but I'm not sure it
  fits anymore with the conversion functionality, so I didn't do that
  yet. LMK.
v2: Fixed a crash when pmem parameter is omitted.

Michal Clapinski (2):
  libnvdimm/e820: Add a new parameter to split e820 entry into many
    regions
  libnvdimm: add nd_e820.pmem automatic devdax conversion

 .../admin-guide/kernel-parameters.txt         |  10 +
 drivers/dax/pmem.c                            |   2 +-
 drivers/nvdimm/dax_devs.c                     |   5 +-
 drivers/nvdimm/e820.c                         | 211 +++++++++++++++++-
 drivers/nvdimm/nd.h                           |   6 +
 drivers/nvdimm/pfn_devs.c                     | 158 +++++++++----
 include/linux/libnvdimm.h                     |   3 +
 7 files changed, 346 insertions(+), 49 deletions(-)

-- 
2.50.0.rc1.591.g9c95f17f64-goog

Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Ira Weiny 3 months, 2 weeks ago

Michal Clapinski wrote:
> This includes:
> 1. Splitting one e820 entry into many regions.
> 2. Conversion to devdax during boot.
> 
> This change is needed for the hypervisor live update. VMs' memory will
> be backed by those emulated pmem devices. To support various VM shapes
> I want to create devdax devices at 1GB granularity similar to hugetlb.
> Also detecting those devices as devdax during boot speeds up the whole
> process. Conversion in userspace would be much slower which is
> unacceptable while trying to minimize

Did you explore the NFIT injection strategy which Dan suggested?[1]

[1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/

If so why did it not work?

Ira

[snip]

Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Michał Cłapiński 3 months, 1 week ago

On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Michal Clapinski wrote:
> > This includes:
> > 1. Splitting one e820 entry into many regions.
> > 2. Conversion to devdax during boot.
> >
> > This change is needed for the hypervisor live update. VMs' memory will
> > be backed by those emulated pmem devices. To support various VM shapes
> > I want to create devdax devices at 1GB granularity similar to hugetlb.
> > Also detecting those devices as devdax during boot speeds up the whole
> > process. Conversion in userspace would be much slower which is
> > unacceptable while trying to minimize
>
> Did you explore the NFIT injection strategy which Dan suggested?[1]
>
> [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/
>
> If so why did it not work?

I'm new to all this so I might be off on some/all of the things.

My issues with NFIT:
1. I can either go with custom bios or acpi nfit injection. Custom
bios sounds rather aggressive to me and I'd prefer to avoid this. The
NFIT injection is done via initramfs, right? If a system doesn't use
initramfs at the moment, that would introduce another step in the boot
process. One of the requirements of the hypervisor live update project
is that the boot process has to be blazing fast and I'm worried
introducing initramfs would go against this requirement.
2. If I were to create an NFIT, it would have to contain thousands of
entries. That would have to be parsed on every boot. Again, I'm
worried about the performance.

Do you think an NFIT solution could be as fast as the simple command
line solution?

[snip]

Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Michał Cłapiński 1 month, 3 weeks ago

On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote:
>
> On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > Michal Clapinski wrote:
> > > This includes:
> > > 1. Splitting one e820 entry into many regions.
> > > 2. Conversion to devdax during boot.
> > >
> > > This change is needed for the hypervisor live update. VMs' memory will
> > > be backed by those emulated pmem devices. To support various VM shapes
> > > I want to create devdax devices at 1GB granularity similar to hugetlb.
> > > Also detecting those devices as devdax during boot speeds up the whole
> > > process. Conversion in userspace would be much slower which is
> > > unacceptable while trying to minimize
> >
> > Did you explore the NFIT injection strategy which Dan suggested?[1]
> >
> > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/
> >
> > If so why did it not work?
>
> I'm new to all this so I might be off on some/all of the things.
>
> My issues with NFIT:
> 1. I can either go with custom bios or acpi nfit injection. Custom
> bios sounds rather aggressive to me and I'd prefer to avoid this. The
> NFIT injection is done via initramfs, right? If a system doesn't use
> initramfs at the moment, that would introduce another step in the boot
> process. One of the requirements of the hypervisor live update project
> is that the boot process has to be blazing fast and I'm worried
> introducing initramfs would go against this requirement.
> 2. If I were to create an NFIT, it would have to contain thousands of
> entries. That would have to be parsed on every boot. Again, I'm
> worried about the performance.
>
> Do you think an NFIT solution could be as fast as the simple command
> line solution?

Hello,
just a follow up email. I'd like to receive some feedback on this.

Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Ira Weiny 1 month, 1 week ago

Michał Cłapiński wrote:
> On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote:
> >
> > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote:
> > >
> > > Michal Clapinski wrote:
> > > > This includes:
> > > > 1. Splitting one e820 entry into many regions.
> > > > 2. Conversion to devdax during boot.
> > > >
> > > > This change is needed for the hypervisor live update. VMs' memory will
> > > > be backed by those emulated pmem devices. To support various VM shapes
> > > > I want to create devdax devices at 1GB granularity similar to hugetlb.
> > > > Also detecting those devices as devdax during boot speeds up the whole
> > > > process. Conversion in userspace would be much slower which is
> > > > unacceptable while trying to minimize
> > >
> > > Did you explore the NFIT injection strategy which Dan suggested?[1]
> > >
> > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/
> > >
> > > If so why did it not work?
> >
> > I'm new to all this so I might be off on some/all of the things.
> >
> > My issues with NFIT:
> > 1. I can either go with custom bios or acpi nfit injection. Custom
> > bios sounds rather aggressive to me and I'd prefer to avoid this. The
> > NFIT injection is done via initramfs, right? If a system doesn't use
> > initramfs at the moment, that would introduce another step in the boot
> > process. One of the requirements of the hypervisor live update project
> > is that the boot process has to be blazing fast and I'm worried
> > introducing initramfs would go against this requirement.
> > 2. If I were to create an NFIT, it would have to contain thousands of
> > entries. That would have to be parsed on every boot. Again, I'm
> > worried about the performance.
> >
> > Do you think an NFIT solution could be as fast as the simple command
> > line solution?
> 
> Hello,
> just a follow up email. I'd like to receive some feedback on this.

Apologies.  I'm not keen on adding kernel parameters so I'm curious what
you think about Mike's new driver?[1]

[1] https://lore.kernel.org/all/68b0f8a31a2b8_293b3294ae@iweiny-mobl.notmuch/

Ira

Re: [PATCH v3 0/2] libnvdimm/e820: Add a new parameter to configure many regions per e820 entry

Posted by Pasha Tatashin 1 month, 1 week ago

On Thu, Aug 28, 2025 at 8:48 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> Michał Cłapiński wrote:
> > On Tue, Jul 1, 2025 at 2:05 PM Michał Cłapiński <mclapinski@google.com> wrote:
> > >
> > > On Wed, Jun 25, 2025 at 11:16 PM Ira Weiny <ira.weiny@intel.com> wrote:
> > > >
> > > > Michal Clapinski wrote:
> > > > > This includes:
> > > > > 1. Splitting one e820 entry into many regions.
> > > > > 2. Conversion to devdax during boot.
> > > > >
> > > > > This change is needed for the hypervisor live update. VMs' memory will
> > > > > be backed by those emulated pmem devices. To support various VM shapes
> > > > > I want to create devdax devices at 1GB granularity similar to hugetlb.
> > > > > Also detecting those devices as devdax during boot speeds up the whole
> > > > > process. Conversion in userspace would be much slower which is
> > > > > unacceptable while trying to minimize
> > > >
> > > > Did you explore the NFIT injection strategy which Dan suggested?[1]
> > > >
> > > > [1] https://lore.kernel.org/all/6807f0bfbe589_71fe2944d@dwillia2-xfh.jf.intel.com.notmuch/
> > > >
> > > > If so why did it not work?
> > >
> > > I'm new to all this so I might be off on some/all of the things.
> > >
> > > My issues with NFIT:
> > > 1. I can either go with custom bios or acpi nfit injection. Custom
> > > bios sounds rather aggressive to me and I'd prefer to avoid this. The
> > > NFIT injection is done via initramfs, right? If a system doesn't use
> > > initramfs at the moment, that would introduce another step in the boot
> > > process. One of the requirements of the hypervisor live update project
> > > is that the boot process has to be blazing fast and I'm worried
> > > introducing initramfs would go against this requirement.
> > > 2. If I were to create an NFIT, it would have to contain thousands of
> > > entries. That would have to be parsed on every boot. Again, I'm
> > > worried about the performance.
> > >
> > > Do you think an NFIT solution could be as fast as the simple command
> > > line solution?
> >
> > Hello,
> > just a follow up email. I'd like to receive some feedback on this.
>
> Apologies.  I'm not keen on adding kernel parameters so I'm curious what
> you think about Mike's new driver?[1]

Hi Ira,

Mike's proposal and our use case are different.

What we're proposing is a way to automatically convert emulated PMEM
into DAX/FSDAX during boot and subdivide it into page-aligned chunks
(e.g., 1G/2M). We have a userspace agent that then manages these
devdax devices, similar to how HugeTLB pages are handled, allowing the
chunks to be used in a cloud environment to support guest memory for
live updates.

To be clear, we are not trying to make the carved-out PMEM region
scalable. The hypervisor's memory allocation stays the same, and these
PMEM/DAX devices are used exclusively for running VMs. This approach
isn't intended for the general-purpose, scalable persistent memory use
case that Mike's driver addresses.

Pasha