add API for parallel Saves (not for committing)

[libvirt RFC] add API for parallel Saves (not for committing)

Posted by Claudio Fontana 2 years ago

RFC, starting point for discussion.

Sketch API changes to allow parallel Saves, and open up
and implementation for QEMU to leverage multifd migration to files,
with optional multifd compression.

This allows to improve save times for huge VMs.

The idea is to issue commands like:

virsh save domain /path/savevm --parallel --parallel-connections 2

and have libvirt start a multifd migration to:

/path/savevm   : main migration connection
/path/savevm.1 : multifd channel 1
/path/savevm.2 : multifd channel 2

Signed-off-by: Claudio Fontana <cfontana@suse.de>
---
 include/libvirt/libvirt-domain.h | 5 +++++
 src/driver-hypervisor.h          | 7 +++++++
 src/libvirt_public.syms          | 5 +++++
 src/qemu/qemu_driver.c           | 1 +
 tools/virsh-domain.c             | 8 ++++++++
 5 files changed, 26 insertions(+)

diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
index 2d5718301e..a7b9c4132d 100644
--- a/include/libvirt/libvirt-domain.h
+++ b/include/libvirt/libvirt-domain.h
@@ -1270,6 +1270,7 @@ typedef enum {
     VIR_DOMAIN_SAVE_RUNNING      = 1 << 1, /* Favor running over paused */
     VIR_DOMAIN_SAVE_PAUSED       = 1 << 2, /* Favor paused over running */
     VIR_DOMAIN_SAVE_RESET_NVRAM  = 1 << 3, /* Re-initialize NVRAM from template */
+    VIR_DOMAIN_SAVE_PARALLEL     = 1 << 4, /* Parallel Save/Restore to multiple files */
 } virDomainSaveRestoreFlags;
 
 int                     virDomainSave           (virDomainPtr domain,
@@ -1278,6 +1279,10 @@ int                     virDomainSaveFlags      (virDomainPtr domain,
                                                  const char *to,
                                                  const char *dxml,
                                                  unsigned int flags);
+int                     virDomainSaveParametersFlags (virDomainPtr domain,
+                                                      virTypedParameterPtr params,
+                                                      int nparams,
+                                                      unsigned int flags);
 int                     virDomainRestore        (virConnectPtr conn,
                                                  const char *from);
 int                     virDomainRestoreFlags   (virConnectPtr conn,
diff --git a/src/driver-hypervisor.h b/src/driver-hypervisor.h
index 4423eb0885..a4e1d21e76 100644
--- a/src/driver-hypervisor.h
+++ b/src/driver-hypervisor.h
@@ -240,6 +240,12 @@ typedef int
                          const char *dxml,
                          unsigned int flags);
 
+typedef int
+(*virDrvDomainSaveParametersFlags)(virDomainPtr domain,
+                                   virTypedParameterPtr params,
+                                   int nparams,
+                                   unsigned int flags);
+
 typedef int
 (*virDrvDomainRestore)(virConnectPtr conn,
                        const char *from);
@@ -1489,6 +1495,7 @@ struct _virHypervisorDriver {
     virDrvDomainGetControlInfo domainGetControlInfo;
     virDrvDomainSave domainSave;
     virDrvDomainSaveFlags domainSaveFlags;
+    virDrvDomainSaveParametersFlags domainSaveParametersFlags;
     virDrvDomainRestore domainRestore;
     virDrvDomainRestoreFlags domainRestoreFlags;
     virDrvDomainSaveImageGetXMLDesc domainSaveImageGetXMLDesc;
diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms
index f93692c427..eb3a7afb75 100644
--- a/src/libvirt_public.syms
+++ b/src/libvirt_public.syms
@@ -916,4 +916,9 @@ LIBVIRT_8.0.0 {
         virDomainSetLaunchSecurityState;
 } LIBVIRT_7.8.0;
 
+LIBVIRT_8.3.0 {
+    global:
+        virDomainSaveParametersFlags;
+} LIBVIRT_8.0.0;
+
 # .... define new API here using predicted next version number ....
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 77012eb527..249105356c 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -20826,6 +20826,7 @@ static virHypervisorDriver qemuHypervisorDriver = {
     .domainGetControlInfo = qemuDomainGetControlInfo, /* 0.9.3 */
     .domainSave = qemuDomainSave, /* 0.2.0 */
     .domainSaveFlags = qemuDomainSaveFlags, /* 0.9.4 */
+    .domainSaveParametersFlags = qemuDomainSaveParametersFlags, /* 8.3.0 */
     .domainRestore = qemuDomainRestore, /* 0.2.0 */
     .domainRestoreFlags = qemuDomainRestoreFlags, /* 0.9.4 */
     .domainSaveImageGetXMLDesc = qemuDomainSaveImageGetXMLDesc, /* 0.9.4 */
diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c
index d5fd8be7c3..ccded6d265 100644
--- a/tools/virsh-domain.c
+++ b/tools/virsh-domain.c
@@ -4164,6 +4164,14 @@ static const vshCmdOptDef opts_save[] = {
      .type = VSH_OT_BOOL,
      .help = N_("avoid file system cache when saving")
     },
+    {.name = "parallel",
+     .type = VSH_OT_BOOL,
+     .help = N_("enable parallel save to files")
+    },
+    {.name = "parallel-connections",
+     .type = VSH_OT_INT,
+     .help = N_("number of connections/files for parallel save")
+    },
     {.name = "xml",
      .type = VSH_OT_STRING,
      .completer = virshCompletePathLocalExisting,
-- 
2.34.1

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> RFC, starting point for discussion.
> 
> Sketch API changes to allow parallel Saves, and open up
> and implementation for QEMU to leverage multifd migration to files,
> with optional multifd compression.
> 
> This allows to improve save times for huge VMs.
> 
> The idea is to issue commands like:
> 
> virsh save domain /path/savevm --parallel --parallel-connections 2
> 
> and have libvirt start a multifd migration to:
> 
> /path/savevm   : main migration connection
> /path/savevm.1 : multifd channel 1
> /path/savevm.2 : multifd channel 2

At a conceptual level the idea would to still have a single file,
but have threads writing to different regions of it. I don't think
that's possible with multifd though, as it doesn't partition RAM
up between threads, its just hands out pages on demand. So if one
thread happens to be quicker it'll send more RAM than another
thread. Also we're basically capturing the migration RAM, and the
multifd channels have control info, in addition to the RAM pages.

That makes me wonder actually, are the multifd streams unidirectional
or bidirectional ?  Our saving to a file logic, relies on the streams
being unidirectional.

You've got me thinking, however, whether we can take QEMU out of
the loop entirely for saving RAM.

IIUC with 'x-ignore-shared' migration capability QEMU will skip
saving of RAM region entirely (well technically any region marked
as 'shared', which I guess can cover more things). 

If the QEMU process is configured with a file backed shared
memory, or memfd, I wonder if we can take advantage of this.
eg

  1. pause the VM
  1. write the libvirt header to save.img
  2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
     RAM after header
  3. QMP migrate with x-ignore-shared to copy device
     state after RAM

Probably can do the same on restore too.

Now, this would only work for a 'save' and 'restore', not
for snapshots, as it would rely on the VCPUs being paused
to stop RAM being modified.

> 
> Signed-off-by: Claudio Fontana <cfontana@suse.de>
> ---
>  include/libvirt/libvirt-domain.h | 5 +++++
>  src/driver-hypervisor.h          | 7 +++++++
>  src/libvirt_public.syms          | 5 +++++
>  src/qemu/qemu_driver.c           | 1 +
>  tools/virsh-domain.c             | 8 ++++++++
>  5 files changed, 26 insertions(+)
> 
> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
> index 2d5718301e..a7b9c4132d 100644
> --- a/include/libvirt/libvirt-domain.h
> +++ b/include/libvirt/libvirt-domain.h
> @@ -1270,6 +1270,7 @@ typedef enum {
>      VIR_DOMAIN_SAVE_RUNNING      = 1 << 1, /* Favor running over paused */
>      VIR_DOMAIN_SAVE_PAUSED       = 1 << 2, /* Favor paused over running */
>      VIR_DOMAIN_SAVE_RESET_NVRAM  = 1 << 3, /* Re-initialize NVRAM from template */
> +    VIR_DOMAIN_SAVE_PARALLEL     = 1 << 4, /* Parallel Save/Restore to multiple files */
>  } virDomainSaveRestoreFlags;
>  
>  int                     virDomainSave           (virDomainPtr domain,
> @@ -1278,6 +1279,10 @@ int                     virDomainSaveFlags      (virDomainPtr domain,
>                                                   const char *to,
>                                                   const char *dxml,
>                                                   unsigned int flags);
> +int                     virDomainSaveParametersFlags (virDomainPtr domain,
> +                                                      virTypedParameterPtr params,
> +                                                      int nparams,
> +                                                      unsigned int flags);
>  int                     virDomainRestore        (virConnectPtr conn,
>                                                   const char *from);
>  int                     virDomainRestoreFlags   (virConnectPtr conn,
> diff --git a/src/driver-hypervisor.h b/src/driver-hypervisor.h
> index 4423eb0885..a4e1d21e76 100644
> --- a/src/driver-hypervisor.h
> +++ b/src/driver-hypervisor.h
> @@ -240,6 +240,12 @@ typedef int
>                           const char *dxml,
>                           unsigned int flags);
>  
> +typedef int
> +(*virDrvDomainSaveParametersFlags)(virDomainPtr domain,
> +                                   virTypedParameterPtr params,
> +                                   int nparams,
> +                                   unsigned int flags);
> +
>  typedef int
>  (*virDrvDomainRestore)(virConnectPtr conn,
>                         const char *from);
> @@ -1489,6 +1495,7 @@ struct _virHypervisorDriver {
>      virDrvDomainGetControlInfo domainGetControlInfo;
>      virDrvDomainSave domainSave;
>      virDrvDomainSaveFlags domainSaveFlags;
> +    virDrvDomainSaveParametersFlags domainSaveParametersFlags;
>      virDrvDomainRestore domainRestore;
>      virDrvDomainRestoreFlags domainRestoreFlags;
>      virDrvDomainSaveImageGetXMLDesc domainSaveImageGetXMLDesc;
> diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms
> index f93692c427..eb3a7afb75 100644
> --- a/src/libvirt_public.syms
> +++ b/src/libvirt_public.syms
> @@ -916,4 +916,9 @@ LIBVIRT_8.0.0 {
>          virDomainSetLaunchSecurityState;
>  } LIBVIRT_7.8.0;
>  
> +LIBVIRT_8.3.0 {
> +    global:
> +        virDomainSaveParametersFlags;
> +} LIBVIRT_8.0.0;
> +
>  # .... define new API here using predicted next version number ....
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
> index 77012eb527..249105356c 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -20826,6 +20826,7 @@ static virHypervisorDriver qemuHypervisorDriver = {
>      .domainGetControlInfo = qemuDomainGetControlInfo, /* 0.9.3 */
>      .domainSave = qemuDomainSave, /* 0.2.0 */
>      .domainSaveFlags = qemuDomainSaveFlags, /* 0.9.4 */
> +    .domainSaveParametersFlags = qemuDomainSaveParametersFlags, /* 8.3.0 */
>      .domainRestore = qemuDomainRestore, /* 0.2.0 */
>      .domainRestoreFlags = qemuDomainRestoreFlags, /* 0.9.4 */
>      .domainSaveImageGetXMLDesc = qemuDomainSaveImageGetXMLDesc, /* 0.9.4 */
> diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c
> index d5fd8be7c3..ccded6d265 100644
> --- a/tools/virsh-domain.c
> +++ b/tools/virsh-domain.c
> @@ -4164,6 +4164,14 @@ static const vshCmdOptDef opts_save[] = {
>       .type = VSH_OT_BOOL,
>       .help = N_("avoid file system cache when saving")
>      },
> +    {.name = "parallel",
> +     .type = VSH_OT_BOOL,
> +     .help = N_("enable parallel save to files")
> +    },
> +    {.name = "parallel-connections",
> +     .type = VSH_OT_INT,
> +     .help = N_("number of connections/files for parallel save")
> +    },
>      {.name = "xml",
>       .type = VSH_OT_STRING,
>       .completer = virshCompletePathLocalExisting,
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Peter Krempa 2 years ago

On Thu, Apr 21, 2022 at 18:08:36 +0100, Daniel P. Berrangé wrote:
> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> > RFC, starting point for discussion.
> > 
> > Sketch API changes to allow parallel Saves, and open up
> > and implementation for QEMU to leverage multifd migration to files,
> > with optional multifd compression.
> > 
> > This allows to improve save times for huge VMs.
> > 
> > The idea is to issue commands like:
> > 
> > virsh save domain /path/savevm --parallel --parallel-connections 2
> > 
> > and have libvirt start a multifd migration to:
> > 
> > /path/savevm   : main migration connection
> > /path/savevm.1 : multifd channel 1
> > /path/savevm.2 : multifd channel 2
> 
> At a conceptual level the idea would to still have a single file,
> but have threads writing to different regions of it. I don't think

Note that guys from Virtuozzo planned to do enhancements to the
migration code which would allow post-copy style migration into a file.

For this they need a memory image with "random access" especially for
the loading part. Now the idea was to use a different image format,
something more like a qcow2 container (or actually a qcow2 image) to
store the memory pages but allow random access.

Now with that a parallel output could also theoretically be possible
IIRC.

Unfortunately I have no idea how their work is progressing, it was a
while ago already we've discussed it here.

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Claudio Fontana 2 years ago

On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
>> RFC, starting point for discussion.
>>
>> Sketch API changes to allow parallel Saves, and open up
>> and implementation for QEMU to leverage multifd migration to files,
>> with optional multifd compression.
>>
>> This allows to improve save times for huge VMs.
>>
>> The idea is to issue commands like:
>>
>> virsh save domain /path/savevm --parallel --parallel-connections 2
>>
>> and have libvirt start a multifd migration to:
>>
>> /path/savevm   : main migration connection
>> /path/savevm.1 : multifd channel 1
>> /path/savevm.2 : multifd channel 2
> 
> At a conceptual level the idea would to still have a single file,
> but have threads writing to different regions of it. I don't think
> that's possible with multifd though, as it doesn't partition RAM
> up between threads, its just hands out pages on demand. So if one
> thread happens to be quicker it'll send more RAM than another
> thread. Also we're basically capturing the migration RAM, and the
> multifd channels have control info, in addition to the RAM pages.
> 
> That makes me wonder actually, are the multifd streams unidirectional
> or bidirectional ?  Our saving to a file logic, relies on the streams
> being unidirectional.


Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).


> 
> You've got me thinking, however, whether we can take QEMU out of
> the loop entirely for saving RAM.
> 
> IIUC with 'x-ignore-shared' migration capability QEMU will skip
> saving of RAM region entirely (well technically any region marked
> as 'shared', which I guess can cover more things). 

Heh I have no idea about this.

> 
> If the QEMU process is configured with a file backed shared
> memory, or memfd, I wonder if we can take advantage of this.
> eg
> 
>   1. pause the VM
>   1. write the libvirt header to save.img
>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
>      RAM after header

I don't understand this point very much... if the ram is already backed by file why are we sending this again..?

>   3. QMP migrate with x-ignore-shared to copy device
>      state after RAM
> 
> Probably can do the same on restore too.


Do I understand correctly that you suggest to constantly update the RAM to file at runtime?
Given the compute nature of the workload, I'd think this would slow things down.

We need to evict the memory to disk rarely, but when that happens it should be as fast as possible.

The advantage of the multifd idea was, we have cpus sitting there doing nothing that were reserved for running the VM,
we may as well use them to reduce the size of the problem substantially by compressing each stream separately.

> 
> Now, this would only work for a 'save' and 'restore', not
> for snapshots, as it would rely on the VCPUs being paused
> to stop RAM being modified.
> 
>>
>> Signed-off-by: Claudio Fontana <cfontana@suse.de>
>> ---
>>  include/libvirt/libvirt-domain.h | 5 +++++
>>  src/driver-hypervisor.h          | 7 +++++++
>>  src/libvirt_public.syms          | 5 +++++
>>  src/qemu/qemu_driver.c           | 1 +
>>  tools/virsh-domain.c             | 8 ++++++++
>>  5 files changed, 26 insertions(+)
>>
>> diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h
>> index 2d5718301e..a7b9c4132d 100644
>> --- a/include/libvirt/libvirt-domain.h
>> +++ b/include/libvirt/libvirt-domain.h
>> @@ -1270,6 +1270,7 @@ typedef enum {
>>      VIR_DOMAIN_SAVE_RUNNING      = 1 << 1, /* Favor running over paused */
>>      VIR_DOMAIN_SAVE_PAUSED       = 1 << 2, /* Favor paused over running */
>>      VIR_DOMAIN_SAVE_RESET_NVRAM  = 1 << 3, /* Re-initialize NVRAM from template */
>> +    VIR_DOMAIN_SAVE_PARALLEL     = 1 << 4, /* Parallel Save/Restore to multiple files */
>>  } virDomainSaveRestoreFlags;
>>  
>>  int                     virDomainSave           (virDomainPtr domain,
>> @@ -1278,6 +1279,10 @@ int                     virDomainSaveFlags      (virDomainPtr domain,
>>                                                   const char *to,
>>                                                   const char *dxml,
>>                                                   unsigned int flags);
>> +int                     virDomainSaveParametersFlags (virDomainPtr domain,
>> +                                                      virTypedParameterPtr params,
>> +                                                      int nparams,
>> +                                                      unsigned int flags);
>>  int                     virDomainRestore        (virConnectPtr conn,
>>                                                   const char *from);
>>  int                     virDomainRestoreFlags   (virConnectPtr conn,
>> diff --git a/src/driver-hypervisor.h b/src/driver-hypervisor.h
>> index 4423eb0885..a4e1d21e76 100644
>> --- a/src/driver-hypervisor.h
>> +++ b/src/driver-hypervisor.h
>> @@ -240,6 +240,12 @@ typedef int
>>                           const char *dxml,
>>                           unsigned int flags);
>>  
>> +typedef int
>> +(*virDrvDomainSaveParametersFlags)(virDomainPtr domain,
>> +                                   virTypedParameterPtr params,
>> +                                   int nparams,
>> +                                   unsigned int flags);
>> +
>>  typedef int
>>  (*virDrvDomainRestore)(virConnectPtr conn,
>>                         const char *from);
>> @@ -1489,6 +1495,7 @@ struct _virHypervisorDriver {
>>      virDrvDomainGetControlInfo domainGetControlInfo;
>>      virDrvDomainSave domainSave;
>>      virDrvDomainSaveFlags domainSaveFlags;
>> +    virDrvDomainSaveParametersFlags domainSaveParametersFlags;
>>      virDrvDomainRestore domainRestore;
>>      virDrvDomainRestoreFlags domainRestoreFlags;
>>      virDrvDomainSaveImageGetXMLDesc domainSaveImageGetXMLDesc;
>> diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms
>> index f93692c427..eb3a7afb75 100644
>> --- a/src/libvirt_public.syms
>> +++ b/src/libvirt_public.syms
>> @@ -916,4 +916,9 @@ LIBVIRT_8.0.0 {
>>          virDomainSetLaunchSecurityState;
>>  } LIBVIRT_7.8.0;
>>  
>> +LIBVIRT_8.3.0 {
>> +    global:
>> +        virDomainSaveParametersFlags;
>> +} LIBVIRT_8.0.0;
>> +
>>  # .... define new API here using predicted next version number ....
>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>> index 77012eb527..249105356c 100644
>> --- a/src/qemu/qemu_driver.c
>> +++ b/src/qemu/qemu_driver.c
>> @@ -20826,6 +20826,7 @@ static virHypervisorDriver qemuHypervisorDriver = {
>>      .domainGetControlInfo = qemuDomainGetControlInfo, /* 0.9.3 */
>>      .domainSave = qemuDomainSave, /* 0.2.0 */
>>      .domainSaveFlags = qemuDomainSaveFlags, /* 0.9.4 */
>> +    .domainSaveParametersFlags = qemuDomainSaveParametersFlags, /* 8.3.0 */
>>      .domainRestore = qemuDomainRestore, /* 0.2.0 */
>>      .domainRestoreFlags = qemuDomainRestoreFlags, /* 0.9.4 */
>>      .domainSaveImageGetXMLDesc = qemuDomainSaveImageGetXMLDesc, /* 0.9.4 */
>> diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c
>> index d5fd8be7c3..ccded6d265 100644
>> --- a/tools/virsh-domain.c
>> +++ b/tools/virsh-domain.c
>> @@ -4164,6 +4164,14 @@ static const vshCmdOptDef opts_save[] = {
>>       .type = VSH_OT_BOOL,
>>       .help = N_("avoid file system cache when saving")
>>      },
>> +    {.name = "parallel",
>> +     .type = VSH_OT_BOOL,
>> +     .help = N_("enable parallel save to files")
>> +    },
>> +    {.name = "parallel-connections",
>> +     .type = VSH_OT_INT,
>> +     .help = N_("number of connections/files for parallel save")
>> +    },
>>      {.name = "xml",
>>       .type = VSH_OT_STRING,
>>       .completer = virshCompletePathLocalExisting,
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel
>

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
> > On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> >> RFC, starting point for discussion.
> >>
> >> Sketch API changes to allow parallel Saves, and open up
> >> and implementation for QEMU to leverage multifd migration to files,
> >> with optional multifd compression.
> >>
> >> This allows to improve save times for huge VMs.
> >>
> >> The idea is to issue commands like:
> >>
> >> virsh save domain /path/savevm --parallel --parallel-connections 2
> >>
> >> and have libvirt start a multifd migration to:
> >>
> >> /path/savevm   : main migration connection
> >> /path/savevm.1 : multifd channel 1
> >> /path/savevm.2 : multifd channel 2
> > 
> > At a conceptual level the idea would to still have a single file,
> > but have threads writing to different regions of it. I don't think
> > that's possible with multifd though, as it doesn't partition RAM
> > up between threads, its just hands out pages on demand. So if one
> > thread happens to be quicker it'll send more RAM than another
> > thread. Also we're basically capturing the migration RAM, and the
> > multifd channels have control info, in addition to the RAM pages.
> > 
> > That makes me wonder actually, are the multifd streams unidirectional
> > or bidirectional ?  Our saving to a file logic, relies on the streams
> > being unidirectional.
> 
> 
> Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
> 
> 
> > 
> > You've got me thinking, however, whether we can take QEMU out of
> > the loop entirely for saving RAM.
> > 
> > IIUC with 'x-ignore-shared' migration capability QEMU will skip
> > saving of RAM region entirely (well technically any region marked
> > as 'shared', which I guess can cover more things). 
> 
> Heh I have no idea about this.
> 
> > 
> > If the QEMU process is configured with a file backed shared
> > memory, or memfd, I wonder if we can take advantage of this.
> > eg
> > 
> >   1. pause the VM
> >   1. write the libvirt header to save.img
> >   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
> >      RAM after header
> 
> I don't understand this point very much... if the ram is already
> backed by file why are we sending this again..?

It is a file pointing to hugepagefs or tmpfs. It is still actually
RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.

We don't do this by default, but anyone with large (many GB) VMs
is increasingly likel to be relying on huge pages to optimize
their VM performance.

In our current save scheme we have (at least) 2 copies going
on. QEMU copies from RAM into the FD it uses for migrate.
libvirt IO helper copies from the FD into the file. This involves
multiple threads and multiple userspace/kernel switches and data
copies.  You've been trying to eliminate the 2nd copy in userspace.

If we take advantage of scenario where QEMU RAM is backed by a
tmpfs/hugepagefs file, we can potentially eliminate both copies
in userspace. The kernel can be told to copy direct from the
hugepagefs file into the disk file.

> >   3. QMP migrate with x-ignore-shared to copy device
> >      state after RAM
> > 
> > Probably can do the same on restore too.
> 
> 
> Do I understand correctly that you suggest to constantly update the RAM to file at runtime?
> Given the compute nature of the workload, I'd think this would slow things down.

No, no different to what we do today. I'm just saying we let
the kernl copy straight from  QEMU's RAM backing file into
the dest file, at time of save, so we do *nothing* in userpsace
in either libvirt or QEMU.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Claudio Fontana 2 years ago

On 4/22/22 10:19 AM, Daniel P. Berrangé wrote:
> On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
>> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
>>> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
>>>> RFC, starting point for discussion.
>>>>
>>>> Sketch API changes to allow parallel Saves, and open up
>>>> and implementation for QEMU to leverage multifd migration to files,
>>>> with optional multifd compression.
>>>>
>>>> This allows to improve save times for huge VMs.
>>>>
>>>> The idea is to issue commands like:
>>>>
>>>> virsh save domain /path/savevm --parallel --parallel-connections 2
>>>>
>>>> and have libvirt start a multifd migration to:
>>>>
>>>> /path/savevm   : main migration connection
>>>> /path/savevm.1 : multifd channel 1
>>>> /path/savevm.2 : multifd channel 2
>>>
>>> At a conceptual level the idea would to still have a single file,
>>> but have threads writing to different regions of it. I don't think
>>> that's possible with multifd though, as it doesn't partition RAM
>>> up between threads, its just hands out pages on demand. So if one
>>> thread happens to be quicker it'll send more RAM than another
>>> thread. Also we're basically capturing the migration RAM, and the
>>> multifd channels have control info, in addition to the RAM pages.
>>>
>>> That makes me wonder actually, are the multifd streams unidirectional
>>> or bidirectional ?  Our saving to a file logic, relies on the streams
>>> being unidirectional.
>>
>>
>> Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
>>
>>
>>>
>>> You've got me thinking, however, whether we can take QEMU out of
>>> the loop entirely for saving RAM.
>>>
>>> IIUC with 'x-ignore-shared' migration capability QEMU will skip
>>> saving of RAM region entirely (well technically any region marked
>>> as 'shared', which I guess can cover more things). 
>>
>> Heh I have no idea about this.
>>
>>>
>>> If the QEMU process is configured with a file backed shared
>>> memory, or memfd, I wonder if we can take advantage of this.
>>> eg
>>>
>>>   1. pause the VM
>>>   1. write the libvirt header to save.img
>>>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
>>>      RAM after header
>>
>> I don't understand this point very much... if the ram is already
>> backed by file why are we sending this again..?
> 
> It is a file pointing to hugepagefs or tmpfs. It is still actually
> RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.
> 
> We don't do this by default, but anyone with large (many GB) VMs
> is increasingly likel to be relying on huge pages to optimize
> their VM performance.

For what I could observe I'd say it depends on the specific scenario,
how much memory we have to work with, the general compromise between cpu, memory, disk, ... all of which is subject to cost optimization.

> 
> In our current save scheme we have (at least) 2 copies going
> on. QEMU copies from RAM into the FD it uses for migrate.
> libvirt IO helper copies from the FD into the file. This involves
> multiple threads and multiple userspace/kernel switches and data
> copies.  You've been trying to eliminate the 2nd copy in userspace.

I've been trying to eliminate the 2nd copy in userspace, but this is just aspect 1) I have in mind,
it is good but gives only so much, and for huge VMs things fall apart when reaching the file cache trashing problem.

Aspect 2) in my mind is the file cache trashing that the kernel gets into, is the reason that we need O_DIRECT at all with huge VMs I think,
which creates a lot of complications (ie we are kinda forced to have a helper anyway to ensure block aligned source, destination addresses and length),
and suboptimal performance.

This is what was attempted to be solved by, in my understanding:

https://lwn.net/Articles/806980/

which seemed more promising to me, but unfortunately the implementation went to /dev/null apparently.

There was also posix_fadvise POSIX_FADV_NOREUSE, which I think in practice is a very clunky API, and which also got lost.

Aspect 3) is a practical solution that I already prototyped and yields very good results in practice,
which is to make better use of the resources we have, since we have a certain number of cpus assigned to run VMs,
and the save/restore operations we need happen with a suspended guest, so we can exploit this to get those cpus to good use,
and reduce the problem size by leveraging multifd and compression which comes for free from qemu.

I think that until the file cache issue remains unsolved, we are stuck with O_DIRECT, so we are stuck with a helper,
and at that point we can easily have a

multifd-helper

that reuses the code from iohelper, and performs O_DIRECT writes of the compressed streams to multiple files in parallel.


> 
> If we take advantage of scenario where QEMU RAM is backed by a
> tmpfs/hugepagefs file, we can potentially eliminate both copies
> in userspace. The kernel can be told to copy direct from the
> hugepagefs file into the disk file.

Interesting, still we incur in the file cache trashing as we write though right?

> 
>>>   3. QMP migrate with x-ignore-shared to copy device
>>>      state after RAM
>>>
>>> Probably can do the same on restore too.
>>
>>
>> Do I understand correctly that you suggest to constantly update the RAM to file at runtime?
>> Given the compute nature of the workload, I'd think this would slow things down.
> 
> No, no different to what we do today. I'm just saying we let
> the kernl copy straight from  QEMU's RAM backing file into
> the dest file, at time of save, so we do *nothing* in userpsace
> in either libvirt or QEMU.
> 
> With regards,
> Daniel
>

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Fri, Apr 22, 2022 at 01:40:20PM +0200, Claudio Fontana wrote:
> On 4/22/22 10:19 AM, Daniel P. Berrangé wrote:
> > On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
> >> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
> >>> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> >>>> RFC, starting point for discussion.
> >>>>
> >>>> Sketch API changes to allow parallel Saves, and open up
> >>>> and implementation for QEMU to leverage multifd migration to files,
> >>>> with optional multifd compression.
> >>>>
> >>>> This allows to improve save times for huge VMs.
> >>>>
> >>>> The idea is to issue commands like:
> >>>>
> >>>> virsh save domain /path/savevm --parallel --parallel-connections 2
> >>>>
> >>>> and have libvirt start a multifd migration to:
> >>>>
> >>>> /path/savevm   : main migration connection
> >>>> /path/savevm.1 : multifd channel 1
> >>>> /path/savevm.2 : multifd channel 2
> >>>
> >>> At a conceptual level the idea would to still have a single file,
> >>> but have threads writing to different regions of it. I don't think
> >>> that's possible with multifd though, as it doesn't partition RAM
> >>> up between threads, its just hands out pages on demand. So if one
> >>> thread happens to be quicker it'll send more RAM than another
> >>> thread. Also we're basically capturing the migration RAM, and the
> >>> multifd channels have control info, in addition to the RAM pages.
> >>>
> >>> That makes me wonder actually, are the multifd streams unidirectional
> >>> or bidirectional ?  Our saving to a file logic, relies on the streams
> >>> being unidirectional.
> >>
> >>
> >> Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
> >>
> >>
> >>>
> >>> You've got me thinking, however, whether we can take QEMU out of
> >>> the loop entirely for saving RAM.
> >>>
> >>> IIUC with 'x-ignore-shared' migration capability QEMU will skip
> >>> saving of RAM region entirely (well technically any region marked
> >>> as 'shared', which I guess can cover more things). 
> >>
> >> Heh I have no idea about this.
> >>
> >>>
> >>> If the QEMU process is configured with a file backed shared
> >>> memory, or memfd, I wonder if we can take advantage of this.
> >>> eg
> >>>
> >>>   1. pause the VM
> >>>   1. write the libvirt header to save.img
> >>>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
> >>>      RAM after header
> >>
> >> I don't understand this point very much... if the ram is already
> >> backed by file why are we sending this again..?
> > 
> > It is a file pointing to hugepagefs or tmpfs. It is still actually
> > RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.
> > 
> > We don't do this by default, but anyone with large (many GB) VMs
> > is increasingly likel to be relying on huge pages to optimize
> > their VM performance.
> 
> For what I could observe I'd say it depends on the specific scenario,
> how much memory we have to work with, the general compromise between cpu, memory, disk, ... all of which is subject to cost optimization.
> 
> > 
> > In our current save scheme we have (at least) 2 copies going
> > on. QEMU copies from RAM into the FD it uses for migrate.
> > libvirt IO helper copies from the FD into the file. This involves
> > multiple threads and multiple userspace/kernel switches and data
> > copies.  You've been trying to eliminate the 2nd copy in userspace.
> 
> I've been trying to eliminate the 2nd copy in userspace, but this is just aspect 1) I have in mind,
> it is good but gives only so much, and for huge VMs things fall apart when reaching the file cache trashing problem.

Agreed.

> Aspect 2) in my mind is the file cache trashing that the kernel gets into, is the reason that we need O_DIRECT at all with huge VMs I think,
> which creates a lot of complications (ie we are kinda forced to have a helper anyway to ensure block aligned source, destination addresses and length),
> and suboptimal performance.

Right, we can eliminate the second copy, or we can eliminate cache
trashing, but not both.

> Aspect 3) is a practical solution that I already prototyped and yields very good results in practice,
> which is to make better use of the resources we have, since we have a certain number of cpus assigned to run VMs,
> and the save/restore operations we need happen with a suspended guest, so we can exploit this to get those cpus to good use,
> and reduce the problem size by leveraging multifd and compression which comes for free from qemu.
> 
> I think that until the file cache issue remains unsolved, we are stuck with O_DIRECT, so we are stuck with a helper,
> and at that point we can easily have a
> 
> multifd-helper
> 
> that reuses the code from iohelper, and performs O_DIRECT writes of the compressed streams to multiple files in parallel.

I'm worried that we could be taking ourselves down a dead-end by
trying to optimize on the libvirt side, because we've got a
mismatch  between the QMP APIs we're using and the intent of
QEMU.

The QEMU migration APIs were designed around streaming to a
remote instance, and we're essentially playing games to use
them as a way to write to local storage.

The RAM pages we're saving are of course page aligned in QEMU
because they are mapped RAM. We loose/throwaway the page
alignment because we're sending them over a FD, potentially
adding in each metadata headers to identify which location
the RAM block came from. 

QEMU has APIs for doing async I/O to local storage using
O_DIRECT, via the BlockDev layer. QEMU can even use this
for saving state via the loadvm/savevm monitor commands
for internal snapshots. This is not accessible via the
normal migration QMP command though.

I feel to give ourselves the best chance of optimizing the
save/restore, we need to get QEMU to have full knowledge of
what is going on, and get libvirt out of the picture almost
entirely.

If QEMU knows that the migration source/target is a random
access file, rather than a stream, then it will not have
to attach any headers to identify RAM pages. It can just
read/write them directly at a fixed offset in the file.
It can even do this while the CPU is running, just overwriting
the previously written page on disk if the contents changed.

This would mean the save image is a fixed size exactly
matching the RAM size, plus libvirt header and vmstate.
Right now if we save a live snapshot, the save image can
be almost arbitrarily large, since we'll save the same
RAM page over & over again if the VM is modifying the
content.

I think we need to introduce an explicit 'file:' protocol
for the migrate command, that is backed by the blockdev APIs
so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
protocol, we need to be able to tell QEMU whether the 'fd'
is a stream or a regular file, so it can choose between the
regular send/recv APIs, vs the Blockdev APIs (maybe we can
auto-detect with fstat()).  If we do this, then multifd
doesn't end up needing multiple save files on disk, all
the threads can be directly writing to the same file, just
as the relevant offsets on disk to match the RAM page
location.

> > If we take advantage of scenario where QEMU RAM is backed by a
> > tmpfs/hugepagefs file, we can potentially eliminate both copies
> > in userspace. The kernel can be told to copy direct from the
> > hugepagefs file into the disk file.
> 
> Interesting, still we incur in the file cache trashing as we write though right?

I'm not sure to be honest. I struggle to find docs about whether
sendfile is compatible with an FD opened with O_DIRECT.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Dr. David Alan Gilbert 2 years ago

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Fri, Apr 22, 2022 at 01:40:20PM +0200, Claudio Fontana wrote:
> > On 4/22/22 10:19 AM, Daniel P. Berrangé wrote:
> > > On Thu, Apr 21, 2022 at 08:06:40PM +0200, Claudio Fontana wrote:
> > >> On 4/21/22 7:08 PM, Daniel P. Berrangé wrote:
> > >>> On Thu, Apr 14, 2022 at 09:54:16AM +0200, Claudio Fontana wrote:
> > >>>> RFC, starting point for discussion.
> > >>>>
> > >>>> Sketch API changes to allow parallel Saves, and open up
> > >>>> and implementation for QEMU to leverage multifd migration to files,
> > >>>> with optional multifd compression.
> > >>>>
> > >>>> This allows to improve save times for huge VMs.
> > >>>>
> > >>>> The idea is to issue commands like:
> > >>>>
> > >>>> virsh save domain /path/savevm --parallel --parallel-connections 2
> > >>>>
> > >>>> and have libvirt start a multifd migration to:
> > >>>>
> > >>>> /path/savevm   : main migration connection
> > >>>> /path/savevm.1 : multifd channel 1
> > >>>> /path/savevm.2 : multifd channel 2
> > >>>
> > >>> At a conceptual level the idea would to still have a single file,
> > >>> but have threads writing to different regions of it. I don't think
> > >>> that's possible with multifd though, as it doesn't partition RAM
> > >>> up between threads, its just hands out pages on demand. So if one
> > >>> thread happens to be quicker it'll send more RAM than another
> > >>> thread. Also we're basically capturing the migration RAM, and the
> > >>> multifd channels have control info, in addition to the RAM pages.
> > >>>
> > >>> That makes me wonder actually, are the multifd streams unidirectional
> > >>> or bidirectional ?  Our saving to a file logic, relies on the streams
> > >>> being unidirectional.
> > >>
> > >>
> > >> Unidirectional. In the meantime I completed an actual libvirt prototype that works (only did the save part, not the restore yet).
> > >>
> > >>
> > >>>
> > >>> You've got me thinking, however, whether we can take QEMU out of
> > >>> the loop entirely for saving RAM.
> > >>>
> > >>> IIUC with 'x-ignore-shared' migration capability QEMU will skip
> > >>> saving of RAM region entirely (well technically any region marked
> > >>> as 'shared', which I guess can cover more things). 
> > >>
> > >> Heh I have no idea about this.
> > >>
> > >>>
> > >>> If the QEMU process is configured with a file backed shared
> > >>> memory, or memfd, I wonder if we can take advantage of this.
> > >>> eg
> > >>>
> > >>>   1. pause the VM
> > >>>   1. write the libvirt header to save.img
> > >>>   2. sendfile(qemus-memfd, save.img-fd)  to copy the entire
> > >>>      RAM after header
> > >>
> > >> I don't understand this point very much... if the ram is already
> > >> backed by file why are we sending this again..?
> > > 
> > > It is a file pointing to hugepagefs or tmpfs. It is still actually
> > > RAM, but we exposed it to QEMU via a file, which QEMU then mmap'd.
> > > 
> > > We don't do this by default, but anyone with large (many GB) VMs
> > > is increasingly likel to be relying on huge pages to optimize
> > > their VM performance.
> > 
> > For what I could observe I'd say it depends on the specific scenario,
> > how much memory we have to work with, the general compromise between cpu, memory, disk, ... all of which is subject to cost optimization.
> > 
> > > 
> > > In our current save scheme we have (at least) 2 copies going
> > > on. QEMU copies from RAM into the FD it uses for migrate.
> > > libvirt IO helper copies from the FD into the file. This involves
> > > multiple threads and multiple userspace/kernel switches and data
> > > copies.  You've been trying to eliminate the 2nd copy in userspace.
> > 
> > I've been trying to eliminate the 2nd copy in userspace, but this is just aspect 1) I have in mind,
> > it is good but gives only so much, and for huge VMs things fall apart when reaching the file cache trashing problem.
> 
> Agreed.
> 
> > Aspect 2) in my mind is the file cache trashing that the kernel gets into, is the reason that we need O_DIRECT at all with huge VMs I think,
> > which creates a lot of complications (ie we are kinda forced to have a helper anyway to ensure block aligned source, destination addresses and length),
> > and suboptimal performance.
> 
> Right, we can eliminate the second copy, or we can eliminate cache
> trashing, but not both.
> 
> > Aspect 3) is a practical solution that I already prototyped and yields very good results in practice,
> > which is to make better use of the resources we have, since we have a certain number of cpus assigned to run VMs,
> > and the save/restore operations we need happen with a suspended guest, so we can exploit this to get those cpus to good use,
> > and reduce the problem size by leveraging multifd and compression which comes for free from qemu.
> > 
> > I think that until the file cache issue remains unsolved, we are stuck with O_DIRECT, so we are stuck with a helper,
> > and at that point we can easily have a
> > 
> > multifd-helper
> > 
> > that reuses the code from iohelper, and performs O_DIRECT writes of the compressed streams to multiple files in parallel.
> 
> I'm worried that we could be taking ourselves down a dead-end by
> trying to optimize on the libvirt side, because we've got a
> mismatch  between the QMP APIs we're using and the intent of
> QEMU.
> 
> The QEMU migration APIs were designed around streaming to a
> remote instance, and we're essentially playing games to use
> them as a way to write to local storage.

Yes.

> The RAM pages we're saving are of course page aligned in QEMU
> because they are mapped RAM. We loose/throwaway the page
> alignment because we're sending them over a FD, potentially
> adding in each metadata headers to identify which location
> the RAM block came from. 
> 
> QEMU has APIs for doing async I/O to local storage using
> O_DIRECT, via the BlockDev layer. QEMU can even use this
> for saving state via the loadvm/savevm monitor commands
> for internal snapshots. This is not accessible via the
> normal migration QMP command though.
> 
> 
> I feel to give ourselves the best chance of optimizing the
> save/restore, we need to get QEMU to have full knowledge of
> what is going on, and get libvirt out of the picture almost
> entirely.
> 
> If QEMU knows that the migration source/target is a random
> access file, rather than a stream, then it will not have
> to attach any headers to identify RAM pages. It can just
> read/write them directly at a fixed offset in the file.
> It can even do this while the CPU is running, just overwriting
> the previously written page on disk if the contents changed.
> 
> This would mean the save image is a fixed size exactly
> matching the RAM size, plus libvirt header and vmstate.
> Right now if we save a live snapshot, the save image can
> be almost arbitrarily large, since we'll save the same
> RAM page over & over again if the VM is modifying the
> content.
> 
> I think we need to introduce an explicit 'file:' protocol
> for the migrate command, that is backed by the blockdev APIs
> so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> protocol, we need to be able to tell QEMU whether the 'fd'
> is a stream or a regular file, so it can choose between the
> regular send/recv APIs, vs the Blockdev APIs (maybe we can
> auto-detect with fstat()).  If we do this, then multifd
> doesn't end up needing multiple save files on disk, all
> the threads can be directly writing to the same file, just
> as the relevant offsets on disk to match the RAM page
> location.

Hmm so what I'm not sure of is whether it makes sense to use the normal
migration flow/code for this or not; and you're suggesting a few
possibly contradictory things.

Adding a file: protocol would be pretty easy (whether it went via
the blockdev layer or not); getting it to be more efficient is the
tricky part, because we've got loads of levels of stream abstraction in
the RAM save code:
    QEMUFile->channel->OS
but then if you want to enforce alignment you somehow have to make that
go all the way down.

If you weren't doing it live then you could come up with a mode
that just did one big fat write(2) for each RAM Block; and frankly just
sidestepped the entire rest of the RAM migration code.
But then you're suggesting being able to do it live writing it into a
fixed place on disk; which says that you have to change the (already
complicated) RAM migration code rather than sidestepping it.

Dave

> > > If we take advantage of scenario where QEMU RAM is backed by a
> > > tmpfs/hugepagefs file, we can potentially eliminate both copies
> > > in userspace. The kernel can be told to copy direct from the
> > > hugepagefs file into the disk file.
> > 
> > Interesting, still we incur in the file cache trashing as we write though right?
> 
> I'm not sure to be honest. I struggle to find docs about whether
> sendfile is compatible with an FD opened with O_DIRECT.
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > I'm worried that we could be taking ourselves down a dead-end by
> > trying to optimize on the libvirt side, because we've got a
> > mismatch  between the QMP APIs we're using and the intent of
> > QEMU.
> > 
> > The QEMU migration APIs were designed around streaming to a
> > remote instance, and we're essentially playing games to use
> > them as a way to write to local storage.
> 
> Yes.
> 
> > The RAM pages we're saving are of course page aligned in QEMU
> > because they are mapped RAM. We loose/throwaway the page
> > alignment because we're sending them over a FD, potentially
> > adding in each metadata headers to identify which location
> > the RAM block came from. 
> > 
> > QEMU has APIs for doing async I/O to local storage using
> > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > for saving state via the loadvm/savevm monitor commands
> > for internal snapshots. This is not accessible via the
> > normal migration QMP command though.
> > 
> > 
> > I feel to give ourselves the best chance of optimizing the
> > save/restore, we need to get QEMU to have full knowledge of
> > what is going on, and get libvirt out of the picture almost
> > entirely.
> > 
> > If QEMU knows that the migration source/target is a random
> > access file, rather than a stream, then it will not have
> > to attach any headers to identify RAM pages. It can just
> > read/write them directly at a fixed offset in the file.
> > It can even do this while the CPU is running, just overwriting
> > the previously written page on disk if the contents changed.
> > 
> > This would mean the save image is a fixed size exactly
> > matching the RAM size, plus libvirt header and vmstate.
> > Right now if we save a live snapshot, the save image can
> > be almost arbitrarily large, since we'll save the same
> > RAM page over & over again if the VM is modifying the
> > content.
> > 
> > I think we need to introduce an explicit 'file:' protocol
> > for the migrate command, that is backed by the blockdev APIs
> > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > protocol, we need to be able to tell QEMU whether the 'fd'
> > is a stream or a regular file, so it can choose between the
> > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > auto-detect with fstat()).  If we do this, then multifd
> > doesn't end up needing multiple save files on disk, all
> > the threads can be directly writing to the same file, just
> > as the relevant offsets on disk to match the RAM page
> > location.
> 
> Hmm so what I'm not sure of is whether it makes sense to use the normal
> migration flow/code for this or not; and you're suggesting a few
> possibly contradictory things.
> 
> Adding a file: protocol would be pretty easy (whether it went via
> the blockdev layer or not); getting it to be more efficient is the
> tricky part, because we've got loads of levels of stream abstraction in
> the RAM save code:
>     QEMUFile->channel->OS
> but then if you want to enforce alignment you somehow have to make that
> go all the way down.

The QIOChannel stuff doesn't add buffering, so I wasn't worried
about alignment there.

QEMUFile has optional buffering which would mess with alignment,
but we could turn that off potentially for the RAM transfer, if
using multifd.

I'm confident the performance on the QMEU side though could
exceed what's viable with libvirt's iohelper  today, as we
would definitely be eliminating 1 copy and many context switches.

> If you weren't doing it live then you could come up with a mode
> that just did one big fat write(2) for each RAM Block; and frankly just
> sidestepped the entire rest of the RAM migration code.
> But then you're suggesting being able to do it live writing it into a
> fixed place on disk; which says that you have to change the (already
> complicated) RAM migration code rather than sidestepping it.

Yeah, we need "live" for the live snapshot - which fits in with
the previously discussed goal of turning the 'savevm/snapshot-save'
HMP/QMP impls into a facade around 'migrate' + 'block-copy' QMP
commands.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Dr. David Alan Gilbert 2 years ago

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > I'm worried that we could be taking ourselves down a dead-end by
> > > trying to optimize on the libvirt side, because we've got a
> > > mismatch  between the QMP APIs we're using and the intent of
> > > QEMU.
> > > 
> > > The QEMU migration APIs were designed around streaming to a
> > > remote instance, and we're essentially playing games to use
> > > them as a way to write to local storage.
> > 
> > Yes.
> > 
> > > The RAM pages we're saving are of course page aligned in QEMU
> > > because they are mapped RAM. We loose/throwaway the page
> > > alignment because we're sending them over a FD, potentially
> > > adding in each metadata headers to identify which location
> > > the RAM block came from. 
> > > 
> > > QEMU has APIs for doing async I/O to local storage using
> > > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > > for saving state via the loadvm/savevm monitor commands
> > > for internal snapshots. This is not accessible via the
> > > normal migration QMP command though.
> > > 
> > > 
> > > I feel to give ourselves the best chance of optimizing the
> > > save/restore, we need to get QEMU to have full knowledge of
> > > what is going on, and get libvirt out of the picture almost
> > > entirely.
> > > 
> > > If QEMU knows that the migration source/target is a random
> > > access file, rather than a stream, then it will not have
> > > to attach any headers to identify RAM pages. It can just
> > > read/write them directly at a fixed offset in the file.
> > > It can even do this while the CPU is running, just overwriting
> > > the previously written page on disk if the contents changed.
> > > 
> > > This would mean the save image is a fixed size exactly
> > > matching the RAM size, plus libvirt header and vmstate.
> > > Right now if we save a live snapshot, the save image can
> > > be almost arbitrarily large, since we'll save the same
> > > RAM page over & over again if the VM is modifying the
> > > content.
> > > 
> > > I think we need to introduce an explicit 'file:' protocol
> > > for the migrate command, that is backed by the blockdev APIs
> > > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > is a stream or a regular file, so it can choose between the
> > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > auto-detect with fstat()).  If we do this, then multifd
> > > doesn't end up needing multiple save files on disk, all
> > > the threads can be directly writing to the same file, just
> > > as the relevant offsets on disk to match the RAM page
> > > location.
> > 
> > Hmm so what I'm not sure of is whether it makes sense to use the normal
> > migration flow/code for this or not; and you're suggesting a few
> > possibly contradictory things.
> > 
> > Adding a file: protocol would be pretty easy (whether it went via
> > the blockdev layer or not); getting it to be more efficient is the
> > tricky part, because we've got loads of levels of stream abstraction in
> > the RAM save code:
> >     QEMUFile->channel->OS
> > but then if you want to enforce alignment you somehow have to make that
> > go all the way down.
> 
> The QIOChannel stuff doesn't add buffering, so I wasn't worried
> about alignment there.
> 
> QEMUFile has optional buffering which would mess with alignment,
> but we could turn that off potentially for the RAM transfer, if
> using multifd.

The problem isn't whether they add buffering or not; the problem is you
now need a way to add a mechanism to ask for alignment.

> I'm confident the performance on the QMEU side though could
> exceed what's viable with libvirt's iohelper  today, as we
> would definitely be eliminating 1 copy and many context switches.

Yes but you get that just from adding a simple file: (or fd:) mode
without trying to do anything clever with alignment or rewriting the
same offset.

> > If you weren't doing it live then you could come up with a mode
> > that just did one big fat write(2) for each RAM Block; and frankly just
> > sidestepped the entire rest of the RAM migration code.
> > But then you're suggesting being able to do it live writing it into a
> > fixed place on disk; which says that you have to change the (already
> > complicated) RAM migration code rather than sidestepping it.
> 
> Yeah, we need "live" for the live snapshot - which fits in with
> the previously discussed goal of turning the 'savevm/snapshot-save'
> HMP/QMP impls into a facade around 'migrate' + 'block-copy' QMP
> commands.

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > I'm worried that we could be taking ourselves down a dead-end by
> > > > trying to optimize on the libvirt side, because we've got a
> > > > mismatch  between the QMP APIs we're using and the intent of
> > > > QEMU.
> > > > 
> > > > The QEMU migration APIs were designed around streaming to a
> > > > remote instance, and we're essentially playing games to use
> > > > them as a way to write to local storage.
> > > 
> > > Yes.
> > > 
> > > > The RAM pages we're saving are of course page aligned in QEMU
> > > > because they are mapped RAM. We loose/throwaway the page
> > > > alignment because we're sending them over a FD, potentially
> > > > adding in each metadata headers to identify which location
> > > > the RAM block came from. 
> > > > 
> > > > QEMU has APIs for doing async I/O to local storage using
> > > > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > > > for saving state via the loadvm/savevm monitor commands
> > > > for internal snapshots. This is not accessible via the
> > > > normal migration QMP command though.
> > > > 
> > > > 
> > > > I feel to give ourselves the best chance of optimizing the
> > > > save/restore, we need to get QEMU to have full knowledge of
> > > > what is going on, and get libvirt out of the picture almost
> > > > entirely.
> > > > 
> > > > If QEMU knows that the migration source/target is a random
> > > > access file, rather than a stream, then it will not have
> > > > to attach any headers to identify RAM pages. It can just
> > > > read/write them directly at a fixed offset in the file.
> > > > It can even do this while the CPU is running, just overwriting
> > > > the previously written page on disk if the contents changed.
> > > > 
> > > > This would mean the save image is a fixed size exactly
> > > > matching the RAM size, plus libvirt header and vmstate.
> > > > Right now if we save a live snapshot, the save image can
> > > > be almost arbitrarily large, since we'll save the same
> > > > RAM page over & over again if the VM is modifying the
> > > > content.
> > > > 
> > > > I think we need to introduce an explicit 'file:' protocol
> > > > for the migrate command, that is backed by the blockdev APIs
> > > > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > > is a stream or a regular file, so it can choose between the
> > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > > auto-detect with fstat()).  If we do this, then multifd
> > > > doesn't end up needing multiple save files on disk, all
> > > > the threads can be directly writing to the same file, just
> > > > as the relevant offsets on disk to match the RAM page
> > > > location.
> > > 
> > > Hmm so what I'm not sure of is whether it makes sense to use the normal
> > > migration flow/code for this or not; and you're suggesting a few
> > > possibly contradictory things.
> > > 
> > > Adding a file: protocol would be pretty easy (whether it went via
> > > the blockdev layer or not); getting it to be more efficient is the
> > > tricky part, because we've got loads of levels of stream abstraction in
> > > the RAM save code:
> > >     QEMUFile->channel->OS
> > > but then if you want to enforce alignment you somehow have to make that
> > > go all the way down.
> > 
> > The QIOChannel stuff doesn't add buffering, so I wasn't worried
> > about alignment there.
> > 
> > QEMUFile has optional buffering which would mess with alignment,
> > but we could turn that off potentially for the RAM transfer, if
> > using multifd.
> 
> The problem isn't whether they add buffering or not; the problem is you
> now need a way to add a mechanism to ask for alignment.
> 
> > I'm confident the performance on the QMEU side though could
> > exceed what's viable with libvirt's iohelper  today, as we
> > would definitely be eliminating 1 copy and many context switches.
> 
> Yes but you get that just from adding a simple file: (or fd:) mode
> without trying to do anything clever with alignment or rewriting the
> same offset.

I don't think so, as libvirt supports O_DIRECT today to avoid
trashing the host cache when saving VMs. So to be able to
offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
So we do need the alignment support at the very least. Rewriting
at the same offset isnt mandatory, but I think it'd make multifd
saner if trying to have all threads work on the same file.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Dr. David Alan Gilbert 2 years ago

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > I'm worried that we could be taking ourselves down a dead-end by
> > > > > trying to optimize on the libvirt side, because we've got a
> > > > > mismatch  between the QMP APIs we're using and the intent of
> > > > > QEMU.
> > > > > 
> > > > > The QEMU migration APIs were designed around streaming to a
> > > > > remote instance, and we're essentially playing games to use
> > > > > them as a way to write to local storage.
> > > > 
> > > > Yes.
> > > > 
> > > > > The RAM pages we're saving are of course page aligned in QEMU
> > > > > because they are mapped RAM. We loose/throwaway the page
> > > > > alignment because we're sending them over a FD, potentially
> > > > > adding in each metadata headers to identify which location
> > > > > the RAM block came from. 
> > > > > 
> > > > > QEMU has APIs for doing async I/O to local storage using
> > > > > O_DIRECT, via the BlockDev layer. QEMU can even use this
> > > > > for saving state via the loadvm/savevm monitor commands
> > > > > for internal snapshots. This is not accessible via the
> > > > > normal migration QMP command though.
> > > > > 
> > > > > 
> > > > > I feel to give ourselves the best chance of optimizing the
> > > > > save/restore, we need to get QEMU to have full knowledge of
> > > > > what is going on, and get libvirt out of the picture almost
> > > > > entirely.
> > > > > 
> > > > > If QEMU knows that the migration source/target is a random
> > > > > access file, rather than a stream, then it will not have
> > > > > to attach any headers to identify RAM pages. It can just
> > > > > read/write them directly at a fixed offset in the file.
> > > > > It can even do this while the CPU is running, just overwriting
> > > > > the previously written page on disk if the contents changed.
> > > > > 
> > > > > This would mean the save image is a fixed size exactly
> > > > > matching the RAM size, plus libvirt header and vmstate.
> > > > > Right now if we save a live snapshot, the save image can
> > > > > be almost arbitrarily large, since we'll save the same
> > > > > RAM page over & over again if the VM is modifying the
> > > > > content.
> > > > > 
> > > > > I think we need to introduce an explicit 'file:' protocol
> > > > > for the migrate command, that is backed by the blockdev APIs
> > > > > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > > > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > > > is a stream or a regular file, so it can choose between the
> > > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > > > auto-detect with fstat()).  If we do this, then multifd
> > > > > doesn't end up needing multiple save files on disk, all
> > > > > the threads can be directly writing to the same file, just
> > > > > as the relevant offsets on disk to match the RAM page
> > > > > location.
> > > > 
> > > > Hmm so what I'm not sure of is whether it makes sense to use the normal
> > > > migration flow/code for this or not; and you're suggesting a few
> > > > possibly contradictory things.
> > > > 
> > > > Adding a file: protocol would be pretty easy (whether it went via
> > > > the blockdev layer or not); getting it to be more efficient is the
> > > > tricky part, because we've got loads of levels of stream abstraction in
> > > > the RAM save code:
> > > >     QEMUFile->channel->OS
> > > > but then if you want to enforce alignment you somehow have to make that
> > > > go all the way down.
> > > 
> > > The QIOChannel stuff doesn't add buffering, so I wasn't worried
> > > about alignment there.
> > > 
> > > QEMUFile has optional buffering which would mess with alignment,
> > > but we could turn that off potentially for the RAM transfer, if
> > > using multifd.
> > 
> > The problem isn't whether they add buffering or not; the problem is you
> > now need a way to add a mechanism to ask for alignment.
> > 
> > > I'm confident the performance on the QMEU side though could
> > > exceed what's viable with libvirt's iohelper  today, as we
> > > would definitely be eliminating 1 copy and many context switches.
> > 
> > Yes but you get that just from adding a simple file: (or fd:) mode
> > without trying to do anything clever with alignment or rewriting the
> > same offset.
> 
> I don't think so, as libvirt supports O_DIRECT today to avoid
> trashing the host cache when saving VMs. So to be able to
> offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.

I guess you could O_DIRECT it from a buffer in QemuFile or the channel.

> So we do need the alignment support at the very least. Rewriting
> at the same offset isnt mandatory, but I think it'd make multifd
> saner if trying to have all threads work on the same file.

Thinking on the fly, you'd need some non trivial changes:

  a) A section entry in the format to say 'align to ... n bytes'
    (easyish)
  b) A way to allocate a location in the file to a RAMBlock
    [ We already have a bitmap address, so that might do,  but
    you need to make it interact with the existing file, so it might
    be easier to do the allocate and record it ]
  c) A way to say to the layers below it while writing RAM that it
    needs to go in a given location.
  d) A clean way for a..c only to happen in this case.
  e) Hmm ram size changes/hotplug/virtio-mem

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [libvirt RFC] add API for parallel Saves (not for committing)

Posted by Daniel P. Berrangé 2 years ago

On Mon, Apr 25, 2022 at 03:25:43PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Mon, Apr 25, 2022 at 01:33:41PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > On Mon, Apr 25, 2022 at 12:04:37PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > > > > I think we need to introduce an explicit 'file:' protocol
> > > > > > for the migrate command, that is backed by the blockdev APIs
> > > > > > so it can do O_DIRECT and non-blocking AIO.  For the 'fd:'
> > > > > > protocol, we need to be able to tell QEMU whether the 'fd'
> > > > > > is a stream or a regular file, so it can choose between the
> > > > > > regular send/recv APIs, vs the Blockdev APIs (maybe we can
> > > > > > auto-detect with fstat()).  If we do this, then multifd
> > > > > > doesn't end up needing multiple save files on disk, all
> > > > > > the threads can be directly writing to the same file, just
> > > > > > as the relevant offsets on disk to match the RAM page
> > > > > > location.
> > > > > 
> > > > > Hmm so what I'm not sure of is whether it makes sense to use the normal
> > > > > migration flow/code for this or not; and you're suggesting a few
> > > > > possibly contradictory things.
> > > > > 
> > > > > Adding a file: protocol would be pretty easy (whether it went via
> > > > > the blockdev layer or not); getting it to be more efficient is the
> > > > > tricky part, because we've got loads of levels of stream abstraction in
> > > > > the RAM save code:
> > > > >     QEMUFile->channel->OS
> > > > > but then if you want to enforce alignment you somehow have to make that
> > > > > go all the way down.
> > > > 
> > > > The QIOChannel stuff doesn't add buffering, so I wasn't worried
> > > > about alignment there.
> > > > 
> > > > QEMUFile has optional buffering which would mess with alignment,
> > > > but we could turn that off potentially for the RAM transfer, if
> > > > using multifd.
> > > 
> > > The problem isn't whether they add buffering or not; the problem is you
> > > now need a way to add a mechanism to ask for alignment.
> > > 
> > > > I'm confident the performance on the QMEU side though could
> > > > exceed what's viable with libvirt's iohelper  today, as we
> > > > would definitely be eliminating 1 copy and many context switches.
> > > 
> > > Yes but you get that just from adding a simple file: (or fd:) mode
> > > without trying to do anything clever with alignment or rewriting the
> > > same offset.
> > 
> > I don't think so, as libvirt supports O_DIRECT today to avoid
> > trashing the host cache when saving VMs. So to be able to
> > offload libvirt's work to QEMU, O_DIRECT is a pre-requisite.
> 
> I guess you could O_DIRECT it from a buffer in QemuFile or the channel.
> 
> > So we do need the alignment support at the very least. Rewriting
> > at the same offset isnt mandatory, but I think it'd make multifd
> > saner if trying to have all threads work on the same file.
> 
> Thinking on the fly, you'd need some non trivial changes:
> 
>   a) A section entry in the format to say 'align to ... n bytes'
>     (easyish)

Yep

>   b) A way to allocate a location in the file to a RAMBlock
>     [ We already have a bitmap address, so that might do,  but
>     you need to make it interact with the existing file, so it might
>     be easier to do the allocate and record it ]

IIUC, the migration protocol first serializes all RAM, and then serializes
the VMstate for devices.  When libvirt creates a save image for a VM it
has its own header + XML dump and then appends the migrate stream.

So we get a layout of

   +------------------+
   | libvirt header   |
   +------------------+
   | libvirt XML      |
   | ...              |
   +------------------+
   | migration stream |
   | ...              |
   +------------------+

The 'migration stream' is opaque as far as libvirt is concerned,
but we happen to know that from QEMU POV it expands to

   +------------------+
   | RAM stream       |
   | ...              |
   +------------------+
   | vmstate          |
   | ...              |
   +------------------+

Where 'RAM stream' is a stream of the RAM block header and
RAM block contents, for every page.

In the suggestion above, my desire is to achieve this layout


   +------------------+
   | libvirt header   |
   +------------------+
   | libvirt XML      |
   | ...              |
   +------------------+
   | RAM              |
   | ...              |
   +------------------+
   | vmstate          |
   | ...              |
   +------------------+

The key difference being 'RAM' instead of 'RAM stream'. Libvirt
would have to tell QEMU what offset in the file it is permitted
to start at - say 16 MB.  'RAM' would be a 1:1 mapping of the
guest RAM, simply offset by that 16 MB.

IOW, I'm assuming that a 4 GB RAM VM would have its RAM written
starting from offfset  16 MB and ending at 4 GB + 16 MB.

I'm thinking stuff like virtio-mem / RAM hotplug makes life
harder though as there can be many distinct blocks of RAM
contributing to a QEMU map, and we need to be able to declare
an ordering of them, for mapping to the file.



>   c) A way to say to the layers below it while writing RAM that it
>     needs to go in a given location.

Yes, QEMUFile would need some new APIs accepting offsets, and we
would need a way to report whether a given impl supports random
access, vs streaming only.

>   d) A clean way for a..c only to happen in this case.



>   e) Hmm ram size changes/hotplug/virtio-mem

This is "easy" from libvirt POV, since none of those things are possible
to be invoked during save/restore/migrate.

Hard(er) for QEMU since it doesn't place that restriction on users in the
general case.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|