[libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)

Eric Blake posted 8 patches 5 years, 1 month ago
Test syntax-check passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20190312023839.11069-1-eblake@redhat.com
include/libvirt/libvirt-domain-snapshot.h |  16 +-
src/conf/snapshot_conf.h                  |   1 +
src/driver-hypervisor.h                   |  13 +-
src/conf/snapshot_conf.c                  |  34 ++--
src/libvirt-domain-snapshot.c             | 122 ++++++++++-
src/libvirt_public.syms                   |   2 +
src/qemu/qemu_domain.c                    |   2 +-
src/qemu/qemu_driver.c                    | 236 +++++++++++++++++-----
src/remote/remote_driver.c                |   6 +-
src/remote/remote_protocol.x              |  38 +++-
src/remote_protocol-structs               |  17 ++
src/test/test_driver.c                    |  62 +++++-
src/vz/vz_driver.c                        |   3 +-
tools/virsh-snapshot.c                    | 111 ++++++++--
tools/virsh.pod                           |  15 +-
15 files changed, 584 insertions(+), 94 deletions(-)
[libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
While looking at my work on incremental backups, Nir raised a good
point: if we want to recreate a set of known checkpoints on one
machine that will be taking over a domain from another machine,
my initial proposal required making multiple API calls to list the
XML for each checkpoint on the source, and then more API calls to
redefine each checkpoint on the destination; it also had the drawback
that the list has to be presented in topological order (the API won't
let you define a child checkpoint if the parent is not defined first).
He asked if I could instead add bulk APIs, for getting the XML for
all checkpoints at once, and then for redefining all checkpoints at
once.

Since my checkpoint code borrows heavily from concepts in the snapshot
code, I chose to tackle the problem by starting with this series, which
does the same thing for snapshots as what I plan to do for checkpoints.
That is, since this patch series adds 2 new APIs, the checkpoint series
will do likewise with very similar code.

I'm also toying with the idea of followup patches that store all
snapshots alongside the persistent <domain> XML in a single file,
rather than in one snapshot per <domainsnapshot> file (we'd still
support reading from the old method at libvirtd startup, but would
not need to output in that manner any more).

Based-on: <20190308060512.17733-1-eblake@redhat.com>
[0/5 snapshots: topological sorting]

Changes from v3:
- introduce 2 new API instead of flags to 2 existing API [Jan]
- rebase on top of minor changes that already landed in the
non-controversial first half of the v3 series [John]

git backport-diff gets confused by patch renames, but this is quite
obviously more or less a rewrite:

001/8:[down] 'snapshot: Add new API for bulk dumpxml/redefine'
002/8:[down] 'snapshot: Support topological virDomainSnapshotForEach()'
003/8:[down] 'snapshot: Tweaks to support new bulk dumpxml/import API'
004/8:[down] 'remote: Wire up snapshot bulk dumpxml/import'
005/8:[down] 'virsh: Expose bulk snapshot dumpxml/import'
006/8:[0087] [FC] 'test: Implement bulk snapshot operations'
007/8:[0040] [FC] 'qemu: Factor out qemuDomainSnapshotValidate() helper'
008/8:[0152] [FC] 'qemu: Implement bulk snapshot operations'

Eric Blake (8):
  snapshot: Add new API for bulk dumpxml/redefine
  snapshot: Support topological virDomainSnapshotForEach()
  snapshot: Tweaks to support new bulk dumpxml/import API
  remote: Wire up snapshot bulk dumpxml/import
  virsh: Expose bulk snapshot dumpxml/import
  test: Implement bulk snapshot operations
  qemu: Factor out qemuDomainSnapshotValidate() helper
  qemu: Implement bulk snapshot operations

 include/libvirt/libvirt-domain-snapshot.h |  16 +-
 src/conf/snapshot_conf.h                  |   1 +
 src/driver-hypervisor.h                   |  13 +-
 src/conf/snapshot_conf.c                  |  34 ++--
 src/libvirt-domain-snapshot.c             | 122 ++++++++++-
 src/libvirt_public.syms                   |   2 +
 src/qemu/qemu_domain.c                    |   2 +-
 src/qemu/qemu_driver.c                    | 236 +++++++++++++++++-----
 src/remote/remote_driver.c                |   6 +-
 src/remote/remote_protocol.x              |  38 +++-
 src/remote_protocol-structs               |  17 ++
 src/test/test_driver.c                    |  62 +++++-
 src/vz/vz_driver.c                        |   3 +-
 tools/virsh-snapshot.c                    | 111 ++++++++--
 tools/virsh.pod                           |  15 +-
 15 files changed, 584 insertions(+), 94 deletions(-)

-- 
2.20.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Daniel P. Berrangé 5 years, 1 month ago
On Mon, Mar 11, 2019 at 09:38:31PM -0500, Eric Blake wrote:
> While looking at my work on incremental backups, Nir raised a good
> point: if we want to recreate a set of known checkpoints on one
> machine that will be taking over a domain from another machine,
> my initial proposal required making multiple API calls to list the
> XML for each checkpoint on the source, and then more API calls to
> redefine each checkpoint on the destination; it also had the drawback
> that the list has to be presented in topological order (the API won't
> let you define a child checkpoint if the parent is not defined first).
> He asked if I could instead add bulk APIs, for getting the XML for
> all checkpoints at once, and then for redefining all checkpoints at
> once.

To me the key blocking problem here is the topological sorting one
as it makes it incredibly painful to run individual APIs. Your other
series on list addresses this nicely by adding the topological flag.

Using that flag, in psuedo-python like code the process of exporting
and importing snapshots from one host to another should look approx
like this

  snapshots = srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
  for snapshot in snapshts:
      xml = snapshot.XMLDesc()
      dstdom.SnapshotCreateXML(xml)
  
I don't really see a reason why this is so hard that it justifies
adding these bulk snapshot list/define  APIs.

At the same time there is a very real downside to providing these bulk
snapshot APIs in terms of scalability. The <domain> XML for a guest
can get very large. The <domainsnapshot> XML contains a copy of the
<domain> XML from the point in time of the snapsht. Therefore a
<snapshots> XML document is going to contain *many* <domain> XML
documents concatenated.

For a domain with many devices and many snapshots, the bulk list
APIs have non-negligible risk of hitting the RPC protocol limit
on string size.

We've seen this before with the bulk domain stats APIs we added.
We none the less still added the bulk domain stats APIs because
it was a compelling performance benefit for something that is run
very frequently. For apps to be scalable though, they have to
limit how many guests they request bulk stats for at any time.
This in fact makes the code using the bulk stats APIs more complex
than if it just called the individual APIs. The performance benefit
is still a win though in this case.

In the bulk snapshot import/export case though, I'm not seeing a
compelling performance reason for their usage. Making the APIs
scalable would require to limit how many snapshots are returned
in any single call. There would then need to be repeated calls
to fetch the rest. This makes it more complicated than just
fetching/defining each individual snapshot

IOW, overall I'm not seeing a big enough justification to add these
new APIs, that would outweigh the downsides they bring in terms of
scalability.

The topological sorting flag solves the really big pain point well
enough.

> I'm also toying with the idea of followup patches that store all
> snapshots alongside the persistent <domain> XML in a single file,
> rather than in one snapshot per <domainsnapshot> file (we'd still
> support reading from the old method at libvirtd startup, but would
> not need to output in that manner any more).

What would be the benefit to having it all in one file ? During
initial libvirtd startup we could read the info from one files,
but we still need code to read the info from many files so we're
not saving code in that respect. Most of the snapshot APIs that
save XML to disk only care about a single <domainsnapshot>
document so wouldn't benefit from having all snapshots in one
place.

Re-writing the entire <domain> XML means that a host failure during
a snapshot save can put the entire domain definition at risk of loss.
Though we do try to mitigate that by writing to a temp file and doing
an atomic rename, that doesn't eliminate the risk entirely.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 1:10 PM, Daniel P. Berrangé wrote:
> On Mon, Mar 11, 2019 at 09:38:31PM -0500, Eric Blake wrote:
>> While looking at my work on incremental backups, Nir raised a good
>> point: if we want to recreate a set of known checkpoints on one
>> machine that will be taking over a domain from another machine,
>> my initial proposal required making multiple API calls to list the
>> XML for each checkpoint on the source, and then more API calls to
>> redefine each checkpoint on the destination; it also had the drawback
>> that the list has to be presented in topological order (the API won't
>> let you define a child checkpoint if the parent is not defined first).
>> He asked if I could instead add bulk APIs, for getting the XML for
>> all checkpoints at once, and then for redefining all checkpoints at
>> once.
> 
> To me the key blocking problem here is the topological sorting one
> as it makes it incredibly painful to run individual APIs. Your other
> series on list addresses this nicely by adding the topological flag.
> 
> Using that flag, in psuedo-python like code the process of exporting
> and importing snapshots from one host to another should look approx
> like this
> 
>   snapshots = srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
>   for snapshot in snapshts:
>       xml = snapshot.XMLDesc()
>       dstdom.SnapshotCreateXML(xml)
>   
> I don't really see a reason why this is so hard that it justifies
> adding these bulk snapshot list/define  APIs.

Nir, do you want to chime in here?

> 
> At the same time there is a very real downside to providing these bulk
> snapshot APIs in terms of scalability. The <domain> XML for a guest
> can get very large. The <domainsnapshot> XML contains a copy of the
> <domain> XML from the point in time of the snapsht. Therefore a
> <snapshots> XML document is going to contain *many* <domain> XML
> documents concatenated.
> 
> For a domain with many devices and many snapshots, the bulk list
> APIs have non-negligible risk of hitting the RPC protocol limit
> on string size.

I did not consider that, but you are definitely correct about it being a
real problem (I didn't trip over it in my small domain testing, but
combined XML is definitely a repetition of a <domain> per <domainsnapshot>).

One idea I had on the checkpoint series was adding a NO_DOMAIN flag
which outputs much more compact XML by omitting the <domain> element.
With checkpoints, the <domain> at the time of the checkpoint is less
important (useful to know if we need it, but we can't revert to that
state); but with snapshots, the <domain> at the time of the snapshot is
very important (to the point that you have to request a flag for an
unsafe revert in virDomainSnapshotRevert() if the <domain> information
was incomplete).  So using a NO_DOMAIN flag to minimize XML so that the
bulk output doesn't exceed RPC limits is a possibility, but not a very
nice one (as you'd STILL have to do a one-API-per-snapshot call to
reinstate the <domain> of any snapshot that you plan to revert to).

> 
> We've seen this before with the bulk domain stats APIs we added.
> We none the less still added the bulk domain stats APIs because
> it was a compelling performance benefit for something that is run
> very frequently. For apps to be scalable though, they have to
> limit how many guests they request bulk stats for at any time.
> This in fact makes the code using the bulk stats APIs more complex
> than if it just called the individual APIs. The performance benefit
> is still a win though in this case.
> 
> In the bulk snapshot import/export case though, I'm not seeing a
> compelling performance reason for their usage. Making the APIs
> scalable would require to limit how many snapshots are returned
> in any single call. There would then need to be repeated calls
> to fetch the rest. This makes it more complicated than just
> fetching/defining each individual snapshot

And the frequency of migrating snapshots is less than the frequency of
querying bulk stats. It makes the most sense to speed up APIs called
frequently.

> 
> IOW, overall I'm not seeing a big enough justification to add these
> new APIs, that would outweigh the downsides they bring in terms of
> scalability.

You raise some good points.

> 
> The topological sorting flag solves the really big pain point well
> enough.

The topological sorting thing kind of fell on my lap as I was working on
the bulk code, but now that I've written it, it definitely solves a
problem that we previously had, even if I had not properly identified
the problem (randomized list ordering is painful to work with).  Either
redefine has to accept randomized ordering (which it currently doesn't),
or listing should output in sane ordering (which topological adds) - and
once we have topological output, subsequent input gets a LOT easier,
even if split across multiple API calls.

> 
>> I'm also toying with the idea of followup patches that store all
>> snapshots alongside the persistent <domain> XML in a single file,
>> rather than in one snapshot per <domainsnapshot> file (we'd still
>> support reading from the old method at libvirtd startup, but would
>> not need to output in that manner any more).
> 
> What would be the benefit to having it all in one file ? During
> initial libvirtd startup we could read the info from one files,
> but we still need code to read the info from many files so we're
> not saving code in that respect. Most of the snapshot APIs that
> save XML to disk only care about a single <domainsnapshot>
> document so wouldn't benefit from having all snapshots in one
> place.

One possible benefit: in the current
virDomainSnapshotCreateXML(REDEFINE) code, we potentially end up
changing three separate snapshots in one call: if we are redefining a
replacement snapshot (maybe supplying a <domain> subelement that was
previously missing), AND marking the redefinition as the new current
element, it requires rewriting the XML file of the previous current
snapshot (to drop an internal-only <active>1</active> marker), deleting
the XML file of the previous state of the snapshot being modified, and
writing the new XML file of the redefined snapshot.  If any one of those
steps die, it's very hard to roll back to the earlier state.  But if we
store all snapshots in a single file using the <snapshots
current='name'> (rather than a per-snapshot <active>1</active> marker),
we only have to modify a single file when making any changes to the
snapshot hierarchy.  If we fail at any earlier step of the way, we have
not partially corrupted the snapshot hierarchy in a manner that is hard
to roll back.

Another (weak) benefit - right now, qemu snapshots cannot be named
"../blah", because that would create a filename that escapes the proper
subdirectory.  With a single <snapshots>, you can allow more flexible
snapshot names, as those names no longer influence host file names.

> 
> Re-writing the entire <domain> XML means that a host failure during
> a snapshot save can put the entire domain definition at risk of loss.
> Though we do try to mitigate that by writing to a temp file and doing
> an atomic rename, that doesn't eliminate the risk entirely.

Okay, so it's probably better to store the <snapshots> XML in a file
separate from the <domain> itself (failure to write the snapshots file
may lose the snapshots, but does not lose the domain), but still, I
can't help but think that two files is potentially nicer than a
subdirectory full of one file per snapshot.

Unless Nir speaks up with strong reasons in favor of new bulk APIs
(rather than living with the new topological sort flags), I'm okay
dropping this series.  The patches that are already in (the head of the
v3 posting of this series, for generating/parsing <snapshots>) may still
prove to be useful if you agree with me that a single file for ALL
snapshots is more robust than one file per snapshot, particularly when
we have exiting APIs that already have a hard time rolling back to a
sane state on partial failure when manipulating more than one file. But
as I have not yet posted any series showing how that would work, that
doesn't change that these v4 patches are probably not worth pursuing
further.

And, since it is more important to get new API in by 5.2 (since rebasing
APIs is difficult to backport) than it is to do internal-only changes (a
subdirectory of multiple files vs. one <snapshots> file is not
user-visible, and can be backported without as much hassle), I'll focus
more on the checkpoint APIs rather than any refactoring of internals. So
that may mean that 5.2 is released with my <snapshots> format/parse code
with no consumers of that code, but I don't think it hurts enough to
revert them (as we may end up with a consumer for that code, even if not
in 5.2).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Nir Soffer 5 years, 1 month ago
On Tue, Mar 12, 2019 at 9:49 PM Eric Blake <eblake@redhat.com> wrote:

> On 3/12/19 1:10 PM, Daniel P. Berrangé wrote:
> > On Mon, Mar 11, 2019 at 09:38:31PM -0500, Eric Blake wrote:
> >> While looking at my work on incremental backups, Nir raised a good
> >> point: if we want to recreate a set of known checkpoints on one
> >> machine that will be taking over a domain from another machine,
> >> my initial proposal required making multiple API calls to list the
> >> XML for each checkpoint on the source, and then more API calls to
> >> redefine each checkpoint on the destination; it also had the drawback
> >> that the list has to be presented in topological order (the API won't
> >> let you define a child checkpoint if the parent is not defined first).
> >> He asked if I could instead add bulk APIs, for getting the XML for
> >> all checkpoints at once, and then for redefining all checkpoints at
> >> once.
> >
> > To me the key blocking problem here is the topological sorting one
> > as it makes it incredibly painful to run individual APIs. Your other
> > series on list addresses this nicely by adding the topological flag.
> >
> > Using that flag, in psuedo-python like code the process of exporting
> > and importing snapshots from one host to another should look approx
> > like this
> >
> >   snapshots =
> srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
> >   for snapshot in snapshts:
> >       xml = snapshot.XMLDesc()
> >       dstdom.SnapshotCreateXML(xml)
> >
> > I don't really see a reason why this is so hard that it justifies
> > adding these bulk snapshot list/define  APIs.
>
> Nir, do you want to chime in here?
>

We don't have a need to list or define snapshots since we managed snapshots
on oVirt side.
We want an API to list and redefine checkpoints.

Our use case is:

1. Start VM on some host
2. When oVirt starts a backup, it adds a checkpoint in oVirt database
3. When libvirt starts a backup, libvirt add the checkpoint oVirt created
4. Steps 2 and 3 are repeated while the VM is running...
5. Stop VM - at this point we undefine the VM, so libivrt lost all info
about checkopints
6. Start VM on some host
7. oVirt use the bulk checkpoint API to redefine all checkpoints in oVirt
database
8. If there is unknown bitmap on a disk, or a bitmap is missing, it means
that someone
   modified the disk behind oVirt back, and oVirt will delete the affected
checkpoints
   and use libvirt checkpoints API to delete the checkpoints and bitmaps.
This forces
  full backup for the next backup of the affected disks.

We would like that libvirt will fail to redefine the checkpoints if the
info in the checkpoints
does not match the the bitmap on the disks mentioned in the checkpoints. To
do this,
libvirt must have the complete checkpoint state. It cannot if we define
checkpoints one
by one.

If libivirt cannot fail, we need to a way to list all checkpoints and
bitmaps, so we can
sync oVirt database with reality.

If we don't validate bitmaps with the expected bitmaps, the backup may fail
when starting
a backup (good), or succeed, including wrong data in the backup.

It also does not make sense to redefine checkpoint state in incremental
way, this like
building a VM XML with many devices with one call per device.

The same checkpoint management will have to be duplicated in any libvirt
client that
manage more then one host (e.g. KubeVirt, OpenStack), since libvirt does
not handle
more than one host.

For more info see
-
https://ovirt.org/develop/release-management/features/storage/incremental-backup.html
- https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L11525
-
https://github.com/oVirt/vdsm/blob/0874b89a19af2d7726ed2f0b9d674cf0bb48a67e/lib/vdsm/api/vdsm-api.yml#L8129

>
> > At the same time there is a very real downside to providing these bulk
> > snapshot APIs in terms of scalability. The <domain> XML for a guest
> > can get very large. The <domainsnapshot> XML contains a copy of the
> > <domain> XML from the point in time of the snapsht. Therefore a
> > <snapshots> XML document is going to contain *many* <domain> XML
> > documents concatenated.
> >
> > For a domain with many devices and many snapshots, the bulk list
> > APIs have non-negligible risk of hitting the RPC protocol limit
> > on string size.
>
> I did not consider that, but you are definitely correct about it being a
> real problem (I didn't trip over it in my small domain testing, but
> combined XML is definitely a repetition of a <domain> per
> <domainsnapshot>).
>
> One idea I had on the checkpoint series was adding a NO_DOMAIN flag
> which outputs much more compact XML by omitting the <domain> element.
> With checkpoints, the <domain> at the time of the checkpoint is less
> important (useful to know if we need it, but we can't revert to that
> state); but with snapshots, the <domain> at the time of the snapshot is
> very important (to the point that you have to request a flag for an
> unsafe revert in virDomainSnapshotRevert() if the <domain> information
> was incomplete).  So using a NO_DOMAIN flag to minimize XML so that the
> bulk output doesn't exceed RPC limits is a possibility, but not a very
> nice one (as you'd STILL have to do a one-API-per-snapshot call to
> reinstate the <domain> of any snapshot that you plan to revert to).
>
> >
> > We've seen this before with the bulk domain stats APIs we added.
> > We none the less still added the bulk domain stats APIs because
> > it was a compelling performance benefit for something that is run
> > very frequently. For apps to be scalable though, they have to
> > limit how many guests they request bulk stats for at any time.
> > This in fact makes the code using the bulk stats APIs more complex
> > than if it just called the individual APIs. The performance benefit
> > is still a win though in this case.
> >
> > In the bulk snapshot import/export case though, I'm not seeing a
> > compelling performance reason for their usage. Making the APIs
> > scalable would require to limit how many snapshots are returned
> > in any single call. There would then need to be repeated calls
> > to fetch the rest. This makes it more complicated than just
> > fetching/defining each individual snapshot
>
> And the frequency of migrating snapshots is less than the frequency of
> querying bulk stats. It makes the most sense to speed up APIs called
> frequently.
>
> >
> > IOW, overall I'm not seeing a big enough justification to add these
> > new APIs, that would outweigh the downsides they bring in terms of
> > scalability.
>
> You raise some good points.
>
> >
> > The topological sorting flag solves the really big pain point well
> > enough.
>
> The topological sorting thing kind of fell on my lap as I was working on
> the bulk code, but now that I've written it, it definitely solves a
> problem that we previously had, even if I had not properly identified
> the problem (randomized list ordering is painful to work with).  Either
> redefine has to accept randomized ordering (which it currently doesn't),
> or listing should output in sane ordering (which topological adds) - and
> once we have topological output, subsequent input gets a LOT easier,
> even if split across multiple API calls.
>
> >
> >> I'm also toying with the idea of followup patches that store all
> >> snapshots alongside the persistent <domain> XML in a single file,
> >> rather than in one snapshot per <domainsnapshot> file (we'd still
> >> support reading from the old method at libvirtd startup, but would
> >> not need to output in that manner any more).
> >
> > What would be the benefit to having it all in one file ? During
> > initial libvirtd startup we could read the info from one files,
> > but we still need code to read the info from many files so we're
> > not saving code in that respect. Most of the snapshot APIs that
> > save XML to disk only care about a single <domainsnapshot>
> > document so wouldn't benefit from having all snapshots in one
> > place.
>
> One possible benefit: in the current
> virDomainSnapshotCreateXML(REDEFINE) code, we potentially end up
> changing three separate snapshots in one call: if we are redefining a
> replacement snapshot (maybe supplying a <domain> subelement that was
> previously missing), AND marking the redefinition as the new current
> element, it requires rewriting the XML file of the previous current
> snapshot (to drop an internal-only <active>1</active> marker), deleting
> the XML file of the previous state of the snapshot being modified, and
> writing the new XML file of the redefined snapshot.  If any one of those
> steps die, it's very hard to roll back to the earlier state.  But if we
> store all snapshots in a single file using the <snapshots
> current='name'> (rather than a per-snapshot <active>1</active> marker),
> we only have to modify a single file when making any changes to the
> snapshot hierarchy.  If we fail at any earlier step of the way, we have
> not partially corrupted the snapshot hierarchy in a manner that is hard
> to roll back.
>
> Another (weak) benefit - right now, qemu snapshots cannot be named
> "../blah", because that would create a filename that escapes the proper
> subdirectory.  With a single <snapshots>, you can allow more flexible
> snapshot names, as those names no longer influence host file names.
>
> >
> > Re-writing the entire <domain> XML means that a host failure during
> > a snapshot save can put the entire domain definition at risk of loss.
> > Though we do try to mitigate that by writing to a temp file and doing
> > an atomic rename, that doesn't eliminate the risk entirely.
>
> Okay, so it's probably better to store the <snapshots> XML in a file
> separate from the <domain> itself (failure to write the snapshots file
> may lose the snapshots, but does not lose the domain), but still, I
> can't help but think that two files is potentially nicer than a
> subdirectory full of one file per snapshot.
>
> Unless Nir speaks up with strong reasons in favor of new bulk APIs
> (rather than living with the new topological sort flags), I'm okay
> dropping this series.


I'm ok with dropping bulk snapshots, but I think libvirt should provide
bulk APIs for checkpoints.

Nir


> The patches that are already in (the head of the
> v3 posting of this series, for generating/parsing <snapshots>) may still
> prove to be useful if you agree with me that a single file for ALL
> snapshots is more robust than one file per snapshot, particularly when
> we have exiting APIs that already have a hard time rolling back to a
> sane state on partial failure when manipulating more than one file. But
> as I have not yet posted any series showing how that would work, that
> doesn't change that these v4 patches are probably not worth pursuing
> further.
>
> And, since it is more important to get new API in by 5.2 (since rebasing
> APIs is difficult to backport) than it is to do internal-only changes (a
> subdirectory of multiple files vs. one <snapshots> file is not
> user-visible, and can be backported without as much hassle), I'll focus
> more on the checkpoint APIs rather than any refactoring of internals. So
> that may mean that 5.2 is released with my <snapshots> format/parse code
> with no consumers of that code, but I don't think it hurts enough to
> revert them (as we may end up with a consumer for that code, even if not
> in 5.2).
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
>
>
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 3:29 PM, Nir Soffer wrote:

>>>   snapshots =
>> srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
>>>   for snapshot in snapshts:
>>>       xml = snapshot.XMLDesc()
>>>       dstdom.SnapshotCreateXML(xml)
>>>
>>> I don't really see a reason why this is so hard that it justifies
>>> adding these bulk snapshot list/define  APIs.
>>
>> Nir, do you want to chime in here?
>>
> 
> We don't have a need to list or define snapshots since we managed snapshots
> on oVirt side.
> We want an API to list and redefine checkpoints.

But the proposed <domaincheckpoint> also has a <domain> subelement, so
it has the same problem (the XML for a bulk operation can become
prohibitively large).

> 
> Our use case is:
> 
> 1. Start VM on some host
> 2. When oVirt starts a backup, it adds a checkpoint in oVirt database

And does this database preserve topological ordering? If not, why not?

> 3. When libvirt starts a backup, libvirt add the checkpoint oVirt created
> 4. Steps 2 and 3 are repeated while the VM is running...
> 5. Stop VM - at this point we undefine the VM, so libivrt lost all info
> about checkopints
> 6. Start VM on some host
> 7. oVirt use the bulk checkpoint API to redefine all checkpoints in oVirt
> database
> 8. If there is unknown bitmap on a disk, or a bitmap is missing, it means
> that someone
>    modified the disk behind oVirt back, and oVirt will delete the affected
> checkpoints
>    and use libvirt checkpoints API to delete the checkpoints and bitmaps.
> This forces
>   full backup for the next backup of the affected disks.
> 
> We would like that libvirt will fail to redefine the checkpoints if the
> info in the checkpoints
> does not match the the bitmap on the disks mentioned in the checkpoints.

That's a different feature request - you're asking for
virDomainCheckpointCreateXML(REDEFINE) to have some additional flag
where the metadata re-creation fails if the underlying bitmap is
corrupt/missing. Which may still be a good flag to add, but is something
you can do with one-at-a-time checkpoint redefinitions, and doesn't, in
itself, require bulk redefinition.

> To
> do this,
> libvirt must have the complete checkpoint state. It cannot if we define
> checkpoints one
> by one.

As long as you redefine checkpoints in topological order, I don't see
how defining one at a time or bulk changes whether libvirt can validate
that the checkpoints are correct.  And even then, redefining in
topological order is only necessary so that libvirt can ensure that any
<parent/> checkpoint exists before attempting to define the metadata for
a child checkpoint.

> 
> If libivirt cannot fail, we need to a way to list all checkpoints and
> bitmaps, so we can
> sync oVirt database with reality.
> 
> If we don't validate bitmaps with the expected bitmaps, the backup may fail
> when starting
> a backup (good), or succeed, including wrong data in the backup.
> 
> It also does not make sense to redefine checkpoint state in incremental
> way, this like
> building a VM XML with many devices with one call per device.

The argument here is that it will be the same amount of XML whether it
is one call per checkpoint or one call for all checkpoints (note - there
are multiple devices in one checkpoint, so it is NOT one call per
'device * checkpoint')

> 
> The same checkpoint management will have to be duplicated in any libvirt
> client that
> manage more then one host (e.g. KubeVirt, OpenStack), since libvirt does
> not handle
> more than one host.

But again, if you are moving checkpoint metadata from one host to
another, doing so one-at-a-time in topological order shouldn't be much
harder than doing so in one single call, but without the drawbacks of
exceeding XML size over RPC call limits.

> 
> For more info see
> -
> https://ovirt.org/develop/release-management/features/storage/incremental-backup.html
> - https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L11525
> -
> https://github.com/oVirt/vdsm/blob/0874b89a19af2d7726ed2f0b9d674cf0bb48a67e/lib/vdsm/api/vdsm-api.yml#L8129
> 

>> Unless Nir speaks up with strong reasons in favor of new bulk APIs
>> (rather than living with the new topological sort flags), I'm okay
>> dropping this series.
> 
> 
> I'm ok with dropping bulk snapshots, but I think libvirt should provide
> bulk APIs for checkpoints.
> 
> Nir
>

Given that the same amount of XML is involved across RPC, I'm
hard-pressed to include bulk operations on one object without providing
it on both.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Nir Soffer 5 years, 1 month ago
On Tue, Mar 12, 2019 at 11:02 PM Eric Blake <eblake@redhat.com> wrote:

> On 3/12/19 3:29 PM, Nir Soffer wrote:
>
> >>>   snapshots =
> >> srcdom.ListAllSnapshots(libvirt.VIR_DOMAIN_SNAPSHOT_LIST_TOPOLOGICAL)
> >>>   for snapshot in snapshts:
> >>>       xml = snapshot.XMLDesc()
> >>>       dstdom.SnapshotCreateXML(xml)
> >>>
> >>> I don't really see a reason why this is so hard that it justifies
> >>> adding these bulk snapshot list/define  APIs.
> >>
> >> Nir, do you want to chime in here?
> >>
> >
> > We don't have a need to list or define snapshots since we managed
> snapshots
> > on oVirt side.
> > We want an API to list and redefine checkpoints.
>
> But the proposed <domaincheckpoint> also has a <domain> subelement, so
> it has the same problem (the XML for a bulk operation can become
> prohibitively large).
>

Why do we need <domain> in a checkpoint?

You can see here what we store in a checkpoint (based on your patches v2 or
v3):
https://ovirt.org/develop/release-management/features/storage/incremental-backup.html#engine-database-changes

Which is:

vm_checkpoints table

- checkpoint_id: UUID
- parent: UUID
- vm_id: UUID
- creation_date: TIMESTAMP

vm_checkpoint_disk_map table
- disk_id: UUID
- checkpoint_id: UUID

>
> > Our use case is:
> >
> > 1. Start VM on some host
> > 2. When oVirt starts a backup, it adds a checkpoint in oVirt database
>
> And does this database preserve topological ordering? If not, why not?
>

Using the parent UUID we can create the correct chain in the right order,
and this is what we send to vdsm. Based on this, vdsm will generate the XML
for libvirt.

There is no domain info related to these checkpoints, and I hope we are not
adding such requirements at this stage.


> > 3. When libvirt starts a backup, libvirt add the checkpoint oVirt created
> > 4. Steps 2 and 3 are repeated while the VM is running...
> > 5. Stop VM - at this point we undefine the VM, so libivrt lost all info
> > about checkopints
> > 6. Start VM on some host
> > 7. oVirt use the bulk checkpoint API to redefine all checkpoints in oVirt
> > database
> > 8. If there is unknown bitmap on a disk, or a bitmap is missing, it means
> > that someone
> >    modified the disk behind oVirt back, and oVirt will delete the
> affected
> > checkpoints
> >    and use libvirt checkpoints API to delete the checkpoints and bitmaps.
> > This forces
> >   full backup for the next backup of the affected disks.
> >
> > We would like that libvirt will fail to redefine the checkpoints if the
> > info in the checkpoints
> > does not match the the bitmap on the disks mentioned in the checkpoints.
>
> That's a different feature request - you're asking for
> virDomainCheckpointCreateXML(REDEFINE) to have some additional flag
> where the metadata re-creation fails if the underlying bitmap is
> corrupt/missing. Which may still be a good flag to add, but is something
> you can do with one-at-a-time checkpoint redefinitions, and doesn't, in
> itself, require bulk redefinition.
>

Checking for unknown bitmap on disk is not possible unless you have the
complete
list of checkpoints. Unless you add something like:

    virDomainCheckpointValidate()

Which will fail if the checkpoints known to libvirt do not match the
bitmaps in the
underlying qcow2 images.

But even if we add this, the user will have to do:

    try:
        for checkpoint in orderdered_checkpoints:
           generate xml...
           redefine checkpoint...
    except ...:
        remove all checkpoints...

    validate checkpoints

Unless the xml for single checkpoint is big, I don't see why all libvirt
users
need to redefine checkopoints one by one and handle all the possible errors
(e.g. some checkpoints redefined, some not).

Note that vdsm may be killed in the middle of the redefine loop, and in
this case
we leave livbirt with partial info about checkpoints, and we need to
redefine
the checkpoints again handling this partial sate.

We means all libvirt clients that need to managed more than one host.

> To
> > do this,
> > libvirt must have the complete checkpoint state. It cannot if we define
> > checkpoints one
> > by one.
>
> As long as you redefine checkpoints in topological order, I don't see
> how defining one at a time or bulk changes whether libvirt can validate
> that the checkpoints are correct.  And even then, redefining in
> topological order is only necessary so that libvirt can ensure that any
> <parent/> checkpoint exists before attempting to define the metadata for
> a child checkpoint.
>
> >
> > If libivirt cannot fail, we need to a way to list all checkpoints and
> > bitmaps, so we can
> > sync oVirt database with reality.
> >
> > If we don't validate bitmaps with the expected bitmaps, the backup may
> fail
> > when starting
> > a backup (good), or succeed, including wrong data in the backup.
> >
> > It also does not make sense to redefine checkpoint state in incremental
> > way, this like
> > building a VM XML with many devices with one call per device.
>
> The argument here is that it will be the same amount of XML whether it
> is one call per checkpoint or one call for all checkpoints (note - there
> are multiple devices in one checkpoint, so it is NOT one call per
> 'device * checkpoint')
>
> >
> > The same checkpoint management will have to be duplicated in any libvirt
> > client that
> > manage more then one host (e.g. KubeVirt, OpenStack), since libvirt does
> > not handle
> > more than one host.
>
> But again, if you are moving checkpoint metadata from one host to
> another, doing so one-at-a-time in topological order shouldn't be much
> harder than doing so in one single call, but without the drawbacks of
> exceeding XML size over RPC call limits.
>
> >
> > For more info see
> > -
> >
> https://ovirt.org/develop/release-management/features/storage/incremental-backup.html
> > -
> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/api/vdsm-api.yml#L11525
> > -
> >
> https://github.com/oVirt/vdsm/blob/0874b89a19af2d7726ed2f0b9d674cf0bb48a67e/lib/vdsm/api/vdsm-api.yml#L8129
> >
>
> >> Unless Nir speaks up with strong reasons in favor of new bulk APIs
> >> (rather than living with the new topological sort flags), I'm okay
> >> dropping this series.
> >
> >
> > I'm ok with dropping bulk snapshots, but I think libvirt should provide
> > bulk APIs for checkpoints.
> >
> > Nir
> >
>
> Given that the same amount of XML is involved across RPC, I'm
> hard-pressed to include bulk operations on one object without providing
> it on both.
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.           +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
>
>
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 4:35 PM, Nir Soffer wrote:

>>> We don't have a need to list or define snapshots since we managed
>> snapshots
>>> on oVirt side.
>>> We want an API to list and redefine checkpoints.
>>
>> But the proposed <domaincheckpoint> also has a <domain> subelement, so
>> it has the same problem (the XML for a bulk operation can become
>> prohibitively large).
>>
> 
> Why do we need <domain> in a checkpoint?

The initial design for snapshots did not have <domain>, and it bit us
hard; you cannot safely revert to a snapshot if you don't know the state
of the domain at that time. The initial design for checkpoints has thus
mandated that <domain> be present (my v5 series fails to redefine a
snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
during dumpxml for reduced output).  If we are convinced that defining a
snapshot without <domain> is safe enough, then I can relax the
checkpoint code to allow redefined metadata without <domain> the way
snapshots already have to do it, even though I was hoping that
checkpoints could start life with fewer back-compats that snapshot has
had to carry along.  But I'd rather start strict and relax later
(require <domain> and then remove it when proven safe), and not start
loose (leave <domain> optional, and then wish we had made it mandatory).

> 
> You can see here what we store in a checkpoint (based on your patches v2 or
> v3):
> https://ovirt.org/develop/release-management/features/storage/incremental-backup.html#engine-database-changes
> 
> Which is:
> 
> vm_checkpoints table
> 
> - checkpoint_id: UUID
> - parent: UUID
> - vm_id: UUID
> - creation_date: TIMESTAMP
> 
> vm_checkpoint_disk_map table
> - disk_id: UUID
> - checkpoint_id: UUID
> 

And no state of the <domain> at the time the checkpoint was created?

>>
>>> Our use case is:
>>>
>>> 1. Start VM on some host
>>> 2. When oVirt starts a backup, it adds a checkpoint in oVirt database
>>
>> And does this database preserve topological ordering? If not, why not?
>>
> 
> Using the parent UUID we can create the correct chain in the right order,
> and this is what we send to vdsm. Based on this, vdsm will generate the XML
> for libvirt.
> 
> There is no domain info related to these checkpoints, and I hope we are not
> adding such requirements at this stage.

Rather, that requirement has been there in all of my drafts, and it
sounds like you are trying to get me to relax that requirement to not
need a <domain> during checkpoint metadata redefine.


>> That's a different feature request - you're asking for
>> virDomainCheckpointCreateXML(REDEFINE) to have some additional flag
>> where the metadata re-creation fails if the underlying bitmap is
>> corrupt/missing. Which may still be a good flag to add, but is something
>> you can do with one-at-a-time checkpoint redefinitions, and doesn't, in
>> itself, require bulk redefinition.
>>
> 
> Checking for unknown bitmap on disk is not possible unless you have the
> complete
> list of checkpoints. Unless you add something like:
> 
>     virDomainCheckpointValidate()
> 
> Which will fail if the checkpoints known to libvirt do not match the
> bitmaps in the
> underlying qcow2 images.

Yes, an API like that may be better than trying to validate on a
per-redefine basis whether libvirt metadata matches qcow2 metadata.

> 
> But even if we add this, the user will have to do:
> 
>     try:
>         for checkpoint in orderdered_checkpoints:
>            generate xml...
>            redefine checkpoint...
>     except ...:
>         remove all checkpoints...
> 
>     validate checkpoints

Yes, but that's not hard. And it's no different whether the 'try/except'
code is a single libvirt API (with bulk redefine) or multiple libvirt
APIs (no bulk redefine, but easy enough to one-at-a-time redefine in
topological order).  Why you have to generate the XML instead of reusing
the XML that libvirt produces is something I'm not sure of (that is,
what's wrong with using libvirt's dumpxml output as your input, rather
than trying to reconstruct it on the fly?).

> 
> Unless the xml for single checkpoint is big, I don't see why all libvirt
> users
> need to redefine checkopoints one by one and handle all the possible errors
> (e.g. some checkpoints redefined, some not).

As currently proposed, <domaincheckpoint> includes a mandatory <domain>
sub-element on redefinition, which means that the XML _is_ big for a
series of checkpoints.  It also means that you're better off storing the
XML produced by libvirt than trying to regenerate it by hand, when it
comes to metadata redefinition.

> 
> Note that vdsm may be killed in the middle of the redefine loop, and in
> this case
> we leave livbirt with partial info about checkpoints, and we need to
> redefine
> the checkpoints again handling this partial sate.

But that's relatively easy - if you don't know whether libvirt might
have partial data, then wipe the data and start the redefine loop from
scratch.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Nir Soffer 5 years, 1 month ago
On Tue, Mar 12, 2019 at 11:52 PM Eric Blake <eblake@redhat.com> wrote:

> On 3/12/19 4:35 PM, Nir Soffer wrote:
>
> >>> We don't have a need to list or define snapshots since we managed
> >> snapshots
> >>> on oVirt side.
> >>> We want an API to list and redefine checkpoints.
> >>
> >> But the proposed <domaincheckpoint> also has a <domain> subelement, so
> >> it has the same problem (the XML for a bulk operation can become
> >> prohibitively large).
> >>
> >
> > Why do we need <domain> in a checkpoint?
>
> The initial design for snapshots did not have <domain>, and it bit us
> hard; you cannot safely revert to a snapshot if you don't know the state
> of the domain at that time. The initial design for checkpoints has thus
> mandated that <domain> be present (my v5 series fails to redefine a
> snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
> during dumpxml for reduced output).  If we are convinced that defining a
> snapshot without <domain> is safe enough, then I can relax the
> checkpoint code to allow redefined metadata without <domain> the way
> snapshots already have to do it, even though I was hoping that
> checkpoints could start life with fewer back-compats that snapshot has
> had to carry along.  But I'd rather start strict and relax later
> (require <domain> and then remove it when proven safe), and not start
> loose (leave <domain> optional, and then wish we had made it mandatory).
>

You keep mentioning snapshots, but we are not going to redefine snapshots,
only checkpoints.

We need to do this only because qcow2 does not store a chain of bitmaps, but
un-ordered bitmaps without any metadata, so we need to remind libvirt which
bitmaps we have in every disk, and what is the order of the bitmaps. We
will be
happy if we could just specify the bitmaps and disks in the backup API, and
avoid
rebuilding the checkopints history, just like we avoid rebuilding the
snapshot
history.

> You can see here what we store in a checkpoint (based on your patches v2
> or
> > v3):
> >
> https://ovirt.org/develop/release-management/features/storage/incremental-backup.html#engine-database-changes
> >
> > Which is:
> >
> > vm_checkpoints table
> >
> > - checkpoint_id: UUID
> > - parent: UUID
> > - vm_id: UUID
> > - creation_date: TIMESTAMP
> >
> > vm_checkpoint_disk_map table
> > - disk_id: UUID
> > - checkpoint_id: UUID
> >
>
> And no state of the <domain> at the time the checkpoint was created?
>

No, but the configuration of the VM is stored in oVirt, so the backup
application
can get this configuration from oVirt at the time of the backup.

>>
> >>> Our use case is:
> >>>
> >>> 1. Start VM on some host
> >>> 2. When oVirt starts a backup, it adds a checkpoint in oVirt database
> >>
> >> And does this database preserve topological ordering? If not, why not?
> >>
> >
> > Using the parent UUID we can create the correct chain in the right order,
> > and this is what we send to vdsm. Based on this, vdsm will generate the
> XML
> > for libvirt.
> >
> > There is no domain info related to these checkpoints, and I hope we are
> not
> > adding such requirements at this stage.
>
> Rather, that requirement has been there in all of my drafts, and it
> sounds like you are trying to get me to relax that requirement to not
> need a <domain> during checkpoint metadata redefine.
>

We never understood this requirement it in the html documentation built
from your patches.

Looking again in v3 - I see:

    domain
    The inactive domain configuration at the time the checkpoint was
created. Readonly.

And we have an example:

<domaincheckpoint>
  <description>Completion of updates after OS install</description>
  <disks>
    <disk name='vda' checkpoint='bitmap'/>
    <disk name='vdb' checkpoint='no'/>
  </disks>
</domaincheckpoint>

Which does not have a domain, since it is read-only.

There is an example of the xml we get from libvirt:

<domaincheckpoint>
  <name>1525889631</name>
  <description>Completion of updates after OS install</description>
  <creationTime>1525889631</creationTime>
  <parent>
    <name>1525111885</name>
  </parent>
  <disks>
    <disk name='vda' checkpoint='bitmap' bitmap='1525889631'/>
    <disk name='vdb' checkpoint='no'/>
  </disks>
  <domain type='qemu'>
    <name>fedora</name>
    <uuid>93a5c045-6457-2c09-e56c-927cdf34e178</uuid>
    <memory>1048576</memory>
    ...
    <devices>
      <disk type='file' device='disk'>
        <driver name='qemu' type='qcow2'/>
        <source file='/path/to/file1'/>
        <target dev='vda' bus='virtio'/>
      </disk>
      <disk type='file' device='disk' snapshot='external'>
        <driver name='qemu' type='raw'/>
        <source file='/path/to/file2'/>
        <target dev='vdb' bus='virtio'/>
      </disk>
      ...
    </devices>
  </domain>
</domaincheckpoint>

But as the domain is read-only, we assumed that we don't need to pass it in
the
domaincheckpoint xml. We also don't need it since we keep the the disks
uuids
included in every checkpoint, and from this you can find the information
needed
to generate the tiny checkpoint xml above.

Can you explain why historic domain xml is needed for performing a backup?

The caller of the API is responsible to give you the list of bitmaps that
should be
included for every disk. Libvirt needs to create a new active bitmap and
start the
NBD server exposing the bitmaps specified by the caller. How having the xml
for
every checkpoint/bitmap is needed for perform these steps?

Note also that libvirt does not have snapshot history on oVirt host, since
we manage
snapshots externally. When libvirt starts a VM, the only thing it knows
about snapshots
is the chain of disks. Even if we restore the domain xml for every
checkpoint, libvirt
cannot validate it against the non-existing snapshots.

While we use persistent VMs since 4.2, our VMs are actually transient. We
started to use
persistent VMs because they are more tested, and we can use the VM
<metadata> to persist
stuff, instead of maintaining our own temporary metadata persistence
solution.

>> That's a different feature request - you're asking for
> >> virDomainCheckpointCreateXML(REDEFINE) to have some additional flag
> >> where the metadata re-creation fails if the underlying bitmap is
> >> corrupt/missing. Which may still be a good flag to add, but is something
> >> you can do with one-at-a-time checkpoint redefinitions, and doesn't, in
> >> itself, require bulk redefinition.
> >>
> >
> > Checking for unknown bitmap on disk is not possible unless you have the
> > complete
> > list of checkpoints. Unless you add something like:
> >
> >     virDomainCheckpointValidate()
> >
> > Which will fail if the checkpoints known to libvirt do not match the
> > bitmaps in the
> > underlying qcow2 images.
>
> Yes, an API like that may be better than trying to validate on a
> per-redefine basis whether libvirt metadata matches qcow2 metadata.
>
> >
> > But even if we add this, the user will have to do:
> >
> >     try:
> >         for checkpoint in orderdered_checkpoints:
> >            generate xml...
> >            redefine checkpoint...
> >     except ...:
> >         remove all checkpoints...
> >
> >     validate checkpoints
>
> Yes, but that's not hard. And it's no different whether the 'try/except'
> code is a single libvirt API (with bulk redefine) or multiple libvirt
> APIs (no bulk redefine, but easy enough to one-at-a-time redefine in
> topological order).  Why you have to generate the XML instead of reusing
> the XML that libvirt produces is something I'm not sure of (that is,
> what's wrong with using libvirt's dumpxml output as your input, rather
> than trying to reconstruct it on the fly?).
>

The xml libvirt produces may not be unstable. For example the disk uuid
"22e5a7d6-5409-482d-bda4-f5939b27aad4" may be called now "sda", but
when the vm was running on another host 2 weeks ago maybe it was called
"sdb", since the user added or removed some devices.

When we create the xml using the disk uuuid, we will always refer to the
current
name of the disk, regardless of the history of the vm.

> > Unless the xml for single checkpoint is big, I don't see why all libvirt
> > users
> > need to redefine checkopoints one by one and handle all the possible
> errors
> > (e.g. some checkpoints redefined, some not).
>
> As currently proposed, <domaincheckpoint> includes a mandatory <domain>
> sub-element on redefinition, which means that the XML _is_ big for a
> series of checkpoints.  It also means that you're better off storing the
> XML produced by libvirt than trying to regenerate it by hand, when it
> comes to metadata redefinition.
>
> >
> > Note that vdsm may be killed in the middle of the redefine loop, and in
> > this case
> > we leave livbirt with partial info about checkpoints, and we need to
> > redefine
> > the checkpoints again handling this partial sate.
>
> But that's relatively easy - if you don't know whether libvirt might
> have partial data, then wipe the data and start the redefine loop from
> scratch.
>

But this easy boilerplate will be duplicated in all libvirt clients,
instead of
having single API to restore the state.

Of course if we must keep historic domain configuration for every checkpoint
bulk API does not make sense, but I don't understand this requirement.

Nir
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 6:29 PM, Nir Soffer wrote:
> On Tue, Mar 12, 2019 at 11:52 PM Eric Blake <eblake@redhat.com> wrote:
> 
>> On 3/12/19 4:35 PM, Nir Soffer wrote:
>>
>>>>> We don't have a need to list or define snapshots since we managed
>>>> snapshots
>>>>> on oVirt side.
>>>>> We want an API to list and redefine checkpoints.
>>>>
>>>> But the proposed <domaincheckpoint> also has a <domain> subelement, so
>>>> it has the same problem (the XML for a bulk operation can become
>>>> prohibitively large).
>>>>
>>>
>>> Why do we need <domain> in a checkpoint?
>>
>> The initial design for snapshots did not have <domain>, and it bit us
>> hard; you cannot safely revert to a snapshot if you don't know the state
>> of the domain at that time. The initial design for checkpoints has thus
>> mandated that <domain> be present (my v5 series fails to redefine a
>> snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
>> during dumpxml for reduced output).  If we are convinced that defining a
>> snapshot without <domain> is safe enough, then I can relax the

s/snapshot/checkpoint/ in that line

>> checkpoint code to allow redefined metadata without <domain> the way
>> snapshots already have to do it, even though I was hoping that
>> checkpoints could start life with fewer back-compats that snapshot has
>> had to carry along.  But I'd rather start strict and relax later
>> (require <domain> and then remove it when proven safe), and not start
>> loose (leave <domain> optional, and then wish we had made it mandatory).
>>
> 
> You keep mentioning snapshots, but we are not going to redefine snapshots,
> only checkpoints.
> 
> We need to do this only because qcow2 does not store a chain of bitmaps, but

just as qcow2 does not store a chain of snapshots either.

> un-ordered bitmaps without any metadata, so we need to remind libvirt which
> bitmaps we have in every disk, and what is the order of the bitmaps. We
> will be
> happy if we could just specify the bitmaps and disks in the backup API, and
> avoid
> rebuilding the checkopints history, just like we avoid rebuilding the
> snapshot
> history.

For persistent domains, it would be better if libvirt migrated snapshots
(and therefore checkpoints) automatically, instead of being a coward and
refusing the migration because a snapshot exists. The workaround is that
you dumpxml then undefine snapshots on the source, then migrate, then
createXML(REDEFINE) on the destination using the dumped XML.  Since
checkpoints share a lot of code with snapshots, it was easier to get the
code working by having the same rules (and if we fix it for one, we
should fix it for both, again because they share a lot of code).

For transient domains, libvirt has no where to store the information
about the snapshots or checkpoints, so it is relying on you to REDEFINE
the SAME state as what it had last time libvirt was running.

>>> There is no domain info related to these checkpoints, and I hope we are
>> not
>>> adding such requirements at this stage.
>>
>> Rather, that requirement has been there in all of my drafts, and it
>> sounds like you are trying to get me to relax that requirement to not
>> need a <domain> during checkpoint metadata redefine.
>>
> 
> We never understood this requirement it in the html documentation built
> from your patches.
> 
> Looking again in v3 - I see:
> 
>     domain
>     The inactive domain configuration at the time the checkpoint was
> created. Readonly.

Copied heavily from snapshots which has the same wording, so I'm going
to improve that wording in the next round of patches.  But the key point
is that:

CreateXML(, 0)

takes abbreviated (or full) XML, ignores readonly fields, _and_
populates remaining fields with information learned AT THE TIME the
object was created.  Thereafter, GetXMLDesc() outputs the full XML, and:

CreateXML(, _REDEFINE)

says to take the FULL XML from a previous dump (and NOT the abbreviated
XML at the initial creation) in order to restore ALL state as libvirt
had (if you omit fields, libvirt has to guess at what state to recreate
for those fields - for some fields, guessing might not be too bad, such
as what timestamp to use. But for other fields, guessing what to use for
<domain> by using the current state of the domain, which may have had
hotplug events in the meantime, is WRONG compared to being told the
exact <domain> state that libvirt previously had in memory and which was
included in the previous dumpxml).

In other words, _REDEFINE has never been about inventing state from
scratch, but about restoring state that libvirt previously had, as
dumped by libvirt.

> 
> And we have an example:
> 
> <domaincheckpoint>
>   <description>Completion of updates after OS install</description>
>   <disks>
>     <disk name='vda' checkpoint='bitmap'/>
>     <disk name='vdb' checkpoint='no'/>
>   </disks>
> </domaincheckpoint>
> 
> Which does not have a domain, since it is read-only.

Yes, that's and example of CreateXML(, 0). But NOT an example of
CreateXML(, _REDEFINE).

> 
> There is an example of the xml we get from libvirt:
> 
> <domaincheckpoint>
>   <name>1525889631</name>
>   <description>Completion of updates after OS install</description>
>   <creationTime>1525889631</creationTime>
>   <parent>
>     <name>1525111885</name>
>   </parent>
>   <disks>
>     <disk name='vda' checkpoint='bitmap' bitmap='1525889631'/>
>     <disk name='vdb' checkpoint='no'/>
>   </disks>
>   <domain type='qemu'>
>     <name>fedora</name>
>     <uuid>93a5c045-6457-2c09-e56c-927cdf34e178</uuid>
>     <memory>1048576</memory>
>     ...
>     <devices>
>       <disk type='file' device='disk'>
>         <driver name='qemu' type='qcow2'/>
>         <source file='/path/to/file1'/>
>         <target dev='vda' bus='virtio'/>
>       </disk>
>       <disk type='file' device='disk' snapshot='external'>
>         <driver name='qemu' type='raw'/>
>         <source file='/path/to/file2'/>
>         <target dev='vdb' bus='virtio'/>
>       </disk>
>       ...
>     </devices>
>   </domain>
> </domaincheckpoint>

And _that's_ the XML that the _REDEFINE flag expects.

> 
> But as the domain is read-only, we assumed that we don't need to pass it in
> the
> domaincheckpoint xml. We also don't need it since we keep the the disks
> uuids
> included in every checkpoint, and from this you can find the information
> needed
> to generate the tiny checkpoint xml above.

But the tiny original input XML is no longer sufficient during
_REDEFINE, because you no longer have the point in time where libvirt
can populate missing/readonly fields with sane defaults.

> 
> Can you explain why historic domain xml is needed for performing a backup?

In the current implementation for qemu, the full <domain> is needed only
for validation that the disks requested in the virBackupBegin()
operation have a clean chain of bitmaps from the checkpoint to the
present; and you may be right that the bulk of the information about the
<domain> is spurious for qemu's needs.  But it is not obvious whether
throwing away information about the <domain> would be valid for other
hypervisor implementations, nor whether some future tweak to qemu may
actually require more of the <domain> state as it was at the time the
snapshot was created.

> 
> The caller of the API is responsible to give you the list of bitmaps that
> should be
> included for every disk. Libvirt needs to create a new active bitmap and
> start the
> NBD server exposing the bitmaps specified by the caller. How having the xml
> for
> every checkpoint/bitmap is needed for perform these steps?

Because having the <domain> lets libvirt correlate that disk 'vda' at
the time of the snapshot maps to whatever filename with UUID in the
<domain> definition, to ensure that it is operating on the same device
(and that 'vda' was not hot-unplugged and then hot-plugged to some other
disk in the meantime).


>>
> 
> The xml libvirt produces may not be unstable. For example the disk uuid
> "22e5a7d6-5409-482d-bda4-f5939b27aad4" may be called now "sda", but
> when the vm was running on another host 2 weeks ago maybe it was called
> "sdb", since the user added or removed some devices.

But at the time a <domainsnapshot> or <domaincheckpoint> was created,
the <domain> subelement of that XML _is_ stable - that particular disk
name/UUID WAS disk 'sda' at the time the object was created, no matter
what name it has later.


>>> Note that vdsm may be killed in the middle of the redefine loop, and in
>>> this case
>>> we leave livbirt with partial info about checkpoints, and we need to
>>> redefine
>>> the checkpoints again handling this partial sate.
>>
>> But that's relatively easy - if you don't know whether libvirt might
>> have partial data, then wipe the data and start the redefine loop from
>> scratch.
>>
> 
> But this easy boilerplate will be duplicated in all libvirt clients,
> instead of
> having single API to restore the state.

If the single API didn't have to include a <domain> element for every
checkpoint, and thus risk exceeding RPC bounds, then we'd be okay. But
at the present moment, bulk operations would require a LOT of XML,
regardless of snapshot vs. checkpoint.

> 
> Of course if we must keep historic domain configuration for every checkpoint
> bulk API does not make sense, but I don't understand this requirement.

The requirement is stronger for snapshot (because you can't revert
without either knowing the <domain> setup to revert to or using the
UNSAFE flag) than for checkpoint; but my plan was to make <domain>
mandatory for checkpoint because experience with checkpoint has shown
that making <domain> optional on REDEFINE causes more grief than
desired, and because I wanted to share as much good code between the two
(while leaving the back-compats to just snapshots) as possible.  If it
turns out that we want <domain> to be optional on REDEFINE for
checkpoints, we can do that as a followup patch, but first I want to
focus on getting the API in (we can relax API restrictions later, but
only if the API even exists to make it into backports).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Daniel P. Berrangé 5 years, 1 month ago
On Tue, Mar 12, 2019 at 04:52:24PM -0500, Eric Blake wrote:
> On 3/12/19 4:35 PM, Nir Soffer wrote:
> 
> >>> We don't have a need to list or define snapshots since we managed
> >> snapshots
> >>> on oVirt side.
> >>> We want an API to list and redefine checkpoints.
> >>
> >> But the proposed <domaincheckpoint> also has a <domain> subelement, so
> >> it has the same problem (the XML for a bulk operation can become
> >> prohibitively large).
> >>
> > 
> > Why do we need <domain> in a checkpoint?
> 
> The initial design for snapshots did not have <domain>, and it bit us
> hard; you cannot safely revert to a snapshot if you don't know the state
> of the domain at that time. The initial design for checkpoints has thus
> mandated that <domain> be present (my v5 series fails to redefine a
> snapshot if you omit <domain>, even though it has a NO_DOMAIN flag
> during dumpxml for reduced output).  If we are convinced that defining a
> snapshot without <domain> is safe enough, then I can relax the
> checkpoint code to allow redefined metadata without <domain> the way
> snapshots already have to do it, even though I was hoping that
> checkpoints could start life with fewer back-compats that snapshot has
> had to carry along.  But I'd rather start strict and relax later
> (require <domain> and then remove it when proven safe), and not start
> loose (leave <domain> optional, and then wish we had made it mandatory).

Given that a guest domain XML can be change at runtime at any point,
I don't see how omitting <domain> from the checkpoint XML is safe
in general. Even if apps think it is safe now and omit it, a future
version of the app might change in a way that makes omitting the
<domain> unsafe. If we didn't historically record the <domain> in
the checkpoint in the first place, then the new version of the app
is potentially in trouble. So I think it is good that we are strict
and mandate the <domain> XML even if it is not technically required
in some use cases.


> > Note that vdsm may be killed in the middle of the redefine loop, and in
> > this case
> > we leave livbirt with partial info about checkpoints, and we need to
> > redefine
> > the checkpoints again handling this partial sate.
> 
> But that's relatively easy - if you don't know whether libvirt might
> have partial data, then wipe the data and start the redefine loop from
> scratch.

Of course the same failure scenario applies if libvirt is doing it via
a bulk operation. The redefine loop still exists, just inside libvirt
instead, which might be killed or die part way though. So you're not
really fixing a failure scenario, just moving the failure to a different
piece. That's no net win.



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 4:52 PM, Eric Blake wrote:

>> vm_checkpoints table
>>
>> - checkpoint_id: UUID
>> - parent: UUID
>> - vm_id: UUID
>> - creation_date: TIMESTAMP
>>
>> vm_checkpoint_disk_map table
>> - disk_id: UUID
>> - checkpoint_id: UUID
>>
> 
> And no state of the <domain> at the time the checkpoint was created?

I meant to add:

once we start playing with checkpoints vs. external snapshots vs.
hotplug in anger, then knowing WHICH disks existed at the time of a
given checkpoint may prove invaluable to even figure out WHICH disks
need to participate in a given backup operation.  The initial
implementation uses XML (so that adding features can hopefully reuse the
same API, rather than needing yet more one-off API additions), even if
it is currently limited to not playing nicely with snapshots.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Nir Soffer 5 years, 1 month ago
On Wed, Mar 13, 2019 at 12:01 AM Eric Blake <eblake@redhat.com> wrote:

> On 3/12/19 4:52 PM, Eric Blake wrote:
>
> >> vm_checkpoints table
> >>
> >> - checkpoint_id: UUID
> >> - parent: UUID
> >> - vm_id: UUID
> >> - creation_date: TIMESTAMP
> >>
> >> vm_checkpoint_disk_map table
> >> - disk_id: UUID
> >> - checkpoint_id: UUID
> >>
> >
> > And no state of the <domain> at the time the checkpoint was created?
>
> I meant to add:
>
> once we start playing with checkpoints vs. external snapshots vs.
> hotplug in anger, then knowing WHICH disks existed at the time of a
> given checkpoint may prove invaluable to even figure out WHICH disks
> need to participate in a given backup operation.  The initial
> implementation uses XML (so that adding features can hopefully reuse the
> same API, rather than needing yet more one-off API additions), even if
> it is currently limited to not playing nicely with snapshots.


When you plug a disk, you cannot use any of the bitmaps on the disk because
you don't have any guarantee that the user did not attach the disk to
another machine
that modified the data in some way without updating the bitmaps. So we are
going to
delete a disk from checkpoints once you unplug it, and start from scratch
when you
plug a disk.

I think what we keep is enough to tell libvirt which bitmaps are needed for
a backup.

We probably need to keep the vm configuration per checkpoint or at least
ensure
consistency, so once you started a backup the configuration cannot change
until
the backup ends.
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v4 0/8] bulk snapshot list/redefine (incremental backup saga)
Posted by Eric Blake 5 years, 1 month ago
On 3/12/19 6:45 PM, Nir Soffer wrote:

> 
> When you plug a disk, you cannot use any of the bitmaps on the disk because
> you don't have any guarantee that the user did not attach the disk to
> another machine
> that modified the data in some way without updating the bitmaps. So we are
> going to
> delete a disk from checkpoints once you unplug it, and start from scratch
> when you
> plug a disk.

Correct - from libvirt's point of view, when a disk is hotplugged, the
next backup needs to perform a full backup rather than an incremental
backup on that disk, because that disk was not present in the machine on
the last checkpoint and therefore does not have bitmap coverage from the
last checkpoint to the present moment. But you don't necessarily have to
delete bitmaps from the qcow2 file to get this effect.  The fact that
libvirt stored a full <domain> at the time of the checkpoint is enough
for libvirt to tell that you are requesting a backup of a disk that did
not exist at the starting checkpoint, and therefore that disk needs a
full rather than incremental backup.


> 
> I think what we keep is enough to tell libvirt which bitmaps are needed for
> a backup.

The current design for virDomainBackup() is that you state which
checkpoint to start from, and then libvirt computes all bitmaps from
that checkpoint to the present in order to perform the incremental
backup, or complains that there is no sequence of bitmaps that covers
that length in time so you have to do a full backup instead.  You do not
tell libvirt 'perform a backup of the clusters represented in bitmaps B,
C, and D', but 'perform a backup of the clusters modified since time
T2'.  Libvirt then crawls the chain of checkpoint objects to find T2 and
verify which bitmaps between T2 and the present (if that be bitmaps B,
C, and D) that it needs to use to perform that differential backup.

> 
> We probably need to keep the vm configuration per checkpoint or at least
> ensure
> consistency, so once you started a backup the configuration cannot change
> until
> the backup ends.

And that configuration per checkpoint IS the <domain> subelement.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list