Currently post_save hook is called without checking its return. If post_save
fails, we need to set an error and propagate it to the caller.
Since post_save hook is called regardless of whether there is a preceeding error,
it is possible that we have 2 distict errors, one from the preceeding function
call, and the other from the post_save call.
Return the latest error to the caller.
Signed-off-by: Arun Menon <armenon@redhat.com>
---
migration/vmstate.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/migration/vmstate.c b/migration/vmstate.c
index b725202bfcf69c3c81338f1f5479aa2ddc5db86f..25a819da069b982d4043f287b4562ea402d9eb0e 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -418,6 +418,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
void *opaque, JSONWriter *vmdesc, int version_id, Error **errp)
{
int ret = 0;
+ int ps_ret = 0;
const VMStateField *field = vmsd->fields;
trace_vmstate_save_state_top(vmsd->name);
@@ -533,7 +534,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
error_setg(errp, "Save of field %s/%s failed",
vmsd->name, field->name);
if (vmsd->post_save) {
- vmsd->post_save(opaque);
+ ps_ret = vmsd->post_save(opaque);
+ if (ps_ret) {
+ ret = ps_ret;
+ error_free_or_abort(errp);
+ error_setg(errp,
+ "post-save for %s failed, ret: '%d'",
+ vmsd->name, ret);
+ }
}
return ret;
}
@@ -561,10 +569,12 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
if (vmsd->post_save) {
- int ps_ret = vmsd->post_save(opaque);
- if (!ret && ps_ret) {
+ ps_ret = vmsd->post_save(opaque);
+ if (ps_ret) {
ret = ps_ret;
- error_setg(errp, "post-save failed: %s", vmsd->name);
+ error_free_or_abort(errp);
+ error_setg(errp, "post-save for %s failed, ret: '%d'",
+ vmsd->name, ret);
}
}
return ret;
--
2.50.0
On 2025/07/31 22:21, Arun Menon wrote:
> Currently post_save hook is called without checking its return. If post_save
> fails, we need to set an error and propagate it to the caller.
>
> Since post_save hook is called regardless of whether there is a preceeding error,
> it is possible that we have 2 distict errors, one from the preceeding function
> call, and the other from the post_save call.
>
> Return the latest error to the caller.
This needs to be explained better. This patch makes two changes on the
behavior when there are two errors:
1) Proceeding errors were propagated before, but they are now
dismissed.
2) post_error() errors were dismissed before, but they are now
propagated.
This message doesn't mention 1) at all. It does say 2) is necessary, but
does not explain why.
>
> Signed-off-by: Arun Menon <armenon@redhat.com>
> ---
> migration/vmstate.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/migration/vmstate.c b/migration/vmstate.c
> index b725202bfcf69c3c81338f1f5479aa2ddc5db86f..25a819da069b982d4043f287b4562ea402d9eb0e 100644
> --- a/migration/vmstate.c
> +++ b/migration/vmstate.c
> @@ -418,6 +418,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> void *opaque, JSONWriter *vmdesc, int version_id, Error **errp)
> {
> int ret = 0;
> + int ps_ret = 0;
> const VMStateField *field = vmsd->fields;
>
> trace_vmstate_save_state_top(vmsd->name);
> @@ -533,7 +534,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> error_setg(errp, "Save of field %s/%s failed",
> vmsd->name, field->name);
> if (vmsd->post_save) {
> - vmsd->post_save(opaque);
> + ps_ret = vmsd->post_save(opaque);
> + if (ps_ret) {
> + ret = ps_ret;
> + error_free_or_abort(errp);
> + error_setg(errp,
> + "post-save for %s failed, ret: '%d'",
> + vmsd->name, ret);
> + }
> }
> return ret;
> }
> @@ -561,10 +569,12 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
>
> if (vmsd->post_save) {
> - int ps_ret = vmsd->post_save(opaque);
> - if (!ret && ps_ret) {
> + ps_ret = vmsd->post_save(opaque);
> + if (ps_ret) {
> ret = ps_ret;
> - error_setg(errp, "post-save failed: %s", vmsd->name);
> + error_free_or_abort(errp);
> + error_setg(errp, "post-save for %s failed, ret: '%d'",
> + vmsd->name, ret);
> }
> }
> return ret;
>
Hi Akihiko,
Thanks for the review.
On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote:
> On 2025/07/31 22:21, Arun Menon wrote:
> > Currently post_save hook is called without checking its return. If post_save
> > fails, we need to set an error and propagate it to the caller.
> >
> > Since post_save hook is called regardless of whether there is a preceeding error,
> > it is possible that we have 2 distict errors, one from the preceeding function
> > call, and the other from the post_save call.
> >
> > Return the latest error to the caller.
>
> This needs to be explained better. This patch makes two changes on the
> behavior when there are two errors:
>
> 1) Proceeding errors were propagated before, but they are now
> dismissed.
> 2) post_error() errors were dismissed before, but they are now
> propagated.
>
> This message doesn't mention 1) at all. It does say 2) is necessary, but
> does not explain why.
Please correct me if I am wrong, does the following look ok?
Currently post_save() hook is called without checking its return.
post_save() hooks, generally does the cleanup, and that is the reason why
we have been dismissing its failure.
We want to consider
- the saving of the device state (save or subsection save) and
- running the cleanup after
as an atomic operation. And that is why, post_save is called regardless
of whether there is a preceeding error.
This means that, it is possible that we have 2 distict errors, one from
the preceeding function and the other from the post_save() function.
This patch makes two changes on the behavior when there are two errors:
1) Preceeding errors were propagated before, but they are now
dismissed if there is a new post_save() error.
2) post_save() errors were dismissed before, but they are now
propagated.
We intend to extend the error propagation feature (saving erros in
Error object and return to the caller) to the post/pre
save/load hooks. Therefore, it is important for the user to know
the exact reason of failure in case any of these hooks fail.
====
I do not know much about the operations we do, or intend to do
using the post_save() call. But my guess is that we are going to
revert things or cleanup stuff, and if reverting/cleanup fails,
then the user should be notified about that.
>
> >
> > Signed-off-by: Arun Menon <armenon@redhat.com>
> > ---
> > migration/vmstate.c | 18 ++++++++++++++----
> > 1 file changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/migration/vmstate.c b/migration/vmstate.c
> > index b725202bfcf69c3c81338f1f5479aa2ddc5db86f..25a819da069b982d4043f287b4562ea402d9eb0e 100644
> > --- a/migration/vmstate.c
> > +++ b/migration/vmstate.c
> > @@ -418,6 +418,7 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> > void *opaque, JSONWriter *vmdesc, int version_id, Error **errp)
> > {
> > int ret = 0;
> > + int ps_ret = 0;
> > const VMStateField *field = vmsd->fields;
> > trace_vmstate_save_state_top(vmsd->name);
> > @@ -533,7 +534,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> > error_setg(errp, "Save of field %s/%s failed",
> > vmsd->name, field->name);
> > if (vmsd->post_save) {
> > - vmsd->post_save(opaque);
> > + ps_ret = vmsd->post_save(opaque);
> > + if (ps_ret) {
> > + ret = ps_ret;
> > + error_free_or_abort(errp);
> > + error_setg(errp,
> > + "post-save for %s failed, ret: '%d'",
> > + vmsd->name, ret);
> > + }
> > }
> > return ret;
> > }
> > @@ -561,10 +569,12 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> > ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
> > if (vmsd->post_save) {
> > - int ps_ret = vmsd->post_save(opaque);
> > - if (!ret && ps_ret) {
> > + ps_ret = vmsd->post_save(opaque);
> > + if (ps_ret) {
> > ret = ps_ret;
> > - error_setg(errp, "post-save failed: %s", vmsd->name);
> > + error_free_or_abort(errp);
> > + error_setg(errp, "post-save for %s failed, ret: '%d'",
> > + vmsd->name, ret);
> > }
> > }
> > return ret;
> >
>
Regards,
Arun
On 2025/08/02 2:27, Arun Menon wrote: > Hi Akihiko, > Thanks for the review. > > On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote: >> On 2025/07/31 22:21, Arun Menon wrote: >>> Currently post_save hook is called without checking its return. If post_save >>> fails, we need to set an error and propagate it to the caller. >>> >>> Since post_save hook is called regardless of whether there is a preceeding error, >>> it is possible that we have 2 distict errors, one from the preceeding function >>> call, and the other from the post_save call. >>> >>> Return the latest error to the caller. >> >> This needs to be explained better. This patch makes two changes on the >> behavior when there are two errors: >> >> 1) Proceeding errors were propagated before, but they are now >> dismissed. >> 2) post_error() errors were dismissed before, but they are now >> propagated. >> >> This message doesn't mention 1) at all. It does say 2) is necessary, but >> does not explain why. > > Please correct me if I am wrong, does the following look ok? > > Currently post_save() hook is called without checking its return. > post_save() hooks, generally does the cleanup, and that is the reason why > we have been dismissing its failure. > > We want to consider > - the saving of the device state (save or subsection save) and > - running the cleanup after > as an atomic operation. And that is why, post_save is called regardless > of whether there is a preceeding error. > This means that, it is possible that we have 2 distict errors, one from > the preceeding function and the other from the post_save() function. > > This patch makes two changes on the behavior when there are two errors: > > 1) Preceeding errors were propagated before, but they are now > dismissed if there is a new post_save() error. > 2) post_save() errors were dismissed before, but they are now > propagated. > > We intend to extend the error propagation feature (saving erros in > Error object and return to the caller) to the post/pre > save/load hooks. Therefore, it is important for the user to know > the exact reason of failure in case any of these hooks fail. > > ==== > I do not know much about the operations we do, or intend to do > using the post_save() call. But my guess is that we are going to > revert things or cleanup stuff, and if reverting/cleanup fails, > then the user should be notified about that. All what you stated sounds correct to me and explains 2). But there is still no reasoning for 1). Previously I listed three possible design options: > - propagate whatever error easier to do so > - somehow combine errors > (languages that support exceptions often do this) > - make sure that no error will happen when there is a proceeding error > by having two callbacks: > - One callback that is called only when there is no proceeding error > and can raise an error > - Another that is always called but cannot raise an error https://lore.kernel.org/qemu-devel/1ff92294-5c8c-4b33-89c1-91d37d6accb0@rsg.ci.i.u-tokyo.ac.jp/ It will be a good reasoning if we can show propagating post_save() errors is easier (or more important) than propagating proceeding errors. There are alternative options. "Making sure that no error will happen when there is a proceeding error by having two callbacks" is one of them. You wrote: > We want to consider > - the saving of the device state (save or subsection save) and > - running the cleanup after > as an atomic operation. And that is why, post_save is called > regardless of whether there is a preceeding error. > This means that, it is possible that we have 2 distict errors, one > from the preceeding function and the other from the post_save() > function. But it may not be mandatory. In fact, I searched "post_save" in the code base and found all implementations are for cleanup and always succeeds, which makes sense; most (if not all) cleanup operations like g_free() always succeeds in general. So perhaps "making sure that no error will happen when there is a proceeding error" may be feasible, and it may not even require two callbacks because post_save() always succeeds. Regards, Akihiko Odaki
Hi Akihiko, On Sun, Aug 03, 2025 at 04:15:16PM +0900, Akihiko Odaki wrote: > On 2025/08/02 2:27, Arun Menon wrote: > > Hi Akihiko, > > Thanks for the review. > > > > On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote: > > > On 2025/07/31 22:21, Arun Menon wrote: > > > > Currently post_save hook is called without checking its return. If post_save > > > > fails, we need to set an error and propagate it to the caller. > > > > > > > > Since post_save hook is called regardless of whether there is a preceeding error, > > > > it is possible that we have 2 distict errors, one from the preceeding function > > > > call, and the other from the post_save call. > > > > > > > > Return the latest error to the caller. > > > > > > This needs to be explained better. This patch makes two changes on the > > > behavior when there are two errors: > > > > > > 1) Proceeding errors were propagated before, but they are now > > > dismissed. > > > 2) post_error() errors were dismissed before, but they are now > > > propagated. > > > > > > This message doesn't mention 1) at all. It does say 2) is necessary, but > > > does not explain why. > > > > Please correct me if I am wrong, does the following look ok? > > > > Currently post_save() hook is called without checking its return. > > post_save() hooks, generally does the cleanup, and that is the reason why > > we have been dismissing its failure. > > > > We want to consider > > - the saving of the device state (save or subsection save) and > > - running the cleanup after > > as an atomic operation. And that is why, post_save is called regardless > > of whether there is a preceeding error. > > This means that, it is possible that we have 2 distict errors, one from > > the preceeding function and the other from the post_save() function. > > > > This patch makes two changes on the behavior when there are two errors: > > > > 1) Preceeding errors were propagated before, but they are now > > dismissed if there is a new post_save() error. > > 2) post_save() errors were dismissed before, but they are now > > propagated. > > > > We intend to extend the error propagation feature (saving erros in > > Error object and return to the caller) to the post/pre > > save/load hooks. Therefore, it is important for the user to know > > the exact reason of failure in case any of these hooks fail. > > > > ==== > > I do not know much about the operations we do, or intend to do > > using the post_save() call. But my guess is that we are going to > > revert things or cleanup stuff, and if reverting/cleanup fails, > > then the user should be notified about that. > > All what you stated sounds correct to me and explains 2). But there is still > no reasoning for 1). > > Previously I listed three possible design options: > > - propagate whatever error easier to do so > > - somehow combine errors > > (languages that support exceptions often do this) > > - make sure that no error will happen when there is a proceeding error > > by having two callbacks: > > - One callback that is called only when there is no proceeding error > > and can raise an error > > - Another that is always called but cannot raise an error > > https://lore.kernel.org/qemu-devel/1ff92294-5c8c-4b33-89c1-91d37d6accb0@rsg.ci.i.u-tokyo.ac.jp/ > > It will be a good reasoning if we can show propagating post_save() errors is > easier (or more important) than propagating proceeding errors. > > There are alternative options. "Making sure that no error will happen when > there is a proceeding error by having two callbacks" is one of them. You > wrote: > > We want to consider > > - the saving of the device state (save or subsection save) and > > - running the cleanup after > > as an atomic operation. And that is why, post_save is called > > regardless of whether there is a preceeding error. > > This means that, it is possible that we have 2 distict errors, one > > from the preceeding function and the other from the post_save() > > function. > > But it may not be mandatory. In fact, I searched "post_save" in the code > base and found all implementations are for cleanup and always succeeds, > which makes sense; most (if not all) cleanup operations like g_free() always > succeeds in general. post_save calls do succeed in general. Now, we are introducing the post_save_errp() function which is an alternative to the post_save() call. We are adding the feature to catch errors in Error object, thereby clearly considering the case where post_save_erp can go wrong. So we have already deviated from the old design that all post_save operations are infalliable cleaup routines. The only catch is, these alternate functions are introduced a couple of commits later. If that is the case, then we need to handle both the errors, and propagate only the most significant (which in my opinion is the post_save) one. In other words, the reasoning for 1, 1) Proceeding errors were propagated before, but they are now dismissed. is that, a failure during the preceeding operation means that we failed in saving the state but a failure during the post-save call, means that we failed in cleanup or restoring back the system to a stable state, which is a more serious one, because it may lead to the system being in an inconsistent state. Can I write something on these lines, to justify the change? > > So perhaps "making sure that no error will happen when there is a proceeding > error" may be feasible, and it may not even require two callbacks because > post_save() always succeeds. I am sorry, I could not completely understand the part, where you mentioned that we need to have 2 callbacks; I am assuming you are mentioning something like try..catch, finally blocks from the other languages. post_save (that can not raise an error) being called in the goto block. So if the main operation fails, we say goto cleanup, and then execute post_save which always succeeds. If the main operation succeeds then we call post_save that can raise an error. Please correct me if I am wrong. > > Regards, > Akihiko Odaki > Regards, Arun
On 2025/08/04 16:48, Arun Menon wrote:
> Hi Akihiko,
>
> On Sun, Aug 03, 2025 at 04:15:16PM +0900, Akihiko Odaki wrote:
>> On 2025/08/02 2:27, Arun Menon wrote:
>>> Hi Akihiko,
>>> Thanks for the review.
>>>
>>> On Fri, Aug 01, 2025 at 04:35:34PM +0900, Akihiko Odaki wrote:
>>>> On 2025/07/31 22:21, Arun Menon wrote:
>>>>> Currently post_save hook is called without checking its return. If post_save
>>>>> fails, we need to set an error and propagate it to the caller.
>>>>>
>>>>> Since post_save hook is called regardless of whether there is a preceeding error,
>>>>> it is possible that we have 2 distict errors, one from the preceeding function
>>>>> call, and the other from the post_save call.
>>>>>
>>>>> Return the latest error to the caller.
>>>>
>>>> This needs to be explained better. This patch makes two changes on the
>>>> behavior when there are two errors:
>>>>
>>>> 1) Proceeding errors were propagated before, but they are now
>>>> dismissed.
>>>> 2) post_error() errors were dismissed before, but they are now
>>>> propagated.
>>>>
>>>> This message doesn't mention 1) at all. It does say 2) is necessary, but
>>>> does not explain why.
>>>
>>> Please correct me if I am wrong, does the following look ok?
>>>
>>> Currently post_save() hook is called without checking its return.
>>> post_save() hooks, generally does the cleanup, and that is the reason why
>>> we have been dismissing its failure.
>>>
>>> We want to consider
>>> - the saving of the device state (save or subsection save) and
>>> - running the cleanup after
>>> as an atomic operation. And that is why, post_save is called regardless
>>> of whether there is a preceeding error.
>>> This means that, it is possible that we have 2 distict errors, one from
>>> the preceeding function and the other from the post_save() function.
>>>
>>> This patch makes two changes on the behavior when there are two errors:
>>>
>>> 1) Preceeding errors were propagated before, but they are now
>>> dismissed if there is a new post_save() error.
>>> 2) post_save() errors were dismissed before, but they are now
>>> propagated.
>>>
>>> We intend to extend the error propagation feature (saving erros in
>>> Error object and return to the caller) to the post/pre
>>> save/load hooks. Therefore, it is important for the user to know
>>> the exact reason of failure in case any of these hooks fail.
>>>
>>> ====
>>> I do not know much about the operations we do, or intend to do
>>> using the post_save() call. But my guess is that we are going to
>>> revert things or cleanup stuff, and if reverting/cleanup fails,
>>> then the user should be notified about that.
>>
>> All what you stated sounds correct to me and explains 2). But there is still
>> no reasoning for 1).
>>
>> Previously I listed three possible design options:
>>> - propagate whatever error easier to do so
>>> - somehow combine errors
>>> (languages that support exceptions often do this)
>>> - make sure that no error will happen when there is a proceeding error
>>> by having two callbacks:
>>> - One callback that is called only when there is no proceeding error
>>> and can raise an error
>>> - Another that is always called but cannot raise an error
>>
>> https://lore.kernel.org/qemu-devel/1ff92294-5c8c-4b33-89c1-91d37d6accb0@rsg.ci.i.u-tokyo.ac.jp/
>>
>> It will be a good reasoning if we can show propagating post_save() errors is
>> easier (or more important) than propagating proceeding errors.
>>
>> There are alternative options. "Making sure that no error will happen when
>> there is a proceeding error by having two callbacks" is one of them. You
>> wrote:
>>> We want to consider
>>> - the saving of the device state (save or subsection save) and
>>> - running the cleanup after
>>> as an atomic operation. And that is why, post_save is called
>>> regardless of whether there is a preceeding error.
>>> This means that, it is possible that we have 2 distict errors, one
>>> from the preceeding function and the other from the post_save()
>>> function.
>>
>> But it may not be mandatory. In fact, I searched "post_save" in the code
>> base and found all implementations are for cleanup and always succeeds,
>> which makes sense; most (if not all) cleanup operations like g_free() always
>> succeeds in general.
>
> post_save calls do succeed in general.
> Now, we are introducing the post_save_errp() function which is an alternative to the
> post_save() call. We are adding the feature to catch errors in Error object,
> thereby clearly considering the case where post_save_erp can go wrong.
> So we have already deviated from the old design that all post_save operations are
> infalliable cleaup routines. The only catch is, these alternate functions are introduced
> a couple of commits later.
>
> If that is the case, then we need to handle both the errors, and propagate only
> the most significant (which in my opinion is the post_save) one.
>
> In other words, the reasoning for 1,
> 1) Proceeding errors were propagated before, but they are now dismissed.
>
> is that,
> a failure during the preceeding operation means that we failed in saving the state
> but a failure during the post-save call, means that we failed in cleanup or restoring
> back the system to a stable state, which is a more serious one, because it may lead to
> the system being in an inconsistent state.
> Can I write something on these lines, to justify the change?
That is better, but also gives another question: when will "cleanup or
restoring back the system to a stable state" fail? We already know that
the all post_save() always succeeds, so there should be some anticipated
use case that requires us to "deviate from the old design that all
post_save operations are infalliable cleanup routines".
>
>>
>> So perhaps "making sure that no error will happen when there is a proceeding
>> error" may be feasible, and it may not even require two callbacks because
>> post_save() always succeeds.
>
> I am sorry, I could not completely understand the part, where you mentioned that
> we need to have 2 callbacks; I am assuming you are mentioning something like try..catch,
> finally blocks from the other languages.
> post_save (that can not raise an error) being called in the goto block.
Such two callbacks would handle two scenarios you previously wrote,
respectively:
> We want to consider
> - the saving of the device state (save or subsection save) and
> - running the cleanup after
> as an atomic operation.
An analogy with try/catch/finally will indeed clarify the idea, though
"catch" is not relevant here. It will look like as follows:
try {
/* A pseudo function that represents the save operation. */
save();
/*
* A new callback that can raise an error.
* It will not be called if save() fails.
*/
vmsd->commit_save();
} finally {
/*
* The old callback. Returning non-zero value is deprecated.
* Any implementations that return a non-zero value should be
* gradually converted to the new commit_save() callback.
*/
vmsd->post_save();
}
>
> So if the main operation fails, we say goto cleanup, and then execute post_save which always
> succeeds. If the main operation succeeds then we call post_save that can raise an error. Please correct
> me if I am wrong.
Yes, that's what I meant.
Regards,
Akihiko Odaki
© 2016 - 2025 Red Hat, Inc.