[v9] migration: propagate vTPM errors using Error objects

[PATCH v9 24/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Posted by Arun Menon 6 months, 1 week ago

We consider,
  - the saving of the device state (save or subsection save) and
  - running the cleanup after (post_save)
as an atomic operation. And that is why, post_save() is called regardless
of whether there is a preceeding error. This means that, it is possible
that we have 2 distict errors, one from the preceeding function and the
other from the post_save() function.

This commit changes the error handling behavior when two errors occur during
a save operation.

1) Preceding errors were propagated before, but they are now dismissed, if there
   is a new post_save() error.
    - A failure during the main save operation, means the system failed to
      reach a new desired state. A failure during the post-save cleanup,
      however, is a more critical issue as it can leave the system in an
      inconsistent or corrupted state. At present, all post_save() calls
      succeed. However, the introduction of error handling variants of these
      functions (post_save_errp) in the following commit, imply that we need
      to handle and propagate these errors. Therefore, we prioritize reporting
      the more severe error.

2) post_save() errors were dismissed before, but they are now propagated.
    - The original design implicitly assumed post-save hooks were infallible
      cleanup routines.
      This will not be the case after introducing the post/pre save/load errp
      variant functions. Dismissing these errors prevents users from
      identifying the specific operation that failed.
      The entire save-and-cleanup process is treated as a single
      logical operation, and all potential failures are communicated.

Signed-off-by: Arun Menon <armenon@redhat.com>
---
 migration/vmstate.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/migration/vmstate.c b/migration/vmstate.c
index fbc59caadbbcc75fe6de27b459aa9aa25e76aa0a..ef78a1e62ad92e9608de72d125da80ea496c8dd1 100644
--- a/migration/vmstate.c
+++ b/migration/vmstate.c
@@ -554,6 +554,12 @@ static int vmstate_save_dispatch(QEMUFile *f,
                     error_setg(errp, "Save of field %s/%s failed",
                                 vmsd->name, field->name);
                     ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
+                    if (ps_ret) {
+                        ret = ps_ret;
+                        error_free_or_abort(errp);
+                        error_setg(errp, "post-save for %s failed, ret: %d",
+                                   vmsd->name, ret);
+                    }
                     return ret;
                 }
 
@@ -603,10 +609,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
     }
 
     ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
+
     ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
-    if (!ret && ps_ret) {
+    if (ps_ret) {
+        if (ret) {
+            error_free_or_abort(errp);
+        }
         ret = ps_ret;
-        error_setg(errp, "post-save failed: %s", vmsd->name);
+        error_propagate(errp, local_err);
     }
 
     return ret;

-- 
2.50.1

Re: [PATCH v9 24/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Posted by Akihiko Odaki 6 months, 1 week ago

On 2025/08/06 3:25, Arun Menon wrote:
> We consider,
>    - the saving of the device state (save or subsection save) and
>    - running the cleanup after (post_save)
> as an atomic operation. And that is why, post_save() is called regardless
> of whether there is a preceeding error. This means that, it is possible
> that we have 2 distict errors, one from the preceeding function and the
> other from the post_save() function.
> 
> This commit changes the error handling behavior when two errors occur during
> a save operation.
> 
> 1) Preceding errors were propagated before, but they are now dismissed, if there
>     is a new post_save() error.
>      - A failure during the main save operation, means the system failed to
>        reach a new desired state. A failure during the post-save cleanup,
>        however, is a more critical issue as it can leave the system in an
>        inconsistent or corrupted state. At present, all post_save() calls
>        succeed. However, the introduction of error handling variants of these
>        functions (post_save_errp) in the following commit, imply that we need
>        to handle and propagate these errors. Therefore, we prioritize reporting
>        the more severe error.

This explains why the post_save() error is propagated instead of 
propagating the preceding error. But the preceding errors still do not 
have to be dismissed if we report them with error_report_err() instead.

> 
> 2) post_save() errors were dismissed before, but they are now propagated.
>      - The original design implicitly assumed post-save hooks were infallible
>        cleanup routines.
>        This will not be the case after introducing the post/pre save/load errp
>        variant functions. Dismissing these errors prevents users from
>        identifying the specific operation that failed.

Re-iterating previous comments, if introducing post save errp variant 
functions break the assumption, we just don't have to introduce one. The 
reason to introduce one needs to be explained.

>        The entire save-and-cleanup process is treated as a single
>        logical operation, and all potential failures are communicated.
> 
> Signed-off-by: Arun Menon <armenon@redhat.com>
> ---
>   migration/vmstate.c | 14 ++++++++++++--
>   1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/vmstate.c b/migration/vmstate.c
> index fbc59caadbbcc75fe6de27b459aa9aa25e76aa0a..ef78a1e62ad92e9608de72d125da80ea496c8dd1 100644
> --- a/migration/vmstate.c
> +++ b/migration/vmstate.c
> @@ -554,6 +554,12 @@ static int vmstate_save_dispatch(QEMUFile *f,
>                       error_setg(errp, "Save of field %s/%s failed",
>                                   vmsd->name, field->name);
>                       ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
> +                    if (ps_ret) {
> +                        ret = ps_ret;
> +                        error_free_or_abort(errp);
> +                        error_setg(errp, "post-save for %s failed, ret: %d",
> +                                   vmsd->name, ret);
> +                    }
>                       return ret;
>                   }
>   
> @@ -603,10 +609,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
>       }
>   
>       ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
> +
>       ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
> -    if (!ret && ps_ret) {
> +    if (ps_ret) {
> +        if (ret) {
> +            error_free_or_abort(errp);
> +        }
>           ret = ps_ret;
> -        error_setg(errp, "post-save failed: %s", vmsd->name);
> +        error_propagate(errp, local_err);
>       }
>   
>       return ret;
>

Re: [PATCH v9 24/27] migration: Propagate last encountered error in vmstate_save_state_v() function

Posted by Arun Menon 6 months, 1 week ago

Hi Akihiko,

Thanks for the review.

On Wed, Aug 06, 2025 at 02:24:37PM +0900, Akihiko Odaki wrote:
> On 2025/08/06 3:25, Arun Menon wrote:
> > We consider,
> >    - the saving of the device state (save or subsection save) and
> >    - running the cleanup after (post_save)
> > as an atomic operation. And that is why, post_save() is called regardless
> > of whether there is a preceeding error. This means that, it is possible
> > that we have 2 distict errors, one from the preceeding function and the
> > other from the post_save() function.
> > 
> > This commit changes the error handling behavior when two errors occur during
> > a save operation.
> > 
> > 1) Preceding errors were propagated before, but they are now dismissed, if there
> >     is a new post_save() error.
> >      - A failure during the main save operation, means the system failed to
> >        reach a new desired state. A failure during the post-save cleanup,
> >        however, is a more critical issue as it can leave the system in an
> >        inconsistent or corrupted state. At present, all post_save() calls
> >        succeed. However, the introduction of error handling variants of these
> >        functions (post_save_errp) in the following commit, imply that we need
> >        to handle and propagate these errors. Therefore, we prioritize reporting
> >        the more severe error.
> 
> This explains why the post_save() error is propagated instead of propagating
> the preceding error. But the preceding errors still do not have to be
> dismissed if we report them with error_report_err() instead.

Yes, both can also be reported.

> 
> > 
> > 2) post_save() errors were dismissed before, but they are now propagated.
> >      - The original design implicitly assumed post-save hooks were infallible
> >        cleanup routines.
> >        This will not be the case after introducing the post/pre save/load errp
> >        variant functions. Dismissing these errors prevents users from
> >        identifying the specific operation that failed.
> 
> Re-iterating previous comments, if introducing post save errp variant
> functions break the assumption, we just don't have to introduce one. The
> reason to introduce one needs to be explained.

Sure, let's keep the original design and try the approach where we rename
post_save to cleanup_save and return void. This should make things clear.

That way, we can avoid introducing post_save_errp() and also we can discard
patches [PATCH v9 23/27] Refactor vmstate_save_state_v() function and this
one [PATCH v9 24/27] Propagate last encountered error in vmstate_save_state_v()
function.

I will post a new version with that patch.

> 
> >        The entire save-and-cleanup process is treated as a single
> >        logical operation, and all potential failures are communicated.
> > 
> > Signed-off-by: Arun Menon <armenon@redhat.com>
> > ---
> >   migration/vmstate.c | 14 ++++++++++++--
> >   1 file changed, 12 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/vmstate.c b/migration/vmstate.c
> > index fbc59caadbbcc75fe6de27b459aa9aa25e76aa0a..ef78a1e62ad92e9608de72d125da80ea496c8dd1 100644
> > --- a/migration/vmstate.c
> > +++ b/migration/vmstate.c
> > @@ -554,6 +554,12 @@ static int vmstate_save_dispatch(QEMUFile *f,
> >                       error_setg(errp, "Save of field %s/%s failed",
> >                                   vmsd->name, field->name);
> >                       ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
> > +                    if (ps_ret) {
> > +                        ret = ps_ret;
> > +                        error_free_or_abort(errp);
> > +                        error_setg(errp, "post-save for %s failed, ret: %d",
> > +                                   vmsd->name, ret);
> > +                    }
> >                       return ret;
> >                   }
> > @@ -603,10 +609,14 @@ int vmstate_save_state_v(QEMUFile *f, const VMStateDescription *vmsd,
> >       }
> >       ret = vmstate_subsection_save(f, vmsd, opaque, vmdesc, errp);
> > +
> >       ps_ret = post_save_dispatch(vmsd, opaque, &local_err);
> > -    if (!ret && ps_ret) {
> > +    if (ps_ret) {
> > +        if (ret) {
> > +            error_free_or_abort(errp);
> > +        }
> >           ret = ps_ret;
> > -        error_setg(errp, "post-save failed: %s", vmsd->name);
> > +        error_propagate(errp, local_err);
> >       }
> >       return ret;
> > 
> 
Regards,
Arun