On Thu, Sep 18, 2025 at 08:53:17PM +0530, Arun Menon wrote:
> Hello,
>
> Currently, when a migration of a VM with an encrypted vTPM
> fails on the destination host (e.g., due to a mismatch in secret values),
> the error message displayed on the source host is generic and unhelpful.
>
> For example, a typical error looks like this:
> "operation failed: job 'migration out' failed: Sibling indicated error 1.
> operation failed: job 'migration in' failed: load of migration failed:
> Input/output error"
>
> This message does not provide any specific indication of a vTPM failure.
> Such generic errors are logged using error_report(), which prints to
> the console/monitor but does not make the detailed error accessible via
> the QMP query-migrate command.
>
> This series addresses the issue, by ensuring that specific TPM error
> messages are propagated via the QEMU Error object.
> To make this possible,
> - A set of functions in the call stack is changed
> to incorporate an Error object as an additional parameter.
> - Also, the TPM backend makes use of a new hook called post_load_errp()
> that explicitly passes an Error object.
>
> It is organized as follows,
> - Patches 1-23 focuses on pushing Error object into the functions
> that are important in the call stack where TPM errors are observed.
> We still need to make changes in rest of the functions in savevm.c
> such that they also incorporate the errp object for propagating errors.
> - Patches 12, 13, 20, are minor refactoring changes.
> - Patch 24 removes error variant of vmstate_save_state() function.
> - Patch 25 renames post_save() to cleanup_save()
> - Patch 26 introduces the new variants of the hooks in VMStateDescription
> structure. These hooks should be used in future implementations.
> - Patch 27 focuses on changing the TPM backend such that the errors are
> set in the Error object.
>
> While this series focuses specifically on TPM error reporting during
> live migration, it lays the groundwork for broader improvements.
> A lot of methods in savevm.c that previously returned an integer now capture
> errors in the Error object, enabling other modules to adopt the
> post_load_errp hook in the future.
>
> One such change previously attempted:
> https://lists.gnu.org/archive/html/qemu-devel/2021-02/msg01727.html
>
> Resolves: https://issues.redhat.com/browse/RHEL-82826
>
> Signed-off-by: Arun Menon <armenon@redhat.com>
> ---
> Changes in v14:
> - Addressed two regressions introduced in
> [PATCH 07/27] migration: push Error **errp into qemu_loadvm_state()
> thanks to Akihiko's suggestions.
> a) duplicate error reporting: The previous patch introduced a duplicate
> error message when 'exit_on_error' is false and the migration failed.
> This is not trivial to fix because there is one such scenario in postcopy
> migration where querying the error message using the QMP client or libvirt
> is not possible as the destination abruptly exits.
> However, the logic to print error is put into a separate specific path and
> a comment is added in this path, which can be improved in future to catch the
> specific case.
> b) Ignoring the error in the migration state and reporting local_err:
> Solved this by restoring the patch 07.
> - Link to v13: https://lore.kernel.org/qemu-devel/20250830-propagate_tpm_error-v13-0-a4e777b7eb2c@redhat.com
>
> Changes in v13:
> - Akihiko suggested to use error_report_err() instead of warn_report_err().
> We must not prefix error messages with "warning:" unless the error is a non-critical
> failure that can be logged while the program continues to function.
> - Removed error_prepend() with ERRP_GUARD() where it is not required within
> the series.
> - Link to v12: https://lore.kernel.org/qemu-devel/20250821-propagate_tpm_error-v12-0-72b803e707dc@redhat.com
>
> Changes in v12:
> - Remove error_prepend() calls where no additional information is appended to
> the error string. This also allows us to remove unnecessary ERRP_GUARD().
> - Avoid ambiguity by propagating clear messages in errp.
> - Add clarity to commit messages throughout the series.
> - Link to v11: https://lore.kernel.org/qemu-devel/20250813-propagate_tpm_error-v11-0-b470a374b42d@redhat.com
>
> Changes in v11:
> - Remove unnecessary NULL check in postcopy_ram_listen_thread.
> - Change error_warn to error_fatal or pass local_err wherever appropriate, because,
> https://lore.kernel.org/qemu-devel/20250808080823.2638861-13-armbru@redhat.com/
> Most changes are in patches 2,24.
> - Link to v10: https://lore.kernel.org/qemu-devel/20250808-propagate_tpm_error-v10-0-3e81a1d419b2@redhat.com
>
> Changes in v10:
> - Remove the patch to propagate most recent error and the patch of refactoring
> vmstate_save_state_v(): 23,24. They are not required because we intend to keep
> the design as is.
> - Added 2 new patches
> - patch 25: Rename post_save() to cleanup_save() and make it void
> - patch 20: Return -1 on memory allocation failure in ram.c
> - Pass &error_warn or &error_fatal to capture error or exit on error.
> - Link to v9: https://lore.kernel.org/qemu-devel/20250805-propagate_tpm_error-v9-0-123450810db7@redhat.com
>
> Changes in v9:
> - Re ordering patches such that error is reported in each one of them.
> - format specifier enclosed in '' changed i.e. '%d' changed to %d
> - Reporting errors where they were missed before. Setting errp to NULL
> in case of retry.
> - Link to v8: https://lore.kernel.org/qemu-devel/20250731-propagate_tpm_error-v8-0-28fd82fdfdb2@redhat.com
>
> Changes in v8:
> - 3 new patches added:
> - patch 23:
> - Changes the error propagation by returning the most recent error
> to the caller when both save device state and post_save fails.
> - patch 24:
> - Refactors the vmstate_save_state_v() function by adding wrapper
> functions to separate concerns.
> - patch 25:
> - Removes the error variant of the vmstate_save_state()
> function introduced in commit 969298f9d7.
> - Use ERRP_GUARD() where there is an errp dereference or an error_prepend call.
> - Pass &error_warn in place of NULL, in vmstate_load_state() calls so
> that the caller knows about the error.
> - Remove unnecessary null check before setting errp. Dereferencing it is not required.
> - Documentation for the new variants of post/pre save/load hooks added.
> - Some patches, although they received a 'Reviewed-by' tag, have undergone few minor changes,
> Patch 1 : removed extra space
> Patch 2 : Commit message changed, refactoring the function to
> always set errp and return.
> Patch 8 : Commit message changed.
> Patch 9 : use error_setg_errno instead of error_setg.
> Patch 27 : use error_setg_errno instead of error_setg.
> - Link to v7: https://lore.kernel.org/qemu-devel/20250725-propagate_tpm_error-v7-0-d52704443975@redhat.com
>
> Changes in v7:
> - Fix propagating errors in post_save_errp. The latest error encountered is
> propagated.
> - user-strings in error_prepend() calls now end with a ': ' so that the print is pretty.
> - Change the order of one of the patches.
> - Link to v6: https://lore.kernel.org/qemu-devel/20250721-propagate_tpm_error-v6-0-fef740e15e17@redhat.com
>
> Changes in v6:
> - Incorporated review comments from Daniel and Akihiko, related to few
> semantic errors and improve error logging.
> - Add one more patch that removes NULL checks after calling
> qemu_file_get_return_path() because it does not fail.
> - Link to v5: https://lore.kernel.org/qemu-devel/20250717-propagate_tpm_error-v5-0-1f406f88ee65@redhat.com
>
> Changes in v5:
> - Solve a bug that set errp even though it was not NULL, pointed out by Fabiano in v4.
> - Link to v4: https://lore.kernel.org/qemu-devel/20250716-propagate_tpm_error-v4-0-7141902077c0@redhat.com
>
> Changes in v4:
> - Split the patches into smaller ones based on functions. Pass NULL in the
> caller until errp is made available. Every function that has an
> Error **errp object passed to it, ensures that it sets the errp object
> in case of failure.
> - A few more functions within loadvm_process_command() now handle errors using
> the errp object. I've converted these for consistency, taking Daniel's
> patches (link above) as a reference.
> - Along with the post_load_errp() hook, other duplicate hooks are also introduced.
> This will enable us to migrate to the newer versions eventually.
> - Fix some semantic errors, like using error_propagate_prepend() in places where
> we need to preserve existing behaviour of accumulating the error in local_err
> and then propagating it to errp. This can be refactored in a later commit.
> - Add more information in commit messages explaining the changes.
> - Link to v3: https://lore.kernel.org/qemu-devel/20250702-propagate_tpm_error-v3-0-986d94540528@redhat.com
>
> Changes in v3:
> - Split the 2nd patch into 2. Introducing post_load_with_error() hook
> has been separated from using it in the backends TPM module. This is
> so that it can be acknowledged.
> - Link to v2: https://lore.kernel.org/qemu-devel/20250627-propagate_tpm_error-v2-0-85990c89da29@redhat.com
>
> Changes in v2:
> - Combine the first two changes into one, focusing on passing the
> Error object (errp) consistently through functions involved in
> loading the VM's state. Other functions are not yet changed.
> - As suggested in the review comment, add null checks for errp
> before adding error messages, preventing crashes.
> We also now correctly set errors when post-copy migration fails.
> - In process_incoming_migration_co(), switch to error_prepend
> instead of error_setg. This means we now null-check local_err in
> the "fail" section before using it, preventing dereferencing issues.
> - Link to v1: https://lore.kernel.org/qemu-devel/20250624-propagate_tpm_error-v1-0-2171487a593d@redhat.com
>
> ---
> Arun Menon (27):
> migration: push Error **errp into vmstate_subsection_load()
> migration: push Error **errp into vmstate_load_state()
> migration: push Error **errp into qemu_loadvm_state_header()
> migration: push Error **errp into vmstate_load()
> migration: push Error **errp into loadvm_process_command()
> migration: push Error **errp into loadvm_handle_cmd_packaged()
> migration: push Error **errp into qemu_loadvm_state()
> migration: push Error **errp into qemu_load_device_state()
> migration: push Error **errp into qemu_loadvm_state_main()
> migration: push Error **errp into qemu_loadvm_section_start_full()
> migration: push Error **errp into qemu_loadvm_section_part_end()
> migration: Update qemu_file_get_return_path() docs and remove dead checks
> migration: make loadvm_postcopy_handle_resume() void
> migration: push Error **errp into ram_postcopy_incoming_init()
> migration: push Error **errp into loadvm_postcopy_handle_advise()
> migration: push Error **errp into loadvm_postcopy_handle_listen()
> migration: push Error **errp into loadvm_postcopy_handle_run()
> migration: push Error **errp into loadvm_postcopy_ram_handle_discard()
> migration: push Error **errp into loadvm_handle_recv_bitmap()
> migration: Return -1 on memory allocation failure in ram.c
> migration: push Error **errp into loadvm_process_enable_colo()
> migration: push Error **errp into loadvm_postcopy_handle_switchover_start()
> migration: Capture error in postcopy_ram_listen_thread()
> migration: Remove error variant of vmstate_save_state() function
> migration: Rename post_save() to cleanup_save() and make it void
> migration: Add error-parameterized function variants in VMSD struct
> backends/tpm: Propagate vTPM error on migration failure
>
> backends/tpm/tpm_emulator.c | 40 ++---
> docs/devel/migration/main.rst | 21 ++-
> hw/display/virtio-gpu.c | 5 +-
> hw/pci/pci.c | 5 +-
> hw/ppc/spapr_pci.c | 5 +-
> hw/s390x/virtio-ccw.c | 4 +-
> hw/scsi/spapr_vscsi.c | 6 +-
> hw/vfio/pci.c | 9 +-
> hw/virtio/virtio-mmio.c | 5 +-
> hw/virtio/virtio-pci.c | 4 +-
> hw/virtio/virtio.c | 13 +-
> include/migration/colo.h | 2 +-
> include/migration/vmstate.h | 20 ++-
> migration/colo.c | 10 +-
> migration/cpr.c | 6 +-
> migration/migration.c | 38 ++---
> migration/postcopy-ram.c | 9 +-
> migration/postcopy-ram.h | 2 +-
> migration/qemu-file.c | 1 -
> migration/ram.c | 16 +-
> migration/ram.h | 4 +-
> migration/savevm.c | 334 ++++++++++++++++++++++++------------------
> migration/savevm.h | 7 +-
> migration/vmstate-types.c | 53 ++++---
> migration/vmstate.c | 115 ++++++++++-----
> target/arm/machine.c | 6 +-
> tests/unit/test-vmstate.c | 83 +++++++++--
> ui/vdagent.c | 8 +-
> 28 files changed, 523 insertions(+), 308 deletions(-)
> ---
> base-commit: f0007b7f03e2d7fc33e71c3a582f2364c51a226b
> change-id: 20250624-propagate_tpm_error-bf4ae6c23d30
>
> Best regards,
> --
> Arun Menon <armenon@redhat.com>
>
Hi,
Gentle ping for the series.
Is there something more to be done to improve this before queueing it?
TIA.
Regards,
Arun Menon