In case of error, close_return_path_on_source() can perform a shutdown
to exit the return-path thread. However, in migrate_fd_cleanup(),
'to_dst_file' is closed before calling close_return_path_on_source()
and the shutdown fails, leaving the source and destination waiting for
an event to occur.
Close the file after calling close_return_path_on_source() so that the
shutdown succeeds and the return-path thread exits.
Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
migration/migration.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/migration/migration.c b/migration/migration.c
index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state)
static void migrate_fd_cleanup(MigrationState *s)
{
int file_error = 0;
+ QEMUFile *tmp = NULL;
g_free(s->hostname);
s->hostname = NULL;
@@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s)
qemu_savevm_state_cleanup();
if (s->to_dst_file) {
- QEMUFile *tmp;
-
trace_migrate_fd_cleanup();
bql_unlock();
if (s->migration_thread_running) {
@@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s)
* critical section won't block for long.
*/
migration_ioc_unregister_yank_from_file(tmp);
- qemu_fclose(tmp);
}
- /*
- * We already cleaned up to_dst_file, so errors from the return
- * path might be due to that, ignore them.
- */
close_return_path_on_source(s, file_error);
+ if (tmp) {
+ qemu_fclose(tmp);
+ }
+
assert(!migration_is_active(s));
if (s->state == MIGRATION_STATUS_CANCELLING) {
--
2.43.0
Cédric Le Goater <clg@redhat.com> writes: > In case of error, close_return_path_on_source() can perform a shutdown > to exit the return-path thread. However, in migrate_fd_cleanup(), > 'to_dst_file' is closed before calling close_return_path_on_source() > and the shutdown fails, leaving the source and destination waiting for > an event to occur. At close_return_path_on_source, qemu_file_shutdown() and checking ms->to_dst_file are done under the qemu_file_lock, so how could migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file check have passed? > > Close the file after calling close_return_path_on_source() so that the > shutdown succeeds and the return-path thread exits. > > Signed-off-by: Cédric Le Goater <clg@redhat.com> > --- > migration/migration.c | 12 +++++------- > 1 file changed, 5 insertions(+), 7 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state) > static void migrate_fd_cleanup(MigrationState *s) > { > int file_error = 0; > + QEMUFile *tmp = NULL; > > g_free(s->hostname); > s->hostname = NULL; > @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s) > qemu_savevm_state_cleanup(); > > if (s->to_dst_file) { > - QEMUFile *tmp; > - > trace_migrate_fd_cleanup(); > bql_unlock(); > if (s->migration_thread_running) { > @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s) > * critical section won't block for long. > */ > migration_ioc_unregister_yank_from_file(tmp); > - qemu_fclose(tmp); > } > > - /* > - * We already cleaned up to_dst_file, so errors from the return > - * path might be due to that, ignore them. > - */ > close_return_path_on_source(s, file_error); > > + if (tmp) { > + qemu_fclose(tmp); > + } > + > assert(!migration_is_active(s)); > > if (s->state == MIGRATION_STATUS_CANCELLING) {
On 2/2/24 15:42, Fabiano Rosas wrote: > Cédric Le Goater <clg@redhat.com> writes: > >> In case of error, close_return_path_on_source() can perform a shutdown >> to exit the return-path thread. However, in migrate_fd_cleanup(), >> 'to_dst_file' is closed before calling close_return_path_on_source() >> and the shutdown fails, leaving the source and destination waiting for >> an event to occur. > > At close_return_path_on_source, qemu_file_shutdown() and checking > ms->to_dst_file are done under the qemu_file_lock, so how could > migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file > check have passed? This is not a locking issue, it's much simpler. migrate_fd_cleanup() clears the ms->to_dst_file pointer and closes the QEMUFile and then calls close_return_path_on_source() which then tries to use resources which are not available anymore. Thanks, C. > >> >> Close the file after calling close_return_path_on_source() so that the >> shutdown succeeds and the return-path thread exits. >> >> Signed-off-by: Cédric Le Goater <clg@redhat.com> >> --- >> migration/migration.c | 12 +++++------- >> 1 file changed, 5 insertions(+), 7 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 2c3362235c7651c11d581f3c3639571f1f9636ef..1e0b6acaedc272e8ce26ad40be2c42177f5fd14e 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -1314,6 +1314,7 @@ void migrate_set_state(int *state, int old_state, int new_state) >> static void migrate_fd_cleanup(MigrationState *s) >> { >> int file_error = 0; >> + QEMUFile *tmp = NULL; >> >> g_free(s->hostname); >> s->hostname = NULL; >> @@ -1323,8 +1324,6 @@ static void migrate_fd_cleanup(MigrationState *s) >> qemu_savevm_state_cleanup(); >> >> if (s->to_dst_file) { >> - QEMUFile *tmp; >> - >> trace_migrate_fd_cleanup(); >> bql_unlock(); >> if (s->migration_thread_running) { >> @@ -1344,15 +1343,14 @@ static void migrate_fd_cleanup(MigrationState *s) >> * critical section won't block for long. >> */ >> migration_ioc_unregister_yank_from_file(tmp); >> - qemu_fclose(tmp); >> } >> >> - /* >> - * We already cleaned up to_dst_file, so errors from the return >> - * path might be due to that, ignore them. >> - */ >> close_return_path_on_source(s, file_error); >> >> + if (tmp) { >> + qemu_fclose(tmp); >> + } >> + >> assert(!migration_is_active(s)); >> >> if (s->state == MIGRATION_STATUS_CANCELLING) { >
Cédric Le Goater <clg@redhat.com> writes: > On 2/2/24 15:42, Fabiano Rosas wrote: >> Cédric Le Goater <clg@redhat.com> writes: >> >>> In case of error, close_return_path_on_source() can perform a shutdown >>> to exit the return-path thread. However, in migrate_fd_cleanup(), >>> 'to_dst_file' is closed before calling close_return_path_on_source() >>> and the shutdown fails, leaving the source and destination waiting for >>> an event to occur. >> >> At close_return_path_on_source, qemu_file_shutdown() and checking >> ms->to_dst_file are done under the qemu_file_lock, so how could >> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file >> check have passed? > > This is not a locking issue, it's much simpler. migrate_fd_cleanup() > clears the ms->to_dst_file pointer and closes the QEMUFile and then > calls close_return_path_on_source() which then tries to use resources > which are not available anymore. I'm missing something here. Which resources? I assume you're talking about this: WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { if (ms->to_dst_file && ms->rp_state.from_dst_file && qemu_file_get_error(ms->to_dst_file)) { qemu_file_shutdown(ms->rp_state.from_dst_file); } } How do we get past the 'if (ms->to_dst_file)'?
On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote: > Cédric Le Goater <clg@redhat.com> writes: > > > On 2/2/24 15:42, Fabiano Rosas wrote: > >> Cédric Le Goater <clg@redhat.com> writes: > >> > >>> In case of error, close_return_path_on_source() can perform a shutdown > >>> to exit the return-path thread. However, in migrate_fd_cleanup(), > >>> 'to_dst_file' is closed before calling close_return_path_on_source() > >>> and the shutdown fails, leaving the source and destination waiting for > >>> an event to occur. > >> > >> At close_return_path_on_source, qemu_file_shutdown() and checking > >> ms->to_dst_file are done under the qemu_file_lock, so how could > >> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file > >> check have passed? > > > > This is not a locking issue, it's much simpler. migrate_fd_cleanup() > > clears the ms->to_dst_file pointer and closes the QEMUFile and then > > calls close_return_path_on_source() which then tries to use resources > > which are not available anymore. > > I'm missing something here. Which resources? I assume you're talking > about this: > > WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { > if (ms->to_dst_file && ms->rp_state.from_dst_file && > qemu_file_get_error(ms->to_dst_file)) { > qemu_file_shutdown(ms->rp_state.from_dst_file); > } > } > > How do we get past the 'if (ms->to_dst_file)'? We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call close_return_path_on_source(), found that to_dst_file==NULL and then skip the shutdown(). One other option might be that we do close_return_path_on_source() before the chunk of releasing to_dst_file. This "two qemufiles share the same ioc" issue had bitten us before IIRC, and the only concern of that workaround is we keep postponing resolution of the real issue, then we keep getting bitten by it.. Maybe we can wait a few days to see if Dan can join the conversation and if we can reach a consensus on a complete solution. Otherwise I think we can still work this around, but maybe that'll require a comment block explaining the bits after such movement. Thanks, -- Peter Xu
On 2/5/24 04:37, Peter Xu wrote: > On Fri, Feb 02, 2024 at 12:11:09PM -0300, Fabiano Rosas wrote: >> Cédric Le Goater <clg@redhat.com> writes: >> >>> On 2/2/24 15:42, Fabiano Rosas wrote: >>>> Cédric Le Goater <clg@redhat.com> writes: >>>> >>>>> In case of error, close_return_path_on_source() can perform a shutdown >>>>> to exit the return-path thread. However, in migrate_fd_cleanup(), >>>>> 'to_dst_file' is closed before calling close_return_path_on_source() >>>>> and the shutdown fails, leaving the source and destination waiting for >>>>> an event to occur. >>>> >>>> At close_return_path_on_source, qemu_file_shutdown() and checking >>>> ms->to_dst_file are done under the qemu_file_lock, so how could >>>> migrate_fd_cleanup() have cleared the pointer but the ms->to_dst_file >>>> check have passed? >>> >>> This is not a locking issue, it's much simpler. migrate_fd_cleanup() >>> clears the ms->to_dst_file pointer and closes the QEMUFile and then >>> calls close_return_path_on_source() which then tries to use resources >>> which are not available anymore. >> >> I'm missing something here. Which resources? I assume you're talking >> about this: >> >> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >> if (ms->to_dst_file && ms->rp_state.from_dst_file && >> qemu_file_get_error(ms->to_dst_file)) { >> qemu_file_shutdown(ms->rp_state.from_dst_file); >> } >> } >> >> How do we get past the 'if (ms->to_dst_file)'? > > We don't; migrate_fd_cleanup() will release ms->to_dst_file, then call > close_return_path_on_source(), found that to_dst_file==NULL and then skip > the shutdown(). > > One other option might be that we do close_return_path_on_source() before > the chunk of releasing to_dst_file. > > This "two qemufiles share the same ioc" issue had bitten us before IIRC, > and the only concern of that workaround is we keep postponing resolution of > the real issue, then we keep getting bitten by it.. > > Maybe we can wait a few days to see if Dan can join the conversation and if > we can reach a consensus on a complete solution. Otherwise I think we can > still work this around, but maybe that'll require a comment block > explaining the bits after such movement. yes. The series should have been sent with an RFC. I changed PATCH 1 to use migrate_has_error() instead of qemu_file_get_error(ms->to_dst_file). I will keep PATCH 2 as it is for the time being and wait for more feedback. The prereq series adds an Error** argument to the .save_setup() and .log_global*() handlers. I should send this week. Thanks, C. > > Thanks, >
© 2016 - 2024 Red Hat, Inc.