Invalidate cache before source start in case of failed migration.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
migration/migration.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 1e99ec9b7e..8f39e0dc02 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
case MIGRATION_STATUS_FAILED:
case MIGRATION_STATUS_CANCELLED:
if (s->vm_was_running) {
- vm_start();
+ Error *local_err = NULL;
+ bdrv_invalidate_cache_all(&local_err);
+ if (local_err) {
+ error_reportf_err(local_err, "Can't invalidate disks before "
+ "source vm start");
+ } else {
+ vm_start();
+ }
} else {
if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
runstate_set(RUN_STATE_POSTMIGRATE);
--
2.11.1
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > Invalidate cache before source start in case of failed migration. > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Why doesn't the code at the bottom of migration_completion, fail_invalidate: and the code in migrate_fd_cancel handle this? What case did you see it in that those didn't handle? (Also I guess it probably should set s->block_inactive = false) Dave > --- > > migration/migration.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 1e99ec9b7e..8f39e0dc02 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s) > case MIGRATION_STATUS_FAILED: > case MIGRATION_STATUS_CANCELLED: > if (s->vm_was_running) { > - vm_start(); > + Error *local_err = NULL; > + bdrv_invalidate_cache_all(&local_err); > + if (local_err) { > + error_reportf_err(local_err, "Can't invalidate disks before " > + "source vm start"); > + } else { > + vm_start(); > + } > } else { > if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { > runstate_set(RUN_STATE_POSTMIGRATE); > -- > 2.11.1 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
15.06.2018 15:06, Dr. David Alan Gilbert wrote: > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >> Invalidate cache before source start in case of failed migration. >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > Why doesn't the code at the bottom of migration_completion, > fail_invalidate: and the code in migrate_fd_cancel handle this? > > What case did you see it in that those didn't handle? > (Also I guess it probably should set s->block_inactive = false) on source I see: 81392@1529065750.766289:migrate_set_state new state 7 81392@1529065750.766330:migration_thread_file_err 81392@1529065750.766332:migration_thread_after_loop so, we are leaving loop on if (qemu_file_get_error(s->to_dst_file)) { migrate_set_state(&s->state, current_active_state, MIGRATION_STATUS_FAILED); trace_migration_thread_file_err(); break; } and skip migration_completion() > > Dave > >> --- >> >> migration/migration.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index 1e99ec9b7e..8f39e0dc02 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s) >> case MIGRATION_STATUS_FAILED: >> case MIGRATION_STATUS_CANCELLED: >> if (s->vm_was_running) { >> - vm_start(); >> + Error *local_err = NULL; >> + bdrv_invalidate_cache_all(&local_err); >> + if (local_err) { >> + error_reportf_err(local_err, "Can't invalidate disks before " >> + "source vm start"); >> + } else { >> + vm_start(); >> + } >> } else { >> if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { >> runstate_set(RUN_STATE_POSTMIGRATE); >> -- >> 2.11.1 >> > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Best regards, Vladimir
On 06/15/2018 08:33 AM, Vladimir Sementsov-Ogievskiy wrote: > 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>> Invalidate cache before source start in case of failed migration. >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >> Why doesn't the code at the bottom of migration_completion, >> fail_invalidate: and the code in migrate_fd_cancel handle this? >> >> What case did you see it in that those didn't handle? >> (Also I guess it probably should set s->block_inactive = false) > > on source I see: > > 81392@1529065750.766289:migrate_set_state new state 7 > 81392@1529065750.766330:migration_thread_file_err > 81392@1529065750.766332:migration_thread_after_loop > > so, we are leaving loop on > if (qemu_file_get_error(s->to_dst_file)) { > migrate_set_state(&s->state, current_active_state, > MIGRATION_STATUS_FAILED); > trace_migration_thread_file_err(); > break; > } > > and skip migration_completion() > OK, so you're saying that in migration_thread; ` thr_error = migration_detect_error(s); if (thr_error == MIG_THR_ERR_FATAL) { /* Stop migration */ break; } ` is triggering, so then we jump right away to ` trace_migration_thread_after_loop(); migration_iteration_finish(s); ` and in so doing, we fail before we get a chance for migration_iteration_run to call migration_completion, which was otherwise the mechanism that invalidated the images. Seems like a legitimate concern that early failures versus late failures trigger different error pathways. David? --js > >> >> Dave >> >>> --- >>> >>> migration/migration.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index 1e99ec9b7e..8f39e0dc02 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -2806,7 +2806,14 @@ static void >>> migration_iteration_finish(MigrationState *s) >>> case MIGRATION_STATUS_FAILED: >>> case MIGRATION_STATUS_CANCELLED: >>> if (s->vm_was_running) { >>> - vm_start(); >>> + Error *local_err = NULL; >>> + bdrv_invalidate_cache_all(&local_err); >>> + if (local_err) { >>> + error_reportf_err(local_err, "Can't invalidate disks >>> before " >>> + "source vm start"); >>> + } else { >>> + vm_start(); >>> + } >>> } else { >>> if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { >>> runstate_set(RUN_STATE_POSTMIGRATE); >>> -- >>> 2.11.1 >>> >> -- >> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > >
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > 15.06.2018 15:06, Dr. David Alan Gilbert wrote: > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > > Invalidate cache before source start in case of failed migration. > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > Why doesn't the code at the bottom of migration_completion, > > fail_invalidate: and the code in migrate_fd_cancel handle this? > > > > What case did you see it in that those didn't handle? > > (Also I guess it probably should set s->block_inactive = false) > > on source I see: > > 81392@1529065750.766289:migrate_set_state new state 7 > 81392@1529065750.766330:migration_thread_file_err > 81392@1529065750.766332:migration_thread_after_loop > > so, we are leaving loop on > if (qemu_file_get_error(s->to_dst_file)) { > migrate_set_state(&s->state, current_active_state, > MIGRATION_STATUS_FAILED); > trace_migration_thread_file_err(); > break; > } > > and skip migration_completion() Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation test that had previously ended with a 'cancelled' state has now ended up in 'failed' (which is the state 7 you have above). I suspect there's something else going on as well; I think what is supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel it does the recovery, but because it's going to failed instead, then it's jumping over that recovery. Dave > > > > > Dave > > > > > --- > > > > > > migration/migration.c | 9 ++++++++- > > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > index 1e99ec9b7e..8f39e0dc02 100644 > > > --- a/migration/migration.c > > > +++ b/migration/migration.c > > > @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s) > > > case MIGRATION_STATUS_FAILED: > > > case MIGRATION_STATUS_CANCELLED: > > > if (s->vm_was_running) { > > > - vm_start(); > > > + Error *local_err = NULL; > > > + bdrv_invalidate_cache_all(&local_err); > > > + if (local_err) { > > > + error_reportf_err(local_err, "Can't invalidate disks before " > > > + "source vm start"); > > > + } else { > > > + vm_start(); > > > + } > > > } else { > > > if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { > > > runstate_set(RUN_STATE_POSTMIGRATE); > > > -- > > > 2.11.1 > > > > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > -- > Best regards, > Vladimir > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > 15.06.2018 15:06, Dr. David Alan Gilbert wrote: > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > > > Invalidate cache before source start in case of failed migration. > > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > > Why doesn't the code at the bottom of migration_completion, > > > fail_invalidate: and the code in migrate_fd_cancel handle this? > > > > > > What case did you see it in that those didn't handle? > > > (Also I guess it probably should set s->block_inactive = false) > > > > on source I see: > > > > 81392@1529065750.766289:migrate_set_state new state 7 > > 81392@1529065750.766330:migration_thread_file_err > > 81392@1529065750.766332:migration_thread_after_loop > > > > so, we are leaving loop on > > if (qemu_file_get_error(s->to_dst_file)) { > > migrate_set_state(&s->state, current_active_state, > > MIGRATION_STATUS_FAILED); > > trace_migration_thread_file_err(); > > break; > > } > > > > and skip migration_completion() > > Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation > test that had previously ended with a 'cancelled' state has now ended up > in 'failed' (which is the state 7 you have above). > I suspect there's something else going on as well; I think what is > supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for > a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel > it does the recovery, but because it's going to failed instead, then > it's jumping over that recovery. Going back and actually looking at the patch again; can I ask for 1 small change; Can you set s->block_inactive = false in the case where you don't get the local_err (Like we do at the bottom of migrate_fd_cancel) Does that make sense? Thanks, Dave > Dave > > > > > > > > > Dave > > > > > > > --- > > > > > > > > migration/migration.c | 9 ++++++++- > > > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > > index 1e99ec9b7e..8f39e0dc02 100644 > > > > --- a/migration/migration.c > > > > +++ b/migration/migration.c > > > > @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s) > > > > case MIGRATION_STATUS_FAILED: > > > > case MIGRATION_STATUS_CANCELLED: > > > > if (s->vm_was_running) { > > > > - vm_start(); > > > > + Error *local_err = NULL; > > > > + bdrv_invalidate_cache_all(&local_err); > > > > + if (local_err) { > > > > + error_reportf_err(local_err, "Can't invalidate disks before " > > > > + "source vm start"); > > > > + } else { > > > > + vm_start(); > > > > + } > > > > } else { > > > > if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { > > > > runstate_set(RUN_STATE_POSTMIGRATE); > > > > -- > > > > 2.11.1 > > > > > > > -- > > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > > > > -- > > Best regards, > > Vladimir > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>> Invalidate cache before source start in case of failed migration. >>>>> >>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >>>> Why doesn't the code at the bottom of migration_completion, >>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>> >>>> What case did you see it in that those didn't handle? >>>> (Also I guess it probably should set s->block_inactive = false) >>> >>> on source I see: >>> >>> 81392@1529065750.766289:migrate_set_state new state 7 >>> 81392@1529065750.766330:migration_thread_file_err >>> 81392@1529065750.766332:migration_thread_after_loop >>> >>> so, we are leaving loop on >>> if (qemu_file_get_error(s->to_dst_file)) { >>> migrate_set_state(&s->state, current_active_state, >>> MIGRATION_STATUS_FAILED); >>> trace_migration_thread_file_err(); >>> break; >>> } >>> >>> and skip migration_completion() >> >> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >> test that had previously ended with a 'cancelled' state has now ended up >> in 'failed' (which is the state 7 you have above). >> I suspect there's something else going on as well; I think what is >> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for >> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel >> it does the recovery, but because it's going to failed instead, then >> it's jumping over that recovery. > > Going back and actually looking at the patch again; > can I ask for 1 small change; > Can you set s->block_inactive = false in the case where you > don't get the local_err (Like we do at the bottom of migrate_fd_cancel) > > > Does that make sense? > > Thanks, > > Dave > Vladimir, one more question for you because I'm not as familiar with this code: In the normal case we need to invalidate the qcow2 cache as a way to re-engage the disk (yes?) when we have failed during the late-migration steps. In this case, we seem to be observing a failure during the bulk transfer loop. Why is it important to invalidate the cache at this step -- would the disk have been inactivated yet? It shouldn't, because it's in the bulk transfer phase -- or am I missing something? I feel like this code is behaving in a way that's fairly surprising for a casual reader so I was hoping you could elaborate for me. --js
25.06.2018 21:03, John Snow wrote: > > On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote: >> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>>> Invalidate cache before source start in case of failed migration. >>>>>> >>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >>>>> Why doesn't the code at the bottom of migration_completion, >>>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>>> >>>>> What case did you see it in that those didn't handle? >>>>> (Also I guess it probably should set s->block_inactive = false) >>>> on source I see: >>>> >>>> 81392@1529065750.766289:migrate_set_state new state 7 >>>> 81392@1529065750.766330:migration_thread_file_err >>>> 81392@1529065750.766332:migration_thread_after_loop >>>> >>>> so, we are leaving loop on >>>> if (qemu_file_get_error(s->to_dst_file)) { >>>> migrate_set_state(&s->state, current_active_state, >>>> MIGRATION_STATUS_FAILED); >>>> trace_migration_thread_file_err(); >>>> break; >>>> } >>>> >>>> and skip migration_completion() >>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >>> test that had previously ended with a 'cancelled' state has now ended up >>> in 'failed' (which is the state 7 you have above). >>> I suspect there's something else going on as well; I think what is >>> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for >>> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel >>> it does the recovery, but because it's going to failed instead, then >>> it's jumping over that recovery. >> Going back and actually looking at the patch again; >> can I ask for 1 small change; >> Can you set s->block_inactive = false in the case where you >> don't get the local_err (Like we do at the bottom of migrate_fd_cancel) >> >> >> Does that make sense? >> >> Thanks, >> >> Dave >> > Vladimir, one more question for you because I'm not as familiar with > this code: > > In the normal case we need to invalidate the qcow2 cache as a way to > re-engage the disk (yes?) when we have failed during the late-migration > steps. > > In this case, we seem to be observing a failure during the bulk transfer > loop. Why is it important to invalidate the cache at this step -- would > the disk have been inactivated yet? It shouldn't, because it's in the > bulk transfer phase -- or am I missing something? > > I feel like this code is behaving in a way that's fairly surprising for > a casual reader so I was hoping you could elaborate for me. > > --js In my case, source is already in postcopy state, when error occured, so it is inactivated. -- Best regards, Vladimir
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > 25.06.2018 21:03, John Snow wrote: > > > > On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote: > > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > > > > 15.06.2018 15:06, Dr. David Alan Gilbert wrote: > > > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: > > > > > > > Invalidate cache before source start in case of failed migration. > > > > > > > > > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> > > > > > > Why doesn't the code at the bottom of migration_completion, > > > > > > fail_invalidate: and the code in migrate_fd_cancel handle this? > > > > > > > > > > > > What case did you see it in that those didn't handle? > > > > > > (Also I guess it probably should set s->block_inactive = false) > > > > > on source I see: > > > > > > > > > > 81392@1529065750.766289:migrate_set_state new state 7 > > > > > 81392@1529065750.766330:migration_thread_file_err > > > > > 81392@1529065750.766332:migration_thread_after_loop > > > > > > > > > > so, we are leaving loop on > > > > > if (qemu_file_get_error(s->to_dst_file)) { > > > > > migrate_set_state(&s->state, current_active_state, > > > > > MIGRATION_STATUS_FAILED); > > > > > trace_migration_thread_file_err(); > > > > > break; > > > > > } > > > > > > > > > > and skip migration_completion() > > > > Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation > > > > test that had previously ended with a 'cancelled' state has now ended up > > > > in 'failed' (which is the state 7 you have above). > > > > I suspect there's something else going on as well; I think what is > > > > supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for > > > > a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel > > > > it does the recovery, but because it's going to failed instead, then > > > > it's jumping over that recovery. > > > Going back and actually looking at the patch again; > > > can I ask for 1 small change; > > > Can you set s->block_inactive = false in the case where you > > > don't get the local_err (Like we do at the bottom of migrate_fd_cancel) > > > > > > > > > Does that make sense? > > > > > > Thanks, > > > > > > Dave > > > > > Vladimir, one more question for you because I'm not as familiar with > > this code: > > > > In the normal case we need to invalidate the qcow2 cache as a way to > > re-engage the disk (yes?) when we have failed during the late-migration > > steps. > > > > In this case, we seem to be observing a failure during the bulk transfer > > loop. Why is it important to invalidate the cache at this step -- would > > the disk have been inactivated yet? It shouldn't, because it's in the > > bulk transfer phase -- or am I missing something? > > > > I feel like this code is behaving in a way that's fairly surprising for > > a casual reader so I was hoping you could elaborate for me. > > > > --js > > In my case, source is already in postcopy state, when error occured, so it > is inactivated. Ah, that explains why I couldn't understand the path that got you there; I never think about restarting the source once we're in postcopy - because once the destination is running all is lost. But, you might be in the gap efore management has actually started the destination so it does need fixing. Dave > -- > Best regards, > Vladimir > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
25.06.2018 20:50, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>> Invalidate cache before source start in case of failed migration. >>>>> >>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> >>>> Why doesn't the code at the bottom of migration_completion, >>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>> >>>> What case did you see it in that those didn't handle? >>>> (Also I guess it probably should set s->block_inactive = false) >>> on source I see: >>> >>> 81392@1529065750.766289:migrate_set_state new state 7 >>> 81392@1529065750.766330:migration_thread_file_err >>> 81392@1529065750.766332:migration_thread_after_loop >>> >>> so, we are leaving loop on >>> if (qemu_file_get_error(s->to_dst_file)) { >>> migrate_set_state(&s->state, current_active_state, >>> MIGRATION_STATUS_FAILED); >>> trace_migration_thread_file_err(); >>> break; >>> } >>> >>> and skip migration_completion() John is right, this ls an unrelated log, here we fail before inactivation and there are no problems. Actual problem is when we fail in postcopy_start, at the end. And source log looks like: 84297@1530001796.287344:migrate_set_state new state 1 84297@1530001796.287374:migration_fd_outgoing fd=101 84297@1530001796.287383:migration_set_outgoing_channel ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null) 84297@1530001796.294032:migration_bitmap_sync_start 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932 84297@1530001796.300561:migrate_set_state new state 4 84297@1530001796.300588:migration_thread_setup_complete 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 (pre = 0 compat=1107976192 post=0) 84297@1530001796.300595:migrate_set_state new state 5 Tap fd 33 disable, ret 0 84297@1530001796.426477:migration_bitmap_sync_start 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091 84297@1530001796.439077:migrate_global_state_pre_save saved state: running 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream errored -5 84297@1530001796.439141:migrate_set_state new state 7 84297@1530001796.439181:migration_thread_after_loop Tap fd 33 enable, ret 0 84297@1530001796.453639:migrate_fd_cleanup qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed. 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed >> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >> test that had previously ended with a 'cancelled' state has now ended up >> in 'failed' (which is the state 7 you have above). >> I suspect there's something else going on as well; I think what is >> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for >> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel >> it does the recovery, but because it's going to failed instead, then >> it's jumping over that recovery. > Going back and actually looking at the patch again; > can I ask for 1 small change; > Can you set s->block_inactive = false in the case where you > don't get the local_err (Like we do at the bottom of migrate_fd_cancel) > > > Does that make sense? Ok, I'll resend. Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my reproduce isn't appropriate for upstream. But looks like current code have a possibility of the same fail: postcopy_start() .... ret = qemu_file_get_error(ms->to_dst_file); if (ret) { error_report("postcopy_start: Migration stream errored"); leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run then the loop should finish, as state should be MIGRATION_STATUS_FAILED, so we will not call migration_completion. Hm, I have questions now: 1. should we check s->block_inactive, and if it is false, don't invalidate? it is done in migrate_fd_cancel(), but not don in migration_completion(). 2. should we call qemu_mutex_lock_iothread() like in migration_completion()? Why is it needed in migration_completion(), when vm is not running? > > Thanks, > > Dave > >> Dave >> >>>> Dave >>>> >>>>> --- >>>>> >>>>> migration/migration.c | 9 ++++++++- >>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>> index 1e99ec9b7e..8f39e0dc02 100644 >>>>> --- a/migration/migration.c >>>>> +++ b/migration/migration.c >>>>> @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s) >>>>> case MIGRATION_STATUS_FAILED: >>>>> case MIGRATION_STATUS_CANCELLED: >>>>> if (s->vm_was_running) { >>>>> - vm_start(); >>>>> + Error *local_err = NULL; >>>>> + bdrv_invalidate_cache_all(&local_err); >>>>> + if (local_err) { >>>>> + error_reportf_err(local_err, "Can't invalidate disks before " >>>>> + "source vm start"); >>>>> + } else { >>>>> + vm_start(); >>>>> + } >>>>> } else { >>>>> if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { >>>>> runstate_set(RUN_STATE_POSTMIGRATE); >>>>> -- >>>>> 2.11.1 >>>>> >>>> -- >>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>> >>> -- >>> Best regards, >>> Vladimir >>> >> -- >> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Best regards, Vladimir
26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote: > 25.06.2018 20:50, Dr. David Alan Gilbert wrote: >> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>>> Invalidate cache before source start in case of failed migration. >>>>>> >>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>>>>> <vsementsov@virtuozzo.com> >>>>> Why doesn't the code at the bottom of migration_completion, >>>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>>> >>>>> What case did you see it in that those didn't handle? >>>>> (Also I guess it probably should set s->block_inactive = false) >>>> on source I see: >>>> >>>> 81392@1529065750.766289:migrate_set_state new state 7 >>>> 81392@1529065750.766330:migration_thread_file_err >>>> 81392@1529065750.766332:migration_thread_after_loop >>>> >>>> so, we are leaving loop on >>>> if (qemu_file_get_error(s->to_dst_file)) { >>>> migrate_set_state(&s->state, current_active_state, >>>> MIGRATION_STATUS_FAILED); >>>> trace_migration_thread_file_err(); >>>> break; >>>> } >>>> >>>> and skip migration_completion() > > > John is right, this ls an unrelated log, here we fail before > inactivation and there are no problems. > > Actual problem is when we fail in postcopy_start, at the end. And > source log looks like: > > 84297@1530001796.287344:migrate_set_state new state 1 > 84297@1530001796.287374:migration_fd_outgoing fd=101 > 84297@1530001796.287383:migration_set_outgoing_channel > ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null) > 84297@1530001796.294032:migration_bitmap_sync_start > 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932 > 84297@1530001796.300561:migrate_set_state new state 4 > 84297@1530001796.300588:migration_thread_setup_complete > 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 > (pre = 0 compat=1107976192 post=0) > 84297@1530001796.300595:migrate_set_state new state 5 > Tap fd 33 disable, ret 0 > 84297@1530001796.426477:migration_bitmap_sync_start > 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091 > 84297@1530001796.439077:migrate_global_state_pre_save saved state: > running > 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream > errored -5 > 84297@1530001796.439141:migrate_set_state new state 7 > 84297@1530001796.439181:migration_thread_after_loop > Tap fd 33 enable, ret 0 > 84297@1530001796.453639:migrate_fd_cleanup > qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion > `!(bs->open_flags & 0x0800)' failed. > 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed > > >>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >>> test that had previously ended with a 'cancelled' state has now >>> ended up >>> in 'failed' (which is the state 7 you have above). >>> I suspect there's something else going on as well; I think what is >>> supposed to happen in the case of 'cancel' is that it spins in >>> 'cancelling' for >>> a while in migrate_fd_cancel and then at the bottom of >>> migrate_fd_cancel >>> it does the recovery, but because it's going to failed instead, then >>> it's jumping over that recovery. >> Going back and actually looking at the patch again; >> can I ask for 1 small change; >> Can you set s->block_inactive = false in the case where you >> don't get the local_err (Like we do at the bottom of migrate_fd_cancel) >> >> >> Does that make sense? > > Ok, I'll resend. > > Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my > reproduce isn't appropriate for upstream. > But looks like current code have a possibility of the same fail: > > postcopy_start() > .... > ret = qemu_file_get_error(ms->to_dst_file); > if (ret) { > error_report("postcopy_start: Migration stream errored"); > > leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run > > then the loop should finish, as state should be > MIGRATION_STATUS_FAILED, so we will not call migration_completion. > > Hm, I have questions now: > > 1. should we check s->block_inactive, and if it is false, don't > invalidate? it is done in migrate_fd_cancel(), but not don in > migration_completion(). > 2. should we call qemu_mutex_lock_iothread() like in > migration_completion()? Why is it needed in migration_completion(), > when vm is not running? Hm, forgotten thread, I should resend, but what do you think about these questions? > >> >> Thanks, >> >> Dave >> >>> Dave >>> >>>>> Dave >>>>> >>>>>> --- >>>>>> >>>>>> migration/migration.c | 9 ++++++++- >>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>>> index 1e99ec9b7e..8f39e0dc02 100644 >>>>>> --- a/migration/migration.c >>>>>> +++ b/migration/migration.c >>>>>> @@ -2806,7 +2806,14 @@ static void >>>>>> migration_iteration_finish(MigrationState *s) >>>>>> case MIGRATION_STATUS_FAILED: >>>>>> case MIGRATION_STATUS_CANCELLED: >>>>>> if (s->vm_was_running) { >>>>>> - vm_start(); >>>>>> + Error *local_err = NULL; >>>>>> + bdrv_invalidate_cache_all(&local_err); >>>>>> + if (local_err) { >>>>>> + error_reportf_err(local_err, "Can't invalidate >>>>>> disks before " >>>>>> + "source vm start"); >>>>>> + } else { >>>>>> + vm_start(); >>>>>> + } >>>>>> } else { >>>>>> if (runstate_check(RUN_STATE_FINISH_MIGRATE)) { >>>>>> runstate_set(RUN_STATE_POSTMIGRATE); >>>>>> -- >>>>>> 2.11.1 >>>>>> >>>>> -- >>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >>>> >>>> -- >>>> Best regards, >>>> Vladimir >>>> >>> -- >>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >> -- >> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > -- Best regards, Vladimir
On 10/08/2018 11:36 AM, Vladimir Sementsov-Ogievskiy wrote: > 26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote: >> 25.06.2018 20:50, Dr. David Alan Gilbert wrote: >>> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: >>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote: >>>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote: >>>>>>> Invalidate cache before source start in case of failed migration. >>>>>>> >>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>>>>>> <vsementsov@virtuozzo.com> >>>>>> Why doesn't the code at the bottom of migration_completion, >>>>>> fail_invalidate: and the code in migrate_fd_cancel handle this? >>>>>> >>>>>> What case did you see it in that those didn't handle? >>>>>> (Also I guess it probably should set s->block_inactive = false) >>>>> on source I see: >>>>> >>>>> 81392@1529065750.766289:migrate_set_state new state 7 >>>>> 81392@1529065750.766330:migration_thread_file_err >>>>> 81392@1529065750.766332:migration_thread_after_loop >>>>> >>>>> so, we are leaving loop on >>>>> if (qemu_file_get_error(s->to_dst_file)) { >>>>> migrate_set_state(&s->state, current_active_state, >>>>> MIGRATION_STATUS_FAILED); >>>>> trace_migration_thread_file_err(); >>>>> break; >>>>> } >>>>> >>>>> and skip migration_completion() >> >> >> John is right, this ls an unrelated log, here we fail before >> inactivation and there are no problems. >> >> Actual problem is when we fail in postcopy_start, at the end. And >> source log looks like: >> >> 84297@1530001796.287344:migrate_set_state new state 1 >> 84297@1530001796.287374:migration_fd_outgoing fd=101 >> 84297@1530001796.287383:migration_set_outgoing_channel >> ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null) >> 84297@1530001796.294032:migration_bitmap_sync_start >> 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932 >> 84297@1530001796.300561:migrate_set_state new state 4 >> 84297@1530001796.300588:migration_thread_setup_complete >> 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 >> (pre = 0 compat=1107976192 post=0) >> 84297@1530001796.300595:migrate_set_state new state 5 >> Tap fd 33 disable, ret 0 >> 84297@1530001796.426477:migration_bitmap_sync_start >> 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091 >> 84297@1530001796.439077:migrate_global_state_pre_save saved state: >> running >> 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream >> errored -5 >> 84297@1530001796.439141:migrate_set_state new state 7 >> 84297@1530001796.439181:migration_thread_after_loop >> Tap fd 33 enable, ret 0 >> 84297@1530001796.453639:migrate_fd_cleanup >> qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion >> `!(bs->open_flags & 0x0800)' failed. >> 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed >> >> >>>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation >>>> test that had previously ended with a 'cancelled' state has now >>>> ended up >>>> in 'failed' (which is the state 7 you have above). >>>> I suspect there's something else going on as well; I think what is >>>> supposed to happen in the case of 'cancel' is that it spins in >>>> 'cancelling' for >>>> a while in migrate_fd_cancel and then at the bottom of >>>> migrate_fd_cancel >>>> it does the recovery, but because it's going to failed instead, then >>>> it's jumping over that recovery. >>> Going back and actually looking at the patch again; >>> can I ask for 1 small change; >>> Can you set s->block_inactive = false in the case where you >>> don't get the local_err (Like we do at the bottom of migrate_fd_cancel) >>> >>> >>> Does that make sense? >> >> Ok, I'll resend. >> >> Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my >> reproduce isn't appropriate for upstream. >> But looks like current code have a possibility of the same fail: >> >> postcopy_start() >> .... >> ret = qemu_file_get_error(ms->to_dst_file); >> if (ret) { >> error_report("postcopy_start: Migration stream errored"); >> >> leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run >> >> then the loop should finish, as state should be >> MIGRATION_STATUS_FAILED, so we will not call migration_completion. >> >> Hm, I have questions now: >> >> 1. should we check s->block_inactive, and if it is false, don't >> invalidate? it is done in migrate_fd_cancel(), but not don in >> migration_completion(). >> 2. should we call qemu_mutex_lock_iothread() like in >> migration_completion()? Why is it needed in migration_completion(), >> when vm is not running? > > > Hm, forgotten thread, I should resend, but what do you think about these > questions? > Personally, I can't remember where we stand on this thread at all. If you have changes that you think are still important (because you ran into the issue again), probably best to send a respin (with a new cover letter and updated justification) and we'll take it from there. (I seem to recall we were working off of different versions which handled migration differently, which was causing confusion...) --js
© 2016 - 2024 Red Hat, Inc.