[Qemu-devel] [PATCH] migration: invalidate cache before source start

Vladimir Sementsov-Ogievskiy posted 1 patch 5 years, 10 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20180609141457.6283-1-vsementsov@virtuozzo.com
Test checkpatch passed
Test docker-mingw@fedora passed
Test docker-quick@centos7 passed
Test s390x passed
migration/migration.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
[Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Vladimir Sementsov-Ogievskiy 5 years, 10 months ago
Invalidate cache before source start in case of failed migration.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

 migration/migration.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 1e99ec9b7e..8f39e0dc02 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
     case MIGRATION_STATUS_FAILED:
     case MIGRATION_STATUS_CANCELLED:
         if (s->vm_was_running) {
-            vm_start();
+            Error *local_err = NULL;
+            bdrv_invalidate_cache_all(&local_err);
+            if (local_err) {
+                error_reportf_err(local_err, "Can't invalidate disks before "
+                                  "source vm start");
+            } else {
+                vm_start();
+            }
         } else {
             if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
                 runstate_set(RUN_STATE_POSTMIGRATE);
-- 
2.11.1


Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Dr. David Alan Gilbert 5 years, 10 months ago
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> Invalidate cache before source start in case of failed migration.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Why doesn't the code at the bottom of migration_completion,
fail_invalidate:   and the code in migrate_fd_cancel   handle this?

What case did you see it in that those didn't handle?
(Also I guess it probably should set s->block_inactive = false)

Dave

> ---
> 
>  migration/migration.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 1e99ec9b7e..8f39e0dc02 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
>      case MIGRATION_STATUS_FAILED:
>      case MIGRATION_STATUS_CANCELLED:
>          if (s->vm_was_running) {
> -            vm_start();
> +            Error *local_err = NULL;
> +            bdrv_invalidate_cache_all(&local_err);
> +            if (local_err) {
> +                error_reportf_err(local_err, "Can't invalidate disks before "
> +                                  "source vm start");
> +            } else {
> +                vm_start();
> +            }
>          } else {
>              if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>                  runstate_set(RUN_STATE_POSTMIGRATE);
> -- 
> 2.11.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Vladimir Sementsov-Ogievskiy 5 years, 10 months ago
15.06.2018 15:06, Dr. David Alan Gilbert wrote:
> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>> Invalidate cache before source start in case of failed migration.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Why doesn't the code at the bottom of migration_completion,
> fail_invalidate:   and the code in migrate_fd_cancel   handle this?
>
> What case did you see it in that those didn't handle?
> (Also I guess it probably should set s->block_inactive = false)

on source I see:

81392@1529065750.766289:migrate_set_state new state 7
81392@1529065750.766330:migration_thread_file_err
81392@1529065750.766332:migration_thread_after_loop

so, we are leaving loop on
         if (qemu_file_get_error(s->to_dst_file)) {
             migrate_set_state(&s->state, current_active_state,
MIGRATION_STATUS_FAILED);
trace_migration_thread_file_err();
break;
         }

and skip migration_completion()


>
> Dave
>
>> ---
>>
>>   migration/migration.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 1e99ec9b7e..8f39e0dc02 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
>>       case MIGRATION_STATUS_FAILED:
>>       case MIGRATION_STATUS_CANCELLED:
>>           if (s->vm_was_running) {
>> -            vm_start();
>> +            Error *local_err = NULL;
>> +            bdrv_invalidate_cache_all(&local_err);
>> +            if (local_err) {
>> +                error_reportf_err(local_err, "Can't invalidate disks before "
>> +                                  "source vm start");
>> +            } else {
>> +                vm_start();
>> +            }
>>           } else {
>>               if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>>                   runstate_set(RUN_STATE_POSTMIGRATE);
>> -- 
>> 2.11.1
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by John Snow 5 years, 10 months ago

On 06/15/2018 08:33 AM, Vladimir Sementsov-Ogievskiy wrote:
> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>> Invalidate cache before source start in case of failed migration.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Why doesn't the code at the bottom of migration_completion,
>> fail_invalidate:   and the code in migrate_fd_cancel   handle this?
>>
>> What case did you see it in that those didn't handle?
>> (Also I guess it probably should set s->block_inactive = false)
> 
> on source I see:
> 
> 81392@1529065750.766289:migrate_set_state new state 7
> 81392@1529065750.766330:migration_thread_file_err
> 81392@1529065750.766332:migration_thread_after_loop
> 
> so, we are leaving loop on
>         if (qemu_file_get_error(s->to_dst_file)) {
>             migrate_set_state(&s->state, current_active_state,
> MIGRATION_STATUS_FAILED);
> trace_migration_thread_file_err();
> break;
>         }
> 
> and skip migration_completion()
> 

OK, so you're saying that in migration_thread;

`
        thr_error = migration_detect_error(s);
        if (thr_error == MIG_THR_ERR_FATAL) {
	    /* Stop migration */
            break;
	}
`

is triggering, so then we jump right away to

`
    trace_migration_thread_after_loop();
    migration_iteration_finish(s);
`

and in so doing, we fail before we get a chance for
migration_iteration_run to call migration_completion, which was
otherwise the mechanism that invalidated the images.

Seems like a legitimate concern that early failures versus late failures
trigger different error pathways.

David?

--js

> 
>>
>> Dave
>>
>>> ---
>>>
>>>   migration/migration.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 1e99ec9b7e..8f39e0dc02 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2806,7 +2806,14 @@ static void
>>> migration_iteration_finish(MigrationState *s)
>>>       case MIGRATION_STATUS_FAILED:
>>>       case MIGRATION_STATUS_CANCELLED:
>>>           if (s->vm_was_running) {
>>> -            vm_start();
>>> +            Error *local_err = NULL;
>>> +            bdrv_invalidate_cache_all(&local_err);
>>> +            if (local_err) {
>>> +                error_reportf_err(local_err, "Can't invalidate disks
>>> before "
>>> +                                  "source vm start");
>>> +            } else {
>>> +                vm_start();
>>> +            }
>>>           } else {
>>>               if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>>>                   runstate_set(RUN_STATE_POSTMIGRATE);
>>> -- 
>>> 2.11.1
>>>
>> -- 
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 


Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Dr. David Alan Gilbert 5 years, 10 months ago
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
> > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > Invalidate cache before source start in case of failed migration.
> > > 
> > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > Why doesn't the code at the bottom of migration_completion,
> > fail_invalidate:   and the code in migrate_fd_cancel   handle this?
> > 
> > What case did you see it in that those didn't handle?
> > (Also I guess it probably should set s->block_inactive = false)
> 
> on source I see:
> 
> 81392@1529065750.766289:migrate_set_state new state 7
> 81392@1529065750.766330:migration_thread_file_err
> 81392@1529065750.766332:migration_thread_after_loop
> 
> so, we are leaving loop on
>         if (qemu_file_get_error(s->to_dst_file)) {
>             migrate_set_state(&s->state, current_active_state,
> MIGRATION_STATUS_FAILED);
> trace_migration_thread_file_err();
> break;
>         }
> 
> and skip migration_completion()

Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
test that had previously ended with a 'cancelled' state has now ended up
in 'failed' (which is the state 7 you have above).
I suspect there's something else going on as well; I think what is
supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
it does the recovery, but because it's going to failed instead, then
it's jumping over that recovery.

Dave

> 
> > 
> > Dave
> > 
> > > ---
> > > 
> > >   migration/migration.c | 9 ++++++++-
> > >   1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index 1e99ec9b7e..8f39e0dc02 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
> > >       case MIGRATION_STATUS_FAILED:
> > >       case MIGRATION_STATUS_CANCELLED:
> > >           if (s->vm_was_running) {
> > > -            vm_start();
> > > +            Error *local_err = NULL;
> > > +            bdrv_invalidate_cache_all(&local_err);
> > > +            if (local_err) {
> > > +                error_reportf_err(local_err, "Can't invalidate disks before "
> > > +                                  "source vm start");
> > > +            } else {
> > > +                vm_start();
> > > +            }
> > >           } else {
> > >               if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
> > >                   runstate_set(RUN_STATE_POSTMIGRATE);
> > > -- 
> > > 2.11.1
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> 
> -- 
> Best regards,
> Vladimir
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Dr. David Alan Gilbert 5 years, 10 months ago
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
> > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > > Invalidate cache before source start in case of failed migration.
> > > > 
> > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > Why doesn't the code at the bottom of migration_completion,
> > > fail_invalidate:   and the code in migrate_fd_cancel   handle this?
> > > 
> > > What case did you see it in that those didn't handle?
> > > (Also I guess it probably should set s->block_inactive = false)
> > 
> > on source I see:
> > 
> > 81392@1529065750.766289:migrate_set_state new state 7
> > 81392@1529065750.766330:migration_thread_file_err
> > 81392@1529065750.766332:migration_thread_after_loop
> > 
> > so, we are leaving loop on
> >         if (qemu_file_get_error(s->to_dst_file)) {
> >             migrate_set_state(&s->state, current_active_state,
> > MIGRATION_STATUS_FAILED);
> > trace_migration_thread_file_err();
> > break;
> >         }
> > 
> > and skip migration_completion()
> 
> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
> test that had previously ended with a 'cancelled' state has now ended up
> in 'failed' (which is the state 7 you have above).
> I suspect there's something else going on as well; I think what is
> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
> it does the recovery, but because it's going to failed instead, then
> it's jumping over that recovery.

Going back and actually looking at the patch again;
can I ask for 1 small change;
   Can you set s->block_inactive = false   in the case where you
don't get the local_err (Like we do at the bottom of migrate_fd_cancel)


Does that make sense?

Thanks,

Dave

> Dave
> 
> > 
> > > 
> > > Dave
> > > 
> > > > ---
> > > > 
> > > >   migration/migration.c | 9 ++++++++-
> > > >   1 file changed, 8 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index 1e99ec9b7e..8f39e0dc02 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
> > > >       case MIGRATION_STATUS_FAILED:
> > > >       case MIGRATION_STATUS_CANCELLED:
> > > >           if (s->vm_was_running) {
> > > > -            vm_start();
> > > > +            Error *local_err = NULL;
> > > > +            bdrv_invalidate_cache_all(&local_err);
> > > > +            if (local_err) {
> > > > +                error_reportf_err(local_err, "Can't invalidate disks before "
> > > > +                                  "source vm start");
> > > > +            } else {
> > > > +                vm_start();
> > > > +            }
> > > >           } else {
> > > >               if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
> > > >                   runstate_set(RUN_STATE_POSTMIGRATE);
> > > > -- 
> > > > 2.11.1
> > > > 
> > > --
> > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> > 
> > -- 
> > Best regards,
> > Vladimir
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by John Snow 5 years, 10 months ago

On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>> Invalidate cache before source start in case of failed migration.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Why doesn't the code at the bottom of migration_completion,
>>>> fail_invalidate:   and the code in migrate_fd_cancel   handle this?
>>>>
>>>> What case did you see it in that those didn't handle?
>>>> (Also I guess it probably should set s->block_inactive = false)
>>>
>>> on source I see:
>>>
>>> 81392@1529065750.766289:migrate_set_state new state 7
>>> 81392@1529065750.766330:migration_thread_file_err
>>> 81392@1529065750.766332:migration_thread_after_loop
>>>
>>> so, we are leaving loop on
>>>         if (qemu_file_get_error(s->to_dst_file)) {
>>>             migrate_set_state(&s->state, current_active_state,
>>> MIGRATION_STATUS_FAILED);
>>> trace_migration_thread_file_err();
>>> break;
>>>         }
>>>
>>> and skip migration_completion()
>>
>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>> test that had previously ended with a 'cancelled' state has now ended up
>> in 'failed' (which is the state 7 you have above).
>> I suspect there's something else going on as well; I think what is
>> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
>> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
>> it does the recovery, but because it's going to failed instead, then
>> it's jumping over that recovery.
> 
> Going back and actually looking at the patch again;
> can I ask for 1 small change;
>    Can you set s->block_inactive = false   in the case where you
> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
> 
> 
> Does that make sense?
> 
> Thanks,
> 
> Dave
> 

Vladimir, one more question for you because I'm not as familiar with
this code:

In the normal case we need to invalidate the qcow2 cache as a way to
re-engage the disk (yes?) when we have failed during the late-migration
steps.

In this case, we seem to be observing a failure during the bulk transfer
loop. Why is it important to invalidate the cache at this step -- would
the disk have been inactivated yet? It shouldn't, because it's in the
bulk transfer phase -- or am I missing something?

I feel like this code is behaving in a way that's fairly surprising for
a casual reader so I was hoping you could elaborate for me.

--js

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Vladimir Sementsov-Ogievskiy 5 years, 10 months ago
25.06.2018 21:03, John Snow wrote:
>
> On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote:
>> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>>> Invalidate cache before source start in case of failed migration.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> Why doesn't the code at the bottom of migration_completion,
>>>>> fail_invalidate:   and the code in migrate_fd_cancel   handle this?
>>>>>
>>>>> What case did you see it in that those didn't handle?
>>>>> (Also I guess it probably should set s->block_inactive = false)
>>>> on source I see:
>>>>
>>>> 81392@1529065750.766289:migrate_set_state new state 7
>>>> 81392@1529065750.766330:migration_thread_file_err
>>>> 81392@1529065750.766332:migration_thread_after_loop
>>>>
>>>> so, we are leaving loop on
>>>>          if (qemu_file_get_error(s->to_dst_file)) {
>>>>              migrate_set_state(&s->state, current_active_state,
>>>> MIGRATION_STATUS_FAILED);
>>>> trace_migration_thread_file_err();
>>>> break;
>>>>          }
>>>>
>>>> and skip migration_completion()
>>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>>> test that had previously ended with a 'cancelled' state has now ended up
>>> in 'failed' (which is the state 7 you have above).
>>> I suspect there's something else going on as well; I think what is
>>> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
>>> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
>>> it does the recovery, but because it's going to failed instead, then
>>> it's jumping over that recovery.
>> Going back and actually looking at the patch again;
>> can I ask for 1 small change;
>>     Can you set s->block_inactive = false   in the case where you
>> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
>>
>>
>> Does that make sense?
>>
>> Thanks,
>>
>> Dave
>>
> Vladimir, one more question for you because I'm not as familiar with
> this code:
>
> In the normal case we need to invalidate the qcow2 cache as a way to
> re-engage the disk (yes?) when we have failed during the late-migration
> steps.
>
> In this case, we seem to be observing a failure during the bulk transfer
> loop. Why is it important to invalidate the cache at this step -- would
> the disk have been inactivated yet? It shouldn't, because it's in the
> bulk transfer phase -- or am I missing something?
>
> I feel like this code is behaving in a way that's fairly surprising for
> a casual reader so I was hoping you could elaborate for me.
>
> --js

In my case, source is already in postcopy state, when error occured, so 
it is inactivated.

-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Dr. David Alan Gilbert 5 years, 10 months ago
* Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> 25.06.2018 21:03, John Snow wrote:
> > 
> > On 06/25/2018 01:50 PM, Dr. David Alan Gilbert wrote:
> > > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > > > 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
> > > > > > * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
> > > > > > > Invalidate cache before source start in case of failed migration.
> > > > > > > 
> > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > > > > Why doesn't the code at the bottom of migration_completion,
> > > > > > fail_invalidate:   and the code in migrate_fd_cancel   handle this?
> > > > > > 
> > > > > > What case did you see it in that those didn't handle?
> > > > > > (Also I guess it probably should set s->block_inactive = false)
> > > > > on source I see:
> > > > > 
> > > > > 81392@1529065750.766289:migrate_set_state new state 7
> > > > > 81392@1529065750.766330:migration_thread_file_err
> > > > > 81392@1529065750.766332:migration_thread_after_loop
> > > > > 
> > > > > so, we are leaving loop on
> > > > >          if (qemu_file_get_error(s->to_dst_file)) {
> > > > >              migrate_set_state(&s->state, current_active_state,
> > > > > MIGRATION_STATUS_FAILED);
> > > > > trace_migration_thread_file_err();
> > > > > break;
> > > > >          }
> > > > > 
> > > > > and skip migration_completion()
> > > > Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
> > > > test that had previously ended with a 'cancelled' state has now ended up
> > > > in 'failed' (which is the state 7 you have above).
> > > > I suspect there's something else going on as well; I think what is
> > > > supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
> > > > a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
> > > > it does the recovery, but because it's going to failed instead, then
> > > > it's jumping over that recovery.
> > > Going back and actually looking at the patch again;
> > > can I ask for 1 small change;
> > >     Can you set s->block_inactive = false   in the case where you
> > > don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
> > > 
> > > 
> > > Does that make sense?
> > > 
> > > Thanks,
> > > 
> > > Dave
> > > 
> > Vladimir, one more question for you because I'm not as familiar with
> > this code:
> > 
> > In the normal case we need to invalidate the qcow2 cache as a way to
> > re-engage the disk (yes?) when we have failed during the late-migration
> > steps.
> > 
> > In this case, we seem to be observing a failure during the bulk transfer
> > loop. Why is it important to invalidate the cache at this step -- would
> > the disk have been inactivated yet? It shouldn't, because it's in the
> > bulk transfer phase -- or am I missing something?
> > 
> > I feel like this code is behaving in a way that's fairly surprising for
> > a casual reader so I was hoping you could elaborate for me.
> > 
> > --js
> 
> In my case, source is already in postcopy state, when error occured, so it
> is inactivated.

Ah, that explains why I couldn't understand the path that got you there;
I never think about restarting the source once we're in postcopy -
because once the destination is running all is lost.
But, you might be in the gap efore management has actually started
the destination so it does need fixing.

Dave

> -- 
> Best regards,
> Vladimir
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Vladimir Sementsov-Ogievskiy 5 years, 10 months ago
25.06.2018 20:50, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>> Invalidate cache before source start in case of failed migration.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Why doesn't the code at the bottom of migration_completion,
>>>> fail_invalidate:   and the code in migrate_fd_cancel   handle this?
>>>>
>>>> What case did you see it in that those didn't handle?
>>>> (Also I guess it probably should set s->block_inactive = false)
>>> on source I see:
>>>
>>> 81392@1529065750.766289:migrate_set_state new state 7
>>> 81392@1529065750.766330:migration_thread_file_err
>>> 81392@1529065750.766332:migration_thread_after_loop
>>>
>>> so, we are leaving loop on
>>>          if (qemu_file_get_error(s->to_dst_file)) {
>>>              migrate_set_state(&s->state, current_active_state,
>>> MIGRATION_STATUS_FAILED);
>>> trace_migration_thread_file_err();
>>> break;
>>>          }
>>>
>>> and skip migration_completion()


John is right, this ls an unrelated log, here we fail before 
inactivation and there are no problems.

Actual problem is when we fail in postcopy_start, at the end. And source 
log looks like:

84297@1530001796.287344:migrate_set_state new state 1
84297@1530001796.287374:migration_fd_outgoing fd=101
84297@1530001796.287383:migration_set_outgoing_channel 
ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null)
84297@1530001796.294032:migration_bitmap_sync_start
84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932
84297@1530001796.300561:migrate_set_state new state 4
84297@1530001796.300588:migration_thread_setup_complete
84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 
(pre = 0 compat=1107976192 post=0)
84297@1530001796.300595:migrate_set_state new state 5
Tap fd 33 disable, ret 0
84297@1530001796.426477:migration_bitmap_sync_start
84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091
84297@1530001796.439077:migrate_global_state_pre_save saved state: running
2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream 
errored -5
84297@1530001796.439141:migrate_set_state new state 7
84297@1530001796.439181:migration_thread_after_loop
Tap fd 33 enable, ret 0
84297@1530001796.453639:migrate_fd_cleanup
qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion `!(bs->open_flags 
& 0x0800)' failed.
2018-06-26 08:29:56.605+0000: shutting down, reason=crashed


>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>> test that had previously ended with a 'cancelled' state has now ended up
>> in 'failed' (which is the state 7 you have above).
>> I suspect there's something else going on as well; I think what is
>> supposed to happen in the case of 'cancel' is that it spins in 'cancelling' for
>> a while in migrate_fd_cancel and then at the bottom of migrate_fd_cancel
>> it does the recovery, but because it's going to failed instead, then
>> it's jumping over that recovery.
> Going back and actually looking at the patch again;
> can I ask for 1 small change;
>     Can you set s->block_inactive = false   in the case where you
> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
>
>
> Does that make sense?

Ok, I'll resend.

Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my 
reproduce isn't appropriate for upstream.
But looks like current code have a possibility of the same fail:

postcopy_start()
     ....
     ret = qemu_file_get_error(ms->to_dst_file);
     if (ret) {
         error_report("postcopy_start: Migration stream errored");

leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run

then the loop should finish, as state should be MIGRATION_STATUS_FAILED, 
so we will not call migration_completion.

Hm, I have questions now:

1. should we check s->block_inactive, and if it is false, don't 
invalidate? it is done in migrate_fd_cancel(), but not don in 
migration_completion().
2. should we call qemu_mutex_lock_iothread() like in 
migration_completion()? Why is it needed in migration_completion(), when 
vm is not running?

>
> Thanks,
>
> Dave
>
>> Dave
>>
>>>> Dave
>>>>
>>>>> ---
>>>>>
>>>>>    migration/migration.c | 9 ++++++++-
>>>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index 1e99ec9b7e..8f39e0dc02 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -2806,7 +2806,14 @@ static void migration_iteration_finish(MigrationState *s)
>>>>>        case MIGRATION_STATUS_FAILED:
>>>>>        case MIGRATION_STATUS_CANCELLED:
>>>>>            if (s->vm_was_running) {
>>>>> -            vm_start();
>>>>> +            Error *local_err = NULL;
>>>>> +            bdrv_invalidate_cache_all(&local_err);
>>>>> +            if (local_err) {
>>>>> +                error_reportf_err(local_err, "Can't invalidate disks before "
>>>>> +                                  "source vm start");
>>>>> +            } else {
>>>>> +                vm_start();
>>>>> +            }
>>>>>            } else {
>>>>>                if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>>>>>                    runstate_set(RUN_STATE_POSTMIGRATE);
>>>>> -- 
>>>>> 2.11.1
>>>>>
>>>> --
>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>
>>> -- 
>>> Best regards,
>>> Vladimir
>>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


-- 
Best regards,
Vladimir


Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by Vladimir Sementsov-Ogievskiy 5 years, 6 months ago
26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote:
> 25.06.2018 20:50, Dr. David Alan Gilbert wrote:
>> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>>> Invalidate cache before source start in case of failed migration.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>>>>> <vsementsov@virtuozzo.com>
>>>>> Why doesn't the code at the bottom of migration_completion,
>>>>> fail_invalidate:   and the code in migrate_fd_cancel handle this?
>>>>>
>>>>> What case did you see it in that those didn't handle?
>>>>> (Also I guess it probably should set s->block_inactive = false)
>>>> on source I see:
>>>>
>>>> 81392@1529065750.766289:migrate_set_state new state 7
>>>> 81392@1529065750.766330:migration_thread_file_err
>>>> 81392@1529065750.766332:migration_thread_after_loop
>>>>
>>>> so, we are leaving loop on
>>>>          if (qemu_file_get_error(s->to_dst_file)) {
>>>>              migrate_set_state(&s->state, current_active_state,
>>>> MIGRATION_STATUS_FAILED);
>>>> trace_migration_thread_file_err();
>>>> break;
>>>>          }
>>>>
>>>> and skip migration_completion()
>
>
> John is right, this ls an unrelated log, here we fail before 
> inactivation and there are no problems.
>
> Actual problem is when we fail in postcopy_start, at the end. And 
> source log looks like:
>
> 84297@1530001796.287344:migrate_set_state new state 1
> 84297@1530001796.287374:migration_fd_outgoing fd=101
> 84297@1530001796.287383:migration_set_outgoing_channel 
> ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null)
> 84297@1530001796.294032:migration_bitmap_sync_start
> 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932
> 84297@1530001796.300561:migrate_set_state new state 4
> 84297@1530001796.300588:migration_thread_setup_complete
> 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 
> (pre = 0 compat=1107976192 post=0)
> 84297@1530001796.300595:migrate_set_state new state 5
> Tap fd 33 disable, ret 0
> 84297@1530001796.426477:migration_bitmap_sync_start
> 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091
> 84297@1530001796.439077:migrate_global_state_pre_save saved state: 
> running
> 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream 
> errored -5
> 84297@1530001796.439141:migrate_set_state new state 7
> 84297@1530001796.439181:migration_thread_after_loop
> Tap fd 33 enable, ret 0
> 84297@1530001796.453639:migrate_fd_cleanup
> qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion 
> `!(bs->open_flags & 0x0800)' failed.
> 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed
>
>
>>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>>> test that had previously ended with a 'cancelled' state has now 
>>> ended up
>>> in 'failed' (which is the state 7 you have above).
>>> I suspect there's something else going on as well; I think what is
>>> supposed to happen in the case of 'cancel' is that it spins in 
>>> 'cancelling' for
>>> a while in migrate_fd_cancel and then at the bottom of 
>>> migrate_fd_cancel
>>> it does the recovery, but because it's going to failed instead, then
>>> it's jumping over that recovery.
>> Going back and actually looking at the patch again;
>> can I ask for 1 small change;
>>     Can you set s->block_inactive = false   in the case where you
>> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
>>
>>
>> Does that make sense?
>
> Ok, I'll resend.
>
> Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my 
> reproduce isn't appropriate for upstream.
> But looks like current code have a possibility of the same fail:
>
> postcopy_start()
>     ....
>     ret = qemu_file_get_error(ms->to_dst_file);
>     if (ret) {
>         error_report("postcopy_start: Migration stream errored");
>
> leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run
>
> then the loop should finish, as state should be 
> MIGRATION_STATUS_FAILED, so we will not call migration_completion.
>
> Hm, I have questions now:
>
> 1. should we check s->block_inactive, and if it is false, don't 
> invalidate? it is done in migrate_fd_cancel(), but not don in 
> migration_completion().
> 2. should we call qemu_mutex_lock_iothread() like in 
> migration_completion()? Why is it needed in migration_completion(), 
> when vm is not running?


Hm, forgotten thread, I should resend, but what do you think about these 
questions?

>
>>
>> Thanks,
>>
>> Dave
>>
>>> Dave
>>>
>>>>> Dave
>>>>>
>>>>>> ---
>>>>>>
>>>>>>    migration/migration.c | 9 ++++++++-
>>>>>>    1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>> index 1e99ec9b7e..8f39e0dc02 100644
>>>>>> --- a/migration/migration.c
>>>>>> +++ b/migration/migration.c
>>>>>> @@ -2806,7 +2806,14 @@ static void 
>>>>>> migration_iteration_finish(MigrationState *s)
>>>>>>        case MIGRATION_STATUS_FAILED:
>>>>>>        case MIGRATION_STATUS_CANCELLED:
>>>>>>            if (s->vm_was_running) {
>>>>>> -            vm_start();
>>>>>> +            Error *local_err = NULL;
>>>>>> +            bdrv_invalidate_cache_all(&local_err);
>>>>>> +            if (local_err) {
>>>>>> +                error_reportf_err(local_err, "Can't invalidate 
>>>>>> disks before "
>>>>>> +                                  "source vm start");
>>>>>> +            } else {
>>>>>> +                vm_start();
>>>>>> +            }
>>>>>>            } else {
>>>>>>                if (runstate_check(RUN_STATE_FINISH_MIGRATE)) {
>>>>>>                    runstate_set(RUN_STATE_POSTMIGRATE);
>>>>>> -- 
>>>>>> 2.11.1
>>>>>>
>>>>> -- 
>>>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>>>
>>>> -- 
>>>> Best regards,
>>>> Vladimir
>>>>
>>> -- 
>>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>> -- 
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>


-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH] migration: invalidate cache before source start
Posted by John Snow 5 years, 6 months ago

On 10/08/2018 11:36 AM, Vladimir Sementsov-Ogievskiy wrote:
> 26.06.2018 11:44, Vladimir Sementsov-Ogievskiy wrote:
>> 25.06.2018 20:50, Dr. David Alan Gilbert wrote:
>>> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>> 15.06.2018 15:06, Dr. David Alan Gilbert wrote:
>>>>>> * Vladimir Sementsov-Ogievskiy (vsementsov@virtuozzo.com) wrote:
>>>>>>> Invalidate cache before source start in case of failed migration.
>>>>>>>
>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>>>>>> <vsementsov@virtuozzo.com>
>>>>>> Why doesn't the code at the bottom of migration_completion,
>>>>>> fail_invalidate:   and the code in migrate_fd_cancel handle this?
>>>>>>
>>>>>> What case did you see it in that those didn't handle?
>>>>>> (Also I guess it probably should set s->block_inactive = false)
>>>>> on source I see:
>>>>>
>>>>> 81392@1529065750.766289:migrate_set_state new state 7
>>>>> 81392@1529065750.766330:migration_thread_file_err
>>>>> 81392@1529065750.766332:migration_thread_after_loop
>>>>>
>>>>> so, we are leaving loop on
>>>>>          if (qemu_file_get_error(s->to_dst_file)) {
>>>>>              migrate_set_state(&s->state, current_active_state,
>>>>> MIGRATION_STATUS_FAILED);
>>>>> trace_migration_thread_file_err();
>>>>> break;
>>>>>          }
>>>>>
>>>>> and skip migration_completion()
>>
>>
>> John is right, this ls an unrelated log, here we fail before 
>> inactivation and there are no problems.
>>
>> Actual problem is when we fail in postcopy_start, at the end. And 
>> source log looks like:
>>
>> 84297@1530001796.287344:migrate_set_state new state 1
>> 84297@1530001796.287374:migration_fd_outgoing fd=101
>> 84297@1530001796.287383:migration_set_outgoing_channel 
>> ioc=0x56363454d630 ioctype=qio-channel-socket hostname=(null)
>> 84297@1530001796.294032:migration_bitmap_sync_start
>> 84297@1530001796.300483:migration_bitmap_sync_end dirty_pages 932
>> 84297@1530001796.300561:migrate_set_state new state 4
>> 84297@1530001796.300588:migration_thread_setup_complete
>> 84297@1530001796.300593:migrate_pending pending size 1107976192 max 0 
>> (pre = 0 compat=1107976192 post=0)
>> 84297@1530001796.300595:migrate_set_state new state 5
>> Tap fd 33 disable, ret 0
>> 84297@1530001796.426477:migration_bitmap_sync_start
>> 84297@1530001796.433227:migration_bitmap_sync_end dirty_pages 1091
>> 84297@1530001796.439077:migrate_global_state_pre_save saved state: 
>> running
>> 2018-06-26T08:29:56.439134Z qemu-kvm: postcopy_start: Migration stream 
>> errored -5
>> 84297@1530001796.439141:migrate_set_state new state 7
>> 84297@1530001796.439181:migration_thread_after_loop
>> Tap fd 33 enable, ret 0
>> 84297@1530001796.453639:migrate_fd_cleanup
>> qemu-kvm: block/io.c:1655: bdrv_co_pwritev: Assertion 
>> `!(bs->open_flags & 0x0800)' failed.
>> 2018-06-26 08:29:56.605+0000: shutting down, reason=crashed
>>
>>
>>>> Yeh, OK; I'd seen soemthing else a few days ago, where a cancellation
>>>> test that had previously ended with a 'cancelled' state has now 
>>>> ended up
>>>> in 'failed' (which is the state 7 you have above).
>>>> I suspect there's something else going on as well; I think what is
>>>> supposed to happen in the case of 'cancel' is that it spins in 
>>>> 'cancelling' for
>>>> a while in migrate_fd_cancel and then at the bottom of 
>>>> migrate_fd_cancel
>>>> it does the recovery, but because it's going to failed instead, then
>>>> it's jumping over that recovery.
>>> Going back and actually looking at the patch again;
>>> can I ask for 1 small change;
>>>     Can you set s->block_inactive = false   in the case where you
>>> don't get the local_err (Like we do at the bottom of migrate_fd_cancel)
>>>
>>>
>>> Does that make sense?
>>
>> Ok, I'll resend.
>>
>> Hm, looks like I'm fixing an outdated version (based on v2.9.0) And my 
>> reproduce isn't appropriate for upstream.
>> But looks like current code have a possibility of the same fail:
>>
>> postcopy_start()
>>     ....
>>     ret = qemu_file_get_error(ms->to_dst_file);
>>     if (ret) {
>>         error_report("postcopy_start: Migration stream errored");
>>
>> leads to "return MIG_ITERATE_SKIP;" in migration_iteration_run
>>
>> then the loop should finish, as state should be 
>> MIGRATION_STATUS_FAILED, so we will not call migration_completion.
>>
>> Hm, I have questions now:
>>
>> 1. should we check s->block_inactive, and if it is false, don't 
>> invalidate? it is done in migrate_fd_cancel(), but not don in 
>> migration_completion().
>> 2. should we call qemu_mutex_lock_iothread() like in 
>> migration_completion()? Why is it needed in migration_completion(), 
>> when vm is not running?
> 
> 
> Hm, forgotten thread, I should resend, but what do you think about these 
> questions?
> 

Personally, I can't remember where we stand on this thread at all. If
you have changes that you think are still important (because you ran
into the issue again), probably best to send a respin (with a new cover
letter and updated justification) and we'll take it from there.

(I seem to recall we were working off of different versions which
handled migration differently, which was causing confusion...)

--js