[Qemu-devel] [PATCH V2 RESEND] block/replication.c: Fix crash issue after failover

Zhang Chen posted 1 patch 4 years, 10 months ago
Test s390x passed
Test checkpatch passed
Test asan passed
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test FreeBSD passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190621062843.1605-1-chen.zhang@intel.com
Maintainers: Wen Congyang <wencongyang2@huawei.com>, Xie Changlong <xiechanglong.d@gmail.com>, Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
block/replication.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
[Qemu-devel] [PATCH V2 RESEND] block/replication.c: Fix crash issue after failover
Posted by Zhang Chen 4 years, 10 months ago
From: Zhang Chen <chen.zhang@intel.com>

If we try to close replication after failover, it will crash here.
So we need check the block job on active disk before cancel the job.

Signed-off-by: Zhang Chen <chen.zhang@intel.com>
---
 block/replication.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/replication.c b/block/replication.c
index b41bc507c0..a68bc7e986 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
         replication_stop(s->rs, false, NULL);
     }
     if (s->stage == BLOCK_REPLICATION_FAILOVER) {
-        job_cancel_sync(&s->commit_job->job);
+        if (s->commit_job) {
+            job_cancel_sync(&s->commit_job->job);
+        }
     }
 
     if (s->mode == REPLICATION_MODE_SECONDARY) {
-- 
2.17.GIT


Re: [Qemu-devel] [Qemu-block] [PATCH V2 RESEND] block/replication.c: Fix crash issue after failover
Posted by John Snow 4 years, 10 months ago

On 6/21/19 2:28 AM, Zhang Chen wrote:
> From: Zhang Chen <chen.zhang@intel.com>
> 
> If we try to close replication after failover, it will crash here.
> So we need check the block job on active disk before cancel the job.
> 
> Signed-off-by: Zhang Chen <chen.zhang@intel.com>
> ---
>  block/replication.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/replication.c b/block/replication.c
> index b41bc507c0..a68bc7e986 100644
> --- a/block/replication.c
> +++ b/block/replication.c
> @@ -149,7 +149,9 @@ static void replication_close(BlockDriverState *bs)
>          replication_stop(s->rs, false, NULL);
>      }
>      if (s->stage == BLOCK_REPLICATION_FAILOVER) {
> -        job_cancel_sync(&s->commit_job->job);
> +        if (s->commit_job) {
> +            job_cancel_sync(&s->commit_job->job);
> +        }
>      }
>  
>      if (s->mode == REPLICATION_MODE_SECONDARY) {
> 

I actually don't understand this right away.

The only place I see that sets commit_job is replication_stop, which
sets it immediately after s->stage = BLOCK_REPLICATION_FAILOVER.

So if we're here in replication_close, shouldn't we have a valid job object?

...unless we never succeeded in launching this commit job, but then
don't we have worse problems?

...Or, perhaps the job actually finished, but then we never cleared the
job variable in replication_done, but then I don't see why this if
statement would actually help us.

Can you share some details of the crash to help me understand the crash,
and why this patch helps?

--js