[PATCH 0/3] block/nbd: fix crashers in reconnect while migrating

Roman Kagan posted 3 patches 3 years, 3 months ago
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20210128201418.607640-1-rvkagan@yandex-team.ru
Maintainers: Eric Blake <eblake@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Max Reitz <mreitz@redhat.com>
There is a newer version of this series
include/block/nbd.h |  7 ++++---
block/nbd.c         | 25 +++++++++++++++++--------
2 files changed, 21 insertions(+), 11 deletions(-)
[PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
Posted by Roman Kagan 3 years, 3 months ago
During the final phase of migration the NBD reconnection logic may
encounter situations it doesn't expect during regular operation.

This series addresses some of them that make qemu crash.  They are
reproducible when a vm with a secondary drive attached via nbd with
non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
in the guest on that drive and is migrated (e.g. to a file), while the
nbd server is SIGKILL-ed and restarted every second.

See the individual patches for specific crash conditions and more
detailed explanations.

Roman Kagan (3):
  block/nbd: only detach existing iochannel from aio_context
  block/nbd: only enter connection coroutine if it's present
  nbd: make nbd_read* return -EIO on error

 include/block/nbd.h |  7 ++++---
 block/nbd.c         | 25 +++++++++++++++++--------
 2 files changed, 21 insertions(+), 11 deletions(-)

-- 
2.29.2


Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
Posted by Vladimir Sementsov-Ogievskiy 3 years, 3 months ago
28.01.2021 23:14, Roman Kagan wrote:
> During the final phase of migration the NBD reconnection logic may
> encounter situations it doesn't expect during regular operation.
> 
> This series addresses some of them that make qemu crash.  They are
> reproducible when a vm with a secondary drive attached via nbd with
> non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
> in the guest on that drive and is migrated (e.g. to a file), while the
> nbd server is SIGKILL-ed and restarted every second.
> 
> See the individual patches for specific crash conditions and more
> detailed explanations.
> 
> Roman Kagan (3):
>    block/nbd: only detach existing iochannel from aio_context
>    block/nbd: only enter connection coroutine if it's present
>    nbd: make nbd_read* return -EIO on error
> 
>   include/block/nbd.h |  7 ++++---
>   block/nbd.c         | 25 +++++++++++++++++--------
>   2 files changed, 21 insertions(+), 11 deletions(-)
> 

Thanks a lot for fixing!

Do you have some reproducer scripts? Could you post them or may be add an iotest?

-- 
Best regards,
Vladimir

Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
Posted by Roman Kagan 3 years, 3 months ago
On Fri, Jan 29, 2021 at 08:51:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 28.01.2021 23:14, Roman Kagan wrote:
> > During the final phase of migration the NBD reconnection logic may
> > encounter situations it doesn't expect during regular operation.
> > 
> > This series addresses some of them that make qemu crash.  They are
> > reproducible when a vm with a secondary drive attached via nbd with
> > non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
> > in the guest on that drive and is migrated (e.g. to a file), while the
> > nbd server is SIGKILL-ed and restarted every second.
> > 
> > See the individual patches for specific crash conditions and more
> > detailed explanations.
> > 
> > Roman Kagan (3):
> >    block/nbd: only detach existing iochannel from aio_context
> >    block/nbd: only enter connection coroutine if it's present
> >    nbd: make nbd_read* return -EIO on error
> > 
> >   include/block/nbd.h |  7 ++++---
> >   block/nbd.c         | 25 +++++++++++++++++--------
> >   2 files changed, 21 insertions(+), 11 deletions(-)
> > 
> 
> Thanks a lot for fixing!
> 
> Do you have some reproducer scripts? Could you post them or may be add
> an iotest?

I don't have it scripted, just ad hoc command lines.  I'll look into
making up a test.  Can you perhaps suggest what existing test to base
on?

Thanks,
Roman.

Re: [PATCH 0/3] block/nbd: fix crashers in reconnect while migrating
Posted by Vladimir Sementsov-Ogievskiy 3 years, 3 months ago
29.01.2021 10:35, Roman Kagan wrote:
> On Fri, Jan 29, 2021 at 08:51:39AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> 28.01.2021 23:14, Roman Kagan wrote:
>>> During the final phase of migration the NBD reconnection logic may
>>> encounter situations it doesn't expect during regular operation.
>>>
>>> This series addresses some of them that make qemu crash.  They are
>>> reproducible when a vm with a secondary drive attached via nbd with
>>> non-zero "reconnect-delay" runs a stress load (fio with big queue depth)
>>> in the guest on that drive and is migrated (e.g. to a file), while the
>>> nbd server is SIGKILL-ed and restarted every second.
>>>
>>> See the individual patches for specific crash conditions and more
>>> detailed explanations.
>>>
>>> Roman Kagan (3):
>>>     block/nbd: only detach existing iochannel from aio_context
>>>     block/nbd: only enter connection coroutine if it's present
>>>     nbd: make nbd_read* return -EIO on error
>>>
>>>    include/block/nbd.h |  7 ++++---
>>>    block/nbd.c         | 25 +++++++++++++++++--------
>>>    2 files changed, 21 insertions(+), 11 deletions(-)
>>>
>>
>> Thanks a lot for fixing!
>>
>> Do you have some reproducer scripts? Could you post them or may be add
>> an iotest?
> 
> I don't have it scripted, just ad hoc command lines.  I'll look into
> making up a test.  Can you perhaps suggest what existing test to base
> on?
> 

For now reconnect feature is covered only by two tests tests/qemu-iotests/264 and tests/qemu-iotests/277.

Also note, that since "f203080bbd iotests: rewrite check into python" you should add new iotests with human-readable file names into tests/qemu-iotests/tests subdirectory. Also you don't need to update tests/qemu-iotests/group file (it's absent now), test groups are defined in tests themselves in a comment, like "# group: rw quick".

-- 
Best regards,
Vladimir