[v4] block: Add retry for werror=/rerror= mechanism

[PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism

Posted by Jiahui Cen 5 years, 1 month ago

A VM in the cloud environment may use a virutal disk as the backend storage,
and there are usually filesystems on the virtual block device. When backend
storage is temporarily down, any I/O issued to the virtual block device
will cause an error. For example, an error occurred in ext4 filesystem would
make the filesystem readonly. In production environment, a cloud backend
storage can be soon recovered. For example, an IP-SAN may be down due to
network failure and will be online soon after network is recovered. However,
the error in the filesystem may not be recovered unless a device reattach
or system restart. Thus an I/O retry mechanism is in need to implement a
self-healing system.

This patch series propose to extend the werror=/rerror= mechanism to add
a 'retry' feature. It can automatically retry failed I/O requests on error
without sending error back to guest, and guest can get back running smoothly
when I/O is recovred.

v3->v4:
* Adapt to werror=/rerror= mechanism.

v2->v3:
* Add a doc to describe I/O hang.

v1->v2:
* Rebase to fix compile problems.
* Fix incorrect remove of rehandle list.
* Provide rehandle pause interface.

REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html

Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
Signed-off-by: Ying Fang <fangying1@huawei.com>

Jiahui Cen (7):
  qapi/block-core: Add retry option for error action
  block-backend: Introduce retry timer
  block-backend: Add device specific retry callback
  block-backend: Enable retry action on errors
  block-backend: Add timeout support for retry
  block: Add error retry param setting
  virtio_blk: Add support for retry on errors

 block/block-backend.c          | 66 ++++++++++++++++++++
 blockdev.c                     | 52 +++++++++++++++
 hw/block/block.c               | 10 +++
 hw/block/virtio-blk.c          | 19 +++++-
 include/hw/block/block.h       |  7 ++-
 include/sysemu/block-backend.h | 10 +++
 qapi/block-core.json           |  4 +-
 7 files changed, 162 insertions(+), 6 deletions(-)

-- 
2.28.0

Re: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism

Posted by Jiahui Cen 5 years, 1 month ago

Kindly ping...

On 2020/12/15 20:30, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>   qapi/block-core: Add retry option for error action
>   block-backend: Introduce retry timer
>   block-backend: Add device specific retry callback
>   block-backend: Enable retry action on errors
>   block-backend: Add timeout support for retry
>   block: Add error retry param setting
>   virtio_blk: Add support for retry on errors
> 
>  block/block-backend.c          | 66 ++++++++++++++++++++
>  blockdev.c                     | 52 +++++++++++++++
>  hw/block/block.c               | 10 +++
>  hw/block/virtio-blk.c          | 19 +++++-
>  include/hw/block/block.h       |  7 ++-
>  include/sysemu/block-backend.h | 10 +++
>  qapi/block-core.json           |  4 +-
>  7 files changed, 162 insertions(+), 6 deletions(-)
>

Re: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism

Posted by Ying Fang 5 years ago

Kindly ping for it.

Thanks for Stefan's suggestion, we have re-implement the concept by
introducing the 'retry' feature base on the werror=/rerror= mechanism.

Hope this thread won't be missed. Any comments and reviews are wellcome.

Thanks.
Ying Fang.

On 12/15/2020 8:30 PM, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>    qapi/block-core: Add retry option for error action
>    block-backend: Introduce retry timer
>    block-backend: Add device specific retry callback
>    block-backend: Enable retry action on errors
>    block-backend: Add timeout support for retry
>    block: Add error retry param setting
>    virtio_blk: Add support for retry on errors
> 
>   block/block-backend.c          | 66 ++++++++++++++++++++
>   blockdev.c                     | 52 +++++++++++++++
>   hw/block/block.c               | 10 +++
>   hw/block/virtio-blk.c          | 19 +++++-
>   include/hw/block/block.h       |  7 ++-
>   include/sysemu/block-backend.h | 10 +++
>   qapi/block-core.json           |  4 +-
>   7 files changed, 162 insertions(+), 6 deletions(-)
>

Ping: [PATCH v4 0/7] block: Add retry for werror=/rerror= mechanism

Posted by Jiahui Cen 5 years, 1 month ago

Hi Kevin,

What do you think of these patches?

Thanks,
Jiahui

On 2020/12/15 20:30, Jiahui Cen wrote:
> A VM in the cloud environment may use a virutal disk as the backend storage,
> and there are usually filesystems on the virtual block device. When backend
> storage is temporarily down, any I/O issued to the virtual block device
> will cause an error. For example, an error occurred in ext4 filesystem would
> make the filesystem readonly. In production environment, a cloud backend
> storage can be soon recovered. For example, an IP-SAN may be down due to
> network failure and will be online soon after network is recovered. However,
> the error in the filesystem may not be recovered unless a device reattach
> or system restart. Thus an I/O retry mechanism is in need to implement a
> self-healing system.
> 
> This patch series propose to extend the werror=/rerror= mechanism to add
> a 'retry' feature. It can automatically retry failed I/O requests on error
> without sending error back to guest, and guest can get back running smoothly
> when I/O is recovred.
> 
> v3->v4:
> * Adapt to werror=/rerror= mechanism.
> 
> v2->v3:
> * Add a doc to describe I/O hang.
> 
> v1->v2:
> * Rebase to fix compile problems.
> * Fix incorrect remove of rehandle list.
> * Provide rehandle pause interface.
> 
> REF: https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg06560.html
> 
> Signed-off-by: Jiahui Cen <cenjiahui@huawei.com>
> Signed-off-by: Ying Fang <fangying1@huawei.com>
> 
> Jiahui Cen (7):
>   qapi/block-core: Add retry option for error action
>   block-backend: Introduce retry timer
>   block-backend: Add device specific retry callback
>   block-backend: Enable retry action on errors
>   block-backend: Add timeout support for retry
>   block: Add error retry param setting
>   virtio_blk: Add support for retry on errors
> 
>  block/block-backend.c          | 66 ++++++++++++++++++++
>  blockdev.c                     | 52 +++++++++++++++
>  hw/block/block.c               | 10 +++
>  hw/block/virtio-blk.c          | 19 +++++-
>  include/hw/block/block.h       |  7 ++-
>  include/sysemu/block-backend.h | 10 +++
>  qapi/block-core.json           |  4 +-
>  7 files changed, 162 insertions(+), 6 deletions(-)
>