POC: Generating realistic block errors

[RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 4 months ago

For a long time I thought that VMs could be a great way to improve OS
code quality by modifying the simulated hardware to return errors to
exercise error paths in the OSs, specifically in block devices for right now.
A number of different approaches are available within the Linux kernel, eg.
scsi-debug, dm-flakey, and others.  However, I always though it would be best to
simulate it from the hardware.  To fully exercise the entire stack.  As a
bonus it's OS agnostic for those projects that cross OSs and it's available
before the OS even boots.

This POC needs a lot more work, but it's what I have so far.  Learning QEMU
internals, plus some of the different bus types has been interesting to say
the least.  I'm most familiar with SCSI, but the others are foreign to me.
AHCI/SATA/ATA is very interesting with it's history and the associated code to
handle it's evolution.

Eventually I think it would be useful to add functionality for errors on
write paths, timeouts, and different error behaviors.  Like media error(s) that
are corrected by a re-write (simulate failed write on power loss), bit rot
injection etc.  I know a number of these can be solved different ways,
but embracing them from the VM environment seems ideal to me.  Expanding
to gather statistics on IO patterns, histograms etc. could be very
useful too.  Having the ability to start/stop information collection and
then have access to what happened and in what order could allow for
better error injection of key FS structures and software RAID solutions too.

I've recently been informed that blkdebug exists.  From a cursory investigation
it does appear have overlap, but I haven't given it a try yet.  From looking
at the code to insert my changes it appears that some of the errors I'm
generating are different than what for example an EIO on a read_aio does, but
I'm not sure.  Perhaps some of the other features that I've outlined above
already exist too in some other capacity of QEMU?

Instead of working on this more in a vacuum I'm presenting what I have.  To
gauge interest, to see if others think it's as interesting as I do.  Or perhaps,
to find out that I've been re-inventing the wheel.

I'm interested in learning what thoughts people have on this.

Thanks,
Tony

Tony Asleson (4):
  Add qapi for block error injection
  SCSI media error reporting
  NVMe media error reporting
  ahci media error reporting

 block/Makefile.objs  |   2 +-
 block/error_inject.c | 179 +++++++++++++++++++++++++++++++++++++++++++
 block/error_inject.h |  43 +++++++++++
 block/qapi.c         |  18 +++++
 hw/block/nvme.c      |   8 ++
 hw/ide/ahci.c        |  27 +++++++
 hw/scsi/scsi-disk.c  |  33 ++++++++
 include/scsi/utils.h |   4 +
 qapi/block.json      |  36 +++++++++
 scsi/utils.c         |  31 ++++++++
 10 files changed, 380 insertions(+), 1 deletion(-)
 create mode 100644 block/error_inject.c
 create mode 100644 block/error_inject.h

-- 
2.21.0

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Stefan Hajnoczi 6 years, 4 months ago

On Thu, Sep 19, 2019 at 02:48:43PM -0500, Tony Asleson wrote:
> For a long time I thought that VMs could be a great way to improve OS
> code quality by modifying the simulated hardware to return errors to
> exercise error paths in the OSs, specifically in block devices for right now.
> A number of different approaches are available within the Linux kernel, eg.
> scsi-debug, dm-flakey, and others.  However, I always though it would be best to
> simulate it from the hardware.  To fully exercise the entire stack.  As a
> bonus it's OS agnostic for those projects that cross OSs and it's available
> before the OS even boots.
> 
> This POC needs a lot more work, but it's what I have so far.  Learning QEMU
> internals, plus some of the different bus types has been interesting to say
> the least.  I'm most familiar with SCSI, but the others are foreign to me.
> AHCI/SATA/ATA is very interesting with it's history and the associated code to
> handle it's evolution.
> 
> Eventually I think it would be useful to add functionality for errors on
> write paths, timeouts, and different error behaviors.  Like media error(s) that
> are corrected by a re-write (simulate failed write on power loss), bit rot
> injection etc.  I know a number of these can be solved different ways,
> but embracing them from the VM environment seems ideal to me.  Expanding
> to gather statistics on IO patterns, histograms etc. could be very
> useful too.  Having the ability to start/stop information collection and
> then have access to what happened and in what order could allow for
> better error injection of key FS structures and software RAID solutions too.
> 
> I've recently been informed that blkdebug exists.  From a cursory investigation
> it does appear have overlap, but I haven't given it a try yet.  From looking
> at the code to insert my changes it appears that some of the errors I'm
> generating are different than what for example an EIO on a read_aio does, but
> I'm not sure.  Perhaps some of the other features that I've outlined above
> already exist too in some other capacity of QEMU?

blkdebug is purely at the QEMU block layer level.  It is not aware of
storage controller-specific error information or features.  If you want
to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
block layer, then trying to do it in blkdebug becomes a layering
violation.  This justifies adding a new error injection feature directly
into AHCI, virtio-scsi, NVMe, etc devices.

What I like about blkdebug is the state machine and relatively powerful
tests that this enables.  It makes sense to reuse those for storage
controller-specific error injection too.  In the future we may wish to
reuse it for network cards and other emulated devices too!

Perhaps the error injection "engine" (the core blkdebug code) can be
extracted and reused?

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 4 months ago

On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> blkdebug is purely at the QEMU block layer level.  It is not aware of
> storage controller-specific error information or features.  If you want
> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> block layer, then trying to do it in blkdebug becomes a layering
> violation.  This justifies adding a new error injection feature directly
> into AHCI, virtio-scsi, NVMe, etc devices.

Good discussion point...

In my opening use case for this POC I'm generically trying to create an
unrecoverable media error for a specific sector.  For each of the
different device types it's different on how that error is conveyed and
the associated data in transfer.

If we return EIO on a read_aio, that must be translated into transport
specific error right?  I really need to try out blkdebug and see what is
seen in the guest.  Maybe I'm upside down on the layering too :-)

> What I like about blkdebug is the state machine and relatively powerful
> tests that this enables.  It makes sense to reuse those for storage
> controller-specific error injection too.  In the future we may wish to
> reuse it for network cards and other emulated devices too!
> 
> Perhaps the error injection "engine" (the core blkdebug code) can be
> extracted and reused?

I think VMs are a great way to test all different error paths, just not
those specific to storage so if we could make this generic that would be
great.

I also elaborated about this a bit in one of my other responses, but
I'll reiterate here a bit too.  If we could design a suitable callback
interface utilizing qapi I think we could move the state machine/logic
out of QEMU all together.  So that the test code could contain this
logic which would allow all types of behavior that we haven't even
thought of to exist outside of QEMU and not require changes to QEMU to
exploit.

-Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 2 months ago

On 9/20/19 12:28 PM, Tony Asleson wrote:
> On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
>> blkdebug is purely at the QEMU block layer level.  It is not aware of
>> storage controller-specific error information or features.  If you want
>> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
>> block layer, then trying to do it in blkdebug becomes a layering
>> violation.  This justifies adding a new error injection feature directly
>> into AHCI, virtio-scsi, NVMe, etc devices.
> 
> Good discussion point...
> 
> In my opening use case for this POC I'm generically trying to create an
> unrecoverable media error for a specific sector.  For each of the
> different device types it's different on how that error is conveyed and
> the associated data in transfer.
> 

I would like to get some additional clarification on this point.  Should
I be investing more time integrating my proposed functionality into
blkdebug or other?

Sorry for the long response time, got sidetracked with other stuff.

Thanks,
Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Stefan Hajnoczi 6 years, 2 months ago

On Thu, Nov 14, 2019 at 09:47:48AM -0600, Tony Asleson wrote:
> On 9/20/19 12:28 PM, Tony Asleson wrote:
> > On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> >> blkdebug is purely at the QEMU block layer level.  It is not aware of
> >> storage controller-specific error information or features.  If you want
> >> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> >> block layer, then trying to do it in blkdebug becomes a layering
> >> violation.  This justifies adding a new error injection feature directly
> >> into AHCI, virtio-scsi, NVMe, etc devices.
> > 
> > Good discussion point...
> > 
> > In my opening use case for this POC I'm generically trying to create an
> > unrecoverable media error for a specific sector.  For each of the
> > different device types it's different on how that error is conveyed and
> > the associated data in transfer.
> > 
> 
> I would like to get some additional clarification on this point.  Should
> I be investing more time integrating my proposed functionality into
> blkdebug or other?
> 
> Sorry for the long response time, got sidetracked with other stuff.

blkdebug can inject EIO when a specific LBA is accessed.  Is that
enough for what you want to do?  Then you can reuse and maybe extend
blkdebug.

Stefan

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Kevin Wolf 6 years, 2 months ago

Am 21.11.2019 um 11:30 hat Stefan Hajnoczi geschrieben:
> On Thu, Nov 14, 2019 at 09:47:48AM -0600, Tony Asleson wrote:
> > On 9/20/19 12:28 PM, Tony Asleson wrote:
> > > On 9/20/19 4:22 AM, Stefan Hajnoczi wrote:
> > >> blkdebug is purely at the QEMU block layer level.  It is not aware of
> > >> storage controller-specific error information or features.  If you want
> > >> to inject NVMe- or SCSI-specific errors that make no sense in QEMU's
> > >> block layer, then trying to do it in blkdebug becomes a layering
> > >> violation.  This justifies adding a new error injection feature directly
> > >> into AHCI, virtio-scsi, NVMe, etc devices.
> > > 
> > > Good discussion point...
> > > 
> > > In my opening use case for this POC I'm generically trying to create an
> > > unrecoverable media error for a specific sector.  For each of the
> > > different device types it's different on how that error is conveyed and
> > > the associated data in transfer.
> > > 
> > 
> > I would like to get some additional clarification on this point.  Should
> > I be investing more time integrating my proposed functionality into
> > blkdebug or other?
> > 
> > Sorry for the long response time, got sidetracked with other stuff.
> 
> blkdebug can inject EIO when a specific LBA is accessed.  Is that
> enough for what you want to do?  Then you can reuse and maybe extend
> blkdebug.

And if we need something more device specific, maybe a common core can
be extracted that can be used by both blkdebug and the devices. All of
the logic of managing error injection rules and checking whether
something needs to be done at the current event should be common between
both.

Kevin

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 2 months ago

On 11/21/19 4:30 AM, Stefan Hajnoczi wrote:
> blkdebug can inject EIO when a specific LBA is accessed.  Is that
> enough for what you want to do?  Then you can reuse and maybe extend
> blkdebug.

Not exactly.  For SCSI, I would like to be able to return different
types of device errors on reads eg. 03/1101, 03/1600 and writes.  The
SCSI sense data needs to include the first block in error for the
transfer.  It would be good to also have the ability to include things
like SCSI check conditions with recoverable errors too.

I've been experimenting with blkdebug, to learn more and to see how it
would need to be extended.  One thing that I was trying to understand is
how an EIO from blkdebug gets translated into a bus/device specific
error.  At the moment I'm not sure.  I've been trying to figure out the
layering.  I think that blkdebug sits between the device specific model
and the underlying block representation on disk.  Thus it injects error
return values when accessing the underlying data, but that could be
incorrect.  If it is correct I should see some code that translates the
EIO to something transport/device specific.  Although I don't understand
how returning an ENOSPC from read_aio in blkdebug would get translated
for a SCSI disk as it doesn't make sense to me (one of the examples in
the documentation).  Actually I don't know how getting ENOSPC on a read
could happen?

During my blkdebug experimentation, I've been using lsi53c895a  with
scsi-disk and thus far I've not been able to generate a read error back
to the guest kernel.  I've managed to abort qemu with an assert and hang
qemu without being able to get an error back to the guest kernel.  I
wrote up one of them: https://bugs.launchpad.net/qemu/+bug/1853898 .
Specifying a specific sector hasn't worked for me yet.  I'm still trying
to figure out how to enable tracing/debugging etc. to see what I'm going
incorrectly.

If someone can point me to any relevant docs, diagrams, talks etc. that
would be greatly appreciated.  I've been looking in the source tree docs
directory and the source code itself and things I've found from web
searches.

Thanks,
Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Kevin Wolf 6 years, 2 months ago

Am 26.11.2019 um 19:19 hat Tony Asleson geschrieben:
> On 11/21/19 4:30 AM, Stefan Hajnoczi wrote:
> > blkdebug can inject EIO when a specific LBA is accessed.  Is that
> > enough for what you want to do?  Then you can reuse and maybe extend
> > blkdebug.
> 
> Not exactly.  For SCSI, I would like to be able to return different
> types of device errors on reads eg. 03/1101, 03/1600 and writes.  The
> SCSI sense data needs to include the first block in error for the
> transfer.  It would be good to also have the ability to include things
> like SCSI check conditions with recoverable errors too.
> 
> I've been experimenting with blkdebug, to learn more and to see how it
> would need to be extended.  One thing that I was trying to understand is
> how an EIO from blkdebug gets translated into a bus/device specific
> error.  At the moment I'm not sure.  I've been trying to figure out the
> layering.  I think that blkdebug sits between the device specific model
> and the underlying block representation on disk.  Thus it injects error
> return values when accessing the underlying data, but that could be
> incorrect.  If it is correct I should see some code that translates the
> EIO to something transport/device specific.

The point where the device calls into the generic block layer is where
the functions that start with blk_ are called (blk_aio_pwritev() and
blk_aio_preadv() are probably the most interesting ones).

The callback path in scsi-disk is not that easy to follow, but in the
end, error returns should result in scsi_handle_rw_error() being called
where error codes are translated into SCSI sense codes.

> Although I don't understand how returning an ENOSPC from read_aio in
> blkdebug would get translated for a SCSI disk as it doesn't make sense
> to me (one of the examples in the documentation).  Actually I don't
> know how getting ENOSPC on a read could happen?

That scenario doesn't make a lot of sense to me either, but blkdebug can
just inject any error code, even nonsensical ones.

> During my blkdebug experimentation, I've been using lsi53c895a  with
> scsi-disk and thus far I've not been able to generate a read error back
> to the guest kernel.  I've managed to abort qemu with an assert and hang
> qemu without being able to get an error back to the guest kernel.  I
> wrote up one of them: https://bugs.launchpad.net/qemu/+bug/1853898 .
> Specifying a specific sector hasn't worked for me yet.  I'm still trying
> to figure out how to enable tracing/debugging etc. to see what I'm going
> incorrectly.

Note that depending on the rerror/werror options, QEMU may not deliver
errors to the guest, but stop VMs instead. If the monitor is still
responsive, it's likely that you just got a stopped VM rather than a
hanging QEMU.

The default is that the VM is stopped for ENOSPC and other errors are
delivered to the guest.

Kevin

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by no-reply@patchew.org 6 years, 4 months ago

Patchew URL: https://patchew.org/QEMU/20190919194847.18518-1-tasleson@redhat.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  CC      hw/misc/arm_sysctl.o
  CC      hw/misc/cbus.o
/tmp/qemu-test/src/hw/ide/ahci.c: In function 'execute_ncq_command':
/tmp/qemu-test/src/hw/ide/ahci.c:1094:31: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'uint64_t' {aka 'long long unsigned int'} [-Werror=format=]
         sprintf(device_id, "%lu", ide_state->wwn);
                             ~~^   ~~~~~~~~~~~~~~
                             %llu


The full log is available at
http://patchew.org/logs/20190919194847.18518-1-tasleson@redhat.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Stefan Hajnoczi 6 years, 4 months ago

On Thu, Sep 19, 2019 at 02:48:43PM -0500, Tony Asleson wrote:
> I've recently been informed that blkdebug exists.

By the way, if error injection only needs to report media errors, then
the existing blkdebug feature should be sufficient to do this.

I think adding storage controller-specific error injection is only
really necessary for injection SCSI-, AHCI-, or NVMe-specific errors
that the QEMU block layer has no concept of.

So the question is: can blkdebug already do what you need? :)

Stefan

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Kevin Wolf 6 years, 4 months ago

Am 19.09.2019 um 21:48 hat Tony Asleson geschrieben:
> For a long time I thought that VMs could be a great way to improve OS
> code quality by modifying the simulated hardware to return errors to
> exercise error paths in the OSs, specifically in block devices for right now.
> A number of different approaches are available within the Linux kernel, eg.
> scsi-debug, dm-flakey, and others.  However, I always though it would be best to
> simulate it from the hardware.  To fully exercise the entire stack.  As a
> bonus it's OS agnostic for those projects that cross OSs and it's available
> before the OS even boots.
> 
> This POC needs a lot more work, but it's what I have so far.  Learning QEMU
> internals, plus some of the different bus types has been interesting to say
> the least.  I'm most familiar with SCSI, but the others are foreign to me.
> AHCI/SATA/ATA is very interesting with it's history and the associated code to
> handle it's evolution.
> 
> Eventually I think it would be useful to add functionality for errors on
> write paths, timeouts, and different error behaviors.  Like media error(s) that
> are corrected by a re-write (simulate failed write on power loss), bit rot
> injection etc.  I know a number of these can be solved different ways,
> but embracing them from the VM environment seems ideal to me.

Yes, this makes sense to me. The question to answer seems to be where
this functionality would fit in - first of all, frontend (inside the
device emulation) or backend. And it's not an easy quesiton, but see
below.

> Expanding to gather statistics on IO patterns, histograms etc. could
> be very useful too.  Having the ability to start/stop information
> collection and then have access to what happened and in what order
> could allow for better error injection of key FS structures and
> software RAID solutions too.

We do already gather some basic statistics, and it would probably make
sense to add more things there. This is pretty clearly a job for the
backend, I think. If the information is easy to collect, we can add it
in the normal I/O path, otherwise an optional filter block driver could
do this.

> I've recently been informed that blkdebug exists.  From a cursory
> investigation it does appear have overlap, but I haven't given it a
> try yet.  From looking at the code to insert my changes it appears
> that some of the errors I'm generating are different than what for
> example an EIO on a read_aio does, but I'm not sure.  Perhaps some of
> the other features that I've outlined above already exist too in some
> other capacity of QEMU?

The overlap seems big enough that we can't ignore it. blkdebug was
primarily meant to debug image format block drivers like qcow2, but
because this means that it can inject I/O errors if specific sectors are
accessed, it makes sense to just use the same functionality for guest
devices, too.

I/O error inserted by blkdebug can be one-off or permanent, but since it
also supports using a small state machine, I think you should already be
able to configure your errors that are corrected by a rewrite, too, even
if there is no explicit support for this yet (I guess we could add it if
it turned out to be much easier to use).

The one thing I see in your series that we can't currently provide this
way is the exact sector number where the error happened. If you read
from sector 32 to 64 and there is an error configured for sector 50, you
just see that the whole request is failing.


I also wonder why you had to write low-level error handling code instead
of calling the existing error functions. If the existing functions don't
give the right result in error cases, shouldn't they be fixed anyway?

And then, as John already hinted, adding code for debugging scenarios to
hot paths that are important for high-performance VMs that don't use any
debugging is less than optimal.


So bringing everything together, what would you think of this plan:

1. Extend blkdebug with whatever ways you need to trigger I/O errors
   (only if the existing modes aren't sufficient at least for the start;
   we can still always extend it later)

2. Add a new BlockDriver callback that can return detailed information
   about an error (such as the exact sector number), and wire it up
   through BlockBackend (some blk_* function). Implement it in blkdebug.

3. In the guest devices, only call the function to get detailed error
   information in the existing error path. You can then update some
   device state according to the details if the block driver returned
   anything (probably only blkdebug will return something).

This way, we have no changes at all in the hot I/O path if you didn't
configure your VM with a blkdebug filter. And we avoid duplication of
code both in the error handler in devices and in the error injection
mechanisms.

Kevin

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 4 months ago

On 9/20/19 3:36 AM, Kevin Wolf wrote:
> I/O error inserted by blkdebug can be one-off or permanent, but since it
> also supports using a small state machine, I think you should already be
> able to configure your errors that are corrected by a rewrite, too, even
> if there is no explicit support for this yet (I guess we could add it if
> it turned out to be much easier to use).

One thing I thought about is the feasibility of having a callback for
these errors across qapi.  For example you could register a sector for a
read/write/both and when that operation occurs you would block IO, send
the sector number and associated data across qapi for test code to do
something with it and respond allowing the operation to continue
successfully or by returning an error determined by the external test
code to be propagated to guest.

This would allow the logic to be outside of QEMU.  So for example in the
re-write case the test code could remove the error when it gets the
write, instead of having that logic embedded in QEMU itself.

Thoughts?

> The one thing I see in your series that we can't currently provide this
> way is the exact sector number where the error happened. If you read
> from sector 32 to 64 and there is an error configured for sector 50, you
> just see that the whole request is failing.

Also depending on the device type the data behavior can be different
too.  For SCSI devices I believe the specification states that the data
leading up to the sector in error is transferred to the initiator.  For
ATA I believe this is not true.  My code doesn't model this correctly.
I generated the error before any data was transferred.

I'm thinking changes in blkdebug will need to be done to handle this too?

> I also wonder why you had to write low-level error handling code instead
> of calling the existing error functions. If the existing functions don't
> give the right result in error cases, shouldn't they be fixed anyway?

I would think so too.  I'm using error constants that already exist, but
apparently are not being used anywhere else.

> And then, as John already hinted, adding code for debugging scenarios to
> hot paths that are important for high-performance VMs that don't use any
> debugging is less than optimal.

I agree, the POC code was experimental, but I should have done more
effort in minimizing the run-time costs.

Additionally I think it would be good if QEMU could standardize the
device wwn format to be consistent throughout all block device types,
eg. uint64_t, but maybe not possible.  I also think it would be good to
allow the wwn passed on the command line correlate with what the guest
sees for /sys/block/<device>/device/wwid.

However, I'm assuming that QEMU has the same stance as the linux kernel
with no visible user space breakage?

> 
> So bringing everything together, what would you think of this plan:
> 
> 1. Extend blkdebug with whatever ways you need to trigger I/O errors
>    (only if the existing modes aren't sufficient at least for the start;
>    we can still always extend it later)
> 
> 2. Add a new BlockDriver callback that can return detailed information
>    about an error (such as the exact sector number), and wire it up
>    through BlockBackend (some blk_* function). Implement it in blkdebug.
> 
> 3. In the guest devices, only call the function to get detailed error
>    information in the existing error path. You can then update some
>    device state according to the details if the block driver returned
>    anything (probably only blkdebug will return something).
> 
> This way, we have no changes at all in the hot I/O path if you didn't
> configure your VM with a blkdebug filter. And we avoid duplication of
> code both in the error handler in devices and in the error injection
> mechanisms.

This all sounds good to me.  Although I'm not 100% sure of all the
specific details you are describing at the moment as I'm not that
familiar with the code base.

Thanks!

-Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Eric Blake 6 years, 4 months ago

On 9/20/19 11:41 AM, Tony Asleson wrote:
> On 9/20/19 3:36 AM, Kevin Wolf wrote:
>> I/O error inserted by blkdebug can be one-off or permanent, but since it
>> also supports using a small state machine, I think you should already be
>> able to configure your errors that are corrected by a rewrite, too, even
>> if there is no explicit support for this yet (I guess we could add it if
>> it turned out to be much easier to use).
> 
> One thing I thought about is the feasibility of having a callback for
> these errors across qapi.  For example you could register a sector for a
> read/write/both and when that operation occurs you would block IO, send
> the sector number and associated data across qapi for test code to do
> something with it and respond allowing the operation to continue
> successfully or by returning an error determined by the external test
> code to be propagated to guest.
> 
> This would allow the logic to be outside of QEMU.  So for example in the
> re-write case the test code could remove the error when it gets the
> write, instead of having that logic embedded in QEMU itself.
> 
> Thoughts?

To some extent, this sounds similar to what you can accomplish with an
NBD disk.  You can write an nbdkit plugin which exposes whatever error
handling you want (such as "the first read to this sector fails with
EIO, but a second read succeeds"), but only insofar as it fits in the
bounds of what the NBD protocol exposes over the wire (so qemu would see
EIO errors, and could narrow in on which portion of the disk provides or
avoids those errors, but would not have any additional insights that
would resemble a hardware-specific query without extensions to the NBD
protocol).

I am worried, however, that making data transactions have to go through
QMP to get an answer on how to handle a specific guest request will slow
things down; QMP is not built to be an efficient dataplane interface.
If you truly want isolation (where another process receives all guest
transactions, and makes decisions on how to handle them), it seems like
writing up a remote server (as in 'nbdkit' for the NBD protocol, or a
custom provider for the iscsi protocol) is the way to go.

[I have no idea if there is an iscsi counterpart for nbdkit; the iscsi
protocol is notoriously more complex than the NBD protocol, so it's not
something I'm likely to write]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 4 months ago

On 9/20/19 12:08 PM, Eric Blake wrote:
> I am worried, however, that making data transactions have to go through
> QMP to get an answer on how to handle a specific guest request will slow
> things down; QMP is not built to be an efficient dataplane interface.

IMHO we only need to make the code path efficient up to the point we
know we have an IO operation that contains sector(s) that we want to do
something with.  Once that happens I don't think it will be an issue for
those to be slower.  A real spinning disk drive takes longer to handle a
request when it encounters an error too.

> If you truly want isolation (where another process receives all guest
> transactions, and makes decisions on how to handle them)

It would only handle specific sector(s) of interest, not all.

-Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Kevin Wolf 6 years, 4 months ago

Am 20.09.2019 um 18:41 hat Tony Asleson geschrieben:
> On 9/20/19 3:36 AM, Kevin Wolf wrote:
> > I/O error inserted by blkdebug can be one-off or permanent, but since it
> > also supports using a small state machine, I think you should already be
> > able to configure your errors that are corrected by a rewrite, too, even
> > if there is no explicit support for this yet (I guess we could add it if
> > it turned out to be much easier to use).
> 
> One thing I thought about is the feasibility of having a callback for
> these errors across qapi.  For example you could register a sector for a
> read/write/both and when that operation occurs you would block IO, send
> the sector number and associated data across qapi for test code to do
> something with it and respond allowing the operation to continue
> successfully or by returning an error determined by the external test
> code to be propagated to guest.
> 
> This would allow the logic to be outside of QEMU.  So for example in the
> re-write case the test code could remove the error when it gets the
> write, instead of having that logic embedded in QEMU itself.
> 
> Thoughts?

Emitting a QMP event when blkdebug injects an error makes sense to me.

I wouldn't use it for this case, though, because this would become racy.
It could happen that the guest writes to the image, which sends a QMP
event, and then reads before the external program has removed the error.

> > The one thing I see in your series that we can't currently provide this
> > way is the exact sector number where the error happened. If you read
> > from sector 32 to 64 and there is an error configured for sector 50, you
> > just see that the whole request is failing.
> 
> Also depending on the device type the data behavior can be different
> too.  For SCSI devices I believe the specification states that the data
> leading up to the sector in error is transferred to the initiator.  For
> ATA I believe this is not true.  My code doesn't model this correctly.
> I generated the error before any data was transferred.
> 
> I'm thinking changes in blkdebug will need to be done to handle this too?

Hm... The SCSI case might get a bit tricky (if the spec does indeed say
that this is what must happen).

blkdebug can only return an I/O error for the whole request. Then the
device would ask for more information about the error and get the sector
number of the error. And then it would have to retry up to immediately
before the problematic sector.

By the way, this is an example where fixing this in the context of
blkdebug will also fix the behaviour for real I/O errors.

> > I also wonder why you had to write low-level error handling code instead
> > of calling the existing error functions. If the existing functions don't
> > give the right result in error cases, shouldn't they be fixed anyway?
> 
> I would think so too.  I'm using error constants that already exist, but
> apparently are not being used anywhere else.
> 
> 
> > And then, as John already hinted, adding code for debugging scenarios to
> > hot paths that are important for high-performance VMs that don't use any
> > debugging is less than optimal.
> 
> I agree, the POC code was experimental, but I should have done more
> effort in minimizing the run-time costs.
> 
> Additionally I think it would be good if QEMU could standardize the
> device wwn format to be consistent throughout all block device types,
> eg. uint64_t, but maybe not possible.  I also think it would be good to
> allow the wwn passed on the command line correlate with what the guest
> sees for /sys/block/<device>/device/wwid.
> 
> However, I'm assuming that QEMU has the same stance as the linux kernel
> with no visible user space breakage?

Changes that are visible to the guest break live migration, so they are
only allowed with a new option that defaults to off for existing machine
type. The default can change to on for new machine types.

> > So bringing everything together, what would you think of this plan:
> > 
> > 1. Extend blkdebug with whatever ways you need to trigger I/O errors
> >    (only if the existing modes aren't sufficient at least for the start;
> >    we can still always extend it later)
> > 
> > 2. Add a new BlockDriver callback that can return detailed information
> >    about an error (such as the exact sector number), and wire it up
> >    through BlockBackend (some blk_* function). Implement it in blkdebug.
> > 
> > 3. In the guest devices, only call the function to get detailed error
> >    information in the existing error path. You can then update some
> >    device state according to the details if the block driver returned
> >    anything (probably only blkdebug will return something).
> > 
> > This way, we have no changes at all in the hot I/O path if you didn't
> > configure your VM with a blkdebug filter. And we avoid duplication of
> > code both in the error handler in devices and in the error injection
> > mechanisms.
> 
> This all sounds good to me.  Although I'm not 100% sure of all the
> specific details you are describing at the moment as I'm not that
> familiar with the code base.

Yes, but I hope it does give you some pointers what to look at. Feel
free to ask more questions once you're ready. (Though I won't be
available next week, but there are other people, too, and I'll read it
later anyway.)

Kevin

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Tony Asleson 6 years, 4 months ago

On 9/20/19 1:11 PM, Kevin Wolf wrote:
> Emitting a QMP event when blkdebug injects an error makes sense to me.
> 
> I wouldn't use it for this case, though, because this would become racy.
> It could happen that the guest writes to the image, which sends a QMP
> event, and then reads before the external program has removed the error.

My POC had a single lock protecting it's shared state.  I'm kind of
surprised no one jumped on that because it's a big point of lock
contention.  One could argue that the state data and associated lock(s)
should be on each device which leads me to the next point.  I think with
careful locking and sequencing we could address this race condition so
that the error was removed before the write completed.  In fact it would
need to work that way to allow the external test code the ability to
perturb the data before it's written if that's what the test wanted to do.

-Tony

Re: [RFC 0/4] POC: Generating realistic block errors

Posted by Kevin Wolf 6 years, 4 months ago

Am 20.09.2019 um 20:55 hat Tony Asleson geschrieben:
> On 9/20/19 1:11 PM, Kevin Wolf wrote:
> > Emitting a QMP event when blkdebug injects an error makes sense to me.
> > 
> > I wouldn't use it for this case, though, because this would become racy.
> > It could happen that the guest writes to the image, which sends a QMP
> > event, and then reads before the external program has removed the error.
> 
> My POC had a single lock protecting it's shared state.  I'm kind of
> surprised no one jumped on that because it's a big point of lock
> contention.

I think people didn't review the code in detail because we're still
discussing very high-level design questions.

Anyway, I did mention that I'd like to get your code out of the way for
the fast path when the feature isn't used. If the user explicitly
enabled the feature and we're basically in a debugging setup, the lock
contention should be acceptable. In fact, the mutex might not even be
necessary because the code should be covered by the AioContext lock.

However, I don't see how this locking could fix the race I mention. It's
not a race between two QEMU components, but between the guest and a QMP
client. A mutex in QEMU certainly feels like the wrong way to address
it.

If you really wanted an external process to control this, you would have
to fully stop the VM whenever an error is injected and only continue it
via QMP after the QMP client has decided whether or not to disable the
error. Because you'd need a custom QMP client then, you wouldn't be able
to use things like libvirt for such QEMU instances.

Kevin