[RFC 4/4] ahci media error reporting

Tony Asleson posted 4 patches 6 years, 4 months ago
Maintainers: Markus Armbruster <armbru@redhat.com>, John Snow <jsnow@redhat.com>, Max Reitz <mreitz@redhat.com>, Eric Blake <eblake@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Keith Busch <keith.busch@intel.com>, Kevin Wolf <kwolf@redhat.com>, Fam Zheng <fam@euphon.net>
[RFC 4/4] ahci media error reporting
Posted by Tony Asleson 6 years, 4 months ago
Initial attempt at returning a media error for ahci.  This is certainly
wrong and needs serious improvement.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
---
 hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index d45393c019..f487764106 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -36,6 +36,7 @@
 #include "hw/ide/internal.h"
 #include "hw/ide/pci.h"
 #include "ahci_internal.h"
+#include "block/error_inject.h"
 
 #include "trace.h"
 
@@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
     ncq_tfs->used = 0;
 }
 
+/*
+ * Figure out correct way to report media error, this is at best a guess
+ * and based on the output of linux kernel, not even remotely close.
+ */
+static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
+{
+    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
+
+    ide_state->error = ECC_ERR;
+    ide_state->status = READY_STAT | ERR_STAT;
+    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
+    ncq_tfs->lba = err_sector;
+    qemu_sglist_destroy(&ncq_tfs->sglist);
+    ncq_tfs->used = 0;
+}
+
 static void ncq_finish(NCQTransferState *ncq_tfs)
 {
     /* If we didn't error out, set our finished bit. Errored commands
@@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
 {
     AHCIDevice *ad = ncq_tfs->drive;
     IDEState *ide_state = &ad->port.ifs[0];
+    uint64_t error_sector = 0;
+    char device_id[32];
     int port = ad->port_no;
 
     g_assert(is_ncq(ncq_tfs->cmd));
@@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
 
     switch (ncq_tfs->cmd) {
     case READ_FPDMA_QUEUED:
+        sprintf(device_id, "%lu", ide_state->wwn);
+
+        if (error_in_read(device_id, ncq_tfs->lba,
+                ncq_tfs->sector_count, &error_sector)) {
+            ncq_media_err(ncq_tfs, error_sector);
+            return;
+        }
+
         trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
                                        ncq_tfs->sector_count, ncq_tfs->lba);
         dma_acct_start(ide_state->blk, &ncq_tfs->acct,
-- 
2.21.0


Re: [RFC 4/4] ahci media error reporting
Posted by John Snow 6 years, 4 months ago

On 9/19/19 3:48 PM, Tony Asleson wrote:
> Initial attempt at returning a media error for ahci.  This is certainly
> wrong and needs serious improvement.
> 

Hi; I have the unfortunate distinction of being the AHCI maintainer.
Please CC me on future revisions and discussion if you are interacting
with my problem child.

Also remember to CC qemu-block.

> Signed-off-by: Tony Asleson <tasleson@redhat.com>
> ---
>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
> index d45393c019..f487764106 100644
> --- a/hw/ide/ahci.c
> +++ b/hw/ide/ahci.c
> @@ -36,6 +36,7 @@
>  #include "hw/ide/internal.h"
>  #include "hw/ide/pci.h"
>  #include "ahci_internal.h"
> +#include "block/error_inject.h"
>  
>  #include "trace.h"
>  
> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>      ncq_tfs->used = 0;
>  }
>  
> +/*
> + * Figure out correct way to report media error, this is at best a guess
> + * and based on the output of linux kernel, not even remotely close.
> + */

Depends on what kind of error, exactly, you're trying to report, and at
what level. (AHCI, NCQ, SATA, ATA?)

Keep in mind that you've inserted an error path for SATA drives using
NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
devices using either PATA/SATA, or ATA drives on PATA.

> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
> +{
> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
> +
> +    ide_state->error = ECC_ERR;
> +    ide_state->status = READY_STAT | ERR_STAT;
> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
> +    ncq_tfs->lba = err_sector;
> +    qemu_sglist_destroy(&ncq_tfs->sglist);
> +    ncq_tfs->used = 0;
> +}
> +

If you are definitely very sure you only want an ide_state error
difference, you could just as well call ncq_err and then patch
ide_state->error.

(I am not sure that's what you want, but... maybe it is?)

I'd have to check -- because I can't say the AHCI emulator was designed
so much as congealed -- but you might need calls to ncq_finish.

usually, ncq_cb handles the return from any NCQ command and will call
ncq_err and ncq_finish as appropriate to tidy up the command.

It might be a mistake that execute_ncq_command issues ncq_err in the
`default` arm of the switch statement without a call to finish.

If we do call ncq_finish from this context I'm not sure if we want
block_acct_done here unconditionally. We may not have started a block
accounting operation if we never started a backend operation. Everything
else looks about right to me.


>  static void ncq_finish(NCQTransferState *ncq_tfs)
>  {
>      /* If we didn't error out, set our finished bit. Errored commands
> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>  {
>      AHCIDevice *ad = ncq_tfs->drive;
>      IDEState *ide_state = &ad->port.ifs[0];
> +    uint64_t error_sector = 0;
> +    char device_id[32];
>      int port = ad->port_no;
>  
>      g_assert(is_ncq(ncq_tfs->cmd));
> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>  
>      switch (ncq_tfs->cmd) {
>      case READ_FPDMA_QUEUED:
> +        sprintf(device_id, "%lu", ide_state->wwn);

This seems suspicious for your design in general, but I'd like to not
run sprintf to a buffer in the hotpath for NCQ.

If you need this, I'd do it when wwn is set and just grab it from the
state structure.

> +
> +        if (error_in_read(device_id, ncq_tfs->lba,
> +                ncq_tfs->sector_count, &error_sector)) {
> +            ncq_media_err(ncq_tfs, error_sector);
> +            return;
> +        }
> +

One of the downsides to trying to trigger read error injections
per-device instead of per-node is that now you have to goof around with
device-specific code everywhere.

I suppose from your cover letter you *WANT* device-specific error
exercising, which would necessitate a different design from blkdebug,
but it means you have to add support for the framework per-device and it
might be very tricky to get right.

>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
> 

Re: [RFC 4/4] ahci media error reporting
Posted by Tony Asleson 6 years, 4 months ago
On 9/19/19 3:43 PM, John Snow wrote:
> 
> 
> On 9/19/19 3:48 PM, Tony Asleson wrote:
>> Initial attempt at returning a media error for ahci.  This is certainly
>> wrong and needs serious improvement.
>>
> 
> Hi; I have the unfortunate distinction of being the AHCI maintainer.
> Please CC me on future revisions and discussion if you are interacting
> with my problem child.

Will do and thank you for taking a look at this!

> Also remember to CC qemu-block.
> 
>> Signed-off-by: Tony Asleson <tasleson@redhat.com>
>> ---
>>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
>> index d45393c019..f487764106 100644
>> --- a/hw/ide/ahci.c
>> +++ b/hw/ide/ahci.c
>> @@ -36,6 +36,7 @@
>>  #include "hw/ide/internal.h"
>>  #include "hw/ide/pci.h"
>>  #include "ahci_internal.h"
>> +#include "block/error_inject.h"
>>  
>>  #include "trace.h"
>>  
>> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>>      ncq_tfs->used = 0;
>>  }
>>  
>> +/*
>> + * Figure out correct way to report media error, this is at best a guess
>> + * and based on the output of linux kernel, not even remotely close.
>> + */
> 
> Depends on what kind of error, exactly, you're trying to report, and at
> what level. (AHCI, NCQ, SATA, ATA?)

I was trying to return a media error, like a 3/1101 for SCSI device.  I
think that is at the ATA level?


> Keep in mind that you've inserted an error path for SATA drives using
> NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
> devices using either PATA/SATA, or ATA drives on PATA.

Correct, but for trying out a simple read on a SATA drive for Linux I
end up here first :-)  Well, until the kernel retries a number of times
and finally gives up and returns an error while also disabling NCQ for
the device.


>> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
>> +{
>> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
>> +
>> +    ide_state->error = ECC_ERR;
>> +    ide_state->status = READY_STAT | ERR_STAT;
>> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
>> +    ncq_tfs->lba = err_sector;
>> +    qemu_sglist_destroy(&ncq_tfs->sglist);
>> +    ncq_tfs->used = 0;
>> +}
>> +
> 
> If you are definitely very sure you only want an ide_state error
> difference, you could just as well call ncq_err and then patch
> ide_state->error.
> 
> (I am not sure that's what you want, but... maybe it is?)

As I mentioned above, return an unrecoverable media error.

SCSI sense data can report the first sector in error and I thought I
could do the same for SATA too?

> I'd have to check -- because I can't say the AHCI emulator was designed
> so much as congealed -- but you might need calls to ncq_finish.
> 
> usually, ncq_cb handles the return from any NCQ command and will call
> ncq_err and ncq_finish as appropriate to tidy up the command.
> 
> It might be a mistake that execute_ncq_command issues ncq_err in the
> `default` arm of the switch statement without a call to finish.
> 
> If we do call ncq_finish from this context I'm not sure if we want
> block_acct_done here unconditionally. We may not have started a block
> accounting operation if we never started a backend operation. Everything
> else looks about right to me.
> 
> 
>>  static void ncq_finish(NCQTransferState *ncq_tfs)
>>  {
>>      /* If we didn't error out, set our finished bit. Errored commands
>> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>  {
>>      AHCIDevice *ad = ncq_tfs->drive;
>>      IDEState *ide_state = &ad->port.ifs[0];
>> +    uint64_t error_sector = 0;
>> +    char device_id[32];
>>      int port = ad->port_no;
>>  
>>      g_assert(is_ncq(ncq_tfs->cmd));
>> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>  
>>      switch (ncq_tfs->cmd) {
>>      case READ_FPDMA_QUEUED:
>> +        sprintf(device_id, "%lu", ide_state->wwn);
> 
> This seems suspicious for your design in general, but I'd like to not
> run sprintf to a buffer in the hotpath for NCQ.

I totally agree.

I started out using integers in the call for error_in_read, as that is
what SCSI uses internally for wwn.  Then I did NVMe and it's using a
string that doesn't apparently need to be an integer for the wwn? so I
changed it to being a string to accommodate.

I also find it interesting that when a SATA device wwid is dumped out
within the guest it doesn't appear to have any correlation with the wwn
that was passed in on the command line, eg.

-device ide-drive,drive=satadisk,bus=ahci.0,wwn=8675309

$cat /sys/block/sda/device/wwid
t10.ATA     QEMU HARDDISK                           QM00005


This does correlate for SCSI

-device scsi-hd,drive=hd,wwn=12345678

$ cat /sys/block/sdc/device/wwid
naa.0000000000bc614e


as 0xbc614e = 12345678


For NVMe, the wwn is in the wwid, but it's not immediately obvious.

Being able to correlate between the command line and what you find in
the guest would be good.


> If you need this, I'd do it when wwn is set and just grab it from the
> state structure.
> 
>> +
>> +        if (error_in_read(device_id, ncq_tfs->lba,
>> +                ncq_tfs->sector_count, &error_sector)) {
>> +            ncq_media_err(ncq_tfs, error_sector);
>> +            return;
>> +        }
>> +
> 
> One of the downsides to trying to trigger read error injections
> per-device instead of per-node is that now you have to goof around with
> device-specific code everywhere>
> I suppose from your cover letter you *WANT* device-specific error
> exercising, which would necessitate a different design from blkdebug,
> but it means you have to add support for the framework per-device and it
> might be very tricky to get right.

Yes, goal was to be able to selectively pick one or more specific block
devices and then create one or more block errors on each device with
potentially different error behavior for each block in error.


>>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
>>


Re: [RFC 4/4] ahci media error reporting
Posted by John Snow 6 years, 4 months ago

On 9/19/19 5:49 PM, Tony Asleson wrote:
> On 9/19/19 3:43 PM, John Snow wrote:
>>
>>
>> On 9/19/19 3:48 PM, Tony Asleson wrote:
>>> Initial attempt at returning a media error for ahci.  This is certainly
>>> wrong and needs serious improvement.
>>>
>>
>> Hi; I have the unfortunate distinction of being the AHCI maintainer.
>> Please CC me on future revisions and discussion if you are interacting
>> with my problem child.
> 
> Will do and thank you for taking a look at this!
> 
>> Also remember to CC qemu-block.
>>
>>> Signed-off-by: Tony Asleson <tasleson@redhat.com>
>>> ---
>>>  hw/ide/ahci.c | 27 +++++++++++++++++++++++++++
>>>  1 file changed, 27 insertions(+)
>>>
>>> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
>>> index d45393c019..f487764106 100644
>>> --- a/hw/ide/ahci.c
>>> +++ b/hw/ide/ahci.c
>>> @@ -36,6 +36,7 @@
>>>  #include "hw/ide/internal.h"
>>>  #include "hw/ide/pci.h"
>>>  #include "ahci_internal.h"
>>> +#include "block/error_inject.h"
>>>  
>>>  #include "trace.h"
>>>  
>>> @@ -999,6 +1000,22 @@ static void ncq_err(NCQTransferState *ncq_tfs)
>>>      ncq_tfs->used = 0;
>>>  }
>>>  
>>> +/*
>>> + * Figure out correct way to report media error, this is at best a guess
>>> + * and based on the output of linux kernel, not even remotely close.
>>> + */
>>
>> Depends on what kind of error, exactly, you're trying to report, and at
>> what level. (AHCI, NCQ, SATA, ATA?)
> 
> I was trying to return a media error, like a 3/1101 for SCSI device.  I
> think that is at the ATA level?
> 
> 
>> Keep in mind that you've inserted an error path for SATA drives using
>> NCQ, but seemingly haven't touched SATA drives not using NCQ, or ATAPI
>> devices using either PATA/SATA, or ATA drives on PATA.
> 
> Correct, but for trying out a simple read on a SATA drive for Linux I
> end up here first :-)  Well, until the kernel retries a number of times
> and finally gives up and returns an error while also disabling NCQ for
> the device.
> 
> 
>>> +static void ncq_media_err(NCQTransferState *ncq_tfs, uint64_t err_sector)
>>> +{
>>> +    IDEState *ide_state = &ncq_tfs->drive->port.ifs[0];
>>> +
>>> +    ide_state->error = ECC_ERR;
>>> +    ide_state->status = READY_STAT | ERR_STAT;
>>> +    ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
>>> +    ncq_tfs->lba = err_sector;
>>> +    qemu_sglist_destroy(&ncq_tfs->sglist);
>>> +    ncq_tfs->used = 0;
>>> +}
>>> +
>>
>> If you are definitely very sure you only want an ide_state error
>> difference, you could just as well call ncq_err and then patch
>> ide_state->error.
>>
>> (I am not sure that's what you want, but... maybe it is?)
> 
> As I mentioned above, return an unrecoverable media error.
> 
> SCSI sense data can report the first sector in error and I thought I
> could do the same for SATA too?
> 
>> I'd have to check -- because I can't say the AHCI emulator was designed
>> so much as congealed -- but you might need calls to ncq_finish.
>>
>> usually, ncq_cb handles the return from any NCQ command and will call
>> ncq_err and ncq_finish as appropriate to tidy up the command.
>>
>> It might be a mistake that execute_ncq_command issues ncq_err in the
>> `default` arm of the switch statement without a call to finish.
>>
>> If we do call ncq_finish from this context I'm not sure if we want
>> block_acct_done here unconditionally. We may not have started a block
>> accounting operation if we never started a backend operation. Everything
>> else looks about right to me.
>>
>>
>>>  static void ncq_finish(NCQTransferState *ncq_tfs)
>>>  {
>>>      /* If we didn't error out, set our finished bit. Errored commands
>>> @@ -1065,6 +1082,8 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  {
>>>      AHCIDevice *ad = ncq_tfs->drive;
>>>      IDEState *ide_state = &ad->port.ifs[0];
>>> +    uint64_t error_sector = 0;
>>> +    char device_id[32];
>>>      int port = ad->port_no;
>>>  
>>>      g_assert(is_ncq(ncq_tfs->cmd));
>>> @@ -1072,6 +1091,14 @@ static void execute_ncq_command(NCQTransferState *ncq_tfs)
>>>  
>>>      switch (ncq_tfs->cmd) {
>>>      case READ_FPDMA_QUEUED:
>>> +        sprintf(device_id, "%lu", ide_state->wwn);
>>
>> This seems suspicious for your design in general, but I'd like to not
>> run sprintf to a buffer in the hotpath for NCQ.
> 
> I totally agree.
> 
> I started out using integers in the call for error_in_read, as that is
> what SCSI uses internally for wwn.  Then I did NVMe and it's using a
> string that doesn't apparently need to be an integer for the wwn? so I
> changed it to being a string to accommodate.
> 
> I also find it interesting that when a SATA device wwid is dumped out
> within the guest it doesn't appear to have any correlation with the wwn
> that was passed in on the command line, eg.
> 
> -device ide-drive,drive=satadisk,bus=ahci.0,wwn=8675309
> 
> $cat /sys/block/sda/device/wwid
> t10.ATA     QEMU HARDDISK                           QM00005
> 
> 
> This does correlate for SCSI
> 
> -device scsi-hd,drive=hd,wwn=12345678
> 
> $ cat /sys/block/sdc/device/wwid
> naa.0000000000bc614e
> 
> 
> as 0xbc614e = 12345678
> 
> 
> For NVMe, the wwn is in the wwid, but it's not immediately obvious.
> 
> Being able to correlate between the command line and what you find in
> the guest would be good.
> 
> 
>> If you need this, I'd do it when wwn is set and just grab it from the
>> state structure.
>>
>>> +
>>> +        if (error_in_read(device_id, ncq_tfs->lba,
>>> +                ncq_tfs->sector_count, &error_sector)) {
>>> +            ncq_media_err(ncq_tfs, error_sector);
>>> +            return;
>>> +        }
>>> +
>>
>> One of the downsides to trying to trigger read error injections
>> per-device instead of per-node is that now you have to goof around with
>> device-specific code everywhere>
>> I suppose from your cover letter you *WANT* device-specific error
>> exercising, which would necessitate a different design from blkdebug,
>> but it means you have to add support for the framework per-device and it
>> might be very tricky to get right.
> 
> Yes, goal was to be able to selectively pick one or more specific block
> devices and then create one or more block errors on each device with
> potentially different error behavior for each block in error.
> 

Yeah -- I imagine you want to see how the kernel will respond to various
error situations posed by certain block devices themselves.

Well... In general, we support rerror=stop and werror=stop in all of our
block emulators that we care about, and you can find the choke points
that implement this by looking for e.g. 'BLOCK_ERROR_ACTION_STOP' --
these error-handling points should already be kind to the various
emulator state machines; the emulators know how to manage the error
actions correctly from this point.

I think whatever solution you end up on should look to expand these
choke points instead of trying to add new ones.

... maybe. maybe it's still too hard to weave together in that way, I'm
not sure. I only care about IDE/ATA devices, so it's just a
recommendation to try to use the existing error handling points as a
basis instead of trying to add new ones.

I think adding new ones will be hard to verify for correctness and hard
to test.

--js

> 
>>>          trace_execute_ncq_command_read(ad->hba, port, ncq_tfs->tag,
>>>                                         ncq_tfs->sector_count, ncq_tfs->lba);
>>>          dma_acct_start(ide_state->blk, &ncq_tfs->acct,
>>>
> 

Re: [RFC 4/4] ahci media error reporting
Posted by Kevin Wolf 6 years, 4 months ago
Am 19.09.2019 um 22:43 hat John Snow geschrieben:
> I'd have to check -- because I can't say the AHCI emulator was designed
> so much as congealed -- but you might need calls to ncq_finish.
> 
> usually, ncq_cb handles the return from any NCQ command and will call
> ncq_err and ncq_finish as appropriate to tidy up the command.
> 
> It might be a mistake that execute_ncq_command issues ncq_err in the
> `default` arm of the switch statement without a call to finish.
> 
> If we do call ncq_finish from this context I'm not sure if we want
> block_acct_done here unconditionally. We may not have started a block
> accounting operation if we never started a backend operation. Everything
> else looks about right to me.

With that much uncertainty, the one thing I'm pretty certain of is that
someone (TM) should write some qtests - if only to figure out what
really happens.

Kevin

Re: [RFC 4/4] ahci media error reporting
Posted by John Snow 6 years, 4 months ago

On 9/20/19 4:43 AM, Kevin Wolf wrote:
> Am 19.09.2019 um 22:43 hat John Snow geschrieben:
>> I'd have to check -- because I can't say the AHCI emulator was designed
>> so much as congealed -- but you might need calls to ncq_finish.
>>
>> usually, ncq_cb handles the return from any NCQ command and will call
>> ncq_err and ncq_finish as appropriate to tidy up the command.
>>
>> It might be a mistake that execute_ncq_command issues ncq_err in the
>> `default` arm of the switch statement without a call to finish.
>>
>> If we do call ncq_finish from this context I'm not sure if we want
>> block_acct_done here unconditionally. We may not have started a block
>> accounting operation if we never started a backend operation. Everything
>> else looks about right to me.
> 
> With that much uncertainty, the one thing I'm pretty certain of is that
> someone (TM) should write some qtests - if only to figure out what
> really happens.
> 

For sure -- I handle the normative cases, but I don't test what happens
if you issue an unsupported NCQ command. (I don't know what real
hardware does right now, either. I'm sure I could read the spec and find
out, but don't have a testing setup that lets me analyze real hardware
anymore.)

I will have to defer this to someone (TM), but I suspect the code (that
I suspect was used as a basis for inserting a new error pathway in this
patch) is wrong and is missing a call to ncq_finish -- but that call
needs to not call the block accounting cleanup, because we didn't start
an operation in this case.

That's my Official Hunch.

--js

Re: [RFC 4/4] ahci media error reporting
Posted by Tony Asleson 6 years, 4 months ago
On 9/20/19 11:18 AM, John Snow wrote:
> For sure -- I handle the normative cases, but I don't test what happens
> if you issue an unsupported NCQ command. (I don't know what real
> hardware does right now, either. I'm sure I could read the spec and find
> out, but don't have a testing setup that lets me analyze real hardware
> anymore.)

Regardless of what actual hardware does, it's always good to see what
the spec says as hardware folks get it wrong too sometimes :-)

-Tony

Re: [RFC 4/4] ahci media error reporting
Posted by John Snow 6 years, 4 months ago

On 9/20/19 3:25 PM, Tony Asleson wrote:
> On 9/20/19 11:18 AM, John Snow wrote:
>> For sure -- I handle the normative cases, but I don't test what happens
>> if you issue an unsupported NCQ command. (I don't know what real
>> hardware does right now, either. I'm sure I could read the spec and find
>> out, but don't have a testing setup that lets me analyze real hardware
>> anymore.)
> 
> Regardless of what actual hardware does, it's always good to see what
> the spec says as hardware folks get it wrong too sometimes :-)
> 
> -Tony
> 

That depends. If we're emulating an "AHCI device", we should follow the
spec, but the one we instantiate in QEMU is the ICH9 version, so we try
to follow the hardware.

I do wind up looking at both, but because the relevant specs are split
across SATA, AHCI, ATA, ATAPI and into SCSI sometimes, (plus sub-specs)
it can be hard to piece together a holistic picture of what should
happen sometimes.

Looking at hardware is sometimes quicker and more definitive :)

Anyway, none of this helps you out much, but I will at least stay tuned
to review AHCI code and can try to help you make better incisions.

Enjoy your weekend!

--js