[PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load

Vinayak Kale posted 1 patch 8 months, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240311121519.1481732-1-vkale@nvidia.com
Maintainers: Alex Williamson <alex.williamson@redhat.com>, "Cédric Le Goater" <clg@redhat.com>
hw/vfio/pci.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
[PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load
Posted by Vinayak Kale 8 months, 2 weeks ago
In case of migration, during restore operation, qemu checks config space of the
pci device with the config space in the migration stream captured during save
operation. In case of config space data mismatch, restore operation is failed.

config space check is done in function get_pci_config_device(). By default VSC
(vendor-specific-capability) in config space is checked.

Ideally qemu should not check VSC for VFIO-PCI device during restore/load as
qemu is not aware of VSC ABI.

This patch skips the check for VFIO-PCI device by clearing pdev->cmask[] for VSC
offsets. If cmask[] is not set for an offset, then qemu skips config space check
for that offset.

Signed-off-by: Vinayak Kale <vkale@nvidia.com>
---
Version History
v1->v2:
    - Limited scope of change to vfio-pci devices instead of all pci devices.

 hw/vfio/pci.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index d7fe06715c..9edaff4b37 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2132,6 +2132,22 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
     }
 }
 
+static int vfio_add_vendor_specific_cap(VFIOPCIDevice *vdev, int pos,
+                                        uint8_t size, Error **errp)
+{
+    PCIDevice *pdev = &vdev->pdev;
+
+    pos = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, size, errp);
+    if (pos < 0) {
+        return pos;
+    }
+
+    /* Exempt config space check for VSC during restore/load  */
+    memset(pdev->cmask + pos, 0, size);
+
+    return pos;
+}
+
 static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
 {
     PCIDevice *pdev = &vdev->pdev;
@@ -2199,6 +2215,9 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
         vfio_check_af_flr(vdev, pos);
         ret = pci_add_capability(pdev, cap_id, pos, size, errp);
         break;
+    case PCI_CAP_ID_VNDR:
+        ret = vfio_add_vendor_specific_cap(vdev, pos, size, errp);
+        break;
     default:
         ret = pci_add_capability(pdev, cap_id, pos, size, errp);
         break;
-- 
2.34.1
Re: [PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load
Posted by Alex Williamson 8 months, 2 weeks ago
On Mon, 11 Mar 2024 17:45:19 +0530
Vinayak Kale <vkale@nvidia.com> wrote:

> In case of migration, during restore operation, qemu checks config space of the
> pci device with the config space in the migration stream captured during save
> operation. In case of config space data mismatch, restore operation is failed.
> 
> config space check is done in function get_pci_config_device(). By default VSC
> (vendor-specific-capability) in config space is checked.
> 
> Ideally qemu should not check VSC for VFIO-PCI device during restore/load as
> qemu is not aware of VSC ABI.

It's disappointing that we can't seem to have a discussion about why
it's not the responsibility of the underlying migration support in the
vfio-pci variant driver to make the vendor specific capability
consistent across migration.

Also, for future maintenance, specifically what device is currently
broken by this and under what conditions?

> 
> This patch skips the check for VFIO-PCI device by clearing pdev->cmask[] for VSC
> offsets. If cmask[] is not set for an offset, then qemu skips config space check
> for that offset.
> 
> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> ---
> Version History
> v1->v2:
>     - Limited scope of change to vfio-pci devices instead of all pci devices.
> 
>  hw/vfio/pci.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index d7fe06715c..9edaff4b37 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2132,6 +2132,22 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
>      }
>  }
>  
> +static int vfio_add_vendor_specific_cap(VFIOPCIDevice *vdev, int pos,
> +                                        uint8_t size, Error **errp)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    pos = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, size, errp);
> +    if (pos < 0) {
> +        return pos;
> +    }
> +
> +    /* Exempt config space check for VSC during restore/load  */
> +    memset(pdev->cmask + pos, 0, size);

This excludes the entire capability from comparison, including the
capability ID, next pointer, and capability length.  Even if the
contents of the capability are considered volatile vendor information,
the header is spec defined ABI which must be consistent.  Thanks,

Alex

> +
> +    return pos;
> +}
> +
>  static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>  {
>      PCIDevice *pdev = &vdev->pdev;
> @@ -2199,6 +2215,9 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>          vfio_check_af_flr(vdev, pos);
>          ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>          break;
> +    case PCI_CAP_ID_VNDR:
> +        ret = vfio_add_vendor_specific_cap(vdev, pos, size, errp);
> +        break;
>      default:
>          ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>          break;
Re: [PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load
Posted by Vinayak Kale 8 months, 2 weeks ago

On 11/03/24 8:32 pm, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 11 Mar 2024 17:45:19 +0530
> Vinayak Kale <vkale@nvidia.com> wrote:
> 
>> In case of migration, during restore operation, qemu checks config space of the
>> pci device with the config space in the migration stream captured during save
>> operation. In case of config space data mismatch, restore operation is failed.
>>
>> config space check is done in function get_pci_config_device(). By default VSC
>> (vendor-specific-capability) in config space is checked.
>>
>> Ideally qemu should not check VSC for VFIO-PCI device during restore/load as
>> qemu is not aware of VSC ABI.
> 
> It's disappointing that we can't seem to have a discussion about why
> it's not the responsibility of the underlying migration support in the
> vfio-pci variant driver to make the vendor specific capability
> consistent across migration.

I think it is device vendor driver's responsibility to ensure that VSC 
is consistent across migration. Here consistency could mean that VSC 
format should be same on source and destination, however actual VSC 
contents may not be byte-to-byte identical.

If a vfio-pci device is migration capable and if vfio-pci vendor driver 
is OK with volatile VSC contents as long as consistency is maintained 
for VSC format then QEMU should exempt config space check for VSC contents.

> 
> Also, for future maintenance, specifically what device is currently
> broken by this and under what conditions?

Under certain conditions VSC contents vary for NVIDIA vGPU devices in 
case of live migration. Due to QEMU's current config space check for 
VSC, live migration is broken across NVIDIA vGPU devices.

> 
>>
>> This patch skips the check for VFIO-PCI device by clearing pdev->cmask[] for VSC
>> offsets. If cmask[] is not set for an offset, then qemu skips config space check
>> for that offset.
>>
>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
>> ---
>> Version History
>> v1->v2:
>>      - Limited scope of change to vfio-pci devices instead of all pci devices.
>>
>>   hw/vfio/pci.c | 19 +++++++++++++++++++
>>   1 file changed, 19 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index d7fe06715c..9edaff4b37 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -2132,6 +2132,22 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
>>       }
>>   }
>>
>> +static int vfio_add_vendor_specific_cap(VFIOPCIDevice *vdev, int pos,
>> +                                        uint8_t size, Error **errp)
>> +{
>> +    PCIDevice *pdev = &vdev->pdev;
>> +
>> +    pos = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, size, errp);
>> +    if (pos < 0) {
>> +        return pos;
>> +    }
>> +
>> +    /* Exempt config space check for VSC during restore/load  */
>> +    memset(pdev->cmask + pos, 0, size);
> 
> This excludes the entire capability from comparison, including the
> capability ID, next pointer, and capability length.  Even if the
> contents of the capability are considered volatile vendor information,
> the header is spec defined ABI which must be consistent.  Thanks,

This makes sense, I'll address this in V3. Thanks.

> 
> Alex
> 
>> +
>> +    return pos;
>> +}
>> +
>>   static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>>   {
>>       PCIDevice *pdev = &vdev->pdev;
>> @@ -2199,6 +2215,9 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>>           vfio_check_af_flr(vdev, pos);
>>           ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>>           break;
>> +    case PCI_CAP_ID_VNDR:
>> +        ret = vfio_add_vendor_specific_cap(vdev, pos, size, errp);
>> +        break;
>>       default:
>>           ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>>           break;
>
Re: [PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load
Posted by Alex Williamson 8 months, 1 week ago
On Fri, 15 Mar 2024 23:22:22 +0530
Vinayak Kale <vkale@nvidia.com> wrote:

> On 11/03/24 8:32 pm, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, 11 Mar 2024 17:45:19 +0530
> > Vinayak Kale <vkale@nvidia.com> wrote:
> >   
> >> In case of migration, during restore operation, qemu checks config space of the
> >> pci device with the config space in the migration stream captured during save
> >> operation. In case of config space data mismatch, restore operation is failed.
> >>
> >> config space check is done in function get_pci_config_device(). By default VSC
> >> (vendor-specific-capability) in config space is checked.
> >>
> >> Ideally qemu should not check VSC for VFIO-PCI device during restore/load as
> >> qemu is not aware of VSC ABI.  
> > 
> > It's disappointing that we can't seem to have a discussion about why
> > it's not the responsibility of the underlying migration support in the
> > vfio-pci variant driver to make the vendor specific capability
> > consistent across migration.  
> 
> I think it is device vendor driver's responsibility to ensure that VSC 
> is consistent across migration. Here consistency could mean that VSC 
> format should be same on source and destination, however actual VSC 
> contents may not be byte-to-byte identical.
> 
> If a vfio-pci device is migration capable and if vfio-pci vendor driver 
> is OK with volatile VSC contents as long as consistency is maintained 
> for VSC format then QEMU should exempt config space check for VSC contents.

I tend to agree that ultimately the variant driver is responsible for
making the device consistent during migration and QEMU's policy that
even vendor defined ABI needs to be byte for byte identical is somewhat
arbitrary.

> > Also, for future maintenance, specifically what device is currently
> > broken by this and under what conditions?  
> 
> Under certain conditions VSC contents vary for NVIDIA vGPU devices in 
> case of live migration. Due to QEMU's current config space check for 
> VSC, live migration is broken across NVIDIA vGPU devices.

This is incredibly vague.  We've been testing NVIDIA vGPU migration and
have not experienced a migration failure due to VSC mismatch.  Does this
require a specific device?  A specific workload?  What specific
conditions trigger this problem?

While as above, I agree in theory that the responsibility lies on the
migration support in the variant driver, there are risks involved,
particularly if new dependencies on the VSC contents are developed in
the guest.  For future maintenance and development in this space, the
commit log should describe exactly the scenario that requires this
policy change.  Thanks,

Alex

> >> This patch skips the check for VFIO-PCI device by clearing pdev->cmask[] for VSC
> >> offsets. If cmask[] is not set for an offset, then qemu skips config space check
> >> for that offset.
> >>
> >> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
> >> ---
> >> Version History
> >> v1->v2:
> >>      - Limited scope of change to vfio-pci devices instead of all pci devices.
> >>
> >>   hw/vfio/pci.c | 19 +++++++++++++++++++
> >>   1 file changed, 19 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index d7fe06715c..9edaff4b37 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -2132,6 +2132,22 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
> >>       }
> >>   }
> >>
> >> +static int vfio_add_vendor_specific_cap(VFIOPCIDevice *vdev, int pos,
> >> +                                        uint8_t size, Error **errp)
> >> +{
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    pos = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, size, errp);
> >> +    if (pos < 0) {
> >> +        return pos;
> >> +    }
> >> +
> >> +    /* Exempt config space check for VSC during restore/load  */
> >> +    memset(pdev->cmask + pos, 0, size);  
> > 
> > This excludes the entire capability from comparison, including the
> > capability ID, next pointer, and capability length.  Even if the
> > contents of the capability are considered volatile vendor information,
> > the header is spec defined ABI which must be consistent.  Thanks,  
> 
> This makes sense, I'll address this in V3. Thanks.
> 
> > 
> > Alex
> >   
> >> +
> >> +    return pos;
> >> +}
> >> +
> >>   static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
> >>   {
> >>       PCIDevice *pdev = &vdev->pdev;
> >> @@ -2199,6 +2215,9 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
> >>           vfio_check_af_flr(vdev, pos);
> >>           ret = pci_add_capability(pdev, cap_id, pos, size, errp);
> >>           break;
> >> +    case PCI_CAP_ID_VNDR:
> >> +        ret = vfio_add_vendor_specific_cap(vdev, pos, size, errp);
> >> +        break;
> >>       default:
> >>           ret = pci_add_capability(pdev, cap_id, pos, size, errp);
> >>           break;  
> >   
>
Re: [PATCH v2] vfio/pci: migration: Skip config space check for vendor specific capability during restore/load
Posted by Vinayak Kale 8 months, 1 week ago

On 18/03/24 8:28 pm, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, 15 Mar 2024 23:22:22 +0530
> Vinayak Kale <vkale@nvidia.com> wrote:
> 
>> On 11/03/24 8:32 pm, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Mon, 11 Mar 2024 17:45:19 +0530
>>> Vinayak Kale <vkale@nvidia.com> wrote:
>>>
>>>> In case of migration, during restore operation, qemu checks config space of the
>>>> pci device with the config space in the migration stream captured during save
>>>> operation. In case of config space data mismatch, restore operation is failed.
>>>>
>>>> config space check is done in function get_pci_config_device(). By default VSC
>>>> (vendor-specific-capability) in config space is checked.
>>>>
>>>> Ideally qemu should not check VSC for VFIO-PCI device during restore/load as
>>>> qemu is not aware of VSC ABI.
>>>
>>> It's disappointing that we can't seem to have a discussion about why
>>> it's not the responsibility of the underlying migration support in the
>>> vfio-pci variant driver to make the vendor specific capability
>>> consistent across migration.
>>
>> I think it is device vendor driver's responsibility to ensure that VSC
>> is consistent across migration. Here consistency could mean that VSC
>> format should be same on source and destination, however actual VSC
>> contents may not be byte-to-byte identical.
>>
>> If a vfio-pci device is migration capable and if vfio-pci vendor driver
>> is OK with volatile VSC contents as long as consistency is maintained
>> for VSC format then QEMU should exempt config space check for VSC contents.
> 
> I tend to agree that ultimately the variant driver is responsible for
> making the device consistent during migration and QEMU's policy that
> even vendor defined ABI needs to be byte for byte identical is somewhat
> arbitrary.
> 
>>> Also, for future maintenance, specifically what device is currently
>>> broken by this and under what conditions?
>>
>> Under certain conditions VSC contents vary for NVIDIA vGPU devices in
>> case of live migration. Due to QEMU's current config space check for
>> VSC, live migration is broken across NVIDIA vGPU devices.
> 
> This is incredibly vague.  We've been testing NVIDIA vGPU migration and
> have not experienced a migration failure due to VSC mismatch.  Does this
> require a specific device?  A specific workload?  What specific
> conditions trigger this problem?

In case of live migration, in a situation where source and destination 
host driver is different, Vendor Specific Information in VSC varies on 
the destination to ensure vGPU feature capabilities exposed to guest 
driver are compatible with destination host. This is applicable to all 
NVIDIA vGPU devices.

> 
> While as above, I agree in theory that the responsibility lies on the
> migration support in the variant driver, there are risks involved,
> particularly if new dependencies on the VSC contents are developed in
> the guest.  For future maintenance and development in this space, the
> commit log should describe exactly the scenario that requires this
> policy change.  Thanks,

I'll add aforementioned scenario (situation when live migration is 
broken for NVIDIA vGPU devices) in the commit description. Thanks.

> 
> Alex
> 
>>>> This patch skips the check for VFIO-PCI device by clearing pdev->cmask[] for VSC
>>>> offsets. If cmask[] is not set for an offset, then qemu skips config space check
>>>> for that offset.
>>>>
>>>> Signed-off-by: Vinayak Kale <vkale@nvidia.com>
>>>> ---
>>>> Version History
>>>> v1->v2:
>>>>       - Limited scope of change to vfio-pci devices instead of all pci devices.
>>>>
>>>>    hw/vfio/pci.c | 19 +++++++++++++++++++
>>>>    1 file changed, 19 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index d7fe06715c..9edaff4b37 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -2132,6 +2132,22 @@ static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos)
>>>>        }
>>>>    }
>>>>
>>>> +static int vfio_add_vendor_specific_cap(VFIOPCIDevice *vdev, int pos,
>>>> +                                        uint8_t size, Error **errp)
>>>> +{
>>>> +    PCIDevice *pdev = &vdev->pdev;
>>>> +
>>>> +    pos = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, size, errp);
>>>> +    if (pos < 0) {
>>>> +        return pos;
>>>> +    }
>>>> +
>>>> +    /* Exempt config space check for VSC during restore/load  */
>>>> +    memset(pdev->cmask + pos, 0, size);
>>>
>>> This excludes the entire capability from comparison, including the
>>> capability ID, next pointer, and capability length.  Even if the
>>> contents of the capability are considered volatile vendor information,
>>> the header is spec defined ABI which must be consistent.  Thanks,
>>
>> This makes sense, I'll address this in V3. Thanks.
>>
>>>
>>> Alex
>>>
>>>> +
>>>> +    return pos;
>>>> +}
>>>> +
>>>>    static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>>>>    {
>>>>        PCIDevice *pdev = &vdev->pdev;
>>>> @@ -2199,6 +2215,9 @@ static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos, Error **errp)
>>>>            vfio_check_af_flr(vdev, pos);
>>>>            ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>>>>            break;
>>>> +    case PCI_CAP_ID_VNDR:
>>>> +        ret = vfio_add_vendor_specific_cap(vdev, pos, size, errp);
>>>> +        break;
>>>>        default:
>>>>            ret = pci_add_capability(pdev, cap_id, pos, size, errp);
>>>>            break;
>>>
>>
>