[v7] PCI devices passthrough on Arm, part 3

[PATCH V7 09/11] vpci: add initial support for virtual PCI bus topology

Posted by Oleksandr Tyshchenko 3 years, 6 months ago

From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Assign SBDF to the PCI devices being passed through with bus 0.
The resulting topology is where PCIe devices reside on the bus 0 of the
root complex itself (embedded endpoints).
This implementation is limited to 32 devices which are allowed on
a single PCI bus.

Please note, that at the moment only function 0 of a multifunction
device can be passed through.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v6:
- re-work wrt new locking scheme
- OT: add ASSERT(pcidevs_write_locked()); to add_virtual_device()
Since v5:
- s/vpci_add_virtual_device/add_virtual_device and make it static
- call add_virtual_device from vpci_assign_device and do not use
  REGISTER_VPCI_INIT machinery
- add pcidevs_locked ASSERT
- use DECLARE_BITMAP for vpci_dev_assigned_map
Since v4:
- moved and re-worked guest sbdf initializers
- s/set_bit/__set_bit
- s/clear_bit/__clear_bit
- minor comment fix s/Virtual/Guest/
- added VPCI_MAX_VIRT_DEV constant (PCI_SLOT(~0) + 1) which will be used
  later for counting the number of MMIO handlers required for a guest
  (Julien)
Since v3:
 - make use of VPCI_INIT
 - moved all new code to vpci.c which belongs to it
 - changed open-coded 31 to PCI_SLOT(~0)
 - added comments and code to reject multifunction devices with
   functions other than 0
 - updated comment about vpci_dev_next and made it unsigned int
 - implement roll back in case of error while assigning/deassigning devices
 - s/dom%pd/%pd
Since v2:
 - remove casts that are (a) malformed and (b) unnecessary
 - add new line for better readability
 - remove CONFIG_HAS_VPCI_GUEST_SUPPORT ifdef's as the relevant vPCI
    functions are now completely gated with this config
 - gate common code with CONFIG_HAS_VPCI_GUEST_SUPPORT
New in v2
---
 xen/drivers/vpci/vpci.c | 70 ++++++++++++++++++++++++++++++++++++++++-
 xen/include/xen/sched.h |  8 +++++
 xen/include/xen/vpci.h  | 11 +++++++
 3 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index f683346285..d4601ecf9b 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -84,6 +84,9 @@ int vpci_add_handlers(struct pci_dev *pdev)
 
     INIT_LIST_HEAD(&pdev->vpci->handlers);
     spin_lock_init(&pdev->vpci->lock);
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    pdev->vpci->guest_sbdf.sbdf = ~0;
+#endif
 
     for ( i = 0; i < NUM_VPCI_INIT; i++ )
     {
@@ -99,6 +102,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+static int add_virtual_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    pci_sbdf_t sbdf = { 0 };
+    unsigned long new_dev_number;
+
+    if ( is_hardware_domain(d) )
+        return 0;
+
+    ASSERT(pcidevs_write_locked());
+
+    /*
+     * Each PCI bus supports 32 devices/slots at max or up to 256 when
+     * there are multi-function ones which are not yet supported.
+     */
+    if ( pdev->info.is_extfn )
+    {
+        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
+                 &pdev->sbdf);
+        return -EOPNOTSUPP;
+    }
+
+    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
+                                         VPCI_MAX_VIRT_DEV);
+    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
+        return -ENOSPC;
+
+    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
+
+    /*
+     * Both segment and bus number are 0:
+     *  - we emulate a single host bridge for the guest, e.g. segment 0
+     *  - with bus 0 the virtual devices are seen as embedded
+     *    endpoints behind the root complex
+     *
+     * TODO: add support for multi-function devices.
+     */
+    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
+    pdev->vpci->guest_sbdf = sbdf;
+
+    return 0;
+
+}
+
+static void vpci_remove_virtual_device(const struct pci_dev *pdev)
+{
+    ASSERT(pcidevs_write_locked());
+
+    if ( pdev->vpci )
+    {
+        __clear_bit(pdev->vpci->guest_sbdf.dev,
+                    &pdev->domain->vpci_dev_assigned_map);
+        pdev->vpci->guest_sbdf.sbdf = ~0;
+    }
+}
+
 /* Notify vPCI that device is assigned to guest. */
 int vpci_assign_device(struct pci_dev *pdev)
 {
@@ -111,8 +170,16 @@ int vpci_assign_device(struct pci_dev *pdev)
 
     rc = vpci_add_handlers(pdev);
     if ( rc )
-        vpci_deassign_device(pdev);
+        goto fail;
+
+    rc = add_virtual_device(pdev);
+    if ( rc )
+        goto fail;
+
+    return 0;
 
+ fail:
+    vpci_deassign_device(pdev);
     return rc;
 }
 
@@ -124,6 +191,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
     if ( !has_vpci(pdev->domain) )
         return;
 
+    vpci_remove_virtual_device(pdev);
     vpci_remove_device(pdev);
 }
 #endif /* CONFIG_HAS_VPCI_GUEST_SUPPORT */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index b9515eb497..a2848a5740 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -457,6 +457,14 @@ struct domain
 
 #ifdef CONFIG_HAS_PCI
     struct list_head pdev_list;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /*
+     * The bitmap which shows which device numbers are already used by the
+     * virtual PCI bus topology and is used to assign a unique SBDF to the
+     * next passed through virtual PCI device.
+     */
+    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
+#endif
 #endif
 
 #ifdef CONFIG_HAS_PASSTHROUGH
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index 1010f68c28..cc14b0086d 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -21,6 +21,13 @@ typedef int vpci_register_init_t(struct pci_dev *dev);
 
 #define VPCI_ECAM_BDF(addr)     (((addr) & 0x0ffff000) >> 12)
 
+/*
+ * Maximum number of devices supported by the virtual bus topology:
+ * each PCI bus supports 32 devices/slots at max or up to 256 when
+ * there are multi-function ones which are not yet supported.
+ */
+#define VPCI_MAX_VIRT_DEV       (PCI_SLOT(~0) + 1)
+
 #define REGISTER_VPCI_INIT(x, p)                \
   static vpci_register_init_t *const x##_entry  \
                __used_section(".data.vpci." p) = x
@@ -145,6 +152,10 @@ struct vpci {
             struct vpci_arch_msix_entry arch;
         } entries[];
     } *msix;
+#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
+    /* Guest SBDF of the device. */
+    pci_sbdf_t guest_sbdf;
+#endif
 #endif
 };
 
-- 
2.25.1

Re: [PATCH V7 09/11] vpci: add initial support for virtual PCI bus topology

Posted by Jan Beulich 3 years, 6 months ago

On 19.07.2022 19:42, Oleksandr Tyshchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Assign SBDF to the PCI devices being passed through with bus 0.
> The resulting topology is where PCIe devices reside on the bus 0 of the
> root complex itself (embedded endpoints).
> This implementation is limited to 32 devices which are allowed on
> a single PCI bus.
> 
> Please note, that at the moment only function 0 of a multifunction
> device can be passed through.

I've not been able to spot where this restriction is being enforced -
can you please point me at the respective code?

> @@ -99,6 +102,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
>  }
>  
>  #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +static int add_virtual_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    pci_sbdf_t sbdf = { 0 };
> +    unsigned long new_dev_number;
> +
> +    if ( is_hardware_domain(d) )
> +        return 0;
> +
> +    ASSERT(pcidevs_write_locked());
> +
> +    /*
> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
> +     * there are multi-function ones which are not yet supported.
> +     */
> +    if ( pdev->info.is_extfn )
> +    {
> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
> +                 &pdev->sbdf);
> +        return -EOPNOTSUPP;
> +    }
> +
> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
> +                                         VPCI_MAX_VIRT_DEV);
> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
> +        return -ENOSPC;
> +
> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
> +
> +    /*
> +     * Both segment and bus number are 0:
> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
> +     *  - with bus 0 the virtual devices are seen as embedded
> +     *    endpoints behind the root complex
> +     *
> +     * TODO: add support for multi-function devices.
> +     */
> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
> +    pdev->vpci->guest_sbdf = sbdf;
> +
> +    return 0;
> +
> +}
> +
> +static void vpci_remove_virtual_device(const struct pci_dev *pdev)
> +{
> +    ASSERT(pcidevs_write_locked());
> +
> +    if ( pdev->vpci )
> +    {
> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
> +                    &pdev->domain->vpci_dev_assigned_map);
> +        pdev->vpci->guest_sbdf.sbdf = ~0;
> +    }
> +}

Feels like I did comment on this before: When ...

> @@ -111,8 +170,16 @@ int vpci_assign_device(struct pci_dev *pdev)
>  
>      rc = vpci_add_handlers(pdev);
>      if ( rc )
> -        vpci_deassign_device(pdev);
> +        goto fail;

... this path is taken and hence ...

> +    rc = add_virtual_device(pdev);

... this is bypassed, ...

> +    if ( rc )
> +        goto fail;
> +
> +    return 0;
>  
> + fail:
> +    vpci_deassign_device(pdev);

... the function here will see guest_sbdf still as ~0, while pdev->vpci
is non-NULL. Therefore mistakenly bit 31 of vpci_dev_assigned_map will
be cleared.

> @@ -124,6 +191,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
>      if ( !has_vpci(pdev->domain) )
>          return;
>  
> +    vpci_remove_virtual_device(pdev);
>      vpci_remove_device(pdev);
>  }

And other call sites of vpci_remove_device() do not have a need of
cleaning up guest_sbdf / vpci_dev_assigned_map? IOW I wonder if it
wouldn't be better to have vpci_remove_device() do this as well
(retaining - see my comment on the earlier patch) the simple aliasing
of vpci_deassign_device() to vpci_remove_device()).

> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -457,6 +457,14 @@ struct domain
>  
>  #ifdef CONFIG_HAS_PCI
>      struct list_head pdev_list;
> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
> +    /*
> +     * The bitmap which shows which device numbers are already used by the
> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
> +     * next passed through virtual PCI device.
> +     */
> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
> +#endif
>  #endif

Hmm, yet another reason to keep sched.h including vpci.h, which
imo would better be dropped - sched.h already has way too many
dependencies. (Just a remark, not strictly a request to change
anything.)

Jan

Re: [PATCH V7 09/11] vpci: add initial support for virtual PCI bus topology

Posted by Oleksandr 3 years, 6 months ago

On 27.07.22 13:32, Jan Beulich wrote:


Hello Jan


> On 19.07.2022 19:42, Oleksandr Tyshchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Assign SBDF to the PCI devices being passed through with bus 0.
>> The resulting topology is where PCIe devices reside on the bus 0 of the
>> root complex itself (embedded endpoints).
>> This implementation is limited to 32 devices which are allowed on
>> a single PCI bus.
>>
>> Please note, that at the moment only function 0 of a multifunction
>> device can be passed through.
> I've not been able to spot where this restriction is being enforced -
> can you please point me at the respective code?

Nor have I found the respective code.

Could you please suggest a place where to put such enforcement (I guess, 
this should be present in the toolstack)?



>
>> @@ -99,6 +102,62 @@ int vpci_add_handlers(struct pci_dev *pdev)
>>   }
>>   
>>   #ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +static int add_virtual_device(struct pci_dev *pdev)
>> +{
>> +    struct domain *d = pdev->domain;
>> +    pci_sbdf_t sbdf = { 0 };
>> +    unsigned long new_dev_number;
>> +
>> +    if ( is_hardware_domain(d) )
>> +        return 0;
>> +
>> +    ASSERT(pcidevs_write_locked());
>> +
>> +    /*
>> +     * Each PCI bus supports 32 devices/slots at max or up to 256 when
>> +     * there are multi-function ones which are not yet supported.
>> +     */
>> +    if ( pdev->info.is_extfn )
>> +    {
>> +        gdprintk(XENLOG_ERR, "%pp: only function 0 passthrough supported\n",
>> +                 &pdev->sbdf);
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    new_dev_number = find_first_zero_bit(d->vpci_dev_assigned_map,
>> +                                         VPCI_MAX_VIRT_DEV);
>> +    if ( new_dev_number >= VPCI_MAX_VIRT_DEV )
>> +        return -ENOSPC;
>> +
>> +    __set_bit(new_dev_number, &d->vpci_dev_assigned_map);
>> +
>> +    /*
>> +     * Both segment and bus number are 0:
>> +     *  - we emulate a single host bridge for the guest, e.g. segment 0
>> +     *  - with bus 0 the virtual devices are seen as embedded
>> +     *    endpoints behind the root complex
>> +     *
>> +     * TODO: add support for multi-function devices.
>> +     */
>> +    sbdf.devfn = PCI_DEVFN(new_dev_number, 0);
>> +    pdev->vpci->guest_sbdf = sbdf;
>> +
>> +    return 0;
>> +
>> +}
>> +
>> +static void vpci_remove_virtual_device(const struct pci_dev *pdev)
>> +{
>> +    ASSERT(pcidevs_write_locked());
>> +
>> +    if ( pdev->vpci )
>> +    {
>> +        __clear_bit(pdev->vpci->guest_sbdf.dev,
>> +                    &pdev->domain->vpci_dev_assigned_map);
>> +        pdev->vpci->guest_sbdf.sbdf = ~0;
>> +    }
>> +}
> Feels like I did comment on this before: When ...
>
>> @@ -111,8 +170,16 @@ int vpci_assign_device(struct pci_dev *pdev)
>>   
>>       rc = vpci_add_handlers(pdev);
>>       if ( rc )
>> -        vpci_deassign_device(pdev);
>> +        goto fail;
> ... this path is taken and hence ...
>
>> +    rc = add_virtual_device(pdev);
> ... this is bypassed, ...
>
>> +    if ( rc )
>> +        goto fail;
>> +
>> +    return 0;
>>   
>> + fail:
>> +    vpci_deassign_device(pdev);
> ... the function here will see guest_sbdf still as ~0, while pdev->vpci
> is non-NULL. Therefore mistakenly bit 31 of vpci_dev_assigned_map will
> be cleared.


Indeed, good catch, thanks! I assume this can be just fixed by extending 
a check in vpci_remove_virtual_device():

if ( pdev->vpci && (pdev->vpci->guest_sbdf.sbdf != ~0) )




>
>> @@ -124,6 +191,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
>>       if ( !has_vpci(pdev->domain) )
>>           return;
>>   
>> +    vpci_remove_virtual_device(pdev);
>>       vpci_remove_device(pdev);
>>   }
> And other call sites of vpci_remove_device() do not have a need of
> cleaning up guest_sbdf / vpci_dev_assigned_map?

I am not 100% sure, but it looks like they don't need. On the other 
hand, even if they don't need that, doing the cleaning won't be an issue 
at all,

there is a check before cleaning (which will be extended as I proposed 
above), so ...


> IOW I wonder if it
> wouldn't be better to have vpci_remove_device() do this as well
> (retaining - see my comment on the earlier patch) the simple aliasing
> of vpci_deassign_device() to vpci_remove_device()).


    ... maybe yes. Shall I do that change?



>
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -457,6 +457,14 @@ struct domain
>>   
>>   #ifdef CONFIG_HAS_PCI
>>       struct list_head pdev_list;
>> +#ifdef CONFIG_HAS_VPCI_GUEST_SUPPORT
>> +    /*
>> +     * The bitmap which shows which device numbers are already used by the
>> +     * virtual PCI bus topology and is used to assign a unique SBDF to the
>> +     * next passed through virtual PCI device.
>> +     */
>> +    DECLARE_BITMAP(vpci_dev_assigned_map, VPCI_MAX_VIRT_DEV);
>> +#endif
>>   #endif
> Hmm, yet another reason to keep sched.h including vpci.h, which
> imo would better be dropped - sched.h already has way too many
> dependencies. (Just a remark, not strictly a request to change
> anything.)


I see.


>
> Jan

-- 
Regards,

Oleksandr Tyshchenko

Re: [PATCH V7 09/11] vpci: add initial support for virtual PCI bus topology

Posted by Jan Beulich 3 years, 6 months ago

On 28.07.2022 16:16, Oleksandr wrote:
> On 27.07.22 13:32, Jan Beulich wrote:
>> On 19.07.2022 19:42, Oleksandr Tyshchenko wrote:
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Assign SBDF to the PCI devices being passed through with bus 0.
>>> The resulting topology is where PCIe devices reside on the bus 0 of the
>>> root complex itself (embedded endpoints).
>>> This implementation is limited to 32 devices which are allowed on
>>> a single PCI bus.
>>>
>>> Please note, that at the moment only function 0 of a multifunction
>>> device can be passed through.
>> I've not been able to spot where this restriction is being enforced -
>> can you please point me at the respective code?
> 
> Nor have I found the respective code.
> 
> Could you please suggest a place where to put such enforcement (I guess, 
> this should be present in the toolstack)?

Such check should be in the tool stack primarily to give a sensible
error message to the user. Yet the hypervisor needs to check itself
nevertheless. You know the code you're adding much better than I do,
so I guess I'm a little puzzled by you asking me to suggest a place.
(And for the tool stack I guess asking tool stack folks would get
you better mileage.)

>>> @@ -124,6 +191,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
>>>       if ( !has_vpci(pdev->domain) )
>>>           return;
>>>   
>>> +    vpci_remove_virtual_device(pdev);
>>>       vpci_remove_device(pdev);
>>>   }
>> And other call sites of vpci_remove_device() do not have a need of
>> cleaning up guest_sbdf / vpci_dev_assigned_map?
> 
> I am not 100% sure, but it looks like they don't need. On the other 
> hand, even if they don't need that, doing the cleaning won't be an issue 
> at all,
> 
> there is a check before cleaning (which will be extended as I proposed 
> above), so ...
> 
> 
>> IOW I wonder if it
>> wouldn't be better to have vpci_remove_device() do this as well
>> (retaining - see my comment on the earlier patch) the simple aliasing
>> of vpci_deassign_device() to vpci_remove_device()).
> 
> 
>     ... maybe yes. Shall I do that change?

Well - yes please, afaic.

Jan

Re: [PATCH V7 09/11] vpci: add initial support for virtual PCI bus topology

Posted by Oleksandr 3 years, 6 months ago

On 28.07.22 17:26, Jan Beulich wrote:

Hello Jan

> On 28.07.2022 16:16, Oleksandr wrote:
>> On 27.07.22 13:32, Jan Beulich wrote:
>>> On 19.07.2022 19:42, Oleksandr Tyshchenko wrote:
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Assign SBDF to the PCI devices being passed through with bus 0.
>>>> The resulting topology is where PCIe devices reside on the bus 0 of the
>>>> root complex itself (embedded endpoints).
>>>> This implementation is limited to 32 devices which are allowed on
>>>> a single PCI bus.
>>>>
>>>> Please note, that at the moment only function 0 of a multifunction
>>>> device can be passed through.
>>> I've not been able to spot where this restriction is being enforced -
>>> can you please point me at the respective code?
>> Nor have I found the respective code.
>>
>> Could you please suggest a place where to put such enforcement (I guess,
>> this should be present in the toolstack)?
> Such check should be in the tool stack primarily to give a sensible
> error message to the user. Yet the hypervisor needs to check itself
> nevertheless. You know the code you're adding much better than I do,
> so I guess I'm a little puzzled by you asking me to suggest a place.
> (And for the tool stack I guess asking tool stack folks would get
> you better mileage.)

Thanks for the clarification.


I am still getting used to the changes which that patch series makes (I 
didn't write that code).

Asking for suggestion I didn't mean to point an exact place in the code, 
but rather a subsystem/software layer,

sorry if I was unclear.


>
>>>> @@ -124,6 +191,7 @@ void vpci_deassign_device(struct pci_dev *pdev)
>>>>        if ( !has_vpci(pdev->domain) )
>>>>            return;
>>>>    
>>>> +    vpci_remove_virtual_device(pdev);
>>>>        vpci_remove_device(pdev);
>>>>    }
>>> And other call sites of vpci_remove_device() do not have a need of
>>> cleaning up guest_sbdf / vpci_dev_assigned_map?
>> I am not 100% sure, but it looks like they don't need. On the other
>> hand, even if they don't need that, doing the cleaning won't be an issue
>> at all,
>>
>> there is a check before cleaning (which will be extended as I proposed
>> above), so ...
>>
>>
>>> IOW I wonder if it
>>> wouldn't be better to have vpci_remove_device() do this as well
>>> (retaining - see my comment on the earlier patch) the simple aliasing
>>> of vpci_deassign_device() to vpci_remove_device()).
>>
>>      ... maybe yes. Shall I do that change?
> Well - yes please, afaic.


ok, will do


>
> Jan

-- 
Regards,

Oleksandr Tyshchenko