[PATCH 0/1] pcie: Allow atomic completion on PCIE root port

robin@streamhpc.com posted 1 patch 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230420153839.167418-1-robin@streamhpc.com
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
hw/pci-bridge/gen_pcie_root_port.c | 2 ++
hw/pci/pcie.c                      | 6 ++++++
include/hw/pci/pcie_port.h         | 3 +++
3 files changed, 11 insertions(+)
[PATCH 0/1] pcie: Allow atomic completion on PCIE root port
Posted by robin@streamhpc.com 1 year ago
From: Robin Voetter <robin@streamhpc.com>

The ROCm driver for Linux uses PCIe atomics to schedule work and
generally communicate between the host and the device.  This does not
currently work in QEMU with regular vfio-pci passthrough, because the
pcie-root-port does not advertise the PCIe atomic completer
capabilities.  When initializing the GPU from the Linux driver, it
queries whether the PCIe connection from the CPU to GPU supports the
required capabilities[1] in the pci_enable_atomic_ops_to_root
function[2].  Currently the only part where this fails is checking the
atomic completer capabilities (32 and 64 bits) on the root port[3].  In
this case, the driver determines that PCIe atomics are not supported at
all, and this causes ROCm programs to misbehave.  (While AMD advertises
that there is some support for ROCm without PCIe atomics, I have never
actually gotten that working...)

This patch allows ROCm to properly function by introducing an
additional experimental property to the pcie-root-port,
x-atomic-completion.  Setting this option makes the port report
support for the PCI_EXP_DEVCAP2_ATOMIC_COMP32 and COMP64
capabilities.  This then makes the check from [3] pass, and
everything seems to work appropriately after that.

To verify that the capabilities are reported correctly, one can use
lspci to check the capabilities of the root port: lspci -vvv -s <root
port id> should show 32bit+ and 64bit+ capabilities in DevCap2 when
x-atomic-completion is enabled.  For example:

    -device pcie-root-port,x-atomic-completion=true,id.pcie.1

The output of lspci should include the following for the pcie root port:

    AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-

To verify that ROCm works, the following HIP program should be
sufficient.  The work is scheduled to the GPU by signaling a semaphore
using atomic operations from the CPU side, which is completed on the
GPU, and the GPU-side printf works by signaling a semaphore from the GPU
that is completed on the CPU.  It can be compiled using hipcc with
'hipcc -otest test.hip':

    #include <hip/hip_runtime.h>
    __global__ void test() {
        printf("hello, world\n");
    }
    int main() {
        test<<<dim3(1), dim3(1)>>>();
        hipDeviceSynchronize();
    }

Previously, or when x-atomic-completion is set to false, this program
would simply hang.  Additionally, a message along the lines of the
following is printed to dmesg during boot if the GPU driver determines
that atomics are not supported:

    amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported

When atomics are properly supported, the above program works as
intended, and the previous dmesg message is of course not printed. For
this I am using a simple machine setup using the following device
options, with the GPU that im testing with of course on 03:00.0.

     -device pcie-root-port,x-atomic-completion=true,id=pcie.1
     -device vfio-pci,host=03:00.0,bus=pcie.1

This patch does not include any automatic detection whether the root
port of the host supports the atomic completer capabilities, nor if any
of the physical PCIe bridges between the CPU and GPU support atomic
routing.  The intention here is that the user should make sure that the
host does support atomic completion on the root complex. See also some
prior discussion[4].  I have run the full test suite of some ROCm
libraries: rocPRIM, rocRAND, hipRAND, hipCUB and rocThrust.  All of the
tests pass now, with some minor unrelated changes.

Kind regards,

Robin Voetter, Stream HPC

[1] https://github.com/torvalds/linux/blob/v6.2/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L3716
[2] https://github.com/torvalds/linux/blob/v6.2/drivers/pci/pci.c#L3781
[3] https://github.com/torvalds/linux/blob/v6.2/drivers/pci/pci.c#L3829
[4] https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01815.html
---

Robin Voetter (1):
  pcie: Allow generic PCIE root port to enable atomic completion

 hw/pci-bridge/gen_pcie_root_port.c | 2 ++
 hw/pci/pcie.c                      | 6 ++++++
 include/hw/pci/pcie_port.h         | 3 +++
 3 files changed, 11 insertions(+)

--
2.39.2
Re: [PATCH 0/1] pcie: Allow atomic completion on PCIE root port
Posted by Michael S. Tsirkin 1 year ago
On Thu, Apr 20, 2023 at 05:38:39PM +0200, robin@streamhpc.com wrote:
> From: Robin Voetter <robin@streamhpc.com>
> 
> The ROCm driver for Linux uses PCIe atomics to schedule work and
> generally communicate between the host and the device.  This does not
> currently work in QEMU with regular vfio-pci passthrough, because the
> pcie-root-port does not advertise the PCIe atomic completer
> capabilities.  When initializing the GPU from the Linux driver, it
> queries whether the PCIe connection from the CPU to GPU supports the
> required capabilities[1] in the pci_enable_atomic_ops_to_root
> function[2].  Currently the only part where this fails is checking the
> atomic completer capabilities (32 and 64 bits) on the root port[3].  In
> this case, the driver determines that PCIe atomics are not supported at
> all, and this causes ROCm programs to misbehave.  (While AMD advertises
> that there is some support for ROCm without PCIe atomics, I have never
> actually gotten that working...)
> 
> This patch allows ROCm to properly function by introducing an
> additional experimental property to the pcie-root-port,
> x-atomic-completion.

so what exactly makes it experimental? from this description
it looks like it actually has to be enabled for things to work?
Also pls CC alex on whether this is a correct way to do it.

>  Setting this option makes the port report
> support for the PCI_EXP_DEVCAP2_ATOMIC_COMP32 and COMP64
> capabilities.  This then makes the check from [3] pass, and
> everything seems to work appropriately after that.
> 
> To verify that the capabilities are reported correctly, one can use
> lspci to check the capabilities of the root port: lspci -vvv -s <root
> port id> should show 32bit+ and 64bit+ capabilities in DevCap2 when
> x-atomic-completion is enabled.  For example:
> 
>     -device pcie-root-port,x-atomic-completion=true,id.pcie.1
> 
> The output of lspci should include the following for the pcie root port:
> 
>     AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
> 
> To verify that ROCm works, the following HIP program should be
> sufficient.  The work is scheduled to the GPU by signaling a semaphore
> using atomic operations from the CPU side, which is completed on the
> GPU, and the GPU-side printf works by signaling a semaphore from the GPU
> that is completed on the CPU.  It can be compiled using hipcc with
> 'hipcc -otest test.hip':
> 
>     #include <hip/hip_runtime.h>
>     __global__ void test() {
>         printf("hello, world\n");
>     }
>     int main() {
>         test<<<dim3(1), dim3(1)>>>();
>         hipDeviceSynchronize();
>     }
> 
> Previously, or when x-atomic-completion is set to false, this program
> would simply hang.  Additionally, a message along the lines of the
> following is printed to dmesg during boot if the GPU driver determines
> that atomics are not supported:
> 
>     amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
> 
> When atomics are properly supported, the above program works as
> intended, and the previous dmesg message is of course not printed. For
> this I am using a simple machine setup using the following device
> options, with the GPU that im testing with of course on 03:00.0.
> 
>      -device pcie-root-port,x-atomic-completion=true,id=pcie.1
>      -device vfio-pci,host=03:00.0,bus=pcie.1
> 
> This patch does not include any automatic detection whether the root
> port of the host supports the atomic completer capabilities, nor if any
> of the physical PCIe bridges between the CPU and GPU support atomic
> routing.  The intention here is that the user should make sure that the
> host does support atomic completion on the root complex. See also some
> prior discussion[4].  I have run the full test suite of some ROCm
> libraries: rocPRIM, rocRAND, hipRAND, hipCUB and rocThrust.  All of the
> tests pass now, with some minor unrelated changes.
> 
> Kind regards,
> 
> Robin Voetter, Stream HPC
> 
> [1] https://github.com/torvalds/linux/blob/v6.2/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L3716
> [2] https://github.com/torvalds/linux/blob/v6.2/drivers/pci/pci.c#L3781
> [3] https://github.com/torvalds/linux/blob/v6.2/drivers/pci/pci.c#L3829
> [4] https://lists.nongnu.org/archive/html/qemu-devel/2023-04/msg01815.html
> ---
> 
> Robin Voetter (1):
>   pcie: Allow generic PCIE root port to enable atomic completion
> 
>  hw/pci-bridge/gen_pcie_root_port.c | 2 ++
>  hw/pci/pcie.c                      | 6 ++++++
>  include/hw/pci/pcie_port.h         | 3 +++
>  3 files changed, 11 insertions(+)
> 
> --
> 2.39.2
Re: [PATCH 0/1] pcie: Allow atomic completion on PCIE root port
Posted by Robin Voetter 1 year ago

On 4/21/23 10:22, Michael S. Tsirkin wrote:
> On Thu, Apr 20, 2023 at 05:38:39PM +0200, robin@streamhpc.com wrote:
>> From: Robin Voetter <robin@streamhpc.com>
>>
>> The ROCm driver for Linux uses PCIe atomics to schedule work and
>> generally communicate between the host and the device.  This does not
>> currently work in QEMU with regular vfio-pci passthrough, because the
>> pcie-root-port does not advertise the PCIe atomic completer
>> capabilities.  When initializing the GPU from the Linux driver, it
>> queries whether the PCIe connection from the CPU to GPU supports the
>> required capabilities[1] in the pci_enable_atomic_ops_to_root
>> function[2].  Currently the only part where this fails is checking the
>> atomic completer capabilities (32 and 64 bits) on the root port[3].  In
>> this case, the driver determines that PCIe atomics are not supported at
>> all, and this causes ROCm programs to misbehave.  (While AMD advertises
>> that there is some support for ROCm without PCIe atomics, I have never
>> actually gotten that working...)
>>
>> This patch allows ROCm to properly function by introducing an
>> additional experimental property to the pcie-root-port,
>> x-atomic-completion.
> 
> so what exactly makes it experimental? from this description
> it looks like it actually has to be enabled for things to work?

I was not sure which would be appropriate, but I'm fine with making it a
non-experimental option.
Re: [PATCH 0/1] pcie: Allow atomic completion on PCIE root port
Posted by Michael S. Tsirkin 11 months, 2 weeks ago
On Fri, Apr 21, 2023 at 06:06:49PM +0200, Robin Voetter wrote:
> 
> 
> On 4/21/23 10:22, Michael S. Tsirkin wrote:
> > On Thu, Apr 20, 2023 at 05:38:39PM +0200, robin@streamhpc.com wrote:
> >> From: Robin Voetter <robin@streamhpc.com>
> >>
> >> The ROCm driver for Linux uses PCIe atomics to schedule work and
> >> generally communicate between the host and the device.  This does not
> >> currently work in QEMU with regular vfio-pci passthrough, because the
> >> pcie-root-port does not advertise the PCIe atomic completer
> >> capabilities.  When initializing the GPU from the Linux driver, it
> >> queries whether the PCIe connection from the CPU to GPU supports the
> >> required capabilities[1] in the pci_enable_atomic_ops_to_root
> >> function[2].  Currently the only part where this fails is checking the
> >> atomic completer capabilities (32 and 64 bits) on the root port[3].  In
> >> this case, the driver determines that PCIe atomics are not supported at
> >> all, and this causes ROCm programs to misbehave.  (While AMD advertises
> >> that there is some support for ROCm without PCIe atomics, I have never
> >> actually gotten that working...)
> >>
> >> This patch allows ROCm to properly function by introducing an
> >> additional experimental property to the pcie-root-port,
> >> x-atomic-completion.
> > 
> > so what exactly makes it experimental? from this description
> > it looks like it actually has to be enabled for things to work?
> 
> I was not sure which would be appropriate, but I'm fine with making it a
> non-experimental option.

So I guess the real thing to do is to query this from vfio right?
Unfortunately we don't have access to vfio when we
are creating the root port, but I think the thing to do would
be to check at the time when vfio is attached, and if
atomic is set but not supported, fail attaching vfio.

Right?

-- 
MST
Re: [PATCH 0/1] pcie: Allow atomic completion on PCIE root port
Posted by Alex Williamson 11 months, 2 weeks ago
On Thu, 18 May 2023 16:03:07 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Fri, Apr 21, 2023 at 06:06:49PM +0200, Robin Voetter wrote:
> > 
> > 
> > On 4/21/23 10:22, Michael S. Tsirkin wrote:  
> > > On Thu, Apr 20, 2023 at 05:38:39PM +0200, robin@streamhpc.com wrote:  
> > >> From: Robin Voetter <robin@streamhpc.com>
> > >>
> > >> The ROCm driver for Linux uses PCIe atomics to schedule work and
> > >> generally communicate between the host and the device.  This does not
> > >> currently work in QEMU with regular vfio-pci passthrough, because the
> > >> pcie-root-port does not advertise the PCIe atomic completer
> > >> capabilities.  When initializing the GPU from the Linux driver, it
> > >> queries whether the PCIe connection from the CPU to GPU supports the
> > >> required capabilities[1] in the pci_enable_atomic_ops_to_root
> > >> function[2].  Currently the only part where this fails is checking the
> > >> atomic completer capabilities (32 and 64 bits) on the root port[3].  In
> > >> this case, the driver determines that PCIe atomics are not supported at
> > >> all, and this causes ROCm programs to misbehave.  (While AMD advertises
> > >> that there is some support for ROCm without PCIe atomics, I have never
> > >> actually gotten that working...)
> > >>
> > >> This patch allows ROCm to properly function by introducing an
> > >> additional experimental property to the pcie-root-port,
> > >> x-atomic-completion.  
> > > 
> > > so what exactly makes it experimental? from this description
> > > it looks like it actually has to be enabled for things to work?  
> > 
> > I was not sure which would be appropriate, but I'm fine with making it a
> > non-experimental option.  
> 
> So I guess the real thing to do is to query this from vfio right?
> Unfortunately we don't have access to vfio when we
> are creating the root port, but I think the thing to do would
> be to check at the time when vfio is attached, and if
> atomic is set but not supported, fail attaching vfio.
> 
> Right?

We don't currently provide a way to query this in vfio, but I imagine
we could call pci_enable_atomic_ops_to_root() in the host kernel
ourselves with various sizes and expose which are supported via a
capability on the vfio-device.  I'm not sure what we do for VFs though
since that function is invalid for them (maybe worry about them later).
I also see that one of the in-kernel drivers (mlx5) tries to enable
128-bit support, so I wonder if we want separate options for 32/64-bit
and 128-bit.

QEMU device options are clearly the most straightforward path to enable
this, but would it actually make sense, perhaps in addition, to
implement the above in the kernel and then have the QEMU vfio-pci
driver enable the available completer bits in the root port during
realize?  We could probably get away with it on hotplug, but if
necessary it could be something we only do for cold-plug devices (we
also have a no-hotplug vfio-pci variant if we're concerned what happens
after the device is removed in the VM - again, be could probably get
away with clearing the bits on unplug).

I'm not entirely sure where we stand in QEMU on whether options that
can cause poor behavior should always be experimental or we allow users
to shoot themselves in the foot as they please.  Obviously it makes it
more difficult for libvirt to support such configurations, but maybe
they'd rely on the above automatic enabling rather than try to guess
themselves.  Thanks,

Alex