[RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA

Dragos Tatulea posted 4 patches 3 months, 1 week ago
[RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Dragos Tatulea 3 months, 1 week ago
For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
for example mlx5 SF devices have an auxiliary device as a parent.

This patch introduces the possibility for the driver to specify
another DMA device to be used via the new dma_dev field. The field
should be set before register_netdev().

A new helper function is added to get the DMA device or return NULL.
The callers can check for NULL and fail early if the device is
not capable of DMA.

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 include/linux/netdevice.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5847c20994d3..83faa2314c30 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2550,6 +2550,9 @@ struct net_device {
 
 	struct hwtstamp_provider __rcu	*hwprov;
 
+	/* To be set by devices that can do DMA but not via parent. */
+	struct device		*dma_dev;
+
 	u8			priv[] ____cacheline_aligned
 				       __counted_by(priv_len);
 } ____cacheline_aligned;
@@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev;
 		atomic_long_add((VAL), &(DEV)->stats.__##FIELD)
 #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD)
 
+static inline struct device *netdev_get_dma_dev(const struct net_device *dev)
+{
+	struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent;
+
+	if (!dma_dev->dma_mask)
+		dma_dev = NULL;
+
+	return dma_dev;
+}
+
 #endif	/* _LINUX_NETDEVICE_H */
-- 
2.50.0
Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Pavel Begunkov 3 months ago
On 7/2/25 18:24, Dragos Tatulea wrote:
> For zerocopy (io_uring, devmem), there is an assumption that the
> parent device can do DMA. However that is not always the case:
> for example mlx5 SF devices have an auxiliary device as a parent.
> 
> This patch introduces the possibility for the driver to specify
> another DMA device to be used via the new dma_dev field. The field
> should be set before register_netdev().
> 
> A new helper function is added to get the DMA device or return NULL.
> The callers can check for NULL and fail early if the device is
> not capable of DMA.
> 
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> ---
>   include/linux/netdevice.h | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5847c20994d3..83faa2314c30 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2550,6 +2550,9 @@ struct net_device {
>   
>   	struct hwtstamp_provider __rcu	*hwprov;
>   
> +	/* To be set by devices that can do DMA but not via parent. */
> +	struct device		*dma_dev;
> +
>   	u8			priv[] ____cacheline_aligned
>   				       __counted_by(priv_len);
>   } ____cacheline_aligned;
> @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev;
>   		atomic_long_add((VAL), &(DEV)->stats.__##FIELD)
>   #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD)
>   
> +static inline struct device *netdev_get_dma_dev(const struct net_device *dev)
> +{
> +	struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent;
> +
> +	if (!dma_dev->dma_mask)

dev->dev.parent is NULL for veth and I assume other virtual devices as well.

Mina, can you verify that devmem checks that? Seems like veth is rejected
by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev
locking came after devmem got merged, and there are other virt devices that
might already be converted.

> +		dma_dev = NULL;
> +
> +	return dma_dev;
> +}
> +
>   #endif	/* _LINUX_NETDEVICE_H */

-- 
Pavel Begunkov
Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Mina Almasry 3 months ago
On Tue, Jul 8, 2025 at 4:05 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 7/2/25 18:24, Dragos Tatulea wrote:
> > For zerocopy (io_uring, devmem), there is an assumption that the
> > parent device can do DMA. However that is not always the case:
> > for example mlx5 SF devices have an auxiliary device as a parent.
> >
> > This patch introduces the possibility for the driver to specify
> > another DMA device to be used via the new dma_dev field. The field
> > should be set before register_netdev().
> >
> > A new helper function is added to get the DMA device or return NULL.
> > The callers can check for NULL and fail early if the device is
> > not capable of DMA.
> >
> > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > ---
> >   include/linux/netdevice.h | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5847c20994d3..83faa2314c30 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2550,6 +2550,9 @@ struct net_device {
> >
> >       struct hwtstamp_provider __rcu  *hwprov;
> >
> > +     /* To be set by devices that can do DMA but not via parent. */
> > +     struct device           *dma_dev;
> > +
> >       u8                      priv[] ____cacheline_aligned
> >                                      __counted_by(priv_len);
> >   } ____cacheline_aligned;
> > @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev;
> >               atomic_long_add((VAL), &(DEV)->stats.__##FIELD)
> >   #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD)
> >
> > +static inline struct device *netdev_get_dma_dev(const struct net_device *dev)
> > +{
> > +     struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent;
> > +
> > +     if (!dma_dev->dma_mask)
>
> dev->dev.parent is NULL for veth and I assume other virtual devices as well.
>
> Mina, can you verify that devmem checks that? Seems like veth is rejected
> by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev
> locking came after devmem got merged, and there are other virt devices that
> might already be converted.
>

We never attempt devmem binding on any devices that don't support the
queue API, even before the per netdev locking was merged (there was an
explicit ops check).

even then, dev->dev.parent == NULL isn't disasterous, as far as I
could surmise from a quick look. Seems to be only used with
dma_buf_attach which NULL checks it.

-- 
Thanks,
Mina
Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Pavel Begunkov 3 months ago
On 7/8/25 15:10, Mina Almasry wrote:
> On Tue, Jul 8, 2025 at 4:05 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 7/2/25 18:24, Dragos Tatulea wrote:
>>> For zerocopy (io_uring, devmem), there is an assumption that the
>>> parent device can do DMA. However that is not always the case:
>>> for example mlx5 SF devices have an auxiliary device as a parent.
>>>
>>> This patch introduces the possibility for the driver to specify
>>> another DMA device to be used via the new dma_dev field. The field
>>> should be set before register_netdev().
>>>
>>> A new helper function is added to get the DMA device or return NULL.
>>> The callers can check for NULL and fail early if the device is
>>> not capable of DMA.
>>>
>>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>>> ---
>>>    include/linux/netdevice.h | 13 +++++++++++++
>>>    1 file changed, 13 insertions(+)
>>>
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 5847c20994d3..83faa2314c30 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -2550,6 +2550,9 @@ struct net_device {
>>>
>>>        struct hwtstamp_provider __rcu  *hwprov;
>>>
>>> +     /* To be set by devices that can do DMA but not via parent. */
>>> +     struct device           *dma_dev;
>>> +
>>>        u8                      priv[] ____cacheline_aligned
>>>                                       __counted_by(priv_len);
>>>    } ____cacheline_aligned;
>>> @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev;
>>>                atomic_long_add((VAL), &(DEV)->stats.__##FIELD)
>>>    #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD)
>>>
>>> +static inline struct device *netdev_get_dma_dev(const struct net_device *dev)
>>> +{
>>> +     struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent;
>>> +
>>> +     if (!dma_dev->dma_mask)
>>
>> dev->dev.parent is NULL for veth and I assume other virtual devices as well.
>>
>> Mina, can you verify that devmem checks that? Seems like veth is rejected
>> by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev
>> locking came after devmem got merged, and there are other virt devices that
>> might already be converted.
>>
> 
> We never attempt devmem binding on any devices that don't support the
> queue API, even before the per netdev locking was merged (there was an
> explicit ops check).

great!

io_uring doesn't look at ->queue_mgmt_ops, so the helper from this
patch needs to handle it one way or another.

-- 
Pavel Begunkov

Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Jakub Kicinski 3 months, 1 week ago
On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote:
> For zerocopy (io_uring, devmem), there is an assumption that the
> parent device can do DMA. However that is not always the case:
> for example mlx5 SF devices have an auxiliary device as a parent.

Noob question -- I thought that the point of SFs was that you can pass
them thru to a VM. How do they not have DMA support? Is it added on
demand by the mediated driver or some such?
Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Dragos Tatulea 3 months, 1 week ago
On Wed, Jul 02, 2025 at 11:32:08AM -0700, Jakub Kicinski wrote:
> On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote:
> > For zerocopy (io_uring, devmem), there is an assumption that the
> > parent device can do DMA. However that is not always the case:
> > for example mlx5 SF devices have an auxiliary device as a parent.
> 
> Noob question -- I thought that the point of SFs was that you can pass
> them thru to a VM. How do they not have DMA support? Is it added on
> demand by the mediated driver or some such?
They do have DMA support. Maybe didn't state it properly in the commit
message. It is just that the the parent device
(sf_netdev->dev.parent.device) is not a DMA device. The grandparent
device is a DMA device though (PCI dev of parent PFs). But I wanted to
keep it generic. Maybe it doesn't need to be so generic?

Regarding SFs and VM passtrhough: my understanding is that SFs are more
for passing them to a container.

Thanks,
Dragos
Re: [RFC net-next 1/4] net: Allow non parent devices to be used for ZC DMA
Posted by Jakub Kicinski 3 months, 1 week ago
On Wed, 2 Jul 2025 20:01:48 +0000 Dragos Tatulea wrote:
> On Wed, Jul 02, 2025 at 11:32:08AM -0700, Jakub Kicinski wrote:
> > On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote:  
> > > For zerocopy (io_uring, devmem), there is an assumption that the
> > > parent device can do DMA. However that is not always the case:
> > > for example mlx5 SF devices have an auxiliary device as a parent.  
> > 
> > Noob question -- I thought that the point of SFs was that you can pass
> > them thru to a VM. How do they not have DMA support? Is it added on
> > demand by the mediated driver or some such?  
> They do have DMA support. Maybe didn't state it properly in the commit
> message. It is just that the the parent device
> (sf_netdev->dev.parent.device) is not a DMA device. The grandparent
> device is a DMA device though (PCI dev of parent PFs). But I wanted to
> keep it generic. Maybe it doesn't need to be so generic?
> 
> Regarding SFs and VM passtrhough: my understanding is that SFs are more
> for passing them to a container.

Mm. We had macvlan offload for over a decade, there's no need for
a fake struct device, auxbus and all them layers to delegate a
"subdevice" to a container in netdev world.
In my head subfunctions are a way of configuring a PCIe PASID ergo
they _only_ make sense in context of DMA.
Maybe someone with closer understanding can chime in. If the kind
of subfunctions you describe are expected, and there's a generic 
way of recognizing them -- automatically going to parent of parent
would indeed be cleaner and less error prone, as you suggest.