For zerocopy (io_uring, devmem), there is an assumption that the
parent device can do DMA. However that is not always the case:
for example mlx5 SF devices have an auxiliary device as a parent.
This patch introduces the possibility for the driver to specify
another DMA device to be used via the new dma_dev field. The field
should be set before register_netdev().
A new helper function is added to get the DMA device or return NULL.
The callers can check for NULL and fail early if the device is
not capable of DMA.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
include/linux/netdevice.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5847c20994d3..83faa2314c30 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2550,6 +2550,9 @@ struct net_device {
struct hwtstamp_provider __rcu *hwprov;
+ /* To be set by devices that can do DMA but not via parent. */
+ struct device *dma_dev;
+
u8 priv[] ____cacheline_aligned
__counted_by(priv_len);
} ____cacheline_aligned;
@@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev;
atomic_long_add((VAL), &(DEV)->stats.__##FIELD)
#define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD)
+static inline struct device *netdev_get_dma_dev(const struct net_device *dev)
+{
+ struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent;
+
+ if (!dma_dev->dma_mask)
+ dma_dev = NULL;
+
+ return dma_dev;
+}
+
#endif /* _LINUX_NETDEVICE_H */
--
2.50.0
On 7/2/25 18:24, Dragos Tatulea wrote: > For zerocopy (io_uring, devmem), there is an assumption that the > parent device can do DMA. However that is not always the case: > for example mlx5 SF devices have an auxiliary device as a parent. > > This patch introduces the possibility for the driver to specify > another DMA device to be used via the new dma_dev field. The field > should be set before register_netdev(). > > A new helper function is added to get the DMA device or return NULL. > The callers can check for NULL and fail early if the device is > not capable of DMA. > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> > --- > include/linux/netdevice.h | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 5847c20994d3..83faa2314c30 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2550,6 +2550,9 @@ struct net_device { > > struct hwtstamp_provider __rcu *hwprov; > > + /* To be set by devices that can do DMA but not via parent. */ > + struct device *dma_dev; > + > u8 priv[] ____cacheline_aligned > __counted_by(priv_len); > } ____cacheline_aligned; > @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev; > atomic_long_add((VAL), &(DEV)->stats.__##FIELD) > #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD) > > +static inline struct device *netdev_get_dma_dev(const struct net_device *dev) > +{ > + struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent; > + > + if (!dma_dev->dma_mask) dev->dev.parent is NULL for veth and I assume other virtual devices as well. Mina, can you verify that devmem checks that? Seems like veth is rejected by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev locking came after devmem got merged, and there are other virt devices that might already be converted. > + dma_dev = NULL; > + > + return dma_dev; > +} > + > #endif /* _LINUX_NETDEVICE_H */ -- Pavel Begunkov
On Tue, Jul 8, 2025 at 4:05 AM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 7/2/25 18:24, Dragos Tatulea wrote: > > For zerocopy (io_uring, devmem), there is an assumption that the > > parent device can do DMA. However that is not always the case: > > for example mlx5 SF devices have an auxiliary device as a parent. > > > > This patch introduces the possibility for the driver to specify > > another DMA device to be used via the new dma_dev field. The field > > should be set before register_netdev(). > > > > A new helper function is added to get the DMA device or return NULL. > > The callers can check for NULL and fail early if the device is > > not capable of DMA. > > > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> > > --- > > include/linux/netdevice.h | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 5847c20994d3..83faa2314c30 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -2550,6 +2550,9 @@ struct net_device { > > > > struct hwtstamp_provider __rcu *hwprov; > > > > + /* To be set by devices that can do DMA but not via parent. */ > > + struct device *dma_dev; > > + > > u8 priv[] ____cacheline_aligned > > __counted_by(priv_len); > > } ____cacheline_aligned; > > @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev; > > atomic_long_add((VAL), &(DEV)->stats.__##FIELD) > > #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD) > > > > +static inline struct device *netdev_get_dma_dev(const struct net_device *dev) > > +{ > > + struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent; > > + > > + if (!dma_dev->dma_mask) > > dev->dev.parent is NULL for veth and I assume other virtual devices as well. > > Mina, can you verify that devmem checks that? Seems like veth is rejected > by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev > locking came after devmem got merged, and there are other virt devices that > might already be converted. > We never attempt devmem binding on any devices that don't support the queue API, even before the per netdev locking was merged (there was an explicit ops check). even then, dev->dev.parent == NULL isn't disasterous, as far as I could surmise from a quick look. Seems to be only used with dma_buf_attach which NULL checks it. -- Thanks, Mina
On 7/8/25 15:10, Mina Almasry wrote: > On Tue, Jul 8, 2025 at 4:05 AM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> On 7/2/25 18:24, Dragos Tatulea wrote: >>> For zerocopy (io_uring, devmem), there is an assumption that the >>> parent device can do DMA. However that is not always the case: >>> for example mlx5 SF devices have an auxiliary device as a parent. >>> >>> This patch introduces the possibility for the driver to specify >>> another DMA device to be used via the new dma_dev field. The field >>> should be set before register_netdev(). >>> >>> A new helper function is added to get the DMA device or return NULL. >>> The callers can check for NULL and fail early if the device is >>> not capable of DMA. >>> >>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> >>> --- >>> include/linux/netdevice.h | 13 +++++++++++++ >>> 1 file changed, 13 insertions(+) >>> >>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >>> index 5847c20994d3..83faa2314c30 100644 >>> --- a/include/linux/netdevice.h >>> +++ b/include/linux/netdevice.h >>> @@ -2550,6 +2550,9 @@ struct net_device { >>> >>> struct hwtstamp_provider __rcu *hwprov; >>> >>> + /* To be set by devices that can do DMA but not via parent. */ >>> + struct device *dma_dev; >>> + >>> u8 priv[] ____cacheline_aligned >>> __counted_by(priv_len); >>> } ____cacheline_aligned; >>> @@ -5560,4 +5563,14 @@ extern struct net_device *blackhole_netdev; >>> atomic_long_add((VAL), &(DEV)->stats.__##FIELD) >>> #define DEV_STATS_READ(DEV, FIELD) atomic_long_read(&(DEV)->stats.__##FIELD) >>> >>> +static inline struct device *netdev_get_dma_dev(const struct net_device *dev) >>> +{ >>> + struct device *dma_dev = dev->dma_dev ? dev->dma_dev : dev->dev.parent; >>> + >>> + if (!dma_dev->dma_mask) >> >> dev->dev.parent is NULL for veth and I assume other virtual devices as well. >> >> Mina, can you verify that devmem checks that? Seems like veth is rejected >> by netdev_need_ops_lock() in netdev_nl_bind_rx_doit(), but IIRC per netdev >> locking came after devmem got merged, and there are other virt devices that >> might already be converted. >> > > We never attempt devmem binding on any devices that don't support the > queue API, even before the per netdev locking was merged (there was an > explicit ops check). great! io_uring doesn't look at ->queue_mgmt_ops, so the helper from this patch needs to handle it one way or another. -- Pavel Begunkov
On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote: > For zerocopy (io_uring, devmem), there is an assumption that the > parent device can do DMA. However that is not always the case: > for example mlx5 SF devices have an auxiliary device as a parent. Noob question -- I thought that the point of SFs was that you can pass them thru to a VM. How do they not have DMA support? Is it added on demand by the mediated driver or some such?
On Wed, Jul 02, 2025 at 11:32:08AM -0700, Jakub Kicinski wrote: > On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote: > > For zerocopy (io_uring, devmem), there is an assumption that the > > parent device can do DMA. However that is not always the case: > > for example mlx5 SF devices have an auxiliary device as a parent. > > Noob question -- I thought that the point of SFs was that you can pass > them thru to a VM. How do they not have DMA support? Is it added on > demand by the mediated driver or some such? They do have DMA support. Maybe didn't state it properly in the commit message. It is just that the the parent device (sf_netdev->dev.parent.device) is not a DMA device. The grandparent device is a DMA device though (PCI dev of parent PFs). But I wanted to keep it generic. Maybe it doesn't need to be so generic? Regarding SFs and VM passtrhough: my understanding is that SFs are more for passing them to a container. Thanks, Dragos
On Wed, 2 Jul 2025 20:01:48 +0000 Dragos Tatulea wrote: > On Wed, Jul 02, 2025 at 11:32:08AM -0700, Jakub Kicinski wrote: > > On Wed, 2 Jul 2025 20:24:23 +0300 Dragos Tatulea wrote: > > > For zerocopy (io_uring, devmem), there is an assumption that the > > > parent device can do DMA. However that is not always the case: > > > for example mlx5 SF devices have an auxiliary device as a parent. > > > > Noob question -- I thought that the point of SFs was that you can pass > > them thru to a VM. How do they not have DMA support? Is it added on > > demand by the mediated driver or some such? > They do have DMA support. Maybe didn't state it properly in the commit > message. It is just that the the parent device > (sf_netdev->dev.parent.device) is not a DMA device. The grandparent > device is a DMA device though (PCI dev of parent PFs). But I wanted to > keep it generic. Maybe it doesn't need to be so generic? > > Regarding SFs and VM passtrhough: my understanding is that SFs are more > for passing them to a container. Mm. We had macvlan offload for over a decade, there's no need for a fake struct device, auxbus and all them layers to delegate a "subdevice" to a container in netdev world. In my head subfunctions are a way of configuring a PCIe PASID ergo they _only_ make sense in context of DMA. Maybe someone with closer understanding can chime in. If the kind of subfunctions you describe are expected, and there's a generic way of recognizing them -- automatically going to parent of parent would indeed be cleaner and less error prone, as you suggest.
© 2016 - 2025 Red Hat, Inc.