.../userspace-api/ioctl/ioctl-number.rst | 1 + MAINTAINERS | 8 + drivers/misc/Kconfig | 1 + drivers/misc/Makefile | 1 + drivers/misc/mlx5ctl/Kconfig | 14 + drivers/misc/mlx5ctl/Makefile | 5 + drivers/misc/mlx5ctl/main.c | 597 ++++++++++++++++++ drivers/misc/mlx5ctl/umem.c | 322 ++++++++++ drivers/misc/mlx5ctl/umem.h | 17 + drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 + include/uapi/misc/mlx5ctl.h | 50 ++ 11 files changed, 1024 insertions(+) create mode 100644 drivers/misc/mlx5ctl/Kconfig create mode 100644 drivers/misc/mlx5ctl/Makefile create mode 100644 drivers/misc/mlx5ctl/main.c create mode 100644 drivers/misc/mlx5ctl/umem.c create mode 100644 drivers/misc/mlx5ctl/umem.h create mode 100644 include/uapi/misc/mlx5ctl.h
From: Saeed Mahameed <saeedm@nvidia.com>
Recap from V3 discussion:
=========================
LWN has published an article on this series aptly summarizing the debate.
LINK: https://lwn.net/Articles/955001/
We continue to think that mlx5ctl is reasonable and aligned with the
greater kernel community values. People have pointed to the HW RAID
miscdevices as a good analog. The MD developers did not get to block HW
RAID configuration on the basis that it undermines their work on the
software RAID stack. Further, while there is a superficial similarity to
the DRM/accel debate, that was grounded in a real concern that DRM values
on open source would be bypassed. That argument does not hold up here as
this does come with open source userspace and the functionality mlx5ctl
enables on lockdown has always been available to ConnectX users through
the non-lockdown PCI sysfs. netdev has been doing just fine despite the
long standing presence of this tooling and we have continued to work with
Jakub on building common APIs when appropriate. mlx5 already implements
a wide range of the netdev common interfaces, many of which were pushed
forward by our staff - the DPLL configuration netlink being a recent
example.
Version history:
================
V3: https://lore.kernel.org/all/20231121070619.9836-1-saeed@kernel.org/#r
V2: https://lore.kernel.org/all/20231119092450.164996-1-saeed@kernel.org/#r
V1: https://lore.kernel.org/all/20231018081941.475277-1-saeed@kernel.org/#r
V3->V4:
- Document locking scheme for device lifecycle
- Document reserved bits will always be checked for 0 by driver
- Use GFP_KERNEL instead of ACCOUNT for short lived buffers
- Create sysfs link to parent device under the misc device's sysfs
- Remove unnecessary device name from info ioctl output
- Remove reserved and future flags fields from ioctls
- Precise size checking for ioctl user input.
V2->V3:
- Fix bad Sign-off line.
- Fix kernel robot warnings, define a user ptr arg for umem_unreg ioctl
instead of plain integer to simplify compat_ioctl usage.
V1->V2:
- Provide legal statement and sign-off for dual license use
- Fix License clause to use: BSD-3-Clause OR GPL-2.0
- Fix kernel robot warnings
- Use dev_dbg directly instead of umem_dbg() local wrapper
- Implement .compat_ioctl for 32bit compatibility
- Fix mlx5ctl_info ABI structure size and alignment
- Local pointer to correct type instead of in-place cast
- Check unused fields and flags are 0 on ioctl path
- Use correct macro to declare scalar arg ioctl command
#define MLX5CTL_IOCTL_UMEM_UNREG \
_IO(MLX5CTL_IOCTL_MAGIC, 0x3)
mlx5 ConnectX control misc driver:
==================================
The ConnectX HW family supported by the mlx5 drivers uses an architecture
where a FW component executes "mailbox RPCs" issued by the driver to make
changes to the device. This results in a complex debugging environment
where the FW component has information and low level configuration that
needs to be accessed to userspace for debugging purposes.
Historically a userspace program was used that accessed the PCI register
and config space directly through /sys/bus/pci/.../XXX and could operate
these debugging interfaces in parallel with the running driver.
This approach is incompatible with secure boot and kernel lockdown so this
driver provides a secure and restricted interface to that.
Patch breakdown:
================
1) The first patch in the series introduces the main driver file with the
implementation of a new mlx5 auxiliary device driver to run on top
mlx5_core device instances, on probe it creates a new misc device and in
this patch we implement the open and release fops, On open the driver
would allocate a special FW UID (user context ID) restricted to debug
RPCs only, where all user debug rpcs will be executed under this UID,
and on release the UID will be freed.
2) The second patch adds an info ioctl that will show the allocated UID
and the available capability masks of the device and the current UID, and
some other useful device information such as the underlying ConnectX
Example:
$ sudo ./mlx5ctlu mlx5_core.ctl.0
mlx5dev: 0000:00:04.0
UCTX UID: 1
UCTX CAP: 0x3
DEV UCTX CAP: 0x3
USER CAP: 0x1d
3) Third patch will add the capability to execute debug RPCs under the
special UID.
In the mlx5 architecture the FW RPC commands are of the format of
inbox and outbox buffers. The inbox buffer contains the command
rpc layout as described in the ConnectX Programmers Reference Manual
(PRM) document and as defined in linux/include/mlx5/mlx5_ifc.h.
On success the user outbox buffer will be filled with the device's rpc
response.
For example to query device capabilities:
a user fills out an inbox buffer with the inbox layout:
struct mlx5_ifc_query_hca_cap_in_bits
and expects an outbox buffer with the layout:
struct mlx5_ifc_cmd_hca_cap_bits
4) The fourth patch adds the ability to register user memory into the
ConntectX device and create a umem object that points to that memory.
Command rpc outbox buffer is limited in size, which can be very
annoying when trying to pull large traces out of the device.
Many rpcs offer the ability to scatter output traces, contexts
and logs directly into user space buffers in a single shot.
The registered memory will be described by a device UMEM object which
has a unique umem_id, this umem_id can be later used in the rpc inbox
to tell the device where to populate the response output,
e.g HW traces and other debug object queries.
Example usecase, a ConnectX device coredump can be as large as 2MB.
Using inline rpcs will take thousands of rpcs to get the full
coredump which can consume multiple seconds.
With UMEM, it can be done in a single rpc, using 2MB of umem user buffer.
Other usecases with umem:
- dynamic HW and FW trace monitoring
- high frequency diagnostic counters sampling
- batched objects and resource dumps
See links below for information about user space tools that use this
interface:
[1] https://github.com/saeedtx/mlx5ctl
[2] https://github.com/Mellanox/mstflint
see:
d) mstregdump utility
This utility dumps hardware registers from Mellanox hardware
for later analysis by Mellanox.
g) mstconfig
This tool sets or queries non-volatile configurable options
for Mellanox HCAs.
h) mstfwmanager
Mellanox firmware update and query utility which scans the system
for available Mellanox devices (only mst PCI devices) and performs
the necessary firmware updates.
i) mstreg
The mlxreg utility allows users to obtain information regarding
supported access registers, such as their fields
License: BSD-3-Clause OR GPL-2.0
================================
After a review of this thread [3], and a conversation with the LF,
Mellanox and NVIDIA legal continue to approve the use of a Dual GPL &
Permissive License for mlx5 related driver contributions. This makes it
clear to future contributors that this file may be adapted and reused
under BSD-3-Clause terms on other operating systems. Contributions will
be handled in the normal way and the dual license will apply
automatically. If people wish to contribute significantly and opt out of
a dual license they may separate their GPL only contributions in dedicated
files.
Jason has a signing authority for NVIDIA and has gone through our internal
process to get approval.
[3] https://lore.kernel.org/all/20231018081941.475277-3-saeed@kernel.org/#r
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> # for legal
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Nacked-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed (5):
mlx5: Add aux dev for ctl interface
misc: mlx5ctl: Add mlx5ctl misc driver
misc: mlx5ctl: Add info ioctl
misc: mlx5ctl: Add command rpc ioctl
misc: mlx5ctl: Add umem reg/unreg ioctl
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 8 +
drivers/misc/Kconfig | 1 +
drivers/misc/Makefile | 1 +
drivers/misc/mlx5ctl/Kconfig | 14 +
drivers/misc/mlx5ctl/Makefile | 5 +
drivers/misc/mlx5ctl/main.c | 597 ++++++++++++++++++
drivers/misc/mlx5ctl/umem.c | 322 ++++++++++
drivers/misc/mlx5ctl/umem.h | 17 +
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 +
include/uapi/misc/mlx5ctl.h | 50 ++
11 files changed, 1024 insertions(+)
create mode 100644 drivers/misc/mlx5ctl/Kconfig
create mode 100644 drivers/misc/mlx5ctl/Makefile
create mode 100644 drivers/misc/mlx5ctl/main.c
create mode 100644 drivers/misc/mlx5ctl/umem.c
create mode 100644 drivers/misc/mlx5ctl/umem.h
create mode 100644 include/uapi/misc/mlx5ctl.h
--
2.43.0
On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: > From: Saeed Mahameed <saeedm@nvidia.com> > > Recap from V3 discussion: > ========================= > > LWN has published an article on this series aptly summarizing the debate. > LINK: https://lwn.net/Articles/955001/ > > We continue to think that mlx5ctl is reasonable and aligned with the > greater kernel community values. People have pointed to the HW RAID > miscdevices as a good analog. The MD developers did not get to block HW > RAID configuration on the basis that it undermines their work on the > software RAID stack. Further, while there is a superficial similarity to > the DRM/accel debate, that was grounded in a real concern that DRM values > on open source would be bypassed. That argument does not hold up here as > this does come with open source userspace and the functionality mlx5ctl > enables on lockdown has always been available to ConnectX users through > the non-lockdown PCI sysfs. netdev has been doing just fine despite the > long standing presence of this tooling and we have continued to work with > Jakub on building common APIs when appropriate. mlx5 already implements > a wide range of the netdev common interfaces, many of which were pushed > forward by our staff - the DPLL configuration netlink being a recent > example. I appreciate Jiri's contributions (and you hired Maciej off of Intel at some point) but don't make it sound like nVidia lead DPLL, or pushed for a common interface :| Intel posted SyncE support. I asked them make it a standalone API family: https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ Vadim from Meta joined in and helped a lot based on the OCP time card. Then after some delaying and weird noises y'all started participating. > mlx5 ConnectX control misc driver: > ================================== > > The ConnectX HW family supported by the mlx5 drivers uses an architecture > where a FW component executes "mailbox RPCs" issued by the driver to make > changes to the device. This results in a complex debugging environment > where the FW component has information and low level configuration that > needs to be accessed to userspace for debugging purposes. You don't explain anywhere why addressing the challenges of using debugfs in secure environments isn't the way to go. Also you keep saying debugging purposes but the driver is called "control misc driver", you need to iron out your narrative just a bit more. > Historically a userspace program was used that accessed the PCI register > and config space directly through /sys/bus/pci/.../XXX and could operate > these debugging interfaces in parallel with the running driver. > This approach is incompatible with secure boot and kernel lockdown so this > driver provides a secure and restricted interface to that. [snip] > i) mstreg > The mlxreg utility allows users to obtain information regarding > supported access registers, such as their fields So the access mstreg gives over this interface is read only? That's what your description sounds like, but given your complaints about "inability to add knobs" and "control" in the name of the driver that must be false. > Other usecases with umem: > - dynamic HW and FW trace monitoring > - high frequency diagnostic counters sampling Ah yes, the high frequency counters. Something that is definitely impossible to implement in a generic way. You were literally in the room at netconf when David Ahern described his proposal for this. Anyway, I don't want to waste any more time arguing with you. My opinion is that the kernel community is under no obligation to carry your proprietary gateway interfaces. I may be wrong, but I'm entitled to my opinion. Please do me the basic courtesy of carrying my nack on these patches: Nacked-by: Jakub Kicinski <kuba@kernel.org> and CC netdev, so I don't have to respond again :|
On 07 Feb 07:03, Jakub Kicinski wrote: >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: >> From: Saeed Mahameed <saeedm@nvidia.com> >> >> Recap from V3 discussion: >> ========================= >> >> LWN has published an article on this series aptly summarizing the debate. >> LINK: https://lwn.net/Articles/955001/ >> >> We continue to think that mlx5ctl is reasonable and aligned with the >> greater kernel community values. People have pointed to the HW RAID >> miscdevices as a good analog. The MD developers did not get to block HW >> RAID configuration on the basis that it undermines their work on the >> software RAID stack. Further, while there is a superficial similarity to >> the DRM/accel debate, that was grounded in a real concern that DRM values >> on open source would be bypassed. That argument does not hold up here as >> this does come with open source userspace and the functionality mlx5ctl >> enables on lockdown has always been available to ConnectX users through >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the >> long standing presence of this tooling and we have continued to work with >> Jakub on building common APIs when appropriate. mlx5 already implements >> a wide range of the netdev common interfaces, many of which were pushed >> forward by our staff - the DPLL configuration netlink being a recent >> example. > >I appreciate Jiri's contributions (and you hired Maciej off of Intel at >some point) but don't make it sound like nVidia lead DPLL, or pushed for >a common interface :| Intel posted SyncE support. I asked them make it >a standalone API family: > I will let the stats speak for itself. $ git shortlog -sne --no-merges net/devlink and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory $ git shortlog -sne --no-merges net/core/devlink.c More than 70% of the commits are authored by more than 10 different individuals form Mellanox/Nvidia .. Ok you don't like DPLL, here is a list of some central devlink features we did push to the devlink standard API: - subfunction API and devlink infrastructure - Shared buffer API - port function and rate API - shared buffer - health >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > >Vadim from Meta joined in and helped a lot based on the OCP time card. >Then after some delaying and weird noises y'all started participating. > I remember those discussions, and I agree it is very weird when it takes 3 vendors and 2 years to get a simple devlink API for single bit flip accepted. >> mlx5 ConnectX control misc driver: >> ================================== >> >> The ConnectX HW family supported by the mlx5 drivers uses an architecture >> where a FW component executes "mailbox RPCs" issued by the driver to make >> changes to the device. This results in a complex debugging environment >> where the FW component has information and low level configuration that >> needs to be accessed to userspace for debugging purposes. > >You don't explain anywhere why addressing the challenges of using >debugfs in secure environments isn't the way to go. > I already answered this question in length in v3 here: https://lore.kernel.org/all/ZWZFm2qqhV1wKKCV@x130/ >Also you keep saying debugging purposes but the driver is called >"control misc driver", you need to iron out your narrative just >a bit more. > >> Historically a userspace program was used that accessed the PCI register >> and config space directly through /sys/bus/pci/.../XXX and could operate >> these debugging interfaces in parallel with the running driver. >> This approach is incompatible with secure boot and kernel lockdown so this >> driver provides a secure and restricted interface to that. > >[snip] > >> i) mstreg >> The mlxreg utility allows users to obtain information regarding >> supported access registers, such as their fields > >So the access mstreg gives over this interface is read only? That's >what your description sounds like, but given your complaints about >"inability to add knobs" and "control" in the name of the driver that >must be false. > Yes this is enforced by the mlx5ctl driver and FW using the special debug uid. >> Other usecases with umem: >> - dynamic HW and FW trace monitoring >> - high frequency diagnostic counters sampling > >Ah yes, the high frequency counters. Something that is definitely >impossible to implement in a generic way. You were literally in the >room at netconf when David Ahern described his proposal for this. > I was in the room and I am in support of David's idea, I like it a lot, but I don't believe we have any concrete proposal, and we don't have any use case for it in netdev for now, our use case for this is currently RDMA and HPC specific. Also siimilar to devlink we will be the first to jump in and implement the new API once defined, but this doesn't mean I need to throw away the whole driver just because one single use case will be implemented in netdev one day, and I am sure the netdev implementation won't satisfy all the use-cases of high frequency counters: Also keep in mind high frequency counters is a very small part of the debug and access capabilities the mlx5ctl interface offers. >Anyway, I don't want to waste any more time arguing with you. >My opinion is that the kernel community is under no obligation to carry >your proprietary gateway interfaces. I may be wrong, but I'm entitled >to my opinion. > Thanks, I appreciate your honesty, but I must disagree with your Nack, we provided enough argument for why we believe this approach is the right way to go, it is clear from the responses on V3 and from the LWN article that we have the community support for this open source project. >Please do me the basic courtesy of carrying my nack on these patches: > >Nacked-by: Jakub Kicinski <kuba@kernel.org> > >and CC netdev, so I don't have to respond again :| Ack. Thanks, Saeed.
On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote: > On 07 Feb 07:03, Jakub Kicinski wrote: > >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: > >> From: Saeed Mahameed <saeedm@nvidia.com> > >> > >> Recap from V3 discussion: > >> ========================= > >> > >> LWN has published an article on this series aptly summarizing the debate. > >> LINK: https://lwn.net/Articles/955001/ > >> > >> We continue to think that mlx5ctl is reasonable and aligned with the > >> greater kernel community values. People have pointed to the HW RAID > >> miscdevices as a good analog. The MD developers did not get to block HW > >> RAID configuration on the basis that it undermines their work on the > >> software RAID stack. Further, while there is a superficial similarity to > >> the DRM/accel debate, that was grounded in a real concern that DRM values > >> on open source would be bypassed. That argument does not hold up here as > >> this does come with open source userspace and the functionality mlx5ctl > >> enables on lockdown has always been available to ConnectX users through > >> the non-lockdown PCI sysfs. netdev has been doing just fine despite the > >> long standing presence of this tooling and we have continued to work with > >> Jakub on building common APIs when appropriate. mlx5 already implements > >> a wide range of the netdev common interfaces, many of which were pushed > >> forward by our staff - the DPLL configuration netlink being a recent > >> example. > > > >I appreciate Jiri's contributions (and you hired Maciej off of Intel at > >some point) but don't make it sound like nVidia lead DPLL, or pushed for > >a common interface :| Intel posted SyncE support. I asked them make it > >a standalone API family: > > I will let the stats speak for itself. > $ git shortlog -sne --no-merges net/devlink > and prior to commit f05bd8ebeb69 devlink: move code to a dedicated directory > $ git shortlog -sne --no-merges net/core/devlink.c > > More than 70% of the commits are authored by more than 10 different individuals > form Mellanox/Nvidia .. I'm not questioning that there are folks at nVidia who prefer to go the generic infrastructure route. Jiri and mlxsw team especially. I do worry that adding a pass thru interface will undermine them, but that wasn't my main point. > Ok you don't like DPLL, I didn't say I dislike DPLL. I think it's a very odd example for you to pick for nVidia's contribution. My recollection is: - Maciej from Intel started developing upstream API for SyncE support - I asked him to generalize it to DPLL, he started working on it - nVidia expressed interest in creating a common interface, we thought it'd be great to align vendors - nVidia hired Maciej from Intel, shutting down Intel's progress for a while - nVidia went AWoL, long response times, we held meetings to nudge you along, no commitments - then after months and months Jiri started helping Arkadiusz and Vadim I remember thinking at the time that it must have been a terrible experience for Intel, definitely not how cooperation upstream should look :| IDK how disconnected from upstream netdev you have to be to put that on your banner. > here is a list of some central devlink features we did > push to the devlink standard API: > > - subfunction API and devlink infrastructure > - Shared buffer API > - port function and rate API > - shared buffer > - health I guess shared buffer was important enough to mention twice? :) > >https://lore.kernel.org/netdev/20210830162909.110753ec@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > > > >Vadim from Meta joined in and helped a lot based on the OCP time card. > >Then after some delaying and weird noises y'all started participating. > > > > I remember those discussions, and I agree it is very weird when it > takes 3 vendors and 2 years to get a simple devlink API for single bit > flip accepted. In hindsight it was naive of us to try to wait for nVidia. It's better to let one vendor blaze the trail and others to follow. > >> i) mstreg > >> The mlxreg utility allows users to obtain information regarding > >> supported access registers, such as their fields > > > >So the access mstreg gives over this interface is read only? That's > >what your description sounds like, but given your complaints about > >"inability to add knobs" and "control" in the name of the driver that > >must be false. > > > > Yes this is enforced by the mlx5ctl driver and FW using the special > debug uid. So we trust the proprietary FW not to let the proprietary user space do something out of scope. Got it. > >> Other usecases with umem: > >> - dynamic HW and FW trace monitoring > >> - high frequency diagnostic counters sampling > > > >Ah yes, the high frequency counters. Something that is definitely > >impossible to implement in a generic way. You were literally in the > >room at netconf when David Ahern described his proposal for this. > > > > I was in the room and I am in support of David's idea, I like it a lot, > but I don't believe we have any concrete proposal, and we don't have any > use case for it in netdev for now, our use case for this is currently RDMA > and HPC specific. > > Also siimilar to devlink we will be the first to jump in and implement > the new API once defined, but this doesn't mean I need to throw away the I'm not asking to throw it away. The question is only whether you get to push it upstream and skirt subsystem rules by posting a "misc" driver without even CCing the maintainers on v1 :| > whole driver just because one single use case will be implemented in netdev > one day, and I am sure the netdev implementation won't satisfy all the > use-cases of high frequency counters: > > Also keep in mind high frequency counters is a very small part of the debug > and access capabilities the mlx5ctl interface offers. > > >Anyway, I don't want to waste any more time arguing with you. > >My opinion is that the kernel community is under no obligation to carry > >your proprietary gateway interfaces. I may be wrong, but I'm entitled > >to my opinion. > > Thanks, I appreciate your honesty, but I must disagree with your Nack, we > provided enough argument for why we believe this approach is the right > way to go, it is clear from the responses on V3 and from the LWN article > that we have the community support for this open source project. Why don't you repost it to netdev and see how many acks you get? I'm not the only netdev maintainer.
On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>> Ah yes, the high frequency counters. Something that is definitely >>> impossible to implement in a generic way. You were literally in the >>> room at netconf when David Ahern described his proposal for this. The key point of that proposal is host memory mapped to userspace where H/W counters land (either via direct DMA by a H/W push or a kthread/timer pulling in updates). That is similar to what is proposed here. >>> >> >> I was in the room and I am in support of David's idea, I like it a lot, >> but I don't believe we have any concrete proposal, and we don't have any >> use case for it in netdev for now, our use case for this is currently RDMA >> and HPC specific. >> >> Also siimilar to devlink we will be the first to jump in and implement >> the new API once defined, but this doesn't mean I need to throw away the > > I'm not asking to throw it away. The question is only whether you get > to push it upstream and skirt subsystem rules by posting a "misc" driver > without even CCing the maintainers on v1 :| Can you define what you mean by 'skirt subsystem rules'? That's a new one to me. BTW, there is already a broadcom driver under drivers/misc that seems to have a lot of overlap capability wise to this driver. Perhaps a Broadcom person could chime in. > >> whole driver just because one single use case will be implemented in netdev >> one day, and I am sure the netdev implementation won't satisfy all the >> use-cases of high frequency counters: >> >> Also keep in mind high frequency counters is a very small part of the debug >> and access capabilities the mlx5ctl interface offers. >> >>> Anyway, I don't want to waste any more time arguing with you. >>> My opinion is that the kernel community is under no obligation to carry >>> your proprietary gateway interfaces. I may be wrong, but I'm entitled >>> to my opinion. >> >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we >> provided enough argument for why we believe this approach is the right >> way to go, it is clear from the responses on V3 and from the LWN article >> that we have the community support for this open source project. > > Why don't you repost it to netdev and see how many acks you get? > I'm not the only netdev maintainer. I'll go out on that limb and say I would have no problem ACK'ing the driver. It's been proven time and time again that these kinds of debugging facilities are needed for these kinds of complex, multifunction devices.
On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote: > On 2/8/24 7:15 PM, Jakub Kicinski wrote: > >>> Ah yes, the high frequency counters. Something that is definitely > >>> impossible to implement in a generic way. You were literally in the > >>> room at netconf when David Ahern described his proposal for this. > > The key point of that proposal is host memory mapped to userspace where > H/W counters land (either via direct DMA by a H/W push or a > kthread/timer pulling in updates). That is similar to what is proposed here. The counter experiment that inspired Saeed to write about it here was done using mlx5ctl interfaces and some other POC stuff on an RDMA network monitoring RDMA workloads, inspecting RDMA objects. So if your proposal also considers how to select RDMA object counters, control the detailed sampling hardware with RDMA stuff, and works on a netdev-free InfiniBand network, then it might be interesting. It was actually interesting research, I hope some information will be made public. > BTW, there is already a broadcom driver under drivers/misc that seems to > have a lot of overlap capability wise to this driver. Perhaps a Broadcom > person could chime in. Yeah, there are lots of examples of drivers that use this kind FW API direct to userspace. It is a common design pattern across the kernel in many subsystems. At the core it is following the general philosophy of pushing things to userspace that don't need to be in the kernel. It is more secure, more hackable and easier to deploy. It becomes a userspace decision what kind of tooling community will develop and what the ultimate user experience will be. > > Why don't you repost it to netdev and see how many acks you get? > > I'm not the only netdev maintainer. > > I'll go out on that limb and say I would have no problem ACK'ing the > driver. It's been proven time and time again that these kinds of > debugging facilities are needed for these kinds of complex, > multifunction devices. Agree as well. Ack for RDMA community. This is perfectly consistent with the subsystem's existing design of directly exposing the device to userspace. It is essential as we can't piggyback on any "generic" netdev stuff on InfiniBand HW. Further, I anticipate most of the mlx5ctl users would actually be running primarily RDMA related workloads anyhow. There is not that many people that buy these expensive cards and don't use them to their full capability. Recently at usenix Microsoft shared some details of their production network in the paper "Empowering Azure Storage with RDMA". Notably they shared that "Traffic statistics of all Azure public regions between January 18 and February 16, 2023. Traffic was measured by collecting switch counters of server-facing ports on all Top of Rack (ToR) switches. Around 70% of traffic was RDMA." It is a rare public insight into what is going on in the industry at large, and why RDMA is a significant and important subsytem. Jason
On 2/9/24 6:01 PM, Jason Gunthorpe wrote: > On Fri, Feb 09, 2024 at 03:42:16PM -0700, David Ahern wrote: >> On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>>>> Ah yes, the high frequency counters. Something that is definitely >>>>> impossible to implement in a generic way. You were literally in the >>>>> room at netconf when David Ahern described his proposal for this. >> >> The key point of that proposal is host memory mapped to userspace where >> H/W counters land (either via direct DMA by a H/W push or a >> kthread/timer pulling in updates). That is similar to what is proposed here. > > The counter experiment that inspired Saeed to write about it here was > done using mlx5ctl interfaces and some other POC stuff on an RDMA > network monitoring RDMA workloads, inspecting RDMA objects. > > So if your proposal also considers how to select RDMA object counters, > control the detailed sampling hardware with RDMA stuff, and works > on a netdev-free InfiniBand network, then it might be interesting. Response at netconf in September was mixed. As I recall Jakub for example was shaking his head at 'yet another stats proposal', but since he has referenced it a couple of times now maybe it is worth moving beyond slides to a POC. The uapi discussed was netlink (genl) based; driver hooks were not discussed. Perhaps I can get a working POC for both stacks by netdevconf in July.
On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote: > On 2/8/24 7:15 PM, Jakub Kicinski wrote: > >> I was in the room and I am in support of David's idea, I like it a lot, > >> but I don't believe we have any concrete proposal, and we don't have any > >> use case for it in netdev for now, our use case for this is currently RDMA > >> and HPC specific. > >> > >> Also siimilar to devlink we will be the first to jump in and implement > >> the new API once defined, but this doesn't mean I need to throw away the > > > > I'm not asking to throw it away. The question is only whether you get > > to push it upstream and skirt subsystem rules by posting a "misc" driver > > without even CCing the maintainers on v1 :| > > Can you define what you mean by 'skirt subsystem rules'? That's a new > one to me. I mean that Saeed is well aware that direct FW <> user space interfaces are not allowed in netdev, so he posted this "misc" driver without CCing us, knowing we'd nack it. Maybe the baseline question is whether major subsystems are allowed to set their own rules. I think they should as historically we have a very broad range of, eh, openness in different fields. Networking is pretty open because of the necessary interoperability. > BTW, there is already a broadcom driver under drivers/misc that seems to > have a lot of overlap capability wise to this driver. Perhaps a Broadcom > person could chime in. I'm not aware. Or do you mean bcm-vk? That's a video encoder. > >> Thanks, I appreciate your honesty, but I must disagree with your Nack, we > >> provided enough argument for why we believe this approach is the right > >> way to go, it is clear from the responses on V3 and from the LWN article > >> that we have the community support for this open source project. > > > > Why don't you repost it to netdev and see how many acks you get? > > I'm not the only netdev maintainer. > > I'll go out on that limb and say I would have no problem ACK'ing the > driver. It's been proven time and time again that these kinds of > debugging facilities are needed for these kinds of complex, > multifunction devices. Can we have bare minimum of honesty and stop pretending that his is primarily about debug :|
On 2/9/24 3:58 PM, Jakub Kicinski wrote: > On Fri, 9 Feb 2024 15:42:16 -0700 David Ahern wrote: >> On 2/8/24 7:15 PM, Jakub Kicinski wrote: >>>> I was in the room and I am in support of David's idea, I like it a lot, >>>> but I don't believe we have any concrete proposal, and we don't have any >>>> use case for it in netdev for now, our use case for this is currently RDMA >>>> and HPC specific. >>>> >>>> Also siimilar to devlink we will be the first to jump in and implement >>>> the new API once defined, but this doesn't mean I need to throw away the >>> >>> I'm not asking to throw it away. The question is only whether you get >>> to push it upstream and skirt subsystem rules by posting a "misc" driver >>> without even CCing the maintainers on v1 :| >> >> Can you define what you mean by 'skirt subsystem rules'? That's a new >> one to me. > > I mean that Saeed is well aware that direct FW <> user space interfaces > are not allowed in netdev, so he posted this "misc" driver without > CCing us, knowing we'd nack it. The argument you are making here is that if a device has an ethernet port, all patches must flow through netdev. Or am I misunderstanding? There are a lot of devices that toggle between IB and ethernet with only one (ignore roce here) active at a moment. MLX5 (like many) is split between drivers/net and drivers/infinband. If the debugging capabilities went through drivers/infiniband would you have the same stance? Maybe my bigger question is should drivers that can do different physical layers, or can operate without a netdev, should they be split into different drivers? drivers/misc for the PCI interface, drivers/net for ethernet interfaces and its APIs and drivers/infiniband for IB and its APIs? Generic capabilities like this debugging interface belong to the PCI device since it is generic to the device hence drivers/misc. > > Maybe the baseline question is whether major subsystems are allowed to > set their own rules. I think they should as historically we have a very > broad range of, eh, openness in different fields. Networking is pretty > open because of the necessary interoperability. Interoperability applies solely to networking protocols, not how a device is designed and certainly not to how it can be debugged. There is a clear difference between standard networking protocols (packets on a wire and its headers) and device designs. > >> BTW, there is already a broadcom driver under drivers/misc that seems to >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom >> person could chime in. > > I'm not aware. Or do you mean bcm-vk? That's a video encoder. If that specific piece of S/W is a video encoder, why isn't it under drivers/video? Scanning the code it seems to me to fall under the open channel between userspace and F/W which is a common paradigm. But I do not want this to distract from this patch set; really I was just browsing existing drivers for any overlap.
On 2/9/24 10:01 PM, David Ahern wrote: > Maybe my bigger question is should drivers that can do different > physical layers, or can operate without a netdev, should they be split > into different drivers? drivers/misc for the PCI interface, drivers/net > for ethernet interfaces and its APIs and drivers/infiniband for IB and > its APIs? Generic capabilities like this debugging interface belong to > the PCI device since it is generic to the device hence drivers/misc. > I do not recall seeing a response to this line of questions. We are considering splitting our driver along these lines to upstream it given the independent nature of the capabilities and need to interact with different S/W APIs and hence different maintainers. Any reason new drivers should not be split along these lines? If the answer is no, then where is the best home for the PCI device driver piece of it? drivers/misc or drivers/accel or create a new drivers/multifcn/?
On Wed, Feb 14, 2024 at 01:31:29PM -0700, David Ahern wrote: > On 2/9/24 10:01 PM, David Ahern wrote: > > Maybe my bigger question is should drivers that can do different > > physical layers, or can operate without a netdev, should they be split > > into different drivers? drivers/misc for the PCI interface, drivers/net > > for ethernet interfaces and its APIs and drivers/infiniband for IB and > > its APIs? Generic capabilities like this debugging interface belong to > > the PCI device since it is generic to the device hence drivers/misc. > > > > I do not recall seeing a response to this line of questions. We are > considering splitting our driver along these lines to upstream it given > the independent nature of the capabilities and need to interact with > different S/W APIs and hence different maintainers. Any reason new > drivers should not be split along these lines? I think it is essential to split them up like this, it is why auxbus was created in the first place. The subsystem specific logic should live in the subsystem space. I've had some conversations where we have disagreements about which functions end up in which directories in these split up drivers, but the general principle is good. I like mlx5's split where the core is about booting and communicating with the FW and some common helpers then every subsystem has its own FW calls in its own functions. Other people liked the idea that as much "low level" code was in the core and the subsystems would call helper functions in the core that invoke the FW functions. > If the answer is no, then where is the best home for the PCI device > driver piece of it? drivers/misc or drivers/accel or create a new > drivers/multifcn/? I've had conversations about this too, I would vote for a new directory going forward. drivers/auxbus or something specifically for these core components of an auxbus based multi-subsystem driver. We likely need some reasonable rules so that the core driver remains an in-kernel component and doesn't start growing weird uapi. If you want uapi it needs to come through an appropriate subsystem specific thing, even if that is misc. Jason
On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote: > >> BTW, there is already a broadcom driver under drivers/misc that seems to > >> have a lot of overlap capability wise to this driver. Perhaps a Broadcom > >> person could chime in. > > > > I'm not aware. Or do you mean bcm-vk? That's a video encoder. > > If that specific piece of S/W is a video encoder, why isn't it under > drivers/video? Scanning the code it seems to me to fall under the open > channel between userspace and F/W which is a common paradigm. But I do > not want this to distract from this patch set; really I was just > browsing existing drivers for any overlap. It is an "offload-engine" type of thing that was added before we had the specific subsystem for it. It should be moved to drivers/accel/ one of these days, want to volunteer to help out with that? :) thanks, greg k-h
On 2/11/24 4:03 AM, Greg Kroah-Hartman wrote: > On Fri, Feb 09, 2024 at 10:01:33PM -0700, David Ahern wrote: >>>> BTW, there is already a broadcom driver under drivers/misc that seems to >>>> have a lot of overlap capability wise to this driver. Perhaps a Broadcom >>>> person could chime in. >>> >>> I'm not aware. Or do you mean bcm-vk? That's a video encoder. >> >> If that specific piece of S/W is a video encoder, why isn't it under >> drivers/video? Scanning the code it seems to me to fall under the open >> channel between userspace and F/W which is a common paradigm. But I do >> not want this to distract from this patch set; really I was just >> browsing existing drivers for any overlap. > > It is an "offload-engine" type of thing that was added before we had the > specific subsystem for it. It should be moved to drivers/accel/ one of > these days, want to volunteer to help out with that? :) > Thanks for the background. As for volunteering to move it, I believe Broadcom has more kernel engineers than Enfabrica. :-)
Fri, Feb 09, 2024 at 03:15:55AM CET, kuba@kernel.org wrote: >On Wed, 7 Feb 2024 21:03:35 -0800 Saeed Mahameed wrote: >> On 07 Feb 07:03, Jakub Kicinski wrote: >> >On Tue, 6 Feb 2024 23:24:30 -0800 Saeed Mahameed wrote: >> >> From: Saeed Mahameed <saeedm@nvidia.com> [...] > >> Ok you don't like DPLL, > >I didn't say I dislike DPLL. I think it's a very odd example for >you to pick for nVidia's contribution. My recollection is: > > - Maciej from Intel started developing upstream API for SyncE support > - I asked him to generalize it to DPLL, he started working on it > - nVidia expressed interest in creating a common interface, we thought > it'd be great to align vendors > - nVidia hired Maciej from Intel, shutting down Intel's progress for a while > - nVidia went AWoL, long response times, we held meetings to nudge > you along, no commitments > - then after months and months Jiri started helping Arkadiusz and Vadim > >I remember thinking at the time that it must have been a terrible >experience for Intel, definitely not how cooperation upstream should >look :| For the record, I spent huge amount of time reviewing the patchset and ended up with redesigning significant chunks of it, steering Arkadiusz and Vadim the way I felt is the correct one. Oftentimes I said to myself it would be much quicker to take the patchset over and do it myself :) Anyway, at the end, I think that the result is very good. Solid and well defined uapi, nice kernel implementation, 3 drivers implementing it, each with slightly different usecase, all clicks. If this is not good example of upstream cooperation, I'm not sure what else is... But, I don't think this is related to the misc driver discussion, I just wanted to express my pov on dpll process, when I see people talking about it :) > >IDK how disconnected from upstream netdev you have to be to put that on >your banner. [...]
Chmining in late after a vacation last week, but I really do not understand the discussion this cause. Complex devices need diagnostics, and mlx5ctl and the (free software) userspace tool provide that. Given how it is a complex multi-subsystem driver that exports at least RDMA, ethernet, nvme and virtio_blk interfaces this functionality by definition can't fit into a single subsystem. So with my nvme co-maintainer hat on: Acked-by: Christoph Hellwig <hch@lst.de> to th concept (which is not a full code review!). With my busy kernel contributor head on I have to voice my dissatisfaction with the subsystem maintainer overreach that's causing the troubles here. I think all maintainers can and should voice the opinions, be those technical or political, but trying to block a useful feature without lots of precedence because it is vaguely related to the subsystem is not helpful. Note that this is absolutely not intended to shut the discussion down - if we can find valid arguments why some of this functionality should also be reported through a netdev API we should continue that. E.g. just because we have SCSI, NVMe or product specific passthrough interface we still try to provide useful common interface generically.
On Wed, Feb 14, 2024 at 12:29:16AM -0800, Christoph Hellwig wrote: > Chmining in late after a vacation last week, but I really do not > understand the discussion this cause. > > Complex devices need diagnostics, and mlx5ctl and the (free software) > userspace tool provide that. Given how it is a complex multi-subsystem > driver that exports at least RDMA, ethernet, nvme and virtio_blk > interfaces this functionality by definition can't fit into a single > subsystem. > > So with my nvme co-maintainer hat on: > > Acked-by: Christoph Hellwig <hch@lst.de> > > to th concept (which is not a full code review!). I've been trying to figure out how to respond correctly to this over the last few days, but I share your sentiment. It's probably time to have something like this upstream for more devices. My initial concerns with something that allows direct access to hardware to send messages to FW or read/write registers over BARs were: 1. How someone working at a distro would be able to help/understand if a tool like this was run and may have programmed their hardware differently than a default driver or FW. There now exists a chance for identical systems running identical drivers and FW to behave differently due to out-of-band configuration. One thought I had was some sort of journal to note that config happened from outside, but I'm not sure there is much value there. With the ability to dump regs with devlink health it's possible to know that values may have changed, so I'm not concerned about this since that infra exists. 2. If one can make configuration changes to hardware without kernel APIs (devlink et al), will people still develop new kernel APIs? I think the answer to this is 'yes' as realistically using default tools is much better than using vendor tools for regular configuration. Even if vendors provide shortcuts to program hardware for eval/testing/debug my experience is that these are not acceptable long-term. Requests are always made to include this type of changes in future releases. So I'm not too concerned about the ossification of kernel APIs due to this being included. So if there is general agreement that this is acceptable (especially compared to other out-of-tree drivers, I think a few who find this useful should sync on the best way forward; I'm not sure a separate driver for each vendor is the right approach. If upstream (and therefore distros) are going to accept this we probably owe it to them to not have misc drivers for every different flavor of hardware out there when it might be possible to add a generic driver that can connect to a PCI device via new (auxiliary bus?) API.
On Wed, Feb 14, 2024 at 11:17:58AM -0500, Andy Gospodarek wrote: > 1. How someone working at a distro would be able to help/understand if > a tool like this was run and may have programmed their hardware > differently than a default driver or FW. This is stepping a bit further ahead of the debug focused interfaces mlx5ctl currently has and into configuration.. Obviously a generic FW RPC can do configuration too. We do have alot of industry experience with configuration already. Realistically every complex device already has on-device FLASH that has on-device FW configuration. Formalizing what already exists and is widely used isn't going to make the world worse. I'm also very certain there are actual problems with FW configuration incompatibility. eg there are PCI related configurables on mlx5 you can set that will break everything if you are not The Special User that the configuration was created for. Today everyone already deals with FW version specific behavioral differences and bugs. FW Vers + FW Vers&Config is also the current state of affairs, and is delt with about the same. IMHO > due to out-of-band configuration. One thought I had was some sort of > journal to note that config happened from outside, but I'm not sure > there is much value there. I think devices using this kind of approach need to be well behaved - like no wild (production) access to random bits of HW. Things need to be structured and reportable. There is a clear split in my mind between: - inspection debugging - invasive mutating debugging - configuration And maybe "invasive mutating debugging" taints the kernel or something like that. > With the ability to dump regs with devlink health it's possible to > know that values may have changed, so I'm not concerned about this > since that infra exists. Distros need good tooling to report the state of HW in distro problem reports. If something is lacking we should improve on it. TBH I'd pick "report current status" over "journaling" because devices have FLASH and configuration changes can typically be permanent across reboots. Also, I would generally frown on designs changing OS visible behavior without running the device through a FLR. > 2. If one can make configuration changes to hardware without kernel > APIs (devlink et al), will people still develop new kernel APIs? Sure? Why not? Every vendor has existing tooling to do this configuration stuff. mlx5's existing tooling runs in userspace and uses /sys/../resource. *Everyone* has been using this for the last 15 years. So whatever impact it has is long since baked in. > I think the answer to this is 'yes' as realistically using default tools > is much better than using vendor tools for regular configuration. Personally I think the answer is more nuanced. Users have problems they want to solve. If your problem is "provision my HW from the factory" you may be perfectlly happy with a shared userspace program that can do that for all the HW vendors at the site. No artificial kernel enforced commonality required. IMHO as kernel people we often look at the hammer we have and fall into these patterns where "only the kernel can do abstractions!" but it isn't true, we can very effectively make abstractions in userspace too. > if vendors provide shortcuts to program hardware for eval/testing/debug > my experience is that these are not acceptable long-term. Requests are always > made to include this type of changes in future releases. So I'm not too > concerned about the ossification of kernel APIs due to this being included. Nor am I. Users will ask for the things that work best for them and vendors have been historically good at responding. I also look at RDMA, which is deeply down this path of not doing common interfaces in the kernel. We have alot of robust open source now! The common interfaces that the userspace community developed are wildly different and much more useful than what the prior effort to build kernel common interfaces was creating. In fact there are now competing ideas on what those interfaces should be and alot of interesting development. The kernel common APIs that were developed before turned out to be such a straight jacket that it held back so much stuff and forced people into out-of-tree and badly harmed community forming in the userspace side. In other words, allowing userspace to have freedom has really pushed things forward in good ways. > So if there is general agreement that this is acceptable (especially > compared to other out-of-tree drivers, I think a few who find this > useful should sync on the best way forward; I'm not sure a separate > driver for each vendor is the right approach. I also like this, I don't want the outcome of this discussion to be that only mlx5ctl gets merged. I want all the HW that has this problem to have support in the mainline kernel. I want to see a robust multi-vendor userspace community that houses alot of stuff and is working to solve user problems. To me this point is the big win for the community. If we need a formally named kernel subsystem to pull that community together then lets do that. We can probably have some common ioctls and shared uABI marshaling/discovery/etc like nvme does. At the end it can't really be abstracted too much, the user facing API is really going to be "send a FW RPC" like mlx5/nvme-vendor does. Honestly it will be easier to discuss and document the overall design and what devices have to do to be compliant within a tidy subsytem with a name that people can identify as a thing. Naming things and having spaces is really important to build community around them. Certainly, I am up for this. If you have a similar usage flows I'm quite happy to work with everyone to launch a new mini-subystem. If other people are reading this and think they want to take part too please speak up. > If upstream (and therefore distros) are going to accept this we probably > owe it to them to not have misc drivers for every different flavor of > hardware out there when it might be possible to add a generic driver > that can connect to a PCI device via new (auxiliary bus?) API. Distros often prefer if they have less packaging, versioning and community to deal with :) I'm inspired by things like nvme-cli that show a nice way forward where there can be vendor plugins and a shared github with a common shared umbrella of CLI, documentation and packaging. With some co-operation from the distros we can push vendors to participate in the shared userspace, and we can push vendors from the kernel by denying kernel driver support without basic integration to the userspace (like DRM does with mesa). Thanks, Jason
On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote:
> I also like this, I don't want the outcome of this discussion to be
> that only mlx5ctl gets merged. I want all the HW that has this problem
> to have support in the mainline kernel.
To this end here is my proposal to move forward with a new
mini-subsystem to provide rules for this common approach. Get the
existing tools out of hacky unstructured direct hardware access via
/sys/XX and into a formalized lockdown compatible system. I've talked
to enough people now to think we have a critical mass.
===============
fwctl subsystem
===============
Overview
--------
Modern devices contain extensive amounts of FW, and in many cases, are largely
software defined pieces of hardware. The evolution of this approach is largely a
reaction to Moore's Law where a chip tape out is now highly expensive, and the
chip design is extremely large. Replacing fixed HW logic with a flexible and
tightly coupled FW/HW combination is an effective risk mitigation against chip
respin. Problems in the HW design can be counteracted in device FW. This is
especially true for devices which present a stable and backwards compatible
interface to the operating system driver (such as NVMe).
The FW layer in devices has grown to incredible sizes and devices frequently
integrate clusters of fast processors to run it. For example, mlx5 devices have
over 30MB of FW code, and big configurations operate with over1GB of FW managed
runtime state.
The availability of such a flexible layer has created quite a variety in the
industry where single pieces of silicon are now configurable software defined
devices and can operate in quite different ways depending on the need. Further
we often see cases where specific sites wish to operate devices in ways that are
highly specialized and can only run an application that has been tailored to
their unique configuration.
Further, devices have become multi-functional and integrated they no longer
fit neatly into the kernel's division of subsystems. Modern multi-functional
devices have drivers, such as bnxt/ice/mlx5/pds, that span many subsystems
while sharing the underlying hardware using the auxiliary device system.
All together this creates a challenge for the operating system, where devices
have an expansive FW environment that needs robust vendor-specific debugging
support, and FW driven functionality that is not well suited to “generic”
interfaces. fwctl seeks to allow access to the full device functionality from
user space in the areas of debuggability, management, and first-boot/nth-boot
provisioning.
fwctl is aimed at the common device design pattern where the OS and FW
communicate via an RPC message layer constructed with a queue or mailbox scheme.
In this case the driver will typically have some layer to deliver RPC messages
and collect RPC responses from device FW. The in-kernel subsystem drivers that
operate the device for its primary purposes will use these RPCs to build their
drivers, but devices also usually have a set of ancillary RPCs that don't really
fit into any specific subsystem. For example, a HW RAID controller is primarily
operated by the block layer but also comes with a set of RPCs to administer the
construction of drives within the HW RAID.
In the past when devices were more single function individual subsystems would
grow different approaches to solving some of these common problems, for instance
monitoring device health, manipulating its FLASH, debugging the FW,
provisioning, all have various unique interfaces across the kernel.
fwctl's purpose is to define a common set of limited rules, described below,
that allow user space to securely construct and execute RPCs inside device FW.
The rules serve as an agreement between the operating system and FW on how to
correctly design the RPC interface. As a uAPI the subsystem provides a thin
layer of discovery and a generic uAPI to deliver the RPCs and collect the
response. It supports a system of user space libraries and tools which will
use this interface to control the device.
Scope of Action
---------------
fwctl drivers are strictly restricted to being a way to operate the device FW.
It is not an avenue to access random kernel internals, or other operating system
SW states.
fwctl instances must operate on a well-defined device function, and the device
should have a well-defined security model for what scope within the device the
function is permitted to access. For instance, the most complex PCIe device
today may broadly have several function level scopes:
1. A privileged function with full access to the on-device global state and
configuration
2. Multiple hypervisor function with control over itself and child functions
used with VMs
3. Multiple VM function tightly scoped within the VM
The device may create a logical parent/child relationship between these scopes,
for instance a child VM's FW may be within the scope of the hypervisor FW. It is
quite common in the VFIO world that the hypervisor environment has a complex
provisioning/profiling/configuration responsibility for the function VFIO
assigns to the VM.
Further, within the function, devices often have RPC commands that fall within
some general scopes of action:
1. Access to function & child configuration, flash, etc that becomes live at a
function reset.
2. Access to function & child runtime configuration that kernel drivers can
discover at runtime.
3. Read only access to function debug information that may report on FW objects
in the function & child, including FW objects owned by other kernel
subsystems.
4. Write access to function & child debug information strictly compatible with
the principles of kernel lockdown and kernel integrity protection. Triggers
a kernel Taint.
5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
Limitation to the appropriate scope must be performed by the kernel.
There are many things this interface must not allow user space to do (without a
Taint or CAP), broadly derived from the principles of kernel lockdown. Some
examples:
1. DMA to/from arbitrary memory, hang the system, run code in the device, or
otherwise compromise device or system security and integrity.
2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
objects owned by kernel drivers.
3. Directly configure or otherwise control kernel drivers. The kernel driver
can react to the device configuration at function reset/driver load time,
but otherwise should not be coupled to fwctl.
4. Operate the HW in a way that overlaps with the core purpose of another
primary kernel subsystem, such as read/write to LBAs, send/receive of
Network packets, or operate an accelerator's data plane.
fwctl is not a replacement for device direct access subsystems like uacce or
VFIO.
Security Response
-----------------
The kernel remains the gatekeeper for this interface. If violations of the
scopes, security or isolation principles are found, we have options to let
devices fix them with a FW update, push a kernel patch to parse and block RPC
commands or push a kernel patch to block entire firmware versions, or devices.
While the kernel can always directly parse and restrict RPCs, it is expected
that the existing kernel pattern of allowing drivers to delegate validation to
FW to be a useful design.
fwctl Driver design
-------------------
In many cases a fwctl driver is going to be part of a larger cross-subsystem
device possibly using the auxiliary_device mechanism. In that case several
subsystems are going to be sharing the same device and FW interface layer so the
device design must already provide for isolation and co-operation between kernel
subsystems. fwctl should fit into that same model.
Part of the driver should include a description of how its scope restrictions
and security model work. The driver and FW together must ensure that RPCs
provided by user space are mapped to the appropriate scope and that the user
space has rights to work on that scope. Broadly there are two approaches drivers
can take:
Have the kernel provide a scope label to the FW along with the user space RPC
and FW will parse and enforce the scope
Have the kernel parse the RPC and enforce the scope
The driver and FW must co-operate to ensure that either fwctl cannot allocate
any FW resources, or any resources it does allocate are freed on FD closure. A
driver primarily constructed around FW RPCs may find that its core PCI function
and RPC layer belongs under fwctl with auxiliary devices connecting to other
subsystems.
User space Operation
--------------------
To be written in detail, broadly:
- A “fwctl” class
- Each fwctl device driver makes a fwctl instance with a struct device and
sysfs presence in its class and a char device /dev/fwctl/XX
- Some sysfs files per-instance describing what the device is and bootstrap
details needed to operate and format RPCs.
- fds opened from /dev/fwctl/XX support an IOCTL to specify the scope, execute
an RPC and return the response.
- A security layer ensuring that operations from user space are restricted and
contained within the defined scopes.
User space Community
--------------------
Drawing inspiration from nvme-cli, participating in the kernel side must come
with a user space in a common TBD git tree, at a minimum to usefully operate the
kernel driver. Providing such an implementation is a pre-condition to merging a
kernel driver.
The goal is to build user space community around some of the shared problems
we all have, and ideally develop some common user space programs with some
starting themes of:
- Device in-field debugging
- HW provisioning
- VFIO child device profiling before VM boot
- Confidential Compute topics (attestation, secure provisioning)
That stretches across all subsystems in the kernel. fwupd is a great example of
how an excellent user space experience can emerge out of kernel-side diversity.
Existing Similar Examples
-------------------------
The approach described in this document is not a new idea. Direct, or near
direct device access has been offered by the kernel in different areas for
decades. With more devices wanting to follow this design pattern it is becoming
clear that it is not entirely well understood and, more importantly, the
security considerations are not well defined or agreed upon.
Some examples:
- HW RAID controllers. This includes RPCs to do things like compose drives into
a RAID volume, configure RAID parameters, monitor the HW and more.
- Baseboard managers. RPCs for configuring settings in the device and more
- NVMe vendor command capsules. nvme-cli provides access to some monitoring
functions that vendors have defined, but more exists.
- DRM allows user space drivers to send commands to the device via kernel
mediation
- RDMA allows user space drivers to directly push commands to the device
without kernel involvement
Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
The first 3 would be examples of areas that fwctl intends to cover.
Some key lessons learned from these past efforts are the importance of having a
common user space project to use as a pre-condition for obtaining a kernel
driver. Developing good community around useful software in user space is key to
getting vendor companies to fund participation.
On 3/4/24 9:02 AM, Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > >> I also like this, I don't want the outcome of this discussion to be >> that only mlx5ctl gets merged. I want all the HW that has this problem >> to have support in the mainline kernel. > > To this end here is my proposal to move forward with a new > mini-subsystem to provide rules for this common approach. Get the > existing tools out of hacky unstructured direct hardware access via > /sys/XX and into a formalized lockdown compatible system. I've talked > to enough people now to think we have a critical mass. > > =============== > fwctl subsystem > =============== The aux bus infrastructure was created specifically for multifunction devices -- it allows a core PCI device driver with smaller, subsystem focused drivers for vendor specific implementations of various S/W stack APIs (IB, netdev, etc). One aspect not addressed in that design is where to park the various drivers and extensions that are not solely tied to any one subsystem. Given, that how about moving the existing auxbus code into a new directory, drivers/auxbus. From there, create a subdirectory for this new fwctl subsystem which is most likely to be realized as a new auxbus device and then subdirectories for vendor specific drivers for the aux bus device. Then new drivers being developed in this auxbus world can put the core PCI device handling code under drivers/auxbus/core. In short: - drivers/auxbus/auxiliary.c - drivers/auxbus/core/<vendor>/ - per h/w device driver for managing the PCI device which is shared across multiple auxbus devices - drivers/auxbus/fwctl/fwctl.c - this FW interface - drivers/auxbus/fwctl/<vendor>/ - vendor specific driver for a fwctl auxbus device
On Mon, Apr 29, 2024 at 07:36:41PM -0600, David Ahern wrote: > On 3/4/24 9:02 AM, Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > > > >> I also like this, I don't want the outcome of this discussion to be > >> that only mlx5ctl gets merged. I want all the HW that has this problem > >> to have support in the mainline kernel. > > > > To this end here is my proposal to move forward with a new > > mini-subsystem to provide rules for this common approach. Get the > > existing tools out of hacky unstructured direct hardware access via > > /sys/XX and into a formalized lockdown compatible system. I've talked > > to enough people now to think we have a critical mass. > > > > =============== > > fwctl subsystem > > =============== > > > The aux bus infrastructure was created specifically for multifunction > devices -- it allows a core PCI device driver with smaller, subsystem > focused drivers for vendor specific implementations of various S/W stack > APIs (IB, netdev, etc). One aspect not addressed in that design is where > to park the various drivers and extensions that are not solely tied to > any one subsystem. > > Given, that how about moving the existing auxbus code into a new > directory, drivers/auxbus. From there, create a subdirectory for this > new fwctl subsystem which is most likely to be realized as a new auxbus > device and then subdirectories for vendor specific drivers for the aux > bus device. Then new drivers being developed in this auxbus world can > put the core PCI device handling code under drivers/auxbus/core. > > In short: > - drivers/auxbus/auxiliary.c Ah, the ever-frequent "where do we put the files" discussion :) Is it "per bus" or "per functionality"? In the end, it's a bit of both, HOWEVER, no, auxiliary.c belongs in drivers/base/ for now, why move it? And really, stuff on the auxbus has nothing to do with the auxbus, it has everything to do with with the common device that auxbus really is emulating (i.e. a pci device with multi-functions, or whatever.) > - drivers/auxbus/core/<vendor>/ - per h/w device driver for managing > the PCI device which is shared across multiple auxbus devices Why "vendor"? What happens when vendor names change? :) > - drivers/auxbus/fwctl/fwctl.c - this FW interface fwctl has nothing to do with auxbus other than it might be a user of it, right? So again, no need for it to be under auxbus/ as that is not needed. > - drivers/auxbus/fwctl/<vendor>/ - vendor specific driver for a fwctl > auxbus device if you want drivers/fwctl/<VENDOR>/ if <VENDOR> numbers get huge, sure make subdirs, but really, why create a subdir like that if you don't need it? Start simple please, you can always move files later if it gets too complex. I think that discussing where to place non-existant files at this point in time is kind of funny, given we have yet to see any actual code... thanks, greg k-h
On 3/4/24 9:02 AM, Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > >> I also like this, I don't want the outcome of this discussion to be >> that only mlx5ctl gets merged. I want all the HW that has this problem >> to have support in the mainline kernel. > > To this end here is my proposal to move forward with a new > mini-subsystem to provide rules for this common approach. Get the > existing tools out of hacky unstructured direct hardware access via > /sys/XX and into a formalized lockdown compatible system. I've talked > to enough people now to think we have a critical mass. > It has been almost 3 weeks and no response (to this and other proposals in this thread). I have been around Linux long enough to know that silence is not acceptance. To me, this seems like a sane start for a new subsystem addressing challenges with complex, modern devices. To that end: Acked-by: David Ahern <dsahern@kernel.org>
On Thu, Mar 21, 2024 at 09:23:28PM -0600, David Ahern wrote: > On 3/4/24 9:02 AM, Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > > > >> I also like this, I don't want the outcome of this discussion to be > >> that only mlx5ctl gets merged. I want all the HW that has this problem > >> to have support in the mainline kernel. > > > > To this end here is my proposal to move forward with a new > > mini-subsystem to provide rules for this common approach. Get the > > existing tools out of hacky unstructured direct hardware access via > > /sys/XX and into a formalized lockdown compatible system. I've talked > > to enough people now to think we have a critical mass. > > > > It has been almost 3 weeks and no response (to this and other proposals > in this thread). I have been around Linux long enough to know that > silence is not acceptance. > > To me, this seems like a sane start for a new subsystem addressing > challenges with complex, modern devices. To that end: > > Acked-by: David Ahern <dsahern@kernel.org> +1. We look forward to helping with the development and testing in any way that we can. Jason, thank you for creating this proposal and apologies for not responding on-list sooner. Aron
On Thu, Mar 21, 2024 at 09:23:28PM -0600, David Ahern wrote: > On 3/4/24 9:02 AM, Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 01:57:35PM -0400, Jason Gunthorpe wrote: > > > >> I also like this, I don't want the outcome of this discussion to be > >> that only mlx5ctl gets merged. I want all the HW that has this problem > >> to have support in the mainline kernel. > > > > To this end here is my proposal to move forward with a new > > mini-subsystem to provide rules for this common approach. Get the > > existing tools out of hacky unstructured direct hardware access via > > /sys/XX and into a formalized lockdown compatible system. I've talked > > to enough people now to think we have a critical mass. > > > > It has been almost 3 weeks and no response (to this and other proposals > in this thread). I have been around Linux long enough to know that > silence is not acceptance. > > To me, this seems like a sane start for a new subsystem addressing > challenges with complex, modern devices. To that end: > > Acked-by: David Ahern <dsahern@kernel.org> It's the middle of the merge window, not much we can actually do and this patch series itself couldn't be applied as-is, so it's hard to see what could have happened on my end... thanks, greg k-h
On 3/22/24 1:32 AM, Greg Kroah-Hartman wrote: > > It's the middle of the merge window, not much we can actually do and > this patch series itself couldn't be applied as-is, so it's hard to see > what could have happened on my end... > The proposal was sent a week before the end of the last development cycle, and I believe the intent was to motivate discussion around a concrete proposal to converge on an acceptable solution before sending patches. On your end, what would be helpful is either a simple yes this seems reasonable or no you don't like it for reasons X, Y, and Z.
On Fri, Mar 22, 2024 at 09:24:29AM -0600, David Ahern wrote: > On 3/22/24 1:32 AM, Greg Kroah-Hartman wrote: > > > > It's the middle of the merge window, not much we can actually do and > > this patch series itself couldn't be applied as-is, so it's hard to see > > what could have happened on my end... > > > > The proposal was sent a week before the end of the last development > cycle, and I believe the intent was to motivate discussion around a > concrete proposal to converge on an acceptable solution before sending > patches. > > On your end, what would be helpful is either a simple yes this seems > reasonable or no you don't like it for reasons X, Y, and Z. > Well said, David. I would totally support doing something like this in a fairly generic way that could be leveraged/instantiated by drivers that will allow communication/inspection of hardware blocks in the datapath. There are lots of different ways this could go, so feedback on this would help get us all moving in the right direction.
On Fri, 22 Mar 2024 11:46:27 -0400 Andy Gospodarek wrote: > > > It's the middle of the merge window, not much we can actually do and > > > this patch series itself couldn't be applied as-is, so it's hard to see > > > what could have happened on my end... > > > > The proposal was sent a week before the end of the last development > > cycle, and I believe the intent was to motivate discussion around a > > concrete proposal to converge on an acceptable solution before sending > > patches. > > > > On your end, what would be helpful is either a simple yes this seems > > reasonable or no you don't like it for reasons X, Y, and Z. > > Well said, David. > > I would totally support doing something like this in a fairly generic > way that could be leveraged/instantiated by drivers that will allow > communication/inspection of hardware blocks in the datapath. There are > lots of different ways this could go, so feedback on this would help get > us all moving in the right direction. The more I learn, the more I am convinced that the technical justifications here are just smoke and mirrors. The main motivation for nVidia, Broadcom, (and Enfabrica?) being to hide as much as possible of what you consider your proprietary advantage in the "AI gold rush". RDMA is what it is but I really hate how you're trying to pretend that it's is somehow an inherent need of advanced technology and we need to lower the openness standards for all of the kernel.
On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote: > On Fri, 22 Mar 2024 11:46:27 -0400 Andy Gospodarek wrote: > > > > It's the middle of the merge window, not much we can actually do and > > > > this patch series itself couldn't be applied as-is, so it's hard to see > > > > what could have happened on my end... > > > > > > The proposal was sent a week before the end of the last development > > > cycle, and I believe the intent was to motivate discussion around a > > > concrete proposal to converge on an acceptable solution before sending > > > patches. > > > > > > On your end, what would be helpful is either a simple yes this seems > > > reasonable or no you don't like it for reasons X, Y, and Z. > > > > Well said, David. > > > > I would totally support doing something like this in a fairly generic > > way that could be leveraged/instantiated by drivers that will allow > > communication/inspection of hardware blocks in the datapath. There are > > lots of different ways this could go, so feedback on this would help get > > us all moving in the right direction. > > The more I learn, the more I am convinced that the technical > justifications here are just smoke and mirrors. Let's see some evidence of this then, point to some sillicon devices in the multibillion gate space that don't have complex FW built into their design? > The main motivation for nVidia, Broadcom, (and Enfabrica?) being to > hide as much as possible of what you consider your proprietary > advantage in the "AI gold rush". Despite all of those having built devices like this well before the "AI gold rush" and it being a general overall design principle for the industry because, yes, the silicon technology available actually demands it. It is not to say you couldn't do otherwise, it is just simply too expensive. > RDMA is what it is but I really hate how you're trying to pretend > that it's is somehow an inherent need of advanced technology and > we need to lower the openness standards for all of the kernel. Open hardware has never been an "openness standard" for the kernel. Jason
On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote: > On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote: > > > Well said, David. > > > > > > I would totally support doing something like this in a fairly generic > > > way that could be leveraged/instantiated by drivers that will allow > > > communication/inspection of hardware blocks in the datapath. There are > > > lots of different ways this could go, so feedback on this would help get > > > us all moving in the right direction. > > > > The more I learn, the more I am convinced that the technical > > justifications here are just smoke and mirrors. > > Let's see some evidence of this then, point to some sillicon devices > in the multibillion gate space that don't have complex FW built into > their design? Existence of complex FW does not imply that production systems must have a backdoor to talk to that FW in kernel-unmitigated fashion. As an existence proof I give you NICs we use at Meta. Or old Netronome NICs, you can pick. > > The main motivation for nVidia, Broadcom, (and Enfabrica?) being to > > hide as much as possible of what you consider your proprietary > > advantage in the "AI gold rush". > > Despite all of those having built devices like this well before the > "AI gold rush" and it being a general overall design principle for the > industry because, yes, the silicon technology available actually > demands it. > > It is not to say you couldn't do otherwise, it is just simply too > expensive. I do agree that it is expensive, not sure if it's "too" expensive. But Linux never promised that our way of doing SW development would always be the most cost effective option, right? Especially short term. Or that we'll be competitive time to market. > > RDMA is what it is but I really hate how you're trying to pretend > > that it's is somehow an inherent need of advanced technology and > > we need to lower the openness standards for all of the kernel. > > Open hardware has never been an "openness standard" for the kernel. I was in the meeting with a vendor this morning and when explicitly asked by an SRE (not from my org nor in any way "primed" by me) whether configuration of some run of the mill PCI thing can be exposed via devlink params instead of whatever proprietary thing the vendor was pitching, the vendor's answer was silence and then a pitch of another proprietary mechanism. So no, the "open hardware" is certainly not a requirement for the kernel. But users can't get vendors to implement standard Linux configuration interfaces, and your proposal will make it a lot worse.
On Fri, Mar 22, 2024 at 03:29:24PM -0700, Jakub Kicinski wrote: > On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote: > > On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote: > > > > Well said, David. > > > > > > > > I would totally support doing something like this in a fairly generic > > > > way that could be leveraged/instantiated by drivers that will allow > > > > communication/inspection of hardware blocks in the datapath. There are > > > > lots of different ways this could go, so feedback on this would help get > > > > us all moving in the right direction. > > > > > > The more I learn, the more I am convinced that the technical > > > justifications here are just smoke and mirrors. > > > > Let's see some evidence of this then, point to some sillicon devices > > in the multibillion gate space that don't have complex FW built into > > their design? > > Existence of complex FW does not imply that production systems must > have a backdoor to talk to that FW in kernel-unmitigated fashion. I think we've been over this endlessly, production systems require a solid debugging story for all software in the system, including the device FW. > As an existence proof I give you NICs we use at Meta. > Or old Netronome NICs, you can pick. I wouldn't pick those, they don't meet the multi-billion gate criteria. ie expensive chips built on the most cutting edge and expensive process. Smaller chips on non-cutting edge process have different economics. Startups have different economics. Look at something like an Intel/AMD GPU or Habana labs device. Lack of FW creates a very opaque driver that is twiddling megabytes of register mnemonics that is incomprehensible without HW documentation. A win for code availability, but it hasn't actually crossed into being usefully open. > > Despite all of those having built devices like this well before the > > "AI gold rush" and it being a general overall design principle for the > > industry because, yes, the silicon technology available actually > > demands it. > > > > It is not to say you couldn't do otherwise, it is just simply too > > expensive. > > I do agree that it is expensive, not sure if it's "too" expensive. > But Linux never promised that our way of doing SW development would > always be the most cost effective option, right? Especially short > term. Or that we'll be competitive time to market. By "too expensive" I mean the vendor cannot produce a chip at a price that is salable. Or a startup goes out of business because it can't afford to respin. Linux promises community collaboration where the community members broadly should be driving the priorities. When TTM matters and is agreed it is done. See for instance the various massive security fixes. > > > RDMA is what it is but I really hate how you're trying to pretend > > > that it's is somehow an inherent need of advanced technology and > > > we need to lower the openness standards for all of the kernel. > > > > Open hardware has never been an "openness standard" for the kernel. > > I was in the meeting with a vendor this morning and when explicitly > asked by an SRE (not from my org nor in any way "primed" by me) > whether configuration of some run of the mill PCI thing can be > exposed "run of the mill PCI thing"? Does this thing already have devlink knob? Usually "run of the mill PCI things" are configured through the PCI subsystem, not devlink. > via devlink params instead of whatever proprietary thing the vendor was > pitching, the vendor's answer was silence and then a pitch of another > proprietary mechanism. Our team was very excited about devlink when it first came about. But now we have so many devlink parameters that have been kept out of mainline I see that the excitement has died. > So no, the "open hardware" is certainly not a requirement for the > kernel. But users can't get vendors to implement standard Linux > configuration interfaces, and your proposal will make it a lot worse. I don't agree. If you can't get your vendor to implement the thing you want on devlink right now today, this fwctl isn't going to change that one bit. You already have the vendor tool and the vendor telling you to use it. It makes no difference at all how the existing vendor tools reach the device. Indeed, counting on lockdown to break all the existing vendor tools and render them permanently unusable seems to me to be straining one of the few hard full project rules Linux actually has: don't break existing userspace. Think bigger, maybe your SRE will be happier if as part of this we can get the vendors to agree on some common userspace tooling for device configuration! Wouldn't that be a great big ecosystem improvement! Frankly, there are many forms of common interfaces and many paths to get there. I don't view your very restrictive approach to be helpful to growing this space. It hasn't delivered a working ecosystem, and I don't think it ever will. Instead, as this thread is showing, we have a bunch of unhappy community members and poor upstream support for devices. Jason
On 22 Mar 15:29, Jakub Kicinski wrote: >On Fri, 22 Mar 2024 18:44:23 -0300 Jason Gunthorpe wrote: >> On Fri, Mar 22, 2024 at 01:58:26PM -0700, Jakub Kicinski wrote: >> > > Well said, David. >> > > >> > > I would totally support doing something like this in a fairly generic >> > > way that could be leveraged/instantiated by drivers that will allow >> > > communication/inspection of hardware blocks in the datapath. There are >> > > lots of different ways this could go, so feedback on this would help get >> > > us all moving in the right direction. >> > >> > The more I learn, the more I am convinced that the technical >> > justifications here are just smoke and mirrors. >> >> Let's see some evidence of this then, point to some sillicon devices >> in the multibillion gate space that don't have complex FW built into >> their design? > >Existence of complex FW does not imply that production systems must >have a backdoor to talk to that FW in kernel-unmitigated fashion. > >As an existence proof I give you NICs we use at Meta. >Or old Netronome NICs, you can pick. > This is not true at all, at least for our NICs. Our NICs do need non-netdev interfaces at least for debug and monitoring non-netdev functionality and use-cases at Meta. We can talk about this offline. Also below you mentioned another one of your vendors using proprietary mechanism for configuration. So you can't just have it both ways. It is obvious to everyone that in the AI era, everyone needs customization, this interface is the proposal for the standardization, if you cared to look at Jason's proposal you will see how he goes in length describing how abstraction can happen in user space. >> > The main motivation for nVidia, Broadcom, (and Enfabrica?) being to >> > hide as much as possible of what you consider your proprietary >> > advantage in the "AI gold rush". >> >> Despite all of those having built devices like this well before the >> "AI gold rush" and it being a general overall design principle for the >> industry because, yes, the silicon technology available actually >> demands it. >> >> It is not to say you couldn't do otherwise, it is just simply too >> expensive. > >I do agree that it is expensive, not sure if it's "too" expensive. >But Linux never promised that our way of doing SW development would >always be the most cost effective option, right? Especially short >term. Or that we'll be competitive time to market. > >> > RDMA is what it is but I really hate how you're trying to pretend >> > that it's is somehow an inherent need of advanced technology and >> > we need to lower the openness standards for all of the kernel. >> >> Open hardware has never been an "openness standard" for the kernel. > >I was in the meeting with a vendor this morning and when explicitly >asked by an SRE (not from my org nor in any way "primed" by me) >whether configuration of some run of the mill PCI thing can be exposed >via devlink params instead of whatever proprietary thing the vendor was >pitching, the vendor's answer was silence and then a pitch of another >proprietary mechanism. > Well, this is why we came up with fwctl interface, so nobody needs to sit in silence and all vendors can agree to one interface. We both know devlink params can't scale well enough and accommodate all vendors and don't forget it's netdev specifc! >So no, the "open hardware" is certainly not a requirement for the >kernel. But users can't get vendors to implement standard Linux >configuration interfaces, and your proposal will make it a lot worse. Vendors are already using proprietary configuration interfaces, using direct PCI access via sysfs.. So on the contrary to what you say, this proposal came to unify vendors, and improve the user's experience.. with fwctl, and the proper use-space shared tooling as Jason's suggested you can force other vendors to follow the herd and implement the new standard interfaces that we already have 3 vendors agree to..
On 3/22/24 2:58 PM, Jakub Kicinski wrote: > The more I learn, the more I am convinced that the technical > justifications here are just smoke and mirrors. The main motivation for > nVidia, Broadcom, (and Enfabrica?) being to hide as much as possible of > what you consider your proprietary advantage in the "AI gold rush". RDMA > is what it is but I really hate how you're trying to pretend that it's > is somehow an inherent need of advanced technology and we need to lower > the openness standards for all of the kernel. Kuba: can you respond to Jason's email with the proposal for the new fwctl subsystem and identify places you do not agree? That would provide more concrete discussion points. Thanks,
On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote: > can you respond to Jason's email with the proposal for the new fwctl > subsystem and identify places you do not agree? That would provide more > concrete discussion points. Thanks, Respond in what way, David? Comment on technical aspects of whether a common class device with a discovery mechanism and a sprinkling of semantically meaningless fields can be implemented? Some trivial object hierarchy? On whether someone can actually enforce any of the 4 "don't"s, and whether this interface is basically encouraging and giving a leg up to those willing to be dishonest? Or should we go for another loop of me talking about openness and building common abstractions, and vendors saying how their way of doing basic configuration is so very special, and this is just for debug and security and because others. There's absolutely no willingness to try and build a common interface here.
On 3/22/24 4:40 PM, Jakub Kicinski wrote: > On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote: >> can you respond to Jason's email with the proposal for the new fwctl >> subsystem and identify places you do not agree? That would provide more >> concrete discussion points. Thanks, > > Respond in what way, David? Comment on technical aspects of whether > a common class device with a discovery mechanism and a sprinkling of > semantically meaningless fields can be implemented? Some trivial object > hierarchy? > > On whether someone can actually enforce any of the 4 "don't"s, and > whether this interface is basically encouraging and giving a leg up > to those willing to be dishonest? > > Or should we go for another loop of me talking about openness and > building common abstractions, and vendors saying how their way of > doing basic configuration is so very special, and this is just for > debug and security and because others. > > There's absolutely no willingness to try and build a common interface > here. The proposal is an attempt at a common interface and common tooling to a degree but independent of any specific subsystem of which many are supported by the device. Your responses continue to align with the notion that because the device can spit out ethernet frames, all diagnostics, debugging, configuration, etc. MUST go through networking APIs. You seem unwilling to acknowledge that devices can work for various use cases without a netdev driver, and thus aspects of managing that device should be done outside of a netdev driver.
On 26/03/2024 14:57, David Ahern wrote: > The proposal is an attempt at a common interface and common tooling to a > degree but independent of any specific subsystem of which many are > supported by the device. [ Let me prefix this by noting that I'm speaking personally here, and not representing the position of my employer. ] You can't have a "common interface" and yet be "independent" of anything that could give semantics to that interface. What you have is a common *transport*, used to connect to a *vendor* interface. If you can't even bring yourself to be honest about that, it's no wonder you're getting maintainer pushback. Do we need to go all the way back to operating systems 101 and point out that one of the fundamental jobs of a kernel is to *abstract* the hardware, and provide *services* to userspace rather than mere devices? Frankly, this whole thread reads to me like certain vendors whining that they weren't expecting to have to get their new features *reviewed* by upstream — possibly they thought devlink params would just get rubber- stamped — and now they're finding that the kernel's quality standards still apply. Complaining that devlink params "don't scale" is disingenuous. Patches aren't languishing for want of reviewer resources; it's just that it takes *submitter* time and effort to bring them up to the quality level that's required, and occasionally the vendor has to (shock! horror!) tell the world what one of their magic knobs actually *does*. If all the configuration of these Complex Devices™ goes through fwctl backdoors, where exactly is anyone going to discover the commonalities to underlie the generic interfaces of the next generation? What would configuring plain vanilla netdevs be like today if, instead of a set of well-defined cross-vendor APIs, ethtool (say) had been a mechanism to write arbitrary values to hardware registers on the NIC? These commonalities are key to allowing a product category to mature. I realise vendors in many cases don't want that to happen, because mature products are largely commoditised and thus don't command huge margins; but Linux isn't here to serve vendors' interests at the expense of users. On 23/03/2024 01:27, Saeed Mahameed wrote: > It is obvious to everyone that in the AI era, everyone needs > customization It's always possible to argue that the New Thing is qualitatively different from anything that went before, that these "multibillion gate devices" need to be able to break the rules. But the truth is, you aren't that special. -e
On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote: > On 26/03/2024 14:57, David Ahern wrote: > > The proposal is an attempt at a common interface and common tooling to a > > degree but independent of any specific subsystem of which many are > > supported by the device. > > [ Let me prefix this by noting that I'm speaking personally here, and > not representing the position of my employer. ] <...> > you're getting maintainer pushback. May I suggest you to take a short break, collect names of people who participated in this discussion and check in git history/MAINTAINERS file their contribution to the linux kernel? After you do that, try to ask yourself if your response is still appropriate. Thanks. > > Do we need to go all the way back to operating systems 101 and point out > that one of the fundamental jobs of a kernel is to *abstract* the > hardware, and provide *services* to userspace rather than mere devices? > > Frankly, this whole thread reads to me like certain vendors whining that > they weren't expecting to have to get their new features *reviewed* by > upstream — possibly they thought devlink params would just get rubber- > stamped — and now they're finding that the kernel's quality standards > still apply. > Complaining that devlink params "don't scale" is disingenuous. Patches > aren't languishing for want of reviewer resources; it's just that it > takes *submitter* time and effort to bring them up to the quality level > that's required, and occasionally the vendor has to (shock! horror!) > tell the world what one of their magic knobs actually *does*. > > If all the configuration of these Complex Devices™ goes through fwctl > backdoors, where exactly is anyone going to discover the commonalities > to underlie the generic interfaces of the next generation? What would > configuring plain vanilla netdevs be like today if, instead of a set of > well-defined cross-vendor APIs, ethtool (say) had been a mechanism to > write arbitrary values to hardware registers on the NIC? > These commonalities are key to allowing a product category to mature. I > realise vendors in many cases don't want that to happen, because mature > products are largely commoditised and thus don't command huge margins; > but Linux isn't here to serve vendors' interests at the expense of > users. > > On 23/03/2024 01:27, Saeed Mahameed wrote: > > It is obvious to everyone that in the AI era, everyone needs > > customization > > It's always possible to argue that the New Thing is qualitatively > different from anything that went before, that these "multibillion > gate devices" need to be able to break the rules. > But the truth is, you aren't that special. > > -e
On 02/04/2024 19:48, Leon Romanovsky wrote: > On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote: >> you're getting maintainer pushback. > > May I suggest you to take a short break, collect names of people who > participated in this discussion and check in git history/MAINTAINERS > file their contribution to the linux kernel? Whether you like it or not, Kuba is a kernel maintainer. And thus, semantically, a Nack from him is "maintainer pushback". That remains true regardless of who else in the discussion is also a kernel maintainer. If you had an actual point, feel free to explain to me, without the veiled language, what was so 'inappropriate' about my posting.
On Wed, Apr 03, 2024 at 01:26:50PM +0100, Edward Cree wrote: > On 02/04/2024 19:48, Leon Romanovsky wrote: > > On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote: > >> you're getting maintainer pushback. > > > > May I suggest you to take a short break, collect names of people who > > participated in this discussion and check in git history/MAINTAINERS > > file their contribution to the linux kernel? > Whether you like it or not, Kuba is a kernel maintainer. > And thus, semantically, a Nack from him is "maintainer pushback". > That remains true regardless of who else in the discussion is also > a kernel maintainer. > > If you had an actual point, feel free to explain to me, without the > veiled language, what was so 'inappropriate' about my posting. Language, tone, and content of your email were inappropriate: https://lore.kernel.org/all/0ea32dd4-f408-5870-77eb-f18899f1ad44@gmail.com/ ...certain vendors whining... ^^^^ Language ... possibly they thought devlink params would just get rubber- stamped — and now they're finding that the kernel's quality standards still apply. ... ^^^^ Tone EVERYONE who participated in this discussion knows about kernel's quality standards. ... Patches aren't languishing for want of reviewer resources; it's just that it takes *submitter* time and effort to bring them up to the quality level that's required, and occasionally the vendor has to (shock! horror!) tell the world what one of their magic knobs actually *does*. ... ^^^^ Content This paragraph alone shows that you completely didn't understand the discussion here. Thanks
On 4/3/24 1:00 PM, Leon Romanovsky wrote: > On Wed, Apr 03, 2024 at 01:26:50PM +0100, Edward Cree wrote: >> On 02/04/2024 19:48, Leon Romanovsky wrote: >>> On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote: >>>> you're getting maintainer pushback. >>> >>> May I suggest you to take a short break, collect names of people who >>> participated in this discussion and check in git history/MAINTAINERS >>> file their contribution to the linux kernel? >> Whether you like it or not, Kuba is a kernel maintainer. >> And thus, semantically, a Nack from him is "maintainer pushback". >> That remains true regardless of who else in the discussion is also >> a kernel maintainer. >> >> If you had an actual point, feel free to explain to me, without the >> veiled language, what was so 'inappropriate' about my posting. > > Language, tone, and content of your email were inappropriate: > > https://lore.kernel.org/all/0ea32dd4-f408-5870-77eb-f18899f1ad44@gmail.com/ > ...certain vendors whining... > > ^^^^ Language > > ... possibly they thought devlink params would just get rubber- > stamped — and now they're finding that the kernel's quality standards > still apply. ... > > ^^^^ Tone > EVERYONE who participated in this discussion knows about kernel's > quality standards. > > ... Patches aren't languishing for want of reviewer resources; it's just that it > takes *submitter* time and effort to bring them up to the quality level > that's required, and occasionally the vendor has to (shock! horror!) > tell the world what one of their magic knobs actually *does*. ... > > ^^^^ Content > This paragraph alone shows that you completely didn't understand the > discussion here. > > Thanks +1
On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote: > > ... Patches aren't languishing for want of reviewer resources; it's just that it > > takes *submitter* time and effort to bring them up to the quality level > > that's required, and occasionally the vendor has to (shock! horror!) > > tell the world what one of their magic knobs actually *does*. ... > > > > ^^^^ Content > > This paragraph alone shows that you completely didn't understand the > > discussion here. > > +1 "didn't understand the discussion" is an ironic thing for you to +1, David. After all my emails about HNS3 RDMA you somehow concluded today that I want to make rules for the entire kernel: https://lore.kernel.org/all/6faa47b0-27c3-47f9-94be-1ec671d9543c@kernel.org/ And I second what Ed said. I have asked multiple vendors preaching impossibilism in this thread to start posting those knobs. I offered to do a quick off-list review of the list of knobs they have to give a quick yay / nay, so they don't waste time implementing things that would get nacked. None of the vendors bothered taking me up on that offer.
On Wed, Apr 03, 2024 at 05:01:49PM -0700, Jakub Kicinski wrote: > On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote: > > > ... Patches aren't languishing for want of reviewer resources; it's just that it > > > takes *submitter* time and effort to bring them up to the quality level > > > that's required, and occasionally the vendor has to (shock! horror!) > > > tell the world what one of their magic knobs actually *does*. ... > > > > > > ^^^^ Content > > > This paragraph alone shows that you completely didn't understand the > > > discussion here. > > > > +1 > > "didn't understand the discussion" is an ironic thing for you to +1, > David. After all my emails about HNS3 RDMA you somehow concluded today > that I want to make rules for the entire kernel: > https://lore.kernel.org/all/6faa47b0-27c3-47f9-94be-1ec671d9543c@kernel.org/ What if (hypothetically) I tould you that the congestion control settings in the device FW impacted netdev sourced ethernet trafic as well? Would you be so sanguine that RDMA should have those settings? > And I second what Ed said. I have asked multiple vendors preaching > impossibilism in this thread to start posting those knobs. I offered > to do a quick off-list review of the list of knobs they have to give > a quick yay / nay, so they don't waste time implementing things that > would get nacked. None of the vendors bothered taking me up on that > offer. As far as configuration/provisioning goes, it is really all or nothing. If a specific site can configure only 90% of the stuff required because you will NAK the missing 10% it then it is still not usable and is a wasted effort for everyone. You have never shown that there is a path to 100% with your approach to devlink. In fact I believe you've said flat out that 100% is not achievable. Right here you illustrate the fundamental problem again: there are configurables that already exist in the device that you will NAK for devlink. This is fundamentally why no one is taking you up on these generous offers to pre-NAK device's designs. You made it explicit that you will will NAK something and then it is not 100%. Saeed has said repeatedly he wants 100% of the endless configurables in mlx5. You have the manual and know what they are, tell him how to get to 100% in a few months of work and I will believe you that it is not impossible. Then we only have fwctl's support for debugging and other topics to argue about :P Jason
On Thu, 4 Apr 2024 09:23:38 -0300 Jason Gunthorpe wrote:
> > "didn't understand the discussion" is an ironic thing for you to +1,
> > David. After all my emails about HNS3 RDMA you somehow concluded today
> > that I want to make rules for the entire kernel:
> > https://lore.kernel.org/all/6faa47b0-27c3-47f9-94be-1ec671d9543c@kernel.org/
>
> What if (hypothetically) I tould you that the congestion control
> settings in the device FW impacted netdev sourced ethernet trafic as
> well? Would you be so sanguine that RDMA should have those settings?
We can lawyer the words until the cows come home.
The team I work on takes care of both RoCE/IB/pick your fav proto
and TCP/IP NICs. It's fairly obvious what is RoCE and what is TCP
or user UDP when there are no incentives to act otherwise :|
> > And I second what Ed said. I have asked multiple vendors preaching
> > impossibilism in this thread to start posting those knobs. I offered
> > to do a quick off-list review of the list of knobs they have to give
> > a quick yay / nay, so they don't waste time implementing things that
> > would get nacked. None of the vendors bothered taking me up on that
> > offer.
>
> As far as configuration/provisioning goes, it is really all or
> nothing.
>
> If a specific site can configure only 90% of the stuff required
> because you will NAK the missing 10% it then it is still not usable
> and is a wasted effort for everyone.
(a) are you saying that the device needs 100% of the knobs to be used?
oof, you better warn your prospective customers :S
(b) as Ed pointed out some of the "knobs" are just hacks and lazy
workarounds so we rejected them for quality reasons; the remaining
rejects are because the knobs aren't really device specific, but
vendors don't want to extend existing APIs, as it is easier to
ship "features" without having a core kernel dependency...
> You have never shown that there is a path to 100% with your approach
> to devlink. In fact I believe you've said flat out that 100% is not
> achievable. Right here you illustrate the fundamental problem again:
> there are configurables that already exist in the device that you will
> NAK for devlink.
>
> This is fundamentally why no one is taking you up on these generous
> offers to pre-NAK device's designs. You made it explicit that you will
> will NAK something and then it is not 100%.
>
> Saeed has said repeatedly he wants 100% of the endless configurables
> in mlx5. You have the manual and know what they are, tell him how to
> get to 100% in a few months of work and I will believe you that it is
> not impossible.
Sorry, are you saying that I'm responsible for a providing a solution
to allow arbitrary vendor tools to work and proprietary user space to
communicate directly with the proprietary firmware?
> Then we only have fwctl's support for debugging and other topics to
> argue about :P
On Thu, Apr 04, 2024 at 07:48:50AM -0700, Jakub Kicinski wrote: > On Thu, 4 Apr 2024 09:23:38 -0300 Jason Gunthorpe wrote: > > > "didn't understand the discussion" is an ironic thing for you to +1, > > > David. After all my emails about HNS3 RDMA you somehow concluded today > > > that I want to make rules for the entire kernel: > > > https://lore.kernel.org/all/6faa47b0-27c3-47f9-94be-1ec671d9543c@kernel.org/ > > > > What if (hypothetically) I tould you that the congestion control > > settings in the device FW impacted netdev sourced ethernet trafic as > > well? Would you be so sanguine that RDMA should have those settings? > > We can lawyer the words until the cows come home. > The team I work on takes care of both RoCE/IB/pick your fav proto > and TCP/IP NICs. It's fairly obvious what is RoCE and what is TCP > or user UDP when there are no incentives to act otherwise :| Sure you can tell the difference, that isn't my hypothetical. I'm asking what if the people who designed the device choose not to tell the difference? > > > And I second what Ed said. I have asked multiple vendors preaching > > > impossibilism in this thread to start posting those knobs. I offered > > > to do a quick off-list review of the list of knobs they have to give > > > a quick yay / nay, so they don't waste time implementing things that > > > would get nacked. None of the vendors bothered taking me up on that > > > offer. > > > > As far as configuration/provisioning goes, it is really all or > > nothing. > > > > If a specific site can configure only 90% of the stuff required > > because you will NAK the missing 10% it then it is still not usable > > and is a wasted effort for everyone. > > (a) are you saying that the device needs 100% of the knobs to be used? > oof, you better warn your prospective customers :S No, I'm saying I have 100 customers, 600 configurables and every customer needs a partially overlapping set of 30 of them to be different than the COTS manufacturing default. I can implement 30 and support one customer, but I can't support all 100 customers without all 600 knobs. > (b) as Ed pointed out some of the "knobs" are just hacks and lazy > workarounds so we rejected them for quality reasons; the remaining > rejects are because the knobs aren't really device specific, but > vendors don't want to extend existing APIs, as it is easier to > ship "features" without having a core kernel dependency... Which is back to my point. You are picking and choosing what gets to be supported, and the end result is none of the 100 customers get to actually work. It is overreach to demand that the devices be re-designed as a condition to be part of mainline. The configurables exist as they are and need to be supported, in one way or another, by the kernel. > > You have never shown that there is a path to 100% with your approach > > to devlink. In fact I believe you've said flat out that 100% is not > > achievable. Right here you illustrate the fundamental problem again: > > there are configurables that already exist in the device that you will > > NAK for devlink. > > > > This is fundamentally why no one is taking you up on these generous > > offers to pre-NAK device's designs. You made it explicit that you will > > will NAK something and then it is not 100%. > > > > Saeed has said repeatedly he wants 100% of the endless configurables > > in mlx5. You have the manual and know what they are, tell him how to > > get to 100% in a few months of work and I will believe you that it is > > not impossible. > > Sorry, are you saying that I'm responsible for a providing a solution > to allow arbitrary vendor tools to work and proprietary user space to > communicate directly with the proprietary firmware? I am responding to your remark about "vendors preaching impossibilism". If you want me to agree with you that it is possible then yes, you need to show a way where we get to a point that users are actually able to solve their problems. Otherwise all I hear is how you are going to NAK some unknowable subset of the needed configurables. Sure sounds impossible to me. Jason
#include <disclaimer.h>
On 04/04/2024 18:47, Jason Gunthorpe wrote:
> The configurables exist as they are
> and need to be supported, in one way or another, by the kernel.
Why? What does the kernel get out of it?
Maybe *you* need them to be supported, but maybe you should have
thought of that earlier in the design process. ("A failure on
your part to foresee the eminently foreseeable does not
constitute an emergency on mine.")
If we let folks bypass our standards with a _fait accompli_, we
don't really have standards in the first place.
On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
> #include <disclaimer.h>
>
> On 04/04/2024 18:47, Jason Gunthorpe wrote:
> > The configurables exist as they are
> > and need to be supported, in one way or another, by the kernel.
>
> Why? What does the kernel get out of it?
>
> Maybe *you* need them to be supported, but maybe you should have
> thought of that earlier in the design process. ("A failure on
> your part to foresee the eminently foreseeable does not
> constitute an emergency on mine.")
> If we let folks bypass our standards with a _fait accompli_, we
> don't really have standards in the first place.
Sorry, who are "we" and what are "our standards"?
And, please, please reread the thread, it is not about devlink at all.
Thanks
On 04/04/2024 19:35, Leon Romanovsky wrote:
> On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
>> Why? What does the kernel get out of it?
>>
>> Maybe *you* need them to be supported, but maybe you should have
>> thought of that earlier in the design process. ("A failure on
>> your part to foresee the eminently foreseeable does not
>> constitute an emergency on mine.")
>> If we let folks bypass our standards with a _fait accompli_, we
>> don't really have standards in the first place.
>
> Sorry, who are "we" and what are "our standards"?
As should be obvious from context, "we" in that sentence referred to
the mainline kernel. And while participants in this thread currently
disagree on what "our standards" are, I hope it is not contentious
that the kernel community *does* have standards as to what code and
design is acceptable for inclusion.
Necessarily, for those standards to be meaningful, there cannot be an
exception for "oh, but we already built our product, it's too late
now", otherwise everyone would just ignore the standards and then
demand to be merged anyway.
That is true regardless of the specific standards in question.
> And, please, please reread the thread, it is not about devlink at all.
Please, please, reread the email to which you are replying; it is
not about devlink at all.
-ed
On Thu, Apr 04, 2024 at 08:46:41PM +0100, Edward Cree wrote:
> On 04/04/2024 19:35, Leon Romanovsky wrote:
> > On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
> >> Why? What does the kernel get out of it?
> >>
> >> Maybe *you* need them to be supported, but maybe you should have
> >> thought of that earlier in the design process. ("A failure on
> >> your part to foresee the eminently foreseeable does not
> >> constitute an emergency on mine.")
> >> If we let folks bypass our standards with a _fait accompli_, we
> >> don't really have standards in the first place.
> >
> > Sorry, who are "we" and what are "our standards"?
>
> As should be obvious from context, "we" in that sentence referred to
> the mainline kernel. And while participants in this thread currently
> disagree on what "our standards" are, I hope it is not contentious
> that the kernel community *does* have standards as to what code and
> design is acceptable for inclusion.
You didn't answer my question. What are "our standards"?
Thanks
On 4/5/24 12:41, Leon Romanovsky wrote:
> On Thu, Apr 04, 2024 at 08:46:41PM +0100, Edward Cree wrote:
>> On 04/04/2024 19:35, Leon Romanovsky wrote:
>>> On Thu, Apr 04, 2024 at 07:06:53PM +0100, Edward Cree wrote:
>>>> Why? What does the kernel get out of it?
>>>>
>>>> Maybe *you* need them to be supported, but maybe you should have
>>>> thought of that earlier in the design process. ("A failure on
>>>> your part to foresee the eminently foreseeable does not
>>>> constitute an emergency on mine.")
>>>> If we let folks bypass our standards with a _fait accompli_, we
>>>> don't really have standards in the first place.
>>>
>>> Sorry, who are "we" and what are "our standards"?
>>
>> As should be obvious from context, "we" in that sentence referred to
>> the mainline kernel. And while participants in this thread currently
>> disagree on what "our standards" are, I hope it is not contentious
>> that the kernel community *does* have standards as to what code and
>> design is acceptable for inclusion.
>
> You didn't answer my question. What are "our standards"?
>
> Thanks
>
Our standards include use of tone that is welcoming, and, in this
thread, you are aggressive instead.
With that said, I appreciate your review feedback that I have received
from you.
On 4/3/24 6:01 PM, Jakub Kicinski wrote: > On Wed, 3 Apr 2024 13:31:46 -0600 David Ahern wrote: >>> ... Patches aren't languishing for want of reviewer resources; it's just that it >>> takes *submitter* time and effort to bring them up to the quality level >>> that's required, and occasionally the vendor has to (shock! horror!) >>> tell the world what one of their magic knobs actually *does*. ... >>> >>> ^^^^ Content >>> This paragraph alone shows that you completely didn't understand the >>> discussion here. >> >> +1 > > "didn't understand the discussion" is an ironic thing for you to +1, Come on, Jakub. The +1 was regarding Leon's entire response to Ed, not just the last part you so conveniently quoted here. I agree 100% wiht Leon that Ed's "Language, tone, and content of your email were inappropriate". That is why I left the entire quote in the response and not just the paragraph above. > David. After all my emails about HNS3 RDMA you somehow concluded today > that I want to make rules for the entire kernel: > https://lore.kernel.org/all/6faa47b0-27c3-47f9-94be-1ec671d9543c@kernel.org/ > And as for that response (the URL you listed there), what else should one conclude? That has to be a fair summation of your stance because this entire thread is on a driver completely unrelated to netdev that you jumped in on with a NACK. I have asked multiple times in this thread and the one before where you believe your boundary ends as a maintainer. > And I second what Ed said. I have asked multiple vendors preaching > impossibilism in this thread to start posting those knobs. I offered > to do a quick off-list review of the list of knobs they have to give > a quick yay / nay, so they don't waste time implementing things that > would get nacked. None of the vendors bothered taking me up on that > offer. Again, entirely missing the point. This is not about configure knobs that have a potential to be consistent across devices. Sadly, this thread is not a spiral, ever so slowly converging on an agreement, but a merry-ground just going in circles.
On Tue, Apr 02, 2024 at 05:32:44PM +0100, Edward Cree wrote: > On 26/03/2024 14:57, David Ahern wrote: > > The proposal is an attempt at a common interface and common tooling to a > > degree but independent of any specific subsystem of which many are > > supported by the device. > > [ Let me prefix this by noting that I'm speaking personally here, and > not representing the position of my employer. ] > > You can't have a "common interface" and yet be "independent" of anything > that could give semantics to that interface. What you have is a common > *transport*, used to connect to a *vendor* interface. > If you can't even bring yourself to be honest about that, it's no wonder > you're getting maintainer pushback. I think this was covered in the doc I posted, it is unapologetically a common transport. I have some ideas for different "common interface's" built in userspace. I think others will have good ideas too. It is 'independent [of subsystems]' because it is manipulating *the device* not the subsystem software. > Do we need to go all the way back to operating systems 101 and point out > that one of the fundamental jobs of a kernel is to *abstract* the > hardware, and provide *services* to userspace rather than mere devices? Except that isn't even true! That is a very naive and simplistic view of what an OS should do. If we took that tonnes of interfaces in Linux should just be purged. A fundamental thing the OS exists to do is to arbitrate, secure and multiplex access to hardware. There are many levels where people can put a stick and say 'common interface', and it is not some 'wrong architecture' for the OS to delegate the 'common interface' job to userspace, or even say direct access is fine. This is not an exclusive situation, fwctl doesn't mean future OS side common interfaces are somehow blocked. > Frankly, this whole thread reads to me like certain vendors whining that > they weren't expecting to have to get their new features *reviewed* by > upstream — possibly they thought devlink params would just get rubber- > stamped — and now they're finding that the kernel's quality standards > still apply. Oh please. "quality standards" is not the issue here. This is a philosophical disagreement on OS design and, as Jakub pointed out, a second argument about who gets to have power in our community. > Complaining that devlink params "don't scale" is disingenuous. Saeed's remarks mean the review process doesn't scale. There are something like 600-800 configurables in mlx5, I assume other devices are similar. You can't fight over commonality bit by bit that many times and ever get anywhere. Everyone will get exhausted well before it is done. > If all the configuration of these Complex Devices™ goes through fwctl > backdoors, where exactly is anyone going to discover the commonalities > to underlie the generic interfaces of the next generation? What would > configuring plain vanilla netdevs be like today if, instead of a set of It would be *exactly the same as today* because today everyone with these devices uses the vendor tooling to configure them. Where is the screaming? Where has the concrete demands for common interfaces on some of the knobs been all these years? Where has keeping blessed support out of the kernel got us? If there was a real industry consensus to make commonality here it would be done. In my view there is not industry will because it is not actually an important problem that needs solving. When you buy this kind of complex HW you need to ensure the flash is configured for your site. You will work with the vendor and either get devices with flash preconfigured, or you will work with the vendor to get a suitable FW version and configurable list for your site. Then the job is simple, userspace needs to confirm and fix the device to have the target flash state. The actual benefit of common names for the individual configuration values is pretty tiny. Unless 100% of the configuration is covered by common names it is not going to be meaningful in practice. In my view there is alot of benefit to have a common tool to take descriptions of devices target flash states and then enforce it. Something that is common to many devices and could work for everything, not just some useless subset of aruged-to-death-blessed items. > These commonalities are key to allowing a product category to > mature. Uh no. Alot of this configuration stuff is to serve niche customer interests where the customers have no interest in publishing and commonizing their method of operating because it would give away their own operator secrets. The vendor's created it because their big customers demanded it. eg there are configurables in mlx5 that exist *soley* to accomodate certain customer's pre-existing SW. Stuff is like this because the *industry* doesn't want to fix it, don't dump everything on the vendors like they are the source of the problem. And, as before, this message is talking about configuration and devlink, but there are many more problems fwctl will solve with these devices including debugging/etc that inherently can't be solved in an abstract way. > > It is obvious to everyone that in the AI era, everyone needs > > customization > > It's always possible to argue that the New Thing is qualitatively > different from anything that went before, that these "multibillion > gate devices" need to be able to break the rules. What supposed "rules"? Read my post again, this is well trodden ground in the kernel. Delegation is fairly normal Linux design - try to minimize the kernel footprint is also a well accepted design axiom. Jason
[ Again, I had better disassociate my employer from the vitriol below, for which I alone am responsible. ] On 02/04/2024 19:40, Jason Gunthorpe wrote: > Uh no. Alot of this configuration stuff is to serve niche customer > interests where the customers have no interest in publishing and > commonizing their method of operating because it would give away their > own operator secrets. The vendor's created it because their big > customers demanded it. > > eg there are configurables in mlx5 that exist *soley* to accomodate > certain customer's pre-existing SW. So it's a single-user hack, why do you want support for it in upstream? Oh right, because you want to be able to ship single-user hacks to all your customers while still getting the _cachet_ of being An Intree Driver with the implied engineering benefits of open source — despite the actual feature implementations being obscured in firmware that's not subject to the review process and thus doesn't actually carry those benefits. > There are something like 600-800 configurables in mlx5 So because your engineers can't design clean, orthogonal APIs for toffee, you should be allowed to bypass review? Sorry, but no. The right approach is to find generic mechanisms that cover multiple customer configurations by allowing each customer to specify a policy — which is better not just for the kernel but also for your customers (they can experiment more easily with new policies) and even for you (you don't have to spend engineer time on implementing hundreds of single-purpose configuration switches, and can instead focus on making better products). And of course the customer's policy, with their 'valuable' operator secrets, stays in-house. This is not some revolutionary new idea or blue-sky architecture astronaut thinking; this is the way the Unix engineering tradition has worked for half a century. And it's not like we don't give you the tools to do it! BPF demonstrates that the kernel is perfectly willing to expose highly complex configuration primitives to userspace, *as long as* the interface is well-defined and cross-vendor. Of course, without knowing what your several hundred knobs are for, I can't tell you even in the broadest sense what shape a clean config system to replace them would look like. But 800 magic flags isn't it. > Where is the screaming? Where has keeping blessed support out of > the kernel got us? Well, clearly *someone* wants you to supply them an in-tree driver, else you wouldn't be trying to upstream this. Now maybe they really do just have a psychological aversion to TAINT_OOT_MODULE, but it's more likely that the underlying reason is the improved reliability, maintainability, and portability that come from the upstreaming process. And that in turn only happens because the kernel does not "bless" crap. > The actual benefit of common names for the individual configuration > values is pretty tiny. In case I still need to make it clearer: the purpose of requiring your configurables to go through review is not so you can have "common names" for 800 magic flags. It's so that you are forced to come up with configurables that actually have a sensible design and meaningful semantics that you can define in a way that gives you something to *put* in the name and the commit message that's not just "magic behaviour tweak #42 for $BigCustomer". > a second argument about who gets to have power in our community. As I understand it, power in Linux is entirely social and informal. If you think Kuba doesn't have standing to object, there's nothing *technical* stopping you from applying the patches with his Nacked-by tag included in the commit messages, then sending the resulting PR to Linus. And if you think that Linus would reject the PR in that case, then you're implicitly conceding that Kuba *does* have standing.
> > There are something like 600-800 configurables in mlx5 I keep seeing that 600-800, and thinking that 2 to the power 600 is a very big number, assuming they are simple on/off configurations. Does any user actually have the 'optimal' configuration? How can they even know? Andrew
On Thu, Apr 04, 2024 at 08:44:56PM +0200, Andrew Lunn wrote: > > > There are something like 600-800 configurables in mlx5 > > I keep seeing that 600-800, and thinking that 2 to the power 600 is a > very big number, assuming they are simple on/off configurations. > > Does any user actually have the 'optimal' configuration? How can they > even know? First, the big number is the union of all configurables on all devices released over the past 20 years that work with the mlx5 driver. Looking at a single device will see a much smaller number of configurable. Second, much of it is not performance tuning. The site has a functional workload they need to run and configurations need to be adjusted to match. For instance there is a bunch that configure the UEFI option rom. Various devices can PXE boot, iSCSI boot and so on. There is other stuff to turn on/off functional features which can be quite often because stuff just needs to be known at FLR time. I'm sure there are things in a more classical tunable space but it going to be a more workably small number than 600.. Jason
On 04/04/2024 21:25, Jason Gunthorpe wrote: > For instance there is a bunch that configure the UEFI option > rom. Various devices can PXE boot, iSCSI boot and so on. Sounds like a perfect example of something that should have a standardised cross-vendor configuration mechanism, rather than every vendor having its own special tool as is sadly the case today.
On Thu, Apr 04, 2024 at 09:53:14PM +0100, Edward Cree wrote: > On 04/04/2024 21:25, Jason Gunthorpe wrote: > > For instance there is a bunch that configure the UEFI option > > rom. Various devices can PXE boot, iSCSI boot and so on. > > Sounds like a perfect example of something that should have a > standardised cross-vendor configuration mechanism, rather than > every vendor having its own special tool as is sadly the case > today. Sure, I imagine exploring this kind of commonality in userspace with a fwctl based tool. The kernel is the wrong layer to do it. Jason
On Thu, Apr 04, 2024 at 06:35:04PM +0100, Edward Cree wrote: > [ Again, I had better disassociate my employer from the vitriol below, > for which I alone am responsible. ] > > On 02/04/2024 19:40, Jason Gunthorpe wrote: > > Uh no. Alot of this configuration stuff is to serve niche customer > > interests where the customers have no interest in publishing and > > commonizing their method of operating because it would give away their > > own operator secrets. The vendor's created it because their big > > customers demanded it. > > > > eg there are configurables in mlx5 that exist *soley* to accomodate > > certain customer's pre-existing SW. > > So it's a single-user hack, why do you want support for it in upstream? > Oh right, because you want to be able to ship single-user hacks to all > your customers while still getting the _cachet_ of being An Intree > Driver with the implied engineering benefits of open source — despite > the actual feature implementations being obscured in firmware that's > not subject to the review process and thus doesn't actually carry > those benefits. Uh no, mlx5 already has an excellent in-tree driver, thank you very much. The configuration is not changing the driver, it is changing the device. Consider, I can ship unique devices pre-configured for each site's special needs. They still work with the same inbox driver. In fact that happens already for certain large customers. It is why Jakub recently pointed out that Meta doesn't need any provisioning/mlx5ctl/misc driver. They just buy different devices than everyone else. So, it is really some kind of extremism to say that allowing users to configure the device in their own system in a booted Linux OS instead of in the factory looses the "implied engineering benefits of open source". Further it is really rude and inappropriate to say that some customers should not be able to enjoy intree drivers because of your aesthetic opinion of hacks in a device's design. > > There are something like 600-800 configurables in mlx5 > > So because your engineers can't design clean, orthogonal APIs for > toffee, you should be allowed to bypass review? Sorry, but no. Overreach. The job of the kernel maintainer is to review the driver software, not the device design. > > Where is the screaming? Where has keeping blessed support out of > > the kernel got us? > > Well, clearly *someone* wants you to supply them an in-tree driver, > else you wouldn't be trying to upstream this. We already have an intree driver for this access, it is built into sysfs. Tooling exists everything is deployed, people are happy. If you had read the thread to understand the issue, you'd know this is because the distros have turned on module signing, secure boot and kernel lock down. This disables the ability to have an OOT driver, it disables the existing sysfs scheme and it forces everything to be in Linus's tree or it does not exist. It effectively breaks the long existing ecosystem in this space. Arguing that keeping things working as they have been working for the last 10 years is somehow going to wreck opensource or whatever is absurd. Jason
On Thu, 4 Apr 2024 15:33:05 -0300 Jason Gunthorpe wrote: > Uh no, mlx5 already has an excellent in-tree driver, thank you very > much. The configuration is not changing the driver, it is changing the > device. > > Consider, I can ship unique devices pre-configured for each site's > special needs. They still work with the same inbox driver. > > In fact that happens already for certain large customers. It is why > Jakub recently pointed out that Meta doesn't need any > provisioning/mlx5ctl/misc driver. They just buy different devices than > everyone else. > Further it is really rude and inappropriate to say that some customers > should not be able to enjoy intree drivers because of your aesthetic > opinion of hacks in a device's design. To my knowledge the "customizations" are mostly around fitting into OCP servers. Those unfamiliar with how hyperscalers operate can mentally replace $hyperscaler with HP or Dell in your message. Minus all the proprietary OOB management stuff those guys also provide. > Overreach. The job of the kernel maintainer is to review the driver > software, not the device design. Agreed. I rarely if ever comment on what I think about device design. Discussion is about mlx5ctrl not "the device" as much as you'd like to equate the two.
On Thu, Apr 04, 2024 at 12:53:39PM -0700, Jakub Kicinski wrote: > On Thu, 4 Apr 2024 15:33:05 -0300 Jason Gunthorpe wrote: > > Uh no, mlx5 already has an excellent in-tree driver, thank you very > > much. The configuration is not changing the driver, it is changing the > > device. > > > > Consider, I can ship unique devices pre-configured for each site's > > special needs. They still work with the same inbox driver. > > > > In fact that happens already for certain large customers. It is why > > Jakub recently pointed out that Meta doesn't need any > > provisioning/mlx5ctl/misc driver. They just buy different devices than > > everyone else. > > > Further it is really rude and inappropriate to say that some customers > > should not be able to enjoy intree drivers because of your aesthetic > > opinion of hacks in a device's design. > > To my knowledge the "customizations" are mostly around fitting into OCP > servers. Nope. I understand it is significant. If Meta had to work with a COTS environment like a HP/Dell customer then Meta would have a list of flash configurables to set. I think you greatly underestimate the privilege of being at a hyperscaler and having vendors create custom products just for you.. > Those unfamiliar with how hyperscalers operate can mentally > replace $hyperscaler with HP or Dell in your message. Minus all the > proprietary OOB management stuff those guys also provide. A significant Dell customer will get a server pre-populated with a NIC with some generic Dell configuration. In most cases the customer will have to then program the flash to match their needs. Set a specific FW version, set site specific configurables, etc. Similar to how a Dell customer will have to change the BIOS settings in the Dell to match their needs. > Discussion is about mlx5ctrl not "the device" as much as you'd like > to equate the two. I view mlx5ctl/fwctl as a window into the device. Jason
On Thu, 4 Apr 2024 17:44:54 -0300 Jason Gunthorpe wrote: > > To my knowledge the "customizations" are mostly around fitting into OCP > > servers. > > Nope. I understand it is significant. If Meta had to work with a COTS > environment like a HP/Dell customer then Meta would have a list of > flash configurables to set. I think you greatly underestimate the > privilege of being at a hyperscaler and having vendors create custom > products just for you.. > > > Those unfamiliar with how hyperscalers operate can mentally > > replace $hyperscaler with HP or Dell in your message. Minus all the > > proprietary OOB management stuff those guys also provide. > > A significant Dell customer will get a server pre-populated with a NIC > with some generic Dell configuration. In most cases the customer will > have to then program the flash to match their needs. Set a specific FW > version, set site specific configurables, etc. > > Similar to how a Dell customer will have to change the BIOS settings > in the Dell to match their needs. I can only guess that you are again thinking about RDMA/HPC. Flashing tunables is not a workable solution for extremely varied and ephemeral TCP/IP workloads :|
On Thu, Apr 04, 2024 at 02:34:07PM -0700, Jakub Kicinski wrote: > On Thu, 4 Apr 2024 17:44:54 -0300 Jason Gunthorpe wrote: > > > To my knowledge the "customizations" are mostly around fitting into OCP > > > servers. > > > > Nope. I understand it is significant. If Meta had to work with a COTS > > environment like a HP/Dell customer then Meta would have a list of > > flash configurables to set. I think you greatly underestimate the > > privilege of being at a hyperscaler and having vendors create custom > > products just for you.. > > > > > Those unfamiliar with how hyperscalers operate can mentally > > > replace $hyperscaler with HP or Dell in your message. Minus all the > > > proprietary OOB management stuff those guys also provide. > > > > A significant Dell customer will get a server pre-populated with a NIC > > with some generic Dell configuration. In most cases the customer will > > have to then program the flash to match their needs. Set a specific FW > > version, set site specific configurables, etc. > > > > Similar to how a Dell customer will have to change the BIOS settings > > in the Dell to match their needs. > > I can only guess that you are again thinking about RDMA/HPC. > Flashing tunables is not a workable solution for extremely varied > and ephemeral TCP/IP workloads :| As I answered to Anderew, a lot is functional behavior not so much "tunables". The same way many BIOS settings are not all tunables but have functional impacts to the machine. Like enable SRIOV, for instance. Even for dataplane tunables - you know there are micro-architectural performance tunables set in the special Meta NICs that are wired just for Meta's special use case? Apparently that is actually perfectly workable. It is really strange to hear you act like "Meta doesn't need provisioning or tuning" when the NIC Meta uses is *highly* customized specifically for Meta to the point it is an entirely different product. Of course you don't need provisioning, alot of other people did alot of hard work to make it that way. So please don't use that as a justification to pull up the ladder so nobody else can enjoy even a semi-customized device. Jason
On Fri, 5 Apr 2024 08:13:06 -0300 Jason Gunthorpe wrote: > As I answered to Anderew, a lot is functional behavior not so much > "tunables". The same way many BIOS settings are not all tunables but > have functional impacts to the machine. Like enable SRIOV, for > instance. Thanks, SRIOV is a great example: https://docs.kernel.org/next/networking/devlink/devlink-params.html#id2 Literally the first devlink param on the list. "We will flash it for you" seems to be what vendors like to offer. > Even for dataplane tunables - you know there are micro-architectural > performance tunables set in the special Meta NICs that are wired just > for Meta's special use case? Apparently that is actually perfectly > workable. The only "tunables" I'm aware of were for the OCP Yosemite platform, which is an interesting beast with 4 hosts plugged into one NIC, and constrained PCIe BW. Which is why I said the "tunables" are really about the server platform not being off the shelf. Updating NIC FW to fix server compatibility is hardly unusual. > It is really strange to hear you act like "Meta doesn't need > provisioning or tuning" when the NIC Meta uses is *highly* customized > specifically for Meta to the point it is an entirely different > product. Of course you don't need provisioning, alot of other people > did alot of hard work to make it that way. :) When you say *highly* I think I know what you mean :) It'd be unprofessional for us to discuss here, and I really doubt you actually want to air that laundry publicly :) :) > So please don't use that as a justification to pull up the ladder so > nobody else can enjoy even a semi-customized device. So in this thread I'm pulling up the ladder and in the fbnic one I'm not (as I hope you'd agree)? One could hopefully be forgiven for wondering to what extent your assessment of my intentions is colored by whether they align with your particular goals :(
On Fri, Apr 05, 2024 at 08:38:27AM -0700, Jakub Kicinski wrote: > On Fri, 5 Apr 2024 08:13:06 -0300 Jason Gunthorpe wrote: > > As I answered to Anderew, a lot is functional behavior not so much > > "tunables". The same way many BIOS settings are not all tunables but > > have functional impacts to the machine. Like enable SRIOV, for > > instance. > > Thanks, SRIOV is a great example: > https://docs.kernel.org/next/networking/devlink/devlink-params.html#id2 > Literally the first devlink param on the list. It is too basic to be really usable, unfortunately - recall my earlier remarks that a site needs to configure everything, not just the 2 SRIOV related values in that take. Those are nice generic ones, but others are not. > The only "tunables" I'm aware of were for the OCP Yosemite platform, > which is an interesting beast with 4 hosts plugged into one NIC, > and constrained PCIe BW. Which is why I said the "tunables" are really > about the server platform not being off the shelf. Updating NIC FW > to fix server compatibility is hardly unusual. I don't think it is appropriate to go into details of all the stuff that happens in the commercial relationship between Meta and NVIDIA.. There is lots of history there. My main point is, for others reading these threads, that taking a COTS device like mlx5 and essentially forking it for a single user's special requirements is pretty much standard operating procedure now. Some people enjoy this with custom devices and custom FW, some people have to run standard FW and customize it at provisioning time. Edward accused this all of being "hacks", but I strongly disagree. Having device specific customization and parameters that make sense for the device architecture is not a hack. Functional changes are not the same was weird performance tunables. > > So please don't use that as a justification to pull up the ladder so > > nobody else can enjoy even a semi-customized device. > > So in this thread I'm pulling up the ladder and in the fbnic one I'm > not (as I hope you'd agree)? One could hopefully be forgiven for > wondering to what extent your assessment of my intentions is colored > by whether they align with your particular goals :( I'm not sure what your position is on fbnic TBH. I said mine, I think it would be fine to merge that series - but I'm pretty moderate in my views of what should be accepted to Linux for ideological reasons. Jason
On Fri, 5 Apr 2024 08:38:27 -0700 Jakub Kicinski wrote: > > It is really strange to hear you act like "Meta doesn't need > > provisioning or tuning" when the NIC Meta uses is *highly* customized > > specifically for Meta to the point it is an entirely different > > product. Of course you don't need provisioning, alot of other people > > did alot of hard work to make it that way. > > :) When you say *highly* I think I know what you mean :) > It'd be unprofessional for us to discuss here, and I really doubt > you actually want to air that laundry publicly :) :) Maybe that's unnecessary air of mystery. Long time ago there was a concern about impact of the rapidly developing eswitch offload market(?) on FW stability so a requirement was put forward to *compile out* major unused FW features. Such requirement is no longer in place (or fulfilled) largely due to my support. I wish I could support that out by referring to the OCP NIC SW spec, by it's stuck in "legal review" of one of the vendors for months. I'd like to ask that vendor not to pull up the ladder and let everyone else enjoy access to NIC requirements and recommendations from Meta and Google.
On Fri, Apr 05, 2024 at 10:48:29AM -0700, Jakub Kicinski wrote: > I wish I could support that out by referring to the OCP NIC SW spec, > by it's stuck in "legal review" of one of the vendors for months. > I'd like to ask that vendor not to pull up the ladder and let everyone > else enjoy access to NIC requirements and recommendations from Meta > and Google. Heh. Building standards is hard. It is a delicate game of politics to convince everyone that they are getting some kind of win-win-win, especially when IP and competitive concerns becomes involved. Many attempts fail while still immature, it is why there is so much value in the successful ones. Jason
. disclaimer.sh On 04/04/2024 19:33, Jason Gunthorpe wrote: > Uh no, mlx5 already has an excellent in-tree driver, thank you very > much. I was referring to *mlx5ctl*, which is not currently in-tree, which is why this thread trying to add it exists in the first place. > So, it is really some kind of extremism to say that allowing users to > configure the device in their own system in a booted Linux OS instead Um, nothing upstream does is stopping them installing an OOT mlx5ctl driver, *if* that's what they want to do. Clearly some of them don't like that solution, otherwise we wouldn't be here. > of in the factory looses the "implied engineering benefits of open > source". You're looking at the wrong point on the causal graph. Whether you apply the hacks in the factory or at runtime, their hacky design is what prevents them from having the benefits of open source (because those benefits consist of a development process that weeds out hacks and bad design). It is just that the latter case, if done through an intree driver, would appear to be (and might be marketed as) an open-source developed product, which users would naturally expect to have those benefits, when in fact it doesn't. >> So because your engineers can't design clean, orthogonal APIs for >> toffee, you should be allowed to bypass review? Sorry, but no. > > Overreach. The job of the kernel maintainer is to review the driver > software, not the device design. Really? [1] The kernel always has the choice to not support a given device, or a given feature of a device; and kernel maintainers are precisely the people with both the right and the duty to make that determination. > If you had read the thread to understand the issue, you'd know this is > because the distros have turned on module signing, secure boot and > kernel lock down. Funnily enough, I am aware of that. And if your customers didn't want those things, they'd be quite capable of forking the distro to undo it. Several hyperscalers already have their own in-house distros anyway. They could add their own signing key to the kernel, and sign your ctl driver with it. They could disable lockdown, or patch the kernel to allow your (hopefully signed and IMA-ed) userspace tool to do its PCI-over-sysfs crap even when lockdown blocks it for everything else. But strangely, there are people out there who trust the upstream process to ensure quality/security/etc. more than they trust vendors and their "oh don't worry, the device will enforce security scopes on these raw commands from userspace" magic firmware blobs. What you are asking for here is a special exemption to all those requirements and security measures because "just trust me bro". -ed [1]: https://wiki.linuxfoundation.org/networking/toe
On Thu, Apr 04, 2024 at 08:31:24PM +0100, Edward Cree wrote: > > of in the factory looses the "implied engineering benefits of open > > source". > > You're looking at the wrong point on the causal graph. > Whether you apply the hacks in the factory or at runtime, their hacky > design is what prevents them from having the benefits of open source > (because those benefits consist of a development process that weeds > out hacks and bad design). Oh please. They are not hacks, you are inventing this idea entirely out of thin air based on your own lack of detailed knowledge or personal aesthetic preferences. > > Overreach. The job of the kernel maintainer is to review the driver > > software, not the device design. > > Really? [1] Yep. TOEs are supported in the RDMA stack. The only thing the kernel community decided was that a TOE cannot be accessed using the socket API, the HW is still supported. > But strangely, there are people out there who trust the upstream process > to ensure quality/security/etc. more than they trust vendors and their > "oh don't worry, the device will enforce security scopes on these raw > commands from userspace" magic firmware blobs. Upstream Linux community has already decided trusting firwmare is acceptable. I gave many examples in the fwctl document if you'd care to understand it. And of course we get the usual garbage solution "oh just fork linux" which *completely misses the point* of what Linux really is or what the community is fundamentally seeking to achieve. > What you are asking for here is a special exemption to all those > requirements and security measures because "just trust me bro". Actually no, fwctl is following accepted kernel design principles with many prior examples. Jason
On 4/2/24 12:40 PM, Jason Gunthorpe wrote: > and, as Jakub pointed out, a second argument about who gets to have > power in our community. I think that is the key summary of this disagreement: this is about power over what vendors are allowed to do with in-tree drivers. Jakub is taking the stance that because the devices have use cases with the networking stack, he (as a maintainer of netdev) gets a say on how those devices are accessed across the entire Linux code base and its various use cases. That the device can be used without a netdev seems to be irrelevant.
On Tue, Mar 26, 2024 at 08:57:29AM -0600, David Ahern wrote: > On 3/22/24 4:40 PM, Jakub Kicinski wrote: > > On Fri, 22 Mar 2024 15:18:09 -0600 David Ahern wrote: > >> can you respond to Jason's email with the proposal for the new fwctl > >> subsystem and identify places you do not agree? That would provide more > >> concrete discussion points. Thanks, > > > > Respond in what way, David? Comment on technical aspects of whether > > a common class device with a discovery mechanism and a sprinkling of > > semantically meaningless fields can be implemented? Some trivial object > > hierarchy? > > > > On whether someone can actually enforce any of the 4 "don't"s, and > > whether this interface is basically encouraging and giving a leg up > > to those willing to be dishonest? > > > > Or should we go for another loop of me talking about openness and > > building common abstractions, and vendors saying how their way of > > doing basic configuration is so very special, and this is just for > > debug and security and because others. > > > > There's absolutely no willingness to try and build a common interface > > here. > > The proposal is an attempt at a common interface and common tooling to a > degree but independent of any specific subsystem of which many are > supported by the device. > > Your responses continue to align with the notion that because the device > can spit out ethernet frames, all diagnostics, debugging, configuration, > etc. MUST go through networking APIs. > > You seem unwilling to acknowledge that devices can work for various use > cases without a netdev driver, and thus aspects of managing that device > should be done outside of a netdev driver. HNS driver is a good example of such device. It has nothing to do with netdev and needs common and reliable way to configure FW. https://lore.kernel.org/all/20240308105443.1130283-1-huangjunxian6@hisilicon.com/ https://lore.kernel.org/all/20240322163201.GF66976@ziepe.ca/ Thanks
On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote: > > The proposal is an attempt at a common interface and common tooling to a > > degree but independent of any specific subsystem of which many are > > supported by the device. > > > > Your responses continue to align with the notion that because the device > > can spit out ethernet frames, all diagnostics, debugging, configuration, > > etc. MUST go through networking APIs. > > > > You seem unwilling to acknowledge that devices can work for various use > > cases without a netdev driver, and thus aspects of managing that device > > should be done outside of a netdev driver. > > HNS driver is a good example of such device. It has nothing to do with > netdev and needs common and reliable way to configure FW. Sorry, I have a completely different reading of that thread. Thanks for bringing it up, tho. As I said multiple times I agree that configuring custom parameters in RDMA is a necessity. Junxian's approach of putting such code in the RDMA driver / subsystem is more than reasonable. Even better, it looks like the API is fairly narrowly defined.
On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote: > On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote: > > > The proposal is an attempt at a common interface and common tooling to a > > > degree but independent of any specific subsystem of which many are > > > supported by the device. > > > > > > Your responses continue to align with the notion that because the device > > > can spit out ethernet frames, all diagnostics, debugging, configuration, > > > etc. MUST go through networking APIs. > > > > > > You seem unwilling to acknowledge that devices can work for various use > > > cases without a netdev driver, and thus aspects of managing that device > > > should be done outside of a netdev driver. > > > > HNS driver is a good example of such device. It has nothing to do with > > netdev and needs common and reliable way to configure FW. > > Sorry, I have a completely different reading of that thread. > Thanks for bringing it up, tho. > > As I said multiple times I agree that configuring custom parameters > in RDMA is a necessity. Junxian's approach of putting such code in > the RDMA driver / subsystem is more than reasonable. Even better, > it looks like the API is fairly narrowly defined. Uh, if I understand netdev rules aren't read/write sysfs created from drivers banned? So reasonable for RDMA but unacceptable to netdev? My brain hurts. FWIW, I've been trying to push RDMA away from driver created sysfs for a while now. Aside from the API complexity, implementations have messed up using the sysfs APIs and resulted in some significant problems :( Jason
On Tue, 2 Apr 2024 15:45:54 -0300 Jason Gunthorpe wrote: > On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote: > > On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote: > > > HNS driver is a good example of such device. It has nothing to do with > > > netdev and needs common and reliable way to configure FW. > > > > Sorry, I have a completely different reading of that thread. > > Thanks for bringing it up, tho. > > > > As I said multiple times I agree that configuring custom parameters > > in RDMA is a necessity. Junxian's approach of putting such code in > > the RDMA driver / subsystem is more than reasonable. Even better, > > it looks like the API is fairly narrowly defined. > > Uh, if I understand netdev rules aren't read/write sysfs created from > drivers banned? Neither is that true as an absolute "netdev rule" nor relevant to the discussion. > So reasonable for RDMA but unacceptable to netdev? I don't know or care what interface guidance you provide. What I called reasonable is putting that code in RDMA driver / subsystem. > My brain hurts. Maybe brains are better suited for understanding what other people say rather than twisting and misinterpreting.. > FWIW, I've been trying to push RDMA away from driver created sysfs for > a while now. Aside from the API complexity, implementations have > messed up using the sysfs APIs and resulted in some significant > problems :( Sure, agreed, but off-topic.
On Tue, Apr 02, 2024 at 02:36:07PM -0700, Jakub Kicinski wrote: > > FWIW, I've been trying to push RDMA away from driver created sysfs for > > a while now. Aside from the API complexity, implementations have > > messed up using the sysfs APIs and resulted in some significant > > problems :( > > Sure, agreed, but off-topic. It is not - I don't want a huge amount of sysfs in drivers to replace what fwctl will do for pretty solid technical reasons. I do object, as snarky as I was, to you saying RDMA should take on a whole bunch of read/write sysfs in drivers that netdev would not accept as a "reasonable" direction. Jason
On Tue, 2 Apr 2024 19:46:32 -0300 Jason Gunthorpe wrote: > On Tue, Apr 02, 2024 at 02:36:07PM -0700, Jakub Kicinski wrote: > > > FWIW, I've been trying to push RDMA away from driver created sysfs for > > > a while now. Aside from the API complexity, implementations have > > > messed up using the sysfs APIs and resulted in some significant > > > problems :( > > > > Sure, agreed, but off-topic. > > It is not - I don't want a huge amount of sysfs in drivers to replace > what fwctl will do for pretty solid technical reasons. > > I do object, as snarky as I was, to you saying RDMA should take on a > whole bunch of read/write sysfs in drivers that netdev would not > accept as a "reasonable" direction. Repeating myself a bit - I'm not saying you should take the sysfs patches. Just that RDMA configuration belongs in the RDMA subsystem. I don't want to suggest another solution, because frankly given our "no direct FW interface exposure" I don't have much experience maintaining such APIs upstream. It'd be vain of me to make suggestions. Also I don't want to sound like I'm giving you my "blessing" to do whatever, since my personal beliefs(?) remain unchanged. But they carry little weight when netdevs or traffic which ends up in netdev are not involved.
On Tue, 2 Apr 2024 16:21:06 -0700 Jakub Kicinski wrote: > they carry little weight when netdevs or traffic which ends up in > netdev are not involved. Clarification before this goes into "Leon's quote book" netdev == code base, traffic == network "packets" as opposed to the mailing list and emails.
On Tue, Apr 02, 2024 at 05:15:35PM -0700, Jakub Kicinski wrote: > On Tue, 2 Apr 2024 16:21:06 -0700 Jakub Kicinski wrote: > > they carry little weight when netdevs or traffic which ends up in > > netdev are not involved. > > Clarification before this goes into "Leon's quote book" I don't have one, should I start write it? And sorry for listening and remembering what people say. Thanks
On Mon, Apr 01, 2024 at 07:50:03AM -0700, Jakub Kicinski wrote: > On Mon, 1 Apr 2024 15:30:03 +0300 Leon Romanovsky wrote: > > > The proposal is an attempt at a common interface and common tooling to a > > > degree but independent of any specific subsystem of which many are > > > supported by the device. > > > > > > Your responses continue to align with the notion that because the device > > > can spit out ethernet frames, all diagnostics, debugging, configuration, > > > etc. MUST go through networking APIs. > > > > > > You seem unwilling to acknowledge that devices can work for various use > > > cases without a netdev driver, and thus aspects of managing that device > > > should be done outside of a netdev driver. > > > > HNS driver is a good example of such device. It has nothing to do with > > netdev and needs common and reliable way to configure FW. > > Sorry, I have a completely different reading of that thread. > Thanks for bringing it up, tho. Different people have different opinions, and it is fine. > > As I said multiple times I agree that configuring custom parameters > in RDMA is a necessity. Junxian's approach of putting such code in > the RDMA driver / subsystem is more than reasonable. Even better, > it looks like the API is fairly narrowly defined. It was a tiny example, which emphasizes the need for a common way. If we were listen to average RDMA driver author, we would find ourselves with gazillion different sysfs knobs which do nothing except sending raw data to FW. As a subsystem, we don't want to waste our time in not-beneficial to the subsystem code. Thanks
On Mon, 1 Apr 2024 21:10:33 +0300 Leon Romanovsky wrote: > > Sorry, I have a completely different reading of that thread. > > Thanks for bringing it up, tho. > > Different people have different opinions, and it is fine. Agreed! I think the core of our disagreement is whether and when one set of people get to force their set of choices on other people. Far be it from me to try to force my opinions in RDMA or the kernel as a whole. > > As I said multiple times I agree that configuring custom parameters > > in RDMA is a necessity. Junxian's approach of putting such code in > > the RDMA driver / subsystem is more than reasonable. Even better, > > it looks like the API is fairly narrowly defined. > > It was a tiny example, which emphasizes the need for a common way. > > If we were listen to average RDMA driver author, we would find ourselves > with gazillion different sysfs knobs which do nothing except sending > raw data to FW. As a subsystem, we don't want to waste our time in > not-beneficial to the subsystem code. No disagreement here either, there's a trade-off here between what you call waste of time and what I'd call fostering organic collaboration. Even without taking your priorities into account - whether reviewing device APIs is beneficial is (a) subjective, (b) subsystem dependent, so you should be allowed to make your choice (within the bounds of Linus's trust). But what I keep arguing is that so should we :|
On Mon, Apr 01, 2024 at 12:04:20PM -0700, Jakub Kicinski wrote: > On Mon, 1 Apr 2024 21:10:33 +0300 Leon Romanovsky wrote: > > > Sorry, I have a completely different reading of that thread. > > > Thanks for bringing it up, tho. > > > > Different people have different opinions, and it is fine. > > Agreed! I think the core of our disagreement is whether and when > one set of people get to force their set of choices on other people. > Far be it from me to try to force my opinions in RDMA or the kernel > as a whole. > > > > As I said multiple times I agree that configuring custom parameters > > > in RDMA is a necessity. Junxian's approach of putting such code in > > > the RDMA driver / subsystem is more than reasonable. Even better, > > > it looks like the API is fairly narrowly defined. > > > > It was a tiny example, which emphasizes the need for a common way. > > > > If we were listen to average RDMA driver author, we would find ourselves > > with gazillion different sysfs knobs which do nothing except sending > > raw data to FW. As a subsystem, we don't want to waste our time in > > not-beneficial to the subsystem code. > > No disagreement here either, there's a trade-off here between what > you call waste of time and what I'd call fostering organic > collaboration. Even without taking your priorities into account - > whether reviewing device APIs is beneficial is (a) subjective, > (b) subsystem dependent, so you should be allowed to make your choice > (within the bounds of Linus's trust). But what I keep arguing is that > so should we :| Internal device API is a different story, and I assure you that we are very close in our views here. Probably main difference is that I'm very lax for first user and very strict for the second one, while you are more stricter than me even for the first one. I gave an example from HNS for API to configure FW without any subsystem involvement and there is no real way to push to collaboration here without standardization. Thanks
On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > There is a clear split in my mind between: > - inspection debugging > - invasive mutating debugging > - configuration Yes there's a clear split, and how are you going to enforce it on an opaque interface? Put an "evil" bit in the common header? > And maybe "invasive mutating debugging" taints the kernel or something > like that.
On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > There is a clear split in my mind between: > > - inspection debugging > > - invasive mutating debugging > > - configuration > > Yes there's a clear split, and how are you going to enforce it on > an opaque interface? Put an "evil" bit in the common header? The interface is opaque through a subsystem, it doesn't mean it is completely opaque to every driver layer in the kernel. There is still a HW specific kernel driver that delivers the FW command to the actual HW. In the mlx5 model the kernel driver stamps the command with "uid" which is effectively a security scope label. This cannot be avoided by userspace and is fundamental to why mlx5ctl is secure in a lockdown kernel. For example mlx5's FW interface has the concept of security scopes. We have several defined today: - Kernel - Userspace rdma - Userspace rdma with CAP_NET_RAW - Userspace rdma with CAP_SYS_RAWIO So we trivally add three more for the scopes I listed above. The mlx5ctl driver as posted already introduced a new scope, for example. Userspace will ask the kernel for an appropriate security scope after opening the char-device. If userspace asks for invasive then you get a taint. Issuing an invasive command without a kernel applied invasive security label will be rejected by the FW. We trust the kernel to apply the security label for the origin of the command. We trust the the device FW to implement security scopes, because these are RDMA devices and all of RDMA and all of SRIOV virtualization are totally broken if the device FW cannot be trusted to maintain security separation between scopes. Jason
On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > > There is a clear split in my mind between: > > > - inspection debugging > > > - invasive mutating debugging > > > - configuration > > > > Yes there's a clear split, and how are you going to enforce it on > > an opaque interface? Put an "evil" bit in the common header? > > The interface is opaque through a subsystem, it doesn't mean it is > completely opaque to every driver layer in the kernel. There is still a > HW specific kernel driver that delivers the FW command to the actual > HW. > > In the mlx5 model the kernel driver stamps the command with "uid" > which is effectively a security scope label. This cannot be avoided by > userspace and is fundamental to why mlx5ctl is secure in a lockdown > kernel. > > For example mlx5's FW interface has the concept of security scopes. We > have several defined today: > - Kernel > - Userspace rdma > - Userspace rdma with CAP_NET_RAW > - Userspace rdma with CAP_SYS_RAWIO > > So we trivally add three more for the scopes I listed above. The > mlx5ctl driver as posted already introduced a new scope, for example. > > Userspace will ask the kernel for an appropriate security scope after > opening the char-device. If userspace asks for invasive then you get a > taint. Issuing an invasive command without a kernel applied invasive > security label will be rejected by the FW. > > We trust the kernel to apply the security label for the origin of the > command. We trust the the device FW to implement security scopes, > because these are RDMA devices and all of RDMA and all of SRIOV > virtualization are totally broken if the device FW cannot be trusted > to maintain security separation between scopes. You have changed the argument. The problem Andy was raising is that users having access to low level configuration will make it impossible for distro's support to tell device configuration. There won't be any trace of activity at the OS level. To which you replied that you can differentiate between debugging and configuration on an opaque interface, _in the kernel_. Which I disagree with, obviously. And now you're saying that you can maintain security if you trust the firmware to enforce some rules. I'm not talking about security here, the evil bit is just an example of an unsound design.
On Thu, Feb 15, 2024 at 05:40:34PM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 14:37:55 -0400 Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 10:11:26AM -0800, Jakub Kicinski wrote: > > > On Wed, 14 Feb 2024 13:57:35 -0400 Jason Gunthorpe wrote: > > > > There is a clear split in my mind between: > > > > - inspection debugging > > > > - invasive mutating debugging > > > > - configuration > > > > > > Yes there's a clear split, and how are you going to enforce it on > > > an opaque interface? Put an "evil" bit in the common header? > > > > The interface is opaque through a subsystem, it doesn't mean it is > > completely opaque to every driver layer in the kernel. There is still a > > HW specific kernel driver that delivers the FW command to the actual > > HW. > > > > In the mlx5 model the kernel driver stamps the command with "uid" > > which is effectively a security scope label. This cannot be avoided by > > userspace and is fundamental to why mlx5ctl is secure in a lockdown > > kernel. > > > > For example mlx5's FW interface has the concept of security scopes. We > > have several defined today: > > - Kernel > > - Userspace rdma > > - Userspace rdma with CAP_NET_RAW > > - Userspace rdma with CAP_SYS_RAWIO > > > > So we trivally add three more for the scopes I listed above. The > > mlx5ctl driver as posted already introduced a new scope, for example. > > > > Userspace will ask the kernel for an appropriate security scope after > > opening the char-device. If userspace asks for invasive then you get a > > taint. Issuing an invasive command without a kernel applied invasive > > security label will be rejected by the FW. > > > > We trust the kernel to apply the security label for the origin of the > > command. We trust the the device FW to implement security scopes, > > because these are RDMA devices and all of RDMA and all of SRIOV > > virtualization are totally broken if the device FW cannot be trusted > > to maintain security separation between scopes. > > You have changed the argument. I explained how the technical bits of a part work, you clipped out my answer to Andy's concern. > The problem Andy was raising is that users having access to low level > configuration will make it impossible for distro's support to tell > device configuration. There won't be any trace of activity at the OS > level. I responded to that by saying the answer is to have robust dumping of the device configuration and suggested a taint bit if changes are made to the device outside that support envelope and can't be captured in the dumps. This first part is already what everyone already does. There is some supported configuration in flash and there are tools to dump and inspect this. The field teams understand they need to look at that, and existing data collection tools already capture this stuff. I don't view we have a real problem here. The step beyond Andy was talking about is the hypothetical "what if you touch random unsafe registers directly or something" Which probably shouldn't be allowed on a lockdown kernel, but assuming a device could do so safely, my answer was to trigger a taint. Then you asked how do you trigger a taint if the kernel doesn't parse the commands. > To which you replied that you can differentiate between debugging and > configuration on an opaque interface, _in the kernel_. I did not say "_in the kernel_" meaning the kernel would do it, I meant the kernel would ensure it is done. The kernel delegates the differentiation to the FW and it trusts the FW to do that work for it. > Which I disagree with, obviously. I don't know why? How is it obvious? > And now you're saying that you can maintain security if you trust > the firmware to enforce some rules. Right. The userspace sends a command. The kernel tags it with the permission the userspace has. The FW parses the command and checks the kernel supplied permission against the command content and permits it. We trust the FW to do the restriction on behalf of the kernel. The restriction we are talking about is containing the userspace to only operate within the distro's support envelope. The kernel can ask the FW to enforce that rule as a matter of security and trust the FW to do so. Jason
On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > With my busy kernel contributor head on I have to voice my > dissatisfaction with the subsystem maintainer overreach that's causing > the troubles here. Overreach is unfortunate, I'd love to say "please do merge it as part of RDMA". You probably don't trust my opinion but Jason admitted himself this is primarily for RDMA. RDMA is what it is in terms of openness and all vendors trying to sell their secret magic sauce. The problem is that some RDMA stuff is built really closely on TCP, and given Jason's and co. inability to understand that good fences make good neighbors it will soon start getting into the netdev stack :| Ah, and I presume they may also want it for their DOCA products. So 80% RDMA, 15% DOCA, 5% the rest is my guess. > I think all maintainers can and should voice the > opinions, be those technical or political, but trying to block a useful > feature without lots of precedence because it is vaguely related to the > subsystem is not helpful. Not sure what you mean by "without lots of precedence" but you can ask around netdev. We have nacked such interfaces multiple times. The best proof the rule exists and is well established it is that Saeed has himself asked us a number of times to lift it. What should be expected of us is fairness and not engaging in politics. We have a clear rule against opaque user space to FW interfaces, and I don't see how we could enforce that fairly for pure Ethernet devices if big vendors get to do whatever they want. > Note that this is absolutely not intended to > shut the discussion down - if we can find valid arguments why some of > this functionality should also be reported through a netdev API we > should continue that. Once again, netdev requirements for debug are - you can do whatever you want but other than knobs clearly defined in the code interface must be read only.
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > With my busy kernel contributor head on I have to voice my > > dissatisfaction with the subsystem maintainer overreach that's causing > > the troubles here. > > Overreach is unfortunate, I'd love to say "please do merge it as part > of RDMA". You probably don't trust my opinion but Jason admitted himself > this is primarily for RDMA. "admitted"? You make it sound like a crime. I've been very clear on this need from the RDMA community since the first posting. > The problem is that some RDMA stuff is built really closely on TCP, Huh? Since when? Are you talking about soft-iwarp? That is a reasearch project and Bernard is very responsive, if you have issues ask him and he will help. Otherwise the actual HW devices are not entangled with netdev TCP, the few iWarp devices have their own TCP implementation, in accordance with what the IETF standardized. > and given Jason's and co. inability to understand that good fences > make good neighbors it will soon start getting into the netdev stack :| I seem to recall you saying RDMA shouldn't call any netdev APIs at all. We were unable to agree on where to build the fence for this reason. > Ah, and I presume they may also want it for their DOCA products. > So 80% RDMA, 15% DOCA, 5% the rest is my guess. I don't know all details about DOCA, but what I know about runs over RDMA. > Not sure what you mean by "without lots of precedence" but you can ask > around netdev. We have nacked such interfaces multiple times. > The best proof the rule exists and is well established it is that Saeed > has himself asked us a number of times to lift it. > > What should be expected of us is fairness and not engaging in politics. > We have a clear rule against opaque user space to FW interfaces, > and I don't see how we could enforce that fairly for pure Ethernet > devices if big vendors get to do whatever they want. If your community is telling your rules are not working for them anymore, it is not nice to tell them that rules exist and cannot be questioned. Try working together toward a reasonable consensus solution. The world has changed alot, the use cases are different, the users are different, the devices are different. When Dave made that prohibition long ago it was not in a world of a multi billion transistor NIC being deployed in uniform clusters of unimaginable size. Jason
On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote: > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > > With my busy kernel contributor head on I have to voice my > > > dissatisfaction with the subsystem maintainer overreach that's causing > > > the troubles here. > > > > Overreach is unfortunate, I'd love to say "please do merge it as part > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > this is primarily for RDMA. > > "admitted"? You make it sound like a crime. I've been very clear on > this need from the RDMA community since the first posting. Sorry, unintentional :) Maybe it's a misunderstanding but my impression was that at least Saeed was trying hard to make this driver a common one, not just for RDMA. > > The problem is that some RDMA stuff is built really closely on TCP, > > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch > project and Bernard is very responsive, if you have issues ask him and > he will help. > > Otherwise the actual HW devices are not entangled with netdev TCP, the > few iWarp devices have their own TCP implementation, in accordance > with what the IETF standardized. There are some things I know either from work or otherwise told me in confidence which I can't share. This is very frustrating for me, and probably for you, sorry :( > > and given Jason's and co. inability to understand that good fences > > make good neighbors it will soon start getting into the netdev stack :| > > I seem to recall you saying RDMA shouldn't call any netdev APIs at > all. We were unable to agree on where to build the fence for this > reason. Last time you were calling into the IPsec stack right? It's not just a basic API. IDK how to draw a line, definitely open to suggestions! > > Ah, and I presume they may also want it for their DOCA products. > > So 80% RDMA, 15% DOCA, 5% the rest is my guess. > > I don't know all details about DOCA, but what I know about runs over > RDMA. Well, since you're an RDMA person that's not really saying much about existence of other parts.
On Thu, Feb 15, 2024 at 05:10:13PM -0800, Jakub Kicinski wrote: > On Thu, 15 Feb 2024 09:21:38 -0400 Jason Gunthorpe wrote: > > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > > On Wed, 14 Feb 2024 00:29:16 -0800 Christoph Hellwig wrote: > > > > With my busy kernel contributor head on I have to voice my > > > > dissatisfaction with the subsystem maintainer overreach that's causing > > > > the troubles here. > > > > > > Overreach is unfortunate, I'd love to say "please do merge it as part > > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > > this is primarily for RDMA. > > > > "admitted"? You make it sound like a crime. I've been very clear on > > this need from the RDMA community since the first posting. > > Sorry, unintentional :) Maybe it's a misunderstanding but my impression > was that at least Saeed was trying hard to make this driver a common > one, not just for RDMA. The hardware is common, this is a driver to talk to the shared FW. I don't know how you'd do just one when netdev is effectively an RDMA application running in the kernel, from the perspective of the FW. There is no real line between these things beyond the artificial one we have created in the kernel. > > > The problem is that some RDMA stuff is built really closely on TCP, > > > > Huh? Since when? Are you talking about soft-iwarp? That is a reasearch > > project and Bernard is very responsive, if you have issues ask him and > > he will help. > > > > Otherwise the actual HW devices are not entangled with netdev TCP, the > > few iWarp devices have their own TCP implementation, in accordance > > with what the IETF standardized. > > There are some things I know either from work or otherwise told me > in confidence which I can't share. This is very frustrating for > me, and probably for you, sorry :( Well, all I can say is I know of no forthcoming RDMA things with any different relationship to TCP. I think if someone wants to more TCP they will have a hard time, and I'm not inclined to seriously help anyone get more TCP into RDMA. iWarp is trouble enough already. > > I seem to recall you saying RDMA shouldn't call any netdev APIs at > > all. We were unable to agree on where to build the fence for this > > reason. > > Last time you were calling into the IPsec stack right? It's not just > a basic API. IDK how to draw a line, definitely open to suggestions! I thought the two halfs of the mlx5 driver were co-ordinating their usage of the shared HW around the ipsec configuration pushed into the device by netdev xfrm. > > > Ah, and I presume they may also want it for their DOCA products. > > > So 80% RDMA, 15% DOCA, 5% the rest is my guess. > > > > I don't know all details about DOCA, but what I know about runs over > > RDMA. > > Well, since you're an RDMA person that's not really saying much > about existence of other parts. <shrug> why does DOCA matter? Should we have not done io_uring because Oracle might use it? Besides, from what I know about DOCA it is almost all data plane stuff and RDMA fully covers that already.. Regards, Jason
On 2/15/24 6:10 PM, Jakub Kicinski wrote: >>> and given Jason's and co. inability to understand that good fences >>> make good neighbors it will soon start getting into the netdev stack :| >> >> I seem to recall you saying RDMA shouldn't call any netdev APIs at >> all. We were unable to agree on where to build the fence for this >> reason. > > Last time you were calling into the IPsec stack right? It's not just > a basic API. IDK how to draw a line, definitely open to suggestions! > In general, each subsystem should focus on proper APIs that other subsystems can leverage and not worry about how people wire them up. That NVME, RDMA/IB and io_uring want to re-use the Linux networking stack is a good thing.
On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > Overreach is unfortunate, I'd love to say "please do merge it as part > of RDMA". You probably don't trust my opinion but Jason admitted himself > this is primarily for RDMA. RDMA is what it is in terms of openness and > all vendors trying to sell their secret magic sauce. Common. RDMA has two important open standards, one of them even done in IETF that most open of all standards organizations. > > I think all maintainers can and should voice the > > opinions, be those technical or political, but trying to block a useful > > feature without lots of precedence because it is vaguely related to the > > subsystem is not helpful. > > Not sure what you mean by "without lots of precedence" but you can ask Should have been with. Just about every subsystem with complex devices has this kind of direct interface for observability and co in at least some drivers.
On Wed, 14 Feb 2024 23:00:40 -0800 Christoph Hellwig wrote: > On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: > > Overreach is unfortunate, I'd love to say "please do merge it as part > > of RDMA". You probably don't trust my opinion but Jason admitted himself > > this is primarily for RDMA. RDMA is what it is in terms of openness and > > all vendors trying to sell their secret magic sauce. > > Common. RDMA has two important open standards, one of them even done > in IETF that most open of all standards organizations. While I don't dispute that there are standards which can be read, the practical interoperability of RDMA devices is extremely low. By practical I mean having two devices from different vendors achieve any reasonable performance talking to each other. Even two devices from _the same_ vendor but different generations are unlikely to perform. Given how RDMA is deployed (uniform, greenfield/full replacement) this is entirely reasonable from the engineering perspective. But this is a bit of a vicious cycle, vendors have little incentive to interoperate, and primarily focus on adding secret sauce outside of the standard. In fact you're lucky if the vendor didn't bake some extension which requires custom switches into the NICs :( Compare that to WiFi, which is a level of standardization netdev folks are more accustomed to. You can connect a new device from vendor X to a 10 year old AP from vendor Y and it will run with high perf. Unfortunately because of the AI craze I have some experience with RDMA deployments now. Perhaps you have more, perhaps your experience differs.
On Thu, Feb 15, 2024 at 05:00:46PM -0800, Jakub Kicinski wrote: > But this is a bit of a vicious cycle, vendors have little incentive > to interoperate, and primarily focus on adding secret sauce outside of > the standard. In fact you're lucky if the vendor didn't bake some > extension which requires custom switches into the NICs :( This may all seem shocking if you come from the netdev world, but this has been normal for HPC networking for the last 30 years at least. My counter perspective would be that we are currently in a pretty good moment for HPC industry because we actually have open source implementations for most of it. In fact most actual deployments are running something quite close to the mainline open source stack. The main hold out right now is Cray/HPE's Slingshot networking family (based on ethernet apparently), but less open source. I would say the HPC community has a very different community goal post that netdev land. Make your thing, whatever it is. Come with an open kernel driver, a open rdma-core, a open libfabric/ucx and plug into the open dpdk/nccl/ucx/libfabric layer and demonstrate your thing works with openmpi/etc applications. Supporting that open stack is broadly my north star for the kernel perspective as Mesa is to DRM. Several points of this chain are open industry standards driven by technical working group communities. This is what the standardization and interoperability looks like here. It is probably totally foreign from a netdev view point, far less focus on the wire protocol, devices and kernel. Here the focus is on application and software interoperability. Still, it is open in a pretty solid way. Jason
Thu, Feb 15, 2024 at 08:00:40AM CET, hch@infradead.org wrote: >On Wed, Feb 14, 2024 at 07:48:32AM -0800, Jakub Kicinski wrote: [...] >> > I think all maintainers can and should voice the >> > opinions, be those technical or political, but trying to block a useful >> > feature without lots of precedence because it is vaguely related to the >> > subsystem is not helpful. >> >> Not sure what you mean by "without lots of precedence" but you can ask > >Should have been with. Just about every subsystem with complex devices >has this kind of direct interface for observability and co in at least >some drivers. What about configuration? How do you ensure the direct FW/HW access is used only for observability/debug purposes. I mean, if you can't, I think it is incorrect to name it that way, isn't it?
© 2016 - 2025 Red Hat, Inc.