[PATCH v2] net: add initial support for AF_XDP network backend

Ilya Maximets posted 1 patch 10 months, 1 week ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230705205914.3457560-1-i.maximets@ovn.org
Maintainers: "Dr. David Alan Gilbert" <dave@treblig.org>, Paolo Bonzini <pbonzini@redhat.com>, "Marc-André Lureau" <marcandre.lureau@redhat.com>, "Daniel P. Berrangé" <berrange@redhat.com>, Thomas Huth <thuth@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Ilya Maximets <i.maximets@ovn.org>, Jason Wang <jasowang@redhat.com>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, "Alex Bennée" <alex.bennee@linaro.org>, Wainer dos Santos Moschetta <wainersm@redhat.com>, Beraldo Leal <bleal@redhat.com>
There is a newer version of this series
MAINTAINERS                                   |   4 +
hmp-commands.hx                               |   2 +-
meson.build                                   |  19 +
meson_options.txt                             |   2 +
net/af-xdp.c                                  | 570 ++++++++++++++++++
net/clients.h                                 |   5 +
net/meson.build                               |   3 +
net/net.c                                     |   6 +
qapi/net.json                                 |  60 +-
qemu-options.hx                               |  83 ++-
.../ci/org.centos/stream/8/x86_64/configure   |   1 +
scripts/meson-buildoptions.sh                 |   3 +
tests/docker/dockerfiles/debian-amd64.docker  |   1 +
13 files changed, 756 insertions(+), 3 deletions(-)
create mode 100644 net/af-xdp.c
[PATCH v2] net: add initial support for AF_XDP network backend
Posted by Ilya Maximets 10 months, 1 week ago
AF_XDP is a network socket family that allows communication directly
with the network device driver in the kernel, bypassing most or all
of the kernel networking stack.  In the essence, the technology is
pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
and works with any network interfaces without driver modifications.
Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
require access to character devices or unix sockets.  Only access to
the network interface itself is necessary.

This patch implements a network backend that communicates with the
kernel by creating an AF_XDP socket.  A chunk of userspace memory
is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
Fill and Completion) are placed in that memory along with a pool of
memory buffers for the packet data.  Data transmission is done by
allocating one of the buffers, copying packet data into it and
placing the pointer into Tx ring.  After transmission, device will
return the buffer via Completion ring.  On Rx, device will take
a buffer form a pre-populated Fill ring, write the packet data into
it and place the buffer into Rx ring.

AF_XDP network backend takes on the communication with the host
kernel and the network interface and forwards packets to/from the
peer device in QEMU.

Usage example:

  -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
  -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1

XDP program bridges the socket with a network interface.  It can be
attached to the interface in 2 different modes:

1. skb - this mode should work for any interface and doesn't require
         driver support.  With a caveat of lower performance.

2. native - this does require support from the driver and allows to
            bypass skb allocation in the kernel and potentially use
            zero-copy while getting packets in/out userspace.

By default, QEMU will try to use native mode and fall back to skb.
Mode can be forced via 'mode' option.  To force 'copy' even in native
mode, use 'force-copy=on' option.  This might be useful if there is
some issue with the driver.

Option 'queues=N' allows to specify how many device queues should
be open.  Note that all the queues that are not open are still
functional and can receive traffic, but it will not be delivered to
QEMU.  So, the number of device queues should generally match the
QEMU configuration, unless the device is shared with something
else and the traffic re-direction to appropriate queues is correctly
configured on a device level (e.g. with ethtool -N).
'start-queue=M' option can be used to specify from which queue id
QEMU should start configuring 'N' queues.  It might also be necessary
to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
for examples.

In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
capabilities in order to load default XSK/XDP programs to the
network interface and configure BPF maps.  It is possible, however,
to run with no capabilities.  For that to work, an external process
with admin capabilities will need to pre-load default XSK program,
create AF_XDP sockets and pass their file descriptors to QEMU process
on startup via 'sock-fds' option.  Network backend will need to be
configured with 'inhibit=on' to avoid loading of the program.
QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
or CAP_IPC_LOCK.

Alternatively, the file descriptor for 'xsks_map' can be passed via
'xsks-map-fd=N' option instead of passing socket file descriptors.
That will additionally require CAP_NET_RAW on QEMU side.  This is
useful, because 'sock-fds' may not be available with older libxdp.
'sock-fds' requires libxdp >= 1.4.0.

There are few performance challenges with the current network backends.

First is that they do not support IO threads.  This means that data
path is handled by the main thread in QEMU and may slow down other
work or may be slowed down by some other work.  This also means that
taking advantage of multi-queue is generally not possible today.

Another thing is that data path is going through the device emulation
code, which is not really optimized for performance.  The fastest
"frontend" device is virtio-net.  But it's not optimized for heavy
traffic either, because it expects such use-cases to be handled via
some implementation of vhost (user, kernel, vdpa).  In practice, we
have virtio notifications and rcu lock/unlock on a per-packet basis
and not very efficient accesses to the guest memory.  Communication
channels between backend and frontend devices do not allow passing
more than one packet at a time as well.

Some of these challenges can be avoided in the future by adding better
batching into device emulation or by implementing vhost-af-xdp variant.

There are also a few kernel limitations.  AF_XDP sockets do not
support any kinds of checksum or segmentation offloading.  Buffers
are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
support implementation for AF_XDP is in progress, but not ready yet.
Also, transmission in all non-zero-copy modes is synchronous, i.e.
done in a syscall.  That doesn't allow high packet rates on virtual
interfaces.

However, keeping in mind all of these challenges, current implementation
of the AF_XDP backend shows a decent performance while running on top
of a physical NIC with zero-copy support.

Test setup:

2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
Network backend is configured to open the NIC directly in native mode.
The driver supports zero-copy.  NIC is configured to use 1 queue.

Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
for PPS testing.

iperf3 result:
 TCP stream      : 19.1 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
 Tx only         : 3.4 Mpps
 Rx only         : 2.0 Mpps
 L2 FWD Loopback : 1.5 Mpps

In skb mode the same setup shows much lower performance, similar to
the setup where pair of physical NICs is replaced with veth pair:

iperf3 result:
  TCP stream      : 9 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
  Tx only         : 1.2 Mpps
  Rx only         : 1.0 Mpps
  L2 FWD Loopback : 0.7 Mpps

Results in skb mode or over the veth are close to results of a tap
backend with vhost=on and disabled segmentation offloading bridged
with a NIC.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---

Version 2:

  - Added support for running with no capabilities by passing
    pre-created AF_XDP socket file descriptors via 'sock-fds' option.
    Conditionally complied because requires unreleased libxdp 1.4.0.
    The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.

  - Refined and extended documentation.


 MAINTAINERS                                   |   4 +
 hmp-commands.hx                               |   2 +-
 meson.build                                   |  19 +
 meson_options.txt                             |   2 +
 net/af-xdp.c                                  | 570 ++++++++++++++++++
 net/clients.h                                 |   5 +
 net/meson.build                               |   3 +
 net/net.c                                     |   6 +
 qapi/net.json                                 |  60 +-
 qemu-options.hx                               |  83 ++-
 .../ci/org.centos/stream/8/x86_64/configure   |   1 +
 scripts/meson-buildoptions.sh                 |   3 +
 tests/docker/dockerfiles/debian-amd64.docker  |   1 +
 13 files changed, 756 insertions(+), 3 deletions(-)
 create mode 100644 net/af-xdp.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7164cf55a1..80d4ba4004 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
 S: Maintained
 F: net/netmap.c
 
+AF_XDP network backend
+R: Ilya Maximets <i.maximets@ovn.org>
+F: net/af-xdp.c
+
 Host Memory Backends
 M: David Hildenbrand <david@redhat.com>
 M: Igor Mammedov <imammedo@redhat.com>
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 2cbd0f77a0..af9ffe4681 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1295,7 +1295,7 @@ ERST
     {
         .name       = "netdev_add",
         .args_type  = "netdev:O",
-        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
+        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
 #ifdef CONFIG_VMNET
                       "|vmnet-host|vmnet-shared|vmnet-bridged"
 #endif
diff --git a/meson.build b/meson.build
index a9ba0bfab3..1f8772ea5d 100644
--- a/meson.build
+++ b/meson.build
@@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
   endif
 endif
 
+# libxdp
+libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
+if libxdp.found() and \
+      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
+  libxdp = not_found
+  if get_option('af_xdp').enabled()
+    error('af-xdp support requires libbpf version >= 0.7')
+  else
+    warning('af-xdp support requires libbpf version >= 0.7, disabling')
+  endif
+endif
+
 # libdw
 libdw = not_found
 if not get_option('libdw').auto() or \
@@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
 config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
 config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
 config_host_data.set('CONFIG_EBPF', libbpf.found())
+config_host_data.set('CONFIG_AF_XDP', libxdp.found())
+if libxdp.found()
+  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
+                       cc.has_function('xsk_umem__create_with_fd',
+                                       dependencies: libxdp))
+endif
 config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
 config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
 config_host_data.set('CONFIG_LIBNFS', libnfs.found())
@@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
 summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
 summary_info += {'libcap-ng support': libcap_ng}
 summary_info += {'bpf support':       libbpf}
+summary_info += {'AF_XDP support':    libxdp}
 summary_info += {'rbd support':       rbd}
 summary_info += {'smartcard support': cacard}
 summary_info += {'U2F support':       u2f}
diff --git a/meson_options.txt b/meson_options.txt
index bbb5c7e886..f4e950ce6a 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
 option('keyring', type: 'feature', value: 'auto',
        description: 'Linux keyring support')
 
+option('af_xdp', type : 'feature', value : 'auto',
+       description: 'AF_XDP network backend support')
 option('attr', type : 'feature', value : 'auto',
        description: 'attr/xattr support')
 option('auth_pam', type : 'feature', value : 'auto',
diff --git a/net/af-xdp.c b/net/af-xdp.c
new file mode 100644
index 0000000000..265ba6b12e
--- /dev/null
+++ b/net/af-xdp.c
@@ -0,0 +1,570 @@
+/*
+ * AF_XDP network backend.
+ *
+ * Copyright (c) 2023 Red Hat, Inc.
+ *
+ * Authors:
+ *  Ilya Maximets <i.maximets@ovn.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+
+#include "qemu/osdep.h"
+#include <bpf/bpf.h>
+#include <inttypes.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <net/if.h>
+#include <xdp/xsk.h>
+
+#include "clients.h"
+#include "monitor/monitor.h"
+#include "net/net.h"
+#include "qapi/error.h"
+#include "qemu/cutils.h"
+#include "qemu/error-report.h"
+#include "qemu/iov.h"
+#include "qemu/main-loop.h"
+#include "qemu/memalign.h"
+
+
+typedef struct AFXDPState {
+    NetClientState       nc;
+
+    struct xsk_socket    *xsk;
+    struct xsk_ring_cons rx;
+    struct xsk_ring_prod tx;
+    struct xsk_ring_cons cq;
+    struct xsk_ring_prod fq;
+
+    char                 ifname[IFNAMSIZ];
+    int                  ifindex;
+    bool                 read_poll;
+    bool                 write_poll;
+    uint32_t             outstanding_tx;
+
+    uint64_t             *pool;
+    uint32_t             n_pool;
+    char                 *buffer;
+    struct xsk_umem      *umem;
+
+    uint32_t             n_queues;
+    uint32_t             xdp_flags;
+    bool                 inhibit;
+} AFXDPState;
+
+#define AF_XDP_BATCH_SIZE 64
+
+static void af_xdp_send(void *opaque);
+static void af_xdp_writable(void *opaque);
+
+/* Set the event-loop handlers for the af-xdp backend. */
+static void af_xdp_update_fd_handler(AFXDPState *s)
+{
+    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
+                        s->read_poll ? af_xdp_send : NULL,
+                        s->write_poll ? af_xdp_writable : NULL,
+                        s);
+}
+
+/* Update the read handler. */
+static void af_xdp_read_poll(AFXDPState *s, bool enable)
+{
+    if (s->read_poll != enable) {
+        s->read_poll = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+/* Update the write handler. */
+static void af_xdp_write_poll(AFXDPState *s, bool enable)
+{
+    if (s->write_poll != enable) {
+        s->write_poll = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+static void af_xdp_poll(NetClientState *nc, bool enable)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    if (s->read_poll != enable || s->write_poll != enable) {
+        s->write_poll = enable;
+        s->read_poll  = enable;
+        af_xdp_update_fd_handler(s);
+    }
+}
+
+static void af_xdp_complete_tx(AFXDPState *s)
+{
+    uint32_t idx = 0;
+    uint32_t done, i;
+    uint64_t *addr;
+
+    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
+
+    for (i = 0; i < done; i++) {
+        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
+        s->pool[s->n_pool++] = *addr;
+        s->outstanding_tx--;
+    }
+
+    if (done) {
+        xsk_ring_cons__release(&s->cq, done);
+    }
+}
+
+/*
+ * The fd_write() callback, invoked if the fd is marked as writable
+ * after a poll.
+ */
+static void af_xdp_writable(void *opaque)
+{
+    AFXDPState *s = opaque;
+
+    /* Try to recover buffers that are already sent. */
+    af_xdp_complete_tx(s);
+
+    /*
+     * Unregister the handler, unless we still have packets to transmit
+     * and kernel needs a wake up.
+     */
+    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
+        af_xdp_write_poll(s, false);
+    }
+
+    /* Flush any buffered packets. */
+    qemu_flush_queued_packets(&s->nc);
+}
+
+static ssize_t af_xdp_receive(NetClientState *nc,
+                              const uint8_t *buf, size_t size)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+    struct xdp_desc *desc;
+    uint32_t idx;
+    void *data;
+
+    /* Try to recover buffers that are already sent. */
+    af_xdp_complete_tx(s);
+
+    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
+        /* We can't transmit packet this size... */
+        return size;
+    }
+
+    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
+        /*
+         * Out of buffers or space in tx ring.  Poll until we can write.
+         * This will also kick the Tx, if it was waiting on CQ.
+         */
+        af_xdp_write_poll(s, true);
+        return 0;
+    }
+
+    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
+    desc->addr = s->pool[--s->n_pool];
+    desc->len = size;
+
+    data = xsk_umem__get_data(s->buffer, desc->addr);
+    memcpy(data, buf, size);
+
+    xsk_ring_prod__submit(&s->tx, 1);
+    s->outstanding_tx++;
+
+    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
+        af_xdp_write_poll(s, true);
+    }
+
+    return size;
+}
+
+/*
+ * Complete a previous send (backend --> guest) and enable the
+ * fd_read callback.
+ */
+static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    af_xdp_read_poll(s, true);
+}
+
+static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
+{
+    uint32_t i, idx = 0;
+
+    /* Leave one packet for Tx, just in case. */
+    if (s->n_pool < n + 1) {
+        n = s->n_pool;
+    }
+
+    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
+        return;
+    }
+
+    for (i = 0; i < n; i++) {
+        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
+    }
+    xsk_ring_prod__submit(&s->fq, n);
+
+    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
+        /* Receive was blocked by not having enough buffers.  Wake it up. */
+        af_xdp_read_poll(s, true);
+    }
+}
+
+static void af_xdp_send(void *opaque)
+{
+    uint32_t i, n_rx, idx = 0;
+    AFXDPState *s = opaque;
+
+    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
+    if (!n_rx) {
+        return;
+    }
+
+    for (i = 0; i < n_rx; i++) {
+        const struct xdp_desc *desc;
+        struct iovec iov;
+
+        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
+
+        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
+        iov.iov_len = desc->len;
+
+        s->pool[s->n_pool++] = desc->addr;
+
+        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
+                                     af_xdp_send_completed)) {
+            /*
+             * The peer does not receive anymore.  Packet is queued, stop
+             * reading from the backend until af_xdp_send_completed().
+             */
+            af_xdp_read_poll(s, false);
+
+            /* Re-peek the descriptors to not break the ring cache. */
+            xsk_ring_cons__cancel(&s->rx, n_rx);
+            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
+            g_assert(n_rx == i + 1);
+            break;
+        }
+    }
+
+    /* Release actually sent descriptors and try to re-fill. */
+    xsk_ring_cons__release(&s->rx, n_rx);
+    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
+}
+
+/* Flush and close. */
+static void af_xdp_cleanup(NetClientState *nc)
+{
+    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
+
+    qemu_purge_queued_packets(nc);
+
+    af_xdp_poll(nc, false);
+
+    xsk_socket__delete(s->xsk);
+    s->xsk = NULL;
+    g_free(s->pool);
+    s->pool = NULL;
+    xsk_umem__delete(s->umem);
+    s->umem = NULL;
+    qemu_vfree(s->buffer);
+    s->buffer = NULL;
+
+    /* Remove the program if it's the last open queue. */
+    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
+        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
+        fprintf(stderr,
+                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
+                s->ifname, s->ifindex);
+    }
+}
+
+static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
+{
+    struct xsk_umem_config config = {
+        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+        .frame_headroom = 0,
+    };
+    uint64_t n_descs;
+    uint64_t size;
+    int64_t i;
+    int ret;
+
+    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
+    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
+               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
+    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
+
+    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
+    memset(s->buffer, 0, size);
+
+    if (sock_fd < 0) {
+        ret = xsk_umem__create(&s->umem, s->buffer, size,
+                               &s->fq, &s->cq, &config);
+    } else {
+#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
+        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
+                                       &s->fq, &s->cq, &config);
+#else
+        ret = -1;
+        errno = EINVAL;
+#endif
+    }
+
+    if (ret) {
+        qemu_vfree(s->buffer);
+        error_setg_errno(errp, errno,
+                         "failed to create umem for %s queue_index: %d",
+                         s->ifname, s->nc.queue_index);
+        return -1;
+    }
+
+    s->pool = g_new(uint64_t, n_descs);
+    /* Fill the pool in the opposite order, because it's a LIFO queue. */
+    for (i = n_descs; i >= 0; i--) {
+        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
+    }
+    s->n_pool = n_descs;
+
+    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
+
+    return 0;
+}
+
+static int af_xdp_socket_create(AFXDPState *s,
+                                const NetdevAFXDPOptions *opts,
+                                int xsks_map_fd, Error **errp)
+{
+    struct xsk_socket_config cfg = {
+        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+        .libxdp_flags = 0,
+        .bind_flags = XDP_USE_NEED_WAKEUP,
+        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
+    };
+    int queue_id, error = 0;
+
+    s->inhibit = opts->has_inhibit && opts->inhibit;
+    if (s->inhibit) {
+        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
+    }
+
+    if (opts->has_force_copy && opts->force_copy) {
+        cfg.bind_flags |= XDP_COPY;
+    }
+
+    queue_id = s->nc.queue_index;
+    if (opts->has_start_queue && opts->start_queue > 0) {
+        queue_id += opts->start_queue;
+    }
+
+    if (opts->has_mode) {
+        /* Specific mode requested. */
+        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
+                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
+        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                               s->umem, &s->rx, &s->tx, &cfg)) {
+            error = errno;
+        }
+    } else {
+        /* No mode requested, try native first. */
+        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
+
+        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                               s->umem, &s->rx, &s->tx, &cfg)) {
+            /* Can't use native mode, try skb. */
+            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
+            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
+
+            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
+                                   s->umem, &s->rx, &s->tx, &cfg)) {
+                error = errno;
+            }
+        }
+    }
+
+    if (error) {
+        error_setg_errno(errp, error,
+                         "failed to create AF_XDP socket for %s queue_id: %d",
+                         s->ifname, queue_id);
+        return -1;
+    }
+
+    if (s->inhibit && xsks_map_fd >= 0) {
+        int xsk_fd = xsk_socket__fd(s->xsk);
+
+        /* Need to update the map manually, libxdp skipped that step. */
+        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
+        if (error) {
+            error_setg_errno(errp, error,
+                             "failed to update xsks map for %s queue_id: %d",
+                             s->ifname, queue_id);
+            return -1;
+        }
+    }
+
+    s->xdp_flags = cfg.xdp_flags;
+
+    return 0;
+}
+
+/* NetClientInfo methods. */
+static NetClientInfo net_af_xdp_info = {
+    .type = NET_CLIENT_DRIVER_AF_XDP,
+    .size = sizeof(AFXDPState),
+    .receive = af_xdp_receive,
+    .poll = af_xdp_poll,
+    .cleanup = af_xdp_cleanup,
+};
+
+#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
+static int *parse_socket_fds(const char *sock_fds_str,
+                             int64_t n_expected, Error **errp)
+{
+    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
+    int64_t i, n_sock_fds = g_strv_length(substrings);
+    int *sock_fds = NULL;
+
+    if (n_sock_fds != n_expected) {
+        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
+                   n_expected, n_sock_fds);
+        goto exit;
+    }
+
+    sock_fds = g_new(int, n_sock_fds);
+
+    for (i = 0; i < n_sock_fds; i++) {
+        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], errp);
+        if (sock_fds[i] < 0) {
+            g_free(sock_fds);
+            sock_fds = NULL;
+            goto exit;
+        }
+    }
+
+exit:
+    g_strfreev(substrings);
+    return sock_fds;
+}
+#endif
+
+/*
+ * The exported init function.
+ *
+ * ... -net af-xdp,ifname="..."
+ */
+int net_init_af_xdp(const Netdev *netdev,
+                    const char *name, NetClientState *peer, Error **errp)
+{
+    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
+    NetClientState *nc, *nc0 = NULL;
+    unsigned int ifindex;
+    uint32_t prog_id = 0;
+    int *sock_fds = NULL;
+    int xsks_map_fd = -1;
+    int64_t i, queues;
+    Error *err = NULL;
+    AFXDPState *s;
+
+    ifindex = if_nametoindex(opts->ifname);
+    if (!ifindex) {
+        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
+                         opts->ifname);
+        return -1;
+    }
+
+    queues = opts->has_queues ? opts->queues : 1;
+    if (queues < 1) {
+        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
+                   queues, opts->ifname);
+        return -1;
+    }
+
+#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
+    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
+        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
+        return -1;
+    }
+#else
+    if ((opts->has_inhibit && opts->inhibit)
+        != (opts->xsks_map_fd || opts->sock_fds)) {
+        error_setg(errp, "'inhibit=on' should be used together with "
+                         "'sock-fds' or 'xsks-map-fd'");
+        return -1;
+    }
+
+    if (opts->xsks_map_fd && opts->sock_fds) {
+        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually exclusive");
+        return -1;
+    }
+
+    if (opts->sock_fds) {
+        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
+        if (!sock_fds) {
+            return -1;
+        }
+    }
+#endif
+
+    if (opts->xsks_map_fd) {
+        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
+        if (xsks_map_fd < 0) {
+            goto err;
+        }
+    }
+
+    for (i = 0; i < queues; i++) {
+        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
+        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
+        nc->queue_index = i;
+
+        if (!nc0) {
+            nc0 = nc;
+        }
+
+        s = DO_UPCAST(AFXDPState, nc, nc);
+
+        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
+        s->ifindex = ifindex;
+        s->n_queues = queues;
+
+        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
+            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
+            /* Make sure the XDP program will be removed. */
+            s->n_queues = i;
+            error_propagate(errp, err);
+            goto err;
+        }
+    }
+
+    if (nc0) {
+        s = DO_UPCAST(AFXDPState, nc, nc0);
+        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
+            error_setg_errno(errp, errno,
+                             "no XDP program loaded on '%s', ifindex: %d",
+                             s->ifname, s->ifindex);
+            goto err;
+        }
+    }
+
+    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
+
+    return 0;
+
+err:
+    g_free(sock_fds);
+    if (nc0) {
+        qemu_del_net_client(nc0);
+    }
+
+    return -1;
+}
diff --git a/net/clients.h b/net/clients.h
index ed8bdfff1e..be53794582 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name,
                     NetClientState *peer, Error **errp);
 #endif
 
+#ifdef CONFIG_AF_XDP
+int net_init_af_xdp(const Netdev *netdev, const char *name,
+                    NetClientState *peer, Error **errp);
+#endif
+
 int net_init_vhost_user(const Netdev *netdev, const char *name,
                         NetClientState *peer, Error **errp);
 
diff --git a/net/meson.build b/net/meson.build
index bdf564a57b..61628d4684 100644
--- a/net/meson.build
+++ b/net/meson.build
@@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
 if have_netmap
   system_ss.add(files('netmap.c'))
 endif
+
+system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
+
 if have_vhost_net_user
   system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
   system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
diff --git a/net/net.c b/net/net.c
index 6492ad530e..127f70932b 100644
--- a/net/net.c
+++ b/net/net.c
@@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
 #ifdef CONFIG_NETMAP
         [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
 #endif
+#ifdef CONFIG_AF_XDP
+        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
+#endif
 #ifdef CONFIG_NET_BRIDGE
         [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
 #endif
@@ -1186,6 +1189,9 @@ void show_netdevs(void)
 #ifdef CONFIG_NETMAP
         "netmap",
 #endif
+#ifdef CONFIG_AF_XDP
+        "af-xdp",
+#endif
 #ifdef CONFIG_POSIX
         "vhost-user",
 #endif
diff --git a/qapi/net.json b/qapi/net.json
index db67501308..88f2c982c2 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -408,6 +408,62 @@
     'ifname':     'str',
     '*devname':    'str' } }
 
+##
+# @AFXDPMode:
+#
+# Attach mode for a default XDP program
+#
+# @skb: generic mode, no driver support necessary
+#
+# @native: DRV mode, program is attached to a driver, packets are passed to
+#     the socket without allocation of skb.
+#
+# Since: 8.1
+##
+{ 'enum': 'AFXDPMode',
+  'data': [ 'native', 'skb' ] }
+
+##
+# @NetdevAFXDPOptions:
+#
+# AF_XDP network backend
+#
+# @ifname: The name of an existing network interface.
+#
+# @mode: Attach mode for a default XDP program.  If not specified, then
+#     'native' will be tried first, then 'skb'.
+#
+# @force-copy: Force XDP copy mode even if device supports zero-copy.
+#     (default: false)
+#
+# @queues: number of queues to be used for multiqueue interfaces (default: 1).
+#
+# @start-queue: Use @queues starting from this queue number (default: 0).
+#
+# @inhibit: Don't load a default XDP program, use one already loaded to
+#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
+#
+# @sock-fds: A colon (:) separated list of file descriptors for already open
+#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
+#     These descriptors should already be added into XDP socket map for
+#     corresponding queues.  Requires @inhibit.
+#
+# @xsks-map-fd: A file descriptor for an already open XDP socket map in
+#     the already loaded XDP program.  Requires @inhibit.
+#
+# Since: 8.1
+##
+{ 'struct': 'NetdevAFXDPOptions',
+  'data': {
+    'ifname':       'str',
+    '*mode':        'AFXDPMode',
+    '*force-copy':  'bool',
+    '*queues':      'int',
+    '*start-queue': 'int',
+    '*inhibit':     'bool',
+    '*sock-fds':    { 'type': 'str', 'if': 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
+    '*xsks-map-fd': 'str' } }
+
 ##
 # @NetdevVhostUserOptions:
 #
@@ -642,13 +698,14 @@
 # @vmnet-bridged: since 7.1
 # @stream: since 7.2
 # @dgram: since 7.2
+# @af-xdp: since 8.1
 #
 # Since: 2.7
 ##
 { 'enum': 'NetClientDriver',
   'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
             'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
-            'vhost-vdpa',
+            'vhost-vdpa', 'af-xdp',
             { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
             { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
             { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
@@ -680,6 +737,7 @@
     'bridge':   'NetdevBridgeOptions',
     'hubport':  'NetdevHubPortOptions',
     'netmap':   'NetdevNetmapOptions',
+    'af-xdp':   'NetdevAFXDPOptions',
     'vhost-user': 'NetdevVhostUserOptions',
     'vhost-vdpa': 'NetdevVhostVDPAOptions',
     'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
diff --git a/qemu-options.hx b/qemu-options.hx
index b57489d7ca..d91610701c 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
     "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
     "                netmap device, defaults to '/dev/netmap')\n"
 #endif
+#ifdef CONFIG_AF_XDP
+    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
+    "         [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
+#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
+    "         [,sock-fds=x:y:...:z]\n"
+#endif
+    "                attach to the existing network interface 'name' with AF_XDP socket\n"
+    "                use 'mode=MODE' to specify an XDP program attach mode\n"
+    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
+    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
+    "                with inhibit=on,\n"
+#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
+    "                  use 'sock-fds' to provide file descriptors for already open XDP sockets\n"
+    "                  added to a socket map in XDP program.  One socket per queue.  Or\n"
+#endif
+    "                  use 'xsks-map-fd=k' to provide a file descriptor for xsks map\n"
+    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
+    "                use 'start-queue=m' to specify the first queue that should be used\n"
+#endif
 #ifdef CONFIG_POSIX
     "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
     "                configure a vhost-user network, backed by a chardev 'dev'\n"
@@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
 #ifdef CONFIG_NETMAP
     "netmap|"
 #endif
+#ifdef CONFIG_AF_XDP
+    "af-xdp|"
+#endif
 #ifdef CONFIG_POSIX
     "vhost-user|"
 #endif
@@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
 #ifdef CONFIG_NETMAP
     "netmap|"
 #endif
+#ifdef CONFIG_AF_XDP
+    "af-xdp|"
+#endif
 #ifdef CONFIG_VMNET
     "vmnet-host|vmnet-shared|vmnet-bridged|"
 #endif
@@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
     "                old way to initialize a host network interface\n"
     "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
 SRST
-``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
+``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
     This option is a shortcut for configuring both the on-board
     (default) guest NIC hardware and the host network backend in one go.
     The host backend options are the same as with the corresponding
@@ -3350,6 +3375,62 @@ SRST
         # launch QEMU instance
         |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
 
+``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
+    Configure AF_XDP backend to connect to a network interface 'name'
+    using AF_XDP socket.  A specific program attach mode for a default
+    XDP program can be forced with 'mode', defaults to best-effort,
+    where the likely most performant mode will be in use.  Number of queues
+    'n' should generally match the number or queues in the interface,
+    defaults to 1.  Traffic arriving on non-configured device queues will
+    not be delivered to the network backend.
+
+    .. parsed-literal::
+
+        # set number of queues to 1
+        ethtool -L eth0 combined 4
+        # launch QEMU instance
+        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
+            -netdev af-xdp,id=n1,ifname=eth0,queues=4
+
+    'start-queue' option can be specified if a particular range of queues
+    [m, m + n] should be in use.  For example, this is necessary in order
+    to use MLX NICs in native mode.  The driver will create a separate set
+    of queues on top of regular ones, and only these queues can be used
+    for AF_XDP sockets.  MLX NICs will also require an additional traffic
+    redirection with ethtool to these queues.  E.g.:
+
+    .. parsed-literal::
+
+        # set number of queues to 1
+        ethtool -L eth0 combined 1
+        # redirect all the traffic to the second queue (id: 1)
+        # note: mlx5 driver requires non-empty key/mask pair.
+        ethtool -N eth0 flow-type ether \\
+            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
+        ethtool -N eth0 flow-type ether \\
+            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
+        # launch QEMU instance
+        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
+            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
+
+    XDP program can also be loaded externally.  In this case 'inhibit' option
+    should be set to 'on' and 'xsks-map-fd' provided with a file descriptor
+    for an open XDP socket map of that program, or 'sock-fds' with file
+    descriptors for already open but not bound XDP sockets already added to a
+    socket map for corresponding queues.  One socket per queue.
+
+    .. parsed-literal::
+
+        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
+            -netdev af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
+
+    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
+    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK capability.
+    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
+    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should be
+    added as well.
+
+
 ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
     Establish a vhost-user netdev, backed by a chardev id. The chardev
     should be a unix domain socket backed one. The vhost-user uses a
diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
index d02b09a4b9..7585c4c4ed 100755
--- a/scripts/ci/org.centos/stream/8/x86_64/configure
+++ b/scripts/ci/org.centos/stream/8/x86_64/configure
@@ -35,6 +35,7 @@
 --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
 --with-coroutine=ucontext \
 --tls-priority=@QEMU,SYSTEM \
+--disable-af-xdp \
 --disable-attr \
 --disable-auth-pam \
 --disable-avx2 \
diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
index 7dd5709ef4..162e455309 100644
--- a/scripts/meson-buildoptions.sh
+++ b/scripts/meson-buildoptions.sh
@@ -74,6 +74,7 @@ meson_options_help() {
   printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
   printf "%s\n" '(unless built with --without-default-features):'
   printf "%s\n" ''
+  printf "%s\n" '  af-xdp          AF_XDP network backend support'
   printf "%s\n" '  alsa            ALSA sound support'
   printf "%s\n" '  attr            attr/xattr support'
   printf "%s\n" '  auth-pam        PAM access control'
@@ -207,6 +208,8 @@ meson_options_help() {
 }
 _meson_option_parse() {
   case $1 in
+    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
+    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
     --enable-alsa) printf "%s" -Dalsa=enabled ;;
     --disable-alsa) printf "%s" -Dalsa=disabled ;;
     --enable-attr) printf "%s" -Dattr=enabled ;;
diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
index e39871c7bb..207f7adfb9 100644
--- a/tests/docker/dockerfiles/debian-amd64.docker
+++ b/tests/docker/dockerfiles/debian-amd64.docker
@@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
                       libvirglrenderer-dev \
                       libvte-2.91-dev \
                       libxen-dev \
+                      libxdp-dev \
                       libzstd-dev \
                       llvm \
                       locales \
-- 
2.40.1
Re: [PATCH v2] net: add initial support for AF_XDP network backend
Posted by Jason Wang 9 months, 3 weeks ago
On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> AF_XDP is a network socket family that allows communication directly
> with the network device driver in the kernel, bypassing most or all
> of the kernel networking stack.  In the essence, the technology is
> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> and works with any network interfaces without driver modifications.
> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> require access to character devices or unix sockets.  Only access to
> the network interface itself is necessary.
>
> This patch implements a network backend that communicates with the
> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> Fill and Completion) are placed in that memory along with a pool of
> memory buffers for the packet data.  Data transmission is done by
> allocating one of the buffers, copying packet data into it and
> placing the pointer into Tx ring.  After transmission, device will
> return the buffer via Completion ring.  On Rx, device will take
> a buffer form a pre-populated Fill ring, write the packet data into
> it and place the buffer into Rx ring.
>
> AF_XDP network backend takes on the communication with the host
> kernel and the network interface and forwards packets to/from the
> peer device in QEMU.
>
> Usage example:
>
>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>
> XDP program bridges the socket with a network interface.  It can be
> attached to the interface in 2 different modes:
>
> 1. skb - this mode should work for any interface and doesn't require
>          driver support.  With a caveat of lower performance.
>
> 2. native - this does require support from the driver and allows to
>             bypass skb allocation in the kernel and potentially use
>             zero-copy while getting packets in/out userspace.
>
> By default, QEMU will try to use native mode and fall back to skb.
> Mode can be forced via 'mode' option.  To force 'copy' even in native
> mode, use 'force-copy=on' option.  This might be useful if there is
> some issue with the driver.
>
> Option 'queues=N' allows to specify how many device queues should
> be open.  Note that all the queues that are not open are still
> functional and can receive traffic, but it will not be delivered to
> QEMU.  So, the number of device queues should generally match the
> QEMU configuration, unless the device is shared with something
> else and the traffic re-direction to appropriate queues is correctly
> configured on a device level (e.g. with ethtool -N).
> 'start-queue=M' option can be used to specify from which queue id
> QEMU should start configuring 'N' queues.  It might also be necessary
> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> for examples.
>
> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> capabilities in order to load default XSK/XDP programs to the
> network interface and configure BPF maps.  It is possible, however,
> to run with no capabilities.  For that to work, an external process
> with admin capabilities will need to pre-load default XSK program,
> create AF_XDP sockets and pass their file descriptors to QEMU process
> on startup via 'sock-fds' option.  Network backend will need to be
> configured with 'inhibit=on' to avoid loading of the program.
> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> or CAP_IPC_LOCK.
>
> Alternatively, the file descriptor for 'xsks_map' can be passed via
> 'xsks-map-fd=N' option instead of passing socket file descriptors.
> That will additionally require CAP_NET_RAW on QEMU side.  This is
> useful, because 'sock-fds' may not be available with older libxdp.
> 'sock-fds' requires libxdp >= 1.4.0.
>
> There are few performance challenges with the current network backends.
>
> First is that they do not support IO threads.  This means that data
> path is handled by the main thread in QEMU and may slow down other
> work or may be slowed down by some other work.  This also means that
> taking advantage of multi-queue is generally not possible today.
>
> Another thing is that data path is going through the device emulation
> code, which is not really optimized for performance.  The fastest
> "frontend" device is virtio-net.  But it's not optimized for heavy
> traffic either, because it expects such use-cases to be handled via
> some implementation of vhost (user, kernel, vdpa).  In practice, we
> have virtio notifications and rcu lock/unlock on a per-packet basis
> and not very efficient accesses to the guest memory.  Communication
> channels between backend and frontend devices do not allow passing
> more than one packet at a time as well.
>
> Some of these challenges can be avoided in the future by adding better
> batching into device emulation or by implementing vhost-af-xdp variant.
>
> There are also a few kernel limitations.  AF_XDP sockets do not
> support any kinds of checksum or segmentation offloading.  Buffers
> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> support implementation for AF_XDP is in progress, but not ready yet.
> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> done in a syscall.  That doesn't allow high packet rates on virtual
> interfaces.
>
> However, keeping in mind all of these challenges, current implementation
> of the AF_XDP backend shows a decent performance while running on top
> of a physical NIC with zero-copy support.
>
> Test setup:
>
> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> Network backend is configured to open the NIC directly in native mode.
> The driver supports zero-copy.  NIC is configured to use 1 queue.
>
> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> for PPS testing.
>
> iperf3 result:
>  TCP stream      : 19.1 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>  Tx only         : 3.4 Mpps
>  Rx only         : 2.0 Mpps
>  L2 FWD Loopback : 1.5 Mpps
>
> In skb mode the same setup shows much lower performance, similar to
> the setup where pair of physical NICs is replaced with veth pair:
>
> iperf3 result:
>   TCP stream      : 9 Gbps
>
> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>   Tx only         : 1.2 Mpps
>   Rx only         : 1.0 Mpps
>   L2 FWD Loopback : 0.7 Mpps
>
> Results in skb mode or over the veth are close to results of a tap
> backend with vhost=on and disabled segmentation offloading bridged
> with a NIC.
>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>

Looks good overall, see few comments inline.

> ---
>
> Version 2:
>
>   - Added support for running with no capabilities by passing
>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
>     Conditionally complied because requires unreleased libxdp 1.4.0.
>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
>
>   - Refined and extended documentation.
>
>
>  MAINTAINERS                                   |   4 +
>  hmp-commands.hx                               |   2 +-
>  meson.build                                   |  19 +
>  meson_options.txt                             |   2 +
>  net/af-xdp.c                                  | 570 ++++++++++++++++++
>  net/clients.h                                 |   5 +
>  net/meson.build                               |   3 +
>  net/net.c                                     |   6 +
>  qapi/net.json                                 |  60 +-
>  qemu-options.hx                               |  83 ++-
>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
>  scripts/meson-buildoptions.sh                 |   3 +
>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
>  13 files changed, 756 insertions(+), 3 deletions(-)
>  create mode 100644 net/af-xdp.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7164cf55a1..80d4ba4004 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
>  S: Maintained
>  F: net/netmap.c
>
> +AF_XDP network backend
> +R: Ilya Maximets <i.maximets@ovn.org>
> +F: net/af-xdp.c
> +
>  Host Memory Backends
>  M: David Hildenbrand <david@redhat.com>
>  M: Igor Mammedov <imammedo@redhat.com>
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 2cbd0f77a0..af9ffe4681 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -1295,7 +1295,7 @@ ERST
>      {
>          .name       = "netdev_add",
>          .args_type  = "netdev:O",
> -        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
> +        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
>  #ifdef CONFIG_VMNET
>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
>  #endif
> diff --git a/meson.build b/meson.build
> index a9ba0bfab3..1f8772ea5d 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
>    endif
>  endif
>
> +# libxdp
> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
> +if libxdp.found() and \
> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
> +  libxdp = not_found
> +  if get_option('af_xdp').enabled()
> +    error('af-xdp support requires libbpf version >= 0.7')

Can we simply limit this to 1.4?


> +  else
> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
> +  endif
> +endif
> +
>  # libdw
>  libdw = not_found
>  if not get_option('libdw').auto() or \
> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
>  config_host_data.set('CONFIG_EBPF', libbpf.found())
> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
> +if libxdp.found()
> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
> +                       cc.has_function('xsk_umem__create_with_fd',
> +                                       dependencies: libxdp))
> +endif
>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
>  summary_info += {'libcap-ng support': libcap_ng}
>  summary_info += {'bpf support':       libbpf}
> +summary_info += {'AF_XDP support':    libxdp}
>  summary_info += {'rbd support':       rbd}
>  summary_info += {'smartcard support': cacard}
>  summary_info += {'U2F support':       u2f}
> diff --git a/meson_options.txt b/meson_options.txt
> index bbb5c7e886..f4e950ce6a 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
>  option('keyring', type: 'feature', value: 'auto',
>         description: 'Linux keyring support')
>
> +option('af_xdp', type : 'feature', value : 'auto',
> +       description: 'AF_XDP network backend support')
>  option('attr', type : 'feature', value : 'auto',
>         description: 'attr/xattr support')
>  option('auth_pam', type : 'feature', value : 'auto',
> diff --git a/net/af-xdp.c b/net/af-xdp.c
> new file mode 100644
> index 0000000000..265ba6b12e
> --- /dev/null
> +++ b/net/af-xdp.c
> @@ -0,0 +1,570 @@
> +/*
> + * AF_XDP network backend.
> + *
> + * Copyright (c) 2023 Red Hat, Inc.
> + *
> + * Authors:
> + *  Ilya Maximets <i.maximets@ovn.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +
> +#include "qemu/osdep.h"
> +#include <bpf/bpf.h>
> +#include <inttypes.h>
> +#include <linux/if_link.h>
> +#include <linux/if_xdp.h>
> +#include <net/if.h>
> +#include <xdp/xsk.h>
> +
> +#include "clients.h"
> +#include "monitor/monitor.h"
> +#include "net/net.h"
> +#include "qapi/error.h"
> +#include "qemu/cutils.h"
> +#include "qemu/error-report.h"
> +#include "qemu/iov.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/memalign.h"
> +
> +
> +typedef struct AFXDPState {
> +    NetClientState       nc;
> +
> +    struct xsk_socket    *xsk;
> +    struct xsk_ring_cons rx;
> +    struct xsk_ring_prod tx;
> +    struct xsk_ring_cons cq;
> +    struct xsk_ring_prod fq;
> +
> +    char                 ifname[IFNAMSIZ];
> +    int                  ifindex;
> +    bool                 read_poll;
> +    bool                 write_poll;
> +    uint32_t             outstanding_tx;
> +
> +    uint64_t             *pool;
> +    uint32_t             n_pool;
> +    char                 *buffer;
> +    struct xsk_umem      *umem;
> +
> +    uint32_t             n_queues;
> +    uint32_t             xdp_flags;
> +    bool                 inhibit;
> +} AFXDPState;
> +
> +#define AF_XDP_BATCH_SIZE 64
> +
> +static void af_xdp_send(void *opaque);
> +static void af_xdp_writable(void *opaque);
> +
> +/* Set the event-loop handlers for the af-xdp backend. */
> +static void af_xdp_update_fd_handler(AFXDPState *s)
> +{
> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
> +                        s->read_poll ? af_xdp_send : NULL,
> +                        s->write_poll ? af_xdp_writable : NULL,
> +                        s);
> +}
> +
> +/* Update the read handler. */
> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
> +{
> +    if (s->read_poll != enable) {
> +        s->read_poll = enable;
> +        af_xdp_update_fd_handler(s);
> +    }
> +}
> +
> +/* Update the write handler. */
> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
> +{
> +    if (s->write_poll != enable) {
> +        s->write_poll = enable;
> +        af_xdp_update_fd_handler(s);
> +    }
> +}
> +
> +static void af_xdp_poll(NetClientState *nc, bool enable)
> +{
> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> +
> +    if (s->read_poll != enable || s->write_poll != enable) {
> +        s->write_poll = enable;
> +        s->read_poll  = enable;
> +        af_xdp_update_fd_handler(s);
> +    }
> +}
> +
> +static void af_xdp_complete_tx(AFXDPState *s)
> +{
> +    uint32_t idx = 0;
> +    uint32_t done, i;
> +    uint64_t *addr;
> +
> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
> +
> +    for (i = 0; i < done; i++) {
> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
> +        s->pool[s->n_pool++] = *addr;
> +        s->outstanding_tx--;
> +    }
> +
> +    if (done) {
> +        xsk_ring_cons__release(&s->cq, done);
> +    }
> +}
> +
> +/*
> + * The fd_write() callback, invoked if the fd is marked as writable
> + * after a poll.
> + */
> +static void af_xdp_writable(void *opaque)
> +{
> +    AFXDPState *s = opaque;
> +
> +    /* Try to recover buffers that are already sent. */
> +    af_xdp_complete_tx(s);
> +
> +    /*
> +     * Unregister the handler, unless we still have packets to transmit
> +     * and kernel needs a wake up.
> +     */
> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
> +        af_xdp_write_poll(s, false);
> +    }
> +
> +    /* Flush any buffered packets. */
> +    qemu_flush_queued_packets(&s->nc);
> +}
> +
> +static ssize_t af_xdp_receive(NetClientState *nc,
> +                              const uint8_t *buf, size_t size)
> +{
> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> +    struct xdp_desc *desc;
> +    uint32_t idx;
> +    void *data;
> +
> +    /* Try to recover buffers that are already sent. */
> +    af_xdp_complete_tx(s);
> +
> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
> +        /* We can't transmit packet this size... */
> +        return size;
> +    }
> +
> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
> +        /*
> +         * Out of buffers or space in tx ring.  Poll until we can write.
> +         * This will also kick the Tx, if it was waiting on CQ.
> +         */
> +        af_xdp_write_poll(s, true);
> +        return 0;
> +    }
> +
> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
> +    desc->addr = s->pool[--s->n_pool];
> +    desc->len = size;
> +
> +    data = xsk_umem__get_data(s->buffer, desc->addr);
> +    memcpy(data, buf, size);
> +
> +    xsk_ring_prod__submit(&s->tx, 1);
> +    s->outstanding_tx++;
> +
> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
> +        af_xdp_write_poll(s, true);
> +    }
> +
> +    return size;
> +}
> +
> +/*
> + * Complete a previous send (backend --> guest) and enable the
> + * fd_read callback.
> + */
> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
> +{
> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> +
> +    af_xdp_read_poll(s, true);
> +}
> +
> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
> +{
> +    uint32_t i, idx = 0;
> +
> +    /* Leave one packet for Tx, just in case. */
> +    if (s->n_pool < n + 1) {
> +        n = s->n_pool;
> +    }
> +
> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
> +        return;
> +    }
> +
> +    for (i = 0; i < n; i++) {
> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
> +    }
> +    xsk_ring_prod__submit(&s->fq, n);
> +
> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
> +        /* Receive was blocked by not having enough buffers.  Wake it up. */
> +        af_xdp_read_poll(s, true);
> +    }
> +}
> +
> +static void af_xdp_send(void *opaque)
> +{
> +    uint32_t i, n_rx, idx = 0;
> +    AFXDPState *s = opaque;
> +
> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
> +    if (!n_rx) {
> +        return;
> +    }
> +
> +    for (i = 0; i < n_rx; i++) {
> +        const struct xdp_desc *desc;
> +        struct iovec iov;
> +
> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
> +
> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
> +        iov.iov_len = desc->len;
> +
> +        s->pool[s->n_pool++] = desc->addr;
> +
> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
> +                                     af_xdp_send_completed)) {
> +            /*
> +             * The peer does not receive anymore.  Packet is queued, stop
> +             * reading from the backend until af_xdp_send_completed().
> +             */
> +            af_xdp_read_poll(s, false);
> +
> +            /* Re-peek the descriptors to not break the ring cache. */
> +            xsk_ring_cons__cancel(&s->rx, n_rx);
> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);

The code turns out to be hard to read here.

1) This seems to undo the peek (usually peek doesn't touch the
prod/consumer but it seems not what xsk_ring_cons__peek()) did:

static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
__u32 nb, __u32 *idx)
{
  __u32 entries = xsk_cons_nb_avail(cons, nb);

if (entries > 0) {
                *idx = cons->cached_cons;
                cons->cached_cons += entries;
        }

        return entries;
}

2) It looks to me a partial rollback is sufficient?

xsk_ring_cons__cancel(n_rx - i + 1)?

> +            g_assert(n_rx == i + 1);
> +            break;
> +        }
> +    }
> +
> +    /* Release actually sent descriptors and try to re-fill. */
> +    xsk_ring_cons__release(&s->rx, n_rx);
> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
> +}
> +
> +/* Flush and close. */
> +static void af_xdp_cleanup(NetClientState *nc)
> +{
> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> +
> +    qemu_purge_queued_packets(nc);
> +
> +    af_xdp_poll(nc, false);
> +
> +    xsk_socket__delete(s->xsk);
> +    s->xsk = NULL;
> +    g_free(s->pool);
> +    s->pool = NULL;
> +    xsk_umem__delete(s->umem);
> +    s->umem = NULL;
> +    qemu_vfree(s->buffer);
> +    s->buffer = NULL;
> +
> +    /* Remove the program if it's the last open queue. */
> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
> +        fprintf(stderr,
> +                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
> +                s->ifname, s->ifindex);
> +    }
> +}
> +
> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
> +{
> +    struct xsk_umem_config config = {
> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
> +        .frame_headroom = 0,
> +    };
> +    uint64_t n_descs;
> +    uint64_t size;
> +    int64_t i;
> +    int ret;
> +
> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
> +
> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
> +    memset(s->buffer, 0, size);
> +
> +    if (sock_fd < 0) {
> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
> +                               &s->fq, &s->cq, &config);
> +    } else {
> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
> +                                       &s->fq, &s->cq, &config);
> +#else

So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better

1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD

or

2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD

> +        ret = -1;
> +        errno = EINVAL;
> +#endif
> +    }
> +
> +    if (ret) {
> +        qemu_vfree(s->buffer);
> +        error_setg_errno(errp, errno,
> +                         "failed to create umem for %s queue_index: %d",
> +                         s->ifname, s->nc.queue_index);
> +        return -1;
> +    }
> +
> +    s->pool = g_new(uint64_t, n_descs);
> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
> +    for (i = n_descs; i >= 0; i--) {
> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
> +    }
> +    s->n_pool = n_descs;
> +
> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
> +
> +    return 0;
> +}
> +
> +static int af_xdp_socket_create(AFXDPState *s,
> +                                const NetdevAFXDPOptions *opts,
> +                                int xsks_map_fd, Error **errp)
> +{
> +    struct xsk_socket_config cfg = {
> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> +        .libxdp_flags = 0,
> +        .bind_flags = XDP_USE_NEED_WAKEUP,
> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
> +    };
> +    int queue_id, error = 0;
> +
> +    s->inhibit = opts->has_inhibit && opts->inhibit;
> +    if (s->inhibit) {
> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
> +    }
> +
> +    if (opts->has_force_copy && opts->force_copy) {
> +        cfg.bind_flags |= XDP_COPY;
> +    }
> +
> +    queue_id = s->nc.queue_index;
> +    if (opts->has_start_queue && opts->start_queue > 0) {
> +        queue_id += opts->start_queue;
> +    }
> +
> +    if (opts->has_mode) {
> +        /* Specific mode requested. */
> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> +            error = errno;
> +        }
> +    } else {
> +        /* No mode requested, try native first. */
> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
> +
> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> +            /* Can't use native mode, try skb. */
> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
> +
> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
> +                error = errno;
> +            }
> +        }
> +    }
> +
> +    if (error) {
> +        error_setg_errno(errp, error,
> +                         "failed to create AF_XDP socket for %s queue_id: %d",
> +                         s->ifname, queue_id);
> +        return -1;
> +    }
> +
> +    if (s->inhibit && xsks_map_fd >= 0) {
> +        int xsk_fd = xsk_socket__fd(s->xsk);
> +
> +        /* Need to update the map manually, libxdp skipped that step. */
> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
> +        if (error) {
> +            error_setg_errno(errp, error,
> +                             "failed to update xsks map for %s queue_id: %d",
> +                             s->ifname, queue_id);
> +            return -1;
> +        }
> +    }
> +
> +    s->xdp_flags = cfg.xdp_flags;
> +
> +    return 0;
> +}
> +
> +/* NetClientInfo methods. */
> +static NetClientInfo net_af_xdp_info = {
> +    .type = NET_CLIENT_DRIVER_AF_XDP,
> +    .size = sizeof(AFXDPState),
> +    .receive = af_xdp_receive,
> +    .poll = af_xdp_poll,
> +    .cleanup = af_xdp_cleanup,
> +};
> +
> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> +static int *parse_socket_fds(const char *sock_fds_str,
> +                             int64_t n_expected, Error **errp)
> +{
> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
> +    int64_t i, n_sock_fds = g_strv_length(substrings);
> +    int *sock_fds = NULL;
> +
> +    if (n_sock_fds != n_expected) {
> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
> +                   n_expected, n_sock_fds);
> +        goto exit;
> +    }
> +
> +    sock_fds = g_new(int, n_sock_fds);
> +
> +    for (i = 0; i < n_sock_fds; i++) {
> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], errp);
> +        if (sock_fds[i] < 0) {
> +            g_free(sock_fds);
> +            sock_fds = NULL;
> +            goto exit;
> +        }
> +    }
> +
> +exit:
> +    g_strfreev(substrings);
> +    return sock_fds;
> +}
> +#endif
> +
> +/*
> + * The exported init function.
> + *
> + * ... -net af-xdp,ifname="..."

This is the legacy command line, let's say -netdev af-xdp,...

> + */
> +int net_init_af_xdp(const Netdev *netdev,
> +                    const char *name, NetClientState *peer, Error **errp)
> +{
> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
> +    NetClientState *nc, *nc0 = NULL;
> +    unsigned int ifindex;
> +    uint32_t prog_id = 0;
> +    int *sock_fds = NULL;
> +    int xsks_map_fd = -1;
> +    int64_t i, queues;
> +    Error *err = NULL;
> +    AFXDPState *s;
> +
> +    ifindex = if_nametoindex(opts->ifname);
> +    if (!ifindex) {
> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
> +                         opts->ifname);
> +        return -1;
> +    }
> +
> +    queues = opts->has_queues ? opts->queues : 1;
> +    if (queues < 1) {
> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
> +                   queues, opts->ifname);
> +        return -1;
> +    }
> +
> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
> +        return -1;
> +    }
> +#else
> +    if ((opts->has_inhibit && opts->inhibit)
> +        != (opts->xsks_map_fd || opts->sock_fds)) {
> +        error_setg(errp, "'inhibit=on' should be used together with "
> +                         "'sock-fds' or 'xsks-map-fd'");
> +        return -1;
> +    }
> +
> +    if (opts->xsks_map_fd && opts->sock_fds) {
> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually exclusive");
> +        return -1;
> +    }
> +
> +    if (opts->sock_fds) {
> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
> +        if (!sock_fds) {
> +            return -1;
> +        }
> +    }
> +#endif
> +
> +    if (opts->xsks_map_fd) {
> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
> +        if (xsks_map_fd < 0) {
> +            goto err;
> +        }
> +    }
> +
> +    for (i = 0; i < queues; i++) {
> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
> +        nc->queue_index = i;
> +
> +        if (!nc0) {
> +            nc0 = nc;
> +        }
> +
> +        s = DO_UPCAST(AFXDPState, nc, nc);
> +
> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
> +        s->ifindex = ifindex;
> +        s->n_queues = queues;
> +
> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
> +            /* Make sure the XDP program will be removed. */
> +            s->n_queues = i;
> +            error_propagate(errp, err);
> +            goto err;
> +        }
> +    }
> +
> +    if (nc0) {
> +        s = DO_UPCAST(AFXDPState, nc, nc0);
> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
> +            error_setg_errno(errp, errno,
> +                             "no XDP program loaded on '%s', ifindex: %d",
> +                             s->ifname, s->ifindex);
> +            goto err;
> +        }
> +    }
> +
> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
> +
> +    return 0;
> +
> +err:
> +    g_free(sock_fds);
> +    if (nc0) {
> +        qemu_del_net_client(nc0);
> +    }
> +
> +    return -1;
> +}
> diff --git a/net/clients.h b/net/clients.h
> index ed8bdfff1e..be53794582 100644
> --- a/net/clients.h
> +++ b/net/clients.h
> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name,
>                      NetClientState *peer, Error **errp);
>  #endif
>
> +#ifdef CONFIG_AF_XDP
> +int net_init_af_xdp(const Netdev *netdev, const char *name,
> +                    NetClientState *peer, Error **errp);
> +#endif
> +
>  int net_init_vhost_user(const Netdev *netdev, const char *name,
>                          NetClientState *peer, Error **errp);
>
> diff --git a/net/meson.build b/net/meson.build
> index bdf564a57b..61628d4684 100644
> --- a/net/meson.build
> +++ b/net/meson.build
> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
>  if have_netmap
>    system_ss.add(files('netmap.c'))
>  endif
> +
> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
> +
>  if have_vhost_net_user
>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
> diff --git a/net/net.c b/net/net.c
> index 6492ad530e..127f70932b 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
>  #ifdef CONFIG_NETMAP
>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
>  #endif
> +#ifdef CONFIG_AF_XDP
> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
> +#endif
>  #ifdef CONFIG_NET_BRIDGE
>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
>  #endif
> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
>  #ifdef CONFIG_NETMAP
>          "netmap",
>  #endif
> +#ifdef CONFIG_AF_XDP
> +        "af-xdp",
> +#endif
>  #ifdef CONFIG_POSIX
>          "vhost-user",
>  #endif
> diff --git a/qapi/net.json b/qapi/net.json
> index db67501308..88f2c982c2 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -408,6 +408,62 @@
>      'ifname':     'str',
>      '*devname':    'str' } }
>
> +##
> +# @AFXDPMode:
> +#
> +# Attach mode for a default XDP program
> +#
> +# @skb: generic mode, no driver support necessary
> +#
> +# @native: DRV mode, program is attached to a driver, packets are passed to
> +#     the socket without allocation of skb.
> +#
> +# Since: 8.1

I'd make it for 8.2.

> +##
> +{ 'enum': 'AFXDPMode',
> +  'data': [ 'native', 'skb' ] }
> +
> +##
> +# @NetdevAFXDPOptions:
> +#
> +# AF_XDP network backend
> +#
> +# @ifname: The name of an existing network interface.
> +#
> +# @mode: Attach mode for a default XDP program.  If not specified, then
> +#     'native' will be tried first, then 'skb'.
> +#
> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
> +#     (default: false)
> +#
> +# @queues: number of queues to be used for multiqueue interfaces (default: 1).
> +#
> +# @start-queue: Use @queues starting from this queue number (default: 0).
> +#
> +# @inhibit: Don't load a default XDP program, use one already loaded to
> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
> +#
> +# @sock-fds: A colon (:) separated list of file descriptors for already open
> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
> +#     These descriptors should already be added into XDP socket map for
> +#     corresponding queues.  Requires @inhibit.
> +#
> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
> +#     the already loaded XDP program.  Requires @inhibit.
> +#
> +# Since: 8.1
> +##
> +{ 'struct': 'NetdevAFXDPOptions',
> +  'data': {
> +    'ifname':       'str',
> +    '*mode':        'AFXDPMode',
> +    '*force-copy':  'bool',
> +    '*queues':      'int',
> +    '*start-queue': 'int',
> +    '*inhibit':     'bool',
> +    '*sock-fds':    { 'type': 'str', 'if': 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
> +    '*xsks-map-fd': 'str' } }
> +
>  ##
>  # @NetdevVhostUserOptions:
>  #
> @@ -642,13 +698,14 @@
>  # @vmnet-bridged: since 7.1
>  # @stream: since 7.2
>  # @dgram: since 7.2
> +# @af-xdp: since 8.1
>  #
>  # Since: 2.7
>  ##
>  { 'enum': 'NetClientDriver',
>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
> -            'vhost-vdpa',
> +            'vhost-vdpa', 'af-xdp',
>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
> @@ -680,6 +737,7 @@
>      'bridge':   'NetdevBridgeOptions',
>      'hubport':  'NetdevHubPortOptions',
>      'netmap':   'NetdevNetmapOptions',
> +    'af-xdp':   'NetdevAFXDPOptions',
>      'vhost-user': 'NetdevVhostUserOptions',
>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
> diff --git a/qemu-options.hx b/qemu-options.hx
> index b57489d7ca..d91610701c 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
>      "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
>      "                netmap device, defaults to '/dev/netmap')\n"
>  #endif
> +#ifdef CONFIG_AF_XDP
> +    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
> +    "         [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> +    "         [,sock-fds=x:y:...:z]\n"
> +#endif
> +    "                attach to the existing network interface 'name' with AF_XDP socket\n"
> +    "                use 'mode=MODE' to specify an XDP program attach mode\n"
> +    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
> +    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
> +    "                with inhibit=on,\n"
> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> +    "                  use 'sock-fds' to provide file descriptors for already open XDP sockets\n"
> +    "                  added to a socket map in XDP program.  One socket per queue.  Or\n"
> +#endif
> +    "                  use 'xsks-map-fd=k' to provide a file descriptor for xsks map\n"
> +    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
> +    "                use 'start-queue=m' to specify the first queue that should be used\n"
> +#endif
>  #ifdef CONFIG_POSIX
>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
>      "                configure a vhost-user network, backed by a chardev 'dev'\n"
> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
>  #ifdef CONFIG_NETMAP
>      "netmap|"
>  #endif
> +#ifdef CONFIG_AF_XDP
> +    "af-xdp|"
> +#endif
>  #ifdef CONFIG_POSIX
>      "vhost-user|"
>  #endif
> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>  #ifdef CONFIG_NETMAP
>      "netmap|"
>  #endif
> +#ifdef CONFIG_AF_XDP
> +    "af-xdp|"
> +#endif
>  #ifdef CONFIG_VMNET
>      "vmnet-host|vmnet-shared|vmnet-bridged|"
>  #endif
> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>      "                old way to initialize a host network interface\n"
>      "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
>  SRST
> -``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
> +``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>      This option is a shortcut for configuring both the on-board
>      (default) guest NIC hardware and the host network backend in one go.
>      The host backend options are the same as with the corresponding
> @@ -3350,6 +3375,62 @@ SRST
>          # launch QEMU instance
>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
>
> +``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
> +    Configure AF_XDP backend to connect to a network interface 'name'
> +    using AF_XDP socket.  A specific program attach mode for a default
> +    XDP program can be forced with 'mode', defaults to best-effort,
> +    where the likely most performant mode will be in use.  Number of queues
> +    'n' should generally match the number or queues in the interface,
> +    defaults to 1.  Traffic arriving on non-configured device queues will
> +    not be delivered to the network backend.
> +
> +    .. parsed-literal::
> +
> +        # set number of queues to 1
> +        ethtool -L eth0 combined 4
> +        # launch QEMU instance
> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
> +
> +    'start-queue' option can be specified if a particular range of queues
> +    [m, m + n] should be in use.  For example, this is necessary in order
> +    to use MLX NICs in native mode.  The driver will create a separate set
> +    of queues on top of regular ones, and only these queues can be used
> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
> +    redirection with ethtool to these queues.

Let's avoid mentioning any vendor name unless the emulation is done
for a specific one.

> + E.g.:
> +
> +    .. parsed-literal::
> +
> +        # set number of queues to 1
> +        ethtool -L eth0 combined 1
> +        # redirect all the traffic to the second queue (id: 1)
> +        # note: mlx5 driver requires non-empty key/mask pair.
> +        ethtool -N eth0 flow-type ether \\
> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
> +        ethtool -N eth0 flow-type ether \\
> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
> +        # launch QEMU instance
> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
> +
> +    XDP program can also be loaded externally.  In this case 'inhibit' option
> +    should be set to 'on' and 'xsks-map-fd' provided with a file descriptor
> +    for an open XDP socket map of that program, or 'sock-fds' with file
> +    descriptors for already open but not bound XDP sockets already added to a
> +    socket map for corresponding queues.  One socket per queue.
> +
> +    .. parsed-literal::
> +
> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> +            -netdev af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
> +
> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK capability.
> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should be
> +    added as well.

This requires the synchronization with the code changes, I suggest not
to describe:

1) actual numbers of memory requirement
2) capabilites required for each mode, we don't do that for other
netdev like tap.

Thanks




> +
> +
>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
>      Establish a vhost-user netdev, backed by a chardev id. The chardev
>      should be a unix domain socket backed one. The vhost-user uses a
> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
> index d02b09a4b9..7585c4c4ed 100755
> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
> @@ -35,6 +35,7 @@
>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
>  --with-coroutine=ucontext \
>  --tls-priority=@QEMU,SYSTEM \
> +--disable-af-xdp \
>  --disable-attr \
>  --disable-auth-pam \
>  --disable-avx2 \
> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> index 7dd5709ef4..162e455309 100644
> --- a/scripts/meson-buildoptions.sh
> +++ b/scripts/meson-buildoptions.sh
> @@ -74,6 +74,7 @@ meson_options_help() {
>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
>    printf "%s\n" '(unless built with --without-default-features):'
>    printf "%s\n" ''
> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
>    printf "%s\n" '  alsa            ALSA sound support'
>    printf "%s\n" '  attr            attr/xattr support'
>    printf "%s\n" '  auth-pam        PAM access control'
> @@ -207,6 +208,8 @@ meson_options_help() {
>  }
>  _meson_option_parse() {
>    case $1 in
> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
>      --enable-attr) printf "%s" -Dattr=enabled ;;
> diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
> index e39871c7bb..207f7adfb9 100644
> --- a/tests/docker/dockerfiles/debian-amd64.docker
> +++ b/tests/docker/dockerfiles/debian-amd64.docker
> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
>                        libvirglrenderer-dev \
>                        libvte-2.91-dev \
>                        libxen-dev \
> +                      libxdp-dev \
>                        libzstd-dev \
>                        llvm \
>                        locales \
> --
> 2.40.1
>
Re: [PATCH v2] net: add initial support for AF_XDP network backend
Posted by Ilya Maximets 9 months, 3 weeks ago
On 7/20/23 09:37, Jason Wang wrote:
> On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> AF_XDP is a network socket family that allows communication directly
>> with the network device driver in the kernel, bypassing most or all
>> of the kernel networking stack.  In the essence, the technology is
>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>> and works with any network interfaces without driver modifications.
>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>> require access to character devices or unix sockets.  Only access to
>> the network interface itself is necessary.
>>
>> This patch implements a network backend that communicates with the
>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>> Fill and Completion) are placed in that memory along with a pool of
>> memory buffers for the packet data.  Data transmission is done by
>> allocating one of the buffers, copying packet data into it and
>> placing the pointer into Tx ring.  After transmission, device will
>> return the buffer via Completion ring.  On Rx, device will take
>> a buffer form a pre-populated Fill ring, write the packet data into
>> it and place the buffer into Rx ring.
>>
>> AF_XDP network backend takes on the communication with the host
>> kernel and the network interface and forwards packets to/from the
>> peer device in QEMU.
>>
>> Usage example:
>>
>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>
>> XDP program bridges the socket with a network interface.  It can be
>> attached to the interface in 2 different modes:
>>
>> 1. skb - this mode should work for any interface and doesn't require
>>          driver support.  With a caveat of lower performance.
>>
>> 2. native - this does require support from the driver and allows to
>>             bypass skb allocation in the kernel and potentially use
>>             zero-copy while getting packets in/out userspace.
>>
>> By default, QEMU will try to use native mode and fall back to skb.
>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>> mode, use 'force-copy=on' option.  This might be useful if there is
>> some issue with the driver.
>>
>> Option 'queues=N' allows to specify how many device queues should
>> be open.  Note that all the queues that are not open are still
>> functional and can receive traffic, but it will not be delivered to
>> QEMU.  So, the number of device queues should generally match the
>> QEMU configuration, unless the device is shared with something
>> else and the traffic re-direction to appropriate queues is correctly
>> configured on a device level (e.g. with ethtool -N).
>> 'start-queue=M' option can be used to specify from which queue id
>> QEMU should start configuring 'N' queues.  It might also be necessary
>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>> for examples.
>>
>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>> capabilities in order to load default XSK/XDP programs to the
>> network interface and configure BPF maps.  It is possible, however,
>> to run with no capabilities.  For that to work, an external process
>> with admin capabilities will need to pre-load default XSK program,
>> create AF_XDP sockets and pass their file descriptors to QEMU process
>> on startup via 'sock-fds' option.  Network backend will need to be
>> configured with 'inhibit=on' to avoid loading of the program.
>> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
>> or CAP_IPC_LOCK.
>>
>> Alternatively, the file descriptor for 'xsks_map' can be passed via
>> 'xsks-map-fd=N' option instead of passing socket file descriptors.
>> That will additionally require CAP_NET_RAW on QEMU side.  This is
>> useful, because 'sock-fds' may not be available with older libxdp.
>> 'sock-fds' requires libxdp >= 1.4.0.
>>
>> There are few performance challenges with the current network backends.
>>
>> First is that they do not support IO threads.  This means that data
>> path is handled by the main thread in QEMU and may slow down other
>> work or may be slowed down by some other work.  This also means that
>> taking advantage of multi-queue is generally not possible today.
>>
>> Another thing is that data path is going through the device emulation
>> code, which is not really optimized for performance.  The fastest
>> "frontend" device is virtio-net.  But it's not optimized for heavy
>> traffic either, because it expects such use-cases to be handled via
>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>> have virtio notifications and rcu lock/unlock on a per-packet basis
>> and not very efficient accesses to the guest memory.  Communication
>> channels between backend and frontend devices do not allow passing
>> more than one packet at a time as well.
>>
>> Some of these challenges can be avoided in the future by adding better
>> batching into device emulation or by implementing vhost-af-xdp variant.
>>
>> There are also a few kernel limitations.  AF_XDP sockets do not
>> support any kinds of checksum or segmentation offloading.  Buffers
>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>> support implementation for AF_XDP is in progress, but not ready yet.
>> Also, transmission in all non-zero-copy modes is synchronous, i.e.
>> done in a syscall.  That doesn't allow high packet rates on virtual
>> interfaces.
>>
>> However, keeping in mind all of these challenges, current implementation
>> of the AF_XDP backend shows a decent performance while running on top
>> of a physical NIC with zero-copy support.
>>
>> Test setup:
>>
>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>> Network backend is configured to open the NIC directly in native mode.
>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>
>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>> for PPS testing.
>>
>> iperf3 result:
>>  TCP stream      : 19.1 Gbps
>>
>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>  Tx only         : 3.4 Mpps
>>  Rx only         : 2.0 Mpps
>>  L2 FWD Loopback : 1.5 Mpps
>>
>> In skb mode the same setup shows much lower performance, similar to
>> the setup where pair of physical NICs is replaced with veth pair:
>>
>> iperf3 result:
>>   TCP stream      : 9 Gbps
>>
>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>   Tx only         : 1.2 Mpps
>>   Rx only         : 1.0 Mpps
>>   L2 FWD Loopback : 0.7 Mpps
>>
>> Results in skb mode or over the veth are close to results of a tap
>> backend with vhost=on and disabled segmentation offloading bridged
>> with a NIC.
>>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> 
> Looks good overall, see few comments inline.

Thanks for review!

> 
>> ---
>>
>> Version 2:
>>
>>   - Added support for running with no capabilities by passing
>>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
>>     Conditionally complied because requires unreleased libxdp 1.4.0.
>>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
>>
>>   - Refined and extended documentation.
>>
>>
>>  MAINTAINERS                                   |   4 +
>>  hmp-commands.hx                               |   2 +-
>>  meson.build                                   |  19 +
>>  meson_options.txt                             |   2 +
>>  net/af-xdp.c                                  | 570 ++++++++++++++++++
>>  net/clients.h                                 |   5 +
>>  net/meson.build                               |   3 +
>>  net/net.c                                     |   6 +
>>  qapi/net.json                                 |  60 +-
>>  qemu-options.hx                               |  83 ++-
>>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
>>  scripts/meson-buildoptions.sh                 |   3 +
>>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
>>  13 files changed, 756 insertions(+), 3 deletions(-)
>>  create mode 100644 net/af-xdp.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7164cf55a1..80d4ba4004 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
>>  S: Maintained
>>  F: net/netmap.c
>>
>> +AF_XDP network backend
>> +R: Ilya Maximets <i.maximets@ovn.org>
>> +F: net/af-xdp.c
>> +
>>  Host Memory Backends
>>  M: David Hildenbrand <david@redhat.com>
>>  M: Igor Mammedov <imammedo@redhat.com>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index 2cbd0f77a0..af9ffe4681 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -1295,7 +1295,7 @@ ERST
>>      {
>>          .name       = "netdev_add",
>>          .args_type  = "netdev:O",
>> -        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
>> +        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
>>  #ifdef CONFIG_VMNET
>>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
>>  #endif
>> diff --git a/meson.build b/meson.build
>> index a9ba0bfab3..1f8772ea5d 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
>>    endif
>>  endif
>>
>> +# libxdp
>> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
>> +if libxdp.found() and \
>> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
>> +  libxdp = not_found
>> +  if get_option('af_xdp').enabled()
>> +    error('af-xdp support requires libbpf version >= 0.7')
> 
> Can we simply limit this to 1.4?

This is a check for libbpf, not libxdp.  Or do you think there is no need
to check libbpf version if we request libxdp version high enough?
Users may still break the build by installing old libbpf manually even if
distributions ship more modern versions.

Or do you mean limit the libxdp version to 1.4 in order to avoid conditional
on HAVE_XSK_UMEM__CREATE_WITH_FD ?
The problem with that is that libxdp 1.4 is a week+ old, so not available in
any distribution, AFAIK.  Not sure how big of a problem that is though.

> 
> 
>> +  else
>> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
>> +  endif
>> +endif
>> +
>>  # libdw
>>  libdw = not_found
>>  if not get_option('libdw').auto() or \
>> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
>>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
>>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
>>  config_host_data.set('CONFIG_EBPF', libbpf.found())
>> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
>> +if libxdp.found()
>> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
>> +                       cc.has_function('xsk_umem__create_with_fd',
>> +                                       dependencies: libxdp))
>> +endif
>>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
>>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
>>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
>> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
>>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
>>  summary_info += {'libcap-ng support': libcap_ng}
>>  summary_info += {'bpf support':       libbpf}
>> +summary_info += {'AF_XDP support':    libxdp}
>>  summary_info += {'rbd support':       rbd}
>>  summary_info += {'smartcard support': cacard}
>>  summary_info += {'U2F support':       u2f}
>> diff --git a/meson_options.txt b/meson_options.txt
>> index bbb5c7e886..f4e950ce6a 100644
>> --- a/meson_options.txt
>> +++ b/meson_options.txt
>> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
>>  option('keyring', type: 'feature', value: 'auto',
>>         description: 'Linux keyring support')
>>
>> +option('af_xdp', type : 'feature', value : 'auto',
>> +       description: 'AF_XDP network backend support')
>>  option('attr', type : 'feature', value : 'auto',
>>         description: 'attr/xattr support')
>>  option('auth_pam', type : 'feature', value : 'auto',
>> diff --git a/net/af-xdp.c b/net/af-xdp.c
>> new file mode 100644
>> index 0000000000..265ba6b12e
>> --- /dev/null
>> +++ b/net/af-xdp.c
>> @@ -0,0 +1,570 @@
>> +/*
>> + * AF_XDP network backend.
>> + *
>> + * Copyright (c) 2023 Red Hat, Inc.
>> + *
>> + * Authors:
>> + *  Ilya Maximets <i.maximets@ovn.org>
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +
>> +#include "qemu/osdep.h"
>> +#include <bpf/bpf.h>
>> +#include <inttypes.h>
>> +#include <linux/if_link.h>
>> +#include <linux/if_xdp.h>
>> +#include <net/if.h>
>> +#include <xdp/xsk.h>
>> +
>> +#include "clients.h"
>> +#include "monitor/monitor.h"
>> +#include "net/net.h"
>> +#include "qapi/error.h"
>> +#include "qemu/cutils.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/iov.h"
>> +#include "qemu/main-loop.h"
>> +#include "qemu/memalign.h"
>> +
>> +
>> +typedef struct AFXDPState {
>> +    NetClientState       nc;
>> +
>> +    struct xsk_socket    *xsk;
>> +    struct xsk_ring_cons rx;
>> +    struct xsk_ring_prod tx;
>> +    struct xsk_ring_cons cq;
>> +    struct xsk_ring_prod fq;
>> +
>> +    char                 ifname[IFNAMSIZ];
>> +    int                  ifindex;
>> +    bool                 read_poll;
>> +    bool                 write_poll;
>> +    uint32_t             outstanding_tx;
>> +
>> +    uint64_t             *pool;
>> +    uint32_t             n_pool;
>> +    char                 *buffer;
>> +    struct xsk_umem      *umem;
>> +
>> +    uint32_t             n_queues;
>> +    uint32_t             xdp_flags;
>> +    bool                 inhibit;
>> +} AFXDPState;
>> +
>> +#define AF_XDP_BATCH_SIZE 64
>> +
>> +static void af_xdp_send(void *opaque);
>> +static void af_xdp_writable(void *opaque);
>> +
>> +/* Set the event-loop handlers for the af-xdp backend. */
>> +static void af_xdp_update_fd_handler(AFXDPState *s)
>> +{
>> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
>> +                        s->read_poll ? af_xdp_send : NULL,
>> +                        s->write_poll ? af_xdp_writable : NULL,
>> +                        s);
>> +}
>> +
>> +/* Update the read handler. */
>> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
>> +{
>> +    if (s->read_poll != enable) {
>> +        s->read_poll = enable;
>> +        af_xdp_update_fd_handler(s);
>> +    }
>> +}
>> +
>> +/* Update the write handler. */
>> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
>> +{
>> +    if (s->write_poll != enable) {
>> +        s->write_poll = enable;
>> +        af_xdp_update_fd_handler(s);
>> +    }
>> +}
>> +
>> +static void af_xdp_poll(NetClientState *nc, bool enable)
>> +{
>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>> +
>> +    if (s->read_poll != enable || s->write_poll != enable) {
>> +        s->write_poll = enable;
>> +        s->read_poll  = enable;
>> +        af_xdp_update_fd_handler(s);
>> +    }
>> +}
>> +
>> +static void af_xdp_complete_tx(AFXDPState *s)
>> +{
>> +    uint32_t idx = 0;
>> +    uint32_t done, i;
>> +    uint64_t *addr;
>> +
>> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
>> +
>> +    for (i = 0; i < done; i++) {
>> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
>> +        s->pool[s->n_pool++] = *addr;
>> +        s->outstanding_tx--;
>> +    }
>> +
>> +    if (done) {
>> +        xsk_ring_cons__release(&s->cq, done);
>> +    }
>> +}
>> +
>> +/*
>> + * The fd_write() callback, invoked if the fd is marked as writable
>> + * after a poll.
>> + */
>> +static void af_xdp_writable(void *opaque)
>> +{
>> +    AFXDPState *s = opaque;
>> +
>> +    /* Try to recover buffers that are already sent. */
>> +    af_xdp_complete_tx(s);
>> +
>> +    /*
>> +     * Unregister the handler, unless we still have packets to transmit
>> +     * and kernel needs a wake up.
>> +     */
>> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
>> +        af_xdp_write_poll(s, false);
>> +    }
>> +
>> +    /* Flush any buffered packets. */
>> +    qemu_flush_queued_packets(&s->nc);
>> +}
>> +
>> +static ssize_t af_xdp_receive(NetClientState *nc,
>> +                              const uint8_t *buf, size_t size)
>> +{
>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>> +    struct xdp_desc *desc;
>> +    uint32_t idx;
>> +    void *data;
>> +
>> +    /* Try to recover buffers that are already sent. */
>> +    af_xdp_complete_tx(s);
>> +
>> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
>> +        /* We can't transmit packet this size... */
>> +        return size;
>> +    }
>> +
>> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
>> +        /*
>> +         * Out of buffers or space in tx ring.  Poll until we can write.
>> +         * This will also kick the Tx, if it was waiting on CQ.
>> +         */
>> +        af_xdp_write_poll(s, true);
>> +        return 0;
>> +    }
>> +
>> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
>> +    desc->addr = s->pool[--s->n_pool];
>> +    desc->len = size;
>> +
>> +    data = xsk_umem__get_data(s->buffer, desc->addr);
>> +    memcpy(data, buf, size);
>> +
>> +    xsk_ring_prod__submit(&s->tx, 1);
>> +    s->outstanding_tx++;
>> +
>> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
>> +        af_xdp_write_poll(s, true);
>> +    }
>> +
>> +    return size;
>> +}
>> +
>> +/*
>> + * Complete a previous send (backend --> guest) and enable the
>> + * fd_read callback.
>> + */
>> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
>> +{
>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>> +
>> +    af_xdp_read_poll(s, true);
>> +}
>> +
>> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
>> +{
>> +    uint32_t i, idx = 0;
>> +
>> +    /* Leave one packet for Tx, just in case. */
>> +    if (s->n_pool < n + 1) {
>> +        n = s->n_pool;
>> +    }
>> +
>> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < n; i++) {
>> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
>> +    }
>> +    xsk_ring_prod__submit(&s->fq, n);
>> +
>> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
>> +        /* Receive was blocked by not having enough buffers.  Wake it up. */
>> +        af_xdp_read_poll(s, true);
>> +    }
>> +}
>> +
>> +static void af_xdp_send(void *opaque)
>> +{
>> +    uint32_t i, n_rx, idx = 0;
>> +    AFXDPState *s = opaque;
>> +
>> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
>> +    if (!n_rx) {
>> +        return;
>> +    }
>> +
>> +    for (i = 0; i < n_rx; i++) {
>> +        const struct xdp_desc *desc;
>> +        struct iovec iov;
>> +
>> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
>> +
>> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
>> +        iov.iov_len = desc->len;
>> +
>> +        s->pool[s->n_pool++] = desc->addr;
>> +
>> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
>> +                                     af_xdp_send_completed)) {
>> +            /*
>> +             * The peer does not receive anymore.  Packet is queued, stop
>> +             * reading from the backend until af_xdp_send_completed().
>> +             */
>> +            af_xdp_read_poll(s, false);
>> +
>> +            /* Re-peek the descriptors to not break the ring cache. */
>> +            xsk_ring_cons__cancel(&s->rx, n_rx);
>> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
> 
> The code turns out to be hard to read here.
> 
> 1) This seems to undo the peek (usually peek doesn't touch the
> prod/consumer but it seems not what xsk_ring_cons__peek()) did:

Yeah, it's unfortunate that the peek() function changes the internal
state, but that is what we have...

> 
> static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> __u32 nb, __u32 *idx)
> {
>   __u32 entries = xsk_cons_nb_avail(cons, nb);
> 
> if (entries > 0) {
>                 *idx = cons->cached_cons;
>                 cons->cached_cons += entries;
>         }
> 
>         return entries;
> }
> 
> 2) It looks to me a partial rollback is sufficient?
> 
> xsk_ring_cons__cancel(n_rx - i + 1)?

Good point.  Should work.  It should be n_rx - i - 1 though, if I'm not
mistaken.  So:

    xsk_ring_cons__cancel(n_rx - i - 1);
    n_rx = i + 1;

I'm not sure if that is much easier to read, but that's OK.  Should be
a touch faster as well.  What do you think?

> 
>> +            g_assert(n_rx == i + 1);
>> +            break;
>> +        }
>> +    }
>> +
>> +    /* Release actually sent descriptors and try to re-fill. */
>> +    xsk_ring_cons__release(&s->rx, n_rx);
>> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
>> +}
>> +
>> +/* Flush and close. */
>> +static void af_xdp_cleanup(NetClientState *nc)
>> +{
>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>> +
>> +    qemu_purge_queued_packets(nc);
>> +
>> +    af_xdp_poll(nc, false);
>> +
>> +    xsk_socket__delete(s->xsk);
>> +    s->xsk = NULL;
>> +    g_free(s->pool);
>> +    s->pool = NULL;
>> +    xsk_umem__delete(s->umem);
>> +    s->umem = NULL;
>> +    qemu_vfree(s->buffer);
>> +    s->buffer = NULL;
>> +
>> +    /* Remove the program if it's the last open queue. */
>> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
>> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
>> +        fprintf(stderr,
>> +                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
>> +                s->ifname, s->ifindex);
>> +    }
>> +}
>> +
>> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
>> +{
>> +    struct xsk_umem_config config = {
>> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
>> +        .frame_headroom = 0,
>> +    };
>> +    uint64_t n_descs;
>> +    uint64_t size;
>> +    int64_t i;
>> +    int ret;
>> +
>> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
>> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
>> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
>> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
>> +
>> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
>> +    memset(s->buffer, 0, size);
>> +
>> +    if (sock_fd < 0) {
>> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
>> +                               &s->fq, &s->cq, &config);
>> +    } else {
>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
>> +                                       &s->fq, &s->cq, &config);
>> +#else
> 
> So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better
> 
> 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD

The qapi property is conditionally defined, so users will not be able to
set sock-fds if not supported.  And qemu will complain about unknown
property.  That should be enough?

> 
> or
> 
> 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD

If we require libxdp 1.4 that will be the case.

> 
>> +        ret = -1;
>> +        errno = EINVAL;
>> +#endif
>> +    }
>> +
>> +    if (ret) {
>> +        qemu_vfree(s->buffer);
>> +        error_setg_errno(errp, errno,
>> +                         "failed to create umem for %s queue_index: %d",
>> +                         s->ifname, s->nc.queue_index);
>> +        return -1;
>> +    }
>> +
>> +    s->pool = g_new(uint64_t, n_descs);
>> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
>> +    for (i = n_descs; i >= 0; i--) {
>> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
>> +    }
>> +    s->n_pool = n_descs;
>> +
>> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
>> +
>> +    return 0;
>> +}
>> +
>> +static int af_xdp_socket_create(AFXDPState *s,
>> +                                const NetdevAFXDPOptions *opts,
>> +                                int xsks_map_fd, Error **errp)
>> +{
>> +    struct xsk_socket_config cfg = {
>> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>> +        .libxdp_flags = 0,
>> +        .bind_flags = XDP_USE_NEED_WAKEUP,
>> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
>> +    };
>> +    int queue_id, error = 0;
>> +
>> +    s->inhibit = opts->has_inhibit && opts->inhibit;
>> +    if (s->inhibit) {
>> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
>> +    }
>> +
>> +    if (opts->has_force_copy && opts->force_copy) {
>> +        cfg.bind_flags |= XDP_COPY;
>> +    }
>> +
>> +    queue_id = s->nc.queue_index;
>> +    if (opts->has_start_queue && opts->start_queue > 0) {
>> +        queue_id += opts->start_queue;
>> +    }
>> +
>> +    if (opts->has_mode) {
>> +        /* Specific mode requested. */
>> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
>> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>> +            error = errno;
>> +        }
>> +    } else {
>> +        /* No mode requested, try native first. */
>> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
>> +
>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>> +            /* Can't use native mode, try skb. */
>> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
>> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
>> +
>> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
>> +                error = errno;
>> +            }
>> +        }
>> +    }
>> +
>> +    if (error) {
>> +        error_setg_errno(errp, error,
>> +                         "failed to create AF_XDP socket for %s queue_id: %d",
>> +                         s->ifname, queue_id);
>> +        return -1;
>> +    }
>> +
>> +    if (s->inhibit && xsks_map_fd >= 0) {
>> +        int xsk_fd = xsk_socket__fd(s->xsk);
>> +
>> +        /* Need to update the map manually, libxdp skipped that step. */
>> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
>> +        if (error) {
>> +            error_setg_errno(errp, error,
>> +                             "failed to update xsks map for %s queue_id: %d",
>> +                             s->ifname, queue_id);
>> +            return -1;
>> +        }
>> +    }
>> +
>> +    s->xdp_flags = cfg.xdp_flags;
>> +
>> +    return 0;
>> +}
>> +
>> +/* NetClientInfo methods. */
>> +static NetClientInfo net_af_xdp_info = {
>> +    .type = NET_CLIENT_DRIVER_AF_XDP,
>> +    .size = sizeof(AFXDPState),
>> +    .receive = af_xdp_receive,
>> +    .poll = af_xdp_poll,
>> +    .cleanup = af_xdp_cleanup,
>> +};
>> +
>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>> +static int *parse_socket_fds(const char *sock_fds_str,
>> +                             int64_t n_expected, Error **errp)
>> +{
>> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
>> +    int64_t i, n_sock_fds = g_strv_length(substrings);
>> +    int *sock_fds = NULL;
>> +
>> +    if (n_sock_fds != n_expected) {
>> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
>> +                   n_expected, n_sock_fds);
>> +        goto exit;
>> +    }
>> +
>> +    sock_fds = g_new(int, n_sock_fds);
>> +
>> +    for (i = 0; i < n_sock_fds; i++) {
>> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], errp);
>> +        if (sock_fds[i] < 0) {
>> +            g_free(sock_fds);
>> +            sock_fds = NULL;
>> +            goto exit;
>> +        }
>> +    }
>> +
>> +exit:
>> +    g_strfreev(substrings);
>> +    return sock_fds;
>> +}
>> +#endif
>> +
>> +/*
>> + * The exported init function.
>> + *
>> + * ... -net af-xdp,ifname="..."
> 
> This is the legacy command line, let's say -netdev af-xdp,...

Sure.

> 
>> + */
>> +int net_init_af_xdp(const Netdev *netdev,
>> +                    const char *name, NetClientState *peer, Error **errp)
>> +{
>> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
>> +    NetClientState *nc, *nc0 = NULL;
>> +    unsigned int ifindex;
>> +    uint32_t prog_id = 0;
>> +    int *sock_fds = NULL;
>> +    int xsks_map_fd = -1;
>> +    int64_t i, queues;
>> +    Error *err = NULL;
>> +    AFXDPState *s;
>> +
>> +    ifindex = if_nametoindex(opts->ifname);
>> +    if (!ifindex) {
>> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
>> +                         opts->ifname);
>> +        return -1;
>> +    }
>> +
>> +    queues = opts->has_queues ? opts->queues : 1;
>> +    if (queues < 1) {
>> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
>> +                   queues, opts->ifname);
>> +        return -1;
>> +    }
>> +
>> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
>> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
>> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
>> +        return -1;
>> +    }
>> +#else
>> +    if ((opts->has_inhibit && opts->inhibit)
>> +        != (opts->xsks_map_fd || opts->sock_fds)) {
>> +        error_setg(errp, "'inhibit=on' should be used together with "
>> +                         "'sock-fds' or 'xsks-map-fd'");
>> +        return -1;
>> +    }
>> +
>> +    if (opts->xsks_map_fd && opts->sock_fds) {
>> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually exclusive");
>> +        return -1;
>> +    }
>> +
>> +    if (opts->sock_fds) {
>> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
>> +        if (!sock_fds) {
>> +            return -1;
>> +        }
>> +    }
>> +#endif
>> +
>> +    if (opts->xsks_map_fd) {
>> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
>> +        if (xsks_map_fd < 0) {
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    for (i = 0; i < queues; i++) {
>> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
>> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
>> +        nc->queue_index = i;
>> +
>> +        if (!nc0) {
>> +            nc0 = nc;
>> +        }
>> +
>> +        s = DO_UPCAST(AFXDPState, nc, nc);
>> +
>> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
>> +        s->ifindex = ifindex;
>> +        s->n_queues = queues;
>> +
>> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
>> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
>> +            /* Make sure the XDP program will be removed. */
>> +            s->n_queues = i;
>> +            error_propagate(errp, err);
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    if (nc0) {
>> +        s = DO_UPCAST(AFXDPState, nc, nc0);
>> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
>> +            error_setg_errno(errp, errno,
>> +                             "no XDP program loaded on '%s', ifindex: %d",
>> +                             s->ifname, s->ifindex);
>> +            goto err;
>> +        }
>> +    }
>> +
>> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
>> +
>> +    return 0;
>> +
>> +err:
>> +    g_free(sock_fds);
>> +    if (nc0) {
>> +        qemu_del_net_client(nc0);
>> +    }
>> +
>> +    return -1;
>> +}
>> diff --git a/net/clients.h b/net/clients.h
>> index ed8bdfff1e..be53794582 100644
>> --- a/net/clients.h
>> +++ b/net/clients.h
>> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name,
>>                      NetClientState *peer, Error **errp);
>>  #endif
>>
>> +#ifdef CONFIG_AF_XDP
>> +int net_init_af_xdp(const Netdev *netdev, const char *name,
>> +                    NetClientState *peer, Error **errp);
>> +#endif
>> +
>>  int net_init_vhost_user(const Netdev *netdev, const char *name,
>>                          NetClientState *peer, Error **errp);
>>
>> diff --git a/net/meson.build b/net/meson.build
>> index bdf564a57b..61628d4684 100644
>> --- a/net/meson.build
>> +++ b/net/meson.build
>> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
>>  if have_netmap
>>    system_ss.add(files('netmap.c'))
>>  endif
>> +
>> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
>> +
>>  if have_vhost_net_user
>>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
>>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
>> diff --git a/net/net.c b/net/net.c
>> index 6492ad530e..127f70932b 100644
>> --- a/net/net.c
>> +++ b/net/net.c
>> @@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
>>  #ifdef CONFIG_NETMAP
>>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
>>  #endif
>> +#ifdef CONFIG_AF_XDP
>> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
>> +#endif
>>  #ifdef CONFIG_NET_BRIDGE
>>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
>>  #endif
>> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
>>  #ifdef CONFIG_NETMAP
>>          "netmap",
>>  #endif
>> +#ifdef CONFIG_AF_XDP
>> +        "af-xdp",
>> +#endif
>>  #ifdef CONFIG_POSIX
>>          "vhost-user",
>>  #endif
>> diff --git a/qapi/net.json b/qapi/net.json
>> index db67501308..88f2c982c2 100644
>> --- a/qapi/net.json
>> +++ b/qapi/net.json
>> @@ -408,6 +408,62 @@
>>      'ifname':     'str',
>>      '*devname':    'str' } }
>>
>> +##
>> +# @AFXDPMode:
>> +#
>> +# Attach mode for a default XDP program
>> +#
>> +# @skb: generic mode, no driver support necessary
>> +#
>> +# @native: DRV mode, program is attached to a driver, packets are passed to
>> +#     the socket without allocation of skb.
>> +#
>> +# Since: 8.1
> 
> I'd make it for 8.2.

OK.

> 
>> +##
>> +{ 'enum': 'AFXDPMode',
>> +  'data': [ 'native', 'skb' ] }
>> +
>> +##
>> +# @NetdevAFXDPOptions:
>> +#
>> +# AF_XDP network backend
>> +#
>> +# @ifname: The name of an existing network interface.
>> +#
>> +# @mode: Attach mode for a default XDP program.  If not specified, then
>> +#     'native' will be tried first, then 'skb'.
>> +#
>> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
>> +#     (default: false)
>> +#
>> +# @queues: number of queues to be used for multiqueue interfaces (default: 1).
>> +#
>> +# @start-queue: Use @queues starting from this queue number (default: 0).
>> +#
>> +# @inhibit: Don't load a default XDP program, use one already loaded to
>> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
>> +#
>> +# @sock-fds: A colon (:) separated list of file descriptors for already open
>> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
>> +#     These descriptors should already be added into XDP socket map for
>> +#     corresponding queues.  Requires @inhibit.
>> +#
>> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
>> +#     the already loaded XDP program.  Requires @inhibit.
>> +#
>> +# Since: 8.1
>> +##
>> +{ 'struct': 'NetdevAFXDPOptions',
>> +  'data': {
>> +    'ifname':       'str',
>> +    '*mode':        'AFXDPMode',
>> +    '*force-copy':  'bool',
>> +    '*queues':      'int',
>> +    '*start-queue': 'int',
>> +    '*inhibit':     'bool',
>> +    '*sock-fds':    { 'type': 'str', 'if': 'HAVE_XSK_UMEM__CREATE_WITH_FD' },

The paramater is defined conditionally hare and it will not be
compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined.

>> +    '*xsks-map-fd': 'str' } }
>> +
>>  ##
>>  # @NetdevVhostUserOptions:
>>  #
>> @@ -642,13 +698,14 @@
>>  # @vmnet-bridged: since 7.1
>>  # @stream: since 7.2
>>  # @dgram: since 7.2
>> +# @af-xdp: since 8.1
>>  #
>>  # Since: 2.7
>>  ##
>>  { 'enum': 'NetClientDriver',
>>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
>>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
>> -            'vhost-vdpa',
>> +            'vhost-vdpa', 'af-xdp',
>>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
>>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
>>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
>> @@ -680,6 +737,7 @@
>>      'bridge':   'NetdevBridgeOptions',
>>      'hubport':  'NetdevHubPortOptions',
>>      'netmap':   'NetdevNetmapOptions',
>> +    'af-xdp':   'NetdevAFXDPOptions',
>>      'vhost-user': 'NetdevVhostUserOptions',
>>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
>>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index b57489d7ca..d91610701c 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
>>      "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
>>      "                netmap device, defaults to '/dev/netmap')\n"
>>  #endif
>> +#ifdef CONFIG_AF_XDP
>> +    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
>> +    "         [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>> +    "         [,sock-fds=x:y:...:z]\n"
>> +#endif
>> +    "                attach to the existing network interface 'name' with AF_XDP socket\n"
>> +    "                use 'mode=MODE' to specify an XDP program attach mode\n"
>> +    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
>> +    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
>> +    "                with inhibit=on,\n"
>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>> +    "                  use 'sock-fds' to provide file descriptors for already open XDP sockets\n"
>> +    "                  added to a socket map in XDP program.  One socket per queue.  Or\n"
>> +#endif
>> +    "                  use 'xsks-map-fd=k' to provide a file descriptor for xsks map\n"
>> +    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
>> +    "                use 'start-queue=m' to specify the first queue that should be used\n"
>> +#endif
>>  #ifdef CONFIG_POSIX
>>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
>>      "                configure a vhost-user network, backed by a chardev 'dev'\n"
>> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
>>  #ifdef CONFIG_NETMAP
>>      "netmap|"
>>  #endif
>> +#ifdef CONFIG_AF_XDP
>> +    "af-xdp|"
>> +#endif
>>  #ifdef CONFIG_POSIX
>>      "vhost-user|"
>>  #endif
>> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>  #ifdef CONFIG_NETMAP
>>      "netmap|"
>>  #endif
>> +#ifdef CONFIG_AF_XDP
>> +    "af-xdp|"
>> +#endif
>>  #ifdef CONFIG_VMNET
>>      "vmnet-host|vmnet-shared|vmnet-bridged|"
>>  #endif
>> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>      "                old way to initialize a host network interface\n"
>>      "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
>>  SRST
>> -``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>> +``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>>      This option is a shortcut for configuring both the on-board
>>      (default) guest NIC hardware and the host network backend in one go.
>>      The host backend options are the same as with the corresponding
>> @@ -3350,6 +3375,62 @@ SRST
>>          # launch QEMU instance
>>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
>>
>> +``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
>> +    Configure AF_XDP backend to connect to a network interface 'name'
>> +    using AF_XDP socket.  A specific program attach mode for a default
>> +    XDP program can be forced with 'mode', defaults to best-effort,
>> +    where the likely most performant mode will be in use.  Number of queues
>> +    'n' should generally match the number or queues in the interface,
>> +    defaults to 1.  Traffic arriving on non-configured device queues will
>> +    not be delivered to the network backend.
>> +
>> +    .. parsed-literal::
>> +
>> +        # set number of queues to 1
>> +        ethtool -L eth0 combined 4
>> +        # launch QEMU instance
>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
>> +
>> +    'start-queue' option can be specified if a particular range of queues
>> +    [m, m + n] should be in use.  For example, this is necessary in order
>> +    to use MLX NICs in native mode.  The driver will create a separate set
>> +    of queues on top of regular ones, and only these queues can be used
>> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
>> +    redirection with ethtool to these queues.
> 
> Let's avoid mentioning any vendor name unless the emulation is done
> for a specific one.

AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this
strange fashion, so these are the only NICs that will not work without
start-queue configuration and the extra traffic re-direction.  Unfortunately,
this is also not documented anywhere, so you just have to know that it works
this way...

So, one one hand, I understand that it's not the place for vendor-specific
docs.  On the other, it gives qemu users a chance to not waste a lot of time
trying to figure out why everything is configured correctly, but the traffic
doesn't flow.

Maybe something like this:

    'start-queue' option can be specified if a particular range of queues
    [m, m + n] should be in use.  For example, this is may be necessary in
    order to use certain NICs in native mode.  Kernel allows the driver to
    create a separate set of XDP queues on top of regular ones, and only
    these queues can be used for AF_XDP sockets.  NICs that work this way
    may also require an additional traffic redirection with ethtool to these
    special queues.

    .. parsed-literal::

        # set number of queues to 1
        ethtool -L eth0 combined 1
        # redirect all the traffic to the second queue (id: 1)
        # note: drivers may require non-empty key/mask pair.
        ethtool -N eth0 flow-type ether \\
            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
        ethtool -N eth0 flow-type ether \\
            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
        # launch QEMU instance
        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1

What do you think?

> 
>> + E.g.:
>> +
>> +    .. parsed-literal::
>> +
>> +        # set number of queues to 1
>> +        ethtool -L eth0 combined 1
>> +        # redirect all the traffic to the second queue (id: 1)
>> +        # note: mlx5 driver requires non-empty key/mask pair.
>> +        ethtool -N eth0 flow-type ether \\
>> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>> +        ethtool -N eth0 flow-type ether \\
>> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>> +        # launch QEMU instance
>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>> +
>> +    XDP program can also be loaded externally.  In this case 'inhibit' option
>> +    should be set to 'on' and 'xsks-map-fd' provided with a file descriptor
>> +    for an open XDP socket map of that program, or 'sock-fds' with file
>> +    descriptors for already open but not bound XDP sockets already added to a
>> +    socket map for corresponding queues.  One socket per queue.
>> +
>> +    .. parsed-literal::
>> +
>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
>> +
>> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
>> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK capability.
>> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
>> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should be
>> +    added as well.
> 
> This requires the synchronization with the code changes, I suggest not
> to describe:
> 
> 1) actual numbers of memory requirement
> 2) capabilites required for each mode, we don't do that for other
> netdev like tap.

Makes sense.  So, should we just strike the paragraph above entirely?

> 
> Thanks
> 
> 
> 
> 
>> +
>> +
>>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
>>      Establish a vhost-user netdev, backed by a chardev id. The chardev
>>      should be a unix domain socket backed one. The vhost-user uses a
>> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
>> index d02b09a4b9..7585c4c4ed 100755
>> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
>> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
>> @@ -35,6 +35,7 @@
>>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
>>  --with-coroutine=ucontext \
>>  --tls-priority=@QEMU,SYSTEM \
>> +--disable-af-xdp \
>>  --disable-attr \
>>  --disable-auth-pam \
>>  --disable-avx2 \
>> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
>> index 7dd5709ef4..162e455309 100644
>> --- a/scripts/meson-buildoptions.sh
>> +++ b/scripts/meson-buildoptions.sh
>> @@ -74,6 +74,7 @@ meson_options_help() {
>>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
>>    printf "%s\n" '(unless built with --without-default-features):'
>>    printf "%s\n" ''
>> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
>>    printf "%s\n" '  alsa            ALSA sound support'
>>    printf "%s\n" '  attr            attr/xattr support'
>>    printf "%s\n" '  auth-pam        PAM access control'
>> @@ -207,6 +208,8 @@ meson_options_help() {
>>  }
>>  _meson_option_parse() {
>>    case $1 in
>> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
>> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
>>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
>>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
>>      --enable-attr) printf "%s" -Dattr=enabled ;;
>> diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
>> index e39871c7bb..207f7adfb9 100644
>> --- a/tests/docker/dockerfiles/debian-amd64.docker
>> +++ b/tests/docker/dockerfiles/debian-amd64.docker
>> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
>>                        libvirglrenderer-dev \
>>                        libvte-2.91-dev \
>>                        libxen-dev \
>> +                      libxdp-dev \
>>                        libzstd-dev \
>>                        llvm \
>>                        locales \
>> --
>> 2.40.1
>>
> 


Re: [PATCH v2] net: add initial support for AF_XDP network backend
Posted by Jason Wang 9 months, 2 weeks ago
On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 7/20/23 09:37, Jason Wang wrote:
> > On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> AF_XDP is a network socket family that allows communication directly
> >> with the network device driver in the kernel, bypassing most or all
> >> of the kernel networking stack.  In the essence, the technology is
> >> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
> >> and works with any network interfaces without driver modifications.
> >> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >> require access to character devices or unix sockets.  Only access to
> >> the network interface itself is necessary.
> >>
> >> This patch implements a network backend that communicates with the
> >> kernel by creating an AF_XDP socket.  A chunk of userspace memory
> >> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
> >> Fill and Completion) are placed in that memory along with a pool of
> >> memory buffers for the packet data.  Data transmission is done by
> >> allocating one of the buffers, copying packet data into it and
> >> placing the pointer into Tx ring.  After transmission, device will
> >> return the buffer via Completion ring.  On Rx, device will take
> >> a buffer form a pre-populated Fill ring, write the packet data into
> >> it and place the buffer into Rx ring.
> >>
> >> AF_XDP network backend takes on the communication with the host
> >> kernel and the network interface and forwards packets to/from the
> >> peer device in QEMU.
> >>
> >> Usage example:
> >>
> >>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>
> >> XDP program bridges the socket with a network interface.  It can be
> >> attached to the interface in 2 different modes:
> >>
> >> 1. skb - this mode should work for any interface and doesn't require
> >>          driver support.  With a caveat of lower performance.
> >>
> >> 2. native - this does require support from the driver and allows to
> >>             bypass skb allocation in the kernel and potentially use
> >>             zero-copy while getting packets in/out userspace.
> >>
> >> By default, QEMU will try to use native mode and fall back to skb.
> >> Mode can be forced via 'mode' option.  To force 'copy' even in native
> >> mode, use 'force-copy=on' option.  This might be useful if there is
> >> some issue with the driver.
> >>
> >> Option 'queues=N' allows to specify how many device queues should
> >> be open.  Note that all the queues that are not open are still
> >> functional and can receive traffic, but it will not be delivered to
> >> QEMU.  So, the number of device queues should generally match the
> >> QEMU configuration, unless the device is shared with something
> >> else and the traffic re-direction to appropriate queues is correctly
> >> configured on a device level (e.g. with ethtool -N).
> >> 'start-queue=M' option can be used to specify from which queue id
> >> QEMU should start configuring 'N' queues.  It might also be necessary
> >> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
> >> for examples.
> >>
> >> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >> capabilities in order to load default XSK/XDP programs to the
> >> network interface and configure BPF maps.  It is possible, however,
> >> to run with no capabilities.  For that to work, an external process
> >> with admin capabilities will need to pre-load default XSK program,
> >> create AF_XDP sockets and pass their file descriptors to QEMU process
> >> on startup via 'sock-fds' option.  Network backend will need to be
> >> configured with 'inhibit=on' to avoid loading of the program.
> >> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
> >> or CAP_IPC_LOCK.
> >>
> >> Alternatively, the file descriptor for 'xsks_map' can be passed via
> >> 'xsks-map-fd=N' option instead of passing socket file descriptors.
> >> That will additionally require CAP_NET_RAW on QEMU side.  This is
> >> useful, because 'sock-fds' may not be available with older libxdp.
> >> 'sock-fds' requires libxdp >= 1.4.0.
> >>
> >> There are few performance challenges with the current network backends.
> >>
> >> First is that they do not support IO threads.  This means that data
> >> path is handled by the main thread in QEMU and may slow down other
> >> work or may be slowed down by some other work.  This also means that
> >> taking advantage of multi-queue is generally not possible today.
> >>
> >> Another thing is that data path is going through the device emulation
> >> code, which is not really optimized for performance.  The fastest
> >> "frontend" device is virtio-net.  But it's not optimized for heavy
> >> traffic either, because it expects such use-cases to be handled via
> >> some implementation of vhost (user, kernel, vdpa).  In practice, we
> >> have virtio notifications and rcu lock/unlock on a per-packet basis
> >> and not very efficient accesses to the guest memory.  Communication
> >> channels between backend and frontend devices do not allow passing
> >> more than one packet at a time as well.
> >>
> >> Some of these challenges can be avoided in the future by adding better
> >> batching into device emulation or by implementing vhost-af-xdp variant.
> >>
> >> There are also a few kernel limitations.  AF_XDP sockets do not
> >> support any kinds of checksum or segmentation offloading.  Buffers
> >> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
> >> support implementation for AF_XDP is in progress, but not ready yet.
> >> Also, transmission in all non-zero-copy modes is synchronous, i.e.
> >> done in a syscall.  That doesn't allow high packet rates on virtual
> >> interfaces.
> >>
> >> However, keeping in mind all of these challenges, current implementation
> >> of the AF_XDP backend shows a decent performance while running on top
> >> of a physical NIC with zero-copy support.
> >>
> >> Test setup:
> >>
> >> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
> >> Network backend is configured to open the NIC directly in native mode.
> >> The driver supports zero-copy.  NIC is configured to use 1 queue.
> >>
> >> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
> >> for PPS testing.
> >>
> >> iperf3 result:
> >>  TCP stream      : 19.1 Gbps
> >>
> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>  Tx only         : 3.4 Mpps
> >>  Rx only         : 2.0 Mpps
> >>  L2 FWD Loopback : 1.5 Mpps
> >>
> >> In skb mode the same setup shows much lower performance, similar to
> >> the setup where pair of physical NICs is replaced with veth pair:
> >>
> >> iperf3 result:
> >>   TCP stream      : 9 Gbps
> >>
> >> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
> >>   Tx only         : 1.2 Mpps
> >>   Rx only         : 1.0 Mpps
> >>   L2 FWD Loopback : 0.7 Mpps
> >>
> >> Results in skb mode or over the veth are close to results of a tap
> >> backend with vhost=on and disabled segmentation offloading bridged
> >> with a NIC.
> >>
> >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >
> > Looks good overall, see few comments inline.
>
> Thanks for review!
>
> >
> >> ---
> >>
> >> Version 2:
> >>
> >>   - Added support for running with no capabilities by passing
> >>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
> >>     Conditionally complied because requires unreleased libxdp 1.4.0.
> >>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
> >>
> >>   - Refined and extended documentation.
> >>
> >>
> >>  MAINTAINERS                                   |   4 +
> >>  hmp-commands.hx                               |   2 +-
> >>  meson.build                                   |  19 +
> >>  meson_options.txt                             |   2 +
> >>  net/af-xdp.c                                  | 570 ++++++++++++++++++
> >>  net/clients.h                                 |   5 +
> >>  net/meson.build                               |   3 +
> >>  net/net.c                                     |   6 +
> >>  qapi/net.json                                 |  60 +-
> >>  qemu-options.hx                               |  83 ++-
> >>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
> >>  scripts/meson-buildoptions.sh                 |   3 +
> >>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
> >>  13 files changed, 756 insertions(+), 3 deletions(-)
> >>  create mode 100644 net/af-xdp.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 7164cf55a1..80d4ba4004 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
> >>  S: Maintained
> >>  F: net/netmap.c
> >>
> >> +AF_XDP network backend
> >> +R: Ilya Maximets <i.maximets@ovn.org>
> >> +F: net/af-xdp.c
> >> +
> >>  Host Memory Backends
> >>  M: David Hildenbrand <david@redhat.com>
> >>  M: Igor Mammedov <imammedo@redhat.com>
> >> diff --git a/hmp-commands.hx b/hmp-commands.hx
> >> index 2cbd0f77a0..af9ffe4681 100644
> >> --- a/hmp-commands.hx
> >> +++ b/hmp-commands.hx
> >> @@ -1295,7 +1295,7 @@ ERST
> >>      {
> >>          .name       = "netdev_add",
> >>          .args_type  = "netdev:O",
> >> -        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
> >> +        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
> >>  #ifdef CONFIG_VMNET
> >>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
> >>  #endif
> >> diff --git a/meson.build b/meson.build
> >> index a9ba0bfab3..1f8772ea5d 100644
> >> --- a/meson.build
> >> +++ b/meson.build
> >> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
> >>    endif
> >>  endif
> >>
> >> +# libxdp
> >> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
> >> +if libxdp.found() and \
> >> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
> >> +  libxdp = not_found
> >> +  if get_option('af_xdp').enabled()
> >> +    error('af-xdp support requires libbpf version >= 0.7')
> >
> > Can we simply limit this to 1.4?
>
> This is a check for libbpf, not libxdp.  Or do you think there is no need
> to check libbpf version if we request libxdp version high enough?
> Users may still break the build by installing old libbpf manually even if
> distributions ship more modern versions.
>
> Or do you mean limit the libxdp version to 1.4 in order to avoid conditional
> on HAVE_XSK_UMEM__CREATE_WITH_FD ?

Yes.

> The problem with that is that libxdp 1.4 is a week+ old, so not available in
> any distribution, AFAIK.  Not sure how big of a problem that is though.

It doesn't matter as this is a brand new backend, it would simplify
future maintenance if we can get rid of any HAVE_XXX macros.

>
> >
> >
> >> +  else
> >> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
> >> +  endif
> >> +endif
> >> +
> >>  # libdw
> >>  libdw = not_found
> >>  if not get_option('libdw').auto() or \
> >> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
> >>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
> >>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
> >>  config_host_data.set('CONFIG_EBPF', libbpf.found())
> >> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
> >> +if libxdp.found()
> >> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
> >> +                       cc.has_function('xsk_umem__create_with_fd',
> >> +                                       dependencies: libxdp))
> >> +endif
> >>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
> >>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
> >>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
> >> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
> >>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
> >>  summary_info += {'libcap-ng support': libcap_ng}
> >>  summary_info += {'bpf support':       libbpf}
> >> +summary_info += {'AF_XDP support':    libxdp}
> >>  summary_info += {'rbd support':       rbd}
> >>  summary_info += {'smartcard support': cacard}
> >>  summary_info += {'U2F support':       u2f}
> >> diff --git a/meson_options.txt b/meson_options.txt
> >> index bbb5c7e886..f4e950ce6a 100644
> >> --- a/meson_options.txt
> >> +++ b/meson_options.txt
> >> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
> >>  option('keyring', type: 'feature', value: 'auto',
> >>         description: 'Linux keyring support')
> >>
> >> +option('af_xdp', type : 'feature', value : 'auto',
> >> +       description: 'AF_XDP network backend support')
> >>  option('attr', type : 'feature', value : 'auto',
> >>         description: 'attr/xattr support')
> >>  option('auth_pam', type : 'feature', value : 'auto',
> >> diff --git a/net/af-xdp.c b/net/af-xdp.c
> >> new file mode 100644
> >> index 0000000000..265ba6b12e
> >> --- /dev/null
> >> +++ b/net/af-xdp.c
> >> @@ -0,0 +1,570 @@
> >> +/*
> >> + * AF_XDP network backend.
> >> + *
> >> + * Copyright (c) 2023 Red Hat, Inc.
> >> + *
> >> + * Authors:
> >> + *  Ilya Maximets <i.maximets@ovn.org>
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> + * See the COPYING file in the top-level directory.
> >> + */
> >> +
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include <bpf/bpf.h>
> >> +#include <inttypes.h>
> >> +#include <linux/if_link.h>
> >> +#include <linux/if_xdp.h>
> >> +#include <net/if.h>
> >> +#include <xdp/xsk.h>
> >> +
> >> +#include "clients.h"
> >> +#include "monitor/monitor.h"
> >> +#include "net/net.h"
> >> +#include "qapi/error.h"
> >> +#include "qemu/cutils.h"
> >> +#include "qemu/error-report.h"
> >> +#include "qemu/iov.h"
> >> +#include "qemu/main-loop.h"
> >> +#include "qemu/memalign.h"
> >> +
> >> +
> >> +typedef struct AFXDPState {
> >> +    NetClientState       nc;
> >> +
> >> +    struct xsk_socket    *xsk;
> >> +    struct xsk_ring_cons rx;
> >> +    struct xsk_ring_prod tx;
> >> +    struct xsk_ring_cons cq;
> >> +    struct xsk_ring_prod fq;
> >> +
> >> +    char                 ifname[IFNAMSIZ];
> >> +    int                  ifindex;
> >> +    bool                 read_poll;
> >> +    bool                 write_poll;
> >> +    uint32_t             outstanding_tx;
> >> +
> >> +    uint64_t             *pool;
> >> +    uint32_t             n_pool;
> >> +    char                 *buffer;
> >> +    struct xsk_umem      *umem;
> >> +
> >> +    uint32_t             n_queues;
> >> +    uint32_t             xdp_flags;
> >> +    bool                 inhibit;
> >> +} AFXDPState;
> >> +
> >> +#define AF_XDP_BATCH_SIZE 64
> >> +
> >> +static void af_xdp_send(void *opaque);
> >> +static void af_xdp_writable(void *opaque);
> >> +
> >> +/* Set the event-loop handlers for the af-xdp backend. */
> >> +static void af_xdp_update_fd_handler(AFXDPState *s)
> >> +{
> >> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
> >> +                        s->read_poll ? af_xdp_send : NULL,
> >> +                        s->write_poll ? af_xdp_writable : NULL,
> >> +                        s);
> >> +}
> >> +
> >> +/* Update the read handler. */
> >> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
> >> +{
> >> +    if (s->read_poll != enable) {
> >> +        s->read_poll = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +/* Update the write handler. */
> >> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
> >> +{
> >> +    if (s->write_poll != enable) {
> >> +        s->write_poll = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_poll(NetClientState *nc, bool enable)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    if (s->read_poll != enable || s->write_poll != enable) {
> >> +        s->write_poll = enable;
> >> +        s->read_poll  = enable;
> >> +        af_xdp_update_fd_handler(s);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_complete_tx(AFXDPState *s)
> >> +{
> >> +    uint32_t idx = 0;
> >> +    uint32_t done, i;
> >> +    uint64_t *addr;
> >> +
> >> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
> >> +
> >> +    for (i = 0; i < done; i++) {
> >> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
> >> +        s->pool[s->n_pool++] = *addr;
> >> +        s->outstanding_tx--;
> >> +    }
> >> +
> >> +    if (done) {
> >> +        xsk_ring_cons__release(&s->cq, done);
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * The fd_write() callback, invoked if the fd is marked as writable
> >> + * after a poll.
> >> + */
> >> +static void af_xdp_writable(void *opaque)
> >> +{
> >> +    AFXDPState *s = opaque;
> >> +
> >> +    /* Try to recover buffers that are already sent. */
> >> +    af_xdp_complete_tx(s);
> >> +
> >> +    /*
> >> +     * Unregister the handler, unless we still have packets to transmit
> >> +     * and kernel needs a wake up.
> >> +     */
> >> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
> >> +        af_xdp_write_poll(s, false);
> >> +    }
> >> +
> >> +    /* Flush any buffered packets. */
> >> +    qemu_flush_queued_packets(&s->nc);
> >> +}
> >> +
> >> +static ssize_t af_xdp_receive(NetClientState *nc,
> >> +                              const uint8_t *buf, size_t size)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +    struct xdp_desc *desc;
> >> +    uint32_t idx;
> >> +    void *data;
> >> +
> >> +    /* Try to recover buffers that are already sent. */
> >> +    af_xdp_complete_tx(s);
> >> +
> >> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
> >> +        /* We can't transmit packet this size... */
> >> +        return size;
> >> +    }
> >> +
> >> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
> >> +        /*
> >> +         * Out of buffers or space in tx ring.  Poll until we can write.
> >> +         * This will also kick the Tx, if it was waiting on CQ.
> >> +         */
> >> +        af_xdp_write_poll(s, true);
> >> +        return 0;
> >> +    }
> >> +
> >> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
> >> +    desc->addr = s->pool[--s->n_pool];
> >> +    desc->len = size;
> >> +
> >> +    data = xsk_umem__get_data(s->buffer, desc->addr);
> >> +    memcpy(data, buf, size);
> >> +
> >> +    xsk_ring_prod__submit(&s->tx, 1);
> >> +    s->outstanding_tx++;
> >> +
> >> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
> >> +        af_xdp_write_poll(s, true);
> >> +    }
> >> +
> >> +    return size;
> >> +}
> >> +
> >> +/*
> >> + * Complete a previous send (backend --> guest) and enable the
> >> + * fd_read callback.
> >> + */
> >> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    af_xdp_read_poll(s, true);
> >> +}
> >> +
> >> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
> >> +{
> >> +    uint32_t i, idx = 0;
> >> +
> >> +    /* Leave one packet for Tx, just in case. */
> >> +    if (s->n_pool < n + 1) {
> >> +        n = s->n_pool;
> >> +    }
> >> +
> >> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < n; i++) {
> >> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
> >> +    }
> >> +    xsk_ring_prod__submit(&s->fq, n);
> >> +
> >> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
> >> +        /* Receive was blocked by not having enough buffers.  Wake it up. */
> >> +        af_xdp_read_poll(s, true);
> >> +    }
> >> +}
> >> +
> >> +static void af_xdp_send(void *opaque)
> >> +{
> >> +    uint32_t i, n_rx, idx = 0;
> >> +    AFXDPState *s = opaque;
> >> +
> >> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
> >> +    if (!n_rx) {
> >> +        return;
> >> +    }
> >> +
> >> +    for (i = 0; i < n_rx; i++) {
> >> +        const struct xdp_desc *desc;
> >> +        struct iovec iov;
> >> +
> >> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
> >> +
> >> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
> >> +        iov.iov_len = desc->len;
> >> +
> >> +        s->pool[s->n_pool++] = desc->addr;
> >> +
> >> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
> >> +                                     af_xdp_send_completed)) {
> >> +            /*
> >> +             * The peer does not receive anymore.  Packet is queued, stop
> >> +             * reading from the backend until af_xdp_send_completed().
> >> +             */
> >> +            af_xdp_read_poll(s, false);
> >> +
> >> +            /* Re-peek the descriptors to not break the ring cache. */
> >> +            xsk_ring_cons__cancel(&s->rx, n_rx);
> >> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
> >
> > The code turns out to be hard to read here.
> >
> > 1) This seems to undo the peek (usually peek doesn't touch the
> > prod/consumer but it seems not what xsk_ring_cons__peek()) did:
>
> Yeah, it's unfortunate that the peek() function changes the internal
> state, but that is what we have...
>
> >
> > static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
> > __u32 nb, __u32 *idx)
> > {
> >   __u32 entries = xsk_cons_nb_avail(cons, nb);
> >
> > if (entries > 0) {
> >                 *idx = cons->cached_cons;
> >                 cons->cached_cons += entries;
> >         }
> >
> >         return entries;
> > }
> >
> > 2) It looks to me a partial rollback is sufficient?
> >
> > xsk_ring_cons__cancel(n_rx - i + 1)?
>
> Good point.  Should work.  It should be n_rx - i - 1 though, if I'm not
> mistaken.  So:
>
>     xsk_ring_cons__cancel(n_rx - i - 1);
>     n_rx = i + 1;
>
> I'm not sure if that is much easier to read, but that's OK.  Should be
> a touch faster as well.  What do you think?

Let's do that please.

>
> >
> >> +            g_assert(n_rx == i + 1);
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    /* Release actually sent descriptors and try to re-fill. */
> >> +    xsk_ring_cons__release(&s->rx, n_rx);
> >> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
> >> +}
> >> +
> >> +/* Flush and close. */
> >> +static void af_xdp_cleanup(NetClientState *nc)
> >> +{
> >> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +    qemu_purge_queued_packets(nc);
> >> +
> >> +    af_xdp_poll(nc, false);
> >> +
> >> +    xsk_socket__delete(s->xsk);
> >> +    s->xsk = NULL;
> >> +    g_free(s->pool);
> >> +    s->pool = NULL;
> >> +    xsk_umem__delete(s->umem);
> >> +    s->umem = NULL;
> >> +    qemu_vfree(s->buffer);
> >> +    s->buffer = NULL;
> >> +
> >> +    /* Remove the program if it's the last open queue. */
> >> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
> >> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
> >> +        fprintf(stderr,
> >> +                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
> >> +                s->ifname, s->ifindex);
> >> +    }
> >> +}
> >> +
> >> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
> >> +{
> >> +    struct xsk_umem_config config = {
> >> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> >> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> >> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
> >> +        .frame_headroom = 0,
> >> +    };
> >> +    uint64_t n_descs;
> >> +    uint64_t size;
> >> +    int64_t i;
> >> +    int ret;
> >> +
> >> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
> >> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
> >> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
> >> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
> >> +
> >> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
> >> +    memset(s->buffer, 0, size);
> >> +
> >> +    if (sock_fd < 0) {
> >> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
> >> +                               &s->fq, &s->cq, &config);
> >> +    } else {
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
> >> +                                       &s->fq, &s->cq, &config);
> >> +#else
> >
> > So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better
> >
> > 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD
>
> The qapi property is conditionally defined, so users will not be able to
> set sock-fds if not supported.  And qemu will complain about unknown
> property.  That should be enough?

Yes.

>
> >
> > or
> >
> > 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD
>
> If we require libxdp 1.4 that will be the case.
>
> >
> >> +        ret = -1;
> >> +        errno = EINVAL;
> >> +#endif
> >> +    }
> >> +
> >> +    if (ret) {
> >> +        qemu_vfree(s->buffer);
> >> +        error_setg_errno(errp, errno,
> >> +                         "failed to create umem for %s queue_index: %d",
> >> +                         s->ifname, s->nc.queue_index);
> >> +        return -1;
> >> +    }
> >> +
> >> +    s->pool = g_new(uint64_t, n_descs);
> >> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
> >> +    for (i = n_descs; i >= 0; i--) {
> >> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
> >> +    }
> >> +    s->n_pool = n_descs;
> >> +
> >> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int af_xdp_socket_create(AFXDPState *s,
> >> +                                const NetdevAFXDPOptions *opts,
> >> +                                int xsks_map_fd, Error **errp)
> >> +{
> >> +    struct xsk_socket_config cfg = {
> >> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
> >> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
> >> +        .libxdp_flags = 0,
> >> +        .bind_flags = XDP_USE_NEED_WAKEUP,
> >> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
> >> +    };
> >> +    int queue_id, error = 0;
> >> +
> >> +    s->inhibit = opts->has_inhibit && opts->inhibit;
> >> +    if (s->inhibit) {
> >> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
> >> +    }
> >> +
> >> +    if (opts->has_force_copy && opts->force_copy) {
> >> +        cfg.bind_flags |= XDP_COPY;
> >> +    }
> >> +
> >> +    queue_id = s->nc.queue_index;
> >> +    if (opts->has_start_queue && opts->start_queue > 0) {
> >> +        queue_id += opts->start_queue;
> >> +    }
> >> +
> >> +    if (opts->has_mode) {
> >> +        /* Specific mode requested. */
> >> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
> >> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
> >> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> >> +            error = errno;
> >> +        }
> >> +    } else {
> >> +        /* No mode requested, try native first. */
> >> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
> >> +
> >> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                               s->umem, &s->rx, &s->tx, &cfg)) {
> >> +            /* Can't use native mode, try skb. */
> >> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
> >> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
> >> +
> >> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
> >> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
> >> +                error = errno;
> >> +            }
> >> +        }
> >> +    }
> >> +
> >> +    if (error) {
> >> +        error_setg_errno(errp, error,
> >> +                         "failed to create AF_XDP socket for %s queue_id: %d",
> >> +                         s->ifname, queue_id);
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (s->inhibit && xsks_map_fd >= 0) {
> >> +        int xsk_fd = xsk_socket__fd(s->xsk);
> >> +
> >> +        /* Need to update the map manually, libxdp skipped that step. */
> >> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
> >> +        if (error) {
> >> +            error_setg_errno(errp, error,
> >> +                             "failed to update xsks map for %s queue_id: %d",
> >> +                             s->ifname, queue_id);
> >> +            return -1;
> >> +        }
> >> +    }
> >> +
> >> +    s->xdp_flags = cfg.xdp_flags;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/* NetClientInfo methods. */
> >> +static NetClientInfo net_af_xdp_info = {
> >> +    .type = NET_CLIENT_DRIVER_AF_XDP,
> >> +    .size = sizeof(AFXDPState),
> >> +    .receive = af_xdp_receive,
> >> +    .poll = af_xdp_poll,
> >> +    .cleanup = af_xdp_cleanup,
> >> +};
> >> +
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +static int *parse_socket_fds(const char *sock_fds_str,
> >> +                             int64_t n_expected, Error **errp)
> >> +{
> >> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
> >> +    int64_t i, n_sock_fds = g_strv_length(substrings);
> >> +    int *sock_fds = NULL;
> >> +
> >> +    if (n_sock_fds != n_expected) {
> >> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
> >> +                   n_expected, n_sock_fds);
> >> +        goto exit;
> >> +    }
> >> +
> >> +    sock_fds = g_new(int, n_sock_fds);
> >> +
> >> +    for (i = 0; i < n_sock_fds; i++) {
> >> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], errp);
> >> +        if (sock_fds[i] < 0) {
> >> +            g_free(sock_fds);
> >> +            sock_fds = NULL;
> >> +            goto exit;
> >> +        }
> >> +    }
> >> +
> >> +exit:
> >> +    g_strfreev(substrings);
> >> +    return sock_fds;
> >> +}
> >> +#endif
> >> +
> >> +/*
> >> + * The exported init function.
> >> + *
> >> + * ... -net af-xdp,ifname="..."
> >
> > This is the legacy command line, let's say -netdev af-xdp,...
>
> Sure.
>
> >
> >> + */
> >> +int net_init_af_xdp(const Netdev *netdev,
> >> +                    const char *name, NetClientState *peer, Error **errp)
> >> +{
> >> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
> >> +    NetClientState *nc, *nc0 = NULL;
> >> +    unsigned int ifindex;
> >> +    uint32_t prog_id = 0;
> >> +    int *sock_fds = NULL;
> >> +    int xsks_map_fd = -1;
> >> +    int64_t i, queues;
> >> +    Error *err = NULL;
> >> +    AFXDPState *s;
> >> +
> >> +    ifindex = if_nametoindex(opts->ifname);
> >> +    if (!ifindex) {
> >> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
> >> +                         opts->ifname);
> >> +        return -1;
> >> +    }
> >> +
> >> +    queues = opts->has_queues ? opts->queues : 1;
> >> +    if (queues < 1) {
> >> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
> >> +                   queues, opts->ifname);
> >> +        return -1;
> >> +    }
> >> +
> >> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
> >> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
> >> +        return -1;
> >> +    }
> >> +#else
> >> +    if ((opts->has_inhibit && opts->inhibit)
> >> +        != (opts->xsks_map_fd || opts->sock_fds)) {
> >> +        error_setg(errp, "'inhibit=on' should be used together with "
> >> +                         "'sock-fds' or 'xsks-map-fd'");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (opts->xsks_map_fd && opts->sock_fds) {
> >> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually exclusive");
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (opts->sock_fds) {
> >> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
> >> +        if (!sock_fds) {
> >> +            return -1;
> >> +        }
> >> +    }
> >> +#endif
> >> +
> >> +    if (opts->xsks_map_fd) {
> >> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
> >> +        if (xsks_map_fd < 0) {
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    for (i = 0; i < queues; i++) {
> >> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
> >> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
> >> +        nc->queue_index = i;
> >> +
> >> +        if (!nc0) {
> >> +            nc0 = nc;
> >> +        }
> >> +
> >> +        s = DO_UPCAST(AFXDPState, nc, nc);
> >> +
> >> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
> >> +        s->ifindex = ifindex;
> >> +        s->n_queues = queues;
> >> +
> >> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
> >> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
> >> +            /* Make sure the XDP program will be removed. */
> >> +            s->n_queues = i;
> >> +            error_propagate(errp, err);
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    if (nc0) {
> >> +        s = DO_UPCAST(AFXDPState, nc, nc0);
> >> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
> >> +            error_setg_errno(errp, errno,
> >> +                             "no XDP program loaded on '%s', ifindex: %d",
> >> +                             s->ifname, s->ifindex);
> >> +            goto err;
> >> +        }
> >> +    }
> >> +
> >> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
> >> +
> >> +    return 0;
> >> +
> >> +err:
> >> +    g_free(sock_fds);
> >> +    if (nc0) {
> >> +        qemu_del_net_client(nc0);
> >> +    }
> >> +
> >> +    return -1;
> >> +}
> >> diff --git a/net/clients.h b/net/clients.h
> >> index ed8bdfff1e..be53794582 100644
> >> --- a/net/clients.h
> >> +++ b/net/clients.h
> >> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name,
> >>                      NetClientState *peer, Error **errp);
> >>  #endif
> >>
> >> +#ifdef CONFIG_AF_XDP
> >> +int net_init_af_xdp(const Netdev *netdev, const char *name,
> >> +                    NetClientState *peer, Error **errp);
> >> +#endif
> >> +
> >>  int net_init_vhost_user(const Netdev *netdev, const char *name,
> >>                          NetClientState *peer, Error **errp);
> >>
> >> diff --git a/net/meson.build b/net/meson.build
> >> index bdf564a57b..61628d4684 100644
> >> --- a/net/meson.build
> >> +++ b/net/meson.build
> >> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
> >>  if have_netmap
> >>    system_ss.add(files('netmap.c'))
> >>  endif
> >> +
> >> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
> >> +
> >>  if have_vhost_net_user
> >>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
> >>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
> >> diff --git a/net/net.c b/net/net.c
> >> index 6492ad530e..127f70932b 100644
> >> --- a/net/net.c
> >> +++ b/net/net.c
> >> @@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
> >>  #ifdef CONFIG_NETMAP
> >>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
> >> +#endif
> >>  #ifdef CONFIG_NET_BRIDGE
> >>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
> >>  #endif
> >> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
> >>  #ifdef CONFIG_NETMAP
> >>          "netmap",
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +        "af-xdp",
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>          "vhost-user",
> >>  #endif
> >> diff --git a/qapi/net.json b/qapi/net.json
> >> index db67501308..88f2c982c2 100644
> >> --- a/qapi/net.json
> >> +++ b/qapi/net.json
> >> @@ -408,6 +408,62 @@
> >>      'ifname':     'str',
> >>      '*devname':    'str' } }
> >>
> >> +##
> >> +# @AFXDPMode:
> >> +#
> >> +# Attach mode for a default XDP program
> >> +#
> >> +# @skb: generic mode, no driver support necessary
> >> +#
> >> +# @native: DRV mode, program is attached to a driver, packets are passed to
> >> +#     the socket without allocation of skb.
> >> +#
> >> +# Since: 8.1
> >
> > I'd make it for 8.2.
>
> OK.
>
> >
> >> +##
> >> +{ 'enum': 'AFXDPMode',
> >> +  'data': [ 'native', 'skb' ] }
> >> +
> >> +##
> >> +# @NetdevAFXDPOptions:
> >> +#
> >> +# AF_XDP network backend
> >> +#
> >> +# @ifname: The name of an existing network interface.
> >> +#
> >> +# @mode: Attach mode for a default XDP program.  If not specified, then
> >> +#     'native' will be tried first, then 'skb'.
> >> +#
> >> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
> >> +#     (default: false)
> >> +#
> >> +# @queues: number of queues to be used for multiqueue interfaces (default: 1).
> >> +#
> >> +# @start-queue: Use @queues starting from this queue number (default: 0).
> >> +#
> >> +# @inhibit: Don't load a default XDP program, use one already loaded to
> >> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
> >> +#
> >> +# @sock-fds: A colon (:) separated list of file descriptors for already open
> >> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
> >> +#     These descriptors should already be added into XDP socket map for
> >> +#     corresponding queues.  Requires @inhibit.
> >> +#
> >> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
> >> +#     the already loaded XDP program.  Requires @inhibit.
> >> +#
> >> +# Since: 8.1
> >> +##
> >> +{ 'struct': 'NetdevAFXDPOptions',
> >> +  'data': {
> >> +    'ifname':       'str',
> >> +    '*mode':        'AFXDPMode',
> >> +    '*force-copy':  'bool',
> >> +    '*queues':      'int',
> >> +    '*start-queue': 'int',
> >> +    '*inhibit':     'bool',
> >> +    '*sock-fds':    { 'type': 'str', 'if': 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
>
> The paramater is defined conditionally hare and it will not be
> compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined.

Right, I missed that.

>
> >> +    '*xsks-map-fd': 'str' } }
> >> +
> >>  ##
> >>  # @NetdevVhostUserOptions:
> >>  #
> >> @@ -642,13 +698,14 @@
> >>  # @vmnet-bridged: since 7.1
> >>  # @stream: since 7.2
> >>  # @dgram: since 7.2
> >> +# @af-xdp: since 8.1
> >>  #
> >>  # Since: 2.7
> >>  ##
> >>  { 'enum': 'NetClientDriver',
> >>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
> >>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
> >> -            'vhost-vdpa',
> >> +            'vhost-vdpa', 'af-xdp',
> >>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
> >>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
> >>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
> >> @@ -680,6 +737,7 @@
> >>      'bridge':   'NetdevBridgeOptions',
> >>      'hubport':  'NetdevHubPortOptions',
> >>      'netmap':   'NetdevNetmapOptions',
> >> +    'af-xdp':   'NetdevAFXDPOptions',
> >>      'vhost-user': 'NetdevVhostUserOptions',
> >>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
> >>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
> >> diff --git a/qemu-options.hx b/qemu-options.hx
> >> index b57489d7ca..d91610701c 100644
> >> --- a/qemu-options.hx
> >> +++ b/qemu-options.hx
> >> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
> >>      "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
> >>      "                netmap device, defaults to '/dev/netmap')\n"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
> >> +    "         [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    "         [,sock-fds=x:y:...:z]\n"
> >> +#endif
> >> +    "                attach to the existing network interface 'name' with AF_XDP socket\n"
> >> +    "                use 'mode=MODE' to specify an XDP program attach mode\n"
> >> +    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
> >> +    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
> >> +    "                with inhibit=on,\n"
> >> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
> >> +    "                  use 'sock-fds' to provide file descriptors for already open XDP sockets\n"
> >> +    "                  added to a socket map in XDP program.  One socket per queue.  Or\n"
> >> +#endif
> >> +    "                  use 'xsks-map-fd=k' to provide a file descriptor for xsks map\n"
> >> +    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
> >> +    "                use 'start-queue=m' to specify the first queue that should be used\n"
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
> >>      "                configure a vhost-user network, backed by a chardev 'dev'\n"
> >> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
> >>  #ifdef CONFIG_NETMAP
> >>      "netmap|"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "af-xdp|"
> >> +#endif
> >>  #ifdef CONFIG_POSIX
> >>      "vhost-user|"
> >>  #endif
> >> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
> >>  #ifdef CONFIG_NETMAP
> >>      "netmap|"
> >>  #endif
> >> +#ifdef CONFIG_AF_XDP
> >> +    "af-xdp|"
> >> +#endif
> >>  #ifdef CONFIG_VMNET
> >>      "vmnet-host|vmnet-shared|vmnet-bridged|"
> >>  #endif
> >> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
> >>      "                old way to initialize a host network interface\n"
> >>      "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
> >>  SRST
> >> -``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
> >> +``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
> >>      This option is a shortcut for configuring both the on-board
> >>      (default) guest NIC hardware and the host network backend in one go.
> >>      The host backend options are the same as with the corresponding
> >> @@ -3350,6 +3375,62 @@ SRST
> >>          # launch QEMU instance
> >>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
> >>
> >> +``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
> >> +    Configure AF_XDP backend to connect to a network interface 'name'
> >> +    using AF_XDP socket.  A specific program attach mode for a default
> >> +    XDP program can be forced with 'mode', defaults to best-effort,
> >> +    where the likely most performant mode will be in use.  Number of queues
> >> +    'n' should generally match the number or queues in the interface,
> >> +    defaults to 1.  Traffic arriving on non-configured device queues will
> >> +    not be delivered to the network backend.
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        # set number of queues to 1
> >> +        ethtool -L eth0 combined 4
> >> +        # launch QEMU instance
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
> >> +
> >> +    'start-queue' option can be specified if a particular range of queues
> >> +    [m, m + n] should be in use.  For example, this is necessary in order
> >> +    to use MLX NICs in native mode.  The driver will create a separate set
> >> +    of queues on top of regular ones, and only these queues can be used
> >> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
> >> +    redirection with ethtool to these queues.
> >
> > Let's avoid mentioning any vendor name unless the emulation is done
> > for a specific one.
>
> AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this
> strange fashion, so these are the only NICs that will not work without
> start-queue configuration and the extra traffic re-direction.  Unfortunately,
> this is also not documented anywhere, so you just have to know that it works
> this way...
>
> So, one one hand, I understand that it's not the place for vendor-specific
> docs.  On the other, it gives qemu users a chance to not waste a lot of time
> trying to figure out why everything is configured correctly, but the traffic
> doesn't flow.
>
> Maybe something like this:
>
>     'start-queue' option can be specified if a particular range of queues
>     [m, m + n] should be in use.  For example, this is may be necessary in
>     order to use certain NICs in native mode.  Kernel allows the driver to
>     create a separate set of XDP queues on top of regular ones, and only
>     these queues can be used for AF_XDP sockets.  NICs that work this way
>     may also require an additional traffic redirection with ethtool to these
>     special queues.
>
>     .. parsed-literal::
>
>         # set number of queues to 1
>         ethtool -L eth0 combined 1
>         # redirect all the traffic to the second queue (id: 1)
>         # note: drivers may require non-empty key/mask pair.
>         ethtool -N eth0 flow-type ether \\
>             dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>         ethtool -N eth0 flow-type ether \\
>             dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>         # launch QEMU instance
>         |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>             -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>
> What do you think?

This is great.

>
> >
> >> + E.g.:
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        # set number of queues to 1
> >> +        ethtool -L eth0 combined 1
> >> +        # redirect all the traffic to the second queue (id: 1)
> >> +        # note: mlx5 driver requires non-empty key/mask pair.
> >> +        ethtool -N eth0 flow-type ether \\
> >> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
> >> +        ethtool -N eth0 flow-type ether \\
> >> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
> >> +        # launch QEMU instance
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
> >> +
> >> +    XDP program can also be loaded externally.  In this case 'inhibit' option
> >> +    should be set to 'on' and 'xsks-map-fd' provided with a file descriptor
> >> +    for an open XDP socket map of that program, or 'sock-fds' with file
> >> +    descriptors for already open but not bound XDP sockets already added to a
> >> +    socket map for corresponding queues.  One socket per queue.
> >> +
> >> +    .. parsed-literal::
> >> +
> >> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
> >> +            -netdev af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
> >> +
> >> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
> >> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK capability.
> >> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
> >> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should be
> >> +    added as well.
> >
> > This requires the synchronization with the code changes, I suggest not
> > to describe:
> >
> > 1) actual numbers of memory requirement
> > 2) capabilites required for each mode, we don't do that for other
> > netdev like tap.
>
> Makes sense.  So, should we just strike the paragraph above entirely?

I think so.

Thanks

>
> >
> > Thanks
> >
> >
> >
> >
> >> +
> >> +
> >>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
> >>      Establish a vhost-user netdev, backed by a chardev id. The chardev
> >>      should be a unix domain socket backed one. The vhost-user uses a
> >> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
> >> index d02b09a4b9..7585c4c4ed 100755
> >> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
> >> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
> >> @@ -35,6 +35,7 @@
> >>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
> >>  --with-coroutine=ucontext \
> >>  --tls-priority=@QEMU,SYSTEM \
> >> +--disable-af-xdp \
> >>  --disable-attr \
> >>  --disable-auth-pam \
> >>  --disable-avx2 \
> >> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
> >> index 7dd5709ef4..162e455309 100644
> >> --- a/scripts/meson-buildoptions.sh
> >> +++ b/scripts/meson-buildoptions.sh
> >> @@ -74,6 +74,7 @@ meson_options_help() {
> >>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
> >>    printf "%s\n" '(unless built with --without-default-features):'
> >>    printf "%s\n" ''
> >> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
> >>    printf "%s\n" '  alsa            ALSA sound support'
> >>    printf "%s\n" '  attr            attr/xattr support'
> >>    printf "%s\n" '  auth-pam        PAM access control'
> >> @@ -207,6 +208,8 @@ meson_options_help() {
> >>  }
> >>  _meson_option_parse() {
> >>    case $1 in
> >> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
> >> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
> >>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
> >>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
> >>      --enable-attr) printf "%s" -Dattr=enabled ;;
> >> diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
> >> index e39871c7bb..207f7adfb9 100644
> >> --- a/tests/docker/dockerfiles/debian-amd64.docker
> >> +++ b/tests/docker/dockerfiles/debian-amd64.docker
> >> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
> >>                        libvirglrenderer-dev \
> >>                        libvte-2.91-dev \
> >>                        libxen-dev \
> >> +                      libxdp-dev \
> >>                        libzstd-dev \
> >>                        llvm \
> >>                        locales \
> >> --
> >> 2.40.1
> >>
> >
>
Re: [PATCH v2] net: add initial support for AF_XDP network backend
Posted by Ilya Maximets 9 months, 1 week ago
On 7/25/23 08:55, Jason Wang wrote:
> On Thu, Jul 20, 2023 at 9:26 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>>
>> On 7/20/23 09:37, Jason Wang wrote:
>>> On Thu, Jul 6, 2023 at 4:58 AM Ilya Maximets <i.maximets@ovn.org> wrote:
>>>>
>>>> AF_XDP is a network socket family that allows communication directly
>>>> with the network device driver in the kernel, bypassing most or all
>>>> of the kernel networking stack.  In the essence, the technology is
>>>> pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
>>>> and works with any network interfaces without driver modifications.
>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
>>>> require access to character devices or unix sockets.  Only access to
>>>> the network interface itself is necessary.
>>>>
>>>> This patch implements a network backend that communicates with the
>>>> kernel by creating an AF_XDP socket.  A chunk of userspace memory
>>>> is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
>>>> Fill and Completion) are placed in that memory along with a pool of
>>>> memory buffers for the packet data.  Data transmission is done by
>>>> allocating one of the buffers, copying packet data into it and
>>>> placing the pointer into Tx ring.  After transmission, device will
>>>> return the buffer via Completion ring.  On Rx, device will take
>>>> a buffer form a pre-populated Fill ring, write the packet data into
>>>> it and place the buffer into Rx ring.
>>>>
>>>> AF_XDP network backend takes on the communication with the host
>>>> kernel and the network interface and forwards packets to/from the
>>>> peer device in QEMU.
>>>>
>>>> Usage example:
>>>>
>>>>   -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
>>>>   -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
>>>>
>>>> XDP program bridges the socket with a network interface.  It can be
>>>> attached to the interface in 2 different modes:
>>>>
>>>> 1. skb - this mode should work for any interface and doesn't require
>>>>          driver support.  With a caveat of lower performance.
>>>>
>>>> 2. native - this does require support from the driver and allows to
>>>>             bypass skb allocation in the kernel and potentially use
>>>>             zero-copy while getting packets in/out userspace.
>>>>
>>>> By default, QEMU will try to use native mode and fall back to skb.
>>>> Mode can be forced via 'mode' option.  To force 'copy' even in native
>>>> mode, use 'force-copy=on' option.  This might be useful if there is
>>>> some issue with the driver.
>>>>
>>>> Option 'queues=N' allows to specify how many device queues should
>>>> be open.  Note that all the queues that are not open are still
>>>> functional and can receive traffic, but it will not be delivered to
>>>> QEMU.  So, the number of device queues should generally match the
>>>> QEMU configuration, unless the device is shared with something
>>>> else and the traffic re-direction to appropriate queues is correctly
>>>> configured on a device level (e.g. with ethtool -N).
>>>> 'start-queue=M' option can be used to specify from which queue id
>>>> QEMU should start configuring 'N' queues.  It might also be necessary
>>>> to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
>>>> for examples.
>>>>
>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
>>>> capabilities in order to load default XSK/XDP programs to the
>>>> network interface and configure BPF maps.  It is possible, however,
>>>> to run with no capabilities.  For that to work, an external process
>>>> with admin capabilities will need to pre-load default XSK program,
>>>> create AF_XDP sockets and pass their file descriptors to QEMU process
>>>> on startup via 'sock-fds' option.  Network backend will need to be
>>>> configured with 'inhibit=on' to avoid loading of the program.
>>>> QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
>>>> or CAP_IPC_LOCK.
>>>>
>>>> Alternatively, the file descriptor for 'xsks_map' can be passed via
>>>> 'xsks-map-fd=N' option instead of passing socket file descriptors.
>>>> That will additionally require CAP_NET_RAW on QEMU side.  This is
>>>> useful, because 'sock-fds' may not be available with older libxdp.
>>>> 'sock-fds' requires libxdp >= 1.4.0.
>>>>
>>>> There are few performance challenges with the current network backends.
>>>>
>>>> First is that they do not support IO threads.  This means that data
>>>> path is handled by the main thread in QEMU and may slow down other
>>>> work or may be slowed down by some other work.  This also means that
>>>> taking advantage of multi-queue is generally not possible today.
>>>>
>>>> Another thing is that data path is going through the device emulation
>>>> code, which is not really optimized for performance.  The fastest
>>>> "frontend" device is virtio-net.  But it's not optimized for heavy
>>>> traffic either, because it expects such use-cases to be handled via
>>>> some implementation of vhost (user, kernel, vdpa).  In practice, we
>>>> have virtio notifications and rcu lock/unlock on a per-packet basis
>>>> and not very efficient accesses to the guest memory.  Communication
>>>> channels between backend and frontend devices do not allow passing
>>>> more than one packet at a time as well.
>>>>
>>>> Some of these challenges can be avoided in the future by adding better
>>>> batching into device emulation or by implementing vhost-af-xdp variant.
>>>>
>>>> There are also a few kernel limitations.  AF_XDP sockets do not
>>>> support any kinds of checksum or segmentation offloading.  Buffers
>>>> are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
>>>> support implementation for AF_XDP is in progress, but not ready yet.
>>>> Also, transmission in all non-zero-copy modes is synchronous, i.e.
>>>> done in a syscall.  That doesn't allow high packet rates on virtual
>>>> interfaces.
>>>>
>>>> However, keeping in mind all of these challenges, current implementation
>>>> of the AF_XDP backend shows a decent performance while running on top
>>>> of a physical NIC with zero-copy support.
>>>>
>>>> Test setup:
>>>>
>>>> 2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
>>>> Network backend is configured to open the NIC directly in native mode.
>>>> The driver supports zero-copy.  NIC is configured to use 1 queue.
>>>>
>>>> Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
>>>> for PPS testing.
>>>>
>>>> iperf3 result:
>>>>  TCP stream      : 19.1 Gbps
>>>>
>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>>  Tx only         : 3.4 Mpps
>>>>  Rx only         : 2.0 Mpps
>>>>  L2 FWD Loopback : 1.5 Mpps
>>>>
>>>> In skb mode the same setup shows much lower performance, similar to
>>>> the setup where pair of physical NICs is replaced with veth pair:
>>>>
>>>> iperf3 result:
>>>>   TCP stream      : 9 Gbps
>>>>
>>>> dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
>>>>   Tx only         : 1.2 Mpps
>>>>   Rx only         : 1.0 Mpps
>>>>   L2 FWD Loopback : 0.7 Mpps
>>>>
>>>> Results in skb mode or over the veth are close to results of a tap
>>>> backend with vhost=on and disabled segmentation offloading bridged
>>>> with a NIC.
>>>>
>>>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>>>
>>> Looks good overall, see few comments inline.
>>
>> Thanks for review!
>>
>>>
>>>> ---
>>>>
>>>> Version 2:
>>>>
>>>>   - Added support for running with no capabilities by passing
>>>>     pre-created AF_XDP socket file descriptors via 'sock-fds' option.
>>>>     Conditionally complied because requires unreleased libxdp 1.4.0.
>>>>     The last restriction is having 32 MB of RLIMIT_MEMLOCK per queue.
>>>>
>>>>   - Refined and extended documentation.
>>>>
>>>>
>>>>  MAINTAINERS                                   |   4 +
>>>>  hmp-commands.hx                               |   2 +-
>>>>  meson.build                                   |  19 +
>>>>  meson_options.txt                             |   2 +
>>>>  net/af-xdp.c                                  | 570 ++++++++++++++++++
>>>>  net/clients.h                                 |   5 +
>>>>  net/meson.build                               |   3 +
>>>>  net/net.c                                     |   6 +
>>>>  qapi/net.json                                 |  60 +-
>>>>  qemu-options.hx                               |  83 ++-
>>>>  .../ci/org.centos/stream/8/x86_64/configure   |   1 +
>>>>  scripts/meson-buildoptions.sh                 |   3 +
>>>>  tests/docker/dockerfiles/debian-amd64.docker  |   1 +
>>>>  13 files changed, 756 insertions(+), 3 deletions(-)
>>>>  create mode 100644 net/af-xdp.c
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 7164cf55a1..80d4ba4004 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -2929,6 +2929,10 @@ W: http://info.iet.unipi.it/~luigi/netmap/
>>>>  S: Maintained
>>>>  F: net/netmap.c
>>>>
>>>> +AF_XDP network backend
>>>> +R: Ilya Maximets <i.maximets@ovn.org>
>>>> +F: net/af-xdp.c
>>>> +
>>>>  Host Memory Backends
>>>>  M: David Hildenbrand <david@redhat.com>
>>>>  M: Igor Mammedov <imammedo@redhat.com>
>>>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>>>> index 2cbd0f77a0..af9ffe4681 100644
>>>> --- a/hmp-commands.hx
>>>> +++ b/hmp-commands.hx
>>>> @@ -1295,7 +1295,7 @@ ERST
>>>>      {
>>>>          .name       = "netdev_add",
>>>>          .args_type  = "netdev:O",
>>>> -        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|vhost-user"
>>>> +        .params     = "[user|tap|socket|stream|dgram|vde|bridge|hubport|netmap|af-xdp|vhost-user"
>>>>  #ifdef CONFIG_VMNET
>>>>                        "|vmnet-host|vmnet-shared|vmnet-bridged"
>>>>  #endif
>>>> diff --git a/meson.build b/meson.build
>>>> index a9ba0bfab3..1f8772ea5d 100644
>>>> --- a/meson.build
>>>> +++ b/meson.build
>>>> @@ -1891,6 +1891,18 @@ if libbpf.found() and not cc.links('''
>>>>    endif
>>>>  endif
>>>>
>>>> +# libxdp
>>>> +libxdp = dependency('libxdp', required: get_option('af_xdp'), method: 'pkg-config')
>>>> +if libxdp.found() and \
>>>> +      not (libbpf.found() and libbpf.version().version_compare('>=0.7'))
>>>> +  libxdp = not_found
>>>> +  if get_option('af_xdp').enabled()
>>>> +    error('af-xdp support requires libbpf version >= 0.7')
>>>
>>> Can we simply limit this to 1.4?
>>
>> This is a check for libbpf, not libxdp.  Or do you think there is no need
>> to check libbpf version if we request libxdp version high enough?
>> Users may still break the build by installing old libbpf manually even if
>> distributions ship more modern versions.
>>
>> Or do you mean limit the libxdp version to 1.4 in order to avoid conditional
>> on HAVE_XSK_UMEM__CREATE_WITH_FD ?
> 
> Yes.
> 
>> The problem with that is that libxdp 1.4 is a week+ old, so not available in
>> any distribution, AFAIK.  Not sure how big of a problem that is though.
> 
> It doesn't matter as this is a brand new backend, it would simplify
> future maintenance if we can get rid of any HAVE_XXX macros.

OK, makes sense.  I'll require libxdp 1.4 and remove all ifdefs.
I'll also remove xsks-map-fd configuration, since it will be always
posisble to just use sock-fds instead.  We may add it back later,
but it requires extra privileges (NET_RAW), so I'm not sure there
is much value in it.

> 
>>
>>>
>>>
>>>> +  else
>>>> +    warning('af-xdp support requires libbpf version >= 0.7, disabling')
>>>> +  endif
>>>> +endif
>>>> +
>>>>  # libdw
>>>>  libdw = not_found
>>>>  if not get_option('libdw').auto() or \
>>>> @@ -2112,6 +2124,12 @@ config_host_data.set('CONFIG_HEXAGON_IDEF_PARSER', get_option('hexagon_idef_pars
>>>>  config_host_data.set('CONFIG_LIBATTR', have_old_libattr)
>>>>  config_host_data.set('CONFIG_LIBCAP_NG', libcap_ng.found())
>>>>  config_host_data.set('CONFIG_EBPF', libbpf.found())
>>>> +config_host_data.set('CONFIG_AF_XDP', libxdp.found())
>>>> +if libxdp.found()
>>>> +  config_host_data.set('HAVE_XSK_UMEM__CREATE_WITH_FD',
>>>> +                       cc.has_function('xsk_umem__create_with_fd',
>>>> +                                       dependencies: libxdp))
>>>> +endif
>>>>  config_host_data.set('CONFIG_LIBDAXCTL', libdaxctl.found())
>>>>  config_host_data.set('CONFIG_LIBISCSI', libiscsi.found())
>>>>  config_host_data.set('CONFIG_LIBNFS', libnfs.found())
>>>> @@ -4285,6 +4303,7 @@ summary_info += {'PVRDMA support':    have_pvrdma}
>>>>  summary_info += {'fdt support':       fdt_opt == 'disabled' ? false : fdt_opt}
>>>>  summary_info += {'libcap-ng support': libcap_ng}
>>>>  summary_info += {'bpf support':       libbpf}
>>>> +summary_info += {'AF_XDP support':    libxdp}
>>>>  summary_info += {'rbd support':       rbd}
>>>>  summary_info += {'smartcard support': cacard}
>>>>  summary_info += {'U2F support':       u2f}
>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>> index bbb5c7e886..f4e950ce6a 100644
>>>> --- a/meson_options.txt
>>>> +++ b/meson_options.txt
>>>> @@ -120,6 +120,8 @@ option('avx512bw', type: 'feature', value: 'auto',
>>>>  option('keyring', type: 'feature', value: 'auto',
>>>>         description: 'Linux keyring support')
>>>>
>>>> +option('af_xdp', type : 'feature', value : 'auto',
>>>> +       description: 'AF_XDP network backend support')
>>>>  option('attr', type : 'feature', value : 'auto',
>>>>         description: 'attr/xattr support')
>>>>  option('auth_pam', type : 'feature', value : 'auto',
>>>> diff --git a/net/af-xdp.c b/net/af-xdp.c
>>>> new file mode 100644
>>>> index 0000000000..265ba6b12e
>>>> --- /dev/null
>>>> +++ b/net/af-xdp.c
>>>> @@ -0,0 +1,570 @@
>>>> +/*
>>>> + * AF_XDP network backend.
>>>> + *
>>>> + * Copyright (c) 2023 Red Hat, Inc.
>>>> + *
>>>> + * Authors:
>>>> + *  Ilya Maximets <i.maximets@ovn.org>
>>>> + *
>>>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> + * See the COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include <bpf/bpf.h>
>>>> +#include <inttypes.h>
>>>> +#include <linux/if_link.h>
>>>> +#include <linux/if_xdp.h>
>>>> +#include <net/if.h>
>>>> +#include <xdp/xsk.h>
>>>> +
>>>> +#include "clients.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "net/net.h"
>>>> +#include "qapi/error.h"
>>>> +#include "qemu/cutils.h"
>>>> +#include "qemu/error-report.h"
>>>> +#include "qemu/iov.h"
>>>> +#include "qemu/main-loop.h"
>>>> +#include "qemu/memalign.h"
>>>> +
>>>> +
>>>> +typedef struct AFXDPState {
>>>> +    NetClientState       nc;
>>>> +
>>>> +    struct xsk_socket    *xsk;
>>>> +    struct xsk_ring_cons rx;
>>>> +    struct xsk_ring_prod tx;
>>>> +    struct xsk_ring_cons cq;
>>>> +    struct xsk_ring_prod fq;
>>>> +
>>>> +    char                 ifname[IFNAMSIZ];
>>>> +    int                  ifindex;
>>>> +    bool                 read_poll;
>>>> +    bool                 write_poll;
>>>> +    uint32_t             outstanding_tx;
>>>> +
>>>> +    uint64_t             *pool;
>>>> +    uint32_t             n_pool;
>>>> +    char                 *buffer;
>>>> +    struct xsk_umem      *umem;
>>>> +
>>>> +    uint32_t             n_queues;
>>>> +    uint32_t             xdp_flags;
>>>> +    bool                 inhibit;
>>>> +} AFXDPState;
>>>> +
>>>> +#define AF_XDP_BATCH_SIZE 64
>>>> +
>>>> +static void af_xdp_send(void *opaque);
>>>> +static void af_xdp_writable(void *opaque);
>>>> +
>>>> +/* Set the event-loop handlers for the af-xdp backend. */
>>>> +static void af_xdp_update_fd_handler(AFXDPState *s)
>>>> +{
>>>> +    qemu_set_fd_handler(xsk_socket__fd(s->xsk),
>>>> +                        s->read_poll ? af_xdp_send : NULL,
>>>> +                        s->write_poll ? af_xdp_writable : NULL,
>>>> +                        s);
>>>> +}
>>>> +
>>>> +/* Update the read handler. */
>>>> +static void af_xdp_read_poll(AFXDPState *s, bool enable)
>>>> +{
>>>> +    if (s->read_poll != enable) {
>>>> +        s->read_poll = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +/* Update the write handler. */
>>>> +static void af_xdp_write_poll(AFXDPState *s, bool enable)
>>>> +{
>>>> +    if (s->write_poll != enable) {
>>>> +        s->write_poll = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_poll(NetClientState *nc, bool enable)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    if (s->read_poll != enable || s->write_poll != enable) {
>>>> +        s->write_poll = enable;
>>>> +        s->read_poll  = enable;
>>>> +        af_xdp_update_fd_handler(s);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_complete_tx(AFXDPState *s)
>>>> +{
>>>> +    uint32_t idx = 0;
>>>> +    uint32_t done, i;
>>>> +    uint64_t *addr;
>>>> +
>>>> +    done = xsk_ring_cons__peek(&s->cq, XSK_RING_CONS__DEFAULT_NUM_DESCS, &idx);
>>>> +
>>>> +    for (i = 0; i < done; i++) {
>>>> +        addr = (void *) xsk_ring_cons__comp_addr(&s->cq, idx++);
>>>> +        s->pool[s->n_pool++] = *addr;
>>>> +        s->outstanding_tx--;
>>>> +    }
>>>> +
>>>> +    if (done) {
>>>> +        xsk_ring_cons__release(&s->cq, done);
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * The fd_write() callback, invoked if the fd is marked as writable
>>>> + * after a poll.
>>>> + */
>>>> +static void af_xdp_writable(void *opaque)
>>>> +{
>>>> +    AFXDPState *s = opaque;
>>>> +
>>>> +    /* Try to recover buffers that are already sent. */
>>>> +    af_xdp_complete_tx(s);
>>>> +
>>>> +    /*
>>>> +     * Unregister the handler, unless we still have packets to transmit
>>>> +     * and kernel needs a wake up.
>>>> +     */
>>>> +    if (!s->outstanding_tx || !xsk_ring_prod__needs_wakeup(&s->tx)) {
>>>> +        af_xdp_write_poll(s, false);
>>>> +    }
>>>> +
>>>> +    /* Flush any buffered packets. */
>>>> +    qemu_flush_queued_packets(&s->nc);
>>>> +}
>>>> +
>>>> +static ssize_t af_xdp_receive(NetClientState *nc,
>>>> +                              const uint8_t *buf, size_t size)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +    struct xdp_desc *desc;
>>>> +    uint32_t idx;
>>>> +    void *data;
>>>> +
>>>> +    /* Try to recover buffers that are already sent. */
>>>> +    af_xdp_complete_tx(s);
>>>> +
>>>> +    if (size > XSK_UMEM__DEFAULT_FRAME_SIZE) {
>>>> +        /* We can't transmit packet this size... */
>>>> +        return size;
>>>> +    }
>>>> +
>>>> +    if (!s->n_pool || !xsk_ring_prod__reserve(&s->tx, 1, &idx)) {
>>>> +        /*
>>>> +         * Out of buffers or space in tx ring.  Poll until we can write.
>>>> +         * This will also kick the Tx, if it was waiting on CQ.
>>>> +         */
>>>> +        af_xdp_write_poll(s, true);
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    desc = xsk_ring_prod__tx_desc(&s->tx, idx);
>>>> +    desc->addr = s->pool[--s->n_pool];
>>>> +    desc->len = size;
>>>> +
>>>> +    data = xsk_umem__get_data(s->buffer, desc->addr);
>>>> +    memcpy(data, buf, size);
>>>> +
>>>> +    xsk_ring_prod__submit(&s->tx, 1);
>>>> +    s->outstanding_tx++;
>>>> +
>>>> +    if (xsk_ring_prod__needs_wakeup(&s->tx)) {
>>>> +        af_xdp_write_poll(s, true);
>>>> +    }
>>>> +
>>>> +    return size;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Complete a previous send (backend --> guest) and enable the
>>>> + * fd_read callback.
>>>> + */
>>>> +static void af_xdp_send_completed(NetClientState *nc, ssize_t len)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    af_xdp_read_poll(s, true);
>>>> +}
>>>> +
>>>> +static void af_xdp_fq_refill(AFXDPState *s, uint32_t n)
>>>> +{
>>>> +    uint32_t i, idx = 0;
>>>> +
>>>> +    /* Leave one packet for Tx, just in case. */
>>>> +    if (s->n_pool < n + 1) {
>>>> +        n = s->n_pool;
>>>> +    }
>>>> +
>>>> +    if (!n || !xsk_ring_prod__reserve(&s->fq, n, &idx)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < n; i++) {
>>>> +        *xsk_ring_prod__fill_addr(&s->fq, idx++) = s->pool[--s->n_pool];
>>>> +    }
>>>> +    xsk_ring_prod__submit(&s->fq, n);
>>>> +
>>>> +    if (xsk_ring_prod__needs_wakeup(&s->fq)) {
>>>> +        /* Receive was blocked by not having enough buffers.  Wake it up. */
>>>> +        af_xdp_read_poll(s, true);
>>>> +    }
>>>> +}
>>>> +
>>>> +static void af_xdp_send(void *opaque)
>>>> +{
>>>> +    uint32_t i, n_rx, idx = 0;
>>>> +    AFXDPState *s = opaque;
>>>> +
>>>> +    n_rx = xsk_ring_cons__peek(&s->rx, AF_XDP_BATCH_SIZE, &idx);
>>>> +    if (!n_rx) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < n_rx; i++) {
>>>> +        const struct xdp_desc *desc;
>>>> +        struct iovec iov;
>>>> +
>>>> +        desc = xsk_ring_cons__rx_desc(&s->rx, idx++);
>>>> +
>>>> +        iov.iov_base = xsk_umem__get_data(s->buffer, desc->addr);
>>>> +        iov.iov_len = desc->len;
>>>> +
>>>> +        s->pool[s->n_pool++] = desc->addr;
>>>> +
>>>> +        if (!qemu_sendv_packet_async(&s->nc, &iov, 1,
>>>> +                                     af_xdp_send_completed)) {
>>>> +            /*
>>>> +             * The peer does not receive anymore.  Packet is queued, stop
>>>> +             * reading from the backend until af_xdp_send_completed().
>>>> +             */
>>>> +            af_xdp_read_poll(s, false);
>>>> +
>>>> +            /* Re-peek the descriptors to not break the ring cache. */
>>>> +            xsk_ring_cons__cancel(&s->rx, n_rx);
>>>> +            n_rx = xsk_ring_cons__peek(&s->rx, i + 1, &idx);
>>>
>>> The code turns out to be hard to read here.
>>>
>>> 1) This seems to undo the peek (usually peek doesn't touch the
>>> prod/consumer but it seems not what xsk_ring_cons__peek()) did:
>>
>> Yeah, it's unfortunate that the peek() function changes the internal
>> state, but that is what we have...
>>
>>>
>>> static inline __u32 xsk_ring_cons__peek(struct xsk_ring_cons *cons,
>>> __u32 nb, __u32 *idx)
>>> {
>>>   __u32 entries = xsk_cons_nb_avail(cons, nb);
>>>
>>> if (entries > 0) {
>>>                 *idx = cons->cached_cons;
>>>                 cons->cached_cons += entries;
>>>         }
>>>
>>>         return entries;
>>> }
>>>
>>> 2) It looks to me a partial rollback is sufficient?
>>>
>>> xsk_ring_cons__cancel(n_rx - i + 1)?
>>
>> Good point.  Should work.  It should be n_rx - i - 1 though, if I'm not
>> mistaken.  So:
>>
>>     xsk_ring_cons__cancel(n_rx - i - 1);
>>     n_rx = i + 1;
>>
>> I'm not sure if that is much easier to read, but that's OK.  Should be
>> a touch faster as well.  What do you think?
> 
> Let's do that please.

OK, Sure.  Seems to work fine.

I'll post v3 soon with this and other discused changes.

> 
>>
>>>
>>>> +            g_assert(n_rx == i + 1);
>>>> +            break;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* Release actually sent descriptors and try to re-fill. */
>>>> +    xsk_ring_cons__release(&s->rx, n_rx);
>>>> +    af_xdp_fq_refill(s, AF_XDP_BATCH_SIZE);
>>>> +}
>>>> +
>>>> +/* Flush and close. */
>>>> +static void af_xdp_cleanup(NetClientState *nc)
>>>> +{
>>>> +    AFXDPState *s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +    qemu_purge_queued_packets(nc);
>>>> +
>>>> +    af_xdp_poll(nc, false);
>>>> +
>>>> +    xsk_socket__delete(s->xsk);
>>>> +    s->xsk = NULL;
>>>> +    g_free(s->pool);
>>>> +    s->pool = NULL;
>>>> +    xsk_umem__delete(s->umem);
>>>> +    s->umem = NULL;
>>>> +    qemu_vfree(s->buffer);
>>>> +    s->buffer = NULL;
>>>> +
>>>> +    /* Remove the program if it's the last open queue. */
>>>> +    if (!s->inhibit && nc->queue_index == s->n_queues - 1 && s->xdp_flags
>>>> +        && bpf_xdp_detach(s->ifindex, s->xdp_flags, NULL) != 0) {
>>>> +        fprintf(stderr,
>>>> +                "af-xdp: unable to remove XDP program from '%s', ifindex: %d\n",
>>>> +                s->ifname, s->ifindex);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int af_xdp_umem_create(AFXDPState *s, int sock_fd, Error **errp)
>>>> +{
>>>> +    struct xsk_umem_config config = {
>>>> +        .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>>>> +        .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>>>> +        .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
>>>> +        .frame_headroom = 0,
>>>> +    };
>>>> +    uint64_t n_descs;
>>>> +    uint64_t size;
>>>> +    int64_t i;
>>>> +    int ret;
>>>> +
>>>> +    /* Number of descriptors if all 4 queues (rx, tx, cq, fq) are full. */
>>>> +    n_descs = (XSK_RING_PROD__DEFAULT_NUM_DESCS
>>>> +               + XSK_RING_CONS__DEFAULT_NUM_DESCS) * 2;
>>>> +    size = n_descs * XSK_UMEM__DEFAULT_FRAME_SIZE;
>>>> +
>>>> +    s->buffer = qemu_memalign(qemu_real_host_page_size(), size);
>>>> +    memset(s->buffer, 0, size);
>>>> +
>>>> +    if (sock_fd < 0) {
>>>> +        ret = xsk_umem__create(&s->umem, s->buffer, size,
>>>> +                               &s->fq, &s->cq, &config);
>>>> +    } else {
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +        ret = xsk_umem__create_with_fd(&s->umem, sock_fd, s->buffer, size,
>>>> +                                       &s->fq, &s->cq, &config);
>>>> +#else
>>>
>>> So sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD won't work. We'd better
>>>
>>> 1) disable sock_fds without HAVE_XSK_UMEM__CREATE_WITH_FD
>>
>> The qapi property is conditionally defined, so users will not be able to
>> set sock-fds if not supported.  And qemu will complain about unknown
>> property.  That should be enough?
> 
> Yes.
> 
>>
>>>
>>> or
>>>
>>> 2) disable af_xdp without HAVE_XSK_UMEM__CREATE_WITH_FD
>>
>> If we require libxdp 1.4 that will be the case.
>>
>>>
>>>> +        ret = -1;
>>>> +        errno = EINVAL;
>>>> +#endif
>>>> +    }
>>>> +
>>>> +    if (ret) {
>>>> +        qemu_vfree(s->buffer);
>>>> +        error_setg_errno(errp, errno,
>>>> +                         "failed to create umem for %s queue_index: %d",
>>>> +                         s->ifname, s->nc.queue_index);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    s->pool = g_new(uint64_t, n_descs);
>>>> +    /* Fill the pool in the opposite order, because it's a LIFO queue. */
>>>> +    for (i = n_descs; i >= 0; i--) {
>>>> +        s->pool[i] = i * XSK_UMEM__DEFAULT_FRAME_SIZE;
>>>> +    }
>>>> +    s->n_pool = n_descs;
>>>> +
>>>> +    af_xdp_fq_refill(s, XSK_RING_PROD__DEFAULT_NUM_DESCS);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int af_xdp_socket_create(AFXDPState *s,
>>>> +                                const NetdevAFXDPOptions *opts,
>>>> +                                int xsks_map_fd, Error **errp)
>>>> +{
>>>> +    struct xsk_socket_config cfg = {
>>>> +        .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
>>>> +        .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
>>>> +        .libxdp_flags = 0,
>>>> +        .bind_flags = XDP_USE_NEED_WAKEUP,
>>>> +        .xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
>>>> +    };
>>>> +    int queue_id, error = 0;
>>>> +
>>>> +    s->inhibit = opts->has_inhibit && opts->inhibit;
>>>> +    if (s->inhibit) {
>>>> +        cfg.libxdp_flags |= XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD;
>>>> +    }
>>>> +
>>>> +    if (opts->has_force_copy && opts->force_copy) {
>>>> +        cfg.bind_flags |= XDP_COPY;
>>>> +    }
>>>> +
>>>> +    queue_id = s->nc.queue_index;
>>>> +    if (opts->has_start_queue && opts->start_queue > 0) {
>>>> +        queue_id += opts->start_queue;
>>>> +    }
>>>> +
>>>> +    if (opts->has_mode) {
>>>> +        /* Specific mode requested. */
>>>> +        cfg.xdp_flags |= (opts->mode == AFXDP_MODE_NATIVE)
>>>> +                         ? XDP_FLAGS_DRV_MODE : XDP_FLAGS_SKB_MODE;
>>>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +            error = errno;
>>>> +        }
>>>> +    } else {
>>>> +        /* No mode requested, try native first. */
>>>> +        cfg.xdp_flags |= XDP_FLAGS_DRV_MODE;
>>>> +
>>>> +        if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                               s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +            /* Can't use native mode, try skb. */
>>>> +            cfg.xdp_flags &= ~XDP_FLAGS_DRV_MODE;
>>>> +            cfg.xdp_flags |= XDP_FLAGS_SKB_MODE;
>>>> +
>>>> +            if (xsk_socket__create(&s->xsk, s->ifname, queue_id,
>>>> +                                   s->umem, &s->rx, &s->tx, &cfg)) {
>>>> +                error = errno;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (error) {
>>>> +        error_setg_errno(errp, error,
>>>> +                         "failed to create AF_XDP socket for %s queue_id: %d",
>>>> +                         s->ifname, queue_id);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (s->inhibit && xsks_map_fd >= 0) {
>>>> +        int xsk_fd = xsk_socket__fd(s->xsk);
>>>> +
>>>> +        /* Need to update the map manually, libxdp skipped that step. */
>>>> +        error = bpf_map_update_elem(xsks_map_fd, &queue_id, &xsk_fd, 0);
>>>> +        if (error) {
>>>> +            error_setg_errno(errp, error,
>>>> +                             "failed to update xsks map for %s queue_id: %d",
>>>> +                             s->ifname, queue_id);
>>>> +            return -1;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    s->xdp_flags = cfg.xdp_flags;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +/* NetClientInfo methods. */
>>>> +static NetClientInfo net_af_xdp_info = {
>>>> +    .type = NET_CLIENT_DRIVER_AF_XDP,
>>>> +    .size = sizeof(AFXDPState),
>>>> +    .receive = af_xdp_receive,
>>>> +    .poll = af_xdp_poll,
>>>> +    .cleanup = af_xdp_cleanup,
>>>> +};
>>>> +
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +static int *parse_socket_fds(const char *sock_fds_str,
>>>> +                             int64_t n_expected, Error **errp)
>>>> +{
>>>> +    gchar **substrings = g_strsplit(sock_fds_str, ":", -1);
>>>> +    int64_t i, n_sock_fds = g_strv_length(substrings);
>>>> +    int *sock_fds = NULL;
>>>> +
>>>> +    if (n_sock_fds != n_expected) {
>>>> +        error_setg(errp, "expected %"PRIi64" socket fds, got %"PRIi64,
>>>> +                   n_expected, n_sock_fds);
>>>> +        goto exit;
>>>> +    }
>>>> +
>>>> +    sock_fds = g_new(int, n_sock_fds);
>>>> +
>>>> +    for (i = 0; i < n_sock_fds; i++) {
>>>> +        sock_fds[i] = monitor_fd_param(monitor_cur(), substrings[i], errp);
>>>> +        if (sock_fds[i] < 0) {
>>>> +            g_free(sock_fds);
>>>> +            sock_fds = NULL;
>>>> +            goto exit;
>>>> +        }
>>>> +    }
>>>> +
>>>> +exit:
>>>> +    g_strfreev(substrings);
>>>> +    return sock_fds;
>>>> +}
>>>> +#endif
>>>> +
>>>> +/*
>>>> + * The exported init function.
>>>> + *
>>>> + * ... -net af-xdp,ifname="..."
>>>
>>> This is the legacy command line, let's say -netdev af-xdp,...
>>
>> Sure.
>>
>>>
>>>> + */
>>>> +int net_init_af_xdp(const Netdev *netdev,
>>>> +                    const char *name, NetClientState *peer, Error **errp)
>>>> +{
>>>> +    const NetdevAFXDPOptions *opts = &netdev->u.af_xdp;
>>>> +    NetClientState *nc, *nc0 = NULL;
>>>> +    unsigned int ifindex;
>>>> +    uint32_t prog_id = 0;
>>>> +    int *sock_fds = NULL;
>>>> +    int xsks_map_fd = -1;
>>>> +    int64_t i, queues;
>>>> +    Error *err = NULL;
>>>> +    AFXDPState *s;
>>>> +
>>>> +    ifindex = if_nametoindex(opts->ifname);
>>>> +    if (!ifindex) {
>>>> +        error_setg_errno(errp, errno, "failed to get ifindex for '%s'",
>>>> +                         opts->ifname);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    queues = opts->has_queues ? opts->queues : 1;
>>>> +    if (queues < 1) {
>>>> +        error_setg(errp, "invalid number of queues (%" PRIi64 ") for '%s'",
>>>> +                   queues, opts->ifname);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +#ifndef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    if ((opts->has_inhibit && opts->inhibit) != !!opts->xsks_map_fd) {
>>>> +        error_setg(errp, "expected 'inhibit=on' and 'xsks-map-fd' together");
>>>> +        return -1;
>>>> +    }
>>>> +#else
>>>> +    if ((opts->has_inhibit && opts->inhibit)
>>>> +        != (opts->xsks_map_fd || opts->sock_fds)) {
>>>> +        error_setg(errp, "'inhibit=on' should be used together with "
>>>> +                         "'sock-fds' or 'xsks-map-fd'");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (opts->xsks_map_fd && opts->sock_fds) {
>>>> +        error_setg(errp, "'sock-fds' and 'xsks-map-fd' are mutually exclusive");
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (opts->sock_fds) {
>>>> +        sock_fds = parse_socket_fds(opts->sock_fds, queues, errp);
>>>> +        if (!sock_fds) {
>>>> +            return -1;
>>>> +        }
>>>> +    }
>>>> +#endif
>>>> +
>>>> +    if (opts->xsks_map_fd) {
>>>> +        xsks_map_fd = monitor_fd_param(monitor_cur(), opts->xsks_map_fd, errp);
>>>> +        if (xsks_map_fd < 0) {
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    for (i = 0; i < queues; i++) {
>>>> +        nc = qemu_new_net_client(&net_af_xdp_info, peer, "af-xdp", name);
>>>> +        qemu_set_info_str(nc, "af-xdp%"PRIi64" to %s", i, opts->ifname);
>>>> +        nc->queue_index = i;
>>>> +
>>>> +        if (!nc0) {
>>>> +            nc0 = nc;
>>>> +        }
>>>> +
>>>> +        s = DO_UPCAST(AFXDPState, nc, nc);
>>>> +
>>>> +        pstrcpy(s->ifname, sizeof(s->ifname), opts->ifname);
>>>> +        s->ifindex = ifindex;
>>>> +        s->n_queues = queues;
>>>> +
>>>> +        if (af_xdp_umem_create(s, sock_fds ? sock_fds[i] : -1, errp)
>>>> +            || af_xdp_socket_create(s, opts, xsks_map_fd, errp)) {
>>>> +            /* Make sure the XDP program will be removed. */
>>>> +            s->n_queues = i;
>>>> +            error_propagate(errp, err);
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (nc0) {
>>>> +        s = DO_UPCAST(AFXDPState, nc, nc0);
>>>> +        if (bpf_xdp_query_id(s->ifindex, s->xdp_flags, &prog_id) || !prog_id) {
>>>> +            error_setg_errno(errp, errno,
>>>> +                             "no XDP program loaded on '%s', ifindex: %d",
>>>> +                             s->ifname, s->ifindex);
>>>> +            goto err;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    af_xdp_read_poll(s, true); /* Initially only poll for reads. */
>>>> +
>>>> +    return 0;
>>>> +
>>>> +err:
>>>> +    g_free(sock_fds);
>>>> +    if (nc0) {
>>>> +        qemu_del_net_client(nc0);
>>>> +    }
>>>> +
>>>> +    return -1;
>>>> +}
>>>> diff --git a/net/clients.h b/net/clients.h
>>>> index ed8bdfff1e..be53794582 100644
>>>> --- a/net/clients.h
>>>> +++ b/net/clients.h
>>>> @@ -64,6 +64,11 @@ int net_init_netmap(const Netdev *netdev, const char *name,
>>>>                      NetClientState *peer, Error **errp);
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_AF_XDP
>>>> +int net_init_af_xdp(const Netdev *netdev, const char *name,
>>>> +                    NetClientState *peer, Error **errp);
>>>> +#endif
>>>> +
>>>>  int net_init_vhost_user(const Netdev *netdev, const char *name,
>>>>                          NetClientState *peer, Error **errp);
>>>>
>>>> diff --git a/net/meson.build b/net/meson.build
>>>> index bdf564a57b..61628d4684 100644
>>>> --- a/net/meson.build
>>>> +++ b/net/meson.build
>>>> @@ -36,6 +36,9 @@ system_ss.add(when: vde, if_true: files('vde.c'))
>>>>  if have_netmap
>>>>    system_ss.add(files('netmap.c'))
>>>>  endif
>>>> +
>>>> +system_ss.add(when: libxdp, if_true: files('af-xdp.c'))
>>>> +
>>>>  if have_vhost_net_user
>>>>    system_ss.add(when: 'CONFIG_VIRTIO_NET', if_true: files('vhost-user.c'), if_false: files('vhost-user-stub.c'))
>>>>    system_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-user-stub.c'))
>>>> diff --git a/net/net.c b/net/net.c
>>>> index 6492ad530e..127f70932b 100644
>>>> --- a/net/net.c
>>>> +++ b/net/net.c
>>>> @@ -1082,6 +1082,9 @@ static int (* const net_client_init_fun[NET_CLIENT_DRIVER__MAX])(
>>>>  #ifdef CONFIG_NETMAP
>>>>          [NET_CLIENT_DRIVER_NETMAP]    = net_init_netmap,
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +        [NET_CLIENT_DRIVER_AF_XDP]    = net_init_af_xdp,
>>>> +#endif
>>>>  #ifdef CONFIG_NET_BRIDGE
>>>>          [NET_CLIENT_DRIVER_BRIDGE]    = net_init_bridge,
>>>>  #endif
>>>> @@ -1186,6 +1189,9 @@ void show_netdevs(void)
>>>>  #ifdef CONFIG_NETMAP
>>>>          "netmap",
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +        "af-xdp",
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>          "vhost-user",
>>>>  #endif
>>>> diff --git a/qapi/net.json b/qapi/net.json
>>>> index db67501308..88f2c982c2 100644
>>>> --- a/qapi/net.json
>>>> +++ b/qapi/net.json
>>>> @@ -408,6 +408,62 @@
>>>>      'ifname':     'str',
>>>>      '*devname':    'str' } }
>>>>
>>>> +##
>>>> +# @AFXDPMode:
>>>> +#
>>>> +# Attach mode for a default XDP program
>>>> +#
>>>> +# @skb: generic mode, no driver support necessary
>>>> +#
>>>> +# @native: DRV mode, program is attached to a driver, packets are passed to
>>>> +#     the socket without allocation of skb.
>>>> +#
>>>> +# Since: 8.1
>>>
>>> I'd make it for 8.2.
>>
>> OK.
>>
>>>
>>>> +##
>>>> +{ 'enum': 'AFXDPMode',
>>>> +  'data': [ 'native', 'skb' ] }
>>>> +
>>>> +##
>>>> +# @NetdevAFXDPOptions:
>>>> +#
>>>> +# AF_XDP network backend
>>>> +#
>>>> +# @ifname: The name of an existing network interface.
>>>> +#
>>>> +# @mode: Attach mode for a default XDP program.  If not specified, then
>>>> +#     'native' will be tried first, then 'skb'.
>>>> +#
>>>> +# @force-copy: Force XDP copy mode even if device supports zero-copy.
>>>> +#     (default: false)
>>>> +#
>>>> +# @queues: number of queues to be used for multiqueue interfaces (default: 1).
>>>> +#
>>>> +# @start-queue: Use @queues starting from this queue number (default: 0).
>>>> +#
>>>> +# @inhibit: Don't load a default XDP program, use one already loaded to
>>>> +#     the interface (default: false).  Requires @sock-fds or @xsks-map-fd.
>>>> +#
>>>> +# @sock-fds: A colon (:) separated list of file descriptors for already open
>>>> +#     but not bound AF_XDP sockets in the queue order.  One fd per queue.
>>>> +#     These descriptors should already be added into XDP socket map for
>>>> +#     corresponding queues.  Requires @inhibit.
>>>> +#
>>>> +# @xsks-map-fd: A file descriptor for an already open XDP socket map in
>>>> +#     the already loaded XDP program.  Requires @inhibit.
>>>> +#
>>>> +# Since: 8.1
>>>> +##
>>>> +{ 'struct': 'NetdevAFXDPOptions',
>>>> +  'data': {
>>>> +    'ifname':       'str',
>>>> +    '*mode':        'AFXDPMode',
>>>> +    '*force-copy':  'bool',
>>>> +    '*queues':      'int',
>>>> +    '*start-queue': 'int',
>>>> +    '*inhibit':     'bool',
>>>> +    '*sock-fds':    { 'type': 'str', 'if': 'HAVE_XSK_UMEM__CREATE_WITH_FD' },
>>
>> The paramater is defined conditionally hare and it will not be
>> compiled in, if HAVE_XSK_UMEM__CREATE_WITH_FD is not defined.
> 
> Right, I missed that.
> 
>>
>>>> +    '*xsks-map-fd': 'str' } }
>>>> +
>>>>  ##
>>>>  # @NetdevVhostUserOptions:
>>>>  #
>>>> @@ -642,13 +698,14 @@
>>>>  # @vmnet-bridged: since 7.1
>>>>  # @stream: since 7.2
>>>>  # @dgram: since 7.2
>>>> +# @af-xdp: since 8.1
>>>>  #
>>>>  # Since: 2.7
>>>>  ##
>>>>  { 'enum': 'NetClientDriver',
>>>>    'data': [ 'none', 'nic', 'user', 'tap', 'l2tpv3', 'socket', 'stream',
>>>>              'dgram', 'vde', 'bridge', 'hubport', 'netmap', 'vhost-user',
>>>> -            'vhost-vdpa',
>>>> +            'vhost-vdpa', 'af-xdp',
>>>>              { 'name': 'vmnet-host', 'if': 'CONFIG_VMNET' },
>>>>              { 'name': 'vmnet-shared', 'if': 'CONFIG_VMNET' },
>>>>              { 'name': 'vmnet-bridged', 'if': 'CONFIG_VMNET' }] }
>>>> @@ -680,6 +737,7 @@
>>>>      'bridge':   'NetdevBridgeOptions',
>>>>      'hubport':  'NetdevHubPortOptions',
>>>>      'netmap':   'NetdevNetmapOptions',
>>>> +    'af-xdp':   'NetdevAFXDPOptions',
>>>>      'vhost-user': 'NetdevVhostUserOptions',
>>>>      'vhost-vdpa': 'NetdevVhostVDPAOptions',
>>>>      'vmnet-host': { 'type': 'NetdevVmnetHostOptions',
>>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>>> index b57489d7ca..d91610701c 100644
>>>> --- a/qemu-options.hx
>>>> +++ b/qemu-options.hx
>>>> @@ -2856,6 +2856,25 @@ DEF("netdev", HAS_ARG, QEMU_OPTION_netdev,
>>>>      "                VALE port (created on the fly) called 'name' ('nmname' is name of the \n"
>>>>      "                netmap device, defaults to '/dev/netmap')\n"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off]\n"
>>>> +    "         [,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k]\n"
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    "         [,sock-fds=x:y:...:z]\n"
>>>> +#endif
>>>> +    "                attach to the existing network interface 'name' with AF_XDP socket\n"
>>>> +    "                use 'mode=MODE' to specify an XDP program attach mode\n"
>>>> +    "                use 'force-copy=on|off' to force XDP copy mode even if device supports zero-copy (default: off)\n"
>>>> +    "                use 'inhibit=on|off' to inhibit loading of a default XDP program (default: off)\n"
>>>> +    "                with inhibit=on,\n"
>>>> +#ifdef HAVE_XSK_UMEM__CREATE_WITH_FD
>>>> +    "                  use 'sock-fds' to provide file descriptors for already open XDP sockets\n"
>>>> +    "                  added to a socket map in XDP program.  One socket per queue.  Or\n"
>>>> +#endif
>>>> +    "                  use 'xsks-map-fd=k' to provide a file descriptor for xsks map\n"
>>>> +    "                use 'queues=n' to specify how many queues of a multiqueue interface should be used\n"
>>>> +    "                use 'start-queue=m' to specify the first queue that should be used\n"
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>      "-netdev vhost-user,id=str,chardev=dev[,vhostforce=on|off]\n"
>>>>      "                configure a vhost-user network, backed by a chardev 'dev'\n"
>>>> @@ -2901,6 +2920,9 @@ DEF("nic", HAS_ARG, QEMU_OPTION_nic,
>>>>  #ifdef CONFIG_NETMAP
>>>>      "netmap|"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "af-xdp|"
>>>> +#endif
>>>>  #ifdef CONFIG_POSIX
>>>>      "vhost-user|"
>>>>  #endif
>>>> @@ -2929,6 +2951,9 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>>>  #ifdef CONFIG_NETMAP
>>>>      "netmap|"
>>>>  #endif
>>>> +#ifdef CONFIG_AF_XDP
>>>> +    "af-xdp|"
>>>> +#endif
>>>>  #ifdef CONFIG_VMNET
>>>>      "vmnet-host|vmnet-shared|vmnet-bridged|"
>>>>  #endif
>>>> @@ -2936,7 +2961,7 @@ DEF("net", HAS_ARG, QEMU_OPTION_net,
>>>>      "                old way to initialize a host network interface\n"
>>>>      "                (use the -netdev option if possible instead)\n", QEMU_ARCH_ALL)
>>>>  SRST
>>>> -``-nic [tap|bridge|user|l2tpv3|vde|netmap|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>>>> +``-nic [tap|bridge|user|l2tpv3|vde|netmap|af-xdp|vhost-user|socket][,...][,mac=macaddr][,model=mn]``
>>>>      This option is a shortcut for configuring both the on-board
>>>>      (default) guest NIC hardware and the host network backend in one go.
>>>>      The host backend options are the same as with the corresponding
>>>> @@ -3350,6 +3375,62 @@ SRST
>>>>          # launch QEMU instance
>>>>          |qemu_system| linux.img -nic vde,sock=/tmp/myswitch
>>>>
>>>> +``-netdev af-xdp,id=str,ifname=name[,mode=native|skb][,force-copy=on|off][,queues=n][,start-queue=m][,inhibit=on|off][,xsks-map-fd=k][,sock-fds=x:y:...:z]``
>>>> +    Configure AF_XDP backend to connect to a network interface 'name'
>>>> +    using AF_XDP socket.  A specific program attach mode for a default
>>>> +    XDP program can be forced with 'mode', defaults to best-effort,
>>>> +    where the likely most performant mode will be in use.  Number of queues
>>>> +    'n' should generally match the number or queues in the interface,
>>>> +    defaults to 1.  Traffic arriving on non-configured device queues will
>>>> +    not be delivered to the network backend.
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        # set number of queues to 1
>>>> +        ethtool -L eth0 combined 4
>>>> +        # launch QEMU instance
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=4
>>>> +
>>>> +    'start-queue' option can be specified if a particular range of queues
>>>> +    [m, m + n] should be in use.  For example, this is necessary in order
>>>> +    to use MLX NICs in native mode.  The driver will create a separate set
>>>> +    of queues on top of regular ones, and only these queues can be used
>>>> +    for AF_XDP sockets.  MLX NICs will also require an additional traffic
>>>> +    redirection with ethtool to these queues.
>>>
>>> Let's avoid mentioning any vendor name unless the emulation is done
>>> for a specific one.
>>
>> AFAIK, MLX/NVIDIA is the only vendor that implements XDP queues in this
>> strange fashion, so these are the only NICs that will not work without
>> start-queue configuration and the extra traffic re-direction.  Unfortunately,
>> this is also not documented anywhere, so you just have to know that it works
>> this way...
>>
>> So, one one hand, I understand that it's not the place for vendor-specific
>> docs.  On the other, it gives qemu users a chance to not waste a lot of time
>> trying to figure out why everything is configured correctly, but the traffic
>> doesn't flow.
>>
>> Maybe something like this:
>>
>>     'start-queue' option can be specified if a particular range of queues
>>     [m, m + n] should be in use.  For example, this is may be necessary in
>>     order to use certain NICs in native mode.  Kernel allows the driver to
>>     create a separate set of XDP queues on top of regular ones, and only
>>     these queues can be used for AF_XDP sockets.  NICs that work this way
>>     may also require an additional traffic redirection with ethtool to these
>>     special queues.
>>
>>     .. parsed-literal::
>>
>>         # set number of queues to 1
>>         ethtool -L eth0 combined 1
>>         # redirect all the traffic to the second queue (id: 1)
>>         # note: drivers may require non-empty key/mask pair.
>>         ethtool -N eth0 flow-type ether \\
>>             dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>>         ethtool -N eth0 flow-type ether \\
>>             dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>>         # launch QEMU instance
>>         |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>             -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>>
>> What do you think?
> 
> This is great.
> 
>>
>>>
>>>> + E.g.:
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        # set number of queues to 1
>>>> +        ethtool -L eth0 combined 1
>>>> +        # redirect all the traffic to the second queue (id: 1)
>>>> +        # note: mlx5 driver requires non-empty key/mask pair.
>>>> +        ethtool -N eth0 flow-type ether \\
>>>> +            dst 00:00:00:00:00:00 m FF:FF:FF:FF:FF:FE action 1
>>>> +        ethtool -N eth0 flow-type ether \\
>>>> +            dst 00:00:00:00:00:01 m FF:FF:FF:FF:FF:FE action 1
>>>> +        # launch QEMU instance
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=1,start-queue=1
>>>> +
>>>> +    XDP program can also be loaded externally.  In this case 'inhibit' option
>>>> +    should be set to 'on' and 'xsks-map-fd' provided with a file descriptor
>>>> +    for an open XDP socket map of that program, or 'sock-fds' with file
>>>> +    descriptors for already open but not bound XDP sockets already added to a
>>>> +    socket map for corresponding queues.  One socket per queue.
>>>> +
>>>> +    .. parsed-literal::
>>>> +
>>>> +        |qemu_system| linux.img -device virtio-net-pci,netdev=n1 \\
>>>> +            -netdev af-xdp,id=n1,ifname=eth0,queues=3,inhibit=on,sock-fds=15:16:17
>>>> +
>>>> +    With 'inhibit=on' and 'sock-fds', QEMU process will only require 32 MB
>>>> +    of locked memory (RLIMIT_MEMLOCK) per queue or CAP_IPC_LOCK capability.
>>>> +    With 'inhibit=on' and 'xsks-map-fd' it will additionally require
>>>> +    CAP_NET_RAW capability.  With 'inhibit=off', CAP_SYS/NET_ADMIN should be
>>>> +    added as well.
>>>
>>> This requires the synchronization with the code changes, I suggest not
>>> to describe:
>>>
>>> 1) actual numbers of memory requirement
>>> 2) capabilites required for each mode, we don't do that for other
>>> netdev like tap.
>>
>> Makes sense.  So, should we just strike the paragraph above entirely?
> 
> I think so.
> 
> Thanks
> 
>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>> +
>>>> +
>>>>  ``-netdev vhost-user,chardev=id[,vhostforce=on|off][,queues=n]``
>>>>      Establish a vhost-user netdev, backed by a chardev id. The chardev
>>>>      should be a unix domain socket backed one. The vhost-user uses a
>>>> diff --git a/scripts/ci/org.centos/stream/8/x86_64/configure b/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> index d02b09a4b9..7585c4c4ed 100755
>>>> --- a/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> +++ b/scripts/ci/org.centos/stream/8/x86_64/configure
>>>> @@ -35,6 +35,7 @@
>>>>  --block-drv-ro-whitelist="vmdk,vhdx,vpc,https,ssh" \
>>>>  --with-coroutine=ucontext \
>>>>  --tls-priority=@QEMU,SYSTEM \
>>>> +--disable-af-xdp \
>>>>  --disable-attr \
>>>>  --disable-auth-pam \
>>>>  --disable-avx2 \
>>>> diff --git a/scripts/meson-buildoptions.sh b/scripts/meson-buildoptions.sh
>>>> index 7dd5709ef4..162e455309 100644
>>>> --- a/scripts/meson-buildoptions.sh
>>>> +++ b/scripts/meson-buildoptions.sh
>>>> @@ -74,6 +74,7 @@ meson_options_help() {
>>>>    printf "%s\n" 'disabled with --disable-FEATURE, default is enabled if available'
>>>>    printf "%s\n" '(unless built with --without-default-features):'
>>>>    printf "%s\n" ''
>>>> +  printf "%s\n" '  af-xdp          AF_XDP network backend support'
>>>>    printf "%s\n" '  alsa            ALSA sound support'
>>>>    printf "%s\n" '  attr            attr/xattr support'
>>>>    printf "%s\n" '  auth-pam        PAM access control'
>>>> @@ -207,6 +208,8 @@ meson_options_help() {
>>>>  }
>>>>  _meson_option_parse() {
>>>>    case $1 in
>>>> +    --enable-af-xdp) printf "%s" -Daf_xdp=enabled ;;
>>>> +    --disable-af-xdp) printf "%s" -Daf_xdp=disabled ;;
>>>>      --enable-alsa) printf "%s" -Dalsa=enabled ;;
>>>>      --disable-alsa) printf "%s" -Dalsa=disabled ;;
>>>>      --enable-attr) printf "%s" -Dattr=enabled ;;
>>>> diff --git a/tests/docker/dockerfiles/debian-amd64.docker b/tests/docker/dockerfiles/debian-amd64.docker
>>>> index e39871c7bb..207f7adfb9 100644
>>>> --- a/tests/docker/dockerfiles/debian-amd64.docker
>>>> +++ b/tests/docker/dockerfiles/debian-amd64.docker
>>>> @@ -97,6 +97,7 @@ RUN export DEBIAN_FRONTEND=noninteractive && \
>>>>                        libvirglrenderer-dev \
>>>>                        libvte-2.91-dev \
>>>>                        libxen-dev \
>>>> +                      libxdp-dev \
>>>>                        libzstd-dev \
>>>>                        llvm \
>>>>                        locales \
>>>> --
>>>> 2.40.1
>>>>
>>>
>>
>