[PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses

Laine Stump posted 1 patch 4 months, 3 weeks ago
src/network/network_nftables.c                | 69 +++++++++++++++++++
tests/networkxml2firewalldata/base.nftables   | 14 ++++
.../forward-dev-linux.nftables                | 16 +++++
.../isolated-linux.nftables                   | 16 +++++
.../nat-default-linux.nftables                | 16 +++++
.../nat-ipv6-linux.nftables                   | 16 +++++
.../nat-ipv6-masquerade-linux.nftables        | 16 +++++
.../nat-many-ips-linux.nftables               | 16 +++++
.../nat-port-range-ipv6-linux.nftables        | 16 +++++
.../nat-port-range-linux.nftables             | 16 +++++
.../nat-tftp-linux.nftables                   | 16 +++++
.../route-default-linux.nftables              | 16 +++++
12 files changed, 243 insertions(+)
[PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 3 weeks ago
Many long years ago (April 2010), soon after "vhost" in-kernel packet
processing was added to the virtio-net driver, people running RHEL5
virtual machines with a virtio-net interface connected via a libvirt
virtual network noticed that when vhost packet processing was enabled,
their VMs could no longer get an IP address via DHCP - the guest was
ignoring the DHCP response packets sent by the host.

(I've been informed by danpb that the same issue had been encountered,
and "fixed" even earlier than that, in 2006, with Xen as the
hypervisor.)

The "gory details" of the 2010 discussion are chronicled here:

  https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html

but basically it was because the checksum of packets wasn't being
fully computed on the host side (because QEMU on the host and the NIC
driver in the guest had agreed between themselves to turn off
checksums because they were unnecessary due to the "link" between the
two being entirely in local memory and not a physical cable), but

1) a partial checksum had been put into the packets at some point by
   "someone"

2) the "don't use checksums" info was known by the guest kernel, which
   would properly ignore the "bad" checksum), and

3) the packets were being read by the dhclient application on the
   guest side with a "raw" socket (thus bypassing the guest kernel UDP
   processing that would have known the checksum was irrelevant)),

The "fix" for this ended up being two-tiered:

1) The ISC DHCP package (which contains the aforementioned dhclient
program) made a fix to their dhclient code which caused it to accept
packets anyway even if they didn't have a proper checksum (NB: that's
not a full explanation, and possibly not accurate). This fixed the
problem for guests with an updated dhclient. Here is the code with the
fix to ISC DHCP:

  https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365

This fixed the issue for any new/updated guests that had the fixed
dhclient, but it didn't solve the problem for existing/old guest
images that didn't/couldn't get their dhclient updated. This brings us
to:

2) iptables added a new "CHECKSUM" target and "--checksum-fill"
action:

  http://patchwork.ozlabs.org/patch/58525/

and libvirt added an iptables rule for each virtual network to match
DHCP response packets and perform --checksum-fill. This way by the
time dhclient on the guest reads the raw packet, the checksum would be
corrected, and the packet would be accepted. This was pushed upstream
in libvirt commit v0.8.2-142-gfd5b15ff1a.

The word at the time from those more knowledgeable than me was that
the bad checksum problem was really specific to ISC's dhclient running
on Linux, and so once their fix was in use everywhere dhclient was
used, bad checksums would be a thing of the past and the
--checksum-fill iptables rules would no longer be needed (but would
otherwise be harmless if they were still there). (Plot twist: the
dhclient code in fix (1) above apparently is on a Linux-only code path
- this is very important later!)

Based on this information (and also due to the opinion that fixing it
by having iptables modify the packet checksum was really the wrong way
to permanently fix things, i.e. an "ugly hack"), the nftables
developers made the decision to not implement an equivalent to
--checksum-fill in nftables. As a result, when I wrote the nftables
firewall backend for libvirt virtual networks earlier this year, it
didn't add in any rule to "fix" broken UDP checksums (since there was
apparently no equivalent in nftables and, after all, that was fixed
somewhere else 14 years ago, right???)

But last week, when Rich Jones was doing routine testing using a Fedora
40 host (the first Fedora release to use the nftables backend of libvirt's
network driver by default) and a FreeBSD guest, for "some strange
reason", the FreeBSD guest was unable to get an IP address from DHCP!!

  https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html

A few quick tests proved that it was the same old "bad checksum"
problem from 2010 come back to haunt us - it wasn't a Linux-only issue
after all.

Phil Sutter and Eric Garver (nftables people) pointed out that, while
nftables doesn't have an action that will *compute* the checksum of a
packet, it *does* have an action that will set the checksum to 0, and
suggested we try adding a "zero the checksum" rule for dhcp response
packets to our nftables ruleset. (Why? Because a checksum value of 0
in a IPv4 UDP packet is defined by RFC768 to mean "no checksum
generated", implying "checksum not needed").  It turns out that this
works - dhclient properly recognizes that a 0 checksum means "don't
bother with the checksum", and accepts the packet as valid.

So to once again fix this timeless bug, this patch adds such a
checksum zeroing rule to the nftables rules setup for each virtual
network.

This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
guests, while not breaking it for Fedora or Windows (10) guests.

Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6
Reported-by: Rich Jones <rjones@redhat.com>
Fix-Suggested-by: Eric Garver <egarver@redhat.com>
Fix-Suggested-by: Phil Sutter <psutter@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---

(NB: jdenemar reminded me this is a bugfix, so we don't need to rush
to get it pushed before he freezes for the RC1 snapshot. I'm happy
with it as is though, if anyone is awake early enough and wants to
push it before he snapshots RC1. Otherwise I'll just push it later and
it will be in RC2)

Changes from V1:

* informational errors/omissions in commit log message fixed

 src/network/network_nftables.c                | 69 +++++++++++++++++++
 tests/networkxml2firewalldata/base.nftables   | 14 ++++
 .../forward-dev-linux.nftables                | 16 +++++
 .../isolated-linux.nftables                   | 16 +++++
 .../nat-default-linux.nftables                | 16 +++++
 .../nat-ipv6-linux.nftables                   | 16 +++++
 .../nat-ipv6-masquerade-linux.nftables        | 16 +++++
 .../nat-many-ips-linux.nftables               | 16 +++++
 .../nat-port-range-ipv6-linux.nftables        | 16 +++++
 .../nat-port-range-linux.nftables             | 16 +++++
 .../nat-tftp-linux.nftables                   | 16 +++++
 .../route-default-linux.nftables              | 16 +++++
 12 files changed, 243 insertions(+)

diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c
index f8b5ab665d..5523207269 100644
--- a/src/network/network_nftables.c
+++ b/src/network/network_nftables.c
@@ -51,6 +51,7 @@ VIR_LOG_INIT("network.nftables");
 #define VIR_NFTABLES_FWD_OUT_CHAIN "guest_output"
 #define VIR_NFTABLES_FWD_X_CHAIN "guest_cross"
 #define VIR_NFTABLES_NAT_POSTROUTE_CHAIN "guest_nat"
+#define VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN "postroute_mangle"
 
 /* we must avoid using the standard "filter" table as used by
  * iptables, as any subsequent attempts to use iptables commands will
@@ -106,6 +107,10 @@ nftablesGlobalChain nftablesChains[] = {
 
     /* chains for NAT rules */
     {NULL, VIR_NFTABLES_NAT_POSTROUTE_CHAIN, "{ type nat hook postrouting priority 100; policy accept; }"},
+
+    /* chain for "mangle" rules that modify packets (e.g. 0 out UDP checksums) */
+    {NULL, VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN, "{ type filter hook postrouting priority 0; policy accept; }"},
+
 };
 
 
@@ -644,6 +649,44 @@ nftablesAddDontMasquerade(virFirewall *fw,
 }
 
 
+/**
+ * nftablesAddOutputFixUdpChecksum:
+ *
+ * Add a rule to @fw that will 0 out the checksum of udp packets
+ * output from @iface with destination port @port.
+
+ * Zeroing the checksum of a UDP packet tells the receiving end "you
+ * don't need to validate the checksum", which is useful in cases
+ * where the host (sender) thinks that packet checksums will be
+ * computed elsewhere (and so leaves a partially computed checksum in
+ * the packet header) while the guest (receiver) thinks that the
+ * checksum has already been fully computed; in the meantime none of
+ * the code in between has actually finished computing the
+ * checksum.
+ *
+ * An example of this is DHCP response packets from host to
+ * guest. If the checksum of each of these packets isn't zeroed, then
+ * many guests (e.g. FreeBSD) will drop them with reason BAD CHECKSUM;
+ * if the packets arrive at those guests with a checksum of 0, they
+ * will happily accept the packet.
+ */
+static void
+nftablesAddOutputFixUdpChecksum(virFirewall *fw,
+                                const char *iface,
+                                int port)
+{
+    g_autofree char *portstr = g_strdup_printf("%d", port);
+
+    virFirewallAddCmd(fw, VIR_FIREWALL_LAYER_IPV4,
+                      "insert", "rule", "ip",
+                      VIR_NFTABLES_PRIVATE_TABLE,
+                      VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN,
+                      "oif", iface, "udp", "dport", portstr,
+                      "counter", "udp", "checksum", "set", "0",
+                      NULL);
+}
+
+
 static const char networkLocalMulticastIPv4[] = "224.0.0.0/24";
 static const char networkLocalMulticastIPv6[] = "ff02::/16";
 static const char networkLocalBroadcast[] = "255.255.255.255/32";
@@ -901,6 +944,30 @@ nftablesAddGeneralFirewallRules(virFirewall *fw,
 }
 
 
+static void
+nftablesAddChecksumFirewallRules(virFirewall *fw,
+                                 virNetworkDef *def)
+{
+    size_t i;
+    virNetworkIPDef *ipv4def;
+
+    /* Look for the first IPv4 address that has dhcp or tftpboot
+     * defined. We support dhcp config on 1 IPv4 interface only.
+     */
+    for (i = 0; (ipv4def = virNetworkDefGetIPByIndex(def, AF_INET, i)); i++) {
+        if (ipv4def->nranges || ipv4def->nhosts)
+            break;
+    }
+
+    /* If we are doing local DHCP service on this network, add a rule
+     * that will fixup the checksum of DHCP response packets back to
+     * the guests.
+     */
+    if (ipv4def)
+        nftablesAddOutputFixUdpChecksum(fw, def->bridge, 68);
+}
+
+
 static int
 nftablesAddIPSpecificFirewallRules(virFirewall *fw,
                                    virNetworkDef *def,
@@ -952,6 +1019,8 @@ nftablesAddFirewallRules(virNetworkDef *def, virFirewall **fwRemoval)
             return -1;
     }
 
+    nftablesAddChecksumFirewallRules(fw, def);
+
     if (virFirewallApply(fw) < 0)
         return -1;
 
diff --git a/tests/networkxml2firewalldata/base.nftables b/tests/networkxml2firewalldata/base.nftables
index a064318739..6371fc12dd 100644
--- a/tests/networkxml2firewalldata/base.nftables
+++ b/tests/networkxml2firewalldata/base.nftables
@@ -68,6 +68,13 @@ libvirt_network \
 guest_nat \
 '{ type nat hook postrouting priority 100; policy accept; }'
 nft \
+add \
+chain \
+ip \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
+nft \
 list \
 table \
 ip6 \
@@ -136,3 +143,10 @@ ip6 \
 libvirt_network \
 guest_nat \
 '{ type nat hook postrouting priority 100; policy accept; }'
+nft \
+add \
+chain \
+ip6 \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
diff --git a/tests/networkxml2firewalldata/forward-dev-linux.nftables b/tests/networkxml2firewalldata/forward-dev-linux.nftables
index 8badb74beb..9dea1a88a4 100644
--- a/tests/networkxml2firewalldata/forward-dev-linux.nftables
+++ b/tests/networkxml2firewalldata/forward-dev-linux.nftables
@@ -156,3 +156,19 @@ daddr \
 224.0.0.0/24 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/isolated-linux.nftables b/tests/networkxml2firewalldata/isolated-linux.nftables
index d1b4dac178..67ee0a2bf5 100644
--- a/tests/networkxml2firewalldata/isolated-linux.nftables
+++ b/tests/networkxml2firewalldata/isolated-linux.nftables
@@ -62,3 +62,19 @@ oif \
 virbr0 \
 counter \
 accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-default-linux.nftables b/tests/networkxml2firewalldata/nat-default-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-default-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-default-linux.nftables
@@ -142,3 +142,19 @@ daddr \
 224.0.0.0/24 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
index d8a9ba706d..617ed8b753 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
@@ -200,3 +200,19 @@ oif \
 virbr0 \
 counter \
 accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
index a7f09cda59..a710d0e296 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
@@ -272,3 +272,19 @@ daddr \
 ff02::/16 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
index b826fe6134..0be5fb7e65 100644
--- a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
@@ -366,3 +366,19 @@ daddr \
 224.0.0.0/24 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
index ceaed6fa40..7574356855 100644
--- a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
@@ -384,3 +384,19 @@ daddr \
 ff02::/16 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
index 1dc37a26ec..127536e4db 100644
--- a/tests/networkxml2firewalldata/nat-port-range-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
@@ -312,3 +312,19 @@ oif \
 virbr0 \
 counter \
 accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-tftp-linux.nftables b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-tftp-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
@@ -142,3 +142,19 @@ daddr \
 224.0.0.0/24 \
 counter \
 return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/route-default-linux.nftables b/tests/networkxml2firewalldata/route-default-linux.nftables
index 282c9542a5..be9c4f5439 100644
--- a/tests/networkxml2firewalldata/route-default-linux.nftables
+++ b/tests/networkxml2firewalldata/route-default-linux.nftables
@@ -56,3 +56,19 @@ oif \
 virbr0 \
 counter \
 accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
-- 
2.47.0
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 3 weeks ago
On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote:
> Many long years ago (April 2010), soon after "vhost" in-kernel packet
> processing was added to the virtio-net driver, people running RHEL5
> virtual machines with a virtio-net interface connected via a libvirt
> virtual network noticed that when vhost packet processing was enabled,
> their VMs could no longer get an IP address via DHCP - the guest was
> ignoring the DHCP response packets sent by the host.
> 
> (I've been informed by danpb that the same issue had been encountered,
> and "fixed" even earlier than that, in 2006, with Xen as the
> hypervisor.)
> 
> The "gory details" of the 2010 discussion are chronicled here:
> 
>   https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html
> 
> but basically it was because the checksum of packets wasn't being
> fully computed on the host side (because QEMU on the host and the NIC
> driver in the guest had agreed between themselves to turn off
> checksums because they were unnecessary due to the "link" between the
> two being entirely in local memory and not a physical cable), but
> 
> 1) a partial checksum had been put into the packets at some point by
>    "someone"
> 
> 2) the "don't use checksums" info was known by the guest kernel, which
>    would properly ignore the "bad" checksum), and
> 
> 3) the packets were being read by the dhclient application on the
>    guest side with a "raw" socket (thus bypassing the guest kernel UDP
>    processing that would have known the checksum was irrelevant)),
> 
> The "fix" for this ended up being two-tiered:
> 
> 1) The ISC DHCP package (which contains the aforementioned dhclient
> program) made a fix to their dhclient code which caused it to accept
> packets anyway even if they didn't have a proper checksum (NB: that's
> not a full explanation, and possibly not accurate). This fixed the
> problem for guests with an updated dhclient. Here is the code with the
> fix to ISC DHCP:
> 
>   https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365
> 
> This fixed the issue for any new/updated guests that had the fixed
> dhclient, but it didn't solve the problem for existing/old guest
> images that didn't/couldn't get their dhclient updated. This brings us
> to:
> 
> 2) iptables added a new "CHECKSUM" target and "--checksum-fill"
> action:
> 
>   http://patchwork.ozlabs.org/patch/58525/
> 
> and libvirt added an iptables rule for each virtual network to match
> DHCP response packets and perform --checksum-fill. This way by the
> time dhclient on the guest reads the raw packet, the checksum would be
> corrected, and the packet would be accepted. This was pushed upstream
> in libvirt commit v0.8.2-142-gfd5b15ff1a.
> 
> The word at the time from those more knowledgeable than me was that
> the bad checksum problem was really specific to ISC's dhclient running
> on Linux, and so once their fix was in use everywhere dhclient was
> used, bad checksums would be a thing of the past and the
> --checksum-fill iptables rules would no longer be needed (but would
> otherwise be harmless if they were still there). (Plot twist: the
> dhclient code in fix (1) above apparently is on a Linux-only code path
> - this is very important later!)
> 
> Based on this information (and also due to the opinion that fixing it
> by having iptables modify the packet checksum was really the wrong way
> to permanently fix things, i.e. an "ugly hack"), the nftables
> developers made the decision to not implement an equivalent to
> --checksum-fill in nftables. As a result, when I wrote the nftables
> firewall backend for libvirt virtual networks earlier this year, it
> didn't add in any rule to "fix" broken UDP checksums (since there was
> apparently no equivalent in nftables and, after all, that was fixed
> somewhere else 14 years ago, right???)
> 
> But last week, when Rich Jones was doing routine testing using a Fedora
> 40 host (the first Fedora release to use the nftables backend of libvirt's
> network driver by default) and a FreeBSD guest, for "some strange
> reason", the FreeBSD guest was unable to get an IP address from DHCP!!
> 
>   https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html
> 
> A few quick tests proved that it was the same old "bad checksum"
> problem from 2010 come back to haunt us - it wasn't a Linux-only issue
> after all.
> 
> Phil Sutter and Eric Garver (nftables people) pointed out that, while
> nftables doesn't have an action that will *compute* the checksum of a
> packet, it *does* have an action that will set the checksum to 0, and
> suggested we try adding a "zero the checksum" rule for dhcp response
> packets to our nftables ruleset. (Why? Because a checksum value of 0
> in a IPv4 UDP packet is defined by RFC768 to mean "no checksum
> generated", implying "checksum not needed").  It turns out that this
> works - dhclient properly recognizes that a 0 checksum means "don't
> bother with the checksum", and accepts the packet as valid.
> 
> So to once again fix this timeless bug, this patch adds such a
> checksum zeroing rule to the nftables rules setup for each virtual
> network.
> 
> This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
> guests, while not breaking it for Fedora or Windows (10) guests.

You can add OpenBSD to that list, as I tested that too.

> 
> Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6
> Reported-by: Rich Jones <rjones@redhat.com>
> Fix-Suggested-by: Eric Garver <egarver@redhat.com>
> Fix-Suggested-by: Phil Sutter <psutter@redhat.com>
> Signed-off-by: Laine Stump <laine@redhat.com>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
> 
> (NB: jdenemar reminded me this is a bugfix, so we don't need to rush
> to get it pushed before he freezes for the RC1 snapshot. I'm happy
> with it as is though, if anyone is awake early enough and wants to
> push it before he snapshots RC1. Otherwise I'll just push it later and
> it will be in RC2)
> 
> Changes from V1:
> 
> * informational errors/omissions in commit log message fixed
> 
>  src/network/network_nftables.c                | 69 +++++++++++++++++++
>  tests/networkxml2firewalldata/base.nftables   | 14 ++++
>  .../forward-dev-linux.nftables                | 16 +++++
>  .../isolated-linux.nftables                   | 16 +++++
>  .../nat-default-linux.nftables                | 16 +++++
>  .../nat-ipv6-linux.nftables                   | 16 +++++
>  .../nat-ipv6-masquerade-linux.nftables        | 16 +++++
>  .../nat-many-ips-linux.nftables               | 16 +++++
>  .../nat-port-range-ipv6-linux.nftables        | 16 +++++
>  .../nat-port-range-linux.nftables             | 16 +++++
>  .../nat-tftp-linux.nftables                   | 16 +++++
>  .../route-default-linux.nftables              | 16 +++++
>  12 files changed, 243 insertions(+)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Fri, Oct 25, 2024 at 04:44:16PM +0100, Daniel P. Berrangé wrote:
> On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote:
> > This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
> > guests, while not breaking it for Fedora or Windows (10) guests.
>
> You can add OpenBSD to that list, as I tested that too.

I did some testing of my own and I can confirm that FreeBSD and
OpenBSD are fine with this change, as are various Linux flavors
(Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).

However, a few other operating systems aren't: namely GNU/Hurd, Haiku
and NetBSD break with this change. Interestingly, these were all fine
with the nftables backend before it.

Now, one could argue that GNU/Hurd and Haiku are toy/research
operating systems with fairly small audiences, and it would be hard
to disagree :) but I don't think we can put NetBSD in the same
bucket.

I'm also concerned about old versions of the operating systems that
we've listed as working above being unhappy with the change. It's
true that, to an extent, we can just tell people to upgrade their
guests, but sometimes running old operating systems is the whole
point of using virtualization in the first place...

In conclusion, even with this latest fix the nftables backend still
represents a step backwards compared to the iptables one.
Considering that we've made it the default one, we should try to
close the gap as much as possible.

-- 
Andrea Bolognani / Red Hat / Virtualization
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> On Fri, Oct 25, 2024 at 04:44:16PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote:
> > > This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
> > > guests, while not breaking it for Fedora or Windows (10) guests.
> >
> > You can add OpenBSD to that list, as I tested that too.
> 
> I did some testing of my own and I can confirm that FreeBSD and
> OpenBSD are fine with this change, as are various Linux flavors
> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> 
> However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> and NetBSD break with this change. Interestingly, these were all fine
> with the nftables backend before it.

Well that's odd. I've checked NetBSD source code and found no less
than 3 DHCP client impls, and all of them cope with checksum == 0.

https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497

https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507

https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373

the middle impl also directly copes with partial checksums

Not identified the Hurd/Haiku DHCP client code yet...

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> > I did some testing of my own and I can confirm that FreeBSD and
> > OpenBSD are fine with this change, as are various Linux flavors
> > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> >
> > However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> > and NetBSD break with this change. Interestingly, these were all fine
> > with the nftables backend before it.
>
> Well that's odd. I've checked NetBSD source code and found no less
> than 3 DHCP client impls, and all of them cope with checksum == 0.
>
> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
>
> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
>
> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
>
> the middle impl also directly copes with partial checksums

The boot log contains

  Starting dhcpcd.
  wm0: checksum failure from 192.168.124.1

so I guess the second implementation is the relevant one.

> Not identified the Hurd/Haiku DHCP client code yet...

I'm using Debian GNU/Hurd, so the DHCP client is the same as regular
Debian (ISC DHCP). The source can be found at

  https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/

The version is a bit old and there's the tiniest amount of patching
compared to the Linux build, specifically:

  --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01
01:00:00.000000000 +0100
  +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix
2023-02-15 15:39:49.000000000 +0100
  @@ -0,0 +1,26 @@
  +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
  +===================================================================
  +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c
  ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
  +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is
  +       char strbuf[ISC_STRERRORSIZE];
  +       const char *err = "socket";
  +       int tries = 0;
  +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE)
  ++#if 1
  +       int on = 1;
  + #endif
  + #if defined(SO_RCVBUF)

I'm not sure whether this could be relevant to the issue at hand.


To clarify, this is something that needs to be handled at the
userspace level, no kernel changes required? And clearly it affects
DHCP, but what about other protocols? Are we confident those will
cope just fine?

-- 
Andrea Bolognani / Red Hat / Virtualization
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> > > I did some testing of my own and I can confirm that FreeBSD and
> > > OpenBSD are fine with this change, as are various Linux flavors
> > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> > >
> > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> > > and NetBSD break with this change. Interestingly, these were all fine
> > > with the nftables backend before it.
> >
> > Well that's odd. I've checked NetBSD source code and found no less
> > than 3 DHCP client impls, and all of them cope with checksum == 0.
> >
> > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
> >
> > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
> >
> > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
> >
> > the middle impl also directly copes with partial checksums
> 
> The boot log contains
> 
>   Starting dhcpcd.
>   wm0: checksum failure from 192.168.124.1
> 
> so I guess the second implementation is the relevant one.

I've just tested netBSD 10.0 and get exactly the same failure
as you.

I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
on the guest and that is reporting that the checksum is bad.
It is *not* getting set to zero.

Meanwhile, if I run the same tcpdump with OpenBSD guest, I
see tcpdump reporting a zero checksum as expected.

WTF ?

Somehow our nftables rule is not having an effect, or worse,
it is have a non-deterministic effect where it works for
packets on some guests, but not others.

I checked the rule counters and packets are hitting the rule,
but not getting their checksum zerod.

IOW, conceptually I still believe our desire to set the checksum
to zero is correct, but somehow the impl is not taking effect
as expected.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
> > On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
> > > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> > > > I did some testing of my own and I can confirm that FreeBSD and
> > > > OpenBSD are fine with this change, as are various Linux flavors
> > > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> > > >
> > > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> > > > and NetBSD break with this change. Interestingly, these were all fine
> > > > with the nftables backend before it.
> > >
> > > Well that's odd. I've checked NetBSD source code and found no less
> > > than 3 DHCP client impls, and all of them cope with checksum == 0.
> > >
> > > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
> > >
> > > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
> > >
> > > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
> > >
> > > the middle impl also directly copes with partial checksums
> > 
> > The boot log contains
> > 
> >   Starting dhcpcd.
> >   wm0: checksum failure from 192.168.124.1
> > 
> > so I guess the second implementation is the relevant one.
> 
> I've just tested netBSD 10.0 and get exactly the same failure
> as you.
> 
> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
> on the guest and that is reporting that the checksum is bad.
> It is *not* getting set to zero.
> 
> Meanwhile, if I run the same tcpdump with OpenBSD guest, I
> see tcpdump reporting a zero checksum as expected.
> 
> WTF ?
> 
> Somehow our nftables rule is not having an effect, or worse,
> it is have a non-deterministic effect where it works for
> packets on some guests, but not others.
> 
> I checked the rule counters and packets are hitting the rule,
> but not getting their checksum zerod.

Further research shows tcpdump on packets leaving 'virbr0' have
the checksum correctly zerod. Our nftables rule is working.

A concurrent tcpdump on packets leaving 'vnetNNN' shows the
checksum is mangled.

With our old iptables rules, we set a valid checksum when leaving
virbr0, and I presume this causes all subsequent code to not touch
the checksum field.

With our new nftables rules, we set a zero checksum when leaving
virbr0, and "zero checksum" conceptually  means "not present (yet)".

I think there must be logic somewhere in the kernel/QEMU which
sees "not present" and decides it needs to do <something> with
the checksum field.

A key difference that is probably relevant is that netbsd is
using an e1000 NIC in QEMU, while openbsd is using a virtio-net
NIC. At least when created by virt-manager.

AFAIR, QEMU's magic checksum offload only happens for virtio-net,
so presumably our rules are incompatible with non-virtio-net NICs
in someway.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 8:46 AM, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote:
>> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
>>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
>>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
>>>>> I did some testing of my own and I can confirm that FreeBSD and
>>>>> OpenBSD are fine with this change, as are various Linux flavors
>>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
>>>>>
>>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku
>>>>> and NetBSD break with this change. Interestingly, these were all fine
>>>>> with the nftables backend before it.
>>>>
>>>> Well that's odd. I've checked NetBSD source code and found no less
>>>> than 3 DHCP client impls, and all of them cope with checksum == 0.
>>>>
>>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
>>>>
>>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
>>>>
>>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
>>>>
>>>> the middle impl also directly copes with partial checksums
>>>
>>> The boot log contains
>>>
>>>    Starting dhcpcd.
>>>    wm0: checksum failure from 192.168.124.1
>>>
>>> so I guess the second implementation is the relevant one.
>>
>> I've just tested netBSD 10.0 and get exactly the same failure
>> as you.
>>
>> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
>> on the guest and that is reporting that the checksum is bad.
>> It is *not* getting set to zero.
>>
>> Meanwhile, if I run the same tcpdump with OpenBSD guest, I
>> see tcpdump reporting a zero checksum as expected.
>>
>> WTF ?
>>
>> Somehow our nftables rule is not having an effect, or worse,
>> it is have a non-deterministic effect where it works for
>> packets on some guests, but not others.
>>
>> I checked the rule counters and packets are hitting the rule,
>> but not getting their checksum zerod.
> 
> Further research shows tcpdump on packets leaving 'virbr0' have
> the checksum correctly zerod. Our nftables rule is working.
> 
> A concurrent tcpdump on packets leaving 'vnetNNN' shows the
> checksum is mangled.
> 
> With our old iptables rules, we set a valid checksum when leaving
> virbr0, and I presume this causes all subsequent code to not touch
> the checksum field.
> 
> With our new nftables rules, we set a zero checksum when leaving
> virbr0, and "zero checksum" conceptually  means "not present (yet)".
> 
> I think there must be logic somewhere in the kernel/QEMU which
> sees "not present" and decides it needs to do <something> with
> the checksum field.

Yikes!

> 
> A key difference that is probably relevant is that netbsd is
> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> NIC. At least when created by virt-manager.
 >
> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> so presumably our rules are incompatible with non-virtio-net NICs
> in someway.

Double and triple yikes!

So something in the packet path for non-virtio-net NICs is noticing that 
the packet checksum is 0, and then "fixing" it with the *wrong* checksum?

But in the past when it already had the correct checksum, that same bit 
of code said "Huh. The checksum is already correct" and left it alone.

So when the extra rules are removed, then those same guests begin 
working? (You can easily remove the checksum rules with:

   nft delete chain ip libvirt_network postroute_mangle

BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
work with the 0-checksum rules removed. In this case tcpdump on virbr0 
shows "bad cksum", but when I look at tcpdump on the guest, it shows 
"udp cksum ok" though, so something else somewhere is setting the 
checksum to the correct value.
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Phil Sutter 4 months, 2 weeks ago
Hi,

On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> On 10/29/24 8:46 AM, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote:
> >> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
> >>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
> >>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> >>>>> I did some testing of my own and I can confirm that FreeBSD and
> >>>>> OpenBSD are fine with this change, as are various Linux flavors
> >>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> >>>>>
> >>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> >>>>> and NetBSD break with this change. Interestingly, these were all fine
> >>>>> with the nftables backend before it.
> >>>>
> >>>> Well that's odd. I've checked NetBSD source code and found no less
> >>>> than 3 DHCP client impls, and all of them cope with checksum == 0.
> >>>>
> >>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
> >>>>
> >>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
> >>>>
> >>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
> >>>>
> >>>> the middle impl also directly copes with partial checksums
> >>>
> >>> The boot log contains
> >>>
> >>>    Starting dhcpcd.
> >>>    wm0: checksum failure from 192.168.124.1
> >>>
> >>> so I guess the second implementation is the relevant one.
> >>
> >> I've just tested netBSD 10.0 and get exactly the same failure
> >> as you.
> >>
> >> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
> >> on the guest and that is reporting that the checksum is bad.
> >> It is *not* getting set to zero.
> >>
> >> Meanwhile, if I run the same tcpdump with OpenBSD guest, I
> >> see tcpdump reporting a zero checksum as expected.
> >>
> >> WTF ?
> >>
> >> Somehow our nftables rule is not having an effect, or worse,
> >> it is have a non-deterministic effect where it works for
> >> packets on some guests, but not others.
> >>
> >> I checked the rule counters and packets are hitting the rule,
> >> but not getting their checksum zerod.
> > 
> > Further research shows tcpdump on packets leaving 'virbr0' have
> > the checksum correctly zerod. Our nftables rule is working.
> > 
> > A concurrent tcpdump on packets leaving 'vnetNNN' shows the
> > checksum is mangled.
> > 
> > With our old iptables rules, we set a valid checksum when leaving
> > virbr0, and I presume this causes all subsequent code to not touch
> > the checksum field.
> > 
> > With our new nftables rules, we set a zero checksum when leaving
> > virbr0, and "zero checksum" conceptually  means "not present (yet)".
> > 
> > I think there must be logic somewhere in the kernel/QEMU which
> > sees "not present" and decides it needs to do <something> with
> > the checksum field.
> 
> Yikes!
> 
> > 
> > A key difference that is probably relevant is that netbsd is
> > using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> > NIC. At least when created by virt-manager.
>  >
> > AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> > so presumably our rules are incompatible with non-virtio-net NICs
> > in someway.
> 
> Double and triple yikes!
> 
> So something in the packet path for non-virtio-net NICs is noticing that 
> the packet checksum is 0, and then "fixing" it with the *wrong* checksum?
> 
> But in the past when it already had the correct checksum, that same bit 
> of code said "Huh. The checksum is already correct" and left it alone.

Maybe this is due to the partial checksum thing in kernel which updates
the checksum if not using virtio "shortcut".

> So when the extra rules are removed, then those same guests begin 
> working? (You can easily remove the checksum rules with:
> 
>    nft delete chain ip libvirt_network postroute_mangle
> 
> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
> work with the 0-checksum rules removed. In this case tcpdump on virbr0 
> shows "bad cksum", but when I look at tcpdump on the guest, it shows 
> "udp cksum ok" though, so something else somewhere is setting the 
> checksum to the correct value.

FWIW, I just tested an alternative workaround using tc. This works for
me with a FreeBSD guest and NIC switched to either e1000 or virtio:

# tc qd add dev vnetbr0 root handle 1: htb
# tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
	u32 match ip sport 67 ffff match ip dport 68 ffff \
	action csum ip and udp

Another alternative might be to add the nftables rule for virtio-based
guests only.

Cheers, Phil
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 11:12 AM, Phil Sutter wrote:
> Hi,
> 
> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
>> On 10/29/24 8:46 AM, Daniel P. Berrangé wrote:
>>> On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote:
>>>> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
>>>>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
>>>>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
>>>>>>> I did some testing of my own and I can confirm that FreeBSD and
>>>>>>> OpenBSD are fine with this change, as are various Linux flavors
>>>>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
>>>>>>>
>>>>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku
>>>>>>> and NetBSD break with this change. Interestingly, these were all fine
>>>>>>> with the nftables backend before it.
>>>>>>
>>>>>> Well that's odd. I've checked NetBSD source code and found no less
>>>>>> than 3 DHCP client impls, and all of them cope with checksum == 0.
>>>>>>
>>>>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
>>>>>>
>>>>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
>>>>>>
>>>>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
>>>>>>
>>>>>> the middle impl also directly copes with partial checksums
>>>>>
>>>>> The boot log contains
>>>>>
>>>>>     Starting dhcpcd.
>>>>>     wm0: checksum failure from 192.168.124.1
>>>>>
>>>>> so I guess the second implementation is the relevant one.
>>>>
>>>> I've just tested netBSD 10.0 and get exactly the same failure
>>>> as you.
>>>>
>>>> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
>>>> on the guest and that is reporting that the checksum is bad.
>>>> It is *not* getting set to zero.
>>>>
>>>> Meanwhile, if I run the same tcpdump with OpenBSD guest, I
>>>> see tcpdump reporting a zero checksum as expected.
>>>>
>>>> WTF ?
>>>>
>>>> Somehow our nftables rule is not having an effect, or worse,
>>>> it is have a non-deterministic effect where it works for
>>>> packets on some guests, but not others.
>>>>
>>>> I checked the rule counters and packets are hitting the rule,
>>>> but not getting their checksum zerod.
>>>
>>> Further research shows tcpdump on packets leaving 'virbr0' have
>>> the checksum correctly zerod. Our nftables rule is working.
>>>
>>> A concurrent tcpdump on packets leaving 'vnetNNN' shows the
>>> checksum is mangled.
>>>
>>> With our old iptables rules, we set a valid checksum when leaving
>>> virbr0, and I presume this causes all subsequent code to not touch
>>> the checksum field.
>>>
>>> With our new nftables rules, we set a zero checksum when leaving
>>> virbr0, and "zero checksum" conceptually  means "not present (yet)".
>>>
>>> I think there must be logic somewhere in the kernel/QEMU which
>>> sees "not present" and decides it needs to do <something> with
>>> the checksum field.
>>
>> Yikes!
>>
>>>
>>> A key difference that is probably relevant is that netbsd is
>>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
>>> NIC. At least when created by virt-manager.
>>   >
>>> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
>>> so presumably our rules are incompatible with non-virtio-net NICs
>>> in someway.
>>
>> Double and triple yikes!
>>
>> So something in the packet path for non-virtio-net NICs is noticing that
>> the packet checksum is 0, and then "fixing" it with the *wrong* checksum?
>>
>> But in the past when it already had the correct checksum, that same bit
>> of code said "Huh. The checksum is already correct" and left it alone.
> 
> Maybe this is due to the partial checksum thing in kernel which updates
> the checksum if not using virtio "shortcut".
> 
>> So when the extra rules are removed, then those same guests begin
>> working? (You can easily remove the checksum rules with:
>>
>>     nft delete chain ip libvirt_network postroute_mangle
>>
>> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
>> work with the 0-checksum rules removed. In this case tcpdump on virbr0
>> shows "bad cksum", but when I look at tcpdump on the guest, it shows
>> "udp cksum ok" though, so something else somewhere is setting the
>> checksum to the correct value.
> 
> FWIW, I just tested an alternative workaround using tc. This works for
> me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> 
> # tc qd add dev vnetbr0 root handle 1: htb
> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> 	action csum ip and udp

This did work for me. However, since we support bandwidth management of 
the interfaces connected to a virtual netowkr using tc, I decided to try 
doing the same on a network configured with bandwidth management 
enabled. I did this by adding the following to the <network> XML before 
starting the network:

   <bandwidth>
     <inbound average='1000' peak='2000' burst='5120'/>
     <outbound average='500' peak='1000' burst='2560'/>
   </bandwidth>

This resulted in the following tc-related stuff being setup:

# tc filter show dev virbr13
filter parent 1: protocol all pref 1 fw chain 0
filter parent 1: protocol all pref 1 fw chain 0 handle 0x1 classid :1
=======

# tc qd show dev virbr13
qdisc htb 1: root refcnt 2 r2q 10 default 0x2 direct_packets_stat 0 
direct_qlen 1000
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024 
perturb 10sec
qdisc ingress ffff: parent ffff:fff1 ----------------
=======

# tc class show dev virbr13
class htb 1:1 root rate 8Mbit ceil 16Mbit burst 1600b cburst 1600b
class htb 1:2 parent 1:1 leaf 2: prio 0 rate 8Mbit ceil 16Mbit burst 5Mb 
cburst 1600b
=======

(obviously my network is using virbr13). Now when I try running the 
commands above, I see this:

# tc qd add dev virbr13 root handle 1: htb
Error: Exclusivity flag on, cannot modify.

So how does this all need to change in order to have our 
per-bridge-device bandwidth management coexist with the qdisc/filter to 
re-compute dhcp response checksums? (sorry for requesting that you ELI5; 
that's just easier than digging through all the documentation to figure 
out what is probably a simple solution :-P)
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Phil Sutter 4 months, 1 week ago
Hi Laine,

On Fri, Nov 01, 2024 at 11:44:12AM -0400, Laine Stump wrote:
> On 10/29/24 11:12 AM, Phil Sutter wrote:
[...]
> > FWIW, I just tested an alternative workaround using tc. This works for
> > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > 
> > # tc qd add dev vnetbr0 root handle 1: htb
> > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > 	action csum ip and udp
> 
> This did work for me. However, since we support bandwidth management of 
> the interfaces connected to a virtual netowkr using tc, I decided to try 
> doing the same on a network configured with bandwidth management 
> enabled. I did this by adding the following to the <network> XML before 
> starting the network:
> 
>    <bandwidth>
>      <inbound average='1000' peak='2000' burst='5120'/>
>      <outbound average='500' peak='1000' burst='2560'/>
>    </bandwidth>
> 
> This resulted in the following tc-related stuff being setup:
> 
> # tc filter show dev virbr13
> filter parent 1: protocol all pref 1 fw chain 0
> filter parent 1: protocol all pref 1 fw chain 0 handle 0x1 classid :1
> =======
> 
> # tc qd show dev virbr13
> qdisc htb 1: root refcnt 2 r2q 10 default 0x2 direct_packets_stat 0 
> direct_qlen 1000
> qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024 
> perturb 10sec
> qdisc ingress ffff: parent ffff:fff1 ----------------
> =======
> 
> # tc class show dev virbr13
> class htb 1:1 root rate 8Mbit ceil 16Mbit burst 1600b cburst 1600b
> class htb 1:2 parent 1:1 leaf 2: prio 0 rate 8Mbit ceil 16Mbit burst 5Mb 
> cburst 1600b
> =======
> 
> (obviously my network is using virbr13). Now when I try running the 
> commands above, I see this:
> 
> # tc qd add dev virbr13 root handle 1: htb
> Error: Exclusivity flag on, cannot modify.

No big deal: You're just trying to recreate the already existing HTB
qdisc and it fails. But my quoted commands create one merely because we
need one to attach the filter to.

> So how does this all need to change in order to have our 
> per-bridge-device bandwidth management coexist with the qdisc/filter to 
> re-compute dhcp response checksums? (sorry for requesting that you ELI5; 
> that's just easier than digging through all the documentation to figure 
> out what is probably a simple solution :-P)

It should just work if we attach the extra filter to the existing qdisc
(untested due to laziness):

# tc filter add dev virbr13 prio 2 protocol ip parent 1: \
	u32 match ip sport 67 ffff match ip dport 68 ffff \
	action csum ip and udp

I forgot the meaning of the prio value again. Lower value may be higher
prio, but IIRC all filters apply anyway and ordering doesn't matter in
this case.

In practice, you may just change the code to always add the HTB qdisc
and the checksum update filter/action. If <bandwidth> is specified, also
add the classes, leave qdisc(s) and fwmark filter.

Let me know if things work this way.

Cheers, Phil
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> Hi,
> 
> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > So when the extra rules are removed, then those same guests begin 
> > working? (You can easily remove the checksum rules with:
> > 
> >    nft delete chain ip libvirt_network postroute_mangle
> > 
> > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
> > work with the 0-checksum rules removed. In this case tcpdump on virbr0 
> > shows "bad cksum", but when I look at tcpdump on the guest, it shows 
> > "udp cksum ok" though, so something else somewhere is setting the 
> > checksum to the correct value.
> 
> FWIW, I just tested an alternative workaround using tc. This works for
> me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> 
> # tc qd add dev vnetbr0 root handle 1: htb
> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> 	action csum ip and udp

This feels like it is functionally closest to what we've had historically,
even though it is annoying to have to deal with 'tc' tool, in addition
to 'nft'. So I'm thinking this is probably the way we'll have to go.

> Another alternative might be to add the nftables rule for virtio-based
> guests only.

The firewall rules are in a chain that's applied to all guests,
so we have no where to add a per-guest rule.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Phil Sutter 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > Hi,
> > 
> > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > So when the extra rules are removed, then those same guests begin 
> > > working? (You can easily remove the checksum rules with:
> > > 
> > >    nft delete chain ip libvirt_network postroute_mangle
> > > 
> > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
> > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 
> > > shows "bad cksum", but when I look at tcpdump on the guest, it shows 
> > > "udp cksum ok" though, so something else somewhere is setting the 
> > > checksum to the correct value.
> > 
> > FWIW, I just tested an alternative workaround using tc. This works for
> > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > 
> > # tc qd add dev vnetbr0 root handle 1: htb
> > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > 	action csum ip and udp
> 
> This feels like it is functionally closest to what we've had historically,
> even though it is annoying to have to deal with 'tc' tool, in addition
> to 'nft'. So I'm thinking this is probably the way we'll have to go.

Another ugly detail (inherent to 'tc') is that you have to attach a
classful qdisc to the interface since otherwise you can't add a filter
with attached action. While this may not be a problem in practice, there
is this side-effect of setting up a HTB on the bridge which by default
runs a "noqueue" qdisc.

> > Another alternative might be to add the nftables rule for virtio-based
> > guests only.
> 
> The firewall rules are in a chain that's applied to all guests,
> so we have no where to add a per-guest rule.

With nftables, you may create a chain in netdev family which binds to
the specific device(s) needing the hack. It needs maintenance after
guest startup and shutdown, though.

BTW: libvirt supports configurations which don't involve a 'vnetbr0'
bridge. In this case, you will have to setup tc on the actual tap
device, right?

Cheers, Phil
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > > Hi,
> > > 
> > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > > So when the extra rules are removed, then those same guests begin 
> > > > working? (You can easily remove the checksum rules with:
> > > > 
> > > >    nft delete chain ip libvirt_network postroute_mangle
> > > > 
> > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
> > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 
> > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows 
> > > > "udp cksum ok" though, so something else somewhere is setting the 
> > > > checksum to the correct value.
> > > 
> > > FWIW, I just tested an alternative workaround using tc. This works for
> > > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > > 
> > > # tc qd add dev vnetbr0 root handle 1: htb
> > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > > 	action csum ip and udp
> > 
> > This feels like it is functionally closest to what we've had historically,
> > even though it is annoying to have to deal with 'tc' tool, in addition
> > to 'nft'. So I'm thinking this is probably the way we'll have to go.
> 
> Another ugly detail (inherent to 'tc') is that you have to attach a
> classful qdisc to the interface since otherwise you can't add a filter
> with attached action. While this may not be a problem in practice, there
> is this side-effect of setting up a HTB on the bridge which by default
> runs a "noqueue" qdisc.

I'm not that familiar with 'tc'.

Can you explain the functional effect of those 'qdisc' settings on
virbr0, as if I know nothing :-)

> > > Another alternative might be to add the nftables rule for virtio-based
> > > guests only.
> > 
> > The firewall rules are in a chain that's applied to all guests,
> > so we have no where to add a per-guest rule.
> 
> With nftables, you may create a chain in netdev family which binds to
> the specific device(s) needing the hack. It needs maintenance after
> guest startup and shutdown, though.
> 
> BTW: libvirt supports configurations which don't involve a 'vnetbr0'
> bridge. In this case, you will have to setup tc on the actual tap
> device, right?

In those cases, we haven't historically set firewall rules, so
users were on their own, so in that sense, it isn't a regression
we need to solve. Also in those cases, the DHCP daemon would be
off-host, and so packets we're getting back from it ought to
have a checksum present, as they've been over a  physical link.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Phil Sutter 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
> > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > > > Hi,
> > > > 
> > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > > > So when the extra rules are removed, then those same guests begin 
> > > > > working? (You can easily remove the checksum rules with:
> > > > > 
> > > > >    nft delete chain ip libvirt_network postroute_mangle
> > > > > 
> > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to 
> > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 
> > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows 
> > > > > "udp cksum ok" though, so something else somewhere is setting the 
> > > > > checksum to the correct value.
> > > > 
> > > > FWIW, I just tested an alternative workaround using tc. This works for
> > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > > > 
> > > > # tc qd add dev vnetbr0 root handle 1: htb
> > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > > > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > > > 	action csum ip and udp
> > > 
> > > This feels like it is functionally closest to what we've had historically,
> > > even though it is annoying to have to deal with 'tc' tool, in addition
> > > to 'nft'. So I'm thinking this is probably the way we'll have to go.
> > 
> > Another ugly detail (inherent to 'tc') is that you have to attach a
> > classful qdisc to the interface since otherwise you can't add a filter
> > with attached action. While this may not be a problem in practice, there
> > is this side-effect of setting up a HTB on the bridge which by default
> > runs a "noqueue" qdisc.
> 
> I'm not that familiar with 'tc'.
> 
> Can you explain the functional effect of those 'qdisc' settings on
> virbr0, as if I know nothing :-)

'tc' controls QoS in Linux. 'qdisc's define how congestion should be
handled (basically: queue or drop, prioritization, etc). The default
qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
accordingly (calls dev_hard_start_xmit() after a few checks to make sure
the device is working).

HTB is a container of classes (for packet classification) which
themselves hold qdiscs. On my system at least, it doesn't come with
default classes and thus should not do much by itself (apart from
running the filter for us which we want. Anything else is overhead.

I'm not sure how much detail you need - "as if I know nothing" is a bit
like naively typing 'find /' and wondering when it will end. ;)
Please shoot if you need more details. For the time being, let me point
at some howto[2] I wrote long ago.

> > > > Another alternative might be to add the nftables rule for virtio-based
> > > > guests only.
> > > 
> > > The firewall rules are in a chain that's applied to all guests,
> > > so we have no where to add a per-guest rule.
> > 
> > With nftables, you may create a chain in netdev family which binds to
> > the specific device(s) needing the hack. It needs maintenance after
> > guest startup and shutdown, though.
> > 
> > BTW: libvirt supports configurations which don't involve a 'vnetbr0'
> > bridge. In this case, you will have to setup tc on the actual tap
> > device, right?
> 
> In those cases, we haven't historically set firewall rules, so
> users were on their own, so in that sense, it isn't a regression
> we need to solve. Also in those cases, the DHCP daemon would be
> off-host, and so packets we're getting back from it ought to
> have a checksum present, as they've been over a  physical link.

OK. From my perspective, attaching the tc qdisc/filter/action to
individual guest devices would still be the cleaner solution. If there's
no mechanism to attach this to, it might be easier to just stick
everything to the bridge, of course.

Cheers, Phil

[1] https://github.com/torvalds/linux/blob/master/net/core/dev.c#L4393
[2] http://linux-ip.net/gl/tc-filters/
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 3:41 PM, Phil Sutter wrote:
> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
>>>>> Hi,
>>>>>
>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
>>>>>> So when the extra rules are removed, then those same guests begin
>>>>>> working? (You can easily remove the checksum rules with:
>>>>>>
>>>>>>     nft delete chain ip libvirt_network postroute_mangle
>>>>>>
>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
>>>>>> work with the 0-checksum rules removed. In this case tcpdump on virbr0
>>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows
>>>>>> "udp cksum ok" though, so something else somewhere is setting the
>>>>>> checksum to the correct value.
>>>>>
>>>>> FWIW, I just tested an alternative workaround using tc. This works for
>>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio:
>>>>>
>>>>> # tc qd add dev vnetbr0 root handle 1: htb
>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
>>>>> 	u32 match ip sport 67 ffff match ip dport 68 ffff \
>>>>> 	action csum ip and udp
>>>>
>>>> This feels like it is functionally closest to what we've had historically,
>>>> even though it is annoying to have to deal with 'tc' tool, in addition
>>>> to 'nft'. So I'm thinking this is probably the way we'll have to go.
>>>
>>> Another ugly detail (inherent to 'tc') is that you have to attach a
>>> classful qdisc to the interface since otherwise you can't add a filter
>>> with attached action. While this may not be a problem in practice, there
>>> is this side-effect of setting up a HTB on the bridge which by default
>>> runs a "noqueue" qdisc.
>>
>> I'm not that familiar with 'tc'.
>>
>> Can you explain the functional effect of those 'qdisc' settings on
>> virbr0, as if I know nothing :-)
> 
> 'tc' controls QoS in Linux. 'qdisc's define how congestion should be
> handled (basically: queue or drop, prioritization, etc). The default
> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
> accordingly (calls dev_hard_start_xmit() after a few checks to make sure
> the device is working).
> 
> HTB is a container of classes (for packet classification) which
> themselves hold qdiscs. On my system at least, it doesn't come with
> default classes and thus should not do much by itself (apart from
> running the filter for us which we want. Anything else is overhead.
> 
> I'm not sure how much detail you need - "as if I know nothing" is a bit
> like naively typing 'find /' and wondering when it will end. ;)
> Please shoot if you need more details. For the time being, let me point
> at some howto[2] I wrote long ago.
> 
>>>>> Another alternative might be to add the nftables rule for virtio-based
>>>>> guests only.
>>>>
>>>> The firewall rules are in a chain that's applied to all guests,
>>>> so we have no where to add a per-guest rule.
>>>
>>> With nftables, you may create a chain in netdev family which binds to
>>> the specific device(s) needing the hack. It needs maintenance after
>>> guest startup and shutdown, though.
>>>
>>> BTW: libvirt supports configurations which don't involve a 'vnetbr0'
>>> bridge. In this case, you will have to setup tc on the actual tap
>>> device, right?
>>
>> In those cases, we haven't historically set firewall rules, so
>> users were on their own, so in that sense, it isn't a regression
>> we need to solve. Also in those cases, the DHCP daemon would be
>> off-host, and so packets we're getting back from it ought to
>> have a checksum present, as they've been over a  physical link.
> 
> OK. From my perspective, attaching the tc qdisc/filter/action to
> individual guest devices would still be the cleaner solution. If there's
> no mechanism to attach this to, it might be easier to just stick
> everything to the bridge, of course.


We do already use tc for bandwidth control, and when a <bandwidth> 
element is present in the interface (or the network) we run some tc 
commands as a port is added to a network to (I *think*) reserve a 
portion of the bridge's bandwidth for the new interface controls, and 
when a port is deleted we again run some tc commands to remove it. 
(mprivozn added all of this and so therefore knows the most about it)

However, the tc commands on the network side (during the CreatePort API) 
I believe are done with only the network's bridge + the MAC address of 
the guest's NIC (and a "class_id" is created and sent back to QEMU and 
is there I guess used for some *other* tc commands to setup bandwidth 
upper limits for the tap after it's created.)

More significantly, the tap device hasn't even been created yet at the 
time QEMU allocates the port from the network driver, so we don't even 
have a "name of future tap device" that we could send in the 
NetworkPortCreate API XML.

So, I guess what all that means is that having the network driver run a 
tap-device-specific tc command won't be that simple. (Maybe we could add 
"virNetworkPortStart|Stop" APIs or something)

> 
> Cheers, Phil
> 
> [1] https://github.com/torvalds/linux/blob/master/net/core/dev.c#L4393
> [2] http://linux-ip.net/gl/tc-filters/
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
> On 10/29/24 3:41 PM, Phil Sutter wrote:
> > On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
> > > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> > > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > > > > > So when the extra rules are removed, then those same guests begin
> > > > > > > working? (You can easily remove the checksum rules with:
> > > > > > > 
> > > > > > >     nft delete chain ip libvirt_network postroute_mangle
> > > > > > > 
> > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
> > > > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0
> > > > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows
> > > > > > > "udp cksum ok" though, so something else somewhere is setting the
> > > > > > > checksum to the correct value.
> > > > > > 
> > > > > > FWIW, I just tested an alternative workaround using tc. This works for
> > > > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > > > > > 
> > > > > > # tc qd add dev vnetbr0 root handle 1: htb
> > > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > > > > > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > > > > > 	action csum ip and udp
> > > > > 
> > > > > This feels like it is functionally closest to what we've had historically,
> > > > > even though it is annoying to have to deal with 'tc' tool, in addition
> > > > > to 'nft'. So I'm thinking this is probably the way we'll have to go.
> > > > 
> > > > Another ugly detail (inherent to 'tc') is that you have to attach a
> > > > classful qdisc to the interface since otherwise you can't add a filter
> > > > with attached action. While this may not be a problem in practice, there
> > > > is this side-effect of setting up a HTB on the bridge which by default
> > > > runs a "noqueue" qdisc.
> > > 
> > > I'm not that familiar with 'tc'.
> > > 
> > > Can you explain the functional effect of those 'qdisc' settings on
> > > virbr0, as if I know nothing :-)
> > 
> > 'tc' controls QoS in Linux. 'qdisc's define how congestion should be
> > handled (basically: queue or drop, prioritization, etc). The default
> > qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
> > device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
> > accordingly (calls dev_hard_start_xmit() after a few checks to make sure
> > the device is working).
> > 
> > HTB is a container of classes (for packet classification) which
> > themselves hold qdiscs. On my system at least, it doesn't come with
> > default classes and thus should not do much by itself (apart from
> > running the filter for us which we want. Anything else is overhead.
> > 
> > I'm not sure how much detail you need - "as if I know nothing" is a bit
> > like naively typing 'find /' and wondering when it will end. ;)
> > Please shoot if you need more details. For the time being, let me point
> > at some howto[2] I wrote long ago.
> > 
> > > > > > Another alternative might be to add the nftables rule for virtio-based
> > > > > > guests only.
> > > > > 
> > > > > The firewall rules are in a chain that's applied to all guests,
> > > > > so we have no where to add a per-guest rule.
> > > > 
> > > > With nftables, you may create a chain in netdev family which binds to
> > > > the specific device(s) needing the hack. It needs maintenance after
> > > > guest startup and shutdown, though.
> > > > 
> > > > BTW: libvirt supports configurations which don't involve a 'vnetbr0'
> > > > bridge. In this case, you will have to setup tc on the actual tap
> > > > device, right?
> > > 
> > > In those cases, we haven't historically set firewall rules, so
> > > users were on their own, so in that sense, it isn't a regression
> > > we need to solve. Also in those cases, the DHCP daemon would be
> > > off-host, and so packets we're getting back from it ought to
> > > have a checksum present, as they've been over a  physical link.
> > 
> > OK. From my perspective, attaching the tc qdisc/filter/action to
> > individual guest devices would still be the cleaner solution. If there's
> > no mechanism to attach this to, it might be easier to just stick
> > everything to the bridge, of course.
> 
> 
> We do already use tc for bandwidth control, and when a <bandwidth> element
> is present in the interface (or the network) we run some tc commands as a
> port is added to a network to (I *think*) reserve a portion of the bridge's
> bandwidth for the new interface controls, and when a port is deleted we
> again run some tc commands to remove it. (mprivozn added all of this and so
> therefore knows the most about it)
> 
> However, the tc commands on the network side (during the CreatePort API) I
> believe are done with only the network's bridge + the MAC address of the
> guest's NIC (and a "class_id" is created and sent back to QEMU and is there
> I guess used for some *other* tc commands to setup bandwidth upper limits
> for the tap after it's created.)
> 
> More significantly, the tap device hasn't even been created yet at the time
> QEMU allocates the port from the network driver, so we don't even have a
> "name of future tap device" that we could send in the NetworkPortCreate API
> XML.
> 
> So, I guess what all that means is that having the network driver run a
> tap-device-specific tc command won't be that simple. (Maybe we could add
> "virNetworkPortStart|Stop" APIs or something)

I would expect 'tc' rules to be set against virbr0, not the individual
NICs.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/30/24 4:43 AM, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
>> On 10/29/24 3:41 PM, Phil Sutter wrote:
>>> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
>>>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
>>>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
>>>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
>>>>>>>> So when the extra rules are removed, then those same guests begin
>>>>>>>> working? (You can easily remove the checksum rules with:
>>>>>>>>
>>>>>>>>      nft delete chain ip libvirt_network postroute_mangle
>>>>>>>>
>>>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
>>>>>>>> work with the 0-checksum rules removed. In this case tcpdump on virbr0
>>>>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows
>>>>>>>> "udp cksum ok" though, so something else somewhere is setting the
>>>>>>>> checksum to the correct value.
>>>>>>>
>>>>>>> FWIW, I just tested an alternative workaround using tc. This works for
>>>>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio:
>>>>>>>
>>>>>>> # tc qd add dev vnetbr0 root handle 1: htb
>>>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
>>>>>>> 	u32 match ip sport 67 ffff match ip dport 68 ffff \
>>>>>>> 	action csum ip and udp
>>>>>>
>>>>>> This feels like it is functionally closest to what we've had historically,
>>>>>> even though it is annoying to have to deal with 'tc' tool, in addition
>>>>>> to 'nft'. So I'm thinking this is probably the way we'll have to go.
>>>>>
>>>>> Another ugly detail (inherent to 'tc') is that you have to attach a
>>>>> classful qdisc to the interface since otherwise you can't add a filter
>>>>> with attached action. While this may not be a problem in practice, there
>>>>> is this side-effect of setting up a HTB on the bridge which by default
>>>>> runs a "noqueue" qdisc.
>>>>
>>>> I'm not that familiar with 'tc'.
>>>>
>>>> Can you explain the functional effect of those 'qdisc' settings on
>>>> virbr0, as if I know nothing :-)
>>>
>>> 'tc' controls QoS in Linux. 'qdisc's define how congestion should be
>>> handled (basically: queue or drop, prioritization, etc). The default
>>> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
>>> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
>>> accordingly (calls dev_hard_start_xmit() after a few checks to make sure
>>> the device is working).
>>>
>>> HTB is a container of classes (for packet classification) which
>>> themselves hold qdiscs. On my system at least, it doesn't come with
>>> default classes and thus should not do much by itself (apart from
>>> running the filter for us which we want. Anything else is overhead.
>>>
>>> I'm not sure how much detail you need - "as if I know nothing" is a bit
>>> like naively typing 'find /' and wondering when it will end. ;)
>>> Please shoot if you need more details. For the time being, let me point
>>> at some howto[2] I wrote long ago.
>>>
>>>>>>> Another alternative might be to add the nftables rule for virtio-based
>>>>>>> guests only.
>>>>>>
>>>>>> The firewall rules are in a chain that's applied to all guests,
>>>>>> so we have no where to add a per-guest rule.
>>>>>
>>>>> With nftables, you may create a chain in netdev family which binds to
>>>>> the specific device(s) needing the hack. It needs maintenance after
>>>>> guest startup and shutdown, though.
>>>>>
>>>>> BTW: libvirt supports configurations which don't involve a 'vnetbr0'
>>>>> bridge. In this case, you will have to setup tc on the actual tap
>>>>> device, right?
>>>>
>>>> In those cases, we haven't historically set firewall rules, so
>>>> users were on their own, so in that sense, it isn't a regression
>>>> we need to solve. Also in those cases, the DHCP daemon would be
>>>> off-host, and so packets we're getting back from it ought to
>>>> have a checksum present, as they've been over a  physical link.
>>>
>>> OK. From my perspective, attaching the tc qdisc/filter/action to
>>> individual guest devices would still be the cleaner solution. If there's
>>> no mechanism to attach this to, it might be easier to just stick
>>> everything to the bridge, of course.
>>
>>
>> We do already use tc for bandwidth control, and when a <bandwidth> element
>> is present in the interface (or the network) we run some tc commands as a
>> port is added to a network to (I *think*) reserve a portion of the bridge's
>> bandwidth for the new interface controls, and when a port is deleted we
>> again run some tc commands to remove it. (mprivozn added all of this and so
>> therefore knows the most about it)
>>
>> However, the tc commands on the network side (during the CreatePort API) I
>> believe are done with only the network's bridge + the MAC address of the
>> guest's NIC (and a "class_id" is created and sent back to QEMU and is there
>> I guess used for some *other* tc commands to setup bandwidth upper limits
>> for the tap after it's created.)
>>
>> More significantly, the tap device hasn't even been created yet at the time
>> QEMU allocates the port from the network driver, so we don't even have a
>> "name of future tap device" that we could send in the NetworkPortCreate API
>> XML.
>>
>> So, I guess what all that means is that having the network driver run a
>> tap-device-specific tc command won't be that simple. (Maybe we could add
>> "virNetworkPortStart|Stop" APIs or something)
> 
> I would expect 'tc' rules to be set against virbr0, not the individual
> NICs.

You may be right; I haven't looked at the qemu side to be honest. When I 
went in to gather enough info to respond last night I was only 
interested in figuring out how easy/difficult it would be to have the 
network driver issue tc commands with the tap device rather than the 
bridge *since Phil suggested that would be better/less bad/something 
like that. So I mainly looked at what info the network driver currently 
uses for its tc commands, and whether or not the tap device is created 
before or after the NetworkPortCreate (thought I recalled "after", and 
this is correct - makes sense, since we can't know until after 
virNetworkPortCreate whether or not the interface will even be using a 
tap device).

(Side note: for unrelated reasons it could be useful for the network 
driver to create the tap device actually - that could allow unprivileged 
virtqemud to directly/legitimately use the system virtnetworkd networks. 
Probably ot a simple task though.)
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Wed, Oct 30, 2024 at 07:41:02AM -0400, Laine Stump wrote:
> On 10/30/24 4:43 AM, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
> > > 
> > > We do already use tc for bandwidth control, and when a <bandwidth> element
> > > is present in the interface (or the network) we run some tc commands as a
> > > port is added to a network to (I *think*) reserve a portion of the bridge's
> > > bandwidth for the new interface controls, and when a port is deleted we
> > > again run some tc commands to remove it. (mprivozn added all of this and so
> > > therefore knows the most about it)
> > > 
> > > However, the tc commands on the network side (during the CreatePort API) I
> > > believe are done with only the network's bridge + the MAC address of the
> > > guest's NIC (and a "class_id" is created and sent back to QEMU and is there
> > > I guess used for some *other* tc commands to setup bandwidth upper limits
> > > for the tap after it's created.)
> > > 
> > > More significantly, the tap device hasn't even been created yet at the time
> > > QEMU allocates the port from the network driver, so we don't even have a
> > > "name of future tap device" that we could send in the NetworkPortCreate API
> > > XML.
> > > 
> > > So, I guess what all that means is that having the network driver run a
> > > tap-device-specific tc command won't be that simple. (Maybe we could add
> > > "virNetworkPortStart|Stop" APIs or something)
> > 
> > I would expect 'tc' rules to be set against virbr0, not the individual
> > NICs.
> 
> You may be right; I haven't looked at the qemu side to be honest. When I
> went in to gather enough info to respond last night I was only interested in
> figuring out how easy/difficult it would be to have the network driver issue
> tc commands with the tap device rather than the bridge *since Phil suggested
> that would be better/less bad/something like that. So I mainly looked at
> what info the network driver currently uses for its tc commands, and whether
> or not the tap device is created before or after the NetworkPortCreate
> (thought I recalled "after", and this is correct - makes sense, since we
> can't know until after virNetworkPortCreate whether or not the interface
> will even be using a tap device).

I wasn't thinking about the complexity, but rather the redundancy.
We need this rule set for every NIC, as it is way too difficult
for us to accurately predict exactly which scenarios will hit the
DHCP checksumm problem. With that goal, then it is pointless to
inject the same logic on every tap device, when we can inject it
on the bridge device once and be done with it.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 11:21 PM, Laine Stump wrote:
> On 10/29/24 3:41 PM, Phil Sutter wrote:
>> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
>>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
>>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
>>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
>>>>>>> So when the extra rules are removed, then those same guests begin
>>>>>>> working? (You can easily remove the checksum rules with:
>>>>>>>
>>>>>>>     nft delete chain ip libvirt_network postroute_mangle
>>>>>>>
>>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it 
>>>>>>> continues to
>>>>>>> work with the 0-checksum rules removed. In this case tcpdump on 
>>>>>>> virbr0
>>>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows
>>>>>>> "udp cksum ok" though, so something else somewhere is setting the
>>>>>>> checksum to the correct value.
>>>>>>
>>>>>> FWIW, I just tested an alternative workaround using tc. This works 
>>>>>> for
>>>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio:
>>>>>>
>>>>>> # tc qd add dev vnetbr0 root handle 1: htb
>>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
>>>>>>     u32 match ip sport 67 ffff match ip dport 68 ffff \
>>>>>>     action csum ip and udp
>>>>>
>>>>> This feels like it is functionally closest to what we've had 
>>>>> historically,
>>>>> even though it is annoying to have to deal with 'tc' tool, in addition
>>>>> to 'nft'. So I'm thinking this is probably the way we'll have to go.
>>>>
>>>> Another ugly detail (inherent to 'tc') is that you have to attach a
>>>> classful qdisc to the interface since otherwise you can't add a filter
>>>> with attached action. While this may not be a problem in practice, 
>>>> there
>>>> is this side-effect of setting up a HTB on the bridge which by default
>>>> runs a "noqueue" qdisc.
>>>
>>> I'm not that familiar with 'tc'.
>>>
>>> Can you explain the functional effect of those 'qdisc' settings on
>>> virbr0, as if I know nothing :-)
>>
>> 'tc' controls QoS in Linux. 'qdisc's define how congestion should be
>> handled (basically: queue or drop, prioritization, etc). The default
>> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
>> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
>> accordingly (calls dev_hard_start_xmit() after a few checks to make sure
>> the device is working).
>>
>> HTB is a container of classes (for packet classification) which
>> themselves hold qdiscs. On my system at least, it doesn't come with
>> default classes and thus should not do much by itself (apart from
>> running the filter for us which we want. Anything else is overhead.
>>
>> I'm not sure how much detail you need - "as if I know nothing" is a bit
>> like naively typing 'find /' and wondering when it will end. ;)
>> Please shoot if you need more details. For the time being, let me point
>> at some howto[2] I wrote long ago.
>>
>>>>>> Another alternative might be to add the nftables rule for virtio- 
>>>>>> based
>>>>>> guests only.
>>>>>
>>>>> The firewall rules are in a chain that's applied to all guests,
>>>>> so we have no where to add a per-guest rule.
>>>>
>>>> With nftables, you may create a chain in netdev family which binds to
>>>> the specific device(s) needing the hack. It needs maintenance after
>>>> guest startup and shutdown, though.
>>>>
>>>> BTW: libvirt supports configurations which don't involve a 'vnetbr0'
>>>> bridge. In this case, you will have to setup tc on the actual tap
>>>> device, right?
>>>
>>> In those cases, we haven't historically set firewall rules, so
>>> users were on their own, so in that sense, it isn't a regression
>>> we need to solve. Also in those cases, the DHCP daemon would be
>>> off-host, and so packets we're getting back from it ought to
>>> have a checksum present, as they've been over a  physical link.
>>
>> OK. From my perspective, attaching the tc qdisc/filter/action to
>> individual guest devices would still be the cleaner solution. If there's
>> no mechanism to attach this to, it might be easier to just stick
>> everything to the bridge, of course.
> 
> 
> We do already use tc for bandwidth control, and when a <bandwidth> 
> element is present in the interface (or the network) we run some tc 
> commands as a port is added to a network to (I *think*) reserve a 
> portion of the bridge's bandwidth for the new interface controls, and 
> when a port is deleted we again run some tc commands to remove it. 
> (mprivozn added all of this and so therefore knows the most about it)
> 
> However, the tc commands on the network side (during the CreatePort API) 
> I believe are done with only the network's bridge + the MAC address of 
> the guest's NIC (and a "class_id" is created and sent back to QEMU and 
> is there I guess used for some *other* tc commands to setup bandwidth 
> upper limits for the tap after it's created.)
> 
> More significantly, the tap device hasn't even been created yet at the 
> time QEMU allocates the port from the network driver, so we don't even 
> have a "name of future tap device" that we could send in the 
> NetworkPortCreate API XML.
> 
> So, I guess what all that means is that having the network driver run a 
> tap-device-specific tc command won't be that simple. (Maybe we could add 
> "virNetworkPortStart|Stop" APIs or something)

Of course if we were going to do that, we might as well instead add a 
tap-specific nftables rule at that same time (but only for virtio-net 
(and e1000e?) devices...
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
> A key difference that is probably relevant is that netbsd is
> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> NIC. At least when created by virt-manager.
>
> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> so presumably our rules are incompatible with non-virtio-net NICs
> in someway.

Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
since virtio drivers are not available there; moreover, if I switch a
random Linux guest from virtio-net to e1000 I can reproduce the issue
there as well.

-- 
Andrea Bolognani / Red Hat / Virtualization
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote:
> On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
> > A key difference that is probably relevant is that netbsd is
> > using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> > NIC. At least when created by virt-manager.
> >
> > AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> > so presumably our rules are incompatible with non-virtio-net NICs
> > in someway.
> 
> Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
> since virtio drivers are not available there; moreover, if I switch a
> random Linux guest from virtio-net to e1000 I can reproduce the issue
> there as well.

Incidentally, I think this has crossed the threshold where the cure is
worse than the disease.

We cannot ship the forthcoming libvirt release with a checksum "fix"
that breaks all usage of NICs that aren't virtio-net, as that guarantees
brokeness for all historical OS.

If we can't quickly find a way to improve this, I think we need to
revert (or disable) the checksum zero'ing fix for this release and
spend more time investigating it.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 9:14 AM, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote:
>> On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
>>> A key difference that is probably relevant is that netbsd is
>>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
>>> NIC. At least when created by virt-manager.
>>>
>>> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
>>> so presumably our rules are incompatible with non-virtio-net NICs
>>> in someway.
>>
>> Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
>> since virtio drivers are not available there; moreover, if I switch a
>> random Linux guest from virtio-net to e1000 I can reproduce the issue
>> there as well.
> 
> Incidentally, I think this has crossed the threshold where the cure is
> worse than the disease.
> 
> We cannot ship the forthcoming libvirt release with a checksum "fix"
> that breaks all usage of NICs that aren't virtio-net, as that guarantees
> brokeness for all historical OS.
> 
> If we can't quickly find a way to improve this, I think we need to
> revert (or disable) the checksum zero'ing fix for this release and
> spend more time investigating it.

I sadly agree :-/ (although I will point out that it's not *all* 
non-virtio, since e1000e seems to work). I am leaving the house now, but 
will make a patch to either disable or the revert the "fix" when I get 
back in a few hours. (I would rather leave it in with a switch or 
something so we could continue using standard builds to test things, but 
that might take too long and would definitely be too complicated for a 
last instant push)

> 
> With regards,
> Daniel
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Laine Stump 4 months, 2 weeks ago
On 10/29/24 10:51 AM, Laine Stump wrote:
> On 10/29/24 9:14 AM, Daniel P. Berrangé wrote:
>> On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote:
>>> On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
>>>> A key difference that is probably relevant is that netbsd is
>>>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
>>>> NIC. At least when created by virt-manager.
>>>>
>>>> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
>>>> so presumably our rules are incompatible with non-virtio-net NICs
>>>> in someway.
>>>
>>> Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
>>> since virtio drivers are not available there; moreover, if I switch a
>>> random Linux guest from virtio-net to e1000 I can reproduce the issue
>>> there as well.
>>
>> Incidentally, I think this has crossed the threshold where the cure is
>> worse than the disease.
>>
>> We cannot ship the forthcoming libvirt release with a checksum "fix"
>> that breaks all usage of NICs that aren't virtio-net, as that guarantees
>> brokeness for all historical OS.
>>
>> If we can't quickly find a way to improve this, I think we need to
>> revert (or disable) the checksum zero'ing fix for this release and
>> spend more time investigating it.
> 
> I sadly agree :-/ (although I will point out that it's not *all* non- 
> virtio, since e1000e seems to work). I am leaving the house now, but 
> will make a patch to either disable or the revert the "fix" when I get 
> back in a few hours. (I would rather leave it in with a switch or 
> something so we could continue using standard builds to test things, but 
> that might take too long and would definitely be too complicated for a 
> last instant push)

I just sent a revert patch to the list in case someone wants to ACK it. 
It was 100% done with "git revert", so no chance for typos.
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 10:51:57AM -0400, Laine Stump wrote:
> On 10/29/24 9:14 AM, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote:
> > > On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
> > > > A key difference that is probably relevant is that netbsd is
> > > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> > > > NIC. At least when created by virt-manager.
> > > > 
> > > > AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> > > > so presumably our rules are incompatible with non-virtio-net NICs
> > > > in someway.
> > > 
> > > Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
> > > since virtio drivers are not available there; moreover, if I switch a
> > > random Linux guest from virtio-net to e1000 I can reproduce the issue
> > > there as well.
> > 
> > Incidentally, I think this has crossed the threshold where the cure is
> > worse than the disease.
> > 
> > We cannot ship the forthcoming libvirt release with a checksum "fix"
> > that breaks all usage of NICs that aren't virtio-net, as that guarantees
> > brokeness for all historical OS.
> > 
> > If we can't quickly find a way to improve this, I think we need to
> > revert (or disable) the checksum zero'ing fix for this release and
> > spend more time investigating it.
> 
> I sadly agree :-/ (although I will point out that it's not *all* non-virtio,
> since e1000e seems to work). I am leaving the house now, but will make a
> patch to either disable or the revert the "fix" when I get back in a few
> hours. (I would rather leave it in with a switch or something so we could
> continue using standard builds to test things, but that might take too long
> and would definitely be too complicated for a last instant push)

e1000e has support for vnet-hdr similar to virtio-net.

With virtio-net, if I disable vhost-net, with <driver type=qemu>
then code in virtio-net.c will manually fixup DHCP checksums.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 10:51:57AM -0400, Laine Stump wrote:
> On 10/29/24 9:14 AM, Daniel P. Berrangé wrote:
> > We cannot ship the forthcoming libvirt release with a checksum "fix"
> > that breaks all usage of NICs that aren't virtio-net, as that guarantees
> > brokeness for all historical OS.
> >
> > If we can't quickly find a way to improve this, I think we need to
> > revert (or disable) the checksum zero'ing fix for this release and
> > spend more time investigating it.
>
> I sadly agree :-/ (although I will point out that it's not *all* non-virtio,
> since e1000e seems to work). I am leaving the house now, but will make a
> patch to either disable or the revert the "fix" when I get back in a few
> hours. (I would rather leave it in with a switch or something so we could
> continue using standard builds to test things, but that might take too long
> and would definitely be too complicated for a last instant push)

I vote against adding yet another knob. If we can't fix things in
time, let's just revert the change and work on a fully working
solution over the next development cycle.

-- 
Andrea Bolognani / Red Hat / Virtualization
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 01:14:28PM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote:
> > On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote:
> > > A key difference that is probably relevant is that netbsd is
> > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net
> > > NIC. At least when created by virt-manager.
> > >
> > > AFAIR, QEMU's magic checksum offload only happens for virtio-net,
> > > so presumably our rules are incompatible with non-virtio-net NICs
> > > in someway.
> >
> > Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000,
> > since virtio drivers are not available there; moreover, if I switch a
> > random Linux guest from virtio-net to e1000 I can reproduce the issue
> > there as well.
>
> Incidentally, I think this has crossed the threshold where the cure is
> worse than the disease.
>
> We cannot ship the forthcoming libvirt release with a checksum "fix"
> that breaks all usage of NICs that aren't virtio-net, as that guarantees
> brokeness for all historical OS.
>
> If we can't quickly find a way to improve this, I think we need to
> revert (or disable) the checksum zero'ing fix for this release and
> spend more time investigating it.

Agreed.

Going back to the drawing board in a sense, have we figured out why
FreeBSD's DHCP client is unhappy with the previous arrangement? It
seems to me that fixing it might be the most sensible course of
action.

-- 
Andrea Bolognani / Red Hat / Virtualization
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Daniel P. Berrangé 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
> > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
> > > I did some testing of my own and I can confirm that FreeBSD and
> > > OpenBSD are fine with this change, as are various Linux flavors
> > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
> > >
> > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku
> > > and NetBSD break with this change. Interestingly, these were all fine
> > > with the nftables backend before it.
> >
> > Well that's odd. I've checked NetBSD source code and found no less
> > than 3 DHCP client impls, and all of them cope with checksum == 0.
> >
> > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
> >
> > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507
> >
> > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373
> >
> > the middle impl also directly copes with partial checksums
> 
> The boot log contains
> 
>   Starting dhcpcd.
>   wm0: checksum failure from 192.168.124.1
> 
> so I guess the second implementation is the relevant one.

Yes, we can see that message here:

https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3609

	if (!checksums_valid(data, &from, bpf_flags)) {
		logerrx("%s: checksum failure from %s",
		    ifp->name, inet_ntoa(from));
		return;
	}

and the 'checksums_valid' method is the one I point to in the
2nd URL, which has the code to check whether the incoming checksum
is zeroed:

	if (udp.uh_sum == 0)
		return true;

so I'm really surprised you see a failure with this. It is as if
you don't have Laine's fix present at all.

> 
> > Not identified the Hurd/Haiku DHCP client code yet...
> 
> I'm using Debian GNU/Hurd, so the DHCP client is the same as regular
> Debian (ISC DHCP). The source can be found at
> 
>   https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/
> 
> The version is a bit old and there's the tiniest amount of patching
> compared to the Linux build, specifically:
> 
>   --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01
> 01:00:00.000000000 +0100
>   +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix
> 2023-02-15 15:39:49.000000000 +0100
>   @@ -0,0 +1,26 @@
>   +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
>   +===================================================================
>   +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c
>   ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
>   +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is
>   +       char strbuf[ISC_STRERRORSIZE];
>   +       const char *err = "socket";
>   +       int tries = 0;
>   +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE)
>   ++#if 1
>   +       int on = 1;
>   + #endif
>   + #if defined(SO_RCVBUF)
> 
> I'm not sure whether this could be relevant to the issue at hand.

That impl has the explicit check for all-zeros checksum.

> To clarify, this is something that needs to be handled at the
> userspace level, no kernel changes required? And clearly it affects
> DHCP, but what about other protocols? Are we confident those will
> cope just fine?

It would affect *any* application which is reading raw packets
and manually verifying the IP checksum. DHCP is the common
case, but there could be others.

Historically our iptables rule only ever fixed up DHCP packets
and we've not seen other complaints. So if something else is
affected in the real world, it is sufficiently rare that the
few people affected have not noticed and/or cared enough to
escalate it.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses
Posted by Andrea Bolognani 4 months, 2 weeks ago
On Tue, Oct 29, 2024 at 11:21:44AM +0000, Daniel P. Berrangé wrote:
> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
> > I'm using Debian GNU/Hurd, so the DHCP client is the same as regular
> > Debian (ISC DHCP). The source can be found at
> >
> >   https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/
> >
> > The version is a bit old and there's the tiniest amount of patching
> > compared to the Linux build, specifically:
> >
> >   --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01
> > 01:00:00.000000000 +0100
> >   +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix
> > 2023-02-15 15:39:49.000000000 +0100
> >   @@ -0,0 +1,26 @@
> >   +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
> >   +===================================================================
> >   +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c
> >   ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c
> >   +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is
> >   +       char strbuf[ISC_STRERRORSIZE];
> >   +       const char *err = "socket";
> >   +       int tries = 0;
> >   +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE)
> >   ++#if 1
> >   +       int on = 1;
> >   + #endif
> >   + #if defined(SO_RCVBUF)
> >
> > I'm not sure whether this could be relevant to the issue at hand.
>
> That impl has the explicit check for all-zeros checksum.

So you think it's affected by the same unexplicable behavior as
NetBSD? If you tell me how to run tcpdump the right way, I can do
that for you without you having to set up a GNU/Hurd guest.

> > To clarify, this is something that needs to be handled at the
> > userspace level, no kernel changes required? And clearly it affects
> > DHCP, but what about other protocols? Are we confident those will
> > cope just fine?
>
> It would affect *any* application which is reading raw packets
> and manually verifying the IP checksum. DHCP is the common
> case, but there could be others.
>
> Historically our iptables rule only ever fixed up DHCP packets
> and we've not seen other complaints. So if something else is
> affected in the real world, it is sufficiently rare that the
> few people affected have not noticed and/or cared enough to
> escalate it.

That sounds reassuring :)

-- 
Andrea Bolognani / Red Hat / Virtualization