src/network/network_nftables.c | 69 +++++++++++++++++++ tests/networkxml2firewalldata/base.nftables | 14 ++++ .../forward-dev-linux.nftables | 16 +++++ .../isolated-linux.nftables | 16 +++++ .../nat-default-linux.nftables | 16 +++++ .../nat-ipv6-linux.nftables | 16 +++++ .../nat-ipv6-masquerade-linux.nftables | 16 +++++ .../nat-many-ips-linux.nftables | 16 +++++ .../nat-port-range-ipv6-linux.nftables | 16 +++++ .../nat-port-range-linux.nftables | 16 +++++ .../nat-tftp-linux.nftables | 16 +++++ .../route-default-linux.nftables | 16 +++++ 12 files changed, 243 insertions(+)
Many long years ago (April 2010), soon after "vhost" in-kernel packet
processing was added to the virtio-net driver, people running RHEL5
virtual machines with a virtio-net interface connected via a libvirt
virtual network noticed that when vhost packet processing was enabled,
their VMs could no longer get an IP address via DHCP - the guest was
ignoring the DHCP response packets sent by the host.
(I've been informed by danpb that the same issue had been encountered,
and "fixed" even earlier than that, in 2006, with Xen as the
hypervisor.)
The "gory details" of the 2010 discussion are chronicled here:
https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html
but basically it was because the checksum of packets wasn't being
fully computed on the host side (because QEMU on the host and the NIC
driver in the guest had agreed between themselves to turn off
checksums because they were unnecessary due to the "link" between the
two being entirely in local memory and not a physical cable), but
1) a partial checksum had been put into the packets at some point by
"someone"
2) the "don't use checksums" info was known by the guest kernel, which
would properly ignore the "bad" checksum), and
3) the packets were being read by the dhclient application on the
guest side with a "raw" socket (thus bypassing the guest kernel UDP
processing that would have known the checksum was irrelevant)),
The "fix" for this ended up being two-tiered:
1) The ISC DHCP package (which contains the aforementioned dhclient
program) made a fix to their dhclient code which caused it to accept
packets anyway even if they didn't have a proper checksum (NB: that's
not a full explanation, and possibly not accurate). This fixed the
problem for guests with an updated dhclient. Here is the code with the
fix to ISC DHCP:
https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365
This fixed the issue for any new/updated guests that had the fixed
dhclient, but it didn't solve the problem for existing/old guest
images that didn't/couldn't get their dhclient updated. This brings us
to:
2) iptables added a new "CHECKSUM" target and "--checksum-fill"
action:
http://patchwork.ozlabs.org/patch/58525/
and libvirt added an iptables rule for each virtual network to match
DHCP response packets and perform --checksum-fill. This way by the
time dhclient on the guest reads the raw packet, the checksum would be
corrected, and the packet would be accepted. This was pushed upstream
in libvirt commit v0.8.2-142-gfd5b15ff1a.
The word at the time from those more knowledgeable than me was that
the bad checksum problem was really specific to ISC's dhclient running
on Linux, and so once their fix was in use everywhere dhclient was
used, bad checksums would be a thing of the past and the
--checksum-fill iptables rules would no longer be needed (but would
otherwise be harmless if they were still there). (Plot twist: the
dhclient code in fix (1) above apparently is on a Linux-only code path
- this is very important later!)
Based on this information (and also due to the opinion that fixing it
by having iptables modify the packet checksum was really the wrong way
to permanently fix things, i.e. an "ugly hack"), the nftables
developers made the decision to not implement an equivalent to
--checksum-fill in nftables. As a result, when I wrote the nftables
firewall backend for libvirt virtual networks earlier this year, it
didn't add in any rule to "fix" broken UDP checksums (since there was
apparently no equivalent in nftables and, after all, that was fixed
somewhere else 14 years ago, right???)
But last week, when Rich Jones was doing routine testing using a Fedora
40 host (the first Fedora release to use the nftables backend of libvirt's
network driver by default) and a FreeBSD guest, for "some strange
reason", the FreeBSD guest was unable to get an IP address from DHCP!!
https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html
A few quick tests proved that it was the same old "bad checksum"
problem from 2010 come back to haunt us - it wasn't a Linux-only issue
after all.
Phil Sutter and Eric Garver (nftables people) pointed out that, while
nftables doesn't have an action that will *compute* the checksum of a
packet, it *does* have an action that will set the checksum to 0, and
suggested we try adding a "zero the checksum" rule for dhcp response
packets to our nftables ruleset. (Why? Because a checksum value of 0
in a IPv4 UDP packet is defined by RFC768 to mean "no checksum
generated", implying "checksum not needed"). It turns out that this
works - dhclient properly recognizes that a 0 checksum means "don't
bother with the checksum", and accepts the packet as valid.
So to once again fix this timeless bug, this patch adds such a
checksum zeroing rule to the nftables rules setup for each virtual
network.
This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
guests, while not breaking it for Fedora or Windows (10) guests.
Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6
Reported-by: Rich Jones <rjones@redhat.com>
Fix-Suggested-by: Eric Garver <egarver@redhat.com>
Fix-Suggested-by: Phil Sutter <psutter@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
---
(NB: jdenemar reminded me this is a bugfix, so we don't need to rush
to get it pushed before he freezes for the RC1 snapshot. I'm happy
with it as is though, if anyone is awake early enough and wants to
push it before he snapshots RC1. Otherwise I'll just push it later and
it will be in RC2)
Changes from V1:
* informational errors/omissions in commit log message fixed
src/network/network_nftables.c | 69 +++++++++++++++++++
tests/networkxml2firewalldata/base.nftables | 14 ++++
.../forward-dev-linux.nftables | 16 +++++
.../isolated-linux.nftables | 16 +++++
.../nat-default-linux.nftables | 16 +++++
.../nat-ipv6-linux.nftables | 16 +++++
.../nat-ipv6-masquerade-linux.nftables | 16 +++++
.../nat-many-ips-linux.nftables | 16 +++++
.../nat-port-range-ipv6-linux.nftables | 16 +++++
.../nat-port-range-linux.nftables | 16 +++++
.../nat-tftp-linux.nftables | 16 +++++
.../route-default-linux.nftables | 16 +++++
12 files changed, 243 insertions(+)
diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c
index f8b5ab665d..5523207269 100644
--- a/src/network/network_nftables.c
+++ b/src/network/network_nftables.c
@@ -51,6 +51,7 @@ VIR_LOG_INIT("network.nftables");
#define VIR_NFTABLES_FWD_OUT_CHAIN "guest_output"
#define VIR_NFTABLES_FWD_X_CHAIN "guest_cross"
#define VIR_NFTABLES_NAT_POSTROUTE_CHAIN "guest_nat"
+#define VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN "postroute_mangle"
/* we must avoid using the standard "filter" table as used by
* iptables, as any subsequent attempts to use iptables commands will
@@ -106,6 +107,10 @@ nftablesGlobalChain nftablesChains[] = {
/* chains for NAT rules */
{NULL, VIR_NFTABLES_NAT_POSTROUTE_CHAIN, "{ type nat hook postrouting priority 100; policy accept; }"},
+
+ /* chain for "mangle" rules that modify packets (e.g. 0 out UDP checksums) */
+ {NULL, VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN, "{ type filter hook postrouting priority 0; policy accept; }"},
+
};
@@ -644,6 +649,44 @@ nftablesAddDontMasquerade(virFirewall *fw,
}
+/**
+ * nftablesAddOutputFixUdpChecksum:
+ *
+ * Add a rule to @fw that will 0 out the checksum of udp packets
+ * output from @iface with destination port @port.
+
+ * Zeroing the checksum of a UDP packet tells the receiving end "you
+ * don't need to validate the checksum", which is useful in cases
+ * where the host (sender) thinks that packet checksums will be
+ * computed elsewhere (and so leaves a partially computed checksum in
+ * the packet header) while the guest (receiver) thinks that the
+ * checksum has already been fully computed; in the meantime none of
+ * the code in between has actually finished computing the
+ * checksum.
+ *
+ * An example of this is DHCP response packets from host to
+ * guest. If the checksum of each of these packets isn't zeroed, then
+ * many guests (e.g. FreeBSD) will drop them with reason BAD CHECKSUM;
+ * if the packets arrive at those guests with a checksum of 0, they
+ * will happily accept the packet.
+ */
+static void
+nftablesAddOutputFixUdpChecksum(virFirewall *fw,
+ const char *iface,
+ int port)
+{
+ g_autofree char *portstr = g_strdup_printf("%d", port);
+
+ virFirewallAddCmd(fw, VIR_FIREWALL_LAYER_IPV4,
+ "insert", "rule", "ip",
+ VIR_NFTABLES_PRIVATE_TABLE,
+ VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN,
+ "oif", iface, "udp", "dport", portstr,
+ "counter", "udp", "checksum", "set", "0",
+ NULL);
+}
+
+
static const char networkLocalMulticastIPv4[] = "224.0.0.0/24";
static const char networkLocalMulticastIPv6[] = "ff02::/16";
static const char networkLocalBroadcast[] = "255.255.255.255/32";
@@ -901,6 +944,30 @@ nftablesAddGeneralFirewallRules(virFirewall *fw,
}
+static void
+nftablesAddChecksumFirewallRules(virFirewall *fw,
+ virNetworkDef *def)
+{
+ size_t i;
+ virNetworkIPDef *ipv4def;
+
+ /* Look for the first IPv4 address that has dhcp or tftpboot
+ * defined. We support dhcp config on 1 IPv4 interface only.
+ */
+ for (i = 0; (ipv4def = virNetworkDefGetIPByIndex(def, AF_INET, i)); i++) {
+ if (ipv4def->nranges || ipv4def->nhosts)
+ break;
+ }
+
+ /* If we are doing local DHCP service on this network, add a rule
+ * that will fixup the checksum of DHCP response packets back to
+ * the guests.
+ */
+ if (ipv4def)
+ nftablesAddOutputFixUdpChecksum(fw, def->bridge, 68);
+}
+
+
static int
nftablesAddIPSpecificFirewallRules(virFirewall *fw,
virNetworkDef *def,
@@ -952,6 +1019,8 @@ nftablesAddFirewallRules(virNetworkDef *def, virFirewall **fwRemoval)
return -1;
}
+ nftablesAddChecksumFirewallRules(fw, def);
+
if (virFirewallApply(fw) < 0)
return -1;
diff --git a/tests/networkxml2firewalldata/base.nftables b/tests/networkxml2firewalldata/base.nftables
index a064318739..6371fc12dd 100644
--- a/tests/networkxml2firewalldata/base.nftables
+++ b/tests/networkxml2firewalldata/base.nftables
@@ -68,6 +68,13 @@ libvirt_network \
guest_nat \
'{ type nat hook postrouting priority 100; policy accept; }'
nft \
+add \
+chain \
+ip \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
+nft \
list \
table \
ip6 \
@@ -136,3 +143,10 @@ ip6 \
libvirt_network \
guest_nat \
'{ type nat hook postrouting priority 100; policy accept; }'
+nft \
+add \
+chain \
+ip6 \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
diff --git a/tests/networkxml2firewalldata/forward-dev-linux.nftables b/tests/networkxml2firewalldata/forward-dev-linux.nftables
index 8badb74beb..9dea1a88a4 100644
--- a/tests/networkxml2firewalldata/forward-dev-linux.nftables
+++ b/tests/networkxml2firewalldata/forward-dev-linux.nftables
@@ -156,3 +156,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/isolated-linux.nftables b/tests/networkxml2firewalldata/isolated-linux.nftables
index d1b4dac178..67ee0a2bf5 100644
--- a/tests/networkxml2firewalldata/isolated-linux.nftables
+++ b/tests/networkxml2firewalldata/isolated-linux.nftables
@@ -62,3 +62,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-default-linux.nftables b/tests/networkxml2firewalldata/nat-default-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-default-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-default-linux.nftables
@@ -142,3 +142,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
index d8a9ba706d..617ed8b753 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
@@ -200,3 +200,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
index a7f09cda59..a710d0e296 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
@@ -272,3 +272,19 @@ daddr \
ff02::/16 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
index b826fe6134..0be5fb7e65 100644
--- a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
@@ -366,3 +366,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
index ceaed6fa40..7574356855 100644
--- a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
@@ -384,3 +384,19 @@ daddr \
ff02::/16 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
index 1dc37a26ec..127536e4db 100644
--- a/tests/networkxml2firewalldata/nat-port-range-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
@@ -312,3 +312,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-tftp-linux.nftables b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-tftp-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
@@ -142,3 +142,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/route-default-linux.nftables b/tests/networkxml2firewalldata/route-default-linux.nftables
index 282c9542a5..be9c4f5439 100644
--- a/tests/networkxml2firewalldata/route-default-linux.nftables
+++ b/tests/networkxml2firewalldata/route-default-linux.nftables
@@ -56,3 +56,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
--
2.47.0
On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote: > Many long years ago (April 2010), soon after "vhost" in-kernel packet > processing was added to the virtio-net driver, people running RHEL5 > virtual machines with a virtio-net interface connected via a libvirt > virtual network noticed that when vhost packet processing was enabled, > their VMs could no longer get an IP address via DHCP - the guest was > ignoring the DHCP response packets sent by the host. > > (I've been informed by danpb that the same issue had been encountered, > and "fixed" even earlier than that, in 2006, with Xen as the > hypervisor.) > > The "gory details" of the 2010 discussion are chronicled here: > > https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html > > but basically it was because the checksum of packets wasn't being > fully computed on the host side (because QEMU on the host and the NIC > driver in the guest had agreed between themselves to turn off > checksums because they were unnecessary due to the "link" between the > two being entirely in local memory and not a physical cable), but > > 1) a partial checksum had been put into the packets at some point by > "someone" > > 2) the "don't use checksums" info was known by the guest kernel, which > would properly ignore the "bad" checksum), and > > 3) the packets were being read by the dhclient application on the > guest side with a "raw" socket (thus bypassing the guest kernel UDP > processing that would have known the checksum was irrelevant)), > > The "fix" for this ended up being two-tiered: > > 1) The ISC DHCP package (which contains the aforementioned dhclient > program) made a fix to their dhclient code which caused it to accept > packets anyway even if they didn't have a proper checksum (NB: that's > not a full explanation, and possibly not accurate). This fixed the > problem for guests with an updated dhclient. Here is the code with the > fix to ISC DHCP: > > https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365 > > This fixed the issue for any new/updated guests that had the fixed > dhclient, but it didn't solve the problem for existing/old guest > images that didn't/couldn't get their dhclient updated. This brings us > to: > > 2) iptables added a new "CHECKSUM" target and "--checksum-fill" > action: > > http://patchwork.ozlabs.org/patch/58525/ > > and libvirt added an iptables rule for each virtual network to match > DHCP response packets and perform --checksum-fill. This way by the > time dhclient on the guest reads the raw packet, the checksum would be > corrected, and the packet would be accepted. This was pushed upstream > in libvirt commit v0.8.2-142-gfd5b15ff1a. > > The word at the time from those more knowledgeable than me was that > the bad checksum problem was really specific to ISC's dhclient running > on Linux, and so once their fix was in use everywhere dhclient was > used, bad checksums would be a thing of the past and the > --checksum-fill iptables rules would no longer be needed (but would > otherwise be harmless if they were still there). (Plot twist: the > dhclient code in fix (1) above apparently is on a Linux-only code path > - this is very important later!) > > Based on this information (and also due to the opinion that fixing it > by having iptables modify the packet checksum was really the wrong way > to permanently fix things, i.e. an "ugly hack"), the nftables > developers made the decision to not implement an equivalent to > --checksum-fill in nftables. As a result, when I wrote the nftables > firewall backend for libvirt virtual networks earlier this year, it > didn't add in any rule to "fix" broken UDP checksums (since there was > apparently no equivalent in nftables and, after all, that was fixed > somewhere else 14 years ago, right???) > > But last week, when Rich Jones was doing routine testing using a Fedora > 40 host (the first Fedora release to use the nftables backend of libvirt's > network driver by default) and a FreeBSD guest, for "some strange > reason", the FreeBSD guest was unable to get an IP address from DHCP!! > > https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html > > A few quick tests proved that it was the same old "bad checksum" > problem from 2010 come back to haunt us - it wasn't a Linux-only issue > after all. > > Phil Sutter and Eric Garver (nftables people) pointed out that, while > nftables doesn't have an action that will *compute* the checksum of a > packet, it *does* have an action that will set the checksum to 0, and > suggested we try adding a "zero the checksum" rule for dhcp response > packets to our nftables ruleset. (Why? Because a checksum value of 0 > in a IPv4 UDP packet is defined by RFC768 to mean "no checksum > generated", implying "checksum not needed"). It turns out that this > works - dhclient properly recognizes that a 0 checksum means "don't > bother with the checksum", and accepts the packet as valid. > > So to once again fix this timeless bug, this patch adds such a > checksum zeroing rule to the nftables rules setup for each virtual > network. > > This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD > guests, while not breaking it for Fedora or Windows (10) guests. You can add OpenBSD to that list, as I tested that too. > > Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6 > Reported-by: Rich Jones <rjones@redhat.com> > Fix-Suggested-by: Eric Garver <egarver@redhat.com> > Fix-Suggested-by: Phil Sutter <psutter@redhat.com> > Signed-off-by: Laine Stump <laine@redhat.com> > Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> > --- > > (NB: jdenemar reminded me this is a bugfix, so we don't need to rush > to get it pushed before he freezes for the RC1 snapshot. I'm happy > with it as is though, if anyone is awake early enough and wants to > push it before he snapshots RC1. Otherwise I'll just push it later and > it will be in RC2) > > Changes from V1: > > * informational errors/omissions in commit log message fixed > > src/network/network_nftables.c | 69 +++++++++++++++++++ > tests/networkxml2firewalldata/base.nftables | 14 ++++ > .../forward-dev-linux.nftables | 16 +++++ > .../isolated-linux.nftables | 16 +++++ > .../nat-default-linux.nftables | 16 +++++ > .../nat-ipv6-linux.nftables | 16 +++++ > .../nat-ipv6-masquerade-linux.nftables | 16 +++++ > .../nat-many-ips-linux.nftables | 16 +++++ > .../nat-port-range-ipv6-linux.nftables | 16 +++++ > .../nat-port-range-linux.nftables | 16 +++++ > .../nat-tftp-linux.nftables | 16 +++++ > .../route-default-linux.nftables | 16 +++++ > 12 files changed, 243 insertions(+) Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Oct 25, 2024 at 04:44:16PM +0100, Daniel P. Berrangé wrote: > On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote: > > This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD > > guests, while not breaking it for Fedora or Windows (10) guests. > > You can add OpenBSD to that list, as I tested that too. I did some testing of my own and I can confirm that FreeBSD and OpenBSD are fine with this change, as are various Linux flavors (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). However, a few other operating systems aren't: namely GNU/Hurd, Haiku and NetBSD break with this change. Interestingly, these were all fine with the nftables backend before it. Now, one could argue that GNU/Hurd and Haiku are toy/research operating systems with fairly small audiences, and it would be hard to disagree :) but I don't think we can put NetBSD in the same bucket. I'm also concerned about old versions of the operating systems that we've listed as working above being unhappy with the change. It's true that, to an extent, we can just tell people to upgrade their guests, but sometimes running old operating systems is the whole point of using virtualization in the first place... In conclusion, even with this latest fix the nftables backend still represents a step backwards compared to the iptables one. Considering that we've made it the default one, we should try to close the gap as much as possible. -- Andrea Bolognani / Red Hat / Virtualization
On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > On Fri, Oct 25, 2024 at 04:44:16PM +0100, Daniel P. Berrangé wrote: > > On Fri, Oct 25, 2024 at 12:18:14AM -0400, Laine Stump wrote: > > > This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD > > > guests, while not breaking it for Fedora or Windows (10) guests. > > > > You can add OpenBSD to that list, as I tested that too. > > I did some testing of my own and I can confirm that FreeBSD and > OpenBSD are fine with this change, as are various Linux flavors > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku > and NetBSD break with this change. Interestingly, these were all fine > with the nftables backend before it. Well that's odd. I've checked NetBSD source code and found no less than 3 DHCP client impls, and all of them cope with checksum == 0. https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 the middle impl also directly copes with partial checksums Not identified the Hurd/Haiku DHCP client code yet... With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > > I did some testing of my own and I can confirm that FreeBSD and > > OpenBSD are fine with this change, as are various Linux flavors > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > > > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku > > and NetBSD break with this change. Interestingly, these were all fine > > with the nftables backend before it. > > Well that's odd. I've checked NetBSD source code and found no less > than 3 DHCP client impls, and all of them cope with checksum == 0. > > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 > > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 > > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 > > the middle impl also directly copes with partial checksums The boot log contains Starting dhcpcd. wm0: checksum failure from 192.168.124.1 so I guess the second implementation is the relevant one. > Not identified the Hurd/Haiku DHCP client code yet... I'm using Debian GNU/Hurd, so the DHCP client is the same as regular Debian (ISC DHCP). The source can be found at https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/ The version is a bit old and there's the tiniest amount of patching compared to the Linux build, specifically: --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01 01:00:00.000000000 +0100 +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix 2023-02-15 15:39:49.000000000 +0100 @@ -0,0 +1,26 @@ +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c +=================================================================== +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is + char strbuf[ISC_STRERRORSIZE]; + const char *err = "socket"; + int tries = 0; +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE) ++#if 1 + int on = 1; + #endif + #if defined(SO_RCVBUF) I'm not sure whether this could be relevant to the issue at hand. To clarify, this is something that needs to be handled at the userspace level, no kernel changes required? And clearly it affects DHCP, but what about other protocols? Are we confident those will cope just fine? -- Andrea Bolognani / Red Hat / Virtualization
On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: > On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: > > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > > > I did some testing of my own and I can confirm that FreeBSD and > > > OpenBSD are fine with this change, as are various Linux flavors > > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > > > > > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku > > > and NetBSD break with this change. Interestingly, these were all fine > > > with the nftables backend before it. > > > > Well that's odd. I've checked NetBSD source code and found no less > > than 3 DHCP client impls, and all of them cope with checksum == 0. > > > > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 > > > > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 > > > > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 > > > > the middle impl also directly copes with partial checksums > > The boot log contains > > Starting dhcpcd. > wm0: checksum failure from 192.168.124.1 > > so I guess the second implementation is the relevant one. I've just tested netBSD 10.0 and get exactly the same failure as you. I've tried "tcpdump -vv -i vnetXXX port 68" on the host and on the guest and that is reporting that the checksum is bad. It is *not* getting set to zero. Meanwhile, if I run the same tcpdump with OpenBSD guest, I see tcpdump reporting a zero checksum as expected. WTF ? Somehow our nftables rule is not having an effect, or worse, it is have a non-deterministic effect where it works for packets on some guests, but not others. I checked the rule counters and packets are hitting the rule, but not getting their checksum zerod. IOW, conceptually I still believe our desire to set the checksum to zero is correct, but somehow the impl is not taking effect as expected. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: > > On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: > > > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > > > > I did some testing of my own and I can confirm that FreeBSD and > > > > OpenBSD are fine with this change, as are various Linux flavors > > > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > > > > > > > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku > > > > and NetBSD break with this change. Interestingly, these were all fine > > > > with the nftables backend before it. > > > > > > Well that's odd. I've checked NetBSD source code and found no less > > > than 3 DHCP client impls, and all of them cope with checksum == 0. > > > > > > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 > > > > > > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 > > > > > > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 > > > > > > the middle impl also directly copes with partial checksums > > > > The boot log contains > > > > Starting dhcpcd. > > wm0: checksum failure from 192.168.124.1 > > > > so I guess the second implementation is the relevant one. > > I've just tested netBSD 10.0 and get exactly the same failure > as you. > > I've tried "tcpdump -vv -i vnetXXX port 68" on the host and > on the guest and that is reporting that the checksum is bad. > It is *not* getting set to zero. > > Meanwhile, if I run the same tcpdump with OpenBSD guest, I > see tcpdump reporting a zero checksum as expected. > > WTF ? > > Somehow our nftables rule is not having an effect, or worse, > it is have a non-deterministic effect where it works for > packets on some guests, but not others. > > I checked the rule counters and packets are hitting the rule, > but not getting their checksum zerod. Further research shows tcpdump on packets leaving 'virbr0' have the checksum correctly zerod. Our nftables rule is working. A concurrent tcpdump on packets leaving 'vnetNNN' shows the checksum is mangled. With our old iptables rules, we set a valid checksum when leaving virbr0, and I presume this causes all subsequent code to not touch the checksum field. With our new nftables rules, we set a zero checksum when leaving virbr0, and "zero checksum" conceptually means "not present (yet)". I think there must be logic somewhere in the kernel/QEMU which sees "not present" and decides it needs to do <something> with the checksum field. A key difference that is probably relevant is that netbsd is using an e1000 NIC in QEMU, while openbsd is using a virtio-net NIC. At least when created by virt-manager. AFAIR, QEMU's magic checksum offload only happens for virtio-net, so presumably our rules are incompatible with non-virtio-net NICs in someway. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 10/29/24 8:46 AM, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote: >> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: >>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: >>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: >>>>> I did some testing of my own and I can confirm that FreeBSD and >>>>> OpenBSD are fine with this change, as are various Linux flavors >>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). >>>>> >>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku >>>>> and NetBSD break with this change. Interestingly, these were all fine >>>>> with the nftables backend before it. >>>> >>>> Well that's odd. I've checked NetBSD source code and found no less >>>> than 3 DHCP client impls, and all of them cope with checksum == 0. >>>> >>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 >>>> >>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 >>>> >>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 >>>> >>>> the middle impl also directly copes with partial checksums >>> >>> The boot log contains >>> >>> Starting dhcpcd. >>> wm0: checksum failure from 192.168.124.1 >>> >>> so I guess the second implementation is the relevant one. >> >> I've just tested netBSD 10.0 and get exactly the same failure >> as you. >> >> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and >> on the guest and that is reporting that the checksum is bad. >> It is *not* getting set to zero. >> >> Meanwhile, if I run the same tcpdump with OpenBSD guest, I >> see tcpdump reporting a zero checksum as expected. >> >> WTF ? >> >> Somehow our nftables rule is not having an effect, or worse, >> it is have a non-deterministic effect where it works for >> packets on some guests, but not others. >> >> I checked the rule counters and packets are hitting the rule, >> but not getting their checksum zerod. > > Further research shows tcpdump on packets leaving 'virbr0' have > the checksum correctly zerod. Our nftables rule is working. > > A concurrent tcpdump on packets leaving 'vnetNNN' shows the > checksum is mangled. > > With our old iptables rules, we set a valid checksum when leaving > virbr0, and I presume this causes all subsequent code to not touch > the checksum field. > > With our new nftables rules, we set a zero checksum when leaving > virbr0, and "zero checksum" conceptually means "not present (yet)". > > I think there must be logic somewhere in the kernel/QEMU which > sees "not present" and decides it needs to do <something> with > the checksum field. Yikes! > > A key difference that is probably relevant is that netbsd is > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > NIC. At least when created by virt-manager. > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > so presumably our rules are incompatible with non-virtio-net NICs > in someway. Double and triple yikes! So something in the packet path for non-virtio-net NICs is noticing that the packet checksum is 0, and then "fixing" it with the *wrong* checksum? But in the past when it already had the correct checksum, that same bit of code said "Huh. The checksum is already correct" and left it alone. So when the extra rules are removed, then those same guests begin working? (You can easily remove the checksum rules with: nft delete chain ip libvirt_network postroute_mangle BTW, I just now tried an e1000e NIC on Fedora guest and it continues to work with the 0-checksum rules removed. In this case tcpdump on virbr0 shows "bad cksum", but when I look at tcpdump on the guest, it shows "udp cksum ok" though, so something else somewhere is setting the checksum to the correct value.
Hi, On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > On 10/29/24 8:46 AM, Daniel P. Berrangé wrote: > > On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote: > >> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: > >>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: > >>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > >>>>> I did some testing of my own and I can confirm that FreeBSD and > >>>>> OpenBSD are fine with this change, as are various Linux flavors > >>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > >>>>> > >>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku > >>>>> and NetBSD break with this change. Interestingly, these were all fine > >>>>> with the nftables backend before it. > >>>> > >>>> Well that's odd. I've checked NetBSD source code and found no less > >>>> than 3 DHCP client impls, and all of them cope with checksum == 0. > >>>> > >>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 > >>>> > >>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 > >>>> > >>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 > >>>> > >>>> the middle impl also directly copes with partial checksums > >>> > >>> The boot log contains > >>> > >>> Starting dhcpcd. > >>> wm0: checksum failure from 192.168.124.1 > >>> > >>> so I guess the second implementation is the relevant one. > >> > >> I've just tested netBSD 10.0 and get exactly the same failure > >> as you. > >> > >> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and > >> on the guest and that is reporting that the checksum is bad. > >> It is *not* getting set to zero. > >> > >> Meanwhile, if I run the same tcpdump with OpenBSD guest, I > >> see tcpdump reporting a zero checksum as expected. > >> > >> WTF ? > >> > >> Somehow our nftables rule is not having an effect, or worse, > >> it is have a non-deterministic effect where it works for > >> packets on some guests, but not others. > >> > >> I checked the rule counters and packets are hitting the rule, > >> but not getting their checksum zerod. > > > > Further research shows tcpdump on packets leaving 'virbr0' have > > the checksum correctly zerod. Our nftables rule is working. > > > > A concurrent tcpdump on packets leaving 'vnetNNN' shows the > > checksum is mangled. > > > > With our old iptables rules, we set a valid checksum when leaving > > virbr0, and I presume this causes all subsequent code to not touch > > the checksum field. > > > > With our new nftables rules, we set a zero checksum when leaving > > virbr0, and "zero checksum" conceptually means "not present (yet)". > > > > I think there must be logic somewhere in the kernel/QEMU which > > sees "not present" and decides it needs to do <something> with > > the checksum field. > > Yikes! > > > > > A key difference that is probably relevant is that netbsd is > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > > NIC. At least when created by virt-manager. > > > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > > so presumably our rules are incompatible with non-virtio-net NICs > > in someway. > > Double and triple yikes! > > So something in the packet path for non-virtio-net NICs is noticing that > the packet checksum is 0, and then "fixing" it with the *wrong* checksum? > > But in the past when it already had the correct checksum, that same bit > of code said "Huh. The checksum is already correct" and left it alone. Maybe this is due to the partial checksum thing in kernel which updates the checksum if not using virtio "shortcut". > So when the extra rules are removed, then those same guests begin > working? (You can easily remove the checksum rules with: > > nft delete chain ip libvirt_network postroute_mangle > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > shows "bad cksum", but when I look at tcpdump on the guest, it shows > "udp cksum ok" though, so something else somewhere is setting the > checksum to the correct value. FWIW, I just tested an alternative workaround using tc. This works for me with a FreeBSD guest and NIC switched to either e1000 or virtio: # tc qd add dev vnetbr0 root handle 1: htb # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ u32 match ip sport 67 ffff match ip dport 68 ffff \ action csum ip and udp Another alternative might be to add the nftables rule for virtio-based guests only. Cheers, Phil
On 10/29/24 11:12 AM, Phil Sutter wrote: > Hi, > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: >> On 10/29/24 8:46 AM, Daniel P. Berrangé wrote: >>> On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote: >>>> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: >>>>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: >>>>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: >>>>>>> I did some testing of my own and I can confirm that FreeBSD and >>>>>>> OpenBSD are fine with this change, as are various Linux flavors >>>>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). >>>>>>> >>>>>>> However, a few other operating systems aren't: namely GNU/Hurd, Haiku >>>>>>> and NetBSD break with this change. Interestingly, these were all fine >>>>>>> with the nftables backend before it. >>>>>> >>>>>> Well that's odd. I've checked NetBSD source code and found no less >>>>>> than 3 DHCP client impls, and all of them cope with checksum == 0. >>>>>> >>>>>> https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 >>>>>> >>>>>> https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 >>>>>> >>>>>> https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 >>>>>> >>>>>> the middle impl also directly copes with partial checksums >>>>> >>>>> The boot log contains >>>>> >>>>> Starting dhcpcd. >>>>> wm0: checksum failure from 192.168.124.1 >>>>> >>>>> so I guess the second implementation is the relevant one. >>>> >>>> I've just tested netBSD 10.0 and get exactly the same failure >>>> as you. >>>> >>>> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and >>>> on the guest and that is reporting that the checksum is bad. >>>> It is *not* getting set to zero. >>>> >>>> Meanwhile, if I run the same tcpdump with OpenBSD guest, I >>>> see tcpdump reporting a zero checksum as expected. >>>> >>>> WTF ? >>>> >>>> Somehow our nftables rule is not having an effect, or worse, >>>> it is have a non-deterministic effect where it works for >>>> packets on some guests, but not others. >>>> >>>> I checked the rule counters and packets are hitting the rule, >>>> but not getting their checksum zerod. >>> >>> Further research shows tcpdump on packets leaving 'virbr0' have >>> the checksum correctly zerod. Our nftables rule is working. >>> >>> A concurrent tcpdump on packets leaving 'vnetNNN' shows the >>> checksum is mangled. >>> >>> With our old iptables rules, we set a valid checksum when leaving >>> virbr0, and I presume this causes all subsequent code to not touch >>> the checksum field. >>> >>> With our new nftables rules, we set a zero checksum when leaving >>> virbr0, and "zero checksum" conceptually means "not present (yet)". >>> >>> I think there must be logic somewhere in the kernel/QEMU which >>> sees "not present" and decides it needs to do <something> with >>> the checksum field. >> >> Yikes! >> >>> >>> A key difference that is probably relevant is that netbsd is >>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net >>> NIC. At least when created by virt-manager. >> > >>> AFAIR, QEMU's magic checksum offload only happens for virtio-net, >>> so presumably our rules are incompatible with non-virtio-net NICs >>> in someway. >> >> Double and triple yikes! >> >> So something in the packet path for non-virtio-net NICs is noticing that >> the packet checksum is 0, and then "fixing" it with the *wrong* checksum? >> >> But in the past when it already had the correct checksum, that same bit >> of code said "Huh. The checksum is already correct" and left it alone. > > Maybe this is due to the partial checksum thing in kernel which updates > the checksum if not using virtio "shortcut". > >> So when the extra rules are removed, then those same guests begin >> working? (You can easily remove the checksum rules with: >> >> nft delete chain ip libvirt_network postroute_mangle >> >> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to >> work with the 0-checksum rules removed. In this case tcpdump on virbr0 >> shows "bad cksum", but when I look at tcpdump on the guest, it shows >> "udp cksum ok" though, so something else somewhere is setting the >> checksum to the correct value. > > FWIW, I just tested an alternative workaround using tc. This works for > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > # tc qd add dev vnetbr0 root handle 1: htb > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > u32 match ip sport 67 ffff match ip dport 68 ffff \ > action csum ip and udp This did work for me. However, since we support bandwidth management of the interfaces connected to a virtual netowkr using tc, I decided to try doing the same on a network configured with bandwidth management enabled. I did this by adding the following to the <network> XML before starting the network: <bandwidth> <inbound average='1000' peak='2000' burst='5120'/> <outbound average='500' peak='1000' burst='2560'/> </bandwidth> This resulted in the following tc-related stuff being setup: # tc filter show dev virbr13 filter parent 1: protocol all pref 1 fw chain 0 filter parent 1: protocol all pref 1 fw chain 0 handle 0x1 classid :1 ======= # tc qd show dev virbr13 qdisc htb 1: root refcnt 2 r2q 10 default 0x2 direct_packets_stat 0 direct_qlen 1000 qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024 perturb 10sec qdisc ingress ffff: parent ffff:fff1 ---------------- ======= # tc class show dev virbr13 class htb 1:1 root rate 8Mbit ceil 16Mbit burst 1600b cburst 1600b class htb 1:2 parent 1:1 leaf 2: prio 0 rate 8Mbit ceil 16Mbit burst 5Mb cburst 1600b ======= (obviously my network is using virbr13). Now when I try running the commands above, I see this: # tc qd add dev virbr13 root handle 1: htb Error: Exclusivity flag on, cannot modify. So how does this all need to change in order to have our per-bridge-device bandwidth management coexist with the qdisc/filter to re-compute dhcp response checksums? (sorry for requesting that you ELI5; that's just easier than digging through all the documentation to figure out what is probably a simple solution :-P)
Hi Laine, On Fri, Nov 01, 2024 at 11:44:12AM -0400, Laine Stump wrote: > On 10/29/24 11:12 AM, Phil Sutter wrote: [...] > > FWIW, I just tested an alternative workaround using tc. This works for > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > # tc qd add dev vnetbr0 root handle 1: htb > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > action csum ip and udp > > This did work for me. However, since we support bandwidth management of > the interfaces connected to a virtual netowkr using tc, I decided to try > doing the same on a network configured with bandwidth management > enabled. I did this by adding the following to the <network> XML before > starting the network: > > <bandwidth> > <inbound average='1000' peak='2000' burst='5120'/> > <outbound average='500' peak='1000' burst='2560'/> > </bandwidth> > > This resulted in the following tc-related stuff being setup: > > # tc filter show dev virbr13 > filter parent 1: protocol all pref 1 fw chain 0 > filter parent 1: protocol all pref 1 fw chain 0 handle 0x1 classid :1 > ======= > > # tc qd show dev virbr13 > qdisc htb 1: root refcnt 2 r2q 10 default 0x2 direct_packets_stat 0 > direct_qlen 1000 > qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024 > perturb 10sec > qdisc ingress ffff: parent ffff:fff1 ---------------- > ======= > > # tc class show dev virbr13 > class htb 1:1 root rate 8Mbit ceil 16Mbit burst 1600b cburst 1600b > class htb 1:2 parent 1:1 leaf 2: prio 0 rate 8Mbit ceil 16Mbit burst 5Mb > cburst 1600b > ======= > > (obviously my network is using virbr13). Now when I try running the > commands above, I see this: > > # tc qd add dev virbr13 root handle 1: htb > Error: Exclusivity flag on, cannot modify. No big deal: You're just trying to recreate the already existing HTB qdisc and it fails. But my quoted commands create one merely because we need one to attach the filter to. > So how does this all need to change in order to have our > per-bridge-device bandwidth management coexist with the qdisc/filter to > re-compute dhcp response checksums? (sorry for requesting that you ELI5; > that's just easier than digging through all the documentation to figure > out what is probably a simple solution :-P) It should just work if we attach the extra filter to the existing qdisc (untested due to laziness): # tc filter add dev virbr13 prio 2 protocol ip parent 1: \ u32 match ip sport 67 ffff match ip dport 68 ffff \ action csum ip and udp I forgot the meaning of the prio value again. Lower value may be higher prio, but IIRC all filters apply anyway and ordering doesn't matter in this case. In practice, you may just change the code to always add the HTB qdisc and the checksum update filter/action. If <bandwidth> is specified, also add the classes, leave qdisc(s) and fwmark filter. Let me know if things work this way. Cheers, Phil
On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > Hi, > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > So when the extra rules are removed, then those same guests begin > > working? (You can easily remove the checksum rules with: > > > > nft delete chain ip libvirt_network postroute_mangle > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > "udp cksum ok" though, so something else somewhere is setting the > > checksum to the correct value. > > FWIW, I just tested an alternative workaround using tc. This works for > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > # tc qd add dev vnetbr0 root handle 1: htb > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > u32 match ip sport 67 ffff match ip dport 68 ffff \ > action csum ip and udp This feels like it is functionally closest to what we've had historically, even though it is annoying to have to deal with 'tc' tool, in addition to 'nft'. So I'm thinking this is probably the way we'll have to go. > Another alternative might be to add the nftables rule for virtio-based > guests only. The firewall rules are in a chain that's applied to all guests, so we have no where to add a per-guest rule. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > > Hi, > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > > So when the extra rules are removed, then those same guests begin > > > working? (You can easily remove the checksum rules with: > > > > > > nft delete chain ip libvirt_network postroute_mangle > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > > "udp cksum ok" though, so something else somewhere is setting the > > > checksum to the correct value. > > > > FWIW, I just tested an alternative workaround using tc. This works for > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > # tc qd add dev vnetbr0 root handle 1: htb > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > action csum ip and udp > > This feels like it is functionally closest to what we've had historically, > even though it is annoying to have to deal with 'tc' tool, in addition > to 'nft'. So I'm thinking this is probably the way we'll have to go. Another ugly detail (inherent to 'tc') is that you have to attach a classful qdisc to the interface since otherwise you can't add a filter with attached action. While this may not be a problem in practice, there is this side-effect of setting up a HTB on the bridge which by default runs a "noqueue" qdisc. > > Another alternative might be to add the nftables rule for virtio-based > > guests only. > > The firewall rules are in a chain that's applied to all guests, > so we have no where to add a per-guest rule. With nftables, you may create a chain in netdev family which binds to the specific device(s) needing the hack. It needs maintenance after guest startup and shutdown, though. BTW: libvirt supports configurations which don't involve a 'vnetbr0' bridge. In this case, you will have to setup tc on the actual tap device, right? Cheers, Phil
On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > > > Hi, > > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > > > So when the extra rules are removed, then those same guests begin > > > > working? (You can easily remove the checksum rules with: > > > > > > > > nft delete chain ip libvirt_network postroute_mangle > > > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > > > "udp cksum ok" though, so something else somewhere is setting the > > > > checksum to the correct value. > > > > > > FWIW, I just tested an alternative workaround using tc. This works for > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > > > # tc qd add dev vnetbr0 root handle 1: htb > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > > action csum ip and udp > > > > This feels like it is functionally closest to what we've had historically, > > even though it is annoying to have to deal with 'tc' tool, in addition > > to 'nft'. So I'm thinking this is probably the way we'll have to go. > > Another ugly detail (inherent to 'tc') is that you have to attach a > classful qdisc to the interface since otherwise you can't add a filter > with attached action. While this may not be a problem in practice, there > is this side-effect of setting up a HTB on the bridge which by default > runs a "noqueue" qdisc. I'm not that familiar with 'tc'. Can you explain the functional effect of those 'qdisc' settings on virbr0, as if I know nothing :-) > > > Another alternative might be to add the nftables rule for virtio-based > > > guests only. > > > > The firewall rules are in a chain that's applied to all guests, > > so we have no where to add a per-guest rule. > > With nftables, you may create a chain in netdev family which binds to > the specific device(s) needing the hack. It needs maintenance after > guest startup and shutdown, though. > > BTW: libvirt supports configurations which don't involve a 'vnetbr0' > bridge. In this case, you will have to setup tc on the actual tap > device, right? In those cases, we haven't historically set firewall rules, so users were on their own, so in that sense, it isn't a regression we need to solve. Also in those cases, the DHCP daemon would be off-host, and so packets we're getting back from it ought to have a checksum present, as they've been over a physical link. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > > > > Hi, > > > > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > > > > So when the extra rules are removed, then those same guests begin > > > > > working? (You can easily remove the checksum rules with: > > > > > > > > > > nft delete chain ip libvirt_network postroute_mangle > > > > > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > > > > "udp cksum ok" though, so something else somewhere is setting the > > > > > checksum to the correct value. > > > > > > > > FWIW, I just tested an alternative workaround using tc. This works for > > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > > > > > # tc qd add dev vnetbr0 root handle 1: htb > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > > > action csum ip and udp > > > > > > This feels like it is functionally closest to what we've had historically, > > > even though it is annoying to have to deal with 'tc' tool, in addition > > > to 'nft'. So I'm thinking this is probably the way we'll have to go. > > > > Another ugly detail (inherent to 'tc') is that you have to attach a > > classful qdisc to the interface since otherwise you can't add a filter > > with attached action. While this may not be a problem in practice, there > > is this side-effect of setting up a HTB on the bridge which by default > > runs a "noqueue" qdisc. > > I'm not that familiar with 'tc'. > > Can you explain the functional effect of those 'qdisc' settings on > virbr0, as if I know nothing :-) 'tc' controls QoS in Linux. 'qdisc's define how congestion should be handled (basically: queue or drop, prioritization, etc). The default qdisc for virtual devices like bridge or veth is "noqueue" - it sets the device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it accordingly (calls dev_hard_start_xmit() after a few checks to make sure the device is working). HTB is a container of classes (for packet classification) which themselves hold qdiscs. On my system at least, it doesn't come with default classes and thus should not do much by itself (apart from running the filter for us which we want. Anything else is overhead. I'm not sure how much detail you need - "as if I know nothing" is a bit like naively typing 'find /' and wondering when it will end. ;) Please shoot if you need more details. For the time being, let me point at some howto[2] I wrote long ago. > > > > Another alternative might be to add the nftables rule for virtio-based > > > > guests only. > > > > > > The firewall rules are in a chain that's applied to all guests, > > > so we have no where to add a per-guest rule. > > > > With nftables, you may create a chain in netdev family which binds to > > the specific device(s) needing the hack. It needs maintenance after > > guest startup and shutdown, though. > > > > BTW: libvirt supports configurations which don't involve a 'vnetbr0' > > bridge. In this case, you will have to setup tc on the actual tap > > device, right? > > In those cases, we haven't historically set firewall rules, so > users were on their own, so in that sense, it isn't a regression > we need to solve. Also in those cases, the DHCP daemon would be > off-host, and so packets we're getting back from it ought to > have a checksum present, as they've been over a physical link. OK. From my perspective, attaching the tc qdisc/filter/action to individual guest devices would still be the cleaner solution. If there's no mechanism to attach this to, it might be easier to just stick everything to the bridge, of course. Cheers, Phil [1] https://github.com/torvalds/linux/blob/master/net/core/dev.c#L4393 [2] http://linux-ip.net/gl/tc-filters/
On 10/29/24 3:41 PM, Phil Sutter wrote: > On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: >> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: >>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: >>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: >>>>> Hi, >>>>> >>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: >>>>>> So when the extra rules are removed, then those same guests begin >>>>>> working? (You can easily remove the checksum rules with: >>>>>> >>>>>> nft delete chain ip libvirt_network postroute_mangle >>>>>> >>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to >>>>>> work with the 0-checksum rules removed. In this case tcpdump on virbr0 >>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows >>>>>> "udp cksum ok" though, so something else somewhere is setting the >>>>>> checksum to the correct value. >>>>> >>>>> FWIW, I just tested an alternative workaround using tc. This works for >>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio: >>>>> >>>>> # tc qd add dev vnetbr0 root handle 1: htb >>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ >>>>> u32 match ip sport 67 ffff match ip dport 68 ffff \ >>>>> action csum ip and udp >>>> >>>> This feels like it is functionally closest to what we've had historically, >>>> even though it is annoying to have to deal with 'tc' tool, in addition >>>> to 'nft'. So I'm thinking this is probably the way we'll have to go. >>> >>> Another ugly detail (inherent to 'tc') is that you have to attach a >>> classful qdisc to the interface since otherwise you can't add a filter >>> with attached action. While this may not be a problem in practice, there >>> is this side-effect of setting up a HTB on the bridge which by default >>> runs a "noqueue" qdisc. >> >> I'm not that familiar with 'tc'. >> >> Can you explain the functional effect of those 'qdisc' settings on >> virbr0, as if I know nothing :-) > > 'tc' controls QoS in Linux. 'qdisc's define how congestion should be > handled (basically: queue or drop, prioritization, etc). The default > qdisc for virtual devices like bridge or veth is "noqueue" - it sets the > device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it > accordingly (calls dev_hard_start_xmit() after a few checks to make sure > the device is working). > > HTB is a container of classes (for packet classification) which > themselves hold qdiscs. On my system at least, it doesn't come with > default classes and thus should not do much by itself (apart from > running the filter for us which we want. Anything else is overhead. > > I'm not sure how much detail you need - "as if I know nothing" is a bit > like naively typing 'find /' and wondering when it will end. ;) > Please shoot if you need more details. For the time being, let me point > at some howto[2] I wrote long ago. > >>>>> Another alternative might be to add the nftables rule for virtio-based >>>>> guests only. >>>> >>>> The firewall rules are in a chain that's applied to all guests, >>>> so we have no where to add a per-guest rule. >>> >>> With nftables, you may create a chain in netdev family which binds to >>> the specific device(s) needing the hack. It needs maintenance after >>> guest startup and shutdown, though. >>> >>> BTW: libvirt supports configurations which don't involve a 'vnetbr0' >>> bridge. In this case, you will have to setup tc on the actual tap >>> device, right? >> >> In those cases, we haven't historically set firewall rules, so >> users were on their own, so in that sense, it isn't a regression >> we need to solve. Also in those cases, the DHCP daemon would be >> off-host, and so packets we're getting back from it ought to >> have a checksum present, as they've been over a physical link. > > OK. From my perspective, attaching the tc qdisc/filter/action to > individual guest devices would still be the cleaner solution. If there's > no mechanism to attach this to, it might be easier to just stick > everything to the bridge, of course. We do already use tc for bandwidth control, and when a <bandwidth> element is present in the interface (or the network) we run some tc commands as a port is added to a network to (I *think*) reserve a portion of the bridge's bandwidth for the new interface controls, and when a port is deleted we again run some tc commands to remove it. (mprivozn added all of this and so therefore knows the most about it) However, the tc commands on the network side (during the CreatePort API) I believe are done with only the network's bridge + the MAC address of the guest's NIC (and a "class_id" is created and sent back to QEMU and is there I guess used for some *other* tc commands to setup bandwidth upper limits for the tap after it's created.) More significantly, the tap device hasn't even been created yet at the time QEMU allocates the port from the network driver, so we don't even have a "name of future tap device" that we could send in the NetworkPortCreate API XML. So, I guess what all that means is that having the network driver run a tap-device-specific tc command won't be that simple. (Maybe we could add "virNetworkPortStart|Stop" APIs or something) > > Cheers, Phil > > [1] https://github.com/torvalds/linux/blob/master/net/core/dev.c#L4393 > [2] http://linux-ip.net/gl/tc-filters/
On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote: > On 10/29/24 3:41 PM, Phil Sutter wrote: > > On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: > > > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: > > > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: > > > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > > > > > > Hi, > > > > > > > > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > > > > > > So when the extra rules are removed, then those same guests begin > > > > > > > working? (You can easily remove the checksum rules with: > > > > > > > > > > > > > > nft delete chain ip libvirt_network postroute_mangle > > > > > > > > > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > > > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > > > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > > > > > > "udp cksum ok" though, so something else somewhere is setting the > > > > > > > checksum to the correct value. > > > > > > > > > > > > FWIW, I just tested an alternative workaround using tc. This works for > > > > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > > > > > > > > > # tc qd add dev vnetbr0 root handle 1: htb > > > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > > > > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > > > > > action csum ip and udp > > > > > > > > > > This feels like it is functionally closest to what we've had historically, > > > > > even though it is annoying to have to deal with 'tc' tool, in addition > > > > > to 'nft'. So I'm thinking this is probably the way we'll have to go. > > > > > > > > Another ugly detail (inherent to 'tc') is that you have to attach a > > > > classful qdisc to the interface since otherwise you can't add a filter > > > > with attached action. While this may not be a problem in practice, there > > > > is this side-effect of setting up a HTB on the bridge which by default > > > > runs a "noqueue" qdisc. > > > > > > I'm not that familiar with 'tc'. > > > > > > Can you explain the functional effect of those 'qdisc' settings on > > > virbr0, as if I know nothing :-) > > > > 'tc' controls QoS in Linux. 'qdisc's define how congestion should be > > handled (basically: queue or drop, prioritization, etc). The default > > qdisc for virtual devices like bridge or veth is "noqueue" - it sets the > > device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it > > accordingly (calls dev_hard_start_xmit() after a few checks to make sure > > the device is working). > > > > HTB is a container of classes (for packet classification) which > > themselves hold qdiscs. On my system at least, it doesn't come with > > default classes and thus should not do much by itself (apart from > > running the filter for us which we want. Anything else is overhead. > > > > I'm not sure how much detail you need - "as if I know nothing" is a bit > > like naively typing 'find /' and wondering when it will end. ;) > > Please shoot if you need more details. For the time being, let me point > > at some howto[2] I wrote long ago. > > > > > > > > Another alternative might be to add the nftables rule for virtio-based > > > > > > guests only. > > > > > > > > > > The firewall rules are in a chain that's applied to all guests, > > > > > so we have no where to add a per-guest rule. > > > > > > > > With nftables, you may create a chain in netdev family which binds to > > > > the specific device(s) needing the hack. It needs maintenance after > > > > guest startup and shutdown, though. > > > > > > > > BTW: libvirt supports configurations which don't involve a 'vnetbr0' > > > > bridge. In this case, you will have to setup tc on the actual tap > > > > device, right? > > > > > > In those cases, we haven't historically set firewall rules, so > > > users were on their own, so in that sense, it isn't a regression > > > we need to solve. Also in those cases, the DHCP daemon would be > > > off-host, and so packets we're getting back from it ought to > > > have a checksum present, as they've been over a physical link. > > > > OK. From my perspective, attaching the tc qdisc/filter/action to > > individual guest devices would still be the cleaner solution. If there's > > no mechanism to attach this to, it might be easier to just stick > > everything to the bridge, of course. > > > We do already use tc for bandwidth control, and when a <bandwidth> element > is present in the interface (or the network) we run some tc commands as a > port is added to a network to (I *think*) reserve a portion of the bridge's > bandwidth for the new interface controls, and when a port is deleted we > again run some tc commands to remove it. (mprivozn added all of this and so > therefore knows the most about it) > > However, the tc commands on the network side (during the CreatePort API) I > believe are done with only the network's bridge + the MAC address of the > guest's NIC (and a "class_id" is created and sent back to QEMU and is there > I guess used for some *other* tc commands to setup bandwidth upper limits > for the tap after it's created.) > > More significantly, the tap device hasn't even been created yet at the time > QEMU allocates the port from the network driver, so we don't even have a > "name of future tap device" that we could send in the NetworkPortCreate API > XML. > > So, I guess what all that means is that having the network driver run a > tap-device-specific tc command won't be that simple. (Maybe we could add > "virNetworkPortStart|Stop" APIs or something) I would expect 'tc' rules to be set against virbr0, not the individual NICs. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 10/30/24 4:43 AM, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote: >> On 10/29/24 3:41 PM, Phil Sutter wrote: >>> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: >>>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: >>>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: >>>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: >>>>>>>> So when the extra rules are removed, then those same guests begin >>>>>>>> working? (You can easily remove the checksum rules with: >>>>>>>> >>>>>>>> nft delete chain ip libvirt_network postroute_mangle >>>>>>>> >>>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to >>>>>>>> work with the 0-checksum rules removed. In this case tcpdump on virbr0 >>>>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows >>>>>>>> "udp cksum ok" though, so something else somewhere is setting the >>>>>>>> checksum to the correct value. >>>>>>> >>>>>>> FWIW, I just tested an alternative workaround using tc. This works for >>>>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio: >>>>>>> >>>>>>> # tc qd add dev vnetbr0 root handle 1: htb >>>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ >>>>>>> u32 match ip sport 67 ffff match ip dport 68 ffff \ >>>>>>> action csum ip and udp >>>>>> >>>>>> This feels like it is functionally closest to what we've had historically, >>>>>> even though it is annoying to have to deal with 'tc' tool, in addition >>>>>> to 'nft'. So I'm thinking this is probably the way we'll have to go. >>>>> >>>>> Another ugly detail (inherent to 'tc') is that you have to attach a >>>>> classful qdisc to the interface since otherwise you can't add a filter >>>>> with attached action. While this may not be a problem in practice, there >>>>> is this side-effect of setting up a HTB on the bridge which by default >>>>> runs a "noqueue" qdisc. >>>> >>>> I'm not that familiar with 'tc'. >>>> >>>> Can you explain the functional effect of those 'qdisc' settings on >>>> virbr0, as if I know nothing :-) >>> >>> 'tc' controls QoS in Linux. 'qdisc's define how congestion should be >>> handled (basically: queue or drop, prioritization, etc). The default >>> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the >>> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it >>> accordingly (calls dev_hard_start_xmit() after a few checks to make sure >>> the device is working). >>> >>> HTB is a container of classes (for packet classification) which >>> themselves hold qdiscs. On my system at least, it doesn't come with >>> default classes and thus should not do much by itself (apart from >>> running the filter for us which we want. Anything else is overhead. >>> >>> I'm not sure how much detail you need - "as if I know nothing" is a bit >>> like naively typing 'find /' and wondering when it will end. ;) >>> Please shoot if you need more details. For the time being, let me point >>> at some howto[2] I wrote long ago. >>> >>>>>>> Another alternative might be to add the nftables rule for virtio-based >>>>>>> guests only. >>>>>> >>>>>> The firewall rules are in a chain that's applied to all guests, >>>>>> so we have no where to add a per-guest rule. >>>>> >>>>> With nftables, you may create a chain in netdev family which binds to >>>>> the specific device(s) needing the hack. It needs maintenance after >>>>> guest startup and shutdown, though. >>>>> >>>>> BTW: libvirt supports configurations which don't involve a 'vnetbr0' >>>>> bridge. In this case, you will have to setup tc on the actual tap >>>>> device, right? >>>> >>>> In those cases, we haven't historically set firewall rules, so >>>> users were on their own, so in that sense, it isn't a regression >>>> we need to solve. Also in those cases, the DHCP daemon would be >>>> off-host, and so packets we're getting back from it ought to >>>> have a checksum present, as they've been over a physical link. >>> >>> OK. From my perspective, attaching the tc qdisc/filter/action to >>> individual guest devices would still be the cleaner solution. If there's >>> no mechanism to attach this to, it might be easier to just stick >>> everything to the bridge, of course. >> >> >> We do already use tc for bandwidth control, and when a <bandwidth> element >> is present in the interface (or the network) we run some tc commands as a >> port is added to a network to (I *think*) reserve a portion of the bridge's >> bandwidth for the new interface controls, and when a port is deleted we >> again run some tc commands to remove it. (mprivozn added all of this and so >> therefore knows the most about it) >> >> However, the tc commands on the network side (during the CreatePort API) I >> believe are done with only the network's bridge + the MAC address of the >> guest's NIC (and a "class_id" is created and sent back to QEMU and is there >> I guess used for some *other* tc commands to setup bandwidth upper limits >> for the tap after it's created.) >> >> More significantly, the tap device hasn't even been created yet at the time >> QEMU allocates the port from the network driver, so we don't even have a >> "name of future tap device" that we could send in the NetworkPortCreate API >> XML. >> >> So, I guess what all that means is that having the network driver run a >> tap-device-specific tc command won't be that simple. (Maybe we could add >> "virNetworkPortStart|Stop" APIs or something) > > I would expect 'tc' rules to be set against virbr0, not the individual > NICs. You may be right; I haven't looked at the qemu side to be honest. When I went in to gather enough info to respond last night I was only interested in figuring out how easy/difficult it would be to have the network driver issue tc commands with the tap device rather than the bridge *since Phil suggested that would be better/less bad/something like that. So I mainly looked at what info the network driver currently uses for its tc commands, and whether or not the tap device is created before or after the NetworkPortCreate (thought I recalled "after", and this is correct - makes sense, since we can't know until after virNetworkPortCreate whether or not the interface will even be using a tap device). (Side note: for unrelated reasons it could be useful for the network driver to create the tap device actually - that could allow unprivileged virtqemud to directly/legitimately use the system virtnetworkd networks. Probably ot a simple task though.)
On Wed, Oct 30, 2024 at 07:41:02AM -0400, Laine Stump wrote: > On 10/30/24 4:43 AM, Daniel P. Berrangé wrote: > > On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote: > > > > > > We do already use tc for bandwidth control, and when a <bandwidth> element > > > is present in the interface (or the network) we run some tc commands as a > > > port is added to a network to (I *think*) reserve a portion of the bridge's > > > bandwidth for the new interface controls, and when a port is deleted we > > > again run some tc commands to remove it. (mprivozn added all of this and so > > > therefore knows the most about it) > > > > > > However, the tc commands on the network side (during the CreatePort API) I > > > believe are done with only the network's bridge + the MAC address of the > > > guest's NIC (and a "class_id" is created and sent back to QEMU and is there > > > I guess used for some *other* tc commands to setup bandwidth upper limits > > > for the tap after it's created.) > > > > > > More significantly, the tap device hasn't even been created yet at the time > > > QEMU allocates the port from the network driver, so we don't even have a > > > "name of future tap device" that we could send in the NetworkPortCreate API > > > XML. > > > > > > So, I guess what all that means is that having the network driver run a > > > tap-device-specific tc command won't be that simple. (Maybe we could add > > > "virNetworkPortStart|Stop" APIs or something) > > > > I would expect 'tc' rules to be set against virbr0, not the individual > > NICs. > > You may be right; I haven't looked at the qemu side to be honest. When I > went in to gather enough info to respond last night I was only interested in > figuring out how easy/difficult it would be to have the network driver issue > tc commands with the tap device rather than the bridge *since Phil suggested > that would be better/less bad/something like that. So I mainly looked at > what info the network driver currently uses for its tc commands, and whether > or not the tap device is created before or after the NetworkPortCreate > (thought I recalled "after", and this is correct - makes sense, since we > can't know until after virNetworkPortCreate whether or not the interface > will even be using a tap device). I wasn't thinking about the complexity, but rather the redundancy. We need this rule set for every NIC, as it is way too difficult for us to accurately predict exactly which scenarios will hit the DHCP checksumm problem. With that goal, then it is pointless to inject the same logic on every tap device, when we can inject it on the bridge device once and be done with it. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 10/29/24 11:21 PM, Laine Stump wrote: > On 10/29/24 3:41 PM, Phil Sutter wrote: >> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: >>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: >>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: >>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: >>>>>> Hi, >>>>>> >>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: >>>>>>> So when the extra rules are removed, then those same guests begin >>>>>>> working? (You can easily remove the checksum rules with: >>>>>>> >>>>>>> nft delete chain ip libvirt_network postroute_mangle >>>>>>> >>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it >>>>>>> continues to >>>>>>> work with the 0-checksum rules removed. In this case tcpdump on >>>>>>> virbr0 >>>>>>> shows "bad cksum", but when I look at tcpdump on the guest, it shows >>>>>>> "udp cksum ok" though, so something else somewhere is setting the >>>>>>> checksum to the correct value. >>>>>> >>>>>> FWIW, I just tested an alternative workaround using tc. This works >>>>>> for >>>>>> me with a FreeBSD guest and NIC switched to either e1000 or virtio: >>>>>> >>>>>> # tc qd add dev vnetbr0 root handle 1: htb >>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ >>>>>> u32 match ip sport 67 ffff match ip dport 68 ffff \ >>>>>> action csum ip and udp >>>>> >>>>> This feels like it is functionally closest to what we've had >>>>> historically, >>>>> even though it is annoying to have to deal with 'tc' tool, in addition >>>>> to 'nft'. So I'm thinking this is probably the way we'll have to go. >>>> >>>> Another ugly detail (inherent to 'tc') is that you have to attach a >>>> classful qdisc to the interface since otherwise you can't add a filter >>>> with attached action. While this may not be a problem in practice, >>>> there >>>> is this side-effect of setting up a HTB on the bridge which by default >>>> runs a "noqueue" qdisc. >>> >>> I'm not that familiar with 'tc'. >>> >>> Can you explain the functional effect of those 'qdisc' settings on >>> virbr0, as if I know nothing :-) >> >> 'tc' controls QoS in Linux. 'qdisc's define how congestion should be >> handled (basically: queue or drop, prioritization, etc). The default >> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the >> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it >> accordingly (calls dev_hard_start_xmit() after a few checks to make sure >> the device is working). >> >> HTB is a container of classes (for packet classification) which >> themselves hold qdiscs. On my system at least, it doesn't come with >> default classes and thus should not do much by itself (apart from >> running the filter for us which we want. Anything else is overhead. >> >> I'm not sure how much detail you need - "as if I know nothing" is a bit >> like naively typing 'find /' and wondering when it will end. ;) >> Please shoot if you need more details. For the time being, let me point >> at some howto[2] I wrote long ago. >> >>>>>> Another alternative might be to add the nftables rule for virtio- >>>>>> based >>>>>> guests only. >>>>> >>>>> The firewall rules are in a chain that's applied to all guests, >>>>> so we have no where to add a per-guest rule. >>>> >>>> With nftables, you may create a chain in netdev family which binds to >>>> the specific device(s) needing the hack. It needs maintenance after >>>> guest startup and shutdown, though. >>>> >>>> BTW: libvirt supports configurations which don't involve a 'vnetbr0' >>>> bridge. In this case, you will have to setup tc on the actual tap >>>> device, right? >>> >>> In those cases, we haven't historically set firewall rules, so >>> users were on their own, so in that sense, it isn't a regression >>> we need to solve. Also in those cases, the DHCP daemon would be >>> off-host, and so packets we're getting back from it ought to >>> have a checksum present, as they've been over a physical link. >> >> OK. From my perspective, attaching the tc qdisc/filter/action to >> individual guest devices would still be the cleaner solution. If there's >> no mechanism to attach this to, it might be easier to just stick >> everything to the bridge, of course. > > > We do already use tc for bandwidth control, and when a <bandwidth> > element is present in the interface (or the network) we run some tc > commands as a port is added to a network to (I *think*) reserve a > portion of the bridge's bandwidth for the new interface controls, and > when a port is deleted we again run some tc commands to remove it. > (mprivozn added all of this and so therefore knows the most about it) > > However, the tc commands on the network side (during the CreatePort API) > I believe are done with only the network's bridge + the MAC address of > the guest's NIC (and a "class_id" is created and sent back to QEMU and > is there I guess used for some *other* tc commands to setup bandwidth > upper limits for the tap after it's created.) > > More significantly, the tap device hasn't even been created yet at the > time QEMU allocates the port from the network driver, so we don't even > have a "name of future tap device" that we could send in the > NetworkPortCreate API XML. > > So, I guess what all that means is that having the network driver run a > tap-device-specific tc command won't be that simple. (Maybe we could add > "virNetworkPortStart|Stop" APIs or something) Of course if we were going to do that, we might as well instead add a tap-specific nftables rule at that same time (but only for virtio-net (and e1000e?) devices...
On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: > A key difference that is probably relevant is that netbsd is > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > NIC. At least when created by virt-manager. > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > so presumably our rules are incompatible with non-virtio-net NICs > in someway. Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, since virtio drivers are not available there; moreover, if I switch a random Linux guest from virtio-net to e1000 I can reproduce the issue there as well. -- Andrea Bolognani / Red Hat / Virtualization
On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote: > On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: > > A key difference that is probably relevant is that netbsd is > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > > NIC. At least when created by virt-manager. > > > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > > so presumably our rules are incompatible with non-virtio-net NICs > > in someway. > > Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, > since virtio drivers are not available there; moreover, if I switch a > random Linux guest from virtio-net to e1000 I can reproduce the issue > there as well. Incidentally, I think this has crossed the threshold where the cure is worse than the disease. We cannot ship the forthcoming libvirt release with a checksum "fix" that breaks all usage of NICs that aren't virtio-net, as that guarantees brokeness for all historical OS. If we can't quickly find a way to improve this, I think we need to revert (or disable) the checksum zero'ing fix for this release and spend more time investigating it. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 10/29/24 9:14 AM, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote: >> On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: >>> A key difference that is probably relevant is that netbsd is >>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net >>> NIC. At least when created by virt-manager. >>> >>> AFAIR, QEMU's magic checksum offload only happens for virtio-net, >>> so presumably our rules are incompatible with non-virtio-net NICs >>> in someway. >> >> Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, >> since virtio drivers are not available there; moreover, if I switch a >> random Linux guest from virtio-net to e1000 I can reproduce the issue >> there as well. > > Incidentally, I think this has crossed the threshold where the cure is > worse than the disease. > > We cannot ship the forthcoming libvirt release with a checksum "fix" > that breaks all usage of NICs that aren't virtio-net, as that guarantees > brokeness for all historical OS. > > If we can't quickly find a way to improve this, I think we need to > revert (or disable) the checksum zero'ing fix for this release and > spend more time investigating it. I sadly agree :-/ (although I will point out that it's not *all* non-virtio, since e1000e seems to work). I am leaving the house now, but will make a patch to either disable or the revert the "fix" when I get back in a few hours. (I would rather leave it in with a switch or something so we could continue using standard builds to test things, but that might take too long and would definitely be too complicated for a last instant push) > > With regards, > Daniel
On 10/29/24 10:51 AM, Laine Stump wrote: > On 10/29/24 9:14 AM, Daniel P. Berrangé wrote: >> On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote: >>> On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: >>>> A key difference that is probably relevant is that netbsd is >>>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net >>>> NIC. At least when created by virt-manager. >>>> >>>> AFAIR, QEMU's magic checksum offload only happens for virtio-net, >>>> so presumably our rules are incompatible with non-virtio-net NICs >>>> in someway. >>> >>> Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, >>> since virtio drivers are not available there; moreover, if I switch a >>> random Linux guest from virtio-net to e1000 I can reproduce the issue >>> there as well. >> >> Incidentally, I think this has crossed the threshold where the cure is >> worse than the disease. >> >> We cannot ship the forthcoming libvirt release with a checksum "fix" >> that breaks all usage of NICs that aren't virtio-net, as that guarantees >> brokeness for all historical OS. >> >> If we can't quickly find a way to improve this, I think we need to >> revert (or disable) the checksum zero'ing fix for this release and >> spend more time investigating it. > > I sadly agree :-/ (although I will point out that it's not *all* non- > virtio, since e1000e seems to work). I am leaving the house now, but > will make a patch to either disable or the revert the "fix" when I get > back in a few hours. (I would rather leave it in with a switch or > something so we could continue using standard builds to test things, but > that might take too long and would definitely be too complicated for a > last instant push) I just sent a revert patch to the list in case someone wants to ACK it. It was 100% done with "git revert", so no chance for typos.
On Tue, Oct 29, 2024 at 10:51:57AM -0400, Laine Stump wrote: > On 10/29/24 9:14 AM, Daniel P. Berrangé wrote: > > On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote: > > > On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: > > > > A key difference that is probably relevant is that netbsd is > > > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > > > > NIC. At least when created by virt-manager. > > > > > > > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > > > > so presumably our rules are incompatible with non-virtio-net NICs > > > > in someway. > > > > > > Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, > > > since virtio drivers are not available there; moreover, if I switch a > > > random Linux guest from virtio-net to e1000 I can reproduce the issue > > > there as well. > > > > Incidentally, I think this has crossed the threshold where the cure is > > worse than the disease. > > > > We cannot ship the forthcoming libvirt release with a checksum "fix" > > that breaks all usage of NICs that aren't virtio-net, as that guarantees > > brokeness for all historical OS. > > > > If we can't quickly find a way to improve this, I think we need to > > revert (or disable) the checksum zero'ing fix for this release and > > spend more time investigating it. > > I sadly agree :-/ (although I will point out that it's not *all* non-virtio, > since e1000e seems to work). I am leaving the house now, but will make a > patch to either disable or the revert the "fix" when I get back in a few > hours. (I would rather leave it in with a switch or something so we could > continue using standard builds to test things, but that might take too long > and would definitely be too complicated for a last instant push) e1000e has support for vnet-hdr similar to virtio-net. With virtio-net, if I disable vhost-net, with <driver type=qemu> then code in virtio-net.c will manually fixup DHCP checksums. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Oct 29, 2024 at 10:51:57AM -0400, Laine Stump wrote: > On 10/29/24 9:14 AM, Daniel P. Berrangé wrote: > > We cannot ship the forthcoming libvirt release with a checksum "fix" > > that breaks all usage of NICs that aren't virtio-net, as that guarantees > > brokeness for all historical OS. > > > > If we can't quickly find a way to improve this, I think we need to > > revert (or disable) the checksum zero'ing fix for this release and > > spend more time investigating it. > > I sadly agree :-/ (although I will point out that it's not *all* non-virtio, > since e1000e seems to work). I am leaving the house now, but will make a > patch to either disable or the revert the "fix" when I get back in a few > hours. (I would rather leave it in with a switch or something so we could > continue using standard builds to test things, but that might take too long > and would definitely be too complicated for a last instant push) I vote against adding yet another knob. If we can't fix things in time, let's just revert the change and work on a fully working solution over the next development cycle. -- Andrea Bolognani / Red Hat / Virtualization
On Tue, Oct 29, 2024 at 01:14:28PM +0000, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 01:09:00PM +0000, Andrea Bolognani wrote: > > On Tue, Oct 29, 2024 at 12:46:55PM +0000, Daniel P. Berrangé wrote: > > > A key difference that is probably relevant is that netbsd is > > > using an e1000 NIC in QEMU, while openbsd is using a virtio-net > > > NIC. At least when created by virt-manager. > > > > > > AFAIR, QEMU's magic checksum offload only happens for virtio-net, > > > so presumably our rules are incompatible with non-virtio-net NICs > > > in someway. > > > > Yes, that's it! The GNU/Hurd and Haiku guests are also using e1000, > > since virtio drivers are not available there; moreover, if I switch a > > random Linux guest from virtio-net to e1000 I can reproduce the issue > > there as well. > > Incidentally, I think this has crossed the threshold where the cure is > worse than the disease. > > We cannot ship the forthcoming libvirt release with a checksum "fix" > that breaks all usage of NICs that aren't virtio-net, as that guarantees > brokeness for all historical OS. > > If we can't quickly find a way to improve this, I think we need to > revert (or disable) the checksum zero'ing fix for this release and > spend more time investigating it. Agreed. Going back to the drawing board in a sense, have we figured out why FreeBSD's DHCP client is unhappy with the previous arrangement? It seems to me that fixing it might be the most sensible course of action. -- Andrea Bolognani / Red Hat / Virtualization
On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: > On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote: > > On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote: > > > I did some testing of my own and I can confirm that FreeBSD and > > > OpenBSD are fine with this change, as are various Linux flavors > > > (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu). > > > > > > However, a few other operating systems aren't: namely GNU/Hurd, Haiku > > > and NetBSD break with this change. Interestingly, these were all fine > > > with the nftables backend before it. > > > > Well that's odd. I've checked NetBSD source code and found no less > > than 3 DHCP client impls, and all of them cope with checksum == 0. > > > > https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497 > > > > https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3507 > > > > https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/packet.c#L373 > > > > the middle impl also directly copes with partial checksums > > The boot log contains > > Starting dhcpcd. > wm0: checksum failure from 192.168.124.1 > > so I guess the second implementation is the relevant one. Yes, we can see that message here: https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhcp.c#L3609 if (!checksums_valid(data, &from, bpf_flags)) { logerrx("%s: checksum failure from %s", ifp->name, inet_ntoa(from)); return; } and the 'checksums_valid' method is the one I point to in the 2nd URL, which has the code to check whether the incoming checksum is zeroed: if (udp.uh_sum == 0) return true; so I'm really surprised you see a failure with this. It is as if you don't have Laine's fix present at all. > > > Not identified the Hurd/Haiku DHCP client code yet... > > I'm using Debian GNU/Hurd, so the DHCP client is the same as regular > Debian (ISC DHCP). The source can be found at > > https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/ > > The version is a bit old and there's the tiniest amount of patching > compared to the Linux build, specifically: > > --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01 > 01:00:00.000000000 +0100 > +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix > 2023-02-15 15:39:49.000000000 +0100 > @@ -0,0 +1,26 @@ > +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c > +=================================================================== > +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c > ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c > +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is > + char strbuf[ISC_STRERRORSIZE]; > + const char *err = "socket"; > + int tries = 0; > +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE) > ++#if 1 > + int on = 1; > + #endif > + #if defined(SO_RCVBUF) > > I'm not sure whether this could be relevant to the issue at hand. That impl has the explicit check for all-zeros checksum. > To clarify, this is something that needs to be handled at the > userspace level, no kernel changes required? And clearly it affects > DHCP, but what about other protocols? Are we confident those will > cope just fine? It would affect *any* application which is reading raw packets and manually verifying the IP checksum. DHCP is the common case, but there could be others. Historically our iptables rule only ever fixed up DHCP packets and we've not seen other complaints. So if something else is affected in the real world, it is sufficiently rare that the few people affected have not noticed and/or cared enough to escalate it. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, Oct 29, 2024 at 11:21:44AM +0000, Daniel P. Berrangé wrote: > On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote: > > I'm using Debian GNU/Hurd, so the DHCP client is the same as regular > > Debian (ISC DHCP). The source can be found at > > > > https://deb.debian.org/debian-ports/pool-hurd-i386/main/i/isc-dhcp/ > > > > The version is a bit old and there's the tiniest amount of patching > > compared to the Linux build, specifically: > > > > --- isc-dhcp-4.4.3-P1-1.1/debian/patches/bind-fix 1970-01-01 > > 01:00:00.000000000 +0100 > > +++ isc-dhcp-4.4.3-P1-1.1+hurd.1/debian/patches/bind-fix > > 2023-02-15 15:39:49.000000000 +0100 > > @@ -0,0 +1,26 @@ > > +Index: isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c > > +=================================================================== > > +--- isc-dhcp-4.4.3-P1-build.orig/bind/bind-9.11.36/lib/isc/unix/socket.c > > ++++ isc-dhcp-4.4.3-P1-build/bind/bind-9.11.36/lib/isc/unix/socket.c > > +@@ -2633,7 +2633,7 @@ opensocket(isc__socketmgr_t *manager, is > > + char strbuf[ISC_STRERRORSIZE]; > > + const char *err = "socket"; > > + int tries = 0; > > +-#if defined(USE_CMSG) || defined(SO_BSDCOMPAT) || defined(SO_NOSIGPIPE) > > ++#if 1 > > + int on = 1; > > + #endif > > + #if defined(SO_RCVBUF) > > > > I'm not sure whether this could be relevant to the issue at hand. > > That impl has the explicit check for all-zeros checksum. So you think it's affected by the same unexplicable behavior as NetBSD? If you tell me how to run tcpdump the right way, I can do that for you without you having to set up a GNU/Hurd guest. > > To clarify, this is something that needs to be handled at the > > userspace level, no kernel changes required? And clearly it affects > > DHCP, but what about other protocols? Are we confident those will > > cope just fine? > > It would affect *any* application which is reading raw packets > and manually verifying the IP checksum. DHCP is the common > case, but there could be others. > > Historically our iptables rule only ever fixed up DHCP packets > and we've not seen other complaints. So if something else is > affected in the real world, it is sufficiently rare that the > few people affected have not noticed and/or cared enough to > escalate it. That sounds reassuring :) -- Andrea Bolognani / Red Hat / Virtualization
© 2016 - 2025 Red Hat, Inc.