net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

[PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Dmitry Z via B4 Relay 4 months ago

From: Dmitry Z <demetriousz@proton.me>

In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
`saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
It gets filled later, once the outgoing interface (OIF) is determined and
`ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.

In some cases, this can cause the flow to switch paths: the first packets
go via one link, while the rest of the flow is routed over another.

A practical example is a Git-over-SSH session. When running `git fetch`,
the initial control traffic uses TOS 0x48, but data transfer switches to
TOS 0x20. This triggers a new hash computation, and at that time `saddr`
is already populated. As a result, packets with TOS 0x20 may be sent via
a different OIF, because `rt6_multipath_hash()` now produces a different
result.

This issue can happen even if the matched IPv6 route specifies a `src`
(preferred source) address. The actual impact depends on the network
topology. In my setup, the flow was redirected to a different switch and
reached another host, leading to TCP RSTs from the host where the session
was never established.

Possible workarounds:
1. Use netfilter to normalize the DSCP field before route lookup.
   (breaks DSCP/TOS assignment set by the socket)
2. Exclude the source address from the ECMP hash via sysctl knobs.
   (excludes an important part from hash computation)

This patch uses the `fib6_prefsrc.addr` value from the selected route to
populate `saddr` before ECMP hash computation, ensuring consistent path
selection across the flow.

Signed-off-by: Dmitry Z <demetriousz@proton.me>
---
 net/ipv6/route.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 3299cfa12e21c96ecb5c4dea5f305d5f7ce16084..d2ecf16417a6f0fc6956f0ebff3d8dea593da059 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2270,6 +2270,11 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 	if (res.f6i == net->ipv6.fib6_null_entry)
 		goto out;
 
+	if (ipv6_addr_any(&fl6->saddr) &&
+	    !ipv6_addr_any(&res.f6i->fib6_prefsrc.addr)) {
+		fl6->saddr = res.f6i->fib6_prefsrc.addr;
+	}
+
 	fib6_select_path(net, &res, fl6, oif, false, skb, strict);
 
 	/*Search through exception table */

---
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a
change-id: 20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-6d646ec96ac4

Best regards,
-- 
Dmitry Z <demetriousz@proton.me>

Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Ido Schimmel 4 months ago

On Sun, Oct 05, 2025 at 08:49:55PM +0000, Dmitry Z via B4 Relay wrote:
> From: Dmitry Z <demetriousz@proton.me>
> 
> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
> 
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.
> 
> A practical example is a Git-over-SSH session. When running `git fetch`,
> the initial control traffic uses TOS 0x48, but data transfer switches to
> TOS 0x20. This triggers a new hash computation, and at that time `saddr`
> is already populated. As a result, packets with TOS 0x20 may be sent via
> a different OIF, because `rt6_multipath_hash()` now produces a different
> result.
> 
> This issue can happen even if the matched IPv6 route specifies a `src`
> (preferred source) address. The actual impact depends on the network
> topology. In my setup, the flow was redirected to a different switch and
> reached another host, leading to TCP RSTs from the host where the session
> was never established.
> 
> Possible workarounds:
> 1. Use netfilter to normalize the DSCP field before route lookup.
>    (breaks DSCP/TOS assignment set by the socket)
> 2. Exclude the source address from the ECMP hash via sysctl knobs.
>    (excludes an important part from hash computation)

Two more options (which I didn't test):

3. Setting "IPQoS" in SSH config to a single value. It should prevent
OpenSSH from switching DSCP while the connection is alive. Switching
DSCP triggers a route lookup since commit 305e95bb893c ("net-ipv6:
changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset()"). To be
clear, I don't think this commit is problematic as there are other
events that can invalidate cached dst entries.

4. Setting "BindAddress" in SSH config. It should make sure that the
same source address is used for all route lookups.

> This patch uses the `fib6_prefsrc.addr` value from the selected route to
> populate `saddr` before ECMP hash computation, ensuring consistent path
> selection across the flow.

I'm not convinced the problem is in the kernel. As long as all the
packets are sent with the same 5-tuple, it's up to the network to
deliver them correctly. I don't know how your topology looks like, but
in the general case packets belonging to the same flow can be routed via
different paths over time. If multiple servers can service incoming SSH
connections, then there should be a stateful load balancer between them
and the clients so that packets belonging to the same flow are always
delivered to the same server. ECMP cannot be relied on to do load
balancing alone as it's stateless.

Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Dmitry 4 months ago

> Two more options (which I didn't test):
>
> 3. Setting "IPQoS" in SSH config to a single value. It should prevent
> OpenSSH from switching DSCP while the connection is alive. Switching
> DSCP triggers a route lookup since commit 305e95bb893c ("net-ipv6:
> changes to ->tclass (via IPV6_TCLASS) should sk_dst_reset()"). To be
> clear, I don't think this commit is problematic as there are other
> events that can invalidate cached dst entries.

I haven't tested this, but I assume it should work, since the IP header isn't
changed during an active connection.

> 4. Setting "BindAddress" in SSH config. It should make sure that the
> same source address is used for all route lookups.

Yes, I've tested this one, and it works. I was focused on finding a system-level
solution and didn't think about application-level settings.

> As long as all the packets are sent with the same 5-tuple.

The problem is that in the beginning the SADDR remains empty during hash
computation. It appears to be filled later, once the outgoing interface (OIF) is
determined.

Let's look at how to reproduce the issue:

Test lab topology:

+-----+   vlan=1        +-----+
|     +---------------->|     |
|HostA+---------------->|HostF|
|     |...              |     |
|     +---------------->|     |
+-----+   vlan=99       +-----+

HostA lo: 2001:db8:aaaa::
HostF lo: 2001:db8:ffff::

Host A has an ECMP route to 2001:db8:ffff:: with a specified source address
2001:db8:aaaa::, distributed across all VLANs toward Host F. I run git fetch on
Host A to transfer data from Host F.

PCAP Without the fix:

16:34:40.875734 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 98: vlan 49, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 40)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [S], cksum 0x064b (incorrect
-> 0x5490), seq 827400610, win 64800, options [mss 1440,sackOK,TS val 1303683318
ecr 0,nop,wscale 7], length 0

<skipped>

16:34:41.566130 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 90: vlan 49, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 32)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [.], cksum 0x0643 (incorrect
-> 0xd980), seq 3570, ack 4031, win 509, options [nop,nop,TS val 1303684009 ecr
3265960348], length 0

16:34:41.567338 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 234: vlan 83, p 0, ethertype IPv6 (0x86dd), (class 0x20,
flowlabel 0xfdf8e, hlim 64, next-header TCP (6) payload length: 176)
2001:db8:aaaa::.44690 > 2001:db8:ffff::.22: Flags [P.], cksum 0x06d3 (incorrect
-> 0xce55), seq 3570:3714, ack 4031, win 509, options [nop,nop,TS val 1303684009
ecr 3265960348], length 144

As you can see, it sends packets through different interfaces — this is a
symptom of the issue. In a real environment with multiple physical links (up to
6–8 interfaces), the same problem can be observed as well.

I put some prints around ip6_multipath_hash_policy():

PRINTS Without the fix:

Oct 06 16:34:40 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=:: dst=2001:db8:ffff:: proto=6 hash=2109163277

Oct 06 16:34:41 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3559450110

Oct 06 16:34:41 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3559450110

As you can see, the saddr field is empty at the beginning of the connection,
which causes the hash to be different initially.

PCAP With the fix:

16:42:27.624160 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 98: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 40)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [S], cksum 0x064b (incorrect
-> 0x174e), seq 1032224426, win 64800, options [mss 1440,sackOK,TS val
3603754981 ecr 0,nop,wscale 10], length 0

<skipped>

16:42:28.328572 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 90: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x48,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 32)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [.], cksum 0x0643 (incorrect
-> 0xcd3f), seq 3570, ack 4031, win 66, options [nop,nop,TS val 3603755686 ecr
3266427110], length 0

16:42:28.329511 52:54:00:05:66:74 > 52:54:00:8d:24:26, ethertype 802.1Q
(0x8100), length 234: vlan 70, p 0, ethertype IPv6 (0x86dd), (class 0x20,
flowlabel 0xcff07, hlim 64, next-header TCP (6) payload length: 176)
2001:db8:aaaa::.43660 > 2001:db8:ffff::.22: Flags [P.], cksum 0x06d3 (incorrect
-> 0x3fd6), seq 3570:3714, ack 4031, win 66, options [nop,nop,TS val 3603755686
ecr 3266427110], length 144

As you can see here we have the same vlan.

PRINTS With the fix:

Oct 06 16:42:27 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165
Oct 06 16:42:28 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165
Oct 06 16:42:28 arch1 kernel: IPv6: fib6 ip6_multipath_hash_policy DEBUG:
src=2001:db8:aaaa:: dst=2001:db8:ffff:: proto=6 hash=3025767165

So, with the fix applied, we populate SADDR and calculate the hash correctly.  I
think it's reasonable to respect the src field in the IPv6 route when computing
the hash.

> I'm not convinced the problem is in the kernel. As long as all the
> packets are sent with the same 5-tuple, it's up to the network to
> deliver them correctly. I don't know how your topology looks like, but
> in the general case packets belonging to the same flow can be routed via
> different paths over time. If multiple servers can service incoming SSH
> connections, then there should be a stateful load balancer between them
> and the clients so that packets belonging to the same flow are always
> delivered to the same server. ECMP cannot be relied on to do load
> balancing alone as it's stateless.

Well, it seems the current implementation doesn't properly respect the SRC field
and handles it inconsistently - it is ignored at the start of a session and only
taken into account once the session is established.

> as long as all the packets are sent with the same 5-tuple, it’s up to the
> network to deliver them correctly

If the 5-tuple is not changed, then both the hash and the outgoing interface
(OIF) should remain consistent, which is not the case. Only with the fix does it
respect the configured SRC and produce a consistent, correct 5-tuple with the
proper hash.

Therefore, in my opinion, this should be fixed.

Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Ido Schimmel 4 months ago

On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> If the 5-tuple is not changed, then both the hash and the outgoing interface
> (OIF) should remain consistent, which is not the case. Only with the fix does it
> respect the configured SRC and produce a consistent, correct 5-tuple with the
> proper hash.
> 
> Therefore, in my opinion, this should be fixed.

Note that even if the hash is consistent throughout the lifetime of the
socket, it is still possible for packets to be routed out of different
interfaces. This can happen, for example, if one of the nexthop devices
loses its carrier. This will change the hash thresholds in the ECMP
group and can cause packets to egress a different interface even if the
current one is not the one that went down. Obviously packets can also
change paths due to changes in other routers between you and the
destination. A network design that results in connections being severed
every time a flow is routed differently seems fragile to me.

If you still want to address the issue, then I believe that the correct
way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
I'm not sure why they differ, but the IPv4 version will first do a route
lookup to determine the source address, then allocate a source port and
only when all the parameters are known it will do a final route lookup
and cache the result in the socket. IPv6 on the other hand, does a
single route lookup with an unknown source address and an unknown source
port.

This is explained in the comment above ip_route_connect_init() and
Willem also explained it here:

https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/

Willem, do you happen to know why tcp_v6_connect() only performs a
single route lookup?

Link to the original patch:

https://lore.kernel.org/netdev/20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-v1-1-d43b6ef00035@proton.me/

Thanks

Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Willem de Bruijn 4 months ago

Ido Schimmel wrote:
> On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> > If the 5-tuple is not changed, then both the hash and the outgoing interface
> > (OIF) should remain consistent, which is not the case.

With git fetch over SSH, the process apparenty explicitly changes DSCP
(by calling setsockopt IPV6_TCLASS?). Which triggers a dst reset,
which that may trigger a different path. That is WAI, right?

Policy routing can explicitly specify different egress devices for
different DSCP settings.

Is this the entire issue? The original message states

> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
>
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.

That sounds as if the OIF can change in between the rt6_multipath_hash
and ipv6_dev_get_saddr calls for a regular socket, without such
explicit DSCP changes. Does this happen?


> > Only with the fix does it
> > respect the configured SRC and produce a consistent, correct 5-tuple with the
> > proper hash.
> > 
> > Therefore, in my opinion, this should be fixed.
> 
> Note that even if the hash is consistent throughout the lifetime of the
> socket, it is still possible for packets to be routed out of different
> interfaces. This can happen, for example, if one of the nexthop devices
> loses its carrier. This will change the hash thresholds in the ECMP
> group and can cause packets to egress a different interface even if the
> current one is not the one that went down. Obviously packets can also
> change paths due to changes in other routers between you and the
> destination. A network design that results in connections being severed
> every time a flow is routed differently seems fragile to me.
> 
> If you still want to address the issue, then I believe that the correct
> way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
> I'm not sure why they differ, but the IPv4 version will first do a route
> lookup to determine the source address, then allocate a source port and
> only when all the parameters are known it will do a final route lookup
> and cache the result in the socket. IPv6 on the other hand, does a
> single route lookup with an unknown source address and an unknown source
> port.
> 
> This is explained in the comment above ip_route_connect_init() and
> Willem also explained it here:
> 
> https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/
> 
> Willem, do you happen to know why tcp_v6_connect() only performs a
> single route lookup?

I did not fully get to the historical reasons for the differences.
From v1 of that patch:

"Side-quest: I wonder if the second route lookup in ip_route_connect
is vestigial since the introduction of the third route lookup with
ip_route_newports. IPv6 has neither second nor third lookup, which
hints that perhaps both can be removed. "

https://lore.kernel.org/netdev/20250420180537.2973960-2-willemdebruijn.kernel@gmail.com/
 
> Link to the original patch:
> 
> https://lore.kernel.org/netdev/20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-v1-1-d43b6ef00035@proton.me/
> 
> Thanks

Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

Posted by Ido Schimmel 4 months ago

On Tue, Oct 07, 2025 at 06:25:47PM -0400, Willem de Bruijn wrote:
> Ido Schimmel wrote:
> > On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> > > If the 5-tuple is not changed, then both the hash and the outgoing interface
> > > (OIF) should remain consistent, which is not the case.
> 
> With git fetch over SSH, the process apparenty explicitly changes DSCP
> (by calling setsockopt IPV6_TCLASS?). Which triggers a dst reset,
> which that may trigger a different path. That is WAI, right?

Yes, but in this case policy routing does not match on DSCP. The reason
for the path change after the dst reset is that the initial route lookup
is performed with an incomplete 5-tuple (missing source address and
source port) compared to subsequent lookups.

Dmitry already verified that specifying "BindAddress" in SSH config
resolves the issue as in this case the route lookups are always
performed with the same source address. This indicates that the DSCP
change itself is not the problem.

> 
> Policy routing can explicitly specify different egress devices for
> different DSCP settings.
> 
> Is this the entire issue? The original message states
> 
> > In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> > `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> > It gets filled later, once the outgoing interface (OIF) is determined and
> > `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
> >
> > In some cases, this can cause the flow to switch paths: the first packets
> > go via one link, while the rest of the flow is routed over another.
> 
> That sounds as if the OIF can change in between the rt6_multipath_hash
> and ipv6_dev_get_saddr calls for a regular socket, without such
> explicit DSCP changes. Does this happen?

I'm not sure what you mean by that, but any event that triggers a dst
reset can result in an OIF change as subsequent route lookups will be
performed with different parameters compared to the initial route
lookup.

> 
> 
> > > Only with the fix does it
> > > respect the configured SRC and produce a consistent, correct 5-tuple with the
> > > proper hash.
> > > 
> > > Therefore, in my opinion, this should be fixed.
> > 
> > Note that even if the hash is consistent throughout the lifetime of the
> > socket, it is still possible for packets to be routed out of different
> > interfaces. This can happen, for example, if one of the nexthop devices
> > loses its carrier. This will change the hash thresholds in the ECMP
> > group and can cause packets to egress a different interface even if the
> > current one is not the one that went down. Obviously packets can also
> > change paths due to changes in other routers between you and the
> > destination. A network design that results in connections being severed
> > every time a flow is routed differently seems fragile to me.
> > 
> > If you still want to address the issue, then I believe that the correct
> > way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
> > I'm not sure why they differ, but the IPv4 version will first do a route
> > lookup to determine the source address, then allocate a source port and
> > only when all the parameters are known it will do a final route lookup
> > and cache the result in the socket. IPv6 on the other hand, does a
> > single route lookup with an unknown source address and an unknown source
> > port.
> > 
> > This is explained in the comment above ip_route_connect_init() and
> > Willem also explained it here:
> > 
> > https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/
> > 
> > Willem, do you happen to know why tcp_v6_connect() only performs a
> > single route lookup?
> 
> I did not fully get to the historical reasons for the differences.
> From v1 of that patch:
> 
> "Side-quest: I wonder if the second route lookup in ip_route_connect
> is vestigial since the introduction of the third route lookup with
> ip_route_newports. IPv6 has neither second nor third lookup, which
> hints that perhaps both can be removed. "

I also wondered about the second route lookup in ip_route_connect(), but
the one in ip_route_newports() seems necessary as it will perform a
route lookup with a complete 5-tuple, unlike the first.

Thanks