include/net/mptcp.h | 4 ++++ net/ipv4/tcp_input.c | 5 +++++ net/mptcp/subflow.c | 8 ++++++++ 3 files changed, 17 insertions(+)
When an mptcp connection closes down, a MPTCP-level DATA_FIN (flag in mp tcp
options) is sent.
We should wait until we can be sure the peer has received the mptcp-level ack of
that DATA_FIN. Else, we can end up with the last mptcp subflow in FIN_WAIT2 state.
In that state, TCP won't respond to incoming ACK anymore, because no
progress happens from TCP point of view.
mptcp-packetdrill example:
// MPTCP sk moves to FIN_WAIT1, subflow remains in ESTABLISHED state:
+0.5 close(4) = 0
+0.0 > . 2:2(0) ack 1 win 256 <dss dack4=1 dsn8=2 ssn=0 dll=1 nocs fin, nop, nop>
// receive DATA_FIN ack. MPTCP sk moves to FIN_WAIT2, subflow remains in ESTABLISHED state.
+0.0 < . 1:1(0) ack 2 win 450 <dss dack4=3 dsn8=1 ssn=0 dll=0 nocs, nop, nop>
// Receive DATA_FIN. MPTCP socket is removed, subflow sk moves to FIN_WAIT1.
+0.0 < . 1:1(0) ack 2 win 450 <dss dack4=3 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop>
// acks the clients DATA_FIN, but packet might get lost
+0 > . 2:2(0) ack 1 win 256 <dss dack4=2 nocs>
// Receive DATA_FIN retransmission, subflow still in TCP_FIN1 state.
+0.01 < . 1:1(0) ack 1 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop>
+0 > F. 2:2(0) ack 1 win 256 <dss dack4=2 nocs>
// Ack the tcp-level fin but also retransmit the data fin.
// Note that we still claim mptcp-level dack4=2, i.e. the
// servers DATA_FIN remains unacked.
// Without this change, this 'ack 3' moves the subflow to FIN_WAIT2 state.
+0.0 < . 2:2(0) ack 3 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop>
+0 > . 3:3(0) ack 1 win 256 <dss dack4=2 nocs>
// receive another retransmit of data fin...
+0.0 < . 2:2(0) ack 3 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop>
// ... but without this patch, nothing happens, TW sk won't send anything.
// With this patch, the subflow remains in FIN_WAIT1 and replies with a DSS ack.
+0 > . 3:3(0) ack 1 win 256 <dss dack4=2 nocs>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/181
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/mptcp.h | 4 ++++
net/ipv4/tcp_input.c | 5 +++++
net/mptcp/subflow.c | 8 ++++++++
3 files changed, 17 insertions(+)
Full test case stashed here for the time being:
https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9
Its possible to either change tw_sk to have enough mptcp context to be
able to send a packet back, or we can supress/delay the tw_sock
transition. This is what classic TCP is doing when it receives another
FIN while in FIN1 state.
This change makes the test case work (another dss ack is sent), but
there may be other corner cases where we need to delay the sk ->
tw_sk transition.
If the general idea looks ok, perhaps its better to replace
tcp_time_wait(sk, skb, ..
with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, ..
and place the option parsing for mptcp-subflows there?
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 02ca83493470..de2249996671 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -152,6 +152,8 @@ void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp,
void mptcp_diag_fill_info(struct mptcp_sock *msk, struct mptcp_info *info);
+bool mptcp_subflow_pending_data_fin_ack(const struct sock *ssk);
+
/* move the skb extension owership, with the assumption that 'to' is
* newly allocated
*/
@@ -297,6 +299,8 @@ static inline int mptcp_subflow_init_cookie_req(struct request_sock *req,
}
static inline __be32 mptcp_reset_option(const struct sk_buff *skb) { return htonl(0u); }
+
+static inline bool mptcp_subflow_pending_data_fin_ack(const struct sock *ssk) { return false; }
#endif /* CONFIG_MPTCP */
#if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ab5f0ea166f1..c3281e0cf9db 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6608,6 +6608,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
* marginal case.
*/
inet_csk_reset_keepalive_timer(sk, tmo);
+ } else if (sk_is_mptcp(sk) &&
+ mptcp_incoming_options(sk, skb) &&
+ mptcp_subflow_pending_data_fin_ack(sk)) {
+ inet_csk_schedule_ack(sk);
+ inet_csk_reset_keepalive_timer(sk, tmo);
} else {
tcp_time_wait(sk, TCP_FIN_WAIT2, tmo);
goto consume;
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index c7d49fb6e7bd..d3e285bc45b8 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -351,6 +351,14 @@ static bool subflow_thmac_valid(struct mptcp_subflow_context *subflow)
return thmac == subflow->thmac;
}
+bool mptcp_subflow_pending_data_fin_ack(const struct sock *ssk)
+{
+ struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
+ struct sock *sk = subflow->conn;
+
+ return READ_ONCE(mptcp_sk(sk)->rcv_data_fin);
+}
+
void mptcp_subflow_reset(struct sock *ssk)
{
struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
--
2.35.1
On Thu, 2022-08-18 at 18:21 +0200, Florian Westphal wrote: > When an mptcp connection closes down, a MPTCP-level DATA_FIN (flag in mp tcp > options) is sent. > > We should wait until we can be sure the peer has received the mptcp-level ack of > that DATA_FIN. Else, we can end up with the last mptcp subflow in FIN_WAIT2 state. > > In that state, TCP won't respond to incoming ACK anymore, because no > progress happens from TCP point of view. > > mptcp-packetdrill example: > > // MPTCP sk moves to FIN_WAIT1, subflow remains in ESTABLISHED state: > +0.5 close(4) = 0 > +0.0 > . 2:2(0) ack 1 win 256 <dss dack4=1 dsn8=2 ssn=0 dll=1 nocs fin, nop, nop> > > // receive DATA_FIN ack. MPTCP sk moves to FIN_WAIT2, subflow remains in ESTABLISHED state. > +0.0 < . 1:1(0) ack 2 win 450 <dss dack4=3 dsn8=1 ssn=0 dll=0 nocs, nop, nop> > > // Receive DATA_FIN. MPTCP socket is removed, subflow sk moves to FIN_WAIT1. > +0.0 < . 1:1(0) ack 2 win 450 <dss dack4=3 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop> > > // acks the clients DATA_FIN, but packet might get lost > +0 > . 2:2(0) ack 1 win 256 <dss dack4=2 nocs> > > // Receive DATA_FIN retransmission, subflow still in TCP_FIN1 state. > +0.01 < . 1:1(0) ack 1 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop> > > +0 > F. 2:2(0) ack 1 win 256 <dss dack4=2 nocs> > > // Ack the tcp-level fin but also retransmit the data fin. > // Note that we still claim mptcp-level dack4=2, i.e. the > // servers DATA_FIN remains unacked. > > // Without this change, this 'ack 3' moves the subflow to FIN_WAIT2 state. > +0.0 < . 2:2(0) ack 3 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop> > > +0 > . 3:3(0) ack 1 win 256 <dss dack4=2 nocs> > > // receive another retransmit of data fin... > +0.0 < . 2:2(0) ack 3 win 450 <dss dack4=2 dsn8=1 ssn=0 dll=1 nocs fin, nop, nop> > > // ... but without this patch, nothing happens, TW sk won't send anything. > // With this patch, the subflow remains in FIN_WAIT1 and replies with a DSS ack. > +0 > . 3:3(0) ack 1 win 256 <dss dack4=2 nocs> > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/181 > Signed-off-by: Florian Westphal <fw@strlen.de> > --- > include/net/mptcp.h | 4 ++++ > net/ipv4/tcp_input.c | 5 +++++ > net/mptcp/subflow.c | 8 ++++++++ > 3 files changed, 17 insertions(+) > > Full test case stashed here for the time being: > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > Its possible to either change tw_sk to have enough mptcp context to be > able to send a packet back, or we can supress/delay the tw_sock > transition. This is what classic TCP is doing when it receives another > FIN while in FIN1 state. > > This change makes the test case work (another dss ack is sent), but > there may be other corner cases where we need to delay the sk -> > tw_sk transition. > > If the general idea looks ok, perhaps its better to replace > > tcp_time_wait(sk, skb, .. > > with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, .. > and place the option parsing for mptcp-subflows there? _if_ (big if) I read correctly, this patch "always" (most common shutdown sequence should match the condition checked here, right?) delay the tw_sk transiction for mptcp sockets, keeping alive the whole mptcp and subflow socks for a ??? timeout. Is that correct? If so, and the mptcp_tw socket is not too invasive (again, big if), I think implementing the mptcp_tw could be better: a busy server could keep a lot of memory around for the shutting-down-socks. I think the mptcp_tw will not need a reference to the mptcp sock, it just need to know the mptcp-level data_fin seq to be acked. I guess it will likely require 3 hooks: in tcp_time_wait() to fill the data_fin_ack info and in tcp_v4_send_ack()/tcp_v6_send_response() to fill the TCP options accordingly > diff --git a/include/net/mptcp.h b/include/net/mptcp.h > index 02ca83493470..de2249996671 100644 > --- a/include/net/mptcp.h > +++ b/include/net/mptcp.h > @@ -152,6 +152,8 @@ void mptcp_write_options(struct tcphdr *th, __be32 *ptr, struct tcp_sock *tp, > > void mptcp_diag_fill_info(struct mptcp_sock *msk, struct mptcp_info *info); > > +bool mptcp_subflow_pending_data_fin_ack(const struct sock *ssk); > + > /* move the skb extension owership, with the assumption that 'to' is > * newly allocated > */ > @@ -297,6 +299,8 @@ static inline int mptcp_subflow_init_cookie_req(struct request_sock *req, > } > > static inline __be32 mptcp_reset_option(const struct sk_buff *skb) { return htonl(0u); } > + > +static inline bool mptcp_subflow_pending_data_fin_ack(const struct sock *ssk) { return false; } > #endif /* CONFIG_MPTCP */ > > #if IS_ENABLED(CONFIG_MPTCP_IPV6) > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index ab5f0ea166f1..c3281e0cf9db 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -6608,6 +6608,11 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) > * marginal case. > */ > inet_csk_reset_keepalive_timer(sk, tmo); > + } else if (sk_is_mptcp(sk) && > + mptcp_incoming_options(sk, skb) && > + mptcp_subflow_pending_data_fin_ack(sk)) { Possibly ENOCOFFEE here, but could the above 2 function being replaced by checking if the incoming packet carries a data_fin option? (not sure if worthy and its very subjective, it "sounds" a little bit clearer to me) > + inet_csk_schedule_ack(sk); Why we explcitly need to schedule the ack, and the plain TCP-fin case does not need it? Cheers, Paolo
Paolo Abeni <pabeni@redhat.com> wrote: > > Full test case stashed here for the time being: > > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > > > Its possible to either change tw_sk to have enough mptcp context to be > > able to send a packet back, or we can supress/delay the tw_sock > > transition. This is what classic TCP is doing when it receives another > > FIN while in FIN1 state. > > > > This change makes the test case work (another dss ack is sent), but > > there may be other corner cases where we need to delay the sk -> > > tw_sk transition. > > > > If the general idea looks ok, perhaps its better to replace > > > > tcp_time_wait(sk, skb, .. > > > > with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, .. > > and place the option parsing for mptcp-subflows there? > > _if_ (big if) I read correctly, this patch "always" (most common > shutdown sequence should match the condition checked here, right?) Really? I had to fudge with the test case a long time to get the needed transition, I'm not even sure the test case is legit. https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 is that even correct?! (See line 43f., it acks the tcp-level fin, but retransmits the data_fin). > delay the tw_sk transiction for mptcp sockets, keeping alive the whole > mptcp and subflow socks for a ??? timeout. Is that correct? Yes. > If so, and the mptcp_tw socket is not too invasive (again, big if), I > think implementing the mptcp_tw could be better: a busy server could > keep a lot of memory around for the shutting-down-socks. Hmm, not sure, see above and below. > I think the mptcp_tw will not need a reference to the mptcp sock, it > just need to know the mptcp-level data_fin seq to be acked. Agreed. > > + } else if (sk_is_mptcp(sk) && > > + mptcp_incoming_options(sk, skb) && > > + mptcp_subflow_pending_data_fin_ack(sk)) { > > Possibly ENOCOFFEE here, but could the above 2 function being replaced > by checking if the incoming packet carries a data_fin option? Is that needed? AFAIU we ONLY need to respond if we have a pending fin_ack, i.e. if we know the data_fin ack is completed we don't need to act? If we need this unconditionally then of course your proposal makes more sense, i would then rework this to handle this as an extension of 'tcp->fin set' case with same handling (pass to tcp_data_queue()). > (not sure if worthy and its very subjective, it "sounds" a little bit > clearer to me) > > > + inet_csk_schedule_ack(sk); > > Why we explcitly need to schedule the ack, and the plain TCP-fin case > does not need it? Yes, I could rework this. For normal tcp, there are two cases: 1. pure ack -> no action needed, transition to FIN_WAIT2 + mini-tw-socket 2. fin set -> pass the skb to tcp_data_queue (this can't be seen from the context diff, but this is where the skb will end up, this will then call tcp_fin() again. In this case, the socket doesn't move to tw state. Thanks, Florian
Hello, sharing here private discussion to keep all on the same page... On Mon, 2022-08-22 at 14:32 +0200, Florian Westphal wrote: > Paolo Abeni <pabeni@redhat.com> wrote: > > > Full test case stashed here for the time being: > > > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > > > > > Its possible to either change tw_sk to have enough mptcp context to be > > > able to send a packet back, or we can supress/delay the tw_sock > > > transition. This is what classic TCP is doing when it receives another > > > FIN while in FIN1 state. > > > > > > This change makes the test case work (another dss ack is sent), but > > > there may be other corner cases where we need to delay the sk -> > > > tw_sk transition. > > > > > > If the general idea looks ok, perhaps its better to replace > > > > > > tcp_time_wait(sk, skb, .. > > > > > > with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, .. > > > and place the option parsing for mptcp-subflows there? > > > > _if_ (big if) I read correctly, this patch "always" (most common > > shutdown sequence should match the condition checked here, right?) > > Really? I had to fudge with the test case a long time to get the needed > transition, I'm not even sure the test case is legit. > > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > is that even correct?! I mean/fear that a simple shutdown sequence (local close(), followed by remote data fin) could end-up matching the test implemented by this patch (even if the data fin retransmission will really happen on a much more rare condition). That is, possibly the test is a little more relaxed then actual packet sequence necessary to trigger the issue (but I don't know how to make it more strict). I'll try to double check the above. Thanks! Paolo
On Tue, 2022-08-23 at 17:54 +0200, Paolo Abeni wrote: > On Mon, 2022-08-22 at 14:32 +0200, Florian Westphal wrote: > > Paolo Abeni <pabeni@redhat.com> wrote: > > > > Full test case stashed here for the time being: > > > > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > > > > > > > Its possible to either change tw_sk to have enough mptcp context to be > > > > able to send a packet back, or we can supress/delay the tw_sock > > > > transition. This is what classic TCP is doing when it receives another > > > > FIN while in FIN1 state. > > > > > > > > This change makes the test case work (another dss ack is sent), but > > > > there may be other corner cases where we need to delay the sk -> > > > > tw_sk transition. > > > > > > > > If the general idea looks ok, perhaps its better to replace > > > > > > > > tcp_time_wait(sk, skb, .. > > > > > > > > with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, .. > > > > and place the option parsing for mptcp-subflows there? > > > > > > _if_ (big if) I read correctly, this patch "always" (most common > > > shutdown sequence should match the condition checked here, right?) > > > > Really? I had to fudge with the test case a long time to get the needed > > transition, I'm not even sure the test case is legit. > > > > https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d9 > > > > is that even correct?! > > I mean/fear that a simple shutdown sequence (local close(), followed by > remote data fin) could end-up matching the test implemented by this > patch (even if the data fin retransmission will really happen on a much > more rare condition). > > That is, possibly the test is a little more relaxed then actual packet > sequence necessary to trigger the issue (but I don't know how to make > it more strict). > > I'll try to double check the above. I'm sorry for the latency. I reviewed the relevant code, and I now think the new code introduced by this patch will not trigger frequently (at all) as I feared above, so it should be safe. Additionally I reviewed the relevant pkt drill, and I think the scenario addressed here is really a "strange" one, e.g. the remote peer must ack a TCP fin without "accepting" the MPTCP-level DSS ack carried by such packet. (because it retransmits the acked set/data_fin). I read the above as esplicitly forbidden by the protocol. According to the RFC, the mptcp impl. must not drop an accepted TCP packet while processing it at the MPTCP level. While that may be technically hard to enforce for packet carrying data (e.g. due to mem accounting) it looks easy for pure-ack, as in the above case. So the buggy implementation has no excuses ;) TL;DR: I think the patch is good but likely not needed, sorry. Thanks, Paolo p.s. the test drill contains another "strange" thing: the client mptcp- level data ack sequence moves back from 3 to 2 (line 30-38 in https://github.com/fw-strlen/packetdrill/commit/60a3f57ea309f910643e06dbf123236741c7f8d ) but that should not have any effect, as we can move the received dack seq only forward.
Hi Florian, On 18/08/2022 18:21, Florian Westphal wrote: > When an mptcp connection closes down, a MPTCP-level DATA_FIN (flag in mp tcp > options) is sent. > > We should wait until we can be sure the peer has received the mptcp-level ack of > that DATA_FIN. Else, we can end up with the last mptcp subflow in FIN_WAIT2 state. > > In that state, TCP won't respond to incoming ACK anymore, because no > progress happens from TCP point of view. Thank you for looking at that! (...) > Its possible to either change tw_sk to have enough mptcp context to be > able to send a packet back Do you think changing tw_sk would be accepted upstream? That would be good for MPTCP from a perf point of view I suppose. Or maybe not worth it? If it is worth it, could we have a 'struct mptcp_timewait_sock' for MPTCP case, extending 'struct tcp_timewait_sock'? But then I guess the "annoying" thing for upstream is that we might have to modify or split functions from tcp_minisocks.c if we cannot add more indirections. > or we can supress/delay the tw_sock > transition. All the time for MPTCP sockets? Do you think it the impact will just be minor? > This is what classic TCP is doing when it receives another > FIN while in FIN1 state. > > This change makes the test case work (another dss ack is sent), but > there may be other corner cases where we need to delay the sk -> > tw_sk transition. We didn't hear anything about other cases so far, we can also see case by case for this situation (corner cases at the end of a connection) :) > If the general idea looks ok, perhaps its better to replace > > tcp_time_wait(sk, skb, .. > > with a mptcp aware helper, e.g. tcp_time_wait_check(sk, skb, .. > and place the option parsing for mptcp-subflows there? Would it be better because we would put this in tcp_minisocks.c? Or just to isolate MPTCP specific function in a small and dedicated function? I don't have any strong opinion about one or the other. Cheers, Matt -- Tessares | Belgium | Hybrid Access Solutions www.tessares.net
Hi Florian, Thank you for your modifications, that's great! Our CI did some validations and here is its report: - KVM Validation: normal: - Unstable: 1 failed test(s): selftest_mptcp_join 🔴: - Task: https://cirrus-ci.com/task/5095347042123776 - Summary: https://api.cirrus-ci.com/v1/artifact/task/5095347042123776/summary/summary.txt - KVM Validation: debug: - Unstable: 2 failed test(s): packetdrill_add_addr selftest_mptcp_join 🔴: - Task: https://cirrus-ci.com/task/6221246948966400 - Summary: https://api.cirrus-ci.com/v1/artifact/task/6221246948966400/summary/summary.txt Initiator: Patchew Applier Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/6c09a70059d2 If there are some issues, you can reproduce them using the same environment as the one used by the CI thanks to a docker image, e.g.: $ cd [kernel source code] $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \ --pull always mptcp/mptcp-upstream-virtme-docker:latest \ auto-debug For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker Please note that despite all the efforts that have been already done to have a stable tests suite when executed on a public CI like here, it is possible some reported issues are not due to your modifications. Still, do not hesitate to help us improve that ;-) Cheers, MPTCP GH Action bot Bot operated by Matthieu Baerts (Tessares)
© 2016 - 2023 Red Hat, Inc.