1
Paolo worked on this RX path refactor for these two main reasons:
1
This is a batch of changes I had sitting in my local tree for a while.
2
Why another refactor you may ask? Two main resons:
2
3
3
- Currently, the MPTCP RX path introduces quite a bit of 'exceptional'
4
- currently the mptcp RX path introduces quite a bit of 'exceptional'
4
accounting/locking processing WRT to plain TCP, adding up to the
5
accounting/locking processing WRT to plain TCP, adding up to the
5
implementation complexity in a miserable way.
6
implementation complexity in a misurable way
6
7
- the performance gap WRT plain TCP for single subflow connections is
7
- The performance gap WRT plain TCP for single subflow connections is
8
quite measurable.
8
quite measurable.
9
9
10
The present refactor addresses both the above items: most of the
10
The present refactor addresses both the above items: most of the
11
additional complexity is dropped, and single stream performances
11
additional complexity is dropped, and single stream performances
12
increase measurably, from 55Gbps to 71Gbps in Paolo's loopback test. As
12
increase measurably - from 55Gbps to 71Gbps in my loopback test. As a
13
a reference, plain TCP was around 84Gbps on the same host.
13
reference, plain TCP is around 84Gps on the same host.
14
14
15
The above comes to a price: the patch are invasive, even in subtle ways.
15
The above comes to a price: the patch are invasive, even in subtle ways:
16
the chance of destabilizing the implementation is real (ence the
17
additional, intentional '-next' into the subj).
16
18
17
Note: patch 5/7 removes the sk_forward_alloc_get() helper, which caused
19
In any case keeping the patch hidden for longer was not going to do any
18
some trivial modifications in different places in the net tree: sockets,
20
good, so here we are.
19
IPv4, sched. That's why a few more people have been Cc here. Feel free
20
to only look at this patch 5/7.
21
21
22
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
22
Changes from v1:
23
---
23
- fixed several data stream corruption and wake-up misses due
24
to multi subflows races
25
- added patches 1-3 mainly to address the above
26
- added an additional follow-up patch (patch 7) with more cleanup
27
24
Paolo Abeni (7):
28
Paolo Abeni (7):
25
mptcp: consolidate subflow cleanup
29
mptcp: prevent excessive coalescing on receive
26
mptcp: drop __mptcp_fastopen_gen_msk_ackseq()
30
tcp: fix recvbuffer adjust on sleeping rcvmsg
27
mptcp: move the whole rx path under msk socket lock protection
31
mptcp: don't always assume copied data in mptcp_cleanup_rbuf()
28
mptcp: cleanup mem accounting
32
mptcp: consolidate subflow cleanup
29
net: dismiss sk_forward_alloc_get()
33
mptcp: move the whole rx path under msk socket lock protection
30
mptcp: dismiss __mptcp_rmem()
34
mptcp: cleanup mem accounting.
31
mptcp: micro-optimize __mptcp_move_skb()
35
net: dismiss sk_forward_alloc_get()
32
36
33
include/net/sock.h | 13 ---
37
include/net/sock.h | 13 ---
34
net/core/sock.c | 2 +-
38
net/core/sock.c | 2 +-
35
net/ipv4/af_inet.c | 2 +-
39
net/ipv4/af_inet.c | 2 +-
36
net/ipv4/inet_diag.c | 2 +-
40
net/ipv4/inet_diag.c | 2 +-
37
net/mptcp/fastopen.c | 27 +----
41
net/mptcp/fastopen.c | 4 +-
38
net/mptcp/protocol.c | 317 ++++++++++++++++-----------------------------------
42
net/mptcp/protocol.c | 259 +++++++++++++------------------------------
39
net/mptcp/protocol.h | 22 ++--
43
net/mptcp/protocol.h | 6 +-
40
net/mptcp/subflow.c | 36 +++---
44
net/mptcp/subflow.c | 33 +++---
41
net/sched/em_meta.c | 2 +-
45
net/sched/em_meta.c | 2 +-
42
9 files changed, 134 insertions(+), 289 deletions(-)
46
9 files changed, 104 insertions(+), 219 deletions(-)
43
---
44
base-commit: b4cb730862cf4f59ac3dcb83b9ac4eeb29dbfb0e
45
change-id: 20250106-net-next-mptcp-rx-path-refactor-f44579efb57c
46
47
47
Best regards,
48
--
48
--
49
Matthieu Baerts (NGI0) <matttbe@kernel.org>
49
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
1
Currently the skb size after coalescing is only limited by the skb
2
2
layout (the skb must not carry frag_list). A single coalesced skb
3
When we will move the whole RX path under the msk socket lock, updating
3
covering several MSS can potentially fill completely the receive
4
the already queued skb for passive fastopen socket at 3rd ack time will
4
buffer. In such a case, the snd win will zero until the receive buffer
5
be extremely painful and race prone
5
will be empty again, affecting tput badly.
6
7
The map_seq for already enqueued skbs is used only to allow correct
8
coalescing with later data; preventing collapsing to the first skb of
9
a fastopen connect we can completely remove the
10
__mptcp_fastopen_gen_msk_ackseq() helper.
11
12
Before dropping this helper, a new item had to be added to the
13
mptcp_skb_cb structure. Because this item will be frequently tested in
14
the fast path -- almost on every packet -- and because there is free
15
space there, a single byte is used instead of a bitfield. This micro
16
optimisation slightly reduces the number of CPU operations to do the
17
associated check.
18
6
19
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
20
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
21
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
22
---
8
---
23
net/mptcp/fastopen.c | 24 ++----------------------
9
No fixes tag because the problem is not very visible in practice
24
net/mptcp/protocol.c | 4 +++-
10
currently, but will be apparent after the rx refactor.
25
net/mptcp/protocol.h | 5 ++---
11
Still I hope this could affect positively the simlut flows self-tests
26
net/mptcp/subflow.c | 3 ---
12
---
27
4 files changed, 7 insertions(+), 29 deletions(-)
13
net/mptcp/protocol.c | 1 +
14
1 file changed, 1 insertion(+)
28
15
29
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
30
index XXXXXXX..XXXXXXX 100644
31
--- a/net/mptcp/fastopen.c
32
+++ b/net/mptcp/fastopen.c
33
@@ -XXX,XX +XXX,XX @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
34
    tp->copied_seq += skb->len;
35
    subflow->ssn_offset += skb->len;
36
37
-    /* initialize a dummy sequence number, we will update it at MPC
38
-     * completion, if needed
39
-     */
40
+    /* Only the sequence delta is relevant */
41
    MPTCP_SKB_CB(skb)->map_seq = -skb->len;
42
    MPTCP_SKB_CB(skb)->end_seq = 0;
43
    MPTCP_SKB_CB(skb)->offset = 0;
44
    MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
45
+    MPTCP_SKB_CB(skb)->cant_coalesce = 1;
46
47
    mptcp_data_lock(sk);
48
49
@@ -XXX,XX +XXX,XX @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
50
51
    mptcp_data_unlock(sk);
52
}
53
-
54
-void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflow_context *subflow,
55
-                 const struct mptcp_options_received *mp_opt)
56
-{
57
-    struct sock *sk = (struct sock *)msk;
58
-    struct sk_buff *skb;
59
-
60
-    skb = skb_peek_tail(&sk->sk_receive_queue);
61
-    if (skb) {
62
-        WARN_ON_ONCE(MPTCP_SKB_CB(skb)->end_seq);
63
-        pr_debug("msk %p moving seq %llx -> %llx end_seq %llx -> %llx\n", sk,
64
-             MPTCP_SKB_CB(skb)->map_seq, MPTCP_SKB_CB(skb)->map_seq + msk->ack_seq,
65
-             MPTCP_SKB_CB(skb)->end_seq, MPTCP_SKB_CB(skb)->end_seq + msk->ack_seq);
66
-        MPTCP_SKB_CB(skb)->map_seq += msk->ack_seq;
67
-        MPTCP_SKB_CB(skb)->end_seq += msk->ack_seq;
68
-    }
69
-
70
-    pr_debug("msk=%p ack_seq=%llx\n", msk, msk->ack_seq);
71
-}
72
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
16
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
73
index XXXXXXX..XXXXXXX 100644
17
index XXXXXXX..XXXXXXX 100644
74
--- a/net/mptcp/protocol.c
18
--- a/net/mptcp/protocol.c
75
+++ b/net/mptcp/protocol.c
19
+++ b/net/mptcp/protocol.c
76
@@ -XXX,XX +XXX,XX @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
20
@@ -XXX,XX +XXX,XX @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to,
77
    bool fragstolen;
78
    int delta;
21
    int delta;
79
22
80
-    if (MPTCP_SKB_CB(from)->offset ||
23
    if (MPTCP_SKB_CB(from)->offset ||
81
+    if (unlikely(MPTCP_SKB_CB(to)->cant_coalesce) ||
24
+     ((to->len + from->len) > (sk->sk_rcvbuf >> 3)) ||
82
+     MPTCP_SKB_CB(from)->offset ||
83
     ((to->len + from->len) > (sk->sk_rcvbuf >> 3)) ||
84
     !skb_try_coalesce(to, from, &fragstolen, &delta))
25
     !skb_try_coalesce(to, from, &fragstolen, &delta))
85
        return false;
26
        return false;
86
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
27
87
    MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len;
88
    MPTCP_SKB_CB(skb)->offset = offset;
89
    MPTCP_SKB_CB(skb)->has_rxtstamp = has_rxtstamp;
90
+    MPTCP_SKB_CB(skb)->cant_coalesce = 0;
91
92
    if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) {
93
        /* in sequence */
94
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
95
index XXXXXXX..XXXXXXX 100644
96
--- a/net/mptcp/protocol.h
97
+++ b/net/mptcp/protocol.h
98
@@ -XXX,XX +XXX,XX @@ struct mptcp_skb_cb {
99
    u64 map_seq;
100
    u64 end_seq;
101
    u32 offset;
102
-    u8 has_rxtstamp:1;
103
+    u8 has_rxtstamp;
104
+    u8 cant_coalesce;
105
};
106
107
#define MPTCP_SKB_CB(__skb)    ((struct mptcp_skb_cb *)&((__skb)->cb[0]))
108
@@ -XXX,XX +XXX,XX @@ void mptcp_event_pm_listener(const struct sock *ssk,
109
             enum mptcp_event_type event);
110
bool mptcp_userspace_pm_active(const struct mptcp_sock *msk);
111
112
-void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflow_context *subflow,
113
-                 const struct mptcp_options_received *mp_opt);
114
void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subflow,
115
                     struct request_sock *req);
116
int mptcp_nl_fill_addr(struct sk_buff *skb,
117
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
118
index XXXXXXX..XXXXXXX 100644
119
--- a/net/mptcp/subflow.c
120
+++ b/net/mptcp/subflow.c
121
@@ -XXX,XX +XXX,XX @@ void __mptcp_subflow_fully_established(struct mptcp_sock *msk,
122
    subflow_set_remote_key(msk, subflow, mp_opt);
123
    WRITE_ONCE(subflow->fully_established, true);
124
    WRITE_ONCE(msk->fully_established, true);
125
-
126
-    if (subflow->is_mptfo)
127
-        __mptcp_fastopen_gen_msk_ackseq(msk, subflow, mp_opt);
128
}
129
130
static struct sock *subflow_syn_recv_sock(const struct sock *sk,
131
132
--
28
--
133
2.47.1
29
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
1
If the recvmsg() blocks after receiving some data - i.e. due to
2
SO_RCVLOWAT - the MPTCP code will attempt multiple times to
3
adjust the receive buffer size, wrongly accounting every time the
4
cumulative of received data - instead of accounting only for the
5
delta.
2
6
3
After the RX path refactor the mentioned function is expected to run
7
Address the issue moving mptcp_rcv_space_adjust just after the
4
frequently, let's optimize it a bit.
8
data reception and passing it only the just received bytes.
5
9
6
Scan for ready subflow from the last processed one, and stop after
10
This also remove an unneeded difference between the TCP and MPTCP
7
traversing the list once or reaching the msk memory limit - instead of
11
RX code path implementation.
8
looking for dubious per-subflow conditions.
9
Also re-order the memory limit checks, to avoid duplicate tests.
10
12
13
Fixes: 581302298524 ("mptcp: error out earlier on disconnect")
11
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
14
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12
Reviewed-by: Mat Martineau <martineau@kernel.org>
13
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
14
---
15
---
15
net/mptcp/protocol.c | 111 +++++++++++++++++++++++----------------------------
16
Not strictly related to the refactor, found while investigating
16
net/mptcp/protocol.h | 2 +
17
the CI failure
17
2 files changed, 52 insertions(+), 61 deletions(-)
18
---
19
net/mptcp/protocol.c | 6 +++---
20
1 file changed, 3 insertions(+), 3 deletions(-)
18
21
19
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
22
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
20
index XXXXXXX..XXXXXXX 100644
23
index XXXXXXX..XXXXXXX 100644
21
--- a/net/mptcp/protocol.c
24
--- a/net/mptcp/protocol.c
22
+++ b/net/mptcp/protocol.c
25
+++ b/net/mptcp/protocol.c
23
@@ -XXX,XX +XXX,XX @@ static void mptcp_dss_corruption(struct mptcp_sock *msk, struct sock *ssk)
26
@@ -XXX,XX +XXX,XX @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
27
    goto out;
24
}
28
}
25
29
26
static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
30
+static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied);
27
-                     struct sock *ssk,
28
-                     unsigned int *bytes)
29
+                     struct sock *ssk)
30
{
31
    struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk);
32
    struct sock *sk = (struct sock *)msk;
33
-    unsigned int moved = 0;
34
    bool more_data_avail;
35
    struct tcp_sock *tp;
36
-    bool done = false;
37
+    bool ret = false;
38
39
    pr_debug("msk=%p ssk=%p\n", msk, ssk);
40
    tp = tcp_sk(ssk);
41
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
42
        struct sk_buff *skb;
43
        bool fin;
44
45
+        if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
46
+            break;
47
+
31
+
48
        /* try to move as much data as available */
32
static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
49
        map_remaining = subflow->map_data_len -
33
                struct msghdr *msg,
50
                mptcp_subflow_get_map_offset(subflow);
34
                size_t len, int flags,
51
35
@@ -XXX,XX +XXX,XX @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk,
52
        skb = skb_peek(&ssk->sk_receive_queue);
53
-        if (!skb) {
54
-            /* With racing move_skbs_to_msk() and __mptcp_move_skbs(),
55
-             * a different CPU can have already processed the pending
56
-             * data, stop here or we can enter an infinite loop
57
-             */
58
-            if (!moved)
59
-                done = true;
60
+        if (unlikely(!skb))
61
            break;
36
            break;
62
-        }
37
    }
63
38
64
        if (__mptcp_check_fallback(msk)) {
39
+    mptcp_rcv_space_adjust(msk, copied);
65
            /* Under fallback skbs have no MPTCP extension and TCP could
40
    return copied;
66
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
41
}
67
42
68
        offset = seq - TCP_SKB_CB(skb)->seq;
43
@@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
69
        fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
44
        }
70
-        if (fin) {
45
71
-            done = true;
46
        pr_debug("block timeout %ld\n", timeo);
72
+        if (fin)
47
-        mptcp_rcv_space_adjust(msk, copied);
73
            seq++;
48
        err = sk_wait_data(sk, &timeo, NULL);
74
-        }
49
        if (err < 0) {
75
50
            err = copied ? : err;
76
        if (offset < skb->len) {
51
@@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
77
            size_t len = skb->len - offset;
52
        }
78
53
    }
79
-            if (tp->urg_data)
54
80
-                done = true;
55
-    mptcp_rcv_space_adjust(msk, copied);
81
-
56
-
82
-            if (__mptcp_move_skb(msk, ssk, skb, offset, len))
57
out_err:
83
-                moved += len;
58
    if (cmsg_flags && copied >= 0) {
84
+            ret = __mptcp_move_skb(msk, ssk, skb, offset, len) || ret;
59
        if (cmsg_flags & MPTCP_CMSG_TS)
85
            seq += len;
86
87
            if (unlikely(map_remaining < len)) {
88
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
89
            }
90
91
            sk_eat_skb(ssk, skb);
92
-            done = true;
93
        }
94
95
        WRITE_ONCE(tp->copied_seq, seq);
96
        more_data_avail = mptcp_subflow_data_available(ssk);
97
98
-        if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) {
99
-            done = true;
100
-            break;
101
-        }
102
    } while (more_data_avail);
103
104
-    if (moved > 0)
105
+    if (ret)
106
        msk->last_data_recv = tcp_jiffies32;
107
-    *bytes += moved;
108
-    return done;
109
+    return ret;
110
}
111
112
static bool __mptcp_ofo_queue(struct mptcp_sock *msk)
113
@@ -XXX,XX +XXX,XX @@ void __mptcp_error_report(struct sock *sk)
114
static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
115
{
116
    struct sock *sk = (struct sock *)msk;
117
-    unsigned int moved = 0;
118
+    bool moved;
119
120
-    __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
121
+    moved = __mptcp_move_skbs_from_subflow(msk, ssk);
122
    __mptcp_ofo_queue(msk);
123
    if (unlikely(ssk->sk_err)) {
124
        if (!sock_owned_by_user(sk))
125
@@ -XXX,XX +XXX,XX @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk)
126
     */
127
    if (mptcp_pending_data_fin(sk, NULL))
128
        mptcp_schedule_work(sk);
129
-    return moved > 0;
130
+    return moved;
131
}
132
133
static void __mptcp_rcvbuf_update(struct sock *sk, struct sock *ssk)
134
@@ -XXX,XX +XXX,XX @@ static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
135
136
    __mptcp_rcvbuf_update(sk, ssk);
137
138
-    /* over limit? can't append more skbs to msk, Also, no need to wake-up*/
139
-    if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
140
-        return;
141
-
142
    /* Wake-up the reader only for in-sequence data */
143
    if (move_skbs_to_msk(msk, ssk) && mptcp_epollin_ready(sk))
144
        sk->sk_data_ready(sk);
145
@@ -XXX,XX +XXX,XX @@ bool mptcp_schedule_work(struct sock *sk)
146
    return false;
147
}
148
149
-static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk)
150
-{
151
-    struct mptcp_subflow_context *subflow;
152
-
153
-    msk_owned_by_me(msk);
154
-
155
-    mptcp_for_each_subflow(msk, subflow) {
156
-        if (READ_ONCE(subflow->data_avail))
157
-            return mptcp_subflow_tcp_sock(subflow);
158
-    }
159
-
160
-    return NULL;
161
-}
162
-
163
static bool mptcp_skb_can_collapse_to(u64 write_seq,
164
                 const struct sk_buff *skb,
165
                 const struct mptcp_ext *mpext)
166
@@ -XXX,XX +XXX,XX @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
167
    msk->rcvq_space.time = mstamp;
168
}
169
170
+static struct mptcp_subflow_context *
171
+__mptcp_first_ready_from(struct mptcp_sock *msk,
172
+             struct mptcp_subflow_context *subflow)
173
+{
174
+    struct mptcp_subflow_context *start_subflow = subflow;
175
+
176
+    while (!READ_ONCE(subflow->data_avail)) {
177
+        subflow = mptcp_next_subflow(msk, subflow);
178
+        if (subflow == start_subflow)
179
+            return NULL;
180
+    }
181
+    return subflow;
182
+}
183
+
184
static bool __mptcp_move_skbs(struct sock *sk)
185
{
186
    struct mptcp_subflow_context *subflow;
187
    struct mptcp_sock *msk = mptcp_sk(sk);
188
-    unsigned int moved = 0;
189
-    bool ret, done;
190
+    bool ret = false;
191
+
192
+    if (list_empty(&msk->conn_list))
193
+        return false;
194
195
    /* verify we can move any data from the subflow, eventually updating */
196
    if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK))
197
        mptcp_for_each_subflow(msk, subflow)
198
            __mptcp_rcvbuf_update(sk, subflow->tcp_sock);
199
200
-    if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
201
-        return false;
202
-
203
-    do {
204
-        struct sock *ssk = mptcp_subflow_recv_lookup(msk);
205
+    subflow = list_first_entry(&msk->conn_list,
206
+                 struct mptcp_subflow_context, node);
207
+    for (;;) {
208
+        struct sock *ssk;
209
        bool slowpath;
210
211
-        if (unlikely(!ssk))
212
+        /*
213
+         * As an optimization avoid traversing the subflows list
214
+         * and ev. acquiring the subflow socket lock before baling out
215
+         */
216
+        if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
217
            break;
218
219
-        slowpath = lock_sock_fast(ssk);
220
-        done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved);
221
+        subflow = __mptcp_first_ready_from(msk, subflow);
222
+        if (!subflow)
223
+            break;
224
225
+        ssk = mptcp_subflow_tcp_sock(subflow);
226
+        slowpath = lock_sock_fast(ssk);
227
+        ret = __mptcp_move_skbs_from_subflow(msk, ssk) || ret;
228
        if (unlikely(ssk->sk_err))
229
            __mptcp_error_report(sk);
230
        unlock_sock_fast(ssk, slowpath);
231
-    } while (!done);
232
233
-    ret = moved > 0 || __mptcp_ofo_queue(msk);
234
+        subflow = mptcp_next_subflow(msk, subflow);
235
+    }
236
+
237
+    __mptcp_ofo_queue(msk);
238
    if (ret)
239
        mptcp_check_data_fin((struct sock *)msk);
240
    return ret;
241
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
242
index XXXXXXX..XXXXXXX 100644
243
--- a/net/mptcp/protocol.h
244
+++ b/net/mptcp/protocol.h
245
@@ -XXX,XX +XXX,XX @@ struct mptcp_sock {
246
    list_for_each_entry(__subflow, &((__msk)->conn_list), node)
247
#define mptcp_for_each_subflow_safe(__msk, __subflow, __tmp)            \
248
    list_for_each_entry_safe(__subflow, __tmp, &((__msk)->conn_list), node)
249
+#define mptcp_next_subflow(__msk, __subflow)                \
250
+    list_next_entry_circular(__subflow, &((__msk)->conn_list), node)
251
252
extern struct genl_family mptcp_genl_family;
253
254
255
--
60
--
256
2.47.1
61
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
1
Under some corner cases the MPTCP protocol can end-up invoking
2
mptcp_cleanup_rbuf() when no data has been copied, but such helper
3
assumes the opposite condition.
2
4
3
After the RX path refactor, it become a wrapper for sk_rmem_alloc
5
Explicitly drop such assumption and performs the costly call only
4
access, with a slightly misleading name. Just drop it.
6
when strictly needed - before releasing the msk socket lock.
5
7
8
Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.")
6
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
9
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
7
Reviewed-by: Mat Martineau <martineau@kernel.org>
8
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
9
---
10
---
10
net/mptcp/protocol.c | 8 ++++----
11
net/mptcp/protocol.c | 18 +++++++++---------
11
net/mptcp/protocol.h | 11 ++---------
12
1 file changed, 9 insertions(+), 9 deletions(-)
12
2 files changed, 6 insertions(+), 13 deletions(-)
13
13
14
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
14
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
15
index XXXXXXX..XXXXXXX 100644
15
index XXXXXXX..XXXXXXX 100644
16
--- a/net/mptcp/protocol.c
16
--- a/net/mptcp/protocol.c
17
+++ b/net/mptcp/protocol.c
17
+++ b/net/mptcp/protocol.c
18
@@ -XXX,XX +XXX,XX @@ static void mptcp_cleanup_rbuf(struct mptcp_sock *msk, int copied)
18
@@ -XXX,XX +XXX,XX @@ static void mptcp_send_ack(struct mptcp_sock *msk)
19
        mptcp_subflow_send_ack(mptcp_subflow_tcp_sock(subflow));
20
}
21
22
-static void mptcp_subflow_cleanup_rbuf(struct sock *ssk)
23
+static void mptcp_subflow_cleanup_rbuf(struct sock *ssk, int copied)
24
{
25
    bool slow;
26
27
    slow = lock_sock_fast(ssk);
28
    if (tcp_can_send_ack(ssk))
29
-        tcp_cleanup_rbuf(ssk, 1);
30
+        tcp_cleanup_rbuf(ssk, copied);
31
    unlock_sock_fast(ssk, slow);
32
}
33
34
@@ -XXX,XX +XXX,XX @@ static bool mptcp_subflow_could_cleanup(const struct sock *ssk, bool rx_empty)
35
             (ICSK_ACK_PUSHED2 | ICSK_ACK_PUSHED)));
36
}
37
38
-static void mptcp_cleanup_rbuf(struct mptcp_sock *msk)
39
+static void mptcp_cleanup_rbuf(struct mptcp_sock *msk, int copied)
40
{
41
    int old_space = READ_ONCE(msk->old_wspace);
42
    struct mptcp_subflow_context *subflow;
43
@@ -XXX,XX +XXX,XX @@ static void mptcp_cleanup_rbuf(struct mptcp_sock *msk)
44
    int space = __mptcp_space(sk);
19
    bool cleanup, rx_empty;
45
    bool cleanup, rx_empty;
20
46
21
    cleanup = (space > 0) && (space >= (old_space << 1)) && copied;
47
-    cleanup = (space > 0) && (space >= (old_space << 1));
22
-    rx_empty = !__mptcp_rmem(sk) && copied;
48
-    rx_empty = !__mptcp_rmem(sk);
23
+    rx_empty = !sk_rmem_alloc_get(sk) && copied;
49
+    cleanup = (space > 0) && (space >= (old_space << 1)) && copied;
50
+    rx_empty = !__mptcp_rmem(sk) && copied;
24
51
25
    mptcp_for_each_subflow(msk, subflow) {
52
    mptcp_for_each_subflow(msk, subflow) {
26
        struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
53
        struct sock *ssk = mptcp_subflow_tcp_sock(subflow);
27
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
54
28
        WRITE_ONCE(tp->copied_seq, seq);
55
        if (cleanup || mptcp_subflow_could_cleanup(ssk, rx_empty))
29
        more_data_avail = mptcp_subflow_data_available(ssk);
56
-            mptcp_subflow_cleanup_rbuf(ssk);
30
57
+            mptcp_subflow_cleanup_rbuf(ssk, copied);
31
-        if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
58
    }
32
+        if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf) {
59
}
33
            done = true;
60
34
            break;
61
@@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
62
63
        copied += bytes_read;
64
65
-        /* be sure to advertise window change */
66
-        mptcp_cleanup_rbuf(msk);
67
-
68
        if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
69
            continue;
70
71
@@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
35
        }
72
        }
36
@@ -XXX,XX +XXX,XX @@ static void __mptcp_data_ready(struct sock *sk, struct sock *ssk)
73
37
    __mptcp_rcvbuf_update(sk, ssk);
74
        pr_debug("block timeout %ld\n", timeo);
38
75
+        mptcp_cleanup_rbuf(msk, copied);
39
    /* over limit? can't append more skbs to msk, Also, no need to wake-up*/
76
        err = sk_wait_data(sk, &timeo, NULL);
40
-    if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
77
        if (err < 0) {
41
+    if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
78
            err = copied ? : err;
42
        return;
79
@@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
43
80
        }
44
    /* Wake-up the reader only for in-sequence data */
81
    }
45
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs(struct sock *sk)
82
46
        mptcp_for_each_subflow(msk, subflow)
83
+    mptcp_cleanup_rbuf(msk, copied);
47
            __mptcp_rcvbuf_update(sk, subflow->tcp_sock);
84
+
48
85
out_err:
49
-    if (__mptcp_rmem(sk) > sk->sk_rcvbuf)
86
    if (cmsg_flags && copied >= 0) {
50
+    if (sk_rmem_alloc_get(sk) > sk->sk_rcvbuf)
87
        if (cmsg_flags & MPTCP_CMSG_TS)
51
        return false;
52
53
    do {
54
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
55
index XXXXXXX..XXXXXXX 100644
56
--- a/net/mptcp/protocol.h
57
+++ b/net/mptcp/protocol.h
58
@@ -XXX,XX +XXX,XX @@ static inline void msk_owned_by_me(const struct mptcp_sock *msk)
59
#define mptcp_sk(ptr) container_of_const(ptr, struct mptcp_sock, sk.icsk_inet.sk)
60
#endif
61
62
-/* the msk socket don't use the backlog, also account for the bulk
63
- * free memory
64
- */
65
-static inline int __mptcp_rmem(const struct sock *sk)
66
-{
67
-    return atomic_read(&sk->sk_rmem_alloc);
68
-}
69
-
70
static inline int mptcp_win_from_space(const struct sock *sk, int space)
71
{
72
    return __tcp_win_from_space(mptcp_sk(sk)->scaling_ratio, space);
73
@@ -XXX,XX +XXX,XX @@ static inline int mptcp_space_from_win(const struct sock *sk, int win)
74
75
static inline int __mptcp_space(const struct sock *sk)
76
{
77
-    return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) - __mptcp_rmem(sk));
78
+    return mptcp_win_from_space(sk, READ_ONCE(sk->sk_rcvbuf) -
79
+                 sk_rmem_alloc_get(sk));
80
}
81
82
static inline struct mptcp_data_frag *mptcp_send_head(const struct sock *sk)
83
84
--
88
--
85
2.47.1
89
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
2
3
Consolidate all the cleanup actions requiring the worker in a single
1
Consolidate all the cleanup actions requiring the worker in a single
4
helper and ensure the dummy data fin creation for fallback socket is
2
helper and ensure the dummy data fin creation for fallback socket is
5
performed only when the tcp rx queue is empty.
3
performed only when the tcp rx queue is empty.
6
4
7
There are no functional changes intended, but this will simplify the
5
There are no functional changes intended, but this will simplify the
8
next patch, when the tcp rx queue spooling could be delayed at release_cb
6
next patch, when the tcp rx queue spooling could be delayed at release_cb
9
time.
7
time.
10
8
11
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
9
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
12
Reviewed-by: Mat Martineau <martineau@kernel.org>
13
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
14
---
10
---
15
net/mptcp/subflow.c | 33 ++++++++++++++++++---------------
11
net/mptcp/subflow.c | 33 ++++++++++++++++++---------------
16
1 file changed, 18 insertions(+), 15 deletions(-)
12
1 file changed, 18 insertions(+), 15 deletions(-)
17
13
18
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
14
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
...
...
78
-     mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
74
-     mptcp_update_rcv_data_fin(msk, READ_ONCE(msk->ack_seq), true))
79
-        mptcp_schedule_work(parent);
75
-        mptcp_schedule_work(parent);
80
}
76
}
81
77
82
void mptcp_subflow_queue_clean(struct sock *listener_sk, struct sock *listener_ssk)
78
void mptcp_subflow_queue_clean(struct sock *listener_sk, struct sock *listener_ssk)
83
84
--
79
--
85
2.47.1
80
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
2
3
After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
1
After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
4
move the whole MPTCP rx path under the socket lock leveraging the
2
move the whole MPTCP rx path under the socket lock leveraging the
5
release_cb.
3
release_cb.
6
4
7
We can drop a bunch of spin_lock pairs in the receive functions, use
5
We can drop a bunch of spin_lock pairs in the receive functions, use
...
...
19
When the skbs move is delayed at msk release callback time, even the
17
When the skbs move is delayed at msk release callback time, even the
20
msk rcvbuf update is delayed; additionally take care of such action in
18
msk rcvbuf update is delayed; additionally take care of such action in
21
__mptcp_move_skbs().
19
__mptcp_move_skbs().
22
20
23
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
21
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
24
Reviewed-by: Mat Martineau <martineau@kernel.org>
25
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
26
---
22
---
27
net/mptcp/fastopen.c | 1 +
23
v1 -> v2:
28
net/mptcp/protocol.c | 123 ++++++++++++++++++++++++---------------------------
24
- cleanup && fixup msk sk_rcvbuf update
25
- fix missed wakeup due to bad placed __mptcp_move_skbs(); move it
26
form __mptcp_recvmsg_mskq into mptcp_recvmsg()
27
- added missing '\n' in debug message format string
28
- keep 'snd_una' && friends update under the mptcp data lock
29
---
30
net/mptcp/fastopen.c | 2 +
31
net/mptcp/protocol.c | 123 ++++++++++++++++++++-----------------------
29
net/mptcp/protocol.h | 2 +-
32
net/mptcp/protocol.h | 2 +-
30
3 files changed, 60 insertions(+), 66 deletions(-)
33
3 files changed, 61 insertions(+), 66 deletions(-)
31
34
32
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
35
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
33
index XXXXXXX..XXXXXXX 100644
36
index XXXXXXX..XXXXXXX 100644
34
--- a/net/mptcp/fastopen.c
37
--- a/net/mptcp/fastopen.c
35
+++ b/net/mptcp/fastopen.c
38
+++ b/net/mptcp/fastopen.c
36
@@ -XXX,XX +XXX,XX @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
39
@@ -XXX,XX +XXX,XX @@ void mptcp_fastopen_subflow_synack_set_params(struct mptcp_subflow_context *subf
37
    MPTCP_SKB_CB(skb)->cant_coalesce = 1;
40
    MPTCP_SKB_CB(skb)->has_rxtstamp = TCP_SKB_CB(skb)->has_rxtstamp;
38
41
39
    mptcp_data_lock(sk);
42
    mptcp_data_lock(sk);
40
+    DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
43
+    DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
41
44
42
    mptcp_set_owner_r(skb, sk);
45
    mptcp_set_owner_r(skb, sk);
43
    __skb_queue_tail(&sk->sk_receive_queue, skb);
46
    __skb_queue_tail(&sk->sk_receive_queue, skb);
47
@@ -XXX,XX +XXX,XX @@ void __mptcp_fastopen_gen_msk_ackseq(struct mptcp_sock *msk, struct mptcp_subflo
48
    struct sock *sk = (struct sock *)msk;
49
    struct sk_buff *skb;
50
51
+    DEBUG_NET_WARN_ON_ONCE(sock_owned_by_user_nocheck(sk));
52
    skb = skb_peek_tail(&sk->sk_receive_queue);
53
    if (skb) {
54
        WARN_ON_ONCE(MPTCP_SKB_CB(skb)->end_seq);
44
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
55
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
45
index XXXXXXX..XXXXXXX 100644
56
index XXXXXXX..XXXXXXX 100644
46
--- a/net/mptcp/protocol.c
57
--- a/net/mptcp/protocol.c
47
+++ b/net/mptcp/protocol.c
58
+++ b/net/mptcp/protocol.c
48
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
59
@@ -XXX,XX +XXX,XX @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk,
...
...
346
    struct rb_root out_of_order_queue;
357
    struct rb_root out_of_order_queue;
347
-    struct sk_buff_head receive_queue;
358
-    struct sk_buff_head receive_queue;
348
    struct list_head conn_list;
359
    struct list_head conn_list;
349
    struct list_head rtx_queue;
360
    struct list_head rtx_queue;
350
    struct mptcp_data_frag *first_pending;
361
    struct mptcp_data_frag *first_pending;
351
352
--
362
--
353
2.47.1
363
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
2
3
After the previous patch, updating sk_forward_memory is cheap and
1
After the previous patch, updating sk_forward_memory is cheap and
4
we can drop a lot of complexity from the MPTCP memory accounting,
2
we can drop a lot of complexity from the MPTCP memory acconting,
5
removing the custom fwd mem allocations for rmem.
3
removing the custom fwd mem allocations for rmem.
6
4
7
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
5
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8
Reviewed-by: Mat Martineau <martineau@kernel.org>
6
---
9
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
7
v1 -> v2:
8
- keep 'snd_una' and recovery-related fields under the msk
9
data lock
10
- dropped unneeded code in __mptcp_move_skbs()
10
---
11
---
11
net/mptcp/fastopen.c | 2 +-
12
net/mptcp/fastopen.c | 2 +-
12
net/mptcp/protocol.c | 115 ++++-----------------------------------------------
13
net/mptcp/protocol.c | 115 +++----------------------------------------
13
net/mptcp/protocol.h | 4 +-
14
net/mptcp/protocol.h | 4 +-
14
3 files changed, 10 insertions(+), 111 deletions(-)
15
3 files changed, 10 insertions(+), 111 deletions(-)
15
16
16
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
17
diff --git a/net/mptcp/fastopen.c b/net/mptcp/fastopen.c
17
index XXXXXXX..XXXXXXX 100644
18
index XXXXXXX..XXXXXXX 100644
...
...
285
-    return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
286
-    return atomic_read(&sk->sk_rmem_alloc) - READ_ONCE(mptcp_sk(sk)->rmem_released);
286
+    return atomic_read(&sk->sk_rmem_alloc);
287
+    return atomic_read(&sk->sk_rmem_alloc);
287
}
288
}
288
289
289
static inline int mptcp_win_from_space(const struct sock *sk, int space)
290
static inline int mptcp_win_from_space(const struct sock *sk, int space)
290
291
--
291
--
292
2.47.1
292
2.45.2
diff view generated by jsdifflib
1
From: Paolo Abeni <pabeni@redhat.com>
2
3
After the previous patch we can remove the forward_alloc_get
1
After the previous patch we can remove the forward_alloc_get
4
proto callback, basically reverting commit 292e6077b040 ("net: introduce
2
proto callback, basically reverting commit 292e6077b040 ("net: introduce
5
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
3
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
6
sk_forward_alloc_get() in sk_get_meminfo()").
4
sk_forward_alloc_get() in sk_get_meminfo()").
7
5
8
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
6
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
9
Acked-by: Mat Martineau <martineau@kernel.org>
10
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
11
---
7
---
12
include/net/sock.h | 13 -------------
8
include/net/sock.h | 13 -------------
13
net/core/sock.c | 2 +-
9
net/core/sock.c | 2 +-
14
net/ipv4/af_inet.c | 2 +-
10
net/ipv4/af_inet.c | 2 +-
15
net/ipv4/inet_diag.c | 2 +-
11
net/ipv4/inet_diag.c | 2 +-
...
...
97
-    dst->value = sk_forward_alloc_get(sk);
93
-    dst->value = sk_forward_alloc_get(sk);
98
+    dst->value = READ_ONCE(sk->sk_forward_alloc);
94
+    dst->value = READ_ONCE(sk->sk_forward_alloc);
99
}
95
}
100
96
101
META_COLLECTOR(int_sk_sndbuf)
97
META_COLLECTOR(int_sk_sndbuf)
102
103
--
98
--
104
2.47.1
99
2.45.2
diff view generated by jsdifflib