[PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Paolo Abeni posted 1 patch 1 month, 2 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/22cda018a37459d99683e572e35ac61bbc43fcae.1634900440.git.pabeni@redhat.com
Maintainers: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Mat Martineau <mathew.j.martineau@linux.intel.com>, Matthieu Baerts <matthieu.baerts@tessares.net>
net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
net/mptcp/protocol.h |  1 +
2 files changed, 41 insertions(+), 18 deletions(-)

[PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Paolo Abeni 1 month, 2 weeks ago
The MPTCP packet scheduler has sub-optimal behavior with asymmetric
subflows: if the faster subflow-level cwin is closed, the packet
scheduler can enqueue "too much" data on a slower subflow.

When all the data on the faster subflow is acked, if the mptcp-level
cwin is closed, and link utilization becomes suboptimal.

The solution is implementing blest-like[1] HoL-blocking estimation,
transmitting only on the subflow with the shorter estimated time to
flush the queued memory. If such subflows cwin is closed, we wait
even if other subflows are available.

This is quite simpler than the original blest implementation, as we
leverage the pacing rate provided by the TCP socket. To get a more
accurate estimation for the subflow linger-time, we maintain a
per-subflow weighted average of such info.

Additionally drop magic numbers usage in favor of newly defined
macros.

[1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
notes:
- this apparently solves for good issue/137, with > 200 iterations with
no failures
- still to be investigated the impact on high-speed links, if any
---
 net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
 net/mptcp/protocol.h |  1 +
 2 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 7803b0dbb1be..cc9d32cb7bc7 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
 	return __mptcp_subflow_active(subflow);
 }
 
+#define SSK_MODE_ACTIVE	0
+#define SSK_MODE_BACKUP	1
+#define SSK_MODE_MAX	2
+
 /* implement the mptcp packet scheduler;
  * returns the subflow that will transmit the next DSS
  * additionally updates the rtx timeout
  */
 static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 {
-	struct subflow_send_info send_info[2];
+	struct subflow_send_info send_info[SSK_MODE_MAX];
 	struct mptcp_subflow_context *subflow;
 	struct sock *sk = (struct sock *)msk;
+	u32 pace, burst, wmem;
 	int i, nr_active = 0;
 	struct sock *ssk;
 	long tout = 0;
 	u64 ratio;
-	u32 pace;
 
 	sock_owned_by_me(sk);
 
@@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 	}
 
 	/* pick the subflow with the lower wmem/wspace ratio */
-	for (i = 0; i < 2; ++i) {
+	for (i = 0; i < SSK_MODE_MAX; ++i) {
 		send_info[i].ssk = NULL;
 		send_info[i].ratio = -1;
 	}
+
 	mptcp_for_each_subflow(msk, subflow) {
 		trace_mptcp_subflow_get_send(subflow);
 		ssk =  mptcp_subflow_tcp_sock(subflow);
@@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 
 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
 		nr_active += !subflow->backup;
-		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
-			continue;
-
-		pace = READ_ONCE(ssk->sk_pacing_rate);
-		if (!pace)
-			continue;
+		pace = subflow->avg_pacing_rate;
+		if (unlikely(!pace)) {
+			/* init pacing rate from socket */
+			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);
+			if (!pace)
+				continue;
+		}
 
 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
 				pace);
@@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
 
 	/* pick the best backup if no other subflow is active */
 	if (!nr_active)
-		send_info[0].ssk = send_info[1].ssk;
-
-	if (send_info[0].ssk) {
-		msk->last_snd = send_info[0].ssk;
-		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
-				       tcp_sk(msk->last_snd)->snd_wnd);
-		return msk->last_snd;
-	}
+		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
+
+	/* According to the blest algorithm, to avoid HoL blocking for the
+	 * faster flow, we need to:
+	 * - estimate the faster flow linger time
+	 * - use the above to estimate the amount of byte transferred
+	 *   by the faster flow
+	 * - check that the amount of queued data is greter than the above,
+	 *   otherwise do not use the picked, slower, subflow
+	 * We select the subflow with the shorter estimated time to flush
+	 * the queued mem, which basically ensure the above. We just need
+	 * to check that subflow has a non empty cwin.
+	 */
+	ssk = send_info[SSK_MODE_ACTIVE].ssk;
+	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
+		return NULL;
 
-	return NULL;
+	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
+	wmem = READ_ONCE(ssk->sk_wmem_queued);
+	subflow = mptcp_subflow_ctx(ssk);
+	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
+					   READ_ONCE(ssk->sk_pacing_rate) * burst,
+					   burst + wmem);
+	msk->last_snd = ssk;
+	msk->snd_burst = burst;
+	return ssk;
 }
 
 static void mptcp_push_release(struct sock *ssk, struct mptcp_sendmsg_info *info)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 67a61ac48b20..46691acdea24 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -391,6 +391,7 @@ DECLARE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions);
 /* MPTCP subflow context */
 struct mptcp_subflow_context {
 	struct	list_head node;/* conn_list of subflows */
+	unsigned long avg_pacing_rate; /* protected by msk socket lock */
 	u64	local_key;
 	u64	remote_key;
 	u64	idsn;
-- 
2.26.3


Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Matthieu Baerts 1 month, 1 week ago
Hi Paolo,

On 22/10/2021 13:38, Paolo Abeni wrote:
> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
> subflows: if the faster subflow-level cwin is closed, the packet
> scheduler can enqueue "too much" data on a slower subflow.
> 
> When all the data on the faster subflow is acked, if the mptcp-level
> cwin is closed, and link utilization becomes suboptimal.
> 
> The solution is implementing blest-like[1] HoL-blocking estimation,
> transmitting only on the subflow with the shorter estimated time to
> flush the queued memory. If such subflows cwin is closed, we wait
> even if other subflows are available.
> 
> This is quite simpler than the original blest implementation, as we
> leverage the pacing rate provided by the TCP socket. To get a more
> accurate estimation for the subflow linger-time, we maintain a
> per-subflow weighted average of such info.
> 
> Additionally drop magic numbers usage in favor of newly defined
> macros.
> 
> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf

Thank you for looking at that!

It improves a lot the situation! I ran the new version in a slow
environment with a "debug" kernel and it fails after 806 attempts! (more
than 16 hours ;-) )

# unbalanced bwidth with unbalanced delay: transfer slower than
expected! runtime 4011 ms, expected 4005 ms max 4005        [ fail ]

I usually have:

3923 max 4005
3922 max 4005
3872 max 4005
3880 max 4005
3801 max 4005

So all good I guess. We can then close 137, no?

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/137
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Mat Martineau 1 month, 1 week ago
On Fri, 22 Oct 2021, Paolo Abeni wrote:

> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
> subflows: if the faster subflow-level cwin is closed, the packet
> scheduler can enqueue "too much" data on a slower subflow.
>
> When all the data on the faster subflow is acked, if the mptcp-level
> cwin is closed, and link utilization becomes suboptimal.
>
> The solution is implementing blest-like[1] HoL-blocking estimation,
> transmitting only on the subflow with the shorter estimated time to
> flush the queued memory. If such subflows cwin is closed, we wait
> even if other subflows are available.
>
> This is quite simpler than the original blest implementation, as we
> leverage the pacing rate provided by the TCP socket. To get a more
> accurate estimation for the subflow linger-time, we maintain a
> per-subflow weighted average of such info.
>
> Additionally drop magic numbers usage in favor of newly defined
> macros.
>
> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> notes:
> - this apparently solves for good issue/137, with > 200 iterations with
> no failures
> - still to be investigated the impact on high-speed links, if any
> ---
> net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
> net/mptcp/protocol.h |  1 +
> 2 files changed, 41 insertions(+), 18 deletions(-)
>
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 7803b0dbb1be..cc9d32cb7bc7 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
> 	return __mptcp_subflow_active(subflow);
> }
>
> +#define SSK_MODE_ACTIVE	0
> +#define SSK_MODE_BACKUP	1
> +#define SSK_MODE_MAX	2
> +
> /* implement the mptcp packet scheduler;
>  * returns the subflow that will transmit the next DSS
>  * additionally updates the rtx timeout
>  */
> static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> {
> -	struct subflow_send_info send_info[2];
> +	struct subflow_send_info send_info[SSK_MODE_MAX];
> 	struct mptcp_subflow_context *subflow;
> 	struct sock *sk = (struct sock *)msk;
> +	u32 pace, burst, wmem;
> 	int i, nr_active = 0;
> 	struct sock *ssk;
> 	long tout = 0;
> 	u64 ratio;
> -	u32 pace;
>
> 	sock_owned_by_me(sk);
>
> @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> 	}
>
> 	/* pick the subflow with the lower wmem/wspace ratio */
> -	for (i = 0; i < 2; ++i) {
> +	for (i = 0; i < SSK_MODE_MAX; ++i) {
> 		send_info[i].ssk = NULL;
> 		send_info[i].ratio = -1;
> 	}
> +
> 	mptcp_for_each_subflow(msk, subflow) {
> 		trace_mptcp_subflow_get_send(subflow);
> 		ssk =  mptcp_subflow_tcp_sock(subflow);
> @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>
> 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
> 		nr_active += !subflow->backup;
> -		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
> -			continue;
> -
> -		pace = READ_ONCE(ssk->sk_pacing_rate);
> -		if (!pace)
> -			continue;
> +		pace = subflow->avg_pacing_rate;
> +		if (unlikely(!pace)) {
> +			/* init pacing rate from socket */
> +			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);

Checkpatch complains about the double assignment. I did do a double-take 
to confirm that it didn't look like a '==' typo, and slightly lean toward 
keeping it checkpatch-clean.

> +			if (!pace)
> +				continue;
> +		}
>
> 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
> 				pace);
> @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>
> 	/* pick the best backup if no other subflow is active */
> 	if (!nr_active)
> -		send_info[0].ssk = send_info[1].ssk;
> -
> -	if (send_info[0].ssk) {
> -		msk->last_snd = send_info[0].ssk;
> -		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
> -				       tcp_sk(msk->last_snd)->snd_wnd);
> -		return msk->last_snd;
> -	}
> +		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
> +
> +	/* According to the blest algorithm, to avoid HoL blocking for the
> +	 * faster flow, we need to:
> +	 * - estimate the faster flow linger time
> +	 * - use the above to estimate the amount of byte transferred
> +	 *   by the faster flow
> +	 * - check that the amount of queued data is greter than the above,
> +	 *   otherwise do not use the picked, slower, subflow
> +	 * We select the subflow with the shorter estimated time to flush
> +	 * the queued mem, which basically ensure the above. We just need
> +	 * to check that subflow has a non empty cwin.
> +	 */
> +	ssk = send_info[SSK_MODE_ACTIVE].ssk;
> +	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
> +		return NULL;
>
> -	return NULL;
> +	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
> +	wmem = READ_ONCE(ssk->sk_wmem_queued);
> +	subflow = mptcp_subflow_ctx(ssk);
> +	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
> +					   READ_ONCE(ssk->sk_pacing_rate) * burst,
> +					   burst + wmem);

I see how this gives a smoothed-out version of the pacing rate, using 
avg_pacing_rate for the bytes already sent on this subflow (wmem) and 
ssk->sk_pacing_rate for the bytes that will probably be sent now.

Is this a *better* estimation of the flow speed than the plain 
ssk->sk_pacing_rate? If there is a change in speed, it will be slow to 
respond to throughput changes. I think you could remove avg_pacing_rate 
modifications and include the expected burst length in the calculation 
earlier (in the mptcp_for_each_subflow() loop) to implement the blest-like 
behavior. Maybe rename subflow_send_info.ratio to 
subflow_send_info.time_to_flush.

What do you think?

> + msk->last_snd = ssk; + msk->snd_burst = burst; + return ssk; }
>
> static void mptcp_push_release(struct sock *ssk, struct mptcp_sendmsg_info *info)
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 67a61ac48b20..46691acdea24 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -391,6 +391,7 @@ DECLARE_PER_CPU(struct mptcp_delegated_action, mptcp_delegated_actions);
> /* MPTCP subflow context */
> struct mptcp_subflow_context {
> 	struct	list_head node;/* conn_list of subflows */
> +	unsigned long avg_pacing_rate; /* protected by msk socket lock */
> 	u64	local_key;
> 	u64	remote_key;
> 	u64	idsn;
> -- 
> 2.26.3
>
>
>

--
Mat Martineau
Intel

Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Paolo Abeni 1 month, 1 week ago
On Fri, 2021-10-22 at 18:19 -0700, Mat Martineau wrote:
> On Fri, 22 Oct 2021, Paolo Abeni wrote:
> 
> > The MPTCP packet scheduler has sub-optimal behavior with asymmetric
> > subflows: if the faster subflow-level cwin is closed, the packet
> > scheduler can enqueue "too much" data on a slower subflow.
> > 
> > When all the data on the faster subflow is acked, if the mptcp-level
> > cwin is closed, and link utilization becomes suboptimal.
> > 
> > The solution is implementing blest-like[1] HoL-blocking estimation,
> > transmitting only on the subflow with the shorter estimated time to
> > flush the queued memory. If such subflows cwin is closed, we wait
> > even if other subflows are available.
> > 
> > This is quite simpler than the original blest implementation, as we
> > leverage the pacing rate provided by the TCP socket. To get a more
> > accurate estimation for the subflow linger-time, we maintain a
> > per-subflow weighted average of such info.
> > 
> > Additionally drop magic numbers usage in favor of newly defined
> > macros.
> > 
> > [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > notes:
> > - this apparently solves for good issue/137, with > 200 iterations with
> > no failures
> > - still to be investigated the impact on high-speed links, if any
> > ---
> > net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
> > net/mptcp/protocol.h |  1 +
> > 2 files changed, 41 insertions(+), 18 deletions(-)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 7803b0dbb1be..cc9d32cb7bc7 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
> > 	return __mptcp_subflow_active(subflow);
> > }
> > 
> > +#define SSK_MODE_ACTIVE	0
> > +#define SSK_MODE_BACKUP	1
> > +#define SSK_MODE_MAX	2
> > +
> > /* implement the mptcp packet scheduler;
> >  * returns the subflow that will transmit the next DSS
> >  * additionally updates the rtx timeout
> >  */
> > static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > {
> > -	struct subflow_send_info send_info[2];
> > +	struct subflow_send_info send_info[SSK_MODE_MAX];
> > 	struct mptcp_subflow_context *subflow;
> > 	struct sock *sk = (struct sock *)msk;
> > +	u32 pace, burst, wmem;
> > 	int i, nr_active = 0;
> > 	struct sock *ssk;
> > 	long tout = 0;
> > 	u64 ratio;
> > -	u32 pace;
> > 
> > 	sock_owned_by_me(sk);
> > 
> > @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > 	}
> > 
> > 	/* pick the subflow with the lower wmem/wspace ratio */
> > -	for (i = 0; i < 2; ++i) {
> > +	for (i = 0; i < SSK_MODE_MAX; ++i) {
> > 		send_info[i].ssk = NULL;
> > 		send_info[i].ratio = -1;
> > 	}
> > +
> > 	mptcp_for_each_subflow(msk, subflow) {
> > 		trace_mptcp_subflow_get_send(subflow);
> > 		ssk =  mptcp_subflow_tcp_sock(subflow);
> > @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > 
> > 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
> > 		nr_active += !subflow->backup;
> > -		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
> > -			continue;
> > -
> > -		pace = READ_ONCE(ssk->sk_pacing_rate);
> > -		if (!pace)
> > -			continue;
> > +		pace = subflow->avg_pacing_rate;
> > +		if (unlikely(!pace)) {
> > +			/* init pacing rate from socket */
> > +			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);
> 
> Checkpatch complains about the double assignment. I did do a double-take 
> to confirm that it didn't look like a '==' typo, and slightly lean toward 
> keeping it checkpatch-clean.

Yep, I'll fix the above in v2.

> > +			if (!pace)
> > +				continue;
> > +		}
> > 
> > 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
> > 				pace);
> > @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > 
> > 	/* pick the best backup if no other subflow is active */
> > 	if (!nr_active)
> > -		send_info[0].ssk = send_info[1].ssk;
> > -
> > -	if (send_info[0].ssk) {
> > -		msk->last_snd = send_info[0].ssk;
> > -		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
> > -				       tcp_sk(msk->last_snd)->snd_wnd);
> > -		return msk->last_snd;
> > -	}
> > +		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
> > +
> > +	/* According to the blest algorithm, to avoid HoL blocking for the
> > +	 * faster flow, we need to:
> > +	 * - estimate the faster flow linger time
> > +	 * - use the above to estimate the amount of byte transferred
> > +	 *   by the faster flow
> > +	 * - check that the amount of queued data is greter than the above,
> > +	 *   otherwise do not use the picked, slower, subflow
> > +	 * We select the subflow with the shorter estimated time to flush
> > +	 * the queued mem, which basically ensure the above. We just need
> > +	 * to check that subflow has a non empty cwin.
> > +	 */
> > +	ssk = send_info[SSK_MODE_ACTIVE].ssk;
> > +	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
> > +		return NULL;
> > 
> > -	return NULL;
> > +	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
> > +	wmem = READ_ONCE(ssk->sk_wmem_queued);
> > +	subflow = mptcp_subflow_ctx(ssk);
> > +	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
> > +					   READ_ONCE(ssk->sk_pacing_rate) * burst,
> > +					   burst + wmem);
> 
> I see how this gives a smoothed-out version of the pacing rate, using 
> avg_pacing_rate for the bytes already sent on this subflow (wmem) and 
> ssk->sk_pacing_rate for the bytes that will probably be sent now.
> 
> Is this a *better* estimation of the flow speed than the plain 
> ssk->sk_pacing_rate?

It looks like it is. I get a consistent (higher) failure rate using
plain sk_pacing_rate instead. 'sk_pacing_rate' is an 'instant' value
applied to the current pkt, can change in a relevant way in a little
amount of time.

AFAIK, the weighted average above is a good approximation of what
happens: TCP assigns to each packet a minimum departure time dependant
on the current rate. That departure time is then enforced by the tc
scheduler. Even if the available b/w changes (increases) in-between,
the packet will still respect the departure (rate) constraint.

> If there is a change in speed, it will be slow to 
> respond to throughput changes. I think you could remove avg_pacing_rate 
> modifications and include the expected burst length in the calculation 
> earlier (in the mptcp_for_each_subflow() loop) to implement the blest-like 
> behavior. Maybe rename subflow_send_info.ratio to 
> subflow_send_info.time_to_flush.
> 
> What do you think?

Field rename will make the code more readable, agreed!

I'm unsure I follow the other part correctly ?!? Do you mean replacing:

	ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32, subflow->avg_pacing_rate);

with:

	ratio = div_u64((u64)(READ_ONCE(ssk->sk_wmem_queued) + burst) << 32, ssk->sk_pacing_rate);

?

In my experiments, adding 'burst' to the first expression, does not
change the results at all, while replacing the average pacing with
sk_pacing_rate causes a relevant failure rate increases.

I'm reasonably sure the above change will not be for the better.

Cheers,

Paolo


Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Mat Martineau 1 month, 1 week ago
On Mon, 25 Oct 2021, Paolo Abeni wrote:

> On Fri, 2021-10-22 at 18:19 -0700, Mat Martineau wrote:
>> On Fri, 22 Oct 2021, Paolo Abeni wrote:
>>
>>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
>>> subflows: if the faster subflow-level cwin is closed, the packet
>>> scheduler can enqueue "too much" data on a slower subflow.
>>>
>>> When all the data on the faster subflow is acked, if the mptcp-level
>>> cwin is closed, and link utilization becomes suboptimal.
>>>
>>> The solution is implementing blest-like[1] HoL-blocking estimation,
>>> transmitting only on the subflow with the shorter estimated time to
>>> flush the queued memory. If such subflows cwin is closed, we wait
>>> even if other subflows are available.
>>>
>>> This is quite simpler than the original blest implementation, as we
>>> leverage the pacing rate provided by the TCP socket. To get a more
>>> accurate estimation for the subflow linger-time, we maintain a
>>> per-subflow weighted average of such info.
>>>
>>> Additionally drop magic numbers usage in favor of newly defined
>>> macros.
>>>
>>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> notes:
>>> - this apparently solves for good issue/137, with > 200 iterations with
>>> no failures
>>> - still to be investigated the impact on high-speed links, if any
>>> ---
>>> net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
>>> net/mptcp/protocol.h |  1 +
>>> 2 files changed, 41 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>> index 7803b0dbb1be..cc9d32cb7bc7 100644
>>> --- a/net/mptcp/protocol.c
>>> +++ b/net/mptcp/protocol.c
>>> @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
>>> 	return __mptcp_subflow_active(subflow);
>>> }
>>>
>>> +#define SSK_MODE_ACTIVE	0
>>> +#define SSK_MODE_BACKUP	1
>>> +#define SSK_MODE_MAX	2
>>> +
>>> /* implement the mptcp packet scheduler;
>>>  * returns the subflow that will transmit the next DSS
>>>  * additionally updates the rtx timeout
>>>  */
>>> static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>> {
>>> -	struct subflow_send_info send_info[2];
>>> +	struct subflow_send_info send_info[SSK_MODE_MAX];
>>> 	struct mptcp_subflow_context *subflow;
>>> 	struct sock *sk = (struct sock *)msk;
>>> +	u32 pace, burst, wmem;
>>> 	int i, nr_active = 0;
>>> 	struct sock *ssk;
>>> 	long tout = 0;
>>> 	u64 ratio;
>>> -	u32 pace;
>>>
>>> 	sock_owned_by_me(sk);
>>>
>>> @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>> 	}
>>>
>>> 	/* pick the subflow with the lower wmem/wspace ratio */
>>> -	for (i = 0; i < 2; ++i) {
>>> +	for (i = 0; i < SSK_MODE_MAX; ++i) {
>>> 		send_info[i].ssk = NULL;
>>> 		send_info[i].ratio = -1;
>>> 	}
>>> +
>>> 	mptcp_for_each_subflow(msk, subflow) {
>>> 		trace_mptcp_subflow_get_send(subflow);
>>> 		ssk =  mptcp_subflow_tcp_sock(subflow);
>>> @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>
>>> 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
>>> 		nr_active += !subflow->backup;
>>> -		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
>>> -			continue;
>>> -
>>> -		pace = READ_ONCE(ssk->sk_pacing_rate);
>>> -		if (!pace)
>>> -			continue;
>>> +		pace = subflow->avg_pacing_rate;
>>> +		if (unlikely(!pace)) {
>>> +			/* init pacing rate from socket */
>>> +			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);
>>
>> Checkpatch complains about the double assignment. I did do a double-take
>> to confirm that it didn't look like a '==' typo, and slightly lean toward
>> keeping it checkpatch-clean.
>
> Yep, I'll fix the above in v2.
>
>>> +			if (!pace)
>>> +				continue;
>>> +		}
>>>
>>> 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
>>> 				pace);
>>> @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>
>>> 	/* pick the best backup if no other subflow is active */
>>> 	if (!nr_active)
>>> -		send_info[0].ssk = send_info[1].ssk;
>>> -
>>> -	if (send_info[0].ssk) {
>>> -		msk->last_snd = send_info[0].ssk;
>>> -		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
>>> -				       tcp_sk(msk->last_snd)->snd_wnd);
>>> -		return msk->last_snd;
>>> -	}
>>> +		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
>>> +
>>> +	/* According to the blest algorithm, to avoid HoL blocking for the
>>> +	 * faster flow, we need to:
>>> +	 * - estimate the faster flow linger time
>>> +	 * - use the above to estimate the amount of byte transferred
>>> +	 *   by the faster flow
>>> +	 * - check that the amount of queued data is greter than the above,
>>> +	 *   otherwise do not use the picked, slower, subflow
>>> +	 * We select the subflow with the shorter estimated time to flush
>>> +	 * the queued mem, which basically ensure the above. We just need
>>> +	 * to check that subflow has a non empty cwin.
>>> +	 */
>>> +	ssk = send_info[SSK_MODE_ACTIVE].ssk;
>>> +	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
>>> +		return NULL;
>>>
>>> -	return NULL;
>>> +	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
>>> +	wmem = READ_ONCE(ssk->sk_wmem_queued);
>>> +	subflow = mptcp_subflow_ctx(ssk);
>>> +	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
>>> +					   READ_ONCE(ssk->sk_pacing_rate) * burst,
>>> +					   burst + wmem);
>>
>> I see how this gives a smoothed-out version of the pacing rate, using
>> avg_pacing_rate for the bytes already sent on this subflow (wmem) and
>> ssk->sk_pacing_rate for the bytes that will probably be sent now.
>>
>> Is this a *better* estimation of the flow speed than the plain
>> ssk->sk_pacing_rate?
>
> It looks like it is. I get a consistent (higher) failure rate using
> plain sk_pacing_rate instead. 'sk_pacing_rate' is an 'instant' value
> applied to the current pkt, can change in a relevant way in a little
> amount of time.
>

Ok, sk_pacing_rate is even noisier than I realized. My main concern here 
is that whatever filtering we do on that noisy signal handles a variety of 
conditions well. This includes cases like a failing wireless signal, where 
the current and future throughput might be far lower than the average 
pacing rate would indicate - and in that case the filtering algorithm 
needs to respond appropriately.

My first job was in test & measurement, where a lot of thought goes in to 
taking noisy A/D converter data and turning it in to a "good" measurement 
value. When using a weighted average like this, it's common to throw away 
the average if the new value is a significant change (like "X%" higher or 
lower). The value of X depends on the characteristics of expected noise - 
if X is too big, the measurement is slow to respond. If X is too small, 
the measurement can be inaccurate. I think there's more work to do to tune 
this well for MPTCP scheduling, the existing upstream code uses the raw 
sk_pacing_rate which is too noisy, and the weighted average with no 
handling of significant changes to network conditions could be too slow.

Maybe the cwin check helps with the slow response, but it also skips any 
update to the average pacing rate.

> AFAIK, the weighted average above is a good approximation of what
> happens: TCP assigns to each packet a minimum departure time dependant
> on the current rate. That departure time is then enforced by the tc
> scheduler. Even if the available b/w changes (increases) in-between,
> the packet will still respect the departure (rate) constraint.
>
>> If there is a change in speed, it will be slow to
>> respond to throughput changes. I think you could remove avg_pacing_rate
>> modifications and include the expected burst length in the calculation
>> earlier (in the mptcp_for_each_subflow() loop) to implement the blest-like
>> behavior. Maybe rename subflow_send_info.ratio to
>> subflow_send_info.time_to_flush.
>>
>> What do you think?
>
> Field rename will make the code more readable, agreed!
>
> I'm unsure I follow the other part correctly ?!? Do you mean replacing:
>
> 	ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32, subflow->avg_pacing_rate);
>
> with:
>
> 	ratio = div_u64((u64)(READ_ONCE(ssk->sk_wmem_queued) + burst) << 32, ssk->sk_pacing_rate);
>
> ?
>

That's what I meant, but it looks like it doesn't improve things.

> In my experiments, adding 'burst' to the first expression, does not
> change the results at all, while replacing the average pacing with
> sk_pacing_rate causes a relevant failure rate increases.
>
> I'm reasonably sure the above change will not be for the better.
>

Yeah, I think you're correct for the common cases covered by the tests.


I'll take a look at a sampling of sk_pacing_rate values and see if I have 
any filter tuning ideas.


--
Mat Martineau
Intel

Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Paolo Abeni 1 month, 1 week ago
On Mon, 2021-10-25 at 10:17 -0700, Mat Martineau wrote:
> On Mon, 25 Oct 2021, Paolo Abeni wrote:
> 
> > On Fri, 2021-10-22 at 18:19 -0700, Mat Martineau wrote:
> > > On Fri, 22 Oct 2021, Paolo Abeni wrote:
> > > 
> > > > The MPTCP packet scheduler has sub-optimal behavior with asymmetric
> > > > subflows: if the faster subflow-level cwin is closed, the packet
> > > > scheduler can enqueue "too much" data on a slower subflow.
> > > > 
> > > > When all the data on the faster subflow is acked, if the mptcp-level
> > > > cwin is closed, and link utilization becomes suboptimal.
> > > > 
> > > > The solution is implementing blest-like[1] HoL-blocking estimation,
> > > > transmitting only on the subflow with the shorter estimated time to
> > > > flush the queued memory. If such subflows cwin is closed, we wait
> > > > even if other subflows are available.
> > > > 
> > > > This is quite simpler than the original blest implementation, as we
> > > > leverage the pacing rate provided by the TCP socket. To get a more
> > > > accurate estimation for the subflow linger-time, we maintain a
> > > > per-subflow weighted average of such info.
> > > > 
> > > > Additionally drop magic numbers usage in favor of newly defined
> > > > macros.
> > > > 
> > > > [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
> > > > 
> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > ---
> > > > notes:
> > > > - this apparently solves for good issue/137, with > 200 iterations with
> > > > no failures
> > > > - still to be investigated the impact on high-speed links, if any
> > > > ---
> > > > net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
> > > > net/mptcp/protocol.h |  1 +
> > > > 2 files changed, 41 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > > index 7803b0dbb1be..cc9d32cb7bc7 100644
> > > > --- a/net/mptcp/protocol.c
> > > > +++ b/net/mptcp/protocol.c
> > > > @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
> > > > 	return __mptcp_subflow_active(subflow);
> > > > }
> > > > 
> > > > +#define SSK_MODE_ACTIVE	0
> > > > +#define SSK_MODE_BACKUP	1
> > > > +#define SSK_MODE_MAX	2
> > > > +
> > > > /* implement the mptcp packet scheduler;
> > > >  * returns the subflow that will transmit the next DSS
> > > >  * additionally updates the rtx timeout
> > > >  */
> > > > static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > > > {
> > > > -	struct subflow_send_info send_info[2];
> > > > +	struct subflow_send_info send_info[SSK_MODE_MAX];
> > > > 	struct mptcp_subflow_context *subflow;
> > > > 	struct sock *sk = (struct sock *)msk;
> > > > +	u32 pace, burst, wmem;
> > > > 	int i, nr_active = 0;
> > > > 	struct sock *ssk;
> > > > 	long tout = 0;
> > > > 	u64 ratio;
> > > > -	u32 pace;
> > > > 
> > > > 	sock_owned_by_me(sk);
> > > > 
> > > > @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > > > 	}
> > > > 
> > > > 	/* pick the subflow with the lower wmem/wspace ratio */
> > > > -	for (i = 0; i < 2; ++i) {
> > > > +	for (i = 0; i < SSK_MODE_MAX; ++i) {
> > > > 		send_info[i].ssk = NULL;
> > > > 		send_info[i].ratio = -1;
> > > > 	}
> > > > +
> > > > 	mptcp_for_each_subflow(msk, subflow) {
> > > > 		trace_mptcp_subflow_get_send(subflow);
> > > > 		ssk =  mptcp_subflow_tcp_sock(subflow);
> > > > @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > > > 
> > > > 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
> > > > 		nr_active += !subflow->backup;
> > > > -		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
> > > > -			continue;
> > > > -
> > > > -		pace = READ_ONCE(ssk->sk_pacing_rate);
> > > > -		if (!pace)
> > > > -			continue;
> > > > +		pace = subflow->avg_pacing_rate;
> > > > +		if (unlikely(!pace)) {
> > > > +			/* init pacing rate from socket */
> > > > +			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);
> > > 
> > > Checkpatch complains about the double assignment. I did do a double-take
> > > to confirm that it didn't look like a '==' typo, and slightly lean toward
> > > keeping it checkpatch-clean.
> > 
> > Yep, I'll fix the above in v2.
> > 
> > > > +			if (!pace)
> > > > +				continue;
> > > > +		}
> > > > 
> > > > 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
> > > > 				pace);
> > > > @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
> > > > 
> > > > 	/* pick the best backup if no other subflow is active */
> > > > 	if (!nr_active)
> > > > -		send_info[0].ssk = send_info[1].ssk;
> > > > -
> > > > -	if (send_info[0].ssk) {
> > > > -		msk->last_snd = send_info[0].ssk;
> > > > -		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
> > > > -				       tcp_sk(msk->last_snd)->snd_wnd);
> > > > -		return msk->last_snd;
> > > > -	}
> > > > +		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
> > > > +
> > > > +	/* According to the blest algorithm, to avoid HoL blocking for the
> > > > +	 * faster flow, we need to:
> > > > +	 * - estimate the faster flow linger time
> > > > +	 * - use the above to estimate the amount of byte transferred
> > > > +	 *   by the faster flow
> > > > +	 * - check that the amount of queued data is greter than the above,
> > > > +	 *   otherwise do not use the picked, slower, subflow
> > > > +	 * We select the subflow with the shorter estimated time to flush
> > > > +	 * the queued mem, which basically ensure the above. We just need
> > > > +	 * to check that subflow has a non empty cwin.
> > > > +	 */
> > > > +	ssk = send_info[SSK_MODE_ACTIVE].ssk;
> > > > +	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
> > > > +		return NULL;
> > > > 
> > > > -	return NULL;
> > > > +	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
> > > > +	wmem = READ_ONCE(ssk->sk_wmem_queued);
> > > > +	subflow = mptcp_subflow_ctx(ssk);
> > > > +	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
> > > > +					   READ_ONCE(ssk->sk_pacing_rate) * burst,
> > > > +					   burst + wmem);
> > > 
> > > I see how this gives a smoothed-out version of the pacing rate, using
> > > avg_pacing_rate for the bytes already sent on this subflow (wmem) and
> > > ssk->sk_pacing_rate for the bytes that will probably be sent now.
> > > 
> > > Is this a *better* estimation of the flow speed than the plain
> > > ssk->sk_pacing_rate?
> > 
> > It looks like it is. I get a consistent (higher) failure rate using
> > plain sk_pacing_rate instead. 'sk_pacing_rate' is an 'instant' value
> > applied to the current pkt, can change in a relevant way in a little
> > amount of time.
> > 
> 
> Ok, sk_pacing_rate is even noisier than I realized. My main concern here 
> is that whatever filtering we do on that noisy signal handles a variety of 
> conditions well. This includes cases like a failing wireless signal, where 
> the current and future throughput might be far lower than the average 
> pacing rate would indicate - and in that case the filtering algorithm 
> needs to respond appropriately.
> 
> My first job was in test & measurement, where a lot of thought goes in to 
> taking noisy A/D converter data and turning it in to a "good" measurement 
> value. When using a weighted average like this, it's common to throw away 
> the average if the new value is a significant change (like "X%" higher or 
> lower). 

I think I do understand.

Please note that here there are some difference WRT sampling more or
less noisy values: the stack enforces that the current sampled value is
applied to current packet and we record the pacing for each xmitted
chunk of data.

Using the current pacing_rate to estimate the time to flush the ssk
xmit queue is subject to relevant errors, as it applies the "wrong"
pace to previous samples. 

The computed time could differ in a significant way from the real one
only if the TCP stack will produce multiple GSO segments for a single
burst, using very different pacing rate for the 'next ones'.

Since the max burst size is below the max gso size, AFAICS the above
could happen only with GSO disabled, when performances are expected to
be bad. Additionally, throwing away the average on sudden pacing value
changes will not protect from the above scenario: a delta will be
caused by _future_ sudden change of the pacing rate.

Not sure if the above is somewhat clear ;)

Cheers,

Paolo

p.s. just to move the thing forward I'll send a v2 with the
uncontroversial changes applied...


Re: [PATCH mptcp-next] mptcp: enforce HoL-blocking estimation

Posted by Mat Martineau 1 month, 1 week ago
On Tue, 26 Oct 2021, Paolo Abeni wrote:

> On Mon, 2021-10-25 at 10:17 -0700, Mat Martineau wrote:
>> On Mon, 25 Oct 2021, Paolo Abeni wrote:
>>
>>> On Fri, 2021-10-22 at 18:19 -0700, Mat Martineau wrote:
>>>> On Fri, 22 Oct 2021, Paolo Abeni wrote:
>>>>
>>>>> The MPTCP packet scheduler has sub-optimal behavior with asymmetric
>>>>> subflows: if the faster subflow-level cwin is closed, the packet
>>>>> scheduler can enqueue "too much" data on a slower subflow.
>>>>>
>>>>> When all the data on the faster subflow is acked, if the mptcp-level
>>>>> cwin is closed, and link utilization becomes suboptimal.
>>>>>
>>>>> The solution is implementing blest-like[1] HoL-blocking estimation,
>>>>> transmitting only on the subflow with the shorter estimated time to
>>>>> flush the queued memory. If such subflows cwin is closed, we wait
>>>>> even if other subflows are available.
>>>>>
>>>>> This is quite simpler than the original blest implementation, as we
>>>>> leverage the pacing rate provided by the TCP socket. To get a more
>>>>> accurate estimation for the subflow linger-time, we maintain a
>>>>> per-subflow weighted average of such info.
>>>>>
>>>>> Additionally drop magic numbers usage in favor of newly defined
>>>>> macros.
>>>>>
>>>>> [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf
>>>>>
>>>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>>>> ---
>>>>> notes:
>>>>> - this apparently solves for good issue/137, with > 200 iterations with
>>>>> no failures
>>>>> - still to be investigated the impact on high-speed links, if any
>>>>> ---
>>>>> net/mptcp/protocol.c | 58 ++++++++++++++++++++++++++++++--------------
>>>>> net/mptcp/protocol.h |  1 +
>>>>> 2 files changed, 41 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
>>>>> index 7803b0dbb1be..cc9d32cb7bc7 100644
>>>>> --- a/net/mptcp/protocol.c
>>>>> +++ b/net/mptcp/protocol.c
>>>>> @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_context *subflow)
>>>>> 	return __mptcp_subflow_active(subflow);
>>>>> }
>>>>>
>>>>> +#define SSK_MODE_ACTIVE	0
>>>>> +#define SSK_MODE_BACKUP	1
>>>>> +#define SSK_MODE_MAX	2
>>>>> +
>>>>> /* implement the mptcp packet scheduler;
>>>>>  * returns the subflow that will transmit the next DSS
>>>>>  * additionally updates the rtx timeout
>>>>>  */
>>>>> static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>>> {
>>>>> -	struct subflow_send_info send_info[2];
>>>>> +	struct subflow_send_info send_info[SSK_MODE_MAX];
>>>>> 	struct mptcp_subflow_context *subflow;
>>>>> 	struct sock *sk = (struct sock *)msk;
>>>>> +	u32 pace, burst, wmem;
>>>>> 	int i, nr_active = 0;
>>>>> 	struct sock *ssk;
>>>>> 	long tout = 0;
>>>>> 	u64 ratio;
>>>>> -	u32 pace;
>>>>>
>>>>> 	sock_owned_by_me(sk);
>>>>>
>>>>> @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>>> 	}
>>>>>
>>>>> 	/* pick the subflow with the lower wmem/wspace ratio */
>>>>> -	for (i = 0; i < 2; ++i) {
>>>>> +	for (i = 0; i < SSK_MODE_MAX; ++i) {
>>>>> 		send_info[i].ssk = NULL;
>>>>> 		send_info[i].ratio = -1;
>>>>> 	}
>>>>> +
>>>>> 	mptcp_for_each_subflow(msk, subflow) {
>>>>> 		trace_mptcp_subflow_get_send(subflow);
>>>>> 		ssk =  mptcp_subflow_tcp_sock(subflow);
>>>>> @@ -1439,12 +1444,13 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>>>
>>>>> 		tout = max(tout, mptcp_timeout_from_subflow(subflow));
>>>>> 		nr_active += !subflow->backup;
>>>>> -		if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd)
>>>>> -			continue;
>>>>> -
>>>>> -		pace = READ_ONCE(ssk->sk_pacing_rate);
>>>>> -		if (!pace)
>>>>> -			continue;
>>>>> +		pace = subflow->avg_pacing_rate;
>>>>> +		if (unlikely(!pace)) {
>>>>> +			/* init pacing rate from socket */
>>>>> +			pace = subflow->avg_pacing_rate = READ_ONCE(ssk->sk_pacing_rate);
>>>>
>>>> Checkpatch complains about the double assignment. I did do a double-take
>>>> to confirm that it didn't look like a '==' typo, and slightly lean toward
>>>> keeping it checkpatch-clean.
>>>
>>> Yep, I'll fix the above in v2.
>>>
>>>>> +			if (!pace)
>>>>> +				continue;
>>>>> +		}
>>>>>
>>>>> 		ratio = div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32,
>>>>> 				pace);
>>>>> @@ -1457,16 +1463,32 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk)
>>>>>
>>>>> 	/* pick the best backup if no other subflow is active */
>>>>> 	if (!nr_active)
>>>>> -		send_info[0].ssk = send_info[1].ssk;
>>>>> -
>>>>> -	if (send_info[0].ssk) {
>>>>> -		msk->last_snd = send_info[0].ssk;
>>>>> -		msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE,
>>>>> -				       tcp_sk(msk->last_snd)->snd_wnd);
>>>>> -		return msk->last_snd;
>>>>> -	}
>>>>> +		send_info[SSK_MODE_ACTIVE].ssk = send_info[SSK_MODE_BACKUP].ssk;
>>>>> +
>>>>> +	/* According to the blest algorithm, to avoid HoL blocking for the
>>>>> +	 * faster flow, we need to:
>>>>> +	 * - estimate the faster flow linger time
>>>>> +	 * - use the above to estimate the amount of byte transferred
>>>>> +	 *   by the faster flow
>>>>> +	 * - check that the amount of queued data is greter than the above,
>>>>> +	 *   otherwise do not use the picked, slower, subflow
>>>>> +	 * We select the subflow with the shorter estimated time to flush
>>>>> +	 * the queued mem, which basically ensure the above. We just need
>>>>> +	 * to check that subflow has a non empty cwin.
>>>>> +	 */
>>>>> +	ssk = send_info[SSK_MODE_ACTIVE].ssk;
>>>>> +	if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd)
>>>>> +		return NULL;
>>>>>
>>>>> -	return NULL;
>>>>> +	burst = min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd);
>>>>> +	wmem = READ_ONCE(ssk->sk_wmem_queued);
>>>>> +	subflow = mptcp_subflow_ctx(ssk);
>>>>> +	subflow->avg_pacing_rate = div_u64((u64)subflow->avg_pacing_rate * wmem +
>>>>> +					   READ_ONCE(ssk->sk_pacing_rate) * burst,
>>>>> +					   burst + wmem);
>>>>
>>>> I see how this gives a smoothed-out version of the pacing rate, using
>>>> avg_pacing_rate for the bytes already sent on this subflow (wmem) and
>>>> ssk->sk_pacing_rate for the bytes that will probably be sent now.
>>>>
>>>> Is this a *better* estimation of the flow speed than the plain
>>>> ssk->sk_pacing_rate?
>>>
>>> It looks like it is. I get a consistent (higher) failure rate using
>>> plain sk_pacing_rate instead. 'sk_pacing_rate' is an 'instant' value
>>> applied to the current pkt, can change in a relevant way in a little
>>> amount of time.
>>>
>>
>> Ok, sk_pacing_rate is even noisier than I realized. My main concern here
>> is that whatever filtering we do on that noisy signal handles a variety of
>> conditions well. This includes cases like a failing wireless signal, where
>> the current and future throughput might be far lower than the average
>> pacing rate would indicate - and in that case the filtering algorithm
>> needs to respond appropriately.
>>
>> My first job was in test & measurement, where a lot of thought goes in to
>> taking noisy A/D converter data and turning it in to a "good" measurement
>> value. When using a weighted average like this, it's common to throw away
>> the average if the new value is a significant change (like "X%" higher or
>> lower).
>
> I think I do understand.
>
> Please note that here there are some difference WRT sampling more or
> less noisy values: the stack enforces that the current sampled value is
> applied to current packet and we record the pacing for each xmitted
> chunk of data.
>
> Using the current pacing_rate to estimate the time to flush the ssk
> xmit queue is subject to relevant errors, as it applies the "wrong"
> pace to previous samples.
>
> The computed time could differ in a significant way from the real one
> only if the TCP stack will produce multiple GSO segments for a single
> burst, using very different pacing rate for the 'next ones'.
>
> Since the max burst size is below the max gso size, AFAICS the above
> could happen only with GSO disabled, when performances are expected to
> be bad. Additionally, throwing away the average on sudden pacing value
> changes will not protect from the above scenario: a delta will be
> caused by _future_ sudden change of the pacing rate.
>
> Not sure if the above is somewhat clear ;)
>

Yeah, I get what you're saying. Appreciate the explanation since you know 
the underlying pacing implementation better than I do.

I'm still working on gathering some raw data to better understand what 
sk_pacing_rate, the values used to calculate sk_pacing_rate, and the 
weighted average look like over time. I want to have more confidence that 
the scheduling changes work well in a variety of conditions.

Would you say the slow start / ca ratios in tcp_update_pacing_rate() are 
interfering with subflow selection at all? It looks like the various 
tcp_sock members that are used to generate sk_pacing_rate could be useful, 
but they require holding the subflow socket lock to read them :(

>
> p.s. just to move the thing forward I'll send a v2 with the
> uncontroversial changes applied...

Ok, thanks for sending.

--
Mat Martineau
Intel