From nobody Sat Apr 27 01:43:42 2024 Delivered-To: wpasupplicant.patchew@gmail.com Received: by 2002:a05:6638:3394:0:0:0:0 with SMTP id h20csp765273jav; Tue, 26 Oct 2021 02:42:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxXvgx25pv0DPHaCYvUQAr335QSi/UGElaXjqZc25EP44LiMuajyNIzDCR45FR54q+Nh0lb X-Received: by 2002:a17:902:aa02:b0:13a:6c8f:407f with SMTP id be2-20020a170902aa0200b0013a6c8f407fmr20641048plb.59.1635241354236; Tue, 26 Oct 2021 02:42:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635241354; cv=none; d=google.com; s=arc-20160816; b=n3Yr4sqJBuvn4GiK4HS5R1w9OjtmAGGxOqbT/SIOfF/wQ7nMnssI2/LDv56nIFWexi SxgAi3jEu+5wFHuyWFWVydZSe0dtXUNiEQ/rWojeyR+RguaFF2KTOzlOfgaYE1TqcGQo 5kd202yyFIevMl3TefbtDc6i01WIJafv43KuvVvn4Am8Kazlzdg06qo6ECZoH4shqGYK k6GtboKIRoGiYScsfUhFIVN4SKM1YYKsvt9cHhe334JD7Nov/TyaNw9o7xyKos4iw8lm I9lhf34h6LIz3j2rVUtX+LLVmC5420Zz6cLpT8g3291yDBQJKhDCGj7V/CYC+XeCw8IW tKng== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:message-id:date:subject:to:from :dkim-signature; bh=YifkauvL8mqeFIzNrIl2oEUnPlFjIbawwgu22jFfot8=; b=k/0s13/YSYdDxD6TYtSvTuE5Pf8U4xCEj2QwIuCdBVKUQ8iGp/GF6P3kp7aCfm3tNt cmw0YNUQxId7tPA7A9qutoNBERLiOvTRqkYx2DslC43a81RhMfWNz3XGobG5UjvpOrUJ 30ckHw6gGsXEDfRfd11Sn9Fxne9ekINix2YLJVHTKITJV/Fcg5WQAFYcuEfQvMX/AApK u5rMGJYq0JhEqZ32spN9qZ+qASFj2I0kQLPVGZXm5OQKzmUvHGWDJD1GMUgcxzjyQI96 8Z1z+1rwllRekdG2/CvlPvk774c8Z6zFDiwHXhCQEv+MhRFG+wOJT3yTseKOo52xWHfT OkbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MujTcVGH; spf=pass (google.com: domain of mptcp+bounces-2259-wpasupplicant.patchew=gmail.com@lists.linux.dev designates 147.75.69.165 as permitted sender) smtp.mailfrom="mptcp+bounces-2259-wpasupplicant.patchew=gmail.com@lists.linux.dev"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from sjc.edge.kernel.org (sjc.edge.kernel.org. [147.75.69.165]) by mx.google.com with ESMTPS id g139si22037934pfb.124.2021.10.26.02.42.34 for (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 26 Oct 2021 02:42:34 -0700 (PDT) Received-SPF: pass (google.com: domain of mptcp+bounces-2259-wpasupplicant.patchew=gmail.com@lists.linux.dev designates 147.75.69.165 as permitted sender) client-ip=147.75.69.165; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MujTcVGH; spf=pass (google.com: domain of mptcp+bounces-2259-wpasupplicant.patchew=gmail.com@lists.linux.dev designates 147.75.69.165 as permitted sender) smtp.mailfrom="mptcp+bounces-2259-wpasupplicant.patchew=gmail.com@lists.linux.dev"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sjc.edge.kernel.org (Postfix) with ESMTPS id 2EABE3E0E4F for ; Tue, 26 Oct 2021 09:42:33 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0890C2C9F; Tue, 26 Oct 2021 09:42:32 +0000 (UTC) X-Original-To: mptcp@lists.linux.dev Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC14772 for ; Tue, 26 Oct 2021 09:42:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1635241348; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=YifkauvL8mqeFIzNrIl2oEUnPlFjIbawwgu22jFfot8=; b=MujTcVGHZRRdde5Qa1l6siFKKP9BUNJfUgqojljAV2Qyyrl4wndJpyDflM7Mlk43e1qqxU LnYX7GxWhC8ecOHasFL/c+UHr+DEWzZm6LiX8zT/dFeqQhrKcEAtd9ethj3SLq65fHBYwM UCxWsruR6jKcmz033Tzgh5tSUOnfHNc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-580-rHhgdOBHNMiBPnf6oCmK0A-1; Tue, 26 Oct 2021 05:42:20 -0400 X-MC-Unique: rHhgdOBHNMiBPnf6oCmK0A-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7B320779 for ; Tue, 26 Oct 2021 09:42:19 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.39.194.196]) by smtp.corp.redhat.com (Postfix) with ESMTP id E437B5BAE5 for ; Tue, 26 Oct 2021 09:42:18 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.linux.dev Subject: [PATCH v2 mptcp-next] mptcp: enforce HoL-blocking estimation Date: Tue, 26 Oct 2021 11:42:12 +0200 Message-Id: Precedence: bulk X-Mailing-List: mptcp@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The MPTCP packet scheduler has sub-optimal behavior with asymmetric subflows: if the faster subflow-level cwin is closed, the packet scheduler can enqueue "too much" data on a slower subflow. When all the data on the faster subflow is acked, if the mptcp-level cwin is closed, and link utilization becomes suboptimal. The solution is implementing blest-like[1] HoL-blocking estimation, transmitting only on the subflow with the shorter estimated time to flush the queued memory. If such subflows cwin is closed, we wait even if other subflows are available. This is quite simpler than the original blest implementation, as we leverage the pacing rate provided by the TCP socket. To get a more accurate estimation for the subflow linger-time, we maintain a per-subflow weighted average of such info. Additionally drop magic numbers usage in favor of newly defined macros and use more meaningful names for status variable. [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf Signed-off-by: Paolo Abeni --- v1 -> v2: - fix checkpatch issue (mat) - rename ratio as linger_time (mat) --- net/mptcp/protocol.c | 72 +++++++++++++++++++++++++++++--------------- net/mptcp/protocol.h | 1 + 2 files changed, 48 insertions(+), 25 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 7803b0dbb1be..582f5f88d6ef 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1370,7 +1370,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct= sock *ssk, =20 struct subflow_send_info { struct sock *ssk; - u64 ratio; + u64 linger_time; }; =20 void mptcp_subflow_set_active(struct mptcp_subflow_context *subflow) @@ -1395,20 +1395,24 @@ bool mptcp_subflow_active(struct mptcp_subflow_cont= ext *subflow) return __mptcp_subflow_active(subflow); } =20 +#define SSK_MODE_ACTIVE 0 +#define SSK_MODE_BACKUP 1 +#define SSK_MODE_MAX 2 + /* implement the mptcp packet scheduler; * returns the subflow that will transmit the next DSS * additionally updates the rtx timeout */ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk) { - struct subflow_send_info send_info[2]; + struct subflow_send_info send_info[SSK_MODE_MAX]; struct mptcp_subflow_context *subflow; struct sock *sk =3D (struct sock *)msk; + u32 pace, burst, wmem; int i, nr_active =3D 0; struct sock *ssk; + u64 linger_time; long tout =3D 0; - u64 ratio; - u32 pace; =20 sock_owned_by_me(sk); =20 @@ -1427,10 +1431,11 @@ static struct sock *mptcp_subflow_get_send(struct m= ptcp_sock *msk) } =20 /* pick the subflow with the lower wmem/wspace ratio */ - for (i =3D 0; i < 2; ++i) { + for (i =3D 0; i < SSK_MODE_MAX; ++i) { send_info[i].ssk =3D NULL; - send_info[i].ratio =3D -1; + send_info[i].linger_time =3D -1; } + mptcp_for_each_subflow(msk, subflow) { trace_mptcp_subflow_get_send(subflow); ssk =3D mptcp_subflow_tcp_sock(subflow); @@ -1439,34 +1444,51 @@ static struct sock *mptcp_subflow_get_send(struct m= ptcp_sock *msk) =20 tout =3D max(tout, mptcp_timeout_from_subflow(subflow)); nr_active +=3D !subflow->backup; - if (!sk_stream_memory_free(subflow->tcp_sock) || !tcp_sk(ssk)->snd_wnd) - continue; - - pace =3D READ_ONCE(ssk->sk_pacing_rate); - if (!pace) - continue; + pace =3D subflow->avg_pacing_rate; + if (unlikely(!pace)) { + /* init pacing rate from socket */ + subflow->avg_pacing_rate =3D READ_ONCE(ssk->sk_pacing_rate); + pace =3D subflow->avg_pacing_rate; + if (!pace) + continue; + } =20 - ratio =3D div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32, - pace); - if (ratio < send_info[subflow->backup].ratio) { + linger_time =3D div_u64((u64)READ_ONCE(ssk->sk_wmem_queued) << 32, pace); + if (linger_time < send_info[subflow->backup].linger_time) { send_info[subflow->backup].ssk =3D ssk; - send_info[subflow->backup].ratio =3D ratio; + send_info[subflow->backup].linger_time =3D linger_time; } } __mptcp_set_timeout(sk, tout); =20 /* pick the best backup if no other subflow is active */ if (!nr_active) - send_info[0].ssk =3D send_info[1].ssk; - - if (send_info[0].ssk) { - msk->last_snd =3D send_info[0].ssk; - msk->snd_burst =3D min_t(int, MPTCP_SEND_BURST_SIZE, - tcp_sk(msk->last_snd)->snd_wnd); - return msk->last_snd; - } + send_info[SSK_MODE_ACTIVE].ssk =3D send_info[SSK_MODE_BACKUP].ssk; + + /* According to the blest algorithm, to avoid HoL blocking for the + * faster flow, we need to: + * - estimate the faster flow linger time + * - use the above to estimate the amount of byte transferred + * by the faster flow + * - check that the amount of queued data is greter than the above, + * otherwise do not use the picked, slower, subflow + * We select the subflow with the shorter estimated time to flush + * the queued mem, which basically ensure the above. We just need + * to check that subflow has a non empty cwin. + */ + ssk =3D send_info[SSK_MODE_ACTIVE].ssk; + if (!ssk || !sk_stream_memory_free(ssk) || !tcp_sk(ssk)->snd_wnd) + return NULL; =20 - return NULL; + burst =3D min_t(int, MPTCP_SEND_BURST_SIZE, tcp_sk(ssk)->snd_wnd); + wmem =3D READ_ONCE(ssk->sk_wmem_queued); + subflow =3D mptcp_subflow_ctx(ssk); + subflow->avg_pacing_rate =3D div_u64((u64)subflow->avg_pacing_rate * wmem= + + READ_ONCE(ssk->sk_pacing_rate) * burst, + burst + wmem); + msk->last_snd =3D ssk; + msk->snd_burst =3D burst; + return ssk; } =20 static void mptcp_push_release(struct sock *ssk, struct mptcp_sendmsg_info= *info) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 67a61ac48b20..46691acdea24 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -391,6 +391,7 @@ DECLARE_PER_CPU(struct mptcp_delegated_action, mptcp_de= legated_actions); /* MPTCP subflow context */ struct mptcp_subflow_context { struct list_head node;/* conn_list of subflows */ + unsigned long avg_pacing_rate; /* protected by msk socket lock */ u64 local_key; u64 remote_key; u64 idsn; --=20 2.26.3