[RFC PATCH v2] selftests: mptcp: more stable simult_flows tests

Paolo Abeni posted 1 patch 2 years, 5 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/9d66028a72b6807d4dc3397bb70f028cbc78161d.1634310418.git.pabeni@redhat.com
Maintainers: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Mat Martineau <mathew.j.martineau@linux.intel.com>, Matthieu Baerts <matthieu.baerts@tessares.net>, Shuah Khan <shuah@kernel.org>
.../selftests/net/mptcp/mptcp_connect.c       | 72 +++++++++++++++----
.../selftests/net/mptcp/simult_flows.sh       | 36 ++++------
2 files changed, 72 insertions(+), 36 deletions(-)
[RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Paolo Abeni 2 years, 5 months ago
Currently the simult_flows.sh self-tests are not very
stables, expecially when running on slow VM.

The tests mesures runtime for transfers on multiple subflows
and check that the time is nearby the theoretical maximum.

The current test infra introduces a bit of jitter in test
runtime, due to multiple explicit delays. Additionally the
runtime is measured by the shell script wrapper. On slow
VM, the script overhead is measurable and subject to relevant
jitter.

One solution to make the test more stable would be adding more
slack to the expected time; that could possibly hide reall
regressions. Instead move the measurement inside the command
doint the transfer, and drop most unneeded sleep.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
@matttbe: could you please double check this makes simult_flows
tests really more stable in your testbed?

Now the slack is really quite tight, I think we can raise it a
little more, without loosing the ability of catching regressions
---
 .../selftests/net/mptcp/mptcp_connect.c       | 72 +++++++++++++++----
 .../selftests/net/mptcp/simult_flows.sh       | 36 ++++------
 2 files changed, 72 insertions(+), 36 deletions(-)

diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c
index 89c4753c2760..53d267a1e3d7 100644
--- a/tools/testing/selftests/net/mptcp/mptcp_connect.c
+++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c
@@ -14,6 +14,7 @@
 #include <strings.h>
 #include <signal.h>
 #include <unistd.h>
+#include <time.h>
 
 #include <sys/poll.h>
 #include <sys/sendfile.h>
@@ -64,6 +65,7 @@ static int cfg_sndbuf;
 static int cfg_rcvbuf;
 static bool cfg_join;
 static bool cfg_remove;
+static unsigned int cfg_time;
 static unsigned int cfg_do_w;
 static int cfg_wait;
 static uint32_t cfg_mark;
@@ -78,9 +80,10 @@ static struct cfg_cmsg_types cfg_cmsg_types;
 static void die_usage(void)
 {
 	fprintf(stderr, "Usage: mptcp_connect [-6] [-u] [-s MPTCP|TCP] [-p port] [-m mode]"
-		"[-l] [-w sec] connect_address\n");
+		"[-l] [-w sec] [-t num] [-T num] connect_address\n");
 	fprintf(stderr, "\t-6 use ipv6\n");
 	fprintf(stderr, "\t-t num -- set poll timeout to num\n");
+	fprintf(stderr, "\t-T num -- set expected runtime to num ms\n");
 	fprintf(stderr, "\t-S num -- set SO_SNDBUF to num\n");
 	fprintf(stderr, "\t-R num -- set SO_RCVBUF to num\n");
 	fprintf(stderr, "\t-p num -- use port num\n");
@@ -448,7 +451,7 @@ static void set_nonblock(int fd)
 	fcntl(fd, F_SETFL, flags | O_NONBLOCK);
 }
 
-static int copyfd_io_poll(int infd, int peerfd, int outfd)
+static int copyfd_io_poll(int infd, int peerfd, int outfd, bool *in_closed_after_out)
 {
 	struct pollfd fds = {
 		.fd = peerfd,
@@ -487,9 +490,11 @@ static int copyfd_io_poll(int infd, int peerfd, int outfd)
 				 */
 				fds.events &= ~POLLIN;
 
-				if ((fds.events & POLLOUT) == 0)
+				if ((fds.events & POLLOUT) == 0) {
+					*in_closed_after_out = true;
 					/* and nothing more to send */
 					break;
+				}
 
 			/* Else, still have data to transmit */
 			} else if (len < 0) {
@@ -547,7 +552,7 @@ static int copyfd_io_poll(int infd, int peerfd, int outfd)
 	}
 
 	/* leave some time for late join/announce */
-	if (cfg_join || cfg_remove)
+	if (cfg_remove)
 		usleep(cfg_wait);
 
 	close(peerfd);
@@ -646,7 +651,7 @@ static int do_sendfile(int infd, int outfd, unsigned int count)
 }
 
 static int copyfd_io_mmap(int infd, int peerfd, int outfd,
-			  unsigned int size)
+			  unsigned int size, bool *in_closed_after_out)
 {
 	int err;
 
@@ -664,13 +669,14 @@ static int copyfd_io_mmap(int infd, int peerfd, int outfd,
 		shutdown(peerfd, SHUT_WR);
 
 		err = do_recvfile(peerfd, outfd);
+		*in_closed_after_out = true;
 	}
 
 	return err;
 }
 
 static int copyfd_io_sendfile(int infd, int peerfd, int outfd,
-			      unsigned int size)
+			      unsigned int size, bool *in_closed_after_out)
 {
 	int err;
 
@@ -685,6 +691,7 @@ static int copyfd_io_sendfile(int infd, int peerfd, int outfd,
 		if (err)
 			return err;
 		err = do_recvfile(peerfd, outfd);
+		*in_closed_after_out = true;
 	}
 
 	return err;
@@ -692,27 +699,62 @@ static int copyfd_io_sendfile(int infd, int peerfd, int outfd,
 
 static int copyfd_io(int infd, int peerfd, int outfd)
 {
+	bool in_closed_after_out = false;
+	struct timespec start, end;
 	int file_size;
+	int ret;
+
+	if (cfg_time && (clock_gettime(CLOCK_MONOTONIC, &start) < 0))
+		xerror("can not fetch start time %d", errno);
 
 	switch (cfg_mode) {
 	case CFG_MODE_POLL:
-		return copyfd_io_poll(infd, peerfd, outfd);
+		ret = copyfd_io_poll(infd, peerfd, outfd, &in_closed_after_out);
+		break;
+
 	case CFG_MODE_MMAP:
 		file_size = get_infd_size(infd);
 		if (file_size < 0)
 			return file_size;
-		return copyfd_io_mmap(infd, peerfd, outfd, file_size);
+		ret = copyfd_io_mmap(infd, peerfd, outfd, file_size, &in_closed_after_out);
+		break;
+
 	case CFG_MODE_SENDFILE:
 		file_size = get_infd_size(infd);
 		if (file_size < 0)
 			return file_size;
-		return copyfd_io_sendfile(infd, peerfd, outfd, file_size);
+		ret = copyfd_io_sendfile(infd, peerfd, outfd, file_size, &in_closed_after_out);
+		break;
+
+	default:
+		fprintf(stderr, "Invalid mode %d\n", cfg_mode);
+
+		die_usage();
+		return 1;
 	}
 
-	fprintf(stderr, "Invalid mode %d\n", cfg_mode);
+	if (ret)
+		return ret;
 
-	die_usage();
-	return 1;
+	if (cfg_time) {
+		unsigned int delta_ms;
+
+		if (clock_gettime(CLOCK_MONOTONIC, &end) < 0)
+			xerror("can not fetch end time %d", errno);
+		delta_ms = (end.tv_sec - start.tv_sec) * 1000 + (end.tv_nsec - start.tv_nsec) / 1000000;
+		if (delta_ms > cfg_time) {
+			xerror("transfer slower than expected! runtime %d ms, expected %d ms",
+			       delta_ms, cfg_time);
+		}
+
+		/* show the runtime only if this end shutdown(wr) before receving the EOF,
+		 * (that is, if this end got the longer runtime)
+		 */
+		if (in_closed_after_out)
+			fprintf(stderr, "%d", delta_ms);
+	}
+
+	return 0;
 }
 
 static void check_sockaddr(int pf, struct sockaddr_storage *ss,
@@ -1005,12 +1047,11 @@ static void parse_opts(int argc, char **argv)
 {
 	int c;
 
-	while ((c = getopt(argc, argv, "6jr:lp:s:hut:m:S:R:w:M:P:c:")) != -1) {
+	while ((c = getopt(argc, argv, "6jr:lp:s:hut:T:m:S:R:w:M:P:c:")) != -1) {
 		switch (c) {
 		case 'j':
 			cfg_join = true;
 			cfg_mode = CFG_MODE_POLL;
-			cfg_wait = 400000;
 			break;
 		case 'r':
 			cfg_remove = true;
@@ -1043,6 +1084,9 @@ static void parse_opts(int argc, char **argv)
 			if (poll_timeout <= 0)
 				poll_timeout = -1;
 			break;
+		case 'T':
+			cfg_time = atoi(optarg);
+			break;
 		case 'm':
 			cfg_mode = parse_mode(optarg);
 			break;
diff --git a/tools/testing/selftests/net/mptcp/simult_flows.sh b/tools/testing/selftests/net/mptcp/simult_flows.sh
index 910d8126af8f..f441ff7904fc 100755
--- a/tools/testing/selftests/net/mptcp/simult_flows.sh
+++ b/tools/testing/selftests/net/mptcp/simult_flows.sh
@@ -51,7 +51,7 @@ setup()
 	sout=$(mktemp)
 	cout=$(mktemp)
 	capout=$(mktemp)
-	size=$((2048 * 4096))
+	size=$((2 * 2048 * 4096))
 	dd if=/dev/zero of=$small bs=4096 count=20 >/dev/null 2>&1
 	dd if=/dev/zero of=$large bs=4096 count=$((size / 4096)) >/dev/null 2>&1
 
@@ -161,17 +161,15 @@ do_transfer()
 
 	timeout ${timeout_test} \
 		ip netns exec ${ns3} \
-			./mptcp_connect -jt ${timeout_poll} -l -p $port \
+			./mptcp_connect -jt ${timeout_poll} -l -p $port -T $time \
 				0.0.0.0 < "$sin" > "$sout" &
 	local spid=$!
 
 	wait_local_port_listen "${ns3}" "${port}"
 
-	local start
-	start=$(date +%s%3N)
 	timeout ${timeout_test} \
 		ip netns exec ${ns1} \
-			./mptcp_connect -jt ${timeout_poll} -p $port \
+			./mptcp_connect -jt ${timeout_poll} -p $port -T $time \
 				10.0.3.3 < "$cin" > "$cout" &
 	local cpid=$!
 
@@ -180,27 +178,20 @@ do_transfer()
 	wait $spid
 	local rets=$?
 
-	local stop
-	stop=$(date +%s%3N)
-
 	if $capture; then
 		sleep 1
 		kill ${cappid_listener}
 		kill ${cappid_connector}
 	fi
 
-	local duration
-	duration=$((stop-start))
-
 	cmp $sin $cout > /dev/null 2>&1
 	local cmps=$?
 	cmp $cin $sout > /dev/null 2>&1
 	local cmpc=$?
 
-	printf "%16s" "$duration max $max_time "
+	printf "%-16s" " max $max_time "
 	if [ $retc -eq 0 ] && [ $rets -eq 0 ] && \
-	   [ $cmpc -eq 0 ] && [ $cmps -eq 0 ] && \
-	   [ $duration -lt $max_time ]; then
+	   [ $cmpc -eq 0 ] && [ $cmps -eq 0 ]; then
 		echo "[ OK ]"
 		cat "$capout"
 		return 0
@@ -244,23 +235,24 @@ run_test()
 	tc -n $ns2 qdisc add dev ns2eth1 root netem rate ${rate1}mbit $delay1
 	tc -n $ns2 qdisc add dev ns2eth2 root netem rate ${rate2}mbit $delay2
 
-	# time is measure in ms
-	local time=$((size * 8 * 1000 / (( $rate1 + $rate2) * 1024 *1024) ))
+	# time is measured in ms, account for transfer size, affegated link speed
+	# and header overhead (10%)
+	local time=$((size * 8 * 1000 * 10 / (( $rate1 + $rate2) * 1024 *1024 * 9) ))
 
 	# mptcp_connect will do some sleeps to allow the mp_join handshake
-	# completion
-	time=$((time + 1350))
+	# completion (see mptcp_connect): 200ms on each side, add some slack
+	time=$((time + 450))
 
-	printf "%-50s" "$msg"
-	do_transfer $small $large $((time * 11 / 10))
+	printf "%-60s" "$msg"
+	do_transfer $small $large $time
 	lret=$?
 	if [ $lret -ne 0 ]; then
 		ret=$lret
 		[ $bail -eq 0 ] || exit $ret
 	fi
 
-	printf "%-50s" "$msg - reverse direction"
-	do_transfer $large $small $((time * 11 / 10))
+	printf "%-60s" "$msg - reverse direction"
+	do_transfer $large $small $time
 	lret=$?
 	if [ $lret -ne 0 ]; then
 		ret=$lret
-- 
2.26.3


Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Mat Martineau 2 years, 5 months ago
On Fri, 15 Oct 2021, Paolo Abeni wrote:

> Currently the simult_flows.sh self-tests are not very
> stables, expecially when running on slow VM.
>
> The tests mesures runtime for transfers on multiple subflows
> and check that the time is nearby the theoretical maximum.
>
> The current test infra introduces a bit of jitter in test
> runtime, due to multiple explicit delays. Additionally the
> runtime is measured by the shell script wrapper. On slow
> VM, the script overhead is measurable and subject to relevant
> jitter.
>
> One solution to make the test more stable would be adding more
> slack to the expected time; that could possibly hide reall
> regressions. Instead move the measurement inside the command
> doint the transfer, and drop most unneeded sleep.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> @matttbe: could you please double check this makes simult_flows
> tests really more stable in your testbed?
>
> Now the slack is really quite tight, I think we can raise it a
> little more, without loosing the ability of catching regressions


I still get occasional failures on my slow desktop VM (lots of debug 
features turned on), but a little more slack would help. Most of the time 
all cases succeed. If I see a failure, it's just one test out of the 
batch. These are each from separate runs of simult_flows.sh:

unbalanced bwidth                                           transfer slower than expected! runtime 4500 ms, expected 4005 ms max 4005        [ fail ]


unbalanced bwidth with opposed, unbalanced delay            transfer slower than expected! runtime 4632 ms, expected 4005 ms max 4005        [ fail ]


Curious to see Matthieu's results!

--
Mat Martineau
Intel

Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Matthieu Baerts 2 years, 5 months ago
Hi Paolo, Mat,

On 16/10/2021 02:12, Mat Martineau wrote:
> On Fri, 15 Oct 2021, Paolo Abeni wrote:
> 
>> Currently the simult_flows.sh self-tests are not very
>> stables, expecially when running on slow VM.
>>
>> The tests mesures runtime for transfers on multiple subflows
>> and check that the time is nearby the theoretical maximum.
>>
>> The current test infra introduces a bit of jitter in test
>> runtime, due to multiple explicit delays. Additionally the
>> runtime is measured by the shell script wrapper. On slow
>> VM, the script overhead is measurable and subject to relevant
>> jitter.
>>
>> One solution to make the test more stable would be adding more
>> slack to the expected time; that could possibly hide reall
>> regressions. Instead move the measurement inside the command
>> doint the transfer, and drop most unneeded sleep.

Thank you for looking at that!

It is indeed more stable: with a debug kernel and a "slow host", the
test fails almost all the time. Now I got 3 failures in X attempts, not
bad! (and I stopped after the last failure)

>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>> ---
>> @matttbe: could you please double check this makes simult_flows
>> tests really more stable in your testbed?
>>
>> Now the slack is really quite tight, I think we can raise it a
>> little more, without loosing the ability of catching regressions
> 
> 
> I still get occasional failures on my slow desktop VM (lots of debug
> features turned on), but a little more slack would help. Most of the
> time all cases succeed. If I see a failure, it's just one test out of
> the batch. These are each from separate runs of simult_flows.sh:
> 
> unbalanced bwidth                                           transfer
> slower than expected! runtime 4500 ms, expected 4005 ms max 4005       
> [ fail ]
> 
> 
> unbalanced bwidth with opposed, unbalanced delay            transfer
> slower than expected! runtime 4632 ms, expected 4005 ms max 4005       
> [ fail ]
> 
> 
> Curious to see Matthieu's results!


I also have issues with unbalanced bwidth tests. In a normal situation,
I'm also very closed under the expected time so maybe we can increase
the thresholds.

Here is what I have when the tests are succeeding:


> balanced bwidth                                             7404 max
7561       [ OK ]

> balanced bwidth - reverse direction                         7422 max
7561       [ OK ]

> balanced bwidth with unbalanced delay                       7428 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7406 max
7561       [ OK ]

> unbalanced bwidth                                           3837 max
4005       [ OK ]

> unbalanced bwidth - reverse direction                       3808 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     3880 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay - reverse direction 3845 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            3881 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3834 max 4005       [ OK ]





> balanced bwidth                                             7413 max
7561       [ OK ]

> balanced bwidth - reverse direction                         [
195.535235] hrtimer: interrupt took 19400 ns

7406 max 7561       [ OK ]

> balanced bwidth with unbalanced delay                       7388 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7415 max
7561       [ OK ]

> unbalanced bwidth                                           3853 max
4005       [ OK ]

> unbalanced bwidth - reverse direction                       3869 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     3887 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay - reverse direction 3826 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            3880 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3826 max 4005       [ OK ]





> balanced bwidth                                             7441 max
7561       [ OK ]

> balanced bwidth - reverse direction                         7439 max
7561       [ OK ]

> balanced bwidth with unbalanced delay                       7414 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7432 max
7561       [ OK ]

> unbalanced bwidth                                           3803 max
4005       [ OK ]

> unbalanced bwidth - reverse direction                       3810 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     3792 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay - reverse direction 3796 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            3839 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3809 max 4005       [ OK ]







Now when they are failing:



> balanced bwidth                                             7397 max
7561       [ OK ]

> balanced bwidth - reverse direction                         7386 max
7561       [ OK ]

> balanced bwidth with unbalanced delay                       7401 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7403 max
7561       [ OK ]

> unbalanced bwidth                                           transfer
slower than expected! runtime 4039 ms, expected 4005 ms max 4005
[ fail ]

> client exit code 1, server 0

>

> netns ns3-0-BaDnIV socket stat for 10005:

> State     Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

> TIME-WAIT 0      0           10.0.3.3:10005     10.0.2.1:39479
timer:(timewait,59sec,0)

>

> TIME-WAIT 0      0           10.0.3.3:10005     10.0.1.1:49984
timer:(timewait,59sec,0)

>

>

> netns ns1-0-BaDnIV socket stat for 10005:

> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess

> -rw------- 1 root root 16777216 Oct 18 09:33 /tmp/tmp.JugE78zTog

> -rw------- 1 root root 16777216 Oct 18 09:34 /tmp/tmp.SPV6oen3hT

> -rw------- 1 root root 81920 Oct 18 09:33 /tmp/tmp.7SJVlib0CM

> -rw------- 1 root root 81920 Oct 18 09:34 /tmp/tmp.QvimZ7jK8i

> unbalanced bwidth - reverse direction                       3808 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     3798 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay - reverse direction 3844 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            3937 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3810 max 4005       [ OK ]





> balanced bwidth                                             7391 max
7561       [ OK ]

> balanced bwidth - reverse direction                         7380 max
7561       [ OK ]

> balanced bwidth with unbalanced delay                       7424 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7412 max
7561       [ OK ]

> unbalanced bwidth                                           3894 max
4005       [ OK ]

> unbalanced bwidth - reverse direction                       3884 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     3848 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay - reverse direction 3896 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            transfer
slower than expected! runtime 4039 ms, expected 4005 ms max 4005
[ fail ]

> client exit code 1, server 0

>

> netns ns3-0-4o5MrN socket stat for 10009:

> State     Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

> TIME-WAIT 0      0           10.0.3.3:10009     10.0.1.1:51946
timer:(timewait,59sec,0)

>

> TIME-WAIT 0      0           10.0.3.3:10009     10.0.2.1:40567
timer:(timewait,59sec,0)

>

>

> netns ns1-0-4o5MrN socket stat for 10009:

> State     Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

> TIME-WAIT 0      0           10.0.1.1:51946     10.0.3.3:10009
timer:(timewait,58sec,0)

>

> -rw------- 1 root root 16777216 Oct 18 09:41 /tmp/tmp.TZc6AFBB9s

> -rw------- 1 root root 16777216 Oct 18 09:43 /tmp/tmp.fwjA2UfvqJ

> -rw------- 1 root root 81920 Oct 18 09:41 /tmp/tmp.4XRtDSnt3Z

> -rw------- 1 root root 81920 Oct 18 09:42 /tmp/tmp.nqFmtr1ZHk

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3876 max 4005       [ OK ]





> balanced bwidth                                             7415 max
7561       [ OK ]

> balanced bwidth - reverse direction                         7391 max
7561       [ OK ]

> balanced bwidth with unbalanced delay                       7436 max
7561       [ OK ]

> balanced bwidth with unbalanced delay - reverse direction   7389 max
7561       [ OK ]

> unbalanced bwidth                                           3797 max
4005       [ OK ]

> unbalanced bwidth - reverse direction                       3797 max
4005       [ OK ]

> unbalanced bwidth with unbalanced delay                     transfer
slower than expected! runtime 4324 ms, expected 4005 mstransfer slower
than expected! runtime 4421 ms, expected 4005 ms max 4005        [ fail ]

> client exit code 1, server 1

>

> netns ns3-0-CL5xtY socket stat for 10007:

> State     Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

> TIME-WAIT 0      0           10.0.3.3:10007     10.0.1.1:57566
timer:(timewait,59sec,0)

>

>

> netns ns1-0-CL5xtY socket stat for 10007:

> State     Recv-Q Send-Q    Local Address:Port  Peer Address:Port Process

> TIME-WAIT 0      0              10.0.1.1:57566     10.0.3.3:10007
timer:(timewait,59sec,0)

>

> TIME-WAIT 0      0      10.0.2.1%ns1eth2:47275     10.0.3.3:10007
timer:(timewait,59sec,0)

>

> -rw------- 1 root root 16777216 Oct 18 09:56 /tmp/tmp.2Fm5JWJ2Gm

> -rw------- 1 root root 16777216 Oct 18 09:55 /tmp/tmp.myCVbJlKF1

> -rw------- 1 root root 81920 Oct 18 09:55 /tmp/tmp.GJRGytXcM1

> -rw------- 1 root root 81920 Oct 18 09:56 /tmp/tmp.WUONy22LUi

> unbalanced bwidth with unbalanced delay - reverse direction 3906 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay            3823 max
4005       [ OK ]

> unbalanced bwidth with opposed, unbalanced delay - reverse
direction3810 max 4005       [ OK ]


Please tell me if you need anything else!

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Matthieu Baerts 2 years, 5 months ago
On 18/10/2021 12:24, Matthieu Baerts wrote:
> Hi Paolo, Mat,
> 
> On 16/10/2021 02:12, Mat Martineau wrote:
>> On Fri, 15 Oct 2021, Paolo Abeni wrote:
>>
>>> Currently the simult_flows.sh self-tests are not very
>>> stables, expecially when running on slow VM.
>>>
>>> The tests mesures runtime for transfers on multiple subflows
>>> and check that the time is nearby the theoretical maximum.
>>>
>>> The current test infra introduces a bit of jitter in test
>>> runtime, due to multiple explicit delays. Additionally the
>>> runtime is measured by the shell script wrapper. On slow
>>> VM, the script overhead is measurable and subject to relevant
>>> jitter.
>>>
>>> One solution to make the test more stable would be adding more
>>> slack to the expected time; that could possibly hide reall
>>> regressions. Instead move the measurement inside the command
>>> doint the transfer, and drop most unneeded sleep.
> 
> Thank you for looking at that!
> 
> It is indeed more stable: with a debug kernel and a "slow host", the
> test fails almost all the time. Now I got 3 failures in X attempts, not
> bad! (and I stopped after the last failure)

And X=18 here ;-)

And for a maybe better output for the tests results from my previous email:

https://paste.centos.org/view/d75b3836

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Matthieu Baerts 2 years, 5 months ago
On 18/10/2021 14:30, Matthieu Baerts wrote:
> On 18/10/2021 12:24, Matthieu Baerts wrote:
>> Hi Paolo, Mat,
>>
>> On 16/10/2021 02:12, Mat Martineau wrote:
>>> On Fri, 15 Oct 2021, Paolo Abeni wrote:
>>>
>>>> Currently the simult_flows.sh self-tests are not very
>>>> stables, expecially when running on slow VM.
>>>>
>>>> The tests mesures runtime for transfers on multiple subflows
>>>> and check that the time is nearby the theoretical maximum.
>>>>
>>>> The current test infra introduces a bit of jitter in test
>>>> runtime, due to multiple explicit delays. Additionally the
>>>> runtime is measured by the shell script wrapper. On slow
>>>> VM, the script overhead is measurable and subject to relevant
>>>> jitter.
>>>>
>>>> One solution to make the test more stable would be adding more
>>>> slack to the expected time; that could possibly hide reall
>>>> regressions. Instead move the measurement inside the command
>>>> doint the transfer, and drop most unneeded sleep.
>>
>> Thank you for looking at that!
>>
>> It is indeed more stable: with a debug kernel and a "slow host", the
>> test fails almost all the time. Now I got 3 failures in X attempts, not
>> bad! (and I stopped after the last failure)
> 
> And X=18 here ;-)

The public CI seems happy too, no issue with "selftest_simult_flows":

- KVM Validation: normal:
  - Unstable: 1 failed test(s): selftest_mptcp_join 🔴:
  - Task: https://cirrus-ci.com/task/5047114079141888
  - Summary:
https://api.cirrus-ci.com/v1/artifact/task/5047114079141888/summary/summary.txt

- KVM Validation: debug:
  - Unstable: 2 failed test(s): selftest_diag selftest_mptcp_join 🔴:
  - Task: https://cirrus-ci.com/task/6173013985984512
  - Summary:
https://api.cirrus-ci.com/v1/artifact/task/6173013985984512/summary/summary.txt

Initiator: Matthieu Baerts
Commits:
https://github.com/multipath-tcp/mptcp_net-next/commits/d0beafc15723

Good job :)

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Paolo Abeni 2 years, 5 months ago
On Fri, 2021-10-15 at 17:12 -0700, Mat Martineau wrote:
> On Fri, 15 Oct 2021, Paolo Abeni wrote:
> 
> > Currently the simult_flows.sh self-tests are not very
> > stables, expecially when running on slow VM.
> > 
> > The tests mesures runtime for transfers on multiple subflows
> > and check that the time is nearby the theoretical maximum.
> > 
> > The current test infra introduces a bit of jitter in test
> > runtime, due to multiple explicit delays. Additionally the
> > runtime is measured by the shell script wrapper. On slow
> > VM, the script overhead is measurable and subject to relevant
> > jitter.
> > 
> > One solution to make the test more stable would be adding more
> > slack to the expected time; that could possibly hide reall
> > regressions. Instead move the measurement inside the command
> > doint the transfer, and drop most unneeded sleep.
> > 
> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > ---
> > @matttbe: could you please double check this makes simult_flows
> > tests really more stable in your testbed?
> > 
> > Now the slack is really quite tight, I think we can raise it a
> > little more, without loosing the ability of catching regressions
> 
> I still get occasional failures on my slow desktop VM (lots of debug 
> features turned on), but a little more slack would help. Most of the time 
> all cases succeed. If I see a failure, it's just one test out of the 
> batch. These are each from separate runs of simult_flows.sh:
> 
> unbalanced bwidth                                           transfer slower than expected! runtime 4500 ms, expected 4005 ms max 4005        [ fail ]
> 
> 
> unbalanced bwidth with opposed, unbalanced delay            transfer slower than expected! runtime 4632 ms, expected 4005 ms max 4005        [ fail ]

Let me elaborate a bit more here.

Before this patch, I observed 2 different kind of failures:

- off by few ms. The root cause was the jitter introduced by the
VM/scripts/tests env. That should be resolved now.

- off by a few hundred ms. The root cause is a little more uncertain,
but I think it's due to HoL blocking for the faster subflow. I observe
this kind of failure only when the subflows have different speeds.

Looking at the pcap traces with wireshark, and doing the time-sequence
stream graph (Stevens), the slower subflow has a pretty much constant
slope, while the faster one has some smallish periods with a lower
slope.

I'm experimenting a patch doing HoL blocking estimation in a similar
way to blest. It looks like it improves the situation, but I need do to
 more experiments.

Cheers,

Paolo



Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Matthieu Baerts 2 years, 5 months ago
Hi Paolo,

On 20/10/2021 19:43, Paolo Abeni wrote:
> On Fri, 2021-10-15 at 17:12 -0700, Mat Martineau wrote:
>> On Fri, 15 Oct 2021, Paolo Abeni wrote:
>>
>>> Currently the simult_flows.sh self-tests are not very
>>> stables, expecially when running on slow VM.
>>>
>>> The tests mesures runtime for transfers on multiple subflows
>>> and check that the time is nearby the theoretical maximum.
>>>
>>> The current test infra introduces a bit of jitter in test
>>> runtime, due to multiple explicit delays. Additionally the
>>> runtime is measured by the shell script wrapper. On slow
>>> VM, the script overhead is measurable and subject to relevant
>>> jitter.
>>>
>>> One solution to make the test more stable would be adding more
>>> slack to the expected time; that could possibly hide reall
>>> regressions. Instead move the measurement inside the command
>>> doint the transfer, and drop most unneeded sleep.
>>>
>>> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
>>> ---
>>> @matttbe: could you please double check this makes simult_flows
>>> tests really more stable in your testbed?
>>>
>>> Now the slack is really quite tight, I think we can raise it a
>>> little more, without loosing the ability of catching regressions
>>
>> I still get occasional failures on my slow desktop VM (lots of debug 
>> features turned on), but a little more slack would help. Most of the time 
>> all cases succeed. If I see a failure, it's just one test out of the 
>> batch. These are each from separate runs of simult_flows.sh:
>>
>> unbalanced bwidth                                           transfer slower than expected! runtime 4500 ms, expected 4005 ms max 4005        [ fail ]
>>
>>
>> unbalanced bwidth with opposed, unbalanced delay            transfer slower than expected! runtime 4632 ms, expected 4005 ms max 4005        [ fail ]
> 
> Let me elaborate a bit more here.
> 
> Before this patch, I observed 2 different kind of failures:
> 
> - off by few ms. The root cause was the jitter introduced by the
> VM/scripts/tests env. That should be resolved now.

Should we already apply your RFC patch in the export branch then?

> - off by a few hundred ms. The root cause is a little more uncertain,
> but I think it's due to HoL blocking for the faster subflow. I observe
> this kind of failure only when the subflows have different speeds.
> 
> Looking at the pcap traces with wireshark, and doing the time-sequence
> stream graph (Stevens), the slower subflow has a pretty much constant
> slope, while the faster one has some smallish periods with a lower
> slope.
> 
> I'm experimenting a patch doing HoL blocking estimation in a similar
> way to blest. It looks like it improves the situation, but I need do to
>  more experiments.

Thank you for looking at that!

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Paolo Abeni 2 years, 5 months ago
On Thu, 2021-10-21 at 10:54 +0200, Matthieu Baerts wrote:
> Hi Paolo,
> 
> On 20/10/2021 19:43, Paolo Abeni wrote:
> > On Fri, 2021-10-15 at 17:12 -0700, Mat Martineau wrote:
> > > On Fri, 15 Oct 2021, Paolo Abeni wrote:
> > > 
> > > > Currently the simult_flows.sh self-tests are not very
> > > > stables, expecially when running on slow VM.
> > > > 
> > > > The tests mesures runtime for transfers on multiple subflows
> > > > and check that the time is nearby the theoretical maximum.
> > > > 
> > > > The current test infra introduces a bit of jitter in test
> > > > runtime, due to multiple explicit delays. Additionally the
> > > > runtime is measured by the shell script wrapper. On slow
> > > > VM, the script overhead is measurable and subject to relevant
> > > > jitter.
> > > > 
> > > > One solution to make the test more stable would be adding more
> > > > slack to the expected time; that could possibly hide reall
> > > > regressions. Instead move the measurement inside the command
> > > > doint the transfer, and drop most unneeded sleep.
> > > > 
> > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > ---
> > > > @matttbe: could you please double check this makes simult_flows
> > > > tests really more stable in your testbed?
> > > > 
> > > > Now the slack is really quite tight, I think we can raise it a
> > > > little more, without loosing the ability of catching regressions
> > > 
> > > I still get occasional failures on my slow desktop VM (lots of debug 
> > > features turned on), but a little more slack would help. Most of the time 
> > > all cases succeed. If I see a failure, it's just one test out of the 
> > > batch. These are each from separate runs of simult_flows.sh:
> > > 
> > > unbalanced bwidth                                           transfer slower than expected! runtime 4500 ms, expected 4005 ms max 4005        [ fail ]
> > > 
> > > 
> > > unbalanced bwidth with opposed, unbalanced delay            transfer slower than expected! runtime 4632 ms, expected 4005 ms max 4005        [ fail ]
> > 
> > Let me elaborate a bit more here.
> > 
> > Before this patch, I observed 2 different kind of failures:
> > 
> > - off by few ms. The root cause was the jitter introduced by the
> > VM/scripts/tests env. That should be resolved now.
> 
> Should we already apply your RFC patch in the export branch then?

I'm not 110% sure it has not side effect on the mptcp_join.sh self-
tests. I guess we can go ahead and check that :)

/P


Re: [RFC PATCH v2] selftests: mptcp: more stable simult_flows tests
Posted by Matthieu Baerts 2 years, 5 months ago
Hi Paolo,

On 15/10/2021 17:09, Paolo Abeni wrote:
> Currently the simult_flows.sh self-tests are not very
> stables, expecially when running on slow VM.
> 
> The tests mesures runtime for transfers on multiple subflows
> and check that the time is nearby the theoretical maximum.
> 
> The current test infra introduces a bit of jitter in test
> runtime, due to multiple explicit delays. Additionally the
> runtime is measured by the shell script wrapper. On slow
> VM, the script overhead is measurable and subject to relevant
> jitter.
> 
> One solution to make the test more stable would be adding more
> slack to the expected time; that could possibly hide reall
> regressions. Instead move the measurement inside the command
> doint the transfer, and drop most unneeded sleep.

Thank you for the patch!

Now in our tree (without some typos detected by codespell) and my RvB tag:

- e0c1e93cee3c: selftests: mptcp: more stable simult_flows tests
- Results: 3b796086040b..27b3739429a3

Builds and tests are now in progress:

https://cirrus-ci.com/github/multipath-tcp/mptcp_net-next/export/20211021T100529
https://github.com/multipath-tcp/mptcp_net-next/actions/workflows/build-validation.yml?query=branch:export/20211021T100529

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net