From: Menglong Dong <imagedong@tencent.com>
Add the function chk_msk_inuse() to diag.sh, which is used to check the
statistics of mptcp socket in use. As mptcp socket in listen state will
be closed randomly after 'accept', we need to get the count of listening
mptcp socket through 'ss' command.
All tests pass.
Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
v6:
- reuse __chk_nr to check the number of msk in use
---
tools/testing/selftests/net/mptcp/diag.sh | 47 ++++++++++++++++++++---
1 file changed, 42 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/mptcp/diag.sh b/tools/testing/selftests/net/mptcp/diag.sh
index 515859a5168b..9994cb23f8a0 100755
--- a/tools/testing/selftests/net/mptcp/diag.sh
+++ b/tools/testing/selftests/net/mptcp/diag.sh
@@ -36,15 +36,20 @@ if [ $? -ne 0 ];then
exit $ksft_skip
fi
+get_msk_inuse()
+{
+ ip netns exec $ns cat /proc/net/protocols | awk '$1~/^MPTCP$/{print $3}'
+}
+
__chk_nr()
{
- local condition="$1"
+ local command="$1"
local expected=$2
local msg nr
shift 2
msg=$*
- nr=$(ss -inmHMN $ns | $condition)
+ nr=$(eval $command)
printf "%-50s" "$msg"
if [ $nr != $expected ]; then
@@ -56,9 +61,17 @@ __chk_nr()
test_cnt=$((test_cnt+1))
}
+__chk_msk_nr()
+{
+ local condition=$1
+ shift 1
+
+ __chk_nr "ss -inmHMN $ns | $condition" $*
+}
+
chk_msk_nr()
{
- __chk_nr "grep -c token:" $*
+ __chk_msk_nr "grep -c token:" $*
}
wait_msk_nr()
@@ -96,12 +109,12 @@ wait_msk_nr()
chk_msk_fallback_nr()
{
- __chk_nr "grep -c fallback" $*
+ __chk_msk_nr "grep -c fallback" $*
}
chk_msk_remote_key_nr()
{
- __chk_nr "grep -c remote_key" $*
+ __chk_msk_nr "grep -c remote_key" $*
}
__chk_listen()
@@ -141,6 +154,25 @@ chk_msk_listen()
nr=$(ss -Ml $filter | wc -l)
}
+chk_msk_inuse()
+{
+ local expected=$1
+ local listen_nr
+
+ listen_nr=$(ss -N $ns -Ml | grep -c LISTEN)
+ expected=$(($expected+$listen_nr))
+ shift 1
+
+ for i in $(seq 10); do
+ if [ $(get_msk_inuse) -eq $expected ];then
+ break
+ fi
+ sleep 0.1
+ done
+
+ __chk_nr get_msk_inuse $expected $*
+}
+
# $1: ns, $2: port
wait_local_port_listen()
{
@@ -194,8 +226,10 @@ wait_connected $ns 10000
chk_msk_nr 2 "after MPC handshake "
chk_msk_remote_key_nr 2 "....chk remote_key"
chk_msk_fallback_nr 0 "....chk no fallback"
+chk_msk_inuse 2 "chk 2 msk in use"
flush_pids
+chk_msk_inuse 0 "chk 0 msk in use after flush"
echo "a" | \
timeout ${timeout_test} \
@@ -231,6 +265,9 @@ for I in `seq 1 $NR_CLIENTS`; do
done
wait_msk_nr $((NR_CLIENTS*2)) "many msk socket present"
+chk_msk_inuse $((NR_CLIENTS*2)) "chk many msk in use"
flush_pids
+chk_msk_inuse 0 "chk 0 msk in use after flush"
+
exit $ret
--
2.37.2
Hi Menglong, Thank you for your modifications, that's great! Our CI did some validations and here is its report: - KVM Validation: normal: - Success! ✅: - Task: https://cirrus-ci.com/task/4909154062565376 - Summary: https://api.cirrus-ci.com/v1/artifact/task/4909154062565376/summary/summary.txt - KVM Validation: debug: - Unstable: 1 failed test(s): selftest_diag 🔴: - Task: https://cirrus-ci.com/task/6035053969408000 - Summary: https://api.cirrus-ci.com/v1/artifact/task/6035053969408000/summary/summary.txt Initiator: Patchew Applier Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/aa327877c6f9 If there are some issues, you can reproduce them using the same environment as the one used by the CI thanks to a docker image, e.g.: $ cd [kernel source code] $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \ --pull always mptcp/mptcp-upstream-virtme-docker:latest \ auto-debug For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker Please note that despite all the efforts that have been already done to have a stable tests suite when executed on a public CI like here, it is possible some reported issues are not due to your modifications. Still, do not hesitate to help us improve that ;-) Cheers, MPTCP GH Action bot Bot operated by Matthieu Baerts (Tessares)
Hi Menglong, Thank you for the v6! It looks like the CI is not happy with it: On 03/11/2022 14:21, MPTCP CI wrote: > Hi Menglong, > > Thank you for your modifications, that's great! > > Our CI did some validations and here is its report: > > - KVM Validation: normal: > - Success! ✅: > - Task: https://cirrus-ci.com/task/4909154062565376 > - Summary: https://api.cirrus-ci.com/v1/artifact/task/4909154062565376/summary/summary.txt > > - KVM Validation: debug: > - Unstable: 1 failed test(s): selftest_diag 🔴: > - Task: https://cirrus-ci.com/task/6035053969408000 > - Summary: https://api.cirrus-ci.com/v1/artifact/task/6035053969408000/summary/summary.txt As you can see: ---------------------------- (...) # all listen sockets [ ok ] # after MPC handshake [ ok ] # ....chk remote_key [ ok ] # ....chk no fallback [ ok ] # chk 2 msk in use [ ok ] # chk 0 msk in use after flush [ ok ] # check fallback [ ok ] # many msk socket present [ fail ] timeout while expecting 200 max 201 last 1 # chk many msk in use [ fail ] expected 200 found 0 # chk 0 msk in use after flush [ ok ] ---------------------------- I guess one socket is still present after the 'check fallback': you probably need to modify flush_pids() to wait for the processes to be over, as suggested on a comment in your v5, no? https://lore.kernel.org/all/b3f3c01e-4010-d5ce-970d-394711bcd0e1@tessares.net/ I don't think it is a good idea to wait for >= 200 except if it takes a very long time to have the previous socket terminated. If it does, maybe we should re-order the test or re-create the netns instead of re-using it. About the patch 3/4, note that the SIGUSR1 is probably stopping the test earlier than expected because the interrupt will cause some actions to stop but still good to check for the 'quit' variable. Also, one small detail for patch 4/4: can you add "...." at the beginning of the new lines you print in the selftest, similar to "....chk no fallback"? Cheers, Matt -- Tessares | Belgium | Hybrid Access Solutions www.tessares.net
> On 2022/11/4 01:34,“Matthieu Baerts”<matthieu.baerts@tessares.net> write: > > Hi Menglong, > > Thank you for the v6! > > It looks like the CI is not happy with it: > > On 03/11/2022 14:21, MPTCP CI wrote: > > Hi Menglong, > > > > Thank you for your modifications, that's great! > > > > Our CI did some validations and here is its report: > > > > - KVM Validation: normal: > > - Success! ✅: > > - Task: https://cirrus-ci.com/task/4909154062565376 > > - Summary: https://api.cirrus-ci.com/v1/artifact/task/4909154062565376/summary/summary.txt > > > > - KVM Validation: debug: > > - Unstable: 1 failed test(s): selftest_diag 🔴: > > - Task: https://cirrus-ci.com/task/6035053969408000 > > - Summary: https://api.cirrus-ci.com/v1/artifact/task/6035053969408000/summary/summary.txt > > As you can see: > > ---------------------------- > (...) > # all listen sockets [ ok ] > # after MPC handshake [ ok ] > # ....chk remote_key [ ok ] > # ....chk no fallback [ ok ] > # chk 2 msk in use [ ok ] > # chk 0 msk in use after flush [ ok ] > # check fallback [ ok ] > # many msk socket present [ fail ] timeout > while expecting 200 max 201 last 1 > # chk many msk in use [ fail ] expected > 200 found 0 > # chk 0 msk in use after flush [ ok ] > ---------------------------- > > I guess one socket is still present after the 'check fallback': you > probably need to modify flush_pids() to wait for the processes to be > over, as suggested on a comment in your v5, no? > In fact, I suspect that it's something else. the wait_msk_nr() has already wait long enough for the processes to be over. I suspect that the msk is not released after the processes exit. Now, I can reproduce it in my computer, about once in 10 times. And I'll try to figure out what's happening. > b3f3c01e-4010-d5ce-970d-394711bcd0e1@tessares.net <https://lore.kernel.org/all/<a href=>/">https://lore.kernel.org/all/b3f3c01e-4010-d5ce-970d-394711bcd0e1@tessares.net/ > > I don't think it is a good idea to wait for >= 200 except if it takes a > very long time to have the previous socket terminated. If it does, maybe > we should re-order the test or re-create the netns instead of re-using it. > It can happen in a very slow qemu machine, but not possible in a physical machine. > > About the patch 3/4, note that the SIGUSR1 is probably stopping the test > earlier than expected because the interrupt will cause some actions to > stop but still good to check for the 'quit' variable. > > > Also, one small detail for patch 4/4: can you add "...." at the > beginning of the new lines you print in the selftest, similar to > "....chk no fallback"? > Okay! Thanks! Menglong Dong > Cheers, > Matt > -- > Tessares | Belgium | Hybrid Access Solutions > www.tessares.net
On Thu, 3 Nov 2022, Matthieu Baerts wrote: > Hi Menglong, > > Thank you for the v6! > > It looks like the CI is not happy with it: > > On 03/11/2022 14:21, MPTCP CI wrote: >> Hi Menglong, >> >> Thank you for your modifications, that's great! >> >> Our CI did some validations and here is its report: >> >> - KVM Validation: normal: >> - Success! ✅: >> - Task: https://cirrus-ci.com/task/4909154062565376 >> - Summary: https://api.cirrus-ci.com/v1/artifact/task/4909154062565376/summary/summary.txt >> >> - KVM Validation: debug: >> - Unstable: 1 failed test(s): selftest_diag 🔴: >> - Task: https://cirrus-ci.com/task/6035053969408000 >> - Summary: https://api.cirrus-ci.com/v1/artifact/task/6035053969408000/summary/summary.txt > > As you can see: > > ---------------------------- > (...) > # all listen sockets [ ok ] > # after MPC handshake [ ok ] > # ....chk remote_key [ ok ] > # ....chk no fallback [ ok ] > # chk 2 msk in use [ ok ] > # chk 0 msk in use after flush [ ok ] > # check fallback [ ok ] > # many msk socket present [ fail ] timeout > while expecting 200 max 201 last 1 > # chk many msk in use [ fail ] expected > 200 found 0 > # chk 0 msk in use after flush [ ok ] > ---------------------------- > > I guess one socket is still present after the 'check fallback': you > probably need to modify flush_pids() to wait for the processes to be > over, as suggested on a comment in your v5, no? > > https://lore.kernel.org/all/b3f3c01e-4010-d5ce-970d-394711bcd0e1@tessares.net/ > > I don't think it is a good idea to wait for >= 200 except if it takes a > very long time to have the previous socket terminated. If it does, maybe > we should re-order the test or re-create the netns instead of re-using it. > > > About the patch 3/4, note that the SIGUSR1 is probably stopping the test > earlier than expected because the interrupt will cause some actions to > stop but still good to check for the 'quit' variable. > > > Also, one small detail for patch 4/4: can you add "...." at the > beginning of the new lines you print in the selftest, similar to > "....chk no fallback"? > Menglong - Thanks for the updated patches. The test ran ok on my local system, but the CI is slow on the debug build which makes the timing trickier. I don't have anything to add to Matthieu's comments above, seems like his suggestions will resolve the CI issue. -- Mat Martineau Intel
© 2016 - 2023 Red Hat, Inc.