[PATCH mptcp-next v11 0/5] BPF redundant scheduler

Geliang Tang posted 5 patches 1 year, 7 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/cover.1662714028.git.geliang.tang@suse.com
Maintainers: Mykola Lysenko <mykolal@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, Alexei Starovoitov <ast@kernel.org>, Song Liu <song@kernel.org>, "David S. Miller" <davem@davemloft.net>, Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Shuah Khan <shuah@kernel.org>, Matthieu Baerts <matthieu.baerts@tessares.net>, Hao Luo <haoluo@google.com>, Andrii Nakryiko <andrii@kernel.org>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, Jakub Kicinski <kuba@kernel.org>, Eric Dumazet <edumazet@google.com>, Mat Martineau <mathew.j.martineau@linux.intel.com>, Martin KaFai Lau <martin.lau@linux.dev>
There is a newer version of this series
net/mptcp/protocol.c                          | 158 +++++++++++-------
net/mptcp/protocol.h                          |  12 +-
net/mptcp/sched.c                             |  59 ++++---
.../testing/selftests/bpf/prog_tests/mptcp.c  |  34 ++++
.../selftests/bpf/progs/mptcp_bpf_red.c       |  36 ++++
5 files changed, 209 insertions(+), 90 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c
[PATCH mptcp-next v11 0/5] BPF redundant scheduler
Posted by Geliang Tang 1 year, 7 months ago
v11:
 - address to Mat's comments in v10.
 - rebase to export/20220908T063452

v10:
 - send multiple dfrags in __mptcp_push_pending().

v9:
 - drop the extra *err paramenter of mptcp_sched_get_send() as Florian
   suggested.

v8:
 - update __mptcp_push_pending(), send the same data on each subflow.
 - update __mptcp_retrans, track the max sent data.
 = add a new patch.

v7:
 - drop redundant flag in v6
 - drop __mptcp_subflows_push_pending in v6
 - update redundant subflows support in __mptcp_push_pending
 - update redundant subflows support in __mptcp_retrans

v6:
 - Add redundant flag for struct mptcp_sched_ops.
 - add a dedicated function __mptcp_subflows_push_pending() to deal with
   redundat subflows push pending.

v5:
 - address to Paolo's comment, keep the optimization to
mptcp_subflow_get_send() for the non eBPF case.
 - merge mptcp_sched_get_send() and __mptcp_sched_get_send() in v4 into one.
 - depends on "cleanups for bpf sched selftests".

v4:
 - small cleanups in patch 1, 2.
 - add TODO in patch 3.
 - rebase patch 5 on 'cleanups for bpf sched selftests'.

v3:
 - use new API.
 - fix the link failure tests issue mentioned in ("https://patchwork.kernel.org/project/mptcp/cover/cover.1653033459.git.geliang.tang@suse.com/").

v2:
 - add MPTCP_SUBFLOWS_MAX limit to avoid infinite loops when the
   scheduler always sets call_again to true.
 - track the largest copied amount.
 - deal with __mptcp_subflow_push_pending() and the retransmit loop.
 - depends on "BPF round-robin scheduler" v14.

v1:

Implements the redundant BPF MPTCP scheduler, which sends all packets
redundantly on all available subflows.

Geliang Tang (5):
  Squash to "mptcp: add get_subflow wrappers"
  mptcp: redundant subflows push pending
  mptcp: redundant subflows retrans support
  selftests/bpf: Add bpf_red scheduler
  selftests/bpf: Add bpf_red test

 net/mptcp/protocol.c                          | 158 +++++++++++-------
 net/mptcp/protocol.h                          |  12 +-
 net/mptcp/sched.c                             |  59 ++++---
 .../testing/selftests/bpf/prog_tests/mptcp.c  |  34 ++++
 .../selftests/bpf/progs/mptcp_bpf_red.c       |  36 ++++
 5 files changed, 209 insertions(+), 90 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_red.c

-- 
2.35.3


Re: [PATCH mptcp-next v11 0/5] BPF redundant scheduler
Posted by Mat Martineau 1 year, 7 months ago
On Fri, 9 Sep 2022, Geliang Tang wrote:

> v11:
> - address to Mat's comments in v10.
> - rebase to export/20220908T063452
>

Hi Geliang -

Thanks for the updates to this series.

I get slightly different kernel splats than the CI. For example, here's my 
kmsg output with the first test in mptcp_connect.sh:

[ 3102.670021] IPv6: ADDRCONF(NETDEV_CHANGE): ns1eth2: link becomes ready
[ 3102.885448] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth3: link becomes ready
[ 3103.112575] IPv6: ADDRCONF(NETDEV_CHANGE): ns3eth4: link becomes ready
[ 3103.463347] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth1: link becomes ready
[ 3107.580236] ------------[ cut here ]------------
[ 3107.581325] WARNING: CPU: 2 PID: 1112 at net/mptcp/protocol.c:1306 mptcp_sendmsg_frag (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1306 (discriminator 1)) 
[ 3107.583192] Modules linked in:
[ 3107.585317] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
[ 3107.587250] RIP: 0010:mptcp_sendmsg_frag (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1306 (discriminator 1)) 
[ 3107.588421] Code: 0f 85 21 fd ff ff 48 c7 c2 a0 c1 1a 83 be 78 00 00 00 48 c7 c7 80 c7 1a 83 c6 05 9e 31 f2 01 01 e8 03 83 03 00 e9 fd fc ff ff <0f> 0b 0f b6 44 24 63 88 44 24 30 e9 dc f4 ff ff 8b 74 24 18 48 89
All code
========
    0:	0f 85 21 fd ff ff    	jne    0xfffffffffffffd27
    6:	48 c7 c2 a0 c1 1a 83 	mov    $0xffffffff831ac1a0,%rdx
    d:	be 78 00 00 00       	mov    $0x78,%esi
   12:	48 c7 c7 80 c7 1a 83 	mov    $0xffffffff831ac780,%rdi
   19:	c6 05 9e 31 f2 01 01 	movb   $0x1,0x1f2319e(%rip)        # 0x1f231be
   20:	e8 03 83 03 00       	call   0x38328
   25:	e9 fd fc ff ff       	jmp    0xfffffffffffffd27
   2a:*	0f 0b                	ud2    		<-- trapping instruction
   2c:	0f b6 44 24 63       	movzbl 0x63(%rsp),%eax
   31:	88 44 24 30          	mov    %al,0x30(%rsp)
   35:	e9 dc f4 ff ff       	jmp    0xfffffffffffff516
   3a:	8b 74 24 18          	mov    0x18(%rsp),%esi
   3e:	48                   	rex.W
   3f:	89                   	.byte 0x89

Code starting with the faulting instruction
===========================================
    0:	0f 0b                	ud2
    2:	0f b6 44 24 63       	movzbl 0x63(%rsp),%eax
    7:	88 44 24 30          	mov    %al,0x30(%rsp)
    b:	e9 dc f4 ff ff       	jmp    0xfffffffffffff4ec
   10:	8b 74 24 18          	mov    0x18(%rsp),%esi
   14:	48                   	rex.W
   15:	89                   	.byte 0x89
[ 3107.592031] RSP: 0018:ffff888010f67910 EFLAGS: 00010202
[ 3107.593172] RAX: e585171f95821f87 RBX: ffff888113b04f00 RCX: ffffffff8260bb19
[ 3107.594458] RDX: 0000000000000001 RSI: ffffffff8260bac4 RDI: ffff88800fa84848
[ 3107.596326] RBP: e585171f95821f88 R08: 0000000000000000 R09: ffff88800fa848af
[ 3107.597665] R10: ffffed1001f50915 R11: 0000000000000000 R12: ffff888010f67a78
[ 3107.599903] R13: 0000000000000001 R14: ffff888107691800 R15: ffff8880357a0000
[ 3107.601366] FS:  00007f668bdb0740(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
[ 3107.603201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3107.604495] CR2: 00007ffe77890328 CR3: 0000000115318005 CR4: 0000000000370ee0
[ 3107.606212] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3107.607854] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3107.610532] Call Trace:
[ 3107.611229]  <TASK>
[ 3107.611746] ? mptcp_init_sock (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1230) 
[ 3107.612654] ? lockdep_hardirqs_on_prepare (/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4252 /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4319 
/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4271) 
[ 3107.613890] ? __local_bh_enable_ip (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45 /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80 
/home/mjmartin/work/mptcp-nn/kernel/softirq.c:401) 
[ 3107.614952] __mptcp_push_pending (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1569) 
[ 3107.615948] ? mptcp_close (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1532) 
[ 3107.616850] ? __sk_mem_raise_allocated (/home/mjmartin/work/mptcp-nn/net/core/sock.c:2810 /home/mjmartin/work/mptcp-nn/net/core/sock.c:2981) 
[ 3107.617982] ? copy_page_from_iter (/home/mjmartin/work/mptcp-nn/lib/iov_iter.c:751 /home/mjmartin/work/mptcp-nn/lib/iov_iter.c:738) 
[ 3107.618991] mptcp_sendmsg (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1789) 
[ 3107.619896] ? __mptcp_push_pending (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1682) 
[ 3107.621006] ? inet_send_prepare (/home/mjmartin/work/mptcp-nn/net/ipv4/af_inet.c:807) 
[ 3107.622087] ? inet_send_prepare (/home/mjmartin/work/mptcp-nn/net/ipv4/af_inet.c:816) 
[ 3107.623082] sock_sendmsg (/home/mjmartin/work/mptcp-nn/net/socket.c:717 /home/mjmartin/work/mptcp-nn/net/socket.c:734) 
[ 3107.624372] sock_write_iter (/home/mjmartin/work/mptcp-nn/net/socket.c:1109) 
[ 3107.625361] ? sock_sendmsg (/home/mjmartin/work/mptcp-nn/net/socket.c:1092) 
[ 3107.626261] ? file_has_perm (/home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:1724) 
[ 3107.627185] ? selinux_file_permission (/home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:3570 /home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:3590) 
[ 3107.628345] vfs_write (/home/mjmartin/work/mptcp-nn/./include/linux/fs.h:2187 /home/mjmartin/work/mptcp-nn/fs/read_write.c:491 /home/mjmartin/work/mptcp-nn/fs/read_write.c:578) 
[ 3107.629171] ? __ia32_sys_pread64 (/home/mjmartin/work/mptcp-nn/fs/read_write.c:559) 
[ 3107.630159] ? bit_wait_io_timeout (/home/mjmartin/work/mptcp-nn/kernel/locking/mutex.c:902) 
[ 3107.631177] ? __fget_light (/home/mjmartin/work/mptcp-nn/fs/file.c:1007 (discriminator 1)) 
[ 3107.632005] ksys_write (/home/mjmartin/work/mptcp-nn/fs/read_write.c:631) 
[ 3107.632957] ? __ia32_sys_read (/home/mjmartin/work/mptcp-nn/fs/read_write.c:621) 
[ 3107.633859] ? lockdep_hardirqs_on_prepare (/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:466 /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4320 
/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4271) 
[ 3107.635041] ? syscall_enter_from_user_mode (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45 /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80 
/home/mjmartin/work/mptcp-nn/kernel/entry/common.c:109) 
[ 3107.636398] do_syscall_64 (/home/mjmartin/work/mptcp-nn/arch/x86/entry/common.c:50 /home/mjmartin/work/mptcp-nn/arch/x86/entry/common.c:80) 
[ 3107.637254] entry_SYSCALL_64_after_hwframe (/home/mjmartin/work/mptcp-nn/arch/x86/entry/entry_64.S:120) 
[ 3107.638351] RIP: 0033:0x7f668beb48f7
[ 3107.639336] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
All code
========
    0:	0f 00                	(bad)
    2:	f7 d8                	neg    %eax
    4:	64 89 02             	mov    %eax,%fs:(%rdx)
    7:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
    e:	eb b7                	jmp    0xffffffffffffffc7
   10:	0f 1f 00             	nopl   (%rax)
   13:	f3 0f 1e fa          	endbr64
   17:	64 8b 04 25 18 00 00 	mov    %fs:0x18,%eax
   1e:	00
   1f:	85 c0                	test   %eax,%eax
   21:	75 10                	jne    0x33
   23:	b8 01 00 00 00       	mov    $0x1,%eax
   28:	0f 05                	syscall
   2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
   30:	77 51                	ja     0x83
   32:	c3                   	ret
   33:	48 83 ec 28          	sub    $0x28,%rsp
   37:	48 89 54 24 18       	mov    %rdx,0x18(%rsp)
   3c:	48                   	rex.W
   3d:	89                   	.byte 0x89
   3e:	74 24                	je     0x64

Code starting with the faulting instruction
===========================================
    0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
    6:	77 51                	ja     0x59
    8:	c3                   	ret
    9:	48 83 ec 28          	sub    $0x28,%rsp
    d:	48 89 54 24 18       	mov    %rdx,0x18(%rsp)
   12:	48                   	rex.W
   13:	89                   	.byte 0x89
   14:	74 24                	je     0x3a
[ 3107.643426] RSP: 002b:00007ffe778943c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 3107.645335] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f668beb48f7
[ 3107.647101] RDX: 0000000000002000 RSI: 00007ffe778943f0 RDI: 0000000000000003
[ 3107.648756] RBP: 0000000000000000 R08: 00007f668bfab214 R09: 00007f668bfab280
[ 3107.650320] R10: 00007f668bdba140 R11: 0000000000000246 R12: 0000000000001500
[ 3107.651883] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000002000
[ 3107.653497]  </TASK>
[ 3107.654005] irq event stamp: 17761
[ 3107.654780] hardirqs last enabled at (17773): __up_console_sem (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45 (discriminator 1) /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80 
(discriminator 1) /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:138 (discriminator 1) /home/mjmartin/work/mptcp-nn/kernel/printk/printk.c:264 (discriminator 1)) 
[ 3107.656996] hardirqs last disabled at (17788): __schedule (/home/mjmartin/work/mptcp-nn/kernel/sched/core.c:6393 (discriminator 1)) 
[ 3107.658933] softirqs last enabled at (17806): __irq_exit_rcu (/home/mjmartin/work/mptcp-nn/kernel/softirq.c:445 /home/mjmartin/work/mptcp-nn/kernel/softirq.c:650) 
[ 3107.660848] softirqs last disabled at (17819): __irq_exit_rcu (/home/mjmartin/work/mptcp-nn/kernel/softirq.c:445 /home/mjmartin/work/mptcp-nn/kernel/softirq.c:650) 
[ 3107.662940] ---[ end trace 0000000000000000 ]---


Do you see anything similar in your testing? This was on a 4-cpu VM for 
me.


Line 1306 of protocol.c that caused the splat is:

 		WARN_ON_ONCE(reuse_skb);

and it looks like that is expected to happen with a zero window and all 
data acked. Sounds like a condition that wasn't expected with previous 
schedulers (that only sent on one subflow at a time), but could happen 
with redundant schedulers when msk->snd_una is updated by another subflow.

If you can't reproduce this, let me know and I can investigate some more. 
It's reproducible on my system.

--
Mat Martineau
Intel

Re: [PATCH mptcp-next v11 0/5] BPF redundant scheduler
Posted by Geliang Tang 1 year, 6 months ago
Hi Mat,

Sorry for the late reply.

On Mon, Sep 12, 2022 at 05:01:39PM -0700, Mat Martineau wrote:
> On Fri, 9 Sep 2022, Geliang Tang wrote:
> 
> > v11:
> > - address to Mat's comments in v10.
> > - rebase to export/20220908T063452
> > 
> 
> Hi Geliang -
> 
> Thanks for the updates to this series.
> 
> I get slightly different kernel splats than the CI. For example, here's my
> kmsg output with the first test in mptcp_connect.sh:
> 
> [ 3102.670021] IPv6: ADDRCONF(NETDEV_CHANGE): ns1eth2: link becomes ready
> [ 3102.885448] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth3: link becomes ready
> [ 3103.112575] IPv6: ADDRCONF(NETDEV_CHANGE): ns3eth4: link becomes ready
> [ 3103.463347] IPv6: ADDRCONF(NETDEV_CHANGE): ns2eth1: link becomes ready
> [ 3107.580236] ------------[ cut here ]------------
> [ 3107.581325] WARNING: CPU: 2 PID: 1112 at net/mptcp/protocol.c:1306
> mptcp_sendmsg_frag (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1306
> (discriminator 1)) [ 3107.583192] Modules linked in:
> [ 3107.585317] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014
> [ 3107.587250] RIP: 0010:mptcp_sendmsg_frag
> (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1306 (discriminator 1)) [
> 3107.588421] Code: 0f 85 21 fd ff ff 48 c7 c2 a0 c1 1a 83 be 78 00 00 00 48
> c7 c7 80 c7 1a 83 c6 05 9e 31 f2 01 01 e8 03 83 03 00 e9 fd fc ff ff <0f> 0b
> 0f b6 44 24 63 88 44 24 30 e9 dc f4 ff ff 8b 74 24 18 48 89
> All code
> ========
>    0:	0f 85 21 fd ff ff    	jne    0xfffffffffffffd27
>    6:	48 c7 c2 a0 c1 1a 83 	mov    $0xffffffff831ac1a0,%rdx
>    d:	be 78 00 00 00       	mov    $0x78,%esi
>   12:	48 c7 c7 80 c7 1a 83 	mov    $0xffffffff831ac780,%rdi
>   19:	c6 05 9e 31 f2 01 01 	movb   $0x1,0x1f2319e(%rip)        # 0x1f231be
>   20:	e8 03 83 03 00       	call   0x38328
>   25:	e9 fd fc ff ff       	jmp    0xfffffffffffffd27
>   2a:*	0f 0b                	ud2    		<-- trapping instruction
>   2c:	0f b6 44 24 63       	movzbl 0x63(%rsp),%eax
>   31:	88 44 24 30          	mov    %al,0x30(%rsp)
>   35:	e9 dc f4 ff ff       	jmp    0xfffffffffffff516
>   3a:	8b 74 24 18          	mov    0x18(%rsp),%esi
>   3e:	48                   	rex.W
>   3f:	89                   	.byte 0x89
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	0f 0b                	ud2
>    2:	0f b6 44 24 63       	movzbl 0x63(%rsp),%eax
>    7:	88 44 24 30          	mov    %al,0x30(%rsp)
>    b:	e9 dc f4 ff ff       	jmp    0xfffffffffffff4ec
>   10:	8b 74 24 18          	mov    0x18(%rsp),%esi
>   14:	48                   	rex.W
>   15:	89                   	.byte 0x89
> [ 3107.592031] RSP: 0018:ffff888010f67910 EFLAGS: 00010202
> [ 3107.593172] RAX: e585171f95821f87 RBX: ffff888113b04f00 RCX: ffffffff8260bb19
> [ 3107.594458] RDX: 0000000000000001 RSI: ffffffff8260bac4 RDI: ffff88800fa84848
> [ 3107.596326] RBP: e585171f95821f88 R08: 0000000000000000 R09: ffff88800fa848af
> [ 3107.597665] R10: ffffed1001f50915 R11: 0000000000000000 R12: ffff888010f67a78
> [ 3107.599903] R13: 0000000000000001 R14: ffff888107691800 R15: ffff8880357a0000
> [ 3107.601366] FS:  00007f668bdb0740(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000
> [ 3107.603201] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3107.604495] CR2: 00007ffe77890328 CR3: 0000000115318005 CR4: 0000000000370ee0
> [ 3107.606212] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 3107.607854] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 3107.610532] Call Trace:
> [ 3107.611229]  <TASK>
> [ 3107.611746] ? mptcp_init_sock
> (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1230) [ 3107.612654] ?
> lockdep_hardirqs_on_prepare
> (/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4252
> /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4319
> /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4271) [ 3107.613890] ?
> __local_bh_enable_ip
> (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45
> /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80
> /home/mjmartin/work/mptcp-nn/kernel/softirq.c:401) [ 3107.614952]
> __mptcp_push_pending
> (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1569) [ 3107.615948] ?
> mptcp_close (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1532) [
> 3107.616850] ? __sk_mem_raise_allocated
> (/home/mjmartin/work/mptcp-nn/net/core/sock.c:2810
> /home/mjmartin/work/mptcp-nn/net/core/sock.c:2981) [ 3107.617982] ?
> copy_page_from_iter (/home/mjmartin/work/mptcp-nn/lib/iov_iter.c:751
> /home/mjmartin/work/mptcp-nn/lib/iov_iter.c:738) [ 3107.618991]
> mptcp_sendmsg (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1789) [
> 3107.619896] ? __mptcp_push_pending
> (/home/mjmartin/work/mptcp-nn/net/mptcp/protocol.c:1682) [ 3107.621006] ?
> inet_send_prepare (/home/mjmartin/work/mptcp-nn/net/ipv4/af_inet.c:807) [
> 3107.622087] ? inet_send_prepare
> (/home/mjmartin/work/mptcp-nn/net/ipv4/af_inet.c:816) [ 3107.623082]
> sock_sendmsg (/home/mjmartin/work/mptcp-nn/net/socket.c:717
> /home/mjmartin/work/mptcp-nn/net/socket.c:734) [ 3107.624372]
> sock_write_iter (/home/mjmartin/work/mptcp-nn/net/socket.c:1109) [
> 3107.625361] ? sock_sendmsg (/home/mjmartin/work/mptcp-nn/net/socket.c:1092)
> [ 3107.626261] ? file_has_perm
> (/home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:1724) [ 3107.627185]
> ? selinux_file_permission
> (/home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:3570
> /home/mjmartin/work/mptcp-nn/security/selinux/hooks.c:3590) [ 3107.628345]
> vfs_write (/home/mjmartin/work/mptcp-nn/./include/linux/fs.h:2187
> /home/mjmartin/work/mptcp-nn/fs/read_write.c:491
> /home/mjmartin/work/mptcp-nn/fs/read_write.c:578) [ 3107.629171] ?
> __ia32_sys_pread64 (/home/mjmartin/work/mptcp-nn/fs/read_write.c:559) [
> 3107.630159] ? bit_wait_io_timeout
> (/home/mjmartin/work/mptcp-nn/kernel/locking/mutex.c:902) [ 3107.631177] ?
> __fget_light (/home/mjmartin/work/mptcp-nn/fs/file.c:1007 (discriminator 1))
> [ 3107.632005] ksys_write (/home/mjmartin/work/mptcp-nn/fs/read_write.c:631)
> [ 3107.632957] ? __ia32_sys_read
> (/home/mjmartin/work/mptcp-nn/fs/read_write.c:621) [ 3107.633859] ?
> lockdep_hardirqs_on_prepare
> (/home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:466
> /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4320
> /home/mjmartin/work/mptcp-nn/kernel/locking/lockdep.c:4271) [ 3107.635041] ?
> syscall_enter_from_user_mode
> (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45
> /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80
> /home/mjmartin/work/mptcp-nn/kernel/entry/common.c:109) [ 3107.636398]
> do_syscall_64 (/home/mjmartin/work/mptcp-nn/arch/x86/entry/common.c:50
> /home/mjmartin/work/mptcp-nn/arch/x86/entry/common.c:80) [ 3107.637254]
> entry_SYSCALL_64_after_hwframe
> (/home/mjmartin/work/mptcp-nn/arch/x86/entry/entry_64.S:120) [ 3107.638351]
> RIP: 0033:0x7f668beb48f7
> [ 3107.639336] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> All code
> ========
>    0:	0f 00                	(bad)
>    2:	f7 d8                	neg    %eax
>    4:	64 89 02             	mov    %eax,%fs:(%rdx)
>    7:	48 c7 c0 ff ff ff ff 	mov    $0xffffffffffffffff,%rax
>    e:	eb b7                	jmp    0xffffffffffffffc7
>   10:	0f 1f 00             	nopl   (%rax)
>   13:	f3 0f 1e fa          	endbr64
>   17:	64 8b 04 25 18 00 00 	mov    %fs:0x18,%eax
>   1e:	00
>   1f:	85 c0                	test   %eax,%eax
>   21:	75 10                	jne    0x33
>   23:	b8 01 00 00 00       	mov    $0x1,%eax
>   28:	0f 05                	syscall
>   2a:*	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax		<-- trapping instruction
>   30:	77 51                	ja     0x83
>   32:	c3                   	ret
>   33:	48 83 ec 28          	sub    $0x28,%rsp
>   37:	48 89 54 24 18       	mov    %rdx,0x18(%rsp)
>   3c:	48                   	rex.W
>   3d:	89                   	.byte 0x89
>   3e:	74 24                	je     0x64
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	48 3d 00 f0 ff ff    	cmp    $0xfffffffffffff000,%rax
>    6:	77 51                	ja     0x59
>    8:	c3                   	ret
>    9:	48 83 ec 28          	sub    $0x28,%rsp
>    d:	48 89 54 24 18       	mov    %rdx,0x18(%rsp)
>   12:	48                   	rex.W
>   13:	89                   	.byte 0x89
>   14:	74 24                	je     0x3a
> [ 3107.643426] RSP: 002b:00007ffe778943c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [ 3107.645335] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f668beb48f7
> [ 3107.647101] RDX: 0000000000002000 RSI: 00007ffe778943f0 RDI: 0000000000000003
> [ 3107.648756] RBP: 0000000000000000 R08: 00007f668bfab214 R09: 00007f668bfab280
> [ 3107.650320] R10: 00007f668bdba140 R11: 0000000000000246 R12: 0000000000001500
> [ 3107.651883] R13: 0000000000002000 R14: 0000000000000000 R15: 0000000000002000
> [ 3107.653497]  </TASK>
> [ 3107.654005] irq event stamp: 17761
> [ 3107.654780] hardirqs last enabled at (17773): __up_console_sem
> (/home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:45
> (discriminator 1)
> /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:80
> (discriminator 1)
> /home/mjmartin/work/mptcp-nn/./arch/x86/include/asm/irqflags.h:138
> (discriminator 1) /home/mjmartin/work/mptcp-nn/kernel/printk/printk.c:264
> (discriminator 1)) [ 3107.656996] hardirqs last disabled at (17788):
> __schedule (/home/mjmartin/work/mptcp-nn/kernel/sched/core.c:6393
> (discriminator 1)) [ 3107.658933] softirqs last enabled at (17806):
> __irq_exit_rcu (/home/mjmartin/work/mptcp-nn/kernel/softirq.c:445
> /home/mjmartin/work/mptcp-nn/kernel/softirq.c:650) [ 3107.660848] softirqs
> last disabled at (17819): __irq_exit_rcu
> (/home/mjmartin/work/mptcp-nn/kernel/softirq.c:445
> /home/mjmartin/work/mptcp-nn/kernel/softirq.c:650) [ 3107.662940] ---[ end
> trace 0000000000000000 ]---
> 
> 
> Do you see anything similar in your testing? This was on a 4-cpu VM for me.
> 

Yes, I got both this error (1306 WARN_ON_ONCE(reuse_skb)) and another error
(1010 WARN_ON_ONCE(!msk->recovery)) in my tests.

> 
> Line 1306 of protocol.c that caused the splat is:
> 
> 		WARN_ON_ONCE(reuse_skb);
> 
> and it looks like that is expected to happen with a zero window and all data
> acked. Sounds like a condition that wasn't expected with previous schedulers
> (that only sent on one subflow at a time), but could happen with redundant
> schedulers when msk->snd_una is updated by another subflow.
>

The original code updates dfrag->already_sent immediately after invoking
mptcp_sendmsg_frag. But we delay updating dfrag->already_sent in our code
in this series after all frags are sent. Then mptcp_check_allowed_size()
will return 0 sometime in this case. We got (1306 WARN_ON_ONCE(reuse_skb))
error here:

1291         if (copy == 0) {
1292                 u64 snd_una = READ_ONCE(msk->snd_una);
1293 
1294                 if (snd_una != msk->snd_nxt) {
1295                         tcp_remove_empty_skb(ssk);
1296                         return 0;
1297                 }
1298 
1299                 zero_window_probe = true;
1300                 data_seq = snd_una - 1;
1301                 copy = 1;
1302 
1303                 /* all mptcp-level data is acked, no skbs should be present into the
1304                  * ssk write queue
1305                  */
1306                 WARN_ON_ONCE(reuse_skb);
1307         }

The orignal code updates msk->first_pending immediately after every frag is
sent, but we delay updating it after all frags are sent. In this way, the
code will run to the position of (dfrag == msk->first_pending). We got
(1010 WARN_ON_ONCE(!msk->recovery)) error here:

1008                 if (unlikely(dfrag == msk->first_pending)) {
1009                         /* in recovery mode can see ack after the current snd head */
1010                         if (WARN_ON_ONCE(!msk->recovery))
1011                                 break;
1012 
1013                         WRITE_ONCE(msk->first_pending, mptcp_send_next(sk));
1014                 }

I'm trying to fix these two errors, but I haven't made much progress. So I
want to hear your suggestions.

Thanks,
-Geliang

> If you can't reproduce this, let me know and I can investigate some more.
> It's reproducible on my system.
> 
> --
> Mat Martineau
> Intel