[PATCH mptcp-next v8 0/8] BPF packet scheduler

Geliang Tang posted 8 patches 2 years ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/cover.1648537203.git.geliang.tang@suse.com
Maintainers: Alexei Starovoitov <ast@kernel.org>, Jonathan Corbet <corbet@lwn.net>, KP Singh <kpsingh@kernel.org>, Shuah Khan <shuah@kernel.org>, Jakub Kicinski <kuba@kernel.org>, Yonghong Song <yhs@fb.com>, Song Liu <songliubraving@fb.com>, John Fastabend <john.fastabend@gmail.com>, Mat Martineau <mathew.j.martineau@linux.intel.com>, Andrii Nakryiko <andrii@kernel.org>, Matthieu Baerts <matthieu.baerts@tessares.net>, "David S. Miller" <davem@davemloft.net>, Martin KaFai Lau <kafai@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, Paolo Abeni <pabeni@redhat.com>
There is a newer version of this series
Documentation/networking/mptcp-sysctl.rst     |   8 ++
include/net/mptcp.h                           |  13 ++
kernel/bpf/bpf_struct_ops_types.h             |   4 +
net/mptcp/Makefile                            |   2 +-
net/mptcp/bpf.c                               | 102 ++++++++++++++++
net/mptcp/ctrl.c                              |  14 +++
net/mptcp/protocol.c                          |  15 ++-
net/mptcp/protocol.h                          |  16 +++
net/mptcp/sched.c                             | 104 ++++++++++++++++
tools/testing/selftests/bpf/bpf_tcp_helpers.h |  12 ++
.../testing/selftests/bpf/prog_tests/mptcp.c  | 114 ++++++++++++++++++
.../selftests/bpf/progs/mptcp_bpf_first.c     |  30 +++++
12 files changed, 429 insertions(+), 5 deletions(-)
create mode 100644 net/mptcp/sched.c
create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_first.c
[PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Geliang Tang 2 years ago
v8:
 - use global sched_list instead of pernet sched_list.
 - drop synchronize_rcu() in mptcp_unregister_scheduler().
 - update mptcp_init_sched and mptcp_release_sched as Mat and Florian
   suggested.
 - fix the build break in patch 8.
 - depends on: "add skc_to_mptcp_sock" v14.
 - export/20220325T055307

v7:
 - add bpf_try_module_get in mptcp_init_sched.
 - add bpf_module_put in mptcp_release_sched.
 - rename bpf_first to mptcp_bpf_first.
 - update commit logs.

v6:
 - still use pernet sched_list, use current->nsproxy->net_ns in BPF
   context instead of using init_net.
 - patch 1:
   - use rcu_read_lock instead of spin_lock in mptcp_sched_find as Florian suggested.
   - drop synchronize_rcu in sched_exit_net as Florian suggested.
   - keep synchronize_rcu in mptcp_unregister_scheduler, otherwise, got
     a workqueue lockup in my test.
   - update Makefile as Mat suggested.
 - patch 2:
   - add mptcp_sched_data_init to register default sched, instead of
     registering it in init_net.
 - patch 5:
   - move mptcp_sched_get_subflow to protocol.h as Mat suggested.
 - patch 6:
   - use current->nsproxy->net_ns instead of init_net.
 - patch 8:
   - add send_data to send more data, instead of send_byte.

v5:
 - patch 1: define per-namespace sched_list (but only used init_net
   namespace. It is difficult to get 'net' in bpf_mptcp_sched_reg and
   bpf_mptcp_sched_unreg. I need some suggestions here.)
 - patch 2: skip mptcp_sched_default in mptcp_unregister_scheduler.
 - patch 8: add tests into mptcp.c, instead of bpf_tcp_ca.c.

v4:
 - set msk->sched to &mptcp_sched_default when the sched argument is NULL
in mptcp_init_sched().

v3:
 - add mptcp_release_sched helper in patch 4.
 - rename mptcp_set_sched to mptcp_init_sched in patch 4.
 - add mptcp_sched_first_release in patch 7.
 - do some cleanups.

v2:
 - split into more small patches.
 - change all parameters of mptcp_sched_ops from sk to msk:
       void (*init)(struct mptcp_sock *msk);
       void (*release)(struct mptcp_sock *msk);
       struct sock *   (*get_subflow)(struct mptcp_sock *msk);
 - add tests in bpf_tcp_ca.c, instead of adding a new one.

v1:
 - Addressed to the commends in the RFC version.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/75

Geliang Tang (8):
  mptcp: add struct mptcp_sched_ops
  mptcp: register default scheduler
  mptcp: add a new sysctl scheduler
  mptcp: add sched in mptcp_sock
  mptcp: add get_subflow wrapper
  mptcp: add bpf_mptcp_sched_ops
  selftests: bpf: add bpf_first scheduler
  selftests: bpf: add bpf_first test

 Documentation/networking/mptcp-sysctl.rst     |   8 ++
 include/net/mptcp.h                           |  13 ++
 kernel/bpf/bpf_struct_ops_types.h             |   4 +
 net/mptcp/Makefile                            |   2 +-
 net/mptcp/bpf.c                               | 102 ++++++++++++++++
 net/mptcp/ctrl.c                              |  14 +++
 net/mptcp/protocol.c                          |  15 ++-
 net/mptcp/protocol.h                          |  16 +++
 net/mptcp/sched.c                             | 104 ++++++++++++++++
 tools/testing/selftests/bpf/bpf_tcp_helpers.h |  12 ++
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 114 ++++++++++++++++++
 .../selftests/bpf/progs/mptcp_bpf_first.c     |  30 +++++
 12 files changed, 429 insertions(+), 5 deletions(-)
 create mode 100644 net/mptcp/sched.c
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_first.c

-- 
2.34.1


Re: [PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Mat Martineau 2 years ago
On Tue, 29 Mar 2022, Geliang Tang wrote:

> v8:
> - use global sched_list instead of pernet sched_list.

Yes, I think this is fine. I had initially asked about pernet 
configuration with respect to registering BPF schedulers, but the 
important thing is that the sysctl can be set per-namespace.

> - drop synchronize_rcu() in mptcp_unregister_scheduler().
> - update mptcp_init_sched and mptcp_release_sched as Mat and Florian
>   suggested.
> - fix the build break in patch 8.
> - depends on: "add skc_to_mptcp_sock" v14.
> - export/20220325T055307

Thanks for updating. Builds and runs fine here.

Before we add this to the export branch, have you thought about how 
subflow data (including the backup bit and throughput/latency data used by 
the default scheduler) can be accessed in the BPF get_subflow hook? Do you 
think that can be cleanly added after this series, or is there anything 
that may need to be changed in this series?

Thanks,

Mat


>
> v7:
> - add bpf_try_module_get in mptcp_init_sched.
> - add bpf_module_put in mptcp_release_sched.
> - rename bpf_first to mptcp_bpf_first.
> - update commit logs.
>
> v6:
> - still use pernet sched_list, use current->nsproxy->net_ns in BPF
>   context instead of using init_net.
> - patch 1:
>   - use rcu_read_lock instead of spin_lock in mptcp_sched_find as Florian suggested.
>   - drop synchronize_rcu in sched_exit_net as Florian suggested.
>   - keep synchronize_rcu in mptcp_unregister_scheduler, otherwise, got
>     a workqueue lockup in my test.
>   - update Makefile as Mat suggested.
> - patch 2:
>   - add mptcp_sched_data_init to register default sched, instead of
>     registering it in init_net.
> - patch 5:
>   - move mptcp_sched_get_subflow to protocol.h as Mat suggested.
> - patch 6:
>   - use current->nsproxy->net_ns instead of init_net.
> - patch 8:
>   - add send_data to send more data, instead of send_byte.
>
> v5:
> - patch 1: define per-namespace sched_list (but only used init_net
>   namespace. It is difficult to get 'net' in bpf_mptcp_sched_reg and
>   bpf_mptcp_sched_unreg. I need some suggestions here.)
> - patch 2: skip mptcp_sched_default in mptcp_unregister_scheduler.
> - patch 8: add tests into mptcp.c, instead of bpf_tcp_ca.c.
>
> v4:
> - set msk->sched to &mptcp_sched_default when the sched argument is NULL
> in mptcp_init_sched().
>
> v3:
> - add mptcp_release_sched helper in patch 4.
> - rename mptcp_set_sched to mptcp_init_sched in patch 4.
> - add mptcp_sched_first_release in patch 7.
> - do some cleanups.
>
> v2:
> - split into more small patches.
> - change all parameters of mptcp_sched_ops from sk to msk:
>       void (*init)(struct mptcp_sock *msk);
>       void (*release)(struct mptcp_sock *msk);
>       struct sock *   (*get_subflow)(struct mptcp_sock *msk);
> - add tests in bpf_tcp_ca.c, instead of adding a new one.
>
> v1:
> - Addressed to the commends in the RFC version.
>
> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/75
>
> Geliang Tang (8):
>  mptcp: add struct mptcp_sched_ops
>  mptcp: register default scheduler
>  mptcp: add a new sysctl scheduler
>  mptcp: add sched in mptcp_sock
>  mptcp: add get_subflow wrapper
>  mptcp: add bpf_mptcp_sched_ops
>  selftests: bpf: add bpf_first scheduler
>  selftests: bpf: add bpf_first test
>
> Documentation/networking/mptcp-sysctl.rst     |   8 ++
> include/net/mptcp.h                           |  13 ++
> kernel/bpf/bpf_struct_ops_types.h             |   4 +
> net/mptcp/Makefile                            |   2 +-
> net/mptcp/bpf.c                               | 102 ++++++++++++++++
> net/mptcp/ctrl.c                              |  14 +++
> net/mptcp/protocol.c                          |  15 ++-
> net/mptcp/protocol.h                          |  16 +++
> net/mptcp/sched.c                             | 104 ++++++++++++++++
> tools/testing/selftests/bpf/bpf_tcp_helpers.h |  12 ++
> .../testing/selftests/bpf/prog_tests/mptcp.c  | 114 ++++++++++++++++++
> .../selftests/bpf/progs/mptcp_bpf_first.c     |  30 +++++
> 12 files changed, 429 insertions(+), 5 deletions(-)
> create mode 100644 net/mptcp/sched.c
> create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_first.c
>
> -- 
> 2.34.1
>
>
>

--
Mat Martineau
Intel

Re: [PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Geliang Tang 2 years ago
Hi Mat & Matt,

On Tue, Mar 29, 2022 at 04:13:02PM -0700, Mat Martineau wrote:
> 
> On Tue, 29 Mar 2022, Geliang Tang wrote:
> 
> > v8:
> > - use global sched_list instead of pernet sched_list.
> 
> Yes, I think this is fine. I had initially asked about pernet configuration
> with respect to registering BPF schedulers, but the important thing is that
> the sysctl can be set per-namespace.
> 
> > - drop synchronize_rcu() in mptcp_unregister_scheduler().
> > - update mptcp_init_sched and mptcp_release_sched as Mat and Florian
> >   suggested.
> > - fix the build break in patch 8.
> > - depends on: "add skc_to_mptcp_sock" v14.
> > - export/20220325T055307
> 
> Thanks for updating. Builds and runs fine here.
> 
> Before we add this to the export branch, have you thought about how subflow
> data (including the backup bit and throughput/latency data used by the
> default scheduler) can be accessed in the BPF get_subflow hook? Do you think
> that can be cleanly added after this series, or is there anything that may
> need to be changed in this series?

I plan to do that in the next series, BPF round-robin scheduler.

I had implemented round-robin in kernel before:

https://patchwork.kernel.org/project/mptcp/cover/cover.1631011068.git.geliangtang@xiaomi.com/

This time I plan to implement it using BPF.

In order to support bpf_rr, I think we need to do two more things
based on this series:

1. Get the subflow data from BPF, maybe iterate over the subflows
from mptcp_sock.

2. Make same members of struct mptcp_sock writable in BPF, say
last_snd, then we can set it like this:

        msk->last_snd = ssk.

Neither of these things is easy to implement. I'd like to hear your
opinions.

Thanks,
-Geliang

> 
> Thanks,
> 
> Mat
> 
> 
> > 
> > v7:
> > - add bpf_try_module_get in mptcp_init_sched.
> > - add bpf_module_put in mptcp_release_sched.
> > - rename bpf_first to mptcp_bpf_first.
> > - update commit logs.
> > 
> > v6:
> > - still use pernet sched_list, use current->nsproxy->net_ns in BPF
> >   context instead of using init_net.
> > - patch 1:
> >   - use rcu_read_lock instead of spin_lock in mptcp_sched_find as Florian suggested.
> >   - drop synchronize_rcu in sched_exit_net as Florian suggested.
> >   - keep synchronize_rcu in mptcp_unregister_scheduler, otherwise, got
> >     a workqueue lockup in my test.
> >   - update Makefile as Mat suggested.
> > - patch 2:
> >   - add mptcp_sched_data_init to register default sched, instead of
> >     registering it in init_net.
> > - patch 5:
> >   - move mptcp_sched_get_subflow to protocol.h as Mat suggested.
> > - patch 6:
> >   - use current->nsproxy->net_ns instead of init_net.
> > - patch 8:
> >   - add send_data to send more data, instead of send_byte.
> > 
> > v5:
> > - patch 1: define per-namespace sched_list (but only used init_net
> >   namespace. It is difficult to get 'net' in bpf_mptcp_sched_reg and
> >   bpf_mptcp_sched_unreg. I need some suggestions here.)
> > - patch 2: skip mptcp_sched_default in mptcp_unregister_scheduler.
> > - patch 8: add tests into mptcp.c, instead of bpf_tcp_ca.c.
> > 
> > v4:
> > - set msk->sched to &mptcp_sched_default when the sched argument is NULL
> > in mptcp_init_sched().
> > 
> > v3:
> > - add mptcp_release_sched helper in patch 4.
> > - rename mptcp_set_sched to mptcp_init_sched in patch 4.
> > - add mptcp_sched_first_release in patch 7.
> > - do some cleanups.
> > 
> > v2:
> > - split into more small patches.
> > - change all parameters of mptcp_sched_ops from sk to msk:
> >       void (*init)(struct mptcp_sock *msk);
> >       void (*release)(struct mptcp_sock *msk);
> >       struct sock *   (*get_subflow)(struct mptcp_sock *msk);
> > - add tests in bpf_tcp_ca.c, instead of adding a new one.
> > 
> > v1:
> > - Addressed to the commends in the RFC version.
> > 
> > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/75
> > 
> > Geliang Tang (8):
> >  mptcp: add struct mptcp_sched_ops
> >  mptcp: register default scheduler
> >  mptcp: add a new sysctl scheduler
> >  mptcp: add sched in mptcp_sock
> >  mptcp: add get_subflow wrapper
> >  mptcp: add bpf_mptcp_sched_ops
> >  selftests: bpf: add bpf_first scheduler
> >  selftests: bpf: add bpf_first test
> > 
> > Documentation/networking/mptcp-sysctl.rst     |   8 ++
> > include/net/mptcp.h                           |  13 ++
> > kernel/bpf/bpf_struct_ops_types.h             |   4 +
> > net/mptcp/Makefile                            |   2 +-
> > net/mptcp/bpf.c                               | 102 ++++++++++++++++
> > net/mptcp/ctrl.c                              |  14 +++
> > net/mptcp/protocol.c                          |  15 ++-
> > net/mptcp/protocol.h                          |  16 +++
> > net/mptcp/sched.c                             | 104 ++++++++++++++++
> > tools/testing/selftests/bpf/bpf_tcp_helpers.h |  12 ++
> > .../testing/selftests/bpf/prog_tests/mptcp.c  | 114 ++++++++++++++++++
> > .../selftests/bpf/progs/mptcp_bpf_first.c     |  30 +++++
> > 12 files changed, 429 insertions(+), 5 deletions(-)
> > create mode 100644 net/mptcp/sched.c
> > create mode 100644 tools/testing/selftests/bpf/progs/mptcp_bpf_first.c
> > 
> > -- 
> > 2.34.1
> > 
> > 
> > 
> 
> --
> Mat Martineau
> Intel
> 


Re: [PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Mat Martineau 2 years ago
On Wed, 30 Mar 2022, Geliang Tang wrote:

> Hi Mat & Matt,
>
> On Tue, Mar 29, 2022 at 04:13:02PM -0700, Mat Martineau wrote:
>>
>> On Tue, 29 Mar 2022, Geliang Tang wrote:
>>
>>> v8:
>>> - use global sched_list instead of pernet sched_list.
>>
>> Yes, I think this is fine. I had initially asked about pernet configuration
>> with respect to registering BPF schedulers, but the important thing is that
>> the sysctl can be set per-namespace.
>>
>>> - drop synchronize_rcu() in mptcp_unregister_scheduler().
>>> - update mptcp_init_sched and mptcp_release_sched as Mat and Florian
>>>   suggested.
>>> - fix the build break in patch 8.
>>> - depends on: "add skc_to_mptcp_sock" v14.
>>> - export/20220325T055307
>>
>> Thanks for updating. Builds and runs fine here.
>>
>> Before we add this to the export branch, have you thought about how subflow
>> data (including the backup bit and throughput/latency data used by the
>> default scheduler) can be accessed in the BPF get_subflow hook? Do you think
>> that can be cleanly added after this series, or is there anything that may
>> need to be changed in this series?
>
> I plan to do that in the next series, BPF round-robin scheduler.
>
> I had implemented round-robin in kernel before:
>
> https://patchwork.kernel.org/project/mptcp/cover/cover.1631011068.git.geliangtang@xiaomi.com/
>
> This time I plan to implement it using BPF.
>
> In order to support bpf_rr, I think we need to do two more things
> based on this series:
>
> 1. Get the subflow data from BPF, maybe iterate over the subflows
> from mptcp_sock.
>

Like we discussed in the meeting, I think a helper function that's 
callable from the BPF code could gather the necessary information from the 
subflows in a safe way.

> 2. Make same members of struct mptcp_sock writable in BPF, say
> last_snd, then we can set it like this:
>
>        msk->last_snd = ssk.

The kernel C code can record which subflow was used last based on the 
previous value returned by the scheduler, or a BPF scheduler could track 
data in a BPF map (I think - I'm not a BPF expert).

For something like last_snd to work, I think we need some way to identify 
subflows that's not the raw ssk pointer. Are there rules about pointer 
usage in BPF that we need to consider?

--
Mat Martineau
Intel

Re: [PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Matthieu Baerts 2 years ago
Hi Mat, Geliang,

On 02/04/2022 02:40, Mat Martineau wrote:
> On Wed, 30 Mar 2022, Geliang Tang wrote:
>> 1. Get the subflow data from BPF, maybe iterate over the subflows
>> from mptcp_sock.
>>
> 
> Like we discussed in the meeting, I think a helper function that's
> callable from the BPF code could gather the necessary information from
> the subflows in a safe way.

Yes, similar to what is done in TCP CA, e.g. bpf_tcp_send_ack(), etc.

>> 2. Make same members of struct mptcp_sock writable in BPF, say
>> last_snd, then we can set it like this:
>>
>>        msk->last_snd = ssk.
> 
> The kernel C code can record which subflow was used last based on the
> previous value returned by the scheduler, or a BPF scheduler could track
> data in a BPF map (I think - I'm not a BPF expert).

msk->last_snd seems quite specific to the current in-kernel packet
scheduler, no?

It will be useful to store this last ssk for a RR scheduler but I think
that's just a coincidence. I mean: it would probably be helpful to store
everything needed for the current in-kernel packet scheduler in a
dedicated structure in the msk. Then any packet scheduler can decide
what to store in this dedicated memory and we can also easily clear stuff.

> For something like last_snd to work, I think we need some way to
> identify subflows that's not the raw ssk pointer. Are there rules about
> pointer usage in BPF that we need to consider?

Yes, for BPF we might store info in a MAP linked to the msk. It would be
interesting to see what is done in BPF Cubic code for example.

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net

Re: [PATCH mptcp-next v8 0/8] BPF packet scheduler
Posted by Matthieu Baerts 2 years ago
Hi Geliang,

On 30/03/2022 16:12, Geliang Tang wrote:
> Hi Mat & Matt,
> 
> On Tue, Mar 29, 2022 at 04:13:02PM -0700, Mat Martineau wrote:
>>
>> On Tue, 29 Mar 2022, Geliang Tang wrote:
>>
>>> v8:
>>> - use global sched_list instead of pernet sched_list.
>>
>> Yes, I think this is fine. I had initially asked about pernet configuration
>> with respect to registering BPF schedulers, but the important thing is that
>> the sysctl can be set per-namespace.
>>
>>> - drop synchronize_rcu() in mptcp_unregister_scheduler().
>>> - update mptcp_init_sched and mptcp_release_sched as Mat and Florian
>>>   suggested.
>>> - fix the build break in patch 8.
>>> - depends on: "add skc_to_mptcp_sock" v14.
>>> - export/20220325T055307
>>
>> Thanks for updating. Builds and runs fine here.
>>
>> Before we add this to the export branch, have you thought about how subflow
>> data (including the backup bit and throughput/latency data used by the
>> default scheduler) can be accessed in the BPF get_subflow hook? Do you think
>> that can be cleanly added after this series, or is there anything that may
>> need to be changed in this series?
> 
> I plan to do that in the next series, BPF round-robin scheduler.
> 
> I had implemented round-robin in kernel before:
> 
> https://patchwork.kernel.org/project/mptcp/cover/cover.1631011068.git.geliangtang@xiaomi.com/
> 
> This time I plan to implement it using BPF.

Sounds like a good idea to exercise his new BPF packet scheduler
infrastructure, thanks!

> In order to support bpf_rr, I think we need to do two more things
> based on this series:
> 
> 1. Get the subflow data from BPF, maybe iterate over the subflows
> from mptcp_sock.

Yes, sounds good to me. What is blocking to do that?
Maybe some new BPF helpers will be needed?

> 2. Make same members of struct mptcp_sock writable in BPF, say
> last_snd, then we can set it like this:
> 
>         msk->last_snd = ssk.

I don't think this is a direction to take with BPF. I mean: not
directly. You can have helper doing that with extra verifications. How
is it done with TCP CA?

Or maybe for this specific case here with RR, you can also store the
last snd subflow in a MAP storage, no?

Cheers,
Matt
-- 
Tessares | Belgium | Hybrid Access Solutions
www.tessares.net