include/net/mptcp.h | 29 ++++++++++++++++++++++++ net/mptcp/bpf.c | 25 ++++++++++++++++++-- net/mptcp/protocol.c | 14 ++++++++---- net/mptcp/protocol.h | 9 ++++++++ net/mptcp/sched.c | 20 ++++++++++++++++ tools/testing/selftests/bpf/progs/mptcp_bpf.h | 1 + tools/testing/selftests/bpf/progs/mptcp_bpf_rr.c | 18 +++++++++++++++ 7 files changed, 110 insertions(+), 6 deletions(-)
This patchset proposes an updated API for mptcp schedulers to provide more flexibility for future scheduling strategies. This does not change the current behavior of get_subflow. It adds an optional hook to control scheduling behavior after selecting a subflow in get_subflow. The main idea is to provide a way to perform a scheduling per-MSS or per chunk of data. To demonstrate this, the mptcp_bpf_rr scheduler has been modified to send one MSS per subflow in an alternating fashion. The other potential application of this API could be to define a scheduler that will proactivelly reinject data on available subflows based on some heuristics (eg. when one subflow RTT increases when moving away from WiFi AP). It could also allow to limit the usage of a specific subflow, eg., in the case of high BDP. I am mainly looking for feedback to see whether this design make sense. Matthieu already did a quick pass on it. The API introduces a push() function (name TBD) that takes the subflow and data chunk (sequence number and length) as arguments. The scheduler can then control packet scheduling by: - Setting data transmission limits (e.g., per-MSS on a subflow or per window). - Setting flags associated with the data. Currently, only MPTCP_SCHED_FLAG_RESCHEDULE is defined, forcing the scheduler to call get_subflow after sending the chunk. I am envisioning another flag MPTCP_SCHED_FLAG_REINJECT that will ensure that chunk of data can be rescheduled directly on another subflow. I've written some packetdrill tests to validate the update RR scheduler the test are available on my fork: https://github.com/gdetal/packetdrill/tree/mptcp-net-next/gtests/net/mptcp-bpf/round-robin Open questions: - Is the current API for setting flags and limits appropriate? What are eBPF best practices to pass back information to kernel ? - should get_scheduler and this new API be merged somehow as to provide a single way of scheduling data ? Eg. could pass chunk and return flags, limit and subflow to send this data to. Co-Developed-By: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Gregory Detal <gregory.detal@gmail.com> --- Gregory Detal (4): mptcp: add push sched callback mptcp: use new push callback to schedule chunks mptcp: bpf: allow to write to mptcp_sched_chunk selftests/bpf: mptcp RR: send 1 MSS on each subflow include/net/mptcp.h | 29 ++++++++++++++++++++++++ net/mptcp/bpf.c | 25 ++++++++++++++++++-- net/mptcp/protocol.c | 14 ++++++++---- net/mptcp/protocol.h | 9 ++++++++ net/mptcp/sched.c | 20 ++++++++++++++++ tools/testing/selftests/bpf/progs/mptcp_bpf.h | 1 + tools/testing/selftests/bpf/progs/mptcp_bpf_rr.c | 18 +++++++++++++++ 7 files changed, 110 insertions(+), 6 deletions(-) --- base-commit: c98f8dc106a6d6c4424f2e612762821e1be14bf0 change-id: 20240503-sched_per_packet-d6e4488e54e8 Best regards, -- Gregory Detal <gregory.detal@gmail.com>
On Mon, 27 May 2024, Gregory Detal wrote: > This patchset proposes an updated API for mptcp schedulers to provide > more flexibility for future scheduling strategies. This does not change > the current behavior of get_subflow. It adds an optional hook to control > scheduling behavior after selecting a subflow in get_subflow. > > The main idea is to provide a way to perform a scheduling per-MSS or per > chunk of data. To demonstrate this, the mptcp_bpf_rr scheduler has been > modified to send one MSS per subflow in an alternating fashion. > > The other potential application of this API could be to define a > scheduler that will proactivelly reinject data on available subflows > based on some heuristics (eg. when one subflow RTT increases when moving > away from WiFi AP). It could also allow to limit the usage of a specific > subflow, eg., in the case of high BDP. > > I am mainly looking for feedback to see whether this design make sense. > Matthieu already did a quick pass on it. > Hi Gregory - Sorry about the delay, thanks for working on this scheduler code! > The API introduces a push() function (name TBD) that takes the subflow > and data chunk (sequence number and length) as arguments. The scheduler > can then control packet scheduling by: > - Setting data transmission limits (e.g., per-MSS on a subflow or per window). > - Setting flags associated with the data. Currently, only > MPTCP_SCHED_FLAG_RESCHEDULE is defined, forcing the scheduler to call > get_subflow after sending the chunk. > > I am envisioning another flag MPTCP_SCHED_FLAG_REINJECT that will ensure > that chunk of data can be rescheduled directly on another subflow. > > I've written some packetdrill tests to validate the update RR scheduler > the test are available on my fork: > > https://github.com/gdetal/packetdrill/tree/mptcp-net-next/gtests/net/mptcp-bpf/round-robin > > Open questions: > - Is the current API for setting flags and limits appropriate? What are > eBPF best practices to pass back information to kernel ? I think the approach of using a regular u16 for the flags and preprocessor constants like MPTCP_SCHED_FLAG_RESCHEDULE isn't a good fit for eBPF. The main reasons I see: * The user's eBPF code can set any combination of bits in the u16 flags without the kernel automatically checking that those bits are known/expected. * The flag values as preprocessor constants can't be represented in BTF format to allow compiling the BPF scheduler using type information from the kernel (rather than manually setting those #defines in the scheduler code). Even though it's not as space-efficient, I think individual struct members would be better than an integer bitfield for this kind of flag. Then the struct_access hook can check what's supported at load time. > - should get_scheduler and this new API be merged somehow as to provide > a single way of scheduling data ? Eg. could pass chunk and return > flags, limit and subflow to send this data to. Yes, I think it would be good to consolidate get_scheduler and the new API, so two calls out to the BPF scheduler are not required for each packet. It's ok to change the internal scheduler APIs, and I think that will be required to support reinjection. - Mat > > Co-Developed-By: Matthieu Baerts <matttbe@kernel.org> > Signed-off-by: Gregory Detal <gregory.detal@gmail.com> > --- > Gregory Detal (4): > mptcp: add push sched callback > mptcp: use new push callback to schedule chunks > mptcp: bpf: allow to write to mptcp_sched_chunk > selftests/bpf: mptcp RR: send 1 MSS on each subflow > > include/net/mptcp.h | 29 ++++++++++++++++++++++++ > net/mptcp/bpf.c | 25 ++++++++++++++++++-- > net/mptcp/protocol.c | 14 ++++++++---- > net/mptcp/protocol.h | 9 ++++++++ > net/mptcp/sched.c | 20 ++++++++++++++++ > tools/testing/selftests/bpf/progs/mptcp_bpf.h | 1 + > tools/testing/selftests/bpf/progs/mptcp_bpf_rr.c | 18 +++++++++++++++ > 7 files changed, 110 insertions(+), 6 deletions(-) > --- > base-commit: c98f8dc106a6d6c4424f2e612762821e1be14bf0 > change-id: 20240503-sched_per_packet-d6e4488e54e8 > > Best regards, > -- > Gregory Detal <gregory.detal@gmail.com> > > >
Hi Gregory, Thank you for your modifications, that's great! Our CI did some validations and here is its report: - KVM Validation: normal: Success! ✅ - KVM Validation: debug: Success! ✅ - KVM Validation: btf (only bpftest_all): Success! ✅ - Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9255729312 Initiator: Patchew Applier Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/d60094641280 Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=856158 If there are some issues, you can reproduce them using the same environment as the one used by the CI thanks to a docker image, e.g.: $ cd [kernel source code] $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \ --pull always mptcp/mptcp-upstream-virtme-docker:latest \ auto-normal For more details: https://github.com/multipath-tcp/mptcp-upstream-virtme-docker Please note that despite all the efforts that have been already done to have a stable tests suite when executed on a public CI like here, it is possible some reported issues are not due to your modifications. Still, do not hesitate to help us improve that ;-) Cheers, MPTCP GH Action bot Bot operated by Matthieu Baerts (NGI0 Core)
© 2016 - 2024 Red Hat, Inc.