From: Geliang Tang <tanggeliang@kylinos.cn>
It's necessary to traverse all subflows on the conn_list of an MPTCP
socket and then call kfunc to modify the fields of each subflow. In
kernel space, mptcp_for_each_subflow() helper is used for this:
mptcp_for_each_subflow(msk, subflow)
kfunc(subflow);
But in the MPTCP BPF program, this has not yet been implemented. As
Martin suggested recently, this conn_list walking + modify-by-kfunc
usage fits the bpf_iter use case.
So this patch adds a new bpf_iter type named "mptcp_subflow" to do
this and implements its helpers bpf_iter_mptcp_subflow_new()/_next()/
_destroy(). And register these bpf_iter mptcp_subflow into mptcp
common kfunc set. Then bpf_for_each() for mptcp_subflow can be used
in BPF program like this:
bpf_for_each(mptcp_subflow, subflow, msk)
kfunc(subflow);
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Notes:
- v2:
- Add BUILD_BUG_ON() checks, similar to the ones done with other
bpf_iter_(...) helpers.
- Replace msk_owned_by_me() by sock_owned_by_user_nocheck() and
!spin_is_locked() (Martin).
A few versions of this single patch have been previously posted to the
BPF mailing list by Geliang, before continuing to the MPTCP mailing list
only, with other patches of this series. The version of the whole series
has been reset to 1, but here is the ChangeLog for the previous ones:
- v2: remove msk->pm.lock in _new() and _destroy() (Martin)
drop DEFINE_BPF_ITER_FUNC, change opaque[3] to opaque[2] (Andrii)
- v3: drop bpf_iter__mptcp_subflow
- v4: if msk is NULL, initialize kit->msk to NULL in _new() and check
it in _next() (Andrii)
- v5: use list_is_last() instead of list_entry_is_head() add
KF_ITER_NEW/NEXT/DESTROY flags add msk_owned_by_me in _new()
- v6: add KF_TRUSTED_ARGS flag (Andrii, Martin)
---
net/mptcp/bpf.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 53 insertions(+)
diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
index c5bfd84c16c43230d9d8e1fd8ff781a767e647b5..e39f0e4fb683c1aa31ee075281daee218dac5878 100644
--- a/net/mptcp/bpf.c
+++ b/net/mptcp/bpf.c
@@ -35,6 +35,15 @@ static const struct btf_kfunc_id_set bpf_mptcp_fmodret_set = {
.set = &bpf_mptcp_fmodret_ids,
};
+struct bpf_iter_mptcp_subflow {
+ __u64 __opaque[2];
+} __aligned(8);
+
+struct bpf_iter_mptcp_subflow_kern {
+ struct mptcp_sock *msk;
+ struct list_head *pos;
+} __aligned(8);
+
__bpf_kfunc_start_defs();
__bpf_kfunc static struct mptcp_subflow_context *
@@ -47,10 +56,54 @@ bpf_mptcp_subflow_ctx(const struct sock *sk)
return NULL;
}
+__bpf_kfunc static int
+bpf_iter_mptcp_subflow_new(struct bpf_iter_mptcp_subflow *it,
+ struct mptcp_sock *msk)
+{
+ struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
+ struct sock *sk = (struct sock *)msk;
+
+ BUILD_BUG_ON(sizeof(struct bpf_iter_mptcp_subflow_kern) >
+ sizeof(struct bpf_iter_mptcp_subflow));
+ BUILD_BUG_ON(__alignof__(struct bpf_iter_mptcp_subflow_kern) !=
+ __alignof__(struct bpf_iter_mptcp_subflow));
+
+ kit->msk = msk;
+ if (!msk)
+ return -EINVAL;
+
+ if (!sock_owned_by_user_nocheck(sk) &&
+ !spin_is_locked(&sk->sk_lock.slock))
+ return -EINVAL;
+
+ kit->pos = &msk->conn_list;
+ return 0;
+}
+
+__bpf_kfunc static struct mptcp_subflow_context *
+bpf_iter_mptcp_subflow_next(struct bpf_iter_mptcp_subflow *it)
+{
+ struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
+
+ if (!kit->msk || list_is_last(kit->pos, &kit->msk->conn_list))
+ return NULL;
+
+ kit->pos = kit->pos->next;
+ return list_entry(kit->pos, struct mptcp_subflow_context, node);
+}
+
+__bpf_kfunc static void
+bpf_iter_mptcp_subflow_destroy(struct bpf_iter_mptcp_subflow *it)
+{
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(bpf_mptcp_common_kfunc_ids)
BTF_ID_FLAGS(func, bpf_mptcp_subflow_ctx, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_new, KF_ITER_NEW | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_next, KF_ITER_NEXT | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_destroy, KF_ITER_DESTROY)
BTF_KFUNCS_END(bpf_mptcp_common_kfunc_ids)
static const struct btf_kfunc_id_set bpf_mptcp_common_kfunc_set = {
--
2.47.1
On 12/19/24 7:46 AM, Matthieu Baerts (NGI0) wrote:
> From: Geliang Tang <tanggeliang@kylinos.cn>
>
> It's necessary to traverse all subflows on the conn_list of an MPTCP
> socket and then call kfunc to modify the fields of each subflow. In
> kernel space, mptcp_for_each_subflow() helper is used for this:
>
> mptcp_for_each_subflow(msk, subflow)
> kfunc(subflow);
>
> But in the MPTCP BPF program, this has not yet been implemented. As
> Martin suggested recently, this conn_list walking + modify-by-kfunc
> usage fits the bpf_iter use case.
>
> So this patch adds a new bpf_iter type named "mptcp_subflow" to do
> this and implements its helpers bpf_iter_mptcp_subflow_new()/_next()/
> _destroy(). And register these bpf_iter mptcp_subflow into mptcp
> common kfunc set. Then bpf_for_each() for mptcp_subflow can be used
> in BPF program like this:
>
> bpf_for_each(mptcp_subflow, subflow, msk)
> kfunc(subflow);
>
> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
> Reviewed-by: Mat Martineau <martineau@kernel.org>
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> ---
> Notes:
> - v2:
> - Add BUILD_BUG_ON() checks, similar to the ones done with other
> bpf_iter_(...) helpers.
> - Replace msk_owned_by_me() by sock_owned_by_user_nocheck() and
> !spin_is_locked() (Martin).
>
> A few versions of this single patch have been previously posted to the
> BPF mailing list by Geliang, before continuing to the MPTCP mailing list
> only, with other patches of this series. The version of the whole series
> has been reset to 1, but here is the ChangeLog for the previous ones:
> - v2: remove msk->pm.lock in _new() and _destroy() (Martin)
> drop DEFINE_BPF_ITER_FUNC, change opaque[3] to opaque[2] (Andrii)
> - v3: drop bpf_iter__mptcp_subflow
> - v4: if msk is NULL, initialize kit->msk to NULL in _new() and check
> it in _next() (Andrii)
> - v5: use list_is_last() instead of list_entry_is_head() add
> KF_ITER_NEW/NEXT/DESTROY flags add msk_owned_by_me in _new()
> - v6: add KF_TRUSTED_ARGS flag (Andrii, Martin)
> ---
> net/mptcp/bpf.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 53 insertions(+)
>
> diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
> index c5bfd84c16c43230d9d8e1fd8ff781a767e647b5..e39f0e4fb683c1aa31ee075281daee218dac5878 100644
> --- a/net/mptcp/bpf.c
> +++ b/net/mptcp/bpf.c
> @@ -35,6 +35,15 @@ static const struct btf_kfunc_id_set bpf_mptcp_fmodret_set = {
> .set = &bpf_mptcp_fmodret_ids,
> };
>
> +struct bpf_iter_mptcp_subflow {
> + __u64 __opaque[2];
> +} __aligned(8);
> +
> +struct bpf_iter_mptcp_subflow_kern {
> + struct mptcp_sock *msk;
> + struct list_head *pos;
> +} __aligned(8);
> +
> __bpf_kfunc_start_defs();
>
> __bpf_kfunc static struct mptcp_subflow_context *
> @@ -47,10 +56,54 @@ bpf_mptcp_subflow_ctx(const struct sock *sk)
> return NULL;
> }
>
> +__bpf_kfunc static int
> +bpf_iter_mptcp_subflow_new(struct bpf_iter_mptcp_subflow *it,
> + struct mptcp_sock *msk)
> +{
> + struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
> + struct sock *sk = (struct sock *)msk;
> +
> + BUILD_BUG_ON(sizeof(struct bpf_iter_mptcp_subflow_kern) >
> + sizeof(struct bpf_iter_mptcp_subflow));
> + BUILD_BUG_ON(__alignof__(struct bpf_iter_mptcp_subflow_kern) !=
> + __alignof__(struct bpf_iter_mptcp_subflow));
> +
> + kit->msk = msk;
> + if (!msk)
NULL check is not needed. verifier should have rejected it for KF_TRUSTED_ARGS.
> + return -EINVAL;
> +
> + if (!sock_owned_by_user_nocheck(sk) &&
> + !spin_is_locked(&sk->sk_lock.slock))
I could have missed something. If it is to catch bug, should it be
sock_owned_by_me() that has the lockdep splat? For the cg get/setsockopt hook
here, the lock should have already been held earlier in the kernel.
This set is only showing the cg sockopt bpf prog but missing the major
struct_ops piece. It is hard to comment. I assumed the lock situation is the
same for the struct_ops where the lock will be held before calling the
struct_ops prog?
> + return -EINVAL;
> +
> + kit->pos = &msk->conn_list;
> + return 0;
> +}
> +
> +__bpf_kfunc static struct mptcp_subflow_context *
> +bpf_iter_mptcp_subflow_next(struct bpf_iter_mptcp_subflow *it)
> +{
> + struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
> +
> + if (!kit->msk || list_is_last(kit->pos, &kit->msk->conn_list))
> + return NULL;
> +
> + kit->pos = kit->pos->next;
> + return list_entry(kit->pos, struct mptcp_subflow_context, node);
> +}
> +
> +__bpf_kfunc static void
> +bpf_iter_mptcp_subflow_destroy(struct bpf_iter_mptcp_subflow *it)
> +{
> +}
> +
> __bpf_kfunc_end_defs();
>
> BTF_KFUNCS_START(bpf_mptcp_common_kfunc_ids)
> BTF_ID_FLAGS(func, bpf_mptcp_subflow_ctx, KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_new, KF_ITER_NEW | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_next, KF_ITER_NEXT | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_iter_mptcp_subflow_destroy, KF_ITER_DESTROY)
> BTF_KFUNCS_END(bpf_mptcp_common_kfunc_ids)
>
> static const struct btf_kfunc_id_set bpf_mptcp_common_kfunc_set = {
>
Hi Martin,
Thank you for your review!
Sorry for the delay here, Geliang started to work on a new version, but
it might take a bit of time as he is currently off for a few days.
On 24/01/2025 01:47, Martin KaFai Lau wrote:
> On 12/19/24 7:46 AM, Matthieu Baerts (NGI0) wrote:
>> From: Geliang Tang <tanggeliang@kylinos.cn>
>>
>> It's necessary to traverse all subflows on the conn_list of an MPTCP
>> socket and then call kfunc to modify the fields of each subflow. In
>> kernel space, mptcp_for_each_subflow() helper is used for this:
>>
>> mptcp_for_each_subflow(msk, subflow)
>> kfunc(subflow);
>>
>> But in the MPTCP BPF program, this has not yet been implemented. As
>> Martin suggested recently, this conn_list walking + modify-by-kfunc
>> usage fits the bpf_iter use case.
>>
>> So this patch adds a new bpf_iter type named "mptcp_subflow" to do
>> this and implements its helpers bpf_iter_mptcp_subflow_new()/_next()/
>> _destroy(). And register these bpf_iter mptcp_subflow into mptcp
>> common kfunc set. Then bpf_for_each() for mptcp_subflow can be used
>> in BPF program like this:
>>
>> bpf_for_each(mptcp_subflow, subflow, msk)
>> kfunc(subflow);
(...)
>> diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
>> index
>> c5bfd84c16c43230d9d8e1fd8ff781a767e647b5..e39f0e4fb683c1aa31ee075281daee218dac5878 100644
>> --- a/net/mptcp/bpf.c
>> +++ b/net/mptcp/bpf.c
(...)
>> @@ -47,10 +56,54 @@ bpf_mptcp_subflow_ctx(const struct sock *sk)
>> return NULL;
>> }
>> +__bpf_kfunc static int
>> +bpf_iter_mptcp_subflow_new(struct bpf_iter_mptcp_subflow *it,
>> + struct mptcp_sock *msk)
>> +{
>> + struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
>> + struct sock *sk = (struct sock *)msk;
>> +
>> + BUILD_BUG_ON(sizeof(struct bpf_iter_mptcp_subflow_kern) >
>> + sizeof(struct bpf_iter_mptcp_subflow));
>> + BUILD_BUG_ON(__alignof__(struct bpf_iter_mptcp_subflow_kern) !=
>> + __alignof__(struct bpf_iter_mptcp_subflow));
>> +
>> + kit->msk = msk;
>> + if (!msk)
>
> NULL check is not needed. verifier should have rejected it for
> KF_TRUSTED_ARGS.
>
>> + return -EINVAL;
>> +
>> + if (!sock_owned_by_user_nocheck(sk) &&
>> + !spin_is_locked(&sk->sk_lock.slock))
>
> I could have missed something. If it is to catch bug, should it be
> sock_owned_by_me() that has the lockdep splat? For the cg get/setsockopt
> hook here, the lock should have already been held earlier in the kernel.
Good point. Because in this series, the kfunc is currently restricted to
CG [gs]etsockopt hooks, we should use msk_owned_by_me(msk) here.
> This set is only showing the cg sockopt bpf prog but missing the major
> struct_ops piece. It is hard to comment. I assumed the lock situation is
> the same for the struct_ops where the lock will be held before calling
> the struct_ops prog?
I understand it is hard to comment on that point. In the 'struct_ops' we
are designing, the lock will indeed be held before calling the stuct_ops
program. So we will just need to make sure this assumption is correct
for all callbacks of our struct_ops.
Also, if I understood correctly, it is possible to restrict a kfunc to
some specific struct_ops, e.g. not to call this kfunc for the TCP CA
struct_ops. So these checks should indeed not be needed, but I will
double-check that with Geliang.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
Hi Geliang,
(only with the MPTCP list)
On 24/01/2025 01:47, Martin KaFai Lau wrote:
> On 12/19/24 7:46 AM, Matthieu Baerts (NGI0) wrote:
>> From: Geliang Tang <tanggeliang@kylinos.cn>
>>
>> It's necessary to traverse all subflows on the conn_list of an MPTCP
>> socket and then call kfunc to modify the fields of each subflow. In
>> kernel space, mptcp_for_each_subflow() helper is used for this:
>>
>> mptcp_for_each_subflow(msk, subflow)
>> kfunc(subflow);
>>
>> But in the MPTCP BPF program, this has not yet been implemented. As
>> Martin suggested recently, this conn_list walking + modify-by-kfunc
>> usage fits the bpf_iter use case.
>>
>> So this patch adds a new bpf_iter type named "mptcp_subflow" to do
>> this and implements its helpers bpf_iter_mptcp_subflow_new()/_next()/
>> _destroy(). And register these bpf_iter mptcp_subflow into mptcp
>> common kfunc set. Then bpf_for_each() for mptcp_subflow can be used
>> in BPF program like this:
>>
>> bpf_for_each(mptcp_subflow, subflow, msk)
>> kfunc(subflow);
>>
>> Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
>> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
>> Reviewed-by: Mat Martineau <martineau@kernel.org>
>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>> ---
>> Notes:
>> - v2:
>> - Add BUILD_BUG_ON() checks, similar to the ones done with other
>> bpf_iter_(...) helpers.
>> - Replace msk_owned_by_me() by sock_owned_by_user_nocheck() and
>> !spin_is_locked() (Martin).
>>
>> A few versions of this single patch have been previously posted to the
>> BPF mailing list by Geliang, before continuing to the MPTCP mailing list
>> only, with other patches of this series. The version of the whole series
>> has been reset to 1, but here is the ChangeLog for the previous ones:
>> - v2: remove msk->pm.lock in _new() and _destroy() (Martin)
>> drop DEFINE_BPF_ITER_FUNC, change opaque[3] to opaque[2] (Andrii)
>> - v3: drop bpf_iter__mptcp_subflow
>> - v4: if msk is NULL, initialize kit->msk to NULL in _new() and check
>> it in _next() (Andrii)
>> - v5: use list_is_last() instead of list_entry_is_head() add
>> KF_ITER_NEW/NEXT/DESTROY flags add msk_owned_by_me in _new()
>> - v6: add KF_TRUSTED_ARGS flag (Andrii, Martin)
>> ---
>> net/mptcp/bpf.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++
>> 1 file changed, 53 insertions(+)
>>
>> diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
>> index
>> c5bfd84c16c43230d9d8e1fd8ff781a767e647b5..e39f0e4fb683c1aa31ee075281daee218dac5878 100644
>> --- a/net/mptcp/bpf.c
>> +++ b/net/mptcp/bpf.c
>> @@ -35,6 +35,15 @@ static const struct btf_kfunc_id_set
>> bpf_mptcp_fmodret_set = {
>> .set = &bpf_mptcp_fmodret_ids,
>> };
>> +struct bpf_iter_mptcp_subflow {
>> + __u64 __opaque[2];
>> +} __aligned(8);
>> +
>> +struct bpf_iter_mptcp_subflow_kern {
>> + struct mptcp_sock *msk;
>> + struct list_head *pos;
>> +} __aligned(8);
>> +
>> __bpf_kfunc_start_defs();
>> __bpf_kfunc static struct mptcp_subflow_context *
>> @@ -47,10 +56,54 @@ bpf_mptcp_subflow_ctx(const struct sock *sk)
>> return NULL;
>> }
>> +__bpf_kfunc static int
>> +bpf_iter_mptcp_subflow_new(struct bpf_iter_mptcp_subflow *it,
>> + struct mptcp_sock *msk)
>> +{
>> + struct bpf_iter_mptcp_subflow_kern *kit = (void *)it;
>> + struct sock *sk = (struct sock *)msk;
>> +
>> + BUILD_BUG_ON(sizeof(struct bpf_iter_mptcp_subflow_kern) >
>> + sizeof(struct bpf_iter_mptcp_subflow));
>> + BUILD_BUG_ON(__alignof__(struct bpf_iter_mptcp_subflow_kern) !=
>> + __alignof__(struct bpf_iter_mptcp_subflow));
>> +
>> + kit->msk = msk;
>> + if (!msk)
>
> NULL check is not needed. verifier should have rejected it for
> KF_TRUSTED_ARGS.
>
>> + return -EINVAL;
>> +
>> + if (!sock_owned_by_user_nocheck(sk) &&
>> + !spin_is_locked(&sk->sk_lock.slock))
>
> I could have missed something. If it is to catch bug, should it be
> sock_owned_by_me() that has the lockdep splat? For the cg get/setsockopt
> hook here, the lock should have already been held earlier in the kernel.
:)
So from what I understand: either we need these checks because this
helper could technically be called when these conditions are not met. Or
we don't need them, and we can have sock_owned_by_me() (without the
return that was in the v1) for lockdep.
> This set is only showing the cg sockopt bpf prog but missing the major
> struct_ops piece. It is hard to comment. I assumed the lock situation is
> the same for the struct_ops where the lock will be held before calling
> the struct_ops prog?
Can we restrict the use of this helper only for some specific callbacks?
Or for only some specific struct_ops?
Last time I looked, I thought that you could only restrict it like that:
"this helper can be used by (any) struct_ops". But then does it mean
this helper could be called from a TCP BPF CC ops or another unrelated
one for example? How can we avoid that?
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
© 2016 - 2025 Red Hat, Inc.