[PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path

Gang Yan posted 1 patch 1 month, 1 week ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/multipath-tcp/mptcp_net-next tags/patchew/20260309025431.125943-1-gang.yan@linux.dev
There is a newer version of this series
net/mptcp/protocol.c | 24 ++++++++++++++++++++++--
net/mptcp/protocol.h |  4 +++-
2 files changed, 25 insertions(+), 3 deletions(-)
[PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by Gang Yan 1 month, 1 week ago
From: Gang Yan <yangang@kylinos.cn>

Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
which marks the end of a record for application-level message boundaries.

Data fragments tagged with MSG_EOR are explicitly marked in the
mptcp_data_frag structure and skb context to prevent unintended
coalescing with subsequent data chunks. This ensures the intent of
applications using MSG_EOR is preserved across MPTCP subflows,
maintaining consistent message segmentation behavior.

Signed-off-by: Gang Yan <yangang@kylinos.cn>
---

Notes:
      - This patch incorporates feedback and suggestions from Paolo Abeni
        and Geliang Tang, including memory alignment optimizations for the
        mptcp_data_frag struct (shrinking overhead to u8 and using bitfield
        for eor to avoid size increase) and compile-time checks with BUILD_BUG_ON.
      - Packetdrill test cases validating this feature are available at:
        https://github.com/multipath-tcp/packetdrill/pull/189/changes/d6ce92a4786704fe749bbd848ced0c047632282e

 net/mptcp/protocol.c | 24 ++++++++++++++++++++++--
 net/mptcp/protocol.h |  4 +++-
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 17e43aff4459..3e574c87301b 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1174,6 +1174,7 @@ mptcp_carve_data_frag(const struct mptcp_sock *msk, struct page_frag *pfrag,
 	dfrag->offset = offset + sizeof(struct mptcp_data_frag);
 	dfrag->already_sent = 0;
 	dfrag->page = pfrag->page;
+	dfrag->eor = 0;
 
 	return dfrag;
 }
@@ -1435,6 +1436,13 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
 		mptcp_update_infinite_map(msk, ssk, mpext);
 	trace_mptcp_sendmsg_frag(mpext);
 	mptcp_subflow_ctx(ssk)->rel_write_seq += copy;
+
+	/* If this is the last chunk of a dfrag with MSG_EOR set,
+	 * mark the skb to prevent coalescing with subsequent data.
+	 */
+	if (dfrag->eor && info->sent + copy >= dfrag->data_len)
+		TCP_SKB_CB(skb)->eor = 1;
+
 	return copy;
 }
 
@@ -1895,7 +1903,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	long timeo;
 
 	/* silently ignore everything else */
-	msg->msg_flags &= MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL | MSG_FASTOPEN;
+	msg->msg_flags &= MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL |
+			  MSG_FASTOPEN | MSG_EOR;
 
 	lock_sock(sk);
 
@@ -2002,8 +2011,16 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			goto do_error;
 	}
 
-	if (copied)
+	if (copied) {
+		/* Mark the last dfrag with EOR if MSG_EOR was set */
+		if (msg->msg_flags & MSG_EOR) {
+			struct mptcp_data_frag *dfrag = mptcp_pending_tail(sk);
+
+			if (dfrag)
+				dfrag->eor = 1;
+		}
 		__mptcp_push_pending(sk, msg->msg_flags);
+	}
 
 out:
 	release_sock(sk);
@@ -4621,6 +4638,9 @@ void __init mptcp_proto_init(void)
 	inet_register_protosw(&mptcp_protosw);
 
 	BUILD_BUG_ON(sizeof(struct mptcp_skb_cb) > sizeof_field(struct sk_buff, cb));
+	/* Compile-time check: ensure 'overhead' (alignment + struct size) fits in u8 */
+	BUILD_BUG_ON(ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag) > U8_MAX);
+
 }
 
 #if IS_ENABLED(CONFIG_MPTCP_IPV6)
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index f5d4d7d030f2..db96f2945cbd 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -264,7 +264,9 @@ struct mptcp_data_frag {
 	u64 data_seq;
 	u16 data_len;
 	u16 offset;
-	u16 overhead;
+	u8 overhead;
+	u8 eor:1,
+	   __unused:7;
 	u16 already_sent;
 	struct page *page;
 };
-- 
2.43.0
Re: [PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by Matthieu Baerts 3 weeks, 4 days ago
Hi Gang,

Thank you for the new version.

On 09/03/2026 03:54, Gang Yan wrote:
> From: Gang Yan <yangang@kylinos.cn>
> 
> Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
> which marks the end of a record for application-level message boundaries.
> 
> Data fragments tagged with MSG_EOR are explicitly marked in the
> mptcp_data_frag structure and skb context to prevent unintended
> coalescing with subsequent data chunks. This ensures the intent of
> applications using MSG_EOR is preserved across MPTCP subflows,
> maintaining consistent message segmentation behavior.
> 
> Signed-off-by: Gang Yan <yangang@kylinos.cn>
> ---
> 
> Notes:
>       - This patch incorporates feedback and suggestions from Paolo Abeni
>         and Geliang Tang, including memory alignment optimizations for the
>         mptcp_data_frag struct (shrinking overhead to u8 and using bitfield
>         for eor to avoid size increase) and compile-time checks with BUILD_BUG_ON.

Please mention why you shrank "overhead" to a u8 (not to increase the
struct size), and why it is OK to do so (u16 not needed because ...) +
explaining the BUILD_BUG_ON().

>       - Packetdrill test cases validating this feature are available at:
>         https://github.com/multipath-tcp/packetdrill/pull/189/changes/d6ce92a4786704fe749bbd848ced0c047632282e

Thank you, I just reviewed it.

Do you mind checking the AI review there please:


https://netdev-ai.bots.linux.dev/ai-review.html?id=22434689-7326-48c8-af75-273d99fbef55

I think it is valid, but better to double-check.

> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 17e43aff4459..3e574c87301b 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c

(...)

> @@ -4621,6 +4638,9 @@ void __init mptcp_proto_init(void)
>  	inet_register_protosw(&mptcp_protosw);
>  
>  	BUILD_BUG_ON(sizeof(struct mptcp_skb_cb) > sizeof_field(struct sk_buff, cb));
> +	/* Compile-time check: ensure 'overhead' (alignment + struct size) fits in u8 */
> +	BUILD_BUG_ON(ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag) > U8_MAX);

Sorry, I'm not sure what you are checking here. Do you mind explaining
it please?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.
Re: [PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by gang.yan@linux.dev 3 weeks ago
March 27, 2026 at 12:42 AM, "Matthieu Baerts" <matttbe@kernel.org mailto:matttbe@kernel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > wrote:


> 
> Hi Gang,
> 
> Thank you for the new version.
> 
> On 09/03/2026 03:54, Gang Yan wrote:
> 
> > 
> > From: Gang Yan <yangang@kylinos.cn>
> >  
> >  Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
> >  which marks the end of a record for application-level message boundaries.
> >  
> >  Data fragments tagged with MSG_EOR are explicitly marked in the
> >  mptcp_data_frag structure and skb context to prevent unintended
> >  coalescing with subsequent data chunks. This ensures the intent of
> >  applications using MSG_EOR is preserved across MPTCP subflows,
> >  maintaining consistent message segmentation behavior.
> >  
> >  Signed-off-by: Gang Yan <yangang@kylinos.cn>
> >  ---
> >  
> >  Notes:
> >  - This patch incorporates feedback and suggestions from Paolo Abeni
> >  and Geliang Tang, including memory alignment optimizations for the
> >  mptcp_data_frag struct (shrinking overhead to u8 and using bitfield
> >  for eor to avoid size increase) and compile-time checks with BUILD_BUG_ON.
> > 
> Please mention why you shrank "overhead" to a u8 (not to increase the
> struct size), and why it is OK to do so (u16 not needed because ...) +
> explaining the BUILD_BUG_ON().

The ‘u8’ is one of Paolo's suggestions[1]. I think 'u16' is not needed because:
 - 'offset = ALIGN(orig_offset, sizeof(long));'
 - 'dfrag->offset = offset - origin_offset + sizeof(struct mptcp_data_frag);',
the max value of offset is 7, and sizeof(struct mptcp_data_frag)) is
usually 40, so the overhead is 47, far less than 255.

Another suggestion from Paolo[1] is a build time check on the max 'overhead'
value. So I use 'ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag)' to
represent the max_val of 'overhead'.

But Paolo also mention it's probably too conservative. WDYT?

[1] https://patchwork.kernel.org/project/mptcp/patch/20260203023029.855434-1-gang.yan@linux.dev/

> 
> > 
> > - Packetdrill test cases validating this feature are available at:
> >  https://github.com/multipath-tcp/packetdrill/pull/189/changes/d6ce92a4786704fe749bbd848ced0c047632282e
> > 
> Thank you, I just reviewed it.

Thanks, I'll try to fix them.

> 
> Do you mind checking the AI review there please:
> 
> https://netdev-ai.bots.linux.dev/ai-review.html?id=22434689-7326-48c8-af75-273d99fbef55
> 
> I think it is valid, but better to double-check.

Yes, I think it's a good catch, and we should fix it as follows:

@@ -1032,7 +1032,8 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
                                       const struct page_frag *pfrag,
                                       const struct mptcp_data_frag *df)
 {
-       return df && pfrag->page == df->page &&
+       return df && !df->eor &&
+               pfrag->page == df->page &&
                pfrag->size - pfrag->offset > 0 &&
                pfrag->offset == (df->offset + df->data_len) &&
                df->data_seq + df->data_len == msk->write_seq;
If OK, I'll apply it when sending v2.

> 
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> >  index 17e43aff4459..3e574c87301b 100644
> >  --- a/net/mptcp/protocol.c
> >  +++ b/net/mptcp/protocol.c
> > 
> (...)
> 
> > 
> > @@ -4621,6 +4638,9 @@ void __init mptcp_proto_init(void)
> >  inet_register_protosw(&mptcp_protosw);
> >  
> >  BUILD_BUG_ON(sizeof(struct mptcp_skb_cb) > sizeof_field(struct sk_buff, cb));
> >  + /* Compile-time check: ensure 'overhead' (alignment + struct size) fits in u8 */
> >  + BUILD_BUG_ON(ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag) > U8_MAX);
> > 
> Sorry, I'm not sure what you are checking here. Do you mind explaining
> it please?
> 

The 'BUILD_BUG_ON' is explained at the beginning of the reply, thanks.

Cheers,
Gang

> Cheers,
> Matt
> -- 
> Sponsored by the NGI0 Core fund.
>
Re: [PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by Matthieu Baerts 3 weeks ago
Hi Gang,

On 30/03/2026 10:19, gang.yan@linux.dev wrote:
> March 27, 2026 at 12:42 AM, "Matthieu Baerts" <matttbe@kernel.org mailto:matttbe@kernel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > wrote:
> 
> 
>>
>> Hi Gang,
>>
>> Thank you for the new version.
>>
>> On 09/03/2026 03:54, Gang Yan wrote:
>>
>>>
>>> From: Gang Yan <yangang@kylinos.cn>
>>>  
>>>  Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
>>>  which marks the end of a record for application-level message boundaries.
>>>  
>>>  Data fragments tagged with MSG_EOR are explicitly marked in the
>>>  mptcp_data_frag structure and skb context to prevent unintended
>>>  coalescing with subsequent data chunks. This ensures the intent of
>>>  applications using MSG_EOR is preserved across MPTCP subflows,
>>>  maintaining consistent message segmentation behavior.
>>>  
>>>  Signed-off-by: Gang Yan <yangang@kylinos.cn>
>>>  ---
>>>  
>>>  Notes:
>>>  - This patch incorporates feedback and suggestions from Paolo Abeni
>>>  and Geliang Tang, including memory alignment optimizations for the
>>>  mptcp_data_frag struct (shrinking overhead to u8 and using bitfield
>>>  for eor to avoid size increase) and compile-time checks with BUILD_BUG_ON.
>>>
>> Please mention why you shrank "overhead" to a u8 (not to increase the
>> struct size), and why it is OK to do so (u16 not needed because ...) +
>> explaining the BUILD_BUG_ON().
> 
> The ‘u8’ is one of Paolo's suggestions[1]. I think 'u16' is not needed because:
>  - 'offset = ALIGN(orig_offset, sizeof(long));'
>  - 'dfrag->offset = offset - origin_offset + sizeof(struct mptcp_data_frag);',
> the max value of offset is 7, and sizeof(struct mptcp_data_frag)) is
> usually 40, so the overhead is 47, far less than 255.

Thank you for the explanation. Can you then mention in the commit
message that it is fine to reduce overhead to a 'u8', and add the above
explanation, please?

If 'offset' max value is 7, it could also be reduced from a u16 to a u8
then, no?

> Another suggestion from Paolo[1] is a build time check on the max 'overhead'
> value. So I use 'ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag)' to
> represent the max_val of 'overhead'.

It might be good to add a comment here too, at least to explain that
"ALIGN(1, sizeof(long))" represents 'offset' maximum size.

> But Paolo also mention it's probably too conservative. WDYT?

Maybe, but it doesn't hurt I suppose. As long as this check is clearly
linked to different fields from the mptcp_data_frag structure → having a
comment explaining that.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.

Re: [PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by gang.yan@linux.dev 2 weeks, 6 days ago
March 30, 2026 at 5:50 PM, "Matthieu Baerts" <matttbe@kernel.org mailto:matttbe@kernel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > wrote:


> 
> Hi Gang,
> 
> On 30/03/2026 10:19, gang.yan@linux.dev wrote:
> 
> > 
> > March 27, 2026 at 12:42 AM, "Matthieu Baerts" <matttbe@kernel.org mailto:matttbe@kernel.org?to=%22Matthieu%20Baerts%22%20%3Cmatttbe%40kernel.org%3E > wrote:
> >  
> >  
> > 
> > > 
> > > Hi Gang,
> > > 
> > >  Thank you for the new version.
> > > 
> > >  On 09/03/2026 03:54, Gang Yan wrote:
> > > 
> >  From: Gang Yan <yangang@kylinos.cn>
> >  
> >  Extend MPTCP's sendmsg handling to recognize and honor the MSG_EOR flag,
> >  which marks the end of a record for application-level message boundaries.
> >  
> >  Data fragments tagged with MSG_EOR are explicitly marked in the
> >  mptcp_data_frag structure and skb context to prevent unintended
> >  coalescing with subsequent data chunks. This ensures the intent of
> >  applications using MSG_EOR is preserved across MPTCP subflows,
> >  maintaining consistent message segmentation behavior.
> >  
> >  Signed-off-by: Gang Yan <yangang@kylinos.cn>
> >  ---
> >  
> >  Notes:
> >  - This patch incorporates feedback and suggestions from Paolo Abeni
> >  and Geliang Tang, including memory alignment optimizations for the
> >  mptcp_data_frag struct (shrinking overhead to u8 and using bitfield
> >  for eor to avoid size increase) and compile-time checks with BUILD_BUG_ON.
> > 
> > > 
> > > Please mention why you shrank "overhead" to a u8 (not to increase the
> > >  struct size), and why it is OK to do so (u16 not needed because ...) +
> > >  explaining the BUILD_BUG_ON().
> > > 
> >  
> >  The ‘u8’ is one of Paolo's suggestions[1]. I think 'u16' is not needed because:
> >  - 'offset = ALIGN(orig_offset, sizeof(long));'
> >  - 'dfrag->offset = offset - origin_offset + sizeof(struct mptcp_data_frag);',
> >  the max value of offset is 7, and sizeof(struct mptcp_data_frag)) is
> >  usually 40, so the overhead is 47, far less than 255.
> > 
> Thank you for the explanation. Can you then mention in the commit
> message that it is fine to reduce overhead to a 'u8', and add the above
> explanation, please?
> 
> If 'offset' max value is 7, it could also be reduced from a u16 to a u8
> then, no?

Hi, Matt:

Sorry, there was an error in the explanation. The maximum value of
(offset - origin_offset) is 7, so the 'offset' variable should use u16.

> 
> > 
> > Another suggestion from Paolo[1] is a build time check on the max 'overhead'
> >  value. So I use 'ALIGN(1, sizeof(long)) + sizeof(struct mptcp_data_frag)' to
> >  represent the max_val of 'overhead'.
> > 
> It might be good to add a comment here too, at least to explain that
> "ALIGN(1, sizeof(long))" represents 'offset' maximum size.

Good idea, I'll apply your suggestions in v2.

Thanks
Gang
> 
> > 
> > But Paolo also mention it's probably too conservative. WDYT?
> > 
> Maybe, but it doesn't hurt I suppose. As long as this check is clearly
> linked to different fields from the mptcp_data_frag structure → having a
> comment explaining that.
> 
> Cheers,
> Matt
> -- 
> Sponsored by the NGI0 Core fund.
>
Re: [PATCH mptcp-next] mptcp: preserve MSG_EOR semantics in sendmsg path
Posted by MPTCP CI 1 month, 1 week ago
Hi Gang,

Thank you for your modifications, that's great!

Our CI did some validations and here is its report:

- KVM Validation: normal (except selftest_mptcp_join): Success! ✅
- KVM Validation: normal (only selftest_mptcp_join): Success! ✅
- KVM Validation: debug (except selftest_mptcp_join): Unstable: 1 failed test(s): packetdrill_dss 🔴
- KVM Validation: debug (only selftest_mptcp_join): Success! ✅
- KVM Validation: btf-normal (only bpftest_all): Success! ✅
- KVM Validation: btf-debug (only bpftest_all): Success! ✅
- Task: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/22836823300

Initiator: Patchew Applier
Commits: https://github.com/multipath-tcp/mptcp_net-next/commits/070dbf41676b
Patchwork: https://patchwork.kernel.org/project/mptcp/list/?series=1063383


If there are some issues, you can reproduce them using the same environment as
the one used by the CI thanks to a docker image, e.g.:

    $ cd [kernel source code]
    $ docker run -v "${PWD}:${PWD}:rw" -w "${PWD}" --privileged --rm -it \
        --pull always mptcp/mptcp-upstream-virtme-docker:latest \
        auto-normal

For more details:

    https://github.com/multipath-tcp/mptcp-upstream-virtme-docker


Please note that despite all the efforts that have been already done to have a
stable tests suite when executed on a public CI like here, it is possible some
reported issues are not due to your modifications. Still, do not hesitate to
help us improve that ;-)

Cheers,
MPTCP GH Action bot
Bot operated by Matthieu Baerts (NGI0 Core)