:p
atchew
Login
From: Geliang Tang <tanggeliang@kylinos.cn> v9: - merge the squash-to patches. - a new patch "drop release and lock again in splice_read". v8: - export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h. - add a new helper mptcp_recv_should_stop. - add mptcp_connect_splice.sh. - update commit logs. v7: - only patch 1 and patch 2 changed. - add a new helper mptcp_eat_recv_skb. - invoke skb_peek in mptcp_recv_skb(). - use while ((skb = mptcp_recv_skb(sk)) != NULL) instead of skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp). v6: - address Paolo's comments for v4, v5 (thanks) v5: - extract the common code of __mptcp_recvmsg_mskq() and mptcp_read_sock() into a new helper __mptcp_recvmsg_desc() to reduce duplication code. v4: - v3 doesn't work for MPTCP fallback tests in mptcp_connect.sh, this set fix it. - invoke __mptcp_move_skbs in mptcp_read_sock. - use INDIRECT_CALL_INET_1 in __tcp_splice_read. v3: - merge the two squash-to patches. - use sk->sk_rcvbuf instead of INT_MAX as the max len in mptcp_read_sock(). - add splice io mode for mptcp_connect and drop mptcp_splice.c test. - the splice test for packetdrill is also added here: https://github.com/multipath-tcp/packetdrill/pull/162 v2: - set splice_read of mptcp - add a splice selftest. I have good news! I recently added MPTCP support to "NVME over TCP". And my RFC patches are under review by NVME maintainer Hannes. Replacing "NVME over TCP" with MPTCP is very simple. I used IPPROTO_MPTCP instead of IPPROTO_TCP to create MPTCP sockets on both target and host sides, these sockets are created in Kernel space. nvmet_tcp_add_port: ret = sock_create(port->addr.ss_family, SOCK_STREAM, IPPROTO_MPTCP, &port->sock); nvme_tcp_alloc_queue: ret = sock_create_kern(current->nsproxy->net_ns, ctrl->addr.ss_family, SOCK_STREAM, IPPROTO_MPTCP, &queue->sock); nvme_tcp_try_recv() needs to call .read_sock interface of struct proto_ops, but it is not implemented in MPTCP. So I implemented it with reference to __mptcp_recvmsg_mskq(). Since the NVME part patches are still under reviewing, I only send the MPTCP part patches in this set to MPTCP ML for your opinions. Geliang Tang (9): mptcp: add eat_recv_skb helper mptcp: implement .read_sock tcp: drop release and lock again in splice_read tcp: export splice_state and splice_data_recv tcp: add recv_should_stop helper mptcp: use recv_should_stop helper mptcp: implement .splice_read selftests: mptcp: add splice io mode selftests: mptcp: connect: cover splice mode include/net/tcp.h | 35 +++ net/ipv4/tcp.c | 90 ++------ net/mptcp/protocol.c | 204 +++++++++++++++--- tools/testing/selftests/net/mptcp/Makefile | 2 +- .../selftests/net/mptcp/mptcp_connect.c | 63 +++++- .../net/mptcp/mptcp_connect_splice.sh | 4 + 6 files changed, 288 insertions(+), 110 deletions(-) create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> This patch extracts the free skb related code in __mptcp_recvmsg_mskq() into a new helper mptcp_eat_recv_skb(). Use sk_eat_skb() in this helper instead of open-coding it. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied); +static void mptcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) +{ + /* avoid the indirect call, we know the destructor is sock_wfree */ + skb->destructor = NULL; + atomic_sub(skb->truesize, &sk->sk_rmem_alloc); + sk_mem_uncharge(sk, skb->truesize); + sk_eat_skb(sk, skb); +} + static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg, size_t len, int flags, @@ -XXX,XX +XXX,XX @@ static int __mptcp_recvmsg_mskq(struct sock *sk, } if (!(flags & MSG_PEEK)) { - /* avoid the indirect call, we know the destructor is sock_wfree */ - skb->destructor = NULL; - atomic_sub(skb->truesize, &sk->sk_rmem_alloc); - sk_mem_uncharge(sk, skb->truesize); - __skb_unlink(skb, &sk->sk_receive_queue); - __kfree_skb(skb); + mptcp_eat_recv_skb(sk, skb); msk->bytes_consumed += count; } -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> nvme_tcp_try_recv() needs to call .read_sock interface of struct proto_ops, but it's not implemented in MPTCP. This patch implements it with reference to __tcp_read_sock() and __mptcp_recvmsg_mskq(). Corresponding to tcp_recv_skb(), a new helper for MPTCP named mptcp_recv_skb() is added to peek a skb from sk->sk_receive_queue. Compared with __mptcp_recvmsg_mskq(), mptcp_read_sock() uses sk->sk_rcvbuf as the max read length. The LISTEN status is checked before the while loop, and mptcp_recv_skb() and mptcp_cleanup_rbuf() are invoked after the loop. In the loop, all flags checks for __mptcp_recvmsg_mskq() are removed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 62 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock, return mask; } +static struct sk_buff *mptcp_recv_skb(struct sock *sk) +{ + if (skb_queue_empty(&sk->sk_receive_queue)) + __mptcp_move_skbs(sk); + return skb_peek(&sk->sk_receive_queue); +} + +/* + * Note: + * - It is assumed that the socket was locked by the caller. + */ +static int mptcp_read_sock(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + size_t len = sk->sk_rcvbuf; + struct sk_buff *skb; + int copied = 0; + + if (sk->sk_state == TCP_LISTEN) + return -ENOTCONN; + while ((skb = mptcp_recv_skb(sk)) != NULL) { + u32 offset = MPTCP_SKB_CB(skb)->offset; + u32 data_len = skb->len - offset; + u32 size = min_t(size_t, len - copied, data_len); + int count; + + count = recv_actor(desc, skb, offset, size); + if (count <= 0) { + if (!copied) + copied = count; + break; + } + + copied += count; + + if (count < data_len) { + MPTCP_SKB_CB(skb)->offset += count; + MPTCP_SKB_CB(skb)->map_seq += count; + msk->bytes_consumed += count; + break; + } + + mptcp_eat_recv_skb(sk, skb); + msk->bytes_consumed += count; + + if (copied >= len) + break; + } + + mptcp_rcv_space_adjust(msk, copied); + + if (copied > 0) { + mptcp_recv_skb(sk); + mptcp_cleanup_rbuf(msk, copied); + } + + return copied; +} + static const struct proto_ops mptcp_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_stream_ops = { .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .set_rcvlowat = mptcp_set_rcvlowat, + .read_sock = mptcp_read_sock, }; static struct inet_protosw mptcp_protosw = { @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_v6_stream_ops = { .compat_ioctl = inet6_compat_ioctl, #endif .set_rcvlowat = mptcp_set_rcvlowat, + .read_sock = mptcp_read_sock, }; static struct proto mptcp_v6_prot; -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> In the main loop of tcp_splice_read(), release_sock() is invoked to release the socket lock, and then lock_sock() is immediately invoked to hold the socket lock again. It looks like these two calls are useless, so let's drop them. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/ipv4/tcp.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index XXXXXXX..XXXXXXX 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, if (!tss.len || !timeo) break; - release_sock(sk); - lock_sock(sk); if (sk->sk_err || sk->sk_state == TCP_CLOSE || (sk->sk_shutdown & RCV_SHUTDOWN) || -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h so that they can be used by MPTCP. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- include/net/tcp.h | 12 ++++++++++++ net/ipv4/tcp.c | 13 ++----------- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index XXXXXXX..XXXXXXX 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -XXX,XX +XXX,XX @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX); */ #define TFO_SERVER_WO_SOCKOPT1 0x400 +/* + * TCP splice context + */ +struct tcp_splice_state { + struct pipe_inode_info *pipe; + size_t len; + unsigned int flags; +}; + +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, + unsigned int offset, size_t len); + /* sysctl variables for tcp */ extern int sysctl_tcp_max_orphans; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index XXXXXXX..XXXXXXX 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -XXX,XX +XXX,XX @@ EXPORT_SYMBOL(tcp_have_smc); struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp; EXPORT_IPV6_MOD(tcp_sockets_allocated); -/* - * TCP splice context - */ -struct tcp_splice_state { - struct pipe_inode_info *pipe; - size_t len; - unsigned int flags; -}; - /* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. @@ -XXX,XX +XXX,XX @@ void tcp_push(struct sock *sk, int flags, int mss_now, __tcp_push_pending_frames(sk, mss_now, nonagle); } -static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, - unsigned int offset, size_t len) +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, + unsigned int offset, size_t len) { struct tcp_splice_state *tss = rd_desc->arg.data; int ret; -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> Factor out a new helper tcp_recv_should_stop() from tcp_recvmsg_locked() and tcp_splice_read() to check whether to stop receiving. It will be used for MPTCP too. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- include/net/tcp.h | 23 +++++++++++++++ net/ipv4/tcp.c | 75 ++++++++--------------------------------------- 2 files changed, 35 insertions(+), 63 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index XXXXXXX..XXXXXXX 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -XXX,XX +XXX,XX @@ struct tcp_splice_state { int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len); +static inline int tcp_recv_should_stop(struct sock *sk, long timeo) +{ + if (sock_flag(sk, SOCK_DONE)) + return -ENETDOWN; + + if (sk->sk_err) + return sock_error(sk); + + if (sk->sk_shutdown & RCV_SHUTDOWN) + return -ESHUTDOWN; + + if (sk->sk_state == TCP_CLOSE) + return -ENOTCONN; + + if (!timeo) + return -EAGAIN; + + if (signal_pending(current)) + return sock_intr_errno(timeo); + + return 0; +} + /* sysctl variables for tcp */ extern int sysctl_tcp_max_orphans; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index XXXXXXX..XXXXXXX 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, }; long timeo; ssize_t spliced; - int ret; + int ret, err; sock_rps_record_flow(sk); /* @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, else if (!ret) { if (spliced) break; - if (sock_flag(sk, SOCK_DONE)) - break; - if (sk->sk_err) { - ret = sock_error(sk); - break; - } - if (sk->sk_shutdown & RCV_SHUTDOWN) - break; - if (sk->sk_state == TCP_CLOSE) { - /* - * This occurs when user tries to read - * from never connected socket. - */ - ret = -ENOTCONN; - break; - } - if (!timeo) { - ret = -EAGAIN; + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (err != -ESHUTDOWN && err != -ENETDOWN) + ret = err; break; } /* if __tcp_splice_read() got nothing while we have @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, ret = sk_wait_data(sk, &timeo, NULL); if (ret < 0) break; - if (signal_pending(current)) { - ret = sock_intr_errno(timeo); - break; - } continue; } tss.len -= ret; spliced += ret; - if (!tss.len || !timeo) + if (!tss.len) break; - if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - signal_pending(current)) + if (tcp_recv_should_stop(sk, timeo)) break; } @@ -XXX,XX +XXX,XX @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (copied >= target && !READ_ONCE(sk->sk_backlog.tail)) break; - if (copied) { - if (!timeo || - sk->sk_err || - sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - signal_pending(current)) - break; - } else { - if (sock_flag(sk, SOCK_DONE)) - break; - - if (sk->sk_err) { - copied = sock_error(sk); - break; - } - - if (sk->sk_shutdown & RCV_SHUTDOWN) - break; - - if (sk->sk_state == TCP_CLOSE) { - /* This occurs when user tries to read - * from never connected socket. - */ - copied = -ENOTCONN; - break; - } - - if (!timeo) { - copied = -EAGAIN; - break; - } - - if (signal_pending(current)) { - copied = sock_intr_errno(timeo); - break; - } + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (!copied && err != -ESHUTDOWN && err != -ENETDOWN) + copied = err; + break; } if (copied >= target) { -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> Use the newly added tcp_recv_should_stop() helper in mptcp_recvmsg() to check whether to stop receiving. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 32 ++++++-------------------------- 1 file changed, 6 insertions(+), 26 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (copied >= target) break; - if (copied) { - if (sk->sk_err || - sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - !timeo || - signal_pending(current)) - break; - } else { - if (sk->sk_err) { - copied = sock_error(sk); + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (copied) break; - } - if (sk->sk_shutdown & RCV_SHUTDOWN) { + if (err == -ESHUTDOWN) { /* race breaker: the shutdown could be after the * previous receive queue check */ @@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, break; } - if (sk->sk_state == TCP_CLOSE) { - copied = -ENOTCONN; - break; - } - - if (!timeo) { - copied = -EAGAIN; - break; - } - - if (signal_pending(current)) { - copied = sock_intr_errno(timeo); - break; - } + copied = err; + break; } pr_debug("block timeout %ld\n", timeo); -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> This patch implements .splice_read interface of mptcp struct proto_ops as mptcp_splice_read() with reference to tcp_splice_read(). Corresponding to __tcp_splice_read(), __mptcp_splice_read() is defined, invoking mptcp_read_sock() instead of tcp_read_sock(). mptcp_splice_read() is almost the same as tcp_splice_read(), except for sock_rps_record_flow() and __mptcp_move_skbs(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 94 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 94 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_read_sock(struct sock *sk, read_descriptor_t *desc, return copied; } +static int __mptcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) +{ + /* Store TCP splice context information in read_descriptor_t. */ + read_descriptor_t rd_desc = { + .arg.data = tss, + .count = tss->len, + }; + + return mptcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); +} + +/** + * mptcp_splice_read - splice data from MPTCP socket to a pipe + * @sock: socket to splice from + * @ppos: position (not valid) + * @pipe: pipe to splice to + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will read pages from given socket and fill them into a pipe. + * + **/ +static ssize_t mptcp_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + struct tcp_splice_state tss = { + .pipe = pipe, + .len = len, + .flags = flags, + }; + struct sock *sk = sock->sk; + ssize_t spliced = 0; + int ret = 0, err; + long timeo; + + /* + * We can't seek on a socket input + */ + if (unlikely(*ppos)) + return -ESPIPE; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); + while (tss.len) { + ret = __mptcp_splice_read(sk, &tss); + if (ret < 0) { + break; + } else if (!ret) { + if (spliced) + break; + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (err == -ESHUTDOWN) { + if (__mptcp_move_skbs(sk)) + continue; + break; + } + ret = err; + break; + } + /* if __mptcp_splice_read() got nothing while we have + * an skb in receive queue, we do not want to loop. + * This might happen with URG data. + */ + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + ret = sk_wait_data(sk, &timeo, NULL); + if (ret < 0) + break; + continue; + } + tss.len -= ret; + spliced += ret; + + if (!tss.len) + break; + + if (tcp_recv_should_stop(sk, timeo)) + break; + } + + release_sock(sk); + + if (spliced) + return spliced; + + return ret; +} + static const struct proto_ops mptcp_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_stream_ops = { .mmap = sock_no_mmap, .set_rcvlowat = mptcp_set_rcvlowat, .read_sock = mptcp_read_sock, + .splice_read = mptcp_splice_read, }; static struct inet_protosw mptcp_protosw = { @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_v6_stream_ops = { #endif .set_rcvlowat = mptcp_set_rcvlowat, .read_sock = mptcp_read_sock, + .splice_read = mptcp_splice_read, }; static struct proto mptcp_v6_prot; -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> This patch adds a new 'splice' io mode for mptcp_connect to test the newly added read_sock() and splice_read() functions of MPTCP. do_splice() efficiently transfers data directly between two file descriptors (infd and outfd) without copying to userspace, using Linux's splice() system call. Usage: ./mptcp_connect.sh -m splice Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- .../selftests/net/mptcp/mptcp_connect.c | 63 ++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/mptcp_connect.c +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c @@ -XXX,XX +XXX,XX @@ enum cfg_mode { CFG_MODE_POLL, CFG_MODE_MMAP, CFG_MODE_SENDFILE, + CFG_MODE_SPLICE, }; enum cfg_peek { @@ -XXX,XX +XXX,XX @@ static void die_usage(void) fprintf(stderr, "\t-j -- add additional sleep at connection start and tear down " "-- for MPJ tests\n"); fprintf(stderr, "\t-l -- listens mode, accepts incoming connection\n"); - fprintf(stderr, "\t-m [poll|mmap|sendfile] -- use poll(default)/mmap+write/sendfile\n"); + fprintf(stderr, "\t-m [poll|mmap|sendfile|splice] -- use poll(default)/mmap+write/sendfile/splice\n"); fprintf(stderr, "\t-M mark -- set socket packet mark\n"); fprintf(stderr, "\t-o option -- test sockopt <option>\n"); fprintf(stderr, "\t-p num -- use port num\n"); @@ -XXX,XX +XXX,XX @@ static int copyfd_io_sendfile(int infd, int peerfd, int outfd, return err; } +static int do_splice(const int infd, const int outfd, const size_t len, + struct wstate *winfo) +{ + int pipefd[2]; + ssize_t bytes; + int err; + + err = pipe(pipefd); + if (err) + return err; + + while ((bytes = splice(infd, NULL, pipefd[1], NULL, + len - winfo->total_len, + SPLICE_F_MOVE | SPLICE_F_MORE)) > 0) { + splice(pipefd[0], NULL, outfd, NULL, bytes, + SPLICE_F_MOVE | SPLICE_F_MORE); + } + + close(pipefd[0]); + close(pipefd[1]); + + return 0; +} + +static int copyfd_io_splice(int infd, int peerfd, int outfd, unsigned int size, + bool *in_closed_after_out, struct wstate *winfo) +{ + int err; + + if (listen_mode) { + err = do_splice(peerfd, outfd, size, winfo); + if (err) + return err; + + err = do_splice(infd, peerfd, size, winfo); + } else { + err = do_splice(infd, peerfd, size, winfo); + if (err) + return err; + + shut_wr(peerfd); + + err = do_splice(peerfd, outfd, size, winfo); + *in_closed_after_out = true; + } + + return err; +} + static int copyfd_io(int infd, int peerfd, int outfd, bool close_peerfd, struct wstate *winfo) { bool in_closed_after_out = false; @@ -XXX,XX +XXX,XX @@ static int copyfd_io(int infd, int peerfd, int outfd, bool close_peerfd, struct &in_closed_after_out, winfo); break; + case CFG_MODE_SPLICE: + file_size = get_infd_size(infd); + if (file_size < 0) + return file_size; + ret = copyfd_io_splice(infd, peerfd, outfd, file_size, + &in_closed_after_out, winfo); + break; + default: fprintf(stderr, "Invalid mode %d\n", cfg_mode); @@ -XXX,XX +XXX,XX @@ int parse_mode(const char *mode) return CFG_MODE_MMAP; if (!strcasecmp(mode, "sendfile")) return CFG_MODE_SENDFILE; + if (!strcasecmp(mode, "splice")) + return CFG_MODE_SPLICE; fprintf(stderr, "Unknown test mode: %s\n", mode); fprintf(stderr, "Supported modes are:\n"); fprintf(stderr, "\t\t\"poll\" - interleaved read/write using poll()\n"); fprintf(stderr, "\t\t\"mmap\" - send entire input file (mmap+write), then read response (-l will read input first)\n"); fprintf(stderr, "\t\t\"sendfile\" - send entire input file (sendfile), then read response (-l will read input first)\n"); + fprintf(stderr, "\t\t\"splice\" - send entire input file (splice), then read response (-l will read input first)\n"); die_usage(); -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> The "splice" alternate mode for mptcp_connect.sh/.c is available now, this patch adds mptcp_connect_splice.sh to test it in the MPTCP CI by default. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- tools/testing/selftests/net/mptcp/Makefile | 2 +- tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh | 4 ++++ 2 files changed, 5 insertions(+), 1 deletion(-) create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/Makefile +++ b/tools/testing/selftests/net/mptcp/Makefile @@ -XXX,XX +XXX,XX @@ top_srcdir = ../../../../.. CFLAGS += -Wall -Wl,--no-as-needed -O2 -g -I$(top_srcdir)/usr/include $(KHDR_INCLUDES) TEST_PROGS := mptcp_connect.sh mptcp_connect_mmap.sh mptcp_connect_sendfile.sh \ - mptcp_connect_checksum.sh pm_netlink.sh mptcp_join.sh diag.sh \ + mptcp_connect_splice.sh mptcp_connect_checksum.sh pm_netlink.sh mptcp_join.sh diag.sh \ simult_flows.sh mptcp_sockopt.sh userspace_pm.sh TEST_GEN_FILES = mptcp_connect pm_nl_ctl mptcp_sockopt mptcp_inq mptcp_diag diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh b/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh new file mode 100755 index XXXXXXX..XXXXXXX --- /dev/null +++ b/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh @@ -XXX,XX +XXX,XX @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +"$(dirname "${0}")/mptcp_connect.sh" -m splice "${@}" -- 2.48.1
From: Geliang Tang <tanggeliang@kylinos.cn> v12: - rebase on "mptcp: receive path improvement" 1-7. - some cleanups. v11: - drop "tcp: drop release and lock again in splice_read", and add this release and lock again in mptcp_splice_read too. (Thanks Mat, I didn't understand the intent of this code before.) - call mptcp_rps_record_subflows() in mptcp_splice_read as Mat suggested. v10: - add an offset parameter for mptcp_recv_skb and make it more like tcp_recv_skb. - Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1756780274.git.tanggeliang@kylinos.cn/ v9: - merge the squash-to patches. - a new patch "drop release and lock again in splice_read". - Link: https://patchwork.kernel.org/project/mptcp/cover/cover.1752399660.git.tanggeliang@kylinos.cn/ v8: - export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h. - add a new helper mptcp_recv_should_stop. - add mptcp_connect_splice.sh. - update commit logs. v7: - only patch 1 and patch 2 changed. - add a new helper mptcp_eat_recv_skb. - invoke skb_peek in mptcp_recv_skb(). - use while ((skb = mptcp_recv_skb(sk)) != NULL) instead of skb_queue_walk_safe(&sk->sk_receive_queue, skb, tmp). v6: - address Paolo's comments for v4, v5 (thanks) v5: - extract the common code of __mptcp_recvmsg_mskq() and mptcp_read_sock() into a new helper __mptcp_recvmsg_desc() to reduce duplication code. v4: - v3 doesn't work for MPTCP fallback tests in mptcp_connect.sh, this set fix it. - invoke __mptcp_move_skbs in mptcp_read_sock. - use INDIRECT_CALL_INET_1 in __tcp_splice_read. v3: - merge the two squash-to patches. - use sk->sk_rcvbuf instead of INT_MAX as the max len in mptcp_read_sock(). - add splice io mode for mptcp_connect and drop mptcp_splice.c test. - the splice test for packetdrill is also added here: https://github.com/multipath-tcp/packetdrill/pull/162 v2: - set splice_read of mptcp - add a splice selftest. I have good news! I recently added MPTCP support to "NVME over TCP". And my RFC patches are under review by NVME maintainer Hannes. Replacing "NVME over TCP" with MPTCP is very simple. I used IPPROTO_MPTCP instead of IPPROTO_TCP to create MPTCP sockets on both target and host sides, these sockets are created in Kernel space. nvmet_tcp_add_port: ret = sock_create(port->addr.ss_family, SOCK_STREAM, IPPROTO_MPTCP, &port->sock); nvme_tcp_alloc_queue: ret = sock_create_kern(current->nsproxy->net_ns, ctrl->addr.ss_family, SOCK_STREAM, IPPROTO_MPTCP, &queue->sock); nvme_tcp_try_recv() needs to call .read_sock interface of struct proto_ops, but it is not implemented in MPTCP. So I implemented it with reference to __mptcp_recvmsg_mskq(). Since the NVME part patches are still under reviewing, I only send the MPTCP part patches in this set to MPTCP ML for your opinions. Geliang Tang (8): mptcp: add eat_recv_skb helper mptcp: implement .read_sock tcp: export splice_state and splice_data_recv tcp: add recv_should_stop helper mptcp: use recv_should_stop helper mptcp: implement .splice_read selftests: mptcp: add splice io mode selftests: mptcp: connect: cover splice mode include/net/tcp.h | 35 +++ net/ipv4/tcp.c | 86 ++----- net/mptcp/protocol.c | 224 +++++++++++++++--- tools/testing/selftests/net/mptcp/Makefile | 1 + .../selftests/net/mptcp/mptcp_connect.c | 63 ++++- .../net/mptcp/mptcp_connect_splice.sh | 5 + 6 files changed, 304 insertions(+), 110 deletions(-) create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> This patch extracts the free skb related code in __mptcp_recvmsg_mskq() into a new helper mptcp_eat_recv_skb(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied); +static void mptcp_eat_recv_skb(struct sock *sk, struct sk_buff *skb) +{ + /* avoid the indirect call, we know the destructor is sock_rfree */ + skb->destructor = NULL; + skb->sk = NULL; + atomic_sub(skb->truesize, &sk->sk_rmem_alloc); + sk_mem_uncharge(sk, skb->truesize); + __skb_unlink(skb, &sk->sk_receive_queue); + skb_attempt_defer_free(skb); +} + static int __mptcp_recvmsg_mskq(struct sock *sk, struct msghdr *msg, size_t len, int flags, @@ -XXX,XX +XXX,XX @@ static int __mptcp_recvmsg_mskq(struct sock *sk, } if (!(flags & MSG_PEEK)) { - /* avoid the indirect call, we know the destructor is sock_rfree */ - skb->destructor = NULL; - skb->sk = NULL; - atomic_sub(skb->truesize, &sk->sk_rmem_alloc); - sk_mem_uncharge(sk, skb->truesize); - __skb_unlink(skb, &sk->sk_receive_queue); - skb_attempt_defer_free(skb); + mptcp_eat_recv_skb(sk, skb); msk->bytes_consumed += count; } -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> nvme_tcp_try_recv() needs to call .read_sock interface of struct proto_ops, but it's not implemented in MPTCP. This patch implements it with reference to __tcp_read_sock() and __mptcp_recvmsg_mskq(). Corresponding to tcp_recv_skb(), a new helper for MPTCP named mptcp_recv_skb() is added to peek a skb from sk->sk_receive_queue. Compared with __mptcp_recvmsg_mskq(), mptcp_read_sock() uses sk->sk_rcvbuf as the max read length. The LISTEN status is checked before the while loop, and mptcp_recv_skb() and mptcp_cleanup_rbuf() are invoked after the loop. In the loop, all flags checks for __mptcp_recvmsg_mskq() are removed. Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 74 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock, return mask; } +static struct sk_buff *mptcp_recv_skb(struct sock *sk, u32 *off) +{ + struct sk_buff *skb; + u32 offset; + + if (skb_queue_empty(&sk->sk_receive_queue)) + __mptcp_move_skbs(sk); + + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { + offset = MPTCP_SKB_CB(skb)->offset; + if (offset < skb->len) { + *off = offset; + return skb; + } + mptcp_eat_recv_skb(sk, skb); + } + return NULL; +} + +/* + * Note: + * - It is assumed that the socket was locked by the caller. + */ +static int mptcp_read_sock(struct sock *sk, read_descriptor_t *desc, + sk_read_actor_t recv_actor) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + size_t len = sk->sk_rcvbuf; + struct sk_buff *skb; + int copied = 0; + u32 offset; + + if (sk->sk_state == TCP_LISTEN) + return -ENOTCONN; + while ((skb = mptcp_recv_skb(sk, &offset)) != NULL) { + u32 data_len = skb->len - offset; + u32 size = min_t(size_t, len - copied, data_len); + int count; + + count = recv_actor(desc, skb, offset, size); + if (count <= 0) { + if (!copied) + copied = count; + break; + } + + copied += count; + + if (count < data_len) { + MPTCP_SKB_CB(skb)->offset += count; + MPTCP_SKB_CB(skb)->map_seq += count; + msk->bytes_consumed += count; + break; + } + + mptcp_eat_recv_skb(sk, skb); + msk->bytes_consumed += count; + + if (copied >= len) + break; + } + + mptcp_rcv_space_adjust(msk, copied); + + if (copied > 0) { + mptcp_recv_skb(sk, &offset); + mptcp_cleanup_rbuf(msk, copied); + } + + return copied; +} + static const struct proto_ops mptcp_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_stream_ops = { .recvmsg = inet_recvmsg, .mmap = sock_no_mmap, .set_rcvlowat = mptcp_set_rcvlowat, + .read_sock = mptcp_read_sock, }; static struct inet_protosw mptcp_protosw = { @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_v6_stream_ops = { .compat_ioctl = inet6_compat_ioctl, #endif .set_rcvlowat = mptcp_set_rcvlowat, + .read_sock = mptcp_read_sock, }; static struct proto mptcp_v6_prot; -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h so that they can be used by MPTCP. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- include/net/tcp.h | 12 ++++++++++++ net/ipv4/tcp.c | 13 ++----------- 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index XXXXXXX..XXXXXXX 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -XXX,XX +XXX,XX @@ static_assert((1 << ATO_BITS) > TCP_DELACK_MAX); */ #define TFO_SERVER_WO_SOCKOPT1 0x400 +/* + * TCP splice context + */ +struct tcp_splice_state { + struct pipe_inode_info *pipe; + size_t len; + unsigned int flags; +}; + +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, + unsigned int offset, size_t len); + /* sysctl variables for tcp */ extern int sysctl_tcp_max_orphans; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index XXXXXXX..XXXXXXX 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -XXX,XX +XXX,XX @@ EXPORT_SYMBOL(tcp_have_smc); struct percpu_counter tcp_sockets_allocated ____cacheline_aligned_in_smp; EXPORT_IPV6_MOD(tcp_sockets_allocated); -/* - * TCP splice context - */ -struct tcp_splice_state { - struct pipe_inode_info *pipe; - size_t len; - unsigned int flags; -}; - /* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. @@ -XXX,XX +XXX,XX @@ void tcp_push(struct sock *sk, int flags, int mss_now, __tcp_push_pending_frames(sk, mss_now, nonagle); } -static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, - unsigned int offset, size_t len) +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, + unsigned int offset, size_t len) { struct tcp_splice_state *tss = rd_desc->arg.data; int ret; -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> Factor out a new helper tcp_recv_should_stop() from tcp_recvmsg_locked() and tcp_splice_read() to check whether to stop receiving. It will be used for MPTCP too. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- include/net/tcp.h | 23 +++++++++++++++ net/ipv4/tcp.c | 73 ++++++++--------------------------------------- 2 files changed, 35 insertions(+), 61 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index XXXXXXX..XXXXXXX 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -XXX,XX +XXX,XX @@ struct tcp_splice_state { int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len); +static inline int tcp_recv_should_stop(struct sock *sk, long timeo) +{ + if (sock_flag(sk, SOCK_DONE)) + return -ENETDOWN; + + if (sk->sk_err) + return sock_error(sk); + + if (sk->sk_shutdown & RCV_SHUTDOWN) + return -ESHUTDOWN; + + if (sk->sk_state == TCP_CLOSE) + return -ENOTCONN; + + if (!timeo) + return -EAGAIN; + + if (signal_pending(current)) + return sock_intr_errno(timeo); + + return 0; +} + /* sysctl variables for tcp */ extern int sysctl_tcp_max_orphans; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index XXXXXXX..XXXXXXX 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, if (ret < 0) break; else if (!ret) { + int err; + if (spliced) break; - if (sock_flag(sk, SOCK_DONE)) - break; - if (sk->sk_err) { - ret = sock_error(sk); - break; - } - if (sk->sk_shutdown & RCV_SHUTDOWN) - break; - if (sk->sk_state == TCP_CLOSE) { - /* - * This occurs when user tries to read - * from never connected socket. - */ - ret = -ENOTCONN; - break; - } - if (!timeo) { - ret = -EAGAIN; + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (err != -ENETDOWN && err != -ESHUTDOWN) + ret = err; break; } /* if __tcp_splice_read() got nothing while we have @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, ret = sk_wait_data(sk, &timeo, NULL); if (ret < 0) break; - if (signal_pending(current)) { - ret = sock_intr_errno(timeo); - break; - } continue; } tss.len -= ret; @@ -XXX,XX +XXX,XX @@ ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, release_sock(sk); lock_sock(sk); - if (sk->sk_err || sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - signal_pending(current)) + if (tcp_recv_should_stop(sk, timeo)) break; } @@ -XXX,XX +XXX,XX @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, if (copied >= target && !READ_ONCE(sk->sk_backlog.tail)) break; - if (copied) { - if (!timeo || - sk->sk_err || - sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - signal_pending(current)) - break; - } else { - if (sock_flag(sk, SOCK_DONE)) - break; - - if (sk->sk_err) { - copied = sock_error(sk); - break; - } - - if (sk->sk_shutdown & RCV_SHUTDOWN) - break; - - if (sk->sk_state == TCP_CLOSE) { - /* This occurs when user tries to read - * from never connected socket. - */ - copied = -ENOTCONN; - break; - } - - if (!timeo) { - copied = -EAGAIN; - break; - } - - if (signal_pending(current)) { - copied = sock_intr_errno(timeo); - break; - } + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (!copied && err != -ENETDOWN && err != -ESHUTDOWN) + copied = err; + break; } if (copied >= target) { -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> Use the newly added tcp_recv_should_stop() helper in mptcp_recvmsg() to check whether to stop receiving. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 35 +++++------------------------------ 1 file changed, 5 insertions(+), 30 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (copied >= target) break; - if (copied) { - if (sk->sk_err || - sk->sk_state == TCP_CLOSE || - (sk->sk_shutdown & RCV_SHUTDOWN) || - !timeo || - signal_pending(current)) - break; - } else { - if (sk->sk_err) { - copied = sock_error(sk); - break; - } - - if (sk->sk_shutdown & RCV_SHUTDOWN) - break; - - if (sk->sk_state == TCP_CLOSE) { - copied = -ENOTCONN; - break; - } - - if (!timeo) { - copied = -EAGAIN; - break; - } - - if (signal_pending(current)) { - copied = sock_intr_errno(timeo); - break; - } + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (!copied && err != -ESHUTDOWN) + copied = err; + break; } pr_debug("block timeout %ld\n", timeo); -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> This patch implements .splice_read interface of mptcp struct proto_ops as mptcp_splice_read() with reference to tcp_splice_read(). Corresponding to __tcp_splice_read(), __mptcp_splice_read() is defined, invoking mptcp_read_sock() instead of tcp_read_sock(). mptcp_splice_read() is almost the same as tcp_splice_read(), except for sock_rps_record_flow(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- net/mptcp/protocol.c | 96 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index XXXXXXX..XXXXXXX 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -XXX,XX +XXX,XX @@ static int mptcp_read_sock(struct sock *sk, read_descriptor_t *desc, return copied; } +static int __mptcp_splice_read(struct sock *sk, struct tcp_splice_state *tss) +{ + /* Store TCP splice context information in read_descriptor_t. */ + read_descriptor_t rd_desc = { + .arg.data = tss, + .count = tss->len, + }; + + return mptcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); +} + +/** + * mptcp_splice_read - splice data from MPTCP socket to a pipe + * @sock: socket to splice from + * @ppos: position (not valid) + * @pipe: pipe to splice to + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Description: + * Will read pages from given socket and fill them into a pipe. + * + **/ +static ssize_t mptcp_splice_read(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + struct tcp_splice_state tss = { + .pipe = pipe, + .len = len, + .flags = flags, + }; + struct sock *sk = sock->sk; + ssize_t spliced = 0; + int ret = 0; + long timeo; + + /* + * We can't seek on a socket input + */ + if (unlikely(*ppos)) + return -ESPIPE; + + lock_sock(sk); + + mptcp_rps_record_subflows(mptcp_sk(sk)); + + timeo = sock_rcvtimeo(sk, sock->file->f_flags & O_NONBLOCK); + while (tss.len) { + ret = __mptcp_splice_read(sk, &tss); + if (ret < 0) { + break; + } else if (!ret) { + int err; + + if (spliced) + break; + err = tcp_recv_should_stop(sk, timeo); + if (err < 0) { + if (err != -ESHUTDOWN) + ret = err; + break; + } + /* if __mptcp_splice_read() got nothing while we have + * an skb in receive queue, we do not want to loop. + * This might happen with URG data. + */ + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + ret = sk_wait_data(sk, &timeo, NULL); + if (ret < 0) + break; + continue; + } + tss.len -= ret; + spliced += ret; + + if (!tss.len || !timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (tcp_recv_should_stop(sk, timeo)) + break; + } + + release_sock(sk); + + if (spliced) + return spliced; + + return ret; +} + static const struct proto_ops mptcp_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_stream_ops = { .mmap = sock_no_mmap, .set_rcvlowat = mptcp_set_rcvlowat, .read_sock = mptcp_read_sock, + .splice_read = mptcp_splice_read, }; static struct inet_protosw mptcp_protosw = { @@ -XXX,XX +XXX,XX @@ static const struct proto_ops mptcp_v6_stream_ops = { #endif .set_rcvlowat = mptcp_set_rcvlowat, .read_sock = mptcp_read_sock, + .splice_read = mptcp_splice_read, }; static struct proto mptcp_v6_prot; -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> This patch adds a new 'splice' io mode for mptcp_connect to test the newly added read_sock() and splice_read() functions of MPTCP. do_splice() efficiently transfers data directly between two file descriptors (infd and outfd) without copying to userspace, using Linux's splice() system call. Usage: ./mptcp_connect.sh -m splice Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- .../selftests/net/mptcp/mptcp_connect.c | 63 ++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect.c b/tools/testing/selftests/net/mptcp/mptcp_connect.c index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/mptcp_connect.c +++ b/tools/testing/selftests/net/mptcp/mptcp_connect.c @@ -XXX,XX +XXX,XX @@ enum cfg_mode { CFG_MODE_POLL, CFG_MODE_MMAP, CFG_MODE_SENDFILE, + CFG_MODE_SPLICE, }; enum cfg_peek { @@ -XXX,XX +XXX,XX @@ static void die_usage(void) fprintf(stderr, "\t-j -- add additional sleep at connection start and tear down " "-- for MPJ tests\n"); fprintf(stderr, "\t-l -- listens mode, accepts incoming connection\n"); - fprintf(stderr, "\t-m [poll|mmap|sendfile] -- use poll(default)/mmap+write/sendfile\n"); + fprintf(stderr, "\t-m [poll|mmap|sendfile|splice] -- use poll(default)/mmap+write/sendfile/splice\n"); fprintf(stderr, "\t-M mark -- set socket packet mark\n"); fprintf(stderr, "\t-o option -- test sockopt <option>\n"); fprintf(stderr, "\t-p num -- use port num\n"); @@ -XXX,XX +XXX,XX @@ static int copyfd_io_sendfile(int infd, int peerfd, int outfd, return err; } +static int do_splice(const int infd, const int outfd, const size_t len, + struct wstate *winfo) +{ + int pipefd[2]; + ssize_t bytes; + int err; + + err = pipe(pipefd); + if (err) + return err; + + while ((bytes = splice(infd, NULL, pipefd[1], NULL, + len - winfo->total_len, + SPLICE_F_MOVE | SPLICE_F_MORE)) > 0) { + splice(pipefd[0], NULL, outfd, NULL, bytes, + SPLICE_F_MOVE | SPLICE_F_MORE); + } + + close(pipefd[0]); + close(pipefd[1]); + + return 0; +} + +static int copyfd_io_splice(int infd, int peerfd, int outfd, unsigned int size, + bool *in_closed_after_out, struct wstate *winfo) +{ + int err; + + if (listen_mode) { + err = do_splice(peerfd, outfd, size, winfo); + if (err) + return err; + + err = do_splice(infd, peerfd, size, winfo); + } else { + err = do_splice(infd, peerfd, size, winfo); + if (err) + return err; + + shut_wr(peerfd); + + err = do_splice(peerfd, outfd, size, winfo); + *in_closed_after_out = true; + } + + return err; +} + static int copyfd_io(int infd, int peerfd, int outfd, bool close_peerfd, struct wstate *winfo) { bool in_closed_after_out = false; @@ -XXX,XX +XXX,XX @@ static int copyfd_io(int infd, int peerfd, int outfd, bool close_peerfd, struct &in_closed_after_out, winfo); break; + case CFG_MODE_SPLICE: + file_size = get_infd_size(infd); + if (file_size < 0) + return file_size; + ret = copyfd_io_splice(infd, peerfd, outfd, file_size, + &in_closed_after_out, winfo); + break; + default: fprintf(stderr, "Invalid mode %d\n", cfg_mode); @@ -XXX,XX +XXX,XX @@ int parse_mode(const char *mode) return CFG_MODE_MMAP; if (!strcasecmp(mode, "sendfile")) return CFG_MODE_SENDFILE; + if (!strcasecmp(mode, "splice")) + return CFG_MODE_SPLICE; fprintf(stderr, "Unknown test mode: %s\n", mode); fprintf(stderr, "Supported modes are:\n"); fprintf(stderr, "\t\t\"poll\" - interleaved read/write using poll()\n"); fprintf(stderr, "\t\t\"mmap\" - send entire input file (mmap+write), then read response (-l will read input first)\n"); fprintf(stderr, "\t\t\"sendfile\" - send entire input file (sendfile), then read response (-l will read input first)\n"); + fprintf(stderr, "\t\t\"splice\" - send entire input file (splice), then read response (-l will read input first)\n"); die_usage(); -- 2.43.0
From: Geliang Tang <tanggeliang@kylinos.cn> The "splice" alternate mode for mptcp_connect.sh/.c is available now, this patch adds mptcp_connect_splice.sh to test it in the MPTCP CI by default. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> --- tools/testing/selftests/net/mptcp/Makefile | 1 + tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh | 5 +++++ 2 files changed, 6 insertions(+) create mode 100755 tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh diff --git a/tools/testing/selftests/net/mptcp/Makefile b/tools/testing/selftests/net/mptcp/Makefile index XXXXXXX..XXXXXXX 100644 --- a/tools/testing/selftests/net/mptcp/Makefile +++ b/tools/testing/selftests/net/mptcp/Makefile @@ -XXX,XX +XXX,XX @@ top_srcdir = ../../../../.. CFLAGS += -Wall -Wl,--no-as-needed -O2 -g -I$(top_srcdir)/usr/include $(KHDR_INCLUDES) TEST_PROGS := mptcp_connect.sh mptcp_connect_mmap.sh mptcp_connect_sendfile.sh \ + mptcp_connect_splice.sh \ mptcp_connect_checksum.sh pm_netlink.sh mptcp_join.sh diag.sh \ simult_flows.sh mptcp_sockopt.sh userspace_pm.sh diff --git a/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh b/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh new file mode 100755 index XXXXXXX..XXXXXXX --- /dev/null +++ b/tools/testing/selftests/net/mptcp/mptcp_connect_splice.sh @@ -XXX,XX +XXX,XX @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +MPTCP_LIB_KSFT_TEST="$(basename "${0}" .sh)" \ + "$(dirname "${0}")/mptcp_connect.sh" -m splice "${@}" -- 2.43.0