[v2] netlink: introduce netlink poll to resolve fast return issue

[PATCH net-next v2] netlink: introduce netlink poll to resolve fast return issue

Posted by Jong eon Park 2 years, 2 months ago

In very rare cases, there was an issue where a user's 'poll' function
waiting for a uevent would continuously return very quickly, causing
excessive CPU usage due to the following scenario.

When sk_rmem_alloc exceeds sk_rcvbuf, netlink_broadcast_deliver returns an
error and netlink_overrun is called. However, if netlink_overrun was
called in a context just before a another context returns from the 'poll'
and 'recv' is invoked, emptying the rcv queue, sk->sk_err = ENOBUF is
written to the netlink socket belatedly and it enters the
NETLINK_S_CONGESTED state. If the user does not check for POLLERR, they
cannot consume and clean sk_err and repeatedly enter the situation where
they call 'poll' again but return immediately. Moreover, in this
situation, rcv queue is already empty and NETLINK_S_CONGESTED flag
prevents any more incoming packets. This makes it impossible for the user
to call 'recv'.

This "congested" situation is a bit ambiguous. The queue is empty, yet
'congested' remains. This means kernel can no longer deliver uevents
despite the empty queue, and it lead to the persistent 'congested' status.

------------CPU1 (kernel)----------  --------------CPU2 (app)--------------
...
a driver delivers uevent.            poll was waiting for schedule.
a driver delivers uevent.
a driver delivers uevent.
...
1) netlink_broadcast_deliver fails.
(sk_rmem_alloc > sk_rcvbuf)
                                      getting schedule and poll returns,
                                      and the app calls recv.
                                      (rcv queue is empied)
                                      2)

netlink_overrun is called.
(NETLINK_S_CONGESTED flag is set,
ENOBUF is written in sk_err and,
wake up poll.)
                                      finishing its job and call poll.
                                      poll returns POLLERR.

                                      (the app doesn't have POLLERR handler)
                                      it calls poll, but getting POLLERR.
                                      it calls poll, but getting POLLERR.
                                      it calls poll, but getting POLLERR.
                                      ...

To address this issue, I would like to introduce the following netlink
poll.

After calling the datagram_poll, netlink poll checks the
NETLINK_S_CONGESTED status and rcv queue, and this make the user to be
readable once more even if the user has already emptied rcv queue. This
allows the user to be able to consume sk->sk_err value through
netlink_recvmsg, thus the situation described above can be avoided.

Co-developed-by: Dong ha Kang <dongha7.kang@samsung.com>
Signed-off-by: Jong eon Park <jongeon.park@samsung.com>

---
v2:
 - Add more detailed explanation.
---
 net/netlink/af_netlink.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index eb086b06d60d..f08c10220041 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2002,6 +2002,20 @@ static int netlink_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 	return err ? : copied;
 }
 
+static __poll_t netlink_poll(struct file *file, struct socket *sock,
+			     poll_table *wait)
+{
+	__poll_t mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct netlink_sock *nlk = nlk_sk(sk);
+
+	if (test_bit(NETLINK_S_CONGESTED, &nlk->state) &&
+	    skb_queue_empty_lockless(&sk->sk_receive_queue))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
 static void netlink_data_ready(struct sock *sk)
 {
 	BUG();
@@ -2803,7 +2817,7 @@ static const struct proto_ops netlink_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	netlink_getname,
-	.poll =		datagram_poll,
+	.poll =		netlink_poll,
 	.ioctl =	netlink_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
-- 
2.25.1

Re: [PATCH net-next v2] netlink: introduce netlink poll to resolve fast return issue

Posted by Paolo Abeni 2 years, 2 months ago

Hi,

I'm sorry for the delayed feedback.

On Tue, 2023-11-14 at 18:07 +0900, Jong eon Park wrote:
> In very rare cases, there was an issue where a user's 'poll' function
> waiting for a uevent would continuously return very quickly, causing
> excessive CPU usage due to the following scenario.
> 
> When sk_rmem_alloc exceeds sk_rcvbuf, netlink_broadcast_deliver returns an
> error and netlink_overrun is called. However, if netlink_overrun was
> called in a context just before a another context returns from the 'poll'
> and 'recv' is invoked, emptying the rcv queue, sk->sk_err = ENOBUF is
> written to the netlink socket belatedly and it enters the
> NETLINK_S_CONGESTED state. If the user does not check for POLLERR, they
> cannot consume and clean sk_err and repeatedly enter the situation where
> they call 'poll' again but return immediately. Moreover, in this
> situation, rcv queue is already empty and NETLINK_S_CONGESTED flag
> prevents any more incoming packets. This makes it impossible for the user
> to call 'recv'.
> 
> This "congested" situation is a bit ambiguous. The queue is empty, yet
> 'congested' remains. This means kernel can no longer deliver uevents
> despite the empty queue, and it lead to the persistent 'congested' status.
> 
> ------------CPU1 (kernel)----------  --------------CPU2 (app)--------------
> ...
> a driver delivers uevent.            poll was waiting for schedule.
> a driver delivers uevent.
> a driver delivers uevent.
> ...
> 1) netlink_broadcast_deliver fails.
> (sk_rmem_alloc > sk_rcvbuf)
>                                       getting schedule and poll returns,
>                                       and the app calls recv.
>                                       (rcv queue is empied)
>                                       2)
> 
> netlink_overrun is called.
> (NETLINK_S_CONGESTED flag is set,
> ENOBUF is written in sk_err and,
> wake up poll.)
>                                       finishing its job and call poll.
>                                       poll returns POLLERR.
> 
>                                       (the app doesn't have POLLERR handler)
>                                       it calls poll, but getting POLLERR.
>                                       it calls poll, but getting POLLERR.
>                                       it calls poll, but getting POLLERR.
>                                       ...
> 
> To address this issue, I would like to introduce the following netlink
> poll.

IMHO the above is an application bug, and should not be addressed in
the kernel.

If you want to limit the amount of CPU time your application could use,
you have to resort to process scheduler setting and/or container
limits: nothing could prevent a [buggy?] application from doing:

# in shell script
while true; do :; done

The above condition is IMHO not very different from the above: the
application is requesting POLLERR event and not processing them.

To more accurate is like looping on poll() getting read event without
reading any data. Nothing we should address in the kernel.

Cheers,

Paolo