[v1] TUN/TAP: Improving throughput and latency by avoiding SKB drops

[PATCH net] TUN/TAP: Improving throughput and latency by avoiding SKB drops

Posted by Simon Schippers 1 month, 3 weeks ago

This patch is the result of our paper with the title "The NODROP Patch:
Hardening Secure Networking for Real-time Teleoperation by Preventing
Packet Drops in the Linux TUN Driver" [1].
It deals with the tun_net_xmit function which drops SKB's with the reason
SKB_DROP_REASON_FULL_RING whenever the tx_ring (TUN queue) is full,
resulting in reduced TCP performance and packet loss for bursty video
streams when used over VPN's.

The abstract reads as follows:
"Throughput-critical teleoperation requires robust and low-latency
communication to ensure safety and performance. Often, these kinds of
applications are implemented in Linux-based operating systems and transmit
over virtual private networks, which ensure encryption and ease of use by
providing a dedicated tunneling interface (TUN) to user space
applications. In this work, we identified a specific behavior in the Linux
TUN driver, which results in significant performance degradation due to
the sender stack silently dropping packets. This design issue drastically
impacts real-time video streaming, inducing up to 29 % packet loss with
noticeable video artifacts when the internal queue of the TUN driver is
reduced to 25 packets to minimize latency. Furthermore, a small queue
length also drastically reduces the throughput of TCP traffic due to many
retransmissions. Instead, with our open-source NODROP Patch, we propose
generating backpressure in case of burst traffic or network congestion.
The patch effectively addresses the packet-dropping behavior, hardening
real-time video streaming and improving TCP throughput by 36 % in high
latency scenarios."

In addition to the mentioned performance and latency improvements for VPN
applications, this patch also allows the proper usage of qdisc's. For
example a fq_codel can not control the queuing delay when packets are
already dropped in the TUN driver. This issue is also described in [2].

The performance evaluation of the paper (see Fig. 4) showed a 4%
performance hit for a single queue TUN with the default TUN queue size of
500 packets. However it is important to notice that with the proposed
patch no packet drop ever occurred even with a TUN queue size of 1 packet.
The utilized validation pipeline is available under [3].

As the reduction of the TUN queue to a size of down to 5 packets showed no
further performance hit in the paper, a reduction of the default TUN queue
size might be desirable accompanying this patch. A reduction would
obviously reduce buffer bloat and memory requirements.

Implementation details:
- The netdev queue start/stop flow control is utilized.
- Compatible with multi-queue by only stopping/waking the specific
netdevice subqueue.
- No additional locking is used.

In the tun_net_xmit function:
- Stopping the subqueue is done when the tx_ring gets full after inserting
the SKB into the tx_ring.
- In the unlikely case when the insertion with ptr_ring_produce fails, the
old dropping behavior is used for this SKB.
- In the unlikely case when tun_net_xmit is called even though the tx_ring
is full, the subqueue is stopped once again and NETDEV_TX_BUSY is returned.

In the tun_ring_recv function:
- Waking the subqueue is done after consuming a SKB from the tx_ring when
the tx_ring is empty. Waking the subqueue when the tx_ring has any
available space, so when it is not full, showed crashes in our testing. We
are open to suggestions.
- Especially when the tx_ring is configured to be small, queuing might be
stopped in the tun_net_xmit function while at the same time,
ptr_ring_consume is not able to grab a packet. This prevents tun_net_xmit
from being called again and causes tun_ring_recv to wait indefinitely for
a packet. Therefore, the queue is woken after grabbing a packet if the
queuing is stopped. The same behavior is applied in the accompanying wait
queue.
- Because the tun_struct is required to get the tx_queue into the new txq
pointer, the tun_struct is passed in tun_do_read aswell. This is likely
faster then trying to get it via the tun_file tfile because it utilizes a
rcu lock.

We are open to suggestions regarding the implementation :)
Thank you for your work!

[1] Link:
https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2
025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
[2] Link:
https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective
-on-tun-device
[3] Link: https://github.com/tudo-cni/nodrop

Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
---
 drivers/net/tun.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cc6c50180663..e88a312d3c72 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1023,6 +1023,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	netif_info(tun, tx_queued, tun->dev, "%s %d\n", __func__, skb->len);
 
+	if (unlikely(ptr_ring_full(&tfile->tx_ring))) {
+		queue = netdev_get_tx_queue(dev, txq);
+		netif_tx_stop_queue(queue);
+		rcu_read_unlock();
+		return NETDEV_TX_BUSY;
+	}
+
 	/* Drop if the filter does not like it.
 	 * This is a noop if the filter is disabled.
 	 * Filter can be enabled only for the TAP devices. */
@@ -1060,13 +1067,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	nf_reset_ct(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+	queue = netdev_get_tx_queue(dev, txq);
+	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
+		netif_tx_stop_queue(queue);
 		drop_reason = SKB_DROP_REASON_FULL_RING;
 		goto drop;
 	}
+	if (ptr_ring_full(&tfile->tx_ring))
+		netif_tx_stop_queue(queue);
 
 	/* dev->lltx requires to do our own update of trans_start */
-	queue = netdev_get_tx_queue(dev, txq);
 	txq_trans_cond_update(queue);
 
 	/* Notify and wake up reader process */
@@ -2110,15 +2120,21 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
-static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
+static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
+	struct netdev_queue *txq;
 	void *ptr = NULL;
 	int error = 0;
 
 	ptr = ptr_ring_consume(&tfile->tx_ring);
 	if (ptr)
 		goto out;
+
+	txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
+	if (unlikely(netif_tx_queue_stopped(txq)))
+		netif_tx_wake_queue(txq);
+
 	if (noblock) {
 		error = -EAGAIN;
 		goto out;
@@ -2131,6 +2147,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 		ptr = ptr_ring_consume(&tfile->tx_ring);
 		if (ptr)
 			break;
+
+		if (unlikely(netif_tx_queue_stopped(txq)))
+			netif_tx_wake_queue(txq);
+
 		if (signal_pending(current)) {
 			error = -ERESTARTSYS;
 			break;
@@ -2147,6 +2167,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 	remove_wait_queue(&tfile->socket.wq.wait, &wait);
 
 out:
+	if (ptr_ring_empty(&tfile->tx_ring)) {
+		txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
+		netif_tx_wake_queue(txq);
+	}
 	*err = error;
 	return ptr;
 }
@@ -2165,7 +2189,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 
 	if (!ptr) {
 		/* Read frames from ring */
-		ptr = tun_ring_recv(tfile, noblock, &err);
+		ptr = tun_ring_recv(tun, tfile, noblock, &err);
 		if (!ptr)
 			return err;
 	}
-- 
2.43.0

Re: [PATCH net] TUN/TAP: Improving throughput and latency by avoiding SKB drops

Posted by Willem de Bruijn 1 month, 3 weeks ago

Simon Schippers wrote:
> This patch is the result of our paper with the title "The NODROP Patch:
> Hardening Secure Networking for Real-time Teleoperation by Preventing
> Packet Drops in the Linux TUN Driver" [1].
> It deals with the tun_net_xmit function which drops SKB's with the reason
> SKB_DROP_REASON_FULL_RING whenever the tx_ring (TUN queue) is full,
> resulting in reduced TCP performance and packet loss for bursty video
> streams when used over VPN's.
> 
> The abstract reads as follows:
> "Throughput-critical teleoperation requires robust and low-latency
> communication to ensure safety and performance. Often, these kinds of
> applications are implemented in Linux-based operating systems and transmit
> over virtual private networks, which ensure encryption and ease of use by
> providing a dedicated tunneling interface (TUN) to user space
> applications. In this work, we identified a specific behavior in the Linux
> TUN driver, which results in significant performance degradation due to
> the sender stack silently dropping packets. This design issue drastically
> impacts real-time video streaming, inducing up to 29 % packet loss with
> noticeable video artifacts when the internal queue of the TUN driver is
> reduced to 25 packets to minimize latency. Furthermore, a small queue

This clearly increases dropcount. Does it meaningfully reduce latency?

The cause of latency here is scheduling of the process reading from
the tun FD.

Task pinning and/or adjusting scheduler priority/algorithm/etc. may
be a more effective and robust approach to reducing latency.

> length also drastically reduces the throughput of TCP traffic due to many
> retransmissions. Instead, with our open-source NODROP Patch, we propose
> generating backpressure in case of burst traffic or network congestion.
> The patch effectively addresses the packet-dropping behavior, hardening
> real-time video streaming and improving TCP throughput by 36 % in high
> latency scenarios."
> 
> In addition to the mentioned performance and latency improvements for VPN
> applications, this patch also allows the proper usage of qdisc's. For
> example a fq_codel can not control the queuing delay when packets are
> already dropped in the TUN driver. This issue is also described in [2].
> 
> The performance evaluation of the paper (see Fig. 4) showed a 4%
> performance hit for a single queue TUN with the default TUN queue size of
> 500 packets. However it is important to notice that with the proposed
> patch no packet drop ever occurred even with a TUN queue size of 1 packet.
> The utilized validation pipeline is available under [3].
> 
> As the reduction of the TUN queue to a size of down to 5 packets showed no
> further performance hit in the paper, a reduction of the default TUN queue
> size might be desirable accompanying this patch. A reduction would
> obviously reduce buffer bloat and memory requirements.
> 
> Implementation details:
> - The netdev queue start/stop flow control is utilized.
> - Compatible with multi-queue by only stopping/waking the specific
> netdevice subqueue.
> - No additional locking is used.
> 
> In the tun_net_xmit function:
> - Stopping the subqueue is done when the tx_ring gets full after inserting
> the SKB into the tx_ring.
> - In the unlikely case when the insertion with ptr_ring_produce fails, the
> old dropping behavior is used for this SKB.
> - In the unlikely case when tun_net_xmit is called even though the tx_ring
> is full, the subqueue is stopped once again and NETDEV_TX_BUSY is returned.
> 
> In the tun_ring_recv function:
> - Waking the subqueue is done after consuming a SKB from the tx_ring when
> the tx_ring is empty. Waking the subqueue when the tx_ring has any
> available space, so when it is not full, showed crashes in our testing. We
> are open to suggestions.
> - Especially when the tx_ring is configured to be small, queuing might be
> stopped in the tun_net_xmit function while at the same time,
> ptr_ring_consume is not able to grab a packet. This prevents tun_net_xmit
> from being called again and causes tun_ring_recv to wait indefinitely for
> a packet. Therefore, the queue is woken after grabbing a packet if the
> queuing is stopped. The same behavior is applied in the accompanying wait
> queue.
> - Because the tun_struct is required to get the tx_queue into the new txq
> pointer, the tun_struct is passed in tun_do_read aswell. This is likely
> faster then trying to get it via the tun_file tfile because it utilizes a
> rcu lock.
> 
> We are open to suggestions regarding the implementation :)
> Thank you for your work!
> 
> [1] Link:
> https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2
> 025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> [2] Link:
> https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective
> -on-tun-device
> [3] Link: https://github.com/tudo-cni/nodrop
> 
> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> ---
>  drivers/net/tun.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index cc6c50180663..e88a312d3c72 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1023,6 +1023,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	netif_info(tun, tx_queued, tun->dev, "%s %d\n", __func__, skb->len);
>  
> +	if (unlikely(ptr_ring_full(&tfile->tx_ring))) {
> +		queue = netdev_get_tx_queue(dev, txq);
> +		netif_tx_stop_queue(queue);
> +		rcu_read_unlock();
> +		return NETDEV_TX_BUSY;

returning NETDEV_TX_BUSY is discouraged.

In principle pausing the "device" queue for TUN, similar to other
devices, sounds reasonable, iff the simpler above suggestion is not
sufficient.

But then preferable to pause before the queue is full, to avoid having
to return failure. See for instance virtio_net.

> +	}
> +
>  	/* Drop if the filter does not like it.
>  	 * This is a noop if the filter is disabled.
>  	 * Filter can be enabled only for the TAP devices. */
> @@ -1060,13 +1067,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  
>  	nf_reset_ct(skb);
>  
> -	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +	queue = netdev_get_tx_queue(dev, txq);
> +	if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
> +		netif_tx_stop_queue(queue);
>  		drop_reason = SKB_DROP_REASON_FULL_RING;
>  		goto drop;
>  	}
> +	if (ptr_ring_full(&tfile->tx_ring))
> +		netif_tx_stop_queue(queue);
>  
>  	/* dev->lltx requires to do our own update of trans_start */
> -	queue = netdev_get_tx_queue(dev, txq);
>  	txq_trans_cond_update(queue);
>  
>  	/* Notify and wake up reader process */
> @@ -2110,15 +2120,21 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	return total;
>  }
>  
> -static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
> +static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile, int noblock, int *err)
>  {
>  	DECLARE_WAITQUEUE(wait, current);
> +	struct netdev_queue *txq;
>  	void *ptr = NULL;
>  	int error = 0;
>  
>  	ptr = ptr_ring_consume(&tfile->tx_ring);
>  	if (ptr)
>  		goto out;
> +
> +	txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
> +	if (unlikely(netif_tx_queue_stopped(txq)))
> +		netif_tx_wake_queue(txq);
> +
>  	if (noblock) {
>  		error = -EAGAIN;
>  		goto out;
> @@ -2131,6 +2147,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  		ptr = ptr_ring_consume(&tfile->tx_ring);
>  		if (ptr)
>  			break;
> +
> +		if (unlikely(netif_tx_queue_stopped(txq)))
> +			netif_tx_wake_queue(txq);
> +
>  		if (signal_pending(current)) {
>  			error = -ERESTARTSYS;
>  			break;
> @@ -2147,6 +2167,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  	remove_wait_queue(&tfile->socket.wq.wait, &wait);
>  
>  out:
> +	if (ptr_ring_empty(&tfile->tx_ring)) {
> +		txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
> +		netif_tx_wake_queue(txq);
> +	}
>  	*err = error;
>  	return ptr;
>  }
> @@ -2165,7 +2189,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>  
>  	if (!ptr) {
>  		/* Read frames from ring */
> -		ptr = tun_ring_recv(tfile, noblock, &err);
> +		ptr = tun_ring_recv(tun, tfile, noblock, &err);
>  		if (!ptr)
>  			return err;
>  	}
> -- 
> 2.43.0
>

Re: [PATCH net] TUN/TAP: Improving throughput and latency by avoiding SKB drops

Posted by Jason Wang 1 month, 3 weeks ago

On Sat, Aug 9, 2025 at 10:15 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Simon Schippers wrote:
> > This patch is the result of our paper with the title "The NODROP Patch:
> > Hardening Secure Networking for Real-time Teleoperation by Preventing
> > Packet Drops in the Linux TUN Driver" [1].
> > It deals with the tun_net_xmit function which drops SKB's with the reason
> > SKB_DROP_REASON_FULL_RING whenever the tx_ring (TUN queue) is full,
> > resulting in reduced TCP performance and packet loss for bursty video
> > streams when used over VPN's.
> >
> > The abstract reads as follows:
> > "Throughput-critical teleoperation requires robust and low-latency
> > communication to ensure safety and performance. Often, these kinds of
> > applications are implemented in Linux-based operating systems and transmit
> > over virtual private networks, which ensure encryption and ease of use by
> > providing a dedicated tunneling interface (TUN) to user space
> > applications. In this work, we identified a specific behavior in the Linux
> > TUN driver, which results in significant performance degradation due to
> > the sender stack silently dropping packets. This design issue drastically
> > impacts real-time video streaming, inducing up to 29 % packet loss with
> > noticeable video artifacts when the internal queue of the TUN driver is
> > reduced to 25 packets to minimize latency. Furthermore, a small queue
>
> This clearly increases dropcount. Does it meaningfully reduce latency?
>
> The cause of latency here is scheduling of the process reading from
> the tun FD.
>
> Task pinning and/or adjusting scheduler priority/algorithm/etc. may
> be a more effective and robust approach to reducing latency.
>
> > length also drastically reduces the throughput of TCP traffic due to many
> > retransmissions. Instead, with our open-source NODROP Patch, we propose
> > generating backpressure in case of burst traffic or network congestion.
> > The patch effectively addresses the packet-dropping behavior, hardening
> > real-time video streaming and improving TCP throughput by 36 % in high
> > latency scenarios."
> >
> > In addition to the mentioned performance and latency improvements for VPN
> > applications, this patch also allows the proper usage of qdisc's. For
> > example a fq_codel can not control the queuing delay when packets are
> > already dropped in the TUN driver. This issue is also described in [2].
> >
> > The performance evaluation of the paper (see Fig. 4) showed a 4%
> > performance hit for a single queue TUN with the default TUN queue size of
> > 500 packets. However it is important to notice that with the proposed
> > patch no packet drop ever occurred even with a TUN queue size of 1 packet.
> > The utilized validation pipeline is available under [3].
> >
> > As the reduction of the TUN queue to a size of down to 5 packets showed no
> > further performance hit in the paper, a reduction of the default TUN queue
> > size might be desirable accompanying this patch. A reduction would
> > obviously reduce buffer bloat and memory requirements.
> >
> > Implementation details:
> > - The netdev queue start/stop flow control is utilized.
> > - Compatible with multi-queue by only stopping/waking the specific
> > netdevice subqueue.
> > - No additional locking is used.
> >
> > In the tun_net_xmit function:
> > - Stopping the subqueue is done when the tx_ring gets full after inserting
> > the SKB into the tx_ring.
> > - In the unlikely case when the insertion with ptr_ring_produce fails, the
> > old dropping behavior is used for this SKB.
> > - In the unlikely case when tun_net_xmit is called even though the tx_ring
> > is full, the subqueue is stopped once again and NETDEV_TX_BUSY is returned.
> >
> > In the tun_ring_recv function:
> > - Waking the subqueue is done after consuming a SKB from the tx_ring when
> > the tx_ring is empty. Waking the subqueue when the tx_ring has any
> > available space, so when it is not full, showed crashes in our testing. We
> > are open to suggestions.
> > - Especially when the tx_ring is configured to be small, queuing might be
> > stopped in the tun_net_xmit function while at the same time,
> > ptr_ring_consume is not able to grab a packet. This prevents tun_net_xmit
> > from being called again and causes tun_ring_recv to wait indefinitely for
> > a packet. Therefore, the queue is woken after grabbing a packet if the
> > queuing is stopped. The same behavior is applied in the accompanying wait
> > queue.
> > - Because the tun_struct is required to get the tx_queue into the new txq
> > pointer, the tun_struct is passed in tun_do_read aswell. This is likely
> > faster then trying to get it via the tun_file tfile because it utilizes a
> > rcu lock.
> >
> > We are open to suggestions regarding the implementation :)
> > Thank you for your work!
> >
> > [1] Link:
> > https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2
> > 025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
> > [2] Link:
> > https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective
> > -on-tun-device
> > [3] Link: https://github.com/tudo-cni/nodrop
> >
> > Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
> > Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
> > ---
> >  drivers/net/tun.c | 32 ++++++++++++++++++++++++++++----
> >  1 file changed, 28 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index cc6c50180663..e88a312d3c72 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1023,6 +1023,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
> >
> >       netif_info(tun, tx_queued, tun->dev, "%s %d\n", __func__, skb->len);
> >
> > +     if (unlikely(ptr_ring_full(&tfile->tx_ring))) {
> > +             queue = netdev_get_tx_queue(dev, txq);
> > +             netif_tx_stop_queue(queue);
> > +             rcu_read_unlock();
> > +             return NETDEV_TX_BUSY;
>
> returning NETDEV_TX_BUSY is discouraged.
>
> In principle pausing the "device" queue for TUN, similar to other
> devices, sounds reasonable, iff the simpler above suggestion is not
> sufficient.
>
> But then preferable to pause before the queue is full, to avoid having
> to return failure. See for instance virtio_net.

+1 and we probably need to invent new ptr ring helpers for that.

Thanks

[PATCH net] TUN/TAP: Improving throughput and latency by avoiding SKB drops

Posted by Simon Schippers 1 month, 3 weeks ago

Willem de Bruijn wrote:
> Simon Schippers wrote:
>> This patch is the result of our paper with the title "The NODROP Patch:
>> Hardening Secure Networking for Real-time Teleoperation by Preventing
>> Packet Drops in the Linux TUN Driver" [1].
>> It deals with the tun_net_xmit function which drops SKB's with the reason
>> SKB_DROP_REASON_FULL_RING whenever the tx_ring (TUN queue) is full,
>> resulting in reduced TCP performance and packet loss for bursty video
>> streams when used over VPN's.
>>
>> The abstract reads as follows:
>> "Throughput-critical teleoperation requires robust and low-latency
>> communication to ensure safety and performance. Often, these kinds of
>> applications are implemented in Linux-based operating systems and transmit
>> over virtual private networks, which ensure encryption and ease of use by
>> providing a dedicated tunneling interface (TUN) to user space
>> applications. In this work, we identified a specific behavior in the Linux
>> TUN driver, which results in significant performance degradation due to
>> the sender stack silently dropping packets. This design issue drastically
>> impacts real-time video streaming, inducing up to 29 % packet loss with
>> noticeable video artifacts when the internal queue of the TUN driver is
>> reduced to 25 packets to minimize latency. Furthermore, a small queue
>
> This clearly increases dropcount. Does it meaningfully reduce latency?
>
> The cause of latency here is scheduling of the process reading from
> the tun FD.
>
> Task pinning and/or adjusting scheduler priority/algorithm/etc. may
> be a more effective and robust approach to reducing latency.
>

Thank you for your answer!

In our case, we consider latencies mainly on the application level
end-to-end, e.g., a UDP real-time video stream. There, high latencies
mostly occur due to buffer bloat in the lower layers like the TUN driver.
Example:
--> A VPN application using the TUN driver with the default 500 packet TUN
queue and sending packets via a 10Mbit/s interface.
--> Applications try to send a traffic > 10 Mbit/s through the VPN, 1500
Bytes per packet.
--> The TUN queue fills up completely.
--> Approx. Delay = (1500Bytes * 500 packets) / (10 Mbit/s / 8 Bit/Byte) =
600ms
--> We were able to reproduce such huge latencies in our measurements.
Especially in cases of low-latency applications, these buffer/queue sizes
reflect the maximum worst-case latency, which we focus on minimizing.

Just reducing the TUN queue is not an option here as without proper
backpropagation of the congestion to the upper layer application (in this
case through the blocking of the queues), the applications will consider
the TUN network as of "unlimited bandwidth" and will therefor e.g. in case
of TCP treat every dropped packet by the TUN driver as a packet loss
reducing its congestion window. With proper backpropagation, the
application data rate is limited, resulting in no artificial packet loss
and maintaining the data rate close to the achievable maximum.
In addition, the TUN queue should depend on the interface speed which can
change over time (e.g. Wi-Fi, cellular modems).
--> This patch allows to reduce the TUN queue without suffering from drops.
--> It lets the qdisc (e.g. fq_codel) manage the delay.
--> Allows the upper-level application to handle the congestion in its
prefered way instead of deciding to drop its packets.

>> length also drastically reduces the throughput of TCP traffic due to many
>> retransmissions. Instead, with our open-source NODROP Patch, we propose
>> generating backpressure in case of burst traffic or network congestion.
>> The patch effectively addresses the packet-dropping behavior, hardening
>> real-time video streaming and improving TCP throughput by 36 % in high
>> latency scenarios."
>>
>> In addition to the mentioned performance and latency improvements for VPN
>> applications, this patch also allows the proper usage of qdisc's. For
>> example a fq_codel can not control the queuing delay when packets are
>> already dropped in the TUN driver. This issue is also described in [2].
>>
>> The performance evaluation of the paper (see Fig. 4) showed a 4%
>> performance hit for a single queue TUN with the default TUN queue size of
>> 500 packets. However it is important to notice that with the proposed
>> patch no packet drop ever occurred even with a TUN queue size of 1 packet.
>> The utilized validation pipeline is available under [3].
>>
>> As the reduction of the TUN queue to a size of down to 5 packets showed no
>> further performance hit in the paper, a reduction of the default TUN queue
>> size might be desirable accompanying this patch. A reduction would
>> obviously reduce buffer bloat and memory requirements.
>>
>> Implementation details:
>> - The netdev queue start/stop flow control is utilized.
>> - Compatible with multi-queue by only stopping/waking the specific
>> netdevice subqueue.
>> - No additional locking is used.
>>
>> In the tun_net_xmit function:
>> - Stopping the subqueue is done when the tx_ring gets full after inserting
>> the SKB into the tx_ring.
>> - In the unlikely case when the insertion with ptr_ring_produce fails, the
>> old dropping behavior is used for this SKB.
>> - In the unlikely case when tun_net_xmit is called even though the tx_ring
>> is full, the subqueue is stopped once again and NETDEV_TX_BUSY is returned.
>>
>> In the tun_ring_recv function:
>> - Waking the subqueue is done after consuming a SKB from the tx_ring when
>> the tx_ring is empty. Waking the subqueue when the tx_ring has any
>> available space, so when it is not full, showed crashes in our testing. We
>> are open to suggestions.
>> - Especially when the tx_ring is configured to be small, queuing might be
>> stopped in the tun_net_xmit function while at the same time,
>> ptr_ring_consume is not able to grab a packet. This prevents tun_net_xmit
>> from being called again and causes tun_ring_recv to wait indefinitely for
>> a packet. Therefore, the queue is woken after grabbing a packet if the
>> queuing is stopped. The same behavior is applied in the accompanying wait
>> queue.
>> - Because the tun_struct is required to get the tx_queue into the new txq
>> pointer, the tun_struct is passed in tun_do_read aswell. This is likely
>> faster then trying to get it via the tun_file tfile because it utilizes a
>> rcu lock.
>>
>> We are open to suggestions regarding the implementation :)
>> Thank you for your work!
>>
>> [1] Link:
>> https://cni.etit.tu-dortmund.de/storages/cni-etit/r/Research/Publications/2
>> 025/Gebauer_2025_VTCFall/Gebauer_VTCFall2025_AuthorsVersion.pdf
>> [2] Link:
>> https://unix.stackexchange.com/questions/762935/traffic-shaping-ineffective
>> -on-tun-device
>> [3] Link: https://github.com/tudo-cni/nodrop
>>
>> Co-developed-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Tim Gebauer <tim.gebauer@tu-dortmund.de>
>> Signed-off-by: Simon Schippers <simon.schippers@tu-dortmund.de>
>> ---
>>  drivers/net/tun.c | 32 ++++++++++++++++++++++++++++----
>>  1 file changed, 28 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index cc6c50180663..e88a312d3c72 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1023,6 +1023,13 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>      netif_info(tun, tx_queued, tun->dev, "%s %d\n", __func__, skb->len);
>>
>> +    if (unlikely(ptr_ring_full(&tfile->tx_ring))) {
>> +            queue = netdev_get_tx_queue(dev, txq);
>> +            netif_tx_stop_queue(queue);
>> +            rcu_read_unlock();
>> +            return NETDEV_TX_BUSY;
>
> returning NETDEV_TX_BUSY is discouraged.
>

I agree with you:
In the unlikely case when the start/stop flow control fails and
tun_net_xmit is called even though the TUN queue is full, it should just
drop the packet.

> In principle pausing the "device" queue for TUN, similar to other
> devices, sounds reasonable, iff the simpler above suggestion is not
> sufficient.
>

The current implementation pauses in the exact moment when the tx_ring
becomes full and that proved to be sufficient in our testing.
Because the tx_ring always saves same size SKB pointers, I do not think we
have to stop the queuing earlier like virtio_net does.

I will adjust the implementation and also fix the general protection fault
in tun_net_xmit caused by the ptr_ring_full call.

> But then preferable to pause before the queue is full, to avoid having
> to return failure. See for instance virtio_net.
>
>> +    }
>> +
>>      /* Drop if the filter does not like it.
>>       * This is a noop if the filter is disabled.
>>       * Filter can be enabled only for the TAP devices. */
>> @@ -1060,13 +1067,16 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>      nf_reset_ct(skb);
>>
>> -    if (ptr_ring_produce(&tfile->tx_ring, skb)) {
>> +    queue = netdev_get_tx_queue(dev, txq);
>> +    if (unlikely(ptr_ring_produce(&tfile->tx_ring, skb))) {
>> +            netif_tx_stop_queue(queue);
>>              drop_reason = SKB_DROP_REASON_FULL_RING;
>>              goto drop;
>>      }
>> +    if (ptr_ring_full(&tfile->tx_ring))
>> +            netif_tx_stop_queue(queue);
>>
>>      /* dev->lltx requires to do our own update of trans_start */
>> -    queue = netdev_get_tx_queue(dev, txq);
>>      txq_trans_cond_update(queue);
>>
>>      /* Notify and wake up reader process */
>> @@ -2110,15 +2120,21 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>      return total;
>>  }
>>
>> -static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>> +static void *tun_ring_recv(struct tun_struct *tun, struct tun_file *tfile, int noblock, int *err)
>>  {
>>      DECLARE_WAITQUEUE(wait, current);
>> +    struct netdev_queue *txq;
>>      void *ptr = NULL;
>>      int error = 0;
>>
>>      ptr = ptr_ring_consume(&tfile->tx_ring);
>>      if (ptr)
>>              goto out;
>> +
>> +    txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
>> +    if (unlikely(netif_tx_queue_stopped(txq)))
>> +            netif_tx_wake_queue(txq);
>> +
>>      if (noblock) {
>>              error = -EAGAIN;
>>              goto out;
>> @@ -2131,6 +2147,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>              ptr = ptr_ring_consume(&tfile->tx_ring);
>>              if (ptr)
>>                      break;
>> +
>> +            if (unlikely(netif_tx_queue_stopped(txq)))
>> +                    netif_tx_wake_queue(txq);
>> +
>>              if (signal_pending(current)) {
>>                      error = -ERESTARTSYS;
>>                      break;
>> @@ -2147,6 +2167,10 @@ static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>      remove_wait_queue(&tfile->socket.wq.wait, &wait);
>>
>>  out:
>> +    if (ptr_ring_empty(&tfile->tx_ring)) {
>> +            txq = netdev_get_tx_queue(tun->dev, tfile->queue_index);
>> +            netif_tx_wake_queue(txq);
>> +    }
>>      *err = error;
>>      return ptr;
>>  }
>> @@ -2165,7 +2189,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>>
>>      if (!ptr) {
>>              /* Read frames from ring */
>> -            ptr = tun_ring_recv(tfile, noblock, &err);
>> +            ptr = tun_ring_recv(tun, tfile, noblock, &err);
>>              if (!ptr)
>>                      return err;
>>      }
>> --
>> 2.43.0
>>
>
>

Wichtiger Hinweis: Die Information in dieser E-Mail ist vertraulich. Sie ist ausschließlich für den Adressaten bestimmt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, unterrichten Sie bitte den Absender und vernichten Sie diese Mail. Vielen Dank.
Unbeschadet der Korrespondenz per E-Mail, sind unsere Erklärungen ausschließlich final rechtsverbindlich, wenn sie in herkömmlicher Schriftform (mit eigenhändiger Unterschrift) oder durch Übermittlung eines solchen Schriftstücks per Telefax erfolgen.

Important note: The information included in this e-mail is confidential. It is solely intended for the recipient. If you are not the intended recipient of this e-mail please contact the sender and delete this message. Thank you. Without prejudice of e-mail correspondence, our statements are only legally binding when they are made in the conventional written form (with personal signature) or when such documents are sent by fax.

[syzbot ci] Re: TUN/TAP: Improving throughput and latency by avoiding SKB drops

Posted by syzbot ci 1 month, 3 weeks ago

syzbot ci has tested the following series

[v1] TUN/TAP: Improving throughput and latency by avoiding SKB drops
https://lore.kernel.org/all/20250808153721.261334-1-simon.schippers@tu-dortmund.de
* [PATCH net] TUN/TAP: Improving throughput and latency by avoiding SKB drops

and found the following issue:
general protection fault in tun_net_xmit

Full report is available here:
https://ci.syzbot.org/series/4a9dd6ad-3c81-4957-b447-4d1e8e9ee7a2

***

general protection fault in tun_net_xmit

tree:      net
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net.git
base:      ae633388cae349886f1a3cfb27aa092854b24c1b
arch:      amd64
compiler:  Debian clang version 20.1.7 (++20250616065708+6146a88f6049-1~exp1~20250616065826.132), Debian LLD 20.1.7
config:    https://ci.syzbot.org/builds/f35af9e4-44af-4a13-8842-d9d36ecb06e7/config
C repro:   https://ci.syzbot.org/findings/e400bf02-40dc-43bb-8c15-d21b7ecb7304/c_repro
syz repro: https://ci.syzbot.org/findings/e400bf02-40dc-43bb-8c15-d21b7ecb7304/syz_repro

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
CPU: 1 UID: 0 PID: 12 Comm: kworker/u8:0 Not tainted 6.16.0-syzkaller-06620-gae633388cae3-dirty #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: ipv6_addrconf addrconf_dad_work
RIP: 0010:__ptr_ring_full include/linux/ptr_ring.h:51 [inline]
RIP: 0010:ptr_ring_full include/linux/ptr_ring.h:59 [inline]
RIP: 0010:tun_net_xmit+0x3ee/0x19c0 drivers/net/tun.c:1026
Code: 54 24 18 48 89 d0 48 c1 e8 03 48 89 44 24 58 42 0f b6 04 28 84 c0 0f 85 f9 11 00 00 48 63 02 48 8d 1c c3 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 df e8 d3 0f ac fb 48 8b 1b 48 8b 7c 24
RSP: 0018:ffffc900000f6f00 EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000010 RCX: dffffc0000000000
RDX: ffff88811bf90940 RSI: 0000000000000004 RDI: ffffc900000f6e80
RBP: ffffc900000f7050 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff5200001edd0 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff8881054c8000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8881a3c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 0000000110b70000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 __netdev_start_xmit include/linux/netdevice.h:5219 [inline]
 netdev_start_xmit include/linux/netdevice.h:5228 [inline]
 xmit_one net/core/dev.c:3827 [inline]
 dev_hard_start_xmit+0x2d7/0x830 net/core/dev.c:3843
 sch_direct_xmit+0x241/0x4b0 net/sched/sch_generic.c:344
 __dev_xmit_skb net/core/dev.c:4102 [inline]
 __dev_queue_xmit+0x1857/0x3b50 net/core/dev.c:4679
 neigh_output include/net/neighbour.h:547 [inline]
 ip6_finish_output2+0x11fe/0x16a0 net/ipv6/ip6_output.c:141
 NF_HOOK include/linux/netfilter.h:318 [inline]
 ndisc_send_skb+0xb54/0x1440 net/ipv6/ndisc.c:512
 addrconf_dad_completed+0x7ae/0xd60 net/ipv6/addrconf.c:4360
 addrconf_dad_work+0xc36/0x14b0 net/ipv6/addrconf.c:-1
 process_one_work kernel/workqueue.c:3238 [inline]
 process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321
 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402
 kthread+0x711/0x8a0 kernel/kthread.c:464
 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__ptr_ring_full include/linux/ptr_ring.h:51 [inline]
RIP: 0010:ptr_ring_full include/linux/ptr_ring.h:59 [inline]
RIP: 0010:tun_net_xmit+0x3ee/0x19c0 drivers/net/tun.c:1026
Code: 54 24 18 48 89 d0 48 c1 e8 03 48 89 44 24 58 42 0f b6 04 28 84 c0 0f 85 f9 11 00 00 48 63 02 48 8d 1c c3 48 89 d8 48 c1 e8 03 <42> 80 3c 28 00 74 08 48 89 df e8 d3 0f ac fb 48 8b 1b 48 8b 7c 24
RSP: 0018:ffffc900000f6f00 EFLAGS: 00010202
RAX: 0000000000000002 RBX: 0000000000000010 RCX: dffffc0000000000
RDX: ffff88811bf90940 RSI: 0000000000000004 RDI: ffffc900000f6e80
RBP: ffffc900000f7050 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff5200001edd0 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff8881054c8000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8881a3c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000200000002280 CR3: 0000000110b70000 CR4: 00000000000006f0
----------------
Code disassembly (best guess):
   0:	54                   	push   %rsp
   1:	24 18                	and    $0x18,%al
   3:	48 89 d0             	mov    %rdx,%rax
   6:	48 c1 e8 03          	shr    $0x3,%rax
   a:	48 89 44 24 58       	mov    %rax,0x58(%rsp)
   f:	42 0f b6 04 28       	movzbl (%rax,%r13,1),%eax
  14:	84 c0                	test   %al,%al
  16:	0f 85 f9 11 00 00    	jne    0x1215
  1c:	48 63 02             	movslq (%rdx),%rax
  1f:	48 8d 1c c3          	lea    (%rbx,%rax,8),%rbx
  23:	48 89 d8             	mov    %rbx,%rax
  26:	48 c1 e8 03          	shr    $0x3,%rax
* 2a:	42 80 3c 28 00       	cmpb   $0x0,(%rax,%r13,1) <-- trapping instruction
  2f:	74 08                	je     0x39
  31:	48 89 df             	mov    %rbx,%rdi
  34:	e8 d3 0f ac fb       	call   0xfbac100c
  39:	48 8b 1b             	mov    (%rbx),%rbx
  3c:	48                   	rex.W
  3d:	8b                   	.byte 0x8b
  3e:	7c 24                	jl     0x64


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.