[v2] net/sched: netem: account for backlog updates from child qdisc

[PATCH v2] net/sched: netem: account for backlog updates from child qdisc

Posted by Martin Ottens 1 year, 2 months ago

In general, 'qlen' of any classful qdisc should keep track of the
number of packets that the qdisc itself and all of its children holds.
In case of netem, 'qlen' only accounts for the packets in its internal
tfifo. When netem is used with a child qdisc, the child qdisc can use
'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
or dropped SKBs. This function updates 'qlen' and the backlog statistics
of netem, but netem does not account for changes made by a child qdisc.
'qlen' then indicates the wrong number of packets in the tfifo.
If a child qdisc creates new SKBs during enqueue and informs its parent
about this, netem's 'qlen' value is increased. When netem dequeues the
newly created SKBs from the child, the 'qlen' in netem is not updated.
If 'qlen' reaches the configured sch->limit, the enqueue function stops
working, even though the tfifo is not full.

Reproduce the bug:
Ensure that the sender machine has GSO enabled. Configure netem as root
qdisc and tbf as its child on the outgoing interface of the machine
as follows:
$ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
$ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms

Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
client on the machine. Check the qdisc statistics:
$ tc -s qdisc show dev <oif>

tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
The interface fully stops transferring packets and "locks". In this case,
the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
its limit and no more packets are accepted.

This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
only decreased when a packet is returned by its dequeue function, and not
during enqueuing into the child qdisc. External updates to 'qlen' are thus
accounted for and only the behavior of the backlog statistics changes. As
in other qdiscs, 'qlen' then keeps track of  how many packets are held in
netem and all of its children. As before, sch->limit remains as the
maximum number of packets in the tfifo. The same applies to netem's
backlog statistics.
(Note: the 'backlog' byte-statistic of netem is incorrect in the example
above even after the patch is applied due to a bug in tbf. See my
previous patch ([PATCH] net/sched: tbf: correct backlog statistic for
GSO packets)).

Signed-off-by: Martin Ottens <martin.ottens@fau.de>
---
 net/sched/sch_netem.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index fe6fed291a7b..71ec9986ed37 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -79,6 +79,8 @@ struct netem_sched_data {
 	struct sk_buff	*t_head;
 	struct sk_buff	*t_tail;
 
+	u32 t_len;
+
 	/* optional qdisc for classful handling (NULL at netem init) */
 	struct Qdisc	*qdisc;
 
@@ -383,6 +385,7 @@ static void tfifo_reset(struct Qdisc *sch)
 	rtnl_kfree_skbs(q->t_head, q->t_tail);
 	q->t_head = NULL;
 	q->t_tail = NULL;
+	q->t_len = 0;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
@@ -412,6 +415,7 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 		rb_link_node(&nskb->rbnode, parent, p);
 		rb_insert_color(&nskb->rbnode, &q->t_root);
 	}
+	q->t_len++;
 	sch->q.qlen++;
 }
 
@@ -518,7 +522,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			1<<get_random_u32_below(8);
 	}
 
-	if (unlikely(sch->q.qlen >= sch->limit)) {
+	if (unlikely(q->t_len >= sch->limit)) {
 		/* re-link segs, so that qdisc_drop_all() frees them all */
 		skb->next = segs;
 		qdisc_drop_all(skb, sch, to_free);
@@ -702,8 +706,8 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 tfifo_dequeue:
 	skb = __qdisc_dequeue_head(&sch->q);
 	if (skb) {
-		qdisc_qstats_backlog_dec(sch, skb);
 deliver:
+		qdisc_qstats_backlog_dec(sch, skb);
 		qdisc_bstats_update(sch, skb);
 		return skb;
 	}
@@ -719,8 +723,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 
 		if (time_to_send <= now && q->slot.slot_next <= now) {
 			netem_erase_head(q, skb);
-			sch->q.qlen--;
-			qdisc_qstats_backlog_dec(sch, skb);
+			q->t_len--;
 			skb->next = NULL;
 			skb->prev = NULL;
 			/* skb->dev shares skb->rbnode area,
@@ -747,16 +750,21 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 					if (net_xmit_drop_count(err))
 						qdisc_qstats_drop(sch);
 					qdisc_tree_reduce_backlog(sch, 1, pkt_len);
+					sch->qstats.backlog -= pkt_len;
+					sch->q.qlen--;
 				}
 				goto tfifo_dequeue;
 			}
+			sch->q.qlen--;
 			goto deliver;
 		}
 
 		if (q->qdisc) {
 			skb = q->qdisc->ops->dequeue(q->qdisc);
-			if (skb)
+			if (skb) {
+				sch->q.qlen--;
 				goto deliver;
+			}
 		}
 
 		qdisc_watchdog_schedule_ns(&q->watchdog,
@@ -766,8 +774,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 
 	if (q->qdisc) {
 		skb = q->qdisc->ops->dequeue(q->qdisc);
-		if (skb)
+		if (skb) {
+			sch->q.qlen--;
 			goto deliver;
+		}
 	}
 	return NULL;
 }
-- 
2.39.5

Re: [PATCH v2] net/sched: netem: account for backlog updates from child qdisc

Posted by Jamal Hadi Salim 1 year, 2 months ago

On Wed, Dec 4, 2024 at 7:29 AM Martin Ottens <martin.ottens@fau.de> wrote:
>
> In general, 'qlen' of any classful qdisc should keep track of the
> number of packets that the qdisc itself and all of its children holds.
> In case of netem, 'qlen' only accounts for the packets in its internal
> tfifo. When netem is used with a child qdisc, the child qdisc can use
> 'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
> or dropped SKBs. This function updates 'qlen' and the backlog statistics
> of netem, but netem does not account for changes made by a child qdisc.
> 'qlen' then indicates the wrong number of packets in the tfifo.
> If a child qdisc creates new SKBs during enqueue and informs its parent
> about this, netem's 'qlen' value is increased. When netem dequeues the
> newly created SKBs from the child, the 'qlen' in netem is not updated.
> If 'qlen' reaches the configured sch->limit, the enqueue function stops
> working, even though the tfifo is not full.
>
> Reproduce the bug:
> Ensure that the sender machine has GSO enabled. Configure netem as root
> qdisc and tbf as its child on the outgoing interface of the machine
> as follows:
> $ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
> $ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms
>
> Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
> client on the machine. Check the qdisc statistics:
> $ tc -s qdisc show dev <oif>
>
> tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
> The interface fully stops transferring packets and "locks". In this case,
> the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
> its limit and no more packets are accepted.
>

Would be nice to see the before and after (your change) output of the
stats to illustrate

> This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
> only decreased when a packet is returned by its dequeue function, and not
> during enqueuing into the child qdisc. External updates to 'qlen' are thus
> accounted for and only the behavior of the backlog statistics changes. As
> in other qdiscs, 'qlen' then keeps track of  how many packets are held in
> netem and all of its children. As before, sch->limit remains as the
> maximum number of packets in the tfifo. The same applies to netem's
> backlog statistics.
> (Note: the 'backlog' byte-statistic of netem is incorrect in the example
> above even after the patch is applied due to a bug in tbf. See my
> previous patch ([PATCH] net/sched: tbf: correct backlog statistic for
> GSO packets)).
>
> Signed-off-by: Martin Ottens <martin.ottens@fau.de>

Fixes missing - probably since 2.6.x. Stephen?

Your fix seems reasonable but I am curious: does this only happen with
TCP? If yes, perhaps the
GSO handling maybe contributing?
Can you run iperf with udp and see if the issue shows up again? Or
ping -f with size 1024.
You can slow down tbf for example with something like: tbf rate 20kbit
buffer 1600 limit 3000

cheers,
jamal


> ---
>  net/sched/sch_netem.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index fe6fed291a7b..71ec9986ed37 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -79,6 +79,8 @@ struct netem_sched_data {
>         struct sk_buff  *t_head;
>         struct sk_buff  *t_tail;
>
> +       u32 t_len;
> +
>         /* optional qdisc for classful handling (NULL at netem init) */
>         struct Qdisc    *qdisc;
>
> @@ -383,6 +385,7 @@ static void tfifo_reset(struct Qdisc *sch)
>         rtnl_kfree_skbs(q->t_head, q->t_tail);
>         q->t_head = NULL;
>         q->t_tail = NULL;
> +       q->t_len = 0;
>  }
>
>  static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
> @@ -412,6 +415,7 @@ static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
>                 rb_link_node(&nskb->rbnode, parent, p);
>                 rb_insert_color(&nskb->rbnode, &q->t_root);
>         }
> +       q->t_len++;
>         sch->q.qlen++;
>  }
>
> @@ -518,7 +522,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
>                         1<<get_random_u32_below(8);
>         }
>
> -       if (unlikely(sch->q.qlen >= sch->limit)) {
> +       if (unlikely(q->t_len >= sch->limit)) {
>                 /* re-link segs, so that qdisc_drop_all() frees them all */
>                 skb->next = segs;
>                 qdisc_drop_all(skb, sch, to_free);
> @@ -702,8 +706,8 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
>  tfifo_dequeue:
>         skb = __qdisc_dequeue_head(&sch->q);
>         if (skb) {
> -               qdisc_qstats_backlog_dec(sch, skb);
>  deliver:
> +               qdisc_qstats_backlog_dec(sch, skb);
>                 qdisc_bstats_update(sch, skb);
>                 return skb;
>         }
> @@ -719,8 +723,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
>
>                 if (time_to_send <= now && q->slot.slot_next <= now) {
>                         netem_erase_head(q, skb);
> -                       sch->q.qlen--;
> -                       qdisc_qstats_backlog_dec(sch, skb);
> +                       q->t_len--;
>                         skb->next = NULL;
>                         skb->prev = NULL;
>                         /* skb->dev shares skb->rbnode area,
> @@ -747,16 +750,21 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
>                                         if (net_xmit_drop_count(err))
>                                                 qdisc_qstats_drop(sch);
>                                         qdisc_tree_reduce_backlog(sch, 1, pkt_len);
> +                                       sch->qstats.backlog -= pkt_len;
> +                                       sch->q.qlen--;
>                                 }
>                                 goto tfifo_dequeue;
>                         }
> +                       sch->q.qlen--;
>                         goto deliver;
>                 }
>
>                 if (q->qdisc) {
>                         skb = q->qdisc->ops->dequeue(q->qdisc);
> -                       if (skb)
> +                       if (skb) {
> +                               sch->q.qlen--;
>                                 goto deliver;
> +                       }
>                 }
>
>                 qdisc_watchdog_schedule_ns(&q->watchdog,
> @@ -766,8 +774,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
>
>         if (q->qdisc) {
>                 skb = q->qdisc->ops->dequeue(q->qdisc);
> -               if (skb)
> +               if (skb) {
> +                       sch->q.qlen--;
>                         goto deliver;
> +               }
>         }
>         return NULL;
>  }
> --
> 2.39.5
>

Re: [PATCH v2] net/sched: netem: account for backlog updates from child qdisc

Posted by Martin Ottens 1 year, 2 months ago

On 05.12.24 13:40, Jamal Hadi Salim wrote:
> Would be nice to see the before and after (your change) output of the
> stats to illustrate

Setup is as described in my patch. I used a larger limit of 
1000 for netem so that the overshoot of the qlen becomes more 
visible. Kernel is from the current net-next tree (the patch to 
sch_tbf referenced in my patch is already applied (1596a135e318)).

TCP before the fix (qlen is 1150p, exceeding the maximum of 1000p, 
netem qdisc becomes "locked" and stops accepting packets):

qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
 Sent 2760196 bytes 1843 pkt (dropped 389, overlimits 0 requeues 0)
 backlog 4294560030b 1150p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
 Sent 2760196 bytes 1843 pkt (dropped 327, overlimits 7356 requeues 0)
 backlog 0b 0p requeues 0

UDP (iperf3 sends 50Mbit/s) before the fix, no issues here:

qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
 Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 0 requeues 0)
 backlog 643680b 432p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
 Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 341057 requeues 0)
 backlog 311410b 209p requeues 0

TCP after the fix (UDP is not affected by the fix):

qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
 Sent 94859934 bytes 62676 pkt (dropped 15, overlimits 0 requeues 0)
 backlog 573806b 130p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
 Sent 94859934 bytes 62676 pkt (dropped 324, overlimits 248442 requeues 0)
 backlog 4542b 3p requeues 0

> Your fix seems reasonable but I am curious: does this only happen with
> TCP? If yes, perhaps the
> GSO handling maybe contributing?
> Can you run iperf with udp and see if the issue shows up again? Or
> ping -f with size 1024.

I was only able to reproduce this behavior with tbf and it happens 
only when GSO packets are segmented inside the tbf child qdisc. As 
shown above, UDP is therefore not affected. The behavior also occurs 
if this configuration is used on the "outgoing" interface of a system 
that just forwards packets between two networks and GRO is enabled on 
the "incoming" interface.

Re: [PATCH v2] net/sched: netem: account for backlog updates from child qdisc

Posted by Jamal Hadi Salim 1 year, 2 months ago

On Sat, Dec 7, 2024 at 11:37 AM Martin Ottens <martin.ottens@fau.de> wrote:
>
> On 05.12.24 13:40, Jamal Hadi Salim wrote:
> > Would be nice to see the before and after (your change) output of the
> > stats to illustrate
>
> Setup is as described in my patch. I used a larger limit of
> 1000 for netem so that the overshoot of the qlen becomes more
> visible. Kernel is from the current net-next tree (the patch to
> sch_tbf referenced in my patch is already applied (1596a135e318)).
>

Ok, wasnt aware of this one..

>
> TCP before the fix (qlen is 1150p, exceeding the maximum of 1000p,
> netem qdisc becomes "locked" and stops accepting packets):
>
> qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
>  Sent 2760196 bytes 1843 pkt (dropped 389, overlimits 0 requeues 0)
>  backlog 4294560030b 1150p requeues 0
> qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
>  Sent 2760196 bytes 1843 pkt (dropped 327, overlimits 7356 requeues 0)
>  backlog 0b 0p requeues 0
>
> UDP (iperf3 sends 50Mbit/s) before the fix, no issues here:
>
> qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
>  Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 0 requeues 0)
>  backlog 643680b 432p requeues 0
> qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
>  Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 341057 requeues 0)
>  backlog 311410b 209p requeues 0
>
> TCP after the fix (UDP is not affected by the fix):
>
> qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
>  Sent 94859934 bytes 62676 pkt (dropped 15, overlimits 0 requeues 0)
>  backlog 573806b 130p requeues 0
> qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
>  Sent 94859934 bytes 62676 pkt (dropped 324, overlimits 248442 requeues 0)
>  backlog 4542b 3p requeues 0
>

backlog being > 0 is a problem, unless your results are captured mid
test (instead of end of test)
I will validate on net-next and with your patch.

> > Your fix seems reasonable but I am curious: does this only happen with
> > TCP? If yes, perhaps the
> > GSO handling maybe contributing?
> > Can you run iperf with udp and see if the issue shows up again? Or
> > ping -f with size 1024.
>
> I was only able to reproduce this behavior with tbf and it happens
> only when GSO packets are segmented inside the tbf child qdisc. As
> shown above, UDP is therefore not affected. The behavior also occurs
> if this configuration is used on the "outgoing" interface of a system
> that just forwards packets between two networks and GRO is enabled on
> the "incoming" interface.

Ok, will do a quick check since i have cycles..

cheers,
jamal

Re: [PATCH v2] net/sched: netem: account for backlog updates from child qdisc

Posted by Jamal Hadi Salim 1 year, 2 months ago

On Mon, Dec 9, 2024 at 4:13 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Dec 7, 2024 at 11:37 AM Martin Ottens <martin.ottens@fau.de> wrote:
> >
> > On 05.12.24 13:40, Jamal Hadi Salim wrote:
> > > Would be nice to see the before and after (your change) output of the
> > > stats to illustrate
> >
> > Setup is as described in my patch. I used a larger limit of
> > 1000 for netem so that the overshoot of the qlen becomes more
> > visible. Kernel is from the current net-next tree (the patch to
> > sch_tbf referenced in my patch is already applied (1596a135e318)).
> >
>
> Ok, wasnt aware of this one..
>
> >
> > TCP before the fix (qlen is 1150p, exceeding the maximum of 1000p,
> > netem qdisc becomes "locked" and stops accepting packets):
> >
> > qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
> >  Sent 2760196 bytes 1843 pkt (dropped 389, overlimits 0 requeues 0)
> >  backlog 4294560030b 1150p requeues 0
> > qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
> >  Sent 2760196 bytes 1843 pkt (dropped 327, overlimits 7356 requeues 0)
> >  backlog 0b 0p requeues 0
> >
> > UDP (iperf3 sends 50Mbit/s) before the fix, no issues here:
> >
> > qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
> >  Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 0 requeues 0)
> >  backlog 643680b 432p requeues 0
> > qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
> >  Sent 71917940 bytes 48286 pkt (dropped 2415, overlimits 341057 requeues 0)
> >  backlog 311410b 209p requeues 0
> >
> > TCP after the fix (UDP is not affected by the fix):
> >
> > qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
> >  Sent 94859934 bytes 62676 pkt (dropped 15, overlimits 0 requeues 0)
> >  backlog 573806b 130p requeues 0
> > qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
> >  Sent 94859934 bytes 62676 pkt (dropped 324, overlimits 248442 requeues 0)
> >  backlog 4542b 3p requeues 0
> >
>
> backlog being > 0 is a problem, unless your results are captured mid
> test (instead of end of test)
> I will validate on net-next and with your patch.
>

Ok, so seems sane to me - but can you please put output on the commit
reflecting after the test is completed?
Something like, before patch (highlighting stuck backlog on netem):

qdisc netem 1: root refcnt 2 limit 1000 delay 1s seed 17105543349430145291
 Sent 35220 bytes 43 pkt (dropped 7, overlimits 0 requeues 0)
 backlog 4294958212b 0p requeues 0
qdisc tbf 8003: parent 1: rate 50Mbit burst 1600b lat 224us
 Sent 35220 bytes 43 pkt (dropped 17, overlimits 1 requeues 0)
 backlog 0b 0p requeues 0

And after your patch:
qdisc netem 1: root refcnt 2 limit 1000 delay 1s seed 11503045766577034723
 Sent 42864 bytes 49 pkt (dropped 5, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 8001: parent 1: rate 50Mbit burst 1600b lat 224us
 Sent 42864 bytes 49 pkt (dropped 16, overlimits 5 requeues 0)
 backlog 0b 0p requeues 0

backlog is now shown as cleared.

Coincidentally, removing your tbf patch (which is already in net-next)
and rerunning the test it didnt seem to matter whether GSO was on or
off (as you can see below backlog is stuck on tbf):


GSO off:
qdisc netem 1: root refcnt 2 limit 1000 delay 1s seed 12925321237200695918
 Sent 26284 bytes 39 pkt (dropped 7, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 8001: parent 1: rate 50Mbit burst 1600b lat 224us
 Sent 26284 bytes 39 pkt (dropped 17, overlimits 1 requeues 0)
 backlog 4294959726b 0p requeues 0

GSO on:
qdisc netem 1: root refcnt 2 limit 1000 delay 1s seed 18236003995023052493
 Sent 35224 bytes 43 pkt (dropped 7, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc tbf 8002: parent 1: rate 50Mbit burst 1600b lat 224us
 Sent 35224 bytes 43 pkt (dropped 17, overlimits 1 requeues 0)
 backlog 4294958212b 0p requeues 0

Please resubmit the patch - add my acked-by and put the proper
before/after stats.
Fixes is likely: Linux-2.6.12-rc2

cheers,
jamal