net: sched: act_api: size RTM_GETACTION reply by fill size

[PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 1 week ago

tcf_action_fill_size() already computes the required dump size, but
RTM_GETACTION replies always allocate NLMSG_GOODSIZE. Large action
state can overrun that skb and make dumps fail.

Use the computed reply size for RTM_GETACTION replies so large actions
can be dumped, while still keeping NLMSG_GOODSIZE as a floor.

Fixes: 4e76e75d6aba ("net sched actions: calculate add/delete event message size")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
---
 net/sched/act_api.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index e1ab0faeb8113..8ab016d352850 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1685,12 +1685,12 @@ static int tca_get_fill(struct sk_buff *skb, struct tc_action *actions[],
 
 static int
 tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n,
-	       struct tc_action *actions[], int event,
+	       struct tc_action *actions[], int event, size_t attr_size,
 	       struct netlink_ext_ack *extack)
 {
 	struct sk_buff *skb;
 
-	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	skb = alloc_skb(max_t(size_t, attr_size, NLMSG_GOODSIZE), GFP_KERNEL);
 	if (!skb)
 		return -ENOBUFS;
 	if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event,
@@ -2041,7 +2041,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
 	attr_size = tcf_action_full_attrs_size(attr_size);
 
 	if (event == RTM_GETACTION)
-		ret = tcf_get_notify(net, portid, n, actions, event, extack);
+		ret = tcf_get_notify(net, portid, n, actions, event,
+				     attr_size, extack);
 	else { /* delete */
 		ret = tcf_del_notify(net, n, actions, portid, attr_size, extack);
 		if (ret)
-- 
2.52.GIT

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 1 week ago

On Fri, Jan 30, 2026 at 8:43 AM Paul Moses <p@1g4.org> wrote:
>
> tcf_action_fill_size() already computes the required dump size, but
> RTM_GETACTION replies always allocate NLMSG_GOODSIZE. Large action
> state can overrun that skb and make dumps fail.
>
> Use the computed reply size for RTM_GETACTION replies so large actions
> can be dumped, while still keeping NLMSG_GOODSIZE as a floor.
>
> Fixes: 4e76e75d6aba ("net sched actions: calculate add/delete event message size")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paul Moses <p@1g4.org>
> ---
>  net/sched/act_api.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> index e1ab0faeb8113..8ab016d352850 100644
> --- a/net/sched/act_api.c
> +++ b/net/sched/act_api.c
> @@ -1685,12 +1685,12 @@ static int tca_get_fill(struct sk_buff *skb, struct tc_action *actions[],
>
>  static int
>  tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n,
> -              struct tc_action *actions[], int event,
> +              struct tc_action *actions[], int event, size_t attr_size,
>                struct netlink_ext_ack *extack)
>  {
>         struct sk_buff *skb;
>
> -       skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
> +       skb = alloc_skb(max_t(size_t, attr_size, NLMSG_GOODSIZE), GFP_KERNEL);
>         if (!skb)
>                 return -ENOBUFS;
>         if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event,
> @@ -2041,7 +2041,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
>         attr_size = tcf_action_full_attrs_size(attr_size);
>
>         if (event == RTM_GETACTION)
> -               ret = tcf_get_notify(net, portid, n, actions, event, extack);
> +               ret = tcf_get_notify(net, portid, n, actions, event,
> +                                    attr_size, extack);
>         else { /* delete */
>                 ret = tcf_del_notify(net, n, actions, portid, attr_size, extack);
>                 if (ret)

dunno. Is this based on some issue you found? This is a common pattern
in a lot of places in the stack and has not caused any issues (afaik).

cheers,
jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 1 week ago

Yes, In net/sched/act_api.c the GETACTION notify path always does alloc_skb(NLMSG_GOODSIZE), if tca_get_fill() 
runs out of tailroom it returns -1 and tcf_get_notify() maps that to -EINVAL. So failures are size-dependent 
and can look intermittent across different action dumps. act_gate might be the outlier?

The size is already computed in tca_action_gd() (sum tcf_action_fill_size() then tcf_action_full_attrs_size()) 
and add/del already allocate max(attr_size, NLMSG_GOODSIZE). This patch just makes GETACTION consistent with 
that.

On the reproducer: the gatebench test with 100 entries is reasonable.
https://raw.githubusercontent.com/jopamo/gatebench/refs/heads/main/src/selftests/test_large_dump.c

I plan to follow this up with another patch for act_gate and believe they both are integral to fully stabilize 
act_gate.

Thanks
Paul

On Friday, January 30th, 2026 at 10:05 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 
> 
> On Fri, Jan 30, 2026 at 8:43 AM Paul Moses p@1g4.org wrote:
> 
> > tcf_action_fill_size() already computes the required dump size, but
> > RTM_GETACTION replies always allocate NLMSG_GOODSIZE. Large action
> > state can overrun that skb and make dumps fail.
> > 
> > Use the computed reply size for RTM_GETACTION replies so large actions
> > can be dumped, while still keeping NLMSG_GOODSIZE as a floor.
> > 
> > Fixes: 4e76e75d6aba ("net sched actions: calculate add/delete event message size")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Paul Moses p@1g4.org
> > ---
> > net/sched/act_api.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/net/sched/act_api.c b/net/sched/act_api.c
> > index e1ab0faeb8113..8ab016d352850 100644
> > --- a/net/sched/act_api.c
> > +++ b/net/sched/act_api.c
> > @@ -1685,12 +1685,12 @@ static int tca_get_fill(struct sk_buff *skb, struct tc_action *actions[],
> > 
> > static int
> > tcf_get_notify(struct net *net, u32 portid, struct nlmsghdr *n,
> > - struct tc_action *actions[], int event,
> > + struct tc_action *actions[], int event, size_t attr_size,
> > struct netlink_ext_ack *extack)
> > {
> > struct sk_buff *skb;
> > 
> > - skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
> > + skb = alloc_skb(max_t(size_t, attr_size, NLMSG_GOODSIZE), GFP_KERNEL);
> > if (!skb)
> > return -ENOBUFS;
> > if (tca_get_fill(skb, actions, portid, n->nlmsg_seq, 0, event,
> > @@ -2041,7 +2041,8 @@ tca_action_gd(struct net *net, struct nlattr *nla, struct nlmsghdr *n,
> > attr_size = tcf_action_full_attrs_size(attr_size);
> > 
> > if (event == RTM_GETACTION)
> > - ret = tcf_get_notify(net, portid, n, actions, event, extack);
> > + ret = tcf_get_notify(net, portid, n, actions, event,
> > + attr_size, extack);
> > else { /* delete */
> > ret = tcf_del_notify(net, n, actions, portid, attr_size, extack);
> > if (ret)
> 
> 
> dunno. Is this based on some issue you found? This is a common pattern
> in a lot of places in the stack and has not caused any issues (afaik).
> 
> cheers,
> jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 1 week ago

On Fri, Jan 30, 2026 at 12:22 PM Paul Moses <p@1g4.org> wrote:
>
> Yes, In net/sched/act_api.c the GETACTION notify path always does alloc_skb(NLMSG_GOODSIZE), if tca_get_fill()
> runs out of tailroom it returns -1 and tcf_get_notify() maps that to -EINVAL. So failures are size-dependent
> and can look intermittent across different action dumps. act_gate might be the outlier?
>

Very bizarre that dump would fail because it is transactional. It
shouldnt matter that you are only allocing NLMSG_GOODSIZE.
Is there a possibility that  a single act_gate entry can be larger
than NLMSG_GOODSIZE?

> The size is already computed in tca_action_gd() (sum tcf_action_fill_size() then tcf_action_full_attrs_size())
> and add/del already allocate max(attr_size, NLMSG_GOODSIZE). This patch just makes GETACTION consistent with
> that.
>

I looked at act_gate dump and it is sane. Which leads to perhaps your
test program being bugy.

Install the 100 actions then use tc to count.
Something like:
 tc actions ls action gate | grep index | wc -l

cheers,
jamal

> On the reproducer: the gatebench test with 100 entries is reasonable.
> https://raw.githubusercontent.com/jopamo/gatebench/refs/heads/main/src/selftests/test_large_dump.c
>
> I plan to follow this up with another patch for act_gate and believe they both are integral to fully stabilize
> act_gate.
>
> Thanks
> Paul

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 1 week ago

What version of act_gate.c are you currently testing? Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message. 

look at af_netlink.c
	/* NLMSG_GOODSIZE is small to avoid high order allocations being
	 * required, but it makes sense to _attempt_ a 32KiB allocation
	 * to reduce number of system calls on dump operations, if user
	 * ever provided a big enough buffer.
	 */
         ...
	/* Trim skb to allocated size. User is expected to provide buffer as
	 * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
	 * netlink_recvmsg())). dump will pack as many smaller messages as
	 * could fit within the allocated skb. skb is typically allocated
	 * with larger space than required (could be as much as near 2x the
	 * requested size with align to next power of 2 approach). Allowing
	 * dump to use the excess space makes it difficult for a user to have a
	 * reasonable static buffer based on the expected largest dump of a
	 * single netdev. The outcome is MSG_TRUNC error.
	 */

This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept. 
https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4

gatebench --selftest 
Configuration:
  Iterations per run: 1000
  Warmup iterations:  100
  Runs:               5
  Gate entries:       10
  Gate interval:      1000000 ns
  Starting index:     1000
  CPU pinning:        no
  Netlink timeout:    1000 ms
  Selftest:           yes
  JSON output:        no
  Sampling:           no
  Clock ID:           11
  Base time:          0 ns
  Cycle time:         0 ns
  Cycle time ext:     0 ns

Environment:
  Kernel: Linux 6.18.7 x86_64
  Current CPU: 7
  Clock source: CLOCK_MONOTONIC_RAW

Running selftests...
Running 20 selftests...
  create missing parms           PASS (got -22)
  create missing entry list      PASS (got -22)
  create empty entry list        PASS (got -22)
  create zero interval           PASS (got -22)
  create bad clockid             PASS (got -22)
  replace without existing       PASS (got 0)
  duplicate create               PASS (got -17)
  dump correctness               PASS (got 0)
  replace persistence            PASS (got 0)
  clockid variants               PASS (got 0)
  cycle time derivation          PASS (got 0)
  cycle time extension parsing   PASS (got 0)
  replace preserve schedule      PASS (got 0)
  base time update               PASS (got 0)
  multiple entries               PASS (got 0)
  malformed nesting              PASS (got -22)
  bad attribute size             PASS (got -22)
  param validation               PASS (got 0)
  replace invalid                PASS (got 0)
  large dump                     DEBUG: msg->len = 3112
PASS (got 0)

Selftests: 20/20 passed
Selftests passed

Running benchmark...
Run 1/5... done (311721.5 ops/sec)
Run 2/5... done (321045.7 ops/sec)
Run 3/5... done (336402.3 ops/sec)
Run 4/5... done (338419.7 ops/sec)
Run 5/5... done (316618.9 ops/sec)
Benchmark completed successfully

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 6 days, 19 hours ago

.

On Fri, Jan 30, 2026 at 3:48 PM Paul Moses <p@1g4.org> wrote:
>
> What version of act_gate.c are you currently testing?

I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
But i did look at the latest kernel tree and the dumping code has not changed.
+Cc Po Liu who i believe added that code.

>Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.

Sorry - I am not running your code (didnt want to compile anything on
this machine), just plain tc and i have to admit I dont know much
about the mechanics or spec for gate, so my example is based on
something Po Liu posted, here's a script to add 100 entries:
---
for i in {1..100}; do
    echo "$i"
    tc actions add action gate clockid CLOCK_TAI sched-entry open
200000000 -1 8000000 sched-entry close 100000000 -1 -1
done
---

Then dumping:

$ sudo tc actions ls action gate | grep index
index 1 ref 1 bind 0
index 2 ref 1 bind 0
index 3 ref 1 bind 0
index 4 ref 1 bind 0
index 5 ref 1 bind 0
index 6 ref 1 bind 0
..
...
....
index 95 ref 1 bind 0
index 96 ref 1 bind 0
index 97 ref 1 bind 0
index 98 ref 1 bind 0
index 99 ref 1 bind 0
index 100 ref 1 bind 0
$


>
> look at af_netlink.c
>         /* NLMSG_GOODSIZE is small to avoid high order allocations being
>          * required, but it makes sense to _attempt_ a 32KiB allocation
>          * to reduce number of system calls on dump operations, if user
>          * ever provided a big enough buffer.
>          */
>          ...
>         /* Trim skb to allocated size. User is expected to provide buffer as
>          * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
>          * netlink_recvmsg())). dump will pack as many smaller messages as
>          * could fit within the allocated skb. skb is typically allocated
>          * with larger space than required (could be as much as near 2x the
>          * requested size with align to next power of 2 approach). Allowing
>          * dump to use the excess space makes it difficult for a user to have a
>          * reasonable static buffer based on the expected largest dump of a
>          * single netdev. The outcome is MSG_TRUNC error.
>          */
>
> This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
>

I dont see a problem with "dump" as you seem to be suggesting. I asked
earlier if it is possible that you can create some single entry - not
100 as shown above that will consume more than NLMSG_GOODSIZE? My
limited knowledge is not helping me see such a scenario.

I looked at the transaction of how the 100 entries are dumped and i
see the following:
$ sudo tc actions ls action gate | grep total
total acts 12
total acts 12
total acts 76

User space received batches of 12, 12, and last one was 76 before it
received an empty message with NLMSG_DONE.

cheers,
jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 6 days, 19 hours ago

On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> .
>
> On Fri, Jan 30, 2026 at 3:48 PM Paul Moses <p@1g4.org> wrote:
> >
> > What version of act_gate.c are you currently testing?
>
> I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> But i did look at the latest kernel tree and the dumping code has not changed.
> +Cc Po Liu who i believe added that code.
>
> >Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
>
> Sorry - I am not running your code (didnt want to compile anything on
> this machine), just plain tc and i have to admit I dont know much
> about the mechanics or spec for gate, so my example is based on
> something Po Liu posted, here's a script to add 100 entries:
> ---
> for i in {1..100}; do
>     echo "$i"
>     tc actions add action gate clockid CLOCK_TAI sched-entry open
> 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> done
> ---
>
> Then dumping:
>
> $ sudo tc actions ls action gate | grep index
> index 1 ref 1 bind 0
> index 2 ref 1 bind 0
> index 3 ref 1 bind 0
> index 4 ref 1 bind 0
> index 5 ref 1 bind 0
> index 6 ref 1 bind 0
> ..
> ...
> ....
> index 95 ref 1 bind 0
> index 96 ref 1 bind 0
> index 97 ref 1 bind 0
> index 98 ref 1 bind 0
> index 99 ref 1 bind 0
> index 100 ref 1 bind 0
> $
>
>
> >
> > look at af_netlink.c
> >         /* NLMSG_GOODSIZE is small to avoid high order allocations being
> >          * required, but it makes sense to _attempt_ a 32KiB allocation
> >          * to reduce number of system calls on dump operations, if user
> >          * ever provided a big enough buffer.
> >          */
> >          ...
> >         /* Trim skb to allocated size. User is expected to provide buffer as
> >          * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> >          * netlink_recvmsg())). dump will pack as many smaller messages as
> >          * could fit within the allocated skb. skb is typically allocated
> >          * with larger space than required (could be as much as near 2x the
> >          * requested size with align to next power of 2 approach). Allowing
> >          * dump to use the excess space makes it difficult for a user to have a
> >          * reasonable static buffer based on the expected largest dump of a
> >          * single netdev. The outcome is MSG_TRUNC error.
> >          */
> >
> > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> >
>
> I dont see a problem with "dump" as you seem to be suggesting. I asked
> earlier if it is possible that you can create some single entry - not
> 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> limited knowledge is not helping me see such a scenario.

Aha. I think there is a terminology mixup ;->
"dump" (a very unfortunate use of that word in the netlink world ;->)
is a very special word. So when you take a dump in this world you are
GETing a whole table. In this case all the gate actions.

If i am not mistaken in your case this is not a dump - rather, you are
CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
suspected. I dont believe iproute2 will allow you to do that.
What's happening then is that the generated netlink event notification
for that single entry is too big to fit in NLMSG_GOODSIZE.
Let me try to craft something for that...

cheers,
jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 6 days, 19 hours ago

1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
4. Because of that, your test only exercises the dump path with many small actions.
5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
7. The kernel then tries to serialize one action into that skb.
8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
9. A single sched-entry does not exceed NLMSG_GOODSIZE.
10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
12. It does not reflect the size of an individual action dump, and in your test each action is small.
13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"


I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.

Thanks
Paul




On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 
> 
> On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> 
> > .
> > 
> > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > 
> > > What version of act_gate.c are you currently testing?
> > 
> > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > But i did look at the latest kernel tree and the dumping code has not changed.
> > +Cc Po Liu who i believe added that code.
> > 
> > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > 
> > Sorry - I am not running your code (didnt want to compile anything on
> > this machine), just plain tc and i have to admit I dont know much
> > about the mechanics or spec for gate, so my example is based on
> > something Po Liu posted, here's a script to add 100 entries:
> > ---
> > for i in {1..100}; do
> > echo "$i"
> > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > done
> > ---
> > 
> > Then dumping:
> > 
> > $ sudo tc actions ls action gate | grep index
> > index 1 ref 1 bind 0
> > index 2 ref 1 bind 0
> > index 3 ref 1 bind 0
> > index 4 ref 1 bind 0
> > index 5 ref 1 bind 0
> > index 6 ref 1 bind 0
> > ..
> > ...
> > ....
> > index 95 ref 1 bind 0
> > index 96 ref 1 bind 0
> > index 97 ref 1 bind 0
> > index 98 ref 1 bind 0
> > index 99 ref 1 bind 0
> > index 100 ref 1 bind 0
> > $
> > 
> > > look at af_netlink.c
> > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > * required, but it makes sense to attempt a 32KiB allocation
> > > * to reduce number of system calls on dump operations, if user
> > > * ever provided a big enough buffer.
> > > /
> > > ...
> > > / Trim skb to allocated size. User is expected to provide buffer as
> > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > * could fit within the allocated skb. skb is typically allocated
> > > * with larger space than required (could be as much as near 2x the
> > > * requested size with align to next power of 2 approach). Allowing
> > > * dump to use the excess space makes it difficult for a user to have a
> > > * reasonable static buffer based on the expected largest dump of a
> > > * single netdev. The outcome is MSG_TRUNC error.
> > > */
> > > 
> > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > 
> > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > earlier if it is possible that you can create some single entry - not
> > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > limited knowledge is not helping me see such a scenario.
> 
> 
> Aha. I think there is a terminology mixup ;->
> 
> "dump" (a very unfortunate use of that word in the netlink world ;->)
> 
> is a very special word. So when you take a dump in this world you are
> GETing a whole table. In this case all the gate actions.
> 
> If i am not mistaken in your case this is not a dump - rather, you are
> CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> suspected. I dont believe iproute2 will allow you to do that.
> What's happening then is that the generated netlink event notification
> for that single entry is too big to fit in NLMSG_GOODSIZE.
> Let me try to craft something for that...
> 
> cheers,
> jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 6 days, 19 hours ago

On Sat, Jan 31, 2026 at 12:18 PM Paul Moses <p@1g4.org> wrote:
>
> 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> 4. Because of that, your test only exercises the dump path with many small actions.
> 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> 7. The kernel then tries to serialize one action into that skb.
> 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
>

Yes, thats the same error i was getting (with script below).
---
ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
#SCHEDULE=$(printf "$ENTRY%.0s" {1..10})

for i in {1..2}; do
    echo "Iteration: $i"
    tc actions add action gate clockid CLOCK_TAI $SCHEDULE
done
----

I know of no other action that exceeds this limit with all its params
batched, and of course tc in userspace truncates it to about 32.
Addition does succeed at 32 of those things per action.
I have no idea if above is legal but it is allowed by the system.

> I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
>

We cant have it to be infinite for sure - we will need to put an upper
bound in parse_gate_list().
Are you knowledgeable about this spec? I was Ccing Po Liu but his
email is bouncing (so i removed him).

So back to your first post: I agree we have an issue here. Your
solution will solve the event notifications but then we will need an
upper bound check. We will also need to check that same upper bound in
user space iproute2 code so we dont allow arbitrary values. Current
number of 16 seems to work just fine - if we agree that is a "good"
number (or if the specs dicate it is) then you can simply provide that
fix..

cheers,
jamal


> Thanks
> Paul
>
>
>
>
> On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> >
> >
> > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> >
> > > .
> > >
> > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > >
> > > > What version of act_gate.c are you currently testing?
> > >
> > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > +Cc Po Liu who i believe added that code.
> > >
> > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > >
> > > Sorry - I am not running your code (didnt want to compile anything on
> > > this machine), just plain tc and i have to admit I dont know much
> > > about the mechanics or spec for gate, so my example is based on
> > > something Po Liu posted, here's a script to add 100 entries:
> > > ---
> > > for i in {1..100}; do
> > > echo "$i"
> > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > done
> > > ---
> > >
> > > Then dumping:
> > >
> > > $ sudo tc actions ls action gate | grep index
> > > index 1 ref 1 bind 0
> > > index 2 ref 1 bind 0
> > > index 3 ref 1 bind 0
> > > index 4 ref 1 bind 0
> > > index 5 ref 1 bind 0
> > > index 6 ref 1 bind 0
> > > ..
> > > ...
> > > ....
> > > index 95 ref 1 bind 0
> > > index 96 ref 1 bind 0
> > > index 97 ref 1 bind 0
> > > index 98 ref 1 bind 0
> > > index 99 ref 1 bind 0
> > > index 100 ref 1 bind 0
> > > $
> > >
> > > > look at af_netlink.c
> > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > * to reduce number of system calls on dump operations, if user
> > > > * ever provided a big enough buffer.
> > > > /
> > > > ...
> > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > * could fit within the allocated skb. skb is typically allocated
> > > > * with larger space than required (could be as much as near 2x the
> > > > * requested size with align to next power of 2 approach). Allowing
> > > > * dump to use the excess space makes it difficult for a user to have a
> > > > * reasonable static buffer based on the expected largest dump of a
> > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > */
> > > >
> > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > >
> > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > earlier if it is possible that you can create some single entry - not
> > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > limited knowledge is not helping me see such a scenario.
> >
> >
> > Aha. I think there is a terminology mixup ;->
> >
> > "dump" (a very unfortunate use of that word in the netlink world ;->)
> >
> > is a very special word. So when you take a dump in this world you are
> > GETing a whole table. In this case all the gate actions.
> >
> > If i am not mistaken in your case this is not a dump - rather, you are
> > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > suspected. I dont believe iproute2 will allow you to do that.
> > What's happening then is that the generated netlink event notification
> > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > Let me try to craft something for that...
> >
> > cheers,
> > jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 6 days, 2 hours ago

The hardware manufacturers impose their own limits based on design constraints, it's not based on the spec. iproute2's value seems arbitrary, 1024 comes out to be about 32 entries, based on the message length of 3112 at 100 entries (this isn't counting overhead). Is page size ever less than 4k? May as well see what can safely fit into NLMSG_GOODSIZE at it's lowest possible value.

With 4k page size, the failure point appears to be 93 entries:
  large dump                     DEBUG: large dump msg_len=2904 cap=12288 entries=93 cycle_time=9304278

So bounding it at 64 entries or so(for now at least) would be a safe choice to maintain a margin and not impose arbitrarily low values.

Yes, I've wanted to talk to Po for a while now. :)

Thanks,
Paul

On Saturday, January 31st, 2026 at 11:34 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 
> 
> On Sat, Jan 31, 2026 at 12:18 PM Paul Moses p@1g4.org wrote:
> 
> > 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> > 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> > 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> > 4. Because of that, your test only exercises the dump path with many small actions.
> > 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> > 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> > 7. The kernel then tries to serialize one action into that skb.
> > 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> > 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> > 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> > 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> > 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> > 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> > 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
> 
> 
> Yes, thats the same error i was getting (with script below).
> ---
> ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
> SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
> #SCHEDULE=$(printf "$ENTRY%.0s" {1..10})
> 
> for i in {1..2}; do
> echo "Iteration: $i"
> tc actions add action gate clockid CLOCK_TAI $SCHEDULE
> done
> ----
> 
> I know of no other action that exceeds this limit with all its params
> batched, and of course tc in userspace truncates it to about 32.
> Addition does succeed at 32 of those things per action.
> I have no idea if above is legal but it is allowed by the system.
> 
> > I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
> 
> 
> We cant have it to be infinite for sure - we will need to put an upper
> bound in parse_gate_list().
> Are you knowledgeable about this spec? I was Ccing Po Liu but his
> email is bouncing (so i removed him).
> 
> So back to your first post: I agree we have an issue here. Your
> solution will solve the event notifications but then we will need an
> upper bound check. We will also need to check that same upper bound in
> user space iproute2 code so we dont allow arbitrary values. Current
> number of 16 seems to work just fine - if we agree that is a "good"
> number (or if the specs dicate it is) then you can simply provide that
> fix..
> 
> cheers,
> jamal
> 
> > Thanks
> > Paul
> > 
> > On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > 
> > > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > 
> > > > .
> > > > 
> > > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > > > 
> > > > > What version of act_gate.c are you currently testing?
> > > > 
> > > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > > +Cc Po Liu who i believe added that code.
> > > > 
> > > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > > > 
> > > > Sorry - I am not running your code (didnt want to compile anything on
> > > > this machine), just plain tc and i have to admit I dont know much
> > > > about the mechanics or spec for gate, so my example is based on
> > > > something Po Liu posted, here's a script to add 100 entries:
> > > > ---
> > > > for i in {1..100}; do
> > > > echo "$i"
> > > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > > done
> > > > ---
> > > > 
> > > > Then dumping:
> > > > 
> > > > $ sudo tc actions ls action gate | grep index
> > > > index 1 ref 1 bind 0
> > > > index 2 ref 1 bind 0
> > > > index 3 ref 1 bind 0
> > > > index 4 ref 1 bind 0
> > > > index 5 ref 1 bind 0
> > > > index 6 ref 1 bind 0
> > > > ..
> > > > ...
> > > > ....
> > > > index 95 ref 1 bind 0
> > > > index 96 ref 1 bind 0
> > > > index 97 ref 1 bind 0
> > > > index 98 ref 1 bind 0
> > > > index 99 ref 1 bind 0
> > > > index 100 ref 1 bind 0
> > > > $
> > > > 
> > > > > look at af_netlink.c
> > > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > > * to reduce number of system calls on dump operations, if user
> > > > > * ever provided a big enough buffer.
> > > > > /
> > > > > ...
> > > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > > * could fit within the allocated skb. skb is typically allocated
> > > > > * with larger space than required (could be as much as near 2x the
> > > > > * requested size with align to next power of 2 approach). Allowing
> > > > > * dump to use the excess space makes it difficult for a user to have a
> > > > > * reasonable static buffer based on the expected largest dump of a
> > > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > > */
> > > > > 
> > > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > > > 
> > > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > > earlier if it is possible that you can create some single entry - not
> > > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > > limited knowledge is not helping me see such a scenario.
> > > 
> > > Aha. I think there is a terminology mixup ;->
> > > 
> > > "dump" (a very unfortunate use of that word in the netlink world ;->)
> > > 
> > > is a very special word. So when you take a dump in this world you are
> > > GETing a whole table. In this case all the gate actions.
> > > 
> > > If i am not mistaken in your case this is not a dump - rather, you are
> > > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > > suspected. I dont believe iproute2 will allow you to do that.
> > > What's happening then is that the generated netlink event notification
> > > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > > Let me try to craft something for that...
> > > 
> > > cheers,
> > > jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 4 days, 22 hours ago

On Sun, Feb 1, 2026 at 4:57 AM Paul Moses <p@1g4.org> wrote:
>
> The hardware manufacturers impose their own limits based on design constraints, it's not based on the spec. iproute2's value seems arbitrary, 1024 comes out to be about 32 entries, based on the message length of 3112 at 100 entries (this isn't counting overhead). Is page size ever less than 4k? May as well see what can safely fit into NLMSG_GOODSIZE at it's lowest possible value.
>
> With 4k page size, the failure point appears to be 93 entries:
>   large dump                     DEBUG: large dump msg_len=2904 cap=12288 entries=93 cycle_time=9304278
>
> So bounding it at 64 entries or so(for now at least) would be a safe choice to maintain a margin and not impose arbitrarily low values.
>

Why dont we pick some value that doesnt require changes to iproute2? Example 32.

> Yes, I've wanted to talk to Po for a while now. :)
>

There has to be someone else, vendor, etc who is invested in this..
That looks like magic valves to me that open/close - not sure why you
want to do it more than once.

cheers,
jamal

> Thanks,
> Paul
>
> On Saturday, January 31st, 2026 at 11:34 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> >
> >
> > On Sat, Jan 31, 2026 at 12:18 PM Paul Moses p@1g4.org wrote:
> >
> > > 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> > > 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> > > 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> > > 4. Because of that, your test only exercises the dump path with many small actions.
> > > 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> > > 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> > > 7. The kernel then tries to serialize one action into that skb.
> > > 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> > > 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> > > 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> > > 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> > > 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> > > 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> > > 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
> >
> >
> > Yes, thats the same error i was getting (with script below).
> > ---
> > ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
> > SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
> > #SCHEDULE=$(printf "$ENTRY%.0s" {1..10})
> >
> > for i in {1..2}; do
> > echo "Iteration: $i"
> > tc actions add action gate clockid CLOCK_TAI $SCHEDULE
> > done
> > ----
> >
> > I know of no other action that exceeds this limit with all its params
> > batched, and of course tc in userspace truncates it to about 32.
> > Addition does succeed at 32 of those things per action.
> > I have no idea if above is legal but it is allowed by the system.
> >
> > > I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
> >
> >
> > We cant have it to be infinite for sure - we will need to put an upper
> > bound in parse_gate_list().
> > Are you knowledgeable about this spec? I was Ccing Po Liu but his
> > email is bouncing (so i removed him).
> >
> > So back to your first post: I agree we have an issue here. Your
> > solution will solve the event notifications but then we will need an
> > upper bound check. We will also need to check that same upper bound in
> > user space iproute2 code so we dont allow arbitrary values. Current
> > number of 16 seems to work just fine - if we agree that is a "good"
> > number (or if the specs dicate it is) then you can simply provide that
> > fix..
> >
> > cheers,
> > jamal
> >
> > > Thanks
> > > Paul
> > >
> > > On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > >
> > > > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > >
> > > > > .
> > > > >
> > > > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > > > >
> > > > > > What version of act_gate.c are you currently testing?
> > > > >
> > > > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > > > +Cc Po Liu who i believe added that code.
> > > > >
> > > > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > > > >
> > > > > Sorry - I am not running your code (didnt want to compile anything on
> > > > > this machine), just plain tc and i have to admit I dont know much
> > > > > about the mechanics or spec for gate, so my example is based on
> > > > > something Po Liu posted, here's a script to add 100 entries:
> > > > > ---
> > > > > for i in {1..100}; do
> > > > > echo "$i"
> > > > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > > > done
> > > > > ---
> > > > >
> > > > > Then dumping:
> > > > >
> > > > > $ sudo tc actions ls action gate | grep index
> > > > > index 1 ref 1 bind 0
> > > > > index 2 ref 1 bind 0
> > > > > index 3 ref 1 bind 0
> > > > > index 4 ref 1 bind 0
> > > > > index 5 ref 1 bind 0
> > > > > index 6 ref 1 bind 0
> > > > > ..
> > > > > ...
> > > > > ....
> > > > > index 95 ref 1 bind 0
> > > > > index 96 ref 1 bind 0
> > > > > index 97 ref 1 bind 0
> > > > > index 98 ref 1 bind 0
> > > > > index 99 ref 1 bind 0
> > > > > index 100 ref 1 bind 0
> > > > > $
> > > > >
> > > > > > look at af_netlink.c
> > > > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > > > * to reduce number of system calls on dump operations, if user
> > > > > > * ever provided a big enough buffer.
> > > > > > /
> > > > > > ...
> > > > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > > > * could fit within the allocated skb. skb is typically allocated
> > > > > > * with larger space than required (could be as much as near 2x the
> > > > > > * requested size with align to next power of 2 approach). Allowing
> > > > > > * dump to use the excess space makes it difficult for a user to have a
> > > > > > * reasonable static buffer based on the expected largest dump of a
> > > > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > > > */
> > > > > >
> > > > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > > > >
> > > > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > > > earlier if it is possible that you can create some single entry - not
> > > > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > > > limited knowledge is not helping me see such a scenario.
> > > >
> > > > Aha. I think there is a terminology mixup ;->
> > > >
> > > > "dump" (a very unfortunate use of that word in the netlink world ;->)
> > > >
> > > > is a very special word. So when you take a dump in this world you are
> > > > GETing a whole table. In this case all the gate actions.
> > > >
> > > > If i am not mistaken in your case this is not a dump - rather, you are
> > > > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > > > suspected. I dont believe iproute2 will allow you to do that.
> > > > What's happening then is that the generated netlink event notification
> > > > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > > > Let me try to craft something for that...
> > > >
> > > > cheers,
> > > > jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 4 days, 15 hours ago

Want to be clear, as I said before, I spent months on this before I approached. 

The gates are programmed by a controller and used to orchestrate deterministic traffic admission. This is not a simple open/close mechanism configured by humans. 

I am moving closer to IEEE not further away from.

Thanks
Paul


On Monday, February 2nd, 2026 at 8:33 AM, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 
> 
> On Sun, Feb 1, 2026 at 4:57 AM Paul Moses p@1g4.org wrote:
> 
> > The hardware manufacturers impose their own limits based on design constraints, it's not based on the spec. iproute2's value seems arbitrary, 1024 comes out to be about 32 entries, based on the message length of 3112 at 100 entries (this isn't counting overhead). Is page size ever less than 4k? May as well see what can safely fit into NLMSG_GOODSIZE at it's lowest possible value.
> > 
> > With 4k page size, the failure point appears to be 93 entries:
> > large dump DEBUG: large dump msg_len=2904 cap=12288 entries=93 cycle_time=9304278
> > 
> > So bounding it at 64 entries or so(for now at least) would be a safe choice to maintain a margin and not impose arbitrarily low values.
> 
> 
> Why dont we pick some value that doesnt require changes to iproute2? Example 32.
> 
> > Yes, I've wanted to talk to Po for a while now. :)
> 
> 
> There has to be someone else, vendor, etc who is invested in this..
> That looks like magic valves to me that open/close - not sure why you
> want to do it more than once.
> 
> cheers,
> jamal
> 
> > Thanks,
> > Paul
> > 
> > On Saturday, January 31st, 2026 at 11:34 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > 
> > > On Sat, Jan 31, 2026 at 12:18 PM Paul Moses p@1g4.org wrote:
> > > 
> > > > 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> > > > 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> > > > 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> > > > 4. Because of that, your test only exercises the dump path with many small actions.
> > > > 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> > > > 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> > > > 7. The kernel then tries to serialize one action into that skb.
> > > > 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> > > > 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> > > > 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> > > > 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> > > > 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> > > > 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> > > > 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
> > > 
> > > Yes, thats the same error i was getting (with script below).
> > > ---
> > > ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
> > > SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
> > > #SCHEDULE=$(printf "$ENTRY%.0s" {1..10})
> > > 
> > > for i in {1..2}; do
> > > echo "Iteration: $i"
> > > tc actions add action gate clockid CLOCK_TAI $SCHEDULE
> > > done
> > > ----
> > > 
> > > I know of no other action that exceeds this limit with all its params
> > > batched, and of course tc in userspace truncates it to about 32.
> > > Addition does succeed at 32 of those things per action.
> > > I have no idea if above is legal but it is allowed by the system.
> > > 
> > > > I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
> > > 
> > > We cant have it to be infinite for sure - we will need to put an upper
> > > bound in parse_gate_list().
> > > Are you knowledgeable about this spec? I was Ccing Po Liu but his
> > > email is bouncing (so i removed him).
> > > 
> > > So back to your first post: I agree we have an issue here. Your
> > > solution will solve the event notifications but then we will need an
> > > upper bound check. We will also need to check that same upper bound in
> > > user space iproute2 code so we dont allow arbitrary values. Current
> > > number of 16 seems to work just fine - if we agree that is a "good"
> > > number (or if the specs dicate it is) then you can simply provide that
> > > fix..
> > > 
> > > cheers,
> > > jamal
> > > 
> > > > Thanks
> > > > Paul
> > > > 
> > > > On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > > 
> > > > > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > > > 
> > > > > > .
> > > > > > 
> > > > > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > > > > > 
> > > > > > > What version of act_gate.c are you currently testing?
> > > > > > 
> > > > > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > > > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > > > > +Cc Po Liu who i believe added that code.
> > > > > > 
> > > > > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > > > > > 
> > > > > > Sorry - I am not running your code (didnt want to compile anything on
> > > > > > this machine), just plain tc and i have to admit I dont know much
> > > > > > about the mechanics or spec for gate, so my example is based on
> > > > > > something Po Liu posted, here's a script to add 100 entries:
> > > > > > ---
> > > > > > for i in {1..100}; do
> > > > > > echo "$i"
> > > > > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > > > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > > > > done
> > > > > > ---
> > > > > > 
> > > > > > Then dumping:
> > > > > > 
> > > > > > $ sudo tc actions ls action gate | grep index
> > > > > > index 1 ref 1 bind 0
> > > > > > index 2 ref 1 bind 0
> > > > > > index 3 ref 1 bind 0
> > > > > > index 4 ref 1 bind 0
> > > > > > index 5 ref 1 bind 0
> > > > > > index 6 ref 1 bind 0
> > > > > > ..
> > > > > > ...
> > > > > > ....
> > > > > > index 95 ref 1 bind 0
> > > > > > index 96 ref 1 bind 0
> > > > > > index 97 ref 1 bind 0
> > > > > > index 98 ref 1 bind 0
> > > > > > index 99 ref 1 bind 0
> > > > > > index 100 ref 1 bind 0
> > > > > > $
> > > > > > 
> > > > > > > look at af_netlink.c
> > > > > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > > > > * to reduce number of system calls on dump operations, if user
> > > > > > > * ever provided a big enough buffer.
> > > > > > > /
> > > > > > > ...
> > > > > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > > > > * could fit within the allocated skb. skb is typically allocated
> > > > > > > * with larger space than required (could be as much as near 2x the
> > > > > > > * requested size with align to next power of 2 approach). Allowing
> > > > > > > * dump to use the excess space makes it difficult for a user to have a
> > > > > > > * reasonable static buffer based on the expected largest dump of a
> > > > > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > > > > */
> > > > > > > 
> > > > > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > > > > > 
> > > > > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > > > > earlier if it is possible that you can create some single entry - not
> > > > > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > > > > limited knowledge is not helping me see such a scenario.
> > > > > 
> > > > > Aha. I think there is a terminology mixup ;->
> > > > > 
> > > > > "dump" (a very unfortunate use of that word in the netlink world ;->)
> > > > > 
> > > > > is a very special word. So when you take a dump in this world you are
> > > > > GETing a whole table. In this case all the gate actions.
> > > > > 
> > > > > If i am not mistaken in your case this is not a dump - rather, you are
> > > > > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > > > > suspected. I dont believe iproute2 will allow you to do that.
> > > > > What's happening then is that the generated netlink event notification
> > > > > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > > > > Let me try to craft something for that...
> > > > > 
> > > > > cheers,
> > > > > jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Paul Moses 1 day, 21 hours ago

Looks like pedit might also affected. Hopefully this makes it more clear. Going to wait on more input before doing anything else with this.

NLMSG_GOODSIZE = SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192))
SKB_WITH_OVERHEAD(X) = X - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
nla_total_size(payload) = NLA_ALIGN(NLA_HDRLEN + payload), with NLA_HDRLEN = 4 and 4 byte alignment

Per entry size for the gate list:

Each entry is a nested TCA_GATE_ONE_ENTRY plus five attributes:

TCA_GATE_ONE_ENTRY (nest, no payload) -> 4
INDEX (u32) -> 8
GATE (flag, no payload) -> 4
INTERVAL (u32) -> 8
MAX_OCTETS (s32) -> 8
IPV (s32) -> 8

So one entry is:

entry_sz = 4 + 8 + 4 + 8 + 8 + 8 = 40 bytes

Fixed overhead for one act_gate dump:

1. Action wrapper (RTM_GETACTION):

NLMSG_HDRLEN + sizeof(struct tcamsg) + nla_total_size(0)
= 16 + 4 + 4 = 24 bytes

2. Action shared attributes emitted by tcf_action_dump_1, baseline only
   (no cookie, no HW stats, no flags):

TCA_ACT_KIND (IFNAMSIZ) = 20
TCA_ACT_STATS nest = 4
TCA_STATS_BASIC = 20
TCA_STATS_PKT64 = 12
TCA_STATS_QUEUE = 24
TCA_ACT_OPTIONS nest = 4
TCA_GACT_TM = 36
TCA_ACT_IN_HW_COUNT = 8
action number nest = 4

Total shared baseline = 156 bytes

Optional shared attributes, only if present:

TCA_ACT_HW_STATS = +12
TCA_ACT_USED_HW_STATS = +12
TCA_ACT_FLAGS = +12
TCA_ACT_COOKIE = +nla_total_size(cookie_len)

3. Gate specific attributes inside options, fixed part including TM:

TCA_GATE_PARMS = 24
BASE_TIME = 12
CYCLE_TIME = 12
CYCLE_TIME_EXT = 12
CLOCKID = 8
FLAGS = 8
PRIORITY = 8
ENTRY_LIST nest = 4
TCA_GATE_TM = 36

Total gate baseline = 124 bytes

4. 64 bit alignment padding, only when
   !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

There are 7 attributes that trigger the 64 bit padding:
-three stats blocks, three time values and the gate TM
-Each adds 4 bytes, so add 28 bytes in that case

Putting it together:

fixed = 24 (wrapper) + 156 (shared baseline) + 124 (gate baseline)
fixed = 304 bytes

opt = nla_total_size(cookie_len)
+ 12 for each of HW_STATS, USED_HW_STATS and FLAGS if present
+ 28 if unaligned access padding is required

The maximum number of entries that fit in a single skb is:

Nmax = floor((NLMSG_GOODSIZE - fixed - opt) / 40)

If PAGE_SIZE = 4096 and sizeof(struct skb_shared_info) = 320:

NLMSG_GOODSIZE = 4096 - 320 = 3776
Nmax = floor((3776 - 304) / 40) = 86

8192:

NLMSG_GOODSIZE = 8192 - 320 = 7872
Nmax = floor((7872 - 304) / 40) = 189

Thanks,
Paul


On Monday, February 2nd, 2026 at 2:49 PM, Paul Moses <p@1g4.org> wrote:

>
>
> Want to be clear, as I said before, I spent months on this before I approached.
>
> The gates are programmed by a controller and used to orchestrate deterministic traffic admission. This is not a simple open/close mechanism configured by humans.
>
> I am moving closer to IEEE not further away from.
>
> Thanks
> Paul
>
>
> On Monday, February 2nd, 2026 at 8:33 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
>
> > On Sun, Feb 1, 2026 at 4:57 AM Paul Moses p@1g4.org wrote:
> >
> > > The hardware manufacturers impose their own limits based on design constraints, it's not based on the spec. iproute2's value seems arbitrary, 1024 comes out to be about 32 entries, based on the message length of 3112 at 100 entries (this isn't counting overhead). Is page size ever less than 4k? May as well see what can safely fit into NLMSG_GOODSIZE at it's lowest possible value.
> > >
> > > With 4k page size, the failure point appears to be 93 entries:
> > > large dump DEBUG: large dump msg_len=2904 cap=12288 entries=93 cycle_time=9304278
> > >
> > > So bounding it at 64 entries or so(for now at least) would be a safe choice to maintain a margin and not impose arbitrarily low values.
> >
> > Why dont we pick some value that doesnt require changes to iproute2? Example 32.
> >
> > > Yes, I've wanted to talk to Po for a while now. :)
> >
> > There has to be someone else, vendor, etc who is invested in this..
> > That looks like magic valves to me that open/close - not sure why you
> > want to do it more than once.
> >
> > cheers,
> > jamal
> >
> > > Thanks,
> > > Paul
> > >
> > > On Saturday, January 31st, 2026 at 11:34 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > >
> > > > On Sat, Jan 31, 2026 at 12:18 PM Paul Moses p@1g4.org wrote:
> > > >
> > > > > 1. Your script creates 100 separate gate actions, not one gate action with a large schedule.
> > > > > 2. Each “tc actions add … gate …” call creates a new action, so you end up with 100 small actions.
> > > > > 3. The issue I am reporting needs one single gate action that contains many sched-entry objects.
> > > > > 4. Because of that, your test only exercises the dump path with many small actions.
> > > > > 5. The failure I see is in the GETACTION notify path, not in the generic dump batching logic.
> > > > > 6. In that path, tcf_get_notify() allocates a fixed-size skb using NLMSG_GOODSIZE.
> > > > > 7. The kernel then tries to serialize one action into that skb.
> > > > > 8. If a single action contains a large gate schedule, tca_get_fill() runs out of tailroom and fails, and the kernel returns -EINVAL.
> > > > > 9. A single sched-entry does not exceed NLMSG_GOODSIZE.
> > > > > 10. The problem is one action with many sched-entries, because the entire entry list is serialized into the payload of that one action.
> > > > > 11. The “total acts 12 / 12 / 76” output only shows how many small actions were packed into each dump batch.
> > > > > 12. It does not reflect the size of an individual action dump, and in your test each action is small.
> > > > > 13. To reproduce with tc, you need one tc invocation that adds many sched-entry attributes to the same gate action, and then run “tc actions get action gate index <idx>” on that action.
> > > > > 14. tc has it's own limit at 1024 apparently "addattr_l ERROR: message exceeded bound of 1024"
> > > >
> > > > Yes, thats the same error i was getting (with script below).
> > > > ---
> > > > ENTRY="sched-entry open 200000000 -1 8000000 sched-entry close 100000000 -1 -1 "
> > > > SCHEDULE=$(printf "$ENTRY%.0s" {1..100})
> > > > #SCHEDULE=$(printf "$ENTRY%.0s" {1..10})
> > > >
> > > > for i in {1..2}; do
> > > > echo "Iteration: $i"
> > > > tc actions add action gate clockid CLOCK_TAI $SCHEDULE
> > > > done
> > > > ----
> > > >
> > > > I know of no other action that exceeds this limit with all its params
> > > > batched, and of course tc in userspace truncates it to about 32.
> > > > Addition does succeed at 32 of those things per action.
> > > > I have no idea if above is legal but it is allowed by the system.
> > > >
> > > > > I'm not opposed to gate being clamped instead of adding support for large schedule sizes, but I wanted to thoroughly document why it's not possible so the next person isn't chasing a cryptic -EINVAL like I did.
> > > >
> > > > We cant have it to be infinite for sure - we will need to put an upper
> > > > bound in parse_gate_list().
> > > > Are you knowledgeable about this spec? I was Ccing Po Liu but his
> > > > email is bouncing (so i removed him).
> > > >
> > > > So back to your first post: I agree we have an issue here. Your
> > > > solution will solve the event notifications but then we will need an
> > > > upper bound check. We will also need to check that same upper bound in
> > > > user space iproute2 code so we dont allow arbitrary values. Current
> > > > number of 16 seems to work just fine - if we agree that is a "good"
> > > > number (or if the specs dicate it is) then you can simply provide that
> > > > fix..
> > > >
> > > > cheers,
> > > > jamal
> > > >
> > > > > Thanks
> > > > > Paul
> > > > >
> > > > > On Saturday, January 31st, 2026 at 11:14 AM, Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > > >
> > > > > > On Sat, Jan 31, 2026 at 11:51 AM Jamal Hadi Salim jhs@mojatatu.com wrote:
> > > > > >
> > > > > > > .
> > > > > > >
> > > > > > > On Fri, Jan 30, 2026 at 3:48 PM Paul Moses p@1g4.org wrote:
> > > > > > >
> > > > > > > > What version of act_gate.c are you currently testing?
> > > > > > >
> > > > > > > I am running plain ubuntu on this machine using their shipped kernel 6.8.0.
> > > > > > > But i did look at the latest kernel tree and the dumping code has not changed.
> > > > > > > +Cc Po Liu who i believe added that code.
> > > > > > >
> > > > > > > > Did you actually run the tests? “large dump” creates ONE action at base_index, with num_entries=100, then immediately does GETACTION. So “tc actions ls action gate | grep index | wc -l” won’t exercise this, because it only counts actions. It doesn’t amplify the per action dump size (the entry list does). It uses libmnl (mnl_socket_sendto / mnl_socket_recvfrom) with MNL_SOCKET_BUFFER_SIZE. There is no custom netlink handling. The failure is returned by the kernel before userspace parses anything. The dumps are transactional at the netlink level, but an individual action dump still has to fit in the skb backing that message.
> > > > > > >
> > > > > > > Sorry - I am not running your code (didnt want to compile anything on
> > > > > > > this machine), just plain tc and i have to admit I dont know much
> > > > > > > about the mechanics or spec for gate, so my example is based on
> > > > > > > something Po Liu posted, here's a script to add 100 entries:
> > > > > > > ---
> > > > > > > for i in {1..100}; do
> > > > > > > echo "$i"
> > > > > > > tc actions add action gate clockid CLOCK_TAI sched-entry open
> > > > > > > 200000000 -1 8000000 sched-entry close 100000000 -1 -1
> > > > > > > done
> > > > > > > ---
> > > > > > >
> > > > > > > Then dumping:
> > > > > > >
> > > > > > > $ sudo tc actions ls action gate | grep index
> > > > > > > index 1 ref 1 bind 0
> > > > > > > index 2 ref 1 bind 0
> > > > > > > index 3 ref 1 bind 0
> > > > > > > index 4 ref 1 bind 0
> > > > > > > index 5 ref 1 bind 0
> > > > > > > index 6 ref 1 bind 0
> > > > > > > ..
> > > > > > > ...
> > > > > > > ....
> > > > > > > index 95 ref 1 bind 0
> > > > > > > index 96 ref 1 bind 0
> > > > > > > index 97 ref 1 bind 0
> > > > > > > index 98 ref 1 bind 0
> > > > > > > index 99 ref 1 bind 0
> > > > > > > index 100 ref 1 bind 0
> > > > > > > $
> > > > > > >
> > > > > > > > look at af_netlink.c
> > > > > > > > /* NLMSG_GOODSIZE is small to avoid high order allocations being
> > > > > > > > * required, but it makes sense to attempt a 32KiB allocation
> > > > > > > > * to reduce number of system calls on dump operations, if user
> > > > > > > > * ever provided a big enough buffer.
> > > > > > > > /
> > > > > > > > ...
> > > > > > > > / Trim skb to allocated size. User is expected to provide buffer as
> > > > > > > > * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
> > > > > > > > * netlink_recvmsg())). dump will pack as many smaller messages as
> > > > > > > > * could fit within the allocated skb. skb is typically allocated
> > > > > > > > * with larger space than required (could be as much as near 2x the
> > > > > > > > * requested size with align to next power of 2 approach). Allowing
> > > > > > > > * dump to use the excess space makes it difficult for a user to have a
> > > > > > > > * reasonable static buffer based on the expected largest dump of a
> > > > > > > > * single netdev. The outcome is MSG_TRUNC error.
> > > > > > > > */
> > > > > > > >
> > > > > > > > This is where I am currently but I have seen these bugs appear throughout all my iterations including what's in the tree currently, if you show me better alternatives that solve my problems, I'll gladly accept.
> > > > > > > > https://github.com/torvalds/linux/compare/master...jopamo:linux:net-stable-upstream-v4
> > > > > > >
> > > > > > > I dont see a problem with "dump" as you seem to be suggesting. I asked
> > > > > > > earlier if it is possible that you can create some single entry - not
> > > > > > > 100 as shown above that will consume more than NLMSG_GOODSIZE? My
> > > > > > > limited knowledge is not helping me see such a scenario.
> > > > > >
> > > > > > Aha. I think there is a terminology mixup ;->
> > > > > >
> > > > > > "dump" (a very unfortunate use of that word in the netlink world ;->)
> > > > > >
> > > > > > is a very special word. So when you take a dump in this world you are
> > > > > > GETing a whole table. In this case all the gate actions.
> > > > > >
> > > > > > If i am not mistaken in your case this is not a dump - rather, you are
> > > > > > CREATing a single entry which is bigger than NLMSG_GOODSIZE as i
> > > > > > suspected. I dont believe iproute2 will allow you to do that.
> > > > > > What's happening then is that the generated netlink event notification
> > > > > > for that single entry is too big to fit in NLMSG_GOODSIZE.
> > > > > > Let me try to craft something for that...
> > > > > >
> > > > > > cheers,
> > > > > > jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 1 day, 17 hours ago

On Thu, Feb 5, 2026 at 10:13 AM Paul Moses <p@1g4.org> wrote:
>
> Looks like pedit might also affected. Hopefully this makes it more clear. Going to wait on more input before doing anything else with this.
>
> NLMSG_GOODSIZE = SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192))
> SKB_WITH_OVERHEAD(X) = X - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> nla_total_size(payload) = NLA_ALIGN(NLA_HDRLEN + payload), with NLA_HDRLEN = 4 and 4 byte alignment
>
> Per entry size for the gate list:
>
> Each entry is a nested TCA_GATE_ONE_ENTRY plus five attributes:
>
> TCA_GATE_ONE_ENTRY (nest, no payload) -> 4
> INDEX (u32) -> 8
> GATE (flag, no payload) -> 4
> INTERVAL (u32) -> 8
> MAX_OCTETS (s32) -> 8
> IPV (s32) -> 8
>
> So one entry is:
>
> entry_sz = 4 + 8 + 4 + 8 + 8 + 8 = 40 bytes
>
> Fixed overhead for one act_gate dump:
>
> 1. Action wrapper (RTM_GETACTION):
>
> NLMSG_HDRLEN + sizeof(struct tcamsg) + nla_total_size(0)
> = 16 + 4 + 4 = 24 bytes
>
> 2. Action shared attributes emitted by tcf_action_dump_1, baseline only
>    (no cookie, no HW stats, no flags):
>
> TCA_ACT_KIND (IFNAMSIZ) = 20
> TCA_ACT_STATS nest = 4
> TCA_STATS_BASIC = 20
> TCA_STATS_PKT64 = 12
> TCA_STATS_QUEUE = 24
> TCA_ACT_OPTIONS nest = 4
> TCA_GACT_TM = 36
> TCA_ACT_IN_HW_COUNT = 8
> action number nest = 4
>
> Total shared baseline = 156 bytes
>
> Optional shared attributes, only if present:
>
> TCA_ACT_HW_STATS = +12
> TCA_ACT_USED_HW_STATS = +12
> TCA_ACT_FLAGS = +12
> TCA_ACT_COOKIE = +nla_total_size(cookie_len)
>
> 3. Gate specific attributes inside options, fixed part including TM:
>
> TCA_GATE_PARMS = 24
> BASE_TIME = 12
> CYCLE_TIME = 12
> CYCLE_TIME_EXT = 12
> CLOCKID = 8
> FLAGS = 8
> PRIORITY = 8
> ENTRY_LIST nest = 4
> TCA_GATE_TM = 36
>
> Total gate baseline = 124 bytes
>
> 4. 64 bit alignment padding, only when
>    !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>
> There are 7 attributes that trigger the 64 bit padding:
> -three stats blocks, three time values and the gate TM
> -Each adds 4 bytes, so add 28 bytes in that case
>
> Putting it together:
>
> fixed = 24 (wrapper) + 156 (shared baseline) + 124 (gate baseline)
> fixed = 304 bytes
>
> opt = nla_total_size(cookie_len)
> + 12 for each of HW_STATS, USED_HW_STATS and FLAGS if present
> + 28 if unaligned access padding is required
>
> The maximum number of entries that fit in a single skb is:
>
> Nmax = floor((NLMSG_GOODSIZE - fixed - opt) / 40)
>
> If PAGE_SIZE = 4096 and sizeof(struct skb_shared_info) = 320:
>
> NLMSG_GOODSIZE = 4096 - 320 = 3776
> Nmax = floor((3776 - 304) / 40) = 86
>
> 8192:
>
> NLMSG_GOODSIZE = 8192 - 320 = 7872
> Nmax = floor((7872 - 304) / 40) = 189
>

Seems arbitrary and I was hoping you dont have to change iproute2
which restricts the total size to 1KB.
Earlier, unless i misread, you said you are looking at IEEE - what
does the spec say?
If i am not mistaken, the spec is   IEEE 802.1Qbv which unfortunately
is behind a paywall.
The closest i could find was a vendor talking about it here:
https://onlinedocs.microchip.com/oxy/GUID-82119957-1E11-4B69-84AC-EF0EA08F5595-en-US-5/GUID-7E7509A4-351E-4D82-8266-967681BA2644.html

And they seem to indicate you can only have _one_ off and one timer
per queue, for a max of 8 queues.
Since Po is AWOL, +Cc the taprio folks (Vinicius, Vladmir).

cheers,
jamal

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Vladimir Oltean 1 day, 15 hours ago

On Thu, Feb 05, 2026 at 02:23:00PM -0500, Jamal Hadi Salim wrote:
> On Thu, Feb 5, 2026 at 10:13â€¯AM Paul Moses <p@1g4.org> wrote:
> >
> > Looks like pedit might also affected. Hopefully this makes it more clear. Going to wait on more input before doing anything else with this.
> >
> > NLMSG_GOODSIZE = SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192))
> > SKB_WITH_OVERHEAD(X) = X - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> > nla_total_size(payload) = NLA_ALIGN(NLA_HDRLEN + payload), with NLA_HDRLEN = 4 and 4 byte alignment
> >
> > Per entry size for the gate list:
> >
> > Each entry is a nested TCA_GATE_ONE_ENTRY plus five attributes:
> >
> > TCA_GATE_ONE_ENTRY (nest, no payload) -> 4
> > INDEX (u32) -> 8
> > GATE (flag, no payload) -> 4
> > INTERVAL (u32) -> 8
> > MAX_OCTETS (s32) -> 8
> > IPV (s32) -> 8
> >
> > So one entry is:
> >
> > entry_sz = 4 + 8 + 4 + 8 + 8 + 8 = 40 bytes
> >
> > Fixed overhead for one act_gate dump:
> >
> > 1. Action wrapper (RTM_GETACTION):
> >
> > NLMSG_HDRLEN + sizeof(struct tcamsg) + nla_total_size(0)
> > = 16 + 4 + 4 = 24 bytes
> >
> > 2. Action shared attributes emitted by tcf_action_dump_1, baseline only
> >    (no cookie, no HW stats, no flags):
> >
> > TCA_ACT_KIND (IFNAMSIZ) = 20
> > TCA_ACT_STATS nest = 4
> > TCA_STATS_BASIC = 20
> > TCA_STATS_PKT64 = 12
> > TCA_STATS_QUEUE = 24
> > TCA_ACT_OPTIONS nest = 4
> > TCA_GACT_TM = 36
> > TCA_ACT_IN_HW_COUNT = 8
> > action number nest = 4
> >
> > Total shared baseline = 156 bytes
> >
> > Optional shared attributes, only if present:
> >
> > TCA_ACT_HW_STATS = +12
> > TCA_ACT_USED_HW_STATS = +12
> > TCA_ACT_FLAGS = +12
> > TCA_ACT_COOKIE = +nla_total_size(cookie_len)
> >
> > 3. Gate specific attributes inside options, fixed part including TM:
> >
> > TCA_GATE_PARMS = 24
> > BASE_TIME = 12
> > CYCLE_TIME = 12
> > CYCLE_TIME_EXT = 12
> > CLOCKID = 8
> > FLAGS = 8
> > PRIORITY = 8
> > ENTRY_LIST nest = 4
> > TCA_GATE_TM = 36
> >
> > Total gate baseline = 124 bytes
> >
> > 4. 64 bit alignment padding, only when
> >    !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> >
> > There are 7 attributes that trigger the 64 bit padding:
> > -three stats blocks, three time values and the gate TM
> > -Each adds 4 bytes, so add 28 bytes in that case
> >
> > Putting it together:
> >
> > fixed = 24 (wrapper) + 156 (shared baseline) + 124 (gate baseline)
> > fixed = 304 bytes
> >
> > opt = nla_total_size(cookie_len)
> > + 12 for each of HW_STATS, USED_HW_STATS and FLAGS if present
> > + 28 if unaligned access padding is required
> >
> > The maximum number of entries that fit in a single skb is:
> >
> > Nmax = floor((NLMSG_GOODSIZE - fixed - opt) / 40)
> >
> > If PAGE_SIZE = 4096 and sizeof(struct skb_shared_info) = 320:
> >
> > NLMSG_GOODSIZE = 4096 - 320 = 3776
> > Nmax = floor((3776 - 304) / 40) = 86
> >
> > 8192:
> >
> > NLMSG_GOODSIZE = 8192 - 320 = 7872
> > Nmax = floor((7872 - 304) / 40) = 189
> >
> 
> Seems arbitrary and I was hoping you dont have to change iproute2
> which restricts the total size to 1KB.
> Earlier, unless i misread, you said you are looking at IEEE - what
> does the spec say?
> If i am not mistaken, the spec is   IEEE 802.1Qbv which unfortunately
> is behind a paywall.
> The closest i could find was a vendor talking about it here:
> https://onlinedocs.microchip.com/oxy/GUID-82119957-1E11-4B69-84AC-EF0EA08F5595-en-US-5/GUID-7E7509A4-351E-4D82-8266-967681BA2644.html
> 
> And they seem to indicate you can only have _one_ off and one timer
> per queue, for a max of 8 queues.
> Since Po is AWOL, +Cc the taprio folks (Vinicius, Vladmir).
> 
> cheers,
> jamal

Sorry, I haven't been following this thread, I don't know what the
question to me is?

The tc-gate action corresponds to a feature which can be identified by
the "stream gate" keyword in standard IEEE 802.1Q (-2018 or later).
It is a sub-function of clause 8.6.5.1 Per-stream filtering and policing
(PSFP).

This is different from what you reference above as taprio / IEEE 802.1Qbv
(old/obsolete name for workgroup which later became merged into standard
802.1Q as clause 8.6.8.4 Enhancements for scheduled traffic).

The tc-gate is not defined per queue, but rather a standalone object
that streams (tc filters) point to. The schedule (or "gate control list")
size, translatable into the number of TCA_GATE_ONE_ENTRY elements, is
arbitrary as far as the standard is concerned.

We at NXP have hardware today which supports up to 256 gates in a single
stream gate control list.

I'm not sure I understand the reference to the [number of] timers.

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 1 day, 15 hours ago

On Thu, Feb 5, 2026 at 3:36 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> On Thu, Feb 05, 2026 at 02:23:00PM -0500, Jamal Hadi Salim wrote:
> > On Thu, Feb 5, 2026 at 10:13â€¯AM Paul Moses <p@1g4.org> wrote:
> > >
> > > Looks like pedit might also affected. Hopefully this makes it more clear. Going to wait on more input before doing anything else with this.
> > >
> > > NLMSG_GOODSIZE = SKB_WITH_OVERHEAD(min(PAGE_SIZE, 8192))
> > > SKB_WITH_OVERHEAD(X) = X - SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> > > nla_total_size(payload) = NLA_ALIGN(NLA_HDRLEN + payload), with NLA_HDRLEN = 4 and 4 byte alignment
> > >
> > > Per entry size for the gate list:
> > >
> > > Each entry is a nested TCA_GATE_ONE_ENTRY plus five attributes:
> > >
> > > TCA_GATE_ONE_ENTRY (nest, no payload) -> 4
> > > INDEX (u32) -> 8
> > > GATE (flag, no payload) -> 4
> > > INTERVAL (u32) -> 8
> > > MAX_OCTETS (s32) -> 8
> > > IPV (s32) -> 8
> > >
> > > So one entry is:
> > >
> > > entry_sz = 4 + 8 + 4 + 8 + 8 + 8 = 40 bytes
> > >
> > > Fixed overhead for one act_gate dump:
> > >
> > > 1. Action wrapper (RTM_GETACTION):
> > >
> > > NLMSG_HDRLEN + sizeof(struct tcamsg) + nla_total_size(0)
> > > = 16 + 4 + 4 = 24 bytes
> > >
> > > 2. Action shared attributes emitted by tcf_action_dump_1, baseline only
> > >    (no cookie, no HW stats, no flags):
> > >
> > > TCA_ACT_KIND (IFNAMSIZ) = 20
> > > TCA_ACT_STATS nest = 4
> > > TCA_STATS_BASIC = 20
> > > TCA_STATS_PKT64 = 12
> > > TCA_STATS_QUEUE = 24
> > > TCA_ACT_OPTIONS nest = 4
> > > TCA_GACT_TM = 36
> > > TCA_ACT_IN_HW_COUNT = 8
> > > action number nest = 4
> > >
> > > Total shared baseline = 156 bytes
> > >
> > > Optional shared attributes, only if present:
> > >
> > > TCA_ACT_HW_STATS = +12
> > > TCA_ACT_USED_HW_STATS = +12
> > > TCA_ACT_FLAGS = +12
> > > TCA_ACT_COOKIE = +nla_total_size(cookie_len)
> > >
> > > 3. Gate specific attributes inside options, fixed part including TM:
> > >
> > > TCA_GATE_PARMS = 24
> > > BASE_TIME = 12
> > > CYCLE_TIME = 12
> > > CYCLE_TIME_EXT = 12
> > > CLOCKID = 8
> > > FLAGS = 8
> > > PRIORITY = 8
> > > ENTRY_LIST nest = 4
> > > TCA_GATE_TM = 36
> > >
> > > Total gate baseline = 124 bytes
> > >
> > > 4. 64 bit alignment padding, only when
> > >    !CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > >
> > > There are 7 attributes that trigger the 64 bit padding:
> > > -three stats blocks, three time values and the gate TM
> > > -Each adds 4 bytes, so add 28 bytes in that case
> > >
> > > Putting it together:
> > >
> > > fixed = 24 (wrapper) + 156 (shared baseline) + 124 (gate baseline)
> > > fixed = 304 bytes
> > >
> > > opt = nla_total_size(cookie_len)
> > > + 12 for each of HW_STATS, USED_HW_STATS and FLAGS if present
> > > + 28 if unaligned access padding is required
> > >
> > > The maximum number of entries that fit in a single skb is:
> > >
> > > Nmax = floor((NLMSG_GOODSIZE - fixed - opt) / 40)
> > >
> > > If PAGE_SIZE = 4096 and sizeof(struct skb_shared_info) = 320:
> > >
> > > NLMSG_GOODSIZE = 4096 - 320 = 3776
> > > Nmax = floor((3776 - 304) / 40) = 86
> > >
> > > 8192:
> > >
> > > NLMSG_GOODSIZE = 8192 - 320 = 7872
> > > Nmax = floor((7872 - 304) / 40) = 189
> > >
> >
> > Seems arbitrary and I was hoping you dont have to change iproute2
> > which restricts the total size to 1KB.
> > Earlier, unless i misread, you said you are looking at IEEE - what
> > does the spec say?
> > If i am not mistaken, the spec is   IEEE 802.1Qbv which unfortunately
> > is behind a paywall.
> > The closest i could find was a vendor talking about it here:
> > https://onlinedocs.microchip.com/oxy/GUID-82119957-1E11-4B69-84AC-EF0EA08F5595-en-US-5/GUID-7E7509A4-351E-4D82-8266-967681BA2644.html
> >
> > And they seem to indicate you can only have _one_ off and one timer
> > per queue, for a max of 8 queues.
> > Since Po is AWOL, +Cc the taprio folks (Vinicius, Vladmir).
> >
> > cheers,
> > jamal
>
> Sorry, I haven't been following this thread, I don't know what the
> question to me is?
>
> The tc-gate action corresponds to a feature which can be identified by
> the "stream gate" keyword in standard IEEE 802.1Q (-2018 or later).
> It is a sub-function of clause 8.6.5.1 Per-stream filtering and policing
> (PSFP).
>
> This is different from what you reference above as taprio / IEEE 802.1Qbv
> (old/obsolete name for workgroup which later became merged into standard
> 802.1Q as clause 8.6.8.4 Enhancements for scheduled traffic).
>
> The tc-gate is not defined per queue, but rather a standalone object
> that streams (tc filters) point to. The schedule (or "gate control list")
> size, translatable into the number of TCA_GATE_ONE_ENTRY elements, is
> arbitrary as far as the standard is concerned.
>
> We at NXP have hardware today which supports up to 256 gates in a single
> stream gate control list.

Yes, this kinda answers the question: we are looking for something
that serves as an upper bound for the control list.
Does the standard explicitly specify that it is arbitrary - or is that
deduced by lack of mention of an upper bound.
Either way imo  we need to have a "reasonable" upper bound in the code.

cheers,
jamal


> I'm not sure I understand the reference to the [number of] timers.

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Vladimir Oltean 1 day, 14 hours ago

On Thu, Feb 05, 2026 at 04:30:06PM -0500, Jamal Hadi Salim wrote:
> Yes, this kinda answers the question: we are looking for something
> that serves as an upper bound for the control list.
> Does the standard explicitly specify that it is arbitrary - or is that
> deduced by lack of mention of an upper bound.
> Either way imo  we need to have a "reasonable" upper bound in the code.
> 
> cheers,
> jamal

It doesn't specifically use the word "arbitrary" but it describes a
mechanism to indicate what the arbitrarily chosen upper bound is, if
there is one.

Specifically, clause 12.31.1.4 talks of a managed object for PSFP called
SupportedListMax. This is supposed to report the maximum values that the
AdminControlListLength and OperControlListLength parameters can hold in
this particular implementation.

There is no intrinsic or universally reasonable limit on their count.
It depends on the required schedule complexity.

Re: [PATCH net] net: sched: act_api: size RTM_GETACTION reply by fill size

Posted by Jamal Hadi Salim 23 hours ago

On Thu, Feb 5, 2026 at 5:12 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> On Thu, Feb 05, 2026 at 04:30:06PM -0500, Jamal Hadi Salim wrote:
> > Yes, this kinda answers the question: we are looking for something
> > that serves as an upper bound for the control list.
> > Does the standard explicitly specify that it is arbitrary - or is that
> > deduced by lack of mention of an upper bound.
> > Either way imo  we need to have a "reasonable" upper bound in the code.
> >
> > cheers,
> > jamal
>
> It doesn't specifically use the word "arbitrary" but it describes a
> mechanism to indicate what the arbitrarily chosen upper bound is, if
> there is one.
>
> Specifically, clause 12.31.1.4 talks of a managed object for PSFP called
> SupportedListMax. This is supposed to report the maximum values that the
> AdminControlListLength and OperControlListLength parameters can hold in
> this particular implementation.
>

Very helpful details - Thanks Vladmir.
Paul, maybe a nice number like 512 for something analogous to
AdminControlListLength?
The analogous OperControlListLength can be derived from counting the
list elements.

cheers,
jamal

> There is no intrinsic or universally reasonable limit on their count.
> It depends on the required schedule complexity.