[v2] Support defragmenting IPv(4|6) packets in BPF

[PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Daniel Xu 2 years, 6 months ago

=== Context ===

In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:

1. Enforce policy on first fragment and accept all subsequent fragments.
   This works but may let in certain attacks or allow data exfiltration.

2. Enforce policy on first fragment and drop all subsequent fragments.
   This does not really work b/c some protocols may rely on
   fragmentation. For example, DNS may rely on oversized UDP packets for
   large responses.

So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:

    Middleboxes [...] should process IP fragments in a manner that is
    consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
    must maintain state in order to achieve this goal.

=== BPF related bits ===

However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.

Fortunately, the kernel has robust support for fragmented IP packets.
This patchset wraps the existing defragmentation facilities in kfuncs so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.

=== Patchset details ===

This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.

Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.

[0]: https://datatracker.ietf.org/doc/html/rfc8900

===

Changes from v1:
* Add support for ipv6 defragmentation


Daniel Xu (8):
  ip: frags: Return actual error codes from ip_check_defrag()
  bpf: verifier: Support KF_CHANGES_PKT flag
  bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  net: ipv6: Factor ipv6_frag_rcv() to take netns and user
  bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
  bpf: selftests: Support not connecting client socket
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Add defrag selftests

 Documentation/bpf/kfuncs.rst                  |   7 +
 drivers/net/macvlan.c                         |   2 +-
 include/linux/btf.h                           |   1 +
 include/net/ip.h                              |  11 +
 include/net/ipv6.h                            |   1 +
 include/net/ipv6_frag.h                       |   1 +
 include/net/transp_v6.h                       |   1 +
 kernel/bpf/verifier.c                         |   8 +
 net/ipv4/Makefile                             |   1 +
 net/ipv4/ip_fragment.c                        |  15 +-
 net/ipv4/ip_fragment_bpf.c                    |  98 ++++++
 net/ipv6/Makefile                             |   1 +
 net/ipv6/af_inet6.c                           |   4 +
 net/ipv6/reassembly.c                         |  16 +-
 net/ipv6/reassembly_bpf.c                     | 143 ++++++++
 net/packet/af_packet.c                        |   2 +-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 +++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 +++
 tools/testing/selftests/bpf/network_helpers.c |  26 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../bpf/prog_tests/ip_check_defrag.c          | 327 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/ip_check_defrag.c     | 133 +++++++
 24 files changed, 931 insertions(+), 21 deletions(-)
 create mode 100644 net/ipv4/ip_fragment_bpf.c
 create mode 100644 net/ipv6/reassembly_bpf.c
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

-- 
2.39.1

Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Alexei Starovoitov 2 years, 6 months ago

On Mon, Feb 27, 2023 at 12:51:02PM -0700, Daniel Xu wrote:
> === Context ===
> 
> In the context of a middlebox, fragmented packets are tricky to handle.
> The full 5-tuple of a packet is often only available in the first
> fragment which makes enforcing consistent policy difficult. There are
> really only two stateless options, neither of which are very nice:
> 
> 1. Enforce policy on first fragment and accept all subsequent fragments.
>    This works but may let in certain attacks or allow data exfiltration.
> 
> 2. Enforce policy on first fragment and drop all subsequent fragments.
>    This does not really work b/c some protocols may rely on
>    fragmentation. For example, DNS may rely on oversized UDP packets for
>    large responses.
> 
> So stateful tracking is the only sane option. RFC 8900 [0] calls this
> out as well in section 6.3:
> 
>     Middleboxes [...] should process IP fragments in a manner that is
>     consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
>     must maintain state in order to achieve this goal.
> 
> === BPF related bits ===
> 
> However, when policy is enforced through BPF, the prog is run before the
> kernel reassembles fragmented packets. This leaves BPF developers in a
> awkward place: implement reassembly (possibly poorly) or use a stateless
> method as described above.
> 
> Fortunately, the kernel has robust support for fragmented IP packets.
> This patchset wraps the existing defragmentation facilities in kfuncs so
> that BPF progs running on middleboxes can reassemble fragmented packets
> before applying policy.
> 
> === Patchset details ===
> 
> This patchset is (hopefully) relatively straightforward from BPF perspective.
> One thing I'd like to call out is the skb_copy()ing of the prog skb. I
> did this to maintain the invariant that the ctx remains valid after prog
> has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
> consume the skb if the skb is a fragment.

Instead of doing all that with extra skb copy can you hook bpf prog after
the networking stack already handled ip defrag?
What kind of middle box are you doing? Why does it have to run at TC layer?

Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Edward Cree 2 years, 6 months ago

On 27/02/2023 19:51, Daniel Xu wrote:
> However, when policy is enforced through BPF, the prog is run before the
> kernel reassembles fragmented packets. This leaves BPF developers in a
> awkward place: implement reassembly (possibly poorly) or use a stateless
> method as described above.

Just out of curiosity - what stops BPF progs using the middle ground of
 stateful validation?  I'm thinking of something like:
First-frag: run the usual checks on L4 headers etc, if we PASS then save
 IPID and maybe expected next frag-offset into a map.  But don't try to
 stash the packet contents anywhere for later reassembly, just PASS it.
Subsequent frags: look up the IPID in the map.  If we find it, validate
 and update the frag-offset in the map; if this is the last fragment then
 delete the map entry.  If the frag-offset was bogus or the IPID wasn't
 found in the map, DROP; otherwise PASS.
(If re-ordering is prevalent then use something more sophisticated than
 just expected next frag-offset, but the principle is the same. And of
 course you might want to put in timers for expiry etc.)
So this avoids the need to stash the packet data and modify/consume SKBs,
 because you're not actually doing reassembly; the down-side is that the
 BPF program can't so easily make decisions about the application-layer
 contents of the fragmented datagram, but for the common case (we just
 care about the 5-tuple) it's simple enough.
But I haven't actually tried it, so maybe there's some obvious reason why
 it can't work this way.

-ed

Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Daniel Xu 2 years, 6 months ago

Hi Ed,

Thanks for giving this a look.

On Mon, Feb 27, 2023 at 08:38:41PM +0000, Edward Cree wrote:
> On 27/02/2023 19:51, Daniel Xu wrote:
> > However, when policy is enforced through BPF, the prog is run before the
> > kernel reassembles fragmented packets. This leaves BPF developers in a
> > awkward place: implement reassembly (possibly poorly) or use a stateless
> > method as described above.
> 
> Just out of curiosity - what stops BPF progs using the middle ground of
>  stateful validation?  I'm thinking of something like:
> First-frag: run the usual checks on L4 headers etc, if we PASS then save
>  IPID and maybe expected next frag-offset into a map.  But don't try to
>  stash the packet contents anywhere for later reassembly, just PASS it.
> Subsequent frags: look up the IPID in the map.  If we find it, validate
>  and update the frag-offset in the map; if this is the last fragment then
>  delete the map entry.  If the frag-offset was bogus or the IPID wasn't
>  found in the map, DROP; otherwise PASS.
> (If re-ordering is prevalent then use something more sophisticated than
>  just expected next frag-offset, but the principle is the same. And of
>  course you might want to put in timers for expiry etc.)
> So this avoids the need to stash the packet data and modify/consume SKBs,
>  because you're not actually doing reassembly; the down-side is that the
>  BPF program can't so easily make decisions about the application-layer
>  contents of the fragmented datagram, but for the common case (we just
>  care about the 5-tuple) it's simple enough.
> But I haven't actually tried it, so maybe there's some obvious reason why
>  it can't work this way.

I don't believe full L4 headers are required in the first fragment.
Sufficiently sneaky attackers can, I think, send a byte at a time to
subvert your proposed algorithm. Storing skb data seems inevitable here.
Someone can correct me if I'm wrong here.

Reordering like you mentioned is another attack vector. Perhaps there
are more sophisticated semi-stateful algorithms that can solve the
problem, but it leads me to my next point.

A semi-stateful method like you are proposing is concerning to me from a
reliability and correctness stand point. Such a method can suffer from
impedance mismatches with the rest of the system. For example, whatever
map sizes you choose should probably be aligned with sysfs conntrack
values otherwise you may get some very interesting and unexpected pkt
drops. I think cilium had a talk about debugging a related conntrack
issue in the same vein a while ago. Furthermore, the debugging and
troubleshooting facilities will be different (counters, logs, etc).

Unless someone has had lots of experience writing an ip stack from
the ground up, I suspect there are quite a few more unknown-unknowns
here. What I find valuable about this patch series is that we can
leverage the well understood and battle hardened kernel facilities. So
avoid all the correctness and security issues that the kernel has spent
20+ years fixing. And make it trivial for the next person that comes
along to do the right thing.

Hopefully this all makes sense.

Thanks,
Daniel

Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Edward Cree 2 years, 6 months ago

On 27/02/2023 22:04, Daniel Xu wrote:
> I don't believe full L4 headers are required in the first fragment.
> Sufficiently sneaky attackers can, I think, send a byte at a time to
> subvert your proposed algorithm. Storing skb data seems inevitable here.
> Someone can correct me if I'm wrong here.

My thinking was that legitimate traffic would never do this and thus if
 your first fragment doesn't have enough data to make a determination
 then you just DROP the packet.

> What I find valuable about this patch series is that we can
> leverage the well understood and battle hardened kernel facilities. So
> avoid all the correctness and security issues that the kernel has spent
> 20+ years fixing.

I can certainly see the argument here.  I guess it's a question of are
 you more worried about the DoS from tricking the validator into thinking
 good fragments are bad (the reverse is irrelevant because if you can
 trick a validator into thinking your bad fragment belongs to a previously
 seen good packet, then you can equally trick a reassembler into stitching
 your bad fragment into that packet), or are you more worried about the
 DoS from tying lots of memory down in the reassembly cache.
Even with reordering handling, a data structure to record which ranges of
 a packet have been seen takes much less memory than storing the complete
 fragment bodies.  (Just a simple bitmap of 8-byte blocks — the resolution
 of iph->frag_off — reduces size by a factor of 64, not counting all the
 overhead of a struct sk_buff for each fragment in the queue.  Or you
 could re-use the rbtree-based code from the reassembler, just with a
 freshly allocated node containing only offset & length, instead of the
 whole SKB.)
And having a BPF helper effectively consume the skb is awkward, as you
 noted; someone is likely to decide that skb_copy() is too slow, try to
 add ctx invalidation, and thereby create a whole new swathe of potential
 correctness and security issues.
Plus, imagine trying to support this in a hardware-offload XDP device.
 They'd have to reimplement the entire frag cache, which is a much bigger
 attack surface than just a frag validator, and they couldn't leverage
 the battle-hardened kernel implementation.

> And make it trivial for the next person that comes
> along to do the right thing.

Fwiw the validator approach could *also* be a helper, it doesn't have to
 be something the BPF developer writes for themselves.

But if after thinking about the possibility you still prefer your way, I
 won't try to stop you — I just wanted to ensure it had been considered.

-ed

Re: [PATCH bpf-next v2 0/8] Support defragmenting IPv(4|6) packets in BPF

Posted by Daniel Xu 2 years, 6 months ago

Hi Ed,

Had some trouble with email yesterday (forgot to renew domain
registration) and this reply might not have made it out. Apologies
if it's a repost.

On Mon, Feb 27, 2023 at 10:58:47PM +0000, Edward Cree wrote:
> On 27/02/2023 22:04, Daniel Xu wrote:
> > I don't believe full L4 headers are required in the first fragment.
> > Sufficiently sneaky attackers can, I think, send a byte at a time to
> > subvert your proposed algorithm. Storing skb data seems inevitable here.
> > Someone can correct me if I'm wrong here.
> 
> My thinking was that legitimate traffic would never do this and thus if
>  your first fragment doesn't have enough data to make a determination
>  then you just DROP the packet.

Right, that would be practical. I had some discussion with coworkers and
the other option on the table is to drop all fragments. At least for us
in the cloud, fragments are heavily frowned upon (where are they not..)
anyways.

> > What I find valuable about this patch series is that we can
> > leverage the well understood and battle hardened kernel facilities. So
> > avoid all the correctness and security issues that the kernel has spent
> > 20+ years fixing.
> 
> I can certainly see the argument here.  I guess it's a question of are
>  you more worried about the DoS from tricking the validator into thinking
>  good fragments are bad (the reverse is irrelevant because if you can
>  trick a validator into thinking your bad fragment belongs to a previously
>  seen good packet, then you can equally trick a reassembler into stitching
>  your bad fragment into that packet), or are you more worried about the
>  DoS from tying lots of memory down in the reassembly cache.

Equal balance of concerns on my side. Ideally there are no dropping of
valid packets and DoS is very hard to achieve.

> Even with reordering handling, a data structure to record which ranges of
>  a packet have been seen takes much less memory than storing the complete
>  fragment bodies.  (Just a simple bitmap of 8-byte blocks — the resolution
>  of iph->frag_off — reduces size by a factor of 64, not counting all the
>  overhead of a struct sk_buff for each fragment in the queue.  Or you
>  could re-use the rbtree-based code from the reassembler, just with a
>  freshly allocated node containing only offset & length, instead of the
>  whole SKB.)

Yeah, now that you say that, it doesn't sound too bad on space side. But
I do wonder -- how much code and complexity is that going to be? For
example I think ipv6 frags have a 60s reassembly timeout which adds more
stuff to consider. And probably even more I've already forgotten.

B/c at least on the kernel side, this series is 80% code for tests. And
the kfunc wrappers are not very invasive at all.  Plus it's wrapping
infra that hasn't changed much for decades.

> And having a BPF helper effectively consume the skb is awkward, as you
>  noted; someone is likely to decide that skb_copy() is too slow, try to
>  add ctx invalidation, and thereby create a whole new swathe of potential
>  correctness and security issues.

Yep. I did try that. While the verifier bits weren't too tricky, there
are a lot of infra concerns to solve:

* https://github.com/danobi/linux/commit/35a66af8d54cca647b0adfc7c1da7105d2603dde
* https://github.com/danobi/linux/commit/e8c86ea75e2ca8f0631632d54ef763381308711e
* https://github.com/danobi/linux/commit/972bcf769f41fbfa7f84ce00faf06b5b57bc6f7a

But FWIW, fragmented packets are kinda a corner case anyways. I don't
think it would be resonable to expect high perf when packets are in
play.

> Plus, imagine trying to support this in a hardware-offload XDP device.
>  They'd have to reimplement the entire frag cache, which is a much bigger
>  attack surface than just a frag validator, and they couldn't leverage
>  the battle-hardened kernel implementation.

Hmm, well this helper is restricted to TC progs for now. I don't quite
see a path to enabling for XDP as there would have to be at a minimum
quite a few allocations to handle frags. So not sure XDP is a factor at
the moment.

> 
> > And make it trivial for the next person that comes
> > along to do the right thing.
> 
> Fwiw the validator approach could *also* be a helper, it doesn't have to
>  be something the BPF developer writes for themselves.
> 
> But if after thinking about the possibility you still prefer your way, I
>  won't try to stop you — I just wanted to ensure it had been considered.

Thank you for the discussion. The thought had come to mind originally,
but I shied away after seeing some of the reassembly details. Would be
interested in hearing more from other folks.

Thanks,
Daniel