include/uapi/linux/openvswitch.h | 6 ++++++ net/openvswitch/actions.c | 6 ++++-- net/openvswitch/datapath.c | 10 +++++++++- net/openvswitch/datapath.h | 3 +++ net/openvswitch/vport.c | 1 + 5 files changed, 23 insertions(+), 3 deletions(-)
When a packet enters OVS datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall. With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet. This allows us to send
packets from the same traffic flow through the same handler.
The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again. And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.
However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread. So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.
The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port. In real setups the
number of recirculations can go up to 4 or 5, sometimes more.
There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet. So, there is a race window when the flow is already
installed and the second packet can match it and be forwarded to the
destination before the first packet is re-injected. But the fact that
packets are going through multiple upcalls handled by different
userspace threads makes the reordering noticeably more likely, because
we not only have a race between the kernel and a userspace handler
(which is hard to avoid), but also between multiple userspace handlers.
For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2. Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets. Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.
Let's allow userspace to specify on which Netlink PID the packets
should be upcalled while processing OVS_PACKET_CMD_EXECUTE.
This makes it possible to ensure that all the packets are processed
by the same handler thread in the userspace even with them being
upcalled multiple times in the process. Packets will remain in order
since they will be enqueued to the same socket and re-injected in the
same order. This doesn't eliminate re-ordering as stated above, since
we still have a race between kernel and the userspace thread, but it
allows to eliminate races between multiple userspace threads.
Userspace knows the PID of the socket on which the original upcall is
received, so there is no need to send it up from the kernel.
Solution requires storing the value somewhere for the duration of the
packet processing. There are two potential places for this: our skb
extension or the per-CPU storage. It's not clear which is better,
so just following currently used scheme of storing this kind of things
along the skb.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
include/uapi/linux/openvswitch.h | 6 ++++++
net/openvswitch/actions.c | 6 ++++--
net/openvswitch/datapath.c | 10 +++++++++-
net/openvswitch/datapath.h | 3 +++
net/openvswitch/vport.c | 1 +
5 files changed, 23 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a701bd1f31b..3092c2c6f1d2 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -186,6 +186,11 @@ enum ovs_packet_cmd {
* %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment
* size.
* @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and l4_hash in skb).
+ * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while
+ * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all other ways
+ * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID,
+ * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the
+ * %OVS_VPORT_ATTR_UPCALL_PID.
*
* These attributes follow the &struct ovs_header within the Generic Netlink
* payload for %OVS_PACKET_* commands.
@@ -205,6 +210,7 @@ enum ovs_packet_attr {
OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment size. */
OVS_PACKET_ATTR_LEN, /* Packet size before truncation. */
OVS_PACKET_ATTR_HASH, /* Packet hash. */
+ OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */
__OVS_PACKET_ATTR_MAX
};
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 3add108340bf..2832e0794197 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -941,8 +941,10 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb,
break;
case OVS_USERSPACE_ATTR_PID:
- if (dp->user_features &
- OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
+ if (OVS_CB(skb)->upcall_pid)
+ upcall.portid = OVS_CB(skb)->upcall_pid;
+ else if (dp->user_features &
+ OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
upcall.portid =
ovs_dp_get_upcall_portid(dp,
smp_processor_id());
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index b990dc83504f..ec08ce72f439 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -267,7 +267,9 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key)
memset(&upcall, 0, sizeof(upcall));
upcall.cmd = OVS_PACKET_CMD_MISS;
- if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
+ if (OVS_CB(skb)->upcall_pid)
+ upcall.portid = OVS_CB(skb)->upcall_pid;
+ else if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU)
upcall.portid =
ovs_dp_get_upcall_portid(dp, smp_processor_id());
else
@@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
struct sw_flow_actions *sf_acts;
struct datapath *dp;
struct vport *input_vport;
+ u32 upcall_pid = 0;
u16 mru = 0;
u64 hash;
int len;
@@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
!!(hash & OVS_PACKET_HASH_L4_BIT));
}
+ if (a[OVS_PACKET_ATTR_UPCALL_PID])
+ upcall_pid = nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]);
+ OVS_CB(packet)->upcall_pid = upcall_pid;
+
/* Build an sw_flow for sending this packet. */
flow = ovs_flow_alloc();
err = PTR_ERR(flow);
@@ -719,6 +726,7 @@ static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = {
[OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG },
[OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 },
[OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 },
+ [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 },
};
static const struct genl_small_ops dp_packet_genl_ops[] = {
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index cfeb817a1889..db0c3e69d66c 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -121,6 +121,8 @@ struct datapath {
* @cutlen: The number of bytes from the packet end to be removed.
* @probability: The sampling probability that was applied to this skb; 0 means
* no sampling has occurred; U32_MAX means 100% probability.
+ * @upcall_pid: Netlink socket PID to use for sending this packet to userspace;
+ * 0 means "not set" and default per-CPU or per-vport dispatch should be used.
*/
struct ovs_skb_cb {
struct vport *input_vport;
@@ -128,6 +130,7 @@ struct ovs_skb_cb {
u16 acts_origlen;
u32 cutlen;
u32 probability;
+ u32 upcall_pid;
};
#define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 8732f6e51ae5..6bbbc16ab778 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -501,6 +501,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
OVS_CB(skb)->mru = 0;
OVS_CB(skb)->cutlen = 0;
OVS_CB(skb)->probability = 0;
+ OVS_CB(skb)->upcall_pid = 0;
if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) {
u32 mark;
--
2.49.0
On Sat, 28 Jun 2025 00:01:33 +0200 Ilya Maximets <i.maximets@ovn.org> wrote: > When a packet enters OVS datapath and there is no flow to handle it, > packet goes to userspace through a MISS upcall. With per-CPU upcall > dispatch mechanism, we're using the current CPU id to select the > Netlink PID on which to send this packet. This allows us to send > packets from the same traffic flow through the same handler. > > The handler will process the packet, install required flow into the > kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. > > While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a > recirculation action that will pass the (likely modified) packet > through the flow lookup again. And if the flow is not found, the > packet will be sent to userspace again through another MISS upcall. > > However, the handler thread in userspace is likely running on a > different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled > in the syscall context of that thread. So, when the time comes to > send the packet through another upcall, the per-CPU dispatch will > choose a different Netlink PID, and this packet will end up processed > by a different handler thread on a different CPU. The per-CPU dispatch mode is supposed to rely on the CPU context, which according with what you said above, it is working okay on the first MISS. However, when we hit a recirculation action and there is another MISS, another thread from another CPU context is selected, why? Thanks, Flavio > > The process continues as long as there are new recirculations, each > time the packet goes to a different handler thread before it is sent > out of the OVS datapath to the destination port. In real setups the > number of recirculations can go up to 4 or 5, sometimes more. > > There is always a chance to re-order packets while processing upcalls, > because userspace will first install the flow and then re-inject the > original packet. So, there is a race window when the flow is already > installed and the second packet can match it and be forwarded to the > destination before the first packet is re-injected. But the fact that > packets are going through multiple upcalls handled by different > userspace threads makes the reordering noticeably more likely, because > we not only have a race between the kernel and a userspace handler > (which is hard to avoid), but also between multiple userspace > handlers. > > For example, let's assume that 10 packets got enqueued through a MISS > upcall for handler-1, it will start processing them, will install the > flow into the kernel and start re-injecting packets back, from where > they will go through another MISS to handler-2. Handler-2 will > install the flow into the kernel and start re-injecting the packets, > while handler-1 continues to re-inject the last of the 10 packets, > they will hit the flow installed by handler-2 and be forwarded > without going to the handler-2, while handler-2 still re-injects the > first of these 10 packets. Given multiple recirculations and misses, > these 10 packets may end up completely mixed up on the output from > the datapath. > > Let's allow userspace to specify on which Netlink PID the packets > should be upcalled while processing OVS_PACKET_CMD_EXECUTE. > This makes it possible to ensure that all the packets are processed > by the same handler thread in the userspace even with them being > upcalled multiple times in the process. Packets will remain in order > since they will be enqueued to the same socket and re-injected in the > same order. This doesn't eliminate re-ordering as stated above, since > we still have a race between kernel and the userspace thread, but it > allows to eliminate races between multiple userspace threads. > > Userspace knows the PID of the socket on which the original upcall is > received, so there is no need to send it up from the kernel. > > Solution requires storing the value somewhere for the duration of the > packet processing. There are two potential places for this: our skb > extension or the per-CPU storage. It's not clear which is better, > so just following currently used scheme of storing this kind of things > along the skb. > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > include/uapi/linux/openvswitch.h | 6 ++++++ > net/openvswitch/actions.c | 6 ++++-- > net/openvswitch/datapath.c | 10 +++++++++- > net/openvswitch/datapath.h | 3 +++ > net/openvswitch/vport.c | 1 + > 5 files changed, 23 insertions(+), 3 deletions(-) > > diff --git a/include/uapi/linux/openvswitch.h > b/include/uapi/linux/openvswitch.h index 3a701bd1f31b..3092c2c6f1d2 > 100644 --- a/include/uapi/linux/openvswitch.h > +++ b/include/uapi/linux/openvswitch.h > @@ -186,6 +186,11 @@ enum ovs_packet_cmd { > * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received > fragment > * size. > * @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and > l4_hash in skb). > + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while > + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all > other ways > + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, > + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the > + * %OVS_VPORT_ATTR_UPCALL_PID. > * > * These attributes follow the &struct ovs_header within the Generic > Netlink > * payload for %OVS_PACKET_* commands. > @@ -205,6 +210,7 @@ enum ovs_packet_attr { > OVS_PACKET_ATTR_MRU, /* Maximum received IP > fragment size. */ OVS_PACKET_ATTR_LEN, /* Packet size > before truncation. */ OVS_PACKET_ATTR_HASH, /* Packet > hash. */ > + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ > __OVS_PACKET_ATTR_MAX > }; > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 3add108340bf..2832e0794197 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -941,8 +941,10 @@ static int output_userspace(struct datapath *dp, > struct sk_buff *skb, break; > > case OVS_USERSPACE_ATTR_PID: > - if (dp->user_features & > - OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > + if (OVS_CB(skb)->upcall_pid) > + upcall.portid = > OVS_CB(skb)->upcall_pid; > + else if (dp->user_features & > + OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > upcall.portid = > ovs_dp_get_upcall_portid(dp, > smp_processor_id()); > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c > index b990dc83504f..ec08ce72f439 100644 > --- a/net/openvswitch/datapath.c > +++ b/net/openvswitch/datapath.c > @@ -267,7 +267,9 @@ void ovs_dp_process_packet(struct sk_buff *skb, > struct sw_flow_key *key) memset(&upcall, 0, sizeof(upcall)); > upcall.cmd = OVS_PACKET_CMD_MISS; > > - if (dp->user_features & > OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > + if (OVS_CB(skb)->upcall_pid) > + upcall.portid = OVS_CB(skb)->upcall_pid; > + else if (dp->user_features & > OVS_DP_F_DISPATCH_UPCALL_PER_CPU) upcall.portid = > ovs_dp_get_upcall_portid(dp, > smp_processor_id()); else > @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff > *skb, struct genl_info *info) struct sw_flow_actions *sf_acts; > struct datapath *dp; > struct vport *input_vport; > + u32 upcall_pid = 0; > u16 mru = 0; > u64 hash; > int len; > @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff > *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); > } > > + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > + upcall_pid = > nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > + OVS_CB(packet)->upcall_pid = upcall_pid; > + > /* Build an sw_flow for sending this packet. */ > flow = ovs_flow_alloc(); > err = PTR_ERR(flow); > @@ -719,6 +726,7 @@ static const struct nla_policy > packet_policy[OVS_PACKET_ATTR_MAX + 1] = { [OVS_PACKET_ATTR_PROBE] = > { .type = NLA_FLAG }, [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 }, > [OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 }, > + [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 }, > }; > > static const struct genl_small_ops dp_packet_genl_ops[] = { > diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h > index cfeb817a1889..db0c3e69d66c 100644 > --- a/net/openvswitch/datapath.h > +++ b/net/openvswitch/datapath.h > @@ -121,6 +121,8 @@ struct datapath { > * @cutlen: The number of bytes from the packet end to be removed. > * @probability: The sampling probability that was applied to this > skb; 0 means > * no sampling has occurred; U32_MAX means 100% probability. > + * @upcall_pid: Netlink socket PID to use for sending this packet to > userspace; > + * 0 means "not set" and default per-CPU or per-vport dispatch > should be used. */ > struct ovs_skb_cb { > struct vport *input_vport; > @@ -128,6 +130,7 @@ struct ovs_skb_cb { > u16 acts_origlen; > u32 cutlen; > u32 probability; > + u32 upcall_pid; > }; > #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) > > diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c > index 8732f6e51ae5..6bbbc16ab778 100644 > --- a/net/openvswitch/vport.c > +++ b/net/openvswitch/vport.c > @@ -501,6 +501,7 @@ int ovs_vport_receive(struct vport *vport, struct > sk_buff *skb, OVS_CB(skb)->mru = 0; > OVS_CB(skb)->cutlen = 0; > OVS_CB(skb)->probability = 0; > + OVS_CB(skb)->upcall_pid = 0; > if (unlikely(dev_net(skb->dev) != > ovs_dp_get_net(vport->dp))) { u32 mark; >
On 7/2/25 3:53 PM, Flavio Leitner wrote: > On Sat, 28 Jun 2025 00:01:33 +0200 > Ilya Maximets <i.maximets@ovn.org> wrote: > >> When a packet enters OVS datapath and there is no flow to handle it, >> packet goes to userspace through a MISS upcall. With per-CPU upcall >> dispatch mechanism, we're using the current CPU id to select the >> Netlink PID on which to send this packet. This allows us to send >> packets from the same traffic flow through the same handler. >> >> The handler will process the packet, install required flow into the >> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. >> >> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a >> recirculation action that will pass the (likely modified) packet >> through the flow lookup again. And if the flow is not found, the >> packet will be sent to userspace again through another MISS upcall. >> >> However, the handler thread in userspace is likely running on a >> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled >> in the syscall context of that thread. So, when the time comes to >> send the packet through another upcall, the per-CPU dispatch will >> choose a different Netlink PID, and this packet will end up processed >> by a different handler thread on a different CPU. > > > The per-CPU dispatch mode is supposed to rely on the CPU context, > which according with what you said above, it is working okay on > the first MISS. However, when we hit a recirculation action and > there is another MISS, another thread from another CPU context > is selected, why? Because the second miss is happening while processing OVS_PACKET_CMD_EXECUTE, which is happening in a syscall context of a userspace handler thread, which is running on a different CPU. > > Thanks, > Flavio > >> >> The process continues as long as there are new recirculations, each >> time the packet goes to a different handler thread before it is sent >> out of the OVS datapath to the destination port. In real setups the >> number of recirculations can go up to 4 or 5, sometimes more. >> >> There is always a chance to re-order packets while processing upcalls, >> because userspace will first install the flow and then re-inject the >> original packet. So, there is a race window when the flow is already >> installed and the second packet can match it and be forwarded to the >> destination before the first packet is re-injected. But the fact that >> packets are going through multiple upcalls handled by different >> userspace threads makes the reordering noticeably more likely, because >> we not only have a race between the kernel and a userspace handler >> (which is hard to avoid), but also between multiple userspace >> handlers. >> >> For example, let's assume that 10 packets got enqueued through a MISS >> upcall for handler-1, it will start processing them, will install the >> flow into the kernel and start re-injecting packets back, from where >> they will go through another MISS to handler-2. Handler-2 will >> install the flow into the kernel and start re-injecting the packets, >> while handler-1 continues to re-inject the last of the 10 packets, >> they will hit the flow installed by handler-2 and be forwarded >> without going to the handler-2, while handler-2 still re-injects the >> first of these 10 packets. Given multiple recirculations and misses, >> these 10 packets may end up completely mixed up on the output from >> the datapath. >> >> Let's allow userspace to specify on which Netlink PID the packets >> should be upcalled while processing OVS_PACKET_CMD_EXECUTE. >> This makes it possible to ensure that all the packets are processed >> by the same handler thread in the userspace even with them being >> upcalled multiple times in the process. Packets will remain in order >> since they will be enqueued to the same socket and re-injected in the >> same order. This doesn't eliminate re-ordering as stated above, since >> we still have a race between kernel and the userspace thread, but it >> allows to eliminate races between multiple userspace threads. >> >> Userspace knows the PID of the socket on which the original upcall is >> received, so there is no need to send it up from the kernel. >> >> Solution requires storing the value somewhere for the duration of the >> packet processing. There are two potential places for this: our skb >> extension or the per-CPU storage. It's not clear which is better, >> so just following currently used scheme of storing this kind of things >> along the skb. >> >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> >> --- >> include/uapi/linux/openvswitch.h | 6 ++++++ >> net/openvswitch/actions.c | 6 ++++-- >> net/openvswitch/datapath.c | 10 +++++++++- >> net/openvswitch/datapath.h | 3 +++ >> net/openvswitch/vport.c | 1 + >> 5 files changed, 23 insertions(+), 3 deletions(-) >> >> diff --git a/include/uapi/linux/openvswitch.h >> b/include/uapi/linux/openvswitch.h index 3a701bd1f31b..3092c2c6f1d2 >> 100644 --- a/include/uapi/linux/openvswitch.h >> +++ b/include/uapi/linux/openvswitch.h >> @@ -186,6 +186,11 @@ enum ovs_packet_cmd { >> * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received >> fragment >> * size. >> * @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and >> l4_hash in skb). >> + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while >> + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all >> other ways >> + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, >> + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the >> + * %OVS_VPORT_ATTR_UPCALL_PID. >> * >> * These attributes follow the &struct ovs_header within the Generic >> Netlink >> * payload for %OVS_PACKET_* commands. >> @@ -205,6 +210,7 @@ enum ovs_packet_attr { >> OVS_PACKET_ATTR_MRU, /* Maximum received IP >> fragment size. */ OVS_PACKET_ATTR_LEN, /* Packet size >> before truncation. */ OVS_PACKET_ATTR_HASH, /* Packet >> hash. */ >> + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ >> __OVS_PACKET_ATTR_MAX >> }; >> >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c >> index 3add108340bf..2832e0794197 100644 >> --- a/net/openvswitch/actions.c >> +++ b/net/openvswitch/actions.c >> @@ -941,8 +941,10 @@ static int output_userspace(struct datapath *dp, >> struct sk_buff *skb, break; >> >> case OVS_USERSPACE_ATTR_PID: >> - if (dp->user_features & >> - OVS_DP_F_DISPATCH_UPCALL_PER_CPU) >> + if (OVS_CB(skb)->upcall_pid) >> + upcall.portid = >> OVS_CB(skb)->upcall_pid; >> + else if (dp->user_features & >> + OVS_DP_F_DISPATCH_UPCALL_PER_CPU) >> upcall.portid = >> ovs_dp_get_upcall_portid(dp, >> smp_processor_id()); >> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c >> index b990dc83504f..ec08ce72f439 100644 >> --- a/net/openvswitch/datapath.c >> +++ b/net/openvswitch/datapath.c >> @@ -267,7 +267,9 @@ void ovs_dp_process_packet(struct sk_buff *skb, >> struct sw_flow_key *key) memset(&upcall, 0, sizeof(upcall)); >> upcall.cmd = OVS_PACKET_CMD_MISS; >> >> - if (dp->user_features & >> OVS_DP_F_DISPATCH_UPCALL_PER_CPU) >> + if (OVS_CB(skb)->upcall_pid) >> + upcall.portid = OVS_CB(skb)->upcall_pid; >> + else if (dp->user_features & >> OVS_DP_F_DISPATCH_UPCALL_PER_CPU) upcall.portid = >> ovs_dp_get_upcall_portid(dp, >> smp_processor_id()); else >> @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff >> *skb, struct genl_info *info) struct sw_flow_actions *sf_acts; >> struct datapath *dp; >> struct vport *input_vport; >> + u32 upcall_pid = 0; >> u16 mru = 0; >> u64 hash; >> int len; >> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff >> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); >> } >> >> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) >> + upcall_pid = >> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); >> + OVS_CB(packet)->upcall_pid = upcall_pid; >> + >> /* Build an sw_flow for sending this packet. */ >> flow = ovs_flow_alloc(); >> err = PTR_ERR(flow); >> @@ -719,6 +726,7 @@ static const struct nla_policy >> packet_policy[OVS_PACKET_ATTR_MAX + 1] = { [OVS_PACKET_ATTR_PROBE] = >> { .type = NLA_FLAG }, [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 }, >> [OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 }, >> + [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 }, >> }; >> >> static const struct genl_small_ops dp_packet_genl_ops[] = { >> diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h >> index cfeb817a1889..db0c3e69d66c 100644 >> --- a/net/openvswitch/datapath.h >> +++ b/net/openvswitch/datapath.h >> @@ -121,6 +121,8 @@ struct datapath { >> * @cutlen: The number of bytes from the packet end to be removed. >> * @probability: The sampling probability that was applied to this >> skb; 0 means >> * no sampling has occurred; U32_MAX means 100% probability. >> + * @upcall_pid: Netlink socket PID to use for sending this packet to >> userspace; >> + * 0 means "not set" and default per-CPU or per-vport dispatch >> should be used. */ >> struct ovs_skb_cb { >> struct vport *input_vport; >> @@ -128,6 +130,7 @@ struct ovs_skb_cb { >> u16 acts_origlen; >> u32 cutlen; >> u32 probability; >> + u32 upcall_pid; >> }; >> #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) >> >> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c >> index 8732f6e51ae5..6bbbc16ab778 100644 >> --- a/net/openvswitch/vport.c >> +++ b/net/openvswitch/vport.c >> @@ -501,6 +501,7 @@ int ovs_vport_receive(struct vport *vport, struct >> sk_buff *skb, OVS_CB(skb)->mru = 0; >> OVS_CB(skb)->cutlen = 0; >> OVS_CB(skb)->probability = 0; >> + OVS_CB(skb)->upcall_pid = 0; >> if (unlikely(dev_net(skb->dev) != >> ovs_dp_get_net(vport->dp))) { u32 mark; >> >
On Wed, 2 Jul 2025 16:41:19 +0200 Ilya Maximets <i.maximets@ovn.org> wrote: > On 7/2/25 3:53 PM, Flavio Leitner wrote: > > On Sat, 28 Jun 2025 00:01:33 +0200 > > Ilya Maximets <i.maximets@ovn.org> wrote: > > > >> When a packet enters OVS datapath and there is no flow to handle it, > >> packet goes to userspace through a MISS upcall. With per-CPU upcall > >> dispatch mechanism, we're using the current CPU id to select the > >> Netlink PID on which to send this packet. This allows us to send > >> packets from the same traffic flow through the same handler. > >> > >> The handler will process the packet, install required flow into the > >> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. > >> > >> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a > >> recirculation action that will pass the (likely modified) packet > >> through the flow lookup again. And if the flow is not found, the > >> packet will be sent to userspace again through another MISS upcall. > >> > >> However, the handler thread in userspace is likely running on a > >> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled > >> in the syscall context of that thread. So, when the time comes to > >> send the packet through another upcall, the per-CPU dispatch will > >> choose a different Netlink PID, and this packet will end up processed > >> by a different handler thread on a different CPU. > > > > > > The per-CPU dispatch mode is supposed to rely on the CPU context, > > which according with what you said above, it is working okay on > > the first MISS. However, when we hit a recirculation action and > > there is another MISS, another thread from another CPU context > > is selected, why? > > Because the second miss is happening while processing > OVS_PACKET_CMD_EXECUTE, which is happening in a syscall context of a > userspace handler thread, which is running on a different CPU. Got it, thanks, see below. > > > >> > >> The process continues as long as there are new recirculations, each > >> time the packet goes to a different handler thread before it is sent > >> out of the OVS datapath to the destination port. In real setups the > >> number of recirculations can go up to 4 or 5, sometimes more. > >> > >> There is always a chance to re-order packets while processing upcalls, > >> because userspace will first install the flow and then re-inject the > >> original packet. So, there is a race window when the flow is already > >> installed and the second packet can match it and be forwarded to the > >> destination before the first packet is re-injected. But the fact that > >> packets are going through multiple upcalls handled by different > >> userspace threads makes the reordering noticeably more likely, because > >> we not only have a race between the kernel and a userspace handler > >> (which is hard to avoid), but also between multiple userspace > >> handlers. > >> > >> For example, let's assume that 10 packets got enqueued through a MISS > >> upcall for handler-1, it will start processing them, will install the > >> flow into the kernel and start re-injecting packets back, from where > >> they will go through another MISS to handler-2. Handler-2 will > >> install the flow into the kernel and start re-injecting the packets, > >> while handler-1 continues to re-inject the last of the 10 packets, > >> they will hit the flow installed by handler-2 and be forwarded > >> without going to the handler-2, while handler-2 still re-injects the > >> first of these 10 packets. Given multiple recirculations and misses, > >> these 10 packets may end up completely mixed up on the output from > >> the datapath. > >> > >> Let's allow userspace to specify on which Netlink PID the packets > >> should be upcalled while processing OVS_PACKET_CMD_EXECUTE. > >> This makes it possible to ensure that all the packets are processed > >> by the same handler thread in the userspace even with them being > >> upcalled multiple times in the process. Packets will remain in order > >> since they will be enqueued to the same socket and re-injected in the > >> same order. This doesn't eliminate re-ordering as stated above, since > >> we still have a race between kernel and the userspace thread, but it > >> allows to eliminate races between multiple userspace threads. > >> > >> Userspace knows the PID of the socket on which the original upcall is > >> received, so there is no need to send it up from the kernel. > >> > >> Solution requires storing the value somewhere for the duration of the > >> packet processing. There are two potential places for this: our skb > >> extension or the per-CPU storage. It's not clear which is better, > >> so just following currently used scheme of storing this kind of things > >> along the skb. > >> > >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > >> --- > >> include/uapi/linux/openvswitch.h | 6 ++++++ > >> net/openvswitch/actions.c | 6 ++++-- > >> net/openvswitch/datapath.c | 10 +++++++++- > >> net/openvswitch/datapath.h | 3 +++ > >> net/openvswitch/vport.c | 1 + > >> 5 files changed, 23 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/uapi/linux/openvswitch.h > >> b/include/uapi/linux/openvswitch.h index 3a701bd1f31b..3092c2c6f1d2 > >> 100644 --- a/include/uapi/linux/openvswitch.h > >> +++ b/include/uapi/linux/openvswitch.h > >> @@ -186,6 +186,11 @@ enum ovs_packet_cmd { > >> * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received > >> fragment > >> * size. > >> * @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and > >> l4_hash in skb). > >> + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while > >> + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all > >> other ways > >> + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, > >> + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the > >> + * %OVS_VPORT_ATTR_UPCALL_PID. > >> * > >> * These attributes follow the &struct ovs_header within the Generic > >> Netlink > >> * payload for %OVS_PACKET_* commands. > >> @@ -205,6 +210,7 @@ enum ovs_packet_attr { > >> OVS_PACKET_ATTR_MRU, /* Maximum received IP > >> fragment size. */ OVS_PACKET_ATTR_LEN, /* Packet size > >> before truncation. */ OVS_PACKET_ATTR_HASH, /* Packet > >> hash. */ > >> + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ > >> __OVS_PACKET_ATTR_MAX > >> }; > >> > >> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > >> index 3add108340bf..2832e0794197 100644 > >> --- a/net/openvswitch/actions.c > >> +++ b/net/openvswitch/actions.c > >> @@ -941,8 +941,10 @@ static int output_userspace(struct datapath *dp, > >> struct sk_buff *skb, break; > >> > >> case OVS_USERSPACE_ATTR_PID: > >> - if (dp->user_features & > >> - OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > >> + if (OVS_CB(skb)->upcall_pid) > >> + upcall.portid = > >> OVS_CB(skb)->upcall_pid; > >> + else if (dp->user_features & > >> + OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > >> upcall.portid = > >> ovs_dp_get_upcall_portid(dp, > >> smp_processor_id()); > >> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c > >> index b990dc83504f..ec08ce72f439 100644 > >> --- a/net/openvswitch/datapath.c > >> +++ b/net/openvswitch/datapath.c > >> @@ -267,7 +267,9 @@ void ovs_dp_process_packet(struct sk_buff *skb, > >> struct sw_flow_key *key) memset(&upcall, 0, sizeof(upcall)); > >> upcall.cmd = OVS_PACKET_CMD_MISS; > >> > >> - if (dp->user_features & > >> OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > >> + if (OVS_CB(skb)->upcall_pid) > >> + upcall.portid = OVS_CB(skb)->upcall_pid; > >> + else if (dp->user_features & > >> OVS_DP_F_DISPATCH_UPCALL_PER_CPU) upcall.portid = > >> ovs_dp_get_upcall_portid(dp, > >> smp_processor_id()); else > >> @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff > >> *skb, struct genl_info *info) struct sw_flow_actions *sf_acts; > >> struct datapath *dp; > >> struct vport *input_vport; > >> + u32 upcall_pid = 0; > >> u16 mru = 0; > >> u64 hash; > >> int len; > >> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff > >> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); > >> } > >> > >> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > >> + upcall_pid = > >> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > >> + OVS_CB(packet)->upcall_pid = upcall_pid; Since this is coming from userspace, does it make sense to check if the upcall_pid is one of the pids in the dp->upcall_portids array? Otherwise the approach and patch looks good to me. Thanks! > >> + > >> /* Build an sw_flow for sending this packet. */ > >> flow = ovs_flow_alloc(); > >> err = PTR_ERR(flow); > >> @@ -719,6 +726,7 @@ static const struct nla_policy > >> packet_policy[OVS_PACKET_ATTR_MAX + 1] = { [OVS_PACKET_ATTR_PROBE] = > >> { .type = NLA_FLAG }, [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 }, > >> [OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 }, > >> + [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 }, > >> }; > >> > >> static const struct genl_small_ops dp_packet_genl_ops[] = { > >> diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h > >> index cfeb817a1889..db0c3e69d66c 100644 > >> --- a/net/openvswitch/datapath.h > >> +++ b/net/openvswitch/datapath.h > >> @@ -121,6 +121,8 @@ struct datapath { > >> * @cutlen: The number of bytes from the packet end to be removed. > >> * @probability: The sampling probability that was applied to this > >> skb; 0 means > >> * no sampling has occurred; U32_MAX means 100% probability. > >> + * @upcall_pid: Netlink socket PID to use for sending this packet to > >> userspace; > >> + * 0 means "not set" and default per-CPU or per-vport dispatch > >> should be used. */ > >> struct ovs_skb_cb { > >> struct vport *input_vport; > >> @@ -128,6 +130,7 @@ struct ovs_skb_cb { > >> u16 acts_origlen; > >> u32 cutlen; > >> u32 probability; > >> + u32 upcall_pid; > >> }; > >> #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) > >> > >> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c > >> index 8732f6e51ae5..6bbbc16ab778 100644 > >> --- a/net/openvswitch/vport.c > >> +++ b/net/openvswitch/vport.c > >> @@ -501,6 +501,7 @@ int ovs_vport_receive(struct vport *vport, struct > >> sk_buff *skb, OVS_CB(skb)->mru = 0; > >> OVS_CB(skb)->cutlen = 0; > >> OVS_CB(skb)->probability = 0; > >> + OVS_CB(skb)->upcall_pid = 0; > >> if (unlikely(dev_net(skb->dev) != > >> ovs_dp_get_net(vport->dp))) { u32 mark; > >> > > > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
On 7/3/25 1:08 AM, Flavio Leitner wrote: >>>> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff >>>> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); >>>> } >>>> >>>> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) >>>> + upcall_pid = >>>> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); >>>> + OVS_CB(packet)->upcall_pid = upcall_pid; > > Since this is coming from userspace, does it make sense to check if the > upcall_pid is one of the pids in the dp->upcall_portids array? Not really. IMO, this would be an unnecessary artificial restriction. We're not concerned about security here since OVS_PACKET_CMD_EXECUTE requires the same privileges as the OVS_DP_CMD_NEW or the OVS_DP_CMD_SET. Best regards, Ilya Maximets.
On Thu, 3 Jul 2025 10:38:49 +0200 Ilya Maximets <i.maximets@ovn.org> wrote: > On 7/3/25 1:08 AM, Flavio Leitner wrote: > >>>> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff > >>>> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); > >>>> } > >>>> > >>>> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > >>>> + upcall_pid = > >>>> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > >>>> + OVS_CB(packet)->upcall_pid = upcall_pid; > > > > Since this is coming from userspace, does it make sense to check if the > > upcall_pid is one of the pids in the dp->upcall_portids array? > > Not really. IMO, this would be an unnecessary artificial restriction. > We're not concerned about security here since OVS_PACKET_CMD_EXECUTE > requires the same privileges as the OVS_DP_CMD_NEW or the > OVS_DP_CMD_SET. What if the userspace is buggy or compromised? It seems netlink API will return -ECONNREFUSED and the upcall is dropped. Therefore, we would be okay either way, correct?
On 7/3/25 1:04 PM, Flavio Leitner wrote: > On Thu, 3 Jul 2025 10:38:49 +0200 > Ilya Maximets <i.maximets@ovn.org> wrote: > >> On 7/3/25 1:08 AM, Flavio Leitner wrote: >>>>>> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff >>>>>> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); >>>>>> } >>>>>> >>>>>> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) >>>>>> + upcall_pid = >>>>>> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); >>>>>> + OVS_CB(packet)->upcall_pid = upcall_pid; >>> >>> Since this is coming from userspace, does it make sense to check if the >>> upcall_pid is one of the pids in the dp->upcall_portids array? >> >> Not really. IMO, this would be an unnecessary artificial restriction. >> We're not concerned about security here since OVS_PACKET_CMD_EXECUTE >> requires the same privileges as the OVS_DP_CMD_NEW or the >> OVS_DP_CMD_SET. > > What if the userspace is buggy or compromised? > It seems netlink API will return -ECONNREFUSED and the upcall is dropped. > Therefore, we would be okay either way, correct? If the userspace is compromised, it can overwrite the upcall_portids and do many other things, since the userspace application here has a CAP_NET_ADMIN. And if it's buggy, then the packet will be just dropped on validation or on the upcall, there isn't much difference. It's a responsibility of the userspace application to make sure these sockets exist before passing PIDs into the kernel. From the kernel's perspective dropping the upcall is completely fine. So, yes, we should be OK. Best regards, Ilya Maximets.
On Thu, 3 Jul 2025 13:15:17 +0200 Ilya Maximets <i.maximets@ovn.org> wrote: > On 7/3/25 1:04 PM, Flavio Leitner wrote: > > On Thu, 3 Jul 2025 10:38:49 +0200 > > Ilya Maximets <i.maximets@ovn.org> wrote: > > > >> On 7/3/25 1:08 AM, Flavio Leitner wrote: > >>>>>> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff > >>>>>> *skb, struct genl_info *info) !!(hash & OVS_PACKET_HASH_L4_BIT)); > >>>>>> } > >>>>>> > >>>>>> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > >>>>>> + upcall_pid = > >>>>>> nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > >>>>>> + OVS_CB(packet)->upcall_pid = upcall_pid; > >>> > >>> Since this is coming from userspace, does it make sense to check if the > >>> upcall_pid is one of the pids in the dp->upcall_portids array? > >> > >> Not really. IMO, this would be an unnecessary artificial restriction. > >> We're not concerned about security here since OVS_PACKET_CMD_EXECUTE > >> requires the same privileges as the OVS_DP_CMD_NEW or the > >> OVS_DP_CMD_SET. > > > > What if the userspace is buggy or compromised? > > It seems netlink API will return -ECONNREFUSED and the upcall is dropped. > > Therefore, we would be okay either way, correct? > > If the userspace is compromised, it can overwrite the upcall_portids > and do many other things, since the userspace application here has a > CAP_NET_ADMIN. And if it's buggy, then the packet will be just dropped > on validation or on the upcall, there isn't much difference. > > It's a responsibility of the userspace application to make sure these > sockets exist before passing PIDs into the kernel. From the kernel's > perspective dropping the upcall is completely fine. So, yes, we should > be OK. ack, thanks! -- Flavio Leitner
Ilya Maximets <i.maximets@ovn.org> writes: > When a packet enters OVS datapath and there is no flow to handle it, > packet goes to userspace through a MISS upcall. With per-CPU upcall > dispatch mechanism, we're using the current CPU id to select the > Netlink PID on which to send this packet. This allows us to send > packets from the same traffic flow through the same handler. > > The handler will process the packet, install required flow into the > kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. > > While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a > recirculation action that will pass the (likely modified) packet > through the flow lookup again. And if the flow is not found, the > packet will be sent to userspace again through another MISS upcall. > > However, the handler thread in userspace is likely running on a > different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled > in the syscall context of that thread. So, when the time comes to > send the packet through another upcall, the per-CPU dispatch will > choose a different Netlink PID, and this packet will end up processed > by a different handler thread on a different CPU. Just wondering but why can't we choose the existing core handler when running the packet_cmd_execute? For example, when looking into the per-cpu table we know what the current core is, can we just queue to that one? I actually thought that's what the PER_CPU dispatch mode was supposed to do. Or is it that we want to make sure we keep the association between the skbuff for re-injection always? > The process continues as long as there are new recirculations, each > time the packet goes to a different handler thread before it is sent > out of the OVS datapath to the destination port. In real setups the > number of recirculations can go up to 4 or 5, sometimes more. Is it because the userspace handler threads are being rescheduled across CPUs? Do we still see this behavior if we pinned each handler thread to a specific CPU rather than letting the scheduler make the decision? > There is always a chance to re-order packets while processing upcalls, > because userspace will first install the flow and then re-inject the > original packet. So, there is a race window when the flow is already > installed and the second packet can match it and be forwarded to the > destination before the first packet is re-injected. But the fact that > packets are going through multiple upcalls handled by different > userspace threads makes the reordering noticeably more likely, because > we not only have a race between the kernel and a userspace handler > (which is hard to avoid), but also between multiple userspace handlers. > > For example, let's assume that 10 packets got enqueued through a MISS > upcall for handler-1, it will start processing them, will install the > flow into the kernel and start re-injecting packets back, from where > they will go through another MISS to handler-2. Handler-2 will install > the flow into the kernel and start re-injecting the packets, while > handler-1 continues to re-inject the last of the 10 packets, they will > hit the flow installed by handler-2 and be forwarded without going to > the handler-2, while handler-2 still re-injects the first of these 10 > packets. Given multiple recirculations and misses, these 10 packets > may end up completely mixed up on the output from the datapath. > > Let's allow userspace to specify on which Netlink PID the packets > should be upcalled while processing OVS_PACKET_CMD_EXECUTE. > This makes it possible to ensure that all the packets are processed > by the same handler thread in the userspace even with them being > upcalled multiple times in the process. Packets will remain in order > since they will be enqueued to the same socket and re-injected in the > same order. This doesn't eliminate re-ordering as stated above, since > we still have a race between kernel and the userspace thread, but it > allows to eliminate races between multiple userspace threads. > > Userspace knows the PID of the socket on which the original upcall is > received, so there is no need to send it up from the kernel. > > Solution requires storing the value somewhere for the duration of the > packet processing. There are two potential places for this: our skb > extension or the per-CPU storage. It's not clear which is better, > so just following currently used scheme of storing this kind of things > along the skb. With this change we're almost full on the OVS sk_buff control block. Might be good to mention it in the commit message if you're respinning. > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > include/uapi/linux/openvswitch.h | 6 ++++++ > net/openvswitch/actions.c | 6 ++++-- > net/openvswitch/datapath.c | 10 +++++++++- > net/openvswitch/datapath.h | 3 +++ > net/openvswitch/vport.c | 1 + > 5 files changed, 23 insertions(+), 3 deletions(-) > > diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h > index 3a701bd1f31b..3092c2c6f1d2 100644 > --- a/include/uapi/linux/openvswitch.h > +++ b/include/uapi/linux/openvswitch.h > @@ -186,6 +186,11 @@ enum ovs_packet_cmd { > * %OVS_PACKET_ATTR_USERSPACE action specify the Maximum received fragment > * size. > * @OVS_PACKET_ATTR_HASH: Packet hash info (e.g. hash, sw_hash and l4_hash in skb). > + * @OVS_PACKET_ATTR_UPCALL_PID: Netlink PID to use for upcalls while > + * processing %OVS_PACKET_CMD_EXECUTE. Takes precedence over all other ways > + * to determine the Netlink PID including %OVS_USERSPACE_ATTR_PID, > + * %OVS_DP_ATTR_UPCALL_PID, %OVS_DP_ATTR_PER_CPU_PIDS and the > + * %OVS_VPORT_ATTR_UPCALL_PID. > * > * These attributes follow the &struct ovs_header within the Generic Netlink > * payload for %OVS_PACKET_* commands. > @@ -205,6 +210,7 @@ enum ovs_packet_attr { > OVS_PACKET_ATTR_MRU, /* Maximum received IP fragment size. */ > OVS_PACKET_ATTR_LEN, /* Packet size before truncation. */ > OVS_PACKET_ATTR_HASH, /* Packet hash. */ > + OVS_PACKET_ATTR_UPCALL_PID, /* u32 Netlink PID. */ > __OVS_PACKET_ATTR_MAX > }; > > diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c > index 3add108340bf..2832e0794197 100644 > --- a/net/openvswitch/actions.c > +++ b/net/openvswitch/actions.c > @@ -941,8 +941,10 @@ static int output_userspace(struct datapath *dp, struct sk_buff *skb, > break; > > case OVS_USERSPACE_ATTR_PID: > - if (dp->user_features & > - OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > + if (OVS_CB(skb)->upcall_pid) > + upcall.portid = OVS_CB(skb)->upcall_pid; > + else if (dp->user_features & > + OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > upcall.portid = > ovs_dp_get_upcall_portid(dp, > smp_processor_id()); > diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c > index b990dc83504f..ec08ce72f439 100644 > --- a/net/openvswitch/datapath.c > +++ b/net/openvswitch/datapath.c > @@ -267,7 +267,9 @@ void ovs_dp_process_packet(struct sk_buff *skb, struct sw_flow_key *key) > memset(&upcall, 0, sizeof(upcall)); > upcall.cmd = OVS_PACKET_CMD_MISS; > > - if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > + if (OVS_CB(skb)->upcall_pid) > + upcall.portid = OVS_CB(skb)->upcall_pid; > + else if (dp->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU) > upcall.portid = > ovs_dp_get_upcall_portid(dp, smp_processor_id()); > else > @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) > struct sw_flow_actions *sf_acts; > struct datapath *dp; > struct vport *input_vport; > + u32 upcall_pid = 0; > u16 mru = 0; > u64 hash; > int len; > @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) > !!(hash & OVS_PACKET_HASH_L4_BIT)); > } > > + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > + upcall_pid = nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > + OVS_CB(packet)->upcall_pid = upcall_pid; > + > /* Build an sw_flow for sending this packet. */ > flow = ovs_flow_alloc(); > err = PTR_ERR(flow); > @@ -719,6 +726,7 @@ static const struct nla_policy packet_policy[OVS_PACKET_ATTR_MAX + 1] = { > [OVS_PACKET_ATTR_PROBE] = { .type = NLA_FLAG }, > [OVS_PACKET_ATTR_MRU] = { .type = NLA_U16 }, > [OVS_PACKET_ATTR_HASH] = { .type = NLA_U64 }, > + [OVS_PACKET_ATTR_UPCALL_PID] = { .type = NLA_U32 }, > }; > > static const struct genl_small_ops dp_packet_genl_ops[] = { > diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h > index cfeb817a1889..db0c3e69d66c 100644 > --- a/net/openvswitch/datapath.h > +++ b/net/openvswitch/datapath.h > @@ -121,6 +121,8 @@ struct datapath { > * @cutlen: The number of bytes from the packet end to be removed. > * @probability: The sampling probability that was applied to this skb; 0 means > * no sampling has occurred; U32_MAX means 100% probability. > + * @upcall_pid: Netlink socket PID to use for sending this packet to userspace; > + * 0 means "not set" and default per-CPU or per-vport dispatch should be used. > */ > struct ovs_skb_cb { > struct vport *input_vport; > @@ -128,6 +130,7 @@ struct ovs_skb_cb { > u16 acts_origlen; > u32 cutlen; > u32 probability; > + u32 upcall_pid; > }; > #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb) > > diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c > index 8732f6e51ae5..6bbbc16ab778 100644 > --- a/net/openvswitch/vport.c > +++ b/net/openvswitch/vport.c > @@ -501,6 +501,7 @@ int ovs_vport_receive(struct vport *vport, struct sk_buff *skb, > OVS_CB(skb)->mru = 0; > OVS_CB(skb)->cutlen = 0; > OVS_CB(skb)->probability = 0; > + OVS_CB(skb)->upcall_pid = 0; > if (unlikely(dev_net(skb->dev) != ovs_dp_get_net(vport->dp))) { > u32 mark;
On 7/2/25 2:14 PM, Aaron Conole wrote: > Ilya Maximets <i.maximets@ovn.org> writes: > >> When a packet enters OVS datapath and there is no flow to handle it, >> packet goes to userspace through a MISS upcall. With per-CPU upcall >> dispatch mechanism, we're using the current CPU id to select the >> Netlink PID on which to send this packet. This allows us to send >> packets from the same traffic flow through the same handler. >> >> The handler will process the packet, install required flow into the >> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. >> >> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a >> recirculation action that will pass the (likely modified) packet >> through the flow lookup again. And if the flow is not found, the >> packet will be sent to userspace again through another MISS upcall. >> >> However, the handler thread in userspace is likely running on a >> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled >> in the syscall context of that thread. So, when the time comes to >> send the packet through another upcall, the per-CPU dispatch will >> choose a different Netlink PID, and this packet will end up processed >> by a different handler thread on a different CPU. > > Just wondering but why can't we choose the existing core handler when > running the packet_cmd_execute? For example, when looking into the > per-cpu table we know what the current core is, can we just queue to > that one? I actually thought that's what the PER_CPU dispatch mode was > supposed to do. This is exactly how it works today and it is the problem, because our userspace handler is running on a different CPU and so the 'current CPU' during the packet_cmd_execute is different from the one where kernel was processing the original upcall. > Or is it that we want to make sure we keep the > association between the skbuff for re-injection always? We want the same packet to be enqueued to the same upcall socket after each recirculation, so it gets handled by the same userspace thread. > >> The process continues as long as there are new recirculations, each >> time the packet goes to a different handler thread before it is sent >> out of the OVS datapath to the destination port. In real setups the >> number of recirculations can go up to 4 or 5, sometimes more. > > Is it because the userspace handler threads are being rescheduled across > CPUs? Yes. Userspace handlers are not pinned to a specific core in most cases, so they will be running on different CPUs and will float around. > Do we still see this behavior if we pinned each handler thread to > a specific CPU rather than letting the scheduler make the decision? If you pin each userspace thread to a core that is specified in the PCPU_UPCALL_PIDS for the socket that it is listening on, then the problem will go away, as the packet_cmd_execute syscall will be executed on the same core where the kernel received the original packet. However, that's not possible in many cases (reserved CPUs, different CPU affinity for IRQs and userspace applications, etc.) and not desired as may impact performance of the system, because the kernel and userspace will compete for the same core. > >> There is always a chance to re-order packets while processing upcalls, >> because userspace will first install the flow and then re-inject the >> original packet. So, there is a race window when the flow is already >> installed and the second packet can match it and be forwarded to the >> destination before the first packet is re-injected. But the fact that >> packets are going through multiple upcalls handled by different >> userspace threads makes the reordering noticeably more likely, because >> we not only have a race between the kernel and a userspace handler >> (which is hard to avoid), but also between multiple userspace handlers. >> >> For example, let's assume that 10 packets got enqueued through a MISS >> upcall for handler-1, it will start processing them, will install the >> flow into the kernel and start re-injecting packets back, from where >> they will go through another MISS to handler-2. Handler-2 will install >> the flow into the kernel and start re-injecting the packets, while >> handler-1 continues to re-inject the last of the 10 packets, they will >> hit the flow installed by handler-2 and be forwarded without going to >> the handler-2, while handler-2 still re-injects the first of these 10 >> packets. Given multiple recirculations and misses, these 10 packets >> may end up completely mixed up on the output from the datapath. >> >> Let's allow userspace to specify on which Netlink PID the packets >> should be upcalled while processing OVS_PACKET_CMD_EXECUTE. >> This makes it possible to ensure that all the packets are processed >> by the same handler thread in the userspace even with them being >> upcalled multiple times in the process. Packets will remain in order >> since they will be enqueued to the same socket and re-injected in the >> same order. This doesn't eliminate re-ordering as stated above, since >> we still have a race between kernel and the userspace thread, but it >> allows to eliminate races between multiple userspace threads. >> >> Userspace knows the PID of the socket on which the original upcall is >> received, so there is no need to send it up from the kernel. >> >> Solution requires storing the value somewhere for the duration of the >> packet processing. There are two potential places for this: our skb >> extension or the per-CPU storage. It's not clear which is better, >> so just following currently used scheme of storing this kind of things >> along the skb. > > With this change we're almost full on the OVS sk_buff control block. > Might be good to mention it in the commit message if you're respinning. Are we full? The skb->cb size is 48 bytes and we're only using 24 with this change, unless I'm missing something. Best regards, Ilya Maximets.
Ilya Maximets <i.maximets@ovn.org> writes: > On 7/2/25 2:14 PM, Aaron Conole wrote: >> Ilya Maximets <i.maximets@ovn.org> writes: >> >>> When a packet enters OVS datapath and there is no flow to handle it, >>> packet goes to userspace through a MISS upcall. With per-CPU upcall >>> dispatch mechanism, we're using the current CPU id to select the >>> Netlink PID on which to send this packet. This allows us to send >>> packets from the same traffic flow through the same handler. >>> >>> The handler will process the packet, install required flow into the >>> kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. >>> >>> While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a >>> recirculation action that will pass the (likely modified) packet >>> through the flow lookup again. And if the flow is not found, the >>> packet will be sent to userspace again through another MISS upcall. >>> >>> However, the handler thread in userspace is likely running on a >>> different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled >>> in the syscall context of that thread. So, when the time comes to >>> send the packet through another upcall, the per-CPU dispatch will >>> choose a different Netlink PID, and this packet will end up processed >>> by a different handler thread on a different CPU. >> >> Just wondering but why can't we choose the existing core handler when >> running the packet_cmd_execute? For example, when looking into the >> per-cpu table we know what the current core is, can we just queue to >> that one? I actually thought that's what the PER_CPU dispatch mode was >> supposed to do. > > This is exactly how it works today and it is the problem, because our > userspace handler is running on a different CPU and so the 'current CPU' > during the packet_cmd_execute is different from the one where kernel > was processing the original upcall. > >> Or is it that we want to make sure we keep the >> association between the skbuff for re-injection always? > > We want the same packet to be enqueued to the same upcall socket after > each recirculation, so it gets handled by the same userspace thread. > >> >>> The process continues as long as there are new recirculations, each >>> time the packet goes to a different handler thread before it is sent >>> out of the OVS datapath to the destination port. In real setups the >>> number of recirculations can go up to 4 or 5, sometimes more. >> >> Is it because the userspace handler threads are being rescheduled across >> CPUs? > > Yes. Userspace handlers are not pinned to a specific core in most cases, > so they will be running on different CPUs and will float around. > >> Do we still see this behavior if we pinned each handler thread to >> a specific CPU rather than letting the scheduler make the decision? > > If you pin each userspace thread to a core that is specified in the > PCPU_UPCALL_PIDS for the socket that it is listening on, then the > problem will go away, as the packet_cmd_execute syscall will be executed > on the same core where the kernel received the original packet. > However, that's not possible in many cases (reserved CPUs, different > CPU affinity for IRQs and userspace applications, etc.) and not desired > as may impact performance of the system, because the kernel and userspace > will compete for the same core. Just wanted to make sure I was understanding the problem. Okay, this makes sense. >> >>> There is always a chance to re-order packets while processing upcalls, >>> because userspace will first install the flow and then re-inject the >>> original packet. So, there is a race window when the flow is already >>> installed and the second packet can match it and be forwarded to the >>> destination before the first packet is re-injected. But the fact that >>> packets are going through multiple upcalls handled by different >>> userspace threads makes the reordering noticeably more likely, because >>> we not only have a race between the kernel and a userspace handler >>> (which is hard to avoid), but also between multiple userspace handlers. >>> >>> For example, let's assume that 10 packets got enqueued through a MISS >>> upcall for handler-1, it will start processing them, will install the >>> flow into the kernel and start re-injecting packets back, from where >>> they will go through another MISS to handler-2. Handler-2 will install >>> the flow into the kernel and start re-injecting the packets, while >>> handler-1 continues to re-inject the last of the 10 packets, they will >>> hit the flow installed by handler-2 and be forwarded without going to >>> the handler-2, while handler-2 still re-injects the first of these 10 >>> packets. Given multiple recirculations and misses, these 10 packets >>> may end up completely mixed up on the output from the datapath. >>> >>> Let's allow userspace to specify on which Netlink PID the packets >>> should be upcalled while processing OVS_PACKET_CMD_EXECUTE. >>> This makes it possible to ensure that all the packets are processed >>> by the same handler thread in the userspace even with them being >>> upcalled multiple times in the process. Packets will remain in order >>> since they will be enqueued to the same socket and re-injected in the >>> same order. This doesn't eliminate re-ordering as stated above, since >>> we still have a race between kernel and the userspace thread, but it >>> allows to eliminate races between multiple userspace threads. >>> >>> Userspace knows the PID of the socket on which the original upcall is >>> received, so there is no need to send it up from the kernel. >>> >>> Solution requires storing the value somewhere for the duration of the >>> packet processing. There are two potential places for this: our skb >>> extension or the per-CPU storage. It's not clear which is better, >>> so just following currently used scheme of storing this kind of things >>> along the skb. >> >> With this change we're almost full on the OVS sk_buff control block. >> Might be good to mention it in the commit message if you're respinning. > > Are we full? The skb->cb size is 48 bytes and we're only using 24 > with this change, unless I'm missing something. Hrrm... I guess I miscounted. Yes I agree, with this change we're at 24 bytes in-use on 64-bit platform. Okay no worries. > Best regards, Ilya Maximets.
On Sat, 28 Jun 2025 00:01:33 +0200 Ilya Maximets wrote: > @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) > struct sw_flow_actions *sf_acts; > struct datapath *dp; > struct vport *input_vport; > + u32 upcall_pid = 0; > u16 mru = 0; > u64 hash; > int len; > @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) > !!(hash & OVS_PACKET_HASH_L4_BIT)); > } > > + if (a[OVS_PACKET_ATTR_UPCALL_PID]) > + upcall_pid = nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); > + OVS_CB(packet)->upcall_pid = upcall_pid; sorry for a late nit: OVS_CB(packet)->upcall_pid = nla_get_u32_default(a[OVS_PACKET_ATTR_UPCALL_PID], 0); ?
On 7/2/25 4:21 AM, Jakub Kicinski wrote: > On Sat, 28 Jun 2025 00:01:33 +0200 Ilya Maximets wrote: >> @@ -616,6 +618,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) >> struct sw_flow_actions *sf_acts; >> struct datapath *dp; >> struct vport *input_vport; >> + u32 upcall_pid = 0; >> u16 mru = 0; >> u64 hash; >> int len; >> @@ -651,6 +654,10 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info) >> !!(hash & OVS_PACKET_HASH_L4_BIT)); >> } >> >> + if (a[OVS_PACKET_ATTR_UPCALL_PID]) >> + upcall_pid = nla_get_u32(a[OVS_PACKET_ATTR_UPCALL_PID]); >> + OVS_CB(packet)->upcall_pid = upcall_pid; > > sorry for a late nit: > > OVS_CB(packet)->upcall_pid = > nla_get_u32_default(a[OVS_PACKET_ATTR_UPCALL_PID], 0); > > ? No worries. I was actually looking for ways to collapse this block, but somehow missed these "new" accessors. I'll send a v2 a bit later today in case there will be no further feedback. Best regards, Ilya Maximets.
On 6/28/25 12:01 AM, Ilya Maximets wrote: > When a packet enters OVS datapath and there is no flow to handle it, > packet goes to userspace through a MISS upcall. With per-CPU upcall > dispatch mechanism, we're using the current CPU id to select the > Netlink PID on which to send this packet. This allows us to send > packets from the same traffic flow through the same handler. > > The handler will process the packet, install required flow into the > kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE. > > While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a > recirculation action that will pass the (likely modified) packet > through the flow lookup again. And if the flow is not found, the > packet will be sent to userspace again through another MISS upcall. > > However, the handler thread in userspace is likely running on a > different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled > in the syscall context of that thread. So, when the time comes to > send the packet through another upcall, the per-CPU dispatch will > choose a different Netlink PID, and this packet will end up processed > by a different handler thread on a different CPU. > > The process continues as long as there are new recirculations, each > time the packet goes to a different handler thread before it is sent > out of the OVS datapath to the destination port. In real setups the > number of recirculations can go up to 4 or 5, sometimes more. > > There is always a chance to re-order packets while processing upcalls, > because userspace will first install the flow and then re-inject the > original packet. So, there is a race window when the flow is already > installed and the second packet can match it and be forwarded to the > destination before the first packet is re-injected. But the fact that > packets are going through multiple upcalls handled by different > userspace threads makes the reordering noticeably more likely, because > we not only have a race between the kernel and a userspace handler > (which is hard to avoid), but also between multiple userspace handlers. > > For example, let's assume that 10 packets got enqueued through a MISS > upcall for handler-1, it will start processing them, will install the > flow into the kernel and start re-injecting packets back, from where > they will go through another MISS to handler-2. Handler-2 will install > the flow into the kernel and start re-injecting the packets, while > handler-1 continues to re-inject the last of the 10 packets, they will > hit the flow installed by handler-2 and be forwarded without going to > the handler-2, while handler-2 still re-injects the first of these 10 > packets. Given multiple recirculations and misses, these 10 packets > may end up completely mixed up on the output from the datapath. > > Let's allow userspace to specify on which Netlink PID the packets > should be upcalled while processing OVS_PACKET_CMD_EXECUTE. > This makes it possible to ensure that all the packets are processed > by the same handler thread in the userspace even with them being > upcalled multiple times in the process. Packets will remain in order > since they will be enqueued to the same socket and re-injected in the > same order. This doesn't eliminate re-ordering as stated above, since > we still have a race between kernel and the userspace thread, but it > allows to eliminate races between multiple userspace threads. > > Userspace knows the PID of the socket on which the original upcall is > received, so there is no need to send it up from the kernel. > > Solution requires storing the value somewhere for the duration of the > packet processing. There are two potential places for this: our skb > extension or the per-CPU storage. It's not clear which is better, > so just following currently used scheme of storing this kind of things > along the skb. > > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- The userspace counterpart that makes use of this fucntionality is available here: https://patchwork.ozlabs.org/project/openvswitch/patch/20250627220752.1505153-1-i.maximets@ovn.org/ Best regards, Ilya Maximets.
© 2016 - 2025 Red Hat, Inc.