include/linux/skmsg.h | 1 + net/core/skmsg.c | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-)
We observed an issue from the latest selftest: sockmap_redir where
sk_psock(psock->sk) != psock in the backlog. The root cause is the special
behavior in sockmap_redir - it frequently performs map_update() and
map_delete() on the same socket. During map_update(), we create a new
psock and during map_delete(), we eventually free the psock via rcu_work
in sk_psock_drop(). However, pending workqueues might still exist and not
be processed yet. If users immediately perform another map_update(), a new
psock will be allocated for the same sk, resulting in two psocks pointing
to the same sk.
When the pending workqueue is later triggered, it uses the old psock to
access sk for I/O operations, which is incorrect.
Timing Diagram:
cpu0 cpu1
map_update(sk):
sk->psock = psock1
psock1->sk = sk
map_delete(sk):
rcu_work_free(psock1)
map_update(sk):
sk->psock = psock2
psock2->sk = sk
workqueue:
wakeup with psock1, but the sk of psock1
doesn't belong to psock1
rcu_handler:
clean psock1
free(psock1)
Previously, we used reference counting to address the concurrency issue
between backlog and sock_map_close(). This logic remains necessary as it
prevents the sk from being freed while processing the backlog. But this
patch prevents pending backlogs from using a psock after it has been
freed.
Note: We cannot call cancel_delayed_work_sync() in map_delete() since this
might be invoked in BPF context by BPF helper, and the function may sleep.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
Thanks to Michal Luczaj for providing the sockmap_redir test case, which
indeed covers almost all sockmap forwarding paths.
---
include/linux/skmsg.h | 1 +
net/core/skmsg.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 0b9095a281b8..b17221eef2f4 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -67,6 +67,7 @@ struct sk_psock_progs {
enum sk_psock_state_bits {
SK_PSOCK_TX_ENABLED,
SK_PSOCK_RX_STRP_ENABLED,
+ SK_PSOCK_DROPPED,
};
struct sk_psock_link {
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 34c51eb1a14f..bd58a693ce9a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -656,6 +656,9 @@ static void sk_psock_backlog(struct work_struct *work)
bool ingress;
int ret;
+ if (sk_psock_test_state(psock, SK_PSOCK_DROPPED))
+ return;
+
/* Increment the psock refcnt to synchronize with close(fd) path in
* sock_map_close(), ensuring we wait for backlog thread completion
* before sk_socket freed. If refcnt increment fails, it indicates
@@ -867,7 +870,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
write_unlock_bh(&sk->sk_callback_lock);
sk_psock_stop(psock);
-
+ sk_psock_set_state(psock, SK_PSOCK_DROPPED);
INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
queue_rcu_work(system_wq, &psock->rwork);
}
--
2.47.1
On 2025-05-24 00:22:19, Jiayuan Chen wrote:
> We observed an issue from the latest selftest: sockmap_redir where
> sk_psock(psock->sk) != psock in the backlog. The root cause is the special
> behavior in sockmap_redir - it frequently performs map_update() and
> map_delete() on the same socket. During map_update(), we create a new
> psock and during map_delete(), we eventually free the psock via rcu_work
> in sk_psock_drop(). However, pending workqueues might still exist and not
> be processed yet. If users immediately perform another map_update(), a new
> psock will be allocated for the same sk, resulting in two psocks pointing
> to the same sk.
>
> When the pending workqueue is later triggered, it uses the old psock to
> access sk for I/O operations, which is incorrect.
>
> Timing Diagram:
>
> cpu0 cpu1
>
> map_update(sk):
> sk->psock = psock1
> psock1->sk = sk
> map_delete(sk):
> rcu_work_free(psock1)
>
> map_update(sk):
> sk->psock = psock2
> psock2->sk = sk
> workqueue:
> wakeup with psock1, but the sk of psock1
> doesn't belong to psock1
> rcu_handler:
> clean psock1
> free(psock1)
>
> Previously, we used reference counting to address the concurrency issue
> between backlog and sock_map_close(). This logic remains necessary as it
> prevents the sk from being freed while processing the backlog. But this
> patch prevents pending backlogs from using a psock after it has been
> freed.
Nit, its not that psock would be freed because we do have the
cancel_delayed_work_sync() before the kfree(psock). But this
is not a good state with two psocks referenceing the same sk.
>
> Note: We cannot call cancel_delayed_work_sync() in map_delete() since this
> might be invoked in BPF context by BPF helper, and the function may sleep.
>
> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
>
> ---
> Thanks to Michal Luczaj for providing the sockmap_redir test case, which
> indeed covers almost all sockmap forwarding paths.
> ---
> include/linux/skmsg.h | 1 +
> net/core/skmsg.c | 5 ++++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 0b9095a281b8..b17221eef2f4 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -67,6 +67,7 @@ struct sk_psock_progs {
> enum sk_psock_state_bits {
> SK_PSOCK_TX_ENABLED,
> SK_PSOCK_RX_STRP_ENABLED,
> + SK_PSOCK_DROPPED,
> };
>
> struct sk_psock_link {
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 34c51eb1a14f..bd58a693ce9a 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -656,6 +656,9 @@ static void sk_psock_backlog(struct work_struct *work)
> bool ingress;
> int ret;
>
> + if (sk_psock_test_state(psock, SK_PSOCK_DROPPED))
> + return;
Could we use the SK_PSOCK_TX_ENABLED bit here? Its already used to
ensure we wont requeue work after the psock has started being
removed. Seems like we don't need two flags? wdyt?
> +
> /* Increment the psock refcnt to synchronize with close(fd) path in
> * sock_map_close(), ensuring we wait for backlog thread completion
> * before sk_socket freed. If refcnt increment fails, it indicates
> @@ -867,7 +870,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> write_unlock_bh(&sk->sk_callback_lock);
>
> sk_psock_stop(psock);
Can we add this to sk_psock_stop where we have the TX_ENABLED bit
cleared.
> -
> + sk_psock_set_state(psock, SK_PSOCK_DROPPED);
> INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
> queue_rcu_work(system_wq, &psock->rwork);
> }
> --
> 2.47.1
>
May 29, 2025 at 07:46, "John Fastabend" <john.fastabend@gmail.com> wrote: > > On 2025-05-24 00:22:19, Jiayuan Chen wrote: > > > > > We observed an issue from the latest selftest: sockmap_redir where > > > > sk_psock(psock->sk) != psock in the backlog. The root cause is the special > > > > behavior in sockmap_redir - it frequently performs map_update() and > > > > map_delete() on the same socket. During map_update(), we create a new > > > > psock and during map_delete(), we eventually free the psock via rcu_work > > > > in sk_psock_drop(). However, pending workqueues might still exist and not > > > > be processed yet. If users immediately perform another map_update(), a new > > > > psock will be allocated for the same sk, resulting in two psocks pointing > > > > to the same sk. > > > > > > > > When the pending workqueue is later triggered, it uses the old psock to > > > > access sk for I/O operations, which is incorrect. > > > > > > > > Timing Diagram: > > > > > > > > cpu0 cpu1 > > > > > > > > map_update(sk): > > > > sk->psock = psock1 > > > > psock1->sk = sk > > > > map_delete(sk): > > > > rcu_work_free(psock1) > > > > > > > > map_update(sk): > > > > sk->psock = psock2 > > > > psock2->sk = sk > > > > workqueue: > > > > wakeup with psock1, but the sk of psock1 > > > > doesn't belong to psock1 > > > > rcu_handler: > > > > clean psock1 > > > > free(psock1) > > > > > > > > Previously, we used reference counting to address the concurrency issue > > > > between backlog and sock_map_close(). This logic remains necessary as it > > > > prevents the sk from being freed while processing the backlog. But this > > > > patch prevents pending backlogs from using a psock after it has been > > > > freed. > > > > Nit, its not that psock would be freed because we do have the > > cancel_delayed_work_sync() before the kfree(psock). But this > > is not a good state with two psocks referenceing the same sk. > BTW, did we miss ingress_lock while processing ingress_skb in backlog? will we have the concurrency issue when skb was appended into ingress_skb in sk_psock_skb_redirect().
May 29, 2025 at 07:46, "John Fastabend" <john.fastabend@gmail.com> wrote:
>
> On 2025-05-24 00:22:19, Jiayuan Chen wrote:
>
> >
> > We observed an issue from the latest selftest: sockmap_redir where
> >
> > sk_psock(psock->sk) != psock in the backlog. The root cause is the special
> >
> > behavior in sockmap_redir - it frequently performs map_update() and
> >
> > map_delete() on the same socket. During map_update(), we create a new
> >
> > psock and during map_delete(), we eventually free the psock via rcu_work
> >
> > in sk_psock_drop(). However, pending workqueues might still exist and not
> >
> > be processed yet. If users immediately perform another map_update(), a new
> >
> > psock will be allocated for the same sk, resulting in two psocks pointing
> >
> > to the same sk.
> >
> >
> >
> > When the pending workqueue is later triggered, it uses the old psock to
> >
> > access sk for I/O operations, which is incorrect.
> >
> >
> >
> > Timing Diagram:
> >
> >
> >
> > cpu0 cpu1
> >
> >
> >
> > map_update(sk):
> >
> > sk->psock = psock1
> >
> > psock1->sk = sk
> >
> > map_delete(sk):
> >
> > rcu_work_free(psock1)
> >
> >
> >
> > map_update(sk):
> >
> > sk->psock = psock2
> >
> > psock2->sk = sk
> >
> > workqueue:
> >
> > wakeup with psock1, but the sk of psock1
> >
> > doesn't belong to psock1
> >
> > rcu_handler:
> >
> > clean psock1
> >
> > free(psock1)
> >
> >
> >
> > Previously, we used reference counting to address the concurrency issue
> >
> > between backlog and sock_map_close(). This logic remains necessary as it
> >
> > prevents the sk from being freed while processing the backlog. But this
> >
> > patch prevents pending backlogs from using a psock after it has been
> >
> > freed.
> >
>
> Nit, its not that psock would be freed because we do have the
>
> cancel_delayed_work_sync() before the kfree(psock). But this
>
> is not a good state with two psocks referenceing the same sk.
>
> >
> > Note: We cannot call cancel_delayed_work_sync() in map_delete() since this
> >
> > might be invoked in BPF context by BPF helper, and the function may sleep.
> >
> >
> >
> > Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> >
> > Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> >
> >
> >
> > ---
> >
> > Thanks to Michal Luczaj for providing the sockmap_redir test case, which
> >
> > indeed covers almost all sockmap forwarding paths.
> >
> > ---
> >
> > include/linux/skmsg.h | 1 +
> >
> > net/core/skmsg.c | 5 ++++-
> >
> > 2 files changed, 5 insertions(+), 1 deletion(-)
> >
> >
> >
> > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> >
> > index 0b9095a281b8..b17221eef2f4 100644
> >
> > --- a/include/linux/skmsg.h
> >
> > +++ b/include/linux/skmsg.h
> >
> > @@ -67,6 +67,7 @@ struct sk_psock_progs {
> >
> > enum sk_psock_state_bits {
> >
> > SK_PSOCK_TX_ENABLED,
> >
> > SK_PSOCK_RX_STRP_ENABLED,
> >
> > + SK_PSOCK_DROPPED,
> >
> > };
> >
> >
> >
> > struct sk_psock_link {
> >
> > diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> >
> > index 34c51eb1a14f..bd58a693ce9a 100644
> >
> > --- a/net/core/skmsg.c
> >
> > +++ b/net/core/skmsg.c
> >
> > @@ -656,6 +656,9 @@ static void sk_psock_backlog(struct work_struct *work)
> >
> > bool ingress;
> >
> > int ret;
> >
> >
> >
> > + if (sk_psock_test_state(psock, SK_PSOCK_DROPPED))
> >
> > + return;
> >
>
> Could we use the SK_PSOCK_TX_ENABLED bit here? Its already used to
>
> ensure we wont requeue work after the psock has started being
>
> removed. Seems like we don't need two flags? wdyt?
>
> >
> > +
> >
> > /* Increment the psock refcnt to synchronize with close(fd) path in
> >
> > * sock_map_close(), ensuring we wait for backlog thread completion
> >
> > * before sk_socket freed. If refcnt increment fails, it indicates
> >
> > @@ -867,7 +870,7 @@ void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
> >
> > write_unlock_bh(&sk->sk_callback_lock);
> >
> >
> >
> > sk_psock_stop(psock);
> >
>
> Can we add this to sk_psock_stop where we have the TX_ENABLED bit
>
> cleared.
Thanks, I just add SK_PSOCK_TX_ENABLED checking at the start of sk_psock_backlog().
Every works fine, and truly no more flag needed !
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 34c51eb1a14f..83c78379932e 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work)
bool ingress;
int ret;
+ /* If sk is quickly removed from the map and then added back, the old
+ * psock should not be scheduled, because there are now two psocks
+ * pointing to the same sk.
+ */
+ if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
+ return;
+
/* Increment the psock refcnt to synchronize with close(fd) path in
* sock_map_close(), ensuring we wait for backlog thread completion
* before sk_socket freed. If refcnt increment fails, it indicates
> >
> > -
> >
> > + sk_psock_set_state(psock, SK_PSOCK_DROPPED);
> >
> > INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
> >
> > queue_rcu_work(system_wq, &psock->rwork);
> >
> > }
> >
> > --
> >
> > 2.47.1
> >
>
On 2025-05-29 02:42:14, Jiayuan Chen wrote: > May 29, 2025 at 07:46, "John Fastabend" <john.fastabend@gmail.com> wrote: > > > > > > > On 2025-05-24 00:22:19, Jiayuan Chen wrote: > > > > > > > > We observed an issue from the latest selftest: sockmap_redir where > > > > > > sk_psock(psock->sk) != psock in the backlog. The root cause is the special > > > behavior in sockmap_redir - it frequently performs map_update() and > > > map_delete() on the same socket. During map_update(), we create a new > > > psock and during map_delete(), we eventually free the psock via rcu_work > > > in sk_psock_drop(). However, pending workqueues might still exist and not > > > be processed yet. If users immediately perform another map_update(), a new > > > psock will be allocated for the same sk, resulting in two psocks pointing > > > to the same sk. [...] > > > > Can we add this to sk_psock_stop where we have the TX_ENABLED bit > > > > cleared. > > > > Thanks, I just add SK_PSOCK_TX_ENABLED checking at the start of sk_psock_backlog(). > Every works fine, and truly no more flag needed ! > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index 34c51eb1a14f..83c78379932e 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -656,6 +656,13 @@ static void sk_psock_backlog(struct work_struct *work) > bool ingress; > int ret; > > + /* If sk is quickly removed from the map and then added back, the old > + * psock should not be scheduled, because there are now two psocks > + * pointing to the same sk. > + */ > + if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) > + return; > + > /* Increment the psock refcnt to synchronize with close(fd) path in > * sock_map_close(), ensuring we wait for backlog thread completion > * before sk_socket freed. If refcnt increment fails, it indicates > Thanks. Please submit an official patch so we can get it merged.
© 2016 - 2025 Red Hat, Inc.