From nobody Sun Feb 8 03:37:03 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0F94229B1A; Wed, 21 May 2025 15:34:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747841686; cv=none; b=nwjAX1JV5blAi3HFrUrMa4VI4qNgp2HDI3vn1jZSs7diIYwjVKmYd/Mg/LdG19hNjYdnWFpCjQGUUcHI3SHOetc3WS2H1ve3B0Dk03JkHv4/nvuGHDR+86qEC7l6uNQ/6+b9C1DrH4wp/VYDiSzkBCzyZJHtRJeSvtTVj0XrMpo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747841686; c=relaxed/simple; bh=Z4yeiZqWZpuNmbt9uzWikv0yXPuruF94qxI5kAwDCSg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XUsyepezwES8rAXAVSwAqIGNz4alQsDCacBBruFySQZ1NsgoB22Q13TS40YAm8IgawWhGHVp15RVe4Aip1JOvfyegwoKR4D2ayJ5ORi6Hq6Ifb8rm/o0ht75m3QhH7WXhusD1O0+dB+ODmNQQoHroTMUkjQZkS2mAkNQbrL1euA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=SR4w3AP3; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="SR4w3AP3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 670B9C4CEEB; Wed, 21 May 2025 15:34:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1747841686; bh=Z4yeiZqWZpuNmbt9uzWikv0yXPuruF94qxI5kAwDCSg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=SR4w3AP3mCN08c3pDl31W71B4LJP/WgDfH6dJB/z9lVQX3Xkx2RGk2WbXaRR/S0ox I9vFhdoqpjzb6rSnpVIOUiYurvsgj+oRvPqR7cKeVAExs3LNK+MyuPWFsyoWijL5UO hWpAn8f6obkOmglDexHmP4uBv1C8bb7XgfJwId6vfqohxS+oUtFxV+Db1FSEnjPXYT MHs2qGwj34LMzEs/SWSmz3FcPSuWQE5UMP/LaBTCERql/HjKlYAlZQMjTvWvGGO2S2 tGPc/o5NA0iTXtaOzuurWhXMXQa2l4u5wxxYPgXT5+2jhLF+l5v3mepvvKAdx+iVXq Il0pX8lWPN2nw== From: Lee Jones To: lee@kernel.org, "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Christian Brauner , Kuniyuki Iwashima , Alexander Mikhalitsyn , Jens Axboe , Sasha Levin , Michal Luczaj , Rao Shoaib , Pavel Begunkov , linux-kernel@vger.kernel.org, netdev@vger.kernel.org Cc: stable@vger.kernel.org Subject: [PATCH v6.1 21/27] af_unix: Replace garbage collection algorithm. Date: Wed, 21 May 2025 16:27:20 +0100 Message-ID: <20250521152920.1116756-22-lee@kernel.org> X-Mailer: git-send-email 2.49.0.1143.g0be31eac6b-goog In-Reply-To: <20250521152920.1116756-1-lee@kernel.org> References: <20250521152920.1116756-1-lee@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Kuniyuki Iwashima [ Upstream commit 4090fa373f0e763c43610853d2774b5979915959 ] If we find a dead SCC during iteration, we call unix_collect_skb() to splice all skb in the SCC to the global sk_buff_head, hitlist. After iterating all SCC, we unlock unix_gc_lock and purge the queue. Signed-off-by: Kuniyuki Iwashima Acked-by: Paolo Abeni Link: https://lore.kernel.org/r/20240325202425.60930-15-kuniyu@amazon.com Signed-off-by: Jakub Kicinski (cherry picked from commit 4090fa373f0e763c43610853d2774b5979915959) Signed-off-by: Lee Jones --- include/net/af_unix.h | 8 -- net/unix/af_unix.c | 12 -- net/unix/garbage.c | 318 +++++++++--------------------------------- 3 files changed, 64 insertions(+), 274 deletions(-) diff --git a/include/net/af_unix.h b/include/net/af_unix.h index 14d56b07a54d..47808e366731 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -19,9 +19,6 @@ static inline struct unix_sock *unix_get_socket(struct fi= le *filp) =20 extern spinlock_t unix_gc_lock; extern unsigned int unix_tot_inflight; - -void unix_inflight(struct user_struct *user, struct file *fp); -void unix_notinflight(struct user_struct *user, struct file *fp); void unix_add_edges(struct scm_fp_list *fpl, struct unix_sock *receiver); void unix_del_edges(struct scm_fp_list *fpl); void unix_update_edges(struct unix_sock *receiver); @@ -85,12 +82,7 @@ struct unix_sock { struct sock *peer; struct sock *listener; struct unix_vertex *vertex; - struct list_head link; - unsigned long inflight; spinlock_t lock; - unsigned long gc_flags; -#define UNIX_GC_CANDIDATE 0 -#define UNIX_GC_MAYBE_CYCLE 1 struct socket_wq peer_wq; wait_queue_entry_t peer_wake; struct scm_stat scm_stat; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 4d8b2b2b9a70..25f66adf47d1 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -955,12 +955,10 @@ static struct sock *unix_create1(struct net *net, str= uct socket *sock, int kern, sk->sk_destruct =3D unix_sock_destructor; u =3D unix_sk(sk); u->listener =3D NULL; - u->inflight =3D 0; u->vertex =3D NULL; u->path.dentry =3D NULL; u->path.mnt =3D NULL; spin_lock_init(&u->lock); - INIT_LIST_HEAD(&u->link); mutex_init(&u->iolock); /* single task reading lock */ mutex_init(&u->bindlock); /* single task binding lock */ init_waitqueue_head(&u->peer_wait); @@ -1744,8 +1742,6 @@ static inline bool too_many_unix_fds(struct task_stru= ct *p) =20 static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb) { - int i; - if (too_many_unix_fds(current)) return -ETOOMANYREFS; =20 @@ -1757,9 +1753,6 @@ static int unix_attach_fds(struct scm_cookie *scm, st= ruct sk_buff *skb) if (!UNIXCB(skb).fp) return -ENOMEM; =20 - for (i =3D scm->fp->count - 1; i >=3D 0; i--) - unix_inflight(scm->fp->user, scm->fp->fp[i]); - if (unix_prepare_fpl(UNIXCB(skb).fp)) return -ENOMEM; =20 @@ -1768,15 +1761,10 @@ static int unix_attach_fds(struct scm_cookie *scm, = struct sk_buff *skb) =20 static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb) { - int i; - scm->fp =3D UNIXCB(skb).fp; UNIXCB(skb).fp =3D NULL; =20 unix_destroy_fpl(scm->fp); - - for (i =3D scm->fp->count - 1; i >=3D 0; i--) - unix_notinflight(scm->fp->user, scm->fp->fp[i]); } =20 static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb) diff --git a/net/unix/garbage.c b/net/unix/garbage.c index 1f53c25fc71b..89ea71d9297b 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -322,6 +322,52 @@ static bool unix_vertex_dead(struct unix_vertex *verte= x) return true; } =20 +enum unix_recv_queue_lock_class { + U_RECVQ_LOCK_NORMAL, + U_RECVQ_LOCK_EMBRYO, +}; + +static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *h= itlist) +{ + struct unix_vertex *vertex; + + list_for_each_entry_reverse(vertex, scc, scc_entry) { + struct sk_buff_head *queue; + struct unix_edge *edge; + struct unix_sock *u; + + edge =3D list_first_entry(&vertex->edges, typeof(*edge), vertex_entry); + u =3D edge->predecessor; + queue =3D &u->sk.sk_receive_queue; + + spin_lock(&queue->lock); + + if (u->sk.sk_state =3D=3D TCP_LISTEN) { + struct sk_buff *skb; + + skb_queue_walk(queue, skb) { + struct sk_buff_head *embryo_queue =3D &skb->sk->sk_receive_queue; + + /* listener -> embryo order, the inversion never happens. */ + spin_lock_nested(&embryo_queue->lock, U_RECVQ_LOCK_EMBRYO); + skb_queue_splice_init(embryo_queue, hitlist); + spin_unlock(&embryo_queue->lock); + } + } else { + skb_queue_splice_init(queue, hitlist); + +#if IS_ENABLED(CONFIG_AF_UNIX_OOB) + if (u->oob_skb) { + kfree_skb(u->oob_skb); + u->oob_skb =3D NULL; + } +#endif + } + + spin_unlock(&queue->lock); + } +} + static bool unix_scc_cyclic(struct list_head *scc) { struct unix_vertex *vertex; @@ -345,7 +391,8 @@ static bool unix_scc_cyclic(struct list_head *scc) static LIST_HEAD(unix_visited_vertices); static unsigned long unix_vertex_grouped_index =3D UNIX_VERTEX_INDEX_MARK2; =20 -static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *las= t_index) +static void __unix_walk_scc(struct unix_vertex *vertex, unsigned long *las= t_index, + struct sk_buff_head *hitlist) { LIST_HEAD(vertex_stack); struct unix_edge *edge; @@ -430,7 +477,9 @@ static void __unix_walk_scc(struct unix_vertex *vertex,= unsigned long *last_inde scc_dead =3D unix_vertex_dead(vertex); } =20 - if (!unix_graph_maybe_cyclic) + if (scc_dead) + unix_collect_skb(&scc, hitlist); + else if (!unix_graph_maybe_cyclic) unix_graph_maybe_cyclic =3D unix_scc_cyclic(&scc); =20 list_del(&scc); @@ -441,7 +490,7 @@ static void __unix_walk_scc(struct unix_vertex *vertex,= unsigned long *last_inde goto prev_vertex; } =20 -static void unix_walk_scc(void) +static void unix_walk_scc(struct sk_buff_head *hitlist) { unsigned long last_index =3D UNIX_VERTEX_INDEX_START; =20 @@ -454,7 +503,7 @@ static void unix_walk_scc(void) struct unix_vertex *vertex; =20 vertex =3D list_first_entry(&unix_unvisited_vertices, typeof(*vertex), e= ntry); - __unix_walk_scc(vertex, &last_index); + __unix_walk_scc(vertex, &last_index, hitlist); } =20 list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices); @@ -463,7 +512,7 @@ static void unix_walk_scc(void) unix_graph_grouped =3D true; } =20 -static void unix_walk_scc_fast(void) +static void unix_walk_scc_fast(struct sk_buff_head *hitlist) { while (!list_empty(&unix_unvisited_vertices)) { struct unix_vertex *vertex; @@ -480,279 +529,40 @@ static void unix_walk_scc_fast(void) scc_dead =3D unix_vertex_dead(vertex); } =20 + if (scc_dead) + unix_collect_skb(&scc, hitlist); + list_del(&scc); } =20 list_replace_init(&unix_visited_vertices, &unix_unvisited_vertices); } =20 -static LIST_HEAD(gc_candidates); -static LIST_HEAD(gc_inflight_list); - -/* Keep the number of times in flight count for the file - * descriptor if it is for an AF_UNIX socket. - */ -void unix_inflight(struct user_struct *user, struct file *filp) -{ - struct unix_sock *u =3D unix_get_socket(filp); - - spin_lock(&unix_gc_lock); - - if (u) { - if (!u->inflight) { - WARN_ON_ONCE(!list_empty(&u->link)); - list_add_tail(&u->link, &gc_inflight_list); - } else { - WARN_ON_ONCE(list_empty(&u->link)); - } - u->inflight++; - } - - spin_unlock(&unix_gc_lock); -} - -void unix_notinflight(struct user_struct *user, struct file *filp) -{ - struct unix_sock *u =3D unix_get_socket(filp); - - spin_lock(&unix_gc_lock); - - if (u) { - WARN_ON_ONCE(!u->inflight); - WARN_ON_ONCE(list_empty(&u->link)); - - u->inflight--; - if (!u->inflight) - list_del_init(&u->link); - } - - spin_unlock(&unix_gc_lock); -} - -static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *), - struct sk_buff_head *hitlist) -{ - struct sk_buff *skb; - struct sk_buff *next; - - spin_lock(&x->sk_receive_queue.lock); - skb_queue_walk_safe(&x->sk_receive_queue, skb, next) { - /* Do we have file descriptors ? */ - if (UNIXCB(skb).fp) { - bool hit =3D false; - /* Process the descriptors of this socket */ - int nfd =3D UNIXCB(skb).fp->count; - struct file **fp =3D UNIXCB(skb).fp->fp; - - while (nfd--) { - /* Get the socket the fd matches if it indeed does so */ - struct unix_sock *u =3D unix_get_socket(*fp++); - - /* Ignore non-candidates, they could have been added - * to the queues after starting the garbage collection - */ - if (u && test_bit(UNIX_GC_CANDIDATE, &u->gc_flags)) { - hit =3D true; - - func(u); - } - } - if (hit && hitlist !=3D NULL) { - __skb_unlink(skb, &x->sk_receive_queue); - __skb_queue_tail(hitlist, skb); - } - } - } - spin_unlock(&x->sk_receive_queue.lock); -} - -static void scan_children(struct sock *x, void (*func)(struct unix_sock *), - struct sk_buff_head *hitlist) -{ - if (x->sk_state !=3D TCP_LISTEN) { - scan_inflight(x, func, hitlist); - } else { - struct sk_buff *skb; - struct sk_buff *next; - struct unix_sock *u; - LIST_HEAD(embryos); - - /* For a listening socket collect the queued embryos - * and perform a scan on them as well. - */ - spin_lock(&x->sk_receive_queue.lock); - skb_queue_walk_safe(&x->sk_receive_queue, skb, next) { - u =3D unix_sk(skb->sk); - - /* An embryo cannot be in-flight, so it's safe - * to use the list link. - */ - WARN_ON_ONCE(!list_empty(&u->link)); - list_add_tail(&u->link, &embryos); - } - spin_unlock(&x->sk_receive_queue.lock); - - while (!list_empty(&embryos)) { - u =3D list_entry(embryos.next, struct unix_sock, link); - scan_inflight(&u->sk, func, hitlist); - list_del_init(&u->link); - } - } -} - -static void dec_inflight(struct unix_sock *usk) -{ - usk->inflight--; -} - -static void inc_inflight(struct unix_sock *usk) -{ - usk->inflight++; -} - -static void inc_inflight_move_tail(struct unix_sock *u) -{ - u->inflight++; - - /* If this still might be part of a cycle, move it to the end - * of the list, so that it's checked even if it was already - * passed over - */ - if (test_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags)) - list_move_tail(&u->link, &gc_candidates); -} - static bool gc_in_progress; =20 static void __unix_gc(struct work_struct *work) { struct sk_buff_head hitlist; - struct unix_sock *u, *next; - LIST_HEAD(not_cycle_list); - struct list_head cursor; =20 spin_lock(&unix_gc_lock); =20 - if (!unix_graph_maybe_cyclic) + if (!unix_graph_maybe_cyclic) { + spin_unlock(&unix_gc_lock); goto skip_gc; - - if (unix_graph_grouped) - unix_walk_scc_fast(); - else - unix_walk_scc(); - - /* First, select candidates for garbage collection. Only - * in-flight sockets are considered, and from those only ones - * which don't have any external reference. - * - * Holding unix_gc_lock will protect these candidates from - * being detached, and hence from gaining an external - * reference. Since there are no possible receivers, all - * buffers currently on the candidates' queues stay there - * during the garbage collection. - * - * We also know that no new candidate can be added onto the - * receive queues. Other, non candidate sockets _can_ be - * added to queue, so we must make sure only to touch - * candidates. - * - * Embryos, though never candidates themselves, affect which - * candidates are reachable by the garbage collector. Before - * being added to a listener's queue, an embryo may already - * receive data carrying SCM_RIGHTS, potentially making the - * passed socket a candidate that is not yet reachable by the - * collector. It becomes reachable once the embryo is - * enqueued. Therefore, we must ensure that no SCM-laden - * embryo appears in a (candidate) listener's queue between - * consecutive scan_children() calls. - */ - list_for_each_entry_safe(u, next, &gc_inflight_list, link) { - struct sock *sk =3D &u->sk; - long total_refs; - - total_refs =3D file_count(sk->sk_socket->file); - - WARN_ON_ONCE(!u->inflight); - WARN_ON_ONCE(total_refs < u->inflight); - if (total_refs =3D=3D u->inflight) { - list_move_tail(&u->link, &gc_candidates); - __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); - __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); - - if (sk->sk_state =3D=3D TCP_LISTEN) { - unix_state_lock_nested(sk, U_LOCK_GC_LISTENER); - unix_state_unlock(sk); - } - } - } - - /* Now remove all internal in-flight reference to children of - * the candidates. - */ - list_for_each_entry(u, &gc_candidates, link) - scan_children(&u->sk, dec_inflight, NULL); - - /* Restore the references for children of all candidates, - * which have remaining references. Do this recursively, so - * only those remain, which form cyclic references. - * - * Use a "cursor" link, to make the list traversal safe, even - * though elements might be moved about. - */ - list_add(&cursor, &gc_candidates); - while (cursor.next !=3D &gc_candidates) { - u =3D list_entry(cursor.next, struct unix_sock, link); - - /* Move cursor to after the current position. */ - list_move(&cursor, &u->link); - - if (u->inflight) { - list_move_tail(&u->link, ¬_cycle_list); - __clear_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); - scan_children(&u->sk, inc_inflight_move_tail, NULL); - } } - list_del(&cursor); =20 - /* Now gc_candidates contains only garbage. Restore original - * inflight counters for these as well, and remove the skbuffs - * which are creating the cycle(s). - */ - skb_queue_head_init(&hitlist); - list_for_each_entry(u, &gc_candidates, link) { - scan_children(&u->sk, inc_inflight, &hitlist); + __skb_queue_head_init(&hitlist); =20 -#if IS_ENABLED(CONFIG_AF_UNIX_OOB) - if (u->oob_skb) { - kfree_skb(u->oob_skb); - u->oob_skb =3D NULL; - } -#endif - } - - /* not_cycle_list contains those sockets which do not make up a - * cycle. Restore these to the inflight list. - */ - while (!list_empty(¬_cycle_list)) { - u =3D list_entry(not_cycle_list.next, struct unix_sock, link); - __clear_bit(UNIX_GC_CANDIDATE, &u->gc_flags); - list_move_tail(&u->link, &gc_inflight_list); - } + if (unix_graph_grouped) + unix_walk_scc_fast(&hitlist); + else + unix_walk_scc(&hitlist); =20 spin_unlock(&unix_gc_lock); =20 - /* Here we are. Hitlist is filled. Die. */ __skb_queue_purge(&hitlist); - - spin_lock(&unix_gc_lock); - - /* All candidates should have been detached by now. */ - WARN_ON_ONCE(!list_empty(&gc_candidates)); skip_gc: - /* Paired with READ_ONCE() in wait_for_unix_gc(). */ WRITE_ONCE(gc_in_progress, false); - - spin_unlock(&unix_gc_lock); } =20 static DECLARE_WORK(unix_gc_work, __unix_gc); --=20 2.49.0.1143.g0be31eac6b-goog