[v4] coredump: add coredump socket

[PATCH v4 04/11] net: reserve prefix

Posted by Christian Brauner 9 months ago

Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
CAP_NET_ADMIN in the owning user namespace of the network namespace to
bind it. This will be used in next patches to support the coredump
socket but is a generally useful concept.

The collision risk is so low that we can just start using it. Userspace
must already be prepared to retry if a given abstract address isn't
usable anyway.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/uapi/linux/un.h |  2 ++
 net/unix/af_unix.c      | 39 +++++++++++++++++++++++++++++++++++----
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
index 0ad59dc8b686..bbd5ad508dfa 100644
--- a/include/uapi/linux/un.h
+++ b/include/uapi/linux/un.h
@@ -5,6 +5,8 @@
 #include <linux/socket.h>
 
 #define UNIX_PATH_MAX	108
+/* reserved AF_UNIX socket namespace. */
+#define UNIX_SOCKET_NAMESPACE "linuxafsk/"
 
 struct sockaddr_un {
 	__kernel_sa_family_t sun_family; /* AF_UNIX */
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 472f8aa9ea15..148d008862e7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -114,6 +114,13 @@ static atomic_long_t unix_nr_socks;
 static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
 static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];
 
+static const struct sockaddr_un linuxafsk_addr = {
+	.sun_family = AF_UNIX,
+	.sun_path = "\0"UNIX_SOCKET_NAMESPACE,
+};
+
+#define UNIX_SOCKET_NAMESPACE_ADDR_LEN (offsetof(struct sockaddr_un, sun_path) + sizeof(UNIX_SOCKET_NAMESPACE))
+
 /* SMP locking strategy:
  *    hash table is protected with spinlock.
  *    each socket state is protected by separate spinlock.
@@ -436,6 +443,30 @@ static struct sock *__unix_find_socket_byname(struct net *net,
 	return NULL;
 }
 
+static int unix_may_bind_name(struct net *net, struct sockaddr_un *sunname,
+			      int len, unsigned int hash)
+{
+	struct sock *s;
+
+	s = __unix_find_socket_byname(net, sunname, len, hash);
+	if (s)
+		return -EADDRINUSE;
+
+	/*
+	 * Check whether this is our reserved prefix and if so ensure
+	 * that only privileged processes can bind it.
+	 */
+	if (UNIX_SOCKET_NAMESPACE_ADDR_LEN <= len &&
+	    !memcmp(&linuxafsk_addr, sunname, UNIX_SOCKET_NAMESPACE_ADDR_LEN)) {
+		/* Don't bind the namespace itself. */
+		if (UNIX_SOCKET_NAMESPACE_ADDR_LEN == len)
+			return -ECONNREFUSED;
+		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+			return -ECONNREFUSED;
+	}
+	return 0;
+}
+
 static inline struct sock *unix_find_socket_byname(struct net *net,
 						   struct sockaddr_un *sunname,
 						   int len, unsigned int hash)
@@ -1258,10 +1289,10 @@ static int unix_autobind(struct sock *sk)
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
+	if (unix_may_bind_name(net, addr->name, addr->len, new_hash)) {
 		unix_table_double_unlock(net, old_hash, new_hash);
 
-		/* __unix_find_socket_byname() may take long time if many names
+		/* unix_may_bind_name() may take long time if many names
 		 * are already in use.
 		 */
 		cond_resched();
@@ -1379,7 +1410,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
 	unix_table_double_lock(net, old_hash, new_hash);
 
-	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
+	err = unix_may_bind_name(net, addr->name, addr->len, new_hash);
+	if (err)
 		goto out_spin;
 
 	__unix_set_addr_hash(net, sk, addr, new_hash);
@@ -1389,7 +1421,6 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
 
 out_spin:
 	unix_table_double_unlock(net, old_hash, new_hash);
-	err = -EADDRINUSE;
 out_mutex:
 	mutex_unlock(&u->bindlock);
 out:

-- 
2.47.2

Re: [PATCH v4 04/11] net: reserve prefix

Posted by Kuniyuki Iwashima 9 months ago

From: Christian Brauner <brauner@kernel.org>
Date: Wed, 07 May 2025 18:13:37 +0200
> Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> CAP_NET_ADMIN in the owning user namespace of the network namespace to
> bind it. This will be used in next patches to support the coredump
> socket but is a generally useful concept.

I really think we shouldn't reserve address and it should be
configurable by users via core_pattern as with the other
coredump types.

AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
dying, user can't start the new coredump listener until it's
fully cleaned up, which adds unnecessary drawback.

The semantic should be same with other types, and the todo
for the coredump service is prepare file (file, process, socket)
that can receive data and set its name to core_pattern.

Also, the abstract socket is namespced by design and there is
no point in enforcing the same restriction to non-initial netns.


> 
> The collision risk is so low that we can just start using it. Userspace
> must already be prepared to retry if a given abstract address isn't
> usable anyway.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/uapi/linux/un.h |  2 ++
>  net/unix/af_unix.c      | 39 +++++++++++++++++++++++++++++++++++----
>  2 files changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/un.h b/include/uapi/linux/un.h
> index 0ad59dc8b686..bbd5ad508dfa 100644
> --- a/include/uapi/linux/un.h
> +++ b/include/uapi/linux/un.h
> @@ -5,6 +5,8 @@
>  #include <linux/socket.h>
>  
>  #define UNIX_PATH_MAX	108
> +/* reserved AF_UNIX socket namespace. */
> +#define UNIX_SOCKET_NAMESPACE "linuxafsk/"
>  
>  struct sockaddr_un {
>  	__kernel_sa_family_t sun_family; /* AF_UNIX */
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 472f8aa9ea15..148d008862e7 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -114,6 +114,13 @@ static atomic_long_t unix_nr_socks;
>  static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
>  static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];
>  
> +static const struct sockaddr_un linuxafsk_addr = {
> +	.sun_family = AF_UNIX,
> +	.sun_path = "\0"UNIX_SOCKET_NAMESPACE,
> +};
> +
> +#define UNIX_SOCKET_NAMESPACE_ADDR_LEN (offsetof(struct sockaddr_un, sun_path) + sizeof(UNIX_SOCKET_NAMESPACE))
> +
>  /* SMP locking strategy:
>   *    hash table is protected with spinlock.
>   *    each socket state is protected by separate spinlock.
> @@ -436,6 +443,30 @@ static struct sock *__unix_find_socket_byname(struct net *net,
>  	return NULL;
>  }
>  
> +static int unix_may_bind_name(struct net *net, struct sockaddr_un *sunname,
> +			      int len, unsigned int hash)
> +{
> +	struct sock *s;
> +
> +	s = __unix_find_socket_byname(net, sunname, len, hash);
> +	if (s)
> +		return -EADDRINUSE;
> +
> +	/*
> +	 * Check whether this is our reserved prefix and if so ensure
> +	 * that only privileged processes can bind it.
> +	 */
> +	if (UNIX_SOCKET_NAMESPACE_ADDR_LEN <= len &&
> +	    !memcmp(&linuxafsk_addr, sunname, UNIX_SOCKET_NAMESPACE_ADDR_LEN)) {
> +		/* Don't bind the namespace itself. */
> +		if (UNIX_SOCKET_NAMESPACE_ADDR_LEN == len)
> +			return -ECONNREFUSED;
> +		if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
> +			return -ECONNREFUSED;
> +	}
> +	return 0;
> +}
> +
>  static inline struct sock *unix_find_socket_byname(struct net *net,
>  						   struct sockaddr_un *sunname,
>  						   int len, unsigned int hash)
> @@ -1258,10 +1289,10 @@ static int unix_autobind(struct sock *sk)
>  	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
>  	unix_table_double_lock(net, old_hash, new_hash);
>  
> -	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
> +	if (unix_may_bind_name(net, addr->name, addr->len, new_hash)) {
>  		unix_table_double_unlock(net, old_hash, new_hash);
>  
> -		/* __unix_find_socket_byname() may take long time if many names
> +		/* unix_may_bind_name() may take long time if many names
>  		 * are already in use.
>  		 */
>  		cond_resched();
> @@ -1379,7 +1410,8 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
>  	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
>  	unix_table_double_lock(net, old_hash, new_hash);
>  
> -	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
> +	err = unix_may_bind_name(net, addr->name, addr->len, new_hash);
> +	if (err)
>  		goto out_spin;
>  
>  	__unix_set_addr_hash(net, sk, addr, new_hash);
> @@ -1389,7 +1421,6 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
>  
>  out_spin:
>  	unix_table_double_unlock(net, old_hash, new_hash);
> -	err = -EADDRINUSE;
>  out_mutex:
>  	mutex_unlock(&u->bindlock);
>  out:
> 
> -- 
> 2.47.2

Re: [PATCH v4 04/11] net: reserve prefix

Posted by Christian Brauner 9 months ago

On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Wed, 07 May 2025 18:13:37 +0200
> > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > bind it. This will be used in next patches to support the coredump
> > socket but is a generally useful concept.
> 
> I really think we shouldn't reserve address and it should be
> configurable by users via core_pattern as with the other
> coredump types.
> 
> AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> dying, user can't start the new coredump listener until it's
> fully cleaned up, which adds unnecessary drawback.

This really doesn't matter.

> The semantic should be same with other types, and the todo
> for the coredump service is prepare file (file, process, socket)
> that can receive data and set its name to core_pattern.

We need to perform a capability check during bind() for the host's
coredump socket. Otherwise if the coredump server crashes an
unprivileged attacker can simply bind the address and receive all
coredumps from suid binaries.

This is also a problem for legitimate coredump server updates. To change
the coredump address the coredump server must first setup a new socket
and then update core_pattern and then shutdown the old coredump socket.

Now an unprivileged attacker can rebind the old coredump socket address
but there's still a crashing task that got scheduled out after it copied
the old coredump server address but before it connected to the coredump
server. The new server is now up and the old server's address has been
reused by the attacker. Now the crashing task gets scheduled back in and
connects to the unprivileged attacker and forwards its suid dump to the
attacker.

The name of the socket needs to be protected. This can be done by prefix
but the simplest way is what I did in my earlier version and to just use
a well-known name. The name really doesn't matter and all it adds is
potential for subtle bugs. I want the coredump code I have to maintain
to have as little moving parts as possible.

I'm happy to drop the patch to reserve the prefix as that seems to
bother you. But the coredump socket name won't be configurable. It'd be
good if we could just compromise here. Without the capability check on
bind we can just throw this all out as that's never going to be safe.

Re: [PATCH v4 04/11] net: reserve prefix

Posted by Kuniyuki Iwashima 9 months ago

From: Christian Brauner <brauner@kernel.org>
Date: Thu, 8 May 2025 08:16:29 +0200
> On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Wed, 07 May 2025 18:13:37 +0200
> > > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > > bind it. This will be used in next patches to support the coredump
> > > socket but is a generally useful concept.
> > 
> > I really think we shouldn't reserve address and it should be
> > configurable by users via core_pattern as with the other
> > coredump types.
> > 
> > AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> > dying, user can't start the new coredump listener until it's
> > fully cleaned up, which adds unnecessary drawback.
> 
> This really doesn't matter.
> 
> > The semantic should be same with other types, and the todo
> > for the coredump service is prepare file (file, process, socket)
> > that can receive data and set its name to core_pattern.
> 
> We need to perform a capability check during bind() for the host's
> coredump socket. Otherwise if the coredump server crashes an
> unprivileged attacker can simply bind the address and receive all
> coredumps from suid binaries.

As I mentioned in the previous thread, this can be better
handled by BPF LSM with more fine-grained rule.

1. register a socket with its name to BPF map
2. check if the destination socket is registered at connect

Even when LSM is not availalbe, the cgroup BPF prog can make
connect() fail if the destination name is not registered
in the map.

> 
> This is also a problem for legitimate coredump server updates. To change
> the coredump address the coredump server must first setup a new socket
> and then update core_pattern and then shutdown the old coredump socket.

So, for completeness, the server should set up a cgroup BPF
prog to route the request for the old name to the new one.

Here, the bpf map above can be reused to check if the socket
name is registered in the map or route to another socket in
the map.

Then, the unprivileged issue below and the non-dumpable issue
mentioned in the cover letter can also be resolved.

The server is expected to have CAP_SYS_ADMIN, so BPF should
play a role.


> 
> Now an unprivileged attacker can rebind the old coredump socket address
> but there's still a crashing task that got scheduled out after it copied
> the old coredump server address but before it connected to the coredump
> server. The new server is now up and the old server's address has been
> reused by the attacker. Now the crashing task gets scheduled back in and
> connects to the unprivileged attacker and forwards its suid dump to the
> attacker.
> 
> The name of the socket needs to be protected. This can be done by prefix
> but the simplest way is what I did in my earlier version and to just use
> a well-known name. The name really doesn't matter and all it adds is
> potential for subtle bugs. I want the coredump code I have to maintain
> to have as little moving parts as possible.
> 
> I'm happy to drop the patch to reserve the prefix as that seems to
> bother you. But the coredump socket name won't be configurable. It'd be
> good if we could just compromise here. Without the capability check on
> bind we can just throw this all out as that's never going to be safe.

Re: [PATCH v4 04/11] net: reserve prefix

Posted by Christian Brauner 9 months ago

On Thu, May 08, 2025 at 02:47:45PM -0700, Kuniyuki Iwashima wrote:
> From: Christian Brauner <brauner@kernel.org>
> Date: Thu, 8 May 2025 08:16:29 +0200
> > On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
> > > From: Christian Brauner <brauner@kernel.org>
> > > Date: Wed, 07 May 2025 18:13:37 +0200
> > > > Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
> > > > CAP_NET_ADMIN in the owning user namespace of the network namespace to
> > > > bind it. This will be used in next patches to support the coredump
> > > > socket but is a generally useful concept.
> > > 
> > > I really think we shouldn't reserve address and it should be
> > > configurable by users via core_pattern as with the other
> > > coredump types.
> > > 
> > > AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
> > > dying, user can't start the new coredump listener until it's
> > > fully cleaned up, which adds unnecessary drawback.
> > 
> > This really doesn't matter.
> > 
> > > The semantic should be same with other types, and the todo
> > > for the coredump service is prepare file (file, process, socket)
> > > that can receive data and set its name to core_pattern.
> > 
> > We need to perform a capability check during bind() for the host's
> > coredump socket. Otherwise if the coredump server crashes an
> > unprivileged attacker can simply bind the address and receive all
> > coredumps from suid binaries.
> 
> As I mentioned in the previous thread, this can be better
> handled by BPF LSM with more fine-grained rule.
> 
> 1. register a socket with its name to BPF map
> 2. check if the destination socket is registered at connect
> 
> Even when LSM is not availalbe, the cgroup BPF prog can make
> connect() fail if the destination name is not registered
> in the map.
> 
> > 
> > This is also a problem for legitimate coredump server updates. To change
> > the coredump address the coredump server must first setup a new socket
> > and then update core_pattern and then shutdown the old coredump socket.
> 
> So, for completeness, the server should set up a cgroup BPF
> prog to route the request for the old name to the new one.
> 
> Here, the bpf map above can be reused to check if the socket
> name is registered in the map or route to another socket in
> the map.
> 
> Then, the unprivileged issue below and the non-dumpable issue
> mentioned in the cover letter can also be resolved.
> 
> The server is expected to have CAP_SYS_ADMIN, so BPF should
> play a role.

This has been explained by multiple people over the course of this
thread already. It is simply not acceptable for basic kernel
functionality to be unsafe without the use of additional separate
subsystems. It is not ok to require bpf for a core kernel api to be
safely usable. It's irrelevant whether that's for security or cgroup
hooks. None of which we can require.

I won't even get this past Linus for that matter because he will rightly
NAK that hard and probably ask me whether I've paid any attention to
basic kernel development requirements in the last 10 years. Let alone
for coredumping which handles crashing suid binaries. I understand the
urge to outsurce this problem to userspace but that's not ok.

Coredumping is a core kernel service and all options have to be safely
usable by themselves. In fact, that goes for any kernel API and
especially VFS apis.

Using AF_UNIX sockets will be a major step forward in both simplicity
and security. We've compromised on every front so far. It's not too much
to ask for a basic permission check on a single well-known address
that's exposed as a kernel-level service.

Re: [PATCH v4 04/11] net: reserve prefix

Posted by Daniel Borkmann 9 months ago

On 5/9/25 7:54 AM, Christian Brauner wrote:
> On Thu, May 08, 2025 at 02:47:45PM -0700, Kuniyuki Iwashima wrote:
>> From: Christian Brauner <brauner@kernel.org>
>> Date: Thu, 8 May 2025 08:16:29 +0200
>>> On Wed, May 07, 2025 at 03:45:52PM -0700, Kuniyuki Iwashima wrote:
>>>> From: Christian Brauner <brauner@kernel.org>
>>>> Date: Wed, 07 May 2025 18:13:37 +0200
>>>>> Add the reserved "linuxafsk/" prefix for AF_UNIX sockets and require
>>>>> CAP_NET_ADMIN in the owning user namespace of the network namespace to
>>>>> bind it. This will be used in next patches to support the coredump
>>>>> socket but is a generally useful concept.
>>>>
>>>> I really think we shouldn't reserve address and it should be
>>>> configurable by users via core_pattern as with the other
>>>> coredump types.
>>>>
>>>> AF_UNIX doesn't support SO_REUSEPORT, so once the socket is
>>>> dying, user can't start the new coredump listener until it's
>>>> fully cleaned up, which adds unnecessary drawback.
>>>
>>> This really doesn't matter.
>>>
>>>> The semantic should be same with other types, and the todo
>>>> for the coredump service is prepare file (file, process, socket)
>>>> that can receive data and set its name to core_pattern.
>>>
>>> We need to perform a capability check during bind() for the host's
>>> coredump socket. Otherwise if the coredump server crashes an
>>> unprivileged attacker can simply bind the address and receive all
>>> coredumps from suid binaries.
>>
>> As I mentioned in the previous thread, this can be better
>> handled by BPF LSM with more fine-grained rule.
>>
>> 1. register a socket with its name to BPF map
>> 2. check if the destination socket is registered at connect
>>
>> Even when LSM is not availalbe, the cgroup BPF prog can make
>> connect() fail if the destination name is not registered
>> in the map.
>>
>>> This is also a problem for legitimate coredump server updates. To change
>>> the coredump address the coredump server must first setup a new socket
>>> and then update core_pattern and then shutdown the old coredump socket.
>>
>> So, for completeness, the server should set up a cgroup BPF
>> prog to route the request for the old name to the new one.
>>
>> Here, the bpf map above can be reused to check if the socket
>> name is registered in the map or route to another socket in
>> the map.
>>
>> Then, the unprivileged issue below and the non-dumpable issue
>> mentioned in the cover letter can also be resolved.
>>
>> The server is expected to have CAP_SYS_ADMIN, so BPF should
>> play a role.
> 
> This has been explained by multiple people over the course of this
> thread already. It is simply not acceptable for basic kernel
> functionality to be unsafe without the use of additional separate
> subsystems. It is not ok to require bpf for a core kernel api to be
> safely usable. It's irrelevant whether that's for security or cgroup
> hooks. None of which we can require.

As much as I like BPF, but I agree with Christian here that we should
not rely on other subsystems in addition, which might even be compiled
out in some cases where coredumps are needed (e.g. embedded).