[PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created

Pavel Tikhomirov posted 2 patches 1 month, 1 week ago
There is a newer version of this series
[PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created
Posted by Pavel Tikhomirov 1 month, 1 week ago
This effectively gives us an ability to create the pid namespace init as
a child of the process (setns-ed to the pid namespace) different to the
process which created the pid namespace itself.

Original problem:

There is a cool set_tid feature in clone3() syscall, it allows you to
create process with desired pids on multiple pid namespace levels. Which
is useful to restore processes in CRIU for nested pid namespace case.

In nested container case we can potentially see this kind of pid/user
namespace tree:

                            Process
                          ┌─────────┐
    User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
        │           │     │         │
        ▼           ▼     │         │
    User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
        │           │     │         │
       ...         ...    │   ...   │
        │           │     │         │
        ▼           ▼     │         │
    User NSn ──▶ Pid NSn ──▶ Pid pn │
                          └─────────┘

So to create the "Process" and set pids {p0, p1, ... pn} for it on all
pid namespace levels we can use clone3() syscall set_tid feature, BUT
the syscall does not allow you to set pid on pid namespace levels you
don't have permission to. So basically you have to be in "User NS0" when
creating the "Process" to actually be able to set pids on all levels.

It is ok for almost any process, but with pid namespace init this does
not work, as currently we can only create pid namespace init and the pid
namespace itself simultaneously, so to make "Pid NSn" owned by "User
NSn" we have to be in the "User NSn".

We can't possibly be in "User NS0" and "User NSn" at the same time,
hence the problem.

Alternative solution:

Yes, for the case of pid namespace init we can use old and gold
/proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
it is much more complicated and introduces tons of extra code to do. It
would be nice to make clone3() set_tid interface also aplicable to this
corner case.

Implementation:

Now when anyone can setns to the pid namespace before the creation of
init, and thus multiple processes can fork children to the pid
namespace, we enforce that the first process created is always the init,
and only allow other processes after the init passed the same steps as
were previously required to allow to do setns.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
---
 kernel/pid.c           | 10 +++++++++-
 kernel/pid_namespace.c |  9 ---------
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index a31771bc89c1..d549f08036ab 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -188,9 +188,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	pid->level = ns->level;
 
 	for (i = ns->level; i >= 0; i--) {
+		bool pidns_ready = true;
 		int tid = 0;
 		int pid_max = READ_ONCE(tmp->pid_max);
 
+		read_lock(&tasklist_lock);
+		if (!tmp->child_reaper)
+			pidns_ready = false;
+		read_unlock(&tasklist_lock);
+
 		if (set_tid_size) {
 			tid = set_tid[ns->level - i];
 
@@ -201,7 +207,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			 * Also fail if a PID != 1 is requested and
 			 * no PID 1 exists.
 			 */
-			if (tid != 1 && !tmp->child_reaper)
+			if (tid != 1 && !pidns_ready)
 				goto out_free;
 			retval = -EPERM;
 			if (!checkpoint_restore_ns_capable(tmp->user_ns))
@@ -221,6 +227,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			 */
 			if (nr == -ENOSPC)
 				nr = -EEXIST;
+		} else if (!pidns_ready && idr_get_cursor(&tmp->idr) != 0) {
+			nr = -EINVAL;
 		} else {
 			int pid_min = 1;
 			/*
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index e48f5de41361..d36afc58ee1d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -369,15 +369,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
 	}
 	task_unlock(task);
 
-	if (ns) {
-		read_lock(&tasklist_lock);
-		if (!ns->child_reaper) {
-			put_pid_ns(ns);
-			ns = NULL;
-		}
-		read_unlock(&tasklist_lock);
-	}
-
 	return ns ? &ns->ns : NULL;
 }
 
-- 
2.53.0

Re: [PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created
Posted by Oleg Nesterov 1 month, 1 week ago
Pavel,

Your patch doesn't apply to Linus's tree.

and in any case... can you avoid read_lock(tasklist) in alloc_pid() ?
This is really not good.

Oleg.

On 02/20, Pavel Tikhomirov wrote:
>
> This effectively gives us an ability to create the pid namespace init as
> a child of the process (setns-ed to the pid namespace) different to the
> process which created the pid namespace itself.
> 
> Original problem:
> 
> There is a cool set_tid feature in clone3() syscall, it allows you to
> create process with desired pids on multiple pid namespace levels. Which
> is useful to restore processes in CRIU for nested pid namespace case.
> 
> In nested container case we can potentially see this kind of pid/user
> namespace tree:
> 
>                             Process
>                           ┌─────────┐
>     User NS0 ──▶ Pid NS0 ──▶ Pid p0 │
>         │           │     │         │
>         ▼           ▼     │         │
>     User NS1 ──▶ Pid NS1 ──▶ Pid p1 │
>         │           │     │         │
>        ...         ...    │   ...   │
>         │           │     │         │
>         ▼           ▼     │         │
>     User NSn ──▶ Pid NSn ──▶ Pid pn │
>                           └─────────┘
> 
> So to create the "Process" and set pids {p0, p1, ... pn} for it on all
> pid namespace levels we can use clone3() syscall set_tid feature, BUT
> the syscall does not allow you to set pid on pid namespace levels you
> don't have permission to. So basically you have to be in "User NS0" when
> creating the "Process" to actually be able to set pids on all levels.
> 
> It is ok for almost any process, but with pid namespace init this does
> not work, as currently we can only create pid namespace init and the pid
> namespace itself simultaneously, so to make "Pid NSn" owned by "User
> NSn" we have to be in the "User NSn".
> 
> We can't possibly be in "User NS0" and "User NSn" at the same time,
> hence the problem.
> 
> Alternative solution:
> 
> Yes, for the case of pid namespace init we can use old and gold
> /proc/sys/kernel/ns_last_pid interface on the levels lower than n. But
> it is much more complicated and introduces tons of extra code to do. It
> would be nice to make clone3() set_tid interface also aplicable to this
> corner case.
> 
> Implementation:
> 
> Now when anyone can setns to the pid namespace before the creation of
> init, and thus multiple processes can fork children to the pid
> namespace, we enforce that the first process created is always the init,
> and only allow other processes after the init passed the same steps as
> were previously required to allow to do setns.
> 
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
> ---
>  kernel/pid.c           | 10 +++++++++-
>  kernel/pid_namespace.c |  9 ---------
>  2 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/pid.c b/kernel/pid.c
> index a31771bc89c1..d549f08036ab 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -188,9 +188,15 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  	pid->level = ns->level;
>  
>  	for (i = ns->level; i >= 0; i--) {
> +		bool pidns_ready = true;
>  		int tid = 0;
>  		int pid_max = READ_ONCE(tmp->pid_max);
>  
> +		read_lock(&tasklist_lock);
> +		if (!tmp->child_reaper)
> +			pidns_ready = false;
> +		read_unlock(&tasklist_lock);
> +
>  		if (set_tid_size) {
>  			tid = set_tid[ns->level - i];
>  
> @@ -201,7 +207,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			 * Also fail if a PID != 1 is requested and
>  			 * no PID 1 exists.
>  			 */
> -			if (tid != 1 && !tmp->child_reaper)
> +			if (tid != 1 && !pidns_ready)
>  				goto out_free;
>  			retval = -EPERM;
>  			if (!checkpoint_restore_ns_capable(tmp->user_ns))
> @@ -221,6 +227,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			 */
>  			if (nr == -ENOSPC)
>  				nr = -EEXIST;
> +		} else if (!pidns_ready && idr_get_cursor(&tmp->idr) != 0) {
> +			nr = -EINVAL;
>  		} else {
>  			int pid_min = 1;
>  			/*
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index e48f5de41361..d36afc58ee1d 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -369,15 +369,6 @@ static struct ns_common *pidns_for_children_get(struct task_struct *task)
>  	}
>  	task_unlock(task);
>  
> -	if (ns) {
> -		read_lock(&tasklist_lock);
> -		if (!ns->child_reaper) {
> -			put_pid_ns(ns);
> -			ns = NULL;
> -		}
> -		read_unlock(&tasklist_lock);
> -	}
> -
>  	return ns ? &ns->ns : NULL;
>  }
>  
> -- 
> 2.53.0
> 

Re: [PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created
Posted by Oleg Nesterov 1 month, 1 week ago
On 02/22, Oleg Nesterov wrote:
>
> Pavel,
>
> Your patch doesn't apply to Linus's tree.
>
> and in any case... can you avoid read_lock(tasklist) in alloc_pid() ?
> This is really not good.

I'm afraid I am totally confused, but it looks unnecessary.

Once ->child_reaper is set, it can be changed but not cleared.
This means that pidns_for_children_get() doesn't need tasklist too.

However. With or without this patch, we probably need WRITE_ONCE()
in find_child_reaper() and copy_process() + READ_ONCE() in alloc_pid()
to avoid the possible warnings from KCSAN.

No?

Oleg.
Re: [PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created
Posted by Pavel Tikhomirov 1 month, 1 week ago

On 2/22/26 17:55, Oleg Nesterov wrote:
> On 02/22, Oleg Nesterov wrote:
>>
>> Pavel,
>>
>> Your patch doesn't apply to Linus's tree.
>>
>> and in any case... can you avoid read_lock(tasklist) in alloc_pid() ?
>> This is really not good.
> 
> I'm afraid I am totally confused, but it looks unnecessary.
> 
> Once ->child_reaper is set, it can be changed but not cleared.

Yes.

> This means that pidns_for_children_get() doesn't need tasklist too.

I originally thought that, maybe, something around first setting of
->child_reaper had to be observed together with child_reaper change.
But yes, I looked into the code around it and failed to identify
something like that.

> 
> However. With or without this patch, we probably need WRITE_ONCE()
> in find_child_reaper() and copy_process() + READ_ONCE() in alloc_pid()
> to avoid the possible warnings from KCSAN.

OK, I will add this, sounds like a good idea. Thanks!

> 
> No?
> 
> Oleg.
> 

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.
Re: [PATCH 1/2] pid_namespace: allow opening pid_for_children before init was created
Posted by Pavel Tikhomirov 1 month, 1 week ago

On 2/22/26 16:38, Oleg Nesterov wrote:
> Pavel,
> 
> Your patch doesn't apply to Linus's tree.

Oh, sorry, for some reason I based it on top of v6.19. Will rebase, I 
guess, when I figure out the second (tasklist lock) part.

> 
> and in any case... can you avoid read_lock(tasklist) in alloc_pid() ?
> This is really not good.

I tried this logic with the tasklist lock to be on the safe side, it makes
sure that when a second process in the pid namespace is created, the pid
namespace is in exactly the same initialization stage as before. Moreover
the tasklist lock is already taken in the caller of alloc_pid a bit later,
so I thought it's not that bad.

I understand that taking the global lock is probably a bad idea if we can 
in reality do without it. I will try to dive into the code around setting
child_reaper under tasklist lock and see if there is something which will
be not initialized enough if we check child_reaper and see it set in the
second process without lock.

Thank you for review!

> 
> Oleg.

-- 
Best regards, Pavel Tikhomirov
Senior Software Developer, Virtuozzo.