[PATCH v4 0/2] ipc: Store mq and ipc sysctls in the ipc namespace

Alexey Gladkov posted 2 patches 4 years, 4 months ago
Failed in applying to current master (apply log)
include/linux/ipc_namespace.h |  37 ++++++-
ipc/ipc_sysctl.c              | 189 ++++++++++++++++++++++------------
ipc/mq_sysctl.c               | 121 ++++++++++++----------
ipc/mqueue.c                  |  10 +-
ipc/namespace.c               |  10 ++
5 files changed, 235 insertions(+), 132 deletions(-)
[PATCH v4 0/2] ipc: Store mq and ipc sysctls in the ipc namespace
Posted by Alexey Gladkov 4 years, 4 months ago
This patchset changes the implementation of mq and ipc sysctls. It moves the
sysctls inside the ipc namespace. This will allow us to manage these sysctls
inside the user namespace.

--

Alexey Gladkov (2):
  ipc: Store mqueue sysctls in the ipc namespace
  ipc: Store ipc sysctls in the ipc namespace

 include/linux/ipc_namespace.h |  37 ++++++-
 ipc/ipc_sysctl.c              | 189 ++++++++++++++++++++++------------
 ipc/mq_sysctl.c               | 121 ++++++++++++----------
 ipc/mqueue.c                  |  10 +-
 ipc/namespace.c               |  10 ++
 5 files changed, 235 insertions(+), 132 deletions(-)

-- 
2.33.0

[GIT PULL] ipc: Bind to the ipc namespace at open time.
Posted by Eric W. Biederman 4 years, 3 months ago
Linus,

Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:

  git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git per-namespace-ipc-sysctls-for-v5.18
  HEAD: 1f5c135ee509e89e0cc274333a65f73c62cb16e5 ipc: Store ipc sysctls in the ipc namespace


The per ipc namespace sysctls have been imperfect since they were
implemented.  Instead of binding to the ipc namespace of the opener of
the file the code bound to the ipc namespace of the writer of the
file.

This short series of changes addresses that old deficiency in the
code.

Alexey Gladkov (2):
      ipc: Store mqueue sysctls in the ipc namespace
      ipc: Store ipc sysctls in the ipc namespace

 include/linux/ipc_namespace.h |  37 ++++++++-
 ipc/ipc_sysctl.c              | 189 +++++++++++++++++++++++++++---------------
 ipc/mq_sysctl.c               | 121 +++++++++++++++------------
 ipc/mqueue.c                  |  10 +--
 ipc/namespace.c               |  10 +++
 5 files changed, 235 insertions(+), 132 deletions(-)

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>


Eric
Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.
Posted by Linus Torvalds 4 years, 3 months ago
On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:

Ugh.

I pulled this. Then I stared at it for a long time.

And then I decided that this is too ugly to live.

I'm sorry. I think Alexey has probably spent a fair amount of time on
it, but I really think the sysctl code needs to be cleaned up way more
than this.

The old code was horribly hacky too, but that setup_ipc_sysctls() (and
setup_mq_sysctls()) thing that copies the whole sysctls table, and
then walks it entry by entry to modify it, is just too ugly for words.

And then it hides things in .extra1, and because it does that it can't
use the normal "extra1 and extra2 contains the limits" so then at
write() time it copies it into a local one AGAIN only to set the
limits back so that it can call the normal routines again.

Not ok.

Yes, yes, the old code did some similar things - to set the 'data'
pointer. That was disgusting too. Don't get me wrong - the existing
code was nasty too. But this took nasty code, and doubled down on it.

I really think this is a fundamental problem, and needs a more
fundamental fix than adding more and more of these kinds of nasty
hacks.

And yes, that fundamental fix is almost certainly to pass in 'struct
cred *' to all those sysctl helper functions.

Then, when you do the actual 'sysctl()' system calls, you pass in
'current_cred()".

And the /proc users would pass in file->f_cred.

And yes, that patch might be quite messy, because we have quite a lot
of those random .proc_handler users.

But *most* of them by far (at least in random code) use the standard
proc_dointvec_minmax() and friends, and won't even notice.

And then the ones that are about namespace issues will have to
continue to do the nasty "make a copy and update the data pointer",
but *MAYBE* we could also introduce the notion of an "offset" to those
proc_dointvec_minmax() things to help them out (and at least avoid the
"make a copy" thing).

Anyway, I really think we must not make that sysctl code even uglier
than it is today, and I think we need to move towards a model that
actually makes sense. And that "pass in the right cred" is the only
sensible model I can see.

I haven't tried to create such a patch, and maybe Alexey already tried
to do something like that and it turned out to be too ugly for words
and that's why these nasty patches happened.

But at least for now, I can't with a good conscience pull this.

Sorry,
                   Linus
Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.
Posted by Eric W. Biederman 4 years, 3 months ago
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:
>
> Ugh.
>
> I pulled this. Then I stared at it for a long time.
>
> And then I decided that this is too ugly to live.
>
> I'm sorry. I think Alexey has probably spent a fair amount of time on
> it, but I really think the sysctl code needs to be cleaned up way more
> than this.
>
> The old code was horribly hacky too, but that setup_ipc_sysctls() (and
> setup_mq_sysctls()) thing that copies the whole sysctls table, and
> then walks it entry by entry to modify it, is just too ugly for words.

I am not convinced when it comes to the .data pointers it is too
ugly to live.  The only realistic alternative is doing something
with the offset to adjust the pointers to the per namespace values.

Fundamentally the copy is needed because in the general case (which is
present in the networking stack) we only show a subset of the sysctls in
anything other than the initial namespace.  That is necessary as some
sysctls have not been shown to be safe for unprivileged users to write.

So there does need to be a separate array of ctl_table entries per
namespace, and the data pointers in those entries need to be adjusted to
the per namespace values.

> And then it hides things in .extra1, and because it does that it can't
> use the normal "extra1 and extra2 contains the limits" so then at
> write() time it copies it into a local one AGAIN only to set the
> limits back so that it can call the normal routines again.
>
> Not ok.

I agree.  That is a bit ugly, and I missed it when I was looking
at the code.

The mq_sysctl.c changes look reasonable to me.  There are no strange
games being played with .extra1 or .extra2.  All I see happening
in mq_sysctl.c is boiler plate to make things work.

Looking at ipc_sysctl.c I agree playing games with .extra1 and .extra2
is ugly and error prone.  Worse I noticed when reading it through that
proc_ipc_sem_dointvec passes sem_check_semmni current->nsproxy->ipc_ns
instead of passing the computed ns value.  A bug but not a regression.

I went through the code and we don't need to play games with .extra1
to get the namespace value all we need to call container_of with
the .data member.  That takes a little extra boiler plate especially
for the checkpoint_restore case.

The checkpoint_restore sysctls arguably need to be done differently so
that the permissions can be checked at open time instead of at write
time.  That would eliminate the need to play games with foobar.

>
> Anyway, I really think we must not make that sysctl code even uglier
> than it is today, and I think we need to move towards a model that
> actually makes sense. And that "pass in the right cred" is the only
> sensible model I can see.

I don't see how passing in struct cred does anything helpful.  The
namespace can not be found in struct cred.

Now this set of changes is a setup to allow implementing .set_ownership
and .permission methods so that we can allow the namespace root to write
to sysctls it is safe for the namespace root to be able to write to.
Even there I don't see how passing in a struct cred would help.

Given that there are no regressions I don't see it even being helpful
to not merge this code.

I did look and there are some cleanup pretty significant cleanup
opportunities, by using container_of on .data, which avoids the need
to stuff a namespace parameter in .extra1.

There are also significant cleanup opportunities in implementing a
.permission method that will allow the checkpoint_restart sysctls
to perform all of their permission checks at open time, and not
need any other special code.

But all of these cleanups I see are moving forward from the current
point so I don't see why we would not want to merge the code as is
and then get the tested versions of my cleanups below.



ipc/ipc_sysctl.c | 81 +++++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 45 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 15210ac47e9e..e2209d48db04 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -19,16 +19,11 @@
 static int proc_ipc_dointvec_minmax_orphans(struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
-	struct ipc_namespace *ns = table->extra1;
-	struct ctl_table ipc_table;
+	struct ipc_namespace *ns =
+		container_of(table->data, struct ipc_namespace, shm_rmid_forced);
 	int err;
 
-	memcpy(&ipc_table, table, sizeof(ipc_table));
-
-	ipc_table.extra1 = SYSCTL_ZERO;
-	ipc_table.extra2 = SYSCTL_ONE;
-
-	err = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
+	err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
 
 	if (err < 0)
 		return err;
@@ -55,20 +50,15 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
 static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 	void *buffer, size_t *lenp, loff_t *ppos)
 {
-	struct ipc_namespace *ns = table->extra1;
-	struct ctl_table ipc_table;
+	struct ipc_namespace *ns =
+		container_of(table->data, struct ipc_namespace, sem_ctls);
 	int ret, semmni;
 
-	memcpy(&ipc_table, table, sizeof(ipc_table));
-
-	ipc_table.extra1 = NULL;
-	ipc_table.extra2 = NULL;
-
 	semmni = ns->sem_ctls[3];
 	ret = proc_dointvec(table, write, buffer, lenp, ppos);
 
 	if (!ret)
-		ret = sem_check_semmni(current->nsproxy->ipc_ns);
+		ret = sem_check_semmni(ns);
 
 	/*
 	 * Reset the semmni value if an error happens.
@@ -78,24 +68,6 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
 	return ret;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
-static int proc_ipc_dointvec_minmax_checkpoint_restore(struct ctl_table *table,
-		int write, void *buffer, size_t *lenp, loff_t *ppos)
-{
-	struct ipc_namespace *ns = table->extra1;
-	struct ctl_table ipc_table;
-
-	if (write && !checkpoint_restore_ns_capable(ns->user_ns))
-		return -EPERM;
-
-	memcpy(&ipc_table, table, sizeof(ipc_table));
-
-	ipc_table.extra1 = SYSCTL_ZERO;
-	ipc_table.extra2 = SYSCTL_INT_MAX;
-
-	return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
-}
-#endif
 
 int ipc_mni = IPCMNI;
 int ipc_mni_shift = IPCMNI_SHIFT;
@@ -131,6 +103,8 @@ static struct ctl_table ipc_sysctls[] = {
 		.maxlen		= sizeof(init_ipc_ns.shm_rmid_forced),
 		.mode		= 0644,
 		.proc_handler	= proc_ipc_dointvec_minmax_orphans,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	},
 	{
 		.procname	= "msgmax",
@@ -180,22 +154,28 @@ static struct ctl_table ipc_sysctls[] = {
 		.procname	= "sem_next_id",
 		.data		= &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
-		.mode		= 0666,
-		.proc_handler	= proc_ipc_dointvec_minmax_checkpoint_restore,
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_INT_MAX,
 	},
 	{
 		.procname	= "msg_next_id",
 		.data		= &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
-		.mode		= 0666,
-		.proc_handler	= proc_ipc_dointvec_minmax_checkpoint_restore,
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_INT_MAX,
 	},
 	{
 		.procname	= "shm_next_id",
 		.data		= &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
 		.maxlen		= sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
-		.mode		= 0666,
-		.proc_handler	= proc_ipc_dointvec_minmax_checkpoint_restore,
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_INT_MAX,
 	},
 #endif
 	{}
@@ -211,8 +191,24 @@ static int set_is_seen(struct ctl_table_set *set)
 	return &current->nsproxy->ipc_ns->ipc_set == set;
 }
 
+static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table)
+{
+	struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+	int mode = table->mode;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) ||
+	     (table->data == &ns->ids[IPC_MSG_IDS].next_id) ||
+	     (table->data == &ns->ids[IPC_SHM_IDS].next_id)) &&
+	    checkpoint_restore_ns_capable(ns->user_ns))
+		mode = 0666;
+#endif
+	return mode;
+}
+
 static struct ctl_table_root set_root = {
 	.lookup = set_lookup,
+	.permissions = ipc_permissions,
 };
 
 bool setup_ipc_sysctls(struct ipc_namespace *ns)
@@ -237,7 +233,6 @@ bool setup_ipc_sysctls(struct ipc_namespace *ns)
 
 			} else if (tbl[i].data == &init_ipc_ns.shm_rmid_forced) {
 				tbl[i].data = &ns->shm_rmid_forced;
-				tbl[i].extra1 = ns;
 
 			} else if (tbl[i].data == &init_ipc_ns.msg_ctlmax) {
 				tbl[i].data = &ns->msg_ctlmax;
@@ -250,19 +245,15 @@ bool setup_ipc_sysctls(struct ipc_namespace *ns)
 
 			} else if (tbl[i].data == &init_ipc_ns.sem_ctls) {
 				tbl[i].data = &ns->sem_ctls;
-				tbl[i].extra1 = ns;
 #ifdef CONFIG_CHECKPOINT_RESTORE
 			} else if (tbl[i].data == &init_ipc_ns.ids[IPC_SEM_IDS].next_id) {
 				tbl[i].data = &ns->ids[IPC_SEM_IDS].next_id;
-				tbl[i].extra1 = ns;
 
 			} else if (tbl[i].data == &init_ipc_ns.ids[IPC_MSG_IDS].next_id) {
 				tbl[i].data = &ns->ids[IPC_MSG_IDS].next_id;
-				tbl[i].extra1 = ns;
 
 			} else if (tbl[i].data == &init_ipc_ns.ids[IPC_SHM_IDS].next_id) {
 				tbl[i].data = &ns->ids[IPC_SHM_IDS].next_id;
-				tbl[i].extra1 = ns;
 #endif
 			} else {
 				tbl[i].data = NULL;



Eric
Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.
Posted by Linus Torvalds 4 years, 3 months ago
On Thu, Mar 24, 2022 at 2:49 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> I don't see how passing in struct cred does anything helpful.  The
> namespace can not be found in struct cred.

Duh. For some reason I thought we had it there, but yeah, that only
contains the user-namespace.

The rest is in tsk->nsproxy.

So it would have to be squirrelled away in some other way.

> But all of these cleanups I see are moving forward from the current
> point so I don't see why we would not want to merge the code as is
> and then get the tested versions of my cleanups below.

What part of "that code is too ugly to live" did not end up registering.

Maybe it can be cleaned up. But it had better be cleaned up before I pull it.

                Linus
Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.
Posted by Alexey Gladkov 4 years, 3 months ago
On Thu, Mar 24, 2022 at 11:12:21AM -0700, Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:
> 
> Ugh.
> 
> I pulled this. Then I stared at it for a long time.
> 
> And then I decided that this is too ugly to live.
> 
> I'm sorry. I think Alexey has probably spent a fair amount of time on
> it, but I really think the sysctl code needs to be cleaned up way more
> than this.

Apparently it's my fault that the purpose of these changes is not clear. I
did this refactoring not for the sake of refactoring as such.

In my original patch [1], I was trying to fix the situation where the user
cannot change the /proc/sys/fs/mqueue/* options inside rootless container.

But then I split the changes into refactoring which fixes the hack and
permission changes which I wanted to discuss and propose later.

[1] https://lore.kernel.org/lkml/0f0408bb7fbf3187966a9bf19a98642a5d9669fe.1641225760.git.legion@kernel.org/

> The old code was horribly hacky too, but that setup_ipc_sysctls() (and
> setup_mq_sysctls()) thing that copies the whole sysctls table, and
> then walks it entry by entry to modify it, is just too ugly for words.
> 
> And then it hides things in .extra1, and because it does that it can't
> use the normal "extra1 and extra2 contains the limits" so then at
> write() time it copies it into a local one AGAIN only to set the
> limits back so that it can call the normal routines again.
> 
> Not ok.
> 
> Yes, yes, the old code did some similar things - to set the 'data'
> pointer. That was disgusting too. Don't get me wrong - the existing
> code was nasty too. But this took nasty code, and doubled down on it.
> 
> I really think this is a fundamental problem, and needs a more
> fundamental fix than adding more and more of these kinds of nasty
> hacks.
> 
> And yes, that fundamental fix is almost certainly to pass in 'struct
> cred *' to all those sysctl helper functions.
> 
> Then, when you do the actual 'sysctl()' system calls, you pass in
> 'current_cred()".
> 
> And the /proc users would pass in file->f_cred.
> 
> And yes, that patch might be quite messy, because we have quite a lot
> of those random .proc_handler users.
> 
> But *most* of them by far (at least in random code) use the standard
> proc_dointvec_minmax() and friends, and won't even notice.
> 
> And then the ones that are about namespace issues will have to
> continue to do the nasty "make a copy and update the data pointer",
> but *MAYBE* we could also introduce the notion of an "offset" to those
> proc_dointvec_minmax() things to help them out (and at least avoid the
> "make a copy" thing).
> 
> Anyway, I really think we must not make that sysctl code even uglier
> than it is today, and I think we need to move towards a model that
> actually makes sense. And that "pass in the right cred" is the only
> sensible model I can see.
> 
> I haven't tried to create such a patch, and maybe Alexey already tried
> to do something like that and it turned out to be too ugly for words
> and that's why these nasty patches happened.
> 
> But at least for now, I can't with a good conscience pull this.
> 
> Sorry,
>                    Linus

OK. I'll try to come up with some other solution.

-- 
Rgrds, legion