drivers/nvme/host/tcp.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)
From: Peijie Shao <shaopeijie@cestc.cn>
The patch is for nvme-tcp host side.
commit 1be52169c348
("nvme-tcp: fix selinux denied when calling sock_sendmsg")
uses sock_create_kern instead of sock_create to solve SELinux
problem, however sock_create_kern does not take a reference of
given netns, which results in a use-after-freewhen the
non-init_net netns is destroyed before sock_release.
For example: a container not share with host's network namespace
doing a 'nvme connect', and is stopped without 'nvme disconnect'.
The patch changes parameter current->nsproxy->net_ns to init_net,
makes the socket always belongs to the host. It also naturally
avoids changing sock's netns from previous creator's netns to
init_net when sock is re-created by nvme recovery path
(workqueue is in init_net namespace).
Signed-off-by: Peijie Shao <shaopeijie@cestc.cn>
---
Hi all,
This is the v1 patch. Before this version, I tried to
get_net(current->nsproxy->net_ns) in nvme_tcp_alloc_queue() to
fix the issue, but failed to find a suitable placeto do
put_net(). Because the socket is released by fput() internally.
I think code like below:
nvme_tcp_free_queue() {
fput()
put_net()
}
can not ensure the socket was released before put_net, since
someone is still holding the file.
So I prefer using the 'init_net' net namespace.
---
drivers/nvme/host/tcp.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 26c459f0198d..1b2d3d37656d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1789,7 +1789,13 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
queue->cmnd_capsule_len = sizeof(struct nvme_command) +
NVME_TCP_ADMIN_CCSZ;
- ret = sock_create_kern(current->nsproxy->net_ns,
+ /* sock_create_kern() does not take a reference to
+ * current->nsproxy->net_ns, use init_net instead.
+ * This also avoid changing sock's netns from previous
+ * creator's netns to init_net when sock is re-created
+ * by nvme recovery path.
+ */
+ ret = sock_create_kern(&init_net,
ctrl->addr.ss_family, SOCK_STREAM,
IPPROTO_TCP, &queue->sock);
if (ret) {
--
2.43.0
On Tue, Apr 01, 2025 at 02:19:34PM +0800, shaopeijie@cestc.cn wrote: > + /* sock_create_kern() does not take a reference to > + * current->nsproxy->net_ns, use init_net instead. > + * This also avoid changing sock's netns from previous > + * creator's netns to init_net when sock is re-created > + * by nvme recovery path. > + */ Kernel comment style is /* * .... */ > + ret = sock_create_kern(&init_net, > ctrl->addr.ss_family, SOCK_STREAM, > IPPROTO_TCP, &queue->sock); This can be realigned: ret = sock_create_kern(&init_net, ctrl->addr.ss_family, SOCK_STREAM, IPPROTO_TCP, &queue->sock);
On Thu, Apr 03, 2025 at 06:30:01AM +0200, Christoph Hellwig wrote: > > + ret = sock_create_kern(&init_net, > > ctrl->addr.ss_family, SOCK_STREAM, > > IPPROTO_TCP, &queue->sock); > > This can be realigned: > > ret = sock_create_kern(&init_net, ctrl->addr.ss_family, SOCK_STREAM, > IPPROTO_TCP, &queue->sock); Also did the original patch get merged already? If not please fold both into a single one.
On 4/3/2025 12:30 PM, Christoph Hellwig wrote: > On Thu, Apr 03, 2025 at 06:30:01AM +0200, Christoph Hellwig wrote: >>> + ret = sock_create_kern(&init_net, >>> ctrl->addr.ss_family, SOCK_STREAM, >>> IPPROTO_TCP, &queue->sock); >> >> This can be realigned: >> >> ret = sock_create_kern(&init_net, ctrl->addr.ss_family, SOCK_STREAM, >> IPPROTO_TCP, &queue->sock); > > Also did the original patch get merged already? If not please fold > both into a single one. > > Thanks! Yes,the patch is merged into https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1be52169c3488ef98582ed553ab35cefa3978817 Style problems will be fixed in v2 version.
From: Peijie Shao <shaopeijie@cestc.cn>
The patch is for nvme-tcp host side.
commit 1be52169c348
("nvme-tcp: fix selinux denied when calling sock_sendmsg")
uses sock_create_kern instead of sock_create to solve SELinux
problem, however sock_create_kern does not take a reference of
the given netns, which results in a use-after-free when the
non-init_net netns is destroyed before sock_release.
For example: a container not share with host's network namespace
doing a 'nvme connect', and is stopped without 'nvme disconnect'.
The patch changes parameter current->nsproxy->net_ns to init_net,
makes the socket always belongs to the host. It also naturally
avoids changing sock's netns from previous creator's netns to
init_net when sock is re-created by nvme recovery path
(workqueue is in init_net namespace).
Signed-off-by: Peijie Shao <shaopeijie@cestc.cn>
---
Changes in v2:
1. Fix style problems reviewed by Christoph Hellwig, thanks!
2. Add 'nvme-tcp:' prefix for the patch.
Version v1:
Hi all,
This is the v1 patch. Before this version, I tried to
get_net(current->nsproxy->net_ns) in nvme_tcp_alloc_queue() to
fix the issue, but failed to find a suitable placeto do
put_net(). Because the socket is released by fput() internally.
I think code like below:
nvme_tcp_free_queue() {
fput()
put_net()
}
can not ensure the socket was released before put_net, since
someone is still holding the file.
So I would like to use the 'init_net' net namespace.
---
drivers/nvme/host/tcp.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 26c459f0198d..9b1d0ad18b77 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1789,8 +1789,14 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
queue->cmnd_capsule_len = sizeof(struct nvme_command) +
NVME_TCP_ADMIN_CCSZ;
- ret = sock_create_kern(current->nsproxy->net_ns,
- ctrl->addr.ss_family, SOCK_STREAM,
+ /*
+ * sock_create_kern() does not take a reference to
+ * current->nsproxy->net_ns, use init_net instead.
+ * This also avoid changing sock's netns from previous
+ * creator's netns to init_net when sock is re-created
+ * by nvme recovery path.
+ */
+ ret = sock_create_kern(&init_net, ctrl->addr.ss_family, SOCK_STREAM,
IPPROTO_TCP, &queue->sock);
if (ret) {
dev_err(nctrl->device,
--
2.43.0
I had another look at this patch, and it feels wrong to me. I don't think we are supposed to create sockets triggered by activity in a network namespace in the global namespace even if they are indirectly created through the nvme interface. But maybe I'm misunderstanding how network namespaces work, which is entirely possible. So to avoid the failure I'd be tempted to instead revert commit 1be52169c348 until the problem is fully sorted out.
From: Christoph Hellwig <hch@lst.de> Date: Mon, 7 Apr 2025 16:31:21 +0200 > I had another look at this patch, and it feels wrong to me. I don't > think we are supposed to create sockets triggered by activity in > a network namespace in the global namespace even if they are indirectly > created through the nvme interface. But maybe I'm misunderstanding > how network namespaces work, which is entirely possible. > > So to avoid the failure I'd be tempted to instead revert commit > 1be52169c348 until the problem is fully sorted out. The followup patch is wrong, and the correct fix is to take a reference to the netns by sk_net_refcnt_upgrade(). ---8<--- diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 26c459f0198d..72d260201d8c 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1803,6 +1803,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid, ret = PTR_ERR(sock_file); goto err_destroy_mutex; } + + sk_net_refcnt_upgrade(queue->sock->sk); nvme_tcp_reclassify_socket(queue->sock); /* Single syn retry */ ---8<---
On Mon, Apr 07, 2025 at 10:18:18AM -0700, Kuniyuki Iwashima wrote: > The followup patch is wrong, and the correct fix is to take a reference > to the netns by sk_net_refcnt_upgrade(). Can you send a formal patch for this?
From: Christoph Hellwig <hch@lst.de> Date: Tue, 8 Apr 2025 07:04:08 +0200 > On Mon, Apr 07, 2025 at 10:18:18AM -0700, Kuniyuki Iwashima wrote: > > The followup patch is wrong, and the correct fix is to take a reference > > to the netns by sk_net_refcnt_upgrade(). > > Can you send a formal patch for this? For sure. Which branch/tag should be based on, for-next or nvme-6.15 ? http://git.infradead.org/nvme.git
On Mon, Apr 07, 2025 at 10:55:27PM -0700, Kuniyuki Iwashima wrote: > Which branch/tag should be based on, for-next or nvme-6.15 ? > http://git.infradead.org/nvme.git nvme-6.15 is the canonical tree, but for bug fixes I'm fine with almost anything :)
From: Christoph Hellwig <hch@lst.de> Date: Tue, 8 Apr 2025 07:58:30 +0200 > On Mon, Apr 07, 2025 at 10:55:27PM -0700, Kuniyuki Iwashima wrote: > > Which branch/tag should be based on, for-next or nvme-6.15 ? > > http://git.infradead.org/nvme.git > > nvme-6.15 is the canonical tree, but for bug fixes I'm fine with > almost anything :) Thanks, will post a patch tomorrow :)
I'll do another minor fixup for the comment formatting, but otherwise
this looks good. I'll queue it up.
On Thu, Apr 03, 2025 at 10:47:48PM +0800, shaopeijie@cestc.cn wrote:
> From: Peijie Shao <shaopeijie@cestc.cn>
>
> The patch is for nvme-tcp host side.
>
> commit 1be52169c348
> ("nvme-tcp: fix selinux denied when calling sock_sendmsg")
> uses sock_create_kern instead of sock_create to solve SELinux
> problem, however sock_create_kern does not take a reference of
> the given netns, which results in a use-after-free when the
> non-init_net netns is destroyed before sock_release.
>
> For example: a container not share with host's network namespace
> doing a 'nvme connect', and is stopped without 'nvme disconnect'.
>
> The patch changes parameter current->nsproxy->net_ns to init_net,
> makes the socket always belongs to the host. It also naturally
> avoids changing sock's netns from previous creator's netns to
> init_net when sock is re-created by nvme recovery path
> (workqueue is in init_net namespace).
>
> Signed-off-by: Peijie Shao <shaopeijie@cestc.cn>
> ---
>
> Changes in v2:
> 1. Fix style problems reviewed by Christoph Hellwig, thanks!
> 2. Add 'nvme-tcp:' prefix for the patch.
>
> Version v1:
> Hi all,
> This is the v1 patch. Before this version, I tried to
> get_net(current->nsproxy->net_ns) in nvme_tcp_alloc_queue() to
> fix the issue, but failed to find a suitable placeto do
> put_net(). Because the socket is released by fput() internally.
> I think code like below:
> nvme_tcp_free_queue() {
> fput()
> put_net()
> }
> can not ensure the socket was released before put_net, since
> someone is still holding the file.
>
> So I would like to use the 'init_net' net namespace.
>
> ---
> drivers/nvme/host/tcp.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 26c459f0198d..9b1d0ad18b77 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1789,8 +1789,14 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
> queue->cmnd_capsule_len = sizeof(struct nvme_command) +
> NVME_TCP_ADMIN_CCSZ;
>
> - ret = sock_create_kern(current->nsproxy->net_ns,
> - ctrl->addr.ss_family, SOCK_STREAM,
> + /*
> + * sock_create_kern() does not take a reference to
> + * current->nsproxy->net_ns, use init_net instead.
> + * This also avoid changing sock's netns from previous
> + * creator's netns to init_net when sock is re-created
> + * by nvme recovery path.
> + */
> + ret = sock_create_kern(&init_net, ctrl->addr.ss_family, SOCK_STREAM,
> IPPROTO_TCP, &queue->sock);
> if (ret) {
> dev_err(nctrl->device,
> --
> 2.43.0
>
>
---end quoted text---
© 2016 - 2025 Red Hat, Inc.