ceph/libceph: fix hung tasks and connection recovery during network disruptions

[PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

Posted by Ionut Nechita (Wind River) 3 weeks, 5 days ago

From: Ionut Nechita <ionut.nechita@windriver.com>

In containerized environments (e.g., Rook-Ceph CSI with
forcecephkernelclient=true), the mount() syscall
for kernel CephFS may be invoked from a pod's network namespace
instead of the host namespace. This happens despite the CSI node
plugin (csi-cephfsplugin) running with hostNetwork: true, due to
race conditions during kubelet restart or pod scheduling.

ceph_messenger_init() captures current->nsproxy->net_ns at mount
time and uses it for all subsequent socket operations. When a pod
NS is captured, all kernel ceph sockets (mon, mds, osd) are
created in that namespace, which typically lacks routes to the
Ceph monitors (e.g., fd04:: ClusterIP addresses).
This causes permanent EADDRNOTAVAIL (-99) on every connection
attempt at ip6_dst_lookup_flow(), with no possibility of recovery
short of force-unmount and remount from the correct namespace.

Root cause confirmed via kprobe tracing on ip6_dst_lookup_flow:
the net pointer passed to the routing lookup was the pod's
net_ns (0xff367a0125dd5780) instead of init_net
(0xffffffffbda76940). The pod NS had no route for fd04::/64
(monitor ClusterIP range), while userspace python connect() from
the same host succeeded because it ran in host NS.

Fix this by always using init_net (the host network namespace)
in ceph_messenger_init(). The kernel CephFS client inherently
requires host-level network access to reach Ceph monitors, OSDs,
and MDS daemons. Using the caller's namespace was inherited from
generic socket patterns but is incorrect for a kernel filesystem
client that must survive beyond the lifetime of the mounting
process and its network namespace.

A warning is logged when a mount from a non-init namespace is
detected, to aid debugging.

Observed in production (kernel 6.12.0-1-rt-amd64, Ceph Reef
18.2.5, IPv6-only cluster, ceph-csi v3.13.1):
  - Fresh boot of compute-0, ceph-csi mounts CephFS via kernel
  - All monitor connections fail with EADDRNOTAVAIL immediately
  - kprobe confirms wrong net_ns in ip6_dst_lookup_flow
  - Workaround: umount -l + systemctl restart kubelet
  - After restart: mount captures host NS, works immediately

Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 net/ceph/messenger.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 8165e6a8fe092..a2e8ea6d339c9 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1791,7 +1791,32 @@ void ceph_messenger_init(struct ceph_messenger *msgr,
 
 	atomic_set(&msgr->stopping, 0);
 	atomic_set(&msgr->addr_notavail_count, 0);
-	write_pnet(&msgr->net, get_net(current->nsproxy->net_ns));
+
+	/*
+	 * Use the initial (host) network namespace instead of the
+	 * caller's current namespace. In containerized environments
+	 * (e.g., Rook-Ceph CSI with forcecephkernelclient=true), the
+	 * mount() syscall may be invoked from a pod's network namespace
+	 * even when the CSI plugin runs with hostNetwork: true (race
+	 * conditions during kubelet restart, pod scheduling, etc.).
+	 *
+	 * If the pod NS is captured here, all kernel ceph sockets will
+	 * be created in that NS, which typically lacks routes to the
+	 * Ceph monitors (e.g., fd04:: ClusterIP addresses). This causes
+	 * permanent EADDRNOTAVAIL on every connection attempt with no
+	 * possibility of recovery short of force-unmount + remount.
+	 *
+	 * The kernel CephFS client always needs host-level network
+	 * access to reach Ceph monitors, OSDs, and MDS daemons, so
+	 * using init_net is the correct choice. The previous behavior
+	 * of capturing current->nsproxy->net_ns was inherited from
+	 * generic socket code but is wrong for a kernel filesystem
+	 * client that must survive beyond the lifetime of the mounting
+	 * process's network namespace.
+	 */
+	if (current->nsproxy->net_ns != &init_net)
+		pr_warn("libceph: mount from non-init network namespace detected, using host namespace instead\n");
+	write_pnet(&msgr->net, get_net(&init_net));
 
 	dout("%s %p\n", __func__, msgr);
 }
-- 
2.53.0

Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

Posted by Ilya Dryomov 3 weeks, 1 day ago

On Thu, Mar 12, 2026 at 9:17 AM Ionut Nechita (Wind River)
<ionut.nechita@windriver.com> wrote:
>
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> In containerized environments (e.g., Rook-Ceph CSI with
> forcecephkernelclient=true), the mount() syscall
> for kernel CephFS may be invoked from a pod's network namespace
> instead of the host namespace. This happens despite the CSI node
> plugin (csi-cephfsplugin) running with hostNetwork: true, due to
> race conditions during kubelet restart or pod scheduling.

Hi Ionut,

Can you elaborate on these race conditions?  It sounds like a bug or
a misconfiguration in the orchestration in userspace that this patch is
trying to work around in the kernel client.

>
> ceph_messenger_init() captures current->nsproxy->net_ns at mount
> time and uses it for all subsequent socket operations. When a pod
> NS is captured, all kernel ceph sockets (mon, mds, osd) are
> created in that namespace, which typically lacks routes to the
> Ceph monitors (e.g., fd04:: ClusterIP addresses).
> This causes permanent EADDRNOTAVAIL (-99) on every connection
> attempt at ip6_dst_lookup_flow(), with no possibility of recovery
> short of force-unmount and remount from the correct namespace.

What network provider (in the sense of [1]) are you using?

>
> Root cause confirmed via kprobe tracing on ip6_dst_lookup_flow:
> the net pointer passed to the routing lookup was the pod's
> net_ns (0xff367a0125dd5780) instead of init_net
> (0xffffffffbda76940). The pod NS had no route for fd04::/64
> (monitor ClusterIP range), while userspace python connect() from
> the same host succeeded because it ran in host NS.
>
> Fix this by always using init_net (the host network namespace)
> in ceph_messenger_init(). The kernel CephFS client inherently
> requires host-level network access to reach Ceph monitors, OSDs,
> and MDS daemons. Using the caller's namespace was inherited from
> generic socket patterns but is incorrect for a kernel filesystem

This behavior wasn't inherited but actually introduced as a feature in
commit [2] at someone's request.  Prior to that change attempting to
mount a CephFS filesytem or map an RBD image from anywhere but init_net
produced an error, see commit [3].

I'm going to challenge your "incorrect for a kernel filesystem" claim
because NFS, SMB/CIFS, AFS and likely other network filesystem clients
in the kernel behave the same way.  Mounts outside of init_net are
allowed with the mounting process network namespace getting captured
and used when creating sockets.

> client that must survive beyond the lifetime of the mounting
> process and its network namespace.

Network namespaces are reference counted and CephFS grabs a reference
for the namespace it's mounted in.  The namespace should persist for as
long as the CephFS mount persists even if the mounting process goes
away: another process should be able to enter that namespace, etc.  The
namespace can of course get wedged by the orchestration tearing down
the relevant virtual network devices prematurely, but it's a separate
issue.

>
> A warning is logged when a mount from a non-init namespace is
> detected, to aid debugging.
>
> Observed in production (kernel 6.12.0-1-rt-amd64, Ceph Reef
> 18.2.5, IPv6-only cluster, ceph-csi v3.13.1):
>   - Fresh boot of compute-0, ceph-csi mounts CephFS via kernel
>   - All monitor connections fail with EADDRNOTAVAIL immediately
>   - kprobe confirms wrong net_ns in ip6_dst_lookup_flow
>   - Workaround: umount -l + systemctl restart kubelet
>   - After restart: mount captures host NS, works immediately

[1] https://github.com/rook/rook/blob/master/Documentation/CRDs/Cluster/network-providers.md
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=757856d2b9568a701df9ea6a4be68effbb9d6f44
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eea553c21fbfa486978c82525ee8256239d4f921

Thanks,

                Ilya

Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

Posted by Ionut Nechita (Wind River) 3 weeks ago

Hi Ilya,

Thank you for the detailed feedback and the historical context around
commits 757856d2 and eea553c2 -- I wasn't aware that namespace-aware
mounting was an intentional feature.

With the full series applied (patches 1-13), the Rook-Ceph rolling
upgrade scenario (e.g., Ceph 18.2.2 -> 18.2.5) with active CephFS
workloads completes successfully. The connection recovery, sync
timeouts, and mdsmap refresh patches address the core issues.

I agree that patch 13 is a force-impact change and can be seen as a
workaround for what is likely a race condition in the CSI/kubelet
orchestration layer. I'll drop it from v2 of this series.

I'd like to collaborate to better understand the namespace interaction
with CephFS in containerized environments. I'll gather more details  
about the specific race condition and share them both here and in the
Ceph tracker bug I opened:

  https://tracker.ceph.com/issues/74897

Regarding your questions:
- The network provider is Calico (IPv6-only cluster)
- The race condition occurs during kubelet restart when ceph-csi  
  issues mount() -- in some cases the mount syscall appears 
  to execute in the context of a pod namespace rather than 
  the host namespace, though I need to investigate 
  further to provide a proper reproducer

Thanks again for the review.

Best regards,
Ionut

Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

Posted by Ionut Nechita (Wind River) 5 days, 1 hour ago

Hi Ilya,

Following up with the additional data I promised. I reproduced the
issue on a fresh cluster and have concrete evidence of the namespace
problem.

Environment (4-node cluster, 2 controllers + 2 workers):

  $ kubectl get nodes
  NAME           STATUS   ROLES           KERNEL-VERSION
  compute-0      Ready    <none>          6.12.0-1-rt-amd64
  compute-1      Ready    <none>          6.12.0-1-rt-amd64
  controller-0   Ready    control-plane   6.12.0-1-amd64
  controller-1   Ready    control-plane   6.12.0-1-amd64

  - OS: Debian GNU/Linux 11 (bullseye)
  - Container runtime: containerd 1.7.27
  - Kubernetes: v1.29.2
  - Rook: v1.16.6
  - ceph-csi: v3.13.1
  - Ceph: 18.2.5 (Reef)
  - Network: Calico + Multus, IPv6-only
  - Pod CIDR: dead:beef::/64 (Calico, vxlanMode: Never)
  - Service CIDR: fd04::/112
  - CSI_FORCE_CEPHFS_KERNEL_CLIENT: true
  - CSI_ENABLE_HOST_NETWORK: true

The scenario is a Rook-Ceph rolling upgrade (Ceph 18.2.2 -> 18.2.5).
During the upgrade, Rook recreates the CSI DaemonSet pods and various
Ceph daemon pods (MON, MDS, OSD). Kubelet then needs to remount
CephFS volumes for workload pods on the node.

After the upgrade, the kernel ceph client is stuck with permanent
EADDRNOTAVAIL (-99) on all monitor connections:

  libceph: connect (1)[fd04::652b]:6789 error -99
  libceph: mon0 (1)[fd04::652b]:6789 connect error

The monitors are Kubernetes ClusterIP services:

  rook-ceph-mon-a  ClusterIP  fd04::652b  6789/TCP,3300/TCP
  rook-ceph-mon-b  ClusterIP  fd04::c0e7  6789/TCP,3300/TCP
  rook-ceph-mon-c  ClusterIP  fd04::1981  6789/TCP,3300/TCP

Here is the key evidence. The kernel ceph client debugfs status shows:

  $ cat /sys/kernel/debug/ceph/*/status
  instance: client.374328 (3)[dead:beef::a2bf:c94c:345d:bc66]:0

The source address dead:beef::a2bf:c94c:345d:bc66 is from the Calico
pod CIDR (dead:beef::/64). This address does NOT belong to any
currently running pod on the node. I enumerated all active CNI
namespaces:

  $ for ns in $(ip netns list | awk '{print $1}'); do
      ip netns exec $ns ip -6 addr show | grep dead:beef
    done

  ...bc6d  kube-sriov-cni-ds
  ...bc70  stx-centos
  ...bc73  rook-ceph-mon-a
  ...bc74  rook-ceph-crashcollector
  ...bc75  rook-ceph-exporter
  ...bc76  rook-ceph-mgr-c
  ...bc78  rook-ceph-osd-0

Address ...bc66 is not present in any existing namespace. The pod
that owned it was destroyed during the upgrade, and Calico removed
its veth interfaces during CNI cleanup.

Meanwhile, the CSI plugin pod is correctly in the host namespace:

  $ kubectl exec csi-cephfsplugin-gdrqr -c csi-cephfsplugin \
      -- readlink /proc/1/ns/net
  net:[4026531840]

  $ readlink /proc/1/ns/net   # on host
  net:[4026531840]

And from host userspace, connecting to the same ClusterIP monitors
works fine (goes through kube-proxy iptables DNAT):

  $ python3 -c "import socket; s=socket.socket(socket.AF_INET6, \
      socket.SOCK_STREAM); s.connect(('fd04::652b', 6789)); print('OK')"
  OK

But ping6 from host fails (ICMP not NAT'd by kube-proxy):

  $ ping6 -c1 fd04::652b
  From fdff:719a:bf60:4008::46e icmp_seq=1 Destination unreachable: No route

So the situation is:
  1. The kernel ceph client captured a pod network namespace at mount
     time (source address from dead:beef::/64 proves this)
  2. That pod was later destroyed during the upgrade
  3. Calico tore down the veth interfaces in that namespace
  4. The namespace persists (ref-counted by ceph) but has no
     interfaces or routes -- it is a zombie namespace
  5. All kernel ceph connect() calls fail with EADDRNOTAVAIL
  6. No recovery is possible without force-unmount + remount

As you noted, this is the "orchestration tearing down the relevant
virtual network devices prematurely" scenario. The namespace is kept
alive by the ceph reference, but it becomes non-functional.

I'm still investigating exactly how mount.ceph ends up in a pod
namespace despite the CSI plugin having hostNetwork: true. I have a
monitoring script set up to capture the namespace of mount.ceph
processes during the next upgrade attempt. I suspect it happens
during the brief window when the old CSI pod is terminated and the
new one is not yet ready, but kubelet still attempts to mount
volumes. I'll follow up with that data.

I've also filed this on the Ceph tracker:
  https://tracker.ceph.com/issues/74897

Thanks,
Ionut

Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

Posted by Ionut Nechita (Wind River) 4 days, 3 hours ago

Hi Ilya,

I've identified the root cause. You were right -- this is an
orchestration issue, not a kernel bug.

The problem is caused by the Rook "holder pod" mechanism in
Rook v1.13.7 (used in our older release). Here is the full picture:

In Rook v1.13.7, when Multus is present or CSI_ENABLE_HOST_NETWORK
is false, Rook deploys a "csi-cephfsplugin-holder" DaemonSet. This
holder pod does NOT have hostNetwork: true -- it runs in a Calico
pod network namespace. Its purpose is to expose its network namespace
via a symlink:

  ln -s /proc/$$/ns/net /var/lib/kubelet/plugins/<driver>/<ns>.net.ns

Ceph-CSI then uses this network namespace when performing kernel
mounts. The holder pod template even has a comment:

  "This pod is not expected to be updated nor restarted unless
   the node reboots."

And uses updateStrategy: OnDelete to prevent rolling updates.

The condition for enabling holder pods (controller.go:206):

  holderEnabled := !csiHostNetworkEnabled || cluster.Spec.Network.IsMultus()

Our cluster uses Calico + Multus, so holderEnabled is always true
regardless of CSI_ENABLE_HOST_NETWORK.

During the upgrade from Rook v1.13.7 to v1.16.6, the new Rook
version sets holderEnabled = false unconditionally and deletes the
holder DaemonSets. When the holder pod is deleted, Calico tears
down the veth interfaces in its network namespace. The kernel ceph
client still holds a reference to that namespace, but it no longer
has any network interfaces or routes, resulting in permanent
EADDRNOTAVAIL (-99).

Evidence from the live reproduction:

  Kernel ceph client status:
    instance: client.74244 (3)[dead:beef::a2bf:c94c:345d:bc6f]:0

  The holder pod on compute-0 had the same address:
    csi-cephfsplugin-holder-rook-ceph-dpnbl  dead:beef::a2bf:c94c:345d:bc6f

  After upgrade, the address ...bc6f is not present in any active
  CNI namespace -- the holder pod was deleted and Calico cleaned up
  the veth.

  dmesg shows the session was initially established successfully
  (at boot time, from the holder pod namespace), then lost when
  the holder pod was destroyed during upgrade:

    [  204.515008] libceph: mon0 session established
    [  959.829581] libceph: mon0 session lost, hunting for new mon
    [  959.829698] libceph: connect error -99  (permanent)

Version details:
  Old release (stx.10): Rook v1.13.7, ceph-csi v3.10.2, Ceph v18.2.2
  New release (stx.11): Rook v1.16.6, ceph-csi v3.13.1, Ceph v18.2.5

The new release (Rook v1.16.6) eliminates holder pods entirely and
performs kernel mounts directly from the csi-cephfsplugin DaemonSet,
which has hostNetwork: true. After the upgrade completes and the
stale mount is cleared (umount -l + kubelet restart), new mounts
work correctly from the host namespace.

So to summarize: this was not a kernel bug. The kernel ceph client
correctly captured the network namespace of the mounting process
(the holder pod), as designed. The problem was that the orchestration
(Rook upgrade) destroyed the holder pod and its network namespace
while the kernel mount was still active.

I'll drop patch 13 from the series as previously agreed. Thank you
for pushing me to investigate this properly.

I've also updated the Ceph tracker:
  https://tracker.ceph.com/issues/74897

Thanks,
Ionut