fs/ceph/inode.c | 4 ++++ 1 file changed, 4 insertions(+)
When ceph_readdir_prepopulate calls ceph_get_inode while holding
mdsc->snap_rwsem, a deadlock may occur, blocking all subsequent
requests of the current session.
Fix by release the mds->snap_rwsem read lock before calling the
ceph_get_inode function.
Link: https://tracker.ceph.com/issues/72307
Signed-off-by: Zhao Sun <sunzhao03@kuaishou.com>
---
fs/ceph/inode.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 06cd2963e41e..3d7fb045ba76 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1900,6 +1900,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn,
int ceph_readdir_prepopulate(struct ceph_mds_request *req,
struct ceph_mds_session *session)
{
+ struct ceph_mds_client *mdsc = session->s_mdsc;
struct dentry *parent = req->r_dentry;
struct inode *inode = d_inode(parent);
struct ceph_inode_info *ci = ceph_inode(inode);
@@ -2029,7 +2030,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req,
if (d_really_is_positive(dn)) {
in = d_inode(dn);
} else {
+ /* Release mdsc->snap_rwsem in advance to avoid deadlock */
+ up_read(&mdsc->snap_rwsem);
in = ceph_get_inode(parent->d_sb, tvino, NULL);
+ down_read(&mdsc->snap_rwsem);
if (IS_ERR(in)) {
doutc(cl, "new_inode badness\n");
d_drop(dn);
--
2.39.2 (Apple Git-143)
Hi, The patch correctly identifies and addresses the deadlock by releasing snap_rwsem before the blocking ceph_get_inode call. My Recommendation: Use down_read_killable: The current patch uses down_read(&mdsc->snap_rwsem) after calling ceph_get_inode. This is problematic because down_read is uninterruptible. If a signal (like SIGKILL or Ctrl+C) is sent to the process while it's waiting to re-acquire the snap_rwsem read lock (e.g., if another thread holds the write lock), the process will hang indefinitely and cannot be killed. Using down_read_killable(&mdsc->snap_rwsem) instead allows the lock acquisition to be interrupted by signals. If it fails (returns -EINTR), the error must be handled properly (e.g., release the inode reference obtained from ceph_get_inode and propagate the error) to ensure the process remains killable and doesn't hang. This change is essential to prevent potential system hangs. Best regards, On Wed, Jul 30, 2025 at 1:49 PM Zhao Sun <sunzhao03@kuaishou.com> wrote: > > When ceph_readdir_prepopulate calls ceph_get_inode while holding > mdsc->snap_rwsem, a deadlock may occur, blocking all subsequent > requests of the current session. > > Fix by release the mds->snap_rwsem read lock before calling the > ceph_get_inode function. > > Link: https://tracker.ceph.com/issues/72307 > Signed-off-by: Zhao Sun <sunzhao03@kuaishou.com> > --- > fs/ceph/inode.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index 06cd2963e41e..3d7fb045ba76 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -1900,6 +1900,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn, > int ceph_readdir_prepopulate(struct ceph_mds_request *req, > struct ceph_mds_session *session) > { > + struct ceph_mds_client *mdsc = session->s_mdsc; > struct dentry *parent = req->r_dentry; > struct inode *inode = d_inode(parent); > struct ceph_inode_info *ci = ceph_inode(inode); > @@ -2029,7 +2030,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, > if (d_really_is_positive(dn)) { > in = d_inode(dn); > } else { > + /* Release mdsc->snap_rwsem in advance to avoid deadlock */ > + up_read(&mdsc->snap_rwsem); > in = ceph_get_inode(parent->d_sb, tvino, NULL); > + down_read(&mdsc->snap_rwsem); > if (IS_ERR(in)) { > doutc(cl, "new_inode badness\n"); > d_drop(dn); > -- > 2.39.2 (Apple Git-143) > >
I've taken another look, there is a much bigger issue here I'm not entirely convinced its safe to release and re-acquire the lock: The original design ensures that all snap context operations, both reading and writing, happen under the same lock, guaranteeing consistency. The patch breaks this invariant by creating a window where snap contexts can change between the inode creation and the subsequent ceph_fill_inode call. Bottom line: The patch trades deadlock prevention for potential snap context inconsistency, which could lead to data corruption or incorrect file visibility in snapshots. On Wed, Jul 30, 2025 at 4:45 PM Alex Markuze <amarkuze@redhat.com> wrote: > > Hi, > > The patch correctly identifies and addresses the deadlock by releasing > snap_rwsem before the blocking ceph_get_inode call. > > My Recommendation: > > Use down_read_killable: The current patch uses > down_read(&mdsc->snap_rwsem) after calling ceph_get_inode. > This is problematic because down_read is uninterruptible. If a signal > (like SIGKILL or Ctrl+C) is sent to the process while it's waiting to > re-acquire the snap_rwsem read lock (e.g., if another thread holds the > write lock), the process will hang indefinitely and cannot be killed. > Using down_read_killable(&mdsc->snap_rwsem) instead allows the lock > acquisition to be interrupted by signals. If it fails (returns > -EINTR), the error must be handled properly (e.g., release the inode > reference obtained from ceph_get_inode and propagate the error) to > ensure the process remains killable and doesn't hang. > > This change is essential to prevent potential system hangs. > > Best regards, > > On Wed, Jul 30, 2025 at 1:49 PM Zhao Sun <sunzhao03@kuaishou.com> wrote: > > > > When ceph_readdir_prepopulate calls ceph_get_inode while holding > > mdsc->snap_rwsem, a deadlock may occur, blocking all subsequent > > requests of the current session. > > > > Fix by release the mds->snap_rwsem read lock before calling the > > ceph_get_inode function. > > > > Link: https://tracker.ceph.com/issues/72307 > > Signed-off-by: Zhao Sun <sunzhao03@kuaishou.com> > > --- > > fs/ceph/inode.c | 4 ++++ > > 1 file changed, 4 insertions(+) > > > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > > index 06cd2963e41e..3d7fb045ba76 100644 > > --- a/fs/ceph/inode.c > > +++ b/fs/ceph/inode.c > > @@ -1900,6 +1900,7 @@ static int fill_readdir_cache(struct inode *dir, struct dentry *dn, > > int ceph_readdir_prepopulate(struct ceph_mds_request *req, > > struct ceph_mds_session *session) > > { > > + struct ceph_mds_client *mdsc = session->s_mdsc; > > struct dentry *parent = req->r_dentry; > > struct inode *inode = d_inode(parent); > > struct ceph_inode_info *ci = ceph_inode(inode); > > @@ -2029,7 +2030,10 @@ int ceph_readdir_prepopulate(struct ceph_mds_request *req, > > if (d_really_is_positive(dn)) { > > in = d_inode(dn); > > } else { > > + /* Release mdsc->snap_rwsem in advance to avoid deadlock */ > > + up_read(&mdsc->snap_rwsem); > > in = ceph_get_inode(parent->d_sb, tvino, NULL); > > + down_read(&mdsc->snap_rwsem); > > if (IS_ERR(in)) { > > doutc(cl, "new_inode badness\n"); > > d_drop(dn); > > -- > > 2.39.2 (Apple Git-143) > > > >
© 2016 - 2025 Red Hat, Inc.