[PATCH v1 08/13] ceph: make ceph_start_io_write() killable

Ionut Nechita (Wind River) posted 13 patches 3 weeks, 5 days ago
[PATCH v1 08/13] ceph: make ceph_start_io_write() killable
Posted by Ionut Nechita (Wind River) 3 weeks, 5 days ago
From: Ionut Nechita <ionut.nechita@windriver.com>

When multiple processes write to the same file and one of them is
blocked waiting for MDS/OSD response (e.g., during MDS failover),
other processes block indefinitely on down_write(&inode->i_rwsem)
in ceph_start_io_write().

This causes hung task warnings:

  INFO: task dd:12345 blocked for more than 122 seconds.
  Call Trace:
    ceph_start_io_write+0x...
    ceph_write_iter+0x...

The i_rwsem is held by a process doing fsync/writeback that is
waiting for MDS or OSD response. Other writers queue up on the
rwsem and block indefinitely.

Fix this by using down_write_killable() instead of down_write().
This allows blocked processes to be killed with SIGKILL, preventing
indefinite hangs. The function now returns an error code that
callers must check.

Update ceph_write_iter() to handle the new error return from
ceph_start_io_write().

Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
 fs/ceph/file.c | 9 +++++++--
 fs/ceph/io.c   | 9 +++++++--
 fs/ceph/io.h   | 2 +-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 6587c2d5af1e0..01e4f31b1f2f3 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2359,8 +2359,13 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 retry_snap:
 	if (direct_lock)
 		ceph_start_io_direct(inode);
-	else
-		ceph_start_io_write(inode);
+	else {
+		err = ceph_start_io_write(inode);
+		if (err) {
+			ceph_free_cap_flush(prealloc_cf);
+			return err;
+		}
+	}
 
 	if (iocb->ki_flags & IOCB_APPEND) {
 		err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
diff --git a/fs/ceph/io.c b/fs/ceph/io.c
index c456509b31c3f..f9ac89ec1d6a1 100644
--- a/fs/ceph/io.c
+++ b/fs/ceph/io.c
@@ -83,11 +83,16 @@ ceph_end_io_read(struct inode *inode)
  * Declare that a buffered write operation is about to start, and ensure
  * that we block all direct I/O.
  */
-void
+int
 ceph_start_io_write(struct inode *inode)
 {
-	down_write(&inode->i_rwsem);
+	int ret;
+
+	ret = down_write_killable(&inode->i_rwsem);
+	if (ret)
+		return ret;
 	ceph_block_o_direct(ceph_inode(inode), inode);
+	return 0;
 }
 
 /**
diff --git a/fs/ceph/io.h b/fs/ceph/io.h
index fa594cd77348a..94ce176df9997 100644
--- a/fs/ceph/io.h
+++ b/fs/ceph/io.h
@@ -4,7 +4,7 @@
 
 void ceph_start_io_read(struct inode *inode);
 void ceph_end_io_read(struct inode *inode);
-void ceph_start_io_write(struct inode *inode);
+int ceph_start_io_write(struct inode *inode);
 void ceph_end_io_write(struct inode *inode);
 void ceph_start_io_direct(struct inode *inode);
 void ceph_end_io_direct(struct inode *inode);
-- 
2.53.0
Re: [PATCH v1 08/13] ceph: make ceph_start_io_write() killable
Posted by Viacheslav Dubeyko 3 weeks, 4 days ago
On Thu, 2026-03-12 at 10:16 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> When multiple processes write to the same file and one of them is
> blocked waiting for MDS/OSD response (e.g., during MDS failover),
> other processes block indefinitely on down_write(&inode->i_rwsem)
> in ceph_start_io_write().
> 
> This causes hung task warnings:
> 
>   INFO: task dd:12345 blocked for more than 122 seconds.
>   Call Trace:
>     ceph_start_io_write+0x...
>     ceph_write_iter+0x...
> 
> The i_rwsem is held by a process doing fsync/writeback that is
> waiting for MDS or OSD response. Other writers queue up on the
> rwsem and block indefinitely.
> 
> Fix this by using down_write_killable() instead of down_write().
> This allows blocked processes to be killed with SIGKILL, preventing
> indefinite hangs. The function now returns an error code that
> callers must check.
> 
> Update ceph_write_iter() to handle the new error return from
> ceph_start_io_write().
> 
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
> ---
>  fs/ceph/file.c | 9 +++++++--
>  fs/ceph/io.c   | 9 +++++++--
>  fs/ceph/io.h   | 2 +-
>  3 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 6587c2d5af1e0..01e4f31b1f2f3 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -2359,8 +2359,13 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  retry_snap:
>  	if (direct_lock)
>  		ceph_start_io_direct(inode);
> -	else
> -		ceph_start_io_write(inode);
> +	else {
> +		err = ceph_start_io_write(inode);
> +		if (err) {
> +			ceph_free_cap_flush(prealloc_cf);
> +			return err;
> +		}
> +	}
>  
>  	if (iocb->ki_flags & IOCB_APPEND) {
>  		err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
> diff --git a/fs/ceph/io.c b/fs/ceph/io.c
> index c456509b31c3f..f9ac89ec1d6a1 100644
> --- a/fs/ceph/io.c
> +++ b/fs/ceph/io.c
> @@ -83,11 +83,16 @@ ceph_end_io_read(struct inode *inode)
>   * Declare that a buffered write operation is about to start, and ensure
>   * that we block all direct I/O.
>   */
> -void
> +int
>  ceph_start_io_write(struct inode *inode)
>  {
> -	down_write(&inode->i_rwsem);
> +	int ret;
> +
> +	ret = down_write_killable(&inode->i_rwsem);
> +	if (ret)
> +		return ret;
>  	ceph_block_o_direct(ceph_inode(inode), inode);
> +	return 0;
>  }

Which kernel version do you have? Because, we have this for v.7.0.0-rc3 [1]:

/**
 * ceph_start_io_write - declare the file is being used for buffered writes
 * @inode: file inode
 *
 * Declare that a buffered write operation is about to start, and ensure
 * that we block all direct I/O.
 */
int ceph_start_io_write(struct inode *inode)
{
	int err = down_write_killable(&inode->i_rwsem);
	if (!err)
		ceph_block_o_direct(ceph_inode(inode), inode);
	return err;
}

Thanks,
Slava.

>  
>  /**
> diff --git a/fs/ceph/io.h b/fs/ceph/io.h
> index fa594cd77348a..94ce176df9997 100644
> --- a/fs/ceph/io.h
> +++ b/fs/ceph/io.h
> @@ -4,7 +4,7 @@
>  
>  void ceph_start_io_read(struct inode *inode);
>  void ceph_end_io_read(struct inode *inode);
> -void ceph_start_io_write(struct inode *inode);
> +int ceph_start_io_write(struct inode *inode);
>  void ceph_end_io_write(struct inode *inode);
>  void ceph_start_io_direct(struct inode *inode);
>  void ceph_end_io_direct(struct inode *inode);

[1] https://elixir.bootlin.com/linux/v7.0-rc3/source/fs/ceph/io.c#L110
Re: [PATCH v1 08/13] ceph: make ceph_start_io_write() killable
Posted by Ionut Nechita (Wind River) 3 weeks, 4 days ago
From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Slava,

Thanks for pointing this out.

My patch series is based on v6.12.57 (stable/LTS), where
ceph_start_io_write() still uses the non-killable down_write().

I see that upstream v7.0-rc3 already has this change. I will take
this into account and adapt the series for 6.18 LTS and 7.0+ as
well, dropping patches that are already upstream.

Thanks,
Ionut
RE: [PATCH v1 08/13] ceph: make ceph_start_io_write() killable
Posted by Viacheslav Dubeyko 3 weeks, 3 days ago
On Thu, 2026-03-12 at 22:45 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> Hi Slava,
> 
> Thanks for pointing this out.
> 
> My patch series is based on v6.12.57 (stable/LTS), where
> ceph_start_io_write() still uses the non-killable down_write().
> 
> I see that upstream v7.0-rc3 already has this change. I will take
> this into account and adapt the series for 6.18 LTS and 7.0+ as
> well, dropping patches that are already upstream.
> 
> 

Sounds great! :)

I think, maybe, you need to consider to use namely 100ms as timeout instead of
HZ/10.

Thanks,
Slava.