[PATCH] ext4: fix quota accounting WARN in bigalloc punch hole

Qiliang Yuan posted 1 patch 1 week, 4 days ago
fs/ext4/extents_status.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
[PATCH] ext4: fix quota accounting WARN in bigalloc punch hole
Posted by Qiliang Yuan 1 week, 4 days ago
When doing direct I/O write on a bigalloc filesystem, the allocated
extent might not cover entire clusters at its boundaries, leaving
delayed blocks in those boundary clusters.  In ext4_es_insert_extent(),
__revise_pending() inserts new pending reservations for those boundary
clusters, and the return value (pending=true) was added to resv_used,
causing ext4_da_update_reserve_space() to incorrectly release the
quota reservations for those boundary clusters.

Later when PUNCH_HOLE removes the DIO-allocated blocks, the
extent removal path detects the pending reservation via
ext4_is_pending() and calls ext4_rereserve_cluster().  This tries
to reclaim quota from dq_dqb.dqb_curspace back to dqb_rsvspace,
but since the quota was already incorrectly released, dqb_curspace
is insufficient, triggering:

  WARNING at dquot_reclaim_space_nodirty+0x77c/0x8c0

The subsequent delalloc writeback then fires a second WARN from
dquot_claim_space_nodirty() for the same reason: dqb_rsvspace was
depleted by the earlier incorrect release.

__es_remove_extent() -> get_rsvd() already correctly excludes
boundary clusters that still contain delayed blocks from resv_used.
Adding pending to resv_used double-counts those boundary clusters,
erroneously releasing reservations that are still needed.

Remove the pending variable and the resv_used += pending addition.

Fixes: c543e2429640 ("ext4: update delalloc data reserve spcae in ext4_es_insert_extent()")
Reported-by: Zijing Yin <yzjaurora@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221570
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
 fs/ext4/extents_status.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 6e4a191e82191..fefe0bb8ac4d1 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -909,7 +909,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 	int err1 = 0, err2 = 0, err3 = 0;
-	int resv_used = 0, pending = 0;
+	int resv_used = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	struct extent_status *es1 = NULL;
 	struct extent_status *es2 = NULL;
@@ -977,7 +977,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 			__free_pending(pr);
 			pr = NULL;
 		}
-		pending = err3;
 	}
 	ext4_es_inc_seq(inode);
 error:
@@ -998,7 +997,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	 * any previously delayed allocated clusters instead of claim them
 	 * again.
 	 */
-	resv_used += pending;
 	if (resv_used)
 		ext4_da_update_reserve_space(inode, resv_used,
 					     delalloc_reserve_used);

---
base-commit: eb3f4b7426cfd2b79d65b7d37155480b32259a11
change-id: 20260528-fix-ext4-bigalloc-punch-hole-quota-v2-2adca315d1ba

Best regards,
-- 
Qiliang Yuan <realwujing@gmail.com>
Re: [PATCH] ext4: fix quota accounting WARN in bigalloc punch hole
Posted by Zhang Yi 1 week, 3 days ago
Hi, Qiliang!

On 5/28/2026 6:21 PM, Qiliang Yuan wrote:
> When doing direct I/O write on a bigalloc filesystem, the allocated
> extent might not cover entire clusters at its boundaries, leaving
> delayed blocks in those boundary clusters.  In ext4_es_insert_extent(),
> __revise_pending() inserts new pending reservations for those boundary
> clusters, and the return value (pending=true) was added to resv_used,
> causing ext4_da_update_reserve_space() to incorrectly release the
> quota reservations for those boundary clusters.
> 
> Later when PUNCH_HOLE removes the DIO-allocated blocks, the
> extent removal path detects the pending reservation via
> ext4_is_pending() and calls ext4_rereserve_cluster().  This tries
> to reclaim quota from dq_dqb.dqb_curspace back to dqb_rsvspace,
> but since the quota was already incorrectly released, dqb_curspace
> is insufficient, triggering:
> 
>   WARNING at dquot_reclaim_space_nodirty+0x77c/0x8c0

Hmm, the analysis doesn't seem correct to me. Do you mean the
following case?

# Assume the cluster size is 16KB.
xfs_io -f -c "pwrite 12k 4k" /mnt/foo
xfs_io -d -c "pwrite 0 4k" /mnt/foo
xfs_io -c "fpunch 0 4k" /mnt/foo

During the direct I/O write, quota space will be added in
ext4_mb_new_blocks() because the EXT4_MB_DELALLOC_RESERVED flag is
not set. Therefore, in ext4_es_insert_extent(), we should release the
quota reservations, since this cluster has already been allocated.

Then, in the third operation (punch hole), it will reclaim the added
dqb_curspace. This should not cause an insufficiency.

Am I missing something?

> 
> The subsequent delalloc writeback then fires a second WARN from
> dquot_claim_space_nodirty() for the same reason: dqb_rsvspace was
> depleted by the earlier incorrect release.
> 
> __es_remove_extent() -> get_rsvd() already correctly excludes
> boundary clusters that still contain delayed blocks from resv_used.
> Adding pending to resv_used double-counts those boundary clusters,
> erroneously releasing reservations that are still needed.
> 
> Remove the pending variable and the resv_used += pending addition.

That's not correct. Assume the following case.

# Assume the cluster size is 16KB.
xfs_io -f -c "pwrite 0 16k" /mnt/foo
xfs_io -c "sync_range -w 0 4k" /mnt/foo

Although only part of the cluster is written back, the cluster has been
allocated. Therefore, the quota needs to be claimed. However, since we
only wrote back a portion of a cluster, __es_remove_extent() will not
return the reserved clusters that need to be consumed (i.e., resv_used
is zero). Therefore, we need to determine whether a new pending
allocation has been created by checking the pending status, so that we
can correctly claim the quota.

Thanks,
Yi

> 
> Fixes: c543e2429640 ("ext4: update delalloc data reserve spcae in ext4_es_insert_extent()")
> Reported-by: Zijing Yin <yzjaurora@gmail.com>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221570
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
>  fs/ext4/extents_status.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 6e4a191e82191..fefe0bb8ac4d1 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -909,7 +909,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	struct extent_status newes;
>  	ext4_lblk_t end = lblk + len - 1;
>  	int err1 = 0, err2 = 0, err3 = 0;
> -	int resv_used = 0, pending = 0;
> +	int resv_used = 0;
>  	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>  	struct extent_status *es1 = NULL;
>  	struct extent_status *es2 = NULL;
> @@ -977,7 +977,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  			__free_pending(pr);
>  			pr = NULL;
>  		}
> -		pending = err3;
>  	}
>  	ext4_es_inc_seq(inode);
>  error:
> @@ -998,7 +997,6 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>  	 * any previously delayed allocated clusters instead of claim them
>  	 * again.
>  	 */
> -	resv_used += pending;
>  	if (resv_used)
>  		ext4_da_update_reserve_space(inode, resv_used,
>  					     delalloc_reserve_used);
> 
> ---
> base-commit: eb3f4b7426cfd2b79d65b7d37155480b32259a11
> change-id: 20260528-fix-ext4-bigalloc-punch-hole-quota-v2-2adca315d1ba
> 
> Best regards,
Re: [PATCH] ext4: fix quota accounting WARN in bigalloc punch hole
Posted by Qiliang Yuan 1 week, 3 days ago
Hi Zhang Yi,

On 5/29/2026 3:32 PM, Zhang Yi wrote:
> On 5/28/2026 6:21 PM, Qiliang Yuan wrote:
>> __es_remove_extent() -> get_rsvd() already correctly excludes
>> boundary clusters that still contain delayed blocks from resv_used.
>> Adding pending to resv_used double-counts those boundary clusters,
>> erroneously releasing reservations that are still needed.
>
> Hmm, the analysis doesn't seem correct to me. Do you mean the
> following case?
>
> # Assume the cluster size is 16KB.
> xfs_io -f -c "pwrite 12k 4k" /mnt/foo
> xfs_io -d -c "pwrite 0 4k" /mnt/foo
> xfs_io -c "fpunch 0 4k" /mnt/foo
>
> During the direct I/O write, quota space will be added in
> ext4_mb_new_blocks() because the EXT4_MB_DELALLOC_RESERVED flag is
> not set. Therefore, in ext4_es_insert_extent(), we should release the
> quota reservations, since this cluster has already been allocated.
>
> Then, in the third operation (punch hole), it will reclaim the added
> dqb_curspace. This should not cause an insufficiency.
>
> Am I missing something?

Thanks for the review!  Let me explain the issue with your specific
example.

After step 1 (delalloc write 4KB@12KB in a 16KB cluster), we have:

  ES tree: [0,1) hole, [1,3) delayed, [3,4) hole   (blocks 0..3)
  Quota:   dqb_rsvspace += 16KB (one cluster reserved)

Step 2 (DIO write 4KB@0KB, RWF_DSYNC):

  The DIO allocates one cluster, but the mapped extent from
  ext4_ext_map_blocks() only covers the written range, e.g. [0,0].

  In ext4_es_insert_extent():

    a) __es_remove_extent([0,0]) removes the [0,1) hole entry, but
       there are no delayed extents within [0,0], so resv_used = 0.

       This is correct: the DIO extent [0,0] does not overlap the
       delayed region [1,3).

    b) __revise_pending() scans outside the newly inserted extent [0,0]:

       - Left boundary (block 0): the range starts at cluster
         boundary, no blocks to scan on the left → no pending insert.

       - Right boundary (block 3): blocks [1,3] are outside [0,0]
         and are delayed → __revise_pending() inserts a pending
         reservation and returns pending = 1.

       This pending reservation means: "cluster containing block 3
       still has delayed blocks, keep this cluster reserved."

    c) Then comes the bug:

         resv_used += pending;   // resv_used = 0 + 1 = 1

       This causes ext4_da_update_reserve_space() to release 16KB
       of quota reservation (dquot_release_reservation_block()).  But
       block 3 is still delayed!  Its quota reservation should NOT
       be released — no blocks within [0,0] actually used the delalloc
       reservation.

    So after DIO:
      dqb_rsvspace: originally 16KB, now 0 (incorrectly released)
      dqb_curspace: 16KB (from ext4_mb_new_blocks)

Step 3 (punch 4KB@0KB):

  ext4_remove_blocks() sees that block 0's cluster has a pending
  reservation → calls ext4_rereserve_cluster() → dquot_reclaim_block()
  tries to move 16KB from dqb_curspace to dqb_rsvspace.  But
  dqb_curspace may already be insufficient (depending on whether
  other allocs/frees have happened), triggering:

    WARNING at dquot_reclaim_space_nodirty

Step 4 (delalloc writeback of block 3):

  ext4_da_update_reserve_space() → dquot_claim_block() tries to
  move 16KB from dqb_rsvspace to dqb_curspace.  Since rsvspace was
  incorrectly released and not fully restored by the punch hole's
  rereseve, this triggers:

    WARNING at dquot_claim_space_nodirty

The key point is: pending from __revise_pending() indicates clusters
that *still contain delayed blocks outside the newly inserted extent*.
These clusters' quota reservations must be preserved — get_rsvd()
inside __es_remove_extent() already correctly excludes them from
resv_used.  Adding pending to resv_used double-counts them.

I also found a pre-existing retry loop issue: if __es_insert_extent()
fails with -ENOMEM on the first pass, resv_used carries a stale value
back to retry and could be double-released.  I'll include a fix for
that in v2.

-- 
Qiliang Yuan
Re: [PATCH] ext4: fix quota accounting WARN in bigalloc punch hole
Posted by Zhang Yi 1 week, 3 days ago
On 5/29/2026 4:34 PM, Qiliang Yuan wrote:
> Hi Zhang Yi,
> 
> On 5/29/2026 3:32 PM, Zhang Yi wrote:
>> On 5/28/2026 6:21 PM, Qiliang Yuan wrote:
>>> __es_remove_extent() -> get_rsvd() already correctly excludes
>>> boundary clusters that still contain delayed blocks from resv_used.
>>> Adding pending to resv_used double-counts those boundary clusters,
>>> erroneously releasing reservations that are still needed.
>>
>> Hmm, the analysis doesn't seem correct to me. Do you mean the
>> following case?
>>
>> # Assume the cluster size is 16KB.
>> xfs_io -f -c "pwrite 12k 4k" /mnt/foo
>> xfs_io -d -c "pwrite 0 4k" /mnt/foo
>> xfs_io -c "fpunch 0 4k" /mnt/foo
>>
>> During the direct I/O write, quota space will be added in
>> ext4_mb_new_blocks() because the EXT4_MB_DELALLOC_RESERVED flag is
>> not set. Therefore, in ext4_es_insert_extent(), we should release the
>> quota reservations, since this cluster has already been allocated.
>>
>> Then, in the third operation (punch hole), it will reclaim the added
>> dqb_curspace. This should not cause an insufficiency.
>>
>> Am I missing something?
> 
> Thanks for the review!  Let me explain the issue with your specific
> example.

Thanks for the explanation, some comments below.

> 
> After step 1 (delalloc write 4KB@12KB in a 16KB cluster), we have:
> 
>   ES tree: [0,1) hole, [1,3) delayed, [3,4) hole   (blocks 0..3)

             [0,3) hole, [3,4) delayed(blocks 0..3)  ?

>   Quota:   dqb_rsvspace += 16KB (one cluster reserved)

Right.

> 
> Step 2 (DIO write 4KB@0KB, RWF_DSYNC):
> 
>   The DIO allocates one cluster, but the mapped extent from
>   ext4_ext_map_blocks() only covers the written range, e.g. [0,0].
> 
>   In ext4_es_insert_extent():
> 
>     a) __es_remove_extent([0,0]) removes the [0,1) hole entry, but
>        there are no delayed extents within [0,0], so resv_used = 0.
> 
>        This is correct: the DIO extent [0,0] does not overlap the
>        delayed region [1,3).

Right.

> 
>     b) __revise_pending() scans outside the newly inserted extent [0,0]:
> 
>        - Left boundary (block 0): the range starts at cluster
>          boundary, no blocks to scan on the left → no pending insert.
> 
>        - Right boundary (block 3): blocks [1,3] are outside [0,0]
>          and are delayed → __revise_pending() inserts a pending
>          reservation and returns pending = 1.
> 
>        This pending reservation means: "cluster containing block 3
>        still has delayed blocks, keep this cluster reserved."

Right.

> 
>     c) Then comes the bug:
> 
>          resv_used += pending;   // resv_used = 0 + 1 = 1
> 
>        This causes ext4_da_update_reserve_space() to release 16KB
>        of quota reservation (dquot_release_reservation_block()).  But
>        block 3 is still delayed!  Its quota reservation should NOT
>        be released — no blocks within [0,0] actually used the delalloc
>        reservation.

No. Because this cluster has already been allocated, it needs to occupy
actual quota space and can no longer use reserved counts, so the
reserved counts must be released.

For better understanding, please see the implementation of
ext4_insert_delayed_blocks(). When a new extent is delallocated to an
already allocated cluster using the the reserved count is not increased
(ext4_clu_alloc_state() returns 2).

> 
>     So after DIO:
>       dqb_rsvspace: originally 16KB, now 0 (incorrectly released)
>       dqb_curspace: 16KB (from ext4_mb_new_blocks)
> 
> Step 3 (punch 4KB@0KB):
> 
>   ext4_remove_blocks() sees that block 0's cluster has a pending
>   reservation → calls ext4_rereserve_cluster() → dquot_reclaim_block()
>   tries to move 16KB from dqb_curspace to dqb_rsvspace.  But
>   dqb_curspace may already be insufficient (depending on whether
>   other allocs/frees have happened), triggering:
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
How exactly did it happen? All data cluster allocations are precisely
calculated. In addition, the current example cannot reproduce the issue
you encountered.

> 
>     WARNING at dquot_reclaim_space_nodirty
> 
> Step 4 (delalloc writeback of block 3):
> 
>   ext4_da_update_reserve_space() → dquot_claim_block() tries to
>   move 16KB from dqb_rsvspace to dqb_curspace.  Since rsvspace was
>   incorrectly released and not fully restored by the punch hole's
>   rereseve, this triggers:
> 
>     WARNING at dquot_claim_space_nodirty
> 
> The key point is: pending from __revise_pending() indicates clusters
> that *still contain delayed blocks outside the newly inserted extent*.
> These clusters' quota reservations must be preserved — get_rsvd()

Whether to reserve a quota is determined not by whether there are
delalloc extents left, but by whether there are only pure delalloc
extents in this cluster range.

Cheers,
Yi.

> inside __es_remove_extent() already correctly excludes them from
> resv_used.  Adding pending to resv_used double-counts them.
> 
> I also found a pre-existing retry loop issue: if __es_insert_extent()
> fails with -ENOMEM on the first pass, resv_used carries a stale value
> back to retry and could be double-released.  I'll include a fix for
> that in v2.
> 

Re: [PATCH] ext4: fix quota accounting WARN in bigalloc punch hole
Posted by Qiliang Yuan 1 week, 2 days ago
On 5/29/2026 6:37 PM, Zhang Yi wrote:
> On 5/29/2026 4:34 PM, Qiliang Yuan wrote:
>> Thanks for the review!  Let me explain the issue with your specific
>> example.
> 
> Thanks for the explanation, some comments below.
>
>>   ext4_remove_blocks() sees that block 0's cluster has a pending
>>   reservation → calls ext4_rereserve_cluster() → dquot_reclaim_block()
>>   tries to move 16KB from dqb_curspace to dqb_rsvspace.  But
>>   dqb_curspace may already be insufficient (depending on whether
>>   other allocs/frees have happened), triggering:
>     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> How exactly did it happen? All data cluster allocations are precisely
> calculated. In addition, the current example cannot reproduce the issue
> you encountered.
>
>> The key point is: pending from __revise_pending() indicates clusters
>> that *still contain delayed blocks outside the newly inserted extent*.
>> These clusters' quota reservations must be preserved — get_rsvd()
> 
> Whether to reserve a quota is determined not by whether there are
> delalloc extents left, but by whether there are only pure delalloc
> extents in this cluster range.

You are right about the simple case — I missed the ext4_clu_alloc_state()
logic in ext4_insert_delayed_blocks().  When a cluster already has
allocated blocks, subsequent delalloc within that cluster does not
increase rsvspace, so releasing rsvspace after DIO allocation is
correct there.

However, the syzkaller reproducer's scenario is more complex.  It uses
three operations:

  1. DIO write at sub-block-aligned offset 0x6400 with RWF_DSYNC
     (pwritev2, 0x140000 bytes at offset 0x6400)
  2. Delalloc single-byte write at offset 0x8080c61 (far away cluster)
  3. PUNCH_HOLE from offset 0x7f covering 0x8000c61 bytes

The DIO and delalloc are in different, distant clusters.  The
sub-block-aligned DIO may cause it to fall through a different
code path — possibly zeroing or partial writeback within the cluster
boundary.

I'll instrument a kernel to trace the exact rsvspace/curspace
transitions and get back with concrete numbers showing where the
mismatch occurs.  This should settle whether the root cause is
in resv_used += pending or something else.

Thanks for the detailed review!
-- 
Qiliang Yuan