[v2] ext4: check folio uptodate state in ext4_page_mkwrite()

[PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months, 2 weeks ago

When delayed block allocation fails due to filesystem corruption,
ext4's writeback error handling invalidates affected folios by calling
mpage_release_unused_pages() with invalidate=true, which explicitly
clears the uptodate flag:

    static void mpage_release_unused_pages(..., bool invalidate)
    {
        ...
        if (invalidate) {
            block_invalidate_folio(folio, 0, folio_size(folio));
            folio_clear_uptodate(folio);
        }
    }

If ext4_page_mkwrite() is subsequently called on such a non-uptodate
folio, it can proceed to mark the folio dirty without checking its
state. This triggers a warning in __folio_mark_dirty():

    WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
    __folio_mark_dirty+0x578/0x880

    Call Trace:
     fault_dirty_shared_page+0x16e/0x2d0
     do_wp_page+0x38b/0xd20
     handle_pte_fault+0x1da/0x450
     __handle_mm_fault+0x652/0x13b0
     handle_mm_fault+0x22a/0x6f0
     do_user_addr_fault+0x200/0x8a0
     exc_page_fault+0x81/0x1b0

This scenario occurs when:
1. A write with delayed allocation marks a folio dirty (uptodate=1)
2. Writeback attempts block allocation but detects filesystem corruption
3. Error handling calls mpage_release_unused_pages(invalidate=true),
   which clears the uptodate flag via folio_clear_uptodate()
4. A subsequent ftruncate() triggers ext4_truncate()
5. ext4_block_truncate_page() attempts to zero the page tail
6. This triggers a write fault on the mmap'd page
7. ext4_page_mkwrite() is called with the non-uptodate folio
8. Without checking uptodate, it proceeds to mark the folio dirty
9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())

Fix this by checking folio_test_uptodate() early in ext4_page_mkwrite()
and returning VM_FAULT_SIGBUS if the folio is not uptodate. This prevents
attempting to write to invalidated folios and properly signals the error
to userspace.

The check is placed early, before the delalloc/journal/normal code paths,
as none of these paths should proceed with a non-uptodate folio.

Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Tested-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
---
 fs/ext4/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e99306a8f47c..18a029362c1f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6688,6 +6688,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	if (err)
 		goto out_ret;
 
+	folio_lock(folio);
+	if (!folio_test_uptodate(folio)) {
+		folio_unlock(folio);
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	folio_unlock(folio);
+
 	/*
 	 * On data journalling we skip straight to the transaction handle:
 	 * there's no delalloc; page truncated will be checked later; the
-- 
2.43.0

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months, 1 week ago

On Sat, Nov 22, 2025 at 7:27 AM Deepanshu Kartikey
<kartikey406@gmail.com> wrote:
>
> When delayed block allocation fails due to filesystem corruption,
> ext4's writeback error handling invalidates affected folios by calling
> mpage_release_unused_pages() with invalidate=true, which explicitly
> clears the uptodate flag:
>
>     static void mpage_release_unused_pages(..., bool invalidate)
>     {
>         ...
>         if (invalidate) {
>             block_invalidate_folio(folio, 0, folio_size(folio));
>             folio_clear_uptodate(folio);
>         }
>     }
>
> If ext4_page_mkwrite() is subsequently called on such a non-uptodate
> folio, it can proceed to mark the folio dirty without checking its
> state. This triggers a warning in __folio_mark_dirty():
>
>     WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
>     __folio_mark_dirty+0x578/0x880
>
>     Call Trace:
>      fault_dirty_shared_page+0x16e/0x2d0
>      do_wp_page+0x38b/0xd20
>      handle_pte_fault+0x1da/0x450
>      __handle_mm_fault+0x652/0x13b0
>      handle_mm_fault+0x22a/0x6f0
>      do_user_addr_fault+0x200/0x8a0
>      exc_page_fault+0x81/0x1b0
>
> This scenario occurs when:
> 1. A write with delayed allocation marks a folio dirty (uptodate=1)
> 2. Writeback attempts block allocation but detects filesystem corruption
> 3. Error handling calls mpage_release_unused_pages(invalidate=true),
>    which clears the uptodate flag via folio_clear_uptodate()
> 4. A subsequent ftruncate() triggers ext4_truncate()
> 5. ext4_block_truncate_page() attempts to zero the page tail
> 6. This triggers a write fault on the mmap'd page
> 7. ext4_page_mkwrite() is called with the non-uptodate folio
> 8. Without checking uptodate, it proceeds to mark the folio dirty
> 9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())
>
> Fix this by checking folio_test_uptodate() early in ext4_page_mkwrite()
> and returning VM_FAULT_SIGBUS if the folio is not uptodate. This prevents
> attempting to write to invalidated folios and properly signals the error
> to userspace.
>
> The check is placed early, before the delalloc/journal/normal code paths,
> as none of these paths should proceed with a non-uptodate folio.
>
> Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
> Tested-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
> ---
>  fs/ext4/inode.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e99306a8f47c..18a029362c1f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6688,6 +6688,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>         if (err)
>                 goto out_ret;
>
> +       folio_lock(folio);
> +       if (!folio_test_uptodate(folio)) {
> +               folio_unlock(folio);
> +               ret = VM_FAULT_SIGBUS;
> +               goto out;
> +       }
> +       folio_unlock(folio);
> +
>         /*
>          * On data journalling we skip straight to the transaction handle:
>          * there's no delalloc; page truncated will be checked later; the
> --
> 2.43.0
>

Hi Ted and ext4 maintainers,

I wanted to follow up on this patch submitted a week ago. This fixes
a syzbot-reported WARNING in __folio_mark_dirty() that occurs when
ext4_page_mkwrite() is called with a non-uptodate folio after delayed
allocation writeback failure.

Please let me know if there's any feedback or if I should make any
changes.

Thanks,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Zhang Yi 2 months, 1 week ago

Hi Deepanshu!

On 11/30/2025 10:06 AM, Deepanshu Kartikey wrote:
> On Sat, Nov 22, 2025 at 7:27 AM Deepanshu Kartikey
> <kartikey406@gmail.com> wrote:
>>
>> When delayed block allocation fails due to filesystem corruption,
>> ext4's writeback error handling invalidates affected folios by calling
>> mpage_release_unused_pages() with invalidate=true, which explicitly
>> clears the uptodate flag:
>>
>>     static void mpage_release_unused_pages(..., bool invalidate)
>>     {
>>         ...
>>         if (invalidate) {
>>             block_invalidate_folio(folio, 0, folio_size(folio));
>>             folio_clear_uptodate(folio);
>>         }
>>     }
>>
>> If ext4_page_mkwrite() is subsequently called on such a non-uptodate
>> folio, it can proceed to mark the folio dirty without checking its
>> state. This triggers a warning in __folio_mark_dirty():
>>
>>     WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
>>     __folio_mark_dirty+0x578/0x880
>>
>>     Call Trace:
>>      fault_dirty_shared_page+0x16e/0x2d0
>>      do_wp_page+0x38b/0xd20
>>      handle_pte_fault+0x1da/0x450
>>      __handle_mm_fault+0x652/0x13b0
>>      handle_mm_fault+0x22a/0x6f0
>>      do_user_addr_fault+0x200/0x8a0
>>      exc_page_fault+0x81/0x1b0
>>
>> This scenario occurs when:
>> 1. A write with delayed allocation marks a folio dirty (uptodate=1)
>> 2. Writeback attempts block allocation but detects filesystem corruption
>> 3. Error handling calls mpage_release_unused_pages(invalidate=true),
>>    which clears the uptodate flag via folio_clear_uptodate()
>> 4. A subsequent ftruncate() triggers ext4_truncate()
>> 5. ext4_block_truncate_page() attempts to zero the page tail
>> 6. This triggers a write fault on the mmap'd page
>> 7. ext4_page_mkwrite() is called with the non-uptodate folio
>> 8. Without checking uptodate, it proceeds to mark the folio dirty
>> 9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())

Thank you a lot for analyzing this issue and the fix patch. As I was
going through the process of understanding this issue, I had one
question. Is the code flow that triggers the warning as follows?

wp_page_shared()
  do_page_mkwrite()
    ext4_page_mkwrite()
      block_page_mkwrite()   //The default delalloc path
        block_commit_write()
          mark_buffer_dirty()
            __folio_mark_dirty(0)  //'warn' is false, doesn't trigger warning
        folio_mark_dirty()
          ext4_dirty_folio()
            block_dirty_folio  //newly_dirty is false, doesn't call __folio_mark_dirty()
  fault_dirty_shared_page()
    folio_mark_dirty()  //Trigger warning ?

This folio has been marked as dirty. How was this warning triggered?
Am I missing something?

Thanks,
Yi.

>>
>> Fix this by checking folio_test_uptodate() early in ext4_page_mkwrite()
>> and returning VM_FAULT_SIGBUS if the folio is not uptodate. This prevents
>> attempting to write to invalidated folios and properly signals the error
>> to userspace.
>>
>> The check is placed early, before the delalloc/journal/normal code paths,
>> as none of these paths should proceed with a non-uptodate folio.
>>
>> Reported-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
>> Tested-by: syzbot+b0a0670332b6b3230a0a@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
>> Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
>> ---
>>  fs/ext4/inode.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index e99306a8f47c..18a029362c1f 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -6688,6 +6688,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>>         if (err)
>>                 goto out_ret;
>>
>> +       folio_lock(folio);
>> +       if (!folio_test_uptodate(folio)) {
>> +               folio_unlock(folio);
>> +               ret = VM_FAULT_SIGBUS;
>> +               goto out;
>> +       }
>> +       folio_unlock(folio);
>> +
>>         /*
>>          * On data journalling we skip straight to the transaction handle:
>>          * there's no delalloc; page truncated will be checked later; the
>> --
>> 2.43.0
>>
> 
> Hi Ted and ext4 maintainers,
> 
> I wanted to follow up on this patch submitted a week ago. This fixes
> a syzbot-reported WARNING in __folio_mark_dirty() that occurs when
> ext4_page_mkwrite() is called with a non-uptodate folio after delayed
> allocation writeback failure.
> 
> Please let me know if there's any feedback or if I should make any
> changes.
> 
> Thanks,
> Deepanshu
>

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Tue, Dec 2, 2025 at 5:54 PM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> Hi Deepanshu!
>
> On 11/30/2025 10:06 AM, Deepanshu Kartikey wrote:
> > On Sat, Nov 22, 2025 at 7:27 AM Deepanshu Kartikey
> > <kartikey406@gmail.com> wrote:
> >>
> >> When delayed block allocation fails due to filesystem corruption,
> >> ext4's writeback error handling invalidates affected folios by calling
> >> mpage_release_unused_pages() with invalidate=true, which explicitly
> >> clears the uptodate flag:
> >>
> >>     static void mpage_release_unused_pages(..., bool invalidate)
> >>     {
> >>         ...
> >>         if (invalidate) {
> >>             block_invalidate_folio(folio, 0, folio_size(folio));
> >>             folio_clear_uptodate(folio);
> >>         }
> >>     }
> >>
> >> If ext4_page_mkwrite() is subsequently called on such a non-uptodate
> >> folio, it can proceed to mark the folio dirty without checking its
> >> state. This triggers a warning in __folio_mark_dirty():
> >>
> >>     WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
> >>     __folio_mark_dirty+0x578/0x880
> >>
> >>     Call Trace:
> >>      fault_dirty_shared_page+0x16e/0x2d0
> >>      do_wp_page+0x38b/0xd20
> >>      handle_pte_fault+0x1da/0x450
> >>      __handle_mm_fault+0x652/0x13b0
> >>      handle_mm_fault+0x22a/0x6f0
> >>      do_user_addr_fault+0x200/0x8a0
> >>      exc_page_fault+0x81/0x1b0
> >>
> >> This scenario occurs when:
> >> 1. A write with delayed allocation marks a folio dirty (uptodate=1)
> >> 2. Writeback attempts block allocation but detects filesystem corruption
> >> 3. Error handling calls mpage_release_unused_pages(invalidate=true),
> >>    which clears the uptodate flag via folio_clear_uptodate()
> >> 4. A subsequent ftruncate() triggers ext4_truncate()
> >> 5. ext4_block_truncate_page() attempts to zero the page tail
> >> 6. This triggers a write fault on the mmap'd page
> >> 7. ext4_page_mkwrite() is called with the non-uptodate folio
> >> 8. Without checking uptodate, it proceeds to mark the folio dirty
> >> 9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())
>
> Thank you a lot for analyzing this issue and the fix patch. As I was
> going through the process of understanding this issue, I had one
> question. Is the code flow that triggers the warning as follows?
>
> wp_page_shared()
>   do_page_mkwrite()
>     ext4_page_mkwrite()
>       block_page_mkwrite()   //The default delalloc path
>         block_commit_write()
>           mark_buffer_dirty()
>             __folio_mark_dirty(0)  //'warn' is false, doesn't trigger warning
>         folio_mark_dirty()
>           ext4_dirty_folio()
>             block_dirty_folio  //newly_dirty is false, doesn't call __folio_mark_dirty()
>   fault_dirty_shared_page()
>     folio_mark_dirty()  //Trigger warning ?
>
> This folio has been marked as dirty. How was this warning triggered?
> Am I missing something?
>
> Thanks,
> Yi.
>

Hi Yi,

Thank you for your question about the exact flow that triggers the warning.

You're correct that the code paths within ext4_page_mkwrite() and
block_page_mkwrite() call __folio_mark_dirty() with warn=0, so no
warning occurs there. The warning actually triggers later, in
fault_dirty_shared_page() after page_mkwrite returns.

Here's the complete flow:

  wp_page_shared()
    ↓
    do_page_mkwrite()
      ↓
      ext4_page_mkwrite()
        ↓
        block_page_mkwrite()
          ↓
          mark_buffer_dirty() → __folio_mark_dirty(warn=0)  // No warning
        ↓
        Returns success
    ↓
    fault_dirty_shared_page(vma, folio)  ← Warning triggers here
      ↓
      folio_mark_dirty(folio)
        ↓
        ext4_dirty_folio()
          ↓
          block_dirty_folio()
            ↓
            if (!folio_test_set_dirty(folio))  // Folio not already dirty
              __folio_mark_dirty(folio, mapping, 1)  ← warn=1, triggers WARNING

The key is that the folio can become non-uptodate between when it's
initially read and when wp_page_shared() is called. This happens when:

1. Delayed block allocation fails due to filesystem corruption
2. Error handling in mpage_release_unused_pages() explicitly clears uptodate:

     if (invalidate) {
         block_invalidate_folio(folio, 0, folio_size(folio));
         folio_clear_uptodate(folio);
     }

3. A subsequent operation (like ftruncate) triggers ext4_block_truncate_page()
4. This causes a write fault on the mmap'd page
5. wp_page_shared() is called with the now-non-uptodate folio

From my debug logs with a test kernel:

  [22.387777] EXT4-fs error: lblock 0 mapped to illegal pblock 0
  [22.389798] EXT4-fs: Delayed block allocation failed... error 117
  [22.390401] EXT4-fs: This should not happen!! Data will be lost

  [22.399463] EXT4-fs error: Corrupt filesystem

  [22.400513] WP_PAGE_SHARED: ENTER folio=... uptodate=0 dirty=0
  [22.401953] WP_PAGE_SHARED: page_mkwrite failed, returning 2

With my fix, ext4_page_mkwrite() detects the non-uptodate state and
returns VM_FAULT_SIGBUS before block_page_mkwrite() is called,
preventing wp_page_shared() from reaching fault_dirty_shared_page().

Without the fix, the sequence would be:
- ext4_page_mkwrite() succeeds (doesn't check uptodate)
- block_page_mkwrite() marks buffers dirty (warn=0, no warning)
- Returns to wp_page_shared()
- fault_dirty_shared_page() calls folio_mark_dirty()
- block_dirty_folio() finds folio not dirty (uptodate=0, dirty=0)
- Calls __folio_mark_dirty() with warn=1
- WARNING triggers: WARN_ON_ONCE(warn && !folio_test_uptodate(folio)
&& !folio_test_dirty(folio))

The syzbot call trace confirms this:

  Call Trace:
   fault_dirty_shared_page+0x16e/0x2d0
   do_wp_page+0x38b/0xd20
   handle_pte_fault+0x1da/0x450

Does this clarify the flow?

Best regards,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Zhang Yi 2 months ago

On 12/3/2025 9:37 AM, Deepanshu Kartikey wrote:
> On Tue, Dec 2, 2025 at 5:54 PM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> Hi Deepanshu!
>>
>> On 11/30/2025 10:06 AM, Deepanshu Kartikey wrote:
>>> On Sat, Nov 22, 2025 at 7:27 AM Deepanshu Kartikey
>>> <kartikey406@gmail.com> wrote:
>>>>
>>>> When delayed block allocation fails due to filesystem corruption,
>>>> ext4's writeback error handling invalidates affected folios by calling
>>>> mpage_release_unused_pages() with invalidate=true, which explicitly
>>>> clears the uptodate flag:
>>>>
>>>>     static void mpage_release_unused_pages(..., bool invalidate)
>>>>     {
>>>>         ...
>>>>         if (invalidate) {
>>>>             block_invalidate_folio(folio, 0, folio_size(folio));
>>>>             folio_clear_uptodate(folio);
>>>>         }
>>>>     }
>>>>
>>>> If ext4_page_mkwrite() is subsequently called on such a non-uptodate
>>>> folio, it can proceed to mark the folio dirty without checking its
>>>> state. This triggers a warning in __folio_mark_dirty():
>>>>
>>>>     WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
>>>>     __folio_mark_dirty+0x578/0x880
>>>>
>>>>     Call Trace:
>>>>      fault_dirty_shared_page+0x16e/0x2d0
>>>>      do_wp_page+0x38b/0xd20
>>>>      handle_pte_fault+0x1da/0x450
>>>>      __handle_mm_fault+0x652/0x13b0
>>>>      handle_mm_fault+0x22a/0x6f0
>>>>      do_user_addr_fault+0x200/0x8a0
>>>>      exc_page_fault+0x81/0x1b0
>>>>
>>>> This scenario occurs when:
>>>> 1. A write with delayed allocation marks a folio dirty (uptodate=1)
>>>> 2. Writeback attempts block allocation but detects filesystem corruption
>>>> 3. Error handling calls mpage_release_unused_pages(invalidate=true),
>>>>    which clears the uptodate flag via folio_clear_uptodate()
>>>> 4. A subsequent ftruncate() triggers ext4_truncate()
>>>> 5. ext4_block_truncate_page() attempts to zero the page tail
>>>> 6. This triggers a write fault on the mmap'd page
>>>> 7. ext4_page_mkwrite() is called with the non-uptodate folio
>>>> 8. Without checking uptodate, it proceeds to mark the folio dirty
>>>> 9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())
>>
>> Thank you a lot for analyzing this issue and the fix patch. As I was
>> going through the process of understanding this issue, I had one
>> question. Is the code flow that triggers the warning as follows?
>>
>> wp_page_shared()
>>   do_page_mkwrite()
>>     ext4_page_mkwrite()
>>       block_page_mkwrite()   //The default delalloc path
>>         block_commit_write()
>>           mark_buffer_dirty()
>>             __folio_mark_dirty(0)  //'warn' is false, doesn't trigger warning
>>         folio_mark_dirty()
>>           ext4_dirty_folio()
>>             block_dirty_folio  //newly_dirty is false, doesn't call __folio_mark_dirty()
>>   fault_dirty_shared_page()
>>     folio_mark_dirty()  //Trigger warning ?
>>
>> This folio has been marked as dirty. How was this warning triggered?
>> Am I missing something?
>>
>> Thanks,
>> Yi.
>>
> 
> Hi Yi,
> 
> Thank you for your question about the exact flow that triggers the warning.
> 

Thank you for the clarification, but there are still some details that
are unclear.

> You're correct that the code paths within ext4_page_mkwrite() and
> block_page_mkwrite() call __folio_mark_dirty() with warn=0, so no
> warning occurs there. The warning actually triggers later, in
> fault_dirty_shared_page() after page_mkwrite returns.
> 
> Here's the complete flow:
> 
>   wp_page_shared()
>     ↓
>     do_page_mkwrite()
>       ↓
>       ext4_page_mkwrite()
>         ↓
>         block_page_mkwrite()
>           ↓
>           mark_buffer_dirty() → __folio_mark_dirty(warn=0)  // No warning

             ↓
             if (!folio_test_set_dirty(folio))
                        //The folio will be mark as dirty --- 1

>         ↓
>         Returns success
>     ↓
>     fault_dirty_shared_page(vma, folio)  ← Warning triggers here
>       ↓
>       folio_mark_dirty(folio)
>         ↓
>         ext4_dirty_folio()
>           ↓
>           block_dirty_folio()
>             ↓
>             if (!folio_test_set_dirty(folio))  // Folio not already dirty

              This makes me confused, this folio should be set to dirty
              at position 1. If it is not dirty here, who cleared the dirty
              flag for this folio?

>               __folio_mark_dirty(folio, mapping, 1)  ← warn=1, triggers WARNING
> 
> The key is that the folio can become non-uptodate between when it's
> initially read and when wp_page_shared() is called. This happens when:
> 
> 1. Delayed block allocation fails due to filesystem corruption
> 2. Error handling in mpage_release_unused_pages() explicitly clears uptodate:
> 
>      if (invalidate) {
>          block_invalidate_folio(folio, 0, folio_size(folio));
>          folio_clear_uptodate(folio);
>      }
> 
> 3. A subsequent operation (like ftruncate) triggers ext4_block_truncate_page()
> 4. This causes a write fault on the mmap'd page
> 5. wp_page_shared() is called with the now-non-uptodate folio
> 
> From my debug logs with a test kernel:
> 
>   [22.387777] EXT4-fs error: lblock 0 mapped to illegal pblock 0
>   [22.389798] EXT4-fs: Delayed block allocation failed... error 117
>   [22.390401] EXT4-fs: This should not happen!! Data will be lost
> 
>   [22.399463] EXT4-fs error: Corrupt filesystem
> 
>   [22.400513] WP_PAGE_SHARED: ENTER folio=... uptodate=0 dirty=0
>   [22.401953] WP_PAGE_SHARED: page_mkwrite failed, returning 2
> 
> With my fix, ext4_page_mkwrite() detects the non-uptodate state and
> returns VM_FAULT_SIGBUS before block_page_mkwrite() is called,
> preventing wp_page_shared() from reaching fault_dirty_shared_page().
> 
> Without the fix, the sequence would be:
> - ext4_page_mkwrite() succeeds (doesn't check uptodate)
> - block_page_mkwrite() marks buffers dirty (warn=0, no warning)
> - Returns to wp_page_shared()
> - fault_dirty_shared_page() calls folio_mark_dirty()
> - block_dirty_folio() finds folio not dirty (uptodate=0, dirty=0)
> - Calls __folio_mark_dirty() with warn=1
> - WARNING triggers: WARN_ON_ONCE(warn && !folio_test_uptodate(folio)
> && !folio_test_dirty(folio))
> 
> The syzbot call trace confirms this:
> 
>   Call Trace:
>    fault_dirty_shared_page+0x16e/0x2d0
>    do_wp_page+0x38b/0xd20
>    handle_pte_fault+0x1da/0x450
> 
> Does this clarify the flow?
> 
> Best regards,
> Deepanshu
>

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Wed, Dec 3, 2025 at 12:22 PM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> On 12/3/2025 9:37 AM, Deepanshu Kartikey wrote:
> > On Tue, Dec 2, 2025 at 5:54 PM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
> >>
> >> Hi Deepanshu!
> >>
> >> On 11/30/2025 10:06 AM, Deepanshu Kartikey wrote:
> >>> On Sat, Nov 22, 2025 at 7:27 AM Deepanshu Kartikey
> >>> <kartikey406@gmail.com> wrote:
> >>>>
> >>>> When delayed block allocation fails due to filesystem corruption,
> >>>> ext4's writeback error handling invalidates affected folios by calling
> >>>> mpage_release_unused_pages() with invalidate=true, which explicitly
> >>>> clears the uptodate flag:
> >>>>
> >>>>     static void mpage_release_unused_pages(..., bool invalidate)
> >>>>     {
> >>>>         ...
> >>>>         if (invalidate) {
> >>>>             block_invalidate_folio(folio, 0, folio_size(folio));
> >>>>             folio_clear_uptodate(folio);
> >>>>         }
> >>>>     }
> >>>>
> >>>> If ext4_page_mkwrite() is subsequently called on such a non-uptodate
> >>>> folio, it can proceed to mark the folio dirty without checking its
> >>>> state. This triggers a warning in __folio_mark_dirty():
> >>>>
> >>>>     WARNING: CPU: 0 PID: 5 at mm/page-writeback.c:2960
> >>>>     __folio_mark_dirty+0x578/0x880
> >>>>
> >>>>     Call Trace:
> >>>>      fault_dirty_shared_page+0x16e/0x2d0
> >>>>      do_wp_page+0x38b/0xd20
> >>>>      handle_pte_fault+0x1da/0x450
> >>>>      __handle_mm_fault+0x652/0x13b0
> >>>>      handle_mm_fault+0x22a/0x6f0
> >>>>      do_user_addr_fault+0x200/0x8a0
> >>>>      exc_page_fault+0x81/0x1b0
> >>>>
> >>>> This scenario occurs when:
> >>>> 1. A write with delayed allocation marks a folio dirty (uptodate=1)
> >>>> 2. Writeback attempts block allocation but detects filesystem corruption
> >>>> 3. Error handling calls mpage_release_unused_pages(invalidate=true),
> >>>>    which clears the uptodate flag via folio_clear_uptodate()
> >>>> 4. A subsequent ftruncate() triggers ext4_truncate()
> >>>> 5. ext4_block_truncate_page() attempts to zero the page tail
> >>>> 6. This triggers a write fault on the mmap'd page
> >>>> 7. ext4_page_mkwrite() is called with the non-uptodate folio
> >>>> 8. Without checking uptodate, it proceeds to mark the folio dirty
> >>>> 9. __folio_mark_dirty() triggers: WARN_ON_ONCE(!folio_test_uptodate())
> >>
> >> Thank you a lot for analyzing this issue and the fix patch. As I was
> >> going through the process of understanding this issue, I had one
> >> question. Is the code flow that triggers the warning as follows?
> >>
> >> wp_page_shared()
> >>   do_page_mkwrite()
> >>     ext4_page_mkwrite()
> >>       block_page_mkwrite()   //The default delalloc path
> >>         block_commit_write()
> >>           mark_buffer_dirty()
> >>             __folio_mark_dirty(0)  //'warn' is false, doesn't trigger warning
> >>         folio_mark_dirty()
> >>           ext4_dirty_folio()
> >>             block_dirty_folio  //newly_dirty is false, doesn't call __folio_mark_dirty()
> >>   fault_dirty_shared_page()
> >>     folio_mark_dirty()  //Trigger warning ?
> >>
> >> This folio has been marked as dirty. How was this warning triggered?
> >> Am I missing something?
> >>
> >> Thanks,
> >> Yi.
> >>
> >
> > Hi Yi,
> >
> > Thank you for your question about the exact flow that triggers the warning.
> >
>
> Thank you for the clarification, but there are still some details that
> are unclear.
>
> > You're correct that the code paths within ext4_page_mkwrite() and
> > block_page_mkwrite() call __folio_mark_dirty() with warn=0, so no
> > warning occurs there. The warning actually triggers later, in
> > fault_dirty_shared_page() after page_mkwrite returns.
> >
> > Here's the complete flow:
> >
> >   wp_page_shared()
> >     ↓
> >     do_page_mkwrite()
> >       ↓
> >       ext4_page_mkwrite()
> >         ↓
> >         block_page_mkwrite()
> >           ↓
> >           mark_buffer_dirty() → __folio_mark_dirty(warn=0)  // No warning
>
>              ↓
>              if (!folio_test_set_dirty(folio))
>                         //The folio will be mark as dirty --- 1
>
> >         ↓
> >         Returns success
> >     ↓
> >     fault_dirty_shared_page(vma, folio)  ← Warning triggers here
> >       ↓
> >       folio_mark_dirty(folio)
> >         ↓
> >         ext4_dirty_folio()
> >           ↓
> >           block_dirty_folio()
> >             ↓
> >             if (!folio_test_set_dirty(folio))  // Folio not already dirty
>
>               This makes me confused, this folio should be set to dirty
>               at position 1. If it is not dirty here, who cleared the dirty
>               flag for this folio?
>

Hi Yi,

Excellent question! You're absolutely right that mark_buffer_dirty()
marks the folio dirty at position 1. The key is that the folio dirty
flag is cleared later by the error handling code.

When delayed allocation fails and mpage_release_unused_pages() is
called with invalidate=true, it calls:

  block_invalidate_folio(folio, 0, folio_size(folio));

This function not only invalidates the folio but also clears the dirty
flag. Looking at the code path:

  block_invalidate_folio()
    → do_invalidate_folio()
      → Clears the dirty flag
      → Detaches buffer heads

So the sequence is:

1. First wp_page_shared(): folio becomes dirty=1, uptodate=1 (via
mark_buffer_dirty)
2. Writeback fails due to corruption
3. mpage_release_unused_pages(invalidate=true):
   - block_invalidate_folio() clears dirty flag
   - folio_clear_uptodate() clears uptodate flag
   - Result: folio is now dirty=0, uptodate=0
4. Second wp_page_shared(): called with folio dirty=0, uptodate=0

This is confirmed by my debug logs:

  [22.400513] WP_PAGE_SHARED: ENTER folio=... uptodate=0 dirty=0

The folio is BOTH non-uptodate AND non-dirty when the second
wp_page_shared() is called.

Without my fix:
- ext4_page_mkwrite() succeeds (doesn't check uptodate)
- block_page_mkwrite() tries to mark the folio dirty again
- fault_dirty_shared_page() is reached
- block_dirty_folio() finds folio not dirty (dirty=0)
- Calls __folio_mark_dirty(warn=1) with uptodate=0
- WARNING triggers

With my fix:
- ext4_page_mkwrite() checks uptodate and returns VM_FAULT_SIGBUS
- Never reaches fault_dirty_shared_page()
- No warning

Does this answer your question?

Best regards,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Theodore Tso 2 months ago

My main concern with your patch is folio_lock() is *incredibly*
heavyweight and is going to be a real scalability concern if we need
to take it every single time we need to make a page writeable.

So could we perhaps do something like this?  So the first question is
do we need to take the lock at all?  I'm not sure we need to worry
about the case where the page is not uptodate because we're racing
with the page being brought into memory; if we that could happen under
normal circumstances we would be triggering the warning even without
these situations such as a delayed allocaiton write failing due to a
corrupted file system image.   So can we just do this?

	if (!folio_test_uptodate(folio)) {
		ret = VM_FAULT_SIGBUS;
		goto out;
	}

If it is legitmate that ext4_page_mkwrite() could be called while the
page is still being read in (and again, I don't think it is), then we
could do something like this:

	if (!folio_test_uptodate(folio)) {
		folio_lock(folio);
		if (!folio_test_uptodate(folio)) {
			folio_unlock(folio);
			ret = VM_FAULT_SIGBUS;
			goto out;
		}
		folio_unlock(folio);
	}

Matthew, as the page cache maintainer, do we actually need this extra
rigamarole.  Or can we just skip taking the lock before checking to
see if the folio is uptodate in ext4_page_mkwrite()?

							- Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Matthew Wilcox 2 months ago

On Wed, Dec 03, 2025 at 10:46:57AM -0500, Theodore Tso wrote:
> Matthew, as the page cache maintainer, do we actually need this extra
> rigamarole.  Or can we just skip taking the lock before checking to
> see if the folio is uptodate in ext4_page_mkwrite()?

Oh, this triggered a memory.  Looks like you missed my reply in that
thread.  This patch is bogus.  You need to prevent !uptodate folios from
being referenced by the page tables.

https://groups.google.com/g/syzkaller-bugs/c/kjaCOwdrWVg/m/6Li1dROPBQAJ

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Matthew Wilcox 2 months ago

You snipped out all the context when adding me to the cc, and I'm on
holiday until after Plumbers, so I'm disinclined to go looking for
context.

On Wed, Dec 03, 2025 at 10:46:57AM -0500, Theodore Tso wrote:
> My main concern with your patch is folio_lock() is *incredibly*
> heavyweight and is going to be a real scalability concern if we need
> to take it every single time we need to make a page writeable.
> 
> So could we perhaps do something like this?  So the first question is
> do we need to take the lock at all?  I'm not sure we need to worry
> about the case where the page is not uptodate because we're racing
> with the page being brought into memory; if we that could happen under
> normal circumstances we would be triggering the warning even without
> these situations such as a delayed allocaiton write failing due to a
> corrupted file system image.   So can we just do this?
> 
> 	if (!folio_test_uptodate(folio)) {
> 		ret = VM_FAULT_SIGBUS;
> 		goto out;
> 	}
> 
> If it is legitmate that ext4_page_mkwrite() could be called while the
> page is still being read in (and again, I don't think it is), then we
> could do something like this:
> 
> 	if (!folio_test_uptodate(folio)) {
> 		folio_lock(folio);
> 		if (!folio_test_uptodate(folio)) {
> 			folio_unlock(folio);
> 			ret = VM_FAULT_SIGBUS;
> 			goto out;
> 		}
> 		folio_unlock(folio);
> 	}
> 
> Matthew, as the page cache maintainer, do we actually need this extra
> rigamarole.  Or can we just skip taking the lock before checking to
> see if the folio is uptodate in ext4_page_mkwrite()?
> 
> 							- Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Theodore Tso 2 months ago

On Wed, Dec 03, 2025 at 09:35:29PM +0000, Matthew Wilcox wrote:
> You snipped out all the context when adding me to the cc, and I'm on
> holiday until after Plumbers, so I'm disinclined to go looking for
> context.

Sorry about that.  A quick summary.  Deepanshu was attempting to
address a Syzbot complaint[1].

[1] https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a

The TL;DR summary is that the syzbot complaint involved a maliciously
corrupted file system which resulted file system getting detected when
delayed allocation writeback attempts to do a block allocation.  Error
handling calls mpage_release_unused_pages(invalidate=true), which
clears the uptodate flag via folio_clear_uptodate().

Because syzbot mounts the file system using errors=continue (which is
the worst case; we're not panic'ing the kernel or forcing the file
system to be read-only), we now have a situation where we have a folio
which can be mapped, but !uptodate, but the file system can still be
subject to changes.

In the syzkaller reproducer, the potential malware might call
ftruncate on the file, and this results ext4_truncate() calling
ext4_block_truncate_page() which thentries to zero the page tail,
which triggers a write fault, resulting in ext4_page_mkwrite() on a
page/folio which is not uptodate.  It then tries to mark the folio
dirty, mapped but !uptdate, and then __folio_mark_dirty() triggers:
WARN_ON_ONCE(!folio_test_uptodate()).

Since in Syzkaller assumes users are stupid enough to panic on warn,
this is an urgent security issue because it's a denial of service
attack which is CVE worthy --- where the system admiinstrator is
stupid enough to allow an untrusted user to mount an untrusted,
maliciously crafted file system, instead of using fuse2fs.  The
security people thinkt his is super-duper important.  Personally, I
don't think it's all that urgent, so by all means, don't feel obliged
to think about this while on vacation.  :-)

Anyway, that's the context.  Deepanshu has a proposed fix here[2]
which puts a folio_lock() into every single write page fault for ext4:

+	folio_lock(folio);
+	if (!folio_test_uptodate(folio)) {
+		folio_unlock(folio);
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	folio_unlock(folio);

[2] https://lore.kernel.org/r/20251122015742.362444-1-kartikey406@gmail.com

This seems.... unfortunate to me, so the first question is, "is
locking the folio really necessary"?  (I suspect the answer is no),
and two, should this check be done in the mm layer calling
page_mkwrite(), or in ext4_page_mkwrite()?

Presumably, this might happen for other file systems, with either
syzkaller coming up with this rather implausible scenario of really
stupid, unfortunately system adminsitrator choices --- or in real
life, if we do have some system adminisrtators making really stupid,
unfortunate life choices.  So maybe we should this check should be
done above the file system layer?

						- Ted

> 
> On Wed, Dec 03, 2025 at 10:46:57AM -0500, Theodore Tso wrote:
> > My main concern with your patch is folio_lock() is *incredibly*
> > heavyweight and is going to be a real scalability concern if we need
> > to take it every single time we need to make a page writeable.
> > 
> > So could we perhaps do something like this?  So the first question is
> > do we need to take the lock at all?  I'm not sure we need to worry
> > about the case where the page is not uptodate because we're racing
> > with the page being brought into memory; if we that could happen under
> > normal circumstances we would be triggering the warning even without
> > these situations such as a delayed allocaiton write failing due to a
> > corrupted file system image.   So can we just do this?
> > 
> > 	if (!folio_test_uptodate(folio)) {
> > 		ret = VM_FAULT_SIGBUS;
> > 		goto out;
> > 	}
> > 
> > If it is legitmate that ext4_page_mkwrite() could be called while the
> > page is still being read in (and again, I don't think it is), then we
> > could do something like this:
> > 
> > 	if (!folio_test_uptodate(folio)) {
> > 		folio_lock(folio);
> > 		if (!folio_test_uptodate(folio)) {
> > 			folio_unlock(folio);
> > 			ret = VM_FAULT_SIGBUS;
> > 			goto out;
> > 		}
> > 		folio_unlock(folio);
> > 	}
> > 
> > Matthew, as the page cache maintainer, do we actually need this extra
> > rigamarole.  Or can we just skip taking the lock before checking to
> > see if the folio is uptodate in ext4_page_mkwrite()?
> > 
> > 							- Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Thu, Dec 4, 2025 at 4:04 AM Theodore Tso <tytso@mit.edu> wrote:
>
> On Wed, Dec 03, 2025 at 09:35:29PM +0000, Matthew Wilcox wrote:
> > You snipped out all the context when adding me to the cc, and I'm on
> > holiday until after Plumbers, so I'm disinclined to go looking for
> > context.
>
> Sorry about that.  A quick summary.  Deepanshu was attempting to
> address a Syzbot complaint[1].
>
> [1] https://syzkaller.appspot.com/bug?extid=b0a0670332b6b3230a0a
>
> The TL;DR summary is that the syzbot complaint involved a maliciously
> corrupted file system which resulted file system getting detected when
> delayed allocation writeback attempts to do a block allocation.  Error
> handling calls mpage_release_unused_pages(invalidate=true), which
> clears the uptodate flag via folio_clear_uptodate().
>
> Because syzbot mounts the file system using errors=continue (which is
> the worst case; we're not panic'ing the kernel or forcing the file
> system to be read-only), we now have a situation where we have a folio
> which can be mapped, but !uptodate, but the file system can still be
> subject to changes.
>
> In the syzkaller reproducer, the potential malware might call
> ftruncate on the file, and this results ext4_truncate() calling
> ext4_block_truncate_page() which thentries to zero the page tail,
> which triggers a write fault, resulting in ext4_page_mkwrite() on a
> page/folio which is not uptodate.  It then tries to mark the folio
> dirty, mapped but !uptdate, and then __folio_mark_dirty() triggers:
> WARN_ON_ONCE(!folio_test_uptodate()).
>
> Since in Syzkaller assumes users are stupid enough to panic on warn,
> this is an urgent security issue because it's a denial of service
> attack which is CVE worthy --- where the system admiinstrator is
> stupid enough to allow an untrusted user to mount an untrusted,
> maliciously crafted file system, instead of using fuse2fs.  The
> security people thinkt his is super-duper important.  Personally, I
> don't think it's all that urgent, so by all means, don't feel obliged
> to think about this while on vacation.  :-)
>
> Anyway, that's the context.  Deepanshu has a proposed fix here[2]
> which puts a folio_lock() into every single write page fault for ext4:
>
> +       folio_lock(folio);
> +       if (!folio_test_uptodate(folio)) {
> +               folio_unlock(folio);
> +               ret = VM_FAULT_SIGBUS;
> +               goto out;
> +       }
> +       folio_unlock(folio);
>
> [2] https://lore.kernel.org/r/20251122015742.362444-1-kartikey406@gmail.com
>
> This seems.... unfortunate to me, so the first question is, "is
> locking the folio really necessary"?  (I suspect the answer is no),
> and two, should this check be done in the mm layer calling
> page_mkwrite(), or in ext4_page_mkwrite()?
>
> Presumably, this might happen for other file systems, with either
> syzkaller coming up with this rather implausible scenario of really
> stupid, unfortunately system adminsitrator choices --- or in real
> life, if we do have some system adminisrtators making really stupid,
> unfortunate life choices.  So maybe we should this check should be
> done above the file system layer?
>
>                                                 - Ted
>
>
>
>
>
>
>
> >
> > On Wed, Dec 03, 2025 at 10:46:57AM -0500, Theodore Tso wrote:
> > > My main concern with your patch is folio_lock() is *incredibly*
> > > heavyweight and is going to be a real scalability concern if we need
> > > to take it every single time we need to make a page writeable.
> > >
> > > So could we perhaps do something like this?  So the first question is
> > > do we need to take the lock at all?  I'm not sure we need to worry
> > > about the case where the page is not uptodate because we're racing
> > > with the page being brought into memory; if we that could happen under
> > > normal circumstances we would be triggering the warning even without
> > > these situations such as a delayed allocaiton write failing due to a
> > > corrupted file system image.   So can we just do this?
> > >
> > >     if (!folio_test_uptodate(folio)) {
> > >             ret = VM_FAULT_SIGBUS;
> > >             goto out;
> > >     }
> > >
> > > If it is legitmate that ext4_page_mkwrite() could be called while the
> > > page is still being read in (and again, I don't think it is), then we
> > > could do something like this:
> > >
> > >     if (!folio_test_uptodate(folio)) {
> > >             folio_lock(folio);
> > >             if (!folio_test_uptodate(folio)) {
> > >                     folio_unlock(folio);
> > >                     ret = VM_FAULT_SIGBUS;
> > >                     goto out;
> > >             }
> > >             folio_unlock(folio);
> > >     }
> > >
> > > Matthew, as the page cache maintainer, do we actually need this extra
> > > rigamarole.  Or can we just skip taking the lock before checking to
> > > see if the folio is uptodate in ext4_page_mkwrite()?
> > >
> > >                                                     - Ted


Hi Ted,

Thank you for the detailed summary and context.

Based on Matthew's earlier feedback that we need to "prevent !uptodate
folios from being referenced by the page tables," I believe the
correct fix is not in ext4_page_mkwrite() at all, but rather in
mpage_release_unused_pages().

When we invalidate folios due to writeback failure, we should also
unmap them from page tables:

  static void mpage_release_unused_pages(..., bool invalidate)
  {
      // ... existing code ...

      if (invalidate) {
          if (folio_mapped(folio))
              folio_clear_dirty_for_io(folio);

          // Unmap from page tables before invalidating
          if (folio_mapped(folio))
              unmap_mapping_folio(folio);

          block_invalidate_folio(folio, 0, folio_size(folio));
          folio_clear_uptodate(folio);
      }
  }

This way:
1. No performance impact on normal operations (only in error path)
2. No folio_lock() needed in ext4_page_mkwrite()
3. Prevents stale PTEs from referencing invalidated folios
4. Any subsequent access triggers a new page fault instead

To address your question about whether this should be in MM layer:
other filesystems could hit this same issue if they use delayed
allocation and invalidate folios on writeback failure. However,
since the invalidation is happening in ext4-specific code
(mpage_release_unused_pages), fixing it there seems appropriate.
If this pattern appears in other filesystems, perhaps the fix
could be moved to a common helper.

I'll send v3 with this approach. Does this look correct?

Best regards,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Theodore Tso 2 months ago

On Thu, Dec 04, 2025 at 03:24:50PM +0530, Deepanshu Kartikey wrote:
> Based on Matthew's earlier feedback that we need to "prevent !uptodate
> folios from being referenced by the page tables," I believe the
> correct fix is not in ext4_page_mkwrite() at all, but rather in
> mpage_release_unused_pages().
> 
> When we invalidate folios due to writeback failure, we should also
> unmap them from page tables....

Hmm.... if the page is mmap'ed into the user process, on a writeback
failure, the page contents will suddenly and without any warning,
*disappear*.

So the other option is we could simply *not* invalidate the folio, but
instead leave the folio dirty.  In some cases, where a particular
block group is corrupted, if we retry the block allocation, the
corrupted block group will be busied out, and so when the write back
is retried, it's possible that the data will be actually be persisted.

We do need to make sure the right thing we unmount the filesystem,
since at that point, we have no choice but the invalidate the page and
the data will get lost when the file system is unmounted.  So it's a
more complicated approach.  But if this is happening when the file
system is corrupted, especially if it was maliciously corrupted, all
bets are off anyway, so maybe it's not worth the complexity.

     	     	     	      	       - Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Matthew Wilcox 2 months ago

On Thu, Dec 04, 2025 at 09:18:18PM -0500, Theodore Tso wrote:
> On Thu, Dec 04, 2025 at 03:24:50PM +0530, Deepanshu Kartikey wrote:
> > Based on Matthew's earlier feedback that we need to "prevent !uptodate
> > folios from being referenced by the page tables," I believe the
> > correct fix is not in ext4_page_mkwrite() at all, but rather in
> > mpage_release_unused_pages().
> > 
> > When we invalidate folios due to writeback failure, we should also
> > unmap them from page tables....
> 
> Hmm.... if the page is mmap'ed into the user process, on a writeback
> failure, the page contents will suddenly and without any warning,
> *disappear*.

It sounds like I was confused -- I thought the folios being invalidated 
in mpage_release_unused_pages() belonged to the block device, but from 
what you're saying, they belong to a user-visible file?

Once we hit a writeback error (whether we're in a "device gave EIO" or
"filesystem is corrupted" situation), we're firmly outside what POSIX 
speaks to, and so all that matters is quality of implementation.

Now, is the folio necessarily dirty at this point?  I guess so if we're
in the writeback path.  Darrick got rid of similar code in iomap a few 
years ago; see commit e9c3a8e820ed.  So it'd probably be good to have
ext4 behave the same way.

> So the other option is we could simply *not* invalidate the folio, but
> instead leave the folio dirty.  In some cases, where a particular
> block group is corrupted, if we retry the block allocation, the
> corrupted block group will be busied out, and so when the write back
> is retried, it's possible that the data will be actually be persisted.
> 
> We do need to make sure the right thing we unmount the filesystem,
> since at that point, we have no choice but the invalidate the page and
> the data will get lost when the file system is unmounted.  So it's a
> more complicated approach.  But if this is happening when the file
> system is corrupted, especially if it was maliciously corrupted, all
> bets are off anyway, so maybe it's not worth the complexity.

I'm generally in favour of doing anything we can to write dirty user
data back to storage ;-)  Of course if the storage is throwing a wobbly,
that's beyond our abilities.

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Theodore Tso 2 months ago

On Fri, Dec 05, 2025 at 03:33:22AM +0000, Matthew Wilcox wrote:
> It sounds like I was confused -- I thought the folios being
> invalidated in mpage_release_unused_pages() belonged to the block
> device, but from what you're saying, they belong to a user-visible
> file?

Yes, correct.  I'm guessing that we were marking the page !uptodate
back when that was the only way to indicate that there had been any
kind of I/O error (either on the read or write side).  Obviously we
have much better ways of doing it in the 21st century.  :-)

> Now, is the folio necessarily dirty at this point?  I guess so if
> we're in the writeback path.  Darrick got rid of similar code in
> iomap a few years ago; see commit e9c3a8e820ed.  So it'd probably be
> good to have ext4 behave the same way.

Hmm, yes.   Agreed.

    commit e9c3a8e820ed0eeb2be05072f29f80d1b79f053b
    Author: Darrick J. Wong <djwong@kernel.org>
    Date:   Mon May 16 15:27:38 2022 -0700

    iomap: don't invalidate folios after writeback errors
    
    XFS has the unique behavior (as compared to the other Linux
    filesystems) that on writeback errors it will completely
    invalidate the affected folio and force the page cache to reread
    the contents from disk.  All other filesystems leave the page
    mapped and up to date.
    
    This is a rude awakening for user programs, since (in the case
    where write fails but reread doesn't) file contents will appear to
    revert old disk contents with no notification other than an EIO on
    fsync.  This might have been annoying back in the days when iomap
    dealt with one page at a time, but with multipage folios, we can
    now throw away *megabytes* worth of data for a single write error...

As Darrick pointed out we could potentially append a *single* byte to
a file, and if there was some kind of writeback error, we could
potentially throw away *vast* amounts of data for no good reason.

     	  	     	       	     - Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Fri, Dec 5, 2025 at 7:08 PM Theodore Tso <tytso@mit.edu> wrote:
>
> On Fri, Dec 05, 2025 at 03:33:22AM +0000, Matthew Wilcox wrote:
> > It sounds like I was confused -- I thought the folios being
> > invalidated in mpage_release_unused_pages() belonged to the block
> > device, but from what you're saying, they belong to a user-visible
> > file?
>
> Yes, correct.  I'm guessing that we were marking the page !uptodate
> back when that was the only way to indicate that there had been any
> kind of I/O error (either on the read or write side).  Obviously we
> have much better ways of doing it in the 21st century.  :-)
>
> > Now, is the folio necessarily dirty at this point?  I guess so if
> > we're in the writeback path.  Darrick got rid of similar code in
> > iomap a few years ago; see commit e9c3a8e820ed.  So it'd probably be
> > good to have ext4 behave the same way.
>
> Hmm, yes.   Agreed.
>
>     commit e9c3a8e820ed0eeb2be05072f29f80d1b79f053b
>     Author: Darrick J. Wong <djwong@kernel.org>
>     Date:   Mon May 16 15:27:38 2022 -0700
>
>     iomap: don't invalidate folios after writeback errors
>
>     XFS has the unique behavior (as compared to the other Linux
>     filesystems) that on writeback errors it will completely
>     invalidate the affected folio and force the page cache to reread
>     the contents from disk.  All other filesystems leave the page
>     mapped and up to date.
>
>     This is a rude awakening for user programs, since (in the case
>     where write fails but reread doesn't) file contents will appear to
>     revert old disk contents with no notification other than an EIO on
>     fsync.  This might have been annoying back in the days when iomap
>     dealt with one page at a time, but with multipage folios, we can
>     now throw away *megabytes* worth of data for a single write error...
>
> As Darrick pointed out we could potentially append a *single* byte to
> a file, and if there was some kind of writeback error, we could
> potentially throw away *vast* amounts of data for no good reason.
>
>                                      - Ted


Hi Ted and Matthew,

Thank you for pointing out the iomap commit. I now understand that
invalidating folios on writeback errors is the wrong approach.

Looking at Darrick's commit e9c3a8e820ed, iomap removed both
folio_clear_uptodate() and the invalidation call, keeping folios in
memory with their data intact even after writeback errors.

For ext4, should I apply the same approach to mpage_release_unused_pages()?
Specifically, remove the invalidation entirely:

  if (invalidate) {
      /*
       * On writeback errors, do not invalidate the folio or
       * clear the uptodate flag. This follows the behavior
       * established by iomap (commit e9c3a8e820ed "iomap:
       * don't invalidate folios after writeback errors").
       */
      if (folio_mapped(folio))
          folio_clear_dirty_for_io(folio);
-     block_invalidate_folio(folio, 0, folio_size(folio));
-     folio_clear_uptodate(folio);
  }

This would:
- Keep user data in memory instead of discarding it
- Prevent the WARNING since folio remains uptodate
- Match the behavior of modern filesystems
- Prevent data loss from discarding potentially megabytes of data

Is this the correct approach? If so, I'll send v4 with this fix.

Best regards,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 1 month ago

On Fri, Dec 5, 2025 at 7:58 PM Deepanshu Kartikey <kartikey406@gmail.com> wrote:
>
> On Fri, Dec 5, 2025 at 7:08 PM Theodore Tso <tytso@mit.edu> wrote:
> >
> > On Fri, Dec 05, 2025 at 03:33:22AM +0000, Matthew Wilcox wrote:
> > > It sounds like I was confused -- I thought the folios being
> > > invalidated in mpage_release_unused_pages() belonged to the block
> > > device, but from what you're saying, they belong to a user-visible
> > > file?
> >
> > Yes, correct.  I'm guessing that we were marking the page !uptodate
> > back when that was the only way to indicate that there had been any
> > kind of I/O error (either on the read or write side).  Obviously we
> > have much better ways of doing it in the 21st century.  :-)
> >
> > > Now, is the folio necessarily dirty at this point?  I guess so if
> > > we're in the writeback path.  Darrick got rid of similar code in
> > > iomap a few years ago; see commit e9c3a8e820ed.  So it'd probably be
> > > good to have ext4 behave the same way.
> >
> > Hmm, yes.   Agreed.
> >
> >     commit e9c3a8e820ed0eeb2be05072f29f80d1b79f053b
> >     Author: Darrick J. Wong <djwong@kernel.org>
> >     Date:   Mon May 16 15:27:38 2022 -0700
> >
> >     iomap: don't invalidate folios after writeback errors
> >
> >     XFS has the unique behavior (as compared to the other Linux
> >     filesystems) that on writeback errors it will completely
> >     invalidate the affected folio and force the page cache to reread
> >     the contents from disk.  All other filesystems leave the page
> >     mapped and up to date.
> >
> >     This is a rude awakening for user programs, since (in the case
> >     where write fails but reread doesn't) file contents will appear to
> >     revert old disk contents with no notification other than an EIO on
> >     fsync.  This might have been annoying back in the days when iomap
> >     dealt with one page at a time, but with multipage folios, we can
> >     now throw away *megabytes* worth of data for a single write error...
> >
> > As Darrick pointed out we could potentially append a *single* byte to
> > a file, and if there was some kind of writeback error, we could
> > potentially throw away *vast* amounts of data for no good reason.
> >
> >                                      - Ted
>
>
> Hi Ted and Matthew,
>
> Thank you for pointing out the iomap commit. I now understand that
> invalidating folios on writeback errors is the wrong approach.
>
> Looking at Darrick's commit e9c3a8e820ed, iomap removed both
> folio_clear_uptodate() and the invalidation call, keeping folios in
> memory with their data intact even after writeback errors.
>
> For ext4, should I apply the same approach to mpage_release_unused_pages()?
> Specifically, remove the invalidation entirely:
>
>   if (invalidate) {
>       /*
>        * On writeback errors, do not invalidate the folio or
>        * clear the uptodate flag. This follows the behavior
>        * established by iomap (commit e9c3a8e820ed "iomap:
>        * don't invalidate folios after writeback errors").
>        */
>       if (folio_mapped(folio))
>           folio_clear_dirty_for_io(folio);
> -     block_invalidate_folio(folio, 0, folio_size(folio));
> -     folio_clear_uptodate(folio);
>   }
>
> This would:
> - Keep user data in memory instead of discarding it
> - Prevent the WARNING since folio remains uptodate
> - Match the behavior of modern filesystems
> - Prevent data loss from discarding potentially megabytes of data
>
> Is this the correct approach? If so, I'll send v4 with this fix.
>
> Best regards,
> Deepanshu

Hi Ted and Matthew,

I wanted to follow up on my previous email about the fix approach.

Just to confirm: should I remove the folio invalidation from
mpage_release_unused_pages() entirely, following Darrick's commit
e9c3a8e820ed for iomap?

The change would be:

  if (invalidate) {
      if (folio_mapped(folio))
          folio_clear_dirty_for_io(folio);
-     block_invalidate_folio(folio, 0, folio_size(folio));
-     folio_clear_uptodate(folio);
  }

This keeps the folio in memory with data intact, prevents the WARNING,
and matches modern filesystem behavior.

If this approach is correct, I'll test and send v4. Otherwise, please
let me know what adjustments are needed.

Thank you,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Fri, Dec 5, 2025 at 7:49 AM Theodore Tso <tytso@mit.edu> wrote:
>
> Hmm.... if the page is mmap'ed into the user process, on a writeback
> failure, the page contents will suddenly and without any warning,
> *disappear*.
>
> So the other option is we could simply *not* invalidate the folio, but
> instead leave the folio dirty.  In some cases, where a particular
> block group is corrupted, if we retry the block allocation, the
> corrupted block group will be busied out, and so when the write back
> is retried, it's possible that the data will be actually be persisted.
>
> We do need to make sure the right thing we unmount the filesystem,
> since at that point, we have no choice but the invalidate the page and
> the data will get lost when the file system is unmounted.  So it's a
> more complicated approach.  But if this is happening when the file
> system is corrupted, especially if it was maliciously corrupted, all
> bets are off anyway, so maybe it's not worth the complexity.
>
>                                        - Ted

Hi Ted,

I understand your concern about data loss. You're right that unmapping
causes user data to disappear from memory.

However, as you noted, when the filesystem is corrupted:
1. The error message already says "Data will be lost"
2. All bets are off anyway
3. The simpler fix prevents the WARNING/crash

The more complex approach (keep dirty + retry) would be nice, but given
that corruption is already detected, I agree it's not worth the complexity.

I'll proceed with v3 using the unmap approach.

Best regards,
Deepanshu

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Theodore Tso 2 months ago

On Wed, Dec 03, 2025 at 10:46:57AM -0500, Theodore Tso wrote:
> 
> 	if (!folio_test_uptodate(folio)) {
> 		ret = VM_FAULT_SIGBUS;
> 		goto out;
> 	}

And actually, thinking about this some more, is this check something
that we should be doing in ext4?  Or in the mm layer?

Matthew, what do you think?

     	       	  	   	     	- Ted

Re: [PATCH v2] ext4: check folio uptodate state in ext4_page_mkwrite()

Posted by Deepanshu Kartikey 2 months ago

On Wed, Dec 3, 2025 at 9:18 PM Theodore Tso <tytso@mit.edu> wrote:
>
> My main concern with your patch is folio_lock() is *incredibly*
> heavyweight and is going to be a real scalability concern if we need
> to take it every single time we need to make a page writeable.
>
> So could we perhaps do something like this?  So the first question is
> do we need to take the lock at all?  I'm not sure we need to worry
> about the case where the page is not uptodate because we're racing
> with the page being brought into memory; if we that could happen under
> normal circumstances we would be triggering the warning even without
> these situations such as a delayed allocaiton write failing due to a
> corrupted file system image.   So can we just do this?
>
>         if (!folio_test_uptodate(folio)) {
>                 ret = VM_FAULT_SIGBUS;
>                 goto out;
>         }
>
> If it is legitmate that ext4_page_mkwrite() could be called while the
> page is still being read in (and again, I don't think it is), then we
> could do something like this:
>
>         if (!folio_test_uptodate(folio)) {
>                 folio_lock(folio);
>                 if (!folio_test_uptodate(folio)) {
>                         folio_unlock(folio);
>                         ret = VM_FAULT_SIGBUS;
>                         goto out;
>                 }
>                 folio_unlock(folio);
>         }
>
> Matthew, as the page cache maintainer, do we actually need this extra
> rigamarole.  Or can we just skip taking the lock before checking to
> see if the folio is uptodate in ext4_page_mkwrite()?
>
>                                                         - Ted

Hi Ted,

Thank you for the feedback and the performance concern!

You're absolutely right that folio_lock() is heavyweight. I included
it because I was being overly cautious about potential races, but I
agree with your analysis that under normal circumstances,
ext4_page_mkwrite() should never be called with a non-uptodate folio.

The non-uptodate state only occurs in this specific error case where:
1. Delayed allocation fails due to corruption
2. mpage_release_unused_pages() invalidates the folio
3. A subsequent operation triggers the fault

In this error path, the folio is already in an inconsistent state, so
checking folio_test_uptodate() without the lock should be sufficient
to catch it.

I'll wait for Matthew's input on the locking question, and then send
v3 with the appropriate changes.

Thank you for the guidance!

Best regards,
Deepanshu