fs/userfaultfd.c | 7 ++++++ io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- 3 files changed, 68 insertions(+), 41 deletions(-)
In the Firecracker VM scenario, sporadically encountered threads with the UN state in the following call stack: [<0>] io_wq_put_and_exit+0xa1/0x210 [<0>] io_uring_clean_tctx+0x8e/0xd0 [<0>] io_uring_cancel_generic+0x19f/0x370 [<0>] __io_uring_cancel+0x14/0x20 [<0>] do_exit+0x17f/0x510 [<0>] do_group_exit+0x35/0x90 [<0>] get_signal+0x963/0x970 [<0>] arch_do_signal_or_restart+0x39/0x120 [<0>] syscall_exit_to_user_mode+0x206/0x260 [<0>] do_syscall_64+0x8d/0x170 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 The cause is a large number of IOU kernel threads saturating the CPU and not exiting. When the issue occurs, CPU usage 100% and can only be resolved by rebooting. Each thread's appears as follows: iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe iou-wrk-44588 [kernel.kallsyms] [k] io_write iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault iou-wrk-44588 [kernel.kallsyms] [k] schedule iou-wrk-44588 [kernel.kallsyms] [k] __schedule iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping I tracked the address that triggered the fault and the related function graph, as well as the wake-up side of the user fault, and discovered this : In the IOU worker, when fault in a user space page, this space is associated with a userfault but does not sleep. This is because during scheduling, the judgment in the IOU worker context leads to early return. Meanwhile, the listener on the userfaultfd user side never performs a COPY to respond, causing the page table entry to remain empty. However, due to the early return, it does not sleep and wait to be awakened as in a normal user fault, thus continuously faulting at the same address,so CPU loop. Therefore, I believe it is necessary to specifically handle user faults by setting a new flag to allow schedule function to continue in such cases, make sure the thread to sleep. Patch 1 io_uring: Add new functions to handle user fault scenarios Patch 2 userfaultfd: Set the corresponding flag in IOU worker context fs/userfaultfd.c | 7 ++++++ io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- 3 files changed, 68 insertions(+), 41 deletions(-) -- 2.34.1
On 4/22/25 4:45 AM, Zhiwei Jiang wrote: > In the Firecracker VM scenario, sporadically encountered threads with > the UN state in the following call stack: > [<0>] io_wq_put_and_exit+0xa1/0x210 > [<0>] io_uring_clean_tctx+0x8e/0xd0 > [<0>] io_uring_cancel_generic+0x19f/0x370 > [<0>] __io_uring_cancel+0x14/0x20 > [<0>] do_exit+0x17f/0x510 > [<0>] do_group_exit+0x35/0x90 > [<0>] get_signal+0x963/0x970 > [<0>] arch_do_signal_or_restart+0x39/0x120 > [<0>] syscall_exit_to_user_mode+0x206/0x260 > [<0>] do_syscall_64+0x8d/0x170 > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 > The cause is a large number of IOU kernel threads saturating the CPU > and not exiting. When the issue occurs, CPU usage 100% and can only > be resolved by rebooting. Each thread's appears as follows: > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker > iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work > iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe > iou-wrk-44588 [kernel.kallsyms] [k] io_write > iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter > iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write > iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable > iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault > iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault > iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault > iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault > iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault > iou-wrk-44588 [kernel.kallsyms] [k] schedule > iou-wrk-44588 [kernel.kallsyms] [k] __schedule > iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping > > I tracked the address that triggered the fault and the related function > graph, as well as the wake-up side of the user fault, and discovered this > : In the IOU worker, when fault in a user space page, this space is > associated with a userfault but does not sleep. This is because during > scheduling, the judgment in the IOU worker context leads to early return. > Meanwhile, the listener on the userfaultfd user side never performs a COPY > to respond, causing the page table entry to remain empty. However, due to > the early return, it does not sleep and wait to be awakened as in a normal > user fault, thus continuously faulting at the same address,so CPU loop. > Therefore, I believe it is necessary to specifically handle user faults by > setting a new flag to allow schedule function to continue in such cases, > make sure the thread to sleep. > > Patch 1 io_uring: Add new functions to handle user fault scenarios > Patch 2 userfaultfd: Set the corresponding flag in IOU worker context > > fs/userfaultfd.c | 7 ++++++ > io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- > io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- > 3 files changed, 68 insertions(+), 41 deletions(-) Do you have a test case for this? I don't think the proposed solution is very elegant, userfaultfd should not need to know about thread workers. I'll ponder this a bit... -- Jens Axboe
On Tue, Apr 22, 2025 at 9:35 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 4/22/25 4:45 AM, Zhiwei Jiang wrote: > > In the Firecracker VM scenario, sporadically encountered threads with > > the UN state in the following call stack: > > [<0>] io_wq_put_and_exit+0xa1/0x210 > > [<0>] io_uring_clean_tctx+0x8e/0xd0 > > [<0>] io_uring_cancel_generic+0x19f/0x370 > > [<0>] __io_uring_cancel+0x14/0x20 > > [<0>] do_exit+0x17f/0x510 > > [<0>] do_group_exit+0x35/0x90 > > [<0>] get_signal+0x963/0x970 > > [<0>] arch_do_signal_or_restart+0x39/0x120 > > [<0>] syscall_exit_to_user_mode+0x206/0x260 > > [<0>] do_syscall_64+0x8d/0x170 > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 > > The cause is a large number of IOU kernel threads saturating the CPU > > and not exiting. When the issue occurs, CPU usage 100% and can only > > be resolved by rebooting. Each thread's appears as follows: > > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm > > iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork > > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker > > iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work > > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work > > iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe > > iou-wrk-44588 [kernel.kallsyms] [k] io_write > > iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter > > iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write > > iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter > > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable > > iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable > > iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault > > iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault > > iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault > > iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault > > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault > > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page > > iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault > > iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault > > iou-wrk-44588 [kernel.kallsyms] [k] schedule > > iou-wrk-44588 [kernel.kallsyms] [k] __schedule > > iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq > > iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping > > > > I tracked the address that triggered the fault and the related function > > graph, as well as the wake-up side of the user fault, and discovered this > > : In the IOU worker, when fault in a user space page, this space is > > associated with a userfault but does not sleep. This is because during > > scheduling, the judgment in the IOU worker context leads to early return. > > Meanwhile, the listener on the userfaultfd user side never performs a COPY > > to respond, causing the page table entry to remain empty. However, due to > > the early return, it does not sleep and wait to be awakened as in a normal > > user fault, thus continuously faulting at the same address,so CPU loop. > > Therefore, I believe it is necessary to specifically handle user faults by > > setting a new flag to allow schedule function to continue in such cases, > > make sure the thread to sleep. > > > > Patch 1 io_uring: Add new functions to handle user fault scenarios > > Patch 2 userfaultfd: Set the corresponding flag in IOU worker context > > > > fs/userfaultfd.c | 7 ++++++ > > io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- > > io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- > > 3 files changed, 68 insertions(+), 41 deletions(-) > > Do you have a test case for this? I don't think the proposed solution is > very elegant, userfaultfd should not need to know about thread workers. > I'll ponder this a bit... > > -- > Jens Axboe Sorry,The issue occurs very infrequently, and I can't manually reproduce it. It's not very elegant, but for corner cases, it seems necessary to make some compromises.
On 4/22/25 8:10 AM, ??? wrote: > On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote: >> >> On 4/22/25 4:45 AM, Zhiwei Jiang wrote: >>> In the Firecracker VM scenario, sporadically encountered threads with >>> the UN state in the following call stack: >>> [<0>] io_wq_put_and_exit+0xa1/0x210 >>> [<0>] io_uring_clean_tctx+0x8e/0xd0 >>> [<0>] io_uring_cancel_generic+0x19f/0x370 >>> [<0>] __io_uring_cancel+0x14/0x20 >>> [<0>] do_exit+0x17f/0x510 >>> [<0>] do_group_exit+0x35/0x90 >>> [<0>] get_signal+0x963/0x970 >>> [<0>] arch_do_signal_or_restart+0x39/0x120 >>> [<0>] syscall_exit_to_user_mode+0x206/0x260 >>> [<0>] do_syscall_64+0x8d/0x170 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 >>> The cause is a large number of IOU kernel threads saturating the CPU >>> and not exiting. When the issue occurs, CPU usage 100% and can only >>> be resolved by rebooting. Each thread's appears as follows: >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker >>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work >>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe >>> iou-wrk-44588 [kernel.kallsyms] [k] io_write >>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable >>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault >>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault >>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault >>> iou-wrk-44588 [kernel.kallsyms] [k] schedule >>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule >>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping >>> >>> I tracked the address that triggered the fault and the related function >>> graph, as well as the wake-up side of the user fault, and discovered this >>> : In the IOU worker, when fault in a user space page, this space is >>> associated with a userfault but does not sleep. This is because during >>> scheduling, the judgment in the IOU worker context leads to early return. >>> Meanwhile, the listener on the userfaultfd user side never performs a COPY >>> to respond, causing the page table entry to remain empty. However, due to >>> the early return, it does not sleep and wait to be awakened as in a normal >>> user fault, thus continuously faulting at the same address,so CPU loop. >>> Therefore, I believe it is necessary to specifically handle user faults by >>> setting a new flag to allow schedule function to continue in such cases, >>> make sure the thread to sleep. >>> >>> Patch 1 io_uring: Add new functions to handle user fault scenarios >>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context >>> >>> fs/userfaultfd.c | 7 ++++++ >>> io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- >>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- >>> 3 files changed, 68 insertions(+), 41 deletions(-) >> >> Do you have a test case for this? I don't think the proposed solution is >> very elegant, userfaultfd should not need to know about thread workers. >> I'll ponder this a bit... >> >> -- >> Jens Axboe > Sorry,The issue occurs very infrequently, and I can't manually > reproduce it. It's not very elegant, but for corner cases, it seems > necessary to make some compromises. I'm going to see if I can create one. Not sure I fully understand the issue yet, but I'd be surprised if there isn't a more appropriate and elegant solution rather than exposing the io-wq guts and having userfaultfd manipulate them. That really should not be necessary. -- Jens Axboe
On Tue, Apr 22, 2025 at 10:13 PM Jens Axboe <axboe@kernel.dk> wrote: > > On 4/22/25 8:10 AM, ??? wrote: > > On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote: > >> > >> On 4/22/25 4:45 AM, Zhiwei Jiang wrote: > >>> In the Firecracker VM scenario, sporadically encountered threads with > >>> the UN state in the following call stack: > >>> [<0>] io_wq_put_and_exit+0xa1/0x210 > >>> [<0>] io_uring_clean_tctx+0x8e/0xd0 > >>> [<0>] io_uring_cancel_generic+0x19f/0x370 > >>> [<0>] __io_uring_cancel+0x14/0x20 > >>> [<0>] do_exit+0x17f/0x510 > >>> [<0>] do_group_exit+0x35/0x90 > >>> [<0>] get_signal+0x963/0x970 > >>> [<0>] arch_do_signal_or_restart+0x39/0x120 > >>> [<0>] syscall_exit_to_user_mode+0x206/0x260 > >>> [<0>] do_syscall_64+0x8d/0x170 > >>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 > >>> The cause is a large number of IOU kernel threads saturating the CPU > >>> and not exiting. When the issue occurs, CPU usage 100% and can only > >>> be resolved by rebooting. Each thread's appears as follows: > >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm > >>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_write > >>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter > >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write > >>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter > >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable > >>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable > >>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault > >>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault > >>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault > >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault > >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault > >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page > >>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault > >>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault > >>> iou-wrk-44588 [kernel.kallsyms] [k] schedule > >>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule > >>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq > >>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping > >>> > >>> I tracked the address that triggered the fault and the related function > >>> graph, as well as the wake-up side of the user fault, and discovered this > >>> : In the IOU worker, when fault in a user space page, this space is > >>> associated with a userfault but does not sleep. This is because during > >>> scheduling, the judgment in the IOU worker context leads to early return. > >>> Meanwhile, the listener on the userfaultfd user side never performs a COPY > >>> to respond, causing the page table entry to remain empty. However, due to > >>> the early return, it does not sleep and wait to be awakened as in a normal > >>> user fault, thus continuously faulting at the same address,so CPU loop. > >>> Therefore, I believe it is necessary to specifically handle user faults by > >>> setting a new flag to allow schedule function to continue in such cases, > >>> make sure the thread to sleep. > >>> > >>> Patch 1 io_uring: Add new functions to handle user fault scenarios > >>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context > >>> > >>> fs/userfaultfd.c | 7 ++++++ > >>> io_uring/io-wq.c | 57 +++++++++++++++--------------------------------- > >>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++-- > >>> 3 files changed, 68 insertions(+), 41 deletions(-) > >> > >> Do you have a test case for this? I don't think the proposed solution is > >> very elegant, userfaultfd should not need to know about thread workers. > >> I'll ponder this a bit... > >> > >> -- > >> Jens Axboe > > Sorry,The issue occurs very infrequently, and I can't manually > > reproduce it. It's not very elegant, but for corner cases, it seems > > necessary to make some compromises. > > I'm going to see if I can create one. Not sure I fully understand the > issue yet, but I'd be surprised if there isn't a more appropriate and > elegant solution rather than exposing the io-wq guts and having > userfaultfd manipulate them. That really should not be necessary. > > -- > Jens Axboe Thanks.I'm looking forward to your good news.
On 4/22/25 8:18 AM, ??? wrote:
> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 8:10 AM, ??? wrote:
>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>> the UN state in the following call stack:
>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>> [<0>] do_exit+0x17f/0x510
>>>>> [<0>] do_group_exit+0x35/0x90
>>>>> [<0>] get_signal+0x963/0x970
>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>
>>>>> I tracked the address that triggered the fault and the related function
>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>> associated with a userfault but does not sleep. This is because during
>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>> make sure the thread to sleep.
>>>>>
>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>
>>>>> fs/userfaultfd.c | 7 ++++++
>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>
>>>> Do you have a test case for this? I don't think the proposed solution is
>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>> I'll ponder this a bit...
>>>>
>>>> --
>>>> Jens Axboe
>>> Sorry,The issue occurs very infrequently, and I can't manually
>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>> necessary to make some compromises.
>>
>> I'm going to see if I can create one. Not sure I fully understand the
>> issue yet, but I'd be surprised if there isn't a more appropriate and
>> elegant solution rather than exposing the io-wq guts and having
>> userfaultfd manipulate them. That really should not be necessary.
>>
>> --
>> Jens Axboe
> Thanks.I'm looking forward to your good news.
Well, let's hope there is! In any case, your patches could be
considerably improved if you did:
void set_userfault_flag_for_ioworker(void)
{
struct io_worker *worker;
if (!(current->flags & PF_IO_WORKER))
return;
worker = current->worker_private;
set_bit(IO_WORKER_F_FAULT, &worker->flags);
}
void clear_userfault_flag_for_ioworker(void)
{
struct io_worker *worker;
if (!(current->flags & PF_IO_WORKER))
return;
worker = current->worker_private;
clear_bit(IO_WORKER_F_FAULT, &worker->flags);
}
and then userfaultfd would not need any odd checking, or needing io-wq
related structures public. That'd drastically cut down on the size of
them, and make it a bit more palatable.
--
Jens Axboe
On 4/22/25 8:29 AM, Jens Axboe wrote:
> On 4/22/25 8:18 AM, ??? wrote:
>> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 4/22/25 8:10 AM, ??? wrote:
>>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>>> the UN state in the following call stack:
>>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>>> [<0>] do_exit+0x17f/0x510
>>>>>> [<0>] do_group_exit+0x35/0x90
>>>>>> [<0>] get_signal+0x963/0x970
>>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>>
>>>>>> I tracked the address that triggered the fault and the related function
>>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>>> associated with a userfault but does not sleep. This is because during
>>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>>> make sure the thread to sleep.
>>>>>>
>>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>>
>>>>>> fs/userfaultfd.c | 7 ++++++
>>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>>
>>>>> Do you have a test case for this? I don't think the proposed solution is
>>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>>> I'll ponder this a bit...
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>> Sorry,The issue occurs very infrequently, and I can't manually
>>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>>> necessary to make some compromises.
>>>
>>> I'm going to see if I can create one. Not sure I fully understand the
>>> issue yet, but I'd be surprised if there isn't a more appropriate and
>>> elegant solution rather than exposing the io-wq guts and having
>>> userfaultfd manipulate them. That really should not be necessary.
>>>
>>> --
>>> Jens Axboe
>> Thanks.I'm looking forward to your good news.
>
> Well, let's hope there is! In any case, your patches could be
> considerably improved if you did:
>
> void set_userfault_flag_for_ioworker(void)
> {
> struct io_worker *worker;
> if (!(current->flags & PF_IO_WORKER))
> return;
> worker = current->worker_private;
> set_bit(IO_WORKER_F_FAULT, &worker->flags);
> }
>
> void clear_userfault_flag_for_ioworker(void)
> {
> struct io_worker *worker;
> if (!(current->flags & PF_IO_WORKER))
> return;
> worker = current->worker_private;
> clear_bit(IO_WORKER_F_FAULT, &worker->flags);
> }
>
> and then userfaultfd would not need any odd checking, or needing io-wq
> related structures public. That'd drastically cut down on the size of
> them, and make it a bit more palatable.
Forgot to ask, what kernel are you running on?
--
Jens Axboe
On Tue, Apr 22, 2025 at 11:50 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/22/25 8:29 AM, Jens Axboe wrote:
> > On 4/22/25 8:18 AM, ??? wrote:
> >> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 4/22/25 8:10 AM, ??? wrote:
> >>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
> >>>>>> In the Firecracker VM scenario, sporadically encountered threads with
> >>>>>> the UN state in the following call stack:
> >>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
> >>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
> >>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
> >>>>>> [<0>] __io_uring_cancel+0x14/0x20
> >>>>>> [<0>] do_exit+0x17f/0x510
> >>>>>> [<0>] do_group_exit+0x35/0x90
> >>>>>> [<0>] get_signal+0x963/0x970
> >>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
> >>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
> >>>>>> [<0>] do_syscall_64+0x8d/0x170
> >>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
> >>>>>> The cause is a large number of IOU kernel threads saturating the CPU
> >>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
> >>>>>> be resolved by rebooting. Each thread's appears as follows:
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
> >>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
> >>>>>>
> >>>>>> I tracked the address that triggered the fault and the related function
> >>>>>> graph, as well as the wake-up side of the user fault, and discovered this
> >>>>>> : In the IOU worker, when fault in a user space page, this space is
> >>>>>> associated with a userfault but does not sleep. This is because during
> >>>>>> scheduling, the judgment in the IOU worker context leads to early return.
> >>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
> >>>>>> to respond, causing the page table entry to remain empty. However, due to
> >>>>>> the early return, it does not sleep and wait to be awakened as in a normal
> >>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
> >>>>>> Therefore, I believe it is necessary to specifically handle user faults by
> >>>>>> setting a new flag to allow schedule function to continue in such cases,
> >>>>>> make sure the thread to sleep.
> >>>>>>
> >>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
> >>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
> >>>>>>
> >>>>>> fs/userfaultfd.c | 7 ++++++
> >>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
> >>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
> >>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
> >>>>>
> >>>>> Do you have a test case for this? I don't think the proposed solution is
> >>>>> very elegant, userfaultfd should not need to know about thread workers.
> >>>>> I'll ponder this a bit...
> >>>>>
> >>>>> --
> >>>>> Jens Axboe
> >>>> Sorry,The issue occurs very infrequently, and I can't manually
> >>>> reproduce it. It's not very elegant, but for corner cases, it seems
> >>>> necessary to make some compromises.
> >>>
> >>> I'm going to see if I can create one. Not sure I fully understand the
> >>> issue yet, but I'd be surprised if there isn't a more appropriate and
> >>> elegant solution rather than exposing the io-wq guts and having
> >>> userfaultfd manipulate them. That really should not be necessary.
> >>>
> >>> --
> >>> Jens Axboe
> >> Thanks.I'm looking forward to your good news.
> >
> > Well, let's hope there is! In any case, your patches could be
> > considerably improved if you did:
> >
> > void set_userfault_flag_for_ioworker(void)
> > {
> > struct io_worker *worker;
> > if (!(current->flags & PF_IO_WORKER))
> > return;
> > worker = current->worker_private;
> > set_bit(IO_WORKER_F_FAULT, &worker->flags);
> > }
> >
> > void clear_userfault_flag_for_ioworker(void)
> > {
> > struct io_worker *worker;
> > if (!(current->flags & PF_IO_WORKER))
> > return;
> > worker = current->worker_private;
> > clear_bit(IO_WORKER_F_FAULT, &worker->flags);
> > }
> >
> > and then userfaultfd would not need any odd checking, or needing io-wq
> > related structures public. That'd drastically cut down on the size of
> > them, and make it a bit more palatable.
>
> Forgot to ask, what kernel are you running on?
>
> --
> Jens Axboe
Thanks Jens It is linux-image-6.8.0-1026-gcp
On 4/22/25 10:14 AM, ??? wrote:
> On Tue, Apr 22, 2025 at 11:50?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 4/22/25 8:29 AM, Jens Axboe wrote:
>>> On 4/22/25 8:18 AM, ??? wrote:
>>>> On Tue, Apr 22, 2025 at 10:13?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 4/22/25 8:10 AM, ??? wrote:
>>>>>> On Tue, Apr 22, 2025 at 9:35?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 4/22/25 4:45 AM, Zhiwei Jiang wrote:
>>>>>>>> In the Firecracker VM scenario, sporadically encountered threads with
>>>>>>>> the UN state in the following call stack:
>>>>>>>> [<0>] io_wq_put_and_exit+0xa1/0x210
>>>>>>>> [<0>] io_uring_clean_tctx+0x8e/0xd0
>>>>>>>> [<0>] io_uring_cancel_generic+0x19f/0x370
>>>>>>>> [<0>] __io_uring_cancel+0x14/0x20
>>>>>>>> [<0>] do_exit+0x17f/0x510
>>>>>>>> [<0>] do_group_exit+0x35/0x90
>>>>>>>> [<0>] get_signal+0x963/0x970
>>>>>>>> [<0>] arch_do_signal_or_restart+0x39/0x120
>>>>>>>> [<0>] syscall_exit_to_user_mode+0x206/0x260
>>>>>>>> [<0>] do_syscall_64+0x8d/0x170
>>>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>>>>> The cause is a large number of IOU kernel threads saturating the CPU
>>>>>>>> and not exiting. When the issue occurs, CPU usage 100% and can only
>>>>>>>> be resolved by rebooting. Each thread's appears as follows:
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork_asm
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] ret_from_fork
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_worker_handle_work
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_submit_work
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_issue_sqe
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_write
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] blkdev_write_iter
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_file_buffered_write
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] iomap_write_iter
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_iov_iter_readable
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] fault_in_readable
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] asm_exc_page_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] exc_page_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] do_user_addr_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_mm_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_fault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_no_page
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] hugetlb_handle_userfault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] handle_userfault
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] schedule
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __schedule
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] __raw_spin_unlock_irq
>>>>>>>> iou-wrk-44588 [kernel.kallsyms] [k] io_wq_worker_sleeping
>>>>>>>>
>>>>>>>> I tracked the address that triggered the fault and the related function
>>>>>>>> graph, as well as the wake-up side of the user fault, and discovered this
>>>>>>>> : In the IOU worker, when fault in a user space page, this space is
>>>>>>>> associated with a userfault but does not sleep. This is because during
>>>>>>>> scheduling, the judgment in the IOU worker context leads to early return.
>>>>>>>> Meanwhile, the listener on the userfaultfd user side never performs a COPY
>>>>>>>> to respond, causing the page table entry to remain empty. However, due to
>>>>>>>> the early return, it does not sleep and wait to be awakened as in a normal
>>>>>>>> user fault, thus continuously faulting at the same address,so CPU loop.
>>>>>>>> Therefore, I believe it is necessary to specifically handle user faults by
>>>>>>>> setting a new flag to allow schedule function to continue in such cases,
>>>>>>>> make sure the thread to sleep.
>>>>>>>>
>>>>>>>> Patch 1 io_uring: Add new functions to handle user fault scenarios
>>>>>>>> Patch 2 userfaultfd: Set the corresponding flag in IOU worker context
>>>>>>>>
>>>>>>>> fs/userfaultfd.c | 7 ++++++
>>>>>>>> io_uring/io-wq.c | 57 +++++++++++++++---------------------------------
>>>>>>>> io_uring/io-wq.h | 45 ++++++++++++++++++++++++++++++++++++--
>>>>>>>> 3 files changed, 68 insertions(+), 41 deletions(-)
>>>>>>>
>>>>>>> Do you have a test case for this? I don't think the proposed solution is
>>>>>>> very elegant, userfaultfd should not need to know about thread workers.
>>>>>>> I'll ponder this a bit...
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>> Sorry,The issue occurs very infrequently, and I can't manually
>>>>>> reproduce it. It's not very elegant, but for corner cases, it seems
>>>>>> necessary to make some compromises.
>>>>>
>>>>> I'm going to see if I can create one. Not sure I fully understand the
>>>>> issue yet, but I'd be surprised if there isn't a more appropriate and
>>>>> elegant solution rather than exposing the io-wq guts and having
>>>>> userfaultfd manipulate them. That really should not be necessary.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>> Thanks.I'm looking forward to your good news.
>>>
>>> Well, let's hope there is! In any case, your patches could be
>>> considerably improved if you did:
>>>
>>> void set_userfault_flag_for_ioworker(void)
>>> {
>>> struct io_worker *worker;
>>> if (!(current->flags & PF_IO_WORKER))
>>> return;
>>> worker = current->worker_private;
>>> set_bit(IO_WORKER_F_FAULT, &worker->flags);
>>> }
>>>
>>> void clear_userfault_flag_for_ioworker(void)
>>> {
>>> struct io_worker *worker;
>>> if (!(current->flags & PF_IO_WORKER))
>>> return;
>>> worker = current->worker_private;
>>> clear_bit(IO_WORKER_F_FAULT, &worker->flags);
>>> }
>>>
>>> and then userfaultfd would not need any odd checking, or needing io-wq
>>> related structures public. That'd drastically cut down on the size of
>>> them, and make it a bit more palatable.
>>
>> Forgot to ask, what kernel are you running on?
>>
>> --
>> Jens Axboe
> Thanks Jens It is linux-image-6.8.0-1026-gcp
OK, that's ancient and unsupported in that no stable release is
happening for that kernel. Does it happen on newer kernels too?
FWIW, I haven't been able to reproduce anything odd so far. The io_uring
writes going via io-wq and hitting the userfaultfd path end up sleeping
in the schedule() in handle_userfault() - which is what I'd expect.
Do you know how many pending writes there are? I have a hard time
understanding your description of the problem, but it sounds like a ton
of workers are being created. But it's still not clear to me why that
would be, workers would only get created if there's more work to do, and
the current worker is going to sleep.
Puzzled...
--
Jens Axboe
© 2016 - 2025 Red Hat, Inc.