[PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large

Bin Meng posted 6 patches 10 months, 1 week ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230628152726.110295-1-bmeng@tinylab.org
Maintainers: Jason Wang <jasowang@redhat.com>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>
include/qemu/osdep.h                |  1 +
net/tap.c                           | 24 ++++++------
tests/tcg/cris/libc/check_openpf5.c | 57 +++++++++++++--------------
util/async-teardown.c               | 37 +-----------------
util/osdep.c                        | 60 +++++++++++++++++++++++++++++
5 files changed, 101 insertions(+), 78 deletions(-)
[PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Bin Meng 10 months, 1 week ago
Current codes using a brute-force traversal of all file descriptors
do not scale on a system where the maximum number of file descriptors
is set to a very large value (e.g.: in a Docker container of Manjaro
distribution it is set to 1073741816). QEMU just looks frozen during
start-up.

The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
doesn't need to manually close the fds for child process as the proper
O_CLOEXEC flag should have been set properly on files with its own
codes, QEMU uses a huge number of 3rd party libraries and we don't
trust them to reliably be using O_CLOEXEC on everything they open.

Modern Linux and BSDs have the close_range() call we can use to do the
job, and on Linux we have one more way to walk through /proc/self/fd
to complete the task efficiently, which is what qemu_close_range()
does, a new API we add in util/osdep.c.

V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Changes in v4:
- add 'first > last' check logic
- reorder the ifdefs logic
- change i to unsigned int type
- use qemu_strtoi() instead of atoi()
- limit last upper value to sysconf(_SC_OPEN_MAX) - 1
- call sysconf directly instead of using a variable
- put fd on its own line

Changes in v3:
- fix win32 build failure
- limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)

Changes in v2:
- new patch: "tests/tcg/cris: Fix the coding style"
- new patch: "tests/tcg/cris: Correct the off-by-one error"
- new patch: "util/async-teardown: Fall back to close fds one by one"
- new patch: "util/osdep: Introduce qemu_close_range()"
- new patch: "util/async-teardown: Use qemu_close_range() to close fds"
- Change to use qemu_close_range() to close fds for child process efficiently
- v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/

Bin Meng (4):
  tests/tcg/cris: Fix the coding style
  tests/tcg/cris: Correct the off-by-one error
  util/async-teardown: Fall back to close fds one by one
  util/osdep: Introduce qemu_close_range()

Zhangjin Wu (2):
  util/async-teardown: Use qemu_close_range() to close fds
  net: tap: Use qemu_close_range() to close fds

 include/qemu/osdep.h                |  1 +
 net/tap.c                           | 24 ++++++------
 tests/tcg/cris/libc/check_openpf5.c | 57 +++++++++++++--------------
 util/async-teardown.c               | 37 +-----------------
 util/osdep.c                        | 60 +++++++++++++++++++++++++++++
 5 files changed, 101 insertions(+), 78 deletions(-)

-- 
2.34.1
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Bin Meng 10 months ago
On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>
>
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.
>
> The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> doesn't need to manually close the fds for child process as the proper
> O_CLOEXEC flag should have been set properly on files with its own
> codes, QEMU uses a huge number of 3rd party libraries and we don't
> trust them to reliably be using O_CLOEXEC on everything they open.
>
> Modern Linux and BSDs have the close_range() call we can use to do the
> job, and on Linux we have one more way to walk through /proc/self/fd
> to complete the task efficiently, which is what qemu_close_range()
> does, a new API we add in util/osdep.c.
>
> V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Changes in v4:
> - add 'first > last' check logic
> - reorder the ifdefs logic
> - change i to unsigned int type
> - use qemu_strtoi() instead of atoi()
> - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> - call sysconf directly instead of using a variable
> - put fd on its own line
>
> Changes in v3:
> - fix win32 build failure
> - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>
> Changes in v2:
> - new patch: "tests/tcg/cris: Fix the coding style"
> - new patch: "tests/tcg/cris: Correct the off-by-one error"
> - new patch: "util/async-teardown: Fall back to close fds one by one"
> - new patch: "util/osdep: Introduce qemu_close_range()"
> - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> - Change to use qemu_close_range() to close fds for child process efficiently
> - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>
> Bin Meng (4):
>   tests/tcg/cris: Fix the coding style
>   tests/tcg/cris: Correct the off-by-one error
>   util/async-teardown: Fall back to close fds one by one
>   util/osdep: Introduce qemu_close_range()
>
> Zhangjin Wu (2):
>   util/async-teardown: Use qemu_close_range() to close fds
>   net: tap: Use qemu_close_range() to close fds
>

Ping for 8.1?
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Jason Wang 10 months ago
On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>
> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >
> >
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> >
> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> > doesn't need to manually close the fds for child process as the proper
> > O_CLOEXEC flag should have been set properly on files with its own
> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> > trust them to reliably be using O_CLOEXEC on everything they open.
> >
> > Modern Linux and BSDs have the close_range() call we can use to do the
> > job, and on Linux we have one more way to walk through /proc/self/fd
> > to complete the task efficiently, which is what qemu_close_range()
> > does, a new API we add in util/osdep.c.
> >
> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Changes in v4:
> > - add 'first > last' check logic
> > - reorder the ifdefs logic
> > - change i to unsigned int type
> > - use qemu_strtoi() instead of atoi()
> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> > - call sysconf directly instead of using a variable
> > - put fd on its own line
> >
> > Changes in v3:
> > - fix win32 build failure
> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >
> > Changes in v2:
> > - new patch: "tests/tcg/cris: Fix the coding style"
> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> > - new patch: "util/osdep: Introduce qemu_close_range()"
> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> > - Change to use qemu_close_range() to close fds for child process efficiently
> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >
> > Bin Meng (4):
> >   tests/tcg/cris: Fix the coding style
> >   tests/tcg/cris: Correct the off-by-one error
> >   util/async-teardown: Fall back to close fds one by one
> >   util/osdep: Introduce qemu_close_range()
> >
> > Zhangjin Wu (2):
> >   util/async-teardown: Use qemu_close_range() to close fds
> >   net: tap: Use qemu_close_range() to close fds
> >
>
> Ping for 8.1?

Queued.

Thanks
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Markus Armbruster 10 months ago
Jason Wang <jasowang@redhat.com> writes:

> On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
>>
>> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
>> >
>> >
>> > Current codes using a brute-force traversal of all file descriptors
>> > do not scale on a system where the maximum number of file descriptors
>> > is set to a very large value (e.g.: in a Docker container of Manjaro
>> > distribution it is set to 1073741816). QEMU just looks frozen during
>> > start-up.
>> >
>> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
>> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
>> > doesn't need to manually close the fds for child process as the proper
>> > O_CLOEXEC flag should have been set properly on files with its own
>> > codes, QEMU uses a huge number of 3rd party libraries and we don't
>> > trust them to reliably be using O_CLOEXEC on everything they open.
>> >
>> > Modern Linux and BSDs have the close_range() call we can use to do the
>> > job, and on Linux we have one more way to walk through /proc/self/fd
>> > to complete the task efficiently, which is what qemu_close_range()
>> > does, a new API we add in util/osdep.c.
>> >
>> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Changes in v4:
>> > - add 'first > last' check logic
>> > - reorder the ifdefs logic
>> > - change i to unsigned int type
>> > - use qemu_strtoi() instead of atoi()
>> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
>> > - call sysconf directly instead of using a variable
>> > - put fd on its own line
>> >
>> > Changes in v3:
>> > - fix win32 build failure
>> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
>> >
>> > Changes in v2:
>> > - new patch: "tests/tcg/cris: Fix the coding style"
>> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
>> > - new patch: "util/async-teardown: Fall back to close fds one by one"
>> > - new patch: "util/osdep: Introduce qemu_close_range()"
>> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
>> > - Change to use qemu_close_range() to close fds for child process efficiently
>> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
>> >
>> > Bin Meng (4):
>> >   tests/tcg/cris: Fix the coding style
>> >   tests/tcg/cris: Correct the off-by-one error
>> >   util/async-teardown: Fall back to close fds one by one
>> >   util/osdep: Introduce qemu_close_range()
>> >
>> > Zhangjin Wu (2):
>> >   util/async-teardown: Use qemu_close_range() to close fds
>> >   net: tap: Use qemu_close_range() to close fds
>> >
>>
>> Ping for 8.1?
>
> Queued.

There are review questions open on PATCH 4+5.
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Jason Wang 10 months ago
On Mon, Jul 10, 2023 at 2:07 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Jason Wang <jasowang@redhat.com> writes:
>
> > On Sun, Jul 9, 2023 at 11:48 PM Bin Meng <bmeng.cn@gmail.com> wrote:
> >>
> >> On Wed, Jun 28, 2023 at 11:29 PM Bin Meng <bmeng@tinylab.org> wrote:
> >> >
> >> >
> >> > Current codes using a brute-force traversal of all file descriptors
> >> > do not scale on a system where the maximum number of file descriptors
> >> > is set to a very large value (e.g.: in a Docker container of Manjaro
> >> > distribution it is set to 1073741816). QEMU just looks frozen during
> >> > start-up.
> >> >
> >> > The close-on-exec flag (O_CLOEXEC) was introduced since Linux kernel
> >> > 2.6.23, FreeBSD 8.3, OpenBSD 5.0, Solaris 11. While it's true QEMU
> >> > doesn't need to manually close the fds for child process as the proper
> >> > O_CLOEXEC flag should have been set properly on files with its own
> >> > codes, QEMU uses a huge number of 3rd party libraries and we don't
> >> > trust them to reliably be using O_CLOEXEC on everything they open.
> >> >
> >> > Modern Linux and BSDs have the close_range() call we can use to do the
> >> > job, and on Linux we have one more way to walk through /proc/self/fd
> >> > to complete the task efficiently, which is what qemu_close_range()
> >> > does, a new API we add in util/osdep.c.
> >> >
> >> > V1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Changes in v4:
> >> > - add 'first > last' check logic
> >> > - reorder the ifdefs logic
> >> > - change i to unsigned int type
> >> > - use qemu_strtoi() instead of atoi()
> >> > - limit last upper value to sysconf(_SC_OPEN_MAX) - 1
> >> > - call sysconf directly instead of using a variable
> >> > - put fd on its own line
> >> >
> >> > Changes in v3:
> >> > - fix win32 build failure
> >> > - limit the last_fd of qemu_close_range() to sysconf(_SC_OPEN_MAX)
> >> >
> >> > Changes in v2:
> >> > - new patch: "tests/tcg/cris: Fix the coding style"
> >> > - new patch: "tests/tcg/cris: Correct the off-by-one error"
> >> > - new patch: "util/async-teardown: Fall back to close fds one by one"
> >> > - new patch: "util/osdep: Introduce qemu_close_range()"
> >> > - new patch: "util/async-teardown: Use qemu_close_range() to close fds"
> >> > - Change to use qemu_close_range() to close fds for child process efficiently
> >> > - v1 link: https://lore.kernel.org/qemu-devel/20230406112041.798585-1-bmeng@tinylab.org/
> >> >
> >> > Bin Meng (4):
> >> >   tests/tcg/cris: Fix the coding style
> >> >   tests/tcg/cris: Correct the off-by-one error
> >> >   util/async-teardown: Fall back to close fds one by one
> >> >   util/osdep: Introduce qemu_close_range()
> >> >
> >> > Zhangjin Wu (2):
> >> >   util/async-teardown: Use qemu_close_range() to close fds
> >> >   net: tap: Use qemu_close_range() to close fds
> >> >
> >>
> >> Ping for 8.1?
> >
> > Queued.
>
> There are review questions open on PATCH 4+5.

Sorry, I missed that. I've dropped them from the queue now.

Thanks

>
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Michael Tokarev 10 months, 1 week ago
28.06.2023 18:27, Bin Meng wrote:
> 
> Current codes using a brute-force traversal of all file descriptors
> do not scale on a system where the maximum number of file descriptors
> is set to a very large value (e.g.: in a Docker container of Manjaro
> distribution it is set to 1073741816). QEMU just looks frozen during
> start-up.

So, the same question as before. *Why* do we close all filedescriptors
to begin with?

Thanks,

/mjt
Re: [PATCH v4 0/6] net/tap: Fix QEMU frozen issue when the maximum number of file descriptors is very large
Posted by Daniel P. Berrangé 10 months, 1 week ago
On Thu, Jun 29, 2023 at 11:33:29AM +0300, Michael Tokarev wrote:
> 28.06.2023 18:27, Bin Meng wrote:
> > 
> > Current codes using a brute-force traversal of all file descriptors
> > do not scale on a system where the maximum number of file descriptors
> > is set to a very large value (e.g.: in a Docker container of Manjaro
> > distribution it is set to 1073741816). QEMU just looks frozen during
> > start-up.
> 
> So, the same question as before. *Why* do we close all filedescriptors
> to begin with?

The O_CLOSEXEC flag is a terrible concept, as the default behaviour of
file descriptors is to be leaked into all child processes, unless code
takes explicit action to set O_CLOEXEC in every case. Even if they are
diligent about their own code, apps developers can have zero confidence
that every library they use has set O_CLOEXEC. Not just set it after the
FD is open, but set it atomically when the the FD is open, because
threads create race conditions if not atomically set.

Leaking FDs is a security risk, and QEMU is an especially security
critical application. QEMU needs stronger guarantees that O_CLOEXEC
can offer, and mass-close before execve is the only viable option
to achieve this.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|