docs/formatdomain.rst | 6 ++++ docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 21 +++++++++++ src/conf/domain_conf.h | 1 + src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 5 +++ src/qemu/qemu_validate.c | 7 ++++ .../caps_5.1.0.x86_64.xml | 1 + .../caps_5.2.0.x86_64.xml | 1 + ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 15 files changed, 133 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml
gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 The virtio-balloon device now has the ability to report free pages back to the hypervisor for reuse by other programs. This kernel feature allows for more stable and resource friendly systems. This feature is available in QEMU and is enabled with free-page-reporting=on default virt setting should be off but the user should be able to enable it. Nico Pache (4): Document and parser support for the Virtio free page reporting feature. QEMU: declare qemu capabilities for the Virtio Free page reporting feature QEMU: introduce Virtio free page reporting feature provide testing for free-page-reporting feature in QEMU docs/formatdomain.rst | 6 ++++ docs/schemas/domaincommon.rng | 5 +++ src/conf/domain_conf.c | 21 +++++++++++ src/conf/domain_conf.h | 1 + src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 5 +++ src/qemu/qemu_validate.c | 7 ++++ .../caps_5.1.0.x86_64.xml | 1 + .../caps_5.2.0.x86_64.xml | 1 + ...-options-memballoon-freepage-reporting.err | 1 + ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ tests/qemuxml2argvtest.c | 2 ++ ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ 15 files changed, 133 insertions(+) create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml -- 2.18.4
On 10/12/20 8:35 PM, Nico Pache wrote: > gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 > > The virtio-balloon device now has the ability to report free pages > back to the hypervisor for reuse by other programs. > > This kernel feature allows for more stable and resource friendly systems. > > This feature is available in QEMU and is enabled with free-page-reporting=on > default virt setting should be off but the user should be able to enable it. > > Nico Pache (4): > Document and parser support for the Virtio free page reporting > feature. > QEMU: declare qemu capabilities for the Virtio Free page reporting > feature > QEMU: introduce Virtio free page reporting feature > provide testing for free-page-reporting feature in QEMU > Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> > docs/formatdomain.rst | 6 ++++ > docs/schemas/domaincommon.rng | 5 +++ > src/conf/domain_conf.c | 21 +++++++++++ > src/conf/domain_conf.h | 1 + > src/qemu/qemu_capabilities.c | 2 ++ > src/qemu/qemu_capabilities.h | 1 + > src/qemu/qemu_command.c | 5 +++ > src/qemu/qemu_validate.c | 7 ++++ > .../caps_5.1.0.x86_64.xml | 1 + > .../caps_5.2.0.x86_64.xml | 1 + > ...-options-memballoon-freepage-reporting.err | 1 + > ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ > ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ > tests/qemuxml2argvtest.c | 2 ++ > ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ > 15 files changed, 133 insertions(+) > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml > create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml >
On 10/13/20 1:35 AM, Nico Pache wrote: > gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 > > The virtio-balloon device now has the ability to report free pages > back to the hypervisor for reuse by other programs. > > This kernel feature allows for more stable and resource friendly systems. > > This feature is available in QEMU and is enabled with free-page-reporting=on > default virt setting should be off but the user should be able to enable it. > > Nico Pache (4): > Document and parser support for the Virtio free page reporting > feature. > QEMU: declare qemu capabilities for the Virtio Free page reporting > feature > QEMU: introduce Virtio free page reporting feature > provide testing for free-page-reporting feature in QEMU > > docs/formatdomain.rst | 6 ++++ > docs/schemas/domaincommon.rng | 5 +++ > src/conf/domain_conf.c | 21 +++++++++++ > src/conf/domain_conf.h | 1 + > src/qemu/qemu_capabilities.c | 2 ++ > src/qemu/qemu_capabilities.h | 1 + > src/qemu/qemu_command.c | 5 +++ > src/qemu/qemu_validate.c | 7 ++++ > .../caps_5.1.0.x86_64.xml | 1 + > .../caps_5.2.0.x86_64.xml | 1 + > ...-options-memballoon-freepage-reporting.err | 1 + > ...loon-freepage-reporting.x86_64-latest.args | 36 +++++++++++++++++++ > ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ > tests/qemuxml2argvtest.c | 2 ++ > ...-options-memballoon-freepage-reporting.xml | 22 ++++++++++++ > 15 files changed, 133 insertions(+) > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.err > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.x86_64-latest.args > create mode 100644 tests/qemuxml2argvdata/virtio-options-memballoon-freepage-reporting.xml > create mode 100644 tests/qemuxml2xmloutdata/virtio-options-memballoon-freepage-reporting.xml > Couple of white space problems, but nothing I couldn't fix before pushing. Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Congratulations on your first libvirt contribution! Michal
On 10/13/20 5:10 PM, Michal Privoznik wrote: > On 10/13/20 1:35 AM, Nico Pache wrote: >> gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 >> >> The virtio-balloon device now has the ability to report free pages >> back to the hypervisor for reuse by other programs. Is this something that we might want to report? I mean, we have 'virsh dommemstat $dom' which under the hood calls: {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} Michal
I don't believe so, but David might have some more insight into that question. Once the feature is enabled the guest will report pages on its freelist to the hypervisor, and the host/hypervisor will then release them. I'm not exactly sure what information we can or would want to 'stat' out of this process... A counter for number of pages returned? Cheers, Nico On Tue, Oct 13, 2020, 12:47 PM Michal Privoznik <mprivozn@redhat.com> wrote: > On 10/13/20 5:10 PM, Michal Privoznik wrote: > > On 10/13/20 1:35 AM, Nico Pache wrote: > >> gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 > >> > >> The virtio-balloon device now has the ability to report free pages > >> back to the hypervisor for reuse by other programs. > > Is this something that we might want to report? I mean, we have 'virsh > dommemstat $dom' which under the hood calls: > > > {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} > > Michal > >
On 13.10.20 19:00, Nico Pache wrote: > I don't believe so, but David might have some more insight into that > question. > > Once the feature is enabled the guest will report pages on its freelist > to the hypervisor, and the host/hypervisor will then release them. We report (some, not all) free memory of our guest to the hypervisor. So whatever the guest reported shows up in "stat-free-memory" in the stats already. The hypervisor will "discard" backing storage of reported free memory, resulting in your hypervisor having more free memory available (and the QEMU process consuming less memory). > > I'm not exactly sure what information we can or would want to 'stat' out > of this process... A counter for number of pages returned? As the guest can reuse memory any time without coordination with the hypervisor that wouldn't be of too much help. A counter would only tell you if free page reporting was once performed successfully. You would really have to compare the QEMU process memory footprint with the guest stats to figure out if free page reporting is doing what it's supposed to do. -- Thanks, David / dhildenb
On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote: > On 10/13/20 5:10 PM, Michal Privoznik wrote: > > On 10/13/20 1:35 AM, Nico Pache wrote: > > > gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 > > > > > > The virtio-balloon device now has the ability to report free pages > > > back to the hypervisor for reuse by other programs. > > Is this something that we might want to report? I mean, we have 'virsh > dommemstat $dom' which under the hood calls: > > {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} As I've pointed out in earlier review, I think that the feature name is a bit misleading. It sounds like a statistic, not something that actually returns memory to the host. The docs are now better but still leave a lot of room for imagination. IMO, if the feature is mainly for returning memory to the host, the 'reporting' word should not have been used.
On 10/13/20 7:57 PM, Peter Krempa wrote: > On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote: >> On 10/13/20 5:10 PM, Michal Privoznik wrote: >>> On 10/13/20 1:35 AM, Nico Pache wrote: >>>> gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 >>>> >>>> The virtio-balloon device now has the ability to report free pages >>>> back to the hypervisor for reuse by other programs. >> >> Is this something that we might want to report? I mean, we have 'virsh >> dommemstat $dom' which under the hood calls: >> >> {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} > > As I've pointed out in earlier review, I think that the feature name is > a bit misleading. It sounds like a statistic, not something that > actually returns memory to the host. The docs are now better but still > leave a lot of room for imagination. > > IMO, if the feature is mainly for returning memory to the host, the > 'reporting' word should not have been used. > Oh sorry I missed that. I a penance I will post a cleanup patch. How does "free-pages" or "return-pages" or even "discard-pages" sound? Michal
IMO "return-pages" sounds the best out of those and stays relatively consistent with the kernel and qemu terminology for this feature. I personally don't see a huge problem with the current name, but I've also been staring at the words "free page reporting" for too long. Cheers! -- Nico On Wed, Oct 14, 2020 at 1:56 AM Michal Privoznik <mprivozn@redhat.com> wrote: > On 10/13/20 7:57 PM, Peter Krempa wrote: > > On Tue, Oct 13, 2020 at 18:47:39 +0200, Michal Privoznik wrote: > >> On 10/13/20 5:10 PM, Michal Privoznik wrote: > >>> On 10/13/20 1:35 AM, Nico Pache wrote: > >>>> gitlab issue: https://gitlab.com/libvirt/libvirt/-/issues/79 > >>>> > >>>> The virtio-balloon device now has the ability to report free pages > >>>> back to the hypervisor for reuse by other programs. > >> > >> Is this something that we might want to report? I mean, we have 'virsh > >> dommemstat $dom' which under the hood calls: > >> > >> > {"execute":"qom-get","arguments":{"path":"/machine/peripheral/balloon0","property":"guest-stats"},"id":"libvirt-400"} > > > > As I've pointed out in earlier review, I think that the feature name is > > a bit misleading. It sounds like a statistic, not something that > > actually returns memory to the host. The docs are now better but still > > leave a lot of room for imagination. > > > > IMO, if the feature is mainly for returning memory to the host, the > > 'reporting' word should not have been used. > > > > Oh sorry I missed that. I a penance I will post a cleanup patch. How > does "free-pages" or "return-pages" or even "discard-pages" sound? > > Michal > >
On 10/14/20 8:07 AM, Nico Pache wrote: > IMO "return-pages" sounds the best out of those and stays relatively > consistent with the kernel and qemu terminology for this feature. > > I personally don't see a huge problem with the current name, but I've > also been staring at the words "free page reporting" for too long. Right, I can see both reasons. But now that Peter raised it (again, sorry) at libvirt level reporting usually means "to report something to user". QEMU <-> KVM communication is too low level and since this is not being reported to user directly (even though it affects "stat-free-memory" attribute of the balloon) reporting does sound a bit weird. Let's wait for Peter's opinion. Michal
On 14.10.20 08:30, Michal Privoznik wrote: > On 10/14/20 8:07 AM, Nico Pache wrote: >> IMO "return-pages" sounds the best out of those and stays relatively >> consistent with the kernel and qemu terminology for this feature. >> >> I personally don't see a huge problem with the current name, but I've >> also been staring at the words "free page reporting" for too long. > > Right, I can see both reasons. But now that Peter raised it (again, > sorry) at libvirt level reporting usually means "to report something to > user". QEMU <-> KVM communication is too low level and since this is not > being reported to user directly (even though it affects > "stat-free-memory" attribute of the balloon) reporting does sound a bit > weird. Let's wait for Peter's opinion. We originally wanted to use "free page hinting", but as that name is already taken for another virtio-balloon feature to improve live migration speed (one time hint of free pages just before migration), we used "reporting" instead. On a QEMU level, this feature name makes perfect sense. The guest (continuously) reports free pages to the hypervisor. The hypervisor tries to use the reports to free up some memory (which might not always be possible). Now, all I can say is that 1. "free page reporting" is the official name of the feature. Using a different name in libvirt will crate more confusion than it might actually help. Just saying. 2. The proposed alternatives ""free-pages" or "return-pages" or even "discard-pages" don't match what's actually going on. -- Thanks, David / dhildenb
On 10/14/20 10:26 AM, David Hildenbrand wrote: > On 14.10.20 08:30, Michal Privoznik wrote: Sorry for hijacking this thread, but I need to report it somewhere (what is the best place?) When I want to start a guest with this feature turned on + memfd + hugepages, the host kernel prints this warning into dmesg and hugepages stop working from then on (meaning, even if the pool of allocated HPs is large enough I can't start any guest with HPs): [ 139.434748] ------------[ cut here ]------------ [ 139.434754] WARNING: CPU: 2 PID: 6280 at mm/page_counter.c:57 page_counter_uncharge+0x33/0x40 [ 139.434754] Modules linked in: kvm_amd amdgpu kvm btusb btrtl btbcm btintel sp5100_tco watchdog k10temp mfd_core gpu_sched ttm [ 139.434759] CPU: 2 PID: 6280 Comm: CPU 1/KVM Not tainted 5.8.13-gentoo-x86_64 #2 [ 139.434759] Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 1005 08/01/2019 [ 139.434760] RIP: 0010:page_counter_uncharge+0x33/0x40 [ 139.434762] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 [ 139.434762] RSP: 0018:ffffc9000355fb38 EFLAGS: 00010286 [ 139.434763] RAX: fffffffffffb4000 RBX: ffff888fc267e900 RCX: fffffffffffb4000 [ 139.434763] RDX: 0000000000000402 RSI: fffffffffffb4000 RDI: ffff888fd8411dd0 [ 139.434764] RBP: ffff888fcba983c0 R08: 0000000000080400 R09: fffffffffff7fc00 [ 139.434764] R10: ffffc9000355fb40 R11: 000000000000000a R12: 0000000000000001 [ 139.434765] R13: ffff888fc3d89140 R14: 00000000000001b2 R15: 00000000000001b1 [ 139.434765] FS: 00007fc9d4c35700(0000) GS:ffff888fde880000(0000) knlGS:0000000000000000 [ 139.434766] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 139.434766] CR2: 00007f09a4003000 CR3: 0000000fc06fe000 CR4: 0000000000340ee0 [ 139.434767] Call Trace: [ 139.434769] hugetlb_cgroup_uncharge_file_region+0x46/0x70 [ 139.434772] region_del+0x1a0/0x260 [ 139.434773] hugetlb_unreserve_pages+0x32/0xa0 [ 139.434775] remove_inode_hugepages+0x19d/0x3a0 [ 139.434776] hugetlbfs_fallocate+0x3f2/0x4a0 [ 139.434778] ? __seccomp_filter+0x75/0x6a0 [ 139.434779] vfs_fallocate+0x124/0x260 [ 139.434780] __x64_sys_fallocate+0x39/0x60 [ 139.434783] do_syscall_64+0x38/0x60 [ 139.434784] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 139.434785] RIP: 0033:0x7fc9e0994de7 [ 139.434786] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44 [ 139.434787] RSP: 002b:00007fc9d4c337a0 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 139.434787] RAX: ffffffffffffffda RBX: 0000000036400000 RCX: 00007fc9e0994de7 [ 139.434788] RDX: 0000000036200000 RSI: 0000000000000003 RDI: 000000000000001d [ 139.434788] RBP: 00007fc9d4c33800 R08: 0000000000000000 R09: 0000000000000000 [ 139.434789] R10: 0000000000200000 R11: 0000000000000293 R12: 00007fff9a75c3fe [ 139.434789] R13: 00007fff9a75c3ff R14: 00007fc9d4c35700 R15: 00007fc9d4c33dc0 [ 139.434790] ---[ end trace fb9808303959fc01 ]--- Is this known problem? Michal
On 14.10.20 13:53, Michal Privoznik wrote: > On 10/14/20 10:26 AM, David Hildenbrand wrote: >> On 14.10.20 08:30, Michal Privoznik wrote: > > Sorry for hijacking this thread, but I need to report it somewhere (what > is the best place?) > > When I want to start a guest with this feature turned on + memfd + > hugepages, the host kernel prints this warning into dmesg and hugepages > stop working from then on (meaning, even if the pool of allocated HPs is > large enough I can't start any guest with HPs): > > > > > [ 139.434748] ------------[ cut here ]------------ > [ 139.434754] WARNING: CPU: 2 PID: 6280 at mm/page_counter.c:57 > page_counter_uncharge+0x33/0x40 > [ 139.434754] Modules linked in: kvm_amd amdgpu kvm btusb btrtl btbcm > btintel sp5100_tco watchdog k10temp mfd_core gpu_sched ttm > [ 139.434759] CPU: 2 PID: 6280 Comm: CPU 1/KVM Not tainted > 5.8.13-gentoo-x86_64 #2 > [ 139.434759] Hardware name: System manufacturer System Product > Name/PRIME X570-PRO, BIOS 1005 08/01/2019 > [ 139.434760] RIP: 0010:page_counter_uncharge+0x33/0x40 > [ 139.434762] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 > 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc > c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41 > [ 139.434762] RSP: 0018:ffffc9000355fb38 EFLAGS: 00010286 > [ 139.434763] RAX: fffffffffffb4000 RBX: ffff888fc267e900 RCX: > fffffffffffb4000 > [ 139.434763] RDX: 0000000000000402 RSI: fffffffffffb4000 RDI: > ffff888fd8411dd0 > [ 139.434764] RBP: ffff888fcba983c0 R08: 0000000000080400 R09: > fffffffffff7fc00 > [ 139.434764] R10: ffffc9000355fb40 R11: 000000000000000a R12: > 0000000000000001 > [ 139.434765] R13: ffff888fc3d89140 R14: 00000000000001b2 R15: > 00000000000001b1 > [ 139.434765] FS: 00007fc9d4c35700(0000) GS:ffff888fde880000(0000) > knlGS:0000000000000000 > [ 139.434766] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 139.434766] CR2: 00007f09a4003000 CR3: 0000000fc06fe000 CR4: > 0000000000340ee0 > [ 139.434767] Call Trace: > [ 139.434769] hugetlb_cgroup_uncharge_file_region+0x46/0x70 > [ 139.434772] region_del+0x1a0/0x260 > [ 139.434773] hugetlb_unreserve_pages+0x32/0xa0 > [ 139.434775] remove_inode_hugepages+0x19d/0x3a0 > [ 139.434776] hugetlbfs_fallocate+0x3f2/0x4a0 > [ 139.434778] ? __seccomp_filter+0x75/0x6a0 > [ 139.434779] vfs_fallocate+0x124/0x260 > [ 139.434780] __x64_sys_fallocate+0x39/0x60 > [ 139.434783] do_syscall_64+0x38/0x60 > [ 139.434784] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [ 139.434785] RIP: 0033:0x7fc9e0994de7 > [ 139.434786] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 > 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f > 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44 > [ 139.434787] RSP: 002b:00007fc9d4c337a0 EFLAGS: 00000293 ORIG_RAX: > 000000000000011d > [ 139.434787] RAX: ffffffffffffffda RBX: 0000000036400000 RCX: > 00007fc9e0994de7 > [ 139.434788] RDX: 0000000036200000 RSI: 0000000000000003 RDI: > 000000000000001d > [ 139.434788] RBP: 00007fc9d4c33800 R08: 0000000000000000 R09: > 0000000000000000 > [ 139.434789] R10: 0000000000200000 R11: 0000000000000293 R12: > 00007fff9a75c3fe > [ 139.434789] R13: 00007fff9a75c3ff R14: 00007fc9d4c35700 R15: > 00007fc9d4c33dc0 > [ 139.434790] ---[ end trace fb9808303959fc01 ]--- > > > Is this known problem? No, not at all. Thanks for reporting! And the "bad" thing is, that QEMU doesn't do anything too fancy. All it does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to zap reported pages. The same mechanism is also used for postcopy live migration and virtio-mem with hugetlbfs. Which kernel are you running? 1. Is it an upstream kernel, lkml + -mm lists are the right place (please cc me, or I can try to reproduce and report it). 2. Is it a distro kernel? Then create a BUG there. I was just recently testing virtio-mem with hugetlbfs and it worked on decent upstream Fedora. But maybe I was not able to trigger it. > > Michal > -- Thanks, David / dhildenb
On 10/14/20 2:06 PM, David Hildenbrand wrote: > On 14.10.20 13:53, Michal Privoznik wrote: >> On 10/14/20 10:26 AM, David Hildenbrand wrote: >>> On 14.10.20 08:30, Michal Privoznik wrote: >> > <snip/> > > No, not at all. Thanks for reporting! > > And the "bad" thing is, that QEMU doesn't do anything too fancy. All it > does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to > zap reported pages. The same mechanism is also used for postcopy live > migration and virtio-mem with hugetlbfs. > > Which kernel are you running? > > 1. Is it an upstream kernel, lkml + -mm lists are the right place > (please cc me, or I can try to reproduce and report it). > > 2. Is it a distro kernel? Then create a BUG there. > > I was just recently testing virtio-mem with hugetlbfs and it worked on > decent upstream Fedora. But maybe I was not able to trigger it. Okay, I've upgraded to 5.9.0-gentoo, but the problem persists. Gentoo puts only a very few patches on top of vanilla kernel neither of which touches that area of the code: https://dev.gentoo.org/~mpagano/genpatches/trunk/5.9/ So I think this is reproducible on vanilla too. BTW: Have you tried placing the qemu inside v1 cgroups? Libvirt does that so maybe that's the problem. Anyway, here's the cmd line: /home/zippy/work/qemu/qemu.git/build/qemu-system-x86_64 \ -name guest=fedora,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-fedora/master-key.aes \ -machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ -cpu host,migratable=on \ -m 4096 \ -object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,prealloc=yes,size=4294967296,host-nodes=0,policy=bind \ -overcommit mem-lock=off \ -smp 4,sockets=1,dies=1,cores=2,threads=2 \ -object iothread,id=iothread1 \ -object iothread,id=iothread2 \ -object iothread,id=iothread3 \ -object iothread,id=iothread4 \ -uuid 63840878-0deb-4095-97e6-fc444d9bc9fa \ -no-user-config \ -nodefaults \ -device sga \ -chardev socket,id=charmonitor,fd=33,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc \ -no-shutdown \ -global PIIX4_PM.disable_s3=0 \ -global PIIX4_PM.disable_s4=0 \ -boot menu=on,strict=on \ -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ -blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/fedora.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \ -blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \ -netdev tap,fd=35,id=hostnet0 \ -device virtio-net-pci,host_mtu=9000,netdev=hostnet0,id=net0,mac=52:54:00:a4:6f:91,bus=pci.0,addr=0x3 \ -chardev pty,id=charserial0 \ -device isa-serial,chardev=charserial0,id=serial0 \ -chardev socket,id=charchannel0,fd=36,server,nowait \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on \ -device virtio-vga,id=video0,virgl=on,max_outputs=1,bus=pci.0,addr=0x2 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on Michal
On 14.10.20 15:46, Michal Privoznik wrote: > On 10/14/20 2:06 PM, David Hildenbrand wrote: >> On 14.10.20 13:53, Michal Privoznik wrote: >>> On 10/14/20 10:26 AM, David Hildenbrand wrote: >>>> On 14.10.20 08:30, Michal Privoznik wrote: >>> >> <snip/> >> >> No, not at all. Thanks for reporting! >> >> And the "bad" thing is, that QEMU doesn't do anything too fancy. All it >> does is "fallocate(FALLOC_FL_PUNCH_HOLE)" on hugetlbfs when trying to >> zap reported pages. The same mechanism is also used for postcopy live >> migration and virtio-mem with hugetlbfs. >> >> Which kernel are you running? >> >> 1. Is it an upstream kernel, lkml + -mm lists are the right place >> (please cc me, or I can try to reproduce and report it). >> >> 2. Is it a distro kernel? Then create a BUG there. >> >> I was just recently testing virtio-mem with hugetlbfs and it worked on >> decent upstream Fedora. But maybe I was not able to trigger it. > > Okay, I've upgraded to 5.9.0-gentoo, but the problem persists. Gentoo > puts only a very few patches on top of vanilla kernel neither of which > touches that area of the code: > > https://dev.gentoo.org/~mpagano/genpatches/trunk/5.9/ > > So I think this is reproducible on vanilla too. > > BTW: Have you tried placing the qemu inside v1 cgroups? Libvirt does > that so maybe that's the problem. Anyway, here's the cmd line: > > /home/zippy/work/qemu/qemu.git/build/qemu-system-x86_64 \ > -name guest=fedora,debug-threads=on \ > -S \ > -object > secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-2-fedora/master-key.aes > \ > -machine > pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \ > -cpu host,migratable=on \ > -m 4096 \ > -object > memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,prealloc=yes,size=4294967296,host-nodes=0,policy=bind > \ > -overcommit mem-lock=off \ > -smp 4,sockets=1,dies=1,cores=2,threads=2 \ > -object iothread,id=iothread1 \ > -object iothread,id=iothread2 \ > -object iothread,id=iothread3 \ > -object iothread,id=iothread4 \ > -uuid 63840878-0deb-4095-97e6-fc444d9bc9fa \ > -no-user-config \ > -nodefaults \ > -device sga \ > -chardev socket,id=charmonitor,fd=33,server,nowait \ > -mon chardev=charmonitor,id=monitor,mode=control \ > -rtc base=utc \ > -no-shutdown \ > -global PIIX4_PM.disable_s3=0 \ > -global PIIX4_PM.disable_s4=0 \ > -boot menu=on,strict=on \ > -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \ > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ > -blockdev > '{"driver":"file","filename":"/var/lib/libvirt/images/fedora.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' > \ > -blockdev > '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' > \ > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 > \ > -netdev tap,fd=35,id=hostnet0 \ > -device > virtio-net-pci,host_mtu=9000,netdev=hostnet0,id=net0,mac=52:54:00:a4:6f:91,bus=pci.0,addr=0x3 > \ > -chardev pty,id=charserial0 \ > -device isa-serial,chardev=charserial0,id=serial0 \ > -chardev socket,id=charchannel0,fd=36,server,nowait \ > -device > virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > \ > -spice port=5900,addr=127.0.0.1,disable-ticketing,seamless-migration=on \ > -device virtio-vga,id=video0,virgl=on,max_outputs=1,bus=pci.0,addr=0x2 \ > -device > virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on \ > -sandbox > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ > -msg timestamp=on > Thanks! Reproduced easily on Fedora 32 (5.7.16-200.fc32.x86_64). [ 70.641802] CPU: 3 PID: 2178 Comm: qemu-system-x86 Not tainted 5.7.16-200.fc32.x86_64 #1 [ 70.641802] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020 [ 70.641803] RIP: 0010:page_counter_uncharge+0x4b/0x50 [ 70.641804] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 20 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89 [ 70.641804] RSP: 0018:ffffb4044139bb18 EFLAGS: 00010286 [ 70.641805] RAX: fffffffffff94600 RBX: fffffffffff94600 RCX: ffff8da63d007000 [ 70.641805] RDX: 000000000000046e RSI: fffffffffff94600 RDI: ffff8da678412e28 [ 70.641806] RBP: ffff8da678412e28 R08: ffff8da678412e28 R09: ffff8da63d0078c0 [ 70.641806] R10: ffff8da634173000 R11: 0000000000000007 R12: 000000000008dc00 [ 70.641806] R13: fffffffffff72400 R14: ffff8da63d007000 R15: 0000000000000391 [ 70.641807] FS: 00007fe7ab5fe700(0000) GS:ffff8da67eac0000(0000) knlGS:0000000000000000 [ 70.641808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 70.641808] CR2: 000055568796a468 CR3: 0000000fe9860000 CR4: 0000000000340ee0 [ 70.641808] Call Trace: [ 70.641813] hugetlb_cgroup_uncharge_file_region+0x4b/0x80 [ 70.641815] region_del+0x1d3/0x300 [ 70.641816] hugetlb_unreserve_pages+0x39/0xb0 [ 70.641818] remove_inode_hugepages+0x1a8/0x3d0 [ 70.641831] ? kvm_mmu_notifier_invalidate_range+0x38/0x60 [kvm] [ 70.641832] ? tlb_finish_mmu+0x7a/0x1d0 [ 70.641833] hugetlbfs_fallocate+0x3ac/0x5e0 [ 70.641835] ? avc_has_perm+0x3b/0x160 [ 70.641836] ? file_has_perm+0xa2/0xb0 [ 70.641837] ? selinux_inode_follow_link+0x4c/0xb0 [ 70.641838] ? selinux_file_permission+0x4e/0x120 [ 70.641839] ? security_file_permission+0x2e/0x160 [ 70.641840] vfs_fallocate+0x146/0x280 [ 70.641841] __x64_sys_fallocate+0x3e/0x70 [ 70.641843] do_syscall_64+0x5b/0xf0 [ 70.641846] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Note: prealloc=yes is a bad choice in this environment. It contradicts memory overcommit - what we want to optimize with free page reporting. You allocate all VM memory to throw it away once the guest is up. Is there an option to turn this of with hugetlbfs? I hope so. I'll try reproducing upstream and send a BUG report upstream, ccing you. Thanks! > > Michal > -- Thanks, David / dhildenb
On 10/14/20 4:07 PM, David Hildenbrand wrote: > > > Note: prealloc=yes is a bad choice in this environment. It > contradicts memory overcommit - what we want to optimize with > free page reporting. You allocate all VM memory to throw it away > once the guest is up. Is there an option to turn this of with > hugetlbfs? I hope so. Is this something that should be addressed on qemu side? Since I'm writing a follow up patches for libvirt I can forbid this combination in libvirt. > > I'll try reproducing upstream and send a BUG report upstream, ccing you. Thanks! Thank you very much! Michal
On 14.10.20 16:15, Michal Privoznik wrote: > On 10/14/20 4:07 PM, David Hildenbrand wrote: > >> >> >> Note: prealloc=yes is a bad choice in this environment. It >> contradicts memory overcommit - what we want to optimize with >> free page reporting. You allocate all VM memory to throw it away >> once the guest is up. Is there an option to turn this of with >> hugetlbfs? I hope so. > > Is this something that should be addressed on qemu side? Since I'm > writing a follow up patches for libvirt I can forbid this combination in > libvirt. You mean, bailing out in QEMU if "prealloc=yes" is used along free page reporting? Hmm, not sure if that's the right approach ... we would also have to check for any DIMMs we hotplug and things like that. It's definitely a setup issue to me (where we could try to warn the user, but QEMU warnings tend to get ignored by everybody ...). It's also going to be "not what you want" once we have virtio-mem support in libvirt. (https://virtio-mem.gitlab.io/) But there, it's easier to check+warn. You only want "prealloc=no" for the memory device assigned to a virtio-mem device, not for all system RAM. Thanks! -- Thanks, David / dhildenb
© 2016 - 2024 Red Hat, Inc.