accel/kvm/kvm-all.c | 27 ++ contrib/systemd/qemu-vmsr-helper.service | 15 + contrib/systemd/qemu-vmsr-helper.socket | 9 + docs/specs/index.rst | 1 + docs/specs/rapl-msr.rst | 155 +++++++ docs/tools/index.rst | 1 + docs/tools/qemu-vmsr-helper.rst | 89 ++++ include/io/channel.h | 21 + include/sysemu/kvm_int.h | 32 ++ io/channel-socket.c | 28 ++ io/channel.c | 13 + meson.build | 7 + target/i386/cpu.h | 8 + target/i386/kvm/kvm.c | 431 +++++++++++++++++- target/i386/kvm/meson.build | 1 + target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ target/i386/kvm/vmsr_energy.h | 99 +++++ tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ tools/i386/rapl-msr-index.h | 28 ++ 19 files changed, 1831 insertions(+), 1 deletion(-) create mode 100644 contrib/systemd/qemu-vmsr-helper.service create mode 100644 contrib/systemd/qemu-vmsr-helper.socket create mode 100644 docs/specs/rapl-msr.rst create mode 100644 docs/tools/qemu-vmsr-helper.rst create mode 100644 target/i386/kvm/vmsr_energy.c create mode 100644 target/i386/kvm/vmsr_energy.h create mode 100644 tools/i386/qemu-vmsr-helper.c create mode 100644 tools/i386/rapl-msr-index.h
Dear maintainers, First of all, thank you very much for your review of my patch [1]. In this version (v6), I have attempted to address all the problems addressed by Daniel and Paolo during the last review. However, two open questions remains unanswered that would require the attention of a x86 maintainers: 1)Should I move from -kvm to -cpu the rapl feature ? [2] 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the futur TMPI architecture ? [end of 3] Thank you again for your continued guidance. v5 -> v6 -------- - Better error consistency in qio_channel_get_peerpid() - Memory leak g_strdup_printf/g_build_filename corrected - Renaming several struct with "vmsr_*" for better namespace - Renamed several struct with "guest_*" for better comprehension - Optimization suggerate from Daniel - Crash problem solved [4] v4 -> v5 -------- - correct qio_channel_get_peerpid: return pid = -1 in case of error - Vmsr_helper: compile only for x86 - Vmsr_helper: use qio_channel_read/write_all - Vmsr_helper: abandon user/group - Vmsr_energy.c: correct all error_report - Vmsr thread: compute default socket path only once - Vmsr thread: open socket only once - Pass relevant QEMU CI v3 -> v4 -------- - Correct memory leaks with AddressSanitizer - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is INTEL and if RAPL is activated. - Rename poor variables naming for easier comprehension - Move code that checks Host before creating the VMSR thread - Get rid of libnuma: create function that read sysfs for reading the Host topology instead v2 -> v3 -------- - Move all memory allocations from Clib to Glib - Compile on *BSD (working on Linux only) - No more limitation on the virtual package: each vCPU that belongs to the same virtual package is giving the same results like expected on a real CPU. This has been tested topology like: -smp 4,sockets=2 -smp 16,sockets=4,cores=2,threads=2 v1 -> v2 -------- - To overcome the CVE-2020-8694 a socket communication is created to a priviliged helper - Add the priviliged helper (qemu-vmsr-helper) - Add SO_PEERCRED in qio channel socket RFC -> v1 --------- - Add vmsr_* in front of all vmsr specific function - Change malloc()/calloc()... with all glib equivalent - Pre-allocate all dynamic memories when possible - Add a Documentation of implementation, limitation and usage Best regards, Anthony [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html Anthony Harivel (3): qio: add support for SO_PEERCRED for socket channel tools: build qemu-vmsr-helper Add support for RAPL MSRs in KVM/Qemu accel/kvm/kvm-all.c | 27 ++ contrib/systemd/qemu-vmsr-helper.service | 15 + contrib/systemd/qemu-vmsr-helper.socket | 9 + docs/specs/index.rst | 1 + docs/specs/rapl-msr.rst | 155 +++++++ docs/tools/index.rst | 1 + docs/tools/qemu-vmsr-helper.rst | 89 ++++ include/io/channel.h | 21 + include/sysemu/kvm_int.h | 32 ++ io/channel-socket.c | 28 ++ io/channel.c | 13 + meson.build | 7 + target/i386/cpu.h | 8 + target/i386/kvm/kvm.c | 431 +++++++++++++++++- target/i386/kvm/meson.build | 1 + target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ target/i386/kvm/vmsr_energy.h | 99 +++++ tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ tools/i386/rapl-msr-index.h | 28 ++ 19 files changed, 1831 insertions(+), 1 deletion(-) create mode 100644 contrib/systemd/qemu-vmsr-helper.service create mode 100644 contrib/systemd/qemu-vmsr-helper.socket create mode 100644 docs/specs/rapl-msr.rst create mode 100644 docs/tools/qemu-vmsr-helper.rst create mode 100644 target/i386/kvm/vmsr_energy.c create mode 100644 target/i386/kvm/vmsr_energy.h create mode 100644 tools/i386/qemu-vmsr-helper.c create mode 100644 tools/i386/rapl-msr-index.h -- 2.45.1
On Wed, 22 May 2024 17:34:49 +0200 Anthony Harivel <aharivel@redhat.com> wrote: > Dear maintainers, > > First of all, thank you very much for your review of my patch > [1]. I've tried to play with this feature and have a few questions about it 1. trying to start with non accessible or not existent socket -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket I get: qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met * is it possible to report actual OS error that happened during open/connect, instead of unhelpful 'socket opening failed'? What I see in vmsr_open_socket() error is ignored and btw it's error leak as well * 2nd line shouldn't be there if the 1st error already present. 2. getting periodic error on console where QEMU has been starter # ./qemu-vmsr-helper -k /tmp/sock ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host and let it run it appears rdmsr works (well, it returns some values at least) however there are recurring errors in qemu's stderr(or out) qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat My guess it's some temporary threads, that come and go, but still they shouldn't cause errors if it's normal operation. Also on daemon side, I a few times got while guest was running: qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 though I can't reproduce it reliably 3. when starting daemon not as root, it starts 'fine' but later on complains qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr perhaps it would be better to fail at start daemon if it doesn't have access to necessary files. 4. in case #3, guest also fails to start with errors: qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met again line #2 is not useful and probably not needed (maybe make it tracepoint) and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper 5. does AMD have similar MSRs that we could use to make this feature complete? 6. What happens to power accounting if host constantly migrates vcpus between sockets, are values we are getting still correct/meaningful? Or do we need to pin vcpus to get 'accurate' values? 7. do we have to have a dedicated thread for pooling data from daemon? Can we fetch data from vcpu thread that have accessed msr (with some caching and rate limiting access to the daemon)? > In this version (v6), I have attempted to address all the problems > addressed by Daniel and Paolo during the last review. > > However, two open questions remains unanswered that would require the > attention of a x86 maintainers: > > 1)Should I move from -kvm to -cpu the rapl feature ? [2] > > 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the > futur TMPI architecture ? [end of 3] > > Thank you again for your continued guidance. > > v5 -> v6 > -------- > - Better error consistency in qio_channel_get_peerpid() > - Memory leak g_strdup_printf/g_build_filename corrected > - Renaming several struct with "vmsr_*" for better namespace > - Renamed several struct with "guest_*" for better comprehension > - Optimization suggerate from Daniel > - Crash problem solved [4] > > v4 -> v5 > -------- > > - correct qio_channel_get_peerpid: return pid = -1 in case of error > - Vmsr_helper: compile only for x86 > - Vmsr_helper: use qio_channel_read/write_all > - Vmsr_helper: abandon user/group > - Vmsr_energy.c: correct all error_report > - Vmsr thread: compute default socket path only once > - Vmsr thread: open socket only once > - Pass relevant QEMU CI > > v3 -> v4 > -------- > > - Correct memory leaks with AddressSanitizer > - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is > INTEL and if RAPL is activated. > - Rename poor variables naming for easier comprehension > - Move code that checks Host before creating the VMSR thread > - Get rid of libnuma: create function that read sysfs for reading the > Host topology instead > > v2 -> v3 > -------- > > - Move all memory allocations from Clib to Glib > - Compile on *BSD (working on Linux only) > - No more limitation on the virtual package: each vCPU that belongs to > the same virtual package is giving the same results like expected on > a real CPU. > This has been tested topology like: > -smp 4,sockets=2 > -smp 16,sockets=4,cores=2,threads=2 > > v1 -> v2 > -------- > > - To overcome the CVE-2020-8694 a socket communication is created > to a priviliged helper > - Add the priviliged helper (qemu-vmsr-helper) > - Add SO_PEERCRED in qio channel socket > > RFC -> v1 > --------- > > - Add vmsr_* in front of all vmsr specific function > - Change malloc()/calloc()... with all glib equivalent > - Pre-allocate all dynamic memories when possible > - Add a Documentation of implementation, limitation and usage > > Best regards, > Anthony > > [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html > [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html > [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html > [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html > > Anthony Harivel (3): > qio: add support for SO_PEERCRED for socket channel > tools: build qemu-vmsr-helper > Add support for RAPL MSRs in KVM/Qemu > > accel/kvm/kvm-all.c | 27 ++ > contrib/systemd/qemu-vmsr-helper.service | 15 + > contrib/systemd/qemu-vmsr-helper.socket | 9 + > docs/specs/index.rst | 1 + > docs/specs/rapl-msr.rst | 155 +++++++ > docs/tools/index.rst | 1 + > docs/tools/qemu-vmsr-helper.rst | 89 ++++ > include/io/channel.h | 21 + > include/sysemu/kvm_int.h | 32 ++ > io/channel-socket.c | 28 ++ > io/channel.c | 13 + > meson.build | 7 + > target/i386/cpu.h | 8 + > target/i386/kvm/kvm.c | 431 +++++++++++++++++- > target/i386/kvm/meson.build | 1 + > target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ > target/i386/kvm/vmsr_energy.h | 99 +++++ > tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ > tools/i386/rapl-msr-index.h | 28 ++ > 19 files changed, 1831 insertions(+), 1 deletion(-) > create mode 100644 contrib/systemd/qemu-vmsr-helper.service > create mode 100644 contrib/systemd/qemu-vmsr-helper.socket > create mode 100644 docs/specs/rapl-msr.rst > create mode 100644 docs/tools/qemu-vmsr-helper.rst > create mode 100644 target/i386/kvm/vmsr_energy.c > create mode 100644 target/i386/kvm/vmsr_energy.h > create mode 100644 tools/i386/qemu-vmsr-helper.c > create mode 100644 tools/i386/rapl-msr-index.h >
Hi Igor, Igor Mammedov, Oct 16, 2024 at 13:52: > On Wed, 22 May 2024 17:34:49 +0200 > Anthony Harivel <aharivel@redhat.com> wrote: > >> Dear maintainers, >> >> First of all, thank you very much for your review of my patch >> [1]. > > I've tried to play with this feature and have a few questions about it > Thanks for testing this new feature. > 1. trying to start with non accessible or not existent socket > -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket > I get: > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met > * is it possible to report actual OS error that happened during open/connect, > instead of unhelpful 'socket opening failed'? > > What I see in vmsr_open_socket() error is ignored > and btw it's error leak as well > Shame you missed the 6 iterations of that patch that last for a year. I would have changed that directly ! Anyway I take note on that comment and will send a modification. > * 2nd line shouldn't be there if the 1st error already present. > > 2. getting periodic error on console where QEMU has been starter > # ./qemu-vmsr-helper -k /tmp/sock > ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host > and let it run > > it appears rdmsr works (well, it returns some values at least) > however there are recurring errors in qemu's stderr(or out) > > qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat > qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat > > My guess it's some temporary threads, that come and go, but still > they shouldn't cause errors if it's normal operation. > There a patch in WIP that change this into a Tracepoint. Maybe you can SSH to the VM in meanwhile ? > Also on daemon side, I a few times got while guest was running: > qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 > qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 > though I can't reproduce it reliably This could happen only when a vCPU thread ID has changed between the call of a rdmsr throught the socket and the hepler that read the msr. No idea how a vCPU can change TID or shutdown that fast. > > 3. when starting daemon not as root, it starts 'fine' but later on complains > qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr > perhaps it would be better to fail at start daemon if it doesn't have > access to necessary files. > Right taking a note on that as well. > 4. in case #3, guest also fails to start with errors: > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met > again line #2 is not useful and probably not needed (maybe make it tracepoint) > and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper > I will try to see how to improve that part. Thanks for your valuable feedback. > 5. does AMD have similar MSRs that we could use to make this feature complete? > Yes but the address are completely different. However, this in my ToDo list. First I need way more feedback like yours to move on extending this feature. > 6. What happens to power accounting if host constantly migrates > vcpus between sockets, are values we are getting still correct/meaningful? > Or do we need to pin vcpus to get 'accurate' values? > It's taken into account during the ratio calculation which socket the vCPU has just been scheduled. But yes the value are more 'accurate' when the vCPU is pinned. > 7. do we have to have a dedicated thread for pooling data from daemon? > > Can we fetch data from vcpu thread that have accessed msr > (with some caching and rate limiting access to the daemon)? > This feature is revolving around a thread. Please look at the documentation is not already done: https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation If we only fetch from vCPU thread, we won't have the consumption of the non-vcpu thread. They are taken into account in the total. Thanks again for your feedback. Anthony >> In this version (v6), I have attempted to address all the problems >> addressed by Daniel and Paolo during the last review. >> >> However, two open questions remains unanswered that would require the >> attention of a x86 maintainers: >> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2] >> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the >> futur TMPI architecture ? [end of 3] >> >> Thank you again for your continued guidance. >> >> v5 -> v6 >> -------- >> - Better error consistency in qio_channel_get_peerpid() >> - Memory leak g_strdup_printf/g_build_filename corrected >> - Renaming several struct with "vmsr_*" for better namespace >> - Renamed several struct with "guest_*" for better comprehension >> - Optimization suggerate from Daniel >> - Crash problem solved [4] >> >> v4 -> v5 >> -------- >> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error >> - Vmsr_helper: compile only for x86 >> - Vmsr_helper: use qio_channel_read/write_all >> - Vmsr_helper: abandon user/group >> - Vmsr_energy.c: correct all error_report >> - Vmsr thread: compute default socket path only once >> - Vmsr thread: open socket only once >> - Pass relevant QEMU CI >> >> v3 -> v4 >> -------- >> >> - Correct memory leaks with AddressSanitizer >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is >> INTEL and if RAPL is activated. >> - Rename poor variables naming for easier comprehension >> - Move code that checks Host before creating the VMSR thread >> - Get rid of libnuma: create function that read sysfs for reading the >> Host topology instead >> >> v2 -> v3 >> -------- >> >> - Move all memory allocations from Clib to Glib >> - Compile on *BSD (working on Linux only) >> - No more limitation on the virtual package: each vCPU that belongs to >> the same virtual package is giving the same results like expected on >> a real CPU. >> This has been tested topology like: >> -smp 4,sockets=2 >> -smp 16,sockets=4,cores=2,threads=2 >> >> v1 -> v2 >> -------- >> >> - To overcome the CVE-2020-8694 a socket communication is created >> to a priviliged helper >> - Add the priviliged helper (qemu-vmsr-helper) >> - Add SO_PEERCRED in qio channel socket >> >> RFC -> v1 >> --------- >> >> - Add vmsr_* in front of all vmsr specific function >> - Change malloc()/calloc()... with all glib equivalent >> - Pre-allocate all dynamic memories when possible >> - Add a Documentation of implementation, limitation and usage >> >> Best regards, >> Anthony >> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html >> >> Anthony Harivel (3): >> qio: add support for SO_PEERCRED for socket channel >> tools: build qemu-vmsr-helper >> Add support for RAPL MSRs in KVM/Qemu >> >> accel/kvm/kvm-all.c | 27 ++ >> contrib/systemd/qemu-vmsr-helper.service | 15 + >> contrib/systemd/qemu-vmsr-helper.socket | 9 + >> docs/specs/index.rst | 1 + >> docs/specs/rapl-msr.rst | 155 +++++++ >> docs/tools/index.rst | 1 + >> docs/tools/qemu-vmsr-helper.rst | 89 ++++ >> include/io/channel.h | 21 + >> include/sysemu/kvm_int.h | 32 ++ >> io/channel-socket.c | 28 ++ >> io/channel.c | 13 + >> meson.build | 7 + >> target/i386/cpu.h | 8 + >> target/i386/kvm/kvm.c | 431 +++++++++++++++++- >> target/i386/kvm/meson.build | 1 + >> target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ >> target/i386/kvm/vmsr_energy.h | 99 +++++ >> tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ >> tools/i386/rapl-msr-index.h | 28 ++ >> 19 files changed, 1831 insertions(+), 1 deletion(-) >> create mode 100644 contrib/systemd/qemu-vmsr-helper.service >> create mode 100644 contrib/systemd/qemu-vmsr-helper.socket >> create mode 100644 docs/specs/rapl-msr.rst >> create mode 100644 docs/tools/qemu-vmsr-helper.rst >> create mode 100644 target/i386/kvm/vmsr_energy.c >> create mode 100644 target/i386/kvm/vmsr_energy.h >> create mode 100644 tools/i386/qemu-vmsr-helper.c >> create mode 100644 tools/i386/rapl-msr-index.h >>
On Wed, 16 Oct 2024 14:56:39 +0200 "Anthony Harivel" <aharivel@redhat.com> wrote: > Hi Igor, > > Igor Mammedov, Oct 16, 2024 at 13:52: > > On Wed, 22 May 2024 17:34:49 +0200 > > Anthony Harivel <aharivel@redhat.com> wrote: > > > >> Dear maintainers, > >> > >> First of all, thank you very much for your review of my patch > >> [1]. > > > > I've tried to play with this feature and have a few questions about it > > > > Thanks for testing this new feature. > > > 1. trying to start with non accessible or not existent socket > > -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket > > I get: > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met > > * is it possible to report actual OS error that happened during open/connect, > > instead of unhelpful 'socket opening failed'? > > > > What I see in vmsr_open_socket() error is ignored > > and btw it's error leak as well > > > > Shame you missed the 6 iterations of that patch that last for a year. > I would have changed that directly ! > Anyway I take note on that comment and will send a modification. > > > * 2nd line shouldn't be there if the 1st error already present. > > > > 2. getting periodic error on console where QEMU has been starter > > # ./qemu-vmsr-helper -k /tmp/sock > > ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host > > and let it run > > > > it appears rdmsr works (well, it returns some values at least) > > however there are recurring errors in qemu's stderr(or out) > > > > qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat > > qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat > > > > My guess it's some temporary threads, that come and go, but still > > they shouldn't cause errors if it's normal operation. > > > > There a patch in WIP that change this into a Tracepoint. Maybe you can > SSH to the VM in meanwhile ? it's just idling VM that doesn't do anything, hence the question. > > > Also on daemon side, I a few times got while guest was running: > > qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 > > qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 > > though I can't reproduce it reliably > > This could happen only when a vCPU thread ID has changed between the > call of a rdmsr throught the socket and the hepler that read the msr. > No idea how a vCPU can change TID or shutdown that fast. I guess it needs to be figured out to decide if it's safe to ignore (and not print error) or if it's a genuine error/bug somewhere > > 3. when starting daemon not as root, it starts 'fine' but later on complains > > qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr > > perhaps it would be better to fail at start daemon if it doesn't have > > access to necessary files. > > > > Right taking a note on that as well. > > > > 4. in case #3, guest also fails to start with errors: > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met > > again line #2 is not useful and probably not needed (maybe make it tracepoint) > > and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper > > > > I will try to see how to improve that part. > Thanks for your valuable feedback. > > > 5. does AMD have similar MSRs that we could use to make this feature complete? > > > > Yes but the address are completely different. However, this in my ToDo > list. First I need way more feedback like yours to move on extending > this feature. If adding AMD's MSRs is not difficult, then I'd make it priority. This way users (and libvirt) won't have to deal with 2 different feature-sets and decide when to allow this to be turned on depending on host. > > > 6. What happens to power accounting if host constantly migrates > > vcpus between sockets, are values we are getting still correct/meaningful? > > Or do we need to pin vcpus to get 'accurate' values? > > > > It's taken into account during the ratio calculation which socket the > vCPU has just been scheduled. But yes the value are more 'accurate' when > the vCPU is pinned. in worst case VCPUs might be moved between sockets many times during sample window, can you explain how that is accounted for? Anyways, it would be better to have some numbers in doc that would clarify what kind of accuracy we are talking about (and example pinned vs unpinned), or whether unpinned case measures average temperature of patients in hospital and we should recommend to pin vcpus and everything else. Also actual usecase examples for the feature should be mentioned in the doc. So users could figure out when they need to enable this feature (with attached accuracy numbers). Aka how this new feature is good for end users and what they can do with it. > > 7. do we have to have a dedicated thread for pooling data from daemon? > > > > Can we fetch data from vcpu thread that have accessed msr > > (with some caching and rate limiting access to the daemon)? > > > > This feature is revolving around a thread. Please look at the > documentation is not already done: > > https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation > > If we only fetch from vCPU thread, we won't have the consumption of the > non-vcpu thread. They are taken into account in the total. one can collect the same data from vcpu thread as well, the bonus part is that we don't have an extra thread hanging around and doing work even if guest never asks for those MSRs. This also leads to a question, if we should account for not VCPU threads at all. Looking at real hardware, those MSRs return power usage of CPUs only, and they do not return consumption from auxiliary system components (io/memory/...). One can consider non VCPU threads in QEMU as auxiliary components, so we probably should not to account for them at all when modeling the same hw feature. (aka be consistent with what real hw does). > Thanks again for your feedback. > > Anthony > > > >> In this version (v6), I have attempted to address all the problems > >> addressed by Daniel and Paolo during the last review. > >> > >> However, two open questions remains unanswered that would require the > >> attention of a x86 maintainers: > >> > >> 1)Should I move from -kvm to -cpu the rapl feature ? [2] > >> > >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the > >> futur TMPI architecture ? [end of 3] > >> > >> Thank you again for your continued guidance. > >> > >> v5 -> v6 > >> -------- > >> - Better error consistency in qio_channel_get_peerpid() > >> - Memory leak g_strdup_printf/g_build_filename corrected > >> - Renaming several struct with "vmsr_*" for better namespace > >> - Renamed several struct with "guest_*" for better comprehension > >> - Optimization suggerate from Daniel > >> - Crash problem solved [4] > >> > >> v4 -> v5 > >> -------- > >> > >> - correct qio_channel_get_peerpid: return pid = -1 in case of error > >> - Vmsr_helper: compile only for x86 > >> - Vmsr_helper: use qio_channel_read/write_all > >> - Vmsr_helper: abandon user/group > >> - Vmsr_energy.c: correct all error_report > >> - Vmsr thread: compute default socket path only once > >> - Vmsr thread: open socket only once > >> - Pass relevant QEMU CI > >> > >> v3 -> v4 > >> -------- > >> > >> - Correct memory leaks with AddressSanitizer > >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is > >> INTEL and if RAPL is activated. > >> - Rename poor variables naming for easier comprehension > >> - Move code that checks Host before creating the VMSR thread > >> - Get rid of libnuma: create function that read sysfs for reading the > >> Host topology instead > >> > >> v2 -> v3 > >> -------- > >> > >> - Move all memory allocations from Clib to Glib > >> - Compile on *BSD (working on Linux only) > >> - No more limitation on the virtual package: each vCPU that belongs to > >> the same virtual package is giving the same results like expected on > >> a real CPU. > >> This has been tested topology like: > >> -smp 4,sockets=2 > >> -smp 16,sockets=4,cores=2,threads=2 > >> > >> v1 -> v2 > >> -------- > >> > >> - To overcome the CVE-2020-8694 a socket communication is created > >> to a priviliged helper > >> - Add the priviliged helper (qemu-vmsr-helper) > >> - Add SO_PEERCRED in qio channel socket > >> > >> RFC -> v1 > >> --------- > >> > >> - Add vmsr_* in front of all vmsr specific function > >> - Change malloc()/calloc()... with all glib equivalent > >> - Pre-allocate all dynamic memories when possible > >> - Add a Documentation of implementation, limitation and usage > >> > >> Best regards, > >> Anthony > >> > >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html > >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html > >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html > >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html > >> > >> Anthony Harivel (3): > >> qio: add support for SO_PEERCRED for socket channel > >> tools: build qemu-vmsr-helper > >> Add support for RAPL MSRs in KVM/Qemu > >> > >> accel/kvm/kvm-all.c | 27 ++ > >> contrib/systemd/qemu-vmsr-helper.service | 15 + > >> contrib/systemd/qemu-vmsr-helper.socket | 9 + > >> docs/specs/index.rst | 1 + > >> docs/specs/rapl-msr.rst | 155 +++++++ > >> docs/tools/index.rst | 1 + > >> docs/tools/qemu-vmsr-helper.rst | 89 ++++ > >> include/io/channel.h | 21 + > >> include/sysemu/kvm_int.h | 32 ++ > >> io/channel-socket.c | 28 ++ > >> io/channel.c | 13 + > >> meson.build | 7 + > >> target/i386/cpu.h | 8 + > >> target/i386/kvm/kvm.c | 431 +++++++++++++++++- > >> target/i386/kvm/meson.build | 1 + > >> target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ > >> target/i386/kvm/vmsr_energy.h | 99 +++++ > >> tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ > >> tools/i386/rapl-msr-index.h | 28 ++ > >> 19 files changed, 1831 insertions(+), 1 deletion(-) > >> create mode 100644 contrib/systemd/qemu-vmsr-helper.service > >> create mode 100644 contrib/systemd/qemu-vmsr-helper.socket > >> create mode 100644 docs/specs/rapl-msr.rst > >> create mode 100644 docs/tools/qemu-vmsr-helper.rst > >> create mode 100644 target/i386/kvm/vmsr_energy.c > >> create mode 100644 target/i386/kvm/vmsr_energy.h > >> create mode 100644 tools/i386/qemu-vmsr-helper.c > >> create mode 100644 tools/i386/rapl-msr-index.h > >> >
Igor Mammedov, Oct 18, 2024 at 14:25: > On Wed, 16 Oct 2024 14:56:39 +0200 > "Anthony Harivel" <aharivel@redhat.com> wrote: > >> Hi Igor, >> >> Igor Mammedov, Oct 16, 2024 at 13:52: >> > On Wed, 22 May 2024 17:34:49 +0200 >> > Anthony Harivel <aharivel@redhat.com> wrote: >> > >> >> Dear maintainers, >> >> >> >> First of all, thank you very much for your review of my patch >> >> [1]. >> > >> > I've tried to play with this feature and have a few questions about it >> > >> >> Thanks for testing this new feature. >> >> > 1. trying to start with non accessible or not existent socket >> > -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket >> > I get: >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met >> > * is it possible to report actual OS error that happened during open/connect, >> > instead of unhelpful 'socket opening failed'? >> > >> > What I see in vmsr_open_socket() error is ignored >> > and btw it's error leak as well >> > >> >> Shame you missed the 6 iterations of that patch that last for a year. >> I would have changed that directly ! >> Anyway I take note on that comment and will send a modification. >> >> > * 2nd line shouldn't be there if the 1st error already present. >> > >> > 2. getting periodic error on console where QEMU has been starter >> > # ./qemu-vmsr-helper -k /tmp/sock >> > ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host >> > and let it run >> > >> > it appears rdmsr works (well, it returns some values at least) >> > however there are recurring errors in qemu's stderr(or out) >> > >> > qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat >> > qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat >> > >> > My guess it's some temporary threads, that come and go, but still >> > they shouldn't cause errors if it's normal operation. >> > >> >> There a patch in WIP that change this into a Tracepoint. Maybe you can >> SSH to the VM in meanwhile ? > > it's just idling VM that doesn't do anything, hence the question. > >> >> > Also on daemon side, I a few times got while guest was running: >> > qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 >> > qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 >> > though I can't reproduce it reliably >> >> This could happen only when a vCPU thread ID has changed between the >> call of a rdmsr throught the socket and the hepler that read the msr. >> No idea how a vCPU can change TID or shutdown that fast. > > I guess it needs to be figured out to decide if it's safe to ignore (and not print error) > or if it's a genuine error/bug somewhere > >> > 3. when starting daemon not as root, it starts 'fine' but later on complains >> > qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr >> > perhaps it would be better to fail at start daemon if it doesn't have >> > access to necessary files. >> > >> >> Right taking a note on that as well. >> >> >> > 4. in case #3, guest also fails to start with errors: >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met >> > again line #2 is not useful and probably not needed (maybe make it tracepoint) >> > and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper >> > >> >> I will try to see how to improve that part. >> Thanks for your valuable feedback. >> >> > 5. does AMD have similar MSRs that we could use to make this feature complete? >> > >> >> Yes but the address are completely different. However, this in my ToDo >> list. First I need way more feedback like yours to move on extending >> this feature. > > If adding AMD's MSRs is not difficult, then I'd make it priority. > This way users (and libvirt) won't have to deal with 2 different > feature-sets and decide when to allow this to be turned on depending on host. > QEMU needs to know if it runs on Intel or AMD machine in order to choose which set of MSR it must read. I did not check how to achieve this at the moment but I will when I will work on that. >> >> > 6. What happens to power accounting if host constantly migrates >> > vcpus between sockets, are values we are getting still correct/meaningful? >> > Or do we need to pin vcpus to get 'accurate' values? >> > >> >> It's taken into account during the ratio calculation which socket the >> vCPU has just been scheduled. But yes the value are more 'accurate' when >> the vCPU is pinned. > > in worst case VCPUs might be moved between sockets many times during > sample window, can you explain how that is accounted for? > If one vCPU is moving socket during the sample period then it is detected and not taken into account. That said, if your system is bouncing vCPU back and forth between socket then you will experience a lot of caches misses, cpu caches trashes, context switches, increase of memory latency (numa issues), etc. This will lead to performance degradation and VM performance being very poor. Then you should probably fix it. > Anyways, it would be better to have some numbers in doc that would > clarify what kind of accuracy we are talking about (and example > pinned vs unpinned), or whether unpinned case measures average > temperature of patients in hospital and we should recommend > to pin vcpus and everything else. > I totally understand that I can add more clarification in the documentation that might be obvious for some but not for other. Like isolating your VM properly will give better result. But I won't give any number. It doesn't make sens. Accuracy is not the goal of this feature, it never was and it never will. First of all because RAPL is not accurate for power monitoring. You want accuracy? Use a Power Metering device. You want a reproducible way to compare power energy between A and B in order to optimize your software ? Use can use RAPL and so this feature that shows good reproducible results. > Also actual usecase examples for the feature should be mentioned > in the doc. So users could figure out when they need to enable > this feature (with attached accuracy numbers). Aka how this > new feature is good for end users and what they can do with it. > Got it. More documentation, use case, examples. I will see what can be added to QEMU documentation. >> > 7. do we have to have a dedicated thread for pooling data from daemon? >> > >> > Can we fetch data from vcpu thread that have accessed msr >> > (with some caching and rate limiting access to the daemon)? >> > >> >> This feature is revolving around a thread. Please look at the >> documentation is not already done: >> >> https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation >> >> If we only fetch from vCPU thread, we won't have the consumption of the >> non-vcpu thread. They are taken into account in the total. > > one can collect the same data from vcpu thread as well, > the bonus part is that we don't have an extra thread > hanging around and doing work even if guest never asks > for those MSRs. > > This also leads to a question, if we should account for > not VCPU threads at all. Looking at real hardware, those > MSRs return power usage of CPUs only, and they do not > return consumption from auxiliary system components > (io/memory/...). One can consider non VCPU threads in QEMU > as auxiliary components, so we probably should not to > account for them at all when modeling the same hw feature. > (aka be consistent with what real hw does). > >> Thanks again for your feedback. >> >> Anthony >> >> >> >> In this version (v6), I have attempted to address all the problems >> >> addressed by Daniel and Paolo during the last review. >> >> >> >> However, two open questions remains unanswered that would require the >> >> attention of a x86 maintainers: >> >> >> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2] >> >> >> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the >> >> futur TMPI architecture ? [end of 3] >> >> >> >> Thank you again for your continued guidance. >> >> >> >> v5 -> v6 >> >> -------- >> >> - Better error consistency in qio_channel_get_peerpid() >> >> - Memory leak g_strdup_printf/g_build_filename corrected >> >> - Renaming several struct with "vmsr_*" for better namespace >> >> - Renamed several struct with "guest_*" for better comprehension >> >> - Optimization suggerate from Daniel >> >> - Crash problem solved [4] >> >> >> >> v4 -> v5 >> >> -------- >> >> >> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error >> >> - Vmsr_helper: compile only for x86 >> >> - Vmsr_helper: use qio_channel_read/write_all >> >> - Vmsr_helper: abandon user/group >> >> - Vmsr_energy.c: correct all error_report >> >> - Vmsr thread: compute default socket path only once >> >> - Vmsr thread: open socket only once >> >> - Pass relevant QEMU CI >> >> >> >> v3 -> v4 >> >> -------- >> >> >> >> - Correct memory leaks with AddressSanitizer >> >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is >> >> INTEL and if RAPL is activated. >> >> - Rename poor variables naming for easier comprehension >> >> - Move code that checks Host before creating the VMSR thread >> >> - Get rid of libnuma: create function that read sysfs for reading the >> >> Host topology instead >> >> >> >> v2 -> v3 >> >> -------- >> >> >> >> - Move all memory allocations from Clib to Glib >> >> - Compile on *BSD (working on Linux only) >> >> - No more limitation on the virtual package: each vCPU that belongs to >> >> the same virtual package is giving the same results like expected on >> >> a real CPU. >> >> This has been tested topology like: >> >> -smp 4,sockets=2 >> >> -smp 16,sockets=4,cores=2,threads=2 >> >> >> >> v1 -> v2 >> >> -------- >> >> >> >> - To overcome the CVE-2020-8694 a socket communication is created >> >> to a priviliged helper >> >> - Add the priviliged helper (qemu-vmsr-helper) >> >> - Add SO_PEERCRED in qio channel socket >> >> >> >> RFC -> v1 >> >> --------- >> >> >> >> - Add vmsr_* in front of all vmsr specific function >> >> - Change malloc()/calloc()... with all glib equivalent >> >> - Pre-allocate all dynamic memories when possible >> >> - Add a Documentation of implementation, limitation and usage >> >> >> >> Best regards, >> >> Anthony >> >> >> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html >> >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html >> >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html >> >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html >> >> >> >> Anthony Harivel (3): >> >> qio: add support for SO_PEERCRED for socket channel >> >> tools: build qemu-vmsr-helper >> >> Add support for RAPL MSRs in KVM/Qemu >> >> >> >> accel/kvm/kvm-all.c | 27 ++ >> >> contrib/systemd/qemu-vmsr-helper.service | 15 + >> >> contrib/systemd/qemu-vmsr-helper.socket | 9 + >> >> docs/specs/index.rst | 1 + >> >> docs/specs/rapl-msr.rst | 155 +++++++ >> >> docs/tools/index.rst | 1 + >> >> docs/tools/qemu-vmsr-helper.rst | 89 ++++ >> >> include/io/channel.h | 21 + >> >> include/sysemu/kvm_int.h | 32 ++ >> >> io/channel-socket.c | 28 ++ >> >> io/channel.c | 13 + >> >> meson.build | 7 + >> >> target/i386/cpu.h | 8 + >> >> target/i386/kvm/kvm.c | 431 +++++++++++++++++- >> >> target/i386/kvm/meson.build | 1 + >> >> target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ >> >> target/i386/kvm/vmsr_energy.h | 99 +++++ >> >> tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ >> >> tools/i386/rapl-msr-index.h | 28 ++ >> >> 19 files changed, 1831 insertions(+), 1 deletion(-) >> >> create mode 100644 contrib/systemd/qemu-vmsr-helper.service >> >> create mode 100644 contrib/systemd/qemu-vmsr-helper.socket >> >> create mode 100644 docs/specs/rapl-msr.rst >> >> create mode 100644 docs/tools/qemu-vmsr-helper.rst >> >> create mode 100644 target/i386/kvm/vmsr_energy.c >> >> create mode 100644 target/i386/kvm/vmsr_energy.h >> >> create mode 100644 tools/i386/qemu-vmsr-helper.c >> >> create mode 100644 tools/i386/rapl-msr-index.h >> >> >>
On Tue, 22 Oct 2024 15:49:33 +0200 "Anthony Harivel" <aharivel@redhat.com> wrote: > Igor Mammedov, Oct 18, 2024 at 14:25: > > On Wed, 16 Oct 2024 14:56:39 +0200 > > "Anthony Harivel" <aharivel@redhat.com> wrote: > > > >> Hi Igor, > >> > >> Igor Mammedov, Oct 16, 2024 at 13:52: > >> > On Wed, 22 May 2024 17:34:49 +0200 > >> > Anthony Harivel <aharivel@redhat.com> wrote: > >> > > >> >> Dear maintainers, > >> >> > >> >> First of all, thank you very much for your review of my patch > >> >> [1]. > >> > > >> > I've tried to play with this feature and have a few questions about it > >> > > >> > >> Thanks for testing this new feature. > >> > >> > 1. trying to start with non accessible or not existent socket > >> > -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket > >> > I get: > >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed > >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met > >> > * is it possible to report actual OS error that happened during open/connect, > >> > instead of unhelpful 'socket opening failed'? > >> > > >> > What I see in vmsr_open_socket() error is ignored > >> > and btw it's error leak as well > >> > > >> > >> Shame you missed the 6 iterations of that patch that last for a year. > >> I would have changed that directly ! > >> Anyway I take note on that comment and will send a modification. > >> > >> > * 2nd line shouldn't be there if the 1st error already present. > >> > > >> > 2. getting periodic error on console where QEMU has been starter > >> > # ./qemu-vmsr-helper -k /tmp/sock > >> > ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host > >> > and let it run > >> > > >> > it appears rdmsr works (well, it returns some values at least) > >> > however there are recurring errors in qemu's stderr(or out) > >> > > >> > qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat > >> > qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat > >> > > >> > My guess it's some temporary threads, that come and go, but still > >> > they shouldn't cause errors if it's normal operation. > >> > > >> > >> There a patch in WIP that change this into a Tracepoint. Maybe you can > >> SSH to the VM in meanwhile ? > > > > it's just idling VM that doesn't do anything, hence the question. > > > >> > >> > Also on daemon side, I a few times got while guest was running: > >> > qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 > >> > qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 > >> > though I can't reproduce it reliably > >> > >> This could happen only when a vCPU thread ID has changed between the > >> call of a rdmsr throught the socket and the hepler that read the msr. > >> No idea how a vCPU can change TID or shutdown that fast. > > > > I guess it needs to be figured out to decide if it's safe to ignore (and not print error) > > or if it's a genuine error/bug somewhere > > > >> > 3. when starting daemon not as root, it starts 'fine' but later on complains > >> > qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr > >> > perhaps it would be better to fail at start daemon if it doesn't have > >> > access to necessary files. > >> > > >> > >> Right taking a note on that as well. > >> > >> > >> > 4. in case #3, guest also fails to start with errors: > >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr > >> > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met > >> > again line #2 is not useful and probably not needed (maybe make it tracepoint) > >> > and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper > >> > > >> > >> I will try to see how to improve that part. > >> Thanks for your valuable feedback. > >> > >> > 5. does AMD have similar MSRs that we could use to make this feature complete? > >> > > >> > >> Yes but the address are completely different. However, this in my ToDo > >> list. First I need way more feedback like yours to move on extending > >> this feature. > > > > If adding AMD's MSRs is not difficult, then I'd make it priority. > > This way users (and libvirt) won't have to deal with 2 different > > feature-sets and decide when to allow this to be turned on depending on host. > > > > QEMU needs to know if it runs on Intel or AMD machine in order to choose > which set of MSR it must read. I did not check how to achieve this at the > moment but I will when I will work on that. talking to daemon in terms of power per pkg, e.t.c., we won't care which MRSs go over wire between daemon and QEMU. Then QEMU can map that to relevant MSRs (based on cpumodel) internally. > > >> > >> > 6. What happens to power accounting if host constantly migrates > >> > vcpus between sockets, are values we are getting still correct/meaningful? > >> > Or do we need to pin vcpus to get 'accurate' values? > >> > > >> > >> It's taken into account during the ratio calculation which socket the > >> vCPU has just been scheduled. But yes the value are more 'accurate' when > >> the vCPU is pinned. > > > > in worst case VCPUs might be moved between sockets many times during > > sample window, can you explain how that is accounted for? > > > > If one vCPU is moving socket during the sample period then it is > detected and not taken into account. > > That said, if your system is bouncing vCPU back and forth between socket > then you will experience a lot of caches misses, cpu caches trashes, > context switches, increase of memory latency (numa issues), etc. This > will lead to performance degradation and VM performance being very poor. > Then you should probably fix it. yep, it's bad config, but typical for overcommit scenario. if we can't get correct measurement in this case, then at least printing a warning once could be nice. (it would be better to refuse starting without vCPU pinning, but given pinning isn't done by QEMU, I don't see a way to do that.) > > Anyways, it would be better to have some numbers in doc that would > > clarify what kind of accuracy we are talking about (and example > > pinned vs unpinned), or whether unpinned case measures average > > temperature of patients in hospital and we should recommend > > to pin vcpus and everything else. > > > > I totally understand that I can add more clarification in the > documentation that might be obvious for some but not for other. Like > isolating your VM properly will give better result. > > But I won't give any number. It doesn't make sens. Accuracy is not the > goal of this feature, it never was and it never will. First of all > because RAPL is not accurate for power monitoring. You want accuracy? > Use a Power Metering device. > You want a reproducible way to compare power energy between > A and B in order to optimize your software ? Use can use RAPL and so > this feature that shows good reproducible results. > > > Also actual usecase examples for the feature should be mentioned > > in the doc. So users could figure out when they need to enable > > this feature (with attached accuracy numbers). Aka how this > > new feature is good for end users and what they can do with it. > > > > Got it. More documentation, use case, examples. > I will see what can be added to QEMU documentation. > > > >> > 7. do we have to have a dedicated thread for pooling data from daemon? > >> > > >> > Can we fetch data from vcpu thread that have accessed msr > >> > (with some caching and rate limiting access to the daemon)? > >> > > >> > >> This feature is revolving around a thread. Please look at the > >> documentation is not already done: > >> > >> https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation > >> > >> If we only fetch from vCPU thread, we won't have the consumption of the > >> non-vcpu thread. They are taken into account in the total. > > > > one can collect the same data from vcpu thread as well, > > the bonus part is that we don't have an extra thread > > hanging around and doing work even if guest never asks > > for those MSRs. > > > > This also leads to a question, if we should account for > > not VCPU threads at all. Looking at real hardware, those > > MSRs return power usage of CPUs only, and they do not > > return consumption from auxiliary system components > > (io/memory/...). One can consider non VCPU threads in QEMU > > as auxiliary components, so we probably should not to > > account for them at all when modeling the same hw feature. > > (aka be consistent with what real hw does). > > > >> Thanks again for your feedback. > >> > >> Anthony > >> > >> > >> >> In this version (v6), I have attempted to address all the problems > >> >> addressed by Daniel and Paolo during the last review. > >> >> > >> >> However, two open questions remains unanswered that would require the > >> >> attention of a x86 maintainers: > >> >> > >> >> 1)Should I move from -kvm to -cpu the rapl feature ? [2] > >> >> > >> >> 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the > >> >> futur TMPI architecture ? [end of 3] > >> >> > >> >> Thank you again for your continued guidance. > >> >> > >> >> v5 -> v6 > >> >> -------- > >> >> - Better error consistency in qio_channel_get_peerpid() > >> >> - Memory leak g_strdup_printf/g_build_filename corrected > >> >> - Renaming several struct with "vmsr_*" for better namespace > >> >> - Renamed several struct with "guest_*" for better comprehension > >> >> - Optimization suggerate from Daniel > >> >> - Crash problem solved [4] > >> >> > >> >> v4 -> v5 > >> >> -------- > >> >> > >> >> - correct qio_channel_get_peerpid: return pid = -1 in case of error > >> >> - Vmsr_helper: compile only for x86 > >> >> - Vmsr_helper: use qio_channel_read/write_all > >> >> - Vmsr_helper: abandon user/group > >> >> - Vmsr_energy.c: correct all error_report > >> >> - Vmsr thread: compute default socket path only once > >> >> - Vmsr thread: open socket only once > >> >> - Pass relevant QEMU CI > >> >> > >> >> v3 -> v4 > >> >> -------- > >> >> > >> >> - Correct memory leaks with AddressSanitizer > >> >> - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is > >> >> INTEL and if RAPL is activated. > >> >> - Rename poor variables naming for easier comprehension > >> >> - Move code that checks Host before creating the VMSR thread > >> >> - Get rid of libnuma: create function that read sysfs for reading the > >> >> Host topology instead > >> >> > >> >> v2 -> v3 > >> >> -------- > >> >> > >> >> - Move all memory allocations from Clib to Glib > >> >> - Compile on *BSD (working on Linux only) > >> >> - No more limitation on the virtual package: each vCPU that belongs to > >> >> the same virtual package is giving the same results like expected on > >> >> a real CPU. > >> >> This has been tested topology like: > >> >> -smp 4,sockets=2 > >> >> -smp 16,sockets=4,cores=2,threads=2 > >> >> > >> >> v1 -> v2 > >> >> -------- > >> >> > >> >> - To overcome the CVE-2020-8694 a socket communication is created > >> >> to a priviliged helper > >> >> - Add the priviliged helper (qemu-vmsr-helper) > >> >> - Add SO_PEERCRED in qio channel socket > >> >> > >> >> RFC -> v1 > >> >> --------- > >> >> > >> >> - Add vmsr_* in front of all vmsr specific function > >> >> - Change malloc()/calloc()... with all glib equivalent > >> >> - Pre-allocate all dynamic memories when possible > >> >> - Add a Documentation of implementation, limitation and usage > >> >> > >> >> Best regards, > >> >> Anthony > >> >> > >> >> [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html > >> >> [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html > >> >> [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html > >> >> [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html > >> >> > >> >> Anthony Harivel (3): > >> >> qio: add support for SO_PEERCRED for socket channel > >> >> tools: build qemu-vmsr-helper > >> >> Add support for RAPL MSRs in KVM/Qemu > >> >> > >> >> accel/kvm/kvm-all.c | 27 ++ > >> >> contrib/systemd/qemu-vmsr-helper.service | 15 + > >> >> contrib/systemd/qemu-vmsr-helper.socket | 9 + > >> >> docs/specs/index.rst | 1 + > >> >> docs/specs/rapl-msr.rst | 155 +++++++ > >> >> docs/tools/index.rst | 1 + > >> >> docs/tools/qemu-vmsr-helper.rst | 89 ++++ > >> >> include/io/channel.h | 21 + > >> >> include/sysemu/kvm_int.h | 32 ++ > >> >> io/channel-socket.c | 28 ++ > >> >> io/channel.c | 13 + > >> >> meson.build | 7 + > >> >> target/i386/cpu.h | 8 + > >> >> target/i386/kvm/kvm.c | 431 +++++++++++++++++- > >> >> target/i386/kvm/meson.build | 1 + > >> >> target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ > >> >> target/i386/kvm/vmsr_energy.h | 99 +++++ > >> >> tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ > >> >> tools/i386/rapl-msr-index.h | 28 ++ > >> >> 19 files changed, 1831 insertions(+), 1 deletion(-) > >> >> create mode 100644 contrib/systemd/qemu-vmsr-helper.service > >> >> create mode 100644 contrib/systemd/qemu-vmsr-helper.socket > >> >> create mode 100644 docs/specs/rapl-msr.rst > >> >> create mode 100644 docs/tools/qemu-vmsr-helper.rst > >> >> create mode 100644 target/i386/kvm/vmsr_energy.c > >> >> create mode 100644 target/i386/kvm/vmsr_energy.h > >> >> create mode 100644 tools/i386/qemu-vmsr-helper.c > >> >> create mode 100644 tools/i386/rapl-msr-index.h > >> >> > >> >
On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > On Wed, 16 Oct 2024 14:56:39 +0200 > "Anthony Harivel" <aharivel@redhat.com> wrote: > > > Hi Igor, > > > > Igor Mammedov, Oct 16, 2024 at 13:52: > > > On Wed, 22 May 2024 17:34:49 +0200 > > > Anthony Harivel <aharivel@redhat.com> wrote: > > > > > >> Dear maintainers, > > >> > > >> First of all, thank you very much for your review of my patch > > >> [1]. > > > > > > I've tried to play with this feature and have a few questions about it > > > > > > > Thanks for testing this new feature. > > > > > 1. trying to start with non accessible or not existent socket > > > -accel kvm,rapl=on,rapl-helper-socket=/tmp/socket > > > I get: > > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: vmsr socket opening failed > > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/socks: kvm : error RAPL feature requirement not met > > > * is it possible to report actual OS error that happened during open/connect, > > > instead of unhelpful 'socket opening failed'? > > > > > > What I see in vmsr_open_socket() error is ignored > > > and btw it's error leak as well > > > > > > > Shame you missed the 6 iterations of that patch that last for a year. > > I would have changed that directly ! > > Anyway I take note on that comment and will send a modification. > > > > > * 2nd line shouldn't be there if the 1st error already present. > > > > > > 2. getting periodic error on console where QEMU has been starter > > > # ./qemu-vmsr-helper -k /tmp/sock > > > ./qemu-system-x86_64 -snapshot -m 4G -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock rhel90.img -vnc :0 -cpu host > > > and let it run > > > > > > it appears rdmsr works (well, it returns some values at least) > > > however there are recurring errors in qemu's stderr(or out) > > > > > > qemu-system-x86_64: Error opening /proc/2496093/task/2496109/stat > > > qemu-system-x86_64: Error opening /proc/2496093/task/2496095/stat > > > > > > My guess it's some temporary threads, that come and go, but still > > > they shouldn't cause errors if it's normal operation. > > > > > > > There a patch in WIP that change this into a Tracepoint. Maybe you can > > SSH to the VM in meanwhile ? > > it's just idling VM that doesn't do anything, hence the question. > > > > > > Also on daemon side, I a few times got while guest was running: > > > qemu-vmsr-helper: Failed to open /proc at /proc/2496026/task/2496044 > > > qemu-vmsr-helper: Requested TID not in peer PID: 2496026 2496044 > > > though I can't reproduce it reliably > > > > This could happen only when a vCPU thread ID has changed between the > > call of a rdmsr throught the socket and the hepler that read the msr. > > No idea how a vCPU can change TID or shutdown that fast. > > I guess it needs to be figured out to decide if it's safe to ignore (and not print error) > or if it's a genuine error/bug somewhere > > > > 3. when starting daemon not as root, it starts 'fine' but later on complains > > > qemu-vmsr-helper: Failed to open MSR file at /dev/cpu/0/msr > > > perhaps it would be better to fail at start daemon if it doesn't have > > > access to necessary files. > > > > > > > Right taking a note on that as well. > > > > > > > 4. in case #3, guest also fails to start with errors: > > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: can't read any virtual msr > > > qemu-system-x86_64: -accel kvm,rapl=on,rapl-helper-socket=/tmp/sock: kvm : error RAPL feature requirement not met > > > again line #2 is not useful and probably not needed (maybe make it tracepoint) > > > and #1 is unhelpful - it would be better if it directed user to check qemu-vmsr-helper > > > > > > > I will try to see how to improve that part. > > Thanks for your valuable feedback. > > > > > 5. does AMD have similar MSRs that we could use to make this feature complete? > > > > > > > Yes but the address are completely different. However, this in my ToDo > > list. First I need way more feedback like yours to move on extending > > this feature. > > If adding AMD's MSRs is not difficult, then I'd make it priority. > This way users (and libvirt) won't have to deal with 2 different > feature-sets and decide when to allow this to be turned on depending on host. > > > > > > 6. What happens to power accounting if host constantly migrates > > > vcpus between sockets, are values we are getting still correct/meaningful? > > > Or do we need to pin vcpus to get 'accurate' values? > > > > > > > It's taken into account during the ratio calculation which socket the > > vCPU has just been scheduled. But yes the value are more 'accurate' when > > the vCPU is pinned. > > in worst case VCPUs might be moved between sockets many times during > sample window, can you explain how that is accounted for? > > Anyways, it would be better to have some numbers in doc that would > clarify what kind of accuracy we are talking about (and example > pinned vs unpinned), or whether unpinned case measures average > temperature of patients in hospital and we should recommend > to pin vcpus and everything else. > > Also actual usecase examples for the feature should be mentioned > in the doc. So users could figure out when they need to enable > this feature (with attached accuracy numbers). Aka how this > new feature is good for end users and what they can do with it. > > > > 7. do we have to have a dedicated thread for pooling data from daemon? > > > > > > Can we fetch data from vcpu thread that have accessed msr > > > (with some caching and rate limiting access to the daemon)? > > > > > > > This feature is revolving around a thread. Please look at the > > documentation is not already done: > > > > https://www.qemu.org/docs/master/specs/rapl-msr.html#high-level-implementation > > > > If we only fetch from vCPU thread, we won't have the consumption of the > > non-vcpu thread. They are taken into account in the total. > > one can collect the same data from vcpu thread as well, > the bonus part is that we don't have an extra thread > hanging around and doing work even if guest never asks > for those MSRs. > > This also leads to a question, if we should account for > not VCPU threads at all. Looking at real hardware, those > MSRs return power usage of CPUs only, and they do not > return consumption from auxiliary system components > (io/memory/...). One can consider non VCPU threads in QEMU > as auxiliary components, so we probably should not to > account for them at all when modeling the same hw feature. > (aka be consistent with what real hw does). I understand your POV, but I think that would be a mistake, and would undermine the usefulness of the feature. The deployment model has a cluster of hosts and guests, all belonging to the same user. The user goal is to measure host power consumption imposed by the guest, and dynamically adjust guest workloads in order to minimize power consumption of the host. The guest workloads can impose non-negligble power consumption loads on non-vCPU threads in QEMU. Without that accounted for, any adjustments will be working from (sometimes very) inaccurate data. IOW, I think it is right to include non-vCPU threads usage in the reported info, as it is still fundamentally part of the load that the guest imposes on host pCPUs it is permitted to run on. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, 18 Oct 2024 13:59:34 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote: > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > > On Wed, 16 Oct 2024 14:56:39 +0200 > > "Anthony Harivel" <aharivel@redhat.com> wrote: [...] > > > > This also leads to a question, if we should account for > > not VCPU threads at all. Looking at real hardware, those > > MSRs return power usage of CPUs only, and they do not > > return consumption from auxiliary system components > > (io/memory/...). One can consider non VCPU threads in QEMU > > as auxiliary components, so we probably should not to > > account for them at all when modeling the same hw feature. > > (aka be consistent with what real hw does). > > I understand your POV, but I think that would be a mistake, > and would undermine the usefulness of the feature. > > The deployment model has a cluster of hosts and guests, all > belonging to the same user. The user goal is to measure host > power consumption imposed by the guest, and dynamically adjust > guest workloads in order to minimize power consumption of the > host. For cloud use-case, host side is likely in a better position to accomplish the task of saving power by migrating VM to another socket/host to compact idle load. (I've found at least 1 kubernetis tool[1], which does energy monitoring). Perhaps there are schedulers out there that do that using its data. > The guest workloads can impose non-negligble power consumption > loads on non-vCPU threads in QEMU. Without that accounted for, > any adjustments will be working from (sometimes very) inaccurate > data. Perhaps adding one or several energy sensors (ex: some i2c ones), would let us provide auxiliary threads consumption to guest, and even make it more granular if necessary (incl. vhost user/out of process device models or pass-through devices if they have PMU). It would be better than further muddling vCPUs consumption estimates with something that doesn't belong there. > IOW, I think it is right to include non-vCPU threads usage in > the reported info, as it is still fundamentally part of the > load that the guest imposes on host pCPUs it is permitted to > run on. From what I've read, process energy usage done via RAPL is not exactly accurate. But there are monitoring tools out there that use RAPL and other sources to make energy consumption monitoring more reliable. Reinventing that wheel and pulling all of the nuances of process power monitoring inside of QEMU process, needlessly complicates it. Maybe we should reuse one of existing tools and channel its data through appropriate QEMU channels (RAPL/emulated PMU counters/...). Implementing RAPL in pure form though looks fine to me, so the same tools could use it the same way as on the host if needed without VM specific quirks. 1) https://github.com/sustainable-computing-io/kepler > With regards, > Daniel
On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > On Fri, 18 Oct 2024 13:59:34 +0100 > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > > > On Wed, 16 Oct 2024 14:56:39 +0200 > > > "Anthony Harivel" <aharivel@redhat.com> wrote: > [...] > > > > > > > This also leads to a question, if we should account for > > > not VCPU threads at all. Looking at real hardware, those > > > MSRs return power usage of CPUs only, and they do not > > > return consumption from auxiliary system components > > > (io/memory/...). One can consider non VCPU threads in QEMU > > > as auxiliary components, so we probably should not to > > > account for them at all when modeling the same hw feature. > > > (aka be consistent with what real hw does). > > > > I understand your POV, but I think that would be a mistake, > > and would undermine the usefulness of the feature. > > > > The deployment model has a cluster of hosts and guests, all > > belonging to the same user. The user goal is to measure host > > power consumption imposed by the guest, and dynamically adjust > > guest workloads in order to minimize power consumption of the > > host. > > For cloud use-case, host side is likely in a better position > to accomplish the task of saving power by migrating VM to > another socket/host to compact idle load. (I've found at least 1 > kubernetis tool[1], which does energy monitoring). Perhaps there > are schedulers out there that do that using its data. The host admin can merely shuffle workloads around, hoping that a different packing of workloads onto machines, will reduce power in some aount. You might win a few %, or low 10s of % with this if you're good at it. The guest admin can change the way their workload operates to reduce its inherant power consumption baseline. You could easily come across ways to win high 10s of % with this. That's why it is interesting to expose power consumption info to the guest admin. IOW, neither makes the other obsolete, both approaches are desirable. > > The guest workloads can impose non-negligble power consumption > > loads on non-vCPU threads in QEMU. Without that accounted for, > > any adjustments will be working from (sometimes very) inaccurate > > data. > > Perhaps adding one or several energy sensors (ex: some i2c ones), > would let us provide auxiliary threads consumption to guest, and > even make it more granular if necessary (incl. vhost user/out of > process device models or pass-through devices if they have PMU). > It would be better than further muddling vCPUs consumption > estimates with something that doesn't belong there. There's a tradeoff here in that info directly associated with backends threads, is effectively exposing private QEMU impl details as public ABI. IOW, we don't want too fine granularity here, we need it abstracted sufficiently, that different backend choices for a given don't change what sensors are exposed. I also wonder how existing power monitoring applications would consume such custom sensors - is there sufficient standardization in this are that we're not inventing something totally QEMU specific ? > > IOW, I think it is right to include non-vCPU threads usage in > > the reported info, as it is still fundamentally part of the > > load that the guest imposes on host pCPUs it is permitted to > > run on. > > > From what I've read, process energy usage done via RAPL is not > exactly accurate. But there are monitoring tools out there that > use RAPL and other sources to make energy consumption monitoring > more reliable. > > Reinventing that wheel and pulling all of the nuances of process > power monitoring inside of QEMU process, needlessly complicates it. > Maybe we should reuse one of existing tools and channel its data > through appropriate QEMU channels (RAPL/emulated PMU counters/...). Note, this feature is already released in QEMU 9.1.0. > Implementing RAPL in pure form though looks fine to me, > so the same tools could use it the same way as on the host > if needed without VM specific quirks. IMHO the so called "pure" form is misleading to applications, unless we first provided some other pratical way to expose the data that we would be throwing away from RAPL. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, 22 Oct 2024 14:15:44 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote: > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > > On Fri, 18 Oct 2024 13:59:34 +0100 > > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > > > > On Wed, 16 Oct 2024 14:56:39 +0200 > > > > "Anthony Harivel" <aharivel@redhat.com> wrote: > > [...] > > > > > > > > > > This also leads to a question, if we should account for > > > > not VCPU threads at all. Looking at real hardware, those > > > > MSRs return power usage of CPUs only, and they do not > > > > return consumption from auxiliary system components > > > > (io/memory/...). One can consider non VCPU threads in QEMU > > > > as auxiliary components, so we probably should not to > > > > account for them at all when modeling the same hw feature. > > > > (aka be consistent with what real hw does). > > > > > > I understand your POV, but I think that would be a mistake, > > > and would undermine the usefulness of the feature. > > > > > > The deployment model has a cluster of hosts and guests, all > > > belonging to the same user. The user goal is to measure host > > > power consumption imposed by the guest, and dynamically adjust > > > guest workloads in order to minimize power consumption of the > > > host. > > > > For cloud use-case, host side is likely in a better position > > to accomplish the task of saving power by migrating VM to > > another socket/host to compact idle load. (I've found at least 1 > > kubernetis tool[1], which does energy monitoring). Perhaps there > > are schedulers out there that do that using its data. > > The host admin can merely shuffle workloads around, hoping that > a different packing of workloads onto machines, will reduce power > in some aount. You might win a few %, or low 10s of % with this > if you're good at it. package level savings probably won't make a much of dent (older hw, less impact), but if one would think about vacating/powering down host it's a bit different story (it was in my home lab case - trying to minimize idle consumption of 24/7 systems). But even with that when switching to newer hardware it might come to the point of diminishing returns eventually. > The guest admin can change the way their workload operates to > reduce its inherant power consumption baseline. You could easily > come across ways to win high 10s of % with this. That's why it > is interesting to expose power consumption info to the guest > admin. Looking at discussions around Intel's hybrid CPUs, I got an impression that not userspace nor kernel have enough energy consumption info to make decent scheduling decision and no _one really wishes do scheduling manually_ to begin with. That's where Intel's CPUs with IDT come into the picture to help kernel somehow bin tasks based on efficiency figures (since CPU knows exactly how much resources it is using). But that's relatively new and whether such cpus will stick or not is still an open question (it makes sense for mobile market, but for other applications I'd guess time will show). > IOW, neither makes the other obsolete, both approaches are > desirable. no argument here. > > > The guest workloads can impose non-negligble power consumption > > > loads on non-vCPU threads in QEMU. Without that accounted for, > > > any adjustments will be working from (sometimes very) inaccurate > > > data. > > > > Perhaps adding one or several energy sensors (ex: some i2c ones), > > would let us provide auxiliary threads consumption to guest, and > > even make it more granular if necessary (incl. vhost user/out of > > process device models or pass-through devices if they have PMU). > > It would be better than further muddling vCPUs consumption > > estimates with something that doesn't belong there. > > There's a tradeoff here in that info directly associated with > backends threads, is effectively exposing private QEMU impl > details as public ABI. IOW, we don't want too fine granularity > here, we need it abstracted sufficiently, that different > backend choices for a given don't change what sensors are > exposed. > > I also wonder how existing power monitoring applications > would consume such custom sensors - is there sufficient > standardization in this are that we're not inventing > something totally QEMU specific ? we can expose them as ACPI power meter devices, to make it abstract for guest OS (i.e. guest would need only a standard driver for it) or alternatively model some of real i2c sensors. But yes, it something that should be explored so it would work/supported by common tools or the tool of the choice. > > > > IOW, I think it is right to include non-vCPU threads usage in > > > the reported info, as it is still fundamentally part of the > > > load that the guest imposes on host pCPUs it is permitted to > > > run on. > > > > > > From what I've read, process energy usage done via RAPL is not > > exactly accurate. But there are monitoring tools out there that > > use RAPL and other sources to make energy consumption monitoring > > more reliable. > > > > Reinventing that wheel and pulling all of the nuances of process > > power monitoring inside of QEMU process, needlessly complicates it. > > Maybe we should reuse one of existing tools and channel its data > > through appropriate QEMU channels (RAPL/emulated PMU counters/...). > > Note, this feature is already released in QEMU 9.1.0. that doesn't preclude us from improving impl. details /i.e. what tasks qemu does and what is upto backend (external daemon)/ though. Incl. changing backend if it that would do a better job for in the end (with a benefit that it's mostly maintained by another project). > > Implementing RAPL in pure form though looks fine to me, > > so the same tools could use it the same way as on the host > > if needed without VM specific quirks. > > IMHO the so called "pure" form is misleading to applications, unless > we first provided some other pratical way to expose the data that > we would be throwing away from RAPL. I don't argue that data should be thrown away. But just that we should provide them some other way instead of vCPU RAPL interface. And not confuse host's pCPU with vCPUs. PS: Taking example above that aux threads are inherent pCPU load and stretch it in to host side. Then one can say pCPU inherently incurs power draw on other system components with some workloads, so RAPL MSRs should include that load as well. But yep, at this point turns into a pointless bike-shedding. PS2: in nutshell, my questions are: * should we expose aux threads as other power meter device * would it be better to reuse/integrate with existing (hopefully mature) projects for monitoring on host side instead of duplicating a subset of capabilities in QEMU specific helper and then maintain it. > With regards, > Daniel
Daniel P. Berrangé, Oct 22, 2024 at 15:15: > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: >> On Fri, 18 Oct 2024 13:59:34 +0100 >> Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: >> > > On Wed, 16 Oct 2024 14:56:39 +0200 >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: >> [...] >> >> > > >> > > This also leads to a question, if we should account for >> > > not VCPU threads at all. Looking at real hardware, those >> > > MSRs return power usage of CPUs only, and they do not >> > > return consumption from auxiliary system components >> > > (io/memory/...). One can consider non VCPU threads in QEMU >> > > as auxiliary components, so we probably should not to >> > > account for them at all when modeling the same hw feature. >> > > (aka be consistent with what real hw does). >> > >> > I understand your POV, but I think that would be a mistake, >> > and would undermine the usefulness of the feature. >> > >> > The deployment model has a cluster of hosts and guests, all >> > belonging to the same user. The user goal is to measure host >> > power consumption imposed by the guest, and dynamically adjust >> > guest workloads in order to minimize power consumption of the >> > host. >> >> For cloud use-case, host side is likely in a better position >> to accomplish the task of saving power by migrating VM to >> another socket/host to compact idle load. (I've found at least 1 >> kubernetis tool[1], which does energy monitoring). Perhaps there >> are schedulers out there that do that using its data. I also work for Kepler project. I use it to monitor my VM has a black box and I used it inside my VM with this feature enable. Thanks to that I can optimize the workloads (dpdk application,database,..) inside my VM. This is the use-case in NFV deployment and I'm pretty sure this could be the use-case of many others. > > The host admin can merely shuffle workloads around, hoping that > a different packing of workloads onto machines, will reduce power > in some aount. You might win a few %, or low 10s of % with this > if you're good at it. > > The guest admin can change the way their workload operates to > reduce its inherant power consumption baseline. You could easily > come across ways to win high 10s of % with this. That's why it > is interesting to expose power consumption info to the guest > admin. > > IOW, neither makes the other obsolete, both approaches are > desirable. > >> > The guest workloads can impose non-negligble power consumption >> > loads on non-vCPU threads in QEMU. Without that accounted for, >> > any adjustments will be working from (sometimes very) inaccurate >> > data. >> >> Perhaps adding one or several energy sensors (ex: some i2c ones), >> would let us provide auxiliary threads consumption to guest, and >> even make it more granular if necessary (incl. vhost user/out of >> process device models or pass-through devices if they have PMU). >> It would be better than further muddling vCPUs consumption >> estimates with something that doesn't belong there. I'm confused about your statement. Like every software power metering tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom sensors would be better than a what everyone is using ? The goal is not to be accurate. The goal is to be able to compare A against B in the same environment and RAPL is given reproducible values to do so. Adding RAPL inside VM makes total sens because you can use tools that are already out in the market. > > There's a tradeoff here in that info directly associated with > backends threads, is effectively exposing private QEMU impl > details as public ABI. IOW, we don't want too fine granularity > here, we need it abstracted sufficiently, that different > backend choices for a given don't change what sensors are > exposed. > > I also wonder how existing power monitoring applications > would consume such custom sensors - is there sufficient > standardization in this are that we're not inventing > something totally QEMU specific ? > >> > IOW, I think it is right to include non-vCPU threads usage in >> > the reported info, as it is still fundamentally part of the >> > load that the guest imposes on host pCPUs it is permitted to >> > run on. >> >> >> From what I've read, process energy usage done via RAPL is not >> exactly accurate. But there are monitoring tools out there that >> use RAPL and other sources to make energy consumption monitoring >> more reliable. >> >> Reinventing that wheel and pulling all of the nuances of process >> power monitoring inside of QEMU process, needlessly complicates it. >> Maybe we should reuse one of existing tools and channel its data >> through appropriate QEMU channels (RAPL/emulated PMU counters/...). > > Note, this feature is already released in QEMU 9.1.0. > >> Implementing RAPL in pure form though looks fine to me, >> so the same tools could use it the same way as on the host >> if needed without VM specific quirks. > > IMHO the so called "pure" form is misleading to applications, unless > we first provided some other pratical way to expose the data that > we would be throwing away from RAPL. > The other possibility that I've think of is using a 3rd party tool to give maybe more "accurate value" to QEMU. For example, Kepler could be used to give value for each thread of QEMU and so instead of calculating and using the qemu-vmsr-helper, each values is transfered on request by QEMU via the UNIX thread that is used today between the daemon and QEMU. It's just an idea that I have and I don't know if that is acceptable for each project (QEMU and Kepler) that would really solve few issues. > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Tue, 22 Oct 2024 16:16:36 +0200 "Anthony Harivel" <aharivel@redhat.com> wrote: > Daniel P. Berrangé, Oct 22, 2024 at 15:15: > > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > >> On Fri, 18 Oct 2024 13:59:34 +0100 > >> Daniel P. Berrangé <berrange@redhat.com> wrote: > >> > >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > >> > > On Wed, 16 Oct 2024 14:56:39 +0200 > >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: > >> [...] > >> > >> > > > >> > > This also leads to a question, if we should account for > >> > > not VCPU threads at all. Looking at real hardware, those > >> > > MSRs return power usage of CPUs only, and they do not > >> > > return consumption from auxiliary system components > >> > > (io/memory/...). One can consider non VCPU threads in QEMU > >> > > as auxiliary components, so we probably should not to > >> > > account for them at all when modeling the same hw feature. > >> > > (aka be consistent with what real hw does). > >> > > >> > I understand your POV, but I think that would be a mistake, > >> > and would undermine the usefulness of the feature. > >> > > >> > The deployment model has a cluster of hosts and guests, all > >> > belonging to the same user. The user goal is to measure host > >> > power consumption imposed by the guest, and dynamically adjust > >> > guest workloads in order to minimize power consumption of the > >> > host. > >> > >> For cloud use-case, host side is likely in a better position > >> to accomplish the task of saving power by migrating VM to > >> another socket/host to compact idle load. (I've found at least 1 > >> kubernetis tool[1], which does energy monitoring). Perhaps there > >> are schedulers out there that do that using its data. > > I also work for Kepler project. I use it to monitor my VM has a black > box and I used it inside my VM with this feature enable. Thanks to that > I can optimize the workloads (dpdk application,database,..) inside my VM. > > This is the use-case in NFV deployment and I'm pretty sure this could be > the use-case of many others. > > > > > The host admin can merely shuffle workloads around, hoping that > > a different packing of workloads onto machines, will reduce power > > in some aount. You might win a few %, or low 10s of % with this > > if you're good at it. > > > > The guest admin can change the way their workload operates to > > reduce its inherant power consumption baseline. You could easily > > come across ways to win high 10s of % with this. That's why it > > is interesting to expose power consumption info to the guest > > admin. > > > > IOW, neither makes the other obsolete, both approaches are > > desirable. > > > >> > The guest workloads can impose non-negligble power consumption > >> > loads on non-vCPU threads in QEMU. Without that accounted for, > >> > any adjustments will be working from (sometimes very) inaccurate > >> > data. > >> > >> Perhaps adding one or several energy sensors (ex: some i2c ones), > >> would let us provide auxiliary threads consumption to guest, and > >> even make it more granular if necessary (incl. vhost user/out of > >> process device models or pass-through devices if they have PMU). > >> It would be better than further muddling vCPUs consumption > >> estimates with something that doesn't belong there. > > I'm confused about your statement. Like every software power metering > tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom > sensors would be better than a what everyone is using ? RAPL is used to measure CPU/DRAM/maybe GPU domains. see my other reply to Daniel RAPL + aux (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html) My point wrt RAPL is: CPU domain on host and inside guest should be doing the same thing, i.e. report only package/core consumption of virtual CPU and nothing else (non vCPU induced load should not be included in CPU domain). For non vCPU consumption, we should do the same as bare-metal, i.e. add power sensors where necessary. As minimum we can add a system power meter sensor, which could account for total energy draw (and that can include not only QEMU aux threads, but also for other related processes (aka process handling dpdk NIC, or other vhost user backend)). Individual per device sensors also a possibility in the future (i.e per NIC) is we can find a suitable sensor on host to derive guest value. [...] > Adding RAPL inside VM makes total sens because you can use tools that > are already out in the market. no disagreement here. Given the topic is relatively new, the tooling mostly concentrates on RAPL as most available sensor. But some tools can pull energy values from other sources, we surely can teach them to pull values from a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow sensor handling from lm_sensors). I'd pick acpi power meter as a possible candidate for it is being guest OS agnostic and we can attach it to anything in machine tree. > > There's a tradeoff here in that info directly associated with > > backends threads, is effectively exposing private QEMU impl > > details as public ABI. IOW, we don't want too fine granularity > > here, we need it abstracted sufficiently, that different > > backend choices for a given don't change what sensors are > > exposed. > > > > I also wonder how existing power monitoring applications > > would consume such custom sensors - is there sufficient > > standardization in this are that we're not inventing > > something totally QEMU specific ? > > > >> > IOW, I think it is right to include non-vCPU threads usage in > >> > the reported info, as it is still fundamentally part of the > >> > load that the guest imposes on host pCPUs it is permitted to > >> > run on. > >> > >> > >> From what I've read, process energy usage done via RAPL is not > >> exactly accurate. But there are monitoring tools out there that > >> use RAPL and other sources to make energy consumption monitoring > >> more reliable. > >> > >> Reinventing that wheel and pulling all of the nuances of process > >> power monitoring inside of QEMU process, needlessly complicates it. > >> Maybe we should reuse one of existing tools and channel its data > >> through appropriate QEMU channels (RAPL/emulated PMU counters/...). > > > > Note, this feature is already released in QEMU 9.1.0. > > > >> Implementing RAPL in pure form though looks fine to me, > >> so the same tools could use it the same way as on the host > >> if needed without VM specific quirks. > > > > IMHO the so called "pure" form is misleading to applications, unless > > we first provided some other pratical way to expose the data that > > we would be throwing away from RAPL. > > > > The other possibility that I've think of is using a 3rd party tool to > give maybe more "accurate value" to QEMU. > For example, Kepler could be used to give value for each thread > of QEMU and so instead of calculating and using the qemu-vmsr-helper, > each values is transfered on request by QEMU via the UNIX thread that is > used today between the daemon and QEMU. It's just an idea that I have > and I don't know if that is acceptable for each project (QEMU and > Kepler) that would really solve few issues. From QEMU point of view, it would be fine to get values from external process and just proxy them to guest (preferably without any massaging). Also on QEMU side, I'd suggest to split current monolith functionality in 2 parts: frontend (KVM MSR interface for starters) and backend object (created with -object CLI option) that will handle communication with an external daemon. That way QEMU would be able easily change/add different frontend and backend options (ex: add frontend for RAPL with TCG accel, add backend for Kelper or other project(s) down the road). (it would be good to make this split even for qemu-vmsr-helper). (if you are interested, I can guide you wrt QEMU side of the question). PS: As for other projects we probably should ask if they are open to an idea. They definitely would need some patches for per thread accounting, and maybe for some API to talk with external users (but the later might exist and it might be better for QEMU to adopt it (here QEMU backend object might help as translator of existing protocol to QEMU specific internals). The point is QEMU won't have to reinvent wheel, and other projects will get more exposure/user-base. On top of the projects, you've already pointed out for possible integration with. I could add pmdadenki (CCed few authors) which some distros are shipping/using. > > With regards, > > Daniel > > -- > > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > > |: https://libvirt.org -o- https://fstop138.berrange.com :| > > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
Hi Igor, Igor Mammedov, Nov 01, 2024 at 16:09: > On Tue, 22 Oct 2024 16:16:36 +0200 > "Anthony Harivel" <aharivel@redhat.com> wrote: > >> Daniel P. Berrangé, Oct 22, 2024 at 15:15: >> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: >> >> On Fri, 18 Oct 2024 13:59:34 +0100 >> >> Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: >> >> > > On Wed, 16 Oct 2024 14:56:39 +0200 >> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: >> >> [...] >> >> >> >> > > >> >> > > This also leads to a question, if we should account for >> >> > > not VCPU threads at all. Looking at real hardware, those >> >> > > MSRs return power usage of CPUs only, and they do not >> >> > > return consumption from auxiliary system components >> >> > > (io/memory/...). One can consider non VCPU threads in QEMU >> >> > > as auxiliary components, so we probably should not to >> >> > > account for them at all when modeling the same hw feature. >> >> > > (aka be consistent with what real hw does). >> >> > >> >> > I understand your POV, but I think that would be a mistake, >> >> > and would undermine the usefulness of the feature. >> >> > >> >> > The deployment model has a cluster of hosts and guests, all >> >> > belonging to the same user. The user goal is to measure host >> >> > power consumption imposed by the guest, and dynamically adjust >> >> > guest workloads in order to minimize power consumption of the >> >> > host. >> >> >> >> For cloud use-case, host side is likely in a better position >> >> to accomplish the task of saving power by migrating VM to >> >> another socket/host to compact idle load. (I've found at least 1 >> >> kubernetis tool[1], which does energy monitoring). Perhaps there >> >> are schedulers out there that do that using its data. >> >> I also work for Kepler project. I use it to monitor my VM has a black >> box and I used it inside my VM with this feature enable. Thanks to that >> I can optimize the workloads (dpdk application,database,..) inside my VM. >> >> This is the use-case in NFV deployment and I'm pretty sure this could be >> the use-case of many others. >> >> > >> > The host admin can merely shuffle workloads around, hoping that >> > a different packing of workloads onto machines, will reduce power >> > in some aount. You might win a few %, or low 10s of % with this >> > if you're good at it. >> > >> > The guest admin can change the way their workload operates to >> > reduce its inherant power consumption baseline. You could easily >> > come across ways to win high 10s of % with this. That's why it >> > is interesting to expose power consumption info to the guest >> > admin. >> > >> > IOW, neither makes the other obsolete, both approaches are >> > desirable. >> > >> >> > The guest workloads can impose non-negligble power consumption >> >> > loads on non-vCPU threads in QEMU. Without that accounted for, >> >> > any adjustments will be working from (sometimes very) inaccurate >> >> > data. >> >> >> >> Perhaps adding one or several energy sensors (ex: some i2c ones), >> >> would let us provide auxiliary threads consumption to guest, and >> >> even make it more granular if necessary (incl. vhost user/out of >> >> process device models or pass-through devices if they have PMU). >> >> It would be better than further muddling vCPUs consumption >> >> estimates with something that doesn't belong there. >> >> I'm confused about your statement. Like every software power metering >> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom >> sensors would be better than a what everyone is using ? > > RAPL is used to measure CPU/DRAM/maybe GPU domains. > see my other reply to Daniel RAPL + aux > (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html) > My point wrt RAPL is: CPU domain on host and inside guest > should be doing the same thing, i.e. report only package/core > consumption of virtual CPU and nothing else (non vCPU induced load > should not be included in CPU domain). > > For non vCPU consumption, we should do the same as bare-metal, > i.e. add power sensors where necessary. As minimum we can add > a system power meter sensor, which could account for total > energy draw (and that can include not only QEMU aux threads, > but also for other related processes (aka process handling dpdk NIC, > or other vhost user backend)). > Individual per device sensors also a possibility in the future > (i.e per NIC) is we can find a suitable sensor on host to derive > guest value. > > [...] > >> Adding RAPL inside VM makes total sens because you can use tools that >> are already out in the market. > no disagreement here. > > Given the topic is relatively new, the tooling mostly concentrates on > RAPL as most available sensor. But some tools can pull energy values > from other sources, we surely can teach them to pull values from > a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow > sensor handling from lm_sensors). I'd pick acpi power meter as > a possible candidate for it is being guest OS agnostic and > we can attach it to anything in machine tree. > >> > There's a tradeoff here in that info directly associated with >> > backends threads, is effectively exposing private QEMU impl >> > details as public ABI. IOW, we don't want too fine granularity >> > here, we need it abstracted sufficiently, that different >> > backend choices for a given don't change what sensors are >> > exposed. >> > >> > I also wonder how existing power monitoring applications >> > would consume such custom sensors - is there sufficient >> > standardization in this are that we're not inventing >> > something totally QEMU specific ? >> > >> >> > IOW, I think it is right to include non-vCPU threads usage in >> >> > the reported info, as it is still fundamentally part of the >> >> > load that the guest imposes on host pCPUs it is permitted to >> >> > run on. >> >> >> >> >> >> From what I've read, process energy usage done via RAPL is not >> >> exactly accurate. But there are monitoring tools out there that >> >> use RAPL and other sources to make energy consumption monitoring >> >> more reliable. >> >> >> >> Reinventing that wheel and pulling all of the nuances of process >> >> power monitoring inside of QEMU process, needlessly complicates it. >> >> Maybe we should reuse one of existing tools and channel its data >> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...). >> > >> > Note, this feature is already released in QEMU 9.1.0. >> > >> >> Implementing RAPL in pure form though looks fine to me, >> >> so the same tools could use it the same way as on the host >> >> if needed without VM specific quirks. >> > >> > IMHO the so called "pure" form is misleading to applications, unless >> > we first provided some other pratical way to expose the data that >> > we would be throwing away from RAPL. >> > >> >> The other possibility that I've think of is using a 3rd party tool to >> give maybe more "accurate value" to QEMU. >> For example, Kepler could be used to give value for each thread >> of QEMU and so instead of calculating and using the qemu-vmsr-helper, >> each values is transfered on request by QEMU via the UNIX thread that is >> used today between the daemon and QEMU. It's just an idea that I have >> and I don't know if that is acceptable for each project (QEMU and >> Kepler) that would really solve few issues. > > From QEMU point of view, it would be fine to get values from external > process and just proxy them to guest (preferably without any massaging). > > Also on QEMU side, I'd suggest to split current monolith functionality > in 2 parts: frontend (KVM MSR interface for starters) and backend object > (created with -object CLI option) that will handle communication > with an external daemon. That way QEMU would be able easily change/add > different frontend and backend options (ex: add frontend for RAPL > with TCG accel, add backend for Kelper or other project(s) > down the road). (it would be good to make this split even for > qemu-vmsr-helper). (if you are interested, I can guide you wrt > QEMU side of the question). > > PS: > As for other projects we probably should ask if they are open to an idea. > They definitely would need some patches for per thread accounting, > and maybe for some API to talk with external users (but the later > might exist and it might be better for QEMU to adopt it (here QEMU > backend object might help as translator of existing protocol to > QEMU specific internals). > The point is QEMU won't have to reinvent wheel, and other projects > will get more exposure/user-base. > > On top of the projects, you've already pointed out for possible > integration with. I could add pmdadenki (CCed few authors) which > some distros are shipping/using. > I think you have a fair amount of ideas and opinions on how to handle the RAPL in QEMU and that's really good for improving the features. What I would really like is to have Paolo's opinions on all of that. When I started working on the subject I talked to him several time and we agreed on the current implementation. Not that I disagree with all you said, to the contrary, but the amount of change is quite significant and it would be very annoying if results of this work doesn't make upstream because of Y & X. Let's see if we have more opinions from the people in the loop as well. Thanks for feedback. Anthony
On Sat, 02 Nov 2024 10:32:17 +0100 "Anthony Harivel" <aharivel@redhat.com> wrote: > Hi Igor, > > Igor Mammedov, Nov 01, 2024 at 16:09: > > On Tue, 22 Oct 2024 16:16:36 +0200 > > "Anthony Harivel" <aharivel@redhat.com> wrote: > > > >> Daniel P. Berrangé, Oct 22, 2024 at 15:15: > >> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > >> >> On Fri, 18 Oct 2024 13:59:34 +0100 > >> >> Daniel P. Berrangé <berrange@redhat.com> wrote: > >> >> > >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > >> >> > > On Wed, 16 Oct 2024 14:56:39 +0200 > >> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: > >> >> [...] > >> >> > >> >> > > > >> >> > > This also leads to a question, if we should account for > >> >> > > not VCPU threads at all. Looking at real hardware, those > >> >> > > MSRs return power usage of CPUs only, and they do not > >> >> > > return consumption from auxiliary system components > >> >> > > (io/memory/...). One can consider non VCPU threads in QEMU > >> >> > > as auxiliary components, so we probably should not to > >> >> > > account for them at all when modeling the same hw feature. > >> >> > > (aka be consistent with what real hw does). > >> >> > > >> >> > I understand your POV, but I think that would be a mistake, > >> >> > and would undermine the usefulness of the feature. > >> >> > > >> >> > The deployment model has a cluster of hosts and guests, all > >> >> > belonging to the same user. The user goal is to measure host > >> >> > power consumption imposed by the guest, and dynamically adjust > >> >> > guest workloads in order to minimize power consumption of the > >> >> > host. > >> >> > >> >> For cloud use-case, host side is likely in a better position > >> >> to accomplish the task of saving power by migrating VM to > >> >> another socket/host to compact idle load. (I've found at least 1 > >> >> kubernetis tool[1], which does energy monitoring). Perhaps there > >> >> are schedulers out there that do that using its data. > >> > >> I also work for Kepler project. I use it to monitor my VM has a black > >> box and I used it inside my VM with this feature enable. Thanks to that > >> I can optimize the workloads (dpdk application,database,..) inside my VM. > >> > >> This is the use-case in NFV deployment and I'm pretty sure this could be > >> the use-case of many others. > >> > >> > > >> > The host admin can merely shuffle workloads around, hoping that > >> > a different packing of workloads onto machines, will reduce power > >> > in some aount. You might win a few %, or low 10s of % with this > >> > if you're good at it. > >> > > >> > The guest admin can change the way their workload operates to > >> > reduce its inherant power consumption baseline. You could easily > >> > come across ways to win high 10s of % with this. That's why it > >> > is interesting to expose power consumption info to the guest > >> > admin. > >> > > >> > IOW, neither makes the other obsolete, both approaches are > >> > desirable. > >> > > >> >> > The guest workloads can impose non-negligble power consumption > >> >> > loads on non-vCPU threads in QEMU. Without that accounted for, > >> >> > any adjustments will be working from (sometimes very) inaccurate > >> >> > data. > >> >> > >> >> Perhaps adding one or several energy sensors (ex: some i2c ones), > >> >> would let us provide auxiliary threads consumption to guest, and > >> >> even make it more granular if necessary (incl. vhost user/out of > >> >> process device models or pass-through devices if they have PMU). > >> >> It would be better than further muddling vCPUs consumption > >> >> estimates with something that doesn't belong there. > >> > >> I'm confused about your statement. Like every software power metering > >> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom > >> sensors would be better than a what everyone is using ? > > > > RAPL is used to measure CPU/DRAM/maybe GPU domains. > > see my other reply to Daniel RAPL + aux > > (https://www.mail-archive.com/qemu-devel@nongnu.org/msg1072593.html) > > My point wrt RAPL is: CPU domain on host and inside guest > > should be doing the same thing, i.e. report only package/core > > consumption of virtual CPU and nothing else (non vCPU induced load > > should not be included in CPU domain). > > > > For non vCPU consumption, we should do the same as bare-metal, > > i.e. add power sensors where necessary. As minimum we can add > > a system power meter sensor, which could account for total > > energy draw (and that can include not only QEMU aux threads, > > but also for other related processes (aka process handling dpdk NIC, > > or other vhost user backend)). > > Individual per device sensors also a possibility in the future > > (i.e per NIC) is we can find a suitable sensor on host to derive > > guest value. > > > > [...] > > > >> Adding RAPL inside VM makes total sens because you can use tools that > >> are already out in the market. > > no disagreement here. > > > > Given the topic is relatively new, the tooling mostly concentrates on > > RAPL as most available sensor. But some tools can pull energy values > > from other sources, we surely can teach them to pull values from > > a sensor(s) we'd want to add to QEMU (i.e. for an easy start borrow > > sensor handling from lm_sensors). I'd pick acpi power meter as > > a possible candidate for it is being guest OS agnostic and > > we can attach it to anything in machine tree. > > > >> > There's a tradeoff here in that info directly associated with > >> > backends threads, is effectively exposing private QEMU impl > >> > details as public ABI. IOW, we don't want too fine granularity > >> > here, we need it abstracted sufficiently, that different > >> > backend choices for a given don't change what sensors are > >> > exposed. > >> > > >> > I also wonder how existing power monitoring applications > >> > would consume such custom sensors - is there sufficient > >> > standardization in this are that we're not inventing > >> > something totally QEMU specific ? > >> > > >> >> > IOW, I think it is right to include non-vCPU threads usage in > >> >> > the reported info, as it is still fundamentally part of the > >> >> > load that the guest imposes on host pCPUs it is permitted to > >> >> > run on. > >> >> > >> >> > >> >> From what I've read, process energy usage done via RAPL is not > >> >> exactly accurate. But there are monitoring tools out there that > >> >> use RAPL and other sources to make energy consumption monitoring > >> >> more reliable. > >> >> > >> >> Reinventing that wheel and pulling all of the nuances of process > >> >> power monitoring inside of QEMU process, needlessly complicates it. > >> >> Maybe we should reuse one of existing tools and channel its data > >> >> through appropriate QEMU channels (RAPL/emulated PMU counters/...). > >> > > >> > Note, this feature is already released in QEMU 9.1.0. > >> > > >> >> Implementing RAPL in pure form though looks fine to me, > >> >> so the same tools could use it the same way as on the host > >> >> if needed without VM specific quirks. > >> > > >> > IMHO the so called "pure" form is misleading to applications, unless > >> > we first provided some other pratical way to expose the data that > >> > we would be throwing away from RAPL. > >> > > >> > >> The other possibility that I've think of is using a 3rd party tool to > >> give maybe more "accurate value" to QEMU. > >> For example, Kepler could be used to give value for each thread > >> of QEMU and so instead of calculating and using the qemu-vmsr-helper, > >> each values is transfered on request by QEMU via the UNIX thread that is > >> used today between the daemon and QEMU. It's just an idea that I have > >> and I don't know if that is acceptable for each project (QEMU and > >> Kepler) that would really solve few issues. > > > > From QEMU point of view, it would be fine to get values from external > > process and just proxy them to guest (preferably without any massaging). > > > > Also on QEMU side, I'd suggest to split current monolith functionality > > in 2 parts: frontend (KVM MSR interface for starters) and backend object > > (created with -object CLI option) that will handle communication > > with an external daemon. That way QEMU would be able easily change/add > > different frontend and backend options (ex: add frontend for RAPL > > with TCG accel, add backend for Kelper or other project(s) > > down the road). (it would be good to make this split even for > > qemu-vmsr-helper). (if you are interested, I can guide you wrt > > QEMU side of the question). > > > > PS: > > As for other projects we probably should ask if they are open to an idea. > > They definitely would need some patches for per thread accounting, > > and maybe for some API to talk with external users (but the later > > might exist and it might be better for QEMU to adopt it (here QEMU > > backend object might help as translator of existing protocol to > > QEMU specific internals). > > The point is QEMU won't have to reinvent wheel, and other projects > > will get more exposure/user-base. > > > > On top of the projects, you've already pointed out for possible > > integration with. I could add pmdadenki (CCed few authors) which > > some distros are shipping/using. > > > > I think you have a fair amount of ideas and opinions on how to handle the > RAPL in QEMU and that's really good for improving the features. > > What I would really like is to have Paolo's opinions on all of that. When > I started working on the subject I talked to him several time and we > agreed on the current implementation. > > Not that I disagree with all you said, to the contrary, but the amount > of change is quite significant and it would be very annoying if results > of this work doesn't make upstream because of Y & X. split frontend/backend design is established pattern in QEMU, so I'm not suggesting anything revolutionary (probability that anyone would object to it is very low). sending an RFC can serve as a starting point for discussion. > > Let's see if we have more opinions from the people in the loop as well. yep, given that it would be better to reuse existing power monitoring projects, it would be nice to hear some feedback from them. > > Thanks for feedback. > > Anthony >
Hi all, some thoughts: - I vote for making the metrics as much as possible in the guest available as on the host. Allows cascading, and having in-guest-monitoring working like on bare metal. - As result, really just plain vCPU consumption would be made available in the guest as rapl-core. If the host can at some point understand guests GPU, or I/O consumption, better hand that in separately. - Having in mind that we will also need this for other architectures, at least aarch64. RAPL comes from x86, rather than extending that to also do I/O or such, we might aim at an interface which will also work for aarch64. - Bigger scope will be to look at the consumption of multiple systems, for that we will need to move the metrics to network eventually, changing from MSR or such mechanisms. - For reading the metrics in the guest, I was tempted to suggest PCP with pmda-denki to cover RAPL, but it's right now just reading /sysfs, not MSR's. pmda-lmsensors for further sensors offered on various systems, and pmda-openmetrics for covering anything appearing somewhere on /sysfs as a number. > > Not that I disagree with all you said, to the contrary, but the amount > > of change is quite significant and it would be very annoying if results > > of this work doesn't make upstream because of Y & X. > > split frontend/backend design is established pattern in QEMU, so I'm not > suggesting anything revolutionary (probability that anyone would object > to it is very low). > > sending an RFC can serve as a starting point for discussion. +1, Christian
On Tue, 5 Nov 2024 08:11:14 +0100 Christian Horn <chorn@fluxcoil.net> wrote: > Hi all, > > some thoughts: > > - I vote for making the metrics as much as possible in the guest available > as on the host. Allows cascading, and having in-guest-monitoring working > like on bare metal. > - As result, really just plain vCPU consumption would be made available > in the guest as rapl-core. If the host can at some point understand > guests GPU, or I/O consumption, better hand that in separately. > - Having in mind that we will also need this for other architectures, > at least aarch64. RAPL comes from x86, rather than extending that > to also do I/O or such, we might aim at an interface which will also > work for aarch64. +1 to both points > - Bigger scope will be to look at the consumption of multiple systems, for > that we will need to move the metrics to network eventually, changing > from MSR or such mechanisms. That's aren't VM scope though, which this topic is about. But yes, the same tools as on baremetal can collect data and send/aggregate them elsewhere. The main point from VM perspective is act just like baremetal systems so the same monitoring tools could be reused. > - For reading the metrics in the guest, I was tempted to suggest PCP with > pmda-denki to cover RAPL, but it's right now just reading /sysfs, not > MSR's. pmda-lmsensors for further sensors offered on various systems, For NVF usecase, I also was eyeing pmda-denki. How hard it would be to add MSR based sampling to denki? Can we borrow Anthony's MSR sampling from qemu-vmsr-helper, to reduce amount of work needed. Also, for guest per vCPU accounting, we would need per thread accounting (which I haven't noticed from a quick look at denki). So some effort would be needed to add it there. I didn't know about pmda-lmsensors, I guess we should be able to use it out of box with 'acpi power meter' sensor, if QEMU were to provide such. I've also seen denki supporting battery power sensor, we can abuse that and make QEMU provide that, but I'd rather add 'acpi power meter' sensor to denki (which to some degree intersects with battery power sensor functionality). PS: In this series Anthony uses custom protocol to get data from privileged MSR helper to QEMU. Would it be acceptable? Or is there a preferred way for PCP to do inter-process comms? > and pmda-openmetrics for covering anything appearing somewhere on > /sysfs as a number. > > > > > Not that I disagree with all you said, to the contrary, but the amount > > > of change is quite significant and it would be very annoying if results > > > of this work doesn't make upstream because of Y & X. > > > > split frontend/backend design is established pattern in QEMU, so I'm not > > suggesting anything revolutionary (probability that anyone would object > > to it is very low). > > > > sending an RFC can serve as a starting point for discussion. > > +1, > Christian >
* Igor Mammedov さんが書きました: > On Tue, 5 Nov 2024 08:11:14 +0100 > Christian Horn <chorn@fluxcoil.net> wrote: > > > - For reading the metrics in the guest, I was tempted to suggest PCP with > > pmda-denki to cover RAPL, but it's right now just reading /sysfs, not > > MSR's. pmda-lmsensors for further sensors offered on various systems, > For NVF usecase, I also was eyeing pmda-denki. > > How hard it would be to add MSR based sampling to denki? > Can we borrow Anthony's MSR sampling from > qemu-vmsr-helper, to reduce amount of work needed. Should be possible. Also for /sysfs we do a detection of domains, and based on that register metrics and instances with pmcd. For rapl-msr, that could be done in a similiar way, i.e. as denki.rapl-msr, or separating into denki.rapl.sysfs and denki.rapl.msr . As for the actual doing, I'm not part of the engineering org but support, so it's a spare time activity, when I get to it. PCP engineering has people on the project, a Jira would be a first step. Direct pull requests to upstream are also a good start of course. When developing that, one would in cycles modify src/pmdas/denki/denki.c, compile it, get pmcd to use the modified pmda-denki, look at debug output and metrics. > Also, for guest per vCPU accounting, we would need per thread > accounting (which I haven't noticed from a quick look at denki). > So some effort would be needed to add it there. I think we have these metrics in pmcd already from pmda-linux, i.e. we can see them with this: # pmrep -1gU -t 5 -J 3 proc.hog.cpu [..] [ 1] - proc.hog.cpu["083377 /usr/lib64/firefox/firefox"] [ 2] - proc.hog.cpu["084634 /usr/lib64/firefox/firefox"] [ 3] - proc.hog.cpu["085225 md5sum"] 1 2 3 0.001 0.003 16.304 => Top 3 consumers, process 3 is heaviest. This uses derived metrics, computes from others, defined here: $ cat /etc/pcp/derived/proc.conf [..] proc.hog.cpu = 100 * (rate(proc.psinfo.utime) + rate(proc.psinfo.stime)) / (kernel.all.uptime - proc.psinfo.start_time) proc.hog.cpu(oneline) = average percentage CPU utilization of each process [..] I was brainstorming with Nathan about this in the past, but we did not quickly get to something and lost track. Following the PCP approach, a client would query the required metrics from pmcd (i.e. "process md5sum is right now using most cpu cycles"), and together with "the overall VM or bare-metal-system consumes right now 100W", one could attribute. We might get away with derived metrics as per above. If the computation is not doable with that, we might also use own client code (i.e. C, or python) which gets the metrics and computes the accounting per thread. Last resort would be to collect the required process metrics in pmda-denki for computation there. We might want to take this one out and discuss on PCP upstream, i.e. pcp@groups.io . > I didn't know about pmda-lmsensors, I guess we should be able to use > it out of box with 'acpi power meter' sensor, if QEMU were to provide such. > I've also seen denki supporting battery power sensor, we can abuse that > and make QEMU provide that, but I'd rather add 'acpi power meter' sensor > to denki (which to some degree intersects with battery power sensor > functionality). On this aarch64/Asahi macbook here, recent kernels made /sys/class/hwmon/hwmon1 available, and 'sensors' offers: [chris@asahi sensors]$ sensors [..] Total System Power: 7.71 W AC Input Power: 9.99 W 3.8 V Rail Power: 0.00 W Heatpipe Power: 2.46 W [..] I'm still wondering how these fit into a picture like this one: https://htmlpreview.github.io/?https://github.com/christianhorn/smallhelpers/blob/main/pmda-denki-handbook/denki.html#_hardware_requirements_new_version So with these also overall system consumption is available while AC powered - of course, just that hardware right now. > PS: > In this series Anthony uses custom protocol to get data from > privileged MSR helper to QEMU. Would it be acceptable? The only request would be that implementing that is "an optional ontop source", so not preventing MSR access from bare metal hosts not having it. I guess that's given. So then it's an abstracted channel we provide into the guest. > Or is there a preferred way for PCP to do inter-process comms? Hm.. I thought this was here used to communicate between host and guest? On the good side, if we get the per-thread-attribution done, we can illustrate attribution up into guests with what mermaid calls sankey: https://mermaid.js.org/syntax/sankey.html :) cheers, -- Christian Horn AMC Technical Account Manager, Red Hat K.K. pgp fprint ADA6 C79C AF2E 973E 3F70 73C5 9373 49E7 347B 904F
On Tue, Oct 22, 2024 at 04:16:36PM +0200, Anthony Harivel wrote: > Daniel P. Berrangé, Oct 22, 2024 at 15:15: > > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: > >> On Fri, 18 Oct 2024 13:59:34 +0100 > >> Daniel P. Berrangé <berrange@redhat.com> wrote: > >> > >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: > >> > > On Wed, 16 Oct 2024 14:56:39 +0200 > >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: > >> [...] > >> > >> > > > >> > > This also leads to a question, if we should account for > >> > > not VCPU threads at all. Looking at real hardware, those > >> > > MSRs return power usage of CPUs only, and they do not > >> > > return consumption from auxiliary system components > >> > > (io/memory/...). One can consider non VCPU threads in QEMU > >> > > as auxiliary components, so we probably should not to > >> > > account for them at all when modeling the same hw feature. > >> > > (aka be consistent with what real hw does). > >> > > >> > I understand your POV, but I think that would be a mistake, > >> > and would undermine the usefulness of the feature. > >> > > >> > The deployment model has a cluster of hosts and guests, all > >> > belonging to the same user. The user goal is to measure host > >> > power consumption imposed by the guest, and dynamically adjust > >> > guest workloads in order to minimize power consumption of the > >> > host. > >> > >> For cloud use-case, host side is likely in a better position > >> to accomplish the task of saving power by migrating VM to > >> another socket/host to compact idle load. (I've found at least 1 > >> kubernetis tool[1], which does energy monitoring). Perhaps there > >> are schedulers out there that do that using its data. > > I also work for Kepler project. I use it to monitor my VM has a black > box and I used it inside my VM with this feature enable. Thanks to that > I can optimize the workloads (dpdk application,database,..) inside my VM. > > This is the use-case in NFV deployment and I'm pretty sure this could be > the use-case of many others. > > > > > The host admin can merely shuffle workloads around, hoping that > > a different packing of workloads onto machines, will reduce power > > in some aount. You might win a few %, or low 10s of % with this > > if you're good at it. > > > > The guest admin can change the way their workload operates to > > reduce its inherant power consumption baseline. You could easily > > come across ways to win high 10s of % with this. That's why it > > is interesting to expose power consumption info to the guest > > admin. > > > > IOW, neither makes the other obsolete, both approaches are > > desirable. > > > >> > The guest workloads can impose non-negligble power consumption > >> > loads on non-vCPU threads in QEMU. Without that accounted for, > >> > any adjustments will be working from (sometimes very) inaccurate > >> > data. > >> > >> Perhaps adding one or several energy sensors (ex: some i2c ones), > >> would let us provide auxiliary threads consumption to guest, and > >> even make it more granular if necessary (incl. vhost user/out of > >> process device models or pass-through devices if they have PMU). > >> It would be better than further muddling vCPUs consumption > >> estimates with something that doesn't belong there. > > I'm confused about your statement. Like every software power metering > tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom > sensors would be better than a what everyone is using ? > The goal is not to be accurate. The goal is to be able to compare > A against B in the same environment and RAPL is given reproducible > values to do so. Be careful with saying "The goal isnot to be accurate", as that's a very broad statement, and I don't think it is true. If you're doing A/B comparisons, you *do* need accuracy, in the sense that if a guest workload config change alters host CPU power consumption, you want that to be reflected in what the guest is told about its power usagte. ie if a change in B moves some power usage from a vCPU thread to a non-vCPU thread, you don't want that power usage to disappear from what's reported to the guest. It would give you the false idea that B is more efficient than A, even if the non-vCPU thread for B was cosuming x2 what the orignal vCPU thread was for A. What I think you don't need is for the absolute magnitude of the reported power consumption to be a precise match to the actual power consumption. ie if A and B are reported as 7 and 9 Watts respectively, it doesn't matter if the actual consumption was 12 and 15 watts. The relationship between the two measurements is still valid, and enables tuning, despite the magnitude being under-reported. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Daniel P. Berrangé, Oct 22, 2024 at 16:29: > On Tue, Oct 22, 2024 at 04:16:36PM +0200, Anthony Harivel wrote: >> Daniel P. Berrangé, Oct 22, 2024 at 15:15: >> > On Tue, Oct 22, 2024 at 02:46:15PM +0200, Igor Mammedov wrote: >> >> On Fri, 18 Oct 2024 13:59:34 +0100 >> >> Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> >> >> > On Fri, Oct 18, 2024 at 02:25:26PM +0200, Igor Mammedov wrote: >> >> > > On Wed, 16 Oct 2024 14:56:39 +0200 >> >> > > "Anthony Harivel" <aharivel@redhat.com> wrote: >> >> [...] >> >> >> >> > > >> >> > > This also leads to a question, if we should account for >> >> > > not VCPU threads at all. Looking at real hardware, those >> >> > > MSRs return power usage of CPUs only, and they do not >> >> > > return consumption from auxiliary system components >> >> > > (io/memory/...). One can consider non VCPU threads in QEMU >> >> > > as auxiliary components, so we probably should not to >> >> > > account for them at all when modeling the same hw feature. >> >> > > (aka be consistent with what real hw does). >> >> > >> >> > I understand your POV, but I think that would be a mistake, >> >> > and would undermine the usefulness of the feature. >> >> > >> >> > The deployment model has a cluster of hosts and guests, all >> >> > belonging to the same user. The user goal is to measure host >> >> > power consumption imposed by the guest, and dynamically adjust >> >> > guest workloads in order to minimize power consumption of the >> >> > host. >> >> >> >> For cloud use-case, host side is likely in a better position >> >> to accomplish the task of saving power by migrating VM to >> >> another socket/host to compact idle load. (I've found at least 1 >> >> kubernetis tool[1], which does energy monitoring). Perhaps there >> >> are schedulers out there that do that using its data. >> >> I also work for Kepler project. I use it to monitor my VM has a black >> box and I used it inside my VM with this feature enable. Thanks to that >> I can optimize the workloads (dpdk application,database,..) inside my VM. >> >> This is the use-case in NFV deployment and I'm pretty sure this could be >> the use-case of many others. >> >> > >> > The host admin can merely shuffle workloads around, hoping that >> > a different packing of workloads onto machines, will reduce power >> > in some aount. You might win a few %, or low 10s of % with this >> > if you're good at it. >> > >> > The guest admin can change the way their workload operates to >> > reduce its inherant power consumption baseline. You could easily >> > come across ways to win high 10s of % with this. That's why it >> > is interesting to expose power consumption info to the guest >> > admin. >> > >> > IOW, neither makes the other obsolete, both approaches are >> > desirable. >> > >> >> > The guest workloads can impose non-negligble power consumption >> >> > loads on non-vCPU threads in QEMU. Without that accounted for, >> >> > any adjustments will be working from (sometimes very) inaccurate >> >> > data. >> >> >> >> Perhaps adding one or several energy sensors (ex: some i2c ones), >> >> would let us provide auxiliary threads consumption to guest, and >> >> even make it more granular if necessary (incl. vhost user/out of >> >> process device models or pass-through devices if they have PMU). >> >> It would be better than further muddling vCPUs consumption >> >> estimates with something that doesn't belong there. >> >> I'm confused about your statement. Like every software power metering >> tools out is using RAPL (Kepler, Scaphandre, PowerMon, etc) and custom >> sensors would be better than a what everyone is using ? >> The goal is not to be accurate. The goal is to be able to compare >> A against B in the same environment and RAPL is given reproducible >> values to do so. > > Be careful with saying "The goal isnot to be accurate", as that's > a very broad statement, and I don't think it is true. > > > If you're doing A/B comparisons, you *do* need accuracy, in the > sense that if a guest workload config change alters host CPU > power consumption, you want that to be reflected in what the > guest is told about its power usagte. > > ie if a change in B moves some power usage from a vCPU thread > to a non-vCPU thread, you don't want that power usage to > disappear from what's reported to the guest. It would give you > the false idea that B is more efficient than A, even if the > non-vCPU thread for B was cosuming x2 what the orignal vCPU > thread was for A. > > What I think you don't need is for the absolute magnitude of > the reported power consumption to be a precise match to the > actual power consumption. > > ie if A and B are reported as 7 and 9 Watts respectively, it > doesn't matter if the actual consumption was 12 and 15 watts. > Right, my bad, I agree. When I said "not accurate" I was indeed talking about the absolute magnitude of the reported power consumption. Like your example above is what I had in mind. Sorry for my clumsy shortcut and thanks for clarifying this important point.
Just a gentle ping for the above patch series. Anthony Harivel, May 22, 2024 at 17:34: > Dear maintainers, > > First of all, thank you very much for your review of my patch > [1]. > > In this version (v6), I have attempted to address all the problems > addressed by Daniel and Paolo during the last review. > > However, two open questions remains unanswered that would require the > attention of a x86 maintainers: > > 1)Should I move from -kvm to -cpu the rapl feature ? [2] > > 2)Should I already rename to "rapl_vmsr_*" in order to anticipate the > futur TMPI architecture ? [end of 3] > > Thank you again for your continued guidance. > > v5 -> v6 > -------- > - Better error consistency in qio_channel_get_peerpid() > - Memory leak g_strdup_printf/g_build_filename corrected > - Renaming several struct with "vmsr_*" for better namespace > - Renamed several struct with "guest_*" for better comprehension > - Optimization suggerate from Daniel > - Crash problem solved [4] > > v4 -> v5 > -------- > > - correct qio_channel_get_peerpid: return pid = -1 in case of error > - Vmsr_helper: compile only for x86 > - Vmsr_helper: use qio_channel_read/write_all > - Vmsr_helper: abandon user/group > - Vmsr_energy.c: correct all error_report > - Vmsr thread: compute default socket path only once > - Vmsr thread: open socket only once > - Pass relevant QEMU CI > > v3 -> v4 > -------- > > - Correct memory leaks with AddressSanitizer > - Add sanity check for QEMU and qemu-vmsr-helper for checking if host is > INTEL and if RAPL is activated. > - Rename poor variables naming for easier comprehension > - Move code that checks Host before creating the VMSR thread > - Get rid of libnuma: create function that read sysfs for reading the > Host topology instead > > v2 -> v3 > -------- > > - Move all memory allocations from Clib to Glib > - Compile on *BSD (working on Linux only) > - No more limitation on the virtual package: each vCPU that belongs to > the same virtual package is giving the same results like expected on > a real CPU. > This has been tested topology like: > -smp 4,sockets=2 > -smp 16,sockets=4,cores=2,threads=2 > > v1 -> v2 > -------- > > - To overcome the CVE-2020-8694 a socket communication is created > to a priviliged helper > - Add the priviliged helper (qemu-vmsr-helper) > - Add SO_PEERCRED in qio channel socket > > RFC -> v1 > --------- > > - Add vmsr_* in front of all vmsr specific function > - Change malloc()/calloc()... with all glib equivalent > - Pre-allocate all dynamic memories when possible > - Add a Documentation of implementation, limitation and usage > > Best regards, > Anthony > > [1]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg01570.html > [2]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg03947.html > [3]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02350.html > [4]: https://mail.gnu.org/archive/html/qemu-devel/2024-04/msg02481.html > > Anthony Harivel (3): > qio: add support for SO_PEERCRED for socket channel > tools: build qemu-vmsr-helper > Add support for RAPL MSRs in KVM/Qemu > > accel/kvm/kvm-all.c | 27 ++ > contrib/systemd/qemu-vmsr-helper.service | 15 + > contrib/systemd/qemu-vmsr-helper.socket | 9 + > docs/specs/index.rst | 1 + > docs/specs/rapl-msr.rst | 155 +++++++ > docs/tools/index.rst | 1 + > docs/tools/qemu-vmsr-helper.rst | 89 ++++ > include/io/channel.h | 21 + > include/sysemu/kvm_int.h | 32 ++ > io/channel-socket.c | 28 ++ > io/channel.c | 13 + > meson.build | 7 + > target/i386/cpu.h | 8 + > target/i386/kvm/kvm.c | 431 +++++++++++++++++- > target/i386/kvm/meson.build | 1 + > target/i386/kvm/vmsr_energy.c | 337 ++++++++++++++ > target/i386/kvm/vmsr_energy.h | 99 +++++ > tools/i386/qemu-vmsr-helper.c | 530 +++++++++++++++++++++++ > tools/i386/rapl-msr-index.h | 28 ++ > 19 files changed, 1831 insertions(+), 1 deletion(-) > create mode 100644 contrib/systemd/qemu-vmsr-helper.service > create mode 100644 contrib/systemd/qemu-vmsr-helper.socket > create mode 100644 docs/specs/rapl-msr.rst > create mode 100644 docs/tools/qemu-vmsr-helper.rst > create mode 100644 target/i386/kvm/vmsr_energy.c > create mode 100644 target/i386/kvm/vmsr_energy.h > create mode 100644 tools/i386/qemu-vmsr-helper.c > create mode 100644 tools/i386/rapl-msr-index.h > > -- > 2.45.1
© 2016 - 2024 Red Hat, Inc.