From nobody Wed Nov  5 10:07:49 2025
Delivered-To: importer@patchew.org
Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as
 permitted sender) client-ip=208.118.235.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Authentication-Results: mx.zohomail.com;
	spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted
 sender)  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by
 mx.zohomail.com
	with SMTPS id 1534180454026110.47931817283063;
 Mon, 13 Aug 2018 10:14:14 -0700 (PDT)
Received: from localhost ([::1]:40531 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemu-devel-bounces+importer=patchew.org@nongnu.org>)
	id 1fpGPw-0006m9-Le
	for importer@patchew.org; Mon, 13 Aug 2018 13:14:08 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51735)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1fpGNg-0005Wt-8E
	for qemu-devel@nongnu.org; Mon, 13 Aug 2018 13:11:54 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1fpGNb-0006b8-LG
	for qemu-devel@nongnu.org; Mon, 13 Aug 2018 13:11:48 -0400
Received: from out1-smtp.messagingengine.com ([66.111.4.25]:57003)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <cota@braap.org>) id 1fpGNb-0006aY-Cq
	for qemu-devel@nongnu.org; Mon, 13 Aug 2018 13:11:43 -0400
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
	by mailout.nyi.internal (Postfix) with ESMTP id C47E921C76;
	Mon, 13 Aug 2018 13:11:42 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
	by compute4.internal (MEProxy); Mon, 13 Aug 2018 13:11:42 -0400
Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216])
	by mail.messagingengine.com (Postfix) with ESMTPA id 7E287E455F;
	Mon, 13 Aug 2018 13:11:41 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=cc
	:date:from:in-reply-to:message-id:references:subject:to
	:x-me-sender:x-me-sender:x-sasl-enc; s=mesmtp; bh=b/l1FTxkNfCcBT
	Uvf+Nm+eEipBP8vt2p1PWJzU7OrgE=; b=X/daegeIc6NPj5jtvzO4zdtYvvu6vQ
	Jqk+UpEmBeXyQMc5jsKSSK/TBY/SzgPCRldpshxPyzvbj6Ptd5CdqI6PpEspv7Yf
	orYS1W2Q+kdpOmzTCIUiSEATVDXQNBs49r6wz+x79cIc3bTQUj+kI3UZJdpHGXfj
	iuwH+RtQnA6ZM=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=cc:date:from:in-reply-to:message-id
	:references:subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=
	fm3; bh=b/l1FTxkNfCcBTUvf+Nm+eEipBP8vt2p1PWJzU7OrgE=; b=ng3qLrU2
	gxqDw9JMt/Y2CzGEQIOP3wVeekwTHC40XeuLkgaj9BZg1xuK/Q1UQ4dwPsLAhl1U
	qTblwhntae8ypGwJdFcqf9B7XW9X8SGEEqV6lY1+kR8sMoaIoy5VtJ2kVTbxdivc
	1nBKyPBDjzI0qEMUJZGfMyzN3Gqqt3PrIEz7ZpumXgbDUmUKyC52kPixtMMw40Ip
	pIsCZBY+vW7K2f7Y0yVvkGuHIwshuHMpM96e6BNlpzs7b2rjsaL9OlYU0OQdk7Lv
	B7d+SQ5sriJBFSQqLjpn6Z1Q8hVXRzmqUdnKVu3wGv3mr5ZTvzhFAV+bx/BcHWcR
	nqKP1LiQcddvig==
X-ME-Proxy: <xmx:zbtxW4RBTmhdwyZ22UJ6Tqi0fvaknF_JF-hjS-2EUXYrLqA8VhPTrw>
	<xmx:zbtxW13eXgOwNprmjNHVH-1v4XIPK7Jd8TH3ol1Dpl1rcB5Tbd4HcA>
	<xmx:zbtxW2UOufjTGfrHk0TlJv32d1YxEuKVNnYebxV1Mn9aSvLWWP9MKA>
	<xmx:zbtxWyuY7URCPmZX81gH_omMZwOVNCOIdKi5_Bo8qjHUEu84AKK-rA>
	<xmx:zbtxW6p5J65Ma6WgoBBb_lkyx2HGaRbZlV--eHzfwwP2DjvYF6ELnw>
	<xmx:zrtxWza4depuIIuSBE35X7dNId-RTp312sc0FetHQmKYffWKUBI2Gg>
X-ME-Sender: <xms:zbtxW9beIjpHX56sd5PSCC3DUc3-51kfvP3wcV9Jh5FNbqTr6Q-HYQ>
From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Date: Mon, 13 Aug 2018 13:11:30 -0400
Message-Id: <20180813171132.21939-2-cota@braap.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20180813171132.21939-1-cota@braap.org>
References: <20180813171132.21939-1-cota@braap.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 66.111.4.25
Subject: [Qemu-devel] [PATCH 1/3] qsp: QEMU's Synchronization Profiler
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Peter Crosthwaite <crosthwaite.peter@gmail.com>,
	Stefan Weil <sw@weilnetz.de>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Peter Xu <peterx@redhat.com>, Markus Armbruster <armbru@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <rth@twiddle.net>
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+importer=patchew.org@nongnu.org>
X-ZohoMail: RSF_0  Z_629925259 SPT_0
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

The goal of this module is to profile synchronization primitives
(i.e. mutexes, recursive mutexes and condition variables)
so that scalability issues can be quickly diagnosed.

Sync primitives are profiled by QSP based on the vaddr of the object access=
ed
as well as the call site (file:line_nr). That means the same object called
from two different call sites will be tracked in separate entries, and in t=
he
report (see below) they will show up in different rows. See the comments
at the top of qsp.c for further info on this decision and the overall
design of QSP.

A report shows info about the tracked entries, sorted in descending order
of total wait time. Sample output:

Type               Object  Call site          Wait Time (s)         Count  =
Average (us)
---------------------------------------------------------------------------=
------------
condvar    0x557ee3090e80  cpus.c:1084              1.69207          2916  =
      580.27
condvar    0x557ee30ceb10  cpus.c:1084              1.43442          2404  =
      596.68
mutex      0x557ee2503540  cpus.c:1583              0.30300        636558  =
        0.48
condvar    0x557ee2534460  cpus-common.c:216        0.13549         15513  =
        8.73
condvar    0x557ee2534420  cpus-common.c:186        0.08927         15747  =
        5.67
mutex      0x557ee3090a40  tcg/tcg.c:378            0.00823        120034  =
        0.07
mutex      0x557ee3090a00  tcg/tcg.c:378            0.00820        117890  =
        0.07
mutex      0x557ee3090ac0  tcg/tcg.c:378            0.00764        115954  =
        0.07
mutex      0x557ee3090a80  tcg/tcg.c:378            0.00762        116573  =
        0.07
mutex      0x557ee3090b00  tcg/tcg.c:378            0.00560         82547  =
        0.07
---------------------------------------------------------------------------=
------------

Some perf numbers:

- Double-check that profiling disabled does not affect performance
  with `taskset -c 0 tests/atomic_add-bench -d 5 -m'. Host: AMD Opteron 637=
6.
  - Before: 14.25 Mops/s
  - After:  14.27 Mops/s (within noise range)

- Profiling overhead for `taskset -c 0 atomic_add-bench -d 5 -m' on
  AMD Opteron 6376:
  - No profiling: 14.27 Mops/s
  - W/ profiling:  3.05 Mops/s
  - W/o get_clock: 6.73 Mops/s
  That is, a 4.67X slowdown for mutex_lock.

  The third result is the performance measured for an instrumented mutex_lo=
ck
  that does not call get_clock and instead just increments the number of
  acquisitions for each entry. With this we measure the overhead of just
  the instrumentation + the hash table lookup.

- `taskset -c 0 tests/atomic_add-bench -d 5 -m' on Intel Xeon E5-2690:
  - W/o profiling: 33.95 Mops/s
  - W/ profiling:   8.08 Mops/s
  - W/o get_clock: 14.19 Mops/s
  That is, a 4.20X slowdown for mutex_lock.

Synchronization primitives are not called very often during a regular
QEMU run, so the overall performance impact of the profiler is quite
tolerable, as the following results show.

- Boot + shut down of debian-aarch64, on Intel Xeon E5-2690, 3 runs:
  -smp 1:
    - W/o profiling:
      16448.021211      task-clock (msec)         #    1.000 CPUs utilized =
           ( +-  0.27% )
            35,383      context-switches          #    0.002 M/sec         =
           ( +-  0.17% )
                 1      cpu-migrations            #    0.000 K/sec
            13,410      page-faults               #    0.815 K/sec         =
           ( +-  1.79% )
    47,526,630,360      cycles                    #    2.890 GHz           =
           ( +-  0.18% )  (83.33%)
    22,943,171,724      stalled-cycles-frontend   #   48.27% frontend cycle=
s idle     ( +-  0.18% )  (83.35%)
    16,585,410,756      stalled-cycles-backend    #   34.90% backend  cycle=
s idle     ( +-  0.32% )  (66.68%)
    60,962,863,668      instructions              #    1.28  insns per cycle
                                                  #    0.38  stalled cycles=
 per insn  ( +-  0.06% )  (83.33%)
    10,910,210,349      branches                  #  663.314 M/sec         =
           ( +-  0.08% )  (83.33%)
       240,046,904      branch-misses             #    2.20% of all branche=
s          ( +-  0.29% )  (83.33%)

      16.452269799 seconds time elapsed                                    =
      ( +-  0.27% )

    - W/ profiling:
      16678.033593      task-clock (msec)         #    1.000 CPUs utilized =
           ( +-  0.09% )
            35,740      context-switches          #    0.002 M/sec         =
           ( +-  0.12% )
                 1      cpu-migrations            #    0.000 K/sec
            13,283      page-faults               #    0.796 K/sec         =
           ( +-  1.79% )
    48,208,446,365      cycles                    #    2.891 GHz           =
           ( +-  0.15% )  (83.34%)
    23,237,819,053      stalled-cycles-frontend   #   48.20% frontend cycle=
s idle     ( +-  0.22% )  (83.34%)
    16,926,585,765      stalled-cycles-backend    #   35.11% backend  cycle=
s idle     ( +-  0.30% )  (66.68%)
    61,620,102,560      instructions              #    1.28  insns per cycle
                                                  #    0.38  stalled cycles=
 per insn  ( +-  0.20% )  (83.34%)
    11,015,082,852      branches                  #  660.455 M/sec         =
           ( +-  0.22% )  (83.32%)
       243,335,050      branch-misses             #    2.21% of all branche=
s          ( +-  0.47% )  (83.33%)

      16.685661370 seconds time elapsed                                    =
      ( +-  0.08% )

    That is, a 1.3% slowdown.

  -smp 8:
    - W/o profiling:
      21591.674202      task-clock (msec)         #    2.512 CPUs utilized =
           ( +-  0.04% )
           428,423      context-switches          #    0.020 M/sec         =
           ( +-  0.38% )
             8,474      cpu-migrations            #    0.392 K/sec         =
           ( +-  1.64% )
            27,668      page-faults               #    0.001 M/sec         =
           ( +-  4.56% )
    58,611,751,157      cycles                    #    2.715 GHz           =
           ( +-  0.27% )  (83.40%)
    31,033,543,440      stalled-cycles-frontend   #   52.95% frontend cycle=
s idle     ( +-  0.45% )  (82.98%)
    22,983,788,382      stalled-cycles-backend    #   39.21% backend  cycle=
s idle     ( +-  0.50% )  (66.80%)
    67,020,810,260      instructions              #    1.14  insns per cycle
                                                  #    0.46  stalled cycles=
 per insn  ( +-  0.21% )  (83.45%)
    12,050,905,600      branches                  #  558.127 M/sec         =
           ( +-  0.16% )  (83.40%)
       271,074,726      branch-misses             #    2.25% of all branche=
s          ( +-  0.33% )  (83.42%)

       8.595004141 seconds time elapsed                                    =
      ( +-  0.65% )

    - W/ profiling:
      22227.582945      task-clock (msec)         #    2.516 CPUs utilized =
           ( +-  0.71% )
           487,009      context-switches          #    0.022 M/sec         =
           ( +-  1.43% )
             9,049      cpu-migrations            #    0.407 K/sec         =
           ( +-  0.99% )
            25,963      page-faults               #    0.001 M/sec         =
           ( +-  1.70% )
    60,705,049,201      cycles                    #    2.731 GHz           =
           ( +-  0.52% )  (83.33%)
    32,225,054,360      stalled-cycles-frontend   #   53.08% frontend cycle=
s idle     ( +-  0.39% )  (83.50%)
    23,943,332,206      stalled-cycles-backend    #   39.44% backend  cycle=
s idle     ( +-  0.40% )  (66.79%)
    68,775,229,321      instructions              #    1.13  insns per cycle
                                                  #    0.47  stalled cycles=
 per insn  ( +-  0.45% )  (83.38%)
    12,361,594,986      branches                  #  556.138 M/sec         =
           ( +-  0.36% )  (83.46%)
       281,775,777      branch-misses             #    2.28% of all branche=
s          ( +-  1.16% )  (82.93%)

       8.836031859 seconds time elapsed                                    =
      ( +-  1.07% )

    That is, a 2.7% slowdown.

- Boot + shut down of debian-aarch64, on AMD Opteron 6376, 3 runs:
  -smp 1:
    - W/o profiling:
      27601.373288      task-clock (msec)         #    0.999 CPUs utilized =
           ( +-  0.74% )
            51,532      context-switches          #    0.002 M/sec         =
           ( +-  0.49% )
                 1      cpu-migrations            #    0.000 K/sec
            22,085      page-faults               #    0.800 K/sec         =
           ( +-  1.94% )
    63,051,935,104      cycles                    #    2.284 GHz           =
           ( +-  0.59% )  (83.32%)
    15,210,275,327      stalled-cycles-frontend   #   24.12% frontend cycle=
s idle     ( +-  1.57% )  (83.33%)
    12,701,776,029      stalled-cycles-backend    #   20.14% backend  cycle=
s idle     ( +-  0.65% )  (33.35%)
    64,895,509,291      instructions              #    1.03  insns per cycle
                                                  #    0.23  stalled cycles=
 per insn  ( +-  0.15% )  (50.03%)
    11,598,175,695      branches                  #  420.203 M/sec         =
           ( +-  0.12% )  (66.69%)
       424,964,774      branch-misses             #    3.66% of all branche=
s          ( +-  2.44% )  (83.34%)

      27.634390949 seconds time elapsed                                    =
      ( +-  0.73% )

    - W/ profiling:
      28539.459861      task-clock (msec)         #    0.999 CPUs utilized =
           ( +-  0.90% )
            52,971      context-switches          #    0.002 M/sec         =
           ( +-  0.62% )
                 1      cpu-migrations            #    0.000 K/sec
            22,828      page-faults               #    0.800 K/sec         =
           ( +-  2.19% )
    65,196,667,837      cycles                    #    2.284 GHz           =
           ( +-  0.76% )  (83.32%)
    16,438,749,522      stalled-cycles-frontend   #   25.21% frontend cycle=
s idle     ( +-  2.44% )  (83.34%)
    13,080,610,434      stalled-cycles-backend    #   20.06% backend  cycle=
s idle     ( +-  1.22% )  (33.35%)
    65,744,290,786      instructions              #    1.01  insns per cycle
                                                  #    0.25  stalled cycles=
 per insn  ( +-  0.17% )  (50.05%)
    11,760,663,051      branches                  #  412.084 M/sec         =
           ( +-  0.11% )  (66.69%)
       433,815,865      branch-misses             #    3.69% of all branche=
s          ( +-  2.27% )  (83.34%)

      28.558462655 seconds time elapsed                                    =
      ( +-  0.90% )

    That is, a 3.5% slowdown.

  -smp 8:
    - W/o profiling:
      67084.039812      task-clock (msec)         #    3.075 CPUs utilized =
           ( +-  0.57% )
           448,562      context-switches          #    0.007 M/sec         =
           ( +-  1.30% )
            13,417      cpu-migrations            #    0.200 K/sec         =
           ( +-  1.67% )
            38,428      page-faults               #    0.573 K/sec         =
           ( +-  2.60% )
    92,018,881,848      cycles                    #    1.372 GHz           =
           ( +-  0.51% )  (83.27%)
    24,887,145,043      stalled-cycles-frontend   #   27.05% frontend cycle=
s idle     ( +-  0.97% )  (83.31%)
    10,843,517,286      stalled-cycles-backend    #   11.78% backend  cycle=
s idle     ( +-  0.84% )  (33.45%)
    72,663,707,013      instructions              #    0.79  insns per cycle
                                                  #    0.34  stalled cycles=
 per insn  ( +-  0.34% )  (50.14%)
    13,148,472,626      branches                  #  196.000 M/sec         =
           ( +-  0.26% )  (66.68%)
       524,683,844      branch-misses             #    3.99% of all branche=
s          ( +-  0.31% )  (83.34%)

      21.816863743 seconds time elapsed                                    =
      ( +-  1.37% )

    - W/ profiling:
      72646.052577      task-clock (msec)         #    3.088 CPUs utilized =
           ( +-  0.52% )
           558,334      context-switches          #    0.008 M/sec         =
           ( +-  0.35% )
            16,471      cpu-migrations            #    0.227 K/sec         =
           ( +-  1.35% )
            40,420      page-faults               #    0.556 K/sec         =
           ( +-  2.31% )
    99,025,850,252      cycles                    #    1.363 GHz           =
           ( +-  0.19% )  (83.14%)
    28,317,752,389      stalled-cycles-frontend   #   28.60% frontend cycle=
s idle     ( +-  0.21% )  (83.29%)
    11,370,442,683      stalled-cycles-backend    #   11.48% backend  cycle=
s idle     ( +-  0.78% )  (33.59%)
    75,859,240,699      instructions              #    0.77  insns per cycle
                                                  #    0.37  stalled cycles=
 per insn  ( +-  0.11% )  (50.25%)
    13,774,590,759      branches                  #  189.612 M/sec         =
           ( +-  0.10% )  (66.87%)
       541,354,013      branch-misses             #    3.93% of all branche=
s          ( +-  0.13% )  (83.50%)

      23.525659377 seconds time elapsed                                    =
      ( +-  0.90% )

    That is, a 7.8% slowdown.

Note that as -smp increases the memory footprint of the profiling data grows
as well, which explains the slowdown increase as we add more guest CPUs.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 configure                   |   8 +
 include/qemu/qht.h          |   1 +
 include/qemu/qsp.h          |  38 +++
 include/qemu/thread-posix.h |   4 +-
 include/qemu/thread-win32.h |   5 +-
 include/qemu/thread.h       |  41 ++-
 util/qemu-thread-win32.c    |   4 +-
 util/qht.c                  |  47 +++-
 util/qsp.c                  | 545 ++++++++++++++++++++++++++++++++++++
 util/Makefile.objs          |   1 +
 10 files changed, 671 insertions(+), 23 deletions(-)
 create mode 100644 include/qemu/qsp.h
 create mode 100644 util/qsp.c

diff --git a/configure b/configure
index 2a7796ea80..d7fead7c85 100755
--- a/configure
+++ b/configure
@@ -408,6 +408,7 @@ bsd=3D"no"
 linux=3D"no"
 solaris=3D"no"
 profiler=3D"no"
+sync_profiler=3D"no"
 cocoa=3D"no"
 softmmu=3D"yes"
 linux_user=3D"no"
@@ -1163,6 +1164,8 @@ for opt do
   ;;
   --enable-profiler) profiler=3D"yes"
   ;;
+  --enable-sync-profiler) sync_profiler=3D"yes"
+  ;;
   --disable-cocoa) cocoa=3D"no"
   ;;
   --enable-cocoa)
@@ -1621,6 +1624,7 @@ Advanced options (experts only):
   --tls-priority           default TLS protocol/cipher priority string
   --enable-gprof           QEMU profiling with gprof
   --enable-profiler        profiler support
+  --enable-sync-profiler   support profiling of synchronization primitives
   --enable-xen-pv-domain-build
                            xen pv domain builder
   --enable-debug-stack-usage
@@ -5893,6 +5897,7 @@ echo "gprof enabled     $gprof"
 echo "sparse enabled    $sparse"
 echo "strip binaries    $strip_opt"
 echo "profiler          $profiler"
+echo "sync profiler     $sync_profiler"
 echo "static build      $static"
 if test "$darwin" =3D "yes" ; then
     echo "Cocoa support     $cocoa"
@@ -6146,6 +6151,9 @@ fi
 if test "$profiler" =3D "yes" ; then
   echo "CONFIG_PROFILER=3Dy" >> $config_host_mak
 fi
+if test "$sync_profiler" =3D "yes" ; then
+  echo "CONFIG_SYNC_PROFILER=3Dy" >> $config_host_mak
+fi
 if test "$slirp" =3D "yes" ; then
   echo "CONFIG_SLIRP=3Dy" >> $config_host_mak
   echo "CONFIG_SMBD_COMMAND=3D\"$smbd\"" >> $config_host_mak
diff --git a/include/qemu/qht.h b/include/qemu/qht.h
index 1fb9116fa0..c9a11cc29a 100644
--- a/include/qemu/qht.h
+++ b/include/qemu/qht.h
@@ -46,6 +46,7 @@ typedef bool (*qht_lookup_func_t)(const void *obj, const =
void *userp);
 typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void =
*up);
=20
 #define QHT_MODE_AUTO_RESIZE 0x1 /* auto-resize when heavily loaded */
+#define QHT_MODE_RAW_MUTEXES 0x2 /* bypass the profiler (QSP) */
=20
 /**
  * qht_init - Initialize a QHT
diff --git a/include/qemu/qsp.h b/include/qemu/qsp.h
new file mode 100644
index 0000000000..e36f94ab1d
--- /dev/null
+++ b/include/qemu/qsp.h
@@ -0,0 +1,38 @@
+/*
+ * qsp.c - QEMU Synchronization Profiler
+ *
+ * Copyright (C) 2017, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Note: this header file can *only* be included from thread.h.
+ */
+#ifndef QEMU_QSP_H
+#define QEMU_QSP_H
+
+#include "qemu/fprintf-fn.h"
+
+#ifdef CONFIG_SYNC_PROFILER
+
+void qsp_mutex_lock(QemuMutex *mutex, const char *file, unsigned line);
+int qsp_mutex_trylock(QemuMutex *mutex, const char *file, unsigned line);
+
+void qsp_rec_mutex_lock(QemuRecMutex *mutex, const char *file, unsigned li=
ne);
+int qsp_rec_mutex_trylock(QemuRecMutex *mutex, const char *file, unsigned =
line);
+
+void qsp_cond_wait(QemuCond *cond, QemuMutex *mutex, const char *file,
+                   unsigned line);
+
+void qsp_report(FILE *f, fprintf_function cpu_fprintf, size_t max);
+
+#else /* !CONF_SYNC_PROFILER */
+
+static inline void qsp_report(FILE *f, fprintf_function cpu_fprintf, size_=
t max)
+{
+    cpu_fprintf(f, "[Sync profiler not compiled]\n");
+}
+
+#endif /* !CONF_SYNC_PROFILER */
+
+#endif /* QEMU_QSP_H */
diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index fd27b34128..c903525062 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -6,8 +6,8 @@
=20
 typedef QemuMutex QemuRecMutex;
 #define qemu_rec_mutex_destroy qemu_mutex_destroy
-#define qemu_rec_mutex_lock qemu_mutex_lock
-#define qemu_rec_mutex_trylock qemu_mutex_trylock
+#define qemu_rec_mutex_lock_impl    qemu_mutex_lock_impl
+#define qemu_rec_mutex_trylock_impl qemu_mutex_trylock_impl
 #define qemu_rec_mutex_unlock qemu_mutex_unlock
=20
 struct QemuMutex {
diff --git a/include/qemu/thread-win32.h b/include/qemu/thread-win32.h
index d668d789b4..50af5dd7ab 100644
--- a/include/qemu/thread-win32.h
+++ b/include/qemu/thread-win32.h
@@ -19,8 +19,9 @@ struct QemuRecMutex {
 };
=20
 void qemu_rec_mutex_destroy(QemuRecMutex *mutex);
-void qemu_rec_mutex_lock(QemuRecMutex *mutex);
-int qemu_rec_mutex_trylock(QemuRecMutex *mutex);
+void qemu_rec_mutex_lock_impl(QemuRecMutex *mutex, const char *file, int l=
ine);
+int qemu_rec_mutex_trylock_impl(QemuRecMutex *mutex, const char *file,
+                                int line);
 void qemu_rec_mutex_unlock(QemuRecMutex *mutex);
=20
 struct QemuCond {
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index ef7bd16123..55009bbf26 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -16,6 +16,9 @@ typedef struct QemuThread QemuThread;
 #include "qemu/thread-posix.h"
 #endif
=20
+/* include QSP header once QemuMutex, QemuCond etc. are defined */
+#include "qemu/qsp.h"
+
 #define QEMU_THREAD_JOINABLE 0
 #define QEMU_THREAD_DETACHED 1
=20
@@ -25,10 +28,27 @@ int qemu_mutex_trylock_impl(QemuMutex *mutex, const cha=
r *file, const int line);
 void qemu_mutex_lock_impl(QemuMutex *mutex, const char *file, const int li=
ne);
 void qemu_mutex_unlock_impl(QemuMutex *mutex, const char *file, const int =
line);
=20
-#define qemu_mutex_lock(mutex) \
-        qemu_mutex_lock_impl(mutex, __FILE__, __LINE__)
-#define qemu_mutex_trylock(mutex) \
-        qemu_mutex_trylock_impl(mutex, __FILE__, __LINE__)
+/* convenience macros to bypass the profiler */
+#define qemu_mutex_lock__raw(m)                         \
+        qemu_mutex_lock_impl(m, __FILE__, __LINE__)
+#define qemu_mutex_trylock__raw(m)                      \
+        qemu_mutex_trylock_impl(m, __FILE__, __LINE__)
+
+#ifdef CONFIG_SYNC_PROFILER
+# define qemu_mutex_lock(m)        qsp_mutex_lock(m, __FILE__, __LINE__)
+# define qemu_mutex_trylock(m)     qsp_mutex_trylock(m, __FILE__, __LINE__)
+# define qemu_rec_mutex_lock(m)    qsp_rec_mutex_lock(m, __FILE__, __LINE_=
_)
+# define qemu_rec_mutex_trylock(m) qsp_rec_mutex_trylock(m, __FILE__, __LI=
NE__)
+# define qemu_cond_wait(c, m)      qsp_cond_wait(c, m, __FILE__, __LINE__)
+#else
+# define qemu_mutex_lock(m)     qemu_mutex_lock_impl(m, __FILE__, __LINE__)
+# define qemu_mutex_trylock(m)  qemu_mutex_trylock_impl(m, __FILE__, __LIN=
E__)
+# define qemu_rec_mutex_lock(m) qemu_rec_mutex_lock_impl(m, __FILE__, __LI=
NE__)
+# define qemu_rec_mutex_trylock(m)                              \
+         qemu_rec_mutex_trylock_impl(m, __FILE__, __LINE__)
+# define qemu_cond_wait(c, m)   qemu_cond_wait_impl(c, m, __FILE__, __LINE=
__)
+#endif /* !CONFIG_SYNC_PROFILER */
+
 #define qemu_mutex_unlock(mutex) \
         qemu_mutex_unlock_impl(mutex, __FILE__, __LINE__)
=20
@@ -47,6 +67,16 @@ static inline void (qemu_mutex_unlock)(QemuMutex *mutex)
     qemu_mutex_unlock(mutex);
 }
=20
+static inline void (qemu_rec_mutex_lock)(QemuRecMutex *mutex)
+{
+    qemu_rec_mutex_lock(mutex);
+}
+
+static inline int (qemu_rec_mutex_trylock)(QemuRecMutex *mutex)
+{
+    return qemu_rec_mutex_trylock(mutex);
+}
+
 /* Prototypes for other functions are in thread-posix.h/thread-win32.h.  */
 void qemu_rec_mutex_init(QemuRecMutex *mutex);
=20
@@ -63,9 +93,6 @@ void qemu_cond_broadcast(QemuCond *cond);
 void qemu_cond_wait_impl(QemuCond *cond, QemuMutex *mutex,
                          const char *file, const int line);
=20
-#define qemu_cond_wait(cond, mutex) \
-        qemu_cond_wait_impl(cond, mutex, __FILE__, __LINE__)
-
 static inline void (qemu_cond_wait)(QemuCond *cond, QemuMutex *mutex)
 {
     qemu_cond_wait(cond, mutex);
diff --git a/util/qemu-thread-win32.c b/util/qemu-thread-win32.c
index b303188a36..4a363ca675 100644
--- a/util/qemu-thread-win32.c
+++ b/util/qemu-thread-win32.c
@@ -97,13 +97,13 @@ void qemu_rec_mutex_destroy(QemuRecMutex *mutex)
     DeleteCriticalSection(&mutex->lock);
 }
=20
-void qemu_rec_mutex_lock(QemuRecMutex *mutex)
+void qemu_rec_mutex_lock_impl(QemuRecMutex *mutex, const char *file, int l=
ine)
 {
     assert(mutex->initialized);
     EnterCriticalSection(&mutex->lock);
 }
=20
-int qemu_rec_mutex_trylock(QemuRecMutex *mutex)
+int qemu_rec_mutex_trylock_impl(QemuRecMutex *mutex, const char *file, int=
 line)
 {
     assert(mutex->initialized);
     return !TryEnterCriticalSection(&mutex->lock);
diff --git a/util/qht.c b/util/qht.c
index c138777a9c..1e3a072e25 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -89,6 +89,33 @@
 #define QHT_BUCKET_ENTRIES 4
 #endif
=20
+/*
+ * Do _not_ use qemu_mutex_[try]lock directly! Use these macros, otherwise
+ * the profiler (QSP) will deadlock.
+ */
+static inline void qht_lock(struct qht *ht)
+{
+    if (ht->mode & QHT_MODE_RAW_MUTEXES) {
+        qemu_mutex_lock__raw(&ht->lock);
+    } else {
+        qemu_mutex_lock(&ht->lock);
+    }
+}
+
+static inline int qht_trylock(struct qht *ht)
+{
+    if (ht->mode & QHT_MODE_RAW_MUTEXES) {
+        return qemu_mutex_trylock__raw(&(ht)->lock);
+    }
+    return qemu_mutex_trylock(&(ht)->lock);
+}
+
+/* this inline is not really necessary, but it helps keep code consistent =
*/
+static inline void qht_unlock(struct qht *ht)
+{
+    qemu_mutex_unlock(&ht->lock);
+}
+
 /*
  * Note: reading partially-updated pointers in @pointers could lead to
  * segfaults. We thus access them with atomic_read/set; this guarantees
@@ -254,10 +281,10 @@ void qht_map_lock_buckets__no_stale(struct qht *ht, s=
truct qht_map **pmap)
     qht_map_unlock_buckets(map);
=20
     /* we raced with a resize; acquire ht->lock to see the updated ht->map=
 */
-    qemu_mutex_lock(&ht->lock);
+    qht_lock(ht);
     map =3D ht->map;
     qht_map_lock_buckets(map);
-    qemu_mutex_unlock(&ht->lock);
+    qht_unlock(ht);
     *pmap =3D map;
     return;
 }
@@ -288,11 +315,11 @@ struct qht_bucket *qht_bucket_lock__no_stale(struct q=
ht *ht, uint32_t hash,
     qemu_spin_unlock(&b->lock);
=20
     /* we raced with a resize; acquire ht->lock to see the updated ht->map=
 */
-    qemu_mutex_lock(&ht->lock);
+    qht_lock(ht);
     map =3D ht->map;
     b =3D qht_map_to_bucket(map, hash);
     qemu_spin_lock(&b->lock);
-    qemu_mutex_unlock(&ht->lock);
+    qht_unlock(ht);
     *pmap =3D map;
     return b;
 }
@@ -430,13 +457,13 @@ bool qht_reset_size(struct qht *ht, size_t n_elems)
=20
     n_buckets =3D qht_elems_to_buckets(n_elems);
=20
-    qemu_mutex_lock(&ht->lock);
+    qht_lock(ht);
     map =3D ht->map;
     if (n_buckets !=3D map->n_buckets) {
         new =3D qht_map_create(n_buckets);
     }
     qht_do_resize_and_reset(ht, new);
-    qemu_mutex_unlock(&ht->lock);
+    qht_unlock(ht);
=20
     return !!new;
 }
@@ -565,7 +592,7 @@ static __attribute__((noinline)) void qht_grow_maybe(st=
ruct qht *ht)
      * If the lock is taken it probably means there's an ongoing resize,
      * so bail out.
      */
-    if (qemu_mutex_trylock(&ht->lock)) {
+    if (qht_trylock(ht)) {
         return;
     }
     map =3D ht->map;
@@ -575,7 +602,7 @@ static __attribute__((noinline)) void qht_grow_maybe(st=
ruct qht *ht)
=20
         qht_do_resize(ht, new);
     }
-    qemu_mutex_unlock(&ht->lock);
+    qht_unlock(ht);
 }
=20
 bool qht_insert(struct qht *ht, void *p, uint32_t hash, void **existing)
@@ -788,7 +815,7 @@ bool qht_resize(struct qht *ht, size_t n_elems)
     size_t n_buckets =3D qht_elems_to_buckets(n_elems);
     size_t ret =3D false;
=20
-    qemu_mutex_lock(&ht->lock);
+    qht_lock(ht);
     if (n_buckets !=3D ht->map->n_buckets) {
         struct qht_map *new;
=20
@@ -796,7 +823,7 @@ bool qht_resize(struct qht *ht, size_t n_elems)
         qht_do_resize(ht, new);
         ret =3D true;
     }
-    qemu_mutex_unlock(&ht->lock);
+    qht_unlock(ht);
=20
     return ret;
 }
diff --git a/util/qsp.c b/util/qsp.c
new file mode 100644
index 0000000000..908e99cccf
--- /dev/null
+++ b/util/qsp.c
@@ -0,0 +1,545 @@
+/*
+ * qsp.c - QEMU Synchronization Profiler
+ *
+ * Copyright (C) 2017, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * QSP profiles the time spent in synchronization primitives, which can
+ * help diagnose performance problems, e.g. scalability issues when
+ * contention is high.
+ *
+ * The primitives currently supported are mutexes, recursive mutexes and
+ * condition variables. Note that not all related functions are intercepte=
d;
+ * instead we profile only those functions that can have a performance imp=
act,
+ * either due to blocking (e.g. cond_wait, mutex_lock) or cache line
+ * contention (e.g. mutex_lock, mutex_trylock).
+ *
+ * QSP's design focuses on speed and scalability. This is achieved
+ * by having threads do their profiling entirely on thread-local data.
+ * The appropriate thread-local data is found via a QHT, i.e. a concurrent=
 hash
+ * table. To aggregate data in order to generate a report, we iterate over
+ * all entries in the hash table. Depending on the number of threads and
+ * synchronization objects this might be expensive, but note that it is
+ * very rarely called -- reports are generated only when requested by user=
s.
+ *
+ * Reports are generated as a table where each row represents a call site.=
 A
+ * call site is the triplet formed by the __file__ and __LINE__ of the cal=
ler
+ * as well as the address of the "object" (i.e. mutex, rec. mutex or condv=
ar)
+ * being operated on. Focusing on call sites instead of just on objects mi=
ght
+ * seem puzzling. However, it is a sensible choice since otherwise dealing=
 with
+ * dynamically-allocated objects becomes difficult (e.g. what to do when an
+ * object is destroyed, or reused?). Furthermore, the call site info is of=
 most
+ * importance, since it is callers, and not objects, what cause wait time.
+ *
+ * Alternative designs considered:
+ *
+ * - Use an off-the-shelf profiler such as mutrace. This is not a viable o=
ption
+ *   for us because QEMU has __malloc_hook set (by one of the libraries it
+ *   uses); leaving this hook unset is required to avoid deadlock in mutra=
ce.
+ *
+ * - Use a glib HT for each thread, protecting each HT with its own lock.
+ *   This isn't simpler than the current design, and is 10% slower in the
+ *   atomic_add-bench microbenchmark (-m option).
+ *
+ * - For reports, just use a binary tree as we aggregate data, instead of =
having
+ *   an intermediate hash table. This would simplify the code only slightl=
y, but
+ *   would perform badly if there were many threads and objects to track.
+ *
+ * Related Work:
+ * - Lennart Poettering's mutrace: http://0pointer.de/blog/projects/mutrac=
e.html
+ * - Lozi, David, Thomas, Lawall and Muller. "Remote Core Locking: Migrati=
ng
+ *   Critical-Section Execution to Improve the Performance of Multithreaded
+ *   Applications", USENIX ATC'12.
+ */
+#include "qemu/osdep.h"
+#include "qemu/thread.h"
+#include "qemu/timer.h"
+#include "qemu/qht.h"
+#include "exec/tb-hash-xx.h"
+
+enum qsp_type {
+    QSP_MUTEX,
+    QSP_REC_MUTEX,
+    QSP_CONDVAR,
+};
+
+struct qsp_callsite {
+    const void *obj;
+    const char *file; /* i.e. __FILE__; shortened later */
+    unsigned int line;
+    enum qsp_type type;
+};
+
+struct qsp_entry {
+    void *thread_ptr;
+    const struct qsp_callsite *callsite;
+    uint64_t n_acqs;
+    uint64_t ns;
+};
+
+/* initial sizing for hash tables */
+#define QSP_INITIAL_SIZE 64
+
+/* If this file is moved, QSP_REL_PATH should be updated accordingly */
+#define QSP_REL_PATH "util/qsp.c"
+
+/* this file's full path. Used to present all call sites with relative pat=
hs */
+static size_t qsp_qemu_path_len;
+
+/* the address of qsp_thread gives us a unique 'thread ID' */
+static __thread int qsp_thread;
+
+/*
+ * Call sites are the same for all threads, so we track them in a separate=
 hash
+ * table to save memory.
+ */
+static struct qht qsp_callsite_ht;
+
+static struct qht qsp_ht;
+static bool qsp_initialized, qsp_initializing;
+
+static const char * const qsp_typenames[] =3D {
+    [QSP_MUTEX]     =3D "mutex",
+    [QSP_REC_MUTEX] =3D "rec_mutex",
+    [QSP_CONDVAR]   =3D "condvar",
+};
+
+/*
+ * It pays off to _not_ hash callsite->file; hashing a string is slow, and
+ * without it we still get a pretty unique hash.
+ */
+static inline
+uint32_t do_qsp_callsite_hash(const struct qsp_callsite *callsite, uint64_=
t a)
+{
+    uint64_t b =3D (uint64_t)callsite->obj;
+    uint32_t e =3D callsite->line;
+    uint32_t f =3D callsite->type;
+
+    return tb_hash_func7(a, b, e, f, 0);
+}
+
+static inline
+uint32_t qsp_callsite_hash(const struct qsp_callsite *callsite)
+{
+    return do_qsp_callsite_hash(callsite, 0);
+}
+
+static inline uint32_t do_qsp_entry_hash(const struct qsp_entry *entry,
+                                         uint64_t a)
+{
+    return do_qsp_callsite_hash(entry->callsite, a);
+}
+
+static uint32_t qsp_entry_hash(const struct qsp_entry *entry)
+{
+    return do_qsp_entry_hash(entry, (uint64_t)entry->thread_ptr);
+}
+
+static uint32_t qsp_entry_no_thread_hash(const struct qsp_entry *entry)
+{
+    return do_qsp_entry_hash(entry, 0);
+}
+
+static bool qsp_callsite_cmp(const void *ap, const void *bp)
+{
+    const struct qsp_callsite *a =3D ap;
+    const struct qsp_callsite *b =3D bp;
+
+    return a =3D=3D b ||
+        (a->obj =3D=3D b->obj &&
+         a->line =3D=3D b->line &&
+         a->type =3D=3D b->type &&
+         (a->file =3D=3D b->file || !strcmp(a->file, b->file)));
+}
+
+static bool qsp_entry_no_thread_cmp(const void *ap, const void *bp)
+{
+    const struct qsp_entry *a =3D ap;
+    const struct qsp_entry *b =3D bp;
+
+    return qsp_callsite_cmp(a->callsite, b->callsite);
+}
+
+static bool qsp_entry_cmp(const void *ap, const void *bp)
+{
+    const struct qsp_entry *a =3D ap;
+    const struct qsp_entry *b =3D bp;
+
+    return a->thread_ptr =3D=3D b->thread_ptr &&
+        qsp_callsite_cmp(a->callsite, b->callsite);
+}
+
+/*
+ * Normally we'd call this from a constructor function, but we want it to =
work
+ * via libutil as well.
+ */
+static void qsp_do_init(void)
+{
+    /* make sure this file's path in the tree is up to date with QSP_REL_P=
ATH */
+    g_assert(strstr(__FILE__, QSP_REL_PATH));
+    qsp_qemu_path_len =3D strlen(__FILE__) - strlen(QSP_REL_PATH);
+
+    qht_init(&qsp_ht, qsp_entry_cmp, QSP_INITIAL_SIZE,
+             QHT_MODE_AUTO_RESIZE | QHT_MODE_RAW_MUTEXES);
+    qht_init(&qsp_callsite_ht, qsp_callsite_cmp, QSP_INITIAL_SIZE,
+             QHT_MODE_AUTO_RESIZE | QHT_MODE_RAW_MUTEXES);
+}
+
+static __attribute__((noinline)) void qsp_init__slowpath(void)
+{
+    if (atomic_cmpxchg(&qsp_initializing, false, true) =3D=3D false) {
+        qsp_do_init();
+        atomic_set(&qsp_initialized, true);
+    } else {
+        while (!atomic_read(&qsp_initialized)) {
+            cpu_relax();
+        }
+    }
+}
+
+/* qsp_init() must be called from _all_ exported functions */
+static inline void qsp_init(void)
+{
+    if (likely(atomic_read(&qsp_initialized))) {
+        return;
+    }
+    qsp_init__slowpath();
+}
+
+static struct qsp_callsite *qsp_callsite_find(const struct qsp_callsite *o=
rig)
+{
+    struct qsp_callsite *callsite;
+    uint32_t hash;
+
+    hash =3D qsp_callsite_hash(orig);
+    callsite =3D qht_lookup(&qsp_callsite_ht, orig, hash);
+    if (callsite =3D=3D NULL) {
+        void *existing =3D NULL;
+
+        callsite =3D g_new(struct qsp_callsite, 1);
+        memcpy(callsite, orig, sizeof(*callsite));
+        qht_insert(&qsp_callsite_ht, callsite, hash, &existing);
+        if (unlikely(existing)) {
+            g_free(callsite);
+            callsite =3D existing;
+        }
+    }
+    return callsite;
+}
+
+static struct qsp_entry *qsp_entry_create(struct qht *ht,
+                                          const struct qsp_entry *entry,
+                                          uint32_t hash)
+{
+    struct qsp_entry *e;
+    void *existing =3D NULL;
+
+    e =3D g_new(struct qsp_entry, 1);
+    memcpy(e, entry, sizeof(*e));
+    e->callsite =3D qsp_callsite_find(entry->callsite);
+    /* zero the counters; all other fields are constant */
+    e->n_acqs =3D 0;
+    e->ns =3D 0;
+
+    qht_insert(ht, e, hash, &existing);
+    if (unlikely(existing)) {
+        g_free(e);
+        e =3D existing;
+    }
+    return e;
+}
+
+static struct qsp_entry *qsp_entry_find(struct qht *ht,
+                                        const struct qsp_entry *entry,
+                                        uint32_t hash)
+{
+    struct qsp_entry *e;
+
+    e =3D qht_lookup(ht, entry, hash);
+    if (e =3D=3D NULL) {
+        e =3D qsp_entry_create(ht, entry, hash);
+    }
+    return e;
+}
+
+static struct qsp_entry *qsp_entry_get(const void *obj, const char *file,
+                                       unsigned line, enum qsp_type type)
+{
+    struct qsp_callsite callsite =3D {
+        .obj =3D obj,
+        .file =3D file,
+        .line =3D line,
+        .type =3D type,
+    };
+    struct qsp_entry orig;
+    uint32_t hash;
+
+    qsp_init();
+
+    orig.thread_ptr =3D &qsp_thread;
+    orig.callsite =3D &callsite;
+
+    hash =3D qsp_entry_hash(&orig);
+    return qsp_entry_find(&qsp_ht, &orig, hash);
+}
+
+#define QSP_GEN_VOID(type_, qsp_t_, func_, impl_)                       \
+    void func_(type_ *obj, const char *file, unsigned line)             \
+    {                                                                   \
+        struct qsp_entry *e =3D qsp_entry_get(obj, file, line, qsp_t_);   \
+        int64_t t;                                                      \
+                                                                        \
+        t =3D get_clock();                                                \
+        impl_(obj, file, line);                                         \
+        atomic_set(&e->ns, e->ns + get_clock() - t);                    \
+        atomic_set(&e->n_acqs, e->n_acqs + 1);                          \
+    }
+
+#define QSP_GEN_RET1(type_, qsp_t_, func_, impl_)                       \
+    int func_(type_ *obj, const char *file, unsigned line)              \
+    {                                                                   \
+        struct qsp_entry *e =3D qsp_entry_get(obj, file, line, qsp_t_);   \
+        int64_t t;                                                      \
+        int err;                                                        \
+                                                                        \
+        t =3D get_clock();                                                \
+        err =3D impl_(obj, file, line);                                   \
+        atomic_set(&e->ns, e->ns + get_clock() - t);                    \
+        if (!err) {                                                     \
+            atomic_set(&e->n_acqs, e->n_acqs + 1);                      \
+        }                                                               \
+        return err;                                                     \
+    }
+
+QSP_GEN_VOID(QemuMutex, QSP_MUTEX, qsp_mutex_lock, qemu_mutex_lock_impl)
+QSP_GEN_RET1(QemuMutex, QSP_MUTEX, qsp_mutex_trylock, qemu_mutex_trylock_i=
mpl)
+
+QSP_GEN_VOID(QemuRecMutex, QSP_REC_MUTEX, qsp_rec_mutex_lock,
+             qemu_rec_mutex_lock_impl)
+QSP_GEN_RET1(QemuRecMutex, QSP_REC_MUTEX, qsp_rec_mutex_trylock,
+             qemu_rec_mutex_trylock_impl)
+
+void qsp_cond_wait(QemuCond *cond, QemuMutex *mutex, const char *file,
+                   unsigned line)
+{
+    struct qsp_entry *e;
+    int64_t t;
+
+    qsp_init();
+
+    e =3D qsp_entry_get(cond, file, line, QSP_CONDVAR);
+    t =3D get_clock();
+    qemu_cond_wait_impl(cond, mutex, file, line);
+    atomic_set(&e->ns, e->ns + get_clock() - t);
+    atomic_set(&e->n_acqs, e->n_acqs + 1);
+}
+
+static gint qsp_tree_cmp(gconstpointer ap, gconstpointer bp, gpointer up)
+{
+    const struct qsp_entry *a =3D ap;
+    const struct qsp_entry *b =3D bp;
+
+    if (a->ns > b->ns) {
+        return -1;
+    } else if (a->ns < b->ns) {
+        return 1;
+    } else {
+        const struct qsp_callsite *ca =3D a->callsite;
+        const struct qsp_callsite *cb =3D b->callsite;
+
+        /* same ns. Break the tie with the object's address */
+        if (ca->obj < cb->obj) {
+            return -1;
+        } else if (ca->obj > cb->obj) {
+            return 1;
+        } else {
+            int cmp;
+
+            /* same obj. Break the tie with the callsite's file */
+            cmp =3D strcmp(ca->file, cb->file);
+            if (cmp) {
+                return cmp;
+            }
+            /* same callsite file. Break the tie with the callsite's line =
*/
+            g_assert(ca->line !=3D cb->line);
+            if (ca->line < cb->line) {
+                return -1;
+            } else if (ca->line > cb->line) {
+                return 1;
+            } else {
+                /* break the tie with the callsite's type */
+                return cb->type - ca->type;
+            }
+        }
+    }
+}
+
+static void qsp_sort(struct qht *ht, void *p, uint32_t h, void *userp)
+{
+    struct qsp_entry *e =3D p;
+    GTree *tree =3D userp;
+
+    g_tree_insert(tree, e, NULL);
+}
+
+static void qsp_aggregate(struct qht *global_ht, void *p, uint32_t h, void=
 *up)
+{
+    struct qht *ht =3D up;
+    const struct qsp_entry *e =3D p;
+    struct qsp_entry *agg;
+    uint32_t hash;
+
+    hash =3D qsp_entry_no_thread_hash(e);
+    agg =3D qsp_entry_find(ht, e, hash);
+    agg->ns +=3D e->ns;
+    agg->n_acqs +=3D e->n_acqs;
+}
+
+static void qsp_mktree(GTree *tree)
+{
+    struct qht ht;
+
+    /* first, create a hash table to aggregate all results */
+    qht_init(&ht, qsp_entry_no_thread_cmp, QSP_INITIAL_SIZE,
+             QHT_MODE_AUTO_RESIZE | QHT_MODE_RAW_MUTEXES);
+
+    /* then fill in the hash table by iterating over the global hash table=
 */
+    qht_iter(&qsp_ht, qsp_aggregate, &ht);
+
+    /* sort the hash table elements by using a tree */
+    qht_iter(&ht, qsp_sort, tree);
+
+    /* free the hash table, but keep the elements (those are in the tree n=
ow) */
+    qht_destroy(&ht);
+}
+
+/* free string with g_free */
+static char *qsp_at(const struct qsp_callsite *callsite)
+{
+    GString *s =3D g_string_new("");
+    const char *shortened;
+
+    /* remove the absolute path to qemu */
+    if (unlikely(strlen(callsite->file) < qsp_qemu_path_len)) {
+        shortened =3D callsite->file;
+    } else {
+        shortened =3D callsite->file + qsp_qemu_path_len;
+    }
+    g_string_append_printf(s, "%s:%u", shortened, callsite->line);
+    return g_string_free(s, FALSE);
+}
+
+struct qsp_report_entry {
+    const void *obj;
+    char *callsite_at;
+    const char *typename;
+    double time_s;
+    double ns_avg;
+    uint64_t n_acqs;
+};
+
+struct qsp_report {
+    struct qsp_report_entry *entries;
+    size_t n_entries;
+    size_t max_n_entries;
+};
+
+static gboolean qsp_tree_report(gpointer key, gpointer value, gpointer uda=
ta)
+{
+    const struct qsp_entry *e =3D key;
+    struct qsp_report *report =3D udata;
+    struct qsp_report_entry *entry;
+
+    if (report->n_entries =3D=3D report->max_n_entries) {
+        return TRUE;
+    }
+    entry =3D &report->entries[report->n_entries];
+    report->n_entries++;
+
+    entry->obj =3D e->callsite->obj;
+    entry->callsite_at =3D qsp_at(e->callsite);
+    entry->typename =3D qsp_typenames[e->callsite->type];
+    entry->time_s =3D e->ns * 1e-9;
+    entry->n_acqs =3D e->n_acqs;
+    entry->ns_avg =3D e->n_acqs ? e->ns / e->n_acqs : 0;
+    return FALSE;
+}
+
+static void
+pr_report(const struct qsp_report *rep, FILE *f, fprintf_function pr)
+{
+    char *dashes;
+    size_t max_len =3D 0;
+    int callsite_len =3D 0;
+    int callsite_rspace;
+    int n_dashes;
+    size_t i;
+
+    /* find out the maximum length of all 'callsite' fields */
+    for (i =3D 0; i < rep->n_entries; i++) {
+        const struct qsp_report_entry *e =3D &rep->entries[i];
+        size_t len =3D strlen(e->callsite_at);
+
+        if (len > max_len) {
+            max_len =3D len;
+        }
+    }
+
+    callsite_len =3D MAX(max_len, strlen("Call site"));
+    /* white space to leave to the right of "Call site" */
+    callsite_rspace =3D callsite_len - strlen("Call site");
+
+    pr(f, "Type               Object  Call site%*s  Wait Time (s)  "
+       "       Count  Average (us)\n", callsite_rspace, "");
+
+    /* build a horizontal rule with dashes */
+    n_dashes =3D 79 + callsite_rspace;
+    dashes =3D g_malloc(n_dashes + 1);
+    memset(dashes, '-', n_dashes);
+    dashes[n_dashes] =3D '\0';
+    pr(f, "%s\n", dashes);
+
+    for (i =3D 0; i < rep->n_entries; i++) {
+        const struct qsp_report_entry *e =3D &rep->entries[i];
+
+        pr(f, "%-9s  %14p  %s%*s  %13.5f  %12" PRIu64 "  %12.2f\n", e->typ=
ename,
+           e->obj, e->callsite_at, callsite_len - (int)strlen(e->callsite_=
at),
+           "", e->time_s, e->n_acqs, e->ns_avg * 1e-3);
+    }
+
+    pr(f, "%s\n", dashes);
+    g_free(dashes);
+}
+
+static void report_destroy(struct qsp_report *rep)
+{
+    size_t i;
+
+    for (i =3D 0; i < rep->n_entries; i++) {
+        struct qsp_report_entry *e =3D &rep->entries[i];
+
+        g_free(e->callsite_at);
+    }
+    g_free(rep->entries);
+}
+
+void qsp_report(FILE *f, fprintf_function cpu_fprintf, size_t max)
+{
+    GTree *tree =3D g_tree_new_full(qsp_tree_cmp, NULL, g_free, NULL);
+    struct qsp_report rep;
+
+    qsp_init();
+
+    rep.entries =3D g_new0(struct qsp_report_entry, max);
+    rep.n_entries =3D 0;
+    rep.max_n_entries =3D max;
+
+    qsp_mktree(tree);
+    g_tree_foreach(tree, qsp_tree_report, &rep);
+    g_tree_destroy(tree);
+
+    pr_report(&rep, f, cpu_fprintf);
+    report_destroy(&rep);
+}
diff --git a/util/Makefile.objs b/util/Makefile.objs
index e1c3fed4dc..579faa2ca6 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -3,6 +3,7 @@ util-obj-y +=3D bufferiszero.o
 util-obj-y +=3D lockcnt.o
 util-obj-y +=3D aiocb.o async.o aio-wait.o thread-pool.o qemu-timer.o
 util-obj-y +=3D main-loop.o iohandler.o
+util-obj-$(CONFIG_SYNC_PROFILER) +=3D qsp.o
 util-obj-$(CONFIG_POSIX) +=3D aio-posix.o
 util-obj-$(CONFIG_POSIX) +=3D compatfd.o
 util-obj-$(CONFIG_POSIX) +=3D event_notifier-posix.o
--=20
2.17.1