From nobody Fri Nov  7 09:11:05 2025
Delivered-To: importer@patchew.org
Received-SPF: pass (zoho.com: domain of gnu.org designates 209.51.188.17 as
 permitted sender) client-ip=209.51.188.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Authentication-Results: mx.zohomail.com;
	spf=pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted
 sender)  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by
 mx.zohomail.com
	with SMTPS id 1547484849990824.2834760815217;
 Mon, 14 Jan 2019 08:54:09 -0800 (PST)
Received: from localhost ([127.0.0.1]:44849 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemu-devel-bounces+importer=patchew.org@nongnu.org>)
	id 1gj5V2-0000kO-8z
	for importer@patchew.org; Mon, 14 Jan 2019 11:54:08 -0500
Received: from eggs.gnu.org ([209.51.188.92]:38232)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1gj5RZ-0006op-FP
	for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:34 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1gj5RW-0001xm-Su
	for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:32 -0500
Received: from out3-smtp.messagingengine.com ([66.111.4.27]:60499)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <cota@braap.org>) id 1gj5RW-0001uX-Kb
	for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:30 -0500
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
	by mailout.nyi.internal (Postfix) with ESMTP id EA25B29FCB;
	Mon, 14 Jan 2019 11:50:21 -0500 (EST)
Received: from mailfrontend2 ([10.202.2.163])
	by compute4.internal (MEProxy); Mon, 14 Jan 2019 11:50:21 -0500
Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216])
	by mail.messagingengine.com (Postfix) with ESMTPA id D06F9102F3;
	Mon, 14 Jan 2019 11:50:20 -0500 (EST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=
	from:to:cc:subject:date:message-id:in-reply-to:references; s=
	mesmtp; bh=ypBOAMmilVRUbXNneYJns/7zsHM/qiHzGF86taMnMK4=; b=C7XpF
	JBy3LaIOLGwjIh6Ln3Uj4Z8aQEkT5syPNajzQ/DqBXAbkWcGNmKWbDl3mQFn87h3
	qsebAAUizcKtY0PcMcI1D1T0OOwyeXjmMZqw8UzgiQP1v0GWR8Cy+spwA4CcZhpM
	TV1J5wtshSsbKsTtAcYw1RI+XVu0aBFXHHPxgY=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=cc:date:from:in-reply-to:message-id
	:references:subject:to:x-me-proxy:x-me-proxy:x-me-sender
	:x-me-sender:x-sasl-enc; s=fm1; bh=ypBOAMmilVRUbXNneYJns/7zsHM/q
	iHzGF86taMnMK4=; b=JOIkOLADDKsHwVHhj7nOVJbGF40yD++a3NH1ifc3Uvxxr
	5VPBAS+6PRyMN8or6tSt9220u3OmCQPN4X8VDFuf9ZR4tc7FvHD9FnijFjFsK7Xp
	lCHWF5mcM0fiUoI0LiiMJGHL7urZus9FbusbCeQhjgSvqqLb/4u6NTBQMWDX7S5i
	ahhc8AcSG3W4lKigrNeF4Yiz4sjZe85tRHm+LPIK4reto7Dssoq68HT6GIt4eogw
	bsSnpvtQiXkXStM3UabpASBfc8zKkstkydAI7Vuabka3KDJBNlR01M4mkYY5iNfm
	DL7RSowJGz/c3rzuFjhiz3XQ2XTmNg6oGLrmwklHw==
X-ME-Sender: <xms:zb08XECluHrwcaE5-_blUq85PSWf0PVH2pHQSeHNXJkzRWorVkwZSA>
X-ME-Proxy-Cause: 
 gggruggvucftvghtrhhoucdtuddrgedtledrgedugdelfecutefuodetggdotefrodftvf
	curfhrohhfihhlvgemucfhrghsthforghilhdpqfhuthenuceurghilhhouhhtmecufedt
	tdenucesvcftvggtihhpihgvnhhtshculddquddttddmnegoufhushhpvggtthffohhmrg
	hinhculdegledmnecujfgurhephffvufffkffojghfsedttdertdertddtnecuhfhrohhm
	pedfgfhmihhlihhoucfirdcuvehothgrfdcuoegtohhtrgessghrrggrphdrohhrgheqne
	cuffhomhgrihhnpehimhhguhhrrdgtohhmnecukfhppeduvdekrdehledrvddtrddvudei
	necurfgrrhgrmhepmhgrihhlfhhrohhmpegtohhtrgessghrrggrphdrohhrghenucevlh
	hushhtvghrufhiiigvpedt
X-ME-Proxy: <xmx:zb08XCRGmBaQWvIyibJpFRCCJIbn_N-FFoxjeh6r6Ff5QEqHkPEcHQ>
	<xmx:zb08XAK1OE_W4j7OyWQp77qz2GpIsmbzS_Xq_tzDJept2JAg6n1ubA>
	<xmx:zb08XL4TK0TRFqgPe77LKR7XnJSEaD9QxuU6FtyQCAZgizmkMR5fVw>
	<xmx:zb08XFv7tKa-NkqSx7oNh9Jk5fXGic7SnxM-ZAR4JznQ2ocI8a1egw>
From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Date: Mon, 14 Jan 2019 11:50:17 -0500
Message-Id: <20190114165017.27298-4-cota@braap.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190114165017.27298-1-cota@braap.org>
References: <20190114165017.27298-1-cota@braap.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 66.111.4.27
Subject: [Qemu-devel] [PATCH v6 3/3] tcg/i386: enable dynamic TLB sizing
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>,
	Richard Henderson <richard.henderson@linaro.org>
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

As the following experiments show, this series is a net perf gain,
particularly for memory-heavy workloads. Experiments are run on an
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.

1. System boot + shudown, debian aarch64:

- Before (v3.1.0):
 Performance counter stats for './die.sh v3.1.0' (10 runs):

       9019.797015      task-clock (msec)         #    0.993 CPUs utilized =
           ( +-  0.23% )
    29,910,312,379      cycles                    #    3.316 GHz           =
           ( +-  0.14% )
    54,699,252,014      instructions              #    1.83  insn per cycle=
           ( +-  0.08% )
    10,061,951,686      branches                  # 1115.541 M/sec         =
           ( +-  0.08% )
       172,966,530      branch-misses             #    1.72% of all branche=
s          ( +-  0.07% )

       9.084039051 seconds time elapsed                                    =
      ( +-  0.23% )

- After:
 Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):

       8624.084842      task-clock (msec)         #    0.993 CPUs utilized =
           ( +-  0.23% )
    28,556,123,404      cycles                    #    3.311 GHz           =
           ( +-  0.13% )
    51,755,089,512      instructions              #    1.81  insn per cycle=
           ( +-  0.05% )
     9,526,513,946      branches                  # 1104.641 M/sec         =
           ( +-  0.05% )
       166,578,509      branch-misses             #    1.75% of all branche=
s          ( +-  0.19% )

       8.680540350 seconds time elapsed                                    =
      ( +-  0.24% )

That is, a 4.4% perf increase.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (v3.1.0):
      56100.574751      task-clock (msec)         #    1.016 CPUs utilized =
           ( +-  4.81% )
   200,745,466,128      cycles                    #    3.578 GHz           =
           ( +-  5.24% )
   431,949,100,608      instructions              #    2.15  insn per cycle=
           ( +-  5.65% )
    77,502,383,330      branches                  # 1381.490 M/sec         =
           ( +-  6.18% )
       844,681,191      branch-misses             #    1.09% of all branche=
s          ( +-  3.82% )

      55.221556378 seconds time elapsed                                    =
      ( +-  5.01% )

- After:
      56603.419540      task-clock (msec)         #    1.019 CPUs utilized =
           ( +- 10.19% )
   202,217,930,479      cycles                    #    3.573 GHz           =
           ( +- 10.69% )
   439,336,291,626      instructions              #    2.17  insn per cycle=
           ( +- 14.14% )
    80,538,357,447      branches                  # 1422.853 M/sec         =
           ( +- 16.09% )
       776,321,622      branch-misses             #    0.96% of all branche=
s          ( +-  3.77% )

      55.549661409 seconds time elapsed                                    =
      ( +- 10.44% )

No improvement (within noise range). Note that for this workload,
increasing the time window too much can lead to perf degradation,
since it flushes the TLB *very* frequently.

3. x86_64 SPEC06int:

           x86_64-softmmu speedup vs. v3.1.0 for SPEC06int (test set)
            Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

5.5 +----------------------------------------------------------------------=
--+
    |                   +-+                                                =
  |
  5 |-+.................+-+...............................tlb-dyn-v5.......=
+-|
    |                   * *                                                =
  |
4.5 |-+.................*.*................................................=
+-|
    |                   * *                                                =
  |
  4 |-+.................*.*................................................=
+-|
    |                   * *                                                =
  |
3.5 |-+.................*.*................................................=
+-|
    |                   * *                                                =
  |
  3 |-+......+-+*.......*.*................................................=
+-|
    |        *  *       * *                                                =
  |
2.5 |-+......*..*.......*.*.................................+-+*...........=
+-|
    |        *  *       * *                                 *  *           =
  |
  2 |-+......*..*.......*.*.................................*..*...........=
+-|
    |        *  *       * *                                 *  *  +-+      =
  |
1.5 |-+......*..*.......*.*.................................*..*.*+-+.*+-+.=
+-|
    |        *  * *+-+  * *  +-+       *+-+  +-+       +-+  *  * *  * *  * =
  |
  1 |++++-+*+*++*+*++*++*+*++*+*+++-+*+*+-++*+-++++-++++-+++*++*+*++*+*++*+=
++|
    |   *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  * =
  |
0.5 +----------------------------------------------------------------------=
--+
  400.perlb401.bzip403.g429445.g456.hm462.libq464.h471.omn47483.xalancbgeom=
ean
  png: https://imgur.com/YRF90f7

That is, a 1.51x average speedup over the baseline, with a max speedup
of 5.17x.

Here's a different look at the SPEC06int results, using KVM as the baseline:

             x86_64-softmmu slowdown vs. KVM for SPEC06int (test set)
             Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

25 +-----------------------------------------------------------------------=
----+
   |                   +-+                                        +-+      =
    |
   |                   * *                             +-+      v3.1.0     =
    |
   |                   * *                             +-+  tlb-dyn-v5     =
    |
   |                   * *                             * *        +-+      =
    |
20 |-+.................*.*.............................*.+-+......*.*......=
..+-|
   |                   * *                             * # #      * *      =
    |
   |        +-+        * *                             * # #      * *      =
    |
   |        * *        * *                             * # #      * *      =
    |
15 |-+......*.*........*.*.............................*.#.#......*.+-+....=
..+-|
   |        * *        * *                             * # #      * #|#    =
    |
   |        * *        * *        +-+                  * # #      * +-+    =
    |
   |        * *  +-+   * *        ++-+       +-+       * # #      * # # +-+=
    |
   |        * *  +-+   * *        * ##       *|   +-+  * # #      * # # +-+=
    |
10 |-+......*.*..*.+-+.*.*........*.##.......++-+.*.+-+*.#.#......*.#.#.*.*=
..+-|
   |        * *  * +-+ * *        * ## +-+   *# # * # #* # # +-+  * # # * *=
    |
   |        * *  * # # * *  +-+   * ## * +-+ *# # * # #* # # * *  * # # *+-=
+   |
   |        * *  * # # * *  * +-+ * ## * # # *# # * # #* # # * *  * # # * #=
#   |
 5 |-+......*.+-+*.#.#.*.*..*.#.#.*.##.*.#.#.*#.#.*.#.#*.#.#.*.*..*.#.#.*.#=
#.+-|
   |        * # #* # # * +-+* # # * ## * # # *# # * # #* # # * *  * # # * #=
#   |
   |        * # #* # # * # #* # # * ## * # # *# # * # #* # # * +-+* # # * #=
#   |
   |   ++-+ * # #* # # * # #* # # * ## * # # *# # * # #* # # * # #* # # * #=
#   |
   |+++*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+*+#+#+*#+#+*+#+#*+#+#+*+#+#*+#+#+*+#=
#+++|
 0 +-----------------------------------------------------------------------=
----+
 400.perlbe401.bzi403.gc429445.go456.h462.libqu464.h471.omne4483.xalancbmge=
omean
  png: https://imgur.com/YzAMNEV

After this series, we bring down the average SPEC06int slowdown vs KVM
from 11.47x to 7.58x.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.h     |  2 +-
 tcg/i386/tcg-target.inc.c | 28 ++++++++++++++--------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index bd7d37c7ef..bdcf613f65 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -27,7 +27,7 @@
=20
 #define TCG_TARGET_INSN_UNIT_SIZE  1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
-#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 1
=20
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 1b4e3b80e1..df8b20755c 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -329,6 +329,7 @@ static inline int tcg_target_const_match(tcg_target_lon=
g val, TCGType type,
 #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
 #define OPC_ANDN        (0xf2 | P_EXT38)
 #define OPC_ADD_GvEv	(OPC_ARITH_GvEv | (ARITH_ADD << 3))
+#define OPC_AND_GvEv    (OPC_ARITH_GvEv | (ARITH_AND << 3))
 #define OPC_BLENDPS     (0x0c | P_EXT3A | P_DATA16)
 #define OPC_BSF         (0xbc | P_EXT)
 #define OPC_BSR         (0xbd | P_EXT)
@@ -1621,7 +1622,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC=
GReg addrlo, TCGReg addrhi,
         }
         if (TCG_TYPE_PTR =3D=3D TCG_TYPE_I64) {
             hrexw =3D P_REXW;
-            if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) {
+            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
                 tlbtype =3D TCG_TYPE_I64;
                 tlbrexw =3D P_REXW;
             }
@@ -1629,6 +1630,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, T=
CGReg addrlo, TCGReg addrhi,
     }
=20
     tcg_out_mov(s, tlbtype, r0, addrlo);
+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_mask[mem_index]));
+
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_table[mem_index]));
+
     /* If the required alignment is at least as large as the access, simply
        copy the address and mask.  For lesser alignments, check that we do=
n't
        cross pages for the complete access.  */
@@ -1638,20 +1648,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, =
TCGReg addrlo, TCGReg addrhi,
         tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_ma=
sk);
     }
     tlb_mask =3D (target_ulong)TARGET_PAGE_MASK | a_mask;
-
-    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
     tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0);
-    tgen_arithi(s, ARITH_AND + tlbrexw, r0,
-                (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0);
-
-    tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0,
-                             offsetof(CPUArchState, tlb_table[mem_index][0=
])
-                             + which);
=20
     /* cmp 0(r0), r1 */
-    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0);
+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);
=20
     /* Prepare for both the fast path add of the tlb addend, and the slow
        path function argument setup.  */
@@ -1664,7 +1664,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC=
GReg addrlo, TCGReg addrhi,
=20
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r0), addrhi */
-        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4);
+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4);
=20
         /* jne slow_path */
         tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
@@ -1676,7 +1676,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC=
GReg addrlo, TCGReg addrhi,
=20
     /* add addend(r0), r1 */
     tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
-                         offsetof(CPUTLBEntry, addend) - which);
+                         offsetof(CPUTLBEntry, addend));
 }
=20
 /*
--=20
2.17.1