From nobody Fri Nov 7 09:11:05 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1547484849990824.2834760815217; Mon, 14 Jan 2019 08:54:09 -0800 (PST) Received: from localhost ([127.0.0.1]:44849 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gj5V2-0000kO-8z for importer@patchew.org; Mon, 14 Jan 2019 11:54:08 -0500 Received: from eggs.gnu.org ([209.51.188.92]:38232) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gj5RZ-0006op-FP for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:34 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gj5RW-0001xm-Su for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:32 -0500 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:60499) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gj5RW-0001uX-Kb for qemu-devel@nongnu.org; Mon, 14 Jan 2019 11:50:30 -0500 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id EA25B29FCB; Mon, 14 Jan 2019 11:50:21 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 14 Jan 2019 11:50:21 -0500 Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id D06F9102F3; Mon, 14 Jan 2019 11:50:20 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h= from:to:cc:subject:date:message-id:in-reply-to:references; s= mesmtp; bh=ypBOAMmilVRUbXNneYJns/7zsHM/qiHzGF86taMnMK4=; b=C7XpF JBy3LaIOLGwjIh6Ln3Uj4Z8aQEkT5syPNajzQ/DqBXAbkWcGNmKWbDl3mQFn87h3 qsebAAUizcKtY0PcMcI1D1T0OOwyeXjmMZqw8UzgiQP1v0GWR8Cy+spwA4CcZhpM TV1J5wtshSsbKsTtAcYw1RI+XVu0aBFXHHPxgY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:date:from:in-reply-to:message-id :references:subject:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm1; bh=ypBOAMmilVRUbXNneYJns/7zsHM/q iHzGF86taMnMK4=; b=JOIkOLADDKsHwVHhj7nOVJbGF40yD++a3NH1ifc3Uvxxr 5VPBAS+6PRyMN8or6tSt9220u3OmCQPN4X8VDFuf9ZR4tc7FvHD9FnijFjFsK7Xp lCHWF5mcM0fiUoI0LiiMJGHL7urZus9FbusbCeQhjgSvqqLb/4u6NTBQMWDX7S5i ahhc8AcSG3W4lKigrNeF4Yiz4sjZe85tRHm+LPIK4reto7Dssoq68HT6GIt4eogw bsSnpvtQiXkXStM3UabpASBfc8zKkstkydAI7Vuabka3KDJBNlR01M4mkYY5iNfm DL7RSowJGz/c3rzuFjhiz3XQ2XTmNg6oGLrmwklHw== X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedtledrgedugdelfecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfhuthenuceurghilhhouhhtmecufedt tdenucesvcftvggtihhpihgvnhhtshculddquddttddmnegoufhushhpvggtthffohhmrg hinhculdegledmnecujfgurhephffvufffkffojghfsedttdertdertddtnecuhfhrohhm pedfgfhmihhlihhoucfirdcuvehothgrfdcuoegtohhtrgessghrrggrphdrohhrgheqne cuffhomhgrihhnpehimhhguhhrrdgtohhmnecukfhppeduvdekrdehledrvddtrddvudei necurfgrrhgrmhepmhgrihhlfhhrohhmpegtohhtrgessghrrggrphdrohhrghenucevlh hushhtvghrufhiiigvpedt X-ME-Proxy: From: "Emilio G. Cota" To: qemu-devel@nongnu.org Date: Mon, 14 Jan 2019 11:50:17 -0500 Message-Id: <20190114165017.27298-4-cota@braap.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190114165017.27298-1-cota@braap.org> References: <20190114165017.27298-1-cota@braap.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 66.111.4.27 Subject: [Qemu-devel] [PATCH v6 3/3] tcg/i386: enable dynamic TLB sizing X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?UTF-8?q?Alex=20Benn=C3=A9e?= , Richard Henderson Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" As the following experiments show, this series is a net perf gain, particularly for memory-heavy workloads. Experiments are run on an Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz. 1. System boot + shudown, debian aarch64: - Before (v3.1.0): Performance counter stats for './die.sh v3.1.0' (10 runs): 9019.797015 task-clock (msec) # 0.993 CPUs utilized = ( +- 0.23% ) 29,910,312,379 cycles # 3.316 GHz = ( +- 0.14% ) 54,699,252,014 instructions # 1.83 insn per cycle= ( +- 0.08% ) 10,061,951,686 branches # 1115.541 M/sec = ( +- 0.08% ) 172,966,530 branch-misses # 1.72% of all branche= s ( +- 0.07% ) 9.084039051 seconds time elapsed = ( +- 0.23% ) - After: Performance counter stats for './die.sh tlb-dyn-v5' (10 runs): 8624.084842 task-clock (msec) # 0.993 CPUs utilized = ( +- 0.23% ) 28,556,123,404 cycles # 3.311 GHz = ( +- 0.13% ) 51,755,089,512 instructions # 1.81 insn per cycle= ( +- 0.05% ) 9,526,513,946 branches # 1104.641 M/sec = ( +- 0.05% ) 166,578,509 branch-misses # 1.75% of all branche= s ( +- 0.19% ) 8.680540350 seconds time elapsed = ( +- 0.24% ) That is, a 4.4% perf increase. 2. System boot + shutdown, ubuntu 18.04 x86_64: - Before (v3.1.0): 56100.574751 task-clock (msec) # 1.016 CPUs utilized = ( +- 4.81% ) 200,745,466,128 cycles # 3.578 GHz = ( +- 5.24% ) 431,949,100,608 instructions # 2.15 insn per cycle= ( +- 5.65% ) 77,502,383,330 branches # 1381.490 M/sec = ( +- 6.18% ) 844,681,191 branch-misses # 1.09% of all branche= s ( +- 3.82% ) 55.221556378 seconds time elapsed = ( +- 5.01% ) - After: 56603.419540 task-clock (msec) # 1.019 CPUs utilized = ( +- 10.19% ) 202,217,930,479 cycles # 3.573 GHz = ( +- 10.69% ) 439,336,291,626 instructions # 2.17 insn per cycle= ( +- 14.14% ) 80,538,357,447 branches # 1422.853 M/sec = ( +- 16.09% ) 776,321,622 branch-misses # 0.96% of all branche= s ( +- 3.77% ) 55.549661409 seconds time elapsed = ( +- 10.44% ) No improvement (within noise range). Note that for this workload, increasing the time window too much can lead to perf degradation, since it flushes the TLB *very* frequently. 3. x86_64 SPEC06int: x86_64-softmmu speedup vs. v3.1.0 for SPEC06int (test set) Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake) 5.5 +----------------------------------------------------------------------= --+ | +-+ = | 5 |-+.................+-+...............................tlb-dyn-v5.......= +-| | * * = | 4.5 |-+.................*.*................................................= +-| | * * = | 4 |-+.................*.*................................................= +-| | * * = | 3.5 |-+.................*.*................................................= +-| | * * = | 3 |-+......+-+*.......*.*................................................= +-| | * * * * = | 2.5 |-+......*..*.......*.*.................................+-+*...........= +-| | * * * * * * = | 2 |-+......*..*.......*.*.................................*..*...........= +-| | * * * * * * +-+ = | 1.5 |-+......*..*.......*.*.................................*..*.*+-+.*+-+.= +-| | * * *+-+ * * +-+ *+-+ +-+ +-+ * * * * * * = | 1 |++++-+*+*++*+*++*++*+*++*+*+++-+*+*+-++*+-++++-++++-+++*++*+*++*+*++*+= ++| | * * * * * * * * * * * * * * * * * * * * * * * * * * = | 0.5 +----------------------------------------------------------------------= --+ 400.perlb401.bzip403.g429445.g456.hm462.libq464.h471.omn47483.xalancbgeom= ean png: https://imgur.com/YRF90f7 That is, a 1.51x average speedup over the baseline, with a max speedup of 5.17x. Here's a different look at the SPEC06int results, using KVM as the baseline: x86_64-softmmu slowdown vs. KVM for SPEC06int (test set) Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake) 25 +-----------------------------------------------------------------------= ----+ | +-+ +-+ = | | * * +-+ v3.1.0 = | | * * +-+ tlb-dyn-v5 = | | * * * * +-+ = | 20 |-+.................*.*.............................*.+-+......*.*......= ..+-| | * * * # # * * = | | +-+ * * * # # * * = | | * * * * * # # * * = | 15 |-+......*.*........*.*.............................*.#.#......*.+-+....= ..+-| | * * * * * # # * #|# = | | * * * * +-+ * # # * +-+ = | | * * +-+ * * ++-+ +-+ * # # * # # +-+= | | * * +-+ * * * ## *| +-+ * # # * # # +-+= | 10 |-+......*.*..*.+-+.*.*........*.##.......++-+.*.+-+*.#.#......*.#.#.*.*= ..+-| | * * * +-+ * * * ## +-+ *# # * # #* # # +-+ * # # * *= | | * * * # # * * +-+ * ## * +-+ *# # * # #* # # * * * # # *+-= + | | * * * # # * * * +-+ * ## * # # *# # * # #* # # * * * # # * #= # | 5 |-+......*.+-+*.#.#.*.*..*.#.#.*.##.*.#.#.*#.#.*.#.#*.#.#.*.*..*.#.#.*.#= #.+-| | * # #* # # * +-+* # # * ## * # # *# # * # #* # # * * * # # * #= # | | * # #* # # * # #* # # * ## * # # *# # * # #* # # * +-+* # # * #= # | | ++-+ * # #* # # * # #* # # * ## * # # *# # * # #* # # * # #* # # * #= # | |+++*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+*+#+#+*#+#+*+#+#*+#+#+*+#+#*+#+#+*+#= #+++| 0 +-----------------------------------------------------------------------= ----+ 400.perlbe401.bzi403.gc429445.go456.h462.libqu464.h471.omne4483.xalancbmge= omean png: https://imgur.com/YzAMNEV After this series, we bring down the average SPEC06int slowdown vs KVM from 11.47x to 7.58x. Signed-off-by: Emilio G. Cota --- tcg/i386/tcg-target.h | 2 +- tcg/i386/tcg-target.inc.c | 28 ++++++++++++++-------------- 2 files changed, 15 insertions(+), 15 deletions(-) diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h index bd7d37c7ef..bdcf613f65 100644 --- a/tcg/i386/tcg-target.h +++ b/tcg/i386/tcg-target.h @@ -27,7 +27,7 @@ =20 #define TCG_TARGET_INSN_UNIT_SIZE 1 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31 -#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0 +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 1 =20 #ifdef __x86_64__ # define TCG_TARGET_REG_BITS 64 diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c index 1b4e3b80e1..df8b20755c 100644 --- a/tcg/i386/tcg-target.inc.c +++ b/tcg/i386/tcg-target.inc.c @@ -329,6 +329,7 @@ static inline int tcg_target_const_match(tcg_target_lon= g val, TCGType type, #define OPC_ARITH_GvEv (0x03) /* ... plus (ARITH_FOO << 3) */ #define OPC_ANDN (0xf2 | P_EXT38) #define OPC_ADD_GvEv (OPC_ARITH_GvEv | (ARITH_ADD << 3)) +#define OPC_AND_GvEv (OPC_ARITH_GvEv | (ARITH_AND << 3)) #define OPC_BLENDPS (0x0c | P_EXT3A | P_DATA16) #define OPC_BSF (0xbc | P_EXT) #define OPC_BSR (0xbd | P_EXT) @@ -1621,7 +1622,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC= GReg addrlo, TCGReg addrhi, } if (TCG_TYPE_PTR =3D=3D TCG_TYPE_I64) { hrexw =3D P_REXW; - if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) { + if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) { tlbtype =3D TCG_TYPE_I64; tlbrexw =3D P_REXW; } @@ -1629,6 +1630,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, T= CGReg addrlo, TCGReg addrhi, } =20 tcg_out_mov(s, tlbtype, r0, addrlo); + tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0, + TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS); + + tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0, + offsetof(CPUArchState, tlb_mask[mem_index])); + + tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0, + offsetof(CPUArchState, tlb_table[mem_index])); + /* If the required alignment is at least as large as the access, simply copy the address and mask. For lesser alignments, check that we do= n't cross pages for the complete access. */ @@ -1638,20 +1648,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, = TCGReg addrlo, TCGReg addrhi, tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_ma= sk); } tlb_mask =3D (target_ulong)TARGET_PAGE_MASK | a_mask; - - tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0, - TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS); - tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0); - tgen_arithi(s, ARITH_AND + tlbrexw, r0, - (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0); - - tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0, - offsetof(CPUArchState, tlb_table[mem_index][0= ]) - + which); =20 /* cmp 0(r0), r1 */ - tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0); + tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which); =20 /* Prepare for both the fast path add of the tlb addend, and the slow path function argument setup. */ @@ -1664,7 +1664,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC= GReg addrlo, TCGReg addrhi, =20 if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) { /* cmp 4(r0), addrhi */ - tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4); + tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4); =20 /* jne slow_path */ tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0); @@ -1676,7 +1676,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TC= GReg addrlo, TCGReg addrhi, =20 /* add addend(r0), r1 */ tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0, - offsetof(CPUTLBEntry, addend) - which); + offsetof(CPUTLBEntry, addend)); } =20 /* --=20 2.17.1