From nobody Sun Feb  8 11:18:26 2026
Delivered-To: importer@patchew.org
Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as
 permitted sender) client-ip=208.118.235.17;
 envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org;
 helo=lists.gnu.org;
Authentication-Results: mx.zohomail.com;
	spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted
 sender)  smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org
Return-Path: <qemu-devel-bounces+importer=patchew.org@nongnu.org>
Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by
 mx.zohomail.com
	with SMTPS id 153886272021839.81483990864399;
 Sat, 6 Oct 2018 14:52:00 -0700 (PDT)
Received: from localhost ([::1]:40470 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemu-devel-bounces+importer=patchew.org@nongnu.org>)
	id 1g8uUP-0004PS-1D
	for importer@patchew.org; Sat, 06 Oct 2018 17:51:57 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:34478)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1g8uO9-0000Aa-7g
	for qemu-devel@nongnu.org; Sat, 06 Oct 2018 17:45:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1g8uO7-0007HY-Od
	for qemu-devel@nongnu.org; Sat, 06 Oct 2018 17:45:29 -0400
Received: from out3-smtp.messagingengine.com ([66.111.4.27]:32909)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <cota@braap.org>) id 1g8uO7-0007FH-FG
	for qemu-devel@nongnu.org; Sat, 06 Oct 2018 17:45:27 -0400
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
	by mailout.nyi.internal (Postfix) with ESMTP id 2118521FB2;
	Sat,  6 Oct 2018 17:45:15 -0400 (EDT)
Received: from mailfrontend2 ([10.202.2.163])
	by compute4.internal (MEProxy); Sat, 06 Oct 2018 17:45:16 -0400
Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216])
	by mail.messagingengine.com (Postfix) with ESMTPA id 6B859102EC;
	Sat,  6 Oct 2018 17:45:15 -0400 (EDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=
	from:to:cc:subject:date:message-id:in-reply-to:references; s=
	mesmtp; bh=nI4UTKQAvcf5wSbPDepxLgJbirR7QeciBNXzc1sNIrw=; b=fi+gD
	lGdKwCmDod4TMLFfMp7lTgjmWkVuW0EEWs4Z99Gb5eNnrHoRZa6ccqod1XEUk2JW
	T+QzZ2l2UayoJcxsdHDP0RADmn68MdOn9TMNCTd4/zFElxkGtCSbBVcZIWVjNxr7
	nN9/T4ZWaPnp8nJwV/O0TTN439icO1SyJbZEss=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=cc:date:from:in-reply-to:message-id
	:references:subject:to:x-me-proxy:x-me-proxy:x-me-sender
	:x-me-sender:x-sasl-enc; s=fm3; bh=nI4UTKQAvcf5wSbPDepxLgJbirR7Q
	eciBNXzc1sNIrw=; b=bMDCs6BkrIMU3OV4llSf2lhWQ+jLEH4EEU5W9VRmw5xED
	rP7iXwKlTVmM222i0pwrX7kK/asEjJ4F+AiPxIFoxBHGI8mi4m1T/DB7knmGVGSL
	BNyEaFtM/o9TLCRCglxHWqgrKAGYp+Qm6nnYIuso36rQt3/7qOV/KBne8flcPuDX
	PmHT3oU1v8y5dOEGMC+eaWrzYXmcLgbFWGZVfho7U4n9H81OrsU9eVWgh9mA44gs
	xT+have5irBZT2i8eYZiFeObmd0zGnf+L5x2WH67+quefhxG9/KCkBwvllBsjBX7
	+T8AxUJz/YN5drV8OlEYGg65aGNh+6jguqSTfxCtQ==
X-ME-Sender: <xms:6yy5W3bPUojIu92pJ9hV-m2IFUrE98vLDiEHLL_D07zxXOVvKOa10A>
X-ME-Proxy: <xmx:6yy5W4SsiCVsetbZXtR19gM_6ZRDFEwB231dmzR1CHy0Ch8tnbxXag>
	<xmx:6yy5W0B_S_3y13X6syrx7IZAxlOPOzZ0M4kk8lcSkwsbJhaT5ir6Og>
	<xmx:6yy5W_8xZYNvaQy_eRqWTLoPLrteNMDbbfWg6IubNp0Rj1rwNLAciw>
	<xmx:6yy5W4a-9-W7M1AV7CqyFj3x484_zIDyF9cAeJ1VDReoNI2YvZ13gQ>
	<xmx:6yy5W2HYm7OjLWoFZ1djH2iWHnMdSmIHJTtIC3daKSRyNNDBSsiSWg>
	<xmx:6yy5W4xdoJWMZJ8aBCla07RM84DN-NsvCQK_POQ5idBJGdSZ-do1yQ>
From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Date: Sat,  6 Oct 2018 17:45:08 -0400
Message-Id: <20181006214508.5331-7-cota@braap.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20181006214508.5331-1-cota@braap.org>
References: <20181006214508.5331-1-cota@braap.org>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 66.111.4.27
Subject: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use
 rate
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.21
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>,
	Pranith Kumar <bobby.prani@gmail.com>,
	Richard Henderson <richard.henderson@linaro.org>
Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+importer=patchew.org@nongnu.org>
X-ZohoMail: RSF_0  Z_629925259 SPT_0
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"

Perform the resizing only on flushes, otherwise we'd
have to take a perf hit by either rehashing the array
or unnecessarily flushing it.

We grow the array aggressively, and reduce the size more
slowly. This accommodates mixed workloads, where some
processes might be memory-heavy while others are not.

As the following experiments show, this a net perf gain,
particularly for memory-heavy workloads. Experiments
are run on an Intel i7-6700K CPU @ 4.00GHz.

1. System boot + shudown, debian aarch64:

- Before (tb-lock-v3):
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 run=
s):

       7469.363393      task-clock (msec)         #    0.998 CPUs utilized =
           ( +-  0.07% )
    31,507,707,190      cycles                    #    4.218 GHz           =
           ( +-  0.07% )
    57,101,577,452      instructions              #    1.81  insns per cycl=
e          ( +-  0.08% )
    10,265,531,804      branches                  # 1374.352 M/sec         =
           ( +-  0.07% )
       173,020,681      branch-misses             #    1.69% of all branche=
s          ( +-  0.10% )

       7.483359063 seconds time elapsed                                    =
      ( +-  0.08% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 run=
s):

       7185.036730      task-clock (msec)         #    0.999 CPUs utilized =
           ( +-  0.11% )
    30,303,501,143      cycles                    #    4.218 GHz           =
           ( +-  0.11% )
    54,198,386,487      instructions              #    1.79  insns per cycl=
e          ( +-  0.08% )
     9,726,518,945      branches                  # 1353.719 M/sec         =
           ( +-  0.08% )
       167,082,307      branch-misses             #    1.72% of all branche=
s          ( +-  0.08% )

       7.195597842 seconds time elapsed                                    =
      ( +-  0.11% )

That is, a 3.8% improvement.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -no=
graphic' (2 runs):

      49971.036482      task-clock (msec)         #    0.999 CPUs utilized =
           ( +-  1.62% )
   210,766,077,140      cycles                    #    4.218 GHz           =
           ( +-  1.63% )
   428,829,830,790      instructions              #    2.03  insns per cycl=
e          ( +-  0.75% )
    77,313,384,038      branches                  # 1547.164 M/sec         =
           ( +-  0.54% )
       835,610,706      branch-misses             #    1.08% of all branche=
s          ( +-  2.97% )

      50.003855102 seconds time elapsed                                    =
      ( +-  1.61% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -n=
ographic' (2 runs):

      50118.124477      task-clock (msec)         #    0.999 CPUs utilized =
           ( +-  4.30% )
           132,396      context-switches          #    0.003 M/sec         =
           ( +-  1.20% )
                 0      cpu-migrations            #    0.000 K/sec         =
           ( +-100.00% )
           167,754      page-faults               #    0.003 M/sec         =
           ( +-  0.06% )
   211,414,701,601      cycles                    #    4.218 GHz           =
           ( +-  4.30% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   431,618,818,597      instructions              #    2.04  insns per cycl=
e          ( +-  6.40% )
    80,197,256,524      branches                  # 1600.165 M/sec         =
           ( +-  8.59% )
       794,830,352      branch-misses             #    0.99% of all branche=
s          ( +-  2.05% )

      50.177077175 seconds time elapsed                                    =
      ( +-  4.23% )

No improvement (within noise range).

3. x86_64 SPEC06int:
                              SPEC06int (test set)
                         [ Y axis: speedup over master ]
  8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+=
-+
    |                                                                      =
 |
    |                                                   tlb-lock-v3        =
 |
  7 +-+..................$$$...........................+indirection       +=
-+
    |                    $ $                              +resizing        =
 |
    |                    $ $                                               =
 |
  6 +-+..................$.$..............................................+=
-+
    |                    $ $                                               =
 |
    |                    $ $                                               =
 |
  5 +-+..................$.$..............................................+=
-+
    |                    $ $                                               =
 |
    |                    $ $                                               =
 |
  4 +-+..................$.$..............................................+=
-+
    |                    $ $                                               =
 |
    |          +++       $ $                                               =
 |
  3 +-+........$$+.......$.$..............................................+=
-+
    |          $$        $ $                                               =
 |
    |          $$        $ $                                 $$$           =
 |
  2 +-+........$$........$.$.................................$.$..........+=
-+
    |          $$        $ $                                 $ $       +$$ =
 |
    |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$ =
 |
  1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+=
-+
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ =
 |
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ =
 |
  0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+=
-+
     401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
png: https://imgur.com/a/b1wn3wc

That is, a 1.53x average speedup over master, with a max speedup of 7.13x.

Note that "indirection" (i.e. the first patch in this series) incurs
no overhead, on average.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h |  1 +
 accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 27b9433976..4d1d6b2b8b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
     size_t size;
     size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
     size_t used;
+    size_t n_flushes_low_rate;
 } CPUTLBDesc;
=20
 #define CPU_COMMON_TLB  \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 1ca71ecfc4..afb61e9c2b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
         desc->size =3D MIN_CPU_TLB_SIZE;
         desc->mask =3D (desc->size - 1) << CPU_TLB_ENTRY_BITS;
         desc->used =3D 0;
+        desc->n_flushes_low_rate =3D 0;
         env->tlb_table[i] =3D g_new(CPUTLBEntry, desc->size);
         env->iotlb[i] =3D g_new0(CPUIOTLBEntry, desc->size);
     }
@@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
     return count;
 }
=20
+/* Call with tlb_lock held */
+static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
+{
+    CPUTLBDesc *desc =3D &env->tlb_desc[mmu_idx];
+    size_t rate =3D desc->used * 100 / desc->size;
+    size_t new_size =3D desc->size;
+
+    if (rate =3D=3D 100) {
+        new_size =3D MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BI=
TS);
+    } else if (rate > 70) {
+        new_size =3D MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BI=
TS);
+    } else if (rate < 30) {
+        desc->n_flushes_low_rate++;
+        if (desc->n_flushes_low_rate =3D=3D 100) {
+            new_size =3D MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
+            desc->n_flushes_low_rate =3D 0;
+        }
+    }
+
+    if (new_size =3D=3D desc->size) {
+        return;
+    }
+
+    g_free(env->tlb_table[mmu_idx]);
+    g_free(env->iotlb[mmu_idx]);
+
+    desc->size =3D new_size;
+    desc->mask =3D (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+    desc->n_flushes_low_rate =3D 0;
+    env->tlb_table[mmu_idx] =3D g_new(CPUTLBEntry, desc->size);
+    env->iotlb[mmu_idx] =3D g_new0(CPUIOTLBEntry, desc->size);
+}
+
 /* This is OK because CPU architectures generally permit an
  * implementation to drop entries from the TLB at any time, so
  * flushing more entries than required is only an efficiency issue,
@@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
      */
     qemu_spin_lock(&env->tlb_lock);
     for (i =3D 0; i < NB_MMU_MODES; i++) {
+        tlb_mmu_resize_locked(env, i);
         memset(env->tlb_table[i], -1,
                env->tlb_desc[i].size * sizeof(CPUTLBEntry));
         env->tlb_desc[i].used =3D 0;
@@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cp=
u, run_on_cpu_data data)
         if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
             tlb_debug("%d\n", mmu_idx);
=20
+            tlb_mmu_resize_locked(env, mmu_idx);
             memset(env->tlb_table[mmu_idx], -1,
                    env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
             memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[=
0]));
--=20
2.17.1