[Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate

Emilio G. Cota posted 6 patches 7 years, 1 month ago
There is a newer version of this series
[Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Posted by Emilio G. Cota 7 years, 1 month ago
Perform the resizing only on flushes, otherwise we'd
have to take a perf hit by either rehashing the array
or unnecessarily flushing it.

We grow the array aggressively, and reduce the size more
slowly. This accommodates mixed workloads, where some
processes might be memory-heavy while others are not.

As the following experiments show, this a net perf gain,
particularly for memory-heavy workloads. Experiments
are run on an Intel i7-6700K CPU @ 4.00GHz.

1. System boot + shudown, debian aarch64:

- Before (tb-lock-v3):
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
    31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
    57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
    10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
       173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )

       7.483359063 seconds time elapsed                                          ( +-  0.08% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
    30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
    54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
     9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
       167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )

       7.195597842 seconds time elapsed                                          ( +-  0.11% )

That is, a 3.8% improvement.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
   210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
   428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
    77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
       835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )

      50.003855102 seconds time elapsed                                          ( +-  1.61% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
           132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
           167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
   211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
    80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
       794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )

      50.177077175 seconds time elapsed                                          ( +-  4.23% )

No improvement (within noise range).

3. x86_64 SPEC06int:
                              SPEC06int (test set)
                         [ Y axis: speedup over master ]
  8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
    |                                                                       |
    |                                                   tlb-lock-v3         |
  7 +-+..................$$$...........................+indirection       +-+
    |                    $ $                              +resizing         |
    |                    $ $                                                |
  6 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  5 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  4 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |          +++       $ $                                                |
  3 +-+........$$+.......$.$..............................................+-+
    |          $$        $ $                                                |
    |          $$        $ $                                 $$$            |
  2 +-+........$$........$.$.................................$.$..........+-+
    |          $$        $ $                                 $ $       +$$  |
    |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
  1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
  0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
     401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
png: https://imgur.com/a/b1wn3wc

That is, a 1.53x average speedup over master, with a max speedup of 7.13x.

Note that "indirection" (i.e. the first patch in this series) incurs
no overhead, on average.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h |  1 +
 accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 27b9433976..4d1d6b2b8b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
     size_t size;
     size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
     size_t used;
+    size_t n_flushes_low_rate;
 } CPUTLBDesc;
 
 #define CPU_COMMON_TLB  \
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 1ca71ecfc4..afb61e9c2b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
         desc->size = MIN_CPU_TLB_SIZE;
         desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
         desc->used = 0;
+        desc->n_flushes_low_rate = 0;
         env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
         env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
     }
@@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
     return count;
 }
 
+/* Call with tlb_lock held */
+static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
+{
+    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
+    size_t rate = desc->used * 100 / desc->size;
+    size_t new_size = desc->size;
+
+    if (rate == 100) {
+        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+    } else if (rate > 70) {
+        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
+    } else if (rate < 30) {
+        desc->n_flushes_low_rate++;
+        if (desc->n_flushes_low_rate == 100) {
+            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
+            desc->n_flushes_low_rate = 0;
+        }
+    }
+
+    if (new_size == desc->size) {
+        return;
+    }
+
+    g_free(env->tlb_table[mmu_idx]);
+    g_free(env->iotlb[mmu_idx]);
+
+    desc->size = new_size;
+    desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
+    desc->n_flushes_low_rate = 0;
+    env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
+    env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
+}
+
 /* This is OK because CPU architectures generally permit an
  * implementation to drop entries from the TLB at any time, so
  * flushing more entries than required is only an efficiency issue,
@@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
      */
     qemu_spin_lock(&env->tlb_lock);
     for (i = 0; i < NB_MMU_MODES; i++) {
+        tlb_mmu_resize_locked(env, i);
         memset(env->tlb_table[i], -1,
                env->tlb_desc[i].size * sizeof(CPUTLBEntry));
         env->tlb_desc[i].used = 0;
@@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
         if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
             tlb_debug("%d\n", mmu_idx);
 
+            tlb_mmu_resize_locked(env, mmu_idx);
             memset(env->tlb_table[mmu_idx], -1,
                    env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
             memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
-- 
2.17.1


Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Posted by Philippe Mathieu-Daudé 7 years ago
Hi Emilio,

On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> Perform the resizing only on flushes, otherwise we'd
> have to take a perf hit by either rehashing the array
> or unnecessarily flushing it.
> 
> We grow the array aggressively, and reduce the size more
> slowly. This accommodates mixed workloads, where some
> processes might be memory-heavy while others are not.
> 
> As the following experiments show, this a net perf gain,
> particularly for memory-heavy workloads. Experiments
> are run on an Intel i7-6700K CPU @ 4.00GHz.
> 
> 1. System boot + shudown, debian aarch64:
> 
> - Before (tb-lock-v3):
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
>     31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
>     57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
>     10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
>        173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )
> 
>        7.483359063 seconds time elapsed                                          ( +-  0.08% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
>     30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
>     54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
>      9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
>        167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )
> 
>        7.195597842 seconds time elapsed                                          ( +-  0.11% )
> 
> That is, a 3.8% improvement.
> 
> 2. System boot + shutdown, ubuntu 18.04 x86_64:

You can also run the VM tests to build QEMU:

$ make vm-test
vm-test: Test QEMU in preconfigured virtual machines

  vm-build-ubuntu.i386            - Build QEMU in ubuntu i386 VM
  vm-build-freebsd                - Build QEMU in FreeBSD VM
  vm-build-netbsd                 - Build QEMU in NetBSD VM
  vm-build-openbsd                - Build QEMU in OpenBSD VM
  vm-build-centos                 - Build QEMU in CentOS VM, with Docker

> 
> - Before (tb-lock-v3):
> Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
>    210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
>    428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
>     77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
>        835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )
> 
>       50.003855102 seconds time elapsed                                          ( +-  1.61% )
> 
> - After:
>  Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):
> 
>       50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
>            132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
>                  0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
>            167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
>    211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
>    <not supported>      stalled-cycles-frontend
>    <not supported>      stalled-cycles-backend
>    431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
>     80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
>        794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )
> 
>       50.177077175 seconds time elapsed                                          ( +-  4.23% )
> 
> No improvement (within noise range).
> 
> 3. x86_64 SPEC06int:
>                               SPEC06int (test set)
>                          [ Y axis: speedup over master ]
>   8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
>     |                                                                       |
>     |                                                   tlb-lock-v3         |
>   7 +-+..................$$$...........................+indirection       +-+
>     |                    $ $                              +resizing         |
>     |                    $ $                                                |
>   6 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   5 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |                    $ $                                                |
>   4 +-+..................$.$..............................................+-+
>     |                    $ $                                                |
>     |          +++       $ $                                                |
>   3 +-+........$$+.......$.$..............................................+-+
>     |          $$        $ $                                                |
>     |          $$        $ $                                 $$$            |
>   2 +-+........$$........$.$.................................$.$..........+-+
>     |          $$        $ $                                 $ $       +$$  |
>     |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
>   1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>     |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
>   0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
>      401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean

This description line is hard to read ;)

> png: https://imgur.com/a/b1wn3wc
> 
> That is, a 1.53x average speedup over master, with a max speedup of 7.13x.
> 
> Note that "indirection" (i.e. the first patch in this series) incurs
> no overhead, on average.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/cpu-defs.h |  1 +
>  accel/tcg/cputlb.c      | 36 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 27b9433976..4d1d6b2b8b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -145,6 +145,7 @@ typedef struct CPUTLBDesc {
>      size_t size;
>      size_t mask; /* (.size - 1) << CPU_TLB_ENTRY_BITS for TLB fast path */
>      size_t used;
> +    size_t n_flushes_low_rate;
>  } CPUTLBDesc;
>  
>  #define CPU_COMMON_TLB  \
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 1ca71ecfc4..afb61e9c2b 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -85,6 +85,7 @@ void tlb_init(CPUState *cpu)
>          desc->size = MIN_CPU_TLB_SIZE;
>          desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
>          desc->used = 0;
> +        desc->n_flushes_low_rate = 0;
>          env->tlb_table[i] = g_new(CPUTLBEntry, desc->size);
>          env->iotlb[i] = g_new0(CPUIOTLBEntry, desc->size);
>      }
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
>      return count;
>  }
>  
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> +    size_t rate = desc->used * 100 / desc->size;
> +    size_t new_size = desc->size;
> +
> +    if (rate == 100) {
> +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate > 70) {
> +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate < 30) {

I wonder if those thresholds might be per TCG_TARGET.

Btw the paper used 40% here, did you tried it too?

Regards,

Phil.

> +        desc->n_flushes_low_rate++;
> +        if (desc->n_flushes_low_rate == 100) {
> +            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> +            desc->n_flushes_low_rate = 0;
> +        }
> +    }
> +
> +    if (new_size == desc->size) {
> +        return;
> +    }
> +
> +    g_free(env->tlb_table[mmu_idx]);
> +    g_free(env->iotlb[mmu_idx]);
> +
> +    desc->size = new_size;
> +    desc->mask = (desc->size - 1) << CPU_TLB_ENTRY_BITS;
> +    desc->n_flushes_low_rate = 0;
> +    env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, desc->size);
> +    env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, desc->size);
> +}
> +
>  /* This is OK because CPU architectures generally permit an
>   * implementation to drop entries from the TLB at any time, so
>   * flushing more entries than required is only an efficiency issue,
> @@ -151,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu)
>       */
>      qemu_spin_lock(&env->tlb_lock);
>      for (i = 0; i < NB_MMU_MODES; i++) {
> +        tlb_mmu_resize_locked(env, i);
>          memset(env->tlb_table[i], -1,
>                 env->tlb_desc[i].size * sizeof(CPUTLBEntry));
>          env->tlb_desc[i].used = 0;
> @@ -215,6 +250,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
>          if (test_bit(mmu_idx, &mmu_idx_bitmask)) {
>              tlb_debug("%d\n", mmu_idx);
>  
> +            tlb_mmu_resize_locked(env, mmu_idx);
>              memset(env->tlb_table[mmu_idx], -1,
>                     env->tlb_desc[mmu_idx].size * sizeof(CPUTLBEntry));
>              memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
> 

Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Posted by Emilio G. Cota 7 years ago
On Sun, Oct 07, 2018 at 19:37:50 +0200, Philippe Mathieu-Daudé wrote:
> On 10/6/18 11:45 PM, Emilio G. Cota wrote:
> > 2. System boot + shutdown, ubuntu 18.04 x86_64:
> 
> You can also run the VM tests to build QEMU:
> 
> $ make vm-test

Thanks, will give that a look.

> > +    if (rate == 100) {
> > +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > +    } else if (rate > 70) {
> > +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> > +    } else if (rate < 30) {
> 
> I wonder if those thresholds might be per TCG_TARGET.

Do you mean to tune the growth rate based on each TCG target?
(max and min are already determined by the TCG target).
The optimal growth rate is mostly dependent on the guest
workload, so I wouldn't expect the TCG target to matter
much.

That said, we could spend quite some time tweaking the
TLB sizing algorithm. But with this RFC I wanted to see
(a) whether this approach is a good idea at all, and (b) show
what 'easy' speedups might look like (because converting
all TCG targets is a pain, so it better be justified).

> Btw the paper used 40% here, did you tried it too?

Yes, I tried several alternatives including what the
paper describes, i.e. (skipping the min/max checks
for simplicity):

	if (rate > 70) {
		new_size = 2 * old_size;
	} else if (rate < 40) {
		new_size = old_size / 2;
	}

But that didn't give great speedups (see "resizing-paper"
set):
  https://imgur.com/a/w3AqHP7

A few points stand out to me:

- We get very different speedups even if we implement
  the algorithm they describe (not sure that's exactly
  what they implemented though). But there are many
  variables that could explain that, e.g. different guest
  images (and therefore different TLB flush rates) and
  different QEMU baselines (ours is faster than the paper's,
  so getting speedups is harder).

- 70/40% use rate for growing/shrinking the TLB does not
  seem a great choice, if one wants to avoid a pathological
  case that can induce constant resizing. Imagine we got
  exactly 70% use rate, and all TLB misses were compulsory
  (i.e. a direct-mapped TLB would have not prevented a
  single miss). We'd then double the TLB size:
    size_new = 2*size_old
  But then the use rate will halve:
    use_new = 0.7/2 = 0.35
  So we'd then end up in a grow-shrink loop!
  Picking a "shrink threshold" below 0.70/2=0.35 avoids this.

- Aggressively increasing the TLB size when usage is high
  makes sense. However, reducing the size at the same
  rate does not make much sense. Imagine the following
  scenario with two processes being scheduled: one process
  uses a lot of memory, and the other one uses little, but
  both are CPU-intensive and therefore being assigned similar
  time slices by the scheduler. Ideally you'd resize the TLB
  to meet each process' memory demands. However, at flush
  time we don't even know what process is running or about
  to run, so we have to size the TLB exclusively based on
  recent use rates. In this scenario you're probably close
  to optimal if you size the TLB to meet the demands of the
  most memory-hungry process. You'll lose some extra time
  flushing the (now larger) TLB, but your net gain is likely
  to be positive given the TLB fills you won't have to do
  when the memory-heavy process is scheduled in.

So to me it's quite likely that in the paper they
could have gotten even better results by reducing the
shrink rate, like we did.

Thanks,

		Emilio

Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Posted by Emilio G. Cota 7 years ago
On Sun, Oct 07, 2018 at 21:48:34 -0400, Emilio G. Cota wrote:
> - 70/40% use rate for growing/shrinking the TLB does not
>   seem a great choice, if one wants to avoid a pathological
>   case that can induce constant resizing. Imagine we got
>   exactly 70% use rate, and all TLB misses were compulsory
>   (i.e. a direct-mapped TLB would have not prevented a
            ^^^
>   single miss). We'd then double the TLB size:

I meant fully associative.

		E.

Re: [Qemu-devel] [RFC 6/6] cputlb: dynamically resize TLBs based on use rate
Posted by Richard Henderson 7 years ago
On 10/6/18 2:45 PM, Emilio G. Cota wrote:
> @@ -122,6 +123,39 @@ size_t tlb_flush_count(void)
>      return count;
>  }
>  
> +/* Call with tlb_lock held */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_desc[mmu_idx];
> +    size_t rate = desc->used * 100 / desc->size;
> +    size_t new_size = desc->size;
> +
> +    if (rate == 100) {
> +        new_size = MIN(desc->size << 2, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate > 70) {
> +        new_size = MIN(desc->size << 1, 1 << TCG_TARGET_TLB_MAX_INDEX_BITS);
> +    } else if (rate < 30) {
> +        desc->n_flushes_low_rate++;
> +        if (desc->n_flushes_low_rate == 100) {
> +            new_size = MAX(desc->size >> 1, 1 << MIN_CPU_TLB_BITS);
> +            desc->n_flushes_low_rate = 0;
> +        }
> +    }
> +
> +    if (new_size == desc->size) {

s/desc->size/old_size/g
Otherwise it looks plausible as a first cut.


r~