[v2] TCG optimizations for 2.10

[Qemu-devel] [PATCH v2 00/13] TCG optimizations for 2.10
Posted by Emilio G. Cota 8 years, 9 months ago
v1 for context:
  https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg02021.html

This series is aimed at 2.10 or beyond. Its goal is to improve
TCG code execution performance by optimizing:

1- Cross-page direct jumps (softmmu only, obviously)
2- Indirect branches (both softmmu and user-mode)
3- tb_jmp_cache hashing in user-mode (last patch)

Optimizations 1 and 2 are optional. This series implements them
for the i386 TCG backend and the ARM and i386 front-ends; other
backends/frontends can easily opt-in later on.

Changes from v1:

- Followed Richard's design, i.e. have a single helper in tcg-runtime
  and have the TCG op (now called "goto_ptr") to directly jump
  to the host pointer. This pointer is always valid since it's
  either pointing to the (valid) target or to TCG's epilogue. This
  simplifies the whole thing; the only branch in the code path is
  now the one that checks whether the tb pointer from tb_jmp_cache
  is valid.

- Much better performance (e.g. 2.4x speedup for "train" xalancbmk) --
  I'm guessing the design with just one branch is the reason. Also,
  I was unconditionally assigning ret=0 when entering the epilogue;
  fixed now.

- Document goto_ptr in tcg/README, as suggested by Paolo.

- target/i386: also optimized ret/ret im.

- Ensure that TCGContext's read-mostly fields are accessed without
  cache line bouncing. Note that (1) every time we translate,
  TCGContext is heavily written to, and (2) the address of the
  epilogue, which is now accessed in a fast path, is part of
  TCGContext. So patches 3 and 4 make sure there is no false sharing
  of cache lines between these two access patterns.

- Evaluated Paolo's suggestion of using multiplicative hashing. See
  the last patch's commit log.

Things I didn't do:

- Apply the optimization to syscall instructions in target/i386.

- Look at the impact of TLB flushes. With these (new, improved) perf
  numbers there is less reason to worry about this, although they
  should explain the perf differences between softmmu and user-mode.
  Thanks Alex for pointing me out to your profiling code though!
  Learning to use trace points is next in my QEMU TODO list, so I'll
  take a look.

The series applies cleanly on v2.9.0. Measurements are in the
commit logs. You can inspect/fetch the changes at:
  https://github.com/cota/qemu/tree/tcg-opt-v2

Thanks,

		Emilio