From nobody Wed Nov 5 09:25:14 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 14995876944747.226359610975919; Sun, 9 Jul 2017 01:08:14 -0700 (PDT) Received: from localhost ([::1]:35323 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dU7GG-0000se-Ew for importer@patchew.org; Sun, 09 Jul 2017 04:08:12 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46520) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dU6zN-0004Sn-8a for qemu-devel@nongnu.org; Sun, 09 Jul 2017 03:50:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dU6zF-000320-Cl for qemu-devel@nongnu.org; Sun, 09 Jul 2017 03:50:45 -0400 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:46671) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dU6zE-000308-Te for qemu-devel@nongnu.org; Sun, 09 Jul 2017 03:50:37 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 44FA720909; Sun, 9 Jul 2017 03:50:35 -0400 (EDT) Received: from frontend2 ([10.202.2.161]) by compute4.internal (MEProxy); Sun, 09 Jul 2017 03:50:35 -0400 Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id 0444D24254; Sun, 9 Jul 2017 03:50:35 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=cc :date:from:in-reply-to:message-id:references:subject:to :x-me-sender:x-me-sender:x-sasl-enc:x-sasl-enc; s=mesmtp; bh=yy1 7R76bDWI7Fe2WTODXRQXUFMXLAWQiZ1XH7LOU2S8=; b=yKiHv9YzoXFGN3X2XeW NQq94BwhkVw6+y96V+Tci9ZAXS1Ryzqe9hRWv3rLom2pSYK00qjdweXJG7p1ga4Q t61Ilv8/cD61NZFcKb3v00ZAwye1fZwT1VbGga4cT39vN66ev9zF/WjbF7k143sj w09dv+YvukNC4EFO2Fv5MJKU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:date:from:in-reply-to:message-id :references:subject:to:x-me-sender:x-me-sender:x-sasl-enc :x-sasl-enc; s=fm1; bh=yy17R76bDWI7Fe2WTODXRQXUFMXLAWQiZ1XH7LOU2 S8=; b=jeUl8joMcKVFyXsf8PBMx0l46hr0vfZxdm2vAvTa/6fU3gp3N5UFAulKT JCiwO1PAfnpDPa+gh1jw6Q5fcbz4bunKM6iEcBjNF4acULAfznOUFZ2Uft07bSHS aR6/WVw1LXx3zE/O9kB1WIdrBHXwBWfa5lx/sAUdSgA8ll4i5ZoZm2gCLLrf23/Y w4iWxCjmhLPHNB83AU9xaavqB5InFMq3ikLv1YrQatjlMW003jzSpTbk8xUjJLE5 L5TDSxKMbyfXLUrxiJwOYl+hMxqBT7Gbi3hh+u23PBy9xgr3jK2yfSss3MigNII+ dxvML5JwM1KCMxo0EGnShkitDIuDw== X-ME-Sender: X-Sasl-enc: G6Kyp9AMJ7o6/ZTZR6TyEf//IWpQtk7ZoSJFVDtvr5vJ 1499586635 From: "Emilio G. Cota" To: qemu-devel@nongnu.org Date: Sun, 9 Jul 2017 03:50:12 -0400 Message-Id: <1499586614-20507-21-git-send-email-cota@braap.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1499586614-20507-1-git-send-email-cota@braap.org> References: <1499586614-20507-1-git-send-email-cota@braap.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 66.111.4.27 Subject: [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Richard Henderson Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" In preparation for having multiple TCG threads. The naive solution here is to split code_gen_buffer statically among the TCG threads; this however results in poor utilization if translation needs are different across TCG threads. What we do here is to add an extra layer of indirection, assigning regions that act just like pages do in virtual memory allocation. (BTW if you are wondering about the chosen naming, I did not want to use blocks or pages because those are already heavily used in QEMU). The effectiveness of this approach is clear after seeing some numbers. I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark. Note that I'm evaluating this after enabling per-thread TCG (which is done by a subsequent commit). * -smp 1, 1 region (entire buffer): qemu: flush code_size=3D83885014 nb_tbs=3D154739 avg_tb_size=3D357 qemu: flush code_size=3D83884902 nb_tbs=3D153136 avg_tb_size=3D363 qemu: flush code_size=3D83885014 nb_tbs=3D152777 avg_tb_size=3D364 qemu: flush code_size=3D83884950 nb_tbs=3D150057 avg_tb_size=3D373 qemu: flush code_size=3D83884998 nb_tbs=3D150234 avg_tb_size=3D373 qemu: flush code_size=3D83885014 nb_tbs=3D154009 avg_tb_size=3D360 qemu: flush code_size=3D83885014 nb_tbs=3D151007 avg_tb_size=3D370 qemu: flush code_size=3D83885014 nb_tbs=3D151816 avg_tb_size=3D367 That is, 8 flushes. * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]: qemu: flush code_size=3D76328008 nb_tbs=3D141040 avg_tb_size=3D356 qemu: flush code_size=3D75366534 nb_tbs=3D138000 avg_tb_size=3D361 qemu: flush code_size=3D76864546 nb_tbs=3D140653 avg_tb_size=3D361 qemu: flush code_size=3D76309084 nb_tbs=3D135945 avg_tb_size=3D375 qemu: flush code_size=3D74581856 nb_tbs=3D132909 avg_tb_size=3D375 qemu: flush code_size=3D73927256 nb_tbs=3D135616 avg_tb_size=3D360 qemu: flush code_size=3D78629426 nb_tbs=3D142896 avg_tb_size=3D365 qemu: flush code_size=3D76667052 nb_tbs=3D138508 avg_tb_size=3D368 Again, 8 flushes. Note how buffer utilization is not 100%, but it is close. Smaller region sizes would yield higher utilization, but we want region allocation to be rare (it acquires a lock), so we do not want to go too small. * -smp 8, static partitioning of 8 regions (10 MB per region): qemu: flush code_size=3D21936504 nb_tbs=3D40570 avg_tb_size=3D354 qemu: flush code_size=3D11472174 nb_tbs=3D20633 avg_tb_size=3D370 qemu: flush code_size=3D11603976 nb_tbs=3D21059 avg_tb_size=3D365 qemu: flush code_size=3D23254872 nb_tbs=3D41243 avg_tb_size=3D377 qemu: flush code_size=3D28289496 nb_tbs=3D52057 avg_tb_size=3D358 qemu: flush code_size=3D43605160 nb_tbs=3D78896 avg_tb_size=3D367 qemu: flush code_size=3D45166552 nb_tbs=3D82158 avg_tb_size=3D364 qemu: flush code_size=3D63289640 nb_tbs=3D116494 avg_tb_size=3D358 qemu: flush code_size=3D51389960 nb_tbs=3D93937 avg_tb_size=3D362 qemu: flush code_size=3D59665928 nb_tbs=3D107063 avg_tb_size=3D372 qemu: flush code_size=3D38380824 nb_tbs=3D68597 avg_tb_size=3D374 qemu: flush code_size=3D44884568 nb_tbs=3D79901 avg_tb_size=3D376 qemu: flush code_size=3D50782632 nb_tbs=3D90681 avg_tb_size=3D374 qemu: flush code_size=3D39848888 nb_tbs=3D71433 avg_tb_size=3D372 qemu: flush code_size=3D64708840 nb_tbs=3D119052 avg_tb_size=3D359 qemu: flush code_size=3D49830008 nb_tbs=3D90992 avg_tb_size=3D362 qemu: flush code_size=3D68372408 nb_tbs=3D123442 avg_tb_size=3D368 qemu: flush code_size=3D33555560 nb_tbs=3D59514 avg_tb_size=3D378 qemu: flush code_size=3D44748344 nb_tbs=3D80974 avg_tb_size=3D367 qemu: flush code_size=3D37104248 nb_tbs=3D67609 avg_tb_size=3D364 That is, 20 flushes. Note how a static partitioning approach uses the code buffer poorly, leading to many unnecessary flushes. Signed-off-by: Emilio G. Cota --- tcg/tcg.h | 8 +++ accel/tcg/translate-all.c | 61 ++++++++++++---- bsd-user/main.c | 1 + linux-user/main.c | 1 + tcg/tcg.c | 175 ++++++++++++++++++++++++++++++++++++++++++= +++- 5 files changed, 230 insertions(+), 16 deletions(-) diff --git a/tcg/tcg.h b/tcg/tcg.h index be5f3fd..a767a33 100644 --- a/tcg/tcg.h +++ b/tcg/tcg.h @@ -761,6 +761,14 @@ void *tcg_malloc_internal(TCGContext *s, int size); void tcg_pool_reset(TCGContext *s); TranslationBlock *tcg_tb_alloc(TCGContext *s); =20 +void tcg_region_init(TCGContext *s); +bool tcg_region_alloc(TCGContext *s); +void tcg_region_set_size(size_t size); +void tcg_region_reset_all(void); + +size_t tcg_code_size(void); +size_t tcg_code_capacity(void); + /* Called with tb_lock held. */ static inline void *tcg_malloc(int size) { diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c index 31a9d42..ce9d746 100644 --- a/accel/tcg/translate-all.c +++ b/accel/tcg/translate-all.c @@ -53,11 +53,13 @@ #include "exec/cputlb.h" #include "exec/tb-hash.h" #include "translate-all.h" +#include "qemu/error-report.h" #include "qemu/bitmap.h" #include "qemu/timer.h" #include "qemu/main-loop.h" #include "exec/log.h" #include "sysemu/cpus.h" +#include "sysemu/sysemu.h" =20 /* #define DEBUG_TB_INVALIDATE */ /* #define DEBUG_TB_FLUSH */ @@ -808,6 +810,41 @@ static inline void code_gen_alloc(size_t tb_size) qemu_mutex_init(&tb_ctx.tb_lock); } =20 +#ifdef CONFIG_SOFTMMU +/* + * It is likely that some vCPUs will translate more code than others, so we + * first try to set more regions than smp_cpus, with those regions being + * larger than the minimum code_gen_buffer size. If that's not possible we + * make do by evenly dividing the code_gen_buffer among the vCPUs. + */ +static void code_gen_set_region_size(TCGContext *s) +{ + size_t per_cpu =3D s->code_gen_buffer_size / smp_cpus; + size_t div; + + assert(per_cpu); + /* + * Use a single region if all we have is one vCPU. + * We could also use a single region with !mttcg, but at this time we = have + * not yet processed the thread=3Dsingle|multi flag. + */ + if (smp_cpus =3D=3D 1) { + tcg_region_set_size(0); + return; + } + + for (div =3D 8; div > 0; div--) { + size_t region_size =3D per_cpu / div; + + if (region_size >=3D 2 * MIN_CODE_GEN_BUFFER_SIZE) { + tcg_region_set_size(region_size); + return; + } + } + tcg_region_set_size(per_cpu); +} +#endif + static void tb_htable_init(void) { unsigned int mode =3D QHT_MODE_AUTO_RESIZE; @@ -829,6 +866,8 @@ void tcg_exec_init(unsigned long tb_size) /* There's no guest base to take into account, so go ahead and initialize the prologue now. */ tcg_prologue_init(&tcg_ctx); + code_gen_set_region_size(&tcg_ctx); + tcg_region_init(&tcg_ctx); #endif } =20 @@ -929,14 +968,9 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data= tb_flush_count) #if defined(DEBUG_TB_FLUSH) g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size); nb_tbs =3D g_tree_nnodes(tb_ctx.tb_tree); - printf("qemu: flush code_size=3D%ld nb_tbs=3D%d avg_tb_size=3D%zu\n", - (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer), - nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0); + fprintf(stderr, "qemu: flush code_size=3D%zu nb_tbs=3D%d avg_tb_size= =3D%zu\n", + tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0); #endif - if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) - > tcg_ctx.code_gen_buffer_size) { - cpu_abort(cpu, "Internal error: code buffer overflow\n"); - } =20 CPU_FOREACH(cpu) { cpu_tb_jmp_cache_clear(cpu); @@ -949,7 +983,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data = tb_flush_count) qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE); page_flush_tb(); =20 - tcg_ctx.code_gen_ptr =3D tcg_ctx.code_gen_buffer; + tcg_region_reset_all(); /* XXX: flush processor icache at this point if cache flush is expensive */ atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1); @@ -1281,9 +1315,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu, cflags |=3D CF_USE_ICOUNT; } =20 + buffer_overflow: tb =3D tb_alloc(pc); if (unlikely(!tb)) { - buffer_overflow: /* flush must be done */ tb_flush(cpu); mmap_unlock(); @@ -1366,9 +1400,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu, } #endif =20 - tcg_ctx.code_gen_ptr =3D (void *) + atomic_set(&tcg_ctx.code_gen_ptr, (void *) ROUND_UP((uintptr_t)gen_code_buf + gen_code_size + search_size, - CODE_GEN_ALIGN); + CODE_GEN_ALIGN)); =20 /* init jump list */ assert(((uintptr_t)tb & 3) =3D=3D 0); @@ -1907,9 +1941,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fpr= intf) * otherwise users might think "-tb-size" is not honoured. * For avg host size we use the precise numbers from tb_tree_stats tho= ugh. */ - cpu_fprintf(f, "gen code size %td/%zd\n", - tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, - tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer); + cpu_fprintf(f, "gen code size %zu/%zd\n", + tcg_code_size(), tcg_code_capacity()); cpu_fprintf(f, "TB count %d\n", nb_tbs); cpu_fprintf(f, "TB avg target size %zu max=3D%zu bytes\n", nb_tbs ? tst.target_size / nb_tbs : 0, diff --git a/bsd-user/main.c b/bsd-user/main.c index fa9c012..1a16052 100644 --- a/bsd-user/main.c +++ b/bsd-user/main.c @@ -979,6 +979,7 @@ int main(int argc, char **argv) generating the prologue until now so that the prologue can take the real value of GUEST_BASE into account. */ tcg_prologue_init(&tcg_ctx); + tcg_region_init(&tcg_ctx); =20 /* build Task State */ memset(ts, 0, sizeof(TaskState)); diff --git a/linux-user/main.c b/linux-user/main.c index 630c73d..b73759c 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -4457,6 +4457,7 @@ int main(int argc, char **argv, char **envp) generating the prologue until now so that the prologue can take the real value of GUEST_BASE into account. */ tcg_prologue_init(&tcg_ctx); + tcg_region_init(&tcg_ctx); =20 #if defined(TARGET_I386) env->cr[0] =3D CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK; diff --git a/tcg/tcg.c b/tcg/tcg.c index 8febf53..03ebc8c 100644 --- a/tcg/tcg.c +++ b/tcg/tcg.c @@ -129,6 +129,23 @@ static QemuMutex tcg_lock; static QSIMPLEQ_HEAD(, TCGContext) ctx_list =3D QSIMPLEQ_HEAD_INITIALIZER(ctx_list); =20 +/* + * We divide code_gen_buffer into equally-sized "regions" that TCG threads + * dynamically allocate from as demand dictates. Given appropriate region + * sizing, this minimizes flushes even when some TCG threads generate a lot + * more code than others. + */ +struct tcg_region_state { + void *buf; + size_t n; + size_t current; + size_t n_full; + size_t size; /* size of one region */ +}; + +/* protected by tcg_lock */ +static struct tcg_region_state region; + static TCGRegSet tcg_target_available_regs[2]; static TCGRegSet tcg_target_call_clobber_regs; =20 @@ -410,6 +427,156 @@ void tcg_context_init(TCGContext *s) tcg_register_thread(); } =20 +static void tcg_region_set_size__locked(size_t size) +{ + if (!size) { + region.size =3D tcg_init_ctx->code_gen_buffer_size; + region.n =3D 1; + } else { + region.size =3D size; + region.n =3D tcg_init_ctx->code_gen_buffer_size / size; + } + if (unlikely(region.size < TCG_HIGHWATER)) { + tcg_abort(); + } +} + +/* + * Call this function at init time (i.e. only once). Calling this function= is + * optional: if no region size is set, a single region will be used. + * + * Note: calling this function *after* calling tcg_region_init() is a bug. + */ +void tcg_region_set_size(size_t size) +{ + tcg_debug_assert(!region.size); + + qemu_mutex_lock(&tcg_lock); + tcg_region_set_size__locked(size); + qemu_mutex_unlock(&tcg_lock); +} + +static void tcg_region_assign__locked(TCGContext *s) +{ + void *buf =3D region.buf + region.size * region.current; + + s->code_gen_buffer =3D buf; + s->code_gen_ptr =3D buf; + s->code_gen_buffer_size =3D region.size; + s->code_gen_highwater =3D buf + region.size - TCG_HIGHWATER; +} + +static bool tcg_region_alloc__locked(TCGContext *s) +{ + if (region.current =3D=3D region.n) { + return false; + } + tcg_region_assign__locked(s); + region.current++; + return true; +} + +/* + * Request a new region once the one in use has filled up. + * Note: upon initializing a TCG thread, allocate a new region with + * tcg_region_init() instead. + * Returns true on success. + * */ +bool tcg_region_alloc(TCGContext *s) +{ + bool success; + + qemu_mutex_lock(&tcg_lock); + success =3D tcg_region_alloc__locked(s); + if (success) { + region.n_full++; + } + qemu_mutex_unlock(&tcg_lock); + return success; +} + +/* + * Allocate an initial region. + * All TCG threads must have called this function before any of them initi= ates + * translation. + * + * The region size might have previously been set by tcg_region_set_size(); + * otherwise a single region will be used on the entire code_gen_buffer. + * + * Note: allocate subsequent regions with tcg_region_alloc(). + */ +void tcg_region_init(TCGContext *s) +{ + qemu_mutex_lock(&tcg_lock); + if (region.buf =3D=3D NULL) { + region.buf =3D tcg_init_ctx->code_gen_buffer; + } + if (!region.size) { + tcg_region_set_size__locked(0); + } + /* if we cannot allocate on init, then we did something wrong */ + if (!tcg_region_alloc__locked(s)) { + tcg_abort(); + } + qemu_mutex_unlock(&tcg_lock); + +} + +/* Call from a safe-work context */ +void tcg_region_reset_all(void) +{ + TCGContext *s; + + qemu_mutex_lock(&tcg_lock); + region.current =3D 0; + region.n_full =3D 0; + + QSIMPLEQ_FOREACH(s, &ctx_list, entry) { + if (unlikely(!tcg_region_alloc__locked(s))) { + tcg_abort(); + } + } + qemu_mutex_unlock(&tcg_lock); +} + +/* + * Returns the size (in bytes) of all translated code (i.e. from all regio= ns) + * currently in the cache. + * See also: tcg_code_capacity() + * Do not confuse with tcg_current_code_size(); that one applies to a sing= le + * TCG context. + */ +size_t tcg_code_size(void) +{ + const TCGContext *s; + size_t total; + + qemu_mutex_lock(&tcg_lock); + total =3D region.n_full * (region.size - TCG_HIGHWATER); + QSIMPLEQ_FOREACH(s, &ctx_list, entry) { + size_t size; + + size =3D atomic_read(&s->code_gen_ptr) - s->code_gen_buffer; + if (unlikely(size > s->code_gen_buffer_size)) { + tcg_abort(); + } + total +=3D size; + } + qemu_mutex_unlock(&tcg_lock); + return total; +} + +/* + * Returns the code capacity (in bytes) of the entire cache, i.e. includin= g all + * regions. + * See also: tcg_code_size() + */ +size_t tcg_code_capacity(void) +{ + /* no need for synchronization; these variables are set at init time */ + return region.n * (region.size - TCG_HIGHWATER); +} + /* * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext * set up by their parent thread via tcg_context_init(). @@ -432,13 +599,17 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s) TranslationBlock *tb; void *next; =20 + retry: tb =3D (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, align); next =3D (void *)ROUND_UP((uintptr_t)(tb + 1), align); =20 if (unlikely(next > s->code_gen_highwater)) { - return NULL; + if (!tcg_region_alloc(s)) { + return NULL; + } + goto retry; } - s->code_gen_ptr =3D next; + atomic_set(&s->code_gen_ptr, next); return tb; } =20 --=20 2.7.4