From nobody Tue Oct 28 12:39:19 2025 Delivered-To: importer@patchew.org Received-SPF: pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) client-ip=208.118.235.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists.gnu.org; Authentication-Results: mx.zohomail.com; spf=pass (zoho.com: domain of gnu.org designates 208.118.235.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) by mx.zohomail.com with SMTPS id 1522981318442629.1289400991451; Thu, 5 Apr 2018 19:21:58 -0700 (PDT) Received: from localhost ([::1]:44033 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f4H0k-0000w6-EW for importer@patchew.org; Thu, 05 Apr 2018 22:21:54 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58540) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f4GsO-0002I1-VE for qemu-devel@nongnu.org; Thu, 05 Apr 2018 22:13:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f4GsK-0003Wh-Rk for qemu-devel@nongnu.org; Thu, 05 Apr 2018 22:13:16 -0400 Received: from out5-smtp.messagingengine.com ([66.111.4.29]:39197) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1f4GsK-0003W4-Kf for qemu-devel@nongnu.org; Thu, 05 Apr 2018 22:13:12 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 35E3C20B71; Thu, 5 Apr 2018 22:13:12 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Thu, 05 Apr 2018 22:13:12 -0400 Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id E429DE47F2; Thu, 5 Apr 2018 22:13:11 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=mesmtp; bh=0gZKbq1o4s8mvrgrJrr1bJyKcC rKjcUl61DzQxw/oHw=; b=1I7Ygxm/9KZlCmtGM6uNdJgUCetGGT5U4yQ1/Qlg3v YdLIjryC+w7EeWnVb3ObYQomXa8Y3hX61V351kNB2zr7NA3vmC+MDNvRiXvAry4P VBIf4PplNDmT+7vLNfp0eFm8wwlcpb4Lg2zCPwzgad1cMpPi7jTVSoOFkcQ3EgBB Q= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=0gZKbq 1o4s8mvrgrJrr1bJyKcCrKjcUl61DzQxw/oHw=; b=G/d4yLOKO8sUty+gfXRzND +0aey7JDa4wYBBZD3+fmxgE2MoMH1sKaIgPcZ3bbhFvB0ysjN4u/zJ5Z4KSKtGgd yWslxFhvOklUySLX1LJDkwuL0WBmZXicN+KbCxiw4WUnwVyC8+CEPSg7C/SJp7q3 cwD/LUBJZngvM651svjrTa8vGBeczCMaCHhGH7MrintJNOY1iPabU3LZJh58yt2C O+smkDK1LDsfmAHXFgEZOR1+Eoj9P8aAge/Uq3x8xH0e42TwHQoY6Mtdi1Hltqqv Cw7g0z/rOcMIAFFJ661IDv3U440BGE7r/FqPvlORDZ3AnrR9IJH5+NWs4ZRXLNbQ == X-ME-Sender: From: "Emilio G. Cota" To: qemu-devel@nongnu.org Date: Thu, 5 Apr 2018 22:13:05 -0400 Message-Id: <1522980788-1252-15-git-send-email-cota@braap.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1522980788-1252-1-git-send-email-cota@braap.org> References: <1522980788-1252-1-git-send-email-cota@braap.org> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 66.111.4.29 Subject: [Qemu-devel] [PATCH v2 14/17] translate-all: protect TB jumps with a per-destination-TB lock X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Richard Henderson , =?UTF-8?q?Alex=20Benn=C3=A9e?= , Paolo Bonzini Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: "Qemu-devel" X-ZohoMail: RSF_0 Z_629925259 SPT_0 This applies to both user-mode and !user-mode emulation. Instead of relying on a global lock, protect the list of incoming jumps with tb->jmp_lock. This lock also protects tb->cflags, so update all tb->cflags readers outside tb->jmp_lock to use atomic reads via tb_cflags(). In order to find the destination TB (and therefore its jmp_lock) from the origin TB, we introduce tb->jmp_dest[]. I considered not using a linked list of jumps, which simplifies code and makes the struct smaller. However, it unnecessarily increases memory usage, which results in a performance decrease. See for instance these numbers booting+shutting down debian-arm: Time (s) Rel. err (%) Abs. err (s) Rel. slowdown (= %) ---------------------------------------------------------------------------= --- before 20.88 0.74 0.154512 = 0. after 20.81 0.38 0.079078 -0.335249= 04 GTree 21.02 0.28 0.058856 0.670498= 08 GHashTable + xxhash 21.63 1.08 0.233604 3.59195= 40 Using a hash table or a binary tree to keep track of the jumps doesn't really pay off, not only due to the increased memory usage, but also because most TBs have only 0 or 1 jumps to them. The maximum number of jumps when booting debian-arm that I measured is 35, but as we can see in the histogram below a TB with that many incoming jumps is extremely rare; the average TB has 0.80 incoming jumps. n_jumps: 379208; avg jumps/tb: 0.801099 dist: [0.0,1.0)|=E2=96=84=E2=96=88=E2=96=81=E2=96=81=E2=96=81=E2=96=81=E2= =96=81=E2=96=81=E2=96=81=E2=96=81=E2=96=81=E2=96=81=E2=96=81 =E2=96=81=E2= =96=81=E2=96=81=E2=96=81=E2=96=81=E2=96=81 =E2=96=81=E2=96=81=E2=96=81 =E2= =96=81=E2=96=81=E2=96=81 =E2=96=81|[34.0,35.0] Signed-off-by: Emilio G. Cota --- docs/devel/multi-thread-tcg.txt | 6 +- include/exec/exec-all.h | 35 +++++++----- accel/tcg/cpu-exec.c | 41 +++++++++----- accel/tcg/translate-all.c | 118 ++++++++++++++++++++++++------------= ---- 4 files changed, 124 insertions(+), 76 deletions(-) diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.= txt index faf09c6..df83445 100644 --- a/docs/devel/multi-thread-tcg.txt +++ b/docs/devel/multi-thread-tcg.txt @@ -131,8 +131,10 @@ DESIGN REQUIREMENT: Safely handle invalidation of TBs =20 The direct jump themselves are updated atomically by the TCG tb_set_jmp_target() code. Modification to the linked lists that allow -searching for linked pages are done under the protect of the -tb_lock(). +searching for linked pages are done under the protection of tb->jmp_lock, +where tb is the destination block of a jump. Each origin block keeps a +pointer to its destinations so that the appropriate lock can be acquired b= efore +iterating over a jump list. =20 The global page table is a lockless radix tree; cmpxchg is used to atomically insert new elements. diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h index 7911e69..f8adeb8 100644 --- a/include/exec/exec-all.h +++ b/include/exec/exec-all.h @@ -341,7 +341,7 @@ struct TranslationBlock { #define CF_LAST_IO 0x00008000 /* Last insn may be an IO access. */ #define CF_NOCACHE 0x00010000 /* To be freed after execution */ #define CF_USE_ICOUNT 0x00020000 -#define CF_INVALID 0x00040000 /* TB is stale. Setters need tb_lock */ +#define CF_INVALID 0x00040000 /* TB is stale. Set with @jmp_lock held = */ #define CF_PARALLEL 0x00080000 /* Generate code for a parallel context = */ /* cflags' mask for hashing/comparison */ #define CF_HASH_MASK \ @@ -360,6 +360,9 @@ struct TranslationBlock { uintptr_t page_next[2]; tb_page_addr_t page_addr[2]; =20 + /* jmp_lock placed here to fill a 4-byte hole. Its documentation is be= low */ + QemuSpin jmp_lock; + /* The following data are used to directly call another TB from * the code of this one. This can be done either by emitting direct or * indirect native jump instructions. These jumps are reset so that th= e TB @@ -371,20 +374,26 @@ struct TranslationBlock { #define TB_JMP_RESET_OFFSET_INVALID 0xffff /* indicates no jump generated = */ uintptr_t jmp_target_arg[2]; /* target address or offset */ =20 - /* Each TB has an associated circular list of TBs jumping to this one. - * jmp_list_first points to the first TB jumping to this one. - * jmp_list_next is used to point to the next TB in a list. - * Since each TB can have two jumps, it can participate in two lists. - * jmp_list_first and jmp_list_next are 4-byte aligned pointers to a - * TranslationBlock structure, but the two least significant bits of - * them are used to encode which data field of the pointed TB should - * be used to traverse the list further from that TB: - * 0 =3D> jmp_list_next[0], 1 =3D> jmp_list_next[1], 2 =3D> jmp_list_f= irst. - * In other words, 0/1 tells which jump is used in the pointed TB, - * and 2 means that this is a pointer back to the target TB of this li= st. + /* + * Each TB has a NULL-terminated list (jmp_list_head) of incoming jump= s. + * Each TB can have two outgoing jumps, and therefore can participate + * in two lists. The list entries are kept in jmp_list_next[2]. The le= ast + * significant bit (LSB) of the pointers in these lists is used to enc= ode + * which of the two list entries is to be used in the pointed TB. + * + * List traversals are protected by jmp_lock. The destination TB of ea= ch + * outgoing jump is kept in jmp_dest[] so that the appropriate jmp_lock + * can be acquired from any origin TB. + * + * jmp_dest[] are tagged pointers as well. The LSB is set when the TB = is + * being invalidated, so that no further outgoing jumps from it can be= set. + * + * jmp_lock also protects the CF_INVALID cflag; a jump must not be cha= ined + * to a destination TB that has CF_INVALID set. */ + uintptr_t jmp_list_head; uintptr_t jmp_list_next[2]; - uintptr_t jmp_list_first; + uintptr_t jmp_dest[2]; }; =20 extern bool parallel_cpus; diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c index c79c43b..178452a 100644 --- a/accel/tcg/cpu-exec.c +++ b/accel/tcg/cpu-exec.c @@ -350,28 +350,43 @@ void tb_set_jmp_target(TranslationBlock *tb, int n, u= intptr_t addr) } } =20 -/* Called with tb_lock held. */ static inline void tb_add_jump(TranslationBlock *tb, int n, TranslationBlock *tb_next) { + uintptr_t old; + assert(n < ARRAY_SIZE(tb->jmp_list_next)); - if (tb->jmp_list_next[n]) { - /* Another thread has already done this while we were - * outside of the lock; nothing to do in this case */ - return; + qemu_spin_lock(&tb_next->jmp_lock); + + /* make sure the destination TB is valid */ + if (tb_next->cflags & CF_INVALID) { + goto out_unlock_next; + } + /* Atomically claim the jump destination slot only if it was NULL */ + old =3D atomic_cmpxchg(&tb->jmp_dest[n], (uintptr_t)NULL, (uintptr_t)t= b_next); + if (old) { + goto out_unlock_next; } + + /* patch the native jump address */ + tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr); + + /* add in TB jmp list */ + tb->jmp_list_next[n] =3D tb_next->jmp_list_head; + tb_next->jmp_list_head =3D (uintptr_t)tb | n; + + qemu_spin_unlock(&tb_next->jmp_lock); + qemu_log_mask_and_addr(CPU_LOG_EXEC, tb->pc, "Linking TBs %p [" TARGET_FMT_lx "] index %d -> %p [" TARGET_FMT_lx "]\n", tb->tc.ptr, tb->pc, n, tb_next->tc.ptr, tb_next->pc); + return; =20 - /* patch the native jump address */ - tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr); - - /* add in TB jmp circular list */ - tb->jmp_list_next[n] =3D tb_next->jmp_list_first; - tb_next->jmp_list_first =3D (uintptr_t)tb | n; + out_unlock_next: + qemu_spin_unlock(&tb_next->jmp_lock); + return; } =20 static inline TranslationBlock *tb_find(CPUState *cpu, @@ -414,9 +429,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu, tb_lock(); acquired_tb_lock =3D true; } - if (!(tb->cflags & CF_INVALID)) { - tb_add_jump(last_tb, tb_exit, tb); - } + tb_add_jump(last_tb, tb_exit, tb); } if (acquired_tb_lock) { tb_unlock(); diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c index aabde7e..486c9df 100644 --- a/accel/tcg/translate-all.c +++ b/accel/tcg/translate-all.c @@ -170,6 +170,9 @@ struct page_collection { #define PAGE_FOR_EACH_TB(pagedesc, tb, n) \ TB_FOR_EACH_TAGGED((pagedesc)->first_tb, tb, n, page_next) =20 +#define TB_FOR_EACH_JMP(head_tb, tb, n) \ + TB_FOR_EACH_TAGGED((head_tb)->jmp_list_head, tb, n, jmp_list_next) + /* In system mode we want L1_MAP to be based on ram offsets, while in user mode we want it to be based on virtual addresses. */ #if !defined(CONFIG_USER_ONLY) @@ -387,7 +390,7 @@ static int cpu_restore_state_from_tb(CPUState *cpu, Tra= nslationBlock *tb, return -1; =20 found: - if (tb->cflags & CF_USE_ICOUNT) { + if (tb_cflags(tb) & CF_USE_ICOUNT) { assert(use_icount); /* Reset the cycle counter to the start of the block. */ cpu->icount_decr.u16.low +=3D num_insns; @@ -432,7 +435,7 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc) tb =3D tcg_tb_lookup(host_pc); if (tb) { cpu_restore_state_from_tb(cpu, tb, host_pc); - if (tb->cflags & CF_NOCACHE) { + if (tb_cflags(tb) & CF_NOCACHE) { /* one-shot translation, invalidate it immediately */ tb_phys_invalidate(tb, -1); tcg_tb_remove(tb); @@ -1360,34 +1363,53 @@ static inline void tb_page_remove(PageDesc *pd, Tra= nslationBlock *tb) g_assert_not_reached(); } =20 -/* remove the TB from a list of TBs jumping to the n-th jump target of the= TB */ -static inline void tb_remove_from_jmp_list(TranslationBlock *tb, int n) +/* remove @orig from its @n_orig-th jump list */ +static inline void tb_remove_from_jmp_list(TranslationBlock *orig, int n_o= rig) { - TranslationBlock *tb1; - uintptr_t *ptb, ntb; - unsigned int n1; + uintptr_t ptr, ptr_locked; + TranslationBlock *dest; + TranslationBlock *tb; + uintptr_t *pprev; + int n; =20 - ptb =3D &tb->jmp_list_next[n]; - if (*ptb) { - /* find tb(n) in circular list */ - for (;;) { - ntb =3D *ptb; - n1 =3D ntb & 3; - tb1 =3D (TranslationBlock *)(ntb & ~3); - if (n1 =3D=3D n && tb1 =3D=3D tb) { - break; - } - if (n1 =3D=3D 2) { - ptb =3D &tb1->jmp_list_first; - } else { - ptb =3D &tb1->jmp_list_next[n1]; - } - } - /* now we can suppress tb(n) from the list */ - *ptb =3D tb->jmp_list_next[n]; + /* mark the LSB of jmp_dest[] so that no further jumps can be inserted= */ + ptr =3D atomic_or_fetch(&orig->jmp_dest[n_orig], 1); + dest =3D (TranslationBlock *)(ptr & ~1); + if (dest =3D=3D NULL) { + return; + } =20 - tb->jmp_list_next[n] =3D (uintptr_t)NULL; + qemu_spin_lock(&dest->jmp_lock); + /* + * While acquiring the lock, the jump might have been removed if the + * destination TB was invalidated; check again. + */ + ptr_locked =3D atomic_read(&orig->jmp_dest[n_orig]); + if (ptr_locked !=3D ptr) { + qemu_spin_unlock(&dest->jmp_lock); + /* + * The only possibility is that the jump was unlinked via + * tb_jump_unlink(dest). Seeing here another destination would be = a bug, + * because we set the LSB above. + */ + g_assert(ptr_locked =3D=3D 1 && dest->cflags & CF_INVALID); + return; } + /* + * We first acquired the lock, and since the destination pointer match= es, + * we know for sure that @orig is in the jmp list. + */ + pprev =3D &dest->jmp_list_head; + TB_FOR_EACH_JMP(dest, tb, n) { + if (tb =3D=3D orig && n =3D=3D n_orig) { + *pprev =3D tb->jmp_list_next[n]; + /* no need to set orig->jmp_dest[n]; setting the LSB was enoug= h */ + qemu_spin_unlock(&dest->jmp_lock); + return; + } + pprev =3D &tb->jmp_list_next[n]; + } + g_assert_not_reached(); } =20 /* reset the jump entry 'n' of a TB so that it is not chained to @@ -1399,24 +1421,21 @@ static inline void tb_reset_jump(TranslationBlock *= tb, int n) } =20 /* remove any jumps to the TB */ -static inline void tb_jmp_unlink(TranslationBlock *tb) +static inline void tb_jmp_unlink(TranslationBlock *dest) { - TranslationBlock *tb1; - uintptr_t *ptb, ntb; - unsigned int n1; + TranslationBlock *tb; + int n; =20 - ptb =3D &tb->jmp_list_first; - for (;;) { - ntb =3D *ptb; - n1 =3D ntb & 3; - tb1 =3D (TranslationBlock *)(ntb & ~3); - if (n1 =3D=3D 2) { - break; - } - tb_reset_jump(tb1, n1); - *ptb =3D tb1->jmp_list_next[n1]; - tb1->jmp_list_next[n1] =3D (uintptr_t)NULL; + qemu_spin_lock(&dest->jmp_lock); + + TB_FOR_EACH_JMP(dest, tb, n) { + tb_reset_jump(tb, n); + atomic_and(&tb->jmp_dest[n], (uintptr_t)NULL | 1); + /* No need to clear the list entry; setting the dest ptr is enough= */ } + dest->jmp_list_head =3D (uintptr_t)NULL; + + qemu_spin_unlock(&dest->jmp_lock); } =20 /* If @rm_from_page_list is set, call with the TB's pages' locks held */ @@ -1429,11 +1448,14 @@ static void do_tb_phys_invalidate(TranslationBlock = *tb, bool rm_from_page_list) =20 assert_tb_locked(); =20 + /* make sure no further incoming jumps will be chained to this TB */ + qemu_spin_lock(&tb->jmp_lock); atomic_set(&tb->cflags, tb->cflags | CF_INVALID); + qemu_spin_unlock(&tb->jmp_lock); =20 /* remove the TB from the hash list */ phys_pc =3D tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK); - h =3D tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MA= SK, + h =3D tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cflags(tb) & CF_HASH= _MASK, tb->trace_vcpu_dstate); if (!qht_remove(&tb_ctx.htable, tb, h)) { return; @@ -1784,10 +1806,12 @@ TranslationBlock *tb_gen_code(CPUState *cpu, CODE_GEN_ALIGN)); =20 /* init jump list */ - assert(((uintptr_t)tb & 3) =3D=3D 0); - tb->jmp_list_first =3D (uintptr_t)tb | 2; + qemu_spin_init(&tb->jmp_lock); + tb->jmp_list_head =3D (uintptr_t)NULL; tb->jmp_list_next[0] =3D (uintptr_t)NULL; tb->jmp_list_next[1] =3D (uintptr_t)NULL; + tb->jmp_dest[0] =3D (uintptr_t)NULL; + tb->jmp_dest[1] =3D (uintptr_t)NULL; =20 /* init original jump addresses wich has been set during tcg_gen_code(= ) */ if (tb->jmp_reset_offset[0] !=3D TB_JMP_RESET_OFFSET_INVALID) { @@ -1879,7 +1903,7 @@ tb_invalidate_phys_page_range__locked(struct page_col= lection *pages, } } if (current_tb =3D=3D tb && - (current_tb->cflags & CF_COUNT_MASK) !=3D 1) { + (tb_cflags(current_tb) & CF_COUNT_MASK) !=3D 1) { /* If we are modifying the current TB, we must stop its execution. We could be more precise by checking that the modification is after the current PC, but it @@ -2076,7 +2100,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t ad= dr, uintptr_t pc) PAGE_FOR_EACH_TB(p, tb, n) { #ifdef TARGET_HAS_PRECISE_SMC if (current_tb =3D=3D tb && - (current_tb->cflags & CF_COUNT_MASK) !=3D 1) { + (tb_cflags(current_tb) & CF_COUNT_MASK) !=3D 1) { /* If we are modifying the current TB, we must stop its execution. We could be more precise by checking that the modification is after the current PC, but it @@ -2201,7 +2225,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retadd= r) /* Generate a new TB executing the I/O insn. */ cpu->cflags_next_tb =3D curr_cflags() | CF_LAST_IO | n; =20 - if (tb->cflags & CF_NOCACHE) { + if (tb_cflags(tb) & CF_NOCACHE) { if (tb->orig_tb) { /* Invalidate original TB if this TB was generated in * cpu_exec_nocache() */ --=20 2.7.4