[Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB

Emilio G. Cota posted 16 patches 7 years, 7 months ago
[Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
Posted by Emilio G. Cota 7 years, 7 months ago
Use the recently-gained QHT feature of returning the matching TB if it
already exists. This allows us to get rid of the lookup we perform
right after acquiring tb_lock.

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      | 14 ++------------
 accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 7c83887..8aed38c 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
         if (tb == NULL) {
             mmap_lock();
             tb_lock();
-            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
-            if (likely(tb == NULL)) {
-                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
-            }
+            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
             tb_unlock();
             mmap_unlock();
         }
@@ -396,14 +393,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
         tb_lock();
         acquired_tb_lock = true;
 
-        /* There's a chance that our desired tb has been translated while
-         * taking the locks so we check again inside the lock.
-         */
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
-        if (likely(tb == NULL)) {
-            /* if no translated code available, then translate it now */
-            tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
-        }
+        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
 
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 82832ef..dbe6c12 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
  * (-1) to indicate that only one page contains the TB.
  *
  * Called with mmap_lock held for user-mode emulation.
+ *
+ * Returns @tb or an existing TB that matches @tb.
  */
-static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
-                         tb_page_addr_t phys_page2)
+static TranslationBlock *
+tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
+             tb_page_addr_t phys_page2)
 {
     PageDesc *p;
     PageDesc *p2 = NULL;
+    void *existing_tb;
     uint32_t h;
 
     assert_memory_lock();
@@ -1516,6 +1520,11 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     /*
      * Add the TB to the page list.
      * To avoid deadlock, acquire first the lock of the lower-addressed page.
+     * We keep the locks held until after inserting the TB in the hash table,
+     * so that if the insertion fails we know for sure that the TBs are still
+     * in the page descriptors.
+     * Note that inserting into the hash table first isn't an option, since
+     * we can only insert TBs that are fully initialized.
      */
     p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
     if (likely(phys_page2 == -1)) {
@@ -1535,21 +1544,33 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
         tb_page_add(p2, tb, 1, phys_page2);
     }
 
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
+                     tb->trace_vcpu_dstate);
+    existing_tb = qht_insert(&tb_ctx.htable, tb, h);
+
+    /* remove TB from the page(s) if we couldn't insert it */
+    if (unlikely(existing_tb)) {
+        tb_page_remove(p, tb);
+        invalidate_page_bitmap(p);
+        if (p2) {
+            tb_page_remove(p2, tb);
+            invalidate_page_bitmap(p2);
+        }
+        tb = existing_tb;
+    }
+
     if (p2) {
         page_unlock(p2);
     }
     page_unlock(p);
 
-    /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
-                     tb->trace_vcpu_dstate);
-    qht_insert(&tb_ctx.htable, tb, h);
-
 #ifdef CONFIG_USER_ONLY
     if (DEBUG_TB_CHECK_GATE) {
         tb_page_check();
     }
 #endif
+    return tb;
 }
 
 /* Called with mmap_lock held for user mode emulation.  */
@@ -1558,7 +1579,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
                               uint32_t flags, int cflags)
 {
     CPUArchState *env = cpu->env_ptr;
-    TranslationBlock *tb;
+    TranslationBlock *tb, *existing_tb;
     tb_page_addr_t phys_pc, phys_page2;
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
@@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * memory barrier is required before tb_link_page() makes the TB visible
      * through the physical hash table and physical page list.
      */
-    tb_link_page(tb, phys_pc, phys_page2);
+    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
+    /* if the TB already exists, discard what we just translated */
+    if (unlikely(existing_tb != tb)) {
+        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+        atomic_set(&tcg_ctx->code_gen_ptr, orig_aligned);
+        return existing_tb;
+    }
     tcg_tb_insert(tb);
     return tb;
 }
-- 
2.7.4


Re: [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
Posted by Alex Bennée 7 years, 6 months ago
Emilio G. Cota <cota@braap.org> writes:

> Use the recently-gained QHT feature of returning the matching TB if it
> already exists. This allows us to get rid of the lookup we perform
> right after acquiring tb_lock.
>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cpu-exec.c      | 14 ++------------
>  accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 40 insertions(+), 21 deletions(-)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 7c83887..8aed38c 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
>          if (tb == NULL) {
>              mmap_lock();
>              tb_lock();
> -            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> -            if (likely(tb == NULL)) {
> -                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> -            }
> +            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);

tb_gen_code needs to be renamed to reflect it's semantics.
tb_get_or_gen_code? Or maybe tb_get_code with a sub-helper to do the
generation.

>              tb_unlock();
>              mmap_unlock();
>          }
> @@ -396,14 +393,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
>          tb_lock();
>          acquired_tb_lock = true;
>
> -        /* There's a chance that our desired tb has been translated while
> -         * taking the locks so we check again inside the lock.
> -         */
> -        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> -        if (likely(tb == NULL)) {
> -            /* if no translated code available, then translate it now */
> -            tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
> -        }
> +        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
>
>          mmap_unlock();
>          /* We add the TB in the virtual pc hash table for the fast lookup */
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 82832ef..dbe6c12 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
>   * (-1) to indicate that only one page contains the TB.
>   *
>   * Called with mmap_lock held for user-mode emulation.
> + *
> + * Returns @tb or an existing TB that matches @tb.

That's just confusing to read. So this returns a TB like the @tb we
passed in but actually a different one matching the same conditions?

>   */
> -static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
> -                         tb_page_addr_t phys_page2)
> +static TranslationBlock *
> +tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
> +             tb_page_addr_t phys_page2)
>  {
>      PageDesc *p;
>      PageDesc *p2 = NULL;
> +    void *existing_tb;
>      uint32_t h;
>
>      assert_memory_lock();
> @@ -1516,6 +1520,11 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>      /*
>       * Add the TB to the page list.
>       * To avoid deadlock, acquire first the lock of the lower-addressed page.
> +     * We keep the locks held until after inserting the TB in the hash table,
> +     * so that if the insertion fails we know for sure that the TBs are still
> +     * in the page descriptors.
> +     * Note that inserting into the hash table first isn't an option, since
> +     * we can only insert TBs that are fully initialized.
>       */
>      p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
>      if (likely(phys_page2 == -1)) {
> @@ -1535,21 +1544,33 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>          tb_page_add(p2, tb, 1, phys_page2);
>      }
>
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
> +                     tb->trace_vcpu_dstate);
> +    existing_tb = qht_insert(&tb_ctx.htable, tb, h);

modulo comments about qht_insert API earlier in the series.

> +
> +    /* remove TB from the page(s) if we couldn't insert it */
> +    if (unlikely(existing_tb)) {
> +        tb_page_remove(p, tb);
> +        invalidate_page_bitmap(p);
> +        if (p2) {
> +            tb_page_remove(p2, tb);
> +            invalidate_page_bitmap(p2);
> +        }
> +        tb = existing_tb;
> +    }
> +
>      if (p2) {
>          page_unlock(p2);
>      }
>      page_unlock(p);
>
> -    /* add in the hash table */
> -    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
> -                     tb->trace_vcpu_dstate);
> -    qht_insert(&tb_ctx.htable, tb, h);
> -
>  #ifdef CONFIG_USER_ONLY
>      if (DEBUG_TB_CHECK_GATE) {
>          tb_page_check();
>      }
>  #endif
> +    return tb;
>  }
>
>  /* Called with mmap_lock held for user mode emulation.  */
> @@ -1558,7 +1579,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>                                uint32_t flags, int cflags)
>  {
>      CPUArchState *env = cpu->env_ptr;
> -    TranslationBlock *tb;
> +    TranslationBlock *tb, *existing_tb;
>      tb_page_addr_t phys_pc, phys_page2;
>      target_ulong virt_page2;
>      tcg_insn_unit *gen_code_buf;
> @@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       * memory barrier is required before tb_link_page() makes the TB visible
>       * through the physical hash table and physical page list.
>       */
> -    tb_link_page(tb, phys_pc, phys_page2);
> +    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
> +    /* if the TB already exists, discard what we just translated */

So are we in the position now that we could potentially do a translation
but be beaten by another thread generating the same code? I suspect we could
do with a bit of explanatory commentary for the tb_gen_code functions.

Also I think the "Translation Blocks" section needs updating in the
MTTCG design document to make this clear.

I'm curious if we should be counting unused translations somewhere in
the JIT stats. I'm guessing you need to work at a pathalogical case to
hit this much?

> +    if (unlikely(existing_tb != tb)) {
> +        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
> +
> +        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> +        atomic_set(&tcg_ctx->code_gen_ptr, orig_aligned);
> +        return existing_tb;
> +    }
>      tcg_tb_insert(tb);
>      return tb;
>  }


--
Alex Bennée

Re: [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
Posted by Emilio G. Cota 7 years, 6 months ago
On Thu, Mar 29, 2018 at 16:19:39 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Use the recently-gained QHT feature of returning the matching TB if it
> > already exists. This allows us to get rid of the lookup we perform
> > right after acquiring tb_lock.
> >
> > Suggested-by: Richard Henderson <rth@twiddle.net>
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  accel/tcg/cpu-exec.c      | 14 ++------------
> >  accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
> >  2 files changed, 40 insertions(+), 21 deletions(-)
> >
> > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > index 7c83887..8aed38c 100644
> > --- a/accel/tcg/cpu-exec.c
> > +++ b/accel/tcg/cpu-exec.c
> > @@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
> >          if (tb == NULL) {
> >              mmap_lock();
> >              tb_lock();
> > -            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> > -            if (likely(tb == NULL)) {
> > -                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> > -            }
> > +            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> 
> tb_gen_code needs to be renamed to reflect it's semantics.
> tb_get_or_gen_code? Or maybe tb_get_code with a sub-helper to do the
> generation.

I think it can remain as tb_gen_code. The caller still gets
a TB, and whether that TB has been generated by this thread or
any other thread is irrelevant.

(snip)
> > --- a/accel/tcg/translate-all.c
> > +++ b/accel/tcg/translate-all.c
> > @@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
> >   * (-1) to indicate that only one page contains the TB.
> >   *
> >   * Called with mmap_lock held for user-mode emulation.
> > + *
> > + * Returns @tb or an existing TB that matches @tb.
> 
> That's just confusing to read. So this returns a TB like the @tb we
> passed in but actually a different one matching the same conditions?

Good point. Here tb_link_page is not a great name, but instead
of adding a long name such as tb_link_page_or_get_existing, in
v2 I've expanded the above comment. It now looks as follows:

 * Returns a pointer @tb, or a pointer to an existing TB that matches @tb.
 * Note that in !user-mode, another thread might have already added a TB
 * for the same block of guest code that @tb corresponds to. In that case,
 * the caller should discard the original @tb, and use instead the returned TB.
 
> > @@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
> >       * memory barrier is required before tb_link_page() makes the TB visible
> >       * through the physical hash table and physical page list.
> >       */
> > -    tb_link_page(tb, phys_pc, phys_page2);
> > +    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
> > +    /* if the TB already exists, discard what we just translated */
> 
> So are we in the position now that we could potentially do a translation
> but be beaten by another thread generating the same code?

Exactly.

> I suspect we could
> do with a bit of explanatory commentary for the tb_gen_code functions.

As I said above I don't think tb_gen_code changes at all
to its callers, since the caller still gets a TB pointer that it
did not have before.

tb_link_page is the key here -- I hope the updated comment
I quoted above is enough to make things clear.

> Also I think the "Translation Blocks" section needs updating in the
> MTTCG design document to make this clear.

I've added a comment at the bottom of that section:

Parallel code generation is supported. QHT is used at insertion time
as the synchronization point across threads, thereby ensuring that we only
keep track of a single TranslationBlock for each guest code block.

> I'm curious if we should be counting unused translations somewhere in
> the JIT stats. I'm guessing you need to work at a pathalogical case to
> hit this much?

This should be extremely rare on most workloads. Given that and the
fact that we won't have unused translated code (we discard it by
resetting code_gen_ptr), I wouldn't worry too much about this.
In the unlikely case that it ever became a problem, TCG profiling
time would account for it, and on a perf profile we'd see the slow
path in tb_link_page being taken.

Thanks,

		Emilio