tcg: parallel code generation (Work in Progress)

[Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)

Posted by Emilio G. Cota 8 years, 7 months ago

This is a request for comments as well as a request for help :-)

I've been experimenting with making TCGContext per-thread, so that we can
run most of tcg_gen_code in parallel. I've made some progress,
but haven't yet got it to work.

My guess is that the TCG stack is still global instead of per-vCPU
(it's been global since tmp_buf was removed from CPUState, right?),
but I'm having trouble following that code so most likely I'm wrong.

Any help would be appreciated--please disregard minor nits, I want
to see whether I can make this work to then take measurements
to decide whether this is worth the trouble.

- Patch 1 is a trivial doc fixup, feel free to pick it up

- Patches 2-3 remove *tbs[] to use a binary search tree instead.
  This removes the assumption in tb_find_pc that *tbs[] are ordered
  by tc_ptr, thereby allowing us to generate code regardless of
  its location on the host (as we do after patch 6).

- Patch 4 addresses a reporting issue: ever since we embedded the
  struct TB's in code_gen_buffer (6e3b2bfd6), we have been
  misreporting the size of the generated code. Not a huge deal,
  but I noticed while I was working on this.

- Patches 5-7 make TCGContext per-thread in softmmu. I have put there
  some XXX's to note that I'm aware of those issues, so don't worry
  too much about those--except of course if you have any input on
  what the cause of the race(s) might be.

Thanks,

		Emilio

Re: [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)

Posted by Richard Henderson 8 years, 7 months ago

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> - Patches 2-3 remove *tbs[] to use a binary search tree instead.
>    This removes the assumption in tb_find_pc that *tbs[] are ordered
>    by tc_ptr, thereby allowing us to generate code regardless of
>    its location on the host (as we do after patch 6).

Have you considered a scheme by which the front end translation and tcg 
optimization are done outside the lock, but final code generation is done 
inside the lock?

It would put at least half of the translation time in the parallel space 
without requiring changes to code_buffer allocation.

r~

Re: [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)

Posted by Emilio G. Cota 8 years, 7 months ago

On Fri, Jun 30, 2017 at 01:25:54 -0700, Richard Henderson wrote:
> On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> >- Patches 2-3 remove *tbs[] to use a binary search tree instead.
> >   This removes the assumption in tb_find_pc that *tbs[] are ordered
> >   by tc_ptr, thereby allowing us to generate code regardless of
> >   its location on the host (as we do after patch 6).
> 
> Have you considered a scheme by which the front end translation and tcg
> optimization are done outside the lock, but final code generation is done
> inside the lock?
> 
> It would put at least half of the translation time in the parallel space
> without requiring changes to code_buffer allocation.

I don't think that would save much, because the performance issue comes
from the fact that we have to grab the lock, regardless of how long we hold
it. So even if we did nothing inside the lock, scalability when
translating a lot of code (e.g. booting) would still be quite bad.

So we either get rid of the lock altogether, or use a more scalable lock.

		E.