One of the followon items I had from the bitops clean-up is this:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 648d6dd475ba..9c3a017606ed 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask(
unsigned int cpu;
vcpu_id = ffsl(vmask) - 1;
- vmask &= ~(1UL << vcpu_id);
+ vmask &= vmask - 1;
vcpu_id += vcpu_bias;
if ( (vcpu_id >= d->max_vcpus) )
return 0;
which yields the following improvement:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34)
Function old new delta
vcpumask_to_pcpumask 519 485 -34
While I (the programmer) can reason the two expressions are equivalent,
the compiler can't, so we really are saving a SHL to re-calculate (1 <<
vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted
above the ffsl() call.
However, the majority of the savings came from the fact that the old
code used to hold 1 in %r15 (across the entire function!) so it could
calculate (1 << vcpu_id) on each loop iteration. With %r15 now free for
other use, we one fewer thing spilt to the stack.
Anyway - while it goes to show that while/ffs logic should be extra
careful to use x &= x - 1 for it's loop condition logic, that's not all.
The rest of this function is crazy. We're reading a guest-word at a
time in order to make a d->max_vcpus sized bitmap (with a reasonable
amount of opencoded logic to reassemble the fragments back into a vcpu
number).
PVH dom0 can reach here, and because it's not pv32, will be deemed to
have a guest word size of 64. Also, word-at-time for any HVM guest is
an insane overhead in terms of the pagewalks behind the copy_from_hvm()
infrastructure.
Instead, we should calculate the size of the bitmap at
DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then
just have a single for_each_set_bit() looping over it. Importantly this
avoids needing to know or care about the guest word size.
This removes 4 of the 6 hiding lfences; the pair for calculating
is_native to start with, and then one of the two pairs behind the use of
is_native to select the type of the access.
The only complications is this: Right now, PV max vCPUS is 8k, so we
could even get away with this being an on-stack bitmap. However, the
vast majority of PV guests small and a 64-bit bitmap would do fine.
But, given this is just one example, wouldn't it be better if we just
unconditionally had a marshalling buffer for hypercalls? There's the
XLAT area but it doesn't exist PV64, and we've got various other pieces
of scratch per-cpu data.
If not, having a 128/256-bit bitmap on the stack will still be good
enough in practice, but still ok at amortising the PVH dom0 costs.
Thoughts?
~Andrew
On 01.06.2024 20:50, Andrew Cooper wrote: > One of the followon items I had from the bitops clean-up is this: > > diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c > index 648d6dd475ba..9c3a017606ed 100644 > --- a/xen/arch/x86/mm.c > +++ b/xen/arch/x86/mm.c > @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( > unsigned int cpu; > > vcpu_id = ffsl(vmask) - 1; > - vmask &= ~(1UL << vcpu_id); > + vmask &= vmask - 1; > vcpu_id += vcpu_bias; > if ( (vcpu_id >= d->max_vcpus) ) > return 0; > > which yields the following improvement: > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) > Function old new delta > vcpumask_to_pcpumask 519 485 -34 Nice. At the risk of getting flamed again for raising dumb questions: Considering that elsewhere "trickery" like the &= mask - 1 here were deemed not nice to have (at least wanting to be hidden behind a suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() be suitable here too, and less at risk of being considered "trickery"? But yes, that would eliminate the benefit of making the bit clearing independent of the ffsl() result. And personally I'm fine anyway with the form as suggested. > While I (the programmer) can reason the two expressions are equivalent, > the compiler can't, Why is it you think it can't? There's no further knowledge that you as a human need to rely on for this, afaics. If ffsl() uses the built-in (as it now does), the compiler has full insight into what's going on. It's just that compiler engineers may not deem it worth the effort to carry out such a special-case optimization. > so we really are saving a SHL to re-calculate (1 << > vcpu_id) and swapping it for a LEA -1(vmask) which happens to be hoisted > above the ffsl() call. > > However, the majority of the savings came from the fact that the old > code used to hold 1 in %r15 (across the entire function!) so it could > calculate (1 << vcpu_id) on each loop iteration. With %r15 now free for > other use, we one fewer thing spilt to the stack. > > Anyway - while it goes to show that while/ffs logic should be extra > careful to use x &= x - 1 for it's loop condition logic, that's not all. > > The rest of this function is crazy. We're reading a guest-word at a > time in order to make a d->max_vcpus sized bitmap (with a reasonable > amount of opencoded logic to reassemble the fragments back into a vcpu > number). > > PVH dom0 can reach here, and because it's not pv32, will be deemed to > have a guest word size of 64. Also, word-at-time for any HVM guest is > an insane overhead in terms of the pagewalks behind the copy_from_hvm() > infrastructure. > > Instead, we should calculate the size of the bitmap at > DIV_ROUND_UP(d->max_vcpus, 8), copy the bitmap in one whole go, and then > just have a single for_each_set_bit() looping over it. Importantly this > avoids needing to know or care about the guest word size. > > This removes 4 of the 6 hiding lfences; the pair for calculating > is_native to start with, and then one of the two pairs behind the use of > is_native to select the type of the access. > > The only complications is this: Right now, PV max vCPUS is 8k, so we > could even get away with this being an on-stack bitmap. However, the > vast majority of PV guests small and a 64-bit bitmap would do fine. > > But, given this is just one example, wouldn't it be better if we just > unconditionally had a marshalling buffer for hypercalls? There's the > XLAT area but it doesn't exist PV64, and we've got various other pieces > of scratch per-cpu data. > > If not, having a 128/256-bit bitmap on the stack will still be good > enough in practice, but still ok at amortising the PVH dom0 costs. > > Thoughts? Well, yes, the last of what you suggest might certainly be worthwhile as a minimal improvement. The only slight concern with increasing the granularity is that it'll be increasingly less likely that a 2nd or further iterations of the loop would actually be exercised in routine testing. A marshalling buffer might also make sense to have, as long as we have more than just one or two places wanting to use it. Since you say "just one example", likely you have further uses in mind. In most other places we use input data more directly, so it looks like I can't right away come up with possible further use cases. Jan
On 03/06/2024 10:19 pm, Jan Beulich wrote: > On 01.06.2024 20:50, Andrew Cooper wrote: >> One of the followon items I had from the bitops clean-up is this: >> >> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >> index 648d6dd475ba..9c3a017606ed 100644 >> --- a/xen/arch/x86/mm.c >> +++ b/xen/arch/x86/mm.c >> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( >> unsigned int cpu; >> >> vcpu_id = ffsl(vmask) - 1; >> - vmask &= ~(1UL << vcpu_id); >> + vmask &= vmask - 1; >> vcpu_id += vcpu_bias; >> if ( (vcpu_id >= d->max_vcpus) ) >> return 0; >> >> which yields the following improvement: >> >> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) >> Function old new delta >> vcpumask_to_pcpumask 519 485 -34 > Nice. At the risk of getting flamed again for raising dumb questions: > Considering that elsewhere "trickery" like the &= mask - 1 here were > deemed not nice to have (at least wanting to be hidden behind a > suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() > be suitable here too, and less at risk of being considered "trickery"? __clear_bit() is even worse, because it forces the bitmap to be spilled to memory. It hopefully wont when I've given the test/set helpers the same treatment as ffs/fls. I'm not really a fan of the bithack. Like the others, it's completely unintuitive unless you know it. However, the improvements speak for themselves and in this one case, the best chance of people recognising it is in full; hiding any part behind a macro (of any name) obscures things further. > But yes, that would eliminate the benefit of making the bit clearing > independent of the ffsl() result. And personally I'm fine anyway with > the form as suggested. > >> While I (the programmer) can reason the two expressions are equivalent, >> the compiler can't, > Why is it you think it can't? There's no further knowledge that you > as a human need to rely on for this, afaics. If ffsl() uses the > built-in (as it now does), the compiler has full insight into what's > going on. It's just that compiler engineers may not deem it worth the > effort to carry out such a special-case optimization. On x86, it's a block of asm, not the builtin. But even with the builtin, just because there is a builtin transforming a common expression into efficient assembly, doesn't mean there's semantic reasoning about the result. https://godbolt.org/z/eah1374a1 I can't find any way to get any compiler to reason about bit in order to avoid the shift (or rotate, for Clang). ~Andrew
On 07.06.2024 14:35, Andrew Cooper wrote: > On 03/06/2024 10:19 pm, Jan Beulich wrote: >> On 01.06.2024 20:50, Andrew Cooper wrote: >>> One of the followon items I had from the bitops clean-up is this: >>> >>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >>> index 648d6dd475ba..9c3a017606ed 100644 >>> --- a/xen/arch/x86/mm.c >>> +++ b/xen/arch/x86/mm.c >>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( >>> unsigned int cpu; >>> >>> vcpu_id = ffsl(vmask) - 1; >>> - vmask &= ~(1UL << vcpu_id); >>> + vmask &= vmask - 1; >>> vcpu_id += vcpu_bias; >>> if ( (vcpu_id >= d->max_vcpus) ) >>> return 0; >>> >>> which yields the following improvement: >>> >>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) >>> Function old new delta >>> vcpumask_to_pcpumask 519 485 -34 >> Nice. At the risk of getting flamed again for raising dumb questions: >> Considering that elsewhere "trickery" like the &= mask - 1 here were >> deemed not nice to have (at least wanting to be hidden behind a >> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() >> be suitable here too, and less at risk of being considered "trickery"? > > __clear_bit() is even worse, because it forces the bitmap to be spilled > to memory. It hopefully wont when I've given the test/set helpers the > same treatment as ffs/fls. Sorry, not directly related here: When you're saying "when I've given" does that mean you'd like Oleksii's "xen: introduce generic non-atomic test_*bit()" to not go in once at least an Arm ack has arrived? Jan
On 10/06/2024 8:15 am, Jan Beulich wrote: > On 07.06.2024 14:35, Andrew Cooper wrote: >> On 03/06/2024 10:19 pm, Jan Beulich wrote: >>> On 01.06.2024 20:50, Andrew Cooper wrote: >>>> One of the followon items I had from the bitops clean-up is this: >>>> >>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >>>> index 648d6dd475ba..9c3a017606ed 100644 >>>> --- a/xen/arch/x86/mm.c >>>> +++ b/xen/arch/x86/mm.c >>>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( >>>> unsigned int cpu; >>>> >>>> vcpu_id = ffsl(vmask) - 1; >>>> - vmask &= ~(1UL << vcpu_id); >>>> + vmask &= vmask - 1; >>>> vcpu_id += vcpu_bias; >>>> if ( (vcpu_id >= d->max_vcpus) ) >>>> return 0; >>>> >>>> which yields the following improvement: >>>> >>>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) >>>> Function old new delta >>>> vcpumask_to_pcpumask 519 485 -34 >>> Nice. At the risk of getting flamed again for raising dumb questions: >>> Considering that elsewhere "trickery" like the &= mask - 1 here were >>> deemed not nice to have (at least wanting to be hidden behind a >>> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() >>> be suitable here too, and less at risk of being considered "trickery"? >> __clear_bit() is even worse, because it forces the bitmap to be spilled >> to memory. It hopefully wont when I've given the test/set helpers the >> same treatment as ffs/fls. > Sorry, not directly related here: When you're saying "when I've given" > does that mean you'd like Oleksii's "xen: introduce generic non-atomic > test_*bit()" to not go in once at least an Arm ack has arrived? If we weren't deep in a code freeze, I'd be insisting on some changes in that patch. For now, I'll settle for not introducing regressions, so it needs at least one more spin (there's a MISRA and UB regression I spotted, but I haven't reviewed it in detail yet). But yes - they're going to end up rather different when I've applied all the compile time optimisations which are available. ~Andre
On 10.06.2024 12:12, Andrew Cooper wrote: > On 10/06/2024 8:15 am, Jan Beulich wrote: >> On 07.06.2024 14:35, Andrew Cooper wrote: >>> On 03/06/2024 10:19 pm, Jan Beulich wrote: >>>> On 01.06.2024 20:50, Andrew Cooper wrote: >>>>> One of the followon items I had from the bitops clean-up is this: >>>>> >>>>> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c >>>>> index 648d6dd475ba..9c3a017606ed 100644 >>>>> --- a/xen/arch/x86/mm.c >>>>> +++ b/xen/arch/x86/mm.c >>>>> @@ -3425,7 +3425,7 @@ static int vcpumask_to_pcpumask( >>>>> unsigned int cpu; >>>>> >>>>> vcpu_id = ffsl(vmask) - 1; >>>>> - vmask &= ~(1UL << vcpu_id); >>>>> + vmask &= vmask - 1; >>>>> vcpu_id += vcpu_bias; >>>>> if ( (vcpu_id >= d->max_vcpus) ) >>>>> return 0; >>>>> >>>>> which yields the following improvement: >>>>> >>>>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-34 (-34) >>>>> Function old new delta >>>>> vcpumask_to_pcpumask 519 485 -34 >>>> Nice. At the risk of getting flamed again for raising dumb questions: >>>> Considering that elsewhere "trickery" like the &= mask - 1 here were >>>> deemed not nice to have (at least wanting to be hidden behind a >>>> suitably named macro; see e.g. ISOLATE_LSB()), wouldn't __clear_bit() >>>> be suitable here too, and less at risk of being considered "trickery"? >>> __clear_bit() is even worse, because it forces the bitmap to be spilled >>> to memory. It hopefully wont when I've given the test/set helpers the >>> same treatment as ffs/fls. >> Sorry, not directly related here: When you're saying "when I've given" >> does that mean you'd like Oleksii's "xen: introduce generic non-atomic >> test_*bit()" to not go in once at least an Arm ack has arrived? > > If we weren't deep in a code freeze, I'd be insisting on some changes in > that patch. > > For now, I'll settle for not introducing regressions, so it needs at > least one more spin (there's a MISRA and UB regression I spotted, but I > haven't reviewed it in detail yet). Did you point this[1] out to him? I've just checked, and I can't seem to be able to find any reply of yours on any of the respective sub-threads, which formally still means the patch would be fine to go in as is once an Arm ack arrives (taking the same approach as you did elsewhere wrt a PPC one). It's just that informally at least I now know to wait ... Jan [1] It'll likely be embarrassing to learn what I've overlooked during the many rounds of review. > But yes - they're going to end up rather different when I've applied all > the compile time optimisations which are available. > > ~Andre
© 2016 - 2024 Red Hat, Inc.