In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only
thing causing it to be spilled to the stack. Furthemore we only care about
the bottom 3 bits, so rewrite it to be a plain for loop.
For the {start,stop}_timer variables, these are spilled to the stack despite
the __{set,clear}_bit() calls. Again we only care about the bottom 3 bits, so
shrink the variables from long to int. Use for_each_set_bit() rather than
opencoding it at the end which amongst other things means the loop predicate
is no longer forced to the stack by the loop body.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
All in all, it's modest according to bloat-o-meter:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29)
Function old new delta
hpet_write 2225 2196 -29
but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before.
---
xen/arch/x86/hvm/hpet.c | 29 ++++++++---------------------
1 file changed, 8 insertions(+), 21 deletions(-)
diff --git a/xen/arch/x86/hvm/hpet.c b/xen/arch/x86/hvm/hpet.c
index 87642575f9cd..e3981d5e467c 100644
--- a/xen/arch/x86/hvm/hpet.c
+++ b/xen/arch/x86/hvm/hpet.c
@@ -349,8 +349,7 @@ static int cf_check hpet_write(
unsigned int tn, i;
/* Acculumate a bit mask of timers whos state is changed by this write. */
- unsigned long start_timers = 0;
- unsigned long stop_timers = 0;
+ unsigned int start_timers = 0, stop_timers = 0;
#define set_stop_timer(n) (__set_bit((n), &stop_timers))
#define set_start_timer(n) (__set_bit((n), &start_timers))
#define set_restart_timer(n) (set_stop_timer(n),set_start_timer(n))
@@ -405,16 +404,12 @@ static int cf_check hpet_write(
case HPET_STATUS:
/* write 1 to clear. */
- while ( new_val )
+ for ( i = 0; i < HPET_TIMER_NUM; i++ )
{
- bool active;
-
- i = ffsl(new_val) - 1;
- if ( i >= HPET_TIMER_NUM )
- break;
- __clear_bit(i, &new_val);
- active = __test_and_clear_bit(i, &h->hpet.isr);
- if ( active )
+ if ( !(new_val & (1U << i)) )
+ continue;
+
+ if ( __test_and_clear_bit(i, &h->hpet.isr) )
{
hvm_ioapic_deassert(v->domain, timer_int_route(h, i));
if ( hpet_enabled(h) && timer_enabled(h, i) &&
@@ -533,19 +528,11 @@ static int cf_check hpet_write(
}
/* stop/start timers whos state was changed by this write. */
- while (stop_timers)
- {
- i = ffsl(stop_timers) - 1;
- __clear_bit(i, &stop_timers);
+ for_each_set_bit ( i, stop_timers )
hpet_stop_timer(h, i, guest_time);
- }
- while (start_timers)
- {
- i = ffsl(start_timers) - 1;
- __clear_bit(i, &start_timers);
+ for_each_set_bit ( i, start_timers )
hpet_set_timer(h, i, guest_time);
- }
#undef set_stop_timer
#undef set_start_timer
--
2.39.2
On 27.08.2024 15:57, Andrew Cooper wrote: > In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only > thing causing it to be spilled to the stack. Furthemore we only care about > the bottom 3 bits, so rewrite it to be a plain for loop. > > For the {start,stop}_timer variables, these are spilled to the stack despite > the __{set,clear}_bit() calls. That's an observation from what the compiler happens to do? I don't see any other reason why they would need spilling; I expect it's merely a matter of registers better be used for other variables. If we ever meant to build Xen with APX fully in use, that might change. IOW may I at least ask for s/are/happen to be/? I'm also a little irritated by "despite", but you're the native speaker. It would have seemed to me that e.g. "irrespective of" would better express what (I think) is meant. > Again we only care about the bottom 3 bits, so > shrink the variables from long to int. Use for_each_set_bit() rather than > opencoding it at the end which amongst other things means the loop predicate > is no longer forced to the stack by the loop body. > > Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> > --- > CC: Jan Beulich <JBeulich@suse.com> > CC: Roger Pau Monné <roger.pau@citrix.com> > > All in all, it's modest according to bloat-o-meter: > > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29) > Function old new delta > hpet_write 2225 2196 -29 > > but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before. However, on the negative side all the first of the loops you touch now always takes 3 iterations, when previously we may have got away with as little as none. Is there a reason not to use for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) ) there (with the masking of the low bit possibly pulled out)? > @@ -533,19 +528,11 @@ static int cf_check hpet_write( > } > > /* stop/start timers whos state was changed by this write. */ > - while (stop_timers) > - { > - i = ffsl(stop_timers) - 1; > - __clear_bit(i, &stop_timers); > + for_each_set_bit ( i, stop_timers ) > hpet_stop_timer(h, i, guest_time); > - } > > - while (start_timers) > - { > - i = ffsl(start_timers) - 1; > - __clear_bit(i, &start_timers); > + for_each_set_bit ( i, start_timers ) > hpet_set_timer(h, i, guest_time); > - } To avoid variable shadowing, I think you don't want to use i in these two loops. Alternatively the function scope i would need constraining to the individual loops. Unrelated to the change you make, but related to the code you touch: Isn't there a bug there with the length != 8 handling ahead of the switch()? The bits being write-1-to-clear, using the value read for parts the original insn didn't write means we might clear ISR bits we weren't asked to clear. I guess I'll make a patch, which may want to go ahead of yours for ease of backporting. (Of course guests should have no need to write to other than the bottom part of the register, but still.) Jan
On 28/08/2024 9:13 am, Jan Beulich wrote: > On 27.08.2024 15:57, Andrew Cooper wrote: >> In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only >> thing causing it to be spilled to the stack. Furthemore we only care about >> the bottom 3 bits, so rewrite it to be a plain for loop. >> >> For the {start,stop}_timer variables, these are spilled to the stack despite >> the __{set,clear}_bit() calls. > That's an observation from what the compiler happens to do? I don't see any > other reason why they would need spilling; I expect it's merely a matter of > registers better be used for other variables. It is a consequence of how our helpers are written. I do expect it to improve when I get around to reworking them. For example, the Linux helpers have enough constant folding capabilities to allow the compiler to turn: { int foo = 0; ... __set_bit(1, &foo); into: { int foo = 1; as well as being able to emit LOCK AND/OR/XOR in place of LOCK BT{C,S,R} for a constant bit position. One thing I want to do, which I haven't figured out how to do yet, is to allow the arch form to emit BT?Q forms. Right now, code generation for PGC_* and PGT_* suffers quite a lot. We mix between reg/imm logic, then spill to the stack because top bits aren't within range for the "I" constraint on 32-bit instructions, issue a BT?L reg/mem (which has much higher latency than any other form), then pick it back off the stack to do more reg/imm logic. I was wondering if, because of the always_inline, I could do something like __builtin_constant_p(bit) && __builtin_object_size(addr, 0) >= 8 and emitting long-granular logic, which will be able to pick the imm/reg form rather than turning into reg/mem. But, I've not had time to experiment here, and I doubt I'll get around to it soon. Another optimisation we're lacking vs Linux is that our test_bit() has a volatile pointer where Linux's is non-volatile. This makes a massive difference for the ability to optimise looking at multiple bits. > If we ever meant to build Xen > with APX fully in use, that might change. IOW may I at least ask for > s/are/happen to be/? I'm also a little irritated by "despite", but you're > the native speaker. It would have seemed to me that e.g. "irrespective of" > would better express what (I think) is meant. "despite" isn't really the right term, but I also wouldn't have said it was something to be irritated over. What I was trying to say was "they're spilled to the stack even with the __set_bit() calls removed". Which makes sense; they're values held for almost the full duration of the function, that are not used in ~every step of logic. Interestingly, given that they're spilled to the stack, the __set_bit() form is more efficient than the plain C "|= (1u << i);", but I'd still like an implementation which could make that determination itself. > >> Again we only care about the bottom 3 bits, so >> shrink the variables from long to int. Use for_each_set_bit() rather than >> opencoding it at the end which amongst other things means the loop predicate >> is no longer forced to the stack by the loop body. >> >> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >> --- >> CC: Jan Beulich <JBeulich@suse.com> >> CC: Roger Pau Monné <roger.pau@citrix.com> >> >> All in all, it's modest according to bloat-o-meter: >> >> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29) >> Function old new delta >> hpet_write 2225 2196 -29 >> >> but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before. > However, on the negative side all the first of the loops you touch now always > takes 3 iterations, when previously we may have got away with as little as > none. Is there a reason not to use > > for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) ) > > there (with the masking of the low bit possibly pulled out)? There are multiple angles here. First, I got an unexpected surprise on ARM with an expression, and while this one won't pick up pointer const-ness, I can never remember what MISRA's view on this is. Second, this is the odd-loop-out compared to rest of the function, which are all of the form "for ( i = 0; i < HPET_TIMER_NUM ;". But perhaps most importantly, OSes don't touch this register. Xen not at all, and Linux only in _hpet_print_config(). Neither bother preserving/clearing it on suspend/resume, even when running the HPET in legacy replacement mode. I haven't checked windows behaviour, but I don't expect it to differ here. This register simply isn't interesting for the preferred type of interrupts (edge), and also isn't useful for an ISR handling a line interrupt. So my choice was based on which produced the smallest code, because it's an dead-in-practice codepath. > >> @@ -533,19 +528,11 @@ static int cf_check hpet_write( >> } >> >> /* stop/start timers whos state was changed by this write. */ >> - while (stop_timers) >> - { >> - i = ffsl(stop_timers) - 1; >> - __clear_bit(i, &stop_timers); >> + for_each_set_bit ( i, stop_timers ) >> hpet_stop_timer(h, i, guest_time); >> - } >> >> - while (start_timers) >> - { >> - i = ffsl(start_timers) - 1; >> - __clear_bit(i, &start_timers); >> + for_each_set_bit ( i, start_timers ) >> hpet_set_timer(h, i, guest_time); >> - } > To avoid variable shadowing, I think you don't want to use i in these two > loops. Alternatively the function scope i would need constraining to the > individual loops. Yeah, I was bitten by that on one of the ARM patches. I'll adjust. ~Andrew
On 28.08.2024 19:50, Andrew Cooper wrote: > On 28/08/2024 9:13 am, Jan Beulich wrote: >> On 27.08.2024 15:57, Andrew Cooper wrote: >>> In the HPET_STATUS handling, the use of __clear_bit(i, &new_val) is the only >>> thing causing it to be spilled to the stack. Furthemore we only care about >>> the bottom 3 bits, so rewrite it to be a plain for loop. >>> >>> For the {start,stop}_timer variables, these are spilled to the stack despite >>> the __{set,clear}_bit() calls. >> That's an observation from what the compiler happens to do? I don't see any >> other reason why they would need spilling; I expect it's merely a matter of >> registers better be used for other variables. > > It is a consequence of how our helpers are written. I do expect it to > improve when I get around to reworking them. > > For example, the Linux helpers have enough constant folding capabilities > to allow the compiler to turn: > > { > int foo = 0; > ... > __set_bit(1, &foo); > > into: > > { > int foo = 1; > > > as well as being able to emit LOCK AND/OR/XOR in place of LOCK BT{C,S,R} > for a constant bit position. > > One thing I want to do, which I haven't figured out how to do yet, is to > allow the arch form to emit BT?Q forms. > > Right now, code generation for PGC_* and PGT_* suffers quite a lot. We > mix between reg/imm logic, then spill to the stack because top bits > aren't within range for the "I" constraint on 32-bit instructions, issue > a BT?L reg/mem (which has much higher latency than any other form), then > pick it back off the stack to do more reg/imm logic. > > I was wondering if, because of the always_inline, I could do something > like __builtin_constant_p(bit) && __builtin_object_size(addr, 0) >= 8 > and emitting long-granular logic, which will be able to pick the imm/reg > form rather than turning into reg/mem. That may work, provided there actually was always_inline. >> If we ever meant to build Xen >> with APX fully in use, that might change. IOW may I at least ask for >> s/are/happen to be/? I'm also a little irritated by "despite", but you're >> the native speaker. It would have seemed to me that e.g. "irrespective of" >> would better express what (I think) is meant. > > "despite" isn't really the right term, but I also wouldn't have said it > was something to be irritated over. > > What I was trying to say was "they're spilled to the stack even with the > __set_bit() calls removed". Which makes sense; they're values held for > almost the full duration of the function, that are not used in ~every > step of logic. Right, the "not a good use for a register var" reason that I had alluded to. >>> Again we only care about the bottom 3 bits, so >>> shrink the variables from long to int. Use for_each_set_bit() rather than >>> opencoding it at the end which amongst other things means the loop predicate >>> is no longer forced to the stack by the loop body. >>> >>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> >>> --- >>> CC: Jan Beulich <JBeulich@suse.com> >>> CC: Roger Pau Monné <roger.pau@citrix.com> >>> >>> All in all, it's modest according to bloat-o-meter: >>> >>> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-29 (-29) >>> Function old new delta >>> hpet_write 2225 2196 -29 >>> >>> but we have shrunk the stack frame by 8 bytes; 0x28 as opposed to 0x30 before. >> However, on the negative side all the first of the loops you touch now always >> takes 3 iterations, when previously we may have got away with as little as >> none. Is there a reason not to use >> >> for_each_set_bit ( i, new_val & ((1U << HPET_TIMER_NUM) - 1) ) >> >> there (with the masking of the low bit possibly pulled out)? > > There are multiple angles here. > > First, I got an unexpected surprise on ARM with an expression, and while > this one won't pick up pointer const-ness, I can never remember what > MISRA's view on this is. > > Second, this is the odd-loop-out compared to rest of the function, which > are all of the form "for ( i = 0; i < HPET_TIMER_NUM ;". > > But perhaps most importantly, OSes don't touch this register. Xen not > at all, and Linux only in _hpet_print_config(). Neither bother > preserving/clearing it on suspend/resume, even when running the HPET in > legacy replacement mode. > > I haven't checked windows behaviour, but I don't expect it to differ > here. This register simply isn't interesting for the preferred type of > interrupts (edge), and also isn't useful for an ISR handling a line > interrupt. Yet there must have been an environment where the register is of use, or else Roger wouldn't have been prompted to make what is now be07023be115 ("x86/vhpet: add support for level triggered interrupts"). Jan
© 2016 - 2024 Red Hat, Inc.