[Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Ian Jackson posted 13 patches 1 week ago
Failed in applying to current master (apply log)
sg-report-host-history | 189 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 160 insertions(+), 29 deletions(-)

[Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Ian Jackson 1 week ago
Earlier this week we discovered that sg-report-host-history was running
extremely slowly.  We applied an emergency fix 0fa72b13f5af
  sg-report-host-history: Reduce limit from 2000 to 200

The main problem is that sg-report-host-history runs once for each
flight, and must generate a relevant history view of the recent
history for each host - including much history that is already in the
old version of the html file.

The slow part is asking the database about information about each job,
including its final step, allocation step, etc.  (The main query which
digs out relevant jobs is also rather time consuming it runs all in
one go and takes only a minute or two.)

In this series we introduce a mechanism which caches much of the
historical analysis.

It is not straightforward to reuse old html data as-is because we
would have to do a merge sort with the new data and that would involve
rewriting the alternating background colour (!)

So instead, we stuff the information we got from the database into
comments in the HTML, which we can then scan on future runs.

The overall result is a factor of 10 speedup in my tests, for the
original history limit of 2000.  That is now fast enough we can put
it back.

(I was not able to reproduce the exceptional case I saw earlier in the
week, where it was apparently taking hours.  I suspect that there
comes a tipping point where db transactions end up being restarted.)

The patches are broken down into small pieces so that I could think
about them clearly and do self-review.

Ian Jackson (13):
  sg-report-host-history: Improve debugging output
  sg-report-host-history: New --no-install option for testing
  sg-report-host-history: Move `computeflightsrange' after hosts
  sg-report-host-history: Actually honour $minflight
  sg-report-host-history: Get job status from mainquery
  sg-report-host-history: Add $cachekey argument to jobquery
  sg-report-host-history: Store per-job query results in %$jr
  sg-report-host-history: Write cache entries
  sg-report-host-history: Write cache entries for tail, too
  sg-report-host-history: Read cache entries
  sg-report-host-history: Move job runvars query later
  sg-report-host-history: Cache runvar queries (power information)
  Revert "sg-report-host-history: Reduce limit from 2000 to 200"

 sg-report-host-history | 189 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 160 insertions(+), 29 deletions(-)

-- 
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Sander Eikelenboom 1 week ago
On 08/11/2019 19:49, Ian Jackson wrote:
> Earlier this week we discovered that sg-report-host-history was running
> extremely slowly.  We applied an emergency fix 0fa72b13f5af
>   sg-report-host-history: Reduce limit from 2000 to 200
> 
> The main problem is that sg-report-host-history runs once for each
> flight, and must generate a relevant history view of the recent
> history for each host - including much history that is already in the
> old version of the html file.
> 
> The slow part is asking the database about information about each job,
> including its final step, allocation step, etc.  (The main query which
> digs out relevant jobs is also rather time consuming it runs all in
> one go and takes only a minute or two.)
> 
> In this series we introduce a mechanism which caches much of the
> historical analysis.
> 
> It is not straightforward to reuse old html data as-is because we
> would have to do a merge sort with the new data and that would involve
> rewriting the alternating background colour (!)
> 
> So instead, we stuff the information we got from the database into
> comments in the HTML, which we can then scan on future runs.

Not mend to bike shed, so just for consideration:
- Have you considered (inline) css for the background colouring, or does
  it have to be html only  ?
- And for caching perhaps a materialized view with aggregated data only
  refreshed at a more convient time could perhaps help at the database
  level ?

--
Sander

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Ian Jackson 1 week ago
Sander Eikelenboom writes ("Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history"):
> Not mend to bike shed, so just for consideration:

Suggestions are very welcome.  Be careful, I'm still looking for a
co-maintainer :-).

> - Have you considered (inline) css for the background colouring, or does
>   it have to be html only  ?

There is no particular reason why it shouldn't be CSS.  Is there a
reason why doing it in html causes problems for you ?

The background colours for the cells are made with
  report_altcolour
  report_altchangecolour
in Osstest/Executive.pm.

report_altcolour returns something that can be put into an element
open tag, given a definite indication of whether the colour should be
paler or darker.

report_altchangecolour is used to produce background colours which
change when the value in the cell changes.

I think it would be easy to replace bgcolour= with some appropriate
style= and some CSS.  Patches - even very rough ones - welcome.

> - And for caching perhaps a materialized view with aggregated data only
>   refreshed at a more convient time could perhaps help at the database
>   level ?

Maybe, but currently the archaeology algorithm is not expressed
entirely in SQL so it couldn't be a materialised view.  And converting
it to SQL would be annoying because SQL is a rather poor programming
language.

It might be possible to, instead, have table(s) containing archaeology
results.  I hadn't really properly considered that possibility.  That
might well have been a better approach.  So thank you for your helpful
prompt.  I will definitely bear this in mind for the future.

I'm not sure I feel like reengineering this particular series at this
time, though.  One reason (apart from that I've done it like this now)
is that the current approach has the advantage that it doesn't need a
DB schema change.  I have a system for doing schema changes but they
add risk and I don't want to do that in the Xen release freeze.

Regards,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Sander Eikelenboom 1 week ago
On 11/11/2019 12:00, Ian Jackson wrote:
> Sander Eikelenboom writes ("Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history"):
>> Not mend to bike shed, so just for consideration:
> 
> Suggestions are very welcome.  Be careful, I'm still looking for a
> co-maintainer :-).
/me is ducking under the table ;)
Seems to be quite a lot of intracate Perl, I never was a prince of Perl
and that hasn't got any better by not using it actively the past years.

>> - Have you considered (inline) css for the background colouring, or does
>>   it have to be html only  ?
> 
> There is no particular reason why it shouldn't be CSS.  Is there a
> reason why doing it in html causes problems for you ?

Not really, but especially applying style to alternating rows is now
quite simple with pseudo classes:

 tr:nth-child(even){
   background-color: grey;
 }

 tr:nth-child(even){
   background-color: white;
 }

You could stuff this in a <head><style> ... </style></head>,
so you don't have to repeat this every row for the common case.
For any special cases you could overrule based on class.
I happen to find it one of the most useful CSS features.

https://www.w3.org/wiki/CSS/Selectors/pseudo-classes/:nth-child

> The background colours for the cells are made with
>   report_altcolour
>   report_altchangecolour
> in Osstest/Executive.pm.
> 
> report_altcolour returns something that can be put into an element
> open tag, given a definite indication of whether the colour should be
> paler or darker.
> 
> report_altchangecolour is used to produce background colours which
> change when the value in the cell changes.
> 
> I think it would be easy to replace bgcolour= with some appropriate
> style= and some CSS.  Patches - even very rough ones - welcome.
> 
>> - And for caching perhaps a materialized view with aggregated data only
>>   refreshed at a more convient time could perhaps help at the database
>>   level ?
> 
> Maybe, but currently the archaeology algorithm is not expressed
> entirely in SQL so it couldn't be a materialised view.  And converting
> it to SQL would be annoying because SQL is a rather poor programming
> language.

It is a poor programming language, but it is very good at retrieving and
modifying data. Sometimes it takes some effort to wrap your head around
the way you have to specify what data you want and in what for, without
being to explicit in how it is supposed to be retrieved.

> It might be possible to, instead, have table(s) containing archaeology
> results.  I hadn't really properly considered that possibility.  That
> might well have been a better approach.  So thank you for your helpful
> prompt.  I will definitely bear this in mind for the future.

If I remember correctly Postgres is being used, perhaps there is stull
some relatively low hanging fruit when analyzing the performance of the
queries you run at the actual data.

> I'm not sure I feel like reengineering this particular series at this
> time, though.  One reason (apart from that I've done it like this now)
> is that the current approach has the advantage that it doesn't need a
> DB schema change.  I have a system for doing schema changes but they
> add risk and I don't want to do that in the Xen release freeze.

I understand, and I concur that that is probably the best at the moment.

I will take a look at the code somewhere this or next week and see if I
can get any familiarity with it and perhaps end up with some contributions.

--
Sander

> Regards,
> Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Ian Jackson 1 week ago
Sander Eikelenboom writes ("Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history"):
> /me is ducking under the table ;)
> Seems to be quite a lot of intracate Perl, I never was a prince of Perl
> and that hasn't got any better by not using it actively the past years.

Heh.  Although it's generally not supposed to be intricate.  I have
tried to keep it fairly straightforward.

> Not really, but especially applying style to alternating rows is now
> quite simple with pseudo classes:
> 
>  tr:nth-child(even){
>    background-color: grey;
>  }
> 
>  tr:nth-child(even){
>    background-color: white;
>  }
> 
> You could stuff this in a <head><style> ... </style></head>,
> so you don't have to repeat this every row for the common case.
> For any special cases you could overrule based on class.
> I happen to find it one of the most useful CSS features.

Interesting.  Mmm.  (Although your vignette above ought to have an
`odd' in it I think...)

> > Maybe, but currently the archaeology algorithm is not expressed
> > entirely in SQL so it couldn't be a materialised view.  And converting
> > it to SQL would be annoying because SQL is a rather poor programming
> > language.
> 
> It is a poor programming language, but it is very good at retrieving and
> modifying data. Sometimes it takes some effort to wrap your head around
> the way you have to specify what data you want and in what for, without
> being to explicit in how it is supposed to be retrieved.

Indeed so.

> If I remember correctly Postgres is being used, perhaps there is stull
> some relatively low hanging fruit when analyzing the performance of the
> queries you run at the actual data.

Yes.  One of the biggest problems is I really want to make an index on
runvar *values*.  But if I do that then routine runvar updates have to
update that index.  So what I want is a partial index but the rows of
runvars which are indexed ought to be controlled by the corresponding
row of the flights table.

I am considering denormalising this by including a `finalised' bit in
the runvars table.  But not now...

> I will take a look at the code somewhere this or next week and see if I
> can get any familiarity with it and perhaps end up with some contributions.

All contributions and suggestions are welcome.

Regards,
Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [OSSTEST PATCH 00/13] Speed up and restore host history

Posted by Ian Jackson 10 hours ago
Hi, I promised you to do a risk/benefit analysis on this series and
here is my report.  With your permission I plan to push it on Sunday
night or Monday morning, if you think that is a convenient time.


Summary:

There are three kinds of risk here:

* There is a nonneglible chance that these changes have a significant
  adverse performance impact on post-flight reporting, so that
  overall throughput is adversely affected.  I have tried to exclude
  it by both reasoning and testing but it remains a risk.

  I propose to deal with this risk by pushing the change to osstest
  pretest at the beginning of the week, so that when it makes it
  through the self-push gate I am around to monitor it.  I will check
  to see that it is DTRT, and, particularly, that the reporting is not
  overly slow.

* I expect a certain amount of additional delay during the
  transitional period, when some flights are using old code and some
  new code.

  I propose to deal with this issue by negotiating a good time to do
  this when we can afford to, effectively, lose a few hours'
  throughput.

* There is a pretty small chance that these changes breaks everything
  by causing all flights to crash during host reporting.

  This will be obvious, especially if I'm watching it all closely.
  If this happens it will need to be reverted.

If we decide this series is a problem, after it has gone into
production, we can simply revert it.  There is nothing else in the
osstest push gate right now.  The old code will still function and we
could confidently force push it.

The upside of this change is to undo a regression in our ability to
diagnose host problems.  Particularly, if a host has a low probability
or intermittent fault, we will want to be able to look further back
than the current ~200 jobs (not sure how long that is without looking
it up but it is only a few days I think, at least for some hosts).

Ian.


Patch-by-patch notes:


sg-report-host-history: Improve debugging output

This is just additional prints.  If they accidentally refer to wrong
variables, this would generate perl nonfatal warnings in debug mode
(which we do not use in production).


sg-report-host-history: New --no-install option for testing

By inspection and testing this code does nothing if the new option is
not passed.


sg-report-host-history: Move `computeflightsrange' after hosts

I double checked that global variables used and set by
computeflightsrange.  It uses and sets $flightlimit; nothing else uses
this and it is set by the option parser.  It uses $limit, which is
only set by the option parser.  It sets $minflight and $flightcond;
these are used only by mainquery, which still comes after
computeflightsrange.


sg-report-host-history: Actually honour $minflight

The effect of this is to limit the output from some of
sg-report-host-history's queries.  If this is wrong somehow the worst
case is that information would be missing from the host history
reports.  That information would be for flights earlier than a minimum
flight number, so it would be quite obvious.

In principle the code code have a bug which causes the queries to
fail, for example if the parameters or syntax are wrong.  But the new
syntax is unconditional and such a bug should therefore be spotted
during testing.


sg-report-host-history: Get job status from mainquery

This unconditionally joins the jobs table to the runvars table in the
`mainquery'.  (Unconditionality means the query syntax is right.)

The jobs table is much smaller.  A handful of empirical tests suggest
this change does not slow things down significantly.  It not
particularly likely, but it is possible that this will be different in
production.

The change to the $infoq is slightly confusing.  There is now a dummy
"AND ?!='X'" condition in the query.  Its purpose is to consume a
redundant job name argument which is not needed any more.  jobs are
never called X so this condition is always true.  Testing shows this
works.


sg-report-host-history: Add $cachekey argument to jobquery

This patch does nothing but add an unused argument.  Syntax errors and
missed call sites (even on non-taken paths) would be caught by perl.


sg-report-host-history: Store per-job query results in %$jr

This is quite complex.  It stores new data in a hash %$jr which is
about the size of the host history report.  Those host history reports
have limited size so we expect this to be OK from a performance point
of view.  If not, we would see slow sg-report-host-history processes
(see mitigation above).

In principle this code might cause perl errors and cause
sg-report-host-history to crash, maybe because of a wrong or undefined
reference.  But I have tested both the cache hit and cache miss cases.


sg-report-host-history: Write cache entries
sg-report-host-history: Write cache entries for tail, too

This dumps the data out to the HTML.  There is new fiddly quoting code
but it is largely unconditional so has been executed and tested, so it
will probably not crash entirely.  There remains a risk that the
quoting algorithm or something else is wrong and generates corrupted
HTML.  That would not be a crisis for us as users, but it might affect
the program's ability to read it in.  See the next section for that:


sg-report-host-history: Read cache entries

The biggest risk here is that the logfile parser which reads the cache
entries finds something it doesn't like and crashes, refusing to parse
it.

If this occurs it is because of strange data in the osstest database:
weird job names or something, which trigger quoting/unquoting bugs.
But this code has been manually tested on existing recent data.  So
existing data is good.  And we aren't making new changes to osstest.


sg-report-host-history: Move job runvars query later

This is fine because it just sets local (my) variables.  Perl would
notice if we had got things wrong.


sg-report-host-history: Cache runvar queries (power information)

This relies on the changes made so far and does not add significant
risks of its own.


Revert "sg-report-host-history: Reduce limit from 2000 to 200"

This is the purpose of the exercise.

The risk is that the changes are not sufficient to, in practice, give
adequate performance.  During the transition (while some jobs are
using new code and some old) there will be some delays as things are
needlessly regenerated, but afterwards all should be well.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel