[v1] thread-pool: fix performance regression

[PATCH 0/2] thread-pool: fix performance regression

Posted by Paolo Bonzini 3 years, 9 months ago

Together, these two patches fix the performance regression induced by
QemuSemaphore; individually they don't though.

6.2:
   iops        : min=58051, max=62260, avg=60282.57, stdev=1081.18, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  775]
   iops        : min=59401, max=61290, avg=60651.27, stdev=468.24, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  717]
   iops        : min=59583, max=60816, avg=60353.43, stdev=282.69, samples=30
    clat percentiles (usec):   1.00th=[  490],   99.99th=[  701]
   iops        : min=58099, max=60713, avg=59739.53, stdev=755.49, samples=30
    clat percentiles (usec):   1.00th=[  494],   99.99th=[  717]


patched:
   iops        : min=60616, max=62522, avg=61654.37, stdev=555.67, samples=30
    clat percentiles (usec):   1.00th=[  474],   99.99th=[ 1303]
   iops        : min=61841, max=63600, avg=62878.47, stdev=442.40, samples=30
    clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
   iops        : min=62976, max=63910, avg=63531.60, stdev=261.05, samples=30
    clat percentiles (usec):   1.00th=[  461],   99.99th=[  693]
   iops        : min=60803, max=63623, avg=62653.37, stdev=808.76, samples=30
    clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]

Paolo

Supersedes: <20220505131346.823941-1-pbonzini@redhat.com>

Paolo Bonzini (2):
  thread-pool: optimize scheduling of completion bottom half
  thread-pool: replace semaphore with condition variable

 util/thread-pool.c | 34 +++++++++++++---------------------
 1 file changed, 13 insertions(+), 21 deletions(-)

-- 
2.31.1

Re: [PATCH 0/2] thread-pool: fix performance regression

Posted by Lukáš Doktor 3 years, 9 months ago

Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and it's better than the f9fc8932, better than the previous patch by Stefan, but still I'm not reaching the performance of d7482ffe97 (before the f9fc8932 commit):

f9f    |  0.0 | -2.8 |  0.6
stefan | -3.1 | -1.2 | -2.2
paolo  |  5.3 |  5.4 |  7.1
d74    |  7.2 |  9.1 |  8.2

Anyway it's definitely closer to the previous baseline (~-2%). Note I have not tried other scenarios, just the 4K nbd writes on rotational disk. I'll try running more throughout the night.

Regards,
Lukáš

Dne 06. 05. 22 v 13:47 Paolo Bonzini napsal(a):
> Together, these two patches fix the performance regression induced by
> QemuSemaphore; individually they don't though.
> 
> 6.2:
>    iops        : min=58051, max=62260, avg=60282.57, stdev=1081.18, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  775]
>    iops        : min=59401, max=61290, avg=60651.27, stdev=468.24, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  717]
>    iops        : min=59583, max=60816, avg=60353.43, stdev=282.69, samples=30
>     clat percentiles (usec):   1.00th=[  490],   99.99th=[  701]
>    iops        : min=58099, max=60713, avg=59739.53, stdev=755.49, samples=30
>     clat percentiles (usec):   1.00th=[  494],   99.99th=[  717]
> 
> 
> patched:
>    iops        : min=60616, max=62522, avg=61654.37, stdev=555.67, samples=30
>     clat percentiles (usec):   1.00th=[  474],   99.99th=[ 1303]
>    iops        : min=61841, max=63600, avg=62878.47, stdev=442.40, samples=30
>     clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
>    iops        : min=62976, max=63910, avg=63531.60, stdev=261.05, samples=30
>     clat percentiles (usec):   1.00th=[  461],   99.99th=[  693]
>    iops        : min=60803, max=63623, avg=62653.37, stdev=808.76, samples=30
>     clat percentiles (usec):   1.00th=[  465],   99.99th=[  685]
> 
> Paolo
> 
> Supersedes: <20220505131346.823941-1-pbonzini@redhat.com>
> 
> Paolo Bonzini (2):
>   thread-pool: optimize scheduling of completion bottom half
>   thread-pool: replace semaphore with condition variable
> 
>  util/thread-pool.c | 34 +++++++++++++---------------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
>

Re: [PATCH 0/2] thread-pool: fix performance regression

Posted by Lukáš Doktor 3 years, 9 months ago

Dne 06. 05. 22 v 20:55 Lukáš Doktor napsal(a):
> Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and it's better than the f9fc8932, better than the previous patch by Stefan, but still I'm not reaching the performance of d7482ffe97 (before the f9fc8932 commit):
> 
> f9f    |  0.0 | -2.8 |  0.6
> stefan | -3.1 | -1.2 | -2.2
> paolo  |  5.3 |  5.4 |  7.1
> d74    |  7.2 |  9.1 |  8.2
> 
> Anyway it's definitely closer to the previous baseline (~-2%). Note I have not tried other scenarios, just the 4K nbd writes on rotational disk. I'll try running more throughout the night.
> 

I tried a couple of iterations of fio-nbd 4/64/256KB read/writes on a rotational disk and overall the latest fix results in a steady 2.5% throughput regression for the 4KiB writes. The remaining tested variants performed similarly. Please let me know if you want me to test the fio execution inside the guest as well or some other variants.

Regards,
Lukáš

Re: [PATCH 0/2] thread-pool: fix performance regression

Posted by Paolo Bonzini 3 years, 9 months ago

On 5/9/22 08:42, Lukáš Doktor wrote:
> Dne 06. 05. 22 v 20:55 Lukáš Doktor napsal(a):
>> Hello Paolo, folks, I gave it a try (on top of the f9fc8932) and
>> it's better than the f9fc8932, better than the previous patch by
>> Stefan, but still I'm not reaching the performance of d7482ffe97
>> (before the f9fc8932 commit):
>> 
>> f9f    |  0.0 | -2.8 |  0.6 stefan | -3.1 | -1.2 | -2.2 paolo  |
>> 5.3 |  5.4 |  7.1 d74    |  7.2 |  9.1 |  8.2
>> 
>> Anyway it's definitely closer to the previous baseline (~-2%). Note
>> I have not tried other scenarios, just the 4K nbd writes on
>> rotational disk. I'll try running more throughout the night.
>> 
> 
> I tried a couple of iterations of fio-nbd 4/64/256KB read/writes on a
> rotational disk and overall the latest fix results in a steady 2.5%
> throughput regression for the 4KiB writes. The remaining tested
> variants performed similarly. Please let me know if you want me to
> test the fio execution inside the guest as well or some other
> variants.

Considering we have conflicting results (I get a 2-3% improvement over 
6.2), and that in general aio=native/aio=io_uring is preferred, I think 
we can proceed with these patches at least for now.

Paolo