[PATCH v2 0/5] fast poll multishot mode

Hao Xu posted 5 patches 4 years ago
There is a newer version of this series
fs/io_uring.c                 | 121 ++++++++++++++++++++++++++--------
include/uapi/linux/io_uring.h |   5 ++
2 files changed, 100 insertions(+), 26 deletions(-)
[PATCH v2 0/5] fast poll multishot mode
Posted by Hao Xu 4 years ago
Let multishot support multishot mode, currently only add accept as its
first comsumer.
theoretical analysis:
  1) when connections come in fast
    - singleshot:
              add accept sqe(userpsace) --> accept inline
                              ^                 |
                              |-----------------|
    - multishot:
             add accept sqe(userspace) --> accept inline
                                              ^     |
                                              |--*--|

    we do accept repeatedly in * place until get EAGAIN

  2) when connections come in at a low pressure
    similar thing like 1), we reduce a lot of userspace-kernel context
    switch and useless vfs_poll()


tests:
Did some tests, which goes in this way:

  server    client(multiple)
  accept    connect
  read      write
  write     read
  close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
  1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
  +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
  +1934226+1914385)/20.0 = 1927633.75
after:
  1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
  +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
  +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%


A liburing test is here:
https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

v1->v2:
 - re-implement it against the reworked poll code

Hao Xu (5):
  io_uring: add IORING_ACCEPT_MULTISHOT for accept
  io_uring: add REQ_F_APOLL_MULTISHOT for requests
  io_uring: let fast poll support multishot
  io_uring: add a helper for poll clean
  io_uring: implement multishot mode for accept

 fs/io_uring.c                 | 121 ++++++++++++++++++++++++++--------
 include/uapi/linux/io_uring.h |   5 ++
 2 files changed, 100 insertions(+), 26 deletions(-)


base-commit: f2e030dd7aaea5a937a2547dc980fab418fbc5e7
-- 
2.36.0
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 1:00 AM, Hao Xu wrote:
> Let multishot support multishot mode, currently only add accept as its
> first comsumer.
> theoretical analysis:
>   1) when connections come in fast
>     - singleshot:
>               add accept sqe(userpsace) --> accept inline
>                               ^                 |
>                               |-----------------|
>     - multishot:
>              add accept sqe(userspace) --> accept inline
>                                               ^     |
>                                               |--*--|
> 
>     we do accept repeatedly in * place until get EAGAIN
> 
>   2) when connections come in at a low pressure
>     similar thing like 1), we reduce a lot of userspace-kernel context
>     switch and useless vfs_poll()
> 
> 
> tests:
> Did some tests, which goes in this way:
> 
>   server    client(multiple)
>   accept    connect
>   read      write
>   write     read
>   close     close
> 
> Basically, raise up a number of clients(on same machine with server) to
> connect to the server, and then write some data to it, the server will
> write those data back to the client after it receives them, and then
> close the connection after write return. Then the client will read the
> data and then close the connection. Here I test 10000 clients connect
> one server, data size 128 bytes. And each client has a go routine for
> it, so they come to the server in short time.
> test 20 times before/after this patchset, time spent:(unit cycle, which
> is the return value of clock())
> before:
>   1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>   +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>   +1934226+1914385)/20.0 = 1927633.75
> after:
>   1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>   +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>   +1871324+1940803)/20.0 = 1894750.45
> 
> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
> 
> 
> A liburing test is here:
> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

Wish I had seen that, I wrote my own! But maybe that's good, you tend to
find other issues through that.

Anyway, works for me in testing, and I can see this being a nice win for
accept intensive workloads. I pushed a bunch of cleanup patches that
should just get folded in. Can you fold them into your patches and
address the other feedback, and post a v3? I pushed the test branch
here:

https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

-- 
Jens Axboe
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 4:23 PM, Jens Axboe wrote:
> On 5/6/22 1:00 AM, Hao Xu wrote:
>> Let multishot support multishot mode, currently only add accept as its
>> first comsumer.
>> theoretical analysis:
>>   1) when connections come in fast
>>     - singleshot:
>>               add accept sqe(userpsace) --> accept inline
>>                               ^                 |
>>                               |-----------------|
>>     - multishot:
>>              add accept sqe(userspace) --> accept inline
>>                                               ^     |
>>                                               |--*--|
>>
>>     we do accept repeatedly in * place until get EAGAIN
>>
>>   2) when connections come in at a low pressure
>>     similar thing like 1), we reduce a lot of userspace-kernel context
>>     switch and useless vfs_poll()
>>
>>
>> tests:
>> Did some tests, which goes in this way:
>>
>>   server    client(multiple)
>>   accept    connect
>>   read      write
>>   write     read
>>   close     close
>>
>> Basically, raise up a number of clients(on same machine with server) to
>> connect to the server, and then write some data to it, the server will
>> write those data back to the client after it receives them, and then
>> close the connection after write return. Then the client will read the
>> data and then close the connection. Here I test 10000 clients connect
>> one server, data size 128 bytes. And each client has a go routine for
>> it, so they come to the server in short time.
>> test 20 times before/after this patchset, time spent:(unit cycle, which
>> is the return value of clock())
>> before:
>>   1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>   +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>   +1934226+1914385)/20.0 = 1927633.75
>> after:
>>   1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>   +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>   +1871324+1940803)/20.0 = 1894750.45
>>
>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>
>>
>> A liburing test is here:
>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
> 
> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
> find other issues through that.
> 
> Anyway, works for me in testing, and I can see this being a nice win for
> accept intensive workloads. I pushed a bunch of cleanup patches that
> should just get folded in. Can you fold them into your patches and
> address the other feedback, and post a v3? I pushed the test branch
> here:
> 
> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

Quick benchmark here, accepting 10k connections:

Stock kernel
real	0m0.728s
user	0m0.009s
sys	0m0.192s

Patched
real	0m0.684s
user	0m0.018s
sys	0m0.102s

Looks like a nice win for a highly synthetic benchmark. Nothing
scientific, was just curious.

-- 
Jens Axboe
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 5:26 PM, Jens Axboe wrote:
> On 5/6/22 4:23 PM, Jens Axboe wrote:
>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>> Let multishot support multishot mode, currently only add accept as its
>>> first comsumer.
>>> theoretical analysis:
>>>   1) when connections come in fast
>>>     - singleshot:
>>>               add accept sqe(userpsace) --> accept inline
>>>                               ^                 |
>>>                               |-----------------|
>>>     - multishot:
>>>              add accept sqe(userspace) --> accept inline
>>>                                               ^     |
>>>                                               |--*--|
>>>
>>>     we do accept repeatedly in * place until get EAGAIN
>>>
>>>   2) when connections come in at a low pressure
>>>     similar thing like 1), we reduce a lot of userspace-kernel context
>>>     switch and useless vfs_poll()
>>>
>>>
>>> tests:
>>> Did some tests, which goes in this way:
>>>
>>>   server    client(multiple)
>>>   accept    connect
>>>   read      write
>>>   write     read
>>>   close     close
>>>
>>> Basically, raise up a number of clients(on same machine with server) to
>>> connect to the server, and then write some data to it, the server will
>>> write those data back to the client after it receives them, and then
>>> close the connection after write return. Then the client will read the
>>> data and then close the connection. Here I test 10000 clients connect
>>> one server, data size 128 bytes. And each client has a go routine for
>>> it, so they come to the server in short time.
>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>> is the return value of clock())
>>> before:
>>>   1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>>   +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>>   +1934226+1914385)/20.0 = 1927633.75
>>> after:
>>>   1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>>   +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>>   +1871324+1940803)/20.0 = 1894750.45
>>>
>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>
>>>
>>> A liburing test is here:
>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>
>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>> find other issues through that.
>>
>> Anyway, works for me in testing, and I can see this being a nice win for
>> accept intensive workloads. I pushed a bunch of cleanup patches that
>> should just get folded in. Can you fold them into your patches and
>> address the other feedback, and post a v3? I pushed the test branch
>> here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
> 
> Quick benchmark here, accepting 10k connections:
> 
> Stock kernel
> real	0m0.728s
> user	0m0.009s
> sys	0m0.192s
> 
> Patched
> real	0m0.684s
> user	0m0.018s
> sys	0m0.102s
> 
> Looks like a nice win for a highly synthetic benchmark. Nothing
> scientific, was just curious.

One more thought on this - how is it supposed to work with
accept-direct? One idea would be to make it incrementally increasing.
But we need a good story for that, if it's exclusive to non-direct
files, then it's a lot less interesting as the latter is really nice win
for lots of files. If we can combine the two, even better.

-- 
Jens Axboe
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 8:33 PM, Jens Axboe wrote:
> On 5/6/22 5:26 PM, Jens Axboe wrote:
>> On 5/6/22 4:23 PM, Jens Axboe wrote:
>>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>>> Let multishot support multishot mode, currently only add accept as its
>>>> first comsumer.
>>>> theoretical analysis:
>>>>   1) when connections come in fast
>>>>     - singleshot:
>>>>               add accept sqe(userpsace) --> accept inline
>>>>                               ^                 |
>>>>                               |-----------------|
>>>>     - multishot:
>>>>              add accept sqe(userspace) --> accept inline
>>>>                                               ^     |
>>>>                                               |--*--|
>>>>
>>>>     we do accept repeatedly in * place until get EAGAIN
>>>>
>>>>   2) when connections come in at a low pressure
>>>>     similar thing like 1), we reduce a lot of userspace-kernel context
>>>>     switch and useless vfs_poll()
>>>>
>>>>
>>>> tests:
>>>> Did some tests, which goes in this way:
>>>>
>>>>   server    client(multiple)
>>>>   accept    connect
>>>>   read      write
>>>>   write     read
>>>>   close     close
>>>>
>>>> Basically, raise up a number of clients(on same machine with server) to
>>>> connect to the server, and then write some data to it, the server will
>>>> write those data back to the client after it receives them, and then
>>>> close the connection after write return. Then the client will read the
>>>> data and then close the connection. Here I test 10000 clients connect
>>>> one server, data size 128 bytes. And each client has a go routine for
>>>> it, so they come to the server in short time.
>>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>>> is the return value of clock())
>>>> before:
>>>>   1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>>>   +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>>>   +1934226+1914385)/20.0 = 1927633.75
>>>> after:
>>>>   1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>>>   +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>>>   +1871324+1940803)/20.0 = 1894750.45
>>>>
>>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>>
>>>>
>>>> A liburing test is here:
>>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>>
>>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>>> find other issues through that.
>>>
>>> Anyway, works for me in testing, and I can see this being a nice win for
>>> accept intensive workloads. I pushed a bunch of cleanup patches that
>>> should just get folded in. Can you fold them into your patches and
>>> address the other feedback, and post a v3? I pushed the test branch
>>> here:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
>>
>> Quick benchmark here, accepting 10k connections:
>>
>> Stock kernel
>> real	0m0.728s
>> user	0m0.009s
>> sys	0m0.192s
>>
>> Patched
>> real	0m0.684s
>> user	0m0.018s
>> sys	0m0.102s
>>
>> Looks like a nice win for a highly synthetic benchmark. Nothing
>> scientific, was just curious.
> 
> One more thought on this - how is it supposed to work with
> accept-direct? One idea would be to make it incrementally increasing.
> But we need a good story for that, if it's exclusive to non-direct
> files, then it's a lot less interesting as the latter is really nice win
> for lots of files. If we can combine the two, even better.

Running some quick testing, on an actual test box (previous numbers were
from a vm on my laptop):

Testing singleshot, normal files
Did 10000 accepts

________________________________________________________
Executed in  216.10 millis    fish           external
   usr time    9.32 millis  150.00 micros    9.17 millis
   sys time  110.06 millis   67.00 micros  109.99 millis

Testing multishot, fixed files
Did 10000 accepts

________________________________________________________
Executed in  189.04 millis    fish           external
   usr time   11.86 millis  159.00 micros   11.71 millis
   sys time   93.71 millis   70.00 micros   93.64 millis

That's about ~19 usec to accept a connection, pretty decent. Using
singleshot and with fixed files, it shaves about ~8% off, ends at around
200msec.

I think we can get away with using fixed files and multishot, attaching
the quick patch I did below to test it. We need something better than
this, otherwise once the space fills up, we'll likely end up with a
sparse space and the naive approach of just incrementing the next slot
won't work at all.

-- 
Jens Axboe
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Hao Xu 4 years ago
在 2022/5/7 上午11:08, Jens Axboe 写道:
> On 5/6/22 8:33 PM, Jens Axboe wrote:
>> On 5/6/22 5:26 PM, Jens Axboe wrote:
>>> On 5/6/22 4:23 PM, Jens Axboe wrote:
>>>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>>>> Let multishot support multishot mode, currently only add accept as its
>>>>> first comsumer.
>>>>> theoretical analysis:
>>>>>    1) when connections come in fast
>>>>>      - singleshot:
>>>>>                add accept sqe(userpsace) --> accept inline
>>>>>                                ^                 |
>>>>>                                |-----------------|
>>>>>      - multishot:
>>>>>               add accept sqe(userspace) --> accept inline
>>>>>                                                ^     |
>>>>>                                                |--*--|
>>>>>
>>>>>      we do accept repeatedly in * place until get EAGAIN
>>>>>
>>>>>    2) when connections come in at a low pressure
>>>>>      similar thing like 1), we reduce a lot of userspace-kernel context
>>>>>      switch and useless vfs_poll()
>>>>>
>>>>>
>>>>> tests:
>>>>> Did some tests, which goes in this way:
>>>>>
>>>>>    server    client(multiple)
>>>>>    accept    connect
>>>>>    read      write
>>>>>    write     read
>>>>>    close     close
>>>>>
>>>>> Basically, raise up a number of clients(on same machine with server) to
>>>>> connect to the server, and then write some data to it, the server will
>>>>> write those data back to the client after it receives them, and then
>>>>> close the connection after write return. Then the client will read the
>>>>> data and then close the connection. Here I test 10000 clients connect
>>>>> one server, data size 128 bytes. And each client has a go routine for
>>>>> it, so they come to the server in short time.
>>>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>>>> is the return value of clock())
>>>>> before:
>>>>>    1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>>>>    +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>>>>    +1934226+1914385)/20.0 = 1927633.75
>>>>> after:
>>>>>    1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>>>>    +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>>>>    +1871324+1940803)/20.0 = 1894750.45
>>>>>
>>>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>>>
>>>>>
>>>>> A liburing test is here:
>>>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>>>
>>>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>>>> find other issues through that.
>>>>
>>>> Anyway, works for me in testing, and I can see this being a nice win for
>>>> accept intensive workloads. I pushed a bunch of cleanup patches that
>>>> should just get folded in. Can you fold them into your patches and
>>>> address the other feedback, and post a v3? I pushed the test branch
>>>> here:
>>>>
>>>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
>>>
>>> Quick benchmark here, accepting 10k connections:
>>>
>>> Stock kernel
>>> real	0m0.728s
>>> user	0m0.009s
>>> sys	0m0.192s
>>>
>>> Patched
>>> real	0m0.684s
>>> user	0m0.018s
>>> sys	0m0.102s
>>>
>>> Looks like a nice win for a highly synthetic benchmark. Nothing
>>> scientific, was just curious.
>>
>> One more thought on this - how is it supposed to work with
>> accept-direct? One idea would be to make it incrementally increasing.
>> But we need a good story for that, if it's exclusive to non-direct
>> files, then it's a lot less interesting as the latter is really nice win
>> for lots of files. If we can combine the two, even better.
> 
> Running some quick testing, on an actual test box (previous numbers were
> from a vm on my laptop):
> 
> Testing singleshot, normal files
> Did 10000 accepts
> 
> ________________________________________________________
> Executed in  216.10 millis    fish           external
>     usr time    9.32 millis  150.00 micros    9.17 millis
>     sys time  110.06 millis   67.00 micros  109.99 millis
> 
> Testing multishot, fixed files
> Did 10000 accepts
> 
> ________________________________________________________
> Executed in  189.04 millis    fish           external
>     usr time   11.86 millis  159.00 micros   11.71 millis
>     sys time   93.71 millis   70.00 micros   93.64 millis
> 
> That's about ~19 usec to accept a connection, pretty decent. Using
> singleshot and with fixed files, it shaves about ~8% off, ends at around
> 200msec.
> 
> I think we can get away with using fixed files and multishot, attaching
I'm not following, do you mean we shouldn't do the multishot+fixed file
or we should use multishot+fixed to make the result better?
> the quick patch I did below to test it. We need something better than
Sorry Jens, I didn't see the quick patch, is there anything I misunderstand?
> this, otherwise once the space fills up, we'll likely end up with a
> sparse space and the naive approach of just incrementing the next slot
> won't work at all.

> 

Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Hao Xu 4 years ago
Hi All,
I actually had a question about the current poll code, from the code it
seems when we cancel a poll-like request, it will ignore the existing
events and just raise a -ECANCELED cqe though I haven't tested it. Is
this by design or am I missing something?

Regards,
Hao
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 1:36 AM, Hao Xu wrote:
> Hi All,
> I actually had a question about the current poll code, from the code it
> seems when we cancel a poll-like request, it will ignore the existing
> events and just raise a -ECANCELED cqe though I haven't tested it. Is
> this by design or am I missing something?

That's by design, but honestly I don't think anyone considered the case
where it's being canceled but has events already. For that case, I think
we should follow the usual logic of only returning an error (canceled)
if we don't have events, if we have events just return them. For
multi-shot, obviously also terminate, but same logic there.

Care to do a separate patch for that?

-- 
Jens Axboe
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Pavel Begunkov 4 years ago
On 5/6/22 15:18, Jens Axboe wrote:
> On 5/6/22 1:36 AM, Hao Xu wrote:
>> Hi All,
>> I actually had a question about the current poll code, from the code it
>> seems when we cancel a poll-like request, it will ignore the existing
>> events and just raise a -ECANCELED cqe though I haven't tested it. Is
>> this by design or am I missing something?
> 
> That's by design, but honestly I don't think anyone considered the case
> where it's being canceled but has events already. For that case, I think
> we should follow the usual logic of only returning an error (canceled)
> if we don't have events, if we have events just return them. For
> multi-shot, obviously also terminate, but same logic there.

Why would we care? It's inherently racy in any case and any
user not handling this is already screwed.

-- 
Pavel Begunkov
Re: [PATCH v2 0/5] fast poll multishot mode
Posted by Jens Axboe 4 years ago
On 5/6/22 10:01 AM, Pavel Begunkov wrote:
> On 5/6/22 15:18, Jens Axboe wrote:
>> On 5/6/22 1:36 AM, Hao Xu wrote:
>>> Hi All,
>>> I actually had a question about the current poll code, from the code it
>>> seems when we cancel a poll-like request, it will ignore the existing
>>> events and just raise a -ECANCELED cqe though I haven't tested it. Is
>>> this by design or am I missing something?
>>
>> That's by design, but honestly I don't think anyone considered the case
>> where it's being canceled but has events already. For that case, I think
>> we should follow the usual logic of only returning an error (canceled)
>> if we don't have events, if we have events just return them. For
>> multi-shot, obviously also terminate, but same logic there.
> 
> Why would we care? It's inherently racy in any case and any
> user not handling this is already screwed.

We don't really need to care, but the normal logic is "return error if
we have no result, return result otherwise". No reason this should be
any different imho, it's not really a correctness thing.

-- 
Jens Axboe