fs/io_uring.c | 121 ++++++++++++++++++++++++++-------- include/uapi/linux/io_uring.h | 5 ++ 2 files changed, 100 insertions(+), 26 deletions(-)
Let multishot support multishot mode, currently only add accept as its
first comsumer.
theoretical analysis:
1) when connections come in fast
- singleshot:
add accept sqe(userpsace) --> accept inline
^ |
|-----------------|
- multishot:
add accept sqe(userspace) --> accept inline
^ |
|--*--|
we do accept repeatedly in * place until get EAGAIN
2) when connections come in at a low pressure
similar thing like 1), we reduce a lot of userspace-kernel context
switch and useless vfs_poll()
tests:
Did some tests, which goes in this way:
server client(multiple)
accept connect
read write
write read
close close
Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
+1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
+1934226+1914385)/20.0 = 1927633.75
after:
1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
+1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
+1871324+1940803)/20.0 = 1894750.45
(1927633.75 - 1894750.45) / 1927633.75 = 1.65%
A liburing test is here:
https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
v1->v2:
- re-implement it against the reworked poll code
Hao Xu (5):
io_uring: add IORING_ACCEPT_MULTISHOT for accept
io_uring: add REQ_F_APOLL_MULTISHOT for requests
io_uring: let fast poll support multishot
io_uring: add a helper for poll clean
io_uring: implement multishot mode for accept
fs/io_uring.c | 121 ++++++++++++++++++++++++++--------
include/uapi/linux/io_uring.h | 5 ++
2 files changed, 100 insertions(+), 26 deletions(-)
base-commit: f2e030dd7aaea5a937a2547dc980fab418fbc5e7
--
2.36.0
On 5/6/22 1:00 AM, Hao Xu wrote: > Let multishot support multishot mode, currently only add accept as its > first comsumer. > theoretical analysis: > 1) when connections come in fast > - singleshot: > add accept sqe(userpsace) --> accept inline > ^ | > |-----------------| > - multishot: > add accept sqe(userspace) --> accept inline > ^ | > |--*--| > > we do accept repeatedly in * place until get EAGAIN > > 2) when connections come in at a low pressure > similar thing like 1), we reduce a lot of userspace-kernel context > switch and useless vfs_poll() > > > tests: > Did some tests, which goes in this way: > > server client(multiple) > accept connect > read write > write read > close close > > Basically, raise up a number of clients(on same machine with server) to > connect to the server, and then write some data to it, the server will > write those data back to the client after it receives them, and then > close the connection after write return. Then the client will read the > data and then close the connection. Here I test 10000 clients connect > one server, data size 128 bytes. And each client has a go routine for > it, so they come to the server in short time. > test 20 times before/after this patchset, time spent:(unit cycle, which > is the return value of clock()) > before: > 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 > +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 > +1934226+1914385)/20.0 = 1927633.75 > after: > 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 > +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 > +1871324+1940803)/20.0 = 1894750.45 > > (1927633.75 - 1894750.45) / 1927633.75 = 1.65% > > > A liburing test is here: > https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c Wish I had seen that, I wrote my own! But maybe that's good, you tend to find other issues through that. Anyway, works for me in testing, and I can see this being a nice win for accept intensive workloads. I pushed a bunch of cleanup patches that should just get folded in. Can you fold them into your patches and address the other feedback, and post a v3? I pushed the test branch here: https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot -- Jens Axboe
On 5/6/22 4:23 PM, Jens Axboe wrote: > On 5/6/22 1:00 AM, Hao Xu wrote: >> Let multishot support multishot mode, currently only add accept as its >> first comsumer. >> theoretical analysis: >> 1) when connections come in fast >> - singleshot: >> add accept sqe(userpsace) --> accept inline >> ^ | >> |-----------------| >> - multishot: >> add accept sqe(userspace) --> accept inline >> ^ | >> |--*--| >> >> we do accept repeatedly in * place until get EAGAIN >> >> 2) when connections come in at a low pressure >> similar thing like 1), we reduce a lot of userspace-kernel context >> switch and useless vfs_poll() >> >> >> tests: >> Did some tests, which goes in this way: >> >> server client(multiple) >> accept connect >> read write >> write read >> close close >> >> Basically, raise up a number of clients(on same machine with server) to >> connect to the server, and then write some data to it, the server will >> write those data back to the client after it receives them, and then >> close the connection after write return. Then the client will read the >> data and then close the connection. Here I test 10000 clients connect >> one server, data size 128 bytes. And each client has a go routine for >> it, so they come to the server in short time. >> test 20 times before/after this patchset, time spent:(unit cycle, which >> is the return value of clock()) >> before: >> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >> +1934226+1914385)/20.0 = 1927633.75 >> after: >> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >> +1871324+1940803)/20.0 = 1894750.45 >> >> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >> >> >> A liburing test is here: >> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c > > Wish I had seen that, I wrote my own! But maybe that's good, you tend to > find other issues through that. > > Anyway, works for me in testing, and I can see this being a nice win for > accept intensive workloads. I pushed a bunch of cleanup patches that > should just get folded in. Can you fold them into your patches and > address the other feedback, and post a v3? I pushed the test branch > here: > > https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot Quick benchmark here, accepting 10k connections: Stock kernel real 0m0.728s user 0m0.009s sys 0m0.192s Patched real 0m0.684s user 0m0.018s sys 0m0.102s Looks like a nice win for a highly synthetic benchmark. Nothing scientific, was just curious. -- Jens Axboe
On 5/6/22 5:26 PM, Jens Axboe wrote: > On 5/6/22 4:23 PM, Jens Axboe wrote: >> On 5/6/22 1:00 AM, Hao Xu wrote: >>> Let multishot support multishot mode, currently only add accept as its >>> first comsumer. >>> theoretical analysis: >>> 1) when connections come in fast >>> - singleshot: >>> add accept sqe(userpsace) --> accept inline >>> ^ | >>> |-----------------| >>> - multishot: >>> add accept sqe(userspace) --> accept inline >>> ^ | >>> |--*--| >>> >>> we do accept repeatedly in * place until get EAGAIN >>> >>> 2) when connections come in at a low pressure >>> similar thing like 1), we reduce a lot of userspace-kernel context >>> switch and useless vfs_poll() >>> >>> >>> tests: >>> Did some tests, which goes in this way: >>> >>> server client(multiple) >>> accept connect >>> read write >>> write read >>> close close >>> >>> Basically, raise up a number of clients(on same machine with server) to >>> connect to the server, and then write some data to it, the server will >>> write those data back to the client after it receives them, and then >>> close the connection after write return. Then the client will read the >>> data and then close the connection. Here I test 10000 clients connect >>> one server, data size 128 bytes. And each client has a go routine for >>> it, so they come to the server in short time. >>> test 20 times before/after this patchset, time spent:(unit cycle, which >>> is the return value of clock()) >>> before: >>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >>> +1934226+1914385)/20.0 = 1927633.75 >>> after: >>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >>> +1871324+1940803)/20.0 = 1894750.45 >>> >>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >>> >>> >>> A liburing test is here: >>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c >> >> Wish I had seen that, I wrote my own! But maybe that's good, you tend to >> find other issues through that. >> >> Anyway, works for me in testing, and I can see this being a nice win for >> accept intensive workloads. I pushed a bunch of cleanup patches that >> should just get folded in. Can you fold them into your patches and >> address the other feedback, and post a v3? I pushed the test branch >> here: >> >> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot > > Quick benchmark here, accepting 10k connections: > > Stock kernel > real 0m0.728s > user 0m0.009s > sys 0m0.192s > > Patched > real 0m0.684s > user 0m0.018s > sys 0m0.102s > > Looks like a nice win for a highly synthetic benchmark. Nothing > scientific, was just curious. One more thought on this - how is it supposed to work with accept-direct? One idea would be to make it incrementally increasing. But we need a good story for that, if it's exclusive to non-direct files, then it's a lot less interesting as the latter is really nice win for lots of files. If we can combine the two, even better. -- Jens Axboe
On 5/6/22 8:33 PM, Jens Axboe wrote: > On 5/6/22 5:26 PM, Jens Axboe wrote: >> On 5/6/22 4:23 PM, Jens Axboe wrote: >>> On 5/6/22 1:00 AM, Hao Xu wrote: >>>> Let multishot support multishot mode, currently only add accept as its >>>> first comsumer. >>>> theoretical analysis: >>>> 1) when connections come in fast >>>> - singleshot: >>>> add accept sqe(userpsace) --> accept inline >>>> ^ | >>>> |-----------------| >>>> - multishot: >>>> add accept sqe(userspace) --> accept inline >>>> ^ | >>>> |--*--| >>>> >>>> we do accept repeatedly in * place until get EAGAIN >>>> >>>> 2) when connections come in at a low pressure >>>> similar thing like 1), we reduce a lot of userspace-kernel context >>>> switch and useless vfs_poll() >>>> >>>> >>>> tests: >>>> Did some tests, which goes in this way: >>>> >>>> server client(multiple) >>>> accept connect >>>> read write >>>> write read >>>> close close >>>> >>>> Basically, raise up a number of clients(on same machine with server) to >>>> connect to the server, and then write some data to it, the server will >>>> write those data back to the client after it receives them, and then >>>> close the connection after write return. Then the client will read the >>>> data and then close the connection. Here I test 10000 clients connect >>>> one server, data size 128 bytes. And each client has a go routine for >>>> it, so they come to the server in short time. >>>> test 20 times before/after this patchset, time spent:(unit cycle, which >>>> is the return value of clock()) >>>> before: >>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >>>> +1934226+1914385)/20.0 = 1927633.75 >>>> after: >>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >>>> +1871324+1940803)/20.0 = 1894750.45 >>>> >>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >>>> >>>> >>>> A liburing test is here: >>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c >>> >>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to >>> find other issues through that. >>> >>> Anyway, works for me in testing, and I can see this being a nice win for >>> accept intensive workloads. I pushed a bunch of cleanup patches that >>> should just get folded in. Can you fold them into your patches and >>> address the other feedback, and post a v3? I pushed the test branch >>> here: >>> >>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot >> >> Quick benchmark here, accepting 10k connections: >> >> Stock kernel >> real 0m0.728s >> user 0m0.009s >> sys 0m0.192s >> >> Patched >> real 0m0.684s >> user 0m0.018s >> sys 0m0.102s >> >> Looks like a nice win for a highly synthetic benchmark. Nothing >> scientific, was just curious. > > One more thought on this - how is it supposed to work with > accept-direct? One idea would be to make it incrementally increasing. > But we need a good story for that, if it's exclusive to non-direct > files, then it's a lot less interesting as the latter is really nice win > for lots of files. If we can combine the two, even better. Running some quick testing, on an actual test box (previous numbers were from a vm on my laptop): Testing singleshot, normal files Did 10000 accepts ________________________________________________________ Executed in 216.10 millis fish external usr time 9.32 millis 150.00 micros 9.17 millis sys time 110.06 millis 67.00 micros 109.99 millis Testing multishot, fixed files Did 10000 accepts ________________________________________________________ Executed in 189.04 millis fish external usr time 11.86 millis 159.00 micros 11.71 millis sys time 93.71 millis 70.00 micros 93.64 millis That's about ~19 usec to accept a connection, pretty decent. Using singleshot and with fixed files, it shaves about ~8% off, ends at around 200msec. I think we can get away with using fixed files and multishot, attaching the quick patch I did below to test it. We need something better than this, otherwise once the space fills up, we'll likely end up with a sparse space and the naive approach of just incrementing the next slot won't work at all. -- Jens Axboe
在 2022/5/7 上午11:08, Jens Axboe 写道: > On 5/6/22 8:33 PM, Jens Axboe wrote: >> On 5/6/22 5:26 PM, Jens Axboe wrote: >>> On 5/6/22 4:23 PM, Jens Axboe wrote: >>>> On 5/6/22 1:00 AM, Hao Xu wrote: >>>>> Let multishot support multishot mode, currently only add accept as its >>>>> first comsumer. >>>>> theoretical analysis: >>>>> 1) when connections come in fast >>>>> - singleshot: >>>>> add accept sqe(userpsace) --> accept inline >>>>> ^ | >>>>> |-----------------| >>>>> - multishot: >>>>> add accept sqe(userspace) --> accept inline >>>>> ^ | >>>>> |--*--| >>>>> >>>>> we do accept repeatedly in * place until get EAGAIN >>>>> >>>>> 2) when connections come in at a low pressure >>>>> similar thing like 1), we reduce a lot of userspace-kernel context >>>>> switch and useless vfs_poll() >>>>> >>>>> >>>>> tests: >>>>> Did some tests, which goes in this way: >>>>> >>>>> server client(multiple) >>>>> accept connect >>>>> read write >>>>> write read >>>>> close close >>>>> >>>>> Basically, raise up a number of clients(on same machine with server) to >>>>> connect to the server, and then write some data to it, the server will >>>>> write those data back to the client after it receives them, and then >>>>> close the connection after write return. Then the client will read the >>>>> data and then close the connection. Here I test 10000 clients connect >>>>> one server, data size 128 bytes. And each client has a go routine for >>>>> it, so they come to the server in short time. >>>>> test 20 times before/after this patchset, time spent:(unit cycle, which >>>>> is the return value of clock()) >>>>> before: >>>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >>>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >>>>> +1934226+1914385)/20.0 = 1927633.75 >>>>> after: >>>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >>>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >>>>> +1871324+1940803)/20.0 = 1894750.45 >>>>> >>>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >>>>> >>>>> >>>>> A liburing test is here: >>>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c >>>> >>>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to >>>> find other issues through that. >>>> >>>> Anyway, works for me in testing, and I can see this being a nice win for >>>> accept intensive workloads. I pushed a bunch of cleanup patches that >>>> should just get folded in. Can you fold them into your patches and >>>> address the other feedback, and post a v3? I pushed the test branch >>>> here: >>>> >>>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot >>> >>> Quick benchmark here, accepting 10k connections: >>> >>> Stock kernel >>> real 0m0.728s >>> user 0m0.009s >>> sys 0m0.192s >>> >>> Patched >>> real 0m0.684s >>> user 0m0.018s >>> sys 0m0.102s >>> >>> Looks like a nice win for a highly synthetic benchmark. Nothing >>> scientific, was just curious. >> >> One more thought on this - how is it supposed to work with >> accept-direct? One idea would be to make it incrementally increasing. >> But we need a good story for that, if it's exclusive to non-direct >> files, then it's a lot less interesting as the latter is really nice win >> for lots of files. If we can combine the two, even better. > > Running some quick testing, on an actual test box (previous numbers were > from a vm on my laptop): > > Testing singleshot, normal files > Did 10000 accepts > > ________________________________________________________ > Executed in 216.10 millis fish external > usr time 9.32 millis 150.00 micros 9.17 millis > sys time 110.06 millis 67.00 micros 109.99 millis > > Testing multishot, fixed files > Did 10000 accepts > > ________________________________________________________ > Executed in 189.04 millis fish external > usr time 11.86 millis 159.00 micros 11.71 millis > sys time 93.71 millis 70.00 micros 93.64 millis > > That's about ~19 usec to accept a connection, pretty decent. Using > singleshot and with fixed files, it shaves about ~8% off, ends at around > 200msec. > > I think we can get away with using fixed files and multishot, attaching I'm not following, do you mean we shouldn't do the multishot+fixed file or we should use multishot+fixed to make the result better? > the quick patch I did below to test it. We need something better than Sorry Jens, I didn't see the quick patch, is there anything I misunderstand? > this, otherwise once the space fills up, we'll likely end up with a > sparse space and the naive approach of just incrementing the next slot > won't work at all. >
Hi All, I actually had a question about the current poll code, from the code it seems when we cancel a poll-like request, it will ignore the existing events and just raise a -ECANCELED cqe though I haven't tested it. Is this by design or am I missing something? Regards, Hao
On 5/6/22 1:36 AM, Hao Xu wrote: > Hi All, > I actually had a question about the current poll code, from the code it > seems when we cancel a poll-like request, it will ignore the existing > events and just raise a -ECANCELED cqe though I haven't tested it. Is > this by design or am I missing something? That's by design, but honestly I don't think anyone considered the case where it's being canceled but has events already. For that case, I think we should follow the usual logic of only returning an error (canceled) if we don't have events, if we have events just return them. For multi-shot, obviously also terminate, but same logic there. Care to do a separate patch for that? -- Jens Axboe
On 5/6/22 15:18, Jens Axboe wrote: > On 5/6/22 1:36 AM, Hao Xu wrote: >> Hi All, >> I actually had a question about the current poll code, from the code it >> seems when we cancel a poll-like request, it will ignore the existing >> events and just raise a -ECANCELED cqe though I haven't tested it. Is >> this by design or am I missing something? > > That's by design, but honestly I don't think anyone considered the case > where it's being canceled but has events already. For that case, I think > we should follow the usual logic of only returning an error (canceled) > if we don't have events, if we have events just return them. For > multi-shot, obviously also terminate, but same logic there. Why would we care? It's inherently racy in any case and any user not handling this is already screwed. -- Pavel Begunkov
On 5/6/22 10:01 AM, Pavel Begunkov wrote: > On 5/6/22 15:18, Jens Axboe wrote: >> On 5/6/22 1:36 AM, Hao Xu wrote: >>> Hi All, >>> I actually had a question about the current poll code, from the code it >>> seems when we cancel a poll-like request, it will ignore the existing >>> events and just raise a -ECANCELED cqe though I haven't tested it. Is >>> this by design or am I missing something? >> >> That's by design, but honestly I don't think anyone considered the case >> where it's being canceled but has events already. For that case, I think >> we should follow the usual logic of only returning an error (canceled) >> if we don't have events, if we have events just return them. For >> multi-shot, obviously also terminate, but same logic there. > > Why would we care? It's inherently racy in any case and any > user not handling this is already screwed. We don't really need to care, but the normal logic is "return error if we have no result, return result otherwise". No reason this should be any different imho, it's not really a correctness thing. -- Jens Axboe
© 2016 - 2026 Red Hat, Inc.