drivers/char/random.c | 126 +++++++++++++++++++++--------------------- 1 file changed, 63 insertions(+), 63 deletions(-)
Hi Al,
I've incorporated your suggestions into Jens' patches and simplified a
lot of the control flow. Could you take a look at these and let me know
if it looks sane? In particular, I'm using the property you mentioned in
which copy_{to,from}_iter() can take a maximum and do less if the
remaining length is too small.
Jason
Jens Axboe (3):
random: convert to using fops->read_iter()
random: convert to using fops->write_iter()
random: wire up fops->splice_{read,write}_iter()
drivers/char/random.c | 126 +++++++++++++++++++++---------------------
1 file changed, 63 insertions(+), 63 deletions(-)
--
2.35.1
Hi Jens,
On Fri, May 20, 2022 at 11:44:56AM +0200, Jason A. Donenfeld wrote:
> Jens Axboe (3):
> random: convert to using fops->read_iter()
> random: convert to using fops->write_iter()
> random: wire up fops->splice_{read,write}_iter()
FYI, this series makes reads from /dev/urandom slower, from around 616
MiB/s to 598 MiB/s on my system. That seems rather unfortunate.
Are we sure we really want to do this and need to do this?
Jason
On 5/20/22 9:25 AM, Jason A. Donenfeld wrote:
> Hi Jens,
>
> On Fri, May 20, 2022 at 11:44:56AM +0200, Jason A. Donenfeld wrote:
>> Jens Axboe (3):
>> random: convert to using fops->read_iter()
>> random: convert to using fops->write_iter()
>> random: wire up fops->splice_{read,write}_iter()
>
> FYI, this series makes reads from /dev/urandom slower, from around 616
> MiB/s to 598 MiB/s on my system. That seems rather unfortunate.
How reproducible is that? That seems like a huge difference for the
change. How big are the reads?
> Are we sure we really want to do this and need to do this?
I'm very sure, otherwise we're just accepting that we're breaking real
world applications. Alternatively, you can keep the ->read() and add the
->read_iter() as well, but I'm a bit skeptical at your initial results
there.
--
Jens Axboe
On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote: > I'm very sure, otherwise we're just accepting that we're breaking real > world applications. "Breaking" as in "it used to work with earlier kernels, doesn't work with recent ones"? Details, please...
On 5/20/22 9:47 AM, Al Viro wrote:
> On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote:
>
>> I'm very sure, otherwise we're just accepting that we're breaking real
>> world applications.
>
> "Breaking" as in "it used to work with earlier kernels, doesn't work with
> recent ones"? Details, please...
Yes, as in exactly that. This is what drove this addition of
->read_iter() for urandom. See commit:
ommit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Sep 3 16:22:34 2020 +0200
fs: don't allow splice read/write without explicit ops
related to the set_fs() changes, and now go look for any commit that
has:
Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
in it and see that this isn't an isolated incident at all.
tldr - splice from /dev/urandom used to work, and I recently got a
report internally on an application that broke on upgrade from 5.6 to
5.12 exactly because it now just just -EINVAL's instead.
--
Jens Axboe
On Fri, May 20, 2022 at 09:53:30AM -0600, Jens Axboe wrote:
> On 5/20/22 9:47 AM, Al Viro wrote:
> > On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote:
> >
> >> I'm very sure, otherwise we're just accepting that we're breaking real
> >> world applications.
> >
> > "Breaking" as in "it used to work with earlier kernels, doesn't work with
> > recent ones"? Details, please...
>
> Yes, as in exactly that. This is what drove this addition of
> ->read_iter() for urandom. See commit:
>
> ommit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
> Author: Christoph Hellwig <hch@lst.de>
> Date: Thu Sep 3 16:22:34 2020 +0200
>
> fs: don't allow splice read/write without explicit ops
>
> related to the set_fs() changes, and now go look for any commit that
> has:
>
> Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
>
> in it and see that this isn't an isolated incident at all.
>
> tldr - splice from /dev/urandom used to work, and I recently got a
> report internally on an application that broke on upgrade from 5.6 to
> 5.12 exactly because it now just just -EINVAL's instead.
IIRC, Linus' position at the time had been along the lines of
"splice is not so good ABI anyway, so let's do it and fix up
the places that do get real-world complaints once such appear".
So /dev/urandom is one such place...
On 5/20/22 10:15 AM, Al Viro wrote:
> On Fri, May 20, 2022 at 09:53:30AM -0600, Jens Axboe wrote:
>> On 5/20/22 9:47 AM, Al Viro wrote:
>>> On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote:
>>>
>>>> I'm very sure, otherwise we're just accepting that we're breaking real
>>>> world applications.
>>>
>>> "Breaking" as in "it used to work with earlier kernels, doesn't work with
>>> recent ones"? Details, please...
>>
>> Yes, as in exactly that. This is what drove this addition of
>> ->read_iter() for urandom. See commit:
>>
>> ommit 36e2c7421f02a22f71c9283e55fdb672a9eb58e7
>> Author: Christoph Hellwig <hch@lst.de>
>> Date: Thu Sep 3 16:22:34 2020 +0200
>>
>> fs: don't allow splice read/write without explicit ops
>>
>> related to the set_fs() changes, and now go look for any commit that
>> has:
>>
>> Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
>>
>> in it and see that this isn't an isolated incident at all.
>>
>> tldr - splice from /dev/urandom used to work, and I recently got a
>> report internally on an application that broke on upgrade from 5.6 to
>> 5.12 exactly because it now just just -EINVAL's instead.
>
> IIRC, Linus' position at the time had been along the lines of
> "splice is not so good ABI anyway, so let's do it and fix up
> the places that do get real-world complaints once such appear".
> So /dev/urandom is one such place...
That's what Christoph said too. Honestly that's a very odd way to
attempt to justify breakage like this, even if it is tempting to
facilitate the set_fs() removal. But then be honest about it and say
it like it is, rather than some hand wavy explanation that frankly
doesn't make any sense.
The referenced change doesn't change the splice ABI at all, hence the
justification seems very random to me. It kept what we already have,
except we randomly break some use cases.
--
Jens Axboe
Hi Jens,
On Fri, May 20, 2022 at 10:24:36AM -0600, Jens Axboe wrote:
> On 5/20/22 10:15 AM, Al Viro wrote:
> > IIRC, Linus' position at the time had been along the lines of
> > "splice is not so good ABI anyway, so let's do it and fix up
> > the places that do get real-world complaints once such appear".
> > So /dev/urandom is one such place...
>
> That's what Christoph said too. Honestly that's a very odd way to
> attempt to justify breakage like this, even if it is tempting to
> facilitate the set_fs() removal. But then be honest about it and say
> it like it is, rather than some hand wavy explanation that frankly
> doesn't make any sense.
>
> The referenced change doesn't change the splice ABI at all, hence the
> justification seems very random to me. It kept what we already have,
> except we randomly break some use cases.
It looks like Al is right in the sense that Linus must certainly be
aware of the breakage. He fixed tty in 9bb48c82aced ("tty: implement
write_iter").
Anyway, it seems like the iter functions are the way forward, so this v4
is queued up now (with a few minor cosmetic changes) and pushed to:
https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/log/
I'll send an early 5.19 pull for everything either tonight or Sunday.
And then next week I'll start on backports. (Though, 5.12 is a weird
kernel version; I assume this is some Meta kernel that has its own
backport team?)
Meanwhile, hopefully Al can pick up the splice.c sendfile(2) chardev
patch for 5.19, so at least there'll be some silver lining to the
performance hit.
Jason
On 5/20/22 10:39 AM, Jason A. Donenfeld wrote:
> Hi Jens,
>
> On Fri, May 20, 2022 at 10:24:36AM -0600, Jens Axboe wrote:
>> On 5/20/22 10:15 AM, Al Viro wrote:
>>> IIRC, Linus' position at the time had been along the lines of
>>> "splice is not so good ABI anyway, so let's do it and fix up
>>> the places that do get real-world complaints once such appear".
>>> So /dev/urandom is one such place...
>>
>> That's what Christoph said too. Honestly that's a very odd way to
>> attempt to justify breakage like this, even if it is tempting to
>> facilitate the set_fs() removal. But then be honest about it and say
>> it like it is, rather than some hand wavy explanation that frankly
>> doesn't make any sense.
>>
>> The referenced change doesn't change the splice ABI at all, hence the
>> justification seems very random to me. It kept what we already have,
>> except we randomly break some use cases.
>
> It looks like Al is right in the sense that Linus must certainly be
> aware of the breakage. He fixed tty in 9bb48c82aced ("tty: implement
> write_iter").
I don't think anyone is disputing that, but I also know that Linus wants
these fixed up as they are discovered. And I agree with him on that,
even if I disagree on the process to get there as it introduced
frivolous breakage...
> Anyway, it seems like the iter functions are the way forward, so this v4
> is queued up now (with a few minor cosmetic changes) and pushed to:
> https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git/log/
> I'll send an early 5.19 pull for everything either tonight or Sunday.
> And then next week I'll start on backports. (Though, 5.12 is a weird
> kernel version; I assume this is some Meta kernel that has its own
> backport team?)
Thanks!
> Meanwhile, hopefully Al can pick up the splice.c sendfile(2) chardev
> patch for 5.19, so at least there'll be some silver lining to the
> performance hit.
Let's hope so.
--
Jens Axboe
Jens Axboe <axboe@kernel.dk> writes:
> On 5/20/22 10:39 AM, Jason A. Donenfeld wrote:
>> Hi Jens,
>>
>> On Fri, May 20, 2022 at 10:24:36AM -0600, Jens Axboe wrote:
>>> On 5/20/22 10:15 AM, Al Viro wrote:
>>>> IIRC, Linus' position at the time had been along the lines of
>>>> "splice is not so good ABI anyway, so let's do it and fix up
>>>> the places that do get real-world complaints once such appear".
>>>> So /dev/urandom is one such place...
>>>
>>> That's what Christoph said too. Honestly that's a very odd way to
>>> attempt to justify breakage like this, even if it is tempting to
>>> facilitate the set_fs() removal. But then be honest about it and say
>>> it like it is, rather than some hand wavy explanation that frankly
>>> doesn't make any sense.
>>>
>>> The referenced change doesn't change the splice ABI at all, hence the
>>> justification seems very random to me. It kept what we already have,
>>> except we randomly break some use cases.
>>
>> It looks like Al is right in the sense that Linus must certainly be
>> aware of the breakage. He fixed tty in 9bb48c82aced ("tty: implement
>> write_iter").
>
> I don't think anyone is disputing that, but I also know that Linus wants
> these fixed up as they are discovered. And I agree with him on that,
> even if I disagree on the process to get there as it introduced
> frivolous breakage...
I believe the hypothesis at the time was that on many of these
interfaces splice is not used on them so it did matter.
With everything being fixed in the places the hypothesis turned out to
be wrong.
The rule is no regressions, not bug comparability forever. This allows
things like removing a.out binary support, and many other things that
are just dead code these days.
Sometimes the only way to discover what has users is to remove support
and wait a while and see if anyone complains. Sometimes doing that
is worth it, other times it is not.
My general sense is that there have been few enough reports that users
who care about splice support have been few and far between. Which
suggests the hypothesis was not an unreasonable one when Linus made it.
The truly unfortunate part is that no one knew enough about these users
to be able to step up and say that they care about splice on those
interfaces at the start of this process.
Eric
Hi Jens, On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote: > On 5/20/22 9:25 AM, Jason A. Donenfeld wrote: > > Are we sure we really want to do this and need to do this? > > I'm very sure, otherwise we're just accepting that we're breaking real > world applications. Would we really? I always thought splice() and copy_file_range() and sendfile() were all kind of "special" in that they mostly do not work for many things, and so all userspaces need fallback code. And the state of "mostly not working" has always just been the norm. So broken today, working tomorrow, broken next week would be par for the course? I might be *super* wrong here, so feel free to say so, but this has been my general impression. Anyway, I do like the idea of supporting splice() and sendfile(). The performance hit is just kind of sad. Jason
On 5/20/22 9:46 AM, Jason A. Donenfeld wrote: > Hi Jens, > > On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote: >> On 5/20/22 9:25 AM, Jason A. Donenfeld wrote: >>> Are we sure we really want to do this and need to do this? >> >> I'm very sure, otherwise we're just accepting that we're breaking real >> world applications. > > Would we really? I always thought splice() and copy_file_range() and > sendfile() were all kind of "special" in that they mostly do not work > for many things, and so all userspaces need fallback code. And the state > of "mostly not working" has always just been the norm. So broken today, > working tomorrow, broken next week would be par for the course? I might > be *super* wrong here, so feel free to say so, but this has been my > general impression. No, I think that is exactly the wrong impression. If you have an application that is written using eg splice from /dev/urandom, then it should indeed be safe to expect that it will indeed continue working. If we have one core tenet in the kernel it's that you should ALWAYS be able to upgrade your kernel and not have any breakage in terms of userspace ABI. Obviously that can happen sometimes, but I think this one is exactly the poster child of breakage that should NOT happen. We took away a feature that someone depended on. This situation of splice being flakily availably is new from when it was decided that it was OK to only have it be available if ->read_iter() and ->splice_read() is set. And that is very unfortunate. > Anyway, I do like the idea of supporting splice() and sendfile(). The > performance hit is just kind of sad. I like the idea too, but this is deeper than that. We simply cannot just break existing use cases like that. Thankfully we do have an option here as I outlined in the previous email - keep the ->read() and just add ->read_iter() on the side so that splice still works. Is that ideal? No, it needs more code to support. But hopefully that can die at some point when the performance gap is such that we no longer need to worry about having ->read() for those cases. -- Jens Axboe
Hi Jens, On Fri, May 20, 2022 at 09:51:06AM -0600, Jens Axboe wrote: > ABI. Obviously that can happen sometimes, but I think this one is > exactly the poster child of breakage that should NOT happen. We took > away a feature that someone depended on. I suppose somebody (Meta, I presume) *did* notice the breakage, which is sign enough. Jason
On 5/20/22 9:58 AM, Jason A. Donenfeld wrote: > Hi Jens, > > On Fri, May 20, 2022 at 09:51:06AM -0600, Jens Axboe wrote: >> ABI. Obviously that can happen sometimes, but I think this one is >> exactly the poster child of breakage that should NOT happen. We took >> away a feature that someone depended on. > > I suppose somebody (Meta, I presume) *did* notice the breakage, which is > sign enough. Indeed, this is how I found out. And then you have to wonder how often this has happened elsewhere where nobody bothered to report it, or how many are still waiting to happen because they are still on an older kernel and just haven't upgraded yet. Or upgraded and rolled back perhaps. None of the latter really matters, it's eventual breakage there, and very real breakage right now for the first case. -- Jens Axboe
Hi Jens,
On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote:
> On 5/20/22 9:25 AM, Jason A. Donenfeld wrote:
> > Hi Jens,
> >
> > On Fri, May 20, 2022 at 11:44:56AM +0200, Jason A. Donenfeld wrote:
> >> Jens Axboe (3):
> >> random: convert to using fops->read_iter()
> >> random: convert to using fops->write_iter()
> >> random: wire up fops->splice_{read,write}_iter()
> >
> > FYI, this series makes reads from /dev/urandom slower, from around 616
> > MiB/s to 598 MiB/s on my system. That seems rather unfortunate.
>
> How reproducible is that? That seems like a huge difference for the
> change. How big are the reads?
Fairly reproducible. Actually, if anything, it reproduces consistently
with worst results; I chose the most favorable ones for the new code.
This isn't any fancy `perf` profiling, but just running:
$ pv /dev/urandom > /dev/null
From looking at strace, the read size appears to be 131072.
Jason
On 5/20/22 9:39 AM, Jason A. Donenfeld wrote:
> Hi Jens,
>
> On Fri, May 20, 2022 at 09:34:46AM -0600, Jens Axboe wrote:
>> On 5/20/22 9:25 AM, Jason A. Donenfeld wrote:
>>> Hi Jens,
>>>
>>> On Fri, May 20, 2022 at 11:44:56AM +0200, Jason A. Donenfeld wrote:
>>>> Jens Axboe (3):
>>>> random: convert to using fops->read_iter()
>>>> random: convert to using fops->write_iter()
>>>> random: wire up fops->splice_{read,write}_iter()
>>>
>>> FYI, this series makes reads from /dev/urandom slower, from around 616
>>> MiB/s to 598 MiB/s on my system. That seems rather unfortunate.
>>
>> How reproducible is that? That seems like a huge difference for the
>> change. How big are the reads?
>
> Fairly reproducible. Actually, if anything, it reproduces consistently
> with worst results; I chose the most favorable ones for the new code.
> This isn't any fancy `perf` profiling, but just running:
>
> $ pv /dev/urandom > /dev/null
>
> From looking at strace, the read size appears to be 131072.
Ran 32, 1k, 4k here and it does seem to be down aboout 3%. Which is
definitely bigger than I expected, particularly for larger reads. If
anything, the 32b read seems comparably better than eg 1k or 4k, which
is also unexpected. Let me do a bit of profiling to see what is up.
If you're worried about it, I'd just keep the read/write and add the
iter variants on the side.
--
Jens Axboe
Hi Jens, On Fri, May 20, 2022 at 09:44:25AM -0600, Jens Axboe wrote: > Ran 32, 1k, 4k here and it does seem to be down aboout 3%. Which is > definitely bigger than I expected, particularly for larger reads. If > anything, the 32b read seems comparably better than eg 1k or 4k, which > is also unexpected. Let me do a bit of profiling to see what is up. Something to keep in mind wrt 32b is that for complicated crypto reasons, the function has this logic: - If len <= 32, generate one 64 byte block and give <= 32 bytes of it to the caller. - If len > 32, generate one 64 byte block, but give 0 of it to the caller. Then generate ⌈len/64⌉ blocks for the caller. Put together, this means: - 1..32, 1 block - 33..64, 2 blocks - 65..128, 3 blocks - 129..196, 4 blocks So you get this sort of shelf where the amortization benefits don't really kick in until after 3 blocks. > If you're worried about it, I'd just keep the read/write and add the > iter variants on the side. Not a chance of that. These functions are already finicky as-is; I would really hate to have to duplicate all of these paths. Jason
On 5/20/22 9:55 AM, Jason A. Donenfeld wrote: > Hi Jens, > > On Fri, May 20, 2022 at 09:44:25AM -0600, Jens Axboe wrote: >> Ran 32, 1k, 4k here and it does seem to be down aboout 3%. Which is >> definitely bigger than I expected, particularly for larger reads. If >> anything, the 32b read seems comparably better than eg 1k or 4k, which >> is also unexpected. Let me do a bit of profiling to see what is up. > > Something to keep in mind wrt 32b is that for complicated crypto > reasons, the function has this logic: > > - If len <= 32, generate one 64 byte block and give <= 32 bytes of it to > the caller. > > - If len > 32, generate one 64 byte block, but give 0 of it to the > caller. Then generate ?len/64? blocks for the caller. > > Put together, this means: > > - 1..32, 1 block > - 33..64, 2 blocks > - 65..128, 3 blocks > - 129..196, 4 blocks > > So you get this sort of shelf where the amortization benefits don't > really kick in until after 3 blocks. Ah I see, I can see if 64b is closer to the change for eg 1k. >> If you're worried about it, I'd just keep the read/write and add the >> iter variants on the side. > > Not a chance of that. These functions are already finicky as-is; I would > really hate to have to duplicate all of these paths. Then I'd say there are only two options: - Add a helper that provides splice for something that only has read/write set. - Just accept that we're 3% slower reading from /dev/urandom for now, and maybe 1-2% for small reads. Can't really imagine this being a huge issue, how many high throughput /dev/urandom read situations exist in the real world? -- Jens Axboe
Hi Jens, On Fri, May 20, 2022 at 09:58:28AM -0600, Jens Axboe wrote: > On 5/20/22 9:55 AM, Jason A. Donenfeld wrote: > > Hi Jens, > > > > On Fri, May 20, 2022 at 09:44:25AM -0600, Jens Axboe wrote: > >> Ran 32, 1k, 4k here and it does seem to be down aboout 3%. Which is > >> definitely bigger than I expected, particularly for larger reads. If > >> anything, the 32b read seems comparably better than eg 1k or 4k, which > >> is also unexpected. Let me do a bit of profiling to see what is up. > > > > Something to keep in mind wrt 32b is that for complicated crypto > > reasons, the function has this logic: > > > > - If len <= 32, generate one 64 byte block and give <= 32 bytes of it to > > the caller. > > > > - If len > 32, generate one 64 byte block, but give 0 of it to the > > caller. Then generate ?len/64? blocks for the caller. > > > > Put together, this means: > > > > - 1..32, 1 block > > - 33..64, 2 blocks > > - 65..128, 3 blocks > > - 129..196, 4 blocks > > > > So you get this sort of shelf where the amortization benefits don't > > really kick in until after 3 blocks. > > Ah I see, I can see if 64b is closer to the change for eg 1k. What I meant by providing all that detail is that from a cycles-per-byte perspective, smaller=more expensive. So it's possible that the difference in the patchset is less visible as it gets lost in the more expensive operation. > >> If you're worried about it, I'd just keep the read/write and add the > >> iter variants on the side. > > > > Not a chance of that. These functions are already finicky as-is; I would > > really hate to have to duplicate all of these paths. > > Then I'd say there are only two options: > > - Add a helper that provides splice for something that only has > read/write set. That'd be fine with me, but wouldn't it involve bringing back set_fs(), because of the copy_to_user() in there? > - Just accept that we're 3% slower reading from /dev/urandom for now, > and maybe 1-2% for small reads. Can't really imagine this being a huge > issue, how many high throughput /dev/urandom read situations exist in > the real world? An option three might be that eventually the VFS overhead is worked out and read_iter() reaches parity. One can hope, I guess. Jason
On 5/20/22 10:03 AM, Jason A. Donenfeld wrote: >> Then I'd say there are only two options: >> >> - Add a helper that provides splice for something that only has >> read/write set. > > That'd be fine with me, but wouldn't it involve bringing back set_fs(), > because of the copy_to_user() in there? I haven't even looked into whether it's currently feasible or not, just mentioning it as a potential option. But the better one is definetely the next one... >> - Just accept that we're 3% slower reading from /dev/urandom for now, >> and maybe 1-2% for small reads. Can't really imagine this being a huge >> issue, how many high throughput /dev/urandom read situations exist in >> the real world? > > An option three might be that eventually the VFS overhead is worked out > and read_iter() reaches parity. One can hope, I guess. And that will certainly happen, especially as we have other paths that don't really have the choice, they have to use the iterator versions. And if we can get a bit closer, then that also opens the door more generic conversions so we can kill ->read/->write for almost all cases (except those weirdo ones that Al pointed out). -- Jens Axboe
On 5/20/22 3:44 AM, Jason A. Donenfeld wrote:
> Hi Al,
>
> I've incorporated your suggestions into Jens' patches and simplified a
> lot of the control flow. Could you take a look at these and let me know
> if it looks sane? In particular, I'm using the property you mentioned in
> which copy_{to,from}_iter() can take a maximum and do less if the
> remaining length is too small.
FWIW, ran my (small) tests and this series looks and tests fine to me.
--
Jens Axboe
© 2016 - 2026 Red Hat, Inc.