deconflicting new syscall numbers for 6.11

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* deconflicting new syscall numbers for 6.11
@ 2024-07-04 17:10 Jason A. Donenfeld
  2024-07-04 17:21 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 17:10 UTC (permalink / raw)
  To: jolsa, mhiramat, cgzones, brauner; +Cc: linux-kernel, torvalds, arnd

Hi Christian, Jiri,

The three of us all have new syscalls planned for 6.11. Arnd suggested
that we coordinate to deconflict, to make the merge easier.

Would you mind if I take 463?
Maybe Jiri can take 464?
And then Christian can take 465-onward for those several syscalls?

Does that work?

Alternatively, we can all take 463 and let Linus work it out when
merging. I don't know what the norm is or what he'd prefer.

Regards,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:10 deconflicting new syscall numbers for 6.11 Jason A. Donenfeld
@ 2024-07-04 17:21 ` Linus Torvalds
  2024-07-04 17:33   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 17:21 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> The three of us all have new syscalls planned for 6.11. Arnd suggested
> that we coordinate to deconflict, to make the merge easier.

Nobody has explained to me what has changed since your last vdso
getrandom, and I'm not planning on pulling it unless that fundamental
flaw is fixed.

Why is this _so_ critical that it needs a vdso?

Why isn't user space just doing it itself?

What's so magical about this all?

This all seems entirely pointless to me still, because it's optimizing
something that nobody seems to care about, adding new VM
infrastructure, new magic system calls, yadda yadda.

I was very sceptical last time, and absolutely _nothing_ has changed.
Not a peep on why it's now suddenly so hugely important again.

We don't add stuff "just because we can". We need to have a damn good
reason for it. And I still don't see the reason, and I haven't seen
anybody even trying to explain the reason.

              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:21 ` Linus Torvalds
@ 2024-07-04 17:33   ` Linus Torvalds
  2024-07-04 17:47     ` Linus Torvalds
  2024-07-04 17:46   ` Jason A. Donenfeld
  2024-07-06  1:14   ` Mathieu Desnoyers
  2 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 17:33 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 10:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> We don't add stuff "just because we can". We need to have a damn good
> reason for it. And I still don't see the reason, and I haven't seen
> anybody even trying to explain the reason.

IOW, I want to see actual *users* piping up and saying "this is a
problem, here's my real load that spends 10% of time on getrandom(),
and this fixes it".

I'm not AT ALL interested in microbenchmarks or theoretical "if users
need high-performance random numbers".

I need a real actual live user that says "I can't just use rdrand and
my own chacha mixing on top" and explains why having a SSE2 chachacha
in kernel code exposed as a vdso is so critical, and a magical buffer
maintained by the kernel.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:33   ` Linus Torvalds
@ 2024-07-04 17:47     ` Linus Torvalds
  2024-07-04 17:51       ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 17:47 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 10:33, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I need a real actual live user that says "I can't just use rdrand and
> my own chacha mixing on top" and explains why having a SSE2 chachacha
> in kernel code exposed as a vdso is so critical, and a magical buffer
> maintained by the kernel.

One final note: the reason I'm so negative about this all is that the
random number subsystem has such an absolutely _horrendous_ history of
two main conflicting issues: people wanting reasonable usable random
numbers on one side, and then the people that discuss what the word
"entropy" means on the other side.

And honestly, I don't want the kernel stuck even *more* in the middle
of that morass. I strongly suspect that one reason why glibc people
would want this is the exact same reason: _they_ don't want to be
stuck in the same padded room with the crazies _either_, so they love
the concept of "somebody else's problem".

So no. I do not think "libc people want this" is an argument at all
for the kernel doing it. Quite the reverse. It's a "pass the hot
potato" thing. Which is why I really really want those real users
standing up and saying "we can't use rdrand and rdtsc and our own
mixing".

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:47     ` Linus Torvalds
@ 2024-07-04 17:51       ` Jason A. Donenfeld
  0 siblings, 0 replies; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 17:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 4, 2024 at 7:47 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> One final note: the reason I'm so negative about this all is that the
> random number subsystem has such an absolutely _horrendous_ history of
> two main conflicting issues: people wanting reasonable usable random
> numbers on one side, and then the people that discuss what the word
> "entropy" means on the other side.

Yes. My entire goal since the beginning has been to clean up the filth
and insanity that's emerged from this. And there's a real userspace
side of filth too that's not going to be solved without this.

> And honestly, I don't want the kernel stuck even *more* in the middle
> of that morass.

Certainly I am not bringing us anywhere near that morass. I'm the one
who's been diligently trying to dig us out of it!

> I strongly suspect that one reason why glibc people
> would want this is the exact same reason: _they_ don't want to be
> stuck in the same padded room with the crazies _either_, so they love
> the concept of "somebody else's problem".

On the contrary, the glibc people were busy doing something grotesque
and incomplete, when I paused things so that I could do it properly
where it belongs.

> potato" thing. Which is why I really really want those real users
> standing up and saying "we can't use rdrand and rdtsc and our own
> mixing".

The point is that the people trying to "use rdrand and rdtsc and our
own mixing" are in for a world of pain, will come to a solution that
isn't complete and will fall over catastrophically in some
circumstances, and proliferates the problem.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:21 ` Linus Torvalds
  2024-07-04 17:33   ` Linus Torvalds
@ 2024-07-04 17:46   ` Jason A. Donenfeld
  2024-07-04 17:55     ` Linus Torvalds
  2024-07-06  1:14   ` Mathieu Desnoyers
  2 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 17:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 04, 2024 at 10:21:34AM -0700, Linus Torvalds wrote:
> On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > The three of us all have new syscalls planned for 6.11. Arnd suggested
> > that we coordinate to deconflict, to make the merge easier.
> 
> Nobody has explained to me what has changed since your last vdso
> getrandom, and I'm not planning on pulling it unless that fundamental
> flaw is fixed.

Oh. That's an unpleasant surprise. I've been hard at work on bringing
everything up to snuff. That's pretty much been my sole focus.

Changes since the last time I worked on this are explained in large at
the top of this:

https://lore.kernel.org/lkml/20240703183115.1075219-1-Jason@zx2c4.com/

The big issue before was that the mm additions were too insane. I've
paired those down and made them really minimal. Then the mm people piped
up and it became even more minimal. Now I think it's pretty alright.

But I think, perhaps evidently barring you, the use case of this in the
first place and need for it is well understood and appreciated at large
by now. So to answer that,

> Why is this _so_ critical that it needs a vdso?
> 
> Why isn't user space just doing it itself?
> 
> What's so magical about this all?
> 
> This all seems entirely pointless to me still, because it's optimizing
> something that nobody seems to care about
>
> IOW, I want to see actual *users* piping up and saying "this is a
> problem, here's my real load that spends 10% of time on getrandom(),
> and this fixes it".
>
> I'm not AT ALL interested in microbenchmarks or theoretical "if users
> need high-performance random numbers".
>
> I need a real actual live user that says "I can't just use rdrand and
> my own chacha mixing on top" and explains why having a SSE2 chachacha
> in kernel code exposed as a vdso is so critical, and a magical buffer
> maintained by the kernel.

As far as speed goes, there are many legitimate applications that cannot
make a syscall every time. TLS nonces and keys come to mind as a huge
one. "Make getrandom() fast enough that the TLS library can use it" is
something that's come up over and over. There's now also arc4random() in
glibc, whose addition is what sparked this whole patchset two years ago.
That's not a micro benchmark thing either. I too don't really care for
microbenchmarks with the random driver. But I do want it to be actually
useable, so that people use it, because it is the best facility for the
task. With regards to why VDSO, the cover letter lays that out in
detail. Userspace does not have access to the information in a timely
manner that the kernel does, and the particulars of the kernel's
accounting are bound to change, especially as all this matures with VMs.
The RNG in the vDSO needs to be tightly coupled with the RNG in the
kernel; these are part of the same thing.

Anyway, those actual users exist, and the partial solutions and hacks
required to workaround this shortcoming are kind of grotesque and in one
way or another bad. This isn't theoretical. I'm not working on this for
"fun".

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:46   ` Jason A. Donenfeld
@ 2024-07-04 17:55     ` Linus Torvalds
  2024-07-04 18:04       ` Jason A. Donenfeld
  2024-07-04 18:44       ` Willy Tarreau
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 17:55 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 10:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> As far as speed goes, there are many legitimate applications that cannot
> make a syscall every time.

This is not an argument.

Nobody suggested a system call each time.

What I talked about, and suggested, was rdrand and user-space mixing.
The system call would be a "initialize the pool" thing with possibly
some re-seeding occasionally.

> Anyway, those actual users exist, and the partial solutions and hacks
> required to workaround this shortcoming are kind of grotesque and in one
> way or another bad. This isn't theoretical. I'm not working on this for
> "fun".

Once again: I don't want to hear "users exist".

I want to hear *from* those users. Because I would have expected all
those users to already have perfectly working setups in place already.

A trivial google for "rdrand library" finds lots of hits for things
that then use the AES-NI instructions to whiten things etc.

And several of them mention OS X and Windows in addition to Linux. So
those things are at least partly portable.

And no, I'm *NOT* interested in catering to the crazies that say "we
can't mix in the TSC values and do rdrand, because we don't trust
those". That's literally the kind of people I want to avoid lik,e the
plague, and WHY I don't want more of this in the kernel.

Because sane users don't say that. Sane users say "every round, we mix
in the TSC, and every X rounds we do rdread, and every 100*X rounds we
do rdseed, and that means that the end result in not really
predictable even if you've started from the same virtual machine
image".

And sane users presumably ALREADY HAVE THIS.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:55     ` Linus Torvalds
@ 2024-07-04 18:04       ` Jason A. Donenfeld
  2024-07-04 18:18         ` Linus Torvalds
  2024-07-04 18:44       ` Willy Tarreau
  1 sibling, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 18:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 4, 2024 at 7:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, 4 Jul 2024 at 10:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > As far as speed goes, there are many legitimate applications that cannot
> > make a syscall every time.
>
> This is not an argument.
>
> Nobody suggested a system call each time.

Well, that's currently the only way to get random numbers that are
sure to be fresh and not, for example, cloned or resumed in a VM.

> What I talked about, and suggested, was rdrand and user-space mixing.
> The system call would be a "initialize the pool" thing with possibly
> some re-seeding occasionally.

And this does not work well at all. The question is "when to reseed?"
and only the kernel is in a position to reliably know this in a
race-free manner.

> > Anyway, those actual users exist, and the partial solutions and hacks
> > required to workaround this shortcoming are kind of grotesque and in one
> > way or another bad. This isn't theoretical. I'm not working on this for
> > "fun".
>
> Once again: I don't want to hear "users exist".
>
> I want to hear *from* those users. Because I would have expected all
> those users to already have perfectly working setups in place already.

What do you want me to do here? Every time somebody talks to me about
this, tell them, "hey would you talk to Linus about this?" and then,
"omg you want me to send Linus an email?!" Library authors wish they
could call getrandom() for their needs, yet they cannot, and are
forced to invent incomplete solutions. Coupling kernel RNG semantics
to userspace RNG semantics is not even a new idea; Microsoft heard
from their customers, for example, and made things work. (Maybe
hearing "Microsoft ..." will turn you off even more? I don't know.
This solution isn't like theirs and is nicer, but it stems from the
same need.)

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:04       ` Jason A. Donenfeld
@ 2024-07-04 18:18         ` Linus Torvalds
  2024-07-04 18:35           ` Linus Torvalds
  2024-07-04 18:36           ` Jason A. Donenfeld
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 18:18 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 11:04, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> >
> > I want to hear *from* those users. Because I would have expected all
> > those users to already have perfectly working setups in place already.
>
> What do you want me to do here?

You literally said "those users exist".

Make them pipe up.

Make them explain why what they don't have now doesn't work. What this
solves. In real terms.

Make them explain why that random "we duplicated the VM, and now we
worry that mixing in TSC doesn't help" is an actual real-world
concern, rather than something COMPLETELY MADE UP BY RANDOM NUMBER
PEOPLE.

See what my argument is? My argument is literally that theoretical
random number people will make up arguments that aren't actually
relevant in real life.

Do real people migrate VMs? Hell yes they do. Do they care about the
numbers being magically "stale" after said migration? I seriously
doubt that.

Do real people start multiple VMs from one single starting image?
Again, hell yes they do.

But do they start those multiple VMs from some random slapdash
snapshot that they just picked without any concern and cannot just
reseed in user space? And if they do, why should *WE* clean up after
their mindbogglingly stupid setup?

See what my argument is? I suspect _strongly_ that this is all
completely over-engineered based on theoretical grounds that aren't
actually practical grounds.

And dammit, I'm asking for the practical grounds. For the actual users.

And if you have trouble finding those, you just proved my point.

           Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:18         ` Linus Torvalds
@ 2024-07-04 18:35           ` Linus Torvalds
  2024-07-04 18:46             ` Jason A. Donenfeld
  2024-07-04 18:36           ` Jason A. Donenfeld
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 18:35 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 11:18, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> See what my argument is? I suspect _strongly_ that this is all
> completely over-engineered based on theoretical grounds that aren't
> actually practical grounds.

I also have a separate suggestion: I'm more than happy to do something
*MUCH SIMPLER*.

If people want just generation counts, we can give them generation
counts and maybe something extra in the vdso read-only page.  No new
VM stuff, no new "allocate a buffer that the kernel manages", just
something like one cacheline of helper data in the vdso page that is
shared with everybody and is already mapped.

THAT is what the vdso stuff is designed for. It's not supposed to be a
whole new library routine.

The state allocation should all be done in user space. The mixing
should all be done in user space. As far as I can tell, the *ONLY*
reason this is at all about the kernel is that "generation" counter.

Just expose the generation counter in the vdso data. It will even be
backwards compatible, in that old kernels will always have a value of
zero, and whatever user space library then uses the generation counter
to check that we haven't had some migration event or whetever won't
get the *migration* events, but the code will otherwise work.

And the regular user space library can decide to use whatever mixing
it wants, whatever state size it wants, and the kernel doesn't have to
worry about special memory allocations.

See why I think this is all so *HORRENDOUSLY* over-engineered? The
kernel has absolutely _zero_ special knowledge about random numbers
that user space doesn't have, except for that *one* number.

And you literally don't want to do kernel system calls anyway due to
performance, so your code is 99% user code anyway. KEEP IT THAT WAY.
Don't add it to the kernel.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:35           ` Linus Torvalds
@ 2024-07-04 18:46             ` Jason A. Donenfeld
  2024-07-04 18:52               ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 18:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 04, 2024 at 11:35:12AM -0700, Linus Torvalds wrote:
> On Thu, 4 Jul 2024 at 11:18, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > See what my argument is? I suspect _strongly_ that this is all
> > completely over-engineered based on theoretical grounds that aren't
> > actually practical grounds.
> 
> I also have a separate suggestion: I'm more than happy to do something
> *MUCH SIMPLER*.
> 
> If people want just generation counts, we can give them generation

I addressed this in the cover letter:

| How do we rectify this? By putting a safe implementation of getrandom()
| in the vDSO, which has access to whatever information a
| particular iteration of random.c is using to make its decisions. I use
| that careful language of "particular iteration of random.c", because the
| set of things that a vDSO getrandom() implementation might need for making
| decisions as good as the kernel's will likely change over time. This
| isn't just a matter of exporting certain *data* to userspace. We're not
| going to commit to a "data API" where the various heuristics used are
| exposed, locking in how the kernel works for decades to come, and then
| leave it to various userspaces to roll something on top and shoot
| themselves in the foot and have all sorts of complexity disasters.
| Rather, vDSO getrandom() is supposed to be the *same exact algorithm*
| that runs in the kernel, except it's been hoisted into userspace as
| much as possible. And so vDSO getrandom() and kernel getrandom() will
| always mirror each other hermetically.

random.c has a long history of exposing lots of particulars that we've
had to stub out. Enough of that. It's far better to have a function (not
a piece of data!) that uses the *exact same algorithm* and hence has the
exact same guarantees as random.c, and the kernel can keep those in
sync.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:46             ` Jason A. Donenfeld
@ 2024-07-04 18:52               ` Linus Torvalds
  2024-07-04 18:57                 ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 18:52 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 11:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
>
> I addressed this in the cover letter:
>
> | How do we rectify this? By putting a safe implementation of getrandom()
> | in the vDSO, which has access to whatever information a
> | particular iteration of random.c is using to make its decisions. I use
> | that careful language of "particular iteration of random.c", because the
> | set of things that a vDSO getrandom() implementation might need for making
> | decisions as good as the kernel's will likely change over time.

Jason. This smells. It's BS.

Christ, let's make a deal: do a five-liner patch that adds the
generation number to the vdso data, and basically document it as a
"the kernel thinks you need to reseed your buffers using getrandom"
flag.

And *if* it turns out in the future that there is then any major
reason why that doesn't work, I'll take the 1000+ line thing, ok?

Deal?

                    Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:52               ` Linus Torvalds
@ 2024-07-04 18:57                 ` Jason A. Donenfeld
  2024-07-04 19:19                   ` Linus Torvalds
  2024-07-07 16:56                   ` Russell Haley
  0 siblings, 2 replies; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 18:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 4, 2024 at 8:52 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 4 Jul 2024 at 11:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> >
> > I addressed this in the cover letter:
> >
> > | How do we rectify this? By putting a safe implementation of getrandom()
> > | in the vDSO, which has access to whatever information a
> > | particular iteration of random.c is using to make its decisions. I use
> > | that careful language of "particular iteration of random.c", because the
> > | set of things that a vDSO getrandom() implementation might need for making
> > | decisions as good as the kernel's will likely change over time.
>
> Jason. This smells. It's BS.

It's not BS. And that's not a real argument from you, but rather is
something else.

> Christ, let's make a deal: do a five-liner patch that adds the
> generation number to the vdso data, and basically document it as a
> "the kernel thinks you need to reseed your buffers using getrandom"
> flag.

I really do not want to expose random.c internals, and then deal with
the consequences of breaking user code that relied on that. The fake
entropy count API was already a nightmare to move away from. And I
think there's tremendous value in letting users use the kernel's
*exact algorithm*, whatever it happens to be, without syscall
overhead. Plus, this means further proliferation of bad userspace
RNGs. So I think the deal is a bad one.

> reason why that doesn't work, I'll take the 1000+ line thing

(I would like to point out that a good deal of that series is test
code and documentation and such.)

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:57                 ` Jason A. Donenfeld
@ 2024-07-04 19:19                   ` Linus Torvalds
  2024-07-04 21:07                     ` Linus Torvalds
  2024-07-07 16:56                   ` Russell Haley
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 19:19 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 11:57, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> I really do not want to expose random.c internals, and then deal with
> the consequences of breaking user code that relied on that. The fake
> entropy count API was already a nightmare to move away from. And I
> think there's tremendous value in letting users use the kernel's
> *exact algorithm*, whatever it happens to be, without syscall
> overhead. Plus, this means further proliferation of bad userspace
> RNGs. So I think the deal is a bad one.

Bah. I guess I'll have to walk through the patch series once again.

I'm still not thrilled about it. But I'll give it another go.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 19:19                   ` Linus Torvalds
@ 2024-07-04 21:07                     ` Linus Torvalds
  2024-07-04 21:44                       ` Arnd Bergmann
  2024-07-05 16:18                       ` Jason A. Donenfeld
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 21:07 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Thu, 4 Jul 2024 at 12:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Bah. I guess I'll have to walk through the patch series once again.

Ok, I went through it once. First comments:

The system call additions look really random. You don't add them to
all architectures, but the ones you *do* add them to seem positively
pointless:

 - I don't think you should introduce the system all on 32-bit
architectures, and that includes as a compat call on 64-bit.

    The VM_DROPPABLE infrastructure doesn't even exist on 32-bit, and
while that might not be technically a requirement, it does seem to
argue against doing this on 32-bit architectures. Plus nobody sane
cares.

    You didn't even enable it on 32-bit x86 in the vdso, so why did
you enable it as a syscall?

 - even 64-bit architectures don't necessarily have anything like a
vdso, eg alpha.

It looks like you randomly just picked the architectures that have a
syscall.tbl file, rather than architectures where this made sense. I
thin kyou should drop all of them except possibly arm64, s390 and
powerpc.

I'm very ambivalent about the VM_DROPPABLE code.

On one hand, it's something we've discussed many times, and I don't
hate it. On the other hand, the discussions have always been about
actually exposing it to user space as a MAP_DROPPABLE so that user
space can do caching.

In fact, I'm almost certain that *because* you didn't expose it to
mmap(), people will now then instead mis-use vgetrandom_alloc()
instead to allocate random MAP_DROPPABLE pages. That is going to be a
nightmare.

And that nightmare has to be avoided. Which in turn means that I think
vgetrandom_alloc() has to go, and you just need to expose
MAP_DROPPABLE instead that obly works for private anonymous mappings,
and make sure glibc uses that.

Because as your patch series stands now, the semantics are unacceptable.

This is a non-starter. When I see a new system call where my reaction
is not just "this should have been just a mmap()", but then
immediately followed by "Oh, and people will mis-use this as a cool
mmap", I'm not merging that system call.

So I don't hate VM_DROPPABLE per se, but the interface is simply not
ok. vgetrandom_alloc() absolutely *has* to go, and needs to just be a
user-space wrapper around regular mmap.

                 Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 21:07                     ` Linus Torvalds
@ 2024-07-04 21:44                       ` Arnd Bergmann
  2024-07-04 22:07                         ` Linus Torvalds
  2024-07-05 16:18                       ` Jason A. Donenfeld
  1 sibling, 1 reply; 39+ messages in thread
From: Arnd Bergmann @ 2024-07-04 21:44 UTC (permalink / raw)
  To: Linus Torvalds, Jason A . Donenfeld
  Cc: Jiri Olsa, Masami Hiramatsu, cgzones, Christian Brauner,
	linux-kernel

On Thu, Jul 4, 2024, at 23:07, Linus Torvalds wrote:
>
>  - even 64-bit architectures don't necessarily have anything like a
> vdso, eg alpha.
>
> It looks like you randomly just picked the architectures that have a
> syscall.tbl file, rather than architectures where this made sense. I
> thin kyou should drop all of them except possibly arm64, s390 and
> powerpc.

It's not random, it's all the architectures: the ones that
don't have a syscall.tbl file are the ones that use the table
in include/uapi/asm-generic/unistd.h. I generally recommend
doing it like to ensure all architectures define the same
__NR_* macro for new syscalls even if the implementation
gets added later. As you say, this one is a special because
it's not useful without a vdso, but that doesn't require making
it more special than necessary by adding it selectively.

In particular, if the entries above number 402 are kept
consistent across all architectures are the same, we can
more easily move them into a shared file in the future to
avoid some of the complexity of adding syscalls.

     Arnd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 21:44                       ` Arnd Bergmann
@ 2024-07-04 22:07                         ` Linus Torvalds
  2024-07-05  8:32                           ` Arnd Bergmann
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-04 22:07 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones,
	Christian Brauner, linux-kernel

On Thu, 4 Jul 2024 at 14:45, Arnd Bergmann <arnd@arndb.de> wrote:
>
> It's not random, it's all the architectures: the ones that
> don't have a syscall.tbl file are the ones that use the table
> in include/uapi/asm-generic/unistd.h.

Ok.

I think it's bogus to reseve system calls for everybody even when it
makes no sense.

But it's also pretty moot, since I think the whole system call has to go away.

All it is is an odd wrapper around mmap() anyway, and it's a useful
enough thing *outside* of getrandom() that I pretty much guarantee it
will be used for other things than vgetrandom anyway.

              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 22:07                         ` Linus Torvalds
@ 2024-07-05  8:32                           ` Arnd Bergmann
  2024-07-05 16:59                             ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Arnd Bergmann @ 2024-07-05  8:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones,
	Christian Brauner, linux-kernel

On Fri, Jul 5, 2024, at 00:07, Linus Torvalds wrote:
> On Thu, 4 Jul 2024 at 14:45, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> It's not random, it's all the architectures: the ones that
>> don't have a syscall.tbl file are the ones that use the table
>> in include/uapi/asm-generic/unistd.h.
>
> Ok.
>
> I think it's bogus to reseve system calls for everybody even when it
> makes no sense.

I see. Just to make sure: do you think it's ok to still
reserve system call numbers everywhere if they are used
on most architectures? I posted a series yesterday to
convert include/asm-generic/uapi/unistd.h into the syscall.tbl
format, and I did this change for clone3:

https://lore.kernel.org/lkml/20240704143611.2979589-8-arnd@kernel.org/

The reasoning here is that we want this to be available
everywhere but there are four architectures still missing
it, and having the macro defined in the generated unistd.h
avoids a special case.

On the other hand, I left memfd_secret a special case since
that one is only implemented on one architecture using the
generic table.

> But it's also pretty moot, since I think the whole system call has to go away.
>
> All it is is an odd wrapper around mmap() anyway, and it's a useful
> enough thing *outside* of getrandom() that I pretty much guarantee it
> will be used for other things than vgetrandom anyway.

Right.

    Arnd

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05  8:32                           ` Arnd Bergmann
@ 2024-07-05 16:59                             ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2024-07-05 16:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones,
	Christian Brauner, linux-kernel

On Fri, 5 Jul 2024 at 01:34, Arnd Bergmann <arnd@arndb.de> wrote:
>
> I see. Just to make sure: do you think it's ok to still
> reserve system call numbers everywhere if they are used
> on most architectures?

Yes. If there's a reason why a system call might be used, no problem.

             Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 21:07                     ` Linus Torvalds
  2024-07-04 21:44                       ` Arnd Bergmann
@ 2024-07-05 16:18                       ` Jason A. Donenfeld
  2024-07-05 17:39                         ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-05 16:18 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 04, 2024 at 02:07:41PM -0700, Linus Torvalds wrote:
> On Thu, 4 Jul 2024 at 12:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Bah. I guess I'll have to walk through the patch series once again.

Thanks for having a look. I really appreciate it.

> Ok, I went through it once. First comments:
> 
> The system call additions look really random. You don't add them to
> all architectures, but the ones you *do* add them to seem positively
> pointless:
> 
>  - I don't think you should introduce the system all on 32-bit
> architectures, and that includes as a compat call on 64-bit.
> 
>     The VM_DROPPABLE infrastructure doesn't even exist on 32-bit, and
> while that might not be technically a requirement, it does seem to
> argue against doing this on 32-bit architectures. Plus nobody sane
> cares.
> 
>     You didn't even enable it on 32-bit x86 in the vdso, so why did
> you enable it as a syscall?
> 
>  - even 64-bit architectures don't necessarily have anything like a
> vdso, eg alpha.
> 
> It looks like you randomly just picked the architectures that have a
> syscall.tbl file, rather than architectures where this made sense. I
> thin kyou should drop all of them except possibly arm64, s390 and
> powerpc.

The first versions of my series actually only enabled it on x86.
(Somebody also wrote an arm64 implementation of all this already, but
that's for later.) But after I posted that, people (Arnd, I think?) told
me I should add it to all architectures to "reserve" the number. That
was a lot of annoying busy work to do, but I did it, and not just random
archs, but *all* of them.

I'd be happy to revert all this and just enable it on x86. I'll do that
for the v+1 patch. It's less work for me and would make this series one
patch less.

But there might be a conversation to have (that I think you've begun
with Arnd) about what the expectations are for this, because the "enable
it on all of them" seems to be something I've heard on more than one
occasion.

> I'm very ambivalent about the VM_DROPPABLE code.
> 
> On one hand, it's something we've discussed many times, and I don't
> hate it. On the other hand, the discussions have always been about
> actually exposing it to user space as a MAP_DROPPABLE so that user
> space can do caching.
> 
> In fact, I'm almost certain that *because* you didn't expose it to
> mmap(), people will now then instead mis-use vgetrandom_alloc()
> instead to allocate random MAP_DROPPABLE pages. That is going to be a
> nightmare.

VM_DROPPABLE *is* actually a very useful feature. Or it at least seems
like it could be one. One can imagine various database caches that do a
memory vs cpu trade off using it. (But, to be clear, I've never actually
spoken with database developers about it.)

There are some other improvements for it I have in mind that I was
considering posting in some time when this work here has settled.

And then, indeed, it'd make sense to eventually expose this properly to
mmap() and let people use it. (Or if you want to do that in reverse,
adding it to mmap() first, so that people don't misuse
vgetrandom_alloc(), that's fine.)

> And that nightmare has to be avoided. Which in turn means that I think
> vgetrandom_alloc() has to go, and you just need to expose
> MAP_DROPPABLE instead that obly works for private anonymous mappings,
> and make sure glibc uses that.
> 
> Because as your patch series stands now, the semantics are unacceptable.
> 
> This is a non-starter. When I see a new system call where my reaction
> is not just "this should have been just a mmap()", but then
> immediately followed by "Oh, and people will mis-use this as a cool
> mmap", I'm not merging that system call.
> 
> So I don't hate VM_DROPPABLE per se, but the interface is simply not
> ok. vgetrandom_alloc() absolutely *has* to go, and needs to just be a
> user-space wrapper around regular mmap.

So I'm not wedded to adding a syscall for this and am pretty open to
other ways of doing it, but I actually think given the requirements,
this kind of makes sense. I was talking about this problem with tglx or
with Greg a while back, kind of frustrated, and one of them suggested,
"well just make it a syscall; that's what those are for," and it
immediately made sense, and so that's what I've done. Here are the
requirements:

- The "mechanism" needs to return allocated memory to userspace that can
  be chunked up on a per-thread basis, with no state straddling pages,
  which means it also needs to return the size of each state, and the
  number of states that were allocated.

- The size of each state might change kernel version to kernel version.

- In an effort to match the behaviors of syscall getrandom() as much as
  possible, it needs to be mapped with various flags (the ones in the
  current vgetrandom_alloc() implementation).

- Which flags are needed might change kernel version to kernel version.

- Future memory tagging CPU extensions might allow us to prevent the
  memory from being accessed unless the accesses are coming from vDSO
  code, which would avoid heartbleed-like bugs. This is very appealing.

So, the memory that's returned, and the parameters about it are sort of
tied to the actual [v]getrandom() implementation. That sounds to me like
this should be done by a function that the kernel is in charge of. Hence
the syscall. (Or a vDSO function, but then it wouldn't correspond with
an equivalent syscall, which might not be appealing to tglx, and it
starts to smell like "library code" which we really don't want.)

Given this, it seemed like a syscall was the cleanest most cromulent
solution. But if you have other suggestions, I'm open to it.

Maybe, though, the best way of assuaging your concerns would be to
expose MAP_DROPPABLE in mmap() in the same series as the rest, so that
there *isn't* a chance that vgetrandom_alloc() will be abused when
people realize it's a handy feature to have.

Thoughts?

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 16:18                       ` Jason A. Donenfeld
@ 2024-07-05 17:39                         ` Linus Torvalds
  2024-07-05 17:53                           ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-05 17:39 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, 5 Jul 2024 at 09:18, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> VM_DROPPABLE *is* actually a very useful feature. Or it at least seems
> like it could be one.

Yes. It's been discussed exactly in that "this _could_ be very useful"
sense, although we've never actually pulled the trigger.

I tried to find previous discussions on lore, but failed miserably, so
I can't point to previous discussions from long ago, but one question
was also always about whether you wanted some explicit "populate this
page range" interface together with getting a SIGBUS when it's
unpopulated (so that you can basically do demand-paging in user
space).

With just a "this could be useful" but no hard users, it never really
got anywhere.

Anyway, I really don't mind VM_DROPPABLE with "it just gets
re-populated as a new anonymous page" model, particularly since we
could easily then later decide that we could expand on it as a
MAP_SHARED thing with SIGBUS semantics and explicit initialization if
we ever really want it.

End result: I don't think there are necessariyl *lots* of users, but I
do think that this is something where some enterprising person goes "I
can use this", and makes some cool library that uses it for caching,
and then we'd be stuck with it.

> And then, indeed, it'd make sense to eventually expose this properly to
> mmap() and let people use it. (Or if you want to do that in reverse,
> adding it to mmap() first, so that people don't misuse
> vgetrandom_alloc(), that's fine.)

Yes. And it should be pretty trivial.

We just at least initially have to be very careful to limit it to
MAP_ANONYMOUS and MAP_PRIVATE. Because dropping dirty bits on shared
mappings sounds insane and like a possible source of confusion (and
thus bugs and maybe even security issues).

It's possible that we might even use a MAP_TYPE flag for this. Or make
it a PROT_xyz bit rather than a MAP_xyz.

So there's some trivial sanity checks and some UI issues to just pick,
but apart from "just pick something sane", exposing this for mmap() is
_not_ hard, and I do think it needs to be done first.

And once it's done, I think the argument for having a special system
call is basically gone too.

> - The "mechanism" needs to return allocated memory to userspace that can
>   be chunked up on a per-thread basis, with no state straddling pages,
>   which means it also needs to return the size of each state, and the
>   number of states that were allocated.
>
> - The size of each state might change kernel version to kernel version.

Just pick a size large enough.

And why would that size not  be one page?

Considering that you really don't want to rely on page-crossing state
*ANYWAY* because of the whole "one page can go away while another one
sticks around" issue, I would expect that states over one page per
thread would be a *very* questionable idea to begin with.

I don't think we'll ever see systems with page sizes smaller than 4k.
They have existed in the past, but they're not making a comeback.
People want larger pages, not smaller ones.

And the stat size rigth now is what - 200 bytes? So a single page
seems (a) sufficient and (b) kind of the sane maximum anyway due to
the dropping.

No?

              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 17:39                         ` Linus Torvalds
@ 2024-07-05 17:53                           ` Jason A. Donenfeld
  2024-07-05 18:08                             ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-05 17:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Fri, Jul 05, 2024 at 10:39:48AM -0700, Linus Torvalds wrote:
> Yes. And it should be pretty trivial.
> 
> We just at least initially have to be very careful to limit it to
> MAP_ANONYMOUS and MAP_PRIVATE. Because dropping dirty bits on shared
> mappings sounds insane and like a possible source of confusion (and
> thus bugs and maybe even security issues).
> 
> It's possible that we might even use a MAP_TYPE flag for this. Or make
> it a PROT_xyz bit rather than a MAP_xyz.
> 
> So there's some trivial sanity checks and some UI issues to just pick,
> but apart from "just pick something sane", exposing this for mmap() is
> _not_ hard, and I do think it needs to be done first.

I can take a stab at it.

> > - The "mechanism" needs to return allocated memory to userspace that can
> >   be chunked up on a per-thread basis, with no state straddling pages,
> >   which means it also needs to return the size of each state, and the
> >   number of states that were allocated.
> >
> > - The size of each state might change kernel version to kernel version.
> 
> Just pick a size large enough.
> 
> And why would that size not  be one page?
> 
> Considering that you really don't want to rely on page-crossing state
> *ANYWAY* because of the whole "one page can go away while another one
> sticks around" issue, I would expect that states over one page per
> thread would be a *very* questionable idea to begin with.
> 
> I don't think we'll ever see systems with page sizes smaller than 4k.
> They have existed in the past, but they're not making a comeback.
> People want larger pages, not smaller ones.

That sounds not so good: the current state is 144 bytes, and it's
expected that there'll be one of these per thread. Mapping 16k or 4k per
thread seems pretty bad. At least it certainly seems that way? Wasting
16240 bytes per thread + a new vmap I can't imagine is okay.

Also, these points still stand:

| - In an effort to match the behaviors of syscall getrandom() as much as
|   possible, it needs to be mapped with various flags (the ones in the
|   current vgetrandom_alloc() implementation).
|
| - Which flags are needed might change kernel version to kernel version.
|
| - Future memory tagging CPU extensions might allow us to prevent the
|   memory from being accessed unless the accesses are coming from vDSO
|   code, which would avoid heartbleed-like bugs. This is very appealing.

It seems like leaving it just up to mmap() will not only result in users
doing it wrong, but kind of limits our options moving forward. And
there's this whole issue of communicating sizes so as not to be
wasteful.

Another idea I had, if you hate the syscall, is I could just add this as
(another) private ioctl() on the /dev/random node. This sounds worse
than a syscall worse because it means that node has to exist and the fd
has to be opened -- and concerns about this were what lead to the
getrandom() syscall being introduced in the first place -- but it would
at least avoid the syscall. I'm not crazy about that though.

Maybe the winning solution is MAP_DROPPABLE (or PROT_DROPPABLE) in
mmap(), and then in the following commit, add the vgetrandom_alloc()
syscall, and then we'll avoid vgetrandom_alloc() getting abused, but
still have a nice interface that isn't too constraining.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 17:53                           ` Jason A. Donenfeld
@ 2024-07-05 18:08                             ` Linus Torvalds
  2024-07-05 18:56                               ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-05 18:08 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, 5 Jul 2024 at 10:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> That sounds not so good: the current state is 144 bytes, and it's
> expected that there'll be one of these per thread. Mapping 16k or 4k per
> thread seems pretty bad. At least it certainly seems that way? Wasting
> 16240 bytes per thread + a new vmap I can't imagine is okay.

Well, I guess the simple solution would be "just pick a size that is
guaranteed to be at most a page, and a power-of-two, and big enough".

You really don't have that many choices. Presumably we won't have
per-architecture random states anyway, so the smallest supported page
size is the upper limit, and if the current size is 144 bytes, we know
that 256 is the lower limit.

IOW, we pretty much know that the number is _always_ going to be 2**n
where 8 <= n <= 12.

Just pick one.

> | - Future memory tagging CPU extensions might allow us to prevent the
> |   memory from being accessed unless the accesses are coming from vDSO
> |   code, which would avoid heartbleed-like bugs. This is very appealing.

No. Stop this idiocy.

Now you are getting into cray-cray land. Nobody cares about random
numbers so much that they'd worry about leaking them from other
sources thanks to hardware bugs.

Seriously. This is the kind of "crazy random number" talk that makes
me go "I don't want to touch this".

Get your act together. There is *NO* way we care about this kind of
garbage, and just bringing it up makes me doubt that you have the
right mindset.

You claimed to not be one of the crazy people. SHOW IT.

                   Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 18:08                             ` Linus Torvalds
@ 2024-07-05 18:56                               ` Jason A. Donenfeld
  2024-07-05 19:21                                 ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-05 18:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, Jul 05, 2024 at 11:08:03AM -0700, Linus Torvalds wrote:
> On Fri, 5 Jul 2024 at 10:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > That sounds not so good: the current state is 144 bytes, and it's
> > expected that there'll be one of these per thread. Mapping 16k or 4k per
> > thread seems pretty bad. At least it certainly seems that way? Wasting
> > 16240 bytes per thread + a new vmap I can't imagine is okay.
> 
> Well, I guess the simple solution would be "just pick a size that is
> guaranteed to be at most a page, and a power-of-two, and big enough".
> 
> You really don't have that many choices. Presumably we won't have
> per-architecture random states anyway, so the smallest supported page
> size is the upper limit, and if the current size is 144 bytes, we know
> that 256 is the lower limit.
> 
> IOW, we pretty much know that the number is _always_ going to be 2**n
> where 8 <= n <= 12.
> 
> Just pick one.

And if we want to exceed that size in the future, then what? Just seems
like hard coding it locks us in.

Also, pow2 is still wasteful - 28 states for a 4k page at optimal size
versus 16 states for a 4k page at rounding up to current pow2. That's
not a huge difference at small scale. But also, why? Seems like we could
do this a lot better.

> 
> > | - Future memory tagging CPU extensions might allow us to prevent the
> > |   memory from being accessed unless the accesses are coming from vDSO
> > |   code, which would avoid heartbleed-like bugs. This is very appealing.
> 
> No. Stop this idiocy.
> 
> Now you are getting into cray-cray land. Nobody cares about random
> numbers so much that they'd worry about leaking them from other
> sources thanks to hardware bugs.
> 
> Seriously. This is the kind of "crazy random number" talk that makes
> me go "I don't want to touch this".
> 
> Get your act together. There is *NO* way we care about this kind of
> garbage, and just bringing it up makes me doubt that you have the
> right mindset.
> 
> You claimed to not be one of the crazy people. SHOW IT.

I'm pretty sure you just misunderstood what I'm referring to.
"Heartbleed-like" refers to remote info leak. Like, some server process
spits out a bunch of memory onto the network. If the rng pages can only
be accessed when the caller is at some specified address range, then
those kinds of bugs are mitigated. Anyway, just an idea, but doesn't
seem like an impossible one.

There were also those two other unrelated points I raised, trimmed from
the context. To repaste them all from before:

| Here are the requirements:
|
| - The "mechanism" needs to return allocated memory to userspace that can
|   be chunked up on a per-thread basis, with no state straddling pages,
|   which means it also needs to return the size of each state, and the
|   number of states that were allocated.
| 
| - The size of each state might change kernel version to kernel version.

Your suggestion is to hard code the state size to a power of 2, which
will lock us in to having that as an upper bound forever, and also waste
memory because it's not ideally sized.

| 
| - In an effort to match the behaviors of syscall getrandom() as much as
|   possible, it needs to be mapped with various flags (the ones in the
|   current vgetrandom_alloc() implementation).
| 
| - Which flags are needed might change kernel version to kernel version.

Unaddressed.

| 
| - Future memory tagging CPU extensions might allow us to prevent the
|   memory from being accessed unless the accesses are coming from vDSO
|   code, which would avoid heartbleed-like bugs. This is very appealing.

I think you misunderstood me as referring to "hardware bugs", but that's
not what I was talking about, as I described above. Anyway, regardless,
if your take on this is, "I don't care about making certain rng memory
harder to leak than other memory," then so be it and I'll drop this
point.

| So, the memory that's returned, and the parameters about it are sort of
| tied to the actual [v]getrandom() implementation. That sounds to me like
| this should be done by a function that the kernel is in charge of. Hence
| the syscall.

I'm having a hard time seeing how, "let the user guess and pass
whatever flags were decided at one moment" is preferable to, "have a
syscall/function/ioctl/whatever communicate to userspace what it needs
to do and to set up the mapping in exactly the way it's needed." I'm
sorry to keep belaboring this, but I'm actually just sort of surprised
by your take.

I get the part about, "users will abuse vgetrandom_alloc() for something
uncouth," which seems very real, but the solution to that is to just
expose this to mmap() first. Once that's there, vgetrandom_alloc()
becomes kind of similar to, say, map_shadow_stack().

But okay, spit-balling further, there are the current ideas proposed,
and I'll add two more to the bottom:

0) Syscall.

1) /dev/random ioctl. Downside: needs filesystem node, fd.

2) Hard coding 256 and set of mmap flags. Downside: discussed above.

3) Expose /proc/sys/kernel/random/vgetrandom_info, which gives one field
   of the state size and another of the flags needed for mmap. Downside:
   still less flexible than the kernel doing the allocation, like if
   it'd be nice in the future for some additional step to be taken on
   the memory after mmap(). Downside: needs filesystem node, fd.

4) Same as (3), but expose this through passing -1 as opaque_len to
   vgetrandom(). Downside: kinda ugly, adds branch.

I think of these, (3) is preferable to (2). (0) still seems best, but
I'm not sure you'll agree yet. (4) might be preferable to (3) because no
filesystem stuff.

If (0) and (1) are still sounding bad to you, do (3) or (4) sound
better?

Also, I'm just brainstorming here; if you find these deranged, that's
okay.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 18:56                               ` Jason A. Donenfeld
@ 2024-07-05 19:21                                 ` Linus Torvalds
  2024-07-05 19:46                                   ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-05 19:21 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, 5 Jul 2024 at 11:56, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> And if we want to exceed that size in the future, then what? Just seems
> like hard coding it locks us in.

KISS. Keep It Simple Stupid. Make a sane decision. Stick with it.

This is *not* something where things will change radically over the years.

But what this *is* is something where we want to actively avoid
overcomplicating things.

If saying "the state size is fixed at 256 bytes" means that ten years
from now, we won't be updating to some super-duper fancy new algorithm
that wants to keep a huge state size - then that's a GOOD thing.

We are software ENGINEERS. That means that we make sane decisions and
live with real life limits.

We know that we don't have infinite entropy, and we understand that we
can't even know how much entropy we do have.  At some point, you just
have to put your foot down.

Leave the people who have theoretical concerns behind. They can damn
well do their own thing. We should not care.

If somebody is unhappy with the result, let them go make their own
random number generator.

We've used the current chacha state for what, a decade now? Just let it be.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 19:21                                 ` Linus Torvalds
@ 2024-07-05 19:46                                   ` Linus Torvalds
  2024-07-06  0:11                                     ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-05 19:46 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, 5 Jul 2024 at 12:21, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> KISS. Keep It Simple Stupid. Make a sane decision. Stick with it.

Side note: you could just stick the size as a constant in the vdso too.

But honestly, what's the argument for more than 256 if 144 bytes is
the reality now?

Does anybody seriously think our current getrandom() isn't good enough?

              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-05 19:46                                   ` Linus Torvalds
@ 2024-07-06  0:11                                     ` Jason A. Donenfeld
  2024-07-06  2:10                                       ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-06  0:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Fri, Jul 05, 2024 at 12:46:37PM -0700, Linus Torvalds wrote:
> If saying "the state size is fixed at 256 bytes" means that ten years
> from now, we won't be updating to some super-duper fancy new algorithm
> that wants to keep a huge state size - then that's a GOOD thing.

I'm all for avoiding fanciness. I can imagine three plausible scenarios
where we benefit from the kernel doing the allocation, rather than mmap,
or where it's nice to have the kernel decide on the size:

- On some platform, it's actually more efficient to generate N blocks,
  such that the state there needs to be larger.

- The amount of state that we buffer increases according to some speed
  vs practicality trade off that changes. (Right now we buffer 1.5
  blocks; maybe 3.5 would be better eventually.)

- We find out that there's a better way of doing all this with a special
  mapping instead, or some other means.

What I have in mind, IOW, isn't fanciness. But alright, let me run with
where you're urging me and see where that takes things. 

> Side note: you could just stick the size as a constant in the vdso too.

Yea, this sounds more like solution (4) from my last email. I'll give
that a shot and see what it's like nuking the syscall. I'll ping here
when v21 of the series is ready, and hopefully you like it more.

Thanks for brainstorming this all with me.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06  0:11                                     ` Jason A. Donenfeld
@ 2024-07-06  2:10                                       ` Jason A. Donenfeld
  2024-07-06  2:56                                         ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-06  2:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi again Linus,

On Sat, Jul 06, 2024 at 02:11:59AM +0200, Jason A. Donenfeld wrote:
> What I have in mind, IOW, isn't fanciness. But alright, let me run with
> where you're urging me and see where that takes things. 
> 
> > Side note: you could just stick the size as a constant in the vdso too.
> 
> Yea, this sounds more like solution (4) from my last email. I'll give
> that a shot and see what it's like nuking the syscall. I'll ping here
> when v21 of the series is ready, and hopefully you like it more.

I'll spend the weekend doing my own code review and fixing things up and
working on commit messages and documentation and all that, but there are
now three simpler commits in here that implement what I have in mind
based on our discussion:

    https://git.zx2c4.com/linux-rng/log/

The selftest code is the largest part of it. There's no more syscall. I
think it should be much more to your liking and seems like an alright
set of compromises. Hopefully that's a bit closer to the mark.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06  2:10                                       ` Jason A. Donenfeld
@ 2024-07-06  2:56                                         ` Linus Torvalds
  2024-07-06 23:26                                           ` Jason A. Donenfeld
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2024-07-06  2:56 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

On Fri, 5 Jul 2024 at 19:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
>     https://git.zx2c4.com/linux-rng/log/

So we already expose VM_WIPEONFORK and VM_DONTDUMP using madvise().
Exposing them at mmap creation time with MMAP_xyz sounds fine.

However, I do note that both the pre-existing VM_WIPEONFORK - and the
new VM_DROPPABLE - needs to be limited to anonymous private mappings
only.

You did that for VM_DROPPABLE, but not for VM_WIPEONFORK.

Now, admittedly I don't remember *why* we made VM_WIPEONFORK only work
for private mappings, but that's what we did.

Anyway, that patch looks largely fine to me apart from that note, but
I do think you want to check it with the mm people on linux-mm.

> The selftest code is the largest part of it. There's no more syscall. I
> think it should be much more to your liking and seems like an alright
> set of compromises. Hopefully that's a bit closer to the mark.

From a "look through the patches" standpoint, this did look more
palatable to me, but I also would have had an easier time with looking
at the patches if the self-tests were separate commits.

              Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06  2:56                                         ` Linus Torvalds
@ 2024-07-06 23:26                                           ` Jason A. Donenfeld
  0 siblings, 0 replies; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-06 23:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Fri, Jul 05, 2024 at 07:56:03PM -0700, Linus Torvalds wrote:
> On Fri, 5 Jul 2024 at 19:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> >     https://git.zx2c4.com/linux-rng/log/
> 
> So we already expose VM_WIPEONFORK and VM_DONTDUMP using madvise().
> Exposing them at mmap creation time with MMAP_xyz sounds fine.
> 
> However, I do note that both the pre-existing VM_WIPEONFORK - and the
> new VM_DROPPABLE - needs to be limited to anonymous private mappings
> only.
> 
> You did that for VM_DROPPABLE, but not for VM_WIPEONFORK.

Good catch, thanks. I'll look over all of that again closely too.

> Anyway, that patch looks largely fine to me apart from that note, but
> I do think you want to check it with the mm people on linux-mm.

They'll certainly be on the list of recipients for the v+1 series when I
post it (hopefully shortly).

> > The selftest code is the largest part of it. There's no more syscall. I
> > think it should be much more to your liking and seems like an alright
> > set of compromises. Hopefully that's a bit closer to the mark.
> 
> From a "look through the patches" standpoint, this did look more
> palatable to me, but I also would have had an easier time with looking
> at the patches if the self-tests were separate commits.

Okay, will do. I think you've got some selftest makefile fixes from
John/Shuah that'll be sent your way if they haven't already for 6.10
that I'll rebase on so that there isn't an annoying merge conflict.
https://lore.kernel.org/all/d99a1e3b-1893-4fac-bf05-bcb60ca7f89c@linuxfoundation.org/

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:57                 ` Jason A. Donenfeld
  2024-07-04 19:19                   ` Linus Torvalds
@ 2024-07-07 16:56                   ` Russell Haley
  1 sibling, 0 replies; 39+ messages in thread
From: Russell Haley @ 2024-07-07 16:56 UTC (permalink / raw)
  To: jason; +Cc: arnd, brauner, cgzones, jolsa, linux-kernel, mhiramat, torvalds

Since any PRNG will have the concept of re-seeding, I had to think
*really hard* to understand how a pseudo-generation number that really
means "reseed advised on change" could restrict future kernel
development, so for anyone else following along in the peanut gallery,
here's the scenario I came up with:

Suppose on some future CPU, RDRAND is improved to be essentially
perfect, with the same latency and throughput as a load from L1. So it
acts like a HWRNG, not a PRNG.  On such a CPU and with a command line
option that means "I 100% trust my CPU vendor," the kernel could
statically replace getrandom() with a function that just uses RDRAND,
and statically disable all the machinery for gathering entropy from
events and re-seeding the PRNG.

*Unless*, that is, userspace potentially needs to know when a
reseed-necessitating event has happened.

- Russell

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:18         ` Linus Torvalds
  2024-07-04 18:35           ` Linus Torvalds
@ 2024-07-04 18:36           ` Jason A. Donenfeld
  1 sibling, 0 replies; 39+ messages in thread
From: Jason A. Donenfeld @ 2024-07-04 18:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd

Hi Linus,

On Thu, Jul 4, 2024 at 8:18 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> > What do you want me to do here?
>
> You literally said "those users exist".
>
> Make them pipe up.
>
> Make them explain why what they don't have now doesn't work. What this
> solves. In real terms.
>
> Make them explain why that random "we duplicated the VM, and now we
> worry that mixing in TSC doesn't help" is an actual real-world
> concern, rather than something COMPLETELY MADE UP BY RANDOM NUMBER
> PEOPLE.
>
> See what my argument is? My argument is literally that theoretical
> random number people will make up arguments that aren't actually
> relevant in real life.

No, I don't think this is made up by random number nutsos. I believe
this is a real actual concern.

> Do real people migrate VMs? Hell yes they do. Do they care about the
> numbers being magically "stale" after said migration? I seriously
> doubt that.

Yes! They do!

>
> Do real people start multiple VMs from one single starting image?
> Again, hell yes they do.
>
> But do they start those multiple VMs from some random slapdash
> snapshot that they just picked without any concern and cannot just
> reseed in user space? And if they do, why should *WE* clean up after
> their mindbogglingly stupid setup?

Except userspace isn't really in a great position to do that. There's
no need to suggest that people proliferate these foot guns either.

> See what my argument is? I suspect _strongly_ that this is all
> completely over-engineered based on theoretical grounds that aren't
> actually practical grounds.
>
> And dammit, I'm asking for the practical grounds. For the actual users.
>
> And if you have trouble finding those, you just proved my point.

And I think what you're missing here is that these concerns come _from
actual users_. This *isn't* theoretical.

Look, I am not some "random number" nut job. I've worked very hard to
move the kernel's RNG far outside the realm of that world. And I'm not
looking for things to do or code to write or ways to occupy my time,
just 'cuz. I'm working on this because there's a real, tangible, need
for it. This has come out of countless recurring discussions with
folks at conferences and elsewhere. I am very much part of the world
where people are writing code that makes use of getrandom(), or would
like to make use of getrandom() but can't, and this pickle comes up
repeatedly. "Oh but we can't because of syscall speed, so we've got
this userspace thing, but it's not optimal, so we're just kind of
hoping for the best, but yea one of these days somebody should do
something..."

It's okay that people aren't having those discussions with you. That's
why I'm maintaining this thing and talking to folks and caring about
it and thinking carefully about it. And because people are having
these conversations with me, that's *also* why I am very sensitive to,
"is this guy a random number nut?" concerns, because lord I've met a
lot of them and they all have their little hang up. I don't want to
add code "just because we can." But I think this here will solve a
very real problem for very real users, and everytime the fact that I'm
working on this comes up, there are real people with real concerns who
are glad to hear it's coming finally.

Alternatively, you can say, "well until they talk to me directly, no
way josé", and that'd be your prerogative, I guess. But that'd be
pretty darn disappointing.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:55     ` Linus Torvalds
  2024-07-04 18:04       ` Jason A. Donenfeld
@ 2024-07-04 18:44       ` Willy Tarreau
  2024-07-05  7:01         ` Matthias Urlichs
  1 sibling, 1 reply; 39+ messages in thread
From: Willy Tarreau @ 2024-07-04 18:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner,
	linux-kernel, arnd

On Thu, Jul 04, 2024 at 10:55:46AM -0700, Linus Torvalds wrote:
> A trivial google for "rdrand library" finds lots of hits for things
> that then use the AES-NI instructions to whiten things etc.

As a userland developer, I can say that dealing with external libs for
low-level stuff, which themselves sometimes even come with their own
set of dependencies, is always a pain. There must be compelling reasons
for adding dependencies. It's reinforced when you have to deal with
long term support on your software that goes beyond the lib's.

And having to go through instruction support detection and open-coding
all that stuff with runtime fallbacks for older CPUs is also a pain. Not
to mention the cases where you run in VMs where features are there but
not listed or presented but slowly emulated.

I'm using a lot of arch-specific code at build time, I'm often fine with
detecting -ENOSYS at run time to fall back to an older implementation of
a syscall, but I've not crossed the barrier of runtime CPU features
detection which adds further burden and further fragments bug reports
between users.

Regarding VM migration, my code is not concerned because I'm not aware
of users migrating such VMs. BUT I've got complains in the past from
some users generating UUIDs for each forwarded request that they were
seeing duplicates in their logs due to the lack of thread safety on
random(), which made me work on an alternative. Thus I can easily
imagine that equivalent applications that just want to assign a unique
ID to an event that ends up in a log, and when such applications suffer
a VM migration could face a similar problem that is not easy to address
in userland.

In my opinion, abstracting the hardware is the role of the kernel. If
getrandom() is fast enough for my uses, why not. If it's not, I find
value in having a much faster proposal that offers the same API to all
applications without each having to reinvent the wheel. I can't judge
on the merits of vgetrandom() vs getrandom() though. But to give you an
idea, years ago for portability reasons (random() thread safety, multiple
OS support, performance), I ended up writing my own xoroshiro128 generator
to address multiple problems at once and I must confess I was a bit sad
to see that randoms remain so little portable between operating systems
and their various versions, and that the work left to be done for users
is non trivial.

I can imagine that users with higher expectations than mine would want
to adopt vgetrandom() when available.

Now would I replace my existing RNG with this new syscall when it gets
widely available ? Maybe, if it brings some value. It's easy enough to
deal with two code branches, one with the new, optimal syscall, and the
legacy generic fallback.

Hoping this matches the type of feedback you were looking for.

Willy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 18:44       ` Willy Tarreau
@ 2024-07-05  7:01         ` Matthias Urlichs
  0 siblings, 0 replies; 39+ messages in thread
From: Matthias Urlichs @ 2024-07-05  7:01 UTC (permalink / raw)
  To: Willy Tarreau, Linus Torvalds
  Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner,
	linux-kernel, arnd


[-- Attachment #1.1.1: Type: text/plain, Size: 1463 bytes --]

On 04.07.24 20:44, Willy Tarreau wrote:
> BUT I've got complains in the past from
> some users generating UUIDs for each forwarded request that they were
> seeing duplicates in their logs due to the lack of thread safety on
> random(), which made me work on an alternative. Thus I can easily
> imagine that equivalent applications that just want to assign a unique
> ID to an event that ends up in a log, and when such applications suffer
> a VM migration could face a similar problem that is not easy to address
> in userland.

I'd like to second that.

I sometimes need to duplicate a running VM, mostly in order to debug 
stuff. Now both VMs run the same code with the same pseudo-RNG, 
generating the same message IDs when they log something. I've seen 
rejects on logs from the real VM because the dupe got there first.

Owch.

A userspace RNG with a zapped VM_DROPPABLE page that re-initializes 
itself from the kernel RNG would solve this problem (and others).

Thus a reasonable implementation seems to be

* implement VM_DROPPABLE (which I'd like to use for userspace caching 
anyway)

* teach VM cloners, task migrators and whatnot not to copy pages marked thus

* add a RNG generation counter to the VDSO

* teach libc's getrandom() to use these

Yes this doesn't use the exact same implementation of random.c that's in 
the kernel, but frankly I don't care about that.

-- 
-- regards
-- 
-- Matthias Urlichs


[-- Attachment #1.1.2: Type: text/html, Size: 1944 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-04 17:21 ` Linus Torvalds
  2024-07-04 17:33   ` Linus Torvalds
  2024-07-04 17:46   ` Jason A. Donenfeld
@ 2024-07-06  1:14   ` Mathieu Desnoyers
  2024-07-06 10:01     ` Florian Weimer
  2 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2024-07-06  1:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner,
	linux-kernel, arnd, Adhemerval Zanella Netto, Zack Weinberg,
	Cristian Rodríguez, Florian Weimer, Wilco Dijkstra

On 04-Jul-2024 10:21:34 AM, Linus Torvalds wrote:
> On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > The three of us all have new syscalls planned for 6.11. Arnd suggested
> > that we coordinate to deconflict, to make the merge easier.
> 
> Nobody has explained to me what has changed since your last vdso
> getrandom, and I'm not planning on pulling it unless that fundamental
> flaw is fixed.
> 
> Why is this _so_ critical that it needs a vdso?
> 
> Why isn't user space just doing it itself?
> 
> What's so magical about this all?
> 
> This all seems entirely pointless to me still, because it's optimizing
> something that nobody seems to care about, adding new VM
> infrastructure, new magic system calls, yadda yadda.
> 
> I was very sceptical last time, and absolutely _nothing_ has changed.
> Not a peep on why it's now suddenly so hugely important again.
> 
> We don't add stuff "just because we can". We need to have a damn good
> reason for it. And I still don't see the reason, and I haven't seen
> anybody even trying to explain the reason.

[ Note: as I wrote down this email, I notice that you are heading
  towards the same conclusions I'm reaching on other sub-threads of this
  discussion. But I'm providing this feedback because it adds relevant
  information based on earlier discussions with libc developers. ]

Earlier this year in March, I've jumped into the discussion on the
libc-alpha mailing list to understand the userspace RNG seeding
requirements better. The interesting bits that explain how the kernel
can play an important role start here:

https://sourceware.org/pipermail/libc-alpha/2024-March/155534.html

From an absolutely-not-security-expert perspective, here is how I see
the desiderata breakdown:

- There appears to be a need to make sure the random seed is not exposed
  across fork, core dump and other similar scenarios. This can be
  achieved by simply letting userspace use the appropriate madvise(2)
  advices on a memory mapping created through mmap(2). I don't see why
  there would be any need to create any RNG-centric ABI for this. If
  new madvise(2) advices are needed, they can simply be added there.

- There appears to be interest in having a RNG faster than a system call
  for various reasons I'm not familiar with. A vDSO appears to be one
  way to do this. Another way would be to let userspace implement it
  all, which raises the following question: what is the minimal state
  known only by the kernel currently unknown from userspace ? This
  brings the following point.

- Based on the libc-alpha discussion, I understand that the main thing
  the kernel knows about which is unknown from userspace is a sort-of
  generation counter, which tracks for instance the fact that the kernel
  was migrated to a different VM, or suspended and then resumed, and
  hence the current seed should be discarded and re-seeded entirely.
  I suspect that is the _key_ information that is currently missing from
  a purely userspace RNG perspective today. I hinted at extending the
  rseq(2) ABI for that purpose: exposing a generation counter for the
  RNG in a thread area shared between kernel and user-space. The
  per-thread area is already there and the hard work of integrating it
  with libc is mostly complete. Another alternative would be, as you
  hint elsewhere in this thread
(https://lore.kernel.org/lkml/CAHk-=wgqD9h0Eb-n94ZEuK9SugnkczXvX497X=OdACVEhsw5xQ@mail.gmail.com/)
  to create a vDSO to expose exactly this kind of generation counter.
  Given this is not a thread-specific thing, it might be a better
  approach that the rseq per-thread area.

So either I'm missing something important (please enlighten me), or we
could achieve all those end-goals with a small fraction of the ABI
complexity introduced by the vDSO as it is initially proposed.

I don't think that just because there happens to be bad userspace RNG
implementations out there we should give up on userspace and maintain
this all complexity in the kernel. This is just working around userspace
ecosystem issues by moving the implementation and maintainance burden
into the kernel.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06  1:14   ` Mathieu Desnoyers
@ 2024-07-06 10:01     ` Florian Weimer
  2024-07-06 14:34       ` Zack Weinberg
  0 siblings, 1 reply; 39+ messages in thread
From: Florian Weimer @ 2024-07-06 10:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones,
	brauner, linux-kernel, arnd, Adhemerval Zanella Netto,
	Zack Weinberg, Cristian Rodríguez, Wilco Dijkstra

* Mathieu Desnoyers:

> From an absolutely-not-security-expert perspective, here is how I see
> the desiderata breakdown:
>
> - There appears to be a need to make sure the random seed is not exposed
>   across fork, core dump and other similar scenarios. This can be
>   achieved by simply letting userspace use the appropriate madvise(2)
>   advices on a memory mapping created through mmap(2). I don't see why
>   there would be any need to create any RNG-centric ABI for this. If
>   new madvise(2) advices are needed, they can simply be added there.

I don't think there's consensus about protecting coredumps and VM-level
forks (migration where multiple clones continue executing).

Personally, I'm not convinced either that it's sufficient to protect
just the RNG from VM-level forks if nonce-reliant ciphers are involved.
It needs careful condiseration how these ciphers are used, and I'm not
sure that VM-level fork protection for the RNG itself is even a critical
part of that.  (The ciphers are still deterministic, and the forks will
compute the same result if the operations are ordered correctly,
resulting in no information leak.  Anyway, I don't understand why
cryptographers prefer algorithms where nonces are so critical to avoid
long-term key leaks.)

> - There appears to be interest in having a RNG faster than a system call
>   for various reasons I'm not familiar with. A vDSO appears to be one
>   way to do this. Another way would be to let userspace implement it
>   all, which raises the following question: what is the minimal state
>   known only by the kernel currently unknown from userspace ? This
>   brings the following point.

The history here is that we had a reasonable fast userspace
implementation that could deal with the process fork case (which is
quite easier within glibc).  It could not deal with VM-level forks.  The
goal was to provide something that is unpredictable in practice and
about as fast as random() (or even rand()), so that programmers could
just use arc4random() if they do not need a reproducible sequence and
not worry about performance.  We removed this implementation from glibc
and replaced it with something that makes a system call on every
arc4random call.  The promise at the time was that we'll soon get a vDSO
call to accelerate this, without the need for some sort of stream cipher
in glibc.  That hasn't happened so far.

Meanwhile, it's been reported that if chrony uses arc4random from glibc,
NTP server performance drops by 25%:

  Bug 29437 - arc4random is too slow
  <https://sourceware.org/bugzilla/show_bug.cgi?id=29437.

Obviously, we need to fix this eventually.

The arc4random implementation in glibc was never intended to displace
randomness generation for cryptographic purposes.  AndIt doesn't have
to: none of the major cryptographic libraries will give up their RNG in
favor of glibc's, so if you are doing cryptography, you already have a
RNG recommended by the cryptographers that is ready to use.  The
arc4random implementation had a different use case, replacing random()
and rand() calls, but it was somehow repurposed.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06 10:01     ` Florian Weimer
@ 2024-07-06 14:34       ` Zack Weinberg
  2024-07-06 15:30         ` Florian Weimer
  0 siblings, 1 reply; 39+ messages in thread
From: Zack Weinberg @ 2024-07-06 14:34 UTC (permalink / raw)
  To: Florian Weimer, Mathieu Desnoyers
  Cc: Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones,
	brauner, linux-kernel, Arnd Bergmann, Adhemerval Zanella,
	Cristian Rodríguez, Wilco Dijkstra

Without commenting on the rest of this...

On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote:
> The arc4random implementation in glibc was never intended to displace
> randomness generation for cryptographic purposes.

...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be
suitable for cryptographic purposes, and, simultaneously, intended to be
fast enough that user space programs should never hesitate to use it.
Therefore, Linux+glibc needs to be prepared for user space programs to
use it that way -- expecting both speed and cryptographic strength --
or else we shouldn't have added it at all.

zw

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06 14:34       ` Zack Weinberg
@ 2024-07-06 15:30         ` Florian Weimer
  2024-07-07 20:57           ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 39+ messages in thread
From: Florian Weimer @ 2024-07-06 15:30 UTC (permalink / raw)
  To: Zack Weinberg
  Cc: Mathieu Desnoyers, Linus Torvalds, Jason A. Donenfeld, jolsa,
	mhiramat, cgzones, brauner, linux-kernel, Arnd Bergmann,
	Adhemerval Zanella, Cristian Rodríguez, Wilco Dijkstra

* Zack Weinberg:

> Without commenting on the rest of this...
>
> On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote:
>> The arc4random implementation in glibc was never intended to displace
>> randomness generation for cryptographic purposes.
>
> ...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be
> suitable for cryptographic purposes, and, simultaneously, intended to be
> fast enough that user space programs should never hesitate to use it.
> Therefore, Linux+glibc needs to be prepared for user space programs to
> use it that way -- expecting both speed and cryptographic strength --
> or else we shouldn't have added it at all.

None of the major cryptographic libraries (OpenSSL, NSS, nettle,
libgcrypt, OpenJDK, Go, GNUTLS) use arc4random in their upstream
version.  If the BSDs use arc4random rather than the bundled generators,
they must have downstream-only patches.  I also don't see why someone
writing a new library from scratch would use arc4random because its
addition to glibc is still quite recent, and it provides no performance
advantage over going to the kernel interfaces directly.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: deconflicting new syscall numbers for 6.11
  2024-07-06 15:30         ` Florian Weimer
@ 2024-07-07 20:57           ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 39+ messages in thread
From: Adhemerval Zanella Netto @ 2024-07-07 20:57 UTC (permalink / raw)
  To: Florian Weimer, Zack Weinberg
  Cc: Mathieu Desnoyers, Linus Torvalds, Jason A. Donenfeld, jolsa,
	mhiramat, cgzones, brauner, linux-kernel, Arnd Bergmann,
	Cristian Rodríguez, Wilco Dijkstra

On 06/07/24 12:30, Florian Weimer wrote:
> * Zack Weinberg:
> 
>> Without commenting on the rest of this...
>>
>> On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote:
>>> The arc4random implementation in glibc was never intended to displace
>>> randomness generation for cryptographic purposes.
>>
>> ...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be
>> suitable for cryptographic purposes, and, simultaneously, intended to be
>> fast enough that user space programs should never hesitate to use it.
>> Therefore, Linux+glibc needs to be prepared for user space programs to
>> use it that way -- expecting both speed and cryptographic strength --
>> or else we shouldn't have added it at all.
> 
> None of the major cryptographic libraries (OpenSSL, NSS, nettle,
> libgcrypt, OpenJDK, Go, GNUTLS) use arc4random in their upstream
> version.  If the BSDs use arc4random rather than the bundled generators,F
> they must have downstream-only patches.  I also don't see why someone
> writing a new library from scratch would use arc4random because its
> addition to glibc is still quite recent, and it provides no performance
> advantage over going to the kernel interfaces directly.

The BSD seems to use use it extensively, specially in the base system for
tools like smtpd/relayd/etc. as alternative to rand/random and to avoid 
pulling a RNG from cryptographic library. But I agree that for glibc, 
arc4random being just a shim over getrandom is only helpful as a way to
avoid a biased implementation of arc4random_uniform (which is quite 
common if you check on the internet about it...).

Also, this vDSO proposal and they way the now is up to kernel to manage
the RNG state would adds some extra considerations for libc getrandom 
implementation.  The libc symbol now is fully async-signal and thread-safe
due being just a syscall wrapper, and to sane manage the way the vDSO 
buffer is designed (either by vgetrandom_alloc or mmap), the runtime 
will need a way to allocate and manage this threads states with a block 
allocator (assuming runtime would like to keep a per-thread state).

For arc4random, the libbsd way or the old way glibc used to do (prior
Jason refactor), would be simple because it was never intended to be
async-signal.  But for getrandom it would require to either have a
async-signal-safe malloc implementation (to keep track of the extra
states) or a block allocation over mmap (which adds some extra memory 
usage). So getrandom now will potentially uses 2 more pages, which is 
not the end of world since interface is designed to allow failure, but 
it is something to consider.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-07-07 20:57 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-04 17:10 deconflicting new syscall numbers for 6.11 Jason A. Donenfeld
2024-07-04 17:21 ` Linus Torvalds
2024-07-04 17:33   ` Linus Torvalds
2024-07-04 17:47     ` Linus Torvalds
2024-07-04 17:51       ` Jason A. Donenfeld
2024-07-04 17:46   ` Jason A. Donenfeld
2024-07-04 17:55     ` Linus Torvalds
2024-07-04 18:04       ` Jason A. Donenfeld
2024-07-04 18:18         ` Linus Torvalds
2024-07-04 18:35           ` Linus Torvalds
2024-07-04 18:46             ` Jason A. Donenfeld
2024-07-04 18:52               ` Linus Torvalds
2024-07-04 18:57                 ` Jason A. Donenfeld
2024-07-04 19:19                   ` Linus Torvalds
2024-07-04 21:07                     ` Linus Torvalds
2024-07-04 21:44                       ` Arnd Bergmann
2024-07-04 22:07                         ` Linus Torvalds
2024-07-05  8:32                           ` Arnd Bergmann
2024-07-05 16:59                             ` Linus Torvalds
2024-07-05 16:18                       ` Jason A. Donenfeld
2024-07-05 17:39                         ` Linus Torvalds
2024-07-05 17:53                           ` Jason A. Donenfeld
2024-07-05 18:08                             ` Linus Torvalds
2024-07-05 18:56                               ` Jason A. Donenfeld
2024-07-05 19:21                                 ` Linus Torvalds
2024-07-05 19:46                                   ` Linus Torvalds
2024-07-06  0:11                                     ` Jason A. Donenfeld
2024-07-06  2:10                                       ` Jason A. Donenfeld
2024-07-06  2:56                                         ` Linus Torvalds
2024-07-06 23:26                                           ` Jason A. Donenfeld
2024-07-07 16:56                   ` Russell Haley
2024-07-04 18:36           ` Jason A. Donenfeld
2024-07-04 18:44       ` Willy Tarreau
2024-07-05  7:01         ` Matthias Urlichs
2024-07-06  1:14   ` Mathieu Desnoyers
2024-07-06 10:01     ` Florian Weimer
2024-07-06 14:34       ` Zack Weinberg
2024-07-06 15:30         ` Florian Weimer
2024-07-07 20:57           ` Adhemerval Zanella Netto

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox