* deconflicting new syscall numbers for 6.11 @ 2024-07-04 17:10 Jason A. Donenfeld 2024-07-04 17:21 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 17:10 UTC (permalink / raw) To: jolsa, mhiramat, cgzones, brauner; +Cc: linux-kernel, torvalds, arnd Hi Christian, Jiri, The three of us all have new syscalls planned for 6.11. Arnd suggested that we coordinate to deconflict, to make the merge easier. Would you mind if I take 463? Maybe Jiri can take 464? And then Christian can take 465-onward for those several syscalls? Does that work? Alternatively, we can all take 463 and let Linus work it out when merging. I don't know what the norm is or what he'd prefer. Regards, Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:10 deconflicting new syscall numbers for 6.11 Jason A. Donenfeld @ 2024-07-04 17:21 ` Linus Torvalds 2024-07-04 17:33 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 17:21 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > The three of us all have new syscalls planned for 6.11. Arnd suggested > that we coordinate to deconflict, to make the merge easier. Nobody has explained to me what has changed since your last vdso getrandom, and I'm not planning on pulling it unless that fundamental flaw is fixed. Why is this _so_ critical that it needs a vdso? Why isn't user space just doing it itself? What's so magical about this all? This all seems entirely pointless to me still, because it's optimizing something that nobody seems to care about, adding new VM infrastructure, new magic system calls, yadda yadda. I was very sceptical last time, and absolutely _nothing_ has changed. Not a peep on why it's now suddenly so hugely important again. We don't add stuff "just because we can". We need to have a damn good reason for it. And I still don't see the reason, and I haven't seen anybody even trying to explain the reason. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:21 ` Linus Torvalds @ 2024-07-04 17:33 ` Linus Torvalds 2024-07-04 17:47 ` Linus Torvalds 2024-07-04 17:46 ` Jason A. Donenfeld 2024-07-06 1:14 ` Mathieu Desnoyers 2 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 17:33 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 10:21, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > We don't add stuff "just because we can". We need to have a damn good > reason for it. And I still don't see the reason, and I haven't seen > anybody even trying to explain the reason. IOW, I want to see actual *users* piping up and saying "this is a problem, here's my real load that spends 10% of time on getrandom(), and this fixes it". I'm not AT ALL interested in microbenchmarks or theoretical "if users need high-performance random numbers". I need a real actual live user that says "I can't just use rdrand and my own chacha mixing on top" and explains why having a SSE2 chachacha in kernel code exposed as a vdso is so critical, and a magical buffer maintained by the kernel. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:33 ` Linus Torvalds @ 2024-07-04 17:47 ` Linus Torvalds 2024-07-04 17:51 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 17:47 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 10:33, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I need a real actual live user that says "I can't just use rdrand and > my own chacha mixing on top" and explains why having a SSE2 chachacha > in kernel code exposed as a vdso is so critical, and a magical buffer > maintained by the kernel. One final note: the reason I'm so negative about this all is that the random number subsystem has such an absolutely _horrendous_ history of two main conflicting issues: people wanting reasonable usable random numbers on one side, and then the people that discuss what the word "entropy" means on the other side. And honestly, I don't want the kernel stuck even *more* in the middle of that morass. I strongly suspect that one reason why glibc people would want this is the exact same reason: _they_ don't want to be stuck in the same padded room with the crazies _either_, so they love the concept of "somebody else's problem". So no. I do not think "libc people want this" is an argument at all for the kernel doing it. Quite the reverse. It's a "pass the hot potato" thing. Which is why I really really want those real users standing up and saying "we can't use rdrand and rdtsc and our own mixing". Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:47 ` Linus Torvalds @ 2024-07-04 17:51 ` Jason A. Donenfeld 0 siblings, 0 replies; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 17:51 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 4, 2024 at 7:47 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > One final note: the reason I'm so negative about this all is that the > random number subsystem has such an absolutely _horrendous_ history of > two main conflicting issues: people wanting reasonable usable random > numbers on one side, and then the people that discuss what the word > "entropy" means on the other side. Yes. My entire goal since the beginning has been to clean up the filth and insanity that's emerged from this. And there's a real userspace side of filth too that's not going to be solved without this. > And honestly, I don't want the kernel stuck even *more* in the middle > of that morass. Certainly I am not bringing us anywhere near that morass. I'm the one who's been diligently trying to dig us out of it! > I strongly suspect that one reason why glibc people > would want this is the exact same reason: _they_ don't want to be > stuck in the same padded room with the crazies _either_, so they love > the concept of "somebody else's problem". On the contrary, the glibc people were busy doing something grotesque and incomplete, when I paused things so that I could do it properly where it belongs. > potato" thing. Which is why I really really want those real users > standing up and saying "we can't use rdrand and rdtsc and our own > mixing". The point is that the people trying to "use rdrand and rdtsc and our own mixing" are in for a world of pain, will come to a solution that isn't complete and will fall over catastrophically in some circumstances, and proliferates the problem. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:21 ` Linus Torvalds 2024-07-04 17:33 ` Linus Torvalds @ 2024-07-04 17:46 ` Jason A. Donenfeld 2024-07-04 17:55 ` Linus Torvalds 2024-07-06 1:14 ` Mathieu Desnoyers 2 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 17:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 04, 2024 at 10:21:34AM -0700, Linus Torvalds wrote: > On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > The three of us all have new syscalls planned for 6.11. Arnd suggested > > that we coordinate to deconflict, to make the merge easier. > > Nobody has explained to me what has changed since your last vdso > getrandom, and I'm not planning on pulling it unless that fundamental > flaw is fixed. Oh. That's an unpleasant surprise. I've been hard at work on bringing everything up to snuff. That's pretty much been my sole focus. Changes since the last time I worked on this are explained in large at the top of this: https://lore.kernel.org/lkml/20240703183115.1075219-1-Jason@zx2c4.com/ The big issue before was that the mm additions were too insane. I've paired those down and made them really minimal. Then the mm people piped up and it became even more minimal. Now I think it's pretty alright. But I think, perhaps evidently barring you, the use case of this in the first place and need for it is well understood and appreciated at large by now. So to answer that, > Why is this _so_ critical that it needs a vdso? > > Why isn't user space just doing it itself? > > What's so magical about this all? > > This all seems entirely pointless to me still, because it's optimizing > something that nobody seems to care about > > IOW, I want to see actual *users* piping up and saying "this is a > problem, here's my real load that spends 10% of time on getrandom(), > and this fixes it". > > I'm not AT ALL interested in microbenchmarks or theoretical "if users > need high-performance random numbers". > > I need a real actual live user that says "I can't just use rdrand and > my own chacha mixing on top" and explains why having a SSE2 chachacha > in kernel code exposed as a vdso is so critical, and a magical buffer > maintained by the kernel. As far as speed goes, there are many legitimate applications that cannot make a syscall every time. TLS nonces and keys come to mind as a huge one. "Make getrandom() fast enough that the TLS library can use it" is something that's come up over and over. There's now also arc4random() in glibc, whose addition is what sparked this whole patchset two years ago. That's not a micro benchmark thing either. I too don't really care for microbenchmarks with the random driver. But I do want it to be actually useable, so that people use it, because it is the best facility for the task. With regards to why VDSO, the cover letter lays that out in detail. Userspace does not have access to the information in a timely manner that the kernel does, and the particulars of the kernel's accounting are bound to change, especially as all this matures with VMs. The RNG in the vDSO needs to be tightly coupled with the RNG in the kernel; these are part of the same thing. Anyway, those actual users exist, and the partial solutions and hacks required to workaround this shortcoming are kind of grotesque and in one way or another bad. This isn't theoretical. I'm not working on this for "fun". Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:46 ` Jason A. Donenfeld @ 2024-07-04 17:55 ` Linus Torvalds 2024-07-04 18:04 ` Jason A. Donenfeld 2024-07-04 18:44 ` Willy Tarreau 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 17:55 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 10:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > As far as speed goes, there are many legitimate applications that cannot > make a syscall every time. This is not an argument. Nobody suggested a system call each time. What I talked about, and suggested, was rdrand and user-space mixing. The system call would be a "initialize the pool" thing with possibly some re-seeding occasionally. > Anyway, those actual users exist, and the partial solutions and hacks > required to workaround this shortcoming are kind of grotesque and in one > way or another bad. This isn't theoretical. I'm not working on this for > "fun". Once again: I don't want to hear "users exist". I want to hear *from* those users. Because I would have expected all those users to already have perfectly working setups in place already. A trivial google for "rdrand library" finds lots of hits for things that then use the AES-NI instructions to whiten things etc. And several of them mention OS X and Windows in addition to Linux. So those things are at least partly portable. And no, I'm *NOT* interested in catering to the crazies that say "we can't mix in the TSC values and do rdrand, because we don't trust those". That's literally the kind of people I want to avoid lik,e the plague, and WHY I don't want more of this in the kernel. Because sane users don't say that. Sane users say "every round, we mix in the TSC, and every X rounds we do rdread, and every 100*X rounds we do rdseed, and that means that the end result in not really predictable even if you've started from the same virtual machine image". And sane users presumably ALREADY HAVE THIS. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:55 ` Linus Torvalds @ 2024-07-04 18:04 ` Jason A. Donenfeld 2024-07-04 18:18 ` Linus Torvalds 2024-07-04 18:44 ` Willy Tarreau 1 sibling, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 18:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 4, 2024 at 7:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, 4 Jul 2024 at 10:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > As far as speed goes, there are many legitimate applications that cannot > > make a syscall every time. > > This is not an argument. > > Nobody suggested a system call each time. Well, that's currently the only way to get random numbers that are sure to be fresh and not, for example, cloned or resumed in a VM. > What I talked about, and suggested, was rdrand and user-space mixing. > The system call would be a "initialize the pool" thing with possibly > some re-seeding occasionally. And this does not work well at all. The question is "when to reseed?" and only the kernel is in a position to reliably know this in a race-free manner. > > Anyway, those actual users exist, and the partial solutions and hacks > > required to workaround this shortcoming are kind of grotesque and in one > > way or another bad. This isn't theoretical. I'm not working on this for > > "fun". > > Once again: I don't want to hear "users exist". > > I want to hear *from* those users. Because I would have expected all > those users to already have perfectly working setups in place already. What do you want me to do here? Every time somebody talks to me about this, tell them, "hey would you talk to Linus about this?" and then, "omg you want me to send Linus an email?!" Library authors wish they could call getrandom() for their needs, yet they cannot, and are forced to invent incomplete solutions. Coupling kernel RNG semantics to userspace RNG semantics is not even a new idea; Microsoft heard from their customers, for example, and made things work. (Maybe hearing "Microsoft ..." will turn you off even more? I don't know. This solution isn't like theirs and is nicer, but it stems from the same need.) Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:04 ` Jason A. Donenfeld @ 2024-07-04 18:18 ` Linus Torvalds 2024-07-04 18:35 ` Linus Torvalds 2024-07-04 18:36 ` Jason A. Donenfeld 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 18:18 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 11:04, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > > I want to hear *from* those users. Because I would have expected all > > those users to already have perfectly working setups in place already. > > What do you want me to do here? You literally said "those users exist". Make them pipe up. Make them explain why what they don't have now doesn't work. What this solves. In real terms. Make them explain why that random "we duplicated the VM, and now we worry that mixing in TSC doesn't help" is an actual real-world concern, rather than something COMPLETELY MADE UP BY RANDOM NUMBER PEOPLE. See what my argument is? My argument is literally that theoretical random number people will make up arguments that aren't actually relevant in real life. Do real people migrate VMs? Hell yes they do. Do they care about the numbers being magically "stale" after said migration? I seriously doubt that. Do real people start multiple VMs from one single starting image? Again, hell yes they do. But do they start those multiple VMs from some random slapdash snapshot that they just picked without any concern and cannot just reseed in user space? And if they do, why should *WE* clean up after their mindbogglingly stupid setup? See what my argument is? I suspect _strongly_ that this is all completely over-engineered based on theoretical grounds that aren't actually practical grounds. And dammit, I'm asking for the practical grounds. For the actual users. And if you have trouble finding those, you just proved my point. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:18 ` Linus Torvalds @ 2024-07-04 18:35 ` Linus Torvalds 2024-07-04 18:46 ` Jason A. Donenfeld 2024-07-04 18:36 ` Jason A. Donenfeld 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 18:35 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 11:18, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > See what my argument is? I suspect _strongly_ that this is all > completely over-engineered based on theoretical grounds that aren't > actually practical grounds. I also have a separate suggestion: I'm more than happy to do something *MUCH SIMPLER*. If people want just generation counts, we can give them generation counts and maybe something extra in the vdso read-only page. No new VM stuff, no new "allocate a buffer that the kernel manages", just something like one cacheline of helper data in the vdso page that is shared with everybody and is already mapped. THAT is what the vdso stuff is designed for. It's not supposed to be a whole new library routine. The state allocation should all be done in user space. The mixing should all be done in user space. As far as I can tell, the *ONLY* reason this is at all about the kernel is that "generation" counter. Just expose the generation counter in the vdso data. It will even be backwards compatible, in that old kernels will always have a value of zero, and whatever user space library then uses the generation counter to check that we haven't had some migration event or whetever won't get the *migration* events, but the code will otherwise work. And the regular user space library can decide to use whatever mixing it wants, whatever state size it wants, and the kernel doesn't have to worry about special memory allocations. See why I think this is all so *HORRENDOUSLY* over-engineered? The kernel has absolutely _zero_ special knowledge about random numbers that user space doesn't have, except for that *one* number. And you literally don't want to do kernel system calls anyway due to performance, so your code is 99% user code anyway. KEEP IT THAT WAY. Don't add it to the kernel. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:35 ` Linus Torvalds @ 2024-07-04 18:46 ` Jason A. Donenfeld 2024-07-04 18:52 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 18:46 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 04, 2024 at 11:35:12AM -0700, Linus Torvalds wrote: > On Thu, 4 Jul 2024 at 11:18, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > See what my argument is? I suspect _strongly_ that this is all > > completely over-engineered based on theoretical grounds that aren't > > actually practical grounds. > > I also have a separate suggestion: I'm more than happy to do something > *MUCH SIMPLER*. > > If people want just generation counts, we can give them generation I addressed this in the cover letter: | How do we rectify this? By putting a safe implementation of getrandom() | in the vDSO, which has access to whatever information a | particular iteration of random.c is using to make its decisions. I use | that careful language of "particular iteration of random.c", because the | set of things that a vDSO getrandom() implementation might need for making | decisions as good as the kernel's will likely change over time. This | isn't just a matter of exporting certain *data* to userspace. We're not | going to commit to a "data API" where the various heuristics used are | exposed, locking in how the kernel works for decades to come, and then | leave it to various userspaces to roll something on top and shoot | themselves in the foot and have all sorts of complexity disasters. | Rather, vDSO getrandom() is supposed to be the *same exact algorithm* | that runs in the kernel, except it's been hoisted into userspace as | much as possible. And so vDSO getrandom() and kernel getrandom() will | always mirror each other hermetically. random.c has a long history of exposing lots of particulars that we've had to stub out. Enough of that. It's far better to have a function (not a piece of data!) that uses the *exact same algorithm* and hence has the exact same guarantees as random.c, and the kernel can keep those in sync. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:46 ` Jason A. Donenfeld @ 2024-07-04 18:52 ` Linus Torvalds 2024-07-04 18:57 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 18:52 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 11:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > I addressed this in the cover letter: > > | How do we rectify this? By putting a safe implementation of getrandom() > | in the vDSO, which has access to whatever information a > | particular iteration of random.c is using to make its decisions. I use > | that careful language of "particular iteration of random.c", because the > | set of things that a vDSO getrandom() implementation might need for making > | decisions as good as the kernel's will likely change over time. Jason. This smells. It's BS. Christ, let's make a deal: do a five-liner patch that adds the generation number to the vdso data, and basically document it as a "the kernel thinks you need to reseed your buffers using getrandom" flag. And *if* it turns out in the future that there is then any major reason why that doesn't work, I'll take the 1000+ line thing, ok? Deal? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:52 ` Linus Torvalds @ 2024-07-04 18:57 ` Jason A. Donenfeld 2024-07-04 19:19 ` Linus Torvalds 2024-07-07 16:56 ` Russell Haley 0 siblings, 2 replies; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 18:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 4, 2024 at 8:52 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 4 Jul 2024 at 11:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > > > I addressed this in the cover letter: > > > > | How do we rectify this? By putting a safe implementation of getrandom() > > | in the vDSO, which has access to whatever information a > > | particular iteration of random.c is using to make its decisions. I use > > | that careful language of "particular iteration of random.c", because the > > | set of things that a vDSO getrandom() implementation might need for making > > | decisions as good as the kernel's will likely change over time. > > Jason. This smells. It's BS. It's not BS. And that's not a real argument from you, but rather is something else. > Christ, let's make a deal: do a five-liner patch that adds the > generation number to the vdso data, and basically document it as a > "the kernel thinks you need to reseed your buffers using getrandom" > flag. I really do not want to expose random.c internals, and then deal with the consequences of breaking user code that relied on that. The fake entropy count API was already a nightmare to move away from. And I think there's tremendous value in letting users use the kernel's *exact algorithm*, whatever it happens to be, without syscall overhead. Plus, this means further proliferation of bad userspace RNGs. So I think the deal is a bad one. > reason why that doesn't work, I'll take the 1000+ line thing (I would like to point out that a good deal of that series is test code and documentation and such.) Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:57 ` Jason A. Donenfeld @ 2024-07-04 19:19 ` Linus Torvalds 2024-07-04 21:07 ` Linus Torvalds 2024-07-07 16:56 ` Russell Haley 1 sibling, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 19:19 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 11:57, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > I really do not want to expose random.c internals, and then deal with > the consequences of breaking user code that relied on that. The fake > entropy count API was already a nightmare to move away from. And I > think there's tremendous value in letting users use the kernel's > *exact algorithm*, whatever it happens to be, without syscall > overhead. Plus, this means further proliferation of bad userspace > RNGs. So I think the deal is a bad one. Bah. I guess I'll have to walk through the patch series once again. I'm still not thrilled about it. But I'll give it another go. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 19:19 ` Linus Torvalds @ 2024-07-04 21:07 ` Linus Torvalds 2024-07-04 21:44 ` Arnd Bergmann 2024-07-05 16:18 ` Jason A. Donenfeld 0 siblings, 2 replies; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 21:07 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, 4 Jul 2024 at 12:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Bah. I guess I'll have to walk through the patch series once again. Ok, I went through it once. First comments: The system call additions look really random. You don't add them to all architectures, but the ones you *do* add them to seem positively pointless: - I don't think you should introduce the system all on 32-bit architectures, and that includes as a compat call on 64-bit. The VM_DROPPABLE infrastructure doesn't even exist on 32-bit, and while that might not be technically a requirement, it does seem to argue against doing this on 32-bit architectures. Plus nobody sane cares. You didn't even enable it on 32-bit x86 in the vdso, so why did you enable it as a syscall? - even 64-bit architectures don't necessarily have anything like a vdso, eg alpha. It looks like you randomly just picked the architectures that have a syscall.tbl file, rather than architectures where this made sense. I thin kyou should drop all of them except possibly arm64, s390 and powerpc. I'm very ambivalent about the VM_DROPPABLE code. On one hand, it's something we've discussed many times, and I don't hate it. On the other hand, the discussions have always been about actually exposing it to user space as a MAP_DROPPABLE so that user space can do caching. In fact, I'm almost certain that *because* you didn't expose it to mmap(), people will now then instead mis-use vgetrandom_alloc() instead to allocate random MAP_DROPPABLE pages. That is going to be a nightmare. And that nightmare has to be avoided. Which in turn means that I think vgetrandom_alloc() has to go, and you just need to expose MAP_DROPPABLE instead that obly works for private anonymous mappings, and make sure glibc uses that. Because as your patch series stands now, the semantics are unacceptable. This is a non-starter. When I see a new system call where my reaction is not just "this should have been just a mmap()", but then immediately followed by "Oh, and people will mis-use this as a cool mmap", I'm not merging that system call. So I don't hate VM_DROPPABLE per se, but the interface is simply not ok. vgetrandom_alloc() absolutely *has* to go, and needs to just be a user-space wrapper around regular mmap. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 21:07 ` Linus Torvalds @ 2024-07-04 21:44 ` Arnd Bergmann 2024-07-04 22:07 ` Linus Torvalds 2024-07-05 16:18 ` Jason A. Donenfeld 1 sibling, 1 reply; 39+ messages in thread From: Arnd Bergmann @ 2024-07-04 21:44 UTC (permalink / raw) To: Linus Torvalds, Jason A . Donenfeld Cc: Jiri Olsa, Masami Hiramatsu, cgzones, Christian Brauner, linux-kernel On Thu, Jul 4, 2024, at 23:07, Linus Torvalds wrote: > > - even 64-bit architectures don't necessarily have anything like a > vdso, eg alpha. > > It looks like you randomly just picked the architectures that have a > syscall.tbl file, rather than architectures where this made sense. I > thin kyou should drop all of them except possibly arm64, s390 and > powerpc. It's not random, it's all the architectures: the ones that don't have a syscall.tbl file are the ones that use the table in include/uapi/asm-generic/unistd.h. I generally recommend doing it like to ensure all architectures define the same __NR_* macro for new syscalls even if the implementation gets added later. As you say, this one is a special because it's not useful without a vdso, but that doesn't require making it more special than necessary by adding it selectively. In particular, if the entries above number 402 are kept consistent across all architectures are the same, we can more easily move them into a shared file in the future to avoid some of the complexity of adding syscalls. Arnd ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 21:44 ` Arnd Bergmann @ 2024-07-04 22:07 ` Linus Torvalds 2024-07-05 8:32 ` Arnd Bergmann 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-04 22:07 UTC (permalink / raw) To: Arnd Bergmann Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones, Christian Brauner, linux-kernel On Thu, 4 Jul 2024 at 14:45, Arnd Bergmann <arnd@arndb.de> wrote: > > It's not random, it's all the architectures: the ones that > don't have a syscall.tbl file are the ones that use the table > in include/uapi/asm-generic/unistd.h. Ok. I think it's bogus to reseve system calls for everybody even when it makes no sense. But it's also pretty moot, since I think the whole system call has to go away. All it is is an odd wrapper around mmap() anyway, and it's a useful enough thing *outside* of getrandom() that I pretty much guarantee it will be used for other things than vgetrandom anyway. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 22:07 ` Linus Torvalds @ 2024-07-05 8:32 ` Arnd Bergmann 2024-07-05 16:59 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Arnd Bergmann @ 2024-07-05 8:32 UTC (permalink / raw) To: Linus Torvalds Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones, Christian Brauner, linux-kernel On Fri, Jul 5, 2024, at 00:07, Linus Torvalds wrote: > On Thu, 4 Jul 2024 at 14:45, Arnd Bergmann <arnd@arndb.de> wrote: >> >> It's not random, it's all the architectures: the ones that >> don't have a syscall.tbl file are the ones that use the table >> in include/uapi/asm-generic/unistd.h. > > Ok. > > I think it's bogus to reseve system calls for everybody even when it > makes no sense. I see. Just to make sure: do you think it's ok to still reserve system call numbers everywhere if they are used on most architectures? I posted a series yesterday to convert include/asm-generic/uapi/unistd.h into the syscall.tbl format, and I did this change for clone3: https://lore.kernel.org/lkml/20240704143611.2979589-8-arnd@kernel.org/ The reasoning here is that we want this to be available everywhere but there are four architectures still missing it, and having the macro defined in the generated unistd.h avoids a special case. On the other hand, I left memfd_secret a special case since that one is only implemented on one architecture using the generic table. > But it's also pretty moot, since I think the whole system call has to go away. > > All it is is an odd wrapper around mmap() anyway, and it's a useful > enough thing *outside* of getrandom() that I pretty much guarantee it > will be used for other things than vgetrandom anyway. Right. Arnd ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 8:32 ` Arnd Bergmann @ 2024-07-05 16:59 ` Linus Torvalds 0 siblings, 0 replies; 39+ messages in thread From: Linus Torvalds @ 2024-07-05 16:59 UTC (permalink / raw) To: Arnd Bergmann Cc: Jason A . Donenfeld, Jiri Olsa, Masami Hiramatsu, cgzones, Christian Brauner, linux-kernel On Fri, 5 Jul 2024 at 01:34, Arnd Bergmann <arnd@arndb.de> wrote: > > I see. Just to make sure: do you think it's ok to still > reserve system call numbers everywhere if they are used > on most architectures? Yes. If there's a reason why a system call might be used, no problem. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 21:07 ` Linus Torvalds 2024-07-04 21:44 ` Arnd Bergmann @ 2024-07-05 16:18 ` Jason A. Donenfeld 2024-07-05 17:39 ` Linus Torvalds 1 sibling, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-05 16:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 04, 2024 at 02:07:41PM -0700, Linus Torvalds wrote: > On Thu, 4 Jul 2024 at 12:19, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Bah. I guess I'll have to walk through the patch series once again. Thanks for having a look. I really appreciate it. > Ok, I went through it once. First comments: > > The system call additions look really random. You don't add them to > all architectures, but the ones you *do* add them to seem positively > pointless: > > - I don't think you should introduce the system all on 32-bit > architectures, and that includes as a compat call on 64-bit. > > The VM_DROPPABLE infrastructure doesn't even exist on 32-bit, and > while that might not be technically a requirement, it does seem to > argue against doing this on 32-bit architectures. Plus nobody sane > cares. > > You didn't even enable it on 32-bit x86 in the vdso, so why did > you enable it as a syscall? > > - even 64-bit architectures don't necessarily have anything like a > vdso, eg alpha. > > It looks like you randomly just picked the architectures that have a > syscall.tbl file, rather than architectures where this made sense. I > thin kyou should drop all of them except possibly arm64, s390 and > powerpc. The first versions of my series actually only enabled it on x86. (Somebody also wrote an arm64 implementation of all this already, but that's for later.) But after I posted that, people (Arnd, I think?) told me I should add it to all architectures to "reserve" the number. That was a lot of annoying busy work to do, but I did it, and not just random archs, but *all* of them. I'd be happy to revert all this and just enable it on x86. I'll do that for the v+1 patch. It's less work for me and would make this series one patch less. But there might be a conversation to have (that I think you've begun with Arnd) about what the expectations are for this, because the "enable it on all of them" seems to be something I've heard on more than one occasion. > I'm very ambivalent about the VM_DROPPABLE code. > > On one hand, it's something we've discussed many times, and I don't > hate it. On the other hand, the discussions have always been about > actually exposing it to user space as a MAP_DROPPABLE so that user > space can do caching. > > In fact, I'm almost certain that *because* you didn't expose it to > mmap(), people will now then instead mis-use vgetrandom_alloc() > instead to allocate random MAP_DROPPABLE pages. That is going to be a > nightmare. VM_DROPPABLE *is* actually a very useful feature. Or it at least seems like it could be one. One can imagine various database caches that do a memory vs cpu trade off using it. (But, to be clear, I've never actually spoken with database developers about it.) There are some other improvements for it I have in mind that I was considering posting in some time when this work here has settled. And then, indeed, it'd make sense to eventually expose this properly to mmap() and let people use it. (Or if you want to do that in reverse, adding it to mmap() first, so that people don't misuse vgetrandom_alloc(), that's fine.) > And that nightmare has to be avoided. Which in turn means that I think > vgetrandom_alloc() has to go, and you just need to expose > MAP_DROPPABLE instead that obly works for private anonymous mappings, > and make sure glibc uses that. > > Because as your patch series stands now, the semantics are unacceptable. > > This is a non-starter. When I see a new system call where my reaction > is not just "this should have been just a mmap()", but then > immediately followed by "Oh, and people will mis-use this as a cool > mmap", I'm not merging that system call. > > So I don't hate VM_DROPPABLE per se, but the interface is simply not > ok. vgetrandom_alloc() absolutely *has* to go, and needs to just be a > user-space wrapper around regular mmap. So I'm not wedded to adding a syscall for this and am pretty open to other ways of doing it, but I actually think given the requirements, this kind of makes sense. I was talking about this problem with tglx or with Greg a while back, kind of frustrated, and one of them suggested, "well just make it a syscall; that's what those are for," and it immediately made sense, and so that's what I've done. Here are the requirements: - The "mechanism" needs to return allocated memory to userspace that can be chunked up on a per-thread basis, with no state straddling pages, which means it also needs to return the size of each state, and the number of states that were allocated. - The size of each state might change kernel version to kernel version. - In an effort to match the behaviors of syscall getrandom() as much as possible, it needs to be mapped with various flags (the ones in the current vgetrandom_alloc() implementation). - Which flags are needed might change kernel version to kernel version. - Future memory tagging CPU extensions might allow us to prevent the memory from being accessed unless the accesses are coming from vDSO code, which would avoid heartbleed-like bugs. This is very appealing. So, the memory that's returned, and the parameters about it are sort of tied to the actual [v]getrandom() implementation. That sounds to me like this should be done by a function that the kernel is in charge of. Hence the syscall. (Or a vDSO function, but then it wouldn't correspond with an equivalent syscall, which might not be appealing to tglx, and it starts to smell like "library code" which we really don't want.) Given this, it seemed like a syscall was the cleanest most cromulent solution. But if you have other suggestions, I'm open to it. Maybe, though, the best way of assuaging your concerns would be to expose MAP_DROPPABLE in mmap() in the same series as the rest, so that there *isn't* a chance that vgetrandom_alloc() will be abused when people realize it's a handy feature to have. Thoughts? Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 16:18 ` Jason A. Donenfeld @ 2024-07-05 17:39 ` Linus Torvalds 2024-07-05 17:53 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-05 17:39 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, 5 Jul 2024 at 09:18, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > VM_DROPPABLE *is* actually a very useful feature. Or it at least seems > like it could be one. Yes. It's been discussed exactly in that "this _could_ be very useful" sense, although we've never actually pulled the trigger. I tried to find previous discussions on lore, but failed miserably, so I can't point to previous discussions from long ago, but one question was also always about whether you wanted some explicit "populate this page range" interface together with getting a SIGBUS when it's unpopulated (so that you can basically do demand-paging in user space). With just a "this could be useful" but no hard users, it never really got anywhere. Anyway, I really don't mind VM_DROPPABLE with "it just gets re-populated as a new anonymous page" model, particularly since we could easily then later decide that we could expand on it as a MAP_SHARED thing with SIGBUS semantics and explicit initialization if we ever really want it. End result: I don't think there are necessariyl *lots* of users, but I do think that this is something where some enterprising person goes "I can use this", and makes some cool library that uses it for caching, and then we'd be stuck with it. > And then, indeed, it'd make sense to eventually expose this properly to > mmap() and let people use it. (Or if you want to do that in reverse, > adding it to mmap() first, so that people don't misuse > vgetrandom_alloc(), that's fine.) Yes. And it should be pretty trivial. We just at least initially have to be very careful to limit it to MAP_ANONYMOUS and MAP_PRIVATE. Because dropping dirty bits on shared mappings sounds insane and like a possible source of confusion (and thus bugs and maybe even security issues). It's possible that we might even use a MAP_TYPE flag for this. Or make it a PROT_xyz bit rather than a MAP_xyz. So there's some trivial sanity checks and some UI issues to just pick, but apart from "just pick something sane", exposing this for mmap() is _not_ hard, and I do think it needs to be done first. And once it's done, I think the argument for having a special system call is basically gone too. > - The "mechanism" needs to return allocated memory to userspace that can > be chunked up on a per-thread basis, with no state straddling pages, > which means it also needs to return the size of each state, and the > number of states that were allocated. > > - The size of each state might change kernel version to kernel version. Just pick a size large enough. And why would that size not be one page? Considering that you really don't want to rely on page-crossing state *ANYWAY* because of the whole "one page can go away while another one sticks around" issue, I would expect that states over one page per thread would be a *very* questionable idea to begin with. I don't think we'll ever see systems with page sizes smaller than 4k. They have existed in the past, but they're not making a comeback. People want larger pages, not smaller ones. And the stat size rigth now is what - 200 bytes? So a single page seems (a) sufficient and (b) kind of the sane maximum anyway due to the dropping. No? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 17:39 ` Linus Torvalds @ 2024-07-05 17:53 ` Jason A. Donenfeld 2024-07-05 18:08 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-05 17:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Fri, Jul 05, 2024 at 10:39:48AM -0700, Linus Torvalds wrote: > Yes. And it should be pretty trivial. > > We just at least initially have to be very careful to limit it to > MAP_ANONYMOUS and MAP_PRIVATE. Because dropping dirty bits on shared > mappings sounds insane and like a possible source of confusion (and > thus bugs and maybe even security issues). > > It's possible that we might even use a MAP_TYPE flag for this. Or make > it a PROT_xyz bit rather than a MAP_xyz. > > So there's some trivial sanity checks and some UI issues to just pick, > but apart from "just pick something sane", exposing this for mmap() is > _not_ hard, and I do think it needs to be done first. I can take a stab at it. > > - The "mechanism" needs to return allocated memory to userspace that can > > be chunked up on a per-thread basis, with no state straddling pages, > > which means it also needs to return the size of each state, and the > > number of states that were allocated. > > > > - The size of each state might change kernel version to kernel version. > > Just pick a size large enough. > > And why would that size not be one page? > > Considering that you really don't want to rely on page-crossing state > *ANYWAY* because of the whole "one page can go away while another one > sticks around" issue, I would expect that states over one page per > thread would be a *very* questionable idea to begin with. > > I don't think we'll ever see systems with page sizes smaller than 4k. > They have existed in the past, but they're not making a comeback. > People want larger pages, not smaller ones. That sounds not so good: the current state is 144 bytes, and it's expected that there'll be one of these per thread. Mapping 16k or 4k per thread seems pretty bad. At least it certainly seems that way? Wasting 16240 bytes per thread + a new vmap I can't imagine is okay. Also, these points still stand: | - In an effort to match the behaviors of syscall getrandom() as much as | possible, it needs to be mapped with various flags (the ones in the | current vgetrandom_alloc() implementation). | | - Which flags are needed might change kernel version to kernel version. | | - Future memory tagging CPU extensions might allow us to prevent the | memory from being accessed unless the accesses are coming from vDSO | code, which would avoid heartbleed-like bugs. This is very appealing. It seems like leaving it just up to mmap() will not only result in users doing it wrong, but kind of limits our options moving forward. And there's this whole issue of communicating sizes so as not to be wasteful. Another idea I had, if you hate the syscall, is I could just add this as (another) private ioctl() on the /dev/random node. This sounds worse than a syscall worse because it means that node has to exist and the fd has to be opened -- and concerns about this were what lead to the getrandom() syscall being introduced in the first place -- but it would at least avoid the syscall. I'm not crazy about that though. Maybe the winning solution is MAP_DROPPABLE (or PROT_DROPPABLE) in mmap(), and then in the following commit, add the vgetrandom_alloc() syscall, and then we'll avoid vgetrandom_alloc() getting abused, but still have a nice interface that isn't too constraining. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 17:53 ` Jason A. Donenfeld @ 2024-07-05 18:08 ` Linus Torvalds 2024-07-05 18:56 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-05 18:08 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, 5 Jul 2024 at 10:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > That sounds not so good: the current state is 144 bytes, and it's > expected that there'll be one of these per thread. Mapping 16k or 4k per > thread seems pretty bad. At least it certainly seems that way? Wasting > 16240 bytes per thread + a new vmap I can't imagine is okay. Well, I guess the simple solution would be "just pick a size that is guaranteed to be at most a page, and a power-of-two, and big enough". You really don't have that many choices. Presumably we won't have per-architecture random states anyway, so the smallest supported page size is the upper limit, and if the current size is 144 bytes, we know that 256 is the lower limit. IOW, we pretty much know that the number is _always_ going to be 2**n where 8 <= n <= 12. Just pick one. > | - Future memory tagging CPU extensions might allow us to prevent the > | memory from being accessed unless the accesses are coming from vDSO > | code, which would avoid heartbleed-like bugs. This is very appealing. No. Stop this idiocy. Now you are getting into cray-cray land. Nobody cares about random numbers so much that they'd worry about leaking them from other sources thanks to hardware bugs. Seriously. This is the kind of "crazy random number" talk that makes me go "I don't want to touch this". Get your act together. There is *NO* way we care about this kind of garbage, and just bringing it up makes me doubt that you have the right mindset. You claimed to not be one of the crazy people. SHOW IT. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 18:08 ` Linus Torvalds @ 2024-07-05 18:56 ` Jason A. Donenfeld 2024-07-05 19:21 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-05 18:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, Jul 05, 2024 at 11:08:03AM -0700, Linus Torvalds wrote: > On Fri, 5 Jul 2024 at 10:53, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > That sounds not so good: the current state is 144 bytes, and it's > > expected that there'll be one of these per thread. Mapping 16k or 4k per > > thread seems pretty bad. At least it certainly seems that way? Wasting > > 16240 bytes per thread + a new vmap I can't imagine is okay. > > Well, I guess the simple solution would be "just pick a size that is > guaranteed to be at most a page, and a power-of-two, and big enough". > > You really don't have that many choices. Presumably we won't have > per-architecture random states anyway, so the smallest supported page > size is the upper limit, and if the current size is 144 bytes, we know > that 256 is the lower limit. > > IOW, we pretty much know that the number is _always_ going to be 2**n > where 8 <= n <= 12. > > Just pick one. And if we want to exceed that size in the future, then what? Just seems like hard coding it locks us in. Also, pow2 is still wasteful - 28 states for a 4k page at optimal size versus 16 states for a 4k page at rounding up to current pow2. That's not a huge difference at small scale. But also, why? Seems like we could do this a lot better. > > > | - Future memory tagging CPU extensions might allow us to prevent the > > | memory from being accessed unless the accesses are coming from vDSO > > | code, which would avoid heartbleed-like bugs. This is very appealing. > > No. Stop this idiocy. > > Now you are getting into cray-cray land. Nobody cares about random > numbers so much that they'd worry about leaking them from other > sources thanks to hardware bugs. > > Seriously. This is the kind of "crazy random number" talk that makes > me go "I don't want to touch this". > > Get your act together. There is *NO* way we care about this kind of > garbage, and just bringing it up makes me doubt that you have the > right mindset. > > You claimed to not be one of the crazy people. SHOW IT. I'm pretty sure you just misunderstood what I'm referring to. "Heartbleed-like" refers to remote info leak. Like, some server process spits out a bunch of memory onto the network. If the rng pages can only be accessed when the caller is at some specified address range, then those kinds of bugs are mitigated. Anyway, just an idea, but doesn't seem like an impossible one. There were also those two other unrelated points I raised, trimmed from the context. To repaste them all from before: | Here are the requirements: | | - The "mechanism" needs to return allocated memory to userspace that can | be chunked up on a per-thread basis, with no state straddling pages, | which means it also needs to return the size of each state, and the | number of states that were allocated. | | - The size of each state might change kernel version to kernel version. Your suggestion is to hard code the state size to a power of 2, which will lock us in to having that as an upper bound forever, and also waste memory because it's not ideally sized. | | - In an effort to match the behaviors of syscall getrandom() as much as | possible, it needs to be mapped with various flags (the ones in the | current vgetrandom_alloc() implementation). | | - Which flags are needed might change kernel version to kernel version. Unaddressed. | | - Future memory tagging CPU extensions might allow us to prevent the | memory from being accessed unless the accesses are coming from vDSO | code, which would avoid heartbleed-like bugs. This is very appealing. I think you misunderstood me as referring to "hardware bugs", but that's not what I was talking about, as I described above. Anyway, regardless, if your take on this is, "I don't care about making certain rng memory harder to leak than other memory," then so be it and I'll drop this point. | So, the memory that's returned, and the parameters about it are sort of | tied to the actual [v]getrandom() implementation. That sounds to me like | this should be done by a function that the kernel is in charge of. Hence | the syscall. I'm having a hard time seeing how, "let the user guess and pass whatever flags were decided at one moment" is preferable to, "have a syscall/function/ioctl/whatever communicate to userspace what it needs to do and to set up the mapping in exactly the way it's needed." I'm sorry to keep belaboring this, but I'm actually just sort of surprised by your take. I get the part about, "users will abuse vgetrandom_alloc() for something uncouth," which seems very real, but the solution to that is to just expose this to mmap() first. Once that's there, vgetrandom_alloc() becomes kind of similar to, say, map_shadow_stack(). But okay, spit-balling further, there are the current ideas proposed, and I'll add two more to the bottom: 0) Syscall. 1) /dev/random ioctl. Downside: needs filesystem node, fd. 2) Hard coding 256 and set of mmap flags. Downside: discussed above. 3) Expose /proc/sys/kernel/random/vgetrandom_info, which gives one field of the state size and another of the flags needed for mmap. Downside: still less flexible than the kernel doing the allocation, like if it'd be nice in the future for some additional step to be taken on the memory after mmap(). Downside: needs filesystem node, fd. 4) Same as (3), but expose this through passing -1 as opaque_len to vgetrandom(). Downside: kinda ugly, adds branch. I think of these, (3) is preferable to (2). (0) still seems best, but I'm not sure you'll agree yet. (4) might be preferable to (3) because no filesystem stuff. If (0) and (1) are still sounding bad to you, do (3) or (4) sound better? Also, I'm just brainstorming here; if you find these deranged, that's okay. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 18:56 ` Jason A. Donenfeld @ 2024-07-05 19:21 ` Linus Torvalds 2024-07-05 19:46 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-05 19:21 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, 5 Jul 2024 at 11:56, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > And if we want to exceed that size in the future, then what? Just seems > like hard coding it locks us in. KISS. Keep It Simple Stupid. Make a sane decision. Stick with it. This is *not* something where things will change radically over the years. But what this *is* is something where we want to actively avoid overcomplicating things. If saying "the state size is fixed at 256 bytes" means that ten years from now, we won't be updating to some super-duper fancy new algorithm that wants to keep a huge state size - then that's a GOOD thing. We are software ENGINEERS. That means that we make sane decisions and live with real life limits. We know that we don't have infinite entropy, and we understand that we can't even know how much entropy we do have. At some point, you just have to put your foot down. Leave the people who have theoretical concerns behind. They can damn well do their own thing. We should not care. If somebody is unhappy with the result, let them go make their own random number generator. We've used the current chacha state for what, a decade now? Just let it be. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 19:21 ` Linus Torvalds @ 2024-07-05 19:46 ` Linus Torvalds 2024-07-06 0:11 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-05 19:46 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, 5 Jul 2024 at 12:21, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > KISS. Keep It Simple Stupid. Make a sane decision. Stick with it. Side note: you could just stick the size as a constant in the vdso too. But honestly, what's the argument for more than 256 if 144 bytes is the reality now? Does anybody seriously think our current getrandom() isn't good enough? Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-05 19:46 ` Linus Torvalds @ 2024-07-06 0:11 ` Jason A. Donenfeld 2024-07-06 2:10 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-06 0:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Fri, Jul 05, 2024 at 12:46:37PM -0700, Linus Torvalds wrote: > If saying "the state size is fixed at 256 bytes" means that ten years > from now, we won't be updating to some super-duper fancy new algorithm > that wants to keep a huge state size - then that's a GOOD thing. I'm all for avoiding fanciness. I can imagine three plausible scenarios where we benefit from the kernel doing the allocation, rather than mmap, or where it's nice to have the kernel decide on the size: - On some platform, it's actually more efficient to generate N blocks, such that the state there needs to be larger. - The amount of state that we buffer increases according to some speed vs practicality trade off that changes. (Right now we buffer 1.5 blocks; maybe 3.5 would be better eventually.) - We find out that there's a better way of doing all this with a special mapping instead, or some other means. What I have in mind, IOW, isn't fanciness. But alright, let me run with where you're urging me and see where that takes things. > Side note: you could just stick the size as a constant in the vdso too. Yea, this sounds more like solution (4) from my last email. I'll give that a shot and see what it's like nuking the syscall. I'll ping here when v21 of the series is ready, and hopefully you like it more. Thanks for brainstorming this all with me. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 0:11 ` Jason A. Donenfeld @ 2024-07-06 2:10 ` Jason A. Donenfeld 2024-07-06 2:56 ` Linus Torvalds 0 siblings, 1 reply; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-06 2:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi again Linus, On Sat, Jul 06, 2024 at 02:11:59AM +0200, Jason A. Donenfeld wrote: > What I have in mind, IOW, isn't fanciness. But alright, let me run with > where you're urging me and see where that takes things. > > > Side note: you could just stick the size as a constant in the vdso too. > > Yea, this sounds more like solution (4) from my last email. I'll give > that a shot and see what it's like nuking the syscall. I'll ping here > when v21 of the series is ready, and hopefully you like it more. I'll spend the weekend doing my own code review and fixing things up and working on commit messages and documentation and all that, but there are now three simpler commits in here that implement what I have in mind based on our discussion: https://git.zx2c4.com/linux-rng/log/ The selftest code is the largest part of it. There's no more syscall. I think it should be much more to your liking and seems like an alright set of compromises. Hopefully that's a bit closer to the mark. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 2:10 ` Jason A. Donenfeld @ 2024-07-06 2:56 ` Linus Torvalds 2024-07-06 23:26 ` Jason A. Donenfeld 0 siblings, 1 reply; 39+ messages in thread From: Linus Torvalds @ 2024-07-06 2:56 UTC (permalink / raw) To: Jason A. Donenfeld; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Fri, 5 Jul 2024 at 19:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > https://git.zx2c4.com/linux-rng/log/ So we already expose VM_WIPEONFORK and VM_DONTDUMP using madvise(). Exposing them at mmap creation time with MMAP_xyz sounds fine. However, I do note that both the pre-existing VM_WIPEONFORK - and the new VM_DROPPABLE - needs to be limited to anonymous private mappings only. You did that for VM_DROPPABLE, but not for VM_WIPEONFORK. Now, admittedly I don't remember *why* we made VM_WIPEONFORK only work for private mappings, but that's what we did. Anyway, that patch looks largely fine to me apart from that note, but I do think you want to check it with the mm people on linux-mm. > The selftest code is the largest part of it. There's no more syscall. I > think it should be much more to your liking and seems like an alright > set of compromises. Hopefully that's a bit closer to the mark. From a "look through the patches" standpoint, this did look more palatable to me, but I also would have had an easier time with looking at the patches if the self-tests were separate commits. Linus ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 2:56 ` Linus Torvalds @ 2024-07-06 23:26 ` Jason A. Donenfeld 0 siblings, 0 replies; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-06 23:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Fri, Jul 05, 2024 at 07:56:03PM -0700, Linus Torvalds wrote: > On Fri, 5 Jul 2024 at 19:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > https://git.zx2c4.com/linux-rng/log/ > > So we already expose VM_WIPEONFORK and VM_DONTDUMP using madvise(). > Exposing them at mmap creation time with MMAP_xyz sounds fine. > > However, I do note that both the pre-existing VM_WIPEONFORK - and the > new VM_DROPPABLE - needs to be limited to anonymous private mappings > only. > > You did that for VM_DROPPABLE, but not for VM_WIPEONFORK. Good catch, thanks. I'll look over all of that again closely too. > Anyway, that patch looks largely fine to me apart from that note, but > I do think you want to check it with the mm people on linux-mm. They'll certainly be on the list of recipients for the v+1 series when I post it (hopefully shortly). > > The selftest code is the largest part of it. There's no more syscall. I > > think it should be much more to your liking and seems like an alright > > set of compromises. Hopefully that's a bit closer to the mark. > > From a "look through the patches" standpoint, this did look more > palatable to me, but I also would have had an easier time with looking > at the patches if the self-tests were separate commits. Okay, will do. I think you've got some selftest makefile fixes from John/Shuah that'll be sent your way if they haven't already for 6.10 that I'll rebase on so that there isn't an annoying merge conflict. https://lore.kernel.org/all/d99a1e3b-1893-4fac-bf05-bcb60ca7f89c@linuxfoundation.org/ Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:57 ` Jason A. Donenfeld 2024-07-04 19:19 ` Linus Torvalds @ 2024-07-07 16:56 ` Russell Haley 1 sibling, 0 replies; 39+ messages in thread From: Russell Haley @ 2024-07-07 16:56 UTC (permalink / raw) To: jason; +Cc: arnd, brauner, cgzones, jolsa, linux-kernel, mhiramat, torvalds Since any PRNG will have the concept of re-seeding, I had to think *really hard* to understand how a pseudo-generation number that really means "reseed advised on change" could restrict future kernel development, so for anyone else following along in the peanut gallery, here's the scenario I came up with: Suppose on some future CPU, RDRAND is improved to be essentially perfect, with the same latency and throughput as a load from L1. So it acts like a HWRNG, not a PRNG. On such a CPU and with a command line option that means "I 100% trust my CPU vendor," the kernel could statically replace getrandom() with a function that just uses RDRAND, and statically disable all the machinery for gathering entropy from events and re-seeding the PRNG. *Unless*, that is, userspace potentially needs to know when a reseed-necessitating event has happened. - Russell ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:18 ` Linus Torvalds 2024-07-04 18:35 ` Linus Torvalds @ 2024-07-04 18:36 ` Jason A. Donenfeld 1 sibling, 0 replies; 39+ messages in thread From: Jason A. Donenfeld @ 2024-07-04 18:36 UTC (permalink / raw) To: Linus Torvalds; +Cc: jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd Hi Linus, On Thu, Jul 4, 2024 at 8:18 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > What do you want me to do here? > > You literally said "those users exist". > > Make them pipe up. > > Make them explain why what they don't have now doesn't work. What this > solves. In real terms. > > Make them explain why that random "we duplicated the VM, and now we > worry that mixing in TSC doesn't help" is an actual real-world > concern, rather than something COMPLETELY MADE UP BY RANDOM NUMBER > PEOPLE. > > See what my argument is? My argument is literally that theoretical > random number people will make up arguments that aren't actually > relevant in real life. No, I don't think this is made up by random number nutsos. I believe this is a real actual concern. > Do real people migrate VMs? Hell yes they do. Do they care about the > numbers being magically "stale" after said migration? I seriously > doubt that. Yes! They do! > > Do real people start multiple VMs from one single starting image? > Again, hell yes they do. > > But do they start those multiple VMs from some random slapdash > snapshot that they just picked without any concern and cannot just > reseed in user space? And if they do, why should *WE* clean up after > their mindbogglingly stupid setup? Except userspace isn't really in a great position to do that. There's no need to suggest that people proliferate these foot guns either. > See what my argument is? I suspect _strongly_ that this is all > completely over-engineered based on theoretical grounds that aren't > actually practical grounds. > > And dammit, I'm asking for the practical grounds. For the actual users. > > And if you have trouble finding those, you just proved my point. And I think what you're missing here is that these concerns come _from actual users_. This *isn't* theoretical. Look, I am not some "random number" nut job. I've worked very hard to move the kernel's RNG far outside the realm of that world. And I'm not looking for things to do or code to write or ways to occupy my time, just 'cuz. I'm working on this because there's a real, tangible, need for it. This has come out of countless recurring discussions with folks at conferences and elsewhere. I am very much part of the world where people are writing code that makes use of getrandom(), or would like to make use of getrandom() but can't, and this pickle comes up repeatedly. "Oh but we can't because of syscall speed, so we've got this userspace thing, but it's not optimal, so we're just kind of hoping for the best, but yea one of these days somebody should do something..." It's okay that people aren't having those discussions with you. That's why I'm maintaining this thing and talking to folks and caring about it and thinking carefully about it. And because people are having these conversations with me, that's *also* why I am very sensitive to, "is this guy a random number nut?" concerns, because lord I've met a lot of them and they all have their little hang up. I don't want to add code "just because we can." But I think this here will solve a very real problem for very real users, and everytime the fact that I'm working on this comes up, there are real people with real concerns who are glad to hear it's coming finally. Alternatively, you can say, "well until they talk to me directly, no way josé", and that'd be your prerogative, I guess. But that'd be pretty darn disappointing. Jason ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:55 ` Linus Torvalds 2024-07-04 18:04 ` Jason A. Donenfeld @ 2024-07-04 18:44 ` Willy Tarreau 2024-07-05 7:01 ` Matthias Urlichs 1 sibling, 1 reply; 39+ messages in thread From: Willy Tarreau @ 2024-07-04 18:44 UTC (permalink / raw) To: Linus Torvalds Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd On Thu, Jul 04, 2024 at 10:55:46AM -0700, Linus Torvalds wrote: > A trivial google for "rdrand library" finds lots of hits for things > that then use the AES-NI instructions to whiten things etc. As a userland developer, I can say that dealing with external libs for low-level stuff, which themselves sometimes even come with their own set of dependencies, is always a pain. There must be compelling reasons for adding dependencies. It's reinforced when you have to deal with long term support on your software that goes beyond the lib's. And having to go through instruction support detection and open-coding all that stuff with runtime fallbacks for older CPUs is also a pain. Not to mention the cases where you run in VMs where features are there but not listed or presented but slowly emulated. I'm using a lot of arch-specific code at build time, I'm often fine with detecting -ENOSYS at run time to fall back to an older implementation of a syscall, but I've not crossed the barrier of runtime CPU features detection which adds further burden and further fragments bug reports between users. Regarding VM migration, my code is not concerned because I'm not aware of users migrating such VMs. BUT I've got complains in the past from some users generating UUIDs for each forwarded request that they were seeing duplicates in their logs due to the lack of thread safety on random(), which made me work on an alternative. Thus I can easily imagine that equivalent applications that just want to assign a unique ID to an event that ends up in a log, and when such applications suffer a VM migration could face a similar problem that is not easy to address in userland. In my opinion, abstracting the hardware is the role of the kernel. If getrandom() is fast enough for my uses, why not. If it's not, I find value in having a much faster proposal that offers the same API to all applications without each having to reinvent the wheel. I can't judge on the merits of vgetrandom() vs getrandom() though. But to give you an idea, years ago for portability reasons (random() thread safety, multiple OS support, performance), I ended up writing my own xoroshiro128 generator to address multiple problems at once and I must confess I was a bit sad to see that randoms remain so little portable between operating systems and their various versions, and that the work left to be done for users is non trivial. I can imagine that users with higher expectations than mine would want to adopt vgetrandom() when available. Now would I replace my existing RNG with this new syscall when it gets widely available ? Maybe, if it brings some value. It's easy enough to deal with two code branches, one with the new, optimal syscall, and the legacy generic fallback. Hoping this matches the type of feedback you were looking for. Willy ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 18:44 ` Willy Tarreau @ 2024-07-05 7:01 ` Matthias Urlichs 0 siblings, 0 replies; 39+ messages in thread From: Matthias Urlichs @ 2024-07-05 7:01 UTC (permalink / raw) To: Willy Tarreau, Linus Torvalds Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd [-- Attachment #1.1.1: Type: text/plain, Size: 1463 bytes --] On 04.07.24 20:44, Willy Tarreau wrote: > BUT I've got complains in the past from > some users generating UUIDs for each forwarded request that they were > seeing duplicates in their logs due to the lack of thread safety on > random(), which made me work on an alternative. Thus I can easily > imagine that equivalent applications that just want to assign a unique > ID to an event that ends up in a log, and when such applications suffer > a VM migration could face a similar problem that is not easy to address > in userland. I'd like to second that. I sometimes need to duplicate a running VM, mostly in order to debug stuff. Now both VMs run the same code with the same pseudo-RNG, generating the same message IDs when they log something. I've seen rejects on logs from the real VM because the dupe got there first. Owch. A userspace RNG with a zapped VM_DROPPABLE page that re-initializes itself from the kernel RNG would solve this problem (and others). Thus a reasonable implementation seems to be * implement VM_DROPPABLE (which I'd like to use for userspace caching anyway) * teach VM cloners, task migrators and whatnot not to copy pages marked thus * add a RNG generation counter to the VDSO * teach libc's getrandom() to use these Yes this doesn't use the exact same implementation of random.c that's in the kernel, but frankly I don't care about that. -- -- regards -- -- Matthias Urlichs [-- Attachment #1.1.2: Type: text/html, Size: 1944 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 840 bytes --] ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-04 17:21 ` Linus Torvalds 2024-07-04 17:33 ` Linus Torvalds 2024-07-04 17:46 ` Jason A. Donenfeld @ 2024-07-06 1:14 ` Mathieu Desnoyers 2024-07-06 10:01 ` Florian Weimer 2 siblings, 1 reply; 39+ messages in thread From: Mathieu Desnoyers @ 2024-07-06 1:14 UTC (permalink / raw) To: Linus Torvalds Cc: Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd, Adhemerval Zanella Netto, Zack Weinberg, Cristian Rodríguez, Florian Weimer, Wilco Dijkstra On 04-Jul-2024 10:21:34 AM, Linus Torvalds wrote: > On Thu, 4 Jul 2024 at 10:10, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > The three of us all have new syscalls planned for 6.11. Arnd suggested > > that we coordinate to deconflict, to make the merge easier. > > Nobody has explained to me what has changed since your last vdso > getrandom, and I'm not planning on pulling it unless that fundamental > flaw is fixed. > > Why is this _so_ critical that it needs a vdso? > > Why isn't user space just doing it itself? > > What's so magical about this all? > > This all seems entirely pointless to me still, because it's optimizing > something that nobody seems to care about, adding new VM > infrastructure, new magic system calls, yadda yadda. > > I was very sceptical last time, and absolutely _nothing_ has changed. > Not a peep on why it's now suddenly so hugely important again. > > We don't add stuff "just because we can". We need to have a damn good > reason for it. And I still don't see the reason, and I haven't seen > anybody even trying to explain the reason. [ Note: as I wrote down this email, I notice that you are heading towards the same conclusions I'm reaching on other sub-threads of this discussion. But I'm providing this feedback because it adds relevant information based on earlier discussions with libc developers. ] Earlier this year in March, I've jumped into the discussion on the libc-alpha mailing list to understand the userspace RNG seeding requirements better. The interesting bits that explain how the kernel can play an important role start here: https://sourceware.org/pipermail/libc-alpha/2024-March/155534.html From an absolutely-not-security-expert perspective, here is how I see the desiderata breakdown: - There appears to be a need to make sure the random seed is not exposed across fork, core dump and other similar scenarios. This can be achieved by simply letting userspace use the appropriate madvise(2) advices on a memory mapping created through mmap(2). I don't see why there would be any need to create any RNG-centric ABI for this. If new madvise(2) advices are needed, they can simply be added there. - There appears to be interest in having a RNG faster than a system call for various reasons I'm not familiar with. A vDSO appears to be one way to do this. Another way would be to let userspace implement it all, which raises the following question: what is the minimal state known only by the kernel currently unknown from userspace ? This brings the following point. - Based on the libc-alpha discussion, I understand that the main thing the kernel knows about which is unknown from userspace is a sort-of generation counter, which tracks for instance the fact that the kernel was migrated to a different VM, or suspended and then resumed, and hence the current seed should be discarded and re-seeded entirely. I suspect that is the _key_ information that is currently missing from a purely userspace RNG perspective today. I hinted at extending the rseq(2) ABI for that purpose: exposing a generation counter for the RNG in a thread area shared between kernel and user-space. The per-thread area is already there and the hard work of integrating it with libc is mostly complete. Another alternative would be, as you hint elsewhere in this thread (https://lore.kernel.org/lkml/CAHk-=wgqD9h0Eb-n94ZEuK9SugnkczXvX497X=OdACVEhsw5xQ@mail.gmail.com/) to create a vDSO to expose exactly this kind of generation counter. Given this is not a thread-specific thing, it might be a better approach that the rseq per-thread area. So either I'm missing something important (please enlighten me), or we could achieve all those end-goals with a small fraction of the ABI complexity introduced by the vDSO as it is initially proposed. I don't think that just because there happens to be bad userspace RNG implementations out there we should give up on userspace and maintain this all complexity in the kernel. This is just working around userspace ecosystem issues by moving the implementation and maintainance burden into the kernel. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 1:14 ` Mathieu Desnoyers @ 2024-07-06 10:01 ` Florian Weimer 2024-07-06 14:34 ` Zack Weinberg 0 siblings, 1 reply; 39+ messages in thread From: Florian Weimer @ 2024-07-06 10:01 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, arnd, Adhemerval Zanella Netto, Zack Weinberg, Cristian Rodríguez, Wilco Dijkstra * Mathieu Desnoyers: > From an absolutely-not-security-expert perspective, here is how I see > the desiderata breakdown: > > - There appears to be a need to make sure the random seed is not exposed > across fork, core dump and other similar scenarios. This can be > achieved by simply letting userspace use the appropriate madvise(2) > advices on a memory mapping created through mmap(2). I don't see why > there would be any need to create any RNG-centric ABI for this. If > new madvise(2) advices are needed, they can simply be added there. I don't think there's consensus about protecting coredumps and VM-level forks (migration where multiple clones continue executing). Personally, I'm not convinced either that it's sufficient to protect just the RNG from VM-level forks if nonce-reliant ciphers are involved. It needs careful condiseration how these ciphers are used, and I'm not sure that VM-level fork protection for the RNG itself is even a critical part of that. (The ciphers are still deterministic, and the forks will compute the same result if the operations are ordered correctly, resulting in no information leak. Anyway, I don't understand why cryptographers prefer algorithms where nonces are so critical to avoid long-term key leaks.) > - There appears to be interest in having a RNG faster than a system call > for various reasons I'm not familiar with. A vDSO appears to be one > way to do this. Another way would be to let userspace implement it > all, which raises the following question: what is the minimal state > known only by the kernel currently unknown from userspace ? This > brings the following point. The history here is that we had a reasonable fast userspace implementation that could deal with the process fork case (which is quite easier within glibc). It could not deal with VM-level forks. The goal was to provide something that is unpredictable in practice and about as fast as random() (or even rand()), so that programmers could just use arc4random() if they do not need a reproducible sequence and not worry about performance. We removed this implementation from glibc and replaced it with something that makes a system call on every arc4random call. The promise at the time was that we'll soon get a vDSO call to accelerate this, without the need for some sort of stream cipher in glibc. That hasn't happened so far. Meanwhile, it's been reported that if chrony uses arc4random from glibc, NTP server performance drops by 25%: Bug 29437 - arc4random is too slow <https://sourceware.org/bugzilla/show_bug.cgi?id=29437. Obviously, we need to fix this eventually. The arc4random implementation in glibc was never intended to displace randomness generation for cryptographic purposes. AndIt doesn't have to: none of the major cryptographic libraries will give up their RNG in favor of glibc's, so if you are doing cryptography, you already have a RNG recommended by the cryptographers that is ready to use. The arc4random implementation had a different use case, replacing random() and rand() calls, but it was somehow repurposed. Thanks, Florian ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 10:01 ` Florian Weimer @ 2024-07-06 14:34 ` Zack Weinberg 2024-07-06 15:30 ` Florian Weimer 0 siblings, 1 reply; 39+ messages in thread From: Zack Weinberg @ 2024-07-06 14:34 UTC (permalink / raw) To: Florian Weimer, Mathieu Desnoyers Cc: Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, Arnd Bergmann, Adhemerval Zanella, Cristian Rodríguez, Wilco Dijkstra Without commenting on the rest of this... On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote: > The arc4random implementation in glibc was never intended to displace > randomness generation for cryptographic purposes. ...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be suitable for cryptographic purposes, and, simultaneously, intended to be fast enough that user space programs should never hesitate to use it. Therefore, Linux+glibc needs to be prepared for user space programs to use it that way -- expecting both speed and cryptographic strength -- or else we shouldn't have added it at all. zw ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 14:34 ` Zack Weinberg @ 2024-07-06 15:30 ` Florian Weimer 2024-07-07 20:57 ` Adhemerval Zanella Netto 0 siblings, 1 reply; 39+ messages in thread From: Florian Weimer @ 2024-07-06 15:30 UTC (permalink / raw) To: Zack Weinberg Cc: Mathieu Desnoyers, Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, Arnd Bergmann, Adhemerval Zanella, Cristian Rodríguez, Wilco Dijkstra * Zack Weinberg: > Without commenting on the rest of this... > > On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote: >> The arc4random implementation in glibc was never intended to displace >> randomness generation for cryptographic purposes. > > ...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be > suitable for cryptographic purposes, and, simultaneously, intended to be > fast enough that user space programs should never hesitate to use it. > Therefore, Linux+glibc needs to be prepared for user space programs to > use it that way -- expecting both speed and cryptographic strength -- > or else we shouldn't have added it at all. None of the major cryptographic libraries (OpenSSL, NSS, nettle, libgcrypt, OpenJDK, Go, GNUTLS) use arc4random in their upstream version. If the BSDs use arc4random rather than the bundled generators, they must have downstream-only patches. I also don't see why someone writing a new library from scratch would use arc4random because its addition to glibc is still quite recent, and it provides no performance advantage over going to the kernel interfaces directly. Thanks, Florian ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: deconflicting new syscall numbers for 6.11 2024-07-06 15:30 ` Florian Weimer @ 2024-07-07 20:57 ` Adhemerval Zanella Netto 0 siblings, 0 replies; 39+ messages in thread From: Adhemerval Zanella Netto @ 2024-07-07 20:57 UTC (permalink / raw) To: Florian Weimer, Zack Weinberg Cc: Mathieu Desnoyers, Linus Torvalds, Jason A. Donenfeld, jolsa, mhiramat, cgzones, brauner, linux-kernel, Arnd Bergmann, Cristian Rodríguez, Wilco Dijkstra On 06/07/24 12:30, Florian Weimer wrote: > * Zack Weinberg: > >> Without commenting on the rest of this... >> >> On Sat, Jul 6, 2024, at 6:01 AM, Florian Weimer wrote: >>> The arc4random implementation in glibc was never intended to displace >>> randomness generation for cryptographic purposes. >> >> ...arc4random on the BSDs (particularly on OpenBSD) *is* intended to be >> suitable for cryptographic purposes, and, simultaneously, intended to be >> fast enough that user space programs should never hesitate to use it. >> Therefore, Linux+glibc needs to be prepared for user space programs to >> use it that way -- expecting both speed and cryptographic strength -- >> or else we shouldn't have added it at all. > > None of the major cryptographic libraries (OpenSSL, NSS, nettle, > libgcrypt, OpenJDK, Go, GNUTLS) use arc4random in their upstream > version. If the BSDs use arc4random rather than the bundled generators,F > they must have downstream-only patches. I also don't see why someone > writing a new library from scratch would use arc4random because its > addition to glibc is still quite recent, and it provides no performance > advantage over going to the kernel interfaces directly. The BSD seems to use use it extensively, specially in the base system for tools like smtpd/relayd/etc. as alternative to rand/random and to avoid pulling a RNG from cryptographic library. But I agree that for glibc, arc4random being just a shim over getrandom is only helpful as a way to avoid a biased implementation of arc4random_uniform (which is quite common if you check on the internet about it...). Also, this vDSO proposal and they way the now is up to kernel to manage the RNG state would adds some extra considerations for libc getrandom implementation. The libc symbol now is fully async-signal and thread-safe due being just a syscall wrapper, and to sane manage the way the vDSO buffer is designed (either by vgetrandom_alloc or mmap), the runtime will need a way to allocate and manage this threads states with a block allocator (assuming runtime would like to keep a per-thread state). For arc4random, the libbsd way or the old way glibc used to do (prior Jason refactor), would be simple because it was never intended to be async-signal. But for getrandom it would require to either have a async-signal-safe malloc implementation (to keep track of the extra states) or a block allocation over mmap (which adds some extra memory usage). So getrandom now will potentially uses 2 more pages, which is not the end of world since interface is designed to allow failure, but it is something to consider. ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2024-07-07 20:57 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-07-04 17:10 deconflicting new syscall numbers for 6.11 Jason A. Donenfeld 2024-07-04 17:21 ` Linus Torvalds 2024-07-04 17:33 ` Linus Torvalds 2024-07-04 17:47 ` Linus Torvalds 2024-07-04 17:51 ` Jason A. Donenfeld 2024-07-04 17:46 ` Jason A. Donenfeld 2024-07-04 17:55 ` Linus Torvalds 2024-07-04 18:04 ` Jason A. Donenfeld 2024-07-04 18:18 ` Linus Torvalds 2024-07-04 18:35 ` Linus Torvalds 2024-07-04 18:46 ` Jason A. Donenfeld 2024-07-04 18:52 ` Linus Torvalds 2024-07-04 18:57 ` Jason A. Donenfeld 2024-07-04 19:19 ` Linus Torvalds 2024-07-04 21:07 ` Linus Torvalds 2024-07-04 21:44 ` Arnd Bergmann 2024-07-04 22:07 ` Linus Torvalds 2024-07-05 8:32 ` Arnd Bergmann 2024-07-05 16:59 ` Linus Torvalds 2024-07-05 16:18 ` Jason A. Donenfeld 2024-07-05 17:39 ` Linus Torvalds 2024-07-05 17:53 ` Jason A. Donenfeld 2024-07-05 18:08 ` Linus Torvalds 2024-07-05 18:56 ` Jason A. Donenfeld 2024-07-05 19:21 ` Linus Torvalds 2024-07-05 19:46 ` Linus Torvalds 2024-07-06 0:11 ` Jason A. Donenfeld 2024-07-06 2:10 ` Jason A. Donenfeld 2024-07-06 2:56 ` Linus Torvalds 2024-07-06 23:26 ` Jason A. Donenfeld 2024-07-07 16:56 ` Russell Haley 2024-07-04 18:36 ` Jason A. Donenfeld 2024-07-04 18:44 ` Willy Tarreau 2024-07-05 7:01 ` Matthias Urlichs 2024-07-06 1:14 ` Mathieu Desnoyers 2024-07-06 10:01 ` Florian Weimer 2024-07-06 14:34 ` Zack Weinberg 2024-07-06 15:30 ` Florian Weimer 2024-07-07 20:57 ` Adhemerval Zanella Netto
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox