linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v6 2/3] random: introduce generic vDSO getrandom() implementation
       [not found] ` <20221121152909.3414096-3-Jason@zx2c4.com>
@ 2022-11-23  8:51   ` Rasmus Villemoes
  2022-11-24  1:18     ` Jason A. Donenfeld
  0 siblings, 1 reply; 9+ messages in thread
From: Rasmus Villemoes @ 2022-11-23  8:51 UTC (permalink / raw)
  To: Jason A. Donenfeld, linux-kernel, patches, tglx
  Cc: linux-crypto, x86, Greg Kroah-Hartman, Adhemerval Zanella Netto,
	Carlos O'Donell, Linux API

On 21/11/2022 16.29, Jason A. Donenfeld wrote:

Cc += linux-api

> 
>       if (!new_block)
>         goto out;
>       new_cap = grnd_allocator.cap + num;
>       new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
>       if (!new_states) {
>         munmap(new_block, num * size_per_each);

Hm. This does leak an implementation detail of vgetrandom_alloc(),
namely that it is based on mmap() of that size rounded up to page size.
Do we want to commit to this being the proper way of disposing of a
succesful vgetrandom_alloc(), or should there also be a
vgetrandom_free(void *states, long num, long size_per_each)?

And if so, what color should the bikeshed really have. I.e.,

- does it need to take that size_per_each parameter which the kernel knows

- should it rather take the product so it can for now be a simple alias
for munmap

- should it also have a flags argument just because that's what all
well-behaving syscalls have these days...

Also, should vgetrandom_alloc() take a void *hint argument that
would/could be passed through to mmap() to give userspace some control
over where the memory is located - possibly only in the future, i.e.
insist on it being NULL for now, but it could open the possibility for
adding e.g. VGRND_MAP_FIXED[_NOREPLACE] that would translate to the
corresponding MAP_ flags.

Rasmus


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 2/3] random: introduce generic vDSO getrandom() implementation
  2022-11-23  8:51   ` [PATCH v6 2/3] random: introduce generic vDSO getrandom() implementation Rasmus Villemoes
@ 2022-11-24  1:18     ` Jason A. Donenfeld
  2022-11-25  8:02       ` Rasmus Villemoes
  0 siblings, 1 reply; 9+ messages in thread
From: Jason A. Donenfeld @ 2022-11-24  1:18 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Linux API

Hi Rasmus,

On Wed, Nov 23, 2022 at 09:51:04AM +0100, Rasmus Villemoes wrote:
> On 21/11/2022 16.29, Jason A. Donenfeld wrote:
> 
> Cc += linux-api
> 
> > 
> >       if (!new_block)
> >         goto out;
> >       new_cap = grnd_allocator.cap + num;
> >       new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
> >       if (!new_states) {
> >         munmap(new_block, num * size_per_each);
> 
> Hm. This does leak an implementation detail of vgetrandom_alloc(),
> namely that it is based on mmap() of that size rounded up to page size.
> Do we want to commit to this being the proper way of disposing of a
> succesful vgetrandom_alloc(), or should there also be a
> vgetrandom_free(void *states, long num, long size_per_each)?

Yes, this is intentional, and this is exactly what I wanted to do. There
are various wrappers of vm_mmap() throughout, mmap being one of them,
and they typically then resort to munmap to unmap it. This is how
userspace handles memory - maps, always maps. So I think doing that is
fine and consistent.

However, your point about it relying on it being a rounded up size isn't
correct. `munmap` will unmap the whole page if the size you pass lies
within a page. So `num * size_of_each` will always do the right thing,
without needing to have userspace code round anything up. (From the man
page: "The  address addr must be a multiple of the page size (but length
need not be). All pages containing a part of the indicated range are
unmapped.") And as you can see in my example code, nothing is rounded
up. So I don't know why you made that comment.

> And if so, what color should the bikeshed really have. I.e.,

No color, thanks.

> Also, should vgetrandom_alloc() take a void *hint argument that
> would/could be passed through to mmap() to give userspace some control
> over where the memory is located - possibly only in the future, i.e.
> insist on it being NULL for now, but it could open the possibility for
> adding e.g. VGRND_MAP_FIXED[_NOREPLACE] that would translate to the
> corresponding MAP_ flags.

I think adding more control is exactly what this is trying to avoid.
It's very intentionally *not* a general allocator function, but
something specific for vDSO getrandom(). However, it does already, in
this very patchset here, take a (currently unused) flags argument, in
case we have the need for later extension.

In the meantime, however, I'm not very interested in complicating this
interface into oblivion. Firstly, it ensures nothing will get done. But
moreover, this interface needs to be somewhat future-proof, yes, but it
does not need to be a general syscall; rather, it's a specific syscall
for a specific task.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
       [not found]           ` <87cz9c5z1f.fsf@oldenburg.str.redhat.com>
@ 2022-11-24 12:24             ` Jason A. Donenfeld
  2022-11-24 12:48               ` Jason A. Donenfeld
                                 ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jason A. Donenfeld @ 2022-11-24 12:24 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api

Hi Florian,

On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> * Jason A. Donenfeld:
> 
> > Hi Florian,
> >
> > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> >> * Jason A. Donenfeld:
> >> 
> >> > Hi Florian,
> >> >
> >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> >> >> * Jason A. Donenfeld:
> >> >> 
> >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> >> >> > + * function provides to userspace, by mapping a certain number of special pages
> >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> >> >> > + * desired, and returns the number of opaque states actually allocated, the
> >> >> > + * size of each one in bytes, and the address of the first state.
> >> >> > + */
> >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> >> >> 
> >> >> I think you should make this __u64, so that you get a consistent
> >> >> userspace interface on all architectures, without the need for compat
> >> >> system calls.
> >> >
> >> > That would be quite unconventional. Most syscalls that take lengths do
> >> > so with the native register size (`unsigned long`, `size_t`), rather
> >> > than u64. If you can point to a recent trend away from this by
> >> > indicating some commits that added new syscalls with u64, I'd be happy
> >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> >> 
> >> See clone3 and struct clone_args.
> >
> > The struct is one thing. But actually, clone3 takes a `size_t`:
> >
> >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> >
> > I take from this that I too should use `size_t` rather than `unsigned
> > long.` And it doesn't seem like there's any compat clone3.
> 
> But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> You need to look at the contents for struct clone_args for comparison.

Ah! I see what you mean; that's a good point. The usual register
clearing thing isn't going to happen because these are addresses.

I still am somewhat hesitant, though, because `size_t` is really the
"proper" type to be used. Maybe the compat syscall thing is just a
necessary evil?

The other direction would be making this a u32, since 640k ought to be
enough for anybody and such, but maybe that'd be a mistake too.

So I'm not sure. Anybody else on the list with experience adding
syscalls have an opinion?

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
  2022-11-24 12:24             ` [PATCH v6 1/3] random: add vgetrandom_alloc() syscall Jason A. Donenfeld
@ 2022-11-24 12:48               ` Jason A. Donenfeld
  2022-11-24 13:18                 ` Arnd Bergmann
  2022-11-24 12:49               ` Christian Brauner
  2022-11-24 16:30               ` Jason A. Donenfeld
  2 siblings, 1 reply; 9+ messages in thread
From: Jason A. Donenfeld @ 2022-11-24 12:48 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api

Hey again,

On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +                unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
> 
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?
> 
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.
> 
> So I'm not sure. Anybody else on the list with experience adding
> syscalls have an opinion?

Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
unsigned long pointer and I don't see any compat stuff around it:

    SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
                    unsigned long, maxnode)
    
    SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
                    unsigned long __user *, nmask, unsigned long, maxnode,
                    unsigned long, addr, unsigned long, flags)

    SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
                    const unsigned long __user *, old_nodes,
                    const unsigned long __user *, new_nodes)
     

In contrast sched_setaffinity and get_robust_list take a unsigned long
pointer and does have a compat wrapper:

    SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
                    unsigned long __user *, user_mask_ptr)

    SYSCALL_DEFINE3(get_robust_list, int, pid,
                    struct robust_list_head __user * __user *, head_ptr,
                    size_t __user *, len_ptr)

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
  2022-11-24 12:24             ` [PATCH v6 1/3] random: add vgetrandom_alloc() syscall Jason A. Donenfeld
  2022-11-24 12:48               ` Jason A. Donenfeld
@ 2022-11-24 12:49               ` Christian Brauner
  2022-11-24 12:57                 ` Jason A. Donenfeld
  2022-11-24 16:30               ` Jason A. Donenfeld
  2 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2022-11-24 12:49 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Florian Weimer, linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api, Arnd Bergmann

On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.

For system calls that take structs as arguments we use u64 in the struct
for proper alignment so we can extend structs without regressing old
kernels. We have a few of those extensible struct system calls.

But we don't really have a lot system calls that pass u64 as a pointer
outside of a structure so far. Neither as register and nor as pointer
iirc. Passing them as a register arg is problematic because of 32bit
arches. But passing as pointer should be fine but it is indeed uncommon.

> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> Ah! I see what you mean; that's a good point. The usual register
> clearing thing isn't going to happen because these are addresses.
> 
> I still am somewhat hesitant, though, because `size_t` is really the
> "proper" type to be used. Maybe the compat syscall thing is just a
> necessary evil?

We try to avoid adding new compat-requiring syscalls like the plague
usually. (At least for new syscalls that don't need to inherit behavior
from earlier syscalls they are a revisions of.)

> 
> The other direction would be making this a u32, since 640k ought to be
> enough for anybody and such, but maybe that'd be a mistake too.

I think making this a size_t is fine. We haven't traditionally used u32
for sizes. All syscalls that pass structs versioned by size use size_t.
So I would recommend to stick with that.

Alternatively, you could also introduce a simple struct versioned by
size for this system call similar to mount_setatt() and clone3() and so
on. This way you don't need to worry about future extensibilty. Just a
thought.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
  2022-11-24 12:49               ` Christian Brauner
@ 2022-11-24 12:57                 ` Jason A. Donenfeld
  0 siblings, 0 replies; 9+ messages in thread
From: Jason A. Donenfeld @ 2022-11-24 12:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Florian Weimer, linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api, Arnd Bergmann

Hi Christian,

Thanks a bunch for chiming in.

On Thu, Nov 24, 2022 at 01:49:27PM +0100, Christian Brauner wrote:
> Alternatively, you could also introduce a simple struct versioned by
> size for this system call similar to mount_setatt() and clone3() and so
> on. This way you don't need to worry about future extensibilty. Just a
> thought.

Briefly considered that, but it seemed a bit heavy for something like
this. I'm not super heavily opposed, but just seemed like a bit much.

> > > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > > >> >> 
> > > >> >> I think you should make this __u64, so that you get a consistent
> > > >> >> userspace interface on all architectures, without the need for compat
> > > >> >> system calls.
> > > >> >
> > > >> > That would be quite unconventional. Most syscalls that take lengths do
> > > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > > >> > than u64. If you can point to a recent trend away from this by
> > > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > > >> 
> > > >> See clone3 and struct clone_args.
> 
> For system calls that take structs as arguments we use u64 in the struct
> for proper alignment so we can extend structs without regressing old
> kernels. We have a few of those extensible struct system calls.
> 
> But we don't really have a lot system calls that pass u64 as a pointer
> outside of a structure so far. Neither as register and nor as pointer
> iirc.

Right, the __u64_aligned business seemed to be mostly about
extensibility.

> > > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > > >
> > > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > > >
> > > > I take from this that I too should use `size_t` rather than `unsigned
> > > > long.` And it doesn't seem like there's any compat clone3.
> > > 
> > > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > > You need to look at the contents for struct clone_args for comparison.
> > 
> > Ah! I see what you mean; that's a good point. The usual register
> > clearing thing isn't going to happen because these are addresses.
> > 
> > I still am somewhat hesitant, though, because `size_t` is really the
> > "proper" type to be used. Maybe the compat syscall thing is just a
> > necessary evil?
>
> I think making this a size_t is fine. We haven't traditionally used u32
> for sizes. All syscalls that pass structs versioned by size use size_t.
> So I would recommend to stick with that.

This isn't quite a struct versioned by size. This is:

    void *vgetrandom_alloc([inout] size_t *num, [out] size_t *size_per_each, unsigned int flags);

You give it an input 'num' and some flags (currently flags=0), and it
gives you back an output 'num' size, an output 'size_per_each' size, and
an opaque pointer value mapping as its return value.

I do like the idea of keeping size_t so that the type is "right". But
the other arguments are equally compelling as well, so not sure.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
  2022-11-24 12:48               ` Jason A. Donenfeld
@ 2022-11-24 13:18                 ` Arnd Bergmann
  0 siblings, 0 replies; 9+ messages in thread
From: Arnd Bergmann @ 2022-11-24 13:18 UTC (permalink / raw)
  To: Jason A . Donenfeld, Florian Weimer
  Cc: linux-kernel, patches, Thomas Gleixner, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api

On Thu, Nov 24, 2022, at 13:48, Jason A. Donenfeld wrote:
> On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:

> Looks like set_mempolicy, get_mempoliy, and migrate_pages pass an
> unsigned long pointer and I don't see any compat stuff around it:
>
>     SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long 
> __user *, nmask,
>                     unsigned long, maxnode)
>    
>     SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
>                     unsigned long __user *, nmask, unsigned long, maxnode,
>                     unsigned long, addr, unsigned long, flags)
>
>     SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
>                     const unsigned long __user *, old_nodes,
>                     const unsigned long __user *, new_nodes)

Compat handling for these is done all the way down in the
pointer access:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/mempolicy.c#n1368

This works here because it's a special bitmap but is not the
best approach if you just have a pointer to a single value.

       Arnd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 1/3] random: add vgetrandom_alloc() syscall
  2022-11-24 12:24             ` [PATCH v6 1/3] random: add vgetrandom_alloc() syscall Jason A. Donenfeld
  2022-11-24 12:48               ` Jason A. Donenfeld
  2022-11-24 12:49               ` Christian Brauner
@ 2022-11-24 16:30               ` Jason A. Donenfeld
  2 siblings, 0 replies; 9+ messages in thread
From: Jason A. Donenfeld @ 2022-11-24 16:30 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	linux-api

On Thu, Nov 24, 2022 at 01:24:42PM +0100, Jason A. Donenfeld wrote:
> Hi Florian,
> 
> On Thu, Nov 24, 2022 at 01:15:24PM +0100, Florian Weimer wrote:
> > * Jason A. Donenfeld:
> > 
> > > Hi Florian,
> > >
> > > On Thu, Nov 24, 2022 at 06:25:39AM +0100, Florian Weimer wrote:
> > >> * Jason A. Donenfeld:
> > >> 
> > >> > Hi Florian,
> > >> >
> > >> > On Wed, Nov 23, 2022 at 11:46:58AM +0100, Florian Weimer wrote:
> > >> >> * Jason A. Donenfeld:
> > >> >> 
> > >> >> > + * The vgetrandom() function in userspace requires an opaque state, which this
> > >> >> > + * function provides to userspace, by mapping a certain number of special pages
> > >> >> > + * into the calling process. It takes a hint as to the number of opaque states
> > >> >> > + * desired, and returns the number of opaque states actually allocated, the
> > >> >> > + * size of each one in bytes, and the address of the first state.
> > >> >> > + */
> > >> >> > +SYSCALL_DEFINE3(vgetrandom_alloc, unsigned long __user *, num,
> > >> >> > +		unsigned long __user *, size_per_each, unsigned int, flags)
> > >> >> 
> > >> >> I think you should make this __u64, so that you get a consistent
> > >> >> userspace interface on all architectures, without the need for compat
> > >> >> system calls.
> > >> >
> > >> > That would be quite unconventional. Most syscalls that take lengths do
> > >> > so with the native register size (`unsigned long`, `size_t`), rather
> > >> > than u64. If you can point to a recent trend away from this by
> > >> > indicating some commits that added new syscalls with u64, I'd be happy
> > >> > to be shown otherwise. But AFAIK, that's not the way it's done.
> > >> 
> > >> See clone3 and struct clone_args.
> > >
> > > The struct is one thing. But actually, clone3 takes a `size_t`:
> > >
> > >     SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> > >
> > > I take from this that I too should use `size_t` rather than `unsigned
> > > long.` And it doesn't seem like there's any compat clone3.
> > 
> > But vgetrandom_alloc does not use unsigned long, but unsigned long *.
> > You need to look at the contents for struct clone_args for comparison.
> 
> The other direction would be making this a u32

I think `unsigned int` is actually a sensible size for what these values
should be. That eliminates the problem and potential bikeshed too. So
I'll go with that for v+1.

Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6 2/3] random: introduce generic vDSO getrandom() implementation
  2022-11-24  1:18     ` Jason A. Donenfeld
@ 2022-11-25  8:02       ` Rasmus Villemoes
  0 siblings, 0 replies; 9+ messages in thread
From: Rasmus Villemoes @ 2022-11-25  8:02 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: linux-kernel, patches, tglx, linux-crypto, x86,
	Greg Kroah-Hartman, Adhemerval Zanella Netto, Carlos O'Donell,
	Linux API

On 24/11/2022 02.18, Jason A. Donenfeld wrote:
> Hi Rasmus,
> 
> On Wed, Nov 23, 2022 at 09:51:04AM +0100, Rasmus Villemoes wrote:
>> On 21/11/2022 16.29, Jason A. Donenfeld wrote:
>>
>> Cc += linux-api
>>
>>>
>>>       if (!new_block)
>>>         goto out;
>>>       new_cap = grnd_allocator.cap + num;
>>>       new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
>>>       if (!new_states) {
>>>         munmap(new_block, num * size_per_each);
>>
>> Hm. This does leak an implementation detail of vgetrandom_alloc(),
>> namely that it is based on mmap() of that size rounded up to page size.
>> Do we want to commit to this being the proper way of disposing of a
>> succesful vgetrandom_alloc(), or should there also be a
>> vgetrandom_free(void *states, long num, long size_per_each)?
> 
> Yes, this is intentional, and this is exactly what I wanted to do. There
> are various wrappers of vm_mmap() throughout, mmap being one of them,
> and they typically then resort to munmap to unmap it. This is how
> userspace handles memory - maps, always maps. So I think doing that is
> fine and consistent.

OK. Perhaps for the benefit of future libc implementors drop a comment
somewhere as to how to dealloc the blob.

> However, your point about it relying on it being a rounded up size isn't
> correct. `munmap` will unmap the whole page if the size you pass lies
> within a page. So `num * size_of_each` will always do the right thing,
> without needing to have userspace code round anything up. (From the man
> page: "The  address addr must be a multiple of the page size (but length
> need not be). 

I know, and I never said userspace needed to round anything up.

All pages containing a part of the indicated range are
> unmapped.") And as you can see in my example code, nothing is rounded
> up. So I don't know why you made that comment.

I made that comment because it's clear from what this does that you get
something back that is _at least_ num*size_per_each in size, but what is
not clear is that the actual allocation is exactly and will always be
that size rounded up to a page size (and no more), so that
munmap(num*size_per_each), with its well-known and documented semantics,
will DTRT.

> I think adding more control is exactly what this is trying to avoid.
> It's very intentionally *not* a general allocator function, but
> something specific for vDSO getrandom(). However, it does already, in
> this very patchset here, take a (currently unused) flags argument, in
> case we have the need for later extension.

OK.

Perhaps you can spend a few more words on why this allocation _needs_ to
be MAP_LOCKED? That seems somewhat of a policy thing imposed by the
kernel, something that would be better left to the libc or distro or
whatnot to request via a flag. I could imagine applications that
currently run at the mlock limit start failing after a libc upgrade -
which could of course be considered a libc problem, and perhaps it's too
unlikely to worry about.

Rasmus


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-25  8:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20221121152909.3414096-1-Jason@zx2c4.com>
     [not found] ` <20221121152909.3414096-3-Jason@zx2c4.com>
2022-11-23  8:51   ` [PATCH v6 2/3] random: introduce generic vDSO getrandom() implementation Rasmus Villemoes
2022-11-24  1:18     ` Jason A. Donenfeld
2022-11-25  8:02       ` Rasmus Villemoes
     [not found] ` <20221121152909.3414096-2-Jason@zx2c4.com>
     [not found]   ` <87v8n6lzh9.fsf@oldenburg.str.redhat.com>
     [not found]     ` <Y37DDX5RtiGsV6MO@zx2c4.com>
     [not found]       ` <87a64g7wks.fsf@oldenburg.str.redhat.com>
     [not found]         ` <Y39djiBSmgXfgWJv@zx2c4.com>
     [not found]           ` <87cz9c5z1f.fsf@oldenburg.str.redhat.com>
2022-11-24 12:24             ` [PATCH v6 1/3] random: add vgetrandom_alloc() syscall Jason A. Donenfeld
2022-11-24 12:48               ` Jason A. Donenfeld
2022-11-24 13:18                 ` Arnd Bergmann
2022-11-24 12:49               ` Christian Brauner
2022-11-24 12:57                 ` Jason A. Donenfeld
2022-11-24 16:30               ` Jason A. Donenfeld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).