Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Florian Weimer @ 2019-06-10 17:28 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, linux-kernel, linux-doc, linux-mm,
	linux-arch, linux-api, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, H.J. Lu, Jann Horn, Jonathan Corbet,
	Kees Cook, Mike Kravetz
In-Reply-To: <5dc357f5858f8036cad5847cfe214401bb9138bf.camel@intel.com>

* Yu-cheng Yu:

> On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote:
>> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>> > > Huh, how does glibc know about all possible past and future legacy code
>> > > in the application?
>> > 
>> > When dlopen() gets a legacy binary and the policy allows that, it will
>> > manage
>> > the bitmap:
>> > 
>> >   If a bitmap has not been created, create one.
>> >   Set bits for the legacy code being loaded.
>> 
>> I was thinking about code that doesn't go through GLIBC like JITs.
>
> If JIT manages the bitmap, it knows where it is.
> It can always read the bitmap again, right?

The problem are JIT libraries without assembler code which can be marked
non-CET, such as liborc.  Our builds (e.g., orc-0.4.29-2.fc30.x86_64)
currently carries the IBT and SHSTK flag, although the entry points into
the generated code do not start with ENDBR, so that a jump to them will
fault with the CET enabled.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH 06/13] keys: Add a notification facility [ver #4]
From: David Howells @ 2019-06-10 17:47 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: dhowells, viro, raven, linux-fsdevel, linux-api, linux-block,
	keyrings, linux-security-module, linux-kernel
In-Reply-To: <20190610111110.72468326@lwn.net>

Jonathan Corbet <corbet@lwn.net> wrote:

> > 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> 
> One little nit: it seems that keyctl_watch_key is actually spelled
> keyctl(KEYCTL_WATCH_KEY, ...).

Yeah - but there'll be a wrapper for it in -lkeyutils.  The syscalls I added
in other patches are, technically, referred to syscall(__NR_xxx, ...) at the
moment too.

David

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 17:59 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <5dc357f5858f8036cad5847cfe214401bb9138bf.camel@intel.com>

On 6/10/19 9:05 AM, Yu-cheng Yu wrote:
> On Fri, 2019-06-07 at 14:09 -0700, Dave Hansen wrote:
>> On 6/7/19 1:06 PM, Yu-cheng Yu wrote:
>>>> Huh, how does glibc know about all possible past and future legacy code
>>>> in the application?
>>> When dlopen() gets a legacy binary and the policy allows that, it will
>>> manage
>>> the bitmap:
>>>
>>>   If a bitmap has not been created, create one.
>>>   Set bits for the legacy code being loaded.
>> I was thinking about code that doesn't go through GLIBC like JITs.
> If JIT manages the bitmap, it knows where it is.
> It can always read the bitmap again, right?

Let's just be clear:

The design proposed here is that all code mappers (anybody wanting to
get legacy non-CET code into the address space):

1. Know about CET
2. Know where the bitmap is, and identify the part that needs to be
   changed
3. Be able to mprotect() the bitmap to be writable (undoing glibc's
   PROT_READ)
4. Set the bits in the bitmap for the legacy code
5. mprotect() the bitmap back to PROT_READ

Do the non-glibc code mappers have glibc interfaces for this?
Otherwise, how could a bunch of JITs in a big multi-threaded application
possibly coordinate the mprotect()s?  Won't they race with each other?

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Casey Schaufler @ 2019-06-10 18:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephen Smalley, David Howells, Al Viro, USB list, LSM List,
	Greg Kroah-Hartman, raven, Linux FS Devel, Linux API, linux-block,
	keyrings, LKML, Paul Moore, casey
In-Reply-To: <CALCETrX5O18q2=dUeC=hEtK2=t5KQpGBy9XveHxFw36OqkbNOg@mail.gmail.com>

On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>> Hi Al,
>>>>
>>>> Here's a set of patches to add a general variable-length notification queue
>>>> concept and to add sources of events for:
>>>>
>>>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>       mount reconfiguration.
>>>>
>>>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>       errors (not complete yet).
>>>>
>>>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>
>>>>   (4) General device events (single common queue) including:
>>>>
>>>>       - Block layer events, such as device errors
>>>>
>>>>       - USB subsystem events, such as device/bus attach/remove, device
>>>>         reset, device errors.
>>>>
>>>> One of the reasons for this is so that we can remove the issue of processes
>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>> on which this patch series depends, provides a way to access superblock and
>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>
>>>>
>>>> LSM support is included, but controversial:
>>>>
>>>>   (1) The creds of the process that did the fput() that reduced the refcount
>>>>       to zero are cached in the file struct.
>>>>
>>>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>       doing the cleanup, thereby making sure that the creds seen by the
>>>>       destruction notification generated by mntput() appears to come from
>>>>       the last fputter.
>>>>
>>>>   (3) security_post_notification() is called for each queue that we might
>>>>       want to post a notification into, thereby allowing the LSM to prevent
>>>>       covert communications.
>>>>
>>>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>       may be set in the first place?  I might need to add a variant per
>>>>       watch-type.
>>>>
>>>>   (?) Do I really need to keep track of the process creds in which an
>>>>       implicit object destruction happened?  For example, imagine you create
>>>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>       someone looking at that fd through procfs at the same time as you exit
>>>>       due to an error.  The LSM sees the destruction notification come from
>>>>       the looker if they happen to do their fput() after yours.
>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>
>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>
>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>> I cannot agree. There is an explicit action by a subject that results
>> in information being delivered to an object. Just like a signal or a
>> UDP packet delivery. Smack handles this kind of thing just fine. The
>> internal mechanism that results in the access is irrelevant from
>> this viewpoint. I can understand how a mechanism like SELinux that
>> works on finer granularity might view it differently.
> I think you really need to give an example of a coherent policy that
> needs this.

I keep telling you, and you keep ignoring what I say.

>   As it stands, your analogy seems confusing.

It's pretty simple. I have given both the abstract
and examples.

>   If someone
> changes the system clock, we don't restrict who is allowed to be
> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> was changed based on who changed the clock.

That's right. The system clock is not an object that
unprivileged processes can modify. In fact, it is not
an object at all. If you care to look, you will see that
Smack does nothing with the clock.

>   Similarly, if someone
> tries to receive a packet on a socket, we check whether they have the
> right to receive on that socket (from the endpoint in question) and,
> if the sender is local, whether the sender can send to that socket.
> We do not check whether the sender can send to the receiver.

Bzzzt! Smack sure does.

> The signal example is inapplicable.

From a modeling viewpoint the actions are identical.

>   Sending a signal to a process is
> an explicit action done to that process, and it can easily adversely
> affect the target.  Of course it requires permission.
>
> --Andy

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 18:02 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <e26f7d09376740a5f7e8360fac4805488b2c0a4f.camel@intel.com>

On 6/10/19 8:22 AM, Yu-cheng Yu wrote:
>> How does glibc know the linear address space size?  We don’t want LA64 to
>> break old binaries because the address calculation changed.
> When an application starts, its highest stack address is determined.
> It uses that as the maximum the bitmap needs to cover.

Huh, I didn't think we ran code from the stack. ;)

Especially given the way that we implemented the new 5-level-paging
address space, I don't think that expecting code to be below the stack
is a good universal expectation.

^ permalink raw reply

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Dave Hansen @ 2019-06-10 18:03 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

I'd really love to see the manpages for these new flags.  The devil is
in the details of our promises to userspace.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Andy Lutomirski @ 2019-06-10 18:22 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Stephen Smalley, David Howells, Al Viro,
	USB list, LSM List, Greg Kroah-Hartman, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LKML, Paul Moore
In-Reply-To: <dac74580-5b48-86e4-8222-cac29a9f541d@schaufler-ca.com>


> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>> Hi Al,
>>>>> 
>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>> concept and to add sources of events for:
>>>>> 
>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>      mount reconfiguration.
>>>>> 
>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>      errors (not complete yet).
>>>>> 
>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>> 
>>>>>  (4) General device events (single common queue) including:
>>>>> 
>>>>>      - Block layer events, such as device errors
>>>>> 
>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>        reset, device errors.
>>>>> 
>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>> on which this patch series depends, provides a way to access superblock and
>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>> 
>>>>> 
>>>>> LSM support is included, but controversial:
>>>>> 
>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>      to zero are cached in the file struct.
>>>>> 
>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>      destruction notification generated by mntput() appears to come from
>>>>>      the last fputter.
>>>>> 
>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>      covert communications.
>>>>> 
>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>      watch-type.
>>>>> 
>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>      the looker if they happen to do their fput() after yours.
>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>> 
>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>> 
>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>> I cannot agree. There is an explicit action by a subject that results
>>> in information being delivered to an object. Just like a signal or a
>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>> internal mechanism that results in the access is irrelevant from
>>> this viewpoint. I can understand how a mechanism like SELinux that
>>> works on finer granularity might view it differently.
>> I think you really need to give an example of a coherent policy that
>> needs this.
> 
> I keep telling you, and you keep ignoring what I say.
> 
>>  As it stands, your analogy seems confusing.
> 
> It's pretty simple. I have given both the abstract
> and examples.

You gave the /dev/null example, which is inapplicable to this patchset.

> 
>>  If someone
>> changes the system clock, we don't restrict who is allowed to be
>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>> was changed based on who changed the clock.
> 
> That's right. The system clock is not an object that
> unprivileged processes can modify. In fact, it is not
> an object at all. If you care to look, you will see that
> Smack does nothing with the clock.

And this is different from the mount tree how?

> 
>>  Similarly, if someone
>> tries to receive a packet on a socket, we check whether they have the
>> right to receive on that socket (from the endpoint in question) and,
>> if the sender is local, whether the sender can send to that socket.
>> We do not check whether the sender can send to the receiver.
> 
> Bzzzt! Smack sure does.

This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

> 
>> The signal example is inapplicable.
> 
> From a modeling viewpoint the actions are identical.

This seems incorrect to me and, I think, to most everyone else reading this. Can you explain?

In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Casey Schaufler @ 2019-06-10 19:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Stephen Smalley, David Howells, Al Viro,
	USB list, LSM List, Greg Kroah-Hartman, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LKML, Paul Moore, casey
In-Reply-To: <E0925E1F-E5F2-4457-8704-47B6E64FE3F3@amacapital.net>

On 6/10/2019 11:22 AM, Andy Lutomirski wrote:
>> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>>> Hi Al,
>>>>>>
>>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>>> concept and to add sources of events for:
>>>>>>
>>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>>      mount reconfiguration.
>>>>>>
>>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>>      errors (not complete yet).
>>>>>>
>>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>>>
>>>>>>  (4) General device events (single common queue) including:
>>>>>>
>>>>>>      - Block layer events, such as device errors
>>>>>>
>>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>>        reset, device errors.
>>>>>>
>>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>>> on which this patch series depends, provides a way to access superblock and
>>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>>>
>>>>>>
>>>>>> LSM support is included, but controversial:
>>>>>>
>>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>>      to zero are cached in the file struct.
>>>>>>
>>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>>      destruction notification generated by mntput() appears to come from
>>>>>>      the last fputter.
>>>>>>
>>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>>      covert communications.
>>>>>>
>>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>>      watch-type.
>>>>>>
>>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>>      the looker if they happen to do their fput() after yours.
>>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>>>
>>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>>>
>>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>>> I cannot agree. There is an explicit action by a subject that results
>>>> in information being delivered to an object. Just like a signal or a
>>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>>> internal mechanism that results in the access is irrelevant from
>>>> this viewpoint. I can understand how a mechanism like SELinux that
>>>> works on finer granularity might view it differently.
>>> I think you really need to give an example of a coherent policy that
>>> needs this.
>> I keep telling you, and you keep ignoring what I say.
>>
>>>  As it stands, your analogy seems confusing.
>> It's pretty simple. I have given both the abstract
>> and examples.
> You gave the /dev/null example, which is inapplicable to this patchset.

That addressed an explicit objection, and pointed out
an exception to a generality you had asserted, which was
not true. It's also a red herring regarding the current
discussion.

>>>  If someone
>>> changes the system clock, we don't restrict who is allowed to be
>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>> was changed based on who changed the clock.
>> That's right. The system clock is not an object that
>> unprivileged processes can modify. In fact, it is not
>> an object at all. If you care to look, you will see that
>> Smack does nothing with the clock.
> And this is different from the mount tree how?

The mount tree can be modified by unprivileged users.
If nothing that unprivileged users can do to the mount
tree can trigger a notification you are correct, the
mount tree is very like the system clock. Is that the
case? 

>>>  Similarly, if someone
>>> tries to receive a packet on a socket, we check whether they have the
>>> right to receive on that socket (from the endpoint in question) and,
>>> if the sender is local, whether the sender can send to that socket.
>>> We do not check whether the sender can send to the receiver.
>> Bzzzt! Smack sure does.
> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

Process A sends a packet to process B.
If A has access to TopSecret data and B is not
allowed to see TopSecret data, the delivery should
be prevented. Is that nonsensical?

>>> The signal example is inapplicable.
>> From a modeling viewpoint the actions are identical.
> This seems incorrect to me

What would be correct then? Some convoluted combination
of system entities that aren't owned or controlled by
any mechanism? 

>  and, I think, to most everyone else reading this.

That's quite the assertion. You may even be correct.

>  Can you explain?
>
> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

YES!!!!!!!!!!!!

And when a process triggers a notification it is the subject
and the watching process is the object!

Subject == active entity
Object  == passive entity

Triggering an event is, like calling kill(), an action!

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 19:38 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <3f19582d-78b1-5849-ffd0-53e8ca747c0d@intel.com>

On Mon, 2019-06-10 at 11:02 -0700, Dave Hansen wrote:
> On 6/10/19 8:22 AM, Yu-cheng Yu wrote:
> > > How does glibc know the linear address space size?  We don’t want LA64 to
> > > break old binaries because the address calculation changed.
> > 
> > When an application starts, its highest stack address is determined.
> > It uses that as the maximum the bitmap needs to cover.
> 
> Huh, I didn't think we ran code from the stack. ;)
> 
> Especially given the way that we implemented the new 5-level-paging
> address space, I don't think that expecting code to be below the stack
> is a good universal expectation.

Yes, you make a good point.  However, allowing the application manage the bitmap
is the most efficient and flexible.  If the loader finds a legacy lib is beyond
the bitmap can cover, it can deal with the problem by moving the lib to a lower
address; or re-allocate the bitmap.  If the loader cannot allocate a big bitmap
to cover all 5-level address space (the bitmap will be large), it can put all
legacy lib's at lower address.  We cannot do these easily in the kernel.

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 19:52 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <5aa98999b1343f34828414b74261201886ec4591.camel@intel.com>

On 6/10/19 12:38 PM, Yu-cheng Yu wrote:
>>> When an application starts, its highest stack address is determined.
>>> It uses that as the maximum the bitmap needs to cover.
>> Huh, I didn't think we ran code from the stack. ;)
>>
>> Especially given the way that we implemented the new 5-level-paging
>> address space, I don't think that expecting code to be below the stack
>> is a good universal expectation.
> Yes, you make a good point.  However, allowing the application manage the bitmap
> is the most efficient and flexible.  If the loader finds a legacy lib is beyond
> the bitmap can cover, it can deal with the problem by moving the lib to a lower
> address; or re-allocate the bitmap.

How could the loader reallocate the bitmap and coordinate with other
users of the bitmap?

> If the loader cannot allocate a big bitmap to cover all 5-level
> address space (the bitmap will be large), it can put all legacy lib's
> at lower address.  We cannot do these easily in the kernel.

This is actually an argument to do it in the kernel.  The kernel can
always allocate the virtual space however it wants, no matter how large.
 If we hide the bitmap behind a kernel API then we can put it at high
5-level user addresses because we also don't have to worry about the
high bits confusing userspace.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Andy Lutomirski @ 2019-06-10 19:53 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Stephen Smalley, David Howells, Al Viro,
	USB list, LSM List, Greg Kroah-Hartman, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LKML, Paul Moore
In-Reply-To: <4b7d02b2-2434-8a7c-66cc-7dbebc37efbc@schaufler-ca.com>

On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> I think you really need to give an example of a coherent policy that
> >>> needs this.
> >> I keep telling you, and you keep ignoring what I say.
> >>
> >>>  As it stands, your analogy seems confusing.
> >> It's pretty simple. I have given both the abstract
> >> and examples.
> > You gave the /dev/null example, which is inapplicable to this patchset.
>
> That addressed an explicit objection, and pointed out
> an exception to a generality you had asserted, which was
> not true. It's also a red herring regarding the current
> discussion.

This argument is pointless.

Please humor me and just give me an example.  If you think you have
already done so, feel free to repeat yourself.  If you have no
example, then please just say so.

>
> >>>  If someone
> >>> changes the system clock, we don't restrict who is allowed to be
> >>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> >>> was changed based on who changed the clock.
> >> That's right. The system clock is not an object that
> >> unprivileged processes can modify. In fact, it is not
> >> an object at all. If you care to look, you will see that
> >> Smack does nothing with the clock.
> > And this is different from the mount tree how?
>
> The mount tree can be modified by unprivileged users.
> If nothing that unprivileged users can do to the mount
> tree can trigger a notification you are correct, the
> mount tree is very like the system clock. Is that the
> case?

The mount tree can't be modified by unprivileged users, unless a
privileged user very carefully configured it as such.  An unprivileged
user can create a new userns and a new mount ns, but then they're
modifying a whole different mount tree.

>
> >>>  Similarly, if someone
> >>> tries to receive a packet on a socket, we check whether they have the
> >>> right to receive on that socket (from the endpoint in question) and,
> >>> if the sender is local, whether the sender can send to that socket.
> >>> We do not check whether the sender can send to the receiver.
> >> Bzzzt! Smack sure does.
> > This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>
> Process A sends a packet to process B.
> If A has access to TopSecret data and B is not
> allowed to see TopSecret data, the delivery should
> be prevented. Is that nonsensical?

It makes sense.  As I see it, the way that a sensible policy should do
this is by making sure that there are no sockets, pipes, etc that
Process A can write and that Process B can read.

If you really want to prevent a malicious process with TopSecret data
from sending it to a different process, then you can't use Linux on
x86 or ARM.  Maybe that will be fixed some day, but you're going to
need to use an extremely tight sandbox to make this work.

>
> >>> The signal example is inapplicable.
> >> From a modeling viewpoint the actions are identical.
> > This seems incorrect to me
>
> What would be correct then? Some convoluted combination
> of system entities that aren't owned or controlled by
> any mechanism?
>

POSIX signal restrictions aren't there to prevent two processes from
communicating.  They're there to prevent the sender from manipulating
or crashing the receiver without appropriate privilege.


> >  and, I think, to most everyone else reading this.
>
> That's quite the assertion. You may even be correct.
>
> >  Can you explain?
> >
> > In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>
> YES!!!!!!!!!!!!
>
> And when a process triggers a notification it is the subject
> and the watching process is the object!
>
> Subject == active entity
> Object  == passive entity
>
> Triggering an event is, like calling kill(), an action!
>

And here is where I disagree with your interpretation.  Triggering an
event is a side effect of writing to the file.  There are *two*
security relevant actions, not one, and they are:

First, the write:

Subject == the writer
Action == write
Object == the file

Then the event, which could be modeled in a couple of ways:

Subject == the file
Action == notify
Object == the recipient

or

Subject == the recipient
Action == watch
Object == the file

By conflating these two actions into one, you've made the modeling
very hard, and you start running into all these nasty questions like
"who actually closed this open file"

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-10 19:55 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Peter Zijlstra, X86 ML, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar, LKML, open list:DOCUMENTATION,
	Linux-MM, linux-arch, Linux API, Arnd Bergmann, Balbir Singh,
	Borislav Petkov, Cyrill Gorcunov, Dave Hansen,
	Eugene Syromiatnikov, Florian Weimer, H.J. Lu, Jann Horn,
	Jonathan Corbet <corbe>
In-Reply-To: <0665416d-9999-b394-df17-f2a5e1408130@intel.com>

On Mon, Jun 10, 2019 at 12:52 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/10/19 12:38 PM, Yu-cheng Yu wrote:
> >>> When an application starts, its highest stack address is determined.
> >>> It uses that as the maximum the bitmap needs to cover.
> >> Huh, I didn't think we ran code from the stack. ;)
> >>
> >> Especially given the way that we implemented the new 5-level-paging
> >> address space, I don't think that expecting code to be below the stack
> >> is a good universal expectation.
> > Yes, you make a good point.  However, allowing the application manage the bitmap
> > is the most efficient and flexible.  If the loader finds a legacy lib is beyond
> > the bitmap can cover, it can deal with the problem by moving the lib to a lower
> > address; or re-allocate the bitmap.
>
> How could the loader reallocate the bitmap and coordinate with other
> users of the bitmap?
>
> > If the loader cannot allocate a big bitmap to cover all 5-level
> > address space (the bitmap will be large), it can put all legacy lib's
> > at lower address.  We cannot do these easily in the kernel.
>
> This is actually an argument to do it in the kernel.  The kernel can
> always allocate the virtual space however it wants, no matter how large.
>  If we hide the bitmap behind a kernel API then we can put it at high
> 5-level user addresses because we also don't have to worry about the
> high bits confusing userspace.
>

That's a fairly compelling argument.

The bitmap is one bit per page, right?  So it's smaller than the
address space by a factor of 8*2^12 == 2^15.  This means that, if we
ever get full 64-bit linear addresses reserved entirely for userspace
(which could happen if my perennial request to Intel to split user and
kernel addresses completely happens), then we'll need 2^48 bytes for
the bitmap, which simply does not fit in the address space of a legacy
application.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 20:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <0665416d-9999-b394-df17-f2a5e1408130@intel.com>

On Mon, 2019-06-10 at 12:52 -0700, Dave Hansen wrote:
> On 6/10/19 12:38 PM, Yu-cheng Yu wrote:
> > > > When an application starts, its highest stack address is determined.
> > > > It uses that as the maximum the bitmap needs to cover.
> > > 
> > > Huh, I didn't think we ran code from the stack. ;)
> > > 
> > > Especially given the way that we implemented the new 5-level-paging
> > > address space, I don't think that expecting code to be below the stack
> > > is a good universal expectation.
> > 
> > Yes, you make a good point.  However, allowing the application manage the
> > bitmap
> > is the most efficient and flexible.  If the loader finds a legacy lib is
> > beyond
> > the bitmap can cover, it can deal with the problem by moving the lib to a
> > lower
> > address; or re-allocate the bitmap.
> 
> How could the loader reallocate the bitmap and coordinate with other
> users of the bitmap?

Assuming the loader actually chooses to re-allocate, it can copy the old bitmap
over to the new before doing the switch.  But, I agree, the other choice is
easier; the loader can simply put the lib at lower address.  AFAIK, the loader
does not request high address in mmap().

> 
> > If the loader cannot allocate a big bitmap to cover all 5-level
> > address space (the bitmap will be large), it can put all legacy lib's
> > at lower address.  We cannot do these easily in the kernel.
> 
> This is actually an argument to do it in the kernel.  The kernel can
> always allocate the virtual space however it wants, no matter how large.
>  If we hide the bitmap behind a kernel API then we can put it at high
> 5-level user addresses because we also don't have to worry about the
> high bits confusing userspace.

We actually tried this.  The kernel needs to reserve the bitmap space in the
beginning for every CET-enabled app, regardless of actual needs.  On each memory
request, the kernel then must consider a percentage of allocated space in its
calculation, and on systems with less memory this quickly becomes a problem.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 20:43 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <5c8727dde9653402eea97bfdd030c479d1e8dd99.camel@intel.com>

On 6/10/19 1:27 PM, Yu-cheng Yu wrote:
>>> If the loader cannot allocate a big bitmap to cover all 5-level
>>> address space (the bitmap will be large), it can put all legacy lib's
>>> at lower address.  We cannot do these easily in the kernel.
>> This is actually an argument to do it in the kernel.  The kernel can
>> always allocate the virtual space however it wants, no matter how large.
>>  If we hide the bitmap behind a kernel API then we can put it at high
>> 5-level user addresses because we also don't have to worry about the
>> high bits confusing userspace.
> We actually tried this.  The kernel needs to reserve the bitmap space in the
> beginning for every CET-enabled app, regardless of actual needs. 

I don't think this is a problem.  In fact, I think reserving the space
is actually the only sane behavior.  If you don't reserve it, you
fundamentally limit where future legacy instructions can go.

One idea is that we always size the bitmap for the 48-bit addressing
systems.  Legacy code probably doesn't _need_ to go in the new address
space, and if we do this we don't have to worry about the gigantic
57-bit address space bitmap.

> On each memory request, the kernel then must consider a percentage of
> allocated space in its calculation, and on systems with less memory
> this quickly becomes a problem.

I'm not sure what you're referring to here?  Are you referring to our
overcommit limits?

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 20:58 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <ac9a20a6-170a-694e-beeb-605a17195034@intel.com>

On Mon, 2019-06-10 at 13:43 -0700, Dave Hansen wrote:
> On 6/10/19 1:27 PM, Yu-cheng Yu wrote:
> > > > If the loader cannot allocate a big bitmap to cover all 5-level
> > > > address space (the bitmap will be large), it can put all legacy lib's
> > > > at lower address.  We cannot do these easily in the kernel.
> > > 
> > > This is actually an argument to do it in the kernel.  The kernel can
> > > always allocate the virtual space however it wants, no matter how large.
> > >  If we hide the bitmap behind a kernel API then we can put it at high
> > > 5-level user addresses because we also don't have to worry about the
> > > high bits confusing userspace.
> > 
> > We actually tried this.  The kernel needs to reserve the bitmap space in the
> > beginning for every CET-enabled app, regardless of actual needs. 
> 
> I don't think this is a problem.  In fact, I think reserving the space
> is actually the only sane behavior.  If you don't reserve it, you
> fundamentally limit where future legacy instructions can go.
> 
> One idea is that we always size the bitmap for the 48-bit addressing
> systems.  Legacy code probably doesn't _need_ to go in the new address
> space, and if we do this we don't have to worry about the gigantic
> 57-bit address space bitmap.
> 
> > On each memory request, the kernel then must consider a percentage of
> > allocated space in its calculation, and on systems with less memory
> > this quickly becomes a problem.
> 
> I'm not sure what you're referring to here?  Are you referring to our
> overcommit limits?

Yes.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Casey Schaufler @ 2019-06-10 21:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephen Smalley, David Howells, Al Viro, USB list, LSM List,
	Greg Kroah-Hartman, raven, Linux FS Devel, Linux API, linux-block,
	keyrings, LKML, Paul Moore
In-Reply-To: <CALCETrU+PKVbrKQJoXj9x_5y+vTZENMczHqyM_Xb85ca5YDZuA@mail.gmail.com>

On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> I think you really need to give an example of a coherent policy that
>>>>> needs this.
>>>> I keep telling you, and you keep ignoring what I say.
>>>>
>>>>>  As it stands, your analogy seems confusing.
>>>> It's pretty simple. I have given both the abstract
>>>> and examples.
>>> You gave the /dev/null example, which is inapplicable to this patchset.
>> That addressed an explicit objection, and pointed out
>> an exception to a generality you had asserted, which was
>> not true. It's also a red herring regarding the current
>> discussion.
> This argument is pointless.
>
> Please humor me and just give me an example.  If you think you have
> already done so, feel free to repeat yourself.  If you have no
> example, then please just say so.

To repeat the /dev/null example:

Process A and process B both open /dev/null.
A and B can write and read to their hearts content
to/from /dev/null without ever once communicating.
The mutual accessibility of /dev/null in no way implies that
A and B can communicate. If A can set a watch on /dev/null,
and B triggers an event, there still has to be an access
check on the delivery of the event because delivering an event
to A is not an action on /dev/null, but on A.


>
>>>>>  If someone
>>>>> changes the system clock, we don't restrict who is allowed to be
>>>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>>>> was changed based on who changed the clock.
>>>> That's right. The system clock is not an object that
>>>> unprivileged processes can modify. In fact, it is not
>>>> an object at all. If you care to look, you will see that
>>>> Smack does nothing with the clock.
>>> And this is different from the mount tree how?
>> The mount tree can be modified by unprivileged users.
>> If nothing that unprivileged users can do to the mount
>> tree can trigger a notification you are correct, the
>> mount tree is very like the system clock. Is that the
>> case?
> The mount tree can't be modified by unprivileged users, unless a
> privileged user very carefully configured it as such.

"Unless" means *is* possible. In which case access control is
required. I will admit to being less then expert on the extent
to which mounts can be done without privilege.

>   An unprivileged
> user can create a new userns and a new mount ns, but then they're
> modifying a whole different mount tree.

Within those namespaces you can still have multiple users,
constrained be system access control policy.

>
>>>>>  Similarly, if someone
>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>> if the sender is local, whether the sender can send to that socket.
>>>>> We do not check whether the sender can send to the receiver.
>>>> Bzzzt! Smack sure does.
>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>> Process A sends a packet to process B.
>> If A has access to TopSecret data and B is not
>> allowed to see TopSecret data, the delivery should
>> be prevented. Is that nonsensical?
> It makes sense.  As I see it, the way that a sensible policy should do
> this is by making sure that there are no sockets, pipes, etc that
> Process A can write and that Process B can read.

You can't explain UDP controls without doing the access check
on packet delivery. The sendmsg() succeeds when the packet leaves
the sender. There doesn't even have to be a socket bound to the
port. The only opportunity you have for control is on packet
delivery, which is the only point at which you can have the
information required.

> If you really want to prevent a malicious process with TopSecret data
> from sending it to a different process, then you can't use Linux on
> x86 or ARM.  Maybe that will be fixed some day, but you're going to
> need to use an extremely tight sandbox to make this work.

I won't be commenting on that.

>
>>>>> The signal example is inapplicable.
>>>> From a modeling viewpoint the actions are identical.
>>> This seems incorrect to me
>> What would be correct then? Some convoluted combination
>> of system entities that aren't owned or controlled by
>> any mechanism?
>>
> POSIX signal restrictions aren't there to prevent two processes from
> communicating.  They're there to prevent the sender from manipulating
> or crashing the receiver without appropriate privilege.

POSIX signal restrictions have a long history. In the P10031e/2c
debates both communication and manipulation where seriously
considered. I would say both are true.

>>>  and, I think, to most everyone else reading this.
>> That's quite the assertion. You may even be correct.
>>
>>>  Can you explain?
>>>
>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>> YES!!!!!!!!!!!!
>>
>> And when a process triggers a notification it is the subject
>> and the watching process is the object!
>>
>> Subject == active entity
>> Object  == passive entity
>>
>> Triggering an event is, like calling kill(), an action!
>>
> And here is where I disagree with your interpretation.  Triggering an
> event is a side effect of writing to the file.  There are *two*
> security relevant actions, not one, and they are:
>
> First, the write:
>
> Subject == the writer
> Action == write
> Object == the file
>
> Then the event, which could be modeled in a couple of ways:
>
> Subject == the file

Files   are   not   subjects. They are passive entities.

> Action == notify
> Object == the recipient
>
> or
>
> Subject == the recipient
> Action == watch
> Object == the file
>
> By conflating these two actions into one, you've made the modeling
> very hard, and you start running into all these nasty questions like
> "who actually closed this open file"

No, I've made the code more difficult. You can not call
the file a subject. That is just wrong. It's not a valid
model.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 22:02 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <328275c9b43c06809c9937c83d25126a6e3efcbd.camel@intel.com>

On 6/10/19 1:58 PM, Yu-cheng Yu wrote:
>>> On each memory request, the kernel then must consider a percentage of
>>> allocated space in its calculation, and on systems with less memory
>>> this quickly becomes a problem.
>> I'm not sure what you're referring to here?  Are you referring to our
>> overcommit limits?
> Yes.

My assumption has always been that these large, potentially sparse
hardware tables *must* be mmap()'d with MAP_NORESERVE specified.  That
should keep them from being problematic with respect to overcommit.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: David Howells @ 2019-06-10 22:07 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: dhowells, Andy Lutomirski, Stephen Smalley, Al Viro, USB list,
	LSM List, Greg Kroah-Hartman, raven, Linux FS Devel, Linux API,
	linux-block, keyrings, LKML, Paul Moore
In-Reply-To: <25d88489-9850-f092-205e-0a4fc292f41b@schaufler-ca.com>

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.

If a process has the privilege, it appears that fanotify() allows that process
to see others accessing /dev/null (FAN_ACCESS, FAN_ACCESS_PERM).  There don't
seem to be any LSM checks there either.

On the other hand, the privilege required is CAP_SYS_ADMIN,

> > The mount tree can't be modified by unprivileged users, unless a
> > privileged user very carefully configured it as such.
> 
> "Unless" means *is* possible. In which case access control is
> required. I will admit to being less then expert on the extent
> to which mounts can be done without privilege.

Automounts in network filesystems, for example.

The initial mount of the network filesystem requires local privilege, but then
mountpoints are managed with remote privilege as granted by things like
kerberos tickets.  The local kernel has no control.

If you have CONFIG_AFS_FS enabled in your kernel, for example, and you install
the keyutils package (dnf, rpm, apt, etc.), then you should be able to do:

	mount -t afs none /mnt -o dyn
	ls /afs/grand.central.org/software/

for example.  That will go through a couple of automount points.  Assuming you
don't have a kerberos login on those servers, however, you shouldn't be able
to add new mountpoints.

Someone watching the mount topology can see events when an automount is
enacted and when it expires, the latter being an event with the system as the
subject since the expiry is done on a timeout set by the kernel.

David

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Yu-cheng Yu @ 2019-06-10 22:40 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <92e56b28-0cd4-e3f4-867b-639d9b98b86c@intel.com>

On Mon, 2019-06-10 at 15:02 -0700, Dave Hansen wrote:
> On 6/10/19 1:58 PM, Yu-cheng Yu wrote:
> > > > On each memory request, the kernel then must consider a percentage of
> > > > allocated space in its calculation, and on systems with less memory
> > > > this quickly becomes a problem.
> > > 
> > > I'm not sure what you're referring to here?  Are you referring to our
> > > overcommit limits?
> > 
> > Yes.
> 
> My assumption has always been that these large, potentially sparse
> hardware tables *must* be mmap()'d with MAP_NORESERVE specified.  That
> should keep them from being problematic with respect to overcommit.

Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and
VM_DONTDUMP.  The bitmap will cover only 48-bit address space.

We then create PR_MARK_CODE_AS_LEGACY.  The kernel will set the bitmap, but it
is going to be slow.

Perhaps we still let the app fill the bitmap?

Yu-cheng

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 22:59 UTC (permalink / raw)
  To: Yu-cheng Yu, Andy Lutomirski
  Cc: Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
	linux-kernel, linux-doc, linux-mm, linux-arch, linux-api,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit
In-Reply-To: <1b961c71d30e31ecb22da2c5401b1a81cb802d86.camel@intel.com>

On 6/10/19 3:40 PM, Yu-cheng Yu wrote:
> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and
> VM_DONTDUMP.  The bitmap will cover only 48-bit address space.

Could you make sure to discuss the downsides of only doing a 48-bit
address space?

What are the reasons behind and implications of VM_DONTDUMP?

> We then create PR_MARK_CODE_AS_LEGACY.  The kernel will set the bitmap, but it
> is going to be slow.

Slow compared to what?  We're effectively adding one (quick) system call
to a path that, today, has at *least* half a dozen syscalls and probably
a bunch of page faults.  Heck, we can probably avoid the actual page
fault to populate the bitmap if we're careful.  That alone would put a
syscall on equal footing with any other approach.  If the bit setting
crossed a page boundary it would probably win.

> Perhaps we still let the app fill the bitmap?

I think I'd want to see some performance data on it first.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: H.J. Lu @ 2019-06-10 23:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, linux-doc, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, Jann Horn,
	Jonathan Corbet <corbet@
In-Reply-To: <ea5e333f-8cd6-8396-635f-a9dc580d5364@intel.com>

On Mon, Jun 10, 2019 at 3:59 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > We then create PR_MARK_CODE_AS_LEGACY.  The kernel will set the bitmap, but it
> > is going to be slow.
>
> Slow compared to what?  We're effectively adding one (quick) system call
> to a path that, today, has at *least* half a dozen syscalls and probably
> a bunch of page faults.  Heck, we can probably avoid the actual page
> fault to populate the bitmap if we're careful.  That alone would put a
> syscall on equal footing with any other approach.  If the bit setting
> crossed a page boundary it would probably win.
>
> > Perhaps we still let the app fill the bitmap?
>
> I think I'd want to see some performance data on it first.

Updating legacy bitmap in user space from kernel requires

long q;

get_user(q, ...);
q |= mask;
put_user(q, ...);

instead of

*p |= mask;

get_user + put_user was quite slow when we tried before.

-- 
H.J.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-10 23:37 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Yu-cheng Yu, Andy Lutomirski, Peter Zijlstra,
	the arch/x86 maintainers, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, LKML, linux-doc, Linux-MM, linux-arch, Linux API,
	Arnd Bergmann, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, Jann Horn,
	Jonathan Corbet
In-Reply-To: <CAMe9rOqLxNxE-gGX9ozX=emW9iQ+gOeUiS3ec5W4jmF6wk6cng@mail.gmail.com>

On 6/10/19 4:20 PM, H.J. Lu wrote:
>>> Perhaps we still let the app fill the bitmap?
>> I think I'd want to see some performance data on it first.
> Updating legacy bitmap in user space from kernel requires
> 
> long q;
> 
> get_user(q, ...);
> q |= mask;
> put_user(q, ...);
> 
> instead of
> 
> *p |= mask;
> 
> get_user + put_user was quite slow when we tried before.

Numbers, please.

There are *lots* of ways to speed something like that up if you have
actual issues with it.  For instance, you can skip the get_user() for
whole bytes.  You can write bits with 0's for unallocated address space.
 You can do user_access_begin/end() to avoid bunches of STAC/CLACs...

The list goes on and on. :)

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Andy Lutomirski @ 2019-06-10 23:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Yu-cheng Yu, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <ea5e333f-8cd6-8396-635f-a9dc580d5364@intel.com>

> On Jun 10, 2019, at 3:59 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 6/10/19 3:40 PM, Yu-cheng Yu wrote:
>> Ok, we will go back to do_mmap() with MAP_PRIVATE, MAP_NORESERVE and
>> VM_DONTDUMP.  The bitmap will cover only 48-bit address space.
> 
> Could you make sure to discuss the downsides of only doing a 48-bit
> address space?
> 
> What are the reasons behind and implications of VM_DONTDUMP?
> 
>> We then create PR_MARK_CODE_AS_LEGACY.  The kernel will set the bitmap, but it
>> is going to be slow.
> 
> Slow compared to what?  We're effectively adding one (quick) system call
> to a path that, today, has at *least* half a dozen syscalls and probably
> a bunch of page faults.  Heck, we can probably avoid the actual page
> fault to populate the bitmap if we're careful.  That alone would put a
> syscall on equal footing with any other approach.  If the bit setting
> crossed a page boundary it would probably win.
> 
>> Perhaps we still let the app fill the bitmap?
> 
> I think I'd want to see some performance data on it first.

Trying to summarize:

If we manage the whole thing in user space, we are basically committing to only covering 48 bits — otherwise the whole model falls apart in quite a few ways. We gain some simplicity in the kernel.

If we do it in the kernel, we still have to decide how much address space to cover. We get to play games like allocating the bitmap above 2^48, but then we might have CRIU issues if we migrate to a system with fewer BA bits.

I doubt that the performance matters much one way or another. I just don’t expect any of this to be a bottleneck.

Another benefit of kernel management: we could plausibly auto-clear the bits corresponding to munmapped regions. Is this worth it?

And a maybe-silly benefit: if we manage it in the kernel, we could optimize the inevitable case where the bitmap contains pages that are all ones :). If it’s in userspace, KSM could do the, but that will be inefficient at best.

^ permalink raw reply

* Re: [PATCH v7 03/14] x86/cet/ibt: Add IBT legacy code bitmap setup function
From: Dave Hansen @ 2019-06-11  0:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Yu-cheng Yu, Peter Zijlstra, x86, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar, linux-kernel, linux-doc, linux-mm, linux-arch,
	linux-api, Arnd Bergmann, Balbir Singh, Borislav Petkov,
	Cyrill Gorcunov, Dave Hansen, Eugene Syromiatnikov,
	Florian Weimer, H.J. Lu, Jann Horn, Jonathan Corbet, Kees Cook,
	Mike Kravetz
In-Reply-To: <BBBF82D3-EE21-49E1-92A4-713C7729E6AD@amacapital.net>

On 6/10/19 4:54 PM, Andy Lutomirski wrote:
> Another benefit of kernel management: we could plausibly auto-clear
> the bits corresponding to munmapped regions. Is this worth it?

I did it for MPX.  I think I even went to the trouble of zapping the
whole pages that got unused.

But, MPX tables took 80% of the address space, worst-case.  This takes
0.003% :)  The only case it would really matter would be a task was
long-running, used legacy executables/JITs, and was mapping/unmapping
text all over the address space.  That seems rather unlikely.

^ permalink raw reply

* Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Andy Lutomirski @ 2019-06-11  0:13 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, Stephen Smalley, David Howells, Al Viro,
	USB list, LSM List, Greg Kroah-Hartman, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LKML, Paul Moore
In-Reply-To: <25d88489-9850-f092-205e-0a4fc292f41b@schaufler-ca.com>



> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>> I think you really need to give an example of a coherent policy that
>>>>>> needs this.
>>>>> I keep telling you, and you keep ignoring what I say.
>>>>> 
>>>>>> As it stands, your analogy seems confusing.
>>>>> It's pretty simple. I have given both the abstract
>>>>> and examples.
>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>> That addressed an explicit objection, and pointed out
>>> an exception to a generality you had asserted, which was
>>> not true. It's also a red herring regarding the current
>>> discussion.
>> This argument is pointless.
>> 
>> Please humor me and just give me an example.  If you think you have
>> already done so, feel free to repeat yourself.  If you have no
>> example, then please just say so.
> 
> To repeat the /dev/null example:
> 
> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.
> 

At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
> 
> 
>>  An unprivileged
>> user can create a new userns and a new mount ns, but then they're
>> modifying a whole different mount tree.
> 
> Within those namespaces you can still have multiple users,
> constrained be system access control policy.

And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.

> 
>> 
>>>>>> Similarly, if someone
>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>> We do not check whether the sender can send to the receiver.
>>>>> Bzzzt! Smack sure does.
>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>> Process A sends a packet to process B.
>>> If A has access to TopSecret data and B is not
>>> allowed to see TopSecret data, the delivery should
>>> be prevented. Is that nonsensical?
>> It makes sense.  As I see it, the way that a sensible policy should do
>> this is by making sure that there are no sockets, pipes, etc that
>> Process A can write and that Process B can read.
> 
> You can't explain UDP controls without doing the access check
> on packet delivery. The sendmsg() succeeds when the packet leaves
> the sender. There doesn't even have to be a socket bound to the
> port. The only opportunity you have for control is on packet
> delivery, which is the only point at which you can have the
> information required.

Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.

> 
>> If you really want to prevent a malicious process with TopSecret data
>> from sending it to a different process, then you can't use Linux on
>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>> need to use an extremely tight sandbox to make this work.
> 
> I won't be commenting on that.

Then why is preventing this is an absolute requirement? It’s unattainable.

> 
>> 
>>>>>> The signal example is inapplicable.
>>>>> From a modeling viewpoint the actions are identical.
>>>> This seems incorrect to me
>>> What would be correct then? Some convoluted combination
>>> of system entities that aren't owned or controlled by
>>> any mechanism?
>>> 
>> POSIX signal restrictions aren't there to prevent two processes from
>> communicating.  They're there to prevent the sender from manipulating
>> or crashing the receiver without appropriate privilege.
> 
> POSIX signal restrictions have a long history. In the P10031e/2c
> debates both communication and manipulation where seriously
> considered. I would say both are true.
> 
>>>> and, I think, to most everyone else reading this.
>>> That's quite the assertion. You may even be correct.
>>> 
>>>> Can you explain?
>>>> 
>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>> YES!!!!!!!!!!!!
>>> 
>>> And when a process triggers a notification it is the subject
>>> and the watching process is the object!
>>> 
>>> Subject == active entity
>>> Object  == passive entity
>>> 
>>> Triggering an event is, like calling kill(), an action!
>>> 
>> And here is where I disagree with your interpretation.  Triggering an
>> event is a side effect of writing to the file.  There are *two*
>> security relevant actions, not one, and they are:
>> 
>> First, the write:
>> 
>> Subject == the writer
>> Action == write
>> Object == the file
>> 
>> Then the event, which could be modeled in a couple of ways:
>> 
>> Subject == the file
> 
> Files   are   not   subjects. They are passive entities.
> 
>> Action == notify
>> Object == the recipient

Great. Then use the variant below.

>> 
>> or
>> 
>> Subject == the recipient
>> Action == watch
>> Object == the file
>> 
>> By conflating these two actions into one, you've made the modeling
>> very hard, and you start running into all these nasty questions like
>> "who actually closed this open file"
> 
> No, I've made the code more difficult.
> You can not call
> the file a subject. That is just wrong. It's not a valid
> model.

You’ve ignored the “Action == watch” variant. Do you care to comment?

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox