Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* Re: [RFC PATCH v2] security, capability: pass object information to security_capable
From: Paul Moore @ 2019-08-14 21:27 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Richard Guy Briggs, Aaron Goidel, mortonm, john.johansen, selinux,
	James Morris, luto, linux-security-module, linux-audit,
	Nicholas Franck, Serge Hallyn
In-Reply-To: <b47e07bc-1b01-c5f0-305d-e6fe014b00d8@tycho.nsa.gov>

On Wed, Aug 14, 2019 at 5:08 PM Stephen Smalley <sds@tycho.nsa.gov> wrote:
> On 8/14/19 3:59 PM, Paul Moore wrote:
> > On Tue, Aug 13, 2019 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> >> On 2019-08-13 11:01, Aaron Goidel wrote:
> >>> On 8/8/19 12:30 PM, Paul Moore wrote:
> >>>> On Thu, Aug 1, 2019 at 10:43 AM Aaron Goidel <acgoide@tycho.nsa.gov> wrote:
> >>>>> From: Nicholas Franck <nhfran2@tycho.nsa.gov>
> >>>>>
> >>>>> At present security_capable does not pass any object information
> >>>>> and therefore can neither audit the particular object nor take it
> >>>>> into account. Augment the security_capable interface to support
> >>>>> passing supplementary data. Use this facility initially to convey
> >>>>> the inode for capability checks relevant to inodes. This only
> >>>>> addresses capable_wrt_inode_uidgid calls; other capability checks
> >>>>> relevant to inodes will be addressed in subsequent changes. In the
> >>>>> future, this will be further extended to pass object information for
> >>>>> other capability checks such as the target task for CAP_KILL.
> >>>>>
> >>>>> In SELinux this new information is leveraged here to include the inode
> >>>>> in the audit message. In the future, it could also be used to perform
> >>>>> a per inode capability checks.
> >>>>>
> >>>>> It would be possible to fold the existing opts argument into this new
> >>>>> supplementary data structure. This was omitted from this change to
> >>>>> minimize changes.
> >>>>>
> >>>>> Signed-off-by: Nicholas Franck <nhfran2@tycho.nsa.gov>
> >>>>> Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
> >>>>> ---
> >>>>> v2:
> >>>>> - Changed order of audit prints so optional information comes second
> >>>>> ---
> >>>>>    include/linux/capability.h             |  7 ++++++
> >>>>>    include/linux/lsm_audit.h              |  5 +++-
> >>>>>    include/linux/lsm_hooks.h              |  3 ++-
> >>>>>    include/linux/security.h               | 23 +++++++++++++-----
> >>>>>    kernel/capability.c                    | 33 ++++++++++++++++++--------
> >>>>>    kernel/seccomp.c                       |  2 +-
> >>>>>    security/apparmor/capability.c         |  8 ++++---
> >>>>>    security/apparmor/include/capability.h |  4 +++-
> >>>>>    security/apparmor/ipc.c                |  2 +-
> >>>>>    security/apparmor/lsm.c                |  5 ++--
> >>>>>    security/apparmor/resource.c           |  2 +-
> >>>>>    security/commoncap.c                   | 11 +++++----
> >>>>>    security/lsm_audit.c                   | 21 ++++++++++++++--
> >>>>>    security/safesetid/lsm.c               |  3 ++-
> >>>>>    security/security.c                    |  5 ++--
> >>>>>    security/selinux/hooks.c               | 20 +++++++++-------
> >>>>>    security/smack/smack_access.c          |  2 +-
> >>>>>    17 files changed, 110 insertions(+), 46 deletions(-)
> >>>>
> >>>> You should CC the linux-audit list, I've added them on this mail.
> >>>>
> >>>> I had hoped to see some thought put into the idea of dynamically
> >>>> emitting the proper audit records as I mentioned in the previous patch
> >>>> set, but regardless there are some comments on this code as written
> >>>> ...
> >>>>
> >>>>> diff --git a/security/lsm_audit.c b/security/lsm_audit.c
> >>>>> index 33028c098ef3..18cc7c956b69 100644
> >>>>> --- a/security/lsm_audit.c
> >>>>> +++ b/security/lsm_audit.c
> >>>>> @@ -229,9 +229,26 @@ static void dump_common_audit_data(struct audit_buffer *ab,
> >>>>>           case LSM_AUDIT_DATA_IPC:
> >>>>>                   audit_log_format(ab, " key=%d ", a->u.ipc_id);
> >>>>>                   break;
> >>>>> -       case LSM_AUDIT_DATA_CAP:
> >>>>> -               audit_log_format(ab, " capability=%d ", a->u.cap);
> >>>>> +       case LSM_AUDIT_DATA_CAP: {
> >>>>> +               const struct inode *inode;
> >>>>> +
> >>>>> +               audit_log_format(ab, " capability=%d ", a->u.cap_struct.cap);
> >>>>> +               if (a->u.cap_struct.cad) {
> >>>>> +                       switch (a->u.cap_struct.cad->type) {
> >>>>> +                       case CAP_AUX_DATA_INODE: {
> >>>>> +                               inode = a->u.cap_struct.cad->u.inode;
> >>>>> +
> >>>>> +                               audit_log_format(ab, " dev=");
> >>>>> +                               audit_log_untrustedstring(ab,
> >>>>> +                                       inode->i_sb->s_id);
> >>>>> +                               audit_log_format(ab, " ino=%lu",
> >>>>> +                                       inode->i_ino);
> >>>>> +                               break;
> >>>>> +                       }
> >>>>
> >>>> Since you are declaring "inode" further up, there doesn't appear to be
> >>>> any need for the CAP_AUX_DATA_INODE braces, please remove them.
> >>>>
> >>>> The general recommended practice when it comes to "sometimes" fields
> >>>> in an audit record, is to always record them in the record, but use a
> >>>> value of "?" when there is nothing relevant to record.  For example,
> >>>> when *not* recording inode information you would do something like the
> >>>> following:
> >>>>
> >>>>     audit_log_format(ab, " dev=? ino=?");
> >>>>
> >>> The issue this brings up is what happens when this is expanded to more
> >>> cases? Assuming there will be a case here for logging audit data for task
> >>> based capabilities (CAP_AUX_DATA_TASK), what do we want to have happen if we
> >>> are recording *neither* inode information nor task information (say a PID)?
> >>> If we log something in the inode case, we presumably don't want to call
> >>> audit_log_format(ab, " dev=?, pid=?") as well. (And vice versa for when we
> >>> log a pid and no inode).
> >>
> >> Yup.  This record is already a mess due to that.
> >>
> >> The general solution is to either use a new record type for each
> >> variant, or to use an auxiliary record for each additional piece of
> >> information.  The long term solution that has been suggested it to move
> >> away from a text message format.
> >
> > This is why I was suggesting that Aaron look into allowing the LSMs to
> > request generation of audit records in the earlier thread (and
> > mentioned it again at the top of my comments in this thread).
>
> How would that work? The behavior we want is to capture some information
> about the inode whenever there is a capability denial upon an attempted
> access to that inode.  Allowing the LSM to enable audit collection on a
> per-process basis doesn't appear to help with that goal, because this is
> something we want for all processes.  Further, we don't really want the
> rest of the audit collection machinery engaged here ...

I read this as "we want to audit this information, but we don't to
turn on the very thing which is designed to do this".  At some point,
if you want to audit things, you actually need to turn on auditing.

> ... we just want the
> inode information for this specific inode, and we have the inode readily
> accessible in the caller of the LSM hook already, so we don't need to do
> it earlier.

Aaron appeared to be complaining that if we stuck to the current best
practices regarding record formatting for the LSM generated audit
record, the record was going to get complicated when people started
adding additional information.  FWIW, I don't disagree.  The only real
alternative I've seen thus far is to look into having the LSM enable
certain records, if anyone has another idea I'm all ears/eyes.

> Further, in the future we want to be able to take the inode security
> label into consideration as part of the capability checking, which is
> independent of audit entirely.  So the rest of the patch will still be
> required even if the audit solution ends up being different.

That's a different topic, I don't think there are any remaining
objections to passing the inode information here.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Colascione, Song Liu, Kees Cook, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrUkqUprujww26VxHwkdXQ3DWJH8nnL2VBYpK2EU0oX_YA@mail.gmail.com>

On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> 
> If eBPF is genuinely not usable by programs that are not fully trusted
> by the admin, then no kernel changes at all are needed.  Programs that
> want to reduce their own privileges can easily fork() a privileged
> subprocess or run a little helper to which they delegate BPF
> operations.  This is far more flexible than anything that will ever be
> in the kernel because it allows the helper to verify that the rest of
> the program is doing exactly what it's supposed to and restrict eBPF
> operations to exactly the subset that is needed.  So a container
> manager or network manager that drops some provilege could have a
> little bpf-helper that manages its BPF XDP, firewalling, etc
> configuration.  The two processes would talk over a socketpair.

there were three projects that tried to delegate bpf operations.
All of them failed.
bpf operational workflow is much more complex than you're imagining.
fork() also doesn't work for all cases.
I gave this example before: consider multiple systemd-like deamons
that need to do bpf operations that want to pass this 'bpf capability'
to other deamons written by other teams. Some of them will start
non-root, but still need to do bpf. They will be rpm installed
and live upgraded while running.
We considered to make systemd such centralized bpf delegation
authority too. It didn't work. bpf in kernel grows quickly.
libbpf part grows independently. llvm keeps evolving.
All of them are being changed while system overall has to stay
operational. Centralized approach breaks apart.

> The interesting cases you're talking about really *do* involved
> unprivileged or less privileged eBPF, though.  Let's see:
> 
> systemd --user: systemd --user *is not privileged at all*.  There's no
> issue of reducing privilege, since systemd --user doesn't have any
> privilege to begin with.  But systemd supports some eBPF features, and
> presumably it would like to support them in the systemd --user case.
> This is unprivileged eBPF.

Let's disambiguate the terminology.
This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
I think that was a mistake.
Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
This is not unprivileged.
'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.

There is a huge difference between the two.
I'm against extending 'unprivileged bpf' even a bit more than what it is
today for many reasons mentioned earlier.
The /dev/bpf is about 'less privileged'.
Less privileged than root. We need to split part of full root capability
into bpf capability. So that most of the root can be dropped.
This is very similar to what cap_net_admin does.
cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
cap_net_admin is very much privileged. Just 'less privileged' than root.
Same thing for cap_bpf.

May be we should do both cap_bpf and /dev/bpf to make it clear that
this is the same thing. Two interfaces to achieve the same result.

> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
> it doesn't involve the bpf() syscall.  There are some seccomp
> proposals in the works that will want some stuff from eBPF.  In

I'm afraid these proposals won't go anywhere.

> So it's a bit of a chicken-and-egg situation.  There aren't major
> unprivileged eBPF users because the kernel support isn't there.

As I said before there are zero known use cases of 'unprivileged bpf'.

If I understand you correctly you're refusing to accept that
'less privileged bpf' is a valid use case while pushing for extending
scope of 'unprivileged'.
Extending the scope is an orthogonal discussion. Currently I'm
opposed to that. Whereas 'less privileged' is what people require.

> It will remain extremely awkward for containers and
> especially nested containers to use eBPF.

I'm afraid we have to agree to disagree here and move on.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 22:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814220545.co5pucyo5jk3weiv@ast-mbp.dhcp.thefacebook.com>



> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>> 
>> If eBPF is genuinely not usable by programs that are not fully trusted
>> by the admin, then no kernel changes at all are needed.  Programs that
>> want to reduce their own privileges can easily fork() a privileged
>> subprocess or run a little helper to which they delegate BPF
>> operations.  This is far more flexible than anything that will ever be
>> in the kernel because it allows the helper to verify that the rest of
>> the program is doing exactly what it's supposed to and restrict eBPF
>> operations to exactly the subset that is needed.  So a container
>> manager or network manager that drops some provilege could have a
>> little bpf-helper that manages its BPF XDP, firewalling, etc
>> configuration.  The two processes would talk over a socketpair.
> 
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
> 
>> The interesting cases you're talking about really *do* involved
>> unprivileged or less privileged eBPF, though.  Let's see:
>> 
>> systemd --user: systemd --user *is not privileged at all*.  There's no
>> issue of reducing privilege, since systemd --user doesn't have any
>> privilege to begin with.  But systemd supports some eBPF features, and
>> presumably it would like to support them in the systemd --user case.
>> This is unprivileged eBPF.
> 
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> 
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.

The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

> 
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.

What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

> 
>> Seccomp.  Seccomp already uses cBPF, which is a form of BPF although
>> it doesn't involve the bpf() syscall.  There are some seccomp
>> proposals in the works that will want some stuff from eBPF.  In
> 
> I'm afraid these proposals won't go anywhere.

Can you explain why?

> 
>> So it's a bit of a chicken-and-egg situation.  There aren't major
>> unprivileged eBPF users because the kernel support isn't there.
> 
> As I said before there are zero known use cases of 'unprivileged bpf'.
> 
> If I understand you correctly you're refusing to accept that
> 'less privileged bpf' is a valid use case while pushing for extending
> scope of 'unprivileged'.

No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

^ permalink raw reply

* Re: [PATCH 1/2] KEYS: Replace uid/gid/perm permissions checking with an ACL
From: Eric Biggers @ 2019-08-14 22:41 UTC (permalink / raw)
  To: David Howells
  Cc: keyrings, linux-security-module, linux-fsdevel, linux-kernel
In-Reply-To: <20190807025814.GA1167@sol.localdomain>

On Tue, Aug 06, 2019 at 07:58:14PM -0700, Eric Biggers wrote:
> On Tue, Jul 30, 2019 at 06:16:14PM -0700, Eric Biggers wrote:
> > On Mon, Jul 29, 2019 at 08:49:56PM -0700, Eric Biggers wrote:
> > > Hi David,
> > > 
> > > On Tue, Jul 09, 2019 at 06:16:01PM -0700, Eric Biggers wrote:
> > > > On Thu, May 23, 2019 at 04:58:27PM +0100, David Howells wrote:
> > > > > Replace the uid/gid/perm permissions checking on a key with an ACL to allow
> > > > > the SETATTR and SEARCH permissions to be split.  This will also allow a
> > > > > greater range of subjects to represented.
> > > > > 
> > > > 
> > > > This patch broke 'keyctl new_session', and hence broke all the fscrypt tests:
> > > > 
> > > > $ keyctl new_session
> > > > keyctl_session_to_parent: Permission denied
> > > > 
> > > > Output of 'keyctl show' is
> > > > 
> > > > $ keyctl show
> > > > Session Keyring
> > > >  605894913 --alswrv      0     0  keyring: _ses
> > > >  189223103 ----s-rv      0     0   \_ user: invocation_id
> > > > 
> > > > - Eric
> > > 
> > > This bug is still present in next-20190729.
> > > 
> > > - Eric
> > 
> > This fixes it:
> > 
> > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
> > index aa3bfcadbc660..519c94f1cc3c2 100644
> > --- a/security/keys/process_keys.c
> > +++ b/security/keys/process_keys.c
> > @@ -58,7 +58,7 @@ static struct key_acl session_keyring_acl = {
> >  	.possessor_viewable = true,
> >  	.nr_ace	= 2,
> >  	.aces = {
> > -		KEY_POSSESSOR_ACE(KEY_ACE__PERMS & ~KEY_ACE_JOIN),
> > +		KEY_POSSESSOR_ACE(KEY_ACE__PERMS),
> >  		KEY_OWNER_ACE(KEY_ACE_VIEW | KEY_ACE_READ),
> >  	}
> >  };
> > 
> > 
> > The old permissions were KEY_POS_ALL | KEY_USR_VIEW | KEY_USR_READ, so
> > I'm not sure why JOIN permission was removed?
> > 
> > - Eric
> 
> Ping.  This is still broken in linux-next.
> 

David, any comment on this?  This is still broken in linux-next.

- Eric

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-14 23:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <AD211133-EA60-4B91-AB1B-201713F50AB2@amacapital.net>

On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >> 
> >> If eBPF is genuinely not usable by programs that are not fully trusted
> >> by the admin, then no kernel changes at all are needed.  Programs that
> >> want to reduce their own privileges can easily fork() a privileged
> >> subprocess or run a little helper to which they delegate BPF
> >> operations.  This is far more flexible than anything that will ever be
> >> in the kernel because it allows the helper to verify that the rest of
> >> the program is doing exactly what it's supposed to and restrict eBPF
> >> operations to exactly the subset that is needed.  So a container
> >> manager or network manager that drops some provilege could have a
> >> little bpf-helper that manages its BPF XDP, firewalling, etc
> >> configuration.  The two processes would talk over a socketpair.
> > 
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> > 
> >> The interesting cases you're talking about really *do* involved
> >> unprivileged or less privileged eBPF, though.  Let's see:
> >> 
> >> systemd --user: systemd --user *is not privileged at all*.  There's no
> >> issue of reducing privilege, since systemd --user doesn't have any
> >> privilege to begin with.  But systemd supports some eBPF features, and
> >> presumably it would like to support them in the systemd --user case.
> >> This is unprivileged eBPF.
> > 
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > 
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> 
> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?

Initially I agreed that it's probably too broad, but then realized
that they're perfect as-is. There is no need to partition further.

> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> 
> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.

Indeed, ambient capabilities should work for all cases.

> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.

There are not that many bits left. I prefer to consume single CAP_BPF bit.
All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
This is no-brainer.

The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
should be extended to CAP_BPF or not.
imo devmap and xskmap can stay CAP_NET_ADMIN,
but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
Initially cgroup-bpf hooks were limited to networking.
It's no longer the case. Requiring NET_ADMIN there make little sense now.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-14 23:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814233335.37t4zfsiswrpd4d6@ast-mbp.dhcp.thefacebook.com>



> On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
>> 
>> 
>>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>> 
>>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>>>> 
>>>> If eBPF is genuinely not usable by programs that are not fully trusted
>>>> by the admin, then no kernel changes at all are needed.  Programs that
>>>> want to reduce their own privileges can easily fork() a privileged
>>>> subprocess or run a little helper to which they delegate BPF
>>>> operations.  This is far more flexible than anything that will ever be
>>>> in the kernel because it allows the helper to verify that the rest of
>>>> the program is doing exactly what it's supposed to and restrict eBPF
>>>> operations to exactly the subset that is needed.  So a container
>>>> manager or network manager that drops some provilege could have a
>>>> little bpf-helper that manages its BPF XDP, firewalling, etc
>>>> configuration.  The two processes would talk over a socketpair.
>>> 
>>> there were three projects that tried to delegate bpf operations.
>>> All of them failed.
>>> bpf operational workflow is much more complex than you're imagining.
>>> fork() also doesn't work for all cases.
>>> I gave this example before: consider multiple systemd-like deamons
>>> that need to do bpf operations that want to pass this 'bpf capability'
>>> to other deamons written by other teams. Some of them will start
>>> non-root, but still need to do bpf. They will be rpm installed
>>> and live upgraded while running.
>>> We considered to make systemd such centralized bpf delegation
>>> authority too. It didn't work. bpf in kernel grows quickly.
>>> libbpf part grows independently. llvm keeps evolving.
>>> All of them are being changed while system overall has to stay
>>> operational. Centralized approach breaks apart.
>>> 
>>>> The interesting cases you're talking about really *do* involved
>>>> unprivileged or less privileged eBPF, though.  Let's see:
>>>> 
>>>> systemd --user: systemd --user *is not privileged at all*.  There's no
>>>> issue of reducing privilege, since systemd --user doesn't have any
>>>> privilege to begin with.  But systemd supports some eBPF features, and
>>>> presumably it would like to support them in the systemd --user case.
>>>> This is unprivileged eBPF.
>>> 
>>> Let's disambiguate the terminology.
>>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
>>> I think that was a mistake.
>>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
>>> This is not unprivileged.
>>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>>> 
>>> There is a huge difference between the two.
>>> I'm against extending 'unprivileged bpf' even a bit more than what it is
>>> today for many reasons mentioned earlier.
>>> The /dev/bpf is about 'less privileged'.
>>> Less privileged than root. We need to split part of full root capability
>>> into bpf capability. So that most of the root can be dropped.
>>> This is very similar to what cap_net_admin does.
>>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
>>> cap_net_admin is very much privileged. Just 'less privileged' than root.
>>> Same thing for cap_bpf.
>> 
>> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> 
> Initially I agreed that it's probably too broad, but then realized
> that they're perfect as-is. There is no need to partition further.
> 
>>> May be we should do both cap_bpf and /dev/bpf to make it clear that
>>> this is the same thing. Two interfaces to achieve the same result.
>> 
>> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> 
> Indeed, ambient capabilities should work for all cases.
> 
>> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> 
> There are not that many bits left. I prefer to consume single CAP_BPF bit.
> All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> This is no-brainer.
> 
> The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> should be extended to CAP_BPF or not.
> imo devmap and xskmap can stay CAP_NET_ADMIN,
> but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> Initially cgroup-bpf hooks were limited to networking.
> It's no longer the case. Requiring NET_ADMIN there make little sense now.
> 

Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar.  The rest could plausibly be hardened in the future, although the by_id stuff may be tricky too.

Do new programs really need the by_id calls?  It could make sense to leave those unchanged and to have new programs use persistent maps instead.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15  0:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <317422C3-ACE3-42A7-A287-7B8FEE12E33A@amacapital.net>

On Wed, Aug 14, 2019 at 04:59:18PM -0700, Andy Lutomirski wrote:
> 
> 
> > On Aug 14, 2019, at 4:33 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> >> On Wed, Aug 14, 2019 at 03:30:51PM -0700, Andy Lutomirski wrote:
> >> 
> >> 
> >>>> On Aug 14, 2019, at 3:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>> 
> >>>> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >>>> 
> >>>> If eBPF is genuinely not usable by programs that are not fully trusted
> >>>> by the admin, then no kernel changes at all are needed.  Programs that
> >>>> want to reduce their own privileges can easily fork() a privileged
> >>>> subprocess or run a little helper to which they delegate BPF
> >>>> operations.  This is far more flexible than anything that will ever be
> >>>> in the kernel because it allows the helper to verify that the rest of
> >>>> the program is doing exactly what it's supposed to and restrict eBPF
> >>>> operations to exactly the subset that is needed.  So a container
> >>>> manager or network manager that drops some provilege could have a
> >>>> little bpf-helper that manages its BPF XDP, firewalling, etc
> >>>> configuration.  The two processes would talk over a socketpair.
> >>> 
> >>> there were three projects that tried to delegate bpf operations.
> >>> All of them failed.
> >>> bpf operational workflow is much more complex than you're imagining.
> >>> fork() also doesn't work for all cases.
> >>> I gave this example before: consider multiple systemd-like deamons
> >>> that need to do bpf operations that want to pass this 'bpf capability'
> >>> to other deamons written by other teams. Some of them will start
> >>> non-root, but still need to do bpf. They will be rpm installed
> >>> and live upgraded while running.
> >>> We considered to make systemd such centralized bpf delegation
> >>> authority too. It didn't work. bpf in kernel grows quickly.
> >>> libbpf part grows independently. llvm keeps evolving.
> >>> All of them are being changed while system overall has to stay
> >>> operational. Centralized approach breaks apart.
> >>> 
> >>>> The interesting cases you're talking about really *do* involved
> >>>> unprivileged or less privileged eBPF, though.  Let's see:
> >>>> 
> >>>> systemd --user: systemd --user *is not privileged at all*.  There's no
> >>>> issue of reducing privilege, since systemd --user doesn't have any
> >>>> privilege to begin with.  But systemd supports some eBPF features, and
> >>>> presumably it would like to support them in the systemd --user case.
> >>>> This is unprivileged eBPF.
> >>> 
> >>> Let's disambiguate the terminology.
> >>> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> >>> I think that was a mistake.
> >>> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> >>> This is not unprivileged.
> >>> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >>> 
> >>> There is a huge difference between the two.
> >>> I'm against extending 'unprivileged bpf' even a bit more than what it is
> >>> today for many reasons mentioned earlier.
> >>> The /dev/bpf is about 'less privileged'.
> >>> Less privileged than root. We need to split part of full root capability
> >>> into bpf capability. So that most of the root can be dropped.
> >>> This is very similar to what cap_net_admin does.
> >>> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> >>> cap_net_admin is very much privileged. Just 'less privileged' than root.
> >>> Same thing for cap_bpf.
> >> 
> >> The new pseudo-capability in this patch set is absurdly broad. I’ve proposed some finer-grained divisions in this thread. Do you have comments on them?
> > 
> > Initially I agreed that it's probably too broad, but then realized
> > that they're perfect as-is. There is no need to partition further.
> > 
> >>> May be we should do both cap_bpf and /dev/bpf to make it clear that
> >>> this is the same thing. Two interfaces to achieve the same result.
> >> 
> >> What for?  If there’s a CAP_BPF, then why do you want /dev/bpf? Especially if you define it to do the same thing.
> > 
> > Indeed, ambient capabilities should work for all cases.
> > 
> >> No, I’m not.  I have no objection at all if you try to come up with a clear definition of what the capability checks do and what it means to grant a new permission to a task.  Changing *all* of the capable checks is needlessly broad.
> > 
> > There are not that many bits left. I prefer to consume single CAP_BPF bit.
> > All capable(CAP_SYS_ADMIN) checks in kernel/bpf/ will become CAP_BPF.
> > This is no-brainer.
> > 
> > The only question is whether few cases of CAP_NET_ADMIN in kernel/bpf/
> > should be extended to CAP_BPF or not.
> > imo devmap and xskmap can stay CAP_NET_ADMIN,
> > but cgroup bpf attach/detach should be either CAP_NET_ADMIN or CAP_BPF.
> > Initially cgroup-bpf hooks were limited to networking.
> > It's no longer the case. Requiring NET_ADMIN there make little sense now.
> > 
> 
> Cgroup bpf attach/detach, with the current API, gives very strong control over the whole system, and it will just get stronger as bpf gains features. Making it CAP_BPF means that you will never have the ability to make CAP_BPF safe to give to anything other than an extremely highly trusted process.  Unsafe pointers are similar. 

'never to less trusted process' ? why do you think so?
I don't see a problem adding /dev/bpf/foo in the future and make things
more granular. There is no such use case today. Hence I don't want to
spend time and design something without clear use case in mind.

> Do new programs really need the by_id calls? 

yes. Lorenz gave an example earlier. map-in-map returns map_id.
To operate on that map by_id is needed.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-15 11:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190814220545.co5pucyo5jk3weiv@ast-mbp.dhcp.thefacebook.com>

On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
>
> > If eBPF is genuinely not usable by programs that are not fully trusted
> > by the admin, then no kernel changes at all are needed. Programs that
> > want to reduce their own privileges can easily fork() a privileged
> > subprocess or run a little helper to which they delegate BPF
> > operations. This is far more flexible than anything that will ever be
> > in the kernel because it allows the helper to verify that the rest of
> > the program is doing exactly what it's supposed to and restrict eBPF
> > operations to exactly the subset that is needed. So a container
> > manager or network manager that drops some provilege could have a
> > little bpf-helper that manages its BPF XDP, firewalling, etc
> > configuration. The two processes would talk over a socketpair.
>
> there were three projects that tried to delegate bpf operations.
> All of them failed.
> bpf operational workflow is much more complex than you're imagining.
> fork() also doesn't work for all cases.
> I gave this example before: consider multiple systemd-like deamons
> that need to do bpf operations that want to pass this 'bpf capability'
> to other deamons written by other teams. Some of them will start
> non-root, but still need to do bpf. They will be rpm installed
> and live upgraded while running.
> We considered to make systemd such centralized bpf delegation
> authority too. It didn't work. bpf in kernel grows quickly.
> libbpf part grows independently. llvm keeps evolving.
> All of them are being changed while system overall has to stay
> operational. Centralized approach breaks apart.
>
> > The interesting cases you're talking about really do involved
> > unprivileged or less privileged eBPF, though. Let's see:
> > systemd --user: systemd --user is not privileged at all. There's no
> > issue of reducing privilege, since systemd --user doesn't have any
> > privilege to begin with. But systemd supports some eBPF features, and
> > presumably it would like to support them in the systemd --user case.
> > This is unprivileged eBPF.
>
> Let's disambiguate the terminology.
> This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> I think that was a mistake.
> Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> This is not unprivileged.
> 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
>
> There is a huge difference between the two.
> I'm against extending 'unprivileged bpf' even a bit more than what it is
> today for many reasons mentioned earlier.
> The /dev/bpf is about 'less privileged'.
> Less privileged than root. We need to split part of full root capability
> into bpf capability. So that most of the root can be dropped.
> This is very similar to what cap_net_admin does.
> cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> cap_net_admin is very much privileged. Just 'less privileged' than root.
> Same thing for cap_bpf.
>
> May be we should do both cap_bpf and /dev/bpf to make it clear that
> this is the same thing. Two interfaces to achieve the same result.
>

systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
Granting them cap_bpf is the same as granting it to every other unprivileged user
process. Also unprivileged user process can start systemd --user process with any
command they like.

Jordan

^ permalink raw reply

* Re: [RFC/RFT v4 0/5] Add generic trusted keys framework/subsystem
From: Sumit Garg @ 2019-08-15 13:03 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: keyrings, linux-integrity,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	linux-security-module, dhowells, Herbert Xu, davem, peterhuewe,
	jgg, jejb, Jarkko Sakkinen, Arnd Bergmann, Greg Kroah-Hartman,
	James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
	Daniel Thompson, Linux Kernel Mailing List,
	tee-dev @ lists . linaro . org
In-Reply-To: <1565789078.10490.10.camel@kernel.org>

Hi Mimi,

On Wed, 14 Aug 2019 at 18:54, Mimi Zohar <zohar@kernel.org> wrote:
>
> Hi Sumit,
>
> On Tue, 2019-08-13 at 13:22 +0530, Sumit Garg wrote:
> > This patch-set is an outcome of discussion here [1]. It has evolved very
> > much since v1 to create, consolidate and generalize trusted keys
> > subsystem.
> >
> > This framework has been tested with trusted keys support provided via TEE
> > but I wasn't able to test it with a TPM device as I don't possess one. It
> > would be really helpful if others could test this patch-set using a TPM
> > device.
>
> With the "CONFIG_HEADER_TEST" and "CONFIG_KERNEL_HEADER_TEST" config
> options enabled, which is required for linux-next, it fails to build.
>

TBH, I wasn't aware about this test feature for headers. It looks like
the header which fails this test is "include/keys/trusted_tpm.h" which
is basically a rename of "include/keys/trusted.h" plus changes in this
patch-set.

And "include/keys/trusted.h" header is already put under blacklist
here: "include/Kbuild +68" as it fails to build. So its that rename
due to which build failure is observed now.

It seems to be an easy fix for this build failure via following changes:

diff --git a/include/keys/trusted_tpm.h b/include/keys/trusted_tpm.h
index 7b593447920b..ca1bec0ef65d 100644
--- a/include/keys/trusted_tpm.h
+++ b/include/keys/trusted_tpm.h
@@ -2,6 +2,9 @@
 #ifndef __TRUSTED_TPM_H
 #define __TRUSTED_TPM_H

+#include <keys/trusted-type.h>
+#include <linux/tpm_command.h>
+
 /* implementation specific TPM constants */
 #define MAX_BUF_SIZE                   1024
 #define TPM_GETRANDOM_SIZE             14

So I will include above changes in this patch-set and also remove
"include/keys/trusted.h" header from the blacklist.

-Sumit

> Mimi

^ permalink raw reply related

* Re: [Non-DoD Source] Re: [RFC PATCH v2] security, capability: pass object information to security_capable
From: Aaron Goidel @ 2019-08-15 13:10 UTC (permalink / raw)
  To: Paul Moore, Stephen Smalley
  Cc: Richard Guy Briggs, mortonm, john.johansen, selinux, James Morris,
	luto, linux-security-module, linux-audit, Nicholas Franck,
	Serge Hallyn
In-Reply-To: <CAHC9VhRzz52bVwMikM7C65vCCSLb0=y1HtB50o-H0G3AMHqRNw@mail.gmail.com>

On 8/14/19 5:27 PM, Paul Moore wrote:
> On Wed, Aug 14, 2019 at 5:08 PM Stephen Smalley <sds@tycho.nsa.gov> wrote:
>> On 8/14/19 3:59 PM, Paul Moore wrote:
>>> On Tue, Aug 13, 2019 at 5:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>>>> On 2019-08-13 11:01, Aaron Goidel wrote:
>>>>> On 8/8/19 12:30 PM, Paul Moore wrote:
>>>>>> On Thu, Aug 1, 2019 at 10:43 AM Aaron Goidel <acgoide@tycho.nsa.gov> wrote:
>>>>>>> From: Nicholas Franck <nhfran2@tycho.nsa.gov>
>>>>>>>
>>>>>>> At present security_capable does not pass any object information
>>>>>>> and therefore can neither audit the particular object nor take it
>>>>>>> into account. Augment the security_capable interface to support
>>>>>>> passing supplementary data. Use this facility initially to convey
>>>>>>> the inode for capability checks relevant to inodes. This only
>>>>>>> addresses capable_wrt_inode_uidgid calls; other capability checks
>>>>>>> relevant to inodes will be addressed in subsequent changes. In the
>>>>>>> future, this will be further extended to pass object information for
>>>>>>> other capability checks such as the target task for CAP_KILL.
>>>>>>>
>>>>>>> In SELinux this new information is leveraged here to include the inode
>>>>>>> in the audit message. In the future, it could also be used to perform
>>>>>>> a per inode capability checks.
>>>>>>>
>>>>>>> It would be possible to fold the existing opts argument into this new
>>>>>>> supplementary data structure. This was omitted from this change to
>>>>>>> minimize changes.
>>>>>>>
>>>>>>> Signed-off-by: Nicholas Franck <nhfran2@tycho.nsa.gov>
>>>>>>> Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
>>>>>>> ---
>>>>>>> v2:
>>>>>>> - Changed order of audit prints so optional information comes second
>>>>>>> ---
>>>>>>>     include/linux/capability.h             |  7 ++++++
>>>>>>>     include/linux/lsm_audit.h              |  5 +++-
>>>>>>>     include/linux/lsm_hooks.h              |  3 ++-
>>>>>>>     include/linux/security.h               | 23 +++++++++++++-----
>>>>>>>     kernel/capability.c                    | 33 ++++++++++++++++++--------
>>>>>>>     kernel/seccomp.c                       |  2 +-
>>>>>>>     security/apparmor/capability.c         |  8 ++++---
>>>>>>>     security/apparmor/include/capability.h |  4 +++-
>>>>>>>     security/apparmor/ipc.c                |  2 +-
>>>>>>>     security/apparmor/lsm.c                |  5 ++--
>>>>>>>     security/apparmor/resource.c           |  2 +-
>>>>>>>     security/commoncap.c                   | 11 +++++----
>>>>>>>     security/lsm_audit.c                   | 21 ++++++++++++++--
>>>>>>>     security/safesetid/lsm.c               |  3 ++-
>>>>>>>     security/security.c                    |  5 ++--
>>>>>>>     security/selinux/hooks.c               | 20 +++++++++-------
>>>>>>>     security/smack/smack_access.c          |  2 +-
>>>>>>>     17 files changed, 110 insertions(+), 46 deletions(-)
>>>>>>
>>>>>> You should CC the linux-audit list, I've added them on this mail.
>>>>>>
>>>>>> I had hoped to see some thought put into the idea of dynamically
>>>>>> emitting the proper audit records as I mentioned in the previous patch
>>>>>> set, but regardless there are some comments on this code as written
>>>>>> ...
>>>>>>
>>>>>>> diff --git a/security/lsm_audit.c b/security/lsm_audit.c
>>>>>>> index 33028c098ef3..18cc7c956b69 100644
>>>>>>> --- a/security/lsm_audit.c
>>>>>>> +++ b/security/lsm_audit.c
>>>>>>> @@ -229,9 +229,26 @@ static void dump_common_audit_data(struct audit_buffer *ab,
>>>>>>>            case LSM_AUDIT_DATA_IPC:
>>>>>>>                    audit_log_format(ab, " key=%d ", a->u.ipc_id);
>>>>>>>                    break;
>>>>>>> -       case LSM_AUDIT_DATA_CAP:
>>>>>>> -               audit_log_format(ab, " capability=%d ", a->u.cap);
>>>>>>> +       case LSM_AUDIT_DATA_CAP: {
>>>>>>> +               const struct inode *inode;
>>>>>>> +
>>>>>>> +               audit_log_format(ab, " capability=%d ", a->u.cap_struct.cap);
>>>>>>> +               if (a->u.cap_struct.cad) {
>>>>>>> +                       switch (a->u.cap_struct.cad->type) {
>>>>>>> +                       case CAP_AUX_DATA_INODE: {
>>>>>>> +                               inode = a->u.cap_struct.cad->u.inode;
>>>>>>> +
>>>>>>> +                               audit_log_format(ab, " dev=");
>>>>>>> +                               audit_log_untrustedstring(ab,
>>>>>>> +                                       inode->i_sb->s_id);
>>>>>>> +                               audit_log_format(ab, " ino=%lu",
>>>>>>> +                                       inode->i_ino);
>>>>>>> +                               break;
>>>>>>> +                       }
>>>>>>
>>>>>> Since you are declaring "inode" further up, there doesn't appear to be
>>>>>> any need for the CAP_AUX_DATA_INODE braces, please remove them.
>>>>>>
>>>>>> The general recommended practice when it comes to "sometimes" fields
>>>>>> in an audit record, is to always record them in the record, but use a
>>>>>> value of "?" when there is nothing relevant to record.  For example,
>>>>>> when *not* recording inode information you would do something like the
>>>>>> following:
>>>>>>
>>>>>>      audit_log_format(ab, " dev=? ino=?");
>>>>>>
>>>>> The issue this brings up is what happens when this is expanded to more
>>>>> cases? Assuming there will be a case here for logging audit data for task
>>>>> based capabilities (CAP_AUX_DATA_TASK), what do we want to have happen if we
>>>>> are recording *neither* inode information nor task information (say a PID)?
>>>>> If we log something in the inode case, we presumably don't want to call
>>>>> audit_log_format(ab, " dev=?, pid=?") as well. (And vice versa for when we
>>>>> log a pid and no inode).
>>>>
>>>> Yup.  This record is already a mess due to that.
>>>>
>>>> The general solution is to either use a new record type for each
>>>> variant, or to use an auxiliary record for each additional piece of
>>>> information.  The long term solution that has been suggested it to move
>>>> away from a text message format.
>>>
>>> This is why I was suggesting that Aaron look into allowing the LSMs to
>>> request generation of audit records in the earlier thread (and
>>> mentioned it again at the top of my comments in this thread).
>>
>> How would that work? The behavior we want is to capture some information
>> about the inode whenever there is a capability denial upon an attempted
>> access to that inode.  Allowing the LSM to enable audit collection on a
>> per-process basis doesn't appear to help with that goal, because this is
>> something we want for all processes.  Further, we don't really want the
>> rest of the audit collection machinery engaged here ...
> 
> I read this as "we want to audit this information, but we don't to
> turn on the very thing which is designed to do this".  At some point,
> if you want to audit things, you actually need to turn on auditing.
> 
>> ... we just want the
>> inode information for this specific inode, and we have the inode readily
>> accessible in the caller of the LSM hook already, so we don't need to do
>> it earlier.
> 
> Aaron appeared to be complaining that if we stuck to the current best
> practices regarding record formatting for the LSM generated audit
> record, the record was going to get complicated when people started
> adding additional information.  FWIW, I don't disagree.  The only real
> alternative I've seen thus far is to look into having the LSM enable
> certain records, if anyone has another idea I'm all ears/eyes.
> 
>> Further, in the future we want to be able to take the inode security
>> label into consideration as part of the capability checking, which is
>> independent of audit entirely.  So the rest of the patch will still be
>> required even if the audit solution ends up being different.
> 
> That's a different topic, I don't think there are any remaining
> objections to passing the inode information here.
> 

I'm looking at how to enable LSMs to selectively turn on audit 
collection. So there seems to be two key points: audit_alloc() and 
__audit_syscall_entry(). Would it suffice to define a single boolean 
hook that takes the task and call it from both functions, to decide 
whether to override an AUDIT_DISABLED state in audit_alloc() and to 
override a 0 audit_n_rules in __audit_syscall_entry(). In audit_alloc() 
if audit_filter_task() returned AUDIT_DISABLED and the hook returned 
true, we would change the state to AUDIT_BUILD_CONTEXT. In 
__audit_syscall_entry(), if the hook returned true, we would set dummy 
to 0. Obviously, we could have a more general hook which lets us return 
arbitrary audit states, but, it isn't clear how we would reconcile 
conflicting results from audit_filter_task() and the hook for any 
situation other than AUDIT_DISABLED. We could also potentially use a 
different hook in __audit_syscall_entry(), though I don't think that we 
want the LSMs trying to interpret the syscall number or arguments.

Do you think that is sufficiently general or would you suggest something 
different?

-- 
Aaron

^ permalink raw reply

* Re: [RFC/RFT v4 0/5] Add generic trusted keys framework/subsystem
From: Mimi Zohar @ 2019-08-15 15:06 UTC (permalink / raw)
  To: Sumit Garg
  Cc: keyrings, linux-integrity,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	linux-security-module, dhowells, Herbert Xu, davem, peterhuewe,
	jgg, jejb, Jarkko Sakkinen, Arnd Bergmann, Greg Kroah-Hartman,
	James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
	Daniel Thompson, Linux Kernel Mailing List,
	tee-dev @ lists . linaro . org
In-Reply-To: <CAFA6WYPU0oREaHROhhRsEXJTijvER8G4riBk4e4=Bd5XgGFqtQ@mail.gmail.com>

On Thu, 2019-08-15 at 18:33 +0530, Sumit Garg wrote:
> Hi Mimi,
> 
> On Wed, 14 Aug 2019 at 18:54, Mimi Zohar <zohar@kernel.org> wrote:
> >
> > Hi Sumit,
> >
> > On Tue, 2019-08-13 at 13:22 +0530, Sumit Garg wrote:
> > > This patch-set is an outcome of discussion here [1]. It has evolved very
> > > much since v1 to create, consolidate and generalize trusted keys
> > > subsystem.
> > >
> > > This framework has been tested with trusted keys support provided via TEE
> > > but I wasn't able to test it with a TPM device as I don't possess one. It
> > > would be really helpful if others could test this patch-set using a TPM
> > > device.
> >
> > With the "CONFIG_HEADER_TEST" and "CONFIG_KERNEL_HEADER_TEST" config
> > options enabled, which is required for linux-next, it fails to build.
> >
> 
> TBH, I wasn't aware about this test feature for headers. 

It's new to me too.

> It looks like
> the header which fails this test is "include/keys/trusted_tpm.h" which
> is basically a rename of "include/keys/trusted.h" plus changes in this
> patch-set.
> 
> And "include/keys/trusted.h" header is already put under blacklist
> here: "include/Kbuild +68" as it fails to build. So its that rename
> due to which build failure is observed now.
> 
> It seems to be an easy fix for this build failure via following changes:
> 
> diff --git a/include/keys/trusted_tpm.h b/include/keys/trusted_tpm.h
> index 7b593447920b..ca1bec0ef65d 100644
> --- a/include/keys/trusted_tpm.h
> +++ b/include/keys/trusted_tpm.h
> @@ -2,6 +2,9 @@
>  #ifndef __TRUSTED_TPM_H
>  #define __TRUSTED_TPM_H
> 
> +#include <keys/trusted-type.h>
> +#include <linux/tpm_command.h>
> +
>  /* implementation specific TPM constants */
>  #define MAX_BUF_SIZE                   1024
>  #define TPM_GETRANDOM_SIZE             14
> 
> So I will include above changes in this patch-set and also remove
> "include/keys/trusted.h" header from the blacklist.

That works, thanks.  With this patch set, at least the EVM trusted key
is properly being decrypted by the encrypted key with both a TPM 1.2
and PTT TPM 2.0.  My laptop still boots properly.  Over the weekend
I'll try to actually review the patches.

Mimi

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 17:28 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <HG0x24u69mnaMFKuxHVAzHpyjwsD5-U6RpqFRua87wGWQCHg00Q8ZqPeA_5kJ9l-d6oe0cXa4HyYXMnOO0Aofp_LcPcQdG0WFV21z1MbgcE=@protonmail.ch>

On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> >
> > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > by the admin, then no kernel changes at all are needed. Programs that
> > > want to reduce their own privileges can easily fork() a privileged
> > > subprocess or run a little helper to which they delegate BPF
> > > operations. This is far more flexible than anything that will ever be
> > > in the kernel because it allows the helper to verify that the rest of
> > > the program is doing exactly what it's supposed to and restrict eBPF
> > > operations to exactly the subset that is needed. So a container
> > > manager or network manager that drops some provilege could have a
> > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > configuration. The two processes would talk over a socketpair.
> >
> > there were three projects that tried to delegate bpf operations.
> > All of them failed.
> > bpf operational workflow is much more complex than you're imagining.
> > fork() also doesn't work for all cases.
> > I gave this example before: consider multiple systemd-like deamons
> > that need to do bpf operations that want to pass this 'bpf capability'
> > to other deamons written by other teams. Some of them will start
> > non-root, but still need to do bpf. They will be rpm installed
> > and live upgraded while running.
> > We considered to make systemd such centralized bpf delegation
> > authority too. It didn't work. bpf in kernel grows quickly.
> > libbpf part grows independently. llvm keeps evolving.
> > All of them are being changed while system overall has to stay
> > operational. Centralized approach breaks apart.
> >
> > > The interesting cases you're talking about really do involved
> > > unprivileged or less privileged eBPF, though. Let's see:
> > > systemd --user: systemd --user is not privileged at all. There's no
> > > issue of reducing privilege, since systemd --user doesn't have any
> > > privilege to begin with. But systemd supports some eBPF features, and
> > > presumably it would like to support them in the systemd --user case.
> > > This is unprivileged eBPF.
> >
> > Let's disambiguate the terminology.
> > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > I think that was a mistake.
> > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > This is not unprivileged.
> > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> >
> > There is a huge difference between the two.
> > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > today for many reasons mentioned earlier.
> > The /dev/bpf is about 'less privileged'.
> > Less privileged than root. We need to split part of full root capability
> > into bpf capability. So that most of the root can be dropped.
> > This is very similar to what cap_net_admin does.
> > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > Same thing for cap_bpf.
> >
> > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > this is the same thing. Two interfaces to achieve the same result.
> >
> 
> systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> Granting them cap_bpf is the same as granting it to every other unprivileged user
> process. Also unprivileged user process can start systemd --user process with any
> command they like.

systemd itself is trusted. It's the same binary whether it runs as pid=1
or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
CAP_BPF is a natural step in the same direction.


^ permalink raw reply

* [RFC PATCH] audit, security: allow LSMs to selectively enable audit collection
From: Aaron Goidel @ 2019-08-15 17:41 UTC (permalink / raw)
  To: paul
  Cc: jmorris, serge, sds, keescook, casey, rgb, linux-audit,
	linux-security-module, selinux, Aaron Goidel

Presently, there is no way for LSMs to enable collection of supplemental
audit records such as path and inode information when a permission denial
occurs. Provide a LSM hook to allow LSMs to selectively enable collection
on a per-task basis, even if the audit configuration would otherwise
disable auditing of a task and/or contains no audit filter rules. If the
hook returns a non-zero result, collect all available audit information. If
the hook generates its own audit record, then supplemental audit
information will be emitted at syscall exit.

In SELinux, we implement this hook by returning the result of a permission
check on the process. If the new process2:audit_enable permission is
allowed by the policy, then audit collection will be enabled for that
process. Otherwise, SELinux will defer to the audit configuration.

Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
---
 include/linux/lsm_hooks.h           |  7 +++++++
 include/linux/security.h            |  7 ++++++-
 kernel/auditsc.c                    | 10 +++++++---
 security/security.c                 |  5 +++++
 security/selinux/hooks.c            | 11 +++++++++++
 security/selinux/include/classmap.h |  2 +-
 6 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index ead98af9c602..7d70a6759621 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1380,6 +1380,11 @@
  *	audit_rule_init.
  *	@lsmrule contains the allocated rule
  *
+ * @audit_enable:
+ *	Allow the security module to selectively enable audit collection
+ *	on permission denials based on whether or not @tsk has the
+ *	process2:audit_enable permission.
+ *
  * @inode_invalidate_secctx:
  *	Notify the security module that it must revalidate the security context
  *	of an inode.
@@ -1800,6 +1805,7 @@ union security_list_options {
 	int (*audit_rule_known)(struct audit_krule *krule);
 	int (*audit_rule_match)(u32 secid, u32 field, u32 op, void *lsmrule);
 	void (*audit_rule_free)(void *lsmrule);
+	int (*audit_enable)(struct task_struct *tsk);
 #endif /* CONFIG_AUDIT */
 
 #ifdef CONFIG_BPF_SYSCALL
@@ -2043,6 +2049,7 @@ struct security_hook_heads {
 	struct hlist_head audit_rule_known;
 	struct hlist_head audit_rule_match;
 	struct hlist_head audit_rule_free;
+	struct hlist_head audit_enable;
 #endif /* CONFIG_AUDIT */
 #ifdef CONFIG_BPF_SYSCALL
 	struct hlist_head bpf;
diff --git a/include/linux/security.h b/include/linux/security.h
index 7d9c1da1f659..7be66db8de4e 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -1719,7 +1719,7 @@ int security_audit_rule_init(u32 field, u32 op, char *rulestr, void **lsmrule);
 int security_audit_rule_known(struct audit_krule *krule);
 int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule);
 void security_audit_rule_free(void *lsmrule);
-
+int security_audit_enable(struct task_struct *tsk);
 #else
 
 static inline int security_audit_rule_init(u32 field, u32 op, char *rulestr,
@@ -1742,6 +1742,11 @@ static inline int security_audit_rule_match(u32 secid, u32 field, u32 op,
 static inline void security_audit_rule_free(void *lsmrule)
 { }
 
+static inline int security_audit_enable(struct task_struct *tsk)
+{
+	return 0;
+}
+
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_AUDIT */
 
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 95ae27edd417..7e052b71bc42 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -906,8 +906,12 @@ int audit_alloc(struct task_struct *tsk)
 
 	state = audit_filter_task(tsk, &key);
 	if (state == AUDIT_DISABLED) {
-		clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
-		return 0;
+		if (security_audit_enable(tsk)) {
+			state = AUDIT_BUILD_CONTEXT;
+		} else {
+			clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
+			return 0;
+		}
 	}
 
 	if (!(context = audit_alloc_context(state))) {
@@ -1623,7 +1627,7 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2,
 	if (state == AUDIT_DISABLED)
 		return;
 
-	context->dummy = !audit_n_rules;
+	context->dummy = !audit_n_rules && !security_audit_enable(current);
 	if (!context->dummy && state == AUDIT_BUILD_CONTEXT) {
 		context->prio = 0;
 		if (auditd_test_task(current))
diff --git a/security/security.c b/security/security.c
index 30687e1366b7..04e160e5d4ab 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2333,6 +2333,11 @@ int security_audit_rule_match(u32 secid, u32 field, u32 op, void *lsmrule)
 {
 	return call_int_hook(audit_rule_match, 0, secid, field, op, lsmrule);
 }
+
+int security_audit_enable(struct task_struct *tsk)
+{
+	return call_int_hook(audit_enable, 0, tsk);
+}
 #endif /* CONFIG_AUDIT */
 
 #ifdef CONFIG_BPF_SYSCALL
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index d55571c585ff..88764aa0ab43 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -6628,6 +6628,16 @@ static void selinux_ib_free_security(void *ib_sec)
 }
 #endif
 
+#ifdef CONFIG_AUDIT
+static int selinux_audit_enable(struct task_struct *tsk)
+{
+	u32 sid = current_sid();
+
+	return !avc_has_perm(&selinux_state, sid, sid, SECCLASS_PROCESS2,
+			PROCESS2__AUDIT_ENABLE, NULL);
+}
+#endif
+
 #ifdef CONFIG_BPF_SYSCALL
 static int selinux_bpf(int cmd, union bpf_attr *attr,
 				     unsigned int size)
@@ -6999,6 +7009,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(audit_rule_known, selinux_audit_rule_known),
 	LSM_HOOK_INIT(audit_rule_match, selinux_audit_rule_match),
 	LSM_HOOK_INIT(audit_rule_free, selinux_audit_rule_free),
+	LSM_HOOK_INIT(audit_enable, selinux_audit_enable),
 #endif
 
 #ifdef CONFIG_BPF_SYSCALL
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 32e9b03be3dd..d7d856cbd486 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -52,7 +52,7 @@ struct security_class_mapping secclass_map[] = {
 	    "execmem", "execstack", "execheap", "setkeycreate",
 	    "setsockcreate", "getrlimit", NULL } },
 	{ "process2",
-	  { "nnp_transition", "nosuid_transition", NULL } },
+	  { "nnp_transition", "nosuid_transition", "audit_enable", NULL } },
 	{ "system",
 	  { "ipc_info", "syslog_read", "syslog_mod",
 	    "syslog_console", "module_request", "module_load", NULL } },
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-15 18:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jordan Glover, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <20190815172856.yoqvgu2yfrgbkowu@ast-mbp.dhcp.thefacebook.com>

On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.
>

I have the feeling that we're somehow speaking different languages.
What, precisely, do you mean when you say "systemd itself is trusted"?
 Do you mean "the administrator trusts that the /lib/systemd/systemd
binary is not malicious"?  Do you mean "the administrator trusts that
the running systemd process is not malicious"?

On a regular Linux desktop or server box, passing CAP_NET_ADMIN, your
envisioned CAP_BPF, or /dev/bpf as in this patchset through to a
systemd --user binary would be a gaping security hole.  You are
welcome to do it on your own systemd, but if a distro did it, it would
be a major error.

If you want IPAddressDeny= to work in a user systemd unit (i.e.
/etc/systemd/user/*), then I think you have two choices.  You could
have an API by which systemd --user can ask a privileged helper to
assist (which has all the challenges you mentioned but is definitely
*possible*) or the kernel bpf() interfaces need to be designed so
that, in the absence of kernel bugs, they are safe to use from an
unprivileged process.  By "safe", I mean "would not expose the system
to attack if the kernel's implementation of the bpf() ABI were
perfect".

My suggestions upthread for incrementally making bpf() depend less on
privilege would accomplish this goal.  It would be entirely reasonable
to say that, even with those changes, bpf() is still a large attack
surface and access to it should be restricted, and having a capability
or other mechanism to explicitly grant access to the
hopefully-secure-but-plausibly-buggy parts of bpf() would make sense.
But you rejected that idea and said you "realized that [changing all
the capable() checks is] perfect as-is" without much explanation,
which makes it hard to understand where you're coming from.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-15 18:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815172856.yoqvgu2yfrgbkowu@ast-mbp.dhcp.thefacebook.com>

On Thursday, August 15, 2019 5:28 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
>
> > On Wednesday, August 14, 2019 10:05 PM, Alexei Starovoitov alexei.starovoitov@gmail.com wrote:
> >
> > > On Wed, Aug 14, 2019 at 10:51:23AM -0700, Andy Lutomirski wrote:
> > >
> > > > If eBPF is genuinely not usable by programs that are not fully trusted
> > > > by the admin, then no kernel changes at all are needed. Programs that
> > > > want to reduce their own privileges can easily fork() a privileged
> > > > subprocess or run a little helper to which they delegate BPF
> > > > operations. This is far more flexible than anything that will ever be
> > > > in the kernel because it allows the helper to verify that the rest of
> > > > the program is doing exactly what it's supposed to and restrict eBPF
> > > > operations to exactly the subset that is needed. So a container
> > > > manager or network manager that drops some provilege could have a
> > > > little bpf-helper that manages its BPF XDP, firewalling, etc
> > > > configuration. The two processes would talk over a socketpair.
> > >
> > > there were three projects that tried to delegate bpf operations.
> > > All of them failed.
> > > bpf operational workflow is much more complex than you're imagining.
> > > fork() also doesn't work for all cases.
> > > I gave this example before: consider multiple systemd-like deamons
> > > that need to do bpf operations that want to pass this 'bpf capability'
> > > to other deamons written by other teams. Some of them will start
> > > non-root, but still need to do bpf. They will be rpm installed
> > > and live upgraded while running.
> > > We considered to make systemd such centralized bpf delegation
> > > authority too. It didn't work. bpf in kernel grows quickly.
> > > libbpf part grows independently. llvm keeps evolving.
> > > All of them are being changed while system overall has to stay
> > > operational. Centralized approach breaks apart.
> > >
> > > > The interesting cases you're talking about really do involved
> > > > unprivileged or less privileged eBPF, though. Let's see:
> > > > systemd --user: systemd --user is not privileged at all. There's no
> > > > issue of reducing privilege, since systemd --user doesn't have any
> > > > privilege to begin with. But systemd supports some eBPF features, and
> > > > presumably it would like to support them in the systemd --user case.
> > > > This is unprivileged eBPF.
> > >
> > > Let's disambiguate the terminology.
> > > This /dev/bpf patch set started as describing the feature as 'unprivileged bpf'.
> > > I think that was a mistake.
> > > Let's call systemd-like deamon usage of bpf 'less privileged bpf'.
> > > This is not unprivileged.
> > > 'unprivileged bpf' is what sysctl kernel.unprivileged_bpf_disabled controls.
> > > There is a huge difference between the two.
> > > I'm against extending 'unprivileged bpf' even a bit more than what it is
> > > today for many reasons mentioned earlier.
> > > The /dev/bpf is about 'less privileged'.
> > > Less privileged than root. We need to split part of full root capability
> > > into bpf capability. So that most of the root can be dropped.
> > > This is very similar to what cap_net_admin does.
> > > cap_net_amdin can bring down eth0 which is just as bad as crashing the box.
> > > cap_net_admin is very much privileged. Just 'less privileged' than root.
> > > Same thing for cap_bpf.
> > > May be we should do both cap_bpf and /dev/bpf to make it clear that
> > > this is the same thing. Two interfaces to achieve the same result.
> >
> > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > process. Also unprivileged user process can start systemd --user process with any
> > command they like.
>
> systemd itself is trusted. It's the same binary whether it runs as pid=1
> or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> CAP_BPF is a natural step in the same direction.

The point was that systemd will run any arbitrary command you'll throw at it and you want
to automatically attach CAP_BPF to it. AmbientCapabilities is not valid option for
systemd --user instance (otherwise it would be nuts).

I think we may have misunderstanding here. Did you mean systemd "system" service with
"User=xxx" option instead of "systemd --user" service? It would make sense then.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Kees Cook @ 2019-08-15 19:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <20190813215823.3sfbakzzjjykyng2@ast-mbp>

On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> agree that containers (namespaces) reduce amount of trust necessary
> for apps to run, but the end goal is not security though.

Unsurprisingly, I totally disagree: this is the very definition of
improved "security": reduced attack surface, confined trust, etc.

> Linux has become a single user system.

I hope this is just hyperbole, because it's not true in reality. I agree
that the vast majority of Linux devices are single-user-at-a-time
systems now (rather than the "shell servers" of yore), but the system
still has to be expected to confine users from each other, root, and the
hardware. Switching users on Chrome OS or a distro laptop, etc is still
very much expected to _mean_ something.

> If user can ssh into the host they can become root.
> If arbitrary code can run on the host it will be break out of any sandbox.
> Containers are not providing the level of security that is enough
> to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.

I'm not sure why you draw the line for VMs -- they're just as buggy
as anything else. Regardless, I reject this line of thinking: yes,
all software is buggy, but that isn't a reason to give up. In fact,
we should be trying very hard to create safe code (*insert arguments
for sane languages and toolchains here*).

If you look at software safety as a binary, you will always be
disappointed. If you look at it as it manifests in the real world,
then there is some perspective to be had. Reachability of flaws becomes
a major factor; exploit chain length becomes a factor. There are very
real impacts to be had from security hardening, sandboxing, etc. Of
course nothing is perfect, but the current state of the world isn't
as you describe. (And I say this with the knowledge of how long
the lifetime of bugs are in the kernel.)

> Containers are used to make production systems safer.

Yes.

> Some people call it more 'secure', but it's clearly not secure for
> arbitrary code

Perhaps it's just a language issue. "More secure" and "safer" mean
mostly the same thing to me. I tend to think "safer" is actually
a superset that includes things that wreck the user experience but
aren't actually in the privilege manipulation realm. In the traditional
"security" triad of confidentiality, integrity, and availability, I tend
to weigh availability less highly, but a bug that stops someone from
doing their work but doesn't wreck data, let them switch users, etc,
is still considered a "security" issue by many folks. The fewer bugs
someone is exposed to improves their security, safety, whatever. The
easiest way to do that is confinement and its associated attack surface
reduction. tl;dr: security and safety are very use-case-specific
continuum, not a binary state.

> When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> It's been a constant source of pain. The constant blinding, randomization,
> verifier speculative analysis, all spectre v1, v2, v4 mitigations
> are simply not worth it. It's a lot of complex kernel code without users.
> There is not a single use case to allow arbitrary malicious bpf
> program to be loaded and executed.

The world isn't binary (safe code/malicious code), and we need to build
systems that can be used safely even when things go wrong. Yes, probably
no one has a system that _intentionally_ feeds eBPF into the kernel from
a web form. But there is probably someone who does it unintentionally,
or has a user login exposed on a system where unpriv BPF is enabled. The
point is to create primitives as safely as possible so when things DO
go wrong, they fail safe instead of making things worse.

I'm all for a "less privileged than root" API for eBPF, but I get worried
when I see "security" being treated as a binary state. Especially when
it is considered an always-failed state. :)

-- 
Kees Cook

^ permalink raw reply

* [RFC PATCH v3] security,capability: pass object information to security_capable
From: Aaron Goidel @ 2019-08-15 20:23 UTC (permalink / raw)
  To: paul
  Cc: rgb, mortonm, john.johansen, selinux, jmorris, luto,
	linux-security-module, linux-audit, nhfran2, serge, sds,
	Aaron Goidel

From: Nicholas Franck <nhfran2@tycho.nsa.gov>

At present security_capable does not pass any object information and
therefore can neither audit the particular object nor take it into
account. Augment the security_capable interface to support passing
supplementary data. Use this facility initially to convey the inode for
capability checks relevant to inodes. This only addresses
capable_wrt_inode_uidgid() calls; other capability checks relevant to
inodes will be addressed in subsequent changes. In the future, this will
be further extended to pass object information for other capability
checks such as the target task for CAP_KILL.

In SELinux this new information is leveraged here to perform an
additional inode based check for capabilities relevant to inodes. Since
the inode provided to capable_wrt_inode_uidgid() is a const argument,
this also required propagating const down to dump_common_audit_data() and
dropping the use of d_find_alias() to find an alias for the inode. This
was sketchy to begin with and should be obsoleted by a separate change
that will allow LSMs to trigger audit collection for all file-related
information.

A new security class is defined for the inode capability permission
checks. This is because the existing file-related classes were already
too close to being full to support the addition of these permissions.
This has the limitation of not supporting per-file-class distinctions
(e.g. distinguishing regular files from directories or special files if
they have the same security label). An alternative would be to
instantiate a separate class for every existing file class, or, to widen
access vectors to 64 bits. Neither seems justified.

It would be possible to fold the existing opts argument into this new
supplementary data structure. This was omitted from this change to
minimize changes.

Signed-off-by: Nicholas Franck <nhfran2@tycho.nsa.gov>
Signed-off-by: Aaron Goidel <acgoide@tycho.nsa.gov>
---
v3:
  - No longer change the audit record for capability checks
  - Introduce an inode-based check in selinux_capable
  - Drop attempts to find dentry when only inode is available
v2:
  - Changed order of audit prints so optional information comes second 
---
 include/linux/capability.h             |  7 +++
 include/linux/lsm_audit.h              |  2 +-
 include/linux/lsm_hooks.h              |  3 +-
 include/linux/security.h               | 23 ++++++---
 kernel/capability.c                    | 33 +++++++++----
 kernel/seccomp.c                       |  2 +-
 security/apparmor/capability.c         |  3 +-
 security/apparmor/include/capability.h |  4 +-
 security/apparmor/ipc.c                |  2 +-
 security/apparmor/lsm.c                |  5 +-
 security/apparmor/resource.c           |  2 +-
 security/commoncap.c                   | 11 +++--
 security/lsm_audit.c                   | 10 +---
 security/safesetid/lsm.c               |  3 +-
 security/security.c                    |  5 +-
 security/selinux/hooks.c               | 66 +++++++++++++++++++++++---
 security/selinux/include/classmap.h    |  3 ++
 security/smack/smack_access.c          |  2 +-
 18 files changed, 136 insertions(+), 50 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index ecce0f43c73a..f72de64c179d 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -211,6 +211,8 @@ extern bool capable(int cap);
 extern bool ns_capable(struct user_namespace *ns, int cap);
 extern bool ns_capable_noaudit(struct user_namespace *ns, int cap);
 extern bool ns_capable_setid(struct user_namespace *ns, int cap);
+extern bool ns_capable_inode(struct user_namespace *ns, int cap,
+			     const struct inode *inode);
 #else
 static inline bool has_capability(struct task_struct *t, int cap)
 {
@@ -246,6 +248,11 @@ static inline bool ns_capable_setid(struct user_namespace *ns, int cap)
 {
 	return true;
 }
+static bool ns_capable_inode(struct user_namespace *ns, int cap,
+			     const struct inode *inode)
+{
+	return true;
+}
 #endif /* CONFIG_MULTIUSER */
 extern bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode);
 extern bool capable_wrt_inode_uidgid(const struct inode *inode, int cap);
diff --git a/include/linux/lsm_audit.h b/include/linux/lsm_audit.h
index 915330abf6e5..dc7d00c7ee67 100644
--- a/include/linux/lsm_audit.h
+++ b/include/linux/lsm_audit.h
@@ -77,7 +77,7 @@ struct common_audit_data {
 	union 	{
 		struct path path;
 		struct dentry *dentry;
-		struct inode *inode;
+		const struct inode *inode;
 		struct lsm_network_audit *net;
 		int cap;
 		int ipc_id;
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index ead98af9c602..3270b8af3498 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1472,7 +1472,8 @@ union security_list_options {
 	int (*capable)(const struct cred *cred,
 			struct user_namespace *ns,
 			int cap,
-			unsigned int opts);
+			unsigned int opts,
+			struct cap_aux_data *cad);
 	int (*quotactl)(int cmds, int type, int id, struct super_block *sb);
 	int (*quota_on)(struct dentry *dentry);
 	int (*syslog)(int type);
diff --git a/include/linux/security.h b/include/linux/security.h
index 7d9c1da1f659..a37377065401 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -77,9 +77,18 @@ enum lsm_event {
 	LSM_POLICY_CHANGE,
 };
 
+
+struct cap_aux_data {
+	char type;
+#define CAP_AUX_DATA_INODE	1
+	union	{
+		const struct inode *inode;
+	} u;
+};
+
 /* These functions are in security/commoncap.c */
 extern int cap_capable(const struct cred *cred, struct user_namespace *ns,
-		       int cap, unsigned int opts);
+		       int cap, unsigned int opts, struct cap_aux_data *cad);
 extern int cap_settime(const struct timespec64 *ts, const struct timezone *tz);
 extern int cap_ptrace_access_check(struct task_struct *child, unsigned int mode);
 extern int cap_ptrace_traceme(struct task_struct *parent);
@@ -215,9 +224,10 @@ int security_capset(struct cred *new, const struct cred *old,
 		    const kernel_cap_t *inheritable,
 		    const kernel_cap_t *permitted);
 int security_capable(const struct cred *cred,
-		       struct user_namespace *ns,
-		       int cap,
-		       unsigned int opts);
+		     struct user_namespace *ns,
+		     int cap,
+		     unsigned int opts,
+		     struct cap_aux_data *cad);
 int security_quotactl(int cmds, int type, int id, struct super_block *sb);
 int security_quota_on(struct dentry *dentry);
 int security_syslog(int type);
@@ -478,9 +488,10 @@ static inline int security_capset(struct cred *new,
 static inline int security_capable(const struct cred *cred,
 				   struct user_namespace *ns,
 				   int cap,
-				   unsigned int opts)
+				   unsigned int opts,
+				   struct cap_aux_data *cad)
 {
-	return cap_capable(cred, ns, cap, opts);
+	return cap_capable(cred, ns, cap, opts, NULL);
 }
 
 static inline int security_quotactl(int cmds, int type, int id,
diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..c3723443904a 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -297,7 +297,7 @@ bool has_ns_capability(struct task_struct *t,
 	int ret;
 
 	rcu_read_lock();
-	ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NONE);
+	ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NONE, NULL);
 	rcu_read_unlock();
 
 	return (ret == 0);
@@ -338,7 +338,7 @@ bool has_ns_capability_noaudit(struct task_struct *t,
 	int ret;
 
 	rcu_read_lock();
-	ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NOAUDIT);
+	ret = security_capable(__task_cred(t), ns, cap, CAP_OPT_NOAUDIT, NULL);
 	rcu_read_unlock();
 
 	return (ret == 0);
@@ -363,7 +363,8 @@ bool has_capability_noaudit(struct task_struct *t, int cap)
 
 static bool ns_capable_common(struct user_namespace *ns,
 			      int cap,
-			      unsigned int opts)
+			      unsigned int opts,
+			      struct cap_aux_data *cad)
 {
 	int capable;
 
@@ -372,7 +373,7 @@ static bool ns_capable_common(struct user_namespace *ns,
 		BUG();
 	}
 
-	capable = security_capable(current_cred(), ns, cap, opts);
+	capable = security_capable(current_cred(), ns, cap, opts, cad);
 	if (capable == 0) {
 		current->flags |= PF_SUPERPRIV;
 		return true;
@@ -393,7 +394,7 @@ static bool ns_capable_common(struct user_namespace *ns,
  */
 bool ns_capable(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_NONE);
+	return ns_capable_common(ns, cap, CAP_OPT_NONE, NULL);
 }
 EXPORT_SYMBOL(ns_capable);
 
@@ -411,7 +412,7 @@ EXPORT_SYMBOL(ns_capable);
  */
 bool ns_capable_noaudit(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT);
+	return ns_capable_common(ns, cap, CAP_OPT_NOAUDIT, NULL);
 }
 EXPORT_SYMBOL(ns_capable_noaudit);
 
@@ -430,7 +431,7 @@ EXPORT_SYMBOL(ns_capable_noaudit);
  */
 bool ns_capable_setid(struct user_namespace *ns, int cap)
 {
-	return ns_capable_common(ns, cap, CAP_OPT_INSETID);
+	return ns_capable_common(ns, cap, CAP_OPT_INSETID, NULL);
 }
 EXPORT_SYMBOL(ns_capable_setid);
 
@@ -470,7 +471,7 @@ bool file_ns_capable(const struct file *file, struct user_namespace *ns,
 	if (WARN_ON_ONCE(!cap_valid(cap)))
 		return false;
 
-	if (security_capable(file->f_cred, ns, cap, CAP_OPT_NONE) == 0)
+	if (security_capable(file->f_cred, ns, cap, CAP_OPT_NONE, NULL) == 0)
 		return true;
 
 	return false;
@@ -503,7 +504,8 @@ bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
 {
 	struct user_namespace *ns = current_user_ns();
 
-	return ns_capable(ns, cap) && privileged_wrt_inode_uidgid(ns, inode);
+	return ns_capable_inode(ns, cap, inode) &&
+		privileged_wrt_inode_uidgid(ns, inode);
 }
 EXPORT_SYMBOL(capable_wrt_inode_uidgid);
 
@@ -524,7 +526,18 @@ bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns)
 	cred = rcu_dereference(tsk->ptracer_cred);
 	if (cred)
 		ret = security_capable(cred, ns, CAP_SYS_PTRACE,
-				       CAP_OPT_NOAUDIT);
+				       CAP_OPT_NOAUDIT, NULL);
 	rcu_read_unlock();
 	return (ret == 0);
 }
+
+bool ns_capable_inode(struct user_namespace *ns, int cap,
+			const struct inode *inode)
+{
+	struct cap_aux_data cad;
+
+	cad.type = CAP_AUX_DATA_INODE;
+	cad.u.inode = inode;
+	return ns_capable_common(ns, cap, CAP_OPT_NONE, &cad);
+}
+EXPORT_SYMBOL(ns_capable_inode);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 811b4a86cdf6..d59dd7079ece 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -446,7 +446,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	 */
 	if (!task_no_new_privs(current) &&
 	    security_capable(current_cred(), current_user_ns(),
-				     CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) != 0)
+			     CAP_SYS_ADMIN, CAP_OPT_NOAUDIT, NULL) != 0)
 		return ERR_PTR(-EACCES);
 
 	/* Allocate a new seccomp_filter */
diff --git a/security/apparmor/capability.c b/security/apparmor/capability.c
index 752f73980e30..eb4a63fa9888 100644
--- a/security/apparmor/capability.c
+++ b/security/apparmor/capability.c
@@ -148,7 +148,8 @@ static int profile_capable(struct aa_profile *profile, int cap,
  *
  * Returns: 0 on success, or else an error code.
  */
-int aa_capable(struct aa_label *label, int cap, unsigned int opts)
+int aa_capable(struct aa_label *label, int cap, unsigned int opts,
+	       struct cap_aux_data *cad)
 {
 	struct aa_profile *profile;
 	int error = 0;
diff --git a/security/apparmor/include/capability.h b/security/apparmor/include/capability.h
index 1b3663b6ab12..d888f09d76d1 100644
--- a/security/apparmor/include/capability.h
+++ b/security/apparmor/include/capability.h
@@ -20,6 +20,7 @@
 #include "apparmorfs.h"
 
 struct aa_label;
+struct cap_aux_data;
 
 /* aa_caps - confinement data for capabilities
  * @allowed: capabilities mask
@@ -40,7 +41,8 @@ struct aa_caps {
 
 extern struct aa_sfs_entry aa_sfs_entry_caps[];
 
-int aa_capable(struct aa_label *label, int cap, unsigned int opts);
+int aa_capable(struct aa_label *label, int cap, unsigned int opts,
+	       struct cap_aux_data *cad);
 
 static inline void aa_free_cap_rules(struct aa_caps *caps)
 {
diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c
index aacd1e95cb59..deb5267ca695 100644
--- a/security/apparmor/ipc.c
+++ b/security/apparmor/ipc.c
@@ -108,7 +108,7 @@ static int profile_tracer_perm(struct aa_profile *tracer,
 	aad(sa)->peer = tracee;
 	aad(sa)->request = 0;
 	aad(sa)->error = aa_capable(&tracer->label, CAP_SYS_PTRACE,
-				    CAP_OPT_NONE);
+				    CAP_OPT_NONE, NULL);
 
 	return aa_audit(AUDIT_APPARMOR_AUTO, tracer, sa, audit_ptrace_cb);
 }
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index 87500bde5a92..82790accb679 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -172,14 +172,15 @@ static int apparmor_capget(struct task_struct *target, kernel_cap_t *effective,
 }
 
 static int apparmor_capable(const struct cred *cred, struct user_namespace *ns,
-			    int cap, unsigned int opts)
+			    int cap, unsigned int opts,
+			    struct cap_aux_data *cad)
 {
 	struct aa_label *label;
 	int error = 0;
 
 	label = aa_get_newest_cred_label(cred);
 	if (!unconfined(label))
-		error = aa_capable(label, cap, opts);
+		error = aa_capable(label, cap, opts, cad);
 	aa_put_label(label);
 
 	return error;
diff --git a/security/apparmor/resource.c b/security/apparmor/resource.c
index 552ed09cb47e..9b3d4b4437f2 100644
--- a/security/apparmor/resource.c
+++ b/security/apparmor/resource.c
@@ -124,7 +124,7 @@ int aa_task_setrlimit(struct aa_label *label, struct task_struct *task,
 	 */
 
 	if (label != peer &&
-	    aa_capable(label, CAP_SYS_RESOURCE, CAP_OPT_NOAUDIT) != 0)
+	    aa_capable(label, CAP_SYS_RESOURCE, CAP_OPT_NOAUDIT, NULL) != 0)
 		error = fn_for_each(label, profile,
 				audit_resource(profile, resource,
 					       new_rlim->rlim_max, peer,
diff --git a/security/commoncap.c b/security/commoncap.c
index c477fb673701..1860ea50f473 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -68,7 +68,7 @@ static void warn_setuid_and_fcaps_mixed(const char *fname)
  * kernel's capable() and has_capability() returns 1 for this case.
  */
 int cap_capable(const struct cred *cred, struct user_namespace *targ_ns,
-		int cap, unsigned int opts)
+		int cap, unsigned int opts, struct cap_aux_data *cad)
 {
 	struct user_namespace *ns = targ_ns;
 
@@ -226,7 +226,7 @@ static inline int cap_inh_is_capped(void)
 	 * capability
 	 */
 	if (cap_capable(current_cred(), current_cred()->user_ns,
-			CAP_SETPCAP, CAP_OPT_NONE) == 0)
+			CAP_SETPCAP, CAP_OPT_NONE, NULL) == 0)
 		return 0;
 	return 1;
 }
@@ -1211,7 +1211,8 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
 		    || (cap_capable(current_cred(),
 				    current_cred()->user_ns,
 				    CAP_SETPCAP,
-				    CAP_OPT_NONE) != 0)			/*[4]*/
+				    CAP_OPT_NONE,
+				    NULL) != 0)				/*[4]*/
 			/*
 			 * [1] no changing of bits that are locked
 			 * [2] no unlocking of locks
@@ -1307,7 +1308,7 @@ int cap_vm_enough_memory(struct mm_struct *mm, long pages)
 	int cap_sys_admin = 0;
 
 	if (cap_capable(current_cred(), &init_user_ns,
-				CAP_SYS_ADMIN, CAP_OPT_NOAUDIT) == 0)
+				CAP_SYS_ADMIN, CAP_OPT_NOAUDIT, NULL) == 0)
 		cap_sys_admin = 1;
 
 	return cap_sys_admin;
@@ -1328,7 +1329,7 @@ int cap_mmap_addr(unsigned long addr)
 
 	if (addr < dac_mmap_min_addr) {
 		ret = cap_capable(current_cred(), &init_user_ns, CAP_SYS_RAWIO,
-				  CAP_OPT_NONE);
+				  CAP_OPT_NONE, NULL);
 		/* set PF_SUPERPRIV if it turns out we allow the low mmap */
 		if (ret == 0)
 			current->flags |= PF_SUPERPRIV;
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 33028c098ef3..67c2d02cd722 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -288,17 +288,9 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 		break;
 	}
 	case LSM_AUDIT_DATA_INODE: {
-		struct dentry *dentry;
-		struct inode *inode;
+		const struct inode *inode;
 
 		inode = a->u.inode;
-		dentry = d_find_alias(inode);
-		if (dentry) {
-			audit_log_format(ab, " name=");
-			audit_log_untrustedstring(ab,
-					 dentry->d_name.name);
-			dput(dentry);
-		}
 		audit_log_format(ab, " dev=");
 		audit_log_untrustedstring(ab, inode->i_sb->s_id);
 		audit_log_format(ab, " ino=%lu", inode->i_ino);
diff --git a/security/safesetid/lsm.c b/security/safesetid/lsm.c
index cecd38e2ac80..c74ed83e9501 100644
--- a/security/safesetid/lsm.c
+++ b/security/safesetid/lsm.c
@@ -80,7 +80,8 @@ static bool check_setuid_policy_hashtable_key_value(kuid_t parent,
 static int safesetid_security_capable(const struct cred *cred,
 				      struct user_namespace *ns,
 				      int cap,
-				      unsigned int opts)
+				      unsigned int opts,
+				      struct cap_aux_data *cad)
 {
 	if (cap == CAP_SETUID &&
 	    check_setuid_policy_hashtable_key(cred->uid)) {
diff --git a/security/security.c b/security/security.c
index 30687e1366b7..e5e0637f43ac 100644
--- a/security/security.c
+++ b/security/security.c
@@ -691,9 +691,10 @@ int security_capset(struct cred *new, const struct cred *old,
 int security_capable(const struct cred *cred,
 		     struct user_namespace *ns,
 		     int cap,
-		     unsigned int opts)
+		     unsigned int opts,
+		     struct cap_aux_data *cad)
 {
-	return call_int_hook(capable, 0, cred, ns, cap, opts);
+	return call_int_hook(capable, 0, cred, ns, cap, opts, cad);
 }
 
 int security_quotactl(int cmds, int type, int id, struct super_block *sb)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index d55571c585ff..093d17f9d500 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1620,7 +1620,10 @@ static inline u32 signal_to_av(int sig)
 
 /* Check whether a task is allowed to use a capability. */
 static int cred_has_capability(const struct cred *cred,
-			       int cap, unsigned int opts, bool initns)
+				int cap,
+				unsigned int opts,
+				bool initns,
+				struct cap_aux_data *cad)
 {
 	struct common_audit_data ad;
 	struct av_decision avd;
@@ -1653,6 +1656,55 @@ static int cred_has_capability(const struct cred *cred,
 		if (rc2)
 			return rc2;
 	}
+
+	if (rc)
+		return rc;
+
+	if (cad && cad->type == CAP_AUX_DATA_INODE) {
+		const struct inode *inode = cad->u.inode;
+		struct inode_security_struct *isec = selinux_inode(inode);
+
+		switch (cap) {
+		case CAP_DAC_OVERRIDE:
+			av = INODE_CAP__DAC_OVERRIDE;
+			break;
+		case CAP_CHOWN:
+			av = INODE_CAP__CHOWN;
+			break;
+		case CAP_FSETID:
+			av = INODE_CAP__FSETID;
+			break;
+		case CAP_DAC_READ_SEARCH:
+			av = INODE_CAP__DAC_READ_SEARCH;
+			break;
+		case CAP_FOWNER:
+			av = INODE_CAP__FOWNER;
+			break;
+		case CAP_SETFCAP:
+			av = INODE_CAP__SETFCAP;
+			break;
+		default:
+			pr_err("SELinux: Unknown capability for inode %d\n",
+									cap);
+			return -EINVAL;
+		}
+
+		rc = avc_has_perm_noaudit(&selinux_state, sid, isec->sid,
+				SECCLASS_INODE_CAP, av, 0, &avd);
+		if (!(opts & CAP_OPT_NOAUDIT)) {
+			int rc2;
+
+			ad.type = LSM_AUDIT_DATA_INODE;
+			ad.u.inode = inode;
+			rc2 = avc_audit(&selinux_state, sid, isec->sid,
+					SECCLASS_INODE_CAP, av, &avd,
+					rc, &ad, 0);
+			if (rc2)
+				return rc2;
+		}
+	}
+
+
 	return rc;
 }
 
@@ -2167,9 +2219,9 @@ static int selinux_capset(struct cred *new, const struct cred *old,
  */
 
 static int selinux_capable(const struct cred *cred, struct user_namespace *ns,
-			   int cap, unsigned int opts)
+			   int cap, unsigned int opts, struct cap_aux_data *cad)
 {
-	return cred_has_capability(cred, cap, opts, ns == &init_user_ns);
+	return cred_has_capability(cred, cap, opts, ns == &init_user_ns, cad);
 }
 
 static int selinux_quotactl(int cmds, int type, int id, struct super_block *sb)
@@ -2243,7 +2295,7 @@ static int selinux_vm_enough_memory(struct mm_struct *mm, long pages)
 	int rc, cap_sys_admin = 0;
 
 	rc = cred_has_capability(current_cred(), CAP_SYS_ADMIN,
-				 CAP_OPT_NOAUDIT, true);
+				 CAP_OPT_NOAUDIT, true, NULL);
 	if (rc == 0)
 		cap_sys_admin = 1;
 
@@ -3103,9 +3155,9 @@ static bool has_cap_mac_admin(bool audit)
 	const struct cred *cred = current_cred();
 	unsigned int opts = audit ? CAP_OPT_NONE : CAP_OPT_NOAUDIT;
 
-	if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, opts))
+	if (cap_capable(cred, &init_user_ns, CAP_MAC_ADMIN, opts, NULL))
 		return false;
-	if (cred_has_capability(cred, CAP_MAC_ADMIN, opts, true))
+	if (cred_has_capability(cred, CAP_MAC_ADMIN, opts, true, NULL))
 		return false;
 	return true;
 }
@@ -3609,7 +3661,7 @@ static int selinux_file_ioctl(struct file *file, unsigned int cmd,
 	case KDSKBENT:
 	case KDSKBSENT:
 		error = cred_has_capability(cred, CAP_SYS_TTY_CONFIG,
-					    CAP_OPT_NONE, true);
+					    CAP_OPT_NONE, true, NULL);
 		break;
 
 	/* default case assumes that the command will go
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 32e9b03be3dd..09b3ac77fe97 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -244,6 +244,9 @@ struct security_class_mapping secclass_map[] = {
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
 	{ "xdp_socket",
 	  { COMMON_SOCK_PERMS, NULL } },
+	{ "inode_cap",
+	  { "dac_override", "chown", "fsetid", "fowner",
+	    "dac_read_search", "setfcap", NULL } },
 	{ NULL }
   };
 
diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c
index fe2ce3a65822..e961bfe8f00a 100644
--- a/security/smack/smack_access.c
+++ b/security/smack/smack_access.c
@@ -640,7 +640,7 @@ bool smack_privileged_cred(int cap, const struct cred *cred)
 	struct smack_known_list_elem *sklep;
 	int rc;
 
-	rc = cap_capable(cred, &init_user_ns, cap, CAP_OPT_NONE);
+	rc = cap_capable(cred, &init_user_ns, cap, CAP_OPT_NONE, NULL);
 	if (rc)
 		return false;
 
-- 
2.21.0


^ permalink raw reply related

* Re: [RFC PATCH v3] security,capability: pass object information to security_capable
From: James Morris @ 2019-08-15 22:32 UTC (permalink / raw)
  To: Aaron Goidel
  Cc: paul, rgb, mortonm, john.johansen, selinux, luto,
	linux-security-module, linux-audit, nhfran2, serge, sds
In-Reply-To: <20190815202357.4238-1-acgoide@tycho.nsa.gov>

On Thu, 15 Aug 2019, Aaron Goidel wrote:

> In SELinux this new information is leveraged here to perform an
> additional inode based check for capabilities relevant to inodes. Since
> the inode provided to capable_wrt_inode_uidgid() is a const argument,
> this also required propagating const down to dump_common_audit_data() and
> dropping the use of d_find_alias() to find an alias for the inode. This
> was sketchy to begin with and should be obsoleted by a separate change
> that will allow LSMs to trigger audit collection for all file-related
> information.

Will the audit logs look the same once the 2nd patch is applied? We need 
to be careful about breaking existing userland.


-- 
James Morris
<jmorris@namei.org>


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 23:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jordan Glover, Daniel Colascione, Song Liu, Kees Cook, Networking,
	bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <CALCETrUv+g+cb79FJ1S4XuV0K=kowFkPXpzoC99svoOfs4-Kvg@mail.gmail.com>

On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > process. Also unprivileged user process can start systemd --user process with any
> > > command they like.
> >
> > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > CAP_BPF is a natural step in the same direction.
> >
> 
> I have the feeling that we're somehow speaking different languages.
> What, precisely, do you mean when you say "systemd itself is trusted"?
>  Do you mean "the administrator trusts that the /lib/systemd/systemd
> binary is not malicious"?  Do you mean "the administrator trusts that
> the running systemd process is not malicious"?

please see
https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
I'm not advocating for or against this approach.
Call it 'security hole' or 'better security'.
There are two categories of people for any feature like this.
My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
level of privileges is acceptable.
Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
But CAP_BPF is clearly better way.

> My suggestions upthread for incrementally making bpf() depend less on
> privilege would accomplish this goal.

As I pointed out countless times it would make the system overall _less_ secure.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1 to
make it _more_ secure.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Alexei Starovoitov @ 2019-08-15 23:46 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andy Lutomirski, Song Liu, Networking, bpf, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team, Lorenz Bauer, Jann Horn, Greg KH,
	Linux API, LSM List
In-Reply-To: <201908151203.FE87970@keescook>

On Thu, Aug 15, 2019 at 12:46:41PM -0700, Kees Cook wrote:
> On Tue, Aug 13, 2019 at 02:58:25PM -0700, Alexei Starovoitov wrote:
> > agree that containers (namespaces) reduce amount of trust necessary
> > for apps to run, but the end goal is not security though.
> 
> Unsurprisingly, I totally disagree: this is the very definition of
> improved "security": reduced attack surface, confined trust, etc.

there are different ways to define the meaning of the word "security".
Of course containers reduce attack surface, etc.
The 'attack surface' as a mitigation from malicious users is not always the goal
of running containers. Ask yourself why containers are used in the datacenters
where only root can ssh into a server, only signed packages can
ever be installed, no browsers running, and no remote code is ever downloaded?

> > Linux has become a single user system.
> 
> I hope this is just hyperbole, because it's not true in reality. I agree
> that the vast majority of Linux devices are single-user-at-a-time
> systems now (rather than the "shell servers" of yore), but the system
> still has to be expected to confine users from each other, root, and the
> hardware. Switching users on Chrome OS or a distro laptop, etc is still
> very much expected to _mean_ something.

of course.

> 
> > If user can ssh into the host they can become root.
> > If arbitrary code can run on the host it will be break out of any sandbox.
> > Containers are not providing the level of security that is enough
> > to run arbitrary code. VMs can do it better, but cpu bugs don't make it easy.
> 
> I'm not sure why you draw the line for VMs -- they're just as buggy
> as anything else. Regardless, I reject this line of thinking: yes,
> all software is buggy, but that isn't a reason to give up.

hmm. are you saying you want kernel community to work towards
making containers (namespaces) being able to run arbitrary code
downloaded from the internet?
In other words the problems solved by user space sandboxing, gvisor, etc
should be solved by the kernel?
I really don't think it's a good idea.

> If you look at software safety as a binary, you will always be
> disappointed. If you look at it as it manifests in the real world,
> then there is some perspective to be had. Reachability of flaws becomes
> a major factor; exploit chain length becomes a factor. There are very
> real impacts to be had from security hardening, sandboxing, etc. Of
> course nothing is perfect, but the current state of the world isn't
> as you describe. (And I say this with the knowledge of how long
> the lifetime of bugs are in the kernel.)

No arguing here. Security today is mainly the number of layers.
Hardening at all levels, sanboxing do help.
namespaces is one of the layers provided by the kernel.
The point that the kernel did its job already.
All other security layers are in user space.
Looking for bugs at every layer is still must have.
In the kernel, systemd, qemu, OS, browsers, etc.
Containers provide one level of security. VMs have another.

> > Some people call it more 'secure', but it's clearly not secure for
> > arbitrary code
> 
> Perhaps it's just a language issue. "More secure" and "safer" mean
> mostly the same thing to me. I tend to think "safer" is actually
> a superset that includes things that wreck the user experience but
> aren't actually in the privilege manipulation realm. In the traditional
> "security" triad of confidentiality, integrity, and availability, I tend
> to weigh availability less highly, but a bug that stops someone from
> doing their work but doesn't wreck data, let them switch users, etc,
> is still considered a "security" issue by many folks. The fewer bugs
> someone is exposed to improves their security, safety, whatever. The
> easiest way to do that is confinement and its associated attack surface
> reduction. tl;dr: security and safety are very use-case-specific
> continuum, not a binary state.

yep

> 
> > When we say 'unprivileged bpf' we really mean arbitrary malicious bpf program.
> > It's been a constant source of pain. The constant blinding, randomization,
> > verifier speculative analysis, all spectre v1, v2, v4 mitigations
> > are simply not worth it. It's a lot of complex kernel code without users.
> > There is not a single use case to allow arbitrary malicious bpf
> > program to be loaded and executed.
> 
> The world isn't binary (safe code/malicious code), and we need to build
> systems that can be used safely even when things go wrong. Yes, probably
> no one has a system that _intentionally_ feeds eBPF into the kernel from
> a web form. But there is probably someone who does it unintentionally,
> or has a user login exposed on a system where unpriv BPF is enabled. The
> point is to create primitives as safely as possible so when things DO
> go wrong, they fail safe instead of making things worse.
> 
> I'm all for a "less privileged than root" API for eBPF, but I get worried
> when I see "security" being treated as a binary state. Especially when
> it is considered an always-failed state. :)

'security as always failed state' ? hmm.
not sure where this impression came from.
One of the goals here is to do sysctl kernel.unprivileged_bpf_disabled=1
which will make the system overall _more_ secure.
I hope we can agree on that.


^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-16  0:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kees Cook, Andy Lutomirski, Song Liu, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815234622.t65oxm5mtfzy6fhg@ast-mbp.dhcp.thefacebook.com>

> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

>> 
>> I'm not sure why you draw the line for VMs -- they're just as buggy
>> as anything else. Regardless, I reject this line of thinking: yes,
>> all software is buggy, but that isn't a reason to give up.
> 
> hmm. are you saying you want kernel community to work towards
> making containers (namespaces) being able to run arbitrary code
> downloaded from the internet?

Yes.

As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)

To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.

Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:

CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.

Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.

(Sandstorm still exists but is no longer as actively developed, sadly.)

^ permalink raw reply

* Re: [RFC/RFT v4 0/5] Add generic trusted keys framework/subsystem
From: Sumit Garg @ 2019-08-16  4:58 UTC (permalink / raw)
  To: Mimi Zohar
  Cc: keyrings, linux-integrity,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	linux-security-module, dhowells, Herbert Xu, davem, peterhuewe,
	jgg, jejb, Jarkko Sakkinen, Arnd Bergmann, Greg Kroah-Hartman,
	James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
	Daniel Thompson, Linux Kernel Mailing List,
	tee-dev @ lists . linaro . org
In-Reply-To: <1565881609.9424.7.camel@kernel.org>

On Thu, 15 Aug 2019 at 20:36, Mimi Zohar <zohar@kernel.org> wrote:
>
> On Thu, 2019-08-15 at 18:33 +0530, Sumit Garg wrote:
> > Hi Mimi,
> >
> > On Wed, 14 Aug 2019 at 18:54, Mimi Zohar <zohar@kernel.org> wrote:
> > >
> > > Hi Sumit,
> > >
> > > On Tue, 2019-08-13 at 13:22 +0530, Sumit Garg wrote:
> > > > This patch-set is an outcome of discussion here [1]. It has evolved very
> > > > much since v1 to create, consolidate and generalize trusted keys
> > > > subsystem.
> > > >
> > > > This framework has been tested with trusted keys support provided via TEE
> > > > but I wasn't able to test it with a TPM device as I don't possess one. It
> > > > would be really helpful if others could test this patch-set using a TPM
> > > > device.
> > >
> > > With the "CONFIG_HEADER_TEST" and "CONFIG_KERNEL_HEADER_TEST" config
> > > options enabled, which is required for linux-next, it fails to build.
> > >
> >
> > TBH, I wasn't aware about this test feature for headers.
>
> It's new to me too.
>
> > It looks like
> > the header which fails this test is "include/keys/trusted_tpm.h" which
> > is basically a rename of "include/keys/trusted.h" plus changes in this
> > patch-set.
> >
> > And "include/keys/trusted.h" header is already put under blacklist
> > here: "include/Kbuild +68" as it fails to build. So its that rename
> > due to which build failure is observed now.
> >
> > It seems to be an easy fix for this build failure via following changes:
> >
> > diff --git a/include/keys/trusted_tpm.h b/include/keys/trusted_tpm.h
> > index 7b593447920b..ca1bec0ef65d 100644
> > --- a/include/keys/trusted_tpm.h
> > +++ b/include/keys/trusted_tpm.h
> > @@ -2,6 +2,9 @@
> >  #ifndef __TRUSTED_TPM_H
> >  #define __TRUSTED_TPM_H
> >
> > +#include <keys/trusted-type.h>
> > +#include <linux/tpm_command.h>
> > +
> >  /* implementation specific TPM constants */
> >  #define MAX_BUF_SIZE                   1024
> >  #define TPM_GETRANDOM_SIZE             14
> >
> > So I will include above changes in this patch-set and also remove
> > "include/keys/trusted.h" header from the blacklist.
>
> That works, thanks.  With this patch set, at least the EVM trusted key
> is properly being decrypted by the encrypted key with both a TPM 1.2
> and PTT TPM 2.0.  My laptop still boots properly.  Over the weekend
> I'll try to actually review the patches.
>

Thanks Mimi for testing this patch-set.

-Sumit

> Mimi

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Song Liu @ 2019-08-16  5:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, Andy Lutomirski, Networking, bpf,
	Alexei Starovoitov, Daniel Borkmann, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <B0364660-AD6A-4E5C-B04F-3B6DA78B4BBE@amacapital.net>



> On Aug 15, 2019, at 5:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Aug 15, 2019, at 4:46 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> 
>>> 
>>> I'm not sure why you draw the line for VMs -- they're just as buggy
>>> as anything else. Regardless, I reject this line of thinking: yes,
>>> all software is buggy, but that isn't a reason to give up.
>> 
>> hmm. are you saying you want kernel community to work towards
>> making containers (namespaces) being able to run arbitrary code
>> downloaded from the internet?
> 
> Yes.
> 
> As an example, Sandstorm uses a combination of namespaces (user, network, mount, ipc) and a moderately permissive seccomp policy to run arbitrary code. Not just little snippets, either — node.js, Mongo, MySQL, Meteor, and other fairly heavyweight stacks can all run under Sandstorm, with the whole stack (database engine binaries, etc) supplied by entirely untrusted customers.  During the time Sandstorm was under active development, I can recall *one* bug that would have allowed a sandbox escape. That’s a pretty good track record.  (Also, Meltdown and Spectre, sigh.)
> 
> To be clear, Sandstorm did not allow creation of a userns by the untrusted code, and Sandstorm would have heavily restricted bpf(), but that should only be necessary because of the possibility of kernel bugs, not because of the overall design.
> 
> Alexei, I’m trying to encourage you to aim for something even better than you have now. Right now, if you grant a user various very strong capabilities, that user’s systemd can use bpf network filters.  Your proposal would allow this with a different, but still very strong, set of capabilities. There’s nothing wrong with this per se, but I think you can aim much higher:
> 
> CAP_NET_ADMIN and your CAP_BPF both effectively allow the holder to take over the system, *by design*.  I’m suggesting that you engage the security community (Kees, myself, Aleksa, Jann, Serge, Christian, etc) to aim for something better: make it so that a normal Linux distro would be willing to relax its settings enough so that normal users can use bpf filtering in the systemd units and maybe eventually use even more bpf() capabilities. And let’s make is to that mainstream container managers (that use userns!) will be willing (as an option) to delegate bpf() to their containers. We’re happy to help design, review, and even write code, but we need you to be willing to work with us to make a design that seems like it will work and then to wait long enough to merge it for us to think about it, try to poke holes in it, and convince ourselves and each other that it has a good chance of being sound.
> 
> Obviously there will be many cases where an unprivileged program should *not* be able to use bpf() IP filtering, but let’s make it so that enabling these advanced features does not automatically give away the keys to the kingdom.
> 
> (Sandstorm still exists but is no longer as actively developed, sadly.)

I am trying to understand different perspectives here. 

Disclaimer: Alexei and I both work for Facebook. But he may disagree 
with everything I am about to say below, because we haven't sync'ed 
about this for a while. :)

I think there are two types of use cases here: 

    1. CAP_BPF_ADMIN: one big key to all sys_bpf(). 
    2. CAP_BPF: subset of sys_bpf() that is safe for containers.

IIUC, currently, CAP_BPF_ADMIN is (almost) same as CAP_SYS_ADMIN. 
And there aren't many real world use cases for CAP_BPF. 

The /dev/bpf patch tries to separate CAP_BPF_ADMIN from CAP_SYS_ADMIN.
On the other hand, Andy would like to introduce CAP_BPF and build
amazing use cases around it (chicken-egg problem). 

Did I misunderstand anything?

If not, I think these two use cases do not really conflict with each
other, and we probably need both of them. Then, the next question is 
do we really need both/either of them. Maybe having two separate 
discussions would make it easier?


The following are some questions I am trying to understand for 
the two cases. 

For CAP_BPF_ADMIN (or /dev/bpf):
Can we just use CAP_NET_ADMIN? It is safer than CAP_SYS_ADMIN, and
reuse existing CAP_ should be easier than introducing a new one? 

For CAP_BPF: 
Do we really need it for the containers? Is it possible to implement 
all container use cases with SUID? At this moment, I think SUID is 
the right way to go for this use case, because this is likely to 
start with a small set of functionalities. We can introduce CAP_BPF
when the container use case is too complicated for SUID. 


I hope some of these questions/thoughts would make some sense?

Thanks,
Song

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Jordan Glover @ 2019-08-16  9:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Daniel Colascione, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Daniel Borkmann, Kernel Team,
	Lorenz Bauer, Jann Horn, Greg KH, Linux API, LSM List
In-Reply-To: <20190815230808.2o2qe7a72cwdce2m@ast-mbp.dhcp.thefacebook.com>

On Thursday, August 15, 2019 11:08 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 15, 2019 at 11:36:43AM -0700, Andy Lutomirski wrote:
>
> > On Thu, Aug 15, 2019 at 10:29 AM Alexei Starovoitov
> > alexei.starovoitov@gmail.com wrote:
> >
> > > On Thu, Aug 15, 2019 at 11:24:54AM +0000, Jordan Glover wrote:
> > >
> > > > systemd --user processes aren't "less privileged". The are COMPLETELY unprivileged.
> > > > Granting them cap_bpf is the same as granting it to every other unprivileged user
> > > > process. Also unprivileged user process can start systemd --user process with any
> > > > command they like.
> > >
> > > systemd itself is trusted. It's the same binary whether it runs as pid=1
> > > or as pid=123. One of the use cases is to make IPAddressDeny= work with --user.
> > > Subset of that feature already works with AmbientCapabilities=CAP_NET_ADMIN.
> > > CAP_BPF is a natural step in the same direction.
> >
> > I have the feeling that we're somehow speaking different languages.
> > What, precisely, do you mean when you say "systemd itself is trusted"?
> > Do you mean "the administrator trusts that the /lib/systemd/systemd
> > binary is not malicious"? Do you mean "the administrator trusts that
> > the running systemd process is not malicious"?
>
> please see
> https://github.com/systemd/systemd/commit/4c1567f29aeb60a6741874bca8a8e3a0bd69ed01
> I'm not advocating for or against this approach.
> Call it 'security hole' or 'better security'.
> There are two categories of people for any feature like this.
> My point that there is a demand to use bpf for non-root and CAP_NET_ADMIN
> level of privileges is acceptable.
> Another option is to relax all of bpf to CAP_NET_ADMIN instead of CAP_SYS_ADMIN.
> But CAP_BPF is clearly better way.
>

Do you realize it's not possible to grant CAP_NET_ADMIN or any other CAP in
"systemd --user" service? Trying to do so will fail with:
"Failed to apply ambient capabilities (before UID change): Operation not permitted"

I think it's crucial to clear that point to avoid confusion in this discussion
where people are talking about different things.

On the other hand running "systemd --system" service with:

User=nobody
AmbientCapabilities=CAP_NET_ADMIN

is perfectly legit and clears some security concerns as only privileged user
can start such service.

Jordan

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Thomas Gleixner @ 2019-08-16  9:59 UTC (permalink / raw)
  To: Jordan Glover
  Cc: Alexei Starovoitov, Andy Lutomirski, Daniel Colascione, Song Liu,
	Kees Cook, Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Lorenz Bauer, Jann Horn, Greg KH, Linux API,
	LSM List
In-Reply-To: <fkD3fs46a1YnR4lh0tEG-g3tDnDcyZuzji7bAUR9wujPLLl75ZhI8Yk-H1jZpSugO7qChVeCwxAMmxLdeoF2QFS3ZzuYlh7zmeZOmhDJxww=@protonmail.ch>

On Fri, 16 Aug 2019, Jordan Glover wrote:
> "systemd --user" service? Trying to do so will fail with:
> "Failed to apply ambient capabilities (before UID change): Operation not permitted"
> 
> I think it's crucial to clear that point to avoid confusion in this discussion
> where people are talking about different things.
> 
> On the other hand running "systemd --system" service with:
> 
> User=nobody
> AmbientCapabilities=CAP_NET_ADMIN
> 
> is perfectly legit and clears some security concerns as only privileged user
> can start such service.

While we are at it, can we please stop looking at this from a systemd only
perspective. There is a world outside of systemd.

Thanks,

	tglx

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox