Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Eric W. Biederman @ 2014-11-01  1:09 UTC (permalink / raw)
  To: Aditya Kali
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
	tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	mingo-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:

> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.

There is a misdesign in this.  Because files already exist we need the
protections that are present in proc and sysfs that only allow you to
mount the filesystem if it is already mounted.  Otherwise you can wind
up mounting this cgroupfs in a chroot jail when the global root would
not like you to see it.  cgroupfs isn't as bad as proc and sys but there
is at the very least an information leak in mounting it.

Given that we are effectively performing a bind mount in this patch, and
that we need to require cgroupfs be mounted anyway (to be safe).

I don't see the point of this change.  

If we could change the set of cgroups or visible in cgroupfs I could
probably see the point.  But as it is this change seems to be pointless.

Eric


> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/kernfs/mount.c      | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/kernfs.h |  2 ++
>  kernel/cgroup.c        | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>  	return NULL;
>  }
>  
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn)
> +{
> +	struct dentry *dentry;
> +	struct inode *inode;
> +
> +	BUG_ON(sb->s_op != &kernfs_sops);
> +
> +	/* inode for the given kernfs_node should already exist. */
> +	inode = ilookup(sb, kn->ino);
> +	if (!inode) {
> +		pr_debug("kernfs: could not get inode for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	/* instantiate and link root dentry */
> +	dentry = d_obtain_root(inode);
> +	if (!dentry) {
> +		pr_debug("kernfs: could not get dentry for '");
> +		pr_cont_kernfs_path(kn);
> +		pr_cont("'.\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	/* If this is a new dentry, set it up. We need kernfs_mutex because this
> +	 * may be called by callers other than kernfs_fill_super. */
> +	mutex_lock(&kernfs_mutex);
> +	if (!dentry->d_fsdata) {
> +		kernfs_get(kn);
> +		dentry->d_fsdata = kn;
> +	} else {
> +		WARN_ON(dentry->d_fsdata != kn);
> +	}
> +	mutex_unlock(&kernfs_mutex);
> +
> +	return dentry;
> +}
> +
>  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>  {
>  	struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>  
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> +				  struct kernfs_node *kn);
>  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>  				       unsigned int flags, void *priv);
>  void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	memset(opts, 0, sizeof(*opts));
>  
> +	/* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> +	 * namespace.
> +	 */
> +	if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> +		opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> +	}
> +
>  	while ((token = strsep(&o, ",")) != NULL) {
>  		nr_opts++;
>  
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>  
>  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>  		pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> -		if (nr_opts != 1) {
> +		if (nr_opts > 1) {
>  			pr_err("sane_behavior: no other mount options allowed\n");
>  			return -EINVAL;
>  		}
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>  }
>  
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> +				 struct cgroup_namespace *ns)
> +{
> +	struct dentry *nsdentry;
> +
> +	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> +	return nsdentry;
> +}
> +
>  static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>  {
>  	LIST_HEAD(tmp_links);
> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>  	int ret;
>  	int i;
>  	bool new_sb;
> +	struct cgroup_namespace *ns =
> +		get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> +	/* Check if the caller has permission to mount. */
> +	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> +		put_cgroup_ns(ns);
> +		return ERR_PTR(-EPERM);
> +	}
>  
>  	/*
>  	 * The first time anyone tries to mount a cgroup, enable the list
> @@ -1817,11 +1841,28 @@ out_free:
>  	kfree(opts.release_agent);
>  	kfree(opts.name);
>  
> -	if (ret)
> +	if (ret) {
> +		put_cgroup_ns(ns);
>  		return ERR_PTR(ret);
> +	}
>  
>  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
>  				CGROUP_SUPER_MAGIC, &new_sb);
> +
> +	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> +		/* If this mount is for the default hierarchy in non-init cgroup
> +		 * namespace, then instead of root cgroup's dentry, we return
> +		 * the dentry corresponding to the cgroupns->root_cgrp.
> +		 */
> +		if (ns != &init_cgroup_ns) {
> +			struct dentry *nsdentry;
> +
> +			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> +			dput(dentry);
> +			dentry = nsdentry;
> +		}
> +	}
> +
>  	if (IS_ERR(dentry) || !new_sb)
>  		cgroup_put(&root->cgrp);
>  
> @@ -1834,6 +1875,7 @@ out_free:
>  		deactivate_super(pinned_sb);
>  	}
>  
> +	put_cgroup_ns(ns);
>  	return dentry;
>  }
>  
> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>  	.name = "cgroup",
>  	.mount = cgroup_mount,
>  	.kill_sb = cgroup_kill_sb,
> +	.fs_flags = FS_USERNS_MOUNT,
>  };
>  
>  static struct kobject *cgroup_kobj;

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Eric W. Biederman @ 2014-11-01  2:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aditya Kali, Tejun Heo, Li Zefan, Serge Hallyn,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel@vger.kernel.org,
	Linux API, Ingo Molnar, Linux Containers, Rohit Jnagal
In-Reply-To: <CALCETrXTaZ3SJ_t-gnbc93BVZXg-912NqO78kFd0Tpi-5-dZoQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>         .name = "cgroup",
>>         .mount = cgroup_mount,
>>         .kill_sb = cgroup_kill_sb,
>> +       .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one!  Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change.  Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.

cgroupfs has no device nodes.  So as long as we are consistent in any
given release what happens here is orthogonal.

I don't remember if we have managed to get the original problem fixed
with the trivial backportable solution.  I think so.

My apologies for not getting to that I haven't even had time to shepherd
through the regression associated regression fix.  I probably just lock
track of them but I haven't found the Tested-By's for it yet.

Nor have I had time to dig through and figure out how to safely deal
with umount -l aka MOUNT_DETACH.

Along with the question about what to do with nodev, there is also
your patch about nosuid.

Starting in about 5 minutes I am going to be mostly offline until
sometime in the 3rd week in November as I haul all of my stuff accross
the country to someplace that actually has winter and my allergies don't
kill me.

I am going to have to review and merge a lot of code as soon as I am
back to being a programmer full time again.  There is a lot of
interesting stuff coming in right now.

Eric

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-01  3:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <87a94blj6m.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Fri, Oct 31, 2014 at 7:59 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>>         .name = "cgroup",
>>>         .mount = cgroup_mount,
>>>         .kill_sb = cgroup_kill_sb,
>>> +       .fs_flags = FS_USERNS_MOUNT,
>>
>> Aargh, another one!  Eric, can you either ack or nack my patch?
>> Because if my patch goes in, then this line may need to change.  Or
>> not, but if a stable release with cgroupfs and without my patch
>> happens, then we'll have an ABI break.
>
> cgroupfs has no device nodes.  So as long as we are consistent in any
> given release what happens here is orthogonal.
>
> I don't remember if we have managed to get the original problem fixed
> with the trivial backportable solution.  I think so.

I don't remember.  I think the problem is still there, since I think
my patch still applies, and my patch conflicts with your fix.  It's
been long enough that I'm not sure it's worth applying your patch as
an interim fix.

>
> My apologies for not getting to that I haven't even had time to shepherd
> through the regression associated regression fix.  I probably just lock
> track of them but I haven't found the Tested-By's for it yet.

No worries.  I've tested it, but it's my patch, so there's a big grain
of salt there.  I think Serge tested it, too.

>
> Nor have I had time to dig through and figure out how to safely deal
> with umount -l aka MOUNT_DETACH.

If you're talking about the do_remount_sb thing, that's already in Linus' tree.

>
> Along with the question about what to do with nodev, there is also
> your patch about nosuid.

The nosuid patch has a couple versions, and I'm not sure which version
I prefer.  It's certainly debatable.

>
> Starting in about 5 minutes I am going to be mostly offline until
> sometime in the 3rd week in November as I haul all of my stuff accross
> the country to someplace that actually has winter and my allergies don't
> kill me.

Have fun!

--Andy

>
> I am going to have to review and merge a lot of code as soon as I am
> back to being a programmer full time again.  There is a lot of
> interesting stuff coming in right now.
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Daniel Mack @ 2014-11-01 11:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, Simon McVittie,
	alban.crequy, Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <CALCETrV6MLYUQN6mqZbH=FrLyrETVoemtdC05po8+X=6SKQ70A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Hi Andy,

On 10/30/2014 10:01 PM, Andy Lutomirski wrote:
> On Thu, Oct 30, 2014 at 8:54 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:

>> We merely allow user that are connected to a bus to query the
>> credentials of the creator of the bus they're connected to.
> 
> Why do you allow this?  What purpose does it serve?  Is that idea that
> each application will own one bus?  If so, what goes wrong if you only
> capture the specific credentials that the creator of a given bus asks
> to have captured?

There are different kinds of buses. There is the system bus and a number
of user buses. It's really useful to be able to identify the user who
owns one of these user buses, for the sake of access control. More
specifically, we have a compatibility service called "bus-proxy" that
speaks the old D-Bus socket protocol on one side and translates all
messages to kdbus messages onto the other. For that, it needs to enforce
the old D-Bus access control semantics, which is described in XML, and
has quite elaborate checks. In order to enforce it, it's relevant to be
able to compare peer credentials with bus owner credentials, because
there's usually a rule that the bus owner UID is allowed more than other
peers.

Sure, there are other ways to figure out the identity of the bus, but
it's really nice to have similar semantics for identifying the bus
owner. kdbus internally has that piece of information anyway, so we
decided to export it, optionally. However, that's really a minor detail
after all.

>> For example, if a privileged service can reboot the system, it checks
>> whether the asking peer has CAP_SYS_BOOT set. If it checks for uid==0
>> instead, and it works in your tests because you happen to test as root,
>> that just a bug in the service, right? But I might have missed your point.
> 
> The issue is the following: if have the privilege needed to talk to
> journald, I may want to enhance security by opening a connection to
> journald (and capture that privilege) and then drop privilege.  I
> should still be able to talk to journald.

Hmm, this is not how D-Bus works, and kdbus stays close the design
principles of D-Bus. There's no concept of 'opening connections to
services'. You just connect to a bus, and then on that bus, you send
individual messages and method calls to other services.

The design of Binder and D-Bus are fundamentally different in that
regard. On D-Bus, the focus is really about method call transaction (a
method call message and a corresponding reply message), and there's no
way to continuously reference a peer via the concept of a 'connection'.

This is why we have this functionality of passing over the caller creds
every time a method call is made. The focus is really on the individual
method call transaction, each one is individually routed, dispatched and
checked for permission. Hence, it should carry individual credential
information from the time the call is issued.

So, back to your example: you cannot 'open a connection to journald'.
You can only connect to a bus and send messages to journald.

> Alternatively, if the privilege needed to reboot is CAP_SYS_BOOT, then
> clients should send that capability bit.  Capturing extra information
> to try to give daemons the flexibility to change their authorization
> conditions later on just moves the problem if you need to change
> policy down the line.

This would be similar to changing the Linux kernel so that each system
call gets a set of capabilities passed in explicitly. RPC calls are very
much comparable to syscalls, though they don't transition into kernel
space, but simply into another process.

We've been augmenting what syscalls check on for access control ever
since. Initially, it check was UID based, then became capability based,
and nowadays we have a concept of MACs, that are actually different on
every system, because some systems use SElinux, some SMACK or some other
MAC. This is why this metadata should be implicit and controlled by the
receiver, not by the sender, because the implementation of the policy
might change eventually, be extended with more sophisticated access
control etc.

But it's not only about access control, there's also auditing. In order
to generate useful audit logs, a system service that is offering
privileged operation to certain clients needs to know the audit
credentials (sessionid and loginuid). Hence, this information needs to
be implicitly appended, controlled by that service. Because if it is not
implicitly appended, than it will more often be missing than expected
(simply because in real-life very few people actually use auditing), and
the system service would not be able to log about it.

So, this is not a metadata leak by accident but metadata that system
services need to know about in order to work properly. Individual
services will require slightly different components of these
credentials, but if you combine things, they need to know pretty much
all of the details we currently offer as implied metadata.

The system bus is about unprivileged apps asking system services for
system operations, in which case the system services must have a way to
know who wants them to do what. The purpose of a system bus is _not_ to
allow unprivileged peers to talk to each other, that's actually even
forbidden in the default policy. That's what user buses are for, and on
those, pretty much every client will have the same privileges anyway,
hence there's no information leak there either.

> What's the actual problem for logging?  I very much understand why a
> logging service never wants to log an incorrect credential (and legacy
> syslog has serious problems here because it doesn't even try to
> capture credential), but what's wrong with having a log that shows the
> uid for legit log messages and that reliably says "declined to state"
> for messages that decline to state.

A system's administrator should be able to gather all sorts of
information about things that happened on a system. Trying to hide
associated metadata is not how things are done anyway - we show it in
numerous places, in /proc, in SCM_CREDENTIALS, by listing /tmp or /home etc.

The separation to limit what is passed around is, in our concept, rather
on the level of connecting to separate buses, PID namespaces, kdbus
domains etc, than to suppress information.

> (Also, I presume that cmdline is for logging.  Keep in mind that the
> cmdline is yanked from user memory and can be freely spoofed.)

Sure. But If a task did that, we want to know about it, and log it
accordingly. Journald provides such features already today, and it's a
great deal in detecting runtime inconsistencies between a task's real
nature and what is displayed in 'ps'.

Thanks,
Daniel

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Andy Lutomirski @ 2014-11-01 16:19 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, Simon McVittie,
	alban.crequy, Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <5454BE6E.5040507-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

On Sat, Nov 1, 2014 at 4:05 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
> Hi Andy,
>
> On 10/30/2014 10:01 PM, Andy Lutomirski wrote:
>> On Thu, Oct 30, 2014 at 8:54 AM, Daniel Mack <daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org> wrote:
>
>>> We merely allow user that are connected to a bus to query the
>>> credentials of the creator of the bus they're connected to.
>>
>> Why do you allow this?  What purpose does it serve?  Is that idea that
>> each application will own one bus?  If so, what goes wrong if you only
>> capture the specific credentials that the creator of a given bus asks
>> to have captured?
>
> There are different kinds of buses. There is the system bus and a number
> of user buses. It's really useful to be able to identify the user who
> owns one of these user buses, for the sake of access control. More
> specifically, we have a compatibility service called "bus-proxy" that
> speaks the old D-Bus socket protocol on one side and translates all
> messages to kdbus messages onto the other. For that, it needs to enforce
> the old D-Bus access control semantics, which is described in XML, and
> has quite elaborate checks. In order to enforce it, it's relevant to be
> able to compare peer credentials with bus owner credentials, because
> there's usually a rule that the bus owner UID is allowed more than other
> peers.
>
> Sure, there are other ways to figure out the identity of the bus, but
> it's really nice to have similar semantics for identifying the bus
> owner. kdbus internally has that piece of information anyway, so we
> decided to export it, optionally. However, that's really a minor detail
> after all.
>

I'm sceptical about the kernel offering APIs just because it can.  I'm
not fundamentally opposed to objects (e.g. busses) having ownership
information, but I think it needs to be well-justified.

Keep in mind that the kernel *has* a concept of ownership.  It's uid,
gid, and security label.  Having the creator's full set of caps and
even *command line* as part of the ownership information is really
weird.

>>> For example, if a privileged service can reboot the system, it checks
>>> whether the asking peer has CAP_SYS_BOOT set. If it checks for uid==0
>>> instead, and it works in your tests because you happen to test as root,
>>> that just a bug in the service, right? But I might have missed your point.
>>
>> The issue is the following: if have the privilege needed to talk to
>> journald, I may want to enhance security by opening a connection to
>> journald (and capture that privilege) and then drop privilege.  I
>> should still be able to talk to journald.
>
> Hmm, this is not how D-Bus works, and kdbus stays close the design
> principles of D-Bus. There's no concept of 'opening connections to
> services'. You just connect to a bus, and then on that bus, you send
> individual messages and method calls to other services.
>
> The design of Binder and D-Bus are fundamentally different in that
> regard. On D-Bus, the focus is really about method call transaction (a
> method call message and a corresponding reply message), and there's no
> way to continuously reference a peer via the concept of a 'connection'.
>
> This is why we have this functionality of passing over the caller creds
> every time a method call is made. The focus is really on the individual
> method call transaction, each one is individually routed, dispatched and
> checked for permission. Hence, it should carry individual credential
> information from the time the call is issued.
>
> So, back to your example: you cannot 'open a connection to journald'.
> You can only connect to a bus and send messages to journald.

But you can open a kdbus fd.  IMO there should be some actual thought
as to what should happen if that credentials change after the fd is
opened.  The default answer in UNIX is that credentials are only
checked at open time.  Violating that should have a good reason.

>
>> Alternatively, if the privilege needed to reboot is CAP_SYS_BOOT, then
>> clients should send that capability bit.  Capturing extra information
>> to try to give daemons the flexibility to change their authorization
>> conditions later on just moves the problem if you need to change
>> policy down the line.
>
> This would be similar to changing the Linux kernel so that each system
> call gets a set of capabilities passed in explicitly. RPC calls are very
> much comparable to syscalls, though they don't transition into kernel
> space, but simply into another process.

That would be a fantastic idea, and, in fact, it would have been
vastly better if syscalls had always worked that way.  It's how
anything on a network *has* to work, it's how object capability
systems work (and object capability systems are much less prone to
security bugs caused by overcomplicated implicit checks), and it's how
Capsicum (which builds on POSIX!) works.

>
> We've been augmenting what syscalls check on for access control ever
> since. Initially, it check was UID based, then became capability based,
> and nowadays we have a concept of MACs, that are actually different on
> every system, because some systems use SElinux, some SMACK or some other
> MAC. This is why this metadata should be implicit and controlled by the
> receiver, not by the sender, because the implementation of the policy
> might change eventually, be extended with more sophisticated access
> control etc.

I agree that this is convenient, but it's a hack, and we've only
gotten away with it so far because only something that's very much in
the trusted computing base can see this information at all.  And I
strongly suspect that, if I were inclined to try to break SELinux, I
could find any number of rather fundamental holes based on the fact
that it adds checks to capabilities in places where they didn't exist
originally.

>
> But it's not only about access control, there's also auditing. In order
> to generate useful audit logs, a system service that is offering
> privileged operation to certain clients needs to know the audit
> credentials (sessionid and loginuid). Hence, this information needs to
> be implicitly appended, controlled by that service. Because if it is not
> implicitly appended, than it will more often be missing than expected
> (simply because in real-life very few people actually use auditing), and
> the system service would not be able to log about it.
>
> So, this is not a metadata leak by accident but metadata that system
> services need to know about in order to work properly. Individual
> services will require slightly different components of these
> credentials, but if you combine things, they need to know pretty much
> all of the details we currently offer as implied metadata.

Sorry, but this is bogus.  The audit system is a bit of an unstable
mess, and, to all appearances, its main design goal seems to be
regulatory compliance instead of actual security.  It seems to be
mostly harmless, but only because all of the data gathered by auditd
that should never have existed in the first place can only be
collected by an extremely privileged global daemon and is shoved in
logs that no one outside the TCB can read.

If you want a configurable-out off-by-default bolt-on system that
allows auditd (and nothing else!) to sniff kdbus busses and log
whatever random crap it wants to log (and keep it the hell out of the
part of the journal that unprivileged users can see), then do so.

>
> The system bus is about unprivileged apps asking system services for
> system operations, in which case the system services must have a way to
> know who wants them to do what. The purpose of a system bus is _not_ to
> allow unprivileged peers to talk to each other, that's actually even
> forbidden in the default policy. That's what user buses are for, and on
> those, pretty much every client will have the same privileges anyway,
> hence there's no information leak there either.

This is nonsense.  If you don't need real access control on user
busses, then *don't add any*.

Except that you *do* need real access control, because of containers,
seccomp, the Chromium sandbox, the upcoming Firefox sandbox, all of
the other little sandboxes that should exist but don't yet, remoting,
etc.  And I guarantee that your list of implicitly transmitted
credentials will fail to handle this case, and this case is the only
one that really matters.  At the end of the day, you're going to need
either crypto, something like object capabilities, or an access
broker.  And none of those benefit at all from the current proposed
model.  (And none of them check anything like implicit credentials at
the time of an RPC call, either.)

Interestingly, it sounds to be like traditional D-Bus will work
considerably better than kdbus in this scenario.

As a concrete example, Wayland takes sensitive operations like
screenshots very seriously.  Which of the kdbus metadata items would
be appropriate to use for access control for something like that?

>
>> What's the actual problem for logging?  I very much understand why a
>> logging service never wants to log an incorrect credential (and legacy
>> syslog has serious problems here because it doesn't even try to
>> capture credential), but what's wrong with having a log that shows the
>> uid for legit log messages and that reliably says "declined to state"
>> for messages that decline to state.
>
> A system's administrator should be able to gather all sorts of
> information about things that happened on a system. Trying to hide
> associated metadata is not how things are done anyway - we show it in
> numerous places, in /proc, in SCM_CREDENTIALS, by listing /tmp or /home etc.

See above.  The sysadmin != every single kdbus user.

>> (Also, I presume that cmdline is for logging.  Keep in mind that the
>> cmdline is yanked from user memory and can be freely spoofed.)
>
> Sure. But If a task did that, we want to know about it, and log it
> accordingly. Journald provides such features already today, and it's a
> great deal in detecting runtime inconsistencies between a task's real
> nature and what is displayed in 'ps'.
>

Sorry, but this is a bit off the deep end.  The kdbus logging
mechanism shouldn't be in the business of trying to do this, because
it's out of scope, and it will fail.

You can't justify logging fundamentally unverifiable things like the
command line by saying that you want to know if someone tries to play
(impossible-to-reliably-detect) games to obscure their command line.

--Andy

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] netns: allow to identify peer netns
From: David Miller @ 2014-11-01 21:08 UTC (permalink / raw)
  To: ebiederm-aS9lmoZGLiVWk0Htik3J/w
  Cc: nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	luto-kltTT9wpgjJwATOyAt5JVQ, cwang-xCSkyg8dI+0RB7SZvlqPiA
In-Reply-To: <871tpph03k.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

From: ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman)
Date: Thu, 30 Oct 2014 11:41:03 -0700

> Nicolas Dichtel <nicolas.dichtel-pdR9zngts4EAvxtiuMwx3w@public.gmane.org> writes:
> 
>> The goal of this serie is to be able to multicast netlink messages with an
>> attribute that identify a peer netns.
>> This is needed by the userland to interpret some informations contained in
>> netlink messages (like IFLA_LINK value, but also some other attributes in case
>> of x-netns netdevice (see also
>> http://thread.gmane.org/gmane.linux.network/315933/focus=316064 and
>> http://thread.gmane.org/gmane.linux.kernel.containers/28301/focus=4239)).
>>
>> Ids of peer netns are set by userland via a new genl messages. These ids are
>> stored per netns and are local (ie only valid in the netns where they are set).
>> To avoid allocating an int for each peer netns, I use idr_for_each() to retrieve
>> the id of a peer netns. Note that it will be possible to add a table (struct net
>> -> id) later to optimize this lookup if needed.
>>
>> Patch 1/4 introduces the netlink API mechanism to set and get these ids.
>> Patch 2/4 and 3/4 implements an example of how to use these ids in rtnetlink
>> messages. And patch 4/4 shows that the netlink messages can be symetric between
>> a GET and a SET.
>>
>> iproute2 patches are available, I can send them on demand.
> 
> A quick reply.  I think this patchset is in the right general direction.
> There are some oddball details that seem odd/awkward to me such as using
> genetlink instead of rtnetlink to get and set the ids, and not having
> ids if they are not set (that feels like a maintenance/usability challenge).
> 
> I would like to give your patches a deep review, but I won't be able to
> do that for a couple of weeks.  I am deep in the process of moving,
> and will be mostly offline until about the Nov 11th.

I'm going to mark this patch set 'deferred' in patchwork until things
move forward.

Thanks.

^ permalink raw reply

* Re: [PATCH 00/12] Add kdbus implementation
From: Greg Kroah-Hartman @ 2014-11-02  1:21 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c
In-Reply-To: <alpine.LRH.2.00.1410292354480.11562-1ReQVI26iDCaZKY3DrU6dA@public.gmane.org>

On Thu, Oct 30, 2014 at 12:00:16AM +0100, Jiri Kosina wrote:
> On Wed, 29 Oct 2014, Greg Kroah-Hartman wrote:
> 
> > kdbus is a kernel-level IPC implementation that aims for resemblance to
> > the the protocol layer with the existing userspace D-Bus daemon while
> > enabling some features that couldn't be implemented before in userspace.
> 
> I'd be interested in the features that can't be implemented in userspace 
> (and therefore would justify existence of kdbus in the kernel). Could you 
> please point me to such list / documentation?
> 
> It seems to me that most of the highlight features from the cover letter 
> can be "easily" (for certain definition of that word, of course) 
> implemented in userspace (vmsplice(), sending fd through unix socket, user 
> namespaces, UUID management, etc).

Sorry for the long delay in getting back to this, I'm battling a bad
case of jet-lag at the moment...

Here's some reasons why I feel it is better to have kdbus in the kernel
rather than trying to implement the same thing in a userspace daemon:

- performance: fewer process context switches, fewer copies, fewer
  syscalls, larger memory chunks via memfd.  This is really important
  for a whole class of userspace programs that are ported from other
  operating systems that are run on tiny ARM systems that rely on
  hundreds of thousands of messages passed at boot time, and at
  "critical" times in their user interaction loops.
- security: the peers which communicate do not have to trust each other,
  as the only trustworthy compoenent in the game is the kernel which
  adds metadata and ensures that all data passed as payload is either
  copied or sealed, so that the receiver can parse the data without
  having to protect against changing memory while parsing buffers.  Also,
  all the data transfer is controlled by the kernel, so that LSMs can
  track and control what is going on, without involving userspace.
  Because of the LSM issue, security people are much happier with this
  model than the current scheme of having to hook into dbus to mediate
  things.
- more metadata can be attached to messages than in userspace
- semantics for apps with heavy data payloads (media apps, for instance)
  with optinal priority message dequeuing, and global message ordering.
  Some "crazy" people are playing with using kdbus for audio data in the
  system.  I'm not saying that this is the best model for this, but
  until now, there wasn't any other way to do this without having to
  create custom "busses", one for each application library.
- being in the kernle closes a lot of races which can't be fixed with
  the current userspace solutions.  For example, with kdbus, there is a
  way a client can disconnect from a bus, but do so only if no further
  messages present in its queue, which is crucial for implementing
  race-free "exit-on-idle" services
- eavesdropping on the kernel level, so privileged users can hook into
  the message stream without hacking support for that into their
  userspace processes
- a number of smaller benefits: for example kdbus learned a way to peek
  full messages without dequeing them, which is really useful for
  logging metadata when handling bus-activation requests. 

Of course, some of the bits above could be implemented in userspace
alone, for example with more sophisticated memory management APIs, but
this is usually done by losing out on the other details.  For example,
for many of the memory management APIs, it's hard to not require the
communicating peers to fully trust each other.  And we _really_ don't
want peers to have to trust each other.

Another benefit of having this in the kernel, rather than as a userspace
daemon, is that you can now easily use the bus from the initrd, or up to
the very end when the system shuts down.  On current userspace D-Bus,
this is not really possible, as this requires passing the bus instance
around between initrd and the "real" system.  Such a transition of all
fds also requires keeping full state of what has already been read from
the connection fds.  kdbus makes this much simpler, as we can change the
ownership of the bus, just by passing one fd over from one part to the
other.

Regarding binder: binder and kdbus follow very different design
concepts.  Binder implies the use of thread-pools to dispatch incoming
method calls.  This is a very efficient scheme, and completely natural
in programming languages like Java.  On most Linux programs, however,
there's a much stronger focus on central poll() loops that dispatch all
sources a program cares about.  kdbus is much more usable in such
environments, as it doesn't enforce a threading model, and it is happy
with serialized dispatching.  In fact, this major difference had an
effect on much of the design decisions: binder does not guarantee global
message ordering due to the parallel dispatching in the thread-pools,
but  kdbus does.  Moreover, there's also a difference in the way message
handling.  In kdbus, every message is basically taken and dispatched as
one blob, while in binder, continious connections to other peers are
created, which are then used to send messages on.  Hence, the models are
quite different, and they serve different needs.  I believe that the
D-Bus/kdbus model is more compatible and friendly with how Linux
programs are usually implemented.  I went into the kdbus vs. binder
stuff in a blog post that I linked to earlier in this thread that goes
into more detail here.

Hopefully this helps explain why I feel kdbus should be in the kernel
and not a userspace daemon.  I'll put this information in the cover
letter for the next round of patches that are sent out.

thanks,

greg k-h

^ permalink raw reply

* Re: kdbus: add documentation
From: Greg Kroah-Hartman @ 2014-11-02  1:29 UTC (permalink / raw)
  To: Peter Meerwald
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <alpine.DEB.2.02.1410301231040.32212-jW+XmwGofnusTnJN9+BGXg@public.gmane.org>

On Thu, Oct 30, 2014 at 01:20:23PM +0100, Peter Meerwald wrote:
> 
> > kdbus is a system for low-latency, low-overhead, easy to use
> > interprocess communication (IPC).
> > 
> > The interface to all functions in this driver is implemented through ioctls
> > on /dev nodes.  This patch adds detailed documentation about the kernel
> > level API design.
> 
> just some typos below

<snip>

Many thanks for the fixes, I've made them all to the file now, it will
show up in the next version we send out.

greg k-h

^ permalink raw reply

* Re: [PATCH 1/2] seccomp.2: document seccomp syscall
From: Michael Kerrisk (man-pages) @ 2014-11-02 16:36 UTC (permalink / raw)
  To: Kees Cook
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1411685267-27949-2-git-send-email-keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>

Hi Kees,

On 09/26/2014 12:47 AM, Kees Cook wrote:
> Combines documentation from prctl, in-kernel seccomp_filter.txt and
> dropper.c, along with details specific to the new syscall.

I am working on integrating this page at the moment, and I'll have
an edited draft for you sometime soon, I hope.

In the meantime, I have a question. Could you show a sample run
of the example program given in the man page? In my attempts so far,
I always get EINVAL from seccomp(). Obviously, I am missing something.

Cheers,

Michael

 
> Signed-off-by: Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> ---
>  man2/seccomp.2 | 400 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 400 insertions(+)
>  create mode 100644 man2/seccomp.2
> 
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> new file mode 100644
> index 0000000..f64950f
> --- /dev/null
> +++ b/man2/seccomp.2
> @@ -0,0 +1,400 @@
> +.\" Copyright (C) 2014 Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> +.\" and Copyright (C) 2012 Will Drewry <wad-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org>
> +.\" and Copyright (C) 2008 Michael Kerrisk <mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein.  The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +seccomp \-
> +operate on Secure Computing state of the process
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/seccomp.h>
> +.B #include <linux/filter.h>
> +.B #include <linux/audit.h>
> +.B #include <linux/signal.h>
> +.B #include <sys/ptrace.h>
> +
> +.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
> +.BI "            void *" args );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR seccomp ()
> +system call operates on the Secure Computing (seccomp) state of the
> +current process.
> +
> +Currently, Linux supports the following
> +.IR operation
> +values:
> +.TP
> +.BR SECCOMP_SET_MODE_STRICT
> +Only system calls that the thread is permitted to make are
> +.BR read (2),
> +.BR write (2),
> +.BR _exit (2),
> +and
> +.BR sigreturn (2).
> +Other system calls result in the delivery of a
> +.BR SIGKILL
> +signal. Strict secure computing mode is useful for number-crunching
> +applications that may need to execute untrusted byte code, perhaps
> +obtained by reading from a pipe or socket.
> +
> +This operation is available only if the kernel is configured with
> +.BR CONFIG_SECCOMP
> +enabled.
> +
> +The value of
> +.IR flags
> +must be 0, and
> +.IR args
> +must be NULL.
> +
> +This operation is functionally identical to calling
> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
> +.TP
> +.BR SECCOMP_SET_MODE_FILTER
> +The system calls allowed are defined by a pointer to a Berkeley Packet
> +Filter (BPF) passed via
> +.IR args .
> +This argument is a pointer to
> +.IR "struct\ sock_fprog" ;
> +it can be designed to filter arbitrary system calls and system call
> +arguments. If the filter is invalid, the call will fail, returning
> +.BR EACCESS
> +in
> +.IR errno .
> +
> +If
> +.BR fork (2),
> +.BR clone (2),
> +or
> +.BR execve (2)
> +are allowed by the filter, any child processes will be constrained to
> +the same filters and system calls as the parent.
> +
> +Prior to using this operation, the process must call
> +.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
> +or run with
> +.BR CAP_SYS_ADMIN
> +privileges in its namespace. If these are not true, the call will fail
> +and return
> +.BR EACCES
> +in
> +.IR errno .
> +This requirement ensures that filter programs cannot be applied to child
> +processes with greater privileges than the process that installed them.
> +
> +Additionally, if
> +.BR prctl (2)
> +or
> +.BR seccomp (2)
> +is allowed by the attached filter, additional filters may be layered on
> +which will increase evaluation time, but allow for further reduction of
> +the attack surface during execution of a process.
> +
> +This operation is available only if the kernel is configured with
> +.BR CONFIG_SECCOMP_FILTER
> +enabled.
> +
> +When
> +.IR flags
> +are 0, this operation is functionally identical to calling
> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
> +
> +The recognized
> +.IR flags
> +are:
> +.RS
> +.TP
> +.BR SECCOMP_FILTER_FLAG_TSYNC
> +When adding a new filter, synchronize all other threads of the current
> +process to the same seccomp filter tree. If any thread cannot do this,
> +the call will not attach the new seccomp filter, and will fail returning
> +the first thread ID found that cannot synchronize.  Synchronization will
> +fail if another thread is in
> +.BR SECCOMP_MODE_STRICT
> +or if it has attached new seccomp filters to itself, diverging from the
> +calling thread's filter tree.
> +.RE
> +.SH FILTERS
> +When adding filters via
> +.BR SECCOMP_SET_MODE_FILTER ,
> +.IR args
> +points to a filter program:
> +
> +.in +4n
> +.nf
> +struct sock_fprog {
> +    unsigned short      len;    /* Number of BPF instructions */
> +    struct sock_filter *filter;
> +};
> +.fi
> +.in
> +
> +Each program must contain one or more BPF instructions:
> +
> +.in +4n
> +.nf
> +struct sock_filter {    /* Filter block */
> +    __u16   code;       /* Actual filter code */
> +    __u8    jt;         /* Jump true */
> +    __u8    jf;         /* Jump false */
> +    __u32   k;          /* Generic multiuse field */
> +};
> +.fi
> +.in
> +
> +When executing the instructions, the BPF program executes over the
> +syscall information made available via:
> +
> +.in +4n
> +.nf
> +struct seccomp_data {
> +    int nr;                     /* system call number */
> +    __u32 arch;                 /* AUDIT_ARCH_* value */
> +    __u64 instruction_pointer;  /* CPU instruction pointer */
> +    __u64 args[6];              /* up to 6 system call arguments */
> +};
> +.fi
> +.in
> +
> +A seccomp filter may return any of the following values. If multiple
> +filters exist, the return value for the evaluation of a given system
> +call will always use the highest precedent value. (For example,
> +.BR SECCOMP_RET_KILL
> +will always take precedence.)
> +
> +In precedence order, they are:
> +.TP
> +.BR SECCOMP_RET_KILL
> +Results in the task exiting immediately without executing the
> +system call.  The exit status of the task (status & 0x7f) will
> +be
> +.BR SIGSYS ,
> +not
> +.BR SIGKILL .
> +.TP
> +.BR SECCOMP_RET_TRAP
> +Results in the kernel sending a
> +.BR SIGSYS
> +signal to the triggering task without executing the system call.
> +.IR siginfo\->si_call_addr
> +will show the address of the system call instruction, and
> +.IR siginfo\->si_syscall
> +and
> +.IR siginfo\->si_arch
> +will indicate which syscall was attempted.  The program counter will be
> +as though the syscall happened (i.e. it will not point to the syscall
> +instruction).  The return value register will contain an arch\-dependent
> +value; if resuming execution, set it to something sensible.
> +(The architecture dependency is because replacing it with
> +.BR ENOSYS
> +could overwrite some useful information.)
> +
> +The
> +.BR SECCOMP_RET_DATA
> +portion of the return value will be passed as
> +.IR si_errno .
> +
> +.BR SIGSYS
> +triggered by seccomp will have a
> +.IR si_code
> +of
> +.BR SYS_SECCOMP .
> +.TP
> +.BR SECCOMP_RET_ERRNO
> +Results in the lower 16-bits of the return value being passed
> +to userland as the
> +.IR errno
> +without executing the system call.
> +.TP
> +.BR SECCOMP_RET_TRACE
> +When returned, this value will cause the kernel to attempt to
> +notify a ptrace()-based tracer prior to executing the system
> +call.  If there is no tracer present,
> +.BR ENOSYS
> +is returned to userland and the system call is not executed.
> +
> +A tracer will be notified if it requests
> +.BR PTRACE_O_TRACESECCOMP
> +using
> +.IR ptrace(PTRACE_SETOPTIONS) .
> +The tracer will be notified of a
> +.BR PTRACE_EVENT_SECCOMP
> +and the
> +.BR SECCOMP_RET_DATA
> +portion of the BPF program return value will be available to the tracer
> +via
> +.BR PTRACE_GETEVENTMSG .
> +
> +The tracer can skip the system call by changing the syscall number
> +to \-1.  Alternatively, the tracer can change the system call
> +requested by changing the system call to a valid syscall number.  If
> +the tracer asks to skip the system call, then the system call will
> +appear to return the value that the tracer puts in the return value
> +register.
> +
> +The seccomp check will not be run again after the tracer is
> +notified.  (This means that seccomp-based sandboxes MUST NOT
> +allow use of ptrace, even of other sandboxed processes, without
> +extreme care; ptracers can use this mechanism to escape.)
> +.TP
> +.BR SECCOMP_RET_ALLOW
> +Results in the system call being executed.
> +
> +If multiple filters exist, the return value for the evaluation of a
> +given system call will always use the highest precedent value.
> +
> +Precedence is only determined using the
> +.BR SECCOMP_RET_ACTION
> +mask.  When multiple filters return values of the same precedence,
> +only the
> +.BR SECCOMP_RET_DATA
> +from the most recently installed filter will be returned.
> +.SH RETURN VALUE
> +On success,
> +.BR seccomp ()
> +returns 0.
> +On error, if
> +.BR SECCOMP_FILTER_FLAG_TSYNC
> +was used, the return value is the thread ID that caused the
> +synchronization failure. On other errors, \-1 is returned, and
> +.IR errno
> +is set to indicate the cause of the error.
> +.SH ERRORS
> +.BR seccomp ()
> +can fail for the following reasons:
> +.TP
> +.BR EACCESS
> +the caller did not have the
> +.BR CAP_SYS_ADMIN
> +capability, or had not set
> +.IR no_new_privs
> +before using
> +.BR SECCOMP_SET_MODE_FILTER .
> +.TP
> +.BR EFAULT
> +.IR args
> +was required to be a valid address.
> +.TP
> +.BR EINVAL
> +.IR operation
> +is unknown; or
> +.IR flags
> +are invalid for the given
> +.IR operation
> +.TP
> +.BR ESRCH
> +Another thread caused a failure during thread sync, but its ID could not
> +be determined.
> +.SH VERSIONS
> +This system call first appeared in Linux 3.16.
> +.\" FIXME Add glibc version
> +.SH CONFORMING TO
> +This system call is a nonstandard Linux extension.
> +.SH NOTES
> +.BR seccomp ()
> +provides a superset of the functionality provided by
> +.IR PR_SET_SECCOMP
> +of
> +.BR prctl (2) .
> +(Which does not support
> +.IR flags .)
> +.SH EXAMPLE
> +.nf
> +#include <errno.h>
> +#include <stddef.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <linux/audit.h>
> +#include <linux/filter.h>
> +#include <linux/seccomp.h>
> +#include <sys/prctl.h>
> +
> +static int install_filter(int syscall, int arch, int error)
> +{
> +    struct sock_filter filter[] = {
> +        /* Load architecture. */
> +        BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                 (offsetof(struct seccomp_data, arch))),
> +        /* Jump forward 4 instructions on architecture mismatch. */
> +        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 4),
> +        /* Load syscall number. */
> +        BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                 (offsetof(struct seccomp_data, nr))),
> +        /* Jump forward 1 instruction on syscall mismatch. */
> +        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, syscall, 0, 1),
> +        /* Matching arch and syscall: return specific errno. */
> +        BPF_STMT(BPF_RET+BPF_K,
> +                 SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
> +        /* Destination of syscall mismatch: Allow other syscalls. */
> +        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +        /* Destination of arch mismatch: Kill process. */
> +        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
> +    };
> +    struct sock_fprog prog = {
> +        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
> +        .filter = filter,
> +    };
> +    if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
> +        perror("seccomp");
> +        return EXIT_FAILURE;
> +    }
> +    return EXIT_SUCCESS;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    if (argc < 5) {
> +        fprintf(stderr, "Usage:\\n"
> +                "refuse <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
> +                "Hint:  AUDIT_ARCH_I386: 0x%X\\n"
> +                "       AUDIT_ARCH_X86_64: 0x%X\\n"
> +                "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
> +        return EXIT_FAILURE;
> +    }
> +    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
> +        perror("prctl");
> +        return EXIT_FAILURE;
> +    }
> +    if (install_filter(strtol(argv[1], NULL, 0),
> +                       strtol(argv[2], NULL, 0),
> +                       strtol(argv[3], NULL, 0)))
> +        return EXIT_FAILURE;
> +    execv(argv[4], &argv[4]);
> +    perror("execv");
> +    return EXIT_FAILURE;
> +}
> +.fi
> +.SH SEE ALSO
> +.ad l
> +.nh
> +.BR prctl (2),
> +.BR ptrace (2),
> +.BR signal (7),
> +.BR socket (7)
> +.ad
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 1/2] seccomp.2: document seccomp syscall
From: Michael Kerrisk (man-pages) @ 2014-11-02 16:37 UTC (permalink / raw)
  To: Kees Cook, Andy Lutomirski
  Cc: mtk.manpages-Re5JQEeQqe8AvxtiuMwx3w, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CAGXu5jJEn5QZPrVa4UjuT34PVuadhXqSb_d=o3W04QvZq-fMNw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 09/26/2014 07:55 AM, Kees Cook wrote:
> On Thu, Sep 25, 2014 at 6:29 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Thu, Sep 25, 2014 at 3:47 PM, Kees Cook <keescook-F7+t8E8rja9g9hUCZPvPmw@public.gmane.org> wrote:
>>> +.SH VERSIONS
>>> +This system call first appeared in Linux 3.16.
>>> +.\" FIXME Add glibc version
>>
>> 3.17?  (And remove FIXME?)
>>
>> Otherwise I like it.
> 
> Ooops, yes: 3.17. Michael do you want me to resend with that
> corrected, or do you want to handle fixing that from this version?

No need. I've fixed that.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: [PATCH 1/2] seccomp.2: document seccomp syscall
From: Michael Kerrisk (man-pages) @ 2014-11-02 17:10 UTC (permalink / raw)
  To: Kees Cook; +Cc: mtk.manpages, linux-api, linux-kernel
In-Reply-To: <54565DA4.50408@gmail.com>

On 11/02/2014 05:36 PM, Michael Kerrisk (man-pages) wrote:
> Hi Kees,
> 
> On 09/26/2014 12:47 AM, Kees Cook wrote:
>> Combines documentation from prctl, in-kernel seccomp_filter.txt and
>> dropper.c, along with details specific to the new syscall.
> 
> I am working on integrating this page at the moment, and I'll have
> an edited draft for you sometime soon, I hope.
> 
> In the meantime, I have a question. Could you show a sample run
> of the example program given in the man page? In my attempts so far,
> I always get EINVAL from seccomp(). Obviously, I am missing something.

Don't worry -- I found the problem. Silly error on my part. (I got the 
syscall number wrong, when rolling my wrappers.)

Cheers,

Michael


>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>>  man2/seccomp.2 | 400 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 400 insertions(+)
>>  create mode 100644 man2/seccomp.2
>>
>> diff --git a/man2/seccomp.2 b/man2/seccomp.2
>> new file mode 100644
>> index 0000000..f64950f
>> --- /dev/null
>> +++ b/man2/seccomp.2
>> @@ -0,0 +1,400 @@
>> +.\" Copyright (C) 2014 Kees Cook <keescook@chromium.org>
>> +.\" and Copyright (C) 2012 Will Drewry <wad@chromium.org>
>> +.\" and Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com>
>> +.\"
>> +.\" %%%LICENSE_START(VERBATIM)
>> +.\" Permission is granted to make and distribute verbatim copies of this
>> +.\" manual provided the copyright notice and this permission notice are
>> +.\" preserved on all copies.
>> +.\"
>> +.\" Permission is granted to copy and distribute modified versions of this
>> +.\" manual under the conditions for verbatim copying, provided that the
>> +.\" entire resulting derived work is distributed under the terms of a
>> +.\" permission notice identical to this one.
>> +.\"
>> +.\" Since the Linux kernel and libraries are constantly changing, this
>> +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
>> +.\" responsibility for errors or omissions, or for damages resulting from
>> +.\" the use of the information contained herein.  The author(s) may not
>> +.\" have taken the same level of care in the production of this manual,
>> +.\" which is licensed free of charge, as they might when working
>> +.\" professionally.
>> +.\"
>> +.\" Formatted or processed versions of this manual, if unaccompanied by
>> +.\" the source, must acknowledge the copyright and authors of this work.
>> +.\" %%%LICENSE_END
>> +.\"
>> +.TH SECCOMP 2 2014-06-23 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +seccomp \-
>> +operate on Secure Computing state of the process
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/seccomp.h>
>> +.B #include <linux/filter.h>
>> +.B #include <linux/audit.h>
>> +.B #include <linux/signal.h>
>> +.B #include <sys/ptrace.h>
>> +
>> +.BI "int seccomp(unsigned int " operation ", unsigned int " flags ,
>> +.BI "            void *" args );
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR seccomp ()
>> +system call operates on the Secure Computing (seccomp) state of the
>> +current process.
>> +
>> +Currently, Linux supports the following
>> +.IR operation
>> +values:
>> +.TP
>> +.BR SECCOMP_SET_MODE_STRICT
>> +Only system calls that the thread is permitted to make are
>> +.BR read (2),
>> +.BR write (2),
>> +.BR _exit (2),
>> +and
>> +.BR sigreturn (2).
>> +Other system calls result in the delivery of a
>> +.BR SIGKILL
>> +signal. Strict secure computing mode is useful for number-crunching
>> +applications that may need to execute untrusted byte code, perhaps
>> +obtained by reading from a pipe or socket.
>> +
>> +This operation is available only if the kernel is configured with
>> +.BR CONFIG_SECCOMP
>> +enabled.
>> +
>> +The value of
>> +.IR flags
>> +must be 0, and
>> +.IR args
>> +must be NULL.
>> +
>> +This operation is functionally identical to calling
>> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_STRICT)" .
>> +.TP
>> +.BR SECCOMP_SET_MODE_FILTER
>> +The system calls allowed are defined by a pointer to a Berkeley Packet
>> +Filter (BPF) passed via
>> +.IR args .
>> +This argument is a pointer to
>> +.IR "struct\ sock_fprog" ;
>> +it can be designed to filter arbitrary system calls and system call
>> +arguments. If the filter is invalid, the call will fail, returning
>> +.BR EACCESS
>> +in
>> +.IR errno .
>> +
>> +If
>> +.BR fork (2),
>> +.BR clone (2),
>> +or
>> +.BR execve (2)
>> +are allowed by the filter, any child processes will be constrained to
>> +the same filters and system calls as the parent.
>> +
>> +Prior to using this operation, the process must call
>> +.IR "prctl(PR_SET_NO_NEW_PRIVS,\ 1)"
>> +or run with
>> +.BR CAP_SYS_ADMIN
>> +privileges in its namespace. If these are not true, the call will fail
>> +and return
>> +.BR EACCES
>> +in
>> +.IR errno .
>> +This requirement ensures that filter programs cannot be applied to child
>> +processes with greater privileges than the process that installed them.
>> +
>> +Additionally, if
>> +.BR prctl (2)
>> +or
>> +.BR seccomp (2)
>> +is allowed by the attached filter, additional filters may be layered on
>> +which will increase evaluation time, but allow for further reduction of
>> +the attack surface during execution of a process.
>> +
>> +This operation is available only if the kernel is configured with
>> +.BR CONFIG_SECCOMP_FILTER
>> +enabled.
>> +
>> +When
>> +.IR flags
>> +are 0, this operation is functionally identical to calling
>> +.IR "prctl(PR_SET_SECCOMP,\ SECCOMP_MODE_FILTER,\ args)" .
>> +
>> +The recognized
>> +.IR flags
>> +are:
>> +.RS
>> +.TP
>> +.BR SECCOMP_FILTER_FLAG_TSYNC
>> +When adding a new filter, synchronize all other threads of the current
>> +process to the same seccomp filter tree. If any thread cannot do this,
>> +the call will not attach the new seccomp filter, and will fail returning
>> +the first thread ID found that cannot synchronize.  Synchronization will
>> +fail if another thread is in
>> +.BR SECCOMP_MODE_STRICT
>> +or if it has attached new seccomp filters to itself, diverging from the
>> +calling thread's filter tree.
>> +.RE
>> +.SH FILTERS
>> +When adding filters via
>> +.BR SECCOMP_SET_MODE_FILTER ,
>> +.IR args
>> +points to a filter program:
>> +
>> +.in +4n
>> +.nf
>> +struct sock_fprog {
>> +    unsigned short      len;    /* Number of BPF instructions */
>> +    struct sock_filter *filter;
>> +};
>> +.fi
>> +.in
>> +
>> +Each program must contain one or more BPF instructions:
>> +
>> +.in +4n
>> +.nf
>> +struct sock_filter {    /* Filter block */
>> +    __u16   code;       /* Actual filter code */
>> +    __u8    jt;         /* Jump true */
>> +    __u8    jf;         /* Jump false */
>> +    __u32   k;          /* Generic multiuse field */
>> +};
>> +.fi
>> +.in
>> +
>> +When executing the instructions, the BPF program executes over the
>> +syscall information made available via:
>> +
>> +.in +4n
>> +.nf
>> +struct seccomp_data {
>> +    int nr;                     /* system call number */
>> +    __u32 arch;                 /* AUDIT_ARCH_* value */
>> +    __u64 instruction_pointer;  /* CPU instruction pointer */
>> +    __u64 args[6];              /* up to 6 system call arguments */
>> +};
>> +.fi
>> +.in
>> +
>> +A seccomp filter may return any of the following values. If multiple
>> +filters exist, the return value for the evaluation of a given system
>> +call will always use the highest precedent value. (For example,
>> +.BR SECCOMP_RET_KILL
>> +will always take precedence.)
>> +
>> +In precedence order, they are:
>> +.TP
>> +.BR SECCOMP_RET_KILL
>> +Results in the task exiting immediately without executing the
>> +system call.  The exit status of the task (status & 0x7f) will
>> +be
>> +.BR SIGSYS ,
>> +not
>> +.BR SIGKILL .
>> +.TP
>> +.BR SECCOMP_RET_TRAP
>> +Results in the kernel sending a
>> +.BR SIGSYS
>> +signal to the triggering task without executing the system call.
>> +.IR siginfo\->si_call_addr
>> +will show the address of the system call instruction, and
>> +.IR siginfo\->si_syscall
>> +and
>> +.IR siginfo\->si_arch
>> +will indicate which syscall was attempted.  The program counter will be
>> +as though the syscall happened (i.e. it will not point to the syscall
>> +instruction).  The return value register will contain an arch\-dependent
>> +value; if resuming execution, set it to something sensible.
>> +(The architecture dependency is because replacing it with
>> +.BR ENOSYS
>> +could overwrite some useful information.)
>> +
>> +The
>> +.BR SECCOMP_RET_DATA
>> +portion of the return value will be passed as
>> +.IR si_errno .
>> +
>> +.BR SIGSYS
>> +triggered by seccomp will have a
>> +.IR si_code
>> +of
>> +.BR SYS_SECCOMP .
>> +.TP
>> +.BR SECCOMP_RET_ERRNO
>> +Results in the lower 16-bits of the return value being passed
>> +to userland as the
>> +.IR errno
>> +without executing the system call.
>> +.TP
>> +.BR SECCOMP_RET_TRACE
>> +When returned, this value will cause the kernel to attempt to
>> +notify a ptrace()-based tracer prior to executing the system
>> +call.  If there is no tracer present,
>> +.BR ENOSYS
>> +is returned to userland and the system call is not executed.
>> +
>> +A tracer will be notified if it requests
>> +.BR PTRACE_O_TRACESECCOMP
>> +using
>> +.IR ptrace(PTRACE_SETOPTIONS) .
>> +The tracer will be notified of a
>> +.BR PTRACE_EVENT_SECCOMP
>> +and the
>> +.BR SECCOMP_RET_DATA
>> +portion of the BPF program return value will be available to the tracer
>> +via
>> +.BR PTRACE_GETEVENTMSG .
>> +
>> +The tracer can skip the system call by changing the syscall number
>> +to \-1.  Alternatively, the tracer can change the system call
>> +requested by changing the system call to a valid syscall number.  If
>> +the tracer asks to skip the system call, then the system call will
>> +appear to return the value that the tracer puts in the return value
>> +register.
>> +
>> +The seccomp check will not be run again after the tracer is
>> +notified.  (This means that seccomp-based sandboxes MUST NOT
>> +allow use of ptrace, even of other sandboxed processes, without
>> +extreme care; ptracers can use this mechanism to escape.)
>> +.TP
>> +.BR SECCOMP_RET_ALLOW
>> +Results in the system call being executed.
>> +
>> +If multiple filters exist, the return value for the evaluation of a
>> +given system call will always use the highest precedent value.
>> +
>> +Precedence is only determined using the
>> +.BR SECCOMP_RET_ACTION
>> +mask.  When multiple filters return values of the same precedence,
>> +only the
>> +.BR SECCOMP_RET_DATA
>> +from the most recently installed filter will be returned.
>> +.SH RETURN VALUE
>> +On success,
>> +.BR seccomp ()
>> +returns 0.
>> +On error, if
>> +.BR SECCOMP_FILTER_FLAG_TSYNC
>> +was used, the return value is the thread ID that caused the
>> +synchronization failure. On other errors, \-1 is returned, and
>> +.IR errno
>> +is set to indicate the cause of the error.
>> +.SH ERRORS
>> +.BR seccomp ()
>> +can fail for the following reasons:
>> +.TP
>> +.BR EACCESS
>> +the caller did not have the
>> +.BR CAP_SYS_ADMIN
>> +capability, or had not set
>> +.IR no_new_privs
>> +before using
>> +.BR SECCOMP_SET_MODE_FILTER .
>> +.TP
>> +.BR EFAULT
>> +.IR args
>> +was required to be a valid address.
>> +.TP
>> +.BR EINVAL
>> +.IR operation
>> +is unknown; or
>> +.IR flags
>> +are invalid for the given
>> +.IR operation
>> +.TP
>> +.BR ESRCH
>> +Another thread caused a failure during thread sync, but its ID could not
>> +be determined.
>> +.SH VERSIONS
>> +This system call first appeared in Linux 3.16.
>> +.\" FIXME Add glibc version
>> +.SH CONFORMING TO
>> +This system call is a nonstandard Linux extension.
>> +.SH NOTES
>> +.BR seccomp ()
>> +provides a superset of the functionality provided by
>> +.IR PR_SET_SECCOMP
>> +of
>> +.BR prctl (2) .
>> +(Which does not support
>> +.IR flags .)
>> +.SH EXAMPLE
>> +.nf
>> +#include <errno.h>
>> +#include <stddef.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <unistd.h>
>> +#include <linux/audit.h>
>> +#include <linux/filter.h>
>> +#include <linux/seccomp.h>
>> +#include <sys/prctl.h>
>> +
>> +static int install_filter(int syscall, int arch, int error)
>> +{
>> +    struct sock_filter filter[] = {
>> +        /* Load architecture. */
>> +        BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
>> +                 (offsetof(struct seccomp_data, arch))),
>> +        /* Jump forward 4 instructions on architecture mismatch. */
>> +        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 4),
>> +        /* Load syscall number. */
>> +        BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
>> +                 (offsetof(struct seccomp_data, nr))),
>> +        /* Jump forward 1 instruction on syscall mismatch. */
>> +        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, syscall, 0, 1),
>> +        /* Matching arch and syscall: return specific errno. */
>> +        BPF_STMT(BPF_RET+BPF_K,
>> +                 SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
>> +        /* Destination of syscall mismatch: Allow other syscalls. */
>> +        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
>> +        /* Destination of arch mismatch: Kill process. */
>> +        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
>> +    };
>> +    struct sock_fprog prog = {
>> +        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
>> +        .filter = filter,
>> +    };
>> +    if (seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog)) {
>> +        perror("seccomp");
>> +        return EXIT_FAILURE;
>> +    }
>> +    return EXIT_SUCCESS;
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +    if (argc < 5) {
>> +        fprintf(stderr, "Usage:\\n"
>> +                "refuse <syscall_nr> <arch> <errno> <prog> [<args>]\\n"
>> +                "Hint:  AUDIT_ARCH_I386: 0x%X\\n"
>> +                "       AUDIT_ARCH_X86_64: 0x%X\\n"
>> +                "\\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
>> +        return EXIT_FAILURE;
>> +    }
>> +    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
>> +        perror("prctl");
>> +        return EXIT_FAILURE;
>> +    }
>> +    if (install_filter(strtol(argv[1], NULL, 0),
>> +                       strtol(argv[2], NULL, 0),
>> +                       strtol(argv[3], NULL, 0)))
>> +        return EXIT_FAILURE;
>> +    execv(argv[4], &argv[4]);
>> +    perror("execv");
>> +    return EXIT_FAILURE;
>> +}
>> +.fi
>> +.SH SEE ALSO
>> +.ad l
>> +.nh
>> +.BR prctl (2),
>> +.BR ptrace (2),
>> +.BR signal (7),
>> +.BR socket (7)
>> +.ad
>>
> 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply

* Re: kdbus namespaces etc
From: Andy Lutomirski @ 2014-11-02 17:13 UTC (permalink / raw)
  To: Lennart Poettering, linux-kernel@vger.kernel.org, Linux API
  Cc: Daniel Mack, Tom Gundersen, Greg KH, Djalal Harouni
In-Reply-To: <20141102165122.GA23921@gardel-login>

[adding lkml and linux-api because this is quite helpful]

On Sun, Nov 2, 2014 at 8:51 AM, Lennart Poettering
<lennart@poettering.net> wrote:
> On Sat, 01.11.14 11:14, Daniel Mack (daniel@zonque.org) wrote:
>
>> On 10/31/2014 04:07 PM, Andy Lutomirski wrote:
>> > There are two major issues, I think.
>> >
>> > The easy one is the metadata thing.  I think that just using the
>> > standard in-kernel APIs and translating when metadata is sent back to
>> > userspace will work.  So, if you store kuid_t, struct pid *, etc, and
>> > remove ns_eq, everything should just work (with the caveat that, in
>> > some circumstances, certain metadata items may be untranslatable).
>> >
>> > The much harder one is kdbus domains.  The basic model for namespaces
>> > is that you can unshare a user namespace (if you want) and then
>> > unshare everything else and set up whatever lives in your container.
>> > So there really should be some way to make that work with kdbus,
>> > especially since Bastien Nocera has kdbus on his wish list as a thing
>> > to make containers work better.
>
> Note that we are working with Bastien all the time, we try to keep him
> in the loop of things anyway, his wishes shouldn't be too far off from
> what we are working on with kdbus.
>
>> > That means that it should be possible to create a kdbus domain from
>> > inside a userns, which means you can't check any global privilege.
>> > This could be done by adding syscalls for kdbus, by creating a
>> > kdbusfs, or possibly even keeping the current device node based
>> > system.  The latter seems likely to be a mess, though, and you'd need
>> > to come up with some sensible semantics for how everything will fit
>> > together, and you'll be up against the weird consideration that using
>> > device nodes more or less enforces a hierarchy of domains when there
>> > really doesn't seem to be anything hierarchical about them.
>
> A couple of things to point out regarding
> namespaces/conainers/sandboxing and kbdus:
>
> a) kdbus is not a generic IPC to use between multiple OS
>    containers. Instead there's just a "system bus" for the OS plus a
>    number of "user busses", one for each user. The system bus is where
>    unprivileged programs talk to system services, possibly requesting
>    priviliged operations that way. The user bus is where unpriviliged
>    user programs talk to user services of the same user. And that's
>    really it. It's not a protocol to talk between OS containers or
>    anything like that. It has a very well defined focus, and that's
>    what we develop it for (now, we can of course extend the scope one
>    day, but for now, let's keep in focus the existing usecase. I mean,
>    there's a reason we didn't call it "kbus", but "kdbus", because we
>    actually focus on the classic dbus usecase, and not something
>    that'd be more generic than that.)
>
> b) To allow multiple OS containers to run in parallel we devised the
>    kdbus "domains" concept: each OS conainer gets its own set of
>    device nodes for its busses, completely isolated from the other
>    domains. While the naming scheme is hierarchial, the "domains" are
>    otherwise completely disconnected, and there's no effective
>    hierarchal structure between them, because there's no structure at
>    all between them, except for the naming. The domain concept is
>    simple: whenever you create a new domain, you get a subdir in your
>    /dev/kdbus which you then mount over the containers /dev/kdbus and
>    so on.
>
> c) To allow sandbox-like filtering containers, where a service runs on
>    a host but only sees a smaller "namespace" of user and system
>    services we came up with the "custom endpoint" concept: a sandboxed
>    app or service gets its own "alias" device nodes for the
>    user/system busses, that have some additional policy applied, which
>    can hide services or make them inaccessible.

I'll try to find the docs for this.  It sounds potentially quite helpful.

>
> We currently have not played around with userns stuff to allow
> creation of unpriviliged domains, but opening this up is not too hard,
> it simply requires us to weaken the permission checks and enforcing
> some minimal naming rules to avoid domain name clashes.

This is IMO not correct.  Linux namespaces have survived this long
without having names, for good reason: it avoids ever dealing with how
to name them.

And just loosening the permissions allows anyone to pollute their
kdbus domain hierarchy, and, possibly worse, make those names visible
outside their new domain.  This is done for the questionable gain of
using device nodes.

>
> Translating credentials is not a priority for us really, as kdbus is
> not an IPC to use between completely different OS containers that have
> different user lists and process lists. Again, we can widen the scope
> one day, but for now we decided to go the safe route: we will suppress
> the creds if we they cannot be mapped. If one day we want to allow
> inter-namespace communication with properly translated creds then we
> can revisit this of course, but for now, simply suppressing them is
> good enough.

But it *is* intended to be used between app containers.  Given that
this is an explicit design goal, I think that someone should really
clarify how the design is compatible with the goal.  "Systemd can do
it" may or may not be a true statement, but it isn't a useful
statement for reviewers.

>
> Note that kdbus is explicitly *not* just an IPC primitive like AF_UNIX
> sockets are. While you use AF_UNIX to build all kinds of communication
> schemes, kdbus comes with a very clear usage scheme: the system and
> the user busses, and nothing else. Hence, because AF_UNIX as IPC
> primitive is so much more generic, covering the namespace translation
> logic for AF_UNIX from day 1 was essential, but this is different for
> kdbus, which is strictly used in one way so far, and intra-container
> communication is not it.
>
> Note that with suppressing the metadata for now for intra-container
> communication we leave a nice avenue open to later on turn this on,

You don't have this avenue to open it up later.  Someone will,
correctly, rely on this suppression to provide anonymity, and, when
you turn it off, you will introduce security holes.

> as we can always add new stuff later on without breaking compat. It
> would be much harder if we let the bits through, but in a broken way
> or in a way that we'd have to change later on.
>
> Also note that the current kdbus client code in systemd already makes
> use of both "domains" and "custom end points" for
> containerization/sandboxing purposes. systemd's "nspawn" tool (which
> implements a minimal LXC-like container manager, that "just works",
> and needs no configuration) already implicitly sets up kdbus domains,
> so that we can make sure the domains concept works nicely and can
> later-on be adopted by LXC, libvirt-lxc, docker, ... too.

nspawn is a great development tool.  It's not such a good test bed for
more complicated use cases, especially since it appears to completely
lack user namespace support.

> In fact, we
> even tested systemd-nspawn recursively, in order to make sure that
> kdbus domains can be corretcly stacked, and do the right thing
> then). Also, systemd's service logic is already able to lock arbitrary
> services into kdbus sandboxes, enforcing much stricter access rights
> on specific services than the usual generic user-id based policy.

How does that work?  What is systemd doing to prevent containerized
things from seeing the full view of kdbus?

>
> Making the whole credential passing stuff opt-in-by-reciever rather
> than opt-in-by-sender is btw also the right thing, because we know
> exactly what dbus is used for, we have a very clear usecase, since
> dbus is already so well established. And for the usecases it has
> (system bus as place where apps talk to system services plus user bus
> as place where apps talk to user services owned by the same user), we
> hence know that it really should be the receiver which decides, since
> it needs to make auth decisions, needs to generate log and audit
> records, and so on.

My media player does not need to generate audit records.  Nor does my
screensaver, and, for that matter, nor do most genuine system
services.

--Andy

(quoting continued below)

>
> To summarize the above: the container usecases were a priority for us
> since day 1. With the "domains" and "custom end points" we think we
> found really convincing concepts to match the common usecases of the
> kernel's PID/UID/... namespacing functionality. We also have
> implementions of usercode ready for them to make sure things work that
> way. kdbus has a much stricter focus than AF_UNIX, as it only is used
> for system + user busses, and for that translating the metadata is not
> a priority.
>
> Anyway, so much about the background why kdbus looks the way it looks
> like. I can understand that it would be great to adapt kdbus to more
> usecases later on (for example, by making it useful for
> intra-namespace communication by doing proper translation of
> credentials, but that would probably would open entirely new cans of
> worms, since then we'd have to establish a third kind of bus really,
> the "all-container" bus that multiple containers can use to
> communicate, but that requires a ton more thinkign), but we'd really
> like to stay focuses on the immediate usecase of the current dbus.
>
> Hope this makes sense,
>
> Lennart
>
> --
> Lennart Poettering, Red Hat

^ permalink raw reply

* (unknown)
From: MRS GRACE MANDA @ 2014-11-02 19:54 UTC (permalink / raw)

In-Reply-To: <1480763910.146593.1414958012342.JavaMail.yahoo-o8Yl8dfWkGi9yjMHE8D8k1Z8N9CAUha/QQ4Iyu8u01E@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 71 bytes --]









This is Mrs Grace Manda (  Please I need your Help is Urgent). 

[-- Attachment #2: Mrs Grace Manda.rtf --]
[-- Type: application/rtf, Size: 35796 bytes --]

^ permalink raw reply

* [PATCH 0/3] fs: add O_BENEATH flag to openat(2)
From: David Drysdale @ 2014-11-03 11:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Kees Cook
  Cc: Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
	Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
	Mike Depinet, James Morris, Andy Lutomirski, Paolo Bonzini,
	Paul Moore, Christoph Hellwig, Eric W. Biederman,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, David Drysdale

This change adds a new O_BENEATH flag for openat(2) which restricts the
provided path, rejecting (with -EACCES) paths that are not beneath
the provided dfd.

This change was previously included as part of a larger patchset
(https://lkml.org/lkml/2014/7/25/426) for Capsicum support; however, it
is potentially useful as an independent change so I've pulled it out
separately here.

In particular, various folks from Chrome[OS] have indicated an interest
in having this functionality.


Changes since the version included in the Capsicum v2 patchset:
 - Add tests of normal symlinks
 - Fix man-page typo
 - Update patch to 3.17

Changes from v1 to v2 of Capsicum patchset:
 - renamed O_BENEATH_ONLY to O_BENEATH [Christoph Hellwig]


David Drysdale (2):
  fs: add O_BENEATH flag to openat(2)
  selftests: Add test of O_BENEATH & openat(2)

 arch/alpha/include/uapi/asm/fcntl.h       |   1 +
 arch/parisc/include/uapi/asm/fcntl.h      |   1 +
 arch/sparc/include/uapi/asm/fcntl.h       |   1 +
 fs/fcntl.c                                |   5 +-
 fs/namei.c                                |  43 ++++++---
 fs/open.c                                 |   4 +-
 include/linux/namei.h                     |   1 +
 include/uapi/asm-generic/fcntl.h          |   4 +
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/openat/.gitignore |   3 +
 tools/testing/selftests/openat/Makefile   |  24 +++++
 tools/testing/selftests/openat/openat.c   | 149 ++++++++++++++++++++++++++++++
 12 files changed, 220 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/openat/.gitignore
 create mode 100644 tools/testing/selftests/openat/Makefile
 create mode 100644 tools/testing/selftests/openat/openat.c

-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply

* [PATCH 1/3] fs: add O_BENEATH flag to openat(2)
From: David Drysdale @ 2014-11-03 11:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Kees Cook
  Cc: Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
	Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
	Mike Depinet, James Morris, Andy Lutomirski, Paolo Bonzini,
	Paul Moore, Christoph Hellwig, Eric W. Biederman,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, David Drysdale
In-Reply-To: <1415015305-15494-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Add a new O_BENEATH flag for openat(2) which restricts the
provided path, rejecting (with -EACCES) paths that are not beneath
the provided dfd.  In particular, reject:
 - paths that contain .. components
 - paths that begin with /
 - symlinks that have paths as above.

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h |  1 +
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c                           |  5 +++--
 fs/namei.c                           | 43 ++++++++++++++++++++++++------------
 fs/open.c                            |  4 +++-
 include/linux/namei.h                |  1 +
 include/uapi/asm-generic/fcntl.h     |  4 ++++
 8 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 09f49a6b87d1..76a87038d2c1 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -33,6 +33,7 @@
 
 #define O_PATH		040000000
 #define __O_TMPFILE	0100000000
+#define O_BENEATH	0200000000	/* no / or .. in openat path */
 
 #define F_GETLK		7
 #define F_SETLK		8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 34a46cbc76ed..3adadf72f929 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -21,6 +21,7 @@
 
 #define O_PATH		020000000
 #define __O_TMPFILE	040000000
+#define O_BENEATH	080000000	/* no / or .. in openat path */
 
 #define F_GETLK64	8
 #define F_SETLK64	9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 7e8ace5bf760..ea38f0bd6cec 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -36,6 +36,7 @@
 
 #define O_PATH		0x1000000
 #define __O_TMPFILE	0x2000000
+#define O_BENEATH	0x4000000	/* no / or .. in openat path */
 
 #define F_GETOWN	5	/*  for sockets. */
 #define F_SETOWN	6	/*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 22d1c3df61ac..c07a32efc34b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -747,14 +747,15 @@ static int __init fcntl_init(void)
 	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 	 * is defined as O_NONBLOCK on some platforms and not on others.
 	 */
-	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
 		O_RDONLY	| O_WRONLY	| O_RDWR	|
 		O_CREAT		| O_EXCL	| O_NOCTTY	|
 		O_TRUNC		| O_APPEND	| /* O_NONBLOCK	| */
 		__O_SYNC	| O_DSYNC	| FASYNC	|
 		O_DIRECT	| O_LARGEFILE	| O_DIRECTORY	|
 		O_NOFOLLOW	| O_NOATIME	| O_CLOEXEC	|
-		__FMODE_EXEC	| O_PATH	| __O_TMPFILE
+		__FMODE_EXEC	| O_PATH	| __O_TMPFILE	|
+		O_BENEATH
 		));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/namei.c b/fs/namei.c
index a7b05bf82d31..2fd547014b6b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -647,7 +647,7 @@ static __always_inline void set_root(struct nameidata *nd)
 	get_fs_root(current->fs, &nd->root);
 }
 
-static int link_path_walk(const char *, struct nameidata *);
+static int link_path_walk(const char *, struct nameidata *, unsigned int);
 
 static __always_inline unsigned set_root_rcu(struct nameidata *nd)
 {
@@ -819,7 +819,8 @@ static int may_linkat(struct path *link)
 }
 
 static __always_inline int
-follow_link(struct path *link, struct nameidata *nd, void **p)
+follow_link(struct path *link, struct nameidata *nd, unsigned int flags,
+	    void **p)
 {
 	struct dentry *dentry = link->dentry;
 	int error;
@@ -867,7 +868,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
 			nd->flags |= LOOKUP_JUMPED;
 		}
 		nd->inode = nd->path.dentry->d_inode;
-		error = link_path_walk(s, nd);
+		error = link_path_walk(s, nd, flags);
 		if (unlikely(error))
 			put_link(nd, link, *p);
 	}
@@ -1585,7 +1586,8 @@ out_err:
  * Without that kind of total limit, nasty chains of consecutive
  * symlinks can cause almost arbitrarily long lookups.
  */
-static inline int nested_symlink(struct path *path, struct nameidata *nd)
+static inline int nested_symlink(struct path *path, struct nameidata *nd,
+				 unsigned int flags)
 {
 	int res;
 
@@ -1603,7 +1605,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
 		struct path link = *path;
 		void *cookie;
 
-		res = follow_link(&link, nd, &cookie);
+		res = follow_link(&link, nd, flags, &cookie);
 		if (res)
 			break;
 		res = walk_component(nd, path, LOOKUP_FOLLOW);
@@ -1739,13 +1741,19 @@ static inline u64 hash_name(const char *name)
  * Returns 0 and nd will have valid dentry and mnt on success.
  * Returns error and drops reference to input namei data on failure.
  */
-static int link_path_walk(const char *name, struct nameidata *nd)
+static int link_path_walk(const char *name, struct nameidata *nd,
+			  unsigned int flags)
 {
 	struct path next;
 	int err;
 	
-	while (*name=='/')
+	while (*name == '/') {
+		if (flags & LOOKUP_BENEATH) {
+			err = -EACCES;
+			goto exit;
+		}
 		name++;
+	}
 	if (!*name)
 		return 0;
 
@@ -1764,6 +1772,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 		if (name[0] == '.') switch (hashlen_len(hash_len)) {
 			case 2:
 				if (name[1] == '.') {
+					if (flags & LOOKUP_BENEATH) {
+						err = -EACCES;
+						goto exit;
+					}
 					type = LAST_DOTDOT;
 					nd->flags |= LOOKUP_JUMPED;
 				}
@@ -1806,7 +1818,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			return err;
 
 		if (err) {
-			err = nested_symlink(&next, nd);
+			err = nested_symlink(&next, nd, flags);
 			if (err)
 				return err;
 		}
@@ -1815,6 +1827,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 			break;
 		}
 	}
+exit:
 	terminate_walk(nd);
 	return err;
 }
@@ -1853,6 +1866,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,
 
 	nd->m_seq = read_seqbegin(&mount_lock);
 	if (*name=='/') {
+		if (flags & LOOKUP_BENEATH)
+			return -EACCES;
 		if (flags & LOOKUP_RCU) {
 			rcu_read_lock();
 			nd->seq = set_root_rcu(nd);
@@ -1953,7 +1968,7 @@ static int path_lookupat(int dfd, const char *name,
 		return err;
 
 	current->total_link_count = 0;
-	err = link_path_walk(name, nd);
+	err = link_path_walk(name, nd, flags);
 
 	if (!err && !(flags & LOOKUP_PARENT)) {
 		err = lookup_last(nd, &path);
@@ -1964,7 +1979,7 @@ static int path_lookupat(int dfd, const char *name,
 			if (unlikely(err))
 				break;
 			nd->flags |= LOOKUP_PARENT;
-			err = follow_link(&link, nd, &cookie);
+			err = follow_link(&link, nd, flags, &cookie);
 			if (err)
 				break;
 			err = lookup_last(nd, &path);
@@ -2304,7 +2319,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 		return err;
 
 	current->total_link_count = 0;
-	err = link_path_walk(name, &nd);
+	err = link_path_walk(name, &nd, flags);
 	if (err)
 		goto out;
 
@@ -2316,7 +2331,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
 		if (unlikely(err))
 			break;
 		nd.flags |= LOOKUP_PARENT;
-		err = follow_link(&link, &nd, &cookie);
+		err = follow_link(&link, &nd, flags, &cookie);
 		if (err)
 			break;
 		err = mountpoint_last(&nd, path);
@@ -3202,7 +3217,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 		goto out;
 
 	current->total_link_count = 0;
-	error = link_path_walk(pathname->name, nd);
+	error = link_path_walk(pathname->name, nd, flags);
 	if (unlikely(error))
 		goto out;
 
@@ -3221,7 +3236,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
 			break;
 		nd->flags |= LOOKUP_PARENT;
 		nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
-		error = follow_link(&link, nd, &cookie);
+		error = follow_link(&link, nd, flags, &cookie);
 		if (unlikely(error))
 			break;
 		error = do_last(nd, &path, file, op, &opened, pathname);
diff --git a/fs/open.c b/fs/open.c
index d6fd3acde134..8afca5b87a0b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -874,7 +874,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		 * If we have O_PATH in the open flag. Then we
 		 * cannot have anything other than the below set of flags
 		 */
-		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+		flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH;
 		acc_mode = 0;
 	} else {
 		acc_mode = MAY_OPEN | ACC_MODE(flags);
@@ -905,6 +905,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
 		lookup_flags |= LOOKUP_DIRECTORY;
 	if (!(flags & O_NOFOLLOW))
 		lookup_flags |= LOOKUP_FOLLOW;
+	if (flags & O_BENEATH)
+		lookup_flags |= LOOKUP_BENEATH;
 	op->lookup_flags = lookup_flags;
 	return 0;
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 492de72560fa..bd0615d1143b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
 #define LOOKUP_FOLLOW		0x0001
 #define LOOKUP_DIRECTORY	0x0002
 #define LOOKUP_AUTOMOUNT	0x0004
+#define LOOKUP_BENEATH		0x0008
 
 #define LOOKUP_PARENT		0x0010
 #define LOOKUP_REVAL		0x0020
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 7543b3e51331..f63aa749a4fb 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -92,6 +92,10 @@
 #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
 #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)      
 
+#ifndef O_BENEATH
+#define O_BENEATH	040000000	/* no / or .. in openat path */
+#endif
+
 #ifndef O_NDELAY
 #define O_NDELAY	O_NONBLOCK
 #endif
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH 2/3] selftests: Add test of O_BENEATH & openat(2)
From: David Drysdale @ 2014-11-03 11:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Kees Cook
  Cc: Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
	Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
	Mike Depinet, James Morris, Andy Lutomirski, Paolo Bonzini,
	Paul Moore, Christoph Hellwig, Eric W. Biederman,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, David Drysdale
In-Reply-To: <1415015305-15494-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Add simple tests of openat(2) variations, including examples that
check the new O_BENEATH flag.

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 tools/testing/selftests/Makefile          |   1 +
 tools/testing/selftests/openat/.gitignore |   4 +
 tools/testing/selftests/openat/Makefile   |  28 ++++++
 tools/testing/selftests/openat/openat.c   | 161 ++++++++++++++++++++++++++++++
 4 files changed, 194 insertions(+)
 create mode 100644 tools/testing/selftests/openat/.gitignore
 create mode 100644 tools/testing/selftests/openat/Makefile
 create mode 100644 tools/testing/selftests/openat/openat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 36ff2e4c7b6f..812e973233d2 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += powerpc
 TARGETS += user
 TARGETS += sysctl
 TARGETS += firmware
+TARGETS += openat
 
 TARGETS_HOTPLUG = cpu-hotplug
 TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/openat/.gitignore b/tools/testing/selftests/openat/.gitignore
new file mode 100644
index 000000000000..835b2dcd8678
--- /dev/null
+++ b/tools/testing/selftests/openat/.gitignore
@@ -0,0 +1,4 @@
+openat
+subdir
+topfile
+symlinkdown
\ No newline at end of file
diff --git a/tools/testing/selftests/openat/Makefile b/tools/testing/selftests/openat/Makefile
new file mode 100644
index 000000000000..84cd06e7ee82
--- /dev/null
+++ b/tools/testing/selftests/openat/Makefile
@@ -0,0 +1,28 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = openat
+DEPS = subdir topfile symlinkdown subdir/bottomfile subdir/symlinkup subdir/symlinkout subdir/symlinkin
+all: $(BINARIES) $(DEPS)
+
+subdir:
+	mkdir -p subdir
+topfile:
+	echo 0123456789 > $@
+subdir/bottomfile: | subdir
+	echo 0123456789 > $@
+subdir/symlinkup: | subdir
+	ln -s ../topfile $@
+subdir/symlinkout: | subdir
+	ln -s /etc/passwd $@
+subdir/symlinkin: | subdir
+	ln -s bottomfile $@
+symlinkdown:
+	ln -s subdir/bottomfile $@
+%: %.c
+	$(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+	./openat
+
+clean:
+	rm -rf $(BINARIES) $(DEPS)
diff --git a/tools/testing/selftests/openat/openat.c b/tools/testing/selftests/openat/openat.c
new file mode 100644
index 000000000000..61acfb53442e
--- /dev/null
+++ b/tools/testing/selftests/openat/openat.c
@@ -0,0 +1,161 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+#include <linux/fcntl.h>
+
+/* Bypass glibc */
+static int openat_(int dirfd, const char *pathname, int flags)
+{
+	return syscall(__NR_openat, dirfd, pathname, flags);
+}
+
+static int openat_or_die(int dfd, const char *path, int flags)
+{
+	int fd = openat_(dfd, path, flags);
+
+	if (fd < 0) {
+		printf("Failed to openat(%d, '%s'); "
+			"check prerequisites are available\n", dfd, path);
+		exit(1);
+	}
+	return fd;
+}
+
+static int check_openat(int dfd, const char *path, int flags)
+{
+	int rc;
+	int fd;
+	char buffer[4];
+
+	errno = 0;
+	printf("Check success of openat(%d, '%s', %x)... ",
+	       dfd, path?:"(null)", flags);
+	fd = openat_(dfd, path, flags);
+	if (fd < 0) {
+		printf("[FAIL]: openat() failed, rc=%d errno=%d (%s)\n",
+			fd, errno, strerror(errno));
+		return 1;
+	}
+	errno = 0;
+	rc = read(fd, buffer, sizeof(buffer));
+	if (rc < 0) {
+		printf("[FAIL]: read() failed, rc=%d errno=%d (%s)\n",
+			rc, errno, strerror(errno));
+		return 1;
+	}
+	close(fd);
+	printf("[OK]\n");
+	return 0;
+}
+
+#define check_openat_fail(dfd, path, flags, errno)	\
+	_check_openat_fail(dfd, path, flags, errno, #errno)
+static int _check_openat_fail(int dfd, const char *path, int flags,
+			      int expected_errno, const char *errno_str)
+{
+	int rc;
+
+	errno = 0;
+	printf("Check failure of openat(%d, '%s', %x) with %s... ",
+		dfd, path?:"(null)", flags, errno_str);
+	rc = openat_(dfd, path, flags);
+	if (rc > 0) {
+		printf("[FAIL] (unexpected success from openat(2))\n");
+		close(rc);
+		return 1;
+	}
+	if (errno != expected_errno) {
+		printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+			expected_errno, strerror(expected_errno),
+			errno, strerror(errno));
+		return 1;
+	}
+	printf("[OK]\n");
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int fail = 0;
+	int dot_dfd = openat_or_die(AT_FDCWD, ".", O_RDONLY);
+	int subdir_dfd = openat_or_die(AT_FDCWD, "subdir", O_RDONLY);
+	int file_fd = openat_or_die(AT_FDCWD, "topfile", O_RDONLY);
+
+	/* Sanity check normal behavior */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile", O_RDONLY);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(dot_dfd, "subdir/../topfile", O_RDONLY);
+
+	fail |= check_openat(subdir_dfd, "../topfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "../subdir/bottomfile", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkup", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "symlinkout", O_RDONLY);
+
+	fail |= check_openat(AT_FDCWD, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(dot_dfd, "/etc/passwd", O_RDONLY);
+	fail |= check_openat(subdir_dfd, "/etc/passwd", O_RDONLY);
+
+	fail |= check_openat_fail(AT_FDCWD, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(dot_dfd, "bogus", O_RDONLY, ENOENT);
+	fail |= check_openat_fail(999, "bogus", O_RDONLY, EBADF);
+	fail |= check_openat_fail(file_fd, "bogus", O_RDONLY, ENOTDIR);
+
+#ifdef O_BENEATH
+	/* Test out O_BENEATH */
+	fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY|O_BENEATH);
+	fail |= check_openat(AT_FDCWD, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH);
+
+	fail |= check_openat(dot_dfd, "topfile", O_RDONLY|O_BENEATH);
+	fail |= check_openat(dot_dfd, "subdir/bottomfile",
+			     O_RDONLY|O_BENEATH);
+	fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY|O_BENEATH);
+
+	/* Symlinks without .. or leading / are OK */
+	fail |= check_openat(dot_dfd, "symlinkdown", O_RDONLY|O_BENEATH);
+	fail |= check_openat(dot_dfd, "subdir/symlinkin", O_RDONLY|O_BENEATH);
+	fail |= check_openat(subdir_dfd, "symlinkin", O_RDONLY|O_BENEATH);
+	/* ... unless of course we specify O_NOFOLLOW */
+	fail |= check_openat_fail(dot_dfd, "symlinkdown",
+				  O_RDONLY|O_BENEATH|O_NOFOLLOW, ELOOP);
+	fail |= check_openat_fail(dot_dfd, "subdir/symlinkin",
+				  O_RDONLY|O_BENEATH|O_NOFOLLOW, ELOOP);
+	fail |= check_openat_fail(subdir_dfd, "symlinkin",
+				  O_RDONLY|O_BENEATH|O_NOFOLLOW, ELOOP);
+
+	/* Can't open paths with ".." in them */
+	fail |= check_openat_fail(dot_dfd, "subdir/../topfile",
+				O_RDONLY|O_BENEATH, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../topfile",
+				  O_RDONLY|O_BENEATH, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "../subdir/bottomfile",
+				O_RDONLY|O_BENEATH, EACCES);
+
+	/* Can't open paths starting with "/" */
+	fail |= check_openat_fail(AT_FDCWD, "/etc/passwd",
+				  O_RDONLY|O_BENEATH, EACCES);
+	fail |= check_openat_fail(dot_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "/etc/passwd",
+				  O_RDONLY|O_BENEATH, EACCES);
+	/* Can't sneak around constraints with symlinks */
+	fail |= check_openat_fail(subdir_dfd, "symlinkup",
+				  O_RDONLY|O_BENEATH, EACCES);
+	fail |= check_openat_fail(subdir_dfd, "symlinkout",
+				  O_RDONLY|O_BENEATH, EACCES);
+#else
+	printf("Skipping O_BENEATH tests due to missing #define\n");
+#endif
+
+	return fail ? -1 : 0;
+}
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH man-pages 3/3] open.2: describe O_BENEATH flag
From: David Drysdale @ 2014-11-03 11:48 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, Kees Cook
  Cc: Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
	Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
	Mike Depinet, James Morris, Andy Lutomirski, Paolo Bonzini,
	Paul Moore, Christoph Hellwig, Eric W. Biederman,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA, David Drysdale
In-Reply-To: <1415015305-15494-1-git-send-email-drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

Signed-off-by: David Drysdale <drysdale-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 man2/open.2 | 31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/man2/open.2 b/man2/open.2
index abc3c35b8b3a..495c7f1e81a4 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -716,6 +716,31 @@ XFS support was added
 .\" commit ab29743117f9f4c22ac44c13c1647fb24fb2bafe
 in Linux 3.15.
 .TP
+.B O_BENEATH " (since Linux 3.??)"
+Ensure that the
+.I pathname
+is beneath the current working directory (for
+.BR open (2))
+or the
+.I dirfd
+(for
+.BR openat (2)).
+If the
+.I pathname
+is absolute or contains a path component of "..", the
+.BR open ()
+fails with the error
+.BR EACCES.
+This occurs even if ".." path component would not actually
+escape the original directory; for example, a
+.I pathname
+of "subdir/../filename" would be rejected.
+Path components that are symbolic links to absolute paths, or that are
+relative paths containing a ".." component, will also cause the
+.BR open ()
+operation to fail with the error
+.BR EACCES.
+.TP
 .B O_TRUNC
 If the file already exists and is a regular file and the access mode allows
 writing (i.e., is
@@ -792,7 +817,11 @@ The requested access to the file is not allowed, or search permission
 is denied for one of the directories in the path prefix of
 .IR pathname ,
 or the file did not exist yet and write access to the parent directory
-is not allowed.
+is not allowed, or the
+.B O_BENEATH
+flag was specified and the
+.I pathname
+was not beneath the relevant directory.
 (See also
 .BR path_resolution (7).)
 .TP
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH man-pages 3/3] open.2: describe O_BENEATH flag
From: Paolo Bonzini @ 2014-11-03 11:56 UTC (permalink / raw)
  To: David Drysdale, linux-kernel, Alexander Viro, Kees Cook
  Cc: Greg Kroah-Hartman, Meredydd Luff, Will Drewry,
	Jorge Lucangeli Obes, Ricky Zhou, Lee Campbell, Julien Tinnes,
	Mike Depinet, James Morris, Andy Lutomirski, Paul Moore,
	Christoph Hellwig, Eric W. Biederman, linux-api,
	linux-security-module
In-Reply-To: <1415015305-15494-4-git-send-email-drysdale@google.com>

On 03/11/2014 12:48, David Drysdale wrote:
> +.I pathname
> +is beneath the current working directory (for
> +.BR open (2))
> +or the
> +.I dirfd
> +(for
> +.BR openat (2)).
> +If the
> +.I pathname
> +is absolute or contains a path component of "..", the
> +.BR open ()
> +fails with the error
> +.BR EACCES.
> +This occurs even if ".." path component would not actually
> +escape the original directory; for example, a
> +.I pathname
> +of "subdir/../filename" would be rejected.
> +Path components that are symbolic links to absolute paths, or that are
> +relative paths containing a ".." component, will also cause the
> +.BR open ()
> +operation to fail with the error
> +.BR EACCES.

I wonder if EPERM is more appropriate than EACCES.

Apart from this, the patches look fine.

Paolo

^ permalink raw reply

* Re: kdbus: add code to gather metadata
From: Simon McVittie @ 2014-11-03 12:00 UTC (permalink / raw)
  To: Andy Lutomirski, Daniel Mack
  Cc: Greg Kroah-Hartman, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, David Herrmann, Djalal Harouni, alban.crequy,
	Javier Martinez Canillas, Tom Gundersen
In-Reply-To: <CALCETrXxx4juUGA3mwOxq0BtErM0kj7_THxiO5LwCVLzCXnd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 01/11/14 16:19, Andy Lutomirski wrote:
> You can't justify logging fundamentally unverifiable things like the
> command line by saying that you want to know if someone tries to play
> (impossible-to-reliably-detect) games to obscure their command line.

I think kdbus might be mixing up two orthogonal things here.

It has an easy, kernel-checked, race-free way to determine
kernel-mediated credential-like information that cannot be faked or
interfered with (uid, primary gid, other gids?, security label,
capabilities) because these are usable for security decisions, but if
they are *not* received in a kernel-checked, race-free way, then they
are useless.

One concrete example of using non-ucred credential-like information is
that traditional D-Bus can only restrict sysadmin tasks to uid 0 (or a
root-equivalent uid in group sudo/admin/whatever), whereas when systemd
and systemd-logind are run on kdbus, many of their D-Bus methods require
specific capabilities(7): KillUser requires CAP_KILL, PowerOff requires
CAP_SYS_BOOT, and so on. If capabilities(7) are a good thing, then
that's surely a good thing too. (On the other hand, if you think
capabilities(7) are a waste of time, then so is this.)

It also uses the same mechanism as an easy, race-free, but *not*
kernel-checked way to determine bits and pieces that are valuable for
debugging (dbus-monitor etc.), but unsuitable for security decisions,
such as cmdline.

In traditional D-Bus, you can get the uid and pid of a remote process,
but in a debug log you would probably actually prefer to log the cmdline
in addition; yes a malicious user could fake the cmdline, but when
debugging a system problem, information that is known to be forgeable
seems better than no information at all. After all, ps(1) shows the
forgeable cmdline, not just the executable. You can get that by
rummaging in /proc/$pid, but there is a race: if the remote process
exits too soon (a "fire and forget" method call) then you'll never know
who it was. kdbus solves that race, but does not make cmdline unforgeable.

If client libraries wishing to attach their cmdline (or other debug
info) to messages for debugging were required to add it as an
out-of-band KDBUS_ITEM, or as a D-Bus message header inside the payload,
then that would be duplicating work in client libraries that could have
been done centrally, but would still solve the race.

    S

^ permalink raw reply

* Re: [PATCH] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-11-03 13:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andi Kleen, Jonathan Corbet,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Rusty Russell,
	linux-doc-u79uwXL29TY76Z2rM5mHXA, jbaron-JqFfY2XvxFXQT0dZR+AlfA,
	Fabian Frederick, isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
	H. Peter Anvin, Masami Hiramatsu, Andrew Morton,
	linux-api-u79uwXL29TY76Z2rM5mHXA, vgoyal-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <20141031015843.GW9743-2aXy8kx9jd5qWu+OMeiFJmRHvnxUvMa2@public.gmane.org>



On 10/30/2014 09:58 PM, Hedi Berriche wrote:
> On Thu, Oct 30, 2014 at 17:06 Prarit Bhargava wrote:
> | There have been several times where I have had to rebuild a kernel to
> | cause a panic when hitting a WARN() in the code in order to get a crash
> | dump from a system.  Sometimes this is easy to do, other times (such as
> | in the case of a remote admin) it is not trivial to send new images to the
> | user.
> | 
> | A much easier method would be a switch to change the WARN() over to a
> | panic.  This makes debugging easier in that I can now test the actual
> | image the WARN() was seen on and I do not have to engage in remote
> | debugging.
> 
> Do we want to leave it to usersspace[1] to ensure panic_on_warn is out
> of the way in when the kdump kernel boots? or would a self-contained
> approach be more preferable i.e. test whether we're a kdump kernel
> before bothering with panic_on_warn?

Hmm ... this is a good point.  Vivek, do you have a preference?  I'm willing to
code it either way.  I should be able to put in a is_kdump_kernel() check
without any problems but I'm not sure if that is the right thing to do here.

P.

> 
> Cheers,
> Hedi.
> 
> [1] kexec-tools in the case of the boot param by filtering it out of the
>     kdump kernel cmdline. In the case of sysctl.conf, it would depend on
>     whether there are distros out there that  include it in the kdump
>     initrd.
> 

^ permalink raw reply

* Re: [PATCH V2] kernel, add bug_on_warn
From: Prarit Bhargava @ 2014-11-03 13:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Andi Kleen, hedi-sJ/iWh9BUns, Jonathan Corbet,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Fabian Frederick,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A, H. Peter Anvin,
	Masami Hiramatsu, Andrew Morton, vgoyal-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <87ppd984qv.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>



On 10/30/2014 08:25 PM, Rusty Russell wrote:
> Prarit Bhargava <prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>> On 10/22/2014 12:27 AM, Rusty Russell wrote:
>>> Prarit Bhargava <prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes:
>>>> There have been several times where I have had to rebuild a kernel to
>>>> cause a panic when hitting a WARN() in the code in order to get a crash
>>>> dump from a system.  Sometimes this is easy to do, other times (such as
>>>> in the case of a remote admin) it is not trivial to send new images to the
>>>> user.
>>>
>>> What about during early boot?
>>
>> Hi Rusty,
>>
>> I really don't have a use case for this in early boot.  The kernel boots, the
>> initramfs, and then we run whatever init (systemd in my case).  A systemd script
>> configures kexec for kdump and that point kdump is "armed".  Doing a bug_on_warn
>> before this will simply result in a panicked system.  I don't get any "new"
>> information FWIW as I get a stack trace, etc., in both the WARN() and BUG() cases.
>>
>>>
>>> I'd recommend you use core_param().  Less code, and can be set on
>>> commandline.

Yeah, I was just starting to do this and then I saw Hedi's comment about
disabling panic_on_warn during kdump to avoid a situation where the kdump kernel
bogus panics on a warn.

So that makes the setup function look like:

static int __init panic_on_warn_setup(char *s)
{
        /* Enabling this on a kdump kernel could cause a bogus panic. */
        if (!is_kdump_kernel())
                panic_on_warn = 1;
        return 0;
}
early_param("panic_on_warn", panic_on_warn_setup);

... so I dunno if core_param would work here :(.  It would have been nice if it did.

P.

^ permalink raw reply

* [PATCH v6] kernel, add panic_on_warn
From: Prarit Bhargava @ 2014-11-03 14:32 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Prarit Bhargava, Jonathan Corbet, Andrew Morton, Rusty Russell,
	H. Peter Anvin, Andi Kleen, Masami Hiramatsu, Fabian Frederick,
	vgoyal-H+wXaHxf7aLQT0dZR+AlfA,
	isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A,
	jbaron-JqFfY2XvxFXQT0dZR+AlfA, hedi-sJ/iWh9BUns,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-api-u79uwXL29TY76Z2rM5mHXA

There have been several times where I have had to rebuild a kernel to
cause a panic when hitting a WARN() in the code in order to get a crash
dump from a system.  Sometimes this is easy to do, other times (such as
in the case of a remote admin) it is not trivial to send new images to the
user.

A much easier method would be a switch to change the WARN() over to a
panic.  This makes debugging easier in that I can now test the actual
image the WARN() was seen on and I do not have to engage in remote
debugging.

This patch adds a panic_on_warn kernel parameter and
/proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common()
path.  The function will still print out the location of the warning.

An example of the panic_on_warn output:

The first line below is from the WARN_ON() to output the WARN_ON()'s location.
After that the panic() output is displayed.

WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
Kernel panic - not syncing: panic_on_warn set ...

CPU: 30 PID: 11698 Comm: insmod Tainted: G        W  OE  3.17.0+ #57
Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
 0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
 0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
 ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
Call Trace:
 [<ffffffff81665190>] dump_stack+0x46/0x58
 [<ffffffff8165e2ec>] panic+0xd0/0x204
 [<ffffffffa038e05f>] ? init_dummy+0x1f/0x30 [dummy_module]
 [<ffffffff81076b90>] warn_slowpath_common+0xd0/0xd0
 [<ffffffffa038e040>] ? dummy_greetings+0x40/0x40 [dummy_module]
 [<ffffffff81076c8a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa038e05f>] init_dummy+0x1f/0x30 [dummy_module]
 [<ffffffff81002144>] do_one_initcall+0xd4/0x210
 [<ffffffff811b52c2>] ? __vunmap+0xc2/0x110
 [<ffffffff810f8889>] load_module+0x16a9/0x1b30
 [<ffffffff810f3d30>] ? store_uevent+0x70/0x70
 [<ffffffff810f49b9>] ? copy_module_from_fd.isra.44+0x129/0x180
 [<ffffffff810f8ec6>] SyS_finit_module+0xa6/0xd0
 [<ffffffff8166cf29>] system_call_fastpath+0x12/0x17

Successfully tested by me.

Cc: Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Rusty Russell <rusty-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>
Cc: "H. Peter Anvin" <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
Cc: Andi Kleen <ak-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt-FCd8Q96Dh0JBDgjK7y7TUQ@public.gmane.org>
Cc: Fabian Frederick <fabf-AgBVmzD5pcezQB+pC5nmwQ@public.gmane.org>
Cc: vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Cc: isimatu.yasuaki-+CUm20s59erQFUHtdCDX3A@public.gmane.org
Cc: jbaron-JqFfY2XvxFXQT0dZR+AlfA@public.gmane.org
Cc: hedi-sJ/iWh9BUns@public.gmane.org
Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: kexec-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Prarit Bhargava <prarit-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[v2]: add /proc/sys/kernel/panic_on_warn, additional documentation, modify
      !slowpath cases
[v3]: use proc_dointvec_minmax() in sysctl handler
[v4]: remove !slowpath cases, and add __read_mostly
[v5]: change to panic_on_warn, re-alphabetize Documentation/sysctl/kernel.txt
[v6]: disable on kdump kernel to avoid bogus panicks.
---
 Documentation/kdump/kdump.txt       |    7 ++++++
 Documentation/kernel-parameters.txt |    3 +++
 Documentation/sysctl/kernel.txt     |   40 +++++++++++++++++++++++------------
 include/linux/kernel.h              |    1 +
 include/uapi/linux/sysctl.h         |    1 +
 kernel/panic.c                      |   23 +++++++++++++++++++-
 kernel/sysctl.c                     |    9 ++++++++
 kernel/sysctl_binary.c              |    1 +
 8 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index 6c0b9f2..bc4bd5a 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at the following URL:
 
    http://people.redhat.com/~anderson/
 
+Trigger Kdump on WARN()
+=======================
+
+The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
+will cause a kdump to occur at the panic() call.  In cases where a user wants
+to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 1
+to achieve the same behaviour.
 
 Contact
 =======
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4c81a86..ea5d57c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			timeout < 0: reboot immediately
 			Format: <timeout>
 
+	panic_on_warn	panic() instead of WARN().  Useful to cause kdump
+			on a WARN().
+
 	crash_kexec_post_notifiers
 			Run kdump after running panic-notifiers and dumping
 			kmsg. This only for the users who doubt kdump always
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 57baff5..b5d0c85 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -54,8 +54,9 @@ show up in /proc/sys/kernel:
 - overflowuid
 - panic
 - panic_on_oops
-- panic_on_unrecovered_nmi
 - panic_on_stackoverflow
+- panic_on_unrecovered_nmi
+- panic_on_warn
 - pid_max
 - powersave-nap               [ PPC only ]
 - printk
@@ -527,19 +528,6 @@ the recommended setting is 60.
 
 ==============================================================
 
-panic_on_unrecovered_nmi:
-
-The default Linux behaviour on an NMI of either memory or unknown is
-to continue operation. For many environments such as scientific
-computing it is preferable that the box is taken out and the error
-dealt with than an uncorrected parity/ECC error get propagated.
-
-A small number of systems do generate NMI's for bizarre random reasons
-such as power management so the default is off. That sysctl works like
-the existing panic controls already in that directory.
-
-==============================================================
-
 panic_on_oops:
 
 Controls the kernel's behaviour when an oops or BUG is encountered.
@@ -563,6 +551,30 @@ This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled.
 
 ==============================================================
 
+panic_on_unrecovered_nmi:
+
+The default Linux behaviour on an NMI of either memory or unknown is
+to continue operation. For many environments such as scientific
+computing it is preferable that the box is taken out and the error
+dealt with than an uncorrected parity/ECC error get propagated.
+
+A small number of systems do generate NMI's for bizarre random reasons
+such as power management so the default is off. That sysctl works like
+the existing panic controls already in that directory.
+
+==============================================================
+
+panic_on_warn:
+
+Calls panic() in the WARN() path when set to 1.  This is useful to avoid
+a kernel rebuild when attempting to kdump at the location of a WARN().
+
+0: only WARN(), default behaviour.
+
+1: call panic() after printing out WARN() location.
+
+==============================================================
+
 perf_cpu_time_max_percent:
 
 Hints to the kernel how much CPU time it should be allowed to
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3d770f55..d60d31d 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -422,6 +422,7 @@ extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
 extern int panic_on_io_nmi;
+extern int panic_on_warn;
 extern int sysctl_panic_on_stackoverflow;
 /*
  * Only to be used by arch init code. If the user over-wrote the default
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 43aaba1..0956373 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_PANIC_ON_WARN=77, /* int: call panic() in WARN() functions */
 };
 
 
diff --git a/kernel/panic.c b/kernel/panic.c
index d09dc5c..b30550a 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -23,6 +23,7 @@
 #include <linux/sysrq.h>
 #include <linux/init.h>
 #include <linux/nmi.h>
+#include <linux/crash_dump.h>
 
 #define PANIC_TIMER_STEP 100
 #define PANIC_BLINK_SPD 18
@@ -33,6 +34,7 @@ static int pause_on_oops;
 static int pause_on_oops_flag;
 static DEFINE_SPINLOCK(pause_on_oops_lock);
 static bool crash_kexec_post_notifiers;
+int panic_on_warn __read_mostly;
 
 int panic_timeout = CONFIG_PANIC_TIMEOUT;
 EXPORT_SYMBOL_GPL(panic_timeout);
@@ -420,13 +422,23 @@ static void warn_slowpath_common(const char *file, int line, void *caller,
 {
 	disable_trace_on_warning();
 
-	pr_warn("------------[ cut here ]------------\n");
+	if (!panic_on_warn)
+		pr_warn("------------[ cut here ]------------\n");
 	pr_warn("WARNING: CPU: %d PID: %d at %s:%d %pS()\n",
 		raw_smp_processor_id(), current->pid, file, line, caller);
 
 	if (args)
 		vprintk(args->fmt, args->args);
 
+	if (panic_on_warn) {
+		/*
+		 * A flood of WARN()s may occur.  Prevent further WARN()s
+		 * from panicking the system.
+		 */
+		panic_on_warn = 0;
+		panic("panic_on_warn set ...\n");
+	}
+
 	print_modules();
 	dump_stack();
 	print_oops_end_marker();
@@ -501,3 +513,12 @@ static int __init oops_setup(char *s)
 	return 0;
 }
 early_param("oops", oops_setup);
+
+static int __init panic_on_warn_setup(char *s)
+{
+	/* Enabling this on a kdump kernel could cause a bogus panic. */
+	if (!is_kdump_kernel())
+		panic_on_warn = 1;
+	return 0;
+}
+early_param("panic_on_warn", panic_on_warn_setup);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..7c54ff7 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+	{
+		.procname	= "panic_on_warn",
+		.data		= &panic_on_warn,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one,
+	},
 	{ }
 };
 
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 9a4f750..7e7746a 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
 	{ CTL_INT,	KERN_COMPAT_LOG,		"compat-log" },
 	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ CTL_INT,	KERN_PANIC_ON_WARN,		"panic_on_warn" },
 	{}
 };
 
-- 
1.7.9.3

^ permalink raw reply related

* Re: [PATCH 00/12] Add kdbus implementation
From: One Thousand Gnomes @ 2014-11-03 14:38 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jiri Kosina, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	john.stultz-QSEj5FYQhm4dnm+yROfE0A, arnd-r2nGTMty4D4,
	tj-DgEjT+Ai2ygdnm+yROfE0A, marcel-kz+m5ild9QBg9hUCZPvPmw,
	desrt-0xnayjDhYQY, hadess-0MeiytkfxGOsTnJN9+BGXg,
	dh.herrmann-Re5JQEeQqe8AvxtiuMwx3w, tixxdz-Umm1ozX2/EEdnm+yROfE0A,
	simon.mcvittie-ZGY8ohtN/8pPYcu2f3hruQ,
	daniel-cYrQPVfZoowdnm+yROfE0A,
	alban.crequy-ZGY8ohtN/8pPYcu2f3hruQ,
	javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ, teg-B22kvLQNl6c
In-Reply-To: <20141102012130.GA9335-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org>

On Sat, 1 Nov 2014 18:21:30 -0700
Greg Kroah-Hartman <gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r@public.gmane.org> wrote:

> Here's some reasons why I feel it is better to have kdbus in the kernel
> rather than trying to implement the same thing in a userspace daemon:

No - these are reasons to have *something* in the kernel. I think it
would be far more constructive to treat the current kdbus as a proof of
concept/prototype or even a draft requirements specification.

>   as the only trustworthy compoenent in the game is the kernel which
>   adds metadata and ensures that all data passed as payload is either
>   copied or sealed, so that the receiver can parse the data without

When the kernel adds metadata without being told to do so by one end of
the link you create a new set of security and privacy leaks. Far better
that the sender must choose what metadata is added and the receiver can
decide to bin stuff that's not acceptable. The job of the kernel is
really more like that of an auditor in a business transaction - to make
sure that the data they agree to pass is truthful.

(ie its the sender who must say "attach my user info", the receiver who
must say "no info, no play" and the kernel who must provide the info so
it can't be faked.

> - semantics for apps with heavy data payloads (media apps, for instance)
>   with optinal priority message dequeuing, and global message ordering.

Sounds like System 5 IPC ;-)

> Regarding binder: binder and kdbus follow very different design
> concepts. 

We know binder is broken but the Android guys are stuck in a special
kind of hell with it for some years to come. We need to make sure kdbus
isn't the same result.

Alan

^ permalink raw reply

* Re: [PATCH 2/2] perf: Userspace software event and ioctl
From: Tomeu Vizoso @ 2014-11-03 14:48 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
In-Reply-To: <1412006003.3817.30.camel@hornet>

On 29 September 2014 17:53, Pawel Moll <pawel.moll@arm.com> wrote:
> On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
>> Also none of the many words above describe
>> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
>
> Hopefully description of the v2 makes better job in this:
>
> http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
>
> where it's already called "UEVENT" and was generated by write().
>
> Before you get into this, though, the most important outcomes of both v1
> and v2 discussions:
>
> * Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
> of generating such events (so the tracee doesn't have to know the fd to
> do ioctl); Frederic seems to have the same on his mind.
>
> * Namhyung proposed sticking the userspace-originating events into the
> buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
>
> Working on making both happen now.

Hi Pawel,

are you still working on this? Would be happy to lend a hand if that
can speed things up.

Cheers,

Tomeu

> Pawel
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [PATCH 2/2] perf: Userspace software event and ioctl
From: Pawel Moll @ 2014-11-03 15:04 UTC (permalink / raw)
  To: Tomeu Vizoso
  Cc: Peter Zijlstra, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	linux-kernel@vger.kernel.org, linux-api@vger.kernel.org
In-Reply-To: <CAAObsKBw9VU3oZZSz5hV-rNXK2WAkjXeOgY_X1AF1Lpp9iLk7g@mail.gmail.com>

On Mon, 2014-11-03 at 14:48 +0000, Tomeu Vizoso wrote:
> On 29 September 2014 17:53, Pawel Moll <pawel.moll@arm.com> wrote:
> > On Mon, 2014-09-29 at 16:32 +0100, Peter Zijlstra wrote:
> >> Also none of the many words above describe
> >> PERF_SAMPLE_USERSPACE_EVENT(), wth is that about?
> >
> > Hopefully description of the v2 makes better job in this:
> >
> > http://thread.gmane.org/gmane.linux.kernel/1793272/focus=4813
> >
> > where it's already called "UEVENT" and was generated by write().
> >
> > Before you get into this, though, the most important outcomes of both v1
> > and v2 discussions:
> >
> > * Ingo suggested prctl(PR_TRACE_UEVENT, type, size, data, 0) as the way
> > of generating such events (so the tracee doesn't have to know the fd to
> > do ioctl); Frederic seems to have the same on his mind.
> >
> > * Namhyung proposed sticking the userspace-originating events into the
> > buffer as PERF_RECORD_UEVENT rather then PERF_SAMPLE_UEVENT.
> >
> > Working on making both happen now.
> 
> are you still working on this? Would be happy to lend a hand if that
> can speed things up.

By all means! In fact I'm typing commit messages right now and will post
the patches later today. Stay tuned and I'm looking forward to all
suggestions, reviews etc.

Cheers!

Pawel

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox