Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* Re: [PATCH v33 12/21] x86/sgx: Allow a limited use of ATTRIBUTE.PROVISIONKEY for attestation
From: Jarkko Sakkinen @ 2020-07-03  2:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Borislav Petkov, x86, linux-sgx, linux-kernel,
	linux-security-module, Jethro Beekman, Andy Lutomirski, akpm,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, josh,
	kai.huang, kai.svahn, kmoy, ludloff, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx
In-Reply-To: <20200629220400.GI12312@linux.intel.com>

On Mon, Jun 29, 2020 at 03:04:00PM -0700, Sean Christopherson wrote:
> On Mon, Jun 29, 2020 at 06:02:42PM +0200, Borislav Petkov wrote:
> > On Thu, Jun 18, 2020 at 01:08:34AM +0300, Jarkko Sakkinen wrote:
> > > Provisioning Certification Enclave (PCE), the root of trust for other
> > > enclaves, generates a signing key from a fused key called Provisioning
> > > Certification Key. PCE can then use this key to certify an attestation key
> > > of a QE, e.g. we get the chain of trust down to the hardware if the Intel
> > 
> > What's a QE?
> > 
> > I don't see this acronym resolved anywhere in the whole patchset.
> 
> Quoting Enclave.
> 
> > > signed PCE is used.
> > > 
> > > To use the needed keys, ATTRIBUTE.PROVISIONKEY is required but should be
> > > only allowed for those who actually need it so that only the trusted
> > > parties can certify QE's.
> > > 
> > > Obviously the attestation service should know the public key of the used
> > > PCE and that way detect illegit attestation, but whitelisting the legit
> > > users still adds an additional layer of defence.
> > > 
> > > Add new device file called /dev/sgx/provision. The sole purpose of this
> > > file is to provide file descriptors that act as privilege tokens to allow
> > > to build enclaves with ATTRIBUTE.PROVISIONKEY set. A new ioctl called
> > > SGX_IOC_ENCLAVE_SET_ATTRIBUTE is used to assign this token to an enclave.
> > 
> > So I'm sure I'm missing something here: what controls which
> > enclave can open /dev/sgx/provision and thus pass the FD to
> > SGX_IOC_ENCLAVE_SET_ATTRIBUTE?
> 
> /dev/sgx/provision is root-only by default, the expectation is that the admin
> will configure the system to grant only specific enclaves access to the
> PROVISION_KEY.
> 
> > And in general, how does that whole flow look like: what calls
> > SGX_IOC_ENCLAVE_SET_ATTRIBUTE when?
> 
> The basic gist is that the host process of an enclave that needs/wants access
> to the PROVISION_KEY will invoke SGX_IOC_ENCLAVE_SET_ATTRIBUTE when building
> the enclave.  Any enclave can request access to PROVISION_KEY, but practically
> speaking only the PCE and QE (or their non-Intel equivalents) actually need
> access to the key.  KVM (future series) will also respect /dev/sgx/provision,
> i.e. require a similar ioctl() to expose the PROVISION_KEY to a guest.
> 
> E.g. for my own personal testing, I never do anything attestation related, so
> none of the enclaves I run request PROVISION_KEY, but I do expose it to VMs to
> test the KVM paths.
> 
> In this series, access is fairly binary, i.e. there's no additional kernel
> infrastructure to help userspace make per-enclave decisions.  There have been
> more than a few proposals on how to extend the kernel to help provide better
> granularity, e.g. LSM hooks, but it was generally agreed to punt that stuff
> to post-upstreaming to keep things "simple" once we went far enough down
> various paths to ensure we weren't painting ourselves into a corner.
> 
> If you want super gory details, Intel's whitepaper on attestation in cloud
> environments is a good starting point[*], but I don't recommended doing much
> more than skimming unless you really like attestation stuff or are
> masochistic, which IMO amount to the same thing :-)
> 
> [*] https://download.01.org/intel-sgx/dcap-1.0/docs/SGX_ECDSA_QuoteGenReference_DCAP_API_Linux_1.0.pdf

Section 3 in [*] is what describes the infrastructure. DCAP is only a
component in the whole attestation infrastructure.

[*] https://software.intel.com/sites/default/files/managed/f1/b8/intel-sgx-support-for-third-party-attestation.pdf

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 12/21] x86/sgx: Allow a limited use of ATTRIBUTE.PROVISIONKEY for attestation
From: Jarkko Sakkinen @ 2020-07-03  2:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sean Christopherson, x86, linux-sgx, linux-kernel,
	linux-security-module, Jethro Beekman, Andy Lutomirski, akpm,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, josh,
	kai.huang, kai.svahn, kmoy, ludloff, nhorman, npmccallum,
	puiterwijk, rientjes, tglx, yaozhangx
In-Reply-To: <20200630084956.GB1093@zn.tnic>

On Tue, Jun 30, 2020 at 10:49:56AM +0200, Borislav Petkov wrote:
> On Mon, Jun 29, 2020 at 03:04:00PM -0700, Sean Christopherson wrote:
> > > I don't see this acronym resolved anywhere in the whole patchset.
> > 
> > Quoting Enclave.
> 
> Yah, pls add it somewhere.
> 
> > /dev/sgx/provision is root-only by default, the expectation is that the admin
> > will configure the system to grant only specific enclaves access to the
> > PROVISION_KEY.
> 
> Uuh, I don't like "the expectation is" - the reality happens to turn
> differently, more often than not.
> 
> > In this series, access is fairly binary, i.e. there's no additional kernel
> > infrastructure to help userspace make per-enclave decisions.  There have been
> > more than a few proposals on how to extend the kernel to help provide better
> > granularity, e.g. LSM hooks, but it was generally agreed to punt that stuff
> > to post-upstreaming to keep things "simple" once we went far enough down
> > various paths to ensure we weren't painting ourselves into a corner.
> 
> So this all sounds to me like we should not upstream /dev/sgx/provision
> now but delay it until the infrastructure for that has been made more
> concrete. We can always add it then. Changing it after the fact -
> if we have to and for whatever reason - would be a lot harder for a
> user-visible interface which someone has started using already.
> 
> So I'd leave  that out from the initial patchset.

I'm trying to understand what is meant by "more concrete". Attestation
is needed for most enclave applications.

If this patch is dropped, should we also allow PROVISION_KEY attribute
to all enclaves?  Dropping this patch and keeping that check in the
driver patch is not very coherent behaviour.

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 12/21] x86/sgx: Allow a limited use of ATTRIBUTE.PROVISIONKEY for attestation
From: Jarkko Sakkinen @ 2020-07-03  2:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-sgx, linux-kernel, linux-security-module,
	Jethro Beekman, Andy Lutomirski, akpm, andriy.shevchenko, asapek,
	cedric.xing, chenalexchen, conradparker, cyhanish, dave.hansen,
	haitao.huang, josh, kai.huang, kai.svahn, kmoy, ludloff, nhorman,
	npmccallum, puiterwijk, rientjes, sean.j.christopherson, tglx,
	yaozhangx
In-Reply-To: <20200703023146.GA306897@linux.intel.com>

On Fri, Jul 03, 2020 at 05:32:28AM +0300, Jarkko Sakkinen wrote:
> On Mon, Jun 29, 2020 at 06:02:42PM +0200, Borislav Petkov wrote:
> > On Thu, Jun 18, 2020 at 01:08:34AM +0300, Jarkko Sakkinen wrote:
> > > Provisioning Certification Enclave (PCE), the root of trust for other
> > > enclaves, generates a signing key from a fused key called Provisioning
> > > Certification Key. PCE can then use this key to certify an attestation key
> > > of a QE, e.g. we get the chain of trust down to the hardware if the Intel
> > 
> > What's a QE?
> > 
> > I don't see this acronym resolved anywhere in the whole patchset.
> 
> Quoting Enclave.

Thanks for spotting this. I updated my GIT-tree accordingly.

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-03  3:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-sgx, linux-kernel, linux-security-module,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Sean Christopherson,
	Suresh Siddha, akpm, andriy.shevchenko, asapek, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	josh, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx
In-Reply-To: <20200626091419.GB27151@zn.tnic>

On Fri, Jun 26, 2020 at 11:14:19AM +0200, Borislav Petkov wrote:
> On Thu, Jun 18, 2020 at 01:08:33AM +0300, Jarkko Sakkinen wrote:
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 59472cd6a11d..35f713e3a267 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -323,6 +323,7 @@ Code  Seq#    Include File                                           Comments
> >                                                                       <mailto:tlewis@mindspring.com>
> >  0xA3  90-9F  linux/dtlk.h
> >  0xA4  00-1F  uapi/linux/tee.h                                        Generic TEE subsystem
> > +0xA4  00-1F  uapi/asm/sgx.h                                          Intel SGX subsystem (a legit conflict as TEE and SGX do not co-exist)
> 
> Maybe add <mailto:linux-sgx@vger.kernel.org> ?
> 
> >  0xAA  00-3F  linux/uapi/linux/userfaultfd.h
> >  0xAB  00-1F  linux/nbd.h
> >  0xAC  00-1F  linux/raw.h
> 
> ...
> 
> > +static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
> > +{
> > +	unsigned long encl_size = secs->size + PAGE_SIZE;
> 
> Wait, you just copied @secs from user memory in sgx_ioc_enclave_create()
> and now use ->size unverified? You're kidding, right?

The validation is done in sgx_validate_secs().

> 
> > +	struct sgx_epc_page *secs_epc;
> > +	unsigned long ssaframesize;
> > +	struct sgx_pageinfo pginfo;
> > +	struct sgx_secinfo secinfo;
> > +	struct file *backing;
> > +	long ret;
> > +
> > +	if (atomic_read(&encl->flags) & SGX_ENCL_CREATED)
> > +		return -EINVAL;
> > +
> > +	ssaframesize = sgx_calc_ssaframesize(secs->miscselect, secs->xfrm);
> 
> So this is using more un-validated user input to do further calculations.
> What can possibly go wrong?
> 
> I sure hope *I* am wrong and am missing something here.
> 
> If not, please, for the next version, audit all your user input and
> validate it before using it. Srsly.

It works but is unclean. I'd guess reason for this is just that code has
evolved into this state over time.

I'd just move the call to sgx_calc_ssaframesize() inside
sgx_validate_secs().

/Jarkko

^ permalink raw reply

* Re: [PATCH v4 1/3] capabilities: Introduce CAP_CHECKPOINT_RESTORE
From: Adrian Reber @ 2020-07-03 11:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
	Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
	Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
	Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
	Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
	linux-security-module, linux-kernel, selinux, Eric Paris,
	Jann Horn, linux-fsdevel
In-Reply-To: <20200701082708.pgfskg7hrsnfi36k@wittgenstein>

On Wed, Jul 01, 2020 at 10:27:08AM +0200, Christian Brauner wrote:
> On Wed, Jul 01, 2020 at 08:49:04AM +0200, Adrian Reber wrote:
> > This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> > checkpoint/restore for non-root users.
> > 
> > Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> > asked numerous times if it is possible to checkpoint/restore a process as
> > non-root. The answer usually was: 'almost'.
> > 
> > The main blocker to restore a process as non-root was to control the PID of the
> > restored process. This feature available via the clone3 system call, or via
> > /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
> > 
> > In the past two years, requests for non-root checkpoint/restore have increased
> > due to the following use cases:
> > * Checkpoint/Restore in an HPC environment in combination with a resource
> >   manager distributing jobs where users are always running as non-root.
> >   There is a desire to provide a way to checkpoint and restore long running
> >   jobs.
> > * Container migration as non-root
> > * We have been in contact with JVM developers who are integrating
> >   CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> >   applications are not meant to be running with CAP_SYS_ADMIN.
> > 
> > We have seen the following workarounds:
> > * Use a setuid wrapper around CRIU:
> >   See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> > * Use a setuid helper that writes to ns_last_pid.
> >   Unfortunately, this helper delegation technique is impossible to use with
> >   clone3, and is thus prone to races.
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Cycle through PIDs with fork() until the desired PID is reached:
> >   This has been demonstrated to work with cycling rates of 100,000 PIDs/s
> >   See https://github.com/twosigma/set_ns_last_pid
> > * Patch out the CAP_SYS_ADMIN check from the kernel
> > * Run the desired application in a new user and PID namespace to provide
> >   a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
> >   typical container environments (e.g., Kubernetes) as /proc is
> >   typically protected with read-only layers (e.g., /proc/sys) for hardening
> >   purposes. Read-only layers prevent additional /proc mounts (due to proc's
> >   SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
> >   certain applications need access to /proc matching their PID namespace.
> > 
> > The introduced capability allows to:
> > * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> >   for the corresponding PID namespace via ns_last_pid/clone3.
> > * Open files in /proc/pid/map_files when the current user is
> >   CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> >   files that are unreachable via the file system such as deleted files, or memfd
> >   files.
> > 
> > See corresponding selftest for an example with clone3().
> > 
> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> > ---
> 
> I think that now looks reasonable. A few comments.
> 
> Before we proceed, please split the addition of
> checkpoint_restore_ns_capable() out into a separate patch.
> In fact, I think the cleanest way of doing this would be:
> - 0/n capability: add CAP_CHECKPOINT_RESTORE
> - 1/n pid: use checkpoint_restore_ns_capable() for set_tid
> - 2/n pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
> - 3/n: proc: require checkpoint_restore_ns_capable() in init userns for map_files
> 
> (commit subjects up to you of course) and a nice commit message for each
> time we relax a permissions on something so we have a clear separate
> track record for each change in case we need to revert something. Then
> the rest of the patches in this series. Testing patches probably last.

Yes, makes sense. I was thinking about this already, but I was not sure
if it I should do it or not. But I had the same idea already.

		Adrian


^ permalink raw reply

* Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
From: Adrian Reber @ 2020-07-03 11:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
	Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
	Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
	Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel
In-Reply-To: <20200702205305.GA3283@mail.hallyn.com>

On Thu, Jul 02, 2020 at 03:53:05PM -0500, Serge E. Hallyn wrote:
> On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> > This adds a test that changes its UID, uses capabilities to
> > get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> > create a process with a given PID as non-root.
> 
> Seems worth also verifying that it fails if you have no capabilities.
> I don't see that in the existing clone3/ test dir.

Bit confused about what you mean. This test does:

 * switch UID to 1000
 * run clone3() with set_tid set and expect EPERM
 * set CAP_CHECKPOINT_RESTORE capability
 * run clone3() with set_tid set and expect success

So it already does what I think you are asking for. Did I misunderstand
your comment?

		Adrian

> > Signed-off-by: Adrian Reber <areber@redhat.com>
> > ---
> >  tools/testing/selftests/clone3/Makefile       |   4 +-
> >  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
> >  2 files changed, 206 insertions(+), 1 deletion(-)
> >  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > 
> > diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> > index cf976c732906..ef7564cb7abe 100644
> > --- a/tools/testing/selftests/clone3/Makefile
> > +++ b/tools/testing/selftests/clone3/Makefile
> > @@ -1,6 +1,8 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  CFLAGS += -g -I../../../../usr/include/
> > +LDLIBS += -lcap
> >  
> > -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> > +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> > +	clone3_cap_checkpoint_restore
> >  
> >  include ../lib.mk
> > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > new file mode 100644
> > index 000000000000..2cc3d57b91f2
> > --- /dev/null
> > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > @@ -0,0 +1,203 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Based on Christian Brauner's clone3() example.
> > + * These tests are assuming to be running in the host's
> > + * PID namespace.
> > + */
> > +
> > +/* capabilities related code based on selftests/bpf/test_verifier.c */
> > +
> > +#define _GNU_SOURCE
> > +#include <errno.h>
> > +#include <linux/types.h>
> > +#include <linux/sched.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <stdbool.h>
> > +#include <sys/capability.h>
> > +#include <sys/prctl.h>
> > +#include <sys/syscall.h>
> > +#include <sys/types.h>
> > +#include <sys/un.h>
> > +#include <sys/wait.h>
> > +#include <unistd.h>
> > +#include <sched.h>
> > +
> > +#include "../kselftest.h"
> > +#include "clone3_selftests.h"
> > +
> > +#ifndef MAX_PID_NS_LEVEL
> > +#define MAX_PID_NS_LEVEL 32
> > +#endif
> > +
> > +static void child_exit(int ret)
> > +{
> > +	fflush(stdout);
> > +	fflush(stderr);
> > +	_exit(ret);
> > +}
> > +
> > +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> > +{
> > +	int status;
> > +	pid_t pid = -1;
> > +
> > +	struct clone_args args = {
> > +		.exit_signal = SIGCHLD,
> > +		.set_tid = ptr_to_u64(set_tid),
> > +		.set_tid_size = set_tid_size,
> > +	};
> > +
> > +	pid = sys_clone3(&args, sizeof(struct clone_args));
> > +	if (pid < 0) {
> > +		ksft_print_msg("%s - Failed to create new process\n",
> > +			       strerror(errno));
> > +		return -errno;
> > +	}
> > +
> > +	if (pid == 0) {
> > +		int ret;
> > +		char tmp = 0;
> > +
> > +		ksft_print_msg
> > +		    ("I am the child, my PID is %d (expected %d)\n",
> > +		     getpid(), set_tid[0]);
> > +
> > +		if (set_tid[0] != getpid())
> > +			child_exit(EXIT_FAILURE);
> > +		child_exit(EXIT_SUCCESS);
> > +	}
> > +
> > +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> > +		       getpid(), pid);
> > +
> > +	if (waitpid(pid, &status, 0) < 0) {
> > +		ksft_print_msg("Child returned %s\n", strerror(errno));
> > +		return -errno;
> > +	}
> > +
> > +	if (!WIFEXITED(status))
> > +		return -1;
> > +
> > +	return WEXITSTATUS(status);
> > +}
> > +
> > +static int test_clone3_set_tid(pid_t * set_tid,
> > +			       size_t set_tid_size, int expected)
> > +{
> > +	int ret;
> > +
> > +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> > +		       getpid(), set_tid[0]);
> > +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> > +
> > +	ksft_print_msg
> > +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> > +	     getpid(), set_tid[0], ret, expected);
> > +	if (ret != expected) {
> > +		ksft_test_result_fail
> > +		    ("[%d] Result (%d) is different than expected (%d)\n",
> > +		     getpid(), ret, expected);
> > +		return -1;
> > +	}
> > +	ksft_test_result_pass
> > +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> > +	     expected);
> > +
> > +	return 0;
> > +}
> > +
> > +struct libcap {
> > +	struct __user_cap_header_struct hdr;
> > +	struct __user_cap_data_struct data[2];
> > +};
> > +
> > +static int set_capability()
> > +{
> > +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> > +	struct libcap *cap;
> > +	int ret = -1;
> > +	cap_t caps;
> > +
> > +	caps = cap_get_proc();
> > +	if (!caps) {
> > +		perror("cap_get_proc");
> > +		return -1;
> > +	}
> > +
> > +	/* Drop all capabilities */
> > +	if (cap_clear(caps)) {
> > +		perror("cap_clear");
> > +		goto out;
> > +	}
> > +
> > +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> > +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> > +
> > +	cap = (struct libcap *) caps;
> > +
> > +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> > +	cap->data[1].effective |= 1 << (40 - 32);
> > +	cap->data[1].permitted |= 1 << (40 - 32);
> > +
> > +	if (cap_set_proc(caps)) {
> > +		perror("cap_set_proc");
> > +		goto out;
> > +	}
> > +	ret = 0;
> > +out:
> > +	if (cap_free(caps))
> > +		perror("cap_free");
> > +	return ret;
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +	pid_t pid;
> > +	int status;
> > +	int ret = 0;
> > +	pid_t set_tid[1];
> > +	uid_t uid = getuid();
> > +
> > +	ksft_print_header();
> > +	test_clone3_supported();
> > +	ksft_set_plan(2);
> > +
> > +	if (uid != 0) {
> > +		ksft_cnt.ksft_xskip = ksft_plan;
> > +		ksft_print_msg("Skipping all tests as non-root\n");
> > +		return ksft_exit_pass();
> > +	}
> > +
> > +	memset(&set_tid, 0, sizeof(set_tid));
> > +
> > +	/* Find the current active PID */
> > +	pid = fork();
> > +	if (pid == 0) {
> > +		ksft_print_msg("Child has PID %d\n", getpid());
> > +		child_exit(EXIT_SUCCESS);
> > +	}
> > +	if (waitpid(pid, &status, 0) < 0)
> > +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> > +
> > +	/* After the child has finished, its PID should be free. */
> > +	set_tid[0] = pid;
> > +
> > +	if (set_capability())
> > +		ksft_test_result_fail
> > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> > +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> > +	setgid(1000);
> > +	setuid(1000);
> > +	set_tid[0] = pid;
> > +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> > +	if (set_capability())
> > +		ksft_test_result_fail
> > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> > +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> > +
> > +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> > +}
> > -- 
> > 2.26.2
> 


^ permalink raw reply

* Re: INFO: task hung in request_key_tag
From: syzbot @ 2020-07-03 11:40 UTC (permalink / raw)
  To: akpm, dhowells, ebiederm, hch, herbert, jarkko.sakkinen, jmorris,
	keyrings, linux-kernel, linux-security-module, mcgrof,
	naresh.kamboju, penguin-kernel, serge, syzkaller-bugs
In-Reply-To: <000000000000961dea05a95c9558@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    aab20039 Add linux-next specific files for 20200701
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=14000a5b100000
kernel config:  https://syzkaller.appspot.com/x/.config?x=739f6fbf326049f4
dashboard link: https://syzkaller.appspot.com/bug?extid=46c77dc7e98c732de754
compiler:       gcc (GCC) 10.1.0-syz 20200507
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=127e085b100000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=107ffaa7100000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+46c77dc7e98c732de754@syzkaller.appspotmail.com

INFO: task syz-executor067:6800 blocked for more than 143 seconds.
      Not tainted 5.8.0-rc3-next-20200701-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor067 D27632  6800   6794 0x00004000
Call Trace:
 context_switch kernel/sched/core.c:3445 [inline]
 __schedule+0x8b4/0x1e80 kernel/sched/core.c:4169
 schedule+0xd0/0x2a0 kernel/sched/core.c:4244
 bit_wait+0x12/0xa0 kernel/sched/wait_bit.c:199
 __wait_on_bit+0x60/0x190 kernel/sched/wait_bit.c:49
 out_of_line_wait_on_bit+0xd5/0x110 kernel/sched/wait_bit.c:64
 wait_on_bit include/linux/wait_bit.h:76 [inline]
 wait_for_key_construction+0x10b/0x140 security/keys/request_key.c:664
 request_key_tag+0x7a/0xb0 security/keys/request_key.c:705
 dns_query+0x257/0x6c3 net/dns_resolver/dns_query.c:128
 ceph_dns_resolve_name net/ceph/messenger.c:1887 [inline]
 ceph_parse_server_name net/ceph/messenger.c:1922 [inline]
 ceph_parse_ips+0x77f/0x8c0 net/ceph/messenger.c:1949
 ceph_parse_mon_ips+0x59/0xc0 net/ceph/ceph_common.c:411
 ceph_parse_source fs/ceph/super.c:271 [inline]
 ceph_parse_mount_param+0x1239/0x17e0 fs/ceph/super.c:322
 vfs_parse_fs_param fs/fs_context.c:117 [inline]
 vfs_parse_fs_param+0x203/0x550 fs/fs_context.c:98
 vfs_parse_fs_string+0xe6/0x150 fs/fs_context.c:161
 do_new_mount fs/namespace.c:2905 [inline]
 do_mount+0x1222/0x1df0 fs/namespace.c:3237
 __do_sys_mount fs/namespace.c:3447 [inline]
 __se_sys_mount fs/namespace.c:3424 [inline]
 __x64_sys_mount+0x18f/0x230 fs/namespace.c:3424
 do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:359
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x4401b9
Code: Bad RIP value.
RSP: 002b:00007fff4cd307c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 00000000004401b9
RDX: 0000000020000040 RSI: 0000000020000080 RDI: 00000000200002c0
RBP: 00000000006ca018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a40
R13: 0000000000401ad0 R14: 0000000000000000 R15: 0000000000000000

Showing all locks held in the system:
1 lock held by khungtaskd/1146:
 #0: ffffffff89bc3000 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:5779
1 lock held by in:imklog/6692:
 #0: ffff88809a086af0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:928

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 1146 Comm: khungtaskd Not tainted 5.8.0-rc3-next-20200701-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x18f/0x20d lib/dump_stack.c:118
 nmi_cpu_backtrace.cold+0x70/0xb1 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1b3/0x223 lib/nmi_backtrace.c:62
 trigger_all_cpu_backtrace include/linux/nmi.h:147 [inline]
 check_hung_uninterruptible_tasks kernel/hung_task.c:253 [inline]
 watchdog+0xd89/0xf30 kernel/hung_task.c:339
 kthread+0x3b5/0x4a0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 3852 Comm: systemd-journal Not tainted 5.8.0-rc3-next-20200701-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:pmd_none arch/x86/include/asm/pgtable.h:825 [inline]
RIP: 0010:pmd_none_or_trans_huge_or_clear_bad include/linux/pgtable.h:1198 [inline]
RIP: 0010:pmd_trans_unstable include/linux/pgtable.h:1223 [inline]
RIP: 0010:pmd_devmap_trans_unstable mm/memory.c:3496 [inline]
RIP: 0010:pmd_devmap_trans_unstable mm/memory.c:3494 [inline]
RIP: 0010:handle_pte_fault mm/memory.c:4198 [inline]
RIP: 0010:__handle_mm_fault mm/memory.c:4360 [inline]
RIP: 0010:handle_mm_fault+0x1da0/0x43f0 mm/memory.c:4397
Code: ff e9 9d f3 ff ff e8 0f 93 cf ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 b9 20 00 00 4c 8b 65 00 <4c> 89 e3 31 ff 48 83 e3 9f 48 89 de e8 7f 8f cf ff 48 85 db 0f 84
RSP: 0000:ffffc900015d7d70 EFLAGS: 00000246
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff81a44fec
RDX: 1ffff11012ad8f4a RSI: ffffffff81a45861 RDI: 0000000000000007
RBP: ffff8880956c7a50 R08: 0000000000000000 R09: ffffffff8aaeb0e7
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000091cd0067
R13: 0000000000000000 R14: 0000000000000040 R15: 0000000000000001
FS:  00007fdaac0718c0(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdaa9411000 CR3: 0000000094cdd000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 do_user_addr_fault+0x5a2/0xd00 arch/x86/mm/fault.c:1295
 handle_page_fault arch/x86/mm/fault.c:1365 [inline]
 exc_page_fault+0xab/0x170 arch/x86/mm/fault.c:1418
 asm_exc_page_fault+0x1e/0x30 arch/x86/include/asm/idtentry.h:565
RIP: 0033:0x7fdaab37b124
Code: Bad RIP value.
RSP: 002b:00007ffcfb2ef578 EFLAGS: 00010202
RAX: 00007fdaa9410fe0 RBX: 0000000000000090 RCX: 000000000021ffa0
RDX: 0000000000000090 RSI: 00007ffcfb2ef620 RDI: 00007fdaa9410fe0
RBP: 0000000000000000 R08: 0000000000220070 R09: 00007ffcfb2f26b0
R10: 000000000001b706 R11: 00007fdaa9410fa0 R12: 00005652b21dbe80
R13: 715c95999b8098ed R14: 00007ffcfb2ef810 R15: 00007ffcfb2ef620


^ permalink raw reply

* Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
From: Tetsuo Handa @ 2020-07-03 13:19 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro, Casey Schaufler
  Cc: Alexei Starovoitov, linux-kernel, David Miller,
	Greg Kroah-Hartman, Kees Cook, Andrew Morton, Alexei Starovoitov,
	bpf, linux-fsdevel, Daniel Borkmann, Jakub Kicinski,
	Masahiro Yamada, Gary Lin, Bruno Meneguele, LSM List,
	Luis Chamberlain, Linus Torvalds
In-Reply-To: <87h7upucqi.fsf@x220.int.ebiederm.org>

On 2020/07/02 22:08, Eric W. Biederman wrote:
>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>> called from both kernel thread and from process context (e.g. init_module()
>> syscall by /sbin/insmod )) ?
> 
> And __fput_sync needs to be even more careful.
> umd_load_blob is called in these changes without any locks held.

But where is the guarantee that a thread which called flush_delayed_fput() waits for
the completion of processing _all_ "struct file" linked into delayed_fput_list ?
If some other thread or delayed_fput_work (scheduled by fput_many()) called
flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
needs to be processed before execve(), can't it?

Also, I don't know how convoluted the dependency of all "struct file" linked into
delayed_fput_list might be, for there can be "struct file" which will not be a
simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.

On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
there is a guarantee that __fput_sync() waits for the completion of "struct file"
which needs to be flushed before execve(), isn't there?

> 
> We fundamentally AKA in any correct version of this code need to flush
> the file descriptor before we call exec or exec can not open it a
> read-only denying all writes from any other opens.
> 
> The use case of flush_delayed_fput is exactly the same as that used
> when loading the initramfs.

When loading the initramfs, the number of threads is quite few (which
means that the possibility of hitting the race window and convoluted
dependency is small).

But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
flush_delayed_fput() might be called after many number of threads already
started running.

On 2020/07/03 1:02, Eric W. Biederman wrote:
>>>> On 2020/06/30 21:29, Eric W. Biederman wrote:
>>>>> Hmm.  The wake up happens just of tgid->wait_pidfd happens just before
>>>>> release_task is called so there is a race.  As it is possible to wake
>>>>> up and then go back to sleep before pid_has_task becomes false.
>>>>
>>>> What is the reason we want to wait until pid_has_task() becomes false?
>>>>
>>>> - wait_event(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID));
>>>> + while (!wait_event_timeout(tgid->wait_pidfd, !pid_has_task(tgid, PIDTYPE_TGID), 1));
>>>
>>> So that it is safe to call bpfilter_umh_cleanup.  The previous code
>>> performed the wait by having a callback in do_exit.
>>
>> But bpfilter_umh_cleanup() does only
>>
>> 	fput(info->pipe_to_umh);
>> 	fput(info->pipe_from_umh);
>> 	put_pid(info->tgid);
>> 	info->tgid = NULL;
>>
>> which is (I think) already safe regardless of the usermode process because
>> bpfilter_umh_cleanup() merely closes one side of two pipes used between
>> two processes and forgets about the usermode process.
> 
> It is not safe.
> 
> Baring bugs there is only one use of shtudown_umh that matters.  The one
> in fini_umh.  The use of the file by the mm must be finished before
> umd_unload_blob.  AKA unmount.  Which completely frees the filesystem.

Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ?
LSM modules might prefer only one instance of filesystem for umd blobs.

For pathname based LSMs, since that filesystem is not visible from mount tree, only
info->driver_name can be used for distinction. Therefore, one instance of filesystem
with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL) might be preferable.

For inode based LSMs, reusing one instance of filesystem created upon early boot might
be convenient for labeling.

Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in
order to implement protections without labeling files. Then, we might also be able to
implement minimal protections without LSMs.

^ permalink raw reply

* Re: linux-next: umh: fix processed error when UMH_WAIT_PROC is used seems to break linux bridge on s390x (bisected)
From: Luis Chamberlain @ 2020-07-03 13:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: David Howells, Christian Borntraeger, Christoph Hellwig,
	Eric W. Biederman, ast, axboe, bfields, bridge, chainsaw,
	christian.brauner, chuck.lever, davem, gregkh, jarkko.sakkinen,
	jmorris, josh, keescook, keyrings, kuba, lars.ellenberg,
	linux-fsdevel, linux-kernel, linux-nfs, linux-security-module,
	nikolay, philipp.reisner, ravenexp, roopa, serge, slyfox, viro,
	yangtiezhu, netdev, markward, linux-s390
In-Reply-To: <d8a74a06-de97-54ae-de03-0d955e82f62b@i-love.sakura.ne.jp>

On Fri, Jul 03, 2020 at 09:52:01AM +0900, Tetsuo Handa wrote:
> On 2020/07/03 4:46, Luis Chamberlain wrote:
> > The alternative to making a compromise is using generic wrappers for
> > things which make sense and letting the callers use those.
> 
> I suggest just introducing KWIFEXITED()/KWEXITSTATUS()/KWIFSIGNALED()/KWTERMSIG()
> macros and fixing the callers, for some callers are not aware of possibility of
> KWIFSIGNALED() case.

OK so we open code all uses. Do that in a next iteration.

  Luis


^ permalink raw reply

* Re: [v2 PATCH] crypto: af_alg - Fix regression on empty requests
From: Luis Chamberlain @ 2020-07-03 13:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Naresh Kamboju, Eric Biggers, LTP List, open list,
	linux-security-module, keyrings, lkft-triage,
	Linux Crypto Mailing List, Jan Stancek, chrubis, Serge E. Hallyn,
	James Morris, Jarkko Sakkinen, David Howells, David S. Miller,
	Sachin Sant, Linux Next Mailing List, linuxppc-dev, linux- stable
In-Reply-To: <20200702033221.GA19367@gondor.apana.org.au>

On Thu, Jul 02, 2020 at 01:32:21PM +1000, Herbert Xu wrote:
> On Tue, Jun 30, 2020 at 02:18:11PM +0530, Naresh Kamboju wrote:
> > 
> > Since we are on this subject,
> > LTP af_alg02  test case fails on stable 4.9 and stable 4.4
> > This is not a regression because the test case has been failing from
> > the beginning.
> > 
> > Is this test case expected to fail on stable 4.9 and 4.4 ?
> > or any chance to fix this on these older branches ?
> > 
> > Test output:
> > af_alg02.c:52: BROK: Timed out while reading from request socket.
> > 
> > ref:
> > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884917/suite/ltp-crypto-tests/test/af_alg02/history/
> > https://qa-reports.linaro.org/lkft/linux-stable-rc-4.9-oe/build/v4.9.228-191-g082e807235d7/testrun/2884606/suite/ltp-crypto-tests/test/af_alg02/log
> 
> Actually this test really is broken.

FWIW the patch "umh: fix processed error when UMH_WAIT_PROC is used" was
dropped from linux-next for now as it was missing checking for signals.
I'll be open coding iall checks for each UMH_WAIT_PROC callers next. Its
not clear if this was the issue with this test case, but figured I'd let
you know.

  Luis

^ permalink raw reply

* Re: [PATCH v2 09/11] ima: Move validation of the keyrings conditional into ima_validate_rule()
From: Mimi Zohar @ 2020-07-03 14:15 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: Dmitry Kasatkin, James Morris, Serge E . Hallyn,
	Lakshmi Ramasubramanian, Prakhar Srivastava, linux-kernel,
	linux-integrity, linux-security-module
In-Reply-To: <20200702221656.GH4694@sequoia>

On Thu, 2020-07-02 at 17:16 -0500, Tyler Hicks wrote:
> On 2020-06-30 19:07:29, Mimi Zohar wrote:
> > On Fri, 2020-06-26 at 17:38 -0500, Tyler Hicks wrote:
> > > Use ima_validate_rule() to ensure that the combination of a hook
> > > function and the keyrings conditional is valid and that the keyrings
> > > conditional is not specified without an explicit KEY_CHECK func
> > > conditional. This is a code cleanup and has no user-facing change.
> > > 
> > > Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
> > > ---
> > > 
> > > * v2
> > >   - Allowed IMA_DIGSIG_REQUIRED, IMA_PERMIT_DIRECTIO,
> > >     IMA_MODSIG_ALLOWED, and IMA_CHECK_BLACKLIST conditionals to be
> > >     present in the rule entry flags for non-buffer hook functions.
> > > 
> > >  security/integrity/ima/ima_policy.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
> > > index 8cdca2399d59..43d49ad958fb 100644
> > > --- a/security/integrity/ima/ima_policy.c
> > > +++ b/security/integrity/ima/ima_policy.c
> > > @@ -1000,6 +1000,15 @@ static bool ima_validate_rule(struct ima_rule_entry *entry)
> > >  		case KEXEC_KERNEL_CHECK:
> > >  		case KEXEC_INITRAMFS_CHECK:
> > >  		case POLICY_CHECK:
> > > +			if (entry->flags & ~(IMA_FUNC | IMA_MASK | IMA_FSMAGIC |
> > > +					     IMA_UID | IMA_FOWNER | IMA_FSUUID |
> > > +					     IMA_INMASK | IMA_EUID | IMA_PCR |
> > > +					     IMA_FSNAME | IMA_DIGSIG_REQUIRED |
> > > +					     IMA_PERMIT_DIRECTIO |
> > > +					     IMA_MODSIG_ALLOWED |
> > > +					     IMA_CHECK_BLACKLIST))
> > 
> > Other than KEYRINGS, this patch should continue to behave the same.
> >  However, this list gives the impressions that all of these flags are
> > permitted on all of the above flags, which isn't true.
> > 
> > For example, both IMA_MODSIG_ALLOWED & IMA_CHECK_BLACKLIST are limited
> > to appended signatures, meaning KERNEL_CHECK and KEXEC_KERNEL_CHECK.
> 
> Just to clarify, are both IMA_MODSIG_ALLOWED and IMA_CHECK_BLACKLIST
> limited to KEXEC_KERNEL_CHECK, KEXEC_INITRAMFS_CHECK, and MODULE_CHECK?
> That's what ima_hook_supports_modsig() suggests.

Theoretically that is true, but I have no idea how you would append a
signature to the kexec boot command line.  The only users of appended
signatures are currently kernel modules and the kexec'ed kernel image.

> 
> >  Both should only be allowed on APPRAISE action rules.
> 
> For completeness, it looks like DONT_APPRAISE should not be allowed.

Good point.  

> 
> > IMA_PCR should be limited to MEASURE action rules.
> 
> It looks like DONT_MEASURE should not be allowed.

The TPM PCR isn't a file attribute.

> 
> > IMA_DIGSIG_REQUIRED should be limited to APPRAISE action rules.
> 
> It looks like DONT_APPRAISE should not be allowed.

Right, in all of these cases the DONT_XXXX isn't applicable.

Mimi

^ permalink raw reply

* [PATCH ghak96 v3] audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
From: Richard Guy Briggs @ 2020-07-03 16:56 UTC (permalink / raw)
  To: Linux-Audit Mailing List, LKML, Linux Security Module list
  Cc: Paul Moore, eparis, john.johansen, Richard Guy Briggs

The LSM_AUDIT_DATA_* records for PATH, FILE, IOCTL_OP, DENTRY and INODE
are incomplete without the task context of the AUDIT Current Working
Directory record.  Add it.

This record addition can't use audit_dummy_context to determine whether
or not to store the record information since the LSM_AUDIT_DATA_*
records are initiated by various LSMs independent of any audit rules.
context->in_syscall is used to determine if it was called in user
context like audit_getname.

Please see the upstream issue
https://github.com/linux-audit/audit-kernel/issues/96

Adapted from Vladis Dronov's v2 patch.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
Passes audit-testsuite.

Changelog:
v3
- adapt and refactor__audit_getname, don't key on dummy

v2
2020-04-02 vdronov https://www.redhat.com/archives/linux-audit/2020-April/msg00004.html
- convert to standalone CWD record

v1:
2020-03-24 vdronov https://github.com/nefigtut/audit-kernel/commit/df0b55b7ab84e1c9faa588b08e547e604bf25c87
- add cwd= field to LSM record

 include/linux/audit.h |  9 ++++++++-
 kernel/auditsc.c      | 17 +++++++++++++++--
 security/lsm_audit.c  |  5 +++++
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 03c4035a532b..bb850d588e1c 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -292,7 +292,7 @@ extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
 extern void __audit_syscall_exit(int ret_success, long ret_value);
 extern struct filename *__audit_reusename(const __user char *uptr);
 extern void __audit_getname(struct filename *name);
-
+extern void __audit_getcwd(void);
 extern void __audit_inode(struct filename *name, const struct dentry *dentry,
 				unsigned int flags);
 extern void __audit_file(const struct file *);
@@ -351,6 +351,11 @@ static inline void audit_getname(struct filename *name)
 	if (unlikely(!audit_dummy_context()))
 		__audit_getname(name);
 }
+static inline void audit_getcwd(void)
+{
+	if (unlikely(audit_context()))
+		__audit_getcwd();
+}
 static inline void audit_inode(struct filename *name,
 				const struct dentry *dentry,
 				unsigned int aflags) {
@@ -579,6 +584,8 @@ static inline struct filename *audit_reusename(const __user char *name)
 }
 static inline void audit_getname(struct filename *name)
 { }
+static inline void audit_getcwd(void)
+{ }
 static inline void audit_inode(struct filename *name,
 				const struct dentry *dentry,
 				unsigned int aflags)
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 3a9100e95fda..934ab5b8c1c5 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1891,6 +1891,20 @@ struct filename *
 	return NULL;
 }
 
+inline void _audit_getcwd(struct audit_context *context)
+{
+	if (!context->pwd.dentry)
+		get_fs_pwd(current->fs, &context->pwd);
+}
+
+void __audit_getcwd(void)
+{
+	struct audit_context *context = audit_context();
+
+	if (context->in_syscall)
+		_audit_getcwd(context);
+}
+
 /**
  * __audit_getname - add a name to the list
  * @name: name to add
@@ -1915,8 +1929,7 @@ void __audit_getname(struct filename *name)
 	name->aname = n;
 	name->refcnt++;
 
-	if (!context->pwd.dentry)
-		get_fs_pwd(current->fs, &context->pwd);
+	_audit_getcwd(context);
 }
 
 static inline int audit_copy_fcaps(struct audit_names *name,
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 2d2bf49016f4..7c555621c2bd 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -241,6 +241,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 			audit_log_untrustedstring(ab, inode->i_sb->s_id);
 			audit_log_format(ab, " ino=%lu", inode->i_ino);
 		}
+		audit_getcwd();
 		break;
 	}
 	case LSM_AUDIT_DATA_FILE: {
@@ -254,6 +255,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 			audit_log_untrustedstring(ab, inode->i_sb->s_id);
 			audit_log_format(ab, " ino=%lu", inode->i_ino);
 		}
+		audit_getcwd();
 		break;
 	}
 	case LSM_AUDIT_DATA_IOCTL_OP: {
@@ -269,6 +271,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 		}
 
 		audit_log_format(ab, " ioctlcmd=0x%hx", a->u.op->cmd);
+		audit_getcwd();
 		break;
 	}
 	case LSM_AUDIT_DATA_DENTRY: {
@@ -283,6 +286,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 			audit_log_untrustedstring(ab, inode->i_sb->s_id);
 			audit_log_format(ab, " ino=%lu", inode->i_ino);
 		}
+		audit_getcwd();
 		break;
 	}
 	case LSM_AUDIT_DATA_INODE: {
@@ -300,6 +304,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 		audit_log_format(ab, " dev=");
 		audit_log_untrustedstring(ab, inode->i_sb->s_id);
 		audit_log_format(ab, " ino=%lu", inode->i_ino);
+		audit_getcwd();
 		break;
 	}
 	case LSM_AUDIT_DATA_TASK: {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH ghak84 v2] audit: purge audit_log_string from the intra-kernel audit API
From: Richard Guy Briggs @ 2020-07-03 17:01 UTC (permalink / raw)
  To: Linux-Audit Mailing List, LKML, Linux Security Module list
  Cc: Paul Moore, eparis, john.johansen, Richard Guy Briggs

audit_log_string() was inteded to be an internal audit function and
since there are only two internal uses, remove them.  Purge all external
uses of it by restructuring code to use an existing audit_log_format()
or using audit_log_format().

Please see the upstream issue
https://github.com/linux-audit/audit-kernel/issues/84

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
Passes audit-testsuite.

Changelog:
v2
- restructure to piggyback on existing audit_log_format() calls, checking quoting needs for each.

v1 Vlad Dronov
- https://github.com/nefigtut/audit-kernel/commit/dbbcba46335a002f44b05874153a85b9cc18aebf

 include/linux/audit.h     |  5 -----
 kernel/audit.c            |  4 ++--
 security/apparmor/audit.c | 10 ++++------
 security/apparmor/file.c  | 25 +++++++------------------
 security/apparmor/ipc.c   | 44 +++++++++++++++++++++-----------------------
 security/apparmor/net.c   | 14 ++++++++------
 security/lsm_audit.c      |  4 ++--
 7 files changed, 44 insertions(+), 62 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 604ede630580..5ad7cd65d76f 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -695,9 +695,4 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
 	return uid_valid(audit_get_loginuid(tsk));
 }
 
-static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
-{
-	audit_log_n_string(ab, buf, strlen(buf));
-}
-
 #endif
diff --git a/kernel/audit.c b/kernel/audit.c
index 8c201f414226..a2f3e34aa724 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2080,13 +2080,13 @@ void audit_log_d_path(struct audit_buffer *ab, const char *prefix,
 	/* We will allow 11 spaces for ' (deleted)' to be appended */
 	pathname = kmalloc(PATH_MAX+11, ab->gfp_mask);
 	if (!pathname) {
-		audit_log_string(ab, "<no_memory>");
+		audit_log_format(ab, "\"<no_memory>\"");
 		return;
 	}
 	p = d_path(path, pathname, PATH_MAX+11);
 	if (IS_ERR(p)) { /* Should never happen since we send PATH_MAX */
 		/* FIXME: can we save some information here? */
-		audit_log_string(ab, "<too_long>");
+		audit_log_format(ab, "\"<too_long>\"");
 	} else
 		audit_log_untrustedstring(ab, p);
 	kfree(pathname);
diff --git a/security/apparmor/audit.c b/security/apparmor/audit.c
index 597732503815..335b5b8d300b 100644
--- a/security/apparmor/audit.c
+++ b/security/apparmor/audit.c
@@ -57,18 +57,16 @@ static void audit_pre(struct audit_buffer *ab, void *ca)
 	struct common_audit_data *sa = ca;
 
 	if (aa_g_audit_header) {
-		audit_log_format(ab, "apparmor=");
-		audit_log_string(ab, aa_audit_type[aad(sa)->type]);
+		audit_log_format(ab, "apparmor=%s",
+				 aa_audit_type[aad(sa)->type]);
 	}
 
 	if (aad(sa)->op) {
-		audit_log_format(ab, " operation=");
-		audit_log_string(ab, aad(sa)->op);
+		audit_log_format(ab, " operation=%s", aad(sa)->op);
 	}
 
 	if (aad(sa)->info) {
-		audit_log_format(ab, " info=");
-		audit_log_string(ab, aad(sa)->info);
+		audit_log_format(ab, " info=\"%s\"", aad(sa)->info);
 		if (aad(sa)->error)
 			audit_log_format(ab, " error=%d", aad(sa)->error);
 	}
diff --git a/security/apparmor/file.c b/security/apparmor/file.c
index 9a2d14b7c9f8..70f27124d051 100644
--- a/security/apparmor/file.c
+++ b/security/apparmor/file.c
@@ -35,20 +35,6 @@ static u32 map_mask_to_chr_mask(u32 mask)
 }
 
 /**
- * audit_file_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
- * @mask: permission mask to convert
- */
-static void audit_file_mask(struct audit_buffer *ab, u32 mask)
-{
-	char str[10];
-
-	aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
-			    map_mask_to_chr_mask(mask));
-	audit_log_string(ab, str);
-}
-
-/**
  * file_audit_cb - call back for file specific audit fields
  * @ab: audit_buffer  (NOT NULL)
  * @va: audit struct to audit values of  (NOT NULL)
@@ -57,14 +43,17 @@ static void file_audit_cb(struct audit_buffer *ab, void *va)
 {
 	struct common_audit_data *sa = va;
 	kuid_t fsuid = current_fsuid();
+	char str[10];
 
 	if (aad(sa)->request & AA_AUDIT_FILE_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_file_mask(ab, aad(sa)->request);
+		aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
+				    map_mask_to_chr_mask(aad(sa)->request));
+		audit_log_format(ab, " requested_mask=%s", str);
 	}
 	if (aad(sa)->denied & AA_AUDIT_FILE_MASK) {
-		audit_log_format(ab, " denied_mask=");
-		audit_file_mask(ab, aad(sa)->denied);
+		aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
+				    map_mask_to_chr_mask(aad(sa)->denied));
+		audit_log_format(ab, " denied_mask=%s", str);
 	}
 	if (aad(sa)->request & AA_AUDIT_FILE_MASK) {
 		audit_log_format(ab, " fsuid=%d",
diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c
index 4ecedffbdd33..18ca807e7872 100644
--- a/security/apparmor/ipc.c
+++ b/security/apparmor/ipc.c
@@ -20,24 +20,21 @@
 
 /**
  * audit_ptrace_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
  * @mask: permission mask to convert
+ *
+ * Returns: pointer to static string
  */
-static void audit_ptrace_mask(struct audit_buffer *ab, u32 mask)
+static const char *audit_ptrace_mask(u32 mask)
 {
 	switch (mask) {
 	case MAY_READ:
-		audit_log_string(ab, "read");
-		break;
+		return "read";
 	case MAY_WRITE:
-		audit_log_string(ab, "trace");
-		break;
+		return "trace";
 	case AA_MAY_BE_READ:
-		audit_log_string(ab, "readby");
-		break;
+		return "readby";
 	case AA_MAY_BE_TRACED:
-		audit_log_string(ab, "tracedby");
-		break;
+		return "tracedby";
 	}
 }
 
@@ -47,12 +44,12 @@ static void audit_ptrace_cb(struct audit_buffer *ab, void *va)
 	struct common_audit_data *sa = va;
 
 	if (aad(sa)->request & AA_PTRACE_PERM_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_ptrace_mask(ab, aad(sa)->request);
+		audit_log_format(ab, " requested_mask=%s",
+				 audit_ptrace_mask(aad(sa)->request));
 
 		if (aad(sa)->denied & AA_PTRACE_PERM_MASK) {
-			audit_log_format(ab, " denied_mask=");
-			audit_ptrace_mask(ab, aad(sa)->denied);
+			audit_log_format(ab, " denied_mask=%s",
+					 audit_ptrace_mask(aad(sa)->denied));
 		}
 	}
 	audit_log_format(ab, " peer=");
@@ -142,16 +139,17 @@ static inline int map_signal_num(int sig)
 }
 
 /**
- * audit_file_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
+ * audit_signal_mask - convert mask to permission string
  * @mask: permission mask to convert
+ *
+ * Returns: pointer to static string
  */
-static void audit_signal_mask(struct audit_buffer *ab, u32 mask)
+static const char *audit_signal_mask(u32 mask)
 {
 	if (mask & MAY_READ)
-		audit_log_string(ab, "receive");
+		return "receive";
 	if (mask & MAY_WRITE)
-		audit_log_string(ab, "send");
+		return "send";
 }
 
 /**
@@ -164,11 +162,11 @@ static void audit_signal_cb(struct audit_buffer *ab, void *va)
 	struct common_audit_data *sa = va;
 
 	if (aad(sa)->request & AA_SIGNAL_PERM_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_signal_mask(ab, aad(sa)->request);
+		audit_log_format(ab, " requested_mask=%s",
+				 audit_signal_mask(aad(sa)->request));
 		if (aad(sa)->denied & AA_SIGNAL_PERM_MASK) {
-			audit_log_format(ab, " denied_mask=");
-			audit_signal_mask(ab, aad(sa)->denied);
+			audit_log_format(ab, " denied_mask=%s",
+					 audit_signal_mask(aad(sa)->denied));
 		}
 	}
 	if (aad(sa)->signal == SIGUNKNOWN)
diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index d8afc39f663a..fa0e85568450 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -72,16 +72,18 @@ void audit_net_cb(struct audit_buffer *ab, void *va)
 {
 	struct common_audit_data *sa = va;
 
-	audit_log_format(ab, " family=");
 	if (address_family_names[sa->u.net->family])
-		audit_log_string(ab, address_family_names[sa->u.net->family]);
+		audit_log_format(ab, " family=\"%s\"",
+				 address_family_names[sa->u.net->family]);
 	else
-		audit_log_format(ab, "\"unknown(%d)\"", sa->u.net->family);
-	audit_log_format(ab, " sock_type=");
+		audit_log_format(ab, " family=\"unknown(%d)\"",
+				 sa->u.net->family);
 	if (sock_type_names[aad(sa)->net.type])
-		audit_log_string(ab, sock_type_names[aad(sa)->net.type]);
+		audit_log_format(ab, " sock_type=\"%s\"",
+				 sock_type_names[aad(sa)->net.type]);
 	else
-		audit_log_format(ab, "\"unknown(%d)\"", aad(sa)->net.type);
+		audit_log_format(ab, " sock_type=\"unknown(%d)\"",
+				 aad(sa)->net.type);
 	audit_log_format(ab, " protocol=%d", aad(sa)->net.protocol);
 
 	if (aad(sa)->request & NET_PERMS_MASK) {
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 2d2bf49016f4..221370794d14 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -427,8 +427,8 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 				 a->u.ibendport->port);
 		break;
 	case LSM_AUDIT_DATA_LOCKDOWN:
-		audit_log_format(ab, " lockdown_reason=");
-		audit_log_string(ab, lockdown_reasons[a->u.reason]);
+		audit_log_format(ab, " lockdown_reason=\"%s\"",
+				 lockdown_reasons[a->u.reason]);
 		break;
 	} /* switch (a->type) */
 }
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH v4 2/3] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
From: Serge E. Hallyn @ 2020-07-03 18:12 UTC (permalink / raw)
  To: Adrian Reber
  Cc: Serge E. Hallyn, Christian Brauner, Eric Biederman,
	Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov, Andrei Vagin,
	Nicolas Viennot, Michał Cłapiński, Kamil Yurtsever,
	Dirk Petersen, Christine Flood, Casey Schaufler, Mike Rapoport,
	Radostin Stoyanov, Cyrill Gorcunov, Stephen Smalley,
	Sargun Dhillon, Arnd Bergmann, linux-security-module,
	linux-kernel, selinux, Eric Paris, Jann Horn, linux-fsdevel
In-Reply-To: <20200703111807.GC243637@dcbz.redhat.com>

On Fri, Jul 03, 2020 at 01:18:07PM +0200, Adrian Reber wrote:
> On Thu, Jul 02, 2020 at 03:53:05PM -0500, Serge E. Hallyn wrote:
> > On Wed, Jul 01, 2020 at 08:49:05AM +0200, Adrian Reber wrote:
> > > This adds a test that changes its UID, uses capabilities to
> > > get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> > > create a process with a given PID as non-root.
> > 
> > Seems worth also verifying that it fails if you have no capabilities.
> > I don't see that in the existing clone3/ test dir.
> 
> Bit confused about what you mean. This test does:
> 
>  * switch UID to 1000
>  * run clone3() with set_tid set and expect EPERM
>  * set CAP_CHECKPOINT_RESTORE capability
>  * run clone3() with set_tid set and expect success
> 
> So it already does what I think you are asking for. Did I misunderstand
> your comment?

Ah, no, I missed that line doing the call with -EPERM.  Thanks!

Acked-by: Serge Hallyn <serge@hallyn.com>


> 		Adrian
> 
> > > Signed-off-by: Adrian Reber <areber@redhat.com>
> > > ---
> > >  tools/testing/selftests/clone3/Makefile       |   4 +-
> > >  .../clone3/clone3_cap_checkpoint_restore.c    | 203 ++++++++++++++++++
> > >  2 files changed, 206 insertions(+), 1 deletion(-)
> > >  create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > 
> > > diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> > > index cf976c732906..ef7564cb7abe 100644
> > > --- a/tools/testing/selftests/clone3/Makefile
> > > +++ b/tools/testing/selftests/clone3/Makefile
> > > @@ -1,6 +1,8 @@
> > >  # SPDX-License-Identifier: GPL-2.0
> > >  CFLAGS += -g -I../../../../usr/include/
> > > +LDLIBS += -lcap
> > >  
> > > -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> > > +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> > > +	clone3_cap_checkpoint_restore
> > >  
> > >  include ../lib.mk
> > > diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > new file mode 100644
> > > index 000000000000..2cc3d57b91f2
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> > > @@ -0,0 +1,203 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +/*
> > > + * Based on Christian Brauner's clone3() example.
> > > + * These tests are assuming to be running in the host's
> > > + * PID namespace.
> > > + */
> > > +
> > > +/* capabilities related code based on selftests/bpf/test_verifier.c */
> > > +
> > > +#define _GNU_SOURCE
> > > +#include <errno.h>
> > > +#include <linux/types.h>
> > > +#include <linux/sched.h>
> > > +#include <stdio.h>
> > > +#include <stdlib.h>
> > > +#include <stdbool.h>
> > > +#include <sys/capability.h>
> > > +#include <sys/prctl.h>
> > > +#include <sys/syscall.h>
> > > +#include <sys/types.h>
> > > +#include <sys/un.h>
> > > +#include <sys/wait.h>
> > > +#include <unistd.h>
> > > +#include <sched.h>
> > > +
> > > +#include "../kselftest.h"
> > > +#include "clone3_selftests.h"
> > > +
> > > +#ifndef MAX_PID_NS_LEVEL
> > > +#define MAX_PID_NS_LEVEL 32
> > > +#endif
> > > +
> > > +static void child_exit(int ret)
> > > +{
> > > +	fflush(stdout);
> > > +	fflush(stderr);
> > > +	_exit(ret);
> > > +}
> > > +
> > > +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> > > +{
> > > +	int status;
> > > +	pid_t pid = -1;
> > > +
> > > +	struct clone_args args = {
> > > +		.exit_signal = SIGCHLD,
> > > +		.set_tid = ptr_to_u64(set_tid),
> > > +		.set_tid_size = set_tid_size,
> > > +	};
> > > +
> > > +	pid = sys_clone3(&args, sizeof(struct clone_args));
> > > +	if (pid < 0) {
> > > +		ksft_print_msg("%s - Failed to create new process\n",
> > > +			       strerror(errno));
> > > +		return -errno;
> > > +	}
> > > +
> > > +	if (pid == 0) {
> > > +		int ret;
> > > +		char tmp = 0;
> > > +
> > > +		ksft_print_msg
> > > +		    ("I am the child, my PID is %d (expected %d)\n",
> > > +		     getpid(), set_tid[0]);
> > > +
> > > +		if (set_tid[0] != getpid())
> > > +			child_exit(EXIT_FAILURE);
> > > +		child_exit(EXIT_SUCCESS);
> > > +	}
> > > +
> > > +	ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> > > +		       getpid(), pid);
> > > +
> > > +	if (waitpid(pid, &status, 0) < 0) {
> > > +		ksft_print_msg("Child returned %s\n", strerror(errno));
> > > +		return -errno;
> > > +	}
> > > +
> > > +	if (!WIFEXITED(status))
> > > +		return -1;
> > > +
> > > +	return WEXITSTATUS(status);
> > > +}
> > > +
> > > +static int test_clone3_set_tid(pid_t * set_tid,
> > > +			       size_t set_tid_size, int expected)
> > > +{
> > > +	int ret;
> > > +
> > > +	ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> > > +		       getpid(), set_tid[0]);
> > > +	ret = call_clone3_set_tid(set_tid, set_tid_size);
> > > +
> > > +	ksft_print_msg
> > > +	    ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> > > +	     getpid(), set_tid[0], ret, expected);
> > > +	if (ret != expected) {
> > > +		ksft_test_result_fail
> > > +		    ("[%d] Result (%d) is different than expected (%d)\n",
> > > +		     getpid(), ret, expected);
> > > +		return -1;
> > > +	}
> > > +	ksft_test_result_pass
> > > +	    ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> > > +	     expected);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +struct libcap {
> > > +	struct __user_cap_header_struct hdr;
> > > +	struct __user_cap_data_struct data[2];
> > > +};
> > > +
> > > +static int set_capability()
> > > +{
> > > +	cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> > > +	struct libcap *cap;
> > > +	int ret = -1;
> > > +	cap_t caps;
> > > +
> > > +	caps = cap_get_proc();
> > > +	if (!caps) {
> > > +		perror("cap_get_proc");
> > > +		return -1;
> > > +	}
> > > +
> > > +	/* Drop all capabilities */
> > > +	if (cap_clear(caps)) {
> > > +		perror("cap_clear");
> > > +		goto out;
> > > +	}
> > > +
> > > +	cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> > > +	cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> > > +
> > > +	cap = (struct libcap *) caps;
> > > +
> > > +	/* 40 -> CAP_CHECKPOINT_RESTORE */
> > > +	cap->data[1].effective |= 1 << (40 - 32);
> > > +	cap->data[1].permitted |= 1 << (40 - 32);
> > > +
> > > +	if (cap_set_proc(caps)) {
> > > +		perror("cap_set_proc");
> > > +		goto out;
> > > +	}
> > > +	ret = 0;
> > > +out:
> > > +	if (cap_free(caps))
> > > +		perror("cap_free");
> > > +	return ret;
> > > +}
> > > +
> > > +int main(int argc, char *argv[])
> > > +{
> > > +	pid_t pid;
> > > +	int status;
> > > +	int ret = 0;
> > > +	pid_t set_tid[1];
> > > +	uid_t uid = getuid();
> > > +
> > > +	ksft_print_header();
> > > +	test_clone3_supported();
> > > +	ksft_set_plan(2);
> > > +
> > > +	if (uid != 0) {
> > > +		ksft_cnt.ksft_xskip = ksft_plan;
> > > +		ksft_print_msg("Skipping all tests as non-root\n");
> > > +		return ksft_exit_pass();
> > > +	}
> > > +
> > > +	memset(&set_tid, 0, sizeof(set_tid));
> > > +
> > > +	/* Find the current active PID */
> > > +	pid = fork();
> > > +	if (pid == 0) {
> > > +		ksft_print_msg("Child has PID %d\n", getpid());
> > > +		child_exit(EXIT_SUCCESS);
> > > +	}
> > > +	if (waitpid(pid, &status, 0) < 0)
> > > +		ksft_exit_fail_msg("Waiting for child %d failed", pid);
> > > +
> > > +	/* After the child has finished, its PID should be free. */
> > > +	set_tid[0] = pid;
> > > +
> > > +	if (set_capability())
> > > +		ksft_test_result_fail
> > > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > > +	prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> > > +	/* This would fail without CAP_CHECKPOINT_RESTORE */
> > > +	setgid(1000);
> > > +	setuid(1000);
> > > +	set_tid[0] = pid;
> > > +	ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> > > +	if (set_capability())
> > > +		ksft_test_result_fail
> > > +		    ("Could not set CAP_CHECKPOINT_RESTORE\n");
> > > +	/* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> > > +	ret |= test_clone3_set_tid(set_tid, 1, 0);
> > > +
> > > +	return !ret ? ksft_exit_pass() : ksft_exit_fail();
> > > +}
> > > -- 
> > > 2.26.2
> > 

^ permalink raw reply

* Re: [PATCH v3 13/16] exit: Factor thread_group_exited out of pidfd_poll
From: Alexei Starovoitov @ 2020-07-03 20:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, David Miller, Greg Kroah-Hartman, Tetsuo Handa,
	Kees Cook, Andrew Morton, Alexei Starovoitov, Al Viro, bpf,
	linux-fsdevel, Daniel Borkmann, Jakub Kicinski, Masahiro Yamada,
	Gary Lin, Bruno Meneguele, LSM List, Casey Schaufler,
	Luis Chamberlain, Linus Torvalds, Christian Brauner
In-Reply-To: <20200702164140.4468-13-ebiederm@xmission.com>

On Thu, Jul 02, 2020 at 11:41:37AM -0500, Eric W. Biederman wrote:
> Create an independent helper thread_group_exited report return true
> when all threads have passed exit_notify in do_exit.  AKA all of the
> threads are at least zombies and might be dead or completely gone.
> 
> Create this helper by taking the logic out of pidfd_poll where
> it is already tested, and adding a missing READ_ONCE on
> the read of task->exit_state.
> 
> I will be changing the user mode driver code to use this same logic
> to know when a user mode driver needs to be restarted.
> 
> Place the new helper thread_group_exited in kernel/exit.c and
> EXPORT it so it can be used by modules.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/sched/signal.h |  2 ++
>  kernel/exit.c                | 24 ++++++++++++++++++++++++
>  kernel/fork.c                |  6 +-----
>  3 files changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 0ee5e696c5d8..1bad18a1d8ba 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -674,6 +674,8 @@ static inline int thread_group_empty(struct task_struct *p)
>  #define delay_group_leader(p) \
>  		(thread_group_leader(p) && !thread_group_empty(p))
>  
> +extern bool thread_group_exited(struct pid *pid);
> +
>  extern struct sighand_struct *__lock_task_sighand(struct task_struct *task,
>  							unsigned long *flags);
>  
> diff --git a/kernel/exit.c b/kernel/exit.c
> index d3294b611df1..a7f112feb0f6 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -1713,6 +1713,30 @@ COMPAT_SYSCALL_DEFINE5(waitid,
>  }
>  #endif
>  
> +/**
> + * thread_group_exited - check that a thread group has exited
> + * @pid: tgid of thread group to be checked.
> + *
> + * Test if thread group is has exited (all threads are zombies, dead
> + * or completely gone).
> + *
> + * Return: true if the thread group has exited. false otherwise.
> + */
> +bool thread_group_exited(struct pid *pid)
> +{
> +	struct task_struct *task;
> +	bool exited;
> +
> +	rcu_read_lock();
> +	task = pid_task(pid, PIDTYPE_PID);
> +	exited = !task ||
> +		(READ_ONCE(task->exit_state) && thread_group_empty(task));
> +	rcu_read_unlock();
> +
> +	return exited;
> +}

I'm not sure why you think READ_ONCE was missing.
It's different in wait_consider_task() where READ_ONCE is needed because
of multiple checks. Here it's done once.

The rest all looks good to me. Tested with and without bpf_preload patches.
Feel free to create a frozen branch with this set.

btw I'll be offline starting tomorrow for a week.
Will catch up with threads afterwards.

^ permalink raw reply

* Re: [PATCH v3 13/16] exit: Factor thread_group_exited out of pidfd_poll
From: Eric W. Biederman @ 2020-07-03 21:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, David Miller, Greg Kroah-Hartman, Tetsuo Handa,
	Kees Cook, Andrew Morton, Alexei Starovoitov, Al Viro, bpf,
	linux-fsdevel, Daniel Borkmann, Jakub Kicinski, Masahiro Yamada,
	Gary Lin, Bruno Meneguele, LSM List, Casey Schaufler,
	Luis Chamberlain, Linus Torvalds, Christian Brauner
In-Reply-To: <20200703203021.paebx25miovmaxqt@ast-mbp.dhcp.thefacebook.com>

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Jul 02, 2020 at 11:41:37AM -0500, Eric W. Biederman wrote:
>> Create an independent helper thread_group_exited report return true
>> when all threads have passed exit_notify in do_exit.  AKA all of the
>> threads are at least zombies and might be dead or completely gone.
>> 
>> Create this helper by taking the logic out of pidfd_poll where
>> it is already tested, and adding a missing READ_ONCE on
>> the read of task->exit_state.
>> 
>> I will be changing the user mode driver code to use this same logic
>> to know when a user mode driver needs to be restarted.
>> 
>> Place the new helper thread_group_exited in kernel/exit.c and
>> EXPORT it so it can be used by modules.
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  include/linux/sched/signal.h |  2 ++
>>  kernel/exit.c                | 24 ++++++++++++++++++++++++
>>  kernel/fork.c                |  6 +-----
>>  3 files changed, 27 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 0ee5e696c5d8..1bad18a1d8ba 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -674,6 +674,8 @@ static inline int thread_group_empty(struct task_struct *p)
>>  #define delay_group_leader(p) \
>>  		(thread_group_leader(p) && !thread_group_empty(p))
>>  
>> +extern bool thread_group_exited(struct pid *pid);
>> +
>>  extern struct sighand_struct *__lock_task_sighand(struct task_struct *task,
>>  							unsigned long *flags);
>>  
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index d3294b611df1..a7f112feb0f6 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -1713,6 +1713,30 @@ COMPAT_SYSCALL_DEFINE5(waitid,
>>  }
>>  #endif
>>  
>> +/**
>> + * thread_group_exited - check that a thread group has exited
>> + * @pid: tgid of thread group to be checked.
>> + *
>> + * Test if thread group is has exited (all threads are zombies, dead
>> + * or completely gone).
>> + *
>> + * Return: true if the thread group has exited. false otherwise.
>> + */
>> +bool thread_group_exited(struct pid *pid)
>> +{
>> +	struct task_struct *task;
>> +	bool exited;
>> +
>> +	rcu_read_lock();
>> +	task = pid_task(pid, PIDTYPE_PID);
>> +	exited = !task ||
>> +		(READ_ONCE(task->exit_state) && thread_group_empty(task));
>> +	rcu_read_unlock();
>> +
>> +	return exited;
>> +}
>
> I'm not sure why you think READ_ONCE was missing.
> It's different in wait_consider_task() where READ_ONCE is needed because
> of multiple checks. Here it's done once.

In practice it probably has no effect on the generated code.  But
READ_ONCE is about telling the compiler not to be clever.  Don't use
tearing loads or stores etc.  When all of the other readers are using
READ_ONCE I just get nervous if we have a case that doesn't.

> The rest all looks good to me. Tested with and without bpf_preload patches.
> Feel free to create a frozen branch with this set.

Can I have your Tested-by and Acked-by?

> btw I'll be offline starting tomorrow for a week.
> Will catch up with threads afterwards.

Eric


^ permalink raw reply

* [PATCH ghak84 v3] audit: purge audit_log_string from the intra-kernel audit API
From: Richard Guy Briggs @ 2020-07-03 21:49 UTC (permalink / raw)
  To: Linux-Audit Mailing List, LKML, Linux Security Module list
  Cc: Paul Moore, eparis, john.johansen, Richard Guy Briggs

audit_log_string() was inteded to be an internal audit function and
since there are only two internal uses, remove them.  Purge all external
uses of it by restructuring code to use an existing audit_log_format()
or using audit_log_format().

Please see the upstream issue
https://github.com/linux-audit/audit-kernel/issues/84

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
---
Passes audit-testsuite.

Changelog:
v3
- fix two warning: non-void function does not return a value in all control paths
	Reported-by: kernel test robot <lkp@intel.com>

v2
- restructure to piggyback on existing audit_log_format() calls, checking quoting needs for each.

v1 Vlad Dronov
- https://github.com/nefigtut/audit-kernel/commit/dbbcba46335a002f44b05874153a85b9cc18aebf

 include/linux/audit.h     |  5 -----
 kernel/audit.c            |  4 ++--
 security/apparmor/audit.c | 10 ++++------
 security/apparmor/file.c  | 25 +++++++------------------
 security/apparmor/ipc.c   | 46 +++++++++++++++++++++++-----------------------
 security/apparmor/net.c   | 14 ++++++++------
 security/lsm_audit.c      |  4 ++--
 7 files changed, 46 insertions(+), 62 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index 604ede630580..5ad7cd65d76f 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -695,9 +695,4 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
 	return uid_valid(audit_get_loginuid(tsk));
 }
 
-static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
-{
-	audit_log_n_string(ab, buf, strlen(buf));
-}
-
 #endif
diff --git a/kernel/audit.c b/kernel/audit.c
index 8c201f414226..a2f3e34aa724 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -2080,13 +2080,13 @@ void audit_log_d_path(struct audit_buffer *ab, const char *prefix,
 	/* We will allow 11 spaces for ' (deleted)' to be appended */
 	pathname = kmalloc(PATH_MAX+11, ab->gfp_mask);
 	if (!pathname) {
-		audit_log_string(ab, "<no_memory>");
+		audit_log_format(ab, "\"<no_memory>\"");
 		return;
 	}
 	p = d_path(path, pathname, PATH_MAX+11);
 	if (IS_ERR(p)) { /* Should never happen since we send PATH_MAX */
 		/* FIXME: can we save some information here? */
-		audit_log_string(ab, "<too_long>");
+		audit_log_format(ab, "\"<too_long>\"");
 	} else
 		audit_log_untrustedstring(ab, p);
 	kfree(pathname);
diff --git a/security/apparmor/audit.c b/security/apparmor/audit.c
index 597732503815..335b5b8d300b 100644
--- a/security/apparmor/audit.c
+++ b/security/apparmor/audit.c
@@ -57,18 +57,16 @@ static void audit_pre(struct audit_buffer *ab, void *ca)
 	struct common_audit_data *sa = ca;
 
 	if (aa_g_audit_header) {
-		audit_log_format(ab, "apparmor=");
-		audit_log_string(ab, aa_audit_type[aad(sa)->type]);
+		audit_log_format(ab, "apparmor=%s",
+				 aa_audit_type[aad(sa)->type]);
 	}
 
 	if (aad(sa)->op) {
-		audit_log_format(ab, " operation=");
-		audit_log_string(ab, aad(sa)->op);
+		audit_log_format(ab, " operation=%s", aad(sa)->op);
 	}
 
 	if (aad(sa)->info) {
-		audit_log_format(ab, " info=");
-		audit_log_string(ab, aad(sa)->info);
+		audit_log_format(ab, " info=\"%s\"", aad(sa)->info);
 		if (aad(sa)->error)
 			audit_log_format(ab, " error=%d", aad(sa)->error);
 	}
diff --git a/security/apparmor/file.c b/security/apparmor/file.c
index 9a2d14b7c9f8..70f27124d051 100644
--- a/security/apparmor/file.c
+++ b/security/apparmor/file.c
@@ -35,20 +35,6 @@ static u32 map_mask_to_chr_mask(u32 mask)
 }
 
 /**
- * audit_file_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
- * @mask: permission mask to convert
- */
-static void audit_file_mask(struct audit_buffer *ab, u32 mask)
-{
-	char str[10];
-
-	aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
-			    map_mask_to_chr_mask(mask));
-	audit_log_string(ab, str);
-}
-
-/**
  * file_audit_cb - call back for file specific audit fields
  * @ab: audit_buffer  (NOT NULL)
  * @va: audit struct to audit values of  (NOT NULL)
@@ -57,14 +43,17 @@ static void file_audit_cb(struct audit_buffer *ab, void *va)
 {
 	struct common_audit_data *sa = va;
 	kuid_t fsuid = current_fsuid();
+	char str[10];
 
 	if (aad(sa)->request & AA_AUDIT_FILE_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_file_mask(ab, aad(sa)->request);
+		aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
+				    map_mask_to_chr_mask(aad(sa)->request));
+		audit_log_format(ab, " requested_mask=%s", str);
 	}
 	if (aad(sa)->denied & AA_AUDIT_FILE_MASK) {
-		audit_log_format(ab, " denied_mask=");
-		audit_file_mask(ab, aad(sa)->denied);
+		aa_perm_mask_to_str(str, sizeof(str), aa_file_perm_chrs,
+				    map_mask_to_chr_mask(aad(sa)->denied));
+		audit_log_format(ab, " denied_mask=%s", str);
 	}
 	if (aad(sa)->request & AA_AUDIT_FILE_MASK) {
 		audit_log_format(ab, " fsuid=%d",
diff --git a/security/apparmor/ipc.c b/security/apparmor/ipc.c
index 4ecedffbdd33..fe431731883f 100644
--- a/security/apparmor/ipc.c
+++ b/security/apparmor/ipc.c
@@ -20,25 +20,23 @@
 
 /**
  * audit_ptrace_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
  * @mask: permission mask to convert
+ *
+ * Returns: pointer to static string
  */
-static void audit_ptrace_mask(struct audit_buffer *ab, u32 mask)
+static const char *audit_ptrace_mask(u32 mask)
 {
 	switch (mask) {
 	case MAY_READ:
-		audit_log_string(ab, "read");
-		break;
+		return "read";
 	case MAY_WRITE:
-		audit_log_string(ab, "trace");
-		break;
+		return "trace";
 	case AA_MAY_BE_READ:
-		audit_log_string(ab, "readby");
-		break;
+		return "readby";
 	case AA_MAY_BE_TRACED:
-		audit_log_string(ab, "tracedby");
-		break;
+		return "tracedby";
 	}
+	return "";
 }
 
 /* call back to audit ptrace fields */
@@ -47,12 +45,12 @@ static void audit_ptrace_cb(struct audit_buffer *ab, void *va)
 	struct common_audit_data *sa = va;
 
 	if (aad(sa)->request & AA_PTRACE_PERM_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_ptrace_mask(ab, aad(sa)->request);
+		audit_log_format(ab, " requested_mask=%s",
+				 audit_ptrace_mask(aad(sa)->request));
 
 		if (aad(sa)->denied & AA_PTRACE_PERM_MASK) {
-			audit_log_format(ab, " denied_mask=");
-			audit_ptrace_mask(ab, aad(sa)->denied);
+			audit_log_format(ab, " denied_mask=%s",
+					 audit_ptrace_mask(aad(sa)->denied));
 		}
 	}
 	audit_log_format(ab, " peer=");
@@ -142,16 +140,18 @@ static inline int map_signal_num(int sig)
 }
 
 /**
- * audit_file_mask - convert mask to permission string
- * @buffer: buffer to write string to (NOT NULL)
+ * audit_signal_mask - convert mask to permission string
  * @mask: permission mask to convert
+ *
+ * Returns: pointer to static string
  */
-static void audit_signal_mask(struct audit_buffer *ab, u32 mask)
+static const char *audit_signal_mask(u32 mask)
 {
 	if (mask & MAY_READ)
-		audit_log_string(ab, "receive");
+		return "receive";
 	if (mask & MAY_WRITE)
-		audit_log_string(ab, "send");
+		return "send";
+	return "";
 }
 
 /**
@@ -164,11 +164,11 @@ static void audit_signal_cb(struct audit_buffer *ab, void *va)
 	struct common_audit_data *sa = va;
 
 	if (aad(sa)->request & AA_SIGNAL_PERM_MASK) {
-		audit_log_format(ab, " requested_mask=");
-		audit_signal_mask(ab, aad(sa)->request);
+		audit_log_format(ab, " requested_mask=%s",
+				 audit_signal_mask(aad(sa)->request));
 		if (aad(sa)->denied & AA_SIGNAL_PERM_MASK) {
-			audit_log_format(ab, " denied_mask=");
-			audit_signal_mask(ab, aad(sa)->denied);
+			audit_log_format(ab, " denied_mask=%s",
+					 audit_signal_mask(aad(sa)->denied));
 		}
 	}
 	if (aad(sa)->signal == SIGUNKNOWN)
diff --git a/security/apparmor/net.c b/security/apparmor/net.c
index d8afc39f663a..fa0e85568450 100644
--- a/security/apparmor/net.c
+++ b/security/apparmor/net.c
@@ -72,16 +72,18 @@ void audit_net_cb(struct audit_buffer *ab, void *va)
 {
 	struct common_audit_data *sa = va;
 
-	audit_log_format(ab, " family=");
 	if (address_family_names[sa->u.net->family])
-		audit_log_string(ab, address_family_names[sa->u.net->family]);
+		audit_log_format(ab, " family=\"%s\"",
+				 address_family_names[sa->u.net->family]);
 	else
-		audit_log_format(ab, "\"unknown(%d)\"", sa->u.net->family);
-	audit_log_format(ab, " sock_type=");
+		audit_log_format(ab, " family=\"unknown(%d)\"",
+				 sa->u.net->family);
 	if (sock_type_names[aad(sa)->net.type])
-		audit_log_string(ab, sock_type_names[aad(sa)->net.type]);
+		audit_log_format(ab, " sock_type=\"%s\"",
+				 sock_type_names[aad(sa)->net.type]);
 	else
-		audit_log_format(ab, "\"unknown(%d)\"", aad(sa)->net.type);
+		audit_log_format(ab, " sock_type=\"unknown(%d)\"",
+				 aad(sa)->net.type);
 	audit_log_format(ab, " protocol=%d", aad(sa)->net.protocol);
 
 	if (aad(sa)->request & NET_PERMS_MASK) {
diff --git a/security/lsm_audit.c b/security/lsm_audit.c
index 2d2bf49016f4..221370794d14 100644
--- a/security/lsm_audit.c
+++ b/security/lsm_audit.c
@@ -427,8 +427,8 @@ static void dump_common_audit_data(struct audit_buffer *ab,
 				 a->u.ibendport->port);
 		break;
 	case LSM_AUDIT_DATA_LOCKDOWN:
-		audit_log_format(ab, " lockdown_reason=");
-		audit_log_string(ab, lockdown_reasons[a->u.reason]);
+		audit_log_format(ab, " lockdown_reason=\"%s\"",
+				 lockdown_reasons[a->u.reason]);
 		break;
 	} /* switch (a->type) */
 }
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
From: Eric W. Biederman @ 2020-07-03 22:25 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Al Viro, Casey Schaufler, Alexei Starovoitov, linux-kernel,
	David Miller, Greg Kroah-Hartman, Kees Cook, Andrew Morton,
	Alexei Starovoitov, bpf, linux-fsdevel, Daniel Borkmann,
	Jakub Kicinski, Masahiro Yamada, Gary Lin, Bruno Meneguele,
	LSM List, Luis Chamberlain, Linus Torvalds
In-Reply-To: <d0266a24-dfab-83d0-e178-aa67c9f5ebc0@i-love.sakura.ne.jp>

Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:

> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>> called from both kernel thread and from process context (e.g. init_module()
>>> syscall by /sbin/insmod )) ?
>> 
>> And __fput_sync needs to be even more careful.
>> umd_load_blob is called in these changes without any locks held.
>
> But where is the guarantee that a thread which called flush_delayed_fput() waits for
> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
> If some other thread or delayed_fput_work (scheduled by fput_many()) called
> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
> needs to be processed before execve(), can't it?

As a module the guarantee is we call task_work_run.
Built into the kernel the guarantee as best I can trace it is that
kthreadd hasn't started, and as such nothing that is scheduled has run
yet.

> Also, I don't know how convoluted the dependency of all "struct file" linked into
> delayed_fput_list might be, for there can be "struct file" which will not be a
> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>
> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
> there is a guarantee that __fput_sync() waits for the completion of "struct file"
> which needs to be flushed before execve(), isn't there?

There is really not a good helper or helpers, and this code suggests we
have something better.  Right now I have used the existing helpers to
the best of my ability.  If you or someone else wants to write a better
version of flushing so that exec can happen be my guest.

As far as I can tell what I have is good enough.

>> We fundamentally AKA in any correct version of this code need to flush
>> the file descriptor before we call exec or exec can not open it a
>> read-only denying all writes from any other opens.
>> 
>> The use case of flush_delayed_fput is exactly the same as that used
>> when loading the initramfs.
>
> When loading the initramfs, the number of threads is quite few (which
> means that the possibility of hitting the race window and convoluted
> dependency is small).

But the reality is the code run very early, before the initramfs is
initialized in practice.

> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
> flush_delayed_fput() might be called after many number of threads already
> started running.

At which point the code probably won't be runnig from a kernel thread
but instead will be running in a thread where task_work_run is relevant.

At worst it is a very small race, where someone else in another thread
starts flushing the file.  Which means the file could still be
completely close before exec.   Even that is not necessarily fatal,
as the usermode driver code has a respawn capability.

Code that is used enough that it hits that race sounds like a very
good problem to have from the perspective of the usermode driver code.

> Do we really need to mount upon umd_load_blob() and unmount upon umd_unload_blob() ?
> LSM modules might prefer only one instance of filesystem for umd
> blobs.

It is simple. People are free to change it, but a single filesystem
seems like a very good place to start with this functionality.

> For pathname based LSMs, since that filesystem is not visible from mount tree, only
> info->driver_name can be used for distinction. Therefore, one instance of filesystem
> with files created with file_open_root(O_CREAT | O_WRONLY | O_EXCL)
> might be preferable.

I took a quick look and the creation and removal of files with the
in-kernel helpers is not particularly easy.  Certainly it is more work
and thus a higher likelyhood of bugs than what I have done.

A directory per driver does sound tempting.  Just more work that I am
willing to do.

> For inode based LSMs, reusing one instance of filesystem created upon early boot might
> be convenient for labeling.
>
> Also, we might want a dedicated filesystem (say, "umdfs") instead of regular tmpfs in
> order to implement protections without labeling files. Then, we might also be able to
> implement minimal protections without LSMs.

All valid points.  Nothing sets this design in stone.
Nothing says this is the endpoint of the evolution of this code.

The entire point of this patchset for me is that I remove the
unnecessary special cases from exec and do_exit, so I don't have to deal
with the usermode driver code anymore.

Eric

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-03 23:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Sean Christopherson, x86, linux-sgx, linux-kernel,
	linux-security-module, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Suresh Siddha, akpm,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, josh,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx
In-Reply-To: <20200626142019.GD27151@zn.tnic>

On Fri, Jun 26, 2020 at 04:20:19PM +0200, Borislav Petkov wrote:
> On Fri, Jun 26, 2020 at 07:16:27AM -0700, Sean Christopherson wrote:
> > That being said, I agree that it would be safer to move sgx_calc_ssaframesize()
> > inside sgx_validate_secs() and only compute encl_size after the secs is
> > validated.
> 

Changed as

if (!secs->ssa_frame_size)
	return -EINVAL;

if (sgx_calc_ssa_frame_size(secs->miscselect, secs->xfrm) >
    secs->ssa_frame_size)
	return -EINVAL;

/Jarkko

^ permalink raw reply

* Re: [PATCH v3 13/16] exit: Factor thread_group_exited out of pidfd_poll
From: Alexei Starovoitov @ 2020-07-04  0:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, David Miller, Greg Kroah-Hartman, Tetsuo Handa,
	Kees Cook, Andrew Morton, Alexei Starovoitov, Al Viro, bpf,
	linux-fsdevel, Daniel Borkmann, Jakub Kicinski, Masahiro Yamada,
	Gary Lin, Bruno Meneguele, LSM List, Casey Schaufler,
	Luis Chamberlain, Linus Torvalds, Christian Brauner
In-Reply-To: <873668s2j8.fsf@x220.int.ebiederm.org>

On Fri, Jul 03, 2020 at 04:37:47PM -0500, Eric W. Biederman wrote:
> 
> > The rest all looks good to me. Tested with and without bpf_preload patches.
> > Feel free to create a frozen branch with this set.
> 
> Can I have your Tested-by and Acked-by?

For the set:
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-04  0:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-sgx, linux-kernel, linux-security-module,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Sean Christopherson,
	Suresh Siddha, akpm, andriy.shevchenko, asapek, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	josh, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx
In-Reply-To: <20200626153400.GE27151@zn.tnic>

On Fri, Jun 26, 2020 at 05:34:00PM +0200, Borislav Petkov wrote:
> On Thu, Jun 18, 2020 at 01:08:33AM +0300, Jarkko Sakkinen wrote:
> 
> ...
> 
> This could use some commenting along the lines of:
> 
> "— If the enclave developer requires measurement of the page as a
> proof for the content, use EEXTEND to add a measurement for 256 bytes of
> the page. Repeat this operation until the entire page is measured."
> 
> At least this text from the SDM maps to the 256 bytes below. Otherwise
> it is magic.

Copied with pride:

/*
 * If the caller requires measurement of the page as a proof for the content,
 * use EEXTEND to add a measurement for 256 bytes of the page. Repeat this
 * operation until the entire page is measured."
 */

> > +static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
> > +			     unsigned long offset, unsigned long length,
> > +			     struct sgx_secinfo *secinfo, unsigned long flags)
> > +{
> > +	struct sgx_encl_page *encl_page;
> > +	struct sgx_epc_page *epc_page;
> > +	int ret;
> > +
> > +	encl_page = sgx_encl_page_alloc(encl, offset, secinfo->flags);
> > +	if (IS_ERR(encl_page))
> > +		return PTR_ERR(encl_page);
> > +
> > +	epc_page = __sgx_alloc_epc_page();
> > +	if (IS_ERR(epc_page)) {
> > +		kfree(encl_page);
> > +		return PTR_ERR(epc_page);
> > +	}
> > +
> > +	if (atomic_read(&encl->flags) &
> > +	    (SGX_ENCL_INITIALIZED | SGX_ENCL_DEAD)) {
> > +		ret = -EFAULT;
> > +		goto err_out_free;
> > +	}
> 
> You can do this first thing when you enter the function so that
> you don't have to allocate needlessly in the error case, when
> SGX_ENCL_INITIALIZED | SGX_ENCL_DEAD is set.

Updated version:

static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
			     unsigned long offset, unsigned long length,
			     struct sgx_secinfo *secinfo, unsigned long flags)
{
	struct sgx_encl_page *encl_page;
	struct sgx_epc_page *epc_page;
	struct sgx_va_page *va_page;
	int ret;

	if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED)
		return -EFAULT;

SGX_ENCL_DEAD check is unnecessary altogether as this flag cannot be
possibly be unset inside ioctl. 'sgx_release()' will set it which is
the release callback for the enclave file.

'sgx_ioctl()' also unnecessarily has this check I just noticed (and
removed).

> "uninitialized"?
> 
> Where is the test for SGX_ENCL_INITIALIZED and erroring out otherwise?
> 
> I.e., what happens if you add pages to an initialized enclave?

Because of historical reasons it is in sgx_encl_add_page(). Then we
allowed ioctl's operate on enclave concurrently. Today we enforce
sequential operation on a single enclave with SGX_ENCL_IOCTL flag
because that is the only sane way to use the construction operations.

Therefore the check can be moved to sgx_ioc_encl_add_pages() if you
request so but first I have one remark to discuss.

I noticed that sometimes wrong state flags turn into -EINVAL and
sometimes into -EFAULT (like in the previous case). I'd suggest
that when the ioctl is blocked based encl->flags and only on that,
the ioctl would return -ENOIOCTLCMD in both cases, i.e. this
command is not available.

That would give much better aids for debugging user space code.

> 
> > + * measurement with the contents of the page. The address range of pages must
> > + * be contiguous.
> 
> Must? Who is enforcing this? I'm trying to find where...

Unfortunately I cannot recall what I meant when I wrote that. I removed
that sentence. I'm not sure what I meant exactly when I used 'contiguous'
here.

> > The SECINFO and measurement mask are applied to all pages.
> > + *
> > + * A SECINFO for a TCS is required to always contain zero permissions because
> > + * CPU silently zeros them. Allowing anything else would cause a mismatch in
> > + * the measurement.
> > + *
> > + * mmap()'s protection bits are capped by the page permissions. For each page
> > + * address, the maximum protection bits are computed with the following
> > + * heuristics:
> > + *
> > + * 1. A regular page: PROT_R, PROT_W and PROT_X match the SECINFO permissions.
> > + * 2. A TCS page: PROT_R | PROT_W.
> > + *
> > + * mmap() is not allowed to surpass the minimum of the maximum protection bits
> > + * within the given address range.
> > + *
> > + * If ENCLS opcode fails, that effectively means that EPC has been invalidated.
> > + * When this happens the enclave is destroyed and -EIO is returned to the
> > + * caller.
> > + *
> > + * Return:
> > + *   0 on success,
> > + *   -EACCES if an executable source page is located in a noexec partition,
> > + *   -EIO if either ENCLS[EADD] or ENCLS[EEXTEND] fails
> > + *   -errno otherwise
> > + */
> > +static long sgx_ioc_enclave_add_pages(struct sgx_encl *encl, void __user *arg)
> > +{
> > +	struct sgx_enclave_add_pages addp;
> > +	struct sgx_secinfo secinfo;
> > +	unsigned long c;
> > +	int ret;
> > +
> > +	if (!(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
> > +		return -EINVAL;
> > +
> > +	if (copy_from_user(&addp, arg, sizeof(addp)))
> > +		return -EFAULT;
> > +
> > +	if (!IS_ALIGNED(addp.offset, PAGE_SIZE) ||
> > +	    !IS_ALIGNED(addp.src, PAGE_SIZE))
> > +		return -EINVAL;
> > +
> > +	if (!(access_ok(addp.src, PAGE_SIZE)))
> > +		return -EFAULT;
> > +
> > +	if (addp.length & (PAGE_SIZE - 1))
> > +		return -EINVAL;
> 
> How many pages are allowed? Unlimited? I'm hoping some limits are
> checked somewhere...

SGX_IOC_ENCLAVE_CREATE defines the address range, and thus sets the
limit on how many pages in total can be added to the enclave.

sgx_encl_size_max_64 contains the maximum size for the address range
and is initialized as follows:

cpuid_count(SGX_CPUID, 0, &eax, &ebx, &ecx, &edx);
sgx_encl_size_max_64 = 1ULL << ((edx >> 8) & 0xFF);

[derived from sgx_drv_init()]

> > +
> > +	if (addp.offset + addp.length - PAGE_SIZE >= encl->size)
> > +		return -EINVAL;
> > +
> > +	if (copy_from_user(&secinfo, (void __user *)addp.secinfo,
> > +			   sizeof(secinfo)))
> > +		return -EFAULT;
> > +
> > +	if (sgx_validate_secinfo(&secinfo))
> > +		return -EINVAL;
> > +
> > +	for (c = 0 ; c < addp.length; c += PAGE_SIZE) {
> > +		if (signal_pending(current)) {
> > +			ret = -EINTR;
> > +			break;
> > +		}
> > +
> > +		if (need_resched())
> > +			cond_resched();
> > +
> > +		ret = sgx_encl_add_page(encl, addp.src + c, addp.offset + c,
> > +					addp.length - c, &secinfo, addp.flags);
> > +		if (ret)
> > +			break;
> > +	}
> > +
> > +	addp.count = c;

If you referred with your previous question, how to limit the number of
pages that this ioctl can process in one run, it is already supported
in the API with 'addp.count'.

It'd be possible to add this if required:

addp.length = min(addp.length, SGX_ENCLAVE_IOC_ADD_PAGES_MAX_LENGTH));

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-04  1:42 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: x86, linux-sgx, linux-kernel, linux-security-module,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Sean Christopherson,
	Suresh Siddha, akpm, andriy.shevchenko, asapek, cedric.xing,
	chenalexchen, conradparker, cyhanish, dave.hansen, haitao.huang,
	josh, kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman,
	puiterwijk, rientjes, tglx, yaozhangx
In-Reply-To: <20200627174335.GC15585@zn.tnic>

On Sat, Jun 27, 2020 at 07:43:35PM +0200, Borislav Petkov wrote:
> On Thu, Jun 18, 2020 at 01:08:33AM +0300, Jarkko Sakkinen wrote:
> > +static int sgx_encl_init(struct sgx_encl *encl, struct sgx_sigstruct *sigstruct,
> > +			 void *token)
> > +{
> > +	u64 mrsigner[4];
> > +	int ret;
> > +	int i;
> > +	int j;
> > +
> > +	/* Check that the required attributes have been authorized. */
> > +	if (encl->secs_attributes & ~encl->allowed_attributes)
> > +		return -EACCES;
> > +
> > +	ret = sgx_get_key_hash(sigstruct->modulus, mrsigner);
> > +	if (ret)
> > +		return ret;
> > +
> > +	mutex_lock(&encl->lock);
> > +
> > +	if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
> > +		ret = -EFAULT;
> > +		goto err_out;
> > +	}
> 
> That test should be the first thing this function or its caller does.

Fixed.

> 
> > +	for (i = 0; i < SGX_EINIT_SLEEP_COUNT; i++) {
> > +		for (j = 0; j < SGX_EINIT_SPIN_COUNT; j++) {
> 
> Ew, what's that double-loop for?
> 
> It tries to init an enclave a bunch of times. Why does it need to init
> more than once?

From SDM:

"Periodically, EINIT polls for certain asynchronous events. If such an
event is detected, it completes with failure code (ZF=1 and RAX =
SGX_UNMASKED_EVENT), and RIP is incremented to point to the next
instruction. These events includes external interrupts, non-maskable
interrupts, system-management interrupts, machine checks, INIT signals,
and the VMX-preemption timer. EINIT does not fail if the pending event
is inhibited (e.g., external interrupts could be inhibited due to
blocking by MOV SS blocking or by STI)."

Not exactly sure though why this must be polled inside the kernel though.

> 
> > +			ret = sgx_einit(sigstruct, token, encl->secs.epc_page,
> > +					mrsigner);
> > +			if (ret == SGX_UNMASKED_EVENT)
> > +				continue;
> > +			else
> > +				break;
> > +		}
> > +
> > +		if (ret != SGX_UNMASKED_EVENT)
> > +			break;
> > +
> > +		msleep_interruptible(SGX_EINIT_SLEEP_TIME);
> > +
> > +		if (signal_pending(current)) {
> > +			ret = -ERESTARTSYS;
> > +			goto err_out;
> > +		}
> > +	}
> > +
> > +	if (ret & ENCLS_FAULT_FLAG) {
> > +		if (encls_failed(ret))
> > +			ENCLS_WARN(ret, "EINIT");
> > +
> > +		sgx_encl_destroy(encl);
> > +		ret = -EFAULT;
> > +	} else if (ret) {
> > +		pr_debug("EINIT returned %d\n", ret);
> > +		ret = -EPERM;
> > +	} else {
> > +		atomic_or(SGX_ENCL_INITIALIZED, &encl->flags);
> > +	}
> > +
> > +err_out:
> > +	mutex_unlock(&encl->lock);
> > +	return ret;
> > +}
> > +
> > +/**
> > + * sgx_ioc_enclave_init - handler for %SGX_IOC_ENCLAVE_INIT
> > + *
> > + * @filep:	open file to /dev/sgx
> 
> @encl:       pointer to an enclave instance (via ioctl() file pointer)
> 
> > + * @arg:	userspace pointer to a struct sgx_enclave_init instance
> > + *
> > + * Flush any outstanding enqueued EADD operations and perform EINIT.  The
> > + * Launch Enclave Public Key Hash MSRs are rewritten as necessary to match
> > + * the enclave's MRSIGNER, which is caculated from the provided sigstruct.
> > + *
> > + * Return:
> > + *   0 on success,
> > + *   SGX error code on EINIT failure,
> > + *   -errno otherwise
> > + */
> > +static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
> > +{
> > +	struct sgx_sigstruct *sigstruct;
> > +	struct sgx_enclave_init einit;
> > +	struct page *initp_page;
> > +	void *token;
> > +	int ret;
> > +
> > +	if (!(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
> 
> Might just as well check the other flags: doing EINIT on an already
> initialized enclave - SGX_ENCL_INITIALIZED - is perhaps a nono or
> similarly on a SGX_ENCL_DEAD enclave.
> 
> And you could do similar sanity checks in the other ioctl functions.

Agreed (see my earlier response, let's continue this discussion there).

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-04  1:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Borislav Petkov, x86, linux-sgx, linux-kernel,
	linux-security-module, Jethro Beekman, Haitao Huang, Chunyang Hui,
	Jordan Hand, Nathaniel McCallum, Seth Moore, Suresh Siddha, akpm,
	andriy.shevchenko, asapek, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, josh,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx
In-Reply-To: <20200629152718.GA12312@linux.intel.com>

On Mon, Jun 29, 2020 at 08:27:19AM -0700, Sean Christopherson wrote:
> On Sat, Jun 27, 2020 at 07:43:35PM +0200, Borislav Petkov wrote:
> > On Thu, Jun 18, 2020 at 01:08:33AM +0300, Jarkko Sakkinen wrote:
> > > +static int sgx_encl_init(struct sgx_encl *encl, struct sgx_sigstruct *sigstruct,
> > > +			 void *token)
> > > +{
> > > +	u64 mrsigner[4];
> > > +	int ret;
> > > +	int i;
> > > +	int j;
> > > +
> > > +	/* Check that the required attributes have been authorized. */
> > > +	if (encl->secs_attributes & ~encl->allowed_attributes)
> > > +		return -EACCES;
> > > +
> > > +	ret = sgx_get_key_hash(sigstruct->modulus, mrsigner);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	mutex_lock(&encl->lock);
> > > +
> > > +	if (atomic_read(&encl->flags) & SGX_ENCL_INITIALIZED) {
> > > +		ret = -EFAULT;
> > > +		goto err_out;
> > > +	}
> > 
> > That test should be the first thing this function or its caller does.
> 
> Hmm, I was going to say that SGX_ENCL_INITIALIZED can't be checked until
> encl->lock is held, but that's not true for this path as mutual exclusion
> is provided by the SGX_ENCL_IOCTL flag.  So yeah, this can be checked at
> the same time as SGX_ENCL_CREATED in sgx_ioc_enclave_init().
> 
> > > +	for (i = 0; i < SGX_EINIT_SLEEP_COUNT; i++) {
> > > +		for (j = 0; j < SGX_EINIT_SPIN_COUNT; j++) {
> > 
> > Ew, what's that double-loop for?
> > 
> > It tries to init an enclave a bunch of times. Why does it need to init
> > more than once?
> 
> ENCLS[EINIT] is interruptible because it has such a high latency, e.g. 50k+
> cycles on success.  If an IRQ/NMI/SMI becomes pending, EINIT may fail with
> SGX_UNMASKED_EVENT so that the event can be serviced.
> 
> The idea behind the double loop is to try EINIT in a tight loop, then back
> off and sleep for a while before retrying that tight inner loop.
> 
> > > +			ret = sgx_einit(sigstruct, token, encl->secs.epc_page,
> > > +					mrsigner);
> > > +			if (ret == SGX_UNMASKED_EVENT)
> > > +				continue;
> > > +			else
> > > +				break;
> > > +		}
> > > +
> > > +		if (ret != SGX_UNMASKED_EVENT)
> > > +			break;
> > > +
> > > +		msleep_interruptible(SGX_EINIT_SLEEP_TIME);
> > > +
> > > +		if (signal_pending(current)) {
> > > +			ret = -ERESTARTSYS;
> > > +			goto err_out;
> > > +		}
> > > +	}
> > > +
> > > +	if (ret & ENCLS_FAULT_FLAG) {
> > > +		if (encls_failed(ret))
> > > +			ENCLS_WARN(ret, "EINIT");
> > > +
> > > +		sgx_encl_destroy(encl);
> > > +		ret = -EFAULT;
> > > +	} else if (ret) {
> > > +		pr_debug("EINIT returned %d\n", ret);
> > > +		ret = -EPERM;
> > > +	} else {
> > > +		atomic_or(SGX_ENCL_INITIALIZED, &encl->flags);
> > > +	}
> > > +
> > > +err_out:
> > > +	mutex_unlock(&encl->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * sgx_ioc_enclave_init - handler for %SGX_IOC_ENCLAVE_INIT
> > > + *
> > > + * @filep:	open file to /dev/sgx
> > 
> > @encl:       pointer to an enclave instance (via ioctl() file pointer)
> > 
> > > + * @arg:	userspace pointer to a struct sgx_enclave_init instance
> > > + *
> > > + * Flush any outstanding enqueued EADD operations and perform EINIT.  The
> > > + * Launch Enclave Public Key Hash MSRs are rewritten as necessary to match
> > > + * the enclave's MRSIGNER, which is caculated from the provided sigstruct.
> > > + *
> > > + * Return:
> > > + *   0 on success,
> > > + *   SGX error code on EINIT failure,
> > > + *   -errno otherwise
> > > + */
> > > +static long sgx_ioc_enclave_init(struct sgx_encl *encl, void __user *arg)
> > > +{
> > > +	struct sgx_sigstruct *sigstruct;
> > > +	struct sgx_enclave_init einit;
> > > +	struct page *initp_page;
> > > +	void *token;
> > > +	int ret;
> > > +
> > > +	if (!(atomic_read(&encl->flags) & SGX_ENCL_CREATED))
> > 
> > Might just as well check the other flags: doing EINIT on an already
> > initialized enclave - SGX_ENCL_INITIALIZED - is perhaps a nono or
> > similarly on a SGX_ENCL_DEAD enclave.
> > 
> > And you could do similar sanity checks in the other ioctl functions.
> 
> Ya, as above, SGX_ENCL_INITIALIZED can be checked here.
> 
> SGX_ENCL_DEAD is actually already checked in in the top level sgx_ioctl(),
> i.e. the check in sgx_encl_add_page() can technically be flat out dropped.
> 
> I say "technically" because I'm a bit torn over SGX_ENCL_DEAD; encl->lock
> must be held to SGX_ENCL_DEAD (the page fault and reclaim flows rely on
> this), but as it stands today only ioctl() paths (guarded by SGX_ENCL_IOCTL)
> and sgx_release() (makes the ioctls() unreachable) set SGX_ENCL_DEAD.
> 
> So it's safe to check SGX_ENCL_DEAD from ioctl() context without holding
> encl->lock, at least in the current code base, but it feels weird/sketchy.
> 
> In the end I don't think I have a strong opinion.  Removing the technically
> unnecessary DEAD check in sgx_encl_add_page() is the simplest change, so it
> may make sense to do that and nothing more for initial upstreaming.  Long
> term, I fully expect we'll add paths that set SGX_ENCL_DEAD outside of
> ioctl() context, e.g. to handle EPC OOM, but it wouldn't be a bad thing to
> have a standalone commit in a future series to add DEAD checks (under
> encl->lock) in the ADD and INIT flows.

AFAIK nonne of th ioctl's should not need SGX_ENCL_DEAD check.

/Jarkko

^ permalink raw reply

* Re: [PATCH v33 11/21] x86/sgx: Linux Enclave Driver
From: Jarkko Sakkinen @ 2020-07-04  3:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: x86, linux-sgx, linux-kernel, linux-security-module,
	Jethro Beekman, Haitao Huang, Chunyang Hui, Jordan Hand,
	Nathaniel McCallum, Seth Moore, Suresh Siddha, akpm,
	andriy.shevchenko, asapek, bp, cedric.xing, chenalexchen,
	conradparker, cyhanish, dave.hansen, haitao.huang, josh,
	kai.huang, kai.svahn, kmoy, ludloff, luto, nhorman, puiterwijk,
	rientjes, tglx, yaozhangx
In-Reply-To: <20200702035902.GC1819@linux.intel.com>

On Wed, Jul 01, 2020 at 08:59:02PM -0700, Sean Christopherson wrote:
> On Thu, Jun 18, 2020 at 01:08:33AM +0300, Jarkko Sakkinen wrote:
> > +static int sgx_validate_secs(const struct sgx_secs *secs,
> > +			     unsigned long ssaframesize)
> > +{
> > +	if (secs->size < (2 * PAGE_SIZE) || !is_power_of_2(secs->size))
> > +		return -EINVAL;
> > +
> > +	if (secs->base & (secs->size - 1))
> > +		return -EINVAL;
> > +
> > +	if (secs->miscselect & sgx_misc_reserved_mask ||
> > +	    secs->attributes & sgx_attributes_reserved_mask ||
> > +	    secs->xfrm & sgx_xfrm_reserved_mask)
> > +		return -EINVAL;
> > +
> > +	if (secs->attributes & SGX_ATTR_MODE64BIT) {
> > +		if (secs->size > sgx_encl_size_max_64)
> > +			return -EINVAL;
> > +	} else if (secs->size > sgx_encl_size_max_32)
> > +		return -EINVAL;
> 
> These should be >=, not >, the SDM uses one of those fancy ≥ ligatures.
> 
> Internal versions use more obvious pseudocode, e.g.:
> 
>     if ((DS:TMP_SECS.ATTRIBUTES.MODE64BIT = 1) AND
>         (DS:TMP_SECS.SIZE AND (~((1 << CPUID.18.0:EDX[15:8]) – 1)))
>     {
>         #GP(0);

Updated as:

static int sgx_validate_secs(const struct sgx_secs *secs)
{
	u64 max_size = (secs->attributes & SGX_ATTR_MODE64BIT) ?
		       sgx_encl_size_max_64 : sgx_encl_size_max_32;

	if (secs->size < (2 * PAGE_SIZE) || !is_power_of_2(secs->size))
		return -EINVAL;

	if (secs->base & (secs->size - 1))
		return -EINVAL;

	if (secs->miscselect & sgx_misc_reserved_mask ||
	    secs->attributes & sgx_attributes_reserved_mask ||
	    secs->xfrm & sgx_xfrm_reserved_mask)
		return -EINVAL;

	if (secs->size >= max_size)
		return -EINVAL;

/Jarkko

^ permalink raw reply

* Re: [PATCH v2 00/15] Make the user mode driver code a better citizen
From: Tetsuo Handa @ 2020-07-04  6:57 UTC (permalink / raw)
  To: Eric W. Biederman, Al Viro
  Cc: Casey Schaufler, Alexei Starovoitov, linux-kernel, David Miller,
	Greg Kroah-Hartman, Kees Cook, Andrew Morton, Alexei Starovoitov,
	bpf, linux-fsdevel, Daniel Borkmann, Jakub Kicinski,
	Masahiro Yamada, Gary Lin, Bruno Meneguele, LSM List,
	Luis Chamberlain, Linus Torvalds
In-Reply-To: <87lfk0nslu.fsf@x220.int.ebiederm.org>

On 2020/07/04 7:25, Eric W. Biederman wrote:
> Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> writes:
> 
>> On 2020/07/02 22:08, Eric W. Biederman wrote:
>>>> By the way, commit 4a9d4b024a3102fc ("switch fput to task_work_add") says
>>>> that use of flush_delayed_fput() has to be careful. Al, is it safe to call
>>>> flush_delayed_fput() from blob_to_mnt() from umd_load_blob() (which might be
>>>> called from both kernel thread and from process context (e.g. init_module()
>>>> syscall by /sbin/insmod )) ?
>>>
>>> And __fput_sync needs to be even more careful.
>>> umd_load_blob is called in these changes without any locks held.
>>
>> But where is the guarantee that a thread which called flush_delayed_fput() waits for
>> the completion of processing _all_ "struct file" linked into delayed_fput_list ?
>> If some other thread or delayed_fput_work (scheduled by fput_many()) called
>> flush_delayed_fput() between blob_to_mnt()'s fput(file) and flush_delayed_fput()
>> sequence? blob_to_mnt()'s flush_delayed_fput() can miss the "struct file" which
>> needs to be processed before execve(), can't it?
> 
> As a module the guarantee is we call task_work_run.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> Built into the kernel the guarantee as best I can trace it is that
> kthreadd hasn't started, and as such nothing that is scheduled has run
> yet.

Have you ever checked how early the kthreadd (PID=2) gets started?

----------
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2306,6 +2306,7 @@ static __latent_entropy struct task_struct *copy_process(
        trace_task_newtask(p, clone_flags);
        uprobe_copy_process(p, clone_flags);

+       printk(KERN_INFO "Created PID: %u Comm: %s\n", p->pid, p->comm);
        return p;

 bad_fork_cancel_cgroup:
----------

----------
[    0.090757][    T0] pid_max: default: 65536 minimum: 512
[    0.090890][    T0] LSM: Security Framework initializing
[    0.090890][    T0] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.090890][    T0] Disabled fast string operations
[    0.090890][    T0] Last level iTLB entries: 4KB 1024, 2MB 1024, 4MB 1024
[    0.090890][    T0] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 1024, 1GB 4
[    0.090890][    T0] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.090890][    T0] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[    0.090890][    T0] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.090890][    T0] SRBDS: Unknown: Dependent on hypervisor status
[    0.090890][    T0] MDS: Mitigation: Clear CPU buffers
[    0.090890][    T0] Freeing SMP alternatives memory: 24K
[    0.090890][    T0] Created PID: 1 Comm: swapper/0
[    0.090890][    T0] Created PID: 2 Comm: swapper/0
[    0.090890][    T1] smpboot: CPU0: Intel(R) Core(TM) i5-4440S CPU @ 2.80GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[    0.091000][    T2] Created PID: 3 Comm: kthreadd
[    0.091995][    T2] Created PID: 4 Comm: kthreadd
[    0.093028][    T2] Created PID: 5 Comm: kthreadd
[    0.093997][    T2] Created PID: 6 Comm: kthreadd
[    0.094995][    T2] Created PID: 7 Comm: kthreadd
[    0.096037][    T2] Created PID: 8 Comm: kthreadd
(...snipped...)
[    0.135716][    T2] Created PID: 13 Comm: kthreadd
[    0.135716][    T1] smp: Bringing up secondary CPUs ...
[    0.135716][    T2] Created PID: 14 Comm: kthreadd
[    0.135716][    T2] Created PID: 15 Comm: kthreadd
[    0.135716][    T2] Created PID: 16 Comm: kthreadd
[    0.135716][    T2] Created PID: 17 Comm: kthreadd
[    0.135716][    T2] Created PID: 18 Comm: kthreadd
[    0.135716][    T1] x86: Booting SMP configuration:
(...snipped...)
[    0.901990][    T1] pci 0000:00:00.0: Limiting direct PCI/PCI transfers
[    0.902145][    T1] pci 0000:00:0f.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.902213][    T1] pci 0000:02:00.0: CLS mismatch (32 != 64), using 64 bytes
[    0.902224][    T1] Trying to unpack rootfs image as initramfs...
[    1.107993][    T1] Freeing initrd memory: 18876K
[    1.109049][    T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.111003][    T1] software IO TLB: mapped [mem 0xab000000-0xaf000000] (64MB)
[    1.112136][    T1] check: Scanning for low memory corruption every 60 seconds
[    1.115040][    T2] Created PID: 52 Comm: kthreadd
[    1.116110][    T1] workingset: timestamp_bits=46 max_order=20 bucket_order=0
[    1.120936][    T1] SGI XFS with ACLs, security attributes, verbose warnings, quota, no debug enabled
[    1.129626][    T2] Created PID: 53 Comm: kthreadd
[    1.131403][    T2] Created PID: 54 Comm: kthreadd
----------

kthreadd (PID=2) is created by swapper/0 (PID=0) immediately after init (PID=1) was created by
swapper/0 (PID=0). It is even before secondary CPUs are brought up, and far earlier than unpacking
initramfs.

And how can we prove that blob_to_mnt() is only called by a kernel thread before some kernel
thread that interferes fput() starts running? blob_to_mnt() needs to be prepared for being
called after many processes already started running.

> 
>> Also, I don't know how convoluted the dependency of all "struct file" linked into
>> delayed_fput_list might be, for there can be "struct file" which will not be a
>> simple close of tmpfs file created by blob_to_mnt()'s file_open_root() request.
>>
>> On the other hand, although __fput_sync() cannot be called from !PF_KTHREAD threads,
>> there is a guarantee that __fput_sync() waits for the completion of "struct file"
>> which needs to be flushed before execve(), isn't there?
> 
> There is really not a good helper or helpers, and this code suggests we
> have something better.  Right now I have used the existing helpers to
> the best of my ability.  If you or someone else wants to write a better
> version of flushing so that exec can happen be my guest.
> 
> As far as I can tell what I have is good enough.

Just saying what you think is not a "review". I'm waiting for answer from Al Viro
because I consider that Al will be the most familiar with fput()'s behavior.
At least I consider that

	if (current->flags & PF_KTHREAD) {
		__fput_sync(file);
	} else {
		fput(file);
		task_work_run();
	}

is a candidate for closing the race window. And depending on Al's answer,
removing

	BUG_ON(!(task->flags & PF_KTHREAD));

 from __fput_sync() and unconditionally using

	__fput_sync(file);

 from blob_to_mnt() might be the better choice. Anyway, I consider that
Al's response is important for this "review".

> 
>>> We fundamentally AKA in any correct version of this code need to flush
>>> the file descriptor before we call exec or exec can not open it a
>>> read-only denying all writes from any other opens.
>>>
>>> The use case of flush_delayed_fput is exactly the same as that used
>>> when loading the initramfs.
>>
>> When loading the initramfs, the number of threads is quite few (which
>> means that the possibility of hitting the race window and convoluted
>> dependency is small).
> 
> But the reality is the code run very early, before the initramfs is
> initialized in practice.

Such expectation is not a reality.

> 
>> But like EXPORT_SYMBOL_GPL(umd_load_blob) indicates, blob_to_mnt()'s
>> flush_delayed_fput() might be called after many number of threads already
>> started running.
> 
> At which point the code probably won't be runnig from a kernel thread
> but instead will be running in a thread where task_work_run is relevant.

No. It is possible that blob_to_mnt() is called by a kernel thread which was
started by init_module() syscall by /sbin/insmod .

> 
> At worst it is a very small race, where someone else in another thread
> starts flushing the file.  Which means the file could still be
> completely close before exec.   Even that is not necessarily fatal,
> as the usermode driver code has a respawn capability.
> 
> Code that is used enough that it hits that race sounds like a very
> good problem to have from the perspective of the usermode driver code.

In general, unconditionally retrying call_usermodehelper() when it returned
a negative value (e.g. -ENOENT, -ENOMEM, -EBUSY) is bad. I don't know which
code is an implementation of "a respawn capability"; I'd like to check where
that code is and whether that code is checking -ETXTBSY.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox