* Re: [PATCH v1 4/5] LSM: Define SELinux function to measure security state
From: Stephen Smalley @ 2020-07-15 18:04 UTC (permalink / raw)
To: Lakshmi Ramasubramanian
Cc: Mimi Zohar, Casey Schaufler, James Morris, linux-integrity,
SElinux list, LSM List, linux-kernel
In-Reply-To: <20200715154853.23374-5-nramas@linux.microsoft.com>
On Wed, Jul 15, 2020 at 11:48 AM Lakshmi Ramasubramanian
<nramas@linux.microsoft.com> wrote:
>
> SELinux configuration and policy are some of the critical data for this
> security module that needs to be measured. To enable this measurement
> SELinux needs to implement the interface function, security_state(), that
> the LSM can call.
>
> Define the security_state() function in SELinux to measure SELinux
> configuration and policy. Call this function to measure SELinux data
> when there is a change in the security module's state.
>
> Sample measurement of SELinux state and hash of the policy:
>
> 10 e32e...5ac3 ima-buf sha256:86e8...4594 selinux-state 656e61626c65643d313b656e666f7263696e673d303b636865636b72657170726f743d313b6e6574706565723d313b6f70656e7065726d3d313b657874736f636b636c6173733d313b616c776179736e6574776f726b3d303b6367726f75707365636c6162656c3d313b6e6e706e6f737569647472616e736974696f6e3d313b67656e66737365636c6162656c73796d6c696e6b3d303b
> 10 f4a7...9408 ima-buf sha256:4941...68fc selinux-policy-hash 8d1d...1834
>
> The data for selinux-state in the above measurement is:
> enabled=1;enforcing=0;checkreqprot=1;netpeer=1;openperm=1;extsockclass=1;alwaysnetwork=0;cgroupseclabel=1;nnpnosuidtransition=1;genfsseclabelsymlink=0;
>
> The data for selinux-policy-hash in the above measurement is
> the SHA256 hash of the SELinux policy.
>
> Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
> Suggested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
> ---
> diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
> index b0e02cfe3ce1..691c7e35f82a 100644
> --- a/security/selinux/include/security.h
> +++ b/security/selinux/include/security.h
> @@ -222,16 +222,35 @@ static inline bool selinux_policycap_genfs_seclabel_symlinks(void)
> return state->policycap[POLICYDB_CAPABILITY_GENFS_SECLABEL_SYMLINKS];
> }
>
> +static inline bool selinux_checkreqprot(void)
> +{
> + struct selinux_state *state = &selinux_state;
> +
> + return state->checkreqprot;
> +}
Probably should use READ_ONCE().
> diff --git a/security/selinux/measure.c b/security/selinux/measure.c
> new file mode 100644
> index 000000000000..b909e8e61542
> --- /dev/null
> +++ b/security/selinux/measure.c
> @@ -0,0 +1,122 @@
> +int selinux_security_state(void)
Let's call this selinux_measure_state() or similar. Needs a verb. And
pass it a struct selinux_state * pointer argument to be measured, even
though initially it will always be passed &selinux_state. The
encapsulation of selinux state within selinux_state was to support
multiple selinux namespaces in the future, each with their own state.
> +{
> + int rc = 0;
> + int count;
> + size_t buflen;
> + int policy_hash_len;
> + char *state = NULL;
> + void *policy = NULL;
> + void *policy_hash = NULL;
> + static char *security_state_string =
> + "enabled=%d;enforcing=%d;checkreqprot=%d;" \
> + "netpeer=%d;openperm=%d;extsockclass=%d;" \
> + "alwaysnetwork=%d;cgroupseclabel=%d;" \
> + "nnpnosuidtransition=%d;genfsseclabelsymlink=%d;";
Rather than hardcoding policy capability names here, I would recommend
using the selinux_policycap_names[] array for the names and the
selinux_state.policycap[] array for the values. Also recommend
passing in a struct selinux_state * here to allow for future case
where there are multiple selinux states, one per selinux namespace.
> diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
> index ef0afd878bfc..0c289d13ef6a 100644
> --- a/security/selinux/ss/services.c
> +++ b/security/selinux/ss/services.c
> @@ -3724,10 +3724,11 @@ int security_netlbl_sid_to_secattr(struct selinux_state *state,
> * security_read_policy - read the policy.
> * @data: binary policy data
> * @len: length of data in bytes
> - *
> + * @alloc_kernel_memory: flag to indicate memory allocation
> */
> -int security_read_policy(struct selinux_state *state,
> - void **data, size_t *len)
> +int security_read_selinux_policy(struct selinux_state *state,
> + void **data, size_t *len,
> + bool alloc_kernel_memory)
Instead of passing in a boolean to control how the memory is
allocated, split this into a helper function that takes an
already-allocated buffer and two
different front-end wrappers, one for kernel-internal use and one for
userspace use.
> @@ -3738,7 +3739,10 @@ int security_read_policy(struct selinux_state *state,
>
> *len = security_policydb_len(state);
>
> - *data = vmalloc_user(*len);
> + if (alloc_kernel_memory)
> + *data = kzalloc(*len, GFP_KERNEL);
You need vmalloc() since policy image size may exceed kmalloc max (or
at least that used to be the case).
^ permalink raw reply
* RE: [PATCH 7/7] exec: Implement kernel_execve
From: David Laight @ 2020-07-15 16:46 UTC (permalink / raw)
To: 'Kees Cook'
Cc: 'Christoph Hellwig', Eric W. Biederman,
linux-kernel@vger.kernel.org, Linus Torvalds, Andy Lutomirski,
H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Al Viro, Luis Chamberlain, linux-fsdevel@vger.kernel.org,
Tetsuo Handa, linux-security-module@vger.kernel.org,
Serge E. Hallyn, James Morris, Kentaro Takeda, Casey Schaufler,
John Johansen
In-Reply-To: <202007150801.27B6690@keescook>
From: Kees Cook <keescook@chromium.org>
> Sent: 15 July 2020 16:09
>
> On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> > From: Christoph Hellwig
> > > Sent: 15 July 2020 07:43
> > > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> > >
> > > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > > +static int count_strings_kernel(const char *const *argv)
> > > > > +{
> > > > > + int i;
> > > > > +
> > > > > + if (!argv)
> > > > > + return 0;
> > > > > +
> > > > > + for (i = 0; argv[i]; ++i) {
> > > > > + if (i >= MAX_ARG_STRINGS)
> > > > > + return -E2BIG;
> > > > > + if (fatal_signal_pending(current))
> > > > > + return -ERESTARTNOHAND;
> > > > > + cond_resched();
> > > > > + }
> > > > > + return i;
> > > > > +}
> > > >
> > > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > > refactor that too? (And maybe rename it to count_strings_user()?)
> >
> > Thinks....
> > If you setup env[] and argv[] on the new user stack early in exec processing
> > then you may not need any limits at all - except the size of the user stack.
> > Even the get_user() loop will hit an invalid address before the counter
> > wraps (provided it is unsigned long).
>
> *grumpy noises* Yes, but not in practice (if I'm understanding what you
> mean). The expectations of a number of execution environments can be
> really odd-ball. I've tried to collect the notes from over the years in
> prepare_arg_pages()'s comments, and it mostly boils down to "there has
> to be enough room for the exec to start" otherwise the exec ends up in a
> hard-to-debug failure state (i.e. past the "point of no return", where you
> get no useful information about the cause of the SEGV). So the point has
> been to move as many of the setup checks as early as possible and report
> about them if they fail. The argv processing is already very early, but
> it needs to do the limit checks otherwise it'll just break after the exec
> is underway and the process will just SEGV. (And ... some environments
> will attempt to dynamically check the size of the argv space by growing
> until it sees E2BIG, so we can't just remove it and let those hit SEGV.)
Yes - I bet the code is horrid.
I guess the real problem is that you'd need access to the old process's
user addresses and the new process's stack area at the same time.
Unless there is a suitable hole in the old process's address map
any attempted trick will fall foul of cache aliasing on some
architectures - like anything else that does page-loaning.
I'm sure there are hair-brained schemes that might work.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply
* RE: [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
From: Nicolas Viennot @ 2020-07-15 15:49 UTC (permalink / raw)
To: Christian Brauner, Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Michał Cłapiński, Kamil Yurtsever,
Dirk Petersen, Christine Flood, Casey Schaufler, Mike Rapoport,
Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn, Stephen Smalley,
Sargun Dhillon, Arnd Bergmann,
linux-security-module@vger.kernel.org,
linux-kernel@vger.kernel.org, selinux@vger.kernel.org, Eric Paris,
Jann Horn, linux-fsdevel@vger.kernel.org
In-Reply-To: <20200715152011.whdeysy3ztqrnocn@wittgenstein>
> On Wed, Jul 15, 2020 at 04:49:53PM +0200, Adrian Reber wrote:
> > From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> >
> > Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.
> >
> > This commit also changes the permission error code from -EINVAL to
> > -EPERM for consistency with the rest of the prctl() syscall when
> > checking capabilities.
> I agree that EINVAL seems weird here but this is a potentially user visible change. Might be nice to have the EINVAL->EPERM change be an additional patch on top after this one so we can revert it in case it breaks someone (unlikely though). I can split this out myself though so no need to resend for that alone.
> What I would also prefer is to have some history in the commit message tbh. The reason is that when we started discussing that specific change I had to hunt down the history of changing /proc/self/exe and had to dig up and read through ancient threads on lore to come up with the explanation why this is placed under a capability. The commit message should then also mention that there are other ways to change the /proc/self/exe link that don't require capabilities and that /proc/self/exe itself is not something userspace should rely on for security. Mainly so that in a few months/years we can read through that commit message and go "Weird, but ok.". :)
> But maybe I can just rewrite this myself so you don't have to go through the trouble. This is really not pedantry it's just that it's a lot of work digging up the reasons for a piece of code existing when it's really not obvious. :)
Hello Christian,
I agree.
Thank you for suggesting doing the work, but you've done plenty already. So we'll come back to you with:
1) A separate commit for EINVAL->EPERM
2) A full history of discussions in the commit message related to /proc/self/exe capability check
Thanks,
Nico
^ permalink raw reply
* [PATCH v1 1/5] IMA: Add LSM_STATE func to measure LSM data
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
In-Reply-To: <20200715154853.23374-1-nramas@linux.microsoft.com>
Critical data structures of security modules need to be measured to
enable an attestation service to verify if the policies and
configuration have been setup correctly and that they haven't been
tampered with at runtime. A new IMA policy is required for handling
this measurement.
Define a new IMA policy func namely LSM_STATE to measure data provided
by security modules. Update ima_match_rules() to check for LSM_STATE
and ima_parse_rule() to handle LSM_STATE.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
Documentation/ABI/testing/ima_policy | 6 +++++-
security/integrity/ima/ima.h | 1 +
security/integrity/ima/ima_api.c | 2 +-
security/integrity/ima/ima_policy.c | 29 +++++++++++++++++++++++-----
4 files changed, 31 insertions(+), 7 deletions(-)
diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy
index cd572912c593..355bc3eade33 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -29,7 +29,7 @@ Description:
base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
[FIRMWARE_CHECK]
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
- [KEXEC_CMDLINE] [KEY_CHECK]
+ [KEXEC_CMDLINE] [KEY_CHECK] [LSM_STATE]
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
[[^]MAY_EXEC]
fsmagic:= hex value
@@ -125,3 +125,7 @@ Description:
keys added to .builtin_trusted_keys or .ima keyring:
measure func=KEY_CHECK keyrings=.builtin_trusted_keys|.ima
+
+ Example of measure rule using LSM_STATE to measure LSM data:
+
+ measure func=LSM_STATE
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 4515975cc540..880fda11a61b 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -200,6 +200,7 @@ static inline unsigned int ima_hash_key(u8 *digest)
hook(POLICY_CHECK, policy) \
hook(KEXEC_CMDLINE, kexec_cmdline) \
hook(KEY_CHECK, key) \
+ hook(LSM_STATE, lsm_state) \
hook(MAX_CHECK, none)
#define __ima_hook_enumify(ENUM, str) ENUM,
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index bf22de8b7ce0..0cebd2404dcf 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -176,7 +176,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
* subj=, obj=, type=, func=, mask=, fsmagic=
* subj,obj, and type: are LSM specific.
* func: FILE_CHECK | BPRM_CHECK | CREDS_CHECK | MMAP_CHECK | MODULE_CHECK
- * | KEXEC_CMDLINE | KEY_CHECK
+ * | KEXEC_CMDLINE | KEY_CHECK | LSM_STATE
* mask: contains the permission mask
* fsmagic: hex value
*
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index 66aa3e17a888..fc8457d9242b 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -417,15 +417,31 @@ static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode,
const char *keyring)
{
int i;
+ int funcmatch = 0;
- if ((func == KEXEC_CMDLINE) || (func == KEY_CHECK)) {
+ switch (func) {
+ case KEXEC_CMDLINE:
+ case KEY_CHECK:
+ case LSM_STATE:
if ((rule->flags & IMA_FUNC) && (rule->func == func)) {
if (func == KEY_CHECK)
- return ima_match_keyring(rule, keyring, cred);
- return true;
- }
- return false;
+ funcmatch = ima_match_keyring(rule, keyring,
+ cred) ? 1 : -1;
+ else
+ funcmatch = 1;
+ } else
+ funcmatch = -1;
+
+ break;
+
+ default:
+ funcmatch = 0;
+ break;
}
+
+ if (funcmatch)
+ return (funcmatch == 1) ? true : false;
+
if ((rule->flags & IMA_FUNC) &&
(rule->func != func && func != POST_SETATTR))
return false;
@@ -1068,6 +1084,9 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
entry->func = KEXEC_CMDLINE;
else if (strcmp(args[0].from, "KEY_CHECK") == 0)
entry->func = KEY_CHECK;
+ else if (strcmp(args[0].from, "LSM_STATE") == 0)
+ entry->func = LSM_STATE;
+
else
result = -EINVAL;
if (!result)
--
2.27.0
^ permalink raw reply related
* [PATCH v1 2/5] IMA: Define an IMA hook to measure LSM data
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
In-Reply-To: <20200715154853.23374-1-nramas@linux.microsoft.com>
IMA subsystem needs to define an IMA hook that the security modules can
call to measure critical data of the security modules.
Define a new IMA hook, namely ima_lsm_state(), that the security modules
can call to measure data.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
include/linux/ima.h | 4 ++++
security/integrity/ima/ima_main.c | 17 +++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/include/linux/ima.h b/include/linux/ima.h
index 9164e1534ec9..7e2686f4953a 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -26,6 +26,7 @@ extern int ima_post_read_file(struct file *file, void *buf, loff_t size,
extern void ima_post_path_mknod(struct dentry *dentry);
extern int ima_file_hash(struct file *file, char *buf, size_t buf_size);
extern void ima_kexec_cmdline(const void *buf, int size);
+extern void ima_lsm_state(const char *lsm_event_name, const void *buf, int size);
#ifdef CONFIG_IMA_KEXEC
extern void ima_add_kexec_buffer(struct kimage *image);
@@ -104,6 +105,9 @@ static inline int ima_file_hash(struct file *file, char *buf, size_t buf_size)
}
static inline void ima_kexec_cmdline(const void *buf, int size) {}
+
+static inline void ima_lsm_state(const char *lsm_event_name,
+ const void *buf, int size) {}
#endif /* CONFIG_IMA */
#ifndef CONFIG_IMA_KEXEC
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 8351b2fd48e0..04d9a1d35300 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -835,6 +835,23 @@ void ima_kexec_cmdline(const void *buf, int size)
KEXEC_CMDLINE, 0, NULL);
}
+/**
+ * ima_lsm_state - measure LSM specific state
+ * @lsm_event_name: LSM event
+ * @buf: pointer to buffer containing LSM specific state
+ * @size: Number of bytes in buf
+ *
+ * Buffers can only be measured, not appraised.
+ */
+void ima_lsm_state(const char *lsm_event_name, const void *buf, int size)
+{
+ if (!lsm_event_name || !buf || !size)
+ return;
+
+ process_buffer_measurement(buf, size, lsm_event_name,
+ LSM_STATE, 0, NULL);
+}
+
static int __init init_ima(void)
{
int error;
--
2.27.0
^ permalink raw reply related
* [PATCH v1 5/5] LSM: Define workqueue for measuring security module state
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
In-Reply-To: <20200715154853.23374-1-nramas@linux.microsoft.com>
Data structures critical to the functioning of a security module could
be tampered with by malware or changed inadvertently at runtime
thereby disabling or reducing the security guarantees provided by
the security module. Such critical data need to be periodically checked
and measured, if there is any change. This would enable an attestation
service, for instance, to verify that the security modules are operating
with the configuration and policy setup by the system administrator.
Define a workqueue in the LSM and invoke the security modules in
the workqueue handler to check their data and measure.
Note that the data given by the security module would be measured by
the IMA subsystem only if it has changed since the last time it was
measured.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
security/security.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/security/security.c b/security/security.c
index 3fbabc2e6ddb..6a0ceff815a2 100644
--- a/security/security.c
+++ b/security/security.c
@@ -89,6 +89,11 @@ static __initdata struct lsm_info *exclusive;
static struct lsm_info *security_state_lsms;
static int security_state_lsms_count;
+static long security_state_timeout = 300000; /* 5 Minutes */
+static void security_state_handler(struct work_struct *work);
+static DECLARE_DELAYED_WORK(security_state_delayed_work,
+ security_state_handler);
+
static __initdata bool debug;
#define init_debug(...) \
do { \
@@ -277,6 +282,26 @@ static void __init initialize_security_state_lsms(void)
security_state_lsms_count = count;
}
+static void initialize_security_state_monitor(void)
+{
+ if (security_state_lsms_count == 0)
+ return;
+
+ schedule_delayed_work(&security_state_delayed_work,
+ msecs_to_jiffies(security_state_timeout));
+}
+
+static void security_state_handler(struct work_struct *work)
+{
+ int inx;
+
+ for (inx = 0; inx < security_state_lsms_count; inx++)
+ measure_security_state(&(security_state_lsms[inx]));
+
+ schedule_delayed_work(&security_state_delayed_work,
+ msecs_to_jiffies(security_state_timeout));
+}
+
/* Populate ordered LSMs list from comma-separated LSM name list. */
static void __init ordered_lsm_parse(const char *order, const char *origin)
{
@@ -400,6 +425,7 @@ static void __init ordered_lsm_init(void)
}
initialize_security_state_lsms();
+ initialize_security_state_monitor();
kfree(ordered_lsms);
}
--
2.27.0
^ permalink raw reply related
* [PATCH v1 3/5] LSM: Add security_state function pointer in lsm_info struct
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
In-Reply-To: <20200715154853.23374-1-nramas@linux.microsoft.com>
The security modules that require their data to be measured using
the IMA subsystem need to define a function that the LSM can call
to trigger the measurement.
Add a function pointer field namely security_state in lsm_info structure.
Update LSM to call this security module function, if defined, to measure
the security module's data using the IMA subsystem.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
---
include/linux/lsm_hooks.h | 3 +++
security/security.c | 48 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 95b7c1d32062..a688eb29e5d8 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1591,6 +1591,9 @@ struct lsm_info {
int *enabled; /* Optional: controlled by CONFIG_LSM */
int (*init)(void); /* Required. */
struct lsm_blob_sizes *blobs; /* Optional: for blob sharing. */
+ int (*security_state)(void); /* Optional: for measuring the state
+ * of the security module.
+ */
};
extern struct lsm_info __start_lsm_info[], __end_lsm_info[];
diff --git a/security/security.c b/security/security.c
index 70a7ad357bc6..3fbabc2e6ddb 100644
--- a/security/security.c
+++ b/security/security.c
@@ -86,6 +86,9 @@ static __initconst const char * const builtin_lsm_order = CONFIG_LSM;
static __initdata struct lsm_info **ordered_lsms;
static __initdata struct lsm_info *exclusive;
+static struct lsm_info *security_state_lsms;
+static int security_state_lsms_count;
+
static __initdata bool debug;
#define init_debug(...) \
do { \
@@ -235,6 +238,45 @@ static void __init initialize_lsm(struct lsm_info *lsm)
}
}
+static int measure_security_state(struct lsm_info *lsm)
+{
+ if (!lsm->security_state)
+ return 0;
+
+ return lsm->security_state();
+}
+
+static void __init initialize_security_state_lsms(void)
+{
+ struct lsm_info **lsm;
+ int count = 0;
+ int inx;
+
+ for (lsm = ordered_lsms; *lsm; lsm++) {
+ if ((*lsm)->security_state)
+ count++;
+ }
+
+ if (count == 0)
+ return;
+
+ security_state_lsms = kcalloc(count, sizeof(struct lsm_info),
+ GFP_KERNEL);
+ if (!security_state_lsms)
+ return;
+
+ inx = 0;
+ for (lsm = ordered_lsms; *lsm; lsm++) {
+ if ((*lsm)->security_state) {
+ security_state_lsms[inx].security_state =
+ (*lsm)->security_state;
+ inx++;
+ }
+ }
+
+ security_state_lsms_count = count;
+}
+
/* Populate ordered LSMs list from comma-separated LSM name list. */
static void __init ordered_lsm_parse(const char *order, const char *origin)
{
@@ -352,8 +394,12 @@ static void __init ordered_lsm_init(void)
lsm_early_cred((struct cred *) current->cred);
lsm_early_task(current);
- for (lsm = ordered_lsms; *lsm; lsm++)
+ for (lsm = ordered_lsms; *lsm; lsm++) {
initialize_lsm(*lsm);
+ measure_security_state(*lsm);
+ }
+
+ initialize_security_state_lsms();
kfree(ordered_lsms);
}
--
2.27.0
^ permalink raw reply related
* [PATCH v1 4/5] LSM: Define SELinux function to measure security state
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
In-Reply-To: <20200715154853.23374-1-nramas@linux.microsoft.com>
SELinux configuration and policy are some of the critical data for this
security module that needs to be measured. To enable this measurement
SELinux needs to implement the interface function, security_state(), that
the LSM can call.
Define the security_state() function in SELinux to measure SELinux
configuration and policy. Call this function to measure SELinux data
when there is a change in the security module's state.
Sample measurement of SELinux state and hash of the policy:
10 e32e...5ac3 ima-buf sha256:86e8...4594 selinux-state 656e61626c65643d313b656e666f7263696e673d303b636865636b72657170726f743d313b6e6574706565723d313b6f70656e7065726d3d313b657874736f636b636c6173733d313b616c776179736e6574776f726b3d303b6367726f75707365636c6162656c3d313b6e6e706e6f737569647472616e736974696f6e3d313b67656e66737365636c6162656c73796d6c696e6b3d303b
10 f4a7...9408 ima-buf sha256:4941...68fc selinux-policy-hash 8d1d...1834
The data for selinux-state in the above measurement is:
enabled=1;enforcing=0;checkreqprot=1;netpeer=1;openperm=1;extsockclass=1;alwaysnetwork=0;cgroupseclabel=1;nnpnosuidtransition=1;genfsseclabelsymlink=0;
The data for selinux-policy-hash in the above measurement is
the SHA256 hash of the SELinux policy.
Signed-off-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
Suggested-by: Stephen Smalley <stephen.smalley.work@gmail.com>
---
security/selinux/Makefile | 2 +
security/selinux/hooks.c | 2 +
security/selinux/include/security.h | 19 +++++
security/selinux/measure.c | 122 ++++++++++++++++++++++++++++
security/selinux/selinuxfs.c | 1 +
security/selinux/ss/services.c | 23 +++++-
6 files changed, 165 insertions(+), 4 deletions(-)
create mode 100644 security/selinux/measure.c
diff --git a/security/selinux/Makefile b/security/selinux/Makefile
index 4d8e0e8adf0b..83d512116341 100644
--- a/security/selinux/Makefile
+++ b/security/selinux/Makefile
@@ -16,6 +16,8 @@ selinux-$(CONFIG_NETLABEL) += netlabel.o
selinux-$(CONFIG_SECURITY_INFINIBAND) += ibpkey.o
+selinux-$(CONFIG_IMA) += measure.o
+
ccflags-y := -I$(srctree)/security/selinux -I$(srctree)/security/selinux/include
$(addprefix $(obj)/,$(selinux-y)): $(obj)/flask.h
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index efa6108b1ce9..1f270206f0ee 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7284,6 +7284,7 @@ DEFINE_LSM(selinux) = {
.enabled = &selinux_enabled_boot,
.blobs = &selinux_blob_sizes,
.init = selinux_init,
+ .security_state = selinux_security_state,
};
#if defined(CONFIG_NETFILTER)
@@ -7394,6 +7395,7 @@ int selinux_disable(struct selinux_state *state)
}
selinux_mark_disabled(state);
+ selinux_security_state();
pr_info("SELinux: Disabled at runtime.\n");
diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
index b0e02cfe3ce1..691c7e35f82a 100644
--- a/security/selinux/include/security.h
+++ b/security/selinux/include/security.h
@@ -222,16 +222,35 @@ static inline bool selinux_policycap_genfs_seclabel_symlinks(void)
return state->policycap[POLICYDB_CAPABILITY_GENFS_SECLABEL_SYMLINKS];
}
+static inline bool selinux_checkreqprot(void)
+{
+ struct selinux_state *state = &selinux_state;
+
+ return state->checkreqprot;
+}
+
int security_mls_enabled(struct selinux_state *state);
int security_load_policy(struct selinux_state *state,
void *data, size_t len);
int security_read_policy(struct selinux_state *state,
void **data, size_t *len);
+int security_read_selinux_policy(struct selinux_state *state,
+ void **data, size_t *len,
+ bool alloc_kernel_memory);
size_t security_policydb_len(struct selinux_state *state);
int security_policycap_supported(struct selinux_state *state,
unsigned int req_cap);
+#ifdef CONFIG_IMA
+extern int selinux_security_state(void);
+#else
+static inline int selinux_security_state(void)
+{
+ return 0;
+}
+#endif
+
#define SEL_VEC_MAX 32
struct av_decision {
u32 allowed;
diff --git a/security/selinux/measure.c b/security/selinux/measure.c
new file mode 100644
index 000000000000..b909e8e61542
--- /dev/null
+++ b/security/selinux/measure.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Measure SELinux state using IMA subsystem.
+ */
+#include <linux/ima.h>
+#include "security.h"
+
+static int selinux_hash_policy(const char *hash_alg_name,
+ void *policy, size_t policy_len,
+ void **policy_hash, int *policy_hash_len)
+{
+ struct crypto_shash *tfm;
+ struct shash_desc *desc = NULL;
+ void *digest = NULL;
+ int desc_size;
+ int digest_size;
+ int ret = 0;
+
+ tfm = crypto_alloc_shash(hash_alg_name, 0, 0);
+ if (IS_ERR(tfm))
+ return PTR_ERR(tfm);
+
+ desc_size = crypto_shash_descsize(tfm) + sizeof(*desc);
+ digest_size = crypto_shash_digestsize(tfm);
+
+ digest = kmalloc(digest_size, GFP_KERNEL);
+ if (!digest) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ desc = kzalloc(desc_size, GFP_KERNEL);
+ if (!desc) {
+ ret = -ENOMEM;
+ goto error;
+ }
+
+ desc->tfm = tfm;
+
+ ret = crypto_shash_digest(desc, policy, policy_len, digest);
+ if (ret < 0)
+ goto error;
+
+ *policy_hash_len = digest_size;
+ *policy_hash = digest;
+ digest = NULL;
+
+error:
+ kfree(desc);
+ kfree(digest);
+
+ crypto_free_shash(tfm);
+
+ if (ret)
+ pr_err("%s: error %d\n", __func__, ret);
+
+ return ret;
+}
+
+int selinux_security_state(void)
+{
+ int rc = 0;
+ int count;
+ size_t buflen;
+ int policy_hash_len;
+ char *state = NULL;
+ void *policy = NULL;
+ void *policy_hash = NULL;
+ static char *security_state_string =
+ "enabled=%d;enforcing=%d;checkreqprot=%d;" \
+ "netpeer=%d;openperm=%d;extsockclass=%d;" \
+ "alwaysnetwork=%d;cgroupseclabel=%d;" \
+ "nnpnosuidtransition=%d;genfsseclabelsymlink=%d;";
+
+ if (!selinux_initialized(&selinux_state))
+ return -EOPNOTSUPP;
+
+ buflen = strlen(security_state_string) + 1;
+ state = kzalloc(buflen, GFP_KERNEL);
+ if (!state) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ count = snprintf(state, buflen, security_state_string,
+ !selinux_disabled(&selinux_state),
+ enforcing_enabled(&selinux_state),
+ selinux_checkreqprot(),
+ selinux_policycap_netpeer(),
+ selinux_policycap_openperm(),
+ selinux_policycap_extsockclass(),
+ selinux_policycap_alwaysnetwork(),
+ selinux_policycap_cgroupseclabel(),
+ selinux_policycap_nnp_nosuid_transition(),
+ selinux_policycap_genfs_seclabel_symlinks());
+ if (count >= 0 && count < buflen)
+ ima_lsm_state("selinux-state", state, count);
+ else {
+ pr_err("selinux state error: %d\n", count);
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = security_read_selinux_policy(&selinux_state,
+ &policy, &buflen, true);
+ if (!rc)
+ rc = selinux_hash_policy("sha256", policy, buflen,
+ &policy_hash, &policy_hash_len);
+ if (!rc)
+ ima_lsm_state("selinux-policy-hash", policy_hash,
+ policy_hash_len);
+
+out:
+ kfree(state);
+ kfree(policy);
+ kfree(policy_hash);
+
+ if (rc)
+ pr_err("%s: error %d\n", __func__, rc);
+
+ return rc;
+}
diff --git a/security/selinux/selinuxfs.c b/security/selinux/selinuxfs.c
index 4781314c2510..8f68964e4ba8 100644
--- a/security/selinux/selinuxfs.c
+++ b/security/selinux/selinuxfs.c
@@ -173,6 +173,7 @@ static ssize_t sel_write_enforce(struct file *file, const char __user *buf,
from_kuid(&init_user_ns, audit_get_loginuid(current)),
audit_get_sessionid(current));
enforcing_set(state, new_value);
+ selinux_security_state();
if (new_value)
avc_ss_reset(state->avc, 0);
selnl_notify_setenforce(new_value);
diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index ef0afd878bfc..0c289d13ef6a 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -3724,10 +3724,11 @@ int security_netlbl_sid_to_secattr(struct selinux_state *state,
* security_read_policy - read the policy.
* @data: binary policy data
* @len: length of data in bytes
- *
+ * @alloc_kernel_memory: flag to indicate memory allocation
*/
-int security_read_policy(struct selinux_state *state,
- void **data, size_t *len)
+int security_read_selinux_policy(struct selinux_state *state,
+ void **data, size_t *len,
+ bool alloc_kernel_memory)
{
struct policydb *policydb = &state->ss->policydb;
int rc;
@@ -3738,7 +3739,10 @@ int security_read_policy(struct selinux_state *state,
*len = security_policydb_len(state);
- *data = vmalloc_user(*len);
+ if (alloc_kernel_memory)
+ *data = kzalloc(*len, GFP_KERNEL);
+ else
+ *data = vmalloc_user(*len);
if (!*data)
return -ENOMEM;
@@ -3754,5 +3758,16 @@ int security_read_policy(struct selinux_state *state,
*len = (unsigned long)fp.data - (unsigned long)*data;
return 0;
+}
+/**
+ * security_read_policy - read the policy.
+ * @data: binary policy data
+ * @len: length of data in bytes
+ *
+ */
+int security_read_policy(struct selinux_state *state,
+ void **data, size_t *len)
+{
+ return security_read_selinux_policy(state, data, len, false);
}
--
2.27.0
^ permalink raw reply related
* [PATCH v1 0/5] LSM: Measure security module state
From: Lakshmi Ramasubramanian @ 2020-07-15 15:48 UTC (permalink / raw)
To: zohar, stephen.smalley.work, casey
Cc: jmorris, linux-integrity, selinux, linux-security-module,
linux-kernel
Critical data structures of security modules are currently not measured.
Therefore an attestation service, for instance, would not be able to
attest whether the security modules are always operating with the policies
and configuration that the system administrator had setup. The policies
and configuration for the security modules could be tampered with by
malware by exploiting Kernel vulnerabilities or modified through some
inadvertent actions on the system. Measuring such critical data would
enable an attestation service to better assess the state of the system.
IMA subsystem measures system files, command line arguments passed to
kexec, boot aggregate, keys, etc. It can be used to measure critical
data structures of security modules as well.
This change aims to address measuring critical data structures
of security modules when they are initialized, when they are updated
at runtime, and also periodically to detect any tampering.
This change set is based off of Linux Kernel version 5.8-rc5.
The following patch needs to be applied first before applying
the patches in this patch set:
https://patchwork.kernel.org/patch/11612989/
Change log:
v1:
=> Per Stephen Smalley's suggestion added selinux_state booleans
and hash of SELinux policy in the measured data for SELinux.
=> Call IMA hook from the security module directly instead of
redirecting through the LSM.
Lakshmi Ramasubramanian (5):
IMA: Add LSM_STATE func to measure LSM data
IMA: Define an IMA hook to measure LSM data
LSM: Add security_state function pointer in lsm_info struct
LSM: Define SELinux function to measure security state
LSM: Define workqueue for measuring security module state
Documentation/ABI/testing/ima_policy | 6 +-
include/linux/ima.h | 4 +
include/linux/lsm_hooks.h | 3 +
security/integrity/ima/ima.h | 1 +
security/integrity/ima/ima_api.c | 2 +-
security/integrity/ima/ima_main.c | 17 ++++
security/integrity/ima/ima_policy.c | 29 +++++--
security/security.c | 74 +++++++++++++++-
security/selinux/Makefile | 2 +
security/selinux/hooks.c | 2 +
security/selinux/include/security.h | 19 +++++
security/selinux/measure.c | 122 +++++++++++++++++++++++++++
security/selinux/selinuxfs.c | 1 +
security/selinux/ss/services.c | 23 ++++-
14 files changed, 293 insertions(+), 12 deletions(-)
create mode 100644 security/selinux/measure.c
--
2.27.0
^ permalink raw reply
* Re: [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
From: Christian Brauner @ 2020-07-15 15:24 UTC (permalink / raw)
To: Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-7-areber@redhat.com>
On Wed, Jul 15, 2020 at 04:49:54PM +0200, Adrian Reber wrote:
> This adds a test that changes its UID, uses capabilities to
> get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
> create a process with a given PID as non-root.
>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> ---
You need to add the new clone3_cap_checkpoint_restore binary to
.gitignore too. :)
And one more annoying request: can you port the selftests to use the
kselftest_harness.h infrastructure, please? It is way nicer to use, has
a more uniform feel, and generates better output. You can look at:
tools/testing/selftests/pidfd/pidfd_setns_test.c
tools/testing/selftests/seccomp/seccomp_bpf.c
and others for examples on how to use it.
Thanks!
Christian
> tools/testing/selftests/clone3/Makefile | 4 +-
> .../clone3/clone3_cap_checkpoint_restore.c | 203 ++++++++++++++++++
> 2 files changed, 206 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
>
> diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
> index cf976c732906..ef7564cb7abe 100644
> --- a/tools/testing/selftests/clone3/Makefile
> +++ b/tools/testing/selftests/clone3/Makefile
> @@ -1,6 +1,8 @@
> # SPDX-License-Identifier: GPL-2.0
> CFLAGS += -g -I../../../../usr/include/
> +LDLIBS += -lcap
>
> -TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
> +TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
> + clone3_cap_checkpoint_restore
>
> include ../lib.mk
> diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> new file mode 100644
> index 000000000000..2cc3d57b91f2
> --- /dev/null
> +++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
> @@ -0,0 +1,203 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Based on Christian Brauner's clone3() example.
> + * These tests are assuming to be running in the host's
> + * PID namespace.
> + */
> +
> +/* capabilities related code based on selftests/bpf/test_verifier.c */
> +
> +#define _GNU_SOURCE
> +#include <errno.h>
> +#include <linux/types.h>
> +#include <linux/sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdbool.h>
> +#include <sys/capability.h>
> +#include <sys/prctl.h>
> +#include <sys/syscall.h>
> +#include <sys/types.h>
> +#include <sys/un.h>
> +#include <sys/wait.h>
> +#include <unistd.h>
> +#include <sched.h>
> +
> +#include "../kselftest.h"
> +#include "clone3_selftests.h"
> +
> +#ifndef MAX_PID_NS_LEVEL
> +#define MAX_PID_NS_LEVEL 32
> +#endif
> +
> +static void child_exit(int ret)
> +{
> + fflush(stdout);
> + fflush(stderr);
> + _exit(ret);
> +}
> +
> +static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
> +{
> + int status;
> + pid_t pid = -1;
> +
> + struct clone_args args = {
> + .exit_signal = SIGCHLD,
> + .set_tid = ptr_to_u64(set_tid),
> + .set_tid_size = set_tid_size,
> + };
> +
> + pid = sys_clone3(&args, sizeof(struct clone_args));
> + if (pid < 0) {
> + ksft_print_msg("%s - Failed to create new process\n",
> + strerror(errno));
> + return -errno;
> + }
> +
> + if (pid == 0) {
> + int ret;
> + char tmp = 0;
> +
> + ksft_print_msg
> + ("I am the child, my PID is %d (expected %d)\n",
> + getpid(), set_tid[0]);
> +
> + if (set_tid[0] != getpid())
> + child_exit(EXIT_FAILURE);
> + child_exit(EXIT_SUCCESS);
> + }
> +
> + ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
> + getpid(), pid);
> +
> + if (waitpid(pid, &status, 0) < 0) {
> + ksft_print_msg("Child returned %s\n", strerror(errno));
> + return -errno;
> + }
> +
> + if (!WIFEXITED(status))
> + return -1;
> +
> + return WEXITSTATUS(status);
> +}
> +
> +static int test_clone3_set_tid(pid_t * set_tid,
> + size_t set_tid_size, int expected)
> +{
> + int ret;
> +
> + ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
> + getpid(), set_tid[0]);
> + ret = call_clone3_set_tid(set_tid, set_tid_size);
> +
> + ksft_print_msg
> + ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
> + getpid(), set_tid[0], ret, expected);
> + if (ret != expected) {
> + ksft_test_result_fail
> + ("[%d] Result (%d) is different than expected (%d)\n",
> + getpid(), ret, expected);
> + return -1;
> + }
> + ksft_test_result_pass
> + ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
> + expected);
> +
> + return 0;
> +}
> +
> +struct libcap {
> + struct __user_cap_header_struct hdr;
> + struct __user_cap_data_struct data[2];
> +};
> +
> +static int set_capability()
> +{
> + cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
> + struct libcap *cap;
> + int ret = -1;
> + cap_t caps;
> +
> + caps = cap_get_proc();
> + if (!caps) {
> + perror("cap_get_proc");
> + return -1;
> + }
> +
> + /* Drop all capabilities */
> + if (cap_clear(caps)) {
> + perror("cap_clear");
> + goto out;
> + }
> +
> + cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
> + cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
> +
> + cap = (struct libcap *) caps;
> +
> + /* 40 -> CAP_CHECKPOINT_RESTORE */
> + cap->data[1].effective |= 1 << (40 - 32);
> + cap->data[1].permitted |= 1 << (40 - 32);
> +
> + if (cap_set_proc(caps)) {
> + perror("cap_set_proc");
> + goto out;
> + }
> + ret = 0;
> +out:
> + if (cap_free(caps))
> + perror("cap_free");
> + return ret;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + pid_t pid;
> + int status;
> + int ret = 0;
> + pid_t set_tid[1];
> + uid_t uid = getuid();
> +
> + ksft_print_header();
> + test_clone3_supported();
> + ksft_set_plan(2);
> +
> + if (uid != 0) {
> + ksft_cnt.ksft_xskip = ksft_plan;
> + ksft_print_msg("Skipping all tests as non-root\n");
> + return ksft_exit_pass();
> + }
> +
> + memset(&set_tid, 0, sizeof(set_tid));
> +
> + /* Find the current active PID */
> + pid = fork();
> + if (pid == 0) {
> + ksft_print_msg("Child has PID %d\n", getpid());
> + child_exit(EXIT_SUCCESS);
> + }
> + if (waitpid(pid, &status, 0) < 0)
> + ksft_exit_fail_msg("Waiting for child %d failed", pid);
> +
> + /* After the child has finished, its PID should be free. */
> + set_tid[0] = pid;
> +
> + if (set_capability())
> + ksft_test_result_fail
> + ("Could not set CAP_CHECKPOINT_RESTORE\n");
> + prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
> + /* This would fail without CAP_CHECKPOINT_RESTORE */
> + setgid(1000);
> + setuid(1000);
> + set_tid[0] = pid;
> + ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
> + if (set_capability())
> + ksft_test_result_fail
> + ("Could not set CAP_CHECKPOINT_RESTORE\n");
> + /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
> + ret |= test_clone3_set_tid(set_tid, 1, 0);
> +
> + return !ret ? ksft_exit_pass() : ksft_exit_fail();
> +}
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
From: Christian Brauner @ 2020-07-15 15:20 UTC (permalink / raw)
To: Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-6-areber@redhat.com>
On Wed, Jul 15, 2020 at 04:49:53PM +0200, Adrian Reber wrote:
> From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
>
> Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.
>
> This commit also changes the permission error code from -EINVAL to
> -EPERM for consistency with the rest of the prctl() syscall when
> checking capabilities.
I agree that EINVAL seems weird here but this is a potentially user
visible change. Might be nice to have the EINVAL->EPERM change be an
additional patch on top after this one so we can revert it in case it
breaks someone (unlikely though). I can split this out myself though so
no need to resend for that alone.
What I would also prefer is to have some history in the commit message
tbh. The reason is that when we started discussing that specific change
I had to hunt down the history of changing /proc/self/exe and had to
dig up and read through ancient threads on lore to come up with the
explanation why this is placed under a capability. The commit message
should then also mention that there are other ways to change the
/proc/self/exe link that don't require capabilities and that
/proc/self/exe itself is not something userspace should rely on for
security. Mainly so that in a few months/years we can read through that
commit message and go "Weird, but ok.". :)
But maybe I can just rewrite this myself so you don't have to go through
the trouble. This is really not pedantry it's just that it's a lot of
work digging up the reasons for a piece of code existing when it's
really not obvious. :)
Christian
>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> ---
> kernel/sys.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..dd59b9142b1d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2007,12 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>
> if (prctl_map.exe_fd != (u32)-1) {
> /*
> - * Make sure the caller has the rights to
> - * change /proc/pid/exe link: only local sys admin should
> - * be allowed to.
> + * Check if the current user is checkpoint/restore capable.
> + * At the time of this writing, it checks for CAP_SYS_ADMIN
> + * or CAP_CHECKPOINT_RESTORE.
> + * Note that a user with access to ptrace can masquerade an
> + * arbitrary program as any executable, even setuid ones.
> */
> - if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> - return -EINVAL;
> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> + return -EPERM;
>
> error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
> if (error)
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH 7/7] exec: Implement kernel_execve
From: Kees Cook @ 2020-07-15 15:09 UTC (permalink / raw)
To: David Laight
Cc: 'Christoph Hellwig', Eric W. Biederman,
linux-kernel@vger.kernel.org, Linus Torvalds, Andy Lutomirski,
H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Al Viro, Luis Chamberlain, linux-fsdevel@vger.kernel.org,
Tetsuo Handa, linux-security-module@vger.kernel.org,
Serge E. Hallyn, James Morris, Kentaro Takeda, Casey Schaufler,
John Johansen
In-Reply-To: <d6d204c4427b49f6b24ac24bf1082fa4@AcuMS.aculab.com>
On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> From: Christoph Hellwig
> > Sent: 15 July 2020 07:43
> > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> >
> > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > +static int count_strings_kernel(const char *const *argv)
> > > > +{
> > > > + int i;
> > > > +
> > > > + if (!argv)
> > > > + return 0;
> > > > +
> > > > + for (i = 0; argv[i]; ++i) {
> > > > + if (i >= MAX_ARG_STRINGS)
> > > > + return -E2BIG;
> > > > + if (fatal_signal_pending(current))
> > > > + return -ERESTARTNOHAND;
> > > > + cond_resched();
> > > > + }
> > > > + return i;
> > > > +}
> > >
> > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > refactor that too? (And maybe rename it to count_strings_user()?)
>
> Thinks....
> If you setup env[] and argv[] on the new user stack early in exec processing
> then you may not need any limits at all - except the size of the user stack.
> Even the get_user() loop will hit an invalid address before the counter
> wraps (provided it is unsigned long).
*grumpy noises* Yes, but not in practice (if I'm understanding what you
mean). The expectations of a number of execution environments can be
really odd-ball. I've tried to collect the notes from over the years in
prepare_arg_pages()'s comments, and it mostly boils down to "there has
to be enough room for the exec to start" otherwise the exec ends up in a
hard-to-debug failure state (i.e. past the "point of no return", where you
get no useful information about the cause of the SEGV). So the point has
been to move as many of the setup checks as early as possible and report
about them if they fail. The argv processing is already very early, but
it needs to do the limit checks otherwise it'll just break after the exec
is underway and the process will just SEGV. (And ... some environments
will attempt to dynamically check the size of the argv space by growing
until it sees E2BIG, so we can't just remove it and let those hit SEGV.)
--
Kees Cook
^ permalink raw reply
* Re: [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
From: Christian Brauner @ 2020-07-15 15:08 UTC (permalink / raw)
To: Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-4-areber@redhat.com>
On Wed, Jul 15, 2020 at 04:49:51PM +0200, Adrian Reber wrote:
> Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
> writing to ns_last_pid.
>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
^ permalink raw reply
* Re: [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid
From: Christian Brauner @ 2020-07-15 15:08 UTC (permalink / raw)
To: Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-3-areber@redhat.com>
On Wed, Jul 15, 2020 at 04:49:50PM +0200, Adrian Reber wrote:
> Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
> using clone3() with set_tid set.
>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---
Looks good!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> kernel/pid.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index de9d29c41d77..a9cbab0194d9 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> if (tid != 1 && !tmp->child_reaper)
> goto out_free;
> retval = -EPERM;
> - if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
> + if (!checkpoint_restore_ns_capable(tmp->user_ns))
> goto out_free;
> set_tid_size--;
> }
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH v5 1/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
From: Christian Brauner @ 2020-07-15 15:06 UTC (permalink / raw)
To: Adrian Reber
Cc: Eric Biederman, Pavel Emelyanov, Oleg Nesterov, Dmitry Safonov,
Andrei Vagin, Nicolas Viennot, Michał Cłapiński,
Kamil Yurtsever, Dirk Petersen, Christine Flood, Casey Schaufler,
Mike Rapoport, Radostin Stoyanov, Cyrill Gorcunov, Serge Hallyn,
Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-2-areber@redhat.com>
On Wed, Jul 15, 2020 at 04:49:49PM +0200, Adrian Reber wrote:
> This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
> checkpoint/restore for non-root users.
>
> Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
> asked numerous times if it is possible to checkpoint/restore a process as
> non-root. The answer usually was: 'almost'.
>
> The main blocker to restore a process as non-root was to control the PID of the
> restored process. This feature available via the clone3 system call, or via
> /proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
>
> In the past two years, requests for non-root checkpoint/restore have increased
> due to the following use cases:
> * Checkpoint/Restore in an HPC environment in combination with a resource
> manager distributing jobs where users are always running as non-root.
> There is a desire to provide a way to checkpoint and restore long running
> jobs.
> * Container migration as non-root
> * We have been in contact with JVM developers who are integrating
> CRIU into a Java VM to decrease the startup time. These checkpoint/restore
> applications are not meant to be running with CAP_SYS_ADMIN.
>
> We have seen the following workarounds:
> * Use a setuid wrapper around CRIU:
> See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
> * Use a setuid helper that writes to ns_last_pid.
> Unfortunately, this helper delegation technique is impossible to use with
> clone3, and is thus prone to races.
> See https://github.com/twosigma/set_ns_last_pid
> * Cycle through PIDs with fork() until the desired PID is reached:
> This has been demonstrated to work with cycling rates of 100,000 PIDs/s
> See https://github.com/twosigma/set_ns_last_pid
> * Patch out the CAP_SYS_ADMIN check from the kernel
> * Run the desired application in a new user and PID namespace to provide
> a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
> typical container environments (e.g., Kubernetes) as /proc is
> typically protected with read-only layers (e.g., /proc/sys) for hardening
> purposes. Read-only layers prevent additional /proc mounts (due to proc's
> SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
> certain applications need access to /proc matching their PID namespace.
>
> The introduced capability allows to:
> * Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
> for the corresponding PID namespace via ns_last_pid/clone3.
> * Open files in /proc/pid/map_files when the current user is
> CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
> files that are unreachable via the file system such as deleted files, or memfd
> files.
>
> See corresponding selftest for an example with clone3().
>
> Signed-off-by: Adrian Reber <areber@redhat.com>
> Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
> ---
Thanks!
This looks good now.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
> include/linux/capability.h | 6 ++++++
> include/uapi/linux/capability.h | 9 ++++++++-
> security/selinux/include/classmap.h | 5 +++--
> 3 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index b4345b38a6be..1e7fe311cabe 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
> return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
> }
>
> +static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
> +{
> + return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
> + ns_capable(ns, CAP_SYS_ADMIN);
> +}
> +
> /* audit system wants to get cap info from files as well */
> extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 48ff0757ae5e..395dd0df8d08 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
> */
> #define CAP_BPF 39
>
> -#define CAP_LAST_CAP CAP_BPF
> +
> +/* Allow checkpoint/restore related operations */
> +/* Allow PID selection during clone3() */
> +/* Allow writing to ns_last_pid */
> +
> +#define CAP_CHECKPOINT_RESTORE 40
> +
> +#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
>
> #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>
> diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
> index e54d62d529f1..ba2e01a6955c 100644
> --- a/security/selinux/include/classmap.h
> +++ b/security/selinux/include/classmap.h
> @@ -27,9 +27,10 @@
> "audit_control", "setfcap"
>
> #define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
> - "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
> + "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
> + "checkpoint_restore"
>
> -#if CAP_LAST_CAP > CAP_BPF
> +#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
> #error New capability defined, please update COMMON_CAP2_PERMS.
> #endif
>
> --
> 2.26.2
>
^ permalink raw reply
* Re: [PATCH 7/7] exec: Implement kernel_execve
From: Kees Cook @ 2020-07-15 15:00 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Eric W. Biederman, linux-kernel, Linus Torvalds, Andy Lutomirski,
H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Al Viro, Luis Chamberlain, linux-fsdevel, Tetsuo Handa,
linux-security-module, Serge E. Hallyn, James Morris,
Kentaro Takeda, Casey Schaufler, John Johansen
In-Reply-To: <20200715064248.GH32470@infradead.org>
On Wed, Jul 15, 2020 at 07:42:48AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > + int i;
> > > +
> > > + if (!argv)
> > > + return 0;
> > > +
> > > + for (i = 0; argv[i]; ++i) {
> > > + if (i >= MAX_ARG_STRINGS)
> > > + return -E2BIG;
> > > + if (fatal_signal_pending(current))
> > > + return -ERESTARTNOHAND;
> > > + cond_resched();
> > > + }
> > > + return i;
> > > +}
> >
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)
>
> Liks this?
>
> http://git.infradead.org/users/hch/misc.git/commitdiff/35a3129dab5b712b018c30681d15de42d9509731
Heh, yes please. :) (Which branch is this from? Are yours and Eric's
tree going to collide?)
--
Kees Cook
^ permalink raw reply
* RE: [PATCH 7/7] exec: Implement kernel_execve
From: David Laight @ 2020-07-15 14:55 UTC (permalink / raw)
To: 'Christoph Hellwig', Kees Cook
Cc: Eric W. Biederman, linux-kernel@vger.kernel.org, Linus Torvalds,
Andy Lutomirski, H. Peter Anvin, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Al Viro, Luis Chamberlain,
linux-fsdevel@vger.kernel.org, Tetsuo Handa,
linux-security-module@vger.kernel.org, Serge E. Hallyn,
James Morris, Kentaro Takeda, Casey Schaufler, John Johansen
In-Reply-To: <20200715064248.GH32470@infradead.org>
From: Christoph Hellwig
> Sent: 15 July 2020 07:43
> Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
>
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > + int i;
> > > +
> > > + if (!argv)
> > > + return 0;
> > > +
> > > + for (i = 0; argv[i]; ++i) {
> > > + if (i >= MAX_ARG_STRINGS)
> > > + return -E2BIG;
> > > + if (fatal_signal_pending(current))
> > > + return -ERESTARTNOHAND;
> > > + cond_resched();
> > > + }
> > > + return i;
> > > +}
> >
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)
Thinks....
If you setup env[] and argv[] on the new user stack early in exec processing
then you may not need any limits at all - except the size of the user stack.
Even the get_user() loop will hit an invalid address before the counter
wraps (provided it is unsigned long).
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply
* [PATCH v5 6/6] selftests: add clone3() CAP_CHECKPOINT_RESTORE test
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
This adds a test that changes its UID, uses capabilities to
get CAP_CHECKPOINT_RESTORE and uses clone3() with set_tid to
create a process with a given PID as non-root.
Signed-off-by: Adrian Reber <areber@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
---
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 203 ++++++++++++++++++
2 files changed, 206 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile
index cf976c732906..ef7564cb7abe 100644
--- a/tools/testing/selftests/clone3/Makefile
+++ b/tools/testing/selftests/clone3/Makefile
@@ -1,6 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -g -I../../../../usr/include/
+LDLIBS += -lcap
-TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid
+TEST_GEN_PROGS := clone3 clone3_clear_sighand clone3_set_tid \
+ clone3_cap_checkpoint_restore
include ../lib.mk
diff --git a/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
new file mode 100644
index 000000000000..2cc3d57b91f2
--- /dev/null
+++ b/tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Based on Christian Brauner's clone3() example.
+ * These tests are assuming to be running in the host's
+ * PID namespace.
+ */
+
+/* capabilities related code based on selftests/bpf/test_verifier.c */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <sys/capability.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+#include "clone3_selftests.h"
+
+#ifndef MAX_PID_NS_LEVEL
+#define MAX_PID_NS_LEVEL 32
+#endif
+
+static void child_exit(int ret)
+{
+ fflush(stdout);
+ fflush(stderr);
+ _exit(ret);
+}
+
+static int call_clone3_set_tid(pid_t * set_tid, size_t set_tid_size)
+{
+ int status;
+ pid_t pid = -1;
+
+ struct clone_args args = {
+ .exit_signal = SIGCHLD,
+ .set_tid = ptr_to_u64(set_tid),
+ .set_tid_size = set_tid_size,
+ };
+
+ pid = sys_clone3(&args, sizeof(struct clone_args));
+ if (pid < 0) {
+ ksft_print_msg("%s - Failed to create new process\n",
+ strerror(errno));
+ return -errno;
+ }
+
+ if (pid == 0) {
+ int ret;
+ char tmp = 0;
+
+ ksft_print_msg
+ ("I am the child, my PID is %d (expected %d)\n",
+ getpid(), set_tid[0]);
+
+ if (set_tid[0] != getpid())
+ child_exit(EXIT_FAILURE);
+ child_exit(EXIT_SUCCESS);
+ }
+
+ ksft_print_msg("I am the parent (%d). My child's pid is %d\n",
+ getpid(), pid);
+
+ if (waitpid(pid, &status, 0) < 0) {
+ ksft_print_msg("Child returned %s\n", strerror(errno));
+ return -errno;
+ }
+
+ if (!WIFEXITED(status))
+ return -1;
+
+ return WEXITSTATUS(status);
+}
+
+static int test_clone3_set_tid(pid_t * set_tid,
+ size_t set_tid_size, int expected)
+{
+ int ret;
+
+ ksft_print_msg("[%d] Trying clone3() with CLONE_SET_TID to %d\n",
+ getpid(), set_tid[0]);
+ ret = call_clone3_set_tid(set_tid, set_tid_size);
+
+ ksft_print_msg
+ ("[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n",
+ getpid(), set_tid[0], ret, expected);
+ if (ret != expected) {
+ ksft_test_result_fail
+ ("[%d] Result (%d) is different than expected (%d)\n",
+ getpid(), ret, expected);
+ return -1;
+ }
+ ksft_test_result_pass
+ ("[%d] Result (%d) matches expectation (%d)\n", getpid(), ret,
+ expected);
+
+ return 0;
+}
+
+struct libcap {
+ struct __user_cap_header_struct hdr;
+ struct __user_cap_data_struct data[2];
+};
+
+static int set_capability()
+{
+ cap_value_t cap_values[] = { CAP_SETUID, CAP_SETGID };
+ struct libcap *cap;
+ int ret = -1;
+ cap_t caps;
+
+ caps = cap_get_proc();
+ if (!caps) {
+ perror("cap_get_proc");
+ return -1;
+ }
+
+ /* Drop all capabilities */
+ if (cap_clear(caps)) {
+ perror("cap_clear");
+ goto out;
+ }
+
+ cap_set_flag(caps, CAP_EFFECTIVE, 2, cap_values, CAP_SET);
+ cap_set_flag(caps, CAP_PERMITTED, 2, cap_values, CAP_SET);
+
+ cap = (struct libcap *) caps;
+
+ /* 40 -> CAP_CHECKPOINT_RESTORE */
+ cap->data[1].effective |= 1 << (40 - 32);
+ cap->data[1].permitted |= 1 << (40 - 32);
+
+ if (cap_set_proc(caps)) {
+ perror("cap_set_proc");
+ goto out;
+ }
+ ret = 0;
+out:
+ if (cap_free(caps))
+ perror("cap_free");
+ return ret;
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t pid;
+ int status;
+ int ret = 0;
+ pid_t set_tid[1];
+ uid_t uid = getuid();
+
+ ksft_print_header();
+ test_clone3_supported();
+ ksft_set_plan(2);
+
+ if (uid != 0) {
+ ksft_cnt.ksft_xskip = ksft_plan;
+ ksft_print_msg("Skipping all tests as non-root\n");
+ return ksft_exit_pass();
+ }
+
+ memset(&set_tid, 0, sizeof(set_tid));
+
+ /* Find the current active PID */
+ pid = fork();
+ if (pid == 0) {
+ ksft_print_msg("Child has PID %d\n", getpid());
+ child_exit(EXIT_SUCCESS);
+ }
+ if (waitpid(pid, &status, 0) < 0)
+ ksft_exit_fail_msg("Waiting for child %d failed", pid);
+
+ /* After the child has finished, its PID should be free. */
+ set_tid[0] = pid;
+
+ if (set_capability())
+ ksft_test_result_fail
+ ("Could not set CAP_CHECKPOINT_RESTORE\n");
+ prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);
+ /* This would fail without CAP_CHECKPOINT_RESTORE */
+ setgid(1000);
+ setuid(1000);
+ set_tid[0] = pid;
+ ret |= test_clone3_set_tid(set_tid, 1, -EPERM);
+ if (set_capability())
+ ksft_test_result_fail
+ ("Could not set CAP_CHECKPOINT_RESTORE\n");
+ /* This should work as we have CAP_CHECKPOINT_RESTORE as non-root */
+ ret |= test_clone3_set_tid(set_tid, 1, 0);
+
+ return !ret ? ksft_exit_pass() : ksft_exit_fail();
+}
--
2.26.2
^ permalink raw reply related
* [PATCH v5 4/6] proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
Opening files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace is useful for
checkpointing and restoring to recover files that are unreachable via
the file system such as deleted files, or memfd files.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
fs/proc/base.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 65893686d1f1..cada783f229e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2194,16 +2194,16 @@ struct map_files_info {
};
/*
- * Only allow CAP_SYS_ADMIN to follow the links, due to concerns about how the
- * symlinks may be used to bypass permissions on ancestor directories in the
- * path to the file in question.
+ * Only allow CAP_SYS_ADMIN and CAP_CHECKPOINT_RESTORE to follow the links, due
+ * to concerns about how the symlinks may be used to bypass permissions on
+ * ancestor directories in the path to the file in question.
*/
static const char *
proc_map_files_get_link(struct dentry *dentry,
struct inode *inode,
struct delayed_call *done)
{
- if (!capable(CAP_SYS_ADMIN))
+ if (!capable(CAP_SYS_ADMIN) || !capable(CAP_CHECKPOINT_RESTORE))
return ERR_PTR(-EPERM);
return proc_pid_get_link(dentry, inode, done);
--
2.26.2
^ permalink raw reply related
* [PATCH v5 5/6] prctl: Allow checkpoint/restore capable processes to change exe link
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
From: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Allow CAP_CHECKPOINT_RESTORE capable users to change /proc/self/exe.
This commit also changes the permission error code from -EINVAL to
-EPERM for consistency with the rest of the prctl() syscall when
checking capabilities.
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
---
kernel/sys.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..dd59b9142b1d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2007,12 +2007,14 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
if (prctl_map.exe_fd != (u32)-1) {
/*
- * Make sure the caller has the rights to
- * change /proc/pid/exe link: only local sys admin should
- * be allowed to.
+ * Check if the current user is checkpoint/restore capable.
+ * At the time of this writing, it checks for CAP_SYS_ADMIN
+ * or CAP_CHECKPOINT_RESTORE.
+ * Note that a user with access to ptrace can masquerade an
+ * arbitrary program as any executable, even setuid ones.
*/
- if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
- return -EINVAL;
+ if (!checkpoint_restore_ns_capable(current_user_ns()))
+ return -EPERM;
error = prctl_set_mm_exe_file(mm, prctl_map.exe_fd);
if (error)
--
2.26.2
^ permalink raw reply related
* [PATCH v5 3/6] pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
writing to ns_last_pid.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
kernel/pid_namespace.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0e5ac162c3a8..ac135bd600eb 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
struct ctl_table tmp = *table;
int ret, next;
- if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+ if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;
/*
--
2.26.2
^ permalink raw reply related
* [PATCH v5 2/6] pid: use checkpoint_restore_ns_capable() for set_tid
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
Use the newly introduced capability CAP_CHECKPOINT_RESTORE to allow
using clone3() with set_tid set.
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
kernel/pid.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index de9d29c41d77..a9cbab0194d9 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -199,7 +199,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
if (tid != 1 && !tmp->child_reaper)
goto out_free;
retval = -EPERM;
- if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ if (!checkpoint_restore_ns_capable(tmp->user_ns))
goto out_free;
set_tid_size--;
}
--
2.26.2
^ permalink raw reply related
* [PATCH v5 1/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
In-Reply-To: <20200715144954.1387760-1-areber@redhat.com>
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.
Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has been
asked numerous times if it is possible to checkpoint/restore a process as
non-root. The answer usually was: 'almost'.
The main blocker to restore a process as non-root was to control the PID of the
restored process. This feature available via the clone3 system call, or via
/proc/sys/kernel/ns_last_pid is unfortunately guarded by CAP_SYS_ADMIN.
In the past two years, requests for non-root checkpoint/restore have increased
due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a resource
manager distributing jobs where users are always running as non-root.
There is a desire to provide a way to checkpoint and restore long running
jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
CRIU into a Java VM to decrease the startup time. These checkpoint/restore
applications are not meant to be running with CAP_SYS_ADMIN.
We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
Unfortunately, this helper delegation technique is impossible to use with
clone3, and is thus prone to races.
See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
This has been demonstrated to work with cycling rates of 100,000 PIDs/s
See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited use in
typical container environments (e.g., Kubernetes) as /proc is
typically protected with read-only layers (e.g., /proc/sys) for hardening
purposes. Read-only layers prevent additional /proc mounts (due to proc's
SB_I_USERNS_VISIBLE property), making the use of new PID namespaces limited as
certain applications need access to /proc matching their PID namespace.
The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for recovering
files that are unreachable via the file system such as deleted files, or memfd
files.
See corresponding selftest for an example with clone3().
Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
---
include/linux/capability.h | 6 ++++++
include/uapi/linux/capability.h | 9 ++++++++-
security/selinux/include/classmap.h | 5 +++--
3 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/include/linux/capability.h b/include/linux/capability.h
index b4345b38a6be..1e7fe311cabe 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -261,6 +261,12 @@ static inline bool bpf_capable(void)
return capable(CAP_BPF) || capable(CAP_SYS_ADMIN);
}
+static inline bool checkpoint_restore_ns_capable(struct user_namespace *ns)
+{
+ return ns_capable(ns, CAP_CHECKPOINT_RESTORE) ||
+ ns_capable(ns, CAP_SYS_ADMIN);
+}
+
/* audit system wants to get cap info from files as well */
extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 48ff0757ae5e..395dd0df8d08 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -408,7 +408,14 @@ struct vfs_ns_cap_data {
*/
#define CAP_BPF 39
-#define CAP_LAST_CAP CAP_BPF
+
+/* Allow checkpoint/restore related operations */
+/* Allow PID selection during clone3() */
+/* Allow writing to ns_last_pid */
+
+#define CAP_CHECKPOINT_RESTORE 40
+
+#define CAP_LAST_CAP CAP_CHECKPOINT_RESTORE
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index e54d62d529f1..ba2e01a6955c 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -27,9 +27,10 @@
"audit_control", "setfcap"
#define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \
- "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf"
+ "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \
+ "checkpoint_restore"
-#if CAP_LAST_CAP > CAP_BPF
+#if CAP_LAST_CAP > CAP_CHECKPOINT_RESTORE
#error New capability defined, please update COMMON_CAP2_PERMS.
#endif
--
2.26.2
^ permalink raw reply related
* [PATCH v5 0/6] capabilities: Introduce CAP_CHECKPOINT_RESTORE
From: Adrian Reber @ 2020-07-15 14:49 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Oleg Nesterov,
Dmitry Safonov, Andrei Vagin, Nicolas Viennot,
Michał Cłapiński, Kamil Yurtsever, Dirk Petersen,
Christine Flood, Casey Schaufler
Cc: Mike Rapoport, Radostin Stoyanov, Adrian Reber, Cyrill Gorcunov,
Serge Hallyn, Stephen Smalley, Sargun Dhillon, Arnd Bergmann,
linux-security-module, linux-kernel, selinux, Eric Paris,
Jann Horn, linux-fsdevel
This is v5 of the 'Introduce CAP_CHECKPOINT_RESTORE' patchset. The
changes to v4 are:
* split into more patches to have the introduction of
CAP_CHECKPOINT_RESTORE and the actual usage in different
patches
* reduce the /proc/self/exe patch to only be about
CAP_CHECKPOINT_RESTORE
Adrian Reber (5):
capabilities: Introduce CAP_CHECKPOINT_RESTORE
pid: use checkpoint_restore_ns_capable() for set_tid
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
selftests: add clone3() CAP_CHECKPOINT_RESTORE test
Nicolas Viennot (1):
prctl: Allow checkpoint/restore capable processes to change exe link
fs/proc/base.c | 8 +-
include/linux/capability.h | 6 +
include/uapi/linux/capability.h | 9 +-
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/sys.c | 12 +-
security/selinux/include/classmap.h | 5 +-
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 203 ++++++++++++++++++
9 files changed, 236 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c
base-commit: d31958b30ea3b7b6e522d6bf449427748ad45822
--
2.26.2
^ permalink raw reply
* Re: [PATCH net-next] cipso: Remove unused inline functions
From: David Miller @ 2020-07-15 14:45 UTC (permalink / raw)
To: yuehaibing; +Cc: paul, kuba, netdev, linux-security-module, linux-kernel
In-Reply-To: <20200715021846.34096-1-yuehaibing@huawei.com>
From: YueHaibing <yuehaibing@huawei.com>
Date: Wed, 15 Jul 2020 10:18:46 +0800
> They are not used any more since commit b1edeb102397 ("netlabel: Replace
> protocol/NetLabel linking with refrerence counts")
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Applied.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox