Linux Security Modules development

Linux Security Modules development
 help / color / mirror / Atom feed

* [PATCH 02/11] security: Add hooks to rule on setting a watch [ver #6]
From: David Howells @ 2019-08-29 18:30 UTC (permalink / raw)
  To: viro
  Cc: dhowells, Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, dhowells, keyrings,
	linux-usb, linux-security-module, linux-fsdevel, linux-api,
	linux-block, linux-security-module, linux-kernel
In-Reply-To: <156710338860.10009.12524626894838499011.stgit@warthog.procyon.org.uk>

Add security hooks that will allow an LSM to rule on whether or not a watch
may be set.  More than one hook is required as the watches watch different
types of object.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: linux-security-module@vger.kernel.org
---

 include/linux/lsm_hooks.h |   22 ++++++++++++++++++++++
 include/linux/security.h  |   15 +++++++++++++++
 security/security.c       |   13 +++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index df1318d85f7d..19108185b254 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1413,6 +1413,20 @@
  *	@ctx is a pointer in which to place the allocated security context.
  *	@ctxlen points to the place to put the length of @ctx.
  *
+ * Security hooks for the general notification queue:
+ *
+ * @watch_key:
+ *	Check to see if a process is allowed to watch for event notifications
+ *	from a key or keyring.
+ *	@watch: The watch object
+ *	@key: The key to watch.
+ *
+ * @watch_devices:
+ *	Check to see if a process is allowed to watch for event notifications
+ *	from devices (as a global set).
+ *	@watch: The watch object
+ *
+ *
  * Security hooks for using the eBPF maps and programs functionalities through
  * eBPF syscalls.
  *
@@ -1688,6 +1702,10 @@ union security_list_options {
 	int (*inode_notifysecctx)(struct inode *inode, void *ctx, u32 ctxlen);
 	int (*inode_setsecctx)(struct dentry *dentry, void *ctx, u32 ctxlen);
 	int (*inode_getsecctx)(struct inode *inode, void **ctx, u32 *ctxlen);
+#ifdef CONFIG_WATCH_QUEUE
+	int (*watch_key)(struct watch *watch, struct key *key);
+	int (*watch_devices)(struct watch *watch);
+#endif /* CONFIG_WATCH_QUEUE */
 
 #ifdef CONFIG_SECURITY_NETWORK
 	int (*unix_stream_connect)(struct sock *sock, struct sock *other,
@@ -1964,6 +1982,10 @@ struct security_hook_heads {
 	struct hlist_head inode_notifysecctx;
 	struct hlist_head inode_setsecctx;
 	struct hlist_head inode_getsecctx;
+#ifdef CONFIG_WATCH_QUEUE
+	struct hlist_head watch_key;
+	struct hlist_head watch_devices;
+#endif /* CONFIG_WATCH_QUEUE */
 #ifdef CONFIG_SECURITY_NETWORK
 	struct hlist_head unix_stream_connect;
 	struct hlist_head unix_may_send;
diff --git a/include/linux/security.h b/include/linux/security.h
index 5f7441abbf42..feeade454308 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -57,6 +57,7 @@ struct mm_struct;
 struct fs_context;
 struct fs_parameter;
 enum fs_value_type;
+struct watch;
 
 /* Default (no) options for the capable function */
 #define CAP_OPT_NONE 0x0
@@ -392,6 +393,10 @@ void security_inode_invalidate_secctx(struct inode *inode);
 int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen);
 int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen);
 int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
+#ifdef CONFIG_WATCH_QUEUE
+int security_watch_key(struct watch *watch, struct key *key);
+int security_watch_devices(struct watch *watch);
+#endif /* CONFIG_WATCH_QUEUE */
 #else /* CONFIG_SECURITY */
 
 static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data)
@@ -1204,6 +1209,16 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32
 {
 	return -EOPNOTSUPP;
 }
+#ifdef CONFIG_WATCH_QUEUE
+static inline int security_watch_key(struct watch *watch, struct key *key)
+{
+	return 0;
+}
+static inline int security_watch_devices(struct watch *watch)
+{
+	return 0;
+}
+#endif /* CONFIG_WATCH_QUEUE */
 #endif	/* CONFIG_SECURITY */
 
 #ifdef CONFIG_SECURITY_NETWORK
diff --git a/security/security.c b/security/security.c
index 250ee2d76406..1ebd2c936a57 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1916,6 +1916,19 @@ int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen)
 }
 EXPORT_SYMBOL(security_inode_getsecctx);
 
+#ifdef CONFIG_WATCH_QUEUE
+int security_watch_key(struct watch *watch, struct key *key)
+{
+	return call_int_hook(watch_key, 0, watch, key);
+}
+
+int security_watch_devices(struct watch *watch)
+{
+	return call_int_hook(watch_devices, 0, watch);
+}
+
+#endif /* CONFIG_WATCH_QUEUE */
+
 #ifdef CONFIG_SECURITY_NETWORK
 
 int security_unix_stream_connect(struct sock *sock, struct sock *other, struct sock *newsk)


^ permalink raw reply related

* [PATCH 03/11] security: Add a hook for the point of notification insertion [ver #6]
From: David Howells @ 2019-08-29 18:30 UTC (permalink / raw)
  To: viro
  Cc: dhowells, Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, dhowells, keyrings,
	linux-usb, linux-security-module, linux-fsdevel, linux-api,
	linux-block, linux-security-module, linux-kernel
In-Reply-To: <156710338860.10009.12524626894838499011.stgit@warthog.procyon.org.uk>

Add a security hook that allows an LSM to rule on whether a notification
message is allowed to be inserted into a particular watch queue.

The hook is given the following information:

 (1) The credentials of the triggerer (which may be init_cred for a system
     notification, eg. a hardware error).

 (2) The credentials of the whoever set the watch.

 (3) The notification message.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Casey Schaufler <casey@schaufler-ca.com>
cc: Stephen Smalley <sds@tycho.nsa.gov>
cc: linux-security-module@vger.kernel.org
---

 include/linux/lsm_hooks.h |   10 ++++++++++
 include/linux/security.h  |   10 ++++++++++
 security/security.c       |    6 ++++++
 3 files changed, 26 insertions(+)

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 19108185b254..e9f1f69cd04d 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1426,6 +1426,12 @@
  *	from devices (as a global set).
  *	@watch: The watch object
  *
+ * @post_notification:
+ *	Check to see if a watch notification can be posted to a particular
+ *	queue.
+ *	@w_cred: The credentials of the whoever set the watch.
+ *	@cred: The event-triggerer's credentials
+ *	@n: The notification being posted
  *
  * Security hooks for using the eBPF maps and programs functionalities through
  * eBPF syscalls.
@@ -1705,6 +1711,9 @@ union security_list_options {
 #ifdef CONFIG_WATCH_QUEUE
 	int (*watch_key)(struct watch *watch, struct key *key);
 	int (*watch_devices)(struct watch *watch);
+	int (*post_notification)(const struct cred *w_cred,
+				 const struct cred *cred,
+				 struct watch_notification *n);
 #endif /* CONFIG_WATCH_QUEUE */
 
 #ifdef CONFIG_SECURITY_NETWORK
@@ -1985,6 +1994,7 @@ struct security_hook_heads {
 #ifdef CONFIG_WATCH_QUEUE
 	struct hlist_head watch_key;
 	struct hlist_head watch_devices;
+	struct hlist_head post_notification;
 #endif /* CONFIG_WATCH_QUEUE */
 #ifdef CONFIG_SECURITY_NETWORK
 	struct hlist_head unix_stream_connect;
diff --git a/include/linux/security.h b/include/linux/security.h
index feeade454308..003437714eee 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -58,6 +58,7 @@ struct fs_context;
 struct fs_parameter;
 enum fs_value_type;
 struct watch;
+struct watch_notification;
 
 /* Default (no) options for the capable function */
 #define CAP_OPT_NONE 0x0
@@ -396,6 +397,9 @@ int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen);
 #ifdef CONFIG_WATCH_QUEUE
 int security_watch_key(struct watch *watch, struct key *key);
 int security_watch_devices(struct watch *watch);
+int security_post_notification(const struct cred *w_cred,
+			       const struct cred *cred,
+			       struct watch_notification *n);
 #endif /* CONFIG_WATCH_QUEUE */
 #else /* CONFIG_SECURITY */
 
@@ -1218,6 +1222,12 @@ static inline int security_watch_devices(struct watch *watch)
 {
 	return 0;
 }
+static inline int security_post_notification(const struct cred *w_cred,
+					     const struct cred *cred,
+					     struct watch_notification *n)
+{
+	return 0;
+}
 #endif /* CONFIG_WATCH_QUEUE */
 #endif	/* CONFIG_SECURITY */
 
diff --git a/security/security.c b/security/security.c
index 1ebd2c936a57..5afe966aea4e 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1927,6 +1927,12 @@ int security_watch_devices(struct watch *watch)
 	return call_int_hook(watch_devices, 0, watch);
 }
 
+int security_post_notification(const struct cred *w_cred,
+			       const struct cred *cred,
+			       struct watch_notification *n)
+{
+	return call_int_hook(post_notification, 0, w_cred, cred, n);
+}
 #endif /* CONFIG_WATCH_QUEUE */
 
 #ifdef CONFIG_SECURITY_NETWORK


^ permalink raw reply related

* [PATCH 01/11] uapi: General notification ring definitions [ver #6]
From: David Howells @ 2019-08-29 18:29 UTC (permalink / raw)
  To: viro
  Cc: dhowells, Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, dhowells, keyrings,
	linux-usb, linux-security-module, linux-fsdevel, linux-api,
	linux-block, linux-security-module, linux-kernel
In-Reply-To: <156710338860.10009.12524626894838499011.stgit@warthog.procyon.org.uk>

Add UAPI definitions for the general notification ring, including the
following pieces:

 (1) struct watch_notification.

     This is the metadata header for each entry in the ring.  It includes a
     type and subtype that indicate the source of the message
     (eg. WATCH_TYPE_MOUNT_NOTIFY) and the kind of the message
     (eg. NOTIFY_MOUNT_NEW_MOUNT).

     The header also contains an information field that conveys the
     following information:

	- WATCH_INFO_LENGTH.  The size of the entry (entries are variable
          length).

	- WATCH_INFO_ID.  The watch ID specified when the watchpoint was
          set.

	- WATCH_INFO_TYPE_INFO.  (Sub)type-specific information.

	- WATCH_INFO_FLAG_*.  Flag bits overlain on the type-specific
          information.  For use by the type.

     All the information in the header can be used in filtering messages at
     the point of writing into the buffer.

 (2) struct watch_queue_buffer.

     This describes the layout of the ring.  Note that the first slots in
     the ring contain a special metadata entry that contains the ring
     pointers.  The producer in the kernel knows to skip this and it has a
     proper header (WATCH_TYPE_META, WATCH_META_SKIP_NOTIFICATION) that
     indicates the size so that the ring consumer can handle it the same as
     any other record and just skip it.

     Note that this means that ring entries can never be split over the end
     of the ring, so if an entry would need to be split, a skip record is
     inserted to wrap the ring first; this is also WATCH_TYPE_META,
     WATCH_META_SKIP_NOTIFICATION.

 (3) WATCH_INFO_NOTIFICATIONS_LOST.

     This is a flag that can be set in the metadata header by the kernel to
     indicate that at least one message was lost since it was last cleared
     by userspace.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---

 include/uapi/linux/watch_queue.h |   67 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)
 create mode 100644 include/uapi/linux/watch_queue.h

diff --git a/include/uapi/linux/watch_queue.h b/include/uapi/linux/watch_queue.h
new file mode 100644
index 000000000000..70f575099968
--- /dev/null
+++ b/include/uapi/linux/watch_queue.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_WATCH_QUEUE_H
+#define _UAPI_LINUX_WATCH_QUEUE_H
+
+#include <linux/types.h>
+
+enum watch_notification_type {
+	WATCH_TYPE_META		= 0,	/* Special record */
+	WATCH_TYPE___NR		= 1
+};
+
+enum watch_meta_notification_subtype {
+	WATCH_META_SKIP_NOTIFICATION	= 0,	/* Just skip this record */
+	WATCH_META_REMOVAL_NOTIFICATION	= 1,	/* Watched object was removed */
+};
+
+#define WATCH_LENGTH_GRANULARITY sizeof(__u64)
+
+/*
+ * Notification record header.  This is aligned to 64-bits so that subclasses
+ * can contain __u64 fields.
+ */
+struct watch_notification {
+	__u32			type:24;	/* enum watch_notification_type */
+	__u32			subtype:8;	/* Type-specific subtype (filterable) */
+	__u32			info;
+#define WATCH_INFO_LENGTH	0x0000003f	/* Length of record / sizeof(watch_notification) */
+#define WATCH_INFO_LENGTH__SHIFT 0
+#define WATCH_INFO_ID		0x0000ff00	/* ID of watchpoint, if type-appropriate */
+#define WATCH_INFO_ID__SHIFT	8
+#define WATCH_INFO_TYPE_INFO	0xffff0000	/* Type-specific info */
+#define WATCH_INFO_TYPE_INFO__SHIFT 16
+#define WATCH_INFO_FLAG_0	0x00010000	/* Type-specific info, flag bit 0 */
+#define WATCH_INFO_FLAG_1	0x00020000	/* ... */
+#define WATCH_INFO_FLAG_2	0x00040000
+#define WATCH_INFO_FLAG_3	0x00080000
+#define WATCH_INFO_FLAG_4	0x00100000
+#define WATCH_INFO_FLAG_5	0x00200000
+#define WATCH_INFO_FLAG_6	0x00400000
+#define WATCH_INFO_FLAG_7	0x00800000
+} __attribute__((aligned(WATCH_LENGTH_GRANULARITY)));
+
+struct watch_queue_buffer {
+	union {
+		/* The first few entries are special, containing the
+		 * ring management variables.
+		 */
+		struct {
+			struct watch_notification watch; /* WATCH_TYPE_META */
+			__u32		head;		/* Ring head index */
+			__u32		tail;		/* Ring tail index */
+			__u32		mask;		/* Ring index mask */
+			__u32		__reserved;
+		} meta;
+		struct watch_notification slots[0];
+	};
+};
+
+/*
+ * The Metadata pseudo-notification message uses a flag bits in the information
+ * field to convey the fact that messages have been lost.  We can only use a
+ * single bit in this manner per word as some arches that support SMP
+ * (eg. parisc) have no kernel<->user atomic bit ops.
+ */
+#define WATCH_INFO_NOTIFICATIONS_LOST WATCH_INFO_FLAG_0
+
+#endif /* _UAPI_LINUX_WATCH_QUEUE_H */


^ permalink raw reply related

* [PATCH 00/11] Keyrings, Block and USB notifications [ver #6]
From: David Howells @ 2019-08-29 18:29 UTC (permalink / raw)
  To: viro
  Cc: dhowells, Casey Schaufler, Stephen Smalley, Greg Kroah-Hartman,
	nicolas.dichtel, raven, Christian Brauner, dhowells, keyrings,
	linux-usb, linux-security-module, linux-fsdevel, linux-api,
	linux-block, linux-security-module, linux-kernel


Here's a set of patches to add a general notification queue concept and to
add sources of events for:

 (1) Key/keyring events, such as creating, linking and removal of keys.

 (2) General device events (single common queue) including:

     - Block layer events, such as device errors

     - USB subsystem events, such as device/bus attach/remove, device
       reset, device errors.

Tests for the key/keyring events can be found on the keyutils next branch:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=next

Notifications are done automatically inside of the testing infrastructure
on every change to that every test makes to a key or keyring.

Manual pages can be found there also, including pages for watch_queue(7)
and the watch_devices(2) system call (these should be transferred to the
manpages package if taken upstream).

LSM hooks are included:

 (1) A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set.  Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable.  The LSM
     should use current's credentials.  [Wanted by SELinux & Smack]

 (2) A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue.  This is given
     the credentials from the event generator (which may be the system) and
     the watch setter.  [Wanted by Smack]

I've provided a preliminary attempt to provide SELinux and Smack with
implementations of some of these hooks.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used just a few, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	watch_devices(fd, 0x02, 0);

     where in both cases, fd indicates the queue and the number after is a
     tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core

Changes:

 ver #6:

 (*) Fix mmap bug in watch_queue driver.

 (*) Add an extended removal notification that can transmit an identifier
     to userspace (such as a key ID).

 (*) Don't produce a instantiation notification in mark_key_instantiated()
     but rather do it in the caller to prevent key updates from producing
     an instantiate notification as well as an update notification.

 (*) Set the right number of filters in the sample program.

 (*) Provide preliminary hook implementations for SELinux and Smack.

 ver #5:

 (*) Split the superblock watch and mount watch parts out into their own
     branch (notifications-mount) as they really need certain fsinfo()
     attributes.

 (*) Rearrange the watch notification UAPI header to push the length down
     to bits 0-5 and remove the lost-message bits.  The userspace's watch
     ID tag is moved to bits 8-15 and then the message type is allocated
     all of bits 16-31 for its own purposes.

     The lost-message bit is moved over to the header, rather than being
     placed in the next message to be generated and given its own word so
     it can be cleared with xchg(,0) for parisc.

 (*) The security_post_notification() hook is no longer called with the
     spinlock held and softirqs disabled - though the RCU readlock is still
     held.

 (*) Buffer pages are now accounted towards RLIMIT_MEMLOCK and CAP_IPC_LOCK
     will skip the overuse check.

 (*) The buffer is marked VM_DONTEXPAND.

 (*) Save the watch-setter's creds in struct watch and give that to the LSM
     hook for posting a message.

 ver #4:

 (*) Split the basic UAPI bits out into their own patch and then split the
     LSM hooks out into an intermediate patch.  Add LSM hooks for setting
     watches.

     Rename the *_notify() system calls to watch_*() for consistency.

 ver #3:

 (*) I've added a USB notification source and reformulated the block
     notification source so that there's now a common watch list, for which
     the system call is now device_notify().

     I've assigned a pair of unused ioctl numbers in the 'W' series to the
     ioctls added by this series.

     I've also added a description of the kernel API to the documentation.

 ver #2:

 (*) I've fixed various issues raised by Jann Horn and GregKH and moved to
     krefs for refcounting.  I've added some security features to try and
     give Casey Schaufler the LSM control he wants.

David
---
David Howells (11):
      uapi: General notification ring definitions
      security: Add hooks to rule on setting a watch
      security: Add a hook for the point of notification insertion
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      Add a general, global device notification watch list
      block: Add block layer notifications
      usb: Add USB subsystem notifications
      Add sample notification program
      selinux: Implement the watch_key security hook
      smack: Implement the watch_key and post_notification hooks [untested]


 Documentation/ioctl/ioctl-number.rst        |    1 
 Documentation/security/keys/core.rst        |   58 ++
 Documentation/watch_queue.rst               |  460 ++++++++++++++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 
 arch/arm/tools/syscall.tbl                  |    1 
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 
 arch/s390/kernel/syscalls/syscall.tbl       |    1 
 arch/sh/kernel/syscalls/syscall.tbl         |    1 
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 
 block/Kconfig                               |    9 
 block/blk-core.c                            |   29 +
 drivers/base/Kconfig                        |    9 
 drivers/base/Makefile                       |    1 
 drivers/base/watch.c                        |   94 +++
 drivers/misc/Kconfig                        |   13 
 drivers/misc/Makefile                       |    1 
 drivers/misc/watch_queue.c                  |  892 +++++++++++++++++++++++++++
 drivers/usb/core/Kconfig                    |    9 
 drivers/usb/core/devio.c                    |   56 ++
 drivers/usb/core/hub.c                      |    4 
 include/linux/blkdev.h                      |   15 
 include/linux/device.h                      |    7 
 include/linux/key.h                         |    3 
 include/linux/lsm_audit.h                   |    1 
 include/linux/lsm_hooks.h                   |   32 +
 include/linux/security.h                    |   25 +
 include/linux/syscalls.h                    |    1 
 include/linux/usb.h                         |   18 +
 include/linux/watch_queue.h                 |   94 +++
 include/uapi/asm-generic/unistd.h           |    4 
 include/uapi/linux/keyctl.h                 |    2 
 include/uapi/linux/watch_queue.h            |  183 ++++++
 kernel/sys_ni.c                             |    1 
 samples/Kconfig                             |    6 
 samples/Makefile                            |    1 
 samples/watch_queue/Makefile                |    8 
 samples/watch_queue/watch_test.c            |  233 +++++++
 security/keys/Kconfig                       |    9 
 security/keys/compat.c                      |    3 
 security/keys/gc.c                          |    5 
 security/keys/internal.h                    |   30 +
 security/keys/key.c                         |   38 +
 security/keys/keyctl.c                      |  103 +++
 security/keys/keyring.c                     |   20 -
 security/keys/request_key.c                 |    4 
 security/security.c                         |   19 +
 security/selinux/hooks.c                    |   17 +
 security/smack/smack_lsm.c                  |   81 ++
 58 files changed, 2587 insertions(+), 28 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 drivers/base/watch.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c


^ permalink raw reply

* [PATCH] ima: ima_api: Use struct_size() in kzalloc()
From: Gustavo A. R. Silva @ 2019-08-29 17:29 UTC (permalink / raw)
  To: Mimi Zohar, Dmitry Kasatkin, James Morris, Serge E. Hallyn
  Cc: linux-integrity, linux-security-module, linux-kernel,
	Gustavo A. R. Silva

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct ima_template_entry {
	...
        struct ima_field_data template_data[0]; /* template related data */
};

instance = kzalloc(sizeof(struct ima_template_entry) + count * sizeof(struct ima_field_data), GFP_NOFS);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kzalloc(struct_size(instance, entry, count), GFP_NOFS);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
---
 security/integrity/ima/ima_api.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 65224474675b..610759fe63b8 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -45,8 +45,8 @@ int ima_alloc_init_template(struct ima_event_data *event_data,
 	else
 		template_desc = ima_template_desc_current();
 
-	*entry = kzalloc(sizeof(**entry) + template_desc->num_fields *
-			 sizeof(struct ima_field_data), GFP_NOFS);
+	*entry = kzalloc(struct_size(*entry, template_data,
+				     template_desc->num_fields), GFP_NOFS);
 	if (!*entry)
 		return -ENOMEM;
 
-- 
2.23.0


^ permalink raw reply related

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-08-29 17:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Peter Zijlstra, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190829172309.xd73ax4wgsjmv6zg@ast-mbp.dhcp.thefacebook.com>

On Thu, 29 Aug 2019 10:23:10 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> > 
> > CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.  
> 
> imo that makes little sense from security pov, since
> such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
> just as well. Yet not letting it do cleanly via uprobe.
> Sort of like giving a spare key for back door of the house and
> saying no, you cannot have main door key.

I took it as CAP_TRACE_KERNEL as a superset of CAP_TRACE_USER. That is,
if you have CAP_TRACE_KERNEL, by default you get USER. Where as
CAP_TRACE_USER, is much more limiting.

-- Steve

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-08-29 17:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190829171922.hkuceiurscsxk5jq@ast-mbp.dhcp.thefacebook.com>

On Thu, 29 Aug 2019 10:19:24 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Aug 29, 2019 at 09:34:34AM -0400, Steven Rostedt wrote:
> > 
> > As the above seems to favor the idea of CAP_TRACING allowing write
> > access to tracefs, should we have a CAP_TRACING_RO for just read access
> > and limited perf abilities?  
> 
> read only vs writeable is an attribute of the file system.
> Bringing such things into caps seem wrong to me.

So using groups then? I'm fine with that.

-- Steve

^ permalink raw reply

* Re: [PATCH] ima: use struct_size() in kzalloc()
From: Gustavo A. R. Silva @ 2019-08-29 17:16 UTC (permalink / raw)
  To: Mimi Zohar, Dmitry Kasatkin, James Morris, Serge E. Hallyn
  Cc: linux-integrity, linux-security-module, linux-kernel
In-Reply-To: <1567050355.11493.3.camel@linux.ibm.com>

Hi Mimi,

On 8/28/19 10:45 PM, Mimi Zohar wrote:

> 
> The same usage exists in ima_api.c: ima_alloc_init_template().  Did
> you want to make the change there as well?
> 

Yep. I'll write a patch for that one too.

Thanks
--
Gustavo

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29 17:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Steven Rostedt, Peter Zijlstra,
	Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190829172309.xd73ax4wgsjmv6zg@ast-mbp.dhcp.thefacebook.com>

On Thu, Aug 29, 2019 at 10:23 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Aug 29, 2019 at 08:43:23AM -0700, Andy Lutomirski wrote:
> >
> > I can imagine splitting it into three capabilities:
> >
> > CAP_TRACE_KERNEL: learn which kernel functions are called when.  This
> > would allow perf profiling, for example, but not sampling of kernel
> > regs.
> >
> > CAP_TRACE_READ_KERNEL_DATA: allow the tracing, profiling, etc features
> > that can read the kernel's data.  So you get function arguments via
> > kprobe, kernel regs, and APIs that expose probe_kernel_read()
> >
> > CAP_TRACE_USER: trace unrelated user processes
> >
> > I'm not sure the code is written in a way that makes splitting
> > CAP_TRACE_KERNEL and CAP_TRACE_READ_KERNEL_DATA, and I'm not sure that
> > CAP_TRACE_KERNEL is all that useful except for plain perf record
> > without CAP_TRACE_READ_KERNEL_DATA.  What do you all think?  I suppose
> > it could also be:
> >
> > CAP_PROFILE_KERNEL: Use perf with events that aren't kprobes or
> > tracepoints.  Does not grant the ability to sample regs or the kernel
> > stack directly.
> >
> > CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> >
> > CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.
>
> imo that makes little sense from security pov, since
> such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
> just as well. Yet not letting it do cleanly via uprobe.
> Sort of like giving a spare key for back door of the house and
> saying no, you cannot have main door key.
>

Not all combinations of capabilities make total sense.  CAP_SETUID,
for example, generally lets you get all the other capabilities.
CAP_TRACE_KERNEL + CAP_TRACE_USER makes sense.  CAP_TRACE_USER by
itself makes sense.  CAP_TRACE_READ_KERNEL_DATA without
CAP_TRACE_KERNEL does not.  I don't think this is a really a problem.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Peter Zijlstra, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <CALCETrWYu0XB_d-MhXFgopEmBu-pog493G1e+KsE3dS32UULgA@mail.gmail.com>

On Thu, Aug 29, 2019 at 08:43:23AM -0700, Andy Lutomirski wrote:
> 
> I can imagine splitting it into three capabilities:
> 
> CAP_TRACE_KERNEL: learn which kernel functions are called when.  This
> would allow perf profiling, for example, but not sampling of kernel
> regs.
> 
> CAP_TRACE_READ_KERNEL_DATA: allow the tracing, profiling, etc features
> that can read the kernel's data.  So you get function arguments via
> kprobe, kernel regs, and APIs that expose probe_kernel_read()
> 
> CAP_TRACE_USER: trace unrelated user processes
> 
> I'm not sure the code is written in a way that makes splitting
> CAP_TRACE_KERNEL and CAP_TRACE_READ_KERNEL_DATA, and I'm not sure that
> CAP_TRACE_KERNEL is all that useful except for plain perf record
> without CAP_TRACE_READ_KERNEL_DATA.  What do you all think?  I suppose
> it could also be:
> 
> CAP_PROFILE_KERNEL: Use perf with events that aren't kprobes or
> tracepoints.  Does not grant the ability to sample regs or the kernel
> stack directly.
> 
> CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.
> 
> CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.

imo that makes little sense from security pov, since
such CAP_TRACE_KERNEL (ex kprobe) can trace "unrelated user process"
just as well. Yet not letting it do cleanly via uprobe.
Sort of like giving a spare key for back door of the house and
saying no, you cannot have main door key.


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-29 17:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190829093434.36540972@gandalf.local.home>

On Thu, Aug 29, 2019 at 09:34:34AM -0400, Steven Rostedt wrote:
> 
> As the above seems to favor the idea of CAP_TRACING allowing write
> access to tracefs, should we have a CAP_TRACING_RO for just read access
> and limited perf abilities?

read only vs writeable is an attribute of the file system.
Bringing such things into caps seem wrong to me.


^ permalink raw reply

* Re: [PATCH] tpm_tis: Fix interrupt probing
From: Jarkko Sakkinen @ 2019-08-29 16:10 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Stefan Berger, linux-integrity, linux-security-module,
	linux-kernel
In-Reply-To: <20190829132021.6vfc535ecb62jokf@linux.intel.com>

On Thu, Aug 29, 2019 at 04:20:21PM +0300, Jarkko Sakkinen wrote:
> On Tue, Aug 27, 2019 at 03:34:36PM -0400, Stefan Berger wrote:
> > On 8/27/19 11:19 AM, Jarkko Sakkinen wrote:
> > > On Tue, Aug 27, 2019 at 04:14:00PM +0300, Jarkko Sakkinen wrote:
> > > > On Tue, Aug 20, 2019 at 08:25:17AM -0400, Stefan Berger wrote:
> > > > > From: Stefan Berger <stefanb@linux.ibm.com>
> > > > > 
> > > > > The interrupt probing of the TPM TIS was broken since we are trying to
> > > > > run it without an active locality and without the TPM_CHIP_FLAG_IRQ set.
> > > > > 
> > > > > Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
> > > > Need these:
> > > > 
> > > > Cc: linux-stable@vger.kernel.org
> > > > Fixes: a3fbfae82b4c ("tpm: take TPM chip power gating out of tpm_transmit()")
> > > > 
> > > > Thank you. I'll apply this to my tree.
> > > > 
> > > > Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > > The commit went in the following form:
> > > 
> > > http://git.infradead.org/users/jjs/linux-tpmdd.git/commit/9b558deab2c5d7dc23d5f7a4064892ede482ad32
> > 
> > I saw you dropped the stetting of the IRQ flag - I needed it, otherwise it
> > wouldn't execute certain code paths.
> 
> I explained why I removed that part. There was no any reasoning for
> it. Also, it cannot be in the same commit if it fixes a diffent
> issue.

AFAIK they go with different fixes-tags.

/Jarkko

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29 15:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Alexei Starovoitov, Peter Zijlstra, Andy Lutomirski,
	Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Masami Hiramatsu, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190829093434.36540972@gandalf.local.home>

On Thu, Aug 29, 2019 at 6:34 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 28 Aug 2019 15:08:28 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > On Wed, Aug 28, 2019 at 09:14:21AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 27, 2019 at 04:01:08PM -0700, Andy Lutomirski wrote:
> > >
> > > > > Tracing:
> > > > >
> > > > > CAP_BPF and perf_paranoid_tracepoint_raw() (which is kernel.perf_event_paranoid == -1)
> > > > > are necessary to:
> > >
> > > That's not tracing, that's perf.
> > >
>
> > re: your first comment above.
> > I'm not sure what difference you see in words 'tracing' and 'perf'.
> > I really hope we don't partition the overall tracing category
> > into CAP_PERF and CAP_FTRACE only because these pieces are maintained
> > by different people.
>
> I think Peter meant: It's not tracing, it's profiling.
>
> And there is a bit of separation between the two, although there is an
> overlap.
>
> Yes, perf can do tracing but it's designed more for profiling.

As I see it, there are a couple of reasons to split something into
multiple capabilities.  If they allow users to do well-defined things
that have materially different risks from the perspective of the
person granting the capabilities, then they can usefully be different.
Similarly, if one carries a risk of accidental use that another does
not, they should usefully be different.  An example of the first is
that CAP_NET_BIND_SERVICE has very different powers from
CAP_NET_ADMIN, whereas CAP_SYS_ADMIN and CAP_PTRACE are really quite
similar from a security perspective.  An example of the latter is that
CAP_DAC_OVERRIDE changes overall open() semantics and CAP_SYS_ADMIN
does not, at least not outside of /proc.

Things having different development histories and different
maintainers doesn't seem like a good reason to split the capabilities
IMO.

>
> > On one side perf_event_open() isn't really doing tracing (as step by
> > step ftracing of function sequences), but perf_event_open() opens
> > an event and the sequence of events (may include IP) becomes a trace.
> > imo CAP_TRACING is the best name to descibe the privileged space
> > of operations possible via perf_event_open, ftrace, kprobe, stack traces, etc.
>
> I have no issue with what you suggest. I guess it comes down to how
> fine grain people want to go. Do we want it to be all or nothing?
> Should CAP_TRACING allow for write access to tracefs? Or should we go
> with needing both CAP_TRACING and permissions in that directory
> (like changing the group ownership of the files at every boot).
>
> Perhaps we should have a CAP_TRACING_RO, that gives read access to
> tracefs (and write if the users have permissions). And have CAP_TRACING
> to allow full write access as well (allowing for users to add kprobe
> events and enabling tracers like the function tracer).

I can imagine splitting it into three capabilities:

CAP_TRACE_KERNEL: learn which kernel functions are called when.  This
would allow perf profiling, for example, but not sampling of kernel
regs.

CAP_TRACE_READ_KERNEL_DATA: allow the tracing, profiling, etc features
that can read the kernel's data.  So you get function arguments via
kprobe, kernel regs, and APIs that expose probe_kernel_read()

CAP_TRACE_USER: trace unrelated user processes

I'm not sure the code is written in a way that makes splitting
CAP_TRACE_KERNEL and CAP_TRACE_READ_KERNEL_DATA, and I'm not sure that
CAP_TRACE_KERNEL is all that useful except for plain perf record
without CAP_TRACE_READ_KERNEL_DATA.  What do you all think?  I suppose
it could also be:

CAP_PROFILE_KERNEL: Use perf with events that aren't kprobes or
tracepoints.  Does not grant the ability to sample regs or the kernel
stack directly.

CAP_TRACE_KERNEL: Use all of perf, ftrace, kprobe, etc.

CAP_TRACE_USER: Use all of perf with scope limited to user mode and uprobes.

> As the above seems to favor the idea of CAP_TRACING allowing write
> access to tracefs, should we have a CAP_TRACING_RO for just read access
> and limited perf abilities?

How about making a separate cap for limited perf capabilities along
the lines of the above?

For what it's worth, it should be straightforward using full tracing
to read out the kernel's random number pool, for example, but it would
be difficult or impossible to do that using just perf record -e
cycles.

^ permalink raw reply

* Re: [PATCH v5 4/4] KEYS: trusted: move tpm2 trusted keys code
From: Jarkko Sakkinen @ 2019-08-29 14:53 UTC (permalink / raw)
  To: Sumit Garg
  Cc: keyrings, linux-integrity,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE,
	linux-security-module, dhowells, Herbert Xu, davem, peterhuewe,
	jgg, jejb, Arnd Bergmann, Greg Kroah-Hartman, Mimi Zohar,
	James Morris, Serge E. Hallyn, Casey Schaufler, Ard Biesheuvel,
	Daniel Thompson, Linux Kernel Mailing List,
	tee-dev @ lists . linaro . org
In-Reply-To: <CAFA6WYPnoDoMWd=PT4mgXPhg1Wp0=AFDnWd_44UMP7sijXzAZA@mail.gmail.com>

On Wed, Aug 28, 2019 at 10:58:11AM +0530, Sumit Garg wrote:
> So you mean to say we should use upper-case letters for 'TPM2' acronym?

Yes.

/Jarkko

^ permalink raw reply

* RE: [WIP][RFC][PATCH 1/3] security: introduce call_int_hook_and() macro
From: Roberto Sassu @ 2019-08-29 14:29 UTC (permalink / raw)
  To: Casey Schaufler, linux-integrity@vger.kernel.org
  Cc: linux-security-module@vger.kernel.org, zohar@linux.ibm.com,
	Dmitry Kasatkin, Silviu Vlasceanu
In-Reply-To: <4c13c8a7-e255-a3a8-c19a-cae85a71cae9@schaufler-ca.com>

> -----Original Message-----
> From: Casey Schaufler [mailto:casey@schaufler-ca.com]
> Sent: Monday, August 19, 2019 4:52 PM
> To: Roberto Sassu <roberto.sassu@huawei.com>; linux-
> integrity@vger.kernel.org
> Cc: linux-security-module@vger.kernel.org; zohar@linux.ibm.com; Dmitry
> Kasatkin <dmitry.kasatkin@huawei.com>; Silviu Vlasceanu
> <Silviu.Vlasceanu@huawei.com>
> Subject: Re: [WIP][RFC][PATCH 1/3] security: introduce call_int_hook_and()
> macro
> 
> On 8/18/2019 4:57 PM, Roberto Sassu wrote:
> > The LSM hooks audit_rule_known() and audit_rule_match() define 1 as
> > result for successful operation. However, the security_ functions use
> > call_int_hook() which stops iterating over LSMs if the result is not
> > zero.
> >
> > Introduce call_int_hook_and(), so that the final result returned by
> > the security_ functions is 1 if all LSMs return 1.
> 
> I don't think this is what you want. You want an audit record generated if
> any of the security modules want one, not only if all of the security modules
> want one.

Right, it would be better if I can specify the prefix of the LSM that should
execute the audit_rule_match() hook.

For example, I would like to specify in the IMA policy:

measure subj_type=infoflow:tcb

'infoflow:tcb' would be the value of the 'lsmrule' parameter of
security_audit_rule_match().

The rule would be evaluated only by Infoflow LSM, and not SELinux.

Roberto

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Steven Rostedt @ 2019-08-29 13:34 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Masami Hiramatsu,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190828220826.nlkpp632rsomocve@ast-mbp.dhcp.thefacebook.com>

On Wed, 28 Aug 2019 15:08:28 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Wed, Aug 28, 2019 at 09:14:21AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 27, 2019 at 04:01:08PM -0700, Andy Lutomirski wrote:
> >   
> > > > Tracing:
> > > >
> > > > CAP_BPF and perf_paranoid_tracepoint_raw() (which is kernel.perf_event_paranoid == -1)
> > > > are necessary to:  
> > 
> > That's not tracing, that's perf.
> >   

> re: your first comment above.
> I'm not sure what difference you see in words 'tracing' and 'perf'.
> I really hope we don't partition the overall tracing category
> into CAP_PERF and CAP_FTRACE only because these pieces are maintained
> by different people.

I think Peter meant: It's not tracing, it's profiling.

And there is a bit of separation between the two, although there is an
overlap.

Yes, perf can do tracing but it's designed more for profiling.

> On one side perf_event_open() isn't really doing tracing (as step by
> step ftracing of function sequences), but perf_event_open() opens
> an event and the sequence of events (may include IP) becomes a trace.
> imo CAP_TRACING is the best name to descibe the privileged space
> of operations possible via perf_event_open, ftrace, kprobe, stack traces, etc.

I have no issue with what you suggest. I guess it comes down to how
fine grain people want to go. Do we want it to be all or nothing?
Should CAP_TRACING allow for write access to tracefs? Or should we go
with needing both CAP_TRACING and permissions in that directory
(like changing the group ownership of the files at every boot). 

Perhaps we should have a CAP_TRACING_RO, that gives read access to
tracefs (and write if the users have permissions). And have CAP_TRACING
to allow full write access as well (allowing for users to add kprobe
events and enabling tracers like the function tracer).

> 
> Another reason are kuprobes. They can be crated via perf_event_open
> and via tracefs. Are they in CAP_PERF or in CAP_FTRACE ? In both, right?
> Should then CAP_KPROBE be used ? that would be an overkill.
> It would partition the space even further without obvious need.
> 
> Looking from BPF angle... BPF doesn't have integration with ftrace yet.
> bpf_trace_printk is using ftrace mechanism, but that's 1% of ftrace.
> In the long run I really like to see bpf using all of ftrace.
> Whereas bpf is using a lot of 'perf'.
> And extending some perf things in bpf specific way.
> Take a look at how BPF_F_STACK_BUILD_ID. It's clearly perf/stack_tracing
> feature that generic perf can use one day.
> Currently it sits in bpf land and accessible via bpf only.
> Though its bpf only today I categorize it under CAP_TRACING.
> 
> I think CAP_TRACING privilege should allow task to do all of perf_event_open,
> kuprobe, stack trace, ftrace, and kallsyms.
> We can think of some exceptions that should stay under CAP_SYS_ADMIN,
> but most of the functionality available by 'perf' binary should be
> usable with CAP_TRACING. 'perf' can do bpf too.
> With CAP_BPF it would be all set.

As the above seems to favor the idea of CAP_TRACING allowing write
access to tracefs, should we have a CAP_TRACING_RO for just read access
and limited perf abilities?

-- Steve

^ permalink raw reply

* Re: [PATCH] tpm_tis: Fix interrupt probing
From: Jarkko Sakkinen @ 2019-08-29 13:20 UTC (permalink / raw)
  To: Stefan Berger
  Cc: Stefan Berger, linux-integrity, linux-security-module,
	linux-kernel
In-Reply-To: <797ff54e-dceb-21d2-dd74-e5244f9c6dfd@linux.ibm.com>

On Tue, Aug 27, 2019 at 03:34:36PM -0400, Stefan Berger wrote:
> On 8/27/19 11:19 AM, Jarkko Sakkinen wrote:
> > On Tue, Aug 27, 2019 at 04:14:00PM +0300, Jarkko Sakkinen wrote:
> > > On Tue, Aug 20, 2019 at 08:25:17AM -0400, Stefan Berger wrote:
> > > > From: Stefan Berger <stefanb@linux.ibm.com>
> > > > 
> > > > The interrupt probing of the TPM TIS was broken since we are trying to
> > > > run it without an active locality and without the TPM_CHIP_FLAG_IRQ set.
> > > > 
> > > > Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
> > > Need these:
> > > 
> > > Cc: linux-stable@vger.kernel.org
> > > Fixes: a3fbfae82b4c ("tpm: take TPM chip power gating out of tpm_transmit()")
> > > 
> > > Thank you. I'll apply this to my tree.
> > > 
> > > Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > The commit went in the following form:
> > 
> > http://git.infradead.org/users/jjs/linux-tpmdd.git/commit/9b558deab2c5d7dc23d5f7a4064892ede482ad32
> 
> I saw you dropped the stetting of the IRQ flag - I needed it, otherwise it
> wouldn't execute certain code paths.

I explained why I removed that part. There was no any reasoning for
it. Also, it cannot be in the same commit if it fixes a diffent
issue.


/Jarkko

^ permalink raw reply

* Re: [PATCH 1/2] rtnetlink: gate MAC address with an LSM hook
From: Michal Kubecek @ 2019-08-29  7:45 UTC (permalink / raw)
  To: netdev; +Cc: Paul Moore, Jeffrey Vander Stoep, David Miller, LSM List, selinux
In-Reply-To: <CAHC9VhRmmEp_nFtOFy_YRa9NwZA4qPnjw7D3JQvqED-tO4Ha1g@mail.gmail.com>

On Tue, Aug 27, 2019 at 04:47:04PM -0400, Paul Moore wrote:
> 
> I'm also not a big fan of inserting the hook in rtnl_fill_ifinfo(); as
> presented it is way too specific for a LSM hook for me to be happy.
> However, I do agree that giving the LSMs some control over netlink
> messages makes sense.  As others have pointed out, it's all a matter
> of where to place the hook.
> 
> If we only care about netlink messages which leverage nlattrs I
> suppose one option that I haven't seen mentioned would be to place a
> hook in nla_put().  While it is a bit of an odd place for a hook, it
> would allow the LSM easy access to the skb and attribute type to make
> decisions, and all of the callers should already be checking the
> return code (although we would need to verify this).  One notable
> drawback (not the only one) is that the hook is going to get hit
> multiple times for each message.

For most messages, "multiple times" would mean tens, for many even
hundreds of calls. For each, you would have to check corresponding
socket (and possibly also genetlink header) to see which netlink based
protocol it is and often even parse existing part of the message to get
the context (because the same numeric attribute type can mean something
completely different if it appears in a nested attribute).

Also, nla_put() (or rather __nla_put()) is not used for all attributes,
one may also use nla_reserve() and then compose the attribute date in
place.

Michal Kubecek

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-29  4:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <DA52992F-4862-4945-8482-FE619A04C753@amacapital.net>

On Wed, Aug 28, 2019 at 05:45:47PM -0700, Andy Lutomirski wrote:
> > 
> >> It seems like you are specifically trying to add a new switch to turn
> >> as much of BPF as possible on and off.  Why?
> > 
> > Didn't I explain it several times already with multiple examples
> > from systemd, daemons, bpftrace ?
> > 
> > Let's try again.
> > Take your laptop with linux distro.
> > You're the only user there. I'm assuming you're not sharing it with
> > partner and kids. This is my definition of 'single user system'.
> > You can sudo on it at any time, but obviously prefer to run as many
> > apps as possible without cap_sys_admin.
> > Now you found some awesome open source app on the web that monitors
> > the health of the kernel and will pop a nice message on a screen if
> > something is wrong. Currently this app needs root. You hesitate,
> > but the apps is so useful and it has strong upstream code review process
> > that you keep running it 24/7.
> > This is open source app. New versions come. You upgrade.
> > You have enough trust in that app that you keep running it as root.
> > But there is always a chance that new version doing accidentaly
> > something stupid as 'kill -9 -1'. It's an open source app at the end.
> > 
> > Now I come with this CAP* proposal to make this app safer.
> > I'm not making your system more secure and not making this app
> > more secure. I can only make your laptop safer for day to day work
> > by limiting the operations this app can do.
> > This particular app monitros the kernel via bpf and tracing.
> > Hence you can give it CAP_TRACING and CAP_BPF and drop the rest.
> 
> This won’t make me much more comfortable, since CAP_BPF lets it do an ever-growing set of nasty things. I’d much rather one or both of two things happen:
> 
> 1. Give it CAP_TRACING only. It can leak my data, but it’s rather hard for it to crash my laptop, lose data, or cause other shenanigans.
> 
> 2. Improve it a bit do all the privileged ops are wrapped by capset().
> 
> Does this make sense?  I’m a security person on occasion. I find vulnerabilities and exploit them deliberately and I break things by accident on a regular basis. In my considered opinion, CAP_TRACING alone, even extended to cover part of BPF as I’ve described, is decently safe. Getting root with just CAP_TRACING will be decently challenging, especially if I don’t get to read things like sshd’s memory, and improvements to mitigate even that could be added.  I am quite confident that attacks starting with CAP_TRACING will have clear audit signatures if auditing is on.  I am also confident that CAP_BPF *will* allow DoS and likely privilege escalation, and this will only get more likely as BPF gets more widely used. And, if BPF-based auditing ever becomes a thing, writing to the audit daemon’s maps will be a great way to cover one’s tracks.

CAP_TRACING, as I'm proposing it, will allow full tracefs access.
I think Steven and Massami prefer that as well.
That includes kprobe with probe_kernel_read.
That also means mini-DoS by installing kprobes everywhere or running too much ftrace.

CAP_TRACING will allow perf_event_open() too.
Which also means mini-DoS with too many events.

CAP_TRACING with or without CAP_BPF is safe, but it's not secure.
And that's what I need to make above 'open source kernel health app' to be safe.

In real world we have tens of such apps and they use all of the things that
I'm allowing via CAP_BPF + CAP_NET_ADMIN + CAP_TRACING.
Some apps will need only two out of three.
I don't see any further possibility to shrink the scope of the proposal.

> I’m trying to convince you that bpf’s security model can be made better
> than what you’re proposing. I’m genuinely not trying to get in your way.
> I’m trying to help you improve bpf.

If you really want to help please don't reject the real use cases
just because they don't fit into your proposal.

There is not a single feature in BPF land that we did because we simply
wanted to. For every feature we drilled into use cases to make sure
there is a real user behind it.
Same thing with CAP_BPF. I'm defining it to include GET_FD_BY_ID because
apps use it and they need to made safer.

Anyway the v2 version of the patch with CAP_TRACING and CAP_BPF is on the way.
Hopefully later tonight or tomorrow.


^ permalink raw reply

* Re: [PATCH] ima: use struct_size() in kzalloc()
From: Mimi Zohar @ 2019-08-29  3:45 UTC (permalink / raw)
  To: Gustavo A. R. Silva, Dmitry Kasatkin, James Morris,
	Serge E. Hallyn
  Cc: linux-integrity, linux-security-module, linux-kernel
In-Reply-To: <671185b9-5c91-5235-b5ea-96d3449bf716@embeddedor.com>

Hi Gustavo,

On Wed, 2019-08-28 at 13:29 -0500, Gustavo A. R. Silva wrote:
> On 5/29/19 11:53 AM, Gustavo A. R. Silva wrote:
> > One of the more common cases of allocation size calculations is finding
> > the size of a structure that has a zero-sized array at the end, along
> > with memory for some number of elements for that array. For example:
> > 
> > struct foo {
> >    int stuff;
> >    struct boo entry[];
> > };
> > 
> > instance = kzalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);
> > 
> > Instead of leaving these open-coded and prone to type mistakes, we can
> > now use the new struct_size() helper:
> > 
> > instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
> > 
> > This code was detected with the help of Coccinelle.
> > 
> > Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
> > ---
> >  security/integrity/ima/ima_template.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/security/integrity/ima/ima_template.c b/security/integrity/ima/ima_template.c
> > index b631b8bc7624..b945dff2ed14 100644
> > --- a/security/integrity/ima/ima_template.c
> > +++ b/security/integrity/ima/ima_template.c
> > @@ -281,9 +281,8 @@ static int ima_restore_template_data(struct ima_template_desc *template_desc,
> >  	int ret = 0;
> >  	int i;
> >  
> > -	*entry = kzalloc(sizeof(**entry) +
> > -		    template_desc->num_fields * sizeof(struct ima_field_data),
> > -		    GFP_NOFS);
> > +	*entry = kzalloc(struct_size(*entry, template_data,
> > +				     template_desc->num_fields), GFP_NOFS);
> >  	if (!*entry)
> >  		return -ENOMEM;
> >  
> > 

The same usage exists in ima_api.c: ima_alloc_init_template().  Did
you want to make the change there as well?

thanks,

Mimi


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29  0:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190828233828.p7xddyw3fjzfinm6@ast-mbp.dhcp.thefacebook.com>



> On Aug 28, 2019, at 4:38 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Tue, Aug 27, 2019 at 11:20:19PM -0700, Andy Lutomirski wrote:
>> On Tue, Aug 27, 2019 at 9:49 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>> 
>>>> On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
>>>> 
>>>> Let me put this a bit differently. Part of the point is that
>>>> CAP_TRACING should allow a user or program to trace without being able
>>>> to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
>>>> crash the system.
>>> 
>>> Really? I'm still waiting for your example where bpf+kprobe crashes the system...
>>> 
>> 
>> That's not what I meant.  bpf+kprobe causing a crash is a bug.  I'm
>> referring to a totally different issue.  On my laptop:
>> 
>> $ sudo bpftool map
>> 48: hash  name foobar  flags 0x0
>>    key 8B  value 8B  max_entries 64  memlock 8192B
>> 181: lpm_trie  flags 0x1
>>    key 8B  value 8B  max_entries 1  memlock 4096B
>> 182: lpm_trie  flags 0x1
>>    key 20B  value 8B  max_entries 1  memlock 4096B
>> 183: lpm_trie  flags 0x1
>>    key 8B  value 8B  max_entries 1  memlock 4096B
>> 184: lpm_trie  flags 0x1
>>    key 20B  value 8B  max_entries 1  memlock 4096B
>> 185: lpm_trie  flags 0x1
>>    key 8B  value 8B  max_entries 1  memlock 4096B
>> 186: lpm_trie  flags 0x1
>>    key 20B  value 8B  max_entries 1  memlock 4096B
>> 187: lpm_trie  flags 0x1
>>    key 8B  value 8B  max_entries 1  memlock 4096B
>> 188: lpm_trie  flags 0x1
>>    key 20B  value 8B  max_entries 1  memlock 4096B
>> 
>> $ sudo bpftool map dump id 186
>> key:
>> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
>> 00 00 00 00
>> value:
>> 02 00 00 00 00 00 00 00
>> Found 1 element
>> 
>> $ sudo bpftool map delete id 186 key hex 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00
>> [this worked]
>> 
>> I don't know what my laptop was doing with map id 186 in particular,
>> but, whatever it was, I definitely broke it.  If a BPF firewall is in
>> use on something important enough, this could easily remove
>> connectivity from part or all of the system.  Right now, this needs
>> CAP_SYS_ADMIN.  With your patch, CAP_BPF is sufficient to do this, but
>> you *also* need CAP_BPF to trace the system using BPF.  Tracing with
>> BPF is 'safe' in the absence of bugs.  Modifying other peoples' maps
>> is not.
> 
> That lpm_trie is likely systemd implementing IP sandboxing.
> Not sure whether it's white or black list.
> Deleting an IP address from that map will either allow or disallow
> network traffic.
> Out of band operation on bpf map broke some bpf program. Sure.
> But calling it 'breaking the system' is quite a stretch.
> Calling it 'crashing the system' is plain wrong.
> Yet you're generalizing this bpf map read/write as
> "CAP_BPF as you’ve proposed it *can* likely crash the system."
> This is what I have a problem with.

Well, after I sent that email, firewalld on my laptop exploded and the system eventually hung.  I call that broken, and I really made a minimal effort here to break things.

> 
> Anyway, changing gears...
> Yes. I did propose to make a task with CAP_BPF to be able to
> manipulate arbitrary maps in the system.
> You could have said that if CAP_BPF is given to 'bpftool'
> then any user will be able to mess with other maps because
> bpftool is likely chmod-ed 755.
> Absolutely correct!
> It's not a fault of the CAP_BPF scope.
> Just don't give that cap to bpftool or do different acl/chmod.

I see no reason that allowing a user to use most of bpftool’s functionality necessarily needs to allow that user to corrupt the system. It obviously will expand the attack surface available to that user, but that should be it.

I’m trying to convince you that bpf’s security model can be made better than what you’re proposing. I’m genuinely not trying to get in your way. I’m trying to help you improve bpf.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29  0:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <DA52992F-4862-4945-8482-FE619A04C753@amacapital.net>



> On Aug 28, 2019, at 5:45 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
>>> On Aug 28, 2019, at 3:55 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> 
>>> On Tue, Aug 27, 2019 at 11:12:29PM -0700, Andy Lutomirski wrote:
>>>>> 
>>>>> 
>>>>> From the previous discussion, you want to make progress toward solving
>>>>> a lot of problems with CAP_BPF.  One of them was making BPF
>>>>> firewalling more generally useful. By making CAP_BPF grant the ability
>>>>> to read kernel memory, you will make administrators much more nervous
>>>>> to grant CAP_BPF.
>>>> 
>>>> Andy, were your email hacked?
>>>> I explained several times that in this proposal
>>>> CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
>>>> CAP_BPF alone is _not enough_.
>>> 
>>> You have indeed said this many times.  You've stated it as a matter of
>>> fact as though it cannot possibly discussed.  I'm asking you to
>>> justify it.
>> 
>> That's not how I see it.
>> I kept stating that both CAP_BPF and CAP_TRACING are necessary to read
>> kernel memory whereas you kept distorting my statement by dropping second
>> part and then making claims that "CAP_BPF grant the ability to read
>> kernel memory, you will make administrators much more nervous".
> 
> Mea culpa. CAP_BPF does, however, appear to permit breaking kASLR due to unsafe pointer conversions, and it allows reading and writing everyone’s maps.  I stand by my overall point.
> 
>> 
>> Just s/CAP_BPF/CAP_BPF and CAP_TRACING/ in this above sentence.
>> See that meaning suddenly changes?
>> Now administrators would be worried about tasks that have both at once.
>> They also would be worried about tasks that have CAP_TRACING alone,
>> because that's what allows probe_kernel_read().
> 
> This is not all what I meant. Of course granting CAP_BPF+CAP_TRACING allows reading kernel memory. This is not at all a problem.  Here is a problem I see:
> 
> CAP_TRACING + CAP_BPF allows modification of other people’s maps and potentially other things that should not be implied by CAP_TRACING alone and that don’t need to be available to tracers. So CAP_TRACING, which is powerful but has somewhat limited scope, isn’t fully useful without CAP_BPF, and giving CAP_TRACING *and* CAP_BPF allows things that teachers shouldn’t be able to do. I think this would make the whole mechanism less useful to Android, for example.
> 
> (Also, I’m not sure quite what you mean by “CAP_TRACING ... allows probe_kernel_read()”. probe_kernel_read() is a kernel function that can’t be directly called by userspace. CAP_TRACING allows reading kernel memory in plenty of ways regardless.)
> 
>> 
>>> It seems like you are specifically trying to add a new switch to turn
>>> as much of BPF as possible on and off.  Why?
>> 
>> Didn't I explain it several times already with multiple examples
>> from systemd, daemons, bpftrace ?
>> 
>> Let's try again.
>> Take your laptop with linux distro.
>> You're the only user there. I'm assuming you're not sharing it with
>> partner and kids. This is my definition of 'single user system'.
>> You can sudo on it at any time, but obviously prefer to run as many
>> apps as possible without cap_sys_admin.
>> Now you found some awesome open source app on the web that monitors
>> the health of the kernel and will pop a nice message on a screen if
>> something is wrong. Currently this app needs root. You hesitate,
>> but the apps is so useful and it has strong upstream code review process
>> that you keep running it 24/7.
>> This is open source app. New versions come. You upgrade.
>> You have enough trust in that app that you keep running it as root.
>> But there is always a chance that new version doing accidentaly
>> something stupid as 'kill -9 -1'. It's an open source app at the end.
>> 
>> Now I come with this CAP* proposal to make this app safer.
>> I'm not making your system more secure and not making this app
>> more secure. I can only make your laptop safer for day to day work
>> by limiting the operations this app can do.
>> This particular app monitros the kernel via bpf and tracing.
>> Hence you can give it CAP_TRACING and CAP_BPF and drop the rest.
> 
> This won’t make me much more comfortable, since CAP_BPF lets it do an ever-growing set of nasty things. I’d much rather one or both of two things happen:
> 
> 1. Give it CAP_TRACING only. It can leak my data, but it’s rather hard for it to crash my laptop, lose data, or cause other shenanigans.
> 
> 2. Improve it a bit do all the privileged ops are wrapped by capset().
> 
> Does this make sense?  I’m a security person on occasion. I find vulnerabilities and exploit them deliberately and I break things by accident on a regular basis. In my considered opinion, CAP_TRACING alone, even extended to cover part of BPF as I’ve described, is decently safe. Getting root with just CAP_TRACING will be decently challenging, especially if I don’t get to read things like sshd’s memory, and improvements to mitigate even that could be added.  I am quite confident that attacks starting with CAP_TRACING will have clear audit signatures if auditing is on.  I am also confident that CAP_BPF *will* allow DoS and likely privilege escalation, and this will only get more likely as BPF gets more widely used. And, if BPF-based auditing ever becomes a thing, writing to the audit daemon’s maps will be a great way to cover one’s tracks.
> 
> 
>> I think they have no choice but to do kernel.unprivileged_bpf_disabled=1.
>> We, as a kernel community, are forcing the users into it.
>> Hence I really do not see a value in any proposal today that expands
>> unprivileged bpf usage.
> 
> I think you’re overemphasizing bpf’s role in the whole speculation mess. I realize that you’ve spent an insane amount of time on mitigations to stupid issues. I’ve spent a less insane amount of time on mitigating similar issues outside of bpf.  It’s a mess.  At the end of the day, the kernel does its best, and new bugs show up. New CPUs will be less buggy.

Bah, accidentally hit send.

If the kernel’s mitigations aren’t good enough or you’re subject to direct user attack (e.g. via insufficient IBPB, SMT attack, etc) then you’re vulnerable. Otherwise you’re less vulnerable. BPF is by no means the whole story. Heck, the kernel *could*, at unfortunate performance cost, more aggressively flush around BPF and effectively treat it like user code.

So I think we should design bpf’s API’s security with the philosophy that speculation attacks are just one more type of bug, and we should make sure that real-world useful configurations don’t give BPF to tasks that don’t need it. My unpriv proposal tries to do this. This is *why* my proposal keeps test_run locked down and restricts running each program type to tasks that are explicitly granted the ability to attach it.

So let’s let CAP_TRACING use bpf. Speculation attacks are mostly irrelevant to tracers anyway, but all the rest of the stuff I’ve been talking is relevant.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-29  0:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190828225512.q6qbvkdiqih2iewk@ast-mbp.dhcp.thefacebook.com>

> On Aug 28, 2019, at 3:55 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Tue, Aug 27, 2019 at 11:12:29PM -0700, Andy Lutomirski wrote:
>>>> 
>>>> 
>>>> From the previous discussion, you want to make progress toward solving
>>>> a lot of problems with CAP_BPF.  One of them was making BPF
>>>> firewalling more generally useful. By making CAP_BPF grant the ability
>>>> to read kernel memory, you will make administrators much more nervous
>>>> to grant CAP_BPF.
>>> 
>>> Andy, were your email hacked?
>>> I explained several times that in this proposal
>>> CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
>>> CAP_BPF alone is _not enough_.
>> 
>> You have indeed said this many times.  You've stated it as a matter of
>> fact as though it cannot possibly discussed.  I'm asking you to
>> justify it.
> 
> That's not how I see it.
> I kept stating that both CAP_BPF and CAP_TRACING are necessary to read
> kernel memory whereas you kept distorting my statement by dropping second
> part and then making claims that "CAP_BPF grant the ability to read
> kernel memory, you will make administrators much more nervous".

Mea culpa. CAP_BPF does, however, appear to permit breaking kASLR due to unsafe pointer conversions, and it allows reading and writing everyone’s maps.  I stand by my overall point.

> 
> Just s/CAP_BPF/CAP_BPF and CAP_TRACING/ in this above sentence.
> See that meaning suddenly changes?
> Now administrators would be worried about tasks that have both at once.
> They also would be worried about tasks that have CAP_TRACING alone,
> because that's what allows probe_kernel_read().

This is not all what I meant. Of course granting CAP_BPF+CAP_TRACING allows reading kernel memory. This is not at all a problem.  Here is a problem I see:

CAP_TRACING + CAP_BPF allows modification of other people’s maps and potentially other things that should not be implied by CAP_TRACING alone and that don’t need to be available to tracers. So CAP_TRACING, which is powerful but has somewhat limited scope, isn’t fully useful without CAP_BPF, and giving CAP_TRACING *and* CAP_BPF allows things that teachers shouldn’t be able to do. I think this would make the whole mechanism less useful to Android, for example.

(Also, I’m not sure quite what you mean by “CAP_TRACING ... allows probe_kernel_read()”. probe_kernel_read() is a kernel function that can’t be directly called by userspace. CAP_TRACING allows reading kernel memory in plenty of ways regardless.)

> 
>> It seems like you are specifically trying to add a new switch to turn
>> as much of BPF as possible on and off.  Why?
> 
> Didn't I explain it several times already with multiple examples
> from systemd, daemons, bpftrace ?
> 
> Let's try again.
> Take your laptop with linux distro.
> You're the only user there. I'm assuming you're not sharing it with
> partner and kids. This is my definition of 'single user system'.
> You can sudo on it at any time, but obviously prefer to run as many
> apps as possible without cap_sys_admin.
> Now you found some awesome open source app on the web that monitors
> the health of the kernel and will pop a nice message on a screen if
> something is wrong. Currently this app needs root. You hesitate,
> but the apps is so useful and it has strong upstream code review process
> that you keep running it 24/7.
> This is open source app. New versions come. You upgrade.
> You have enough trust in that app that you keep running it as root.
> But there is always a chance that new version doing accidentaly
> something stupid as 'kill -9 -1'. It's an open source app at the end.
> 
> Now I come with this CAP* proposal to make this app safer.
> I'm not making your system more secure and not making this app
> more secure. I can only make your laptop safer for day to day work
> by limiting the operations this app can do.
> This particular app monitros the kernel via bpf and tracing.
> Hence you can give it CAP_TRACING and CAP_BPF and drop the rest.

This won’t make me much more comfortable, since CAP_BPF lets it do an ever-growing set of nasty things. I’d much rather one or both of two things happen:

1. Give it CAP_TRACING only. It can leak my data, but it’s rather hard for it to crash my laptop, lose data, or cause other shenanigans.

2. Improve it a bit do all the privileged ops are wrapped by capset().

Does this make sense?  I’m a security person on occasion. I find vulnerabilities and exploit them deliberately and I break things by accident on a regular basis. In my considered opinion, CAP_TRACING alone, even extended to cover part of BPF as I’ve described, is decently safe. Getting root with just CAP_TRACING will be decently challenging, especially if I don’t get to read things like sshd’s memory, and improvements to mitigate even that could be added.  I am quite confident that attacks starting with CAP_TRACING will have clear audit signatures if auditing is on.  I am also confident that CAP_BPF *will* allow DoS and likely privilege escalation, and this will only get more likely as BPF gets more widely used. And, if BPF-based auditing ever becomes a thing, writing to the audit daemon’s maps will be a great way to cover one’s tracks.

> I think they have no choice but to do kernel.unprivileged_bpf_disabled=1.
> We, as a kernel community, are forcing the users into it.
> Hence I really do not see a value in any proposal today that expands
> unprivileged bpf usage.

I think you’re overemphasizing bpf’s role in the whole speculation mess. I realize that you’ve spent an insane amount of time on mitigations to stupid issues. I’ve spent a less insane amount of time on mitigating similar issues outside of bpf.  It’s a mess.  At the end of the day, the kernel does its best, and new bugs show up. New CPUs will be less buggy. 

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrX-bn2SpVzTkPz+A=z_oWDs7PNeouzK7wRWMzyaBd4+7g@mail.gmail.com>

On Tue, Aug 27, 2019 at 11:20:19PM -0700, Andy Lutomirski wrote:
> On Tue, Aug 27, 2019 at 9:49 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
> > >
> > > Let me put this a bit differently. Part of the point is that
> > > CAP_TRACING should allow a user or program to trace without being able
> > > to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
> > > crash the system.
> >
> > Really? I'm still waiting for your example where bpf+kprobe crashes the system...
> >
> 
> That's not what I meant.  bpf+kprobe causing a crash is a bug.  I'm
> referring to a totally different issue.  On my laptop:
> 
> $ sudo bpftool map
> 48: hash  name foobar  flags 0x0
>     key 8B  value 8B  max_entries 64  memlock 8192B
> 181: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 182: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 183: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 184: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 185: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 186: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 187: lpm_trie  flags 0x1
>     key 8B  value 8B  max_entries 1  memlock 4096B
> 188: lpm_trie  flags 0x1
>     key 20B  value 8B  max_entries 1  memlock 4096B
> 
> $ sudo bpftool map dump id 186
> key:
> 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
> 00 00 00 00
> value:
> 02 00 00 00 00 00 00 00
> Found 1 element
> 
> $ sudo bpftool map delete id 186 key hex 00 00 00 00 00 00 00 00 00 00
> 00 00 00 00 00 00 00 00 00 00
> [this worked]
> 
> I don't know what my laptop was doing with map id 186 in particular,
> but, whatever it was, I definitely broke it.  If a BPF firewall is in
> use on something important enough, this could easily remove
> connectivity from part or all of the system.  Right now, this needs
> CAP_SYS_ADMIN.  With your patch, CAP_BPF is sufficient to do this, but
> you *also* need CAP_BPF to trace the system using BPF.  Tracing with
> BPF is 'safe' in the absence of bugs.  Modifying other peoples' maps
> is not.

That lpm_trie is likely systemd implementing IP sandboxing.
Not sure whether it's white or black list.
Deleting an IP address from that map will either allow or disallow
network traffic.
Out of band operation on bpf map broke some bpf program. Sure.
But calling it 'breaking the system' is quite a stretch.
Calling it 'crashing the system' is plain wrong.
Yet you're generalizing this bpf map read/write as
"CAP_BPF as you’ve proposed it *can* likely crash the system."
This is what I have a problem with.

Anyway, changing gears...
Yes. I did propose to make a task with CAP_BPF to be able to
manipulate arbitrary maps in the system.
You could have said that if CAP_BPF is given to 'bpftool'
then any user will be able to mess with other maps because
bpftool is likely chmod-ed 755.
Absolutely correct!
It's not a fault of the CAP_BPF scope.
Just don't give that cap to bpftool or do different acl/chmod.

> If the answer is the latter, then maybe it would make sense to try to
> implement some of the unprivileged bpf stuff and then to see whether
> CAP_BPF is still needed.

<broken_record_mode=on> Nack to extensions to unprivileged bpf.


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28 22:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrW1o+Lazi2Ng6b9JN6jeJffgdW9f3HvqYhNo4TpHRXW=g@mail.gmail.com>

On Tue, Aug 27, 2019 at 11:12:29PM -0700, Andy Lutomirski wrote:
> >
> > >
> > > From the previous discussion, you want to make progress toward solving
> > > a lot of problems with CAP_BPF.  One of them was making BPF
> > > firewalling more generally useful. By making CAP_BPF grant the ability
> > > to read kernel memory, you will make administrators much more nervous
> > > to grant CAP_BPF.
> >
> > Andy, were your email hacked?
> > I explained several times that in this proposal
> > CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
> > CAP_BPF alone is _not enough_.
> 
> You have indeed said this many times.  You've stated it as a matter of
> fact as though it cannot possibly discussed.  I'm asking you to
> justify it.

That's not how I see it.
I kept stating that both CAP_BPF and CAP_TRACING are necessary to read
kernel memory whereas you kept distorting my statement by dropping second
part and then making claims that "CAP_BPF grant the ability to read
kernel memory, you will make administrators much more nervous".

Just s/CAP_BPF/CAP_BPF and CAP_TRACING/ in this above sentence.
See that meaning suddenly changes?
Now administrators would be worried about tasks that have both at once.
They also would be worried about tasks that have CAP_TRACING alone,
because that's what allows probe_kernel_read().

> It seems like you are specifically trying to add a new switch to turn
> as much of BPF as possible on and off.  Why?

Didn't I explain it several times already with multiple examples
from systemd, daemons, bpftrace ?

Let's try again.
Take your laptop with linux distro.
You're the only user there. I'm assuming you're not sharing it with
partner and kids. This is my definition of 'single user system'.
You can sudo on it at any time, but obviously prefer to run as many
apps as possible without cap_sys_admin.
Now you found some awesome open source app on the web that monitors
the health of the kernel and will pop a nice message on a screen if
something is wrong. Currently this app needs root. You hesitate,
but the apps is so useful and it has strong upstream code review process
that you keep running it 24/7.
This is open source app. New versions come. You upgrade.
You have enough trust in that app that you keep running it as root.
But there is always a chance that new version doing accidentaly
something stupid as 'kill -9 -1'. It's an open source app at the end.

Now I come with this CAP* proposal to make this app safer.
I'm not making your system more secure and not making this app
more secure. I can only make your laptop safer for day to day work
by limiting the operations this app can do.
This particular app monitros the kernel via bpf and tracing.
Hence you can give it CAP_TRACING and CAP_BPF and drop the rest.

> > speaking of MDS... I already asked you to help investigate its
> > applicability with existing bpf exposure. Are you going to do that?
> 
> I am blissfully uninvolved in MDS, and I don't know all that much more
> about the overall mechanism than a random reader of tech news :)  ISTM
> there are two meaningful ways that BPF could be involved: a BPF
> program could leak info into the state exposed by MDS, or a BPF
> program could try to read that state.  From what little I understand,
> it's essentially inevitable that BPF leaks information into MDS state,
> and this is probably even controllable by an attacker that understands
> MDS in enough detail.    So the interesting questions are: can BPF be
> used to read MDS state and can BPF be used to leak information in a
> more useful way than the rest of the kernel to an attacker.

agree. that's exactly the question to ask.

> Keeping in mind that the kernel will flush MDS state on every exit to
> usermode, I think the most likely attack is to try to read MDS state
> with BPF.  This could happen, I suppose -- BPF programs can easily
> contain the usual speculation gadgets of "do something and read an
> address that depends on the outcome".  Fortunately, outside of
> bpf_probe_read(), AFAIK BPF programs can't directly touch user memory,
> and an attacker that is allowed to use bpf_probe_read() doesn't need
> MDS to read things.

true as well.
So what do we do with that sentence in Documentation/x86/mds.rst?
Nothing?
New hw bugs will keep coming.
All of them should get similar wording?
Your understanding of MDS and BPF is way above the average.
What other users suppose to do when they read such sentence?
I think they have no choice but to do kernel.unprivileged_bpf_disabled=1.
We, as a kernel community, are forcing the users into it.
Hence I really do not see a value in any proposal today that expands
unprivileged bpf usage.
Since kernel.unprivileged_bpf_disabled=1 all bpf is under cap_sys_admin.
It's not great from security and safety pov. Hence this CAP* proposal.

> > Please take a look at Jann's var1 exploit. Was it hard to run bpf prog
> > in controlled environment without test_run command ?
> >
> 
> Can you send me a link?

https://bugs.chromium.org/p/project-zero/issues/detail?id=1272
writeup_files.tar:kernel_leak_exploit_amd_pro_a8_9600_r7/bpf_stuff.c
Execution is as trivial as write(sockfd, "X", 1) line 405.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox