Linux Security Modules development
 help / color / mirror / Atom feed
* Re: [PATCH v7 3/7] exec: Move path_noexec() check earlier
From: Mickaël Salaün @ 2020-08-13 15:31 UTC (permalink / raw)
  To: Eric W. Biederman, Kees Cook
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
	Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
	linux-integrity, linux-security-module, linux-fsdevel
In-Reply-To: <87a6z1m0u1.fsf@x220.int.ebiederm.org>

Kees Cook wrote this patch, which is in Andrew Morton's tree, but I
think you're talking about O_MAYEXEC, not this patch specifically.

On 11/08/2020 21:36, Eric W. Biederman wrote:
> Mickaël Salaün <mic@digikod.net> writes:
> 
>> From: Kees Cook <keescook@chromium.org>
>>
>> The path_noexec() check, like the regular file check, was happening too
>> late, letting LSMs see impossible execve()s. Check it earlier as well
>> in may_open() and collect the redundant fs/exec.c path_noexec() test
>> under the same robustness comment as the S_ISREG() check.
>>
>> My notes on the call path, and related arguments, checks, etc:
> 
> A big question arises, that I think someone already asked.

Al Viro and Jann Horn expressed such concerns for O_MAYEXEC:
https://lore.kernel.org/lkml/0cc94c91-afd3-27cd-b831-8ea16ca8ca93@digikod.net/

> 
> Why perform this test in may_open directly instead of moving
> it into inode_permission.  That way the code can be shared with
> faccessat, and any other code path that wants it?

This patch is just a refactoring.

About O_MAYEXEC, path-based LSM, IMA and IPE need to work on a struct
file, whereas inode_permission() only gives a struct inode. However,
faccessat2(2) (with extended flags) seems to be the perfect candidate if
we want to be able to check file descriptors.

> 
> That would look to provide a more maintainable kernel.

Why would it be more maintainable?

> 
> Eric
> 
> 
>> do_open_execat()
>>     struct open_flags open_exec_flags = {
>>         .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
>>         .acc_mode = MAY_EXEC,
>>         ...
>>     do_filp_open(dfd, filename, open_flags)
>>         path_openat(nameidata, open_flags, flags)
>>             file = alloc_empty_file(open_flags, current_cred());
>>             do_open(nameidata, file, open_flags)
>>                 may_open(path, acc_mode, open_flag)
>>                     /* new location of MAY_EXEC vs path_noexec() test */
>>                     inode_permission(inode, MAY_OPEN | acc_mode)
>>                         security_inode_permission(inode, acc_mode)
>>                 vfs_open(path, file)
>>                     do_dentry_open(file, path->dentry->d_inode, open)
>>                         security_file_open(f)
>>                         open()
>>     /* old location of path_noexec() test */
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> Link: https://lore.kernel.org/r/20200605160013.3954297-4-keescook@chromium.org
>> ---
>>  fs/exec.c  | 12 ++++--------
>>  fs/namei.c |  4 ++++
>>  2 files changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index bdc6a6eb5dce..4eea20c27b01 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -147,10 +147,8 @@ SYSCALL_DEFINE1(uselib, const char __user *, library)
>>  	 * and check again at the very end too.
>>  	 */
>>  	error = -EACCES;
>> -	if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
>> -		goto exit;
>> -
>> -	if (path_noexec(&file->f_path))
>> +	if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
>> +			 path_noexec(&file->f_path)))
>>  		goto exit;
>>  
>>  	fsnotify_open(file);
>> @@ -897,10 +895,8 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags)
>>  	 * and check again at the very end too.
>>  	 */
>>  	err = -EACCES;
>> -	if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode)))
>> -		goto exit;
>> -
>> -	if (path_noexec(&file->f_path))
>> +	if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) ||
>> +			 path_noexec(&file->f_path)))
>>  		goto exit;
>>  
>>  	err = deny_write_access(file);
>> diff --git a/fs/namei.c b/fs/namei.c
>> index a559ad943970..ddc9b25540fe 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -2863,6 +2863,10 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>>  			return -EACCES;
>>  		flag &= ~O_TRUNC;
>>  		break;
>> +	case S_IFREG:
>> +		if ((acc_mode & MAY_EXEC) && path_noexec(path))
>> +			return -EACCES;
>> +		break;
>>  	}
>>  
>>  	error = inode_permission(inode, MAY_OPEN | acc_mode);

^ permalink raw reply

* Loadpin: Possibility of exceptions for certain firmwares?
From: Martin.U.Lang @ 2020-08-13 15:14 UTC (permalink / raw)
  To: keescook, linux-security-module

Hi Kees, hi LSM community,

I'm currently struggling a bit with the bare simplicity of loadpin and trying to understand which options exist for me.

We have the following situation: 
Our root file system is integrity protected with dm-verity. During the development/validation stage, we create builds that use a test key for secure boot.  For certain builds that successfully passed validation, we would like to "promote" these builds to be a real release and sign them with the real secure-boot/verified-boot release keys. This requires us to re-sign all secure-boot signed artifacts (bootloaders, TEE, kernel, ...) including the dm-verity root hash(es) of the filesystem(s). During this re-signing process we do not want to touch the root file system (due to validation concerns) but only to provide new signatures for the dm-verity root hash.
As it turns out, some of the firmware blobs of our system are signed with the secure boot keys of the system and this signature is validated somewhere outside of the Linux kernel (by other firmware, by the TEE...). So, we wanted to move those outside of the root filesystem in order to avoid changing the file system as part of the release re-signing process. However, we then noticed that the kernel driver seems to load these firmwares using the usual firmware loader infrastructure of Linux. Consequently, loadpin kicks in, does its job and denies loading the firmware as it comes now from this extra partition.

So, I was wondering if adding an exception list to loadpin could be a reasonable approach to this issue. If I understand the whole infrastructure correctly, the driver requests a firmware (file) by name. If I know on my system that the firmware with this name is self-contained secure-boot signed and later-on validated independently anyways, there is no security benefit in letting loadpin enforce that it must come from my root fs. Consequently, I was thinking of adding a (configurable) exception list mechanism to loadpin to accept certain firmwares (by name) even if they are not loaded from the pinned fs.

Would you be open to such an extension of loadpin? If there are other suggestions how this issue could be addressed, I'm also grateful for input. I would like to see a solution that enables us to use loadpin as it saves quite some effort on kernel module signing.

Best Regards,

Martin

^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: James Bottomley @ 2020-08-13 15:10 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Mimi Zohar, James Morris, Deven Bowers, Pavel Machek, Sasha Levin,
	snitzer, dm-devel, tyhicks, agk, Paul Moore, Jonathan Corbet,
	nramas, serge, pasha.tatashin, Jann Horn, linux-block, Al Viro,
	Jens Axboe, mdsakib, open list, eparis, linux-security-module,
	linux-audit, linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <D470BA4B-EF1A-49CA-AFB9-0F7FFC4C6001@gmail.com>

On Thu, 2020-08-13 at 10:42 -0400, Chuck Lever wrote:
> > On Aug 12, 2020, at 11:51 AM, James Bottomley <James.Bottomley@Hans
> > enPartnership.com> wrote:
> > On Wed, 2020-08-12 at 10:15 -0400, Chuck Lever wrote:
> > > > On Aug 11, 2020, at 11:53 AM, James Bottomley
> > > > <James.Bottomley@HansenPartnership.com> wrote:
> > > > On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
[...]
> > > > > > > The client would have to reconstruct that tree again if
> > > > > > > memory pressure caused some or all of the tree to be
> > > > > > > evicted, so perhaps an on-demand mechanism is preferable.
> > > > > > 
> > > > > > Right, but I think that's implementation detail.  Probably
> > > > > > what we need is a way to get the log(N) verification hashes
> > > > > > from the server and it's up to the client whether it caches
> > > > > > them or not.
> > > > > 
> > > > > Agreed, these are implementation details. But see above about
> > > > > the trustworthiness of the intermediate hashes. If they are
> > > > > conveyed on an untrusted network, then they can't be trusted
> > > > > either.
> > > > 
> > > > Yes, they can, provided enough of them are asked for to
> > > > verify.  If you look at the simple example above, suppose I
> > > > have cached H11 and H12, but I've lost the entire H2X layer.  I
> > > > want to verify B3 so I also ask you for your copy of H24.  Then
> > > > I generate H23 from B3 and Hash H23 and H24.  If this doesn't
> > > > hash to H12 I know either you supplied me the wrong block or
> > > > lied about H24.  However, if it all hashes correctly I know you
> > > > supplied me with both the correct B3 and the correct H24.
> > > 
> > > My point is there is a difference between a trusted cache and an
> > > untrusted cache. I argue there is not much value in a cache where
> > > the hashes have to be verified again.
> > 
> > And my point isn't about caching, it's about where the tree comes
> > from. I claim and you agree the client can get the tree from the
> > server a piece at a time (because it can path verify it) and
> > doesn't have to generate it itself.
> 
> OK, let's focus on where the tree comes from. It is certainly
> possible to build protocol to exchange parts of a Merkle tree.

Which is what I think we need to extend IMA to do.

>  The question is how it might be stored on the server.

I think the only thing the server has to guarantee to store is the head
hash, possibly signed.

>  There are some underlying assumptions about the metadata storage
> mechanism that should be stated up front.
> 
> Current forms of IMA metadata are limited in size and stored in a
> container that is read and written in a single operation. If we stick
> with that container format, I don't see a way to store a Merkle tree
> in there for all file sizes.

Well, I don't think you need to.  The only thing that needs to be
stored is the head hash.  Everything else can be reconstructed.  If you
asked me to implement it locally, I'd probably put the head hash in an
xattr but use a CAM based cache for the merkel trees and construct the
tree on first access if it weren't already in the cache.

However, the above isn't what fs-verity does: it stores the tree in a
hidden section of the file.  That's why I don't think we'd mandate
anything about tree storage.  Just describe the partial retrieval
properties we'd like and leave the rest as an implementation detail.

> Thus it seems to me that we cannot begin to consider the tree-on-the-
> server model unless there is a proposed storage mechanism for that
> whole tree. Otherwise, the client must have the primary role in
> unpacking and verifying the tree.

Well, as I said,  I don't think you need to store the tree.  You
certainly could decide to store the entire tree (as fs-verity does) if
it fitted your use case, but it's not required.  Perhaps even in my
case I'd make the CAM based cache persistent, like android's dalvik
cache.

James


> Storing only the tree root in the metadata means the metadata format
> is nicely bounded in size.

^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: Chuck Lever @ 2020-08-13 14:56 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mimi Zohar, James Morris, Deven Bowers, Pavel Machek, Sasha Levin,
	snitzer, dm-devel, tyhicks, agk, Paul Moore, Jonathan Corbet,
	nramas, serge, pasha.tatashin, Jann Horn, linux-block, Al Viro,
	Jens Axboe, mdsakib, open list, eparis, linux-security-module,
	linux-audit, linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <1597329763.3708.13.camel@HansenPartnership.com>



> On Aug 13, 2020, at 10:42 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> On Thu, 2020-08-13 at 10:21 -0400, Chuck Lever wrote:
>>> On Aug 12, 2020, at 11:42 AM, James Bottomley <James.Bottomley@Hans
>>> enPartnership.com> wrote:
> [...]
>>> For most people the security mechanism of local xattrs is
>>> sufficient.  If you're paranoid, you don't believe it is and you
>>> use EVM.
>> 
>> When IMA metadata happens to be stored in local filesystems in
>> a trusted xattr, it's going to enjoy the protection you describe
>> without needing the addition of a cryptographic signature.
>> 
>> However, that metadata doesn't live its whole life there. It
>> can reside in a tar file, it can cross a network, it can live
>> on a back-up tape. I think we agree that any time that metadata
>> is in transit or at rest outside of a Linux local filesystem, it
>> is exposed.
>> 
>> Thus I'm interested in a metadata protection mechanism that does
>> not rely on the security characteristics of a particular storage
>> container. For me, a cryptographic signature fits that bill
>> nicely.
> 
> Sure, but one of the points about IMA is a separation of mechanism from
> policy.  Signed hashes (called appraisal in IMA terms) is just one
> policy you can decide to require or not or even make it conditional on
> other things.

AFAICT, the current EVM_IMA_DIGSIG and EVM_PORTABLE_DIGSIG formats are
always signed. The policy choice is whether or not to verify the
signature, not whether or not the metadata format is signed.


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply

* Re: [PATCH v7 5/7] fs,doc: Enable to enforce noexec mounts or file exec through O_MAYEXEC
From: Mickaël Salaün @ 2020-08-13 14:49 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
	Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
	linux-integrity, linux-security-module, linux-fsdevel,
	Thibaut Sautereau, Randy Dunlap
In-Reply-To: <87364tkl99.fsf@x220.int.ebiederm.org>



On 11/08/2020 21:58, Eric W. Biederman wrote:
> Mickaël Salaün <mic@digikod.net> writes:
> 
>> Allow for the enforcement of the O_MAYEXEC openat2(2) flag.  Thanks to
>> the noexec option from the underlying VFS mount, or to the file execute
>> permission, userspace can enforce these execution policies.  This may
>> allow script interpreters to check execution permission before reading
>> commands from a file, or dynamic linkers to allow shared object
>> loading.
> 
> Ick!!!!!
> 
> This feels like being so open minded your brains fall out.

Really? :)

> 
> I can see having a sysctl that allows the new open flag to be ignored
> so that the existing lack of enforcement when the flag is passed
> continues.

And that could break distros (i.e. multiple interpreters, with or
without O_MAYEXEC, different versions, and different kernels).

> 
> But having the sysctl be fine grained seems like way too much rope.

This follow the same rational: file permissions and mount options can
change but they are controled by the sysadmin, how also configure sysctl
values.

> 
> I don't think the code needs to do more than enforce or not enforce this
> logic.

I think this is the more viable behavior for an eclectic set of distros
(and different use cases).

> 
> You can test the sysctl once when you process O_MAYEXEC.  But code such
> as may_open should not have the conditional behavior.  It should get an
> appropriate set of flags that are always enforced.  With the madness of
> what to do left at the edge of userspace.

The problem is that this userspace is not in charge of the system-wide
policy, the sysadmin is. As pointed out by the commit message, please
take a look at the rational:
https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@digikod.net/

> 
> Anything else appears to be madness, overengineering, and a failure to
> separate concerns.

Oh! What a conclusion! :)

I'd say it is a pragmatic approach.

> 
> Eric
> 
> 
>> Add a new sysctl fs.open_mayexec_enforce to enable system administrators
>> to enforce two complementary security policies according to the
>> installed system: enforce the noexec mount option, and enforce
>> executable file permission.  Indeed, because of compatibility with
>> installed systems, only system administrators are able to check that
>> this new enforcement is in line with the system mount points and file
>> permissions.  A following patch adds documentation.
>>
>> Being able to restrict execution also enables to protect the kernel by
>> restricting arbitrary syscalls that an attacker could perform with a
>> crafted binary or certain script languages.  It also improves multilevel
>> isolation by reducing the ability of an attacker to use side channels
>> with specific code.  These restrictions can natively be enforced for ELF
>> binaries (with the noexec mount option) but require this kernel
>> extension to properly handle scripts (e.g., Python, Perl).  To get a
>> consistent execution policy, additional memory restrictions should also
>> be enforced (e.g. thanks to SELinux).
>>
>> Because the O_MAYEXEC flag is a meant to enforce a system-wide security
>> policy (but not application-centric policies), it does not make sense
>> for userland to check the sysctl value.  Indeed, this new flag only
>> enables to extend the system ability to enforce a policy thanks to (some
>> trusted) userland collaboration.  Moreover, additional security policies
>> could be managed by LSMs.  This is a best-effort approach from the
>> application developer point of view:
>> https://lore.kernel.org/lkml/1477d3d7-4b36-afad-7077-a38f42322238@digikod.net/
>>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Reviewed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Randy Dunlap <rdunlap@infradead.org>
>> ---
>>
>> Changes since v6:
>> * Allow opening pipes, block devices and character devices with
>>   O_MAYEXEC when there is no enforced policy, but forbid any non-regular
>>   file opened with O_MAYEXEC otherwise (i.e. for any enforced policy).
>> * Add a paragraph about the non-regular files policy.
>> * Move path_noexec() calls out of the fast-path (suggested by Kees
>>   Cook).
>>
>> Changes since v5:
>> * Remove the static enforcement configuration through Kconfig because it
>>   makes the code more simple like this, and because the current sysctl
>>   configuration can only be set with CAP_SYS_ADMIN, the same way mount
>>   options (i.e. noexec) can be set.  If an harden distro wants to
>>   enforce a configuration, it should restrict capabilities or sysctl
>>   configuration.  Furthermore, an LSM can easily leverage O_MAYEXEC to
>>   fit its need.
>> * Move checks from inode_permission() to may_open() and make the error
>>   codes more consistent according to file types (in line with a previous
>>   commit): opening a directory with O_MAYEXEC returns EISDIR and other
>>   non-regular file types may return EACCES.
>> * In may_open(), when OMAYEXEC_ENFORCE_FILE is set, replace explicit
>>   call to generic_permission() with an artificial MAY_EXEC to avoid
>>   double calls.  This makes sense especially when an LSM policy forbids
>>   execution of a file.
>> * Replace the custom proc_omayexec() with
>>   proc_dointvec_minmax_sysadmin(), and then replace the CAP_MAC_ADMIN
>>   check with a CAP_SYS_ADMIN one (suggested by Kees Cook and Stephen
>>   Smalley).
>> * Use BIT() (suggested by Kees Cook).
>> * Rename variables (suggested by Kees Cook).
>> * Reword the kconfig help.
>> * Import the documentation patch (suggested by Kees Cook):
>>   https://lore.kernel.org/lkml/20200505153156.925111-6-mic@digikod.net/
>> * Update documentation and add LWN.net article.
>>
>> Changes since v4:
>> * Add kernel configuration options to enforce O_MAYEXEC at build time,
>>   and disable the sysctl in such case (requested by James Morris).
>> * Reword commit message.
>>
>> Changes since v3:
>> * Update comment with O_MAYEXEC.
>>
>> Changes since v2:
>> * Cosmetic changes.
>>
>> Changes since v1:
>> * Move code from Yama to the FS subsystem (suggested by Kees Cook).
>> * Make omayexec_inode_permission() static (suggested by Jann Horn).
>> * Use mode 0600 for the sysctl.
>> * Only match regular files (not directories nor other types), which
>>   follows the same semantic as commit 73601ea5b7b1 ("fs/open.c: allow
>>   opening only regular files during execve()").
>> ---
>>  Documentation/admin-guide/sysctl/fs.rst | 49 +++++++++++++++++++++++++
>>  fs/namei.c                              | 24 ++++++++++++
>>  include/linux/fs.h                      |  1 +
>>  kernel/sysctl.c                         | 12 +++++-
>>  4 files changed, 84 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
>> index 2a45119e3331..ce6e2081d3a9 100644
>> --- a/Documentation/admin-guide/sysctl/fs.rst
>> +++ b/Documentation/admin-guide/sysctl/fs.rst
>> @@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/fs:
>>  - inode-nr
>>  - inode-state
>>  - nr_open
>> +- open_mayexec_enforce
>>  - overflowuid
>>  - overflowgid
>>  - pipe-user-pages-hard
>> @@ -165,6 +166,54 @@ system needs to prune the inode list instead of allocating
>>  more.
>>  
>>  
>> +open_mayexec_enforce
>> +--------------------
>> +
>> +While being ignored by :manpage:`open(2)` and :manpage:`openat(2)`, the
>> +``O_MAYEXEC`` flag can be passed to :manpage:`openat2(2)` to only open regular
>> +files that are expected to be executable.  If the file is not identified as
>> +executable, then the syscall returns -EACCES.  This may allow a script
>> +interpreter to check executable permission before reading commands from a file,
>> +or a dynamic linker to only load executable shared objects.  One interesting
>> +use case is to enforce a "write xor execute" policy through interpreters.
>> +
>> +The ability to restrict code execution must be thought as a system-wide policy,
>> +which first starts by restricting mount points with the ``noexec`` option.
>> +This option is also automatically applied to special filesystems such as /proc .
>> +This prevents files on such mount points to be directly executed by the kernel
>> +or mapped as executable memory (e.g. libraries).  With script interpreters
>> +using the ``O_MAYEXEC`` flag, the executable permission can then be checked
>> +before reading commands from files. This makes it possible to enforce the
>> +``noexec`` at the interpreter level, and thus propagates this security policy
>> +to scripts.  To be fully effective, these interpreters also need to handle the
>> +other ways to execute code: command line parameters (e.g., option ``-e`` for
>> +Perl), module loading (e.g., option ``-m`` for Python), stdin, file sourcing,
>> +environment variables, configuration files, etc.  According to the threat
>> +model, it may be acceptable to allow some script interpreters (e.g. Bash) to
>> +interpret commands from stdin, may it be a TTY or a pipe, because it may not be
>> +enough to (directly) perform syscalls.
>> +
>> +There are two complementary security policies: enforce the ``noexec`` mount
>> +option, and enforce executable file permission.  These policies are handled by
>> +the ``fs.open_mayexec_enforce`` sysctl (writable only with ``CAP_SYS_ADMIN``)
>> +as a bitmask:
>> +
>> +1 - Mount restriction: checks that the mount options for the underlying VFS
>> +    mount do not prevent execution.
>> +
>> +2 - File permission restriction: checks that the to-be-opened file is marked as
>> +    executable for the current process (e.g., POSIX permissions).
>> +
>> +Note that as long as a policy is enforced, opening any non-regular file with
>> +``O_MAYEXEC`` is denied (e.g. TTYs, pipe), even when such a file is marked as
>> +executable or is on an executable mount point.
>> +
>> +Code samples can be found in tools/testing/selftests/openat2/omayexec_test.c
>> +and interpreter patches (for the original O_MAYEXEC version) may be found at
>> +https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC .
>> +See also an overview article: https://lwn.net/Articles/820000/ .
>> +
>> +
>>  overflowgid & overflowuid
>>  -------------------------
>>  
>> diff --git a/fs/namei.c b/fs/namei.c
>> index 3f074ec77390..8ec13c7fd403 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -39,6 +39,7 @@
>>  #include <linux/bitops.h>
>>  #include <linux/init_task.h>
>>  #include <linux/uaccess.h>
>> +#include <linux/sysctl.h>
>>  
>>  #include "internal.h"
>>  #include "mount.h"
>> @@ -425,6 +426,11 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
>>  	return 0;
>>  }
>>  
>> +#define OPEN_MAYEXEC_ENFORCE_MOUNT	BIT(0)
>> +#define OPEN_MAYEXEC_ENFORCE_FILE	BIT(1)
>> +
>> +int sysctl_open_mayexec_enforce __read_mostly;
>> +
>>  /**
>>   * inode_permission - Check for access rights to a given inode
>>   * @inode: Inode to check permission on
>> @@ -2861,11 +2867,29 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>>  	case S_IFSOCK:
>>  		if (acc_mode & MAY_EXEC)
>>  			return -EACCES;
>> +		/*
>> +		 * Opening devices (e.g. TTYs) or pipes with O_MAYEXEC may be
>> +		 * legitimate when there is no enforced policy.
>> +		 */
>> +		if ((acc_mode & MAY_OPENEXEC) && sysctl_open_mayexec_enforce)
>> +			return -EACCES;
>>  		flag &= ~O_TRUNC;
>>  		break;
>>  	case S_IFREG:
>>  		if ((acc_mode & MAY_EXEC) && path_noexec(path))
>>  			return -EACCES;
>> +		if (acc_mode & MAY_OPENEXEC) {
>> +			if ((sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_MOUNT)
>> +					&& path_noexec(path))
>> +				return -EACCES;
>> +			if (sysctl_open_mayexec_enforce & OPEN_MAYEXEC_ENFORCE_FILE)
>> +				/*
>> +				 * Because acc_mode may change here, the next and only
>> +				 * use of acc_mode should then be by the following call
>> +				 * to inode_permission().
>> +				 */
>> +				acc_mode |= MAY_EXEC;
>> +		}
>>  		break;
>>  	}
>>  
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 56f835c9a87a..071f37707ccc 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -83,6 +83,7 @@ extern int sysctl_protected_symlinks;
>>  extern int sysctl_protected_hardlinks;
>>  extern int sysctl_protected_fifos;
>>  extern int sysctl_protected_regular;
>> +extern int sysctl_open_mayexec_enforce;
>>  
>>  typedef __kernel_rwf_t rwf_t;
>>  
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index db1ce7af2563..5008a2566e79 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -113,6 +113,7 @@ static int sixty = 60;
>>  
>>  static int __maybe_unused neg_one = -1;
>>  static int __maybe_unused two = 2;
>> +static int __maybe_unused three = 3;
>>  static int __maybe_unused four = 4;
>>  static unsigned long zero_ul;
>>  static unsigned long one_ul = 1;
>> @@ -888,7 +889,6 @@ static int proc_taint(struct ctl_table *table, int write,
>>  	return err;
>>  }
>>  
>> -#ifdef CONFIG_PRINTK
>>  static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>>  				void *buffer, size_t *lenp, loff_t *ppos)
>>  {
>> @@ -897,7 +897,6 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
>>  
>>  	return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
>>  }
>> -#endif
>>  
>>  /**
>>   * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range checking structure
>> @@ -3264,6 +3263,15 @@ static struct ctl_table fs_table[] = {
>>  		.extra1		= SYSCTL_ZERO,
>>  		.extra2		= &two,
>>  	},
>> +	{
>> +		.procname       = "open_mayexec_enforce",
>> +		.data           = &sysctl_open_mayexec_enforce,
>> +		.maxlen         = sizeof(int),
>> +		.mode           = 0600,
>> +		.proc_handler	= proc_dointvec_minmax_sysadmin,
>> +		.extra1		= SYSCTL_ZERO,
>> +		.extra2		= &three,
>> +	},
>>  #if defined(CONFIG_BINFMT_MISC) || defined(CONFIG_BINFMT_MISC_MODULE)
>>  	{
>>  		.procname	= "binfmt_misc",

^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: Chuck Lever @ 2020-08-13 14:42 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mimi Zohar, James Morris, Deven Bowers, Pavel Machek, Sasha Levin,
	snitzer, dm-devel, tyhicks, agk, Paul Moore, Jonathan Corbet,
	nramas, serge, pasha.tatashin, Jann Horn, linux-block, Al Viro,
	Jens Axboe, mdsakib, open list, eparis, linux-security-module,
	linux-audit, linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <1597247482.7293.18.camel@HansenPartnership.com>



> On Aug 12, 2020, at 11:51 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> On Wed, 2020-08-12 at 10:15 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 11:53 AM, James Bottomley
>>> <James.Bottomley@HansenPartnership.com> wrote:
>>> 
>>> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
> [...]
>>>>> 
>>>>> and what is nice to have to speed up the verification
>>>>> process.  The choice for the latter is cache or reconstruct
>>>>> depending on the resources available.  If the tree gets cached
>>>>> on the server, that would be a server implementation detail
>>>>> invisible to the client.
>>>> 
>>>> We assume that storage targets (for block or file) are not
>>>> trusted. Therefore storage clients cannot rely on intermediate
>>>> results (eg, middle nodes in a Merkle tree) unless those results
>>>> are generated within the client's trust envelope.
>>> 
>>> Yes, they can ... because supplied nodes can be verified.  That's
>>> the whole point of a merkle tree.  As long as I'm sure of the root
>>> hash I can verify all the rest even if supplied by an untrusted
>>> source.  If you consider a simple merkle tree covering 4 blocks:
>>> 
>>>      R
>>>    /   \
>>> H11     H12
>>> / \     / \
>>> H21 H22 H23 H24
>>>>   |   |   |
>>> 
>>> B1   B2  B3  B4
>>> 
>>> Assume I have the verified root hash R.  If you supply B3 you also
>>> supply H24 and H11 as proof.  I verify by hashing B3 to produce H23
>>> then hash H23 and H24 to produce H12 and if H12 and your supplied
>>> H11 hash to R the tree is correct and the B3 you supplied must
>>> likewise be correct.
>> 
>> I'm not sure what you are proving here. Obviously this has to work
>> in order for a client to reconstruct the file's Merkle tree given
>> only R and the file content.
> 
> You implied the server can't be trusted to generate the merkel tree. 
> I'm showing above it can because of the tree path based verification.

What I was implying is that clients can't trust intermediate Merkle
tree content that is not also signed. So far we are talking about
signing only the tree root.

The storage server can store the tree durably, but if the intermediate
parts of the tree are not signed, the client has to verify them anyway,
and that reduces the value of storing potentially large data structures.


>> It's the construction of the tree and verification of the hashes that
>> are potentially expensive. The point of caching intermediate hashes
>> is so that the client verifies them as few times as possible.  I
>> don't see value in caching those hashes on an untrusted server --
>> the client will have to reverify them anyway, and there will be no
>> savings.
> 
> I'm not making any claim about server caching, I'm just saying the
> client can request pieces of the tree from the server without having to
> reconstruct the whole thing itself because it can verify their
> correctness.

To be clear, my concern is about how much of the tree might be stored
in a Merkle-based metadata format. I just don't see that it has much
value to store more than the signed tree root, because the client will
have to reconstitute or verify some tree contents on most every read.

For sufficiently large files, the tree itself can be larger than what
can be stored in an xattr. This is the same problem that fs-verity
faces. And, as I stated earlier, xattr objects are read in their
entirety, they can't be seeked into or read piecemeal.

What it seemed to me that you were suggesting was an offloaded cache
of the Merkle tree. Either the whole tree is stored on the storage
server, or the storage server provides a service that reconstitutes
that tree on behalf of clients. (Please correct me if I misunderstood).
I just don't think that will be practicable or provide the kind of
benefit you might want.


>> Cache once, as close as you can to where the data will be used.
>> 
>> 
>>>> So: if the storage target is considered inside the client's trust
>>>> envelope, it can cache or store durably any intermediate parts of
>>>> the verification process. If not, the network and file storage is
>>>> considered untrusted, and the client has to rely on nothing but
>>>> the signed digest of the tree root.
>>>> 
>>>> We could build a scheme around, say, fscache, that might save the
>>>> intermediate results durably and locally.
>>> 
>>> I agree we want caching on the client, but we can always page in
>>> from the remote as long as we page enough to verify up to R, so
>>> we're always sure the remote supplied genuine information.
>> 
>> Agreed.
>> 
>> 
>>>>>> For this reason, the idea was to save only the signature of
>>>>>> the tree's root on durable storage. The client would retrieve
>>>>>> that signature possibly at open time, and reconstruct the
>>>>>> tree at that time.
>>>>> 
>>>>> Right that's the integrity data you must have.
>>>>> 
>>>>>> Or the tree could be partially constructed on-demand at the
>>>>>> time each unit is to be checked (say, as part of 2. above).
>>>>> 
>>>>> Whether it's reconstructed or cached can be an implementation
>>>>> detail. You clearly have to reconstruct once, but whether you
>>>>> have to do it again depends on the memory available for caching
>>>>> and all the other resource calls in the system.
>>>>> 
>>>>>> The client would have to reconstruct that tree again if
>>>>>> memory pressure caused some or all of the tree to be evicted,
>>>>>> so perhaps an on-demand mechanism is preferable.
>>>>> 
>>>>> Right, but I think that's implementation detail.  Probably what
>>>>> we need is a way to get the log(N) verification hashes from the
>>>>> server and it's up to the client whether it caches them or not.
>>>> 
>>>> Agreed, these are implementation details. But see above about the
>>>> trustworthiness of the intermediate hashes. If they are conveyed
>>>> on an untrusted network, then they can't be trusted either.
>>> 
>>> Yes, they can, provided enough of them are asked for to verify.  If
>>> you look at the simple example above, suppose I have cached H11 and
>>> H12, but I've lost the entire H2X layer.  I want to verify B3 so I
>>> also ask you for your copy of H24.  Then I generate H23 from B3 and
>>> Hash H23 and H24.  If this doesn't hash to H12 I know either you
>>> supplied me the wrong block or lied about H24.  However, if it all
>>> hashes correctly I know you supplied me with both the correct B3
>>> and the correct H24.
>> 
>> My point is there is a difference between a trusted cache and an
>> untrusted cache. I argue there is not much value in a cache where
>> the hashes have to be verified again.
> 
> And my point isn't about caching, it's about where the tree comes from.
> I claim and you agree the client can get the tree from the server a
> piece at a time (because it can path verify it) and doesn't have to
> generate it itself.

OK, let's focus on where the tree comes from. It is certainly
possible to build protocol to exchange parts of a Merkle tree. The
question is how it might be stored on the server. There are some
underlying assumptions about the metadata storage mechanism that
should be stated up front.

Current forms of IMA metadata are limited in size and stored in a
container that is read and written in a single operation. If we stick
with that container format, I don't see a way to store a Merkle tree
in there for all file sizes.

Thus it seems to me that we cannot begin to consider the tree-on-the-
server model unless there is a proposed storage mechanism for that
whole tree. Otherwise, the client must have the primary role in
unpacking and verifying the tree.

Storing only the tree root in the metadata means the metadata format
is nicely bounded in size.


> How much of the tree the client has to store and
> whether the server caches, reads it in from somewhere or reconstructs
> it is an implementation detail.

Sure.


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: James Bottomley @ 2020-08-13 14:42 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Mimi Zohar, James Morris, Deven Bowers, Pavel Machek, Sasha Levin,
	snitzer, dm-devel, tyhicks, agk, Paul Moore, Jonathan Corbet,
	nramas, serge, pasha.tatashin, Jann Horn, linux-block, Al Viro,
	Jens Axboe, mdsakib, open list, eparis, linux-security-module,
	linux-audit, linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <3F328A12-25DD-418B-A7D0-64DA09236E1C@gmail.com>

On Thu, 2020-08-13 at 10:21 -0400, Chuck Lever wrote:
> > On Aug 12, 2020, at 11:42 AM, James Bottomley <James.Bottomley@Hans
> > enPartnership.com> wrote:
[...]
> > For most people the security mechanism of local xattrs is
> > sufficient.  If you're paranoid, you don't believe it is and you
> > use EVM.
> 
> When IMA metadata happens to be stored in local filesystems in
> a trusted xattr, it's going to enjoy the protection you describe
> without needing the addition of a cryptographic signature.
> 
> However, that metadata doesn't live its whole life there. It
> can reside in a tar file, it can cross a network, it can live
> on a back-up tape. I think we agree that any time that metadata
> is in transit or at rest outside of a Linux local filesystem, it
> is exposed.
> 
> Thus I'm interested in a metadata protection mechanism that does
> not rely on the security characteristics of a particular storage
> container. For me, a cryptographic signature fits that bill
> nicely.

Sure, but one of the points about IMA is a separation of mechanism from
policy.  Signed hashes (called appraisal in IMA terms) is just one
policy you can decide to require or not or even make it conditional on
other things.

> > > > I think Mimi's other point is actually that IMA uses a flat
> > > > hash which we derive by reading the entire file and then
> > > > watching for mutations. Since you cannot guarantee we get
> > > > notice of mutation with NFS, the entire IMA mechanism can't
> > > > really be applied in its current form and we have to resort to
> > > > chunk at a time verifications that a Merkel tree would provide.
> > > 
> > > I'm not sure what you mean by this. An NFS client relies on
> > > notification of mutation to maintain the integrity of its cache
> > > of NFS file content, and it's done that since the 1980s.
> > 
> > Mutation detection is part of the current IMA security model.  If
> > IMA sees a file mutate it has to be rehashed the next time it
> > passes the gate.  If we can't trust the NFS server, we can't trust
> > the NFS mutation notification and we have to have a different
> > mechanism to check the file.
> 
> When an NFS server lies about mtime and ctime, then NFS is completely
> broken. Untrusted NFS server doesn't mean "broken behavior" -- I
> would think that local filesystems will have the same problem if
> they can't trust a local block device to store filesystem metadata
> like indirect blocks and timestamps.
> 
> It's not clear to me that IMA as currently implemented can protect
> against broken storage devices or incorrect filesystem behavior.

IMA doesn't really care about the storage.  The gate check will fail if
the storage corrupts the file because the hashes won't match.  The
mechanism for modification notification is the province of the
filesystem and there are definitely some which don't do it (or other fs
features) correctly and thus can't use IMA.

> > > In addition to examining a file's mtime and ctime as maintained
> > > by the NFS server, a client can rely on the file's NFSv4 change
> > > attribute or an NFSv4 delegation.
> > 
> > And that's secure in the face of a malicious or compromised server?
> > 
> > The bottom line is still, I think we can't use linear hashes with
> > an open/exec/mmap gate with NFS and we have to move to chunk at a
> > time verification like that provided by a merkel tree.
> 
> That's fine until we claim that remote filesystems require one form
> of metadata and local filesystems use some other form.
> 
> To guarantee an unbroken chain of provenance, everyone has to use the
> same portable metadata format that is signed once by the content
> creator. That's essentially why I believe the Merkle-based metadata
> format must require that the tree root is signed.

Well, no, that would be optional policy.  We should certainly support
signed head hashes and require it if the policy said so, but we
shouldn't enforce it without the policy.

Suppose I'm a cloud service provider exporting files over NFS on the
control (private) network.  I use IMA to measure untrusted tenants to
get a feel for what they're doing, but since I control the NFS server,
the client and the private network, I wouldn't feel the requirement to
have signed hashes because I trust other mechanisms for the security.

James


^ permalink raw reply

* Re: [PATCH v7 4/7] fs: Introduce O_MAYEXEC flag for openat2(2)
From: Mickaël Salaün @ 2020-08-13 14:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Aleksa Sarai, Alexei Starovoitov, Al Viro,
	Andrew Morton, Andy Lutomirski, Christian Brauner,
	Christian Heimes, Daniel Borkmann, Deven Bowers, Dmitry Vyukov,
	Eric Biggers, Eric Chiang, Florian Weimer, James Morris, Jan Kara,
	Jann Horn, Jonathan Corbet, Kees Cook, Lakshmi Ramasubramanian,
	Matthew Garrett, Matthew Wilcox, Michael Kerrisk, Mimi Zohar,
	Philippe Trébuchet, Scott Shell, Sean Christopherson,
	Shuah Khan, Steve Dower, Steve Grubb, Tetsuo Handa,
	Thibaut Sautereau, Vincent Strubel, kernel-hardening, linux-api,
	linux-integrity, linux-security-module, linux-fsdevel,
	Thibaut Sautereau
In-Reply-To: <87mu31klld.fsf@x220.int.ebiederm.org>


On 11/08/2020 21:51, Eric W. Biederman wrote:
> Mickaël Salaün <mic@digikod.net> writes:
> 
>> When the O_MAYEXEC flag is passed, openat2(2) may be subject to
>> additional restrictions depending on a security policy managed by the
>> kernel through a sysctl or implemented by an LSM thanks to the
>> inode_permission hook.  This new flag is ignored by open(2) and
>> openat(2) because of their unspecified flags handling.  When used with
>> openat2(2), the default behavior is only to forbid to open a directory.
>>
>> The underlying idea is to be able to restrict scripts interpretation
>> according to a policy defined by the system administrator.  For this to
>> be possible, script interpreters must use the O_MAYEXEC flag
>> appropriately.  To be fully effective, these interpreters also need to
>> handle the other ways to execute code: command line parameters (e.g.,
>> option -e for Perl), module loading (e.g., option -m for Python), stdin,
>> file sourcing, environment variables, configuration files, etc.
>> According to the threat model, it may be acceptable to allow some script
>> interpreters (e.g. Bash) to interpret commands from stdin, may it be a
>> TTY or a pipe, because it may not be enough to (directly) perform
>> syscalls.  Further documentation can be found in a following patch.
>>
>> Even without enforced security policy, userland interpreters can set it
>> to enforce the system policy at their level, knowing that it will not
>> break anything on running systems which do not care about this feature.
>> However, on systems which want this feature enforced, there will be
>> knowledgeable people (i.e. sysadmins who enforced O_MAYEXEC
>> deliberately) to manage it.  A simple security policy implementation,
>> configured through a dedicated sysctl, is available in a following
>> patch.
>>
>> O_MAYEXEC should not be confused with the O_EXEC flag which is intended
>> for execute-only, which obviously doesn't work for scripts.  However, a
>> similar behavior could be implemented in userland with O_PATH:
>> https://lore.kernel.org/lkml/1e2f6913-42f2-3578-28ed-567f6a4bdda1@digikod.net/
>>
>> The implementation of O_MAYEXEC almost duplicates what execve(2) and
>> uselib(2) are already doing: setting MAY_OPENEXEC in acc_mode (which can
>> then be checked as MAY_EXEC, if enforced).
> 
> You are allowing S_IFBLK, S_IFCHR, S_IFIFO, S_IFSOCK as targets for
> O_MAYEXEC?

There is a switch case for each file type (in this patch and the next one).

> 
> You are not requiring the opened script be executable?

The (conditional) enforcement is in the next patch, with the rational.

> 
> You are not requring path_noexec?  Despite the original patch that
> inspired this was checking path_noexec?

This patch just introduces the new flag and its default behavior. See
the next patch for a security policy configuration.

> 
> I honestly think this patch is buggy.  If you could reuse MAY_EXEC in
> the kernel and mean what exec means when it says MAY_EXEC that would be
> useful.

Yeah, but unfortunately this is not possible in practice because of
general Linux distro, as explained in the next patch.


> 
> As it is this patch appears wrong and dangerously confusing as it implies
> execness but does not implement execness.

Please see next patch.

> 
> If you were simply defining O_EXEC and reusing MAY_EXEC as it exists
> or exists with cleanups in the kernel this would be a small change that
> would seem to make reasonable sense.  But as you are not reusing
> anything from MAY_EXEC this code does not make any sense as I am reading
> it.

As explained in this commit message, O_EXEC doesn't have the same
semantic. Also, see next patch.

> 
> Eric
> 
> 
>> This is an updated subset of the patch initially written by Vincent
>> Strubel for CLIP OS 4:
>> https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch
>> This patch has been used for more than 12 years with customized script
>> interpreters.  Some examples (with the original O_MAYEXEC) can be found
>> here:
>> https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC
>>
>> Co-developed-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
>> Signed-off-by: Vincent Strubel <vincent.strubel@ssi.gouv.fr>
>> Co-developed-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
>> Signed-off-by: Thibaut Sautereau <thibaut.sautereau@ssi.gouv.fr>
>> Signed-off-by: Mickaël Salaün <mic@digikod.net>
>> Cc: Aleksa Sarai <cyphar@cyphar.com>
>> Cc: Al Viro <viro@zeniv.linux.org.uk>
>> Cc: Deven Bowers <deven.desai@linux.microsoft.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> ---
>>
>> Changes since v6:
>> * Do not set __FMODE_EXEC for now because of inconsistent behavior:
>>   https://lore.kernel.org/lkml/202007160822.CCDB5478@keescook/
>> * Returns EISDIR when opening a directory with O_MAYEXEC.
>> * Removed Deven Bowers and Kees Cook Reviewed-by tags because of the
>>   current update.
>>
>> Changes since v5:
>> * Update commit message.
>>
>> Changes since v3:
>> * Switch back to O_MAYEXEC, but only handle it with openat2(2) which
>>   checks unknown flags (suggested by Aleksa Sarai). Cf.
>>   https://lore.kernel.org/lkml/20200430015429.wuob7m5ofdewubui@yavin.dot.cyphar.com/
>>
>> Changes since v2:
>> * Replace O_MAYEXEC with RESOLVE_MAYEXEC from openat2(2).  This change
>>   enables to not break existing application using bogus O_* flags that
>>   may be ignored by current kernels by using a new dedicated flag, only
>>   usable through openat2(2) (suggested by Jeff Layton).  Using this flag
>>   will results in an error if the running kernel does not support it.
>>   User space needs to manage this case, as with other RESOLVE_* flags.
>>   The best effort approach to security (for most common distros) will
>>   simply consists of ignoring such an error and retry without
>>   RESOLVE_MAYEXEC.  However, a fully controlled system may which to
>>   error out if such an inconsistency is detected.
>>
>> Changes since v1:
>> * Set __FMODE_EXEC when using O_MAYEXEC to make this information
>>   available through the new fanotify/FAN_OPEN_EXEC event (suggested by
>>   Jan Kara and Matthew Bobrowski):
>>   https://lore.kernel.org/lkml/20181213094658.GA996@lithium.mbobrowski.org/
>> ---
>>  fs/fcntl.c                       | 2 +-
>>  fs/namei.c                       | 4 ++--
>>  fs/open.c                        | 6 ++++++
>>  include/linux/fcntl.h            | 2 +-
>>  include/linux/fs.h               | 2 ++
>>  include/uapi/asm-generic/fcntl.h | 7 +++++++
>>  6 files changed, 19 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/fcntl.c b/fs/fcntl.c
>> index 2e4c0fa2074b..0357ad667563 100644
>> --- a/fs/fcntl.c
>> +++ b/fs/fcntl.c
>> @@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
>>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>>  	 */
>> -	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
>> +	BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
>>  		HWEIGHT32(
>>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>>  			__FMODE_EXEC | __FMODE_NONOTIFY));
>> diff --git a/fs/namei.c b/fs/namei.c
>> index ddc9b25540fe..3f074ec77390 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -428,7 +428,7 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask)
>>  /**
>>   * inode_permission - Check for access rights to a given inode
>>   * @inode: Inode to check permission on
>> - * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
>> + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC, %MAY_OPENEXEC)
>>   *
>>   * Check for read/write/execute permissions on an inode.  We use fs[ug]id for
>>   * this, letting us set arbitrary permissions for filesystem access without
>> @@ -2849,7 +2849,7 @@ static int may_open(const struct path *path, int acc_mode, int flag)
>>  	case S_IFLNK:
>>  		return -ELOOP;
>>  	case S_IFDIR:
>> -		if (acc_mode & (MAY_WRITE | MAY_EXEC))
>> +		if (acc_mode & (MAY_WRITE | MAY_EXEC | MAY_OPENEXEC))
>>  			return -EISDIR;
>>  		break;
>>  	case S_IFBLK:
>> diff --git a/fs/open.c b/fs/open.c
>> index 623b7506a6db..21c2c1020574 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -987,6 +987,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
>>  		.mode = mode & S_IALLUGO,
>>  	};
>>  
>> +	/* O_MAYEXEC is ignored by syscalls relying on build_open_how(). */
>> +	how.flags &= ~O_MAYEXEC;
>>  	/* O_PATH beats everything else. */
>>  	if (how.flags & O_PATH)
>>  		how.flags &= O_PATH_FLAGS;
>> @@ -1054,6 +1056,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
>>  	if (flags & __O_SYNC)
>>  		flags |= O_DSYNC;
>>  
>> +	/* Checks execution permissions on open. */
>> +	if (flags & O_MAYEXEC)
>> +		acc_mode |= MAY_OPENEXEC;
>> +
>>  	op->open_flag = flags;
>>  
>>  	/* O_TRUNC implies we need access checks for write permissions */
>> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
>> index 7bcdcf4f6ab2..e188a360fa5f 100644
>> --- a/include/linux/fcntl.h
>> +++ b/include/linux/fcntl.h
>> @@ -10,7 +10,7 @@
>>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>>  	 O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
>>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
>> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
>> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_MAYEXEC)
>>  
>>  /* List of all valid flags for the how->upgrade_mask argument: */
>>  #define VALID_UPGRADE_FLAGS \
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index f5abba86107d..56f835c9a87a 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -101,6 +101,8 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>>  #define MAY_CHDIR		0x00000040
>>  /* called from RCU mode, don't block */
>>  #define MAY_NOT_BLOCK		0x00000080
>> +/* the inode is opened with O_MAYEXEC */
>> +#define MAY_OPENEXEC		0x00000100
>>  
>>  /*
>>   * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
>> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
>> index 9dc0bf0c5a6e..bca90620119f 100644
>> --- a/include/uapi/asm-generic/fcntl.h
>> +++ b/include/uapi/asm-generic/fcntl.h
>> @@ -97,6 +97,13 @@
>>  #define O_NDELAY	O_NONBLOCK
>>  #endif
>>  
>> +/*
>> + * Code execution from file is intended, checks such permission.  A simple
>> + * policy can be enforced system-wide as explained in
>> + * Documentation/admin-guide/sysctl/fs.rst .
>> + */
>> +#define O_MAYEXEC	040000000
>> +
>>  #define F_DUPFD		0	/* dup */
>>  #define F_GETFD		1	/* get close_on_exec */
>>  #define F_SETFD		2	/* set/clear close_on_exec */

^ permalink raw reply

* Re: [dm-devel] [RFC PATCH v5 00/11] Integrity Policy Enforcement LSM (IPE)
From: Chuck Lever @ 2020-08-13 14:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mimi Zohar, James Morris, Deven Bowers, Pavel Machek, Sasha Levin,
	snitzer, dm-devel, tyhicks, agk, Paul Moore, Jonathan Corbet,
	nramas, serge, pasha.tatashin, Jann Horn, linux-block, Al Viro,
	Jens Axboe, mdsakib, open list, eparis, linux-security-module,
	linux-audit, linux-fsdevel, linux-integrity, jaskarankhurana
In-Reply-To: <1597246946.7293.9.camel@HansenPartnership.com>



> On Aug 12, 2020, at 11:42 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
> 
> On Wed, 2020-08-12 at 09:56 -0400, Chuck Lever wrote:
>>> On Aug 11, 2020, at 2:28 PM, James Bottomley <James.Bottomley@Hanse
>>> nPartnership.com> wrote:
>>> 
>>> On Tue, 2020-08-11 at 10:48 -0400, Chuck Lever wrote:
>>>> Mimi's earlier point is that any IMA metadata format that
>>>> involves unsigned digests is exposed to an alteration attack at
>>>> rest or in transit, thus will not provide a robust end-to-end
>>>> integrity guarantee.
>>> 
>>> I don't believe that is Mimi's point, because it's mostly not
>>> correct: the xattr mechanism does provide this today.  The point is
>>> the mechanism we use for storing IMA hashes and signatures today is
>>> xattrs because they have robust security properties for local
>>> filesystems that the kernel enforces.  This use goes beyond IMA,
>>> selinux labels for instance use this property as well.
>> 
>> I don't buy this for a second. If storing a security label in a
>> local xattr is so secure, we wouldn't have any need for EVM.
> 
> What don't you buy?  Security xattrs can only be updated by local root.
> If you trust local root, the xattr mechanism is fine ... it's the only
> one a lot of LSMs use, for instance.  If you don't trust local root or
> worry about offline backups, you use EVM.  A thing isn't secure or
> insecure, it depends on the threat model.  However, if you don't trust
> the NFS server it doesn't matter whether you do or don't trust local
> root, you can't believe the contents of the xattr.
> 
>>> What I think you're saying is that NFS can't provide the robust
>>> security for xattrs we've been relying on, so you need some other
>>> mechanism for storing them.
>> 
>> For NFS, there's a network traversal which is an attack surface.
>> 
>> A local xattr can be attacked as well: a device or bus malfunction
>> can corrupt the content of an xattr, or a privileged user can modify
>> it.
>> 
>> How does that metadata get from the software provider to the end
>> user? It's got to go over a network, stored in various ways, some
>> of which will not be trusted. To attain an unbroken chain of
>> provenance, that metadata has to be signed.
>> 
>> I don't think the question is the storage mechanism, but rather the
>> protection mechanism. Signing the metadata protects it in all of
>> these cases.
> 
> I think we're saying about the same thing.

Roughly.


> For most people the
> security mechanism of local xattrs is sufficient.  If you're paranoid,
> you don't believe it is and you use EVM.

When IMA metadata happens to be stored in local filesystems in
a trusted xattr, it's going to enjoy the protection you describe
without needing the addition of a cryptographic signature.

However, that metadata doesn't live its whole life there. It
can reside in a tar file, it can cross a network, it can live
on a back-up tape. I think we agree that any time that metadata
is in transit or at rest outside of a Linux local filesystem, it
is exposed.

Thus I'm interested in a metadata protection mechanism that does
not rely on the security characteristics of a particular storage
container. For me, a cryptographic signature fits that bill
nicely.


>>> I think Mimi's other point is actually that IMA uses a flat hash
>>> which we derive by reading the entire file and then watching for
>>> mutations. Since you cannot guarantee we get notice of mutation
>>> with NFS, the entire IMA mechanism can't really be applied in its
>>> current form and we have to resort to chunk at a time verifications
>>> that a Merkel tree would provide.
>> 
>> I'm not sure what you mean by this. An NFS client relies on
>> notification of mutation to maintain the integrity of its cache of
>> NFS file content, and it's done that since the 1980s.
> 
> Mutation detection is part of the current IMA security model.  If IMA
> sees a file mutate it has to be rehashed the next time it passes the
> gate.  If we can't trust the NFS server, we can't trust the NFS
> mutation notification and we have to have a different mechanism to
> check the file.

When an NFS server lies about mtime and ctime, then NFS is completely
broken. Untrusted NFS server doesn't mean "broken behavior" -- I
would think that local filesystems will have the same problem if
they can't trust a local block device to store filesystem metadata
like indirect blocks and timestamps.

It's not clear to me that IMA as currently implemented can protect
against broken storage devices or incorrect filesystem behavior.


>> In addition to examining a file's mtime and ctime as maintained by
>> the NFS server, a client can rely on the file's NFSv4 change
>> attribute or an NFSv4 delegation.
> 
> And that's secure in the face of a malicious or compromised server?
> 
> The bottom line is still, I think we can't use linear hashes with an
> open/exec/mmap gate with NFS and we have to move to chunk at a time
> verification like that provided by a merkel tree.

That's fine until we claim that remote filesystems require one form of
metadata and local filesystems use some other form.

To guarantee an unbroken chain of provenance, everyone has to use the
same portable metadata format that is signed once by the content creator.
That's essentially why I believe the Merkle-based metadata format must
require that the tree root is signed.


--
Chuck Lever
chucklever@gmail.com




^ permalink raw reply

* Re: [PATCH v20 05/12] LSM: Infrastructure management of the superblock
From: Mickaël Salaün @ 2020-08-13 14:15 UTC (permalink / raw)
  To: Stephen Smalley, Casey Schaufler, Kees Cook, John Johansen
  Cc: linux-kernel, Al Viro, Andy Lutomirski, Anton Ivanov,
	Arnd Bergmann, James Morris, Jann Horn, Jeff Dike,
	Jonathan Corbet, Michael Kerrisk, Richard Weinberger,
	Serge E . Hallyn, Shuah Khan, Vincent Dagonneau, kernel-hardening,
	linux-api, linux-arch, linux-doc, linux-fsdevel, linux-kselftest,
	linux-security-module, x86
In-Reply-To: <779c290b-45f5-b86c-c573-2edb4004105d@tycho.nsa.gov>


On 12/08/2020 21:16, Stephen Smalley wrote:
> On 8/2/20 5:58 PM, Mickaël Salaün wrote:
>> From: Casey Schaufler <casey@schaufler-ca.com>
>>
>> Move management of the superblock->sb_security blob out
>> of the individual security modules and into the security
>> infrastructure. Instead of allocating the blobs from within
>> the modules the modules tell the infrastructure how much
>> space is required, and the space is allocated there.
>>
>> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
>> Reviewed-by: Kees Cook <keescook@chromium.org>
>> Reviewed-by: John Johansen <john.johansen@canonical.com>
>> Reviewed-by: Stephen Smalley <sds@tycho.nsa.gov>
>> Reviewed-by: Mickaël Salaün <mic@digikod.net>
>> Link:
>> https://lore.kernel.org/r/20190829232935.7099-2-casey@schaufler-ca.com
>> ---
>>
>> Changes since v17:
>> * Rebase the original LSM stacking patch from v5.3 to v5.7: I fixed some
>>    diff conflicts caused by code moves and function renames in
>>    selinux/include/objsec.h and selinux/hooks.c .  I checked that it
>>    builds but I didn't test the changes for SELinux nor SMACK.
> 
> You shouldn't retain Signed-off-by and Reviewed-by lines from an earlier
> patch if you made non-trivial changes to it (even more so if you didn't
> test them).

I think I made trivial changes according to the original patch. But
without reply from other people with Signed-off-by or Reviewed-by
(Casey, Kees, John), I'll remove them. I guess you don't want your
Reviewed-by to be kept, so I'll remove it, except if you want to review
this patch (or the modified part).

^ permalink raw reply

* Re: file metadata via fs API
From: Karel Zak @ 2020-08-13 10:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Whitehouse, David Howells, Miklos Szeredi, linux-fsdevel,
	Al Viro, Jeff Layton, Miklos Szeredi, Nicolas Dichtel,
	Christian Brauner, Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <CAHk-=wgz5H-xYG4bOrHaEtY7rvFA1_6+mTSpjrgK8OsNbfF+Pw@mail.gmail.com>

On Wed, Aug 12, 2020 at 12:50:28PM -0700, Linus Torvalds wrote:
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for, and
> feedback from actual outside users.

I thought (in last 10 years) we all agree that /proc/self/mountinfo is
the expensive, ineffective and fragile way how to deliver information to
userspace.

We have systems with thousands of mountpoints and compose mountinfo in
kernel and again parse it in userspace takes time and it's strange if
you need info about just one mountpoint. 

Unfortunately, the same systems modify the huge mount table extremely
often, because it starts/stops large number of containers and every 
container means a mount operation(s).

In this crazy environment, we have userspace tools like systemd or udisk 
which react to VFS changes and there is no elegant way how to get
details about a modified mount node from kernel.

And of course we already have negative feedback from users who
maintain large systems -- mountinfo returns inconsistent data if you
read it by more read() calls (hopefully fixed by recent Miklos'
mountinfo cursors); system is pretty busy to compose+parse mountinfo,
etc.

> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.

We're too old and too lazy for "engineering for its own sake" :-)
there is pressure from users ...

Maybe David's fsinfo() sucks, but it does not mean that
/proc/self/mountinfo is something cool. Right?

We have to dig deep grave for /proc/self/mountinfo ...

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply

* Re: file metadata via fs API
From: Karel Zak @ 2020-08-13  8:52 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Steven Whitehouse, David Howells, Linus Torvalds, linux-fsdevel,
	Al Viro, Jeff Layton, Miklos Szeredi, Nicolas Dichtel,
	Christian Brauner, Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <CAJfpegv4sC2zm+N5tvEmYaEFvvWJRHfdGqXUoBzbeKj81uNCvQ@mail.gmail.com>

On Wed, Aug 12, 2020 at 02:43:32PM +0200, Miklos Szeredi wrote:
> On Wed, Aug 12, 2020 at 1:28 PM Karel Zak <kzak@redhat.com> wrote:
> 
> > The proposal is based on paths and open(), how do you plan to deal
> > with mount IDs? David's fsinfo() allows to ask for mount info by mount
> > ID and it works well with mount notification where you get the ID. The
> > collaboration with notification interface is critical for our use-cases.
> 
> One would use the notification to keep an up to date set of attributes
> for each watched mount, right?
> 
> That presumably means the mount ID <-> mount path mapping already
> exists, which means it's just possible to use the open(mount_path,
> O_PATH) to obtain the base fd.

The notification also reports new mount nodes, so we have no mount ID
<-> mount path mapping in userspace yet.

The another problem is that open(path) cannot be used if you have multiple
filesystems on the same mount point -- in this case (at least in theory)
you can get ID for by-path inaccessible filesystem.

> A new syscall that returns an fd pointing to the root of the mount
> might be the best solution:
> 
>    int open_mount(int root_fd, u64 mntid, int flags);

Yes, something like this is necessary. You do not want to depend
on paths if you want to read information about mountpoints.

 Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply

* Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
From: Jeffrey E Altman @ 2020-08-13  3:53 UTC (permalink / raw)
  To: Linus Torvalds (torvalds@linux-foundation.org), David Howells
  Cc: Miklos Szeredi, linux-fsdevel, Al Viro, Karel Zak, Jeff Layton,
	Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <CAHk-=wiPx0UJ6Q1X=azwz32xrSeKnTJcH8enySwuuwnGKkHoPA@mail.gmail.com>


[-- Attachment #1.1: Type: text/plain, Size: 7195 bytes --]

On 8/12/2020 2:18 PM, Linus Torvalds (torvalds@linux-foundation.org) wrote:
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
> 
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.
> 
> All the other fancy fsinfo stuff seems to be "just because", and like
> complete overdesign.

Hi Linus,

Is there any existing method by which userland applications can
determine the properties of the filesystem in which a directory or file
is stored in a filesystem agnostic manner?

Over the past year I've observed the opendev/openstack community
struggle with performance issues caused by rsync's inability to
determine if the source and destination object's last update time have
the same resolution and valid time range.  If the source file system
supports 100 nanosecond granularity and the destination file system
supports one second granularity, any source file with a non-zero
fractional seconds timestamp will appear to have changed compared to the
copy in the destination filesystem which discarded the fractional
seconds during the last sync.  Sure, the end user could use the
--modify-window=1 option to inform rsync to add fuzz to the comparisons,
but that introduces the possibility that a file updated a fraction of a
second after an rsync execution would not synchronize the file on the
next run when both source and target have fine grained timestamps.  If
the userland sync processes have access to the source and destination
filesystem time capabilities, they can make more intelligent decisions
without explicit user input.  At a minimum, the timestamp properties
that are important to know include the range of valid timestamps and the
resolution.  Some filesystems support unsigned 32-bit time starting with
UNIX epoch.  Others signed 32-bit time with UNIX epoch.  Still others
FAT, NTFS, etc use alternative epochs and range and resolutions.

Another case where lack of filesystem properties is problematic is "df
--local" which currently relies upon string comparisons of file system
name strings to determine if the underlying file system is local or
remote.  This requires that the gnulib maintainers have knowledge of all
file systems implementations, their published names, and which category
they belong to.  Patches have been accepted in the past year to add
"smb3", "afs", and "gpfs" to the list of remote file systems.  There are
many more remote filesystems that have yet to be added including
"cephfs", "lustre", "gluster", etc.

In many cases, the filesystem properties cannot be inferred from the
filesystem name.  For network file systems, these properties might
depend upon the remote server capabilities or even the properties
associated with a particular volume or share.  Consider the case of a
remote file server that supports 64-bit 100ns time but which for
backward compatibility exports certain volumes or shares with more
restrictive capabilities. Or the case of a network file system protocol
that has evolved over time and gained new capabilities.

For the AFS community, fsinfo offers a method of exposing some server
and volume properties that are obtained via "path ioctls" in OpenAFS and
AuriStorFS.  Some example of properties that might be exposed include
answers to questions such as:

 * what is the volume cell id? perhaps a uuid.
 * what is the volume id in the cell? unsigned 64-bit integer
 * where is a mounted volume hosted? which fileservers, named by uuid
 * what is the block size? 1K, 4K, ...
 * how many blocks are in use or available?
 * what is the quota (thin provisioning), if any?
 * what is the reserved space (fat provisioning), if any?
 * how many vnodes are present?
 * what is the vnode count limit, if any?
 * when was the volume created and last updated?
 * what is the file size limit?
 * are byte range locks supported?
 * are mandatory locks supported?
 * how many entries can be created within a directory?
 * are cross-directory hard links supported?
 * are directories just-send-8, case-sensitive, case-preserving, or
   case-insensitive?
 * if not just-send-8, what character set is used?
 * if Unicode, what normalization rules? etc.
 * are per-object acls supported?
 * what volume maximum acl is assigned, if any?
 * what volume security policy (authn, integ, priv) is assigned, if any?
 * what is the replication policy, if any?
 * what is the volume encryption policy, if any?
 * what is the volume compression policy, if any?
 * are server-to-server copies supported?
 * which of atime, ctime and mtime does the volume support?
 * what is the permitted timestamp range and resolution?
 * are xattrs supported?
 * what is the xattr maximum name length?
 * what is the xattr maximum object size?
 * is the volume currently reachable?
 * is the volume immutable?
 * etc ...

Its true that there isn't widespread use of these filesystem properties
by today's userland applications but that might be due to the lack of
standard interfaces necessary to acquire the information.  For example,
userland frameworks for parallel i/o HPC applications such as HDF5,
PnetCDF and ROMIO require each supported filesystem to provide its own
proprietary "driver" which does little more than expose the filesystem
properties necessary to optimize the layout of file stream data
structures.  With something like "fsinfo" it would be much easier to
develop these HPC frameworks in a filesystem agnostic manner.  This
would permit applications built upon these frameworks to use the best
Linux filesystem available for the workload and not simply the ones for
which proprietary "drivers" have been published.

Although I am sympathetic to the voices in the community that would
prefer to start over with a different architectural approach, David's
fsinfo has been under development for more than two years.  It has not
been developed in a vacuum but in parallel with other kernel components
that have been merged during that time frame.  From my reading of this
thread and those that preceded it, fsinfo has also been developed with
input from significant userland development communities that intend to
leverage the syscall interface as soon as it becomes available.  The
March 2020 discussion of fsinfo received positive feedback not only from
within Red Hat but from other parties as well.

Since no one stepped up to provide an alternative approach in the last
five months, how long should those that desire access to the
functionality be expected to wait for it?

What is the likelihood that an alternative robust solution will be
available in the next merge window or two?

Is the design so horrid that it is better to go without the
functionality than to live with the imperfections?

I for one would like to see this functionality be made available sooner
rather than later.  I know my end users would benefit from the
availability of fsinfo.

Thank you for listening.  Stay healthy and safe, and please wear a mask.

Jeffrey Altman


[-- Attachment #1.2: jaltman.vcf --]
[-- Type: text/x-vcard, Size: 283 bytes --]

begin:vcard
fn:Jeffrey Altman
n:Altman;Jeffrey
org:AuriStor, Inc.
adr:;;255 W 94TH ST STE 6B;New York;NY;10025-6985;United States
email;internet:jaltman@auristor.com
title:CEO
tel;work:+1-212-769-9018
url:https://www.linkedin.com/in/jeffreyaltman/
version:2.1
end:vcard


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4033 bytes --]

^ permalink raw reply

* Re: file metadata via fs API
From: Ian Kent @ 2020-08-13  3:44 UTC (permalink / raw)
  To: Linus Torvalds, Steven Whitehouse
  Cc: David Howells, Miklos Szeredi, linux-fsdevel, Al Viro, Karel Zak,
	Jeff Layton, Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, LSM, Linux Kernel Mailing List
In-Reply-To: <CAHk-=wgz5H-xYG4bOrHaEtY7rvFA1_6+mTSpjrgK8OsNbfF+Pw@mail.gmail.com>

On Wed, 2020-08-12 at 12:50 -0700, Linus Torvalds wrote:
> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <
> swhiteho@redhat.com> wrote:
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
> 
> We haven't had that before, I don't see why it's suddenly such a big
> deal.

Because there's a trend occurring in user space where there are
frequent and persistent mount changes that cause high overhead.

I've seen the number of problems building up over the last few months
that are essentially the same problem that I wanted to resolve. And
that's related to side effects of autofs using a large number of
mounts.

The problems are real.

> 
> The notification side I understand. Polling /proc files is not the
> answer.

Yep, that's one aspect, getting the information about a mount without
reading the entire mount table seems like the sensible thing to do to
allow for a more efficient notification mechanism.

> 
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
> 
> It really smells like "do it because we can, not because we must".
> 
> Who the hell cares about monitoring mounts at a kHz frequencies? If
> this is for MIS use, you want a nice GUI and not wasting CPU time
> polling.

That part of the problem still remains.

The kernel sending a continuous stream of wake ups under load does
also introduce a resource problem but that's probably something to
handle in user space.

> 
> I'm starting to ignore the pull requests from David Howells, because
> by now they have had the same pattern for a couple of years now:
> esoteric new interfaces that seem overdesigned for corner-cases that
> I'm not seeing people clamoring for.
> 
> I need (a) proof this is actualyl something real users care about and
> (b) way more open discussion and implementation from multiple
> parties.
> 
> Because right now it looks like a small in-cabal of a couple of
> people
> who have wild ideas but I'm not seeing the wider use of it.
> 
> Convince me otherwise. AGAIN. This is the exact same issue I had with
> the notification queues that I really wanted actual use-cases for,
> and
> feedback from actual outside users.
> 
> I really think this is engineering for its own sake, rather than
> responding to actual user concerns.
> 
>                Linus


^ permalink raw reply

* Re: file metadata via fs API
From: Ian Kent @ 2020-08-13  1:01 UTC (permalink / raw)
  To: David Howells, Miklos Szeredi
  Cc: Karel Zak, Steven Whitehouse, Linus Torvalds, linux-fsdevel,
	Al Viro, Jeff Layton, Miklos Szeredi, Nicolas Dichtel,
	Christian Brauner, Lennart Poettering, Linux API, LSM,
	Linux Kernel Mailing List
In-Reply-To: <131358.1597237585@warthog.procyon.org.uk>

On Wed, 2020-08-12 at 14:06 +0100, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > That presumably means the mount ID <-> mount path mapping already
> > exists, which means it's just possible to use the open(mount_path,
> > O_PATH) to obtain the base fd.
> 
> No, you can't.  A path more correspond to multiple mounts stacked on
> top of
> each other, e.g.:
> 
> 	mount -t tmpfs none /mnt
> 	mount -t tmpfs none /mnt
> 	mount -t tmpfs none /mnt
> 
> Now you have three co-located mounts and you can't use the path to
> differentiate them.  I think this might be an issue in autofs, but
> Ian would
> need to comment on that.

It is a problem for autofs, direct mounts in particular, but also
for mount ordering at times when umounting a tree of mounts where
mounts are covered or at shutdown.

Ian


^ permalink raw reply

* Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)
From: Al Viro @ 2020-08-12 21:30 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linus Torvalds, Jann Horn, Casey Schaufler, Andy Lutomirski,
	linux-fsdevel, David Howells, Karel Zak, Jeff Layton,
	Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <20200812183326.GU1236603@ZenIV.linux.org.uk>

On Wed, Aug 12, 2020 at 07:33:26PM +0100, Al Viro wrote:

> BTW, what would such opened files look like from /proc/*/fd/* POV?  And
> what would happen if you walk _through_ that symlink, with e.g. ".."
> following it?  Or with names of those attributes, for that matter...
> What about a normal open() of such a sucker?  It won't know where to
> look for your ->private_data...
> 
> FWIW, you keep refering to regularity of this stuff from the syscall
> POV, but it looks like you have no real idea of what subset of the
> things available for normal descriptors will be available for those.

Another question: what should happen with that sucker on umount of
the filesystem holding the underlying object?  Should it be counted
as pinning that fs?

Who controls what's in that tree?  If we plan to have xattrs there,
will they be in a flat tree, or should it mirror the hierarchy of
xattrs?  When is it populated?  open() time?  What happens if we
add/remove an xattr after that point?

If we open the same file several times, what should we get?  A full
copy of the tree every time, with all coherency being up to whatever's
putting attributes there?

What are the permissions needed to do lookups in that thing?

All of that is about semantics and the answers are needed before we
start looking into implementations.  "Whatever my implementation
does" is _not_ a good way to go, especially since that'll be cast
in stone as soon as API becomes exposed to userland...

^ permalink raw reply

* Re: [PATCH v6 0/4] LSM: Measure security module data
From: Lakshmi Ramasubramanian @ 2020-08-12 20:37 UTC (permalink / raw)
  To: Casey Schaufler, Mimi Zohar, Tyler Hicks
  Cc: stephen.smalley.work, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel
In-Reply-To: <5570a4d8-8779-6efe-b208-f7efa8ba9488@schaufler-ca.com>

On 8/5/20 11:25 AM, Casey Schaufler wrote:

>>>>>
>>>>> I think moving away from the idea that measuring "critical" data should
>>>>> be limited to LSMs, will clarify this.
>>>>>
>>>>
>>>> Are you suggesting that instead of calling the hooks LSM_STATE and LSM_POLICY, we should keep it more generic so that it can be utilized by any subsystem to measure their "critical data"?
>>>
>>> Policy, state, history or whim, it should be up to the security module
>>> to determine what data it cares about, and how it should be measured.
>>> Smack does not keep its policy in a single blob of data, it uses lists
>>> which can be modified at will. Your definition of the behavior for
>>> LSM_POLICY wouldn't work for Smack. That doesn't mean that there isn't
>>> a viable way to measure the Smack policy, it just means that IMA isn't
>>> the place for it. If SELinux wants its data measured, SELinux should be
>>> providing the mechanism to do it.
>>>
>>> I guess that I'm agreeing with you in part. If you want a generic measurement
>>> of "critical data", you don't need to assign a type to it, you have the
>>> caller (a security module, a device driver or whatever) identify itself and
>>> how it is going to deal with the data. That's very different from what you've
>>> done to date.
>>
>> Agree.
>>
>> Like Stephen had stated earlier, the reason we kept separate hooks (STATE and POLICY) is because the data for state is usually small and therefore we measure the entire data. Whereas, policy data is usually quite large (a few MB) and hence we measure a hash of the policy.
> 
> SELinux should determine how it wants its data measured.
> SELinux, not IMA, should determine if some "critical data"
> be measured in total, by its hash or as a count of policy
> rules. It would be handy for IMA to supply functions to do
> data blobs and hashes, but it should be up to the caller to
> decide if they meet their needs.
> 

Per feedback from you all, my colleague Tushar has posted a patch series 
that defines a generic IMA hook to measure critical data from other 
subsystems (such as SELinux, AppArmor, Device-Mapper targets, etc.)

Link to the patch series is given below:

	https://patchwork.kernel.org/patch/11711249/

Shortly I will re-post the SELinux state and hash of policy measurement 
patch that will be based on the above patch series.

thanks,
  -lakshmi

^ permalink raw reply

* Re: file metadata via fs API
From: Linus Torvalds @ 2020-08-12 19:50 UTC (permalink / raw)
  To: Steven Whitehouse
  Cc: David Howells, Miklos Szeredi, linux-fsdevel, Al Viro, Karel Zak,
	Jeff Layton, Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <066f9aaf-ee97-46db-022f-5d007f9e6edb@redhat.com>

On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@redhat.com> wrote:
>
> The point of this is to give us the ability to monitor mounts from
> userspace.

We haven't had that before, I don't see why it's suddenly such a big deal.

The notification side I understand. Polling /proc files is not the answer.

But the whole "let's design this crazy subsystem for it" seems way
overkill. I don't see anybody caring that deeply.

It really smells like "do it because we can, not because we must".

Who the hell cares about monitoring mounts at a kHz frequencies? If
this is for MIS use, you want a nice GUI and not wasting CPU time
polling.

I'm starting to ignore the pull requests from David Howells, because
by now they have had the same pattern for a couple of years now:
esoteric new interfaces that seem overdesigned for corner-cases that
I'm not seeing people clamoring for.

I need (a) proof this is actualyl something real users care about and
(b) way more open discussion and implementation from multiple parties.

Because right now it looks like a small in-cabal of a couple of people
who have wild ideas but I'm not seeing the wider use of it.

Convince me otherwise. AGAIN. This is the exact same issue I had with
the notification queues that I really wanted actual use-cases for, and
feedback from actual outside users.

I really think this is engineering for its own sake, rather than
responding to actual user concerns.

               Linus

^ permalink raw reply

* Re: file metadata via fs API
From: Steven Whitehouse @ 2020-08-12 19:34 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: Miklos Szeredi, linux-fsdevel, Al Viro, Karel Zak, Jeff Layton,
	Miklos Szeredi, Nicolas Dichtel, Christian Brauner,
	Lennart Poettering, Linux API, Ian Kent, LSM,
	Linux Kernel Mailing List
In-Reply-To: <CAHk-=wiPx0UJ6Q1X=azwz32xrSeKnTJcH8enySwuuwnGKkHoPA@mail.gmail.com>

Hi,

On 12/08/2020 19:18, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 5:05 PM David Howells <dhowells@redhat.com> wrote:
>> Well, the start of it was my proposal of an fsinfo() system call.
> Ugh. Ok, it's that thing.
>
> This all seems *WAY* over-designed - both your fsinfo and Miklos' version.
>
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
>
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.
>
> All the other fancy fsinfo stuff seems to be "just because", and like
> complete overdesign.
>
> Let's not add system calls just because we can.
>
>               Linus
>

The point of this is to give us the ability to monitor mounts from 
userspace. The original inspiration was rtnetlink, in that we need a 
"dump" operation to give us a snapshot of the current mount state, plus 
then a stream of events which allow us to keep that state updated. The 
tricky question is what happens in case of overflow of the events queue, 
and just like netlink, that needs a resync of the current state to fix 
that, since we can't block mounts, of course.

The fsinfo syscall was designed to be the "dump" operation in this 
system. David's other patch set provides the stream of events. So the 
two are designed to work together. We had the discussion on using 
netlink, of whatever form a while back, and there are a number of 
reasons why that doesn't work (namespace being one).

I think fstatfs might also suffer from the issue of not being easy to 
call on things for which you have no path (e.g. over-mounted mounts) 
Plus we need to know which paths to query, which is why we need to 
enumerate the mounts in the first place - how would we get the fds for 
each mount? It might give you some sb info, but it doesn't tell you the 
options that the sb is mounted with, and it doesn't tell you where it is 
mounted either.

The overall aim is to solve some issues relating to scaling to large 
numbers of mount in systemd and autofs, and also to provide a 
generically useful interface that other tools may use to monitor mounts 
in due course too. Currently parsing /proc/mounts is the only option, 
and that tends to be slow and is certainly not atomic. Extension to 
other sb related messages is a future goal, quota being one possible 
application for the notifications.

If there is a simpler way to get to that goal, then thats all to the 
good, and we should definitely consider it,

Steve.





^ permalink raw reply

* [PATCH 2/3] IMA: add policy to support measuring critical data from kernel components
From: Tushar Sugandhi @ 2020-08-12 19:31 UTC (permalink / raw)
  To: zohar, stephen.smalley.work, casey, gmazyland
  Cc: tyhicks, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel, dm-devel, nramas
In-Reply-To: <20200812193102.18636-1-tusharsu@linux.microsoft.com>

There would be several candidate kernel components suitable for IMA
measurement. Not all of them would be enlightened for IMA measurement.
Also, system administrators may not want to measure data for all of
them, even when they are enlightened for IMA measurements. An IMA policy
specific to various kernel components is needed to measure their
respective critical data.

Add a new IMA policy CRITICAL_DATA+data_sources to support measuring
various critical kernel components. This policy would enable the
system administrators to limit the measurement to the components,
if the components are enlightened for IMA measurement.

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
---
 Documentation/ABI/testing/ima_policy |  6 +-
 security/integrity/ima/ima.h         |  1 +
 security/integrity/ima/ima_api.c     |  2 +-
 security/integrity/ima/ima_policy.c  | 84 ++++++++++++++++++++++------
 4 files changed, 74 insertions(+), 19 deletions(-)

diff --git a/Documentation/ABI/testing/ima_policy b/Documentation/ABI/testing/ima_policy
index cd572912c593..a0dd0f108555 100644
--- a/Documentation/ABI/testing/ima_policy
+++ b/Documentation/ABI/testing/ima_policy
@@ -29,7 +29,7 @@ Description:
 		base: 	func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
 				[FIRMWARE_CHECK]
 				[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
-				[KEXEC_CMDLINE] [KEY_CHECK]
+				[KEXEC_CMDLINE] [KEY_CHECK] [CRITICAL_DATA]
 			mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
 			       [[^]MAY_EXEC]
 			fsmagic:= hex value
@@ -125,3 +125,7 @@ Description:
 		keys added to .builtin_trusted_keys or .ima keyring:
 
 			measure func=KEY_CHECK keyrings=.builtin_trusted_keys|.ima
+
+		Example of measure rule using CRITICAL_DATA to measure critical data
+
+			measure func=CRITICAL_DATA data_sources=selinux|apparmor|dm-crypt
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index e2a151d6653d..99773dfa2541 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -200,6 +200,7 @@ static inline unsigned int ima_hash_key(u8 *digest)
 	hook(POLICY_CHECK, policy)			\
 	hook(KEXEC_CMDLINE, kexec_cmdline)		\
 	hook(KEY_CHECK, key)				\
+	hook(CRITICAL_DATA, critical_data)		\
 	hook(MAX_CHECK, none)
 
 #define __ima_hook_enumify(ENUM, str)	ENUM,
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index af218babd198..9917e1730cb6 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -176,7 +176,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
  *		subj=, obj=, type=, func=, mask=, fsmagic=
  *	subj,obj, and type: are LSM specific.
  *	func: FILE_CHECK | BPRM_CHECK | CREDS_CHECK | MMAP_CHECK | MODULE_CHECK
- *	| KEXEC_CMDLINE | KEY_CHECK
+ *	| KEXEC_CMDLINE | KEY_CHECK | CRITICAL_DATA
  *	mask: contains the permission mask
  *	fsmagic: hex value
  *
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index 4efaf8956eb8..8451ccb2a775 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -22,17 +22,18 @@
 #include "ima.h"
 
 /* flags definitions */
-#define IMA_FUNC	0x0001
-#define IMA_MASK	0x0002
-#define IMA_FSMAGIC	0x0004
-#define IMA_UID		0x0008
-#define IMA_FOWNER	0x0010
-#define IMA_FSUUID	0x0020
-#define IMA_INMASK	0x0040
-#define IMA_EUID	0x0080
-#define IMA_PCR		0x0100
-#define IMA_FSNAME	0x0200
-#define IMA_KEYRINGS	0x0400
+#define IMA_FUNC		0x0001
+#define IMA_MASK		0x0002
+#define IMA_FSMAGIC		0x0004
+#define IMA_UID			0x0008
+#define IMA_FOWNER		0x0010
+#define IMA_FSUUID		0x0020
+#define IMA_INMASK		0x0040
+#define IMA_EUID		0x0080
+#define IMA_PCR			0x0100
+#define IMA_FSNAME		0x0200
+#define IMA_KEYRINGS		0x0400
+#define IMA_DATA_SOURCES	0x0800
 
 #define UNKNOWN		0
 #define MEASURE		0x0001	/* same as IMA_MEASURE */
@@ -84,6 +85,7 @@ struct ima_rule_entry {
 	} lsm[MAX_LSM_RULES];
 	char *fsname;
 	struct ima_rule_opt_list *keyrings; /* Measure keys added to these keyrings */
+	struct ima_rule_opt_list *data_sources; /* Measure data from these sources */
 	struct ima_template_desc *template;
 };
 
@@ -508,14 +510,23 @@ static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode,
 {
 	int i;
 
-	if (func == KEY_CHECK) {
-		return (rule->flags & IMA_FUNC) && (rule->func == func) &&
-		       ima_match_rule_data(rule, rule->keyrings, func_data,
-					   true, cred);
-	}
 	if ((rule->flags & IMA_FUNC) &&
 	    (rule->func != func && func != POST_SETATTR))
 		return false;
+
+	switch (func) {
+	case KEY_CHECK:
+		return ((rule->func == func) &&
+			ima_match_rule_data(rule, rule->keyrings,
+					    func_data, true, cred));
+	case CRITICAL_DATA:
+		return ((rule->func == func) &&
+			ima_match_rule_data(rule, rule->data_sources,
+					    func_data, false, cred));
+	default:
+		break;
+	}
+
 	if ((rule->flags & IMA_MASK) &&
 	    (rule->mask != mask && func != POST_SETATTR))
 		return false;
@@ -911,7 +922,7 @@ enum {
 	Opt_uid_lt, Opt_euid_lt, Opt_fowner_lt,
 	Opt_appraise_type, Opt_appraise_flag,
 	Opt_permit_directio, Opt_pcr, Opt_template, Opt_keyrings,
-	Opt_err
+	Opt_data_sources, Opt_err
 };
 
 static const match_table_t policy_tokens = {
@@ -948,6 +959,7 @@ static const match_table_t policy_tokens = {
 	{Opt_pcr, "pcr=%s"},
 	{Opt_template, "template=%s"},
 	{Opt_keyrings, "keyrings=%s"},
+	{Opt_data_sources, "data_sources=%s"},
 	{Opt_err, NULL}
 };
 
@@ -1110,6 +1122,19 @@ static bool ima_validate_rule(struct ima_rule_entry *entry)
 		if (ima_rule_contains_lsm_cond(entry))
 			return false;
 
+		break;
+	case CRITICAL_DATA:
+		if (entry->action & ~(MEASURE | DONT_MEASURE))
+			return false;
+
+		if (!(entry->flags & IMA_DATA_SOURCES) ||
+		    (entry->flags & ~(IMA_FUNC | IMA_UID | IMA_PCR |
+		    IMA_DATA_SOURCES)))
+			return false;
+
+		if (ima_rule_contains_lsm_cond(entry))
+			return false;
+
 		break;
 	default:
 		return false;
@@ -1242,6 +1267,8 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
 			else if (IS_ENABLED(CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS) &&
 				 strcmp(args[0].from, "KEY_CHECK") == 0)
 				entry->func = KEY_CHECK;
+			else if (strcmp(args[0].from, "CRITICAL_DATA") == 0)
+				entry->func = CRITICAL_DATA;
 			else
 				result = -EINVAL;
 			if (!result)
@@ -1312,6 +1339,23 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
 
 			entry->flags |= IMA_KEYRINGS;
 			break;
+		case Opt_data_sources:
+			ima_log_string(ab, "data_sources", args[0].from);
+
+			if (entry->data_sources) {
+				result = -EINVAL;
+				break;
+			}
+
+			entry->data_sources = ima_alloc_rule_opt_list(args);
+			if (IS_ERR(entry->data_sources)) {
+				result = PTR_ERR(entry->data_sources);
+				entry->data_sources = NULL;
+				break;
+			}
+
+			entry->flags |= IMA_DATA_SOURCES;
+			break;
 		case Opt_fsuuid:
 			ima_log_string(ab, "fsuuid", args[0].from);
 
@@ -1692,6 +1736,12 @@ int ima_policy_show(struct seq_file *m, void *v)
 		seq_puts(m, " ");
 	}
 
+	if (entry->flags & IMA_DATA_SOURCES) {
+		seq_puts(m, "data_sources=");
+		ima_show_rule_opt_list(m, entry->data_sources);
+		seq_puts(m, " ");
+	}
+
 	if (entry->flags & IMA_PCR) {
 		snprintf(tbuf, sizeof(tbuf), "%d", entry->pcr);
 		seq_printf(m, pt(Opt_pcr), tbuf);
-- 
2.17.1


^ permalink raw reply related

* [PATCH 3/3] IMA: define IMA hook to measure critical data from kernel components
From: Tushar Sugandhi @ 2020-08-12 19:31 UTC (permalink / raw)
  To: zohar, stephen.smalley.work, casey, gmazyland
  Cc: tyhicks, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel, dm-devel, nramas
In-Reply-To: <20200812193102.18636-1-tusharsu@linux.microsoft.com>

Currently, IMA does not provide a generic function to kernel components
to measure their data. A generic function provided by IMA would
enable various parts of the kernel with easier and faster on-boarding to
use IMA infrastructure, would avoid code duplication, and consistent
usage of IMA policy CRITICAL_DATA+data_sources across the kernel.

Define a generic IMA function ima_measure_critical_data() to measure
data from various kernel components. Limit the measurement to the
components that are specified in the IMA policy - 
CRITICAL_DATA+data_sources.
Update process_buffer_measurement() to return the status code of the
operation.

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
---
 include/linux/ima.h               |  9 ++++++++
 security/integrity/ima/ima.h      |  8 +++----
 security/integrity/ima/ima_main.c | 37 ++++++++++++++++++++++++-------
 3 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/include/linux/ima.h b/include/linux/ima.h
index d15100de6cdd..865332ecedcb 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -26,6 +26,9 @@ extern int ima_post_read_file(struct file *file, void *buf, loff_t size,
 extern void ima_post_path_mknod(struct dentry *dentry);
 extern int ima_file_hash(struct file *file, char *buf, size_t buf_size);
 extern void ima_kexec_cmdline(int kernel_fd, const void *buf, int size);
+extern int ima_measure_critical_data(const char *event_name,
+				     const char *event_data_source,
+				     const void *buf, int buf_len);
 
 #ifdef CONFIG_IMA_KEXEC
 extern void ima_add_kexec_buffer(struct kimage *image);
@@ -104,6 +107,12 @@ static inline int ima_file_hash(struct file *file, char *buf, size_t buf_size)
 }
 
 static inline void ima_kexec_cmdline(int kernel_fd, const void *buf, int size) {}
+static inline int ima_measure_critical_data(const char *event_name,
+					    const char *event_data_source,
+					    const void *buf, int buf_len)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IMA */
 
 #ifndef CONFIG_IMA_KEXEC
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 99773dfa2541..e65ab067e700 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -266,10 +266,10 @@ void ima_store_measurement(struct integrity_iint_cache *iint, struct file *file,
 			   struct evm_ima_xattr_data *xattr_value,
 			   int xattr_len, const struct modsig *modsig, int pcr,
 			   struct ima_template_desc *template_desc);
-void process_buffer_measurement(struct inode *inode, const void *buf,
-				int buf_len, const char *eventname,
-				enum ima_hooks func, int pcr,
-				const char *func_data);
+int process_buffer_measurement(struct inode *inode, const void *buf,
+			       int buf_len, const char *eventname,
+			       enum ima_hooks func, int pcr,
+			       const char *func_data);
 void ima_audit_measurement(struct integrity_iint_cache *iint,
 			   const unsigned char *filename);
 int ima_alloc_init_template(struct ima_event_data *event_data,
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index a8740b7ea417..129bcaaf13e2 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -736,10 +736,11 @@ int ima_load_data(enum kernel_load_data_id id)
  *
  * Based on policy, the buffer is measured into the ima log.
  */
-void process_buffer_measurement(struct inode *inode, const void *buf,
-				int buf_len, const char *eventname,
-				enum ima_hooks func, int pcr,
-				const char *func_data)
+
+int process_buffer_measurement(struct inode *inode, const void *buf,
+			       int buf_len, const char *eventname,
+			       enum ima_hooks func, int pcr,
+			       const char *func_data)
 {
 	int ret = 0;
 	const char *audit_cause = "ENOMEM";
@@ -759,7 +760,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf,
 	u32 secid;
 
 	if (!ima_policy_flag)
-		return;
+		return 0;
 
 	/*
 	 * Both LSM hooks and auxilary based buffer measurements are
@@ -773,7 +774,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf,
 		action = ima_get_action(inode, current_cred(), secid, 0, func,
 					&pcr, &template, func_data);
 		if (!(action & IMA_MEASURE))
-			return;
+			return 0;
 	}
 
 	if (!pcr)
@@ -788,7 +789,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf,
 			pr_err("template %s init failed, result: %d\n",
 			       (strlen(template->name) ?
 				template->name : template->fmt), ret);
-			return;
+			return ret;
 		}
 	}
 
@@ -820,7 +821,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf,
 					func_measure_str(func),
 					audit_cause, ret, 0, ret);
 
-	return;
+	return ret;
 }
 
 /**
@@ -847,6 +848,26 @@ void ima_kexec_cmdline(int kernel_fd, const void *buf, int size)
 	fdput(f);
 }
 
+/**
+ * ima_measure_critical_data - measure critical data
+ * @event_name: name for the given data
+ * @event_data_source: name of the event data source
+ * @buf: pointer to buffer containing data to measure
+ * @size: Number of bytes in buf
+ *
+ * Buffers can only be measured, not appraised.
+ */
+int ima_measure_critical_data(const char *event_name,
+			      const char *event_data_source,
+			      const void *buf, int buf_len)
+{
+	if (!event_name || !event_data_source || !buf || !buf_len)
+		return -EINVAL;
+
+	return process_buffer_measurement(NULL, buf, buf_len, event_name,
+					  CRITICAL_DATA, 0, event_data_source);
+}
+
 static int __init init_ima(void)
 {
 	int error;
-- 
2.17.1


^ permalink raw reply related

* [PATCH 1/3] IMA: generalize keyring specific measurement constructs
From: Tushar Sugandhi @ 2020-08-12 19:31 UTC (permalink / raw)
  To: zohar, stephen.smalley.work, casey, gmazyland
  Cc: tyhicks, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel, dm-devel, nramas
In-Reply-To: <20200812193102.18636-1-tusharsu@linux.microsoft.com>

IMA functions such as ima_match_keyring(), process_buffer_measurement(),
ima_match_policy() etc. handle data specific to keyrings. Currently,
these constructs are not generic to handle any func specific data. 
This makes it harder to extend without code duplication.

Refactor the keyring specific measurement constructs to be generic and
reusable in other measurement scenarios. Rename the parameter "size"
in process_buffer_measurement() to "buf_len" to indicate it is the
length of the buffer pointed by the parameter "buf".

Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com>
---
 security/integrity/ima/ima.h        | 11 ++++----
 security/integrity/ima/ima_api.c    |  6 ++---
 security/integrity/ima/ima_main.c   | 17 ++++++------
 security/integrity/ima/ima_policy.c | 40 +++++++++++++++++------------
 4 files changed, 41 insertions(+), 33 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 38043074ce5e..e2a151d6653d 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -255,7 +255,7 @@ static inline void ima_process_queued_keys(void) {}
 int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
 		   int mask, enum ima_hooks func, int *pcr,
 		   struct ima_template_desc **template_desc,
-		   const char *keyring);
+		   const char *func_data);
 int ima_must_measure(struct inode *inode, int mask, enum ima_hooks func);
 int ima_collect_measurement(struct integrity_iint_cache *iint,
 			    struct file *file, void *buf, loff_t size,
@@ -265,9 +265,10 @@ void ima_store_measurement(struct integrity_iint_cache *iint, struct file *file,
 			   struct evm_ima_xattr_data *xattr_value,
 			   int xattr_len, const struct modsig *modsig, int pcr,
 			   struct ima_template_desc *template_desc);
-void process_buffer_measurement(struct inode *inode, const void *buf, int size,
-				const char *eventname, enum ima_hooks func,
-				int pcr, const char *keyring);
+void process_buffer_measurement(struct inode *inode, const void *buf,
+				int buf_len, const char *eventname,
+				enum ima_hooks func, int pcr,
+				const char *func_data);
 void ima_audit_measurement(struct integrity_iint_cache *iint,
 			   const unsigned char *filename);
 int ima_alloc_init_template(struct ima_event_data *event_data,
@@ -283,7 +284,7 @@ const char *ima_d_path(const struct path *path, char **pathbuf, char *filename);
 int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
 		     enum ima_hooks func, int mask, int flags, int *pcr,
 		     struct ima_template_desc **template_desc,
-		     const char *keyring);
+		     const char *func_data);
 void ima_init_policy(void);
 void ima_update_policy(void);
 void ima_update_policy_flag(void);
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 4f39fb93f278..af218babd198 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -170,7 +170,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
  * @func: caller identifier
  * @pcr: pointer filled in if matched measure policy sets pcr=
  * @template_desc: pointer filled in if matched measure policy sets template=
- * @keyring: keyring name used to determine the action
+ * @func_data: private data specific to @func, can be NULL.
  *
  * The policy is defined in terms of keypairs:
  *		subj=, obj=, type=, func=, mask=, fsmagic=
@@ -186,14 +186,14 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
 int ima_get_action(struct inode *inode, const struct cred *cred, u32 secid,
 		   int mask, enum ima_hooks func, int *pcr,
 		   struct ima_template_desc **template_desc,
-		   const char *keyring)
+		   const char *func_data)
 {
 	int flags = IMA_MEASURE | IMA_AUDIT | IMA_APPRAISE | IMA_HASH;
 
 	flags &= ima_policy_flag;
 
 	return ima_match_policy(inode, cred, secid, func, mask, flags, pcr,
-				template_desc, keyring);
+				template_desc, func_data);
 }
 
 /*
diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
index 8a91711ca79b..a8740b7ea417 100644
--- a/security/integrity/ima/ima_main.c
+++ b/security/integrity/ima/ima_main.c
@@ -728,17 +728,18 @@ int ima_load_data(enum kernel_load_data_id id)
  * process_buffer_measurement - Measure the buffer to ima log.
  * @inode: inode associated with the object being measured (NULL for KEY_CHECK)
  * @buf: pointer to the buffer that needs to be added to the log.
- * @size: size of buffer(in bytes).
+ * @buf_len: length of buffer(in bytes).
  * @eventname: event name to be used for the buffer entry.
  * @func: IMA hook
  * @pcr: pcr to extend the measurement
- * @keyring: keyring name to determine the action to be performed
+ * @func_data: private data specific to @func, can be NULL.
  *
  * Based on policy, the buffer is measured into the ima log.
  */
-void process_buffer_measurement(struct inode *inode, const void *buf, int size,
-				const char *eventname, enum ima_hooks func,
-				int pcr, const char *keyring)
+void process_buffer_measurement(struct inode *inode, const void *buf,
+				int buf_len, const char *eventname,
+				enum ima_hooks func, int pcr,
+				const char *func_data)
 {
 	int ret = 0;
 	const char *audit_cause = "ENOMEM";
@@ -747,7 +748,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf, int size,
 	struct ima_event_data event_data = {.iint = &iint,
 					    .filename = eventname,
 					    .buf = buf,
-					    .buf_len = size};
+					    .buf_len = buf_len};
 	struct ima_template_desc *template = NULL;
 	struct {
 		struct ima_digest_data hdr;
@@ -770,7 +771,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf, int size,
 	if (func) {
 		security_task_getsecid(current, &secid);
 		action = ima_get_action(inode, current_cred(), secid, 0, func,
-					&pcr, &template, keyring);
+					&pcr, &template, func_data);
 		if (!(action & IMA_MEASURE))
 			return;
 	}
@@ -795,7 +796,7 @@ void process_buffer_measurement(struct inode *inode, const void *buf, int size,
 	iint.ima_hash->algo = ima_hash_algo;
 	iint.ima_hash->length = hash_digest_size[ima_hash_algo];
 
-	ret = ima_calc_buffer_hash(buf, size, iint.ima_hash);
+	ret = ima_calc_buffer_hash(buf, buf_len, iint.ima_hash);
 	if (ret < 0) {
 		audit_cause = "hashing_error";
 		goto out;
diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
index fe1df373c113..4efaf8956eb8 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -451,15 +451,21 @@ int ima_lsm_policy_change(struct notifier_block *nb, unsigned long event,
 }
 
 /**
- * ima_match_keyring - determine whether the keyring matches the measure rule
- * @rule: a pointer to a rule
- * @keyring: name of the keyring to match against the measure rule
+ * ima_match_rule_data - determine whether the given func_data matches
+ *			 the measure rule data
+ * @rule: IMA policy rule
+ * @opt_list: rule data to match func_data against
+ * @func_data: data to match against the measure rule data
+ * @allow_empty_opt_list: If true matches all func_data
  * @cred: a pointer to a credentials structure for user validation
  *
  * Returns true if keyring matches one in the rule, false otherwise.
  */
-static bool ima_match_keyring(struct ima_rule_entry *rule,
-			      const char *keyring, const struct cred *cred)
+static bool ima_match_rule_data(struct ima_rule_entry *rule,
+				const struct ima_rule_opt_list *opt_list,
+				const char *func_data,
+				bool allow_empty_opt_list,
+				const struct cred *cred)
 {
 	bool matched = false;
 	size_t i;
@@ -467,14 +473,14 @@ static bool ima_match_keyring(struct ima_rule_entry *rule,
 	if ((rule->flags & IMA_UID) && !rule->uid_op(cred->uid, rule->uid))
 		return false;
 
-	if (!rule->keyrings)
-		return true;
+	if (!opt_list)
+		return allow_empty_opt_list;
 
-	if (!keyring)
+	if (!func_data)
 		return false;
 
-	for (i = 0; i < rule->keyrings->count; i++) {
-		if (!strcmp(rule->keyrings->items[i], keyring)) {
+	for (i = 0; i < opt_list->count; i++) {
+		if (!strcmp(opt_list->items[i], func_data)) {
 			matched = true;
 			break;
 		}
@@ -491,20 +497,21 @@ static bool ima_match_keyring(struct ima_rule_entry *rule,
  * @secid: the secid of the task to be validated
  * @func: LIM hook identifier
  * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC)
- * @keyring: keyring name to check in policy for KEY_CHECK func
+ * @func_data: private data specific to @func, can be NULL.
  *
  * Returns true on rule match, false on failure.
  */
 static bool ima_match_rules(struct ima_rule_entry *rule, struct inode *inode,
 			    const struct cred *cred, u32 secid,
 			    enum ima_hooks func, int mask,
-			    const char *keyring)
+			    const char *func_data)
 {
 	int i;
 
 	if (func == KEY_CHECK) {
 		return (rule->flags & IMA_FUNC) && (rule->func == func) &&
-		       ima_match_keyring(rule, keyring, cred);
+		       ima_match_rule_data(rule, rule->keyrings, func_data,
+					   true, cred);
 	}
 	if ((rule->flags & IMA_FUNC) &&
 	    (rule->func != func && func != POST_SETATTR))
@@ -608,8 +615,7 @@ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
  * @mask: requested action (MAY_READ | MAY_WRITE | MAY_APPEND | MAY_EXEC)
  * @pcr: set the pcr to extend
  * @template_desc: the template that should be used for this rule
- * @keyring: the keyring name, if given, to be used to check in the policy.
- *           keyring can be NULL if func is anything other than KEY_CHECK.
+ * @func_data: private data specific to @func, can be NULL.
  *
  * Measure decision based on func/mask/fsmagic and LSM(subj/obj/type)
  * conditions.
@@ -621,7 +627,7 @@ static int get_subaction(struct ima_rule_entry *rule, enum ima_hooks func)
 int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
 		     enum ima_hooks func, int mask, int flags, int *pcr,
 		     struct ima_template_desc **template_desc,
-		     const char *keyring)
+		     const char *func_data)
 {
 	struct ima_rule_entry *entry;
 	int action = 0, actmask = flags | (flags << 1);
@@ -636,7 +642,7 @@ int ima_match_policy(struct inode *inode, const struct cred *cred, u32 secid,
 			continue;
 
 		if (!ima_match_rules(entry, inode, cred, secid, func, mask,
-				     keyring))
+				     func_data))
 			continue;
 
 		action |= entry->flags & IMA_ACTION_FLAGS;
-- 
2.17.1


^ permalink raw reply related

* [PATCH 0/3] IMA: Infrastructure for measurement of critical kernel data
From: Tushar Sugandhi @ 2020-08-12 19:30 UTC (permalink / raw)
  To: zohar, stephen.smalley.work, casey, gmazyland
  Cc: tyhicks, sashal, jmorris, linux-integrity, selinux,
	linux-security-module, linux-kernel, dm-devel, nramas

There are several kernel components that contain critical data which if
accidentally or maliciously altered, can compromise the security of the
kernel. Example of such components would include LSMs like SELinux, or
AppArmor; or device-mapper targets like dm-crypt, dm-verity etc.

Many of these components do not use the capabilities provided by kernel
integrity subsystem (IMA), and thus they don't use the benefits of
extended TPM PCR quotes and ultimately the benefits of remote attestation.

This series bridges this gap, so that potential kernel components that
contain data critical to the security of the kernel could take advantage
of IMA's measuring and quoting abilities - thus ultimately enabling
remote attestation for their specific data.

System administrators may want to pick and choose which kernel
components they would want to enable for measurements, quoting, and
remote attestation. To enable that, a new IMA policy is introduced.

And lastly, the functionality is exposed through a function
ima_measure_critical_data(). The functionality is generic enough to
measure the data of any kernel component at runtime.

This series is based on the following repo/branch:
 repo: https://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
 branch: next-integrity

This series also has a dependency on the following patch series:
 https://patchwork.kernel.org/patch/11709527/

Tushar Sugandhi (3):
  IMA: generalize keyring specific measurement constructs
  IMA: add policy to support measuring critical data from kernel
    components
  IMA: define IMA hook to measure critical data from kernel components

 Documentation/ABI/testing/ima_policy |   6 +-
 include/linux/ima.h                  |   9 ++
 security/integrity/ima/ima.h         |  12 +--
 security/integrity/ima/ima_api.c     |   8 +-
 security/integrity/ima/ima_main.c    |  46 +++++++---
 security/integrity/ima/ima_policy.c  | 120 ++++++++++++++++++++-------
 6 files changed, 147 insertions(+), 54 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH v20 05/12] LSM: Infrastructure management of the superblock
From: Stephen Smalley @ 2020-08-12 19:16 UTC (permalink / raw)
  To: Mickaël Salaün, linux-kernel
  Cc: Al Viro, Andy Lutomirski, Anton Ivanov, Arnd Bergmann,
	Casey Schaufler, James Morris, Jann Horn, Jeff Dike,
	Jonathan Corbet, Kees Cook, Michael Kerrisk, Richard Weinberger,
	Serge E . Hallyn, Shuah Khan, Vincent Dagonneau, kernel-hardening,
	linux-api, linux-arch, linux-doc, linux-fsdevel, linux-kselftest,
	linux-security-module, x86, John Johansen
In-Reply-To: <20200802215903.91936-6-mic@digikod.net>

On 8/2/20 5:58 PM, Mickaël Salaün wrote:
> From: Casey Schaufler <casey@schaufler-ca.com>
> 
> Move management of the superblock->sb_security blob out
> of the individual security modules and into the security
> infrastructure. Instead of allocating the blobs from within
> the modules the modules tell the infrastructure how much
> space is required, and the space is allocated there.
> 
> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: John Johansen <john.johansen@canonical.com>
> Reviewed-by: Stephen Smalley <sds@tycho.nsa.gov>
> Reviewed-by: Mickaël Salaün <mic@digikod.net>
> Link: https://lore.kernel.org/r/20190829232935.7099-2-casey@schaufler-ca.com
> ---
> 
> Changes since v17:
> * Rebase the original LSM stacking patch from v5.3 to v5.7: I fixed some
>    diff conflicts caused by code moves and function renames in
>    selinux/include/objsec.h and selinux/hooks.c .  I checked that it
>    builds but I didn't test the changes for SELinux nor SMACK.

You shouldn't retain Signed-off-by and Reviewed-by lines from an earlier 
patch if you made non-trivial changes to it (even more so if you didn't 
test them).


^ permalink raw reply

* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-08-12 18:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module, oleg, x86
In-Reply-To: <20200812100650.GB28154@C02TD0UTHF1T.local>



On 8/12/20 5:06 AM, Mark Rutland wrote:
> [..]
>>
>> The general principle of the mitigation is W^X. I would argue that
>> the above options are violations of the W^X principle. If they are
>> allowed today, they must be fixed. And they will be. So, we cannot
>> rely on them.
> 
> Hold on.
> 
> Contemporary W^X means that a given virtual alias cannot be writeable
> and executeable simultaneously, permitting (a) and (b). If you read the
> references on the Wikipedia page for W^X you'll see the OpenBSD 3.3
> release notes and related presentation make this clear, and further they
> expect (b) to occur with JITS flipping W/X with mprotect().
> 
> Please don't conflate your assumed stronger semantics with the general
> principle. It not matching you expectations does not necessarily mean
> that it is wrong.
> 
> If you want a stronger W^X semantics, please refer to this specifically
> with a distinct name.

OK. Fair enough. We can give a different name to the stronger requirement.
Just for the sake of this discussion and for the want of a better name,
let us call it WX2.


> 
>> a) This requires a remap operation. Two mappings point to the same
>>      physical page. One mapping has W and the other one has X. This
>>      is a violation of W^X.
>>
>> b) This is again a violation. The kernel should refuse to give execute
>>      permission to a page that was writeable in the past and refuse to
>>      give write permission to a page that was executable in the past.
>>
>> c) This is just a variation of (a).
> 
> As above, this is not true.
> 
> If you have a rationale for why this is desirable or necessary, please
> justify that before using this as justification for additional features.
> 

I already supplied the justification. Any user level method can potentially
be hijacked by an attacker for his purpose.

WX does not prevent all of the methods. We need WX2.


>> In general, the problem with user-level methods to map and execute
>> dynamic code is that the kernel cannot tell if a genuine application is
>> using them or an attacker is using them or piggy-backing on them.
> 
> Yes, and as I pointed out the same is true for trampfd unless you can
> somehow authenticate the calls are legitimate (in both callsite and the
> set of arguments), and I don't see any reasonable way of doing that.
> 

I am afraid I am not in agreement with this. If WX2 is not implemented,
an attacker can hack both code and data. If WX2 is implemented, an attacker
can only attack data. The attack surface is reduced.

Also, trampfd calls coming from code from a signed file can be authenticated.
trampfd calls coming from an attacker's generated code cannot be authenticated.


> If you relax your threat model to an attacker not being able to make
> arbitrary syscalls, then your suggestion that userspace can perorm
> chceks between syscalls may be sufficient, but as I pointed out that's
> equally true for a sealed memfd or similar.
> 

Actually, I did not suggest that userspace can perform checks. I said that
the kernel can perform checks.

User space cannot reliably perform checks between calls. A clever hacker
can cover his tracks.

In any case, the kernel has no knowledge of these checks. So, when execute
permissions are requested for a page, a properly implemented WX2 can refuse.


>> Off the top of my head, I have tried to identify some examples
>> where we can have more trust on dynamic code and have the kernel
>> permit its execution.
>>
>> 1. If the kernel can do the job, then that is one safe way. Here, the kernel
>>     is the code. There is no code generation involved. This is what I
>>     have presented in the patch series as the first cut.
> 
> This is sleight-of-hand; it doesn't matter where the logic is performed
> if the power is identical. Practically speaking this is equivalent to
> some dynamic code generation.
> 
> I think that it's misleading to say that because the kernel emulates
> something it is safe when the provenance of the syscall arguments cannot
> be verified.


I submit that there are two aspects - code and data. In one case, both
code and data can be hacked. So, an attacker can modify both code
and data. In the other case, the attacker can only modify data.
The power is not identical. The attack surface is not the same.

Most of the times, security measures are mitigations. They are not a 100%.
This approach of not allowing the user to do certain things that can be
exploited and having the kernel doing them increases our confidence.
From that perspective, the two approaches are different and it is worth
pursuing a kernel based mitigation.


> 
> [...]
> 
>> Anyway, these are just examples. The principle is - if we can identify
>> dynamic code that has a certain measure of trust, can the kernel
>> permit their execution?
> 
> My point generally is that the kernel cannot identify this, and if
> usrspace code is trusted to dynamically generate trampfd arguments it
> can equally be trusted to dyncamilly generate code.

I am afraid not. See my previous response. Ability to hack only data
gives an attacker fewer options as compared to the ability to hack
both code and data.

> 
> [...]
> 
>> As I have mentioned above, I intend to have the kernel generate code
>> only if the code generation is simple enough. For more complicated cases,
>> I plan to use a user-level code generator that is for exclusive kernel use.
>> I have yet to work out the details on how this would work. Need time.
> 
> This reads to me like trampfd is only dealing with a few special cases
> and we know that we need a more general solution.
> 
> I hope I am mistaken, but I get the strong impression that you're trying
> to justify your existing solution rather than trying to understand the
> problem space.
> 

I do understand the problem space. I wanted to address dynamic code in 3
different ways in separate phases starting from the easiest and working
my way up to the more difficult ones.

1. Remove dynamic code where possible

   If the kernel can replace user level dynamic code, then do it.
   This is what I did in version 1.

2. Replace dynamic code with static code

   Where you cannot do (1), replace dynamic code with static code with
   the kernel's help. I wanted to do this later. But I have decided to
   do this in version 2. This combined with signature verification of
   files adds a measure or trust in the code.

3. Deal with JIT, DBT, etc

   In (1) and (2), we deal with machine code. In (3), there is some source
   from which dynamic code needs to be generated using a code generator.
   E.g., JIT code from Java byte code. Here, the solution I had in mind
   had two parts:

       - Make the source more trustworthy by requiring it to be part
         of a signed file
       - Design a code generator trusted and used exclusively by the kernel

In this patchset, I wanted to lay a foundation for all 3 and attempt to
solve (1) first. Once this was in place, I wanted to do (2) and then (3).

In retrospect, I should have probably started with the big picture first
instead of starting with just item (1). But I always had the big picture
in mind. That said, I did not necessarily have all the details fleshed
out for all the phases. (3) is complex.

My focus was to define the API in a generic enough fashion so that all
3 phases can be implemented. But I realize that it is a hard sell at this
point to convince people that the API is adequate for phase 3. So,
I have decided to do (1) and (2). (3) has to be done separately with
more thought and details put into it.

Also, it may be the case that there are some examples of dynamic code
out there than can never be addressed. My goal is to try to address a
majority of the dynamic code out there.


> To be clear, my strong opinion is that we should not be trying to do
> this sort of emulation or code generation within the kernel. I do think
> it's worthwhile to look at mechanisms to make it harder to subvert
> dynamic userspace code generation, but I think the code generation
> itself needs to live in userspace (e.g. for ABI reasons I previously
> mentioned).
> 

I completely agree that the kernel should not deal with the complexities
of code generation and ABI details. My version 1 did not have any code
generation. But since a performance issue was raised, I explored the idea
of kernel code generation. To be honest, I was not really that
comfortable with the idea.

That is why I have decided to implement the second piece I had in
my plan now. This piece does not have the code generation complexities
or ABI issues. This piece can be used to solve libffi, GCC, etc.
I will still write the code in such a way that I can use the first
approach in the future if I really need it. But it will not involve any
code generation from the kernel. It will only be used for cases that
don't mind the extra trip to the kernel.

Madhavan

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox