* Re: [PATCH v3 15/19] IMA: Add support for file reads without contents
From: Kees Cook @ 2020-07-28 19:44 UTC (permalink / raw)
To: Mimi Zohar
Cc: Greg Kroah-Hartman, Scott Branden, Luis Chamberlain, Jessica Yu,
SeongJae Park, KP Singh, linux-efi, linux-security-module,
linux-integrity, selinux, linux-kselftest, linux-kernel
In-Reply-To: <1595856214.4841.86.camel@kernel.org>
On Mon, Jul 27, 2020 at 09:23:34AM -0400, Mimi Zohar wrote:
> On Fri, 2020-07-24 at 14:36 -0700, Kees Cook wrote:
> > From: Scott Branden <scott.branden@broadcom.com>
> >
> > When the kernel_read_file LSM hook is called with contents=false, IMA
> > can appraise the file directly, without requiring a filled buffer. When
> > such a buffer is available, though, IMA can continue to use it instead
> > of forcing a double read here.
> >
> > Signed-off-by: Scott Branden <scott.branden@broadcom.com>
> > Link: https://lore.kernel.org/lkml/20200706232309.12010-10-scott.branden@broadcom.com/
> > Signed-off-by: Kees Cook <keescook@chromium.org>
>
> After adjusting the comment below.
>
> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Sure!
Greg, shall I send a v4 with added Reviews and the comment change or is
that minor enough that you're able to do it?
Thanks for the reviews Mimi!
-Kees
>
> > ---
> > security/integrity/ima/ima_main.c | 22 ++++++++++++++++------
> > 1 file changed, 16 insertions(+), 6 deletions(-)
> >
> > diff --git a/security/integrity/ima/ima_main.c b/security/integrity/ima/ima_main.c
> > index dc4f90660aa6..459e50526a12 100644
> > --- a/security/integrity/ima/ima_main.c
> > +++ b/security/integrity/ima/ima_main.c
> > @@ -613,11 +613,8 @@ void ima_post_path_mknod(struct dentry *dentry)
> > int ima_read_file(struct file *file, enum kernel_read_file_id read_id,
> > bool contents)
> > {
> > - /* Reject all partial reads during appraisal. */
> > - if (!contents) {
> > - if (ima_appraise & IMA_APPRAISE_ENFORCE)
> > - return -EACCES;
> > - }
> > + enum ima_hooks func;
> > + u32 secid;
> >
> > /*
> > * Do devices using pre-allocated memory run the risk of the
> > @@ -626,7 +623,20 @@ int ima_read_file(struct file *file, enum kernel_read_file_id read_id,
> > * buffers? It may be desirable to include the buffer address
> > * in this API and walk all the dma_map_single() mappings to check.
> > */
> > - return 0;
> > +
> > + /*
> > + * There will be a call made to ima_post_read_file() with
> > + * a filled buffer, so we don't need to perform an extra
> > + * read early here.
> > + */
> > + if (contents)
> > + return 0;
> > +
> > + /* Read entire file for all partial reads during appraisal. */
>
> In addition to verifying the file signature, the file might be
> included in the IMA measurement list or the file hash may be used to
> augment the audit record. Please remove "during appraisal" from the
> comment.
>
> > + func = read_idmap[read_id] ?: FILE_CHECK;
> > + security_task_getsecid(current, &secid);
> > + return process_measurement(file, current_cred(), secid, NULL,
> > + 0, MAY_READ, func);
> > }
> >
> > const int read_idmap[READING_MAX_ID] = {
>
--
Kees Cook
^ permalink raw reply
* Re: [PATCH v3 12/19] firmware_loader: Use security_post_load_data()
From: Kees Cook @ 2020-07-28 19:43 UTC (permalink / raw)
To: Mimi Zohar
Cc: Greg Kroah-Hartman, Scott Branden, Luis Chamberlain, Jessica Yu,
SeongJae Park, KP Singh, linux-efi, linux-security-module,
linux-integrity, selinux, linux-kselftest, linux-kernel
In-Reply-To: <1595847465.4841.63.camel@kernel.org>
On Mon, Jul 27, 2020 at 06:57:45AM -0400, Mimi Zohar wrote:
> On Fri, 2020-07-24 at 14:36 -0700, Kees Cook wrote:
> > Now that security_post_load_data() is wired up, use it instead
> > of the NULL file argument style of security_post_read_file(),
> > and update the security_kernel_load_data() call to indicate that a
> > security_kernel_post_load_data() call is expected.
> >
> > Wire up the IMA check to match earlier logic. Perhaps a generalized
> > change to ima_post_load_data() might look something like this:
> >
> > return process_buffer_measurement(buf, size,
> > kernel_load_data_id_str(load_id),
> > read_idmap[load_id] ?: FILE_CHECK,
> > 0, NULL);
> >
> > Signed-off-by: Kees Cook <keescook@chromium.org>
>
> process_measurement() measures, verifies a file signature - both
> signatures stored as an xattr and as an appended buffer signature -
> and augments audit records with the file hash. (Support for measuring,
> augmenting audit records, and/or verifying fs-verity signatures has
> yet to be added.)
>
> As explained in my response to 11/19, the file descriptor provides the
> file pathname associated with the buffer data. In addition, IMA
> policy rules may be defined in terms of other file descriptor info -
> uid, euid, uuid, etc.
>
> Recently support was added for measuring the kexec boot command line,
> certificates being loaded onto a keyring, and blacklisted file hashes
> (limited to appended signatures). None of these buffers are signed.
> process_buffer_measurement() was added for this reason and as a
> result is limited to just measuring the buffer data.
>
> Whether process_measurement() or process_buffer_measurement() should
> be modified, needs to be determined. In either case to support the
> init_module syscall, would at minimum require the associated file
> pathname.
Right -- I don't intend to make changes to the init_module() syscall
since it's deprecated, so this hook is more of a "fuller LSM coverage
for old syscalls" addition.
IMA can happily continue to ignore it, which is what I have here, but I
thought I'd at least show what it *might* look like. Perhaps BPF LSM is
a better example.
Does anything need to change for this patch?
--
Kees Cook
^ permalink raw reply
* Re: [PATCH v3 11/19] LSM: Introduce kernel_post_load_data() hook
From: Kees Cook @ 2020-07-28 19:41 UTC (permalink / raw)
To: Mimi Zohar
Cc: Greg Kroah-Hartman, Scott Branden, Luis Chamberlain, Jessica Yu,
SeongJae Park, KP Singh, linux-efi, linux-security-module,
linux-integrity, selinux, linux-kselftest, linux-kernel
In-Reply-To: <1595846951.4841.61.camel@kernel.org>
On Mon, Jul 27, 2020 at 06:49:11AM -0400, Mimi Zohar wrote:
> On Fri, 2020-07-24 at 14:36 -0700, Kees Cook wrote:
> > There are a few places in the kernel where LSMs would like to have
> > visibility into the contents of a kernel buffer that has been loaded or
> > read. While security_kernel_post_read_file() (which includes the
> > buffer) exists as a pairing for security_kernel_read_file(), no such
> > hook exists to pair with security_kernel_load_data().
> >
> > Earlier proposals for just using security_kernel_post_read_file() with a
> > NULL file argument were rejected (i.e. "file" should always be valid for
> > the security_..._file hooks, but it appears at least one case was
> > left in the kernel during earlier refactoring. (This will be fixed in
> > a subsequent patch.)
> >
> > Since not all cases of security_kernel_load_data() can have a single
> > contiguous buffer made available to the LSM hook (e.g. kexec image
> > segments are separately loaded), there needs to be a way for the LSM to
> > reason about its expectations of the hook coverage. In order to handle
> > this, add a "contents" argument to the "kernel_load_data" hook that
> > indicates if the newly added "kernel_post_load_data" hook will be called
> > with the full contents once loaded. That way, LSMs requiring full contents
> > can choose to unilaterally reject "kernel_load_data" with contents=false
> > (which is effectively the existing hook coverage), but when contents=true
> > they can allow it and later evaluate the "kernel_post_load_data" hook
> > once the buffer is loaded.
> >
> > With this change, LSMs can gain coverage over non-file-backed data loads
> > (e.g. init_module(2) and firmware userspace helper), which will happen
> > in subsequent patches.
> >
> > Additionally prepare IMA to start processing these cases.
> >
> > Signed-off-by: Kees Cook <keescook@chromium.org>
>
> At least from an IMA perspective, the original
> security_kernel_load_data() hook was defined in order to prevent
> certain syscalls - init_module, kexec_load - and loading firmware via
> sysfs. The resulting error messages were generic.
>
> Unlike security_kernel_load_data(), security_kernel_post_load_data()
> is meant to be used, but without a file desciptor specific
> information, like the filename associated with the buffer, is missing.
> Having the filename isn't actually necessary for verifying the
> appended signature, but it is needed for auditing signature
> verification failures and including in the IMA measurement list.
Right -- I'm open to ideas on this, but as it stands, other LSMs (e.g.
BPF LSM) can benefit from the security_kernel_post_load_data() to
examine the contents, etc.
Is there anything that needs to change in this patch?
--
Kees Cook
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-07-28 19:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
In-Reply-To: <CALCETrVy5OMuUx04-wWk9FJbSxkrT2vMfN_kANinudrDwC4Cig@mail.gmail.com>
I am working on a response to this. I will send it soon.
Thanks.
Madhavan
On 7/28/20 12:31 PM, Andy Lutomirski wrote:
>> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>>
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> This is quite clever, but now I’m wondering just how much kernel help
> is really needed. In your series, the trampoline is an non-executable
> page. I can think of at least two alternative approaches, and I'd
> like to know the pros and cons.
>
> 1. Entirely userspace: a return trampoline would be something like:
>
> 1:
> pushq %rax
> pushq %rbc
> pushq %rcx
> ...
> pushq %r15
> movq %rsp, %rdi # pointer to saved regs
> leaq 1b(%rip), %rsi # pointer to the trampoline itself
> callq trampoline_handler # see below
>
> You would fill a page with a bunch of these, possibly compacted to get
> more per page, and then you would remap as many copies as needed. The
> 'callq trampoline_handler' part would need to be a bit clever to make
> it continue to work despite this remapping. This will be *much*
> faster than trampfd. How much of your use case would it cover? For
> the inverse, it's not too hard to write a bit of asm to set all
> registers and jump somewhere.
>
> 2. Use existing kernel functionality. Raise a signal, modify the
> state, and return from the signal. This is very flexible and may not
> be all that much slower than trampfd.
>
> 3. Use a syscall. Instead of having the kernel handle page faults,
> have the trampoline code push the syscall nr register, load a special
> new syscall nr into the syscall nr register, and do a syscall. On
> x86_64, this would be:
>
> pushq %rax
> movq __NR_magic_trampoline, %rax
> syscall
>
> with some adjustment if the stack slot you're clobbering is important.
>
>
> Also, will using trampfd cause issues with various unwinders? I can
> easily imagine unwinders expecting code to be readable, although this
> is slowly going away for other reasons.
>
> All this being said, I think that the kernel should absolutely add a
> sensible interface for JITs to use to materialize their code. This
> would integrate sanely with LSMs and wouldn't require hacks like using
> files, etc. A cleverly designed JIT interface could function without
> seriailization IPIs, and even lame architectures like x86 could
> potentially avoid shootdown IPIs if the interface copied code instead
> of playing virtual memory games. At its very simplest, this could be:
>
> void *jit_create_code(const void *source, size_t len);
>
> and the result would be a new anonymous mapping that contains exactly
> the code requested. There could also be:
>
> int jittfd_create(...);
>
> that does something similar but creates a memfd. A nicer
> implementation for short JIT sequences would allow appending more code
> to an existing JIT region. On x86, an appendable JIT region would
> start filled with 0xCC, and I bet there's a way to materialize new
> code into a previously 0xcc-filled virtual page wthout any
> synchronization. One approach would be to start with:
>
> <some code>
> 0xcc
> 0xcc
> ...
> 0xcc
>
> and to create a whole new page like:
>
> <some code>
> <some more code>
> 0xcc
> ...
> 0xcc
>
> so that the only difference is that some code changed to some more
> code. Then replace the PTE to swap from the old page to the new page,
> and arrange to avoid freeing the old page until we're sure it's gone
> from all TLBs. This may not work if <some more code> spans a page
> boundary. The #BP fixup would zap the TLB and retry. Even just
> directly copying code over some 0xcc bytes almost works, but there's a
> nasty corner case involving instructions that fetch I$ fetch
> boundaries. I'm not sure to what extent I$ snooping helps.
>
> --Andy
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-07-28 18:52 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Laight, kernel-hardening@lists.openwall.com,
linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org, oleg@redhat.com,
x86@kernel.org
In-Reply-To: <CALCETrUta5-0TLJ9-jfdehpTAp2Efmukk2npYadFzz9ozOrG2w@mail.gmail.com>
On 7/28/20 12:16 PM, Andy Lutomirski wrote:
> On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
> <madvenka@linux.microsoft.com> wrote:
>> Thanks. See inline..
>>
>> On 7/28/20 10:13 AM, David Laight wrote:
>>> From: madvenka@linux.microsoft.com
>>>> Sent: 28 July 2020 14:11
>>> ...
>>>> The kernel creates the trampoline mapping without any permissions. When
>>>> the trampoline is executed by user code, a page fault happens and the
>>>> kernel gets control. The kernel recognizes that this is a trampoline
>>>> invocation. It sets up the user registers based on the specified
>>>> register context, and/or pushes values on the user stack based on the
>>>> specified stack context, and sets the user PC to the requested target
>>>> PC. When the kernel returns, execution continues at the target PC.
>>>> So, the kernel does the work of the trampoline on behalf of the
>>>> application.
>>> Isn't the performance of this going to be horrid?
>> It takes about the same amount of time as getpid(). So, it is
>> one quick trip into the kernel. I expect that applications will
>> typically not care about this extra overhead as long as
>> they are able to run.
> What did you test this on? A page fault on any modern x86_64 system
> is much, much, much, much slower than a syscall.
I sent a response to this. But the mail was returned to me.
I am resending.
I tested it in on a KVM guest running Ubuntu. So, when you say that a
page fault is much slower, do you mean a regular page fault that is handled
through the VM layer? Here is the relevant code in do_user_addr_fault():
if (unlikely(access_error(hw_error_code, vma))) {
/*
* If it is a user execute fault, it could be a trampoline
* invocation.
*/
if ((hw_error_code & tflags) == tflags &&
trampfd_fault(vma, regs)) {
up_read(&mm->mmap_sem);
return;
}
bad_area_access_error(regs, hw_error_code, address, vma);
return;
}
...
fault = handle_mm_fault(vma, address, flags);
trampfd faults are instruction faults that go through a different code path than
the one that calls handle_mm_fault(). Perhaps, it is the handle_mm_fault() that
is time consuming. Could you clarify?
Thanks.
Madhavan
^ permalink raw reply
* Re: [PATCH v3 00/19] Introduce partial kernel_read_file() support
From: Mimi Zohar @ 2020-07-28 18:48 UTC (permalink / raw)
To: Scott Branden, Kees Cook, Greg Kroah-Hartman
Cc: Luis Chamberlain, Jessica Yu, SeongJae Park, KP Singh, linux-efi,
linux-security-module, linux-integrity, selinux, linux-kselftest,
linux-kernel
In-Reply-To: <1a46db6f-1c8a-3509-6371-7c77999833f2@broadcom.com>
On Mon, 2020-07-27 at 12:18 -0700, Scott Branden wrote:
> Hi Mimi/Kees,
>
> On 2020-07-27 4:16 a.m., Mimi Zohar wrote:
> > On Fri, 2020-07-24 at 14:36 -0700, Kees Cook wrote:
> >> v3:
> >> - add reviews/acks
> >> - add "IMA: Add support for file reads without contents" patch
> >> - trim CC list, in case that's why vger ignored v2
> >> v2: [missing from lkml archives! (CC list too long?) repeating changes
> here]
> >> - fix issues in firmware test suite
> >> - add firmware partial read patches
> >> - various bug fixes/cleanups
> >> v1:
> https://lore.kernel.org/lkml/20200717174309.1164575-1-keescook@chromium.org/
> >>
> >> Hi,
> >>
> >> Here's my tree for adding partial read support in kernel_read_file(),
> >> which fixes a number of issues along the way. It's got Scott's firmware
> >> and IMA patches ported and everything tests cleanly for me (even with
> >> CONFIG_IMA_APPRAISE=y).
> > Thanks, Kees. Other than my comments on the new
> > security_kernel_post_load_data() hook, the patch set is really nice.
> >
> > In addition to compiling with CONFIG_IMA_APPRAISE enabled, have you
> > booted the kernel with the ima_policy=tcb? The tcb policy will add
> > measurements to the IMA measurement list and extend the TPM with the
> > file or buffer data digest. Are you seeing the firmware measurements,
> > in particular the partial read measurement?
> I booted the kernel with ima_policy=tcb.
>
> Unfortunately after enabling the following, fw_run_tests.sh does not run.
>
> mkdir /sys/kernel/security
> mount -t securityfs securityfs /sys/kernel/security
> echo "measure func=FIRMWARE_CHECK" > /sys/kernel/security/ima/policy
> echo "appraise func=FIRMWARE_CHECK appraise_type=imasig" >
> /sys/kernel/security/ima/policy
> ./fw_run_tests.sh
>
> [ 1296.258052] test_firmware: loading 'test-firmware.bin'
> [ 1296.263903] misc test_firmware: loading /lib/firmware/test-firmware.bin
> failed with error -13
> [ 1296.263905] audit: type=1800 audit(1595905754.266:9): pid=5696 uid=0
> auid=4294967295 ses=4294967295 subj=kernel op=appraise_data cause=IMA-
> signature-required comm="fw_namespace" name="/lib/firmware/test-firmware.bin"
> dev="tmpfs" ino=4592 res=0
> [ 1296.297085] misc test_firmware: Direct firmware load for test-firmware.bin
> failed with error -13
> [ 1296.305947] test_firmware: load of 'test-firmware.bin' failed: -13
The "appraise" rule verifies the IMA signature. Unless you signed the firmware
(evmctl) and load the public key on the IMA keyring, that's to be expected. I
assume you are seeing firmware measurements in the IMA measuremenet log.
Mimi
^ permalink raw reply
* Re: [PATCH V3fix ghak120] audit: initialize context values in case of mandatory events
From: Paul Moore @ 2020-07-28 18:47 UTC (permalink / raw)
To: Richard Guy Briggs
Cc: Eric Paris, Linux Security Module list, Linux-Audit Mailing List,
LKML
In-Reply-To: <20200728162722.djvy3qyclj57wsfn@madcap2.tricolour.ca>
On Tue, Jul 28, 2020 at 12:27 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2020-07-27 22:14, Paul Moore wrote:
> > On Mon, Jul 27, 2020 at 5:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > Issue ghak120 enabled syscall records to accompany required records when
> > > no rules are present to trigger the storage of syscall context. A
> > > reported issue showed that the cwd was not always initialized. That
> > > issue was already resolved ...
> >
> > Yes and no. Yes, it appears to be resolved in v5.8-rc1 and above, but
> > the problematic commit is in v5.7 and I'm not sure backporting the fix
> > in v5.8-rcX plus this patch is the right thing to do for a released
> > kernel. The lowest risk fix for v5.7 at this point is to do a revert;
>
> Ok, fair enough. I don't understand why you didn't do the revert since
> it appears so trivial to you and this review and fix turned out to be
> marginally more work. I didn't understand what you wanted when you
> referred to stable.
I held off on the revert because I thought you might want the chance
to submit the revert with your authorship. I made an assumption that
it meant the same to you as it does to me; that's my mistake, I should
have known better.
I'll do the revert myself for stable-5.8 (which should trickle down to
v5.7.z with the right metadata), don't bother with it.
> > regardless of what happens with this patch and v5.8-rcX please post a
> > revert for the audit/stable-5.7 tree as soon as you can.
>
> (more below...)
>
> > > ... but a review of all other records that could
> > > be triggered at the time of a syscall record revealed other potential
> > > values that could be missing or misleading. Initialize them.
> > >
> > > The fds array is reset to -1 after the first syscall to indicate it
> > > isn't valid any more, but was never set to -1 when the context was
> > > allocated to indicate it wasn't yet valid.
> > >
> > > The audit_inode* functions can be called without going through
> > > getname_flags() or getname_kernel() that sets audit_names and cwd, so
> > > set the cwd if it has not already been done so due to audit_names being
> > > valid.
> > >
> > > The LSM dump_common_audit_data() LSM_AUDIT_DATA_NET:AF_UNIX case was
> > > missed with the ghak96 patch, so add that case here.
> > >
> > > Please see issue https://github.com/linux-audit/audit-kernel/issues/120
> > > Please see issue https://github.com/linux-audit/audit-kernel/issues/96
> > > Passes audit-testsuite.
> > >
> > > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > > ---
> > > kernel/auditsc.c | 3 +++
> > > security/lsm_audit.c | 1 +
> > > 2 files changed, 4 insertions(+)
> > >
> > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > index 6884b50069d1..2f97618e6a34 100644
> > > --- a/kernel/auditsc.c
> > > +++ b/kernel/auditsc.c
> > > @@ -929,6 +929,7 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
> > > context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
> > > INIT_LIST_HEAD(&context->killed_trees);
> > > INIT_LIST_HEAD(&context->names_list);
> > > + context->fds[0] = -1;
> > > return context;
> > > }
> > >
> > > @@ -2076,6 +2077,7 @@ void __audit_inode(struct filename *name, const struct dentry *dentry,
> > > }
> > > handle_path(dentry);
> > > audit_copy_inode(n, dentry, inode, flags & AUDIT_INODE_NOEVAL);
> > > + _audit_getcwd(context);
> > > }
> > >
> > > void __audit_file(const struct file *file)
> > > @@ -2194,6 +2196,7 @@ void __audit_inode_child(struct inode *parent,
> > > audit_copy_inode(found_child, dentry, inode, 0);
> > > else
> > > found_child->ino = AUDIT_INO_UNSET;
> > > + _audit_getcwd(context);
> > > }
> > > EXPORT_SYMBOL_GPL(__audit_inode_child);
> > >
> > > diff --git a/security/lsm_audit.c b/security/lsm_audit.c
> > > index 53d0d183db8f..e93077612246 100644
> > > --- a/security/lsm_audit.c
> > > +++ b/security/lsm_audit.c
> > > @@ -369,6 +369,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
> > > audit_log_untrustedstring(ab, p);
> > > else
> > > audit_log_n_hex(ab, p, len);
> > > + audit_getcwd();
> > > break;
> > > }
> > > }
> >
> > I understand the "fds[0] = -1" fix in audit_alloc_context()
> > (ironically, the kzalloc() which is supposed to help with cases like
> > this, hurts us with this particular field), but I'm still not quite
> > seeing why we need to sprinkle audit_getcwd() calls everywhere to fix
> > this bug (this seems more like a feature add than a bigfix). Yes,
> > they may fix the problem but it seems like simply adding a
> > context->pwd test in audit_log_name() similar to what we do in
> > audit_log_exit() is the correct fix.
>
> Well, considering that ghak96 ended up being a bugfix (that wasn't its
> intent), I figured these audit_getcwd() were also bugfixes to prevent
> the same BUG under different calling conditions.
>
> > We are currently at -rc7 and this really needs to land before v5.8 is
> > released, presumably this weekend; this means a small and limited bug
> > fix patch is what is needed.
>
> Ok, so it sounds like rather than just fix it now, it would be better to
> revert it, then submit *one* patch for ghak120 plus this fix that will
> go tentatively upstream in 3 months, fully in 5. Arguably the last
> chunk above should be added to ghak96, so that should be reverted too,
> then resubmitted with this added chunk (or it could be a fixup chunk
> that would need to be sequenced with ghak120). As for the middle two
> chunks, they could either be resubmitted with a resubmitted ghak96, or
> with a resubmitted ghak120. As for the timing of all of these, ghak96
> should be in place before the ghak120 patch, so even resubmitting one
> patch for the combined ghak120 and ghak96 might make more sense.
Sigh.
I can't even reply to that paragraph above without going to GH and
looking up all those different ghak references, which is annoying, and
right now it seems like my time is better spent cleaning up this mess.
I'm not exactly sure what you mean by "one patch", but right now we
are at -rc7 and we've/I've got broken kernels to fix; submit whatever
you want and we'll deal with it when it's posted.
> I know you like only really minimal fixes this late, but this seemed
> pretty minimal to me...
Minimal is a one (two?) line NULL check in audit_log_name(), this
patch is not that.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Andy Lutomirski @ 2020-07-28 17:31 UTC (permalink / raw)
To: madvenka
Cc: Kernel Hardening, Linux API, linux-arm-kernel, Linux FS Devel,
linux-integrity, LKML, LSM List, Oleg Nesterov, X86 ML
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
> On Jul 28, 2020, at 6:11 AM, madvenka@linux.microsoft.com wrote:
>
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
This is quite clever, but now I’m wondering just how much kernel help
is really needed. In your series, the trampoline is an non-executable
page. I can think of at least two alternative approaches, and I'd
like to know the pros and cons.
1. Entirely userspace: a return trampoline would be something like:
1:
pushq %rax
pushq %rbc
pushq %rcx
...
pushq %r15
movq %rsp, %rdi # pointer to saved regs
leaq 1b(%rip), %rsi # pointer to the trampoline itself
callq trampoline_handler # see below
You would fill a page with a bunch of these, possibly compacted to get
more per page, and then you would remap as many copies as needed. The
'callq trampoline_handler' part would need to be a bit clever to make
it continue to work despite this remapping. This will be *much*
faster than trampfd. How much of your use case would it cover? For
the inverse, it's not too hard to write a bit of asm to set all
registers and jump somewhere.
2. Use existing kernel functionality. Raise a signal, modify the
state, and return from the signal. This is very flexible and may not
be all that much slower than trampfd.
3. Use a syscall. Instead of having the kernel handle page faults,
have the trampoline code push the syscall nr register, load a special
new syscall nr into the syscall nr register, and do a syscall. On
x86_64, this would be:
pushq %rax
movq __NR_magic_trampoline, %rax
syscall
with some adjustment if the stack slot you're clobbering is important.
Also, will using trampfd cause issues with various unwinders? I can
easily imagine unwinders expecting code to be readable, although this
is slowly going away for other reasons.
All this being said, I think that the kernel should absolutely add a
sensible interface for JITs to use to materialize their code. This
would integrate sanely with LSMs and wouldn't require hacks like using
files, etc. A cleverly designed JIT interface could function without
seriailization IPIs, and even lame architectures like x86 could
potentially avoid shootdown IPIs if the interface copied code instead
of playing virtual memory games. At its very simplest, this could be:
void *jit_create_code(const void *source, size_t len);
and the result would be a new anonymous mapping that contains exactly
the code requested. There could also be:
int jittfd_create(...);
that does something similar but creates a memfd. A nicer
implementation for short JIT sequences would allow appending more code
to an existing JIT region. On x86, an appendable JIT region would
start filled with 0xCC, and I bet there's a way to materialize new
code into a previously 0xcc-filled virtual page wthout any
synchronization. One approach would be to start with:
<some code>
0xcc
0xcc
...
0xcc
and to create a whole new page like:
<some code>
<some more code>
0xcc
...
0xcc
so that the only difference is that some code changed to some more
code. Then replace the PTE to swap from the old page to the new page,
and arrange to avoid freeing the old page until we're sure it's gone
from all TLBs. This may not work if <some more code> spans a page
boundary. The #BP fixup would zap the TLB and retry. Even just
directly copying code over some 0xcc bytes almost works, but there's a
nasty corner case involving instructions that fetch I$ fetch
boundaries. I'm not sure to what extent I$ snooping helps.
--Andy
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Andy Lutomirski @ 2020-07-28 17:16 UTC (permalink / raw)
To: Madhavan T. Venkataraman
Cc: David Laight, kernel-hardening@lists.openwall.com,
linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org, oleg@redhat.com,
x86@kernel.org
In-Reply-To: <f5cfd11b-04fe-9db7-9d67-7ee898636edb@linux.microsoft.com>
On Tue, Jul 28, 2020 at 9:32 AM Madhavan T. Venkataraman
<madvenka@linux.microsoft.com> wrote:
>
> Thanks. See inline..
>
> On 7/28/20 10:13 AM, David Laight wrote:
> > From: madvenka@linux.microsoft.com
> >> Sent: 28 July 2020 14:11
> > ...
> >> The kernel creates the trampoline mapping without any permissions. When
> >> the trampoline is executed by user code, a page fault happens and the
> >> kernel gets control. The kernel recognizes that this is a trampoline
> >> invocation. It sets up the user registers based on the specified
> >> register context, and/or pushes values on the user stack based on the
> >> specified stack context, and sets the user PC to the requested target
> >> PC. When the kernel returns, execution continues at the target PC.
> >> So, the kernel does the work of the trampoline on behalf of the
> >> application.
> > Isn't the performance of this going to be horrid?
>
> It takes about the same amount of time as getpid(). So, it is
> one quick trip into the kernel. I expect that applications will
> typically not care about this extra overhead as long as
> they are able to run.
What did you test this on? A page fault on any modern x86_64 system
is much, much, much, much slower than a syscall.
--Andy
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-07-28 17:08 UTC (permalink / raw)
To: James Morris, Casey Schaufler
Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, oleg, x86
In-Reply-To: <alpine.LRH.2.21.2007290300400.31310@namei.org>
On 7/28/20 12:05 PM, James Morris wrote:
> On Tue, 28 Jul 2020, Casey Schaufler wrote:
>
>> You could make a separate LSM to do these checks instead of limiting
>> it to SELinux. Your use case, your call, of course.
> It's not limited to SELinux. This is hooked via the LSM API and
> implementable by any LSM (similar to execmem, execstack etc.)
Yes. I have an implementation that I am testing right now that
defines the hook for exectramp and implements it for
SELinux. That is why I mentioned SELinux.
Madhavan
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: James Morris @ 2020-07-28 17:05 UTC (permalink / raw)
To: Casey Schaufler
Cc: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
linux-fsdevel, linux-integrity, linux-kernel,
linux-security-module, oleg, x86
In-Reply-To: <3fd22f92-7f45-1b0f-e4fe-857f3bceedd0@schaufler-ca.com>
On Tue, 28 Jul 2020, Casey Schaufler wrote:
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.
It's not limited to SELinux. This is hooked via the LSM API and
implementable by any LSM (similar to execmem, execstack etc.)
--
James Morris
<jmorris@namei.org>
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-07-28 16:49 UTC (permalink / raw)
To: Casey Schaufler, kernel-hardening, linux-api, linux-arm-kernel,
linux-fsdevel, linux-integrity, linux-kernel,
linux-security-module, oleg, x86
In-Reply-To: <3fd22f92-7f45-1b0f-e4fe-857f3bceedd0@schaufler-ca.com>
Thanks.
On 7/28/20 11:05 AM, Casey Schaufler wrote:
>> In this solution, the kernel recognizes certain sequences of instructions
>> as "well-known" trampolines. When such a trampoline is executed, a page
>> fault happens because the trampoline page does not have execute permission.
>> The kernel recognizes the trampoline and emulates it. Basically, the
>> kernel does the work of the trampoline on behalf of the application.
> What prevents a malicious process from using the "well-known" trampoline
> to its own purposes? I expect it is obvious, but I'm not seeing it. Old
> eyes, I suppose.
You are quite right. As I note below, the attack surface is the
buffer that contains the trampoline code. Since the kernel does
check the instruction sequence, the sequence cannot be
changed by a hacker. But the hacker can presumably change
the register values and redirect the PC to his desired location.
The assumption with trampoline emulation is that the
system will have security settings that will prevent pages from
having both write and execute permissions. So, a hacker
cannot load his own code in a page and redirect the PC to
it and execute his own code. But he can probably set the
PC to point to arbitrary locations. For instance, jump to
the middle of a C library function.
>
>> Here, the attack surface is the buffer that contains the trampoline.
>> The attack surface is narrower than before. A hacker may still be able to
>> modify what gets loaded in the registers or modify the target PC to point
>> to arbitrary locations.
...
>> Work that is pending
>> --------------------
>>
>> - I am working on implementing an SELinux setting called "exectramp"
>> similar to "execmem" to allow the use of trampfd on a per application
>> basis.
> You could make a separate LSM to do these checks instead of limiting
> it to SELinux. Your use case, your call, of course.
OK. I will research this.
Madhavan
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Madhavan T. Venkataraman @ 2020-07-28 16:32 UTC (permalink / raw)
To: David Laight, kernel-hardening@lists.openwall.com,
linux-api@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org, oleg@redhat.com,
x86@kernel.org
In-Reply-To: <c23de6ec47614f489943e1a89a21dfa3@AcuMS.aculab.com>
Thanks. See inline..
On 7/28/20 10:13 AM, David Laight wrote:
> From: madvenka@linux.microsoft.com
>> Sent: 28 July 2020 14:11
> ...
>> The kernel creates the trampoline mapping without any permissions. When
>> the trampoline is executed by user code, a page fault happens and the
>> kernel gets control. The kernel recognizes that this is a trampoline
>> invocation. It sets up the user registers based on the specified
>> register context, and/or pushes values on the user stack based on the
>> specified stack context, and sets the user PC to the requested target
>> PC. When the kernel returns, execution continues at the target PC.
>> So, the kernel does the work of the trampoline on behalf of the
>> application.
> Isn't the performance of this going to be horrid?
It takes about the same amount of time as getpid(). So, it is
one quick trip into the kernel. I expect that applications will
typically not care about this extra overhead as long as
they are able to run.
But I agree that if there is an application that cannot tolerate
this extra overhead, then it is an issue. See below for further
discussion.
In the libffi changes I have included in the cover letter, I have
done it in such a way that trampfd is chosen when current
security settings don't allow other methods such as
loading trampoline code into a file and mapping it. In this
case, the application can at least run with trampfd.
>
> If you don't care that much about performance the fixup can
> all be done in userspace within the fault signal handler.
I do care about performance.
This is a framework to address trampolines. In this initial
work, I want to establish one basic way for things to work.
In the future, trampfd can be enhanced for performance.
For instance, it is easy for an architecture to generate
the exact instructions required to load specified registers,
push specified values on the stack and jump to a target
PC. The kernel can map a page with the generated code
with execute permissions. In this case, the performance
issue goes away.
> Since whatever you do needs the application changed why
> not change the implementation of nested functions to not
> need on-stack executable trampolines.
I kinda agree with your suggestion.
But it is up to the GCC folks to change its implementation.
I am trying to provide a way for their existing implementation
to work in a more secure way.
> I can think of other alternatives that don't need much more
> than an array of 'push constant; jump trampoline' instructions
> be created (all jump to the same place).
>
> You might want something to create an executable page of such
> instructions.
Agreed. And that can be done within this framework as
I have mentioned above.
But it is not just this trampoline type that I have implemented
in this patchset. In the future, other types can be implemented
and other contexts can be defined. Basically, the approach is
for the user to supply a recipe to the kernel and leave it up to
the kernel to do it in the best way possible. I am hoping that
other forms of dynamic code can be addressed in the future
using the same framework.
*Purely as a hypothetical example*, a user can supply
instructions in a language such as BPF that the kernel
understands and have the kernel arrange for that to
be executed in user context.
Madhavan
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
^ permalink raw reply
* Re: [PATCH V3fix ghak120] audit: initialize context values in case of mandatory events
From: Richard Guy Briggs @ 2020-07-28 16:27 UTC (permalink / raw)
To: Paul Moore
Cc: Eric Paris, Linux Security Module list, Linux-Audit Mailing List,
LKML
In-Reply-To: <CAHC9VhSx23JiN6GprskxdEcs9uftJOp03Svh7YJbQLOV91AMiQ@mail.gmail.com>
On 2020-07-27 22:14, Paul Moore wrote:
> On Mon, Jul 27, 2020 at 5:30 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > Issue ghak120 enabled syscall records to accompany required records when
> > no rules are present to trigger the storage of syscall context. A
> > reported issue showed that the cwd was not always initialized. That
> > issue was already resolved ...
>
> Yes and no. Yes, it appears to be resolved in v5.8-rc1 and above, but
> the problematic commit is in v5.7 and I'm not sure backporting the fix
> in v5.8-rcX plus this patch is the right thing to do for a released
> kernel. The lowest risk fix for v5.7 at this point is to do a revert;
Ok, fair enough. I don't understand why you didn't do the revert since
it appears so trivial to you and this review and fix turned out to be
marginally more work. I didn't understand what you wanted when you
referred to stable.
> regardless of what happens with this patch and v5.8-rcX please post a
> revert for the audit/stable-5.7 tree as soon as you can.
(more below...)
> > ... but a review of all other records that could
> > be triggered at the time of a syscall record revealed other potential
> > values that could be missing or misleading. Initialize them.
> >
> > The fds array is reset to -1 after the first syscall to indicate it
> > isn't valid any more, but was never set to -1 when the context was
> > allocated to indicate it wasn't yet valid.
> >
> > The audit_inode* functions can be called without going through
> > getname_flags() or getname_kernel() that sets audit_names and cwd, so
> > set the cwd if it has not already been done so due to audit_names being
> > valid.
> >
> > The LSM dump_common_audit_data() LSM_AUDIT_DATA_NET:AF_UNIX case was
> > missed with the ghak96 patch, so add that case here.
> >
> > Please see issue https://github.com/linux-audit/audit-kernel/issues/120
> > Please see issue https://github.com/linux-audit/audit-kernel/issues/96
> > Passes audit-testsuite.
> >
> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> > ---
> > kernel/auditsc.c | 3 +++
> > security/lsm_audit.c | 1 +
> > 2 files changed, 4 insertions(+)
> >
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index 6884b50069d1..2f97618e6a34 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -929,6 +929,7 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
> > context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
> > INIT_LIST_HEAD(&context->killed_trees);
> > INIT_LIST_HEAD(&context->names_list);
> > + context->fds[0] = -1;
> > return context;
> > }
> >
> > @@ -2076,6 +2077,7 @@ void __audit_inode(struct filename *name, const struct dentry *dentry,
> > }
> > handle_path(dentry);
> > audit_copy_inode(n, dentry, inode, flags & AUDIT_INODE_NOEVAL);
> > + _audit_getcwd(context);
> > }
> >
> > void __audit_file(const struct file *file)
> > @@ -2194,6 +2196,7 @@ void __audit_inode_child(struct inode *parent,
> > audit_copy_inode(found_child, dentry, inode, 0);
> > else
> > found_child->ino = AUDIT_INO_UNSET;
> > + _audit_getcwd(context);
> > }
> > EXPORT_SYMBOL_GPL(__audit_inode_child);
> >
> > diff --git a/security/lsm_audit.c b/security/lsm_audit.c
> > index 53d0d183db8f..e93077612246 100644
> > --- a/security/lsm_audit.c
> > +++ b/security/lsm_audit.c
> > @@ -369,6 +369,7 @@ static void dump_common_audit_data(struct audit_buffer *ab,
> > audit_log_untrustedstring(ab, p);
> > else
> > audit_log_n_hex(ab, p, len);
> > + audit_getcwd();
> > break;
> > }
> > }
>
> I understand the "fds[0] = -1" fix in audit_alloc_context()
> (ironically, the kzalloc() which is supposed to help with cases like
> this, hurts us with this particular field), but I'm still not quite
> seeing why we need to sprinkle audit_getcwd() calls everywhere to fix
> this bug (this seems more like a feature add than a bigfix). Yes,
> they may fix the problem but it seems like simply adding a
> context->pwd test in audit_log_name() similar to what we do in
> audit_log_exit() is the correct fix.
Well, considering that ghak96 ended up being a bugfix (that wasn't its
intent), I figured these audit_getcwd() were also bugfixes to prevent
the same BUG under different calling conditions.
> We are currently at -rc7 and this really needs to land before v5.8 is
> released, presumably this weekend; this means a small and limited bug
> fix patch is what is needed.
Ok, so it sounds like rather than just fix it now, it would be better to
revert it, then submit *one* patch for ghak120 plus this fix that will
go tentatively upstream in 3 months, fully in 5. Arguably the last
chunk above should be added to ghak96, so that should be reverted too,
then resubmitted with this added chunk (or it could be a fixup chunk
that would need to be sequenced with ghak120). As for the middle two
chunks, they could either be resubmitted with a resubmitted ghak96, or
with a resubmitted ghak120. As for the timing of all of these, ghak96
should be in place before the ghak120 patch, so even resubmitting one
patch for the combined ghak120 and ghak96 might make more sense.
I know you like only really minimal fixes this late, but this seemed
pretty minimal to me...
> paul moore
- RGB
--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
^ permalink raw reply
* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
From: Oleg Nesterov @ 2020-07-28 16:06 UTC (permalink / raw)
To: Madhavan T. Venkataraman
Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, x86
In-Reply-To: <dc41589a-647a-ba59-5376-abbf5d07c6e7@linux.microsoft.com>
On 07/28, Madhavan T. Venkataraman wrote:
>
> I guess since the symbol is not used by any modules, I don't need to
> export it.
Yes,
Oleg.
^ permalink raw reply
* Re: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: Casey Schaufler @ 2020-07-28 16:05 UTC (permalink / raw)
To: madvenka, kernel-hardening, linux-api, linux-arm-kernel,
linux-fsdevel, linux-integrity, linux-kernel,
linux-security-module, oleg, x86
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
On 7/28/2020 6:10 AM, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>
> Introduction
> ------------
>
> Trampolines are used in many different user applications. Trampoline
> code is often generated at runtime. Trampoline code can also just be a
> pre-defined sequence of machine instructions in a data buffer.
>
> Trampoline code is placed either in a data page or in a stack page. In
> order to execute a trampoline, the page it resides in needs to be mapped
> with execute permissions. Writable pages with execute permissions provide
> an attack surface for hackers. Attackers can use this to inject malicious
> code, modify existing code or do other harm.
>
> To mitigate this, LSMs such as SELinux may not allow pages to have both
> write and execute permissions. This prevents trampolines from executing
> and blocks applications that use trampolines. To allow genuine applications
> to run, exceptions have to be made for them (by setting execmem, etc).
> In this case, the attack surface is just the pages of such applications.
>
> An application that is not allowed to have writable executable pages
> may try to load trampoline code into a file and map the file with execute
> permissions. In this case, the attack surface is just the buffer that
> contains trampoline code. However, a successful exploit may provide the
> hacker with means to load his own code in a file, map it and execute it.
>
> LSMs (such as the IPE proposal [1]) may allow only properly signed object
> files to be mapped with execute permissions. This will prevent trampoline
> files from being mapped. Again, exceptions have to be made for genuine
> applications.
>
> We need a way to execute trampolines without making security exceptions
> where possible and to reduce the attack surface even further.
>
> Examples of trampolines
> -----------------------
>
> libffi (A Portable Foreign Function Interface Library):
>
> libffi allows a user to define functions with an arbitrary list of
> arguments and return value through a feature called "Closures".
> Closures use trampolines to jump to ABI handlers that handle calling
> conventions and call a target function. libffi is used by a lot
> of different applications. To name a few:
>
> - Python
> - Java
> - Javascript
> - Ruby FFI
> - Lisp
> - Objective C
>
> GCC nested functions:
>
> GCC has traditionally used trampolines for implementing nested
> functions. The trampoline is placed on the user stack. So, the stack
> needs to be executable.
>
> Currently available solution
> ----------------------------
>
> One solution that has been proposed to allow trampolines to be executed
> without making security exceptions is Trampoline Emulation. See:
>
> https://pax.grsecurity.net/docs/emutramp.txt
>
> In this solution, the kernel recognizes certain sequences of instructions
> as "well-known" trampolines. When such a trampoline is executed, a page
> fault happens because the trampoline page does not have execute permission.
> The kernel recognizes the trampoline and emulates it. Basically, the
> kernel does the work of the trampoline on behalf of the application.
What prevents a malicious process from using the "well-known" trampoline
to its own purposes? I expect it is obvious, but I'm not seeing it. Old
eyes, I suppose.
> Here, the attack surface is the buffer that contains the trampoline.
> The attack surface is narrower than before. A hacker may still be able to
> modify what gets loaded in the registers or modify the target PC to point
> to arbitrary locations.
>
> Currently, the emulated trampolines are the ones used in libffi and GCC
> nested functions. To my knowledge, only X86 is supported at this time.
>
> As noted in emutramp.txt, this is not a generic solution. For every new
> trampoline that needs to be supported, new instruction sequences need to
> be recognized by the kernel and emulated. And this has to be done for
> every architecture that needs to be supported.
>
> emutramp.txt notes the following:
>
> "... the real solution is not in emulation but by designing a kernel API
> for runtime code generation and modifying userland to make use of it."
>
> Trampoline File Descriptor (trampfd)
> --------------------------
>
> I am proposing a kernel API using anonymous file descriptors that
> can be used to create and execute trampolines with the help of the
> kernel. In this solution also, the kernel does the work of the trampoline.
> The API is described in patch 1/4 of this patchset. I provide a
> summary here:
>
> Trampolines commonly execute the following sequence:
>
> - Load some values in some registers and/or
> - Push some values on the stack
> - Jump to a target PC
>
> libffi and GCC nested function trampolines fit into this model.
>
> Using the kernel API, applications and libraries can:
>
> - Create a trampoline object
> - Associate a register context with the trampoline (including
> a target PC)
> - Associate a stack context with the trampoline
> - Map the trampoline into a process address space
> - Execute the trampoline by executing at the trampoline address
>
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
>
> In this case, the attack surface is the context buffer. A hacker may
> attack an application with a vulnerability and may be able to modify the
> context buffer. So, when the register or stack context is set for
> a trampoline, the values may have been tampered with. From an attack
> surface perspective, this is similar to Trampoline Emulation. But
> with trampfd, user code can retrieve a trampoline's context from the
> kernel and add defensive checks to see if the context has been
> tampered with.
>
> As for the target PC, trampfd implements a measure called the
> "Allowed PCs" context (see Advantages) to prevent a hacker from making
> the target PC point to arbitrary locations. So, the attack surface is
> narrower than Trampoline Emulation.
>
> Advantages of the Trampoline File Descriptor approach
> -----------------------------------------------------
>
> - trampfd is customizable. The user can specify any combination of
> allowed register name-value pairs in the register context and the kernel
> will set it up accordingly. This allows different user trampolines to be
> converted to use trampfd.
>
> - trampfd allows a stack context to be set up so that trampolines that
> need to push values on the user stack can do that.
>
> - The initial work is targeted for X86 and ARM. But the implementation
> leverages small portions of existing signal delivery code. Specifically,
> it uses pt_regs for setting up user registers and copy_to_user()
> to push values on the stack. So, this can be very easily ported to other
> architectures.
>
> - trampfd provides a basic framework. In the future, new trampoline types
> can be implemented, new contexts can be defined, and additional rules
> can be implemented for security purposes.
>
> - For instance, trampfd defines an "Allowed PCs" context in this initial
> work. As an example, libffi can create a read-only array of all ABI
> handlers for an architecture at build time. This array can be used to
> set the list of allowed PCs for a trampoline. This will mean that a hacker
> cannot hack the PC part of the register context and make it point to
> arbitrary locations.
>
> - An SELinux setting called "exectramp" can be implemented along the
> lines of "execmem", "execstack" and "execheap" to selectively allow the
> use of trampolines on a per application basis.
>
> - User code can add defensive checks in the code before invoking a
> trampoline to make sure that a hacker has not modified the context data.
> It can do this by getting the trampoline context from the kernel and
> double checking it.
>
> - In the future, if the kernel can be enhanced to use a safe code
> generation component, that code can be placed in the trampoline mapping
> pages. Then, the trampoline invocation does not have to incur a trip
> into the kernel.
>
> - Also, if the kernel can be enhanced to use a safe code generation
> component, other forms of dynamic code such as JIT code can be
> addressed by the trampfd framework.
>
> - Trampolines can be shared across processes which can give rise to
> interesting uses in the future.
>
> - Trampfd can be used for other purposes to extend the kernel's
> functionality.
>
> libffi
> ------
>
> I have implemented my solution for libffi and provided the changes for
> X86 and ARM, 32-bit and 64-bit. Here is the reference patch:
>
> http://linux.microsoft.com/~madvenka/libffi/libffi.txt
>
> If the trampfd patchset gets accepted, I will send the libffi changes
> to the maintainers for a review. BTW, I have also successfully executed
> the libffi self tests.
>
> Work that is pending
> --------------------
>
> - I am working on implementing an SELinux setting called "exectramp"
> similar to "execmem" to allow the use of trampfd on a per application
> basis.
You could make a separate LSM to do these checks instead of limiting
it to SELinux. Your use case, your call, of course.
>
> - I have a comprehensive test program to test the kernel API. I am
> working on adding it to selftests.
>
> References
> ----------
>
> [1] https://microsoft.github.io/ipe/
> ---
> Madhavan T. Venkataraman (4):
> fs/trampfd: Implement the trampoline file descriptor API
> x86/trampfd: Support for the trampoline file descriptor
> arm64/trampfd: Support for the trampoline file descriptor
> arm/trampfd: Support for the trampoline file descriptor
>
> arch/arm/include/uapi/asm/ptrace.h | 20 ++
> arch/arm/kernel/Makefile | 1 +
> arch/arm/kernel/trampfd.c | 214 +++++++++++++++++
> arch/arm/mm/fault.c | 12 +-
> arch/arm/tools/syscall.tbl | 1 +
> arch/arm64/include/asm/ptrace.h | 9 +
> arch/arm64/include/asm/unistd.h | 2 +-
> arch/arm64/include/asm/unistd32.h | 2 +
> arch/arm64/include/uapi/asm/ptrace.h | 57 +++++
> arch/arm64/kernel/Makefile | 2 +
> arch/arm64/kernel/trampfd.c | 278 ++++++++++++++++++++++
> arch/arm64/mm/fault.c | 15 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/uapi/asm/ptrace.h | 38 +++
> arch/x86/kernel/Makefile | 2 +
> arch/x86/kernel/trampfd.c | 313 +++++++++++++++++++++++++
> arch/x86/mm/fault.c | 11 +
> fs/Makefile | 1 +
> fs/trampfd/Makefile | 6 +
> fs/trampfd/trampfd_data.c | 43 ++++
> fs/trampfd/trampfd_fops.c | 131 +++++++++++
> fs/trampfd/trampfd_map.c | 78 ++++++
> fs/trampfd/trampfd_pcs.c | 95 ++++++++
> fs/trampfd/trampfd_regs.c | 137 +++++++++++
> fs/trampfd/trampfd_stack.c | 131 +++++++++++
> fs/trampfd/trampfd_stubs.c | 41 ++++
> fs/trampfd/trampfd_syscall.c | 92 ++++++++
> include/linux/syscalls.h | 3 +
> include/linux/trampfd.h | 82 +++++++
> include/uapi/asm-generic/unistd.h | 4 +-
> include/uapi/linux/trampfd.h | 171 ++++++++++++++
> init/Kconfig | 8 +
> kernel/sys_ni.c | 3 +
> 34 files changed, 1998 insertions(+), 7 deletions(-)
> create mode 100644 arch/arm/kernel/trampfd.c
> create mode 100644 arch/arm64/kernel/trampfd.c
> create mode 100644 arch/x86/kernel/trampfd.c
> create mode 100644 fs/trampfd/Makefile
> create mode 100644 fs/trampfd/trampfd_data.c
> create mode 100644 fs/trampfd/trampfd_fops.c
> create mode 100644 fs/trampfd/trampfd_map.c
> create mode 100644 fs/trampfd/trampfd_pcs.c
> create mode 100644 fs/trampfd/trampfd_regs.c
> create mode 100644 fs/trampfd/trampfd_stack.c
> create mode 100644 fs/trampfd/trampfd_stubs.c
> create mode 100644 fs/trampfd/trampfd_syscall.c
> create mode 100644 include/linux/trampfd.h
> create mode 100644 include/uapi/linux/trampfd.h
>
^ permalink raw reply
* RE: [PATCH v1 0/4] [RFC] Implement Trampoline File Descriptor
From: David Laight @ 2020-07-28 15:13 UTC (permalink / raw)
To: 'madvenka@linux.microsoft.com',
kernel-hardening@lists.openwall.com, linux-api@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linux-fsdevel@vger.kernel.org, linux-integrity@vger.kernel.org,
linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org, oleg@redhat.com,
x86@kernel.org
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
From: madvenka@linux.microsoft.com
> Sent: 28 July 2020 14:11
...
> The kernel creates the trampoline mapping without any permissions. When
> the trampoline is executed by user code, a page fault happens and the
> kernel gets control. The kernel recognizes that this is a trampoline
> invocation. It sets up the user registers based on the specified
> register context, and/or pushes values on the user stack based on the
> specified stack context, and sets the user PC to the requested target
> PC. When the kernel returns, execution continues at the target PC.
> So, the kernel does the work of the trampoline on behalf of the
> application.
Isn't the performance of this going to be horrid?
If you don't care that much about performance the fixup can
all be done in userspace within the fault signal handler.
Since whatever you do needs the application changed why
not change the implementation of nested functions to not
need on-stack executable trampolines.
I can think of other alternatives that don't need much more
than an array of 'push constant; jump trampoline' instructions
be created (all jump to the same place).
You might want something to create an executable page of such
instructions.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply
* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
From: Madhavan T. Venkataraman @ 2020-07-28 14:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, x86
In-Reply-To: <20200728145013.GA9972@redhat.com>
Thanks. See inline..
On 7/28/20 9:50 AM, Oleg Nesterov wrote:
> On 07/28, madvenka@linux.microsoft.com wrote:
>> +bool is_trampfd_vma(struct vm_area_struct *vma)
>> +{
>> + struct file *file = vma->vm_file;
>> +
>> + if (!file)
>> + return false;
>> + return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
> Hmm, this looks obviously wrong or I am totally confused. A user can
> create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?
>
> Why not
>
> return file->f_op == trampfd_fops;
This is definitely the correct check. I will fix it.
>
> ?
>
>> +EXPORT_SYMBOL_GPL(is_trampfd_vma);
> why is it exported?
This is in common code and is called by arch code. Should I not export it?
I guess since the symbol is not used by any modules, I don't need to
export it. Please confirm and I will fix this.
Madhavan
^ permalink raw reply
* Re: [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
From: Oleg Nesterov @ 2020-07-28 14:50 UTC (permalink / raw)
To: madvenka
Cc: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, x86
In-Reply-To: <20200728131050.24443-2-madvenka@linux.microsoft.com>
On 07/28, madvenka@linux.microsoft.com wrote:
>
> +bool is_trampfd_vma(struct vm_area_struct *vma)
> +{
> + struct file *file = vma->vm_file;
> +
> + if (!file)
> + return false;
> + return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
Hmm, this looks obviously wrong or I am totally confused. A user can
create a file named "[trampfd]", mmap it, and fool trampfd_fault() ?
Why not
return file->f_op == trampfd_fops;
?
> +EXPORT_SYMBOL_GPL(is_trampfd_vma);
why is it exported?
Oleg.
^ permalink raw reply
* Re: [PATCH 1/2] ima: Pre-parse the list of keyrings in a KEY_CHECK rule
From: Lakshmi Ramasubramanian @ 2020-07-28 14:25 UTC (permalink / raw)
To: Tyler Hicks, Mimi Zohar, Dmitry Kasatkin
Cc: James Morris, Serge E . Hallyn, Tushar Sugandhi, Nayna Jain,
linux-kernel, linux-integrity, linux-security-module
In-Reply-To: <20200727140831.64251-2-tyhicks@linux.microsoft.com>
On 7/27/20 7:08 AM, Tyler Hicks wrote:
> The ima_keyrings buffer was used as a work buffer for strsep()-based
> parsing of the "keyrings=" option of an IMA policy rule. This parsing
> was re-performed each time an asymmetric key was added to a kernel
> keyring for each loaded policy rule that contained a "keyrings=" option.
>
> An example rule specifying this option is:
>
> measure func=KEY_CHECK keyrings=a|b|c
>
> The rule says to measure asymmetric keys added to any of the kernel
> keyrings named "a", "b", or "c". The size of the buffer size was
> equal to the size of the largest "keyrings=" value seen in a previously
> loaded rule (5 + 1 for the NUL-terminator in the previous example) and
> the buffer was pre-allocated at the time of policy load.
>
> The pre-allocated buffer approach suffered from a couple bugs:
>
> 1) There was no locking around the use of the buffer so concurrent key
> add operations, to two different keyrings, would result in the
> strsep() loop of ima_match_keyring() to modify the buffer at the same
> time. This resulted in unexpected results from ima_match_keyring()
> and, therefore, could cause unintended keys to be measured or keys to
> not be measured when IMA policy intended for them to be measured.
>
> 2) If the kstrdup() that initialized entry->keyrings in ima_parse_rule()
> failed, the ima_keyrings buffer was freed and set to NULL even when a
> valid KEY_CHECK rule was previously loaded. The next KEY_CHECK event
> would trigger a call to strcpy() with a NULL destination pointer and
> crash the kernel.
>
> Remove the need for a pre-allocated global buffer by parsing the list of
> keyrings in a KEY_CHECK rule at the time of policy load. The
> ima_rule_entry will contain an array of string pointers which point to
> the name of each keyring specified in the rule. No string processing
> needs to happen at the time of asymmetric key add so iterating through
> the list and doing a string comparison is all that's required at the
> time of policy check.
>
> In the process of changing how the "keyrings=" policy option is handled,
> a couple additional bugs were fixed:
>
> 1) The rule parser accepted rules containing invalid "keyrings=" values
> such as "a|b||c", "a|b|", or simply "|".
>
> 2) The /sys/kernel/security/ima/policy file did not display the entire
> "keyrings=" value if the list of keyrings was longer than what could
> fit in the fixed size tbuf buffer in ima_policy_show().
>
> Fixes: 5c7bac9fb2c5 ("IMA: pre-allocate buffer to hold keyrings string")
> Fixes: 2b60c0ecedf8 ("IMA: Read keyrings= option from the IMA policy")
> Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
> ---
> security/integrity/ima/ima_policy.c | 138 +++++++++++++++++++---------
> 1 file changed, 93 insertions(+), 45 deletions(-)
Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
^ permalink raw reply
* Re: [PATCH 2/2] ima: Fail rule parsing when asymmetric key measurement isn't supportable
From: Lakshmi Ramasubramanian @ 2020-07-28 14:14 UTC (permalink / raw)
To: Tyler Hicks, Mimi Zohar, Dmitry Kasatkin
Cc: James Morris, Serge E . Hallyn, Tushar Sugandhi, Nayna Jain,
linux-kernel, linux-integrity, linux-security-module
In-Reply-To: <20200727140831.64251-3-tyhicks@linux.microsoft.com>
On 7/27/20 7:08 AM, Tyler Hicks wrote:
> Measuring keys is currently only supported for asymmetric keys. In the
> future, this might change.
>
> For now, the "func=KEY_CHECK" and "keyrings=" options are only
> appropriate when CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Make
> this clear at policy load so that IMA policy authors don't assume that
> these policy language constructs are supported.
>
> Fixes: 2b60c0ecedf8 ("IMA: Read keyrings= option from the IMA policy")
> Fixes: 5808611cccb2 ("IMA: Add KEY_CHECK func to measure keys")
> Suggested-by: Nayna Jain <nayna@linux.ibm.com>
> Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
> ---
> security/integrity/ima/ima_policy.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/security/integrity/ima/ima_policy.c b/security/integrity/ima/ima_policy.c
> index c328cfa0fc49..05f012fd3dca 100644
> --- a/security/integrity/ima/ima_policy.c
> +++ b/security/integrity/ima/ima_policy.c
> @@ -1233,7 +1233,8 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
> entry->func = POLICY_CHECK;
> else if (strcmp(args[0].from, "KEXEC_CMDLINE") == 0)
> entry->func = KEXEC_CMDLINE;
> - else if (strcmp(args[0].from, "KEY_CHECK") == 0)
> + else if (IS_ENABLED(CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS) &&
> + strcmp(args[0].from, "KEY_CHECK") == 0)
> entry->func = KEY_CHECK;
> else
> result = -EINVAL;
> @@ -1290,7 +1291,8 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry *entry)
> case Opt_keyrings:
> ima_log_string(ab, "keyrings", args[0].from);
>
> - if (entry->keyrings) {
> + if (!IS_ENABLED(CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS) ||
> + entry->keyrings) {
> result = -EINVAL;
> break;
> }
>
Reviewed-by: Lakshmi Ramasubramanian <nramas@linux.microsoft.com>
^ permalink raw reply
* [PATCH v1 3/4] [RFC] arm64/trampfd: Provide support for the trampoline file descriptor
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, oleg, x86,
madvenka
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 64-bit ARM support for the trampoline file descriptor.
- Define architecture specific register names
- Handle the trampoline invocation page fault
- Setup the user register context on trampoline invocation
- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
arch/arm64/include/asm/ptrace.h | 9 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/arm64/include/uapi/asm/ptrace.h | 57 ++++++
arch/arm64/kernel/Makefile | 2 +
arch/arm64/kernel/trampfd.c | 278 +++++++++++++++++++++++++++
arch/arm64/mm/fault.c | 15 +-
7 files changed, 361 insertions(+), 4 deletions(-)
create mode 100644 arch/arm64/kernel/trampfd.c
diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h
index 953b6a1ce549..dad6cdbd59c6 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -232,6 +232,15 @@ static inline unsigned long user_stack_pointer(struct pt_regs *regs)
return regs->sp;
}
+static inline void user_stack_pointer_set(struct pt_regs *regs,
+ unsigned long val)
+{
+ if (compat_user_mode(regs))
+ regs->compat_sp = val;
+ else
+ regs->sp = val;
+}
+
extern int regs_query_register_offset(const char *name);
extern unsigned long regs_get_kernel_stack_nth(struct pt_regs *regs,
unsigned int n);
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 3b859596840d..b3b2019f8d16 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 440
+#define __NR_compat_syscalls 441
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 6d95d0c8bf2f..821ddcaf9683 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -885,6 +885,8 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h
index 42cbe34d95ce..f4d1974dd795 100644
--- a/arch/arm64/include/uapi/asm/ptrace.h
+++ b/arch/arm64/include/uapi/asm/ptrace.h
@@ -88,6 +88,63 @@ struct user_pt_regs {
__u64 pstate;
};
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+ arm_r0,
+ arm_r1,
+ arm_r2,
+ arm_r3,
+ arm_r4,
+ arm_r5,
+ arm_r6,
+ arm_r7,
+ arm_r8,
+ arm_r9,
+ arm_r10,
+ arm_ip,
+ arm_pc,
+ arm_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+ arm64_r0 = arm_max,
+ arm64_r1,
+ arm64_r2,
+ arm64_r3,
+ arm64_r4,
+ arm64_r5,
+ arm64_r6,
+ arm64_r7,
+ arm64_r8,
+ arm64_r9,
+ arm64_r10,
+ arm64_r11,
+ arm64_r12,
+ arm64_r13,
+ arm64_r14,
+ arm64_r15,
+ arm64_r16,
+ arm64_r17,
+ arm64_r18,
+ arm64_r19,
+ arm64_r20,
+ arm64_r21,
+ arm64_r22,
+ arm64_r23,
+ arm64_r24,
+ arm64_r25,
+ arm64_r26,
+ arm64_r27,
+ arm64_r28,
+ arm64_pc,
+ arm64_max,
+};
+
struct user_fpsimd_state {
__uint128_t vregs[32];
__u32 fpsr;
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index a561cbb91d4d..18d373fb1208 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -71,3 +71,5 @@ extra-y += $(head-y) vmlinux.lds
ifeq ($(CONFIG_DEBUG_EFI),y)
AFLAGS_head.o += -DVMLINUX_PATH="\"$(realpath $(objtree)/vmlinux)\""
endif
+
+obj-$(CONFIG_TRAMPFD) += trampfd.o
diff --git a/arch/arm64/kernel/trampfd.c b/arch/arm64/kernel/trampfd.c
new file mode 100644
index 000000000000..d79e749e0c30
--- /dev/null
+++ b/arch/arm64/kernel/trampfd.c
@@ -0,0 +1,278 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM64 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+ return is_compat_thread(task_thread_info(current));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+ switch (name) {
+ case arm_r0:
+ case arm_r1:
+ case arm_r2:
+ case arm_r3:
+ case arm_r4:
+ case arm_r5:
+ case arm_r6:
+ case arm_r7:
+ case arm_r8:
+ case arm_r9:
+ case arm_r10:
+ pt_regs->regs[name] = (__u64)value;
+ break;
+ case arm_ip:
+ pt_regs->regs[arm64_r16 - arm_max] = (__u64)value;
+ break;
+ case arm_pc:
+ pt_regs->pc = (__u64)value;
+ break;
+ default:
+ WARN(1, "%s: Illegal register name %d\n", __func__, name);
+ break;
+ }
+}
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+ switch (name) {
+ case arm64_r0:
+ case arm64_r1:
+ case arm64_r2:
+ case arm64_r3:
+ case arm64_r4:
+ case arm64_r5:
+ case arm64_r6:
+ case arm64_r7:
+ case arm64_r8:
+ case arm64_r9:
+ case arm64_r10:
+ case arm64_r11:
+ case arm64_r12:
+ case arm64_r13:
+ case arm64_r14:
+ case arm64_r15:
+ case arm64_r16:
+ case arm64_r17:
+ case arm64_r18:
+ case arm64_r19:
+ case arm64_r20:
+ case arm64_r21:
+ case arm64_r22:
+ case arm64_r23:
+ case arm64_r24:
+ case arm64_r25:
+ case arm64_r26:
+ case arm64_r27:
+ case arm64_r28:
+ pt_regs->regs[name - arm_max] = (__u64)value;
+ break;
+ case arm64_pc:
+ pt_regs->pc = (__u64)value;
+ break;
+ default:
+ WARN(1, "%s: Illegal register name %d\n", __func__, name);
+ break;
+ }
+}
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ bool compat = is_compat();
+
+ for (; reg < reg_end; reg++) {
+ if (compat)
+ set_reg_32(pt_regs, reg->name, reg->value);
+ else
+ set_reg_64(pt_regs, reg->name, reg->value);
+ }
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ int min, max, pc_name;
+ bool pc_set = false;
+
+ if (is_compat()) {
+ min = 0;
+ pc_name = arm_pc;
+ max = arm_max;
+ } else {
+ min = arm_max;
+ pc_name = arm64_pc;
+ max = arm64_max;
+ }
+
+ for (; reg < reg_end; reg++) {
+ if (reg->name < min || reg->name >= max || reg->reserved)
+ return false;
+ if (reg->name == pc_name && reg->value)
+ pc_set = true;
+ }
+ return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ struct trampfd_values *allowed_pcs = trampfd->allowed_pcs;
+ u64 *allowed_values, pc_value = 0;
+ u32 nvalues, pc_name;
+ int i;
+
+ if (!allowed_pcs)
+ return true;
+
+ pc_name = is_compat() ? arm_pc : arm64_pc;
+
+ /*
+ * Find the PC register and its value. If the PC register has been
+ * specified multiple times, only the last one counts.
+ */
+ for (; reg < reg_end; reg++) {
+ if (reg->name == pc_name)
+ pc_value = reg->value;
+ }
+
+ allowed_values = allowed_pcs->values;
+ nvalues = allowed_pcs->nvalues;
+
+ for (i = 0; i < nvalues; i++) {
+ if (pc_value == allowed_values[i])
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+ unsigned long sp;
+
+ sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+ if (tstack->flags & TRAMPFD_SET_SP)
+ sp = round_down(sp, 16);
+
+ if (!access_ok((void *)sp, user_stack_pointer(pt_regs) - sp))
+ return -EFAULT;
+
+ if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+ return -EFAULT;
+
+ if (tstack->flags & TRAMPFD_SET_SP)
+ user_stack_pointer_set(pt_regs, sp);
+
+ return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+ struct vm_area_struct *vma,
+ struct pt_regs *pt_regs)
+{
+ char buf[TRAMPFD_MAX_STACK_SIZE];
+ struct trampfd_regs *tregs;
+ struct trampfd_stack *tstack = NULL;
+ unsigned long addr;
+ size_t size;
+ int rc = 0;
+
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Execution of the trampoline must start at the offset specfied by
+ * the kernel.
+ */
+ addr = vma->vm_start + trampfd->map.ioffset;
+ if (addr != pt_regs->pc) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * At a minimum, the user PC register must be specified for a
+ * user trampoline.
+ */
+ tregs = trampfd->regs;
+ if (!tregs) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * Set the register context for the trampoline.
+ */
+ set_regs(pt_regs, tregs);
+
+ if (trampfd->stack) {
+ /*
+ * Copy the stack context into a local buffer and push stack
+ * data after dropping the lock.
+ */
+ size = sizeof(*trampfd->stack) + trampfd->stack->size;
+ tstack = (struct trampfd_stack *) buf;
+ memcpy(tstack, trampfd->stack, size);
+ }
+unlock:
+ mutex_unlock(&trampfd->lock);
+
+ if (!rc && tstack) {
+ mmap_read_unlock(vma->vm_mm);
+ rc = push_data(pt_regs, tstack);
+ mmap_read_lock(vma->vm_mm);
+ }
+ return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+ struct trampfd *trampfd;
+
+ if (!is_trampfd_vma(vma))
+ return false;
+ trampfd = vma->vm_private_data;
+
+ if (trampfd->type == TRAMPFD_USER)
+ return !trampfd_user_fault(trampfd, vma, pt_regs);
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+ return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 8afb238ff335..6e5e3193919a 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -23,6 +23,7 @@
#include <linux/perf_event.h>
#include <linux/preempt.h>
#include <linux/hugetlb.h>
+#include <linux/trampfd.h>
#include <asm/acpi.h>
#include <asm/bug.h>
@@ -404,7 +405,8 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re
#define VM_FAULT_BADACCESS 0x020000
static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
- unsigned int mm_flags, unsigned long vm_flags)
+ unsigned int mm_flags, unsigned long vm_flags,
+ struct pt_regs *regs)
{
struct vm_area_struct *vma = find_vma(mm, addr);
@@ -426,8 +428,15 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr,
* Check that the permissions on the VMA allow for the fault which
* occurred.
*/
- if (!(vma->vm_flags & vm_flags))
+ if (!(vma->vm_flags & vm_flags)) {
+ /*
+ * If it is an execute fault, it could be a trampoline
+ * invocation.
+ */
+ if ((vm_flags & VM_EXEC) && trampfd_fault(vma, regs))
+ return 0;
return VM_FAULT_BADACCESS;
+ }
return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
}
@@ -516,7 +525,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
#endif
}
- fault = __do_page_fault(mm, addr, mm_flags, vm_flags);
+ fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs);
major |= fault & VM_FAULT_MAJOR;
/* Quick path to respond to signals */
--
2.17.1
^ permalink raw reply related
* [PATCH v1 1/4] [RFC] fs/trampfd: Implement the trampoline file descriptor API
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, oleg, x86,
madvenka
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
There are many applications that use trampoline code. Trampoline code is
usually placed in a data page or a stack page. In order to execute a
trampoline, the page that contains the trampoline needs to have execute
permissions.
Writable pages with execute permissions provide an attack surface for
hackers. To mitigate this, LSMs such as SELinux may prevent a page from
having both write and execute permissions.
An application may attempt to circumvent this by writing the trampoline
code into a temporary file and mapping the file into its process
address space with just execute permissions. This presents the same
opportunity to hackers as before. LSMs that implement cryptographic
verification of files can prevent such temporary files from being mapped.
Such security mitigations prevent genuine trampoline code from running
as well.
Typically, trampolines simply load some values in some registers and/or
push some values on the stack and jump to a target PC. For such simple
trampolines, an application could request the kernel to do that work
instead of executing trampoline code to do that work. trampfd allows
applications to do exactly this.
Such applications can then run without having to relax security
settings for them. For instance, libffi trampolines can easily be
replaced by trampfd. libffi is used by a variety of applications.
trampfd_create() system call
----------------------------
A new system call is introduced to create a trampoline. The system call
number for this is 440. The system call is invoked like this:
int trampfd;
trampfd = syscall(440, type, data);
type Trampoline type.
data Trampoline type-specific data.
Types of trampolines
--------------------
Different types of trampolines can be defined based on the desired
functionality. In this initial work, the following type is defined:
TRAMPFD_USER
This implements the simple trampoline type I referred to earlier.
The type-specific structure for TRAMPFD_USER is struct trampfd_user.
Trampoline contexts
-------------------
A trampoline can have one or more contexts associated with it. Contexts
are of two kinds:
- Contexts that can be specified by the user. These can be added,
retrieved and removed by user code.
- Contexts that are specified by the kernel. This can only be
added by the kernel. But these can be read by the user.
In this initial work, I define the following contexts:
User specifiable:
Register Context
----------------
Contains register name-value pairs. When a trampoline is invoked,
the specified values are loaded in the specified registers. This
includes the value of the PC register. The kernel specifies the
subset of registers that can be specified.
Stack Context
-------------
Contains data to push on the user stack when a trampoline is
invoked.
Allowed PCs
-----------
This specifies a list of PCs that the trampoline is allowed to
jump to. This prevents a hacker from modifying the trampoline's
target PC.
Kernel specified:
Mapping parameters
------------------
Used to map a trampoline into an address space. Mapping parameters
are determined by the kernel based on the trampoline type and
type-specific information.
Other contexts can be defined in the future.
How to set and read contexts
----------------------------
A symbolic file offset is associated with each context type.
TRAMPFD_MAP_OFFSET
TRAMPFD_REGS_OFFSET
TRAMPFD_STACK_OFFSET
TRAMPFD_PCS_OFFSET
A structure is defined for each context type as well:
struct trampfd_map
struct trampfd_regs
struct trampfd_stack
struct trampfd_pcs
To set/retrieve a context, seek to the corresponding offset and
write()/read() the corresponding structure. As a convenience, pread()
and pwrite() can be used so it can be done in one call instead of two.
Invoking a trampoline
---------------------
Map the file descriptor into process address space using mmap(). The
kernel returns an address to invoke the trampoline with. The protection
for the mapping is set to PROT_NONE.
Execute the trampoline in one of two ways depending upon what the target
PC points to:
- Branch to the trampoline address.
- Use the trampoline address as a function pointer and call it.
Because the user process does not have execute permissions on the
trampoline address, it traps into the kernel. The kernel recognizes
it as a trampoline invocation and performs the action indicated by the
trampoline's type and context. In the case of TRAMPFD_USER, the
kernel loads the user registers with the values specified in the
register context, pushes the values specfied in the stack context on
the user stack and sets the user PC to point to the PC register value
in the register context. Then, the process returns to user land and
continues execution at the target PC.
Removing a context
------------------
To remove a context, write the context structure into trampfd but
specify a zero context. For example, for register context, specify
the number of registers as 0. For stack context, specify size of
stack data as 0.
Removing a trampoline
---------------------
To remove a trampoline, unmap it and close the file descriptor. When
the last reference on the trampoline goes away, the trampoline is freed.
Sharing trampolines
-------------------
A trampoline created by one thread can be used by other threads sharing
the same address space.
Trampolines, in general, may be shared across processes by the usual
mechanism of sending the file descriptor to another process over a Unix
domain socket.
Architecture support
--------------------
The handling of the trampoline page fault and the setting up of the
register and stack contexts are architecture specific. Architecture
specific patches will implement support for the API.
The signal delivery code in the kernel already implements the elements
needed for this work. That will be leveraged.
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
fs/Makefile | 1 +
fs/trampfd/Makefile | 6 ++
fs/trampfd/trampfd_data.c | 43 ++++++++
fs/trampfd/trampfd_fops.c | 131 +++++++++++++++++++++++
fs/trampfd/trampfd_map.c | 78 ++++++++++++++
fs/trampfd/trampfd_pcs.c | 95 +++++++++++++++++
fs/trampfd/trampfd_regs.c | 137 ++++++++++++++++++++++++
fs/trampfd/trampfd_stack.c | 131 +++++++++++++++++++++++
fs/trampfd/trampfd_stubs.c | 41 +++++++
fs/trampfd/trampfd_syscall.c | 92 ++++++++++++++++
include/linux/syscalls.h | 3 +
include/linux/trampfd.h | 82 ++++++++++++++
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/trampfd.h | 171 ++++++++++++++++++++++++++++++
init/Kconfig | 8 ++
kernel/sys_ni.c | 3 +
16 files changed, 1025 insertions(+), 1 deletion(-)
create mode 100644 fs/trampfd/Makefile
create mode 100644 fs/trampfd/trampfd_data.c
create mode 100644 fs/trampfd/trampfd_fops.c
create mode 100644 fs/trampfd/trampfd_map.c
create mode 100644 fs/trampfd/trampfd_pcs.c
create mode 100644 fs/trampfd/trampfd_regs.c
create mode 100644 fs/trampfd/trampfd_stack.c
create mode 100644 fs/trampfd/trampfd_stubs.c
create mode 100644 fs/trampfd/trampfd_syscall.c
create mode 100644 include/linux/trampfd.h
create mode 100644 include/uapi/linux/trampfd.h
diff --git a/fs/Makefile b/fs/Makefile
index 2ce5112b02c8..227761302000 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/
obj-$(CONFIG_EROFS_FS) += erofs/
obj-$(CONFIG_VBOXSF_FS) += vboxsf/
obj-$(CONFIG_ZONEFS_FS) += zonefs/
+obj-$(CONFIG_TRAMPFD) += trampfd/
diff --git a/fs/trampfd/Makefile b/fs/trampfd/Makefile
new file mode 100644
index 000000000000..bdf5e487facc
--- /dev/null
+++ b/fs/trampfd/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_TRAMPFD) += trampfd.o
+
+trampfd-y += trampfd_data.o trampfd_fops.o trampfd_map.o trampfd_pcs.o
+trampfd-y += trampfd_regs.o trampfd_stack.o trampfd_stubs.o trampfd_syscall.o
diff --git a/fs/trampfd/trampfd_data.c b/fs/trampfd/trampfd_data.c
new file mode 100644
index 000000000000..0a316754cbe4
--- /dev/null
+++ b/fs/trampfd/trampfd_data.c
@@ -0,0 +1,43 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Trampoline type-specific code.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/trampfd.h>
+
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data)
+{
+ struct trampfd_map *map = &trampfd->map;
+ struct trampfd_user *user;
+
+ if (trampfd->type == TRAMPFD_USER) {
+ user = kmalloc(sizeof(*user), GFP_KERNEL);
+ if (!user)
+ return -ENOMEM;
+
+ if (copy_from_user(user, tramp_data, sizeof(*user))) {
+ kfree(user);
+ return -EFAULT;
+ }
+ if (user->flags || user->reserved) {
+ kfree(user);
+ return -EINVAL;
+ }
+ trampfd->data = user;
+
+ map->size = PAGE_SIZE;
+ map->prot = PROT_NONE;
+ map->flags = MAP_PRIVATE;
+ map->offset = 0;
+ map->ioffset = 0;
+ }
+ return 0;
+}
diff --git a/fs/trampfd/trampfd_fops.c b/fs/trampfd/trampfd_fops.c
new file mode 100644
index 000000000000..94b82e0da75b
--- /dev/null
+++ b/fs/trampfd/trampfd_fops.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - File operations.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/seq_file.h>
+#include <linux/trampfd.h>
+
+#ifdef CONFIG_PROC_FS
+static const char * const trampfd_type_names[TRAMPFD_NUM_TYPES] = {
+ "TRAMPFD_USER",
+};
+
+static void trampfd_show_fdinfo(struct seq_file *sfile, struct file *file)
+{
+ struct trampfd *trampfd = file->private_data;
+
+ seq_printf(sfile, "type: %s\n", trampfd_type_names[trampfd->type]);
+}
+#endif
+
+static loff_t trampfd_llseek(struct file *file, loff_t offset, int whence)
+{
+ struct trampfd *trampfd = file->private_data;
+
+ if (whence != SEEK_SET)
+ return -EINVAL;
+
+ if ((offset < 0) || (offset >= TRAMPFD_NUM_OFFSETS))
+ return -EINVAL;
+
+ mutex_lock(&trampfd->lock);
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+ mutex_unlock(&trampfd->lock);
+ return offset;
+}
+
+static ssize_t trampfd_read(struct file *file, char __user *arg,
+ size_t count, loff_t *ppos)
+{
+ int rc;
+
+ if (!arg || !count)
+ return -EINVAL;
+
+ switch (*ppos) {
+ case TRAMPFD_MAP_OFFSET:
+ rc = trampfd_get_map(file, arg, count);
+ break;
+
+ case TRAMPFD_REGS_OFFSET:
+ rc = trampfd_get_regs(file, arg, count);
+ break;
+
+ case TRAMPFD_STACK_OFFSET:
+ rc = trampfd_get_stack(file, arg, count);
+ break;
+
+ default:
+ rc = -EINVAL;
+ goto out;
+ }
+out:
+ return rc ? rc : (ssize_t) count;
+}
+
+static ssize_t trampfd_write(struct file *file, const char __user *arg,
+ size_t count, loff_t *ppos)
+{
+ int rc;
+
+ if (!arg || !count)
+ return -EINVAL;
+
+ switch (*ppos) {
+ case TRAMPFD_REGS_OFFSET:
+ rc = trampfd_set_regs(file, arg, count);
+ break;
+
+ case TRAMPFD_STACK_OFFSET:
+ rc = trampfd_set_stack(file, arg, count);
+ break;
+
+ case TRAMPFD_ALLOWED_PCS_OFFSET:
+ rc = trampfd_set_allowed_pcs(file, arg, count);
+ break;
+
+ default:
+ rc = -EINVAL;
+ goto out;
+ }
+out:
+ return rc ? rc : (ssize_t) count;
+}
+
+static int trampfd_release(struct inode *inode, struct file *file)
+{
+ struct trampfd *trampfd = file->private_data;
+
+ if (trampfd->type == TRAMPFD_USER) {
+ kfree(trampfd->regs);
+ kfree(trampfd->stack);
+ kfree(trampfd->allowed_pcs);
+ }
+ kfree(trampfd->data);
+ mutex_destroy(&trampfd->lock);
+ kmem_cache_free(trampfd_cache, trampfd);
+ return 0;
+}
+
+const struct file_operations trampfd_fops = {
+#ifdef CONFIG_PROC_FS
+ .show_fdinfo = trampfd_show_fdinfo,
+#endif
+ .llseek = trampfd_llseek,
+ .read = trampfd_read,
+ .write = trampfd_write,
+ .release = trampfd_release,
+ .mmap = trampfd_mmap,
+ .get_unmapped_area = trampfd_get_unmapped_area,
+};
diff --git a/fs/trampfd/trampfd_map.c b/fs/trampfd/trampfd_map.c
new file mode 100644
index 000000000000..1a156c850ca8
--- /dev/null
+++ b/fs/trampfd/trampfd_map.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Memory mapping.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/security.h>
+#include <linux/trampfd.h>
+
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct trampfd *trampfd = file->private_data;
+
+ if (trampfd->type == TRAMPFD_USER) {
+ /*
+ * These mappings are special mappings that should not be
+ * merged or inherited. No physical page is currently allocated
+ * to these mappings. So, there is nothing to read/write.
+ * When the trampoline is invoked, an execute fault must be
+ * encountered so the kernel can intercept the invocation and
+ * set up user context.
+ */
+ if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
+ return -EINVAL;
+ vma->vm_flags = VM_SPECIAL | VM_DONTCOPY | VM_DONTDUMP;
+ }
+ vma->vm_private_data = trampfd;
+ return 0;
+}
+
+unsigned long
+trampfd_get_unmapped_area(struct file *file, unsigned long orig_addr,
+ unsigned long len, unsigned long pgoff,
+ unsigned long flags)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_map *map = &trampfd->map;
+ unsigned long map_pgoff = map->offset >> PAGE_SHIFT;
+
+ const typeof_member(struct file_operations, get_unmapped_area)
+ get_area = current->mm->get_unmapped_area;
+
+ if (len != map->size || pgoff != map_pgoff || (flags != map->flags))
+ return -EINVAL;
+
+ return get_area(file, orig_addr, len, pgoff, flags);
+}
+
+/*
+ * Retrieve the mapping parameters of a trampoline.
+ */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+
+ if (count != sizeof(trampfd->map))
+ return -EINVAL;
+ if (copy_to_user(arg, &trampfd->map, count))
+ return -EFAULT;
+ return 0;
+}
+
+bool is_trampfd_vma(struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+
+ if (!file)
+ return false;
+ return !strcmp(file->f_path.dentry->d_name.name, trampfd_name);
+}
+EXPORT_SYMBOL_GPL(is_trampfd_vma);
diff --git a/fs/trampfd/trampfd_pcs.c b/fs/trampfd/trampfd_pcs.c
new file mode 100644
index 000000000000..0ed36fd2169f
--- /dev/null
+++ b/fs/trampfd/trampfd_pcs.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Allowed PCs context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy list of allowed PCs from the user and validate it.
+ */
+static int trampfd_copy_allowed_pcs(struct trampfd_values *allowed_pcs,
+ const void __user *arg, size_t count)
+{
+ u32 npcs;
+ size_t size;
+ u64 *values;
+ int i;
+
+ if (copy_from_user(allowed_pcs, arg, count))
+ return -EFAULT;
+
+ if (allowed_pcs->reserved)
+ return -EINVAL;
+
+ npcs = allowed_pcs->nvalues;
+ if (npcs > TRAMPFD_MAX_PCS)
+ return -EINVAL;
+
+ size = sizeof(*allowed_pcs);
+ size += npcs * sizeof(u64);
+ if (size != count)
+ return -EINVAL;
+
+ values = allowed_pcs->values;
+ for (i = 0; i < npcs; i++) {
+ if (!values[i])
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Set the allowed PCs for a trampoline. If the trampoline has a register
+ * context at this point, the PC register value in that register context is
+ * not checked against this list of allowed PCs.
+ */
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+ size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_values *allowed_pcs, *cur_allowed_pcs;
+ int rc;
+
+ if (count < sizeof(*allowed_pcs) || count > TRAMPFD_MAX_PCS_SIZE)
+ return -EINVAL;
+
+ allowed_pcs = kmalloc(count, GFP_KERNEL);
+ if (!allowed_pcs)
+ return -ENOMEM;
+
+ rc = trampfd_copy_allowed_pcs(allowed_pcs, arg, count);
+ if (rc)
+ goto out;
+
+ /*
+ * If number of PCs is 0, there is no new PCS to set.
+ */
+ if (!allowed_pcs->nvalues) {
+ kfree(allowed_pcs);
+ allowed_pcs = NULL;
+ }
+
+ /*
+ * Swap the new PCs with the current one and free the current one,
+ * if any.
+ */
+ mutex_lock(&trampfd->lock);
+
+ cur_allowed_pcs = trampfd->allowed_pcs;
+ trampfd->allowed_pcs = allowed_pcs;
+ allowed_pcs = cur_allowed_pcs;
+
+ mutex_unlock(&trampfd->lock);
+out:
+ kfree(allowed_pcs);
+ return rc;
+}
diff --git a/fs/trampfd/trampfd_regs.c b/fs/trampfd/trampfd_regs.c
new file mode 100644
index 000000000000..35114d647385
--- /dev/null
+++ b/fs/trampfd/trampfd_regs.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Register context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_regs(struct trampfd_regs *regs, const void __user *arg,
+ size_t count)
+{
+ u32 nregs;
+ size_t size;
+
+ if (copy_from_user(regs, arg, count))
+ return -EFAULT;
+
+ if (regs->reserved)
+ return -EINVAL;
+
+ nregs = regs->nregs;
+ if (nregs > TRAMPFD_MAX_REGS)
+ return -EINVAL;
+
+ size = sizeof(*regs);
+ size += nregs * sizeof(struct trampfd_reg);
+ if (size != count)
+ return -EINVAL;
+
+ if (nregs && !trampfd_valid_regs(regs))
+ return -EINVAL;
+ return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_regs *regs, *cur_regs;
+ int rc;
+
+ if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+ return -EINVAL;
+
+ regs = kmalloc(count, GFP_KERNEL);
+ if (!regs)
+ return -ENOMEM;
+
+ rc = trampfd_copy_regs(regs, arg, count);
+ if (rc)
+ goto out;
+
+ /*
+ * If nregs is 0, there is no new register context to set.
+ */
+ if (!regs->nregs) {
+ kfree(regs);
+ regs = NULL;
+ }
+
+ /*
+ * Swap the new register context with the current one and free the
+ * current one, if any.
+ */
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Check if the specified PC is allowed.
+ */
+ if (!regs || trampfd_allowed_pc(trampfd, regs)) {
+ cur_regs = trampfd->regs;
+ trampfd->regs = regs;
+ regs = cur_regs;
+ } else {
+ rc = -EINVAL;
+ }
+
+ mutex_unlock(&trampfd->lock);
+out:
+ kfree(regs);
+ return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_regs *regs, *cur_regs;
+ size_t size;
+ int rc = 0;
+
+ if (count < sizeof(*regs) || count > TRAMPFD_MAX_REGS_SIZE)
+ return -EINVAL;
+
+ regs = kmalloc(count, GFP_KERNEL);
+ if (!regs)
+ return -ENOMEM;
+
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Copy the current register context into a local buffer so we can
+ * copy it to the user outside the lock.
+ */
+ cur_regs = trampfd->regs;
+ if (cur_regs) {
+ size = sizeof(*cur_regs);
+ size += sizeof(struct trampfd_reg) * cur_regs->nregs;
+ if (size > count)
+ size = count;
+ memcpy(regs, cur_regs, size);
+ } else {
+ size = sizeof(*regs);
+ memset(regs, 0, size);
+ }
+
+ mutex_unlock(&trampfd->lock);
+
+ if (copy_to_user(arg, regs, size))
+ rc = -EFAULT;
+
+ kfree(regs);
+ return rc;
+}
diff --git a/fs/trampfd/trampfd_stack.c b/fs/trampfd/trampfd_stack.c
new file mode 100644
index 000000000000..032c5ed70d57
--- /dev/null
+++ b/fs/trampfd/trampfd_stack.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stack context.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/trampfd.h>
+
+/*
+ * Copy context from the user and validate it.
+ */
+static int trampfd_copy_stack(struct trampfd_stack *stack,
+ const void __user *arg, size_t count)
+{
+ size_t size;
+
+ if (copy_from_user(stack, arg, count))
+ return -EFAULT;
+
+ if (stack->reserved)
+ return -EINVAL;
+
+ size = stack->size;
+ if (size > TRAMPFD_MAX_DATA_SIZE)
+ return -EINVAL;
+
+ size += sizeof(*stack);
+ if (size != count)
+ return -EINVAL;
+
+ if (!stack->size)
+ return 0;
+
+ if ((stack->flags & ~TRAMPFD_SFLAGS) ||
+ stack->offset > TRAMPFD_MAX_STACK_OFFSET)
+ return -EINVAL;
+ return 0;
+}
+
+/*
+ * Set the register context for a trampoline.
+ */
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_stack *stack, *cur_stack;
+ int rc;
+
+ if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+ return -EINVAL;
+
+ stack = kmalloc(count, GFP_KERNEL);
+ if (!stack)
+ return -ENOMEM;
+
+ rc = trampfd_copy_stack(stack, arg, count);
+ if (rc)
+ goto out;
+
+ /*
+ * If size is 0, there is no new stack context to set.
+ */
+ if (!stack->size) {
+ kfree(stack);
+ stack = NULL;
+ }
+
+ /*
+ * Swap the new stack context with the current one and free the
+ * current one, if any.
+ */
+ mutex_lock(&trampfd->lock);
+
+ cur_stack = trampfd->stack;
+ trampfd->stack = stack;
+ stack = cur_stack;
+
+ mutex_unlock(&trampfd->lock);
+out:
+ kfree(stack);
+ return rc;
+}
+
+/*
+ * Retrieve the register context of a trampoline.
+ */
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count)
+{
+ struct trampfd *trampfd = file->private_data;
+ struct trampfd_stack *stack, *cur_stack;
+ size_t size;
+ int rc = 0;
+
+ if (count < sizeof(*stack) || count > TRAMPFD_MAX_STACK_SIZE)
+ return -EINVAL;
+
+ stack = kmalloc(count, GFP_KERNEL);
+ if (!stack)
+ return -ENOMEM;
+
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Copy the current register context into a local buffer so we can
+ * copy it to the user outside the lock.
+ */
+ cur_stack = trampfd->stack;
+ if (cur_stack) {
+ size = sizeof(*cur_stack) + cur_stack->size;
+ if (size > count)
+ size = count;
+ memcpy(stack, cur_stack, size);
+ } else {
+ size = sizeof(*stack);
+ memset(stack, 0, size);
+ }
+
+ mutex_unlock(&trampfd->lock);
+
+ if (copy_to_user(arg, stack, size))
+ rc = -EFAULT;
+
+ kfree(stack);
+ return rc;
+}
diff --git a/fs/trampfd/trampfd_stubs.c b/fs/trampfd/trampfd_stubs.c
new file mode 100644
index 000000000000..8ca29dccbbf7
--- /dev/null
+++ b/fs/trampfd/trampfd_stubs.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - Stub functions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+
+/*
+ * Stub for the arch function that checks if a trampoline type is supported
+ * by the architecture. Return an error for all types that require architecture
+ * support. Return success for the rest as they are generic.
+ */
+int __attribute__((weak)) trampfd_check_arch(struct trampfd *trampfd)
+{
+ if (trampfd->type == TRAMPFD_USER)
+ return -EINVAL;
+ return 0;
+}
+
+/*
+ * Stub for the arch function that checks if a specified register context
+ * is valid.
+ */
+bool __attribute__((weak)) trampfd_valid_regs(struct trampfd_regs *regs)
+{
+ return false;
+}
+
+/*
+ * Stub for the arch function that checks if the PC register in a specified
+ * register context is allowed.
+ */
+bool __attribute__((weak)) trampfd_allowed_pc(struct trampfd *trampfd,
+ struct trampfd_regs *regs)
+{
+ return false;
+}
diff --git a/fs/trampfd/trampfd_syscall.c b/fs/trampfd/trampfd_syscall.c
new file mode 100644
index 000000000000..675460afc521
--- /dev/null
+++ b/fs/trampfd/trampfd_syscall.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - System call.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@microsoft.com)
+ *
+ * Copyright (C) 2020 Microsoft Corporation.
+ */
+
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/err.h>
+#include <linux/mman.h>
+#include <linux/syscalls.h>
+#include <linux/anon_inodes.h>
+#include <linux/trampfd.h>
+
+char *trampfd_name = "[trampfd]";
+
+struct kmem_cache *trampfd_cache;
+
+SYSCALL_DEFINE3(trampfd_create,
+ int, tramp_type,
+ const void __user *, tramp_data,
+ unsigned int, flags)
+{
+ struct trampfd *trampfd;
+ struct file *file;
+ int fd, rc = 0;
+
+ if (!trampfd_cache)
+ return -ENOMEM;
+
+ /*
+ * Flags are for future use.
+ */
+ if (flags || !tramp_data)
+ return -EINVAL;
+
+ if (tramp_type < 0 || tramp_type >= TRAMPFD_NUM_TYPES)
+ return -EINVAL;
+
+ trampfd = kmem_cache_zalloc(trampfd_cache, GFP_KERNEL);
+ if (!trampfd)
+ return -ENOMEM;
+
+ mutex_init(&trampfd->lock);
+ trampfd->type = tramp_type;
+
+ rc = trampfd_create_data(trampfd, tramp_data);
+ if (rc)
+ goto freetramp;
+
+ rc = trampfd_check_arch(trampfd);
+ if (rc)
+ goto freedata;
+
+ rc = get_unused_fd_flags(O_CLOEXEC);
+ if (rc < 0)
+ goto freedata;
+ fd = rc;
+
+ file = anon_inode_getfile(trampfd_name, &trampfd_fops, trampfd, O_RDWR);
+ if (IS_ERR(file)) {
+ rc = PTR_ERR(file);
+ goto freefd;
+ }
+ file->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+ fd_install(fd, file);
+ return fd;
+freefd:
+ put_unused_fd(fd);
+freedata:
+ kfree(trampfd->data);
+freetramp:
+ kmem_cache_free(trampfd_cache, trampfd);
+ return rc;
+}
+
+int __init trampfd_init(void)
+{
+ trampfd_cache = kmem_cache_create("trampfd_cache",
+ sizeof(struct trampfd), 0, SLAB_HWCACHE_ALIGN, NULL);
+
+ if (trampfd_cache == NULL) {
+ pr_warn("%s: kmem_cache_create failed", __func__);
+ return -ENOMEM;
+ }
+ return 0;
+}
+core_initcall(trampfd_init);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b951a87da987..25ddf29477bc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1005,6 +1005,9 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_trampfd_create(int tramp_type,
+ const void __user *tramp_data,
+ unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/include/linux/trampfd.h b/include/linux/trampfd.h
new file mode 100644
index 000000000000..383d7eeda2d1
--- /dev/null
+++ b/include/linux/trampfd.h
@@ -0,0 +1,82 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Trampoline File Descriptor - Internal structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _LINUX_TRAMPFD_H
+#define _LINUX_TRAMPFD_H
+
+#include <uapi/linux/trampfd.h>
+
+#define TRAMPFD_MAX_REGS_SIZE \
+ (sizeof(struct trampfd_regs) + \
+ (sizeof(struct trampfd_reg) * TRAMPFD_MAX_REGS))
+
+#define TRAMPFD_MAX_STACK_SIZE \
+ (sizeof(struct trampfd_stack) + TRAMPFD_MAX_DATA_SIZE)
+
+#define TRAMPFD_MAX_PCS_SIZE \
+ (sizeof(struct trampfd_values) + sizeof(u64) * TRAMPFD_MAX_PCS)
+
+/*
+ * Trampoline structure.
+ */
+struct trampfd {
+ struct mutex lock; /* to serialize access */
+ enum trampfd_type type; /* type of trampoline */
+ void *data; /* type specific data */
+ struct trampfd_map map; /* mmap() parameters */
+ struct trampfd_regs *regs; /* register context */
+ struct trampfd_stack *stack; /* stack context */
+ struct trampfd_values *allowed_pcs; /* allowed PCs */
+};
+
+#ifdef CONFIG_TRAMPFD
+
+/* Trampoline mapping */
+int trampfd_mmap(struct file *file, struct vm_area_struct *vma);
+unsigned long trampfd_get_unmapped_area(struct file *file,
+ unsigned long orig_addr,
+ unsigned long len,
+ unsigned long pgoff,
+ unsigned long flags);
+bool is_trampfd_vma(struct vm_area_struct *vma);
+
+/* Trampoline context */
+int trampfd_get_map(struct file *file, char __user *arg, size_t count);
+int trampfd_set_regs(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_regs(struct file *file, char __user *arg, size_t count);
+int trampfd_set_stack(struct file *file, const char __user *arg, size_t count);
+int trampfd_get_stack(struct file *file, char __user *arg, size_t count);
+int trampfd_set_allowed_pcs(struct file *file, const char __user *arg,
+ size_t count);
+
+/* Arch functions */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs);
+bool trampfd_valid_regs(struct trampfd_regs *regs);
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *regs);
+int trampfd_check_arch(struct trampfd *trampfd);
+
+/* Trampoline type-specific */
+int trampfd_create_data(struct trampfd *trampfd, const void __user *tramp_data);
+
+extern char *trampfd_name;
+extern struct kmem_cache *trampfd_cache;
+extern const struct file_operations trampfd_fops;
+
+#define USERPTR(ptr) ((void __user *)(uintptr_t)(ptr))
+
+#else
+
+static inline bool trampfd_fault(struct vm_area_struct *vma,
+ struct pt_regs *pt_regs)
+{
+ return false;
+}
+
+#endif /* CONFIG_TRAMPFD */
+
+#endif /* _LINUX_TRAMPFD_H */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index f4a01305d9a6..14e526a45624 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -857,9 +857,11 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#define __NR_faccessat2 439
__SYSCALL(__NR_faccessat2, sys_faccessat2)
+#define __NR_trampfd_create 440
+__SYSCALL(__NR_trampfd_create, sys_trampfd_create)
#undef __NR_syscalls
-#define __NR_syscalls 440
+#define __NR_syscalls 441
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/trampfd.h b/include/uapi/linux/trampfd.h
new file mode 100644
index 000000000000..bf9a6ef3683b
--- /dev/null
+++ b/include/uapi/linux/trampfd.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Trampoline File Descriptor - API structures and definitions.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+#ifndef _UAPI_LINUX_TRAMPFD_H
+#define _UAPI_LINUX_TRAMPFD_H
+
+#include <linux/types.h>
+#include <linux/ptrace.h>
+
+/*
+ * All structure fields are defined so that they are the same width and at the
+ * same structure offset on 32-bit and 64-bit to avoid compat code.
+ *
+ * All fields named "reserved" must be set to 0. They are there primarily for
+ * alignment. But they may be used in the future.
+ */
+
+/* ------------------------- Types of Trampolines ------------------------- */
+
+/*
+ * TRAMPFD_USER
+ * User programs use the kernel as a trampoline to setup a user context
+ * and jump to a user function. This trampoline type can be used to
+ * replace user trampoline code.
+ */
+enum trampfd_type {
+ TRAMPFD_USER,
+ TRAMPFD_NUM_TYPES,
+};
+
+/* ---------------------------- Context offsets ---------------------------- */
+
+/*
+ * A trampoline has different types of context associated with it. Each context
+ * type has a symbolic offset into trampfd. The context can be read from or
+ * written to at its symbolic offset in trampfd.
+ *
+ * TRAMPFD_MAP_OFFSET
+ * To read trampoline mapping parameters - struct ktramp_map.
+ *
+ * TRAMPFD_REGS_OFFSET
+ * To read/write trampoline register context - struct ktramp_regs.
+ *
+ * TRAMPFD_STACK_OFFSET
+ * To read/write trampoline stack context - struct ktramp_stack.
+ *
+ * TRAMPFD_ALLOWED_PCS_OFFSET
+ * To write a list of allowed PCs - struct trampfd_values.
+ */
+enum trampfd_offsets {
+ TRAMPFD_MAP_OFFSET,
+ TRAMPFD_REGS_OFFSET,
+ TRAMPFD_STACK_OFFSET,
+ TRAMPFD_ALLOWED_PCS_OFFSET,
+ TRAMPFD_NUM_OFFSETS,
+};
+
+/* ------------------- Trampoline type specific data -------------------- */
+
+/*
+ * For TRAMPFD_USER.
+ */
+struct trampfd_user {
+ __u32 flags; /* for future enhancements */
+ __u32 reserved;
+};
+
+/* ------------------- Trampoline mapping parameters ---------------------- */
+
+/*
+ * Since the kernel implements the trampoline object, the kernel specifies
+ * how a trampoline should be mapped. User code must obtain these parameters
+ * and do an mmap() to map the trampoline. The first four parameters are used
+ * in the mmap() call. User code must add ioffset to the address returned by
+ * mmap() to get the actual invocation address for the trampoline.
+ */
+struct trampfd_map {
+ __u32 size; /* Size of the mapping */
+ __u32 prot; /* memory protection */
+ __u32 flags; /* map flags */
+ __u32 offset; /* file offset */
+ __u32 ioffset; /* invocation offset */
+ __u32 reserved;
+};
+
+/* -------------------------- Register context -------------------------- */
+
+/*
+ * A register context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The register context is
+ * an array of name-value pairs. When a trampoline is invoked, its user
+ * registers are loaded with the specified values. Register names are
+ * architecture specific and can be found in <linux/ptrace.h> for architectures
+ * that support trampolines. Enumerations reg_32_name and reg_64_name in
+ * <linux/ptrace.h> refer to 32-bit and 64-bit respectively.
+ */
+struct trampfd_reg {
+ __u32 name; /* Register name */
+ __u32 reserved;
+ __u64 value; /* Register value */
+};
+
+/*
+ * Register context. It is a variable sized structure sized by the number
+ * of registers.
+ */
+struct trampfd_regs {
+ __u32 nregs; /* Number of registers */
+ __u32 reserved;
+ struct trampfd_reg regs[0]; /* Array of registers */
+};
+
+#define TRAMPFD_MAX_REGS 40
+
+/* ---------------------------- Stack context ---------------------------- */
+
+/*
+ * A stack context may be specified for a trampoline, if applicable
+ * to the trampoline type. E.g., TRAMPFD_USER. The stack context contains
+ * a data buffer. When a trampoline is invoked, the specified data is pushed
+ * on the stack at a specified offset from the current stack pointer.
+ * Optionally, the stack pointer can be moved to the top of the data.
+ *
+ * This is a variable sized structure sized by the amount of data that is
+ * to be pushed on the user stack.
+ */
+struct trampfd_stack {
+ __u32 flags; /* TRAMPFD_SFLAGS */
+ __u32 offset; /* Offset from top of stack */
+ __u32 size; /* Size of data to push */
+ __u32 reserved;
+ __u8 data[0]; /* Data to push on the stack */
+};
+
+#define TRAMPFD_MAX_DATA_SIZE 64
+#define TRAMPFD_MAX_STACK_OFFSET 256
+
+/*
+ * Stack context flags:
+ *
+ * TRAMPFD_SET_SP
+ * After pushing the data to user stack, move the stack pointer to the
+ * base of the data pushed. Note that the kernel will align the stack
+ * pointer based on the alignment requirements of the architecture.
+ */
+#define TRAMPFD_SET_SP 0x1
+#define TRAMPFD_SFLAGS (TRAMPFD_SET_SP)
+
+/* ---------------------------- Values context ---------------------------- */
+
+/*
+ * Some contexts may be just a list of values. For instance, the user can
+ * specify a list of allowed PCs for a trampoline. The following structure
+ * is used for those contexts.
+ */
+struct trampfd_values {
+ __u32 nvalues; /* number of values */
+ __u32 reserved;
+ __u64 values[0]; /* Array of values */
+};
+
+#define TRAMPFD_MAX_PCS 16
+
+/* -------------------------------------------------------------------------- */
+
+#endif /* _UAPI_LINUX_TRAMPFD_H */
diff --git a/init/Kconfig b/init/Kconfig
index 0498af567f70..783a0b98fce1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2313,3 +2313,11 @@ config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
# <asm/syscall_wrapper.h>.
config ARCH_HAS_SYSCALL_WRAPPER
def_bool n
+
+config TRAMPFD
+ bool "Enable trampfd_create() system call"
+ depends on MMU
+ help
+ Enable the trampfd_create() system call that allows a process to
+ map trampolines within its address space that can be invoked
+ with the help of the kernel.
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 3b69a560a7ac..136acf9234a3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -349,6 +349,9 @@ COND_SYSCALL(pkey_mprotect);
COND_SYSCALL(pkey_alloc);
COND_SYSCALL(pkey_free);
+/* Trampoline fd */
+COND_SYSCALL(trampfd_create);
+
/*
* Architecture specific weak syscall entries.
--
2.17.1
^ permalink raw reply related
* [PATCH v1 2/4] [RFC] x86/trampfd: Provide support for the trampoline file descriptor
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, oleg, x86,
madvenka
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 32-bit and 64-bit X86 support for the trampoline file descriptor.
- Define architecture specific register names
- Handle the trampoline invocation page fault
- Setup the user register context on trampoline invocation
- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/uapi/asm/ptrace.h | 38 +++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/trampfd.c | 313 +++++++++++++++++++++++++
arch/x86/mm/fault.c | 11 +
6 files changed, 366 insertions(+)
create mode 100644 arch/x86/kernel/trampfd.c
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index d8f8a1a69ed1..77eb50414591 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -443,3 +443,4 @@
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
439 i386 faccessat2 sys_faccessat2
+440 i386 trampfd_create sys_trampfd_create
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 78847b32e137..9d962de1d21f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -360,6 +360,7 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common trampfd_create sys_trampfd_create
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/x86/include/uapi/asm/ptrace.h b/arch/x86/include/uapi/asm/ptrace.h
index 85165c0edafc..b031598f857e 100644
--- a/arch/x86/include/uapi/asm/ptrace.h
+++ b/arch/x86/include/uapi/asm/ptrace.h
@@ -9,6 +9,44 @@
#ifndef __ASSEMBLY__
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+ x32_eax,
+ x32_ebx,
+ x32_ecx,
+ x32_edx,
+ x32_esi,
+ x32_edi,
+ x32_ebp,
+ x32_eip,
+ x32_max,
+};
+
+/*
+ * These register names are to be used by 64-bit applications.
+ */
+enum reg_64_name {
+ x64_rax = x32_max,
+ x64_rbx,
+ x64_rcx,
+ x64_rdx,
+ x64_rsi,
+ x64_rdi,
+ x64_rbp,
+ x64_r8,
+ x64_r9,
+ x64_r10,
+ x64_r11,
+ x64_r12,
+ x64_r13,
+ x64_r14,
+ x64_r15,
+ x64_rip,
+ x64_max,
+};
+
#ifdef __i386__
/* this struct defines the way the registers are stored on the
stack during a system call. */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index e77261db2391..5d968ac4c7d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -157,3 +157,5 @@ ifeq ($(CONFIG_X86_64),y)
endif
obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT) += ima_arch.o
+
+obj-$(CONFIG_TRAMPFD) += trampfd.o
diff --git a/arch/x86/kernel/trampfd.c b/arch/x86/kernel/trampfd.c
new file mode 100644
index 000000000000..f6b5507134d2
--- /dev/null
+++ b/arch/x86/kernel/trampfd.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - X86 support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/thread_info.h>
+#include <linux/mm_types.h>
+#include <linux/trampfd.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static inline bool is_compat(void)
+{
+ return (IS_ENABLED(CONFIG_X86_32) ||
+ (IS_ENABLED(CONFIG_COMPAT) && test_thread_flag(TIF_ADDR32)));
+}
+
+static void set_reg_32(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+ switch (name) {
+ case x32_eax:
+ pt_regs->ax = (unsigned long)value;
+ break;
+ case x32_ebx:
+ pt_regs->bx = (unsigned long)value;
+ break;
+ case x32_ecx:
+ pt_regs->cx = (unsigned long)value;
+ break;
+ case x32_edx:
+ pt_regs->dx = (unsigned long)value;
+ break;
+ case x32_esi:
+ pt_regs->si = (unsigned long)value;
+ break;
+ case x32_edi:
+ pt_regs->di = (unsigned long)value;
+ break;
+ case x32_ebp:
+ pt_regs->bp = (unsigned long)value;
+ break;
+ case x32_eip:
+ pt_regs->ip = (unsigned long)value;
+ break;
+ default:
+ WARN(1, "%s: Illegal register name %d\n", __func__, name);
+ break;
+ }
+}
+
+#ifdef __i386__
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+}
+
+#else
+
+static void set_reg_64(struct pt_regs *pt_regs, u32 name, u64 value)
+{
+ switch (name) {
+ case x64_rax:
+ pt_regs->ax = (unsigned long)value;
+ break;
+ case x64_rbx:
+ pt_regs->bx = (unsigned long)value;
+ break;
+ case x64_rcx:
+ pt_regs->cx = (unsigned long)value;
+ break;
+ case x64_rdx:
+ pt_regs->dx = (unsigned long)value;
+ break;
+ case x64_rsi:
+ pt_regs->si = (unsigned long)value;
+ break;
+ case x64_rdi:
+ pt_regs->di = (unsigned long)value;
+ break;
+ case x64_rbp:
+ pt_regs->bp = (unsigned long)value;
+ break;
+ case x64_r8:
+ pt_regs->r8 = (unsigned long)value;
+ break;
+ case x64_r9:
+ pt_regs->r9 = (unsigned long)value;
+ break;
+ case x64_r10:
+ pt_regs->r10 = (unsigned long)value;
+ break;
+ case x64_r11:
+ pt_regs->r11 = (unsigned long)value;
+ break;
+ case x64_r12:
+ pt_regs->r12 = (unsigned long)value;
+ break;
+ case x64_r13:
+ pt_regs->r13 = (unsigned long)value;
+ break;
+ case x64_r14:
+ pt_regs->r14 = (unsigned long)value;
+ break;
+ case x64_r15:
+ pt_regs->r15 = (unsigned long)value;
+ break;
+ case x64_rip:
+ pt_regs->ip = (unsigned long)value;
+ break;
+ default:
+ WARN(1, "%s: Illegal register name %d\n", __func__, name);
+ break;
+ }
+}
+
+#endif /* __i386__ */
+
+static void set_regs(struct pt_regs *pt_regs, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ bool compat = is_compat();
+
+ for (; reg < reg_end; reg++) {
+ if (compat)
+ set_reg_32(pt_regs, reg->name, reg->value);
+ else
+ set_reg_64(pt_regs, reg->name, reg->value);
+ }
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ int min, max, pc_name;
+ bool pc_set = false;
+
+ if (is_compat()) {
+ min = 0;
+ pc_name = x32_eip;
+ max = x32_max;
+ } else {
+ min = x32_max;
+ pc_name = x64_rip;
+ max = x64_max;
+ }
+
+ for (; reg < reg_end; reg++) {
+ if (reg->name < min || reg->name >= max || reg->reserved)
+ return false;
+ if (reg->name == pc_name && reg->value)
+ pc_set = true;
+ }
+ return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ struct trampfd_values *allowed_pcs = trampfd->allowed_pcs;
+ u64 *allowed_values, pc_value = 0;
+ u32 nvalues, pc_name;
+ int i;
+
+ if (!allowed_pcs)
+ return true;
+
+ pc_name = is_compat() ? x32_eip : x64_rip;
+
+ /*
+ * Find the PC register and its value. If the PC register has been
+ * specified multiple times, only the last one counts.
+ */
+ for (; reg < reg_end; reg++) {
+ if (reg->name == pc_name)
+ pc_value = reg->value;
+ }
+
+ allowed_values = allowed_pcs->values;
+ nvalues = allowed_pcs->nvalues;
+
+ for (i = 0; i < nvalues; i++) {
+ if (pc_value == allowed_values[i])
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(struct pt_regs *pt_regs, struct trampfd_stack *tstack)
+{
+ unsigned long sp;
+
+ sp = user_stack_pointer(pt_regs) - tstack->size - tstack->offset;
+ if (tstack->flags & TRAMPFD_SET_SP) {
+ if (is_compat())
+ sp = ((sp + 4) & -16ul) - 4;
+ else
+ sp = round_down(sp, 16) - 8;
+ }
+
+ if (!access_ok(sp, user_stack_pointer(pt_regs) - sp))
+ return -EFAULT;
+
+ if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+ return -EFAULT;
+
+ if (tstack->flags & TRAMPFD_SET_SP)
+ user_stack_pointer_set(pt_regs, sp);
+
+ return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+ struct vm_area_struct *vma,
+ struct pt_regs *pt_regs)
+{
+ char buf[TRAMPFD_MAX_STACK_SIZE];
+ struct trampfd_regs *tregs;
+ struct trampfd_stack *tstack = NULL;
+ unsigned long addr;
+ size_t size;
+ int rc = 0;
+
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Execution of the trampoline must start at the offset specfied by
+ * the kernel.
+ */
+ addr = vma->vm_start + trampfd->map.ioffset;
+ if (addr != pt_regs->ip) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * At a minimum, the user PC register must be specified for a
+ * user trampoline.
+ */
+ tregs = trampfd->regs;
+ if (!tregs) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * Set the register context for the trampoline.
+ */
+ set_regs(pt_regs, tregs);
+
+ if (trampfd->stack) {
+ /*
+ * Copy the stack context into a local buffer and push stack
+ * data after dropping the lock.
+ */
+ size = sizeof(*trampfd->stack) + trampfd->stack->size;
+ tstack = (struct trampfd_stack *) buf;
+ memcpy(tstack, trampfd->stack, size);
+ }
+unlock:
+ mutex_unlock(&trampfd->lock);
+
+ if (!rc && tstack) {
+ mmap_read_unlock(vma->vm_mm);
+ rc = push_data(pt_regs, tstack);
+ mmap_read_lock(vma->vm_mm);
+ }
+ return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+ struct trampfd *trampfd;
+
+ if (!is_trampfd_vma(vma))
+ return false;
+ trampfd = vma->vm_private_data;
+
+ if (trampfd->type == TRAMPFD_USER)
+ return !trampfd_user_fault(trampfd, vma, pt_regs);
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ------------------------- Arch Initialization ------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+ return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1ead568c0101..a1432ee2a1a2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
#include <linux/uaccess.h> /* faulthandler_disabled() */
#include <linux/efi.h> /* efi_recover_from_page_fault()*/
#include <linux/mm_types.h>
+#include <linux/trampfd.h> /* trampoline invocation */
#include <asm/cpufeature.h> /* boot_cpu_has, ... */
#include <asm/traps.h> /* dotraplinkage, ... */
@@ -1142,6 +1143,7 @@ void do_user_addr_fault(struct pt_regs *regs,
struct mm_struct *mm;
vm_fault_t fault, major = 0;
unsigned int flags = FAULT_FLAG_DEFAULT;
+ unsigned long tflags = X86_PF_INSTR | X86_PF_USER;
tsk = current;
mm = tsk->mm;
@@ -1275,6 +1277,15 @@ void do_user_addr_fault(struct pt_regs *regs,
*/
good_area:
if (unlikely(access_error(hw_error_code, vma))) {
+ /*
+ * If it is a user execute fault, it could be a trampoline
+ * invocation.
+ */
+ if ((hw_error_code & tflags) == tflags &&
+ trampfd_fault(vma, regs)) {
+ mmap_read_unlock(mm);
+ return;
+ }
bad_area_access_error(regs, hw_error_code, address, vma);
return;
}
--
2.17.1
^ permalink raw reply related
* [PATCH v1 4/4] [RFC] arm/trampfd: Provide support for the trampoline file descriptor
From: madvenka @ 2020-07-28 13:10 UTC (permalink / raw)
To: kernel-hardening, linux-api, linux-arm-kernel, linux-fsdevel,
linux-integrity, linux-kernel, linux-security-module, oleg, x86,
madvenka
In-Reply-To: <20200728131050.24443-1-madvenka@linux.microsoft.com>
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
Implement 32-bit ARM support for the trampoline file descriptor.
- Define architecture specific register names
- Handle the trampoline invocation page fault
- Setup the user register context on trampoline invocation
- Setup the user stack context on trampoline invocation
Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
arch/arm/include/uapi/asm/ptrace.h | 20 +++
arch/arm/kernel/Makefile | 1 +
arch/arm/kernel/trampfd.c | 214 +++++++++++++++++++++++++++++
arch/arm/mm/fault.c | 12 +-
arch/arm/tools/syscall.tbl | 1 +
5 files changed, 246 insertions(+), 2 deletions(-)
create mode 100644 arch/arm/kernel/trampfd.c
diff --git a/arch/arm/include/uapi/asm/ptrace.h b/arch/arm/include/uapi/asm/ptrace.h
index e61c65b4018d..47b1c5e2f32c 100644
--- a/arch/arm/include/uapi/asm/ptrace.h
+++ b/arch/arm/include/uapi/asm/ptrace.h
@@ -151,6 +151,26 @@ struct pt_regs {
#define ARM_r0 uregs[0]
#define ARM_ORIG_r0 uregs[17]
+/*
+ * These register names are to be used by 32-bit applications.
+ */
+enum reg_32_name {
+ arm_r0,
+ arm_r1,
+ arm_r2,
+ arm_r3,
+ arm_r4,
+ arm_r5,
+ arm_r6,
+ arm_r7,
+ arm_r8,
+ arm_r9,
+ arm_r10,
+ arm_ip,
+ arm_pc,
+ arm_max,
+};
+
/*
* The size of the user-visible VFP state as seen by PTRACE_GET/SETVFPREGS
* and core dumps.
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 89e5d864e923..652c54c2f19a 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -105,5 +105,6 @@ obj-$(CONFIG_SMP) += psci_smp.o
endif
obj-$(CONFIG_HAVE_ARM_SMCCC) += smccc-call.o
+obj-$(CONFIG_TRAMPFD) += trampfd.o
extra-y := $(head-y) vmlinux.lds
diff --git a/arch/arm/kernel/trampfd.c b/arch/arm/kernel/trampfd.c
new file mode 100644
index 000000000000..50fc5706e85b
--- /dev/null
+++ b/arch/arm/kernel/trampfd.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Trampoline File Descriptor - ARM support.
+ *
+ * Author: Madhavan T. Venkataraman (madvenka@linux.microsoft.com)
+ *
+ * Copyright (c) 2020, Microsoft Corporation.
+ */
+
+#include <linux/trampfd.h>
+#include <linux/mm_types.h>
+#include <linux/uaccess.h>
+
+/* ---------------------------- Register Context ---------------------------- */
+
+static void set_reg(long *uregs, u32 name, u64 value)
+{
+ switch (name) {
+ case arm_r0:
+ case arm_r1:
+ case arm_r2:
+ case arm_r3:
+ case arm_r4:
+ case arm_r5:
+ case arm_r6:
+ case arm_r7:
+ case arm_r8:
+ case arm_r9:
+ case arm_r10:
+ uregs[name] = (__u64)value;
+ break;
+ case arm_ip:
+ ARM_ip = (__u64)value;
+ break;
+ case arm_pc:
+ ARM_pc = (__u64)value;
+ break;
+ default:
+ WARN(1, "%s: Illegal register name %d\n", __func__, name);
+ break;
+ }
+}
+
+static void set_regs(long *uregs, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+
+ for (; reg < reg_end; reg++)
+ set_reg(uregs, reg->name, reg->value);
+}
+
+/*
+ * Check if the register names are valid. Check if the user PC has been set.
+ */
+bool trampfd_valid_regs(struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ bool pc_set = false;
+
+ for (; reg < reg_end; reg++) {
+ if (reg->name >= arm_max || reg->reserved)
+ return false;
+ if (reg->name == arm_pc && reg->value)
+ pc_set = true;
+ }
+ return pc_set;
+}
+EXPORT_SYMBOL_GPL(trampfd_valid_regs);
+
+/*
+ * Check if the PC specified in a register context is allowed.
+ */
+bool trampfd_allowed_pc(struct trampfd *trampfd, struct trampfd_regs *tregs)
+{
+ struct trampfd_reg *reg = tregs->regs;
+ struct trampfd_reg *reg_end = reg + tregs->nregs;
+ struct trampfd_values *allowed_pcs = trampfd->allowed_pcs;
+ u64 *allowed_values, pc_value = 0;
+ u32 nvalues, pc_name;
+ int i;
+
+ if (!allowed_pcs)
+ return true;
+
+ pc_name = arm_pc;
+
+ /*
+ * Find the PC register and its value. If the PC register has been
+ * specified multiple times, only the last one counts.
+ */
+ for (; reg < reg_end; reg++) {
+ if (reg->name == pc_name)
+ pc_value = reg->value;
+ }
+
+ allowed_values = allowed_pcs->values;
+ nvalues = allowed_pcs->nvalues;
+
+ for (i = 0; i < nvalues; i++) {
+ if (pc_value == allowed_values[i])
+ return true;
+ }
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_allowed_pc);
+
+/* ---------------------------- Stack Context ---------------------------- */
+
+static int push_data(long *uregs, struct trampfd_stack *tstack)
+{
+ unsigned long sp;
+
+ sp = ARM_sp - tstack->size - tstack->offset;
+ if (tstack->flags & TRAMPFD_SET_SP)
+ sp &= ~7;
+
+ if (!access_ok(sp, ARM_sp - sp))
+ return -EFAULT;
+
+ if (copy_to_user(USERPTR(sp), tstack->data, tstack->size))
+ return -EFAULT;
+
+ if (tstack->flags & TRAMPFD_SET_SP)
+ ARM_sp = sp;
+ return 0;
+}
+
+/* ---------------------------- Fault Handlers ---------------------------- */
+
+static int trampfd_user_fault(struct trampfd *trampfd,
+ struct vm_area_struct *vma,
+ long *uregs)
+{
+ char buf[TRAMPFD_MAX_STACK_SIZE];
+ struct trampfd_regs *tregs;
+ struct trampfd_stack *tstack = NULL;
+ unsigned long addr;
+ size_t size;
+ int rc;
+
+ mutex_lock(&trampfd->lock);
+
+ /*
+ * Execution of the trampoline must start at the offset specfied by
+ * the kernel.
+ */
+ addr = vma->vm_start + trampfd->map.ioffset;
+ if (addr != ARM_pc) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * At a minimum, the user PC register must be specified for a
+ * user trampoline.
+ */
+ tregs = trampfd->regs;
+ if (!tregs) {
+ rc = -EINVAL;
+ goto unlock;
+ }
+
+ /*
+ * Set the register context for the trampoline.
+ */
+ set_regs(uregs, tregs);
+
+ if (trampfd->stack) {
+ /*
+ * Copy the stack context into a local buffer and push stack
+ * data after dropping the lock.
+ */
+ size = sizeof(*trampfd->stack) + trampfd->stack->size;
+ tstack = (struct trampfd_stack *) buf;
+ memcpy(tstack, trampfd->stack, size);
+ }
+unlock:
+ mutex_unlock(&trampfd->lock);
+
+ if (!rc && tstack) {
+ mmap_read_unlock(vma->vm_mm);
+ rc = push_data(uregs, tstack);
+ mmap_read_lock(vma->vm_mm);
+ }
+ return rc;
+}
+
+/*
+ * Handle it if it is a trampoline fault.
+ */
+bool trampfd_fault(struct vm_area_struct *vma, struct pt_regs *pt_regs)
+{
+ struct trampfd *trampfd;
+ unsigned long *uregs = pt_regs->uregs;
+
+ if (!is_trampfd_vma(vma))
+ return false;
+ trampfd = vma->vm_private_data;
+
+ if (trampfd->type == TRAMPFD_USER)
+ return !trampfd_user_fault(trampfd, vma, uregs);
+ return false;
+}
+EXPORT_SYMBOL_GPL(trampfd_fault);
+
+/* ---------------------------- Miscellaneous ---------------------------- */
+
+int trampfd_check_arch(struct trampfd *trampfd)
+{
+ return 0;
+}
+EXPORT_SYMBOL_GPL(trampfd_check_arch);
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c6550eddfce1..21a81d19336b 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -17,6 +17,7 @@
#include <linux/sched/debug.h>
#include <linux/highmem.h>
#include <linux/perf_event.h>
+#include <linux/trampfd.h>
#include <asm/system_misc.h>
#include <asm/system_info.h>
@@ -202,7 +203,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
static vm_fault_t __kprobes
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- unsigned int flags, struct task_struct *tsk)
+ unsigned int flags, struct task_struct *tsk,
+ struct pt_regs *regs)
{
struct vm_area_struct *vma;
vm_fault_t fault;
@@ -220,6 +222,12 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
*/
good_area:
if (access_error(fsr, vma)) {
+ /*
+ * If it is an execute fault, it could be a trampoline
+ * invocation.
+ */
+ if ((fsr & FSR_LNX_PF) && trampfd_fault(vma, regs))
+ return 0;
fault = VM_FAULT_BADACCESS;
goto out;
}
@@ -290,7 +298,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
#endif
}
- fault = __do_page_fault(mm, addr, fsr, flags, tsk);
+ fault = __do_page_fault(mm, addr, fsr, flags, tsk, regs);
/* If we need to retry but a fatal signal is pending, handle the
* signal first. We do not need to release the mmap_lock because
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d5cae5ffede0..88cf4c45069a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -452,3 +452,4 @@
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
+440 common trampfd_create sys_trampfd_create
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox