* [PATCH 1/9] exec: add a global execve counter
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-11 0:12 ` Linus Torvalds
2012-03-11 17:25 ` Oleg Nesterov
2012-03-10 23:25 ` [PATCH 2/9] proc: add proc_file_private struct to store private information Djalal Harouni
` (9 subsequent siblings)
10 siblings, 2 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
Procfs files and other important objects may contain sensitive information
which must not be seen, inherited or processed across execve.
The commit e268337dfe26dfc7efd422a804dbb27977a3cccc tries to fix the same
problem.
Given that consideration this patch introduces two counters:
A global atomic execve counter that will be incremented on every
do_execve_common() call, and an atomic exec_id member for the task_struct.
The task's exec_id can be used as a security check to see and to test if
some special objects belong to the appropriate process image since it will
be a uniq id on each execve operation.
There is also new helper functions:
get_task_exec_id() to get the task's exec_id and task_exec_id_ok() to
check if the provided exec_id equals the task's exec_id.
The exec_id idea is taken from the recent grsecurity patches.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/exec.c | 11 +++++++++++
include/linux/sched.h | 32 ++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 153dee1..fb8ccae 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1448,6 +1448,14 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
EXPORT_SYMBOL(search_binary_handler);
/*
+ * A global execve counter that we need to set atomically.
+ * It will be incremented on every do_execve_common() this way we can use
+ * it to check if some special objects belong to the appropriate process
+ * image.
+ */
+static atomic64_t exec_counter = ATOMIC_INIT(0);
+
+/*
* sys_execve() executes a new program.
*/
static int do_execve_common(const char *filename,
@@ -1460,6 +1468,7 @@ static int do_execve_common(const char *filename,
struct files_struct *displaced;
bool clear_in_exec;
int retval;
+ u64 exec_id;
const struct cred *cred = current_cred();
/*
@@ -1542,6 +1551,8 @@ static int do_execve_common(const char *filename,
goto out;
/* execve succeeded */
+ exec_id = atomic64_inc_return(&exec_counter);
+ atomic64_set(¤t->exec_id, exec_id);
current->fs->in_exec = 0;
current->in_execve = 0;
acct_update_integrals(current);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0657368..762f206 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1420,6 +1420,9 @@ struct task_struct {
#endif
seccomp_t seccomp;
+/* Execve counter: will be used to check if objects belong to the appropriate
+ * process image */
+ atomic64_t exec_id;
/* Thread group tracking */
u32 parent_exec_id;
u32 self_exec_id;
@@ -1752,6 +1755,35 @@ static inline int is_global_init(struct task_struct *tsk)
return tsk->pid == 1;
}
+/**
+ * task_get_exec_id - get the task's exec_id
+ * @task: Task struct to operate on.
+ *
+ * Get the task's exec_id.
+ */
+static inline u64 get_task_exec_id(struct task_struct *task)
+{
+ return atomic64_read(&task->exec_id);
+}
+
+/**
+ * task_exec_id_ok - check if the task's exec_id equals the provided
+ * exec_id.
+ * @task: Task struct to check against.
+ * @exec_id: Execve ID.
+ *
+ * Check if the task's exec_id equals the provided exec_id. Use it to
+ * protect special objects.
+ *
+ * It will be more effective if the check is delayed as mush as possible
+ * to avoid any new execve surprises especially if we are checking against
+ * target task's exec_id.
+ */
+static inline int task_exec_id_ok(struct task_struct *task, u64 exec_id)
+{
+ return atomic64_read(&task->exec_id) == exec_id;
+}
+
/*
* is_container_init:
* check whether in the task is init in its own pid namespace.
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-10 23:25 ` [PATCH 1/9] exec: add a global execve counter Djalal Harouni
@ 2012-03-11 0:12 ` Linus Torvalds
2012-03-11 0:36 ` Linus Torvalds
` (2 more replies)
2012-03-11 17:25 ` Oleg Nesterov
1 sibling, 3 replies; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 0:12 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 3:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
>
> Given that consideration this patch introduces two counters:
> A global atomic execve counter that will be incremented on every
> do_execve_common() call, and an atomic exec_id member for the task_struct.
This seems horribly expensive on most 32-bit architectures, including
very much x86-32. That atomic64_inc_return() is not cheap. It's
possible that it's basically an impossible operation to do atomically
on certain platforms, causing it to use some random spinlock instead.
I wonder if we couldn't make something much cheaper, since we don't
actually care about it being globally incrementing, we just care about
it being globally unique. IOW, it could easily be a 56-bit per-cpu
counter along with the CPU number in the high bits or something like
that. Avoiding the whole atomicity issue, and thus avoiding the main
reason those things are really expensive.
IOW, something like
cpu = get_cpu();
.. increment percpu 64-bit counter ..
id = counter * MAX_CPUS + cpu;
put_cpu(cpu);
or equivalent would seem to be a potentially much cheaper approach.
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:12 ` Linus Torvalds
@ 2012-03-11 0:36 ` Linus Torvalds
2012-03-11 0:58 ` Linus Torvalds
2012-03-11 14:03 ` Alan Cox
2012-03-11 8:39 ` Djalal Harouni
2012-03-11 9:40 ` Solar Designer
2 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 0:36 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 4:12 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I wonder if we couldn't make something much cheaper, since we don't
> actually care about it being globally incrementing, we just care about
> it being globally unique.
Actually, we don't even need it to be "globally unique" afaik. It
really needs to be "relatively unique", since it's always compared
within the confines of a particular task.
So the problem with using the existing "self_exec_id" is that it's
trivially the same across tasks. But a *tuple* of <struct
task_struct, thread-specific exec_id> doesn't have that issue, so as
far as I can tell we could easily make *that* the unique ID.
I wonder if the number part of exec_id would even have to be 64-bit. I
think I can do about 10000 execves per second if I make the program a
small static one - and that's on a fast CPU. And it's a per-thread
counter, so you can't scale it with lots of CPU's. So it would take
something like four days to wrap. Hmm..
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:36 ` Linus Torvalds
@ 2012-03-11 0:58 ` Linus Torvalds
2012-03-11 8:24 ` Solar Designer
2012-03-11 14:03 ` Alan Cox
1 sibling, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 0:58 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 4:36 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I wonder if the number part of exec_id would even have to be 64-bit. I
> think I can do about 10000 execves per second if I make the program a
> small static one - and that's on a fast CPU. And it's a per-thread
> counter, so you can't scale it with lots of CPU's. So it would take
> something like four days to wrap. Hmm..
Actually, using a pure counter is horrible, because even if it takes
four days to wrap, it *will* wrap, and the attacker can just count his
own execve's.
If, instead, you were to use a counter that counts *independently* of
execve's, you're much better off.
And if you use one that is free - because the CPU implements it
natively - you're even better off.
IOW, why is the exec-id just the time stamp counter (on any random cpu
- we really don't care)? That should be safe even in just 32 bits
exactly because it's not under the control of the user.
And it's zero cost for us to update.
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:58 ` Linus Torvalds
@ 2012-03-11 8:24 ` Solar Designer
2012-03-11 9:56 ` Ingo Molnar
0 siblings, 1 reply; 48+ messages in thread
From: Solar Designer @ 2012-03-11 8:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Djalal Harouni, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:58:01PM -0800, Linus Torvalds wrote:
> On Sat, Mar 10, 2012 at 4:36 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I wonder if the number part of exec_id would even have to be 64-bit. I
> > think I can do about 10000 execves per second if I make the program a
> > small static one - and that's on a fast CPU. And it's a per-thread
> > counter, so you can't scale it with lots of CPU's. So it would take
> > something like four days to wrap. Hmm..
>
> Actually, using a pure counter is horrible, because even if it takes
> four days to wrap, it *will* wrap, and the attacker can just count his
> own execve's.
Four days (for a 32-bit counter) is just not enough, so the counter
needs to be e.g. 64-bit as proposed. A 64-bit counter won't wrap during
lifetime of a system.
> If, instead, you were to use a counter that counts *independently* of
> execve's, you're much better off.
>
> And if you use one that is free - because the CPU implements it
> natively - you're even better off.
>
> IOW, why is the exec-id just the time stamp counter
The CPUs' timestamp counters were not designed for security. I would
not be too surprised if some implementation of a CPU architecture (maybe
emulated, maybe under a hypervisor) has such timestamp counter
granularity that we may see the same value across a second execve().
Also, we'd need extra code for archs/CPUs that lack timestamp counters.
> (on any random cpu - we really don't care)?
I'd rather not rely on the timestamp counters across CPUs being in sync,
and if they are not in sync then we may see the same value again (on one
CPU vs. another). So at least we'd need to record the CPU number as well.
> That should be safe even in just 32 bits
> exactly because it's not under the control of the user.
32 bits is just not enough even if not under control of an attacker.
If there's some low chance of hitting the same counter value on execve()
vs. procfs file access and there's no lockout on multiple counter
mismatches, an attacker may simply try that in a loop, and with 32-bit
values might have a good chance of succeeding in a reasonable amount of
time (such as a few days).
Alexander
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 8:24 ` Solar Designer
@ 2012-03-11 9:56 ` Ingo Molnar
0 siblings, 0 replies; 48+ messages in thread
From: Ingo Molnar @ 2012-03-11 9:56 UTC (permalink / raw)
To: Solar Designer
Cc: Linus Torvalds, Djalal Harouni, linux-kernel, kernel-hardening,
Andrew Morton, Al Viro, Alexey Dobriyan, Eric W. Biederman,
Vasiliy Kulikov, Kees Cook, WANG Cong, James Morris,
Oleg Nesterov, linux-security-module, linux-fsdevel, Alan Cox,
Greg KH, Stephen Wilson, Jason A. Donenfeld
* Solar Designer <solar@openwall.com> wrote:
> > Actually, using a pure counter is horrible, because even if
> > it takes four days to wrap, it *will* wrap, and the attacker
> > can just count his own execve's.
>
> Four days (for a 32-bit counter) is just not enough, so the
> counter needs to be e.g. 64-bit as proposed. A 64-bit counter
> won't wrap during lifetime of a system.
A 64-bit counter is OK on 32-bit platforms as well as long as
it's not *atomic*.
Linus's scheme of using the CPU ID for the high bits would solve
that particular problem IMO. Each CPU would have its own count
set apart in a percpu area, accessible via __this_cpu_inc() or
so.
16 bits for the CPU ID and 48 bits for the actual count should
be enough for everyone! ;-) It wraps in about 700 years, with
current CPU speeds and assuming that exec() will be relatively
slow in the future as well.
Some other kernel code might make use of such a fast, global
generation count as well, so if this is reasonably abstracted
out and named properly it would not just be a single-purpose
security facility. DEFINE_GENCOUNT() or so? [the count itself
would not be reused, of course.]
> The CPUs' timestamp counters were not designed for security.
> I would not be too surprised if some implementation of a CPU
> architecture (maybe emulated, maybe under a hypervisor) has
> such timestamp counter granularity that we may see the same
> value across a second execve().
Not using the TSC would certainly make this logic simpler and
faster - which is a big plus for any security measure.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:36 ` Linus Torvalds
2012-03-11 0:58 ` Linus Torvalds
@ 2012-03-11 14:03 ` Alan Cox
2012-03-11 17:15 ` Djalal Harouni
1 sibling, 1 reply; 48+ messages in thread
From: Alan Cox @ 2012-03-11 14:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Djalal Harouni, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Greg KH, Ingo Molnar,
Stephen Wilson, Jason A. Donenfeld
> I wonder if the number part of exec_id would even have to be 64-bit. I
> think I can do about 10000 execves per second if I make the program a
> small static one - and that's on a fast CPU. And it's a per-thread
> counter, so you can't scale it with lots of CPU's. So it would take
> something like four days to wrap. Hmm..
I don't think an exec id trick works that well here. It needs to bind to
the actual *object* being used and refcount it in the cases that matter,
then have a proper way of ensuring we clean up references such as a list
of objects to zap hooked to each task struct
So something like
struct foo_node {
struct list_node node;
struct proc_object *ref;
};
ref = NULL;
if (foo->ptr != NULL
ref = kref_get(foo->ptr);
And in the task exit paths walk the node list doing a kref_put/NULL
Add a suitable lock and it ought to be able to generically do that for
anything you need to clobber.
You've still got a sort of race however, just as the proposed execid base
code. You can pass the fd and access the proc function *as* the exec
occurs. Assuming your ref counting is valid and you use new objects after
the exec that ought to just mean you get the data for the old mm struct,
which seems fine to me. It's logically equivalent to having asked a
microsecond before the exec rather than during it.
Alan
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 14:03 ` Alan Cox
@ 2012-03-11 17:15 ` Djalal Harouni
0 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 17:15 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Greg KH, Ingo Molnar,
Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 02:03:10PM +0000, Alan Cox wrote:
> > I wonder if the number part of exec_id would even have to be 64-bit. I
> > think I can do about 10000 execves per second if I make the program a
> > small static one - and that's on a fast CPU. And it's a per-thread
> > counter, so you can't scale it with lots of CPU's. So it would take
> > something like four days to wrap. Hmm..
>
> I don't think an exec id trick works that well here. It needs to bind to
> the actual *object* being used and refcount it in the cases that matter,
> then have a proper way of ensuring we clean up references such as a list
> of objects to zap hooked to each task struct
>
> So something like
>
>
> struct foo_node {
> struct list_node node;
> struct proc_object *ref;
> };
>
> ref = NULL;
> if (foo->ptr != NULL
> ref = kref_get(foo->ptr);
>
> And in the task exit paths walk the node list doing a kref_put/NULL
Here I assume that you are talking about the target task.
Yes but proc inode which are created on the fly can also be released by
the reader before the target exits, so do we want to walk the node list
each time release is called by the reader ?
The current implementation tries to track target, but as noted in the
other thread we can just track the reader and in this case we do not need
atomic for task_struct nor for the proc_file_private, a simple u64
comparison will do the job
But perhaps what you propose is better, I'll try to think more about it.
> Add a suitable lock and it ought to be able to generically do that for
> anything you need to clobber.
>
> You've still got a sort of race however, just as the proposed execid base
> code. You can pass the fd and access the proc function *as* the exec
IMO the proposed patch do not suffer from this race if we do the propre
permission checks just after setting the exec_id at open, or do permission
checks and then check exec_id at read. And here I'm talking about target
tracking. If we just check reader (current) then I assume that there are no
room for races.
> occurs. Assuming your ref counting is valid and you use new objects after
> the exec that ought to just mean you get the data for the old mm struct,
> which seems fine to me. It's logically equivalent to having asked a
> microsecond before the exec rather than during it.
>
> Alan
Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:12 ` Linus Torvalds
2012-03-11 0:36 ` Linus Torvalds
@ 2012-03-11 8:39 ` Djalal Harouni
2012-03-11 9:40 ` Solar Designer
2 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 8:39 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:12:20PM -0800, Linus Torvalds wrote:
> On Sat, Mar 10, 2012 at 3:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> >
> > Given that consideration this patch introduces two counters:
> > A global atomic execve counter that will be incremented on every
> > do_execve_common() call, and an atomic exec_id member for the task_struct.
>
> This seems horribly expensive on most 32-bit architectures, including
> very much x86-32. That atomic64_inc_return() is not cheap. It's
> possible that it's basically an impossible operation to do atomically
> on certain platforms, causing it to use some random spinlock instead.
Yes it will use spinlocks, and before that we also hold the
current->signal->cred_guard_mutex during all the do_execve_common(), not
to mention the search_binary_handler() logic which will test binary
format... it can even call request_module()
There is the first:
exec_id = atomic64_inc_return(&exec_counter); /* increment global */
and the second one:
atomic64_set(¤t->exec_id, exec_id); /* set task exec_id */
I've added this since when we track the target task we do not want to race
against it when it does its execve call or when we set the exec_id at
open time of the /proc/<pid>/* files.
I can remove this second one but in this situation we must track only
reader (current), which is an aggressive behaviour, actually it may be
safer to do so. I've made the target track change to fit the desired
behaviour of "bind files to their process image".
And when we are at it, we should only allow reading of /proc/<pid>/mem, at
least it's only info leak in case there are future changes that can affect
the current protections.
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 0:12 ` Linus Torvalds
2012-03-11 0:36 ` Linus Torvalds
2012-03-11 8:39 ` Djalal Harouni
@ 2012-03-11 9:40 ` Solar Designer
2 siblings, 0 replies; 48+ messages in thread
From: Solar Designer @ 2012-03-11 9:40 UTC (permalink / raw)
To: Linus Torvalds
Cc: Djalal Harouni, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:12:20PM -0800, Linus Torvalds wrote:
> On Sat, Mar 10, 2012 at 3:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> >
> > Given that consideration this patch introduces two counters:
> > A global atomic execve counter that will be incremented on every
> > do_execve_common() call, and an atomic exec_id member for the task_struct.
>
> This seems horribly expensive on most 32-bit architectures, including
> very much x86-32. That atomic64_inc_return() is not cheap. It's
> possible that it's basically an impossible operation to do atomically
> on certain platforms, causing it to use some random spinlock instead.
I think these are _relatively_ cheap considering that it's execve().
If we can do e.g. 10 million of atomic64_inc_return()'s per second (and
I think we can do a lot more, although I did not benchmark this), but
only 10000 of execve()'s per second, then the performance impact is 0.1%.
I admit that 0.1% is significant, but I used worst-case guesstimates
here; the actual number might be more like 0.01% (50 million vs. 5 thousand)
or even 0.002% (200 million vs. 4 thousand). (200 million is 15 CPU
cycles at 3 GHz, which may be a reasonable optimistic estimate.
4 thousand is a realistic execve() speed for dynamically-linked programs.)
> I wonder if we couldn't make something much cheaper, since we don't
> actually care about it being globally incrementing, we just care about
> it being globally unique. IOW, it could easily be a 56-bit per-cpu
> counter along with the CPU number in the high bits or something like
> that. Avoiding the whole atomicity issue, and thus avoiding the main
> reason those things are really expensive.
>
> IOW, something like
>
> cpu = get_cpu();
> .. increment percpu 64-bit counter ..
> id = counter * MAX_CPUS + cpu;
> put_cpu(cpu);
>
> or equivalent would seem to be a potentially much cheaper approach.
This makes sense to me. We just need to ensure that the per-CPU counter
is still large enough that it won't wrap. 56 bits is enough
(considering that the attack has to run on a single CPU of the target
system, unlike e.g. an attack on a cipher with a 56-bit key could be),
but there's little room for reducing this further (such as to support
many more than 256 CPUs in a system). So if we use this approach, maybe
we should simply keep a 64-bit per-CPU counter and a 32-bit CPU number,
for a 96-bit id. This may even be faster on execve() (no need to
combine the two values into one).
I don't know which one of these approaches has lower overhead on current
systems. My _guess_ is that atomic64_inc_return() may well be faster in
many cases.
Alexander
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-10 23:25 ` [PATCH 1/9] exec: add a global execve counter Djalal Harouni
2012-03-11 0:12 ` Linus Torvalds
@ 2012-03-11 17:25 ` Oleg Nesterov
2012-03-11 17:49 ` self_exec_id/parent_exec_id && CLONE_PARENT Oleg Nesterov
` (2 more replies)
1 sibling, 3 replies; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-11 17:25 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On 03/11, Djalal Harouni wrote:
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1420,6 +1420,9 @@ struct task_struct {
> #endif
> seccomp_t seccomp;
>
> +/* Execve counter: will be used to check if objects belong to the appropriate
> + * process image */
> + atomic64_t exec_id;
> /* Thread group tracking */
> u32 parent_exec_id;
> u32 self_exec_id;
Well, I don't think it is right to add this counter into task_struct.
It should be per-process, signal_struct makes more sense. Or may be
mm_struct.
Btw this is also true for parent_exec_id/self_exec_id, but this is
another story.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* self_exec_id/parent_exec_id && CLONE_PARENT
2012-03-11 17:25 ` Oleg Nesterov
@ 2012-03-11 17:49 ` Oleg Nesterov
2012-03-11 18:02 ` Linus Torvalds
2012-03-11 22:48 ` [PATCH 1/9] exec: add a global execve counter Linus Torvalds
2012-03-11 23:36 ` Djalal Harouni
2 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-11 17:49 UTC (permalink / raw)
To: Djalal Harouni, Linus Torvalds, Alan Cox
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, linux-security-module,
linux-fsdevel, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld
(change subject)
On 03/11, Oleg Nesterov wrote:
>
> Well, I don't think it is right to add this counter into task_struct.
>
> It should be per-process, signal_struct makes more sense. Or may be
> mm_struct.
>
> Btw this is also true for parent_exec_id/self_exec_id, but this is
> another story.
In fact I think it would be nice to kill parent_exec_id/self_exec_id.
Afaics, this only problem is clone(CLONE_PARENT | SIGXXX). I expect
the answer is "no, can break existing applications", but I'll ask
anyway.
Can't we change this? IOW, can't we modify copy_process
- p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
+ p->exit_signal =
+ (clone_flags & CLONE_THREAD) ? -1 :
+ (clobe_flags & CLONE_PARENT) ? current->group_leader->exit_signal :
+ (clone_flags & CSIGNAL);
(or simply use SIGCHLD instead of group_leader->exit_signal).
Then we can kill parent_exec_id/self_exec_id if me modify de_thread()
to set ->exit_signal = SIGCHLD for every child.
I am also asking because the change above looks like the fix to me.
The child must not control its ->exit_signal, it is the parent who
decides which signal the child should use for notification.
And to me, clone(CLONE_PARENT | SIGXXX) looks like a violation of
rule above.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: self_exec_id/parent_exec_id && CLONE_PARENT
2012-03-11 17:49 ` self_exec_id/parent_exec_id && CLONE_PARENT Oleg Nesterov
@ 2012-03-11 18:02 ` Linus Torvalds
2012-03-11 18:37 ` richard -rw- weinberger
2012-03-14 18:55 ` [PATCH 0/1] (Was: self_exec_id/parent_exec_id && CLONE_PARENT) Oleg Nesterov
0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 18:02 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Djalal Harouni, Alan Cox, linux-kernel, kernel-hardening,
Andrew Morton, Al Viro, Alexey Dobriyan, Eric W. Biederman,
Vasiliy Kulikov, Kees Cook, Solar Designer, WANG Cong,
James Morris, linux-security-module, linux-fsdevel, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> I am also asking because the change above looks like the fix to me.
> The child must not control its ->exit_signal, it is the parent who
> decides which signal the child should use for notification.
>
> And to me, clone(CLONE_PARENT | SIGXXX) looks like a violation of
> rule above.
SIGXXX is for doing things like AIO with threads, but it would never
be used together with CLONE_PARENT, that would be odd and wrong.
So I think we could disallow that - or at least try. See if anybody
notices, and if it breaks anything.
The rule about the Linux ABI is not that the ABI is set in stone. It's
that we can't break any existing binaries. And *maybe* there are users
of CLONE_PARENT and special signals, but it sounds unlikely and would
probably confuse real programs. So feel free to just try it (early in
the 3.4 merge window - not at this point, though).
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: self_exec_id/parent_exec_id && CLONE_PARENT
2012-03-11 18:02 ` Linus Torvalds
@ 2012-03-11 18:37 ` richard -rw- weinberger
2012-03-11 18:39 ` Oleg Nesterov
2012-03-14 18:55 ` [PATCH 0/1] (Was: self_exec_id/parent_exec_id && CLONE_PARENT) Oleg Nesterov
1 sibling, 1 reply; 48+ messages in thread
From: richard -rw- weinberger @ 2012-03-11 18:37 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, Djalal Harouni, Alan Cox, linux-kernel,
kernel-hardening, Andrew Morton, Al Viro, Alexey Dobriyan,
Eric W. Biederman, Vasiliy Kulikov, Kees Cook, Solar Designer,
WANG Cong, James Morris, linux-security-module, linux-fsdevel,
Greg KH, Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 7:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, Mar 11, 2012 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> I am also asking because the change above looks like the fix to me.
>> The child must not control its ->exit_signal, it is the parent who
>> decides which signal the child should use for notification.
>>
>> And to me, clone(CLONE_PARENT | SIGXXX) looks like a violation of
>> rule above.
>
> SIGXXX is for doing things like AIO with threads, but it would never
> be used together with CLONE_PARENT, that would be odd and wrong.
>
> So I think we could disallow that - or at least try. See if anybody
> notices, and if it breaks anything.
>
UserModeLinux is using CLONE_PARENT | CLONE_FILES | SIGCHLD.
Is this a problem?
--
Thanks,
//richard
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: self_exec_id/parent_exec_id && CLONE_PARENT
2012-03-11 18:37 ` richard -rw- weinberger
@ 2012-03-11 18:39 ` Oleg Nesterov
0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-11 18:39 UTC (permalink / raw)
To: richard -rw- weinberger
Cc: Linus Torvalds, Djalal Harouni, Alan Cox, linux-kernel,
kernel-hardening, Andrew Morton, Al Viro, Alexey Dobriyan,
Eric W. Biederman, Vasiliy Kulikov, Kees Cook, Solar Designer,
WANG Cong, James Morris, linux-security-module, linux-fsdevel,
Greg KH, Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On 03/11, richard -rw- weinberger wrote:
>
> On Sun, Mar 11, 2012 at 7:02 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Sun, Mar 11, 2012 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >>
> >> I am also asking because the change above looks like the fix to me.
> >> The child must not control its ->exit_signal, it is the parent who
> >> decides which signal the child should use for notification.
> >>
> >> And to me, clone(CLONE_PARENT | SIGXXX) looks like a violation of
> >> rule above.
> >
> > SIGXXX is for doing things like AIO with threads, but it would never
> > be used together with CLONE_PARENT, that would be odd and wrong.
> >
> > So I think we could disallow that - or at least try. See if anybody
> > notices, and if it breaks anything.
> >
>
> UserModeLinux is using CLONE_PARENT | CLONE_FILES | SIGCHLD.
> Is this a problem?
This depends on how the forking process was created, and what its
parent expects.
If it was forked with SIGCHLD as well, then no problems. Otherwise
the parent can notice the difference.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 0/1] (Was: self_exec_id/parent_exec_id && CLONE_PARENT)
2012-03-11 18:02 ` Linus Torvalds
2012-03-11 18:37 ` richard -rw- weinberger
@ 2012-03-14 18:55 ` Oleg Nesterov
2012-03-14 18:55 ` [PATCH 1/1] CLONE_PARENT shouldn't allow to set ->exit_signal Oleg Nesterov
1 sibling, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-14 18:55 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds
Cc: Djalal Harouni, Alan Cox, linux-kernel, kernel-hardening, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, linux-security-module,
linux-fsdevel, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Roland McGrath
On 03/11, Linus Torvalds wrote:
>
> On Sun, Mar 11, 2012 at 10:49 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > I am also asking because the change above looks like the fix to me.
> > The child must not control its ->exit_signal, it is the parent who
> > decides which signal the child should use for notification.
> >
> > And to me, clone(CLONE_PARENT | SIGXXX) looks like a violation of
> > rule above.
>
> SIGXXX is for doing things like AIO with threads, but it would never
> be used together with CLONE_PARENT, that would be odd and wrong.
>
> So I think we could disallow that - or at least try. See if anybody
> notices, and if it breaks anything.
>
> The rule about the Linux ABI is not that the ABI is set in stone. It's
> that we can't break any existing binaries. And *maybe* there are users
> of CLONE_PARENT and special signals, but it sounds unlikely and would
> probably confuse real programs. So feel free to just try it (early in
> the 3.4 merge window - not at this point, though).
OK, nobody seems to object.
Andrew, could you take this patch?
As for self_exec_id/parent_exec_id, this needs cleanups and fixes
in any case. But perhaps we can kill them after this change.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 1/1] CLONE_PARENT shouldn't allow to set ->exit_signal
2012-03-14 18:55 ` [PATCH 0/1] (Was: self_exec_id/parent_exec_id && CLONE_PARENT) Oleg Nesterov
@ 2012-03-14 18:55 ` Oleg Nesterov
2012-03-18 18:25 ` Linus Torvalds
0 siblings, 1 reply; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-14 18:55 UTC (permalink / raw)
To: Andrew Morton, Linus Torvalds
Cc: Djalal Harouni, Alan Cox, linux-kernel, kernel-hardening, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, linux-security-module,
linux-fsdevel, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Roland McGrath
The child must not control its ->exit_signal, it is the parent who
decides which signal the child should use for notification.
This means that CLONE_PARENT should not use "clone_flags & CSIGNAL",
the forking task is the sibling of the new process and their parent
doesn't control exit_signal in this case.
This patch uses ->exit_signal of the forking process, but perhaps
we should simply use SIGCHLD.
We read group_leader->exit_signal lockless, this can race with the
ORIGINAL_SIGNAL -> SIGCHLD transition, but this is fine.
Potentially this change allows to kill self_exec_id/parent_exec_id.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/fork.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..34d7ed1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1310,7 +1310,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
clear_all_latency_tracing(p);
/* ok, now we should be set up.. */
- p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
+ if (clone_flags & CLONE_THREAD)
+ p->exit_signal = -1;
+ else if (clone_flags & CLONE_PARENT)
+ p->exit_signal = current->group_leader->exit_signal;
+ else
+ p->exit_signal = (clone_flags & CSIGNAL);
+
p->pdeath_signal = 0;
p->exit_state = 0;
--
1.5.5.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 1/1] CLONE_PARENT shouldn't allow to set ->exit_signal
2012-03-14 18:55 ` [PATCH 1/1] CLONE_PARENT shouldn't allow to set ->exit_signal Oleg Nesterov
@ 2012-03-18 18:25 ` Linus Torvalds
2012-03-18 20:53 ` Oleg Nesterov
0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2012-03-18 18:25 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Djalal Harouni, Alan Cox, linux-kernel,
kernel-hardening, Al Viro, Alexey Dobriyan, Eric W. Biederman,
Vasiliy Kulikov, Kees Cook, Solar Designer, WANG Cong,
James Morris, linux-security-module, linux-fsdevel, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld, Roland McGrath
On Wed, Mar 14, 2012 at 11:55 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> + if (clone_flags & CLONE_THREAD)
> + p->exit_signal = -1;
> + else if (clone_flags & CLONE_PARENT)
> + p->exit_signal = current->group_leader->exit_signal;
> + else
> + p->exit_signal = (clone_flags & CSIGNAL);
So why is it "current->group_leader->exit_signal" rather than the much
more logical (imho) and simpler "current->exit_signal"?
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/1] CLONE_PARENT shouldn't allow to set ->exit_signal
2012-03-18 18:25 ` Linus Torvalds
@ 2012-03-18 20:53 ` Oleg Nesterov
0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-18 20:53 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, Djalal Harouni, Alan Cox, linux-kernel,
kernel-hardening, Al Viro, Alexey Dobriyan, Eric W. Biederman,
Vasiliy Kulikov, Kees Cook, Solar Designer, WANG Cong,
James Morris, linux-security-module, linux-fsdevel, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld, Roland McGrath
On 03/18, Linus Torvalds wrote:
>
> On Wed, Mar 14, 2012 at 11:55 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> > + if (clone_flags & CLONE_THREAD)
> > + p->exit_signal = -1;
> > + else if (clone_flags & CLONE_PARENT)
> > + p->exit_signal = current->group_leader->exit_signal;
> > + else
> > + p->exit_signal = (clone_flags & CSIGNAL);
>
> So why is it "current->group_leader->exit_signal" rather than the much
> more logical (imho) and simpler "current->exit_signal"?
This would be wrong if current is not the main thread. In this
case current->exit_signal = -1. Only group_leader has the "real"
exit_signal used for notification.
Historically "exit_signal = -1" meant different things, currently
it only means "I am not the leader", see thread_group_leader().
I'll write another email tomorrow. Yes, I do remember I promised
the security fixes in this area, sorry for delay.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 17:25 ` Oleg Nesterov
2012-03-11 17:49 ` self_exec_id/parent_exec_id && CLONE_PARENT Oleg Nesterov
@ 2012-03-11 22:48 ` Linus Torvalds
2012-03-11 23:32 ` Djalal Harouni
2012-03-11 23:36 ` Djalal Harouni
2 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 22:48 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Djalal Harouni, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 10:25 AM, Oleg Nesterov <oleg@redhat.com> wrote:
>
> It should be per-process, signal_struct makes more sense. Or may be
> mm_struct.
I do wonder if we shouldn't just consider the "struct mm_struct"
pointer to *be* the unique exec ID. It's what /proc/pid/mem does, and
it works fine, and allows us to just use a normal pointer as the
unique ID.
Just increment the mm_count for the thing, and hold a reference to it,
and now you're all done.
By definition an execve() will change the mm struct, and if you have a
refcount to the old one, it won't be re-used. And it's not a huge
allocation, although it would definitely be good to put that thing on
a diet.
And as long as we use 'mm_count', not 'mm_users', it won't pin
anything else in memory. That's part of the whole point of the
doubly-refcounted thing.
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 22:48 ` [PATCH 1/9] exec: add a global execve counter Linus Torvalds
@ 2012-03-11 23:32 ` Djalal Harouni
2012-03-11 23:42 ` Linus Torvalds
0 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 23:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 03:48:02PM -0700, Linus Torvalds wrote:
> On Sun, Mar 11, 2012 at 10:25 AM, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > It should be per-process, signal_struct makes more sense. Or may be
> > mm_struct.
>
> I do wonder if we shouldn't just consider the "struct mm_struct"
> pointer to *be* the unique exec ID. It's what /proc/pid/mem does, and
> it works fine, and allows us to just use a normal pointer as the
> unique ID.
>
> Just increment the mm_count for the thing, and hold a reference to it,
> and now you're all done.
Please Linus have you checked the:
[PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection
That keeping the mm struct wont work, since it will eat memory and the
OOM-killer will kill some innocent processes, and the abuse can only be
catched by the VFS.
What's your opinion on it ?
And if we do the same for all the /proc/<pid>/* files and even if you
clear the VM of the old process, some other information will still be
available, even if it is not useful, this is not consistent.
Is it ok if the code keeps different objects related to dead processes
alive ?
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 23:32 ` Djalal Harouni
@ 2012-03-11 23:42 ` Linus Torvalds
2012-03-12 0:25 ` Djalal Harouni
0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 23:42 UTC (permalink / raw)
To: Djalal Harouni
Cc: Oleg Nesterov, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 4:32 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
>>
>> Just increment the mm_count for the thing, and hold a reference to it,
>> and now you're all done.
> Please Linus have you checked the:
> [PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection
>
> That keeping the mm struct wont work, since it will eat memory and the
> OOM-killer will kill some innocent processes, and the abuse can only be
> catched by the VFS.
That's the point. I made the mistake of using mm_users initially, but
ysing mm_count - which is what I said to use (and what Oleg fixed
things to in commit 6d08f2c71397) should *not* have that problem. It
just keeps the 'struct mm_struct' itself around.
> What's your opinion on it ?
What's the advantage? You replace it with *another* allocation, and a
64-bit thing that is much less useful.
The size of the patch also speaks for itself:
fs/proc/base.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------------
and it's more complex and uses more memory on average (the refcount
thing is *free* for usual cases).
I do agree that it would be nicer if mm_struct was a bit smaller, but
at the same time, I really don't see the advantage of replacing it
with another allocation entirely that makes the code just more
complicated.
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 23:42 ` Linus Torvalds
@ 2012-03-12 0:25 ` Djalal Harouni
2012-03-12 10:11 ` Linus Torvalds
0 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-12 0:25 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 04:42:37PM -0700, Linus Torvalds wrote:
> On Sun, Mar 11, 2012 at 4:32 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> >>
> >> Just increment the mm_count for the thing, and hold a reference to it,
> >> and now you're all done.
> > Please Linus have you checked the:
> > [PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection
> >
> > That keeping the mm struct wont work, since it will eat memory and the
> > OOM-killer will kill some innocent processes, and the abuse can only be
> > catched by the VFS.
>
> That's the point. I made the mistake of using mm_users initially, but
> ysing mm_count - which is what I said to use (and what Oleg fixed
> things to in commit 6d08f2c71397) should *not* have that problem. It
> just keeps the 'struct mm_struct' itself around.
And that mm_struct will explode and only the VFS will catch it.
Given 1024 processes * (RLIMIT_NOFILE 1024 - 3) == ~1020000
more than 1020000 mm structs (all of dead processes ?)
A quick test on a default ubuntu:
cat /proc/sys/fs/file-max
388411
So we are able to keep around 388411 dead mm_struct in memory, just try it.
Our embedded devices will suffer, serial login will be killed, getty, ...
ssh root owned ... I've experienced this.
I'll try to post more numbers.
> > What's your opinion on it ?
>
> What's the advantage? You replace it with *another* allocation, and a
That allocation is much smaller and it can be used to protect all
/proc/<pid>/* files in a unified way. Consistency seems the right thing
to do, even the /proc/<pid>/{syscall,stack,...} that do not operate on mm
structs will be protected. I don't know how you will protect these files ?
We can even remove the allocation and just reference the exec_id or use
other fields of the file struct.
I'm must admit that grsecurity just did the check in the seq files, they
did not add anything, this can leave some other files unprotected, but it
is cheaper.
> 64-bit thing that is much less useful.
Well, it protect as from wraps.
>From the previous discussions it seems that perhaps we can work around it
or have an agreement.
> The size of the patch also speaks for itself:
>
> fs/proc/base.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------------
That's not fair since most of the logic of /proc/<pid>/* files is in
fs/proc/base.c and we can clean the code, there are some duplicated
functions in my patch, which we can improve.
> and it's more complex and uses more memory on average (the refcount
> thing is *free* for usual cases).
>
> I do agree that it would be nicer if mm_struct was a bit smaller, but
> at the same time, I really don't see the advantage of replacing it
> with another allocation entirely that makes the code just more
> complicated.
As I've said we can improve it.
I'll take all the feedback I got and re-work the patches, in the meantime
some of these sensitive files need to be protected especially the
/proc/<pid>/{maps,smaps,numa_maps}
I'll re-submit.
Thanks for the feedback.
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-12 0:25 ` Djalal Harouni
@ 2012-03-12 10:11 ` Linus Torvalds
2012-03-12 14:01 ` Djalal Harouni
0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2012-03-12 10:11 UTC (permalink / raw)
To: Djalal Harouni
Cc: Oleg Nesterov, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 5:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> On Sun, Mar 11, 2012 at 04:42:37PM -0700, Linus Torvalds wrote:
>> That's the point. I made the mistake of using mm_users initially, but
>> ysing mm_count - which is what I said to use (and what Oleg fixed
>> things to in commit 6d08f2c71397) should *not* have that problem. It
>> just keeps the 'struct mm_struct' itself around.
> And that mm_struct will explode and only the VFS will catch it.
>
> Given 1024 processes * (RLIMIT_NOFILE 1024 - 3) == ~1020000
>
> more than 1020000 mm structs (all of dead processes ?)
>
> A quick test on a default ubuntu:
> cat /proc/sys/fs/file-max
> 388411
>
> So we are able to keep around 388411 dead mm_struct in memory, just try it.
Umm.
I think your argument is totally braindead and wrong.
My counter-argument is very simple: "So what?"
Those mm_structs are small. They are something like a couple of
hundred bytes. If you really worry about open files, you should worry
about the size of the inode, and people using the "pipe()" system
call. Then you have those open files with an inode, *and* several kB
of data that can be trivially filled by the user with a simple
"write()" that they never need read.
So "struct mm_struct" is totally irrelevant, and not in any way a
special thing. It's not the biggest, it's not the most interesting,
and it's simply not interesting. You're barking up the wrong tree.
> Our embedded devices will suffer, serial login will be killed, getty, ...
> ssh root owned ... I've experienced this.
None of it has anything to do with 'struct mm_struct', though, has it?
I suspect the real thing to do is to just make the OOM killer look at
how many files are open too. Make each open file count as 4kB (or
more), and use it when deciding what to kill. Fix the actual real
problem instead of trying to fix one small detail - and one that isn't
even the right small detail.
Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-12 10:11 ` Linus Torvalds
@ 2012-03-12 14:01 ` Djalal Harouni
0 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-12 14:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Oleg Nesterov, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Mon, Mar 12, 2012 at 03:11:43AM -0700, Linus Torvalds wrote:
> On Sun, Mar 11, 2012 at 5:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> > On Sun, Mar 11, 2012 at 04:42:37PM -0700, Linus Torvalds wrote:
> >> That's the point. I made the mistake of using mm_users initially, but
> >> ysing mm_count - which is what I said to use (and what Oleg fixed
> >> things to in commit 6d08f2c71397) should *not* have that problem. It
> >> just keeps the 'struct mm_struct' itself around.
> > And that mm_struct will explode and only the VFS will catch it.
> >
> > Given 1024 processes * (RLIMIT_NOFILE 1024 - 3) == ~1020000
> >
> > more than 1020000 mm structs (all of dead processes ?)
> >
> > A quick test on a default ubuntu:
> > cat /proc/sys/fs/file-max
> > 388411
> >
> > So we are able to keep around 388411 dead mm_struct in memory, just try it.
>
> Umm.
>
> I think your argument is totally braindead and wrong.
>
> My counter-argument is very simple: "So what?"
I thought that keeping these dead bytes alive is not the best solution,
this is why we have all these counters...
My first solution to protect these files was to emulate what you are
currently suggesting, your patch + Olge's patch. I've written a similar
patch to protect /proc/<pid>/{maps,smaps} but later after more thoughts,
an atomic operation (which I'll re-work) and a comparison (without
allocations) seems more benefical if we achieve the same protection.
I must say that I'm confused here:
Sometimes we try to align structs and later we just ignore this ?
> Those mm_structs are small. They are something like a couple of
/sys/kernel/slab/mm_struct/object_size on 64bit gives:
1232
> hundred bytes. If you really worry about open files, you should worry
> about the size of the inode, and people using the "pipe()" system
> call. Then you have those open files with an inode, *and* several kB
> of data that can be trivially filled by the user with a simple
> "write()" that they never need read.
Ok, yes I see.
To clarify: I was worrying about the downsides of keeping dead data alive.
> So "struct mm_struct" is totally irrelevant, and not in any way a
> special thing. It's not the biggest, it's not the most interesting,
> and it's simply not interesting. You're barking up the wrong tree.
Yes it's not the biggest but it's related to the problem, any way here we
are not trying to fix mm_struct, yes we are protecting procfs files.
Perhaps another simple/clean patch will do the job, otherwise protecting
these files is the priority. Thank you for the reminder.
These files need protections and do not operate on mm:
/proc/<pid>/{syscall,stack}
And we need to add the open file operation to most of the /proc/<pid>/*
sensitive files.
> > Our embedded devices will suffer, serial login will be killed, getty, ...
> > ssh root owned ... I've experienced this.
>
> None of it has anything to do with 'struct mm_struct', though, has it?
>
> I suspect the real thing to do is to just make the OOM killer look at
> how many files are open too. Make each open file count as 4kB (or
> more), and use it when deciding what to kill. Fix the actual real
> problem instead of trying to fix one small detail - and one that isn't
> even the right small detail.
With 64 processes or less OOM killer can start killing innocent processes.
(and what about legitimate open files ?)
Thanks for the comments.
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
--
To unsubscribe from this list: send the line "unsubscribe linux-security-module" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 17:25 ` Oleg Nesterov
2012-03-11 17:49 ` self_exec_id/parent_exec_id && CLONE_PARENT Oleg Nesterov
2012-03-11 22:48 ` [PATCH 1/9] exec: add a global execve counter Linus Torvalds
@ 2012-03-11 23:36 ` Djalal Harouni
2012-03-12 14:34 ` Oleg Nesterov
2 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 23:36 UTC (permalink / raw)
To: Oleg Nesterov
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 06:25:12PM +0100, Oleg Nesterov wrote:
> On 03/11, Djalal Harouni wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1420,6 +1420,9 @@ struct task_struct {
> > #endif
> > seccomp_t seccomp;
> >
> > +/* Execve counter: will be used to check if objects belong to the appropriate
> > + * process image */
> > + atomic64_t exec_id;
> > /* Thread group tracking */
> > u32 parent_exec_id;
> > u32 self_exec_id;
>
> Well, I don't think it is right to add this counter into task_struct.
>
> It should be per-process, signal_struct makes more sense. Or may be
> mm_struct.
Some /proc/<pid>/{syscall,stack,...} do not operate on mm_struct so why we
should add the: "acquire a reference to mm, get exec_id and mmput".
For the signal_struct currently I don't know, from a comment it seems that
signal_struct can be shared! I don't know.
> Btw this is also true for parent_exec_id/self_exec_id, but this is
> another story.
Yes someone who knows this should check it.
> Oleg.
Thanks.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 1/9] exec: add a global execve counter
2012-03-11 23:36 ` Djalal Harouni
@ 2012-03-12 14:34 ` Oleg Nesterov
0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-12 14:34 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On 03/12, Djalal Harouni wrote:
>
> On Sun, Mar 11, 2012 at 06:25:12PM +0100, Oleg Nesterov wrote:
> >
> > Well, I don't think it is right to add this counter into task_struct.
> >
> > It should be per-process, signal_struct makes more sense. Or may be
> > mm_struct.
> Some /proc/<pid>/{syscall,stack,...} do not operate on mm_struct so why we
> should add the: "acquire a reference to mm, get exec_id and mmput".
This could be simpler, just read the counter under task_lock(). And
unless I misread the next patches syscall/stack can use current->mm
lockless.
OK, nevermind.
> For the signal_struct currently I don't know, from a comment it seems that
> signal_struct can be shared!
Yes, it is shared, and that is why it makes sense for the per-process
data. All threads in the thread group (process) have the same ->signal.
And unlike ->mm, ->signal survives after exec.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 2/9] proc: add proc_file_private struct to store private information
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
2012-03-10 23:25 ` [PATCH 1/9] exec: add a global execve counter Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 3/9] proc: new proc_exec_id_ok() helper function Djalal Harouni
` (8 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
In order to store and to track the private information specific to each
procfs opened file we convert the 'proc_maps_private' struct which is used
by the /proc/$pid/{maps,smaps,numa_maps} files to 'proc_file_private'
which can now be used by all the /proc/<pid>/* and even other /proc/* files
to store private information that must be kept across syscalls.
The proc_file_private struct includes an exec_id member which can be used
to track and to protect special and sensitive procfs files. In general it
will be set at open time to the task's exec_id to bind the file to this
task.
The proc_file_private struct offers the way to make the checks and to
bind special procfs files to the appropriate task in a consistent and
unified way.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/internal.h | 11 ++++++++++-
1 files changed, 10 insertions(+), 1 deletions(-)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 2925775..a8968f6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -61,10 +61,19 @@ extern const struct file_operations proc_pagemap_operations;
extern const struct file_operations proc_net_operations;
extern const struct inode_operations proc_net_inode_operations;
-struct proc_maps_private {
+/*
+ * Internal proc_file_private is used to track procfs files being
+ * processed especially the ones that varies during runtime.
+ */
+struct proc_file_private {
+ /* The Execve ID of the task, use this to protect special procfs
+ * files. Must be set at open time. */
+ u64 exec_id;
struct pid *pid;
+ struct inode *inode;
struct task_struct *task;
#ifdef CONFIG_MMU
+ /* For /proc/pid/{maps,smaps...} */
struct vm_area_struct *tail_vma;
#endif
};
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 3/9] proc: new proc_exec_id_ok() helper function
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
2012-03-10 23:25 ` [PATCH 1/9] exec: add a global execve counter Djalal Harouni
2012-03-10 23:25 ` [PATCH 2/9] proc: add proc_file_private struct to store private information Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 4/9] proc: protect /proc/<pid>/* INF files from reader across execve Djalal Harouni
` (7 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni
proc_exec_id_ok() will be used for /proc/<pid>/* files checks to see if the
task's exec_id equals the exec_id of the proc_file_private struct.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/internal.h | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+), 0 deletions(-)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index a8968f6..bed27a8 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -10,6 +10,7 @@
*/
#include <linux/proc_fs.h>
+#include <linux/sched.h>
extern struct proc_dir_entry proc_root;
#ifdef CONFIG_PROC_SYSCTL
@@ -95,6 +96,25 @@ static inline int proc_fd(struct inode *inode)
return PROC_I(inode)->fd;
}
+/**
+ * proc_exec_id_ok - check if the task's exec_id equals the exec_id of
+ * the proc_file_private.
+ * @task: Task struct to check against.
+ * @proc_private: The proc_file_private struct.
+ *
+ * Check if the exec_id of the two structs are equal. Use it to protect
+ * special procfs files when the fd is passed to a new execve (i.e. suid)
+ *
+ * It will be more effective if the check is delayed as mush as possible
+ * to avoid any new execve surprises especially if we are checking against
+ * target task's exec_id.
+ */
+static inline int proc_exec_id_ok(struct task_struct *task,
+ struct proc_file_private *proc_priv)
+{
+ return task_exec_id_ok(task, proc_priv->exec_id);
+}
+
struct dentry *proc_lookup_de(struct proc_dir_entry *de, struct inode *ino,
struct dentry *dentry);
int proc_readdir_de(struct proc_dir_entry *de, struct file *filp, void *dirent,
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 4/9] proc: protect /proc/<pid>/* INF files from reader across execve
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (2 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 3/9] proc: new proc_exec_id_ok() helper function Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 5/9] proc: add protection support for /proc/<pid>/* ONE files Djalal Harouni
` (6 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
Protect all the INF files /proc/$pid/{auxv,limits,syscall,cmdline,io,...}
from reader across execve. A process can open its /proc/self/* INF files
and exec a setuid or a privileged program, then this last one will use the
passed fd and leak its own sensitive information.
Since there are multiple INF files this patch provides a solution without
any internal processing logic change.
The exec_id of the proc_file_private will be set to the reader's exec_id
instead of the target's exec_id. Using the target's exec_id should be the
natural choice since this will bind these files to their task, but in
order to do that we must do permission checks (ptrace) at each syscall, and
currently these checks are done at read time and only for some important
procfs files. By using the current's (reader) exec_id we are sure that the
reader process at read time did not perform an execve.
All the /proc/<pid>/* INF files are protected by a one hit check just
before returning results to userspace.
This is an aggressive check compared to the target's exec_id check.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/base.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d4548dd..387e637 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -610,15 +610,42 @@ static const struct inode_operations proc_def_inode_operations = {
#define PROC_BLOCK_SIZE (3*1024) /* 4K page size but our output routines use some slack for overruns */
+static int proc_info_open(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv;
+ int ret = -ENOMEM;
+
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return ret;
+
+ /*
+ * Use the exec_id of current to protect INF files. We do this
+ * since there are no permission checks (ptrace...) at open, they
+ * are done at read time, and only for the sensitive files.
+ * By using the current's exec_id we are sure that the reader process
+ * did not change.
+ */
+ priv->exec_id = get_task_exec_id(current);
+ filp->private_data = priv;
+
+ return 0;
+}
+
static ssize_t proc_info_read(struct file * file, char __user * buf,
size_t count, loff_t *ppos)
{
struct inode * inode = file->f_path.dentry->d_inode;
+ struct proc_file_private *priv = file->private_data;
unsigned long page;
- ssize_t length;
- struct task_struct *task = get_proc_task(inode);
+ ssize_t length = 0;
+ struct task_struct *task;
+
+ if (!priv)
+ return length;
length = -ESRCH;
+ task = get_proc_task(inode);
if (!task)
goto out_no_task;
@@ -631,8 +658,16 @@ static ssize_t proc_info_read(struct file * file, char __user * buf,
length = PROC_I(inode)->op.proc_read(task, (char*)page);
+ /* Check delayed */
+ if (!proc_exec_id_ok(current, priv)) {
+ length = 0;
+ goto out_free;
+ }
+
if (length >= 0)
length = simple_read_from_buffer(buf, count, ppos, (char *)page, length);
+
+out_free:
free_page(page);
out:
put_task_struct(task);
@@ -640,9 +675,19 @@ out_no_task:
return length;
}
+static int proc_info_release(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv = filp->private_data;
+
+ kfree(priv);
+ return 0;
+}
+
static const struct file_operations proc_info_file_operations = {
+ .open = proc_info_open,
.read = proc_info_read,
.llseek = generic_file_llseek,
+ .release = proc_info_release,
};
static int proc_single_show(struct seq_file *m, void *v)
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 5/9] proc: add protection support for /proc/<pid>/* ONE files
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (3 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 4/9] proc: protect /proc/<pid>/* INF files from reader across execve Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 6/9] proc: protect /proc/<pid>/* ONE files from reader across execve Djalal Harouni
` (5 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
Sensitive information of the /proc/<pid>/* ONE files can be leaked if a
process opens its own /proc/self/* files and exec a setuid or a privileged
program, this last one will read from the fd and leak its own sensitive
data.
Since there are multiple ONE files this patch will use the following
logic: the exec_id of the proc_file_private will be set to the reader's
exec_id instead of the target's exec_id. Using the target's exec_id should
be the natural choice since this will bind these files to their task, but
in order to do that we must do permission checks (ptrace) at each syscall,
and currently these checks are done at read time and only for some
important procfs files. By using the current's (reader) exec_id we are sure
that the reader process at read time did not perform an execve.
Currently this is the list of the ONE files that will use this protection:
/proc/<pid>/{personality,stat,stack} and each file will have its own
check. This will be provided by the next patch.
This is an aggressive check compared to the target's exec_id check.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/base.c | 36 ++++++++++++++++++++++++++++++++----
1 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 387e637..6df8ddd 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -692,12 +692,17 @@ static const struct file_operations proc_info_file_operations = {
static int proc_single_show(struct seq_file *m, void *v)
{
- struct inode *inode = m->private;
+ struct proc_file_private *priv = m->private;
+ struct inode *inode;
struct pid_namespace *ns;
struct pid *pid;
struct task_struct *task;
- int ret;
+ int ret = 0;
+ if (!priv)
+ return ret;
+
+ inode = priv->inode;
ns = inode->i_sb->s_fs_info;
pid = proc_pid(inode);
task = get_pid_task(pid, PIDTYPE_PID);
@@ -712,14 +717,37 @@ static int proc_single_show(struct seq_file *m, void *v)
static int proc_single_open(struct inode *inode, struct file *filp)
{
- return single_open(filp, proc_single_show, inode);
+ struct proc_file_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ /* Protect ONE files with current's exec_id */
+ priv->exec_id = get_task_exec_id(current);
+ ret = single_open(filp, proc_single_show, priv);
+ if (!ret)
+ priv->inode = inode;
+ else
+ kfree(priv);
+ }
+ return ret;
+}
+
+static int proc_single_release(struct inode *inode, struct file *filp)
+{
+ struct seq_file *seq = filp->private_data;
+ int ret = 0;
+ if (seq) {
+ kfree(seq->private);
+ ret = single_release(inode, filp);
+ }
+ return ret;
}
static const struct file_operations proc_single_file_operations = {
.open = proc_single_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = single_release,
+ .release = proc_single_release,
};
static int mem_open(struct inode* inode, struct file* file)
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 6/9] proc: protect /proc/<pid>/* ONE files from reader across execve
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (4 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 5/9] proc: add protection support for /proc/<pid>/* ONE files Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 7/9] proc: protect /proc/<pid>/{maps,smaps,numa_maps} Djalal Harouni
` (4 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
Protect the /proc/<pid>/{stack,stat,personality} ONE files from reader
across execve by checking the exec_id of the reader.
The check is against current's exec_id since there are no permission
checks (ptrace) at open time and currently we do not want to change the
internal logic of these files, so we just make sure that the reader process
did not perform an execve at read time.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/array.c | 4 ++++
fs/proc/base.c | 34 +++++++++++++++++++++++-----------
2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/fs/proc/array.c b/fs/proc/array.c
index c602b8d..20b6f5e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -381,6 +381,10 @@ static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
state = *get_task_state(task);
vsize = eip = esp = 0;
permitted = ptrace_may_access(task, PTRACE_MODE_READ | PTRACE_MODE_NOAUDIT);
+
+ if (!proc_exec_id_ok(current, m->private))
+ return 0;
+
mm = get_task_mm(task);
if (mm) {
vsize = task_vsize(mm);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 6df8ddd..0a5383e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -305,6 +305,7 @@ static void unlock_trace(struct task_struct *task)
static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
+ struct proc_file_private *priv = m->private;
struct stack_trace trace;
unsigned long *entries;
int err;
@@ -320,17 +321,24 @@ static int proc_pid_stack(struct seq_file *m, struct pid_namespace *ns,
trace.skip = 0;
err = lock_trace(task);
- if (!err) {
- save_stack_trace_tsk(task, &trace);
+ if (err)
+ goto out_free;
- for (i = 0; i < trace.nr_entries; i++) {
- seq_printf(m, "[<%pK>] %pS\n",
- (void *)entries[i], (void *)entries[i]);
- }
- unlock_trace(task);
+ err = 0;
+ if (!proc_exec_id_ok(current, priv))
+ goto out_unlock;
+
+ save_stack_trace_tsk(task, &trace);
+
+ for (i = 0; i < trace.nr_entries; i++) {
+ seq_printf(m, "[<%pK>] %pS\n",
+ (void *)entries[i], (void *)entries[i]);
}
- kfree(entries);
+out_unlock:
+ unlock_trace(task);
+out_free:
+ kfree(entries);
return err;
}
#endif
@@ -3021,10 +3029,14 @@ static int proc_pid_personality(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
int err = lock_trace(task);
- if (!err) {
+ if (err)
+ return err;
+
+ err = 0;
+ if (proc_exec_id_ok(current, m->private))
seq_printf(m, "%08x\n", task->personality);
- unlock_trace(task);
- }
+
+ unlock_trace(task);
return err;
}
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 7/9] proc: protect /proc/<pid>/{maps,smaps,numa_maps}
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (5 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 6/9] proc: protect /proc/<pid>/* ONE files from reader across execve Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-10 23:25 ` [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve Djalal Harouni
` (3 subsequent siblings)
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni, Vasiliy Kulikov,
Solar Designer
Protect the /proc/<pid>/{maps,smaps,numa_maps} files from reader across
execve by checking its exec_id.
The best solution should be to bind these files to their task by using the
target's exec_id and to perform permission checks at each syscall, but
currently this is not possible, it will break glibc FORTIFY_SOURCE
protection. For the moment the exec_id check is against the reader to be
sure that the reader process is not playing tricks and did not perform an
execve at read time.
Use the new 'proc_file_private' struct to protect these sensitive files,
this struct can store and handle all the
/proc/<pid>/{maps,smaps,numa_maps} internal data.
The proc_exec_id_ok() check is performed inside the functions that are
responsible of constructing and reporting results.
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/task_mmu.c | 37 ++++++++++++++++++++++++++-----------
1 files changed, 26 insertions(+), 11 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7dcd2a2..96a0e4a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -90,7 +90,7 @@ static void pad_len_spaces(struct seq_file *m, int len)
seq_printf(m, "%*c", len, ' ');
}
-static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
+static void vma_stop(struct proc_file_private *priv, struct vm_area_struct *vma)
{
if (vma && vma != priv->tail_vma) {
struct mm_struct *mm = vma->vm_mm;
@@ -101,7 +101,7 @@ static void vma_stop(struct proc_maps_private *priv, struct vm_area_struct *vma)
static void *m_start(struct seq_file *m, loff_t *pos)
{
- struct proc_maps_private *priv = m->private;
+ struct proc_file_private *priv = m->private;
unsigned long last_addr = m->version;
struct mm_struct *mm;
struct vm_area_struct *vma, *tail_vma = NULL;
@@ -168,7 +168,7 @@ out:
static void *m_next(struct seq_file *m, void *v, loff_t *pos)
{
- struct proc_maps_private *priv = m->private;
+ struct proc_file_private *priv = m->private;
struct vm_area_struct *vma = v;
struct vm_area_struct *tail_vma = priv->tail_vma;
@@ -181,7 +181,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *pos)
static void m_stop(struct seq_file *m, void *v)
{
- struct proc_maps_private *priv = m->private;
+ struct proc_file_private *priv = m->private;
struct vm_area_struct *vma = v;
if (!IS_ERR(vma))
@@ -193,15 +193,20 @@ static void m_stop(struct seq_file *m, void *v)
static int do_maps_open(struct inode *inode, struct file *file,
const struct seq_operations *ops)
{
- struct proc_maps_private *priv;
+ struct proc_file_private *priv;
int ret = -ENOMEM;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (priv) {
- priv->pid = proc_pid(inode);
+ /*
+ * Use current's exec_id since currently there are no
+ * permission checks at open.
+ */
+ priv->exec_id = get_task_exec_id(current);
ret = seq_open(file, ops);
if (!ret) {
struct seq_file *m = file->private_data;
m->private = priv;
+ priv->pid = proc_pid(inode);
} else {
kfree(priv);
}
@@ -278,9 +283,12 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
static int show_map(struct seq_file *m, void *v)
{
struct vm_area_struct *vma = v;
- struct proc_maps_private *priv = m->private;
+ struct proc_file_private *priv = m->private;
struct task_struct *task = priv->task;
+ if (!proc_exec_id_ok(current, priv))
+ return 0;
+
show_map_vma(m, vma);
if (m->count < m->size) /* vma is copied successfully */
@@ -424,7 +432,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
static int show_smap(struct seq_file *m, void *v)
{
- struct proc_maps_private *priv = m->private;
+ struct proc_file_private *priv = m->private;
struct task_struct *task = priv->task;
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
@@ -434,6 +442,9 @@ static int show_smap(struct seq_file *m, void *v)
.private = &mss,
};
+ if (!proc_exec_id_ok(current, priv))
+ return 0;
+
memset(&mss, 0, sizeof mss);
mss.vma = vma;
/* mmap_sem is held in m_start */
@@ -878,7 +889,7 @@ struct numa_maps {
};
struct numa_maps_private {
- struct proc_maps_private proc_maps;
+ struct proc_file_private proc_maps;
struct numa_maps md;
};
@@ -1005,7 +1016,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
static int show_numa_map(struct seq_file *m, void *v)
{
struct numa_maps_private *numa_priv = m->private;
- struct proc_maps_private *proc_priv = &numa_priv->proc_maps;
+ struct proc_file_private *proc_priv = &numa_priv->proc_maps;
struct vm_area_struct *vma = v;
struct numa_maps *md = &numa_priv->md;
struct file *file = vma->vm_file;
@@ -1018,6 +1029,9 @@ static int show_numa_map(struct seq_file *m, void *v)
if (!mm)
return 0;
+ if (!proc_exec_id_ok(current, proc_priv))
+ return 0;
+
/* Ensure we start with an empty set of numa_maps statistics. */
memset(md, 0, sizeof(*md));
@@ -1097,11 +1111,12 @@ static int numa_maps_open(struct inode *inode, struct file *file)
int ret = -ENOMEM;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (priv) {
- priv->proc_maps.pid = proc_pid(inode);
+ priv->proc_maps.exec_id = get_task_exec_id(current);
ret = seq_open(file, &proc_pid_numa_maps_op);
if (!ret) {
struct seq_file *m = file->private_data;
m->private = priv;
+ priv->proc_maps.pid = proc_pid(inode);
} else {
kfree(priv);
}
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (6 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 7/9] proc: protect /proc/<pid>/{maps,smaps,numa_maps} Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-11 8:05 ` Alexey Dobriyan
2012-03-10 23:25 ` [PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection Djalal Harouni
` (2 subsequent siblings)
10 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni
The /proc/<pid>/{environ,pagemap} are sensitive files which must be
protected across execve to avoid information leaks.
These files are protected by attaching them to their task at open time by
saving the exec_id of the target task, this way in read we just compare
the target task's exec_id and the previously saved exec_id of the
proc_file_private struct, in other words we just bind these files to their
appropriate process image at open time. We do this since we are able to do
proper permission checks (ptrace) at each syscall, so we do not care about
the reader.
Another important rule is to set the exec_id of the target task before the
permission checks at open, this way we do not race against target task
execve, and it will be more effective if the exec_id check at read/write
times are delayed as much as possible to be sure that the target task do
not change during execve.
This patch adds the open file_operation to the
/proc/<pid>/{environ,pagemap} so we are able to set the exec_id of the
target task and to do the appropriate permission checks. The exec_id check
is done in the related read file_operation.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/base.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++--
fs/proc/task_mmu.c | 69 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 148 insertions(+), 6 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 0a5383e..536cd65 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -882,15 +882,66 @@ static const struct file_operations proc_mem_operations = {
.release = mem_release,
};
+static int environ_open(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv;
+ struct mm_struct *mm;
+ struct task_struct *task;
+ int ret = -ENOMEM;
+
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return ret;
+
+ ret = -ESRCH;
+ task = get_proc_task(filp->f_path.dentry->d_inode);
+ if (!task)
+ goto out_free;
+
+ /*
+ * Use target task's exec_id to bind the file to the task.
+ * Setup exec_id first then check for permissions.
+ */
+ priv->exec_id = get_task_exec_id(task);
+ mm = mm_for_maps(task);
+ put_task_struct(task);
+
+ if (!mm) {
+ ret = -ENOENT;
+ goto out_free;
+ }
+
+ if (IS_ERR(mm)) {
+ ret = PTR_ERR(mm);
+ goto out_free;
+ }
+
+ filp->private_data = priv;
+ /* do not pin mm */
+ mmput(mm);
+
+ return 0;
+
+out_free:
+ kfree(priv);
+ return ret;
+}
+
static ssize_t environ_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task = get_proc_task(file->f_dentry->d_inode);
+ struct proc_file_private *priv = file->private_data;
+ struct task_struct *task;
char *page;
unsigned long src = *ppos;
- int ret = -ESRCH;
+ int ret = 0;
struct mm_struct *mm;
+ if (!priv)
+ return ret;
+
+ ret = -ESRCH;
+ task = get_proc_task(file->f_dentry->d_inode);
if (!task)
goto out_no_task;
@@ -901,9 +952,15 @@ static ssize_t environ_read(struct file *file, char __user *buf,
mm = mm_for_maps(task);
- ret = PTR_ERR(mm);
- if (!mm || IS_ERR(mm))
+ if (!mm) {
+ ret = -ENOENT;
+ goto out_free;
+ }
+
+ if (IS_ERR(mm)) {
+ ret = PTR_ERR(mm);
goto out_free;
+ }
ret = 0;
while (count > 0) {
@@ -925,6 +982,16 @@ static ssize_t environ_read(struct file *file, char __user *buf,
break;
}
+ /*
+ * Delay the check since access_process_vm() operates
+ * on a task.
+ */
+ if (!proc_exec_id_ok(task, priv)) {
+ /* There was an execv */
+ ret = 0;
+ break;
+ }
+
if (copy_to_user(buf, page, retval)) {
ret = -EFAULT;
break;
@@ -946,9 +1013,19 @@ out_no_task:
return ret;
}
+static int environ_release(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv = filp->private_data;
+
+ kfree(priv);
+ return 0;
+}
+
static const struct file_operations proc_environ_operations = {
+ .open = environ_open,
.read = environ_read,
.llseek = generic_file_llseek,
+ .release = environ_release,
};
static ssize_t oom_adjust_read(struct file *file, char __user *buf,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 96a0e4a..5b00969 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -770,13 +770,59 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
*/
#define PAGEMAP_WALK_SIZE (PMD_SIZE)
#define PAGEMAP_WALK_MASK (PMD_MASK)
+static int pagemap_open(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv;
+ struct mm_struct *mm;
+ struct task_struct *task;
+ int ret = -ENOMEM;
+
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return ret;
+
+ ret = -ESRCH;
+ task = get_proc_task(filp->f_path.dentry->d_inode);
+ if (!task)
+ goto out_free;
+
+ /*
+ * Use target task's exec_id to bind the file to the task.
+ * Setup exec_id first then check for permissions.
+ */
+ priv->exec_id = get_task_exec_id(task);
+ mm = mm_for_maps(task);
+ put_task_struct(task);
+
+ if (!mm) {
+ ret = -ENOENT;
+ goto out_free;
+ }
+
+ if (IS_ERR(mm)) {
+ ret = PTR_ERR(mm);
+ goto out_free;
+ }
+
+ filp->private_data = priv;
+ /* do not pin mm */
+ mmput(mm);
+
+ return 0;
+
+out_free:
+ kfree(priv);
+ return ret;
+}
+
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
- struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ struct proc_file_private *priv = file->private_data;
+ struct task_struct *task;
struct mm_struct *mm;
struct pagemapread pm;
- int ret = -ESRCH;
+ int ret = 0;
struct mm_walk pagemap_walk = {};
unsigned long src;
unsigned long svpfn;
@@ -784,6 +830,11 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
unsigned long end_vaddr;
int copied = 0;
+ if (!priv)
+ return ret;
+
+ ret = -ESRCH;
+ task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
goto out;
@@ -831,6 +882,10 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
* will stop when we hit the end of the buffer.
*/
ret = 0;
+ if (!proc_exec_id_ok(task, priv))
+ /* There was an execve */
+ goto out_mm;
+
while (count && (start_vaddr < end_vaddr)) {
int len;
unsigned long end;
@@ -868,9 +923,19 @@ out:
return ret;
}
+static int pagemap_release(struct inode *inode, struct file *filp)
+{
+ struct proc_file_private *priv = filp->private_data;
+
+ kfree(priv);
+ return 0;
+}
+
const struct file_operations proc_pagemap_operations = {
+ .open = pagemap_open,
.llseek = mem_lseek, /* borrow this */
.read = pagemap_read,
+ .release = pagemap_release,
};
#endif /* CONFIG_PROC_PAGE_MONITOR */
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve
2012-03-10 23:25 ` [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve Djalal Harouni
@ 2012-03-11 8:05 ` Alexey Dobriyan
2012-03-11 17:01 ` Djalal Harouni
0 siblings, 1 reply; 48+ messages in thread
From: Alexey Dobriyan @ 2012-03-11 8:05 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 12:25:18AM +0100, Djalal Harouni wrote:
> The /proc/<pid>/{environ,pagemap} are sensitive files which must be
> protected across execve to avoid information leaks.
>
> These files are protected by attaching them to their task at open time by
> saving the exec_id of the target task, this way in read we just compare
> the target task's exec_id and the previously saved exec_id of the
> proc_file_private struct, in other words we just bind these files to their
> appropriate process image at open time. We do this since we are able to do
> proper permission checks (ptrace) at each syscall, so we do not care about
> the reader.
>
> Another important rule is to set the exec_id of the target task before the
> permission checks at open, this way we do not race against target task
> execve, and it will be more effective if the exec_id check at read/write
> times are delayed as much as possible to be sure that the target task do
> not change during execve.
>
> This patch adds the open file_operation to the
> /proc/<pid>/{environ,pagemap} so we are able to set the exec_id of the
> target task and to do the appropriate permission checks. The exec_id check
> is done in the related read file_operation.
->open is duplicated.
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> +static int environ_open(struct inode *inode, struct file *filp)
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve
2012-03-11 8:05 ` Alexey Dobriyan
@ 2012-03-11 17:01 ` Djalal Harouni
0 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 17:01 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sun, Mar 11, 2012 at 11:05:23AM +0300, Alexey Dobriyan wrote:
> On Sun, Mar 11, 2012 at 12:25:18AM +0100, Djalal Harouni wrote:
> > The /proc/<pid>/{environ,pagemap} are sensitive files which must be
> > protected across execve to avoid information leaks.
> >
> > These files are protected by attaching them to their task at open time by
> > saving the exec_id of the target task, this way in read we just compare
> > the target task's exec_id and the previously saved exec_id of the
> > proc_file_private struct, in other words we just bind these files to their
> > appropriate process image at open time. We do this since we are able to do
> > proper permission checks (ptrace) at each syscall, so we do not care about
> > the reader.
> >
> > Another important rule is to set the exec_id of the target task before the
> > permission checks at open, this way we do not race against target task
> > execve, and it will be more effective if the exec_id check at read/write
> > times are delayed as much as possible to be sure that the target task do
> > not change during execve.
> >
> > This patch adds the open file_operation to the
> > /proc/<pid>/{environ,pagemap} so we are able to set the exec_id of the
> > target task and to do the appropriate permission checks. The exec_id check
> > is done in the related read file_operation.
>
> ->open is duplicated.
Right, I'll unify the code in a generic open function that does:
* alloc and setup proc_file_private (which includes the exec_id)
* ptrace check using mm_for_maps()
unify only those who check PTRACE_MODE_READ
* save priv_file_proc.
This applies also to the release functions, I'll re-submit it.
Thanks Alexey.
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (7 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 8/9] proc: protect /proc/<pid>/{environ,pagemap} across execve Djalal Harouni
@ 2012-03-10 23:25 ` Djalal Harouni
2012-03-11 0:01 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Linus Torvalds
2012-03-12 19:13 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Eric W. Biederman
10 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-10 23:25 UTC (permalink / raw)
To: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel
Cc: Alan Cox, Greg KH, Ingo Molnar, Stephen Wilson,
Jason A. Donenfeld, Djalal Harouni
The /proc/<pid>/mem is a sensitive file which needs proper protection.
Commit e268337dfe26dfc7efd422a804dbb27977a3cccc changes the handling of
the /proc/<pid>/mem file by attaching and tracking the VM of the target
process at open time which means that even after an execve the old VM will
still be referenced.
Commit 6d08f2c7139790c268820a2e590795cb8333181a makes sure that the VM of
the target process is not pinned which will prevent rlimits bypass but the
internal kernel mm_struct will be.
The mm_struct is a big struct and this can have negative effects on the
machine especially low memory ones if a process keep /proc/<pid>/mem open,
fork and re-exec where the /proc/<pid>/mem is related to a dead or a
zombie process.
Some tests regarding the current implementation:
1) slabtop results with 32 processes each one with 1000 opened file:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
26272 26270 99% 1.56K 1634 20 52288K mm_struct
26685 26681 99% 0.69K 1251 23 20016K filp
...
108 103 95% 5.20K 18 6 576K task_struct
The number of mm_structs do not reflect the number of task_structs, and
all these mm_structs are related to zombie/dead processes. In this case
with 32 processes on low memory the OOM killer will be triggered killing
other processes.
2) With enough memory and with a higher number of processes:
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
34233 31414 91% 1.25K 2858 12 45728K mm_struct
34944 32062 91% 0.38K 1664 21 13312K filp
44835 41973 93% 0.19K 2135 21 8540K kmalloc-192
...
294 163 55% 4.88K 49 6 1568K task_struct
In this (2) situation we reach the VFS file-max value
/proc/sys/fs/file-max which was 34869 on the tested system.
But It is not the job of the VFS to block this behaviour, and systems with
higher file-max values will suffer.
Based on the above considerations and if we can achieve the same
/proc/<pid>/mem protection then we should go for it.
Use the proc_file_private and its exec_id to track /proc/<pid>/mem usage.
Set the exec_id to the target task's exec_id at open time, and later
compare it on each read and write calls before processing the VM of the
target which means we bind the /proc/<pid>/mem to its exec_id process at
open time. Permission checks are also done on each syscall so we do not
care about the reader, only the target is tracked and its mm_struct is not
pinned, this means that on each syscall readers/writers will operate on
the right mm_struct.
We must perform permission checks at each syscall.
The appropriate exec_id of the proc_file_private must be set at the right
time just before doing the permission checks at open time, this way we do
not race against target task execve. The exec_id checks at read and write
time must also be done at the right time after permission checks when the
mm is already grabbed.
Signed-off-by: Djalal Harouni <tixxdz@opendz.org>
---
fs/proc/base.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 78 insertions(+), 21 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 536cd65..a9cc73c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -760,50 +760,99 @@ static const struct file_operations proc_single_file_operations = {
static int mem_open(struct inode* inode, struct file* file)
{
- struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
+ struct proc_file_private *priv;
+ struct task_struct *task;
struct mm_struct *mm;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return ret;
+
+ ret = -ESRCH;
+ task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
- return -ESRCH;
+ goto out_free;
+ /*
+ * Use target task's exec_id to bind the current opened
+ * /proc/<pid>/mem file to its process image.
+ *
+ * Setup exec_id first then check for permissions.
+ */
+ priv->exec_id = get_task_exec_id(task);
mm = mm_access(task, PTRACE_MODE_ATTACH);
put_task_struct(task);
- if (IS_ERR(mm))
- return PTR_ERR(mm);
+ if (!mm) {
+ ret = -ENOENT;
+ goto out_free;
+ }
- if (mm) {
- /* ensure this mm_struct can't be freed */
- atomic_inc(&mm->mm_count);
- /* but do not pin its memory */
- mmput(mm);
+ if (IS_ERR(mm)) {
+ ret = PTR_ERR(mm);
+ goto out_free;
}
/* OK to pass negative loff_t, we can catch out-of-range */
file->f_mode |= FMODE_UNSIGNED_OFFSET;
- file->private_data = mm;
+ file->private_data = priv;
+
+ mmput(mm);
return 0;
+
+out_free:
+ kfree(priv);
+ return ret;
}
static ssize_t mem_rw(struct file *file, char __user *buf,
size_t count, loff_t *ppos, int write)
{
- struct mm_struct *mm = file->private_data;
+ struct proc_file_private *priv = file->private_data;
+ struct task_struct *task;
unsigned long addr = *ppos;
- ssize_t copied;
+ ssize_t copied = 0;
char *page;
+ struct mm_struct *mm;
- if (!mm)
- return 0;
+ if (!priv)
+ return copied;
+
+ copied = -ESRCH;
+ task = get_proc_task(file->f_dentry->d_inode);
+ if (!task)
+ goto out_no_task;
+ copied = -ENOMEM;
page = (char *)__get_free_page(GFP_TEMPORARY);
if (!page)
- return -ENOMEM;
+ goto out;
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH);
+ if (!mm) {
+ copied = -ENOENT;
+ goto out_free;
+ }
+
+ if (IS_ERR(mm)) {
+ copied = PTR_ERR(mm);
+ goto out_free;
+ }
copied = 0;
- if (!atomic_inc_not_zero(&mm->mm_users))
- goto free;
+ /*
+ * See if the processed /proc/<pid>/mem file belongs to the
+ * original opened /proc/<pid>/ process image and that there was
+ * no new execve.
+ *
+ * Do the check here since access_remote_vm() will operate
+ * on the already grabbed mm.
+ */
+ if (!proc_exec_id_ok(task, priv))
+ /* There was an execv */
+ goto out_mm;
while (count > 0) {
int this_len = min_t(int, count, PAGE_SIZE);
@@ -813,6 +862,10 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
break;
}
+ /*
+ * Do not change this: we must operate on the previously
+ * grabbed mm.
+ */
this_len = access_remote_vm(mm, addr, page, this_len, write);
if (!this_len) {
if (!copied)
@@ -832,9 +885,13 @@ static ssize_t mem_rw(struct file *file, char __user *buf,
}
*ppos = addr;
+out_mm:
mmput(mm);
-free:
+out_free:
free_page((unsigned long) page);
+out:
+ put_task_struct(task);
+out_no_task:
return copied;
}
@@ -868,9 +925,9 @@ loff_t mem_lseek(struct file *file, loff_t offset, int orig)
static int mem_release(struct inode *inode, struct file *file)
{
- struct mm_struct *mm = file->private_data;
- if (mm)
- mmdrop(mm);
+ struct proc_file_private *priv = file->private_data;
+
+ kfree(priv);
return 0;
}
--
1.7.1
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (8 preceding siblings ...)
2012-03-10 23:25 ` [PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection Djalal Harouni
@ 2012-03-11 0:01 ` Linus Torvalds
2012-03-11 0:27 ` Djalal Harouni
` (2 more replies)
2012-03-12 19:13 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Eric W. Biederman
10 siblings, 3 replies; 48+ messages in thread
From: Linus Torvalds @ 2012-03-11 0:01 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 3:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
>
> 1) Use the target exec_id to bind files to their exec_id task:
>
> For the REG files /proc/<pid>/{environ,pagemap,mem} we set the exec_id
> of the proc_file_private to the target task, and we continue with
> permission checks at open time, later on each read/write call the
> permission checks are done + check the target exec_id if it equals the
> exec_id of the proc_file_private that was set at open time, in other words
> we bind the file to its task's exec_id, this way new exec programs can not
> operate on the passed fd.
So the exec_id approach was totally broken when it was used for
/proc/<pid>/mem, is there any reason to believe it's a good idea now?
It's entirely predictable, and you can make the exec_id match by
simply forking elsewhere and then passing the fd around using unix
domain sockets, since the exec_id is just updated by incrementing a
counter.
I would in general suggest strongly against using exec_id for anything
that involves files. It isn't designed for that, it's designed for the
whole "check the parent exec_id" thing for ptrace, where that whole
"pass things around to another process" approach doesn't work.
Linus
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-11 0:01 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Linus Torvalds
@ 2012-03-11 0:27 ` Djalal Harouni
2012-03-11 8:46 ` Djalal Harouni
2012-03-11 10:35 ` exec_id protection from bad child exit signals (was: Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve) Solar Designer
2 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 0:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:01:09PM -0800, Linus Torvalds wrote:
> On Sat, Mar 10, 2012 at 3:25 PM, Djalal Harouni <tixxdz@opendz.org> wrote:
> >
> > 1) Use the target exec_id to bind files to their exec_id task:
> >
> > For the REG files /proc/<pid>/{environ,pagemap,mem} we set the exec_id
> > of the proc_file_private to the target task, and we continue with
> > permission checks at open time, later on each read/write call the
> > permission checks are done + check the target exec_id if it equals the
> > exec_id of the proc_file_private that was set at open time, in other words
> > we bind the file to its task's exec_id, this way new exec programs can not
> > operate on the passed fd.
>
> So the exec_id approach was totally broken when it was used for
> /proc/<pid>/mem, is there any reason to believe it's a good idea now?
Yes the previously one was broken since it was not a global uniq exec_id,
it was designed for threads tracking.
The current one is a global exec_id with uniq IDs, incremented on each
do_execve_common() call.
> It's entirely predictable, and you can make the exec_id match by
> simply forking elsewhere and then passing the fd around using unix
> domain sockets, since the exec_id is just updated by incrementing a
> counter.
For the fork one yes exec_id will match but we have the permission
checks (ptrace) at each syscall, so even if two processes share the same
exec_id the ptrace check should fail.
Yes it's predictable, but I don't see how you could pass the fd to another
extern privileged process without failing at the exec_id check.
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-11 0:01 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Linus Torvalds
2012-03-11 0:27 ` Djalal Harouni
@ 2012-03-11 8:46 ` Djalal Harouni
2012-03-11 10:35 ` exec_id protection from bad child exit signals (was: Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve) Solar Designer
2 siblings, 0 replies; 48+ messages in thread
From: Djalal Harouni @ 2012-03-11 8:46 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-kernel, kernel-hardening, Andrew Morton, Al Viro,
Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:01:09PM -0800, Linus Torvalds wrote:
> I would in general suggest strongly against using exec_id for anything
> that involves files. It isn't designed for that, it's designed for the
> whole "check the parent exec_id" thing for ptrace, where that whole
> "pass things around to another process" approach doesn't work.
Yes exec_id for files can seem strange, but here we are speaking about
virtual files that are related to the image of the process being run,
these files are in reality just some portion of the process memory.
Linus please check the patch:
[PATCH 9/9] proc: improve and clean up /proc/<pid>/mem protection
I have explained why the current protection is not the best solution. The
oom killer will kill other processes including getty,klogd,... even root
owned ssh.
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* exec_id protection from bad child exit signals (was: Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve)
2012-03-11 0:01 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Linus Torvalds
2012-03-11 0:27 ` Djalal Harouni
2012-03-11 8:46 ` Djalal Harouni
@ 2012-03-11 10:35 ` Solar Designer
2012-03-11 18:20 ` Oleg Nesterov
2 siblings, 1 reply; 48+ messages in thread
From: Solar Designer @ 2012-03-11 10:35 UTC (permalink / raw)
To: Linus Torvalds
Cc: Djalal Harouni, linux-kernel, kernel-hardening, Andrew Morton,
Al Viro, Alexey Dobriyan, Eric W. Biederman, Vasiliy Kulikov,
Kees Cook, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Sat, Mar 10, 2012 at 04:01:09PM -0800, Linus Torvalds wrote:
> I would in general suggest strongly against using exec_id for anything
> that involves files. It isn't designed for that, it's designed for the
> whole "check the parent exec_id" thing for ptrace, where that whole
> "pass things around to another process" approach doesn't work.
Actually, the original/historical purpose of the exec_id stuff was to
protect privileged parent processes (those having done a SUID/SGID exec)
from non-standard child exit signals, which could be set with clone().
I think we may want to audit the current implementation and see if it
still fully achieves the goal or maybe not (and fix it if not).
IIRC, 32 bits was considered enough because it was only the trusted
privileged parent process itself that could potentially cause a
wraparound. (I did not verify this conclusion now. It might be wrong.)
I include below pieces of the prototype implementation from
linux-2.2.12-ow6.tar.gz released in 1999. One notable difference from
the code that went into mainline kernels was that I only incremented the
counter on privileged execve(), and I additionally handled counter
wraparound. I am a bit concerned that a wraparound attack might be
possible on the code currently in mainline kernels, thereby allowing for
a bad exit signal to be sent to a privileged new parent program. Does
anything prevent the wraparound attack currently? (I did not check for
this yet, sorry.)
On exec:
+ bprm->priv_change = id_change || cap_raised;
+ if (bprm->priv_change) {
...
+ /*
+ * Increment the privileged execution counter, so that our
+ * old children know not to send bad exit_signal's to us.
+ * Also, wait on the lock if there's an exit_signal being
+ * sent to us now, to make sure it doesn't get sent to the
+ * new privileged program.
+ */
+ spin_lock_irqsave(¤t->priv_lock, flags);
+ if (!++current->priv) {
+ struct task_struct *p;
+
+ /*
+ * The counter can't really overflow with real-world
+ * programs (and it has to be the privileged program
+ * itself that causes the overflow), but we handle
+ * this case anyway, just for correctness.
+ */
+ read_lock(&tasklist_lock);
+ for_each_task(p) {
+ if (p->p_pptr == current) {
+ p->ppriv = 0;
+ current->priv = 1;
+ }
+ }
+ read_unlock(&tasklist_lock);
+ }
+ spin_unlock_irqrestore(¤t->priv_lock, flags);
In task_struct:
+/* Privileged execution counters, for exit_signal permission checking */
+ spinlock_t priv_lock;
+ int priv, ppriv;
On fork() and clone():
+ spin_lock_init(&p->priv_lock);
+ p->priv = 0;
+ p->ppriv = current->priv;
Exit signal:
+ unsigned long flags = 0;
+ int locked = 0;
+
+ if (sig && sig != SIGCHLD) {
+ /*
+ * Make sure our parent hasn't executed a privileged program
+ * (such as, SUID) since we were born.
+ *
+ * We do some locking here to ensure that there's no race
+ * between the check and actually sending the signal.
+ * Currently, this is probably redundant as notify_parent()
+ * is only used either with the big lock obtained, or with
+ * the signal set to SIGCHLD.
+ */
+ locked = 1;
+ spin_lock_irqsave(&tsk->p_pptr->priv_lock, flags);
+ if (tsk->p_pptr->priv != tsk->ppriv) {
+ spin_unlock_irqrestore(&tsk->p_pptr->priv_lock, flags);
+ locked = 0;
+ sig = 0;
+ }
+ }
...
+ if (locked) spin_unlock_irqrestore(&tsk->p_pptr->priv_lock, flags);
IIRC, an equivalent of the above went upstream (with simplifications
and a variables rename by Alan) in 2.2.13, so that may be another
"reference implementation" to check against.
Alexander
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: exec_id protection from bad child exit signals (was: Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve)
2012-03-11 10:35 ` exec_id protection from bad child exit signals (was: Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve) Solar Designer
@ 2012-03-11 18:20 ` Oleg Nesterov
0 siblings, 0 replies; 48+ messages in thread
From: Oleg Nesterov @ 2012-03-11 18:20 UTC (permalink / raw)
To: Solar Designer
Cc: Linus Torvalds, Djalal Harouni, linux-kernel, kernel-hardening,
Andrew Morton, Al Viro, Alexey Dobriyan, Eric W. Biederman,
Vasiliy Kulikov, Kees Cook, WANG Cong, James Morris,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On 03/11, Solar Designer wrote:
>
> Actually, the original/historical purpose of the exec_id stuff was to
> protect privileged parent processes (those having done a SUID/SGID exec)
> from non-standard child exit signals, which could be set with clone().
> I think we may want to audit the current implementation and see if it
> still fully achieves the goal or maybe not (and fix it if not).
Funny that, I noticed this message only after I sent the question about
the current exec_id stuff.
> I include below pieces of the prototype implementation from
> linux-2.2.12-ow6.tar.gz released in 1999.
Perhaps I missed something, but ignoring the "cap_raised" issues, this
all is very simple. de_thread() should simply do:
current->exit_signal = SIGCHLD;
read_lock(&tasklist_lock);
list_for_each_entry(p, ¤t->children, sibling)
p->exit_signal = SIGCHILD;
read_unlock(&tasklist_lock);
The only problem is CLONE_PARENT.
Oleg.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-10 23:25 [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Djalal Harouni
` (9 preceding siblings ...)
2012-03-11 0:01 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Linus Torvalds
@ 2012-03-12 19:13 ` Eric W. Biederman
2012-03-12 20:44 ` Djalal Harouni
10 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2012-03-12 19:13 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
Djalal Harouni <tixxdz@opendz.org> writes:
> Procfs files and other important objects may contain sensitive information
> which must not be seen, inherited or processed across execve.
So I am dense. /proc/<pid>/mem was special in that it uses a different
set of checks than other files, and to do those access checks
/proc/<pid>/mem needed to look at exec_id.
For all of the access checks that are not written in that silly way.
What is wrong with ptrace_may_access run at every read/write of a file?
We redo all of the permission checks every time so that should avoid
races.
I really think you are trying to solve something that is not broken.
Certainly I could not see your argument for why anything but
/proc/<pid>/mem needs attention.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-12 19:13 ` [PATCH 0/9] proc: protect /proc/<pid>/* files across execve Eric W. Biederman
@ 2012-03-12 20:44 ` Djalal Harouni
2012-03-12 21:47 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-12 20:44 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Mon, Mar 12, 2012 at 12:13:15PM -0700, Eric W. Biederman wrote:
> Djalal Harouni <tixxdz@opendz.org> writes:
>
> > Procfs files and other important objects may contain sensitive information
> > which must not be seen, inherited or processed across execve.
>
> So I am dense. /proc/<pid>/mem was special in that it uses a different
> set of checks than other files, and to do those access checks
> /proc/<pid>/mem needed to look at exec_id.
If you are referring to the old protection, yes it was against an ID, but
not uniq IDs, so you can execve a suid do some tricks to have a match on IDs
and bypass the protection, how: by opening your /proc/self/mem and pass
the fd to the exec suid who at read/write time will process its own
/proc/self/mem
> For all of the access checks that are not written in that silly way.
> What is wrong with ptrace_may_access run at every read/write of a file?
As it was noted, these files change during runtime, so even if you do the
ptrace check at each syscall (which is of course a good thing), you must be
sure that you are doing the check on the right target and the processed
file belongs really to the appropriate process image of the target.
This is what Alan says: "bind objects to their process image".
So if ptrace_may_access is done when current == target then it will
succeed, and if we are processing /proc/self/* then every ptrace check
will pass, and if you pass the fd of /proc/self/* to a privileged process
then it will just operate on its own file.
So we need a way/solution to know that we are processing the _right_ files.
The proposed exec_id here will change on each execve(): uniq ids, so if we
execve an external privileged program or have an fd of one of its files,
the exec_id check will fail, but we must also perform the ptrace check.
> We redo all of the permission checks every time so that should avoid
> races.
races are against new execves.
IMHO even with that there will be races, here we are operating on
virtual files /proc/<pid>/* and the creds and privileges of the process
can change at any time even after the ptrace check. Some files use locks
see the /proc/<pid>/io commit 293eb1e7772b25a93647c798c7b89bf26c2da2e0
Hope that I'm not out of context.
> I really think you are trying to solve something that is not broken.
> Certainly I could not see your argument for why anything but
> /proc/<pid>/mem needs attention.
This patch series is to complete the protection for all the /proc/<pid>/*
sensitive files and to update the /proc/<pid>/mem protection.
For this last one, the checks were ok, but I thought that we can do better
without keeping dead bytes.
For all the thing and to keep track:
This is not news. Alan's historical links:
http://lkml.org/lkml/2012/1/29/35
The previous discussion on kernel-hardening:
(includes some variants described by Vasiliy and other problems which I'll
try to discuss here)
http://www.openwall.com/lists/kernel-hardening/2012/02/10/1
These are sensitive files, combined with some binaries properties, attackers
can achieve ASLR bypass ...
> Eric
Thanks Eric.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-12 20:44 ` Djalal Harouni
@ 2012-03-12 21:47 ` Eric W. Biederman
2012-03-12 22:41 ` Djalal Harouni
0 siblings, 1 reply; 48+ messages in thread
From: Eric W. Biederman @ 2012-03-12 21:47 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
Djalal Harouni <tixxdz@opendz.org> writes:
> On Mon, Mar 12, 2012 at 12:13:15PM -0700, Eric W. Biederman wrote:
>> Djalal Harouni <tixxdz@opendz.org> writes:
>>
>> > Procfs files and other important objects may contain sensitive information
>> > which must not be seen, inherited or processed across execve.
>>
>> So I am dense. /proc/<pid>/mem was special in that it uses a different
>> set of checks than other files, and to do those access checks
>> /proc/<pid>/mem needed to look at exec_id.
> If you are referring to the old protection, yes it was against an ID, but
> not uniq IDs, so you can execve a suid do some tricks to have a match on IDs
> and bypass the protection, how: by opening your /proc/self/mem and pass
> the fd to the exec suid who at read/write time will process its own
> /proc/self/mem
Yes that case is silly and I don't care. I care that you seem to be
stomping all over the non-silly cases using the excuse that there
was one silly case.
>> For all of the access checks that are not written in that silly way.
>> What is wrong with ptrace_may_access run at every read/write of a file?
> As it was noted, these files change during runtime, so even if you do the
> ptrace check at each syscall (which is of course a good thing), you must be
> sure that you are doing the check on the right target and the processed
> file belongs really to the appropriate process image of the target.
The right target is by definition the current value of the process in
question.
/proc/<pid>/<attr> files are supposed to work after an exec.
Adding an exec_id to additional files simply breaks existing
applications for no good reason.
What is needed is for safety is to guard against the race of exec
happening during a read or a write, so that we don't get access
to something we shouldn't have permissions to.
In general that means reference counting or locks. All exec_id can
meaningfully be used for in the general case is a trigger to try again.
> This is not news. Alan's historical links:
> http://lkml.org/lkml/2012/1/29/35
Alan's case refers to how to handle /proc/<pid>/maps the right way.
> The previous discussion on kernel-hardening:
> (includes some variants described by Vasiliy and other problems which I'll
> try to discuss here)
> http://www.openwall.com/lists/kernel-hardening/2012/02/10/1
In your post on openwall I just see you arguing that the current
and deliberate semantics of the permission checks on the proc files are
wrong. Because the permission checks happen at access time rather than
at open time.
Well I am sorry. The permission checks have happened at access time for
ages and that is deliberate.
Furthermore one of your confused arguments seems to imply that there is
a such a thing as a /proc/self file distinct from a /proc/<pid> file.
There not. /proc/self is just a symlink into /proc/<pid> and as such
does not open new security holes.
If you expect /proc/ not to let you find out things about yourself
if you are a suid executable that just seems silly.
So in short you seem to be arguing for changing the semantics of access
of every file under /proc/<pid> because there is cognitive dissonance
between your understanding of those files and what is implemented. I
see no acknowledgement that you are arguing for a semantic change or
any arguments in favor of that change. At best I see is an argument
that says you the files don't work the way you expect which is most
definitely not sufficient for a change.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-12 21:47 ` Eric W. Biederman
@ 2012-03-12 22:41 ` Djalal Harouni
2012-03-12 23:10 ` Eric W. Biederman
0 siblings, 1 reply; 48+ messages in thread
From: Djalal Harouni @ 2012-03-12 22:41 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
On Mon, Mar 12, 2012 at 02:47:43PM -0700, Eric W. Biederman wrote:
> Djalal Harouni <tixxdz@opendz.org> writes:
>
> > On Mon, Mar 12, 2012 at 12:13:15PM -0700, Eric W. Biederman wrote:
> >> Djalal Harouni <tixxdz@opendz.org> writes:
> >>
> >> > Procfs files and other important objects may contain sensitive information
> >> > which must not be seen, inherited or processed across execve.
> >>
> >> So I am dense. /proc/<pid>/mem was special in that it uses a different
> >> set of checks than other files, and to do those access checks
> >> /proc/<pid>/mem needed to look at exec_id.
> > If you are referring to the old protection, yes it was against an ID, but
> > not uniq IDs, so you can execve a suid do some tricks to have a match on IDs
> > and bypass the protection, how: by opening your /proc/self/mem and pass
> > the fd to the exec suid who at read/write time will process its own
> > /proc/self/mem
>
> Yes that case is silly and I don't care. I care that you seem to be
> stomping all over the non-silly cases using the excuse that there
> was one silly case.
I'm not sure I follow you here, these proc files suffer from the same
problems, can you please point what are these non-silly case ?
> >> For all of the access checks that are not written in that silly way.
> >> What is wrong with ptrace_may_access run at every read/write of a file?
> > As it was noted, these files change during runtime, so even if you do the
> > ptrace check at each syscall (which is of course a good thing), you must be
> > sure that you are doing the check on the right target and the processed
> > file belongs really to the appropriate process image of the target.
>
> The right target is by definition the current value of the process in
> question.
>
> /proc/<pid>/<attr> files are supposed to work after an exec.
Yes.
> Adding an exec_id to additional files simply breaks existing
> applications for no good reason.
Can you please point these applications that this patch will break ?
So we do not do it.
I've tested 'ps' but perhaps I've missed something ?
We just return 0 at read() which is a correct thing to do.
> What is needed is for safety is to guard against the race of exec
> happening during a read or a write, so that we don't get access
before and during. I say before since this is what we are tying to
emulate.
> to something we shouldn't have permissions to.
>
> In general that means reference counting or locks. All exec_id can
Locks ? counting yes this perhaps can work as Alan suggested, but a simple
check will catch all the things without any node list nor count inc/dec.
Linus's /proc/pid/mem patch do not even do that, he just keep the old mm
at open time.
> meaningfully be used for in the general case is a trigger to try again.
>
> > This is not news. Alan's historical links:
> > http://lkml.org/lkml/2012/1/29/35
>
> Alan's case refers to how to handle /proc/<pid>/maps the right way.
Alan's case refers to how to handle all the /proc/<pid>/* files and any
other object that should be attached to its process image.
> > The previous discussion on kernel-hardening:
> > (includes some variants described by Vasiliy and other problems which I'll
> > try to discuss here)
> > http://www.openwall.com/lists/kernel-hardening/2012/02/10/1
>
> In your post on openwall I just see you arguing that the current
> and deliberate semantics of the permission checks on the proc files are
> wrong. Because the permission checks happen at access time rather than
> at open time.
Yes if you are speaking about {maps,smaps...} then it should also be at
open time and at each syscall (it depends on files) and for the
/proc/pid/{maps,smaps,numa_maps} yes it is not safe to give non-privileged
processes an fd to an objects attached to a privileged process.
> Well I am sorry. The permission checks have happened at access time for
> ages and that is deliberate.
Yes (perhaps even before I use Linux :) ), but IMHO they are not perfect,
if they were, then we'll not see all the /proc/<pid>/ problems.
The current patches protect /proc/pid/{maps,smaps,numa_maps} without breaking
things, there are no checks at open which is not safe, but I did not add
it since I was not sure and I don't want to break things (glibc
FORITFY_SOURCE ...).
My patches add the ptrace checks at open only for
/proc/<pid>/{environ,pagemap,mem} which is safe, I did not include
/proc/pid/{auxv,maps,smaps,...} so please if you have _real_ cases of
problems just post them.
> Furthermore one of your confused arguments seems to imply that there is
> a such a thing as a /proc/self file distinct from a /proc/<pid> file.
> There not. /proc/self is just a symlink into /proc/<pid> and as such
> does not open new security holes.
No it's not confused, we just use the /proc/self as it is easy to explain.
Sorry if you feel that.
> If you expect /proc/ not to let you find out things about yourself
> if you are a suid executable that just seems silly.
Ah yes this one about suid/privileged, if you are still privileged then
there is no harm, but if you drop your privileges IMHO that means that you
do not want to do privileged operations, but that current == task in
ptrace which is before the creds check will just allow you to do so.
Well, this is another subject.
> So in short you seem to be arguing for changing the semantics of access
> of every file under /proc/<pid> because there is cognitive dissonance
> between your understanding of those files and what is implemented. I
> see no acknowledgement that you are arguing for a semantic change or
> any arguments in favor of that change. At best I see is an argument
> that says you the files don't work the way you expect which is most
> definitely not sufficient for a change.
I really don't understant what you are trying to say/prove here, this is
the same issue of the last /proc/<pid>/mem one that got fixed by Linus and
Olge patches. These are dynamic files.
For arguments I'll re-try.
> Eric
Thanks Eric.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
tixxdz
http://opendz.org
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH 0/9] proc: protect /proc/<pid>/* files across execve
2012-03-12 22:41 ` Djalal Harouni
@ 2012-03-12 23:10 ` Eric W. Biederman
0 siblings, 0 replies; 48+ messages in thread
From: Eric W. Biederman @ 2012-03-12 23:10 UTC (permalink / raw)
To: Djalal Harouni
Cc: linux-kernel, kernel-hardening, Andrew Morton, Linus Torvalds,
Al Viro, Alexey Dobriyan, Vasiliy Kulikov, Kees Cook,
Solar Designer, WANG Cong, James Morris, Oleg Nesterov,
linux-security-module, linux-fsdevel, Alan Cox, Greg KH,
Ingo Molnar, Stephen Wilson, Jason A. Donenfeld
Djalal Harouni <tixxdz@opendz.org> writes:
> I'm not sure I follow you here, these proc files suffer from the same
> problems, can you please point what are these non-silly case ?
As I read 3.3-rc1 which I have handy I do not see problems.
>> What is needed is for safety is to guard against the race of exec
>> happening during a read or a write, so that we don't get access
> before and during. I say before since this is what we are tying to
> emulate.
>> to something we shouldn't have permissions to.
>>
>> In general that means reference counting or locks. All exec_id can
> Locks ? counting yes this perhaps can work as Alan suggested, but a simple
> check will catch all the things without any node list nor count inc/dec.
An exec_id check adds no value, to files like /proc/pid/environ.
mm_for_maps nicely does the necessary permission checks and avoids the
race.
>> > The previous discussion on kernel-hardening:
>> > (includes some variants described by Vasiliy and other problems which I'll
>> > try to discuss here)
>> > http://www.openwall.com/lists/kernel-hardening/2012/02/10/1
>>
>> In your post on openwall I just see you arguing that the current
>> and deliberate semantics of the permission checks on the proc files are
>> wrong. Because the permission checks happen at access time rather than
>> at open time.
> Yes if you are speaking about {maps,smaps...} then it should also be at
> open time and at each syscall (it depends on files) and for the
> /proc/pid/{maps,smaps,numa_maps} yes it is not safe to give non-privileged
> processes an fd to an objects attached to a privileged process.
Then the permission checks at access time are broken, and exec_id
nonsense won't make them better.
>> Well I am sorry. The permission checks have happened at access time for
>> ages and that is deliberate.
> Yes (perhaps even before I use Linux :) ), but IMHO they are not perfect,
> if they were, then we'll not see all the /proc/<pid>/ problems.
We see problems because we lots of people play in the code and are not
careful. It looks to me like you are trying to make the problem worse
by making the code more complicated and even harder to understand.
> The current patches protect /proc/pid/{maps,smaps,numa_maps} without breaking
> things, there are no checks at open which is not safe, but I did not add
> it since I was not sure and I don't want to break things (glibc
> FORITFY_SOURCE ...).
>
> My patches add the ptrace checks at open only for
> /proc/<pid>/{environ,pagemap,mem} which is safe, I did not include
> /proc/pid/{auxv,maps,smaps,...} so please if you have _real_ cases of
> problems just post them.
You are thrashing the code and making it more complicated for no
apparent reason.
>> If you expect /proc/ not to let you find out things about yourself
>> if you are a suid executable that just seems silly.
> Ah yes this one about suid/privileged, if you are still privileged then
> there is no harm, but if you drop your privileges IMHO that means that you
> do not want to do privileged operations, but that current == task in
> ptrace which is before the creds check will just allow you to do so.
Please show me how you can drop privileges to access yourself.
The current == task check is there because of extreme silliness in
the security modules and the fact that glibc breaks in entertaining
was if it can not access /proc/self/masp. If you I read you right
you want to reintroduce that breaking for suid exectuables. Frankly
that sounds like you a security bug in the making.
> Well, this is another subject.
>
>> So in short you seem to be arguing for changing the semantics of access
>> of every file under /proc/<pid> because there is cognitive dissonance
>> between your understanding of those files and what is implemented. I
>> see no acknowledgement that you are arguing for a semantic change or
>> any arguments in favor of that change. At best I see is an argument
>> that says you the files don't work the way you expect which is most
>> definitely not sufficient for a change.
> I really don't understant what you are trying to say/prove here, this is
> the same issue of the last /proc/<pid>/mem one that got fixed by Linus and
> Olge patches. These are dynamic files.
The /proc/<pid>/mem permission checks were implemented in a very stupid
way. In that stupid way they used task->exec_id. I don't see any
evidence that anyone touching any of the other proc files made that same
foolish set of choices. So I see no evidence that the problem exists
outside of anywhere except /proc/<pid>/mem.
So very simply it looks to me like your entire patchset is very busily
trying to fix problems that don't exist.
Eric
^ permalink raw reply [flat|nested] 48+ messages in thread