[RFC] cgroup: syscalls limiting subsystem

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] cgroup: syscalls limiting subsystem
@ 2011-10-19  0:21 Łukasz Sowa
  2011-10-19  5:26 ` Paul Menage
  0 siblings, 1 reply; 5+ messages in thread
From: Łukasz Sowa @ 2011-10-19  0:21 UTC (permalink / raw)
  To: containers; +Cc: linux-kernel, linux-security-module

Hi,

Currently, I'm writing BSc thesis about security in modern Linux.
Together with my thesis mentor, I decided that as practical part of my
work I'll implement cgroup subsystem that allows to limit particular
(defined by number) syscalls for groups of processes. The patch is ready
for first review and we decided that I may try to push it to the
mainline - so here it is.

The syscalls cgroup subsystem is designed to improve security of
specified container (or simply process under cgroup) by limiting actions
that malicious application or potential attacker can do (for example by
code injection). It's of course more flexible than seccomp because it
gives far more possibilities, more fine-grained than capabilities and
it's different from syscall auditing - it's prevention, not only
detection. It's similar to the, left unmaintained, systrace patch for
Linux.
Recently, I came across article @ LWN https://lwn.net/Articles/458805/
and I noticed that my subsystem addresses some of the problems expressed
under 'API/ABI restrictions' and it would be my rationale for merging
this subsystem into mainline.

Usage of the subsystem is quite simple. To disallow a syscall for given
cgroup you have to echo its number to syscalls.deny. To allow a syscall
you have to echo its number to syscalls.allow (and parent cgroup also
have to have it allowed). Doing 'cat syscalls.allow' or 'cat
syscalls.deny' shows allowed/disallowed syscalls of given cgroup (with
respect of the parents' settings).

Things that I dislike or particularly need comments are:
1. The asm parts where I push/pop callee-modified registers are ugly.
I'm aware of macros (for x64) like SAVE_ARGS and RESTORE_ARGS but they
simply don't work for me because they modify EFLAGS registers (because
of sub instruction I suppose), forcing me to write uglier and less
efficient code (additional jump needed). I can elaborate on this, if
someone's in doubt. Maybe another version of the SAVE_ARGS/RESTORE_ARGS
macro is needed?
2. Performance. It's not that bad: I measured 5% more sys time for
process on root level, and 8% more sys time for processes on first
level. However, I think it may and should be improved but currently
I have no idea how to do it.
3. Naming convention - it's not bad either but I don't like the names -
'scs' abbreviation sound a little bit silly but full form (syscalls)
makes lines far too long.
4. Lack of documentation - I promise I'll write it, as soon as first RFC
will succeed.
5. Last but not least, it's only for x86 and x86-64. I don't know other
archs so I won't be able to port it, though it should be easy.

Please, let me know what you think.

Best Regards,
Lukasz Sowa

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index f3f6f53..7b57e0a 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -425,6 +425,17 @@ sysenter_past_esp:
 sysenter_do_call:
 	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
+#ifdef CONFIG_CGROUP_SYSCALLS
+	push %eax
+	push %ecx
+	push %edx
+	call scs_cgroup_perm
+	cmpl $0, %eax
+	pop %edx
+	pop %ecx
+	pop %eax
+	je syscall_badsys
+#endif
 	call *sys_call_table(,%eax,4)
 	movl %eax,PT_EAX(%esp)
 	LOCKDEP_SYS_EXIT
@@ -507,6 +518,17 @@ ENTRY(system_call)
 	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
 syscall_call:
+#ifdef CONFIG_CGROUP_SYSCALLS
+	push %eax
+	push %ecx
+	push %edx
+	call scs_cgroup_perm
+	cmpl $0, %eax
+	pop %edx
+	pop %ecx
+	pop %eax
+	je syscall_badsys
+#endif
 	call *sys_call_table(,%eax,4)
 	movl %eax,PT_EAX(%esp)		# store the return value
 syscall_exit:
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 6419bb0..4f21a00 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -479,6 +479,30 @@ ENTRY(system_call_after_swapgs)
 system_call_fastpath:
 	cmpq $__NR_syscall_max,%rax
 	ja badsys
+#ifdef CONFIG_CGROUP_SYSCALLS
+	pushq %rax
+	pushq %rdi
+	pushq %rsi
+	pushq %rdx
+	pushq %rcx
+	pushq %r8
+	pushq %r9
+	pushq %r10
+	pushq %r11
+	movq %rax, %rdi
+	call scs_cgroup_perm
+	cmpl $0, %eax
+	popq %r11
+	popq %r10
+	popq %r9
+	popq %r8
+	popq %rcx
+	popq %rdx
+	popq %rsi
+	popq %rdi
+	popq %rax
+	je badsys
+#endif
 	movq %r10,%rcx
 	call *sys_call_table(,%rax,8)  # XXX:	 rip relative
 	movq %rax,RAX-ARGOFFSET(%rsp)
@@ -595,6 +619,30 @@ tracesys:
 	RESTORE_REST
 	cmpq $__NR_syscall_max,%rax
 	ja   int_ret_from_sys_call	/* RAX(%rsp) set to -ENOSYS above */
+#ifdef CONFIG_CGROUP_SYSCALLS
+	pushq %rax
+	pushq %rdi
+	pushq %rsi
+	pushq %rdx
+	pushq %rcx
+	pushq %r8
+	pushq %r9
+	pushq %r10
+	pushq %r11
+	movq %rax, %rdi
+	call scs_cgroup_perm
+	cmpl $0, %eax
+	popq %r11
+	popq %r10
+	popq %r9
+	popq %r8
+	popq %rcx
+	popq %rdx
+	popq %rsi
+	popq %rdi
+	popq %rax
+	je int_ret_from_sys_call
+#endif
 	movq %r10,%rcx	/* fixup for C */
 	call *sys_call_table(,%rax,8)
 	movq %rax,RAX-ARGOFFSET(%rsp)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..ad6b600 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -64,3 +64,9 @@ SUBSYS(perf)
 #endif
  /* */
+
+#ifdef CONFIG_CGROUP_SYSCALLS
+SUBSYS(syscalls)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index d627783..a03c16e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -609,6 +609,13 @@ config CGROUP_DEVICE
 	  Provides a cgroup implementing whitelists for devices which
 	  a process in the cgroup can mknod or open.
 +config CGROUP_SYSCALLS
+	bool "Syscalls controller for cgroups"
+	depends on X86
+	help
+	  Provides a way to limit access to specified syscalls for
+	  tasks in a cgroup.
+
 config CPUSETS
 	bool "Cpuset support"
 	help
diff --git a/security/Makefile b/security/Makefile
index 8bb0fe9..6db6d88 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -22,6 +22,7 @@ obj-$(CONFIG_AUDIT)			+= lsm_audit.o
 obj-$(CONFIG_SECURITY_TOMOYO)		+= tomoyo/built-in.o
 obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/built-in.o
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
+obj-$(CONFIG_CGROUP_SYSCALLS)	+= syscalls_cgroup.o
  # Object integrity file lists
 subdir-$(CONFIG_IMA)			+= integrity/ima
diff --git a/security/syscalls_cgroup.c b/security/syscalls_cgroup.c
new file mode 100644
index 0000000..cfd0ea0
--- /dev/null
+++ b/security/syscalls_cgroup.c
@@ -0,0 +1,223 @@
+/*
+ * security/syscalls_cgroup.c - syscalls cgroup subsystem
+ *
+ * Copyright (C) 2011 Lukasz Sowa <luksow@gmail.com>
+ */
+
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/cgroup.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/seqlock.h>
+#include <linux/slab.h>
+
+struct scs_cgroup {
+	unsigned long *syscalls_bitmap;
+	struct cgroup_subsys_state css;
+	seqlock_t seqlock;
+};
+
+static inline struct scs_cgroup *css_to_scs_cgroup(struct cgroup_subsys_state *subsys_state)
+{
+	return container_of(subsys_state, struct scs_cgroup, css);
+}
+
+static inline struct scs_cgroup *cgroup_to_scs_cgroup(struct cgroup *cgroup)
+{
+	if (!cgroup)
+		return NULL;
+
+	return css_to_scs_cgroup(cgroup_subsys_state(cgroup,
+							 syscalls_subsys_id));
+}
+
+static inline struct scs_cgroup *task_to_scs_cgroup(struct task_struct *task)
+{
+	return css_to_scs_cgroup(task_subsys_state(task, syscalls_subsys_id));
+}
+
+/*
+ * The range of syscall number is not checked here, because it is done
+ * in low level assembly code.
+ */
+static int __scs_cgroup_perm(struct scs_cgroup *scg, int number)
+{
+	int ret = 1;
+	unsigned int seq;
+
+	if (scg) {
+		do {
+			seq = read_seqbegin(&scg->seqlock);
+			ret = test_bit(number, scg->syscalls_bitmap) &&
+				__scs_cgroup_perm(cgroup_to_scs_cgroup(scg->css.cgroup->parent),
+								number);
+		} while (read_seqretry(&scg->seqlock, seq));
+	}
+
+	return ret;
+}
+
+inline int scs_cgroup_perm(int number)
+{
+	return __scs_cgroup_perm(task_to_scs_cgroup(current), number);
+}
+
+/*
+ * On cgroup creation, syscalls bitmap is simply inherited from parent. In case
+ * of root cgroup, we set all bits.
+ */
+static struct cgroup_subsys_state *scs_cgroup_create(struct cgroup_subsys *subsys, struct cgroup *cgroup)
+{
+	struct scs_cgroup *scg;
+	struct scs_cgroup *parent_scg;
+	unsigned int seq;
+
+	scg = kmalloc(sizeof(*scg), GFP_KERNEL);
+	if (!scg)
+		return ERR_PTR(-ENOMEM);
+	scg->syscalls_bitmap = kmalloc(BITS_TO_LONGS(NR_syscalls) * sizeof(unsigned long),
+								GFP_KERNEL);
+	if (!scg->syscalls_bitmap) {
+		kfree(scg);
+		return ERR_PTR(-ENOMEM);
+	}
+	seqlock_init(&scg->seqlock);
+
+	parent_scg = cgroup_to_scs_cgroup(cgroup->parent);
+	if (parent_scg) {
+		do {
+			seq = read_seqbegin(&parent_scg->seqlock);
+			bitmap_copy(scg->syscalls_bitmap, parent_scg->syscalls_bitmap,
+					NR_syscalls);
+		} while (read_seqretry(&parent_scg->seqlock, seq));
+	} else {
+		bitmap_fill(scg->syscalls_bitmap, NR_syscalls);
+	}
+
+	return &scg->css;
+}
+
+static void scs_cgroup_destroy(struct cgroup_subsys *subsys,
+							struct cgroup *cgroup)
+{
+	struct scs_cgroup *scg = cgroup_to_scs_cgroup(cgroup);
+	kfree(scg->syscalls_bitmap);
+	kfree(scg);
+}
+
+#define SCS_CGROUP_ALLOW 0
+#define SCS_CGROUP_DENY 1
+
+static int scs_cgroup_read(struct cgroup *cgroup, struct cftype *cftype,
+						struct seq_file *seq_file)
+{
+	struct scs_cgroup *scg = cgroup_to_scs_cgroup(cgroup);
+	int bit;
+
+	switch (cftype->private) {
+	case SCS_CGROUP_ALLOW:
+		for (bit = 0; bit < NR_syscalls; ++bit) {
+			if (__scs_cgroup_perm(scg, bit))
+				seq_printf(seq_file, "%d\n", bit);
+		}
+		break;
+	case SCS_CGROUP_DENY:
+		for (bit = 0; bit < NR_syscalls; ++bit) {
+			if (!__scs_cgroup_perm(scg, bit))
+				seq_printf(seq_file, "%d\n", bit);
+		}
+		break;
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * It is always possible to blacklist specified syscall but whitelisting
+ * requires that all ancestors have the syscall on their whitelist.
+ */
+static int scs_cgroup_write(struct cgroup *cgroup, struct cftype *cftype,
+							const char *buffer)
+{
+	struct scs_cgroup *scg = cgroup_to_scs_cgroup(cgroup);
+	struct scs_cgroup *parent_scg = cgroup_to_scs_cgroup(cgroup->parent);
+	int number;
+	int ret;
+
+	ret = kstrtoint(buffer, 0, &number);
+	if (ret)
+		return ret;
+
+	if (number < 0 || number >= NR_syscalls)
+		return -ERANGE;
+
+	switch (cftype->private) {
+	case SCS_CGROUP_ALLOW:
+		if (__scs_cgroup_perm(parent_scg, number)) {
+			write_seqlock(&scg->seqlock);
+			set_bit(number, scg->syscalls_bitmap);
+			write_sequnlock(&scg->seqlock);
+		} else {
+			return -EPERM;
+		}
+		break;
+	case SCS_CGROUP_DENY:
+		write_seqlock(&scg->seqlock);
+		clear_bit(number, scg->syscalls_bitmap);
+		write_sequnlock(&scg->seqlock);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static struct cftype scs_cgroup_files[] = {
+	{
+		.name = "allow",
+		.read_seq_string = scs_cgroup_read,
+		.write_string = scs_cgroup_write,
+		.private = SCS_CGROUP_ALLOW,
+	},
+	{
+		.name = "deny",
+		.read_seq_string = scs_cgroup_read,
+		.write_string = scs_cgroup_write,
+		.private = SCS_CGROUP_DENY,
+	}
+};
+
+static int scs_cgroup_populate(struct cgroup_subsys *subsys,
+							struct cgroup *cgroup)
+{
+	return cgroup_add_files(cgroup, subsys, scs_cgroup_files,
+						ARRAY_SIZE(scs_cgroup_files));
+}
+
+struct cgroup_subsys syscalls_subsys = {
+	.name = "syscalls",
+	.create = scs_cgroup_create,
+	.destroy = scs_cgroup_destroy,
+	.populate = scs_cgroup_populate,
+	.subsys_id = syscalls_subsys_id,
+	.module = THIS_MODULE,
+};
+
+static int __init init_syscalls_subsys(void)
+{
+	return cgroup_load_subsys(&syscalls_subsys);
+}
+
+static void __exit exit_syscalls_subsys(void)
+{
+	cgroup_unload_subsys(&syscalls_subsys);
+}
+
+module_init(init_syscalls_subsys);
+module_exit(exit_syscalls_subsys);
+MODULE_LICENSE("GPL");
+

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] cgroup: syscalls limiting subsystem
  2011-10-19  0:21 [RFC] cgroup: syscalls limiting subsystem Łukasz Sowa
@ 2011-10-19  5:26 ` Paul Menage
  2011-10-20 21:32   ` Łukasz Sowa
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Menage @ 2011-10-19  5:26 UTC (permalink / raw)
  To: Łukasz Sowa; +Cc: containers, linux-kernel, linux-security-module

On Tue, Oct 18, 2011 at 5:21 PM, Łukasz Sowa <luksow@gmail.com> wrote:
> Hi,
>
> Currently, I'm writing BSc thesis about security in modern Linux.
> Together with my thesis mentor, I decided that as practical part of my
> work I'll implement cgroup subsystem that allows to limit particular
> (defined by number) syscalls for groups of processes. The patch is ready
> for first review and we decided that I may try to push it to the
> mainline - so here it is.
>

The major problem with this approach is that denying/allowing system
calls based purely on the syscall number is very coarse. I'd guess
that most vulnerabilities in a system can be exploited just using
system calls that almost all applications need in order to get regular
work done (open, write, exec ,mmap, etc) which limits the utility of
only being able to turn them off by syscall number. So overall I don't
think you'll find very much support for this patch.

Having said that, to address the patch itself:

>
> Things that I dislike or particularly need comments are:
> 1. The asm parts where I push/pop callee-modified registers are ugly.
> I'm aware of macros (for x64) like SAVE_ARGS and RESTORE_ARGS but they
> simply don't work for me because they modify EFLAGS registers (because
> of sub instruction I suppose), forcing me to write uglier and less
> efficient code (additional jump needed). I can elaborate on this, if
> someone's in doubt. Maybe another version of the SAVE_ARGS/RESTORE_ARGS
> macro is needed?

Can't you hook into the ptrace callpath? That's already implemented on
every architecture. Set the thread bit that triggers diverting to
syscall_trace_enter() only when any of the thread's syscalls are
denied, and then you don't have to work in assembly.

> 2. Performance. It's not that bad: I measured 5% more sys time for
> process on root level, and 8% more sys time for processes on first
> level. However, I think it may and should be improved but currently
> I have no idea how to do it.

Do you mean for all processes, or just for processes that had deny
bits set? If the former, then 8% is a non-starter, I think. But if you
hook into the ptrace as I suggested above, you can probably get it to
zero overhead for threads in allow-all cgroups, and I suspect people
would scream much less about slowing down threads that are being
specifically limited anyway.

> 3. Naming convention - it's not bad either but I don't like the names -
> 'scs' abbreviation sound a little bit silly but full form (syscalls)
> makes lines far too long.

I'd suggest syscall rather than scs.

> +       switch (cftype->private) {
> +       case SCS_CGROUP_ALLOW:
> +               for (bit = 0; bit < NR_syscalls; ++bit) {
> +                       if (__scs_cgroup_perm(scg, bit))
> +                               seq_printf(seq_file, "%d\n", bit);

Since this is presumably meant to be only used programmatically (as
syscall numbers can be different on different platforms, so you'll
need programs to translate between names and numbers) I wouldn't
bother with the \n - a plain space will be fine for the process
reading it, and less destructive to screen real-estate if someone
happens to cat it by hand.

> +       switch (cftype->private) {
> +       case SCS_CGROUP_ALLOW:
> +               if (__scs_cgroup_perm(parent_scg, number)) {
> +                       write_seqlock(&scg->seqlock);
> +                       set_bit(number, scg->syscalls_bitmap);
> +                       write_sequnlock(&scg->seqlock);
> +               } else {
> +                       return -EPERM;
> +               }
> +               break;
> +       case SCS_CGROUP_DENY:
> +               write_seqlock(&scg->seqlock);
> +               clear_bit(number, scg->syscalls_bitmap);
> +               write_sequnlock(&scg->seqlock);
> +               break;

Deny needs to clear the bit in all descendants as well. And there
would need to be synchronization between checking the parent bit
during an ALLOW write, and someone changing the parent bit to a DENY.

Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] cgroup: syscalls limiting subsystem
  2011-10-19  5:26 ` Paul Menage
@ 2011-10-20 21:32   ` Łukasz Sowa
  2011-11-01 20:02     ` Will Drewry
  0 siblings, 1 reply; 5+ messages in thread
From: Łukasz Sowa @ 2011-10-20 21:32 UTC (permalink / raw)
  To: Paul Menage; +Cc: containers, linux-kernel, linux-security-module

First of all, thanks a lot for your replay.

> On Tue, Oct 18, 2011 at 5:21 PM, Łukasz Sowa <luksow@gmail.com> wrote:
>> Hi,
>>
>> Currently, I'm writing BSc thesis about security in modern Linux.
>> Together with my thesis mentor, I decided that as practical part of my
>> work I'll implement cgroup subsystem that allows to limit particular
>> (defined by number) syscalls for groups of processes. The patch is ready
>> for first review and we decided that I may try to push it to the
>> mainline - so here it is.
>>
> 
> The major problem with this approach is that denying/allowing system
> calls based purely on the syscall number is very coarse. I'd guess
> that most vulnerabilities in a system can be exploited just using
> system calls that almost all applications need in order to get regular
> work done (open, write, exec ,mmap, etc) which limits the utility of
> only being able to turn them off by syscall number. So overall I don't
> think you'll find very much support for this patch.

I have to disagree. Have you read through the link to the article at LWN
I gave? If not, please do so. There are a few people being interested in
ability to limit access to some system calls, just like seccomp does.
Moreover, as stated in the article, there are some projects looking
forward to this feature like QEMU, vsftpd or Chromium. Another (but
maybe not very convincing) point is that there are some other operating
systems that have such feature successfully implemented and used, like BSD.

> Having said that, to address the patch itself:
> 
>> 2. Performance. It's not that bad: I measured 5% more sys time for
>> process on root level, and 8% more sys time for processes on first
>> level. However, I think it may and should be improved but currently
>> I have no idea how to do it.
> 
> Do you mean for all processes, or just for processes that had deny
> bits set? If the former, then 8% is a non-starter, I think. But if you
> hook into the ptrace as I suggested above, you can probably get it to
> zero overhead for threads in allow-all cgroups, and I suspect people
> would scream much less about slowing down threads that are being
> specifically limited anyway.

The overhead is for all processes, not only the 'denying one'.
Implementing it in a way that you suggested (setting bits in all
descendants) is a very good idea. It enables us to have constant
overhead (same on all levels) - currently 5%.
All existing syscalls tracing solutions use ptrace hooking. It's clean
but the performance of ptrace is horrible. I ran the same benchmark as I
used for my code on ptrace based tracing and the overhead was immense.
It was 100 times (sic!) slower. I have another good paper on the
syscalls tracing that you may look through:
http://www.linux-kongress.org/2009/slides/system_call_tracing_overhead_joerg_zinke.pdf
All in all, I think that 5% overhead tracing is quite good trade-off
between no tracing and extremely slow tracing.
I have to say it again - I'm not satisfied with the current performance
and I'm investigating into making it lower but I wonder that there's a
way to go under, let's say - 3%.
However, if such overhead is still unacceptable maybe we should provide
a way to turn it off (besides not compiling it, of course)? My current
idea is to introduce a per-cgroup triggering bit (which will not be
hierarchical, of course). Then, the overhead for cgroups with tracing
turned off would be minimal (additional bit checking and short jumping
only). Or maybe there's a better way (for ex. a per-process triggering
bit)?

>> 3. Naming convention - it's not bad either but I don't like the names -
>> 'scs' abbreviation sound a little bit silly but full form (syscalls)
>> makes lines far too long.
> 
> I'd suggest syscall rather than scs.

Thanks, I prefer syscall too, but I'll have to cope with long lines.

>> +       switch (cftype->private) {
>> +       case SCS_CGROUP_ALLOW:
>> +               for (bit = 0; bit < NR_syscalls; ++bit) {
>> +                       if (__scs_cgroup_perm(scg, bit))
>> +                               seq_printf(seq_file, "%d\n", bit);
> 
> Since this is presumably meant to be only used programmatically (as
> syscall numbers can be different on different platforms, so you'll
> need programs to translate between names and numbers) I wouldn't
> bother with the \n - a plain space will be fine for the process
> reading it, and less destructive to screen real-estate if someone
> happens to cat it by hand.

Plane space is a good idea, I'll make it this way.
Relaying on the names is in my opinion very error prone, so I think it's
best to left it for sysadmins scripts working on unistd.h file, fetching
numbers based on the names of syscalls.

>> +       switch (cftype->private) {
>> +       case SCS_CGROUP_ALLOW:
>> +               if (__scs_cgroup_perm(parent_scg, number)) {
>> +                       write_seqlock(&scg->seqlock);
>> +                       set_bit(number, scg->syscalls_bitmap);
>> +                       write_sequnlock(&scg->seqlock);
>> +               } else {
>> +                       return -EPERM;
>> +               }
>> +               break;
>> +       case SCS_CGROUP_DENY:
>> +               write_seqlock(&scg->seqlock);
>> +               clear_bit(number, scg->syscalls_bitmap);
>> +               write_sequnlock(&scg->seqlock);
>> +               break;
> 
> Deny needs to clear the bit in all descendants as well. And there
> would need to be synchronization between checking the parent bit
> during an ALLOW write, and someone changing the parent bit to a DENY.

As I said above, it's a good idea.

Thanks a lot again,
Lukasz Sowa

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] cgroup: syscalls limiting subsystem
  2011-10-20 21:32   ` Łukasz Sowa
@ 2011-11-01 20:02     ` Will Drewry
  2011-11-03 19:18       ` Łukasz Sowa
  0 siblings, 1 reply; 5+ messages in thread
From: Will Drewry @ 2011-11-01 20:02 UTC (permalink / raw)
  To: Łukasz Sowa
  Cc: Paul Menage, containers, linux-kernel, linux-security-module

Some much delayed comments - thanks for the post.  It's cool to see
all the different ways to approach system call filtering!

On Thu, Oct 20, 2011 at 4:32 PM, Łukasz Sowa <luksow@gmail.com> wrote:
> First of all, thanks a lot for your replay.
>
>> On Tue, Oct 18, 2011 at 5:21 PM, Łukasz Sowa <luksow@gmail.com> wrote:
>>> Hi,
>>>
>>> Currently, I'm writing BSc thesis about security in modern Linux.
>>> Together with my thesis mentor, I decided that as practical part of my
>>> work I'll implement cgroup subsystem that allows to limit particular
>>> (defined by number) syscalls for groups of processes. The patch is ready
>>> for first review and we decided that I may try to push it to the
>>> mainline - so here it is.

Have you considered doing this as a system call namespace instead of a
cgroup?  (Just curious!)

>>
>> The major problem with this approach is that denying/allowing system
>> calls based purely on the syscall number is very coarse. I'd guess
>> that most vulnerabilities in a system can be exploited just using
>> system calls that almost all applications need in order to get regular
>> work done (open, write, exec ,mmap, etc) which limits the utility of
>> only being able to turn them off by syscall number. So overall I don't
>> think you'll find very much support for this patch.
>
> I have to disagree. Have you read through the link to the article at LWN
> I gave? If not, please do so. There are a few people being interested in
> ability to limit access to some system calls, just like seccomp does.
> Moreover, as stated in the article, there are some projects looking
> forward to this feature like QEMU, vsftpd or Chromium. Another (but
> maybe not very convincing) point is that there are some other operating
> systems that have such feature successfully implemented and used, like BSD.

Feel free to pull the archives from ~May-June w.r.t seccomp filters.
I've agreed with you, and I've tried to lay out the uses.  The other
important point which I've neglected to mention in the past is that
the system call ABI is the only stable place where the kernel attack
surface can be instrumented.  LSM hooks provide policy level decision
making, but it doesn't help explicitly with attack surface (only
implicitly :), and deeper tracepoints aren't considered "stable".

>> Having said that, to address the patch itself:
>>
>>> 2. Performance. It's not that bad: I measured 5% more sys time for
>>> process on root level, and 8% more sys time for processes on first
>>> level. However, I think it may and should be improved but currently
>>> I have no idea how to do it.
>>
>> Do you mean for all processes, or just for processes that had deny
>> bits set? If the former, then 8% is a non-starter, I think. But if you
>> hook into the ptrace as I suggested above, you can probably get it to
>> zero overhead for threads in allow-all cgroups, and I suspect people
>> would scream much less about slowing down threads that are being
>> specifically limited anyway.
>
> The overhead is for all processes, not only the 'denying one'.
> Implementing it in a way that you suggested (setting bits in all
> descendants) is a very good idea. It enables us to have constant
> overhead (same on all levels) - currently 5%.
> All existing syscalls tracing solutions use ptrace hooking. It's clean
> but the performance of ptrace is horrible. I ran the same benchmark as I
> used for my code on ptrace based tracing and the overhead was immense.
> It was 100 times (sic!) slower. I have another good paper on the
> syscalls tracing that you may look through:
> http://www.linux-kongress.org/2009/slides/system_call_tracing_overhead_joerg_zinke.pdf
> All in all, I think that 5% overhead tracing is quite good trade-off
> between no tracing and extremely slow tracing.
> I have to say it again - I'm not satisfied with the current performance
> and I'm investigating into making it lower but I wonder that there's a
> way to go under, let's say - 3%.

Have you considered using the semi-slow path used by auditsc? It uses
a thread_info flag but doesn't take the completely slow-path if _only_
audit is selected.  You may be able to get by with a new TIF flag that
fits in with the same mask that is always called for all syscalls,
then only fork if the process is in a filtered cgroup.  It will be
messy to ensure all the paths work correctly, but it should mean that
the overhead for normal applications is unchanged, and you might avoid
the total slow-path overhead (just something similar to audit
overhead).

Another approach I've been toying with is putting a sys_call_table
pointer in the thread_info and then use it if the given flag is set.
But there are so many potential reasons that's a bad idea, I'm not
sure if it makes sense to suggest exploring it :)

> However, if such overhead is still unacceptable maybe we should provide
> a way to turn it off (besides not compiling it, of course)? My current
> idea is to introduce a per-cgroup triggering bit (which will not be
> hierarchical, of course). Then, the overhead for cgroups with tracing
> turned off would be minimal (additional bit checking and short jumping
> only). Or maybe there's a better way (for ex. a per-process triggering
> bit)?

A TIF flag might be the fit.

>>> 3. Naming convention - it's not bad either but I don't like the names -
>>> 'scs' abbreviation sound a little bit silly but full form (syscalls)
>>> makes lines far too long.
>>
>> I'd suggest syscall rather than scs.
>
> Thanks, I prefer syscall too, but I'll have to cope with long lines.
>
>>> +       switch (cftype->private) {
>>> +       case SCS_CGROUP_ALLOW:
>>> +               for (bit = 0; bit < NR_syscalls; ++bit) {
>>> +                       if (__scs_cgroup_perm(scg, bit))
>>> +                               seq_printf(seq_file, "%d\n", bit);
>>
>> Since this is presumably meant to be only used programmatically (as
>> syscall numbers can be different on different platforms, so you'll
>> need programs to translate between names and numbers) I wouldn't
>> bother with the \n - a plain space will be fine for the process
>> reading it, and less destructive to screen real-estate if someone
>> happens to cat it by hand.
>
> Plane space is a good idea, I'll make it this way.
> Relaying on the names is in my opinion very error prone, so I think it's
> best to left it for sysadmins scripts working on unistd.h file, fetching
> numbers based on the names of syscalls.

Agreed here.  Names are inconsistent -- just compare ftrace syscall
metadata names to those exported to userspace.  Only the exported
NR_blah is ABI as far as I can tell.  I believe deferencing those is
pretty easy in userspace and should be left to the tools. (e.g.,
http://goo.gl/7rpRp )

That said, your approach won't work on platforms which offset system
call start points, have gaps, and different ABI modes which change
those.  You might want to consider a btree or something that doesn't
need a pre-allocated array, etc.

(If not, you'll need to populate helpers for arches that need it to
get their starting number for the current abi and the max numbers and
then make sure processes either can't flip-flop, like CONFIG_COMPAT,
and exceed the sized array.  But perhaps the btree lookup cost is too
much.)

>>> +       switch (cftype->private) {
>>> +       case SCS_CGROUP_ALLOW:
>>> +               if (__scs_cgroup_perm(parent_scg, number)) {
>>> +                       write_seqlock(&scg->seqlock);
>>> +                       set_bit(number, scg->syscalls_bitmap);
>>> +                       write_sequnlock(&scg->seqlock);
>>> +               } else {
>>> +                       return -EPERM;
>>> +               }
>>> +               break;
>>> +       case SCS_CGROUP_DENY:
>>> +               write_seqlock(&scg->seqlock);
>>> +               clear_bit(number, scg->syscalls_bitmap);
>>> +               write_sequnlock(&scg->seqlock);
>>> +               break;

Have you considered supporting ftrace filters?

>> Deny needs to clear the bit in all descendants as well. And there
>> would need to be synchronization between checking the parent bit
>> during an ALLOW write, and someone changing the parent bit to a DENY.
>
> As I said above, it's a good idea.

You can avoid clearing up the group if you evaluate each set rather
than creating a composed view.  It'll yield more runtime overhead, but
only if there's a reason for the processes to have different filter
rules. (I'm doing this in the current seccomp filter patchset I need
to repost :/  because it's non-trivial (doable, but an optimization)
to compose ftrace filters.)

Good luck - I look forward to seeing your next patch series!
will

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] cgroup: syscalls limiting subsystem
  2011-11-01 20:02     ` Will Drewry
@ 2011-11-03 19:18       ` Łukasz Sowa
  0 siblings, 0 replies; 5+ messages in thread
From: Łukasz Sowa @ 2011-11-03 19:18 UTC (permalink / raw)
  To: Will Drewry
  Cc: Paul Menage, containers, linux-kernel, linux-security-module,
	cgroups

Thanks a lot for all valuable remarks and for supporting the idea!

> 
> Have you considered doing this as a system call namespace instead of a
> cgroup?  (Just curious!)
> 

Yes, I have but I didn't see any advantages of system call namespace
over cgroup (maybe I missed something?). However, I think that using
namespace is in this particular case harder - less dynamic and thus less
useful.

> Have you considered using the semi-slow path used by auditsc? It uses
> a thread_info flag but doesn't take the completely slow-path if _only_
> audit is selected.  You may be able to get by with a new TIF flag that
> fits in with the same mask that is always called for all syscalls,
> then only fork if the process is in a filtered cgroup.  It will be
> messy to ensure all the paths work correctly, but it should mean that
> the overhead for normal applications is unchanged, and you might avoid
> the total slow-path overhead (just something similar to audit
> overhead).

I will try thread_info flag in next patch series. However, what I am
worried about is breaking consistency when you end up having processes
in a cgroup that does nothing because of TIF flags set. Another dirty
thing is that the TIF flag cannot be hierarchical (cannot be inherited)
so it's somehow breaking the idea of cgroups.
Another thing - what's better in using TIF flag instead of a per-cgroup
variable (held internally in struct) - is the performance that makes the
difference?

> That said, your approach won't work on platforms which offset system
> call start points, have gaps, and different ABI modes which change
> those.  You might want to consider a btree or something that doesn't
> need a pre-allocated array, etc.
> 
> (If not, you'll need to populate helpers for arches that need it to
> get their starting number for the current abi and the max numbers and
> then make sure processes either can't flip-flop, like CONFIG_COMPAT,
> and exceed the sized array.  But perhaps the btree lookup cost is too
> much.)

That sounds worrying. Could you elaborate on that? I'm not very
other-arches-aware and those things may be important for future work.

> 
> Have you considered supporting ftrace filters?
> 

No I haven't yet. Now, I'm reading through the seccomp patchset (and
discussion) you mentioned. At first glance it seems a nice idea but it
looks like a hard task to get it right. Another thing - isn't the
performance really bad when using those filters?

> Good luck - I look forward to seeing your next patch series!

I hope to post another patch for RFC next week. I will implement Paul's
remarks and TIF flag option and measure the performance again. I'm
looking forward to a nice and fruitful discussion then :).

Thanks,
Lukasz Sowa

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-11-03 19:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-19  0:21 [RFC] cgroup: syscalls limiting subsystem Łukasz Sowa
2011-10-19  5:26 ` Paul Menage
2011-10-20 21:32   ` Łukasz Sowa
2011-11-01 20:02     ` Will Drewry
2011-11-03 19:18       ` Łukasz Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).