linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] kthread: Make kthread_create() killable.
       [not found] <201309132318.HEC05218.OtVOFOFMLJHFQS@I-love.SAKURA.ne.jp>
@ 2013-09-14 16:05 ` Oleg Nesterov
  2013-09-14 23:46   ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2013-09-14 16:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: security, linux-kernel

Add lkml.

On 09/13, Tetsuo Handa wrote:
>
> This patch makes kthread_create() killable,

Probably makes sense...

> @@ -255,36 +266,59 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
>  					   const char namefmt[],
>  					   ...)
>  {
> -	struct kthread_create_info create;
> -
> -	create.threadfn = threadfn;
> -	create.data = data;
> -	create.node = node;
> -	init_completion(&create.done);
> +	struct task_struct *task;
> +	struct kthread_create_info *create = kmalloc(sizeof(*create),
> +						     GFP_KERNEL);
> +
> +	if (!create)
> +		return ERR_PTR(-ENOMEM);
> +	create->threadfn = threadfn;
> +	create->data = data;
> +	create->node = node;
> +	create->owner = (void *) 1;
> +	init_completion(&create->done);
>  
>  	spin_lock(&kthread_create_lock);
> -	list_add_tail(&create.list, &kthread_create_list);
> +	list_add_tail(&create->list, &kthread_create_list);
>  	spin_unlock(&kthread_create_lock);
>  
>  	wake_up_process(kthreadd_task);
> -	wait_for_completion(&create.done);
> -
> -	if (!IS_ERR(create.result)) {
> +	/*
> +	 * Wait for completion in killable state, for I might be chosen by
> +	 * the OOM killer while kthreadd is trying to allocate memory for
> +	 * new kernel thread.
> +	 */
> +	if (unlikely(wait_for_completion_killable(&create->done))) {
> +		/*
> +		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> +		 * calls complete(), leave the cleanup of this structure to
> +		 * that thread.
> +		 */
> +		if (xchg(&create->owner, NULL))
> +			return ERR_PTR(-ENOMEM);

I am wondering if this can be simplified...

At least you can move create->done from kthread_create_info to the
stack, and turn create->owner into the pointer to that completion.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] kthread: Make kthread_create() killable.
  2013-09-14 16:05 ` [PATCH] kthread: Make kthread_create() killable Oleg Nesterov
@ 2013-09-14 23:46   ` Tetsuo Handa
  2013-09-15 16:39     ` Oleg Nesterov
       [not found]     ` <201309152038.DDJ59271.OJMOHLFtSFOQVF@I-love.SAKURA.ne.jp>
  0 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-14 23:46 UTC (permalink / raw)
  To: oleg; +Cc: security, linux-kernel

Oleg Nesterov wrote:
> I am wondering if this can be simplified...
> 
> At least you can move create->done from kthread_create_info to the
> stack, and turn create->owner into the pointer to that completion.

Use of DECLARE_COMPLETION_ONSTACK() looks harmful to me because current thread
needs to be able to terminate as soon as possible if SIGKILLed (especially when
SIGKILLed by OOM killer). If we move something from kmalloc()ed zone to stack,
current thread cannot be terminated until that something is guaranteed to no
longer be used.

I think we need to convert from on-stack objects to kmalloc()ed objects so that
current thread acquires ability to terminate as soon as possible.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] kthread: Make kthread_create() killable.
  2013-09-14 23:46   ` Tetsuo Handa
@ 2013-09-15 16:39     ` Oleg Nesterov
  2013-09-16  0:53       ` [PATCH v2] " Tetsuo Handa
       [not found]     ` <201309152038.DDJ59271.OJMOHLFtSFOQVF@I-love.SAKURA.ne.jp>
  1 sibling, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2013-09-15 16:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: security, linux-kernel

On 09/15, Tetsuo Handa wrote:
>
> Oleg Nesterov wrote:
> > I am wondering if this can be simplified...
> >
> > At least you can move create->done from kthread_create_info to the
> > stack, and turn create->owner into the pointer to that completion.
>
> Use of DECLARE_COMPLETION_ONSTACK() looks harmful to me because current thread
> needs to be able to terminate as soon as possible if SIGKILLed

Sure.

> If we move something from kmalloc()ed zone to stack,
> current thread cannot be terminated until that something is guaranteed to no
> longer be used.

Please look at call_usermodehelper_exec() which does this trick. The
logic is the same, just you need to xchg(create->completion) instead
of create->owner.

In any case "void *owner" looks strange. It is in fact boolean (in
your patch). You can even use, say, test_and_set_bit() instead of
xchg.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] kthread: Make kthread_create() killable.
       [not found]     ` <201309152038.DDJ59271.OJMOHLFtSFOQVF@I-love.SAKURA.ne.jp>
@ 2013-09-15 16:55       ` Oleg Nesterov
  2013-09-16  7:25         ` [PATCH] coredump: Make startup of coredump to pipe killable Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2013-09-15 16:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: viro, linux-kernel, David Rientjes, Sergey Dyasly, Michal Hocko

Hi Tetsuo,

please do not start the off-list discussions ;)

On 09/15, Tetsuo Handa wrote:
>
> I examined do_coredump() case and found similar result.

...

> Do we want to change from call_usermodehelper_exec(UMH_WAIT_EXEC) to
> call_usermodehelper_exec(UMH_WAIT_EXEC | UMH_WAIT_KILLABLE)

To me, this makes sense in any case. And this matches other recent
"make coredump killable" changes.

> thread (who is waiting for pipe reader process to start) can terminate as soon
> as possible (rather than waiting forever unless somebody releases memory)?

But as for OOM this can't help if the dumping process is multithreaded.
This connects to other problems we discuss in another thread. probably
I should send the patch which does s/PT_TRACE_EXIT/SIGNAL_GROUP_COREDUMP/
at least.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] kthread: Make kthread_create() killable.
  2013-09-15 16:39     ` Oleg Nesterov
@ 2013-09-16  0:53       ` Tetsuo Handa
  2013-09-23 23:07         ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-16  0:53 UTC (permalink / raw)
  To: oleg; +Cc: security, linux-kernel

Oleg Nesterov wrote:
> Please look at call_usermodehelper_exec() which does this trick. The
> logic is the same, just you need to xchg(create->completion) instead
> of create->owner.

OK. I understood that we can use the same logic.
----------------------------------------
>From 87373d6938f045abffe8d9b4910bd132036eccaa Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 16 Sep 2013 09:39:17 +0900
Subject: [PATCH v2] kthread: Make kthread_create() killable.

Any users of wait_for_completion() might be chosen by the OOM killer while
waiting for completion() call by some process which does memory allocation.
kthread_create() is one of such users.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, problem similar to CVE-2012-4398
"kernel: request_module() OOM local DoS" can happen.

This patch makes kthread_create() killable, using the same approach used for
fixing CVE-2012-4398.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 kernel/kthread.c |   73 ++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 760e86d..b5ae3ee 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -33,7 +33,7 @@ struct kthread_create_info
 
 	/* Result passed back to kthread_create() from kthreadd. */
 	struct task_struct *result;
-	struct completion done;
+	struct completion *done;
 
 	struct list_head list;
 };
@@ -178,6 +178,7 @@ static int kthread(void *_create)
 	struct kthread_create_info *create = _create;
 	int (*threadfn)(void *data) = create->threadfn;
 	void *data = create->data;
+	struct completion *done;
 	struct kthread self;
 	int ret;
 
@@ -187,10 +188,16 @@ static int kthread(void *_create)
 	init_completion(&self.parked);
 	current->vfork_done = &self.exited;
 
+	/* If user was SIGKILLed, I release the structure. */
+	done = xchg(&create->done, NULL);
+	if (!done) {
+		kfree(create);
+		do_exit(-EINTR);
+	}
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
 	create->result = current;
-	complete(&create->done);
+	complete(done);
 	schedule();
 
 	ret = -EINTR;
@@ -223,8 +230,15 @@ static void create_kthread(struct kthread_create_info *create)
 	/* We want our own signal handler (we take no signals by default). */
 	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
 	if (pid < 0) {
+		/* If user was SIGKILLed, I release the structure. */
+		struct completion *done = xchg(&create->done, NULL);
+
+		if (!done) {
+			kfree(create);
+			return;
+		}
 		create->result = ERR_PTR(pid);
-		complete(&create->done);
+		complete(done);
 	}
 }
 
@@ -255,36 +269,59 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 					   const char namefmt[],
 					   ...)
 {
-	struct kthread_create_info create;
-
-	create.threadfn = threadfn;
-	create.data = data;
-	create.node = node;
-	init_completion(&create.done);
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct task_struct *task;
+	struct kthread_create_info *create = kmalloc(sizeof(*create),
+						     GFP_KERNEL);
+
+	if (!create)
+		return ERR_PTR(-ENOMEM);
+	create->threadfn = threadfn;
+	create->data = data;
+	create->node = node;
+	create->done = &done;
 
 	spin_lock(&kthread_create_lock);
-	list_add_tail(&create.list, &kthread_create_list);
+	list_add_tail(&create->list, &kthread_create_list);
 	spin_unlock(&kthread_create_lock);
 
 	wake_up_process(kthreadd_task);
-	wait_for_completion(&create.done);
-
-	if (!IS_ERR(create.result)) {
+	/*
+	 * Wait for completion in killable state, for I might be chosen by
+	 * the OOM killer while kthreadd is trying to allocate memory for
+	 * new kernel thread.
+	 */
+	if (unlikely(wait_for_completion_killable(&done))) {
+		/*
+		 * If I was SIGKILLed before kthreadd (or new kernel thread)
+		 * calls complete(), leave the cleanup of this structure to
+		 * that thread.
+		 */
+		if (xchg(&create->done, NULL))
+			return ERR_PTR(-ENOMEM);
+		/*
+		 * kthreadd (or new kernel thread) will call complete()
+		 * shortly.
+		 */
+		wait_for_completion(&done);
+	}
+	task = create->result;
+	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
 		va_list args;
 
 		va_start(args, namefmt);
-		vsnprintf(create.result->comm, sizeof(create.result->comm),
-			  namefmt, args);
+		vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
 		va_end(args);
 		/*
 		 * root may have changed our (kthreadd's) priority or CPU mask.
 		 * The kernel thread should not inherit these properties.
 		 */
-		sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
-		set_cpus_allowed_ptr(create.result, cpu_all_mask);
+		sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
+		set_cpus_allowed_ptr(task, cpu_all_mask);
 	}
-	return create.result;
+	kfree(create);
+	return task;
 }
 EXPORT_SYMBOL(kthread_create_on_node);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH] coredump: Make startup of coredump to pipe killable.
  2013-09-15 16:55       ` [PATCH] " Oleg Nesterov
@ 2013-09-16  7:25         ` Tetsuo Handa
  0 siblings, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-16  7:25 UTC (permalink / raw)
  To: oleg, viro; +Cc: linux-kernel, rientjes, dserrg, mhocko

Oleg Nesterov wrote:
> Hi Tetsuo,
> 
> please do not start the off-list discussions ;)

Sorry. Although I think and hope that there is no easy way to trigger this bug,
this bug might become a CVE if found one. Thus, I started without ML. I assume
you also think that there is no easy way to trigger this bug.

> > Do we want to change from call_usermodehelper_exec(UMH_WAIT_EXEC) to
> > call_usermodehelper_exec(UMH_WAIT_EXEC | UMH_WAIT_KILLABLE)
> 
> To me, this makes sense in any case. And this matches other recent
> "make coredump killable" changes.

OK, this patch is for do_coredump(). But I might be missing something. (e.g.
Is it safe to terminate current process while file descriptor table of current
process is planned to be updated by kthread later? Are there other resources
which have to be kept valid until kthread starts coredump process?)
----------------------------------------
>From 6b81b9956df284564112a95c941bf390c15f4f06 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 16 Sep 2013 15:59:05 +0900
Subject: [PATCH] coredump: Make startup of coredump to pipe killable.

Any users of wait_for_completion() might be chosen by the OOM killer while
waiting for completion() call by some process which does memory allocation.
call_usermodehelper() without UMH_KILLABLE flag is one of such users.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, problem similar to CVE-2012-4398
"kernel: request_module() OOM local DoS" can happen.

This patch makes call_usermodehelper_exec() call in do_coredump() killable,
using similar approach used for fixing CVE-2012-4398.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/coredump.c           |  113 ++++++++++++++++++++++++++++++-----------------
 include/linux/binfmts.h |    2 +
 2 files changed, 74 insertions(+), 41 deletions(-)

diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index e8112ae..547690f 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -61,6 +61,8 @@ struct coredump_params {
 	struct file *file;
 	unsigned long limit;
 	unsigned long mm_flags;
+	char **argv; /* Maybe NULL. Used by only fs/coredump.c */
+	bool in_use; /* Used by only fs/coredump.c */
 };
 
 /*
diff --git a/fs/coredump.c b/fs/coredump.c
index 9bdeca1..45abf0b 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -485,6 +485,23 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 	return err;
 }
 
+/*
+ * umh_pipe_cleanup - Clean up resources as needed.
+ *
+ * @info: Pointer to "struct subprocess_info".
+ */
+static void umh_pipe_cleanup(struct subprocess_info *info)
+{
+	/* If user was SIGKILLed, I release the structure. */
+	struct coredump_params *cprm = (struct coredump_params *)info->data;
+
+	if (!xchg(&cprm->in_use, false)) {
+		if (cprm->argv)
+			argv_free(cprm->argv);
+		kfree(cprm);
+	}
+}
+
 void do_coredump(siginfo_t *siginfo)
 {
 	struct core_state core_state;
@@ -500,24 +517,26 @@ void do_coredump(siginfo_t *siginfo)
 	bool need_nonrelative = false;
 	bool core_dumped = false;
 	static atomic_t core_dump_count = ATOMIC_INIT(0);
-	struct coredump_params cprm = {
-		.siginfo = siginfo,
-		.regs = signal_pt_regs(),
-		.limit = rlimit(RLIMIT_CORE),
-		/*
-		 * We must use the same mm->flags while dumping core to avoid
-		 * inconsistency of bit flags, since this flag is not protected
-		 * by any locks.
-		 */
-		.mm_flags = mm->flags,
-	};
+	struct coredump_params *cprm = kzalloc(sizeof(*cprm), GFP_KERNEL);
 
 	audit_core_dumps(siginfo->si_signo);
 
+	if (!cprm)
+		return;
+	cprm->siginfo = siginfo;
+	cprm->regs = signal_pt_regs();
+	cprm->limit = rlimit(RLIMIT_CORE);
+	/*
+	 * We must use the same mm->flags while dumping core to avoid
+	 * inconsistency of bit flags, since this flag is not protected
+	 * by any locks.
+	 */
+	cprm->mm_flags = mm->flags;
+
 	binfmt = mm->binfmt;
 	if (!binfmt || !binfmt->core_dump)
 		goto fail;
-	if (!__get_dumpable(cprm.mm_flags))
+	if (!__get_dumpable(cprm->mm_flags))
 		goto fail;
 
 	cred = prepare_creds();
@@ -529,7 +548,7 @@ void do_coredump(siginfo_t *siginfo)
 	 * so we dump it as root in mode 2, and only into a controlled
 	 * environment (pipe handler or fully qualified path).
 	 */
-	if (__get_dumpable(cprm.mm_flags) == SUID_DUMP_ROOT) {
+	if (__get_dumpable(cprm->mm_flags) == SUID_DUMP_ROOT) {
 		/* Setuid core dump mode */
 		flag = O_EXCL;		/* Stop rewrite attacks */
 		cred->fsuid = GLOBAL_ROOT_UID;	/* Dump root private */
@@ -542,11 +561,10 @@ void do_coredump(siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm);
+	ispipe = format_corename(&cn, cprm);
 
 	if (ispipe) {
 		int dump_count;
-		char **helper_argv;
 		struct subprocess_info *sub_info;
 
 		if (ispipe < 0) {
@@ -555,12 +573,12 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_unlock;
 		}
 
-		if (cprm.limit == 1) {
+		if (cprm->limit == 1) {
 			/* See umh_pipe_setup() which sets RLIMIT_CORE = 1.
 			 *
 			 * Normally core limits are irrelevant to pipes, since
 			 * we're not writing to the file system, but we use
-			 * cprm.limit of 1 here as a speacial value, this is a
+			 * cprm->limit of 1 here as a speacial value, this is a
 			 * consistent way to catch recursive crashes.
 			 * We can still crash if the core_pattern binary sets
 			 * RLIM_CORE = !1, but it runs as root, and can do
@@ -577,7 +595,7 @@ void do_coredump(siginfo_t *siginfo)
 			printk(KERN_WARNING "Aborting core\n");
 			goto fail_unlock;
 		}
-		cprm.limit = RLIM_INFINITY;
+		cprm->limit = RLIM_INFINITY;
 
 		dump_count = atomic_inc_return(&core_dump_count);
 		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
@@ -587,22 +605,29 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
-		helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
-		if (!helper_argv) {
+		cprm->argv = argv_split(GFP_KERNEL, cn.corename, NULL);
+		if (!cprm->argv) {
 			printk(KERN_WARNING "%s failed to allocate memory\n",
 			       __func__);
 			goto fail_dropcount;
 		}
 
 		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_pipe_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
-
-		argv_free(helper_argv);
+		sub_info = call_usermodehelper_setup
+			(cprm->argv[0], cprm->argv, NULL, GFP_KERNEL,
+			 umh_pipe_setup, umh_pipe_cleanup, cprm);
+		if (sub_info) {
+			/*
+			 * Let the cleanup of cprm to kthread if I was
+			 * SIGKILLed before the kthread starts cprm->argv[0].
+			 * We use cprm->in_use field for indicating whether
+			 * the kthread needs to cleanup cprm or not.
+			 */
+			cprm->in_use = true;
+			retval = call_usermodehelper_exec
+				(sub_info, UMH_WAIT_EXEC | UMH_KILLABLE);
+		}
+
 		if (retval) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
 			       cn.corename);
@@ -611,7 +636,7 @@ void do_coredump(siginfo_t *siginfo)
 	} else {
 		struct inode *inode;
 
-		if (cprm.limit < binfmt->min_coredump)
+		if (cprm->limit < binfmt->min_coredump)
 			goto fail_unlock;
 
 		if (need_nonrelative && cn.corename[0] != '/') {
@@ -622,16 +647,16 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_unlock;
 		}
 
-		cprm.file = filp_open(cn.corename,
+		cprm->file = filp_open(cn.corename,
 				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 				 0600);
-		if (IS_ERR(cprm.file))
+		if (IS_ERR(cprm->file))
 			goto fail_unlock;
 
-		inode = file_inode(cprm.file);
+		inode = file_inode(cprm->file);
 		if (inode->i_nlink > 1)
 			goto close_fail;
-		if (d_unhashed(cprm.file->f_path.dentry))
+		if (d_unhashed(cprm->file->f_path.dentry))
 			goto close_fail;
 		/*
 		 * AK: actually i see no reason to not allow this for named
@@ -645,9 +670,9 @@ void do_coredump(siginfo_t *siginfo)
 		 */
 		if (!uid_eq(inode->i_uid, current_fsuid()))
 			goto close_fail;
-		if (!cprm.file->f_op || !cprm.file->f_op->write)
+		if (!cprm->file->f_op || !cprm->file->f_op->write)
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		if (do_truncate(cprm->file->f_path.dentry, 0, 0, cprm->file))
 			goto close_fail;
 	}
 
@@ -658,15 +683,15 @@ void do_coredump(siginfo_t *siginfo)
 	if (displaced)
 		put_files_struct(displaced);
 	if (!dump_interrupted()) {
-		file_start_write(cprm.file);
-		core_dumped = binfmt->core_dump(&cprm);
-		file_end_write(cprm.file);
+		file_start_write(cprm->file);
+		core_dumped = binfmt->core_dump(cprm);
+		file_end_write(cprm->file);
 	}
 	if (ispipe && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+		wait_for_dump_helpers(cprm->file);
 close_fail:
-	if (cprm.file)
-		filp_close(cprm.file, NULL);
+	if (cprm->file)
+		filp_close(cprm->file, NULL);
 fail_dropcount:
 	if (ispipe)
 		atomic_dec(&core_dump_count);
@@ -677,6 +702,12 @@ fail_unlock:
 fail_creds:
 	put_cred(cred);
 fail:
+	/* Cleanup if this structure is no longer used by the kthread. */
+	if (!xchg(&cprm->in_use, false)) {
+		if (cprm->argv)
+			argv_free(cprm->argv);
+		kfree(cprm);
+	}
 	return;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2] kthread: Make kthread_create() killable.
  2013-09-16  0:53       ` [PATCH v2] " Tetsuo Handa
@ 2013-09-23 23:07         ` Andrew Morton
  2013-09-24 15:17           ` [PATCH v3] " Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2013-09-23 23:07 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: oleg, security, linux-kernel

On Mon, 16 Sep 2013 09:53:59 +0900 Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote:

> Subject: [PATCH v2] kthread: Make kthread_create() killable.
> 
> Any users of wait_for_completion() might be chosen by the OOM killer while
> waiting for completion() call by some process which does memory allocation.
> kthread_create() is one of such users.
> 
> When such users are chosen by the OOM killer when they are waiting for
> completion() in TASK_UNINTERRUPTIBLE, problem similar to CVE-2012-4398
> "kernel: request_module() OOM local DoS" can happen.
> 
> This patch makes kthread_create() killable, using the same approach used for
> fixing CVE-2012-4398.

That's a pretty big patch.  What's the status of this?  Do you think
it's ready to go?  Oleg?

I don't like the changelog much - it doesn't really describe the bug. 
I can google "CVE-2012-4398" and that turns up a bunch of stuff but
it's more oriented toward sysadmins etc, rather than kernel developers.

Can we please have a conventional description?  When the user does A,
the kernel does B because C, so we fix it via D?



From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Subject: kthread: make kthread_create() killable

Any users of wait_for_completion() might be chosen by the OOM killer while
waiting for completion() call by some process which does memory
allocation.  kthread_create() is one of such users.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, problem similar to CVE-2012-4398
"kernel: request_module() OOM local DoS" can happen.

This patch makes kthread_create() killable, using the same approach used for
fixing CVE-2012-4398.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/kthread.c |   73 +++++++++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 18 deletions(-)

diff -puN kernel/kthread.c~kthread-make-kthread_create-killable kernel/kthread.c
--- a/kernel/kthread.c~kthread-make-kthread_create-killable
+++ a/kernel/kthread.c
@@ -33,7 +33,7 @@ struct kthread_create_info
 
 	/* Result passed back to kthread_create() from kthreadd. */
 	struct task_struct *result;
-	struct completion done;
+	struct completion *done;
 
 	struct list_head list;
 };
@@ -178,6 +178,7 @@ static int kthread(void *_create)
 	struct kthread_create_info *create = _create;
 	int (*threadfn)(void *data) = create->threadfn;
 	void *data = create->data;
+	struct completion *done;
 	struct kthread self;
 	int ret;
 
@@ -187,10 +188,16 @@ static int kthread(void *_create)
 	init_completion(&self.parked);
 	current->vfork_done = &self.exited;
 
+	/* If user was SIGKILLed, I release the structure. */
+	done = xchg(&create->done, NULL);
+	if (!done) {
+		kfree(create);
+		do_exit(-EINTR);
+	}
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
 	create->result = current;
-	complete(&create->done);
+	complete(done);
 	schedule();
 
 	ret = -EINTR;
@@ -223,8 +230,15 @@ static void create_kthread(struct kthrea
 	/* We want our own signal handler (we take no signals by default). */
 	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
 	if (pid < 0) {
+		/* If user was SIGKILLed, I release the structure. */
+		struct completion *done = xchg(&create->done, NULL);
+
+		if (!done) {
+			kfree(create);
+			return;
+		}
 		create->result = ERR_PTR(pid);
-		complete(&create->done);
+		complete(done);
 	}
 }
 
@@ -255,36 +269,59 @@ struct task_struct *kthread_create_on_no
 					   const char namefmt[],
 					   ...)
 {
-	struct kthread_create_info create;
-
-	create.threadfn = threadfn;
-	create.data = data;
-	create.node = node;
-	init_completion(&create.done);
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct task_struct *task;
+	struct kthread_create_info *create = kmalloc(sizeof(*create),
+						     GFP_KERNEL);
+
+	if (!create)
+		return ERR_PTR(-ENOMEM);
+	create->threadfn = threadfn;
+	create->data = data;
+	create->node = node;
+	create->done = &done;
 
 	spin_lock(&kthread_create_lock);
-	list_add_tail(&create.list, &kthread_create_list);
+	list_add_tail(&create->list, &kthread_create_list);
 	spin_unlock(&kthread_create_lock);
 
 	wake_up_process(kthreadd_task);
-	wait_for_completion(&create.done);
-
-	if (!IS_ERR(create.result)) {
+	/*
+	 * Wait for completion in killable state, for I might be chosen by
+	 * the OOM killer while kthreadd is trying to allocate memory for
+	 * new kernel thread.
+	 */
+	if (unlikely(wait_for_completion_killable(&done))) {
+		/*
+		 * If I was SIGKILLed before kthreadd (or new kernel thread)
+		 * calls complete(), leave the cleanup of this structure to
+		 * that thread.
+		 */
+		if (xchg(&create->done, NULL))
+			return ERR_PTR(-ENOMEM);
+		/*
+		 * kthreadd (or new kernel thread) will call complete()
+		 * shortly.
+		 */
+		wait_for_completion(&done);
+	}
+	task = create->result;
+	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
 		va_list args;
 
 		va_start(args, namefmt);
-		vsnprintf(create.result->comm, sizeof(create.result->comm),
-			  namefmt, args);
+		vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
 		va_end(args);
 		/*
 		 * root may have changed our (kthreadd's) priority or CPU mask.
 		 * The kernel thread should not inherit these properties.
 		 */
-		sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
-		set_cpus_allowed_ptr(create.result, cpu_all_mask);
+		sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
+		set_cpus_allowed_ptr(task, cpu_all_mask);
 	}
-	return create.result;
+	kfree(create);
+	return task;
 }
 EXPORT_SYMBOL(kthread_create_on_node);
 
_


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3] kthread: Make kthread_create() killable.
  2013-09-23 23:07         ` Andrew Morton
@ 2013-09-24 15:17           ` Tetsuo Handa
  2013-09-24 16:38             ` Oleg Nesterov
  2013-09-25 21:05             ` David Rientjes
  0 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-24 15:17 UTC (permalink / raw)
  To: akpm; +Cc: oleg, security, linux-kernel

Andrew Morton wrote:
> That's a pretty big patch.  What's the status of this?  Do you think
> it's ready to go?  Oleg?
> 
> I don't like the changelog much - it doesn't really describe the bug. 
> I can google "CVE-2012-4398" and that turns up a bunch of stuff but
> it's more oriented toward sysadmins etc, rather than kernel developers.
> 
> Can we please have a conventional description?  When the user does A,
> the kernel does B because C, so we fix it via D?

I see. I updated description. Updated patch follows.
----------
>From 0fe0c9d09b45cce0f00457755861204d51d7c2c9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 25 Sep 2013 00:00:27 +0900
Subject: [PATCH] kthread: make kthread_create() killable

Any users of wait_for_completion() might be chosen by the OOM killer while
waiting for completion() call by some process which does memory
allocation.  kthread_create() is one of such users.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

Fix this problem for kthreadd by making kthread_create() killable.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/kthread.c |   73 ++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 55 insertions(+), 18 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 760e86d..b5ae3ee 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -33,7 +33,7 @@ struct kthread_create_info
 
 	/* Result passed back to kthread_create() from kthreadd. */
 	struct task_struct *result;
-	struct completion done;
+	struct completion *done;
 
 	struct list_head list;
 };
@@ -178,6 +178,7 @@ static int kthread(void *_create)
 	struct kthread_create_info *create = _create;
 	int (*threadfn)(void *data) = create->threadfn;
 	void *data = create->data;
+	struct completion *done;
 	struct kthread self;
 	int ret;
 
@@ -187,10 +188,16 @@ static int kthread(void *_create)
 	init_completion(&self.parked);
 	current->vfork_done = &self.exited;
 
+	/* If user was SIGKILLed, I release the structure. */
+	done = xchg(&create->done, NULL);
+	if (!done) {
+		kfree(create);
+		do_exit(-EINTR);
+	}
 	/* OK, tell user we're spawned, wait for stop or wakeup */
 	__set_current_state(TASK_UNINTERRUPTIBLE);
 	create->result = current;
-	complete(&create->done);
+	complete(done);
 	schedule();
 
 	ret = -EINTR;
@@ -223,8 +230,15 @@ static void create_kthread(struct kthread_create_info *create)
 	/* We want our own signal handler (we take no signals by default). */
 	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
 	if (pid < 0) {
+		/* If user was SIGKILLed, I release the structure. */
+		struct completion *done = xchg(&create->done, NULL);
+
+		if (!done) {
+			kfree(create);
+			return;
+		}
 		create->result = ERR_PTR(pid);
-		complete(&create->done);
+		complete(done);
 	}
 }
 
@@ -255,36 +269,59 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 					   const char namefmt[],
 					   ...)
 {
-	struct kthread_create_info create;
-
-	create.threadfn = threadfn;
-	create.data = data;
-	create.node = node;
-	init_completion(&create.done);
+	DECLARE_COMPLETION_ONSTACK(done);
+	struct task_struct *task;
+	struct kthread_create_info *create = kmalloc(sizeof(*create),
+						     GFP_KERNEL);
+
+	if (!create)
+		return ERR_PTR(-ENOMEM);
+	create->threadfn = threadfn;
+	create->data = data;
+	create->node = node;
+	create->done = &done;
 
 	spin_lock(&kthread_create_lock);
-	list_add_tail(&create.list, &kthread_create_list);
+	list_add_tail(&create->list, &kthread_create_list);
 	spin_unlock(&kthread_create_lock);
 
 	wake_up_process(kthreadd_task);
-	wait_for_completion(&create.done);
-
-	if (!IS_ERR(create.result)) {
+	/*
+	 * Wait for completion in killable state, for I might be chosen by
+	 * the OOM killer while kthreadd is trying to allocate memory for
+	 * new kernel thread.
+	 */
+	if (unlikely(wait_for_completion_killable(&done))) {
+		/*
+		 * If I was SIGKILLed before kthreadd (or new kernel thread)
+		 * calls complete(), leave the cleanup of this structure to
+		 * that thread.
+		 */
+		if (xchg(&create->done, NULL))
+			return ERR_PTR(-ENOMEM);
+		/*
+		 * kthreadd (or new kernel thread) will call complete()
+		 * shortly.
+		 */
+		wait_for_completion(&done);
+	}
+	task = create->result;
+	if (!IS_ERR(task)) {
 		static const struct sched_param param = { .sched_priority = 0 };
 		va_list args;
 
 		va_start(args, namefmt);
-		vsnprintf(create.result->comm, sizeof(create.result->comm),
-			  namefmt, args);
+		vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
 		va_end(args);
 		/*
 		 * root may have changed our (kthreadd's) priority or CPU mask.
 		 * The kernel thread should not inherit these properties.
 		 */
-		sched_setscheduler_nocheck(create.result, SCHED_NORMAL, &param);
-		set_cpus_allowed_ptr(create.result, cpu_all_mask);
+		sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
+		set_cpus_allowed_ptr(task, cpu_all_mask);
 	}
-	return create.result;
+	kfree(create);
+	return task;
 }
 EXPORT_SYMBOL(kthread_create_on_node);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] kthread: Make kthread_create() killable.
  2013-09-24 15:17           ` [PATCH v3] " Tetsuo Handa
@ 2013-09-24 16:38             ` Oleg Nesterov
  2013-09-25 21:05             ` David Rientjes
  1 sibling, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2013-09-24 16:38 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, security, linux-kernel

On 09/25, Tetsuo Handa wrote:
>
> Subject: [PATCH] kthread: make kthread_create() killable

The patch looks correct... although I'll try to reread it later ;)

Minor nit, since you are sending the new version anyway...

> +	if (unlikely(wait_for_completion_killable(&done))) {
> +		/*
> +		 * If I was SIGKILLed before kthreadd (or new kernel thread)
> +		 * calls complete(), leave the cleanup of this structure to
> +		 * that thread.
> +		 */
> +		if (xchg(&create->done, NULL))
> +			return ERR_PTR(-ENOMEM);

ENOMEM? Perhaps EINTR makes more sense.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] kthread: Make kthread_create() killable.
  2013-09-24 15:17           ` [PATCH v3] " Tetsuo Handa
  2013-09-24 16:38             ` Oleg Nesterov
@ 2013-09-25 21:05             ` David Rientjes
  2013-09-26  2:59               ` Tetsuo Handa
  1 sibling, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-09-25 21:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, security, linux-kernel

On Wed, 25 Sep 2013, Tetsuo Handa wrote:

> Any users of wait_for_completion() might be chosen by the OOM killer while
> waiting for completion() call by some process which does memory
> allocation.  kthread_create() is one of such users.
> 

Any user process callers of wait_for_completion() you mean.

> When such users are chosen by the OOM killer when they are waiting for
> completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
> due to memory starvation because the OOM killer cannot kill such users.
> 

Also results in a livelock if you're running in a memcg and have hit its 
limit.

> Fix this problem for kthreadd by making kthread_create() killable.
> 

There appear to be other potential callers of wait_for_completion() in the 
tree as well that could be holding lots of memory besides 
kthread_create().

Perhaps that's beyond the scope of this particular patch, though.

wait_for_completion() is scary if that completion requires memory that 
cannot be allocated because the caller is killed but uninterruptible.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] kthread: Make kthread_create() killable.
  2013-09-25 21:05             ` David Rientjes
@ 2013-09-26  2:59               ` Tetsuo Handa
  2013-09-26 18:53                 ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-26  2:59 UTC (permalink / raw)
  To: rientjes; +Cc: akpm, oleg, security, linux-kernel

Thank you for comments, David.

David Rientjes wrote:
> Also results in a livelock if you're running in a memcg and have hit its
> limit.

> wait_for_completion() is scary if that completion requires memory that 
> cannot be allocated because the caller is killed but uninterruptible.

I don't think these lines are specific to wait_for_completion() users.

Currently the OOM killer is disabled throughout from "the moment the OOM killer
chose a process to kill" to "the moment the task_struct of the chosen process
becomes unreachable". Any blocking functions which wait in TASK_UNINTERRUPTIBLE
(e.g. mutex_lock()) can disable the OOM killer if the current thread is chosen
by the OOM killer. Therefore, any users of blocking functions which wait in
TASK_UNINTERRUPTIBLE are considered scary if they assume that the current
thread will not be chosen by the OOM killer.

But it seems to me that re-enabling the OOM killer at some point is more
realizable than purging all such users.

To re-enable the OOM killer at some point, the OOM killer needs to choose more
processes if the to-be-killed process cannot be terminated within an adequate
period.

For example, add "unsigned long memdie_stamp;" to "struct task_struct" and do
"p->memdie_stamp = jiffies + 5 * HZ;" before "set_tsk_thread_flag(p, TIF_MEMDIE);"
and do

 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
 		if (unlikely(frozen(task)))
 			__thaw_task(task);
+		/* Choose more processes if the chosen process cannot die. */
+		if (time_after(jiffies, p->memdie_stamp) &&
+		    task->state == TASK_UNINTERRUPTIBLE)
+		        return OOM_SCAN_CONTINUE;
 		if (!force_kill)
 			return OOM_SCAN_ABORT;
 	}

in oom_scan_process_thread().

This idea costs us the increment of the possibility of different side effects
(e.g. the second-worst process is chosen by the OOM killer when the worst
process cannot be terminated => memory allocation for writeback fails because
the second-worst process was in the ext3's writeback path => fs-error action
(remount read-only or panic) gets triggered by the second-worst process).



Anyway, this patch is for helping the OOM killer to kill the process smoothly
when the chosen process is waiting at kthread_create(). I attach updated patch
description. Did I merge your comments appropriately?
----------
[PATCH v3] kthread: Make kthread_create() killable.

Any user process callers of wait_for_completion() except global init process
might be chosen by the OOM killer while waiting for completion() call by some
other process which does memory allocation.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

kthread_create() is one of such users and this patch fixes the problem for
kthreadd by making kthread_create() killable.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] kthread: Make kthread_create() killable.
  2013-09-26  2:59               ` Tetsuo Handa
@ 2013-09-26 18:53                 ` David Rientjes
  2013-09-28  7:18                   ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-09-26 18:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, security, linux-kernel

On Thu, 26 Sep 2013, Tetsuo Handa wrote:

> > wait_for_completion() is scary if that completion requires memory that 
> > cannot be allocated because the caller is killed but uninterruptible.
> 
> I don't think these lines are specific to wait_for_completion() users.
> 
> Currently the OOM killer is disabled throughout from "the moment the OOM killer
> chose a process to kill" to "the moment the task_struct of the chosen process
> becomes unreachable". Any blocking functions which wait in TASK_UNINTERRUPTIBLE
> (e.g. mutex_lock()) can disable the OOM killer if the current thread is chosen
> by the OOM killer. Therefore, any users of blocking functions which wait in
> TASK_UNINTERRUPTIBLE are considered scary if they assume that the current
> thread will not be chosen by the OOM killer.
> 

Yeah, that's always been true.

> But it seems to me that re-enabling the OOM killer at some point is more
> realizable than purging all such users.
> 
> To re-enable the OOM killer at some point, the OOM killer needs to choose more
> processes if the to-be-killed process cannot be terminated within an adequate
> period.
> 
> For example, add "unsigned long memdie_stamp;" to "struct task_struct" and do
> "p->memdie_stamp = jiffies + 5 * HZ;" before "set_tsk_thread_flag(p, TIF_MEMDIE);"
> and do
> 
>  	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>  		if (unlikely(frozen(task)))
>  			__thaw_task(task);
> +		/* Choose more processes if the chosen process cannot die. */
> +		if (time_after(jiffies, p->memdie_stamp) &&
> +		    task->state == TASK_UNINTERRUPTIBLE)
> +		        return OOM_SCAN_CONTINUE;
>  		if (!force_kill)
>  			return OOM_SCAN_ABORT;
>  	}
> 
> in oom_scan_process_thread().
> 

There may not be any eligible processes left and then the machine panics.  
These time-based delays also have caused a complete depletion of memory 
reserves if more than one process is chosen and each consumes an 
non-neglible amount of memory which would then cause livelock.  We used to 
have a jiffies-based rekill in 2.6.18 internally and we finally could 
remove it when mm->mmap_sem issues were fixed (mostly by checking for 
fatal_signal_pending() and aborting when necessary).

> [PATCH v3] kthread: Make kthread_create() killable.
> 
> Any user process callers of wait_for_completion() except global init process
> might be chosen by the OOM killer while waiting for completion() call by some
> other process which does memory allocation.
> 
> When such users are chosen by the OOM killer when they are waiting for
> completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
> due to memory starvation because the OOM killer cannot kill such users.
> 
> kthread_create() is one of such users and this patch fixes the problem for
> kthreadd by making kthread_create() killable.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Absolutely, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kthread: Make kthread_create() killable.
  2013-09-26 18:53                 ` David Rientjes
@ 2013-09-28  7:18                   ` Tetsuo Handa
  2013-09-30 21:57                     ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2013-09-28  7:18 UTC (permalink / raw)
  To: rientjes; +Cc: akpm, oleg, security, linux-kernel

David Rientjes wrote:
> There may not be any eligible processes left and then the machine panics.  

Some of enterprise users might prefer "kernel panic followed by kdump and
automatic reboot" to "a system is not responding for unpredictable period", for
the panic helps getting information for analyzing what process caused the
freeze. Well, can they use "Panic (Reboot) On Soft Lockups" option?

> These time-based delays also have caused a complete depletion of memory 
> reserves if more than one process is chosen and each consumes an 
> non-neglible amount of memory which would then cause livelock.  We used to 
> have a jiffies-based rekill in 2.6.18 internally and we finally could 
> remove it when mm->mmap_sem issues were fixed (mostly by checking for 
> fatal_signal_pending() and aborting when necessary).

So, you've already tried that.

Currently the OOM killer kills a process after

  blocking_notifier_call_chain(&oom_notify_list, 0, &freed);

in out_of_memory() released all reclaimable memory. This call helps reducing
the chance to kill a process if the bad process no longer asks for more memory.
But if the bad process continues asking for more memory and the chosen task is
in TASK_UNINTERRUPTIBLE state, this call helps the OOM killer to be disabled
for unpredictable period. Therefore, releasing all reclaimable memory before
the OOM killer kills a process might be considered bad.

Then, what about an approach described below?

(1) Introduce a kernel thread which reserves (e.g.) 1 percent of kernel memory
    (this amount should be configurable via sysctl) upon startup.

(2) The kernel thread sleeps using wait_event(memory_reservoir_wait) and
    releases PAGE_SIZE bytes from the reserved memory upon each wakeup.

(3) The OOM killer calls wake_up() like

     	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
     		if (unlikely(frozen(task)))
     			__thaw_task(task);
    +		/* Let the memory reservoir release memory if the chosen process cannot die. */
    +		if (time_after(jiffies, p->memdie_stamp) &&
    +		    task->state == TASK_UNINTERRUPTIBLE)
    +		        wake_up(&memory_reservoir_wait);
     		if (!force_kill)
     			return OOM_SCAN_ABORT;
     	}

    in oom_scan_process_thread().

(4) When a task where test_tsk_thread_flag(task, TIF_MEMDIE) is true has
    terminated and memory used by the task is reclaimed, the reclaimed memory
    is again reserved by the kernel thread up to 1 percent of kernel memory.

In this way, we could shorten the duration of the OOM killer being disabled
unless the reserved memory was not enough to terminate the chosen process.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kthread: Make kthread_create() killable.
  2013-09-28  7:18                   ` Tetsuo Handa
@ 2013-09-30 21:57                     ` David Rientjes
  2013-10-01 13:20                       ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-09-30 21:57 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: akpm, oleg, security, linux-kernel

On Sat, 28 Sep 2013, Tetsuo Handa wrote:

> Some of enterprise users might prefer "kernel panic followed by kdump and
> automatic reboot" to "a system is not responding for unpredictable period", for
> the panic helps getting information for analyzing what process caused the
> freeze. Well, can they use "Panic (Reboot) On Soft Lockups" option?
> 

Or, when the system doesn't respond for a long period of time you do 
sysrq+T and you find the TIF_MEMDIE bit set on a process that makes no 
progress exiting.  These instances _should_ be very rare since we don't 
have any other reports of it (and the oom killer hasn't differed in this 
regard for over three years).  It used to be much more common for 
mm->mmap_sem dependencies that were fixed.

> Currently the OOM killer kills a process after
> 
>   blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> 
> in out_of_memory() released all reclaimable memory.

The oom notifiers usually don't do any good on x86.

> This call helps reducing
> the chance to kill a process if the bad process no longer asks for more memory.

The "bad process" could be anything, it's simply the process that is 
allocating memory when all memory is exhausted.

> But if the bad process continues asking for more memory and the chosen task is
> in TASK_UNINTERRUPTIBLE state, this call helps the OOM killer to be disabled
> for unpredictable period. Therefore, releasing all reclaimable memory before
> the OOM killer kills a process might be considered bad.
> 

I don't follow this statement, could you reword it?

If current calls the oom killer and the oom notifiers don't free any 
memory (which is very likely), then choosing an uninterruptible process is 
possible and has always been possible.  If sending SIGKILL and giving that 
process access to memory reserves does not allow it to exit in a short 
amount of time, then it must be waiting on another process that also 
cannot make forward process.  We must identify these cases (which is 
easily doable as described above) and fix them.

> Then, what about an approach described below?
> 
> (1) Introduce a kernel thread which reserves (e.g.) 1 percent of kernel memory
>     (this amount should be configurable via sysctl) upon startup.
> 

We don't need kernel threads, this is what per-zone memory reserves are 
intended to provide for GFP_ATOMIC and TIF_MEMDIE allocations (or 
PF_MEMALLOC for reclaimers).

> (2) The kernel thread sleeps using wait_event(memory_reservoir_wait) and
>     releases PAGE_SIZE bytes from the reserved memory upon each wakeup.
> 
> (3) The OOM killer calls wake_up() like
> 
>      	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
>      		if (unlikely(frozen(task)))
>      			__thaw_task(task);
>     +		/* Let the memory reservoir release memory if the chosen process cannot die. */
>     +		if (time_after(jiffies, p->memdie_stamp) &&
>     +		    task->state == TASK_UNINTERRUPTIBLE)
>     +		        wake_up(&memory_reservoir_wait);
>      		if (!force_kill)
>      			return OOM_SCAN_ABORT;
>      	}
> 
>     in oom_scan_process_thread().
> 

This doesn't guarantee that the process that the chosen process is waiting 
for will be able to allocate that page, so it's useless.

> (4) When a task where test_tsk_thread_flag(task, TIF_MEMDIE) is true has
>     terminated and memory used by the task is reclaimed, the reclaimed memory
>     is again reserved by the kernel thread up to 1 percent of kernel memory.
> 
> In this way, we could shorten the duration of the OOM killer being disabled
> unless the reserved memory was not enough to terminate the chosen process.
> 

Could you please describe and post the details of any case where this 
currently happens so we can address the problem directly instead of trying 
to workaround it?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kthread: Make kthread_create() killable.
  2013-09-30 21:57                     ` David Rientjes
@ 2013-10-01 13:20                       ` Tetsuo Handa
  2013-10-02  7:06                         ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2013-10-01 13:20 UTC (permalink / raw)
  To: rientjes; +Cc: akpm, oleg, security, linux-kernel

David Rientjes wrote:
> On Sat, 28 Sep 2013, Tetsuo Handa wrote:
> 
> > Some of enterprise users might prefer "kernel panic followed by kdump and
> > automatic reboot" to "a system is not responding for unpredictable period", for
> > the panic helps getting information for analyzing what process caused the
> > freeze. Well, can they use "Panic (Reboot) On Soft Lockups" option?
> > 
> 
> Or, when the system doesn't respond for a long period of time you do 
> sysrq+T and you find the TIF_MEMDIE bit set on a process that makes no 
> progress exiting.

In enterprise systems, an operator is not always sitting in front of the server
for pressing sysrq keys (nor kept ssh session for issuing sysrq via procfs).
The operator likely finds it many hours later after the system got frozen. The
operator finds that he/she can't login, and presses power reset button.
Rather than wasting for many hours, an unattended automatic reboot might be
preferred.

>                    These instances _should_ be very rare since we don't 
> have any other reports of it (and the oom killer hasn't differed in this 
> regard for over three years).  It used to be much more common for 
> mm->mmap_sem dependencies that were fixed.
> 

Such reports in real world might be rare, but I care potential bugs which can
affect availability of the server.

If local unprivileged users can execute their own programs, they can easily
freeze the server. Therefore, I test whether such freeze can happen using DoS
attacking programs executed by local unprivileged users. I confirmed that
request_module() can easily freeze the server and the request_module() case was
fixed as CVE-2012-4398. I confirmed that kthread_create() can freeze the server
(though not easy to trigger but can happen by chance) and posted a patch in
this thread.

> > Currently the OOM killer kills a process after
> > 
> >   blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> > 
> > in out_of_memory() released all reclaimable memory.
> 
> The oom notifiers usually don't do any good on x86.
> 
> > This call helps reducing
> > the chance to kill a process if the bad process no longer asks for more memory.
> 
> The "bad process" could be anything, it's simply the process that is 
> allocating memory when all memory is exhausted.
> 

I'm using "bad process" as what you mean.

> > But if the bad process continues asking for more memory and the chosen task is
> > in TASK_UNINTERRUPTIBLE state, this call helps the OOM killer to be disabled
> > for unpredictable period. Therefore, releasing all reclaimable memory before
> > the OOM killer kills a process might be considered bad.
> > 
> 
> I don't follow this statement, could you reword it?
> 
> If current calls the oom killer and the oom notifiers don't free any 
> memory (which is very likely), then choosing an uninterruptible process is 
> possible and has always been possible.

Yes, this does happen if a local unprivileged user who can execute his/her own
program consumed a lots of memory.

>                                         If sending SIGKILL and giving that 
> process access to memory reserves does not allow it to exit in a short 
> amount of time, then it must be waiting on another process that also 
> cannot make forward process.

Yes. kthread_create(), do_coredump() and call_usermodehelper_keys() are
examples of such cases which I think I can trigger deadlock using DoS attacking
programs executed by local unprivileged users.

>                               We must identify these cases (which is 
> easily doable as described above) and fix them.
> 

I'm not expecting that we identify all possible cases, for any blocking
functions which wait in TASK_UNINTERRUPTIBLE are candidates for such cases.
This is as problematic as GFP_NOFS allocation functions calling other functions
which do GFP_KERNEL allocation.

> > Then, what about an approach described below?
> > 
> > (1) Introduce a kernel thread which reserves (e.g.) 1 percent of kernel memory
> >     (this amount should be configurable via sysctl) upon startup.
> > 
> 
> We don't need kernel threads, this is what per-zone memory reserves are 
> intended to provide for GFP_ATOMIC and TIF_MEMDIE allocations (or 
> PF_MEMALLOC for reclaimers).
> 

I know some amount of memory is reserved for GFP_ATOMIC/TIF_MEMDIE allocations.
What I'm talking about is GFP_KERNEL allocating processes which are preventing
TIF_MEMDIE process from terminating due to TASK_UNINTERRUPTIBLE wait.

> > (2) The kernel thread sleeps using wait_event(memory_reservoir_wait) and
> >     releases PAGE_SIZE bytes from the reserved memory upon each wakeup.
> > 
> > (3) The OOM killer calls wake_up() like
> > 
> >      	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> >      		if (unlikely(frozen(task)))
> >      			__thaw_task(task);
> >     +		/* Let the memory reservoir release memory if the chosen process cannot die. */
> >     +		if (time_after(jiffies, p->memdie_stamp) &&
> >     +		    task->state == TASK_UNINTERRUPTIBLE)
> >     +		        wake_up(&memory_reservoir_wait);
> >      		if (!force_kill)
> >      			return OOM_SCAN_ABORT;
> >      	}
> > 
> >     in oom_scan_process_thread().
> > 
> 
> This doesn't guarantee that the process that the chosen process is waiting 
> for will be able to allocate that page, so it's useless.
> 

I don't think this is completely useless.

The chosen process was marked as TIF_MEMDIE because the process was consuming
most memory, but it does not mean that the process marked as TIF_MEMDIE itself
needs more memory. The process marked as TIF_MEMDIE may be simply waiting at
kthread_create() etc. where other process which is not marked as TIF_MEMDIE
depends on more memory.

Even if the process marked as TIF_MEMDIE is allowed to access reserved memory,
the process which really wants memory may not be marked as TIF_MEMDIE.
To be able to give the process which really wants memory chance to allocate
memory, memory pool via kernel thread might help.

> > (4) When a task where test_tsk_thread_flag(task, TIF_MEMDIE) is true has
> >     terminated and memory used by the task is reclaimed, the reclaimed memory
> >     is again reserved by the kernel thread up to 1 percent of kernel memory.
> > 
> > In this way, we could shorten the duration of the OOM killer being disabled
> > unless the reserved memory was not enough to terminate the chosen process.
> > 
> 
> Could you please describe and post the details of any case where this 
> currently happens so we can address the problem directly instead of trying 
> to workaround it?
> 

Even if the process marked as TIF_MEMDIE can access per-zone memory reserves,
that does not help if the process which really needs memory (in order to
terminate the process marked as TIF_MEMDIE) cannot access per-zone memory
reserves.

If (e.g.) the caller of kthread_create() was marked as TIF_MEMDIE but the
kthreadd cannot allocate memory which is needed for terminating the caller of
kthread_create(), the system freezes. We can make kthread_create() killable
because it is not too difficult to fix.

Some of such dependency (e.g. kthread_create()) can be addressed directly, but
it is too hard to address all directly. Thus, I propose memory reserving thread
as a workaround.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kthread: Make kthread_create() killable.
  2013-10-01 13:20                       ` Tetsuo Handa
@ 2013-10-02  7:06                         ` David Rientjes
  2013-10-02 13:44                           ` Tetsuo Handa
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2013-10-02  7:06 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrew Morton, oleg, security, linux-kernel

On Tue, 1 Oct 2013, Tetsuo Handa wrote:

> > Or, when the system doesn't respond for a long period of time you do 
> > sysrq+T and you find the TIF_MEMDIE bit set on a process that makes no 
> > progress exiting.
> 
> In enterprise systems, an operator is not always sitting in front of the server
> for pressing sysrq keys (nor kept ssh session for issuing sysrq via procfs).

We have a lot of machines and we collect netdumps from those that 
encounter soft lockups, I haven't seen _any_ examples of the oom killer 
stalling out because of wait_for_completion() in three years, period.

> Such reports in real world might be rare, but I care potential bugs which can
> affect availability of the server.
> 

If you have a reproducible testcase or an example from the kernel log that 
demonstrates this problem, please post it.

> If local unprivileged users can execute their own programs, they can easily
> freeze the server. Therefore, I test whether such freeze can happen using DoS
> attacking programs executed by local unprivileged users. I confirmed that
> request_module() can easily freeze the server and the request_module() case was
> fixed as CVE-2012-4398. I confirmed that kthread_create() can freeze the server
> (though not easy to trigger but can happen by chance) and posted a patch in
> this thread.
> 

Right, and that issue required an instrumented kernel to impose a five 
second delay for you to be able to test it.  I acked the patch.  I'm not 
against fixing issues as they arise, I'm against time-based oom kill 
mechanisms that (1) aren't guaranteed to solve the issue and (2) lead to 
unnecessary memory killing or machine panic.

> > process access to memory reserves does not allow it to exit in a short 
> > amount of time, then it must be waiting on another process that also 
> > cannot make forward process.
> 
> Yes. kthread_create(), do_coredump() and call_usermodehelper_keys() are
> examples of such cases which I think I can trigger deadlock using DoS attacking
> programs executed by local unprivileged users.
> 

I thought you had addressed the kthread_create() issue on 
security@kernel.org, is it possible to address the others?

> The chosen process was marked as TIF_MEMDIE because the process was consuming
> most memory, but it does not mean that the process marked as TIF_MEMDIE itself
> needs more memory. The process marked as TIF_MEMDIE may be simply waiting at
> kthread_create() etc. where other process which is not marked as TIF_MEMDIE
> depends on more memory.
> 
> Even if the process marked as TIF_MEMDIE is allowed to access reserved memory,
> the process which really wants memory may not be marked as TIF_MEMDIE.
> To be able to give the process which really wants memory chance to allocate
> memory, memory pool via kernel thread might help.
> 

That may or may not work depending on whether other processes are 
allocating memory or not, there's no guarantee that this pool of memory is 
available only to the process needing to make progress, so it's not a 100% 
solution and now we've allocated 1% of memory to a kthread needlessly in 
99.999% of users who will never see this issue.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: kthread: Make kthread_create() killable.
  2013-10-02  7:06                         ` David Rientjes
@ 2013-10-02 13:44                           ` Tetsuo Handa
  0 siblings, 0 replies; 17+ messages in thread
From: Tetsuo Handa @ 2013-10-02 13:44 UTC (permalink / raw)
  To: rientjes; +Cc: akpm, oleg, security, linux-kernel

David Rientjes wrote:
> I thought you had addressed the kthread_create() issue on 
> security@kernel.org, is it possible to address the others?

I don't know how to call call_usermodehelper_keys(). I need a lot of help
if I write a patch and a testcase for call_usermodehelper_keys().

Thus, I explain only do_coredump() case.

> If you have a reproducible testcase or an example from the kernel log that 
> demonstrates this problem, please post it.

OK, quoting a kernel log. I want to keep the reproducer program and backtrace
by the reproducer program secret for now, for I don't think it is a good thing
to publish DoS attacking program before the problem is fixed.

Theoretically, OOM killer can choose a process which is sleeping at
wait_for_completion(). I reproduced such condition using below patch which
increases the race window for triggering the OOM killer.

---------- patch start ----------
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -485,6 +485,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 	return err;
 }
 
+#include <linux/delay.h>
+
 void do_coredump(siginfo_t *siginfo)
 {
 	struct core_state core_state;
@@ -598,9 +600,16 @@ void do_coredump(siginfo_t *siginfo)
 		sub_info = call_usermodehelper_setup(helper_argv[0],
 						helper_argv, NULL, GFP_KERNEL,
 						umh_pipe_setup, NULL, &cprm);
+		if (sub_info) {
+			printk(KERN_INFO "***** Delaying\n");
+			ssleep(5); /* Give other tasks chance to trigger OOM killer */
+			printk(KERN_INFO "***** Calling call_usermodehelper_exec()\n");
+		}
 		if (sub_info)
 			retval = call_usermodehelper_exec(sub_info,
 							  UMH_WAIT_EXEC);
+		if (sub_info)
+			printk(KERN_INFO "***** call_usermodehelper_exec() done.\n");
 
 		argv_free(helper_argv);
 		if (retval) {
---------- patch end ----------

---------- dmesg start ----------
[  412.089694] a.out[4110]: segfault at 0 ip 08048476 sp bfe82730 error 4 in a.out[8048000+1000]
[  412.091986] ***** Delaying
[  415.988417] memeater invoked oom-killer: gfp_mask=0x200d2, order=0, oom_score_adj=0
[  415.990294] CPU: 0 PID: 4111 Comm: memeater Not tainted 3.11.0-10050-g3711d86-dirty #119
[  415.992732] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 08/15/2008
[  415.995265]  df1da3d0 de02bd38 c05542eb 000200d2 de02bd9c c01a76a1 c067c530 df1da5c8
[  415.997938]  000200d2 00000000 00000000 c013e3d1 de02bd60 c012817c de02bdb8 c0557b6d
[  416.000153]  df147bd0 00000027 00002727 00000206 de02bd9c c0351862 df147edc 0000007b
[  416.002366] Call Trace:
(... snipped ...)
[  416.027799] Mem-Info:
[  416.028453] DMA per-cpu:
[  416.030255] CPU    0: hi:    0, btch:   1 usd:   0
[  416.031483] CPU    1: hi:    0, btch:   1 usd:   0
[  416.032733] Normal per-cpu:
[  416.033486] CPU    0: hi:  186, btch:  31 usd:  30
[  416.034706] CPU    1: hi:  186, btch:  31 usd:   0
[  416.035935] active_anon:117681 inactive_anon:350 isolated_anon:0
[  416.035935]  active_file:13 inactive_file:6 isolated_file:0
[  416.035935]  unevictable:0 dirty:0 writeback:0 unstable:0
[  416.035935]  free:1181 slab_reclaimable:412 slab_unreclaimable:1472
[  416.035935]  mapped:0 shmem:34 pagetables:161 bounce:0
[  416.035935]  free_cma:0
[  416.044759] DMA free:2008kB min:48kB low:60kB high:72kB active_anon:6880kB inactive_anon:20kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:9248kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:40kB slab_unreclaimable:84kB kernel_stack:0kB pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  416.055474] lowmem_reserve[]: 0 491 491 491
[  416.056800] Normal free:2716kB min:2808kB low:3508kB high:4212kB active_anon:463844kB inactive_anon:1380kB active_file:52kB inactive_file:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:507840kB managed:503244kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:136kB slab_reclaimable:1608kB slab_unreclaimable:5804kB kernel_stack:552kB pagetables:624kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:130 all_unreclaimable? yes
[  416.068128] lowmem_reserve[]: 0 0 0 0
[  416.069291] DMA: 0*4kB 1*8kB (R) 1*16kB (R) 0*32kB 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 2008kB
[  416.074373] Normal: 275*4kB (UEM) 136*8kB (UEM) 33*16kB (EM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2716kB
[  416.078022] 56 total pagecache pages
[  416.078954] 0 pages in swap cache
[  416.079829] Swap cache stats: add 120624, delete 120624, find 23686/24787
[  416.081544] Free swap  = 0kB
[  416.083515] Total swap = 0kB
[  416.085827] 131071 pages RAM
[  416.086644] 0 pages HighMem
[  416.087686] 2948 pages reserved
[  416.088581] 523856 pages shared
[  416.089421] 126585 pages non-shared
[  416.090338] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  416.092365] [ 2943]     0  2943      625       94       2        0         -1000 udevd
[  416.095624] [ 3542]     0  3542     2732       45       2        6         -1000 auditd
[  416.097695] [ 3790]     0  3790     2171      122       4        0         -1000 sshd
[  416.099690] [ 3888]     0  3888      624       95       2        0         -1000 udevd
[  416.101750] [ 3889]     0  3889      624       93       2        0         -1000 udevd
[  416.104906] [ 4074]     0  4074      993      106       4        0             0 login
[  416.106926] [ 4075]     0  4075      513       14       3        0             0 mingetty
[  416.109019] [ 4076]     0  4076      513       13       3        0             0 mingetty
[  416.111072] [ 4077]     0  4077      513       14       3        0             0 mingetty
[  416.113162] [ 4078]     0  4078      513       13       3        0             0 mingetty
[  416.116363] [ 4079]     0  4079      993      107       3        0             0 login
[  416.118357] [ 4080]     0  4080     1611       70       4        0             0 bash
[  416.120354] [ 4093]   500  4093     1611       60       4        0             0 bash
[  416.122327] [ 4110]     0  4110   117665   117200     117        0             0 a.out
[  416.125436] [ 4111]   500  4111      478       10       3        0             0 memeater
[  416.127509] Out of memory: Kill process 4110 (a.out) score 885 or sacrifice child
[  416.129458] Killed process 4110 (a.out) total-vm:470660kB, anon-rss:468800kB, file-rss:0kB
[  417.096035] ***** Calling call_usermodehelper_exec()
[  600.976438] INFO: task a.out:4110 blocked for more than 120 seconds.
[  600.978185]       Not tainted 3.11.0-10050-g3711d86-dirty #119
[  600.979655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.980561] a.out           D 0000bc00     0  4110   4080 0x00100084
[  600.982320]  de6a9d20 00000082 00000000 0000bc00 0000000a 1cf02990 df1dab10 df097510
[  600.985850]  df1dab10 c07b8180 c07b2000 c07b8180 df1dab10 df1dab50 00000000 c0568220
[  600.988293]  dffea180 c0568220 de6a9cf0 c015c29a dffea180 df0974d0 dffea180 de6a9d04
[  600.990597] Call Trace:
[  600.993308]  [<c015c29a>] ? check_preempt_curr+0x6a/0x80
[  600.994658]  [<c015c2ca>] ? ttwu_do_wakeup+0x1a/0xd0
[  600.995931]  [<c015fe5a>] ? ttwu_do_activate.clone.1+0x3a/0x50
[  600.996651]  [<c055660e>] schedule+0x1e/0x50
[  600.997749]  [<c0554435>] schedule_timeout+0x135/0x180
[  600.999069]  [<c016009a>] ? wake_up_process+0x1a/0x30
[  601.000470]  [<c014c229>] ? wake_up_worker+0x19/0x20
[  601.001733]  [<c014c643>] ? insert_work+0x53/0x90
[  601.002941]  [<c0555dc4>] wait_for_common+0x94/0x120
[  601.005966]  [<c0160060>] ? try_to_wake_up+0x1f0/0x1f0
[  601.007258]  [<c0555ef2>] wait_for_completion+0x12/0x20
[  601.008590]  [<c014b697>] call_usermodehelper_exec+0xd7/0x100
[  601.010026]  [<c0219104>] do_coredump+0x994/0xca0
[  601.011215]  [<c0219410>] ? do_coredump+0xca0/0xca0
[  601.012540]  [<c0140000>] ? do_proc_dointvec_ms_jiffies_conv+0x70/0x80
[  601.014169]  [<c0145d42>] ? __dequeue_signal+0xd2/0x140
[  601.015500]  [<c014877a>] get_signal_to_deliver+0x19a/0x4e0
[  601.016609]  [<c0101a47>] do_signal+0x37/0x960
[  601.017747]  [<c013e3d1>] ? irq_exit+0x51/0x90
[  601.020353]  [<c0103236>] ? do_IRQ+0x46/0xb0
[  601.021455]  [<c010931d>] ? init_fpu+0x3d/0xa0
[  601.022602]  [<c055a840>] ? __do_page_fault+0x500/0x500
[  601.023915]  [<c01023c5>] do_notify_resume+0x55/0x70
[  601.025088]  [<c0557a6b>] work_notifysig+0x24/0x29
[  601.026304]  [<c055a840>] ? __do_page_fault+0x500/0x500
---------- dmesg end ----------

Below is the patch to fix this problem.

----------------------------------------
>From 052fedc920b735354b618e23c0b74c7b88ecd3c6 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 2 Oct 2013 22:11:00 +0900
Subject: [PATCH v2] coredump: Make startup of coredump to pipe killable.

Any user process callers of wait_for_completion() except global init process
might be chosen by the OOM killer while waiting for completion() call by some
other process which does memory allocation.

When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed
due to memory starvation because the OOM killer cannot kill such users.

call_usermodehelper() without UMH_KILLABLE flag is one of such users and this
patch fixes the problem for do_coredump() by making startup of coredump to pipe
killable.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 fs/coredump.c           |  124 +++++++++++++++++++++++++++++++----------------
 include/linux/binfmts.h |    2 +
 2 files changed, 84 insertions(+), 42 deletions(-)

diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index e8112ae..b0cb384 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -61,6 +61,8 @@ struct coredump_params {
 	struct file *file;
 	unsigned long limit;
 	unsigned long mm_flags;
+	char **argv; /* Maybe NULL. Used by only fs/coredump.c */
+	bool defer_cleanup; /* Used by only fs/coredump.c */
 };
 
 /*
diff --git a/fs/coredump.c b/fs/coredump.c
index 9bdeca1..42d6d0e 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -485,6 +485,38 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
 	return err;
 }
 
+/*
+ * umh_pipe_free - Clean up resources.
+ *
+ * @cprm: Pointer to "struct coredump_params".
+ */
+static void umh_pipe_free(struct coredump_params *cprm)
+{
+	/*
+	 * This xchg() returns true only if "the caller of do_coredump() needs
+	 * to wait for completion of the khelper" or vice versa.
+	 */
+	if (xchg(&cprm->defer_cleanup, false))
+		return;
+	/* Close the pipe's writer side if it is not yet closed. */
+	if (cprm->file)
+		fput(cprm->file);
+	/* Release arguments used by execve() if they are not yet released. */
+	if (cprm->argv)
+		argv_free(cprm->argv);
+	kfree(cprm);
+}
+
+/*
+ * umh_pipe_cleanup - Clean up resources as needed.
+ *
+ * @info: Pointer to "struct subprocess_info".
+ */
+static void umh_pipe_cleanup(struct subprocess_info *info)
+{
+	umh_pipe_free((struct coredump_params *) info->data);
+}
+
 void do_coredump(siginfo_t *siginfo)
 {
 	struct core_state core_state;
@@ -500,24 +532,26 @@ void do_coredump(siginfo_t *siginfo)
 	bool need_nonrelative = false;
 	bool core_dumped = false;
 	static atomic_t core_dump_count = ATOMIC_INIT(0);
-	struct coredump_params cprm = {
-		.siginfo = siginfo,
-		.regs = signal_pt_regs(),
-		.limit = rlimit(RLIMIT_CORE),
-		/*
-		 * We must use the same mm->flags while dumping core to avoid
-		 * inconsistency of bit flags, since this flag is not protected
-		 * by any locks.
-		 */
-		.mm_flags = mm->flags,
-	};
+	struct coredump_params *cprm = kzalloc(sizeof(*cprm), GFP_KERNEL);
 
 	audit_core_dumps(siginfo->si_signo);
 
+	if (!cprm)
+		return;
+	cprm->siginfo = siginfo;
+	cprm->regs = signal_pt_regs();
+	cprm->limit = rlimit(RLIMIT_CORE);
+	/*
+	 * We must use the same mm->flags while dumping core to avoid
+	 * inconsistency of bit flags, since this flag is not protected
+	 * by any locks.
+	 */
+	cprm->mm_flags = mm->flags;
+
 	binfmt = mm->binfmt;
 	if (!binfmt || !binfmt->core_dump)
 		goto fail;
-	if (!__get_dumpable(cprm.mm_flags))
+	if (!__get_dumpable(cprm->mm_flags))
 		goto fail;
 
 	cred = prepare_creds();
@@ -529,7 +563,7 @@ void do_coredump(siginfo_t *siginfo)
 	 * so we dump it as root in mode 2, and only into a controlled
 	 * environment (pipe handler or fully qualified path).
 	 */
-	if (__get_dumpable(cprm.mm_flags) == SUID_DUMP_ROOT) {
+	if (__get_dumpable(cprm->mm_flags) == SUID_DUMP_ROOT) {
 		/* Setuid core dump mode */
 		flag = O_EXCL;		/* Stop rewrite attacks */
 		cred->fsuid = GLOBAL_ROOT_UID;	/* Dump root private */
@@ -542,11 +576,10 @@ void do_coredump(siginfo_t *siginfo)
 
 	old_cred = override_creds(cred);
 
-	ispipe = format_corename(&cn, &cprm);
+	ispipe = format_corename(&cn, cprm);
 
 	if (ispipe) {
 		int dump_count;
-		char **helper_argv;
 		struct subprocess_info *sub_info;
 
 		if (ispipe < 0) {
@@ -555,12 +588,12 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_unlock;
 		}
 
-		if (cprm.limit == 1) {
+		if (cprm->limit == 1) {
 			/* See umh_pipe_setup() which sets RLIMIT_CORE = 1.
 			 *
 			 * Normally core limits are irrelevant to pipes, since
 			 * we're not writing to the file system, but we use
-			 * cprm.limit of 1 here as a speacial value, this is a
+			 * cprm->limit of 1 here as a speacial value, this is a
 			 * consistent way to catch recursive crashes.
 			 * We can still crash if the core_pattern binary sets
 			 * RLIM_CORE = !1, but it runs as root, and can do
@@ -577,7 +610,7 @@ void do_coredump(siginfo_t *siginfo)
 			printk(KERN_WARNING "Aborting core\n");
 			goto fail_unlock;
 		}
-		cprm.limit = RLIM_INFINITY;
+		cprm->limit = RLIM_INFINITY;
 
 		dump_count = atomic_inc_return(&core_dump_count);
 		if (core_pipe_limit && (core_pipe_limit < dump_count)) {
@@ -587,22 +620,27 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_dropcount;
 		}
 
-		helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
-		if (!helper_argv) {
+		cprm->argv = argv_split(GFP_KERNEL, cn.corename, NULL);
+		if (!cprm->argv) {
 			printk(KERN_WARNING "%s failed to allocate memory\n",
 			       __func__);
 			goto fail_dropcount;
 		}
 
 		retval = -ENOMEM;
-		sub_info = call_usermodehelper_setup(helper_argv[0],
-						helper_argv, NULL, GFP_KERNEL,
-						umh_pipe_setup, NULL, &cprm);
-		if (sub_info)
-			retval = call_usermodehelper_exec(sub_info,
-							  UMH_WAIT_EXEC);
-
-		argv_free(helper_argv);
+		sub_info = call_usermodehelper_setup
+			(cprm->argv[0], cprm->argv, NULL, GFP_KERNEL,
+			 umh_pipe_setup, umh_pipe_cleanup, cprm);
+		if (sub_info) {
+			/*
+			 * Let the cleanup of cprm to the khelper if I was
+			 * SIGKILLed before the khelper starts cprm->argv[0].
+			 */
+			cprm->defer_cleanup = true;
+			retval = call_usermodehelper_exec
+				(sub_info, UMH_WAIT_EXEC | UMH_KILLABLE);
+		}
+
 		if (retval) {
 			printk(KERN_INFO "Core dump to |%s pipe failed\n",
 			       cn.corename);
@@ -611,7 +649,7 @@ void do_coredump(siginfo_t *siginfo)
 	} else {
 		struct inode *inode;
 
-		if (cprm.limit < binfmt->min_coredump)
+		if (cprm->limit < binfmt->min_coredump)
 			goto fail_unlock;
 
 		if (need_nonrelative && cn.corename[0] != '/') {
@@ -622,16 +660,16 @@ void do_coredump(siginfo_t *siginfo)
 			goto fail_unlock;
 		}
 
-		cprm.file = filp_open(cn.corename,
+		cprm->file = filp_open(cn.corename,
 				 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 				 0600);
-		if (IS_ERR(cprm.file))
+		if (IS_ERR(cprm->file))
 			goto fail_unlock;
 
-		inode = file_inode(cprm.file);
+		inode = file_inode(cprm->file);
 		if (inode->i_nlink > 1)
 			goto close_fail;
-		if (d_unhashed(cprm.file->f_path.dentry))
+		if (d_unhashed(cprm->file->f_path.dentry))
 			goto close_fail;
 		/*
 		 * AK: actually i see no reason to not allow this for named
@@ -645,9 +683,9 @@ void do_coredump(siginfo_t *siginfo)
 		 */
 		if (!uid_eq(inode->i_uid, current_fsuid()))
 			goto close_fail;
-		if (!cprm.file->f_op || !cprm.file->f_op->write)
+		if (!cprm->file->f_op || !cprm->file->f_op->write)
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		if (do_truncate(cprm->file->f_path.dentry, 0, 0, cprm->file))
 			goto close_fail;
 	}
 
@@ -658,15 +696,17 @@ void do_coredump(siginfo_t *siginfo)
 	if (displaced)
 		put_files_struct(displaced);
 	if (!dump_interrupted()) {
-		file_start_write(cprm.file);
-		core_dumped = binfmt->core_dump(&cprm);
-		file_end_write(cprm.file);
+		file_start_write(cprm->file);
+		core_dumped = binfmt->core_dump(cprm);
+		file_end_write(cprm->file);
 	}
 	if (ispipe && core_pipe_limit)
-		wait_for_dump_helpers(cprm.file);
+		wait_for_dump_helpers(cprm->file);
 close_fail:
-	if (cprm.file)
-		filp_close(cprm.file, NULL);
+	if (cprm->file) {
+		filp_close(cprm->file, NULL);
+		cprm->file = NULL;
+	}
 fail_dropcount:
 	if (ispipe)
 		atomic_dec(&core_dump_count);
@@ -677,7 +717,7 @@ fail_unlock:
 fail_creds:
 	put_cred(cred);
 fail:
-	return;
+	umh_pipe_free(cprm);
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2013-10-02 13:44 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <201309132318.HEC05218.OtVOFOFMLJHFQS@I-love.SAKURA.ne.jp>
2013-09-14 16:05 ` [PATCH] kthread: Make kthread_create() killable Oleg Nesterov
2013-09-14 23:46   ` Tetsuo Handa
2013-09-15 16:39     ` Oleg Nesterov
2013-09-16  0:53       ` [PATCH v2] " Tetsuo Handa
2013-09-23 23:07         ` Andrew Morton
2013-09-24 15:17           ` [PATCH v3] " Tetsuo Handa
2013-09-24 16:38             ` Oleg Nesterov
2013-09-25 21:05             ` David Rientjes
2013-09-26  2:59               ` Tetsuo Handa
2013-09-26 18:53                 ` David Rientjes
2013-09-28  7:18                   ` Tetsuo Handa
2013-09-30 21:57                     ` David Rientjes
2013-10-01 13:20                       ` Tetsuo Handa
2013-10-02  7:06                         ` David Rientjes
2013-10-02 13:44                           ` Tetsuo Handa
     [not found]     ` <201309152038.DDJ59271.OJMOHLFtSFOQVF@I-love.SAKURA.ne.jp>
2013-09-15 16:55       ` [PATCH] " Oleg Nesterov
2013-09-16  7:25         ` [PATCH] coredump: Make startup of coredump to pipe killable Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).