public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Greg Kroah-Hartman <gregkh@suse.de>,
	Nick Piggin <npiggin@suse.de>,
	Peter Ziljstra <a.p.ziljstra@chello.nl>,
	Christoph Lameter <cl@linux-foundation.org>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	San Mehat <san@android.com>, Arve Hj?nnev?g <arve@android.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
Date: Tue, 12 May 2009 10:56:58 +0100	[thread overview]
Message-ID: <20090512095657.GD25923@csn.ul.ie> (raw)
In-Reply-To: <alpine.DEB.2.00.0905101501270.18804@chino.kir.corp.google.com>

On Sun, May 10, 2009 at 03:07:18PM -0700, David Rientjes wrote:
> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares
> the mm.  If a task were to be killed while attached to an mm that could
> not be freed because another thread were set to OOM_DISABLE, it would
> have needlessly been terminated since there is no potential for future
> memory freeing.
> 

Good point. It also sets up weird races for processes that create/delete
threads a lot where they can have different oom_adj values. Now you only
need one thread to configure for the mm which is probably what users wanted
in the first place.

> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
> 
> This fixes a livelock if the oom killer chooses a task and another thread
> sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> because oom_kill_task() repeatedly returns 1 and refuses to kill the
> chosen task while select_bad_process() will repeatedly choose the same
> task during the next retry.
> 
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> in oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
> 
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.
> 

Makes sense but I see two things at least that may need to be fixed up.
Details below.

> Cc: Nick Piggin <npiggin@suse.de>
> Cc: San Mehat <san@android.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt        |   15 +++++++++----
>  drivers/staging/android/lowmemorykiller.c |    2 +-
>  fs/proc/base.c                            |   19 ++++++++++++++--
>  include/linux/mm_types.h                  |    2 +
>  include/linux/sched.h                     |    1 -
>  mm/oom_kill.c                             |   32 ++++++++++++++++++----------
>  6 files changed, 49 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>  
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the liklihood of this process being

%s/liklihood/likelihood//

> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>  
>  The following heuristics are then applied:
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  			task_unlock(p);
>  			continue;
>  		}
> -		oom_adj = p->oomkilladj;
> +		oom_adj = p->mm->oom_adj;
>  		if (oom_adj < min_adj) {
>  			task_unlock(p);
>  			continue;

Ouch, it's a pity that a staging driver and the core VM have a dependency
like this. It makes me wonder again why there is a low memory killer that
is separate from the OOM killer. I should dig through the archives and
see what the reasoning was.

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  
>  	if (!task)
>  		return -ESRCH;
> -	oom_adjust = task->oomkilladj;
> +	task_lock(task);
> +	if (task->mm)
> +		oom_adjust = task->mm->oom_adj;
> +	else
> +		oom_adjust = OOM_DISABLE;
> +	task_unlock(task);
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +	task_lock(task);
> +	if (!task->mm) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->oomkilladj = oom_adjust;
> +	task->mm->oom_adj = oom_adjust;
> +	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -232,6 +232,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1149,7 +1149,6 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> -	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> +	int oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> +	oom_adj = mm->oom_adj;
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 8;
>  
>  	/*
> -	 * Adjust the score by oomkilladj.
> +	 * Adjust the score by oom_adj.
>  	 */
> -	if (p->oomkilladj) {
> -		if (p->oomkilladj > 0) {
> +	if (oom_adj) {
> +		if (oom_adj > 0) {
>  			if (!points)
>  				points = 1;
> -			points <<= p->oomkilladj;
> +			points <<= oom_adj;
>  		} else
> -			points >>= -(p->oomkilladj);
> +			points >>= -(oom_adj);
>  	}
>  
>  #ifdef DEBUG
> @@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		if (p->oomkilladj == OOM_DISABLE)
> +		task_lock(p);
> +		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
>  			continue;

Oops, it would appear we didn't unlock the task here. I bet we deadlock
if oom_adj == OOM_DISABLE for a normal process and it then exits.

> +		task_unlock(p);
>  
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
> @@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -		       p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
>  	 * Don't kill the process if any threads are set to OOM_DISABLE
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +		task_lock(q);
> +		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +			task_unlock(q);
>  			return 1;
> +		}
> +		task_unlock(q);
>  	} while_each_thread(g, q);
>  
>  	__oom_kill_task(p, 1);
> @@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +			current->comm, gfp_mask, order,
> +			current->mm ? current->mm->oom_adj : OOM_DISABLE);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

  parent reply	other threads:[~2009-05-12  9:57 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
2009-05-12  9:23   ` Mel Gorman
2009-05-13  0:27     ` Arve Hjønnevåg
2009-05-13  9:42       ` Mel Gorman
2009-05-14 23:25         ` Arve Hjønnevåg
2009-05-15  9:18           ` Mel Gorman
2009-05-10 22:07 ` [patch 03/11 -mmotm] oom: cleanup android low memory killer David Rientjes
2009-05-10 22:07 ` [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer David Rientjes
2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
2009-05-11 21:11   ` Andrew Morton
2009-05-11 21:28     ` David Rientjes
2009-05-11 21:41       ` Greg KH
2009-05-11 22:05         ` David Rientjes
2009-05-12  9:38   ` Mel Gorman
2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
2009-05-11  0:17   ` KOSAKI Motohiro
2009-05-11  0:26     ` David Rientjes
2009-05-11  1:47       ` KOSAKI Motohiro
2009-05-11  8:43         ` David Rientjes
2009-05-11 21:19           ` Andrew Morton
2009-05-12  9:56   ` Mel Gorman [this message]
2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
2009-05-11 21:22   ` Andrew Morton
2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
2009-05-10 23:59   ` KOSAKI Motohiro
2009-05-11  0:24     ` David Rientjes
2009-05-11  1:45       ` KOSAKI Motohiro
2009-05-11  7:40         ` Minchan Kim
2009-05-11  8:49           ` David Rientjes
2009-05-11 11:23             ` Minchan Kim
2009-05-11  8:45         ` David Rientjes
2009-05-11 16:03           ` Dave Hansen
2009-05-11 19:09             ` David Rientjes
2009-05-11 19:45               ` Dave Hansen
2009-05-11 20:21                 ` David Rientjes
2009-05-11 21:29   ` Andrew Morton
2009-05-11 21:45     ` David Rientjes
2009-05-11 22:11       ` Andrew Morton
2009-05-11 22:31         ` David Rientjes
2009-05-11 22:46           ` Andrew Morton
2009-05-11 23:00             ` David Rientjes
2009-05-11 23:14               ` Andrew Morton
2009-05-11 23:37                 ` David Rientjes
2009-05-12  5:39         ` Peter Zijlstra
2009-05-12 11:36           ` KOSAKI Motohiro
2009-05-12 10:05         ` Mel Gorman
2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
2009-05-11 21:36   ` Andrew Morton
2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
2009-05-11 21:39   ` Andrew Morton
2009-05-11 23:08     ` David Rientjes
2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
2009-05-12 21:14   ` Misleading OOM messages Christoph Lameter
2009-05-14  9:29     ` Pavel Machek
2009-05-14 19:46       ` Christoph Lameter
2009-05-14 20:38         ` Dave Hansen
2009-05-14 20:49           ` Christoph Lameter
2009-05-14 20:49           ` David Rientjes
2009-05-14 21:05             ` Dave Hansen
2009-05-14 21:12               ` David Rientjes
2009-05-14 21:30               ` Christoph Lameter
2009-05-14 21:34                 ` Pavel Machek
2009-05-14 21:41                   ` Dave Hansen
2009-05-15 13:05                     ` Pavel Machek
2009-05-15 17:59                       ` Christoph Lameter
2009-05-15 18:22                         ` Dave Hansen
2009-05-15 19:29                           ` Christoph Lameter
2009-05-15 20:02                         ` Pavel Machek
2009-05-15 21:15                           ` Christoph Lameter
2009-05-19 20:39                             ` Pavel Machek
2009-05-22 13:53                               ` Christoph Lameter
2009-05-22 14:17                                 ` Warn when we run out of swap space (was Re: Misleading OOM messages) Christoph Lameter
2009-05-22 14:56                                   ` Pavel Machek
2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
2009-05-22 19:40                                 ` Randy Dunlap
2009-05-22 19:44                                   ` Christoph Lameter
2009-05-22 21:45                                     ` Alan Cox
2009-05-22 21:43                                 ` Alan Cox
2009-05-15 17:57                   ` Christoph Lameter
2009-05-15 18:15                     ` Dave Hansen
2009-05-15 18:19                       ` Balbir Singh
2009-05-15 19:26                       ` Christoph Lameter
2009-05-15 20:31                         ` Balbir Singh
2009-05-18 14:34                           ` Christoph Lameter
2009-05-18 15:45                             ` Balbir Singh
2009-05-14 21:37                 ` Dave Hansen
2009-05-14 22:00                   ` David Rientjes
2009-05-15 17:58                     ` Christoph Lameter
2009-05-15 18:23                       ` Dave Hansen
2009-05-15 18:57                         ` Balbir Singh
2009-05-15 19:37                         ` David Rientjes
2009-05-14 20:56         ` Pavel Machek
2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
2009-05-13  0:43   ` Arve Hjønnevåg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090512095657.GD25923@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=a.p.ziljstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arve@android.com \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=gregkh@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=rientjes@google.com \
    --cc=san@android.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox