All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mel@csn.ul.ie>
To: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Greg Kroah-Hartman <gregkh@suse.de>,
	Nick Piggin <npiggin@suse.de>,
	Peter Ziljstra <a.p.ziljstra@chello.nl>,
	Christoph Lameter <cl@linux-foundation.org>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	San Mehat <san@android.com>, Arve Hj?nnev?g <arve@android.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct
Date: Tue, 12 May 2009 10:56:58 +0100	[thread overview]
Message-ID: <20090512095657.GD25923@csn.ul.ie> (raw)
In-Reply-To: <alpine.DEB.2.00.0905101501270.18804@chino.kir.corp.google.com>

On Sun, May 10, 2009 at 03:07:18PM -0700, David Rientjes wrote:
> The per-task oom_adj value is a characteristic of its mm more than the
> task itself since it's not possible to oom kill any thread that shares
> the mm.  If a task were to be killed while attached to an mm that could
> not be freed because another thread were set to OOM_DISABLE, it would
> have needlessly been terminated since there is no potential for future
> memory freeing.
> 

Good point. It also sets up weird races for processes that create/delete
threads a lot where they can have different oom_adj values. Now you only
need one thread to configure for the mm which is probably what users wanted
in the first place.

> This patch moves oomkilladj (now more appropriately named oom_adj) from
> struct task_struct to struct mm_struct.  This requires task_lock() on a
> task to check its oom_adj value to protect against exec, but it's already
> necessary to take the lock when dereferencing the mm to find the total VM
> size for the badness heuristic.
> 
> This fixes a livelock if the oom killer chooses a task and another thread
> sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
> because oom_kill_task() repeatedly returns 1 and refuses to kill the
> chosen task while select_bad_process() will repeatedly choose the same
> task during the next retry.
> 
> Taking task_lock() in select_bad_process() to check for OOM_DISABLE and
> in oom_kill_task() to check for threads sharing the same memory will be
> removed in the next patch in this series where it will no longer be
> necessary.
> 
> Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
> these threads are immune from oom killing already.  They simply report an
> oom_adj value of OOM_DISABLE.
> 

Makes sense but I see two things at least that may need to be fixed up.
Details below.

> Cc: Nick Piggin <npiggin@suse.de>
> Cc: San Mehat <san@android.com>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  Documentation/filesystems/proc.txt        |   15 +++++++++----
>  drivers/staging/android/lowmemorykiller.c |    2 +-
>  fs/proc/base.c                            |   19 ++++++++++++++--
>  include/linux/mm_types.h                  |    2 +
>  include/linux/sched.h                     |    1 -
>  mm/oom_kill.c                             |   32 ++++++++++++++++++----------
>  6 files changed, 49 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS
>  3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score
>  ------------------------------------------------------
>  
> -This file can be used to adjust the score used to select which processes
> -should be killed in an  out-of-memory  situation.  Giving it a high score will
> -increase the likelihood of this process being killed by the oom-killer.  Valid
> -values are in the range -16 to +15, plus the special value -17, which disables
> -oom-killing altogether for this process.
> +This file can be used to adjust the score used to select which processes should
> +be killed in an out-of-memory situation.  The oom_adj value is a characteristic
> +of the task's mm, so all threads that share an mm with pid will have the same
> +oom_adj value.  A high value will increase the liklihood of this process being

%s/liklihood/likelihood//

> +killed by the oom-killer.  Valid values are in the range -16 to +15 as
> +explained below and a special value of -17, which disables oom-killing
> +altogether for threads sharing pid's mm.
>  
>  The process to be killed in an out-of-memory situation is selected among all others
>  based on its badness score. This value equals the original memory size of the process
> @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers
>  are the prime candidates to be killed. Having only one 'hungry' child will make
>  parent less preferable than the child.
>  
> +/proc/<pid>/oom_adj cannot be changed for kthreads since they are immune from
> +oom-killing already.
> +
>  /proc/<pid>/oom_score shows process' current badness score.
>  
>  The following heuristics are then applied:
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -97,7 +97,7 @@ static int lowmem_shrink(int nr_to_scan, gfp_t gfp_mask)
>  			task_unlock(p);
>  			continue;
>  		}
> -		oom_adj = p->oomkilladj;
> +		oom_adj = p->mm->oom_adj;
>  		if (oom_adj < min_adj) {
>  			task_unlock(p);
>  			continue;

Ouch, it's a pity that a staging driver and the core VM have a dependency
like this. It makes me wonder again why there is a low memory killer that
is separate from the OOM killer. I should dig through the archives and
see what the reasoning was.

> diff --git a/fs/proc/base.c b/fs/proc/base.c
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1003,7 +1003,12 @@ static ssize_t oom_adjust_read(struct file *file, char __user *buf,
>  
>  	if (!task)
>  		return -ESRCH;
> -	oom_adjust = task->oomkilladj;
> +	task_lock(task);
> +	if (task->mm)
> +		oom_adjust = task->mm->oom_adj;
> +	else
> +		oom_adjust = OOM_DISABLE;
> +	task_unlock(task);
>  	put_task_struct(task);
>  
>  	len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
> @@ -1032,11 +1037,19 @@ static ssize_t oom_adjust_write(struct file *file, const char __user *buf,
>  	task = get_proc_task(file->f_path.dentry->d_inode);
>  	if (!task)
>  		return -ESRCH;
> -	if (oom_adjust < task->oomkilladj && !capable(CAP_SYS_RESOURCE)) {
> +	task_lock(task);
> +	if (!task->mm) {
> +		task_unlock(task);
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +	if (oom_adjust < task->mm->oom_adj && !capable(CAP_SYS_RESOURCE)) {
> +		task_unlock(task);
>  		put_task_struct(task);
>  		return -EACCES;
>  	}
> -	task->oomkilladj = oom_adjust;
> +	task->mm->oom_adj = oom_adjust;
> +	task_unlock(task);
>  	put_task_struct(task);
>  	if (end - buffer == 0)
>  		return -EIO;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -232,6 +232,8 @@ struct mm_struct {
>  
>  	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>  
> +	s8 oom_adj;	/* OOM kill score adjustment (bit shift) */
> +
>  	cpumask_t cpu_vm_mask;
>  
>  	/* Architecture-specific MM context */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1149,7 +1149,6 @@ struct task_struct {
>  	 * a short time
>  	 */
>  	unsigned char fpu_counter;
> -	s8 oomkilladj; /* OOM kill score adjustment (bit shift). */
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>  	unsigned int btrace_seq;
>  #endif
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -58,6 +58,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  	unsigned long points, cpu_time, run_time;
>  	struct mm_struct *mm;
>  	struct task_struct *child;
> +	int oom_adj;
>  
>  	task_lock(p);
>  	mm = p->mm;
> @@ -65,6 +66,7 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		task_unlock(p);
>  		return 0;
>  	}
> +	oom_adj = mm->oom_adj;
>  
>  	/*
>  	 * The memory size of the process is the basis for the badness.
> @@ -148,15 +150,15 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
>  		points /= 8;
>  
>  	/*
> -	 * Adjust the score by oomkilladj.
> +	 * Adjust the score by oom_adj.
>  	 */
> -	if (p->oomkilladj) {
> -		if (p->oomkilladj > 0) {
> +	if (oom_adj) {
> +		if (oom_adj > 0) {
>  			if (!points)
>  				points = 1;
> -			points <<= p->oomkilladj;
> +			points <<= oom_adj;
>  		} else
> -			points >>= -(p->oomkilladj);
> +			points >>= -(oom_adj);
>  	}
>  
>  #ifdef DEBUG
> @@ -251,8 +253,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
>  			*ppoints = ULONG_MAX;
>  		}
>  
> -		if (p->oomkilladj == OOM_DISABLE)
> +		task_lock(p);
> +		if (p->mm && p->mm->oom_adj == OOM_DISABLE)
>  			continue;

Oops, it would appear we didn't unlock the task here. I bet we deadlock
if oom_adj == OOM_DISABLE for a normal process and it then exits.

> +		task_unlock(p);
>  
>  		points = badness(p, uptime.tv_sec);
>  		if (points > *ppoints || !chosen) {
> @@ -304,8 +308,7 @@ static void dump_tasks(const struct mem_cgroup *mem)
>  		}
>  		printk(KERN_INFO "[%5d] %5d %5d %8lu %8lu %3d     %3d %s\n",
>  		       p->pid, __task_cred(p)->uid, p->tgid, mm->total_vm,
> -		       get_mm_rss(mm), (int)task_cpu(p), p->oomkilladj,
> -		       p->comm);
> +		       get_mm_rss(mm), (int)task_cpu(p), mm->oom_adj, p->comm);
>  		task_unlock(p);
>  	} while_each_thread(g, p);
>  }
> @@ -367,8 +370,12 @@ static int oom_kill_task(struct task_struct *p)
>  	 * Don't kill the process if any threads are set to OOM_DISABLE
>  	 */
>  	do_each_thread(g, q) {
> -		if (q->mm == mm && q->oomkilladj == OOM_DISABLE)
> +		task_lock(q);
> +		if (q->mm == mm && q->mm && q->mm->oom_adj == OOM_DISABLE) {
> +			task_unlock(q);
>  			return 1;
> +		}
> +		task_unlock(q);
>  	} while_each_thread(g, q);
>  
>  	__oom_kill_task(p, 1);
> @@ -393,10 +400,11 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	struct task_struct *c;
>  
>  	if (printk_ratelimit()) {
> -		printk(KERN_WARNING "%s invoked oom-killer: "
> -			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n",
> -			current->comm, gfp_mask, order, current->oomkilladj);
>  		task_lock(current);
> +		printk(KERN_WARNING "%s invoked oom-killer: "
> +			"gfp_mask=0x%x, order=%d, oom_adj=%d\n",
> +			current->comm, gfp_mask, order,
> +			current->mm ? current->mm->oom_adj : OOM_DISABLE);
>  		cpuset_print_task_mems_allowed(current);
>  		task_unlock(current);
>  		dump_stack();
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

  parent reply	other threads:[~2009-05-12  9:57 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-10 22:07 [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed David Rientjes
2009-05-10 22:07 ` [patch 02/11 -mmotm] lowmemorykiller: Don't count free space unless it meets the specified limit by itself David Rientjes
2009-05-12  9:23   ` Mel Gorman
2009-05-13  0:27     ` Arve Hjønnevåg
2009-05-13  9:42       ` Mel Gorman
2009-05-14 23:25         ` Arve Hjønnevåg
2009-05-15  9:18           ` Mel Gorman
2009-05-10 22:07 ` [patch 03/11 -mmotm] oom: cleanup android low memory killer David Rientjes
2009-05-10 22:07 ` [patch 04/11 -mmotm] oom: fix possible android low memory killer NULL pointer David Rientjes
2009-05-10 22:07 ` [patch 05/11 -mmotm] oom: fix possible oom_dump_tasks " David Rientjes
2009-05-11 21:11   ` Andrew Morton
2009-05-11 21:28     ` David Rientjes
2009-05-11 21:41       ` Greg KH
2009-05-11 22:05         ` David Rientjes
2009-05-12  9:38   ` Mel Gorman
2009-05-10 22:07 ` [patch 06/11 -mmotm] oom: move oom_adj value from task_struct to mm_struct David Rientjes
2009-05-11  0:17   ` KOSAKI Motohiro
2009-05-11  0:26     ` David Rientjes
2009-05-11  1:47       ` KOSAKI Motohiro
2009-05-11  8:43         ` David Rientjes
2009-05-11 21:19           ` Andrew Morton
2009-05-12  9:56   ` Mel Gorman [this message]
2009-05-10 22:07 ` [patch 07/11 -mmotm] oom: prevent possible OOM_DISABLE livelock David Rientjes
2009-05-11 21:22   ` Andrew Morton
2009-05-10 22:07 ` [patch 08/11 -mmotm] oom: invoke oom killer for __GFP_NOFAIL David Rientjes
2009-05-10 23:59   ` KOSAKI Motohiro
2009-05-11  0:24     ` David Rientjes
2009-05-11  1:45       ` KOSAKI Motohiro
2009-05-11  7:40         ` Minchan Kim
2009-05-11  8:49           ` David Rientjes
2009-05-11 11:23             ` Minchan Kim
2009-05-11  8:45         ` David Rientjes
2009-05-11 16:03           ` Dave Hansen
2009-05-11 19:09             ` David Rientjes
2009-05-11 19:45               ` Dave Hansen
2009-05-11 20:21                 ` David Rientjes
2009-05-11 21:29   ` Andrew Morton
2009-05-11 21:45     ` David Rientjes
2009-05-11 22:11       ` Andrew Morton
2009-05-11 22:31         ` David Rientjes
2009-05-11 22:46           ` Andrew Morton
2009-05-11 23:00             ` David Rientjes
2009-05-11 23:14               ` Andrew Morton
2009-05-11 23:37                 ` David Rientjes
2009-05-12  5:39         ` Peter Zijlstra
2009-05-12 11:36           ` KOSAKI Motohiro
2009-05-12 10:05         ` Mel Gorman
2009-05-10 22:07 ` [patch 09/11 -mmotm] oom: return vm size of oom killed task David Rientjes
2009-05-11 21:36   ` Andrew Morton
2009-05-10 22:07 ` [patch 10/11 -mmotm] oom: avoid oom kill if no interruptible tasks David Rientjes
2009-05-11 21:39   ` Andrew Morton
2009-05-11 23:08     ` David Rientjes
2009-05-10 22:07 ` [patch 11/11 -mmotm] oom: fail allocations if oom killer can't free memory David Rientjes
2009-05-12 21:14   ` Misleading OOM messages Christoph Lameter
2009-05-14  9:29     ` Pavel Machek
2009-05-14 19:46       ` Christoph Lameter
2009-05-14 20:38         ` Dave Hansen
2009-05-14 20:49           ` Christoph Lameter
2009-05-14 20:49           ` David Rientjes
2009-05-14 21:05             ` Dave Hansen
2009-05-14 21:12               ` David Rientjes
2009-05-14 21:30               ` Christoph Lameter
2009-05-14 21:34                 ` Pavel Machek
2009-05-14 21:41                   ` Dave Hansen
2009-05-15 13:05                     ` Pavel Machek
2009-05-15 17:59                       ` Christoph Lameter
2009-05-15 18:22                         ` Dave Hansen
2009-05-15 19:29                           ` Christoph Lameter
2009-05-15 20:02                         ` Pavel Machek
2009-05-15 21:15                           ` Christoph Lameter
2009-05-19 20:39                             ` Pavel Machek
2009-05-22 13:53                               ` Christoph Lameter
2009-05-22 14:17                                 ` Warn when we run out of swap space (was Re: Misleading OOM messages) Christoph Lameter
2009-05-22 14:56                                   ` Pavel Machek
2009-05-22 19:01                               ` Misleading OOM messages Christoph Lameter
2009-05-22 19:40                                 ` Randy Dunlap
2009-05-22 19:44                                   ` Christoph Lameter
2009-05-22 21:45                                     ` Alan Cox
2009-05-22 21:43                                 ` Alan Cox
2009-05-15 17:57                   ` Christoph Lameter
2009-05-15 18:15                     ` Dave Hansen
2009-05-15 18:19                       ` Balbir Singh
2009-05-15 19:26                       ` Christoph Lameter
2009-05-15 20:31                         ` Balbir Singh
2009-05-18 14:34                           ` Christoph Lameter
2009-05-18 15:45                             ` Balbir Singh
2009-05-14 21:37                 ` Dave Hansen
2009-05-14 22:00                   ` David Rientjes
2009-05-15 17:58                     ` Christoph Lameter
2009-05-15 18:23                       ` Dave Hansen
2009-05-15 18:57                         ` Balbir Singh
2009-05-15 19:37                         ` David Rientjes
2009-05-14 20:56         ` Pavel Machek
2009-05-12  9:09 ` [patch 01/11 -mmotm] lowmemorykiller: Only iterate over process list when needed Mel Gorman
2009-05-13  0:43   ` Arve Hjønnevåg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090512095657.GD25923@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=a.p.ziljstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=arve@android.com \
    --cc=cl@linux-foundation.org \
    --cc=dave@linux.vnet.ibm.com \
    --cc=gregkh@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=rientjes@google.com \
    --cc=san@android.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.