All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Dave Jones <davej@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace
Date: Thu, 17 May 2012 14:50:22 -0700	[thread overview]
Message-ID: <20120517145022.a99f41e8.akpm@linux-foundation.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1205171432250.6951@chino.kir.corp.google.com>

On Thu, 17 May 2012 14:33:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom_score_adj scale ranges from -1000 to 1000 and represents the
> proportion of memory available to the process at allocation time.  This
> means an oom_score_adj value of 300, for example, will bias a process as
> though it was using an extra 30.0% of available memory and a value of
> -350 will discount 35.0% of available memory from its usage.
> 
> The oom killer badness heuristic also uses this scale to report the oom
> score for each eligible process in determining the "best" process to
> kill.  Thus, it can only differentiate each process's memory usage by
> 0.1% of system RAM.
> 
> On large systems, this can end up being a large amount of memory: 256MB
> on 256GB systems, for example.
> 
> This can be fixed by having the badness heuristic to use the actual
> memory usage in scoring threads and then normalizing it to the
> oom_score_adj scale for userspace.  This results in better comparison
> between eligible threads for kill and no change from the userspace
> perspective.
> 
> ...
>
> @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	}
>  
>  	/*
> -	 * The memory controller may have a limit of 0 bytes, so avoid a divide
> -	 * by zero, if necessary.
> -	 */
> -	if (!totalpages)
> -		totalpages = 1;
> -
> -	/*
>  	 * The baseline for the badness score is the proportion of RAM that each
>  	 * task's rss, pagetable and swap space use.
>  	 */
> -	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> -	points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> -	points *= 1000;
> -	points /= totalpages;
> +	points = get_mm_rss(p->mm) + p->mm->nr_ptes +
> +		 get_mm_counter(p->mm, MM_SWAPENTS);
>  	task_unlock(p);
>  
>  	/*
>  	 * Root processes get 3% bonus, just like the __vm_enough_memory()
>  	 * implementation used by LSMs.
>  	 */
> -	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> -		points -= 30;
> +	if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages)

There doesn't seem much point in testing totalpages here - it's a
micro-optimisation which adds a branch, on a slow path.

> +		points -= 30 * totalpages / 1000;
>  
>  	/*
>  	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>  	 * either completely disable oom killing or always prefer a certain
>  	 * task.
>  	 */
> -	points += p->signal->oom_score_adj;
> +	points += p->signal->oom_score_adj * totalpages / 1000;

And if we *do* want to add that micro-optimisation, we may as well
extend it to cover this expression also:

	if (totalpages) {	/* reason goes here */
		if (has_capability_noaudit(...))
			points -= 30 * totalpages / 1000;
		p->signal->oom_score_adj * totalpages / 1000;
	}

>  	/*
>  	 * Never return 0 for an eligible task that may be killed since it's
>  	 * possible that no single user task uses more than 0.1% of memory and
>  	 * no single admin tasks uses more than 3.0%.
>  	 */
> -	if (points <= 0)
> -		return 1;
> -	return (points < 1000) ? points : 1000;
> +	return points <= 0 ? 1 : points;

`points' is unsigned - testing it for negative looks odd.

>  }
>  
>  /*
> @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  {
>  	struct task_struct *g, *p;
>  	struct task_struct *chosen = NULL;
> -	*ppoints = 0;
> +	unsigned long chosen_points = 0;
>  
>  	do_each_thread(g, p) {
>  		unsigned int points;
> @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			 */
>  			if (p == current) {
>  				chosen = p;
> -				*ppoints = 1000;
> +				chosen_points = ULONG_MAX;
>  			} else if (!force_kill) {
>  				/*
>  				 * If this task is not being ptraced on exit,
> @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  		}
>  
>  		points = oom_badness(p, memcg, nodemask, totalpages);
> -		if (points > *ppoints) {
> +		if (points > chosen_points) {
>  			chosen = p;
> -			*ppoints = points;
> +			chosen_points = points;
>  		}
>  	} while_each_thread(g, p);
>  
> +	*ppoints = chosen_points * 1000 / totalpages;

So it's up to the select_bad_process() callers to prevent the
divide-by-zero.  It is unobvious that they actually do this, and this
important and unobvious caller requirement is undocumented.

>  	return chosen;
>  }
>  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	Dave Jones <davej@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace
Date: Thu, 17 May 2012 14:50:22 -0700	[thread overview]
Message-ID: <20120517145022.a99f41e8.akpm@linux-foundation.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1205171432250.6951@chino.kir.corp.google.com>

On Thu, 17 May 2012 14:33:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> The oom_score_adj scale ranges from -1000 to 1000 and represents the
> proportion of memory available to the process at allocation time.  This
> means an oom_score_adj value of 300, for example, will bias a process as
> though it was using an extra 30.0% of available memory and a value of
> -350 will discount 35.0% of available memory from its usage.
> 
> The oom killer badness heuristic also uses this scale to report the oom
> score for each eligible process in determining the "best" process to
> kill.  Thus, it can only differentiate each process's memory usage by
> 0.1% of system RAM.
> 
> On large systems, this can end up being a large amount of memory: 256MB
> on 256GB systems, for example.
> 
> This can be fixed by having the badness heuristic to use the actual
> memory usage in scoring threads and then normalizing it to the
> oom_score_adj scale for userspace.  This results in better comparison
> between eligible threads for kill and no change from the userspace
> perspective.
> 
> ...
>
> @@ -198,45 +198,33 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
>  	}
>  
>  	/*
> -	 * The memory controller may have a limit of 0 bytes, so avoid a divide
> -	 * by zero, if necessary.
> -	 */
> -	if (!totalpages)
> -		totalpages = 1;
> -
> -	/*
>  	 * The baseline for the badness score is the proportion of RAM that each
>  	 * task's rss, pagetable and swap space use.
>  	 */
> -	points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> -	points += get_mm_counter(p->mm, MM_SWAPENTS);
> -
> -	points *= 1000;
> -	points /= totalpages;
> +	points = get_mm_rss(p->mm) + p->mm->nr_ptes +
> +		 get_mm_counter(p->mm, MM_SWAPENTS);
>  	task_unlock(p);
>  
>  	/*
>  	 * Root processes get 3% bonus, just like the __vm_enough_memory()
>  	 * implementation used by LSMs.
>  	 */
> -	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> -		points -= 30;
> +	if (has_capability_noaudit(p, CAP_SYS_ADMIN) && totalpages)

There doesn't seem much point in testing totalpages here - it's a
micro-optimisation which adds a branch, on a slow path.

> +		points -= 30 * totalpages / 1000;
>  
>  	/*
>  	 * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>  	 * either completely disable oom killing or always prefer a certain
>  	 * task.
>  	 */
> -	points += p->signal->oom_score_adj;
> +	points += p->signal->oom_score_adj * totalpages / 1000;

And if we *do* want to add that micro-optimisation, we may as well
extend it to cover this expression also:

	if (totalpages) {	/* reason goes here */
		if (has_capability_noaudit(...))
			points -= 30 * totalpages / 1000;
		p->signal->oom_score_adj * totalpages / 1000;
	}

>  	/*
>  	 * Never return 0 for an eligible task that may be killed since it's
>  	 * possible that no single user task uses more than 0.1% of memory and
>  	 * no single admin tasks uses more than 3.0%.
>  	 */
> -	if (points <= 0)
> -		return 1;
> -	return (points < 1000) ? points : 1000;
> +	return points <= 0 ? 1 : points;

`points' is unsigned - testing it for negative looks odd.

>  }
>  
>  /*
> @@ -314,7 +302,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  {
>  	struct task_struct *g, *p;
>  	struct task_struct *chosen = NULL;
> -	*ppoints = 0;
> +	unsigned long chosen_points = 0;
>  
>  	do_each_thread(g, p) {
>  		unsigned int points;
> @@ -354,7 +342,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  			 */
>  			if (p == current) {
>  				chosen = p;
> -				*ppoints = 1000;
> +				chosen_points = ULONG_MAX;
>  			} else if (!force_kill) {
>  				/*
>  				 * If this task is not being ptraced on exit,
> @@ -367,12 +355,13 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>  		}
>  
>  		points = oom_badness(p, memcg, nodemask, totalpages);
> -		if (points > *ppoints) {
> +		if (points > chosen_points) {
>  			chosen = p;
> -			*ppoints = points;
> +			chosen_points = points;
>  		}
>  	} while_each_thread(g, p);
>  
> +	*ppoints = chosen_points * 1000 / totalpages;

So it's up to the select_bad_process() callers to prevent the
divide-by-zero.  It is unobvious that they actually do this, and this
important and unobvious caller requirement is undocumented.

>  	return chosen;
>  }
>  

  reply	other threads:[~2012-05-17 21:50 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-26 19:35 3.4-rc4 oom killer out of control Dave Jones
2012-04-26 19:35 ` Dave Jones
2012-04-26 20:53 ` Dave Jones
2012-04-26 20:53   ` Dave Jones
2012-04-26 22:30   ` David Rientjes
2012-04-26 22:30     ` David Rientjes
2012-04-26 21:40 ` David Rientjes
2012-04-26 21:40   ` David Rientjes
2012-04-26 21:52   ` Dave Jones
2012-04-26 21:52     ` Dave Jones
2012-04-26 22:20     ` David Rientjes
2012-04-26 22:20       ` David Rientjes
2012-04-26 22:44       ` Dave Jones
2012-04-26 22:44         ` Dave Jones
2012-04-26 22:49         ` David Rientjes
2012-04-26 22:49           ` David Rientjes
2012-04-26 22:54           ` Dave Jones
2012-04-26 22:54             ` Dave Jones
2012-04-27  0:54         ` Steven Rostedt
2012-04-27  0:54           ` Steven Rostedt
2012-04-27  2:02           ` Dave Jones
2012-04-27  2:02             ` Dave Jones
2012-05-03 22:14   ` David Rientjes
2012-05-03 22:14     ` David Rientjes
2012-05-03 22:29     ` Dave Jones
2012-05-03 22:29       ` Dave Jones
2012-05-17 21:33       ` [patch] mm, oom: normalize oom scores to oom_score_adj scale only for userspace David Rientjes
2012-05-17 21:33         ` David Rientjes
2012-05-17 21:50         ` Andrew Morton [this message]
2012-05-17 21:50           ` Andrew Morton
2012-05-23  7:15           ` [patch v2] " David Rientjes
2012-05-23  7:15             ` David Rientjes
2012-05-23 22:37             ` Andrew Morton
2012-05-23 22:37               ` Andrew Morton
2012-05-24  6:02               ` David Rientjes
2012-05-24  6:02                 ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120517145022.a99f41e8.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=davej@redhat.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.