Re: OOM fixes 2/5 - Andrea Arcangeli

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
	linux-kernel@vger.kernel.org, Hugh Dickins <hugh@veritas.com>
Subject: Re: OOM fixes 2/5
Date: Fri, 21 Jan 2005 08:04:29 +0100	[thread overview]
Message-ID: <20050121070429.GE17050@dualathlon.random> (raw)
In-Reply-To: <20050120224645.3351d22c.akpm@osdl.org>

On Thu, Jan 20, 2005 at 10:46:45PM -0800, Andrew Morton wrote:
> Thus empirically, it appears that the number of machines which need a
> non-zero protection ratio is exceedingly small.  Why change the setting on
> all machines for the benefit of the tiny few?  Seems weird.  Especially
> when this problem could be solved with a few-line initscript.  Ho hum.

It's up to you, IMHO you're doing a mistake, but I don't mind as long as our
customers aren't at risk of early oom kills (or worse kernel crashes)
with some db load (especially without swap the risk is huge for all
users, since all anonymous memory will be pinned like ptes, but with ~3G
of pagetables they're at risk even with swap).  At least you *must*
admit that without my patch applied as I posted, there's a >0 probabity
of running out of normal zone which will lead to an oom-kill or a
deadlock despite 10G of highmem might still be freeeable (like with
clean cache). And my patch obviously cannot make it impossible to run
out of normal zone, since there's only 800m of normal zone and one can
open more files than what fits in normal zone, but at least it gives the
user the security that a certain workload can run reliably. Without this
patch there's no guarantee at all that any workload will run when >1G of
ptes is allocated.

This below fix as well is needed and you won't find reports of people
reproducing this race condition. Please apply. CC'ed Hugh. Sorry Hugh, I
know you were working on it (you said not in the weekend IIRC), but I've
been upgraded to latest bk so I had to fixup quickly or I would have to
run the racy code on my smp systems to test new kernels.

From: Andrea Arcangeli <andrea@suse.de>
Subject: fixup smp race introduced in 2.6.11-rc1

Signed-off-by: Andrea Arcangeli <andrea@suse.de>

--- x/mm/memory.c.~1~	2005-01-21 06:58:14.747335048 +0100
+++ x/mm/memory.c	2005-01-21 07:16:15.318063328 +0100
@@ -1555,8 +1555,17 @@ void unmap_mapping_range(struct address_

 	spin_lock(&mapping->i_mmap_lock);

+	/* serialize i_size write against truncate_count write */
+	smp_wmb(); 
 	/* Protect against page faults, and endless unmapping loops */
 	mapping->truncate_count++;
+	/*
+	 * For archs where spin_lock has inclusive semantics like ia64
+	 * this smp_mb() will prevent to read pagetable contents
+	 * before the truncate_count increment is visible to
+	 * other cpus.
+	 */
+	smp_mb();
 	if (unlikely(is_restart_addr(mapping->truncate_count))) {
 		if (mapping->truncate_count == 0)
 			reset_vma_truncate_counts(mapping);
@@ -1864,10 +1873,18 @@ do_no_page(struct mm_struct *mm, struct 
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 		sequence = mapping->truncate_count;
+		smp_rmb(); /* serializes i_size against truncate_count */
 	}
 retry:
 	cond_resched();
 	new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
+	/*
+	 * No smp_rmb is needed here as long as there's a full
+	 * spin_lock/unlock sequence inside the ->nopage callback
+	 * (for the pagecache lookup) that acts as an implicit
+	 * smp_mb() and prevents the i_size read to happen
+	 * after the next truncate_count read.
+	 */

 	/* no page was available -- either SIGBUS or OOM */
 	if (new_page == NOPAGE_SIGBUS)

next prev parent reply	other threads:[~2005-01-21  7:05 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-01-21  5:48 OOM fixes 1/5 Andrea Arcangeli
2005-01-21  5:49 ` OOM fixes 2/5 Andrea Arcangeli
2005-01-21  5:49   ` OOM fixes 3/5 Andrea Arcangeli
2005-01-21  5:50     ` OOM fixes 4/5 Andrea Arcangeli
2005-01-21  5:50       ` OOM fixes 5/5 Andrea Arcangeli
2005-01-21  6:01         ` writeback-highmem Andrea Arcangeli
2005-01-21  6:26           ` writeback-highmem Andrew Morton
2005-01-21  6:41             ` writeback-highmem Andrea Arcangeli
2005-01-21 13:46             ` writeback-highmem Rik van Riel
2005-01-21  6:20   ` OOM fixes 2/5 Andrew Morton
2005-01-21  6:35     ` Andrea Arcangeli
2005-01-21  6:36     ` Nick Piggin
2005-01-21  6:46       ` Andrew Morton
2005-01-21  7:04         ` Nick Piggin
2005-01-21  7:17           ` Andrea Arcangeli
2005-01-21  7:04         ` Andrea Arcangeli [this message]
2005-01-21  7:08         ` Andi Kleen
2005-01-21  7:21           ` Andrea Arcangeli
2005-01-21  6:52       ` Andrea Arcangeli
2005-01-21  7:00         ` Andrew Morton
2005-01-21  7:10           ` Andrea Arcangeli
2005-01-22  6:35 ` OOM fixes 1/5 Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050121070429.GE17050@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@osdl.org \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox