public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: stable-review@kernel.org, torvalds@linux-foundation.org,
	akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,
	Mel Gorman <mel@csn.ul.ie>, Wu Fengguang <fengguang.wu@intel.com>,
	Rik van Riel <riel@redhat.com>, Jiri Slaby <jslaby@suse.cz>
Subject: [38/59] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls
Date: Tue, 24 Aug 2010 15:24:50 -0700	[thread overview]
Message-ID: <20100824222525.706491630@clark.site> (raw)
In-Reply-To: <20100824224625.GA5449@kroah.com>

2.6.32-stable review patch.  If anyone has any objections, please let us know.

------------------

From: Wu Fengguang <fengguang.wu@intel.com>

commit e31f3698cd3499e676f6b0ea12e3528f569c4fa3 upstream.

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

	http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

	Invoking any command locked up for minutes (note that I'm
	talking about attempted additional I/O to the _other_,
	_unaffected_ main system HDD - such as loading some shell
	binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system.  The OOM trace shows that 3 processes
are stuck in wait_on_page_writeback() in the direct reclaim path.  One in
do_fork() and the other two in unix_stream_sendmsg().  They are blocked on
this condition:

	(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago.  That condition may be too
permissive.  In Andreas' case, 512MB/1024 = 512KB.  If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.

Firstly, it can easily happen in daily desktop usage.  vmscan priority can
easily go below (DEF_PRIORITY - 2) on _local_ memory pressure.  Even if
the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges.  For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in the
LRU lists.  And order-1 to order-3 allocations are more than common with
SLUB.  Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
slab caches.  For example, the order-1 radix_tree_node slab cache may
stall applications at swap-in time; the order-3 inode cache on most
filesystems may stall applications when trying to read some file; the
order-2 proc_inode_cache may stall applications when trying to open a
/proc file.

Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system.  This "one slow USB device stalls the whole system"
avalanching effect is very bad.

Thirdly, once stalled, the stall time could be intolerable long for the
users.  When there are 20MB queued writeback pages and USB 1.1 is writing
them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
Not to mention it may be called multiple times.

So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
DEF_PRIORITY/3, or 6.25% LRU size.  As the default dirty throttle ratio is
20%, it will hardly be triggered by pure dirty pages.  We'd better treat
PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
uncomfortably long (easily goes beyond 1s).

The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
which are easy to satisfy in 1TB memory boxes.  So, although 6.25% of
memory could be an awful lot of pages to scan on a system with 1TB of
memory, it won't really have to busy scan that much.

Andreas tested an older version of this patch and reported that it mostly
fixed his problem.  Mel Gorman helped improve it and KOSAKI Motohiro will
fix it further in the next patch.

Reported-by: Andreas Mohr <andi@lisas.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 mm/vmscan.c |   53 +++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 45 insertions(+), 8 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1083,6 +1083,48 @@ static int too_many_isolated(struct zone
 }
 
 /*
+ * Returns true if the caller should wait to clean dirty/writeback pages.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for writeback IO to complete.
+ * This will stall high-order allocations noticeably. Only do that when really
+ * need to free the pages under high memory pressure.
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+					unsigned long nr_freed,
+					int priority,
+					int lumpy_reclaim,
+					struct scan_control *sc)
+{
+	int lumpy_stall_priority;
+
+	/* kswapd should not stall on sync IO */
+	if (current_is_kswapd())
+		return false;
+
+	/* Only stall on lumpy reclaim */
+	if (!lumpy_reclaim)
+		return false;
+
+	/* If we have relaimed everything on the isolated list, no stall */
+	if (nr_freed == nr_taken)
+		return false;
+
+	/*
+	 * For high-order allocations, there are two stall thresholds.
+	 * High-cost allocations stall immediately where as lower
+	 * order allocations such as stacks require the scanning
+	 * priority to be much higher before stalling.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		lumpy_stall_priority = DEF_PRIORITY;
+	else
+		lumpy_stall_priority = DEF_PRIORITY / 3;
+
+	return priority <= lumpy_stall_priority;
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
@@ -1176,14 +1218,9 @@ static unsigned long shrink_inactive_lis
 		nr_scanned += nr_scan;
 		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
 
-		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
-		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    lumpy_reclaim) {
+		/* Check if we should syncronously wait for writeback */
+		if (should_reclaim_stall(nr_taken, nr_freed, priority,
+					lumpy_reclaim, sc)) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*



  parent reply	other threads:[~2010-08-24 23:43 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-24 22:46 [00/59] 2.6.32.21-stable review Greg KH
2010-08-24 22:24 ` [01/59] memstick: fix hangs on unexpected device removal in mspro_blk Greg KH
2010-08-24 22:24 ` [02/59] ASoC: Fix inverted mute controls for WM8580 Greg KH
2010-08-24 22:24 ` [03/59] ASoC: Remove DSP mode support for WM8776 Greg KH
2010-08-24 22:24 ` [04/59] ALSA: riptide - Fix detection / load of firmware files Greg KH
2010-08-24 22:24 ` [05/59] ALSA: emu10k1 - delay the PCM interrupts (add pcm_irq_delay parameter) Greg KH
2010-08-24 22:24 ` [06/59] ALSA: hda - Fix missing stream for second ADC on Realtek ALC260 HDA codec Greg KH
2010-08-24 22:24 ` [07/59] ocfs2: do not overwrite error codes in ocfs2_init_acl Greg KH
2010-08-24 22:24 ` [08/59] ocfs2/dlm: fix a dead lock Greg KH
2010-08-24 22:24 ` [09/59] ocfs2 fix o2dlm dlm run purgelist (rev 3) Greg KH
2010-08-24 22:24 ` [10/59] ocfs2: Count more refcount records in file system fragmentation Greg KH
2010-08-24 22:24 ` [11/59] ocfs2/dlm: avoid incorrect bit set in refmap on recovery master Greg KH
2010-08-24 22:24 ` [12/59] ocfs2/dlm: remove potential deadlock -V3 Greg KH
2010-08-24 22:24 ` [13/59] x86, hotplug: Serialize CPU hotplug to avoid bringup concurrency issues Greg KH
2010-08-24 22:24 ` [14/59] x86, apic: Fix apic=debug boot crash Greg KH
2010-08-24 22:24 ` [15/59] Fix the nested PR lock calling issue in ACL Greg KH
2010-08-24 22:24 ` [16/59] hwmon: (pc87360) Fix device resource declaration Greg KH
2010-08-24 22:24 ` [17/59] ARM: Tighten check for allowable CPSR values Greg KH
2010-08-24 22:24 ` [18/59] nfs: Add "lookupcache" to displayed mount options Greg KH
2010-08-24 22:24 ` [19/59] ath5k: disable ASPM L0s for all cards Greg KH
2010-08-24 22:24 ` [20/59] pxa3xx: fix ns2cycle equation Greg KH
2010-08-24 22:24 ` [21/59] drm/i915/edp: Flush the write before waiting for PLLs Greg KH
2010-08-24 22:24 ` [22/59] dm mpath: fix NULL pointer dereference when path parameters missing Greg KH
2010-08-24 22:24 ` [23/59] dm ioctl: release _hash_lock between devices in remove_all Greg KH
2010-08-24 22:24 ` [24/59] mm: make the vma list be doubly linked Greg KH
2010-08-24 22:24 ` [25/59] mm: make the mlock() stack guard page checks stricter Greg KH
2010-08-24 22:24 ` [26/59] mm: make stack guard page logic use vm_prev pointer Greg KH
2010-08-24 22:24 ` [27/59] drm/i915: fix hibernation since i915 self-reclaim fixes Greg KH
2010-08-24 22:24 ` [28/59] drm/i915: add reclaimable to i915 self-reclaimable page allocations Greg KH
2010-08-24 22:24 ` [29/59] slab: fix object alignment Greg KH
2010-08-24 22:24 ` [30/59] sunxvr500: Ignore secondary output PCI devices Greg KH
2010-08-24 22:24 ` [31/59] sparc64: Add missing ID to parport probing code Greg KH
2010-08-24 22:24 ` [32/59] sparc64: Fix rwsem constant bug leading to hangs Greg KH
2010-08-24 22:24 ` [33/59] sparc64: Fix atomic64_t routine return values Greg KH
2010-08-24 22:24 ` [34/59] net: Fix a memmove bug in dev_gro_receive() Greg KH
2010-08-24 22:24 ` [35/59] can: add limit for nframes and clean up signed/unsigned variables Greg KH
2010-08-24 22:24 ` [36/59] isdn: fix information leak Greg KH
2010-08-24 22:24 ` [37/59] act_nat: the checksum of ICMP doesnt have pseudo header Greg KH
2010-08-24 22:24 ` Greg KH [this message]
2010-08-24 22:24 ` [39/59] pcmcia: avoid buffer overflow in pcmcia_setup_isa_irq Greg KH
2010-08-24 22:24 ` [40/59] ext4: consolidate in_range() definitions Greg KH
2010-08-24 22:24 ` [41/59] Oprofile: Change CPUIDS from decimal to hex, and add some comments Greg KH
2010-08-24 22:24 ` [42/59] oprofile: add support for Intel processor model 30 Greg KH
2010-08-24 22:24 ` [43/59] fixes for using make 3.82 Greg KH
2010-08-24 22:24 ` [44/59] ALSA: intel8x0: Mute External Amplifier by default for ThinkPad X31 Greg KH
2010-08-24 22:24 ` [45/59] netlink: fix compat recvmsg Greg KH
2010-08-24 22:24 ` [46/59] drm/radeon/kms: fix typo in radeon_compute_pll_gain Greg KH
2010-08-24 22:24 ` [47/59] drm: stop information leak of old kernel stack Greg KH
2010-08-24 22:25 ` [48/59] powerpc: Fix typo in uImage target Greg KH
2010-08-24 22:25 ` [49/59] powerpc: Initialise paca->kstack before early_setup_secondary Greg KH
2010-08-26  7:10   ` Matt Evans
2010-08-26 23:25     ` Greg KH
2010-08-24 22:25 ` [50/59] USB: option: add Celot CT-650 Greg KH
2010-08-24 22:25 ` [51/59] USB: add device IDs for igotu to navman Greg KH
2010-08-24 22:25 ` [52/59] USB: pl2303: New vendor and product id Greg KH
2010-08-24 22:25 ` [53/59] USB: CP210x Fix Break On/Off Greg KH
2010-08-24 22:25 ` [54/59] USB: ftdi_sio: fix endianess of max packet size Greg KH
2010-08-24 22:25 ` [55/59] USB: io_ti: check firmware version before updating Greg KH
2010-08-24 22:25 ` [56/59] USB: xhci: Remove buggy assignment in next_trb() Greg KH
2010-08-24 22:25 ` [57/59] USB: ftdi_sio: Add ID for Ionics PlugComputer Greg KH
2010-08-24 22:25 ` [58/59] USB: ftdi_sio: add product ID for Lenz LI-USB Greg KH
2010-08-24 22:25 ` [59/59] x86, apic: ack all pending irqs when crashed/on kexec Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100824222525.706491630@clark.site \
    --to=gregkh@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=fengguang.wu@intel.com \
    --cc=jslaby@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mel@csn.ul.ie \
    --cc=riel@redhat.com \
    --cc=stable-review@kernel.org \
    --cc=stable@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox