All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg KH <gregkh@linuxfoundation.org>,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	alan@lxorguk.ukuu.org.uk, Mel Gorman <mgorman@suse.de>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Fengguang Wu <fengguang.wu@intel.com>,
	Michal Hocko <mhocko@suse.cz>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>,
	Minchan Kim <minchan@kernel.org>, Rik van Riel <riel@redhat.com>,
	Ying Han <yinghan@google.com>, Greg Thelen <gthelen@google.com>,
	Hugh Dickins <hughd@google.com>
Subject: [ 16/82] memcg: prevent OOM with too many dirty pages
Date: Mon, 13 Aug 2012 13:18:52 -0700	[thread overview]
Message-ID: <20120813201747.856572538@linuxfoundation.org> (raw)
In-Reply-To: <20120813201746.448504360@linuxfoundation.org>

From: Greg KH <gregkh@linuxfoundation.org>

3.5-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Michal Hocko <mhocko@suse.cz>

commit e62e384e9da8d9a0c599795464a7e76fd490931c upstream.

The current implementation of dirty pages throttling is not memcg aware
which makes it easy to have memcg LRUs full of dirty pages.  Without
throttling, these LRUs can be scanned faster than the rate of writeback,
leading to memcg OOM conditions when the hard limit is small.

This patch fixes the problem by throttling the allocating process
(possibly a writer) during the hard limit reclaim by waiting on
PageReclaim pages.  We are waiting only for PageReclaim pages because
those are the pages that made one full round over LRU and that means that
the writeback is much slower than scanning.

The solution is far from being ideal - long term solution is memcg aware
dirty throttling - but it is meant to be a band aid until we have a real
fix.  We are seeing this happening during nightly backups which are placed
into containers to prevent from eviction of the real working set.

The change affects only memcg reclaim and only when we encounter
PageReclaim pages which is a signal that the reclaim doesn't catch up on
with the writers so somebody should be throttled.  This could be
potentially unfair because it could be somebody else from the group who
gets throttled on behalf of the writer but as writers need to allocate as
well and they allocate in higher rate the probability that only innocent
processes would be penalized is not that high.

I have tested this change by a simple dd copying /dev/zero to tmpfs or
ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
containers) and dd got killed by OOM killer every time.  With the patch I
could run the dd with the same size under 5M controller without any OOM.
The issue is more visible with slower devices for output.

* With the patch
================
* tmpfs size=2G
---------------
$ vim cgroup_cache_oom_test.sh
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

* Without the patch
===================
* tmpfs size=2G
---------------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4668 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

* ext3
------
$ ./cgroup_cache_oom_test.sh 5M
using Limit 5M for group
./cgroup_cache_oom_test.sh: line 46:  4689 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 60M
using Limit 60M for group
./cgroup_cache_oom_test.sh: line 46:  4692 Killed                  dd if=/dev/zero of=$OUT/zero bs=1M count=$count
$ ./cgroup_cache_oom_test.sh 300M
using Limit 300M for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
$ ./cgroup_cache_oom_test.sh 2G
using Limit 2G for group
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

[akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
[hughd@google.com: fix deadlock with loop driver]
Reviewed-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 mm/vmscan.c |   23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,9 +720,26 @@ static unsigned long shrink_page_list(st
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		if (PageWriteback(page)) {
-			nr_writeback++;
-			unlock_page(page);
-			goto keep;
+			/*
+			 * memcg doesn't have any dirty pages throttling so we
+			 * could easily OOM just because too many pages are in
+			 * writeback from reclaim and there is nothing else to
+			 * reclaim.
+			 *
+			 * Check may_enter_fs, certainly because a loop driver
+			 * thread might enter reclaim, and deadlock if it waits
+			 * on a page for which it is needed to do the write
+			 * (loop masks off __GFP_IO|__GFP_FS for this reason);
+			 * but more thought would probably show more reasons.
+			 */
+			if (!global_reclaim(sc) && PageReclaim(page) &&
+					may_enter_fs)
+				wait_on_page_writeback(page);
+			else {
+				nr_writeback++;
+				unlock_page(page);
+				goto keep;
+			}
 		}
 
 		references = page_check_references(page, sc);



  parent reply	other threads:[~2012-08-13 20:42 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-13 20:18 [ 00/82] 3.5.2-stable review Greg Kroah-Hartman
2012-08-13 20:18 ` [ 01/82] virtio-blk: Call del_gendisk() before disable guest kick Greg Kroah-Hartman
2012-08-13 20:18   ` Greg Kroah-Hartman
2012-08-13 20:18 ` [ 02/82] virtio-blk: Reset device after blk_cleanup_queue() Greg Kroah-Hartman
2012-08-13 20:18   ` Greg Kroah-Hartman
2012-08-13 20:18 ` [ 03/82] virtio-blk: Use block layer provided spinlock Greg Kroah-Hartman
2012-08-13 20:18   ` Greg Kroah-Hartman
2012-08-13 20:18 ` [ 04/82] [IA64] Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts Greg Kroah-Hartman
2012-08-13 20:18 ` [ 05/82] asus-wmi: use ASUS_WMI_METHODID_DSTS2 as default DSTS ID Greg Kroah-Hartman
2012-08-13 20:18 ` [ 06/82] selinux: fix selinux_inode_setxattr oops Greg Kroah-Hartman
2012-08-13 20:18 ` [ 07/82] lib/vsprintf.c: kptr_restrict: fix pK-error in SysRq show-all-timers(Q) Greg Kroah-Hartman
2012-08-13 20:18 ` [ 08/82] sunrpc: clnt: Add missing braces Greg Kroah-Hartman
2012-08-13 20:18 ` [ 09/82] SUNRPC: return negative value in case rpcbind client creation error Greg Kroah-Hartman
2012-08-13 20:18 ` [ 10/82] mISDN: Bugfix only few bytes are transfered on a connection Greg Kroah-Hartman
2012-08-13 20:18 ` [ 11/82] nilfs2: fix deadlock issue between chcp and thaw ioctls Greg Kroah-Hartman
2012-08-13 20:18 ` [ 12/82] media: ene_ir: Fix driver initialisation Greg Kroah-Hartman
2012-08-13 20:18 ` [ 13/82] media: m5mols: Correct reported ISO values Greg Kroah-Hartman
2012-08-13 20:18 ` [ 14/82] media: videobuf-dma-contig: restore buffer mapping for uncached bufers Greg Kroah-Hartman
2012-08-13 20:18 ` [ 15/82] pcdp: use early_ioremap/early_iounmap to access pcdp table Greg Kroah-Hartman
2012-08-13 20:18 ` Greg Kroah-Hartman [this message]
2012-08-13 20:18 ` [ 17/82] memcg: further prevent OOM with too many dirty pages Greg Kroah-Hartman
2012-08-13 20:18 ` [ 18/82] mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page() Greg Kroah-Hartman
2012-08-13 20:18 ` [ 19/82] ARM: 7466/1: disable interrupt before spinning endlessly Greg Kroah-Hartman
2012-08-13 20:18 ` [ 20/82] ARM: 7467/1: mutex: use generic xchg-based implementation for ARMv6+ Greg Kroah-Hartman
2012-08-15 13:56   ` Ben Hutchings
2012-08-15 14:08     ` Greg Kroah-Hartman
2012-08-15 14:11       ` Ben Hutchings
2012-08-15 14:49         ` Nicolas Pitre
2012-08-15 14:49         ` Greg Kroah-Hartman
2012-08-15 14:55           ` Will Deacon
2012-08-13 20:18 ` [ 21/82] ARM: 7476/1: vfp: only clear vfp state for current cpu in vfp_pm_suspend Greg Kroah-Hartman
2012-08-13 20:18 ` [ 22/82] ARM: 7477/1: vfp: Always save VFP state in vfp_pm_suspend on UP Greg Kroah-Hartman
2012-08-13 20:18 ` [ 23/82] ARM: 7478/1: errata: extend workaround for erratum #720789 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 24/82] ARM: 7479/1: mm: avoid NULL dereference when flushing gate_vma with VIVT caches Greg Kroah-Hartman
2012-08-13 20:19 ` [ 25/82] ARM: 7480/1: only call smp_send_stop() on SMP Greg Kroah-Hartman
2012-08-13 20:19 ` [ 26/82] ARM: Fix undefined instruction exception handling Greg Kroah-Hartman
2012-08-13 20:19 ` [ 27/82] ALSA: hda - add dock support for Thinkpad T430s Greg Kroah-Hartman
2012-08-13 20:19 ` [ 28/82] ALSA: hda - add dock support for Thinkpad X230 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 29/82] ALSA: hda - remove quirk for Dell Vostro 1015 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 30/82] ALSA: hda - Fix double quirk for Quanta FL1 / Lenovo Ideapad Greg Kroah-Hartman
2012-08-13 20:19 ` [ 31/82] mm: setup pageblock_order before its used by sparsemem Greg Kroah-Hartman
2012-08-13 20:19 ` [ 32/82] mm: mmu_notifier: fix freed page still mapped in secondary MMU Greg Kroah-Hartman
2012-08-13 20:19 ` [ 33/82] md/raid1: dont abort a resync on the first badblock Greg Kroah-Hartman
2012-08-13 20:19 ` [ 34/82] video/smscufx: fix line counting in fb_write Greg Kroah-Hartman
2012-08-13 20:19 ` [ 35/82] block: uninitialized ioc->nr_tasks triggers WARN_ON Greg Kroah-Hartman
2012-08-13 20:19 ` [ 36/82] sh: Fix up recursive fault in oops with unset TTB Greg Kroah-Hartman
2012-08-13 20:19 ` [ 37/82] ore: Fix out-of-bounds access in _ios_obj() Greg Kroah-Hartman
2012-08-13 20:19 ` [ 38/82] ACPI processor: Fix tick_broadcast_mask online/offline regression Greg Kroah-Hartman
2012-08-13 20:19 ` [ 39/82] mISDN: Bugfix for layer2 fixed TEI mode Greg Kroah-Hartman
2012-08-13 20:19 ` [ 40/82] mac80211: cancel mesh path timer Greg Kroah-Hartman
2012-08-13 20:19 ` [ 41/82] ath9k: Add PID/VID support for AR1111 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 42/82] wireless: reg: restore previous behaviour of chan->max_power calculations Greg Kroah-Hartman
2012-08-13 20:19 ` [ 43/82] x86, nops: Missing break resulting in incorrect selection on Intel Greg Kroah-Hartman
2012-08-13 20:19 ` [ 44/82] x86-64, kcmp: The kcmp system call can be common Greg Kroah-Hartman
2012-08-13 20:19 ` [ 45/82] Input: synaptics - handle out of bounds values from the hardware Greg Kroah-Hartman
2012-08-13 20:19 ` [ 46/82] random: make add_interrupt_randomness() do something sane Greg Kroah-Hartman
2012-08-13 20:19 ` [ 47/82] random: use lockless techniques in the interrupt path Greg Kroah-Hartman
2012-08-13 20:19 ` [ 48/82] random: create add_device_randomness() interface Greg Kroah-Hartman
2012-08-13 20:19 ` [ 49/82] usb: feed USB device information to the /dev/random driver Greg Kroah-Hartman
2012-08-13 20:19 ` [ 50/82] net: feed /dev/random with the MAC address when registering a device Greg Kroah-Hartman
2012-08-13 20:19 ` [ 51/82] random: use the arch-specific rng in xfer_secondary_pool Greg Kroah-Hartman
2012-08-13 20:19 ` [ 52/82] random: add new get_random_bytes_arch() function Greg Kroah-Hartman
2012-08-13 20:19 ` [ 53/82] random: add tracepoints for easier debugging and verification Greg Kroah-Hartman
2012-08-13 20:19 ` [ 54/82] MAINTAINERS: Theodore Tso is taking over the random driver Greg Kroah-Hartman
2012-08-13 20:19 ` [ 55/82] rtc: wm831x: Feed the write counter into device_add_randomness() Greg Kroah-Hartman
2012-08-13 20:19 ` [ 56/82] mfd: wm831x: Feed the device UUID " Greg Kroah-Hartman
2012-08-13 20:19 ` [ 57/82] random: remove rand_initialize_irq() Greg Kroah-Hartman
2012-08-13 20:19 ` [ 58/82] random: Add comment to random_initialize() Greg Kroah-Hartman
2012-08-13 20:19 ` [ 59/82] dmi: Feed DMI table to /dev/random driver Greg Kroah-Hartman
2012-08-13 20:19 ` [ 60/82] random: mix in architectural randomness in extract_buf() Greg Kroah-Hartman
2012-08-13 20:19 ` [ 61/82] HID: multitouch: add support for Novatek touchscreen Greg Kroah-Hartman
2012-08-13 20:19 ` [ 62/82] HID: add support for Cypress barcode scanner 04B4:ED81 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 63/82] HID: add ASUS AIO keyboard model AK1D Greg Kroah-Hartman
2012-08-13 20:19 ` [ 64/82] mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables Greg Kroah-Hartman
2012-08-13 20:19 ` [ 65/82] target: Add range checking to UNMAP emulation Greg Kroah-Hartman
2012-08-13 20:19 ` [ 66/82] target: Fix reading of data length fields for UNMAP commands Greg Kroah-Hartman
2012-08-13 20:19 ` [ 67/82] target: Fix possible integer underflow in UNMAP emulation Greg Kroah-Hartman
2012-08-13 20:19 ` [ 68/82] target: Check number of unmap descriptors against our limit Greg Kroah-Hartman
2012-08-13 20:19 ` [ 69/82] ARM: clk-imx31: Fix the keypad clock name Greg Kroah-Hartman
2012-08-13 20:19 ` [ 70/82] ARM: imx: enable emi_slow_gate clock for imx5 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 71/82] ARM: mxs: Remove MMAP_MIN_ADDR setting from mxs_defconfig Greg Kroah-Hartman
2012-08-13 20:19 ` [ 72/82] ARM: dts: imx53-ard: add regulators for lan9220 Greg Kroah-Hartman
2012-08-13 20:19 ` [ 73/82] ARM: pxa: remove irq_to_gpio from ezx-pcap driver Greg Kroah-Hartman
2012-08-13 20:19 ` [ 74/82] cfg80211: process pending events when unregistering net device Greg Kroah-Hartman
2012-08-13 20:19 ` [ 75/82] printk: Fix calculation of length used to discard records Greg Kroah-Hartman
2012-08-13 20:19 ` [ 76/82] tun: dont zeroize sock->file on detach Greg Kroah-Hartman
2012-08-13 20:19 ` [ 77/82] Yama: higher restrictions should block PTRACE_TRACEME Greg Kroah-Hartman
2012-08-13 20:19 ` [ 78/82] iwlwifi: disable greenfield transmissions as a workaround Greg Kroah-Hartman
2012-08-13 20:19 ` [ 79/82] e1000e: NIC goes up and immediately goes down Greg Kroah-Hartman
2012-08-13 20:19 ` [ 80/82] Input: eeti_ts: pass gpio value instead of IRQ Greg Kroah-Hartman
2012-08-13 20:19 ` [ 81/82] Input: wacom - Bamboo One 1024 pressure fix Greg Kroah-Hartman
2012-08-13 20:19 ` [ 82/82] rt61pci: fix NULL pointer dereference in config_lna_gain Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120813201747.856572538@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=fengguang.wu@intel.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujtisu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=minchan@kernel.org \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.