Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan.kim@gmail.com>
To: Andrew Lutomirski <luto@mit.edu>,
	mgorman@suse.de, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: aarcange@redhat.com, kamezawa.hiroyu@jp.fujitsu.com,
	fengguang.wu@intel.com, andi@firstfloor.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, hannes@cmpxchg.org,
	riel@redhat.com
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))
Date: Mon, 30 May 2011 03:28:50 +0900	[thread overview]
Message-ID: <BANLkTim8ngH8ASTk9js-G9DxySWVb7VL3A@mail.gmail.com> (raw)
In-Reply-To: <BANLkTimDtpVeLYisfon7g_=H80D0XXgkGQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 15981 bytes --]

On Fri, May 27, 2011 at 8:58 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Thu, May 26, 2011 at 5:17 AM, Andrew Lutomirski <luto@mit.edu> wrote:
>> On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>>>
>>> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
>>> I doubt it's DM issue. Can you please try to make swap on out of DM?
>>>
>>>
>>
>> I can do one better: I can tell you how to reproduce the OOM in the
>> comfort of your own VM without using dm_crypt or a Sandy Bridge
>> laptop.  This is on Fedora 15, but it really ought to work on any
>> x86_64 distribution that has kvm.  You'll probably want at least 6GB
>> on your host machine because the VM wants 4GB ram.
>>
>> Here's how:
>>
>> Step 1: Clone git://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug.git
>>
>> (You can browse here:)
>> https://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug
>>
>> Instructions to reproduce the mm bug:
>>
>> Step 2: Build Linux v2.6.38.6 with config-2.6.38.6 and the patch
>> 0001-Minchan-patch-for-testing-23-05-2011.patch (both files are in the
>> git repo)
>>
>> Step 3: cd back to reproduce-annoying-mm-bug
>>
>> Step 4: Type this.
>>
>> $ make
>> $ qemu-kvm -m 4G -smp 2 -kernel <linux_dir>/arch/x86/boot/bzImage
>> -initrd initramfs.gz
>>
>> Step 5: Wait for the VM to boot (it's really fast) and then run ./repro_bug.sh.
>>
>> Step 6: Wait a bit and watch the fireworks.  Note that it can take a
>> couple minutes to reproduce the bug.
>>
>> Tested on my Sandy Bridge laptop and on a Xeon W3520.
>>
>> For whatever reason, on my laptop without the VM I can hit the bug
>> almost instantaneously.  Maybe it's because I'm using dm-crypt on my
>> laptop.
>>
>> --Andy
>>
>> P.S.  I think that the mk_trivial_initramfs.sh script is cute, and
>
> That's cool. :)
>
>> maybe I'll try to flesh it out and turn it into a real project some
>> day.
>>
>
> Thanks for good test environment.
> Yesterday, I tried to reproduce your problem in my system(4G DRAM) but
> unfortunately got failed. I tried various setting but I can't reach.
> Maybe I need 8G system or sandy-bridge.  :(
>
> Hi mm folks, It's next round.
> Andrew Lutomirski's first problem, kswapd hang problem was solved by
> recent Mel's series(!pgdat_balanced bug and shrink_slab cond_resched)
> which is key for James, Collins problem.
>
> Andrew's next problem is a early OOM kill.
>
> [   60.627550] cryptsetup invoked oom-killer: gfp_mask=0x201da,
> order=0, oom_adj=0, oom_score_adj=0
> [   60.627553] cryptsetup cpuset=/ mems_allowed=0
> [   60.627555] Pid: 1910, comm: cryptsetup Not tainted 2.6.38.6-no-fpu+ #47
> [   60.627556] Call Trace:
> [   60.627563]  [<ffffffff8107f9c5>] ? cpuset_print_task_mems_allowed+0x91/0x9c
> [   60.627567]  [<ffffffff810b3ef1>] ? dump_header+0x7f/0x1ba
> [   60.627570]  [<ffffffff8109e4d6>] ? trace_hardirqs_on+0x9/0x20
> [   60.627572]  [<ffffffff810b42ba>] ? oom_kill_process+0x50/0x24e
> [   60.627574]  [<ffffffff810b4961>] ? out_of_memory+0x2e4/0x359
> [   60.627576]  [<ffffffff810b879e>] ? __alloc_pages_nodemask+0x5f3/0x775
> [   60.627579]  [<ffffffff810e127e>] ? alloc_pages_current+0xbe/0xd8
> [   60.627581]  [<ffffffff810b2126>] ? __page_cache_alloc+0x77/0x7e
> [   60.627585]  [<ffffffff8135d009>] ? dm_table_unplug_all+0x52/0xed
> [   60.627587]  [<ffffffff810b9f74>] ? __do_page_cache_readahead+0x98/0x1a4
> [   60.627589]  [<ffffffff810ba321>] ? ra_submit+0x21/0x25
> [   60.627590]  [<ffffffff810ba4ee>] ? ondemand_readahead+0x1c9/0x1d8
> [   60.627592]  [<ffffffff810ba5dd>] ? page_cache_sync_readahead+0x3d/0x40
> [   60.627594]  [<ffffffff810b342d>] ? filemap_fault+0x119/0x36c
> [   60.627597]  [<ffffffff810caf5f>] ? __do_fault+0x56/0x342
> [   60.627600]  [<ffffffff810f5630>] ? lookup_page_cgroup+0x32/0x48
> [   60.627602]  [<ffffffff810cd437>] ? handle_pte_fault+0x29f/0x765
> [   60.627604]  [<ffffffff810ba75e>] ? add_page_to_lru_list+0x6e/0x73
> [   60.627606]  [<ffffffff810be487>] ? page_evictable+0x1b/0x8d
> [   60.627607]  [<ffffffff810bae36>] ? put_page+0x24/0x35
> [   60.627610]  [<ffffffff810cdbfc>] ? handle_mm_fault+0x18e/0x1a1
> [   60.627612]  [<ffffffff810cded2>] ? __get_user_pages+0x2c3/0x3ed
> [   60.627614]  [<ffffffff810cfb4b>] ? __mlock_vma_pages_range+0x67/0x6b
> [   60.627616]  [<ffffffff810cfc01>] ? do_mlock_pages+0xb2/0x11a
> [   60.627618]  [<ffffffff810d0448>] ? sys_mlockall+0x111/0x11c
> [   60.627621]  [<ffffffff81002a3b>] ? system_call_fastpath+0x16/0x1b
> [   60.627623] Mem-Info:
> [   60.627624] Node 0 DMA per-cpu:
> [   60.627626] CPU    0: hi:    0, btch:   1 usd:   0
> [   60.627627] CPU    1: hi:    0, btch:   1 usd:   0
> [   60.627628] CPU    2: hi:    0, btch:   1 usd:   0
> [   60.627629] CPU    3: hi:    0, btch:   1 usd:   0
> [   60.627630] Node 0 DMA32 per-cpu:
> [   60.627631] CPU    0: hi:  186, btch:  31 usd:   0
> [   60.627633] CPU    1: hi:  186, btch:  31 usd:   0
> [   60.627634] CPU    2: hi:  186, btch:  31 usd:   0
> [   60.627635] CPU    3: hi:  186, btch:  31 usd:   0
> [   60.627638] active_anon:51586 inactive_anon:17384 isolated_anon:0
> [   60.627639]  active_file:0 inactive_file:226 isolated_file:0
> [   60.627639]  unevictable:395661 dirty:0 writeback:3 unstable:0
> [   60.627640]  free:13258 slab_reclaimable:3979 slab_unreclaimable:9755
> [   60.627640]  mapped:11910 shmem:24046 pagetables:5062 bounce:0
> [   60.627642] Node 0 DMA free:8352kB min:340kB low:424kB high:508kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:952kB
> unevictable:6580kB isolated(anon):0kB isolated(file):0kB
> present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB
> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:1645 all_unreclaimable? yes
> [   60.627649] lowmem_reserve[]: 0 2004 2004 2004
> [   60.627651] Node 0 DMA32 free:44680kB min:44712kB low:55888kB
> high:67068kB active_anon:206344kB inactive_anon:69536kB
> active_file:0kB inactive_file:0kB unevictable:1576064kB
> isolated(anon):0kB isolated(file):0kB present:2052320kB
> mlocked:47540kB dirty:0kB writeback:12kB mapped:47640kB shmem:96184kB
> slab_reclaimable:15900kB slab_unreclaimable:39020kB
> kernel_stack:2424kB pagetables:20248kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:499225 all_unreclaimable? yes
> [   60.627658] lowmem_reserve[]: 0 0 0 0
> [   60.627660] Node 0 DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 2*128kB
> 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8352kB
> [   60.627665] Node 0 DMA32: 959*4kB 2071*8kB 682*16kB 165*32kB
> 27*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 44980kB
> [   60.627670] 419957 total pagecache pages
> [   60.627671] 0 pages in swap cache
> [   60.627672] Swap cache stats: add 137, delete 137, find 0/0
> [   60.627673] Free swap  = 6290904kB
> [   60.627674] Total swap = 6291452kB
> [   60.632560] 524272 pages RAM
> [   60.632562] 9451 pages reserved
> [   60.632563] 45558 pages shared
> [   60.632564] 469944 pages non-shared
>
>
> There are about 270M anon  and lots of free swap space in system.
> Nonetheless, he saw the OOM. I think it doesn't make sense.
> As I look above log, he used swap as crypted device mapper and used 1.4G ramfs.
> Andy, Right?
>
> The thing I doubt firstly was a big ramfs.
> I think in reclaim, shrink_page_list will start to cull mlocked page.
> If there are so many ramfs pages and working set pages in LRU,
> reclaimer can't reclaim any page until It meet non-unevictable pages
> or non-working set page(!PG_referenced and !pte_young). His workload
> had lots of anon pages and ramfs pages. ramfs pages is unevictable
> page so that it would cull and anon pages are promoted very easily so
> that we can't reclaim it easily.
> It means zone->pages_scanned would be very high so after all,
> zone->all_unreclaimable would set.
> As I look above log, the number of lru in  DMA32 zone is 68970.
> The number of unevictable page is 394016.
>
> 394016 + working set page(I don't know) is almost equal to  (68970 * 6
> = 413820).
> So it's possible that zone->all_unreclaimable is set.
> I wanted to test below patch by private but it doesn't solve his problem.
> But I think we need below patch, still. It can happen if we had lots
> of LRU order successive mlocked page in LRU.
>
> ===
>
> From e37f150328aedeea9a88b6190ab2b6e6c1067163 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Wed, 25 May 2011 07:09:17 +0900
> Subject: [PATCH 3/3] vmscan: decrease pages_scanned on unevictable page
>
> If there are many unevictable pages on evictable LRU list(ex, big ramfs),
> shrink_page_list will move it into unevictable and can't reclaim pages.
> But we already increased zone->pages_scanned.
> If the situation is repeated, the number of evictable lru pages is decreased
> while zone->pages_scanned is increased without reclaim any pages.
> It could turn on zone->all_unreclaimable but it's totally false alram.
>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
>  mm/vmscan.c |   22 +++++++++++++++++++---
>  1 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 08d3077..a7df813 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -700,7 +700,8 @@ static noinline_for_stack void
> free_page_list(struct list_head *free_pages)
>  static unsigned long shrink_page_list(struct list_head *page_list,
>                                      struct zone *zone,
>                                      struct scan_control *sc,
> -                                     unsigned long *dirty_pages)
> +                                     unsigned long *dirty_pages,
> +                                     unsigned long *unevictable_pages)
>  {
>        LIST_HEAD(ret_pages);
>        LIST_HEAD(free_pages);
> @@ -708,6 +709,7 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>        unsigned long nr_dirty = 0;
>        unsigned long nr_congested = 0;
>        unsigned long nr_reclaimed = 0;
> +       unsigned long nr_unevictable = 0;
>
>        cond_resched();
>
> @@ -908,6 +910,7 @@ cull_mlocked:
>                        try_to_free_swap(page);
>                unlock_page(page);
>                putback_lru_page(page);
> +               nr_unevictable++;
>                continue;
>
>  activate_locked:
> @@ -936,6 +939,7 @@ keep_lumpy:
>                zone_set_flag(zone, ZONE_CONGESTED);
>
>        *dirty_pages = nr_dirty;
> +       *unevictable_pages = nr_unevictable;
>        free_page_list(&free_pages);
>
>        list_splice(&ret_pages, page_list);
> @@ -1372,6 +1376,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>        unsigned long nr_scanned;
>        unsigned long nr_reclaimed = 0;
>        unsigned long nr_dirty;
> +       unsigned long nr_unevictable;
>        unsigned long nr_taken;
>        unsigned long nr_anon;
>        unsigned long nr_file;
> @@ -1425,7 +1430,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>        spin_unlock_irq(&zone->lru_lock);
>
>        reclaim_mode = sc->reclaim_mode;
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty);
> +       nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty,
> &nr_unevictable);
>
>        /* Check if we should syncronously wait for writeback */
>        if ((nr_dirty && !(reclaim_mode & RECLAIM_MODE_SINGLE) &&
> @@ -1434,7 +1439,8 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>                unsigned long nr_active = clear_active_flags(&page_list, NULL);
>                count_vm_events(PGDEACTIVATE, nr_active);
>                set_reclaim_mode(priority, sc, true);
> -               nr_reclaimed += shrink_page_list(&page_list, zone, sc, &nr_dirty);
> +               nr_reclaimed += shrink_page_list(&page_list, zone, sc,
> +                                               &nr_dirty, &nr_unevictable);
>        }
>
>        local_irq_disable();
> @@ -1442,6 +1448,16 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>                __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>        __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>
> +       /*
> +        * Too many unevictalbe pages on evictable LRU list(ex, big ramfs)
> +        * can make high zone->pages_scanned and reduce the number of lru page
> +        * on evictable lru as reclaim is going on.
> +        * It could turn on all_unreclaimable which is false alarm.
> +        */
> +       spin_lock(&zone->lru_lock);
> +       if (zone->pages_scanned >= nr_unevictable)
> +               zone->pages_scanned -= nr_unevictable;
> +       else
> +               zone->pages_scanned = 0;
> +       spin_unlock(&zone->lru_lock);
> +
>        putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
>
>        trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
> --
> 1.7.1
>
> ===
>
> Then, what I doubt secondly is zone_set_flag(zone, ZONE_CONGESTED).
> He used swap as crypted device mapper.
> Device mapper could make IO slow for his work. It means we are likely
> to meet ZONE_CONGESTED higher than normal swap.
>
> Let's think about it.
> Swap device is very congested so shrink_page_list would set the zone
> as CONGESTED.
> Who is clear ZONE_CONGESTED? There are two place in  kswapd.
> One work in only order > 0. So maybe, it's no-op in Andy's
> workload.(ie, it's mostly order-0 allocation)
> One remained is below.
>
>                                 * If a zone reaches its high watermark,
>                                 * consider it to be no longer congested. It's
>                                 * possible there are dirty pages backed by
>                                 * congested BDIs but as pressure is relieved,
>                                 * spectulatively avoid congestion waits
>                                 */
>                                zone_clear_flag(zone, ZONE_CONGESTED);
>                                if (i <= *classzone_idx)
>                                        balanced += zone->present_pages;
>
> It works only if the zone meets high watermark. If allocation is
> faster than reclaim(ie, it's true for slow swap device), the zone
> would remain congested.
> It means swapout would block.
> As we see the OOM log, we can know that DMA32 zone can't meet high watermark.
>
> Does my guessing make sense?

Hi Andrew.
I got failed your scenario in my machine so could you be willing to
test this patch for proving my above scenario?
The patch is just revert patch of 0e093d99[do not sleep on the
congestion queue...] for 2.6.38.6.
I would like to test it for proving my above zone congestion scenario.

I did it based on 2.6.38.6 for your easy apply so you must apply it
cleanly on vanilla v2.6.38.6.
And you have to add !pgdat_balanced and shrink_slab patch.

Thanks, Andrew.

-- 
Kind regards,
Minchan Kim

[-- Attachment #2: 0001-Revert-writeback-do-not-sleep-on-the-congestion-queu.patch --]
[-- Type: text/x-patch, Size: 10333 bytes --]

From 244e37f1f3978ff182b5e33b77b327e4f48bb438 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@gmail.com>
Date: Mon, 30 May 2011 02:23:49 +0900
Subject: [PATCH] Revert "writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone"

This reverts commit 0e093d99763eb4cea09f8ca4f1d01f34e121d10b.

Conflicts:

	mm/vmscan.c

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/backing-dev.h      |    2 +-
 include/linux/mmzone.h           |    8 -----
 include/trace/events/writeback.h |    7 ----
 mm/backing-dev.c                 |   61 +------------------------------------
 mm/page_alloc.c                  |    4 +-
 mm/vmscan.c                      |   41 ++-----------------------
 6 files changed, 9 insertions(+), 114 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..8b0ae8b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -286,7 +286,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(struct zone *zone, int sync, long timeout);
+
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..e1b16aa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -424,9 +424,6 @@ struct zone {
 typedef enum {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
-	ZONE_CONGESTED,			/* zone has many dirty pages backed by
-					 * a congested BDI
-					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -444,11 +441,6 @@ static inline void zone_clear_flag(struct zone *zone, zone_flags_t flag)
 	clear_bit(flag, &zone->flags);
 }
 
-static inline int zone_is_reclaim_congested(const struct zone *zone)
-{
-	return test_bit(ZONE_CONGESTED, &zone->flags);
-}
-
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 4e249b9..fc2b3a0 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -180,13 +180,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8e4ed88..c9e59de 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -729,7 +729,6 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
-static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -737,8 +736,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	if (test_and_clear_bit(bit, &bdi->state))
-		atomic_dec(&nr_bdi_congested[sync]);
+	clear_bit(bit, &bdi->state);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -750,8 +748,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	if (!test_and_set_bit(bit, &bdi->state))
-		atomic_inc(&nr_bdi_congested[sync]);
+	set_bit(bit, &bdi->state);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -782,57 +779,3 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
- * @zone: A zone to check if it is heavily congested
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) and the given
- * @zone has experienced recent congestion, this waits for up to @timeout
- * jiffies for either a BDI to exit congestion of the given @sync queue
- * or a write to complete.
- *
- * In the absense of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(struct zone *zone, int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, or heavy congestion is not being
-	 * encountered in the current zone, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
-			!zone_is_reclaim_congested(zone)) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2828037..71e9842 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1929,7 +1929,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+			congestion_wait(BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2137,7 +2137,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
+		congestion_wait(BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	} else {
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0665520..59de427 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -703,14 +703,11 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct zone *zone,
 				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
-	unsigned long nr_dirty = 0;
-	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -730,7 +727,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-		VM_BUG_ON(page_zone(page) != zone);
 
 		sc->nr_scanned++;
 
@@ -808,8 +804,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
-			nr_dirty++;
-
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -820,7 +814,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
-				nr_congested++;
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
@@ -931,15 +924,6 @@ keep_lumpy:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
-	/*
-	 * Tag a zone as congested if all the dirty pages encountered were
-	 * backed by a congested BDI. In this case, reclaimers should just
-	 * back off and wait for congestion to clear because further reclaim
-	 * will encounter the same problem
-	 */
-	if (nr_dirty == nr_congested && nr_dirty != 0)
-		zone_set_flag(zone, ZONE_CONGESTED);
-
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
@@ -1426,12 +1410,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+	nr_reclaimed = shrink_page_list(&page_list, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
 
 	local_irq_disable();
@@ -2085,14 +2069,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2) {
-			struct zone *preferred_zone;
-
-			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-						&cpuset_current_mems_allowed,
-						&preferred_zone);
-			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
-		}
+		    priority < DEF_PRIORITY - 2)
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 
 out:
@@ -2455,14 +2433,6 @@ loop_again:
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
 			} else {
-				/*
-				 * If a zone reaches its high watermark,
-				 * consider it to be no longer congested. It's
-				 * possible there are dirty pages backed by
-				 * congested BDIs but as pressure is relieved,
-				 * spectulatively avoid congestion waits
-				 */
-				zone_clear_flag(zone, ZONE_CONGESTED);
 				if (i <= *classzone_idx)
 					balanced += zone->present_pages;
 			}
@@ -2546,9 +2516,6 @@ out:
 				order = sc.order = 0;
 				goto loop_again;
 			}
-
-			/* If balanced, clear the congested flag */
-			zone_clear_flag(zone, ZONE_CONGESTED);
 		}
 	}
 
-- 
1.7.0.4

next prev parent reply	other threads:[~2011-05-29 18:28 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-25 20:17 Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)) Andrew Lutomirski
2011-05-26  8:18 ` KOSAKI Motohiro
2011-05-26 23:58 ` Minchan Kim
2011-05-29 18:28   ` Minchan Kim [this message]
2011-05-30  0:28     ` Andrew Lutomirski
2011-06-14 10:10       ` Johannes Weiner
2011-06-14 12:32         ` Andrew Lutomirski

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:4ce34fa dfblob:8b0ae8b dfblob:02ecb01 dfblob:e1b16aa
dfblob:4e249b9 dfblob:fc2b3a0 dfblob:8e4ed88 dfblob:c9e59de
dfblob:2828037 dfblob:71e9842 dfblob:0665520 dfblob:59de427 )
 OR (
bs:"Revert "
bs:"writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BANLkTim8ngH8ASTk9js-G9DxySWVb7VL3A@mail.gmail.com \
    --to=minchan.kim@gmail.com \
    --cc=aarcange@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@mit.edu \
    --cc=mgorman@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).