Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>,
	linux-mm@kvack.org, Nick Piggin <npiggin@suse.de>,
	Chris Mason <chris.mason@oracle.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, gregkh@novell.com,
	Corrado Zoccolo <czoccolo@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
Date: Wed, 24 Mar 2010 14:50:29 +0000	[thread overview]
Message-ID: <20100324145028.GD2024@csn.ul.ie> (raw)
In-Reply-To: <4BA940E7.2030308@redhat.com>

On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>
>> Test scenario
>> =============
>> X86-64 machine 1 socket 4 cores
>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> 	on-board and a piece of crap, and a decent RAID card could blow
>> 	the budget.
>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> 	Christian was doing
>
> With that many disks, you can easily have dozens of megabytes
> of data in flight to the disk at once.  That is a major
> fraction of memory.
>

That is easily possible. Note, I'm not maintaining this workload configuration
is a good idea.

The background to this problem is Christian running a disk-intensive iozone
workload over many CPUs and disks with limited memory. It's already known
that if he added a small amount of extra memory, the problem went away.
The problem was a massive throughput regression and a bisect pinpointed
two patches (both mine) but neither make sense. One altered the order pages
come back from lists but not availability and his hardware does no automatic
merging. A second does alter the availility of pages via the per-cpu lists
but reverting the behaviour didn't help.

The first fix to this was to replace congestion_wait with a waitqueue
that woke up processes if the watermarks were met. This fixed
Christian's problem but Andrew wants to pin the underlying cause.

I strongly suspect that evict-once behaves sensibly when memory is ample
but in this particular case, it's not helping.

> In fact, you might have all of the inactive file pages under
> IO...
>

Possibly. The tests have a write and a read phase but I wasn't
collecting the data with sufficient granularity to see which of the
tests are actually stalling.

>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> 	fix title: revertevict
>> 	fixed in mainline? no
>> 	affects: 2.6.31 to now
>>
>> 	For reasons that are not immediately obvious, the evict-once patches
>> 	*really* hurt the time spent on congestion and the number of pages
>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> 	because clearly you tested this for AIM7 and might have some
>> 	theories. For the purposes of testing, I just reverted the changes.
>
> The patch helped IO tests with reasonable amounts of memory
> available, because the VM can cache frequently used data
> much more effectively.
>
> This comes at the cost of caching less recently accessed
> use-once data, which should not be an issue since the data
> is only used once...
>

Indeed. With or without evict-once, I'd have an expectation of all the
pages being recycled anyway because of the amount of data involved.

>> Rik, any theory on evict-once?
>
> No real theories yet, just the observation that your revert
> appears to be buggy (see below) and the possibility that your
> test may have all of the inactive file pages under IO...
>

Bah. I had the initial revert right and screwed up reverting from
2.6.32.10 on. I'm rerunning the tests. Is this right?

-       if (is_active_lru(lru)) {
-               if (inactive_list_is_low(zone, sc, file))
-                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
+       if (is_active_lru(lru)) {
+               shrink_active_list(nr_to_scan, zone, sc, priority, file);
                return 0;

> Can you reproduce the stall if you lower the dirty limits?
>

I'm rerunning the revertevict patches at the moment. When they complete,
I'll experiment with dirty limits. Any suggested values or will I just
increase it by some arbitrary amount and see what falls out? e.g.
increse dirty_ratio to 80.

>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>   	struct zone *zone, struct scan_control *sc, int priority)
>>   {
>>   	int file = is_file_lru(lru);
>>
>> -	if (is_active_lru(lru)) {
>> -		if (inactive_list_is_low(zone, sc, file))
>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> +	if (lru == LRU_ACTIVE_FILE) {
>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>   		return 0;
>>   	}
>
> Your revert is buggy.  With this change, anonymous pages will
> never get deactivated via shrink_list.
>

/me slaps self

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mel@csn.ul.ie>
To: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>,
	linux-mm@kvack.org, Nick Piggin <npiggin@suse.de>,
	Chris Mason <chris.mason@oracle.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, gregkh@novell.com,
	Corrado Zoccolo <czoccolo@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure
Date: Wed, 24 Mar 2010 14:50:29 +0000	[thread overview]
Message-ID: <20100324145028.GD2024@csn.ul.ie> (raw)
In-Reply-To: <4BA940E7.2030308@redhat.com>

On Tue, Mar 23, 2010 at 06:29:59PM -0400, Rik van Riel wrote:
> On 03/22/2010 07:50 PM, Mel Gorman wrote:
>
>> Test scenario
>> =============
>> X86-64 machine 1 socket 4 cores
>> 4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
>> 	on-board and a piece of crap, and a decent RAID card could blow
>> 	the budget.
>> Booted mem=256 to ensure it is fully IO-bound and match closer to what
>> 	Christian was doing
>
> With that many disks, you can easily have dozens of megabytes
> of data in flight to the disk at once.  That is a major
> fraction of memory.
>

That is easily possible. Note, I'm not maintaining this workload configuration
is a good idea.

The background to this problem is Christian running a disk-intensive iozone
workload over many CPUs and disks with limited memory. It's already known
that if he added a small amount of extra memory, the problem went away.
The problem was a massive throughput regression and a bisect pinpointed
two patches (both mine) but neither make sense. One altered the order pages
come back from lists but not availability and his hardware does no automatic
merging. A second does alter the availility of pages via the per-cpu lists
but reverting the behaviour didn't help.

The first fix to this was to replace congestion_wait with a waitqueue
that woke up processes if the watermarks were met. This fixed
Christian's problem but Andrew wants to pin the underlying cause.

I strongly suspect that evict-once behaves sensibly when memory is ample
but in this particular case, it's not helping.

> In fact, you might have all of the inactive file pages under
> IO...
>

Possibly. The tests have a write and a read phase but I wasn't
collecting the data with sufficient granularity to see which of the
tests are actually stalling.

>> 3. Page reclaim evict-once logic from 56e49d21 hurts really badly
>> 	fix title: revertevict
>> 	fixed in mainline? no
>> 	affects: 2.6.31 to now
>>
>> 	For reasons that are not immediately obvious, the evict-once patches
>> 	*really* hurt the time spent on congestion and the number of pages
>> 	reclaimed. Rik, I'm afaid I'm punting this to you for explanation
>> 	because clearly you tested this for AIM7 and might have some
>> 	theories. For the purposes of testing, I just reverted the changes.
>
> The patch helped IO tests with reasonable amounts of memory
> available, because the VM can cache frequently used data
> much more effectively.
>
> This comes at the cost of caching less recently accessed
> use-once data, which should not be an issue since the data
> is only used once...
>

Indeed. With or without evict-once, I'd have an expectation of all the
pages being recycled anyway because of the amount of data involved.

>> Rik, any theory on evict-once?
>
> No real theories yet, just the observation that your revert
> appears to be buggy (see below) and the possibility that your
> test may have all of the inactive file pages under IO...
>

Bah. I had the initial revert right and screwed up reverting from
2.6.32.10 on. I'm rerunning the tests. Is this right?

-       if (is_active_lru(lru)) {
-               if (inactive_list_is_low(zone, sc, file))
-                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
+       if (is_active_lru(lru)) {
+               shrink_active_list(nr_to_scan, zone, sc, priority, file);
                return 0;

> Can you reproduce the stall if you lower the dirty limits?
>

I'm rerunning the revertevict patches at the moment. When they complete,
I'll experiment with dirty limits. Any suggested values or will I just
increase it by some arbitrary amount and see what falls out? e.g.
increse dirty_ratio to 80.

>>   static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
>>   	struct zone *zone, struct scan_control *sc, int priority)
>>   {
>>   	int file = is_file_lru(lru);
>>
>> -	if (is_active_lru(lru)) {
>> -		if (inactive_list_is_low(zone, sc, file))
>> -		    shrink_active_list(nr_to_scan, zone, sc, priority, file);
>> +	if (lru == LRU_ACTIVE_FILE) {
>> +		shrink_active_list(nr_to_scan, zone, sc, priority, file);
>>   		return 0;
>>   	}
>
> Your revert is buggy.  With this change, anonymous pages will
> never get deactivated via shrink_list.
>

/me slaps self

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-03-24 14:50 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-08 11:48 [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Mel Gorman
2010-03-08 11:48 ` Mel Gorman
2010-03-08 11:48 ` [PATCH 1/3] page-allocator: Under memory pressure, wait on pressure to relieve instead of congestion Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09 13:35   ` Nick Piggin
2010-03-09 13:35     ` Nick Piggin
2010-03-09 14:17     ` Mel Gorman
2010-03-09 14:17       ` Mel Gorman
2010-03-09 15:03       ` Nick Piggin
2010-03-09 15:03         ` Nick Piggin
2010-03-09 15:42         ` Christian Ehrhardt
2010-03-09 15:42           ` Christian Ehrhardt
2010-03-09 18:22           ` Mel Gorman
2010-03-09 18:22             ` Mel Gorman
2010-03-10  2:38             ` Nick Piggin
2010-03-10  2:38               ` Nick Piggin
2010-03-09 17:35         ` Mel Gorman
2010-03-09 17:35           ` Mel Gorman
2010-03-10  2:35           ` Nick Piggin
2010-03-10  2:35             ` Nick Piggin
2010-03-09 15:50   ` Christoph Lameter
2010-03-09 15:50     ` Christoph Lameter
2010-03-09 15:56     ` Christian Ehrhardt
2010-03-09 15:56       ` Christian Ehrhardt
2010-03-09 16:09       ` Christoph Lameter
2010-03-09 16:09         ` Christoph Lameter
2010-03-09 17:01         ` Mel Gorman
2010-03-09 17:01           ` Mel Gorman
2010-03-09 17:11           ` Christoph Lameter
2010-03-09 17:11             ` Christoph Lameter
2010-03-09 17:30             ` Mel Gorman
2010-03-09 17:30               ` Mel Gorman
2010-03-08 11:48 ` [PATCH 2/3] page-allocator: Check zone pressure when batch of pages are freed Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09  9:53   ` Nick Piggin
2010-03-09  9:53     ` Nick Piggin
2010-03-09 10:08     ` Mel Gorman
2010-03-09 10:08       ` Mel Gorman
2010-03-09 10:23       ` Nick Piggin
2010-03-09 10:23         ` Nick Piggin
2010-03-09 10:36         ` Mel Gorman
2010-03-09 10:36           ` Mel Gorman
2010-03-09 11:11           ` Nick Piggin
2010-03-09 11:11             ` Nick Piggin
2010-03-09 11:29             ` Mel Gorman
2010-03-09 11:29               ` Mel Gorman
2010-03-08 11:48 ` [PATCH 3/3] vmscan: Put kswapd to sleep on its own waitqueue, not congestion Mel Gorman
2010-03-08 11:48   ` Mel Gorman
2010-03-09 10:00   ` Nick Piggin
2010-03-09 10:00     ` Nick Piggin
2010-03-09 10:21     ` Mel Gorman
2010-03-09 10:21       ` Mel Gorman
2010-03-09 10:32       ` Nick Piggin
2010-03-09 10:32         ` Nick Piggin
2010-03-11 23:41 ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Andrew Morton
2010-03-11 23:41   ` Andrew Morton
2010-03-12  6:39   ` Christian Ehrhardt
2010-03-12  6:39     ` Christian Ehrhardt
2010-03-12  7:05     ` Andrew Morton
2010-03-12  7:05       ` Andrew Morton
2010-03-12 10:47       ` Mel Gorman
2010-03-12 10:47         ` Mel Gorman
2010-03-12 12:15         ` Christian Ehrhardt
2010-03-12 12:15           ` Christian Ehrhardt
2010-03-12 14:37           ` Andrew Morton
2010-03-12 14:37             ` Andrew Morton
2010-03-15 12:29             ` Mel Gorman
2010-03-15 12:29               ` Mel Gorman
2010-03-15 14:45               ` Christian Ehrhardt
2010-03-15 14:45                 ` Christian Ehrhardt
2010-03-15 12:34             ` Christian Ehrhardt
2010-03-15 12:34               ` Christian Ehrhardt
2010-03-15 20:09               ` Andrew Morton
2010-03-15 20:09                 ` Andrew Morton
2010-03-16 10:11                 ` Mel Gorman
2010-03-16 10:11                   ` Mel Gorman
2010-03-18 17:42                 ` Mel Gorman
2010-03-18 17:42                   ` Mel Gorman
2010-03-22 23:50                 ` Mel Gorman
2010-03-22 23:50                   ` Mel Gorman
2010-03-23 14:35                   ` Christian Ehrhardt
2010-03-23 14:35                     ` Christian Ehrhardt
2010-03-23 21:35                   ` Corrado Zoccolo
2010-03-23 21:35                     ` Corrado Zoccolo
2010-03-24 11:48                     ` Mel Gorman
2010-03-24 11:48                       ` Mel Gorman
2010-03-24 12:56                       ` Corrado Zoccolo
2010-03-24 12:56                         ` Corrado Zoccolo
2010-03-23 22:29                   ` Rik van Riel
2010-03-23 22:29                     ` Rik van Riel
2010-03-24 14:50                     ` Mel Gorman [this message]
2010-03-24 14:50                       ` Mel Gorman
2010-04-19 12:22                       ` Christian Ehrhardt
2010-04-19 12:22                         ` Christian Ehrhardt
2010-04-19 21:44                         ` Johannes Weiner
2010-04-19 21:44                           ` Johannes Weiner
2010-04-20  7:20                           ` Christian Ehrhardt
2010-04-20  7:20                             ` Christian Ehrhardt
2010-04-20  8:54                             ` Christian Ehrhardt
2010-04-20  8:54                               ` Christian Ehrhardt
2010-04-20 15:32                             ` Johannes Weiner
2010-04-20 15:32                               ` Johannes Weiner
2010-04-20 17:22                               ` Rik van Riel
2010-04-20 17:22                                 ` Rik van Riel
2010-04-21  4:23                                 ` Christian Ehrhardt
2010-04-21  4:23                                   ` Christian Ehrhardt
2010-04-21  7:35                                   ` Christian Ehrhardt
2010-04-21  7:35                                     ` Christian Ehrhardt
2010-04-21 13:19                                     ` Rik van Riel
2010-04-21 13:19                                       ` Rik van Riel
2010-04-22  6:21                                       ` Christian Ehrhardt
2010-04-22  6:21                                         ` Christian Ehrhardt
2010-04-26 10:59                                         ` Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2 Christian Ehrhardt
2010-04-26 10:59                                           ` Christian Ehrhardt
2010-04-26 11:59                                           ` KOSAKI Motohiro
2010-04-26 11:59                                             ` KOSAKI Motohiro
2010-04-26 12:43                                             ` Christian Ehrhardt
2010-04-26 12:43                                               ` Christian Ehrhardt
2010-04-26 14:20                                               ` Rik van Riel
2010-04-26 14:20                                                 ` Rik van Riel
2010-04-27 14:00                                                 ` Christian Ehrhardt
2010-04-27 14:00                                                   ` Christian Ehrhardt
2010-04-21  9:03                                   ` [RFC PATCH 0/3] Avoid the use of congestion_wait under zone pressure Johannes Weiner
2010-04-21  9:03                                     ` Johannes Weiner
2010-04-21 13:20                                   ` Rik van Riel
2010-04-21 13:20                                     ` Rik van Riel
2010-04-20 14:40                           ` Rik van Riel
2010-04-20 14:40                             ` Rik van Riel
2010-03-24  2:38                   ` Greg KH
2010-03-24  2:38                     ` Greg KH
2010-03-24 11:49                     ` Mel Gorman
2010-03-24 11:49                       ` Mel Gorman
2010-03-24 13:13                   ` Johannes Weiner
2010-03-24 13:13                     ` Johannes Weiner
2010-03-12  9:09   ` Mel Gorman
2010-03-12  9:09     ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100324145028.GD2024@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=czoccolo@gmail.com \
    --cc=ehrhardt@linux.vnet.ibm.com \
    --cc=gregkh@novell.com \
    --cc=hannes@cmpxchg.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=npiggin@suse.de \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.