Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	epasch@de.ibm.com, SCHILLIG@de.ibm.com,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	christof.schmitt@de.ibm.com, thoss@de.ibm.com, hare@suse.de,
	gregkh@novell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
Date: Fri, 19 Feb 2010 15:19:34 +0000	[thread overview]
Message-ID: <20100219151934.GA1445@csn.ul.ie> (raw)
In-Reply-To: <4B7E73BF.5030901@linux.vnet.ibm.com>

On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED  fast path   slow path
> >>>> GOOD CASE      ~62       ~1.46
> >>>> BAD CASE       ~62       ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that 
> >>> we  run in a scenario where try_to_free frees a lot of pages via, but 
> >>> of the  wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and 
> >> memory is
> >> low" but you reported it made no difference even when fallback was 
> >> allowed
> >> with high watermarks.
> >> [...]
> > 
> > Today I created rather huge debug logs - I'll spare everyone with too 
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output 
> > extended to list all things per migrate type too.
> > 
> >  From that I still think there should be plenty of pages to get the 
> > allocation through, as it was suggested by the high amount of pages 
> > freed by try_to_free.
> > > 
> [...]
> 
> > More about that tomorrow,
> 
> Well tomorrow is now, and I think I got some important new insights.
> 
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
> 
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
>    get_page_from_freelist - zone loop - current zone 'DMA'
>    get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
>    get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
>    get_page_from_freelist - return page '(null)'
> 
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.

That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.

> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
> 
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>

Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.

> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.

> 
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c  2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c       2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
> 
>         if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>                 /* Wait for some write requests to complete then retry */
> -               congestion_wait(BLK_RW_ASYNC, HZ/50);
> +               /*
> +                * If it gets here try it without watermarks before going
> +                * to sleep.
> +                *
> +                * This will end up in alloc_high_priority and if that fails
> +                * once more direct_reclaim but this time without watermark
> +                * checks.
> +                *
> +                * FIXME: that is just for verification - a real fix needs to
> +                * ensure e.g. page cache allocations don't drain all pages
> +                * under watermark
> +                */
> +               if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> +                       alloc_flags &= ALLOC_NO_WATERMARKS;
> +               else
> +                       congestion_wait(BLK_RW_ASYNC, HZ/50);
>                 goto rebalance;
>         }
> 
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.

It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.

> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21

Still a mystery.

>  - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>

Not that I can think of.

> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>

In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.

> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.

I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.

When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

next prev parent reply	other threads:[~2010-02-19 15:19 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59   ` Christian Ehrhardt
2009-12-10 14:36     ` Christian Ehrhardt
2009-12-11 11:20       ` Mel Gorman
2009-12-11 14:47         ` Christian Ehrhardt
2009-12-18 13:38           ` Christian Ehrhardt
2009-12-18 17:42             ` Mel Gorman
2010-01-14 12:30               ` Christian Ehrhardt
2010-01-19 11:33                 ` Mel Gorman
2010-02-05 15:51                   ` Christian Ehrhardt
2010-02-05 17:49                     ` Mel Gorman
2010-02-08 14:01                       ` Christian Ehrhardt
2010-02-08 15:21                         ` Mel Gorman
2010-02-08 16:55                           ` Mel Gorman
2010-02-09  6:23                           ` Christian Ehrhardt
2010-02-09 15:52                           ` Christian Ehrhardt
2010-02-09 17:57                             ` Mel Gorman
2010-02-11 16:11                               ` Christian Ehrhardt
2010-02-12 10:05                                 ` Nick Piggin
2010-02-15  6:59                                   ` Nick Piggin
2010-02-15 15:46                                   ` Christian Ehrhardt
2010-02-16 11:25                                     ` Mel Gorman
2010-02-16 16:47                                       ` Christian Ehrhardt
2010-02-17  9:55                                         ` Christian Ehrhardt
2010-02-17 10:03                                           ` Christian Ehrhardt
2010-02-18 11:43                                           ` Mel Gorman
2010-02-18 16:09                                             ` Christian Ehrhardt
2010-02-19 11:19                                               ` Christian Ehrhardt
2010-02-19 15:19                                                 ` Mel Gorman [this message]
2010-02-22 15:42                                                   ` Christian Ehrhardt
2010-02-25 15:13                                                     ` Christian Ehrhardt
2010-02-26 11:18                                                       ` Nick Piggin
2010-03-02  6:52                                                   ` Nick Piggin
2010-03-02 10:04                                                     ` Mel Gorman
2010-03-02 10:36                                                       ` Nick Piggin
2010-03-02 11:01                                                         ` Mel Gorman
2010-03-02 11:18                                                           ` Nick Piggin
2010-03-02 11:24                                                             ` Mel Gorman
2010-03-03  6:51                                                               ` Christian Ehrhardt
2010-02-08 15:02                       ` Christian Ehrhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100219151934.GA1445@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=SCHILLIG@de.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=christof.schmitt@de.ibm.com \
    --cc=ehrhardt@linux.vnet.ibm.com \
    --cc=epasch@de.ibm.com \
    --cc=gregkh@novell.com \
    --cc=hare@suse.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=schwidefsky@de.ibm.com \
    --cc=thoss@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).