Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	epasch@de.ibm.com, SCHILLIG@de.ibm.com,
	Martin Schwidefsky <schwidefsky@de.ibm.com>,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	christof.schmitt@de.ibm.com, thoss@de.ibm.com, hare@suse.de,
	gregkh@novell.com
Subject: Re: Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set"
Date: Fri, 19 Feb 2010 15:19:34 +0000	[thread overview]
Message-ID: <20100219151934.GA1445@csn.ul.ie> (raw)
In-Reply-To: <4B7E73BF.5030901@linux.vnet.ibm.com>

On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED  fast path   slow path
> >>>> GOOD CASE      ~62       ~1.46
> >>>> BAD CASE       ~62       ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that 
> >>> we  run in a scenario where try_to_free frees a lot of pages via, but 
> >>> of the  wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and 
> >> memory is
> >> low" but you reported it made no difference even when fallback was 
> >> allowed
> >> with high watermarks.
> >> [...]
> > 
> > Today I created rather huge debug logs - I'll spare everyone with too 
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output 
> > extended to list all things per migrate type too.
> > 
> >  From that I still think there should be plenty of pages to get the 
> > allocation through, as it was suggested by the high amount of pages 
> > freed by try_to_free.
> > > 
> [...]
> 
> > More about that tomorrow,
> 
> Well tomorrow is now, and I think I got some important new insights.
> 
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
> 
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
>    get_page_from_freelist - zone loop - current zone 'DMA'
>    get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
>    get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
>    get_page_from_freelist - return page '(null)'
> 
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.

That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.

> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
> 
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>

Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.

> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.

> 
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c  2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c       2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
> 
>         if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
>                 /* Wait for some write requests to complete then retry */
> -               congestion_wait(BLK_RW_ASYNC, HZ/50);
> +               /*
> +                * If it gets here try it without watermarks before going
> +                * to sleep.
> +                *
> +                * This will end up in alloc_high_priority and if that fails
> +                * once more direct_reclaim but this time without watermark
> +                * checks.
> +                *
> +                * FIXME: that is just for verification - a real fix needs to
> +                * ensure e.g. page cache allocations don't drain all pages
> +                * under watermark
> +                */
> +               if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> +                       alloc_flags &= ALLOC_NO_WATERMARKS;
> +               else
> +                       congestion_wait(BLK_RW_ASYNC, HZ/50);
>                 goto rebalance;
>         }
> 
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.

It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.

> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21

Still a mystery.

>  - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>

Not that I can think of.

> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>

In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.

> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.

I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.

When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

next prev parent reply	other threads:[~2010-02-19 15:19 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-07 14:39 Performance regression in scsi sequential throughput (iozone) due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD is set" Christian Ehrhardt
2009-12-07 15:09 ` Mel Gorman
2009-12-08 17:59   ` Christian Ehrhardt
2009-12-10 14:36     ` Christian Ehrhardt
2009-12-11 11:20       ` Mel Gorman
2009-12-11 14:47         ` Christian Ehrhardt
2009-12-18 13:38           ` Christian Ehrhardt
2009-12-18 17:42             ` Mel Gorman
2010-01-14 12:30               ` Christian Ehrhardt
2010-01-19 11:33                 ` Mel Gorman
2010-02-05 15:51                   ` Christian Ehrhardt
2010-02-05 17:49                     ` Mel Gorman
2010-02-08 14:01                       ` Christian Ehrhardt
2010-02-08 15:21                         ` Mel Gorman
2010-02-08 16:55                           ` Mel Gorman
2010-02-09  6:23                           ` Christian Ehrhardt
2010-02-09 15:52                           ` Christian Ehrhardt
2010-02-09 17:57                             ` Mel Gorman
2010-02-11 16:11                               ` Christian Ehrhardt
2010-02-12 10:05                                 ` Nick Piggin
2010-02-15  6:59                                   ` Nick Piggin
2010-02-15 15:46                                   ` Christian Ehrhardt
2010-02-16 11:25                                     ` Mel Gorman
2010-02-16 16:47                                       ` Christian Ehrhardt
2010-02-17  9:55                                         ` Christian Ehrhardt
2010-02-17 10:03                                           ` Christian Ehrhardt
2010-02-18 11:43                                           ` Mel Gorman
2010-02-18 16:09                                             ` Christian Ehrhardt
2010-02-19 11:19                                               ` Christian Ehrhardt
2010-02-19 15:19                                                 ` Mel Gorman [this message]
2010-02-22 15:42                                                   ` Christian Ehrhardt
2010-02-25 15:13                                                     ` Christian Ehrhardt
2010-02-26 11:18                                                       ` Nick Piggin
2010-03-02  6:52                                                   ` Nick Piggin
2010-03-02 10:04                                                     ` Mel Gorman
2010-03-02 10:36                                                       ` Nick Piggin
2010-03-02 11:01                                                         ` Mel Gorman
2010-03-02 11:18                                                           ` Nick Piggin
2010-03-02 11:24                                                             ` Mel Gorman
2010-03-03  6:51                                                               ` Christian Ehrhardt
2010-02-08 15:02                       ` Christian Ehrhardt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100219151934.GA1445@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=SCHILLIG@de.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=christof.schmitt@de.ibm.com \
    --cc=ehrhardt@linux.vnet.ibm.com \
    --cc=epasch@de.ibm.com \
    --cc=gregkh@novell.com \
    --cc=hare@suse.de \
    --cc=heiko.carstens@de.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=schwidefsky@de.ibm.com \
    --cc=thoss@de.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.