From: Minchan Kim <minchan@kernel.org>
To: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Richard Davies <richard@arachsys.com>,
Shaohua Li <shli@kernel.org>, Rik van Riel <riel@redhat.com>,
Avi Kivity <avi@redhat.com>, QEMU-devel <qemu-devel@nongnu.org>,
KVM <kvm@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated
Date: Wed, 26 Sep 2012 09:49:30 +0900 [thread overview]
Message-ID: <20120926004930.GA10229@bbox> (raw)
In-Reply-To: <20120925091207.GD11266@suse.de>
On Tue, Sep 25, 2012 at 10:12:07AM +0100, Mel Gorman wrote:
> On Mon, Sep 24, 2012 at 02:26:44PM -0700, Andrew Morton wrote:
> > On Mon, 24 Sep 2012 10:39:38 +0100
> > Mel Gorman <mgorman@suse.de> wrote:
> >
> > > On Fri, Sep 21, 2012 at 02:36:56PM -0700, Andrew Morton wrote:
> > >
> > > > Also, what has to be done to avoid the polling altogether? eg/ie, zap
> > > > a pageblock's PB_migrate_skip synchronously, when something was done to
> > > > that pageblock which justifies repolling it?
> > > >
> > >
> > > The "something" event you are looking for is pages being freed or
> > > allocated in the page allocator. A movable page being allocated in block
> > > or a page being freed should clear the PB_migrate_skip bit if it's set.
> > > Unfortunately this would impact the fast path of the alloc and free paths
> > > of the page allocator. I felt that that was too high a price to pay.
> >
> > We already do a similar thing in the page allocator: clearing of
> > ->all_unreclaimable and ->pages_scanned.
>
> That is true but that is a simple write (shared cache line but still) to
> a struct zone. Worse, now that you point it out, that's pretty stupid. It
> should be checking if the value is non-zero before writing to it to avoid
> a cache line bounce.
>
> Clearing the PG_migrate_skip in this path to avoid the need to ever pool is
> not as cheap as it needs to
>
> set_pageblock_skip
> -> set_pageblock_flags_group
> -> page_zone
> -> page_to_pfn
> -> get_pageblock_bitmap
> -> pfn_to_bitidx
> -> __set_bit
>
> > But that isn't on the "fast
> > path" really - it happens once per pcp unload.
>
> That's still an important enough path that I'm wary of making it fatter
> and that only covers the free path. To avoid the polling, the allocation
> side needs to be handled too. It could be shoved down into rmqueue() to
> put it into a slightly colder path but still, it's a price to pay to keep
> compaction happy.
>
> > Can we do something
> > like that? Drop some hint into the zone without having to visit each
> > page?
> >
>
> Not without incurring a cost, but yes, t is possible to give a hint on when
> PG_migrate_skip should be cleared and move away from that time-based hammer.
>
> First, we'd introduce a variant of get_pageblock_migratetype() that returns
> all the bits for the pageblock flags and then helpers to extract either the
> migratetype or the PG_migrate_skip. We already are incurring the cost of
> get_pageblock_migratetype() so it will not be much more expensive than what
> is already there. If there is an allocation or free within a pageblock that
> as the PG_migrate_skip bit set then we increment a counter. When the counter
> reaches some to-be-decided "threshold" then compaction may clear all the
> bits. This would match the criteria of the clearing being based on activity.
>
> There are four potential problems with this
>
> 1. The logic to retrieve all the bits and split them up will be a little
> convulated but maybe it would not be that bad.
>
> 2. The counter is a shared-writable cache line but obviously it could
> be moved to vmstat and incremented with inc_zone_page_state to offset
> the cost a little.
>
> 3. The biggested weakness is that there is not way to know if the
> counter is incremented based on activity in a small subset of blocks.
>
> 4. What should the threshold be?
>
> The first problem is minor but the other three are potentially a mess.
> Adding another vmstat counter is bad enough in itself but if the counter
> is incremented based on a small subsets of pageblocks, the hint becomes
> is potentially useless.
Another idea is that we can add two bits(PG_check_migrate/PG_check_free)
in pageblock_flags_group.
In allocation path, we can set PG_check_migrate in a pageblock
In free path, we can set PG_check_free in a pageblock.
And they are cleared by compaction's scan like now.
So we can discard 3 and 4 at least.
Another idea is that let's cure it by fixing fundamental problem.
Make zone's locks more fine-grained.
As time goes by, system uses bigger memory but our lock of zone
isn't scalable. Recently, lru_lock and zone->lock contention report
isn't rare so i think it's good time that we move next step.
How about defining struct sub_zone per 2G or 4G?
so a zone can have several sub_zone as size and subzone can replace
current zone's role and zone is just container of subzones.
Of course, it's not easy to implement but I think someday we should
go that way. Is it a really overkill?
>
> However, does this match what you have in mind or am I over-complicating
> things?
>
> > > > >
> > > > > ...
> > > > >
> > > > > +static void reset_isolation_suitable(struct zone *zone)
> > > > > +{
> > > > > + unsigned long start_pfn = zone->zone_start_pfn;
> > > > > + unsigned long end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > > > > + unsigned long pfn;
> > > > > +
> > > > > + /*
> > > > > + * Do not reset more than once every five seconds. If allocations are
> > > > > + * failing sufficiently quickly to allow this to happen then continually
> > > > > + * scanning for compaction is not going to help. The choice of five
> > > > > + * seconds is arbitrary but will mitigate excessive scanning.
> > > > > + */
> > > > > + if (time_before(jiffies, zone->compact_blockskip_expire))
> > > > > + return;
> > > > > + zone->compact_blockskip_expire = jiffies + (HZ * 5);
> > > > > +
> > > > > + /* Walk the zone and mark every pageblock as suitable for isolation */
> > > > > + for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
> > > > > + struct page *page;
> > > > > + if (!pfn_valid(pfn))
> > > > > + continue;
> > > > > +
> > > > > + page = pfn_to_page(pfn);
> > > > > + if (zone != page_zone(page))
> > > > > + continue;
> > > > > +
> > > > > + clear_pageblock_skip(page);
> > > > > + }
> > > >
> > > > What's the worst-case loop count here?
> > > >
> > >
> > > zone->spanned_pages >> pageblock_order
> >
> > What's the worst-case value of (zone->spanned_pages >> pageblock_order) :)
>
> Lets take an unlikely case - 128G single-node machine. That loop count
> on x86-64 would be 65536. It'll be fast enough, particularly in this
> path.
>
> --
> Mel Gorman
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-09-26 0:46 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-09-21 10:46 [PATCH 0/9] Reduce compaction scanning and lock contention Mel Gorman
2012-09-21 10:46 ` [PATCH 1/9] Revert "mm: compaction: check lock contention first before taking lock" Mel Gorman
2012-09-21 17:46 ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 2/9] Revert "mm-compaction-abort-compaction-loop-if-lock-is-contended-or-run-too-long-fix" Mel Gorman
2012-09-21 17:47 ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 3/9] Revert "mm: compaction: abort compaction loop if lock is contended or run too long" Mel Gorman
2012-09-21 17:48 ` Rafael Aquini
2012-09-21 10:46 ` [PATCH 4/9] mm: compaction: Abort compaction loop if lock is contended or run too long Mel Gorman
2012-09-21 17:50 ` Rafael Aquini
2012-09-21 21:31 ` Andrew Morton
2012-09-25 7:34 ` Minchan Kim
2012-09-21 10:46 ` [PATCH 5/9] mm: compaction: Acquire the zone->lru_lock as late as possible Mel Gorman
2012-09-21 17:51 ` Rafael Aquini
2012-09-25 7:05 ` Minchan Kim
2012-09-25 7:51 ` Mel Gorman
2012-09-25 8:13 ` Minchan Kim
2012-09-25 21:39 ` Andrew Morton
2012-09-26 0:23 ` Minchan Kim
2012-09-26 10:17 ` Mel Gorman
2012-09-21 10:46 ` [PATCH 6/9] mm: compaction: Acquire the zone->lock " Mel Gorman
2012-09-21 17:52 ` Rafael Aquini
2012-09-21 21:35 ` Andrew Morton
2012-09-24 8:52 ` Mel Gorman
2012-09-25 7:36 ` Minchan Kim
2012-09-25 7:35 ` Minchan Kim
2012-09-21 10:46 ` [PATCH 7/9] Revert "mm: have order > 0 compaction start off where it left" Mel Gorman
2012-09-21 17:52 ` Rafael Aquini
2012-09-25 7:37 ` Minchan Kim
2012-09-21 10:46 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
2012-09-21 17:53 ` Rafael Aquini
2012-09-21 21:36 ` Andrew Morton
2012-09-24 9:39 ` Mel Gorman
2012-09-24 21:26 ` Andrew Morton
2012-09-25 9:12 ` Mel Gorman
2012-09-25 20:03 ` Andrew Morton
2012-09-27 12:06 ` [PATCH] mm: compaction: cache if a pageblock was scanned and no pages were isolated -fix2 Mel Gorman
2012-09-27 13:12 ` [PATCH 8/9] mm: compaction: Cache if a pageblock was scanned and no pages were isolated Mel Gorman
2012-09-26 0:49 ` Minchan Kim [this message]
2012-09-27 12:14 ` Mel Gorman
2012-09-21 10:46 ` [PATCH 9/9] mm: compaction: Restart compaction from near where it left off Mel Gorman
2012-09-21 17:54 ` Rafael Aquini
2012-09-21 13:51 ` [PATCH 0/9] Reduce compaction scanning and lock contention Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120926004930.GA10229@bbox \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=avi@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=qemu-devel@nongnu.org \
--cc=richard@arachsys.com \
--cc=riel@redhat.com \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).