linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Ingo Molnar <mingo@kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter
Date: Thu, 4 Jul 2013 10:23:56 +0100	[thread overview]
Message-ID: <20130704092356.GK1875@suse.de> (raw)
In-Reply-To: <20130703215654.GN17812@cmpxchg.org>

On Wed, Jul 03, 2013 at 05:56:54PM -0400, Johannes Weiner wrote:
> On Wed, Jul 03, 2013 at 03:21:34PM +0100, Mel Gorman wrote:
> > Ideally it would be possible to distinguish between NUMA hinting faults
> > that are private to a task and those that are shared. This would require
> > that the last task that accessed a page for a hinting fault would be
> > recorded which would increase the size of struct page. Instead this patch
> > approximates private pages by assuming that faults that pass the two-stage
> > filter are private pages and all others are shared. The preferred NUMA
> > node is then selected based on where the maximum number of approximately
> > private faults were measured. Shared faults are not taken into
> > consideration for a few reasons.
> 
> Ingo had a patch that would just encode a few bits of the PID along
> with the last_nid (last_cpu in his case) member of struct page.  No
> extra space required and should be accurate enough.
> 

Yes, I'm aware of it. I noted in the changelog that ideally we'd record
the task both to remind myself and so that the patch that introduces it
could refer to this changelog so there is some sort of logical progression
for reviewers.

I was not keen on the use of last_cpu because I felt there was an implicit
assumption that scanning would always be fast enough to record hinting
faults before a task got moved to another CPU for any reason. I feared this
would be worse as memory and task sizes increased. That's why I stayed
with tracking the nid for the two-stage filter until it could be proven
it was insufficient for some reason.

The lack of anything resembling pid tracking now is that the series is
already a bit of a mouthful and I thought the other parts were more
important for now.

> Otherwise this is blind to sharedness within the node the task is
> currently running on, right?
> 

Yes, it is.

> > First, if there are many tasks sharing the page then they'll all move
> > towards the same node. The node will be compute overloaded and then
> > scheduled away later only to bounce back again. Alternatively the shared
> > tasks would just bounce around nodes because the fault information is
> > effectively noise. Either way accounting for shared faults the same as
> > private faults may result in lower performance overall.
> 
> When the node with many shared pages is compute overloaded then there
> is arguably not an optimal node for the tasks and moving them off is
> inevitable. 

Yes. If such an event occurs then the ideal is that the task interleaves
between a subset of nodes. The situation could be partially detected by
tracking if the number of historical faults is approximately larger than
the preferred node and then interleave between the top N nodes most faulted
nodes until the working set fits. Starting the interleave should just be
a matter of coding. The difficulty is correctly backing off that if there
is a phase change.

> However, the node with the most page accesses, private or
> shared, is still the preferred node from a memory stand point.
> Compute load being equal, the task should go to the node with 2GB of
> shared memory and not to the one with 2 private pages.
> 

Agreed. The level of shared vs private needs to be detected. The problem
here is that detecting private dominated workloads is not straight-forward,
particularly as the scan rate slows as we've already discussed.

> If the load balancer moves the task off due to cpu load reasons,
> wouldn't the settle count mechanism prevent it from bouncing back?
> 
> Likewise, if the cpu load situation changes, the balancer could move
> the task back to its truly preferred node.
> 
> > The second reason is based on a hypothetical workload that has a small
> > number of very important, heavily accessed private pages but a large shared
> > array. The shared array would dominate the number of faults and be selected
> > as a preferred node even though it's the wrong decision.
> 
> That's a scan granularity problem and I can't see how you solve it
> with ignoring the shared pages. 

I acknowledge it's a problem and basically I'm making a big assumption
that private-dominated workloads are going to be the common case. Threaded
application on UMA with heavy amounts of shared data (within cache lines)
already suck in terms of performance so I'm expecting programmers already
try and avoid this sort of sharing. Obviously we are at a page granularity
here so the assumption will depend entirely on alignments and buffer sizes
so it might still fall apart.

I think that dealing with this specific problem is a series all on its
own and treating it on its own in isolation would be best.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-07-04  9:24 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-03 14:21 [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 14:21 ` [PATCH 01/13] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-07-03 14:21 ` [PATCH 02/13] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-07-03 14:21 ` [PATCH 03/13] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-07-03 14:21 ` [PATCH 04/13] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-07-03 14:21 ` [PATCH 05/13] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-07-03 14:21 ` [PATCH 06/13] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-07-04 12:26   ` Srikar Dronamraju
2013-07-04 13:29     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 07/13] sched: Split accounting of NUMA hinting faults that pass two-stage filter Mel Gorman
2013-07-03 21:56   ` Johannes Weiner
2013-07-04  9:23     ` Mel Gorman [this message]
2013-07-04 14:24       ` Rik van Riel
2013-07-04 19:36       ` Johannes Weiner
2013-07-05  9:41         ` Mel Gorman
2013-07-05 10:48         ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 08/13] sched: Increase NUMA PTE scanning when a new preferred node is selected Mel Gorman
2013-07-03 14:21 ` [PATCH 09/13] sched: Favour moving tasks towards nodes that incurred more faults Mel Gorman
2013-07-03 18:27   ` Peter Zijlstra
2013-07-04  9:25     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 10/13] sched: Set the scan rate proportional to the size of the task being scanned Mel Gorman
2013-07-03 14:21 ` [PATCH 11/13] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-07-03 15:33   ` Mel Gorman
2013-07-04 12:48   ` Srikar Dronamraju
2013-07-05 10:07     ` Mel Gorman
2013-07-03 14:21 ` [PATCH 12/13] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-07-03 18:35   ` Peter Zijlstra
2013-07-04  9:27     ` Mel Gorman
2013-07-03 18:41   ` Peter Zijlstra
2013-07-04  9:32     ` Mel Gorman
2013-07-03 18:42   ` Peter Zijlstra
2013-07-03 14:21 ` [PATCH 13/13] sched: Account for the number of preferred tasks running on a node when selecting a preferred node Mel Gorman
2013-07-03 18:32   ` Peter Zijlstra
2013-07-04  9:37     ` Mel Gorman
2013-07-04 13:07       ` Srikar Dronamraju
2013-07-04 13:54         ` Mel Gorman
2013-07-04 14:06           ` Peter Zijlstra
2013-07-04 14:40             ` Mel Gorman
2013-07-03 16:19 ` [PATCH 0/13] Basic scheduler support for automatic NUMA balancing V2 Mel Gorman
2013-07-03 16:26   ` Mel Gorman
2013-07-04 18:02 ` [PATCH RFC WIP] Process weights based scheduling for better consolidation Srikar Dronamraju
2013-07-05 10:16   ` Peter Zijlstra
2013-07-05 12:49     ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130704092356.GK1875@suse.de \
    --to=mgorman@suse.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@kernel.org \
    --cc=srikar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).