Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	riel@redhat.com, cl@linux-foundation.org, fengguang.wu@intel.com,
	linuxram@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3
Date: Tue, 16 Jun 2009 14:44:24 +0100	[thread overview]
Message-ID: <20090616134423.GD14241@csn.ul.ie> (raw)
In-Reply-To: <20090616202157.99AF.A69D9226@jp.fujitsu.com>

On Tue, Jun 16, 2009 at 09:57:53PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> 
> > > > > vmscan-zone_reclaim-use-may_swap.patch
> > > > > 
> > > > 
> > > > This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> > > > this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> > > > means that zone_reclaim() now has more work to do when it's enabled and it
> > > > incurs a number of minor faults for no reason as a result of trying to avoid
> > > > going off-node. I don't believe that is desirable because it would manifest
> > > > as high minor fault counts on NUMA and would be difficult to pin down why
> > > > that was happening.
> > > 
> > > (cc to hanns)
> > > 
> > > First, if this patch should be dropped, commit bd2f6199 
> > > (vmscan: respect higher order in zone_reclaim()) should be too. I think.
> > > the combination of lumply reclaim and !may_unmap is really ineffective.
> > 
> > Whether it's ineffective or not, it's what the user has asked for. They
> > want a high-order page found if possible within the limits of
> > zone_reclaim_mode. If it fails, they will enter full direct reclaim
> > later in the path and try again.
> > 
> > How effective lumpy reclaim is in this case really depends on what the
> > system has been used for in the past. It's impossible to know in advance
> > how effective lumpy reclaim will be in every case.
> 
> In general, performance discussion need to concern typical use-case.

What typical use case? zone_reclaim logic is enabled by default on NUMA
machines that report a large latency for remote node access. I do not
believe we can draw conclusions on what a typical use case is just
because the machine happens to be a particular NUMA type.

And this isn't a performance discussion as such either. The patch isn't
going to improve performance. I believe it'll have the opposite effect.

> Almost zone-reclaim enabled machine is not file server. Thus unmapped file
> page are not so high ratio.
> 
> I have pessimistic suspection of successful rate of lumpy reclaim in those server.

While I agree on that particular case, I don't think it justifies the patch
to unmap pages so easily just because zone_reclaim() is enabled.

> Of cource, it don't make allocation failure, it only make full direct reclaim.
> 
> but I don't hope strange and unnecessary lru shuffling. Also I don't think
> it makes performance improvement.
> 

I'm all for avoiding unnecessary LRU shuffling

> > > it might cause isolate neighbor pages and give up unmapping and pages put
> > > back tail of lru.
> > > it mean to shuffle lru list.
> > > 
> > > I don't think it is desirable.
> > > 
> > 
> > With Kamezawa Hiroyuki's patch that avoids unnecessary shuffles of the LRU
> > list due to lumpy reclaim, the situation might be better?
> 
> I still my_unmap is better choice, but if we use it, I agree with adding
> may_unmap and page_mapped() condition to isolate_pages_global() is better and
> good choice.
> 
> nice idea.
> 

Ok, I agree that lumpy reclaim should be checking may_unmap and page_mapped()
but I still don't think that means that reclaim_mode of 1 allows zone_reclaim()
to unmap pages.

> > > Second, we did learned that "mapped or not mapped" is not appropriate
> > > reclaim boosting between split-lru discussion.
> > > So, I think to make consistent is better. if no considerness of may_unmap
> > > makes serious performance issue, we need to fix try_to_free_pages() path too.
> > > 
> > 
> > I don't understand this paragraph.
> > 
> > If zone_reclaim_mode is set to 1, I don't believe the expected behaviour is
> > for pages to be unmapped from page tables. I think it will lead to mysterious
> > bug reports of higher numbers of minor faults when running applications on
> > NUMA machines in some situations.
> 
> AFAIK, 99.9% user read documentation, not actual code. and documentatin
> didn't describe so.
> I don't think this is expected behavior.
> 
> That's my point.
> 

Which part of the documentation for zone_reclaim_mode == 1 implies that
pages will be unmapped from page tables? If the documentation is misleading,
I would prefer for it to be fixed up than pages be unmapped by default
causing performance regressions due to increased minor faults on NUMA.

> 
> > > Third, if we consider MPI program on NUMA, each process only access
> > > a part of array data frequently and never touch rest part of array.
> > > So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.
> > > 
> > > I have one question. your "difficultness of pinning down" is major issue?
> > > 
> > 
> > Yes. If an administrator notices that minor fault rates are higher than
> > expected, it's going to be very difficult for them to understand why
> > it is happening and why setting reclaim_mode to 0 apparently fixes the
> > problem. oprofile for example might just show that a lot of time is being
> > spent in the fault paths but not explain why.
> 
> I don't understand this paragraph a bit. I feel this is only theorical issue.
> successing of try_to_unmap_one() mean the pte don't have accessed bit.
> it's obvious sign to be able to unmap pte.
> 

Ok, even if we are depending on the accessed bit, we are making assumptions
of the frequency of the bit being cleared and how often zone_reclaim()
is called as to whether it will cause more minor faults or not.

Yes, what I'm saying about minor faults being potentially increased is a
theoretical issue and I have no proof but it feels like a real
possibility. I would like to be convinced that setting may_unmap to 1 by
default when zone_reclaim == 1 is not going to result in this problem
occuring or at least to be convinced that it will not happen very often.

I would be much happier if setting may_unmap and may_swap only happened when
RECLAIM_SWAP was enabled.

> if we convice MPI program, long time untouched pages often mean never touched again.
> Am I missing anything? or you don't talk about non-hpc workload?
> 

I don't have a particular workload in mind to be perfectly honest. I'm just not
convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
just because the NUMA distances happen to be large.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	riel@redhat.com, cl@linux-foundation.org, fengguang.wu@intel.com,
	linuxram@us.ibm.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3
Date: Tue, 16 Jun 2009 14:44:24 +0100	[thread overview]
Message-ID: <20090616134423.GD14241@csn.ul.ie> (raw)
In-Reply-To: <20090616202157.99AF.A69D9226@jp.fujitsu.com>

On Tue, Jun 16, 2009 at 09:57:53PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> 
> > > > > vmscan-zone_reclaim-use-may_swap.patch
> > > > > 
> > > > 
> > > > This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> > > > this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> > > > means that zone_reclaim() now has more work to do when it's enabled and it
> > > > incurs a number of minor faults for no reason as a result of trying to avoid
> > > > going off-node. I don't believe that is desirable because it would manifest
> > > > as high minor fault counts on NUMA and would be difficult to pin down why
> > > > that was happening.
> > > 
> > > (cc to hanns)
> > > 
> > > First, if this patch should be dropped, commit bd2f6199 
> > > (vmscan: respect higher order in zone_reclaim()) should be too. I think.
> > > the combination of lumply reclaim and !may_unmap is really ineffective.
> > 
> > Whether it's ineffective or not, it's what the user has asked for. They
> > want a high-order page found if possible within the limits of
> > zone_reclaim_mode. If it fails, they will enter full direct reclaim
> > later in the path and try again.
> > 
> > How effective lumpy reclaim is in this case really depends on what the
> > system has been used for in the past. It's impossible to know in advance
> > how effective lumpy reclaim will be in every case.
> 
> In general, performance discussion need to concern typical use-case.

What typical use case? zone_reclaim logic is enabled by default on NUMA
machines that report a large latency for remote node access. I do not
believe we can draw conclusions on what a typical use case is just
because the machine happens to be a particular NUMA type.

And this isn't a performance discussion as such either. The patch isn't
going to improve performance. I believe it'll have the opposite effect.

> Almost zone-reclaim enabled machine is not file server. Thus unmapped file
> page are not so high ratio.
> 
> I have pessimistic suspection of successful rate of lumpy reclaim in those server.

While I agree on that particular case, I don't think it justifies the patch
to unmap pages so easily just because zone_reclaim() is enabled.

> Of cource, it don't make allocation failure, it only make full direct reclaim.
> 
> but I don't hope strange and unnecessary lru shuffling. Also I don't think
> it makes performance improvement.
> 

I'm all for avoiding unnecessary LRU shuffling

> > > it might cause isolate neighbor pages and give up unmapping and pages put
> > > back tail of lru.
> > > it mean to shuffle lru list.
> > > 
> > > I don't think it is desirable.
> > > 
> > 
> > With Kamezawa Hiroyuki's patch that avoids unnecessary shuffles of the LRU
> > list due to lumpy reclaim, the situation might be better?
> 
> I still my_unmap is better choice, but if we use it, I agree with adding
> may_unmap and page_mapped() condition to isolate_pages_global() is better and
> good choice.
> 
> nice idea.
> 

Ok, I agree that lumpy reclaim should be checking may_unmap and page_mapped()
but I still don't think that means that reclaim_mode of 1 allows zone_reclaim()
to unmap pages.

> > > Second, we did learned that "mapped or not mapped" is not appropriate
> > > reclaim boosting between split-lru discussion.
> > > So, I think to make consistent is better. if no considerness of may_unmap
> > > makes serious performance issue, we need to fix try_to_free_pages() path too.
> > > 
> > 
> > I don't understand this paragraph.
> > 
> > If zone_reclaim_mode is set to 1, I don't believe the expected behaviour is
> > for pages to be unmapped from page tables. I think it will lead to mysterious
> > bug reports of higher numbers of minor faults when running applications on
> > NUMA machines in some situations.
> 
> AFAIK, 99.9% user read documentation, not actual code. and documentatin
> didn't describe so.
> I don't think this is expected behavior.
> 
> That's my point.
> 

Which part of the documentation for zone_reclaim_mode == 1 implies that
pages will be unmapped from page tables? If the documentation is misleading,
I would prefer for it to be fixed up than pages be unmapped by default
causing performance regressions due to increased minor faults on NUMA.

> 
> > > Third, if we consider MPI program on NUMA, each process only access
> > > a part of array data frequently and never touch rest part of array.
> > > So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.
> > > 
> > > I have one question. your "difficultness of pinning down" is major issue?
> > > 
> > 
> > Yes. If an administrator notices that minor fault rates are higher than
> > expected, it's going to be very difficult for them to understand why
> > it is happening and why setting reclaim_mode to 0 apparently fixes the
> > problem. oprofile for example might just show that a lot of time is being
> > spent in the fault paths but not explain why.
> 
> I don't understand this paragraph a bit. I feel this is only theorical issue.
> successing of try_to_unmap_one() mean the pte don't have accessed bit.
> it's obvious sign to be able to unmap pte.
> 

Ok, even if we are depending on the accessed bit, we are making assumptions
of the frequency of the bit being cleared and how often zone_reclaim()
is called as to whether it will cause more minor faults or not.

Yes, what I'm saying about minor faults being potentially increased is a
theoretical issue and I have no proof but it feels like a real
possibility. I would like to be convinced that setting may_unmap to 1 by
default when zone_reclaim == 1 is not going to result in this problem
occuring or at least to be convinced that it will not happen very often.

I would be much happier if setting may_unmap and may_swap only happened when
RECLAIM_SWAP was enabled.

> if we convice MPI program, long time untouched pages often mean never touched again.
> Am I missing anything? or you don't talk about non-hpc workload?
> 

I don't have a particular workload in mind to be perfectly honest. I'm just not
convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
just because the NUMA distances happen to be large.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-06-16 13:44 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-11 10:47 [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3 Mel Gorman
2009-06-11 10:47 ` Mel Gorman
2009-06-11 10:47 ` [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
2009-06-11 10:47   ` Mel Gorman
2009-06-11 11:37   ` KOSAKI Motohiro
2009-06-11 11:37     ` KOSAKI Motohiro
2009-06-12 10:17     ` Mel Gorman
2009-06-12 10:17       ` Mel Gorman
2009-06-15  4:51       ` KOSAKI Motohiro
2009-06-15  4:51         ` KOSAKI Motohiro
2009-06-15 10:05         ` Mel Gorman
2009-06-15 10:05           ` Mel Gorman
2009-06-11 10:47 ` [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
2009-06-11 10:47   ` Mel Gorman
2009-06-11 13:48   ` Christoph Lameter
2009-06-11 13:48     ` Christoph Lameter
2009-06-12 10:36     ` Mel Gorman
2009-06-12 10:36       ` Mel Gorman
2009-06-12 15:44       ` Andrew Morton
2009-06-12 15:44         ` Andrew Morton
2009-06-15 10:28         ` Mel Gorman
2009-06-15 10:28           ` Mel Gorman
2009-06-15 15:58           ` Andrew Morton
2009-06-15 15:58             ` Andrew Morton
2009-06-11 10:47 ` [PATCH 3/3] Count the number of times zone_reclaim() scans and fails Mel Gorman
2009-06-11 10:47   ` Mel Gorman
2009-06-11 11:33   ` KOSAKI Motohiro
2009-06-11 11:33     ` KOSAKI Motohiro
2009-06-15 21:19   ` Andrew Morton
2009-06-15 21:19     ` Andrew Morton
2009-06-16  9:05     ` Mel Gorman
2009-06-16  9:05       ` Mel Gorman
2009-06-11 23:30 ` [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3 Andrew Morton
2009-06-11 23:30   ` Andrew Morton
2009-06-12 11:04   ` Mel Gorman
2009-06-12 11:04     ` Mel Gorman
2009-06-12 16:08     ` Andrew Morton
2009-06-12 16:08       ` Andrew Morton
2009-06-15  9:42     ` KOSAKI Motohiro
2009-06-15  9:42       ` KOSAKI Motohiro
2009-06-15 10:56       ` Mel Gorman
2009-06-15 10:56         ` Mel Gorman
2009-06-15 15:01         ` Christoph Lameter
2009-06-15 15:01           ` Christoph Lameter
2009-06-15 15:25           ` Mel Gorman
2009-06-15 15:25             ` Mel Gorman
2009-06-16 12:08             ` KOSAKI Motohiro
2009-06-16 12:08               ` KOSAKI Motohiro
2009-06-16 12:20               ` Mel Gorman
2009-06-16 12:20                 ` Mel Gorman
2009-06-16 12:30                 ` KOSAKI Motohiro
2009-06-16 12:30                   ` KOSAKI Motohiro
2009-06-16 12:57         ` KOSAKI Motohiro
2009-06-16 12:57           ` KOSAKI Motohiro
2009-06-16 13:44           ` Mel Gorman [this message]
2009-06-16 13:44             ` Mel Gorman
2009-06-16 14:51             ` Christoph Lameter
2009-06-16 14:51               ` Christoph Lameter
2009-06-17 10:06               ` KOSAKI Motohiro
2009-06-17 10:06                 ` KOSAKI Motohiro
2009-06-17 12:03                 ` Mel Gorman
2009-06-17 12:03                   ` Mel Gorman
2009-06-17 18:48                 ` Christoph Lameter
2009-06-17 18:48                   ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090616134423.GD14241@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux-foundation.org \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxram@us.ibm.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.