Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
Date: Tue, 5 Jan 2010 10:18:21 +0000	[thread overview]
Message-ID: <20100105101821.GA28975@csn.ul.ie> (raw)
In-Reply-To: <20100105105328.96CE.A69D9226@jp.fujitsu.com>

On Tue, Jan 05, 2010 at 11:04:58AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > > The zone->lock is one of performance critical locks. Then, it shouldn't
> > > be hold for long time. Currently, we have four walk_zones_in_node()
> > > usage and almost use-case don't need to hold zone->lock.
> > > 
> > > Thus, this patch move locking responsibility from walk_zones_in_node
> > > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > > 
> > > Cc: Mel Gorman <mel@csn.ul.ie>
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > ---
> > >  mm/vmstat.c |    8 +++++---
> > >  1 files changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 6051fba..a5d45bc 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > >  {
> > >  	struct zone *zone;
> > >  	struct zone *node_zones = pgdat->node_zones;
> > > -	unsigned long flags;
> > >  
> > >  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > >  		if (!populated_zone(zone))
> > >  			continue;
> > >  
> > > -		spin_lock_irqsave(&zone->lock, flags);
> > >  		print(m, pgdat, zone);
> > > -		spin_unlock_irqrestore(&zone->lock, flags);
> > >  	}
> > >  }
> > >  
> > > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long flags;
> > >  
> > >  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  
> > >  			area = &(zone->free_area[order]);
> > >  
> > > +			spin_lock_irqsave(&zone->lock, flags);
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > +			spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > 
> > It's not clear why you feel this information requires the lock and the
> > others do not.
> 
> I think above list operation require lock to prevent NULL pointer access. but other parts
> doesn't protect anything, because memory-hotplug change them without zone lock.
> 

True. Add a comment explaining that. I considered list_for_each_safe()
but it wouldn't work in all cases.

> 
> > For the most part, I agree that the accuracy of the information is
> > not critical. Assuming partial writes of the data are not a problem,
> > the information is not going to go so badly out of sync that it would be
> > noticable, even if the information is out of date within the zone.
> > 
> > However, inconsistent reads in zoneinfo really could be a problem. I am
> > concerned that under heavy allocation load that that "pages free" would
> > not match "nr_pages_free" for example. Other examples that adding all the
> > counters together may or may not equal the total number of pages in the zone.
> > 
> > Lets say for example there was a subtle bug related to __inc_zone_page_state()
> > that meant that counters were getting slightly out of sync but it was very
> > marginal and/or difficult to reproduce. With this patch applied, we could
> > not be absolutly sure the counters were correct because it could always have
> > raced with someone holding the zone->lock.
> > 
> > Minimally, I think zoneinfo should be taking the zone lock.
> 
> Thanks lots comments. 
> hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> then zone lock doesn't protect anything. so we have two option
> 
> 1) Add zone lock to memroy-hotplug
> 2) Remove zone lock from zoneinfo
> 
> I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> (of cource, memory hotplug is rarely)?
> 

I think (2) will make zoneinfo harder to use for examining all the counters
properly as I explained above. I haven't looked at memory-hotplug in a
while but IIRC, fields like present_pages should be protected by a lock on
the pgdat and a seq lock on the zone. If this is not true at the moment,
it is a problem.

For the free lists, memory hotplug should be taking the zone->lock properly as
the final stage of onlining memory is to walk the sections being hot-added,
init the memmap and then __free_page() each page individually - i.e. the
normal free path.

So, if memory hotplug is not protected by proper locking, it's not intentional.

> 
> > Secondly, has increased zone->lock contention due to reading /proc
> > really been shown to be a problem? The only situation that I can think
> > of is a badly-written monitor program that is copying all of /proc
> > instead of the files of interest. If a monitor program is doing
> > something like that, it's likely to be incurring performance problems in
> > a large number of different areas. If that is not the trigger case, what
> > is?
> 
> Ah no. I haven't observe such issue. my point is removing meaningless lock.
> 

Then I believe the zonelock should be preserved so that all entries in
/proc/zoneinfo are consistent.

> 
> > >  			seq_printf(m, "%6lu ", freecount);
> > >  		}
> > >  		seq_putc(m, '\n');
> > > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > >  							struct zone *zone)
> > >  {
> > >  	int i;
> > > +
> > 
> > Unnecessary whitespace change.
> 
> Ug. thanks, it's my fault.
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

WARNING: multiple messages have this Message-ID (diff)

From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Balbir Singh <balbir@linux.vnet.ibm.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node
Date: Tue, 5 Jan 2010 10:18:21 +0000	[thread overview]
Message-ID: <20100105101821.GA28975@csn.ul.ie> (raw)
In-Reply-To: <20100105105328.96CE.A69D9226@jp.fujitsu.com>

On Tue, Jan 05, 2010 at 11:04:58AM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > On Mon, Dec 28, 2009 at 04:47:22PM +0900, KOSAKI Motohiro wrote:
> > > The zone->lock is one of performance critical locks. Then, it shouldn't
> > > be hold for long time. Currently, we have four walk_zones_in_node()
> > > usage and almost use-case don't need to hold zone->lock.
> > > 
> > > Thus, this patch move locking responsibility from walk_zones_in_node
> > > to its sub function. Also this patch kill unnecessary zone->lock taking.
> > > 
> > > Cc: Mel Gorman <mel@csn.ul.ie>
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > ---
> > >  mm/vmstat.c |    8 +++++---
> > >  1 files changed, 5 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > > index 6051fba..a5d45bc 100644
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -418,15 +418,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
> > >  {
> > >  	struct zone *zone;
> > >  	struct zone *node_zones = pgdat->node_zones;
> > > -	unsigned long flags;
> > >  
> > >  	for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
> > >  		if (!populated_zone(zone))
> > >  			continue;
> > >  
> > > -		spin_lock_irqsave(&zone->lock, flags);
> > >  		print(m, pgdat, zone);
> > > -		spin_unlock_irqrestore(&zone->lock, flags);
> > >  	}
> > >  }
> > >  
> > > @@ -455,6 +452,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  					pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >  	int order, mtype;
> > > +	unsigned long flags;
> > >  
> > >  	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > >  		seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > @@ -468,8 +466,11 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> > >  
> > >  			area = &(zone->free_area[order]);
> > >  
> > > +			spin_lock_irqsave(&zone->lock, flags);
> > >  			list_for_each(curr, &area->free_list[mtype])
> > >  				freecount++;
> > > +			spin_unlock_irqrestore(&zone->lock, flags);
> > > +
> > 
> > It's not clear why you feel this information requires the lock and the
> > others do not.
> 
> I think above list operation require lock to prevent NULL pointer access. but other parts
> doesn't protect anything, because memory-hotplug change them without zone lock.
> 

True. Add a comment explaining that. I considered list_for_each_safe()
but it wouldn't work in all cases.

> 
> > For the most part, I agree that the accuracy of the information is
> > not critical. Assuming partial writes of the data are not a problem,
> > the information is not going to go so badly out of sync that it would be
> > noticable, even if the information is out of date within the zone.
> > 
> > However, inconsistent reads in zoneinfo really could be a problem. I am
> > concerned that under heavy allocation load that that "pages free" would
> > not match "nr_pages_free" for example. Other examples that adding all the
> > counters together may or may not equal the total number of pages in the zone.
> > 
> > Lets say for example there was a subtle bug related to __inc_zone_page_state()
> > that meant that counters were getting slightly out of sync but it was very
> > marginal and/or difficult to reproduce. With this patch applied, we could
> > not be absolutly sure the counters were correct because it could always have
> > raced with someone holding the zone->lock.
> > 
> > Minimally, I think zoneinfo should be taking the zone lock.
> 
> Thanks lots comments. 
> hmm.. I'd like to clarily your point. My point is memory-hotplug don't take zone lock,
> then zone lock doesn't protect anything. so we have two option
> 
> 1) Add zone lock to memroy-hotplug
> 2) Remove zone lock from zoneinfo
> 
> I thought (2) is sufficient. Do you mean you prefer to (1)? Or you prefer to ignore rarely event
> (of cource, memory hotplug is rarely)?
> 

I think (2) will make zoneinfo harder to use for examining all the counters
properly as I explained above. I haven't looked at memory-hotplug in a
while but IIRC, fields like present_pages should be protected by a lock on
the pgdat and a seq lock on the zone. If this is not true at the moment,
it is a problem.

For the free lists, memory hotplug should be taking the zone->lock properly as
the final stage of onlining memory is to walk the sections being hot-added,
init the memmap and then __free_page() each page individually - i.e. the
normal free path.

So, if memory hotplug is not protected by proper locking, it's not intentional.

> 
> > Secondly, has increased zone->lock contention due to reading /proc
> > really been shown to be a problem? The only situation that I can think
> > of is a badly-written monitor program that is copying all of /proc
> > instead of the files of interest. If a monitor program is doing
> > something like that, it's likely to be incurring performance problems in
> > a large number of different areas. If that is not the trigger case, what
> > is?
> 
> Ah no. I haven't observe such issue. my point is removing meaningless lock.
> 

Then I believe the zonelock should be preserved so that all entries in
/proc/zoneinfo are consistent.

> 
> > >  			seq_printf(m, "%6lu ", freecount);
> > >  		}
> > >  		seq_putc(m, '\n');
> > > @@ -709,6 +710,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> > >  							struct zone *zone)
> > >  {
> > >  	int i;
> > > +
> > 
> > Unnecessary whitespace change.
> 
> Ug. thanks, it's my fault.
> 
> 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-01-05 10:18 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-28  7:47 [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node KOSAKI Motohiro
2009-12-28  7:47 ` KOSAKI Motohiro
2009-12-28  7:48 ` [PATCH 2/4] vmscan: get_scan_ratio cleanup KOSAKI Motohiro
2009-12-28  7:48   ` KOSAKI Motohiro
2009-12-29  7:34   ` Balbir Singh
2009-12-29  7:34     ` Balbir Singh
2009-12-30 13:06     ` KOSAKI Motohiro
2009-12-30 13:06       ` KOSAKI Motohiro
2010-01-01 14:55   ` Rik van Riel
2010-01-01 14:55     ` Rik van Riel
2010-01-03 23:45   ` KAMEZAWA Hiroyuki
2010-01-03 23:45     ` KAMEZAWA Hiroyuki
2009-12-28  7:48 ` [PATCH 3/4] vmstat: add anon_scan_ratio field to zoneinfo KOSAKI Motohiro
2009-12-28  7:48   ` KOSAKI Motohiro
2009-12-29 14:08   ` Balbir Singh
2009-12-29 14:08     ` Balbir Singh
2009-12-30 13:13     ` KOSAKI Motohiro
2009-12-30 13:13       ` KOSAKI Motohiro
2010-01-01 14:59   ` Rik van Riel
2010-01-01 14:59     ` Rik van Riel
2010-01-03 23:47   ` KAMEZAWA Hiroyuki
2010-01-03 23:47     ` KAMEZAWA Hiroyuki
2009-12-28  7:49 ` [PATCH 4/4] memcg: add anon_scan_ratio to memory.stat file KOSAKI Motohiro
2009-12-28  7:49   ` KOSAKI Motohiro
2009-12-29 14:09   ` Balbir Singh
2009-12-29 14:09     ` Balbir Singh
2009-12-30 12:53     ` KOSAKI Motohiro
2009-12-30 12:53       ` KOSAKI Motohiro
2009-12-29  4:05 ` [PATCH 1/4] vmstat: remove zone->lock from walk_zones_in_node Minchan Kim
2009-12-29  4:05   ` Minchan Kim
2009-12-29  6:34 ` Balbir Singh
2009-12-29  6:34   ` Balbir Singh
2009-12-30 12:49   ` KOSAKI Motohiro
2009-12-30 12:49     ` KOSAKI Motohiro
2010-01-03 18:59 ` Mel Gorman
2010-01-03 18:59   ` Mel Gorman
2010-01-05  2:04   ` KOSAKI Motohiro
2010-01-05  2:04     ` KOSAKI Motohiro
2010-01-05 10:18     ` Mel Gorman [this message]
2010-01-05 10:18       ` Mel Gorman
2010-01-06  0:33       ` KOSAKI Motohiro
2010-01-06  0:33         ` KOSAKI Motohiro
2010-01-03 23:48 ` KAMEZAWA Hiroyuki
2010-01-03 23:48   ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100105101821.GA28975@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=akpm@linux-foundation.org \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.