Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware
Date: Tue, 26 Jan 2016 11:27:56 +0300	[thread overview]
Message-ID: <20160126082756.GF11586@esperanza> (raw)
In-Reply-To: <20160125163907.GA29291@cmpxchg.org>

On Mon, Jan 25, 2016 at 11:39:07AM -0500, Johannes Weiner wrote:
> On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote:
> > Currently, inactive_age is maintained per zone, which results in
> > unexpected file page activations in case memory cgroups are used. For
> > example, if the total number of active pages is big, a memory cgroup
> > might get every refaulted file page activated even if refault distance
> > is much greater than the number of active file pages in the cgroup. This
> > patch fixes this issue by making inactive_age per lruvec.
> 
> Argh!!
> 
> It's great that you're still interested in this and kept working on
> it. I just regret that I worked on the same stuff the last couple days
> without pinging you before picking it up. Oh well...

:-)

> 
> However, my patches are sufficiently different that I think it makes
> sense to discuss them both and figure out the best end result.  I have
> some comments below and will followup this email with my version.
> 
> > The patch is pretty straightforward and self-explaining, but there are
> > two things that should be noted:
> > 
> >  - workingset_{eviction,activation} need to get lruvec given a page.
> >    On the default hierarchy one can safely access page->mem_cgroup
> >    provided the page is pinned, but on the legacy hierarchy a page can
> >    be migrated from one cgroup to another at any moment, so extra care
> >    must be taken to assure page->mem_cgroup will stay put.
> > 
> >    workingset_eviction is passed a locked page, so it is safe to use
> >    page->mem_cgroup in this function. workingset_activation is trickier:
> >    it is called from mark_page_accessed, where the page is not
> >    necessarily locked. To protect it against page->mem_cgroup change, we
> >    move it to __activate_page, which is called by mark_page_accessed
> >    once there's enough pages on percpu pagevec. This function is called
> >    with zone->lru_lock held, which rules out page charge migration.
> 
> When a page moves to another cgroup at the same time it's activated,
> there really is no wrong lruvec to age. Both would be correct. The
> locking guarantees a stable answer, but we don't need it here. It's
> enough to take the rcu lock here to ensure page_memcg() isn't freed.

Yeah, that's true, but I think that taking rcu lock when memcg is
disabled or even compiled out is ugly. May be, we should introduce
helpers lock/unlock_page_memcg(), which would be no-op on the unified
hierarchy while on the legacy hierarchy they would act just like
mem_cgroup_begin/end_page_stat, i.e. we could just rename the latter
two. This would also simplify dropping legacy code once the legacy
hierarchy has passed away.

> 
> >  - To calculate refault distance correctly even in case a page is
> >    refaulted by a different cgroup, we need to store memcg id in shadow
> >    entry. There's no problem with it on 64-bit, but on 32-bit there's
> >    not much space left in radix tree slot after storing information
> >    about node, zone, and memory cgroup, so we can't just save eviction
> >    counter as is, because it would trim max refault distance making it
> >    unusable.
> > 
> >    To overcome this problem, we increase refault distance granularity,
> >    as proposed by Johannes Weiner. We disregard 10 least significant
> >    bits of eviction counter. This reduces refault distance accuracy to
> >    4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
> >    us 9 bits for storing eviction counter, hence maximal refault
> >    distance will be 2GB, which should be enough for 32-bit systems.
> 
> If we limit it to 2G it becomes a clear-cut correctness issue once you
> have more memory. Instead, we should continue to stretch out the
> distance with an ever-increasing bucket size. The more memory you
> have, the less important the granularity becomes anyway. With 8G, an
> 8M granularity is still okay, and so forth. And once we get beyond a
> certain point, and it creates problems for people, it should be fair
> enough to recommend upgrading to 64 bit.

That's a great idea. I'm all for it.

> 
> > @@ -152,8 +152,72 @@
> >   * refault distance will immediately activate the refaulting page.
> >   */
> >  
> > -static void *pack_shadow(unsigned long eviction, struct zone *zone)
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * On 32-bit there is not much space left in radix tree slot after
> > + * storing information about node, zone, and memory cgroup, so we
> > + * disregard 10 least significant bits of eviction counter. This
> > + * reduces refault distance accuracy to 4MB, which is still fine.
> > + *
> > + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
> > + * eviction counter, hence maximal refault distance will be 2GB, which
> > + * should be enough for 32-bit systems.
> > + */
> > +#ifdef CONFIG_64BIT
> > +# define REFAULT_DISTANCE_GRANULARITY		0
> > +#else
> > +# define REFAULT_DISTANCE_GRANULARITY		10
> > +#endif
> > +
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return eviction;
> > +
> > +	eviction >>= REFAULT_DISTANCE_GRANULARITY;
> > +	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	if (mem_cgroup_disabled()) {
> > +		*memcg = NULL;
> > +		return entry;
> > +	}
> > +
> > +	rcu_read_lock();
> > +	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
> > +	rcu_read_unlock();
> > +
> > +	entry >>= MEM_CGROUP_ID_SHIFT;
> > +	entry <<= REFAULT_DISTANCE_GRANULARITY;
> > +	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
> > +	return entry;
> > +}
> > +#else /* !CONFIG_MEMCG */
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	*memcg = NULL;
> > +	return entry;
> > +}
> > +#endif /* CONFIG_MEMCG */
> > +
> > +static void *pack_shadow(unsigned long eviction, struct zone *zone,
> > +			 struct mem_cgroup *memcg)
> >  {
> > +	eviction = pack_shadow_memcg(eviction, memcg);
> >  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> >  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> >  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> 
> For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a
> cssid of 0. That would save much of the special casing here.

But what about mem_cgroup_disabled() case? Do we want to waste 16 bits
in this case?

> 
> > @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> > +	struct mem_cgroup *memcg = page_memcg(page);
> > +	struct lruvec *lruvec;
> >  	unsigned long eviction;
> >  
> > -	eviction = atomic_long_inc_return(&zone->inactive_age);
> > -	return pack_shadow(eviction, zone);
> > +	if (!mem_cgroup_disabled())
> > +		mem_cgroup_get(memcg);
> > +
> > +	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> > +	return pack_shadow(eviction, zone, memcg);
> 
> I don't think we need to hold a reference to the memcg here, it should
> be enough to verify whether the cssid is still existent upon refault.
> 
> What could theoretically happen then is that the original memcg gets
> deleted, a new one will reuse the same id and then refault the same
> pages. However, there are two things that should make this acceptable:
> 1. a couple pages don't matter in this case, and sharing data between
> cgroups on a large scale already leads to weird accounting artifacts
> in other places; this wouldn't be much different. 2. from a system
> perspective, those pages were in fact recently evicted, so even if we
> activate them by accident in the new cgroup, it wouldn't be entirely
> unreasonable. The workload will shake out what the true active list
> frequency is on its own.
> 
> So I think we can just do away with the reference counting for now,
> and reconsider it should the false sharing in this case create new
> problems that are worse than existing consequences of false sharing.

Memory cgroup creation/destruction are rare events, so I agree that we
can close our eyes on this and not clutter the code with
mem_cgroup_get/put.

> 
> > @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
> >   */
> >  bool workingset_refault(void *shadow)
> >  {
> > -	unsigned long refault_distance;
> > +	unsigned long refault_distance, nr_active;
> >  	struct zone *zone;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> >  
> > -	unpack_shadow(shadow, &zone, &refault_distance);
> > +	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
> >  	inc_zone_state(zone, WORKINGSET_REFAULT);
> >  
> > -	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +	if (!mem_cgroup_disabled()) {
> > +		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
> > +		mem_cgroup_put(memcg);
> > +	} else
> > +		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);
> 
> This is basically get_lru_size(), so I reused that instead.

Yeah. In the previous version I did the same.

>From quick glance, I think that in general, your patch set looks better.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@kernel.org>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] mm: workingset: make workingset detection logic memcg aware
Date: Tue, 26 Jan 2016 11:27:56 +0300	[thread overview]
Message-ID: <20160126082756.GF11586@esperanza> (raw)
In-Reply-To: <20160125163907.GA29291@cmpxchg.org>

On Mon, Jan 25, 2016 at 11:39:07AM -0500, Johannes Weiner wrote:
> On Sun, Jan 24, 2016 at 07:56:16PM +0300, Vladimir Davydov wrote:
> > Currently, inactive_age is maintained per zone, which results in
> > unexpected file page activations in case memory cgroups are used. For
> > example, if the total number of active pages is big, a memory cgroup
> > might get every refaulted file page activated even if refault distance
> > is much greater than the number of active file pages in the cgroup. This
> > patch fixes this issue by making inactive_age per lruvec.
> 
> Argh!!
> 
> It's great that you're still interested in this and kept working on
> it. I just regret that I worked on the same stuff the last couple days
> without pinging you before picking it up. Oh well...

:-)

> 
> However, my patches are sufficiently different that I think it makes
> sense to discuss them both and figure out the best end result.  I have
> some comments below and will followup this email with my version.
> 
> > The patch is pretty straightforward and self-explaining, but there are
> > two things that should be noted:
> > 
> >  - workingset_{eviction,activation} need to get lruvec given a page.
> >    On the default hierarchy one can safely access page->mem_cgroup
> >    provided the page is pinned, but on the legacy hierarchy a page can
> >    be migrated from one cgroup to another at any moment, so extra care
> >    must be taken to assure page->mem_cgroup will stay put.
> > 
> >    workingset_eviction is passed a locked page, so it is safe to use
> >    page->mem_cgroup in this function. workingset_activation is trickier:
> >    it is called from mark_page_accessed, where the page is not
> >    necessarily locked. To protect it against page->mem_cgroup change, we
> >    move it to __activate_page, which is called by mark_page_accessed
> >    once there's enough pages on percpu pagevec. This function is called
> >    with zone->lru_lock held, which rules out page charge migration.
> 
> When a page moves to another cgroup at the same time it's activated,
> there really is no wrong lruvec to age. Both would be correct. The
> locking guarantees a stable answer, but we don't need it here. It's
> enough to take the rcu lock here to ensure page_memcg() isn't freed.

Yeah, that's true, but I think that taking rcu lock when memcg is
disabled or even compiled out is ugly. May be, we should introduce
helpers lock/unlock_page_memcg(), which would be no-op on the unified
hierarchy while on the legacy hierarchy they would act just like
mem_cgroup_begin/end_page_stat, i.e. we could just rename the latter
two. This would also simplify dropping legacy code once the legacy
hierarchy has passed away.

> 
> >  - To calculate refault distance correctly even in case a page is
> >    refaulted by a different cgroup, we need to store memcg id in shadow
> >    entry. There's no problem with it on 64-bit, but on 32-bit there's
> >    not much space left in radix tree slot after storing information
> >    about node, zone, and memory cgroup, so we can't just save eviction
> >    counter as is, because it would trim max refault distance making it
> >    unusable.
> > 
> >    To overcome this problem, we increase refault distance granularity,
> >    as proposed by Johannes Weiner. We disregard 10 least significant
> >    bits of eviction counter. This reduces refault distance accuracy to
> >    4MB, which is still fine. With the default NODE_SHIFT (3) this leaves
> >    us 9 bits for storing eviction counter, hence maximal refault
> >    distance will be 2GB, which should be enough for 32-bit systems.
> 
> If we limit it to 2G it becomes a clear-cut correctness issue once you
> have more memory. Instead, we should continue to stretch out the
> distance with an ever-increasing bucket size. The more memory you
> have, the less important the granularity becomes anyway. With 8G, an
> 8M granularity is still okay, and so forth. And once we get beyond a
> certain point, and it creates problems for people, it should be fair
> enough to recommend upgrading to 64 bit.

That's a great idea. I'm all for it.

> 
> > @@ -152,8 +152,72 @@
> >   * refault distance will immediately activate the refaulting page.
> >   */
> >  
> > -static void *pack_shadow(unsigned long eviction, struct zone *zone)
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * On 32-bit there is not much space left in radix tree slot after
> > + * storing information about node, zone, and memory cgroup, so we
> > + * disregard 10 least significant bits of eviction counter. This
> > + * reduces refault distance accuracy to 4MB, which is still fine.
> > + *
> > + * With the default NODE_SHIFT (3) this leaves us 9 bits for storing
> > + * eviction counter, hence maximal refault distance will be 2GB, which
> > + * should be enough for 32-bit systems.
> > + */
> > +#ifdef CONFIG_64BIT
> > +# define REFAULT_DISTANCE_GRANULARITY		0
> > +#else
> > +# define REFAULT_DISTANCE_GRANULARITY		10
> > +#endif
> > +
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return eviction;
> > +
> > +	eviction >>= REFAULT_DISTANCE_GRANULARITY;
> > +	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | mem_cgroup_id(memcg);
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	if (mem_cgroup_disabled()) {
> > +		*memcg = NULL;
> > +		return entry;
> > +	}
> > +
> > +	rcu_read_lock();
> > +	*memcg = mem_cgroup_from_id(entry & MEM_CGROUP_ID_MAX);
> > +	rcu_read_unlock();
> > +
> > +	entry >>= MEM_CGROUP_ID_SHIFT;
> > +	entry <<= REFAULT_DISTANCE_GRANULARITY;
> > +	*mask >>= MEM_CGROUP_ID_SHIFT - REFAULT_DISTANCE_GRANULARITY;
> > +	return entry;
> > +}
> > +#else /* !CONFIG_MEMCG */
> > +static unsigned long pack_shadow_memcg(unsigned long eviction,
> > +				       struct mem_cgroup *memcg)
> > +{
> > +	return eviction;
> > +}
> > +
> > +static unsigned long unpack_shadow_memcg(unsigned long entry,
> > +					 unsigned long *mask,
> > +					 struct mem_cgroup **memcg)
> > +{
> > +	*memcg = NULL;
> > +	return entry;
> > +}
> > +#endif /* CONFIG_MEMCG */
> > +
> > +static void *pack_shadow(unsigned long eviction, struct zone *zone,
> > +			 struct mem_cgroup *memcg)
> >  {
> > +	eviction = pack_shadow_memcg(eviction, memcg);
> >  	eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
> >  	eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
> >  	eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
> 
> For !CONFIG_MEMCG, we can define MEMCG_ID_SHIFT to 0 and pass in a
> cssid of 0. That would save much of the special casing here.

But what about mem_cgroup_disabled() case? Do we want to waste 16 bits
in this case?

> 
> > @@ -213,10 +282,16 @@ static void unpack_shadow(void *shadow,
> >  void *workingset_eviction(struct address_space *mapping, struct page *page)
> >  {
> >  	struct zone *zone = page_zone(page);
> > +	struct mem_cgroup *memcg = page_memcg(page);
> > +	struct lruvec *lruvec;
> >  	unsigned long eviction;
> >  
> > -	eviction = atomic_long_inc_return(&zone->inactive_age);
> > -	return pack_shadow(eviction, zone);
> > +	if (!mem_cgroup_disabled())
> > +		mem_cgroup_get(memcg);
> > +
> > +	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +	eviction = atomic_long_inc_return(&lruvec->inactive_age);
> > +	return pack_shadow(eviction, zone, memcg);
> 
> I don't think we need to hold a reference to the memcg here, it should
> be enough to verify whether the cssid is still existent upon refault.
> 
> What could theoretically happen then is that the original memcg gets
> deleted, a new one will reuse the same id and then refault the same
> pages. However, there are two things that should make this acceptable:
> 1. a couple pages don't matter in this case, and sharing data between
> cgroups on a large scale already leads to weird accounting artifacts
> in other places; this wouldn't be much different. 2. from a system
> perspective, those pages were in fact recently evicted, so even if we
> activate them by accident in the new cgroup, it wouldn't be entirely
> unreasonable. The workload will shake out what the true active list
> frequency is on its own.
> 
> So I think we can just do away with the reference counting for now,
> and reconsider it should the false sharing in this case create new
> problems that are worse than existing consequences of false sharing.

Memory cgroup creation/destruction are rare events, so I agree that we
can close our eyes on this and not clutter the code with
mem_cgroup_get/put.

> 
> > @@ -230,13 +305,22 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
> >   */
> >  bool workingset_refault(void *shadow)
> >  {
> > -	unsigned long refault_distance;
> > +	unsigned long refault_distance, nr_active;
> >  	struct zone *zone;
> > +	struct mem_cgroup *memcg;
> > +	struct lruvec *lruvec;
> >  
> > -	unpack_shadow(shadow, &zone, &refault_distance);
> > +	unpack_shadow(shadow, &zone, &memcg, &refault_distance);
> >  	inc_zone_state(zone, WORKINGSET_REFAULT);
> >  
> > -	if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
> > +	if (!mem_cgroup_disabled()) {
> > +		lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> > +		nr_active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_FILE);
> > +		mem_cgroup_put(memcg);
> > +	} else
> > +		nr_active = zone_page_state(zone, NR_ACTIVE_FILE);
> 
> This is basically get_lru_size(), so I reused that instead.

Yeah. In the previous version I did the same.

>From quick glance, I think that in general, your patch set looks better.

Thanks,
Vladimir

next prev parent reply	other threads:[~2016-01-26  8:28 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-24 16:56 [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov
2016-01-24 16:56 ` Vladimir Davydov
2016-01-25 16:39 ` Johannes Weiner
2016-01-25 16:39   ` Johannes Weiner
2016-01-25 16:41   ` [PATCH 1/3] mm: workingset: eviction buckets for bigmem/lowbit machines Johannes Weiner
2016-01-25 16:41     ` Johannes Weiner
2016-01-25 16:41   ` [PATCH 2/3] mm: workingset: separate shadow unpacking and refault calculation Johannes Weiner
2016-01-25 16:41     ` Johannes Weiner
2016-01-25 16:42   ` [PATCH 3/3] mm: workingset: cgroup-aware Johannes Weiner
2016-01-25 16:42     ` Johannes Weiner
2016-01-26  8:27   ` Vladimir Davydov [this message]
2016-01-26  8:27     ` [PATCH v2] mm: workingset: make workingset detection logic memcg aware Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160126082756.GF11586@esperanza \
    --to=vdavydov@virtuozzo.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.