[patch 0/3] mm: memcontrol: performance fixlets for 3.18

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [patch 0/3] mm: memcontrol: performance fixlets for 3.18
@ 2014-09-24 15:08 Johannes Weiner
  2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

Hi Andrew,

here are 2 memcg performance fixlets for 3.18.  One improves uncharge
batching to reduce expensive res_counter ops and irq-toggling, the
other one allows THP charges to succeed under cache pressure.

Thanks!

 include/linux/swap.h |   6 ++-
 mm/memcontrol.c      | 116 +++++++++++++------------------------------------
 mm/swap.c            |  27 +++++++++---
 mm/swap_state.c      |  14 ++----
 mm/vmscan.c          |   7 +--
 5 files changed, 63 insertions(+), 107 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
  2014-09-24 19:42   ` Andrew Morton
  2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
  2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
  2 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

From: Michal Hocko <mhocko@suse.cz>

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
    80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                  |
                  --- _raw_spin_lock
                     |
                     |--66.59%-- res_counter_uncharge_until
                     |          res_counter_uncharge
                     |          uncharge_batch
                     |          uncharge_list
                     |          mem_cgroup_uncharge_list
                     |          release_pages
                     |          free_pages_and_swap_cache
                     |          tlb_flush_mmu_free
                     |          |
                     |          |--90.12%-- unmap_single_vma
                     |          |          unmap_vmas
                     |          |          unmap_region
                     |          |          do_munmap
                     |          |          vm_munmap
                     |          |          sys_munmap
                     |          |          system_call_fastpath
                     |          |          __GI___munmap
                     |          |
                     |           --9.88%-- tlb_flush_mmu
                     |                     tlb_finish_mmu
                     |                     unmap_region
                     |                     do_munmap
                     |                     vm_munmap
                     |                     sys_munmap
                     |                     system_call_fastpath
                     |                     __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.

Reported-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c       | 27 +++++++++++++++++++++------
 mm/swap_state.c | 14 ++++----------
 2 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..8af99dd68dd2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -888,9 +888,9 @@ void lru_add_drain_all(void)
 }
 
 /*
- * Batched page_cache_release().  Decrement the reference count on all the
- * passed pages.  If it fell to zero then remove the page from the LRU and
- * free it.
+ * Batched lru release. Decrement the reference count on all the passed pages.
+ * If it fell to zero then remove the page from the LRU and add it to the given
+ * list to be freed by the caller.
  *
  * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
  * for the remainder of the operation.
@@ -900,10 +900,10 @@ void lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, bool cold)
+static void release_lru_pages(struct page **pages, int nr,
+			      struct list_head *pages_to_free)
 {
 	int i;
-	LIST_HEAD(pages_to_free);
 	struct zone *zone = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
@@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
-		list_add(&page->lru, &pages_to_free);
+		list_add(&page->lru, pages_to_free);
 	}
 	if (zone)
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+/*
+ * Batched page_cache_release(). Frees and uncharges all given pages
+ * for which the reference count drops to 0.
+ */
+void release_pages(struct page **pages, int nr, bool cold)
+{
+	LIST_HEAD(pages_to_free);
 
+	while (nr) {
+		int batch = min(nr, PAGEVEC_SIZE);
+
+		release_lru_pages(pages, batch, &pages_to_free);
+		pages += batch;
+		nr -= batch;
+	}
 	mem_cgroup_uncharge_list(&pages_to_free);
 	free_hot_cold_page_list(&pages_to_free, cold);
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
 	struct page **pagep = pages;
+	int i;
 
 	lru_add_drain();
-	while (nr) {
-		int todo = min(nr, PAGEVEC_SIZE);
-		int i;
-
-		for (i = 0; i < todo; i++)
-			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, false);
-		pagep += todo;
-		nr -= todo;
-	}
+	for (i = 0; i < nr; i++)
+		free_swap_cache(pagep[i]);
+	release_pages(pagep, nr, false);
 }
 
 /*
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
@ 2014-09-24 19:42   ` Andrew Morton
  2014-09-24 21:03     ` Johannes Weiner
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2014-09-24 19:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Wed, 24 Sep 2014 11:08:56 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> From: Michal Hocko <mhocko@suse.cz>
> 
> free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
> This is not a big deal for the normal release path but it completely
> kills memcg uncharge batching which reduces res_counter spin_lock
> contention. Dave has noticed this with his page fault scalability test
> case on a large machine when the lock was basically dominating on all
> CPUs:
> 
> ...
>
> In his case the load was running in the root memcg and that part
> has been handled by reverting 05b843012335 ("mm: memcontrol: use
> root_mem_cgroup res_counter") because this is a clear regression,
> but the problem remains inside dedicated memcgs.
> 
> There is no reason to limit release_pages to PAGEVEC_SIZE batches other
> than lru_lock held times. This logic, however, can be moved inside the
> function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
> hold any lock for the whole pages_to_free list so it is safe to call
> them in a single run.
> 
> Page reference count and LRU handling is moved to release_lru_pages and
> that is run in PAGEVEC_SIZE batches.

Looks OK.

> --- a/mm/swap.c
> +++ b/mm/swap.c
>
> ...
>
> +}
> +/*
> + * Batched page_cache_release(). Frees and uncharges all given pages
> + * for which the reference count drops to 0.
> + */
> +void release_pages(struct page **pages, int nr, bool cold)
> +{
> +	LIST_HEAD(pages_to_free);
>  
> +	while (nr) {
> +		int batch = min(nr, PAGEVEC_SIZE);
> +
> +		release_lru_pages(pages, batch, &pages_to_free);
> +		pages += batch;
> +		nr -= batch;
> +	}

The use of PAGEVEC_SIZE here is pretty misleading - there are no
pagevecs in sight.  SWAP_CLUSTER_MAX would be more appropriate.



afaict the only reason for this loop is to limit the hold duration for
lru_lock.  And it does a suboptimal job of that because it treats all
lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
pages, the logic would then force release_lru_pages() to drop the lock
and return to release_pages() even though it doesn't need to.

So I'm thinking it would be better to move the lock-busting logic into
release_lru_pages() itself.  With a suitable comment, natch ;) Only
bust the lock in the case where we really did hold a particular lru_lock
for 16 consecutive pages.  Then s/release_lru_pages/release_pages/ and
zap the old release_pages().

Obviously it's not very important - presumably the common case is that
the LRU contains lengthy sequences of pages from the same zone.  Maybe.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-24 19:42   ` Andrew Morton
@ 2014-09-24 21:03     ` Johannes Weiner
  2014-09-24 21:15       ` Andrew Morton
  2014-09-25 13:44       ` Michal Hocko
  0 siblings, 2 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 21:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Wed, Sep 24, 2014 at 12:42:34PM -0700, Andrew Morton wrote:
> On Wed, 24 Sep 2014 11:08:56 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > +}
> > +/*
> > + * Batched page_cache_release(). Frees and uncharges all given pages
> > + * for which the reference count drops to 0.
> > + */
> > +void release_pages(struct page **pages, int nr, bool cold)
> > +{
> > +	LIST_HEAD(pages_to_free);
> >  
> > +	while (nr) {
> > +		int batch = min(nr, PAGEVEC_SIZE);
> > +
> > +		release_lru_pages(pages, batch, &pages_to_free);
> > +		pages += batch;
> > +		nr -= batch;
> > +	}
> 
> The use of PAGEVEC_SIZE here is pretty misleading - there are no
> pagevecs in sight.  SWAP_CLUSTER_MAX would be more appropriate.
> 
> 
> 
> afaict the only reason for this loop is to limit the hold duration for
> lru_lock.  And it does a suboptimal job of that because it treats all
> lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
> for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
> pages, the logic would then force release_lru_pages() to drop the lock
> and return to release_pages() even though it doesn't need to.
> 
> So I'm thinking it would be better to move the lock-busting logic into
> release_lru_pages() itself.  With a suitable comment, natch ;) Only
> bust the lock in the case where we really did hold a particular lru_lock
> for 16 consecutive pages.  Then s/release_lru_pages/release_pages/ and
> zap the old release_pages().

Yeah, that's better.

> Obviously it's not very important - presumably the common case is that
> the LRU contains lengthy sequences of pages from the same zone.  Maybe.

Even then, the end result is more concise and busts the lock where
it's actually taken, making the whole thing a bit more obvious:

---
From 367f3dcf25141ff6c30400c00ec09cc3c5486f94 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [patch] mm: memcontrol: do not kill uncharge batching in
 free_pages_and_swap_cache

free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
    80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                  |
                  --- _raw_spin_lock
                     |
                     |--66.59%-- res_counter_uncharge_until
                     |          res_counter_uncharge
                     |          uncharge_batch
                     |          uncharge_list
                     |          mem_cgroup_uncharge_list
                     |          release_pages
                     |          free_pages_and_swap_cache
                     |          tlb_flush_mmu_free
                     |          |
                     |          |--90.12%-- unmap_single_vma
                     |          |          unmap_vmas
                     |          |          unmap_region
                     |          |          do_munmap
                     |          |          vm_munmap
                     |          |          sys_munmap
                     |          |          system_call_fastpath
                     |          |          __GI___munmap
                     |          |
                     |           --9.88%-- tlb_flush_mmu
                     |                     tlb_finish_mmu
                     |                     unmap_region
                     |                     do_munmap
                     |                     vm_munmap
                     |                     sys_munmap
                     |                     system_call_fastpath
                     |                     __GI___munmap

In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.

In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.

Also update the grossly out-of-date release_pages documentation.

Reported-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/swap.c       | 31 ++++++++++++++++++++-----------
 mm/swap_state.c | 14 ++++----------
 2 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..39affa1932ce 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -887,18 +887,14 @@ void lru_add_drain_all(void)
 	mutex_unlock(&lock);
 }
 
-/*
- * Batched page_cache_release().  Decrement the reference count on all the
- * passed pages.  If it fell to zero then remove the page from the LRU and
- * free it.
- *
- * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
- * for the remainder of the operation.
+/**
+ * release_pages - batched page_cache_release()
+ * @pages: array of pages to release
+ * @nr: number of pages
+ * @cold: whether the pages are cache cold
  *
- * The locking in this function is against shrink_inactive_list(): we recheck
- * the page count inside the lock to see whether shrink_inactive_list()
- * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
- * will free it.
+ * Decrement the reference count on all the pages in @pages.  If it
+ * fell to zero, remove the page from the LRU and free it.
  */
 void release_pages(struct page **pages, int nr, bool cold)
 {
@@ -907,6 +903,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 	struct zone *zone = NULL;
 	struct lruvec *lruvec;
 	unsigned long uninitialized_var(flags);
+	unsigned int uninitialized_var(lock_batch);
 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
@@ -914,6 +911,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				lock_batch = 0;
 				zone = NULL;
 			}
 			put_compound_page(page);
@@ -930,6 +928,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 				if (zone)
 					spin_unlock_irqrestore(&zone->lru_lock,
 									flags);
+				lock_batch = 0;
 				zone = pagezone;
 				spin_lock_irqsave(&zone->lru_lock, flags);
 			}
@@ -938,6 +937,16 @@ void release_pages(struct page **pages, int nr, bool cold)
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
+
+			/*
+			 * Make sure the IRQ-safe lock-holding time
+			 * does not get excessive with a continuous
+			 * string of pages from the same zone.
+			 */
+			if (++lock_batch == SWAP_CLUSTER_MAX) {
+				spin_unlock_irqrestore(&zone->lru_lock, flags);
+				zone = NULL;
+			}
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
 void free_pages_and_swap_cache(struct page **pages, int nr)
 {
 	struct page **pagep = pages;
+	int i;
 
 	lru_add_drain();
-	while (nr) {
-		int todo = min(nr, PAGEVEC_SIZE);
-		int i;
-
-		for (i = 0; i < todo; i++)
-			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, false);
-		pagep += todo;
-		nr -= todo;
-	}
+	for (i = 0; i < nr; i++)
+		free_swap_cache(pagep[i]);
+	release_pages(pagep, nr, false);
 }
 
 /*
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-24 21:03     ` Johannes Weiner
@ 2014-09-24 21:15       ` Andrew Morton
  2014-09-25 13:44       ` Michal Hocko
  1 sibling, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2014-09-24 21:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

On Wed, 24 Sep 2014 17:03:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> > Obviously it's not very important - presumably the common case is that
> > the LRU contains lengthy sequences of pages from the same zone.  Maybe.
> 
> Even then, the end result is more concise and busts the lock where
> it's actually taken, making the whole thing a bit more obvious:

Yes, that did come out better.

> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 5 Sep 2014 11:16:17 +0200
> Subject: [patch] mm: memcontrol: do not kill uncharge batching in
>  free_pages_and_swap_cache
> 
> free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
> This is not a big deal for the normal release path but it completely
> kills memcg uncharge batching which reduces res_counter spin_lock
> contention. Dave has noticed this with his page fault scalability test
> case on a large machine when the lock was basically dominating on all
> CPUs:
>     80.18%    80.18%  [kernel]               [k] _raw_spin_lock
>                   |
>                   --- _raw_spin_lock
>                      |
>                      |--66.59%-- res_counter_uncharge_until
>                      |          res_counter_uncharge
>                      |          uncharge_batch
>                      |          uncharge_list
>                      |          mem_cgroup_uncharge_list
>                      |          release_pages
>                      |          free_pages_and_swap_cache
>                      |          tlb_flush_mmu_free
>                      |          |
>                      |          |--90.12%-- unmap_single_vma
>                      |          |          unmap_vmas
>                      |          |          unmap_region
>                      |          |          do_munmap
>                      |          |          vm_munmap
>                      |          |          sys_munmap
>                      |          |          system_call_fastpath
>                      |          |          __GI___munmap
>                      |          |
>                      |           --9.88%-- tlb_flush_mmu
>                      |                     tlb_finish_mmu
>                      |                     unmap_region
>                      |                     do_munmap
>                      |                     vm_munmap
>                      |                     sys_munmap
>                      |                     system_call_fastpath
>                      |                     __GI___munmap
> 
> In his case the load was running in the root memcg and that part
> has been handled by reverting 05b843012335 ("mm: memcontrol: use
> root_mem_cgroup res_counter") because this is a clear regression,
> but the problem remains inside dedicated memcgs.
> 
> There is no reason to limit release_pages to PAGEVEC_SIZE batches other
> than lru_lock held times. This logic, however, can be moved inside the
> function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
> hold any lock for the whole pages_to_free list so it is safe to call
> them in a single run.
> 
> In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> pages, then remove the batching from free_pages_and_swap_cache.

I beefed this paragraph up a bit:

: The release_pages() code was previously breaking the lru_lock each
: PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
: pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
: (32) pages.  This means that the lock acquisition frequency is
: approximately halved and the max hold times are approximately doubled.
:
: The now unneeded batching is removed from free_pages_and_swap_cache().

I doubt if the increased irq-off time will hurt anyone, but who knows...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-24 21:03     ` Johannes Weiner
  2014-09-24 21:15       ` Andrew Morton
@ 2014-09-25 13:44       ` Michal Hocko
  2014-10-02 15:57         ` Johannes Weiner
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-09-25 13:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

On Wed 24-09-14 17:03:22, Johannes Weiner wrote:
[...]
> In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> pages, then remove the batching from free_pages_and_swap_cache.

Actually I had something like that originally but then decided to
not change the break out logic to prevent from strange and subtle
regressions. I have focused only on the memcg batching POV and led the
rest untouched.

I do agree that lru_lock batching can be improved as well. Your change
looks almost correct but you should count all the pages while the lock
is held otherwise you might happen to hold the lock for too long just
because most pages are off the LRU already for some reason. At least
that is what my original attempt was doing. Something like the following
on top of the current patch:
---
diff --git a/mm/swap.c b/mm/swap.c
index 39affa1932ce..8a12b33936b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -911,13 +911,22 @@ void release_pages(struct page **pages, int nr, bool cold)
 		if (unlikely(PageCompound(page))) {
 			if (zone) {
 				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				lock_batch = 0;
 				zone = NULL;
 			}
 			put_compound_page(page);
 			continue;
 		}
 
+		/*
+		 * Make sure the IRQ-safe lock-holding time does not get
+		 * excessive with a continuous string of pages from the
+		 * same zone. The lock is held only if zone != NULL.
+		 */
+		if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
+			spin_unlock_irqrestore(&zone->lru_lock, flags);
+			zone = NULL;
+		}
+
 		if (!put_page_testzero(page))
 			continue;
 
@@ -937,16 +946,6 @@ void release_pages(struct page **pages, int nr, bool cold)
 			VM_BUG_ON_PAGE(!PageLRU(page), page);
 			__ClearPageLRU(page);
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
-
-			/*
-			 * Make sure the IRQ-safe lock-holding time
-			 * does not get excessive with a continuous
-			 * string of pages from the same zone.
-			 */
-			if (++lock_batch == SWAP_CLUSTER_MAX) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				zone = NULL;
-			}
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
  2014-09-25 13:44       ` Michal Hocko
@ 2014-10-02 15:57         ` Johannes Weiner
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-10-02 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

On Thu, Sep 25, 2014 at 03:44:03PM +0200, Michal Hocko wrote:
> On Wed 24-09-14 17:03:22, Johannes Weiner wrote:
> [...]
> > In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> > pages, then remove the batching from free_pages_and_swap_cache.
> 
> Actually I had something like that originally but then decided to
> not change the break out logic to prevent from strange and subtle
> regressions. I have focused only on the memcg batching POV and led the
> rest untouched.
> 
> I do agree that lru_lock batching can be improved as well. Your change
> looks almost correct but you should count all the pages while the lock
> is held otherwise you might happen to hold the lock for too long just
> because most pages are off the LRU already for some reason. At least
> that is what my original attempt was doing. Something like the following
> on top of the current patch:

Yep, that makes sense.

Would you care to send it in such that Andrew can pick it up?  Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
  2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
  2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
  2014-09-24 15:14   ` Vladimir Davydov
       [not found]   ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
  2 siblings, 2 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

When attempting to charge pages, we first charge the memory counter
and then the memory+swap counter.  If one of the counters is at its
limit, we enter reclaim, but if it's the memory+swap counter, reclaim
shouldn't swap because that wouldn't change the situation.  However,
if the counters have the same limits, we never get to the memory+swap
limit.  To know whether reclaim should swap or not, there is a state
flag that indicates whether the limits are equal and whether hitting
the memory limit implies hitting the memory+swap limit.

Just try the memory+swap counter first.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 47 +++++++++++++----------------------------------
 1 file changed, 13 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec22bf380d0..89c920156c2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,9 +315,6 @@ struct mem_cgroup {
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
-	/* set when res.limit == memsw.limit */
-	bool		memsw_is_minimum;
-
 	/* protect arrays of thresholds */
 	struct mutex thresholds_lock;
 
@@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 
 	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
 		noswap = true;
-	if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
-		noswap = true;
 
 	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
 		if (loop)
@@ -2543,16 +2538,17 @@ retry:
 		goto done;
 
 	size = batch * PAGE_SIZE;
-	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
-		if (!do_swap_account)
+	if (!do_swap_account ||
+	    !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+		if (!res_counter_charge(&memcg->res, size, &fail_res))
 			goto done_restock;
-		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
-			goto done_restock;
-		res_counter_uncharge(&memcg->res, size);
+		if (do_swap_account)
+			res_counter_uncharge(&memcg->memsw, size);
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	} else {
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
 		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
-	} else
-		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	}
 
 	if (batch > nr_pages) {
 		batch = nr_pages;
@@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 				unsigned long long val)
 {
 	int retry_count;
-	u64 memswlimit, memlimit;
 	int ret = 0;
 	int children = mem_cgroup_count_children(memcg);
 	u64 curusage, oldusage;
@@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
 		 */
 		mutex_lock(&set_limit_mutex);
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val) {
+		if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
 			ret = -EINVAL;
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
 
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit < val)
+		if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
 			enlarge = 1;
 
 		ret = res_counter_set_limit(&memcg->res, val);
-		if (!ret) {
-			if (memswlimit == val)
-				memcg->memsw_is_minimum = true;
-			else
-				memcg->memsw_is_minimum = false;
-		}
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
@@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 					unsigned long long val)
 {
 	int retry_count;
-	u64 memlimit, memswlimit, oldusage, curusage;
+	u64 oldusage, curusage;
 	int children = mem_cgroup_count_children(memcg);
 	int ret = -EBUSY;
 	int enlarge = 0;
@@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
 		 */
 		mutex_lock(&set_limit_mutex);
-		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
-		if (memlimit > val) {
+		if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
 			ret = -EINVAL;
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
-		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
-		if (memswlimit < val)
+		if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
 			enlarge = 1;
 		ret = res_counter_set_limit(&memcg->memsw, val);
-		if (!ret) {
-			if (memlimit == val)
-				memcg->memsw_is_minimum = true;
-			else
-				memcg->memsw_is_minimum = false;
-		}
 		mutex_unlock(&set_limit_mutex);
 
 		if (!ret)
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
  2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
@ 2014-09-24 15:14   ` Vladimir Davydov
       [not found]   ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
  1 sibling, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-24 15:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Dave Hansen, Michal Hocko, linux-mm,
	cgroups, linux-kernel

On Wed, Sep 24, 2014 at 11:08:57AM -0400, Johannes Weiner wrote:
> When attempting to charge pages, we first charge the memory counter
> and then the memory+swap counter.  If one of the counters is at its
> limit, we enter reclaim, but if it's the memory+swap counter, reclaim
> shouldn't swap because that wouldn't change the situation.  However,
> if the counters have the same limits, we never get to the memory+swap
> limit.  To know whether reclaim should swap or not, there is a state
> flag that indicates whether the limits are equal and whether hitting
> the memory limit implies hitting the memory+swap limit.
> 
> Just try the memory+swap counter first.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]

* Re: [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
       [not found]   ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-09-25 15:27     ` Michal Hocko
  0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2014-09-25 15:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed 24-09-14 11:08:57, Johannes Weiner wrote:
> When attempting to charge pages, we first charge the memory counter
> and then the memory+swap counter.  If one of the counters is at its
> limit, we enter reclaim, but if it's the memory+swap counter, reclaim
> shouldn't swap because that wouldn't change the situation.  However,
> if the counters have the same limits, we never get to the memory+swap
> limit.  To know whether reclaim should swap or not, there is a state
> flag that indicates whether the limits are equal and whether hitting
> the memory limit implies hitting the memory+swap limit.
> 
> Just try the memory+swap counter first.

OK, this makes sense and makes the reclaim code little bit more
readable (). I would just add that the patch shouldn't have any visible
effectes because that is not apparent from the description.

> 
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

> ---
>  mm/memcontrol.c | 47 +++++++++++++----------------------------------
>  1 file changed, 13 insertions(+), 34 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1ec22bf380d0..89c920156c2a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -315,9 +315,6 @@ struct mem_cgroup {
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
>  
> -	/* set when res.limit == memsw.limit */
> -	bool		memsw_is_minimum;
> -
>  	/* protect arrays of thresholds */
>  	struct mutex thresholds_lock;
>  
> @@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>  
>  	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
>  		noswap = true;
> -	if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
> -		noswap = true;
>  
>  	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
>  		if (loop)
> @@ -2543,16 +2538,17 @@ retry:
>  		goto done;
>  
>  	size = batch * PAGE_SIZE;
> -	if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> -		if (!do_swap_account)
> +	if (!do_swap_account ||
> +	    !res_counter_charge(&memcg->memsw, size, &fail_res)) {
> +		if (!res_counter_charge(&memcg->res, size, &fail_res))
>  			goto done_restock;
> -		if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> -			goto done_restock;
> -		res_counter_uncharge(&memcg->res, size);
> +		if (do_swap_account)
> +			res_counter_uncharge(&memcg->memsw, size);
> +		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +	} else {
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
>  		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> -	} else
> -		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> +	}
>  
>  	if (batch > nr_pages) {
>  		batch = nr_pages;
> @@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  				unsigned long long val)
>  {
>  	int retry_count;
> -	u64 memswlimit, memlimit;
>  	int ret = 0;
>  	int children = mem_cgroup_count_children(memcg);
>  	u64 curusage, oldusage;
> @@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
>  		 */
>  		mutex_lock(&set_limit_mutex);
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val) {
> +		if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
>  			ret = -EINVAL;
>  			mutex_unlock(&set_limit_mutex);
>  			break;
>  		}
>  
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit < val)
> +		if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
>  			enlarge = 1;
>  
>  		ret = res_counter_set_limit(&memcg->res, val);
> -		if (!ret) {
> -			if (memswlimit == val)
> -				memcg->memsw_is_minimum = true;
> -			else
> -				memcg->memsw_is_minimum = false;
> -		}
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> @@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  					unsigned long long val)
>  {
>  	int retry_count;
> -	u64 memlimit, memswlimit, oldusage, curusage;
> +	u64 oldusage, curusage;
>  	int children = mem_cgroup_count_children(memcg);
>  	int ret = -EBUSY;
>  	int enlarge = 0;
> @@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		 * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
>  		 */
>  		mutex_lock(&set_limit_mutex);
> -		memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> -		if (memlimit > val) {
> +		if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
>  			ret = -EINVAL;
>  			mutex_unlock(&set_limit_mutex);
>  			break;
>  		}
> -		memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> -		if (memswlimit < val)
> +		if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
>  			enlarge = 1;
>  		ret = res_counter_set_limit(&memcg->memsw, val);
> -		if (!ret) {
> -			if (memlimit == val)
> -				memcg->memsw_is_minimum = true;
> -			else
> -				memcg->memsw_is_minimum = false;
> -		}
>  		mutex_unlock(&set_limit_mutex);
>  
>  		if (!ret)
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
  2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
  2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
  2014-09-29 13:57   ` Michal Hocko
  2 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
	linux-mm, cgroups, linux-kernel

In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.

The reason for this is that the memcg reclaim code was never designed
for higher-order charges.  It reclaims in small batches until there is
room for at least one page.  Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.

Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.

This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it.  As a result, transparent hugepage
allocation rates amid cache activity are drastically improved:

                                      vanilla                 patched
pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)

[ Note: this may in turn increase memory consumption from internal
  fragmentation, which is an inherent risk of transparent hugepages.
  Some setups may have to adjust the memcg limits accordingly to
  accomodate this - or, if the machine is already packed to capacity,
  disable the transparent huge page feature. ]

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/swap.h |  6 +++--
 mm/memcontrol.c      | 69 +++++++++++++---------------------------------------
 mm/vmscan.c          |  7 +++---
 3 files changed, 25 insertions(+), 57 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ea4f926e6b9b..37a585beef5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
-						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+						  unsigned long nr_pages,
+						  gfp_t gfp_mask,
+						  bool may_swap);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89c920156c2a..c2c75262a209 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -478,14 +478,6 @@ enum res_type {
 #define OOM_CONTROL		(0)
 
 /*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
-#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-
-/*
  * The memcg_create_mutex will be held whenever a new cgroup is created.
  * As a consequence, any change that needs to protect against new child cgroups
  * appearing has to hold it as well.
@@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			 NULL, "Memory cgroup out of memory");
 }
 
-static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
-					gfp_t gfp_mask,
-					unsigned long flags)
-{
-	unsigned long total = 0;
-	bool noswap = false;
-	int loop;
-
-	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
-		noswap = true;
-
-	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
-		if (loop)
-			drain_all_stock_async(memcg);
-		total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
-		/*
-		 * Allow limit shrinkers, which are triggered directly
-		 * by userspace, to catch signals and stop reclaim
-		 * after minimal progress, regardless of the margin.
-		 */
-		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
-			break;
-		if (mem_cgroup_margin(memcg))
-			break;
-		/*
-		 * If nothing was reclaimed after two attempts, there
-		 * may be no reclaimable pages in this hierarchy.
-		 */
-		if (loop && !total)
-			break;
-	}
-	return total;
-}
-
 /**
  * test_mem_cgroup_node_reclaimable
  * @memcg: the target memcg
@@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long nr_reclaimed;
-	unsigned long flags = 0;
 	unsigned long long size;
+	bool may_swap = true;
+	bool drained = false;
 	int ret = 0;
 
 	if (mem_cgroup_is_root(memcg))
@@ -2547,7 +2506,7 @@ retry:
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 	} else {
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
-		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+		may_swap = false;
 	}
 
 	if (batch > nr_pages) {
@@ -2572,11 +2531,18 @@ retry:
 	if (!(gfp_mask & __GFP_WAIT))
 		goto nomem;
 
-	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+						    gfp_mask, may_swap);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		goto retry;
 
+	if (!drained) {
+		drain_all_stock_async(mem_over_limit);
+		drained = true;
+		goto retry;
+	}
+
 	if (gfp_mask & __GFP_NORETRY)
 		goto nomem;
 	/*
@@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
-				   MEM_CGROUP_RECLAIM_SHRINK);
+		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
+
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		if (!ret)
 			break;
 
-		mem_cgroup_reclaim(memcg, GFP_KERNEL,
-				   MEM_CGROUP_RECLAIM_NOSWAP |
-				   MEM_CGROUP_RECLAIM_SHRINK);
+		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
+
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
-						false);
+		progress = try_to_free_mem_cgroup_pages(memcg, 1,
+							GFP_KERNEL, true);
 		if (!progress) {
 			nr_retries--;
 			/* maybe some writeback is necessary */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06123f20a326..dcb47074ae03 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 }
 
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   bool noswap)
+					   bool may_swap)
 {
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
 	struct scan_control sc = {
-		.nr_to_reclaim = SWAP_CLUSTER_MAX,
+		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = !noswap,
+		.may_swap = may_swap,
 	};
 
 	/*
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
@ 2014-09-29 13:57   ` Michal Hocko
       [not found]     ` <20140929135707.GA25956-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-09-29 13:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

On Wed 24-09-14 11:08:58, Johannes Weiner wrote:
> In a memcg with even just moderate cache pressure, success rates for
> transparent huge page allocations drop to zero, wasting a lot of
> effort that the allocator puts into assembling these pages.
> 
> The reason for this is that the memcg reclaim code was never designed
> for higher-order charges.  It reclaims in small batches until there is
> room for at least one page.  Huge page charges only succeed when these
> batches add up over a series of huge faults, which is unlikely under
> any significant load involving order-0 allocations in the group.
> 
> Remove that loop on the memcg side in favor of passing the actual
> reclaim goal to direct reclaim, which is already set up and optimized
> to meet higher-order goals efficiently.

I had concerns the last time you were posting similar patch
(http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
any of them neither mentioned nor addressed here. Especially unexpected
long stalls and excessive swapout. 512 pages target for direct reclaim
is too much. Especially for smaller memcgs where we would need to drop
the priority considerably to even scan that many pages.

THP charges close to the limit are definitely a problem but this
is way too risky to fix this problem IMO. Maybe a better approach
would be to limit the reclaim to (clean) page cache (aka
MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
lower and excessive swapouts shouldn't happen at all. What is the point
to swap out a large number of pages just to allow THP charge which can
be used very sparsely?

> This brings memcg's THP policy in line with the system policy: if the
> allocator painstakingly assembles a hugepage, memcg will at least make
> an honest effort to charge it.  As a result, transparent hugepage
> allocation rates amid cache activity are drastically improved:
> 
>                                       vanilla                 patched
> pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
> pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
> thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
> thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)

What is the load and configuration that you have measured?
 
> [ Note: this may in turn increase memory consumption from internal
>   fragmentation, which is an inherent risk of transparent hugepages.
>   Some setups may have to adjust the memcg limits accordingly to
>   accomodate this - or, if the machine is already packed to capacity,
>   disable the transparent huge page feature. ]
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  include/linux/swap.h |  6 +++--
>  mm/memcontrol.c      | 69 +++++++++++++---------------------------------------
>  mm/vmscan.c          |  7 +++---
>  3 files changed, 25 insertions(+), 57 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ea4f926e6b9b..37a585beef5c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> -						  gfp_t gfp_mask, bool noswap);
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +						  unsigned long nr_pages,
> +						  gfp_t gfp_mask,
> +						  bool may_swap);
>  extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
>  						gfp_t gfp_mask, bool noswap,
>  						struct zone *zone,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 89c920156c2a..c2c75262a209 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -478,14 +478,6 @@ enum res_type {
>  #define OOM_CONTROL		(0)
>  
>  /*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT	0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP	(1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT	0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK	(1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -
> -/*
>   * The memcg_create_mutex will be held whenever a new cgroup is created.
>   * As a consequence, any change that needs to protect against new child cgroups
>   * appearing has to hold it as well.
> @@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			 NULL, "Memory cgroup out of memory");
>  }
>  
> -static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
> -					gfp_t gfp_mask,
> -					unsigned long flags)
> -{
> -	unsigned long total = 0;
> -	bool noswap = false;
> -	int loop;
> -
> -	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> -		noswap = true;
> -
> -	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> -		if (loop)
> -			drain_all_stock_async(memcg);
> -		total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
> -		/*
> -		 * Allow limit shrinkers, which are triggered directly
> -		 * by userspace, to catch signals and stop reclaim
> -		 * after minimal progress, regardless of the margin.
> -		 */
> -		if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> -			break;
> -		if (mem_cgroup_margin(memcg))
> -			break;
> -		/*
> -		 * If nothing was reclaimed after two attempts, there
> -		 * may be no reclaimable pages in this hierarchy.
> -		 */
> -		if (loop && !total)
> -			break;
> -	}
> -	return total;
> -}
> -
>  /**
>   * test_mem_cgroup_node_reclaimable
>   * @memcg: the target memcg
> @@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
>  	unsigned long nr_reclaimed;
> -	unsigned long flags = 0;
>  	unsigned long long size;
> +	bool may_swap = true;
> +	bool drained = false;
>  	int ret = 0;
>  
>  	if (mem_cgroup_is_root(memcg))
> @@ -2547,7 +2506,7 @@ retry:
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  	} else {
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> -		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> +		may_swap = false;
>  	}
>  
>  	if (batch > nr_pages) {
> @@ -2572,11 +2531,18 @@ retry:
>  	if (!(gfp_mask & __GFP_WAIT))
>  		goto nomem;
>  
> -	nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> +	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> +						    gfp_mask, may_swap);
>  
>  	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
>  		goto retry;
>  
> +	if (!drained) {
> +		drain_all_stock_async(mem_over_limit);
> +		drained = true;
> +		goto retry;
> +	}
> +
>  	if (gfp_mask & __GFP_NORETRY)
>  		goto nomem;
>  	/*
> @@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_reclaim(memcg, GFP_KERNEL,
> -				   MEM_CGROUP_RECLAIM_SHRINK);
> +		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
> +
>  		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>  		if (!ret)
>  			break;
>  
> -		mem_cgroup_reclaim(memcg, GFP_KERNEL,
> -				   MEM_CGROUP_RECLAIM_NOSWAP |
> -				   MEM_CGROUP_RECLAIM_SHRINK);
> +		try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
> +
>  		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
>  		/* Usage is reduced ? */
>  		if (curusage >= oldusage)
> @@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  		if (signal_pending(current))
>  			return -EINTR;
>  
> -		progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
> -						false);
> +		progress = try_to_free_mem_cgroup_pages(memcg, 1,
> +							GFP_KERNEL, true);
>  		if (!progress) {
>  			nr_retries--;
>  			/* maybe some writeback is necessary */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06123f20a326..dcb47074ae03 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
>  }
>  
>  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> +					   unsigned long nr_pages,
>  					   gfp_t gfp_mask,
> -					   bool noswap)
> +					   bool may_swap)
>  {
>  	struct zonelist *zonelist;
>  	unsigned long nr_reclaimed;
>  	int nid;
>  	struct scan_control sc = {
> -		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> +		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>  		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.target_mem_cgroup = memcg,
>  		.priority = DEF_PRIORITY,
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
> -		.may_swap = !noswap,
> +		.may_swap = may_swap,
>  	};
>  
>  	/*
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20140929135707.GA25956-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
       [not found]     ` <20140929135707.GA25956-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-09-29 17:57       ` Johannes Weiner
  2014-10-07 13:59         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-09-29 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi Michal,

On Mon, Sep 29, 2014 at 03:57:07PM +0200, Michal Hocko wrote:
> On Wed 24-09-14 11:08:58, Johannes Weiner wrote:
> > In a memcg with even just moderate cache pressure, success rates for
> > transparent huge page allocations drop to zero, wasting a lot of
> > effort that the allocator puts into assembling these pages.
> > 
> > The reason for this is that the memcg reclaim code was never designed
> > for higher-order charges.  It reclaims in small batches until there is
> > room for at least one page.  Huge page charges only succeed when these
> > batches add up over a series of huge faults, which is unlikely under
> > any significant load involving order-0 allocations in the group.
> > 
> > Remove that loop on the memcg side in favor of passing the actual
> > reclaim goal to direct reclaim, which is already set up and optimized
> > to meet higher-order goals efficiently.
> 
> I had concerns the last time you were posting similar patch
> (http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
> any of them neither mentioned nor addressed here.

I actually made several attempts to address your concerns, but it's
hard to discern technical objections from what you write.  Example:

> Especially unexpected long stalls and excessive swapout. 512 pages
> target for direct reclaim is too much. Especially for smaller memcgs
> where we would need to drop the priority considerably to even scan
> that many pages.  THP charges close to the limit are definitely a
> problem but this is way too risky to fix this problem IMO.

Every change we make is a trade-off and bears a certain risk.  THP is
a trade-off, it's pretty pointless to ignore the upsides and ride
around on the downsides.  Of course there are downsides.  This patch
makes THP work properly inside memcg, which invites both the upsides
as well as the downsides of THP into memcg.  But they are well known
and we can deal with them.  Why is THP inside memcg special?

Sure, the smaller the memcg, the bigger the THP fault and scan target
in comparison.  We don't have control over the THP size, but the user
can always increase the size of the memcg, iff THP leads to increased
fragmentation and memory consumption for the given workload.

[ Until he can't, at which point he has to decide whether the cost of
  THP outweighs the benefits on a system-wide level.  For now - we
  could certainly consider making a THP-knob available per memcg, I'm
  just extrapolating from the fact that we don't have a per-process
  knob that it's unlikely that we need one per memcg. ]

> Maybe a better approach
> would be to limit the reclaim to (clean) page cache (aka
> MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
> lower and excessive swapouts shouldn't happen at all. What is the point
> to swap out a large number of pages just to allow THP charge which can
> be used very sparsely?

THP can lead to thrashing, that's not news.  Preventing THP faults
from swapping is a reasonable proposal, but again has nothing to do
with memcg.

As for this patch, we don't have sufficient data on existing
configurations to know if this will lead to noticable regressions.  It
might, but everything we do might cause a regression, that's just our
reality.  That alone can't be grounds for rejecting a patch.  However,
in this particular case a regression is trivial to pinpoint (comparing
vmstat, profiles), and trivial to rectify in the field by changing the
memcg limits or disabling THP.

What we DO know is that there are very good use cases for THP, but THP
inside memcg is broken: THP does worse inside a memcg when compared to
bare metal environments of the same size, both in terms of success
rate, as well as in fault latency due to wasted page allocator work.
Plus, the code is illogical, redundant, and full of magic numbers.

Based on this, this patch seems like a net improvement.

> > This brings memcg's THP policy in line with the system policy: if the
> > allocator painstakingly assembles a hugepage, memcg will at least make
> > an honest effort to charge it.  As a result, transparent hugepage
> > allocation rates amid cache activity are drastically improved:
> > 
> >                                       vanilla                 patched
> > pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> > pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
> > pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
> > thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
> > thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
> 
> What is the load and configuration that you have measured?

It's just a single linear disk writer and another thread that faults
in an anonymous range in 4k steps.

> > [ Note: this may in turn increase memory consumption from internal
> >   fragmentation, which is an inherent risk of transparent hugepages.
> >   Some setups may have to adjust the memcg limits accordingly to
> >   accomodate this - or, if the machine is already packed to capacity,
> >   disable the transparent huge page feature. ]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-09-29 17:57       ` Johannes Weiner
@ 2014-10-07 13:59         ` Michal Hocko
  2014-10-08  1:11           ` Johannes Weiner
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-10-07 13:59 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

On Mon 29-09-14 13:57:00, Johannes Weiner wrote:
> Hi Michal,
> 
> On Mon, Sep 29, 2014 at 03:57:07PM +0200, Michal Hocko wrote:
> > On Wed 24-09-14 11:08:58, Johannes Weiner wrote:
> > > In a memcg with even just moderate cache pressure, success rates for
> > > transparent huge page allocations drop to zero, wasting a lot of
> > > effort that the allocator puts into assembling these pages.
> > > 
> > > The reason for this is that the memcg reclaim code was never designed
> > > for higher-order charges.  It reclaims in small batches until there is
> > > room for at least one page.  Huge page charges only succeed when these
> > > batches add up over a series of huge faults, which is unlikely under
> > > any significant load involving order-0 allocations in the group.
> > > 
> > > Remove that loop on the memcg side in favor of passing the actual
> > > reclaim goal to direct reclaim, which is already set up and optimized
> > > to meet higher-order goals efficiently.
> > 
> > I had concerns the last time you were posting similar patch
> > (http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
> > any of them neither mentioned nor addressed here.
> 
> I actually made several attempts to address your concerns, but it's
> hard to discern technical objections from what you write.  Example:
>
> > Especially unexpected long stalls and excessive swapout. 512 pages
> > target for direct reclaim is too much. Especially for smaller memcgs
> > where we would need to drop the priority considerably to even scan
> > that many pages.  THP charges close to the limit are definitely a
> > problem but this is way too risky to fix this problem IMO.
> 
> Every change we make is a trade-off and bears a certain risk.  THP is
> a trade-off, it's pretty pointless to ignore the upsides and ride
> around on the downsides.  Of course there are downsides.  This patch
> makes THP work properly inside memcg, which invites both the upsides
> as well as the downsides of THP into memcg.  But they are well known
> and we can deal with them. 

I do not see any evaluation nor discussion of the upsides and downsides
in the changelog. You are selling this as a net win which I cannot
agree with. I am completely missing any notes about potential excessive
swapouts or longer reclaim stalls which are a natural side effect of direct
reclaim with a larger target (or is this something we do not agree on?).
What is an admin/user supposed to do when one of the above happens?
Disable THP globally?

I still remember when THP was introduced and we have seen boatload of
reclaim related bugs. These were exactly those long stalls, excessive
swapouts and reclaim.

> Why is THP inside memcg special?

For one thing the global case is hitting its limit (watermarks) much
more slowly and gracefully because it has kswapd working on the
background before we are getting into troubles. Memcg will just hit the
wall and rely solely on the direct reclaim so everything we do will end
up latency sensitive.

Moreover, THP allocations have self regulatory mechanisms to prevent
from excessive stalls. This means that THP allocations are less probable
under heavy memory pressure. On the other hand, memcg might be under
serious memory pressure when THP charge comes. The only back off
mechanism we use in memcg is GFP_NORETRY and that happens after one
round of the reclaim. So we should make sure that the first round of the
reclaim doesn't take terribly long.

Another part that matters is the size. Memcgs might be really small and
that changes the math. Large reclaim target will get to low prio reclaim
and thus the excessive reclaim.
The size also makes any potential problem much more probable because the
limit would be hit much more often than extremely low memory conditions
globally.

Also the reclaim decisions are subtly different for memcg because of the
missing per-memcg dirty throttling and flushing. So we can stall on
pages under writeback or get stuck in the write out path which is not
the case for direct reclaim during THP allocation. A large reclaim
target is more probable to hit into dirty or writeback pages.

> Sure, the smaller the memcg, the bigger the THP fault and scan target
> in comparison.  We don't have control over the THP size, but the user
> can always increase the size of the memcg, iff THP leads to increased
> fragmentation and memory consumption for the given workload.
> 
> [ Until he can't, at which point he has to decide whether the cost of
>   THP outweighs the benefits on a system-wide level.  For now - we
>   could certainly consider making a THP-knob available per memcg, I'm
>   just extrapolating from the fact that we don't have a per-process
>   knob that it's unlikely that we need one per memcg. ]

There actually were proposals for per-process THP configuration already.
I haven't tracked it later so I don't know the current status.
Per-process knob sounds like a better fit than a memcg knob because
requirement is basically dependent on the usage pattern which might be
different among processes living in the same memcg.

> > Maybe a better approach
> > would be to limit the reclaim to (clean) page cache (aka
> > MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
> > lower and excessive swapouts shouldn't happen at all. What is the point
> > to swap out a large number of pages just to allow THP charge which can
> > be used very sparsely?
> 
> THP can lead to thrashing, that's not news. 

It shouldn't because the optimization doesn't make much sense
otherwise. Any thrashing is simply a bug.

> Preventing THP faults from swapping is a reasonable proposal, but
> again has nothing to do with memcg.

If we can do this inside the direct reclaim path then I am all for it
because this means less trickery in the memcg code.

I am still not sure this is sufficient because memcg still might stall
on IO so the safest approach would be ~GFP_IO reclaim for memcg reclaim
path.

I feel strong about the first one (.may_swap = 0) and would be OK with
your patch if this is added (to the memcg or common path).
GFP_IO is an extra safety step. Smaller groups would be more likely to
fail to reclaim enough and so THP success rate will be lower but that
doesn't sound terribly wrong to me. I am not insisting on it, though.

> As for this patch, we don't have sufficient data on existing
> configurations to know if this will lead to noticable regressions.  It
> might, but everything we do might cause a regression, that's just our
> reality.  That alone can't be grounds for rejecting a patch. 

That alone certainly does not but then we have to evaluate the risk and
consider other possible ways with a smaller risk.

> However, in this particular case a regression is trivial to pinpoint
> (comparing vmstat, profiles), and trivial to rectify in the field by
> changing the memcg limits or disabling THP.

> What we DO know is that there are very good use cases for THP, but THP
> inside memcg is broken:

All those usecases rely on amortizing THP initial costs by less faults
(assuming the memory range is not used sparsely too much) and the TLB
pressure reduction. Once we are hitting swap or excessive reclaim all
the bets are off and THP is no longer beneficial.

> THP does worse inside a memcg when compared to
> bare metal environments of the same size, both in terms of success
> rate, as well as in fault latency due to wasted page allocator work.

Because memcg is not equivalent to the bare metal with the same amount
of memory. If for nothing else then because the background reclaim is
missing.

> Plus, the code is illogical, redundant, and full of magic numbers.

I am not objecting to the removal of magic numbers and to getting rid of
retry loops outside of direct reclaim path (aka mem_cgroup_reclaim). I
would be willing to take a risk and get rid of them just to make the
code saner. Because those were never justified properly and look more or
less random. This would be a separate patch of course.

> Based on this, this patch seems like a net improvement.

Sigh, yes, if we ignore all the downsides everything will look like a
net improvement :/

> > > This brings memcg's THP policy in line with the system policy: if the
> > > allocator painstakingly assembles a hugepage, memcg will at least make
> > > an honest effort to charge it.  As a result, transparent hugepage
> > > allocation rates amid cache activity are drastically improved:
> > > 
> > >                                       vanilla                 patched
> > > pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> > > pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
> > > pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
> > > thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
> > > thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
> > 
> > What is the load and configuration that you have measured?
> 
> It's just a single linear disk writer and another thread that faults
> in an anonymous range in 4k steps.

This is really vague description...
Which portion of the limit is the anon consumer, what is the memcg limit
size, IO size, etc...? I find it really interesting that _all_ THP
charges failed so the memcg had to be almost fully populated by the page
cache already when the thread tries so fault in the first huge page.

Also 4k steps is basically the best case for THP because the full THP
block is populated. The question is how the system behaves when THP
ranges are populated sparsely (because this is often the case).

Have you checked any anon mostly load?

Finally what are the (average,highest) latencies for the page fault and
how much memory was swapped out.

I would expect this kind of information for testing of such a patch.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-10-07 13:59         ` Michal Hocko
@ 2014-10-08  1:11           ` Johannes Weiner
  2014-10-08 15:33             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-10-08  1:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
> On Mon 29-09-14 13:57:00, Johannes Weiner wrote:
> > Every change we make is a trade-off and bears a certain risk.  THP is
> > a trade-off, it's pretty pointless to ignore the upsides and ride
> > around on the downsides.  Of course there are downsides.  This patch
> > makes THP work properly inside memcg, which invites both the upsides
> > as well as the downsides of THP into memcg.  But they are well known
> > and we can deal with them. 
> 
> I do not see any evaluation nor discussion of the upsides and downsides
> in the changelog. You are selling this as a net win which I cannot
> agree with.

I'm not sure why you want me to regurgitate the pros and cons of
transparent huge pages here.  They have been a well-established
default feature for a while now, and they are currently not working
properly inside memcg, which this patch addresses.

The only valid argument against merging this patch at this point can
be that THP inside memcg will lead to distinct issues that do not
exist on the global level.  So let's find out if there are any, okay?

> I am completely missing any notes about potential excessive
> swapouts or longer reclaim stalls which are a natural side effect of direct
> reclaim with a larger target (or is this something we do not agree on?).

Yes, we disagree here.  Why is reclaiming 2MB once worse than entering
reclaim 16 times to reclaim SWAP_CLUSTER_MAX?  There is no inherent
difference in reclaiming a big chunk and reclaiming many small chunks
that add up to the same size.

It's only different if you don't actually use the full 2MB pages, but
then the issue simply boils down to increased memory consumption.  But
that is easy to deal with and I offered two solutions in my changelog.

> What is an admin/user supposed to do when one of the above happens?
> Disable THP globally?

I already wrote all that.  It would be easier if you read my entire
line of reasoning instead of attacking fragments in isolation, so that
we can make forward progress on this.

> I still remember when THP was introduced and we have seen boatload of
> reclaim related bugs. These were exactly those long stalls, excessive
> swapouts and reclaim.

THP certainly had a bumpy introduction, and I can understand that you
want to prevent the same happening to memcg.

But it's important to evaluate which THP costs actually translate to
memcg.  The worst problems we had came *not* from faulting in bigger
steps, but from creating physically contiguous pages: aggressive lumpy
reclaim, (synchroneous) migration, reclaim beyond the allocation size
to create breathing room for compaction etc.  This is a massive amount
of work ON TOP of the bigger fault granularity.

Memcg only has to reclaim the allocation size in individual 4k pages.
Our only risk from THP is internal fragmentation from users not fully
utilizing the entire 2MB regions, but the work we MIGHT waste is
negligible compared to the work we are DEFINITELY wasting right now by
failing to charge already allocated THPs.

> > Why is THP inside memcg special?
> 
> For one thing the global case is hitting its limit (watermarks) much
> more slowly and gracefully because it has kswapd working on the
> background before we are getting into troubles. Memcg will just hit the
> wall and rely solely on the direct reclaim so everything we do will end
> up latency sensitive.

THP allocations do not wake up kswapd, they go into direct reclaim.
It's likely that kswapd balancing triggered by concurrent order-0
allocations will help THP somewhat, but because the global level needs
contiguous pages, it will still likely enter direct reclaim and direct
compaction.  For example, on my 16G desktop, there are 12MB between
the high and low watermark in the Normal zone; compaction needs double
the allocation size to work, so kswapd can cache reclaim work for up
to 3 THP in the best case (which those concurrent order-0 allocations
that woke kswapd in the first place will likely eat into), and direct
compaction will still have to run.

So AFAICS the synchroneous work required to fit a THP inside memcg is
much less.  And again, under pressure all this global work is already
expensed at that point anyway.

> Moreover, THP allocations have self regulatory mechanisms to prevent
> from excessive stalls. This means that THP allocations are less probable
> under heavy memory pressure.

These mechanisms exist for migration/compaction, but direct reclaim is
still fairly persistent - again, see should_continue_reclaim().

Am I missing something?

> On the other hand, memcg might be under serious memory pressure when
> THP charge comes. The only back off mechanism we use in memcg is
> GFP_NORETRY and that happens after one round of the reclaim. So we
> should make sure that the first round of the reclaim doesn't take
> terribly long.

The same applies globally.  ANY allocation under serious memory
pressure will have a high latency, but nobody forces you to use THP in
an already underprovisioned environment.

> Another part that matters is the size. Memcgs might be really small and
> that changes the math. Large reclaim target will get to low prio reclaim
> and thus the excessive reclaim.

I already addressed page size vs. memcg size before.

However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met.  See shrink_lruvec(),
shrink_zone() etc.

> The size also makes any potential problem much more probable because the
> limit would be hit much more often than extremely low memory conditions
> globally.
>
> Also the reclaim decisions are subtly different for memcg because of the
> missing per-memcg dirty throttling and flushing. So we can stall on
> pages under writeback or get stuck in the write out path which is not
> the case for direct reclaim during THP allocation. A large reclaim
> target is more probable to hit into dirty or writeback pages.

These things again boil down to potential internal fragmentation and
higher memory consumption, as 16 128k reclaims are equally likely to
hit both problems as one 2MB reclaim.

> > Preventing THP faults from swapping is a reasonable proposal, but
> > again has nothing to do with memcg.
> 
> If we can do this inside the direct reclaim path then I am all for it
> because this means less trickery in the memcg code.
> 
> I am still not sure this is sufficient because memcg still might stall
> on IO so the safest approach would be ~GFP_IO reclaim for memcg reclaim
> path.
> 
> I feel strong about the first one (.may_swap = 0) and would be OK with
> your patch if this is added (to the memcg or common path).
> GFP_IO is an extra safety step. Smaller groups would be more likely to
> fail to reclaim enough and so THP success rate will be lower but that
> doesn't sound terribly wrong to me. I am not insisting on it, though.

Would you like to propose the no-swapping patch for the generic
reclaim code?  I'm certainly not against it, but I think the reason
nobody has proposed this yet is that the VM is heavily tuned to prefer
cache reclaim anyway and it's rare that environments run out of cache
and actually swap.  It usually means that memory is underprovisioned.

So I wouldn't be opposed to it as a fail-safe, in case worst comes to
worst, but I think it's a lot less important than you do.

> > However, in this particular case a regression is trivial to pinpoint
> > (comparing vmstat, profiles), and trivial to rectify in the field by
> > changing the memcg limits or disabling THP.
> 
> > What we DO know is that there are very good use cases for THP, but THP
> > inside memcg is broken:
> 
> All those usecases rely on amortizing THP initial costs by less faults
> (assuming the memory range is not used sparsely too much) and the TLB
> pressure reduction. Once we are hitting swap or excessive reclaim all
> the bets are off and THP is no longer beneficial.

Yes, we agree on this, just disagree on the importance of that case.
And both problem and solution would be unrelated to this patch.

> > THP does worse inside a memcg when compared to
> > bare metal environments of the same size, both in terms of success
> > rate, as well as in fault latency due to wasted page allocator work.
> 
> Because memcg is not equivalent to the bare metal with the same amount
> of memory. If for nothing else then because the background reclaim is
> missing.

Which THP is explicitely not using globally.

> > Plus, the code is illogical, redundant, and full of magic numbers.
> 
> I am not objecting to the removal of magic numbers and to getting rid of
> retry loops outside of direct reclaim path (aka mem_cgroup_reclaim). I
> would be willing to take a risk and get rid of them just to make the
> code saner. Because those were never justified properly and look more or
> less random. This would be a separate patch of course.
>  
> > Based on this, this patch seems like a net improvement.
> 
> Sigh, yes, if we ignore all the downsides everything will look like a
> net improvement :/

I don't think you honestly read my email.

> > > > This brings memcg's THP policy in line with the system policy: if the
> > > > allocator painstakingly assembles a hugepage, memcg will at least make
> > > > an honest effort to charge it.  As a result, transparent hugepage
> > > > allocation rates amid cache activity are drastically improved:
> > > > 
> > > >                                       vanilla                 patched
> > > > pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> > > > pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
> > > > pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
> > > > thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
> > > > thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
> > > 
> > > What is the load and configuration that you have measured?
> > 
> > It's just a single linear disk writer and another thread that faults
> > in an anonymous range in 4k steps.
> 
> This is really vague description...
> Which portion of the limit is the anon consumer, what is the memcg limit
> size, IO size, etc...? I find it really interesting that _all_ THP
> charges failed so the memcg had to be almost fully populated by the page
> cache already when the thread tries so fault in the first huge page.
>
> Also 4k steps is basically the best case for THP because the full THP
> block is populated. The question is how the system behaves when THP
> ranges are populated sparsely (because this is often the case).

You are missing the point :(

Sure there are cases that don't benefit from THP, this test just shows
that THP inside memcg can be trivially broken - which harms cases that
WOULD benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-10-08  1:11           ` Johannes Weiner
@ 2014-10-08 15:33             ` Michal Hocko
       [not found]               ` <20141008153329.GF4592-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-10-08 15:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm, cgroups, linux-kernel

[I do not have time to get over all points here and will be offline
until Monday - will get back to the rest then]

On Tue 07-10-14 21:11:06, Johannes Weiner wrote:
> On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
[...]
> > I am completely missing any notes about potential excessive
> > swapouts or longer reclaim stalls which are a natural side effect of direct
> > reclaim with a larger target (or is this something we do not agree on?).
> 
> Yes, we disagree here.  Why is reclaiming 2MB once worse than entering
> reclaim 16 times to reclaim SWAP_CLUSTER_MAX?

You can enter DEF_PRIORITY reclaim 16 times and reclaim your target but
you need at least 512<<DEF_PRIORITY pages on your LRUs to do it in a
single run on that priority. So especially small groups will pay more
and would be subject to mentioned problems (e.g. over-reclaim).

> There is no inherent difference in reclaiming a big chunk and
> reclaiming many small chunks that add up to the same size.
 
[...]

> > Another part that matters is the size. Memcgs might be really small and
> > that changes the math. Large reclaim target will get to low prio reclaim
> > and thus the excessive reclaim.
> 
> I already addressed page size vs. memcg size before.
> 
> However, low priority reclaim does not result in excessive reclaim.
> The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
> pages, and it exits if the goal has been met.  See shrink_lruvec(),
> shrink_zone() etc.

Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
over nr[...] until they are empty and only updates the numbers to be
roughly proportional once:

                if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
                        continue;

                /*
                 * For kswapd and memcg, reclaim at least the number of pages
                 * requested. Ensure that the anon and file LRUs are scanned
                 * proportionally what was requested by get_scan_count(). We
                 * stop reclaiming one LRU and reduce the amount scanning
                 * proportional to the original scan target.
                 */
		[...]
		scan_adjusted = true;

Or do you rely on
                /*
                 * It's just vindictive to attack the larger once the smaller
                 * has gone to zero.  And given the way we stop scanning the
                 * smaller below, this makes sure that we only make one nudge
                 * towards proportionality once we've got nr_to_reclaim.
                 */
                if (!nr_file || !nr_anon)
                        break;

and SCAN_FILE because !inactive_file_is_low?

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20141008153329.GF4592-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
       [not found]               ` <20141008153329.GF4592-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2014-10-08 17:47                 ` Johannes Weiner
  2014-10-11 23:27                   ` Johannes Weiner
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-10-08 17:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Wed, Oct 08, 2014 at 05:33:29PM +0200, Michal Hocko wrote:
> [I do not have time to get over all points here and will be offline
> until Monday - will get back to the rest then]
> 
> On Tue 07-10-14 21:11:06, Johannes Weiner wrote:
> > On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
> [...]
> > > I am completely missing any notes about potential excessive
> > > swapouts or longer reclaim stalls which are a natural side effect of direct
> > > reclaim with a larger target (or is this something we do not agree on?).
> > 
> > Yes, we disagree here.  Why is reclaiming 2MB once worse than entering
> > reclaim 16 times to reclaim SWAP_CLUSTER_MAX?
> 
> You can enter DEF_PRIORITY reclaim 16 times and reclaim your target but
> you need at least 512<<DEF_PRIORITY pages on your LRUs to do it in a
> single run on that priority. So especially small groups will pay more
> and would be subject to mentioned problems (e.g. over-reclaim).

No, even low priority scans bail out shortly after nr_to_reclaim.

> > There is no inherent difference in reclaiming a big chunk and
> > reclaiming many small chunks that add up to the same size.
>  
> [...]
> 
> > > Another part that matters is the size. Memcgs might be really small and
> > > that changes the math. Large reclaim target will get to low prio reclaim
> > > and thus the excessive reclaim.
> > 
> > I already addressed page size vs. memcg size before.
> > 
> > However, low priority reclaim does not result in excessive reclaim.
> > The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
> > pages, and it exits if the goal has been met.  See shrink_lruvec(),
> > shrink_zone() etc.
> 
> Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
> over nr[...] until they are empty and only updates the numbers to be
> roughly proportional once:
> 
>                 if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
>                         continue;
> 
>                 /*
>                  * For kswapd and memcg, reclaim at least the number of pages
>                  * requested. Ensure that the anon and file LRUs are scanned
>                  * proportionally what was requested by get_scan_count(). We
>                  * stop reclaiming one LRU and reduce the amount scanning
>                  * proportional to the original scan target.
>                  */
> 		[...]
> 		scan_adjusted = true;
> 
> Or do you rely on
>                 /*
>                  * It's just vindictive to attack the larger once the smaller
>                  * has gone to zero.  And given the way we stop scanning the
>                  * smaller below, this makes sure that we only make one nudge
>                  * towards proportionality once we've got nr_to_reclaim.
>                  */
>                 if (!nr_file || !nr_anon)
>                         break;
> 
> and SCAN_FILE because !inactive_file_is_low?

That function is indeed quite confusing.

Once nr_to_reclaim has been met, it looks at both LRUs and decides
which one has the smaller scan target left, sets it to 0, and then
scales the bigger target in proportion - that means if it scanned 10%
of nr[file], it sets nr[anon] to 10% of its original size, minus the
anon pages it already scanned.  Based on that alone, overscanning is
limited to twice the requested size, i.e. 4MB for a 2MB THP page,
regardless of how low the priority drops.

In practice, the VM is heavily biased to avoid swapping.  The forceful
SCAN_FILE you point out is one condition that avoids the proportional
scan most of the time.  But even the proportional scan is heavily
biased towards cache - every cache insertion decreases the file
recent_rotated/recent_scanned ratio, whereas anon faults do not.

On top of that, anon pages start out on the active list, whereas cache
starts on the inactive, which means that the majority of the anon scan
target - should there be one - usually translates to deactivation.

So in most cases, I'd expect the scanner to bail after nr_to_reclaim
cache pages, and in low cache situations it might scan up to 2MB worth
of anon pages, a small amount of which it might swap.

I don't particularly like the decisions the current code makes, but it
should work.  We have put in a lot of effort to prevent overreclaim in
the last few years, and a big part of this was decoupling the priority
level from the actual reclaim results.  Nowadays, the priority level
should merely dictate the scan window and have no impact on the number
of pages actually reclaimed.  I don't expect that it will, but if it
does, that's a reclaim bug that needs to be addressed.  If we ask for
N pages, it should never reclaim significantly more than that,
regardless of how aggressively it has to scan to accomplish that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-10-08 17:47                 ` Johannes Weiner
@ 2014-10-11 23:27                   ` Johannes Weiner
  2014-10-17  9:37                     ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-10-11 23:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
	Mel Gorman, Hugh Dickins, linux-mm, cgroups, linux-kernel

On Wed, Oct 08, 2014 at 01:47:25PM -0400, Johannes Weiner wrote:
> On Wed, Oct 08, 2014 at 05:33:29PM +0200, Michal Hocko wrote:
> > On Tue 07-10-14 21:11:06, Johannes Weiner wrote:
> > > On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
> > > > Another part that matters is the size. Memcgs might be really small and
> > > > that changes the math. Large reclaim target will get to low prio reclaim
> > > > and thus the excessive reclaim.
> > > 
> > > I already addressed page size vs. memcg size before.
> > > 
> > > However, low priority reclaim does not result in excessive reclaim.
> > > The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
> > > pages, and it exits if the goal has been met.  See shrink_lruvec(),
> > > shrink_zone() etc.
> > 
> > Now I am confused. shrink_zone will bail out but shrink_lruvec will loop
> > over nr[...] until they are empty and only updates the numbers to be
> > roughly proportional once:
> > 
> >                 if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
> >                         continue;
> > 
> >                 /*
> >                  * For kswapd and memcg, reclaim at least the number of pages
> >                  * requested. Ensure that the anon and file LRUs are scanned
> >                  * proportionally what was requested by get_scan_count(). We
> >                  * stop reclaiming one LRU and reduce the amount scanning
> >                  * proportional to the original scan target.
> >                  */
> > 		[...]
> > 		scan_adjusted = true;
> > 
> > Or do you rely on
> >                 /*
> >                  * It's just vindictive to attack the larger once the smaller
> >                  * has gone to zero.  And given the way we stop scanning the
> >                  * smaller below, this makes sure that we only make one nudge
> >                  * towards proportionality once we've got nr_to_reclaim.
> >                  */
> >                 if (!nr_file || !nr_anon)
> >                         break;
> > 
> > and SCAN_FILE because !inactive_file_is_low?
> 
> That function is indeed quite confusing.
> 
> Once nr_to_reclaim has been met, it looks at both LRUs and decides
> which one has the smaller scan target left, sets it to 0, and then
> scales the bigger target in proportion - that means if it scanned 10%
> of nr[file], it sets nr[anon] to 10% of its original size, minus the
> anon pages it already scanned.  Based on that alone, overscanning is
> limited to twice the requested size, i.e. 4MB for a 2MB THP page,
> regardless of how low the priority drops.

Sorry, this conclusion is incorrect.  The proportionality code can
indeed lead to more overreclaim than that, although I think this is
actually not intended: the comment says "this makes sure we only make
one nudge towards proportionality once we've got nr_to_reclaim," but
once scan_adjusted we never actually check anymore.  We we can end up
making a lot more nudges toward proportionality.

However, the following still applies, so it shouldn't matter:

> In practice, the VM is heavily biased to avoid swapping.  The forceful
> SCAN_FILE you point out is one condition that avoids the proportional
> scan most of the time.  But even the proportional scan is heavily
> biased towards cache - every cache insertion decreases the file
> recent_rotated/recent_scanned ratio, whereas anon faults do not.
> 
> On top of that, anon pages start out on the active list, whereas cache
> starts on the inactive, which means that the majority of the anon scan
> target - should there be one - usually translates to deactivation.
> 
> So in most cases, I'd expect the scanner to bail after nr_to_reclaim
> cache pages, and in low cache situations it might scan up to 2MB worth
> of anon pages, a small amount of which it might swap.
> 
> I don't particularly like the decisions the current code makes, but it
> should work.  We have put in a lot of effort to prevent overreclaim in
> the last few years, and a big part of this was decoupling the priority
> level from the actual reclaim results.  Nowadays, the priority level
> should merely dictate the scan window and have no impact on the number
> of pages actually reclaimed.  I don't expect that it will, but if it
> does, that's a reclaim bug that needs to be addressed.  If we ask for
> N pages, it should never reclaim significantly more than that,
> regardless of how aggressively it has to scan to accomplish that.

That being said, I don't get the rationale behind the proportionality
code in shrink_lruvec().  The patch that introduced it - e82e0561dae9
("mm: vmscan: obey proportional scanning requirements for kswapd") -
mentions respecting swappiness, but as per above we ignore swappiness
anyway until we run low on cache and into actual pressure.  And under
pressure, once we struggle to reclaim nr_to_reclaim, proportionality
enforces itself when one LRU type target hits zero and we continue to
scan the one for which more pressure was allocated.  But as long as we
scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem
getting nr_to_reclaim pages with left-over todo for *both* LRU types,
what's the point of going on?

The justification for enforcing proportionality in direct reclaim is
particularly puzzling:

---

commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
Author: Mel Gorman <mgorman@suse.de>
Date:   Wed Jun 4 16:10:49 2014 -0700

    mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY

[...]

                                                  3.15.0-rc5            3.15.0-rc5
                                                    shrinker            proportion
    Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
    Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
    Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
    Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
    Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
    Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
    
    The test cases are running multiple dd instances reading sparse files. The results are within
    the noise for the small test machine. The impact of the patch is more noticable from the vmstats
    
                                3.15.0-rc5  3.15.0-rc5
                                  shrinker  proportion
    Minor Faults                     35154       36784
    Major Faults                       611        1305
    Swap Ins                           394        1651
    Swap Outs                         4394        5891
    Allocation stalls               118616       44781
    Direct pages scanned           4935171     4602313
    Kswapd pages scanned          15921292    16258483
    Kswapd pages reclaimed        15913301    16248305
    Direct pages reclaimed         4933368     4601133
    Kswapd efficiency                  99%         99%
    Kswapd velocity             670088.047  682555.961
    Direct efficiency                  99%         99%
    Direct velocity             207709.217  193212.133
    Percentage direct scans            23%         22%
    Page writes by reclaim        4858.000    6232.000
    Page writes file                   464         341
    Page writes anon                  4394        5891
    
    Note that there are fewer allocation stalls even though the amount
    of direct reclaim scanning is very approximately the same.

---

The timings show nothing useful, but the statistics strongly speak
*against* this patch.  Sure, direct reclaim invocations are reduced,
but everything else worsened: minor faults increased, major faults
doubled(!), swapins quadrupled(!!), swapins increased, pages scanned
increased, pages reclaimed increased, reclaim page writes increased.

If direct reclaim is invoked at that rate, kswapd is failing at its
job, and the solution shouldn't be to overscan in direct reclaim.  On
the other hand, multi-threaded sparse readers are kind of expected to
overwhelm a single kswapd worker, I'm not sure we should be tuning
allocation latency to such a workload in the first place.

Mel, do you maybe remember details that are not in the changelogs?
Because based on them alone, I think we should look at other ways to
ensure we scan with the right amount of vigor...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
  2014-10-11 23:27                   ` Johannes Weiner
@ 2014-10-17  9:37                     ` Mel Gorman
  0 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2014-10-17  9:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Greg Thelen, Vladimir Davydov,
	Dave Hansen, Hugh Dickins, linux-mm, cgroups, linux-kernel

On Sat, Oct 11, 2014 at 07:27:59PM -0400, Johannes Weiner wrote:
> > I don't particularly like the decisions the current code makes, but it
> > should work.  We have put in a lot of effort to prevent overreclaim in
> > the last few years, and a big part of this was decoupling the priority
> > level from the actual reclaim results.  Nowadays, the priority level
> > should merely dictate the scan window and have no impact on the number
> > of pages actually reclaimed.  I don't expect that it will, but if it
> > does, that's a reclaim bug that needs to be addressed.  If we ask for
> > N pages, it should never reclaim significantly more than that,
> > regardless of how aggressively it has to scan to accomplish that.
> 
> That being said, I don't get the rationale behind the proportionality
> code in shrink_lruvec().  The patch that introduced it - e82e0561dae9
> ("mm: vmscan: obey proportional scanning requirements for kswapd") -
> mentions respecting swappiness, but as per above we ignore swappiness
> anyway until we run low on cache and into actual pressure. 

Swappiness is ignored until active < inactive which is not necessarily
indicative of pressure. Filesystems may mark pages active and move them
to the active list in the absense of pressure for example.

The changelog does not mention it but a big motivator for the patch was
this result https://lkml.org/lkml/2014/5/22/420 and the opening paragraph
mentions "The intent was to minimse direct reclaim latency but Yuanhan
Liu pointer out that it substitutes one long stall for many small stalls
and distorts aging for normal workloads like streaming readers/writers."

> And under
> pressure, once we struggle to reclaim nr_to_reclaim, proportionality
> enforces itself when one LRU type target hits zero and we continue to
> scan the one for which more pressure was allocated.  But as long as we
> scan both lists at the same SWAP_CLUSTER_MAX rate and have no problem
> getting nr_to_reclaim pages with left-over todo for *both* LRU types,
> what's the point of going on?
> 

If we always scanned evenly then it risks reintroducing the long-lived
problem where starting a large amount of IO in the background pushed
everything into swap

> The justification for enforcing proportionality in direct reclaim is
> particularly puzzling:
> 
> ---
> 
> commit 1a501907bbea8e6ebb0b16cf6db9e9cbf1d2c813
> Author: Mel Gorman <mgorman@suse.de>
> Date:   Wed Jun 4 16:10:49 2014 -0700
> 
>     mm: vmscan: use proportional scanning during direct reclaim and full scan at DEF_PRIORITY
> 
> [...]
> 
>                                                   3.15.0-rc5            3.15.0-rc5
>                                                     shrinker            proportion
>     Unit  lru-file-readonce    elapsed      5.3500 (  0.00%)      5.4200 ( -1.31%)
>     Unit  lru-file-readonce time_range      0.2700 (  0.00%)      0.1400 ( 48.15%)
>     Unit  lru-file-readonce time_stddv      0.1148 (  0.00%)      0.0536 ( 53.33%)
>     Unit lru-file-readtwice    elapsed      8.1700 (  0.00%)      8.1700 (  0.00%)
>     Unit lru-file-readtwice time_range      0.4300 (  0.00%)      0.2300 ( 46.51%)
>     Unit lru-file-readtwice time_stddv      0.1650 (  0.00%)      0.0971 ( 41.16%)
>     
>     The test cases are running multiple dd instances reading sparse files. The results are within
>     the noise for the small test machine. The impact of the patch is more noticable from the vmstats
>     
>                                 3.15.0-rc5  3.15.0-rc5
>                                   shrinker  proportion
>     Minor Faults                     35154       36784
>     Major Faults                       611        1305
>     Swap Ins                           394        1651
>     Swap Outs                         4394        5891
>     Allocation stalls               118616       44781
>     Direct pages scanned           4935171     4602313
>     Kswapd pages scanned          15921292    16258483
>     Kswapd pages reclaimed        15913301    16248305
>     Direct pages reclaimed         4933368     4601133
>     Kswapd efficiency                  99%         99%
>     Kswapd velocity             670088.047  682555.961
>     Direct efficiency                  99%         99%
>     Direct velocity             207709.217  193212.133
>     Percentage direct scans            23%         22%
>     Page writes by reclaim        4858.000    6232.000
>     Page writes file                   464         341
>     Page writes anon                  4394        5891
>     
>     Note that there are fewer allocation stalls even though the amount
>     of direct reclaim scanning is very approximately the same.
> 
> ---
> 
> The timings show nothing useful, but the statistics strongly speak
> *against* this patch.  Sure, direct reclaim invocations are reduced,
> but everything else worsened: minor faults increased, major faults
> doubled(!), swapins quadrupled(!!), swapins increased, pages scanned
> increased, pages reclaimed increased, reclaim page writes increased.
> 

Direct reclaim invocations reducing was in line with the intent to
prevent many small stalls.

The increase in minor faults is marginal in absolute terms and very
likely within the noise.

Major faults might have doubled but in absolute terms is about the size
of a THP allocation or roughly a 25ms stall (depending on disk obviously)
overall in a long-lived test.

Differences in swap figures are also large in relative terms but very
small in absolute terms. Again, I suspected it was within the noise.

Same rational for the others. The changes in reclaim stats in absolute
terms are small considering the type of test being executed. The reclaim
figure changes look terrifying but if you look at the sum of direct and
kswapd reclaimed pages you'll see that the difference is marginal and all
that changed was who did the work. Same for scanning. Total scanning and
reclaim work was approximately similar, all that changed in this test is
what process did the work.

What was more important in this case was Yuanhan Liu report that the patch
addressed a major regression.

> Mel, do you maybe remember details that are not in the changelogs?

The link to Yuanhan Liu's report was missing but that happened after the
changelog was written so that is hardly a surprise. Nothing else springs
to mind.

> Because based on them alone, I think we should look at other ways to
> ensure we scan with the right amount of vigor...

I'm not against it per-se but avoid over-reacting about changes in stats
that are this small in absolute terms. If you do change this area, I
strongly suggest you also test with the parallelio-memcachetest configs
from mmtests and watch for IO causing excessive swap.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-10-17  9:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
2014-09-24 19:42   ` Andrew Morton
2014-09-24 21:03     ` Johannes Weiner
2014-09-24 21:15       ` Andrew Morton
2014-09-25 13:44       ` Michal Hocko
2014-10-02 15:57         ` Johannes Weiner
2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
2014-09-24 15:14   ` Vladimir Davydov
     [not found]   ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-09-25 15:27     ` Michal Hocko
2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
2014-09-29 13:57   ` Michal Hocko
     [not found]     ` <20140929135707.GA25956-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-09-29 17:57       ` Johannes Weiner
2014-10-07 13:59         ` Michal Hocko
2014-10-08  1:11           ` Johannes Weiner
2014-10-08 15:33             ` Michal Hocko
     [not found]               ` <20141008153329.GF4592-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2014-10-08 17:47                 ` Johannes Weiner
2014-10-11 23:27                   ` Johannes Weiner
2014-10-17  9:37                     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox