* [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
2014-09-24 19:42 ` Andrew Morton
2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
2 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
From: Michal Hocko <mhocko@suse.cz>
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18% 80.18% [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--66.59%-- res_counter_uncharge_until
| res_counter_uncharge
| uncharge_batch
| uncharge_list
| mem_cgroup_uncharge_list
| release_pages
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| |
| |--90.12%-- unmap_single_vma
| | unmap_vmas
| | unmap_region
| | do_munmap
| | vm_munmap
| | sys_munmap
| | system_call_fastpath
| | __GI___munmap
| |
| --9.88%-- tlb_flush_mmu
| tlb_finish_mmu
| unmap_region
| do_munmap
| vm_munmap
| sys_munmap
| system_call_fastpath
| __GI___munmap
In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.
There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.
Page reference count and LRU handling is moved to release_lru_pages and
that is run in PAGEVEC_SIZE batches.
Reported-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/swap.c | 27 +++++++++++++++++++++------
mm/swap_state.c | 14 ++++----------
2 files changed, 25 insertions(+), 16 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..8af99dd68dd2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -888,9 +888,9 @@ void lru_add_drain_all(void)
}
/*
- * Batched page_cache_release(). Decrement the reference count on all the
- * passed pages. If it fell to zero then remove the page from the LRU and
- * free it.
+ * Batched lru release. Decrement the reference count on all the passed pages.
+ * If it fell to zero then remove the page from the LRU and add it to the given
+ * list to be freed by the caller.
*
* Avoid taking zone->lru_lock if possible, but if it is taken, retain it
* for the remainder of the operation.
@@ -900,10 +900,10 @@ void lru_add_drain_all(void)
* grabbed the page via the LRU. If it did, give up: shrink_inactive_list()
* will free it.
*/
-void release_pages(struct page **pages, int nr, bool cold)
+static void release_lru_pages(struct page **pages, int nr,
+ struct list_head *pages_to_free)
{
int i;
- LIST_HEAD(pages_to_free);
struct zone *zone = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
@@ -943,11 +943,26 @@ void release_pages(struct page **pages, int nr, bool cold)
/* Clear Active bit in case of parallel mark_page_accessed */
__ClearPageActive(page);
- list_add(&page->lru, &pages_to_free);
+ list_add(&page->lru, pages_to_free);
}
if (zone)
spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+/*
+ * Batched page_cache_release(). Frees and uncharges all given pages
+ * for which the reference count drops to 0.
+ */
+void release_pages(struct page **pages, int nr, bool cold)
+{
+ LIST_HEAD(pages_to_free);
+ while (nr) {
+ int batch = min(nr, PAGEVEC_SIZE);
+
+ release_lru_pages(pages, batch, &pages_to_free);
+ pages += batch;
+ nr -= batch;
+ }
mem_cgroup_uncharge_list(&pages_to_free);
free_hot_cold_page_list(&pages_to_free, cold);
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct page **pages, int nr)
{
struct page **pagep = pages;
+ int i;
lru_add_drain();
- while (nr) {
- int todo = min(nr, PAGEVEC_SIZE);
- int i;
-
- for (i = 0; i < todo; i++)
- free_swap_cache(pagep[i]);
- release_pages(pagep, todo, false);
- pagep += todo;
- nr -= todo;
- }
+ for (i = 0; i < nr; i++)
+ free_swap_cache(pagep[i]);
+ release_pages(pagep, nr, false);
}
/*
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
@ 2014-09-24 19:42 ` Andrew Morton
2014-09-24 21:03 ` Johannes Weiner
0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2014-09-24 19:42 UTC (permalink / raw)
To: Johannes Weiner
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
On Wed, 24 Sep 2014 11:08:56 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> From: Michal Hocko <mhocko@suse.cz>
>
> free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
> This is not a big deal for the normal release path but it completely
> kills memcg uncharge batching which reduces res_counter spin_lock
> contention. Dave has noticed this with his page fault scalability test
> case on a large machine when the lock was basically dominating on all
> CPUs:
>
> ...
>
> In his case the load was running in the root memcg and that part
> has been handled by reverting 05b843012335 ("mm: memcontrol: use
> root_mem_cgroup res_counter") because this is a clear regression,
> but the problem remains inside dedicated memcgs.
>
> There is no reason to limit release_pages to PAGEVEC_SIZE batches other
> than lru_lock held times. This logic, however, can be moved inside the
> function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
> hold any lock for the whole pages_to_free list so it is safe to call
> them in a single run.
>
> Page reference count and LRU handling is moved to release_lru_pages and
> that is run in PAGEVEC_SIZE batches.
Looks OK.
> --- a/mm/swap.c
> +++ b/mm/swap.c
>
> ...
>
> +}
> +/*
> + * Batched page_cache_release(). Frees and uncharges all given pages
> + * for which the reference count drops to 0.
> + */
> +void release_pages(struct page **pages, int nr, bool cold)
> +{
> + LIST_HEAD(pages_to_free);
>
> + while (nr) {
> + int batch = min(nr, PAGEVEC_SIZE);
> +
> + release_lru_pages(pages, batch, &pages_to_free);
> + pages += batch;
> + nr -= batch;
> + }
The use of PAGEVEC_SIZE here is pretty misleading - there are no
pagevecs in sight. SWAP_CLUSTER_MAX would be more appropriate.
afaict the only reason for this loop is to limit the hold duration for
lru_lock. And it does a suboptimal job of that because it treats all
lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
pages, the logic would then force release_lru_pages() to drop the lock
and return to release_pages() even though it doesn't need to.
So I'm thinking it would be better to move the lock-busting logic into
release_lru_pages() itself. With a suitable comment, natch ;) Only
bust the lock in the case where we really did hold a particular lru_lock
for 16 consecutive pages. Then s/release_lru_pages/release_pages/ and
zap the old release_pages().
Obviously it's not very important - presumably the common case is that
the LRU contains lengthy sequences of pages from the same zone. Maybe.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-24 19:42 ` Andrew Morton
@ 2014-09-24 21:03 ` Johannes Weiner
2014-09-24 21:15 ` Andrew Morton
2014-09-25 13:44 ` Michal Hocko
0 siblings, 2 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 21:03 UTC (permalink / raw)
To: Andrew Morton
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
On Wed, Sep 24, 2014 at 12:42:34PM -0700, Andrew Morton wrote:
> On Wed, 24 Sep 2014 11:08:56 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > +}
> > +/*
> > + * Batched page_cache_release(). Frees and uncharges all given pages
> > + * for which the reference count drops to 0.
> > + */
> > +void release_pages(struct page **pages, int nr, bool cold)
> > +{
> > + LIST_HEAD(pages_to_free);
> >
> > + while (nr) {
> > + int batch = min(nr, PAGEVEC_SIZE);
> > +
> > + release_lru_pages(pages, batch, &pages_to_free);
> > + pages += batch;
> > + nr -= batch;
> > + }
>
> The use of PAGEVEC_SIZE here is pretty misleading - there are no
> pagevecs in sight. SWAP_CLUSTER_MAX would be more appropriate.
>
>
>
> afaict the only reason for this loop is to limit the hold duration for
> lru_lock. And it does a suboptimal job of that because it treats all
> lru_locks as one: if release_lru_pages() were to hold zoneA's lru_lock
> for 8 pages and then were to drop that and hold zoneB's lru_lock for 8
> pages, the logic would then force release_lru_pages() to drop the lock
> and return to release_pages() even though it doesn't need to.
>
> So I'm thinking it would be better to move the lock-busting logic into
> release_lru_pages() itself. With a suitable comment, natch ;) Only
> bust the lock in the case where we really did hold a particular lru_lock
> for 16 consecutive pages. Then s/release_lru_pages/release_pages/ and
> zap the old release_pages().
Yeah, that's better.
> Obviously it's not very important - presumably the common case is that
> the LRU contains lengthy sequences of pages from the same zone. Maybe.
Even then, the end result is more concise and busts the lock where
it's actually taken, making the whole thing a bit more obvious:
---
From 367f3dcf25141ff6c30400c00ec09cc3c5486f94 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Fri, 5 Sep 2014 11:16:17 +0200
Subject: [patch] mm: memcontrol: do not kill uncharge batching in
free_pages_and_swap_cache
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely
kills memcg uncharge batching which reduces res_counter spin_lock
contention. Dave has noticed this with his page fault scalability test
case on a large machine when the lock was basically dominating on all
CPUs:
80.18% 80.18% [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--66.59%-- res_counter_uncharge_until
| res_counter_uncharge
| uncharge_batch
| uncharge_list
| mem_cgroup_uncharge_list
| release_pages
| free_pages_and_swap_cache
| tlb_flush_mmu_free
| |
| |--90.12%-- unmap_single_vma
| | unmap_vmas
| | unmap_region
| | do_munmap
| | vm_munmap
| | sys_munmap
| | system_call_fastpath
| | __GI___munmap
| |
| --9.88%-- tlb_flush_mmu
| tlb_finish_mmu
| unmap_region
| do_munmap
| vm_munmap
| sys_munmap
| system_call_fastpath
| __GI___munmap
In his case the load was running in the root memcg and that part
has been handled by reverting 05b843012335 ("mm: memcontrol: use
root_mem_cgroup res_counter") because this is a clear regression,
but the problem remains inside dedicated memcgs.
There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times. This logic, however, can be moved inside the
function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call
them in a single run.
In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
pages, then remove the batching from free_pages_and_swap_cache.
Also update the grossly out-of-date release_pages documentation.
Reported-by: Dave Hansen <dave@sr71.net>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/swap.c | 31 ++++++++++++++++++++-----------
mm/swap_state.c | 14 ++++----------
2 files changed, 24 insertions(+), 21 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 6b2dc3897cd5..39affa1932ce 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -887,18 +887,14 @@ void lru_add_drain_all(void)
mutex_unlock(&lock);
}
-/*
- * Batched page_cache_release(). Decrement the reference count on all the
- * passed pages. If it fell to zero then remove the page from the LRU and
- * free it.
- *
- * Avoid taking zone->lru_lock if possible, but if it is taken, retain it
- * for the remainder of the operation.
+/**
+ * release_pages - batched page_cache_release()
+ * @pages: array of pages to release
+ * @nr: number of pages
+ * @cold: whether the pages are cache cold
*
- * The locking in this function is against shrink_inactive_list(): we recheck
- * the page count inside the lock to see whether shrink_inactive_list()
- * grabbed the page via the LRU. If it did, give up: shrink_inactive_list()
- * will free it.
+ * Decrement the reference count on all the pages in @pages. If it
+ * fell to zero, remove the page from the LRU and free it.
*/
void release_pages(struct page **pages, int nr, bool cold)
{
@@ -907,6 +903,7 @@ void release_pages(struct page **pages, int nr, bool cold)
struct zone *zone = NULL;
struct lruvec *lruvec;
unsigned long uninitialized_var(flags);
+ unsigned int uninitialized_var(lock_batch);
for (i = 0; i < nr; i++) {
struct page *page = pages[i];
@@ -914,6 +911,7 @@ void release_pages(struct page **pages, int nr, bool cold)
if (unlikely(PageCompound(page))) {
if (zone) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
+ lock_batch = 0;
zone = NULL;
}
put_compound_page(page);
@@ -930,6 +928,7 @@ void release_pages(struct page **pages, int nr, bool cold)
if (zone)
spin_unlock_irqrestore(&zone->lru_lock,
flags);
+ lock_batch = 0;
zone = pagezone;
spin_lock_irqsave(&zone->lru_lock, flags);
}
@@ -938,6 +937,16 @@ void release_pages(struct page **pages, int nr, bool cold)
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
+
+ /*
+ * Make sure the IRQ-safe lock-holding time
+ * does not get excessive with a continuous
+ * string of pages from the same zone.
+ */
+ if (++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ zone = NULL;
+ }
}
/* Clear Active bit in case of parallel mark_page_accessed */
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ef1f39139b71..154444918685 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -265,18 +265,12 @@ void free_page_and_swap_cache(struct page *page)
void free_pages_and_swap_cache(struct page **pages, int nr)
{
struct page **pagep = pages;
+ int i;
lru_add_drain();
- while (nr) {
- int todo = min(nr, PAGEVEC_SIZE);
- int i;
-
- for (i = 0; i < todo; i++)
- free_swap_cache(pagep[i]);
- release_pages(pagep, todo, false);
- pagep += todo;
- nr -= todo;
- }
+ for (i = 0; i < nr; i++)
+ free_swap_cache(pagep[i]);
+ release_pages(pagep, nr, false);
}
/*
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-24 21:03 ` Johannes Weiner
@ 2014-09-24 21:15 ` Andrew Morton
2014-09-25 13:44 ` Michal Hocko
1 sibling, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2014-09-24 21:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
On Wed, 24 Sep 2014 17:03:22 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Obviously it's not very important - presumably the common case is that
> > the LRU contains lengthy sequences of pages from the same zone. Maybe.
>
> Even then, the end result is more concise and busts the lock where
> it's actually taken, making the whole thing a bit more obvious:
Yes, that did come out better.
> From: Michal Hocko <mhocko@suse.cz>
> Date: Fri, 5 Sep 2014 11:16:17 +0200
> Subject: [patch] mm: memcontrol: do not kill uncharge batching in
> free_pages_and_swap_cache
>
> free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
> This is not a big deal for the normal release path but it completely
> kills memcg uncharge batching which reduces res_counter spin_lock
> contention. Dave has noticed this with his page fault scalability test
> case on a large machine when the lock was basically dominating on all
> CPUs:
> 80.18% 80.18% [kernel] [k] _raw_spin_lock
> |
> --- _raw_spin_lock
> |
> |--66.59%-- res_counter_uncharge_until
> | res_counter_uncharge
> | uncharge_batch
> | uncharge_list
> | mem_cgroup_uncharge_list
> | release_pages
> | free_pages_and_swap_cache
> | tlb_flush_mmu_free
> | |
> | |--90.12%-- unmap_single_vma
> | | unmap_vmas
> | | unmap_region
> | | do_munmap
> | | vm_munmap
> | | sys_munmap
> | | system_call_fastpath
> | | __GI___munmap
> | |
> | --9.88%-- tlb_flush_mmu
> | tlb_finish_mmu
> | unmap_region
> | do_munmap
> | vm_munmap
> | sys_munmap
> | system_call_fastpath
> | __GI___munmap
>
> In his case the load was running in the root memcg and that part
> has been handled by reverting 05b843012335 ("mm: memcontrol: use
> root_mem_cgroup res_counter") because this is a clear regression,
> but the problem remains inside dedicated memcgs.
>
> There is no reason to limit release_pages to PAGEVEC_SIZE batches other
> than lru_lock held times. This logic, however, can be moved inside the
> function. mem_cgroup_uncharge_list and free_hot_cold_page_list do not
> hold any lock for the whole pages_to_free list so it is safe to call
> them in a single run.
>
> In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> pages, then remove the batching from free_pages_and_swap_cache.
I beefed this paragraph up a bit:
: The release_pages() code was previously breaking the lru_lock each
: PAGEVEC_SIZE pages (ie, 14 pages). However this code has no usage of
: pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
: (32) pages. This means that the lock acquisition frequency is
: approximately halved and the max hold times are approximately doubled.
:
: The now unneeded batching is removed from free_pages_and_swap_cache().
I doubt if the increased irq-off time will hurt anyone, but who knows...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-24 21:03 ` Johannes Weiner
2014-09-24 21:15 ` Andrew Morton
@ 2014-09-25 13:44 ` Michal Hocko
2014-10-02 15:57 ` Johannes Weiner
1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-09-25 13:44 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
linux-mm, cgroups, linux-kernel
On Wed 24-09-14 17:03:22, Johannes Weiner wrote:
[...]
> In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> pages, then remove the batching from free_pages_and_swap_cache.
Actually I had something like that originally but then decided to
not change the break out logic to prevent from strange and subtle
regressions. I have focused only on the memcg batching POV and led the
rest untouched.
I do agree that lru_lock batching can be improved as well. Your change
looks almost correct but you should count all the pages while the lock
is held otherwise you might happen to hold the lock for too long just
because most pages are off the LRU already for some reason. At least
that is what my original attempt was doing. Something like the following
on top of the current patch:
---
diff --git a/mm/swap.c b/mm/swap.c
index 39affa1932ce..8a12b33936b4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -911,13 +911,22 @@ void release_pages(struct page **pages, int nr, bool cold)
if (unlikely(PageCompound(page))) {
if (zone) {
spin_unlock_irqrestore(&zone->lru_lock, flags);
- lock_batch = 0;
zone = NULL;
}
put_compound_page(page);
continue;
}
+ /*
+ * Make sure the IRQ-safe lock-holding time does not get
+ * excessive with a continuous string of pages from the
+ * same zone. The lock is held only if zone != NULL.
+ */
+ if (zone && ++lock_batch == SWAP_CLUSTER_MAX) {
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+ zone = NULL;
+ }
+
if (!put_page_testzero(page))
continue;
@@ -937,16 +946,6 @@ void release_pages(struct page **pages, int nr, bool cold)
VM_BUG_ON_PAGE(!PageLRU(page), page);
__ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, page_off_lru(page));
-
- /*
- * Make sure the IRQ-safe lock-holding time
- * does not get excessive with a continuous
- * string of pages from the same zone.
- */
- if (++lock_batch == SWAP_CLUSTER_MAX) {
- spin_unlock_irqrestore(&zone->lru_lock, flags);
- zone = NULL;
- }
}
/* Clear Active bit in case of parallel mark_page_accessed */
[...]
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
2014-09-25 13:44 ` Michal Hocko
@ 2014-10-02 15:57 ` Johannes Weiner
0 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-10-02 15:57 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
linux-mm, cgroups, linux-kernel
On Thu, Sep 25, 2014 at 03:44:03PM +0200, Michal Hocko wrote:
> On Wed 24-09-14 17:03:22, Johannes Weiner wrote:
> [...]
> > In release_pages, break the lock at least every SWAP_CLUSTER_MAX (32)
> > pages, then remove the batching from free_pages_and_swap_cache.
>
> Actually I had something like that originally but then decided to
> not change the break out logic to prevent from strange and subtle
> regressions. I have focused only on the memcg batching POV and led the
> rest untouched.
>
> I do agree that lru_lock batching can be improved as well. Your change
> looks almost correct but you should count all the pages while the lock
> is held otherwise you might happen to hold the lock for too long just
> because most pages are off the LRU already for some reason. At least
> that is what my original attempt was doing. Something like the following
> on top of the current patch:
Yep, that makes sense.
Would you care to send it in such that Andrew can pick it up? Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
2014-09-24 15:14 ` Vladimir Davydov
[not found] ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
2 siblings, 2 replies; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
When attempting to charge pages, we first charge the memory counter
and then the memory+swap counter. If one of the counters is at its
limit, we enter reclaim, but if it's the memory+swap counter, reclaim
shouldn't swap because that wouldn't change the situation. However,
if the counters have the same limits, we never get to the memory+swap
limit. To know whether reclaim should swap or not, there is a state
flag that indicates whether the limits are equal and whether hitting
the memory limit implies hitting the memory+swap limit.
Just try the memory+swap counter first.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/memcontrol.c | 47 +++++++++++++----------------------------------
1 file changed, 13 insertions(+), 34 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ec22bf380d0..89c920156c2a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -315,9 +315,6 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
- /* set when res.limit == memsw.limit */
- bool memsw_is_minimum;
-
/* protect arrays of thresholds */
struct mutex thresholds_lock;
@@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
noswap = true;
- if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
- noswap = true;
for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
if (loop)
@@ -2543,16 +2538,17 @@ retry:
goto done;
size = batch * PAGE_SIZE;
- if (!res_counter_charge(&memcg->res, size, &fail_res)) {
- if (!do_swap_account)
+ if (!do_swap_account ||
+ !res_counter_charge(&memcg->memsw, size, &fail_res)) {
+ if (!res_counter_charge(&memcg->res, size, &fail_res))
goto done_restock;
- if (!res_counter_charge(&memcg->memsw, size, &fail_res))
- goto done_restock;
- res_counter_uncharge(&memcg->res, size);
+ if (do_swap_account)
+ res_counter_uncharge(&memcg->memsw, size);
+ mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ } else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
flags |= MEM_CGROUP_RECLAIM_NOSWAP;
- } else
- mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+ }
if (batch > nr_pages) {
batch = nr_pages;
@@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memswlimit, memlimit;
int ret = 0;
int children = mem_cgroup_count_children(memcg);
u64 curusage, oldusage;
@@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val) {
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}
- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit < val)
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
enlarge = 1;
ret = res_counter_set_limit(&memcg->res, val);
- if (!ret) {
- if (memswlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);
if (!ret)
@@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
unsigned long long val)
{
int retry_count;
- u64 memlimit, memswlimit, oldusage, curusage;
+ u64 oldusage, curusage;
int children = mem_cgroup_count_children(memcg);
int ret = -EBUSY;
int enlarge = 0;
@@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
* We have to guarantee memcg->res.limit <= memcg->memsw.limit.
*/
mutex_lock(&set_limit_mutex);
- memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- if (memlimit > val) {
+ if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
ret = -EINVAL;
mutex_unlock(&set_limit_mutex);
break;
}
- memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
- if (memswlimit < val)
+ if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
enlarge = 1;
ret = res_counter_set_limit(&memcg->memsw, val);
- if (!ret) {
- if (memlimit == val)
- memcg->memsw_is_minimum = true;
- else
- memcg->memsw_is_minimum = false;
- }
mutex_unlock(&set_limit_mutex);
if (!ret)
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
@ 2014-09-24 15:14 ` Vladimir Davydov
[not found] ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
1 sibling, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2014-09-24 15:14 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Greg Thelen, Dave Hansen, Michal Hocko, linux-mm,
cgroups, linux-kernel
On Wed, Sep 24, 2014 at 11:08:57AM -0400, Johannes Weiner wrote:
> When attempting to charge pages, we first charge the memory counter
> and then the memory+swap counter. If one of the counters is at its
> limit, we enter reclaim, but if it's the memory+swap counter, reclaim
> shouldn't swap because that wouldn't change the situation. However,
> if the counters have the same limits, we never get to the memory+swap
> limit. To know whether reclaim should swap or not, there is a state
> flag that indicates whether the limits are equal and whether hitting
> the memory limit implies hitting the memory+swap limit.
>
> Just try the memory+swap counter first.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread[parent not found: <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>]
* Re: [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit
[not found] ` <1411571338-8178-3-git-send-email-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
@ 2014-09-25 15:27 ` Michal Hocko
0 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2014-09-25 15:27 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
On Wed 24-09-14 11:08:57, Johannes Weiner wrote:
> When attempting to charge pages, we first charge the memory counter
> and then the memory+swap counter. If one of the counters is at its
> limit, we enter reclaim, but if it's the memory+swap counter, reclaim
> shouldn't swap because that wouldn't change the situation. However,
> if the counters have the same limits, we never get to the memory+swap
> limit. To know whether reclaim should swap or not, there is a state
> flag that indicates whether the limits are equal and whether hitting
> the memory limit implies hitting the memory+swap limit.
>
> Just try the memory+swap counter first.
OK, this makes sense and makes the reclaim code little bit more
readable (). I would just add that the patch shouldn't have any visible
effectes because that is not apparent from the description.
>
> Signed-off-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
> mm/memcontrol.c | 47 +++++++++++++----------------------------------
> 1 file changed, 13 insertions(+), 34 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1ec22bf380d0..89c920156c2a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -315,9 +315,6 @@ struct mem_cgroup {
> /* OOM-Killer disable */
> int oom_kill_disable;
>
> - /* set when res.limit == memsw.limit */
> - bool memsw_is_minimum;
> -
> /* protect arrays of thresholds */
> struct mutex thresholds_lock;
>
> @@ -1804,8 +1801,6 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>
> if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> noswap = true;
> - if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
> - noswap = true;
>
> for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> if (loop)
> @@ -2543,16 +2538,17 @@ retry:
> goto done;
>
> size = batch * PAGE_SIZE;
> - if (!res_counter_charge(&memcg->res, size, &fail_res)) {
> - if (!do_swap_account)
> + if (!do_swap_account ||
> + !res_counter_charge(&memcg->memsw, size, &fail_res)) {
> + if (!res_counter_charge(&memcg->res, size, &fail_res))
> goto done_restock;
> - if (!res_counter_charge(&memcg->memsw, size, &fail_res))
> - goto done_restock;
> - res_counter_uncharge(&memcg->res, size);
> + if (do_swap_account)
> + res_counter_uncharge(&memcg->memsw, size);
> + mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> + } else {
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> - } else
> - mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> + }
>
> if (batch > nr_pages) {
> batch = nr_pages;
> @@ -3615,7 +3611,6 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> unsigned long long val)
> {
> int retry_count;
> - u64 memswlimit, memlimit;
> int ret = 0;
> int children = mem_cgroup_count_children(memcg);
> u64 curusage, oldusage;
> @@ -3642,24 +3637,16 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> */
> mutex_lock(&set_limit_mutex);
> - memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> - if (memswlimit < val) {
> + if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val) {
> ret = -EINVAL;
> mutex_unlock(&set_limit_mutex);
> break;
> }
>
> - memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> - if (memlimit < val)
> + if (res_counter_read_u64(&memcg->res, RES_LIMIT) < val)
> enlarge = 1;
>
> ret = res_counter_set_limit(&memcg->res, val);
> - if (!ret) {
> - if (memswlimit == val)
> - memcg->memsw_is_minimum = true;
> - else
> - memcg->memsw_is_minimum = false;
> - }
> mutex_unlock(&set_limit_mutex);
>
> if (!ret)
> @@ -3684,7 +3671,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> unsigned long long val)
> {
> int retry_count;
> - u64 memlimit, memswlimit, oldusage, curusage;
> + u64 oldusage, curusage;
> int children = mem_cgroup_count_children(memcg);
> int ret = -EBUSY;
> int enlarge = 0;
> @@ -3703,22 +3690,14 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> * We have to guarantee memcg->res.limit <= memcg->memsw.limit.
> */
> mutex_lock(&set_limit_mutex);
> - memlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> - if (memlimit > val) {
> + if (res_counter_read_u64(&memcg->res, RES_LIMIT) > val) {
> ret = -EINVAL;
> mutex_unlock(&set_limit_mutex);
> break;
> }
> - memswlimit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> - if (memswlimit < val)
> + if (res_counter_read_u64(&memcg->memsw, RES_LIMIT) < val)
> enlarge = 1;
> ret = res_counter_set_limit(&memcg->memsw, val);
> - if (!ret) {
> - if (memlimit == val)
> - memcg->memsw_is_minimum = true;
> - else
> - memcg->memsw_is_minimum = false;
> - }
> mutex_unlock(&set_limit_mutex);
>
> if (!ret)
> --
> 2.1.0
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread
* [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
2014-09-24 15:08 [patch 0/3] mm: memcontrol: performance fixlets for 3.18 Johannes Weiner
2014-09-24 15:08 ` [patch 1/3] mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache Johannes Weiner
2014-09-24 15:08 ` [patch 2/3] mm: memcontrol: simplify detecting when the memory+swap limit is hit Johannes Weiner
@ 2014-09-24 15:08 ` Johannes Weiner
2014-09-29 13:57 ` Michal Hocko
2 siblings, 1 reply; 19+ messages in thread
From: Johannes Weiner @ 2014-09-24 15:08 UTC (permalink / raw)
To: Andrew Morton
Cc: Greg Thelen, Vladimir Davydov, Dave Hansen, Michal Hocko,
linux-mm, cgroups, linux-kernel
In a memcg with even just moderate cache pressure, success rates for
transparent huge page allocations drop to zero, wasting a lot of
effort that the allocator puts into assembling these pages.
The reason for this is that the memcg reclaim code was never designed
for higher-order charges. It reclaims in small batches until there is
room for at least one page. Huge page charges only succeed when these
batches add up over a series of huge faults, which is unlikely under
any significant load involving order-0 allocations in the group.
Remove that loop on the memcg side in favor of passing the actual
reclaim goal to direct reclaim, which is already set up and optimized
to meet higher-order goals efficiently.
This brings memcg's THP policy in line with the system policy: if the
allocator painstakingly assembles a hugepage, memcg will at least make
an honest effort to charge it. As a result, transparent hugepage
allocation rates amid cache activity are drastically improved:
vanilla patched
pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
[ Note: this may in turn increase memory consumption from internal
fragmentation, which is an inherent risk of transparent hugepages.
Some setups may have to adjust the memcg limits accordingly to
accomodate this - or, if the machine is already packed to capacity,
disable the transparent huge page feature. ]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
---
include/linux/swap.h | 6 +++--
mm/memcontrol.c | 69 +++++++++++++---------------------------------------
mm/vmscan.c | 7 +++---
3 files changed, 25 insertions(+), 57 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ea4f926e6b9b..37a585beef5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
-extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
- gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
+ gfp_t gfp_mask,
+ bool may_swap);
extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 89c920156c2a..c2c75262a209 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -478,14 +478,6 @@ enum res_type {
#define OOM_CONTROL (0)
/*
- * Reclaim flags for mem_cgroup_hierarchical_reclaim
- */
-#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
-#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
-#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
-#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
-
-/*
* The memcg_create_mutex will be held whenever a new cgroup is created.
* As a consequence, any change that needs to protect against new child cgroups
* appearing has to hold it as well.
@@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
NULL, "Memory cgroup out of memory");
}
-static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
- gfp_t gfp_mask,
- unsigned long flags)
-{
- unsigned long total = 0;
- bool noswap = false;
- int loop;
-
- if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
- noswap = true;
-
- for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
- if (loop)
- drain_all_stock_async(memcg);
- total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
- /*
- * Allow limit shrinkers, which are triggered directly
- * by userspace, to catch signals and stop reclaim
- * after minimal progress, regardless of the margin.
- */
- if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
- break;
- if (mem_cgroup_margin(memcg))
- break;
- /*
- * If nothing was reclaimed after two attempts, there
- * may be no reclaimable pages in this hierarchy.
- */
- if (loop && !total)
- break;
- }
- return total;
-}
-
/**
* test_mem_cgroup_node_reclaimable
* @memcg: the target memcg
@@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct mem_cgroup *mem_over_limit;
struct res_counter *fail_res;
unsigned long nr_reclaimed;
- unsigned long flags = 0;
unsigned long long size;
+ bool may_swap = true;
+ bool drained = false;
int ret = 0;
if (mem_cgroup_is_root(memcg))
@@ -2547,7 +2506,7 @@ retry:
mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
} else {
mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
- flags |= MEM_CGROUP_RECLAIM_NOSWAP;
+ may_swap = false;
}
if (batch > nr_pages) {
@@ -2572,11 +2531,18 @@ retry:
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
- nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
+ nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+ gfp_mask, may_swap);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
goto retry;
+ if (!drained) {
+ drain_all_stock_async(mem_over_limit);
+ drained = true;
+ goto retry;
+ }
+
if (gfp_mask & __GFP_NORETRY)
goto nomem;
/*
@@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
+
curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
if (!ret)
break;
- mem_cgroup_reclaim(memcg, GFP_KERNEL,
- MEM_CGROUP_RECLAIM_NOSWAP |
- MEM_CGROUP_RECLAIM_SHRINK);
+ try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
+
curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
/* Usage is reduced ? */
if (curusage >= oldusage)
@@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
if (signal_pending(current))
return -EINTR;
- progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
- false);
+ progress = try_to_free_mem_cgroup_pages(memcg, 1,
+ GFP_KERNEL, true);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 06123f20a326..dcb47074ae03 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
}
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
gfp_t gfp_mask,
- bool noswap)
+ bool may_swap)
{
struct zonelist *zonelist;
unsigned long nr_reclaimed;
int nid;
struct scan_control sc = {
- .nr_to_reclaim = SWAP_CLUSTER_MAX,
+ .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.target_mem_cgroup = memcg,
.priority = DEF_PRIORITY,
.may_writepage = !laptop_mode,
.may_unmap = 1,
- .may_swap = !noswap,
+ .may_swap = may_swap,
};
/*
--
2.1.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 19+ messages in thread* Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure
2014-09-24 15:08 ` [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure Johannes Weiner
@ 2014-09-29 13:57 ` Michal Hocko
[not found] ` <20140929135707.GA25956-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2014-09-29 13:57 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Greg Thelen, Vladimir Davydov, Dave Hansen,
linux-mm, cgroups, linux-kernel
On Wed 24-09-14 11:08:58, Johannes Weiner wrote:
> In a memcg with even just moderate cache pressure, success rates for
> transparent huge page allocations drop to zero, wasting a lot of
> effort that the allocator puts into assembling these pages.
>
> The reason for this is that the memcg reclaim code was never designed
> for higher-order charges. It reclaims in small batches until there is
> room for at least one page. Huge page charges only succeed when these
> batches add up over a series of huge faults, which is unlikely under
> any significant load involving order-0 allocations in the group.
>
> Remove that loop on the memcg side in favor of passing the actual
> reclaim goal to direct reclaim, which is already set up and optimized
> to meet higher-order goals efficiently.
I had concerns the last time you were posting similar patch
(http://marc.info/?l=linux-mm&m=140803277013080&w=2) but I do not see
any of them neither mentioned nor addressed here. Especially unexpected
long stalls and excessive swapout. 512 pages target for direct reclaim
is too much. Especially for smaller memcgs where we would need to drop
the priority considerably to even scan that many pages.
THP charges close to the limit are definitely a problem but this
is way too risky to fix this problem IMO. Maybe a better approach
would be to limit the reclaim to (clean) page cache (aka
MEM_CGROUP_RECLAIM_NOSWAP). The risk of long stalls should be much
lower and excessive swapouts shouldn't happen at all. What is the point
to swap out a large number of pages just to allow THP charge which can
be used very sparsely?
> This brings memcg's THP policy in line with the system policy: if the
> allocator painstakingly assembles a hugepage, memcg will at least make
> an honest effort to charge it. As a result, transparent hugepage
> allocation rates amid cache activity are drastically improved:
>
> vanilla patched
> pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
> pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
> pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
> thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
> thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)
What is the load and configuration that you have measured?
> [ Note: this may in turn increase memory consumption from internal
> fragmentation, which is an inherent risk of transparent hugepages.
> Some setups may have to adjust the memcg limits accordingly to
> accomodate this - or, if the machine is already packed to capacity,
> disable the transparent huge page feature. ]
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
> include/linux/swap.h | 6 +++--
> mm/memcontrol.c | 69 +++++++++++++---------------------------------------
> mm/vmscan.c | 7 +++---
> 3 files changed, 25 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ea4f926e6b9b..37a585beef5c 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -327,8 +327,10 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
> extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *mask);
> extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> -extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
> - gfp_t gfp_mask, bool noswap);
> +extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> + unsigned long nr_pages,
> + gfp_t gfp_mask,
> + bool may_swap);
> extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> gfp_t gfp_mask, bool noswap,
> struct zone *zone,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 89c920156c2a..c2c75262a209 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -478,14 +478,6 @@ enum res_type {
> #define OOM_CONTROL (0)
>
> /*
> - * Reclaim flags for mem_cgroup_hierarchical_reclaim
> - */
> -#define MEM_CGROUP_RECLAIM_NOSWAP_BIT 0x0
> -#define MEM_CGROUP_RECLAIM_NOSWAP (1 << MEM_CGROUP_RECLAIM_NOSWAP_BIT)
> -#define MEM_CGROUP_RECLAIM_SHRINK_BIT 0x1
> -#define MEM_CGROUP_RECLAIM_SHRINK (1 << MEM_CGROUP_RECLAIM_SHRINK_BIT)
> -
> -/*
> * The memcg_create_mutex will be held whenever a new cgroup is created.
> * As a consequence, any change that needs to protect against new child cgroups
> * appearing has to hold it as well.
> @@ -1791,40 +1783,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> NULL, "Memory cgroup out of memory");
> }
>
> -static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
> - gfp_t gfp_mask,
> - unsigned long flags)
> -{
> - unsigned long total = 0;
> - bool noswap = false;
> - int loop;
> -
> - if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
> - noswap = true;
> -
> - for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> - if (loop)
> - drain_all_stock_async(memcg);
> - total += try_to_free_mem_cgroup_pages(memcg, gfp_mask, noswap);
> - /*
> - * Allow limit shrinkers, which are triggered directly
> - * by userspace, to catch signals and stop reclaim
> - * after minimal progress, regardless of the margin.
> - */
> - if (total && (flags & MEM_CGROUP_RECLAIM_SHRINK))
> - break;
> - if (mem_cgroup_margin(memcg))
> - break;
> - /*
> - * If nothing was reclaimed after two attempts, there
> - * may be no reclaimable pages in this hierarchy.
> - */
> - if (loop && !total)
> - break;
> - }
> - return total;
> -}
> -
> /**
> * test_mem_cgroup_node_reclaimable
> * @memcg: the target memcg
> @@ -2527,8 +2485,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> struct mem_cgroup *mem_over_limit;
> struct res_counter *fail_res;
> unsigned long nr_reclaimed;
> - unsigned long flags = 0;
> unsigned long long size;
> + bool may_swap = true;
> + bool drained = false;
> int ret = 0;
>
> if (mem_cgroup_is_root(memcg))
> @@ -2547,7 +2506,7 @@ retry:
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> } else {
> mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
> - flags |= MEM_CGROUP_RECLAIM_NOSWAP;
> + may_swap = false;
> }
>
> if (batch > nr_pages) {
> @@ -2572,11 +2531,18 @@ retry:
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> - nr_reclaimed = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
> + nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> + gfp_mask, may_swap);
>
> if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> goto retry;
>
> + if (!drained) {
> + drain_all_stock_async(mem_over_limit);
> + drained = true;
> + goto retry;
> + }
> +
> if (gfp_mask & __GFP_NORETRY)
> goto nomem;
> /*
> @@ -3652,8 +3618,8 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_reclaim(memcg, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_SHRINK);
> + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true);
> +
> curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3703,9 +3669,8 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> if (!ret)
> break;
>
> - mem_cgroup_reclaim(memcg, GFP_KERNEL,
> - MEM_CGROUP_RECLAIM_NOSWAP |
> - MEM_CGROUP_RECLAIM_SHRINK);
> + try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, false);
> +
> curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
> /* Usage is reduced ? */
> if (curusage >= oldusage)
> @@ -3954,8 +3919,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> if (signal_pending(current))
> return -EINTR;
>
> - progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
> - false);
> + progress = try_to_free_mem_cgroup_pages(memcg, 1,
> + GFP_KERNEL, true);
> if (!progress) {
> nr_retries--;
> /* maybe some writeback is necessary */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 06123f20a326..dcb47074ae03 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2759,21 +2759,22 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
> }
>
> unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> + unsigned long nr_pages,
> gfp_t gfp_mask,
> - bool noswap)
> + bool may_swap)
> {
> struct zonelist *zonelist;
> unsigned long nr_reclaimed;
> int nid;
> struct scan_control sc = {
> - .nr_to_reclaim = SWAP_CLUSTER_MAX,
> + .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> .target_mem_cgroup = memcg,
> .priority = DEF_PRIORITY,
> .may_writepage = !laptop_mode,
> .may_unmap = 1,
> - .may_swap = !noswap,
> + .may_swap = may_swap,
> };
>
> /*
> --
> 2.1.0
>
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 19+ messages in thread