Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC PATCH] mm: handle simple case in free_pcppages_bulk()
From: Minchan Kim @ 2011-02-10 13:38 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel
In-Reply-To: <1297343929.1449.3.camel@leonhard>

On Thu, Feb 10, 2011 at 10:18 PM, Namhyung Kim <namhyung@gmail.com> wrote:
> 2011-02-10 (목), 22:10 +0900, Minchan Kim:
>> Hello Namhyung,
>>
>
> Hi Minchan,
>
>
>> On Thu, Feb 10, 2011 at 8:46 PM, Namhyung Kim <namhyung@gmail.com> wrote:
>> > Now I'm seeing that there are some cases to free all pages in a
>> > pcp lists. In that case, just frees all pages in the lists instead
>> > of being bothered with round-robin lists traversal.
>>
>> I though about that but I didn't send the patch.
>> That's because many cases which calls free_pcppages_bulk(,
>> pcp->count,..) are slow path so it adds comparison overhead on fast
>> path while it loses the effectiveness in slow path.
>>
>
> Hmm.. How about adding unlikely() then? Doesn't it help much here?

Yes. It would help but I am not sure how much it is.
AFAIR, when Mel submit the patch, he tried to prove the effectiveness
with some experiment and profiler.
I think if you want it really, we might need some number.
I am not sure it's worth.




-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/5] pagewalk: only split huge pages when necessary
From: Mel Gorman @ 2011-02-10 13:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Dave Hansen, linux-kernel, linux-mm, Michael J Wolf
In-Reply-To: <20110210131928.GV3347@random.random>

On Thu, Feb 10, 2011 at 02:19:28PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 11:11:25AM +0000, Mel Gorman wrote:
> > Before we goto this retry, there is at a cond_resched(). Just to confirm,
> > we are depending on mmap_sem to prevent khugepaged promoting this back to
> > a hugepage, right? I don't see a problem with that but I want to be
> > sure.
> 
> Correct, and we depend on that everywhere as wait_split_huge_page has
> to run without holding spinlocks.
> 

In that case;

Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
From: Mel Gorman @ 2011-02-10 13:33 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel
In-Reply-To: <20110210124838.GU3347@random.random>

On Thu, Feb 10, 2011 at 01:48:38PM +0100, Andrea Arcangeli wrote:
> On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> > We should not be ending up in a situation with the LRU list of only
> > page_evictable pages and that situation persisting causing excessive (or
> > infinite) looping. As unevictable pages are encountered on the LRU list,
> > they should be moved to the unevictable lists by putback_lru_page().  Are you
> > aware of a situation where this becomes broken?
> > 
> > I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> > are all get moved. In this case, nr_scanned is positive and we continue
> > to scan but this is expected and desirable: Reclaim/compaction needs more
> > pages to be freed before it starts compaction. If it stops scanning early,
> > then it would just fail the allocation later. This is what the "NOTE" is about.
> > 
> > I prefer Johannes' fix for the observed problem.
> 
> should_continue_reclaim is only needed for compaction. It tries to
> free enough pages so that compaction can succeed in its defrag
> attempt.

Correct.

> So breaking the loop faster isn't going to cause failures for
> 0 order pages.

Also true, I commented on this in the "Note" your patch deletes and a
suggestion on how an alternative would be to break early unless GFP_REPEAT.

> My worry is that we loop too much in shrink_zone just
> for compaction even when we don't do any progress. shrink_zone would
> never scan more than SWAP_CLUSTER_MAX pages, before this change.

Sortof. Lumpy reclaim would have scanned more than SWAP_CLUSTER_MAX so
scanning was still pretty high. The other costs of lumpy reclaim would hide
it of course.

> Now
> it can loop over the whole lru as long as we're scanning stuff.

True, the alternative being failing the allocation. Returning sooner is of
course an option, but it would be preferable to see a case where the logic
after Johannes' patch is failing.

> Ok to
> overboost shrink_zone if we're making progress to allow compaction at
> the next round, but if we don't visibly make progress, I'm concerned
> that it may be too aggressive to scan the whole list. The performance
> benefit of having an hugepage isn't as huge as scanning all pages in
> the lru when before we would have broken the loop and declared failure
> after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
> in a order 0 allocation.

What about other cases such as order-1 allocations for stack or order-3
allocations for those network cards using jumbo frames without
scatter/gather?

Don't get me wrong, I see your point but I'm wondering if there really are
cases where we routinely scan an entire LRU list of unevictable pages that
are somehow not being migrated properly to the unevictable lists. If
this is happening, we are also in trouble for reclaiming for order-0
pages, right?

> The fix may help of course, maybe it's enough
> for his case I don't know, but I don't see it making a whole lot of
> difference, except now it will stop when the lru is practically empty
> which clearly is an improvement. I think we shouldn't be so worried
> about succeeding compaction, the important thing is we don't waste
> time in compaction if there's not enough free memory but
> compaction_suitable used by both logics should be enough for that.
> 
> I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_

It uses GFP_REPEAT. That is why I specifically mentioned it in the "NOTE"
as an alternative to how we could break early while still being agressive
when required. The only reason it's not that way now is because a) I didn't
consider an LRU mostly full of unevictable pages to be the normal case and b)
for allocations such as order-3 that are preferable not to fail.

> flag or similar that increases how compaction is strict in succeeding,
> up to scanning the whole lru in one go in order to make some free
> memory for compaction to succeed.
> 
> Going ahead with the scan until compaction_suitable is true instead
> makes sense when there's absence of memory pressure and nr_reclaimed
> is never zero.
> 
> Maybe we should try a bit more times than just nr_reclaim but going
> over the whole lru, sounds a bit extreme.
> 

Where should be draw the line? We could come up with ratio of the lists
depending on priority but it'd be hard to measure the gain or loss
without having a profile of a problem case to look at.

> The issue isn't just for unevictable pages, that will be refiled
> during the scan but it will also happen in presence of lots of
> referenced pages. For example if we don't apply my fix, the current
> code can take down all young bits in all ptes in one go in the whole
> system before returning from shrink_zone, that is too much in my view,
> and losing all that information in one go (not even to tell the cost
> associated with losing it) can hardly be offseted by the improvement
> given by 1 more hugepage.
> 
> But please let me know if I've misread something...
> 

I don't think you have misread anything but if we're going to weaken
this logic, I'd at least like to see the GFP_REPEAT option tried - i.e.
preserve being aggressive if set. I'm also not convinced we routinely get
into a situation where the LRU consists of almost all unevictable pages
and if we are in this situation, that is a serious problem on its own. It
would also be preferable if we could get latency figures on alloc_pages for
hugepage-sized allocations and a count of how many are succeeding or failing
to measure the impact (if any).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/5] pagewalk: only split huge pages when necessary
From: Andrea Arcangeli @ 2011-02-10 13:19 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Dave Hansen, linux-kernel, linux-mm, Michael J Wolf
In-Reply-To: <20110210111125.GC17873@csn.ul.ie>

On Thu, Feb 10, 2011 at 11:11:25AM +0000, Mel Gorman wrote:
> Before we goto this retry, there is at a cond_resched(). Just to confirm,
> we are depending on mmap_sem to prevent khugepaged promoting this back to
> a hugepage, right? I don't see a problem with that but I want to be
> sure.

Correct, and we depend on that everywhere as wait_split_huge_page has
to run without holding spinlocks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH] mm: handle simple case in free_pcppages_bulk()
From: Namhyung Kim @ 2011-02-10 13:18 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel
In-Reply-To: <AANLkTikEigbPsNMqqkmixYbCfD7Dz12YMcW2+GZbhUQq@mail.gmail.com>

2011-02-10 (ea(C)), 22:10 +0900, Minchan Kim:
> Hello Namhyung,
> 

Hi Minchan,


> On Thu, Feb 10, 2011 at 8:46 PM, Namhyung Kim <namhyung@gmail.com> wrote:
> > Now I'm seeing that there are some cases to free all pages in a
> > pcp lists. In that case, just frees all pages in the lists instead
> > of being bothered with round-robin lists traversal.
> 
> I though about that but I didn't send the patch.
> That's because many cases which calls free_pcppages_bulk(,
> pcp->count,..) are slow path so it adds comparison overhead on fast
> path while it loses the effectiveness in slow path.
> 

Hmm.. How about adding unlikely() then? Doesn't it help much here?


-- 
Regards,
Namhyung Kim


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH] mm: handle simple case in free_pcppages_bulk()
From: Minchan Kim @ 2011-02-10 13:10 UTC (permalink / raw)
  To: Namhyung Kim; +Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel
In-Reply-To: <1297338408-3590-1-git-send-email-namhyung@gmail.com>

Hello Namhyung,

On Thu, Feb 10, 2011 at 8:46 PM, Namhyung Kim <namhyung@gmail.com> wrote:
> Now I'm seeing that there are some cases to free all pages in a
> pcp lists. In that case, just frees all pages in the lists instead
> of being bothered with round-robin lists traversal.

I though about that but I didn't send the patch.
That's because many cases which calls free_pcppages_bulk(,
pcp->count,..) are slow path so it adds comparison overhead on fast
path while it loses the effectiveness in slow path.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
From: Andrea Arcangeli @ 2011-02-10 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel
In-Reply-To: <20110210102109.GB17873@csn.ul.ie>

On Thu, Feb 10, 2011 at 10:21:10AM +0000, Mel Gorman wrote:
> We should not be ending up in a situation with the LRU list of only
> page_evictable pages and that situation persisting causing excessive (or
> infinite) looping. As unevictable pages are encountered on the LRU list,
> they should be moved to the unevictable lists by putback_lru_page().  Are you
> aware of a situation where this becomes broken?
> 
> I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
> are all get moved. In this case, nr_scanned is positive and we continue
> to scan but this is expected and desirable: Reclaim/compaction needs more
> pages to be freed before it starts compaction. If it stops scanning early,
> then it would just fail the allocation later. This is what the "NOTE" is about.
> 
> I prefer Johannes' fix for the observed problem.

should_continue_reclaim is only needed for compaction. It tries to
free enough pages so that compaction can succeed in its defrag
attempt. So breaking the loop faster isn't going to cause failures for
0 order pages. My worry is that we loop too much in shrink_zone just
for compaction even when we don't do any progress. shrink_zone would
never scan more than SWAP_CLUSTER_MAX pages, before this change. Now
it can loop over the whole lru as long as we're scanning stuff. Ok to
overboost shrink_zone if we're making progress to allow compaction at
the next round, but if we don't visibly make progress, I'm concerned
that it may be too aggressive to scan the whole list. The performance
benefit of having an hugepage isn't as huge as scanning all pages in
the lru when before we would have broken the loop and declared failure
after only SWAP_CLUSTER_MAX pages, and then we would have fallen back
in a order 0 allocation. The fix may help of course, maybe it's enough
for his case I don't know, but I don't see it making a whole lot of
difference, except now it will stop when the lru is practically empty
which clearly is an improvement. I think we shouldn't be so worried
about succeeding compaction, the important thing is we don't waste
time in compaction if there's not enough free memory but
compaction_suitable used by both logics should be enough for that.

I'd rather prefer that if hugetlbfs has special needs it uses a __GFP_
flag or similar that increases how compaction is strict in succeeding,
up to scanning the whole lru in one go in order to make some free
memory for compaction to succeed.

Going ahead with the scan until compaction_suitable is true instead
makes sense when there's absence of memory pressure and nr_reclaimed
is never zero.

Maybe we should try a bit more times than just nr_reclaim but going
over the whole lru, sounds a bit extreme.

The issue isn't just for unevictable pages, that will be refiled
during the scan but it will also happen in presence of lots of
referenced pages. For example if we don't apply my fix, the current
code can take down all young bits in all ptes in one go in the whole
system before returning from shrink_zone, that is too much in my view,
and losing all that information in one go (not even to tell the cost
associated with losing it) can hardly be offseted by the improvement
given by 1 more hugepage.

But please let me know if I've misread something...

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 0/4] memcg: operate on page quantities internally
From: Johannes Weiner @ 2011-02-10 12:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Balbir Singh, linux-mm,
	linux-kernel
In-Reply-To: <20110210085034.a6c5d703.kamezawa.hiroyu@jp.fujitsu.com>

On Thu, Feb 10, 2011 at 08:50:34AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed,  9 Feb 2011 12:01:49 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > If I did not miss anything, this should leave only res_counter and
> > user-visible stuff in bytes.  The ABI probably won't change, so next
> > up is converting res_counter to operate on page quantities.
> 
> Hmm, I think this should be done but think this should be postphoned, too.
> Because, IIUC, some guys will try to discuss charging against kernel objects
> in the next mm-summit. IMHO, it will be done against PAGE not against
> Object even if we do kernel object accouting. So this patch is okay for me.
> But I think it's better to go ahead after we confirm the way we go.
> How do you think ?

That makes sense, let's leave res_counter alone until we have hashed
this out.

> Anyway, I welcome this patch.

Thanks for reviewing,

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 0/4] memcg: operate on page quantities internally
From: Johannes Weiner @ 2011-02-10 12:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, linux-mm,
	linux-kernel
In-Reply-To: <20110209133757.735b08ab.akpm@linux-foundation.org>

On Wed, Feb 09, 2011 at 01:37:57PM -0800, Andrew Morton wrote:
> On Wed,  9 Feb 2011 12:01:49 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Hi,
> > 
> > this patch set converts the memcg charge and uncharge paths to operate
> > on multiples of pages instead of bytes.  It already was a good idea
> > before, but with the merge of THP we made a real mess by specifying
> > huge pages alternatingly in bytes or in number of regular pages.
> > 
> > If I did not miss anything, this should leave only res_counter and
> > user-visible stuff in bytes.  The ABI probably won't change, so next
> > up is converting res_counter to operate on page quantities.
> > 
> 
> I worry that there will be unconverted code and we'll end up adding
> bugs.
> 
> A way to minimise the risk is to force compilation errors and warnings:
> rename fields and functions, reorder function arguments.  Did your
> patches do this as much as they could have?

I sent you fixes/replacements for 1/4 and 4/4. 2/4 and 3/4 adjusted
the names of changed structure members already.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 4/4] memcg: unify charge/uncharge quantities to units of pages
From: Johannes Weiner @ 2011-02-10 12:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, linux-mm,
	linux-kernel
In-Reply-To: <1297249313-23746-5-git-send-email-hannes@cmpxchg.org>

The update to 1/4 made one hunk of this one no longer apply, so to
save you the hassle, here is a complete replacement.

What I changed to visibly break new old-API users:

1. mem_cgroup_margin: sufficiently new to not have new users
developped that assume the return value to be in unit of bytes, so I
left it alone

2. __mem_cgroup_do_charge: dropped the underscores

3. __mem_cgroup_try_charge: moved @nr_pages parameter so that using
the old function signature would complain about passing integers for
pointers and vice versa

4. __mem_cgroup_commit_charge: same as 3.

5. mem_cgroup_move_account: same as 4.

6. __do_uncharge: renamed to mem_cgroup_do_uncharge

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] memcg: unify charge/uncharge quantities to units of pages

There is no clear pattern when we pass a page count and when we pass a
byte count that is a multiple of PAGE_SIZE.

We never charge or uncharge subpage quantities, so convert it all to
page counts.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |  135 ++++++++++++++++++++++++++----------------------------
 1 files changed, 65 insertions(+), 70 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 63c65ab..78a79ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1092,16 +1092,16 @@ unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
  * @mem: the memory cgroup
  *
  * Returns the maximum amount of memory @mem can be charged with, in
- * bytes.
+ * pages.
  */
-static unsigned long long mem_cgroup_margin(struct mem_cgroup *mem)
+static unsigned long mem_cgroup_margin(struct mem_cgroup *mem)
 {
 	unsigned long long margin;
 
 	margin = res_counter_margin(&mem->res);
 	if (do_swap_account)
 		margin = min(margin, res_counter_margin(&mem->memsw));
-	return margin;
+	return margin >> PAGE_SHIFT;
 }
 
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
@@ -1637,7 +1637,7 @@ EXPORT_SYMBOL(mem_cgroup_update_page_stat);
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
  */
-#define CHARGE_SIZE	(32 * PAGE_SIZE)
+#define CHARGE_BATCH	32U
 struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
@@ -1812,9 +1812,10 @@ enum {
 	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
-static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
-				int csize, bool oom_check)
+static int mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
+				unsigned int nr_pages, bool oom_check)
 {
+	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long flags = 0;
@@ -1835,14 +1836,13 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 	/*
-	 * csize can be either a huge page (HPAGE_SIZE), a batch of
-	 * regular pages (CHARGE_SIZE), or a single regular page
-	 * (PAGE_SIZE).
+	 * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
+	 * of regular pages (CHARGE_BATCH), or a single regular page (1).
 	 *
 	 * Never reclaim on behalf of optional batching, retry with a
 	 * single page instead.
 	 */
-	if (csize == CHARGE_SIZE)
+	if (nr_pages == CHARGE_BATCH)
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
@@ -1850,7 +1850,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
 					      gfp_mask, flags);
-	if (mem_cgroup_margin(mem_over_limit) >= csize)
+	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
 	/*
 	 * Even though the limit is exceeded at this point, reclaim
@@ -1861,7 +1861,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	 * unlikely to succeed so close to the limit, and we fall back
 	 * to regular pages anyway in case of failure.
 	 */
-	if (csize == PAGE_SIZE && ret)
+	if (nr_pages == 1 && ret)
 		return CHARGE_RETRY;
 
 	/*
@@ -1887,13 +1887,14 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
 				   gfp_t gfp_mask,
-				   struct mem_cgroup **memcg, bool oom,
-				   int page_size)
+				   unsigned int nr_pages,
+				   struct mem_cgroup **memcg,
+				   bool oom)
 {
+	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem = NULL;
 	int ret;
-	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1918,7 +1919,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (page_size == PAGE_SIZE && consume_stock(mem))
+		if (nr_pages == 1 && consume_stock(mem))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1941,7 +1942,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (page_size == PAGE_SIZE && consume_stock(mem)) {
+		if (nr_pages == 1 && consume_stock(mem)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
@@ -1976,13 +1977,12 @@ again:
 			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 		}
 
-		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
-
+		ret = mem_cgroup_do_charge(mem, gfp_mask, batch, oom_check);
 		switch (ret) {
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = page_size;
+			batch = nr_pages;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
@@ -2003,8 +2003,8 @@ again:
 		}
 	} while (ret != CHARGE_OK);
 
-	if (csize > page_size)
-		refill_stock(mem, (csize - page_size) >> PAGE_SHIFT);
+	if (batch > nr_pages)
+		refill_stock(mem, batch - nr_pages);
 	css_put(&mem->css);
 done:
 	*memcg = mem;
@@ -2083,12 +2083,10 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 				       struct page *page,
+				       unsigned int nr_pages,
 				       struct page_cgroup *pc,
-				       enum charge_type ctype,
-				       int page_size)
+				       enum charge_type ctype)
 {
-	int nr_pages = page_size >> PAGE_SHIFT;
-
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
@@ -2177,26 +2175,28 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
+ * @nr_pages: number of regular pages (>1 for huge pages)
  * @pc:	page_cgroup of the page.
  * @from: mem_cgroup which the page is moved from.
  * @to:	mem_cgroup which the page is moved to. @from != @to.
  * @uncharge: whether we should call uncharge and css_put against @from.
- * @charge_size: number of bytes to charge (regular or huge page)
  *
  * The caller must confirm following.
  * - page is not on LRU (isolate_page() is useful.)
- * - compound_lock is held when charge_size > PAGE_SIZE
+ * - compound_lock is held when nr_pages > 1
  *
  * This function doesn't do "charge" nor css_get to new cgroup. It should be
  * done by a caller(__mem_cgroup_try_charge would be usefull). If @uncharge is
  * true, this function does "uncharge" from old cgroup, but it doesn't if
  * @uncharge is false, so a caller should do "uncharge".
  */
-static int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
-				   struct mem_cgroup *from, struct mem_cgroup *to,
-				   bool uncharge, int charge_size)
+static int mem_cgroup_move_account(struct page *page,
+				   unsigned int nr_pages,
+				   struct page_cgroup *pc,
+				   struct mem_cgroup *from,
+				   struct mem_cgroup *to,
+				   bool uncharge)
 {
-	int nr_pages = charge_size >> PAGE_SHIFT;
 	unsigned long flags;
 	int ret;
 
@@ -2209,7 +2209,7 @@ static int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
 	 * hold it.
 	 */
 	ret = -EBUSY;
-	if (charge_size > PAGE_SIZE && !PageTransHuge(page))
+	if (nr_pages > 1 && !PageTransHuge(page))
 		goto out;
 
 	lock_page_cgroup(pc);
@@ -2267,7 +2267,7 @@ static int mem_cgroup_move_parent(struct page *page,
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
-	int page_size = PAGE_SIZE;
+	unsigned int nr_pages;
 	unsigned long flags;
 	int ret;
 
@@ -2281,23 +2281,21 @@ static int mem_cgroup_move_parent(struct page *page,
 	if (isolate_lru_page(page))
 		goto put;
 
-	if (PageTransHuge(page))
-		page_size = HPAGE_SIZE;
+	nr_pages = hpage_nr_pages(page);
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask,
-				&parent, false, page_size);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, nr_pages, &parent, false);
 	if (ret || !parent)
 		goto put_back;
 
-	if (page_size > PAGE_SIZE)
+	if (nr_pages > 1)
 		flags = compound_lock_irqsave(page);
 
-	ret = mem_cgroup_move_account(page, pc, child, parent, true, page_size);
+	ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent, true);
 	if (ret)
-		__mem_cgroup_cancel_charge(parent, page_size >> PAGE_SHIFT);
+		__mem_cgroup_cancel_charge(parent, nr_pages);
 
-	if (page_size > PAGE_SIZE)
+	if (nr_pages > 1)
 		compound_unlock_irqrestore(page, flags);
 put_back:
 	putback_lru_page(page);
@@ -2317,13 +2315,13 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask, enum charge_type ctype)
 {
 	struct mem_cgroup *mem = NULL;
-	int page_size = PAGE_SIZE;
+	unsigned int nr_pages = 1;
 	struct page_cgroup *pc;
 	bool oom = true;
 	int ret;
 
 	if (PageTransHuge(page)) {
-		page_size <<= compound_order(page);
+		nr_pages <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
 		/*
 		 * Never OOM-kill a process for a huge page.  The
@@ -2335,11 +2333,11 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	pc = lookup_page_cgroup(page);
 	BUG_ON(!pc); /* XXX: remove this and move pc lookup into commit */
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, oom, page_size);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &mem, oom);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, page, pc, ctype, page_size);
+	__mem_cgroup_commit_charge(mem, page, nr_pages, pc, ctype);
 	return 0;
 }
 
@@ -2455,13 +2453,13 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
+	ret = __mem_cgroup_try_charge(NULL, mask, 1, ptr, true);
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
+	return __mem_cgroup_try_charge(mm, mask, 1, ptr, true);
 }
 
 static void
@@ -2477,7 +2475,7 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr,
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, page, pc, ctype, PAGE_SIZE);
+	__mem_cgroup_commit_charge(ptr, page, 1, pc, ctype);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -2529,12 +2527,13 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 	__mem_cgroup_cancel_charge(mem, 1);
 }
 
-static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
-	      int page_size)
+static void mem_cgroup_do_uncharge(struct mem_cgroup *mem,
+				   unsigned int nr_pages,
+				   const enum charge_type ctype)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
+
 	/* If swapout, usage of swap doesn't decrease */
 	if (!do_swap_account || ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
 		uncharge_memsw = false;
@@ -2558,7 +2557,7 @@ __do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
 	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
 		goto direct_uncharge;
 
-	if (page_size != PAGE_SIZE)
+	if (nr_pages > 1)
 		goto direct_uncharge;
 
 	/*
@@ -2574,9 +2573,9 @@ __do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
 		batch->memsw_nr_pages++;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, page_size);
+	res_counter_uncharge(&mem->res, nr_pages * PAGE_SIZE);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, page_size);
+		res_counter_uncharge(&mem->memsw, nr_pages * PAGE_SIZE);
 	if (unlikely(batch->memcg != mem))
 		memcg_oom_recover(mem);
 	return;
@@ -2588,10 +2587,9 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
-	int count;
-	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
-	int page_size = PAGE_SIZE;
+	unsigned int nr_pages = 1;
+	struct page_cgroup *pc;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2600,11 +2598,9 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		return NULL;
 
 	if (PageTransHuge(page)) {
-		page_size <<= compound_order(page);
+		nr_pages <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
 	}
-
-	count = page_size >> PAGE_SHIFT;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2637,7 +2633,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -count);
+	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
 	ClearPageCgroupUsed(pc);
 	/*
@@ -2658,7 +2654,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype, page_size);
+		mem_cgroup_do_uncharge(mem, nr_pages, ctype);
 
 	return mem;
 
@@ -2850,8 +2846,8 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 int mem_cgroup_prepare_migration(struct page *page,
 	struct page *newpage, struct mem_cgroup **ptr, gfp_t gfp_mask)
 {
-	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	struct page_cgroup *pc;
 	enum charge_type ctype;
 	int ret = 0;
 
@@ -2907,7 +2903,7 @@ int mem_cgroup_prepare_migration(struct page *page,
 		return 0;
 
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, ptr, false, PAGE_SIZE);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, 1, ptr, false);
 	css_put(&mem->css);/* drop extra refcnt */
 	if (ret || *ptr == NULL) {
 		if (PageAnon(page)) {
@@ -2934,7 +2930,7 @@ int mem_cgroup_prepare_migration(struct page *page,
 		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	else
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-	__mem_cgroup_commit_charge(mem, page, pc, ctype, PAGE_SIZE);
+	__mem_cgroup_commit_charge(mem, page, 1, pc, ctype);
 	return ret;
 }
 
@@ -4591,8 +4587,7 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
-					      PAGE_SIZE);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, 1, &mem, false);
 		if (ret || !mem)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return -ENOMEM;
@@ -4937,8 +4932,8 @@ retry:
 			if (isolate_lru_page(page))
 				goto put;
 			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(page, pc,
-					mc.from, mc.to, false, PAGE_SIZE)) {
+			if (!mem_cgroup_move_account(page, 1, pc,
+						     mc.from, mc.to, false)) {
 				mc.precharge--;
 				/* we uncharge from mc.from later. */
 				mc.moved_charge++;
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [patch 1/4] memcg: keep only one charge cancelling function
From: Johannes Weiner @ 2011-02-10 12:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, Balbir Singh, linux-mm,
	linux-kernel
In-Reply-To: <1297249313-23746-2-git-send-email-hannes@cmpxchg.org>

Andrew, here is a fix for this patch that reverts to using the
underscored version of the cancel function, which already took a page
count.  Code developped in parallel will either use the underscore
version with the uncharged semantics or error out on the no longer
existing version that took a number of bytes.

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] memcg: keep only one charge cancelling function fix

Keep the underscore-version of the charge cancelling function which
took a page count, rather than silently changing the semantics of the
non-underscore-version.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 804e9fc..e600b55 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2020,8 +2020,8 @@ bypass:
  * This function is for that and do uncharge, put css's refcnt.
  * gotten by try_charge().
  */
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
-				     unsigned int nr_pages)
+static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
+				       unsigned int nr_pages)
 {
 	if (!mem_cgroup_is_root(mem)) {
 		unsigned long bytes = nr_pages * PAGE_SIZE;
@@ -2090,7 +2090,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem, nr_pages);
+		__mem_cgroup_cancel_charge(mem, nr_pages);
 		return;
 	}
 	/*
@@ -2228,7 +2228,7 @@ static int mem_cgroup_move_account(struct page *page, struct page_cgroup *pc,
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from, nr_pages);
+		__mem_cgroup_cancel_charge(from, nr_pages);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2293,7 +2293,7 @@ static int mem_cgroup_move_parent(struct page *page,
 
 	ret = mem_cgroup_move_account(page, pc, child, parent, true, page_size);
 	if (ret)
-		mem_cgroup_cancel_charge(parent, page_size >> PAGE_SHIFT);
+		__mem_cgroup_cancel_charge(parent, page_size >> PAGE_SHIFT);
 
 	if (page_size > PAGE_SIZE)
 		compound_unlock_irqrestore(page, flags);
@@ -2524,7 +2524,7 @@ void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem, 1);
+	__mem_cgroup_cancel_charge(mem, 1);
 }
 
 static void
@@ -4803,7 +4803,7 @@ static void __mem_cgroup_clear_mc(void)
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
-		mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		__mem_cgroup_cancel_charge(mc.to, mc.precharge);
 		mc.precharge = 0;
 	}
 	/*
@@ -4811,7 +4811,7 @@ static void __mem_cgroup_clear_mc(void)
 	 * we must uncharge here.
 	 */
 	if (mc.moved_charge) {
-		mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		__mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
 		mc.moved_charge = 0;
 	}
 	/* we must fixup refcnts and charges */
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [RFC PATCH 0/5] IO-less balance dirty pages
From: Boaz Harrosh @ 2011-02-10 12:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-mm, Wu Fengguang
In-Reply-To: <20110209233006.GC3064@quack.suse.cz>

On 02/10/2011 01:30 AM, Jan Kara wrote:
> On Sun 06-02-11 19:54:37, Boaz Harrosh wrote:
>> On 02/04/2011 03:38 AM, Jan Kara wrote:
>>> The basic idea (implemented in the third patch) is that processes throttled
>>> in balance_dirty_pages() wait for enough IO to complete. The waiting is
>>> implemented as follows: Whenever we decide to throttle a task in
>>> balance_dirty_pages(), task adds itself to a list of tasks that are throttled
>>> against that bdi and goes to sleep waiting to receive specified amount of page
>>> IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
>>> autotuned based on observed IO rate), accumulated page IO completions are
>>> distributed equally among waiting tasks.
>>>
>>> This waiting scheme has been chosen so that waiting time in
>>> balance_dirty_pages() is proportional to
>>>   number_waited_pages * number_of_waiters.
>>> In particular it does not depend on the total number of pages being waited for,
>>> thus providing possibly a fairer results.
>>>
>>> I gave the patches some basic testing (multiple parallel dd's to a single
>>> drive) and they seem to work OK. The dd's get equal share of the disk
>>> throughput (about 10.5 MB/s, which is nice result given the disk can do
>>> about 87 MB/s when writing single-threaded), and dirty limit does not get
>>> exceeded. Of course much more testing needs to be done but I hope it's fine
>>> for the first posting :).
>>
>> So what is the disposition of Wu's patches in light of these ones?
>> * Do they replace Wu's, or Wu's just get rebased ontop of these at a
>>   later stage?
>   They are meant as a replacement.
> 
>> * Did you find any hard problems with Wu's patches that delay them for
>>   a long time?
>   Wu himself wrote that the current patchset probably won't fly because it
> fluctuates too much. So he decided to try to rewrite patches from per-bdi
> limits to global limits when he has time...
> 
>> * Some of the complicated stuff in Wu's patches are the statistics and
>>   rate control mechanics. Are these the troubled area? Because some of
>>   these are actually some things that I'm interested in, and that appeal
>>   to me the most.
>   Basically yes, this logic seems to be the problematic one.
> 
> 								Honza

Thanks dear Jan for you reply. I would love to talk about all this
in LSF, and other writeback issues. Keep us posted with results of
of your investigations.

Cheers
Boaz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [RFC PATCH] mm: handle simple case in free_pcppages_bulk()
From: Namhyung Kim @ 2011-02-10 11:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel

Now I'm seeing that there are some cases to free all pages in a
pcp lists. In that case, just frees all pages in the lists instead
of being bothered with round-robin lists traversal.

Signed-off-by: Namhyung Kim <namhyung@gmail.com>
---
 mm/page_alloc.c |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e8b02771ccea..959c54450ddf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -596,6 +596,28 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
+	/* Simple case: Free all */
+	if (to_free == pcp->count) {
+		LIST_HEAD(freelist);
+
+		for (; migratetype < MIGRATE_PCPTYPES; migratetype++)
+			if (!list_empty(&pcp->lists[migratetype]))
+				list_move(&pcp->lists[migratetype], &freelist);
+
+		while (!list_empty(&freelist)) {
+			struct page *page;
+
+			page = list_first_entry(&freelist, struct page, lru);
+			/* must delete as __free_one_page list manipulates */
+			list_del(&page->lru);
+			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
+			__free_one_page(page, zone, 0, page_private(page));
+			trace_mm_page_pcpu_drain(page, 0, page_private(page));
+			to_free--;
+		}
+		VM_BUG_ON(to_free);
+	}
+
 	while (to_free) {
 		struct page *page;
 		struct list_head *list;
-- 
1.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 5/5] have smaps show transparent huge pages
From: Mel Gorman @ 2011-02-10 11:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli,
	Johannes Weiner, David Rientjes
In-Reply-To: <20110209195413.6D3CB37F@kernel>

On Wed, Feb 09, 2011 at 11:54:13AM -0800, Dave Hansen wrote:
> 
> Now that the mere act of _looking_ at /proc/$pid/smaps will not
> destroy transparent huge pages, tell how much of the VMA is
> actually mapped with them.
> 
> This way, we can make sure that we're getting THPs where we
> expect to see them.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: David Rientjes <rientjes@google.com>
> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff -puN fs/proc/task_mmu.c~teach-smaps-thp fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~teach-smaps-thp	2011-02-09 11:41:44.423556779 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-02-09 11:41:52.611550670 -0800
> @@ -331,6 +331,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long anonymous_thp;
>  	unsigned long swap;
>  	u64 pss;
>  };
> @@ -394,6 +395,7 @@ static int smaps_pte_range(pmd_t *pmd, u
>  			spin_lock(&walk->mm->page_table_lock);
>  		} else {
>  			smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_SIZE, walk);
> +			mss->anonymous_thp += HPAGE_SIZE;

I should have thought of this for the previous patch but should this be
HPAGE_PMD_SIZE instead of HPAGE_SIZE? Right now, they are the same value
but they are not the same thing.

>  			return 0;
>  		}
>  	}
> @@ -435,6 +437,7 @@ static int show_smap(struct seq_file *m,
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "AnonHugePages:  %8lu kB\n"
>  		   "Swap:           %8lu kB\n"
>  		   "KernelPageSize: %8lu kB\n"
>  		   "MMUPageSize:    %8lu kB\n"
> @@ -448,6 +451,7 @@ static int show_smap(struct seq_file *m,
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.anonymous_thp >> 10,
>  		   mss.swap >> 10,
>  		   vma_kernel_pagesize(vma) >> 10,
>  		   vma_mmu_pagesize(vma) >> 10,
> _
> 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 4/5] teach smaps_pte_range() about THP pmds
From: Mel Gorman @ 2011-02-10 11:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli,
	Johannes Weiner, David Rientjes
In-Reply-To: <20110209195411.816D55A7@kernel>

On Wed, Feb 09, 2011 at 11:54:11AM -0800, Dave Hansen wrote:
> 
> This adds code to explicitly detect  and handle
> pmd_trans_huge() pmds.  It then passes HPAGE_SIZE units
> in to the smap_pte_entry() function instead of PAGE_SIZE.
> 
> This means that using /proc/$pid/smaps now will no longer
> cause THPs to be broken down in to small pages.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/5] pass pte size argument in to smaps_pte_entry()
From: Mel Gorman @ 2011-02-10 11:16 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli,
	Johannes Weiner
In-Reply-To: <20110209195410.1C408075@kernel>

On Wed, Feb 09, 2011 at 11:54:10AM -0800, Dave Hansen wrote:
> 
> This patch adds an argument to the new smaps_pte_entry()
> function to let it account in things other than PAGE_SIZE
> units.  I changed all of the PAGE_SIZE sites, even though
> not all of them can be reached for transparent huge pages,
> just so this will continue to work without changes as THPs
> are improved.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2/5] break out smaps_pte_entry() from smaps_pte_range()
From: Mel Gorman @ 2011-02-10 11:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli,
	Johannes Weiner
In-Reply-To: <20110209195408.B08C04D3@kernel>

On Wed, Feb 09, 2011 at 11:54:08AM -0800, Dave Hansen wrote:
> 
> We will use smaps_pte_entry() in a moment to handle both small
> and transparent large pages.  But, we must break it out of
> smaps_pte_range() first.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 1/5] pagewalk: only split huge pages when necessary
From: Mel Gorman @ 2011-02-10 11:11 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, Michael J Wolf, Andrea Arcangeli
In-Reply-To: <20110209195407.2CE28EA0@kernel>

On Wed, Feb 09, 2011 at 11:54:07AM -0800, Dave Hansen wrote:
> 
> v2 - rework if() block, and remove  now redundant split_huge_page()
> 
> Right now, if a mm_walk has either ->pte_entry or ->pmd_entry
> set, it will unconditionally split any transparent huge pages
> it runs in to.  In practice, that means that anyone doing a
> 
> 	cat /proc/$pid/smaps
> 
> will unconditionally break down every huge page in the process
> and depend on khugepaged to re-collapse it later.  This is
> fairly suboptimal.
> 
> This patch changes that behavior.  It teaches each ->pmd_entry
> handler (there are five) that they must break down the THPs
> themselves.  Also, the _generic_ code will never break down
> a THP unless a ->pte_entry handler is actually set.
> 
> This means that the ->pmd_entry handlers can now choose to
> deal with THPs without breaking them down.
> 
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
> 
>  linux-2.6.git-dave/fs/proc/task_mmu.c |    6 ++++++
>  linux-2.6.git-dave/include/linux/mm.h |    3 +++
>  linux-2.6.git-dave/mm/memcontrol.c    |    5 +++--
>  linux-2.6.git-dave/mm/pagewalk.c      |   24 ++++++++++++++++++++----
>  4 files changed, 32 insertions(+), 6 deletions(-)
> 
> diff -puN fs/proc/task_mmu.c~pagewalk-dont-always-split-thp fs/proc/task_mmu.c
> --- linux-2.6.git/fs/proc/task_mmu.c~pagewalk-dont-always-split-thp	2011-02-09 11:41:42.299558364 -0800
> +++ linux-2.6.git-dave/fs/proc/task_mmu.c	2011-02-09 11:41:42.319558349 -0800
> @@ -343,6 +343,8 @@ static int smaps_pte_range(pmd_t *pmd, u
>  	struct page *page;
>  	int mapcount;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -467,6 +469,8 @@ static int clear_refs_pte_range(pmd_t *p
>  	spinlock_t *ptl;
>  	struct page *page;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
> @@ -623,6 +627,8 @@ static int pagemap_pte_range(pmd_t *pmd,
>  	pte_t *pte;
>  	int err = 0;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	/* find the first VMA at or above 'addr' */
>  	vma = find_vma(walk->mm, addr);
>  	for (; addr != end; addr += PAGE_SIZE) {
> diff -puN include/linux/mm.h~pagewalk-dont-always-split-thp include/linux/mm.h
> --- linux-2.6.git/include/linux/mm.h~pagewalk-dont-always-split-thp	2011-02-09 11:41:42.303558361 -0800
> +++ linux-2.6.git-dave/include/linux/mm.h	2011-02-09 11:41:42.323558346 -0800
> @@ -899,6 +899,9 @@ unsigned long unmap_vmas(struct mmu_gath
>   * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
>   * @pud_entry: if set, called for each non-empty PUD (2nd-level) entry
>   * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry
> + * 	       this handler is required to be able to handle
> + * 	       pmd_trans_huge() pmds.  They may simply choose to
> + * 	       split_huge_page() instead of handling it explicitly.
>   * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
>   * @pte_hole: if set, called for each hole at all levels
>   * @hugetlb_entry: if set, called for each hugetlb entry
> diff -puN mm/memcontrol.c~pagewalk-dont-always-split-thp mm/memcontrol.c
> --- linux-2.6.git/mm/memcontrol.c~pagewalk-dont-always-split-thp	2011-02-09 11:41:42.311558355 -0800
> +++ linux-2.6.git-dave/mm/memcontrol.c	2011-02-09 11:41:42.327558343 -0800
> @@ -4737,7 +4737,8 @@ static int mem_cgroup_count_precharge_pt
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> -	VM_BUG_ON(pmd_trans_huge(*pmd));
> +	split_huge_page_pmd(walk->mm, pmd);
> +
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE)
>  		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
> @@ -4899,8 +4900,8 @@ static int mem_cgroup_move_charge_pte_ra
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	split_huge_page_pmd(walk->mm, pmd);
>  retry:
> -	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);

Before we goto this retry, there is at a cond_resched(). Just to confirm,
we are depending on mmap_sem to prevent khugepaged promoting this back to
a hugepage, right? I don't see a problem with that but I want to be
sure.

> diff -puN mm/pagewalk.c~pagewalk-dont-always-split-thp mm/pagewalk.c
> --- linux-2.6.git/mm/pagewalk.c~pagewalk-dont-always-split-thp	2011-02-09 11:41:42.315558352 -0800
> +++ linux-2.6.git-dave/mm/pagewalk.c	2011-02-09 11:41:42.331558340 -0800
> @@ -33,19 +33,35 @@ static int walk_pmd_range(pud_t *pud, un
>  
>  	pmd = pmd_offset(pud, addr);
>  	do {
> +	again:
>  		next = pmd_addr_end(addr, end);
> -		split_huge_page_pmd(walk->mm, pmd);
> -		if (pmd_none_or_clear_bad(pmd)) {
> +		if (pmd_none(*pmd)) {
>  			if (walk->pte_hole)
>  				err = walk->pte_hole(addr, next, walk);
>  			if (err)
>  				break;
>  			continue;
>  		}
> +		/*
> +		 * This implies that each ->pmd_entry() handler
> +		 * needs to know about pmd_trans_huge() pmds
> +		 */
>  		if (walk->pmd_entry)
>  			err = walk->pmd_entry(pmd, addr, next, walk);
> -		if (!err && walk->pte_entry)
> -			err = walk_pte_range(pmd, addr, next, walk);
> +		if (err)
> +			break;
> +
> +		/*
> +		 * Check this here so we only break down trans_huge
> +		 * pages when we _need_ to
> +		 */
> +		if (!walk->pte_entry)
> +			continue;
> +
> +		split_huge_page_pmd(walk->mm, pmd);
> +		if (pmd_none_or_clear_bad(pmd))
> +			goto again;
> +		err = walk_pte_range(pmd, addr, next, walk);
>  		if (err)
>  			break;
>  	} while (pmd++, addr = next, addr != end);

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
From: Michal Hocko @ 2011-02-10 10:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Johannes Weiner, Andrew Morton, Rik van Riel,
	Kent Overstreet, linux-mm, linux-kernel
In-Reply-To: <20110210102109.GB17873@csn.ul.ie>

On Thu 10-02-11 10:21:10, Mel Gorman wrote:
> On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > > Hi,
> > > > 
> > > > I think this should fix the problem of processes getting stuck in
> > > > reclaim that has been reported several times.
> > > 
> > > I don't think it's the only source but I'm basing this on seeing
> > > constant looping in balance_pgdat() and calling congestion_wait() a few
> > > weeks ago that I haven't rechecked since. However, this looks like a
> > > real fix for a real problem.
> > 
> > Agreed. Just yesterday I spent some time on the lumpy compaction
> > changes after wondering about Michal's khugepaged 100% report, and I
> > expected some fix was needed in this area (as I couldn't find any bug
> > in khugepaged yet, so the lumpy compaction looked the next candidate
> > for bugs).
> > 
> 
> Michal did report that disabling defrag did not help but the stack trace
> also showed that it was stuck in shrink_zone() which is what Johannes'
> patch targets. It's not unreasonable to test if Johannes' patch solves
> Michal's problem. Michal, I know that your workload is a bit random and
> may not be reproducible but do you think it'd be possible to determine
> if Johannes' patch helps?

Sure, I can test it. Nevertheless, I haven't seen the problem again. I
have tried to make some memory pressure on the machine but no "luck".

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH R3 1/7] mm: Add add_registered_memory() to memory hotplug API
From: Daniel Kiper @ 2011-02-10 10:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Daniel Kiper, ian.campbell, akpm, andi.kleen, haicheng.li,
	fengguang.wu, jeremy, dan.magenheimer, v.tolstov, pasik, dave,
	wdauchy, rientjes, xen-devel, linux-kernel, linux-mm
In-Reply-To: <20110208232538.GB9857@dumpdata.com>

On Tue, Feb 08, 2011 at 06:25:38PM -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Feb 03, 2011 at 05:25:14PM +0100, Daniel Kiper wrote:
> > add_registered_memory() adds memory ealier registered
> > as memory resource. It is required by memory hotplug
> > for Xen guests, however it could be used also by other
> > modules.
> > 
> > Signed-off-by: Daniel Kiper <dkiper@net-space.pl>
> > ---
> >  include/linux/memory_hotplug.h |    1 +
> >  mm/memory_hotplug.c            |   50 ++++++++++++++++++++++++++++++---------
> >  2 files changed, 39 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 8122018..fe63912 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -223,6 +223,7 @@ static inline int is_mem_section_removable(unsigned long pfn,
> >  #endif /* CONFIG_MEMORY_HOTREMOVE */
> >  
> >  extern int mem_online_node(int nid);
> > +extern int add_registered_memory(int nid, u64 start, u64 size);
> >  extern int add_memory(int nid, u64 start, u64 size);
> >  extern int arch_add_memory(int nid, u64 start, u64 size);
> >  extern int remove_memory(u64 start, u64 size);
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 321fc74..7947bdf 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -532,20 +532,12 @@ out:
> >  }
> >  
> >  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> > -int __ref add_memory(int nid, u64 start, u64 size)
> > +static int __ref __add_memory(int nid, u64 start, u64 size)
> >  {
> >  	pg_data_t *pgdat = NULL;
> >  	int new_pgdat = 0;
> > -	struct resource *res;
> >  	int ret;
> >  
> > -	lock_memory_hotplug();
> > -
> > -	res = register_memory_resource(start, size);
> > -	ret = -EEXIST;
> > -	if (!res)
> > -		goto out;
> > -
> >  	if (!node_online(nid)) {
> >  		pgdat = hotadd_new_pgdat(nid, start);
> >  		ret = -ENOMEM;
> > @@ -579,14 +571,48 @@ int __ref add_memory(int nid, u64 start, u64 size)
> >  	goto out;
> >  
> >  error:
> > -	/* rollback pgdat allocation and others */
> > +	/* rollback pgdat allocation */
> >  	if (new_pgdat)
> >  		rollback_node_hotadd(nid, pgdat);
> > -	if (res)
> > -		release_memory_resource(res);
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> > +int add_registered_memory(int nid, u64 start, u64 size)
> > +{
> > +	int ret;
> > +
> > +	lock_memory_hotplug();
> > +	ret = __add_memory(nid, start, size);
> > +	unlock_memory_hotplug();
> 
> Isn't this a duplicate call to the mutex?
> The __add_memory does an unlock_memory_hotplug when it finishes
> and then you do another unlock_memory_hotplug here too.

No. Calls to lock_memory_hotplug()/unlock_memory_hotplug() were
moved from original add_memory() to add_registered_memory()
and new add_memory().

Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch] vmscan: fix zone shrinking exit when scan work is done
From: Mel Gorman @ 2011-02-10 10:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, Michal Hocko,
	Kent Overstreet, linux-mm, linux-kernel
In-Reply-To: <20110209182846.GN3347@random.random>

On Wed, Feb 09, 2011 at 07:28:46PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 09, 2011 at 04:46:56PM +0000, Mel Gorman wrote:
> > On Wed, Feb 09, 2011 at 04:46:06PM +0100, Johannes Weiner wrote:
> > > Hi,
> > > 
> > > I think this should fix the problem of processes getting stuck in
> > > reclaim that has been reported several times.
> > 
> > I don't think it's the only source but I'm basing this on seeing
> > constant looping in balance_pgdat() and calling congestion_wait() a few
> > weeks ago that I haven't rechecked since. However, this looks like a
> > real fix for a real problem.
> 
> Agreed. Just yesterday I spent some time on the lumpy compaction
> changes after wondering about Michal's khugepaged 100% report, and I
> expected some fix was needed in this area (as I couldn't find any bug
> in khugepaged yet, so the lumpy compaction looked the next candidate
> for bugs).
> 

Michal did report that disabling defrag did not help but the stack trace
also showed that it was stuck in shrink_zone() which is what Johannes'
patch targets. It's not unreasonable to test if Johannes' patch solves
Michal's problem. Michal, I know that your workload is a bit random and
may not be reproducible but do you think it'd be possible to determine
if Johannes' patch helps?

> I've also been wondering about the !nr_scanned check in
> should_continue_reclaim too but I didn't look too much into the caller
> (I was tempted to remove it all together). I don't see how checking
> nr_scanned can be safe even after we fix the caller to avoid passing
> non-zero values if "goto restart".
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something, in my
> view. It looks bogus... So my proposal would be below.
> 

We should not be ending up in a situation with the LRU list of only
page_evictable pages and that situation persisting causing excessive (or
infinite) looping. As unevictable pages are encountered on the LRU list,
they should be moved to the unevictable lists by putback_lru_page().  Are you
aware of a situation where this becomes broken?

I recognise that SWAP_CLUSTER_MAX pages could all be unevictable and they
are all get moved. In this case, nr_scanned is positive and we continue
to scan but this is expected and desirable: Reclaim/compaction needs more
pages to be freed before it starts compaction. If it stops scanning early,
then it would just fail the allocation later. This is what the "NOTE" is about.

I prefer Johannes' fix for the observed problem.

> ====
> Subject: mm: stop checking nr_scanned in should_continue_reclaim
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> nr_scanned is incremented even for !page_evictable... so it's not
> really useful to insist, just because we scanned something.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 148c6e6..9741884 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1831,7 +1831,6 @@ out:
>   */
>  static inline bool should_continue_reclaim(struct zone *zone,
>  					unsigned long nr_reclaimed,
> -					unsigned long nr_scanned,
>  					struct scan_control *sc)
>  {
>  	unsigned long pages_for_compaction;
> @@ -1841,15 +1840,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
>  		return false;
>  
> -	/*
> -	 * If we failed to reclaim and have scanned the full list, stop.
> -	 * NOTE: Checking just nr_reclaimed would exit reclaim/compaction far
> -	 *       faster but obviously would be less likely to succeed
> -	 *       allocation. If this is desirable, use GFP_REPEAT to decide
> -	 *       if both reclaimed and scanned should be checked or just
> -	 *       reclaimed
> -	 */
> -	if (!nr_reclaimed && !nr_scanned)
> +	/* If we failed to reclaim stop. */
> +	if (!nr_reclaimed)
>  		return false;
>  
>  	/*
> @@ -1884,7 +1876,6 @@ static void shrink_zone(int priority, struct zone *zone,
>  	enum lru_list l;
>  	unsigned long nr_reclaimed;
>  	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> -	unsigned long nr_scanned = sc->nr_scanned;
>  
>  restart:
>  	nr_reclaimed = 0;
> @@ -1923,8 +1914,7 @@ restart:
>  		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
>  
>  	/* reclaim/compaction might need reclaim to continue */
> -	if (should_continue_reclaim(zone, nr_reclaimed,
> -					sc->nr_scanned - nr_scanned, sc))
> +	if (should_continue_reclaim(zone, nr_reclaimed, sc))
>  		goto restart;
>  
>  	throttle_vm_writeout(sc->gfp_mask);
> 
> 

-- 
Mel Gorman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: batch-free pcp list if possible
From: Mel Gorman @ 2011-02-10  9:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Namhyung Kim, linux-mm, linux-kernel, Johannes Weiner
In-Reply-To: <20110209123803.4bb6291c.akpm@linux-foundation.org>

On Wed, Feb 09, 2011 at 12:38:03PM -0800, Andrew Morton wrote:
> On Wed,  9 Feb 2011 22:21:17 +0900
> Namhyung Kim <namhyung@gmail.com> wrote:
> 
> > free_pcppages_bulk() frees pages from pcp lists in a round-robin
> > fashion by keeping batch_free counter. But it doesn't need to spin
> > if there is only one non-empty list. This can be checked by
> > batch_free == MIGRATE_PCPTYPES.
> > 
> > Signed-off-by: Namhyung Kim <namhyung@gmail.com>
> > ---
> >  mm/page_alloc.c |    4 ++++
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a873e61e312e..470fb42e303c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -614,6 +614,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >  			list = &pcp->lists[migratetype];
> >  		} while (list_empty(list));
> >  
> > +		/* This is an only non-empty list. Free them all. */
> > +		if (batch_free == MIGRATE_PCPTYPES)
> > +			batch_free = to_free;
> > +
> >  		do {
> >  			page = list_entry(list->prev, struct page, lru);
> >  			/* must delete as __free_one_page list manipulates */
> 
> free_pcppages_bulk() hurts my brain.
> 

I vaguely recall trying to make it easier to understand. Each attempt
made it easier to read, but slower. At the time there were complaints
about the overhead of the page allocator so making it slower was not an
option. "Overhead" was what oprofile reported as the time spent in each
function.

> What is it actually trying to do, and why? It counts up the number of
> contiguous empty lists and then frees that number of pages from the
> first-encountered non-empty list and then advances onto the next list?
> 

Yes. This is potentially unfair because lists for one migratetype can get
drained heavier than others. However, checking empty lists was showing up as
a reasonably significant cost according to profiles for allocator-intensive
workloads. I *think* the workload I was using was netperf-based.

> What's the point in that?  What relationship does the number of
> contiguous empty lists have with the number of pages to free from one
> list?
> 

The point is to avoid excessive checking of empty lists. There is no
relationship between the number of empty lists and the size of the next
list. The size of the lists is related to the workload and the resulting
allocator/free pattern.

> The comment "This is so more pages are freed off fuller lists instead
> of spinning excessively around empty lists" makes no sense - the only
> way this can be true is if the code knows the number of elements on
> each list, and it doesn't know that.
> 

batch_free gets preserved if a list empties so if batch_free was 2 but
there was only 1 page on the next list, more pages are taken off a
larger list. We know what the total size of all the lists are so there
are always pages to find. You're right in that we don't know the size of
individual lists because space in the pcp structure is tight.

> Also, the covering comments over free_pcppages_bulk() regarding the
> pages_scanned counter and the "all pages pinned" logic appear to be out
> of date.  Or, alternatively, those comments do reflect the desired
> design, but we broke it.
> 

This comment is really old.... heh, you introduced it back in 2.5.49
apparently.

The comment is referring to the clearing of all_unreclaimable. By clearing it,
kswapd will scan that zone again and set all_unreclaimable back if necessary
and that is still valid.

More importantly, if there is another process in direct reclaim and it failed
to reclaim any pages, the clearing of all_unreclaimable will avoid the direct
reclaimer entering OOM.

The comment could be better but it doesn't look wrong, just not
particularly helpful.

> Methinks that free_pcppages_bulk() is an area ripe for simplification
> and clarification.
> 

Probably but any patch that simplifies it needs to be accompanied with
profiles of an allocator-intensive workload showing it's not worse as a result.

-- 
Mel Gorman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened.
From: Borislav Petkov @ 2011-02-10  9:14 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Seiji Aguchi, hpa@zytor.com, andi@firstfloor.org,
	ebiederm@xmission.com, gregkh@suse.de, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	dle-develop@lists.sourceforge.net, amwang@redhat.com,
	Satoru Moriya
In-Reply-To: <4D53A3AA.5050908@jp.fujitsu.com>

On Thu, Feb 10, 2011 at 05:36:58PM +0900, Hidetoshi Seto wrote:
> (2011/02/10 1:35), Seiji Aguchi wrote:

[..]

> > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> > index d916183..e76b47b 100644
> > --- a/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -944,6 +944,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> >  
> >  	percpu_inc(mce_exception_count);
> >  
> > +	hwerr_flag = 1;
> > +
> >  	if (notify_die(DIE_NMI, "machine check", regs, error_code,
> >  			   18, SIGKILL) == NOTIFY_STOP)
> >  		goto out;
> 
> Now x86 supports some recoverable machine check, so setting
> flag here will prevent running kexec on systems that have
> encountered such recoverable machine check and recovered.
> 
> I think mce_panic() is proper place to set this flag "hwerr_flag".

I agree, in that case it is unsafe to run kexec only after the error
cannot be recovered by software.

Also, hwerr_flag is really a bad naming choice, how about
"hwerr_unrecoverable" or "hw_compromised" or "recovery_futile" or
"hw_incurable" or simply say what happened: "pcc" = processor context
corrupt (and a reliable restarting might not be possible). This could be
used by others too, besides kexec.

[..]

> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0207c2f..0178f47 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -994,6 +994,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
> >  	int res;
> >  	unsigned int nr_pages;
> >  
> > +	hwerr_flag = 1;
> > +
> >  	if (!sysctl_memory_failure_recovery)
> >  		panic("Memory failure from trap %d on page %lx", trapno, pfn);
> >  
> 
> For similar reason, setting flag here is not good for
> systems working after isolating some poisoned memory page.
> 
> Why not:
>  if (!sysctl_memory_failure_recovery) {
>  	hwerr_flag = 1;
>  	panic("Memory failure from trap %d on page %lx", trapno, pfn);
>  }

Why do we need that in memory-failure.c at all? I mean, when we consume
the UC, we'll end up in mce_panic() anyway.

-- 
Regards/Gruss,
    Boris.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC][PATCH v2] Controlling kexec behaviour when hardware error happened.
From: Hidetoshi Seto @ 2011-02-10  8:36 UTC (permalink / raw)
  To: Seiji Aguchi
  Cc: hpa@zytor.com, andi@firstfloor.org, ebiederm@xmission.com,
	bp@alien8.de, gregkh@suse.de, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org,
	dle-develop@lists.sourceforge.net, amwang@redhat.com,
	Satoru Moriya
In-Reply-To: <5C4C569E8A4B9B42A84A977CF070A35B2C1494DBE0@USINDEVS01.corp.hds.com>

(2011/02/10 1:35), Seiji Aguchi wrote:
> Hi,
> 
> I submitted a quite similar patch last December.
> 
> http://www.spinics.net/lists/linux-mm/msg13157.html
> 
> I retry it with different description of the purpose.
> 
> [Changelog]
> from v1:
>     - Change name of sysctl parameter ,kexec_on_mce, to kexec_on_hwerr. 
>     - Move variable declaration from <asm/mce.h> to <kernel/panic.h>.
>     - Remove CONFIG_X86_MCE in *.c files.
>     - Modify [Purpose]/[Patch Description].
> 
> [Purpose]
> There are some logging features of firmware/hardware, SEL,BMC, etc, in enterprise servers.
> We investigate the firmware/hardware logs first when MCE occurred and replace the broken hardware.
> So, memory dump is not necessary for detecting root cause of machine check.
> Also, we can reduce down-time by skipping kdump.
> 
> Of course, there are a lot of servers which don't have logging features of firmware/hardware.
> So, I proposed a option controlling kexec behaviour when hardware error occurred. 
> 
> [Patch Description]
> This patch adds a sysctl option ,kernel.kexec_on_hwerr, controlling kexec behaviour when hardware error occurred.
> 
>  - Permission
>   - 0644
>  - Value(default is "1")
>    - non-zero: Kexec is enabled regardless of hardware error.
>    - 0: Kexec is disabled when MCE occurred.
>    
> 
> Matrix of kernel.kexec_on_hwerr value ,hardware error and kexec
> 
> --------------------------------------------------
> kernel.kexec_on_hwerr| hardware error | kexec
> --------------------------------------------------
> non-zero             | occurred       | enabled
>                      -----------------------------
>                      | not occurred   | enabled
> --------------------------------------------------
> 0                    | occurred       | disabled
>                      |----------------------------
>                      | not occurred   | enabled
> --------------------------------------------------
> 
> 
> Any comments and suggestions are welcome.
> 
>  Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com>
> 
> ---
>  Documentation/sysctl/kernel.txt  |   11 +++++++++++
>  arch/x86/kernel/cpu/mcheck/mce.c |    2 ++
>  include/linux/kernel.h           |    2 ++
>  include/linux/sysctl.h           |    1 +
>  kernel/panic.c                   |   15 ++++++++++++++-
>  kernel/sysctl.c                  |    8 ++++++++
>  kernel/sysctl_binary.c           |    1 +
>  mm/memory-failure.c              |    2 ++
>  8 files changed, 41 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 11d5ced..3159111 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -34,6 +34,7 @@ show up in /proc/sys/kernel:
>  - hotplug
>  - java-appletviewer           [ binfmt_java, obsolete ]
>  - java-interpreter            [ binfmt_java, obsolete ]
> +- kexec_on_hwerr              [ x86 only ]
>  - kptr_restrict
>  - kstack_depth_to_print       [ X86 only ]
>  - l2cr                        [ PPC only ]
> @@ -261,6 +262,16 @@ This flag controls the L2 cache of G3 processor boards. If  0, the cache is disabled. Enabled if nonzero.
>  
>  ==============================================================
> +kexec_on_hwerr: (X86 only)
> +
> +Controls the behaviour of kexec when panic occurred due to hardware 
> +error.
> +Default value is 1.
> +
> +0: Kexec is disabled.
> +non-zero: Kexec is enabled.
> +
> +==============================================================
>  
>  kptr_restrict:
>  
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index d916183..e76b47b 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -944,6 +944,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  
>  	percpu_inc(mce_exception_count);
>  
> +	hwerr_flag = 1;
> +
>  	if (notify_die(DIE_NMI, "machine check", regs, error_code,
>  			   18, SIGKILL) == NOTIFY_STOP)
>  		goto out;

Now x86 supports some recoverable machine check, so setting
flag here will prevent running kexec on systems that have
encountered such recoverable machine check and recovered.

I think mce_panic() is proper place to set this flag "hwerr_flag".

> diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 2fe6e84..c2fba7c 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -242,6 +242,8 @@ extern void add_taint(unsigned flag);  extern int test_taint(unsigned flag);  extern unsigned long get_taint(void);  extern int root_mountflags;
> +extern int kexec_on_hwerr;
> +extern int hwerr_flag;
>  
>  extern bool early_boot_irqs_disabled;
>  
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 7bb5cb6..8ae5bfe 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -153,6 +153,7 @@ enum
>  	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> +	KERN_KEXEC_ON_HWERR=77, /* int: bevaviour of kexec for hardware error 
> +*/
>  };
>  
>  
> diff --git a/kernel/panic.c b/kernel/panic.c index 991bb87..84c1d2e 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -28,6 +28,8 @@
>  #define PANIC_BLINK_SPD 18
>  
>  int panic_on_oops;
> +int kexec_on_hwerr = 1;
> +int hwerr_flag;
>  static unsigned long tainted_mask;
>  static int pause_on_oops;
>  static int pause_on_oops_flag;
> @@ -45,6 +47,16 @@ static long no_blink(int state)
>  	return 0;
>  }
>  
> +static int kexec_should_skip(void)
> +{
> +	if (!kexec_on_hwerr && hwerr_flag) {
> +		printk(KERN_WARNING "Kexec is skipped because hardware error "
> +		       "occurred.\n");
> +		return 1;
> +	}
> +	return 0;
> +}
> +
>  /* Returns how long it waited in ms */
>  long (*panic_blink)(int state);
>  EXPORT_SYMBOL(panic_blink);
> @@ -86,7 +98,8 @@ NORET_TYPE void panic(const char * fmt, ...)
>  	 * everything else.
>  	 * Do we want to call this before we try to display a message?
>  	 */
> -	crash_kexec(NULL);
> +	if (!kexec_should_skip())
> +		crash_kexec(NULL);
>  
>  	kmsg_dump(KMSG_DUMP_PANIC);
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 0f1bd83..f78edd8 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -811,6 +811,14 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +	{
> +		.procname	= "kexec_on_hwerr",
> +		.data		= &kexec_on_hwerr,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +
>  #endif
>  #if defined(CONFIG_MMU)
>  	{
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c index b875bed..8d572ca 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -137,6 +137,7 @@ static const struct bin_table bin_kern_table[] = {
>  	{ CTL_INT,	KERN_COMPAT_LOG,		"compat-log" },
>  	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
>  	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
> +	{ CTL_INT,	KERN_KEXEC_ON_HWERR,		"kexec_on_hwerr" },
>  	{}
>  };
>  
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0207c2f..0178f47 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -994,6 +994,8 @@ int __memory_failure(unsigned long pfn, int trapno, int flags)
>  	int res;
>  	unsigned int nr_pages;
>  
> +	hwerr_flag = 1;
> +
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure from trap %d on page %lx", trapno, pfn);
>  

For similar reason, setting flag here is not good for
systems working after isolating some poisoned memory page.

Why not:
 if (!sysctl_memory_failure_recovery) {
 	hwerr_flag = 1;
 	panic("Memory failure from trap %d on page %lx", trapno, pfn);
 }

Thanks,
H.Seto

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 3/3] Provide control over unmapped pages (v4)
From: Minchan Kim @ 2011-02-10  5:41 UTC (permalink / raw)
  To: balbir
  Cc: linux-mm, akpm, npiggin, kvm, linux-kernel, kosaki.motohiro, cl,
	kamezawa.hiroyu
In-Reply-To: <AANLkTin4JM6phwy0wuV6fV-i-3UwP_GGmXh1vN=Wz2u=@mail.gmail.com>

I don't know why the part of message is deleted only when I send you.
Maybe it's gmail bug.

I hope mail sending is successful in this turn. :)

On Thu, Feb 10, 2011 at 2:33 PM, Minchan Kim <minchan.kim@gmail.com> wrote:
> Sorry for late response.
>
> On Fri, Jan 28, 2011 at 8:18 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
>> * MinChan Kim <minchan.kim@gmail.com> [2011-01-28 16:24:19]:
>>
>>> >
>>> > But the assumption for LRU order to change happens only if the page
>>> > cannot be successfully freed, which means it is in some way active..
>>> > and needs to be moved no?
>>>
>>> 1. holded page by someone
>>> 2. mapped pages
>>> 3. active pages
>>>
>>> 1 is rare so it isn't the problem.
>>> Of course, in case of 3, we have to activate it so no problem.
>>> The problem is 2.
>>>
>>
>> 2 is a problem, but due to the size aspects not a big one. Like you
>> said even lumpy reclaim affects it. May be the reclaim code could
>> honour may_unmap much earlier.
>
> Even if it is, it's a trade-off to get a big contiguous memory. I
> don't want to add new mess. (In addition, lumpy is weak by compaction
> as time goes by)
> What I have in mind for preventing LRU ignore is that put the page
> into original position instead of head of lru. Maybe it can help the
> situation both lumpy and your case. But it's another story.
>
> How about the idea?
>
> I borrow the idea from CFLRU[1]
> - PCFLRU(Page-Cache First LRU)
>
> When we allocates new page for page cache, we adds the page into LRU's tail.
> When we map the page cache into page table, we rotate the page into LRU's head.
>
> So, inactive list's result is following as.
>
> M.P : mapped page
> N.P : none-mapped page
>
> HEAD-M.P-M.P-M.P-M.P-N.P-N.P-N.P-N.P-N.P-TAIL
>
> Admin can set threshold window size which determines stop reclaiming
> none-mapped page contiguously.
>
> I think it needs some tweak of page cache/page mapping functions but
> we can use kswapd/direct reclaim without change.
>
> Also, it can change page reclaim policy totally but it's just what you
> want, I think.
>
> [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.6188&rep=rep1&type=pdf
>
>>
>> --
>>        Three Cheers,
>>        Balbir
>>
>
>
>
> --
> Kind regards,
> Minchan Kim
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox