linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Konstantin Khlebnikov <khlebnikov@openvz.org>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Hugh Dickins <hughd@google.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>
Subject: Re: [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting
Date: Sat, 18 Feb 2012 13:09:25 +0400	[thread overview]
Message-ID: <4F3F6AC5.8060908@openvz.org> (raw)
In-Reply-To: <20120217085431.80daa020.kamezawa.hiroyu@jp.fujitsu.com>

KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Feb 2012 15:02:27 +0400
> Konstantin Khlebnikov<khlebnikov@openvz.org>  wrote:
>
>> KAMEZAWA Hiroyuki wrote:
>>> On Thu, 16 Feb 2012 09:43:52 +0400
>>> Konstantin Khlebnikov<khlebnikov@openvz.org>   wrote:
>>>
>>>> KAMEZAWA Hiroyuki wrote:
>>>>> On Thu, 16 Feb 2012 02:57:04 +0400
>>>>> Konstantin Khlebnikov<khlebnikov@openvz.org>    wrote:
>>>
>>>>>> * optimize page to book translations, move it upper in the call stack,
>>>>>>      replace some struct zone arguments with struct book pointer.
>>>>>>
>>>>>
>>>>> a page->book transrater from patch 2/15
>>>>>
>>>>> +struct book *page_book(struct page *page)
>>>>> +{
>>>>> +	struct mem_cgroup_per_zone *mz;
>>>>> +	struct page_cgroup *pc;
>>>>> +
>>>>> +	if (mem_cgroup_disabled())
>>>>> +		return&page_zone(page)->book;
>>>>> +
>>>>> +	pc = lookup_page_cgroup(page);
>>>>> +	if (!PageCgroupUsed(pc))
>>>>> +		return&page_zone(page)->book;
>>>>> +	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
>>>>> +	smp_rmb();
>>>>> +	mz = mem_cgroup_zoneinfo(pc->mem_cgroup,
>>>>> +			page_to_nid(page), page_zonenum(page));
>>>>> +	return&mz->book;
>>>>> +}
>>>>>
>>>>> What happens when pc->mem_cgroup is rewritten by move_account() ?
>>>>> Where is the guard for lockless access of this ?
>>>>
>>>> Initially this suppose to be protected with lru_lock, in final patch they are protected with rcu.
>>>
>>> Hmm, VM_BUG_ON(!PageLRU(page)) ?
>>
>> Where?
>>
>
> You said this is guarded by lru_lock. So, page should be on LRU.

Not exactly, in add_page_to_lru_list() it currenly not in lru =)

Plus we can race with lru-isolation, thus this BUG_ON invalid.
All callers must recheck PageLRU() after locking page->book->lru_lock.

My locking uses combination of PageLRU() RCU and small help from memcg code:

page -> book dereference protected with rcu, If we know that page has valid mem_cg pointer
we can pick it, lock, and recheck pointer in loop. If after that PageLRU is set it means we
successfully secured page in its LRU.

PageLRU() versus memcg change stability there guaranteed by spin_unlock_wait(old_book->lru_lock)
in mem_cgroup_move_account() between assigning new pointer and putting page back to new lru.
Otherwise some one can see PageLRU under old lru_lock, while page already putted into new lru under other lru_lock.

If we didn't know about page->book pointer validity (at isolation via pfn in compaction/lumpy reclaim)
we must check PageLRU() first, if it set page->book must be valid. And as usual, after lru locking
PageLRU must be rechecked. Whole operation is under one rcu-lock, it keep all valid pointers.

Plus memcg-destroy after rcu-greace period wait if lru_lock if locked before freeing this structure,
because all lock holder drops rcu-read-lock after securing lru-lock.

This scheme looks pretty simple and complete.

>
>
>
>>>
>>> move_account() overwrites pc->mem_cgroup with isolating page from LRU.
>>> but it doesn't take lru_lock.
>>
>> There three kinds of lock_page_book() users:
>> 1) caller want to catch page in LRU, it will lock either old or new book and
>>      recheck PageLRU() after locking, if page not it in LRU it don't touch anything.
>>      some of these functions has stable reference to page, some of them not.
>>    [ There actually exist small race, I knew about it, just forget to pick this chunk from old code. See below. ]
>> 2) page is isolated by caller, it want to put it back. book link is stable. no problems.
>> 3) page-release functions. page-counter is zero. no references -- no problems.
>>
>> race for 1)
>>
>> catcher					switcher
>>
>> 					# isolate
>> 					old_book = lock_page_book(page)
>> 					ClearPageLRU(page)
>> 					unlock_book(old_book)				
>> 					# charge
>> old_book = lock_page_book(page)		
>> 					# switch
>> 					page->book = new_book
>> 					# putback
>> 					lock_book(new_book)
>> 					SetPageLRU(page)
>> 					unlock_book(new_book)
>> if (PageLRU(page))
>> 	oops, page actually in new_book
>> unlock_book(old_book)
>>
>>
>> I'll protect "switch" phase with old_book lru-lock:
>>
> In linex-next, pc->mem_cgroup is modified only when Page is on LRU.

I hope, page is isolated before this modification? =)

>
> When we need to touch "book", if !PageLRU() ?

Never, but at isolation via-pfn we need check it carefully.

>
>
>> lock_book(old_book)
>> page->book = new_book
>> unlock_book(old_book)
>>
>> The other option is recheck in "catcher" page book after PageLRU()
>> maybe there exists some other variants.
>>
>>> BTW, what amount of perfomance benefit ?
>>
>> It depends, but usually lru_lock is very-very hot.
>> This lock splitting can be used without cgroups and containers,
>> now huge zones can be easily sliced into arbitrary pieces, for example one book per 256Mb.
>>
> I personally think reducing lock by pagevec works enough well.
> So, want to see perforamance on real machine with real apps.

I separate locking primitives into small set of static-inline functions,
thus we can switch locking scheme by config option. This works without extra overhead,
if there only one lock per zone some static-inline functions becomes lockless or empty.

There other problem with current per-cpu page vectors: after lru-lock splitting they will
drop and retake different locks more frequently.

Thus for optimal scalability these vectors should be splited too, or replaced with something.
For example, lru_add_pvecs can be replaced with per-book (lruvec) singly-linked list for pending
inserts with cmpxchg-based atomic splice-insert. In combination with pfn-based zone-book interleaving
(last patch at my github) it may be really faster than small percpu page vectors.
Adding page into this pending list actually does not require even elevating its refcount,
if it drop to zero and page still in pending list put_page can just commit this list into lru.

>
>
>>
>> According to my experience, one of complicated thing there is how to postpone "book" destroying
>> if some its pages are isolated. For example lumpy reclaim and memory compaction isolates pages
>> from several books. And they wants to put them back. Currently this can be broken, if someone removes
>> cgroup in wrong moment. There appears funny races with three players: catcher, switcher and destroyer.
>
> Thank you for pointing out. Hmm... it can happen ? Currently, at cgroup destroying,
> force_empty() works
>
>    1. find a page from LRU
>    2. remove it from LRU
>    3. move it or reclaim it (you said "switcher")
>    4. if res.usage != 0 goto 1.
>
> I think "4" will finally keep cgroup from being destroyed.

Ok, seems so.

>
>
>> This can be fixed with some extra reference-counting or some other sleepable synchronizing.
>> In my rhel6-based implementation I uses extra reference-counting, and it looks ugly. So I want to invent something better.
>> Other option is just never release books, reuse them after rcu grace period for rcu-list iterating.
>>
>
> Another reference counting is very very bad.

Ack)

>
>
>
> Thanks,
> -Kame
>
>
>
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2012-02-18  9:09 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-15 22:57 [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 01/15] mm: rename struct lruvec into struct book Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 02/15] mm: memory bookkeeping core Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 03/15] mm: add book->pages_count Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 04/15] mm: unify inactive_list_is_low() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 05/15] mm: add book->reclaim_stat Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 06/15] mm: kill struct mem_cgroup_zone Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 07/15] mm: move page-to-book translation upper Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 08/15] mm: introduce book locking primitives Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 09/15] mm: handle book relocks on lumpy reclaim Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 10/15] mm: handle book relocks in compaction Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 11/15] mm: handle book relock in memory controller Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 12/15] mm: optimize books in update_page_reclaim_stat() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 13/15] mm: optimize books in pagevec_lru_move_fn() Konstantin Khlebnikov
2012-02-15 22:57 ` [PATCH RFC 14/15] mm: optimize putback for 0-order reclaim Konstantin Khlebnikov
2012-02-15 22:58 ` [PATCH RFC 15/15] mm: split zone->lru_lock Konstantin Khlebnikov
2012-02-16  2:04 ` [PATCH RFC 00/15] mm: memory book keeping and lru_lock splitting KAMEZAWA Hiroyuki
2012-02-16  5:43   ` Konstantin Khlebnikov
2012-02-16  8:24     ` KAMEZAWA Hiroyuki
2012-02-16 11:02       ` Konstantin Khlebnikov
2012-02-16 15:54         ` Konstantin Khlebnikov
2012-02-16 23:54         ` KAMEZAWA Hiroyuki
2012-02-18  9:09           ` Konstantin Khlebnikov [this message]
2012-02-16  2:37 ` Hugh Dickins
2012-02-16  4:51   ` Konstantin Khlebnikov
2012-02-16 21:37     ` Hugh Dickins
2012-02-17 19:56       ` Konstantin Khlebnikov
2012-02-18  2:13       ` Hugh Dickins
2012-02-18  6:35         ` Konstantin Khlebnikov
2012-02-18  7:14           ` Hugh Dickins
2012-02-20  0:32             ` KAMEZAWA Hiroyuki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F3F6AC5.8060908@openvz.org \
    --to=khlebnikov@openvz.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).