Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Balbir Singh <balbir@linux.vnet.ibm.com>
To: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: hugh@veritas.com, kamezawa.hiroyu@jp.fujitsu.com,
	linux-mm@kvack.org, yamamoto@valinux.co.jp, riel@redhat.com
Subject: Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.
Date: Wed, 20 Feb 2008 17:06:08 +0530	[thread overview]
Message-ID: <47BC10A8.4020508@linux.vnet.ibm.com> (raw)
In-Reply-To: <20080220.185821.61784723.taka@valinux.co.jp>

Hirokazu Takahashi wrote:
> Hi,
> 
>>>> I'd like to start from RFC.
>>>>
>>>> In following code
>>>> ==
>>>>   lock_page_cgroup(page);
>>>>   pc = page_get_page_cgroup(page);
>>>>   unlock_page_cgroup(page);
>>>>
>>>>   access 'pc' later..
>>>> == (See, page_cgroup_move_lists())
>>>>
>>>> There is a race because 'pc' is not a stable value without lock_page_cgroup().
>>>> (mem_cgroup_uncharge can free this 'pc').
>>>>
>>>> For example, page_cgroup_move_lists() access pc without lock.
>>>> There is a small race window, between page_cgroup_move_lists()
>>>> and mem_cgroup_uncharge(). At uncharge, page_cgroup struct is immedieately
>>>> freed but move_list can access it after taking lru_lock.
>>>> (*) mem_cgroup_uncharge_page() can be called without zone->lru lock.
>>>>
>>>> This is not good manner.
>>>> .....
>>>> There is no quick fix (maybe). Moreover, I hear some people around me said
>>>> current memcontrol.c codes are very complicated.
>>>> I agree ;( ..it's caued by my work.
>>>>
>>>> I'd like to fix problems in clean way.
>>>> (Note: current -rc2 codes works well under heavy pressure. but there
>>>>  is possibility of race, I think.)
>>> Yes, yes, indeed, I've been working away on this too.
>>>
>>> Ever since the VM_BUG_ON(page_get_page_cgroup(page)) went into
>>> free_hot_cold_page (at my own prompting), I've been hitting it
>>> just very occasionally in my kernel build testing.  Was unable
>>> to reproduce it over the New Year, but a week or two ago found
>>> one machine and config on which it is relatively reproducible,
>>> pretty sure to happen within 12 hours.
>>>
>>> And on Saturday evening at last identified the cause, exactly
>>> where you have: that unsafety in mem_cgroup_move_lists - which
>>> has the nice property of putting pages from the lru on to SLUB's
>>> freelist!
>>>
>>> Unlike the unsafeties of force_empty, this is liable to hit anyone
>>> running with MEM_CONT compiled in, they don't have to be consciously
>>> using mem_cgroups at all.
>>>
>>> (I consider that, by the way, quite a serious defect in the current
>>> mem_cgroup work: that a distro compiling it in for 1% of customers
>>> is then subjecting all to the mem_cgroup overhead - effectively
>>> doubled struct page size and unnecessary accounting overhead.  I
>>> believe there needs to be a way to opt out, a force_empty which
>>> sticks.  Yes, I know the page_cgroup which does that doubling of
>>> size is only allocated on demand, but every page cache page and
>>> every anonymous page is going to have one.  A kmem_cache for them
>>> will reduce the extra, but there still needs to be a way to opt
>>> out completely.)
>>>
>> I've been thinking along these lines as well
>>
>> 1. Have a boot option to turn on/off the memory controller
> 
> It will be much convenient if the memory controller can be turned on/off on
> demand. I think you can turn it off if there aren't any mem_cgroups except
> the root mem_cgroup, 
> 
>> 2. Have a separate cache for the page_cgroup structure. I sent this suggestion
>>    out just yesterday or so.
> 
> I think the policy that every mem_cgroup should be dynamically allocated and
> assigned to the proper page every time is causing some overheads and spinlock
> contentions.
> 
> What do you think if you allocate all page_cgroups and assign to all the pages
> when the memory controller gets turned on, which will allow you to remove
> most of the spinlocks.
> 
> And you may possibly have a chance to remove page->page_cgroup member
> if you allocate array of page_cgroups and attach them to the zone which
> the pages belong to.
> 

We thought of this as well. We dropped it, because we need to track only user
pages at the moment. Doing it for all pages means having the overhead for each
page on the system.

>                zone
>     page[]    +----+    page_cgroup[]
>     +----+<----    ---->+----+
>     |    |    |    |    |    |
>     +----+    |    |    +----+
>     |    |    +----+    |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
>     |    |              |    |
>     +----+              +----+
> 
> 
>> I agree that these are necessary enhancements/changes.
> 
> Thank you,
> Hirokazu Takahashi.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2008-02-20 11:41 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-19 12:54 [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races KAMEZAWA Hiroyuki
2008-02-19 15:40 ` Hugh Dickins
2008-02-20  1:03   ` KAMEZAWA Hiroyuki
2008-02-20  4:14     ` Hugh Dickins
2008-02-20  4:37       ` KAMEZAWA Hiroyuki
2008-02-20  4:39         ` Hugh Dickins
2008-02-20  4:41           ` Balbir Singh
2008-02-20  6:40         ` Balbir Singh
2008-02-20  7:23           ` KAMEZAWA Hiroyuki
2008-02-20  3:14   ` KAMEZAWA Hiroyuki
2008-02-20  3:37     ` YAMAMOTO Takashi
2008-02-20  4:13       ` KAMEZAWA Hiroyuki
2008-02-20  4:32     ` Hugh Dickins
2008-02-20  5:57   ` Balbir Singh
2008-02-20  9:58     ` Hirokazu Takahashi
2008-02-20 10:06       ` Paul Menage
2008-02-20 10:11         ` Balbir Singh
2008-02-20 10:18           ` Paul Menage
2008-02-20 10:55             ` Balbir Singh
2008-02-20 11:21               ` KAMEZAWA Hiroyuki
2008-02-20 11:18                 ` Balbir Singh
2008-02-20 11:32                   ` KAMEZAWA Hiroyuki
2008-02-20 11:34                     ` Balbir Singh
2008-02-20 11:44                       ` Paul Menage
2008-02-20 11:41                   ` Paul Menage
2008-02-20 11:36       ` Balbir Singh [this message]
2008-02-20 11:55         ` Paul Menage
2008-02-21  2:49         ` Hirokazu Takahashi
2008-02-21  6:35           ` KAMEZAWA Hiroyuki
2008-02-21  9:07             ` Hirokazu Takahashi
2008-02-21  9:21               ` KAMEZAWA Hiroyuki
2008-02-21  9:28                 ` Balbir Singh
2008-02-21  9:44                   ` KAMEZAWA Hiroyuki
2008-02-22  3:31                     ` [RFC] Block I/O Cgroup Hirokazu Takahashi
2008-02-22  5:05                       ` KAMEZAWA Hiroyuki
2008-02-22  5:45                         ` Hirokazu Takahashi
2008-02-21  9:25               ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Balbir Singh
2008-02-20  6:27   ` Hirokazu Takahashi
2008-02-20  6:50     ` KAMEZAWA Hiroyuki
2008-02-20  8:32       ` Clean up force_empty (Was Re: [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races.) KAMEZAWA Hiroyuki
2008-02-20 10:07         ` Clean up force_empty Hirokazu Takahashi
2008-02-22  9:24       ` [RFC][PATCH] Clarify mem_cgroup lock handling and avoid races Hugh Dickins
2008-02-22 10:07         ` KAMEZAWA Hiroyuki
2008-02-22 10:25           ` Hugh Dickins
2008-02-22 10:34             ` KAMEZAWA Hiroyuki
2008-02-22 10:50         ` Hirokazu Takahashi
2008-02-22 11:14         ` Balbir Singh
2008-02-22 12:00           ` Hugh Dickins
2008-02-22 12:28             ` Balbir Singh
2008-02-22 12:53               ` Hugh Dickins
2008-02-25  3:18                 ` Balbir Singh
2008-02-19 15:54 ` kamezawa.hiroyu
2008-02-19 16:26   ` Hugh Dickins
2008-02-20  1:55     ` KAMEZAWA Hiroyuki
2008-02-20  2:05       ` YAMAMOTO Takashi
2008-02-20  2:15         ` KAMEZAWA Hiroyuki
2008-02-20  2:32           ` YAMAMOTO Takashi
2008-02-20  4:27             ` Hugh Dickins
2008-02-20  6:38     ` Balbir Singh
2008-02-20 11:00       ` Hugh Dickins
2008-02-20 11:32         ` Balbir Singh
2008-02-20 14:19           ` Hugh Dickins
2008-02-20  5:00 ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47BC10A8.4020508@linux.vnet.ibm.com \
    --to=balbir@linux.vnet.ibm.com \
    --cc=hugh@veritas.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=taka@valinux.co.jp \
    --cc=yamamoto@valinux.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).