Re: [PATCH v3 3/3] proc: add kpageidle file

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan@kernel.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file
Date: Mon, 4 May 2015 19:54:59 +0900	[thread overview]
Message-ID: <20150504105459.GA19384@blaptop> (raw)
In-Reply-To: <20150504094938.GB4197@esperanza>

On Mon, May 04, 2015 at 12:49:39PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> > On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > > +{
> > > > > > > +	struct page *page;
> > > > > > > +
> > > > > > > +	if (!pfn_valid(pfn))
> > > > > > > +		return NULL;
> > > > > > > +	page = pfn_to_page(pfn);
> > > > > > > +	/*
> > > > > > > +	 * We are only interested in user memory pages, i.e. pages that are
> > > > > > > +	 * allocated and on an LRU list.
> > > > > > > +	 */
> > > > > > > +	if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (!get_page_unless_zero(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (unlikely(!PageLRU(page))) {
> > > > > > 
> > > > > > What lock protect the check PageLRU?
> > > > > > If it is racing ClearPageLRU, what happens?
> > > > > 
> > > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > > will surely remain a user memory page at least until we release the
> > > > > reference to it, so it must be safe to play with idle/young flags. If we
> > > > 
> > > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > > once you judge it's LRU page so if it is false-positive, what happens?
> > > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > > before LRU checking. Otherwise, something will be broken.
> > > 
> > > So, basically you are concerned about the case when we encounter a
> > > freshly allocated page, which has PG_lru bit set and it's going to
> > > become anonymous, but it is still in the process of rmap initialization,
> > > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> > > 
> > > AFAICS, page_referenced should handle such pages fine. Look, it only
> > > needs ->index, ->mapping, and ->mapcount.
> > > 
> > > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > > no-op.
> > > 
> > > If ->index is not initialized, than at worst we will go to
> > > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > > will see that the page is actually not mapped in page_referenced_one ->
> > > page_check_address and again do nothing.
> > > 
> > > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > > will return NULL, just as it does in case ->mapping = NULL.
> > > 
> > > For file pages, we always take PG_locked before checking ->mapping, so
> > > it must be valid.
> > > 
> > > Thanks,
> > > Vladimir
> > 
> > 
> > do_anonymous_page
> > page_add_new_anon_rmap
> > atomic_set(&page->_mapcount, 0);
> > __page_set_anon_rmap
> >    anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >    page->mapping = (struct address_space *) anon_vma;
> >    page->index = linear_page_index(vma, address);
> > lru_cache_add 
> >   __pagevec_lru_add_fn
> >   SetPageLRU(page);
> > 
> > During the procedure, there is no lock to prevent race. Then, at least,
> > we need a write memory barrier to guarantee other fields set up before
> > SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> > well) But I can't find any barrier, either.
> > 
> > IOW, any fields you said could be out of order store without any lock or
> > memory barrier. You might argue atomic op is a barrier on x86 but it
> > doesn't guarantee other arches work like that so we need explict momory
> > barrier or lock.
> > 
> > Let's have a theoretical example.
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> >   /* Compilr changed store operation like below */
> 
> But it couldn't. Quoting Documentation/atomic_ops.txt:
> 
> 	Properly aligned pointers, longs, ints, and chars (and unsigned
> 	equivalents) may be atomically loaded from and stored to in the same
> 	sense as described for atomic_read() and atomic_set().
> 
> __page_set_anon_rmap sets page->mapping using the following expression:
> 
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>  	page->mapping = (struct address_space *) anon_vma;
> 
> and it can't be split, i.e. if one concurrently reads page->mapping
> he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
> can't be any intermediate result in page->mapping, such as anon_vma or
> PAGE_MAPPING_ANON, because one doesn't expect
> 
> 	atomic_set(&p, a + b);
> 
> to behave like
> 
> 	atomic_set(&p, a);
> 	atomic_set(&p, atomic_read(&p) + b);

When I parsed the documentation, I understand it that each of words
store/load operation is atomically loaded or stored, not forcing to
preventing split.

Hmm, but it's really important part in this patchset's implementation
so I want to confirm during review process.

I don't have a worry even if I am not a expert about that part because
I know other experts. ;-) Ccing them.

What I want to know is as follows,

In do_anonymous_page, there is following peice of code.
page_add_new_anon_rmap
        __page_set_anon_rmap
                anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; 
                page->mapping = (struct address_space *) anon_vma;
lru_cache_add_active_or_unevictable
        __lru_cache_add
                __pagevec_lru_add_fn 
                        SetPageLRU(page);

From page->mapping to SetPageLRU, there is no explict lock and memory
barrier.

As counterpart, kpageidle can pass page in page_referenced once it judges
out the page has PG_lru with PageLRU check but my concern is that it
judges without any lock or barrier like below.

static struct page *kpageidle_get_page(unsigned long pfn)
{
       struct page *page;

       if (!pfn_valid(pfn))
               return NULL;
       page = pfn_to_page(pfn);
       /*
        * We are only interested in user memory pages, i.e. pages that are
        * allocated and on an LRU list.
        */
       if (!page || page_count(page) == 0 || !PageLRU(page))
               return NULL;
       if (!get_page_unless_zero(page))
               return NULL;
       if (unlikely(!PageLRU(page))) {
               put_page(page);
               return NULL;
       }
       return page;
}

So, I guess once below compiler optimization happens in __page_set_anon_rmap,
it could be corrupt in page_refernced.

__page_set_anon_rmap:
        page->mapping = (struct address_space *) anon_vma;
        page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);

Because page_referenced checks it with PageAnon which has no memory barrier.
So if above compiler optimization happens, page_referenced can pass the anon
page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

But Vladimir said above above compiler optimization cannot happen because
it's aligned pointer and Documentation/atomic_ops.txt said it couldn't happen.

Thanks for looking this.


> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
>

> 
> Thanks,
> Vladimir
> 
> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
> > 
> > It's too theoretical so it might be hard to happen in real practice.
> > My point is there is nothing to prevent explict race.
> > Even if there is no problem with other lock, it's fragile.
> > Do I miss something?
> > 
> > I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file
Date: Mon, 4 May 2015 19:54:59 +0900	[thread overview]
Message-ID: <20150504105459.GA19384@blaptop> (raw)
In-Reply-To: <20150504094938.GB4197@esperanza>

On Mon, May 04, 2015 at 12:49:39PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> > On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > > +{
> > > > > > > +	struct page *page;
> > > > > > > +
> > > > > > > +	if (!pfn_valid(pfn))
> > > > > > > +		return NULL;
> > > > > > > +	page = pfn_to_page(pfn);
> > > > > > > +	/*
> > > > > > > +	 * We are only interested in user memory pages, i.e. pages that are
> > > > > > > +	 * allocated and on an LRU list.
> > > > > > > +	 */
> > > > > > > +	if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (!get_page_unless_zero(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (unlikely(!PageLRU(page))) {
> > > > > > 
> > > > > > What lock protect the check PageLRU?
> > > > > > If it is racing ClearPageLRU, what happens?
> > > > > 
> > > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > > will surely remain a user memory page at least until we release the
> > > > > reference to it, so it must be safe to play with idle/young flags. If we
> > > > 
> > > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > > once you judge it's LRU page so if it is false-positive, what happens?
> > > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > > before LRU checking. Otherwise, something will be broken.
> > > 
> > > So, basically you are concerned about the case when we encounter a
> > > freshly allocated page, which has PG_lru bit set and it's going to
> > > become anonymous, but it is still in the process of rmap initialization,
> > > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> > > 
> > > AFAICS, page_referenced should handle such pages fine. Look, it only
> > > needs ->index, ->mapping, and ->mapcount.
> > > 
> > > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > > no-op.
> > > 
> > > If ->index is not initialized, than at worst we will go to
> > > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > > will see that the page is actually not mapped in page_referenced_one ->
> > > page_check_address and again do nothing.
> > > 
> > > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > > will return NULL, just as it does in case ->mapping = NULL.
> > > 
> > > For file pages, we always take PG_locked before checking ->mapping, so
> > > it must be valid.
> > > 
> > > Thanks,
> > > Vladimir
> > 
> > 
> > do_anonymous_page
> > page_add_new_anon_rmap
> > atomic_set(&page->_mapcount, 0);
> > __page_set_anon_rmap
> >    anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >    page->mapping = (struct address_space *) anon_vma;
> >    page->index = linear_page_index(vma, address);
> > lru_cache_add 
> >   __pagevec_lru_add_fn
> >   SetPageLRU(page);
> > 
> > During the procedure, there is no lock to prevent race. Then, at least,
> > we need a write memory barrier to guarantee other fields set up before
> > SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> > well) But I can't find any barrier, either.
> > 
> > IOW, any fields you said could be out of order store without any lock or
> > memory barrier. You might argue atomic op is a barrier on x86 but it
> > doesn't guarantee other arches work like that so we need explict momory
> > barrier or lock.
> > 
> > Let's have a theoretical example.
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> >   /* Compilr changed store operation like below */
> 
> But it couldn't. Quoting Documentation/atomic_ops.txt:
> 
> 	Properly aligned pointers, longs, ints, and chars (and unsigned
> 	equivalents) may be atomically loaded from and stored to in the same
> 	sense as described for atomic_read() and atomic_set().
> 
> __page_set_anon_rmap sets page->mapping using the following expression:
> 
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>  	page->mapping = (struct address_space *) anon_vma;
> 
> and it can't be split, i.e. if one concurrently reads page->mapping
> he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
> can't be any intermediate result in page->mapping, such as anon_vma or
> PAGE_MAPPING_ANON, because one doesn't expect
> 
> 	atomic_set(&p, a + b);
> 
> to behave like
> 
> 	atomic_set(&p, a);
> 	atomic_set(&p, atomic_read(&p) + b);

When I parsed the documentation, I understand it that each of words
store/load operation is atomically loaded or stored, not forcing to
preventing split.

Hmm, but it's really important part in this patchset's implementation
so I want to confirm during review process.

I don't have a worry even if I am not a expert about that part because
I know other experts. ;-) Ccing them.

What I want to know is as follows,

In do_anonymous_page, there is following peice of code.
page_add_new_anon_rmap
        __page_set_anon_rmap
                anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; 
                page->mapping = (struct address_space *) anon_vma;
lru_cache_add_active_or_unevictable
        __lru_cache_add
                __pagevec_lru_add_fn 
                        SetPageLRU(page);

>From page->mapping to SetPageLRU, there is no explict lock and memory
barrier.

As counterpart, kpageidle can pass page in page_referenced once it judges
out the page has PG_lru with PageLRU check but my concern is that it
judges without any lock or barrier like below.

static struct page *kpageidle_get_page(unsigned long pfn)
{
       struct page *page;

       if (!pfn_valid(pfn))
               return NULL;
       page = pfn_to_page(pfn);
       /*
        * We are only interested in user memory pages, i.e. pages that are
        * allocated and on an LRU list.
        */
       if (!page || page_count(page) == 0 || !PageLRU(page))
               return NULL;
       if (!get_page_unless_zero(page))
               return NULL;
       if (unlikely(!PageLRU(page))) {
               put_page(page);
               return NULL;
       }
       return page;
}

So, I guess once below compiler optimization happens in __page_set_anon_rmap,
it could be corrupt in page_refernced.

__page_set_anon_rmap:
        page->mapping = (struct address_space *) anon_vma;
        page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);

Because page_referenced checks it with PageAnon which has no memory barrier.
So if above compiler optimization happens, page_referenced can pass the anon
page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

But Vladimir said above above compiler optimization cannot happen because
it's aligned pointer and Documentation/atomic_ops.txt said it couldn't happen.

Thanks for looking this.


> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
>

> 
> Thanks,
> Vladimir
> 
> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
> > 
> > It's too theoretical so it might be hard to happen in real practice.
> > My point is there is nothing to prevent explict race.
> > Even if there is no problem with other lock, it's fragile.
> > Do I miss something?
> > 
> > I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan@kernel.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file
Date: Mon, 4 May 2015 19:54:59 +0900	[thread overview]
Message-ID: <20150504105459.GA19384@blaptop> (raw)
In-Reply-To: <20150504094938.GB4197@esperanza>

On Mon, May 04, 2015 at 12:49:39PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> > On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > > +{
> > > > > > > +	struct page *page;
> > > > > > > +
> > > > > > > +	if (!pfn_valid(pfn))
> > > > > > > +		return NULL;
> > > > > > > +	page = pfn_to_page(pfn);
> > > > > > > +	/*
> > > > > > > +	 * We are only interested in user memory pages, i.e. pages that are
> > > > > > > +	 * allocated and on an LRU list.
> > > > > > > +	 */
> > > > > > > +	if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (!get_page_unless_zero(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (unlikely(!PageLRU(page))) {
> > > > > > 
> > > > > > What lock protect the check PageLRU?
> > > > > > If it is racing ClearPageLRU, what happens?
> > > > > 
> > > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > > will surely remain a user memory page at least until we release the
> > > > > reference to it, so it must be safe to play with idle/young flags. If we
> > > > 
> > > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > > once you judge it's LRU page so if it is false-positive, what happens?
> > > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > > before LRU checking. Otherwise, something will be broken.
> > > 
> > > So, basically you are concerned about the case when we encounter a
> > > freshly allocated page, which has PG_lru bit set and it's going to
> > > become anonymous, but it is still in the process of rmap initialization,
> > > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> > > 
> > > AFAICS, page_referenced should handle such pages fine. Look, it only
> > > needs ->index, ->mapping, and ->mapcount.
> > > 
> > > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > > no-op.
> > > 
> > > If ->index is not initialized, than at worst we will go to
> > > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > > will see that the page is actually not mapped in page_referenced_one ->
> > > page_check_address and again do nothing.
> > > 
> > > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > > will return NULL, just as it does in case ->mapping = NULL.
> > > 
> > > For file pages, we always take PG_locked before checking ->mapping, so
> > > it must be valid.
> > > 
> > > Thanks,
> > > Vladimir
> > 
> > 
> > do_anonymous_page
> > page_add_new_anon_rmap
> > atomic_set(&page->_mapcount, 0);
> > __page_set_anon_rmap
> >    anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >    page->mapping = (struct address_space *) anon_vma;
> >    page->index = linear_page_index(vma, address);
> > lru_cache_add 
> >   __pagevec_lru_add_fn
> >   SetPageLRU(page);
> > 
> > During the procedure, there is no lock to prevent race. Then, at least,
> > we need a write memory barrier to guarantee other fields set up before
> > SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> > well) But I can't find any barrier, either.
> > 
> > IOW, any fields you said could be out of order store without any lock or
> > memory barrier. You might argue atomic op is a barrier on x86 but it
> > doesn't guarantee other arches work like that so we need explict momory
> > barrier or lock.
> > 
> > Let's have a theoretical example.
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> >   /* Compilr changed store operation like below */
> 
> But it couldn't. Quoting Documentation/atomic_ops.txt:
> 
> 	Properly aligned pointers, longs, ints, and chars (and unsigned
> 	equivalents) may be atomically loaded from and stored to in the same
> 	sense as described for atomic_read() and atomic_set().
> 
> __page_set_anon_rmap sets page->mapping using the following expression:
> 
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>  	page->mapping = (struct address_space *) anon_vma;
> 
> and it can't be split, i.e. if one concurrently reads page->mapping
> he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
> can't be any intermediate result in page->mapping, such as anon_vma or
> PAGE_MAPPING_ANON, because one doesn't expect
> 
> 	atomic_set(&p, a + b);
> 
> to behave like
> 
> 	atomic_set(&p, a);
> 	atomic_set(&p, atomic_read(&p) + b);

When I parsed the documentation, I understand it that each of words
store/load operation is atomically loaded or stored, not forcing to
preventing split.

Hmm, but it's really important part in this patchset's implementation
so I want to confirm during review process.

I don't have a worry even if I am not a expert about that part because
I know other experts. ;-) Ccing them.

What I want to know is as follows,

In do_anonymous_page, there is following peice of code.
page_add_new_anon_rmap
        __page_set_anon_rmap
                anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; 
                page->mapping = (struct address_space *) anon_vma;
lru_cache_add_active_or_unevictable
        __lru_cache_add
                __pagevec_lru_add_fn 
                        SetPageLRU(page);

>From page->mapping to SetPageLRU, there is no explict lock and memory
barrier.

As counterpart, kpageidle can pass page in page_referenced once it judges
out the page has PG_lru with PageLRU check but my concern is that it
judges without any lock or barrier like below.

static struct page *kpageidle_get_page(unsigned long pfn)
{
       struct page *page;

       if (!pfn_valid(pfn))
               return NULL;
       page = pfn_to_page(pfn);
       /*
        * We are only interested in user memory pages, i.e. pages that are
        * allocated and on an LRU list.
        */
       if (!page || page_count(page) == 0 || !PageLRU(page))
               return NULL;
       if (!get_page_unless_zero(page))
               return NULL;
       if (unlikely(!PageLRU(page))) {
               put_page(page);
               return NULL;
       }
       return page;
}

So, I guess once below compiler optimization happens in __page_set_anon_rmap,
it could be corrupt in page_refernced.

__page_set_anon_rmap:
        page->mapping = (struct address_space *) anon_vma;
        page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);

Because page_referenced checks it with PageAnon which has no memory barrier.
So if above compiler optimization happens, page_referenced can pass the anon
page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

But Vladimir said above above compiler optimization cannot happen because
it's aligned pointer and Documentation/atomic_ops.txt said it couldn't happen.

Thanks for looking this.


> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
>

> 
> Thanks,
> Vladimir
> 
> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
> > 
> > It's too theoretical so it might be hard to happen in real practice.
> > My point is there is nothing to prevent explict race.
> > Even if there is no problem with other lock, it's fragile.
> > Do I miss something?
> > 
> > I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

-- 
Kind regards,
Minchan Kim

next prev parent reply	other threads:[~2015-05-04 10:54 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 12:24 [PATCH v3 0/3] idle memory tracking Vladimir Davydov
2015-04-28 12:24 ` Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 1/3] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-04-28 12:24   ` Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 2/3] proc: add kpagecgroup file Vladimir Davydov
2015-04-28 12:24   ` Vladimir Davydov
2015-04-28 12:24   ` Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 3/3] proc: add kpageidle file Vladimir Davydov
2015-04-28 12:24   ` Vladimir Davydov
2015-04-29  4:35   ` Minchan Kim
2015-04-29  4:35     ` Minchan Kim
2015-04-29  9:12     ` Vladimir Davydov
2015-04-29  9:12       ` Vladimir Davydov
2015-04-30  8:25       ` Minchan Kim
2015-04-30  8:25         ` Minchan Kim
2015-04-30 14:50         ` Vladimir Davydov
2015-04-30 14:50           ` Vladimir Davydov
2015-04-30 14:50           ` Vladimir Davydov
2015-05-04  3:17           ` Minchan Kim
2015-05-04  3:17             ` Minchan Kim
2015-05-04  9:49             ` Vladimir Davydov
2015-05-04  9:49               ` Vladimir Davydov
2015-05-04  9:49               ` Vladimir Davydov
2015-05-04 10:54               ` Minchan Kim [this message]
2015-05-04 10:54                 ` Minchan Kim
2015-05-04 10:54                 ` Minchan Kim
2015-05-08  9:56                 ` Vladimir Davydov
2015-05-08  9:56                   ` Vladimir Davydov
2015-05-08  9:56                   ` Vladimir Davydov
2015-05-09 15:12                   ` Minchan Kim
2015-05-09 15:12                     ` Minchan Kim
2015-05-10 10:34                     ` Vladimir Davydov
2015-05-10 10:34                       ` Vladimir Davydov
2015-05-10 10:34                       ` Vladimir Davydov
2015-05-12  9:41                       ` Vladimir Davydov
2015-05-12  9:41                         ` Vladimir Davydov
     [not found]   ` <4c24a6bf2c9711dd4dbb72a43a16eba6867527b7.1430217477.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-04-29  4:57     ` Minchan Kim
2015-04-29  4:57       ` Minchan Kim
2015-04-29  4:57       ` Minchan Kim
2015-04-29  8:31       ` Vladimir Davydov
2015-04-29  8:31         ` Vladimir Davydov
2015-04-29  8:31         ` Vladimir Davydov
2015-04-30  6:55         ` Minchan Kim
2015-04-30  6:55           ` Minchan Kim
2015-04-30  6:55           ` Minchan Kim
2015-04-29  3:57 ` [PATCH v3 0/3] idle memory tracking Minchan Kim
2015-04-29  3:57   ` Minchan Kim
2015-04-29  7:58   ` Vladimir Davydov
2015-04-29  7:58     ` Vladimir Davydov
     [not found] ` <cover.1430217477.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-04-29  5:02   ` Minchan Kim
2015-04-29  5:02     ` Minchan Kim
2015-04-29  5:02     ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150504105459.GA19384@blaptop \
    --to=minchan@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=gorcunov@openvz.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=vdavydov@parallels.com \
    --cc=walken@google.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.