linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>, Greg Thelen <gthelen@google.com>,
	Michel Lespinasse <walken@google.com>,
	David Rientjes <rientjes@google.com>,
	Pavel Emelyanov <xemul@parallels.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, Rik van Riel <riel@redhat.com>,
	Hugh Dickins <hughd@google.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>
Subject: Re: [PATCH v3 3/3] proc: add kpageidle file
Date: Mon, 4 May 2015 19:54:59 +0900	[thread overview]
Message-ID: <20150504105459.GA19384@blaptop> (raw)
In-Reply-To: <20150504094938.GB4197@esperanza>

On Mon, May 04, 2015 at 12:49:39PM +0300, Vladimir Davydov wrote:
> On Mon, May 04, 2015 at 12:17:22PM +0900, Minchan Kim wrote:
> > On Thu, Apr 30, 2015 at 05:50:55PM +0300, Vladimir Davydov wrote:
> > > On Thu, Apr 30, 2015 at 05:25:31PM +0900, Minchan Kim wrote:
> > > > On Wed, Apr 29, 2015 at 12:12:48PM +0300, Vladimir Davydov wrote:
> > > > > On Wed, Apr 29, 2015 at 01:35:36PM +0900, Minchan Kim wrote:
> > > > > > On Tue, Apr 28, 2015 at 03:24:42PM +0300, Vladimir Davydov wrote:
> > > > > > > +#ifdef CONFIG_IDLE_PAGE_TRACKING
> > > > > > > +static struct page *kpageidle_get_page(unsigned long pfn)
> > > > > > > +{
> > > > > > > +	struct page *page;
> > > > > > > +
> > > > > > > +	if (!pfn_valid(pfn))
> > > > > > > +		return NULL;
> > > > > > > +	page = pfn_to_page(pfn);
> > > > > > > +	/*
> > > > > > > +	 * We are only interested in user memory pages, i.e. pages that are
> > > > > > > +	 * allocated and on an LRU list.
> > > > > > > +	 */
> > > > > > > +	if (!page || page_count(page) == 0 || !PageLRU(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (!get_page_unless_zero(page))
> > > > > > > +		return NULL;
> > > > > > > +	if (unlikely(!PageLRU(page))) {
> > > > > > 
> > > > > > What lock protect the check PageLRU?
> > > > > > If it is racing ClearPageLRU, what happens?
> > > > > 
> > > > > If we hold a reference to a page and see that it's on an LRU list, it
> > > > > will surely remain a user memory page at least until we release the
> > > > > reference to it, so it must be safe to play with idle/young flags. If we
> > > > 
> > > > The problem is that you pass the page in rmap reverse logic(ie, page_referenced)
> > > > once you judge it's LRU page so if it is false-positive, what happens?
> > > > A question is SetPageLRU, PageLRU, ClearPageLRU keeps memory ordering?
> > > > IOW, all of fields from struct page rmap can acccess should be set up completely
> > > > before LRU checking. Otherwise, something will be broken.
> > > 
> > > So, basically you are concerned about the case when we encounter a
> > > freshly allocated page, which has PG_lru bit set and it's going to
> > > become anonymous, but it is still in the process of rmap initialization,
> > > i.e. its ->mapping or ->mapcount may still be uninitialized, right?
> > > 
> > > AFAICS, page_referenced should handle such pages fine. Look, it only
> > > needs ->index, ->mapping, and ->mapcount.
> > > 
> > > If ->mapping is unset, than it is NULL and rmap_walk_anon_lock ->
> > > page_lock_anon_vma_read will return NULL so that rmap_walk will be a
> > > no-op.
> > > 
> > > If ->index is not initialized, than at worst we will go to
> > > anon_vma_interval_tree_foreach over a wrong interval, in which case we
> > > will see that the page is actually not mapped in page_referenced_one ->
> > > page_check_address and again do nothing.
> > > 
> > > If ->mapcount is not initialized it is -1, and page_lock_anon_vma_read
> > > will return NULL, just as it does in case ->mapping = NULL.
> > > 
> > > For file pages, we always take PG_locked before checking ->mapping, so
> > > it must be valid.
> > > 
> > > Thanks,
> > > Vladimir
> > 
> > 
> > do_anonymous_page
> > page_add_new_anon_rmap
> > atomic_set(&page->_mapcount, 0);
> > __page_set_anon_rmap
> >    anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
> >    page->mapping = (struct address_space *) anon_vma;
> >    page->index = linear_page_index(vma, address);
> > lru_cache_add 
> >   __pagevec_lru_add_fn
> >   SetPageLRU(page);
> > 
> > During the procedure, there is no lock to prevent race. Then, at least,
> > we need a write memory barrier to guarantee other fields set up before
> > SetPageLRU. (Of course, PageLRU should have read-memory barrier to work
> > well) But I can't find any barrier, either.
> > 
> > IOW, any fields you said could be out of order store without any lock or
> > memory barrier. You might argue atomic op is a barrier on x86 but it
> > doesn't guarantee other arches work like that so we need explict momory
> > barrier or lock.
> > 
> > Let's have a theoretical example.
> > 
> >         CPU 0                                                                           CPU 1
> > 
> > do_anonymous_page
> >   __page_set_anon_rmap
> >   /* out of order happened so SetPageLRU is done ahead */
> >   SetPageLRU(page)
> >   /* Compilr changed store operation like below */
> 
> But it couldn't. Quoting Documentation/atomic_ops.txt:
> 
> 	Properly aligned pointers, longs, ints, and chars (and unsigned
> 	equivalents) may be atomically loaded from and stored to in the same
> 	sense as described for atomic_read() and atomic_set().
> 
> __page_set_anon_rmap sets page->mapping using the following expression:
> 
>  	anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
>  	page->mapping = (struct address_space *) anon_vma;
> 
> and it can't be split, i.e. if one concurrently reads page->mapping
> he/she will see either NULL or (anon_vma+PAGE_MAPPING_ANON), and there
> can't be any intermediate result in page->mapping, such as anon_vma or
> PAGE_MAPPING_ANON, because one doesn't expect
> 
> 	atomic_set(&p, a + b);
> 
> to behave like
> 
> 	atomic_set(&p, a);
> 	atomic_set(&p, atomic_read(&p) + b);

When I parsed the documentation, I understand it that each of words
store/load operation is atomically loaded or stored, not forcing to
preventing split.

Hmm, but it's really important part in this patchset's implementation
so I want to confirm during review process.

I don't have a worry even if I am not a expert about that part because
I know other experts. ;-) Ccing them.

What I want to know is as follows,

In do_anonymous_page, there is following peice of code.
page_add_new_anon_rmap
        __page_set_anon_rmap
                anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; 
                page->mapping = (struct address_space *) anon_vma;
lru_cache_add_active_or_unevictable
        __lru_cache_add
                __pagevec_lru_add_fn 
                        SetPageLRU(page);

>From page->mapping to SetPageLRU, there is no explict lock and memory
barrier.

As counterpart, kpageidle can pass page in page_referenced once it judges
out the page has PG_lru with PageLRU check but my concern is that it
judges without any lock or barrier like below.

static struct page *kpageidle_get_page(unsigned long pfn)
{
       struct page *page;

       if (!pfn_valid(pfn))
               return NULL;
       page = pfn_to_page(pfn);
       /*
        * We are only interested in user memory pages, i.e. pages that are
        * allocated and on an LRU list.
        */
       if (!page || page_count(page) == 0 || !PageLRU(page))
               return NULL;
       if (!get_page_unless_zero(page))
               return NULL;
       if (unlikely(!PageLRU(page))) {
               put_page(page);
               return NULL;
       }
       return page;
}

So, I guess once below compiler optimization happens in __page_set_anon_rmap,
it could be corrupt in page_refernced.

__page_set_anon_rmap:
        page->mapping = (struct address_space *) anon_vma;
        page->mapping = (struct address_space *)((void *)page_mapping + PAGE_MAPPING_ANON);

Because page_referenced checks it with PageAnon which has no memory barrier.
So if above compiler optimization happens, page_referenced can pass the anon
page in rmap_walk_file, not ramp_walk_anon. It's my theory. :)

But Vladimir said above above compiler optimization cannot happen because
it's aligned pointer and Documentation/atomic_ops.txt said it couldn't happen.

Thanks for looking this.


> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
>

> 
> Thanks,
> Vladimir
> 
> >   page->mapping = (struct address_space *) anon_vma;
> >   /* Big stall happens */
> >                                                                 /* idletacking judged it as LRU page so pass the page
> >                                                                    in page_reference */
> >                                                                 page_refernced
> >                                                                         page_rmapping return true because
> >                                                                         page->mapping has some vaule but not complete
> >                                                                         so it calls rmap_walk_file.
> >                                                                         it's okay to pass non-completed anon page in rmap_walk_file?
> > 
> >   page->mapping = (struct address_space *)
> >         ((void *)page_mapping + PAGE_MAPPING_ANON);
> > 
> > It's too theoretical so it might be hard to happen in real practice.
> > My point is there is nothing to prevent explict race.
> > Even if there is no problem with other lock, it's fragile.
> > Do I miss something?
> > 
> > I think general way to handle PageLRU are ahead isolation or zone->lru_lock.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-05-04 10:54 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-28 12:24 [PATCH v3 0/3] idle memory tracking Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 1/3] memcg: add page_cgroup_ino helper Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 2/3] proc: add kpagecgroup file Vladimir Davydov
2015-04-28 12:24 ` [PATCH v3 3/3] proc: add kpageidle file Vladimir Davydov
2015-04-29  4:35   ` Minchan Kim
2015-04-29  9:12     ` Vladimir Davydov
2015-04-30  8:25       ` Minchan Kim
2015-04-30 14:50         ` Vladimir Davydov
2015-05-04  3:17           ` Minchan Kim
2015-05-04  9:49             ` Vladimir Davydov
2015-05-04 10:54               ` Minchan Kim [this message]
2015-05-08  9:56                 ` Vladimir Davydov
2015-05-09 15:12                   ` Minchan Kim
2015-05-10 10:34                     ` Vladimir Davydov
2015-05-12  9:41                       ` Vladimir Davydov
     [not found]   ` <4c24a6bf2c9711dd4dbb72a43a16eba6867527b7.1430217477.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-04-29  4:57     ` Minchan Kim
2015-04-29  8:31       ` Vladimir Davydov
2015-04-30  6:55         ` Minchan Kim
2015-04-29  3:57 ` [PATCH v3 0/3] idle memory tracking Minchan Kim
2015-04-29  7:58   ` Vladimir Davydov
     [not found] ` <cover.1430217477.git.vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2015-04-29  5:02   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150504105459.GA19384@blaptop \
    --to=minchan@kernel.org \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=gorcunov@openvz.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=vdavydov@parallels.com \
    --cc=walken@google.com \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).