linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Jerome Marchand <jmarchan@redhat.com>,
	Yang Shi <yang.shi@linaro.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
Date: Wed, 24 Feb 2016 01:58:26 +0300	[thread overview]
Message-ID: <20160223225825.GA3651@node.shutemov.name> (raw)
In-Reply-To: <56C35EA8.2000407@intel.com>

On Tue, Feb 16, 2016 at 09:38:48AM -0800, Dave Hansen wrote:
> Sorry, fat-fingered the send on the last one.
> 
> On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> >>> +	if (unlikely(pmd_none(*fe->pmd) &&
> >>> +			__pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address)))
> >>> +		return VM_FAULT_OOM;
> >>
> >> Should we just move this pmd_none() check in to __pte_alloc()?  You do
> >> this same-style check at least twice.
> > 
> > We have it there. The check here is speculative to avoid taking ptl.
> 
> OK, that's a performance optimization.  Why shouldn't all callers of
> __pte_alloc() get the same optimization?

I've sent patch for this.

> >>> +	/* If an huge pmd materialized from under us just retry later */
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return 0;
> >>
> >> Nit: please stop sprinkling unlikely() everywhere.  Is there some
> >> concrete benefit to doing it here?  I really doubt the compiler needs
> >> help putting the code for "return 0" out-of-line.
> >>
> >> Why is it important to abort here?  Is this a small-page-only path?
> > 
> > This unlikely() was moved from __handle_mm_fault(). I didn't put much
> > consideration in it.
> 
> OK, but separately from the unlikely()...  Why is it important to jump
> out of this code when we see a pmd_trans_huge() pmd?

The code below work on pte level, so it expect the pmd to point to page
table.

And the page fault most likely was solved anyway.

> >>> +static int pte_alloc_one_map(struct fault_env *fe)
> >>> +{
> >>> +	struct vm_area_struct *vma = fe->vma;
> >>> +
> >>> +	if (!pmd_none(*fe->pmd))
> >>> +		goto map_pte;
> >>
> >> So the calling convention here is...?  It looks like this has to be
> >> called with fe->pmd == pmd_none().  If not, we assume it's pointing to a
> >> pte page.  This can never be called on a huge pmd.  Right?
> > 
> > It's not under ptl, so pmd can be filled under us. There's huge pmd check in
> > 'map_pte' goto path.
> 
> OK, could we add some comments on that?  We expect to be called to
> ______, but if there is a race, we might also have to handle ______, etc...?

Ok.

> >>> +	if (fe->prealloc_pte) {
> >>> +		smp_wmb(); /* See comment in __pte_alloc() */
> >>
> >> Are we trying to make *this* cpu's write visible, or to see the write
> >> from __pte_alloc()?  It seems like we're trying to see the write.  Isn't
> >> smp_rmb() what we want for that?
> > 
> > See 362a61ad6119.
> 
> That patch explains that anyone allocating and initializing a page table
> page must ensure that all CPUs can see the initialization writes
> *before* the page can be linked into the page tables.  __pte_alloc()
> performs a smp_wmb() to ensure that other processors can see its writes.
> 
> That still doesn't answer my question though.  What does this barrier
> do?  What does it make visible to this processor?  __pte_alloc() already
> made its initialization visible, so what's the purpose *here*?

We don't call __pte_alloc() to allocate the page table for ->prealloc_pte,
we call pte_alloc_one(), which doesn't have the barrier.

> >>> +		atomic_long_inc(&vma->vm_mm->nr_ptes);
> >>> +		pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte);
> >>> +		spin_unlock(fe->ptl);
> >>> +		fe->prealloc_pte = 0;
> >>> +	} else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd,
> >>> +					fe->address))) {
> >>> +		return VM_FAULT_OOM;
> >>> +	}
> >>> +map_pte:
> >>> +	if (unlikely(pmd_trans_huge(*fe->pmd)))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> I think I need a refresher on the locking rules.  pte_offset_map*() is
> >> unsafe to call on a huge pmd.  What in this context makes it impossible
> >> for the pmd to get promoted after the check?
> > 
> > Do you mean what stops pte page table to collapsed into huge pmd?
> > The answer is mmap_sem. Collapse code takes the lock on write to be able to
> > retract page table.
> 
> What I learned in this set is that pte_offset_map_lock() is dangerous to
> call unless THPs have been excluded somehow from the PMD it's being
> called on.
> 
> What I'm looking for is something to make sure that the context has been
> thought through and is thoroughly THP-free.
> 
> It sounds like you've thought through all the cases, but your thoughts
> aren't clear from the way the code is laid out currently.

Actually, I've discovered race looking into this code. Andrea has fixed it
in __handle_mm_fault() and I will move the comment here.

> >>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return.
> >>>   *
> >>>   * Target users are page handler itself and implementations of
> >>>   * vm_ops->map_pages.
> >>>   */
> >>> -void do_set_pte(struct fault_env *fe, struct page *page)
> >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> >>> +		struct page *page)
> >>>  {
> >>>  	struct vm_area_struct *vma = fe->vma;
> >>>  	bool write = fe->flags & FAULT_FLAG_WRITE;
> >>>  	pte_t entry;
> >>>  
> >>> +	if (!fe->pte) {
> >>> +		int ret = pte_alloc_one_map(fe);
> >>> +		if (ret)
> >>> +			return ret;
> >>> +	}
> >>> +
> >>> +	if (!pte_none(*fe->pte))
> >>> +		return VM_FAULT_NOPAGE;
> >>
> >> Oh, you've got to add another pte_none() check because you're deferring
> >> the acquisition of the ptl lock?
> > 
> > Yes, we need to re-check once ptl is taken.
> 
> Another good comment to add, I think. :)

Ok.

> >>> -	/* Check if it makes any sense to call ->map_pages */
> >>> -	fe->address = start_addr;
> >>> -	while (!pte_none(*fe->pte)) {
> >>> -		if (++start_pgoff > end_pgoff)
> >>> -			goto out;
> >>> -		fe->address += PAGE_SIZE;
> >>> -		if (fe->address >= fe->vma->vm_end)
> >>> -			goto out;
> >>> -		fe->pte++;
> >>> +	if (pmd_none(*fe->pmd))
> >>> +		fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address);
> >>> +	fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff);
> >>> +	if (fe->prealloc_pte) {
> >>> +		pte_free(fe->vma->vm_mm, fe->prealloc_pte);
> >>> +		fe->prealloc_pte = 0;
> >>>  	}
> >>> +	if (!fe->pte)
> >>> +		goto out;
> >>
> >> What does !fe->pte *mean* here?  No pte page?  No pte present?  Huge pte
> >> present?
> > 
> > Huge pmd is mapped.
> > 
> > Comment added.
> 
> Huh, so in _some_ contexts, !fe->pte means that we've got a huge pmd.  I
> don't remember seeing that spelled out in the structure comments.

I'll change it to "if (pmd_trans_huge(*fe->pmd))".

> >>> +	if (unlikely(pmd_none(*fe->pmd))) {
> >>> +		/*
> >>> +		 * Leave __pte_alloc() until later: because vm_ops->fault may
> >>> +		 * want to allocate huge page, and if we expose page table
> >>> +		 * for an instant, it will be difficult to retract from
> >>> +		 * concurrent faults and from rmap lookups.
> >>> +		 */
> >>> +	} else {
> >>> +		/*
> >>> +		 * A regular pmd is established and it can't morph into a huge
> >>> +		 * pmd from under us anymore at this point because we hold the
> >>> +		 * mmap_sem read mode and khugepaged takes it in write mode.
> >>> +		 * So now it's safe to run pte_offset_map().
> >>> +		 */
> >>> +		fe->pte = pte_offset_map(fe->pmd, fe->address);
> >>> +
> >>> +		entry = *fe->pte;
> >>> +		barrier();
> >>
> >> Barrier because....?
> 
> Did you miss a response here, Kirill?

The comment below is about this barrier.
Isn't it sufficient?

> >>> +		if (pte_none(entry)) {
> >>> +			pte_unmap(fe->pte);
> >>> +			fe->pte = NULL;
> >>> +		}
> >>> +	}
> >>> +
> >>>  	/*
> >>>  	 * some architectures can have larger ptes than wordsize,
> >>>  	 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
> >>>  	 * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses.
> >>> -	 * The code below just needs a consistent view for the ifs and
> >>> +	 * The code above just needs a consistent view for the ifs and
> >>>  	 * we later double check anyway with the ptl lock held. So here
> >>>  	 * a barrier will do.
> >>>  	 */
> >>
> >> Looks like the barrier got moved, but not the comment.
> > 
> > Moved.
> > 
> >> Man, that's a lot of code.
> > 
> > Yeah. I don't see a sensible way to split it. :-/
> 
> Can you do the "postpone allocation" parts without adding additional THP
> code?  Or does the postponement just add all of the extra THP-handling
> spots?

I'll check. But I wouldn't expect moving THP-handling out of the commit
will make it much smaller.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-02-23 22:58 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
2016-02-11 18:52   ` Andi Kleen
2016-02-16  9:36     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
2016-02-12 16:54   ` Dave Hansen
2016-02-16  9:54     ` Kirill A. Shutemov
2016-02-16 15:29       ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
2016-02-12 17:44   ` Dave Hansen
2016-02-16 14:26     ` Kirill A. Shutemov
2016-02-16 17:17       ` Dave Hansen
2016-02-23 13:05         ` Kirill A. Shutemov
2016-02-16 17:38       ` Dave Hansen
2016-02-23 22:58         ` Kirill A. Shutemov [this message]
2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
2016-02-12 18:33   ` Dave Hansen
2016-02-16 10:00     ` Kirill A. Shutemov
2016-02-16 15:31       ` Dave Hansen
2016-02-18 12:19         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
2016-02-12 18:36   ` Dave Hansen
2016-02-16 10:08     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
2016-02-12 18:42   ` Dave Hansen
2016-02-16 10:14     ` Kirill A. Shutemov
2016-02-16 15:46       ` Dave Hansen
2016-02-18 12:41         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
2016-02-12 18:48   ` Dave Hansen
2016-02-16 10:15     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
2016-02-12 18:50   ` Dave Hansen
2016-02-16 10:16     ` Kirill A. Shutemov
2016-02-16 15:49   ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160223225825.GA3651@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@intel.com \
    --cc=hughd@google.com \
    --cc=jmarchan@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=sasha.levin@oracle.com \
    --cc=vbabka@suse.cz \
    --cc=yang.shi@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).