linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Dave Hansen <dave.hansen@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Hugh Dickins <hughd@google.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christoph Lameter <cl@gentwo.org>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Jerome Marchand <jmarchan@redhat.com>,
	Yang Shi <yang.shi@linaro.org>,
	Sasha Levin <sasha.levin@oracle.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte()
Date: Tue, 23 Feb 2016 16:05:38 +0300	[thread overview]
Message-ID: <20160223130538.GA21144@node.shutemov.name> (raw)
In-Reply-To: <56C3599D.3060106@intel.com>

On Tue, Feb 16, 2016 at 09:17:17AM -0800, Dave Hansen wrote:
> On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote:
> > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote:
> >> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote:
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index ca99c0ecf52e..172f4d8e798d 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -265,6 +265,7 @@ struct fault_env {
> >>>  	pmd_t *pmd;
> >>>  	pte_t *pte;
> >>>  	spinlock_t *ptl;
> >>> +	pgtable_t prealloc_pte;
> >>>  };
> >>
> >> If we're going to do this fault_env thing, we need some heavy-duty
> >> comments on what the different fields do and what they mean.  We don't
> >> want to get in to a situation where we're doing
> >>
> >> 	void fault_foo(struct fault_env *fe);..
> >>
> >> and then we change the internals of fault_foo() to manipulate a
> >> different set of fe->* variables, or change assumptions, then have
> >> callers randomly break.
> >>
> >> One _nice_ part of passing all the arguments explicitly is that it makes
> >> you go visit all the call sites and think about how the conventions change.
> >>
> >> It just makes me nervous.
> >>
> >> The semantics of having both a ->pte and ->pmd need to be very clearly
> >> spelled out too, please.
> > 
> > I've updated this to:
> > 
> > /*
> >  * Page fault context: passes though page fault handler instead of endless list
> >  * of function arguments.
> >  */
> > struct fault_env {
> > 	struct vm_area_struct *vma;	/* Target VMA */
> > 	unsigned long address;		/* Faulting virtual address */
> > 	unsigned int flags;		/* FAULT_FLAG_xxx flags */
> > 	pmd_t *pmd;			/* Pointer to pmd entry matching
> > 					 * the 'address'
> > 					 */
> 
> Is this just for huge PMDs, or does it also cover normal PMDs pointing
> to PTE pages?

Any.

> Is it populated every time we're at or below the PMD during a fault?

Yes.

> Is it always valid?

It points to relevant entry. Nothing to say about content of the entry in
general.

> > 	pte_t *pte;			/* Pointer to pte entry matching
> > 					 * the 'address'. NULL if the page
> > 					 * table hasn't been allocated.
> > 					 */
> 
> What's the relationship between pmd and pte?  Can both be set at the
> same time, etc...?

If pte set, pmd is set too. pmd in this case would point to page table pte
is part of.

It's pretty straight-forward.

> 
> > 	spinlock_t *ptl;		/* Page table lock.
> > 					 * Protects pte page table if 'pte'
> > 					 * is not NULL, otherwise pmd.
> > 					 */
> 
> Are there any rules for callers when a callee puts a value in here?

Nothing in particular. In most cases we acquire and release ptl in the
same function, with few exceptions: write-protect fault path and
do_set_pte(). That's documented around these functions.

> > 	pgtable_t prealloc_pte;		/* Pre-allocated pte page table.
> > 					 * vm_ops->map_pages() calls
> > 					 * do_set_pte() from atomic context.
> > 					 * do_fault_around() pre-allocates
> > 					 * page table to avoid allocation from
> > 					 * atomic context.
> > 					 */
> > };
> 
> Who's responsible for freeing this and when?

do_fault_around() frees the page table if it wasn't used.

> >>>  /*
> >>> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> >>>  	return pte;
> >>>  }
> >>>  
> >>> -void do_set_pte(struct fault_env *fe, struct page *page);
> >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg,
> >>> +		struct page *page);
> >>>  #endif
> >>
> >> I think do_set_pte() might be due for a new name if it's going to be
> >> doing allocations internally.
> > 
> > Any suggestions?
> 
> alloc_set_pte() is probably fine.  Just make it clear early in some
> comments that the allocation is conditional.

Ok.

> >>> diff --git a/mm/filemap.c b/mm/filemap.c
> >>> index 28b3875969a8..ba8150d6dc33 100644
> >>> --- a/mm/filemap.c
> >>> +++ b/mm/filemap.c
> >>> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe,
> >>>  			start_pgoff) {
> >>>  		if (iter.index > end_pgoff)
> >>>  			break;
> >>> -		fe->pte += iter.index - last_pgoff;
> >>> -		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> >>> -		last_pgoff = iter.index;
> >>> -		if (!pte_none(*fe->pte))
> >>> -			goto next;
> >>>  repeat:
> >>>  		page = radix_tree_deref_slot(slot);
> >>>  		if (unlikely(!page))
> >>> @@ -2187,7 +2182,17 @@ repeat:
> >>>  
> >>>  		if (file->f_ra.mmap_miss > 0)
> >>>  			file->f_ra.mmap_miss--;
> >>> -		do_set_pte(fe, page);
> >>> +
> >>> +		fe->address += (iter.index - last_pgoff) << PAGE_SHIFT;
> >>> +		if (fe->pte)
> >>> +			fe->pte += iter.index - last_pgoff;
> >>> +		last_pgoff = iter.index;
> >>> +		if (do_set_pte(fe, NULL, page)) {
> >>> +			/* failed to setup page table: giving up */
> >>> +			if (!fe->pte)
> >>> +				break;
> >>> +			goto unlock;
> >>> +		}
> >>
> >> What's the failure here, though?
> > 
> > At this point in patchset it never fails: allocation failure is not
> > possible as we pre-allocate page table for faularound.
> > 
> > Later after do_set_pmd() is introduced, huge page can be mapped here. By
> > us or under us.
> > 
> > I'll update comment.
> 
> So why check the return value of do_set_pte()?  Why can it return nonzero?

Actually, this part is buggy (loops without result). I used to return
VM_FAULT_NOPAGE when huge page is setup, but not anymore.

I'll replace it with this:

diff --git a/mm/filemap.c b/mm/filemap.c
index de3bb308f5a9..5f655220df69 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2269,12 +2269,12 @@ repeat:
 		if (fe->pte)
 			fe->pte += iter.index - last_pgoff;
 		last_pgoff = iter.index;
-		if (alloc_set_pte(fe, NULL, page)) {
-			/* Huge page is mapped? */
-			if (!fe->pte)
-				break;
-			goto unlock;
-		}
+		alloc_set_pte(fe, NULL, page);
+		/* Huge page is mapped? No need to proceed. */
+		if (pmd_trans_huge(*fe->pmd))
+			break;
+		/* Failed to setup page table? */
+		VM_BUG_ON(!fe->pte);
 		unlock_page(page);
 		goto next;
 unlock:

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-02-23 13:05 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-02-11 14:21 [PATCHv2 00/28] huge tmpfs implementation using compound pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 01/28] thp, dax: do not try to withdraw pgtable from non-anon VMA Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 02/28] rmap: introduce rmap_walk_locked() Kirill A. Shutemov
2016-02-11 18:52   ` Andi Kleen
2016-02-16  9:36     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 03/28] rmap: extend try_to_unmap() to be usable by split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 04/28] mm: make remove_migration_ptes() beyond mm/migration.c Kirill A. Shutemov
2016-02-12 16:54   ` Dave Hansen
2016-02-16  9:54     ` Kirill A. Shutemov
2016-02-16 15:29       ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 05/28] thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 06/28] mm: do not pass mm_struct into handle_mm_fault Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 07/28] mm: introduce fault_env Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() Kirill A. Shutemov
2016-02-12 17:44   ` Dave Hansen
2016-02-16 14:26     ` Kirill A. Shutemov
2016-02-16 17:17       ` Dave Hansen
2016-02-23 13:05         ` Kirill A. Shutemov [this message]
2016-02-16 17:38       ` Dave Hansen
2016-02-23 22:58         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 09/28] rmap: support file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 10/28] mm: introduce do_set_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 11/28] mm, rmap: account file thp pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 12/28] thp, vmstats: add counters for huge file pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 13/28] thp: support file pages in zap_huge_pmd() Kirill A. Shutemov
2016-02-12 18:33   ` Dave Hansen
2016-02-16 10:00     ` Kirill A. Shutemov
2016-02-16 15:31       ` Dave Hansen
2016-02-18 12:19         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 14/28] thp: handle file pages in split_huge_pmd() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 15/28] thp: handle file COW faults Kirill A. Shutemov
2016-02-12 18:36   ` Dave Hansen
2016-02-16 10:08     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 16/28] thp: handle file pages in mremap() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 17/28] thp: skip file huge pmd on copy_huge_pmd() Kirill A. Shutemov
2016-02-12 18:42   ` Dave Hansen
2016-02-16 10:14     ` Kirill A. Shutemov
2016-02-16 15:46       ` Dave Hansen
2016-02-18 12:41         ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 18/28] thp: prepare change_huge_pmd() for file thp Kirill A. Shutemov
2016-02-12 18:48   ` Dave Hansen
2016-02-16 10:15     ` Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 19/28] thp: run vma_adjust_trans_huge() outside i_mmap_rwsem Kirill A. Shutemov
2016-02-12 18:50   ` Dave Hansen
2016-02-16 10:16     ` Kirill A. Shutemov
2016-02-16 15:49   ` Dave Hansen
2016-02-11 14:21 ` [PATCHv2 20/28] thp: file pages support for split_huge_page() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 21/28] vmscan: split file huge pages before paging them out Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 22/28] page-flags: relax policy for PG_mappedtodisk and PG_reclaim Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 23/28] radix-tree: implement radix_tree_maybe_preload_order() Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 24/28] filemap: prepare find and delete operations for huge pages Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 25/28] truncate: handle file thp Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 26/28] shmem: prepare huge=N mount option and /proc/sys/vm/shmem_huge Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 27/28] shmem: get_unmapped_area align huge page Kirill A. Shutemov
2016-02-11 14:21 ` [PATCHv2 28/28] shmem: add huge pages support Kirill A. Shutemov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160223130538.GA21144@node.shutemov.name \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=dave.hansen@intel.com \
    --cc=hughd@google.com \
    --cc=jmarchan@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=sasha.levin@oracle.com \
    --cc=vbabka@suse.cz \
    --cc=yang.shi@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).