linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Andreas Dilger <adilger.kernel@dilger.ca>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Hellwig <hch@lst.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.com>,
	Matthew Wilcox <mawilcox@microsoft.com>,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, linux-nvdimm@lists.01.org,
	linux-xfs@vger.kernel.org
Subject: Re: [PATCH v5 15/17] dax: add struct iomap based DAX PMD support
Date: Wed, 12 Oct 2016 09:45:05 +0200	[thread overview]
Message-ID: <20161012074505.GC13896@quack2.suse.cz> (raw)
In-Reply-To: <20161011225130.GC32165@linux.intel.com>

On Tue 11-10-16 16:51:30, Ross Zwisler wrote:
> On Tue, Oct 11, 2016 at 10:31:52AM +0200, Jan Kara wrote:
> > On Fri 07-10-16 15:09:02, Ross Zwisler wrote:
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ac3cd05..e51d51f 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -281,7 +281,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
> > >  	 * queue to the start of that PMD.  This ensures that all offsets in
> > >  	 * the range covered by the PMD map to the same bit lock.
> > >  	 */
> > > -	if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > > +	if ((unsigned long)entry & RADIX_DAX_PMD)
> > >  		index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> > 
> > I agree with Christoph - helper for masking type bits would make this
> > nicer.
> 
> Fixed via a dax_flag_test() helper as I outlined in the mail to Christoph.  It
> seems clean to me, but if you or Christoph feel strongly that it would be
> cleaner as a local 'flags' variable, I'll make the change.

One idea I had is that you could have helpers like:

dax_is_pmd_entry()
dax_is_pte_entry()
dax_is_empty_entry()
dax_is_hole_entry()

And then you would use these helpers - all the flags would be hidden in the
helpers so even if we decide to change the flagging scheme to compress
things or so, it should be pretty local change.

> > > -		entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > > -			       RADIX_DAX_ENTRY_LOCK);
> > > +
> > > +		/*
> > > +		 * Besides huge zero pages the only other thing that gets
> > > +		 * downgraded are empty entries which don't need to be
> > > +		 * unmapped.
> > > +		 */
> > > +		if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > > +			unmap_mapping_range(mapping,
> > > +				(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > > +
> > >  		spin_lock_irq(&mapping->tree_lock);
> > > -		err = radix_tree_insert(&mapping->page_tree, index, entry);
> > > +
> > > +		if (pmd_downgrade) {
> > > +			radix_tree_delete(&mapping->page_tree, index);
> > > +			mapping->nrexceptional--;
> > > +			dax_wake_mapping_entry_waiter(mapping, index, entry,
> > > +					false);
> > 
> > You need to set 'wake_all' argument here to true. Otherwise there could be
> > waiters waiting for non-existent entry forever...
> 
> Interesting.   Fixed, but let me make sure I understand.  So is the issue that
> you could have say 2 tasks waiting on a PMD index that has been rounded down
> to the PMD index via dax_entry_waitqueue()?
> 
> The person holding the lock on the entry would remove the PMD, insert a PTE
> and wake just one of the PMD aligned waiters.  That waiter would wake up, do
> something PTE based (since the PMD space is now polluted with PTEs), and then
> wake any waiters on it's PTE index.  Meanwhile, the second waiter could sleep
> forever on the PMD aligned index.  Is this correct?

Yes.

> So, perhaps more succinctly:
> 
> Thread 1		Thread 2		Thread 3
> --------		--------		--------
> index 0x202, hold PMD lock 0x200
> 			index 0x203, sleep on 0x200
> 						index 0x204, sleep on 0x200
> downgrade, removing 0x200
> wake one waiter on 0x200
> insert PTE @ 0x202
> 			wake up, grab index 0x203
> 			...
> 			wake one waiter on index 0x203
> 
> 						... sleeps forever
> Right?
 
Exactly.

> > > @@ -608,22 +683,28 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > >  		error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > >  		if (error)
> > >  			return ERR_PTR(error);
> > > +	} else if (((unsigned long)entry & RADIX_DAX_HZP) &&
> > > +			!(flags & RADIX_DAX_HZP)) {
> > > +		/* replacing huge zero page with PMD block mapping */
> > > +		unmap_mapping_range(mapping,
> > > +			(vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > >  	}
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > > -		       RADIX_DAX_ENTRY_LOCK);
> > > +	new_entry = dax_radix_entry(sector, flags);
> > > +
> > 
> > You've lost the RADIX_DAX_ENTRY_LOCK flag here?
> 
> Oh, nope, that's embedded in the dax_radix_entry() helper:
> 
> /* entries begin locked */
> static inline void *dax_radix_entry(sector_t sector, unsigned long flags)
> {
> 	return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
> 			((unsigned long)sector << RADIX_DAX_SHIFT) |
> 			RADIX_DAX_ENTRY_LOCK);
> }
> 
> I'll s/dax_radix_entry/dax_radix_locked_entry/ or something to make this
> clearer to the reader.

Yep, that would be better. Thanks!

> > >  	if (hole_fill) {
> > >  		__delete_from_page_cache(entry, NULL);
> > >  		/* Drop pagecache reference */
> > >  		put_page(entry);
> > > -		error = radix_tree_insert(page_tree, index, new_entry);
> > > +		error = __radix_tree_insert(page_tree, index,
> > > +				dax_radix_order(new_entry), new_entry);
> > >  		if (error) {
> > >  			new_entry = ERR_PTR(error);
> > >  			goto unlock;
> > >  		}
> > >  		mapping->nrexceptional++;
> > > -	} else {
> > > +	} else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > >  		void **slot;
> > >  		void *ret;
> > 
> > Uh, why this condition need to change? Is it some protection so that we
> > don't replace a mapped PMD entry with PTE one?
> 
> Yea, the logic was that if we have a radix tree that has PMDs in it, and some
> new process comes along doing PTE faults, we can just leave the DAX mapped PMD
> entries in the tree.  Locking, dirtying and flushing will happen on PMD basis
> for all processes.

Yup, I got this and I agree with that.

> If you think this is dangerous or not worth the effort we can make it like the
> hole case where any PTE sized faults will result in an unmap and a downgrade
> to PTE sized entries in the radix tree.

Ok, so probably I'm somewhat surprised that the logic that mapped PMD entry
is not replaced by a PTE entry is handled down in
dax_insert_mapping_entry(). But now looking at this it is just how we work
in other cases as well so please just add a comment there.

Longer term we may just shortcut the fault if grab_mapping_entry() will
return us entry of type we can use. That can save us quite some work if
several processes are mapping the same file. But again that's an
optimization to do once the PMD, iomap, and page protection works settle.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2016-10-12  7:45 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-07 21:08 [PATCH v5 00/17] re-enable DAX PMD support Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 04/17] ext2: remove support for DAX PMD faults Ross Zwisler
2016-10-07 21:08 ` [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-10-10 15:46   ` Christoph Hellwig
2016-10-11  7:04   ` Jan Kara
2016-10-11 21:18     ` Ross Zwisler
     [not found] ` <1475874544-24842-1-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-07 21:08   ` [PATCH v5 01/17] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 02/17] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 03/17] dax: remove buffer_size_valid() Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 05/17] ext2: return -EIO on ext2_iomap_end() failure Ross Zwisler
2016-10-11  6:48     ` Jan Kara
2016-10-07 21:08   ` [PATCH v5 06/17] dax: make 'wait_table' global variable static Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 07/17] dax: remove the last BUG_ON() from fs/dax.c Ross Zwisler
     [not found]     ` <1475874544-24842-8-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-10 15:45       ` Christoph Hellwig
2016-10-11  6:50       ` Jan Kara
2016-10-07 21:08   ` [PATCH v5 08/17] dax: consistent variable naming for DAX entries Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 10/17] dax: remove dax_pmd_fault() Ross Zwisler
2016-10-07 21:08   ` [PATCH v5 11/17] dax: correct dax iomap code namespace Ross Zwisler
     [not found]     ` <1475874544-24842-12-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-09 15:28       ` Christoph Hellwig
     [not found]         ` <20161009152803.GA20111-jcswGhMUV9g@public.gmane.org>
2016-10-10 19:04           ` [PATCH] " Ross Zwisler
     [not found]             ` <1476126244-27673-1-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-10 20:19               ` Dave Chinner
2016-10-07 21:08   ` [PATCH v5 12/17] dax: add dax_iomap_sector() helper function Ross Zwisler
     [not found]     ` <1475874544-24842-13-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-10 15:46       ` Christoph Hellwig
2016-10-11  7:06     ` Jan Kara
2016-10-07 21:09   ` [PATCH v5 15/17] dax: add struct iomap based DAX PMD support Ross Zwisler
     [not found]     ` <1475874544-24842-16-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-10 15:59       ` Christoph Hellwig
     [not found]         ` <20161010155917.GA19978-jcswGhMUV9g@public.gmane.org>
2016-10-10 22:06           ` Ross Zwisler
2016-10-11 21:48           ` Ross Zwisler
2016-10-11  8:31       ` Jan Kara
     [not found]         ` <20161011083152.GG6952-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2016-10-11 22:51           ` Ross Zwisler
2016-10-12  7:45             ` Jan Kara [this message]
2016-10-07 21:09   ` [PATCH v5 16/17] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
2016-10-11  8:34     ` Jan Kara
2016-10-07 21:09   ` [PATCH v5 17/17] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-10-07 21:09 ` [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end() Ross Zwisler
2016-10-10 15:50   ` Christoph Hellwig
     [not found]     ` <20161010155004.GD19343-jcswGhMUV9g@public.gmane.org>
2016-10-10 22:05       ` Ross Zwisler
     [not found]   ` <1475874544-24842-14-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-11  7:21     ` Jan Kara
2016-10-07 21:09 ` [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h Ross Zwisler
     [not found]   ` <1475874544-24842-15-git-send-email-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-10-10 15:50     ` Christoph Hellwig
2016-10-11  7:23     ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161012074505.GC13896@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=jack@suse.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mawilcox@microsoft.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).