linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-mm@kvack.org, Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm@lists.01.org, Matthew Wilcox <willy@linux.intel.com>
Subject: Re: [PATCH 16/18] dax: New fault locking
Date: Thu, 5 May 2016 22:13:50 -0600	[thread overview]
Message-ID: <20160506041350.GA29628@linux.intel.com> (raw)
In-Reply-To: <1461015341-20153-17-git-send-email-jack@suse.cz>

On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> @@ -300,6 +324,259 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
>  EXPORT_SYMBOL_GPL(dax_do_io);
>  
>  /*
> + * DAX radix tree locking
> + */
> +struct exceptional_entry_key {
> +	struct radix_tree_root *root;
> +	unsigned long index;
> +};

I believe that we basically just need the struct exceptional_entry_key to
uniquely identify an entry, correct?  I agree that we get this with the pair
[struct radix_tree_root, index], but we also get it with
[struct address_space, index], and we might want to use the latter here since
that's the pair that is used to look up the wait queue in
dax_entry_waitqueue().  Functionally I don't think it matters (correct me if
I'm wrong), but it makes for a nicer symmetry.

> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = get_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

In the conversation about v2 of this series you said:

> Note that we take the hit for dropping the lock only if we really need to
> allocate new radix tree node so about once per 64 new entries. So it is not
> too bad.

I think this is incorrect.  We get here whenever we get a NULL return from
__radix_tree_lookup().  I believe that this happens if we don't have a node,
in which case we need an allocation, but I think it also happens in the case
where we do have a node and we just have a NULL slot in that node.

For the behavior you're looking for (only preload if you need to do an
allocation), you probably need to check the 'slot' we get back from
get_unlocked_mapping_entry(), yea?

> +/*
> + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> + * entry to get unlocked before deleting it.
> + */
> +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *entry;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	/*
> +	 * Caller should make sure radix tree modifications don't race and
> +	 * we have seen exceptional entry here before.
> +	 */
> +	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {

dax_delete_mapping_entry() is only called from clear_exceptional_entry().
With this new code we've changed the behavior of that call path a little.

In the various places where clear_exceptional_entry() is called, the code
batches up a bunch of entries in a pvec via pagevec_lookup_entries().  We
don't hold the mapping->tree_lock between the time this lookup happens and the
time that the entry is passed to clear_exceptional_entry(). This is why the
old code did a verification that the entry passed in matched what was still
currently present in the radix tree.  This was done in the DAX case via
radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
for the page cache case.  In both cases if the entry didn't match what was
currently in the tree, we bailed without doing anything.

This new code doesn't verify against the 'entry' passed to
clear_exceptional_entry(), but instead makes sure it is an exceptional entry
before removing, and if not it does a WARN_ON_ONCE().

This changes things because:

a) If the exceptional entry changed, say from a plain lock entry to an actual
DAX entry, we wouldn't notice, and we would just clear the latter out.  My
guess is that this is fine, I just wanted to call it out.

b) If we have a non-exceptional entry here now, say because our lock entry has
been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
removal.  I think we may want to silence the WARN_ON_ONCE(), as I believe this
could happen during normal operation and we don't want to scare anyone. :)

> +/*
>   * The user has performed a load from a hole in the file.  Allocating
>   * a new page in the file would cause excessive storage usage for
>   * workloads with sparse files.  We allocate a page cache page instead.
> @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
>   * otherwise it will simply fall out of the page cache under memory
>   * pressure without ever having been dirtied.
>   */
> -static int dax_load_hole(struct address_space *mapping, struct page *page,
> -							struct vm_fault *vmf)
> +static int dax_load_hole(struct address_space *mapping, void *entry,
> +			 struct vm_fault *vmf)
>  {
> -	if (!page)
> -		page = find_or_create_page(mapping, vmf->pgoff,
> -						GFP_KERNEL | __GFP_ZERO);
> -	if (!page)
> -		return VM_FAULT_OOM;
> +	struct page *page;
> +
> +	/* Hole page already exists? Return it...  */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		vmf->page = entry;
> +		return VM_FAULT_LOCKED;
> +	}
>  
> +	/* This will replace locked radix tree entry with a hole page */
> +	page = find_or_create_page(mapping, vmf->pgoff,
> +				   vmf->gfp_mask | __GFP_ZERO);

This replacement happens via page_cache_tree_insert(), correct?  In this case,
who wakes up anyone waiting on the old lock entry that we just killed?  In the
non-hole case we would traverse through put_locked_mapping_entry(), but I
don't see that in the hole case.

> @@ -963,23 +1228,18 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	struct file *file = vma->vm_file;
> -	int error;
> -
> -	/*
> -	 * We pass NO_SECTOR to dax_radix_entry() because we expect that a
> -	 * RADIX_DAX_PTE entry already exists in the radix tree from a
> -	 * previous call to __dax_fault().  We just want to look up that PTE
> -	 * entry using vmf->pgoff and make sure the dirty tag is set.  This
> -	 * saves us from having to make a call to get_block() here to look
> -	 * up the sector.
> -	 */
> -	error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
> -			true);
> +	struct address_space *mapping = file->f_mapping;
> +	void *entry;
> +	pgoff_t index = vmf->pgoff;
>  
> -	if (error == -ENOMEM)
> -		return VM_FAULT_OOM;
> -	if (error)
> -		return VM_FAULT_SIGBUS;
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +	if (!entry || !radix_tree_exceptional_entry(entry))
> +		goto out;
> +	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> +	put_unlocked_mapping_entry(mapping, index, entry);

I really like how simple this function has become. :)

  parent reply	other threads:[~2016-05-06  4:13 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-18 21:35 [RFC v3] [PATCH 0/18] DAX page fault locking Jan Kara
2016-04-18 21:35 ` [PATCH 01/18] ext4: Handle transient ENOSPC properly for DAX Jan Kara
2016-04-18 21:35 ` [PATCH 02/18] ext4: Fix race in transient ENOSPC detection Jan Kara
2016-04-18 21:35 ` [PATCH 03/18] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-04-18 21:35 ` [PATCH 04/18] dax: Remove complete_unwritten argument Jan Kara
2016-04-18 21:35 ` [PATCH 05/18] ext2: Avoid DAX zeroing to corrupt data Jan Kara
2016-04-29 16:30   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 06/18] dax: Remove dead zeroing code from fault handlers Jan Kara
2016-04-29 16:48   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 07/18] ext4: Refactor direct IO code Jan Kara
2016-04-18 21:35 ` [PATCH 08/18] ext4: Pre-zero allocated blocks for DAX IO Jan Kara
2016-04-29 18:01   ` Ross Zwisler
2016-05-02 13:09     ` Jan Kara
2016-04-18 21:35 ` [PATCH 09/18] dax: Remove zeroing from dax_io() Jan Kara
2016-04-29 18:56   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 10/18] dax: Remove pointless writeback from dax_do_io() Jan Kara
2016-04-29 19:00   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 11/18] dax: Fix condition for filling of PMD holes Jan Kara
2016-04-29 19:08   ` Ross Zwisler
2016-05-02 13:16     ` Jan Kara
2016-04-18 21:35 ` [PATCH 12/18] dax: Remove redundant inode size checks Jan Kara
2016-04-18 21:35 ` [PATCH 13/18] dax: Make huge page handling depend of CONFIG_BROKEN Jan Kara
2016-04-29 19:53   ` Ross Zwisler
2016-05-02 13:19     ` Jan Kara
2016-04-18 21:35 ` [PATCH 14/18] dax: Define DAX lock bit for radix tree exceptional entry Jan Kara
2016-04-29 20:03   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 15/18] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-04-29 20:29   ` Ross Zwisler
2016-04-18 21:35 ` [PATCH 16/18] dax: New fault locking Jan Kara
2016-04-27  4:27   ` NeilBrown
2016-05-06  4:13   ` Ross Zwisler [this message]
2016-05-10 12:27     ` Jan Kara
2016-05-11 19:26       ` Ross Zwisler
2016-05-12  7:58         ` Jan Kara
2016-04-18 21:35 ` [PATCH 17/18] dax: Use radix tree entry lock to protect cow faults Jan Kara
2016-04-19 11:46   ` Jerome Glisse
2016-04-19 14:33     ` Jan Kara
2016-04-19 15:19       ` Jerome Glisse
2016-04-18 21:35 ` [PATCH 18/18] dax: Remove i_mmap_lock protection Jan Kara
2016-05-06  3:35 ` [RFC v3] [PATCH 0/18] DAX page fault locking Ross Zwisler
2016-05-06 20:33 ` Ross Zwisler
2016-05-09  9:38   ` Jan Kara
2016-05-10 15:28     ` Jan Kara
2016-05-10 20:30       ` Ross Zwisler
2016-05-10 22:39         ` Ross Zwisler
2016-05-11  9:19           ` Jan Kara
2016-05-11 15:52             ` Ross Zwisler
2016-05-09 21:28 ` Verma, Vishal L
2016-05-10 11:52   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160506041350.GA29628@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=willy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).