From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
linux-kernel@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andreas Dilger <adilger.kernel@dilger.ca>,
Andrew Morton <akpm@linux-foundation.org>,
Christoph Hellwig <hch@lst.de>,
Dan Williams <dan.j.williams@intel.com>,
Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.com>,
Matthew Wilcox <mawilcox@microsoft.com>,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, linux-nvdimm@lists.01.org,
linux-xfs@vger.kernel.org
Subject: Re: [PATCH v4 10/12] dax: add struct iomap based DAX PMD support
Date: Mon, 3 Oct 2016 15:05:57 -0600 [thread overview]
Message-ID: <20161003210557.GA28177@linux.intel.com> (raw)
In-Reply-To: <20161003105949.GP6457@quack2.suse.cz>
On Mon, Oct 03, 2016 at 12:59:49PM +0200, Jan Kara wrote:
> On Thu 29-09-16 16:49:28, Ross Zwisler wrote:
> > @@ -420,15 +439,39 @@ restart:
> > mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> > if (err)
> > return ERR_PTR(err);
> > - entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
> > - RADIX_DAX_ENTRY_LOCK);
> > +
> > + /*
> > + * Besides huge zero pages the only other thing that gets
> > + * downgraded are empty entries which don't need to be
> > + * unmapped.
> > + */
> > + if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
> > + unmap_mapping_range(mapping,
> > + (index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > +
> > spin_lock_irq(&mapping->tree_lock);
> > - err = radix_tree_insert(&mapping->page_tree, index, entry);
> > +
> > + if (pmd_downgrade) {
> > + radix_tree_delete(&mapping->page_tree, index);
> > + mapping->nrexceptional--;
> > + dax_wake_mapping_entry_waiter(entry, mapping, index,
> > + false);
> > + }
>
> Hum, this looks really problematic. Once you have dropped tree_lock,
> anything could have happened with the radix tree - in particular the entry
> you've got from get_unlocked_mapping_entry() can be different by now. Also
> there's no guarantee that someone does not map the huge entry again just
> after your call to unmap_mapping_range() finished.
>
> So it seems you need to lock the entry (if you have one) before releasing
> tree_lock to stabilize it. That is enough also to block other mappings of
> that entry. Then once you reacquire the tree_lock, you can safely delete it
> and replace it with a different entry.
Yep, great catch. I'll lock the entry before I drop tree_lock.
> > @@ -623,22 +672,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> > error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
> > if (error)
> > return ERR_PTR(error);
> > + } else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
> > + /* replacing huge zero page with PMD block mapping */
> > + unmap_mapping_range(mapping,
> > + (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
> > }
> >
> > spin_lock_irq(&mapping->tree_lock);
> > - new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > - RADIX_DAX_ENTRY_LOCK);
> > + if (hzp)
> > + new_entry = RADIX_DAX_HZP_ENTRY();
> > + else
> > + new_entry = RADIX_DAX_ENTRY(sector, new_type);
> > +
> > if (hole_fill) {
> > __delete_from_page_cache(entry, NULL);
> > /* Drop pagecache reference */
> > put_page(entry);
> > - error = radix_tree_insert(page_tree, index, new_entry);
> > + error = __radix_tree_insert(page_tree, index,
> > + RADIX_DAX_ORDER(new_type), new_entry);
> > if (error) {
> > new_entry = ERR_PTR(error);
> > goto unlock;
> > }
> > mapping->nrexceptional++;
> > - } else {
> > + } else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
> > void **slot;
> > void *ret;
>
> Hum, I somewhat dislike how PTE and PMD paths differ here. But it's OK for
> now I guess. Long term we might be better off to do away with zero pages
> for PTEs as well and use exceptional entry and a single zero page like you
> do for PMD. Because the special cases these zero pages cause are a
> headache.
I've been thinking about this as well, and I do think we'd be better off with
a single zero page for PTEs, as we have with PMDs. It'd reduce the special
casing in the DAX code, and it'd also ensure that we don't waste a bunch of
time and memory creating read-only zero pages to service reads from holes.
I'll look into adding this for v5.
> > +int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > + pmd_t *pmd, unsigned int flags, struct iomap_ops *ops)
> > +{
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + unsigned long pmd_addr = address & PMD_MASK;
> > + bool write = flags & FAULT_FLAG_WRITE;
> > + struct inode *inode = mapping->host;
> > + struct iomap iomap = { 0 };
> > + int error, result = 0;
> > + pgoff_t size, pgoff;
> > + struct vm_fault vmf;
> > + void *entry;
> > + loff_t pos;
> > +
> > + /* Fall back to PTEs if we're going to COW */
> > + if (write && !(vma->vm_flags & VM_SHARED)) {
> > + split_huge_pmd(vma, pmd, address);
> > + return VM_FAULT_FALLBACK;
> > + }
> > +
> > + /* If the PMD would extend outside the VMA */
> > + if (pmd_addr < vma->vm_start)
> > + return VM_FAULT_FALLBACK;
> > + if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> > + return VM_FAULT_FALLBACK;
> > +
> > + /*
> > + * Check whether offset isn't beyond end of file now. Caller is
> > + * supposed to hold locks serializing us with truncate / punch hole so
> > + * this is a reliable test.
> > + */
> > + pgoff = linear_page_index(vma, pmd_addr);
> > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > +
> > + if (pgoff >= size)
> > + return VM_FAULT_SIGBUS;
> > +
> > + /* If the PMD would extend beyond the file size */
> > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > + return VM_FAULT_FALLBACK;
> > +
> > + /*
> > + * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
> > + * PMD or a HZP entry. If it can't (because a 4k page is already in
> > + * the tree, for instance), it will return -EEXIST and we just fall
> > + * back to 4k entries.
> > + */
> > + entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> > + if (IS_ERR(entry))
> > + return VM_FAULT_FALLBACK;
> > +
> > + /*
> > + * Note that we don't use iomap_apply here. We aren't doing I/O, only
> > + * setting up a mapping, so really we're using iomap_begin() as a way
> > + * to look up our filesystem block.
> > + */
> > + pos = (loff_t)pgoff << PAGE_SHIFT;
> > + error = ops->iomap_begin(inode, pos, PMD_SIZE, write ? IOMAP_WRITE : 0,
> > + &iomap);
>
> I'm not quite sure if it is OK to call ->iomap_begin() without ever calling
> ->iomap_end. Specifically the comment before iomap_apply() says:
>
> "It is assumed that the filesystems will lock whatever resources they
> require in the iomap_begin call, and release them in the iomap_end call."
>
> so what you do could result in unbalanced allocations / locks / whatever.
> Christoph?
I'll add the iomap_end() calls to both the PTE and PMD iomap fault handlers.
> > + if (error)
> > + goto fallback;
> > + if (iomap.offset + iomap.length < pos + PMD_SIZE)
> > + goto fallback;
> > +
> > + vmf.pgoff = pgoff;
> > + vmf.flags = flags;
> > + vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
>
> I don't think you want __GFP_FS here - we have already gone through the
> filesystem's pmd_fault() handler which called dax_iomap_pmd_fault() and
> thus we hold various fs locks, freeze protection, ...
I copied this from __get_fault_gfp_mask() in mm/memory.c. That function is
used by do_page_mkwrite() and __do_fault(), and we eventually get this
vmf->gfp_mask in the PTE fault code. With the code as it is we get the same
vmf->gfp_mask in both dax_iomap_fault() and dax_iomap_pmd_fault(). It seems
like they should remain consistent - is it wrong to have __GFP_FS in
dax_iomap_fault()?
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index c4a51bb..cacff9e 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -8,8 +8,33 @@
> >
> > struct iomap_ops;
> >
> > -/* We use lowest available exceptional entry bit for locking */
> > +/*
> > + * We use lowest available bit in exceptional entry for locking, two bits for
> > + * the entry type (PMD & PTE), and two more for flags (HZP and empty). In
> > + * total five special bits.
> > + */
> > +#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
> > #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> > +/* PTE and PMD types */
> > +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> > +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> > +/* huge zero page and empty entry flags */
> > +#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
> > +#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
>
> I think we can do with just 2 bits for type instead of 4 but for now this
> is OK I guess.
I guess we could combine the PMD/PTE choice into the same bit (0=PTE, 1=PMD),
but we have three cases for the other types (zero page, empty entry just for
locking, real DAX based entry with storage), so we need at least 2 bits for
those.
Christoph also suggested some reworks to the "type" logic - I'll look at
simplifying the way the flags are used for DAX entries.
Thank you for the review!
next prev parent reply other threads:[~2016-10-03 21:06 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-29 22:49 [PATCH v4 00/12] re-enable DAX PMD support Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 01/12] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 02/12] ext4: tell DAX the size of allocation holes Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 03/12] dax: remove buffer_size_valid() Ross Zwisler
2016-09-30 8:49 ` Christoph Hellwig
2016-09-29 22:49 ` [PATCH v4 04/12] ext2: remove support for DAX PMD faults Ross Zwisler
2016-09-30 8:49 ` Christoph Hellwig
2016-10-03 9:35 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 05/12] dax: make 'wait_table' global variable static Ross Zwisler
2016-09-30 8:50 ` Christoph Hellwig
2016-10-03 9:36 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 06/12] dax: consistent variable naming for DAX entries Ross Zwisler
2016-09-30 8:50 ` Christoph Hellwig
2016-10-03 9:37 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 07/12] dax: coordinate locking for offsets in PMD range Ross Zwisler
2016-09-30 9:44 ` Christoph Hellwig
2016-10-03 9:55 ` Jan Kara
2016-10-03 18:40 ` Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 08/12] dax: remove dax_pmd_fault() Ross Zwisler
2016-09-30 8:50 ` Christoph Hellwig
2016-10-03 9:56 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 09/12] dax: correct dax iomap code namespace Ross Zwisler
2016-09-30 8:51 ` Christoph Hellwig
2016-10-03 9:57 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 10/12] dax: add struct iomap based DAX PMD support Ross Zwisler
2016-09-30 9:56 ` Christoph Hellwig
2016-10-03 21:16 ` Ross Zwisler
2016-10-03 10:59 ` Jan Kara
2016-10-03 16:37 ` Christoph Hellwig
2016-10-03 21:05 ` Ross Zwisler [this message]
2016-10-04 5:55 ` Jan Kara
2016-10-04 15:39 ` Ross Zwisler
2016-10-05 5:50 ` Jan Kara
2016-10-06 21:34 ` Ross Zwisler
2016-10-07 2:58 ` Ross Zwisler
2016-10-07 7:24 ` Jan Kara
2016-09-29 22:49 ` [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path Ross Zwisler
2016-09-29 22:49 ` [PATCH v4 12/12] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-09-29 23:43 ` [PATCH v4 00/12] re-enable DAX PMD support Dave Chinner
2016-09-30 3:03 ` Ross Zwisler
2016-09-30 4:00 ` Darrick J. Wong
2016-10-03 18:54 ` Ross Zwisler
2016-09-30 6:48 ` Dave Chinner
2016-10-03 21:11 ` Ross Zwisler
2016-10-03 23:05 ` Ross Zwisler
2016-09-30 11:46 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20161003210557.GA28177@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=adilger.kernel@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@lst.de \
--cc=jack@suse.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mawilcox@microsoft.com \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).