* [PATCH v2 1/9] ext4: allow DAX writeback for hole punch
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-09-21 15:22 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 2/9] ext2: tell DAX the size of allocation holes Ross Zwisler
` (8 subsequent siblings)
9 siblings, 1 reply; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, stable, linux-mm, Andreas Dilger, Alexander Viro,
Jan Kara, linux-fsdevel, linux-ext4
Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
This is because the logic around filemap_write_and_wait_range() in
ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
not for dirty DAX exceptional entries.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
---
fs/ext4/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3131747..0900cb4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
}
/*
- * ext4_punch_hole: punches a hole in a file by releaseing the blocks
+ * ext4_punch_hole: punches a hole in a file by releasing the blocks
* associated with the given offset and length
*
* @inode: File inode
@@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
* Write out all dirty pages to avoid race conditions
* Then release them.
*/
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+ if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
ret = filemap_write_and_wait_range(mapping, offset,
offset + length - 1);
if (ret)
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v2 1/9] ext4: allow DAX writeback for hole punch
2016-08-23 22:04 ` [PATCH v2 1/9] ext4: allow DAX writeback for hole punch Ross Zwisler
@ 2016-09-21 15:22 ` Ross Zwisler
2016-09-22 6:59 ` Jan Kara
2016-09-22 15:51 ` Theodore Ts'o
0 siblings, 2 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-09-21 15:22 UTC (permalink / raw)
To: Theodore Ts'o, Jan Kara
Cc: linux-nvdimm, Matthew Wilcox, Dave Chinner, linux-kernel, stable,
linux-mm, Andreas Dilger, Alexander Viro, Jan Kara, linux-fsdevel,
Andrew Morton, linux-ext4
On Tue, Aug 23, 2016 at 04:04:11PM -0600, Ross Zwisler wrote:
> Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
> This is because the logic around filemap_write_and_wait_range() in
> ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
> not for dirty DAX exceptional entries.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Cc: <stable@vger.kernel.org>
Ted & Jan,
I'm still working on the latest version of the PMD work which integrates with
the new struct iomap faults. At this point it doesn't look like I'm going to
make v4.9, but I think that this bug fix at least should probably go in alone?
Thanks,
- Ross
> ---
> fs/ext4/inode.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3131747..0900cb4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> }
>
> /*
> - * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> + * ext4_punch_hole: punches a hole in a file by releasing the blocks
> * associated with the given offset and length
> *
> * @inode: File inode
> @@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> * Write out all dirty pages to avoid race conditions
> * Then release them.
> */
> - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> ret = filemap_write_and_wait_range(mapping, offset,
> offset + length - 1);
> if (ret)
> --
> 2.9.0
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v2 1/9] ext4: allow DAX writeback for hole punch
2016-09-21 15:22 ` Ross Zwisler
@ 2016-09-22 6:59 ` Jan Kara
2016-09-22 15:51 ` Theodore Ts'o
1 sibling, 0 replies; 42+ messages in thread
From: Jan Kara @ 2016-09-22 6:59 UTC (permalink / raw)
To: Ross Zwisler
Cc: Jan Kara, linux-nvdimm, Matthew Wilcox, Dave Chinner,
linux-kernel, stable, linux-mm, Andreas Dilger, Alexander Viro,
Jan Kara, linux-fsdevel, Theodore Ts'o, Andrew Morton,
linux-ext4
On Wed 21-09-16 09:22:44, Ross Zwisler wrote:
> On Tue, Aug 23, 2016 at 04:04:11PM -0600, Ross Zwisler wrote:
> > Currently when doing a DAX hole punch with ext4 we fail to do a writeback.
> > This is because the logic around filemap_write_and_wait_range() in
> > ext4_punch_hole() only looks for dirty page cache pages in the radix tree,
> > not for dirty DAX exceptional entries.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Cc: <stable@vger.kernel.org>
>
> Ted & Jan,
>
> I'm still working on the latest version of the PMD work which integrates with
> the new struct iomap faults. At this point it doesn't look like I'm going to
> make v4.9, but I think that this bug fix at least should probably go in alone?
Yeah. Ted, feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
and merge this change. Thanks!
Honza
> > fs/ext4/inode.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 3131747..0900cb4 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3890,7 +3890,7 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > }
> >
> > /*
> > - * ext4_punch_hole: punches a hole in a file by releaseing the blocks
> > + * ext4_punch_hole: punches a hole in a file by releasing the blocks
> > * associated with the given offset and length
> > *
> > * @inode: File inode
> > @@ -3919,7 +3919,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> > * Write out all dirty pages to avoid race conditions
> > * Then release them.
> > */
> > - if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > + if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> > ret = filemap_write_and_wait_range(mapping, offset,
> > offset + length - 1);
> > if (ret)
> > --
> > 2.9.0
> >
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v2 1/9] ext4: allow DAX writeback for hole punch
2016-09-21 15:22 ` Ross Zwisler
2016-09-22 6:59 ` Jan Kara
@ 2016-09-22 15:51 ` Theodore Ts'o
1 sibling, 0 replies; 42+ messages in thread
From: Theodore Ts'o @ 2016-09-22 15:51 UTC (permalink / raw)
To: Ross Zwisler, Jan Kara, linux-kernel, Alexander Viro,
Andreas Dilger, Andrew Morton, Dan Williams, Dave Chinner,
Jan Kara, linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm,
Matthew Wilcox, stable
On Wed, Sep 21, 2016 at 09:22:44AM -0600, Ross Zwisler wrote:
>
> Ted & Jan,
>
> I'm still working on the latest version of the PMD work which integrates with
> the new struct iomap faults. At this point it doesn't look like I'm going to
> make v4.9, but I think that this bug fix at least should probably go in alone?
Thanks, applied.
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v2 2/9] ext2: tell DAX the size of allocation holes
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 1/9] ext4: allow DAX writeback for hole punch Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
[not found] ` <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2016-09-15 20:09 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 3/9] ext4: " Ross Zwisler
` (7 subsequent siblings)
9 siblings, 2 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
When DAX calls ext2_get_block() and the file offset points to a hole we
currently don't set bh_result->b_size. When we re-enable PMD faults DAX
will need bh_result->b_size to tell it the size of the hole so it can
decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/ext2/inode.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index d5c7d09..dd55d74 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -773,6 +773,9 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
ret = 0;
+ } else if (ret == 0) {
+ /* hole case, need to fill in bh_result->b_size */
+ bh_result->b_size = 1 << inode->i_blkbits;
}
return ret;
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread[parent not found: <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes
[not found] ` <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2016-08-25 7:57 ` Christoph Hellwig
[not found] ` <20160825075728.GA11235-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2016-08-25 7:57 UTC (permalink / raw)
To: Ross Zwisler
Cc: Theodore Ts'o, Matthew Wilcox,
linux-nvdimm-y27Ovi1pjclAfugRpC6u6w, Dave Chinner,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andreas Dilger, Alexander Viro,
Jan Kara, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
linux-ext4-u79uwXL29TY76Z2rM5mHXA
Hi Ross,
can you take at my (fully working, but not fully cleaned up) version
of the iomap based DAX code here:
http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/iomap-dax
By using iomap we don't even have the size hole problem and totally
get out of the reverse-engineer what buffer_heads are trying to tell
us business. It also gets rid of the other warts of the DAX path
due to pretending to be like direct I/O, so this might be a better
way forward also for ext2/4.
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 2/9] ext2: tell DAX the size of allocation holes
2016-08-23 22:04 ` [PATCH v2 2/9] ext2: tell DAX the size of allocation holes Ross Zwisler
[not found] ` <20160823220419.11717-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2016-09-15 20:09 ` Ross Zwisler
1 sibling, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-09-15 20:09 UTC (permalink / raw)
To: Jan Kara
Cc: Theodore Ts'o, linux-nvdimm, Matthew Wilcox, Dave Chinner,
linux-kernel, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, Andrew Morton, linux-ext4
On Tue, Aug 23, 2016 at 04:04:12PM -0600, Ross Zwisler wrote:
> When DAX calls ext2_get_block() and the file offset points to a hole we
> currently don't set bh_result->b_size. When we re-enable PMD faults DAX
> will need bh_result->b_size to tell it the size of the hole so it can
> decide whether to fault in a 4 KiB zero page or a 2 MiB zero page.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
> fs/ext2/inode.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index d5c7d09..dd55d74 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -773,6 +773,9 @@ int ext2_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_
> if (ret > 0) {
> bh_result->b_size = (ret << inode->i_blkbits);
> ret = 0;
> + } else if (ret == 0) {
> + /* hole case, need to fill in bh_result->b_size */
> + bh_result->b_size = 1 << inode->i_blkbits;
> }
> return ret;
>
> --
> 2.9.0
>
Jan, is it possible for ext2 to return 2 MiB of contiguous space to us via
ext2_get_block()?
I ask because we have all the infrastructure in place for ext2 to handle PMD
faults (ext2_dax_pmd_fault(), etc.), but I don't think in my testing I've ever
seen this actually happen.
ext2 can obviously return multiple blocks from ext2_get_block(), but can it
actually satisfy a whole PMD's worth (512 contiguous blocks)? If so, what
steps do I need to take to get this to work in my testing?
If it can't happen, we should probably rip out ext2_dax_pmd_fault() so that we
don't have to keep falling back to PTEs via the PMD path.
Thanks,
- Ross
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread
* [PATCH v2 3/9] ext4: tell DAX the size of allocation holes
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 1/9] ext4: allow DAX writeback for hole punch Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 2/9] ext2: tell DAX the size of allocation holes Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 4/9] dax: remove buffer_size_valid() Ross Zwisler
` (6 subsequent siblings)
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size. When we re-enable PMD faults DAX will
need bh->b_size to tell it the size of the hole so it can decide whether to
fault in a 4 KiB zero page or a 2 MiB zero page.
_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
fs/ext4/inode.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
ext4_update_bh_state(bh, map.m_flags);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
+ } else if (ret == 0) {
+ /* hole case, need to fill in bh->b_size */
+ bh->b_size = inode->i_sb->s_blocksize * map.m_len;
}
return ret;
}
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 4/9] dax: remove buffer_size_valid()
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (2 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 3/9] ext4: " Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 5/9] dax: make 'wait_table' global variable static Ross Zwisler
` (5 subsequent siblings)
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
Now that all our supported filesystems (ext2, ext4 and XFS) all properly
set bh.b_size when we call get_block() for a hole, rely on that value and
remove the buffer_size_valid() sanity check.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
fs/dax.c | 22 +---------------------
1 file changed, 1 insertion(+), 21 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 993dc6f..8030f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -121,19 +121,6 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
}
-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size. To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
- return bh->b_state != 0;
-}
-
-
static sector_t to_sector(const struct buffer_head *bh,
const struct inode *inode)
{
@@ -175,8 +162,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
rc = get_block(inode, block, bh, rw == WRITE);
if (rc)
break;
- if (!buffer_size_valid(bh))
- bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
/*
@@ -1010,12 +995,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
bdev = bh.b_bdev;
- /*
- * If the filesystem isn't willing to tell us the length of a hole,
- * just fall back to PTEs. Calling get_block 512 times in a loop
- * would be silly.
- */
- if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+ if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "allocated block too small");
return VM_FAULT_FALLBACK;
}
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 5/9] dax: make 'wait_table' global variable static
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (3 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 4/9] dax: remove buffer_size_valid() Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 6/9] dax: consistent variable naming for DAX entries Ross Zwisler
` (4 subsequent siblings)
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:
fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?
Make it static so it has scope local to fs/dax.c, and to make sparse happy.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 8030f93..3066a9b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -50,7 +50,7 @@
#define DAX_WAIT_TABLE_BITS 12
#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
-wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
static int __init init_dax_wait_table(void)
{
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 6/9] dax: consistent variable naming for DAX entries
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (4 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 5/9] dax: make 'wait_table' global variable static Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 7/9] dax: coordinate locking for offsets in PMD range Ross Zwisler
` (3 subsequent siblings)
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
No functional change.
Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries. This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/dax.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 3066a9b..0e3f462 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -355,7 +355,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
static void *get_unlocked_mapping_entry(struct address_space *mapping,
pgoff_t index, void ***slotp)
{
- void *ret, **slot;
+ void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
@@ -365,13 +365,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
ewait.key.index = index;
for (;;) {
- ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+ entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
&slot);
- if (!ret || !radix_tree_exceptional_entry(ret) ||
+ if (!entry || !radix_tree_exceptional_entry(entry) ||
!slot_locked(mapping, slot)) {
if (slotp)
*slotp = slot;
- return ret;
+ return entry;
}
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
@@ -394,13 +394,13 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
*/
static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
{
- void *ret, **slot;
+ void *entry, **slot;
restart:
spin_lock_irq(&mapping->tree_lock);
- ret = get_unlocked_mapping_entry(mapping, index, &slot);
+ entry = get_unlocked_mapping_entry(mapping, index, &slot);
/* No entry for given index? Make sure radix tree is big enough. */
- if (!ret) {
+ if (!entry) {
int err;
spin_unlock_irq(&mapping->tree_lock);
@@ -408,10 +408,10 @@ restart:
mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
if (err)
return ERR_PTR(err);
- ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+ entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
RADIX_DAX_ENTRY_LOCK);
spin_lock_irq(&mapping->tree_lock);
- err = radix_tree_insert(&mapping->page_tree, index, ret);
+ err = radix_tree_insert(&mapping->page_tree, index, entry);
radix_tree_preload_end();
if (err) {
spin_unlock_irq(&mapping->tree_lock);
@@ -423,11 +423,11 @@ restart:
/* Good, we have inserted empty locked entry into the tree. */
mapping->nrexceptional++;
spin_unlock_irq(&mapping->tree_lock);
- return ret;
+ return entry;
}
/* Normal page in radix tree? */
- if (!radix_tree_exceptional_entry(ret)) {
- struct page *page = ret;
+ if (!radix_tree_exceptional_entry(entry)) {
+ struct page *page = entry;
get_page(page);
spin_unlock_irq(&mapping->tree_lock);
@@ -440,9 +440,9 @@ restart:
}
return page;
}
- ret = lock_slot(mapping, slot);
+ entry = lock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
- return ret;
+ return entry;
}
void dax_wake_mapping_entry_waiter(struct address_space *mapping,
@@ -467,11 +467,11 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
{
- void *ret, **slot;
+ void *entry, **slot;
spin_lock_irq(&mapping->tree_lock);
- ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
- if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+ entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+ if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
!slot_locked(mapping, slot))) {
spin_unlock_irq(&mapping->tree_lock);
return;
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 7/9] dax: coordinate locking for offsets in PMD range
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (5 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 6/9] dax: consistent variable naming for DAX entries Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 8/9] dax: re-enable DAX PMD support Ross Zwisler
` (2 subsequent siblings)
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry. The 'mapping'
pointer is still used in the same way.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/dax.c | 37 ++++++++++++++++++++++++-------------
include/linux/dax.h | 2 +-
mm/filemap.c | 2 +-
3 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 0e3f462..955e184 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -62,10 +62,17 @@ static int __init init_dax_wait_table(void)
}
fs_initcall(init_dax_wait_table);
+static pgoff_t dax_entry_start(pgoff_t index, void *entry)
+{
+ if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+ index &= (PMD_MASK >> PAGE_SHIFT);
+ return index;
+}
+
static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
- pgoff_t index)
+ pgoff_t entry_start)
{
- unsigned long hash = hash_long((unsigned long)mapping ^ index,
+ unsigned long hash = hash_long((unsigned long)mapping ^ entry_start,
DAX_WAIT_TABLE_BITS);
return wait_table + hash;
}
@@ -283,7 +290,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
*/
struct exceptional_entry_key {
struct address_space *mapping;
- unsigned long index;
+ pgoff_t entry_start;
};
struct wait_exceptional_entry_queue {
@@ -299,7 +306,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
container_of(wait, struct wait_exceptional_entry_queue, wait);
if (key->mapping != ewait->key.mapping ||
- key->index != ewait->key.index)
+ key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
}
@@ -357,12 +364,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
{
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
- wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+ wait_queue_head_t *wq;
init_wait(&ewait.wait);
ewait.wait.func = wake_exceptional_entry_func;
- ewait.key.mapping = mapping;
- ewait.key.index = index;
for (;;) {
entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -373,6 +378,11 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
*slotp = slot;
return entry;
}
+
+ wq = dax_entry_waitqueue(mapping,
+ dax_entry_start(index, entry));
+ ewait.key.mapping = mapping;
+ ewait.key.entry_start = dax_entry_start(index, entry);
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&mapping->tree_lock);
@@ -445,10 +455,11 @@ restart:
return entry;
}
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
pgoff_t index, bool wake_all)
{
- wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+ wait_queue_head_t *wq = dax_entry_waitqueue(mapping,
+ dax_entry_start(index, entry));
/*
* Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -460,7 +471,7 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
struct exceptional_entry_key key;
key.mapping = mapping;
- key.index = index;
+ key.entry_start = dax_entry_start(index, entry);
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
}
}
@@ -478,7 +489,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
}
unlock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
- dax_wake_mapping_entry_waiter(mapping, index, false);
+ dax_wake_mapping_entry_waiter(entry, mapping, index, false);
}
static void put_locked_mapping_entry(struct address_space *mapping,
@@ -503,7 +514,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
return;
/* We have to wake up next waiter for the radix tree entry lock */
- dax_wake_mapping_entry_waiter(mapping, index, false);
+ dax_wake_mapping_entry_waiter(entry, mapping, index, false);
}
/*
@@ -530,7 +541,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
radix_tree_delete(&mapping->page_tree, index);
mapping->nrexceptional--;
spin_unlock_irq(&mapping->tree_lock);
- dax_wake_mapping_entry_waiter(mapping, index, true);
+ dax_wake_mapping_entry_waiter(entry, mapping, index, true);
return 1;
}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9c6dc77..f6cab31 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -15,7 +15,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
-void dax_wake_mapping_entry_waiter(struct address_space *mapping,
+void dax_wake_mapping_entry_waiter(void *entry, struct address_space *mapping,
pgoff_t index, bool wake_all);
#ifdef CONFIG_FS_DAX
diff --git a/mm/filemap.c b/mm/filemap.c
index 8a287df..35e880d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (node)
workingset_node_pages_dec(node);
/* Wakeup waiters for exceptional entry lock */
- dax_wake_mapping_entry_waiter(mapping, page->index,
+ dax_wake_mapping_entry_waiter(p, mapping, page->index,
false);
}
}
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 8/9] dax: re-enable DAX PMD support
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (6 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 7/9] dax: coordinate locking for offsets in PMD range Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-23 22:04 ` [PATCH v2 9/9] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
2016-08-30 23:01 ` [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.
There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries. The empty entries exist to provide locking for the duration of a
given page fault.
This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.
Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees. This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.
One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset. The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases. This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.
If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry. We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/dax.c | 208 +++++++++++++++++++++++++++++++---------------------
include/linux/dax.h | 27 ++++++-
mm/filemap.c | 4 +-
3 files changed, 151 insertions(+), 88 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 955e184..d11c996 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,20 +32,6 @@
#include <linux/pfn_t.h>
#include <linux/sizes.h>
-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
- RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
- RADIX_TREE_EXCEPTIONAL_ENTRY))
-
/* We choose 4096 entries - same as per-zone page wait tables */
#define DAX_WAIT_TABLE_BITS 12
#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -402,15 +388,32 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
* persistent memory the benefit is doubtful. We can add that later if we can
* show it helps.
*/
-static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
+ unsigned long new_type)
{
+ bool pmd_downgrade = false;
void *entry, **slot;
restart:
spin_lock_irq(&mapping->tree_lock);
entry = get_unlocked_mapping_entry(mapping, index, &slot);
+
+ if (entry && new_type == RADIX_DAX_PMD) {
+ if (!radix_tree_exceptional_entry(entry) ||
+ RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+ spin_unlock_irq(&mapping->tree_lock);
+ return ERR_PTR(-EEXIST);
+ }
+ } else if (entry && new_type == RADIX_DAX_PTE) {
+ if (radix_tree_exceptional_entry(entry) &&
+ RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD &&
+ (unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
+ pmd_downgrade = true;
+ }
+ }
+
/* No entry for given index? Make sure radix tree is big enough. */
- if (!entry) {
+ if (!entry || pmd_downgrade) {
int err;
spin_unlock_irq(&mapping->tree_lock);
@@ -418,15 +421,29 @@ restart:
mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
if (err)
return ERR_PTR(err);
- entry = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
- RADIX_DAX_ENTRY_LOCK);
+
+ if (pmd_downgrade && ((unsigned long)entry & RADIX_DAX_HZP))
+ unmap_mapping_range(mapping,
+ (index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
+
spin_lock_irq(&mapping->tree_lock);
- err = radix_tree_insert(&mapping->page_tree, index, entry);
+
+ if (pmd_downgrade) {
+ radix_tree_delete(&mapping->page_tree, index);
+ mapping->nrexceptional--;
+ dax_wake_mapping_entry_waiter(entry, mapping, index,
+ false);
+ }
+
+ entry = RADIX_DAX_EMPTY_ENTRY(new_type);
+
+ err = __radix_tree_insert(&mapping->page_tree, index,
+ RADIX_DAX_ORDER(new_type), entry);
radix_tree_preload_end();
if (err) {
spin_unlock_irq(&mapping->tree_lock);
/* Someone already created the entry? */
- if (err == -EEXIST)
+ if (err == -EEXIST && new_type == RADIX_DAX_PTE)
goto restart;
return ERR_PTR(err);
}
@@ -595,15 +612,15 @@ static int copy_user_bh(struct page *to, struct inode *inode,
return 0;
}
-#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-
static void *dax_insert_mapping_entry(struct address_space *mapping,
struct vm_fault *vmf,
- void *entry, sector_t sector)
+ void *entry, sector_t sector,
+ unsigned long new_type, bool hzp)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
int error = 0;
bool hole_fill = false;
+ bool hzp_fill = false;
void *new_entry;
pgoff_t index = vmf->pgoff;
@@ -622,22 +639,30 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
if (error)
return ERR_PTR(error);
+ } else if ((unsigned long)entry & RADIX_DAX_HZP && !hzp) {
+ hzp_fill = true;
+ unmap_mapping_range(mapping,
+ (vmf->pgoff << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
}
spin_lock_irq(&mapping->tree_lock);
- new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
- RADIX_DAX_ENTRY_LOCK);
+ if (hzp)
+ new_entry = RADIX_DAX_HZP_ENTRY();
+ else
+ new_entry = RADIX_DAX_ENTRY(sector, new_type);
+
if (hole_fill) {
__delete_from_page_cache(entry, NULL);
/* Drop pagecache reference */
put_page(entry);
- error = radix_tree_insert(page_tree, index, new_entry);
+ error = __radix_tree_insert(page_tree, index,
+ RADIX_DAX_ORDER(new_type), new_entry);
if (error) {
new_entry = ERR_PTR(error);
goto unlock;
}
mapping->nrexceptional++;
- } else {
+ } else if ((unsigned long)entry & (RADIX_DAX_HZP|RADIX_DAX_EMPTY)) {
void **slot;
void *ret;
@@ -693,6 +718,18 @@ static int dax_writeback_one(struct block_device *bdev,
goto unlock;
}
+ if (WARN_ON_ONCE((unsigned long)entry & RADIX_DAX_EMPTY)) {
+ ret = -EIO;
+ goto unlock;
+ }
+
+ /*
+ * Even if dax_writeback_mapping_range() was given a wbc->range_start
+ * in the middle of a PMD, the 'index' we are given will be aligned to
+ * the start index of the PMD, as will the sector we pull from
+ * 'entry'. This allows us to flush for PMD_SIZE and not have to
+ * worry about partial PMD writebacks.
+ */
dax.sector = RADIX_DAX_SECTOR(entry);
dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
spin_unlock_irq(&mapping->tree_lock);
@@ -733,12 +770,11 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc)
{
struct inode *inode = mapping->host;
- pgoff_t start_index, end_index, pmd_index;
+ pgoff_t start_index, end_index;
pgoff_t indices[PAGEVEC_SIZE];
struct pagevec pvec;
bool done = false;
int i, ret = 0;
- void *entry;
if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
return -EIO;
@@ -748,15 +784,6 @@ int dax_writeback_mapping_range(struct address_space *mapping,
start_index = wbc->range_start >> PAGE_SHIFT;
end_index = wbc->range_end >> PAGE_SHIFT;
- pmd_index = DAX_PMD_INDEX(start_index);
-
- rcu_read_lock();
- entry = radix_tree_lookup(&mapping->page_tree, pmd_index);
- rcu_read_unlock();
-
- /* see if the start of our range is covered by a PMD entry */
- if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
- start_index = pmd_index;
tag_pages_for_writeback(mapping, start_index, end_index);
@@ -802,7 +829,8 @@ static int dax_insert_mapping(struct address_space *mapping,
return PTR_ERR(dax.addr);
dax_unmap_atomic(bdev, &dax);
- ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+ ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector,
+ RADIX_DAX_PTE, false);
if (IS_ERR(ret))
return PTR_ERR(ret);
*entryp = ret;
@@ -849,7 +877,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
bh.b_bdev = inode->i_sb->s_bdev;
bh.b_size = PAGE_SIZE;
- entry = grab_mapping_entry(mapping, vmf->pgoff);
+ entry = grab_mapping_entry(mapping, vmf->pgoff, RADIX_DAX_PTE);
if (IS_ERR(entry)) {
error = PTR_ERR(entry);
goto out;
@@ -953,9 +981,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
bool write = flags & FAULT_FLAG_WRITE;
struct block_device *bdev;
pgoff_t size, pgoff;
+ struct vm_fault vmf;
sector_t block;
int result = 0;
- bool alloc = false;
+ void *entry, *ret;
+
/* dax pmd mappings require pfn_t_devmap() */
if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -977,6 +1007,11 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
return VM_FAULT_FALLBACK;
}
+ /*
+ * Check whether offset isn't beyond end of file now. Caller is supposed
+ * to hold locks serializing us with truncate / punch hole so this is
+ * a reliable test.
+ */
pgoff = linear_page_index(vma, pmd_addr);
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (pgoff >= size)
@@ -994,37 +1029,45 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
bh.b_size = PMD_SIZE;
- if (get_block(inode, block, &bh, 0) != 0)
- return VM_FAULT_SIGBUS;
+ /*
+ * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+ * PMD or a HZP entry. If it can't (because a 4k page is already in
+ * the tree, for instance), it will return -EEXIST and we just fall
+ * back to 4k entries.
+ */
+ entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+ if (IS_ERR(entry))
+ return VM_FAULT_FALLBACK;
+
+ if (get_block(inode, block, &bh, 0) != 0) {
+ result = VM_FAULT_SIGBUS;
+ goto unlock_entry;
+ }
if (!buffer_mapped(&bh) && write) {
- if (get_block(inode, block, &bh, 1) != 0)
- return VM_FAULT_SIGBUS;
- alloc = true;
- WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+ if (get_block(inode, block, &bh, 1) != 0) {
+ result = VM_FAULT_SIGBUS;
+ goto unlock_entry;
+ }
}
+ /* Filesystem should not return unwritten buffers to us! */
+ WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
+
bdev = bh.b_bdev;
if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "allocated block too small");
- return VM_FAULT_FALLBACK;
+ goto fallback;
}
- /*
- * If we allocated new storage, make sure no process has any
- * zero pages covering this hole
- */
- if (alloc) {
- loff_t lstart = pgoff << PAGE_SHIFT;
- loff_t lend = lstart + PMD_SIZE - 1; /* inclusive */
-
- truncate_pagecache_range(inode, lstart, lend);
- }
+ vmf.pgoff = pgoff;
+ vmf.flags = flags;
+ vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_FS | __GFP_IO;
if (!write && !buffer_mapped(&bh)) {
spinlock_t *ptl;
- pmd_t entry;
+ pmd_t pmd_entry;
struct page *zero_page = get_huge_zero_page();
if (unlikely(!zero_page)) {
@@ -1032,6 +1075,15 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
goto fallback;
}
+ ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+ 0, RADIX_DAX_PMD, true);
+ if (IS_ERR(ret)) {
+ dax_pmd_dbg(&bh, address,
+ "PMD radix insertion failed");
+ goto fallback;
+ }
+ entry = ret;
+
ptl = pmd_lock(vma->vm_mm, pmd);
if (!pmd_none(*pmd)) {
spin_unlock(ptl);
@@ -1044,9 +1096,9 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
__func__, current->comm, address,
(unsigned long long) to_sector(&bh, inode));
- entry = mk_pmd(zero_page, vma->vm_page_prot);
- entry = pmd_mkhuge(entry);
- set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
+ pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
+ pmd_entry = pmd_mkhuge(pmd_entry);
+ set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
result = VM_FAULT_NOPAGE;
spin_unlock(ptl);
} else {
@@ -1078,27 +1130,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
dax_unmap_atomic(bdev, &dax);
- /*
- * For PTE faults we insert a radix tree entry for reads, and
- * leave it clean. Then on the first write we dirty the radix
- * tree entry via the dax_pfn_mkwrite() path. This sequence
- * allows the dax_pfn_mkwrite() call to be simpler and avoid a
- * call into get_block() to translate the pgoff to a sector in
- * order to be able to create a new radix tree entry.
- *
- * The PMD path doesn't have an equivalent to
- * dax_pfn_mkwrite(), though, so for a read followed by a
- * write we traverse all the way through dax_pmd_fault()
- * twice. This means we can just skip inserting a radix tree
- * entry completely on the initial read and just wait until
- * the write to insert a dirty entry.
- */
- if (write) {
- /*
- * We should insert radix-tree entry and dirty it here.
- * For now this is broken...
- */
+ ret = dax_insert_mapping_entry(mapping, &vmf, entry,
+ dax.sector, RADIX_DAX_PMD, false);
+ if (IS_ERR(ret)) {
+ dax_pmd_dbg(&bh, address,
+ "PMD radix insertion failed");
+ goto fallback;
}
+ entry = ret;
dev_dbg(part_to_dev(bdev->bd_part),
"%s: %s addr: %lx pfn: %lx sect: %llx\n",
@@ -1109,13 +1148,14 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
dax.pfn, write);
}
- out:
+ unlock_entry:
+ put_locked_mapping_entry(mapping, pgoff, entry);
return result;
fallback:
count_vm_event(THP_FAULT_FALLBACK);
result = VM_FAULT_FALLBACK;
- goto out;
+ goto unlock_entry;
}
EXPORT_SYMBOL_GPL(dax_pmd_fault);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f6cab31..bdf6064 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -6,8 +6,33 @@
#include <linux/radix-tree.h>
#include <asm/pgtable.h>
-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, two bits for
+ * the entry type (PMD & PTE), and two more for flags (HZP and empty). In
+ * total five special bits.
+ */
+#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 5)
#define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+/* PTE and PMD types */
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+/* huge zero page and empty entry flags */
+#define RADIX_DAX_HZP (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 4))
+
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+
+/* entries begin locked */
+#define RADIX_DAX_ENTRY(sector, type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |\
+ type | (unsigned long)sector << RADIX_DAX_SHIFT | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_HZP_ENTRY() ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+ RADIX_DAX_PMD | RADIX_DAX_HZP | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+#define RADIX_DAX_EMPTY_ENTRY(type) ((void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | \
+ type | RADIX_DAX_EMPTY | RADIX_DAX_ENTRY_LOCK))
+
+#define RADIX_DAX_ORDER(type) (type == RADIX_DAX_PMD ? PMD_SHIFT-PAGE_SHIFT : 0)
ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *,
get_block_t, dio_iodone_t, int flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 35e880d..d9dd97e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -610,9 +610,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
workingset_node_shadows_dec(node);
} else {
/* DAX can replace empty locked entry with a hole */
- WARN_ON_ONCE(p !=
- (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
- RADIX_DAX_ENTRY_LOCK));
+ WARN_ON_ONCE(p != RADIX_DAX_EMPTY_ENTRY(RADIX_DAX_PTE));
/* DAX accounts exceptional entries as normal pages */
if (node)
workingset_node_pages_dec(node);
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* [PATCH v2 9/9] dax: remove "depends on BROKEN" from FS_DAX_PMD
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (7 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 8/9] dax: re-enable DAX PMD support Ross Zwisler
@ 2016-08-23 22:04 ` Ross Zwisler
2016-08-30 23:01 ` [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
9 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-08-23 22:04 UTC (permalink / raw)
To: linux-kernel
Cc: Theodore Ts'o, Andrew Morton, linux-nvdimm, Matthew Wilcox,
Dave Chinner, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, linux-ext4
Now that DAX PMD faults are once again working and are now participating in
DAX's radix tree locking scheme, allow their config option to be enabled.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
fs/Kconfig | 1 -
1 file changed, 1 deletion(-)
diff --git a/fs/Kconfig b/fs/Kconfig
index 2bc7ad7..b6f0fce 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -55,7 +55,6 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
- depends on BROKEN
endif # BLOCK
--
2.9.0
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply related [flat|nested] 42+ messages in thread* Re: [PATCH v2 0/9] re-enable DAX PMD support
2016-08-23 22:04 [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
` (8 preceding siblings ...)
2016-08-23 22:04 ` [PATCH v2 9/9] dax: remove "depends on BROKEN" from FS_DAX_PMD Ross Zwisler
@ 2016-08-30 23:01 ` Ross Zwisler
2016-08-31 20:20 ` Kani, Toshimitsu
9 siblings, 1 reply; 42+ messages in thread
From: Ross Zwisler @ 2016-08-30 23:01 UTC (permalink / raw)
To: Ross Zwisler
Cc: Theodore Ts'o, linux-nvdimm, Matthew Wilcox, Dave Chinner,
linux-kernel, linux-mm, Andreas Dilger, Alexander Viro, Jan Kara,
linux-fsdevel, Andrew Morton, linux-ext4
On Tue, Aug 23, 2016 at 04:04:10PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking. This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.
>
> Changes since v1:
> - PMD entry locking is now done based on the starting offset of the PMD
> entry, rather than on the radix tree slot which was unreliable. (Jan)
> - Fixed the one issue I could find with hole punch. As far as I can tell
> hole punch now works correctly for both PMD and PTE DAX entries, 4k zero
> pages and huge zero pages.
> - Fixed the way that ext2 returns the size of holes in ext2_get_block().
> (Jan)
> - Made the 'wait_table' global variable static in respnse to a sparse
> warning.
> - Fixed some more inconsitent usage between the names 'ret' and 'entry'
> for radix tree entry variables.
>
> Ross Zwisler (9):
> ext4: allow DAX writeback for hole punch
> ext2: tell DAX the size of allocation holes
> ext4: tell DAX the size of allocation holes
> dax: remove buffer_size_valid()
> dax: make 'wait_table' global variable static
> dax: consistent variable naming for DAX entries
> dax: coordinate locking for offsets in PMD range
> dax: re-enable DAX PMD support
> dax: remove "depends on BROKEN" from FS_DAX_PMD
>
> fs/Kconfig | 1 -
> fs/dax.c | 297 +++++++++++++++++++++++++++++-----------------------
> fs/ext2/inode.c | 3 +
> fs/ext4/inode.c | 7 +-
> include/linux/dax.h | 29 ++++-
> mm/filemap.c | 6 +-
> 6 files changed, 201 insertions(+), 142 deletions(-)
>
> --
> 2.9.0
Ping on this series? Any objections or comments?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread* Re: [PATCH v2 0/9] re-enable DAX PMD support
2016-08-30 23:01 ` [PATCH v2 0/9] re-enable DAX PMD support Ross Zwisler
@ 2016-08-31 20:20 ` Kani, Toshimitsu
2016-08-31 21:36 ` Ross Zwisler
0 siblings, 1 reply; 42+ messages in thread
From: Kani, Toshimitsu @ 2016-08-31 20:20 UTC (permalink / raw)
To: ross.zwisler@linux.intel.com
Cc: tytso@mit.edu, mawilcox@microsoft.com, linux-nvdimm@lists.01.org,
david@fromorbit.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, adilger.kernel@dilger.ca,
viro@zeniv.linux.org.uk, jack@suse.com,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
akpm@linux-foundation.org
On Tue, 2016-08-30 at 17:01 -0600, Ross Zwisler wrote:
> On Tue, Aug 23, 2016 at 04:04:10PM -0600, Ross Zwisler wrote:
> >
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix
> > tree based locking. This series allows DAX PMDs to participate in
> > the DAX radix tree based locking scheme so that they can be re-
> > enabled.
> >
> > Changes since v1:
> > - PMD entry locking is now done based on the starting offset of
> > the PMD entry, rather than on the radix tree slot which was
> > unreliable. (Jan)
> > - Fixed the one issue I could find with hole punch. As far as I
> > can tell hole punch now works correctly for both PMD and PTE DAX
> > entries, 4k zero pages and huge zero pages.
> > - Fixed the way that ext2 returns the size of holes in
> > ext2_get_block(). (Jan)
> > - Made the 'wait_table' global variable static in respnse to a
> > sparse warning.
> > - Fixed some more inconsitent usage between the names 'ret' and
> > 'entry' for radix tree entry variables.
> >
> > Ross Zwisler (9):
> > ext4: allow DAX writeback for hole punch
> > ext2: tell DAX the size of allocation holes
> > ext4: tell DAX the size of allocation holes
> > dax: remove buffer_size_valid()
> > dax: make 'wait_table' global variable static
> > dax: consistent variable naming for DAX entries
> > dax: coordinate locking for offsets in PMD range
> > dax: re-enable DAX PMD support
> > dax: remove "depends on BROKEN" from FS_DAX_PMD
> >
> > fs/Kconfig | 1 -
> > fs/dax.c | 297 +++++++++++++++++++++++++++++-----------
> > ------------
> > fs/ext2/inode.c | 3 +
> > fs/ext4/inode.c | 7 +-
> > include/linux/dax.h | 29 ++++-
> > mm/filemap.c | 6 +-
> > 6 files changed, 201 insertions(+), 142 deletions(-)
> >
> > --
> > 2.9.0
>
> Ping on this series? Any objections or comments?
Hi Ross,
I am seeing a major performance loss in fio mmap test with this patch-
set applied. This happens with or without my patches [1] applied on
top of yours. Without my patches, dax_pmd_fault() falls back to the
pte handler since an mmap'ed address is not 2MB-aligned.
I have attached three test results.
o rc4.log - 4.8.0-rc4 (base)
o non-pmd.log - 4.8.0-rc4 + your patchset (fall back to pte)
o pmd.log - 4.8.0-rc4 + your patchset + my patchset (use pmd maps)
My test steps are as follows.
mkfs.ext4 -O bigalloc -C 2M /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem0
numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio test.fio
"test.fio"
---
[global]
bs=4k
size=2G
directory=/mnt/pmem0
ioengine=mmap
[randrw]
rw=randrw
---
Can you please take a look?
Thanks,
-Toshi
[1] https://lkml.org/lkml/2016/8/29/560
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 0/9] re-enable DAX PMD support
2016-08-31 20:20 ` Kani, Toshimitsu
@ 2016-08-31 21:36 ` Ross Zwisler
2016-08-31 22:08 ` Kani, Toshimitsu
0 siblings, 1 reply; 42+ messages in thread
From: Ross Zwisler @ 2016-08-31 21:36 UTC (permalink / raw)
To: Kani, Toshimitsu
Cc: tytso@mit.edu, akpm@linux-foundation.org, mawilcox@microsoft.com,
linux-nvdimm@lists.01.org, david@fromorbit.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
adilger.kernel@dilger.ca, viro@zeniv.linux.org.uk, jack@suse.com,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
On Wed, Aug 31, 2016 at 08:20:48PM +0000, Kani, Toshimitsu wrote:
> On Tue, 2016-08-30 at 17:01 -0600, Ross Zwisler wrote:
> > On Tue, Aug 23, 2016 at 04:04:10PM -0600, Ross Zwisler wrote:
> > >
> > > DAX PMDs have been disabled since Jan Kara introduced DAX radix
> > > tree based locking. This series allows DAX PMDs to participate in
> > > the DAX radix tree based locking scheme so that they can be re-
> > > enabled.
> > >
> > > Changes since v1:
> > > - PMD entry locking is now done based on the starting offset of
> > > the PMD entry, rather than on the radix tree slot which was
> > > unreliable. (Jan)
> > > - Fixed the one issue I could find with hole punch. As far as I
> > > can tell hole punch now works correctly for both PMD and PTE DAX
> > > entries, 4k zero pages and huge zero pages.
> > > - Fixed the way that ext2 returns the size of holes in
> > > ext2_get_block(). (Jan)
> > > - Made the 'wait_table' global variable static in respnse to a
> > > sparse warning.
> > > - Fixed some more inconsitent usage between the names 'ret' and
> > > 'entry' for radix tree entry variables.
> > >
> > > Ross Zwisler (9):
> > > ext4: allow DAX writeback for hole punch
> > > ext2: tell DAX the size of allocation holes
> > > ext4: tell DAX the size of allocation holes
> > > dax: remove buffer_size_valid()
> > > dax: make 'wait_table' global variable static
> > > dax: consistent variable naming for DAX entries
> > > dax: coordinate locking for offsets in PMD range
> > > dax: re-enable DAX PMD support
> > > dax: remove "depends on BROKEN" from FS_DAX_PMD
> > >
> > > fs/Kconfig | 1 -
> > > fs/dax.c | 297 +++++++++++++++++++++++++++++-----------
> > > ------------
> > > fs/ext2/inode.c | 3 +
> > > fs/ext4/inode.c | 7 +-
> > > include/linux/dax.h | 29 ++++-
> > > mm/filemap.c | 6 +-
> > > 6 files changed, 201 insertions(+), 142 deletions(-)
> > >
> > > --
> > > 2.9.0
> >
> > Ping on this series? Any objections or comments?
>
> Hi Ross,
>
> I am seeing a major performance loss in fio mmap test with this patch-
> set applied. This happens with or without my patches [1] applied on
> top of yours. Without my patches, dax_pmd_fault() falls back to the
> pte handler since an mmap'ed address is not 2MB-aligned.
>
> I have attached three test results.
> o rc4.log - 4.8.0-rc4 (base)
> o non-pmd.log - 4.8.0-rc4 + your patchset (fall back to pte)
> o pmd.log - 4.8.0-rc4 + your patchset + my patchset (use pmd maps)
>
> My test steps are as follows.
>
> mkfs.ext4 -O bigalloc -C 2M /dev/pmem0
> mount -o dax /dev/pmem0 /mnt/pmem0
> numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio test.fio
>
> "test.fio"
> ---
> [global]
> bs=4k
> size=2G
> directory=/mnt/pmem0
> ioengine=mmap
> [randrw]
> rw=randrw
> ---
>
> Can you please take a look?
Yep, thanks for the report.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 0/9] re-enable DAX PMD support
2016-08-31 21:36 ` Ross Zwisler
@ 2016-08-31 22:08 ` Kani, Toshimitsu
2016-09-01 16:21 ` Ross Zwisler
0 siblings, 1 reply; 42+ messages in thread
From: Kani, Toshimitsu @ 2016-08-31 22:08 UTC (permalink / raw)
To: ross.zwisler@linux.intel.com
Cc: tytso@mit.edu, mawilcox@microsoft.com, linux-nvdimm@lists.01.org,
david@fromorbit.com, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, adilger.kernel@dilger.ca,
viro@zeniv.linux.org.uk, jack@suse.com,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
akpm@linux-foundation.org
On Wed, 2016-08-31 at 15:36 -0600, Ross Zwisler wrote:
> On Wed, Aug 31, 2016 at 08:20:48PM +0000, Kani, Toshimitsu wrote:
> >
> > On Tue, 2016-08-30 at 17:01 -0600, Ross Zwisler wrote:
> > >
> > > On Tue, Aug 23, 2016 at 04:04:10PM -0600, Ross Zwisler wrote:
:
> > >
> > > Ping on this series? Any objections or comments?
> >
> > Hi Ross,
> >
> > I am seeing a major performance loss in fio mmap test with this
> > patch-set applied. This happens with or without my patches [1]
> > applied on top of yours. Without my patches, dax_pmd_fault() falls
> > back to the pte handler since an mmap'ed address is not 2MB-
> > aligned.
> >
> > I have attached three test results.
> > o rc4.log - 4.8.0-rc4 (base)
> > o non-pmd.log - 4.8.0-rc4 + your patchset (fall back to pte)
> > o pmd.log - 4.8.0-rc4 + your patchset + my patchset (use pmd maps)
> >
> > My test steps are as follows.
> >
> > mkfs.ext4 -O bigalloc -C 2M /dev/pmem0
> > mount -o dax /dev/pmem0 /mnt/pmem0
> > numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio
> > test.fio
> >
> > "test.fio"
> > ---
> > [global]
> > bs=4k
> > size=2G
> > directory=/mnt/pmem0
> > ioengine=mmap
> > [randrw]
> > rw=randrw
> > ---
> >
> > Can you please take a look?
>
> Yep, thanks for the report.
I have some more observations. It seems this issue is related with pmd
mappings after all. fio creates "randrw.0.0" file. In my setup, an
initial test run creates pmd mappings and hits this issue. Subsequent
test runs (i.e. randrw.0.0 exists), without my patches, fall back to
pte mappings and do not hit this issue. With my patches applied,
subsequent runs still create pmd mappings and hit this issue.
Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread
* Re: [PATCH v2 0/9] re-enable DAX PMD support
2016-08-31 22:08 ` Kani, Toshimitsu
@ 2016-09-01 16:21 ` Ross Zwisler
0 siblings, 0 replies; 42+ messages in thread
From: Ross Zwisler @ 2016-09-01 16:21 UTC (permalink / raw)
To: Kani, Toshimitsu
Cc: tytso@mit.edu, akpm@linux-foundation.org, mawilcox@microsoft.com,
linux-nvdimm@lists.01.org, david@fromorbit.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
adilger.kernel@dilger.ca, viro@zeniv.linux.org.uk, jack@suse.com,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org
On Wed, Aug 31, 2016 at 10:08:59PM +0000, Kani, Toshimitsu wrote:
> On Wed, 2016-08-31 at 15:36 -0600, Ross Zwisler wrote:
> > On Wed, Aug 31, 2016 at 08:20:48PM +0000, Kani, Toshimitsu wrote:
> > >
> > > On Tue, 2016-08-30 at 17:01 -0600, Ross Zwisler wrote:
> > > >
> > > > On Tue, Aug 23, 2016 at 04:04:10PM -0600, Ross Zwisler wrote:
> :
> > > >
> > > > Ping on this series? Any objections or comments?
> > >
> > > Hi Ross,
> > >
> > > I am seeing a major performance loss in fio mmap test with this
> > > patch-set applied. This happens with or without my patches [1]
> > > applied on top of yours. Without my patches, dax_pmd_fault() falls
> > > back to the pte handler since an mmap'ed address is not 2MB-
> > > aligned.
> > >
> > > I have attached three test results.
> > > o rc4.log - 4.8.0-rc4 (base)
> > > o non-pmd.log - 4.8.0-rc4 + your patchset (fall back to pte)
> > > o pmd.log - 4.8.0-rc4 + your patchset + my patchset (use pmd maps)
> > >
> > > My test steps are as follows.
> > >
> > > mkfs.ext4 -O bigalloc -C 2M /dev/pmem0
> > > mount -o dax /dev/pmem0 /mnt/pmem0
> > > numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio
> > > test.fio
> > >
> > > "test.fio"
> > > ---
> > > [global]
> > > bs=4k
> > > size=2G
> > > directory=/mnt/pmem0
> > > ioengine=mmap
> > > [randrw]
> > > rw=randrw
> > > ---
> > >
> > > Can you please take a look?
> >
> > Yep, thanks for the report.
>
> I have some more observations. It seems this issue is related with pmd
> mappings after all. fio creates "randrw.0.0" file. In my setup, an
> initial test run creates pmd mappings and hits this issue. Subsequent
> test runs (i.e. randrw.0.0 exists), without my patches, fall back to
> pte mappings and do not hit this issue. With my patches applied,
> subsequent runs still create pmd mappings and hit this issue.
I've been able to reproduce this on my test setup, and I agree that it appears
to be related to the PMD mappings. Here's my performance with 4k mappings,
either before my set or without your patches:
READ: io=1022.7MB, aggrb=590299KB/s, minb=590299KB/s, maxb=590299KB/s, mint=1774msec, maxt=1774msec
WRITE: io=1025.4MB, aggrb=591860KB/s, minb=591860KB/s, maxb=591860KB/s, mint=1774msec, maxt=1774msec
And with 2 MiB pages:
READ: io=1022.7MB, aggrb=17931KB/s, minb=17931KB/s, maxb=17931KB/s, mint=58401msec, maxt=58401msec
WRITE: io=1025.4MB, aggrb=17978KB/s, minb=17978KB/s, maxb=17978KB/s, mint=58401msec, maxt=58401msec
Dan is seeing something similar with his device DAX code with 2MiB pages, so
our best guess right now is that it must be in the PMD MM code, since that's
really the only thing that the fs/dax and device/dax implementations share.
Interestingly, I'm getting the opposite results when testing in my VM. Here's
the performance with 4k pages:
READ: io=1022.7MB, aggrb=251728KB/s, minb=251728KB/s, maxb=251728KB/s, mint=4160msec, maxt=4160msec
WRITE: io=1025.4MB, aggrb=252394KB/s, minb=252394KB/s, maxb=252394KB/s, mint=4160msec, maxt=4160msec
And with 2MiB pages:
READ: io=1022.7MB, aggrb=902751KB/s, minb=902751KB/s, maxb=902751KB/s, mint=1160msec, maxt=1160msec
WRITE: io=1025.4MB, aggrb=905137KB/s, minb=905137KB/s, maxb=905137KB/s, mint=1160msec, maxt=1160msec
This is a totally different system, so the halved 4k performance in the VM
isn't comparable to my bare metal system, but it's interesting that the use of
PMDs over tripled the performance in my VM. Hmm...
We'll keep digging into this. Thanks again for the report. :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
^ permalink raw reply [flat|nested] 42+ messages in thread