* [RFC 0/1] Optimize ext4 DAX overwrites
@ 2020-08-20 11:36 Ritesh Harjani
2020-08-20 11:36 ` [RFC 1/1] ext4: " Ritesh Harjani
0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani @ 2020-08-20 11:36 UTC (permalink / raw)
To: linux-ext4; +Cc: jack, tytso, linux-fsdevel, linux-kernel, Ritesh Harjani
In case of dax writes, currently we start a journal txn irrespective of whether
it's an overwrite or not. In case of an overwrite we don't need to start a
jbd2 txn since the blocks are already allocated.
So this patch optimizes away the txn start in case of DAX overwrites.
This could significantly boost performance for multi-threaded random write
(overwrite). Fio script used to collect perf numbers is mentioned below.
Below numbers were calculated on a QEMU setup on ppc64 box with simulated
pmem device.
Performance numbers with different threads - (~10x improvement)
==========================================
vanilla_kernel(kIOPS)
60 +-+---------------+-------+--------+--------+--------+-------+------+-+
| + + + +** + + |
55 +-+ ** +-+
| ** ** |
| ** ** |
50 +-+ ** ** +-+
| ** ** |
45 +-+ ** ** +-+
| ** ** |
| ** ** |
40 +-+ ** ** +-+
| ** ** |
35 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** ** |
30 +-+ ** ** ** ** ** ** +-+
| ** +** +** +** ** +** |
25 +-+---------------**------+**------+**------+**------**------+**----+-+
1 2 4 8 12 16
Threads
patched_kernel(kIOPS)
600 +-+--------------+--------+--------+-------+--------+-------+------+-+
| + + + + + +** |
| ** |
500 +-+ ** +-+
| ** |
| ** ** |
400 +-+ ** ** +-+
| ** ** |
300 +-+ ** ** ** +-+
| ** ** ** |
| ** ** ** |
200 +-+ ** ** ** +-+
| ** ** ** ** |
| ** ** ** ** |
100 +-+ ** ** ** ** ** +-+
| ** ** ** ** ** |
| +** +** ** +** +** +** |
0 +-+--------------+**------+**------**------+**------+**-----+**----+-+
1 2 4 8 12 16
Threads
fio script
==========
[global]
rw=randwrite
norandommap=1
invalidate=0
bs=4k
numjobs=16 --> changed this for different thread options
time_based=1
ramp_time=30
runtime=60
group_reporting=1
ioengine=psync
direct=1
size=16G
filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31
file_service_type=random
nrfiles=32
directory=/mnt/
[name]
directory=/mnt/
direct=1
NOTE:
======
1. Looking at ~10x perf delta, I probed a bit deeper to understand what's causing
this scalability problem. It seems when we are starting a jbd2 txn then slab
alloc code is observing some serious contention around spinlock.
Even though the spinlock contention could be related to some other
issue (looking into it internally). But I could still see the perf improvement
of close to ~2x on QEMU setup on x86 with simulated pmem device with the
patched_kernel v/s vanilla_kernel with same fio workload.
perf report from vanilla_kernel (this is not seen with patched kernel) (ppc64)
=======================================================================
47.86% fio [kernel.vmlinux] [k] do_raw_spin_lock
|
---do_raw_spin_lock
|
|--19.43%--_raw_spin_lock
| |
| --19.31%--0
| |
| |--9.77%--deactivate_slab.isra.61
| | ___slab_alloc
| | __slab_alloc
| | kmem_cache_alloc
| | jbd2__journal_start
| | __ext4_journal_start_sb
<...>
2. Kept this as RFC, since maybe using the ext4_iomap_overwrite_ops,
will be better here. We could check for overwrite in ext4_dax_write_iter(),
like how we do for DIO writes. Thoughts?
3. This problem was reported by Dan Williams at [1]
Links
======
[1]: https://lore.kernel.org/linux-ext4/20190802144304.GP25064@quack2.suse.cz/T/
Ritesh Harjani (1):
ext4: Optimize ext4 DAX overwrites
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 2 +-
fs/ext4/inode.c | 8 +++++++-
3 files changed, 9 insertions(+), 2 deletions(-)
--
2.25.4
^ permalink raw reply [flat|nested] 4+ messages in thread* [RFC 1/1] ext4: Optimize ext4 DAX overwrites 2020-08-20 11:36 [RFC 0/1] Optimize ext4 DAX overwrites Ritesh Harjani @ 2020-08-20 11:36 ` Ritesh Harjani 2020-08-20 12:53 ` Jan Kara 0 siblings, 1 reply; 4+ messages in thread From: Ritesh Harjani @ 2020-08-20 11:36 UTC (permalink / raw) To: linux-ext4 Cc: jack, tytso, linux-fsdevel, linux-kernel, Ritesh Harjani, Dan Williams Currently in case of DAX, we are starting a transaction everytime for IOMAP_WRITE case. This can be optimized away in case of an overwrite (where the blocks were already allocated). This could give a significant performance boost for multi-threaded random writes. Reported-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> --- fs/ext4/ext4.h | 1 + fs/ext4/file.c | 2 +- fs/ext4/inode.c | 8 +++++++- 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 42f5060f3cdf..9a2138afc751 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3232,6 +3232,7 @@ extern const struct dentry_operations ext4_dentry_ops; extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); +extern bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len); /* inline.c */ extern int ext4_get_max_inline_size(struct inode *inode); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 2a01e31a032c..51cd92ac1758 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -188,7 +188,7 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len) } /* Is IO overwriting allocated and initialized blocks? */ -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) +bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) { struct ext4_map_blocks map; unsigned int blkbits = inode->i_blkbits; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 10dd470876b3..f0ac0ee9e991 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3423,6 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, int ret; struct ext4_map_blocks map; u8 blkbits = inode->i_blkbits; + bool overwrite = false; if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) return -EINVAL; @@ -3430,6 +3431,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, if (WARN_ON_ONCE(ext4_has_inline_data(inode))) return -ERANGE; + if (IS_DAX(inode) && (flags & IOMAP_WRITE) && + ext4_overwrite_io(inode, offset, length)) + overwrite = true; /* * Calculate the first and last logical blocks respectively. */ @@ -3437,13 +3441,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; - if (flags & IOMAP_WRITE) + if ((flags & IOMAP_WRITE) && !overwrite) ret = ext4_iomap_alloc(inode, &map, flags); else ret = ext4_map_blocks(NULL, inode, &map, 0); if (ret < 0) return ret; + if (IS_DAX(inode) && overwrite) + WARN_ON(!(map.m_flags & EXT4_MAP_MAPPED)); ext4_set_iomap(inode, iomap, &map, offset, length); -- 2.25.4 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [RFC 1/1] ext4: Optimize ext4 DAX overwrites 2020-08-20 11:36 ` [RFC 1/1] ext4: " Ritesh Harjani @ 2020-08-20 12:53 ` Jan Kara 2020-08-20 13:09 ` Ritesh Harjani 0 siblings, 1 reply; 4+ messages in thread From: Jan Kara @ 2020-08-20 12:53 UTC (permalink / raw) To: Ritesh Harjani Cc: linux-ext4, jack, tytso, linux-fsdevel, linux-kernel, Dan Williams On Thu 20-08-20 17:06:28, Ritesh Harjani wrote: > Currently in case of DAX, we are starting a transaction > everytime for IOMAP_WRITE case. This can be optimized > away in case of an overwrite (where the blocks were already > allocated). This could give a significant performance boost > for multi-threaded random writes. > > Reported-by: Dan Williams <dan.j.williams@intel.com> > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Thanks for returning to this and I'm glad to see how much this helped :) BTW, I'd suspect there could be also significant contention and cache line bouncing on j_state_lock and transaction's atomic counters... > --- > fs/ext4/ext4.h | 1 + > fs/ext4/file.c | 2 +- > fs/ext4/inode.c | 8 +++++++- > 3 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 42f5060f3cdf..9a2138afc751 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -3232,6 +3232,7 @@ extern const struct dentry_operations ext4_dentry_ops; > extern const struct inode_operations ext4_file_inode_operations; > extern const struct file_operations ext4_file_operations; > extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); > +extern bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len); > > /* inline.c */ > extern int ext4_get_max_inline_size(struct inode *inode); > diff --git a/fs/ext4/file.c b/fs/ext4/file.c > index 2a01e31a032c..51cd92ac1758 100644 > --- a/fs/ext4/file.c > +++ b/fs/ext4/file.c > @@ -188,7 +188,7 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len) > } > > /* Is IO overwriting allocated and initialized blocks? */ > -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) > +bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) > { > struct ext4_map_blocks map; > unsigned int blkbits = inode->i_blkbits; > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 10dd470876b3..f0ac0ee9e991 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3423,6 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > int ret; > struct ext4_map_blocks map; > u8 blkbits = inode->i_blkbits; > + bool overwrite = false; > > if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) > return -EINVAL; > @@ -3430,6 +3431,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > if (WARN_ON_ONCE(ext4_has_inline_data(inode))) > return -ERANGE; > > + if (IS_DAX(inode) && (flags & IOMAP_WRITE) && > + ext4_overwrite_io(inode, offset, length)) > + overwrite = true; So the patch looks correct but using ext4_overwrite_io() seems a bit foolish since under the hood it does ext4_map_blocks() only to be able to decide whether to call ext4_map_blocks() once again with exactly the same arguments :). So I'd rather slightly refactor the code in ext4_iomap_begin() to avoid this double calling of ext4_map_blocks() for the fast path. Honza > /* > * Calculate the first and last logical blocks respectively. > */ > @@ -3437,13 +3441,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, > EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; > > - if (flags & IOMAP_WRITE) > + if ((flags & IOMAP_WRITE) && !overwrite) > ret = ext4_iomap_alloc(inode, &map, flags); > else > ret = ext4_map_blocks(NULL, inode, &map, 0); > > if (ret < 0) > return ret; > + if (IS_DAX(inode) && overwrite) > + WARN_ON(!(map.m_flags & EXT4_MAP_MAPPED)); > > ext4_set_iomap(inode, iomap, &map, offset, length); > > -- > 2.25.4 > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC 1/1] ext4: Optimize ext4 DAX overwrites 2020-08-20 12:53 ` Jan Kara @ 2020-08-20 13:09 ` Ritesh Harjani 0 siblings, 0 replies; 4+ messages in thread From: Ritesh Harjani @ 2020-08-20 13:09 UTC (permalink / raw) To: Jan Kara; +Cc: linux-ext4, tytso, linux-fsdevel, linux-kernel, Dan Williams On 8/20/20 6:23 PM, Jan Kara wrote: > On Thu 20-08-20 17:06:28, Ritesh Harjani wrote: >> Currently in case of DAX, we are starting a transaction >> everytime for IOMAP_WRITE case. This can be optimized >> away in case of an overwrite (where the blocks were already >> allocated). This could give a significant performance boost >> for multi-threaded random writes. >> >> Reported-by: Dan Williams <dan.j.williams@intel.com> >> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> > > Thanks for returning to this and I'm glad to see how much this helped :) > BTW, I'd suspect there could be also significant contention and cache line > bouncing on j_state_lock and transaction's atomic counters... ok, will try and profile to see if this happens. > >> --- >> fs/ext4/ext4.h | 1 + >> fs/ext4/file.c | 2 +- >> fs/ext4/inode.c | 8 +++++++- >> 3 files changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 42f5060f3cdf..9a2138afc751 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -3232,6 +3232,7 @@ extern const struct dentry_operations ext4_dentry_ops; >> extern const struct inode_operations ext4_file_inode_operations; >> extern const struct file_operations ext4_file_operations; >> extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); >> +extern bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len); >> >> /* inline.c */ >> extern int ext4_get_max_inline_size(struct inode *inode); >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c >> index 2a01e31a032c..51cd92ac1758 100644 >> --- a/fs/ext4/file.c >> +++ b/fs/ext4/file.c >> @@ -188,7 +188,7 @@ ext4_extending_io(struct inode *inode, loff_t offset, size_t len) >> } >> >> /* Is IO overwriting allocated and initialized blocks? */ >> -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) >> +bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len) >> { >> struct ext4_map_blocks map; >> unsigned int blkbits = inode->i_blkbits; >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 10dd470876b3..f0ac0ee9e991 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -3423,6 +3423,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, >> int ret; >> struct ext4_map_blocks map; >> u8 blkbits = inode->i_blkbits; >> + bool overwrite = false; >> >> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK) >> return -EINVAL; >> @@ -3430,6 +3431,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, >> if (WARN_ON_ONCE(ext4_has_inline_data(inode))) >> return -ERANGE; >> >> + if (IS_DAX(inode) && (flags & IOMAP_WRITE) && >> + ext4_overwrite_io(inode, offset, length)) >> + overwrite = true; > > So the patch looks correct but using ext4_overwrite_io() seems a bit > foolish since under the hood it does ext4_map_blocks() only to be able to > decide whether to call ext4_map_blocks() once again with exactly the same > arguments :). So I'd rather slightly refactor the code in > ext4_iomap_begin() to avoid this double calling of ext4_map_blocks() for > the fast path. Yes, agreed. Looking at the numbers I was excited to post out the RFC for discussion. Will make above changes and post. :) With DIO, we need to detect overwrite case early in ext4_dio_write_iter() to determine whether we need shared or excl. locks - so probably for DIO case we still need overwrite check in ext4_dio_write_iter() Thanks for review!! -ritesh > > Honza > >> /* >> * Calculate the first and last logical blocks respectively. >> */ >> @@ -3437,13 +3441,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, >> map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits, >> EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1; >> >> - if (flags & IOMAP_WRITE) >> + if ((flags & IOMAP_WRITE) && !overwrite) >> ret = ext4_iomap_alloc(inode, &map, flags); >> else >> ret = ext4_map_blocks(NULL, inode, &map, 0); >> >> if (ret < 0) >> return ret; >> + if (IS_DAX(inode) && overwrite) >> + WARN_ON(!(map.m_flags & EXT4_MAP_MAPPED)); >> >> ext4_set_iomap(inode, iomap, &map, offset, length); >> >> -- >> 2.25.4 >> ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-08-20 13:10 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-08-20 11:36 [RFC 0/1] Optimize ext4 DAX overwrites Ritesh Harjani 2020-08-20 11:36 ` [RFC 1/1] ext4: " Ritesh Harjani 2020-08-20 12:53 ` Jan Kara 2020-08-20 13:09 ` Ritesh Harjani
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).