* [PATCH 0/3] ext4/quota: journalled quota optimizations
@ 2010-03-27 12:15 Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 1/3] quota: optimize mark_dirty logic Dmitry Monakhov
0 siblings, 1 reply; 12+ messages in thread
From: Dmitry Monakhov @ 2010-03-27 12:15 UTC (permalink / raw)
To: linux-ext4; +Cc: linux-fsdevel, jack, Dmitry Monakhov
This patch set aimed to get rid of most annoying locks.
In case of journalled quota it is dq_iomutex and i_mutex.
Patches are simple and clean. Patches make difference
between nojournaled/journaled quota modes less significant.
And allow us to switch to default journalled quota without
performance penalty.
There are many other optimizations possible, but they are
more intrusive and require wide discussions.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/3] quota: optimize mark_dirty logic
2010-03-27 12:15 [PATCH 0/3] ext4/quota: journalled quota optimizations Dmitry Monakhov
@ 2010-03-27 12:15 ` Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 2/3] ext4: journalled quota optimization Dmitry Monakhov
2010-03-31 15:06 ` [PATCH 1/3] quota: optimize mark_dirty logic Jan Kara
0 siblings, 2 replies; 12+ messages in thread
From: Dmitry Monakhov @ 2010-03-27 12:15 UTC (permalink / raw)
To: linux-ext4; +Cc: linux-fsdevel, jack, Dmitry Monakhov
- Skipp locking if quota is dirty already.
- Return old quota state to help fs-specciffic implementation to optimize
case where quota was dirty already.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/quota/dquot.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index e0b870f..88a8ef2 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -319,14 +319,23 @@ static inline int mark_dquot_dirty(struct dquot *dquot)
return dquot->dq_sb->dq_op->mark_dirty(dquot);
}
+/* Mark dquot dirty in atomic meaner, and return it's old dirty flag state */
int dquot_mark_dquot_dirty(struct dquot *dquot)
{
+ int ret = 1;
+
+ /* If quota is dirtly already we don't have to acquire dq_list_lock */
+ if (test_bit(DQ_MOD_B, &dquot->dq_flags))
+ return 1;
+
spin_lock(&dq_list_lock);
- if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
+ if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) {
list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)->
info[dquot->dq_type].dqi_dirty_list);
+ ret = 0;
+ }
spin_unlock(&dq_list_lock);
- return 0;
+ return ret;
}
EXPORT_SYMBOL(dquot_mark_dquot_dirty);
--
1.6.6.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/3] ext4: journalled quota optimization
2010-03-27 12:15 ` [PATCH 1/3] quota: optimize mark_dirty logic Dmitry Monakhov
@ 2010-03-27 12:15 ` Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 3/3] ext4: optimize quota write locking Dmitry Monakhov
2010-04-01 9:12 ` [PATCH 2/3] ext4: journalled quota optimization Jan Kara
2010-03-31 15:06 ` [PATCH 1/3] quota: optimize mark_dirty logic Jan Kara
1 sibling, 2 replies; 12+ messages in thread
From: Dmitry Monakhov @ 2010-03-27 12:15 UTC (permalink / raw)
To: linux-ext4; +Cc: linux-fsdevel, jack, Dmitry Monakhov
Currently each quota modification result in write_dquot()
and later dquot_commit(). This means what each quota modification
function must wait for dqio_mutex. Which is *huge* performance
penalty on big SMP systems. ASAIU The core idea of this implementation
is to guarantee that each quota modification will be written to journal
atomically. But in fact this is not always true, because dquot may be
changed after dquot modification, but before it was committed to disk.
| Task 1 | Task 2 |
| alloc_space(nr) | |
| ->spin_lock(dq_data_lock) | |
| ->curspace += nr | |
| ->spin_unlock(dq_data_lock) | |
| ->mark_dirty() | free_space(nr) |
| -->write_dquot() | ->spin_lock(dq_data_lock) |
| --->dquot_commit() | ->curspace -= nr |
| ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
| ----->spin_lock(dq_data_lock) | |
| ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
| ----->spin_unlock(dq_data_lock) | |
| ----->quota_write() | |
Quota corruption not happens only because quota modification caller
started journal already. And ext3/4 allow only one running quota
at a time. Let's exploit this fact and avoid writing quota each time.
Act similar to dirty_for_io in general write_back path in page-cache.
If we have found what other task already started on copy and write the
dquot then we skip actual quota_write stage. And let that task do the job.
This patch fix only contention on dqio_mutex.
Side effect: Task which skip real quota_write() will not get an error
(if any). But this is not big deal because:
1) Any error has global consequences (RO_remount, err_print, etc).
2) Real IO is differed till the journall_commit.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/super.c | 37 ++++++++++++++++++++++++++++++-------
1 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 29c6875..b7b5707 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3874,14 +3874,37 @@ static int ext4_release_dquot(struct dquot *dquot)
static int ext4_mark_dquot_dirty(struct dquot *dquot)
{
- /* Are we journaling quotas? */
- if (EXT4_SB(dquot->dq_sb)->s_qf_names[USRQUOTA] ||
- EXT4_SB(dquot->dq_sb)->s_qf_names[GRPQUOTA]) {
- dquot_mark_dquot_dirty(dquot);
- return ext4_write_dquot(dquot);
- } else {
+ int ret, err;
+ handle_t *handle;
+ struct inode *inode;
+
+ /* Are we not journaling quotas? */
+ if (!EXT4_SB(dquot->dq_sb)->s_qf_names[USRQUOTA] &&
+ !EXT4_SB(dquot->dq_sb)->s_qf_names[GRPQUOTA])
return dquot_mark_dquot_dirty(dquot);
- }
+
+ /* journaling quotas case */
+ inode = dquot_to_inode(dquot);
+ handle = ext4_journal_start(inode,
+ EXT4_QUOTA_TRANS_BLOCKS(dquot->dq_sb));
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ if (!dquot_mark_dquot_dirty(dquot))
+ ret = dquot_commit(dquot);
+ else
+ /*
+ * Dquot was already dirty. This means that other task already
+ * started a transaction but not clear dirty bit yet (see
+ * dquot_commit). Since the only one running transaction is
+ * possible at a time. Then that task belongs to the same
+ * transaction. We don'n have to actually write dquot changes
+ * because that task will write it for us.
+ */
+ ret = 0;
+ err = ext4_journal_stop(handle);
+ if (!ret)
+ ret = err;
+ return ret;
}
static int ext4_write_info(struct super_block *sb, int type)
--
1.6.6.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/3] ext4: optimize quota write locking
2010-03-27 12:15 ` [PATCH 2/3] ext4: journalled quota optimization Dmitry Monakhov
@ 2010-03-27 12:15 ` Dmitry Monakhov
2010-04-01 9:14 ` Jan Kara
2010-04-01 9:12 ` [PATCH 2/3] ext4: journalled quota optimization Jan Kara
1 sibling, 1 reply; 12+ messages in thread
From: Dmitry Monakhov @ 2010-03-27 12:15 UTC (permalink / raw)
To: linux-ext4; +Cc: linux-fsdevel, jack, Dmitry Monakhov
Quota inode is append only. In most cases writes performed
to allocated blocks before i_size. Currently we grup i_mutex
for each quota write to protect i_size and i_mtime. But in most
cases i_size not changed. And i_mtime is not that important
as for regular files. So let's update it only when quota file
increased. This allow us to avoid i_mutex in most cases.
Some day quota file will be hiden completly, so i_mtime hack
will disappear.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fs/ext4/super.c | 22 +++++++++++++++-------
1 files changed, 15 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b7b5707..5a378e3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4039,6 +4039,7 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
ext4_lblk_t blk = off >> EXT4_BLOCK_SIZE_BITS(sb);
int err = 0;
int offset = off & (sb->s_blocksize - 1);
+ int is_locked = 0;
int journal_quota = EXT4_SB(sb)->s_qf_names[type] != NULL;
struct buffer_head *bh;
handle_t *handle = journal_current_handle();
@@ -4059,8 +4060,14 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
(unsigned long long)off, (unsigned long long)len);
return -EIO;
}
-
- mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
+ /*
+ * Inode file is append only. It is not necessery to hold i_mutex
+ * if quota block is allocated already. Optimize that case.
+ */
+ if (off + len > i_size_read(inode)) {
+ mutex_lock_nested(&inode->i_mutex, I_MUTEX_QUOTA);
+ is_locked = 1;
+ }
bh = ext4_bread(handle, inode, blk, 1, &err);
if (!bh)
goto out;
@@ -4085,16 +4092,17 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
brelse(bh);
out:
if (err) {
- mutex_unlock(&inode->i_mutex);
+ if (is_locked)
+ mutex_unlock(&inode->i_mutex);
return err;
}
- if (inode->i_size < off + len) {
+ if (is_locked) {
i_size_write(inode, off + len);
EXT4_I(inode)->i_disksize = inode->i_size;
+ inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ ext4_mark_inode_dirty(handle, inode);
+ mutex_unlock(&inode->i_mutex);
}
- inode->i_mtime = inode->i_ctime = CURRENT_TIME;
- ext4_mark_inode_dirty(handle, inode);
- mutex_unlock(&inode->i_mutex);
return len;
}
--
1.6.6.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/3] quota: optimize mark_dirty logic
2010-03-27 12:15 ` [PATCH 1/3] quota: optimize mark_dirty logic Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 2/3] ext4: journalled quota optimization Dmitry Monakhov
@ 2010-03-31 15:06 ` Jan Kara
1 sibling, 0 replies; 12+ messages in thread
From: Jan Kara @ 2010-03-31 15:06 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, linux-fsdevel, jack
Hi,
On Sat 27-03-10 15:15:38, Dmitry Monakhov wrote:
> - Skipp locking if quota is dirty already.
> - Return old quota state to help fs-specciffic implementation to optimize
> case where quota was dirty already.
>
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/quota/dquot.c | 13 +++++++++++--
> 1 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index e0b870f..88a8ef2 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -319,14 +319,23 @@ static inline int mark_dquot_dirty(struct dquot *dquot)
> return dquot->dq_sb->dq_op->mark_dirty(dquot);
> }
>
> +/* Mark dquot dirty in atomic meaner, and return it's old dirty flag state */
> int dquot_mark_dquot_dirty(struct dquot *dquot)
> {
> + int ret = 1;
> +
> + /* If quota is dirtly already we don't have to acquire dq_list_lock */
> + if (test_bit(DQ_MOD_B, &dquot->dq_flags))
> + return 1;
> +
> spin_lock(&dq_list_lock);
> - if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags))
> + if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) {
> list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)->
> info[dquot->dq_type].dqi_dirty_list);
> + ret = 0;
> + }
> spin_unlock(&dq_list_lock);
> - return 0;
> + return ret;
> }
> EXPORT_SYMBOL(dquot_mark_dquot_dirty);
OK, this looks fine. Merged.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] ext4: journalled quota optimization
2010-03-27 12:15 ` [PATCH 2/3] ext4: journalled quota optimization Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 3/3] ext4: optimize quota write locking Dmitry Monakhov
@ 2010-04-01 9:12 ` Jan Kara
2010-04-05 3:30 ` dmonakhov
1 sibling, 1 reply; 12+ messages in thread
From: Jan Kara @ 2010-04-01 9:12 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, linux-fsdevel, jack
Hi,
On Sat 27-03-10 15:15:39, Dmitry Monakhov wrote:
> Currently each quota modification result in write_dquot()
> and later dquot_commit(). This means what each quota modification
> function must wait for dqio_mutex. Which is *huge* performance
> penalty on big SMP systems. ASAIU The core idea of this implementation
> is to guarantee that each quota modification will be written to journal
> atomically. But in fact this is not always true, because dquot may be
> changed after dquot modification, but before it was committed to disk.
We were already discussing this when you've last submitted the patch.
dqio_mutex has nothing to do with journaling. It is there so that two
writes to quota file cannot happen in parallel because that could cause
corruption. Without dqio_mutex, the following would be possible:
Task 1 Task 2
...
qtree_write_dquot()
...
info->dqi_ops->mem2disk_dqblk
modify dquot
mark_dquot_dirty
...
qtree_write_dquot()
- writes newer information
ret = sb->s_op->quota_write
- overwrites the new information
with an old one.
> | Task 1 | Task 2 |
> | alloc_space(nr) | |
> | ->spin_lock(dq_data_lock) | |
> | ->curspace += nr | |
> | ->spin_unlock(dq_data_lock) | |
> | ->mark_dirty() | free_space(nr) |
> | -->write_dquot() | ->spin_lock(dq_data_lock) |
> | --->dquot_commit() | ->curspace -= nr |
> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
> | ----->spin_lock(dq_data_lock) | |
> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
> | ----->spin_unlock(dq_data_lock) | |
> | ----->quota_write() | |
> Quota corruption not happens only because quota modification caller
> started journal already. And ext3/4 allow only one running transaction
> at a time.
While what you write above happens for other ext3/4 metadata as well.
> Let's exploit this fact and avoid writing quota each time.
> Act similar to dirty_for_io in general write_back path in page-cache.
> If we have found what other task already started on copy and write the
> dquot then we skip actual quota_write stage. And let that task do the job.
> This patch fix only contention on dqio_mutex.
As I wrote last time, your patch helps the contention only a tiny bit -
we clear the dirty bit as a first thing after we acquire dqio_mutex. So
your patch helps only if one task happens while another task is between
dquot_mark_dquot_dirty <---------- here
mutex_lock(&dqopt->dqio_mutex);
spin_lock(&dq_list_lock);
if (!clear_dquot_dirty(dquot)) { <----- and here
So I find this as complicating the code without much merit. And if I
remember right, last time I also suggested that it might be much more
useful to optimize how quota structure is written - i.e., get a reference
to a buffer in ext4_acquire_dquot and thus save ext4_bread from
ext4_quota_write.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] ext4: optimize quota write locking
2010-03-27 12:15 ` [PATCH 3/3] ext4: optimize quota write locking Dmitry Monakhov
@ 2010-04-01 9:14 ` Jan Kara
2010-04-05 3:34 ` dmonakhov
0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2010-04-01 9:14 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: linux-ext4, linux-fsdevel, jack
On Sat 27-03-10 15:15:40, Dmitry Monakhov wrote:
> Quota inode is append only. In most cases writes performed
> to allocated blocks before i_size. Currently we grup i_mutex
> for each quota write to protect i_size and i_mtime. But in most
> cases i_size not changed. And i_mtime is not that important
> as for regular files. So let's update it only when quota file
> increased. This allow us to avoid i_mutex in most cases.
But this won't help performance at all because dqio_mutex serializes
writes to the file already, right?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] ext4: journalled quota optimization
2010-04-01 9:12 ` [PATCH 2/3] ext4: journalled quota optimization Jan Kara
@ 2010-04-05 3:30 ` dmonakhov
2010-04-06 18:06 ` Jan Kara
0 siblings, 1 reply; 12+ messages in thread
From: dmonakhov @ 2010-04-05 3:30 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, linux-fsdevel
Jan Kara <jack@suse.cz> writes:
> Hi,
>
> On Sat 27-03-10 15:15:39, Dmitry Monakhov wrote:
>> Currently each quota modification result in write_dquot()
>> and later dquot_commit(). This means what each quota modification
>> function must wait for dqio_mutex. Which is *huge* performance
>> penalty on big SMP systems. ASAIU The core idea of this implementation
>> is to guarantee that each quota modification will be written to journal
>> atomically. But in fact this is not always true, because dquot may be
>> changed after dquot modification, but before it was committed to disk.
> We were already discussing this when you've last submitted the patch.
> dqio_mutex has nothing to do with journaling. It is there so that two
> writes to quota file cannot happen in parallel because that could cause
> corruption. Without dqio_mutex, the following would be possible:
> Task 1 Task 2
> ...
> qtree_write_dquot()
> ...
> info->dqi_ops->mem2disk_dqblk
> modify dquot
> mark_dquot_dirty
> ...
> qtree_write_dquot()
> - writes newer information
> ret = sb->s_op->quota_write
> - overwrites the new information
> with an old one.
>
>
>> | Task 1 | Task 2 |
>> | alloc_space(nr) | |
>> | ->spin_lock(dq_data_lock) | |
>> | ->curspace += nr | |
>> | ->spin_unlock(dq_data_lock) | |
>> | ->mark_dirty() | free_space(nr) |
>> | -->write_dquot() | ->spin_lock(dq_data_lock) |
>> | --->dquot_commit() | ->curspace -= nr |
>> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
>> | ----->spin_lock(dq_data_lock) | |
>> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
>> | ----->spin_unlock(dq_data_lock) | |
>> | ----->quota_write() | |
>> Quota corruption not happens only because quota modification caller
>> started journal already. And ext3/4 allow only one running transaction
>> at a time.
> While what you write above happens for other ext3/4 metadata as well.
>
>> Let's exploit this fact and avoid writing quota each time.
>> Act similar to dirty_for_io in general write_back path in page-cache.
>> If we have found what other task already started on copy and write the
>> dquot then we skip actual quota_write stage. And let that task do the job.
>> This patch fix only contention on dqio_mutex.
> As I wrote last time, your patch helps the contention only a tiny bit -
> we clear the dirty bit as a first thing after we acquire dqio_mutex. So
> your patch helps only if one task happens while another task is between
>
Yes. Absolutely right. But in fact two or more writer tasks are common
because each quota modification result int quota_write.
Mail server is just an example.
In my measurements performance boost was about x2, x3
> dquot_mark_dquot_dirty <---------- here
> mutex_lock(&dqopt->dqio_mutex);
> spin_lock(&dq_list_lock);
> if (!clear_dquot_dirty(dquot)) { <----- and here
>
> So I find this as complicating the code without much merit. And if I
> remember right, last time I also suggested that it might be much more
> useful to optimize how quota structure is written - i.e., get a reference
> to a buffer in ext4_acquire_dquot and thus save ext4_bread from
> ext4_quota_write.
Ok, Indeed this make code more tricky. I just want to make future
default quota switch to journalled_mode more painless with minimal
code changes.
I'll go back to work on long term solution(according to your comments).
Right now i think it is reasonable to add new field to struct dquot.
void *dq_private; to let filesystem to store it's private data.
Ext3/4 will store journalled_buffer here.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/3] ext4: optimize quota write locking
2010-04-01 9:14 ` Jan Kara
@ 2010-04-05 3:34 ` dmonakhov
0 siblings, 0 replies; 12+ messages in thread
From: dmonakhov @ 2010-04-05 3:34 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, linux-fsdevel
Jan Kara <jack@suse.cz> writes:
> On Sat 27-03-10 15:15:40, Dmitry Monakhov wrote:
>> Quota inode is append only. In most cases writes performed
>> to allocated blocks before i_size. Currently we grup i_mutex
>> for each quota write to protect i_size and i_mtime. But in most
>> cases i_size not changed. And i_mtime is not that important
>> as for regular files. So let's update it only when quota file
>> increased. This allow us to avoid i_mutex in most cases.
> But this won't help performance at all because dqio_mutex serializes
> writes to the file already, right?
Oops. This is true. It wasn't true in my private container's tree quota
implementation. So patch is not valid for common case. Please ignore it.
>
> Honza
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] ext4: journalled quota optimization
2010-04-05 3:30 ` dmonakhov
@ 2010-04-06 18:06 ` Jan Kara
2010-04-07 8:12 ` Dmitry Monakhov
0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2010-04-06 18:06 UTC (permalink / raw)
To: dmonakhov; +Cc: Jan Kara, linux-ext4, linux-fsdevel
On Mon 05-04-10 07:30:37, dmonakhov@openvz.org wrote:
> Jan Kara <jack@suse.cz> writes:
>
> > Hi,
> >
> > On Sat 27-03-10 15:15:39, Dmitry Monakhov wrote:
> >> Currently each quota modification result in write_dquot()
> >> and later dquot_commit(). This means what each quota modification
> >> function must wait for dqio_mutex. Which is *huge* performance
> >> penalty on big SMP systems. ASAIU The core idea of this implementation
> >> is to guarantee that each quota modification will be written to journal
> >> atomically. But in fact this is not always true, because dquot may be
> >> changed after dquot modification, but before it was committed to disk.
> > We were already discussing this when you've last submitted the patch.
> > dqio_mutex has nothing to do with journaling. It is there so that two
> > writes to quota file cannot happen in parallel because that could cause
> > corruption. Without dqio_mutex, the following would be possible:
> > Task 1 Task 2
> > ...
> > qtree_write_dquot()
> > ...
> > info->dqi_ops->mem2disk_dqblk
> > modify dquot
> > mark_dquot_dirty
> > ...
> > qtree_write_dquot()
> > - writes newer information
> > ret = sb->s_op->quota_write
> > - overwrites the new information
> > with an old one.
> >
> >
> >> | Task 1 | Task 2 |
> >> | alloc_space(nr) | |
> >> | ->spin_lock(dq_data_lock) | |
> >> | ->curspace += nr | |
> >> | ->spin_unlock(dq_data_lock) | |
> >> | ->mark_dirty() | free_space(nr) |
> >> | -->write_dquot() | ->spin_lock(dq_data_lock) |
> >> | --->dquot_commit() | ->curspace -= nr |
> >> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
> >> | ----->spin_lock(dq_data_lock) | |
> >> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
> >> | ----->spin_unlock(dq_data_lock) | |
> >> | ----->quota_write() | |
> >> Quota corruption not happens only because quota modification caller
> >> started journal already. And ext3/4 allow only one running transaction
> >> at a time.
> > While what you write above happens for other ext3/4 metadata as well.
> >
> >> Let's exploit this fact and avoid writing quota each time.
> >> Act similar to dirty_for_io in general write_back path in page-cache.
> >> If we have found what other task already started on copy and write the
> >> dquot then we skip actual quota_write stage. And let that task do the job.
> >> This patch fix only contention on dqio_mutex.
> > As I wrote last time, your patch helps the contention only a tiny bit -
> > we clear the dirty bit as a first thing after we acquire dqio_mutex. So
> > your patch helps only if one task happens while another task is between
> >
> Yes. Absolutely right. But in fact two or more writer tasks are common
> because each quota modification result int quota_write.
> Mail server is just an example.
> In my measurements performance boost was about x2, x3
I have to admit I'm really surprised (almost suspicious) about those
results :). If I consider the length of the whole write path doing block
allocation and how tiny part of the window you have to squeeze into, it's
hard for me to believe that you can achive such huge improvement. So if
the measurement is really correct, there must be something I miss...
> > dquot_mark_dquot_dirty <---------- here
> > mutex_lock(&dqopt->dqio_mutex);
> > spin_lock(&dq_list_lock);
> > if (!clear_dquot_dirty(dquot)) { <----- and here
> >
>
> > So I find this as complicating the code without much merit. And if I
> > remember right, last time I also suggested that it might be much more
> > useful to optimize how quota structure is written - i.e., get a reference
> > to a buffer in ext4_acquire_dquot and thus save ext4_bread from
> > ext4_quota_write.
> Ok, Indeed this make code more tricky. I just want to make future
> default quota switch to journalled_mode more painless with minimal
> code changes.
:-) We share this aim.
> I'll go back to work on long term solution(according to your comments).
> Right now i think it is reasonable to add new field to struct dquot.
> void *dq_private; to let filesystem to store it's private data.
> Ext3/4 will store journalled_buffer here.
Quota uses the same approach for fs-private data as VFS uses for inodes -
i.e., a filesystem can provide quota allocation and destruction functions
that allocate it's private quota structure which contains also the generic
quota structure. OCFS2 uses the opportunity already so you can check out
how it's exactly implemented.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] ext4: journalled quota optimization
2010-04-06 18:06 ` Jan Kara
@ 2010-04-07 8:12 ` Dmitry Monakhov
2010-04-07 12:06 ` Jan Kara
0 siblings, 1 reply; 12+ messages in thread
From: Dmitry Monakhov @ 2010-04-07 8:12 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, linux-fsdevel
Jan Kara <jack@suse.cz> writes:
> On Mon 05-04-10 07:30:37, dmonakhov@openvz.org wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> > Hi,
>> >
>> > On Sat 27-03-10 15:15:39, Dmitry Monakhov wrote:
>> >> Currently each quota modification result in write_dquot()
>> >> and later dquot_commit(). This means what each quota modification
>> >> function must wait for dqio_mutex. Which is *huge* performance
>> >> penalty on big SMP systems. ASAIU The core idea of this implementation
>> >> is to guarantee that each quota modification will be written to journal
>> >> atomically. But in fact this is not always true, because dquot may be
>> >> changed after dquot modification, but before it was committed to disk.
>> > We were already discussing this when you've last submitted the patch.
>> > dqio_mutex has nothing to do with journaling. It is there so that two
>> > writes to quota file cannot happen in parallel because that could cause
>> > corruption. Without dqio_mutex, the following would be possible:
>> > Task 1 Task 2
>> > ...
>> > qtree_write_dquot()
>> > ...
>> > info->dqi_ops->mem2disk_dqblk
>> > modify dquot
>> > mark_dquot_dirty
>> > ...
>> > qtree_write_dquot()
>> > - writes newer information
>> > ret = sb->s_op->quota_write
>> > - overwrites the new information
>> > with an old one.
>> >
>> >
>> >> | Task 1 | Task 2 |
>> >> | alloc_space(nr) | |
>> >> | ->spin_lock(dq_data_lock) | |
>> >> | ->curspace += nr | |
>> >> | ->spin_unlock(dq_data_lock) | |
>> >> | ->mark_dirty() | free_space(nr) |
>> >> | -->write_dquot() | ->spin_lock(dq_data_lock) |
>> >> | --->dquot_commit() | ->curspace -= nr |
>> >> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
>> >> | ----->spin_lock(dq_data_lock) | |
>> >> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
>> >> | ----->spin_unlock(dq_data_lock) | |
>> >> | ----->quota_write() | |
>> >> Quota corruption not happens only because quota modification caller
>> >> started journal already. And ext3/4 allow only one running transaction
>> >> at a time.
>> > While what you write above happens for other ext3/4 metadata as well.
>> >
>> >> Let's exploit this fact and avoid writing quota each time.
>> >> Act similar to dirty_for_io in general write_back path in page-cache.
>> >> If we have found what other task already started on copy and write the
>> >> dquot then we skip actual quota_write stage. And let that task do the job.
>> >> This patch fix only contention on dqio_mutex.
>> > As I wrote last time, your patch helps the contention only a tiny bit -
>> > we clear the dirty bit as a first thing after we acquire dqio_mutex. So
>> > your patch helps only if one task happens while another task is between
>> >
>> Yes. Absolutely right. But in fact two or more writer tasks are common
>> because each quota modification result int quota_write.
>> Mail server is just an example.
>> In my measurements performance boost was about x2, x3
> I have to admit I'm really surprised (almost suspicious) about those
> results :). If I consider the length of the whole write path doing block
> allocation and how tiny part of the window you have to squeeze into, it's
> hard for me to believe that you can achive such huge improvement. So if
> the measurement is really correct, there must be something I miss...
This one was done more than a month ago, but i think they are
still correct
http://download.openvz.org/~dmonakhov/docs/quota/STATUS.html
I was too lazy(i say my self that i'm busy) to remeasure results in
more mature SMP host(now i have host with 8 cores). I hope i'll do it
this week. The performance gain is easy to describe
Assume you have 16 writer tasks (with same uid for simplicity)
and the first lucky one is able to grab dqio_mutex.
others 15 are modified dquota and wait for dqio_mutex to
write the exact same value.
But after the patch we have only two IO involved tasks
one which actually acquired the mutex and other which wait for
the mutex.
So in my example only first task from 15 will wait on mutex, and
others will return without wait.
>
>> > dquot_mark_dquot_dirty <---------- here
>> > mutex_lock(&dqopt->dqio_mutex);
>> > spin_lock(&dq_list_lock);
>> > if (!clear_dquot_dirty(dquot)) { <----- and here
>> >
>>
>> > So I find this as complicating the code without much merit. And if I
>> > remember right, last time I also suggested that it might be much more
>> > useful to optimize how quota structure is written - i.e., get a reference
>> > to a buffer in ext4_acquire_dquot and thus save ext4_bread from
>> > ext4_quota_write.
>> Ok, Indeed this make code more tricky. I just want to make future
>> default quota switch to journalled_mode more painless with minimal
>> code changes.
> :-) We share this aim.
>
>> I'll go back to work on long term solution(according to your comments).
>> Right now i think it is reasonable to add new field to struct dquot.
>> void *dq_private; to let filesystem to store it's private data.
>> Ext3/4 will store journalled_buffer here.
> Quota uses the same approach for fs-private data as VFS uses for inodes -
> i.e., a filesystem can provide quota allocation and destruction functions
> that allocate it's private quota structure which contains also the generic
> quota structure. OCFS2 uses the opportunity already so you can check out
> how it's exactly implemented.
>
> Honza
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/3] ext4: journalled quota optimization
2010-04-07 8:12 ` Dmitry Monakhov
@ 2010-04-07 12:06 ` Jan Kara
0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2010-04-07 12:06 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, linux-fsdevel
On Wed 07-04-10 12:12:46, Dmitry Monakhov wrote:
> Jan Kara <jack@suse.cz> writes:
> > On Mon 05-04-10 07:30:37, dmonakhov@openvz.org wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >>
> >> > Hi,
> >> >
> >> > On Sat 27-03-10 15:15:39, Dmitry Monakhov wrote:
> >> >> Currently each quota modification result in write_dquot()
> >> >> and later dquot_commit(). This means what each quota modification
> >> >> function must wait for dqio_mutex. Which is *huge* performance
> >> >> penalty on big SMP systems. ASAIU The core idea of this implementation
> >> >> is to guarantee that each quota modification will be written to journal
> >> >> atomically. But in fact this is not always true, because dquot may be
> >> >> changed after dquot modification, but before it was committed to disk.
> >> > We were already discussing this when you've last submitted the patch.
> >> > dqio_mutex has nothing to do with journaling. It is there so that two
> >> > writes to quota file cannot happen in parallel because that could cause
> >> > corruption. Without dqio_mutex, the following would be possible:
> >> > Task 1 Task 2
> >> > ...
> >> > qtree_write_dquot()
> >> > ...
> >> > info->dqi_ops->mem2disk_dqblk
> >> > modify dquot
> >> > mark_dquot_dirty
> >> > ...
> >> > qtree_write_dquot()
> >> > - writes newer information
> >> > ret = sb->s_op->quota_write
> >> > - overwrites the new information
> >> > with an old one.
> >> >
> >> >
> >> >> | Task 1 | Task 2 |
> >> >> | alloc_space(nr) | |
> >> >> | ->spin_lock(dq_data_lock) | |
> >> >> | ->curspace += nr | |
> >> >> | ->spin_unlock(dq_data_lock) | |
> >> >> | ->mark_dirty() | free_space(nr) |
> >> >> | -->write_dquot() | ->spin_lock(dq_data_lock) |
> >> >> | --->dquot_commit() | ->curspace -= nr |
> >> >> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) |
> >> >> | ----->spin_lock(dq_data_lock) | |
> >> >> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value |
> >> >> | ----->spin_unlock(dq_data_lock) | |
> >> >> | ----->quota_write() | |
> >> >> Quota corruption not happens only because quota modification caller
> >> >> started journal already. And ext3/4 allow only one running transaction
> >> >> at a time.
> >> > While what you write above happens for other ext3/4 metadata as well.
> >> >
> >> >> Let's exploit this fact and avoid writing quota each time.
> >> >> Act similar to dirty_for_io in general write_back path in page-cache.
> >> >> If we have found what other task already started on copy and write the
> >> >> dquot then we skip actual quota_write stage. And let that task do the job.
> >> >> This patch fix only contention on dqio_mutex.
> >> > As I wrote last time, your patch helps the contention only a tiny bit -
> >> > we clear the dirty bit as a first thing after we acquire dqio_mutex. So
> >> > your patch helps only if one task happens while another task is between
> >> >
> >> Yes. Absolutely right. But in fact two or more writer tasks are common
> >> because each quota modification result int quota_write.
> >> Mail server is just an example.
> >> In my measurements performance boost was about x2, x3
> > I have to admit I'm really surprised (almost suspicious) about those
> > results :). If I consider the length of the whole write path doing block
> > allocation and how tiny part of the window you have to squeeze into, it's
> > hard for me to believe that you can achive such huge improvement. So if
> > the measurement is really correct, there must be something I miss...
> This one was done more than a month ago, but i think they are
> still correct
> http://download.openvz.org/~dmonakhov/docs/quota/STATUS.html
> I was too lazy(i say my self that i'm busy) to remeasure results in
> more mature SMP host(now i have host with 8 cores). I hope i'll do it
> this week. The performance gain is easy to describe
> Assume you have 16 writer tasks (with same uid for simplicity)
> and the first lucky one is able to grab dqio_mutex.
> others 15 are modified dquota and wait for dqio_mutex to
> write the exact same value.
> But after the patch we have only two IO involved tasks
> one which actually acquired the mutex and other which wait for
> the mutex.
> So in my example only first task from 15 will wait on mutex, and
> others will return without wait.
Ah, OK, this was what I was missing. I was always imagining just two
tasks racing against each other but your patch starts bringing benefits
when there are three or more tasks. Now your results make sence to me.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2010-04-07 12:06 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-27 12:15 [PATCH 0/3] ext4/quota: journalled quota optimizations Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 1/3] quota: optimize mark_dirty logic Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 2/3] ext4: journalled quota optimization Dmitry Monakhov
2010-03-27 12:15 ` [PATCH 3/3] ext4: optimize quota write locking Dmitry Monakhov
2010-04-01 9:14 ` Jan Kara
2010-04-05 3:34 ` dmonakhov
2010-04-01 9:12 ` [PATCH 2/3] ext4: journalled quota optimization Jan Kara
2010-04-05 3:30 ` dmonakhov
2010-04-06 18:06 ` Jan Kara
2010-04-07 8:12 ` Dmitry Monakhov
2010-04-07 12:06 ` Jan Kara
2010-03-31 15:06 ` [PATCH 1/3] quota: optimize mark_dirty logic Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).