From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dmitry Monakhov Subject: Re: [PATCH] ext4: journalled quota seed-up Date: Thu, 17 Dec 2009 19:37:18 +0300 Message-ID: <874onp5zgh.fsf@openvz.org> References: <0KUR00I4YVKZG820@mail.2ka.mipt.ru> <1261012534-15634-1-git-send-email-dmonakhov@openvz.org> <20091217160654.GB3544@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII Content-Transfer-Encoding: 7BIT Cc: linux-fsdevel@vger.kernel.org To: Jan Kara Return-path: Received: from mail.2ka.mipt.ru ([194.85.80.4]:54733 "EHLO mail.2ka.mipt.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1762237AbZLQQhY (ORCPT ); Thu, 17 Dec 2009 11:37:24 -0500 Received: from dmon-lp ([unknown] [10.55.93.124]) by mail.2ka.mipt.ru (Sun Java(tm) System Messaging Server 7u2-7.02 64bit (built Apr 16 2009)) with ESMTPA id <0KUT006BU2EE0870@mail.2ka.mipt.ru> for linux-fsdevel@vger.kernel.org; Thu, 17 Dec 2009 19:42:17 +0300 (MSK) In-reply-to: <20091217160654.GB3544@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Jan Kara writes: > Hi, > > On Thu 17-12-09 04:15:34, Dmitry Monakhov wrote: >> Currently each quota modification result in write_dquot() >> and later dquot_commit(). This means what each quota modification >> function must wait for dqio_mutex. Which is *huge* performance >> penalty on big SMP systems. ASAIU The core idea of this implementation >> is to guarantee that each quota modification will be written to journal >> atomically. But in fact this is not always true, because dquot may be >> changed after dquot modification, but before it was committed to disk. >> >> | Task 1 | Task 2 | >> | alloc_space(nr) | | >> | ->spin_lock(dq_data_lock) | | >> | ->curspace += nr | | >> | ->spin_unlock(dq_data_lock) | | >> | ->mark_dirty() | free_space(nr) | >> | -->write_dquot() | ->spin_lock(dq_data_lock) | >> | --->dquot_commit() | ->curspace -= nr | >> | ---->commit_dqblk() | ->spin_unlock(dq_data_lock) | >> | ----->spin_lock(dq_data_lock) | | >> | ----->mem2disk_dqblk(ddq, dquot) | <<< Copy updated value | >> | ----->spin_unlock(dq_data_lock) | | >> | ----->quota_write() | | >> Quota corruption not happens only because quota modification caller >> started journal already. And ext3/4 allow only one running quota >> at a time. > It was actually designed to work this way. Ext3/4 handles for example > inodes like this as well... You're right that it might be good to document > this fact as in future someone might come with a filesystem with a > different scheme. > >> Let's exploit this fact and avoid writing quota each time. >> Act similar to dirty_for_io in general write_back path in page-cache. >> If we have found what other task already started on copy and write the >> dquot then we skip actual quota_write stage. And let that task do the job. >> Side effect: Task which skip real quota_write() will not get an error >> (if any). But this is not big deal because: >> 1) Any error has global consequences (RO_remount, err_print, etc). >> 2) Real IO is differed till the journall_commit. > ... >> + if (!__dquot_mark_dquot_dirty(dquot)) >> + ret = dquot_commit(dquot); > Hmm, but dquot_commit() clears the dirty bit just after acquiring > dqio_mutex so if I get it right, you save something only if someone is > waiting contended on dqio_mutex. So isn't this just working around a > contention on that mutex (and in a not very nice way)? For case where dquot > is already active (DQ_ACTIVE_B set), we can certainly make the locking much > more lightweight - essentially just a per-dquot lock for the time until we > copy the data from dquot to buffer_head... Yes this patch is aimed to decrease contention on dqio_mutex only, because it is the biggest performance consumer. The patch is not a complete. This is just a proof of concept patch. And according to my measurements, quota overhead becomes approximate to (1 / nr_concurrent_task) which is not perfect. And off-course avoiding dqio_mutex at all is really good. > Also we could make writing dquot more lightweight (have ext4 specific > part of dquot and cache there buffer head which carries the dquot). Even > more, we could track look whether the dquot is already part of the > transaction and save get_write_access etc for that case (but that already > starts to be ugly)... When i've thought about this i come in to idea to introduce callback which dquot_write() may attach to the journal, and call it at the last journal_stop() :) But your idea about BH tracking is much better. > >> + else >> + /* >> + * Dquot was already dirty. This means that other task already >> + * started a transaction but not clear dirty bit yet (see >> + * dquot_commit). Since the only one running transaction is >> + * possible at a time. Then that task belongs to the same >> + * transaction. We don'n have to actually write dquot changes >> + * because that task will write it for for us. >> + */ >> + ret = 0; >> + err = ext4_journal_stop(handle); >> + if (!ret) >> + ret = err; >> + return ret; >> } >> >> static int ext4_write_info(struct super_block *sb, int type) >> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c >> index 6979224..e03eea0 100644 >> --- a/fs/quota/dquot.c >> +++ b/fs/quota/dquot.c >> @@ -311,14 +311,24 @@ static inline int mark_dquot_dirty(struct dquot *dquot) >> { >> return dquot->dq_sb->dq_op->mark_dirty(dquot); >> } >> - >> -int dquot_mark_dquot_dirty(struct dquot *dquot) >> +/* mark dquot dirty in atomic meaner, and return old dirty state */ >> +int __dquot_mark_dquot_dirty(struct dquot *dquot) >> { >> + int ret = 1; >> spin_lock(&dq_list_lock); >> - if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) >> + if (!test_and_set_bit(DQ_MOD_B, &dquot->dq_flags)) { >> list_add(&dquot->dq_dirty, &sb_dqopt(dquot->dq_sb)-> >> info[dquot->dq_type].dqi_dirty_list); >> + ret = 0; >> + } >> spin_unlock(&dq_list_lock); >> + return ret; >> +} >> +EXPORT_SYMBOL(__dquot_mark_dquot_dirty); >> + >> +int dquot_mark_dquot_dirty(struct dquot *dquot) >> +{ >> + __dquot_mark_dquot_dirty(dquot); >> return 0; >> } > Please just do the change do dquot_mark_dquot_dirty(). There's no need > for the new function. > > Honza