From: Josef Bacik <jbacik@redhat.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Josef Bacik <jbacik@redhat.com>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-ext4@vger.kernel.org, rwheeler@redhat.com
Subject: Re: [PATCH] improve jbd fsync batching
Date: Mon, 3 Nov 2008 15:24:43 -0500 [thread overview]
Message-ID: <20081103202442.GA25736@unused.rdu.redhat.com> (raw)
In-Reply-To: <20081103122729.60582692.akpm@linux-foundation.org>
On Mon, Nov 03, 2008 at 12:27:29PM -0800, Andrew Morton wrote:
> On Tue, 28 Oct 2008 16:16:15 -0400
> Josef Bacik <jbacik@redhat.com> wrote:
>
> > Hello,
> >
> > This is a rework of the patch I did a few months ago, taking into account some
> > comments from Andrew and using the new schedule_hrtimeout function (thanks
> > Arjan!).
> >
> > There is a flaw with the way jbd handles fsync batching. If we fsync() a file
> > and we were not the last person to run fsync() on this fs then we automatically
> > sleep for 1 jiffie in order to wait for new writers to join into the transaction
> > before forcing the commit. The problem with this is that with really fast
> > storage (ie a Clariion) the time it takes to commit a transaction to disk is way
> > faster than 1 jiffie in most cases, so sleeping means waiting longer with
> > nothing to do than if we just committed the transaction and kept going. Ric
> > Wheeler noticed this when using fs_mark with more than 1 thread, the throughput
> > would plummet as he added more threads.
> >
> > ...
> >
> > ...
> >
> > @@ -49,6 +50,7 @@ get_transaction(journal_t *journal, transaction_t *transaction)
> > {
> > transaction->t_journal = journal;
> > transaction->t_state = T_RUNNING;
> > + transaction->t_start_time = ktime_get();
> > transaction->t_tid = journal->j_transaction_sequence++;
> > transaction->t_expires = jiffies + journal->j_commit_interval;
> > spin_lock_init(&transaction->t_handle_lock);
> > @@ -1371,7 +1373,7 @@ int journal_stop(handle_t *handle)
> > {
> > transaction_t *transaction = handle->h_transaction;
> > journal_t *journal = transaction->t_journal;
> > - int old_handle_count, err;
> > + int err;
> > pid_t pid;
> >
> > J_ASSERT(journal_current_handle() == handle);
> > @@ -1407,11 +1409,26 @@ int journal_stop(handle_t *handle)
> > */
> > pid = current->pid;
> > if (handle->h_sync && journal->j_last_sync_writer != pid) {
>
> It would be nice to have a comment here explaining the overall design.
> it's a bit opaque working that out from the raw implementation.
>
> > + u64 commit_time, trans_time;
> > +
> > journal->j_last_sync_writer = pid;
> > - do {
> > - old_handle_count = transaction->t_handle_count;
> > - schedule_timeout_uninterruptible(1);
> > - } while (old_handle_count != transaction->t_handle_count);
> > +
> > + spin_lock(&journal->j_state_lock);
> > + commit_time = journal->j_average_commit_time;
> > + spin_unlock(&journal->j_state_lock);
>
> OK, the lock is needed on 32-bit machines, I guess.
>
>
> > + trans_time = ktime_to_ns(ktime_sub(ktime_get(),
> > + transaction->t_start_time));
> > +
> > + commit_time = min_t(u64, commit_time,
> > + 1000*jiffies_to_usecs(1));
>
> OK. The multiplication of an unsigned by 1000 could overflow, but only
> if HZ is less than 0.25. I don't think we need worry about that ;)
>
>
> > + if (trans_time < commit_time) {
> > + ktime_t expires = ktime_add_ns(ktime_get(),
> > + commit_time);
> > + set_current_state(TASK_UNINTERRUPTIBLE);
> > + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
>
> We should have schedule_hrtimeout_uninterruptible(), but we don't.
>
> > + }
> > }
> >
> > current->journal_info = NULL;
> > diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> > index 346e2b8..d842230 100644
> > --- a/include/linux/jbd.h
> > +++ b/include/linux/jbd.h
> > @@ -543,6 +543,11 @@ struct transaction_s
> > unsigned long t_expires;
> >
> > /*
> > + * When this transaction started, in nanoseconds [no locking]
> > + */
> > + ktime_t t_start_time;
> > +
> > + /*
> > * How many handles used this transaction? [t_handle_lock]
> > */
> > int t_handle_count;
> > @@ -800,6 +805,8 @@ struct journal_s
> >
> > pid_t j_last_sync_writer;
> >
> > + u64 j_average_commit_time;
>
> Every field in that structure is carefully documented (except for
> j_last_sync_writer - what vandal did that?)
>
> please fix.
I see you already pulled this into -mm, would you like me to repost with the
same changelog and the patch updated, or just reply to this with the updated
patch? Thanks,
Josef
next prev parent reply other threads:[~2008-11-03 20:24 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-28 20:16 [PATCH] improve jbd fsync batching Josef Bacik
2008-10-28 21:38 ` Andreas Dilger
2008-10-28 21:33 ` Josef Bacik
2008-10-28 21:44 ` Arjan van de Ven
2008-10-28 21:56 ` Ric Wheeler
2008-11-03 20:27 ` Andrew Morton
2008-11-03 20:24 ` Josef Bacik [this message]
2008-11-03 20:55 ` Andrew Morton
2008-11-04 15:41 ` Josef Bacik
2008-11-03 22:13 ` Theodore Tso
2008-11-04 5:24 ` Andreas Dilger
2008-11-04 9:12 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081103202442.GA25736@unused.rdu.redhat.com \
--to=jbacik@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=rwheeler@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).