From: Jan Kara <jack@suse.cz>
To: Mingming Cao <cmm@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>,
akpm@linux-foundation.org, linux-ext4@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH] jbd_commit_transaction() races with journal_try_to_drop_buffers() causing DIO failures
Date: Mon, 12 May 2008 17:54:19 +0200 [thread overview]
Message-ID: <20080512155419.GD15856@duck.suse.cz> (raw)
In-Reply-To: <1210372072.3639.52.camel@localhost.localdomain>
Hello,
On Fri 09-05-08 15:27:52, Mingming Cao wrote:
> > > I was able to reproduce the customer problem involving DIO
> > > (invalidate_inode_pages2) problem by writing simple testcase
> > > to keep writing to a file using buffered writes and DIO writes
> > > forever in a loop. I see DIO writes fail with -EIO.
> > >
> > > After a long debug, found 2 cases how this could happen.
> > > These are race conditions with journal_try_to_free_buffers()
> > > and journal_commit_transaction().
> > >
> > > 1) journal_submit_data_buffers() tries to get bh_state lock. If
> > > try lock fails, it drops the j_list_lock and sleeps for
> > > bh_state lock, while holding a reference on the buffer.
> > > In the meanwhile, journal_try_to_free_buffers() can clean up the
> > > journal head could call try_to_free_buffers(). try_to_free_buffers()
> > > would fail due to the reference held by journal_submit_data_buffers()
> > > - which in turn causes failues for DIO (invalidate_inode_pages2()).
> > >
> > > 2) When the buffer is on t_locked_list waiting for IO to finish,
> > > we hold a reference and give up the cpu, if we can't get
> > > bh_state lock. This causes try_to_free_buffers() to fail.
> > >
> > > Fix is to drop the reference on the buffer if we can't get
> > > bh_state lock, give up the cpu and re-try the whole operation -
> > > instead of waiting for the vh_state lock.
> > >
> > > Does this look like a resonable fix ?
> > As Mingming pointed out there are few other places where we could hold
> > the bh reference. Note also that we accumulate references to buffers in the
> > wbuf[] list and we need that for submit_bh() which consumes one bh
> > reference. Generally, it seems to me as a too fragile and impractical
> > rule "nobody can hold bh reference when not holding page lock" which is
> > basically what it comes down to if you really want to be sure that
> > journal_try_to_free_buffers() succeeds. And also note that in principle
> > there are other places which hold references to buffers without holding the
> > page lock - for example writepage() in ordered mode (although this one is
> > in practice hardly triggerable). So how we could fix at least the races
> > with commit code is to implement launder_page() callback for ext3/4 which
> > would wait for the previous transaction commit in case the page has buffers
> > that are part of that commit (I don't want this logic in
> > journal_try_to_free_buffers() as that is called also on memory-reclaim
> > path, but journal_launder_page() is fine with me).
>
> I am not sure how we are going to gurantee that by the time
> journal_try_to_free_buffers() get called, the page has buffers are not
> as part of the current transaction commit(which could be different than
> the one we waited in ext3_launder_page())?
Hmm, you are right. It is not enough to just wait in ext3_launder_page()
because we don't have a transaction for direct_IO started yet. But if we
actually released buffers from the page there, it should be fine.
> It seems more realistic to fix the races one by one to me.
Not to me, really. The scheme for buffer references you are trying to
impose is awkward to say the least. First, it is completely
counter-intuitive (at least to me ;), second, it is impractical as well.
For example in your scheme, you have no sensible way of locking ordered
data mode buffer - you cannot just do: get the reference and do
lock_buffer() because that violates your requirements. The only reasonable
way you could do that is to lock the page to make sure buffer won't go away
from you - but you cannot currently do that in journal commit code because
of lock ordering. So the only way I can see which is left is: get some jbd
spin lock to serialize with journal_try_to_free_buffers(), get the buffer
reference, try to lock buffer, if it fails, drop everything and restart.
And this is IMO no-go...
And BTW even if you fix such races, I think you'll still have races like:
CPU1: CPU2:
filemap_write_and_wait()
dirty a page
msync() (dirties buffers)
invalidate_inode_page2_range() -> -EIO
The code could historically always return EIO when mixing buffered and
unbuffered accesses and the question is, under which circumstances is this
acceptable? I agree that the current state when if you do "buffered write,
DIO write" in sequence and you'll possibly get EIO is bad and we should fix
it. But I'm not sure we should fix the EIO return under all possible
circumstances at all costs...
> There is still a window that journal_submit_data_buffers() already
> removed the jh from the bh (when found the buffers are already being
> synced), but still keep a reference to the buffer head.
> journal_try_to_free_buffers() could be called. In that case
> try_to_free_buffers() will be called since there is no jh related to
> this buffer, and failed due to journal_submit_data_buffers() hasn't
> finish the cleanup business yet.
>
> For this new race, we could just grab the j_list_lock when re-try
> try_to_free_buffers() to force waiting for journal_commit_transaction()
> to finish it flush work. But not sure if this is acceptable approach?
>
> Patch like this? Comments?
>
> ---------------------------------------------------------------------
> There are a few cases direct IO could race with kjournal flushing
> data buffers which could result direct IO return EIO error.
>
> 1) journal_submit_data_buffers() tries to get bh_state lock. If
> try lock fails, it drops the j_list_lock and sleeps for
> bh_state lock, while holding a reference on the buffer.
> In the meanwhile, journal_try_to_free_buffers() can clean up the
> journal head could call try_to_free_buffers(). try_to_free_buffers()
> would fail due to the reference held by journal_submit_data_buffers()
> - which in turn causes failues for DIO (invalidate_inode_pages2()).
>
> 2) When the buffer is on t_locked_list waiting for IO to finish,
> we hold a reference and give up the cpu, if we can't get
> bh_state lock. This causes try_to_free_buffers() to fail.
>
> 3) when journal_submit_data_buffers() saw the buffer is dirty but failed
> to lock the buffer bh1, journal_submit_data_buffers() released the
> j_list_lock and submit other buffers collected from previous check, with
> the reference to bh1 still hold. During this time
> journal_try_to_free_buffers() could clean up the journal head of bh1 and
> remove it from the t_syncdata_list. Then try_to_free_buffers() would
> fail because the reference held by journal_submit_data_buffers()
>
> 4) journal_submit_data_buffers() already removed the jh from the bh
> (when found the buffers are already being synced), but still keep a
> reference to the buffer head. journal_try_to_free_buffers() could be
> called. In that case try_to_free_buffers() will be called since there is
> no jh related to this buffer, and failed due to
> journal_submit_data_buffers() hasn't finish the cleanup business yet.
>
> Fix for first three races is to drop the reference on the buffer head
> when release the j_list_lock,
> give up the cpu and re-try the whole operation.
>
> This patch also fixes the race that data buffers has been
> flushed to disk and journal head is cleard
> by journal_submit_data_buffers() but did not get a chance to release
> buffer head reference before the journal_try_to_free_buffers() kicked in.
>
>
> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
> Signed-off-by: Mingming Cao <mcao@us.ibm.com>
> ---
> fs/jbd/commit.c | 21 ++++++++++++++++-----
> fs/jbd/transaction.c | 13 +++++++++++++
> 2 files changed, 29 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.26-rc1/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc1.orig/fs/jbd/commit.c 2008-05-03 11:59:44.000000000 -0700
> +++ linux-2.6.26-rc1/fs/jbd/commit.c 2008-05-09 14:44:36.000000000 -0700
> @@ -79,12 +79,16 @@ nope:
>
> /*
> * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is
> - * held. For ranking reasons we must trylock. If we lose, schedule away and
> + * held. For ranking reasons we must trylock. If we lose, unlock the buffer
> + * if needed, drop the reference on the buffer, schedule away and
> * return 0. j_list_lock is dropped in this case.
> */
> -static int inverted_lock(journal_t *journal, struct buffer_head *bh)
> +static int inverted_lock(journal_t *journal, struct buffer_head *bh, int locked)
> {
> if (!jbd_trylock_bh_state(bh)) {
> + if (locked)
> + unlock_buffer(bh);
> + put_bh(bh);
> spin_unlock(&journal->j_list_lock);
> schedule();
> return 0;
> @@ -209,19 +213,24 @@ write_out_data:
> if (buffer_dirty(bh)) {
> if (test_set_buffer_locked(bh)) {
> BUFFER_TRACE(bh, "needs blocking lock");
> + put_bh(bh);
> spin_unlock(&journal->j_list_lock);
> /* Write out all data to prevent deadlocks */
> journal_do_submit_data(wbuf, bufs);
> bufs = 0;
> - lock_buffer(bh);
> spin_lock(&journal->j_list_lock);
> + continue;
^^^ Here you can see what I wrote above. Basically you just busy-loop
wait for buffer lock. You should at least put schedule() there so that you
don't lockup the CPU but it's ugly anyway.
> }
> locked = 1;
> }
> - /* We have to get bh_state lock. Again out of order, sigh. */
> - if (!inverted_lock(journal, bh)) {
> - jbd_lock_bh_state(bh);
> + /*
> + * We have to get bh_state lock. If the try lock fails,
> + * release the ref on the buffer, give up cpu and retry the
> + * whole operation.
> + */
> + if (!inverted_lock(journal, bh, locked)) {
> spin_lock(&journal->j_list_lock);
> + continue;
> }
^^^ And here you add a place where we are not guaranteed to make any
progress... If someone intensively spins on that buffer, commit code could
cycle here forever (or at least for quite a long time).
> /* Someone already cleaned up the buffer? */
> if (!buffer_jbd(bh)
> @@ -430,8 +439,7 @@ void journal_commit_transaction(journal_
> err = -EIO;
> spin_lock(&journal->j_list_lock);
> }
> - if (!inverted_lock(journal, bh)) {
> - put_bh(bh);
> + if (!inverted_lock(journal, bh, 0)) {
> spin_lock(&journal->j_list_lock);
> continue;
> }
> Index: linux-2.6.26-rc1/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.26-rc1.orig/fs/jbd/transaction.c 2008-05-03 11:59:44.000000000 -0700
> +++ linux-2.6.26-rc1/fs/jbd/transaction.c 2008-05-09 09:53:57.000000000 -0700
> @@ -1714,6 +1714,19 @@ int journal_try_to_free_buffers(journal_
> goto busy;
> } while ((bh = bh->b_this_page) != head);
> ret = try_to_free_buffers(page);
> + if (ret == 0) {
> + /*
> + * it is possible that journal_submit_data_buffers()
> + * still holds the bh ref even if clears the jh
> + * after journal_remove_journal_head,
> + * which leads to try_to_free_buffers() failed
> + * let's wait for journal_submit_data_buffers()
> + * to finishing remove the bh from the sync_data_list
> + */
> + spin_lock(&journal->j_list_lock);
> + ret = try_to_free_buffers(page);
> + spin_unlock(&journal->j_list_lock);
> + }
> busy:
> return ret;
> }
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2008-05-12 15:54 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-03-06 17:42 [RFC] JBD ordered mode rewrite Jan Kara
2008-03-06 19:05 ` Josef Bacik
2008-03-10 16:30 ` Jan Kara
2008-03-06 23:53 ` Andrew Morton
2008-03-10 17:38 ` Jan Kara
2008-03-07 1:34 ` Mark Fasheh
2008-03-10 18:00 ` Jan Kara
2008-03-07 10:55 ` Mingming Cao
2008-03-10 18:29 ` Jan Kara
2008-03-07 23:52 ` Andreas Dilger
2008-03-08 0:08 ` Mingming Cao
2008-03-08 12:14 ` Christoph Hellwig
2008-03-10 19:54 ` Jan Kara
2008-03-10 21:37 ` Andreas Dilger
2008-04-25 23:38 ` Possible race between direct IO and JBD? Mingming Cao
2008-04-26 10:41 ` Andrew Morton
2008-04-28 12:26 ` Jan Kara
2008-04-28 17:11 ` Badari Pulavarty
2008-04-28 18:09 ` Jan Kara
2008-04-28 19:09 ` Mingming Cao
2008-04-29 12:43 ` Jan Kara
2008-04-29 17:49 ` Mingming Cao
2008-05-01 15:16 ` [PATCH] jbd_commit_transaction() races with journal_try_to_drop_buffers() causing DIO failures Badari Pulavarty
2008-05-01 22:08 ` Mingming Cao
2008-05-05 17:06 ` Jan Kara
2008-05-05 17:53 ` Mingming Cao
2008-05-06 0:10 ` Badari Pulavarty
2008-05-09 22:27 ` Mingming Cao
2008-05-09 22:39 ` [PATCH] JBD:need hold j_state_lock to updates to transaction t_state to T_COMMIT Mingming Cao
2008-05-12 9:34 ` Jan Kara
2008-05-12 15:54 ` Jan Kara [this message]
2008-05-12 19:23 ` [PATCH] jbd_commit_transaction() races with journal_try_to_drop_buffers() causing DIO failures Mingming Cao
2008-05-13 14:20 ` Jan Kara
2008-05-13 0:39 ` Mingming Cao
2008-05-13 14:54 ` Jan Kara
2008-05-13 16:37 ` Mingming Cao
2008-05-13 22:23 ` Mingming Cao
2008-05-14 17:08 ` Jan Kara
2008-05-14 17:41 ` Mingming Cao
2008-05-14 18:14 ` Jan Kara
2008-05-16 14:13 ` Mingming Cao
2008-05-16 14:14 ` [PATCH] Fix DIO EIO error caused by race between jbd_commit_transaction() and journal_try_to_drop_buffers() Mingming Cao
2008-05-16 15:01 ` Josef Bacik
2008-05-16 17:11 ` Mingming Cao
2008-05-16 17:17 ` Badari Pulavarty
2008-05-16 17:30 ` Mingming Cao
2008-05-16 17:12 ` Badari Pulavarty
2008-05-16 21:01 ` [PATCH] JBD: Fix DIO EIO error caused by race between free buffer and commit trasanction Mingming Cao
2008-05-18 22:37 ` Jan Kara
2008-05-19 19:59 ` Mingming Cao
2008-05-19 20:25 ` Andrew Morton
2008-05-19 22:07 ` Mingming Cao
2008-05-20 9:30 ` Jens Axboe
2008-05-20 17:47 ` Mingming Cao
2008-05-20 18:02 ` [PATCH-v2] JBD: Fix " Mingming Cao
2008-05-20 23:53 ` Jan Kara
2008-05-21 17:14 ` Mingming
2008-05-24 22:44 ` Jan Kara
2008-05-28 18:18 ` Mingming Cao
2008-05-28 18:55 ` Jan Kara
2008-05-29 0:15 ` Mingming Cao
2008-05-29 0:16 ` [PATCH][take 5] " Mingming Cao
2008-05-29 0:18 ` [PATCH][take 5] JBD2: " Mingming Cao
2008-05-30 6:24 ` Aneesh Kumar K.V
2008-05-30 15:17 ` Mingming Cao
2008-05-21 23:38 ` [PATCH 1/2][TAKE3] JBD: " Mingming
2008-05-22 5:57 ` Andrew Morton
2008-05-21 23:39 ` [PATCH 2/2][TAKE3] JBD2: " Mingming
2008-05-20 18:03 ` [PATCH -v2] JBD2: Fix race between journal " Mingming Cao
2008-05-16 21:01 ` [PATCH] JBD2: Fix DIO EIO error caused by race between " Mingming Cao
2008-05-09 22:39 ` [PATCH] JBD2:need hold j_state_lock to updates to transaction t_state to T_COMMIT Mingming Cao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080512155419.GD15856@duck.suse.cz \
--to=jack@suse.cz \
--cc=akpm@linux-foundation.org \
--cc=cmm@us.ibm.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbadari@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).