From: Tao Ma <tm@tao.ma>
To: Jan Kara <jack@suse.cz>
Cc: Jiaying Zhang <jiayingz@google.com>,
Michael Tokarev <mjt@tls.msk.ru>,
linux-ext4@vger.kernel.org, sandeen@redhat.com
Subject: Re: DIO process stuck apparently due to dioread_nolock (3.0)
Date: Tue, 16 Aug 2011 23:03:44 +0800 [thread overview]
Message-ID: <4E4A86D0.2070300@tao.ma> (raw)
In-Reply-To: <20110816135325.GD23416@quack.suse.cz>
On 08/16/2011 09:53 PM, Jan Kara wrote:
> On Mon 15-08-11 16:53:34, Jiaying Zhang wrote:
>> On Mon, Aug 15, 2011 at 1:56 AM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>>> 15.08.2011 12:00, Michael Tokarev wrote:
>>> [....]
>>>
>>> So, it looks like this (starting with cold cache):
>>>
>>> 1. rename the redologs and copy them over - this will
>>> make a hot copy of redologs
>>> 2. startup oracle - it will complain that the redologs aren't
>>> redologs, the header is corrupt
>>> 3. shut down oracle, start it up again - it will succeed.
>>>
>>> If between 1 and 2 you'll issue sync(1) everything will work.
>>> When shutting down, oracle calls fsync(), so that's like
>>> sync(1) again.
>>>
>>> If there will be some time between 1. and 2., everything
>>> will work too.
>>>
>>> Without dioread_nolock I can't trigger the problem no matter
>>> how I tried.
>>>
>>>
>>> A smaller test case. I used redo1.odf file (one of the
>>> redologs) as a test file, any will work.
>>>
>>> $ cp -p redo1.odf temp
>>> $ dd if=temp of=foo iflag=direct count=20
>> Isn't this the expected behavior here? When doing
>> 'cp -p redo1.odf temp', data is copied to temp through
>> buffer write, but there is no guarantee when data will be
>> actually written to disk. Then with 'dd if=temp of=foo
>> iflag=direct count=20', data is read directly from disk.
>> Very likely, the written data hasn't been flushed to disk
>> yet so ext4 returns zero in this case.
> No it's not. Buffered and direct IO is supposed to work correctly
> (although not fast) together. In this particular case we take care to flush
> dirty data from page cache before performing direct IO read... But
> something is broken in this path obviously.
>
> I don't have time to dig into this in detail now but what seems to be the
> problem is that with dioread_nolock option, we don't acquire i_mutex for
> direct IO reads anymore. Thus these reads can compete with
> ext4_end_io_nolock() called from ext4_end_io_work() (this is called under
> i_mutex so without dioread_nolock the race cannot happen).
>
> Hmm, the new writepages code seems to be broken in combination with direct
> IO. Direct IO code expects that when filemap_write_and_wait() finishes,
> data is on disk but with new bio submission code this is not true because
> we clear PageWriteback bit (which is what filemap_fdatawait() waits for) in
> ext4_end_io_buffer_write() but do extent conversion only after that in
> convert workqueue. So the race seems to be there all the time, just without
> dioread_nolock it's much smaller.
You are absolutely right. The really problem is that ext4_direct_IO
begins to work *after* we clear the page writeback flag and *before* we
convert unwritten extent to a valid state. Some of my trace does show
that. I am working on it now.
>
> Fixing this is going to be non-trivial - I'm not sure we can really move
> clearing of PageWriteback bit to conversion workqueue. I thienk we already
> tried that once but it caused deadlocks for some reason...
I just did as what you described and yes I met with another problem and
try to resolve it now. Once it is OK, I will send out the patch.
Thanks
Tao
next prev parent reply other threads:[~2011-08-16 15:03 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-08-10 10:51 DIO process stuck apparently due to dioread_nolock (3.0) Michael Tokarev
2011-08-11 11:59 ` Jan Kara
2011-08-11 12:21 ` Michael Tokarev
2011-08-11 14:01 ` Jan Kara
2011-08-11 20:05 ` Michael Tokarev
2011-08-12 2:46 ` Jiaying Zhang
2011-08-12 6:23 ` Michael Tokarev
2011-08-12 7:07 ` Michael Tokarev
2011-08-12 13:07 ` Jan Kara
2011-08-12 15:55 ` Michael Tokarev
2011-08-12 17:01 ` Eric Sandeen
2011-08-12 17:34 ` Michael Tokarev
2011-08-13 16:02 ` Tao Ma
2011-08-14 20:57 ` Michael Tokarev
2011-08-14 21:07 ` Michael Tokarev
2011-08-15 2:36 ` Tao Ma
2011-08-15 8:00 ` Michael Tokarev
2011-08-15 8:56 ` Michael Tokarev
2011-08-15 9:03 ` Michael Tokarev
2011-08-15 10:28 ` Tao Ma
2011-08-15 23:53 ` Jiaying Zhang
2011-08-16 4:15 ` Tao Ma
2011-08-16 8:38 ` Michael Tokarev
2011-08-16 13:53 ` Jan Kara
2011-08-16 15:03 ` Tao Ma [this message]
2011-08-16 21:32 ` Jiaying Zhang
2011-08-16 22:28 ` Michael Tokarev
2011-08-16 23:07 ` Jiaying Zhang
2011-08-17 17:02 ` Ted Ts'o
2011-08-18 6:49 ` Michael Tokarev
2011-08-18 18:54 ` Jiaying Zhang
2011-08-19 3:20 ` Tao Ma
2011-08-19 3:18 ` Tao Ma
2011-08-19 7:05 ` Michael Tokarev
2011-08-19 17:55 ` Jiaying Zhang
2011-08-16 23:59 ` Dave Chinner
2011-08-17 0:08 ` Jiaying Zhang
2011-08-17 2:22 ` Tao Ma
2011-08-17 9:04 ` Jan Kara
2011-08-15 16:08 ` Eric Sandeen
2011-08-16 4:12 ` Tao Ma
2011-08-16 6:15 ` Tao Ma
2011-08-12 21:19 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E4A86D0.2070300@tao.ma \
--to=tm@tao.ma \
--cc=jack@suse.cz \
--cc=jiayingz@google.com \
--cc=linux-ext4@vger.kernel.org \
--cc=mjt@tls.msk.ru \
--cc=sandeen@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).