From: Enrik Berkhan <Enrik.Berkhan@ge.com>
To: linux-ext4@vger.kernel.org
Subject: possible ext4 related deadlock
Date: Fri, 12 Feb 2010 13:49:34 +0100 [thread overview]
Message-ID: <4B754E5E.603@ge.com> (raw)
Hi,
currently we're experiencing some process hangs that seem to be
ext4-related. (Kernel 2.6.28.10-Blackfin, i.e. with Analog Devices
patches including some memory management changes for NOMMU.)
The situation is as follows:
We have two threads writing to an ext4-filesystem. After several hours
and accross about 20 systems there happens one hang where
(reconstructed from Alt-SysRq-W output):
1. pdflush waits in start_this_handle
2. kjournald2 waits in jdb2_journal_commit_transaction
3. thread 1 waits in start_this_handle
4. thread 2 waits in
ext4_da_write_begin
(start_this_handle succeeded)
grab_cache_page_write_begin
__alloc_pages_internal
try_to_free_pages
do_try_to_free_pages
congestion_wait
Actually, thread 2 shouldn't be completely blocked, because
congestion_wait has a timeout if I understand the code correctly.
Unfortunately, I pressed Alt-SysRq-W only once when having a chance to
reproduce the problem on a test system with console access.
When the system is in this state, some external event like telnet login
or killing a monitoring process in an older telnet sessin by pressing
Ctrl-C makes it continue to work normally. I suspect that this triggers
some memory freeing which allows thread 2 in the example above to get
some pages and continue running.
I had a look at all the recent ext4/jbd2 changes since about 2.6.28 but
couldn't identify anything that would solve this problem. But maybe I
just couldn't identify the right thing.
What I have noticed is that the order of start_this_handle and
grab_cache_page_write_begin has changed between ext3 and ext4:
ext3_write_begin:
...
page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
*pagep = page;
handle = ext3_journal_start(inode, needed_blocks);
...
ext4_{da_}_write_begin:
...
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
goto out;
}
/* We cannot recurse into the filesystem as the transaction is already
* started */
flags |= AOP_FLAG_NOFS;
page = grab_cache_page_write_begin(mapping, index, flags);
...
As I understand the change of the order requires the AOP_FLAG_NOFS in
the ext4 code.
Might this be the reason for the deadlock? Would it be worth trying to
change the order back or is there a very good reason for the change
between ext3 and ext4?
Or am I looking in a completely wrong place?
Any help would be appreciated.
Enrik
next reply other threads:[~2010-02-12 12:54 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-12 12:49 Enrik Berkhan [this message]
2010-02-18 1:55 ` possible ext4 related deadlock Jan Kara
2010-03-05 13:56 ` Enrik Berkhan
2010-03-05 15:45 ` tytso
2010-03-10 16:23 ` Enrik Berkhan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4B754E5E.603@ge.com \
--to=enrik.berkhan@ge.com \
--cc=linux-ext4@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).