From: "Jörn Engel" <joern@lazybastard.org>
To: linux-fsdevel@vger.kernel.org
Cc: Anton Altaparmakov <aia21@cam.ac.uk>, David Chinner <dgc@sgi.com>,
Dave Kleikamp <shaggy@linux.vnet.ibm.com>,
Al Viro <viro@ftp.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>
Subject: [RFC] The many faces of the I_LOCK
Date: Wed, 21 Feb 2007 13:09:56 +0000 [thread overview]
Message-ID: <20070221130956.GB464@lazybastard.org> (raw)
1. Introduction
This lengthy investigation was caused by a deadlock problem in LogFS,
but uncovered a more general problem. It affects, at the least, all
filesystems that need to read inodes in their write path. To my
knowledge, that includes LogFS and NTFS, possibly also JFS and XFS.
Deadlock happens when two processes A and B (that may be identical) have
these two call chains:
Process A: Process B:
inode_wait [filesystem locking write path]
__wait_on_bit __writeback_single_inode
out_of_line_wait_on_bit
ifind_fast
[filesystem calling iget()]
[filesystem locking write path]
Process A will wait_on_inode() and block until process B proceeds
through __sync_single_inode(), which is called from
__writeback_single_inode() in Process B. Process B will block on the
lock of the filesystem write path, held by Process A.
After investigating several quick solutions - none of them worked - I
started asking the question of why ifind_fast() should wait for
__sync_single_inode() to finish. Does that make sense? So I tried to
describe what I_LOCK is used for. If I made any mistakes in my
investigations, please correct me.
2. The usage of inode_lock and I_LOCK
Almost all modifications of inodes are protected by the inode_lock, a
global spinlock. Some modifications, however, can block for various
reasons and require the inode_lock to get dropped temporarily. In the
meantime, the individual inode needs to get protected somehow. Usually
this happens through the use of I_LOCK.
But I_LOCK is not a simple mutex. It is a Janus-faced bit in the inode
that is used for several things, including mutual exclusion and
completion notification. Most users are open-coded, so it is not easy
to follow, but can be summarized in the table below.
In this table columns indicate events when I_LOCK is either set or
reset (or not reset but all waiters are notified anyway). Rows
indicate code that either checks for I_LOCK and changes behaviour
depending on its state or is waiting until I_LOCK gets reset (or is
waiting even if I_LOCK is not set).
__sync_single_inode
| get_new_inode[_fast]
| | unlock_new_inode
| | | dispose_list
| | | | generic_delete_inode
| | | | |
generic_forget_inode
lock v v | | | |
unlock/complete v v v v v comment
-------------------------------------------------------------------------------
__writeback_single_inodeX O O O O sync
write_inode_now X O O O O sync
clear_inode X O O O O sync
__mark_inode_dirty X O O O O lists
generic_osync_inode ? ? ? ? ? ?
get_new_inode[_fast] O X O O O mutex
find_inode[_fast] O O X X X
I_FREEING
ifind[_fast] O X O O O read
jfs txCommit ? ? ? ? ? ?
xfs_ichgtime[_fast] ? ? ? ? ? ?
Comments:
sync - wait for writeout to finish
lists - move inode to dirty list without racing against
__sync_single_inode
mutex - protect against two concurrent get_new_inode[_fast] creating two
inodes
I_FREEING - wait for inode to get freed, then repeat
read - don't return inode until it is read from medium
Now, the "X"s mark combinations where columns and rows are related.
"O"s mark combinations where afaiks columns and rows share no
relationship whatsoever except that both use either I_LOCK or
wake_up_inode()/wait_on_inode() or any other of the partially open-coded
variants. Three rows have not exposed their meaning to me yet, so I'd
gladly receive some insight here.
3. Seperating out sync notification
One of the results from the investigations in 2 appears to be that one
class of users in fs/fs-writeback.c is completely unrelated to another
class of users in fs/inode.c.
In particular, __sync_single_inode(), __writeback_single_inode(),
write_inode_now(), clear_inode(), __mark_inode_dirty() and (possibly?)
generic_osync_inode() seem to only need a completion event to
synchronize with. There is no reason why this group should share a lock
with iget() and any of its many permutations.
Now, if the group in fs/fs-writeback.c had a completion event that is
independent of anything in fs/inode.c, the deadlock scenario described
in section 1 goes away. As a further result, ilookup5_nowait() can get
removed as its only user, NTFS, only introduced it as a workaround for
the deadlock scenario.
Potentially the checks in JFS and XFS can also get removed, but I don't
claim to understand their meaning yet.
Comments?
Jörn
--
Prosperity makes friends, adversity tries them.
-- Publilius Syrus
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next reply other threads:[~2007-02-21 13:13 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-21 13:09 Jörn Engel [this message]
2007-02-21 18:29 ` [RFC] The many faces of the I_LOCK David Chinner
2007-02-21 18:47 ` Jörn Engel
2007-02-21 18:57 ` Jörn Engel
2007-02-21 18:54 ` [PATCH 1/3] Replace I_LOCK with I_SYNC for fs/fs-writeback.c uses Jörn Engel
2007-02-21 18:55 ` [PATCH] [NTFS] Remove ilookup5_nowait and convert users to ilookup5 Jörn Engel
2007-02-21 18:56 ` [PATCH 3/3] Replace I_LOCK with I_SYNC in XFS Jörn Engel
2007-02-22 11:40 ` [RFC] The many faces of the I_LOCK Jörn Engel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070221130956.GB464@lazybastard.org \
--to=joern@lazybastard.org \
--cc=aia21@cam.ac.uk \
--cc=dgc@sgi.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=shaggy@linux.vnet.ibm.com \
--cc=viro@ftp.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).