From: Greg KH <gregkh@suse.de>
To: linux-kernel@vger.kernel.org, stable@kernel.org
Cc: stable-review@kernel.org, torvalds@linux-foundation.org,
akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk,
Akira Fujita <a-fujita@rs.jp.nec.com>,
"Theodore Tso" <tytso@mit.edu>,
Greg Kroah-Hartman <gregkh@suse.de>
Subject: [60/90] ext4: fix lock order problem in ext4_move_extents()
Date: Thu, 10 Dec 2009 20:25:38 -0800 [thread overview]
Message-ID: <20091211042805.985279951@linux.site> (raw)
In-Reply-To: <20091211043502.GA17916@kroah.com>
[-- Attachment #1: 0060-ext4-fix-lock-order-problem-in-ext4_move_extents.patch --]
[-- Type: text/plain, Size: 10472 bytes --]
2.6.31-stable review patch. If anyone has any objections, please let us know.
------------------
(cherry picked from commit fc04cb49a898c372a22b21fffc47f299d8710801)
ext4_move_extents() checks the logical block contiguousness
of original file with ext4_find_extent() and mext_next_extent().
Therefore the extent which ext4_ext_path structure indicates
must not be changed between above functions.
But in current implementation, there is no i_data_sem protection
between ext4_ext_find_extent() and mext_next_extent(). So the extent
which ext4_ext_path structure indicates may be overwritten by
delalloc. As a result, ext4_move_extents() will exchange wrong blocks
between original and donor files. I change the place where
acquire/release i_data_sem to solve this problem.
Moreover, I changed move_extent_per_page() to start transaction first,
and then acquire i_data_sem. Without this change, there is a
possibility of the deadlock between mmap() and ext4_move_extents():
* NOTE: "A", "B" and "C" mean different processes
A-1: ext4_ext_move_extents() acquires i_data_sem of two inodes.
B: do_page_fault() starts the transaction (T),
and then tries to acquire i_data_sem.
But process "A" is already holding it, so it is kept waiting.
C: While "A" and "B" running, kjournald2 tries to commit transaction (T)
but it is under updating, so kjournald2 waits for it.
A-2: Call ext4_journal_start with holding i_data_sem,
but transaction (T) is locked.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
fs/ext4/move_extent.c | 117 ++++++++++++++++++++++----------------------------
1 file changed, 53 insertions(+), 64 deletions(-)
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -77,12 +77,14 @@ static int
mext_next_extent(struct inode *inode, struct ext4_ext_path *path,
struct ext4_extent **extent)
{
+ struct ext4_extent_header *eh;
int ppos, leaf_ppos = path->p_depth;
ppos = leaf_ppos;
if (EXT_LAST_EXTENT(path[ppos].p_hdr) > path[ppos].p_ext) {
/* leaf block */
*extent = ++path[ppos].p_ext;
+ path[ppos].p_block = ext_pblock(path[ppos].p_ext);
return 0;
}
@@ -119,9 +121,18 @@ mext_next_extent(struct inode *inode, st
ext_block_hdr(path[cur_ppos+1].p_bh);
}
+ path[leaf_ppos].p_ext = *extent = NULL;
+
+ eh = path[leaf_ppos].p_hdr;
+ if (le16_to_cpu(eh->eh_entries) == 0)
+ /* empty leaf is found */
+ return -ENODATA;
+
/* leaf block */
path[leaf_ppos].p_ext = *extent =
EXT_FIRST_EXTENT(path[leaf_ppos].p_hdr);
+ path[leaf_ppos].p_block =
+ ext_pblock(path[leaf_ppos].p_ext);
return 0;
}
}
@@ -155,40 +166,15 @@ mext_check_null_inode(struct inode *inod
}
/**
- * mext_double_down_read - Acquire two inodes' read semaphore
- *
- * @orig_inode: original inode structure
- * @donor_inode: donor inode structure
- * Acquire read semaphore of the two inodes (orig and donor) by i_ino order.
- */
-static void
-mext_double_down_read(struct inode *orig_inode, struct inode *donor_inode)
-{
- struct inode *first = orig_inode, *second = donor_inode;
-
- /*
- * Use the inode number to provide the stable locking order instead
- * of its address, because the C language doesn't guarantee you can
- * compare pointers that don't come from the same array.
- */
- if (donor_inode->i_ino < orig_inode->i_ino) {
- first = donor_inode;
- second = orig_inode;
- }
-
- down_read(&EXT4_I(first)->i_data_sem);
- down_read(&EXT4_I(second)->i_data_sem);
-}
-
-/**
- * mext_double_down_write - Acquire two inodes' write semaphore
+ * double_down_write_data_sem - Acquire two inodes' write lock of i_data_sem
*
* @orig_inode: original inode structure
* @donor_inode: donor inode structure
- * Acquire write semaphore of the two inodes (orig and donor) by i_ino order.
+ * Acquire write lock of i_data_sem of the two inodes (orig and donor) by
+ * i_ino order.
*/
static void
-mext_double_down_write(struct inode *orig_inode, struct inode *donor_inode)
+double_down_write_data_sem(struct inode *orig_inode, struct inode *donor_inode)
{
struct inode *first = orig_inode, *second = donor_inode;
@@ -207,28 +193,14 @@ mext_double_down_write(struct inode *ori
}
/**
- * mext_double_up_read - Release two inodes' read semaphore
+ * double_up_write_data_sem - Release two inodes' write lock of i_data_sem
*
* @orig_inode: original inode structure to be released its lock first
* @donor_inode: donor inode structure to be released its lock second
- * Release read semaphore of two inodes (orig and donor).
+ * Release write lock of i_data_sem of two inodes (orig and donor).
*/
static void
-mext_double_up_read(struct inode *orig_inode, struct inode *donor_inode)
-{
- up_read(&EXT4_I(orig_inode)->i_data_sem);
- up_read(&EXT4_I(donor_inode)->i_data_sem);
-}
-
-/**
- * mext_double_up_write - Release two inodes' write semaphore
- *
- * @orig_inode: original inode structure to be released its lock first
- * @donor_inode: donor inode structure to be released its lock second
- * Release write semaphore of two inodes (orig and donor).
- */
-static void
-mext_double_up_write(struct inode *orig_inode, struct inode *donor_inode)
+double_up_write_data_sem(struct inode *orig_inode, struct inode *donor_inode)
{
up_write(&EXT4_I(orig_inode)->i_data_sem);
up_write(&EXT4_I(donor_inode)->i_data_sem);
@@ -688,8 +660,6 @@ mext_replace_branches(handle_t *handle,
int replaced_count = 0;
int dext_alen;
- mext_double_down_write(orig_inode, donor_inode);
-
/* Get the original extent for the block "orig_off" */
*err = get_ext_path(orig_inode, orig_off, &orig_path);
if (*err)
@@ -785,7 +755,6 @@ out:
kfree(donor_path);
}
- mext_double_up_write(orig_inode, donor_inode);
return replaced_count;
}
@@ -851,6 +820,11 @@ move_extent_per_page(struct file *o_filp
* Just swap data blocks between orig and donor.
*/
if (uninit) {
+ /*
+ * Protect extent trees against block allocations
+ * via delalloc
+ */
+ double_down_write_data_sem(orig_inode, donor_inode);
replaced_count = mext_replace_branches(handle, orig_inode,
donor_inode, orig_blk_offset,
block_len_in_page, err);
@@ -858,6 +832,7 @@ move_extent_per_page(struct file *o_filp
/* Clear the inode cache not to refer to the old data */
ext4_ext_invalidate_cache(orig_inode);
ext4_ext_invalidate_cache(donor_inode);
+ double_up_write_data_sem(orig_inode, donor_inode);
goto out2;
}
@@ -905,6 +880,8 @@ move_extent_per_page(struct file *o_filp
/* Release old bh and drop refs */
try_to_release_page(page, 0);
+ /* Protect extent trees against block allocations via delalloc */
+ double_down_write_data_sem(orig_inode, donor_inode);
replaced_count = mext_replace_branches(handle, orig_inode, donor_inode,
orig_blk_offset, block_len_in_page,
&err2);
@@ -913,14 +890,18 @@ move_extent_per_page(struct file *o_filp
block_len_in_page = replaced_count;
replaced_size =
block_len_in_page << orig_inode->i_blkbits;
- } else
+ } else {
+ double_up_write_data_sem(orig_inode, donor_inode);
goto out;
+ }
}
/* Clear the inode cache not to refer to the old data */
ext4_ext_invalidate_cache(orig_inode);
ext4_ext_invalidate_cache(donor_inode);
+ double_up_write_data_sem(orig_inode, donor_inode);
+
if (!page_has_buffers(page))
create_empty_buffers(page, 1 << orig_inode->i_blkbits, 0);
@@ -1236,16 +1217,16 @@ ext4_move_extents(struct file *o_filp, s
return -EINVAL;
}
- /* protect orig and donor against a truncate */
+ /* Protect orig and donor inodes against a truncate */
ret1 = mext_inode_double_lock(orig_inode, donor_inode);
if (ret1 < 0)
return ret1;
- mext_double_down_read(orig_inode, donor_inode);
+ /* Protect extent tree against block allocations via delalloc */
+ double_down_write_data_sem(orig_inode, donor_inode);
/* Check the filesystem environment whether move_extent can be done */
ret1 = mext_check_arguments(orig_inode, donor_inode, orig_start,
donor_start, &len, *moved_len);
- mext_double_up_read(orig_inode, donor_inode);
if (ret1)
goto out;
@@ -1308,6 +1289,10 @@ ext4_move_extents(struct file *o_filp, s
ext4_ext_get_actual_len(ext_cur), block_end + 1) -
max(le32_to_cpu(ext_cur->ee_block), block_start);
+ /* Discard preallocations of two inodes */
+ ext4_discard_preallocations(orig_inode);
+ ext4_discard_preallocations(donor_inode);
+
while (!last_extent && le32_to_cpu(ext_cur->ee_block) <= block_end) {
seq_blocks += add_blocks;
@@ -1359,14 +1344,14 @@ ext4_move_extents(struct file *o_filp, s
seq_start = le32_to_cpu(ext_cur->ee_block);
rest_blocks = seq_blocks;
- /* Discard preallocations of two inodes */
- down_write(&EXT4_I(orig_inode)->i_data_sem);
- ext4_discard_preallocations(orig_inode);
- up_write(&EXT4_I(orig_inode)->i_data_sem);
-
- down_write(&EXT4_I(donor_inode)->i_data_sem);
- ext4_discard_preallocations(donor_inode);
- up_write(&EXT4_I(donor_inode)->i_data_sem);
+ /*
+ * Up semaphore to avoid following problems:
+ * a. transaction deadlock among ext4_journal_start,
+ * ->write_begin via pagefault, and jbd2_journal_commit
+ * b. racing with ->readpage, ->write_begin, and ext4_get_block
+ * in move_extent_per_page
+ */
+ double_up_write_data_sem(orig_inode, donor_inode);
while (orig_page_offset <= seq_end_page) {
@@ -1381,14 +1366,14 @@ ext4_move_extents(struct file *o_filp, s
/* Count how many blocks we have exchanged */
*moved_len += block_len_in_page;
if (ret1 < 0)
- goto out;
+ break;
if (*moved_len > len) {
ext4_error(orig_inode->i_sb, __func__,
"We replaced blocks too much! "
"sum of replaced: %llu requested: %llu",
*moved_len, len);
ret1 = -EIO;
- goto out;
+ break;
}
orig_page_offset++;
@@ -1400,6 +1385,10 @@ ext4_move_extents(struct file *o_filp, s
block_len_in_page = rest_blocks;
}
+ double_down_write_data_sem(orig_inode, donor_inode);
+ if (ret1 < 0)
+ break;
+
/* Decrease buffer counter */
if (holecheck_path)
ext4_ext_drop_refs(holecheck_path);
@@ -1429,7 +1418,7 @@ out:
ext4_ext_drop_refs(holecheck_path);
kfree(holecheck_path);
}
-
+ double_up_write_data_sem(orig_inode, donor_inode);
ret2 = mext_inode_double_unlock(orig_inode, donor_inode);
if (ret1)
next prev parent reply other threads:[~2009-12-11 4:48 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20091211042438.970725457@linux.site>
2009-12-11 4:35 ` [00/90] 2.6.31.8-stable review Greg KH
2009-12-11 4:24 ` [01/90] ext4: Fix memory leak fix when mounting an ext4 filesystem Greg KH
2009-12-11 4:24 ` [02/90] ext4: Avoid null pointer dereference when decoding EROFS w/o a journal Greg KH
2009-12-11 4:24 ` [03/90] jbd2: Fail to load a journal if it is too short Greg KH
2009-12-11 4:24 ` [04/90] jbd2: round commit timer up to avoid uncommitted transaction Greg KH
2009-12-11 4:24 ` [05/90] ext4: fix journal ref count in move_extent_par_page Greg KH
2009-12-11 4:24 ` [06/90] ext4: Fix bugs in mballocs stream allocation mode Greg KH
2009-12-11 4:24 ` [07/90] ext4: Avoid group preallocation for closed files Greg KH
2009-12-11 4:24 ` [08/90] jbd2: Annotate transaction start also for jbd2_journal_restart() Greg KH
2009-12-11 4:24 ` [09/90] ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks() Greg KH
2009-12-11 4:24 ` [10/90] ext4: reject too-large filesystems on 32-bit kernels Greg KH
2009-12-11 4:24 ` [11/90] ext4: Add feature set check helper for mount & remount paths Greg KH
2009-12-11 4:24 ` [12/90] ext4: Add missing unlock_new_inode() call in extent migration code Greg KH
2009-12-11 4:24 ` [13/90] ext4: Allow rename to create more than EXT4_LINK_MAX subdirectories Greg KH
2009-12-11 4:24 ` [14/90] ext4: Limit number of links that can be created by ext4_link() Greg KH
2009-12-11 4:24 ` [15/90] ext4: Restore wbc->range_start in ext4_da_writepages() Greg KH
2009-12-11 4:24 ` [16/90] ext4: fix cache flush in ext4_sync_file Greg KH
2009-12-11 4:24 ` [17/90] ext4: Fix wrong comparisons in mext_check_arguments() Greg KH
2009-12-11 4:24 ` [18/90] ext4: Remove unneeded BUG_ON() in ext4_move_extents() Greg KH
2009-12-11 4:24 ` [19/90] ext4: Return exchanged blocks count to user space in failure Greg KH
2009-12-11 4:24 ` [20/90] ext4: Take page lock before looking at attached buffer_heads flags Greg KH
2009-12-11 4:24 ` [21/90] ext4: print more sysadmin-friendly message in check_block_validity() Greg KH
2009-12-11 4:25 ` [22/90] ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}() Greg KH
2009-12-11 4:25 ` [23/90] ext4: Assure that metadata blocks are written during fsync in no journal mode Greg KH
2009-12-11 4:25 ` [24/90] ext4: Make non-journal fsync work properly Greg KH
2009-12-11 4:25 ` [25/90] ext4: move ext4_mb_init_group() function earlier in the mballoc.c Greg KH
2009-12-11 4:25 ` [26/90] ext4: check for need init flag in ext4_mb_load_buddy Greg KH
2009-12-11 4:25 ` [27/90] ext4: Dont update superblock write time when filesystem is read-only Greg KH
2009-12-11 4:25 ` [28/90] ext4: Always set dx_nodes fake_dirent explicitly Greg KH
2009-12-11 4:25 ` [29/90] ext4: Fix initalization of s_flex_groups Greg KH
2009-12-11 4:25 ` [30/90] ext4: Fix include/trace/events/ext4.h to work with Systemtap Greg KH
2009-12-11 4:25 ` [31/90] ext4: Fix small typo for move_extent_per_page() Greg KH
2009-12-11 4:25 ` [32/90] ext4: Replace get_ext_path macro with an inline funciton Greg KH
2009-12-11 4:25 ` [33/90] ext4: Replace BUG_ON() with ext4_error() in move_extents.c Greg KH
2009-12-11 4:25 ` [34/90] ext4: Add null extent check to ext_get_path Greg KH
2009-12-11 4:25 ` [35/90] ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT Greg KH
2009-12-11 4:25 ` [36/90] ext4: limit block allocations for indirect-block files to < 2^32 Greg KH
2009-12-11 4:25 ` [37/90] ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags Greg KH
2009-12-11 4:25 ` [38/90] ext4: Fix the alloc on close after a truncate hueristic Greg KH
2009-12-11 4:25 ` [39/90] ext4: Fix hueristic which avoids group preallocation for closed files Greg KH
2009-12-11 4:25 ` [40/90] ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks Greg KH
2009-12-11 4:25 ` [41/90] ext4: release reserved quota when block reservation for delalloc retry Greg KH
2009-12-11 4:25 ` [42/90] ext4: Split uninitialized extents for direct I/O Greg KH
2009-12-11 4:25 ` [43/90] ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O Greg KH
2009-12-11 4:25 ` [44/90] ext4: async direct IO for holes and fallocate support Greg KH
2009-12-11 4:25 ` [45/90] ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first Greg KH
2009-12-11 4:25 ` [46/90] ext4: Avoid updating the inode table bh twice in no journal mode Greg KH
2009-12-11 4:25 ` [47/90] ext4: Make sure ext4_dirty_inode() updates the inode " Greg KH
2009-12-11 4:25 ` [48/90] ext4: Handle nested ext4_journal_start/stop calls without a journal Greg KH
2009-12-11 4:25 ` [49/90] ext4: Fix time encoding with extra epoch bits Greg KH
2009-12-11 4:25 ` [50/90] ext4: fix a BUG_ON crash by checking that page has buffers attached to it Greg KH
2009-12-11 4:25 ` [51/90] ext4: retry failed direct IO allocations Greg KH
2009-12-11 4:25 ` [52/90] ext4: discard preallocation when restarting a transaction during truncate Greg KH
2009-12-11 4:25 ` [53/90] ext4: fix ext4_ext_direct_IO()s return value after converting uninit extents Greg KH
2009-12-11 4:25 ` [54/90] ext4: skip conversion of uninit extents after direct IO if there isnt any Greg KH
2009-12-11 4:25 ` [55/90] ext4: code clean up for dio fallocate handling Greg KH
2009-12-11 4:25 ` [56/90] ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O Greg KH
2009-12-11 4:25 ` [57/90] ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC Greg KH
2009-12-11 4:25 ` [58/90] ext4: avoid divide by zero when trying to mount a corrupted file system Greg KH
2009-12-11 4:25 ` [59/90] ext4: fix the returned block count if EXT4_IOC_MOVE_EXT fails Greg KH
2009-12-11 4:25 ` Greg KH [this message]
2009-12-11 4:25 ` [61/90] ext4: fix possible recursive locking warning in EXT4_IOC_MOVE_EXT Greg KH
2009-12-11 4:25 ` [62/90] ext4: plug a buffer_head leak in an error path of ext4_iget() Greg KH
2009-12-11 4:25 ` [63/90] ext4: make sure directory and symlink blocks are revoked Greg KH
2009-12-11 4:25 ` [64/90] ext4: fix i_flags access in ext4_da_writepages_trans_blocks() Greg KH
2009-12-11 4:25 ` [65/90] ext4: journal all modifications in ext4_xattr_set_handle Greg KH
2009-12-11 4:25 ` [66/90] ext4: dont update the superblock in ext4_statfs() Greg KH
2009-12-11 4:25 ` [67/90] ext4: fix uninit block bitmap initialization when s_meta_first_bg is non-zero Greg KH
2009-12-11 4:25 ` [68/90] ext4: fix block validity checks so they work correctly with meta_bg Greg KH
2009-12-11 4:25 ` [69/90] ext4: avoid issuing unnecessary barriers Greg KH
2009-12-11 4:25 ` [70/90] ext4: fix error handling in ext4_ind_get_blocks() Greg KH
2009-12-11 4:25 ` [71/90] ext4: make trim/discard optional (and off by default) Greg KH
2009-12-11 4:25 ` [72/90] ext4: make "norecovery" an alias for "noload" Greg KH
2009-12-11 4:25 ` [73/90] ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT Greg KH
2009-12-11 4:25 ` [74/90] ext4: initialize moved_len before calling ext4_move_extents() Greg KH
2009-12-11 4:25 ` [75/90] ext4: move_extent_per_page() cleanup Greg KH
2009-12-11 4:25 ` [76/90] jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer() Greg KH
2009-12-11 4:25 ` [77/90] ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks() Greg KH
2009-12-11 4:25 ` [78/90] ext4: Avoid data / filesystem corruption when write fails to copy data Greg KH
2009-12-11 4:25 ` [79/90] ext4: wait for log to commit when umounting Greg KH
2009-12-11 4:25 ` [80/90] ext4: remove blocks from inode prealloc list on failure Greg KH
2009-12-11 4:25 ` [81/90] ext4: ext4_get_reserved_space() must return bytes instead of blocks Greg KH
2009-12-11 4:26 ` [82/90] ext4: quota macros cleanup Greg KH
2009-12-11 4:26 ` [83/90] ext4: fix incorrect block reservation on quota transfer Greg KH
2009-12-11 4:26 ` [84/90] ext4: Wait for proper transaction commit on fsync Greg KH
2009-12-11 4:26 ` [85/90] ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT Greg KH
2009-12-11 4:26 ` [86/90] SCSI: megaraid_sas: fix 64 bit sense pointer truncation Greg KH
2009-12-11 4:26 ` [87/90] SCSI: osd_protocol.h: Add missing #include Greg KH
2009-12-11 4:26 ` [88/90] SCSI: scsi_lib_dma: fix bug with dma maps on nested scsi objects Greg KH
2009-12-11 4:26 ` [89/90] signal: Fix alternate signal stack check Greg KH
2009-12-11 4:26 ` [90/90] ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem) Greg KH
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091211042805.985279951@linux.site \
--to=gregkh@suse.de \
--cc=a-fujita@rs.jp.nec.com \
--cc=akpm@linux-foundation.org \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=stable-review@kernel.org \
--cc=stable@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox