* [PATCH 0/5] xfs: fixes for 3.10-rc2
@ 2013-05-17 1:10 Dave Chinner
2013-05-17 1:10 ` [PATCH 1/5] xfs: fix sub-page blocksize data integrity writes Dave Chinner
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
Hi Ben,
The following patches are fixing regressions, data integrity
problems and lockdep reports. They are all candidates for 3.10-rc2.
Please consider them for 3.10-rc2.
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/5] xfs: fix sub-page blocksize data integrity writes
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
@ 2013-05-17 1:10 ` Dave Chinner
2013-05-17 1:10 ` [PATCH 2/5] xfs: fix rounding in xfs_free_file_space Dave Chinner
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
FSX on 512 byte block size filesystems has been failing for some
time with corrupted data. The fault dates back to the change in
the writeback data integrity algorithm that uses a mark-and-sweep
approach to avoid data writeback livelocks.
Unfortunately, a side effect of this mark-and-sweep approach is that
each page will only be written once for a data integrity sync, and
there is a condition in writeback in XFS where a page may require
two writeback attempts to be fully written. As a result of the high
level change, we now only get a partial page writeback during the
integrity sync because the first pass through writeback clears the
mark left on the page index to tell writeback that the page needs
writeback....
The cause is writing a partial page in the clustering code. This can
happen when a mapping boundary falls in the middle of a page - we
end up writing back the first part of the page that the mapping
covers, but then never revisit the page to have the remainder mapped
and written.
The fix is simple - if the mapping boundary falls inside a page,
then simple abort clustering without touching the page. This means
that the next ->writepage entry that write_cache_pages() will make
is the page we aborted on, and xfs_vm_writepage() will map all
sections of the page correctly. This behaviour is also optimal for
non-data integrity writes, as it results in contiguous sequential
writeback of the file rather than missing small holes and having to
write them a "random" writes in a future pass.
With this fix, all the fsx tests in xfstests now pass on a 512 byte
block size filesystem on a 4k page machine.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_aops.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2b2691b..f04eceb 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -725,6 +725,25 @@ xfs_convert_page(
(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
i_size_read(inode));
+ /*
+ * If the current map does not span the entire page we are about to try
+ * to write, then give up. The only way we can write a page that spans
+ * multiple mappings in a single writeback iteration is via the
+ * xfs_vm_writepage() function. Data integrity writeback requires the
+ * entire page to be written in a single attempt, otherwise the part of
+ * the page we don't write here doesn't get written as part of the data
+ * integrity sync.
+ *
+ * For normal writeback, we also don't attempt to write partial pages
+ * here as it simply means that write_cache_pages() will see it under
+ * writeback and ignore the page until some pointin the future, at which
+ * time this will be the only page inteh file that needs writeback.
+ * Hence for more optimal IO patterns, we should always avoid partial
+ * page writeback due to multiple mappings on a page here.
+ */
+ if (!xfs_imap_valid(inode, imap, end_offset))
+ goto fail_unlock_page;
+
len = 1 << inode->i_blkbits;
p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
PAGE_CACHE_SIZE);
--
1.7.10.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/5] xfs: fix rounding in xfs_free_file_space
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
2013-05-17 1:10 ` [PATCH 1/5] xfs: fix sub-page blocksize data integrity writes Dave Chinner
@ 2013-05-17 1:10 ` Dave Chinner
2013-05-17 1:10 ` [PATCH 3/5] xfs: Don't reference the EFI after it is freed Dave Chinner
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The offset passed into xfs_free_file_space() needs to be rounded
down to a certain size, but the rounding mask is built by a 32 bit
variable. Hence the mask will always mask off the upper 32 bits of
the offset and lead to incorrect writeback and invalidation ranges.
This is not actually exposed as a bug because we writeback and
invalidate from the rounded offset to the end of the file, and hence
the offset we are actually punching a hole out of will always be
covered by the code. This needs fixing, however, if we ever want to
use exact ranges for writeback/invalidation here...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_vnodeops.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 1501f4f..0176bb2 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -1453,7 +1453,7 @@ xfs_free_file_space(
xfs_mount_t *mp;
int nimap;
uint resblks;
- uint rounding;
+ xfs_off_t rounding;
int rt;
xfs_fileoff_t startoffset_fsb;
xfs_trans_t *tp;
@@ -1482,7 +1482,7 @@ xfs_free_file_space(
inode_dio_wait(VFS_I(ip));
}
- rounding = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+ rounding = max_t(xfs_off_t, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
ioffset = offset & ~(rounding - 1);
error = -filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
ioffset, -1);
--
1.7.10.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/5] xfs: Don't reference the EFI after it is freed
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
2013-05-17 1:10 ` [PATCH 1/5] xfs: fix sub-page blocksize data integrity writes Dave Chinner
2013-05-17 1:10 ` [PATCH 2/5] xfs: fix rounding in xfs_free_file_space Dave Chinner
@ 2013-05-17 1:10 ` Dave Chinner
2013-05-17 1:10 ` [PATCH 4/5] xfs: avoid nesting transactions in xfs_qm_scall_setqlim() Dave Chinner
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Checking the EFI for whether it is being released from recovery
after we've already released the known active reference is a mistake
worthy of a brown paper bag. Fix the (now) obvious use after free
that it can cause.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_extfree_item.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index c0f3750..452920a 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -305,11 +305,12 @@ xfs_efi_release(xfs_efi_log_item_t *efip,
{
ASSERT(atomic_read(&efip->efi_next_extent) >= nextents);
if (atomic_sub_and_test(nextents, &efip->efi_next_extent)) {
- __xfs_efi_release(efip);
-
/* recovery needs us to drop the EFI reference, too */
if (test_bit(XFS_EFI_RECOVERED, &efip->efi_flags))
__xfs_efi_release(efip);
+
+ __xfs_efi_release(efip);
+ /* efip may now have been freed, do not reference it again. */
}
}
--
1.7.10.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/5] xfs: avoid nesting transactions in xfs_qm_scall_setqlim()
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
` (2 preceding siblings ...)
2013-05-17 1:10 ` [PATCH 3/5] xfs: Don't reference the EFI after it is freed Dave Chinner
@ 2013-05-17 1:10 ` Dave Chinner
2013-05-17 1:10 ` [PATCH 5/5] xfs: fix missing KM_NOFS tags to keep lockdep happy Dave Chinner
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Lockdep reports:
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #3 Not tainted
---------------------------------------------
setquota/28368 is trying to acquire lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
but task is already holding lock:
(sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50
from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be
allocated.
xfs_qm_scall_setqlim() is starting a transaction and then not
passing it into xfs_qm_dqet() and so it starts it's own transaction
when allocating the dquot. Splat!
Fix this by not allocating the dquot in xfs_qm_scall_setqlim()
inside the setqlim transaction. This requires getting the dquot
first (and allocating it if necessary) then dropping and relocking
the dquot before joining it to the setqlim transaction.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_qm_syscalls.c | 35 ++++++++++++++++++++---------------
1 file changed, 20 insertions(+), 15 deletions(-)
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index c41190c..dfa5c05 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -489,31 +489,36 @@ xfs_qm_scall_setqlim(
if ((newlim->d_fieldmask & XFS_DQ_MASK) == 0)
return 0;
- tp = xfs_trans_alloc(mp, XFS_TRANS_QM_SETQLIM);
- error = xfs_trans_reserve(tp, 0, XFS_QM_SETQLIM_LOG_RES(mp),
- 0, 0, XFS_DEFAULT_LOG_COUNT);
- if (error) {
- xfs_trans_cancel(tp, 0);
- return (error);
- }
-
/*
* We don't want to race with a quotaoff so take the quotaoff lock.
- * (We don't hold an inode lock, so there's nothing else to stop
- * a quotaoff from happening). (XXXThis doesn't currently happen
- * because we take the vfslock before calling xfs_qm_sysent).
+ * We don't hold an inode lock, so there's nothing else to stop
+ * a quotaoff from happening.
*/
mutex_lock(&q->qi_quotaofflock);
/*
- * Get the dquot (locked), and join it to the transaction.
- * Allocate the dquot if this doesn't exist.
+ * Get the dquot (locked) before we start, as we need to do a
+ * transaction to allocate it if it doesn't exist. Once we have the
+ * dquot, unlock it so we can start the next transaction safely. We hold
+ * a reference to the dquot, so it's safe to do this unlock/lock without
+ * it being reclaimed in the mean time.
*/
- if ((error = xfs_qm_dqget(mp, NULL, id, type, XFS_QMOPT_DQALLOC, &dqp))) {
- xfs_trans_cancel(tp, XFS_TRANS_ABORT);
+ error = xfs_qm_dqget(mp, NULL, id, type, XFS_QMOPT_DQALLOC, &dqp);
+ if (error) {
ASSERT(error != ENOENT);
goto out_unlock;
}
+ xfs_dqunlock(dqp);
+
+ tp = xfs_trans_alloc(mp, XFS_TRANS_QM_SETQLIM);
+ error = xfs_trans_reserve(tp, 0, XFS_QM_SETQLIM_LOG_RES(mp),
+ 0, 0, XFS_DEFAULT_LOG_COUNT);
+ if (error) {
+ xfs_trans_cancel(tp, 0);
+ goto out_unlock;
+ }
+
+ xfs_dqlock(dqp);
xfs_trans_dqjoin(tp, dqp);
ddq = &dqp->q_core;
--
1.7.10.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 5/5] xfs: fix missing KM_NOFS tags to keep lockdep happy
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
` (3 preceding siblings ...)
2013-05-17 1:10 ` [PATCH 4/5] xfs: avoid nesting transactions in xfs_qm_scall_setqlim() Dave Chinner
@ 2013-05-17 1:10 ` Dave Chinner
2013-05-17 10:15 ` [PATCH 6/5] xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC Dave Chinner
2013-05-17 21:43 ` [PATCH 0/5] xfs: fixes for 3.10-rc2 Ben Myers
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 1:10 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
There are several places where we use KM_SLEEP allocation contexts
and use the fact that there are called from transaction context to
add KM_NOFS where appropriate. Unfortunately, there are several
places where the code makes this assumption but can be called from
outside transaction context but with filesystem locks held. These
places need explicit KM_NOFS annotations to avoid lockdep
complaining about reclaim contexts.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf.c | 2 +-
fs/xfs/xfs_da_btree.c | 6 ++++--
fs/xfs/xfs_dir2_leaf.c | 2 +-
fs/xfs/xfs_log_cil.c | 2 +-
4 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 82b70bd..0d25542 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1649,7 +1649,7 @@ xfs_alloc_buftarg(
{
xfs_buftarg_t *btp;
- btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
+ btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
btp->bt_mount = mp;
btp->bt_dev = bdev->bd_dev;
diff --git a/fs/xfs/xfs_da_btree.c b/fs/xfs/xfs_da_btree.c
index 9b26a99..41ea7e1 100644
--- a/fs/xfs/xfs_da_btree.c
+++ b/fs/xfs/xfs_da_btree.c
@@ -2464,7 +2464,8 @@ xfs_buf_map_from_irec(
ASSERT(nirecs >= 1);
if (nirecs > 1) {
- map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map), KM_SLEEP);
+ map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
+ KM_SLEEP | KM_NOFS);
if (!map)
return ENOMEM;
*mapp = map;
@@ -2520,7 +2521,8 @@ xfs_dabuf_map(
* Optimize the one-block case.
*/
if (nfsb != 1)
- irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP);
+ irecs = kmem_zalloc(sizeof(irec) * nfsb,
+ KM_SLEEP | KM_NOFS);
nirecs = nfsb;
error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
diff --git a/fs/xfs/xfs_dir2_leaf.c b/fs/xfs/xfs_dir2_leaf.c
index 721ba2f..da71a18 100644
--- a/fs/xfs/xfs_dir2_leaf.c
+++ b/fs/xfs/xfs_dir2_leaf.c
@@ -1336,7 +1336,7 @@ xfs_dir2_leaf_getdents(
mp->m_sb.sb_blocksize);
map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
(length * sizeof(struct xfs_bmbt_irec)),
- KM_SLEEP);
+ KM_SLEEP | KM_NOFS);
map_info->map_size = length;
/*
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index e3d0b85..d0833b5 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -139,7 +139,7 @@ xlog_cil_prepare_log_vecs(
new_lv = kmem_zalloc(sizeof(*new_lv) +
niovecs * sizeof(struct xfs_log_iovec),
- KM_SLEEP);
+ KM_SLEEP|KM_NOFS);
/* The allocated iovec region lies beyond the log vector. */
new_lv->lv_iovecp = (struct xfs_log_iovec *)&new_lv[1];
--
1.7.10.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 6/5] xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
` (4 preceding siblings ...)
2013-05-17 1:10 ` [PATCH 5/5] xfs: fix missing KM_NOFS tags to keep lockdep happy Dave Chinner
@ 2013-05-17 10:15 ` Dave Chinner
2013-05-17 21:43 ` [PATCH 0/5] xfs: fixes for 3.10-rc2 Ben Myers
6 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2013-05-17 10:15 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_da_btree.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/xfs/xfs_da_btree.c b/fs/xfs/xfs_da_btree.c
index 41ea7e1..0b8b2a1 100644
--- a/fs/xfs/xfs_da_btree.c
+++ b/fs/xfs/xfs_da_btree.c
@@ -270,6 +270,7 @@ xfs_da3_node_read_verify(
break;
return;
case XFS_ATTR_LEAF_MAGIC:
+ case XFS_ATTR3_LEAF_MAGIC:
bp->b_ops = &xfs_attr3_leaf_buf_ops;
bp->b_ops->verify_read(bp);
return;
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 0/5] xfs: fixes for 3.10-rc2
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
` (5 preceding siblings ...)
2013-05-17 10:15 ` [PATCH 6/5] xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC Dave Chinner
@ 2013-05-17 21:43 ` Ben Myers
6 siblings, 0 replies; 8+ messages in thread
From: Ben Myers @ 2013-05-17 21:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
Hi Dave,
On Fri, May 17, 2013 at 11:10:24AM +1000, Dave Chinner wrote:
> The following patches are fixing regressions, data integrity
> problems and lockdep reports. They are all candidates for 3.10-rc2.
> Please consider them for 3.10-rc2.
I'll take a look, thanks.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-05-17 21:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-17 1:10 [PATCH 0/5] xfs: fixes for 3.10-rc2 Dave Chinner
2013-05-17 1:10 ` [PATCH 1/5] xfs: fix sub-page blocksize data integrity writes Dave Chinner
2013-05-17 1:10 ` [PATCH 2/5] xfs: fix rounding in xfs_free_file_space Dave Chinner
2013-05-17 1:10 ` [PATCH 3/5] xfs: Don't reference the EFI after it is freed Dave Chinner
2013-05-17 1:10 ` [PATCH 4/5] xfs: avoid nesting transactions in xfs_qm_scall_setqlim() Dave Chinner
2013-05-17 1:10 ` [PATCH 5/5] xfs: fix missing KM_NOFS tags to keep lockdep happy Dave Chinner
2013-05-17 10:15 ` [PATCH 6/5] xfs: xfs_da3_node_read_verify() doesn't handle XFS_ATTR3_LEAF_MAGIC Dave Chinner
2013-05-17 21:43 ` [PATCH 0/5] xfs: fixes for 3.10-rc2 Ben Myers
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox