* [PATCH 0/2] xfs: more bug fixes
@ 2014-03-05 1:11 Dave Chinner
2014-03-05 1:11 ` [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help Dave Chinner
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
0 siblings, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2014-03-05 1:11 UTC (permalink / raw)
To: xfs
Hi folks,
The first patch here fixes the issue that Brian first reported and
posted a candidate fix for here:
http://oss.sgi.com/pipermail/xfs/2014-February/034667.html
It fixes the problems with xfs_check_page_type() and
xfs_convert_page() as well.
The second is a fix for a transaction overrun problem that has been
showing up on RHEL7 kernels quite frequently of late. It has been
reproduced on upstream kernels as well, and the patch should address
the oversight made in previous fixes to the inode reservation size
calculation.
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help
2014-03-05 1:11 [PATCH 0/2] xfs: more bug fixes Dave Chinner
@ 2014-03-05 1:11 ` Dave Chinner
2014-03-05 17:06 ` Christoph Hellwig
2014-03-05 22:08 ` Brian Foster
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2014-03-05 1:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
xfs_aops_discard_page() was introduced in the following commit:
xfs: truncate delalloc extents when IO fails in writeback
... to clean up left over delalloc ranges after I/O failure in
->writepage(). generic/224 tests for this scenario and occasionally
reproduces panics on sub-4k blocksize filesystems.
The cause of this is failure to clean up the delalloc range on a
page where the first buffer does not match one of the expected
states of xfs_check_page_type(). If a buffer is not unwritten,
delayed or dirty&mapped, xfs_check_page_type() stops and
immediately returns 0.
The stress test of generic/224 creates a scenario where the first
several buffers of a page with delayed buffers are mapped & uptodate
and some subsequent buffer is delayed. If the ->writepage() happens
to fail for this page, xfs_aops_discard_page() incorrectly skips
the entire page.
This then causes later failures either when direct IO maps the range
and finds the stale delayed buffer, or we evict the inode and find
that the inode still has a delayed block reservation accounted to
it.
We can easily fix this xfs_aops_discard_page() failure by making
xfs_check_page_type() check all buffers, but this breaks
xfs_convert_page() more than it is already broken. Indeed,
xfs_convert_page() wants xfs_check_page_type() to tell it if the
first buffers on the pages are of a type that can be aggregated into
the contiguous IO that is already being built.
xfs_convert_page() should not be writing random buffers out of a
page, but the current behaviour will cause it to do so if there are
buffers that don't match the current specification on the page.
Hence for xfs_convert_page() we need to:
a) return "not ok" if the first buffer on the page does not
match the specification provided to we don't write anything;
and
b) abort it's buffer-add-to-io loop the moment we come
across a buffer that does not match the specification.
Hence we need to fix both xfs_check_page_type() and
xfs_convert_page() to work correctly with pages that have mixed
buffer types, whilst allowing xfs_aops_discard_page() to scan all
buffers on the page for a type match.
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_aops.c | 81 ++++++++++++++++++++++++++++++++++---------------------
1 file changed, 50 insertions(+), 31 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef62c6b..98016b3 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -632,38 +632,46 @@ xfs_map_at_offset(
}
/*
- * Test if a given page is suitable for writing as part of an unwritten
- * or delayed allocate extent.
+ * Test if a given page contains at least one buffer of a given @type.
+ * If @check_all_buffers is true, then we walk all the buffers in the page to
+ * try to find one of the type passed in. If it is not set, then the caller only
+ * needs to check the first buffer on the page for a match.
*/
-STATIC int
+STATIC bool
xfs_check_page_type(
struct page *page,
- unsigned int type)
+ unsigned int type,
+ bool check_all_buffers)
{
- if (PageWriteback(page))
- return 0;
+ struct buffer_head *bh;
+ struct buffer_head *head;
- if (page->mapping && page_has_buffers(page)) {
- struct buffer_head *bh, *head;
- int acceptable = 0;
+ if (PageWriteback(page))
+ return false;
+ if (!page->mapping)
+ return false;
+ if (!page_has_buffers(page))
+ return false;
- bh = head = page_buffers(page);
- do {
- if (buffer_unwritten(bh))
- acceptable += (type == XFS_IO_UNWRITTEN);
- else if (buffer_delay(bh))
- acceptable += (type == XFS_IO_DELALLOC);
- else if (buffer_dirty(bh) && buffer_mapped(bh))
- acceptable += (type == XFS_IO_OVERWRITE);
- else
- break;
- } while ((bh = bh->b_this_page) != head);
+ bh = head = page_buffers(page);
+ do {
+ if (buffer_unwritten(bh)) {
+ if (type == XFS_IO_UNWRITTEN)
+ return true;
+ } else if (buffer_delay(bh)) {
+ if (type == XFS_IO_DELALLOC);
+ return true;
+ } else if (buffer_dirty(bh) && buffer_mapped(bh)) {
+ if (type == XFS_IO_OVERWRITE);
+ return true;
+ }
- if (acceptable)
- return 1;
- }
+ /* If we are only checking the first buffer, we are done now. */
+ if (!check_all_buffers)
+ break;
+ } while ((bh = bh->b_this_page) != head);
- return 0;
+ return false;
}
/*
@@ -697,7 +705,7 @@ xfs_convert_page(
goto fail_unlock_page;
if (page->mapping != inode->i_mapping)
goto fail_unlock_page;
- if (!xfs_check_page_type(page, (*ioendp)->io_type))
+ if (!xfs_check_page_type(page, (*ioendp)->io_type, false))
goto fail_unlock_page;
/*
@@ -742,6 +750,15 @@ xfs_convert_page(
p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
page_dirty = p_offset / len;
+ /*
+ * The moment we find a buffer that doesn't match our current type
+ * specification or can't be written, abort the loop and start
+ * writeback. As per the above xfs_imap_valid() check, only
+ * xfs_vm_writepage() can handle partial page writeback fully - we are
+ * limited here to the buffers that are contiguous with the current
+ * ioend, and hence a buffer we can't write breaks that contiguity and
+ * we have to defer the rest of the IO to xfs_vm_writepage().
+ */
bh = head = page_buffers(page);
do {
if (offset >= end_offset)
@@ -750,7 +767,7 @@ xfs_convert_page(
uptodate = 0;
if (!(PageUptodate(page) || buffer_uptodate(bh))) {
done = 1;
- continue;
+ break;
}
if (buffer_unwritten(bh) || buffer_delay(bh) ||
@@ -762,10 +779,11 @@ xfs_convert_page(
else
type = XFS_IO_OVERWRITE;
- if (!xfs_imap_valid(inode, imap, offset)) {
- done = 1;
- continue;
- }
+ /*
+ * imap should always be valid because of the above
+ * partial page end_offset check on the imap.
+ */
+ ASSERT(xfs_imap_valid(inode, imap, offset));
lock_buffer(bh);
if (type != XFS_IO_OVERWRITE)
@@ -777,6 +795,7 @@ xfs_convert_page(
count++;
} else {
done = 1;
+ break;
}
} while (offset += len, (bh = bh->b_this_page) != head);
@@ -868,7 +887,7 @@ xfs_aops_discard_page(
struct buffer_head *bh, *head;
loff_t offset = page_offset(page);
- if (!xfs_check_page_type(page, XFS_IO_DELALLOC))
+ if (!xfs_check_page_type(page, XFS_IO_DELALLOC, true))
goto out_invalidate;
if (XFS_FORCED_SHUTDOWN(ip->i_mount))
--
1.9.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 1:11 [PATCH 0/2] xfs: more bug fixes Dave Chinner
2014-03-05 1:11 ` [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help Dave Chinner
@ 2014-03-05 1:11 ` Dave Chinner
2014-03-05 3:33 ` Eric Sandeen
` (2 more replies)
1 sibling, 3 replies; 11+ messages in thread
From: Dave Chinner @ 2014-03-05 1:11 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
Back in commit 23956703 ("xfs: inode log reservations are too
small"), the reservation size was increased to take into account the
difference in size between the in-memory BMBT block headers and the
on-disk BMDR headers. This solved a transaction overrun when logging
the inode size.
Recently, however, we've seen a number of these same overruns on
kernels with the above fix in it. All of them have been by 4 bytes,
so we must still not be accounting for something correctly.
Through inspection it turns out the above commit didn't take into
account everything it should have. That is, it only accounts for a
single log op_hdr structure, when it can actually require up to four
op_hdrs - one for each region (log iovec) that is formatted. These
regions are the inode log format header, the inode core, and the two
forks that can be held in the literal area of the inode.
This means we are not accounting for 36 bytes of log space that the
transaction can use, and hence when we get inodes in certain formats
with particular fragmentation patterns we can overrun the
transaction. Fix this by adding the correct accounting for log
op_headers in the transaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_trans_resv.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
index 8515b04..d2c8e4a 100644
--- a/fs/xfs/xfs_trans_resv.c
+++ b/fs/xfs/xfs_trans_resv.c
@@ -81,20 +81,28 @@ xfs_calc_buf_res(
* on disk. Hence we need an inode reservation function that calculates all this
* correctly. So, we log:
*
- * - log op headers for object
+ * - 4 log op headers for object
+ * - for the ilf, the inode core and 2 forks
* - inode log format object
- * - the entire inode contents (core + 2 forks)
- * - two bmap btree block headers
+ * - the inode core
+ * - two inode forks containing bmap btree root blocks.
+ * - the btree data contained by both forks will fit into the inode size,
+ * hence when combined with the inode core above, we have a total of the
+ * actual inode size.
+ * - the BMBT headers need to be accounted separately, as they are
+ * additional to the records and pointers that fit inside the inode
+ * forks.
*/
STATIC uint
xfs_calc_inode_res(
struct xfs_mount *mp,
uint ninodes)
{
- return ninodes * (sizeof(struct xlog_op_header) +
- sizeof(struct xfs_inode_log_format) +
- mp->m_sb.sb_inodesize +
- 2 * XFS_BMBT_BLOCK_LEN(mp));
+ return ninodes *
+ (4 * sizeof(struct xlog_op_header) +
+ sizeof(struct xfs_inode_log_format) +
+ mp->m_sb.sb_inodesize +
+ 2 * XFS_BMBT_BLOCK_LEN(mp));
}
/*
--
1.9.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
@ 2014-03-05 3:33 ` Eric Sandeen
2014-03-05 16:06 ` Brian Foster
2014-03-05 17:14 ` Christoph Hellwig
2 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2014-03-05 3:33 UTC (permalink / raw)
To: Dave Chinner, xfs
On 3/4/14, 7:11 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Back in commit 23956703 ("xfs: inode log reservations are too
> small"), the reservation size was increased to take into account the
> difference in size between the in-memory BMBT block headers and the
> on-disk BMDR headers. This solved a transaction overrun when logging
> the inode size.
>
> Recently, however, we've seen a number of these same overruns on
> kernels with the above fix in it. All of them have been by 4 bytes,
> so we must still not be accounting for something correctly.
>
> Through inspection it turns out the above commit didn't take into
> account everything it should have. That is, it only accounts for a
> single log op_hdr structure, when it can actually require up to four
> op_hdrs - one for each region (log iovec) that is formatted. These
> regions are the inode log format header, the inode core, and the two
> forks that can be held in the literal area of the inode.
>
> This means we are not accounting for 36 bytes of log space that the
> transaction can use, and hence when we get inodes in certain formats
> with particular fragmentation patterns we can overrun the
> transaction. Fix this by adding the correct accounting for log
> op_headers in the transaction.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Makes sense to me;
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> ---
> fs/xfs/xfs_trans_resv.c | 22 +++++++++++++++-------
> 1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
> index 8515b04..d2c8e4a 100644
> --- a/fs/xfs/xfs_trans_resv.c
> +++ b/fs/xfs/xfs_trans_resv.c
> @@ -81,20 +81,28 @@ xfs_calc_buf_res(
> * on disk. Hence we need an inode reservation function that calculates all this
> * correctly. So, we log:
> *
> - * - log op headers for object
> + * - 4 log op headers for object
> + * - for the ilf, the inode core and 2 forks
> * - inode log format object
> - * - the entire inode contents (core + 2 forks)
> - * - two bmap btree block headers
> + * - the inode core
> + * - two inode forks containing bmap btree root blocks.
> + * - the btree data contained by both forks will fit into the inode size,
> + * hence when combined with the inode core above, we have a total of the
> + * actual inode size.
> + * - the BMBT headers need to be accounted separately, as they are
> + * additional to the records and pointers that fit inside the inode
> + * forks.
> */
> STATIC uint
> xfs_calc_inode_res(
> struct xfs_mount *mp,
> uint ninodes)
> {
> - return ninodes * (sizeof(struct xlog_op_header) +
> - sizeof(struct xfs_inode_log_format) +
> - mp->m_sb.sb_inodesize +
> - 2 * XFS_BMBT_BLOCK_LEN(mp));
> + return ninodes *
> + (4 * sizeof(struct xlog_op_header) +
> + sizeof(struct xfs_inode_log_format) +
> + mp->m_sb.sb_inodesize +
> + 2 * XFS_BMBT_BLOCK_LEN(mp));
> }
>
> /*
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
2014-03-05 3:33 ` Eric Sandeen
@ 2014-03-05 16:06 ` Brian Foster
2014-03-05 17:14 ` Christoph Hellwig
2 siblings, 0 replies; 11+ messages in thread
From: Brian Foster @ 2014-03-05 16:06 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Mar 05, 2014 at 12:11:33PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Back in commit 23956703 ("xfs: inode log reservations are too
> small"), the reservation size was increased to take into account the
> difference in size between the in-memory BMBT block headers and the
> on-disk BMDR headers. This solved a transaction overrun when logging
> the inode size.
>
> Recently, however, we've seen a number of these same overruns on
> kernels with the above fix in it. All of them have been by 4 bytes,
> so we must still not be accounting for something correctly.
>
> Through inspection it turns out the above commit didn't take into
> account everything it should have. That is, it only accounts for a
> single log op_hdr structure, when it can actually require up to four
> op_hdrs - one for each region (log iovec) that is formatted. These
> regions are the inode log format header, the inode core, and the two
> forks that can be held in the literal area of the inode.
>
> This means we are not accounting for 36 bytes of log space that the
> transaction can use, and hence when we get inodes in certain formats
> with particular fragmentation patterns we can overrun the
> transaction. Fix this by adding the correct accounting for log
> op_headers in the transaction.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
You can add:
Tested-by: Brian Foster <bfoster@redhat.com>
... to Eric's review as well. I'm still throwing a workload at it, but
this has survived a few hours of a reproducer that consistently caused
the overrun in 5-10 seconds. Thanks for catching this.
Brian
> fs/xfs/xfs_trans_resv.c | 22 +++++++++++++++-------
> 1 file changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/xfs_trans_resv.c b/fs/xfs/xfs_trans_resv.c
> index 8515b04..d2c8e4a 100644
> --- a/fs/xfs/xfs_trans_resv.c
> +++ b/fs/xfs/xfs_trans_resv.c
> @@ -81,20 +81,28 @@ xfs_calc_buf_res(
> * on disk. Hence we need an inode reservation function that calculates all this
> * correctly. So, we log:
> *
> - * - log op headers for object
> + * - 4 log op headers for object
> + * - for the ilf, the inode core and 2 forks
> * - inode log format object
> - * - the entire inode contents (core + 2 forks)
> - * - two bmap btree block headers
> + * - the inode core
> + * - two inode forks containing bmap btree root blocks.
> + * - the btree data contained by both forks will fit into the inode size,
> + * hence when combined with the inode core above, we have a total of the
> + * actual inode size.
> + * - the BMBT headers need to be accounted separately, as they are
> + * additional to the records and pointers that fit inside the inode
> + * forks.
> */
> STATIC uint
> xfs_calc_inode_res(
> struct xfs_mount *mp,
> uint ninodes)
> {
> - return ninodes * (sizeof(struct xlog_op_header) +
> - sizeof(struct xfs_inode_log_format) +
> - mp->m_sb.sb_inodesize +
> - 2 * XFS_BMBT_BLOCK_LEN(mp));
> + return ninodes *
> + (4 * sizeof(struct xlog_op_header) +
> + sizeof(struct xfs_inode_log_format) +
> + mp->m_sb.sb_inodesize +
> + 2 * XFS_BMBT_BLOCK_LEN(mp));
> }
>
> /*
> --
> 1.9.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help
2014-03-05 1:11 ` [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help Dave Chinner
@ 2014-03-05 17:06 ` Christoph Hellwig
2014-03-05 22:08 ` Brian Foster
1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2014-03-05 17:06 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
This looks correct, but why don't we simply have a special cased
function looking or any delalloc page for xfs_aops_discard_page instead
of having one that is fairly over-general?
Reviewed-by: Christoph Hellwig <hch@lst.de>
if you want to go down this route anyway.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
2014-03-05 3:33 ` Eric Sandeen
2014-03-05 16:06 ` Brian Foster
@ 2014-03-05 17:14 ` Christoph Hellwig
2014-03-05 21:40 ` Dave Chinner
2 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2014-03-05 17:14 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
The fix looks correct, but care to remind me where we account for the
xlog_op_header for all non-buf, non-inode transaction?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 17:14 ` Christoph Hellwig
@ 2014-03-05 21:40 ` Dave Chinner
2014-03-05 22:34 ` Christoph Hellwig
0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2014-03-05 21:40 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Wed, Mar 05, 2014 at 09:14:41AM -0800, Christoph Hellwig wrote:
> The fix looks correct, but care to remind me where we account for the
> xlog_op_header for all non-buf, non-inode transaction?
We don't. We need to, I just haven't had the time to audit the rest
of the reservations to fix them all up yet. AFAIA, only the quota
off transactions don't have any slack in them for xlog_op_header
space - all the others that are incorrect probably have enough
overhead from the buffer overhead roundup to cover the lack ophdr
accounting on things like dquots.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help
2014-03-05 1:11 ` [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help Dave Chinner
2014-03-05 17:06 ` Christoph Hellwig
@ 2014-03-05 22:08 ` Brian Foster
2014-03-05 23:18 ` Dave Chinner
1 sibling, 1 reply; 11+ messages in thread
From: Brian Foster @ 2014-03-05 22:08 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Mar 05, 2014 at 12:11:32PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> xfs_aops_discard_page() was introduced in the following commit:
>
> xfs: truncate delalloc extents when IO fails in writeback
>
> ... to clean up left over delalloc ranges after I/O failure in
> ->writepage(). generic/224 tests for this scenario and occasionally
> reproduces panics on sub-4k blocksize filesystems.
>
> The cause of this is failure to clean up the delalloc range on a
> page where the first buffer does not match one of the expected
> states of xfs_check_page_type(). If a buffer is not unwritten,
> delayed or dirty&mapped, xfs_check_page_type() stops and
> immediately returns 0.
>
> The stress test of generic/224 creates a scenario where the first
> several buffers of a page with delayed buffers are mapped & uptodate
> and some subsequent buffer is delayed. If the ->writepage() happens
> to fail for this page, xfs_aops_discard_page() incorrectly skips
> the entire page.
>
> This then causes later failures either when direct IO maps the range
> and finds the stale delayed buffer, or we evict the inode and find
> that the inode still has a delayed block reservation accounted to
> it.
>
> We can easily fix this xfs_aops_discard_page() failure by making
> xfs_check_page_type() check all buffers, but this breaks
> xfs_convert_page() more than it is already broken. Indeed,
> xfs_convert_page() wants xfs_check_page_type() to tell it if the
> first buffers on the pages are of a type that can be aggregated into
> the contiguous IO that is already being built.
>
> xfs_convert_page() should not be writing random buffers out of a
> page, but the current behaviour will cause it to do so if there are
> buffers that don't match the current specification on the page.
> Hence for xfs_convert_page() we need to:
>
> a) return "not ok" if the first buffer on the page does not
> match the specification provided to we don't write anything;
> and
> b) abort it's buffer-add-to-io loop the moment we come
> across a buffer that does not match the specification.
>
> Hence we need to fix both xfs_check_page_type() and
> xfs_convert_page() to work correctly with pages that have mixed
> buffer types, whilst allowing xfs_aops_discard_page() to scan all
> buffers on the page for a type match.
>
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
This looks pretty good to me as well. I notice one little thing that
might not be a real problem, but worth a quick thought...
> fs/xfs/xfs_aops.c | 81 ++++++++++++++++++++++++++++++++++---------------------
> 1 file changed, 50 insertions(+), 31 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index ef62c6b..98016b3 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -632,38 +632,46 @@ xfs_map_at_offset(
> }
>
...
> /*
> @@ -697,7 +705,7 @@ xfs_convert_page(
> goto fail_unlock_page;
> if (page->mapping != inode->i_mapping)
> goto fail_unlock_page;
> - if (!xfs_check_page_type(page, (*ioendp)->io_type))
> + if (!xfs_check_page_type(page, (*ioendp)->io_type, false))
> goto fail_unlock_page;
>
> /*
> @@ -742,6 +750,15 @@ xfs_convert_page(
> p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
> page_dirty = p_offset / len;
>
> + /*
> + * The moment we find a buffer that doesn't match our current type
> + * specification or can't be written, abort the loop and start
> + * writeback. As per the above xfs_imap_valid() check, only
> + * xfs_vm_writepage() can handle partial page writeback fully - we are
> + * limited here to the buffers that are contiguous with the current
> + * ioend, and hence a buffer we can't write breaks that contiguity and
> + * we have to defer the rest of the IO to xfs_vm_writepage().
> + */
> bh = head = page_buffers(page);
> do {
> if (offset >= end_offset)
> @@ -750,7 +767,7 @@ xfs_convert_page(
> uptodate = 0;
> if (!(PageUptodate(page) || buffer_uptodate(bh))) {
> done = 1;
> - continue;
> + break;
> }
>
> if (buffer_unwritten(bh) || buffer_delay(bh) ||
> @@ -762,10 +779,11 @@ xfs_convert_page(
> else
> type = XFS_IO_OVERWRITE;
>
> - if (!xfs_imap_valid(inode, imap, offset)) {
> - done = 1;
> - continue;
> - }
> + /*
> + * imap should always be valid because of the above
> + * partial page end_offset check on the imap.
> + */
> + ASSERT(xfs_imap_valid(inode, imap, offset));
>
> lock_buffer(bh);
> if (type != XFS_IO_OVERWRITE)
> @@ -777,6 +795,7 @@ xfs_convert_page(
> count++;
> } else {
> done = 1;
> + break;
> }
> } while (offset += len, (bh = bh->b_this_page) != head);
>
The next couple lines after the loop are:
if (uptodate && bh == head)
SetPageUptodate(page);
Now that we can break out of the loop, the "bh == head" part of that
check might not necessarily mean what it used to mean. The uptodate
variable is initialized to 1 and we reset to 0 the moment we encounter a
!uptodate buffer. Do you think it's possible to get here on the first
buffer of the page, without having reset 'uptodate,' and potentially
incorrectly set the page uptodate?
Brian
> @@ -868,7 +887,7 @@ xfs_aops_discard_page(
> struct buffer_head *bh, *head;
> loff_t offset = page_offset(page);
>
> - if (!xfs_check_page_type(page, XFS_IO_DELALLOC))
> + if (!xfs_check_page_type(page, XFS_IO_DELALLOC, true))
> goto out_invalidate;
>
> if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> --
> 1.9.0
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] xfs: inode log reservations are still too small
2014-03-05 21:40 ` Dave Chinner
@ 2014-03-05 22:34 ` Christoph Hellwig
0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2014-03-05 22:34 UTC (permalink / raw)
To: Dave Chinner; +Cc: Christoph Hellwig, xfs
> We don't. We need to, I just haven't had the time to audit the rest
> of the reservations to fix them all up yet. AFAIA, only the quota
> off transactions don't have any slack in them for xlog_op_header
> space - all the others that are incorrect probably have enough
> overhead from the buffer overhead roundup to cover the lack ophdr
> accounting on things like dquots.
That's more or less what I expected..
Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help
2014-03-05 22:08 ` Brian Foster
@ 2014-03-05 23:18 ` Dave Chinner
0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2014-03-05 23:18 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Wed, Mar 05, 2014 at 05:08:20PM -0500, Brian Foster wrote:
> On Wed, Mar 05, 2014 at 12:11:32PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > xfs_aops_discard_page() was introduced in the following commit:
> >
> > xfs: truncate delalloc extents when IO fails in writeback
> >
> > ... to clean up left over delalloc ranges after I/O failure in
> > ->writepage(). generic/224 tests for this scenario and occasionally
> > reproduces panics on sub-4k blocksize filesystems.
> >
> > The cause of this is failure to clean up the delalloc range on a
> > page where the first buffer does not match one of the expected
> > states of xfs_check_page_type(). If a buffer is not unwritten,
> > delayed or dirty&mapped, xfs_check_page_type() stops and
> > immediately returns 0.
....
> > @@ -777,6 +795,7 @@ xfs_convert_page(
> > count++;
> > } else {
> > done = 1;
> > + break;
> > }
> > } while (offset += len, (bh = bh->b_this_page) != head);
> >
>
> The next couple lines after the loop are:
>
> if (uptodate && bh == head)
> SetPageUptodate(page);
>
> Now that we can break out of the loop, the "bh == head" part of that
> check might not necessarily mean what it used to mean. The uptodate
> variable is initialized to 1 and we reset to 0 the moment we encounter a
> !uptodate buffer. Do you think it's possible to get here on the first
> buffer of the page, without having reset 'uptodate,' and potentially
> incorrectly set the page uptodate?
Good question :)
I don't think this can happen because if the first buffer on the
page can't be written, xfs_check_page_type() will return false and
we won't get to the loop. By definition, buffer_unwritten() implies
buffer_uptodate(), as does buffer_delay() and buffer_dirty(). Hence
any of the types that will return acceptible will have the first
buffer uptodate.
As for the other breaks in the loop - the initial imap_valid check
ensures we have a map that covers the entire region of the page that
needs writing, and we know that offset < end_offset for the first
buffer on the page. Hence none of the loop breaks will trigger on
the first buffer, and so the above code should not trigger.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2014-03-05 23:21 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-05 1:11 [PATCH 0/2] xfs: more bug fixes Dave Chinner
2014-03-05 1:11 ` [PATCH 1/2] xfs: xfs_check_page_type buffer checks need help Dave Chinner
2014-03-05 17:06 ` Christoph Hellwig
2014-03-05 22:08 ` Brian Foster
2014-03-05 23:18 ` Dave Chinner
2014-03-05 1:11 ` [PATCH 2/2] xfs: inode log reservations are still too small Dave Chinner
2014-03-05 3:33 ` Eric Sandeen
2014-03-05 16:06 ` Brian Foster
2014-03-05 17:14 ` Christoph Hellwig
2014-03-05 21:40 ` Dave Chinner
2014-03-05 22:34 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).