public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: xfs@oss.sgi.com
Subject: [PATCH 001/102] xfs: don't serialise adjacent concurrent direct IO appending writes
Date: Thu, 23 Aug 2012 15:01:19 +1000	[thread overview]
Message-ID: <1345698180-13612-2-git-send-email-david@fromorbit.com> (raw)
In-Reply-To: <1345698180-13612-1-git-send-email-david@fromorbit.com>

Upstream commit: 7271d243f9d1b4106289e4cf876c8b1203de59ab

For append write workloads, extending the file requires a certain
amount of exclusive locking to be done up front to ensure sanity in
things like ensuring that we've zeroed any allocated regions
between the old EOF and the start of the new IO.

For single threads, this typically isn't a problem, and for large
IOs we don't serialise enough for it to be a problem for two
threads on really fast block devices. However for smaller IO and
larger thread counts we have a problem.

Take 4 concurrent sequential, single block sized and aligned IOs.
After the first IO is submitted but before it completes, we end up
with this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^       ^
      |       |
      |       |
      |       |
      |       \- ip->i_new_size
      \- ip->i_size

And the IO is done without exclusive locking because offset <=
ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
grab the IO lock exclusive, because there is a chance we need to do
EOF zeroing. However, there is already an IO in progress that avoids
the need for IO zeroing because offset <= ip->i_new_size. hence we
could avoid holding the IO lock exlcusive for this. Hence after
submission of the second IO, we'd end up this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^               ^
      |               |
      |               |
      |               |
      |               \- ip->i_new_size
      \- ip->i_size

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO writes to the same inode.

And so you can see that for the third concurrent IO, we'd avoid
exclusive locking for the same reason we avoided the exclusive lock
for the second IO.

Fixing this is a bit more complex than that, because we need to hold
a write-submission local value of ip->i_new_size to that clearing
the value is only done if no other thread has updated it before our
IO completes.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
---
 fs/xfs/linux-2.6/xfs_file.c |   68 +++++++++++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index b679198..67d392a 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -410,11 +410,13 @@ xfs_aio_write_isize_update(
  */
 STATIC void
 xfs_aio_write_newsize_update(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	xfs_fsize_t		new_size)
 {
-	if (ip->i_new_size) {
+	if (new_size == ip->i_new_size) {
 		xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
-		ip->i_new_size = 0;
+		if (new_size == ip->i_new_size)
+			ip->i_new_size = 0;
 		if (ip->i_d.di_size > ip->i_size)
 			ip->i_d.di_size = ip->i_size;
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
@@ -465,7 +467,7 @@ xfs_file_splice_write(
 	ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
 
 	xfs_aio_write_isize_update(inode, ppos, ret);
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	return ret;
 }
@@ -662,6 +664,7 @@ xfs_file_aio_write_checks(
 	struct file		*file,
 	loff_t			*pos,
 	size_t			*count,
+	xfs_fsize_t		*new_sizep,
 	int			*iolock)
 {
 	struct inode		*inode = file->f_mapping->host;
@@ -670,6 +673,8 @@ xfs_file_aio_write_checks(
 	int			error = 0;
 
 	xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
+	*new_sizep = 0;
+restart:
 	error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));
 	if (error) {
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
@@ -677,20 +682,41 @@ xfs_file_aio_write_checks(
 		return error;
 	}
 
-	new_size = *pos + *count;
-	if (new_size > ip->i_size)
-		ip->i_new_size = new_size;
-
 	if (likely(!(file->f_mode & FMODE_NOCMTIME)))
 		file_update_time(file);
 
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.
+	 * write. There is no need to issue zeroing if another in-flght IO ends
+	 * at or before this one If zeronig is needed and we are currently
+	 * holding the iolock shared, we need to update it to exclusive which
+	 * involves dropping all locks and relocking to maintain correct locking
+	 * order. If we do this, restart the function to ensure all checks and
+	 * values are still valid.
 	 */
-	if (*pos > ip->i_size)
+	if ((ip->i_new_size && *pos > ip->i_new_size) ||
+	    (!ip->i_new_size && *pos > ip->i_size)) {
+		if (*iolock == XFS_IOLOCK_SHARED) {
+			xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
+			*iolock = XFS_IOLOCK_EXCL;
+			xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
+			goto restart;
+		}
 		error = -xfs_zero_eof(ip, *pos, ip->i_size);
+	}
+
+	/*
+	 * If this IO extends beyond EOF, we may need to update ip->i_new_size.
+	 * We have already zeroed space beyond EOF (if necessary).  Only update
+	 * ip->i_new_size if this IO ends beyond any other in-flight writes.
+	 */
+	new_size = *pos + *count;
+	if (new_size > ip->i_size) {
+		if (new_size > ip->i_new_size)
+			ip->i_new_size = new_size;
+		*new_sizep = new_size;
+	}
 
 	xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error)
@@ -737,6 +763,7 @@ xfs_file_dio_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -757,13 +784,20 @@ xfs_file_dio_aio_write(
 	if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
 		unaligned_io = 1;
 
-	if (unaligned_io || mapping->nrpages || pos > ip->i_size)
+	/*
+	 * We don't need to take an exclusive lock unless there page cache needs
+	 * to be invalidated or unaligned IO is being executed. We don't need to
+	 * consider the EOF extension case here because
+	 * xfs_file_aio_write_checks() will relock the inode as necessary for
+	 * EOF zeroing cases and fill out the new inode size as appropriate.
+	 */
+	if (unaligned_io || mapping->nrpages)
 		*iolock = XFS_IOLOCK_EXCL;
 	else
 		*iolock = XFS_IOLOCK_SHARED;
 	xfs_rw_ilock(ip, *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -812,6 +846,7 @@ xfs_file_buffered_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -825,7 +860,7 @@ xfs_file_buffered_aio_write(
 	*iolock = XFS_IOLOCK_EXCL;
 	xfs_rw_ilock(ip, *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -865,6 +900,7 @@ xfs_file_aio_write(
 	ssize_t			ret;
 	int			iolock;
 	size_t			ocount = 0;
+	xfs_fsize_t		new_size = 0;
 
 	XFS_STATS_INC(xs_write_calls);
 
@@ -884,10 +920,10 @@ xfs_file_aio_write(
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 
 	xfs_aio_write_isize_update(inode, &iocb->ki_pos, ret);
 
@@ -912,7 +948,7 @@ xfs_file_aio_write(
 	}
 
 out_unlock:
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_rw_iunlock(ip, iolock);
 	return ret;
 }
-- 
1.7.10

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-08-23  5:02 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-23  5:01 [RFC, PATCH 0/102]: xfs: 3.0.x stable kernel update Dave Chinner
2012-08-23  5:01 ` Dave Chinner [this message]
2012-08-23  5:01 ` [PATCH 002/102] xfs: remove dead ENODEV handling in xfs_destroy_ioend Dave Chinner
2012-08-23  5:01 ` [PATCH 003/102] xfs: defer AIO/DIO completions Dave Chinner
2012-08-23  5:01 ` [PATCH 004/102] xfs: reduce ioend latency Dave Chinner
2012-08-23  5:01 ` [PATCH 005/102] xfs: wait for I/O completion when writing out pages in xfs_setattr_size Dave Chinner
2012-08-23  5:01 ` [PATCH 006/102] xfs: improve ioend error handling Dave Chinner
2012-08-23  5:01 ` [PATCH 007/102] xfs: Check the return value of xfs_buf_get() Dave Chinner
2012-08-23  5:01 ` [PATCH 008/102] xfs: Check the return value of xfs_trans_get_buf() Dave Chinner
2012-08-23  5:01 ` [PATCH 009/102] xfs: dont ignore error code from xfs_bmbt_update Dave Chinner
2012-08-23  5:01 ` [PATCH 010/102] xfs: fix possible overflow in xfs_ioc_trim() Dave Chinner
2012-08-23  5:01 ` [PATCH 011/102] xfs: XFS_TRANS_SWAPEXT is not a valid flag for Dave Chinner
2012-08-23  5:01 ` [PATCH 012/102] xfs: Don't allocate new buffers on every call to _xfs_buf_find Dave Chinner
2012-08-23  5:01 ` [PATCH 013/102] xfs: reduce the number of log forces from tail pushing Dave Chinner
2012-08-23  5:01 ` [PATCH 014/102] xfs: optimize fsync on directories Dave Chinner
2012-08-23  5:01 ` [PATCH 015/102] xfs: clean up buffer allocation Dave Chinner
2012-08-23  5:01 ` [PATCH 016/102] xfs: clean up xfs_ioerror_alert Dave Chinner
2012-08-23  5:01 ` [PATCH 017/102] xfs: use xfs_ioerror_alert in xfs_buf_iodone_callbacks Dave Chinner
2012-08-23  5:01 ` [PATCH 018/102] xfs: do not flush data workqueues in xfs_flush_buftarg Dave Chinner
2012-08-23  5:01 ` [PATCH 019/102] xfs: add AIL pushing tracepoints Dave Chinner
2012-08-23  5:01 ` [PATCH 020/102] xfs: warn if direct reclaim tries to writeback pages Dave Chinner
2012-08-27 18:17   ` Christoph Hellwig
2012-09-05 11:32     ` Mel Gorman
2012-09-13 20:12       ` Mark Tinguely
2012-09-14  9:45         ` Mel Gorman
2012-09-14 12:56           ` Mark Tinguely
2012-09-14 16:18             ` Mark Tinguely
2012-08-23  5:01 ` [PATCH 021/102] xfs: fix force shutdown handling in xfs_end_io Dave Chinner
2012-08-23  5:01 ` [PATCH 022/102] xfs: fix allocation length overflow in xfs_bmapi_write() Dave Chinner
2012-08-23  5:01 ` [PATCH 023/102] xfs: fix the logspace waiting algorithm Dave Chinner
2012-08-23  5:01 ` [PATCH 024/102] xfs: untangle SYNC_WAIT and SYNC_TRYLOCK meanings for xfs_qm_dqflush Dave Chinner
2012-08-23  5:01 ` [PATCH 025/102] xfs: make sure to really flush all dquots in xfs_qm_quotacheck Dave Chinner
2012-08-23  5:01 ` [PATCH 026/102] xfs: simplify xfs_qm_detach_gdquots Dave Chinner
2012-08-23  5:01 ` [PATCH 027/102] xfs: mark the xfssyncd workqueue as non-reentrant Dave Chinner
2012-08-23  5:01 ` [PATCH 028/102] xfs: make i_flags an unsigned long Dave Chinner
2012-08-23  5:01 ` [PATCH 029/102] xfs: remove the i_size field in struct xfs_inode Dave Chinner
2012-08-23  5:01 ` [PATCH 030/102] xfs: remove the i_new_size " Dave Chinner
2012-08-23  5:01 ` [PATCH 031/102] xfs: always return with the iolock held from Dave Chinner
2012-08-23  5:01 ` [PATCH 032/102] xfs: cleanup xfs_file_aio_write Dave Chinner
2012-08-23  5:01 ` [PATCH 033/102] xfs: pass KM_SLEEP flag to kmem_realloc() in Dave Chinner
2012-08-23  5:01 ` [PATCH 034/102] xfs: show uuid when mount fails due to duplicate uuid Dave Chinner
2012-08-23  5:01 ` [PATCH 035/102] xfs: xfs_trans_add_item() - don't assign in ASSERT() when compare is intended Dave Chinner
2012-08-23  5:01 ` [PATCH 036/102] xfs: split tail_lsn assignments from log space wakeups Dave Chinner
2012-08-23  5:01 ` [PATCH 037/102] xfs: do exact log space wakeups in xlog_ungrant_log_space Dave Chinner
2012-08-23  5:01 ` [PATCH 038/102] xfs: remove xfs_trans_unlocked_item Dave Chinner
2012-08-23  5:01 ` [PATCH 039/102] xfs: cleanup xfs_log_space_wake Dave Chinner
2012-08-23  5:01 ` [PATCH 040/102] xfs: remove log space waitqueues Dave Chinner
2012-08-23  5:01 ` [PATCH 041/102] xfs: add the xlog_grant_head structure Dave Chinner
2012-08-23  5:02 ` [PATCH 042/102] xfs: add xlog_grant_head_init Dave Chinner
2012-08-23  5:02 ` [PATCH 043/102] xfs: add xlog_grant_head_wake_all Dave Chinner
2012-08-23  5:02 ` [PATCH 044/102] xfs: share code for grant head waiting Dave Chinner
2012-08-23  5:02 ` [PATCH 045/102] xfs: share code for grant head wakeups Dave Chinner
2012-08-23  5:02 ` [PATCH 046/102] xfs: share code for grant head availability checks Dave Chinner
2012-08-23  5:02 ` [PATCH 047/102] xfs: split and cleanup xfs_log_reserve Dave Chinner
2012-08-23  5:02 ` [PATCH 048/102] xfs: only take the ILOCK in xfs_reclaim_inode() Dave Chinner
2012-08-23  5:02 ` [PATCH 049/102] xfs: use per-filesystem I/O completion workqueues Dave Chinner
2012-08-23  5:02 ` [PATCH 050/102] xfs: do not require an ioend for new EOF calculation Dave Chinner
2012-08-23  5:02 ` [PATCH 054/102] xfs: make xfs_inode_item_size idempotent Dave Chinner
2012-08-23  5:02 ` [PATCH 055/102] xfs: split in-core and on-disk inode log item fields Dave Chinner
2012-08-23  5:02 ` [PATCH 056/102] xfs: reimplement fdatasync support Dave Chinner
2012-08-23  5:02 ` [PATCH 057/102] xfs: fallback to vmalloc for large buffers in xfs_attrmulti_attr_get Dave Chinner
2012-08-23  5:02 ` [PATCH 058/102] xfs: fallback to vmalloc for large buffers in xfs_getbmap Dave Chinner
2012-08-23  5:02 ` [PATCH 059/102] xfs: fix deadlock in xfs_rtfree_extent Dave Chinner
2012-08-23  5:02 ` [PATCH 060/102] xfs: Fix open flag handling in open_by_handle code Dave Chinner
2012-08-23  5:02 ` [PATCH 061/102] xfs: introduce an allocation workqueue Dave Chinner
2012-08-23  5:02 ` [PATCH 062/102] xfs: trace xfs_name strings correctly Dave Chinner
2012-08-23  5:02 ` [PATCH 063/102] xfs: Account log unmount transaction correctly Dave Chinner
2012-08-23  5:02 ` [PATCH 064/102] xfs: fix fstrim offset calculations Dave Chinner
2012-08-23  5:02 ` [PATCH 065/102] xfs: add lots of attribute trace points Dave Chinner
2012-08-23  5:02 ` [PATCH 066/102] xfs: don't fill statvfs with project quota for a directory Dave Chinner
2012-08-23  5:02 ` [PATCH 067/102] xfs: Ensure inode reclaim can run during quotacheck Dave Chinner
2012-08-23  5:02 ` [PATCH 068/102] xfs: avoid taking the ilock unnessecarily in xfs_qm_dqattach Dave Chinner
2012-08-23  5:02 ` [PATCH 069/102] xfs: reduce ilock hold times in xfs_file_aio_write_checks Dave Chinner
2012-08-23  5:02 ` [PATCH 070/102] xfs: reduce ilock hold times in xfs_setattr_size Dave Chinner
2012-08-23  5:02 ` [PATCH 071/102] xfs: push the ilock into xfs_zero_eof Dave Chinner
2012-08-23  5:02 ` [PATCH 072/102] xfs: use shared ilock mode for direct IO writes by default Dave Chinner
2012-08-23  5:02 ` [PATCH 073/102] xfs: punch all delalloc blocks beyond EOF on write failure Dave Chinner
2012-08-23  5:02 ` [PATCH 074/102] xfs: using GFP_NOFS for blkdev_issue_flush Dave Chinner
2012-08-23  5:02 ` [PATCH 075/102] xfs: page type check in writeback only checks last buffer Dave Chinner
2012-08-23  5:02 ` [PATCH 076/102] xfs: punch new delalloc blocks out of failed writes inside Dave Chinner
2012-08-23  5:02 ` [PATCH 077/102] xfs: prevent needless mount warning causing test failures Dave Chinner
2012-08-23  5:02 ` [PATCH 078/102] xfs: don't assert on delalloc regions beyond EOF Dave Chinner
2012-08-23  5:02 ` [PATCH 079/102] xfs: limit specualtive delalloc to maxioffset Dave Chinner
2012-08-23  5:02 ` [PATCH 080/102] xfs: Use preallocation for inodes with extsz hints Dave Chinner
2012-08-23  5:02 ` [PATCH 081/102] xfs: fix buffer lookup race on allocation failure Dave Chinner
2012-08-23  5:02 ` [PATCH 082/102] xfs: check for buffer errors before waiting Dave Chinner
2012-08-23  5:02 ` [PATCH 083/102] xfs: fix incorrect b_offset initialisation Dave Chinner
2012-08-23  5:02 ` [PATCH 084/102] xfs: use kmem_zone_zalloc for buffers Dave Chinner
2012-08-23  5:02 ` [PATCH 085/102] xfs: use iolock on XFS_IOC_ALLOCSP calls Dave Chinner
2012-08-23  5:02 ` [PATCH 086/102] xfs: Properly exclude IO type flags from buffer flags Dave Chinner
2012-08-23  5:02 ` [PATCH 087/102] xfs: flush outstanding buffers on log mount failure Dave Chinner
2012-08-23  5:02 ` [PATCH 088/102] xfs: protect xfs_sync_worker with s_umount semaphore Dave Chinner
2012-08-23  5:02 ` [PATCH 089/102] xfs: fix memory reclaim deadlock on agi buffer Dave Chinner
2012-08-23  5:02 ` [PATCH 090/102] xfs: add trace points for log forces Dave Chinner
2012-08-23  5:02 ` [PATCH 091/102] xfs: switch to proper __bitwise type for KM_... flags Dave Chinner
2012-08-23  5:02 ` [PATCH 092/102] xfs: xfs_vm_writepage clear iomap_valid when Dave Chinner
2012-08-23  5:02 ` [PATCH 093/102] xfs: fix debug_object WARN at xfs_alloc_vextent() Dave Chinner
2012-08-23  5:02 ` [PATCH 094/102] xfs: m_maxioffset is redundant Dave Chinner
2012-08-23  5:02 ` [PATCH 095/102] xfs: make largest supported offset less shouty Dave Chinner
2012-08-23  5:02 ` [PATCH 096/102] xfs: kill copy and paste segment checks in xfs_file_aio_read Dave Chinner
2012-08-23  5:02 ` [PATCH 097/102] xfs: fix allocbt cursor leak in xfs_alloc_ag_vextent_near Dave Chinner
2012-08-23  5:02 ` [PATCH 098/102] xfs: shutdown xfs_sync_worker before the log Dave Chinner
2012-08-23  5:02 ` [PATCH 099/102] xfs: really fix the cursor leak in xfs_alloc_ag_vextent_near Dave Chinner
2012-08-23  5:02 ` [PATCH 100/102] xfs: don't defer metadata allocation to the workqueue Dave Chinner
2012-08-23  5:02 ` [PATCH 101/102] xfs: prevent recursion in xfs_buf_iorequest Dave Chinner
2012-08-23  5:03 ` [PATCH 102/102] xfs: handle EOF correctly in xfs_vm_writepage Dave Chinner
2012-08-23 21:54 ` [RFC, PATCH 0/102]: xfs: 3.0.x stable kernel update Dave Chinner
2012-08-23 22:14   ` Ben Myers
2012-08-23 22:23   ` Matthias Schniedermeyer
2012-09-01 23:10 ` Christoph Hellwig
2012-09-03  6:04   ` Dave Chinner
2012-09-04 21:13     ` Ben Myers
2012-09-05  4:24       ` Dave Chinner
2012-09-13 18:32 ` Mark Tinguely
2012-09-18 13:59 ` Mark Tinguely
2012-09-18 23:50   ` Dave Chinner
2012-09-19 13:14     ` Mark Tinguely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1345698180-13612-2-git-send-email-david@fromorbit.com \
    --to=david@fromorbit.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox