From: Suparna Bhattacharya <suparna@in.ibm.com>
To: linux-aio@kvack.org, linux-kernel@vger.kernel.org
Cc: linux-osdl@osdl.org
Subject: Re: [PATCH 16/22] Concurrent O_SYNC write support
Date: Fri, 2 Jul 2004 21:57:29 +0530 [thread overview]
Message-ID: <20040702162729.GF3450@in.ibm.com> (raw)
In-Reply-To: <20040702130030.GA4256@in.ibm.com>
On Fri, Jul 02, 2004 at 06:30:30PM +0530, Suparna Bhattacharya wrote:
> The patchset contains modifications and fixes to the AIO core
> to support the full retry model, an implementation of AIO
> support for buffered filesystem AIO reads and O_SYNC writes
> (the latter courtesy O_SYNC speedup changes from Andrew Morton),
> an implementation of AIO reads and writes to pipes (from
> Chris Mason) and AIO poll (again from Chris Mason).
>
> Full retry infrastructure and fixes
> [1] aio-retry.patch
> [2] 4g4g-aio-hang-fix.patch
> [3] aio-retry-elevated-refcount.patch
> [4] aio-splice-runlist.patch
>
> FS AIO read
> [5] aio-wait-page.patch
> [6] aio-fs_read.patch
> [7] aio-upfront-readahead.patch
>
> AIO for pipes
> [8] aio-cancel-fix.patch
> [9] aio-read-immediate.patch
> [10] aio-pipe.patch
> [11] aio-context-switch.patch
>
> Concurrent O_SYNC write speedups using radix-tree walks
> [12] writepages-range.patch
> [13] fix-writeback-range.patch
> [14] fix-writepages-range.patch
> [15] fdatawrite-range.patch
> [16] O_SYNC-speedup.patch
--
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India
------------------------------------------------------
From: Andrew Morton <akpm@osdl.org>
In databases it is common to have multiple threads or processes performing
O_SYNC writes against different parts of the same file.
Our performance at this is poor, because each writer blocks access to the
file by waiting on I/O completion while holding i_sem: everything is
serialised.
The patch improves things by moving the writing and waiting outside i_sem.
So other threads can get in and submit their I/O and permit the disk
scheduler to optimise the IO patterns better.
Also, the O_SYNC writer only writes and waits on the pages which he wrote,
rather than writing and waiting on all dirty pages in the file.
The reason we haven't been able to do this before is that the required walk
of the address_space page lists is easily livelockable without the i_sem
serialisation. But in this patch we perform the waiting via a radix-tree
walk of the affected pages. This cannot be livelocked.
The sync of the inode's metadata is still performed inside i_sem. This is
because it is list-based and is hence still livelockable. However it is
usually the case that databases are overwriting existing file blocks and
there will be no dirty buffers attached to the address_space anyway.
The code is careful to ensure that the IO for the pages and the IO for the
metadata are nonblockingly scheduled at the same time. This is am improvemtn
over the current code, which will issue two separate write-and-wait cycles:
one for metadata, one for pages.
Note from Suparna:
Reworked to use the tagged radix-tree based writeback infrastructure.
include/linux/buffer_head.h | 6 --
include/linux/fs.h | 5 ++
include/linux/writeback.h | 2
mm/filemap.c | 93 +++++++++++++++++++++++++++++++++++---------
4 files changed, 81 insertions(+), 25 deletions(-)
--- linux-2.6.7/include/linux/buffer_head.h 2004-06-15 22:18:55.000000000 -0700
+++ aio/include/linux/buffer_head.h 2004-06-18 15:20:59.761086776 -0700
@@ -202,12 +202,6 @@ int nobh_prepare_write(struct page*, uns
int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
int nobh_truncate_page(struct address_space *, loff_t);
-#define OSYNC_METADATA (1<<0)
-#define OSYNC_DATA (1<<1)
-#define OSYNC_INODE (1<<2)
-int generic_osync_inode(struct inode *, struct address_space *, int);
-
-
/*
* inline definitions
*/
--- linux-2.6.7/include/linux/fs.h 2004-06-15 22:19:02.000000000 -0700
+++ aio/include/linux/fs.h 2004-06-18 15:20:59.763086472 -0700
@@ -820,6 +820,11 @@ extern int vfs_rename(struct inode *, st
#define DT_SOCK 12
#define DT_WHT 14
+#define OSYNC_METADATA (1<<0)
+#define OSYNC_DATA (1<<1)
+#define OSYNC_INODE (1<<2)
+int generic_osync_inode(struct inode *, struct address_space *, int);
+
/*
* This is the "filldir" function type, used by readdir() to let
* the kernel specify what kind of dirent layout it wants to have.
--- linux-2.6.7/include/linux/writeback.h 2004-06-18 15:19:30.648633928 -0700
+++ aio/include/linux/writeback.h 2004-06-18 15:20:59.764086320 -0700
@@ -103,6 +103,8 @@ void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
+int sync_page_range(struct inode *inode, struct address_space *mapping,
+ loff_t pos, size_t count);
/* pdflush.c */
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
--- linux-2.6.7/mm/filemap.c 2004-06-18 15:19:36.095805832 -0700
+++ aio/mm/filemap.c 2004-06-18 15:20:59.770085408 -0700
@@ -239,6 +239,34 @@ static int wait_on_page_writeback_range(
return ret;
}
+
+/*
+ * Write and wait upon all the pages in the passed range. This is a "data
+ * integrity" operation. It waits upon in-flight writeout before starting and
+ * waiting upon new writeout. If there was an IO error, return it.
+ *
+ * We need to re-take i_sem during the generic_osync_inode list walk because
+ * it is otherwise livelockable.
+ */
+int sync_page_range(struct inode *inode, struct address_space *mapping,
+ loff_t pos, size_t count)
+{
+ pgoff_t start = pos >> PAGE_CACHE_SHIFT;
+ pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+ int ret;
+
+ if (mapping->backing_dev_info->memory_backed || !count)
+ return 0;
+ ret = filemap_fdatawrite_range(mapping, pos, pos + count - 1);
+ if (ret == 0) {
+ down(&inode->i_sem);
+ ret = generic_osync_inode(inode, mapping, OSYNC_METADATA);
+ up(&inode->i_sem);
+ }
+ if (ret == 0)
+ ret = wait_on_page_writeback_range(mapping, start, end);
+ return ret;
+}
/**
* filemap_fdatawait - walk the list of under-writeback pages of the given
* address space and wait for all of them.
@@ -2093,11 +2121,13 @@ generic_file_aio_write_nolock(struct kio
/*
* For now, when the user asks for O_SYNC, we'll actually give O_DSYNC
*/
- if (status >= 0) {
- if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
- status = generic_osync_inode(inode, mapping,
- OSYNC_METADATA|OSYNC_DATA);
- }
+ if (likely(status >= 0)) {
+ if (unlikely((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+ if (!a_ops->writepage || !is_sync_kiocb(iocb))
+ status = generic_osync_inode(inode, mapping,
+ OSYNC_METADATA|OSYNC_DATA);
+ }
+ }
/*
* If we get here for O_DIRECT writes then we must have fallen through
@@ -2137,36 +2167,52 @@ ssize_t generic_file_aio_write(struct ki
size_t count, loff_t pos)
{
struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- ssize_t err;
- struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count };
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ ssize_t ret;
+ struct iovec local_iov = { .iov_base = (void __user *)buf,
+ .iov_len = count };
BUG_ON(iocb->ki_pos != pos);
down(&inode->i_sem);
- err = generic_file_aio_write_nolock(iocb, &local_iov, 1,
+ ret = generic_file_aio_write_nolock(iocb, &local_iov, 1,
&iocb->ki_pos);
up(&inode->i_sem);
- return err;
-}
+ if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+ ssize_t err;
+ err = sync_page_range(inode, mapping, pos, ret);
+ if (err < 0)
+ ret = err;
+ }
+ return ret;
+}
EXPORT_SYMBOL(generic_file_aio_write);
ssize_t generic_file_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- struct inode *inode = file->f_mapping->host;
- ssize_t err;
- struct iovec local_iov = { .iov_base = (void __user *)buf, .iov_len = count };
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ ssize_t ret;
+ struct iovec local_iov = { .iov_base = (void __user *)buf,
+ .iov_len = count };
down(&inode->i_sem);
- err = generic_file_write_nolock(file, &local_iov, 1, ppos);
+ ret = generic_file_write_nolock(file, &local_iov, 1, ppos);
up(&inode->i_sem);
- return err;
-}
+ if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+ ssize_t err;
+ err = sync_page_range(inode, mapping, *ppos - ret, ret);
+ if (err < 0)
+ ret = err;
+ }
+ return ret;
+}
EXPORT_SYMBOL(generic_file_write);
ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
@@ -2185,14 +2231,23 @@ ssize_t generic_file_readv(struct file *
EXPORT_SYMBOL(generic_file_readv);
ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
- unsigned long nr_segs, loff_t * ppos)
+ unsigned long nr_segs, loff_t *ppos)
{
- struct inode *inode = file->f_mapping->host;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
ssize_t ret;
down(&inode->i_sem);
ret = generic_file_write_nolock(file, iov, nr_segs, ppos);
up(&inode->i_sem);
+
+ if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
+ int err;
+
+ err = sync_page_range(inode, mapping, *ppos - ret, ret);
+ if (err < 0)
+ ret = err;
+ }
return ret;
}
next prev parent reply other threads:[~2004-07-02 16:18 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-07-02 13:00 [PATCH 0/22] fsaio, pipe aio and aio poll upgraded to 2.6.7 Suparna Bhattacharya
2004-07-02 13:07 ` [PATCH 1/22] High-level AIO retry infrastructure and fixes Suparna Bhattacharya
2004-07-02 13:11 ` [PATCH 2/22] use_mm fix (helps AIO hangs on 4:4 split) Suparna Bhattacharya
2004-07-02 13:14 ` [PATCH 3/22] Refcounting fixes Suparna Bhattacharya
2004-07-02 13:15 ` [PATCH 4/22] Splice ioctx runlist for fairness Suparna Bhattacharya
2004-07-02 13:16 ` [PATCH 5/22] AIO wait on page support Suparna Bhattacharya
2004-07-02 13:18 ` [PATCH 6/22] FS AIO read Suparna Bhattacharya
2004-07-02 13:19 ` [PATCH 7/22] Upfront readahead to help streaming AIO reads Suparna Bhattacharya
2004-07-02 13:20 ` [PATCH 8/22] AIO cancellation fix Suparna Bhattacharya
2004-07-02 13:23 ` [PATCH 9/22] AIO immediate read (needed for AIO pipes & sockets) Suparna Bhattacharya
2004-07-02 13:23 ` [PATCH 10/22] AIO pipe support Suparna Bhattacharya
2004-07-02 13:26 ` [PATCH 11/22] Reduce AIO worker context switches Suparna Bhattacharya
2004-07-02 16:05 ` [PATCH 12/22] Writeback page range hint Suparna Bhattacharya
2004-07-02 16:18 ` [PATCH 13/22] Fix writeback page range to use exact limits Suparna Bhattacharya
2004-07-02 16:22 ` [PATCH 14/22] mpage writepages range limit fix Suparna Bhattacharya
2004-07-02 16:25 ` [PATCH 15/22] filemap_fdatawrite range interface Suparna Bhattacharya
2004-07-02 16:27 ` Suparna Bhattacharya [this message]
2004-07-02 16:31 ` [PATCH 17/22] AIO wait on writeback Suparna Bhattacharya
2004-07-02 16:33 ` [PATCH 18/22] AIO O_SYNC write Suparna Bhattacharya
2004-07-02 16:34 ` [PATCH 19/22] Fix math error in AIO wait on writeback Suparna Bhattacharya
2004-07-02 16:39 ` [PATCH 20/22] AIO poll Suparna Bhattacharya
2004-07-29 15:19 ` Jeff Moyer
2004-07-29 16:02 ` Avi Kivity
2004-07-29 16:16 ` Arjan van de Ven
2004-07-29 16:37 ` Benjamin LaHaise
2004-07-29 17:23 ` William Lee Irwin III
2004-07-29 17:10 ` William Lee Irwin III
2004-07-29 17:24 ` Avi Kivity
2004-07-29 17:26 ` William Lee Irwin III
2004-07-29 17:30 ` Avi Kivity
2004-07-29 17:32 ` William Lee Irwin III
2004-07-02 16:42 ` [PATCH 21/22] fix: flush workqueue on put_ioctx Suparna Bhattacharya
2004-07-02 16:44 ` [PATCH 22/22] Fix stalls with the AIO context switch patch Suparna Bhattacharya
2004-07-05 9:24 ` [PATCH 0/22] fsaio, pipe aio and aio poll upgraded to 2.6.7 Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20040702162729.GF3450@in.ibm.com \
--to=suparna@in.ibm.com \
--cc=linux-aio@kvack.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-osdl@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox