From: Andrew Morton <akpm@osdl.org>
To: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: linux@horizon.com, linux-kernel@vger.kernel.org, sct@redhat.com,
torvalds@osdl.org
Subject: Re: msync() behaviour broken for MS_ASYNC, revert patch?
Date: Thu, 9 Feb 2006 23:14:32 -0800 [thread overview]
Message-ID: <20060209231432.03a09dee.akpm@osdl.org> (raw)
In-Reply-To: <43EC3961.3030904@yahoo.com.au>
Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Instead of
> LINUX_FADV_ASYNC_WRITE
> LINUX_FADV_WRITE_WAIT
>
> can we have something more consistent? Perhaps
> FADV_WRITE_ASYNC
> FADV_WRITE_SYNC
Nope, I had a bit of a think about this and decided that the two operations
which we need are:
From: Andrew Morton <akpm@osdl.org>
Add two new linux-specific fadvise extensions():
LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
offsets `offset' and `offset+len'. Any pages which are currently under
writeout are skipped, whether or not they are dirty.
LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
offsets `offset' and `offset+len'.
By combining these two operations the application may do several things:
LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently dirty
pages at the disk.
LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push all
of the currently dirty pages at the disk, wait until they have been written.
It should be noted that none of these operations write out the file's
metadata. So unless the application is strictly performing overwrites of
already-instantiated disk blocks, there are no guarantees here that the data
will be available after a crash.
To complete this suite of operations I guess we should have a "sync file
metadata only" operation. This gives applications access to all the building
blocks needed for all sorts of sync operations. But sync-metadata doesn't fit
well with the fadvise() interface. Probably it should be a new syscall:
sys_fmetadatasync().
The patch also diddles with the meaning of `endbyte' in sys_fadvise64_64().
It is made to represent that last affected byte in the file (ie: it is
inclusive). Generally, all these byterange and pagerange functions are
inclusive so we can easily represent EOF with -1.
Signed-off-by: Andrew Morton <akpm@osdl.org>
---
include/linux/fadvise.h | 6 ++++
include/linux/fs.h | 5 ++++
mm/fadvise.c | 46 +++++++++++++++++++++++++++++++++-----
mm/filemap.c | 10 ++++----
4 files changed, 57 insertions(+), 10 deletions(-)
diff -puN include/linux/fadvise.h~fadvise-async-write-commands include/linux/fadvise.h
--- devel/include/linux/fadvise.h~fadvise-async-write-commands 2006-02-09 22:29:36.000000000 -0800
+++ devel-akpm/include/linux/fadvise.h 2006-02-09 22:29:36.000000000 -0800
@@ -18,4 +18,10 @@
#define POSIX_FADV_NOREUSE 5 /* Data will be accessed once. */
#endif
+/*
+ * Linux-specific fadvise() extensions:
+ */
+#define LINUX_FADV_ASYNC_WRITE 32 /* Start writeout on range */
+#define LINUX_FADV_WRITE_WAIT 33 /* Wait upon writeout to range */
+
#endif /* FADVISE_H_INCLUDED */
diff -puN include/linux/fs.h~fadvise-async-write-commands include/linux/fs.h
--- devel/include/linux/fs.h~fadvise-async-write-commands 2006-02-09 22:29:36.000000000 -0800
+++ devel-akpm/include/linux/fs.h 2006-02-09 23:06:03.000000000 -0800
@@ -1473,6 +1473,11 @@ extern int filemap_fdatawait(struct addr
extern int filemap_write_and_wait(struct address_space *mapping);
extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
+extern int wait_on_page_writeback_range(struct address_space *mapping,
+ pgoff_t start, pgoff_t end);
+extern int __filemap_fdatawrite_range(struct address_space *mapping,
+ loff_t start, loff_t end, int sync_mode);
+
extern void sync_supers(void);
extern void sync_filesystems(int wait);
extern void emergency_sync(void);
diff -puN mm/fadvise.c~fadvise-async-write-commands mm/fadvise.c
--- devel/mm/fadvise.c~fadvise-async-write-commands 2006-02-09 22:29:36.000000000 -0800
+++ devel-akpm/mm/fadvise.c 2006-02-09 23:12:22.000000000 -0800
@@ -15,6 +15,7 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/fadvise.h>
+#include <linux/writeback.h>
#include <linux/syscalls.h>
#include <asm/unistd.h>
@@ -22,13 +23,36 @@
/*
* POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
* deactivate the pages and clear PG_Referenced.
+ *
+ * LINUX_FADV_ASYNC_WRITE: start async writeout of any dirty pages between file
+ * offsets `offset' and `offset+len' inclusive. Any pages which are currently
+ * under writeout are skipped, whether or not they are dirty.
+ *
+ * LINUX_FADV_WRITE_WAIT: wait upon writeout of any dirty pages between file
+ * offsets `offset' and `offset+len'.
+ *
+ * By combining these two operations the application may do several things:
+ *
+ * LINUX_FADV_ASYNC_WRITE: push some or all of the dirty pages at the disk.
+ *
+ * LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE: push all of the currently
+ * dirty pages at the disk.
+ *
+ * LINUX_FADV_WRITE_WAIT, LINUX_FADV_ASYNC_WRITE, LINUX_FADV_WRITE_WAIT: push
+ * all of the currently dirty pages at the disk, wait until they have been
+ * written.
+ *
+ * It should be noted that none of these operations write out the file's
+ * metadata. So unless the application is strictly performing overwrites of
+ * already-instantiated disk blocks, there are no guarantees here that the data
+ * will be available after a crash.
*/
asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
{
struct file *file = fget(fd);
struct address_space *mapping;
struct backing_dev_info *bdi;
- loff_t endbyte;
+ loff_t endbyte; /* inclusive */
pgoff_t start_index;
pgoff_t end_index;
unsigned long nrpages;
@@ -56,6 +80,8 @@ asmlinkage long sys_fadvise64_64(int fd,
endbyte = offset + len;
if (!len || endbyte < len)
endbyte = -1;
+ else
+ endbyte--; /* inclusive */
bdi = mapping->backing_dev_info;
@@ -78,7 +104,7 @@ asmlinkage long sys_fadvise64_64(int fd,
/* First and last PARTIAL page! */
start_index = offset >> PAGE_CACHE_SHIFT;
- end_index = (endbyte-1) >> PAGE_CACHE_SHIFT;
+ end_index = endbyte >> PAGE_CACHE_SHIFT;
/* Careful about overflow on the "+1" */
nrpages = end_index - start_index + 1;
@@ -96,11 +122,21 @@ asmlinkage long sys_fadvise64_64(int fd,
filemap_flush(mapping);
/* First and last FULL page! */
- start_index = (offset + (PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
+ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
end_index = (endbyte >> PAGE_CACHE_SHIFT);
- if (end_index > start_index)
- invalidate_mapping_pages(mapping, start_index, end_index-1);
+ if (end_index >= start_index)
+ invalidate_mapping_pages(mapping, start_index,
+ end_index);
+ break;
+ case LINUX_FADV_ASYNC_WRITE:
+ ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
+ WB_SYNC_NONE);
+ break;
+ case LINUX_FADV_WRITE_WAIT:
+ ret = wait_on_page_writeback_range(mapping,
+ offset >> PAGE_CACHE_SHIFT,
+ endbyte >> PAGE_CACHE_SHIFT);
break;
default:
ret = -EINVAL;
diff -puN mm/filemap.c~fadvise-async-write-commands mm/filemap.c
--- devel/mm/filemap.c~fadvise-async-write-commands 2006-02-09 22:29:36.000000000 -0800
+++ devel-akpm/mm/filemap.c 2006-02-09 23:05:56.000000000 -0800
@@ -181,8 +181,8 @@ static int sync_page(void *word)
* these two operations is that if a dirty page/buffer is encountered, it must
* be waited upon, and not just skipped over.
*/
-static int __filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end, int sync_mode)
+int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end, int sync_mode)
{
int ret;
struct writeback_control wbc = {
@@ -211,8 +211,8 @@ int filemap_fdatawrite(struct address_sp
}
EXPORT_SYMBOL(filemap_fdatawrite);
-static int filemap_fdatawrite_range(struct address_space *mapping,
- loff_t start, loff_t end)
+static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
+ loff_t end)
{
return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
}
@@ -231,7 +231,7 @@ EXPORT_SYMBOL(filemap_flush);
* Wait for writeback to complete against pages indexed by start->end
* inclusive
*/
-static int wait_on_page_writeback_range(struct address_space *mapping,
+int wait_on_page_writeback_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
struct pagevec pvec;
_
next prev parent reply other threads:[~2006-02-10 7:15 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-09 7:18 msync() behaviour broken for MS_ASYNC, revert patch? linux
2006-02-09 8:18 ` Andrew Morton
2006-02-09 8:35 ` Nick Piggin
2006-02-09 8:42 ` Andrew Morton
2006-02-09 12:38 ` Nick Piggin
2006-02-09 12:39 ` Nick Piggin
2006-02-09 17:48 ` Andrew Morton
2006-02-10 3:36 ` Nick Piggin
2006-02-10 3:50 ` Andrew Morton
2006-02-10 3:57 ` Nick Piggin
2006-02-10 4:13 ` Andrew Morton
2006-02-10 4:30 ` Nick Piggin
2006-02-10 4:43 ` Andrew Morton
2006-02-10 4:52 ` Nick Piggin
2006-02-10 5:13 ` Andrew Morton
2006-02-10 5:29 ` Nick Piggin
2006-02-10 5:50 ` Andrew Morton
2006-02-10 6:03 ` Nick Piggin
2006-02-10 6:13 ` Andrew Morton
2006-02-10 6:31 ` Nick Piggin
2006-02-10 6:46 ` Andrew Morton
2006-02-10 6:57 ` Nick Piggin
2006-02-10 7:14 ` Andrew Morton [this message]
2006-02-10 12:41 ` Nick Piggin
2006-02-10 16:19 ` Linus Torvalds
2006-02-10 17:00 ` Nick Piggin
2006-02-10 17:12 ` Linus Torvalds
2006-02-10 17:35 ` Linus Torvalds
2006-02-10 17:59 ` Nick Piggin
2006-02-10 18:55 ` Linus Torvalds
2006-02-10 19:29 ` Nick Piggin
2006-02-10 19:44 ` Linus Torvalds
2006-02-10 19:52 ` Nick Piggin
2006-02-10 20:03 ` Linus Torvalds
2006-02-11 5:49 ` Nick Piggin
2006-02-10 16:05 ` Linus Torvalds
2006-02-10 16:37 ` Nick Piggin
2006-02-10 17:03 ` Linus Torvalds
2006-02-10 17:37 ` Nick Piggin
2006-02-10 18:01 ` Linus Torvalds
2006-02-10 18:38 ` Nick Piggin
2006-02-10 19:05 ` Linus Torvalds
2006-02-10 19:34 ` Oliver Neukum
2006-02-10 19:59 ` Linus Torvalds
2006-02-10 20:11 ` Andrew Morton
2006-02-10 21:15 ` Linus Torvalds
2006-02-10 21:28 ` Andrew Morton
2006-02-10 20:03 ` Nick Piggin
2006-02-10 21:10 ` Linus Torvalds
2006-02-10 21:55 ` Trond Myklebust
2006-02-10 22:46 ` Linus Torvalds
2006-02-10 23:02 ` Trond Myklebust
2006-02-10 23:15 ` Linus Torvalds
2006-02-11 19:07 ` Trond Myklebust
2006-02-10 17:29 ` linux
2006-02-10 17:42 ` Linus Torvalds
2006-02-10 18:57 ` Nick Piggin
2006-02-10 8:00 ` linux
2006-02-10 13:18 ` Nick Piggin
2006-02-10 7:15 ` linux
2006-02-10 7:28 ` Andrew Morton
2006-02-09 11:18 ` linux
-- strict thread matches above, loose matches on Subject: below --
2004-03-31 22:16 Stephen C. Tweedie
2004-03-31 22:37 ` Linus Torvalds
2004-03-31 23:41 ` Stephen C. Tweedie
2004-04-01 0:08 ` Linus Torvalds
2004-04-01 0:30 ` Andrew Morton
2004-04-01 15:40 ` Stephen C. Tweedie
2004-04-01 16:02 ` Linus Torvalds
2004-04-01 16:33 ` Stephen C. Tweedie
2004-04-01 16:19 ` Jamie Lokier
2004-04-01 16:57 ` Stephen C. Tweedie
2004-04-01 18:51 ` Andrew Morton
2004-03-31 22:53 ` Andrew Morton
2004-03-31 23:20 ` Stephen C. Tweedie
2004-04-16 22:35 ` Jamie Lokier
2004-04-19 21:54 ` Stephen C. Tweedie
2004-04-21 2:10 ` Jamie Lokier
2004-04-21 9:52 ` Stephen C. Tweedie
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060209231432.03a09dee.akpm@osdl.org \
--to=akpm@osdl.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@horizon.com \
--cc=nickpiggin@yahoo.com.au \
--cc=sct@redhat.com \
--cc=torvalds@osdl.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox