sync_file_range(SYNC_FILE_RANGE

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
@ 2008-05-30 10:26 Pavel Machek
  2008-05-30 13:58 ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-05-30 10:26 UTC (permalink / raw)
  To: kernel list, Rafael J. Wysocki

Hi!

sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
for s2disk: there we want to start writeout as early as possible
(system is going to shut down after write, and we need the data on
disk).

Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
does not work for us. Is there non-blocking variant? "Start writeout
on this fd, but don't block me"?

For now I'm doing:

static inline int start_writeout(int fd)
{
#ifdef SYS_sync_file_range
        if (fork())
                return 0;
        else {
                syscall(SYS_sync_file_range, fd,
                        (loff_t)0, (loff_t)0, SYNC_FILE_RANGE_WRITE);
                exit(0);
        }
#else
        errno = ENOSYS;
        return -1;
#endif
}

...but I'd prefer something more elegant...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-05-30 10:26 sync_file_range(SYNC_FILE_RANGE_WRITE) blocks? Pavel Machek
@ 2008-05-30 13:58 ` Hugh Dickins
  2008-05-30 20:43   ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-05-30 13:58 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, Rafael J. Wysocki

On Fri, 30 May 2008, Pavel Machek wrote:
> 
> sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
> for s2disk: there we want to start writeout as early as possible
> (system is going to shut down after write, and we need the data on
> disk).
> 
> Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
> does not work for us. Is there non-blocking variant? "Start writeout
> on this fd, but don't block me"?

I guess there are lots of reasons why it may block (get rescheduled)
briefly, but why would that matter to you?  Are you saying that its
whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE
is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too?

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-05-30 13:58 ` Hugh Dickins
@ 2008-05-30 20:43   ` Pavel Machek
  2008-05-31 18:44     ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-05-30 20:43 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: kernel list, Rafael J. Wysocki

Hi!

> > sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
> > for s2disk: there we want to start writeout as early as possible
> > (system is going to shut down after write, and we need the data on
> > disk).
> > 
> > Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
> > does not work for us. Is there non-blocking variant? "Start writeout
> > on this fd, but don't block me"?
> 
> I guess there are lots of reasons why it may block (get rescheduled)
> briefly, but why would that matter to you?  Are you saying that its
> whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE
> is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too?

It appears to me like it includes WAIT_AFTER, yes.

I was not sure what the expected behaviour was... lets say we have a
lot of dirty data (like 40MB) and system with enough free memory. Is
it reasonable to expect SYNC_FILE_RANGE_WRITE to return pretty much
immediately? (like in less than 10msec)? Because it seems to take more
like a second here...

(Underlying 'file' is actually /dev/sda1 -- aka my swap partition, but
that should not matter --right?)
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-05-30 20:43   ` Pavel Machek
@ 2008-05-31 18:44     ` Hugh Dickins
  2008-06-01  0:39       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-05-31 18:44 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Andrew Morton, kernel list, Rafael J. Wysocki

On Fri, 30 May 2008, Pavel Machek wrote:
> > > sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
> > > for s2disk: there we want to start writeout as early as possible
> > > (system is going to shut down after write, and we need the data on
> > > disk).
> > > 
> > > Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
> > > does not work for us. Is there non-blocking variant? "Start writeout
> > > on this fd, but don't block me"?
> > 
> > I guess there are lots of reasons why it may block (get rescheduled)
> > briefly, but why would that matter to you?  Are you saying that its
> > whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE
> > is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too?
> 
> It appears to me like it includes WAIT_AFTER, yes.
> 
> I was not sure what the expected behaviour was... lets say we have a
> lot of dirty data (like 40MB) and system with enough free memory. Is
> it reasonable to expect SYNC_FILE_RANGE_WRITE to return pretty much
> immediately? (like in less than 10msec)? Because it seems to take more
> like a second here...
> 
> (Underlying 'file' is actually /dev/sda1 -- aka my swap partition, but
> that should not matter --right?)

Right (so long as you're not swapping to it at the same time!).
And it seems to be behaving the same way on a regular file.

All I can say so far is that I find the same as you do:
SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.

Which make the sync_file_range call pretty pointless: your usage seems
perfectly reasonable to me, but somehow we've broken its behaviour.
I'll be investigating ...

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-05-31 18:44     ` Hugh Dickins
@ 2008-06-01  0:39       ` Andrew Morton
  2008-06-01  7:23         ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-06-01  0:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Pavel Machek, kernel list, Rafael J. Wysocki

On Sat, 31 May 2008 19:44:49 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Fri, 30 May 2008, Pavel Machek wrote:
> > > > sync_file_range(SYNC_FILE_RANGE_WRITE) blocks ... which makes problem
> > > > for s2disk: there we want to start writeout as early as possible
> > > > (system is going to shut down after write, and we need the data on
> > > > disk).
> > > > 
> > > > Unfortuantely, sync_file_range(SYNC_FILE_RANGE_WRITE) blocks, which
> > > > does not work for us. Is there non-blocking variant? "Start writeout
> > > > on this fd, but don't block me"?
> > > 
> > > I guess there are lots of reasons why it may block (get rescheduled)
> > > briefly, but why would that matter to you?  Are you saying that its
> > > whole design has got broken somehow, and now SYNC_FILE_RANGE_WRITE
> > > is behaving as if SYNC_FILE_RANGE_WAIT_AFTER had been supplied too?
> > 
> > It appears to me like it includes WAIT_AFTER, yes.
> > 
> > I was not sure what the expected behaviour was... lets say we have a
> > lot of dirty data (like 40MB) and system with enough free memory. Is
> > it reasonable to expect SYNC_FILE_RANGE_WRITE to return pretty much
> > immediately? (like in less than 10msec)? Because it seems to take more
> > like a second here...
> > 
> > (Underlying 'file' is actually /dev/sda1 -- aka my swap partition, but
> > that should not matter --right?)
> 
> Right (so long as you're not swapping to it at the same time!).
> And it seems to be behaving the same way on a regular file.
> 
> All I can say so far is that I find the same as you do:
> SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> 
> Which make the sync_file_range call pretty pointless: your usage seems
> perfectly reasonable to me, but somehow we've broken its behaviour.
> I'll be investigating ...
> 

It will block on disk queue fullness - sysrq-W will tell.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01  0:39       ` Andrew Morton
@ 2008-06-01  7:23         ` Hugh Dickins
  2008-06-01  8:15           ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-06-01  7:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pavel Machek, kernel list, Rafael J. Wysocki

On Sat, 31 May 2008, Andrew Morton wrote:
> On Sat, 31 May 2008 19:44:49 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > All I can say so far is that I find the same as you do:
> > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > 
> > Which make the sync_file_range call pretty pointless: your usage seems
> > perfectly reasonable to me, but somehow we've broken its behaviour.
> > I'll be investigating ...
> 
> It will block on disk queue fullness - sysrq-W will tell.

Ah, thank you.  What a disappointment, though it's understandable.
Doesn't that very severely limit the usefulness of the system call?

I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
but I don't suppose Pavel and I are the only ones misled by it.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01  7:23         ` Hugh Dickins
@ 2008-06-01  8:15           ` Andrew Morton
  2008-06-01 11:40             ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-06-01  8:15 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Pavel Machek, kernel list, Rafael J. Wysocki

On Sun, 1 Jun 2008 08:23:43 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:

> On Sat, 31 May 2008, Andrew Morton wrote:
> > On Sat, 31 May 2008 19:44:49 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote:
> > > 
> > > All I can say so far is that I find the same as you do:
> > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > > 
> > > Which make the sync_file_range call pretty pointless: your usage seems
> > > perfectly reasonable to me, but somehow we've broken its behaviour.
> > > I'll be investigating ...
> > 
> > It will block on disk queue fullness - sysrq-W will tell.
> 
> Ah, thank you.  What a disappointment, though it's understandable.
> Doesn't that very severely limit the usefulness of the system call?

A bit.  The request queue size is runtime tunable though.

I expect major users of this system call will be applications which do
small-sized overwrites into large files, mainly databases.  That is,
once the application developers discover its existence.  I'm still
getting expressions of wonder from people who I tell about the
five-year-old fadvise().

> I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
> but I don't suppose Pavel and I are the only ones misled by it.

Yup, this caveat/restriction should be in the manpage.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01  8:15           ` Andrew Morton
@ 2008-06-01 11:40             ` Pavel Machek
  2008-06-01 20:37               ` Andrew Morton
  2008-06-03  8:01               ` Michael Kerrisk
  0 siblings, 2 replies; 25+ messages in thread
From: Pavel Machek @ 2008-06-01 11:40 UTC (permalink / raw)
  To: Andrew Morton, mtk.manpages; +Cc: Hugh Dickins, kernel list, Rafael J. Wysocki

Hi!

> > > > All I can say so far is that I find the same as you do:
> > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > > > 
> > > > Which make the sync_file_range call pretty pointless: your usage seems
> > > > perfectly reasonable to me, but somehow we've broken its behaviour.
> > > > I'll be investigating ...
> > > 
> > > It will block on disk queue fullness - sysrq-W will tell.
> > 
> > Ah, thank you.  What a disappointment, though it's understandable.
> > Doesn't that very severely limit the usefulness of the system call?
> 
> A bit.  The request queue size is runtime tunable though.

Which /sys is that? What happens if I set the queue size to pretty
much infinity, will memory management die horribly?

> I expect major users of this system call will be applications which do
> small-sized overwrites into large files, mainly databases.  That is,
> once the application developers discover its existence.  I'm still
> getting expressions of wonder from people who I tell about the
> five-year-old fadvise().

Hey, you have one user now, its called s2disk. But for this call to be
useful, we'd need asynchronous variant... is there such thing?

Okay, I can fork and do the call from another process, but...

> > I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
> > but I don't suppose Pavel and I are the only ones misled by it.
> 
> Yup, this caveat/restriction should be in the manpage.

Michael, this is something for you I guess?

And andrew, something for you:

--- 

SYNC_FILE_RANGE_WRITE may and will block. Document that.

Signed-off-by: Pavel Machek <pavel@suse.cz>

---
commit 5db78da3d8e6fa527bfe384ded2ff7c835592fe2
tree 4c405e07be12f0a2260492fb43d19802ff7ebab1
parent 0ea376de01be797f9563c2c2464149f8f0af6329
author Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200
committer Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200

 fs/sync.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 228e17b..54e9f20 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -139,7 +139,8 @@ asmlinkage long sys_fdatasync(unsigned i
  * before performing the write.
  *
  * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the
- * range which are not presently under writeback.
+ * range which are not presently under writeback. Notice that even this this
+ *  may and will block if you attempt to write more than request queue size.
  *
  * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range
  * after performing the write.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 11:40             ` Pavel Machek
@ 2008-06-01 20:37               ` Andrew Morton
  2008-06-01 22:00                 ` Rafael J. Wysocki
                                   ` (2 more replies)
  2008-06-03  8:01               ` Michael Kerrisk
  1 sibling, 3 replies; 25+ messages in thread
From: Andrew Morton @ 2008-06-01 20:37 UTC (permalink / raw)
  To: Pavel Machek; +Cc: mtk.manpages, Hugh Dickins, kernel list, Rafael J. Wysocki

On Sun, 1 Jun 2008 13:40:09 +0200 Pavel Machek <pavel@suse.cz> wrote:

> Hi!
> 
> > > > > All I can say so far is that I find the same as you do:
> > > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > > > > 
> > > > > Which make the sync_file_range call pretty pointless: your usage seems
> > > > > perfectly reasonable to me, but somehow we've broken its behaviour.
> > > > > I'll be investigating ...
> > > > 
> > > > It will block on disk queue fullness - sysrq-W will tell.
> > > 
> > > Ah, thank you.  What a disappointment, though it's understandable.
> > > Doesn't that very severely limit the usefulness of the system call?
> > 
> > A bit.  The request queue size is runtime tunable though.
> 
> Which /sys is that?

/sys/block/sda/queue/nr_requests

> What happens if I set the queue size to pretty
> much infinity, will memory management die horribly?

In theory, no - it's always caused problems when the VM/VFS/FS layer
has relied upon request-queue exhaustion for throttling.  Hence all
that code is supposed to work OK when there is no request-queue
blocking.  Of course, (theory/practice != 1.0).

> > I expect major users of this system call will be applications which do
> > small-sized overwrites into large files, mainly databases.  That is,
> > once the application developers discover its existence.  I'm still
> > getting expressions of wonder from people who I tell about the
> > five-year-old fadvise().
> 
> Hey, you have one user now, its called s2disk. But for this call to be
> useful, we'd need asynchronous variant... is there such thing?

Well if you're asking the syscall to shove more data into the block
layer than it can concurrently handle, sure, the block layer will
block.  It's tunable...

It can still block in places, of course - we might need to do
synchronous reads to get at metadata and we'll need to allocate memory.

> Okay, I can fork and do the call from another process, but...

I sense a strangeness.  What are you actually trying to do with all of this?

Bear in mind that sync_file_range() doesn't sync metadata (ie: indirect
blocks).  So if they weren't already known to have been written, the
data isn't safe.

> - * range which are not presently under writeback.
> + * range which are not presently under writeback. Notice that even this this
> + *  may and will block if you attempt to write more than request queue size.

um, OK. I'll fix the grammar a bit there.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 20:37               ` Andrew Morton
@ 2008-06-01 22:00                 ` Rafael J. Wysocki
  2008-06-01 22:22                 ` Pavel Machek
  2008-06-02 11:43                 ` Jens Axboe
  2 siblings, 0 replies; 25+ messages in thread
From: Rafael J. Wysocki @ 2008-06-01 22:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pavel Machek, mtk.manpages, Hugh Dickins, kernel list

On Sunday, 1 of June 2008, Andrew Morton wrote:
> On Sun, 1 Jun 2008 13:40:09 +0200 Pavel Machek <pavel@suse.cz> wrote:
> 
> > Hi!
> > 
> > > > > > All I can say so far is that I find the same as you do:
> > > > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> > > > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> > > > > > 
> > > > > > Which make the sync_file_range call pretty pointless: your usage seems
> > > > > > perfectly reasonable to me, but somehow we've broken its behaviour.
> > > > > > I'll be investigating ...
> > > > > 
> > > > > It will block on disk queue fullness - sysrq-W will tell.
> > > > 
> > > > Ah, thank you.  What a disappointment, though it's understandable.
> > > > Doesn't that very severely limit the usefulness of the system call?
> > > 
> > > A bit.  The request queue size is runtime tunable though.
> > 
> > Which /sys is that?
> 
> /sys/block/sda/queue/nr_requests
> 
> > What happens if I set the queue size to pretty
> > much infinity, will memory management die horribly?
> 
> In theory, no - it's always caused problems when the VM/VFS/FS layer
> has relied upon request-queue exhaustion for throttling.  Hence all
> that code is supposed to work OK when there is no request-queue
> blocking.  Of course, (theory/practice != 1.0).
> 
> > > I expect major users of this system call will be applications which do
> > > small-sized overwrites into large files, mainly databases.  That is,
> > > once the application developers discover its existence.  I'm still
> > > getting expressions of wonder from people who I tell about the
> > > five-year-old fadvise().
> > 
> > Hey, you have one user now, its called s2disk. But for this call to be
> > useful, we'd need asynchronous variant... is there such thing?
> 
> Well if you're asking the syscall to shove more data into the block
> layer than it can concurrently handle, sure, the block layer will
> block.  It's tunable...
> 
> It can still block in places, of course - we might need to do
> synchronous reads to get at metadata and we'll need to allocate memory.
> 
> > Okay, I can fork and do the call from another process, but...
> 
> I sense a strangeness.  What are you actually trying to do with all of this?

Pavel is trying to avoid me doing multithreaded s2disk, more or less. ;-)

However, I have some numbers showing that multithreaded image saving actually
helps a lot in case the image is compressed and encrypted.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 20:37               ` Andrew Morton
  2008-06-01 22:00                 ` Rafael J. Wysocki
@ 2008-06-01 22:22                 ` Pavel Machek
  2008-06-01 22:47                   ` Andrew Morton
  2008-06-02 11:43                 ` Jens Axboe
  2 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-06-01 22:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mtk.manpages, Hugh Dickins, kernel list, Rafael J. Wysocki

Hi!

> > > I expect major users of this system call will be applications which do
> > > small-sized overwrites into large files, mainly databases.  That is,
> > > once the application developers discover its existence.  I'm still
> > > getting expressions of wonder from people who I tell about the
> > > five-year-old fadvise().
> > 
> > Hey, you have one user now, its called s2disk. But for this call to be
> > useful, we'd need asynchronous variant... is there such thing?
> 
> Well if you're asking the syscall to shove more data into the block
> layer than it can concurrently handle, sure, the block layer will
> block.  It's tunable...

No, no, I don't want to overload block layer. All I want is ...

> > Okay, I can fork and do the call from another process, but...
> 
> I sense a strangeness.  What are you actually trying to do with all of this?

Okay, so I have around 400MB of data, I want it compressed, optionally
encrypted and written to partition.

Now, if I do it "naturally", I do writes, followed by fsync.

That's bad, because kernel does not start write out immediately, and
we waste time with idle disk. (If data compress really well, or
encryption is off, this is significant).

So we improve on this, by doing sync_file_range(SYNC_FILE_RANGE_WRITE)
periodically. That keeps the disk busy, but occassionaly blocks the
cpu... wasting time (which mostly hurts in compression+encryption
case).

So... how can I keep _both_ cpu and disk busy?

> Bear in mind that sync_file_range() doesn't sync metadata (ie: indirect
> blocks).  So if they weren't already known to have been written, the
> data isn't safe.

I'm not trying to use this for correctness; I'm optimizing for
speed. At the end, I do fsync() anyway.  

> > - * range which are not presently under writeback.
> > + * range which are not presently under writeback. Notice that even this this
> > + *  may and will block if you attempt to write more than request queue size.
> 
> um, OK. I'll fix the grammar a bit there.

Thanks.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 22:22                 ` Pavel Machek
@ 2008-06-01 22:47                   ` Andrew Morton
  2008-06-01 23:00                     ` Pavel Machek
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-06-01 22:47 UTC (permalink / raw)
  To: Pavel Machek; +Cc: mtk.manpages, Hugh Dickins, kernel list, Rafael J. Wysocki

On Mon, 2 Jun 2008 00:22:02 +0200 Pavel Machek <pavel@suse.cz> wrote:

> Hi!
> 
> > > > I expect major users of this system call will be applications which do
> > > > small-sized overwrites into large files, mainly databases.  That is,
> > > > once the application developers discover its existence.  I'm still
> > > > getting expressions of wonder from people who I tell about the
> > > > five-year-old fadvise().
> > > 
> > > Hey, you have one user now, its called s2disk. But for this call to be
> > > useful, we'd need asynchronous variant... is there such thing?
> > 
> > Well if you're asking the syscall to shove more data into the block
> > layer than it can concurrently handle, sure, the block layer will
> > block.  It's tunable...
> 
> No, no, I don't want to overload block layer. All I want is ...
> 
> > > Okay, I can fork and do the call from another process, but...
> > 
> > I sense a strangeness.  What are you actually trying to do with all of this?
> 
> Okay, so I have around 400MB of data, I want it compressed, optionally
> encrypted and written to partition.
> 
> Now, if I do it "naturally", I do writes, followed by fsync.
> 
> That's bad, because kernel does not start write out immediately, and
> we waste time with idle disk. (If data compress really well, or
> encryption is off, this is significant).
> 
> So we improve on this, by doing sync_file_range(SYNC_FILE_RANGE_WRITE)
> periodically. That keeps the disk busy, but occassionaly blocks the
> cpu... wasting time (which mostly hurts in compression+encryption
> case).

yep.  That's another use of sync_file_range(): to allow smart userspace
to optimise the kernel's IO scheduling decisions.

> So... how can I keep _both_ cpu and disk busy?

pthread_create() ;)

How about this:

- Add a new SYNC_FILE_RANGE_NON_BLOCKING

- If userspace set that flag, turn on writeback_control.nonblocking
  in __filemap_fdatawrite_range().

- test it a lot.

It will be userspace's responsibility to avoid burning huge amounts of
CPU repeatedly calling sync_file_range() and having it not actually write
anything.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 22:47                   ` Andrew Morton
@ 2008-06-01 23:00                     ` Pavel Machek
  2008-06-01 23:11                       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Machek @ 2008-06-01 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mtk.manpages, Hugh Dickins, kernel list, Rafael J. Wysocki

On Sun 2008-06-01 15:47:36, Andrew Morton wrote:
> On Mon, 2 Jun 2008 00:22:02 +0200 Pavel Machek <pavel@suse.cz> wrote:

Hi!

> > > I sense a strangeness.  What are you actually trying to do with all of this?
> > 
> > Okay, so I have around 400MB of data, I want it compressed, optionally
> > encrypted and written to partition.
> > 
> > Now, if I do it "naturally", I do writes, followed by fsync.
> > 
> > That's bad, because kernel does not start write out immediately, and
> > we waste time with idle disk. (If data compress really well, or
> > encryption is off, this is significant).
> > 
> > So we improve on this, by doing sync_file_range(SYNC_FILE_RANGE_WRITE)
> > periodically. That keeps the disk busy, but occassionaly blocks the
> > cpu... wasting time (which mostly hurts in compression+encryption
> > case).
> 
> yep.  That's another use of sync_file_range(): to allow smart userspace
> to optimise the kernel's IO scheduling decisions.
> 
> > So... how can I keep _both_ cpu and disk busy?
> 
> pthread_create() ;)

Actually it is easy enough to do with fork(), but...

> How about this:
> 
> - Add a new SYNC_FILE_RANGE_NON_BLOCKING
> 
> - If userspace set that flag, turn on writeback_control.nonblocking
>   in __filemap_fdatawrite_range().
> 
> - test it a lot.

Works for me. Is the expectation that I code this? I can certainly
provide testing ;-).

> It will be userspace's responsibility to avoid burning huge amounts of
> CPU repeatedly calling sync_file_range() and having it not actually write
> anything.

Ok... I guess doing 10x sync_file_range() when writing 400MB of data
is not excessive?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 23:00                     ` Pavel Machek
@ 2008-06-01 23:11                       ` Andrew Morton
  2008-06-02  8:43                         ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-06-01 23:11 UTC (permalink / raw)
  To: Pavel Machek; +Cc: mtk.manpages, Hugh Dickins, kernel list, Rafael J. Wysocki

On Mon, 2 Jun 2008 01:00:40 +0200 Pavel Machek <pavel@suse.cz> wrote:

> > How about this:
> > 
> > - Add a new SYNC_FILE_RANGE_NON_BLOCKING
> > 
> > - If userspace set that flag, turn on writeback_control.nonblocking
> >   in __filemap_fdatawrite_range().
> > 
> > - test it a lot.
> 
> Works for me. Is the expectation that I code this? I can certainly
> provide testing ;-).

Something like this:

 fs/sync.c          |    3 ++-
 include/linux/fs.h |    4 +++-
 mm/filemap.c       |    9 +++++----
 3 files changed, 10 insertions(+), 6 deletions(-)

diff -puN mm/filemap.c~a mm/filemap.c
--- a/mm/filemap.c~a
+++ a/mm/filemap.c
@@ -207,7 +207,7 @@ static int sync_page_killable(void *word
  * be waited upon, and not just skipped over.
  */
 int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
-				loff_t end, int sync_mode)
+				loff_t end, int sync_mode, int nonblocking)
 {
 	int ret;
 	struct writeback_control wbc = {
@@ -215,6 +215,7 @@ int __filemap_fdatawrite_range(struct ad
 		.nr_to_write = mapping->nrpages * 2,
 		.range_start = start,
 		.range_end = end,
+		.nonblocking = !!nonblocking,
 	};
 
 	if (!mapping_cap_writeback_dirty(mapping))
@@ -227,7 +228,7 @@ int __filemap_fdatawrite_range(struct ad
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
 {
-	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode);
+	return __filemap_fdatawrite_range(mapping, 0, LLONG_MAX, sync_mode, 0);
 }
 
 int filemap_fdatawrite(struct address_space *mapping)
@@ -239,7 +240,7 @@ EXPORT_SYMBOL(filemap_fdatawrite);
 static int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end)
 {
-	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
+	return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL, 0);
 }
 
 /**
@@ -430,7 +431,7 @@ int filemap_write_and_wait_range(struct 
 
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
-						 WB_SYNC_ALL);
+						 WB_SYNC_ALL, 0);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
diff -puN fs/sync.c~a fs/sync.c
--- a/fs/sync.c~a
+++ a/fs/sync.c
@@ -268,7 +268,8 @@ int do_sync_mapping_range(struct address
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
 		ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
-						WB_SYNC_NONE);
+					WB_SYNC_NONE,
+					flags & SYNC_FILE_RANGE_NONBLOCKING);
 		if (ret < 0)
 			goto out;
 	}
diff -puN include/linux/fs.h~a include/linux/fs.h
--- a/include/linux/fs.h~a
+++ a/include/linux/fs.h
@@ -268,6 +268,7 @@ extern int dir_notify_enable;
 #define SYNC_FILE_RANGE_WAIT_BEFORE	1
 #define SYNC_FILE_RANGE_WRITE		2
 #define SYNC_FILE_RANGE_WAIT_AFTER	4
+#define SYNC_FILE_RANGE_NONBLOCKING	8
 
 #ifdef __KERNEL__
 
@@ -1740,7 +1741,8 @@ extern int filemap_write_and_wait_range(
 extern int wait_on_page_writeback_range(struct address_space *mapping,
 				pgoff_t start, pgoff_t end);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
-				loff_t start, loff_t end, int sync_mode);
+				loff_t start, loff_t end, int sync_mode,
+				int nonblocking);
 
 extern long do_fsync(struct file *file, int datasync);
 extern void sync_supers(void);


But it needs comment updates and it really would be nice to combine the
`sync_mode' and `nonblocking' flags to __filemap_fdatawrite_range() -
it's all getting a bit silly there.

> > It will be userspace's responsibility to avoid burning huge amounts of
> > CPU repeatedly calling sync_file_range() and having it not actually write
> > anything.
> 
> Ok... I guess doing 10x sync_file_range() when writing 400MB of data
> is not excessive?

Sounds very sane.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 23:11                       ` Andrew Morton
@ 2008-06-02  8:43                         ` Hugh Dickins
  2008-06-02 11:18                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-06-02  8:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Pavel Machek, mtk.manpages, kernel list, Rafael J. Wysocki

On Sun, 1 Jun 2008, Andrew Morton wrote:
> On Mon, 2 Jun 2008 01:00:40 +0200 Pavel Machek <pavel@suse.cz> wrote:
> > > How about this:
> > > 
> > > - Add a new SYNC_FILE_RANGE_NON_BLOCKING
> > > 
> > > - If userspace set that flag, turn on writeback_control.nonblocking
> > >   in __filemap_fdatawrite_range().
> > > 
> > > - test it a lot.
> > 
> > Works for me. Is the expectation that I code this? I can certainly
> > provide testing ;-).
> 
> Something like this:

Though this fits very easily into the current kernel implementation,
I don't think it's the right interface for userspace.

If we do go this kind of a way, then I'd say SYNC_FILE_RANGE_NON_BLOCKING
needs to tell the caller how far it got before giving up, rather than just
success or failure.  Why? um, um, because it feels right; and would help
the caller help the kernel by not overloading it with needlessly repeated
loop ranges - any stronger reasons?  But sync_file_range() was defined
to return int rather than ssize_t, so that becomes awkward.

Never mind, I don't think it is the right way anyway.  We don't need
additions to the existing sync_file_range() interface, we just need it
to behave as naive people like Pavel and I expected it to behave in the
first place: SYNC_FILE_RANGE_WRITE should be nonblocking (with respect
to queue congestion, and maybe page locking also).

I was imagining that where the existing nonblocking code just gives up,
the SYNC_FILE_RANGE_WRITE case should schedule the remaining work to be
done a little later: possibly by poking and/or leaving info for pdflush.

I guess there may be some resource fairness issues, with either approach:
it ought not to be unreasonable for a process to proceed by writing a
page then SYNC_FILE_RANGE_WRITing that page, page by page, should it?

But once we claim nonblocking at the user interface, I expect we'll
come up against the raciness in the current nonblocking treatment:
just because bdi is not congested when it's tested doesn't mean we
won't block when the write is submitted.  Perhaps a BIO_RW_NONBLOCK
could fix that up?

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-02  8:43                         ` Hugh Dickins
@ 2008-06-02 11:18                           ` Rafael J. Wysocki
  2008-06-02 12:11                             ` Hugh Dickins
  0 siblings, 1 reply; 25+ messages in thread
From: Rafael J. Wysocki @ 2008-06-02 11:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Pavel Machek, mtk.manpages, kernel list

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On Monday, 2 of June 2008, Hugh Dickins wrote:
> On Sun, 1 Jun 2008, Andrew Morton wrote:
> > On Mon, 2 Jun 2008 01:00:40 +0200 Pavel Machek <pavel@suse.cz> wrote:
> > > > How about this:
> > > > 
> > > > - Add a new SYNC_FILE_RANGE_NON_BLOCKING
> > > > 
> > > > - If userspace set that flag, turn on writeback_control.nonblocking
> > > >   in __filemap_fdatawrite_range().
> > > > 
> > > > - test it a lot.
> > > 
> > > Works for me. Is the expectation that I code this? I can certainly
> > > provide testing ;-).
> > 
> > Something like this:
> 
> Though this fits very easily into the current kernel implementation,
> I don't think it's the right interface for userspace.
> 
> If we do go this kind of a way, then I'd say SYNC_FILE_RANGE_NON_BLOCKING
> needs to tell the caller how far it got before giving up, rather than just
> success or failure.  Why? um, um, because it feels right; and would help
> the caller help the kernel by not overloading it with needlessly repeated
> loop ranges - any stronger reasons?  But sync_file_range() was defined
> to return int rather than ssize_t, so that becomes awkward.
> 
> Never mind, I don't think it is the right way anyway.  We don't need
> additions to the existing sync_file_range() interface, we just need it
> to behave as naive people like Pavel and I expected it to behave in the
> first place: SYNC_FILE_RANGE_WRITE should be nonblocking (with respect
> to queue congestion, and maybe page locking also).

Well, frankly, I'm not sure if we need anything better than we already have.
In fact my numbers show that SYNC_FILE_RANGE_WRITE works quite well -
please see
http://sourceforge.net/mailarchive/message.php?msg_name=200806020122.36193.rjw%40sisk.pl
"early writeout" means that SYNC_FILE_RANGE_WRITE is used and the file with
the results is attached for convenience.

My interpretation of the results is here:
http://sourceforge.net/mailarchive/forum.php?thread_name=200806021238.17100.rjw%40sisk.pl&forum_name=suspend-devel

Thanks,
Rafael

[-- Attachment #2: image_saving_data.txt --]
[-- Type: text/plain, Size: 1098 bytes --]

Labels:
W+ - early writeout enabled
W- - early writeout disabled
C+ - compression enabled
C- - compression disabled
E+ - encryption enabled
E- - encryption disabled
T+ - threads enabled
T- - threads disabled

The numbers are speed values in MB/s.


Box 1 (Single-core, Athlon 64 3000+ 1.8 GHz, 1 MB L2 cache, 1.5 GB RAM)

	W-C-E-T-	W+C-E-T-	W-C+E-T-	W+C+E-T-
Write:	26.2		27.4		46.5		78.6
Read:	27.5		27.5		83.5		83.8

	W-C-E+T-	W+C-E+T-	W-C+E+T-	W+C+E+T-
Write:	15.6		19.1		28.4		38.7
Read:	18.0		18.1		46.1		46.2

	W-C-E+T+	W+C-E+T+	W-C+E+T+	W+C+E+T+
Write:	N/A		N/A		27.9		37.5
Read:	N/A		N/A		46.4		46.4

For compressed images the compression ratio was approx. 0.30


Box 2 (Dual-core, Turion X2 TL-60 2 GHz, 2 x 512 KB L2 cache, 2 GB RAM)

	W-C-E-T-	W+C-E-T-	W-C+E-T-	W+C+E-T-
Write:	31.9		34.1		42.7		74.5
Read:	33.0		32.9		79.8		80.1

	W-C-E+T-	W+C-E+T-	W-C+E+T-	W+C+E+T-
Write:	16.7		22.0		23.9		34.0
Read:	22.2		22.2		47.6		47.9

	W-C-E+T+	W+C-E+T+	W-C+E+T+	W+C+E+T+
Write:	16.4		22.2		34.7		57.2
Read:	22.2		22.1		47.9		48.0

For compressed images the compression ratio was approx. 0.40


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 20:37               ` Andrew Morton
  2008-06-01 22:00                 ` Rafael J. Wysocki
  2008-06-01 22:22                 ` Pavel Machek
@ 2008-06-02 11:43                 ` Jens Axboe
  2008-06-02 12:40                   ` Hugh Dickins
  2008-06-02 16:50                   ` Andrew Morton
  2 siblings, 2 replies; 25+ messages in thread
From: Jens Axboe @ 2008-06-02 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Machek, mtk.manpages, Hugh Dickins, kernel list,
	Rafael J. Wysocki

On Sun, Jun 01 2008, Andrew Morton wrote:
> > > I expect major users of this system call will be applications which do
> > > small-sized overwrites into large files, mainly databases.  That is,
> > > once the application developers discover its existence.  I'm still
> > > getting expressions of wonder from people who I tell about the
> > > five-year-old fadvise().
> > 
> > Hey, you have one user now, its called s2disk. But for this call to be
> > useful, we'd need asynchronous variant... is there such thing?
> 
> Well if you're asking the syscall to shove more data into the block
> layer than it can concurrently handle, sure, the block layer will
> block.  It's tunable...

Ehm, lets get the history right, please :-)

The block layer pretty much doesn't care about how large the queue
size is, it's largely at 128 to prevent the vm from shitting itself
like it has done in the past (and continues to do I guess, though
your reply leaves me wondering).

So you think the vm will be fine with a huge number of requests?
It wont go nuts scanning and reclaiming, wasting oodles of CPU
cycles?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-02 11:18                           ` Rafael J. Wysocki
@ 2008-06-02 12:11                             ` Hugh Dickins
  0 siblings, 0 replies; 25+ messages in thread
From: Hugh Dickins @ 2008-06-02 12:11 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Andrew Morton, Pavel Machek, mtk.manpages, kernel list

On Mon, 2 Jun 2008, Rafael J. Wysocki wrote:
> 
> Well, frankly, I'm not sure if we need anything better than we already have.
> In fact my numbers show that SYNC_FILE_RANGE_WRITE works quite well -
> please see
> http://sourceforge.net/mailarchive/message.php?msg_name=200806020122.36193.rjw%40sisk.pl
> "early writeout" means that SYNC_FILE_RANGE_WRITE is used and the file with
> the results is attached for convenience.
> 
> My interpretation of the results is here:
> http://sourceforge.net/mailarchive/forum.php?thread_name=200806021238.17100.rjw%40sisk.pl&forum_name=suspend-devel

If the current SYNC_FILE_RANGE_WRITE is good enough for you, great,
keep on using it.  Forgive me, I'm not much concerned with your
particular use, I'll happily leave that to you and Pavel!  What
bothers me is Pavel's discovery that a WAIT|WRITE|WAIT interface,
designed to help callers by allowing those stages to be separated,
ends up doing a significant amount of WAITing in the WRITE stage.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-02 11:43                 ` Jens Axboe
@ 2008-06-02 12:40                   ` Hugh Dickins
  2008-06-16 20:53                     ` Rik van Riel
  2008-06-02 16:50                   ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Hugh Dickins @ 2008-06-02 12:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Rik van Riel, Lee Schermerhorn, Pavel Machek,
	mtk.manpages, kernel list, Rafael J. Wysocki

On Mon, 2 Jun 2008, Jens Axboe wrote:
> On Sun, Jun 01 2008, Andrew Morton wrote:
> > 
> > Well if you're asking the syscall to shove more data into the block
> > layer than it can concurrently handle, sure, the block layer will
> > block.  It's tunable...
> 
> Ehm, lets get the history right, please :-)
> 
> The block layer pretty much doesn't care about how large the queue
> size is, it's largely at 128 to prevent the vm from shitting itself
> like it has done in the past (and continues to do I guess, though
> your reply leaves me wondering).
> 
> So you think the vm will be fine with a huge number of requests?
> It wont go nuts scanning and reclaiming, wasting oodles of CPU
> cycles?

Interesting.  I wonder.  I may be quite wrong (Cc'ed Rik and Lee
who I think are currently most in touch with what goes on there),
but my impression is that whereas vmscan.c takes pages off LRU
while it's doing writeback on them, and arranges for them to go
back to the reclaimable tail of the LRU once writeback completes
(the rotate reclaimable business), pages under writeback which
it did not itself initiate will get put back on LRU at the head
(with write completion still pending).

If unbounded queues turn out to be viable in other respects
(which would be very attractive for SYNC_PAGE_RANGE_WRITE), even
if not, then maybe shrink_page_list needs to be able to take pages
already under Writeback off the LRU and set them Reclaimable.

Hugh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-02 11:43                 ` Jens Axboe
  2008-06-02 12:40                   ` Hugh Dickins
@ 2008-06-02 16:50                   ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Andrew Morton @ 2008-06-02 16:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Machek, mtk.manpages, Hugh Dickins, kernel list,
	Rafael J. Wysocki

On Mon, 2 Jun 2008 13:43:20 +0200 Jens Axboe <jens.axboe@oracle.com> wrote:

> On Sun, Jun 01 2008, Andrew Morton wrote:
> > > > I expect major users of this system call will be applications which do
> > > > small-sized overwrites into large files, mainly databases.  That is,
> > > > once the application developers discover its existence.  I'm still
> > > > getting expressions of wonder from people who I tell about the
> > > > five-year-old fadvise().
> > > 
> > > Hey, you have one user now, its called s2disk. But for this call to be
> > > useful, we'd need asynchronous variant... is there such thing?
> > 
> > Well if you're asking the syscall to shove more data into the block
> > layer than it can concurrently handle, sure, the block layer will
> > block.  It's tunable...
> 
> Ehm, lets get the history right, please :-)
> 
> The block layer pretty much doesn't care about how large the queue
> size is, it's largely at 128 to prevent the vm from shitting itself
> like it has done in the past (and continues to do I guess, though
> your reply leaves me wondering).
> 
> So you think the vm will be fine with a huge number of requests?
> It wont go nuts scanning and reclaiming, wasting oodles of CPU
> cycles?

The VFS did screw up a couple of times with unbounded queues.  It did
get fixed and it is a design objective for the writeback code to _not_
depend upon request exhaustion for proper behaviour.

But it hasn't had a large amount of testing with unbounded queues and
there may still be problems in there.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-01 11:40             ` Pavel Machek
  2008-06-01 20:37               ` Andrew Morton
@ 2008-06-03  8:01               ` Michael Kerrisk
  2008-06-03  8:05                 ` Pavel Machek
  1 sibling, 1 reply; 25+ messages in thread
From: Michael Kerrisk @ 2008-06-03  8:01 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Andrew Morton, mtk.manpages, Hugh Dickins, kernel list,
	Rafael J. Wysocki

On Sun, Jun 1, 2008 at 1:40 PM, Pavel Machek <pavel@suse.cz> wrote:
> Hi!
>
>> > > > All I can say so far is that I find the same as you do:
>> > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
>> > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
>> > > >
>> > > > Which make the sync_file_range call pretty pointless: your usage seems
>> > > > perfectly reasonable to me, but somehow we've broken its behaviour.
>> > > > I'll be investigating ...
>> > >
>> > > It will block on disk queue fullness - sysrq-W will tell.
>> >
>> > Ah, thank you.  What a disappointment, though it's understandable.
>> > Doesn't that very severely limit the usefulness of the system call?
>>
>> A bit.  The request queue size is runtime tunable though.
>
> Which /sys is that? What happens if I set the queue size to pretty
> much infinity, will memory management die horribly?
>
>> I expect major users of this system call will be applications which do
>> small-sized overwrites into large files, mainly databases.  That is,
>> once the application developers discover its existence.  I'm still
>> getting expressions of wonder from people who I tell about the
>> five-year-old fadvise().
>
> Hey, you have one user now, its called s2disk. But for this call to be
> useful, we'd need asynchronous variant... is there such thing?
>
> Okay, I can fork and do the call from another process, but...
>
>> > I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
>> > but I don't suppose Pavel and I are the only ones misled by it.
>>
>> Yup, this caveat/restriction should be in the manpage.
>
> Michael, this is something for you I guess?

Pavel,

Just to confirm: you are meaning that the sentence

    Notice that even this this may and will block if you attempt
    to write more than request queue size.

should be added to the man page under the description of
SYNC_FILE_RANGE_WRITE, right?

Cheers,

Michael

>
> And andrew, something for you:
>
> ---
>
> SYNC_FILE_RANGE_WRITE may and will block. Document that.
>
> Signed-off-by: Pavel Machek <pavel@suse.cz>
>
> ---
> commit 5db78da3d8e6fa527bfe384ded2ff7c835592fe2
> tree 4c405e07be12f0a2260492fb43d19802ff7ebab1
> parent 0ea376de01be797f9563c2c2464149f8f0af6329
> author Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200
> committer Pavel <pavel@amd.ucw.cz> Sun, 01 Jun 2008 13:39:25 +0200
>
>  fs/sync.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/fs/sync.c b/fs/sync.c
> index 228e17b..54e9f20 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -139,7 +139,8 @@ asmlinkage long sys_fdatasync(unsigned i
>  * before performing the write.
>  *
>  * SYNC_FILE_RANGE_WRITE: initiate writeout of all those dirty pages in the
> - * range which are not presently under writeback.
> + * range which are not presently under writeback. Notice that even this this
> + *  may and will block if you attempt to write more than request queue size.
>  *
>  * SYNC_FILE_RANGE_WAIT_AFTER: wait upon writeout of all pages in the range
>  * after performing the write.
>
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Found a bug? http://www.kernel.org/doc/man-pages/reporting_bugs.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-03  8:01               ` Michael Kerrisk
@ 2008-06-03  8:05                 ` Pavel Machek
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2008-06-03  8:05 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Andrew Morton, mtk.manpages, Hugh Dickins, kernel list,
	Rafael J. Wysocki

On Tue 2008-06-03 10:01:22, Michael Kerrisk wrote:
> On Sun, Jun 1, 2008 at 1:40 PM, Pavel Machek <pavel@suse.cz> wrote:
> > Hi!
> >
> >> > > > All I can say so far is that I find the same as you do:
> >> > > > SYNC_FILE_RANGE_WRITE (after writing) takes a significant amount of time,
> >> > > > more than half as long as when you add in SYNC_FILE_RANGE_WAIT_AFTER too.
> >> > > >
> >> > > > Which make the sync_file_range call pretty pointless: your usage seems
> >> > > > perfectly reasonable to me, but somehow we've broken its behaviour.
> >> > > > I'll be investigating ...
> >> > >
> >> > > It will block on disk queue fullness - sysrq-W will tell.
> >> >
> >> > Ah, thank you.  What a disappointment, though it's understandable.
> >> > Doesn't that very severely limit the usefulness of the system call?
> >>
> >> A bit.  The request queue size is runtime tunable though.
> >
> > Which /sys is that? What happens if I set the queue size to pretty
> > much infinity, will memory management die horribly?
> >
> >> I expect major users of this system call will be applications which do
> >> small-sized overwrites into large files, mainly databases.  That is,
> >> once the application developers discover its existence.  I'm still
> >> getting expressions of wonder from people who I tell about the
> >> five-year-old fadvise().
> >
> > Hey, you have one user now, its called s2disk. But for this call to be
> > useful, we'd need asynchronous variant... is there such thing?
> >
> > Okay, I can fork and do the call from another process, but...
> >
> >> > I admit the flag isn't called SYNC_FILE_RANGE_WRITE_WITHOUT_WAITING,
> >> > but I don't suppose Pavel and I are the only ones misled by it.
> >>
> >> Yup, this caveat/restriction should be in the manpage.
> >
> > Michael, this is something for you I guess?
> 
> Pavel,
> 
> Just to confirm: you are meaning that the sentence
> 
>     Notice that even this this may and will block if you attempt
>     to write more than request queue size.
> 
> should be added to the man page under the description of
> SYNC_FILE_RANGE_WRITE, right?

Yes.
							Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-02 12:40                   ` Hugh Dickins
@ 2008-06-16 20:53                     ` Rik van Riel
  2008-06-17  4:54                       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Rik van Riel @ 2008-06-16 20:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jens Axboe, Andrew Morton, Lee Schermerhorn, Pavel Machek,
	mtk.manpages, kernel list, Rafael J. Wysocki

On Mon, 2 Jun 2008 13:40:28 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:
> On Mon, 2 Jun 2008, Jens Axboe wrote:
> > On Sun, Jun 01 2008, Andrew Morton wrote:
> > > 
> > > Well if you're asking the syscall to shove more data into the block
> > > layer than it can concurrently handle, sure, the block layer will
> > > block.  It's tunable...
> > 
> > Ehm, lets get the history right, please :-)
> > 
> > The block layer pretty much doesn't care about how large the queue
> > size is, it's largely at 128 to prevent the vm from shitting itself
> > like it has done in the past (and continues to do I guess, though
> > your reply leaves me wondering).
> > 
> > So you think the vm will be fine with a huge number of requests?
> > It wont go nuts scanning and reclaiming, wasting oodles of CPU
> > cycles?
> 
> Interesting.  I wonder.  I may be quite wrong (Cc'ed Rik and Lee
> who I think are currently most in touch with what goes on there),
> but my impression is that whereas vmscan.c takes pages off LRU
> while it's doing writeback on them, and arranges for them to go
> back to the reclaimable tail of the LRU

The problem with lots of CFQ requests is simpler.

If you look at balance_dirty_pages() you will see that it only
takes dirty and NFS unstable pages into account when checking
the dirty limit.  The pages that are in flight (under IO) are
not counted at all.

If you have 8192 CFQ requests and large streaming IO, most of
the IOs will be mergeable and it is possible to pin all of
memory in in-flight (PG_writeback - and other?) pages.

I suspect that balance_dirty_pages() will have to take the
writeback pages into account, though that may cause problems
with FUSE.  Maybe it should at least wait for some IO to
complete?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-16 20:53                     ` Rik van Riel
@ 2008-06-17  4:54                       ` Andrew Morton
  2008-06-17 13:38                         ` Rik van Riel
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2008-06-17  4:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Jens Axboe, Lee Schermerhorn, Pavel Machek,
	mtk.manpages, kernel list, Rafael J. Wysocki

On Mon, 16 Jun 2008 16:53:51 -0400 Rik van Riel <riel@redhat.com> wrote:

> On Mon, 2 Jun 2008 13:40:28 +0100 (BST)
> Hugh Dickins <hugh@veritas.com> wrote:
> > On Mon, 2 Jun 2008, Jens Axboe wrote:
> > > On Sun, Jun 01 2008, Andrew Morton wrote:
> > > > 
> > > > Well if you're asking the syscall to shove more data into the block
> > > > layer than it can concurrently handle, sure, the block layer will
> > > > block.  It's tunable...
> > > 
> > > Ehm, lets get the history right, please :-)
> > > 
> > > The block layer pretty much doesn't care about how large the queue
> > > size is, it's largely at 128 to prevent the vm from shitting itself
> > > like it has done in the past (and continues to do I guess, though
> > > your reply leaves me wondering).
> > > 
> > > So you think the vm will be fine with a huge number of requests?
> > > It wont go nuts scanning and reclaiming, wasting oodles of CPU
> > > cycles?
> > 
> > Interesting.  I wonder.  I may be quite wrong (Cc'ed Rik and Lee
> > who I think are currently most in touch with what goes on there),
> > but my impression is that whereas vmscan.c takes pages off LRU
> > while it's doing writeback on them, and arranges for them to go
> > back to the reclaimable tail of the LRU
> 
> The problem with lots of CFQ requests is simpler.
> 
> If you look at balance_dirty_pages() you will see that it only
> takes dirty and NFS unstable pages into account when checking
> the dirty limit.  The pages that are in flight (under IO) are
> not counted at all.

That would be totally busted.  All that nr_writeback logic in there is very
much supposed to handle under-writeback pages?

> If you have 8192 CFQ requests and large streaming IO, most of
> the IOs will be mergeable and it is possible to pin all of
> memory in in-flight (PG_writeback - and other?) pages.
> 
> I suspect that balance_dirty_pages() will have to take the
> writeback pages into account, though that may cause problems
> with FUSE.  Maybe it should at least wait for some IO to
> complete?

Worried.  One of us is missing something here.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: sync_file_range(SYNC_FILE_RANGE_WRITE) blocks?
  2008-06-17  4:54                       ` Andrew Morton
@ 2008-06-17 13:38                         ` Rik van Riel
  0 siblings, 0 replies; 25+ messages in thread
From: Rik van Riel @ 2008-06-17 13:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Jens Axboe, Lee Schermerhorn, Pavel Machek,
	mtk.manpages, kernel list, Rafael J. Wysocki

On Mon, 16 Jun 2008 21:54:06 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> > The problem with lots of CFQ requests is simpler.
> > 
> > If you look at balance_dirty_pages() you will see that it only
> > takes dirty and NFS unstable pages into account when checking
> > the dirty limit.  The pages that are in flight (under IO) are
> > not counted at all.
> 
> That would be totally busted.  All that nr_writeback logic in there is very
> much supposed to handle under-writeback pages?

Mmm, you are right.   It is counting the number of writeback
pages.

It just isn't waiting on them.  After moving enough pages from
dirty state to writeback state, it breaks out of the loop and
allows memory dirtying to continue.

                if (pages_written >= write_chunk)
                        break;          /* We've done our duty */

I wonder if we should wait for some IO to complete, or at least
do a simple congestion_wait() if the number of writeback pages
is very large?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2008-06-17 13:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-30 10:26 sync_file_range(SYNC_FILE_RANGE_WRITE) blocks? Pavel Machek
2008-05-30 13:58 ` Hugh Dickins
2008-05-30 20:43   ` Pavel Machek
2008-05-31 18:44     ` Hugh Dickins
2008-06-01  0:39       ` Andrew Morton
2008-06-01  7:23         ` Hugh Dickins
2008-06-01  8:15           ` Andrew Morton
2008-06-01 11:40             ` Pavel Machek
2008-06-01 20:37               ` Andrew Morton
2008-06-01 22:00                 ` Rafael J. Wysocki
2008-06-01 22:22                 ` Pavel Machek
2008-06-01 22:47                   ` Andrew Morton
2008-06-01 23:00                     ` Pavel Machek
2008-06-01 23:11                       ` Andrew Morton
2008-06-02  8:43                         ` Hugh Dickins
2008-06-02 11:18                           ` Rafael J. Wysocki
2008-06-02 12:11                             ` Hugh Dickins
2008-06-02 11:43                 ` Jens Axboe
2008-06-02 12:40                   ` Hugh Dickins
2008-06-16 20:53                     ` Rik van Riel
2008-06-17  4:54                       ` Andrew Morton
2008-06-17 13:38                         ` Rik van Riel
2008-06-02 16:50                   ` Andrew Morton
2008-06-03  8:01               ` Michael Kerrisk
2008-06-03  8:05                 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox