write-behind on streaming writes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* write-behind on streaming writes
       [not found] ` <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com>
@ 2012-05-29 15:57   ` Fengguang Wu
  2012-05-29 17:35     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Fengguang Wu @ 2012-05-29 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

Hi Linus,

On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote:
> Ok, pulled.
> 
> However, I have an independent question for you - have you looked at
> any kind of per-file write-behind kind of logic?

Yes, definitely.  Especially for NFS, it benefits to keep each file's
dirty pages low. Because in NFS, a simple stat() will require flushing
all the file's dirty pages before proceeding.

However in general there are no strong user requests for this feature.
I guess it's mainly because they still have the choices to use O_SYNC
or O_DIRECT.

Actually O_SYNC is pretty close to the below code for the purpose of
limiting the dirty and writeback pages, except that it's not on by
default, hence means nothing for normal users.

> The reason I ask is that pretty much every time I write some big file
> (usually when over-writing a harddisk), I tend to use my own hackish
> model, which looks like this:
> 
> #define BUFSIZE (8*1024*1024ul)
> 
>         ...
>         for (..) {
>                 ...
>                 if (write(fd, buffer, BUFSIZE) != BUFSIZE)
>                         break;
>                 sync_file_range(fd, index*BUFSIZE, BUFSIZE,
> SYNC_FILE_RANGE_WRITE);
>                 if (index)
>                         sync_file_range(fd, (index-1)*BUFSIZE,
> BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
>                 ....
> 
> and it tends to be *beautiful* for both disk IO performane and for
> system responsiveness while the big write is in progress.

It seems to me all about optimizing the 1-dd case for desktop users,
and the most beautiful thing about per-file write behind is, it keeps
both the number of dirty and writeback pages low in the system when
there are only one or two sequential dirtier tasks. Which is good for
responsiveness.

Note that the above user space code won't work well when there are 10+
dirtier tasks. It effectively creates 10+ IO submitters on different
regions of the disk and thus create lots of seeks. When there are 10+
dirtier tasks, it's not only desirable to have one single flusher
thread to submit all IO, but also for the flusher to work on the
inodes with large write chunk size.

I happen to have some numbers on comparing the current adaptive
(write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on
XFS (not choosing ext4 because it internally enforces >=128MB chunk
size).  It's basically 4% performance drop in the 1-dd case and up to
20% in the 100-dd case.

  3.4.0-rc2             3.4.0-rc2-4M+
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

So we probably still want to keep the 0.5s worth of chunk size.

> And I'm wondering if we couldn't expose this kind of write-behind
> logic from the kernel. Sure, it only works for the "contiguous write
> of a single large file" model, but that model isn't actually all
> *that* unusual.
> 
> Right now all the write-back logic is based on the
> balance_dirty_pages() model, which is more of a global dirty model.
> Which obviously is needed too - this isn't an "either or" kind of
> thing, it's more of a "maybe we could have a streaming detector *and*
> the 'random writes' code". So I was wondering if anybody had ever been
> looking more at an explicit write-behind model that uses the same kind
> of "per-file window" that the read-ahead code does.

I can imagine it being implemented in kernel this way:

streaming write detector in balance_dirty_pages():

        if (not globally throttled &&
            is streaming writer &&
            it's crossing the N+1 boundary) {
                queue writeback work for chunk N to the flusher
                wait for work completion
        }

The good thing is, that looks not a complex addition. However the
potential problem is, the "wait for work completion" part won't have
guaranteed complete time, especially when there are multiple dd tasks.
This could result in uncontrollable delays in the write() syscall. So
we may do this instead:

-               wait for work completion
+               sleep for (chunk_size/write_bandwidth)

To avoid long write() delays, we might further split the one big 0.5s
sleep into smaller sleeps.

> (The above code only works well for known streaming writes, but the
> *model* of saying "ok, let's start writeout for the previous streaming
> block, and then wait for the writeout of the streaming block before
> that" really does tend to result in very smooth IO and minimal
> disruption of other processes..)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-05-29 15:57   ` write-behind on streaming writes Fengguang Wu
@ 2012-05-29 17:35     ` Linus Torvalds
  2012-05-30  3:21       ` Fengguang Wu
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2012-05-29 17:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Actually O_SYNC is pretty close to the below code for the purpose of
> limiting the dirty and writeback pages, except that it's not on by
> default, hence means nothing for normal users.

Absolutely not.

O_SYNC syncs the *current* write, syncs your metadata, and just
generally makes your writer synchronous. It's just a f*cking moronic
idea. Nobody sane ever uses it, since you are much better off just
using fsync() if you want that kind of behavior. That's one of those
"stupid legacy flags" things that have no sane use.

The whole point is that doing that is never the right thing to do. You
want to sync *past* writes, and you never ever want to wait on them
unless you just sent more (newer) writes to the disk that you are
*not* waiting on - so that you always have more IO pending.

O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
of overlapping IO". Because it requires that the IO is _done_ by the
time you start more, which is against the whole point.

> It seems to me all about optimizing the 1-dd case for desktop users,
> and the most beautiful thing about per-file write behind is, it keeps
> both the number of dirty and writeback pages low in the system when
> there are only one or two sequential dirtier tasks. Which is good for
> responsiveness.

Yes, but I don't think it's about a single-dd case - it's about just
trying to handle one common case (streaming writes) efficiently and
naturally. Try to get those out of the system so that you can then
worry about the *other* cases knowing that they don't have that kind
of big streaming behavior.

For example, right now our main top-level writeback logic is *not*
about streaming writes (just dirty counts), but then we try to "find"
the locality by making the lower-level writeback do the whole "write
back by chunking inodes" without really having any higher-level
information.

I just suspect that we'd be better off teaching upper levels about the
streaming. I know for a fact that if I do it by hand, system
responsiveness was *much* better, and IO throughput didn't go down at
all.

> Note that the above user space code won't work well when there are 10+
> dirtier tasks. It effectively creates 10+ IO submitters on different
> regions of the disk and thus create lots of seeks.

Not really much more than our current writeback code does. It
*schedules* data for writing, but doesn't wait for it until much
later.

You seem to think it was synchronous. It's not. Look at the second
sync_file_range() thing, and the important part is the "index-1". The
fact that you confused this with O_SYNC seems to be the same thing.
This has absolutely *nothing* to do with O_SYNC.

The other important part is that the chunk size is fairly large. We do
read-ahead in 64k kind of things, to make sense the write-behind
chunking needs to be in "multiple megabytes".  8MB is probably the
minimum size it makes sense.

The write-behind would be for things like people writing disk images
and video files. Not for random IO in smaller chunks.

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-05-29 17:35     ` Linus Torvalds
@ 2012-05-30  3:21       ` Fengguang Wu
  2012-06-05  1:01         ` Dave Chinner
  2012-06-05 17:23         ` Vivek Goyal
  0 siblings, 2 replies; 18+ messages in thread
From: Fengguang Wu @ 2012-05-30  3:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

Linus,

On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> >
> > Actually O_SYNC is pretty close to the below code for the purpose of
> > limiting the dirty and writeback pages, except that it's not on by
> > default, hence means nothing for normal users.
> 
> Absolutely not.
> 
> O_SYNC syncs the *current* write, syncs your metadata, and just
> generally makes your writer synchronous. It's just a f*cking moronic
> idea. Nobody sane ever uses it, since you are much better off just
> using fsync() if you want that kind of behavior. That's one of those
> "stupid legacy flags" things that have no sane use.
> 
> The whole point is that doing that is never the right thing to do. You
> want to sync *past* writes, and you never ever want to wait on them
> unless you just sent more (newer) writes to the disk that you are
> *not* waiting on - so that you always have more IO pending.
> 
> O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
> of overlapping IO". Because it requires that the IO is _done_ by the
> time you start more, which is against the whole point.

Yeah, O_SYNC is not really the sane thing to use.  Thanks for teaching
me this with great details!

> > It seems to me all about optimizing the 1-dd case for desktop users,
> > and the most beautiful thing about per-file write behind is, it keeps
> > both the number of dirty and writeback pages low in the system when
> > there are only one or two sequential dirtier tasks. Which is good for
> > responsiveness.
> 
> Yes, but I don't think it's about a single-dd case - it's about just
> trying to handle one common case (streaming writes) efficiently and
> naturally. Try to get those out of the system so that you can then
> worry about the *other* cases knowing that they don't have that kind
> of big streaming behavior.
> 
> For example, right now our main top-level writeback logic is *not*
> about streaming writes (just dirty counts), but then we try to "find"
> the locality by making the lower-level writeback do the whole "write
> back by chunking inodes" without really having any higher-level
> information.

Agreed. Streaming writes can be reliably detected in the same way as
readahead. And doing explicit write-behind for them may help make the
writeback more oriented and well behaved.

For example, consider file A being sequentially written to by dd, and
another mmapped file B being randomly written to. In the current
global writeback, the two files will likely have 1:1 share of the
dirty pages. With write-behind, we'll effectively limit file A's dirty
footprint to 2 chunk sizes, possibly leaving much more rooms for file
B and increase the chances it accumulate more adjacent dirty pages at
writeback time.

> I just suspect that we'd be better off teaching upper levels about the
> streaming. I know for a fact that if I do it by hand, system
> responsiveness was *much* better, and IO throughput didn't go down at
> all.

Your observation of better responsiveness may well be stemmed from
these two aspects:

1) lower dirty/writeback pages
2) the async write IO queue being drained constantly

(1) is obvious. For a mem=4G desktop, the default dirty limit can be
up to (4096 * 20% = 819MB). While your smart writer effectively limits
dirty/writeback pages to a dramatically lower 16MB.

(2) comes from the use of _WAIT_ flags in

        sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);

Each sync_file_range() syscall will submit 8MB write IO and wait for
completion. That means the async write IO queue constantly swing
between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
So on every 12.5ms, the async IO queue runs empty, which gives any
pending read IO (from firefox etc.) a chance to be serviced. Nice
and sweet breaks!

I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Because in a desktop with heavy sequential writes and sporadic reads,
the 20% dirty/writeback pages can hardly reach the end of LRU lists to
trigger waits in direct page reclaim.

On the other hand, it's a known problem that our IO scheculer is still
not that well behaved to provide good read latency when the flusher
rightfully manages to keep 100% fillness of the async IO queue all the
time.

The IO scheduler will be the right place to solve this issue. There's
nothing wrong for the flusher to blindly fill the async IO queue. It's
the flusher's duty to avoid underrun of the async IO queue and the IO
scheduler's duty to select the right queue to service (or to idle).
The IO scheduler *in theory* has all the information to do the right
decisions to _not service_ requests from the flusher when there are
reads observed recently...

> > Note that the above user space code won't work well when there are 10+
> > dirtier tasks. It effectively creates 10+ IO submitters on different
> > regions of the disk and thus create lots of seeks.
> 
> Not really much more than our current writeback code does. It
> *schedules* data for writing, but doesn't wait for it until much
> later.
>
> You seem to think it was synchronous. It's not. Look at the second
> sync_file_range() thing, and the important part is the "index-1". The
> fact that you confused this with O_SYNC seems to be the same thing.
> This has absolutely *nothing* to do with O_SYNC.

Hmm we should be sharing the same view here: it's not waiting for
"index", but does wait for "index-1" for clear of PG_writeback by
using SYNC_FILE_RANGE_WAIT_AFTER.

Or when there are 10+ writers running, each submitting 8MB data to the
async IO queue, they may well overrun the max IO queue size and get
blocked in the earlier stage of get_request_wait().

> The other important part is that the chunk size is fairly large. We do
> read-ahead in 64k kind of things, to make sense the write-behind
> chunking needs to be in "multiple megabytes".  8MB is probably the
> minimum size it makes sense.

Yup. And we also need to make sure it's not 10 tasks each scheduling
50MB write IOs *concurrently*. sync_file_range() is unfortunately
doing it this way by sending IO requests to the async IO queue on its
own, rather than delegating the work to the flusher and let one single
flusher submit IOs for them one after the other.

Imagine the async IO queue can hold exactly 50MB writeback pages. You
can see the obvious difference in the below graph. The IO queue will
be filled with dirty pages from (a) one single inode (b) 10 different
inodes. In the later case, the IO scheduler will switch between the
inodes much more frequently and create lots more seeks.

A theoretic view of the async IO queue:

    +----------------+               +----------------+
    |                |               |    inode 1     |
    |                |               +----------------+
    |                |               |    inode 2     |
    |                |               +----------------+
    |                |               |    inode 3     |
    |                |               +----------------+
    |                |               |    inode 4     |
    |                |               +----------------+
    |   inode 1      |               |    inode 5     |
    |                |               +----------------+
    |                |               |    inode 6     |
    |                |               +----------------+
    |                |               |    inode 7     |
    |                |               +----------------+
    |                |               |    inode 8     |
    |                |               +----------------+
    |                |               |    inode 9     |
    |                |               +----------------+
    |                |               |    inode 10    |
    +----------------+               +----------------+
    (a) one single flusher           (b) 10 sync_file_range()
        submitting 50MB IO               submitting 50MB IO
        for each inode *in turn*         for each inode *in parallel*

So if parallel file syncs are a common usage, we'll need to make them
IO-less, too.

> The write-behind would be for things like people writing disk images
> and video files. Not for random IO in smaller chunks.

Yup.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-05-30  3:21       ` Fengguang Wu
@ 2012-06-05  1:01         ` Dave Chinner
  2012-06-05 17:18           ` Vivek Goyal
  2012-06-05 17:23         ` Vivek Goyal
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2012-06-05  1:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> Linus,
> 
> On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > I just suspect that we'd be better off teaching upper levels about the
> > streaming. I know for a fact that if I do it by hand, system
> > responsiveness was *much* better, and IO throughput didn't go down at
> > all.
> 
> Your observation of better responsiveness may well be stemmed from
> these two aspects:
> 
> 1) lower dirty/writeback pages
> 2) the async write IO queue being drained constantly
> 
> (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> up to (4096 * 20% = 819MB). While your smart writer effectively limits
> dirty/writeback pages to a dramatically lower 16MB.
> 
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!
> 
> I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Almost certainly, especially with NCQ devices where even if the IO
scheduler preempts the write queue immediately, the device might
complete the outstanding 31 writes before servicing the read which
is issued as the 32nd command....

So NCQ depth is going to play a part here as well.

> Because in a desktop with heavy sequential writes and sporadic reads,
> the 20% dirty/writeback pages can hardly reach the end of LRU lists to
> trigger waits in direct page reclaim.
> 
> On the other hand, it's a known problem that our IO scheculer is still
> not that well behaved to provide good read latency when the flusher
> rightfully manages to keep 100% fillness of the async IO queue all the
> time.

Deep queues are the antithesis of low latency. If you want good IO
interactivity (i.e. low access latency) you cannot keep deep async
IO queues. If you want good throughput, you need deep queues to
allow the best scheduling window as possible and to keep the IO
device as busy as possible.

> The IO scheduler will be the right place to solve this issue. There's
> nothing wrong for the flusher to blindly fill the async IO queue. It's
> the flusher's duty to avoid underrun of the async IO queue and the IO
> scheduler's duty to select the right queue to service (or to idle).
> The IO scheduler *in theory* has all the information to do the right
> decisions to _not service_ requests from the flusher when there are
> reads observed recently...

That's my take on the issue, too. Even if we decide that streaming
writes should be sync'd immeidately, where should we draw the limit?

I often write temporary files that would qualify as large streaming
writes (e.g. 1GB) and then immediately remove them. I rely on the
fact they don't hit the disk for performance (i.e. <1s to create,
wait 2s, <1s to read, <1s to unlink). If these are forced to disk
rather than sitting in memory for a short while, the create will now
take ~10s per file and I won't be able to create 10 of them
concurrently and have them all take <1s to create....

IOWs, what might seem like an interactivity optimisation for
one workload will quite badly affect the performance of a different
workload. Optimising read latency vs write bandwidth is exactly what
we have IO schedulers for....

> Or when there are 10+ writers running, each submitting 8MB data to the
> async IO queue, they may well overrun the max IO queue size and get
> blocked in the earlier stage of get_request_wait().

Yup, as soon as you have multiple IO submitters, we get back to the
old problem of thrashing the disks. This is *exactly* the throughput
problem we solved by moving to IO-less throttling. That is, having N
IO submitters is far less efficient than having a single, well
controlled IO submitter. That's exactly what we want to avoid...

> > The other important part is that the chunk size is fairly large. We do
> > read-ahead in 64k kind of things, to make sense the write-behind
> > chunking needs to be in "multiple megabytes".  8MB is probably the
> > minimum size it makes sense.
> 
> Yup. And we also need to make sure it's not 10 tasks each scheduling
> 50MB write IOs *concurrently*. sync_file_range() is unfortunately
> doing it this way by sending IO requests to the async IO queue on its
> own, rather than delegating the work to the flusher and let one single
> flusher submit IOs for them one after the other.

Yup, that's the thrashing we need to avoid ;)

> So if parallel file syncs are a common usage, we'll need to make them
> IO-less, too.

Or just tell people "don't do that"

> > The write-behind would be for things like people writing disk images
> > and video files. Not for random IO in smaller chunks.

Or you could just use async direct IO to acheive exactly the same
thing without modifying the kernel at all ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05  1:01         ` Dave Chinner
@ 2012-06-05 17:18           ` Vivek Goyal
  0 siblings, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List

On Tue, Jun 05, 2012 at 11:01:48AM +1000, Dave Chinner wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > Linus,
> > 
> > On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > I just suspect that we'd be better off teaching upper levels about the
> > > streaming. I know for a fact that if I do it by hand, system
> > > responsiveness was *much* better, and IO throughput didn't go down at
> > > all.
> > 
> > Your observation of better responsiveness may well be stemmed from
> > these two aspects:
> > 
> > 1) lower dirty/writeback pages
> > 2) the async write IO queue being drained constantly
> > 
> > (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> > up to (4096 * 20% = 819MB). While your smart writer effectively limits
> > dirty/writeback pages to a dramatically lower 16MB.
> > 
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> > 
> > I suspect (2) contributes *much more* than (1) to desktop responsiveness.
> 
> Almost certainly, especially with NCQ devices where even if the IO
> scheduler preempts the write queue immediately, the device might
> complete the outstanding 31 writes before servicing the read which
> is issued as the 32nd command....

CFQ does preempt async IO once sync IO gets queued.

> 
> So NCQ depth is going to play a part here as well.

Yes NCQ depth does contribute primarily to READ latencies in presence of
async IO. I think disk drivers and disk firmware should also participate in 
prioritizing READs over pending WRITEs to improve the situation.

IO scheduler can only do so much. CFQ already tries hard to keep pending
async queue depth low and that results in lower throughput many a times
(as compared to deadline).

In fact CFQ tries so hard to prioritize SYNC IO over async IO, that I have
often heard cases of WRITEs being starved and people facing "task blocked
for 120 second warnings".

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-05-30  3:21       ` Fengguang Wu
  2012-06-05  1:01         ` Dave Chinner
@ 2012-06-05 17:23         ` Vivek Goyal
  2012-06-05 17:41           ` Vivek Goyal
  1 sibling, 1 reply; 18+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:23 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:

[..]
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!

I doubt that async IO queue is empty for 12.5ms. We wait for previous
range to finish (index-1) and have already started the IO on next 8MB
of pages. So effectively that should keep 8MB of async IO in
queue (until and unless there are delays from user space side). So reason
for latency improvement might be something else and not because async
IO queue is empty for some time.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 17:23         ` Vivek Goyal
@ 2012-06-05 17:41           ` Vivek Goyal
  2012-06-05 18:48             ` Vivek Goyal
  0 siblings, 1 reply; 18+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:41 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> 
> [..]
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> 
> I doubt that async IO queue is empty for 12.5ms. We wait for previous
> range to finish (index-1) and have already started the IO on next 8MB
> of pages. So effectively that should keep 8MB of async IO in
> queue (until and unless there are delays from user space side). So reason
> for latency improvement might be something else and not because async
> IO queue is empty for some time.

With sync_file_range() test, we can have 8MB of IO in flight. Without that
I think we can have more at times and that might be the reason for latency
improvement.

I see that CFQ has code to allow deeper NCQ depth if there is only a single
writer. So once a reader comes along it might find tons of async IO
already in flight. sync_file_range() will limit that in flight IO hence
the latency improvement. So if we have multiple dd doing sync_file_range()
then probably this latency improvement should go away.

I will run some tests to verify if my understanding about deeper queue
depths in case of single writer is correct or not.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 17:41           ` Vivek Goyal
@ 2012-06-05 18:48             ` Vivek Goyal
  2012-06-05 20:10               ` Vivek Goyal
  0 siblings, 1 reply; 18+ messages in thread
From: Vivek Goyal @ 2012-06-05 18:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:41:57PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > (2) comes from the use of _WAIT_ flags in
> > > 
> > >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > > 
> > > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > > completion. That means the async write IO queue constantly swing
> > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > > So on every 12.5ms, the async IO queue runs empty, which gives any
> > > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > > and sweet breaks!
> > 
> > I doubt that async IO queue is empty for 12.5ms. We wait for previous
> > range to finish (index-1) and have already started the IO on next 8MB
> > of pages. So effectively that should keep 8MB of async IO in
> > queue (until and unless there are delays from user space side). So reason
> > for latency improvement might be something else and not because async
> > IO queue is empty for some time.
> 
> With sync_file_range() test, we can have 8MB of IO in flight. Without that
> I think we can have more at times and that might be the reason for latency
> improvement.
> 
> I see that CFQ has code to allow deeper NCQ depth if there is only a single
> writer. So once a reader comes along it might find tons of async IO
> already in flight. sync_file_range() will limit that in flight IO hence
> the latency improvement. So if we have multiple dd doing sync_file_range()
> then probably this latency improvement should go away.
> 
> I will run some tests to verify if my understanding about deeper queue
> depths in case of single writer is correct or not.

So I did run some tests and can confirm that on an average there seem to
be more in flight requests *without* sync_file_range() and that's probably
the reason that why sync_file_range() test is showing better latency. 

I can see that with "dd if=/dev/zero of=zerofile bs=1M count=1024", we are
driving deeper queue depths (upto 32) and in later stages in flight
requests are constantly high.

With sync_file_range(), in flight requests number of requests fluctuate a
lot between 1 and 32. Many a times it is just 1 or up to 16 and few times
went up to 32.

So sync_file_range() test keeps less in flight requests on on average
hence better latencies. It might not produce throughput drop on SATA
disks but might have some effect on storage array luns. Will give it
a try.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 18:48             ` Vivek Goyal
@ 2012-06-05 20:10               ` Vivek Goyal
  2012-06-06  2:57                 ` Vivek Goyal
  0 siblings, 1 reply; 18+ messages in thread
From: Vivek Goyal @ 2012-06-05 20:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:

[..]
> So sync_file_range() test keeps less in flight requests on on average
> hence better latencies. It might not produce throughput drop on SATA
> disks but might have some effect on storage array luns. Will give it
> a try.

Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
measured time using "time", sync_file_range test finished little faster.

Then I started looking at blktrace output. sync_file_range() test
initially (for about 8 seconds), drives shallow queue depth (about 16),
but after 8 seconds somehow flusher gets involved and starts submitting
lots of requests and we start driving much higher queue depth (upto more than
100). Not sure why flusher should get involved. Is everything working as
expected. I thought that as we wait for last 8MB IO to finish before we
start new one, we should have at max 16MB of IO in flight. Fengguang?

Anyway, so this test of speed comparision is invalid as flusher gets
involved after some time and we start driving higher in flight requests.
I guess I should hard code the maximum number of requests in flight
to see the effect of request queue depth on throughput.

I am also attaching the sync_file_range() test linus mentioned. Did I 
write it right.

Thanks
Vivek

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>

#define BUFSIZE (8*1024*1024)
char buf [BUFSIZE];

int main()
{
	int fd, index = 0;

	fd = open("sync-file-range-tester.tst-file", O_WRONLY|O_CREAT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	memset(buf, 'a', BUFSIZE);

	while (1) {
                if (write(fd, buf, BUFSIZE) != BUFSIZE)
                        break;
                sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
                if (index) {
                        sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
		}
		index++;
		if (index >=512)
			break;
	}
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 20:10               ` Vivek Goyal
@ 2012-06-06  2:57                 ` Vivek Goyal
  2012-06-06  3:14                   ` Linus Torvalds
  2012-06-06 14:08                   ` Fengguang Wu
  0 siblings, 2 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-06-06  2:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> 
> [..]
> > So sync_file_range() test keeps less in flight requests on on average
> > hence better latencies. It might not produce throughput drop on SATA
> > disks but might have some effect on storage array luns. Will give it
> > a try.
> 
> Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> measured time using "time", sync_file_range test finished little faster.
> 
> Then I started looking at blktrace output. sync_file_range() test
> initially (for about 8 seconds), drives shallow queue depth (about 16),
> but after 8 seconds somehow flusher gets involved and starts submitting
> lots of requests and we start driving much higher queue depth (upto more than
> 100). Not sure why flusher should get involved. Is everything working as
> expected. I thought that as we wait for last 8MB IO to finish before we
> start new one, we should have at max 16MB of IO in flight. Fengguang?

Ok, found it. I am using "int index" which in turn caused signed integer
extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
and 64bit offset is sign extended and offsets are screwed. So after 2G
file size, sync_file_range() effectively stops working leaving dirty
pages which are cleaned up by flusher. So that explains why flusher
was kicking during my tests. Change "int" to "unsigned int" and problem
if fixed.

Now I ran sync_file_range() test and another program which writes 4GB file
and does a fdatasync() at the end and compared total execution time. First
one takes around 12.5 seconds while later one takes around 12.00 seconds.
So sync_file_range() is just little slower on this SAN lun.

I had expected a bigger difference as sync_file_range() is just driving
max queue depth of 32 (total 16MB IO in flight), while flushers are
driving queue depths up to 140 or so. So in this paritcular test, driving
much deeper queue depths is not really helping much. (I have seen higher
throughputs with higher queue depths in the past. Now sure why don't we
see it here).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  2:57                 ` Vivek Goyal
@ 2012-06-06  3:14                   ` Linus Torvalds
  2012-06-06 12:14                     ` Vivek Goyal
  2012-06-06 14:08                   ` Fengguang Wu
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2012-06-06  3:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>
> I had expected a bigger difference as sync_file_range() is just driving
> max queue depth of 32 (total 16MB IO in flight), while flushers are
> driving queue depths up to 140 or so. So in this paritcular test, driving
> much deeper queue depths is not really helping much. (I have seen higher
> throughputs with higher queue depths in the past. Now sure why don't we
> see it here).

How did interactivity feel?

Because quite frankly, if the throughput difference is 12.5 vs 12
seconds, I suspect the interactivity thing is what dominates.

And from my memory of the interactivity different was absolutely
*huge*. Even back when I used rotational media, I basically couldn't
even notice the background write with the sync_file_range() approach.
While the regular writeback without the writebehind had absolutely
*huge* pauses if you used something like firefox that uses fsync()
etc. And starting new applications that weren't cached was noticeably
worse too - and then with sync_file_range it wasn't even all that
noticeable.

NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
the writeback on the same filesystem (and obviously disk) as your home
directory is. If the big write is to another filesystem and another
disk, I think you won't see the same issues.

Admittedly, I have not really touched anything with a rotational disk
for the last few years, nor do I ever want to see those rotating
pieces of high-tech rust ever again. And maybe your SAN has so good
latency even under load that it doesn't really matter. I remember it
mattering a lot back when..

Of course, back when I did that testing and had rotational media, we
didn't have the per-bdi writeback logic with the smart speed-dependent
depths etc, so it may be that we're just so much better at writeback
these days that it's not nearly as noticeable any more.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  3:14                   ` Linus Torvalds
@ 2012-06-06 12:14                     ` Vivek Goyal
  2012-06-06 14:00                       ` Fengguang Wu
  2012-06-06 16:15                       ` Vivek Goyal
  0 siblings, 2 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-06-06 12:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > I had expected a bigger difference as sync_file_range() is just driving
> > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > driving queue depths up to 140 or so. So in this paritcular test, driving
> > much deeper queue depths is not really helping much. (I have seen higher
> > throughputs with higher queue depths in the past. Now sure why don't we
> > see it here).
> 
> How did interactivity feel?
> 
> Because quite frankly, if the throughput difference is 12.5 vs 12
> seconds, I suspect the interactivity thing is what dominates.
> 
> And from my memory of the interactivity different was absolutely
> *huge*. Even back when I used rotational media, I basically couldn't
> even notice the background write with the sync_file_range() approach.
> While the regular writeback without the writebehind had absolutely
> *huge* pauses if you used something like firefox that uses fsync()
> etc. And starting new applications that weren't cached was noticeably
> worse too - and then with sync_file_range it wasn't even all that
> noticeable.
> 
> NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> the writeback on the same filesystem (and obviously disk) as your home
> directory is. If the big write is to another filesystem and another
> disk, I think you won't see the same issues.

Ok, I did following test on my single SATA disk and my root filesystem
is on this disk.

I dropped caches and launched firefox and monitored the time it takes
for firefox to start. (cache cold).

And my results are reverse of what you have been seeing. With
sync_file_range() running, firefox takes roughly 30 seconds to start and
with flusher in operation, it takes roughly 20 seconds to start. (I have
approximated the average of 3 runs for simplicity).

I think it is happening because sync_file_range() will send all
the writes as SYNC and it will compete with firefox IO. On the other
hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
heavily and firefox's IO will be prioritized. And this effect should
just get worse as more processes do sync_file_range().

So write-behind should provide better interactivity if writes submitted
are ASYNC and not SYNC.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 12:14                     ` Vivek Goyal
@ 2012-06-06 14:00                       ` Fengguang Wu
  2012-06-06 17:04                         ` Vivek Goyal
  2012-06-06 16:15                       ` Vivek Goyal
  1 sibling, 1 reply; 18+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > I had expected a bigger difference as sync_file_range() is just driving
> > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > much deeper queue depths is not really helping much. (I have seen higher
> > > throughputs with higher queue depths in the past. Now sure why don't we
> > > see it here).
> > 
> > How did interactivity feel?
> > 
> > Because quite frankly, if the throughput difference is 12.5 vs 12
> > seconds, I suspect the interactivity thing is what dominates.
> > 
> > And from my memory of the interactivity different was absolutely
> > *huge*. Even back when I used rotational media, I basically couldn't
> > even notice the background write with the sync_file_range() approach.
> > While the regular writeback without the writebehind had absolutely
> > *huge* pauses if you used something like firefox that uses fsync()
> > etc. And starting new applications that weren't cached was noticeably
> > worse too - and then with sync_file_range it wasn't even all that
> > noticeable.
> > 
> > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > the writeback on the same filesystem (and obviously disk) as your home
> > directory is. If the big write is to another filesystem and another
> > disk, I think you won't see the same issues.
> 
> Ok, I did following test on my single SATA disk and my root filesystem
> is on this disk.
> 
> I dropped caches and launched firefox and monitored the time it takes
> for firefox to start. (cache cold).
> 
> And my results are reverse of what you have been seeing. With
> sync_file_range() running, firefox takes roughly 30 seconds to start and
> with flusher in operation, it takes roughly 20 seconds to start. (I have
> approximated the average of 3 runs for simplicity).
> 
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().
> 
> So write-behind should provide better interactivity if writes submitted
> are ASYNC and not SYNC.

Hi Vivek, thanks for testing all of these out! The result is
definitely interesting and a surprise: we overlooked the SYNC nature
of sync_file_range().

I'd suggest to use these calls to achieve the write-and-drop-behind
behavior, *with* WB_SYNC_NONE:

        posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
        sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);

The caveat is, the below bdi_write_congested() will never evaluate to
true since we are only filling the request queue with 8MB data.

SYSCALL_DEFINE(fadvise64_64):

        case POSIX_FADV_DONTNEED:
                if (!bdi_write_congested(mapping->backing_dev_info))
                        __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                   WB_SYNC_NONE);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 14:00                       ` Fengguang Wu
@ 2012-06-06 17:04                         ` Vivek Goyal
  2012-06-07  9:45                           ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Vivek Goyal @ 2012-06-06 17:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > see it here).
> > > 
> > > How did interactivity feel?
> > > 
> > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > seconds, I suspect the interactivity thing is what dominates.
> > > 
> > > And from my memory of the interactivity different was absolutely
> > > *huge*. Even back when I used rotational media, I basically couldn't
> > > even notice the background write with the sync_file_range() approach.
> > > While the regular writeback without the writebehind had absolutely
> > > *huge* pauses if you used something like firefox that uses fsync()
> > > etc. And starting new applications that weren't cached was noticeably
> > > worse too - and then with sync_file_range it wasn't even all that
> > > noticeable.
> > > 
> > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > the writeback on the same filesystem (and obviously disk) as your home
> > > directory is. If the big write is to another filesystem and another
> > > disk, I think you won't see the same issues.
> > 
> > Ok, I did following test on my single SATA disk and my root filesystem
> > is on this disk.
> > 
> > I dropped caches and launched firefox and monitored the time it takes
> > for firefox to start. (cache cold).
> > 
> > And my results are reverse of what you have been seeing. With
> > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > approximated the average of 3 runs for simplicity).
> > 
> > I think it is happening because sync_file_range() will send all
> > the writes as SYNC and it will compete with firefox IO. On the other
> > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > heavily and firefox's IO will be prioritized. And this effect should
> > just get worse as more processes do sync_file_range().
> > 
> > So write-behind should provide better interactivity if writes submitted
> > are ASYNC and not SYNC.
> 
> Hi Vivek, thanks for testing all of these out! The result is
> definitely interesting and a surprise: we overlooked the SYNC nature
> of sync_file_range().
> 
> I'd suggest to use these calls to achieve the write-and-drop-behind
> behavior, *with* WB_SYNC_NONE:
> 
>         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
>         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> 
> The caveat is, the below bdi_write_congested() will never evaluate to
> true since we are only filling the request queue with 8MB data.
> 
> SYSCALL_DEFINE(fadvise64_64):
> 
>         case POSIX_FADV_DONTNEED:
>                 if (!bdi_write_congested(mapping->backing_dev_info))
>                         __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                    WB_SYNC_NONE);

Hi Fengguang,

Instead of above, I modified sync_file_range() to call __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes showing up at elevator.

With 4 processes doing sync_file_range() now, firefox start time test
clocks around 18-19 seconds which is better than 30-35 seconds of 4
processes doing buffered writes. And system looks pretty good from
interactivity point of view.

Thanks
Vivek

Following is the patch I applied to test.

---
 fs/sync.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
+++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
@@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
 		if (ret < 0)
 			goto out_put;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 17:04                         ` Vivek Goyal
@ 2012-06-07  9:45                           ` Jan Kara
  2012-06-07 19:06                             ` Vivek Goyal
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2012-06-07  9:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Wed 06-06-12 13:04:28, Vivek Goyal wrote:
> On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> > On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > > see it here).
> > > > 
> > > > How did interactivity feel?
> > > > 
> > > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > > seconds, I suspect the interactivity thing is what dominates.
> > > > 
> > > > And from my memory of the interactivity different was absolutely
> > > > *huge*. Even back when I used rotational media, I basically couldn't
> > > > even notice the background write with the sync_file_range() approach.
> > > > While the regular writeback without the writebehind had absolutely
> > > > *huge* pauses if you used something like firefox that uses fsync()
> > > > etc. And starting new applications that weren't cached was noticeably
> > > > worse too - and then with sync_file_range it wasn't even all that
> > > > noticeable.
> > > > 
> > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > > the writeback on the same filesystem (and obviously disk) as your home
> > > > directory is. If the big write is to another filesystem and another
> > > > disk, I think you won't see the same issues.
> > > 
> > > Ok, I did following test on my single SATA disk and my root filesystem
> > > is on this disk.
> > > 
> > > I dropped caches and launched firefox and monitored the time it takes
> > > for firefox to start. (cache cold).
> > > 
> > > And my results are reverse of what you have been seeing. With
> > > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > > approximated the average of 3 runs for simplicity).
> > > 
> > > I think it is happening because sync_file_range() will send all
> > > the writes as SYNC and it will compete with firefox IO. On the other
> > > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > > heavily and firefox's IO will be prioritized. And this effect should
> > > just get worse as more processes do sync_file_range().
> > > 
> > > So write-behind should provide better interactivity if writes submitted
> > > are ASYNC and not SYNC.
> > 
> > Hi Vivek, thanks for testing all of these out! The result is
> > definitely interesting and a surprise: we overlooked the SYNC nature
> > of sync_file_range().
> > 
> > I'd suggest to use these calls to achieve the write-and-drop-behind
> > behavior, *with* WB_SYNC_NONE:
> > 
> >         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
> >         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > The caveat is, the below bdi_write_congested() will never evaluate to
> > true since we are only filling the request queue with 8MB data.
> > 
> > SYSCALL_DEFINE(fadvise64_64):
> > 
> >         case POSIX_FADV_DONTNEED:
> >                 if (!bdi_write_congested(mapping->backing_dev_info))
> >                         __filemap_fdatawrite_range(mapping, offset, endbyte,
> >                                                    WB_SYNC_NONE);
> 
> Hi Fengguang,
> 
> Instead of above, I modified sync_file_range() to call
> __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> showing up at elevator.
> 
> With 4 processes doing sync_file_range() now, firefox start time test
> clocks around 18-19 seconds which is better than 30-35 seconds of 4
> processes doing buffered writes. And system looks pretty good from
> interactivity point of view.
  So do you have any idea why is that? Do we drive shallower queues? Also
how does speed of the writers compare to the speed with normal buffered
writes + fsync (you'd need fsync for sync_file_range writers as well to
make comparison fair)?

								Honza
> ---
>  fs/sync.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
> +++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
> @@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
>  	}
>  
>  	if (flags & SYNC_FILE_RANGE_WRITE) {
> -		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
> +		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
>  		if (ret < 0)
>  			goto out_put;
>  	}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-07  9:45                           ` Jan Kara
@ 2012-06-07 19:06                             ` Vivek Goyal
  0 siblings, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-06-07 19:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Thu, Jun 07, 2012 at 11:45:04AM +0200, Jan Kara wrote:
[..]
> > Instead of above, I modified sync_file_range() to call
> > __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> > showing up at elevator.
> > 
> > With 4 processes doing sync_file_range() now, firefox start time test
> > clocks around 18-19 seconds which is better than 30-35 seconds of 4
> > processes doing buffered writes. And system looks pretty good from
> > interactivity point of view.
>   So do you have any idea why is that? Do we drive shallower queues? Also
> how does speed of the writers compare to the speed with normal buffered
> writes + fsync (you'd need fsync for sync_file_range writers as well to
> make comparison fair)?

Ok, I did more tests and few odd things I noticed.

- Results are varying a lot. Sometimes with write+flush workload also firefox
  launched fast. So now it is hard to conclude things.

- For some reason I had nr_requests as 16K on my root drive. I have no
  idea who is setting it. Once I set it to 128, then firefox with
  write+flush workload performs much better and launch time are similar
  to sync_file_range.

- I tried to open new windows in firefox and browse web, load new
  websites. I would say sync_file_range() feels little better but
  I don't have any logical explanation and can't conclude anything yet
  by looking at traces. I am continuing to stare though.

So in summary, at this point of time I really can't conclude that
using sync_file_range() with ASYNC request is providing better latencies
in my setup.

I will keept at it though and if I notice something new, will write back.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 12:14                     ` Vivek Goyal
  2012-06-06 14:00                       ` Fengguang Wu
@ 2012-06-06 16:15                       ` Vivek Goyal
  1 sibling, 0 replies; 18+ messages in thread
From: Vivek Goyal @ 2012-06-06 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:

[..]
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().

Ok, this time I tried the same test again but with 4 processes doing
writes in parallel on 4 different files.

And with sync_file_range() things turned ugly. Interactivity was very poor. 

firefox launch test took around 1m:45s with sync_file range() while it
took only about 35seconds with regular flusher threads.

So sending writeback IO synchronously wreaks havoc.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  2:57                 ` Vivek Goyal
  2012-06-06  3:14                   ` Linus Torvalds
@ 2012-06-06 14:08                   ` Fengguang Wu
  1 sibling, 0 replies; 18+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 10:57:30PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > So sync_file_range() test keeps less in flight requests on on average
> > > hence better latencies. It might not produce throughput drop on SATA
> > > disks but might have some effect on storage array luns. Will give it
> > > a try.
> > 
> > Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> > file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> > measured time using "time", sync_file_range test finished little faster.
> > 
> > Then I started looking at blktrace output. sync_file_range() test
> > initially (for about 8 seconds), drives shallow queue depth (about 16),
> > but after 8 seconds somehow flusher gets involved and starts submitting
> > lots of requests and we start driving much higher queue depth (upto more than
> > 100). Not sure why flusher should get involved. Is everything working as
> > expected. I thought that as we wait for last 8MB IO to finish before we
> > start new one, we should have at max 16MB of IO in flight. Fengguang?
> 
> Ok, found it. I am using "int index" which in turn caused signed integer
> extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
> and 64bit offset is sign extended and offsets are screwed. So after 2G
> file size, sync_file_range() effectively stops working leaving dirty
> pages which are cleaned up by flusher. So that explains why flusher
> was kicking during my tests. Change "int" to "unsigned int" and problem
> if fixed.

Good catch! Besides that, I do see a small chance for the flusher
thread to kick in: at the time when the inode dirty expires after 30s.
Just a kind reminder, because I don't see how it can impact this
workload in some noticeable way.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-06-07 19:06 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20120528114124.GA6813@localhost>
     [not found] ` <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com>
2012-05-29 15:57   ` write-behind on streaming writes Fengguang Wu
2012-05-29 17:35     ` Linus Torvalds
2012-05-30  3:21       ` Fengguang Wu
2012-06-05  1:01         ` Dave Chinner
2012-06-05 17:18           ` Vivek Goyal
2012-06-05 17:23         ` Vivek Goyal
2012-06-05 17:41           ` Vivek Goyal
2012-06-05 18:48             ` Vivek Goyal
2012-06-05 20:10               ` Vivek Goyal
2012-06-06  2:57                 ` Vivek Goyal
2012-06-06  3:14                   ` Linus Torvalds
2012-06-06 12:14                     ` Vivek Goyal
2012-06-06 14:00                       ` Fengguang Wu
2012-06-06 17:04                         ` Vivek Goyal
2012-06-07  9:45                           ` Jan Kara
2012-06-07 19:06                             ` Vivek Goyal
2012-06-06 16:15                       ` Vivek Goyal
2012-06-06 14:08                   ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).