write-behind on streaming writes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Fengguang Wu <fengguang.wu@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	"Myklebust, Trond" <Trond.Myklebust@netapp.com>,
	linux-fsdevel@vger.kernel.org,
	Linux Memory Management List <linux-mm@kvack.org>
Subject: write-behind on streaming writes
Date: Tue, 29 May 2012 23:57:59 +0800	[thread overview]
Message-ID: <20120529155759.GA11326@localhost> (raw)
In-Reply-To: <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com>

Hi Linus,

On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote:
> Ok, pulled.
> 
> However, I have an independent question for you - have you looked at
> any kind of per-file write-behind kind of logic?

Yes, definitely.  Especially for NFS, it benefits to keep each file's
dirty pages low. Because in NFS, a simple stat() will require flushing
all the file's dirty pages before proceeding.

However in general there are no strong user requests for this feature.
I guess it's mainly because they still have the choices to use O_SYNC
or O_DIRECT.

Actually O_SYNC is pretty close to the below code for the purpose of
limiting the dirty and writeback pages, except that it's not on by
default, hence means nothing for normal users.

> The reason I ask is that pretty much every time I write some big file
> (usually when over-writing a harddisk), I tend to use my own hackish
> model, which looks like this:
> 
> #define BUFSIZE (8*1024*1024ul)
> 
>         ...
>         for (..) {
>                 ...
>                 if (write(fd, buffer, BUFSIZE) != BUFSIZE)
>                         break;
>                 sync_file_range(fd, index*BUFSIZE, BUFSIZE,
> SYNC_FILE_RANGE_WRITE);
>                 if (index)
>                         sync_file_range(fd, (index-1)*BUFSIZE,
> BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
>                 ....
> 
> and it tends to be *beautiful* for both disk IO performane and for
> system responsiveness while the big write is in progress.

It seems to me all about optimizing the 1-dd case for desktop users,
and the most beautiful thing about per-file write behind is, it keeps
both the number of dirty and writeback pages low in the system when
there are only one or two sequential dirtier tasks. Which is good for
responsiveness.

Note that the above user space code won't work well when there are 10+
dirtier tasks. It effectively creates 10+ IO submitters on different
regions of the disk and thus create lots of seeks. When there are 10+
dirtier tasks, it's not only desirable to have one single flusher
thread to submit all IO, but also for the flusher to work on the
inodes with large write chunk size.

I happen to have some numbers on comparing the current adaptive
(write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on
XFS (not choosing ext4 because it internally enforces >=128MB chunk
size).  It's basically 4% performance drop in the 1-dd case and up to
20% in the 100-dd case.

  3.4.0-rc2             3.4.0-rc2-4M+
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

So we probably still want to keep the 0.5s worth of chunk size.

> And I'm wondering if we couldn't expose this kind of write-behind
> logic from the kernel. Sure, it only works for the "contiguous write
> of a single large file" model, but that model isn't actually all
> *that* unusual.
> 
> Right now all the write-back logic is based on the
> balance_dirty_pages() model, which is more of a global dirty model.
> Which obviously is needed too - this isn't an "either or" kind of
> thing, it's more of a "maybe we could have a streaming detector *and*
> the 'random writes' code". So I was wondering if anybody had ever been
> looking more at an explicit write-behind model that uses the same kind
> of "per-file window" that the read-ahead code does.

I can imagine it being implemented in kernel this way:

streaming write detector in balance_dirty_pages():

        if (not globally throttled &&
            is streaming writer &&
            it's crossing the N+1 boundary) {
                queue writeback work for chunk N to the flusher
                wait for work completion
        }

The good thing is, that looks not a complex addition. However the
potential problem is, the "wait for work completion" part won't have
guaranteed complete time, especially when there are multiple dd tasks.
This could result in uncontrollable delays in the write() syscall. So
we may do this instead:

-               wait for work completion
+               sleep for (chunk_size/write_bandwidth)

To avoid long write() delays, we might further split the one big 0.5s
sleep into smaller sleeps.

> (The above code only works well for known streaming writes, but the
> *model* of saying "ok, let's start writeout for the previous streaming
> block, and then wait for the writeout of the streaming block before
> that" really does tend to result in very smooth IO and minimal
> disruption of other processes..)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next      parent reply	other threads:[~2012-05-29 15:57 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20120528114124.GA6813@localhost>
     [not found] ` <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com>
2012-05-29 15:57   ` Fengguang Wu [this message]
2012-05-29 17:35     ` write-behind on streaming writes Linus Torvalds
2012-05-30  3:21       ` Fengguang Wu
2012-06-05  1:01         ` Dave Chinner
2012-06-05 17:18           ` Vivek Goyal
2012-06-05 17:23         ` Vivek Goyal
2012-06-05 17:41           ` Vivek Goyal
2012-06-05 18:48             ` Vivek Goyal
2012-06-05 20:10               ` Vivek Goyal
2012-06-06  2:57                 ` Vivek Goyal
2012-06-06  3:14                   ` Linus Torvalds
2012-06-06 12:14                     ` Vivek Goyal
2012-06-06 14:00                       ` Fengguang Wu
2012-06-06 17:04                         ` Vivek Goyal
2012-06-07  9:45                           ` Jan Kara
2012-06-07 19:06                             ` Vivek Goyal
2012-06-06 16:15                       ` Vivek Goyal
2012-06-06 14:08                   ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120529155759.GA11326@localhost \
    --to=fengguang.wu@intel.com \
    --cc=Trond.Myklebust@netapp.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).