* write-behind on streaming writes [not found] ` <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com> @ 2012-05-29 15:57 ` Fengguang Wu 2012-05-29 17:35 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Fengguang Wu @ 2012-05-29 15:57 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List Hi Linus, On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote: > Ok, pulled. > > However, I have an independent question for you - have you looked at > any kind of per-file write-behind kind of logic? Yes, definitely. Especially for NFS, it benefits to keep each file's dirty pages low. Because in NFS, a simple stat() will require flushing all the file's dirty pages before proceeding. However in general there are no strong user requests for this feature. I guess it's mainly because they still have the choices to use O_SYNC or O_DIRECT. Actually O_SYNC is pretty close to the below code for the purpose of limiting the dirty and writeback pages, except that it's not on by default, hence means nothing for normal users. > The reason I ask is that pretty much every time I write some big file > (usually when over-writing a harddisk), I tend to use my own hackish > model, which looks like this: > > #define BUFSIZE (8*1024*1024ul) > > ... > for (..) { > ... > if (write(fd, buffer, BUFSIZE) != BUFSIZE) > break; > sync_file_range(fd, index*BUFSIZE, BUFSIZE, > SYNC_FILE_RANGE_WRITE); > if (index) > sync_file_range(fd, (index-1)*BUFSIZE, > BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > .... > > and it tends to be *beautiful* for both disk IO performane and for > system responsiveness while the big write is in progress. It seems to me all about optimizing the 1-dd case for desktop users, and the most beautiful thing about per-file write behind is, it keeps both the number of dirty and writeback pages low in the system when there are only one or two sequential dirtier tasks. Which is good for responsiveness. Note that the above user space code won't work well when there are 10+ dirtier tasks. It effectively creates 10+ IO submitters on different regions of the disk and thus create lots of seeks. When there are 10+ dirtier tasks, it's not only desirable to have one single flusher thread to submit all IO, but also for the flusher to work on the inodes with large write chunk size. I happen to have some numbers on comparing the current adaptive (write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on XFS (not choosing ext4 because it internally enforces >=128MB chunk size). It's basically 4% performance drop in the 1-dd case and up to 20% in the 100-dd case. 3.4.0-rc2 3.4.0-rc2-4M+ ----------- ------------------------ 114.02 -4.2% 109.23 snb/thresh=8G/xfs-1dd-1-3.4.0-rc2 102.25 -11.7% 90.24 snb/thresh=8G/xfs-10dd-1-3.4.0-rc2 104.17 -17.5% 85.91 snb/thresh=8G/xfs-20dd-1-3.4.0-rc2 104.94 -18.7% 85.28 snb/thresh=8G/xfs-30dd-1-3.4.0-rc2 104.76 -21.9% 81.82 snb/thresh=8G/xfs-100dd-1-3.4.0-rc2 So we probably still want to keep the 0.5s worth of chunk size. > And I'm wondering if we couldn't expose this kind of write-behind > logic from the kernel. Sure, it only works for the "contiguous write > of a single large file" model, but that model isn't actually all > *that* unusual. > > Right now all the write-back logic is based on the > balance_dirty_pages() model, which is more of a global dirty model. > Which obviously is needed too - this isn't an "either or" kind of > thing, it's more of a "maybe we could have a streaming detector *and* > the 'random writes' code". So I was wondering if anybody had ever been > looking more at an explicit write-behind model that uses the same kind > of "per-file window" that the read-ahead code does. I can imagine it being implemented in kernel this way: streaming write detector in balance_dirty_pages(): if (not globally throttled && is streaming writer && it's crossing the N+1 boundary) { queue writeback work for chunk N to the flusher wait for work completion } The good thing is, that looks not a complex addition. However the potential problem is, the "wait for work completion" part won't have guaranteed complete time, especially when there are multiple dd tasks. This could result in uncontrollable delays in the write() syscall. So we may do this instead: - wait for work completion + sleep for (chunk_size/write_bandwidth) To avoid long write() delays, we might further split the one big 0.5s sleep into smaller sleeps. > (The above code only works well for known streaming writes, but the > *model* of saying "ok, let's start writeout for the previous streaming > block, and then wait for the writeout of the streaming block before > that" really does tend to result in very smooth IO and minimal > disruption of other processes..) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-05-29 15:57 ` write-behind on streaming writes Fengguang Wu @ 2012-05-29 17:35 ` Linus Torvalds 2012-05-30 3:21 ` Fengguang Wu 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2012-05-29 17:35 UTC (permalink / raw) To: Fengguang Wu Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote: > > Actually O_SYNC is pretty close to the below code for the purpose of > limiting the dirty and writeback pages, except that it's not on by > default, hence means nothing for normal users. Absolutely not. O_SYNC syncs the *current* write, syncs your metadata, and just generally makes your writer synchronous. It's just a f*cking moronic idea. Nobody sane ever uses it, since you are much better off just using fsync() if you want that kind of behavior. That's one of those "stupid legacy flags" things that have no sane use. The whole point is that doing that is never the right thing to do. You want to sync *past* writes, and you never ever want to wait on them unless you just sent more (newer) writes to the disk that you are *not* waiting on - so that you always have more IO pending. O_SYNC is the absolutely anti-thesis of that kind of "multiple levels of overlapping IO". Because it requires that the IO is _done_ by the time you start more, which is against the whole point. > It seems to me all about optimizing the 1-dd case for desktop users, > and the most beautiful thing about per-file write behind is, it keeps > both the number of dirty and writeback pages low in the system when > there are only one or two sequential dirtier tasks. Which is good for > responsiveness. Yes, but I don't think it's about a single-dd case - it's about just trying to handle one common case (streaming writes) efficiently and naturally. Try to get those out of the system so that you can then worry about the *other* cases knowing that they don't have that kind of big streaming behavior. For example, right now our main top-level writeback logic is *not* about streaming writes (just dirty counts), but then we try to "find" the locality by making the lower-level writeback do the whole "write back by chunking inodes" without really having any higher-level information. I just suspect that we'd be better off teaching upper levels about the streaming. I know for a fact that if I do it by hand, system responsiveness was *much* better, and IO throughput didn't go down at all. > Note that the above user space code won't work well when there are 10+ > dirtier tasks. It effectively creates 10+ IO submitters on different > regions of the disk and thus create lots of seeks. Not really much more than our current writeback code does. It *schedules* data for writing, but doesn't wait for it until much later. You seem to think it was synchronous. It's not. Look at the second sync_file_range() thing, and the important part is the "index-1". The fact that you confused this with O_SYNC seems to be the same thing. This has absolutely *nothing* to do with O_SYNC. The other important part is that the chunk size is fairly large. We do read-ahead in 64k kind of things, to make sense the write-behind chunking needs to be in "multiple megabytes". 8MB is probably the minimum size it makes sense. The write-behind would be for things like people writing disk images and video files. Not for random IO in smaller chunks. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-05-29 17:35 ` Linus Torvalds @ 2012-05-30 3:21 ` Fengguang Wu 2012-06-05 1:01 ` Dave Chinner 2012-06-05 17:23 ` Vivek Goyal 0 siblings, 2 replies; 18+ messages in thread From: Fengguang Wu @ 2012-05-30 3:21 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Vivek Goyal Linus, On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote: > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote: > > > > Actually O_SYNC is pretty close to the below code for the purpose of > > limiting the dirty and writeback pages, except that it's not on by > > default, hence means nothing for normal users. > > Absolutely not. > > O_SYNC syncs the *current* write, syncs your metadata, and just > generally makes your writer synchronous. It's just a f*cking moronic > idea. Nobody sane ever uses it, since you are much better off just > using fsync() if you want that kind of behavior. That's one of those > "stupid legacy flags" things that have no sane use. > > The whole point is that doing that is never the right thing to do. You > want to sync *past* writes, and you never ever want to wait on them > unless you just sent more (newer) writes to the disk that you are > *not* waiting on - so that you always have more IO pending. > > O_SYNC is the absolutely anti-thesis of that kind of "multiple levels > of overlapping IO". Because it requires that the IO is _done_ by the > time you start more, which is against the whole point. Yeah, O_SYNC is not really the sane thing to use. Thanks for teaching me this with great details! > > It seems to me all about optimizing the 1-dd case for desktop users, > > and the most beautiful thing about per-file write behind is, it keeps > > both the number of dirty and writeback pages low in the system when > > there are only one or two sequential dirtier tasks. Which is good for > > responsiveness. > > Yes, but I don't think it's about a single-dd case - it's about just > trying to handle one common case (streaming writes) efficiently and > naturally. Try to get those out of the system so that you can then > worry about the *other* cases knowing that they don't have that kind > of big streaming behavior. > > For example, right now our main top-level writeback logic is *not* > about streaming writes (just dirty counts), but then we try to "find" > the locality by making the lower-level writeback do the whole "write > back by chunking inodes" without really having any higher-level > information. Agreed. Streaming writes can be reliably detected in the same way as readahead. And doing explicit write-behind for them may help make the writeback more oriented and well behaved. For example, consider file A being sequentially written to by dd, and another mmapped file B being randomly written to. In the current global writeback, the two files will likely have 1:1 share of the dirty pages. With write-behind, we'll effectively limit file A's dirty footprint to 2 chunk sizes, possibly leaving much more rooms for file B and increase the chances it accumulate more adjacent dirty pages at writeback time. > I just suspect that we'd be better off teaching upper levels about the > streaming. I know for a fact that if I do it by hand, system > responsiveness was *much* better, and IO throughput didn't go down at > all. Your observation of better responsiveness may well be stemmed from these two aspects: 1) lower dirty/writeback pages 2) the async write IO queue being drained constantly (1) is obvious. For a mem=4G desktop, the default dirty limit can be up to (4096 * 20% = 819MB). While your smart writer effectively limits dirty/writeback pages to a dramatically lower 16MB. (2) comes from the use of _WAIT_ flags in sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); Each sync_file_range() syscall will submit 8MB write IO and wait for completion. That means the async write IO queue constantly swing between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). So on every 12.5ms, the async IO queue runs empty, which gives any pending read IO (from firefox etc.) a chance to be serviced. Nice and sweet breaks! I suspect (2) contributes *much more* than (1) to desktop responsiveness. Because in a desktop with heavy sequential writes and sporadic reads, the 20% dirty/writeback pages can hardly reach the end of LRU lists to trigger waits in direct page reclaim. On the other hand, it's a known problem that our IO scheculer is still not that well behaved to provide good read latency when the flusher rightfully manages to keep 100% fillness of the async IO queue all the time. The IO scheduler will be the right place to solve this issue. There's nothing wrong for the flusher to blindly fill the async IO queue. It's the flusher's duty to avoid underrun of the async IO queue and the IO scheduler's duty to select the right queue to service (or to idle). The IO scheduler *in theory* has all the information to do the right decisions to _not service_ requests from the flusher when there are reads observed recently... > > Note that the above user space code won't work well when there are 10+ > > dirtier tasks. It effectively creates 10+ IO submitters on different > > regions of the disk and thus create lots of seeks. > > Not really much more than our current writeback code does. It > *schedules* data for writing, but doesn't wait for it until much > later. > > You seem to think it was synchronous. It's not. Look at the second > sync_file_range() thing, and the important part is the "index-1". The > fact that you confused this with O_SYNC seems to be the same thing. > This has absolutely *nothing* to do with O_SYNC. Hmm we should be sharing the same view here: it's not waiting for "index", but does wait for "index-1" for clear of PG_writeback by using SYNC_FILE_RANGE_WAIT_AFTER. Or when there are 10+ writers running, each submitting 8MB data to the async IO queue, they may well overrun the max IO queue size and get blocked in the earlier stage of get_request_wait(). > The other important part is that the chunk size is fairly large. We do > read-ahead in 64k kind of things, to make sense the write-behind > chunking needs to be in "multiple megabytes". 8MB is probably the > minimum size it makes sense. Yup. And we also need to make sure it's not 10 tasks each scheduling 50MB write IOs *concurrently*. sync_file_range() is unfortunately doing it this way by sending IO requests to the async IO queue on its own, rather than delegating the work to the flusher and let one single flusher submit IOs for them one after the other. Imagine the async IO queue can hold exactly 50MB writeback pages. You can see the obvious difference in the below graph. The IO queue will be filled with dirty pages from (a) one single inode (b) 10 different inodes. In the later case, the IO scheduler will switch between the inodes much more frequently and create lots more seeks. A theoretic view of the async IO queue: +----------------+ +----------------+ | | | inode 1 | | | +----------------+ | | | inode 2 | | | +----------------+ | | | inode 3 | | | +----------------+ | | | inode 4 | | | +----------------+ | inode 1 | | inode 5 | | | +----------------+ | | | inode 6 | | | +----------------+ | | | inode 7 | | | +----------------+ | | | inode 8 | | | +----------------+ | | | inode 9 | | | +----------------+ | | | inode 10 | +----------------+ +----------------+ (a) one single flusher (b) 10 sync_file_range() submitting 50MB IO submitting 50MB IO for each inode *in turn* for each inode *in parallel* So if parallel file syncs are a common usage, we'll need to make them IO-less, too. > The write-behind would be for things like people writing disk images > and video files. Not for random IO in smaller chunks. Yup. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-05-30 3:21 ` Fengguang Wu @ 2012-06-05 1:01 ` Dave Chinner 2012-06-05 17:18 ` Vivek Goyal 2012-06-05 17:23 ` Vivek Goyal 1 sibling, 1 reply; 18+ messages in thread From: Dave Chinner @ 2012-06-05 1:01 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Vivek Goyal On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote: > Linus, > > On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote: > > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote: > > I just suspect that we'd be better off teaching upper levels about the > > streaming. I know for a fact that if I do it by hand, system > > responsiveness was *much* better, and IO throughput didn't go down at > > all. > > Your observation of better responsiveness may well be stemmed from > these two aspects: > > 1) lower dirty/writeback pages > 2) the async write IO queue being drained constantly > > (1) is obvious. For a mem=4G desktop, the default dirty limit can be > up to (4096 * 20% = 819MB). While your smart writer effectively limits > dirty/writeback pages to a dramatically lower 16MB. > > (2) comes from the use of _WAIT_ flags in > > sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > > Each sync_file_range() syscall will submit 8MB write IO and wait for > completion. That means the async write IO queue constantly swing > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). > So on every 12.5ms, the async IO queue runs empty, which gives any > pending read IO (from firefox etc.) a chance to be serviced. Nice > and sweet breaks! > > I suspect (2) contributes *much more* than (1) to desktop responsiveness. Almost certainly, especially with NCQ devices where even if the IO scheduler preempts the write queue immediately, the device might complete the outstanding 31 writes before servicing the read which is issued as the 32nd command.... So NCQ depth is going to play a part here as well. > Because in a desktop with heavy sequential writes and sporadic reads, > the 20% dirty/writeback pages can hardly reach the end of LRU lists to > trigger waits in direct page reclaim. > > On the other hand, it's a known problem that our IO scheculer is still > not that well behaved to provide good read latency when the flusher > rightfully manages to keep 100% fillness of the async IO queue all the > time. Deep queues are the antithesis of low latency. If you want good IO interactivity (i.e. low access latency) you cannot keep deep async IO queues. If you want good throughput, you need deep queues to allow the best scheduling window as possible and to keep the IO device as busy as possible. > The IO scheduler will be the right place to solve this issue. There's > nothing wrong for the flusher to blindly fill the async IO queue. It's > the flusher's duty to avoid underrun of the async IO queue and the IO > scheduler's duty to select the right queue to service (or to idle). > The IO scheduler *in theory* has all the information to do the right > decisions to _not service_ requests from the flusher when there are > reads observed recently... That's my take on the issue, too. Even if we decide that streaming writes should be sync'd immeidately, where should we draw the limit? I often write temporary files that would qualify as large streaming writes (e.g. 1GB) and then immediately remove them. I rely on the fact they don't hit the disk for performance (i.e. <1s to create, wait 2s, <1s to read, <1s to unlink). If these are forced to disk rather than sitting in memory for a short while, the create will now take ~10s per file and I won't be able to create 10 of them concurrently and have them all take <1s to create.... IOWs, what might seem like an interactivity optimisation for one workload will quite badly affect the performance of a different workload. Optimising read latency vs write bandwidth is exactly what we have IO schedulers for.... > Or when there are 10+ writers running, each submitting 8MB data to the > async IO queue, they may well overrun the max IO queue size and get > blocked in the earlier stage of get_request_wait(). Yup, as soon as you have multiple IO submitters, we get back to the old problem of thrashing the disks. This is *exactly* the throughput problem we solved by moving to IO-less throttling. That is, having N IO submitters is far less efficient than having a single, well controlled IO submitter. That's exactly what we want to avoid... > > The other important part is that the chunk size is fairly large. We do > > read-ahead in 64k kind of things, to make sense the write-behind > > chunking needs to be in "multiple megabytes". 8MB is probably the > > minimum size it makes sense. > > Yup. And we also need to make sure it's not 10 tasks each scheduling > 50MB write IOs *concurrently*. sync_file_range() is unfortunately > doing it this way by sending IO requests to the async IO queue on its > own, rather than delegating the work to the flusher and let one single > flusher submit IOs for them one after the other. Yup, that's the thrashing we need to avoid ;) > So if parallel file syncs are a common usage, we'll need to make them > IO-less, too. Or just tell people "don't do that" > > The write-behind would be for things like people writing disk images > > and video files. Not for random IO in smaller chunks. Or you could just use async direct IO to acheive exactly the same thing without modifying the kernel at all ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-05 1:01 ` Dave Chinner @ 2012-06-05 17:18 ` Vivek Goyal 0 siblings, 0 replies; 18+ messages in thread From: Vivek Goyal @ 2012-06-05 17:18 UTC (permalink / raw) To: Dave Chinner Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List On Tue, Jun 05, 2012 at 11:01:48AM +1000, Dave Chinner wrote: > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote: > > Linus, > > > > On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote: > > > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote: > > > I just suspect that we'd be better off teaching upper levels about the > > > streaming. I know for a fact that if I do it by hand, system > > > responsiveness was *much* better, and IO throughput didn't go down at > > > all. > > > > Your observation of better responsiveness may well be stemmed from > > these two aspects: > > > > 1) lower dirty/writeback pages > > 2) the async write IO queue being drained constantly > > > > (1) is obvious. For a mem=4G desktop, the default dirty limit can be > > up to (4096 * 20% = 819MB). While your smart writer effectively limits > > dirty/writeback pages to a dramatically lower 16MB. > > > > (2) comes from the use of _WAIT_ flags in > > > > sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > > > > Each sync_file_range() syscall will submit 8MB write IO and wait for > > completion. That means the async write IO queue constantly swing > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). > > So on every 12.5ms, the async IO queue runs empty, which gives any > > pending read IO (from firefox etc.) a chance to be serviced. Nice > > and sweet breaks! > > > > I suspect (2) contributes *much more* than (1) to desktop responsiveness. > > Almost certainly, especially with NCQ devices where even if the IO > scheduler preempts the write queue immediately, the device might > complete the outstanding 31 writes before servicing the read which > is issued as the 32nd command.... CFQ does preempt async IO once sync IO gets queued. > > So NCQ depth is going to play a part here as well. Yes NCQ depth does contribute primarily to READ latencies in presence of async IO. I think disk drivers and disk firmware should also participate in prioritizing READs over pending WRITEs to improve the situation. IO scheduler can only do so much. CFQ already tries hard to keep pending async queue depth low and that results in lower throughput many a times (as compared to deadline). In fact CFQ tries so hard to prioritize SYNC IO over async IO, that I have often heard cases of WRITEs being starved and people facing "task blocked for 120 second warnings". Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-05-30 3:21 ` Fengguang Wu 2012-06-05 1:01 ` Dave Chinner @ 2012-06-05 17:23 ` Vivek Goyal 2012-06-05 17:41 ` Vivek Goyal 1 sibling, 1 reply; 18+ messages in thread From: Vivek Goyal @ 2012-06-05 17:23 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote: [..] > (2) comes from the use of _WAIT_ flags in > > sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > > Each sync_file_range() syscall will submit 8MB write IO and wait for > completion. That means the async write IO queue constantly swing > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). > So on every 12.5ms, the async IO queue runs empty, which gives any > pending read IO (from firefox etc.) a chance to be serviced. Nice > and sweet breaks! I doubt that async IO queue is empty for 12.5ms. We wait for previous range to finish (index-1) and have already started the IO on next 8MB of pages. So effectively that should keep 8MB of async IO in queue (until and unless there are delays from user space side). So reason for latency improvement might be something else and not because async IO queue is empty for some time. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-05 17:23 ` Vivek Goyal @ 2012-06-05 17:41 ` Vivek Goyal 2012-06-05 18:48 ` Vivek Goyal 0 siblings, 1 reply; 18+ messages in thread From: Vivek Goyal @ 2012-06-05 17:41 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote: > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote: > > [..] > > (2) comes from the use of _WAIT_ flags in > > > > sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > > > > Each sync_file_range() syscall will submit 8MB write IO and wait for > > completion. That means the async write IO queue constantly swing > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). > > So on every 12.5ms, the async IO queue runs empty, which gives any > > pending read IO (from firefox etc.) a chance to be serviced. Nice > > and sweet breaks! > > I doubt that async IO queue is empty for 12.5ms. We wait for previous > range to finish (index-1) and have already started the IO on next 8MB > of pages. So effectively that should keep 8MB of async IO in > queue (until and unless there are delays from user space side). So reason > for latency improvement might be something else and not because async > IO queue is empty for some time. With sync_file_range() test, we can have 8MB of IO in flight. Without that I think we can have more at times and that might be the reason for latency improvement. I see that CFQ has code to allow deeper NCQ depth if there is only a single writer. So once a reader comes along it might find tons of async IO already in flight. sync_file_range() will limit that in flight IO hence the latency improvement. So if we have multiple dd doing sync_file_range() then probably this latency improvement should go away. I will run some tests to verify if my understanding about deeper queue depths in case of single writer is correct or not. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-05 17:41 ` Vivek Goyal @ 2012-06-05 18:48 ` Vivek Goyal 2012-06-05 20:10 ` Vivek Goyal 0 siblings, 1 reply; 18+ messages in thread From: Vivek Goyal @ 2012-06-05 18:48 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List On Tue, Jun 05, 2012 at 01:41:57PM -0400, Vivek Goyal wrote: > On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote: > > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote: > > > > [..] > > > (2) comes from the use of _WAIT_ flags in > > > > > > sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); > > > > > > Each sync_file_range() syscall will submit 8MB write IO and wait for > > > completion. That means the async write IO queue constantly swing > > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms). > > > So on every 12.5ms, the async IO queue runs empty, which gives any > > > pending read IO (from firefox etc.) a chance to be serviced. Nice > > > and sweet breaks! > > > > I doubt that async IO queue is empty for 12.5ms. We wait for previous > > range to finish (index-1) and have already started the IO on next 8MB > > of pages. So effectively that should keep 8MB of async IO in > > queue (until and unless there are delays from user space side). So reason > > for latency improvement might be something else and not because async > > IO queue is empty for some time. > > With sync_file_range() test, we can have 8MB of IO in flight. Without that > I think we can have more at times and that might be the reason for latency > improvement. > > I see that CFQ has code to allow deeper NCQ depth if there is only a single > writer. So once a reader comes along it might find tons of async IO > already in flight. sync_file_range() will limit that in flight IO hence > the latency improvement. So if we have multiple dd doing sync_file_range() > then probably this latency improvement should go away. > > I will run some tests to verify if my understanding about deeper queue > depths in case of single writer is correct or not. So I did run some tests and can confirm that on an average there seem to be more in flight requests *without* sync_file_range() and that's probably the reason that why sync_file_range() test is showing better latency. I can see that with "dd if=/dev/zero of=zerofile bs=1M count=1024", we are driving deeper queue depths (upto 32) and in later stages in flight requests are constantly high. With sync_file_range(), in flight requests number of requests fluctuate a lot between 1 and 32. Many a times it is just 1 or up to 16 and few times went up to 32. So sync_file_range() test keeps less in flight requests on on average hence better latencies. It might not produce throughput drop on SATA disks but might have some effect on storage array luns. Will give it a try. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-05 18:48 ` Vivek Goyal @ 2012-06-05 20:10 ` Vivek Goyal 2012-06-06 2:57 ` Vivek Goyal 0 siblings, 1 reply; 18+ messages in thread From: Vivek Goyal @ 2012-06-05 20:10 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote: [..] > So sync_file_range() test keeps less in flight requests on on average > hence better latencies. It might not produce throughput drop on SATA > disks but might have some effect on storage array luns. Will give it > a try. Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a file of size 4G on ext4. Got about 300MB/s write speed. In fact when I measured time using "time", sync_file_range test finished little faster. Then I started looking at blktrace output. sync_file_range() test initially (for about 8 seconds), drives shallow queue depth (about 16), but after 8 seconds somehow flusher gets involved and starts submitting lots of requests and we start driving much higher queue depth (upto more than 100). Not sure why flusher should get involved. Is everything working as expected. I thought that as we wait for last 8MB IO to finish before we start new one, we should have at max 16MB of IO in flight. Fengguang? Anyway, so this test of speed comparision is invalid as flusher gets involved after some time and we start driving higher in flight requests. I guess I should hard code the maximum number of requests in flight to see the effect of request queue depth on throughput. I am also attaching the sync_file_range() test linus mentioned. Did I write it right. Thanks Vivek #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <time.h> #include <fcntl.h> #include <string.h> #define BUFSIZE (8*1024*1024) char buf [BUFSIZE]; int main() { int fd, index = 0; fd = open("sync-file-range-tester.tst-file", O_WRONLY|O_CREAT); if (fd < 0) { perror("open"); exit(1); } memset(buf, 'a', BUFSIZE); while (1) { if (write(fd, buf, BUFSIZE) != BUFSIZE) break; sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE); if (index) { sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER); } index++; if (index >=512) break; } } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-05 20:10 ` Vivek Goyal @ 2012-06-06 2:57 ` Vivek Goyal 2012-06-06 3:14 ` Linus Torvalds 2012-06-06 14:08 ` Fengguang Wu 0 siblings, 2 replies; 18+ messages in thread From: Vivek Goyal @ 2012-06-06 2:57 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote: > On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote: > > [..] > > So sync_file_range() test keeps less in flight requests on on average > > hence better latencies. It might not produce throughput drop on SATA > > disks but might have some effect on storage array luns. Will give it > > a try. > > Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a > file of size 4G on ext4. Got about 300MB/s write speed. In fact when I > measured time using "time", sync_file_range test finished little faster. > > Then I started looking at blktrace output. sync_file_range() test > initially (for about 8 seconds), drives shallow queue depth (about 16), > but after 8 seconds somehow flusher gets involved and starts submitting > lots of requests and we start driving much higher queue depth (upto more than > 100). Not sure why flusher should get involved. Is everything working as > expected. I thought that as we wait for last 8MB IO to finish before we > start new one, we should have at max 16MB of IO in flight. Fengguang? Ok, found it. I am using "int index" which in turn caused signed integer extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens and 64bit offset is sign extended and offsets are screwed. So after 2G file size, sync_file_range() effectively stops working leaving dirty pages which are cleaned up by flusher. So that explains why flusher was kicking during my tests. Change "int" to "unsigned int" and problem if fixed. Now I ran sync_file_range() test and another program which writes 4GB file and does a fdatasync() at the end and compared total execution time. First one takes around 12.5 seconds while later one takes around 12.00 seconds. So sync_file_range() is just little slower on this SAN lun. I had expected a bigger difference as sync_file_range() is just driving max queue depth of 32 (total 16MB IO in flight), while flushers are driving queue depths up to 140 or so. So in this paritcular test, driving much deeper queue depths is not really helping much. (I have seen higher throughputs with higher queue depths in the past. Now sure why don't we see it here). Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 2:57 ` Vivek Goyal @ 2012-06-06 3:14 ` Linus Torvalds 2012-06-06 12:14 ` Vivek Goyal 2012-06-06 14:08 ` Fengguang Wu 1 sibling, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2012-06-06 3:14 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > I had expected a bigger difference as sync_file_range() is just driving > max queue depth of 32 (total 16MB IO in flight), while flushers are > driving queue depths up to 140 or so. So in this paritcular test, driving > much deeper queue depths is not really helping much. (I have seen higher > throughputs with higher queue depths in the past. Now sure why don't we > see it here). How did interactivity feel? Because quite frankly, if the throughput difference is 12.5 vs 12 seconds, I suspect the interactivity thing is what dominates. And from my memory of the interactivity different was absolutely *huge*. Even back when I used rotational media, I basically couldn't even notice the background write with the sync_file_range() approach. While the regular writeback without the writebehind had absolutely *huge* pauses if you used something like firefox that uses fsync() etc. And starting new applications that weren't cached was noticeably worse too - and then with sync_file_range it wasn't even all that noticeable. NOTE! For the real "firefox + fsync" test, I suspect you'd need to do the writeback on the same filesystem (and obviously disk) as your home directory is. If the big write is to another filesystem and another disk, I think you won't see the same issues. Admittedly, I have not really touched anything with a rotational disk for the last few years, nor do I ever want to see those rotating pieces of high-tech rust ever again. And maybe your SAN has so good latency even under load that it doesn't really matter. I remember it mattering a lot back when.. Of course, back when I did that testing and had rotational media, we didn't have the per-bdi writeback logic with the smart speed-dependent depths etc, so it may be that we're just so much better at writeback these days that it's not nearly as noticeable any more. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 3:14 ` Linus Torvalds @ 2012-06-06 12:14 ` Vivek Goyal 2012-06-06 14:00 ` Fengguang Wu 2012-06-06 16:15 ` Vivek Goyal 0 siblings, 2 replies; 18+ messages in thread From: Vivek Goyal @ 2012-06-06 12:14 UTC (permalink / raw) To: Linus Torvalds Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote: > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > I had expected a bigger difference as sync_file_range() is just driving > > max queue depth of 32 (total 16MB IO in flight), while flushers are > > driving queue depths up to 140 or so. So in this paritcular test, driving > > much deeper queue depths is not really helping much. (I have seen higher > > throughputs with higher queue depths in the past. Now sure why don't we > > see it here). > > How did interactivity feel? > > Because quite frankly, if the throughput difference is 12.5 vs 12 > seconds, I suspect the interactivity thing is what dominates. > > And from my memory of the interactivity different was absolutely > *huge*. Even back when I used rotational media, I basically couldn't > even notice the background write with the sync_file_range() approach. > While the regular writeback without the writebehind had absolutely > *huge* pauses if you used something like firefox that uses fsync() > etc. And starting new applications that weren't cached was noticeably > worse too - and then with sync_file_range it wasn't even all that > noticeable. > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do > the writeback on the same filesystem (and obviously disk) as your home > directory is. If the big write is to another filesystem and another > disk, I think you won't see the same issues. Ok, I did following test on my single SATA disk and my root filesystem is on this disk. I dropped caches and launched firefox and monitored the time it takes for firefox to start. (cache cold). And my results are reverse of what you have been seeing. With sync_file_range() running, firefox takes roughly 30 seconds to start and with flusher in operation, it takes roughly 20 seconds to start. (I have approximated the average of 3 runs for simplicity). I think it is happening because sync_file_range() will send all the writes as SYNC and it will compete with firefox IO. On the other hand, flusher's IO will show up as ASYNC and CFQ will be penalize it heavily and firefox's IO will be prioritized. And this effect should just get worse as more processes do sync_file_range(). So write-behind should provide better interactivity if writes submitted are ASYNC and not SYNC. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 12:14 ` Vivek Goyal @ 2012-06-06 14:00 ` Fengguang Wu 2012-06-06 17:04 ` Vivek Goyal 2012-06-06 16:15 ` Vivek Goyal 1 sibling, 1 reply; 18+ messages in thread From: Fengguang Wu @ 2012-06-06 14:00 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote: > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote: > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > I had expected a bigger difference as sync_file_range() is just driving > > > max queue depth of 32 (total 16MB IO in flight), while flushers are > > > driving queue depths up to 140 or so. So in this paritcular test, driving > > > much deeper queue depths is not really helping much. (I have seen higher > > > throughputs with higher queue depths in the past. Now sure why don't we > > > see it here). > > > > How did interactivity feel? > > > > Because quite frankly, if the throughput difference is 12.5 vs 12 > > seconds, I suspect the interactivity thing is what dominates. > > > > And from my memory of the interactivity different was absolutely > > *huge*. Even back when I used rotational media, I basically couldn't > > even notice the background write with the sync_file_range() approach. > > While the regular writeback without the writebehind had absolutely > > *huge* pauses if you used something like firefox that uses fsync() > > etc. And starting new applications that weren't cached was noticeably > > worse too - and then with sync_file_range it wasn't even all that > > noticeable. > > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do > > the writeback on the same filesystem (and obviously disk) as your home > > directory is. If the big write is to another filesystem and another > > disk, I think you won't see the same issues. > > Ok, I did following test on my single SATA disk and my root filesystem > is on this disk. > > I dropped caches and launched firefox and monitored the time it takes > for firefox to start. (cache cold). > > And my results are reverse of what you have been seeing. With > sync_file_range() running, firefox takes roughly 30 seconds to start and > with flusher in operation, it takes roughly 20 seconds to start. (I have > approximated the average of 3 runs for simplicity). > > I think it is happening because sync_file_range() will send all > the writes as SYNC and it will compete with firefox IO. On the other > hand, flusher's IO will show up as ASYNC and CFQ will be penalize it > heavily and firefox's IO will be prioritized. And this effect should > just get worse as more processes do sync_file_range(). > > So write-behind should provide better interactivity if writes submitted > are ASYNC and not SYNC. Hi Vivek, thanks for testing all of these out! The result is definitely interesting and a surprise: we overlooked the SYNC nature of sync_file_range(). I'd suggest to use these calls to achieve the write-and-drop-behind behavior, *with* WB_SYNC_NONE: posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED); sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER); The caveat is, the below bdi_write_congested() will never evaluate to true since we are only filling the request queue with 8MB data. SYSCALL_DEFINE(fadvise64_64): case POSIX_FADV_DONTNEED: if (!bdi_write_congested(mapping->backing_dev_info)) __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 14:00 ` Fengguang Wu @ 2012-06-06 17:04 ` Vivek Goyal 2012-06-07 9:45 ` Jan Kara 0 siblings, 1 reply; 18+ messages in thread From: Vivek Goyal @ 2012-06-06 17:04 UTC (permalink / raw) To: Fengguang Wu Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote: > On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote: > > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote: > > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > I had expected a bigger difference as sync_file_range() is just driving > > > > max queue depth of 32 (total 16MB IO in flight), while flushers are > > > > driving queue depths up to 140 or so. So in this paritcular test, driving > > > > much deeper queue depths is not really helping much. (I have seen higher > > > > throughputs with higher queue depths in the past. Now sure why don't we > > > > see it here). > > > > > > How did interactivity feel? > > > > > > Because quite frankly, if the throughput difference is 12.5 vs 12 > > > seconds, I suspect the interactivity thing is what dominates. > > > > > > And from my memory of the interactivity different was absolutely > > > *huge*. Even back when I used rotational media, I basically couldn't > > > even notice the background write with the sync_file_range() approach. > > > While the regular writeback without the writebehind had absolutely > > > *huge* pauses if you used something like firefox that uses fsync() > > > etc. And starting new applications that weren't cached was noticeably > > > worse too - and then with sync_file_range it wasn't even all that > > > noticeable. > > > > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do > > > the writeback on the same filesystem (and obviously disk) as your home > > > directory is. If the big write is to another filesystem and another > > > disk, I think you won't see the same issues. > > > > Ok, I did following test on my single SATA disk and my root filesystem > > is on this disk. > > > > I dropped caches and launched firefox and monitored the time it takes > > for firefox to start. (cache cold). > > > > And my results are reverse of what you have been seeing. With > > sync_file_range() running, firefox takes roughly 30 seconds to start and > > with flusher in operation, it takes roughly 20 seconds to start. (I have > > approximated the average of 3 runs for simplicity). > > > > I think it is happening because sync_file_range() will send all > > the writes as SYNC and it will compete with firefox IO. On the other > > hand, flusher's IO will show up as ASYNC and CFQ will be penalize it > > heavily and firefox's IO will be prioritized. And this effect should > > just get worse as more processes do sync_file_range(). > > > > So write-behind should provide better interactivity if writes submitted > > are ASYNC and not SYNC. > > Hi Vivek, thanks for testing all of these out! The result is > definitely interesting and a surprise: we overlooked the SYNC nature > of sync_file_range(). > > I'd suggest to use these calls to achieve the write-and-drop-behind > behavior, *with* WB_SYNC_NONE: > > posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED); > sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER); > > The caveat is, the below bdi_write_congested() will never evaluate to > true since we are only filling the request queue with 8MB data. > > SYSCALL_DEFINE(fadvise64_64): > > case POSIX_FADV_DONTNEED: > if (!bdi_write_congested(mapping->backing_dev_info)) > __filemap_fdatawrite_range(mapping, offset, endbyte, > WB_SYNC_NONE); Hi Fengguang, Instead of above, I modified sync_file_range() to call __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes showing up at elevator. With 4 processes doing sync_file_range() now, firefox start time test clocks around 18-19 seconds which is better than 30-35 seconds of 4 processes doing buffered writes. And system looks pretty good from interactivity point of view. Thanks Vivek Following is the patch I applied to test. --- fs/sync.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/fs/sync.c =================================================================== --- linux-2.6.orig/fs/sync.c 2012-06-06 00:12:33.000000000 -0400 +++ linux-2.6/fs/sync.c 2012-06-06 23:11:17.050691776 -0400 @@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, } if (flags & SYNC_FILE_RANGE_WRITE) { - ret = filemap_fdatawrite_range(mapping, offset, endbyte); + ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); if (ret < 0) goto out_put; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 17:04 ` Vivek Goyal @ 2012-06-07 9:45 ` Jan Kara 2012-06-07 19:06 ` Vivek Goyal 0 siblings, 1 reply; 18+ messages in thread From: Jan Kara @ 2012-06-07 9:45 UTC (permalink / raw) To: Vivek Goyal Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Wed 06-06-12 13:04:28, Vivek Goyal wrote: > On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote: > > On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote: > > > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote: > > > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > > > > > I had expected a bigger difference as sync_file_range() is just driving > > > > > max queue depth of 32 (total 16MB IO in flight), while flushers are > > > > > driving queue depths up to 140 or so. So in this paritcular test, driving > > > > > much deeper queue depths is not really helping much. (I have seen higher > > > > > throughputs with higher queue depths in the past. Now sure why don't we > > > > > see it here). > > > > > > > > How did interactivity feel? > > > > > > > > Because quite frankly, if the throughput difference is 12.5 vs 12 > > > > seconds, I suspect the interactivity thing is what dominates. > > > > > > > > And from my memory of the interactivity different was absolutely > > > > *huge*. Even back when I used rotational media, I basically couldn't > > > > even notice the background write with the sync_file_range() approach. > > > > While the regular writeback without the writebehind had absolutely > > > > *huge* pauses if you used something like firefox that uses fsync() > > > > etc. And starting new applications that weren't cached was noticeably > > > > worse too - and then with sync_file_range it wasn't even all that > > > > noticeable. > > > > > > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do > > > > the writeback on the same filesystem (and obviously disk) as your home > > > > directory is. If the big write is to another filesystem and another > > > > disk, I think you won't see the same issues. > > > > > > Ok, I did following test on my single SATA disk and my root filesystem > > > is on this disk. > > > > > > I dropped caches and launched firefox and monitored the time it takes > > > for firefox to start. (cache cold). > > > > > > And my results are reverse of what you have been seeing. With > > > sync_file_range() running, firefox takes roughly 30 seconds to start and > > > with flusher in operation, it takes roughly 20 seconds to start. (I have > > > approximated the average of 3 runs for simplicity). > > > > > > I think it is happening because sync_file_range() will send all > > > the writes as SYNC and it will compete with firefox IO. On the other > > > hand, flusher's IO will show up as ASYNC and CFQ will be penalize it > > > heavily and firefox's IO will be prioritized. And this effect should > > > just get worse as more processes do sync_file_range(). > > > > > > So write-behind should provide better interactivity if writes submitted > > > are ASYNC and not SYNC. > > > > Hi Vivek, thanks for testing all of these out! The result is > > definitely interesting and a surprise: we overlooked the SYNC nature > > of sync_file_range(). > > > > I'd suggest to use these calls to achieve the write-and-drop-behind > > behavior, *with* WB_SYNC_NONE: > > > > posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED); > > sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER); > > > > The caveat is, the below bdi_write_congested() will never evaluate to > > true since we are only filling the request queue with 8MB data. > > > > SYSCALL_DEFINE(fadvise64_64): > > > > case POSIX_FADV_DONTNEED: > > if (!bdi_write_congested(mapping->backing_dev_info)) > > __filemap_fdatawrite_range(mapping, offset, endbyte, > > WB_SYNC_NONE); > > Hi Fengguang, > > Instead of above, I modified sync_file_range() to call > __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes > showing up at elevator. > > With 4 processes doing sync_file_range() now, firefox start time test > clocks around 18-19 seconds which is better than 30-35 seconds of 4 > processes doing buffered writes. And system looks pretty good from > interactivity point of view. So do you have any idea why is that? Do we drive shallower queues? Also how does speed of the writers compare to the speed with normal buffered writes + fsync (you'd need fsync for sync_file_range writers as well to make comparison fair)? Honza > --- > fs/sync.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6/fs/sync.c > =================================================================== > --- linux-2.6.orig/fs/sync.c 2012-06-06 00:12:33.000000000 -0400 > +++ linux-2.6/fs/sync.c 2012-06-06 23:11:17.050691776 -0400 > @@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, > } > > if (flags & SYNC_FILE_RANGE_WRITE) { > - ret = filemap_fdatawrite_range(mapping, offset, endbyte); > + ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); > if (ret < 0) > goto out_put; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-07 9:45 ` Jan Kara @ 2012-06-07 19:06 ` Vivek Goyal 0 siblings, 0 replies; 18+ messages in thread From: Vivek Goyal @ 2012-06-07 19:06 UTC (permalink / raw) To: Jan Kara Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Thu, Jun 07, 2012 at 11:45:04AM +0200, Jan Kara wrote: [..] > > Instead of above, I modified sync_file_range() to call > > __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes > > showing up at elevator. > > > > With 4 processes doing sync_file_range() now, firefox start time test > > clocks around 18-19 seconds which is better than 30-35 seconds of 4 > > processes doing buffered writes. And system looks pretty good from > > interactivity point of view. > So do you have any idea why is that? Do we drive shallower queues? Also > how does speed of the writers compare to the speed with normal buffered > writes + fsync (you'd need fsync for sync_file_range writers as well to > make comparison fair)? Ok, I did more tests and few odd things I noticed. - Results are varying a lot. Sometimes with write+flush workload also firefox launched fast. So now it is hard to conclude things. - For some reason I had nr_requests as 16K on my root drive. I have no idea who is setting it. Once I set it to 128, then firefox with write+flush workload performs much better and launch time are similar to sync_file_range. - I tried to open new windows in firefox and browse web, load new websites. I would say sync_file_range() feels little better but I don't have any logical explanation and can't conclude anything yet by looking at traces. I am continuing to stare though. So in summary, at this point of time I really can't conclude that using sync_file_range() with ASYNC request is providing better latencies in my setup. I will keept at it though and if I notice something new, will write back. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 12:14 ` Vivek Goyal 2012-06-06 14:00 ` Fengguang Wu @ 2012-06-06 16:15 ` Vivek Goyal 1 sibling, 0 replies; 18+ messages in thread From: Vivek Goyal @ 2012-06-06 16:15 UTC (permalink / raw) To: Linus Torvalds Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote: [..] > I think it is happening because sync_file_range() will send all > the writes as SYNC and it will compete with firefox IO. On the other > hand, flusher's IO will show up as ASYNC and CFQ will be penalize it > heavily and firefox's IO will be prioritized. And this effect should > just get worse as more processes do sync_file_range(). Ok, this time I tried the same test again but with 4 processes doing writes in parallel on 4 different files. And with sync_file_range() things turned ugly. Interactivity was very poor. firefox launch test took around 1m:45s with sync_file range() while it took only about 35seconds with regular flusher threads. So sending writeback IO synchronously wreaks havoc. Thanks Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: write-behind on streaming writes 2012-06-06 2:57 ` Vivek Goyal 2012-06-06 3:14 ` Linus Torvalds @ 2012-06-06 14:08 ` Fengguang Wu 1 sibling, 0 replies; 18+ messages in thread From: Fengguang Wu @ 2012-06-06 14:08 UTC (permalink / raw) To: Vivek Goyal Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List, Jens Axboe On Tue, Jun 05, 2012 at 10:57:30PM -0400, Vivek Goyal wrote: > On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote: > > On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote: > > > > [..] > > > So sync_file_range() test keeps less in flight requests on on average > > > hence better latencies. It might not produce throughput drop on SATA > > > disks but might have some effect on storage array luns. Will give it > > > a try. > > > > Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a > > file of size 4G on ext4. Got about 300MB/s write speed. In fact when I > > measured time using "time", sync_file_range test finished little faster. > > > > Then I started looking at blktrace output. sync_file_range() test > > initially (for about 8 seconds), drives shallow queue depth (about 16), > > but after 8 seconds somehow flusher gets involved and starts submitting > > lots of requests and we start driving much higher queue depth (upto more than > > 100). Not sure why flusher should get involved. Is everything working as > > expected. I thought that as we wait for last 8MB IO to finish before we > > start new one, we should have at max 16MB of IO in flight. Fengguang? > > Ok, found it. I am using "int index" which in turn caused signed integer > extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens > and 64bit offset is sign extended and offsets are screwed. So after 2G > file size, sync_file_range() effectively stops working leaving dirty > pages which are cleaned up by flusher. So that explains why flusher > was kicking during my tests. Change "int" to "unsigned int" and problem > if fixed. Good catch! Besides that, I do see a small chance for the flusher thread to kick in: at the time when the inode dirty expires after 30s. Just a kind reminder, because I don't see how it can impact this workload in some noticeable way. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2012-06-07 19:06 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20120528114124.GA6813@localhost> [not found] ` <CA+55aFxHt8q8+jQDuoaK=hObX+73iSBTa4bBWodCX3s-y4Q1GQ@mail.gmail.com> 2012-05-29 15:57 ` write-behind on streaming writes Fengguang Wu 2012-05-29 17:35 ` Linus Torvalds 2012-05-30 3:21 ` Fengguang Wu 2012-06-05 1:01 ` Dave Chinner 2012-06-05 17:18 ` Vivek Goyal 2012-06-05 17:23 ` Vivek Goyal 2012-06-05 17:41 ` Vivek Goyal 2012-06-05 18:48 ` Vivek Goyal 2012-06-05 20:10 ` Vivek Goyal 2012-06-06 2:57 ` Vivek Goyal 2012-06-06 3:14 ` Linus Torvalds 2012-06-06 12:14 ` Vivek Goyal 2012-06-06 14:00 ` Fengguang Wu 2012-06-06 17:04 ` Vivek Goyal 2012-06-07 9:45 ` Jan Kara 2012-06-07 19:06 ` Vivek Goyal 2012-06-06 16:15 ` Vivek Goyal 2012-06-06 14:08 ` Fengguang Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).