* A question about NCQ @ 2006-05-16 10:01 zhao, forrest 2006-05-16 10:49 ` Tejun Heo 2006-05-17 14:31 ` Mark Lord 0 siblings, 2 replies; 11+ messages in thread From: zhao, forrest @ 2006-05-16 10:01 UTC (permalink / raw) To: htejun; +Cc: linux-ide Hi, Tejun Since your NCQ patches were pushed into #upstream, I decide to compare the performance between with and without NCQ enabling. But initial test result of running iozone with O_DIRECT option turned on didn't show the visible performance gain with NCQ. In certain cases, NCQ even had a worse performance than without NCQ. So my question is in what usage case can we observe the performance gain with NCQ? Thanks, Forrest ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-16 10:01 A question about NCQ zhao, forrest @ 2006-05-16 10:49 ` Tejun Heo 2006-05-17 2:21 ` zhao, forrest 2006-05-17 14:31 ` Mark Lord 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2006-05-16 10:49 UTC (permalink / raw) To: zhao, forrest; +Cc: linux-ide zhao, forrest wrote: > Hi, Tejun > > Since your NCQ patches were pushed into #upstream, I decide to compare > the performance between with and without NCQ enabling. > > But initial test result of running iozone with O_DIRECT option turned on > didn't show the visible performance gain with NCQ. In certain cases, NCQ > even had a worse performance than without NCQ. > > So my question is in what usage case can we observe the performance gain > with NCQ? > I don't know the workload of iozone. But NCQ shines when there are many concurrent IOs in progress. A good real world example would be busy file-serving web server. It generally helps if there are multiple IO requests. If iozone is single-threaded (IO-wise), try to run multiple copies of them and compare the results. Also, you need to pay attention to IO schedule in use, IIRC as and cfq are heavily optimized for single-queued devices and might not show the best performance depending on workload. For functionality test, I usually use deadline. It's simpler and usually doesn't get in the way, which, BTW, may or may not translate into better performance. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-16 10:49 ` Tejun Heo @ 2006-05-17 2:21 ` zhao, forrest 2006-05-17 2:37 ` Tejun Heo 2006-05-17 3:19 ` Jeff Garzik 0 siblings, 2 replies; 11+ messages in thread From: zhao, forrest @ 2006-05-17 2:21 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-ide [-- Attachment #1: Type: text/plain, Size: 1235 bytes --] On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: > I don't know the workload of iozone. But NCQ shines when there are many > concurrent IOs in progress. A good real world example would be busy > file-serving web server. It generally helps if there are multiple IO > requests. If iozone is single-threaded (IO-wise), try to run multiple > copies of them and compare the results. > > Also, you need to pay attention to IO schedule in use, IIRC as and cfq > are heavily optimized for single-queued devices and might not show the > best performance depending on workload. For functionality test, I > usually use deadline. It's simpler and usually doesn't get in the way, > which, BTW, may or may not translate into better performance. > Tejun, I run iozone with 8 concurrent threads. From my understanding, NCQ should at least provide the same throughput as non-NCQ. But the attached test result showed that NCQ has the lower throughput compared with non- NCQ. The io scheduler is anticipatory. The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. The current problem is that I don't know where the bottleneck is, block I/O layer, SCSI layer, device driver layer or hardware problem...... Thanks, Forrest [-- Attachment #2: testresult.dio.8thread.NCQ --] [-- Type: text/plain, Size: 1727 bytes --] Iozone: Performance Test of File I/O Version $Revision: 3.263 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Jean-Marc Zucconi, Jeff Blomberg, Erik Habbinga, Kris Strecker, Walter Wong. Run began: Wed May 17 10:06:21 2006 File size set to 2000 KB Record Size 1 KB O_DIRECT feature enabled Command line used: ./iozone -l 8 -u 8 -F hello1.data hello2.data hello3.data hello4.data hello5.data hello6.data hello7.data hello8.data -i 0 -s 2000 -r 1 -I Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 8 Max process = 8 Throughput test with 8 processes Each process writes a 2000 Kbyte file in 1 Kbyte records Children see throughput for 8 initial writers = 1862.25 KB/sec Parent sees throughput for 8 initial writers = 753.66 KB/sec Min throughput per process = 4.11 KB/sec Max throughput per process = 588.33 KB/sec Avg throughput per process = 232.78 KB/sec Min xfer = 14.00 KB Children see throughput for 8 rewriters = 1582.49 KB/sec Parent sees throughput for 8 rewriters = 1576.26 KB/sec Min throughput per process = 2.49 KB/sec Max throughput per process = 384.88 KB/sec Avg throughput per process = 197.81 KB/sec Min xfer = 13.00 KB iozone test complete. [-- Attachment #3: testresult.dio.8thread.nonNCQ --] [-- Type: text/plain, Size: 1727 bytes --] Iozone: Performance Test of File I/O Version $Revision: 3.263 $ Compiled for 32 bit mode. Build: linux Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Jean-Marc Zucconi, Jeff Blomberg, Erik Habbinga, Kris Strecker, Walter Wong. Run began: Wed May 17 10:01:50 2006 File size set to 2000 KB Record Size 1 KB O_DIRECT feature enabled Command line used: ./iozone -l 8 -u 8 -F hello1.data hello2.data hello3.data hello4.data hello5.data hello6.data hello7.data hello8.data -i 0 -s 2000 -r 1 -I Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Min process = 8 Max process = 8 Throughput test with 8 processes Each process writes a 2000 Kbyte file in 1 Kbyte records Children see throughput for 8 initial writers = 2263.96 KB/sec Parent sees throughput for 8 initial writers = 640.26 KB/sec Min throughput per process = 2.93 KB/sec Max throughput per process = 985.75 KB/sec Avg throughput per process = 283.00 KB/sec Min xfer = 6.00 KB Children see throughput for 8 rewriters = 2656.53 KB/sec Parent sees throughput for 8 rewriters = 2602.82 KB/sec Min throughput per process = 3.80 KB/sec Max throughput per process = 1923.40 KB/sec Avg throughput per process = 332.07 KB/sec Min xfer = 4.00 KB iozone test complete. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 2:21 ` zhao, forrest @ 2006-05-17 2:37 ` Tejun Heo 2006-05-17 3:24 ` zhao, forrest 2006-05-17 3:19 ` Jeff Garzik 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2006-05-17 2:37 UTC (permalink / raw) To: zhao, forrest; +Cc: linux-ide zhao, forrest wrote: > On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: >> I don't know the workload of iozone. But NCQ shines when there are many >> concurrent IOs in progress. A good real world example would be busy >> file-serving web server. It generally helps if there are multiple IO >> requests. If iozone is single-threaded (IO-wise), try to run multiple >> copies of them and compare the results. >> >> Also, you need to pay attention to IO schedule in use, IIRC as and cfq >> are heavily optimized for single-queued devices and might not show the >> best performance depending on workload. For functionality test, I >> usually use deadline. It's simpler and usually doesn't get in the way, >> which, BTW, may or may not translate into better performance. >> > Tejun, > > I run iozone with 8 concurrent threads. From my understanding, NCQ > should at least provide the same throughput as non-NCQ. But the attached > test result showed that NCQ has the lower throughput compared with non- > NCQ. > > The io scheduler is anticipatory. > The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. > > The current problem is that I don't know where the bottleneck is, block > I/O layer, SCSI layer, device driver layer or hardware problem...... AFAIK, anticipatory doesn't interact very well with queued devices. Can you try with deadline? -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 2:37 ` Tejun Heo @ 2006-05-17 3:24 ` zhao, forrest 2006-05-17 3:54 ` Tejun Heo 0 siblings, 1 reply; 11+ messages in thread From: zhao, forrest @ 2006-05-17 3:24 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-ide On Wed, 2006-05-17 at 11:37 +0900, Tejun Heo wrote: > zhao, forrest wrote: > > On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: > >> I don't know the workload of iozone. But NCQ shines when there are many > >> concurrent IOs in progress. A good real world example would be busy > >> file-serving web server. It generally helps if there are multiple IO > >> requests. If iozone is single-threaded (IO-wise), try to run multiple > >> copies of them and compare the results. > >> > >> Also, you need to pay attention to IO schedule in use, IIRC as and cfq > >> are heavily optimized for single-queued devices and might not show the > >> best performance depending on workload. For functionality test, I > >> usually use deadline. It's simpler and usually doesn't get in the way, > >> which, BTW, may or may not translate into better performance. > >> > > Tejun, > > > > I run iozone with 8 concurrent threads. From my understanding, NCQ > > should at least provide the same throughput as non-NCQ. But the attached > > test result showed that NCQ has the lower throughput compared with non- > > NCQ. > > > > The io scheduler is anticipatory. > > The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. > > > > The current problem is that I don't know where the bottleneck is, block > > I/O layer, SCSI layer, device driver layer or hardware problem...... > > AFAIK, anticipatory doesn't interact very well with queued devices. Can > you try with deadline? > By using deadline, we observed the performance gain by NCQ. To avoid jitter, I record 6 "average throughput per process" results for NCQ and non-NCQ: NCQ: write :192 204 193 190 187 197 re-write :64 64 51 61 55 72 non-NCQ: write :192 188 206 201 189 200 re-write :36 37 40 39 37 41 Here we observed that NCQ has a better re-write performance than non- NCQ. But when using anticipatory, the test result is: NCQ: write: 233 re-write: 197 non-NCQ: write: 283 re-write: 332 Here we observed that anticipatory has a better performance than deadline. Especially re-write under anticipatory has much better performance than the one under deadline. So why users use deadline instead of anticipatory? Thanks, Forrest ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 3:24 ` zhao, forrest @ 2006-05-17 3:54 ` Tejun Heo 2006-05-17 4:04 ` Nick Piggin 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2006-05-17 3:54 UTC (permalink / raw) To: zhao, forrest; +Cc: linux-ide, Jens Axboe, nickpiggin zhao, forrest wrote: > On Wed, 2006-05-17 at 11:37 +0900, Tejun Heo wrote: >> zhao, forrest wrote: >>> On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: >>>> I don't know the workload of iozone. But NCQ shines when there are many >>>> concurrent IOs in progress. A good real world example would be busy >>>> file-serving web server. It generally helps if there are multiple IO >>>> requests. If iozone is single-threaded (IO-wise), try to run multiple >>>> copies of them and compare the results. >>>> >>>> Also, you need to pay attention to IO schedule in use, IIRC as and cfq >>>> are heavily optimized for single-queued devices and might not show the >>>> best performance depending on workload. For functionality test, I >>>> usually use deadline. It's simpler and usually doesn't get in the way, >>>> which, BTW, may or may not translate into better performance. >>>> >>> Tejun, >>> >>> I run iozone with 8 concurrent threads. From my understanding, NCQ >>> should at least provide the same throughput as non-NCQ. But the attached >>> test result showed that NCQ has the lower throughput compared with non- >>> NCQ. >>> >>> The io scheduler is anticipatory. >>> The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. >>> >>> The current problem is that I don't know where the bottleneck is, block >>> I/O layer, SCSI layer, device driver layer or hardware problem...... >> AFAIK, anticipatory doesn't interact very well with queued devices. Can >> you try with deadline? >> > By using deadline, we observed the performance gain by NCQ. To avoid > jitter, I record 6 "average throughput per process" results for NCQ and > non-NCQ: > > NCQ: > write :192 204 193 190 187 197 > re-write :64 64 51 61 55 72 > > non-NCQ: > write :192 188 206 201 189 200 > re-write :36 37 40 39 37 41 > > Here we observed that NCQ has a better re-write performance than non- > NCQ. > > But when using anticipatory, the test result is: > NCQ: > write: 233 > re-write: 197 > > non-NCQ: > write: 283 > re-write: 332 > > Here we observed that anticipatory has a better performance than > deadline. Especially re-write under anticipatory has much better > performance than the one under deadline. So why users use deadline > instead of anticipatory? [CC'ing Jens and Nick, Hi!] Big difference, didn't expect that. AS seems really good at what it does. I think the result would be highly dependent on the workload. The name anticipatory is originated from the fact that it delays pending IOs in anticipation of yet to be requested IOs. When it hits (usually does on a lot of workloads), the induced delay is easily offset by the reduction of seeking. When it misses, it just wasted time waiting for something which never came. I use cfq for my production system mainly to listen to mp3s better and deadline for test systems as it gives more deterministic behavior. I guess static web serving with large dataset will benefit from NCQ + deadline. Anyways, anticipatory rules, great. I don't know why NCQ is showing worse performance w/ AS, but I'm pretty sure it's something which can be fixed. Jens, Nick, any ideas? -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 3:54 ` Tejun Heo @ 2006-05-17 4:04 ` Nick Piggin 0 siblings, 0 replies; 11+ messages in thread From: Nick Piggin @ 2006-05-17 4:04 UTC (permalink / raw) To: Tejun Heo; +Cc: zhao, forrest, linux-ide, Jens Axboe Tejun Heo wrote: > >Anyways, anticipatory rules, great. I don't know why NCQ is showing >worse performance w/ AS, but I'm pretty sure it's something which can be >fixed. Jens, Nick, any ideas? > Thanks for the numbers, interesting. Anticipatory basically tries pretty hard to control disk command queues because they can result in starvation... not exactly sure why it is _worse_ with NCQ than without, maybe the drive isn't too smart or there is a bad interaction with AS. I don't see any trivial bugs in AS that would cause this, but there may be one... It's unfortunate that we don't have a grand unified IO scheduler that does everything well (except perhaps noop functionality). It is something I guess Jens and I (or maybe someone completely different) should get together with and try to make progress on one day. -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 2:21 ` zhao, forrest 2006-05-17 2:37 ` Tejun Heo @ 2006-05-17 3:19 ` Jeff Garzik 2006-05-17 3:50 ` zhao, forrest 1 sibling, 1 reply; 11+ messages in thread From: Jeff Garzik @ 2006-05-17 3:19 UTC (permalink / raw) To: zhao, forrest; +Cc: Tejun Heo, linux-ide zhao, forrest wrote: > On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: >> I don't know the workload of iozone. But NCQ shines when there are many >> concurrent IOs in progress. A good real world example would be busy >> file-serving web server. It generally helps if there are multiple IO >> requests. If iozone is single-threaded (IO-wise), try to run multiple >> copies of them and compare the results. >> >> Also, you need to pay attention to IO schedule in use, IIRC as and cfq >> are heavily optimized for single-queued devices and might not show the >> best performance depending on workload. For functionality test, I >> usually use deadline. It's simpler and usually doesn't get in the way, >> which, BTW, may or may not translate into better performance. >> > Tejun, > > I run iozone with 8 concurrent threads. From my understanding, NCQ > should at least provide the same throughput as non-NCQ. But the attached > test result showed that NCQ has the lower throughput compared with non- > NCQ. > > The io scheduler is anticipatory. > The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. > > The current problem is that I don't know where the bottleneck is, block > I/O layer, SCSI layer, device driver layer or hardware problem...... Can you verify that /sys/bus/scsi/devices/<device>/queue_depth is greater than 1? Jeff ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 3:19 ` Jeff Garzik @ 2006-05-17 3:50 ` zhao, forrest 0 siblings, 0 replies; 11+ messages in thread From: zhao, forrest @ 2006-05-17 3:50 UTC (permalink / raw) To: Jeff Garzik; +Cc: Tejun Heo, linux-ide On Tue, 2006-05-16 at 23:19 -0400, Jeff Garzik wrote: > zhao, forrest wrote: > > On Tue, 2006-05-16 at 19:49 +0900, Tejun Heo wrote: > >> I don't know the workload of iozone. But NCQ shines when there are many > >> concurrent IOs in progress. A good real world example would be busy > >> file-serving web server. It generally helps if there are multiple IO > >> requests. If iozone is single-threaded (IO-wise), try to run multiple > >> copies of them and compare the results. > >> > >> Also, you need to pay attention to IO schedule in use, IIRC as and cfq > >> are heavily optimized for single-queued devices and might not show the > >> best performance depending on workload. For functionality test, I > >> usually use deadline. It's simpler and usually doesn't get in the way, > >> which, BTW, may or may not translate into better performance. > >> > > Tejun, > > > > I run iozone with 8 concurrent threads. From my understanding, NCQ > > should at least provide the same throughput as non-NCQ. But the attached > > test result showed that NCQ has the lower throughput compared with non- > > NCQ. > > > > The io scheduler is anticipatory. > > The kernel without NCQ is 2.6.16-rc6, the kernel with NCQ is #upstream. > > > > The current problem is that I don't know where the bottleneck is, block > > I/O layer, SCSI layer, device driver layer or hardware problem...... > > Can you verify that /sys/bus/scsi/devices/<device>/queue_depth is > greater than 1? > > Jeff Boot with kernel supporting NCQ: [root@napa-sdv1 ~]# cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth 31 Boot with kernel not supporting NCQ: [root@napa-sdv1 ~]# cat /sys/bus/scsi/devices/0\:0\:0\:0/queue_depth 1 Forrest ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-16 10:01 A question about NCQ zhao, forrest 2006-05-16 10:49 ` Tejun Heo @ 2006-05-17 14:31 ` Mark Lord 2006-05-18 1:56 ` Tejun Heo 1 sibling, 1 reply; 11+ messages in thread From: Mark Lord @ 2006-05-17 14:31 UTC (permalink / raw) To: zhao, forrest; +Cc: htejun, linux-ide zhao, forrest wrote: .. > But initial test result of running iozone with O_DIRECT option turned on > didn't show the visible performance gain with NCQ. In certain cases, NCQ > even had a worse performance than without NCQ. > > So my question is in what usage case can we observe the performance gain > with NCQ? That's something I've been wondering for a couple of years, ever since implementing full NCQ/TCQ Linux drivers for several devices (most notably the very fast qstor.c driver). The observation with all of thses was that Linux already does a reasonably good enough job of scheduling I/O that tagged-queuing rarely seems to help, at least on any benchmark/test tools we've found to try (note that opposite results are obtained when using non-Linux kernels, eg. winxp). With some drives, the use of tagged commands triggers different firmware algorithms, that adversely affect throughput in favour of better random seek capability -- but since the disk scheduling already minimizes the randomness of seeking (very few back-and-forth flurries), this combination often ends up slower than without NCQ (on Linux). Cheers ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: A question about NCQ 2006-05-17 14:31 ` Mark Lord @ 2006-05-18 1:56 ` Tejun Heo 0 siblings, 0 replies; 11+ messages in thread From: Tejun Heo @ 2006-05-18 1:56 UTC (permalink / raw) To: Mark Lord; +Cc: zhao, forrest, linux-ide Mark Lord wrote: > zhao, forrest wrote: > .. >> But initial test result of running iozone with O_DIRECT option turned on >> didn't show the visible performance gain with NCQ. In certain cases, NCQ >> even had a worse performance than without NCQ. >> >> So my question is in what usage case can we observe the performance gain >> with NCQ? > > That's something I've been wondering for a couple of years, > ever since implementing full NCQ/TCQ Linux drivers for several devices > (most notably the very fast qstor.c driver). > > The observation with all of thses was that Linux already does a reasonably > good enough job of scheduling I/O that tagged-queuing rarely seems to help, > at least on any benchmark/test tools we've found to try (note that opposite > results are obtained when using non-Linux kernels, eg. winxp). > > With some drives, the use of tagged commands triggers different firmware > algorithms, that adversely affect throughput in favour of better random > seek capability -- but since the disk scheduling already minimizes the > randomness of seeking (very few back-and-forth flurries), this combination > often ends up slower than without NCQ (on Linux). At this point, NCQ doesn't look that attractive as it shows _worse_ performance on many cases. Maybe libata shouldn't enable it automatically for the time being but I think if drives handle NCQ reasonably well, there are things to be gained from NCQ by making IO schedulers more aware of queued devices. Things that come to my mind are... * Control the movement of head closely but send adjacent requests together to allow the drive optimize at smaller scale. * Reduce plugging/wait latency. As we can send more than one command at a time, we don't have to wait for adjacent requests which might arrive soon. If it's once determined that the head can move to certain area, issue the command ASAP. If adjacent requests arrive, we can merge them while the head is moving thus reducing latency. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-05-18 2:50 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-05-16 10:01 A question about NCQ zhao, forrest 2006-05-16 10:49 ` Tejun Heo 2006-05-17 2:21 ` zhao, forrest 2006-05-17 2:37 ` Tejun Heo 2006-05-17 3:24 ` zhao, forrest 2006-05-17 3:54 ` Tejun Heo 2006-05-17 4:04 ` Nick Piggin 2006-05-17 3:19 ` Jeff Garzik 2006-05-17 3:50 ` zhao, forrest 2006-05-17 14:31 ` Mark Lord 2006-05-18 1:56 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).