* ext4 data=writeback performs worse than data=ordered now @ 2011-12-14 13:34 Wu Fengguang [not found] ` <20111214140025.GA19650@localhost> 2011-12-14 14:30 ` Ted Ts'o 0 siblings, 2 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-14 13:34 UTC (permalink / raw) To: linux-ext4@vger.kernel.org Cc: Jan Kara, Li Shaohua, LKML, linux-fsdevel@vger.kernel.org, Theodore Ts'o Hi, Shaohua recently found that ext4 writeback mode could perform worse than ordered mode in some cases. It may not be a big problem, however we'd like to share some information on our findings. I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. The interesting thing is, data=writeback used to run a bit faster than data=ordered, however situation get inverted presumably by the IO-less dirty throttling. The worst case happens for the USB key, where both old/new kernels see ~10% worse performance for data=writeback. wfg@bee /export/writeback% ./compare -g ext4 -c fs snb/JBOD*/*-3.2.0-rc3-pause6+ ext4 ext4:wb ------------------------ ------------------------ 216.50 -0.5% 215.50 snb/JBOD-4HDD-thresh=100M/ext4-1dd-1-3.2.0-rc3-pause6+ 210.34 -0.5% 209.36 snb/JBOD-4HDD-thresh=1G/ext4-10dd-1-3.2.0-rc3-pause6+ 218.92 -0.5% 217.85 snb/JBOD-4HDD-thresh=1G/ext4-1dd-1-3.2.0-rc3-pause6+ 218.03 -0.1% 217.72 snb/JBOD-4HDD-thresh=8G/ext4-10dd-1-3.2.0-rc3-pause6+ 221.19 -2.0% 216.82 snb/JBOD-4HDD-thresh=8G/ext4-1dd-1-3.2.0-rc3-pause6+ 1084.98 -0.7% 1077.26 TOTAL write_bw wfg@bee /export/writeback% ./compare -g ext4 -c fs fat/*/*-3.2.0-rc3-pause6+ ext4 ext4:wb ------------------------ ------------------------ 46.87 -1.9% 45.96 fat/UKEY-HDD/ext4-100dd-1-3.2.0-rc3-pause6+ 57.40 -4.9% 54.61 fat/UKEY-HDD/ext4-10dd-1-3.2.0-rc3-pause6+ 62.13 -1.2% 61.41 fat/UKEY-HDD/ext4-1dd-1-3.2.0-rc3-pause6+ 2.46 -1.0% 2.44 fat/UKEY-thresh=100M/ext4-100dd-1-3.2.0-rc3-pause6+ 4.52 -4.3% 4.33 fat/UKEY-thresh=100M/ext4-10dd-1-3.2.0-rc3-pause6+ 6.20 -10.6% 5.54 fat/UKEY-thresh=100M/ext4-1dd-1-3.2.0-rc3-pause6+ 2.55 +8.7% 2.77 fat/fio/ext4-fio_fat_mmap_randwrite_4k-1-3.2.0-rc3-pause6+ 9.60 -4.0% 9.21 fat/fio/ext4-fio_fat_mmap_randwrite_64k-1-3.2.0-rc3-pause6+ 53.57 -3.6% 51.61 fat/fio/ext4-fio_fat_rates-1-3.2.0-rc3-pause6+ 49.16 -1.3% 48.51 fat/thresh=1000M/ext4-100dd-1-3.2.0-rc3-pause6+ 56.20 -1.4% 55.40 fat/thresh=1000M/ext4-10dd-1-3.2.0-rc3-pause6+ 57.86 -1.4% 57.07 fat/thresh=1000M/ext4-1dd-1-3.2.0-rc3-pause6+ 50.36 -3.2% 48.75 fat/thresh=1000M:990M/ext4-100dd-1-3.2.0-rc3-pause6+ 56.46 -1.4% 55.69 fat/thresh=1000M:990M/ext4-10dd-1-3.2.0-rc3-pause6+ 57.51 -0.9% 56.97 fat/thresh=1000M:990M/ext4-1dd-1-3.2.0-rc3-pause6+ 50.02 -0.8% 49.60 fat/thresh=1000M:999M/ext4-100dd-1-3.2.0-rc3-pause6+ 55.56 -1.3% 54.84 fat/thresh=1000M:999M/ext4-10dd-1-3.2.0-rc3-pause6+ 56.88 -0.6% 56.52 fat/thresh=1000M:999M/ext4-1dd-1-3.2.0-rc3-pause6+ 32.03 -3.3% 30.98 fat/thresh=100M/ext4-100dd-1-3.2.0-rc3-pause6+ 46.63 -2.5% 45.47 fat/thresh=100M/ext4-10dd-1-3.2.0-rc3-pause6+ 56.67 -2.3% 55.34 fat/thresh=100M/ext4-1dd-1-3.2.0-rc3-pause6+ 36.16 -0.9% 35.84 fat/thresh=10M/ext4-10dd-1-3.2.0-rc3-pause6+ 56.01 -0.1% 55.98 fat/thresh=10M/ext4-1dd-1-3.2.0-rc3-pause6+ 31.45 +0.2% 31.51 fat/thresh=1M/ext4-10dd-1-3.2.0-rc3-pause6+ 52.83 -2.3% 51.62 fat/thresh=1M/ext4-1dd-1-3.2.0-rc3-pause6+ 1047.06 -1.8% 1027.98 TOTAL write_bw wfg@bee /export/writeback% ./compare -g ext4 -c fs fat/*/*-3.1.0+ ext4 ext4:wb ------------------------ ------------------------ 45.91 +2.2% 46.90 fat/UKEY-HDD/ext4-100dd-1-3.1.0+ 54.53 +7.4% 58.54 fat/UKEY-HDD/ext4-10dd-1-3.1.0+ 62.18 -3.8% 59.83 fat/UKEY-HDD/ext4-1dd-1-3.1.0+ 2.41 -10.7% 2.15 fat/UKEY-thresh=100M/ext4-100dd-1-3.1.0+ 4.24 -3.0% 4.11 fat/UKEY-thresh=100M/ext4-10dd-1-3.1.0+ 6.25 -11.6% 5.53 fat/UKEY-thresh=100M/ext4-1dd-1-3.1.0+ 2.20 +0.6% 2.22 fat/fio/ext4-fio_fat_mmap_randwrite_4k-1-3.1.0+ 8.76 -4.2% 8.40 fat/fio/ext4-fio_fat_mmap_randwrite_64k-1-3.1.0+ 50.95 +0.4% 51.17 fat/fio/ext4-fio_fat_rates-1-3.1.0+ 47.44 +4.1% 49.40 fat/thresh=1000M/ext4-100dd-1-3.1.0+ 53.30 +4.3% 55.60 fat/thresh=1000M/ext4-10dd-1-3.1.0+ 56.02 +0.8% 56.47 fat/thresh=1000M/ext4-1dd-1-3.1.0+ 47.99 +1.3% 48.61 fat/thresh=1000M:990M/ext4-100dd-1-3.1.0+ 52.82 +0.3% 53.00 fat/thresh=1000M:990M/ext4-10dd-1-3.1.0+ 54.73 +1.9% 55.74 fat/thresh=1000M:990M/ext4-1dd-1-3.1.0+ 47.91 -0.6% 47.62 fat/thresh=1000M:999M/ext4-100dd-1-3.1.0+ 51.51 +3.0% 53.05 fat/thresh=1000M:999M/ext4-10dd-1-3.1.0+ 52.88 +1.6% 53.71 fat/thresh=1000M:999M/ext4-1dd-1-3.1.0+ 34.56 -2.3% 33.76 fat/thresh=100M/ext4-100dd-1-3.1.0+ 46.44 -1.3% 45.86 fat/thresh=100M/ext4-10dd-1-3.1.0+ 54.76 +3.5% 56.65 fat/thresh=100M/ext4-1dd-1-3.1.0+ 37.43 +3.4% 38.69 fat/thresh=10M/ext4-10dd-1-3.1.0+ 55.21 -0.5% 54.95 fat/thresh=10M/ext4-1dd-1-3.1.0+ 40.36 -1.3% 39.83 fat/thresh=1M/ext4-10dd-1-3.1.0+ 55.66 -0.1% 55.61 fat/thresh=1M/ext4-1dd-1-3.1.0+ 1026.44 +1.1% 1037.40 TOTAL write_bw Here are the comparison between kernels. As you can see, the ordered mode is improved slightly by 2% w/ IO-less, while data=writeback sees -0.9% drop. wfg@bee /export/writeback% ./compare -g ext4- fat/*/*-3.1.0+ fat/*/*-3.2.0-rc3-pause6+ 3.1.0+ 3.2.0-rc3-pause6+ ------------------------ ------------------------ 45.91 +2.1% 46.87 fat/UKEY-HDD/ext4-100dd-1-3.1.0+ 54.53 +5.3% 57.40 fat/UKEY-HDD/ext4-10dd-1-3.1.0+ 62.18 -0.1% 62.13 fat/UKEY-HDD/ext4-1dd-1-3.1.0+ 2.41 +2.1% 2.46 fat/UKEY-thresh=100M/ext4-100dd-1-3.1.0+ 4.24 +6.6% 4.52 fat/UKEY-thresh=100M/ext4-10dd-1-3.1.0+ 6.25 -0.9% 6.20 fat/UKEY-thresh=100M/ext4-1dd-1-3.1.0+ 2.20 +15.6% 2.55 fat/fio/ext4-fio_fat_mmap_randwrite_4k-1-3.1.0+ 8.76 +9.5% 9.60 fat/fio/ext4-fio_fat_mmap_randwrite_64k-1-3.1.0+ 50.95 +5.1% 53.57 fat/fio/ext4-fio_fat_rates-1-3.1.0+ 47.44 +3.6% 49.16 fat/thresh=1000M/ext4-100dd-1-3.1.0+ 53.30 +5.4% 56.20 fat/thresh=1000M/ext4-10dd-1-3.1.0+ 56.02 +3.3% 57.86 fat/thresh=1000M/ext4-1dd-1-3.1.0+ 47.99 +4.9% 50.36 fat/thresh=1000M:990M/ext4-100dd-1-3.1.0+ 52.82 +6.9% 56.46 fat/thresh=1000M:990M/ext4-10dd-1-3.1.0+ 54.73 +5.1% 57.51 fat/thresh=1000M:990M/ext4-1dd-1-3.1.0+ 47.91 +4.4% 50.02 fat/thresh=1000M:999M/ext4-100dd-1-3.1.0+ 51.51 +7.9% 55.56 fat/thresh=1000M:999M/ext4-10dd-1-3.1.0+ 52.88 +7.6% 56.88 fat/thresh=1000M:999M/ext4-1dd-1-3.1.0+ 34.56 -7.3% 32.03 fat/thresh=100M/ext4-100dd-1-3.1.0+ 46.44 +0.4% 46.63 fat/thresh=100M/ext4-10dd-1-3.1.0+ 54.76 +3.5% 56.67 fat/thresh=100M/ext4-1dd-1-3.1.0+ 37.43 -3.4% 36.16 fat/thresh=10M/ext4-10dd-1-3.1.0+ 55.21 +1.5% 56.01 fat/thresh=10M/ext4-1dd-1-3.1.0+ 40.36 -22.1% 31.45 fat/thresh=1M/ext4-10dd-1-3.1.0+ 55.66 -5.1% 52.83 fat/thresh=1M/ext4-1dd-1-3.1.0+ 1026.44 +2.0% 1047.06 TOTAL write_bw wfg@bee /export/writeback% ./compare -g ext4:wb fat/*/*-3.1.0+ fat/*/*-3.2.0-rc3-pause6+ 3.1.0+ 3.2.0-rc3-pause6+ ------------------------ ------------------------ 46.90 -2.0% 45.96 fat/UKEY-HDD/ext4:wb-100dd-1-3.1.0+ 58.54 -6.7% 54.61 fat/UKEY-HDD/ext4:wb-10dd-1-3.1.0+ 59.83 +2.7% 61.41 fat/UKEY-HDD/ext4:wb-1dd-1-3.1.0+ 2.15 +13.3% 2.44 fat/UKEY-thresh=100M/ext4:wb-100dd-1-3.1.0+ 4.11 +5.2% 4.33 fat/UKEY-thresh=100M/ext4:wb-10dd-1-3.1.0+ 5.53 +0.2% 5.54 fat/UKEY-thresh=100M/ext4:wb-1dd-1-3.1.0+ 2.22 +24.8% 2.77 fat/fio/ext4:wb-fio_fat_mmap_randwrite_4k-1-3.1.0+ 8.40 +9.6% 9.21 fat/fio/ext4:wb-fio_fat_mmap_randwrite_64k-1-3.1.0+ 51.17 +0.9% 51.61 fat/fio/ext4:wb-fio_fat_rates-1-3.1.0+ 49.40 -1.8% 48.51 fat/thresh=1000M/ext4:wb-100dd-1-3.1.0+ 55.60 -0.3% 55.40 fat/thresh=1000M/ext4:wb-10dd-1-3.1.0+ 56.47 +1.1% 57.07 fat/thresh=1000M/ext4:wb-1dd-1-3.1.0+ 48.61 +0.3% 48.75 fat/thresh=1000M:990M/ext4:wb-100dd-1-3.1.0+ 53.00 +5.1% 55.69 fat/thresh=1000M:990M/ext4:wb-10dd-1-3.1.0+ 55.74 +2.2% 56.97 fat/thresh=1000M:990M/ext4:wb-1dd-1-3.1.0+ 47.62 +4.2% 49.60 fat/thresh=1000M:999M/ext4:wb-100dd-1-3.1.0+ 53.05 +3.4% 54.84 fat/thresh=1000M:999M/ext4:wb-10dd-1-3.1.0+ 53.71 +5.2% 56.52 fat/thresh=1000M:999M/ext4:wb-1dd-1-3.1.0+ 33.76 -8.3% 30.98 fat/thresh=100M/ext4:wb-100dd-1-3.1.0+ 45.86 -0.8% 45.47 fat/thresh=100M/ext4:wb-10dd-1-3.1.0+ 56.65 -2.3% 55.34 fat/thresh=100M/ext4:wb-1dd-1-3.1.0+ 38.69 -7.4% 35.84 fat/thresh=10M/ext4:wb-10dd-1-3.1.0+ 54.95 +1.9% 55.98 fat/thresh=10M/ext4:wb-1dd-1-3.1.0+ 39.83 -20.9% 31.51 fat/thresh=1M/ext4:wb-10dd-1-3.1.0+ 55.61 -7.2% 51.62 fat/thresh=1M/ext4:wb-1dd-1-3.1.0+ 1037.40 -0.9% 1027.98 TOTAL write_bw Thanks, Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <20111214140025.GA19650@localhost>]
* Re: ext4 data=writeback performs worse than data=ordered now [not found] ` <20111214140025.GA19650@localhost> @ 2011-12-14 14:03 ` Wu Fengguang 0 siblings, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-14 14:03 UTC (permalink / raw) To: linux-ext4@vger.kernel.org Cc: Jan Kara, Li Shaohua, LKML, linux-fsdevel@vger.kernel.org, Theodore Ts'o [-- Attachment #1: Type: text/plain, Size: 686 bytes --] On Wed, Dec 14, 2011 at 10:00:25PM +0800, Wu Fengguang wrote: > > The worst case happens for the USB key, where both old/new kernels > > see ~10% worse performance for data=writeback. > > > ext4 ext4:wb > > ------------------------ ------------------------ > > 6.20 -10.6% 5.54 fat/UKEY-thresh=100M/ext4-1dd-1-3.2.0-rc3-pause6+ > > Some more comparison numbers for the above worst case. > > I don't see obvious differences from the balance_dirty_pages graphs, Ah there seem to be many more blocks in write_begin(), indicated by the more negative pause times in the attached second graph. Thanks, Fengguang [-- Attachment #2: balance_dirty_pages-pause.png --] [-- Type: image/png, Size: 38991 bytes --] [-- Attachment #3: balance_dirty_pages-pause.png --] [-- Type: image/png, Size: 53241 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-14 13:34 ext4 data=writeback performs worse than data=ordered now Wu Fengguang [not found] ` <20111214140025.GA19650@localhost> @ 2011-12-14 14:30 ` Ted Ts'o 2011-12-14 14:49 ` Wu Fengguang ` (2 more replies) 1 sibling, 3 replies; 15+ messages in thread From: Ted Ts'o @ 2011-12-14 14:30 UTC (permalink / raw) To: Wu Fengguang Cc: linux-ext4@vger.kernel.org, Jan Kara, Li Shaohua, LKML, linux-fsdevel@vger.kernel.org On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > Hi, > > Shaohua recently found that ext4 writeback mode could perform worse > than ordered mode in some cases. It may not be a big problem, however > we'd like to share some information on our findings. > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > The interesting thing is, data=writeback used to run a bit faster > than data=ordered, however situation get inverted presumably by the > IO-less dirty throttling. Interesting. What sort of workloads are you using to do these measurements? How many writer threads; I assume you are doing sequential writes which are extending one or more files, etc? I suspect it's due to the throttling meaning that each thread is getting to send less data to the disk, and so there is more seeking going on with data=writeback, where as with data=ordered, at each journal commit we are forcing all of the dirty pages out to disk, one inode at a time, and this is resulting in a more efficient writeback compared to when the writeback code is getting to make its own choices about how much each inode gets to write out at at time. It would be interesting to see what would happen if in ext4_da_writepages(), we completely ignore how many pages are requested to be written back by the writeback code, and just simply write back all of the dirty pages, and see if that brings the performance back. - Ted ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-14 14:30 ` Ted Ts'o @ 2011-12-14 14:49 ` Wu Fengguang 2011-12-14 14:52 ` Tao Ma 2011-12-15 1:02 ` Shaohua Li 2 siblings, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-14 14:49 UTC (permalink / raw) To: Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, Li Shaohua, LKML, linux-fsdevel@vger.kernel.org On Wed, Dec 14, 2011 at 10:30:14PM +0800, Theodore Ts'o wrote: > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > Hi, > > > > Shaohua recently found that ext4 writeback mode could perform worse > > than ordered mode in some cases. It may not be a big problem, however > > we'd like to share some information on our findings. > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > The interesting thing is, data=writeback used to run a bit faster > > than data=ordered, however situation get inverted presumably by the > > IO-less dirty throttling. > > Interesting. What sort of workloads are you using to do these > measurements? How many writer threads; I assume you are doing > sequential writes which are extending one or more files, etc? Yes it's mostly simple dd's, and some fio workloads. The test scripts and fio jobs can be found in https://github.com/fengguang/writeback-tests For example, the run_dd() in https://github.com/fengguang/writeback-tests/blob/master/dd-common.sh and some fio jobs: https://github.com/fengguang/writeback-tests/blob/master/fio_fat_rates https://github.com/fengguang/writeback-tests/blob/master/fio_fat_mmap_randwrite_4k https://github.com/fengguang/writeback-tests/blob/master/fio_fat_mmap_randwrite_64k The meanings in the dirs: hostname dirty_background_bytes | dirty_bytes | FS data=writeback | | | | | # of dd tasks | | | | | | kernel version fat/thresh=1000M:999M/ext4:wb-100dd-1-3.1.0+ | 1st test run (each test can be repreated several times) > I suspect it's due to the throttling meaning that each thread is > getting to send less data to the disk, and so there is more seeking > going on with data=writeback, where as with data=ordered, at each > journal commit we are forcing all of the dirty pages out to disk, one > inode at a time, and this is resulting in a more efficient writeback > compared to when the writeback code is getting to make its own choices > about how much each inode gets to write out at at time. > > It would be interesting to see what would happen if in > ext4_da_writepages(), we completely ignore how many pages are > requested to be written back by the writeback code, and just simply > write back all of the dirty pages, and see if that brings the > performance back. I can provide more tracing data or test patches on your request. But for now, I have to go to bed :-) Thanks, Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-14 14:30 ` Ted Ts'o 2011-12-14 14:49 ` Wu Fengguang @ 2011-12-14 14:52 ` Tao Ma 2011-12-14 15:02 ` Wu Fengguang 2011-12-15 1:02 ` Shaohua Li 2 siblings, 1 reply; 15+ messages in thread From: Tao Ma @ 2011-12-14 14:52 UTC (permalink / raw) To: Ted Ts'o, Wu Fengguang, linux-ext4@vger.kernel.org, Jan Kara, Li Shaohua, LKML, linux-fsdevel@vger.kernel.org Hi Ted/Fengguang, On 12/14/2011 10:30 PM, Ted Ts'o wrote: > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: >> Hi, >> >> Shaohua recently found that ext4 writeback mode could perform worse >> than ordered mode in some cases. It may not be a big problem, however >> we'd like to share some information on our findings. >> >> I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. >> The interesting thing is, data=writeback used to run a bit faster >> than data=ordered, however situation get inverted presumably by the >> IO-less dirty throttling. > > Interesting. What sort of workloads are you using to do these > measurements? How many writer threads; I assume you are doing > sequential writes which are extending one or more files, etc? > > I suspect it's due to the throttling meaning that each thread is > getting to send less data to the disk, and so there is more seeking > going on with data=writeback, where as with data=ordered, at each > journal commit we are forcing all of the dirty pages out to disk, one > inode at a time, and this is resulting in a more efficient writeback > compared to when the writeback code is getting to make its own choices > about how much each inode gets to write out at at time. > > It would be interesting to see what would happen if in > ext4_da_writepages(), we completely ignore how many pages are > requested to be written back by the writeback code, and just simply > write back all of the dirty pages, and see if that brings the > performance back. I guess fengguang's test is a buffer write dd test. Here we have found some performance regression from 18 because of the delayed allocation. In case of delayed allocation, we will create the extent tree during writepages which would delay the write because ext4_da_write_begin would down_read the i_data_sem to map the block while writepages would down_write it so we have seen some severe delay in ext4_da_write_begin (around 3s). And instead of increasing the page numbers of every writepages, some tests shows that the decrease makes the performance increase. I will dive into it soon to see what's going on there. So Fengguang, would you please keep the page number in ext4_da_writepages passed by writeback(instead of the bumping) and check the result? Thanks Tao ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-14 14:52 ` Tao Ma @ 2011-12-14 15:02 ` Wu Fengguang 0 siblings, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-14 15:02 UTC (permalink / raw) To: Tao Ma Cc: Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, Li, Shaohua, LKML, linux-fsdevel@vger.kernel.org On Wed, Dec 14, 2011 at 10:52:00PM +0800, Tao Ma wrote: > Hi Ted/Fengguang, > On 12/14/2011 10:30 PM, Ted Ts'o wrote: > > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > >> Hi, > >> > >> Shaohua recently found that ext4 writeback mode could perform worse > >> than ordered mode in some cases. It may not be a big problem, however > >> we'd like to share some information on our findings. > >> > >> I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > >> The interesting thing is, data=writeback used to run a bit faster > >> than data=ordered, however situation get inverted presumably by the > >> IO-less dirty throttling. > > > > Interesting. What sort of workloads are you using to do these > > measurements? How many writer threads; I assume you are doing > > sequential writes which are extending one or more files, etc? > > > > I suspect it's due to the throttling meaning that each thread is > > getting to send less data to the disk, and so there is more seeking > > going on with data=writeback, where as with data=ordered, at each > > journal commit we are forcing all of the dirty pages out to disk, one > > inode at a time, and this is resulting in a more efficient writeback > > compared to when the writeback code is getting to make its own choices > > about how much each inode gets to write out at at time. > > > > It would be interesting to see what would happen if in > > ext4_da_writepages(), we completely ignore how many pages are > > requested to be written back by the writeback code, and just simply > > write back all of the dirty pages, and see if that brings the > > performance back. > I guess fengguang's test is a buffer write dd test. Here we have found > some performance regression from 18 because of the delayed allocation. > In case of delayed allocation, we will create the extent tree during > writepages which would delay the write because ext4_da_write_begin would > down_read the i_data_sem to map the block while writepages would > down_write it so we have seen some severe delay in ext4_da_write_begin > (around 3s). And instead of increasing the page numbers of every > writepages, some tests shows that the decrease makes the performance > increase. I will dive into it soon to see what's going on there. > > So Fengguang, would you please keep the page number in > ext4_da_writepages passed by writeback(instead of the bumping) and check > the result? Sure, can you provide a patch for me to test? Thanks, Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-14 14:30 ` Ted Ts'o 2011-12-14 14:49 ` Wu Fengguang 2011-12-14 14:52 ` Tao Ma @ 2011-12-15 1:02 ` Shaohua Li 2011-12-15 1:00 ` Wu Fengguang 2011-12-15 1:20 ` Darrick J. Wong 2 siblings, 2 replies; 15+ messages in thread From: Shaohua Li @ 2011-12-15 1:02 UTC (permalink / raw) To: Ted Ts'o Cc: Wu, Fengguang, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Wed, 2011-12-14 at 22:30 +0800, Ted Ts'o wrote: > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > Hi, > > > > Shaohua recently found that ext4 writeback mode could perform worse > > than ordered mode in some cases. It may not be a big problem, however > > we'd like to share some information on our findings. > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > The interesting thing is, data=writeback used to run a bit faster > > than data=ordered, however situation get inverted presumably by the > > IO-less dirty throttling. > > Interesting. What sort of workloads are you using to do these > measurements? How many writer threads; I assume you are doing > sequential writes which are extending one or more files, etc? > > I suspect it's due to the throttling meaning that each thread is > getting to send less data to the disk, and so there is more seeking > going on with data=writeback, where as with data=ordered, at each > journal commit we are forcing all of the dirty pages out to disk, one > inode at a time, and this is resulting in a more efficient writeback > compared to when the writeback code is getting to make its own choices > about how much each inode gets to write out at at time. > > It would be interesting to see what would happen if in > ext4_da_writepages(), we completely ignore how many pages are > requested to be written back by the writeback code, and just simply > write back all of the dirty pages, and see if that brings the > performance back. I saw the issue in a machine with a LSI 1068e HBA card and 12 disks. there is about 20% performance regression with data=writeback comparing 3.1 and 3.2-rc. with data=order, there is small regression too. Reverting writeback changes recover the regression for both cases. My investigation shows the block size writing to disk isn't changed with data=writeback. The block size is still very big, 256k IIRC, which is the max block size in the disks. And I just have one thread for each disk, so seek definitely isn't a problem in my workload. I found sometimes one disk hasn't any request inflight, but we can't send request to the disk, because the scsi host's resource (the queue depth) is used out, looks we send too many requests from other disks and leave some disks starved. The resource imbalance in scsi isn't a new problem, even 3.1 has such issue, so I'd think writeback introduces new imbalance between the 12 disks. In fact, if I limit disk's queue depth to 10, in this way the 12 disks will not impact each other in scsi layer, the performance regression fully disappears for both writeback and order mode. Thanks, Shaohua ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:02 ` Shaohua Li @ 2011-12-15 1:00 ` Wu Fengguang 2011-12-15 1:27 ` NeilBrown 2011-12-15 1:20 ` Darrick J. Wong 1 sibling, 1 reply; 15+ messages in thread From: Wu Fengguang @ 2011-12-15 1:00 UTC (permalink / raw) To: Li, Shaohua Cc: Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org, NeilBrown, linux-raid, Jens Axboe > I found sometimes one disk hasn't any request inflight, but we can't > send request to the disk, because the scsi host's resource (the queue > depth) is used out, looks we send too many requests from other disks and > leave some disks starved. The resource imbalance in scsi isn't a new > problem, even 3.1 has such issue, so I'd think writeback introduces new > imbalance between the 12 disks. In fact, if I limit disk's queue depth > to 10, in this way the 12 disks will not impact each other in scsi > layer, the performance regression fully disappears for both writeback > and order mode. I observe similar issue in MD. The default q->nr_requests = BLKDEV_MAX_RQ; is too small for large arrays, and I end up doing echo 1280 > /sys/block/md0/queue/nr_requests in my tests. Thanks, Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:00 ` Wu Fengguang @ 2011-12-15 1:27 ` NeilBrown 2011-12-15 1:34 ` Wu Fengguang 2011-12-15 5:02 ` Wu Fengguang 0 siblings, 2 replies; 15+ messages in thread From: NeilBrown @ 2011-12-15 1:27 UTC (permalink / raw) To: Wu Fengguang Cc: Li, Shaohua, Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org, linux-raid, Jens Axboe [-- Attachment #1: Type: text/plain, Size: 1142 bytes --] On Thu, 15 Dec 2011 09:00:10 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote: > > I found sometimes one disk hasn't any request inflight, but we can't > > send request to the disk, because the scsi host's resource (the queue > > depth) is used out, looks we send too many requests from other disks and > > leave some disks starved. The resource imbalance in scsi isn't a new > > problem, even 3.1 has such issue, so I'd think writeback introduces new > > imbalance between the 12 disks. In fact, if I limit disk's queue depth > > to 10, in this way the 12 disks will not impact each other in scsi > > layer, the performance regression fully disappears for both writeback > > and order mode. > > I observe similar issue in MD. The default > > q->nr_requests = BLKDEV_MAX_RQ; > > is too small for large arrays, and I end up doing > > echo 1280 > /sys/block/md0/queue/nr_requests > > in my tests. And you find this makes a difference? That is very surprising because md devices don't use requests (and really use the 'queue' at all) and definitely don't make use of nr_requests. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:27 ` NeilBrown @ 2011-12-15 1:34 ` Wu Fengguang 2011-12-15 5:02 ` Wu Fengguang 1 sibling, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-15 1:34 UTC (permalink / raw) To: NeilBrown Cc: Li, Shaohua, Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, Jens Axboe On Thu, Dec 15, 2011 at 09:27:59AM +0800, NeilBrown wrote: > On Thu, 15 Dec 2011 09:00:10 +0800 Wu Fengguang <fengguang.wu@intel.com> > wrote: > > > > I found sometimes one disk hasn't any request inflight, but we can't > > > send request to the disk, because the scsi host's resource (the queue > > > depth) is used out, looks we send too many requests from other disks and > > > leave some disks starved. The resource imbalance in scsi isn't a new > > > problem, even 3.1 has such issue, so I'd think writeback introduces new > > > imbalance between the 12 disks. In fact, if I limit disk's queue depth > > > to 10, in this way the 12 disks will not impact each other in scsi > > > layer, the performance regression fully disappears for both writeback > > > and order mode. > > > > I observe similar issue in MD. The default > > > > q->nr_requests = BLKDEV_MAX_RQ; > > > > is too small for large arrays, and I end up doing > > > > echo 1280 > /sys/block/md0/queue/nr_requests > > > > in my tests. > > And you find this makes a difference? > > That is very surprising because md devices don't use requests (and really use > the 'queue' at all) and definitely don't make use of nr_requests. Ah OK. Hope that I was wrong. I've just kicked off the tests to make sure. Thanks, Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:27 ` NeilBrown 2011-12-15 1:34 ` Wu Fengguang @ 2011-12-15 5:02 ` Wu Fengguang 1 sibling, 0 replies; 15+ messages in thread From: Wu Fengguang @ 2011-12-15 5:02 UTC (permalink / raw) To: NeilBrown Cc: Li, Shaohua, Ted Ts'o, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, Jens Axboe On Thu, Dec 15, 2011 at 09:27:59AM +0800, NeilBrown wrote: > On Thu, 15 Dec 2011 09:00:10 +0800 Wu Fengguang <fengguang.wu@intel.com> > wrote: > > > > I found sometimes one disk hasn't any request inflight, but we can't > > > send request to the disk, because the scsi host's resource (the queue > > > depth) is used out, looks we send too many requests from other disks and > > > leave some disks starved. The resource imbalance in scsi isn't a new > > > problem, even 3.1 has such issue, so I'd think writeback introduces new > > > imbalance between the 12 disks. In fact, if I limit disk's queue depth > > > to 10, in this way the 12 disks will not impact each other in scsi > > > layer, the performance regression fully disappears for both writeback > > > and order mode. > > > > I observe similar issue in MD. The default > > > > q->nr_requests = BLKDEV_MAX_RQ; > > > > is too small for large arrays, and I end up doing > > > > echo 1280 > /sys/block/md0/queue/nr_requests > > > > in my tests. > > And you find this makes a difference? > > That is very surprising because md devices don't use requests (and really use > the 'queue' at all) and definitely don't make use of nr_requests. Yes it is: /sys/block/md0/queue/nr_requests cannot be modified at all... Sorry for the noise! Fengguang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:02 ` Shaohua Li 2011-12-15 1:00 ` Wu Fengguang @ 2011-12-15 1:20 ` Darrick J. Wong 2011-12-15 1:42 ` Shaohua Li 1 sibling, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2011-12-15 1:20 UTC (permalink / raw) To: Shaohua Li Cc: Ted Ts'o, Wu, Fengguang, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Thu, Dec 15, 2011 at 09:02:57AM +0800, Shaohua Li wrote: > On Wed, 2011-12-14 at 22:30 +0800, Ted Ts'o wrote: > > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > > Hi, > > > > > > Shaohua recently found that ext4 writeback mode could perform worse > > > than ordered mode in some cases. It may not be a big problem, however > > > we'd like to share some information on our findings. > > > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > > The interesting thing is, data=writeback used to run a bit faster > > > than data=ordered, however situation get inverted presumably by the > > > IO-less dirty throttling. > > > > Interesting. What sort of workloads are you using to do these > > measurements? How many writer threads; I assume you are doing > > sequential writes which are extending one or more files, etc? > > > > I suspect it's due to the throttling meaning that each thread is > > getting to send less data to the disk, and so there is more seeking > > going on with data=writeback, where as with data=ordered, at each > > journal commit we are forcing all of the dirty pages out to disk, one > > inode at a time, and this is resulting in a more efficient writeback > > compared to when the writeback code is getting to make its own choices > > about how much each inode gets to write out at at time. > > > > It would be interesting to see what would happen if in > > ext4_da_writepages(), we completely ignore how many pages are > > requested to be written back by the writeback code, and just simply > > write back all of the dirty pages, and see if that brings the > > performance back. > I saw the issue in a machine with a LSI 1068e HBA card and 12 disks. > there is about 20% performance regression with data=writeback comparing > 3.1 and 3.2-rc. with data=order, there is small regression too. > Reverting writeback changes recover the regression for both cases. > > My investigation shows the block size writing to disk isn't changed with > data=writeback. The block size is still very big, 256k IIRC, which is > the max block size in the disks. And I just have one thread for each > disk, so seek definitely isn't a problem in my workload. > > I found sometimes one disk hasn't any request inflight, but we can't > send request to the disk, because the scsi host's resource (the queue > depth) is used out, looks we send too many requests from other disks and > leave some disks starved. The resource imbalance in scsi isn't a new I wonder, does the patch in: http://lkml.indiana.edu/hypermail/linux/kernel/1105.3/02339.html help with this starvation problem? I noticed a similar problem and sent a patch, but LSI folks never responded. Maybe two complaining users can change that. The biggest MaxQ I've seen on LSI SAS is 511, and the driver clamps the value it passes to the SCSI layer to whatever the controller reports as its MaxQ (in /proc/mpt/summary). --D > problem, even 3.1 has such issue, so I'd think writeback introduces new > imbalance between the 12 disks. In fact, if I limit disk's queue depth > to 10, in this way the 12 disks will not impact each other in scsi > layer, the performance regression fully disappears for both writeback > and order mode. > > Thanks, > Shaohua > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:20 ` Darrick J. Wong @ 2011-12-15 1:42 ` Shaohua Li 2011-12-15 18:10 ` Darrick J. Wong 0 siblings, 1 reply; 15+ messages in thread From: Shaohua Li @ 2011-12-15 1:42 UTC (permalink / raw) To: djwong@us.ibm.com Cc: Ted Ts'o, Wu, Fengguang, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Thu, 2011-12-15 at 09:20 +0800, Darrick J. Wong wrote: > On Thu, Dec 15, 2011 at 09:02:57AM +0800, Shaohua Li wrote: > > On Wed, 2011-12-14 at 22:30 +0800, Ted Ts'o wrote: > > > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > > > Hi, > > > > > > > > Shaohua recently found that ext4 writeback mode could perform worse > > > > than ordered mode in some cases. It may not be a big problem, however > > > > we'd like to share some information on our findings. > > > > > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > > > The interesting thing is, data=writeback used to run a bit faster > > > > than data=ordered, however situation get inverted presumably by the > > > > IO-less dirty throttling. > > > > > > Interesting. What sort of workloads are you using to do these > > > measurements? How many writer threads; I assume you are doing > > > sequential writes which are extending one or more files, etc? > > > > > > I suspect it's due to the throttling meaning that each thread is > > > getting to send less data to the disk, and so there is more seeking > > > going on with data=writeback, where as with data=ordered, at each > > > journal commit we are forcing all of the dirty pages out to disk, one > > > inode at a time, and this is resulting in a more efficient writeback > > > compared to when the writeback code is getting to make its own choices > > > about how much each inode gets to write out at at time. > > > > > > It would be interesting to see what would happen if in > > > ext4_da_writepages(), we completely ignore how many pages are > > > requested to be written back by the writeback code, and just simply > > > write back all of the dirty pages, and see if that brings the > > > performance back. > > I saw the issue in a machine with a LSI 1068e HBA card and 12 disks. > > there is about 20% performance regression with data=writeback comparing > > 3.1 and 3.2-rc. with data=order, there is small regression too. > > Reverting writeback changes recover the regression for both cases. > > > > My investigation shows the block size writing to disk isn't changed with > > data=writeback. The block size is still very big, 256k IIRC, which is > > the max block size in the disks. And I just have one thread for each > > disk, so seek definitely isn't a problem in my workload. > > > > I found sometimes one disk hasn't any request inflight, but we can't > > send request to the disk, because the scsi host's resource (the queue > > depth) is used out, looks we send too many requests from other disks and > > leave some disks starved. The resource imbalance in scsi isn't a new > > I wonder, does the patch in: > http://lkml.indiana.edu/hypermail/linux/kernel/1105.3/02339.html > help with this starvation problem? I noticed a similar problem and sent a > patch, but LSI folks never responded. Maybe two complaining users can change > that. The biggest MaxQ I've seen on LSI SAS is 511, and the driver clamps the > value it passes to the SCSI layer to whatever the controller reports as its > MaxQ (in /proc/mpt/summary). this should recover the regression too. But I'm afraid it's just a workaround and will hide some issues. what if I have 120 disks instead of 12 disks? I observed one disk can burst 20 requests while the total the scsi host queue depth is 127, leaving other disks starved. I'm hoping to understand why there is such imbalance. Thanks, Shaohua ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 1:42 ` Shaohua Li @ 2011-12-15 18:10 ` Darrick J. Wong 2011-12-16 1:47 ` Shaohua Li 0 siblings, 1 reply; 15+ messages in thread From: Darrick J. Wong @ 2011-12-15 18:10 UTC (permalink / raw) To: Shaohua Li Cc: Ted Ts'o, Wu, Fengguang, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Thu, Dec 15, 2011 at 09:42:25AM +0800, Shaohua Li wrote: > On Thu, 2011-12-15 at 09:20 +0800, Darrick J. Wong wrote: > > On Thu, Dec 15, 2011 at 09:02:57AM +0800, Shaohua Li wrote: > > > On Wed, 2011-12-14 at 22:30 +0800, Ted Ts'o wrote: > > > > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > > > > Hi, > > > > > > > > > > Shaohua recently found that ext4 writeback mode could perform worse > > > > > than ordered mode in some cases. It may not be a big problem, however > > > > > we'd like to share some information on our findings. > > > > > > > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > > > > The interesting thing is, data=writeback used to run a bit faster > > > > > than data=ordered, however situation get inverted presumably by the > > > > > IO-less dirty throttling. > > > > > > > > Interesting. What sort of workloads are you using to do these > > > > measurements? How many writer threads; I assume you are doing > > > > sequential writes which are extending one or more files, etc? > > > > > > > > I suspect it's due to the throttling meaning that each thread is > > > > getting to send less data to the disk, and so there is more seeking > > > > going on with data=writeback, where as with data=ordered, at each > > > > journal commit we are forcing all of the dirty pages out to disk, one > > > > inode at a time, and this is resulting in a more efficient writeback > > > > compared to when the writeback code is getting to make its own choices > > > > about how much each inode gets to write out at at time. > > > > > > > > It would be interesting to see what would happen if in > > > > ext4_da_writepages(), we completely ignore how many pages are > > > > requested to be written back by the writeback code, and just simply > > > > write back all of the dirty pages, and see if that brings the > > > > performance back. > > > I saw the issue in a machine with a LSI 1068e HBA card and 12 disks. > > > there is about 20% performance regression with data=writeback comparing > > > 3.1 and 3.2-rc. with data=order, there is small regression too. > > > Reverting writeback changes recover the regression for both cases. > > > > > > My investigation shows the block size writing to disk isn't changed with > > > data=writeback. The block size is still very big, 256k IIRC, which is > > > the max block size in the disks. And I just have one thread for each > > > disk, so seek definitely isn't a problem in my workload. > > > > > > I found sometimes one disk hasn't any request inflight, but we can't > > > send request to the disk, because the scsi host's resource (the queue > > > depth) is used out, looks we send too many requests from other disks and > > > leave some disks starved. The resource imbalance in scsi isn't a new > > > > I wonder, does the patch in: > > http://lkml.indiana.edu/hypermail/linux/kernel/1105.3/02339.html > > help with this starvation problem? I noticed a similar problem and sent a > > patch, but LSI folks never responded. Maybe two complaining users can change > > that. The biggest MaxQ I've seen on LSI SAS is 511, and the driver clamps the > > value it passes to the SCSI layer to whatever the controller reports as its > > MaxQ (in /proc/mpt/summary). > this should recover the regression too. But I'm afraid it's just a > workaround and will hide some issues. what if I have 120 disks instead > of 12 disks? I observed one disk can burst 20 requests while the total > the scsi host queue depth is 127, leaving other disks starved. I'm > hoping to understand why there is such imbalance. <shrug> I didn't say it would /fix/ the imbalanced-starvation problem, but we might as well take full advantage of the hardware. Even if all it does is enable the user to plug in more disks before things get whacky, I was hoping that someone else could at least give it a spin and say "Yes, this does what it's alleged to do, and without breaking things". :) afaict SCSI doesn't try to balance requests heading towards the HBA; it's all FCFS. --D > > Thanks, > Shaohua > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ext4 data=writeback performs worse than data=ordered now 2011-12-15 18:10 ` Darrick J. Wong @ 2011-12-16 1:47 ` Shaohua Li 0 siblings, 0 replies; 15+ messages in thread From: Shaohua Li @ 2011-12-16 1:47 UTC (permalink / raw) To: djwong@us.ibm.com Cc: Ted Ts'o, Wu, Fengguang, linux-ext4@vger.kernel.org, Jan Kara, LKML, linux-fsdevel@vger.kernel.org On Fri, 2011-12-16 at 02:10 +0800, Darrick J. Wong wrote: > On Thu, Dec 15, 2011 at 09:42:25AM +0800, Shaohua Li wrote: > > On Thu, 2011-12-15 at 09:20 +0800, Darrick J. Wong wrote: > > > On Thu, Dec 15, 2011 at 09:02:57AM +0800, Shaohua Li wrote: > > > > On Wed, 2011-12-14 at 22:30 +0800, Ted Ts'o wrote: > > > > > On Wed, Dec 14, 2011 at 09:34:00PM +0800, Wu Fengguang wrote: > > > > > > Hi, > > > > > > > > > > > > Shaohua recently found that ext4 writeback mode could perform worse > > > > > > than ordered mode in some cases. It may not be a big problem, however > > > > > > we'd like to share some information on our findings. > > > > > > > > > > > > I tested both 3.2 and 3.1 kernels on normal SATA disks and USB key. > > > > > > The interesting thing is, data=writeback used to run a bit faster > > > > > > than data=ordered, however situation get inverted presumably by the > > > > > > IO-less dirty throttling. > > > > > > > > > > Interesting. What sort of workloads are you using to do these > > > > > measurements? How many writer threads; I assume you are doing > > > > > sequential writes which are extending one or more files, etc? > > > > > > > > > > I suspect it's due to the throttling meaning that each thread is > > > > > getting to send less data to the disk, and so there is more seeking > > > > > going on with data=writeback, where as with data=ordered, at each > > > > > journal commit we are forcing all of the dirty pages out to disk, one > > > > > inode at a time, and this is resulting in a more efficient writeback > > > > > compared to when the writeback code is getting to make its own choices > > > > > about how much each inode gets to write out at at time. > > > > > > > > > > It would be interesting to see what would happen if in > > > > > ext4_da_writepages(), we completely ignore how many pages are > > > > > requested to be written back by the writeback code, and just simply > > > > > write back all of the dirty pages, and see if that brings the > > > > > performance back. > > > > I saw the issue in a machine with a LSI 1068e HBA card and 12 disks. > > > > there is about 20% performance regression with data=writeback comparing > > > > 3.1 and 3.2-rc. with data=order, there is small regression too. > > > > Reverting writeback changes recover the regression for both cases. > > > > > > > > My investigation shows the block size writing to disk isn't changed with > > > > data=writeback. The block size is still very big, 256k IIRC, which is > > > > the max block size in the disks. And I just have one thread for each > > > > disk, so seek definitely isn't a problem in my workload. > > > > > > > > I found sometimes one disk hasn't any request inflight, but we can't > > > > send request to the disk, because the scsi host's resource (the queue > > > > depth) is used out, looks we send too many requests from other disks and > > > > leave some disks starved. The resource imbalance in scsi isn't a new > > > > > > I wonder, does the patch in: > > > http://lkml.indiana.edu/hypermail/linux/kernel/1105.3/02339.html > > > help with this starvation problem? I noticed a similar problem and sent a > > > patch, but LSI folks never responded. Maybe two complaining users can change > > > that. The biggest MaxQ I've seen on LSI SAS is 511, and the driver clamps the > > > value it passes to the SCSI layer to whatever the controller reports as its > > > MaxQ (in /proc/mpt/summary). > > this should recover the regression too. But I'm afraid it's just a > > workaround and will hide some issues. what if I have 120 disks instead > > of 12 disks? I observed one disk can burst 20 requests while the total > > the scsi host queue depth is 127, leaving other disks starved. I'm > > hoping to understand why there is such imbalance. > > <shrug> I didn't say it would /fix/ the imbalanced-starvation problem, but we > might as well take full advantage of the hardware. Even if all it does is > enable the user to plug in more disks before things get whacky, I was hoping > that someone else could at least give it a spin and say "Yes, this does what > it's alleged to do, and without breaking things". :) Ok, I tested your patch, it works. So next time you repost the patch, you can add my Tested-by: Shaohua Li <shaohua.li@intel.com> > afaict SCSI doesn't try to balance requests heading towards the HBA; it's all > FCFS. The scsi starvation list tries to do the balance, but apparently not enough. Thanks, Shaohua ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-12-16 1:34 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-12-14 13:34 ext4 data=writeback performs worse than data=ordered now Wu Fengguang [not found] ` <20111214140025.GA19650@localhost> 2011-12-14 14:03 ` Wu Fengguang 2011-12-14 14:30 ` Ted Ts'o 2011-12-14 14:49 ` Wu Fengguang 2011-12-14 14:52 ` Tao Ma 2011-12-14 15:02 ` Wu Fengguang 2011-12-15 1:02 ` Shaohua Li 2011-12-15 1:00 ` Wu Fengguang 2011-12-15 1:27 ` NeilBrown 2011-12-15 1:34 ` Wu Fengguang 2011-12-15 5:02 ` Wu Fengguang 2011-12-15 1:20 ` Darrick J. Wong 2011-12-15 1:42 ` Shaohua Li 2011-12-15 18:10 ` Darrick J. Wong 2011-12-16 1:47 ` Shaohua Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).