[PATCH 00/13] IO-less dirty throttling

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] IO-less dirty throttling
@ 2010-11-17  3:58 Wu Fengguang
  2010-11-17  7:25 ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2010-11-17  3:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Wu Fengguang, linux-mm, linux-fsdevel, LKML

Andrew,

This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
<http://thread.gmane.org/gmane.linux.kernel.mm/52966>.

The basic idea is to introduce a small region under the bdi dirty threshold.
The task will be throttled gently when stepping into the bottom of region,
and get throttled more and more aggressively as bdi dirty+writeback pages
goes up closer to the top of region. At some point the application will be
throttled at the right bandwidth that balances with the device write bandwidth.
(the first patch and documentation has more details)

Changes from initial RFC:

- adaptive ratelimiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
access I put them in

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2

Wu Fengguang (12):
      writeback: IO-less balance_dirty_pages()
      writeback: consolidate variable names in balance_dirty_pages()
      writeback: per-task rate limit on balance_dirty_pages()
      writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
      writeback: bdi write bandwidth estimation
      writeback: show bdi write bandwidth in debugfs
      writeback: quit throttling when bdi dirty pages dropped
      writeback: reduce per-bdi dirty threshold ramp up time
      writeback: make reasonable gap between the dirty/background thresholds
      writeback: scale down max throttle bandwidth on concurrent dirtiers
      writeback: add trace event for balance_dirty_pages()
      writeback: make nr_to_write a per-file limit

Jan Kara (1):
      writeback: account per-bdi accumulated written pages

 .../filesystems/writeback-throttling-design.txt    |  210 +++++++++++++
 fs/fs-writeback.c                                  |   16 +
 include/linux/backing-dev.h                        |    3 +
 include/linux/sched.h                              |    7 +
 include/linux/writeback.h                          |   14 +
 include/trace/events/writeback.h                   |   61 ++++-
 mm/backing-dev.c                                   |   29 +-
 mm/filemap.c                                       |    5 +-
 mm/memory_hotplug.c                                |    3 -
 mm/page-writeback.c                                |  320 +++++++++++---------
 10 files changed, 511 insertions(+), 157 deletions(-)

It runs smoothly on typical configurations. Under small memory system the pause
time will fluctuate much more due to the limited range for soft throttling.

The soft dirty threshold is now lowered to (background + dirty)/2=15%. So it
will be throttling the applications a bit earlier, and may be perceived by end
users as performance "slow down" if his application happens to dirty a bit more
than 15%. Note that vanilla kernel also has this limit at fresh boot: it starts
checking bdi limits when exceeding the global 15%, however the bdi limit ramps
up pretty slowly in common configurations, so the task is immediately throttled.

The task's think time is not considered for now when computing the pause time.
So it will throttle an "scp" over network way harder than a local "cp". When
to take the user space think time into account and ensure accurate throttle
bandwidth, we will effectively create a simple write I/O bandwidth controller.

On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
improves IO throughput from 38MB/s to 42MB/s.

The fs_mark benchmark is interesting. The CPU overheads are almost reduced by
half. Before patch the benchmark is actually bounded by CPU. After patch it's
IO bound, but strangely the throughput becomes slightly slower.

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d  /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11 
#       Version 3.3, 12 thread(s) starting at Thu Nov 11 21:01:36 2010
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.
#

2.6.36
FSUse%        Count         Size    Files/sec     App Overhead
     0      1200000            1       1261.7        524762513
     0      2400000            1       1195.3        537844546
     0      3600000            1       1231.9        496441566
     1      4800000            1       1175.8        552421522
     1      6000000            1       1191.6        558529735
     1      7200000            1       1165.3        551178395
     2      8400000            1       1175.0        533209632
     2      9600000            1       1200.6        534862246
     2     10800000            1       1181.2        540616486
     2     12000000            1       1137.4        554551797
     3     13200000            1       1143.7        563319651
     3     14400000            1       1169.0        519527533
     3     15600000            1       1184.0        533550370
     4     16800000            1       1161.3        534358727
     4     18000000            1       1193.4        521610050
     4     19200000            1       1177.6        524117437
     5     20400000            1       1172.6        506166634
     5     21600000            1       1172.3        515725633

avg                                    1182.761      533488581.833

2.6.36+
FSUse%        Count         Size    Files/sec     App Overhead
     0      1200000            1       1125.0        357885976
     0      2400000            1       1155.6        288103795
     0      3600000            1       1172.4        296521755
     1      4800000            1       1136.0        301718887
     1      6000000            1       1156.7        303605077
     1      7200000            1       1102.9        288852150
     2      8400000            1       1140.9        294894485
     2      9600000            1       1148.0        314394450
     2     10800000            1       1099.7        296365560
     2     12000000            1       1153.6        316283083
     3     13200000            1       1087.9        339988006
     3     14400000            1       1183.9        270836344
     3     15600000            1       1122.7        276400918
     4     16800000            1       1132.1        285272223
     4     18000000            1       1154.8        283424055
     4     19200000            1       1202.5        294558877
     5     20400000            1       1158.1        293971332
     5     21600000            1       1159.4        287720335
     5     22800000            1       1150.1        282987509
     5     24000000            1       1150.7        283870613
     6     25200000            1       1123.8        288094185
     6     26400000            1       1152.1        296984323
     6     27600000            1       1190.7        282403174
     7     28800000            1       1088.6        290493643
     7     30000000            1       1144.1        290311419
     7     31200000            1       1186.0        290021271
     7     32400000            1       1213.9        279465138
     8     33600000            1       1117.3        275745401

avg                                    1146.768      294684785.143


I noticed that

1) BdiWriteback can grow very large. For example, bdi 8:16 has 72960KB
   writeback pages, however the disk IO queue can hold at most
   nr_request*max_sectors_kb=128*512kb=64MB writeback pages. Maybe xfs manages
   to create perfect sequential layouts and writes, and the other 8MB writeback
   pages are flying inside the disk?

	root@wfg-ne02 /cc/fs_mark-3.3/ne02-2.6.36+# g BdiWriteback /debug/bdi/8:*/*
	/debug/bdi/8:0/stats:BdiWriteback:            0 kB
	/debug/bdi/8:112/stats:BdiWriteback:        68352 kB
	/debug/bdi/8:128/stats:BdiWriteback:        62336 kB
	/debug/bdi/8:144/stats:BdiWriteback:        61824 kB
	/debug/bdi/8:160/stats:BdiWriteback:        67328 kB
	/debug/bdi/8:16/stats:BdiWriteback:        72960 kB
	/debug/bdi/8:176/stats:BdiWriteback:        57984 kB
	/debug/bdi/8:192/stats:BdiWriteback:        71936 kB
	/debug/bdi/8:32/stats:BdiWriteback:        68352 kB
	/debug/bdi/8:48/stats:BdiWriteback:        56704 kB
	/debug/bdi/8:64/stats:BdiWriteback:        50304 kB
	/debug/bdi/8:80/stats:BdiWriteback:        68864 kB
	/debug/bdi/8:96/stats:BdiWriteback:         2816 kB

2) the 12 disks are not all 100% utilized. Not even close: sdd, sdf, sdh, sdj
   are almost idle at the moment. Dozens of seconds later, some other disks
   become idle. This happens both before/after patch. There may be some hidden
   bugs (unrelated to this patchset).

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.17    0.00   97.87    1.08    0.00    0.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00    63.00    0.00  125.00     0.00  1909.33    30.55     3.88   31.65   6.57  82.13
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00    19.00    0.00  112.00     0.00  1517.17    27.09     3.95   35.33   8.00  89.60
sdg               0.00    92.67    0.33  126.00     2.67  1773.33    28.12    14.83  120.78   7.73  97.60
sdf               0.00    32.33    0.00   91.67     0.00  1408.17    30.72     4.84   52.97   7.72  70.80
sdh               0.00    17.67    0.00    5.00     0.00   124.00    49.60     0.07   13.33   9.60   4.80
sdi               0.00    44.67    0.00    5.00     0.00   253.33   101.33     0.15   29.33  10.93   5.47
sdl               0.00   168.00    0.00  135.67     0.00  2216.33    32.67     6.41   45.42   5.75  78.00
sdk               0.00   225.00    0.00  123.00     0.00  2355.83    38.31     9.50   73.03   6.94  85.33
sdj               0.00     1.00    0.00    2.33     0.00    26.67    22.86     0.01    2.29   1.71   0.40
sdb               0.00    14.33    0.00  101.67     0.00  1278.00    25.14     2.02   19.95   7.16  72.80
sdm               0.00   150.33    0.00  144.33     0.00  2344.50    32.49     5.43   33.94   5.39  77.73

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.12    0.00   98.63    0.83    0.00    0.42

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00   105.67    0.00  127.33     0.00  1810.17    28.43     4.39   32.43   6.67  84.93
sdd               0.00     5.33    0.00   10.67     0.00   128.00    24.00     0.03    2.50   1.25   1.33
sde               0.00   180.33    0.33  107.67     2.67  2109.33    39.11     8.11   73.93   8.99  97.07
sdg               0.00     7.67    0.00   63.67     0.00  1387.50    43.59     1.45   24.29  11.08  70.53
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00    62.67    0.00   94.67     0.00  1743.50    36.83     3.28   34.68   8.52  80.67
sdl               0.00   162.00    0.00  141.67     0.00  2295.83    32.41     7.09   51.79   6.14  86.93
sdk               0.00    34.33    0.00  143.67     0.00  1910.17    26.59     5.07   38.90   6.26  90.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00   195.00    0.00   96.67     0.00  1949.50    40.33     5.54   57.23   8.39  81.07
sdm               0.00   155.00    0.00  143.00     0.00  2357.50    32.97     5.21   39.98   5.71  81.60

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-17  3:58 [PATCH 00/13] IO-less dirty throttling Wu Fengguang
@ 2010-11-17  7:25 ` Dave Chinner
  2010-11-17 10:06   ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-11-17  7:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, Nov 17, 2010 at 11:58:21AM +0800, Wu Fengguang wrote:
> Andrew,
> 
> This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
> <http://thread.gmane.org/gmane.linux.kernel.mm/52966>.
> 
> The basic idea is to introduce a small region under the bdi dirty threshold.
> The task will be throttled gently when stepping into the bottom of region,
> and get throttled more and more aggressively as bdi dirty+writeback pages
> goes up closer to the top of region. At some point the application will be
> throttled at the right bandwidth that balances with the device write bandwidth.
> (the first patch and documentation has more details)
> 
> Changes from initial RFC:
> 
> - adaptive ratelimiting, to reduce overheads when under throttle threshold
> - prevent overrunning dirty limit on lots of concurrent dirtiers
> - add Documentation/filesystems/writeback-throttling-design.txt
> - lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
> - don't drop the laptop mode code
> - update and comment the trace event
> - benchmarks on concurrent dd and fs_mark covering both large and tiny files
> - bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
>   otherwise it will drift fast and fluctuate
> - don't call balance_dirty_pages_ratelimit() when writing to already dirtied
>   pages, otherwise the task will be throttled too much
> 
> The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
> access I put them in
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2

Great - just pulled it down and I'll start running some tests.

The tree that I'm testing has the vfs inode lock breakup in it, the
inode cache SLAB_DESTROY_BY_RCU series, a large bunch of XFS lock
breakup patches and now the above branch in it. It's here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working

> On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> improves IO throughput from 38MB/s to 42MB/s.

Excellent - I suspect that the reduction in contention on the inode
writeback locks is responsible for dropping the CPU usage right down.

I'm seeing throughput for a _single_ large dd (100GB) increase from ~650MB/s
to 700MB/s with your series. For other numbers of dd's:
							ctx switches
# dd processes		total throughput	 total        per proc
   1			  700MB/s		    400/s	100/s
   2			  700MB/s		    500/s	100/s
   4			  700MB/s		    700/s	100/s
   8			  690MB/s		  1,100/s	100/s
  16			  675MB/s		  2,000/s	110/s
  32			  675MB/s		  5,000/s	150/s
 100			  650MB/s		 22,000/s	210/s
1000			  600MB/s		160,000/s	160/s

A couple of things I noticed - firstly, the number of context
switches scales roughly with the number of writing processes - is
there any reason for waking every writer 100-200 times a second? At
the thousand writer mark, we reach a context switch rate of more
than one per page we complete IO on. Any idea on whether this can be
improved at all?

Also, the system CPU usage while throttling stayed quite low but not
constant. The more writing processes, the lower the system CPU usage
(despite the increase in context switches). Further, if the dd's
didn't all start at the same time, then system CPU usage would
roughly double when the first dd's complete and cpu usage stayed
high until all the writers completed. So there's some trigger when
writers finish/exit there that is changing throttle behaviour.
Increasing the number of writers does not seem to have any adverse
affects.

BTW, killing a thousand dd's all stuck on the throttle is near
instantaneous. ;)

> The fs_mark benchmark is interesting. The CPU overheads are almost reduced by
> half. Before patch the benchmark is actually bounded by CPU. After patch it's
> IO bound, but strangely the throughput becomes slightly slower.

The "App Overhead" that is measured by fs_mark is the time it spends
doing stuff in userspace rather than in syscalls. Changes in the app
overhead typically implies a change in syscall CPU cache footprint. A
substantial reduction in app overhead for the same amount of work
is good. :)

[cut-n-paste from your comment about being io bound below]

> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.17    0.00   97.87    1.08    0.00    0.88

That looks CPU bound, not IO bound.

> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sdc               0.00    63.00    0.00  125.00     0.00  1909.33    30.55     3.88   31.65   6.57  82.13
> sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sde               0.00    19.00    0.00  112.00     0.00  1517.17    27.09     3.95   35.33   8.00  89.60
> sdg               0.00    92.67    0.33  126.00     2.67  1773.33    28.12    14.83  120.78   7.73  97.60
> sdf               0.00    32.33    0.00   91.67     0.00  1408.17    30.72     4.84   52.97   7.72  70.80
> sdh               0.00    17.67    0.00    5.00     0.00   124.00    49.60     0.07   13.33   9.60   4.80
> sdi               0.00    44.67    0.00    5.00     0.00   253.33   101.33     0.15   29.33  10.93   5.47
> sdl               0.00   168.00    0.00  135.67     0.00  2216.33    32.67     6.41   45.42   5.75  78.00
> sdk               0.00   225.00    0.00  123.00     0.00  2355.83    38.31     9.50   73.03   6.94  85.33
> sdj               0.00     1.00    0.00    2.33     0.00    26.67    22.86     0.01    2.29   1.71   0.40
> sdb               0.00    14.33    0.00  101.67     0.00  1278.00    25.14     2.02   19.95   7.16  72.80
> sdm               0.00   150.33    0.00  144.33     0.00  2344.50    32.49     5.43   33.94   5.39  77.73

And that's totalling ~1000 iops during the workload - you're right
in that it doesn't look at all well balanced. The device my test
filesystem is on is running at ~15,000 iops and 120MB/s for the same
workload, but there is another layer of reordering on the host as
well as 512MB of BBWC between the host and the spindles, so maybe
you won't be able to get near that number with your setup....

[.....]

> avg                                    1182.761      533488581.833
> 
> 2.6.36+
> FSUse%        Count         Size    Files/sec     App Overhead
....
> avg                                    1146.768      294684785.143

The difference between the files/s numbers is pretty much within
typical variation of the benchmark. I tend to time the running of
the entire benchmark because the files/s output does not include the
"App Overhead" time and hence you can improve files/s but increase
the app overhead and the overall wall time can be significantly
slower...

FWIW, I'd consider the throughput (1200 files/s) to quite low for 12
disks and a number of CPUs being active. I'm not sure how you
configured the storage/filesystem, but you should configure the
filesystem with at least 2x as many AGs as there are CPUs, and run
one create thread per CPU rather than one per disk.  Also, making
sure you have a largish log (512MB in this case) is helpful, too.

For example, I've got a simple RAID0 of 12 disks that is 1.1TB in
size when I stripe the outer 10% of the drives together (or 18TB if
I stripe the larger inner partitions on the disks). The way I
normally run it (on an 8p/4GB RAM VM) is:

In the host:

$ cat dmtab.fast.12drive 
0 2264924160 striped  12 1024 /dev/sdb1 0 /dev/sdc1 0 /dev/sdd1 0 /dev/sde1 0 /dev/sdf1 0 /dev/sdg1 0 /dev/sdh1 0 /dev/sdi1 0 /dev/sdj1 0 /dev/sdk1 0 /dev/sdl1 0 /dev/sdm1 0
$ sudo dmsetup create fast dmtab.fast.12drive
$ sudo mount -o nobarrier,logbsize=262144,delaylog,inode64 /dev/mapper/fast /mnt/fast

[VM creation script uses fallocate to preallocate 1.1TB file as raw
disk image inside /mnt/fast, appears to guest as /dev/vdb]

In the VM:

# mkfs.xfs -f -l size=131072b -d agcount=16 /dev/vdb
....
# mount -o nobarrier,inode64,delaylog,logbsize=262144 /dev/vdb /mnt/scratch
# /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
>       -d /mnt/scratch/0 -d /mnt/scratch/1 \
>       -d /mnt/scratch/2 -d /mnt/scratch/3 \
>       -d /mnt/scratch/4 -d /mnt/scratch/5 \
>       -d /mnt/scratch/6 -d /mnt/scratch/7

#  ./fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7 
#       Version 3.3, 8 thread(s) starting at Wed Nov 17 15:27:33 2010
#       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 1 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            1      27825.7         11686554
     0      1600000            1      22650.2         13199876
     1      2400000            1      23606.3         12297973
     1      3200000            1      23060.5         12474339
     1      4000000            1      22677.4         12731120
     2      4800000            1      23095.7         12142813
     2      5600000            1      22639.2         12813812
     2      6400000            1      23447.1         12330158
     3      7200000            1      22775.8         12548811
     3      8000000            1      22766.5         12169732
     3      8800000            1      21685.5         12546771
     4      9600000            1      22899.5         12544273
     4     10400000            1      22950.7         12894856
.....

The above numbers are without your patch series. The following
numbers are with your patch series:

FSUse%        Count         Size    Files/sec     App Overhead
     0       800000            1      26163.6         10492957
     0      1600000            1      21960.4         10431605
     1      2400000            1      22099.2         10971110
     1      3200000            1      22052.1         10470168
     1      4000000            1      21264.4         10398188
     2      4800000            1      21815.3         10445699
     2      5600000            1      21557.6         10504866
     2      6400000            1      21856.0         10421309
     3      7200000            1      21853.5         10613164
     3      8000000            1      21309.4         10642358
     3      8800000            1      22130.8         10457972
.....

Ok, so throughput is also down by ~5% from ~23k files/s to ~22k
files/s. On the plus side:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.91    0.00   43.45   46.56    0.00    8.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
vdb               0.00 12022.20    1.60 11431.60     0.01   114.09    20.44    32.34    2.82   0.08  94.64
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

The number of write IOs has dropped ѕignificantly and CPU usage is
more than halved - this was running at ~98% system time!  So for a
~5% throughput reduction, CPU usage has dropped by ~55% and the
number of write IOs have dropped by ~25%. That's a pretty good
result - it's the single biggest drop in CPU usage as a result of
preventing lock contention I've seen on an 8p machine in the past 6
months. Very promising - I guess it's time to look at the code again. :)

Hmmm - looks like the probably bottleneck is that the flusher thread
is close to CPU bound:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2215 root      20   0     0    0    0 R   86  0.0   2:16.43 flush-253:16

             samples  pcnt function                        DSO
             _______ _____ _______________________________ _________________

            32436.00  5.8% _xfs_buf_find                   [kernel.kallsyms]
            26119.00  4.7% kmem_cache_alloc                [kernel.kallsyms]
            17700.00  3.2% __ticket_spin_lock              [kernel.kallsyms]
            14592.00  2.6% xfs_log_commit_cil              [kernel.kallsyms]
            14341.00  2.6% _raw_spin_unlock_irqrestore     [kernel.kallsyms]
            12537.00  2.2% __kmalloc                       [kernel.kallsyms]
            12098.00  2.2% writeback_single_inode          [kernel.kallsyms]
            12078.00  2.2% xfs_iunlock                     [kernel.kallsyms]
            10712.00  1.9% redirty_tail                    [kernel.kallsyms]
            10706.00  1.9% __make_request                  [kernel.kallsyms]
            10469.00  1.9% bit_waitqueue                   [kernel.kallsyms]
            10107.00  1.8% kfree                           [kernel.kallsyms]
            10028.00  1.8% _cond_resched                   [kernel.kallsyms]
             9244.00  1.7% xfs_fs_write_inode              [kernel.kallsyms]
             8759.00  1.6% xfs_iflush_cluster              [kernel.kallsyms]
             7944.00  1.4% queue_io                        [kernel.kallsyms]
             7924.00  1.4% radix_tree_gang_lookup_tag_slot [kernel.kallsyms]
             7468.00  1.3% kmem_cache_free                 [kernel.kallsyms]
             7454.00  1.3% xfs_bmapi                       [kernel.kallsyms]
             7149.00  1.3% writeback_sb_inodes             [kernel.kallsyms]
             5882.00  1.1% xfs_btree_lookup                [kernel.kallsyms]
             5811.00  1.0% __memcpy                        [kernel.kallsyms]
             5446.00  1.0% xfs_alloc_ag_vextent_near       [kernel.kallsyms]
             5346.00  1.0% xfs_trans_buf_item_match        [kernel.kallsyms]
             4704.00  0.8% xfs_perag_get                   [kernel.kallsyms]

That's looking like it's XFS overhead flushing inodes, so that's not
an issue caused by this patch. Indeed, I'm used to seeing 30-40% of
the CPU time here in __ticket_spin_lock, so it certainly appears
that most of the CPU time saving comes from the removal of
contention on the inode_wb_list_lock. I guess it's time for me to
start looking at multiple bdi-flusher threads again....

> I noticed that
> 
> 1) BdiWriteback can grow very large. For example, bdi 8:16 has 72960KB
>    writeback pages, however the disk IO queue can hold at most
>    nr_request*max_sectors_kb=128*512kb=64MB writeback pages. Maybe xfs manages
>    to create perfect sequential layouts and writes, and the other 8MB writeback
>    pages are flying inside the disk?

There's a pretty good chance that this is exactly what is happening.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-17  7:25 ` Dave Chinner
@ 2010-11-17 10:06   ` Wu Fengguang
  2010-11-18  1:40     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2010-11-17 10:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 17, 2010 at 03:25:38PM +0800, Dave Chinner wrote:
> On Wed, Nov 17, 2010 at 11:58:21AM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
> > <http://thread.gmane.org/gmane.linux.kernel.mm/52966>.
> >
> > The basic idea is to introduce a small region under the bdi dirty threshold.
> > The task will be throttled gently when stepping into the bottom of region,
> > and get throttled more and more aggressively as bdi dirty+writeback pages
> > goes up closer to the top of region. At some point the application will be
> > throttled at the right bandwidth that balances with the device write bandwidth.
> > (the first patch and documentation has more details)
> >
> > Changes from initial RFC:
> >
> > - adaptive ratelimiting, to reduce overheads when under throttle threshold
> > - prevent overrunning dirty limit on lots of concurrent dirtiers
> > - add Documentation/filesystems/writeback-throttling-design.txt
> > - lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
> > - don't drop the laptop mode code
> > - update and comment the trace event
> > - benchmarks on concurrent dd and fs_mark covering both large and tiny files
> > - bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
> >   otherwise it will drift fast and fluctuate
> > - don't call balance_dirty_pages_ratelimit() when writing to already dirtied
> >   pages, otherwise the task will be throttled too much
> >
> > The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
> > access I put them in
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2
> 
> Great - just pulled it down and I'll start running some tests.
> 
> The tree that I'm testing has the vfs inode lock breakup in it, the
> inode cache SLAB_DESTROY_BY_RCU series, a large bunch of XFS lock
> breakup patches and now the above branch in it. It's here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working
> 
> > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > improves IO throughput from 38MB/s to 42MB/s.
> 
> Excellent - I suspect that the reduction in contention on the inode
> writeback locks is responsible for dropping the CPU usage right down.
> 
> I'm seeing throughput for a _single_ large dd (100GB) increase from ~650MB/s
> to 700MB/s with your series. For other numbers of dd's:

Great! I didn't expect it to improve _throughput_ of single dd case.

I do noticed the reduction of CPU time for the single dd case, perhaps
due to no more contentions between the dd and flusher thread.

One big advantage of this IO-less implementation is that it does the
work without introducing any _extra_ bookkeeping data structures and
coordinations, and hence is very scalable.

>                                                         ctx switches
> # dd processes          total throughput         total        per proc
>    1                      700MB/s                   400/s       100/s
>    2                      700MB/s                   500/s       100/s
>    4                      700MB/s                   700/s       100/s
>    8                      690MB/s                 1,100/s       100/s
>   16                      675MB/s                 2,000/s       110/s
>   32                      675MB/s                 5,000/s       150/s
>  100                      650MB/s                22,000/s       210/s
> 1000                      600MB/s               160,000/s       160/s
> 
> A couple of things I noticed - firstly, the number of context
> switches scales roughly with the number of writing processes - is
> there any reason for waking every writer 100-200 times a second? At
> the thousand writer mark, we reach a context switch rate of more
> than one per page we complete IO on. Any idea on whether this can be
> improved at all?

It's simple to have the pause time stabilize at larger values.  I can
even easily detect that there are lots of concurrent dirtiers, and in
such cases adaptively enlarge it to no more than 200ms. Does that
value sound reasonable?

Precisely controlling pause time is the major capability pursued by
this implementation (comparing to the earlier attempts to wait on
write completions).

> Also, the system CPU usage while throttling stayed quite low but not
> constant. The more writing processes, the lower the system CPU usage
> (despite the increase in context switches). Further, if the dd's
> didn't all start at the same time, then system CPU usage would
> roughly double when the first dd's complete and cpu usage stayed
> high until all the writers completed. So there's some trigger when
> writers finish/exit there that is changing throttle behaviour.
> Increasing the number of writers does not seem to have any adverse
> affects.

Depending on various conditions, the pause time will be stabilizing at
different point in the range [1 jiffy, 100 ms]. This is a very big
range and I made no attempt (although possible) to further control it.

The smaller pause time, the more overheads in context switches _as
well as_ global_page_state() costs (mainly cacheline bouncing) in
balance_dirty_pages().

I wonder whether or not the majority context switches indicate a
corresponding invocation of balance_dirty_pages()?

> BTW, killing a thousand dd's all stuck on the throttle is near
> instantaneous. ;)

Because the dd's no longer get stuck in D state in get_request_wait()
:)

> > The fs_mark benchmark is interesting. The CPU overheads are almost reduced by
> > half. Before patch the benchmark is actually bounded by CPU. After patch it's
> > IO bound, but strangely the throughput becomes slightly slower.
> 
> The "App Overhead" that is measured by fs_mark is the time it spends
> doing stuff in userspace rather than in syscalls. Changes in the app
> overhead typically implies a change in syscall CPU cache footprint. A
> substantial reduction in app overhead for the same amount of work
> is good. :)

Got it :)  This is an extra bonus, maybe because balance_dirty_pages()
no longer calls into the complex IO stack to writeout pages.

> [cut-n-paste from your comment about being io bound below]
> 
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >            0.17    0.00   97.87    1.08    0.00    0.88
> 
> That looks CPU bound, not IO bound.

Yes, it's collected in vanilla kernel.

> > Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> > sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > sdc               0.00    63.00    0.00  125.00     0.00  1909.33    30.55     3.88   31.65   6.57  82.13
> > sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > sde               0.00    19.00    0.00  112.00     0.00  1517.17    27.09     3.95   35.33   8.00  89.60
> > sdg               0.00    92.67    0.33  126.00     2.67  1773.33    28.12    14.83  120.78   7.73  97.60
> > sdf               0.00    32.33    0.00   91.67     0.00  1408.17    30.72     4.84   52.97   7.72  70.80
> > sdh               0.00    17.67    0.00    5.00     0.00   124.00    49.60     0.07   13.33   9.60   4.80
> > sdi               0.00    44.67    0.00    5.00     0.00   253.33   101.33     0.15   29.33  10.93   5.47
> > sdl               0.00   168.00    0.00  135.67     0.00  2216.33    32.67     6.41   45.42   5.75  78.00
> > sdk               0.00   225.00    0.00  123.00     0.00  2355.83    38.31     9.50   73.03   6.94  85.33
> > sdj               0.00     1.00    0.00    2.33     0.00    26.67    22.86     0.01    2.29   1.71   0.40
> > sdb               0.00    14.33    0.00  101.67     0.00  1278.00    25.14     2.02   19.95   7.16  72.80
> > sdm               0.00   150.33    0.00  144.33     0.00  2344.50    32.49     5.43   33.94   5.39  77.73
> 
> And that's totalling ~1000 iops during the workload - you're right
> in that it doesn't look at all well balanced. The device my test
> filesystem is on is running at ~15,000 iops and 120MB/s for the same
> workload, but there is another layer of reordering on the host as
> well as 512MB of BBWC between the host and the spindles, so maybe
> you won't be able to get near that number with your setup....

OK.

> [.....]
> 
> > avg                                    1182.761      533488581.833
> >
> > 2.6.36+
> > FSUse%        Count         Size    Files/sec     App Overhead
> ....
> > avg                                    1146.768      294684785.143
> 
> The difference between the files/s numbers is pretty much within
> typical variation of the benchmark. I tend to time the running of
> the entire benchmark because the files/s output does not include the
> "App Overhead" time and hence you can improve files/s but increase
> the app overhead and the overall wall time can be significantly
> slower...

Got it.

> FWIW, I'd consider the throughput (1200 files/s) to quite low for 12
> disks and a number of CPUs being active. I'm not sure how you
> configured the storage/filesystem, but you should configure the
> filesystem with at least 2x as many AGs as there are CPUs, and run
> one create thread per CPU rather than one per disk.  Also, making
> sure you have a largish log (512MB in this case) is helpful, too.

The test machine has 16 CPUs and 12 disks. I used plain simple mkfs
commands. I don't have access to the test box now (it's running LKP
for the just released -rc2). I'll checkout the xfs configuration and
recreate it with more AGs and log. And yeah it's a good idea to
increase the number of threads, with "-t 16"? btw, is it a must to run
the test for one whole day? If not, which optarg can be decreased?
"-L 64"?

> For example, I've got a simple RAID0 of 12 disks that is 1.1TB in
> size when I stripe the outer 10% of the drives together (or 18TB if
> I stripe the larger inner partitions on the disks). The way I
> normally run it (on an 8p/4GB RAM VM) is:
> 
> In the host:
> 
> $ cat dmtab.fast.12drive
> 0 2264924160 striped  12 1024 /dev/sdb1 0 /dev/sdc1 0 /dev/sdd1 0 /dev/sde1 0 /dev/sdf1 0 /dev/sdg1 0 /dev/sdh1 0 /dev/sdi1 0 /dev/sdj1 0 /dev/sdk1 0 /dev/sdl1 0 /dev/sdm1 0
> $ sudo dmsetup create fast dmtab.fast.12drive
> $ sudo mount -o nobarrier,logbsize=262144,delaylog,inode64 /dev/mapper/fast /mnt/fast
> 
> [VM creation script uses fallocate to preallocate 1.1TB file as raw
> disk image inside /mnt/fast, appears to guest as /dev/vdb]
> 
> In the VM:
> 
> # mkfs.xfs -f -l size=131072b -d agcount=16 /dev/vdb
> ....
> # mount -o nobarrier,inode64,delaylog,logbsize=262144 /dev/vdb /mnt/scratch
> # /usr/bin/time ./fs_mark -D 10000 -S0 -n 100000 -s 1 -L 63 \
> >       -d /mnt/scratch/0 -d /mnt/scratch/1 \
> >       -d /mnt/scratch/2 -d /mnt/scratch/3 \
> >       -d /mnt/scratch/4 -d /mnt/scratch/5 \
> >       -d /mnt/scratch/6 -d /mnt/scratch/7
> 
> #  ./fs_mark  -D  10000  -S0  -n  100000  -s  1  -L  63  -d  /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> #       Version 3.3, 8 thread(s) starting at Wed Nov 17 15:27:33 2010
> #       Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #       Directories:  Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
> #       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
> #       Files info: size 1 bytes, written with an IO size of 16384 bytes per write
> #       App overhead is time in microseconds spent in the test not doing file writing related system calls.
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       800000            1      27825.7         11686554
>      0      1600000            1      22650.2         13199876
>      1      2400000            1      23606.3         12297973
>      1      3200000            1      23060.5         12474339
>      1      4000000            1      22677.4         12731120
>      2      4800000            1      23095.7         12142813
>      2      5600000            1      22639.2         12813812
>      2      6400000            1      23447.1         12330158
>      3      7200000            1      22775.8         12548811
>      3      8000000            1      22766.5         12169732
>      3      8800000            1      21685.5         12546771
>      4      9600000            1      22899.5         12544273
>      4     10400000            1      22950.7         12894856
> .....
> 
> The above numbers are without your patch series. The following
> numbers are with your patch series:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
>      0       800000            1      26163.6         10492957
>      0      1600000            1      21960.4         10431605
>      1      2400000            1      22099.2         10971110
>      1      3200000            1      22052.1         10470168
>      1      4000000            1      21264.4         10398188
>      2      4800000            1      21815.3         10445699
>      2      5600000            1      21557.6         10504866
>      2      6400000            1      21856.0         10421309
>      3      7200000            1      21853.5         10613164
>      3      8000000            1      21309.4         10642358
>      3      8800000            1      22130.8         10457972
> .....
> 
> Ok, so throughput is also down by ~5% from ~23k files/s to ~22k
> files/s.

Hmm. The bad thing is I have no idea on how to avoid that. It's not
doing IO any more, so what can I do to influence the IO throughput? ;)

Maybe there are unnecessary sleep points in the writeout path?  Or
even one flusher thread is not enough _now_?  Anyway that seems not
the flaw of _this_ patchset, but problems exposed and unfortunately
made more imminent by it.

btw, do you have the total elapsed time before/after patch? As you
said it's the final criterion :)

> On the plus side:
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            1.91    0.00   43.45   46.56    0.00    8.08
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> vda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> vdb               0.00 12022.20    1.60 11431.60     0.01   114.09    20.44    32.34    2.82   0.08  94.64
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> 
> The number of write IOs has dropped ѕignificantly and CPU usage is
> more than halved - this was running at ~98% system time!  So for a
> ~5% throughput reduction, CPU usage has dropped by ~55% and the
> number of write IOs have dropped by ~25%. That's a pretty good
> result - it's the single biggest drop in CPU usage as a result of
> preventing lock contention I've seen on an 8p machine in the past 6
> months. Very promising - I guess it's time to look at the code again. :)

Thanks. The code is vastly rewrote, fortunately you didn't read it
before. I have good feelings on the current code. In V3 it may be
rebased onto the memcg works by Greg, but the basic algorithms will
remain the same.

> Hmmm - looks like the probably bottleneck is that the flusher thread
> is close to CPU bound:
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  2215 root      20   0     0    0    0 R   86  0.0   2:16.43 flush-253:16
> 
>              samples  pcnt function                        DSO
>              _______ _____ _______________________________ _________________
> 
>             32436.00  5.8% _xfs_buf_find                   [kernel.kallsyms]
>             26119.00  4.7% kmem_cache_alloc                [kernel.kallsyms]
>             17700.00  3.2% __ticket_spin_lock              [kernel.kallsyms]
>             14592.00  2.6% xfs_log_commit_cil              [kernel.kallsyms]
>             14341.00  2.6% _raw_spin_unlock_irqrestore     [kernel.kallsyms]
>             12537.00  2.2% __kmalloc                       [kernel.kallsyms]
>             12098.00  2.2% writeback_single_inode          [kernel.kallsyms]
>             12078.00  2.2% xfs_iunlock                     [kernel.kallsyms]
>             10712.00  1.9% redirty_tail                    [kernel.kallsyms]
>             10706.00  1.9% __make_request                  [kernel.kallsyms]
>             10469.00  1.9% bit_waitqueue                   [kernel.kallsyms]
>             10107.00  1.8% kfree                           [kernel.kallsyms]
>             10028.00  1.8% _cond_resched                   [kernel.kallsyms]
>              9244.00  1.7% xfs_fs_write_inode              [kernel.kallsyms]
>              8759.00  1.6% xfs_iflush_cluster              [kernel.kallsyms]
>              7944.00  1.4% queue_io                        [kernel.kallsyms]
>              7924.00  1.4% radix_tree_gang_lookup_tag_slot [kernel.kallsyms]
>              7468.00  1.3% kmem_cache_free                 [kernel.kallsyms]
>              7454.00  1.3% xfs_bmapi                       [kernel.kallsyms]
>              7149.00  1.3% writeback_sb_inodes             [kernel.kallsyms]
>              5882.00  1.1% xfs_btree_lookup                [kernel.kallsyms]
>              5811.00  1.0% __memcpy                        [kernel.kallsyms]
>              5446.00  1.0% xfs_alloc_ag_vextent_near       [kernel.kallsyms]
>              5346.00  1.0% xfs_trans_buf_item_match        [kernel.kallsyms]
>              4704.00  0.8% xfs_perag_get                   [kernel.kallsyms]
> 
> That's looking like it's XFS overhead flushing inodes, so that's not
> an issue caused by this patch. Indeed, I'm used to seeing 30-40% of
> the CPU time here in __ticket_spin_lock, so it certainly appears
> that most of the CPU time saving comes from the removal of
> contention on the inode_wb_list_lock. I guess it's time for me to
> start looking at multiple bdi-flusher threads again....

Heh.

> > I noticed that
> >
> > 1) BdiWriteback can grow very large. For example, bdi 8:16 has 72960KB
> >    writeback pages, however the disk IO queue can hold at most
> >    nr_request*max_sectors_kb=128*512kb=64MB writeback pages. Maybe xfs manages
> >    to create perfect sequential layouts and writes, and the other 8MB writeback
> >    pages are flying inside the disk?
> 
> There's a pretty good chance that this is exactly what is happening.

That's amazing! It's definitely running at the ultimate optimization
goal (for the sequential part).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-17 10:06   ` Wu Fengguang
@ 2010-11-18  1:40     ` Dave Chinner
  2010-11-18  1:59       ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-11-18  1:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 17, 2010 at 06:06:55PM +0800, Wu Fengguang wrote:
> On Wed, Nov 17, 2010 at 03:25:38PM +0800, Dave Chinner wrote:
> > On Wed, Nov 17, 2010 at 11:58:21AM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > This is a revised subset of "[RFC] soft and dynamic dirty throttling limits"
> > > <http://thread.gmane.org/gmane.linux.kernel.mm/52966>.
> > >
> > > The basic idea is to introduce a small region under the bdi dirty threshold.
> > > The task will be throttled gently when stepping into the bottom of region,
> > > and get throttled more and more aggressively as bdi dirty+writeback pages
> > > goes up closer to the top of region. At some point the application will be
> > > throttled at the right bandwidth that balances with the device write bandwidth.
> > > (the first patch and documentation has more details)
> > >
> > > Changes from initial RFC:
> > >
> > > - adaptive ratelimiting, to reduce overheads when under throttle threshold
> > > - prevent overrunning dirty limit on lots of concurrent dirtiers
> > > - add Documentation/filesystems/writeback-throttling-design.txt
> > > - lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
> > > - don't drop the laptop mode code
> > > - update and comment the trace event
> > > - benchmarks on concurrent dd and fs_mark covering both large and tiny files
> > > - bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
> > >   otherwise it will drift fast and fluctuate
> > > - don't call balance_dirty_pages_ratelimit() when writing to already dirtied
> > >   pages, otherwise the task will be throttled too much
> > >
> > > The patches are based on 2.6.37-rc2 and Jan's sync livelock patches. For easier
> > > access I put them in
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v2
> > 
> > Great - just pulled it down and I'll start running some tests.
> > 
> > The tree that I'm testing has the vfs inode lock breakup in it, the
> > inode cache SLAB_DESTROY_BY_RCU series, a large bunch of XFS lock
> > breakup patches and now the above branch in it. It's here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev.git working
> > 
> > > On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%, and
> > > improves IO throughput from 38MB/s to 42MB/s.
> > 
> > Excellent - I suspect that the reduction in contention on the inode
> > writeback locks is responsible for dropping the CPU usage right down.
> > 
> > I'm seeing throughput for a _single_ large dd (100GB) increase from ~650MB/s
> > to 700MB/s with your series. For other numbers of dd's:
> 
> Great! I didn't expect it to improve _throughput_ of single dd case.

At 650MB/s without your series, the dd process is CPU bound. With
your series the dd process now only consumes ~65% of the cpu, so I
suspect that if I was running on a faster block device it'd go even
faster. Removing the writeback path from the write() path certainly
helps in this regard.

> > # dd processes          total throughput         total        per proc
> >    1                      700MB/s                   400/s       100/s
> >    2                      700MB/s                   500/s       100/s
> >    4                      700MB/s                   700/s       100/s
> >    8                      690MB/s                 1,100/s       100/s
> >   16                      675MB/s                 2,000/s       110/s
> >   32                      675MB/s                 5,000/s       150/s
> >  100                      650MB/s                22,000/s       210/s
> > 1000                      600MB/s               160,000/s       160/s
> > 
> > A couple of things I noticed - firstly, the number of context
> > switches scales roughly with the number of writing processes - is
> > there any reason for waking every writer 100-200 times a second? At
> > the thousand writer mark, we reach a context switch rate of more
> > than one per page we complete IO on. Any idea on whether this can be
> > improved at all?
> 
> It's simple to have the pause time stabilize at larger values.  I can
> even easily detect that there are lots of concurrent dirtiers, and in
> such cases adaptively enlarge it to no more than 200ms. Does that
> value sound reasonable?

Certainly. I think that the more concurrent dirtiers, the less
frequently each individual dirtier should be woken. There's no point
waking a dirtier if all they can do is write a single page before
they are throttled again - IO is most efficient when done in larger
batches...

> Precisely controlling pause time is the major capability pursued by
> this implementation (comparing to the earlier attempts to wait on
> write completions).
> 
> > Also, the system CPU usage while throttling stayed quite low but not
> > constant. The more writing processes, the lower the system CPU usage
> > (despite the increase in context switches). Further, if the dd's
> > didn't all start at the same time, then system CPU usage would
> > roughly double when the first dd's complete and cpu usage stayed
> > high until all the writers completed. So there's some trigger when
> > writers finish/exit there that is changing throttle behaviour.
> > Increasing the number of writers does not seem to have any adverse
> > affects.
> 
> Depending on various conditions, the pause time will be stabilizing at
> different point in the range [1 jiffy, 100 ms]. This is a very big
> range and I made no attempt (although possible) to further control it.
> 
> The smaller pause time, the more overheads in context switches _as
> well as_ global_page_state() costs (mainly cacheline bouncing) in
> balance_dirty_pages().

I didn't notice any change in context switches when the CPU usage
changed, so perhaps it was more cacheline bouncing in
global_page_state(). I think more investigation is needed, though.

> I wonder whether or not the majority context switches indicate a
> corresponding invocation of balance_dirty_pages()?

/me needs to run with writeback tracing turned on

> > FWIW, I'd consider the throughput (1200 files/s) to quite low for 12
> > disks and a number of CPUs being active. I'm not sure how you
> > configured the storage/filesystem, but you should configure the
> > filesystem with at least 2x as many AGs as there are CPUs, and run
> > one create thread per CPU rather than one per disk.  Also, making
> > sure you have a largish log (512MB in this case) is helpful, too.
> 
> The test machine has 16 CPUs and 12 disks. I used plain simple mkfs
> commands. I don't have access to the test box now (it's running LKP
> for the just released -rc2). I'll checkout the xfs configuration and
> recreate it with more AGs and log.

Cool.

> And yeah it's a good idea to
> increase the number of threads, with "-t 16"?

No, that just increases the number of threads working on a specific
directory. creates are serialised by the directory i_mutex, so
there's no point running multiple threads per directory.

That's why I use multiple "-d <dir>" options - you get a thread per
directory that way, and they don't serialise with each other given
enough AGs...

> btw, is it a must to run
> the test for one whole day? If not, which optarg can be decreased?
> "-L 64"?

Yeah, -L controls the number of iterations (there's one line of
output per iteration). Generally, for sanity checking, I'll just run
a few iterations. I only ever run the full 50M inode runs when I've
got something I want to compare. Mind you, it generally only takes
an hour on my system, so that's not so bad...

> > Ok, so throughput is also down by ~5% from ~23k files/s to ~22k
> > files/s.
> 
> Hmm. The bad thing is I have no idea on how to avoid that. It's not
> doing IO any more, so what can I do to influence the IO throughput? ;)

That's now a problem of writeback optimisation - where it should be
dealt with ;)

> Maybe there are unnecessary sleep points in the writeout path? 

It sleeps on congestion, but otherwise shouldn't be blocking
anywhere.

> Or even one flusher thread is not enough _now_?

It hasn't been enough for XFS on really large systems doing high
bandwidth IO for a long time. It's only since 2.6.35 and the
introduction of the delaylog mount option that XFS has really been
able to drive small file IO this hard.

> Anyway that seems not
> the flaw of _this_ patchset, but problems exposed and unfortunately
> made more imminent by it.

Agreed.

> btw, do you have the total elapsed time before/after patch? As you
> said it's the final criterion :)

Yeah, sorry, should have posted them - I didn't because I snapped
the numbers before the run had finished. Without series:

373.19user 14940.49system 41:42.17elapsed 612%CPU (0avgtext+0avgdata 82560maxresident)k
0inputs+0outputs (403major+2599763minor)pagefaults 0swaps

With your series:

359.64user 5559.32system 40:53.23elapsed 241%CPU (0avgtext+0avgdata 82496maxresident)k
0inputs+0outputs (312major+2598798minor)pagefaults 0swaps

So the wall time with your series is lower, and system CPU time is
way down (as I've already noted) for this workload on XFS.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-18  1:40     ` Dave Chinner
@ 2010-11-18  1:59       ` Andrew Morton
  2010-11-18  2:50         ` Wu Fengguang
  2010-11-19  2:28         ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2010-11-18  1:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Thu, 18 Nov 2010 12:40:51 +1100 Dave Chinner <david@fromorbit.com> wrote:

> 
> There's no point
> waking a dirtier if all they can do is write a single page before
> they are throttled again - IO is most efficient when done in larger
> batches...

That assumes the process was about to do another write.  That's
reasonable on average, but a bit sad for interactive/rtprio tasks.  At
some stage those scheduler things should be brought into the equation.

>
> ...
>
> Yeah, sorry, should have posted them - I didn't because I snapped
> the numbers before the run had finished. Without series:
> 
> 373.19user 14940.49system 41:42.17elapsed 612%CPU (0avgtext+0avgdata 82560maxresident)k
> 0inputs+0outputs (403major+2599763minor)pagefaults 0swaps
> 
> With your series:
> 
> 359.64user 5559.32system 40:53.23elapsed 241%CPU (0avgtext+0avgdata 82496maxresident)k
> 0inputs+0outputs (312major+2598798minor)pagefaults 0swaps
> 
> So the wall time with your series is lower, and system CPU time is
> way down (as I've already noted) for this workload on XFS.

How much of that benefit is an accounting artifact, moving work away
from the calling process's CPU and into kernel threads?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-18  1:59       ` Andrew Morton
@ 2010-11-18  2:50         ` Wu Fengguang
  2010-11-18  3:19           ` Wu Fengguang
  2010-11-19  2:28         ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2010-11-18  2:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Thu, Nov 18, 2010 at 09:59:00AM +0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 12:40:51 +1100 Dave Chinner <david@fromorbit.com> wrote:
> 
> > 
> > There's no point
> > waking a dirtier if all they can do is write a single page before
> > they are throttled again - IO is most efficient when done in larger
> > batches...
> 
> That assumes the process was about to do another write.  That's
> reasonable on average, but a bit sad for interactive/rtprio tasks.  At
> some stage those scheduler things should be brought into the equation.

The interactive/rtprio tasks are given 1/4 bonus in
global_dirty_limits(). So when there are lots of heavy dirtiers,
the interactive/rtprio tasks will get soft throttled at
(6~8)*bdi_bandwidth. We can increase that to (12~16)*bdi_bandwidth
or whatever.

> >
> > ...
> >
> > Yeah, sorry, should have posted them - I didn't because I snapped
> > the numbers before the run had finished. Without series:
> > 
> > 373.19user 14940.49system 41:42.17elapsed 612%CPU (0avgtext+0avgdata 82560maxresident)k
> > 0inputs+0outputs (403major+2599763minor)pagefaults 0swaps
> > 
> > With your series:
> > 
> > 359.64user 5559.32system 40:53.23elapsed 241%CPU (0avgtext+0avgdata 82496maxresident)k
> > 0inputs+0outputs (312major+2598798minor)pagefaults 0swaps
> > 
> > So the wall time with your series is lower, and system CPU time is
> > way down (as I've already noted) for this workload on XFS.
> 
> How much of that benefit is an accounting artifact, moving work away
> from the calling process's CPU and into kernel threads?

The elapsed time won't cheat, and it's going down from 41:42 to 40:53.

For the CPU time, I have system wide numbers collected from iostat.
Citing from the changelog of the first patch:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

                        2.6.37-rc2                              2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
        %system         wkB/s           avgrq-sz        %system         wkB/s           avgrq-sz
100dd   30.916          37843.000       748.670         3.079           41654.853       822.322
100dd   30.501          37227.521       735.754         3.744           41531.725       820.360

10dd    39.442          47745.021       900.935         20.756          47951.702       901.006
10dd    39.204          47484.616       899.330         20.550          47970.093       900.247

1dd     13.046          57357.468       910.659         13.060          57632.715       909.212
1dd     12.896          56433.152       909.861         12.467          56294.440       909.644

Those are real CPU savings :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-18  2:50         ` Wu Fengguang
@ 2010-11-18  3:19           ` Wu Fengguang
  0 siblings, 0 replies; 8+ messages in thread
From: Wu Fengguang @ 2010-11-18  3:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Thu, Nov 18, 2010 at 10:50:39AM +0800, Wu Fengguang wrote:
> On Thu, Nov 18, 2010 at 09:59:00AM +0800, Andrew Morton wrote:
> > On Thu, 18 Nov 2010 12:40:51 +1100 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > 
> > > There's no point
> > > waking a dirtier if all they can do is write a single page before
> > > they are throttled again - IO is most efficient when done in larger
> > > batches...
> > 
> > That assumes the process was about to do another write.  That's
> > reasonable on average, but a bit sad for interactive/rtprio tasks.  At
> > some stage those scheduler things should be brought into the equation.
> 
> The interactive/rtprio tasks are given 1/4 bonus in
> global_dirty_limits(). So when there are lots of heavy dirtiers,
> the interactive/rtprio tasks will get soft throttled at
> (6~8)*bdi_bandwidth. We can increase that to (12~16)*bdi_bandwidth
> or whatever.

Even better :) It seems that this break in balance_dirty_pages() will
make them throttle free, unless they themselves generate dirty data
faster than the disk can write:

        if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
                break;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 00/13] IO-less dirty throttling
  2010-11-18  1:59       ` Andrew Morton
  2010-11-18  2:50         ` Wu Fengguang
@ 2010-11-19  2:28         ` Dave Chinner
  1 sibling, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-11-19  2:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Theodore Ts'o,
	Chris Mason, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, linux-mm, linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 17, 2010 at 05:59:00PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 12:40:51 +1100 Dave Chinner <david@fromorbit.com> wrote:
> > Yeah, sorry, should have posted them - I didn't because I snapped
> > the numbers before the run had finished. Without series:
> > 
> > 373.19user 14940.49system 41:42.17elapsed 612%CPU (0avgtext+0avgdata 82560maxresident)k
> > 0inputs+0outputs (403major+2599763minor)pagefaults 0swaps
> > 
> > With your series:
> > 
> > 359.64user 5559.32system 40:53.23elapsed 241%CPU (0avgtext+0avgdata 82496maxresident)k
> > 0inputs+0outputs (312major+2598798minor)pagefaults 0swaps
> > 
> > So the wall time with your series is lower, and system CPU time is
> > way down (as I've already noted) for this workload on XFS.
> 
> How much of that benefit is an accounting artifact, moving work away
> from the calling process's CPU and into kernel threads?

As I spelled out in my original results, the sustained CPU usage for
the unmodified kernel is ~780% - 620% fs_mark, 80% bdi-flusher, 80%
kswapd (i.e. completely CPU bound on the 8p test VM).  With this
series, the sustained CPU usage is about 380% - 250% fs_mark, 80%
bdi-flusher, 50% kswapd.

IOWs, this series _halved_ the total sustained CPU usage even after
taking into account all the kernel threads. With wall time also
being reduced and the number of IOs issued dropping by 25%, I find
it hard to classify the result as anything other than spectacular...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-11-19  2:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-17  3:58 [PATCH 00/13] IO-less dirty throttling Wu Fengguang
2010-11-17  7:25 ` Dave Chinner
2010-11-17 10:06   ` Wu Fengguang
2010-11-18  1:40     ` Dave Chinner
2010-11-18  1:59       ` Andrew Morton
2010-11-18  2:50         ` Wu Fengguang
2010-11-18  3:19           ` Wu Fengguang
2010-11-19  2:28         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).