From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wu Fengguang Subject: Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages() Date: Mon, 22 Nov 2010 10:01:45 +0800 Message-ID: <20101122020145.GB10126@localhost> References: <20101117042720.033773013@intel.com> <20101117042849.410279291@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: Andrew Morton , Jan Kara , Chris Mason , Dave Chinner , Peter Zijlstra , Jens Axboe , Christoph Hellwig , Theodore Ts'o , Mel Gorman , Rik van Riel , KOSAKI Motohiro , linux-mm , "linux-fsdevel@vger.kernel.org" , LKML To: Minchan Kim Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Minchan, On Wed, Nov 17, 2010 at 06:34:26PM +0800, Minchan Kim wrote: > Hi Wu, >=20 > As you know, I am not a expert in this area. > So I hope my review can help understanding other newbie like me and > make clear this document. :) > I didn't look into the code. before it, I would like to clear your conc= ept. Yeah, it's some big change of "concept" :) Sorry for the late reply, as I'm still tuning things and some details may change as a result. The biggest challenge now is the stability of the control algorithms. Everything is floating around and I'm trying to keep the fluctuations down by borrowing some equation from the optimal control theory. > On Wed, Nov 17, 2010 at 1:27 PM, Wu Fengguang = wrote: > > As proposed by Chris, Dave and Jan, don't start foreground writeback = IO > > inside balance_dirty_pages(). Instead, simply let it idle sleep for s= ome > > time to throttle the dirtying task. In the mean while, kick off the > > per-bdi flusher thread to do background writeback IO. > > > > This patch introduces the basic framework, which will be further > > consolidated by the next patches. > > > > RATIONALS > > =3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > The current balance_dirty_pages() is rather IO inefficient. > > > > - concurrent writeback of multiple inodes (Dave Chinner) > > > > =C2=A0If every thread doing writes and being throttled start foregrou= nd > > =C2=A0writeback, it leads to N IO submitters from at least N differen= t > > =C2=A0inodes at the same time, end up with N different sets of IO bei= ng > > =C2=A0issued with potentially zero locality to each other, resulting = in > > =C2=A0much lower elevator sort/merge efficiency and hence we seek the= disk > > =C2=A0all over the place to service the different sets of IO. > > =C2=A0OTOH, if there is only one submission thread, it doesn't jump b= etween > > =C2=A0inodes in the same way when congestion clears - it keeps writin= g to > > =C2=A0the same inode, resulting in large related chunks of sequential= IOs > > =C2=A0being issued to the disk. This is more efficient than the above > > =C2=A0foreground writeback because the elevator works better and the = disk > > =C2=A0seeks less. > > > > - IO size too small for fast arrays and too large for slow USB sticks > > > > =C2=A0The write_chunk used by current balance_dirty_pages() cannot be > > =C2=A0directly set to some large value (eg. 128MB) for better IO effi= ciency. > > =C2=A0Because it could lead to more than 1 second user perceivable st= alls. > > =C2=A0Even the current 4MB write size may be too large for slow USB s= ticks. > > =C2=A0The fact that balance_dirty_pages() starts IO on itself couples= the > > =C2=A0IO size to wait time, which makes it hard to do suitable IO siz= e while > > =C2=A0keeping the wait time under control. > > > > For the above two reasons, it's much better to shift IO to the flushe= r > > threads and let balance_dirty_pages() just wait for enough time or pr= ogress. > > > > Jan Kara, Dave Chinner and me explored the scheme to let > > balance_dirty_pages() wait for enough writeback IO completions to > > safeguard the dirty limit. However it's found to have two problems: > > > > - in large NUMA systems, the per-cpu counters may have big accounting > > =C2=A0errors, leading to big throttle wait time and jitters. > > > > - NFS may kill large amount of unstable pages with one single COMMIT. > > =C2=A0Because NFS server serves COMMIT with expensive fsync() IOs, it= is > > =C2=A0desirable to delay and reduce the number of COMMITs. So it's no= t > > =C2=A0likely to optimize away such kind of bursty IO completions, and= the > > =C2=A0resulted large (and tiny) stall times in IO completion based th= rottling. > > > > So here is a pause time oriented approach, which tries to control the > > pause time in each balance_dirty_pages() invocations, by controlling > > the number of pages dirtied before calling balance_dirty_pages(), for > > smooth and efficient dirty throttling: > > > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls > > - avoid too small pause time (less than =C2=A010ms, which burns CPU p= ower) > > - avoid too large pause time (more than 100ms, which hurts responsive= ness) > > - avoid big fluctuations of pause times > > > > For example, when doing a simple cp on ext4 with mem=3D4G HZ=3D250. > > > > before patch, the pause time fluctuates from 0 to 324ms > > (and the stall time may grow very large for slow devices) > > > > [ 1237.139962] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D56 > > [ 1237.207489] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D0 > > [ 1237.225190] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D0 > > [ 1237.234488] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D0 > > [ 1237.244692] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D0 > > [ 1237.375231] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D31 > > [ 1237.443035] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D15 > > [ 1237.574630] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D31 > > [ 1237.642394] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D15 > > [ 1237.666320] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 57 pause=3D5 > > [ 1237.973365] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D81 > > [ 1238.212626] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D56 > > [ 1238.280431] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D15 > > [ 1238.412029] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D31 > > [ 1238.412791] balance_dirty_pages: write_chunk=3D1536 pages_written=3D= 0 pause=3D0 > > > > after patch, the pause time remains stable around 32ms > > > > cp-2687 =C2=A0[002] =C2=A01452.237012: balance_dirty_pages: weight=3D= 56% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[002] =C2=A01452.246157: balance_dirty_pages: weight=3D= 56% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[006] =C2=A01452.253043: balance_dirty_pages: weight=3D= 56% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[006] =C2=A01452.261899: balance_dirty_pages: weight=3D= 57% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[006] =C2=A01452.268939: balance_dirty_pages: weight=3D= 57% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[002] =C2=A01452.276932: balance_dirty_pages: weight=3D= 57% dirtied=3D128 pause=3D8 > > cp-2687 =C2=A0[002] =C2=A01452.285889: balance_dirty_pages: weight=3D= 57% dirtied=3D128 pause=3D8 > > > > CONTROL SYSTEM > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > The current task_dirty_limit() adjusts bdi_dirty_limit to get > > task_dirty_limit according to the dirty "weight" of the current task, > > which is the percent of pages recently dirtied by the task. If 100% > > pages are recently dirtied by the task, it will lower bdi_dirty_limit= by > > 1/8. If only 1% pages are dirtied by the task, it will return almost > > unmodified bdi_dirty_limit. In this way, a heavy dirtier will get > > blocked at task_dirty_limit=3D(bdi_dirty_limit-bdi_dirty_limit/8) whi= le > > allowing a light dirtier to progress (the latter won't be blocked > > because R << B in fig.1). > > > > Fig.1 before patch, a heavy dirtier and a light dirtier > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0R > > ----------------------------------------------+-o--------------------= -------*--| > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0L A =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 B =C2=A0T > > =C2=A0T: bdi_dirty_limit, as returned by bdi_dirty_limit() > > =C2=A0L: T - T/8 > > > > =C2=A0R: bdi_reclaimable + bdi_writeback > > > > =C2=A0A: task_dirty_limit for a heavy dirtier ~=3D R ~=3D L > > =C2=A0B: task_dirty_limit for a light dirtier ~=3D T > > > > Since each process has its own dirty limit, we reuse A/B for the task= s as > > well as their dirty limits. > > > > If B is a newly started heavy dirtier, then it will slowly gain weigh= t > > and A will lose weight. =C2=A0The task_dirty_limit for A and B will b= e > > approaching the center of region (L, T) and eventually stabilize ther= e. > > > > Fig.2 before patch, two heavy dirtiers converging to the same thresho= ld > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 R > > ----------------------------------------------+--------------o-*-----= ----------| > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0L =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0A B =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 T >=20 > Seems good until now. > So, What's the problem if two heavy dirtiers have a same threshold? That's not a problem. It's the proper behavior to converge for two "dd"s. > > Fig.3 after patch, one heavy dirtier > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| > > =C2=A0 =C2=A0throttle_bandwidth ~=3D bdi_bandwidth =C2=A0=3D> =C2=A0 = =C2=A0 o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 o > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0La| =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 o > > ----------------------------------------------+-+-------------o------= ----------| > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0R =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 A =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0T > > =C2=A0T: bdi_dirty_limit > > =C2=A0A: task_dirty_limit =C2=A0 =C2=A0 =C2=A0=3D T - Wa * T/16 > > =C2=A0La: task_throttle_thresh =3D A - A/16 > > > > =C2=A0R: bdi_dirty_pages =3D bdi_reclaimable + bdi_writeback ~=3D La > > > > Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth co= ntrol" > > way. In fig.3, a soft dirty limit region (La, A) is introduced. When = R enters > > this region, the task may be throttled for J jiffies on every N pages= it dirtied. > > Let's call (N/J) the "throttle bandwidth". It is computed by the foll= owing formula: > > > > =C2=A0 =C2=A0 =C2=A0 =C2=A0throttle_bandwidth =3D bdi_bandwidth * (A = - R) / (A - La) > > where > > =C2=A0 =C2=A0 =C2=A0 =C2=A0A =3D T - Wa * T/16 > > =C2=A0 =C2=A0 =C2=A0 =C2=A0La =3D A - A/16 > > where Wa is task weight for A. It's 0 for very light dirtier and 1 fo= r > > the one heavy dirtier (that consumes 100% bdi write bandwidth). =C2=A0= The > > task weight will be updated independently by task_dirty_inc() at > > set_page_dirty() time. >=20 >=20 > Dumb question. >=20 > I can't see the difference between old and new, > La depends on A. > A depends on Wa. > T is constant? T is the bdi's share of the global dirty limit. It's stable in normal, and here we use it as the reference point for per-bdi dirty throttling. > Then, throttle_bandwidth depends on Wa. Sure, each task will be throttled at different bandwidth if there "Wa" are different. > Wa depends on the number of dirtied pages during some interval. > So if light dirtier become heavy, at last light dirtier and heavy > dirtier will have a same weight. > It means throttle_bandwidth is same. It's a same with old result. Yeah. Wa and throttle_bandwidth is changing over time. =20 > Please, open my eyes. :) You get the dynamics right :) > Thanks for the great work. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: email@kvack.org