From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o3P3VJIH224508 for ; Sat, 24 Apr 2010 22:31:19 -0500 Received: from thunker.thunk.org (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 5CB072ED9FD for ; Sat, 24 Apr 2010 20:33:21 -0700 (PDT) Received: from thunker.thunk.org (thunk.org [69.25.196.29]) by cuda.sgi.com with ESMTP id 2CRFnkhmYv4APUI5 for ; Sat, 24 Apr 2010 20:33:21 -0700 (PDT) Date: Sat, 24 Apr 2010 23:33:15 -0400 From: tytso@mit.edu Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages Message-ID: <20100425033315.GC667@thunk.org> References: <1271731314-5893-1-git-send-email-david@fromorbit.com> <1271731314-5893-4-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1271731314-5893-4-git-send-email-david@fromorbit.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com T24gVHVlLCBBcHIgMjAsIDIwMTAgYXQgMTI6NDE6NTNQTSArMTAwMCwgRGF2ZSBDaGlubmVyIHdy b3RlOgo+IEZyb206IERhdmUgQ2hpbm5lciA8ZGNoaW5uZXJAcmVkaGF0LmNvbT4KPiAKPiBJZiBh IGZpbGVzeXN0ZW0gd3JpdGVzIG1vcmUgdGhhbiBvbmUgcGFnZSBpbiAtPndyaXRlcGFnZSwgd3Jp dGVfY2FjaGVfcGFnZXMKPiBmYWlscyB0byBub3RpY2UgdGhpcyBhbmQgY29udGludWVzIHRvIGF0 dGVtcHQgd3JpdGViYWNrIHdoZW4gd2JjLT5ucl90b193cml0ZQo+IGhhcyBnb25lIG5lZ2F0aXZl IC0gdGhpcyB0cmFjZSB3YXMgY2FwdHVyZWQgZnJvbSBYRlM6Cj4gCj4gCj4gICAgIHdiY193cml0 ZWJhY2tfc3RhcnQ6IHRvd3J0PTEwMjQKPiAgICAgd2JjX3dyaXRlcGFnZTogdG93cnQ9MTAyNAo+ ICAgICB3YmNfd3JpdGVwYWdlOiB0b3dydD0wCj4gICAgIHdiY193cml0ZXBhZ2U6IHRvd3J0PS0x Cj4gICAgIHdiY193cml0ZXBhZ2U6IHRvd3J0PS01Cj4gICAgIHdiY193cml0ZXBhZ2U6IHRvd3J0 PS0yMQo+ICAgICB3YmNfd3JpdGVwYWdlOiB0b3dydD0tODUKPiAKPiBUaGlzIGhhcyBhZHZlcnNl IGVmZmVjdHMgb24gZmlsZXN5c3RlbSB3cml0ZWJhY2sgYmVoYXZpb3VyLiB3cml0ZV9jYWNoZV9w YWdlcygpCj4gbmVlZHMgdG8gdGVybWluYXRlIGFmdGVyIGEgY2VydGFpbiBudW1iZXIgb2YgcGFn ZXMgYXJlIHdyaXR0ZW4sIG5vdCBhZnRlciBhCj4gY2VydGFpbiBudW1iZXIgb2YgY2FsbHMgdG8g LT53cml0ZXBhZ2UgYXJlIG1hZGUuIE1ha2UgaXQgb2JzZXJ2ZSB0aGUgY3VycmVudAo+IHZhbHVl IG9mIHdiYy0+bnJfdG9fd3JpdGUgYW5kIHRyZWF0IGEgdmFsdWUgb2YgPD0gMCBhcyB0aG91Z2gg aXQgaXMgYSBlaXRoZXIgYQo+IHRlcm1pbmF0aW9uIGNvbmRpdGlvbiBvciBhIHRyaWdnZXIgdG8g cmVzZXQgdG8gTUFYX1dSSVRF4biGQUNLX1BBR0VTIGZvciBkYXRhCj4gaW50ZWdyaXR5IHN5bmNz LgoKQmUgY2FyZWZ1bCBoZXJlLiAgSWYgeW91IGFyZSBnb2luZyB0byB3cml0ZSBtb3JlIHBhZ2Vz IHRoYW4gd2hhdCB0aGUKd3JpdGViYWNrIGNvZGUgaGFzIHJlcXVlc3RlZCAodGhlIHN0dXBpZCBu byBtb3JlIHRoYW4gMTAyNCBwYWdlcwpyZXN0cmljdGlvbiBpbiB0aGUgd3JpdGViYWNrIGNvZGUg YmVmb3JlIGl0IGp1bXBzIHRvIHN0YXJ0IHdyaXRpbmcKc29tZSBvdGhlciBpbm9kZSksIHlvdSBh Y3R1YWxseSBuZWVkIHRvIGxldCB0aGUgcmV0dXJuZWQKd2JjLT5ucl90b193cml0ZSBnbyBuZWdh dGl2ZSwgc28gdGhhdCB3Yl93cml0ZWJhY2soKSBrbm93cyBob3cgbWFueQpwYWdlcyBpdCBoYXMg d3JpdHRlbi4KCkluIG90aGVyIHdvcmRzLCB0aGUgd3JpdGViYWNrIGNvZGUgYXNzdW1lcyB0aGF0 IAoKICA8b3JpZ25hbCB2YWx1ZSBvZiBucl90b193cml0ZT4gLSA8cmV0dXJuZWQgd2JjLT5ucl90 b193cml0ZT4KCmlzCgogIDxudW1iZXIgb2YgcGFnZXMgYWN0dWFsbHkgd3JpdHRlbj4KCklmIHlv dSBkb24ndCBsZXQgd2JjLT5ucl90b193cml0ZSBnbyBuZWdhdGl2ZSwgdGhlIHdyaXRlYmFjayBj b2RlIHdpbGwKYmUgY29uZnVzZWQgYWJvdXQgaG93IG1hbnkgcGFnZXMgd2VyZSBfYWN0dWFsbHlf IHdyaXR0ZW4sIGFuZCB0aGUKd3JpdGViYWNrIGNvZGUgZW5kcyB1cCB3cml0aW5nIHRvbyBtdWNo LiAgU2VlIGNvbW1pdCAyZmFmMmUxLgoKQWxsIG9mIHRoaXMgaXMgYSBjcm9jayBvZiBjb3Vyc2Uu ICBUaGUgZmlsZSBzeXN0ZW0gc2hvdWxkbid0IGJlCnNlY29uZC1ndWVzc2luZyB0aGUgd3JpdGVi YWNrIGNvZGUuICBJbnN0ZWFkIHRoZSB3cml0ZWJhY2sgY29kZSBzaG91bGQKYmUgYWRhcHRpdmVs eSBtZWFzdXJpbmcgaG93IGxvbmcgaXQgdGFrZXMgdG8gd2VyZSB3cml0dGVuIG91dCBOIHBhZ2Vz CnRvIGEgcGFydGljdWxhciBibG9jayBkZXZpY2UsIGFuZCB0aGVuIGRlY2lkZSB3aGF0J3MgdGhl IGFwcHJvcHJpYXRlCnNldHRpbmcgZm9yIG5yX3RvX3dyaXRlLiAgV2hhdCBtYWtlcyBzZW5zZSBm b3IgYSBVU0Igc3RpY2ssIG9yIGEgNDIwMApSUE0gbGFwdG9wIGRyaXZlLCBtYXkgbm90IG1ha2Ug c2Vuc2UgZm9yIGEgbWFzc2l2ZSBSQUlEIGFycmF5Li4uLgoKQnV0IHNpbmNlIHdlIGRvbid0IGhh dmUgdGhhdCwgYm90aCBYRlMgYW5kIGV4dDQgaGF2ZSB3b3JrYXJvdW5kcyBmb3IKYnJhaW4tZGFt YWdlZCB3cml0ZWJhY2sgYmVoYXZpb3VyLiAgKEkgZGlkIHNvbWUgdGVzdGluZywgYW5kIGV2ZW4g Zm9yCnN0YW5kYXJkIGxhcHRvcCBkcml2ZXMgdGhlIGNhcCBvZiAxMDI0IHBhZ2VzIGlzIGp1c3Qg V2F5IFRvbyBTbWFsbDsKdGhhdCBsaW1pdCB3YXMgc2V0IHNvbWV0aGluZyBsaWtlIGEgZGVjYWRl IGFnbywgYW5kIGV2ZXJ5b25lIGhhcyBiZWVuCmFmcmFpZCB0byBjaGFuZ2UgaXQsIGV2ZW4gdGhv dWdoIGRpc2tzIGhhdmUgZ290dGVuIGEgd2VlIGJpdCBmYXN0ZXIKc2luY2UgdGhvc2UgZGF5cy4p CgogICAgCSAgIAkgICAgICAJICAgICAgIAkgICAgIAkgLSBUZWQKCl9fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCnhmcyBtYWlsaW5nIGxpc3QKeGZzQG9zcy5z Z2kuY29tCmh0dHA6Ly9vc3Muc2dpLmNvbS9tYWlsbWFuL2xpc3RpbmZvL3hmcwo= From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753181Ab0DYDd0 (ORCPT ); Sat, 24 Apr 2010 23:33:26 -0400 Received: from THUNK.ORG ([69.25.196.29]:33844 "EHLO thunker.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751783Ab0DYDdT (ORCPT ); Sat, 24 Apr 2010 23:33:19 -0400 Date: Sat, 24 Apr 2010 23:33:15 -0400 From: tytso@mit.edu To: Dave Chinner Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages Message-ID: <20100425033315.GC667@thunk.org> Mail-Followup-To: tytso@mit.edu, Dave Chinner , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com References: <1271731314-5893-1-git-send-email-david@fromorbit.com> <1271731314-5893-4-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1271731314-5893-4-git-send-email-david@fromorbit.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: tytso@thunk.org X-SA-Exim-Scanned: No (on thunker.thunk.org); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 20, 2010 at 12:41:53PM +1000, Dave Chinner wrote: > From: Dave Chinner > > If a filesystem writes more than one page in ->writepage, write_cache_pages > fails to notice this and continues to attempt writeback when wbc->nr_to_write > has gone negative - this trace was captured from XFS: > > > wbc_writeback_start: towrt=1024 > wbc_writepage: towrt=1024 > wbc_writepage: towrt=0 > wbc_writepage: towrt=-1 > wbc_writepage: towrt=-5 > wbc_writepage: towrt=-21 > wbc_writepage: towrt=-85 > > This has adverse effects on filesystem writeback behaviour. write_cache_pages() > needs to terminate after a certain number of pages are written, not after a > certain number of calls to ->writepage are made. Make it observe the current > value of wbc->nr_to_write and treat a value of <= 0 as though it is a either a > termination condition or a trigger to reset to MAX_WRITEḆACK_PAGES for data > integrity syncs. Be careful here. If you are going to write more pages than what the writeback code has requested (the stupid no more than 1024 pages restriction in the writeback code before it jumps to start writing some other inode), you actually need to let the returned wbc->nr_to_write go negative, so that wb_writeback() knows how many pages it has written. In other words, the writeback code assumes that - nr_to_write> is If you don't let wbc->nr_to_write go negative, the writeback code will be confused about how many pages were _actually_ written, and the writeback code ends up writing too much. See commit 2faf2e1. All of this is a crock of course. The file system shouldn't be second-guessing the writeback code. Instead the writeback code should be adaptively measuring how long it takes to were written out N pages to a particular block device, and then decide what's the appropriate setting for nr_to_write. What makes sense for a USB stick, or a 4200 RPM laptop drive, may not make sense for a massive RAID array.... But since we don't have that, both XFS and ext4 have workarounds for brain-damaged writeback behaviour. (I did some testing, and even for standard laptop drives the cap of 1024 pages is just Way Too Small; that limit was set something like a decade ago, and everyone has been afraid to change it, even though disks have gotten a wee bit faster since those days.) - Ted From mboxrd@z Thu Jan 1 00:00:00 1970 From: tytso@mit.edu Subject: Re: [PATCH 3/4] writeback: pay attention to wbc->nr_to_write in write_cache_pages Date: Sat, 24 Apr 2010 23:33:15 -0400 Message-ID: <20100425033315.GC667@thunk.org> References: <1271731314-5893-1-git-send-email-david@fromorbit.com> <1271731314-5893-4-git-send-email-david@fromorbit.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com To: Dave Chinner Return-path: Received: from THUNK.ORG ([69.25.196.29]:33844 "EHLO thunker.thunk.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751783Ab0DYDdT (ORCPT ); Sat, 24 Apr 2010 23:33:19 -0400 Content-Disposition: inline In-Reply-To: <1271731314-5893-4-git-send-email-david@fromorbit.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Tue, Apr 20, 2010 at 12:41:53PM +1000, Dave Chinner wrote: > From: Dave Chinner >=20 > If a filesystem writes more than one page in ->writepage, write_cache= _pages > fails to notice this and continues to attempt writeback when wbc->nr_= to_write > has gone negative - this trace was captured from XFS: >=20 >=20 > wbc_writeback_start: towrt=3D1024 > wbc_writepage: towrt=3D1024 > wbc_writepage: towrt=3D0 > wbc_writepage: towrt=3D-1 > wbc_writepage: towrt=3D-5 > wbc_writepage: towrt=3D-21 > wbc_writepage: towrt=3D-85 >=20 > This has adverse effects on filesystem writeback behaviour. write_cac= he_pages() > needs to terminate after a certain number of pages are written, not a= fter a > certain number of calls to ->writepage are made. Make it observe the = current > value of wbc->nr_to_write and treat a value of <=3D 0 as though it is= a either a > termination condition or a trigger to reset to MAX_WRITE=E1=B8=86ACK_= PAGES for data > integrity syncs. Be careful here. If you are going to write more pages than what the writeback code has requested (the stupid no more than 1024 pages restriction in the writeback code before it jumps to start writing some other inode), you actually need to let the returned wbc->nr_to_write go negative, so that wb_writeback() knows how many pages it has written. In other words, the writeback code assumes that=20 - nr_to_write> is If you don't let wbc->nr_to_write go negative, the writeback code will be confused about how many pages were _actually_ written, and the writeback code ends up writing too much. See commit 2faf2e1. All of this is a crock of course. The file system shouldn't be second-guessing the writeback code. Instead the writeback code should be adaptively measuring how long it takes to were written out N pages to a particular block device, and then decide what's the appropriate setting for nr_to_write. What makes sense for a USB stick, or a 4200 RPM laptop drive, may not make sense for a massive RAID array.... But since we don't have that, both XFS and ext4 have workarounds for brain-damaged writeback behaviour. (I did some testing, and even for standard laptop drives the cap of 1024 pages is just Way Too Small; that limit was set something like a decade ago, and everyone has been afraid to change it, even though disks have gotten a wee bit faster since those days.) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html