From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by oss.sgi.com (8.14.3/8.14.3/SuSE Linux 0.8) with ESMTP id o4S5H7Oj196346 for ; Fri, 28 May 2010 00:17:07 -0500 Received: from mx2.suse.de (localhost [127.0.0.1]) by cuda.sgi.com (Spam Firewall) with ESMTP id 84098130C1A0 for ; Thu, 27 May 2010 22:21:22 -0700 (PDT) Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by cuda.sgi.com with ESMTP id 8IlAKx3OwH4b0pAq for ; Thu, 27 May 2010 22:21:22 -0700 (PDT) Date: Fri, 28 May 2010 15:19:24 +1000 From: Nick Piggin Subject: Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Message-ID: <20100528051924.GZ22536@laptop> References: <1274777588-21494-1-git-send-email-david@fromorbit.com> <1274777588-21494-4-git-send-email-david@fromorbit.com> <20100527063523.GJ22536@laptop> <20100527224034.GO12087@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20100527224034.GO12087@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Sender: xfs-bounces@oss.sgi.com Errors-To: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, xfs@oss.sgi.com T24gRnJpLCBNYXkgMjgsIDIwMTAgYXQgMDg6NDA6MzRBTSArMTAwMCwgRGF2ZSBDaGlubmVyIHdy b3RlOgo+IE9uIFRodSwgTWF5IDI3LCAyMDEwIGF0IDA0OjM1OjIzUE0gKzEwMDAsIE5pY2sgUGln Z2luIHdyb3RlOgo+ID4gQnV0IHdlIGNhbiB0aGluayBvZiBpbm9kZXMgdGhhdCBhcmUgb25seSBp biB1c2UgYnkgdW51c2VkIChhbmQgYWdlZCkKPiA+IGRlbnRyaWVzIGFzIGVmZmVjdGl2ZWx5IHVu dXNlZCB0aGVtc2VsdmVzLiBTbyB0aGlzIHNlcXVlbmNlIHVuZGVyCj4gPiBlc3RpbWF0ZXMgaG93 IG1hbnkgaW5vZGVzIHRvIHNjYW4uIFRoaXMgY291bGQgYmlhcyBwcmVzc3VyZSBhZ2FpbnN0Cj4g PiBkY2FjaGUgSSdkIHRoaW5rLCBlc3BlY2lhbGx5IGNvbnNpZGVyaW5nIGlub2RlcyBhcmUgZmFy IGxhcmdlciB0aGFuCj4gPiBkZW50cmllcy4gTWF5YmUgcmVxdWlyZSAyIHBhc3NlcyB0byBnZXQg dGhlIGlub2RlcyB1bnVzZWQgaW50aGUKPiA+IGZpcnN0IHBhc3MuCj4gCj4gSXQncyBzZWxmLWJh bGFuY2luZyAtIGl0IHRyZW5kcyB0b3dhcmRzIGFuIGVxdWFsIG51bWJlciBvZiB1bnVzZWQKPiBk ZW50cmllcyBhbmQgaW5vZGVzIGluIHRoZSBjYWNoZXMuIFllcywgaXQgd2lsbCB0ZWFyIGRvd24g bW9yZQo+IGRlbnRyaWVzIGF0IGZpcnN0LCBidXQgd2UgbmVlZCB0byBkbyB0aGF0IHRvIGJlIGFi bGUgdG8gcmVjbGFpbQo+IGlub2Rlcy4KCkJ1dCB0aGVuIGl0IGRvZXNuJ3Qgc2NhbiBlbm91Z2gg aW5vZGVzIG9uIHRoZSBpbm9kZSBwYXNzLgoKCj4g4oSrcyByZWNsYWltIHByb2dyZXNzZXMgdGhl IHByb3BvdGlvbiBvZiBpbm9kZXMgaW5jcmVhc2VzLCBzbwo+IHRoZSBhbW91bnQgb2YgaW5vZGVz IHJlY2xhaW1lZCBpbmNyZWFzZXMuIAo+IAo+IEJhc2ljYWxseSB0aGlzIGlzIGEgcmVjb2duaXRp b24gdGhhdCB0aGUgaW1wb3J0YW50IGNhY2hlIGZvcgo+IGF2b2lkaW5nIElPIGlzIHRoZSBpbm9k ZSBjYWNoZSwgbm90IGhlIGRlbnRyeSBjYWNoZS4gT25jZSB0aGUgaW5vZGUKCllvdSBjYW4gYmlh cyBhZ2FpbnN0IHRoZSBkY2FjaGUgdXNpbmcgbXVsdGlwbGllcnMuCgoKPiBjYWNoZSBpcyBmcmVl ZCB0aGF0IHdlIG5lZWQgdG8gZG8gSU8gdG8gcmVwb3B1bGF0ZSBpdCwgYnV0Cj4gcmVidWlsZGlu ZyBkZW50cmllcyBmcm9tdGVoIGlub2RlIGNhY2hlIG9ubHkgY29zdHMgQ1BVIHRpbWUuIEhlbmNl Cj4gdW5kZXIgbGlnaHQgcmVjbGFpbSwgaW5vZGVzIGFyZSBtb3N0bHkgbGVmdCBpbiBjYWNoZSBi dXQgd2UgZnJlZSB1cAo+IG1lbW9yeSB0aGF0IG9ubHkgY29zdHMgQ1BVIHRvIHJlYnVpbGQuIFVu ZGVyIGhlYXZ5LCBzdXN0YWluZWQKPiByZWNsYWltLCB3ZSB0cmVuZCB0b3dhcmRzIGZyZWVpbmcg ZXF1YWwgYW1vdW50cyBvZiBvYmplY3RzIGZyb20gYm90aAo+IGNhY2hlcy4KCkkgZG9uJ3Qga25v dyBpZiB5b3UndmUgZ290IG51bWJlcnMgb3IgcGF0dGVybnMgdG8ganVzdGlmeSB0aGF0LgpNeSBw b2ludCBpcyB0aGF0IHRoaW5ncyBzaG91bGQgc3RheSBhcyBjbG9zZSB0byB0aGUgb2xkIGNvZGUg YXMKcG9zc2libGUgd2l0aG91dCBnb29kIHJlYXNvbi4KCiAKPiBUaGlzIGlzIHByZXR0eSBtdWNo IHdoYXQgdGhlIGN1cnJlbnQgY29kZSBhdHRlbXB0cyB0byBkbyAtIGZyZWUgYQo+IGxvdCBvZiBk ZW50cmllcywgdGhlbiBmcmVlIGEgc21hbGxlciBhbW91bnQgb2YgdGhlIGlub2RlcyB0aGF0IHdl cmUKPiB1c2VkIGJ5IHRoZSBmcmVlZCBkZW50cmllcy4gT25jZSBhZ2FpbiBpdCBpcyBhIGRpcmVj dCBlbmNvZGluZyBvZgo+IHdoYXQgaXMgY3VycmVudGx5IGFuIGltcGxpY2l0IGRlc2lnbiBmZWF0 dXJlIC0gaXQgbWFrZXMgaXQgKm9idmlvdXMqCj4gaG93IHdlIGFyZSB0cnlpbmcgdG8gYmFsYW5j ZSB0aGUgY2FjaGVzLgoKV2l0aCB5b3VyIHBhdGNoZXMsIGlmIHRoZXJlIGFyZSBubyBpbm9kZXMg ZnJlZSB5b3Ugd291bGQgbmVlZCB0byB0YWtlCjIgcGFzc2VzIGF0IGZyZWVpbmcgdGhlIGRlbnRy eSBjYWNoZS4gTXkgc3VnZ2VzdGlvbiBpcyBjbG9zZXIgdG8gdGhlCmN1cnJlbnQgY29kZS4KIAoK PiBBbm90aGVyIHJlYXNvbiBmb3IgdGhpcyBpcyB0aGF0IHRoZSBjYWxjdWxhdGlvbiBjaGFuZ2Vz IGFnYWluIHRvCj4gYWxsb3cgZmlsZXN5c3RlbSBjYWNoZXMgdG8gbW9kaXkgdGhpcyBwcm9wb3J0 aW9uaW5nIGluIHRoZSBuZXh0Cj4gcGF0Y2guLi4uCj4gCj4gRldJVywgdGhpcyBhbHNvIG1ha2Vz IHdvcmtsb2FkcyB0aGF0IGdlbmVyYXRlIGh1bmRyZWRzIG9mIHRob3VzYW5kcwo+IG9mIG5ldmVy LXRvLWJlLXVzZWQgYWdhaW4gbmVnYXRpdmUgZGVudHJpZXMgZnJlZSBkY2FjaGUgbWVtb3J5IHJl YWxseQo+IHF1aWNrbHkgb24gbWVtb3J5IHByZXNzdXJlLi4uCgpUaGF0IHdvdWxkIHN0aWxsIGJl IHRoZSBjYXNlIGJlY2F1c2UgdXNlZCBpbm9kZXMgYXJlbid0IGdldHRpbmcgdGhlaXIKZGVudHJp ZXMgZnJlZWQgc28gbGl0dGxlIGlub2RlIHNjYW5uaW5nIHdpbGwgb2NjdXIuCgo+IAo+ID4gUGFy dCBvZiB0aGUgcHJvYmxlbSBpcyB0aGUgZnVubnkgc2hyaW5rZXIgQVBJLgo+ID4gCj4gPiBUaGUg cmlnaHQgd2F5IHRvIGRvIGl0IGlzIHRvIGNoYW5nZSB0aGUgc2hyaW5rZXIgQVBJIHNvIHRoYXQg aXQgcGFzc2VzCj4gPiBkb3duIHRoZSBscnVfcGFnZXMgYW5kIHNjYW5uZWQgaW50byB0aGUgY2Fs bGJhY2suIEZyb20gdGhlcmUsIHRoZQo+ID4gc2hyaW5rZXJzIGNhbiBjYWxjdWxhdGUgdGhlIGFw cHJvcHJpYXRlIHJhdGlvIG9mIG9iamVjdHMgdG8gc2Nhbi4KPiA+IE5vIG5lZWQgZm9yIDItY2Fs bCBzY2hlbWUsIG5vIG5lZWQgZm9yIHNocmlua2VyLT5zZWVrcywgYW5kIHRoZQo+ID4gYWJpbGl0 eSB0byBjYWxjdWxhdGUgYW4gYXBwcm9wcmlhdGUgcmF0aW8gZmlyc3QgZm9yIGRjYWNoZSwgYW5k ICp0aGVuKgo+ID4gZm9yIGljYWNoZS4KPiAKPiBNeSBvbmx5IGNvbmNlcm4gYWJvdXQgdGhpcyBp cyB0aGF0IGV4cG9zZXMgdGhlIGlubmVyIHdvcmtpbmdzIG9mIHRoZQo+IHNocmlua2VyIGFuZCBt bSBzdWJzeXN0ZW0gdG8gY29kZSB0aGF0IHNpbXBseSBkb2Vzbid0IG5lZWQgdG8ga25vdwo+IGFi b3V0IGl0LgoKSXQncyBqdXN0IHByb3ZpZGluZyBhIHJhdGlvLiBUaGUgc2hyaW5rZXJzIGFsbHJl YWR5IGtub3cgdGhleSBhcmUKc2Nhbm5pbmcgYmFzZWQgb24gYSByYXRpbyBvZiBwYWdlY2FjaGUg c2Nhbm5lZC4KCgo+ID4gQSBoZWxwZXIgb2YgY291cnNlIGNhbiBkbyB0aGUgY2FsY3VsYXRpb24g KGNvbnNpZGVyaW5nIHRoYXQgZXZlcnkKPiA+IGRyaXZlciBhbmQgdGhlaXIgZG9nIHdpbGwgZG8g dGhlIHdyb25nIHRoaW5nIGlmIHdlIGxldCB0aGVtIDopKS4KPiA+IAo+ID4gdW5zaWduZWQgbG9u ZyBzaHJpbmtlcl9zY2FuKHVuc2lnbmVkIGxvbmcgbHJ1X3BhZ2VzLAo+ID4gCQkJdW5zaWduZWQg bG9uZyBscnVfc2Nhbm5lZCwKPiA+IAkJCXVuc2lnbmVkIGxvbmcgbnJfb2JqZWN0cywKPiA+IAkJ CXVuc2lnbmVkIGxvbmcgc2Nhbl9yYXRpbykKPiA+IHsKPiA+IAl1bnNpZ25lZCBsb25nIGxvbmcg dG1wID0gbnJfb2JqZWN0czsKPiA+IAo+ID4gCXRtcCAqPSBscnVfc2Nhbm5lZCAqIDEwMDsKPiA+ IAlkb19kaXYodG1wLCAobHJ1X3BhZ2VzICogc2Nhbl9yYXRpbykgKyAxKTsKPiA+IAo+ID4gCXJl dHVybiAodW5zaWduZWQgbG9uZyl0bXA7Cj4gPiB9Cj4gPiAKPiA+IFRoZW4gdGhlIHNocmlua2Vy IGNhbGxiYWNrIHdpbGwgZ286Cj4gPiAJc2ItPnNfbnJfZGVudHJ5X3NjYW4gKz0gc2hyaW5rZXJf c2NhbihscnVfcGFnZXMsIGxydV9zY2FubmVkLAo+ID4gCQkJCXNiLT5zX25yX2RlbnRyeV91bnVz ZWQsCj4gPiAJCQkJdmZzX2NhY2hlX3ByZXNzdXJlICogU0VFS1NfUEVSX0RFTlRSWSk7Cj4gPiAJ aWYgKHNiLT5zX25yX2RlbnRyeV9zY2FuID4gU0hSSU5LX0JBVENIKQo+ID4gCQlwcnVuZV9kY2Fj aGUoKQo+ID4gCj4gPiAJc2ItPnNfbnJfaW5vZGVfc2NhbiArPSBzaHJpbmtlcl9zY2FuKGxydV9w YWdlcywgbHJ1X3NjYW5uZWQsCj4gPiAJCQkJc2ItPnNfbnJfaW5vZGVzX3VudXNlZCwKPiA+IAkJ CQl2ZnNfY2FjaGVfcHJlc3N1cmUgKiBTRUVLU19QRVJfSU5PREUpOwo+ID4gCS4uLgo+ID4gCj4g PiBXaGF0IGRvIHlvdSB0aGluayBvZiB0aGF0PyBTZWVpbmcgYXMgd2UncmUgY2hhbmdpbmcgdGhl IHNocmlua2VyIEFQSQo+ID4gYW55d2F5LCBJJ2QgdGhpbmsgaXQgaXMgaGlnaCB0aW1lIHRvIGRv IHNvbXRoaW5nIGxpa2UgdGhpcy4KPiAKPiBJZ25vcmluZyB0aGUgZGNhY2hlL2ljYWNoZSByZWNs YWltIHJhdGlvIGlzc3VlcywgSSdkIHByZWZlciBhIHR3bwoKV2VsbCBpZiBpdCBpcyBhbiBpc3N1 ZSwgaXQgc2hvdWxkIGJlIGNoYW5nZWQgaW4gYSBkaWZmZXJlbnQgcGF0Y2gKSSB0aGluayAod2l0 aCBudW1iZXJzKS4KCgo+IGNhbGwgQVBJIHRoYXQgbWF0Y2hlcyB0aGUgY3VycmVudCBiZWhhdmlv dXIsIGxlYXZpbmcgdGhlIGNhY2x1bGF0aW9uCj4gb2YgaG93IG11Y2ggdG8gcmVjbGFpbSBpbiBz aHJpbmtfc2xhYigpLiBFbmNvZGluZyBpdCB0aGlzIHdheSBtYWtlcwo+IGl0IG1vcmUgZGlmZmlj dWx0IHRvIGNoYW5nZSB0aGUgaGlnaCBsZXZlbCBiZWhhdmlvdXIgZS5nLiBpZiB3ZSB3YW50Cj4g dG8gbW9kaWZ5IHRoZSBhbW91bnQgb2Ygc2xhYiByZWNsYWltIGJhc2VkIG9uIHJlY2xhaW0gcHJp b3JpdHksIHdlJ2QKPiBoYXZlIHRvIGNhaG5nZSBldmVyeSBzaHJpbmtlciBpbnN0ZWFkIG9mIGp1 c3Qgc2hyaW5rX3NsYWIoKS4KCldlIGNhbiBtb2RpZml5IHRoZSByYXRpb3MgYmVmb3JlIGNhbGxp bmcgaWYgbmVlZGVkLCBvciBoYXZlIGEgZGVmYXVsdApyYXRpbyBkZWZpbmUgdG8gbXVsdGlwbHkg d2l0aCBhcyB3ZWxsLgoKQnV0IHNocmlua2VycyBhcmUgdmVyeSBzdWJzeXN0ZW0gc3BlY2lmaWMu CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwp4ZnMgbWFp bGluZyBsaXN0Cnhmc0Bvc3Muc2dpLmNvbQpodHRwOi8vb3NzLnNnaS5jb20vbWFpbG1hbi9saXN0 aW5mby94ZnMK From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754009Ab0E1FT3 (ORCPT ); Fri, 28 May 2010 01:19:29 -0400 Received: from cantor2.suse.de ([195.135.220.15]:48576 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751260Ab0E1FT2 (ORCPT ); Fri, 28 May 2010 01:19:28 -0400 Date: Fri, 28 May 2010 15:19:24 +1000 From: Nick Piggin To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs@oss.sgi.com Subject: Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Message-ID: <20100528051924.GZ22536@laptop> References: <1274777588-21494-1-git-send-email-david@fromorbit.com> <1274777588-21494-4-git-send-email-david@fromorbit.com> <20100527063523.GJ22536@laptop> <20100527224034.GO12087@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20100527224034.GO12087@dastard> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote: > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote: > > But we can think of inodes that are only in use by unused (and aged) > > dentries as effectively unused themselves. So this sequence under > > estimates how many inodes to scan. This could bias pressure against > > dcache I'd think, especially considering inodes are far larger than > > dentries. Maybe require 2 passes to get the inodes unused inthe > > first pass. > > It's self-balancing - it trends towards an equal number of unused > dentries and inodes in the caches. Yes, it will tear down more > dentries at first, but we need to do that to be able to reclaim > inodes. But then it doesn't scan enough inodes on the inode pass. > Ås reclaim progresses the propotion of inodes increases, so > the amount of inodes reclaimed increases. > > Basically this is a recognition that the important cache for > avoiding IO is the inode cache, not he dentry cache. Once the inode You can bias against the dcache using multipliers. > cache is freed that we need to do IO to repopulate it, but > rebuilding dentries fromteh inode cache only costs CPU time. Hence > under light reclaim, inodes are mostly left in cache but we free up > memory that only costs CPU to rebuild. Under heavy, sustained > reclaim, we trend towards freeing equal amounts of objects from both > caches. I don't know if you've got numbers or patterns to justify that. My point is that things should stay as close to the old code as possible without good reason. > This is pretty much what the current code attempts to do - free a > lot of dentries, then free a smaller amount of the inodes that were > used by the freed dentries. Once again it is a direct encoding of > what is currently an implicit design feature - it makes it *obvious* > how we are trying to balance the caches. With your patches, if there are no inodes free you would need to take 2 passes at freeing the dentry cache. My suggestion is closer to the current code. > Another reason for this is that the calculation changes again to > allow filesystem caches to modiy this proportioning in the next > patch.... > > FWIW, this also makes workloads that generate hundreds of thousands > of never-to-be-used again negative dentries free dcache memory really > quickly on memory pressure... That would still be the case because used inodes aren't getting their dentries freed so little inode scanning will occur. > > > Part of the problem is the funny shrinker API. > > > > The right way to do it is to change the shrinker API so that it passes > > down the lru_pages and scanned into the callback. From there, the > > shrinkers can calculate the appropriate ratio of objects to scan. > > No need for 2-call scheme, no need for shrinker->seeks, and the > > ability to calculate an appropriate ratio first for dcache, and *then* > > for icache. > > My only concern about this is that exposes the inner workings of the > shrinker and mm subsystem to code that simply doesn't need to know > about it. It's just providing a ratio. The shrinkers allready know they are scanning based on a ratio of pagecache scanned. > > A helper of course can do the calculation (considering that every > > driver and their dog will do the wrong thing if we let them :)). > > > > unsigned long shrinker_scan(unsigned long lru_pages, > > unsigned long lru_scanned, > > unsigned long nr_objects, > > unsigned long scan_ratio) > > { > > unsigned long long tmp = nr_objects; > > > > tmp *= lru_scanned * 100; > > do_div(tmp, (lru_pages * scan_ratio) + 1); > > > > return (unsigned long)tmp; > > } > > > > Then the shrinker callback will go: > > sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_dentry_unused, > > vfs_cache_pressure * SEEKS_PER_DENTRY); > > if (sb->s_nr_dentry_scan > SHRINK_BATCH) > > prune_dcache() > > > > sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_inodes_unused, > > vfs_cache_pressure * SEEKS_PER_INODE); > > ... > > > > What do you think of that? Seeing as we're changing the shrinker API > > anyway, I'd think it is high time to do somthing like this. > > Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two Well if it is an issue, it should be changed in a different patch I think (with numbers). > call API that matches the current behaviour, leaving the caclulation > of how much to reclaim in shrink_slab(). Encoding it this way makes > it more difficult to change the high level behaviour e.g. if we want > to modify the amount of slab reclaim based on reclaim priority, we'd > have to cahnge every shrinker instead of just shrink_slab(). We can modifiy the ratios before calling if needed, or have a default ratio define to multiply with as well. But shrinkers are very subsystem specific. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nick Piggin Subject: Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Date: Fri, 28 May 2010 15:19:24 +1000 Message-ID: <20100528051924.GZ22536@laptop> References: <1274777588-21494-1-git-send-email-david@fromorbit.com> <1274777588-21494-4-git-send-email-david@fromorbit.com> <20100527063523.GJ22536@laptop> <20100527224034.GO12087@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs@oss.sgi.com To: Dave Chinner Return-path: Content-Disposition: inline In-Reply-To: <20100527224034.GO12087@dastard> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote: > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote: > > But we can think of inodes that are only in use by unused (and aged) > > dentries as effectively unused themselves. So this sequence under > > estimates how many inodes to scan. This could bias pressure against > > dcache I'd think, especially considering inodes are far larger than > > dentries. Maybe require 2 passes to get the inodes unused inthe > > first pass. >=20 > It's self-balancing - it trends towards an equal number of unused > dentries and inodes in the caches. Yes, it will tear down more > dentries at first, but we need to do that to be able to reclaim > inodes. But then it doesn't scan enough inodes on the inode pass. > =E2=84=ABs reclaim progresses the propotion of inodes increases, so > the amount of inodes reclaimed increases.=20 >=20 > Basically this is a recognition that the important cache for > avoiding IO is the inode cache, not he dentry cache. Once the inode You can bias against the dcache using multipliers. > cache is freed that we need to do IO to repopulate it, but > rebuilding dentries fromteh inode cache only costs CPU time. Hence > under light reclaim, inodes are mostly left in cache but we free up > memory that only costs CPU to rebuild. Under heavy, sustained > reclaim, we trend towards freeing equal amounts of objects from both > caches. I don't know if you've got numbers or patterns to justify that. My point is that things should stay as close to the old code as possible without good reason. =20 > This is pretty much what the current code attempts to do - free a > lot of dentries, then free a smaller amount of the inodes that were > used by the freed dentries. Once again it is a direct encoding of > what is currently an implicit design feature - it makes it *obvious* > how we are trying to balance the caches. With your patches, if there are no inodes free you would need to take 2 passes at freeing the dentry cache. My suggestion is closer to the current code. =20 > Another reason for this is that the calculation changes again to > allow filesystem caches to modiy this proportioning in the next > patch.... >=20 > FWIW, this also makes workloads that generate hundreds of thousands > of never-to-be-used again negative dentries free dcache memory really > quickly on memory pressure... That would still be the case because used inodes aren't getting their dentries freed so little inode scanning will occur. >=20 > > Part of the problem is the funny shrinker API. > >=20 > > The right way to do it is to change the shrinker API so that it passe= s > > down the lru_pages and scanned into the callback. From there, the > > shrinkers can calculate the appropriate ratio of objects to scan. > > No need for 2-call scheme, no need for shrinker->seeks, and the > > ability to calculate an appropriate ratio first for dcache, and *then= * > > for icache. >=20 > My only concern about this is that exposes the inner workings of the > shrinker and mm subsystem to code that simply doesn't need to know > about it. It's just providing a ratio. The shrinkers allready know they are scanning based on a ratio of pagecache scanned. > > A helper of course can do the calculation (considering that every > > driver and their dog will do the wrong thing if we let them :)). > >=20 > > unsigned long shrinker_scan(unsigned long lru_pages, > > unsigned long lru_scanned, > > unsigned long nr_objects, > > unsigned long scan_ratio) > > { > > unsigned long long tmp =3D nr_objects; > >=20 > > tmp *=3D lru_scanned * 100; > > do_div(tmp, (lru_pages * scan_ratio) + 1); > >=20 > > return (unsigned long)tmp; > > } > >=20 > > Then the shrinker callback will go: > > sb->s_nr_dentry_scan +=3D shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_dentry_unused, > > vfs_cache_pressure * SEEKS_PER_DENTRY); > > if (sb->s_nr_dentry_scan > SHRINK_BATCH) > > prune_dcache() > >=20 > > sb->s_nr_inode_scan +=3D shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_inodes_unused, > > vfs_cache_pressure * SEEKS_PER_INODE); > > ... > >=20 > > What do you think of that? Seeing as we're changing the shrinker API > > anyway, I'd think it is high time to do somthing like this. >=20 > Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two Well if it is an issue, it should be changed in a different patch I think (with numbers). > call API that matches the current behaviour, leaving the caclulation > of how much to reclaim in shrink_slab(). Encoding it this way makes > it more difficult to change the high level behaviour e.g. if we want > to modify the amount of slab reclaim based on reclaim priority, we'd > have to cahnge every shrinker instead of just shrink_slab(). We can modifiy the ratios before calling if needed, or have a default ratio define to multiply with as well. But shrinkers are very subsystem specific. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 2B4596B01BC for ; Fri, 28 May 2010 01:19:30 -0400 (EDT) Date: Fri, 28 May 2010 15:19:24 +1000 From: Nick Piggin Subject: Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Message-ID: <20100528051924.GZ22536@laptop> References: <1274777588-21494-1-git-send-email-david@fromorbit.com> <1274777588-21494-4-git-send-email-david@fromorbit.com> <20100527063523.GJ22536@laptop> <20100527224034.GO12087@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20100527224034.GO12087@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, xfs@oss.sgi.com List-ID: On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote: > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote: > > But we can think of inodes that are only in use by unused (and aged) > > dentries as effectively unused themselves. So this sequence under > > estimates how many inodes to scan. This could bias pressure against > > dcache I'd think, especially considering inodes are far larger than > > dentries. Maybe require 2 passes to get the inodes unused inthe > > first pass. > > It's self-balancing - it trends towards an equal number of unused > dentries and inodes in the caches. Yes, it will tear down more > dentries at first, but we need to do that to be able to reclaim > inodes. But then it doesn't scan enough inodes on the inode pass. > a?< the amount of inodes reclaimed increases. > > Basically this is a recognition that the important cache for > avoiding IO is the inode cache, not he dentry cache. Once the inode You can bias against the dcache using multipliers. > cache is freed that we need to do IO to repopulate it, but > rebuilding dentries fromteh inode cache only costs CPU time. Hence > under light reclaim, inodes are mostly left in cache but we free up > memory that only costs CPU to rebuild. Under heavy, sustained > reclaim, we trend towards freeing equal amounts of objects from both > caches. I don't know if you've got numbers or patterns to justify that. My point is that things should stay as close to the old code as possible without good reason. > This is pretty much what the current code attempts to do - free a > lot of dentries, then free a smaller amount of the inodes that were > used by the freed dentries. Once again it is a direct encoding of > what is currently an implicit design feature - it makes it *obvious* > how we are trying to balance the caches. With your patches, if there are no inodes free you would need to take 2 passes at freeing the dentry cache. My suggestion is closer to the current code. > Another reason for this is that the calculation changes again to > allow filesystem caches to modiy this proportioning in the next > patch.... > > FWIW, this also makes workloads that generate hundreds of thousands > of never-to-be-used again negative dentries free dcache memory really > quickly on memory pressure... That would still be the case because used inodes aren't getting their dentries freed so little inode scanning will occur. > > > Part of the problem is the funny shrinker API. > > > > The right way to do it is to change the shrinker API so that it passes > > down the lru_pages and scanned into the callback. From there, the > > shrinkers can calculate the appropriate ratio of objects to scan. > > No need for 2-call scheme, no need for shrinker->seeks, and the > > ability to calculate an appropriate ratio first for dcache, and *then* > > for icache. > > My only concern about this is that exposes the inner workings of the > shrinker and mm subsystem to code that simply doesn't need to know > about it. It's just providing a ratio. The shrinkers allready know they are scanning based on a ratio of pagecache scanned. > > A helper of course can do the calculation (considering that every > > driver and their dog will do the wrong thing if we let them :)). > > > > unsigned long shrinker_scan(unsigned long lru_pages, > > unsigned long lru_scanned, > > unsigned long nr_objects, > > unsigned long scan_ratio) > > { > > unsigned long long tmp = nr_objects; > > > > tmp *= lru_scanned * 100; > > do_div(tmp, (lru_pages * scan_ratio) + 1); > > > > return (unsigned long)tmp; > > } > > > > Then the shrinker callback will go: > > sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_dentry_unused, > > vfs_cache_pressure * SEEKS_PER_DENTRY); > > if (sb->s_nr_dentry_scan > SHRINK_BATCH) > > prune_dcache() > > > > sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned, > > sb->s_nr_inodes_unused, > > vfs_cache_pressure * SEEKS_PER_INODE); > > ... > > > > What do you think of that? Seeing as we're changing the shrinker API > > anyway, I'd think it is high time to do somthing like this. > > Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two Well if it is an issue, it should be changed in a different patch I think (with numbers). > call API that matches the current behaviour, leaving the caclulation > of how much to reclaim in shrink_slab(). Encoding it this way makes > it more difficult to change the high level behaviour e.g. if we want > to modify the amount of slab reclaim based on reclaim priority, we'd > have to cahnge every shrinker instead of just shrink_slab(). We can modifiy the ratios before calling if needed, or have a default ratio define to multiply with as well. But shrinkers are very subsystem specific. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org