Re: [PATCH] prune_icache_sb

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wendy Cheng <wcheng@redhat.com>
To: Andrew Morton <akpm@osdl.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] prune_icache_sb
Date: Tue, 28 Nov 2006 16:41:07 -0500	[thread overview]
Message-ID: <456CACF3.7030200@redhat.com> (raw)
In-Reply-To: <20061127165239.9616cbc9.akpm@osdl.org>

Andrew Morton wrote:
> On Mon, 27 Nov 2006 18:52:58 -0500
> Wendy Cheng <wcheng@redhat.com> wrote:
>
>   
>> Not sure about walking thru sb->s_inodes for several reasons....
>>
>> 1. First, the changes made are mostly for file server setup with large 
>> fs size - the entry count in sb->s_inodes may not be shorter then 
>> inode_unused list.
>>     
>
> umm, that's the best-case.  We also care about worst-case.  Think:
> 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
> from the being-unmounted filesytem.  The code as-proposed will do 100x more
> work that it needs to do.  All under a global spinlock.
>   
By walking thru sb->s_inodes, we also need to take inode_lock and 
iprune_mutex (?), since we're purging the inodes from the system - or 
specifically, removing them from inode_unused list. There is really not 
much difference from the current prune_icache() logic. What's been 
proposed here is simply *exporting* the prune_icache() kernel code to 
allow filesystems to trim (purge a small percentage of ) its 
(potentially will be) unused per-mount inodes for *latency* considerations.

I made a mistake by using the "page dirty ratio" to explain the problem 
(sorry! I was not thinking well in previous write-up) that could mislead 
you to think this is a VM issue. This is not so much about 
low-on-free-pages (and/or memory fragmentation) issue (though 
fragmentation is normally part of the symptoms). What the (external) 
kernel module does is to tie its cluster-wide file lock with in-memory 
inode that is obtained during file look-up time. The lock is removed 
from the machine when

1. the lock is granted to other (cluster) machine; or
2. the in-memory inode is purged from the system.

One of the clusters that has this latency issue is an IP/TV application 
where it "rsync" with main station server (with long geographical 
distance) every 15 minutes. It subsequently (and constantly) generates 
large amount of inode (and locks) hanging around. When other nodes, 
served as FTP servers, within the same cluster are serving the files, 
DLM has to wade through huge amount of locks entries to know whether the 
lock requests can be granted. That's where this latency issue gets 
popped out. Our profiling data shows when the cluster performance is 
dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in 
lock searching logic. From VM point of view, the system does not have 
memory shortage so it doesn't have a need to kick off prune_icache() call.

This issue could also be fixed in several different ways - maybe by a 
better DLM hash function, maybe by asking IT people to umount the 
filesystem where *all* per-mount inodes are unconditionally purged (but 
it defeats the purpose of caching inodes and, in our case, the locks) 
after each rsync, ...., etc. But I do think the proposed patch is the 
most sensible way to fix this issue and believe it will be one of these 
functions that if you export it, people will find a good use of it. It 
helps with memory fragmentation and/or shortage *before* it becomes a 
problem as well. I certainly understand and respect a maintainer's 
daunting job on how to take/reject a patch - let me know how you think 
so I can start to work on other solutions if required.

-- Wendy

next prev parent reply	other threads:[~2006-11-28 21:51 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-11-22 21:35 [PATCH] prune_icache_sb Wendy Cheng
2006-11-22 23:36 ` Andrew Morton
2006-11-27 23:52   ` Wendy Cheng
2006-11-28  0:52     ` Andrew Morton
2006-11-28 21:41       ` Wendy Cheng [this message]
2006-11-29  0:21         ` Andrew Morton
2006-11-29  6:02           ` Wendy Cheng
2006-11-30 16:05             ` Wendy Cheng
2006-11-30 19:31               ` Nate Diller
2006-12-01 21:23               ` Andrew Morton
2006-12-03 17:49                 ` Wendy Cheng
2006-12-03 20:47                   ` Andrew Morton
2006-12-04  5:57                     ` Wendy Cheng
2006-12-04  6:28                       ` Andrew Morton
2006-12-04 16:51                   ` Russell Cattelan
2006-12-04 20:46                     ` Wendy Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=456CACF3.7030200@redhat.com \
    --to=wcheng@redhat.com \
    --cc=akpm@osdl.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).