[PATCH] prune_icache

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] prune_icache_sb
@ 2006-11-22 21:35 Wendy Cheng
  2006-11-22 23:36 ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-11-22 21:35 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

There seems to have a need to prune inode cache entries for specific 
mount points (per vfs superblock) due to performance issues found after 
some io intensive commands ("rsyn" for example).  The problem is 
particularly serious for one of our kernel modules where it caches its 
(cluster) locks based on vfs inode implementation. These locks are 
created by inode creation call and get purged when s_op->clear_inode() 
is invoked. With larger servers that equipped with plenty of memory, the 
page dirty ratio may not pass the threshold to trigger VM reclaim logic 
but the accumulated inode counts (and its associated cluster locks) 
could causes unacceptable performance degradation for latency sensitive 
applications.

After adding the uploaded inode trimming patch, together with 
shrink_dcache_sb(), we are able to keep the latency for one real world 
application within a satisfactory bound (consistently stayed within 5 
seconds, compared to the original fluctuation between 5 to 16 seconds). 
The calls are placed in one of our kernel daemons that wakes up in a 
tunable interval to do the trimming work as shown in the following code 
segment. Would appreciate if this patch can get accepted into mainline 
kernel.

                  i_percent = sdp->sd_tune.gt_inoded_purge;
                  if (i_percent) {
                          if (i_percent > 100) i_percent = 100;
                          a_count = atomic_read(&sdp->sd_inode_count);
                          i_count = a_count * i_percent / 100;
                          (void) shrink_dcache_sb(sdp->sd_vfs);
                          (void) prune_icache_sb(i_count, sdp->sd_vfs);
                   }

-- Wendy





[-- Attachment #2: inode_prune_sb.patch --]
[-- Type: text/x-patch, Size: 2144 bytes --]

Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>

 fs/inode.c         |   14 +++++++++++---
 include/linux/fs.h |    3 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

--- linux-2.6.18/include/linux/fs.h	2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/include/linux/fs.h	2006-11-22 13:55:55.000000000 -0500
@@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
-
+extern void prune_icache_sb(int nr_to_scan, struct super_block *sb);
+ 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
--- linux-2.6.18/fs/inode.c	2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/fs/inode.c	2006-11-22 14:12:28.000000000 -0500
@@ -411,7 +411,7 @@ static int can_unuse(struct inode *inode
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(int nr_to_scan)
+static void __prune_icache(int nr_to_scan, struct super_block *sb)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -428,7 +428,8 @@ static void prune_icache(int nr_to_scan)
 
 		inode = list_entry(inode_unused.prev, struct inode, i_list);
 
-		if (inode->i_state || atomic_read(&inode->i_count)) {
+		if (inode->i_state || atomic_read(&inode->i_count) 
+			|| (sb && (inode->i_sb != sb))) {
 			list_move(&inode->i_list, &inode_unused);
 			continue;
 		}
@@ -461,6 +462,13 @@ static void prune_icache(int nr_to_scan)
 	mutex_unlock(&iprune_mutex);
 }
 
+void prune_icache_sb(int nr_to_scan, struct super_block *sb)
+{
+	__prune_icache(nr_to_scan, sb);
+}
+
+EXPORT_SYMBOL(prune_icache_sb);
+
 /*
  * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
  * "unused" means that no dentries are referring to the inodes: the files are
@@ -480,7 +488,7 @@ static int shrink_icache_memory(int nr, 
 	 	 */
 		if (!(gfp_mask & __GFP_FS))
 			return -1;
-		prune_icache(nr);
+		__prune_icache(nr, NULL);
 	}
 	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-22 21:35 [PATCH] prune_icache_sb Wendy Cheng
@ 2006-11-22 23:36 ` Andrew Morton
  2006-11-27 23:52   ` Wendy Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-11-22 23:36 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel

On Wed, 22 Nov 2006 16:35:07 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> There seems to have a need to prune inode cache entries for specific 
> mount points (per vfs superblock) due to performance issues found after 
> some io intensive commands ("rsyn" for example).  The problem is 
> particularly serious for one of our kernel modules where it caches its 
> (cluster) locks based on vfs inode implementation. These locks are 
> created by inode creation call and get purged when s_op->clear_inode() 
> is invoked. With larger servers that equipped with plenty of memory, the 
> page dirty ratio may not pass the threshold to trigger VM reclaim logic 
> but the accumulated inode counts (and its associated cluster locks) 
> could causes unacceptable performance degradation for latency sensitive 
> applications.
> 
> After adding the uploaded inode trimming patch, together with 
> shrink_dcache_sb(), we are able to keep the latency for one real world 
> application within a satisfactory bound (consistently stayed within 5 
> seconds, compared to the original fluctuation between 5 to 16 seconds). 
> The calls are placed in one of our kernel daemons that wakes up in a 
> tunable interval to do the trimming work as shown in the following code 
> segment. Would appreciate if this patch can get accepted into mainline 
> kernel.
> 
>                   i_percent = sdp->sd_tune.gt_inoded_purge;
>                   if (i_percent) {
>                           if (i_percent > 100) i_percent = 100;
>                           a_count = atomic_read(&sdp->sd_inode_count);
>                           i_count = a_count * i_percent / 100;
>                           (void) shrink_dcache_sb(sdp->sd_vfs);
>                           (void) prune_icache_sb(i_count, sdp->sd_vfs);
>                    }
> 
>...
>
> --- linux-2.6.18/include/linux/fs.h	2006-09-19 23:42:06.000000000 -0400
> +++ ups-kernel/include/linux/fs.h	2006-11-22 13:55:55.000000000 -0500
> @@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
>  static inline void insert_inode_hash(struct inode *inode) {
>  	__insert_inode_hash(inode, inode->i_ino);
>  }
> -
> +extern void prune_icache_sb(int nr_to_scan, struct super_block *sb);
> + 
>  extern struct file * get_empty_filp(void);
>  extern void file_move(struct file *f, struct list_head *list);
>  extern void file_kill(struct file *f);
> --- linux-2.6.18/fs/inode.c	2006-09-19 23:42:06.000000000 -0400
> +++ ups-kernel/fs/inode.c	2006-11-22 14:12:28.000000000 -0500
> @@ -411,7 +411,7 @@ static int can_unuse(struct inode *inode
>   * If the inode has metadata buffers attached to mapping->private_list then
>   * try to remove them.
>   */
> -static void prune_icache(int nr_to_scan)
> +static void __prune_icache(int nr_to_scan, struct super_block *sb)
>  {
>  	LIST_HEAD(freeable);
>  	int nr_pruned = 0;
> @@ -428,7 +428,8 @@ static void prune_icache(int nr_to_scan)
>  
>  		inode = list_entry(inode_unused.prev, struct inode, i_list);
>  
> -		if (inode->i_state || atomic_read(&inode->i_count)) {
> +		if (inode->i_state || atomic_read(&inode->i_count) 
> +			|| (sb && (inode->i_sb != sb))) {
>  			list_move(&inode->i_list, &inode_unused);
>  			continue;

This search is potentially inefficient.  It would be better walk
sb->s_inodes.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-22 23:36 ` Andrew Morton
@ 2006-11-27 23:52   ` Wendy Cheng
  2006-11-28  0:52     ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-11-27 23:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

Andrew Morton wrote:
> This search is potentially inefficient.  It would be better walk
> sb->s_inodes.
>
>   
Not sure about walking thru sb->s_inodes for several reasons....

1. First, the changes made are mostly for file server setup with large 
fs size - the entry count in sb->s_inodes may not be shorter then 
inode_unused list.
2. Different from calls such as drop_pagecache_sb() (that doesn't do 
list entry removal), we're walking thru the list to dispose the entries. 
This implies we are walking thru one list (sb->s_inodes) to remove the 
other list's entries (inode_unused). This feels awkward.
3. The new code will be very similar to current prune_icache() with few 
differences - e.g., we really don't want to list_move() within the 
sb->s_inodes list itself (as done in prune_icache() that moves the 
examined entry to the tail of the inode_unused list). We have to either 
duplicate the code or clutter the current prune_icache() routine.

Pruning based on sb->s_inodes *does* have its advantage but a simple and 
plain patch as shown in previous post (that has been well-tested out in 
two large scale production systems) could be equally effective. Make 
sense ?

-- Wendy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-27 23:52   ` Wendy Cheng
@ 2006-11-28  0:52     ` Andrew Morton
  2006-11-28 21:41       ` Wendy Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-11-28  0:52 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel

On Mon, 27 Nov 2006 18:52:58 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> Andrew Morton wrote:
> > This search is potentially inefficient.  It would be better walk
> > sb->s_inodes.
> >
> >   
> Not sure about walking thru sb->s_inodes for several reasons....
> 
> 1. First, the changes made are mostly for file server setup with large 
> fs size - the entry count in sb->s_inodes may not be shorter then 
> inode_unused list.

umm, that's the best-case.  We also care about worst-case.  Think:
1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
from the being-unmounted filesytem.  The code as-proposed will do 100x more
work that it needs to do.  All under a global spinlock.

> 2. Different from calls such as drop_pagecache_sb() (that doesn't do 
> list entry removal), we're walking thru the list to dispose the entries. 
> This implies we are walking thru one list (sb->s_inodes) to remove the 
> other list's entries (inode_unused). This feels awkward.
> 3. The new code will be very similar to current prune_icache() with few 
> differences - e.g., we really don't want to list_move() within the 
> sb->s_inodes list itself (as done in prune_icache() that moves the 
> examined entry to the tail of the inode_unused list). We have to either 
> duplicate the code or clutter the current prune_icache() routine.
> 
> Pruning based on sb->s_inodes *does* have its advantage but a simple and 
> plain patch as shown in previous post (that has been well-tested out in 
> two large scale production systems) could be equally effective. Make 
> sense ?
> 

I also worry about the whole thing:

> There seems to have a need to prune inode cache entries for specific mount
> points (per vfs superblock) due to performance issues found after some io
> intensive commands ("rsyn" for example).  The problem is particularly
> serious for one of our kernel modules where it caches its (cluster) locks
> based on vfs inode implementation.  These locks are created by inode
> creation call and get purged when s_op->clear_inode() is invoked.  With
> larger servers that equipped with plenty of memory, the page dirty ratio
> may not pass the threshold to trigger VM reclaim logic but the accumulated
> inode counts (and its associated cluster locks) could causes unacceptable
> performance degradation for latency sensitive applications.

What's this about "the page dirty ratio may not pass the threshold to
trigger VM reclaim logic"?  Page reclaim isn't triggered by a dirty page
ratio.

Page reclaim is triggered by a shortage of free pages.  And page reclaim is
supposed to reclaim unused inodes in an orderly and balanced fashion.  It
appears that it's not doing so in your case and we'd need to see more
details (please) so we can understand why it is not working.

You're proposing that we not do any of that and that the filesytem be able
to call into a VM memory reclaim function for not-clearly-understood
reasons.  This is a workaround.

Please help us to understand what has gone wrong with inode reclaim.  And
please see if you can find time to help us with this rather than adding
some RH-specific fix and then forgetting about it (sensible though that
approach would be...)

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-28  0:52     ` Andrew Morton
@ 2006-11-28 21:41       ` Wendy Cheng
  2006-11-29  0:21         ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-11-28 21:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

Andrew Morton wrote:
> On Mon, 27 Nov 2006 18:52:58 -0500
> Wendy Cheng <wcheng@redhat.com> wrote:
>
>   
>> Not sure about walking thru sb->s_inodes for several reasons....
>>
>> 1. First, the changes made are mostly for file server setup with large 
>> fs size - the entry count in sb->s_inodes may not be shorter then 
>> inode_unused list.
>>     
>
> umm, that's the best-case.  We also care about worst-case.  Think:
> 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
> from the being-unmounted filesytem.  The code as-proposed will do 100x more
> work that it needs to do.  All under a global spinlock.
>   
By walking thru sb->s_inodes, we also need to take inode_lock and 
iprune_mutex (?), since we're purging the inodes from the system - or 
specifically, removing them from inode_unused list. There is really not 
much difference from the current prune_icache() logic. What's been 
proposed here is simply *exporting* the prune_icache() kernel code to 
allow filesystems to trim (purge a small percentage of ) its 
(potentially will be) unused per-mount inodes for *latency* considerations.

I made a mistake by using the "page dirty ratio" to explain the problem 
(sorry! I was not thinking well in previous write-up) that could mislead 
you to think this is a VM issue. This is not so much about 
low-on-free-pages (and/or memory fragmentation) issue (though 
fragmentation is normally part of the symptoms). What the (external) 
kernel module does is to tie its cluster-wide file lock with in-memory 
inode that is obtained during file look-up time. The lock is removed 
from the machine when

1. the lock is granted to other (cluster) machine; or
2. the in-memory inode is purged from the system.

One of the clusters that has this latency issue is an IP/TV application 
where it "rsync" with main station server (with long geographical 
distance) every 15 minutes. It subsequently (and constantly) generates 
large amount of inode (and locks) hanging around. When other nodes, 
served as FTP servers, within the same cluster are serving the files, 
DLM has to wade through huge amount of locks entries to know whether the 
lock requests can be granted. That's where this latency issue gets 
popped out. Our profiling data shows when the cluster performance is 
dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in 
lock searching logic. From VM point of view, the system does not have 
memory shortage so it doesn't have a need to kick off prune_icache() call.

This issue could also be fixed in several different ways - maybe by a 
better DLM hash function, maybe by asking IT people to umount the 
filesystem where *all* per-mount inodes are unconditionally purged (but 
it defeats the purpose of caching inodes and, in our case, the locks) 
after each rsync, ...., etc. But I do think the proposed patch is the 
most sensible way to fix this issue and believe it will be one of these 
functions that if you export it, people will find a good use of it. It 
helps with memory fragmentation and/or shortage *before* it becomes a 
problem as well. I certainly understand and respect a maintainer's 
daunting job on how to take/reject a patch - let me know how you think 
so I can start to work on other solutions if required.

-- Wendy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-28 21:41       ` Wendy Cheng
@ 2006-11-29  0:21         ` Andrew Morton
  2006-11-29  6:02           ` Wendy Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-11-29  0:21 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel

On Tue, 28 Nov 2006 16:41:07 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> Andrew Morton wrote:
> > On Mon, 27 Nov 2006 18:52:58 -0500
> > Wendy Cheng <wcheng@redhat.com> wrote:
> >
> >   
> >> Not sure about walking thru sb->s_inodes for several reasons....
> >>
> >> 1. First, the changes made are mostly for file server setup with large 
> >> fs size - the entry count in sb->s_inodes may not be shorter then 
> >> inode_unused list.
> >>     
> >
> > umm, that's the best-case.  We also care about worst-case.  Think:
> > 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are
> > from the being-unmounted filesytem.  The code as-proposed will do 100x more
> > work that it needs to do.  All under a global spinlock.
> >   
> By walking thru sb->s_inodes, we also need to take inode_lock and 
> iprune_mutex (?), since we're purging the inodes from the system - or 
> specifically, removing them from inode_unused list. There is really not 
> much difference from the current prune_icache() logic.

There's quite a bit of difference.  The change you're proposing will
perform poorly if it is used in the scenario which I describe above.  It
will waste CPU cycles and will destroy the inode_unused LRU ordering (for
what that's worth, which isn't much).

Trust me, every single time we've had an inefficient search in core kernel,
someone has gone and done something which hits it and causes general
meltdown in their workload.  So we've had to make significant changes to
remove the O(n) or higher search complexity.

And in this case we *already have* the date structures in place to make it
O(1).

> What's been 
> proposed here is simply *exporting* the prune_icache() kernel code to 
> allow filesystems to trim (purge a small percentage of ) its 
> (potentially will be) unused per-mount inodes for *latency* considerations.

It just happens to work in your setup.  If you have a large machine with
two filesystems and you run rsync on both filesystems and run FTP agains
one of them, it might not work so well.  Because the proposed
prune_icache_sb() might need to chew through 500,000 inodes from the wrong
superblock before reclaiming any of the inodes which you want to reclaim. 
Or something like that.

> I made a mistake by using the "page dirty ratio" to explain the problem 
> (sorry! I was not thinking well in previous write-up) that could mislead 
> you to think this is a VM issue. This is not so much about 
> low-on-free-pages (and/or memory fragmentation) issue (though 
> fragmentation is normally part of the symptoms). What the (external) 
> kernel module does is to tie its cluster-wide file lock with in-memory 
> inode that is obtained during file look-up time. The lock is removed 
> from the machine when
> 
> 1. the lock is granted to other (cluster) machine; or
> 2. the in-memory inode is purged from the system.

It seems peculiar to be tying the lifetime of a DLM lock to the system's
memory size and current memory pressure?

> One of the clusters that has this latency issue is an IP/TV application 
> where it "rsync" with main station server (with long geographical 
> distance) every 15 minutes. It subsequently (and constantly) generates 
> large amount of inode (and locks) hanging around. When other nodes, 
> served as FTP servers, within the same cluster are serving the files, 
> DLM has to wade through huge amount of locks entries to know whether the 
> lock requests can be granted. That's where this latency issue gets 
> popped out. Our profiling data shows when the cluster performance is 
> dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in 
> lock searching logic. From VM point of view, the system does not have 
> memory shortage so it doesn't have a need to kick off prune_icache() call.

OK..

> This issue could also be fixed in several different ways - maybe by a 
> better DLM hash function,

It does sound like the lock lookup is broken.

I assume there's some reason for keeping these things floating about in
memory, so there must be a downside to artificially pruning them in
this manner?  If so, a (much) faster lookup would seem to be the best fix.

> maybe by asking IT people to umount the 
> filesystem where *all* per-mount inodes are unconditionally purged (but 
> it defeats the purpose of caching inodes and, in our case, the locks) 
> after each rsync, ...., etc. But I do think the proposed patch is the 
> most sensible way to fix this issue and believe it will be one of these 
> functions that if you export it, people will find a good use of it. It 
> helps with memory fragmentation and/or shortage *before* it becomes a 
> problem as well. I certainly understand and respect a maintainer's 
> daunting job on how to take/reject a patch - let me know how you think 
> so I can start to work on other solutions if required.

We shouldn't export this particular implementation to modules because it
has bad failure modes.  There might be a case for exposing an
i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-29  0:21         ` Andrew Morton
@ 2006-11-29  6:02           ` Wendy Cheng
  2006-11-30 16:05             ` Wendy Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-11-29  6:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

Andrew Morton wrote:

>We shouldn't export this particular implementation to modules because it
>has bad failure modes.  There might be a case for exposing an
>i_sb_list-based API or, perhaps better, a max-unused-inodes mount option.
>
>
>  
>
Ok, thanks for looking into this - it is appreciated. I'll try to figure 
out something else.

-- Wendy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-29  6:02           ` Wendy Cheng
@ 2006-11-30 16:05             ` Wendy Cheng
  2006-11-30 19:31               ` Nate Diller
  2006-12-01 21:23               ` Andrew Morton
  0 siblings, 2 replies; 25+ messages in thread
From: Wendy Cheng @ 2006-11-30 16:05 UTC (permalink / raw)
  To: wcheng, Andrew Morton; +Cc: linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1715 bytes --]

How about a simple and plain change with this uploaded patch ....

The idea is, instead of unconditionally dropping every buffer associated 
with the particular mount point (that defeats the purpose of page 
caching), base kernel exports the "drop_pagecache_sb()" call that allows 
page cache to be trimmed. More importantly, it is changed to offer the 
choice of not randomly purging any buffer but the ones that seem to be 
unused (i_state is NULL and i_count is zero). This will encourage 
filesystem(s) to pro actively response to vm memory shortage if they 
choose so.

 From our end (cluster locks are expensive - that's why we cache them), 
one of our kernel daemons will invoke this newly exported call based on 
a set of pre-defined tunables. It is then followed by a lock reclaim 
logic to trim the locks by checking the page cache associated with the 
inode (that this cluster lock is created for). If nothing is attached to 
the inode (based on i_mapping->nrpages count), we know it is a good 
candidate for trimming and will subsequently drop this lock (instead of 
waiting until the end of vfs inode life cycle).

Note that I could do invalidate_inode_pages() within our kernel modules 
to accomplish what drop_pagecache_sb() does (without coming here to bug 
people) but I don't have access to inode_lock as an external kernel 
module. So either EXPORT_SYMBOL(inode_lock) or this patch ?

The end result (of this change) should encourage filesystem to actively 
avoid depleting too much memory and we'll encourage our applications to 
understand clustering locality issues.

Haven't tested this out though - would appreciate some comments before 
spending more efforts on this direction.

-- Wendy

[-- Attachment #2: inode_prune_sb.patch --]
[-- Type: text/x-patch, Size: 1723 bytes --]

--- linux-2.6.18/include/linux/fs.h	2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/include/linux/fs.h	2006-11-30 02:16:34.000000000 -0500
@@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
 static inline void insert_inode_hash(struct inode *inode) {
 	__insert_inode_hash(inode, inode->i_ino);
 }
-
+extern void void drop_pagecache_sb(struct super_block *sb, int nr_goal);:q
+ 
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);
--- linux-2.6.18/fs/drop_caches.c	2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/fs/drop_caches.c	2006-11-30 03:36:11.000000000 -0500
@@ -12,13 +12,20 @@
 /* A global variable is a bit ugly, but it keeps the code simple */
 int sysctl_drop_caches;

-static void drop_pagecache_sb(struct super_block *sb)
+void drop_pagecache_sb(struct super_block *sb, int nr_goal)
 {
 	struct inode *inode;
+	int nr_count=0;

 	spin_lock(&inode_lock);
 	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-		if (inode->i_state & (I_FREEING|I_WILL_FREE))
+		if (nr_goal) {
+			if (nr_goal == nr_count)
+				break;
+			if ((inode->i_state || atomic_read(&inode->i_count))
+				continue;
+			nr_count++;
+		} else if (inode->i_state & (I_FREEING|I_WILL_FREE))
 			continue;
 		invalidate_inode_pages(inode->i_mapping);
 	}
@@ -36,7 +43,7 @@ restart:
 		spin_unlock(&sb_lock);
 		down_read(&sb->s_umount);
 		if (sb->s_root)
-			drop_pagecache_sb(sb);
+			drop_pagecache_sb(sb, 0);
 		up_read(&sb->s_umount);
 		spin_lock(&sb_lock);
 		if (__put_super_and_need_restart(sb))
@@ -66,3 +73,6 @@ int drop_caches_sysctl_handler(ctl_table
 	}
 	return 0;
 }
+
+EXPORT_SYMBOL(drop_pagecache_sb);
+

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-30 16:05             ` Wendy Cheng
@ 2006-11-30 19:31               ` Nate Diller
  2006-12-01 21:23               ` Andrew Morton
  1 sibling, 0 replies; 25+ messages in thread
From: Nate Diller @ 2006-11-30 19:31 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

On 11/30/06, Wendy Cheng <wcheng@redhat.com> wrote:
> How about a simple and plain change with this uploaded patch ....
>
> The idea is, instead of unconditionally dropping every buffer associated
> with the particular mount point (that defeats the purpose of page
> caching), base kernel exports the "drop_pagecache_sb()" call that allows
> page cache to be trimmed. More importantly, it is changed to offer the
> choice of not randomly purging any buffer but the ones that seem to be
> unused (i_state is NULL and i_count is zero). This will encourage
> filesystem(s) to pro actively response to vm memory shortage if they
> choose so.
>
>  From our end (cluster locks are expensive - that's why we cache them),
> one of our kernel daemons will invoke this newly exported call based on
> a set of pre-defined tunables. It is then followed by a lock reclaim
> logic to trim the locks by checking the page cache associated with the
> inode (that this cluster lock is created for). If nothing is attached to
> the inode (based on i_mapping->nrpages count), we know it is a good
> candidate for trimming and will subsequently drop this lock (instead of
> waiting until the end of vfs inode life cycle).

I have a patch that is a more comprehensive version of this idea, but
it is not fully debugged, and has suffered some bitrot in the past
couple months.  This turns out to be a good performance improvement in
the general case too, but is more complex than your idea because there
are real locking changes needed to avoid deadlocks.  I can send you a
copy of the patch if you are interested.

> Note that I could do invalidate_inode_pages() within our kernel modules
> to accomplish what drop_pagecache_sb() does (without coming here to bug
> people) but I don't have access to inode_lock as an external kernel
> module. So either EXPORT_SYMBOL(inode_lock) or this patch ?

like i said above, you have to be careful when touching inode_lock,
dcache_lock, and the mapping's tree_lock, because of potential
deadlocks.  the mapping's lock can be taken from softirq context, but
the inode and dcache locks cannot.

NATE

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-11-30 16:05             ` Wendy Cheng
  2006-11-30 19:31               ` Nate Diller
@ 2006-12-01 21:23               ` Andrew Morton
  2006-12-03 17:49                 ` Wendy Cheng
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-12-01 21:23 UTC (permalink / raw)
  To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel

On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> How about a simple and plain change with this uploaded patch ....
> 
> The idea is, instead of unconditionally dropping every buffer associated 
> with the particular mount point (that defeats the purpose of page 
> caching), base kernel exports the "drop_pagecache_sb()" call that allows 
> page cache to be trimmed. More importantly, it is changed to offer the 
> choice of not randomly purging any buffer but the ones that seem to be 
> unused (i_state is NULL and i_count is zero). This will encourage 
> filesystem(s) to pro actively response to vm memory shortage if they 
> choose so.

argh.

In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc.  (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).

We do not want to go "encouraging" filesystems to play games tuning and
trimming VFS caches and things like that.  If a patch doing that were to
turn up it would be heartily shouted at and the originator would be asked
to go off and implement the functionality in core VFS so that a) all
filesystems can immediately utilise it and b) other filesystems aren't
tempted to go off and privately implement something similar.

So please bear this philosophy in mind, and think about this feature from
that perspective.

One approach might be to add a per-superblock upper-bound on the number of
cached dentries and/or inodes.  Typically that would be controlled by a
(re)mount option.  Although we could perhaps discuss a sysfs representation
of this (and, presumably, other mount options).

But I'd expect such a proposal to have a hard time, because we'd need to
know why such a thing is needed: we prefer auto-tuning, and that's what we
have now, so what's gone wrong with it and how can we fix it, rather than
adding a manual override?

>  From our end (cluster locks are expensive - that's why we cache them), 
> one of our kernel daemons will invoke this newly exported call based on 
> a set of pre-defined tunables. It is then followed by a lock reclaim 
> logic to trim the locks by checking the page cache associated with the 
> inode (that this cluster lock is created for). If nothing is attached to 
> the inode (based on i_mapping->nrpages count), we know it is a good 
> candidate for trimming and will subsequently drop this lock (instead of 
> waiting until the end of vfs inode life cycle).

Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms.  Seems odd.

If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?

Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

> Note that I could do invalidate_inode_pages() within our kernel modules 
> to accomplish what drop_pagecache_sb() does (without coming here to bug 
> people) but I don't have access to inode_lock as an external kernel 
> module. So either EXPORT_SYMBOL(inode_lock) or this patch ?
> 
> The end result (of this change) should encourage filesystem to actively 
> avoid depleting too much memory

That is _not_ a filesytem responsibility!  inode cache is owned and
maintained by the VFS.

> and we'll encourage our applications to 
> understand clustering locality issues.

?

> Haven't tested this out though - would appreciate some comments before 
> spending more efforts on this direction.
> 
> -- Wendy
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-01 21:23               ` Andrew Morton
@ 2006-12-03 17:49                 ` Wendy Cheng
  2006-12-03 20:47                   ` Andrew Morton
  2006-12-04 16:51                   ` [PATCH] prune_icache_sb Russell Cattelan
  0 siblings, 2 replies; 25+ messages in thread
From: Wendy Cheng @ 2006-12-03 17:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

Andrew Morton wrote:

>On Thu, 30 Nov 2006 11:05:32 -0500
>Wendy Cheng <wcheng@redhat.com> wrote:
>
>  
>
>>
>>The idea is, instead of unconditionally dropping every buffer associated 
>>with the particular mount point (that defeats the purpose of page 
>>caching), base kernel exports the "drop_pagecache_sb()" call that allows 
>>page cache to be trimmed. More importantly, it is changed to offer the 
>>choice of not randomly purging any buffer but the ones that seem to be 
>>unused (i_state is NULL and i_count is zero). This will encourage 
>>filesystem(s) to pro actively response to vm memory shortage if they 
>>choose so.
>>    
>>
>
>argh.
>  
>
I read this as "It is ok to give system admin(s) commands (that this 
"drop_pagecache_sb() call" is all about) to drop page cache. It is, 
however, not ok to give filesystem developer(s) this very same function 
to trim their own page cache if the filesystems choose to do so" ?

>In Linux a filesystem is a dumb layer which sits between the VFS and the
>I/O layer and provides dumb services such as reading/writing inodes,
>reading/writing directory entries, mapping pagecache offsets to disk
>blocks, etc.  (This model is to varying degrees incorrect for every
>post-ext2 filesystem, but that's the way it is).
>  
>
Linux kernel, particularly the VFS layer, is starting to show signs of 
inadequacy as the software components built upon it keep growing. I have 
doubts that it can keep up and handle this complexity with a development 
policy like you just described (filesystem is a dumb layer ?). Aren't 
these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect 
example why trying to do too many things inside vfs layer for so many 
filesystems is a bad idea ? By the way, since we're on this subject, 
could we discuss a little bit about vfs rename call (or I can start 
another new discussion thread) ?

Note that linux do_rename() starts with the usual lookup logic, followed 
by "lock_rename", then a final round of dentry lookup, and finally comes 
to filesystem's i_op->rename call. Since lock_rename() only calls for 
vfs layer locks that are local to this particular machine, for a cluster 
filesystem, there exists a huge window between the final lookup and 
filesystem's i_op->rename calls such that the file could get deleted 
from another node before fs can do anything about it. Is it possible 
that we could get a new function pointer (lock_rename) in 
inode_operations structure so a cluster filesystem can do proper locking ?

>>From our end (cluster locks are expensive - that's why we cache them), 
>>one of our kernel daemons will invoke this newly exported call based on 
>>a set of pre-defined tunables. It is then followed by a lock reclaim 
>>logic to trim the locks by checking the page cache associated with the 
>>inode (that this cluster lock is created for). If nothing is attached to 
>>the inode (based on i_mapping->nrpages count), we know it is a good 
>>candidate for trimming and will subsequently drop this lock (instead of 
>>waiting until the end of vfs inode life cycle).
>>    
>>
>
>Again, I don't understand why you're tying the lifetime of these locks to
>the VFS inode reclaim mechanisms.  Seems odd.
>  
>
Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting the 
request (communication overhead).
2. It involves disk flushing if bouncing between nodes. Say one node 
requests a read lock after another node's write... before the read lock 
can be granted, the write node needs to flush the data to the disk (disk 
io overhead).

For optimization purpose, we want to refrain the disk flush after writes 
and hope (and encourage) the next person who requests the lock to be on 
the very same node (to take the advantage of OS write-back logic). 
That's why the locks are cached on the very same node. It will not get 
removed unless necessary.
What would be better to build the lock caching on top of the existing 
inode cache logic - since these are the objects that the cluster locks 
are created for in the first place.

>If you want to put an upper bound on the number of in-core locks, why not
>string them on a list and throw away the old ones when the upper bound is
>reached?
>  
>
Don't take me wrong. DLM *has* a tunable to set the max lock counts. We 
do drop the locks but to drop the right locks, we need a little bit help 
from VFS layer. Latency requirement is difficult to manage.

>Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
>no problem maintaining millions of cached VFS objects - is there any reason
>why your lock lookup cannot be similarly efficient?
>  
>
Don't be so confident. I did see some complaints from ext3 based mail 
servers in the past - when the storage size was large enough, people had 
to explicitly umount the filesystem from time to time to rescue their 
performance. I don't recall the details at this moment though.

For us with this particular customer, it is a 15TB storage.

-- Wendy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-03 17:49                 ` Wendy Cheng
@ 2006-12-03 20:47                   ` Andrew Morton
  2006-12-04  5:57                     ` Wendy Cheng
  2006-12-04 16:51                   ` [PATCH] prune_icache_sb Russell Cattelan
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-12-03 20:47 UTC (permalink / raw)
  To: wcheng; +Cc: linux-kernel, linux-fsdevel

On Sun, 03 Dec 2006 12:49:42 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> Andrew Morton wrote:
> 
> >On Thu, 30 Nov 2006 11:05:32 -0500
> >Wendy Cheng <wcheng@redhat.com> wrote:
> >
> >  
> >
> >>
> >>The idea is, instead of unconditionally dropping every buffer associated 
> >>with the particular mount point (that defeats the purpose of page 
> >>caching), base kernel exports the "drop_pagecache_sb()" call that allows 
> >>page cache to be trimmed. More importantly, it is changed to offer the 
> >>choice of not randomly purging any buffer but the ones that seem to be 
> >>unused (i_state is NULL and i_count is zero). This will encourage 
> >>filesystem(s) to pro actively response to vm memory shortage if they 
> >>choose so.
> >>    
> >>
> >
> >argh.
> >  
> >
> I read this as "It is ok to give system admin(s) commands (that this 
> "drop_pagecache_sb() call" is all about) to drop page cache. It is, 
> however, not ok to give filesystem developer(s) this very same function 
> to trim their own page cache if the filesystems choose to do so" ?

If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
administrators - it's a convenience thing for developers, to get repeatable
benchmarks.  Attempts to make it a per-numa-node control for admin purposes have
been rejected.

> >In Linux a filesystem is a dumb layer which sits between the VFS and the
> >I/O layer and provides dumb services such as reading/writing inodes,
> >reading/writing directory entries, mapping pagecache offsets to disk
> >blocks, etc.  (This model is to varying degrees incorrect for every
> >post-ext2 filesystem, but that's the way it is).
> >  
> >
> Linux kernel, particularly the VFS layer, is starting to show signs of 
> inadequacy as the software components built upon it keep growing. I have 
> doubts that it can keep up and handle this complexity with a development 
> policy like you just described (filesystem is a dumb layer ?). Aren't 
> these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect 
> example why trying to do too many things inside vfs layer for so many 
> filesystems is a bad idea ?

That's not a very well-chosen example, but yes, the old ext2-based model has
needed to be extended as new filesystems come along.

> By the way, since we're on this subject, 
> could we discuss a little bit about vfs rename call (or I can start 
> another new discussion thread) ?
> 
> Note that linux do_rename() starts with the usual lookup logic, followed 
> by "lock_rename", then a final round of dentry lookup, and finally comes 
> to filesystem's i_op->rename call. Since lock_rename() only calls for 
> vfs layer locks that are local to this particular machine, for a cluster 
> filesystem, there exists a huge window between the final lookup and 
> filesystem's i_op->rename calls such that the file could get deleted 
> from another node before fs can do anything about it. Is it possible 
> that we could get a new function pointer (lock_rename) in 
> inode_operations structure so a cluster filesystem can do proper locking ?

That would need a new thread, and probably (at least pseudo-) code, and
cc's to the appropriate maintainers (although that part of the kernel isn't
really maintained any more - it has fallen into the patch-and-run model).

> >>From our end (cluster locks are expensive - that's why we cache them), 
> >>one of our kernel daemons will invoke this newly exported call based on 
> >>a set of pre-defined tunables. It is then followed by a lock reclaim 
> >>logic to trim the locks by checking the page cache associated with the 
> >>inode (that this cluster lock is created for). If nothing is attached to 
> >>the inode (based on i_mapping->nrpages count), we know it is a good 
> >>candidate for trimming and will subsequently drop this lock (instead of 
> >>waiting until the end of vfs inode life cycle).
> >>    
> >>
> >
> >Again, I don't understand why you're tying the lifetime of these locks to
> >the VFS inode reclaim mechanisms.  Seems odd.
> >  
> >
> Cluster locks are expensive because:
> 
> 1. Every node in the cluster has to agree about it upon granting the 
> request (communication overhead).
> 2. It involves disk flushing if bouncing between nodes. Say one node 
> requests a read lock after another node's write... before the read lock 
> can be granted, the write node needs to flush the data to the disk (disk 
> io overhead).
> 
> For optimization purpose, we want to refrain the disk flush after writes 
> and hope (and encourage) the next person who requests the lock to be on 
> the very same node (to take the advantage of OS write-back logic). 
> That's why the locks are cached on the very same node. It will not get 
> removed unless necessary.
> What would be better to build the lock caching on top of the existing 
> inode cache logic - since these are the objects that the cluster locks 
> are created for in the first place.

hmm, I suppose that makes sense.

Are there dentries associated with these locks?

> >If you want to put an upper bound on the number of in-core locks, why not
> >string them on a list and throw away the old ones when the upper bound is
> >reached?
> >  
> >
> Don't take me wrong. DLM *has* a tunable to set the max lock counts. We 
> do drop the locks but to drop the right locks, we need a little bit help 
> from VFS layer. Latency requirement is difficult to manage.
> 
> >Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
> >no problem maintaining millions of cached VFS objects - is there any reason
> >why your lock lookup cannot be similarly efficient?
> >  
> >
> Don't be so confident. I did see some complaints from ext3 based mail 
> servers in the past - when the storage size was large enough, people had 
> to explicitly umount the filesystem from time to time to rescue their 
> performance. I don't recall the details at this moment though.

People have had plenty of problems with oversized inode-caches in the past,
but I think they were due to memory consumption, not to lookup inefficiency.

My question _still_ remains unanswered.  Third time: is is possible to
speed up this lock-lookup code?

Perhaps others can take a look at it - where is it?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-03 20:47                   ` Andrew Morton
@ 2006-12-04  5:57                     ` Wendy Cheng
  2006-12-04  6:28                       ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-12-04  5:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel

Andrew Morton wrote:

>On Sun, 03 Dec 2006 12:49:42 -0500
>Wendy Cheng <wcheng@redhat.com> wrote:
>
>  
>
>>I read this as "It is ok to give system admin(s) commands (that this 
>>"drop_pagecache_sb() call" is all about) to drop page cache. It is, 
>>however, not ok to give filesystem developer(s) this very same function 
>>to trim their own page cache if the filesystems choose to do so" ?
>>    
>>
>
>If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for
>administrators - it's a convenience thing for developers, to get repeatable
>benchmarks.  Attempts to make it a per-numa-node control for admin purposes have
>been rejected.
>  
>
Just saw you suggested the "next door" LKML thread ("la la la la ... 
swappiness") to use "-o sync" ? Well, now I do see you're determined ... 
anyway, I think I got my stuff working without this kernel patch ... 
still under testing though.

The rename post will be done first thing tomorrow morning.

>>[snip] .......
>>    
>>
>hmm, I suppose that makes sense.
>
>Are there dentries associated with these locks?
>  
>
Yes, dentries are part of the logic (during lookup time) but 
book-keepings (reference count, reclaim, delete, etc) are all done thru 
inode structures.

>  
>
>>>Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
>>>no problem maintaining millions of cached VFS objects - is there any reason
>>>why your lock lookup cannot be similarly efficient?
>>> 
>>>      
>>>
Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h). 
I'm on an older version of DLM that uses FNV hash algorithm 
(http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance 
test runs to compare these two methods.

A final note on this subject - I may not agree with you (about various 
things) but your comments and amazingly quick responses are very very 
appreciated !

-- Wendy


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-04  5:57                     ` Wendy Cheng
@ 2006-12-04  6:28                       ` Andrew Morton
  2006-12-04 16:41                         ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-12-04  6:28 UTC (permalink / raw)
  To: wcheng; +Cc: linux-kernel, linux-fsdevel

On Mon, 04 Dec 2006 00:57:50 -0500
Wendy Cheng <wcheng@redhat.com> wrote:

> >  
> >
> >>>Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
> >>>no problem maintaining millions of cached VFS objects - is there any reason
> >>>why your lock lookup cannot be similarly efficient?
> >>> 
> >>>      
> >>>
> Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h). 
> I'm on an older version of DLM that uses FNV hash algorithm 
> (http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance 
> test runs to compare these two methods.

I'd be surprised if the choice of hash algorithm itself makes much difference.
But we can't say much about it unless we can see the code (ie: your code).

Is it a simple old hash-to-find-the-bucket-then-walk-a-list implementation?
If so, what does the bucket count distribution look like?  What is the average
walk length?  Does it use a single lock, or hashed locking, or a lock-per-bucket?

etc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04  6:28                       ` Andrew Morton
@ 2006-12-04 16:41                         ` Eric Dumazet
  2006-12-04 16:55                           ` Christoph Lameter
  2006-12-05 14:42                           ` Pavel Machek
  0 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2006-12-04 16:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, linux-kernel, Christoph Lameter

[-- Attachment #1: Type: text/plain, Size: 1259 bytes --]

When some objects are allocated by one CPU but freed by another CPU we can 
consume lot of cycles doing divides in obj_to_index().

(Typical load on a dual processor machine where network interrupts are handled 
by one particular CPU (allocating skbufs), and the other CPU is running the 
application (consuming and freeing skbufs))

Here on one production server (dual-core AMD Opteron 285), I noticed this 
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are 
quite modern cpus and the divide is much more expensive on oldest 
architectures :

On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle 
for a multiply.

Doing some math, we can use a reciprocal multiplication instead of a divide.

If we want to compute V = (A / B)  (A and B being u32 quantities)
we can instead use :

V = ((u64)A * RECIPROCAL(B)) >> 32 ;

where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Note :

I wrote pure C code for clarity. gcc output for i386 is not optimal but 
acceptable :

mull   0x14(%ebx)
mov    %edx,%eax // part of the >> 32
xor     %edx,%edx // useless 
mov    %eax,(%esp) // could be avoided
mov    %edx,0x4(%esp) // useless
mov    (%esp),%ebx

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: slab_avoid_divides.patch --]
[-- Type: text/plain, Size: 2164 bytes --]

--- linux-2.6.19/mm/slab.c	2006-12-04 11:50:19.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c	2006-12-04 17:25:02.000000000 +0100
@@ -371,6 +371,19 @@ static void kmem_list3_init(struct kmem_
 	} while (0)
 
 /*
+ * Define the reciprocal value of B so that
+ * ((u32)A / (u32)B) can be replaced by :
+ * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
+ * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply
+ */
+static inline u32 reciprocal_value(unsigned int k)
+{
+	u64 val = (1LL << 32) + (k - 1);
+	do_div(val, k);
+	return (u32)val;
+}
+
+/*
  * struct kmem_cache
  *
  * manages a cache.
@@ -385,6 +398,7 @@ struct kmem_cache {
 	unsigned int shared;
 
 	unsigned int buffer_size;
+	unsigned int reciprocal_buffer_size;
 /* 3) touched by every alloc & free from the backend */
 	struct kmem_list3 *nodelists[MAX_NUMNODES];
 
@@ -626,10 +640,17 @@ static inline void *index_to_obj(struct 
 	return slab->s_mem + cache->buffer_size * idx;
 }
 
+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ *   Using the fact that buffer_size is a constant for a particular cache,
+ *   we can replace (offset / cache->buffer_size) by
+ *   ((u64)offset * cache->reciprocal_buffer_size) >> 32
+ */
 static inline unsigned int obj_to_index(struct kmem_cache *cache,
 					struct slab *slab, void *obj)
 {
-	return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+	unsigned int offset = (obj - slab->s_mem);
+	return (u32)(((u64)offset * cache->reciprocal_buffer_size) >> 32);
 }
 
 /*
@@ -1400,6 +1421,8 @@ void __init kmem_cache_init(void)
 
 	cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
 					cache_line_size());
+	cache_cache.reciprocal_buffer_size =
+		reciprocal_value(cache_cache.buffer_size);
 
 	for (order = 0; order < MAX_ORDER; order++) {
 		cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2320,7 @@ kmem_cache_create (const char *name, siz
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
 	cachep->buffer_size = size;
+	cachep->reciprocal_buffer_size = reciprocal_value(size);
 
 	if (flags & CFLGS_OFF_SLAB) {
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 16:41                         ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet
@ 2006-12-04 16:55                           ` Christoph Lameter
  2006-12-04 18:18                             ` Eric Dumazet
  2006-12-05 14:42                           ` Pavel Machek
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Lameter @ 2006-12-04 16:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, David Miller, linux-kernel

On Mon, 4 Dec 2006, Eric Dumazet wrote:

> Doing some math, we can use a reciprocal multiplication instead of a divide.

Could you generalize the reciprocal thingy so that the division 
can be used from other parts of the kernel as well? It would be useful to 
separately get some cycle counts on a regular division compared with your 
division. If that shows benefit then we may think about using it in the 
kernel. I am a bit surprised that this is still an issue on modern cpus.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 16:55                           ` Christoph Lameter
@ 2006-12-04 18:18                             ` Eric Dumazet
  2006-12-04 19:49                               ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2006-12-04 18:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Andrew Morton, David Miller, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2255 bytes --]

On Monday 04 December 2006 17:55, Christoph Lameter wrote:
> Could you generalize the reciprocal thingy so that the division
> can be used from other parts of the kernel as well? It would be useful to
> separately get some cycle counts on a regular division compared with your
> division. If that shows benefit then we may think about using it in the
> kernel. I am a bit surprised that this is still an issue on modern cpus.

OK I added a new include file, I am not sure it is the best way...

Well, AFAIK this particular divide is the only one that hurts performance on 
my machines.

Do you have in mind another spot in kernel where we could use this reciprocal 
divide as well ?

Yes divide complexity is still an issue with modern CPUS : 
elapsed time for 10^9 loops on Pentium M 1.6 Ghz
24 s for the version using divides
3.8 s for the version using multiplies

[PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

When some objects are allocated by one CPU but freed by another CPU we can 
consume lot of cycles doing divides in obj_to_index().

(Typical load on a dual processor machine where network interrupts are handled 
by one particular CPU (allocating skbufs), and the other CPU is running the 
application (consuming and freeing skbufs))

Here on one production server (dual-core AMD Opteron 285), I noticed this 
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are 
quite modern cpus and the divide is much more expensive on oldest 
architectures :

On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle 
for a multiply.

Doing some math, we can use a reciprocal multiplication instead of a divide.

If we want to compute V = (A / B)  (A and B being u32 quantities)
we can instead use :

V = ((u64)A * RECIPROCAL(B)) >> 32 ;

where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Note :

I wrote pure C code for clarity. gcc output for i386 is not optimal but 
acceptable :

mull   0x14(%ebx)
mov    %edx,%eax // part of the >> 32
xor     %edx,%edx // useless 
mov    %eax,(%esp) // could be avoided
mov    %edx,0x4(%esp) // useless
mov    (%esp),%ebx

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: slab_avoid_divides.patch --]
[-- Type: text/plain, Size: 3013 bytes --]

--- linux-2.6.19/include/linux/reciprocal_div.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/include/linux/reciprocal_div.h	2006-12-04 19:01:44.000000000 +0100
@@ -0,0 +1,30 @@
+#ifndef _LINUX_RECIPROCAL_DIV_H
+#define _LINUX_RECIPROCAL_DIV_H
+
+/*
+ * Define the reciprocal value of B so that
+ * ((u32)A / (u32)B) can be replaced by :
+ * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
+ * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply
+ */
+#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B))
+
+static inline u32 reciprocal_value(unsigned int k)
+{
+	if (__builtin_constant_p(k))
+		return RECIPROCAL_VALUE(k);
+	else {
+		u64 val = (1LL << 32) + (k - 1);
+		do_div(val, k);
+		return (u32)val;
+	}
+}
+
+/*
+ * We want to avoid an expensive divide : (A / B)
+ * If B is known in advance, its reciprocal R can be precalculated/stored.
+ * then  (A / B)  =  (u32)(((u64)(A) * (R)) >> 32)
+ */
+#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32)
+
+#endif
--- linux-2.6.19/mm/slab.c	2006-12-04 11:50:19.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c	2006-12-04 19:03:42.000000000 +0100
@@ -107,6 +107,7 @@
 #include	<linux/mempolicy.h>
 #include	<linux/mutex.h>
 #include	<linux/rtmutex.h>
+#include	<linux/reciprocal_div.h>
 
 #include	<asm/uaccess.h>
 #include	<asm/cacheflush.h>
@@ -385,6 +386,7 @@ struct kmem_cache {
 	unsigned int shared;
 
 	unsigned int buffer_size;
+	unsigned int reciprocal_buffer_size;
 /* 3) touched by every alloc & free from the backend */
 	struct kmem_list3 *nodelists[MAX_NUMNODES];
 
@@ -626,10 +628,17 @@ static inline void *index_to_obj(struct 
 	return slab->s_mem + cache->buffer_size * idx;
 }
 
-static inline unsigned int obj_to_index(struct kmem_cache *cache,
-					struct slab *slab, void *obj)
+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ *   Using the fact that buffer_size is a constant for a particular cache,
+ *   we can replace (offset / cache->buffer_size) by
+ *   RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size)
+ */
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct slab *slab, void *obj)
 {
-	return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+	unsigned int offset = (obj - slab->s_mem);
+	return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size);
 }
 
 /*
@@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)
 
 	cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
 					cache_line_size());
+	cache_cache.reciprocal_buffer_size =
+		reciprocal_value(cache_cache.buffer_size);
 
 	for (order = 0; order < MAX_ORDER; order++) {
 		cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
 	cachep->buffer_size = size;
+	cachep->reciprocal_buffer_size = reciprocal_value(size);
 
 	if (flags & CFLGS_OFF_SLAB) {
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 18:18                             ` Eric Dumazet
@ 2006-12-04 19:49                               ` Andrew Morton
  2006-12-04 19:55                                 ` Christoph Lameter
  2006-12-04 21:34                                 ` Eric Dumazet
  0 siblings, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2006-12-04 19:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, David Miller, linux-kernel

On Mon, 4 Dec 2006 19:18:29 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> On Monday 04 December 2006 17:55, Christoph Lameter wrote:
> > Could you generalize the reciprocal thingy so that the division
> > can be used from other parts of the kernel as well? It would be useful to
> > separately get some cycle counts on a regular division compared with your
> > division. If that shows benefit then we may think about using it in the
> > kernel. I am a bit surprised that this is still an issue on modern cpus.
> 
> OK I added a new include file, I am not sure it is the best way...
> 
> Well, AFAIK this particular divide is the only one that hurts performance on 
> my machines.
> 
> Do you have in mind another spot in kernel where we could use this reciprocal 
> divide as well ?
> 
> Yes divide complexity is still an issue with modern CPUS : 
> elapsed time for 10^9 loops on Pentium M 1.6 Ghz
> 24 s for the version using divides
> 3.8 s for the version using multiplies
> 
> [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
> 
> When some objects are allocated by one CPU but freed by another CPU we can 
> consume lot of cycles doing divides in obj_to_index().
> 
> (Typical load on a dual processor machine where network interrupts are handled 
> by one particular CPU (allocating skbufs), and the other CPU is running the 
> application (consuming and freeing skbufs))
> 
> Here on one production server (dual-core AMD Opteron 285), I noticed this 
> divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are 
> quite modern cpus and the divide is much more expensive on oldest 
> architectures :

Yes, I've seen that divide hurting too.

I suspect it was with unusual everything-in-cache workloads, but whatever.

> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle 
> for a multiply.
> 
> Doing some math, we can use a reciprocal multiplication instead of a divide.
> 
> If we want to compute V = (A / B) __(A and B being u32 quantities)
> we can instead use :
> 
> V = ((u64)A * RECIPROCAL(B)) >> 32 ;
> 
> where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B
> 
> Note :
> 
> I wrote pure C code for clarity. gcc output for i386 is not optimal but 
> acceptable :
> 
> mull __ 0x14(%ebx)
> mov __ __%edx,%eax // part of the >> 32
> xor __ __ %edx,%edx // useless 
> mov __ __%eax,(%esp) // could be avoided
> mov __ __%edx,0x4(%esp) // useless
> mov __ __(%esp),%ebx
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 

--- linux-2.6.19/include/linux/reciprocal_div.h	1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.19-ed/include/linux/reciprocal_div.h	2006-12-04 19:01:44.000000000 +0100
> @@ -0,0 +1,30 @@
> +#ifndef _LINUX_RECIPROCAL_DIV_H
> +#define _LINUX_RECIPROCAL_DIV_H
> +
> +/*
> + * Define the reciprocal value of B so that
> + * ((u32)A / (u32)B) can be replaced by :
> + * (((u64)A * RECIPROCAL_VALUE(B)) >> 32)
> + * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply

"to a multiply".


> + */
> +#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B))

Does this have to be a macro?

I worry that people might try to throw random types at this code and it'll
fail.  Are you prepared to support s8, u8, s16, u16, s32, u32, s64 and u64?
I think not, so perhaps we should be documenting what we _do_ accept, and
adding typecheck() calls in there somewhere to enforce that. (In which case
it would need to be a macro).


> +static inline u32 reciprocal_value(unsigned int k)
> +{
> +	if (__builtin_constant_p(k))
> +		return RECIPROCAL_VALUE(k);
> +	else {
> +		u64 val = (1LL << 32) + (k - 1);
> +		do_div(val, k);
> +		return (u32)val;
> +	}
> +}

We should clearly document that this function is for once-off setup
operations - we'd hate for people to call this with any frequency.

It should be uninlined if poss, too.

> +/*
> + * We want to avoid an expensive divide : (A / B)
> + * If B is known in advance, its reciprocal R can be precalculated/stored.
> + * then  (A / B)  =  (u32)(((u64)(A) * (R)) >> 32)
> + */
> +#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32)

And again, depending upon our decision regarding what types this code will
support, this perhaps should become an inlined C function.

> +#endif
> --- linux-2.6.19/mm/slab.c	2006-12-04 11:50:19.000000000 +0100
> +++ linux-2.6.19-ed/mm/slab.c	2006-12-04 19:03:42.000000000 +0100
> @@ -107,6 +107,7 @@
>  #include	<linux/mempolicy.h>
>  #include	<linux/mutex.h>
>  #include	<linux/rtmutex.h>
> +#include	<linux/reciprocal_div.h>
>  
>  #include	<asm/uaccess.h>
>  #include	<asm/cacheflush.h>
> @@ -385,6 +386,7 @@ struct kmem_cache {
>  	unsigned int shared;
>  
>  	unsigned int buffer_size;
> +	unsigned int reciprocal_buffer_size;

If we decide to support only u32 for this operation, this should become
u32, for clarity.

>  /* 3) touched by every alloc & free from the backend */
>  	struct kmem_list3 *nodelists[MAX_NUMNODES];
>  
> @@ -626,10 +628,17 @@ static inline void *index_to_obj(struct 
>  	return slab->s_mem + cache->buffer_size * idx;
>  }
>  
> -static inline unsigned int obj_to_index(struct kmem_cache *cache,
> -					struct slab *slab, void *obj)
> +/*
> + * We want to avoid an expensive divide : (offset / cache->buffer_size)
> + *   Using the fact that buffer_size is a constant for a particular cache,
> + *   we can replace (offset / cache->buffer_size) by
> + *   RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size)
> + */
> +static inline unsigned int obj_to_index(const struct kmem_cache *cache,
> +					const struct slab *slab, void *obj)
>  {
> -	return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
> +	unsigned int offset = (obj - slab->s_mem);
> +	return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size);
>  }
>  
>  /*
> @@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)
>  
>  	cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
>  					cache_line_size());
> +	cache_cache.reciprocal_buffer_size =
> +		reciprocal_value(cache_cache.buffer_size);
>  
>  	for (order = 0; order < MAX_ORDER; order++) {
>  		cache_estimate(order, cache_cache.buffer_size,
> @@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
>  	if (flags & SLAB_CACHE_DMA)
>  		cachep->gfpflags |= GFP_DMA;
>  	cachep->buffer_size = size;
> +	cachep->reciprocal_buffer_size = reciprocal_value(size);
>  
>  	if (flags & CFLGS_OFF_SLAB) {
>  		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 19:49                               ` Andrew Morton
@ 2006-12-04 19:55                                 ` Christoph Lameter
  2006-12-04 21:34                                 ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Lameter @ 2006-12-04 19:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Eric Dumazet, David Miller, linux-kernel

This is similar stuff to asm-generic/div64.h right? The divide overhead 
depends on the platform? Maybe it would better to place it in 
asm-generic/div.h and then have platform specific functions?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 19:49                               ` Andrew Morton
  2006-12-04 19:55                                 ` Christoph Lameter
@ 2006-12-04 21:34                                 ` Eric Dumazet
  2006-12-04 21:56                                   ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2006-12-04 21:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Christoph Lameter, David Miller, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1585 bytes --]

Thank you Andrew for your comments, here is a new version of the patch that 
should take them into account.

Maybe it should be split into two different patches ?

One introducing include/linux/reciprocal_div.h and lib/reciprocal_div.c, one 
using them in slab ?


[PATCH] SLAB : use a multiply instead of a divide in obj_to_index()

When some objects are allocated by one CPU but freed by another CPU we can
consume lot of cycles doing divides in obj_to_index().

(Typical load on a dual processor machine where network interrupts are handled
by one particular CPU (allocating skbufs), and the other CPU is running the
application (consuming and freeing skbufs))

Here on one production server (dual-core AMD Opteron 285), I noticed this
divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are
quite modern cpus and the divide is much more expensive on oldest
architectures :

On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
for a multiply.

Doing some math, we can use a reciprocal multiplication instead of a divide.

If we want to compute V = (A / B)  (A and B being u32 quantities)
we can instead use :

V = ((u64)A * RECIPROCAL(B)) >> 32 ;

where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Note :

I wrote pure C code for clarity. gcc output for i386 is not optimal but
acceptable :

mull   0x14(%ebx)
mov    %edx,%eax // part of the >> 32
xor     %edx,%edx // useless
mov    %eax,(%esp) // could be avoided
mov    %edx,0x4(%esp) // useless
mov    (%esp),%ebx

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: reciprocal_division.patch --]
[-- Type: text/plain, Size: 3834 bytes --]

--- linux-2.6.19/include/linux/reciprocal_div.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/include/linux/reciprocal_div.h	2006-12-04 23:12:34.000000000 +0100
@@ -0,0 +1,30 @@
+#ifndef _LINUX_RECIPROCAL_DIV_H
+#define _LINUX_RECIPROCAL_DIV_H
+
+/*
+ * This file describes reciprocical division.
+ *
+ * This optimizes the (A/B) problem, when A and B are two u32
+ * and B is a known value (but not known at compile time)
+ *
+ * The math principle used is :
+ *   Let RECIPROCAL_VALUE(B) be (((1LL << 32) + (B - 1))/ B)
+ *   Then A / B = (u32)(((u64)(A) * (R)) >> 32)
+ *
+ * This replaces a divide by a multiply (and a shift), and
+ * is generally less expensive in CPU cycles.
+ */
+
+/*
+ * Computes the reciprocal value (R) for the value B of the divisor.
+ * Should not be called before each reciprocal_divide(),
+ * or else the performance is slower than a normal divide.
+ */
+extern u32 reciprocal_value(u32 B);
+
+
+static inline u32 reciprocal_divide(u32 A, u32 R)
+{
+	return (u32)(((u64)A * R) >> 32);
+}
+#endif
--- linux-2.6.19/lib/reciprocal_div.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.19-ed/lib/reciprocal_div.c	2006-12-04 23:18:54.000000000 +0100
@@ -0,0 +1,10 @@
+#include <linux/types.h>
+#include <asm/div64.h>
+#include <linux/reciprocal_div.h>
+
+u32 reciprocal_value(u32 k)
+{
+	u64 val = (1LL << 32) + (k - 1);
+	do_div(val, k);
+	return (u32)val;
+}
--- linux-2.6.19/lib/Makefile	2006-12-04 23:12:30.000000000 +0100
+++ linux-2.6.19-ed/lib/Makefile	2006-12-04 23:15:14.000000000 +0100
@@ -5,7 +5,7 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
 	 idr.o div64.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o
+	 sha1.o irq_regs.o reciprocal_div.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
--- linux-2.6.19/mm/slab.c	2006-12-04 22:51:42.000000000 +0100
+++ linux-2.6.19-ed/mm/slab.c	2006-12-04 23:13:42.000000000 +0100
@@ -107,6 +107,7 @@
 #include	<linux/mempolicy.h>
 #include	<linux/mutex.h>
 #include	<linux/rtmutex.h>
+#include	<linux/reciprocal_div.h>
 
 #include	<asm/uaccess.h>
 #include	<asm/cacheflush.h>
@@ -385,6 +386,7 @@ struct kmem_cache {
 	unsigned int shared;
 
 	unsigned int buffer_size;
+	u32 reciprocal_buffer_size;
 /* 3) touched by every alloc & free from the backend */
 	struct kmem_list3 *nodelists[MAX_NUMNODES];
 
@@ -626,10 +628,17 @@ static inline void *index_to_obj(struct 
 	return slab->s_mem + cache->buffer_size * idx;
 }
 
-static inline unsigned int obj_to_index(struct kmem_cache *cache,
-					struct slab *slab, void *obj)
+/*
+ * We want to avoid an expensive divide : (offset / cache->buffer_size)
+ *   Using the fact that buffer_size is a constant for a particular cache,
+ *   we can replace (offset / cache->buffer_size) by
+ *   reciprocal_divide(offset, cache->reciprocal_buffer_size)
+ */
+static inline unsigned int obj_to_index(const struct kmem_cache *cache,
+					const struct slab *slab, void *obj)
 {
-	return (unsigned)(obj - slab->s_mem) / cache->buffer_size;
+	unsigned int offset = (obj - slab->s_mem);
+	return reciprocal_divide(offset, cache->reciprocal_buffer_size);
 }
 
 /*
@@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void)
 
 	cache_cache.buffer_size = ALIGN(cache_cache.buffer_size,
 					cache_line_size());
+	cache_cache.reciprocal_buffer_size =
+		reciprocal_value(cache_cache.buffer_size);
 
 	for (order = 0; order < MAX_ORDER; order++) {
 		cache_estimate(order, cache_cache.buffer_size,
@@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz
 	if (flags & SLAB_CACHE_DMA)
 		cachep->gfpflags |= GFP_DMA;
 	cachep->buffer_size = size;
+	cachep->reciprocal_buffer_size = reciprocal_value(size);
 
 	if (flags & CFLGS_OFF_SLAB) {
 		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 21:34                                 ` Eric Dumazet
@ 2006-12-04 21:56                                   ` David Miller
  2006-12-04 22:45                                     ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2006-12-04 21:56 UTC (permalink / raw)
  To: dada1; +Cc: akpm, clameter, linux-kernel

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 04 Dec 2006 22:34:29 +0100

> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
> for a multiply.

For UltraSPARC I and II (which is what this 200mhz guy probably is),
it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles
for a 64-bit divide (32-bit divide is 37 cycles).

UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71
cycles (32/64-bit) for integer divides.

Niagara is even worse :-)  11 cycle integer multiply and a 72 cycle
integer divide (regardless of 32-bit or 64-bit).

(more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost).

So this change has tons of merit for sparc64 chips at least :-)

Also, the multiply can parallelize with other operations but it
seems that integer divide stalls the pipe for most of the duration
of the calculation.  So this makes the divide even worse.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 21:56                                   ` David Miller
@ 2006-12-04 22:45                                     ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2006-12-04 22:45 UTC (permalink / raw)
  To: David Miller; +Cc: akpm, clameter, linux-kernel

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 04 Dec 2006 22:34:29 +0100
> 
>> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle
>> for a multiply.
> 
> For UltraSPARC I and II (which is what this 200mhz guy probably is),
> it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles
> for a 64-bit divide (32-bit divide is 37 cycles).

I must have Ultra-2 (running Solaris :( )

         for (ui = 0 ; ui < 100000000 ; ui++)
                 val += reciprocal_divide(ui, reciproc);

    100000cb0:   83 31 20 00     srl  %g4, 0, %g1
    100000cb4:   82 48 40 12     mulx  %g1, %l2, %g1
    100000cb8:   83 30 70 20     srlx  %g1, 0x20, %g1
    100000cbc:   88 01 20 01     inc  %g4
    100000cc0:   80 a1 00 05     cmp  %g4, %g5
    100000cc4:   08 4f ff fb     bleu  %icc, 100000cb0
    100000cc8:   b0 06 00 01     add  %i0, %g1, %i0

I confirm that this block uses 20 cycles/iteration,
while next one uses 72 cycles/iteration

         for (ui = 0 ; ui < 100000000 ; ui++)
                 val += ui / value;

    100000ca8:   83 31 20 00     srl  %g4, 0, %g1
    100000cac:   82 68 40 11     udivx  %g1, %l1, %g1
    100000cb0:   88 01 20 01     inc  %g4
    100000cb4:   80 a1 00 05     cmp  %g4, %g5
    100000cb8:   08 4f ff fc     bleu  %icc, 100000ca8
    100000cbc:   b0 06 00 01     add  %i0, %g1, %i0


> 
> UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71
> cycles (32/64-bit) for integer divides.
> 
> Niagara is even worse :-)  11 cycle integer multiply and a 72 cycle
> integer divide (regardless of 32-bit or 64-bit).
> 
> (more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost).
> 
> So this change has tons of merit for sparc64 chips at least :-)
> 
> Also, the multiply can parallelize with other operations but it
> seems that integer divide stalls the pipe for most of the duration
> of the calculation.  So this makes the divide even worse.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index()
  2006-12-04 16:41                         ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet
  2006-12-04 16:55                           ` Christoph Lameter
@ 2006-12-05 14:42                           ` Pavel Machek
  1 sibling, 0 replies; 25+ messages in thread
From: Pavel Machek @ 2006-12-05 14:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andrew Morton, David Miller, linux-kernel, Christoph Lameter

Hi!

> When some objects are allocated by one CPU but freed by another CPU we can 
> consume lot of cycles doing divides in obj_to_index().
> 
> (Typical load on a dual processor machine where network interrupts are handled 
> by one particular CPU (allocating skbufs), and the other CPU is running the 
> application (consuming and freeing skbufs))
> 
> Here on one production server (dual-core AMD Opteron 285), I noticed this 
> divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are 
> quite modern cpus and the divide is much more expensive on oldest 
> architectures :
> 
> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle 
> for a multiply.
> 
> Doing some math, we can use a reciprocal multiplication instead of a divide.
> 
> If we want to compute V = (A / B)  (A and B being u32 quantities)
> we can instead use :
> 
> V = ((u64)A * RECIPROCAL(B)) >> 32 ;
> 
> where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B

Well, I guess it should be gcc doing this optimalization, not we by
hand. And I believe gcc *is* smart enough to do it in some cases...

							pavel

-- 
Thanks for all the (sleeping) penguins.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-03 17:49                 ` Wendy Cheng
  2006-12-03 20:47                   ` Andrew Morton
@ 2006-12-04 16:51                   ` Russell Cattelan
  2006-12-04 20:46                     ` Wendy Cheng
  1 sibling, 1 reply; 25+ messages in thread
From: Russell Cattelan @ 2006-12-04 16:51 UTC (permalink / raw)
  To: wcheng; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

Wendy Cheng wrote:

> Andrew Morton wrote:
>
>> On Thu, 30 Nov 2006 11:05:32 -0500
>> Wendy Cheng <wcheng@redhat.com> wrote:
>>
>>  
>>
>>>
>>> The idea is, instead of unconditionally dropping every buffer 
>>> associated with the particular mount point (that defeats the purpose 
>>> of page caching), base kernel exports the "drop_pagecache_sb()" call 
>>> that allows page cache to be trimmed. More importantly, it is 
>>> changed to offer the choice of not randomly purging any buffer but 
>>> the ones that seem to be unused (i_state is NULL and i_count is 
>>> zero). This will encourage filesystem(s) to pro actively response to 
>>> vm memory shortage if they choose so.
>>>   
>>
>>
>> argh.
>>  
>>
> I read this as "It is ok to give system admin(s) commands (that this 
> "drop_pagecache_sb() call" is all about) to drop page cache. It is, 
> however, not ok to give filesystem developer(s) this very same 
> function to trim their own page cache if the filesystems choose to do 
> so" ?
>
>> In Linux a filesystem is a dumb layer which sits between the VFS and the
>> I/O layer and provides dumb services such as reading/writing inodes,
>> reading/writing directory entries, mapping pagecache offsets to disk
>> blocks, etc.  (This model is to varying degrees incorrect for every
>> post-ext2 filesystem, but that's the way it is).
>>  
>>
> Linux kernel, particularly the VFS layer, is starting to show signs of 
> inadequacy as the software components built upon it keep growing. I 
> have doubts that it can keep up and handle this complexity with a 
> development policy like you just described (filesystem is a dumb layer 
> ?). Aren't these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a 
> perfect example why trying to do too many things inside vfs layer for 
> so many filesystems is a bad idea ? By the way, since we're on this 
> subject, could we discuss a little bit about vfs rename call (or I can 
> start another new discussion thread) ?
>
> Note that linux do_rename() starts with the usual lookup logic, 
> followed by "lock_rename", then a final round of dentry lookup, and 
> finally comes to filesystem's i_op->rename call. Since lock_rename() 
> only calls for vfs layer locks that are local to this particular 
> machine, for a cluster filesystem, there exists a huge window between 
> the final lookup and filesystem's i_op->rename calls such that the 
> file could get deleted from another node before fs can do anything 
> about it. Is it possible that we could get a new function pointer 
> (lock_rename) in inode_operations structure so a cluster filesystem 
> can do proper locking ?

It looks like the ocfs2 guys have the similar problem?

http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt

Does this change help fix gfs lock ordering problem as well?


-Russell Cattelan
cattelan@xfs.org

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] prune_icache_sb
  2006-12-04 16:51                   ` [PATCH] prune_icache_sb Russell Cattelan
@ 2006-12-04 20:46                     ` Wendy Cheng
  0 siblings, 0 replies; 25+ messages in thread
From: Wendy Cheng @ 2006-12-04 20:46 UTC (permalink / raw)
  To: Russell Cattelan; +Cc: Andrew Morton, linux-kernel, linux-fsdevel

Russell Cattelan wrote:
> Wendy Cheng wrote:
>
>> Linux kernel, particularly the VFS layer, is starting to show signs 
>> of inadequacy as the software components built upon it keep growing. 
>> I have doubts that it can keep up and handle this complexity with a 
>> development policy like you just described (filesystem is a dumb 
>> layer ?). Aren't these DIO_xxx_LOCKING flags inside 
>> __blockdev_direct_IO() a perfect example why trying to do too many 
>> things inside vfs layer for so many filesystems is a bad idea ? By 
>> the way, since we're on this subject, could we discuss a little bit 
>> about vfs rename call (or I can start another new discussion thread) ?
>>
>> Note that linux do_rename() starts with the usual lookup logic, 
>> followed by "lock_rename", then a final round of dentry lookup, and 
>> finally comes to filesystem's i_op->rename call. Since lock_rename() 
>> only calls for vfs layer locks that are local to this particular 
>> machine, for a cluster filesystem, there exists a huge window between 
>> the final lookup and filesystem's i_op->rename calls such that the 
>> file could get deleted from another node before fs can do anything 
>> about it. Is it possible that we could get a new function pointer 
>> (lock_rename) in inode_operations structure so a cluster filesystem 
>> can do proper locking ?
>
> It looks like the ocfs2 guys have the similar problem?
>
> http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt 
>
>
>

Thanks for the pointer. Same as ocfs2, under current VFS code, both 
GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly 
~200 line draft patch ready for GFS1 (and am looking into GFS2 at this 
moment). The issue here is, for GFS, if vfs lock_rename() can call us, 
this complication can be greatly reduced. Will start another thread to 
see whether the wish can be granted.

-- Wendy


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-12-05 15:12 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-22 21:35 [PATCH] prune_icache_sb Wendy Cheng
2006-11-22 23:36 ` Andrew Morton
2006-11-27 23:52   ` Wendy Cheng
2006-11-28  0:52     ` Andrew Morton
2006-11-28 21:41       ` Wendy Cheng
2006-11-29  0:21         ` Andrew Morton
2006-11-29  6:02           ` Wendy Cheng
2006-11-30 16:05             ` Wendy Cheng
2006-11-30 19:31               ` Nate Diller
2006-12-01 21:23               ` Andrew Morton
2006-12-03 17:49                 ` Wendy Cheng
2006-12-03 20:47                   ` Andrew Morton
2006-12-04  5:57                     ` Wendy Cheng
2006-12-04  6:28                       ` Andrew Morton
2006-12-04 16:41                         ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet
2006-12-04 16:55                           ` Christoph Lameter
2006-12-04 18:18                             ` Eric Dumazet
2006-12-04 19:49                               ` Andrew Morton
2006-12-04 19:55                                 ` Christoph Lameter
2006-12-04 21:34                                 ` Eric Dumazet
2006-12-04 21:56                                   ` David Miller
2006-12-04 22:45                                     ` Eric Dumazet
2006-12-05 14:42                           ` Pavel Machek
2006-12-04 16:51                   ` [PATCH] prune_icache_sb Russell Cattelan
2006-12-04 20:46                     ` Wendy Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox