* [PATCH] prune_icache_sb
@ 2006-11-22 21:35 Wendy Cheng
2006-11-22 23:36 ` Andrew Morton
0 siblings, 1 reply; 25+ messages in thread
From: Wendy Cheng @ 2006-11-22 21:35 UTC (permalink / raw)
To: linux-kernel, linux-fsdevel
[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]
There seems to have a need to prune inode cache entries for specific
mount points (per vfs superblock) due to performance issues found after
some io intensive commands ("rsyn" for example). The problem is
particularly serious for one of our kernel modules where it caches its
(cluster) locks based on vfs inode implementation. These locks are
created by inode creation call and get purged when s_op->clear_inode()
is invoked. With larger servers that equipped with plenty of memory, the
page dirty ratio may not pass the threshold to trigger VM reclaim logic
but the accumulated inode counts (and its associated cluster locks)
could causes unacceptable performance degradation for latency sensitive
applications.
After adding the uploaded inode trimming patch, together with
shrink_dcache_sb(), we are able to keep the latency for one real world
application within a satisfactory bound (consistently stayed within 5
seconds, compared to the original fluctuation between 5 to 16 seconds).
The calls are placed in one of our kernel daemons that wakes up in a
tunable interval to do the trimming work as shown in the following code
segment. Would appreciate if this patch can get accepted into mainline
kernel.
i_percent = sdp->sd_tune.gt_inoded_purge;
if (i_percent) {
if (i_percent > 100) i_percent = 100;
a_count = atomic_read(&sdp->sd_inode_count);
i_count = a_count * i_percent / 100;
(void) shrink_dcache_sb(sdp->sd_vfs);
(void) prune_icache_sb(i_count, sdp->sd_vfs);
}
-- Wendy
[-- Attachment #2: inode_prune_sb.patch --]
[-- Type: text/x-patch, Size: 2144 bytes --]
Signed-off-by: S. Wendy Cheng <wcheng@redhat.com>
fs/inode.c | 14 +++++++++++---
include/linux/fs.h | 3 ++-
2 files changed, 13 insertions(+), 4 deletions(-)
--- linux-2.6.18/include/linux/fs.h 2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/include/linux/fs.h 2006-11-22 13:55:55.000000000 -0500
@@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino
static inline void insert_inode_hash(struct inode *inode) {
__insert_inode_hash(inode, inode->i_ino);
}
-
+extern void prune_icache_sb(int nr_to_scan, struct super_block *sb);
+
extern struct file * get_empty_filp(void);
extern void file_move(struct file *f, struct list_head *list);
extern void file_kill(struct file *f);
--- linux-2.6.18/fs/inode.c 2006-09-19 23:42:06.000000000 -0400
+++ ups-kernel/fs/inode.c 2006-11-22 14:12:28.000000000 -0500
@@ -411,7 +411,7 @@ static int can_unuse(struct inode *inode
* If the inode has metadata buffers attached to mapping->private_list then
* try to remove them.
*/
-static void prune_icache(int nr_to_scan)
+static void __prune_icache(int nr_to_scan, struct super_block *sb)
{
LIST_HEAD(freeable);
int nr_pruned = 0;
@@ -428,7 +428,8 @@ static void prune_icache(int nr_to_scan)
inode = list_entry(inode_unused.prev, struct inode, i_list);
- if (inode->i_state || atomic_read(&inode->i_count)) {
+ if (inode->i_state || atomic_read(&inode->i_count)
+ || (sb && (inode->i_sb != sb))) {
list_move(&inode->i_list, &inode_unused);
continue;
}
@@ -461,6 +462,13 @@ static void prune_icache(int nr_to_scan)
mutex_unlock(&iprune_mutex);
}
+void prune_icache_sb(int nr_to_scan, struct super_block *sb)
+{
+ __prune_icache(nr_to_scan, sb);
+}
+
+EXPORT_SYMBOL(prune_icache_sb);
+
/*
* shrink_icache_memory() will attempt to reclaim some unused inodes. Here,
* "unused" means that no dentries are referring to the inodes: the files are
@@ -480,7 +488,7 @@ static int shrink_icache_memory(int nr,
*/
if (!(gfp_mask & __GFP_FS))
return -1;
- prune_icache(nr);
+ __prune_icache(nr, NULL);
}
return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
}
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH] prune_icache_sb 2006-11-22 21:35 [PATCH] prune_icache_sb Wendy Cheng @ 2006-11-22 23:36 ` Andrew Morton 2006-11-27 23:52 ` Wendy Cheng 0 siblings, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-11-22 23:36 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel On Wed, 22 Nov 2006 16:35:07 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > There seems to have a need to prune inode cache entries for specific > mount points (per vfs superblock) due to performance issues found after > some io intensive commands ("rsyn" for example). The problem is > particularly serious for one of our kernel modules where it caches its > (cluster) locks based on vfs inode implementation. These locks are > created by inode creation call and get purged when s_op->clear_inode() > is invoked. With larger servers that equipped with plenty of memory, the > page dirty ratio may not pass the threshold to trigger VM reclaim logic > but the accumulated inode counts (and its associated cluster locks) > could causes unacceptable performance degradation for latency sensitive > applications. > > After adding the uploaded inode trimming patch, together with > shrink_dcache_sb(), we are able to keep the latency for one real world > application within a satisfactory bound (consistently stayed within 5 > seconds, compared to the original fluctuation between 5 to 16 seconds). > The calls are placed in one of our kernel daemons that wakes up in a > tunable interval to do the trimming work as shown in the following code > segment. Would appreciate if this patch can get accepted into mainline > kernel. > > i_percent = sdp->sd_tune.gt_inoded_purge; > if (i_percent) { > if (i_percent > 100) i_percent = 100; > a_count = atomic_read(&sdp->sd_inode_count); > i_count = a_count * i_percent / 100; > (void) shrink_dcache_sb(sdp->sd_vfs); > (void) prune_icache_sb(i_count, sdp->sd_vfs); > } > >... > > --- linux-2.6.18/include/linux/fs.h 2006-09-19 23:42:06.000000000 -0400 > +++ ups-kernel/include/linux/fs.h 2006-11-22 13:55:55.000000000 -0500 > @@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino > static inline void insert_inode_hash(struct inode *inode) { > __insert_inode_hash(inode, inode->i_ino); > } > - > +extern void prune_icache_sb(int nr_to_scan, struct super_block *sb); > + > extern struct file * get_empty_filp(void); > extern void file_move(struct file *f, struct list_head *list); > extern void file_kill(struct file *f); > --- linux-2.6.18/fs/inode.c 2006-09-19 23:42:06.000000000 -0400 > +++ ups-kernel/fs/inode.c 2006-11-22 14:12:28.000000000 -0500 > @@ -411,7 +411,7 @@ static int can_unuse(struct inode *inode > * If the inode has metadata buffers attached to mapping->private_list then > * try to remove them. > */ > -static void prune_icache(int nr_to_scan) > +static void __prune_icache(int nr_to_scan, struct super_block *sb) > { > LIST_HEAD(freeable); > int nr_pruned = 0; > @@ -428,7 +428,8 @@ static void prune_icache(int nr_to_scan) > > inode = list_entry(inode_unused.prev, struct inode, i_list); > > - if (inode->i_state || atomic_read(&inode->i_count)) { > + if (inode->i_state || atomic_read(&inode->i_count) > + || (sb && (inode->i_sb != sb))) { > list_move(&inode->i_list, &inode_unused); > continue; This search is potentially inefficient. It would be better walk sb->s_inodes. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-22 23:36 ` Andrew Morton @ 2006-11-27 23:52 ` Wendy Cheng 2006-11-28 0:52 ` Andrew Morton 0 siblings, 1 reply; 25+ messages in thread From: Wendy Cheng @ 2006-11-27 23:52 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel Andrew Morton wrote: > This search is potentially inefficient. It would be better walk > sb->s_inodes. > > Not sure about walking thru sb->s_inodes for several reasons.... 1. First, the changes made are mostly for file server setup with large fs size - the entry count in sb->s_inodes may not be shorter then inode_unused list. 2. Different from calls such as drop_pagecache_sb() (that doesn't do list entry removal), we're walking thru the list to dispose the entries. This implies we are walking thru one list (sb->s_inodes) to remove the other list's entries (inode_unused). This feels awkward. 3. The new code will be very similar to current prune_icache() with few differences - e.g., we really don't want to list_move() within the sb->s_inodes list itself (as done in prune_icache() that moves the examined entry to the tail of the inode_unused list). We have to either duplicate the code or clutter the current prune_icache() routine. Pruning based on sb->s_inodes *does* have its advantage but a simple and plain patch as shown in previous post (that has been well-tested out in two large scale production systems) could be equally effective. Make sense ? -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-27 23:52 ` Wendy Cheng @ 2006-11-28 0:52 ` Andrew Morton 2006-11-28 21:41 ` Wendy Cheng 0 siblings, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-11-28 0:52 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel On Mon, 27 Nov 2006 18:52:58 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > Andrew Morton wrote: > > This search is potentially inefficient. It would be better walk > > sb->s_inodes. > > > > > Not sure about walking thru sb->s_inodes for several reasons.... > > 1. First, the changes made are mostly for file server setup with large > fs size - the entry count in sb->s_inodes may not be shorter then > inode_unused list. umm, that's the best-case. We also care about worst-case. Think: 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are from the being-unmounted filesytem. The code as-proposed will do 100x more work that it needs to do. All under a global spinlock. > 2. Different from calls such as drop_pagecache_sb() (that doesn't do > list entry removal), we're walking thru the list to dispose the entries. > This implies we are walking thru one list (sb->s_inodes) to remove the > other list's entries (inode_unused). This feels awkward. > 3. The new code will be very similar to current prune_icache() with few > differences - e.g., we really don't want to list_move() within the > sb->s_inodes list itself (as done in prune_icache() that moves the > examined entry to the tail of the inode_unused list). We have to either > duplicate the code or clutter the current prune_icache() routine. > > Pruning based on sb->s_inodes *does* have its advantage but a simple and > plain patch as shown in previous post (that has been well-tested out in > two large scale production systems) could be equally effective. Make > sense ? > I also worry about the whole thing: > There seems to have a need to prune inode cache entries for specific mount > points (per vfs superblock) due to performance issues found after some io > intensive commands ("rsyn" for example). The problem is particularly > serious for one of our kernel modules where it caches its (cluster) locks > based on vfs inode implementation. These locks are created by inode > creation call and get purged when s_op->clear_inode() is invoked. With > larger servers that equipped with plenty of memory, the page dirty ratio > may not pass the threshold to trigger VM reclaim logic but the accumulated > inode counts (and its associated cluster locks) could causes unacceptable > performance degradation for latency sensitive applications. What's this about "the page dirty ratio may not pass the threshold to trigger VM reclaim logic"? Page reclaim isn't triggered by a dirty page ratio. Page reclaim is triggered by a shortage of free pages. And page reclaim is supposed to reclaim unused inodes in an orderly and balanced fashion. It appears that it's not doing so in your case and we'd need to see more details (please) so we can understand why it is not working. You're proposing that we not do any of that and that the filesytem be able to call into a VM memory reclaim function for not-clearly-understood reasons. This is a workaround. Please help us to understand what has gone wrong with inode reclaim. And please see if you can find time to help us with this rather than adding some RH-specific fix and then forgetting about it (sensible though that approach would be...) Thanks. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-28 0:52 ` Andrew Morton @ 2006-11-28 21:41 ` Wendy Cheng 2006-11-29 0:21 ` Andrew Morton 0 siblings, 1 reply; 25+ messages in thread From: Wendy Cheng @ 2006-11-28 21:41 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel Andrew Morton wrote: > On Mon, 27 Nov 2006 18:52:58 -0500 > Wendy Cheng <wcheng@redhat.com> wrote: > > >> Not sure about walking thru sb->s_inodes for several reasons.... >> >> 1. First, the changes made are mostly for file server setup with large >> fs size - the entry count in sb->s_inodes may not be shorter then >> inode_unused list. >> > > umm, that's the best-case. We also care about worst-case. Think: > 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are > from the being-unmounted filesytem. The code as-proposed will do 100x more > work that it needs to do. All under a global spinlock. > By walking thru sb->s_inodes, we also need to take inode_lock and iprune_mutex (?), since we're purging the inodes from the system - or specifically, removing them from inode_unused list. There is really not much difference from the current prune_icache() logic. What's been proposed here is simply *exporting* the prune_icache() kernel code to allow filesystems to trim (purge a small percentage of ) its (potentially will be) unused per-mount inodes for *latency* considerations. I made a mistake by using the "page dirty ratio" to explain the problem (sorry! I was not thinking well in previous write-up) that could mislead you to think this is a VM issue. This is not so much about low-on-free-pages (and/or memory fragmentation) issue (though fragmentation is normally part of the symptoms). What the (external) kernel module does is to tie its cluster-wide file lock with in-memory inode that is obtained during file look-up time. The lock is removed from the machine when 1. the lock is granted to other (cluster) machine; or 2. the in-memory inode is purged from the system. One of the clusters that has this latency issue is an IP/TV application where it "rsync" with main station server (with long geographical distance) every 15 minutes. It subsequently (and constantly) generates large amount of inode (and locks) hanging around. When other nodes, served as FTP servers, within the same cluster are serving the files, DLM has to wade through huge amount of locks entries to know whether the lock requests can be granted. That's where this latency issue gets popped out. Our profiling data shows when the cluster performance is dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in lock searching logic. From VM point of view, the system does not have memory shortage so it doesn't have a need to kick off prune_icache() call. This issue could also be fixed in several different ways - maybe by a better DLM hash function, maybe by asking IT people to umount the filesystem where *all* per-mount inodes are unconditionally purged (but it defeats the purpose of caching inodes and, in our case, the locks) after each rsync, ...., etc. But I do think the proposed patch is the most sensible way to fix this issue and believe it will be one of these functions that if you export it, people will find a good use of it. It helps with memory fragmentation and/or shortage *before* it becomes a problem as well. I certainly understand and respect a maintainer's daunting job on how to take/reject a patch - let me know how you think so I can start to work on other solutions if required. -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-28 21:41 ` Wendy Cheng @ 2006-11-29 0:21 ` Andrew Morton 2006-11-29 6:02 ` Wendy Cheng 0 siblings, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-11-29 0:21 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel On Tue, 28 Nov 2006 16:41:07 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > Andrew Morton wrote: > > On Mon, 27 Nov 2006 18:52:58 -0500 > > Wendy Cheng <wcheng@redhat.com> wrote: > > > > > >> Not sure about walking thru sb->s_inodes for several reasons.... > >> > >> 1. First, the changes made are mostly for file server setup with large > >> fs size - the entry count in sb->s_inodes may not be shorter then > >> inode_unused list. > >> > > > > umm, that's the best-case. We also care about worst-case. Think: > > 1,000,000 inodes on inode_unused, of which a randomly-sprinkled 10,000 are > > from the being-unmounted filesytem. The code as-proposed will do 100x more > > work that it needs to do. All under a global spinlock. > > > By walking thru sb->s_inodes, we also need to take inode_lock and > iprune_mutex (?), since we're purging the inodes from the system - or > specifically, removing them from inode_unused list. There is really not > much difference from the current prune_icache() logic. There's quite a bit of difference. The change you're proposing will perform poorly if it is used in the scenario which I describe above. It will waste CPU cycles and will destroy the inode_unused LRU ordering (for what that's worth, which isn't much). Trust me, every single time we've had an inefficient search in core kernel, someone has gone and done something which hits it and causes general meltdown in their workload. So we've had to make significant changes to remove the O(n) or higher search complexity. And in this case we *already have* the date structures in place to make it O(1). > What's been > proposed here is simply *exporting* the prune_icache() kernel code to > allow filesystems to trim (purge a small percentage of ) its > (potentially will be) unused per-mount inodes for *latency* considerations. It just happens to work in your setup. If you have a large machine with two filesystems and you run rsync on both filesystems and run FTP agains one of them, it might not work so well. Because the proposed prune_icache_sb() might need to chew through 500,000 inodes from the wrong superblock before reclaiming any of the inodes which you want to reclaim. Or something like that. > I made a mistake by using the "page dirty ratio" to explain the problem > (sorry! I was not thinking well in previous write-up) that could mislead > you to think this is a VM issue. This is not so much about > low-on-free-pages (and/or memory fragmentation) issue (though > fragmentation is normally part of the symptoms). What the (external) > kernel module does is to tie its cluster-wide file lock with in-memory > inode that is obtained during file look-up time. The lock is removed > from the machine when > > 1. the lock is granted to other (cluster) machine; or > 2. the in-memory inode is purged from the system. It seems peculiar to be tying the lifetime of a DLM lock to the system's memory size and current memory pressure? > One of the clusters that has this latency issue is an IP/TV application > where it "rsync" with main station server (with long geographical > distance) every 15 minutes. It subsequently (and constantly) generates > large amount of inode (and locks) hanging around. When other nodes, > served as FTP servers, within the same cluster are serving the files, > DLM has to wade through huge amount of locks entries to know whether the > lock requests can be granted. That's where this latency issue gets > popped out. Our profiling data shows when the cluster performance is > dropped into un-acceptable ranges, DLM could hogs 40% of CPU cycle in > lock searching logic. From VM point of view, the system does not have > memory shortage so it doesn't have a need to kick off prune_icache() call. OK.. > This issue could also be fixed in several different ways - maybe by a > better DLM hash function, It does sound like the lock lookup is broken. I assume there's some reason for keeping these things floating about in memory, so there must be a downside to artificially pruning them in this manner? If so, a (much) faster lookup would seem to be the best fix. > maybe by asking IT people to umount the > filesystem where *all* per-mount inodes are unconditionally purged (but > it defeats the purpose of caching inodes and, in our case, the locks) > after each rsync, ...., etc. But I do think the proposed patch is the > most sensible way to fix this issue and believe it will be one of these > functions that if you export it, people will find a good use of it. It > helps with memory fragmentation and/or shortage *before* it becomes a > problem as well. I certainly understand and respect a maintainer's > daunting job on how to take/reject a patch - let me know how you think > so I can start to work on other solutions if required. We shouldn't export this particular implementation to modules because it has bad failure modes. There might be a case for exposing an i_sb_list-based API or, perhaps better, a max-unused-inodes mount option. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-29 0:21 ` Andrew Morton @ 2006-11-29 6:02 ` Wendy Cheng 2006-11-30 16:05 ` Wendy Cheng 0 siblings, 1 reply; 25+ messages in thread From: Wendy Cheng @ 2006-11-29 6:02 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel Andrew Morton wrote: >We shouldn't export this particular implementation to modules because it >has bad failure modes. There might be a case for exposing an >i_sb_list-based API or, perhaps better, a max-unused-inodes mount option. > > > > Ok, thanks for looking into this - it is appreciated. I'll try to figure out something else. -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-29 6:02 ` Wendy Cheng @ 2006-11-30 16:05 ` Wendy Cheng 2006-11-30 19:31 ` Nate Diller 2006-12-01 21:23 ` Andrew Morton 0 siblings, 2 replies; 25+ messages in thread From: Wendy Cheng @ 2006-11-30 16:05 UTC (permalink / raw) To: wcheng, Andrew Morton; +Cc: linux-kernel, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 1715 bytes --] How about a simple and plain change with this uploaded patch .... The idea is, instead of unconditionally dropping every buffer associated with the particular mount point (that defeats the purpose of page caching), base kernel exports the "drop_pagecache_sb()" call that allows page cache to be trimmed. More importantly, it is changed to offer the choice of not randomly purging any buffer but the ones that seem to be unused (i_state is NULL and i_count is zero). This will encourage filesystem(s) to pro actively response to vm memory shortage if they choose so. From our end (cluster locks are expensive - that's why we cache them), one of our kernel daemons will invoke this newly exported call based on a set of pre-defined tunables. It is then followed by a lock reclaim logic to trim the locks by checking the page cache associated with the inode (that this cluster lock is created for). If nothing is attached to the inode (based on i_mapping->nrpages count), we know it is a good candidate for trimming and will subsequently drop this lock (instead of waiting until the end of vfs inode life cycle). Note that I could do invalidate_inode_pages() within our kernel modules to accomplish what drop_pagecache_sb() does (without coming here to bug people) but I don't have access to inode_lock as an external kernel module. So either EXPORT_SYMBOL(inode_lock) or this patch ? The end result (of this change) should encourage filesystem to actively avoid depleting too much memory and we'll encourage our applications to understand clustering locality issues. Haven't tested this out though - would appreciate some comments before spending more efforts on this direction. -- Wendy [-- Attachment #2: inode_prune_sb.patch --] [-- Type: text/x-patch, Size: 1723 bytes --] --- linux-2.6.18/include/linux/fs.h 2006-09-19 23:42:06.000000000 -0400 +++ ups-kernel/include/linux/fs.h 2006-11-30 02:16:34.000000000 -0500 @@ -1625,7 +1625,8 @@ extern void remove_inode_hash(struct ino static inline void insert_inode_hash(struct inode *inode) { __insert_inode_hash(inode, inode->i_ino); } - +extern void void drop_pagecache_sb(struct super_block *sb, int nr_goal);:q + extern struct file * get_empty_filp(void); extern void file_move(struct file *f, struct list_head *list); extern void file_kill(struct file *f); --- linux-2.6.18/fs/drop_caches.c 2006-09-19 23:42:06.000000000 -0400 +++ ups-kernel/fs/drop_caches.c 2006-11-30 03:36:11.000000000 -0500 @@ -12,13 +12,20 @@ /* A global variable is a bit ugly, but it keeps the code simple */ int sysctl_drop_caches; -static void drop_pagecache_sb(struct super_block *sb) +void drop_pagecache_sb(struct super_block *sb, int nr_goal) { struct inode *inode; + int nr_count=0; spin_lock(&inode_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - if (inode->i_state & (I_FREEING|I_WILL_FREE)) + if (nr_goal) { + if (nr_goal == nr_count) + break; + if ((inode->i_state || atomic_read(&inode->i_count)) + continue; + nr_count++; + } else if (inode->i_state & (I_FREEING|I_WILL_FREE)) continue; invalidate_inode_pages(inode->i_mapping); } @@ -36,7 +43,7 @@ restart: spin_unlock(&sb_lock); down_read(&sb->s_umount); if (sb->s_root) - drop_pagecache_sb(sb); + drop_pagecache_sb(sb, 0); up_read(&sb->s_umount); spin_lock(&sb_lock); if (__put_super_and_need_restart(sb)) @@ -66,3 +73,6 @@ int drop_caches_sysctl_handler(ctl_table } return 0; } + +EXPORT_SYMBOL(drop_pagecache_sb); + ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-30 16:05 ` Wendy Cheng @ 2006-11-30 19:31 ` Nate Diller 2006-12-01 21:23 ` Andrew Morton 1 sibling, 0 replies; 25+ messages in thread From: Nate Diller @ 2006-11-30 19:31 UTC (permalink / raw) To: Wendy Cheng; +Cc: Andrew Morton, linux-kernel, linux-fsdevel On 11/30/06, Wendy Cheng <wcheng@redhat.com> wrote: > How about a simple and plain change with this uploaded patch .... > > The idea is, instead of unconditionally dropping every buffer associated > with the particular mount point (that defeats the purpose of page > caching), base kernel exports the "drop_pagecache_sb()" call that allows > page cache to be trimmed. More importantly, it is changed to offer the > choice of not randomly purging any buffer but the ones that seem to be > unused (i_state is NULL and i_count is zero). This will encourage > filesystem(s) to pro actively response to vm memory shortage if they > choose so. > > From our end (cluster locks are expensive - that's why we cache them), > one of our kernel daemons will invoke this newly exported call based on > a set of pre-defined tunables. It is then followed by a lock reclaim > logic to trim the locks by checking the page cache associated with the > inode (that this cluster lock is created for). If nothing is attached to > the inode (based on i_mapping->nrpages count), we know it is a good > candidate for trimming and will subsequently drop this lock (instead of > waiting until the end of vfs inode life cycle). I have a patch that is a more comprehensive version of this idea, but it is not fully debugged, and has suffered some bitrot in the past couple months. This turns out to be a good performance improvement in the general case too, but is more complex than your idea because there are real locking changes needed to avoid deadlocks. I can send you a copy of the patch if you are interested. > Note that I could do invalidate_inode_pages() within our kernel modules > to accomplish what drop_pagecache_sb() does (without coming here to bug > people) but I don't have access to inode_lock as an external kernel > module. So either EXPORT_SYMBOL(inode_lock) or this patch ? like i said above, you have to be careful when touching inode_lock, dcache_lock, and the mapping's tree_lock, because of potential deadlocks. the mapping's lock can be taken from softirq context, but the inode and dcache locks cannot. NATE ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-11-30 16:05 ` Wendy Cheng 2006-11-30 19:31 ` Nate Diller @ 2006-12-01 21:23 ` Andrew Morton 2006-12-03 17:49 ` Wendy Cheng 1 sibling, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-12-01 21:23 UTC (permalink / raw) To: Wendy Cheng; +Cc: linux-kernel, linux-fsdevel On Thu, 30 Nov 2006 11:05:32 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > How about a simple and plain change with this uploaded patch .... > > The idea is, instead of unconditionally dropping every buffer associated > with the particular mount point (that defeats the purpose of page > caching), base kernel exports the "drop_pagecache_sb()" call that allows > page cache to be trimmed. More importantly, it is changed to offer the > choice of not randomly purging any buffer but the ones that seem to be > unused (i_state is NULL and i_count is zero). This will encourage > filesystem(s) to pro actively response to vm memory shortage if they > choose so. argh. In Linux a filesystem is a dumb layer which sits between the VFS and the I/O layer and provides dumb services such as reading/writing inodes, reading/writing directory entries, mapping pagecache offsets to disk blocks, etc. (This model is to varying degrees incorrect for every post-ext2 filesystem, but that's the way it is). We do not want to go "encouraging" filesystems to play games tuning and trimming VFS caches and things like that. If a patch doing that were to turn up it would be heartily shouted at and the originator would be asked to go off and implement the functionality in core VFS so that a) all filesystems can immediately utilise it and b) other filesystems aren't tempted to go off and privately implement something similar. So please bear this philosophy in mind, and think about this feature from that perspective. One approach might be to add a per-superblock upper-bound on the number of cached dentries and/or inodes. Typically that would be controlled by a (re)mount option. Although we could perhaps discuss a sysfs representation of this (and, presumably, other mount options). But I'd expect such a proposal to have a hard time, because we'd need to know why such a thing is needed: we prefer auto-tuning, and that's what we have now, so what's gone wrong with it and how can we fix it, rather than adding a manual override? > From our end (cluster locks are expensive - that's why we cache them), > one of our kernel daemons will invoke this newly exported call based on > a set of pre-defined tunables. It is then followed by a lock reclaim > logic to trim the locks by checking the page cache associated with the > inode (that this cluster lock is created for). If nothing is attached to > the inode (based on i_mapping->nrpages count), we know it is a good > candidate for trimming and will subsequently drop this lock (instead of > waiting until the end of vfs inode life cycle). Again, I don't understand why you're tying the lifetime of these locks to the VFS inode reclaim mechanisms. Seems odd. If you want to put an upper bound on the number of in-core locks, why not string them on a list and throw away the old ones when the upper bound is reached? Did you look at improving that lock-lookup algorithm, btw? Core kernel has no problem maintaining millions of cached VFS objects - is there any reason why your lock lookup cannot be similarly efficient? > Note that I could do invalidate_inode_pages() within our kernel modules > to accomplish what drop_pagecache_sb() does (without coming here to bug > people) but I don't have access to inode_lock as an external kernel > module. So either EXPORT_SYMBOL(inode_lock) or this patch ? > > The end result (of this change) should encourage filesystem to actively > avoid depleting too much memory That is _not_ a filesytem responsibility! inode cache is owned and maintained by the VFS. > and we'll encourage our applications to > understand clustering locality issues. ? > Haven't tested this out though - would appreciate some comments before > spending more efforts on this direction. > > -- Wendy > > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-01 21:23 ` Andrew Morton @ 2006-12-03 17:49 ` Wendy Cheng 2006-12-03 20:47 ` Andrew Morton 2006-12-04 16:51 ` [PATCH] prune_icache_sb Russell Cattelan 0 siblings, 2 replies; 25+ messages in thread From: Wendy Cheng @ 2006-12-03 17:49 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel Andrew Morton wrote: >On Thu, 30 Nov 2006 11:05:32 -0500 >Wendy Cheng <wcheng@redhat.com> wrote: > > > >> >>The idea is, instead of unconditionally dropping every buffer associated >>with the particular mount point (that defeats the purpose of page >>caching), base kernel exports the "drop_pagecache_sb()" call that allows >>page cache to be trimmed. More importantly, it is changed to offer the >>choice of not randomly purging any buffer but the ones that seem to be >>unused (i_state is NULL and i_count is zero). This will encourage >>filesystem(s) to pro actively response to vm memory shortage if they >>choose so. >> >> > >argh. > > I read this as "It is ok to give system admin(s) commands (that this "drop_pagecache_sb() call" is all about) to drop page cache. It is, however, not ok to give filesystem developer(s) this very same function to trim their own page cache if the filesystems choose to do so" ? >In Linux a filesystem is a dumb layer which sits between the VFS and the >I/O layer and provides dumb services such as reading/writing inodes, >reading/writing directory entries, mapping pagecache offsets to disk >blocks, etc. (This model is to varying degrees incorrect for every >post-ext2 filesystem, but that's the way it is). > > Linux kernel, particularly the VFS layer, is starting to show signs of inadequacy as the software components built upon it keep growing. I have doubts that it can keep up and handle this complexity with a development policy like you just described (filesystem is a dumb layer ?). Aren't these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect example why trying to do too many things inside vfs layer for so many filesystems is a bad idea ? By the way, since we're on this subject, could we discuss a little bit about vfs rename call (or I can start another new discussion thread) ? Note that linux do_rename() starts with the usual lookup logic, followed by "lock_rename", then a final round of dentry lookup, and finally comes to filesystem's i_op->rename call. Since lock_rename() only calls for vfs layer locks that are local to this particular machine, for a cluster filesystem, there exists a huge window between the final lookup and filesystem's i_op->rename calls such that the file could get deleted from another node before fs can do anything about it. Is it possible that we could get a new function pointer (lock_rename) in inode_operations structure so a cluster filesystem can do proper locking ? >>From our end (cluster locks are expensive - that's why we cache them), >>one of our kernel daemons will invoke this newly exported call based on >>a set of pre-defined tunables. It is then followed by a lock reclaim >>logic to trim the locks by checking the page cache associated with the >>inode (that this cluster lock is created for). If nothing is attached to >>the inode (based on i_mapping->nrpages count), we know it is a good >>candidate for trimming and will subsequently drop this lock (instead of >>waiting until the end of vfs inode life cycle). >> >> > >Again, I don't understand why you're tying the lifetime of these locks to >the VFS inode reclaim mechanisms. Seems odd. > > Cluster locks are expensive because: 1. Every node in the cluster has to agree about it upon granting the request (communication overhead). 2. It involves disk flushing if bouncing between nodes. Say one node requests a read lock after another node's write... before the read lock can be granted, the write node needs to flush the data to the disk (disk io overhead). For optimization purpose, we want to refrain the disk flush after writes and hope (and encourage) the next person who requests the lock to be on the very same node (to take the advantage of OS write-back logic). That's why the locks are cached on the very same node. It will not get removed unless necessary. What would be better to build the lock caching on top of the existing inode cache logic - since these are the objects that the cluster locks are created for in the first place. >If you want to put an upper bound on the number of in-core locks, why not >string them on a list and throw away the old ones when the upper bound is >reached? > > Don't take me wrong. DLM *has* a tunable to set the max lock counts. We do drop the locks but to drop the right locks, we need a little bit help from VFS layer. Latency requirement is difficult to manage. >Did you look at improving that lock-lookup algorithm, btw? Core kernel has >no problem maintaining millions of cached VFS objects - is there any reason >why your lock lookup cannot be similarly efficient? > > Don't be so confident. I did see some complaints from ext3 based mail servers in the past - when the storage size was large enough, people had to explicitly umount the filesystem from time to time to rescue their performance. I don't recall the details at this moment though. For us with this particular customer, it is a 15TB storage. -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-03 17:49 ` Wendy Cheng @ 2006-12-03 20:47 ` Andrew Morton 2006-12-04 5:57 ` Wendy Cheng 2006-12-04 16:51 ` [PATCH] prune_icache_sb Russell Cattelan 1 sibling, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-12-03 20:47 UTC (permalink / raw) To: wcheng; +Cc: linux-kernel, linux-fsdevel On Sun, 03 Dec 2006 12:49:42 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > Andrew Morton wrote: > > >On Thu, 30 Nov 2006 11:05:32 -0500 > >Wendy Cheng <wcheng@redhat.com> wrote: > > > > > > > >> > >>The idea is, instead of unconditionally dropping every buffer associated > >>with the particular mount point (that defeats the purpose of page > >>caching), base kernel exports the "drop_pagecache_sb()" call that allows > >>page cache to be trimmed. More importantly, it is changed to offer the > >>choice of not randomly purging any buffer but the ones that seem to be > >>unused (i_state is NULL and i_count is zero). This will encourage > >>filesystem(s) to pro actively response to vm memory shortage if they > >>choose so. > >> > >> > > > >argh. > > > > > I read this as "It is ok to give system admin(s) commands (that this > "drop_pagecache_sb() call" is all about) to drop page cache. It is, > however, not ok to give filesystem developer(s) this very same function > to trim their own page cache if the filesystems choose to do so" ? If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for administrators - it's a convenience thing for developers, to get repeatable benchmarks. Attempts to make it a per-numa-node control for admin purposes have been rejected. > >In Linux a filesystem is a dumb layer which sits between the VFS and the > >I/O layer and provides dumb services such as reading/writing inodes, > >reading/writing directory entries, mapping pagecache offsets to disk > >blocks, etc. (This model is to varying degrees incorrect for every > >post-ext2 filesystem, but that's the way it is). > > > > > Linux kernel, particularly the VFS layer, is starting to show signs of > inadequacy as the software components built upon it keep growing. I have > doubts that it can keep up and handle this complexity with a development > policy like you just described (filesystem is a dumb layer ?). Aren't > these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfect > example why trying to do too many things inside vfs layer for so many > filesystems is a bad idea ? That's not a very well-chosen example, but yes, the old ext2-based model has needed to be extended as new filesystems come along. > By the way, since we're on this subject, > could we discuss a little bit about vfs rename call (or I can start > another new discussion thread) ? > > Note that linux do_rename() starts with the usual lookup logic, followed > by "lock_rename", then a final round of dentry lookup, and finally comes > to filesystem's i_op->rename call. Since lock_rename() only calls for > vfs layer locks that are local to this particular machine, for a cluster > filesystem, there exists a huge window between the final lookup and > filesystem's i_op->rename calls such that the file could get deleted > from another node before fs can do anything about it. Is it possible > that we could get a new function pointer (lock_rename) in > inode_operations structure so a cluster filesystem can do proper locking ? That would need a new thread, and probably (at least pseudo-) code, and cc's to the appropriate maintainers (although that part of the kernel isn't really maintained any more - it has fallen into the patch-and-run model). > >>From our end (cluster locks are expensive - that's why we cache them), > >>one of our kernel daemons will invoke this newly exported call based on > >>a set of pre-defined tunables. It is then followed by a lock reclaim > >>logic to trim the locks by checking the page cache associated with the > >>inode (that this cluster lock is created for). If nothing is attached to > >>the inode (based on i_mapping->nrpages count), we know it is a good > >>candidate for trimming and will subsequently drop this lock (instead of > >>waiting until the end of vfs inode life cycle). > >> > >> > > > >Again, I don't understand why you're tying the lifetime of these locks to > >the VFS inode reclaim mechanisms. Seems odd. > > > > > Cluster locks are expensive because: > > 1. Every node in the cluster has to agree about it upon granting the > request (communication overhead). > 2. It involves disk flushing if bouncing between nodes. Say one node > requests a read lock after another node's write... before the read lock > can be granted, the write node needs to flush the data to the disk (disk > io overhead). > > For optimization purpose, we want to refrain the disk flush after writes > and hope (and encourage) the next person who requests the lock to be on > the very same node (to take the advantage of OS write-back logic). > That's why the locks are cached on the very same node. It will not get > removed unless necessary. > What would be better to build the lock caching on top of the existing > inode cache logic - since these are the objects that the cluster locks > are created for in the first place. hmm, I suppose that makes sense. Are there dentries associated with these locks? > >If you want to put an upper bound on the number of in-core locks, why not > >string them on a list and throw away the old ones when the upper bound is > >reached? > > > > > Don't take me wrong. DLM *has* a tunable to set the max lock counts. We > do drop the locks but to drop the right locks, we need a little bit help > from VFS layer. Latency requirement is difficult to manage. > > >Did you look at improving that lock-lookup algorithm, btw? Core kernel has > >no problem maintaining millions of cached VFS objects - is there any reason > >why your lock lookup cannot be similarly efficient? > > > > > Don't be so confident. I did see some complaints from ext3 based mail > servers in the past - when the storage size was large enough, people had > to explicitly umount the filesystem from time to time to rescue their > performance. I don't recall the details at this moment though. People have had plenty of problems with oversized inode-caches in the past, but I think they were due to memory consumption, not to lookup inefficiency. My question _still_ remains unanswered. Third time: is is possible to speed up this lock-lookup code? Perhaps others can take a look at it - where is it? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-03 20:47 ` Andrew Morton @ 2006-12-04 5:57 ` Wendy Cheng 2006-12-04 6:28 ` Andrew Morton 0 siblings, 1 reply; 25+ messages in thread From: Wendy Cheng @ 2006-12-04 5:57 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel Andrew Morton wrote: >On Sun, 03 Dec 2006 12:49:42 -0500 >Wendy Cheng <wcheng@redhat.com> wrote: > > > >>I read this as "It is ok to give system admin(s) commands (that this >>"drop_pagecache_sb() call" is all about) to drop page cache. It is, >>however, not ok to give filesystem developer(s) this very same function >>to trim their own page cache if the filesystems choose to do so" ? >> >> > >If you're referring to /proc/sys/vm/drop_pagecache then no, that isn't for >administrators - it's a convenience thing for developers, to get repeatable >benchmarks. Attempts to make it a per-numa-node control for admin purposes have >been rejected. > > Just saw you suggested the "next door" LKML thread ("la la la la ... swappiness") to use "-o sync" ? Well, now I do see you're determined ... anyway, I think I got my stuff working without this kernel patch ... still under testing though. The rename post will be done first thing tomorrow morning. >>[snip] ....... >> >> >hmm, I suppose that makes sense. > >Are there dentries associated with these locks? > > Yes, dentries are part of the logic (during lookup time) but book-keepings (reference count, reclaim, delete, etc) are all done thru inode structures. > > >>>Did you look at improving that lock-lookup algorithm, btw? Core kernel has >>>no problem maintaining millions of cached VFS objects - is there any reason >>>why your lock lookup cannot be similarly efficient? >>> >>> >>> Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h). I'm on an older version of DLM that uses FNV hash algorithm (http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance test runs to compare these two methods. A final note on this subject - I may not agree with you (about various things) but your comments and amazingly quick responses are very very appreciated ! -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-04 5:57 ` Wendy Cheng @ 2006-12-04 6:28 ` Andrew Morton 2006-12-04 16:41 ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet 0 siblings, 1 reply; 25+ messages in thread From: Andrew Morton @ 2006-12-04 6:28 UTC (permalink / raw) To: wcheng; +Cc: linux-kernel, linux-fsdevel On Mon, 04 Dec 2006 00:57:50 -0500 Wendy Cheng <wcheng@redhat.com> wrote: > > > > > >>>Did you look at improving that lock-lookup algorithm, btw? Core kernel has > >>>no problem maintaining millions of cached VFS objects - is there any reason > >>>why your lock lookup cannot be similarly efficient? > >>> > >>> > >>> > Yes, just found the new DLM uses "jhash" call (include/linux/jhash.h). > I'm on an older version of DLM that uses FNV hash algorithm > (http://www.isthe.com/chongo/tech/comp/fnv/). Will do some performance > test runs to compare these two methods. I'd be surprised if the choice of hash algorithm itself makes much difference. But we can't say much about it unless we can see the code (ie: your code). Is it a simple old hash-to-find-the-bucket-then-walk-a-list implementation? If so, what does the bucket count distribution look like? What is the average walk length? Does it use a single lock, or hashed locking, or a lock-per-bucket? etc. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 6:28 ` Andrew Morton @ 2006-12-04 16:41 ` Eric Dumazet 2006-12-04 16:55 ` Christoph Lameter 2006-12-05 14:42 ` Pavel Machek 0 siblings, 2 replies; 25+ messages in thread From: Eric Dumazet @ 2006-12-04 16:41 UTC (permalink / raw) To: Andrew Morton; +Cc: David Miller, linux-kernel, Christoph Lameter [-- Attachment #1: Type: text/plain, Size: 1259 bytes --] When some objects are allocated by one CPU but freed by another CPU we can consume lot of cycles doing divides in obj_to_index(). (Typical load on a dual processor machine where network interrupts are handled by one particular CPU (allocating skbufs), and the other CPU is running the application (consuming and freeing skbufs)) Here on one production server (dual-core AMD Opteron 285), I noticed this divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are quite modern cpus and the divide is much more expensive on oldest architectures : On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle for a multiply. Doing some math, we can use a reciprocal multiplication instead of a divide. If we want to compute V = (A / B) (A and B being u32 quantities) we can instead use : V = ((u64)A * RECIPROCAL(B)) >> 32 ; where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B Note : I wrote pure C code for clarity. gcc output for i386 is not optimal but acceptable : mull 0x14(%ebx) mov %edx,%eax // part of the >> 32 xor %edx,%edx // useless mov %eax,(%esp) // could be avoided mov %edx,0x4(%esp) // useless mov (%esp),%ebx Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: slab_avoid_divides.patch --] [-- Type: text/plain, Size: 2164 bytes --] --- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100 +++ linux-2.6.19-ed/mm/slab.c 2006-12-04 17:25:02.000000000 +0100 @@ -371,6 +371,19 @@ static void kmem_list3_init(struct kmem_ } while (0) /* + * Define the reciprocal value of B so that + * ((u32)A / (u32)B) can be replaced by : + * (((u64)A * RECIPROCAL_VALUE(B)) >> 32) + * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply + */ +static inline u32 reciprocal_value(unsigned int k) +{ + u64 val = (1LL << 32) + (k - 1); + do_div(val, k); + return (u32)val; +} + +/* * struct kmem_cache * * manages a cache. @@ -385,6 +398,7 @@ struct kmem_cache { unsigned int shared; unsigned int buffer_size; + unsigned int reciprocal_buffer_size; /* 3) touched by every alloc & free from the backend */ struct kmem_list3 *nodelists[MAX_NUMNODES]; @@ -626,10 +640,17 @@ static inline void *index_to_obj(struct return slab->s_mem + cache->buffer_size * idx; } +/* + * We want to avoid an expensive divide : (offset / cache->buffer_size) + * Using the fact that buffer_size is a constant for a particular cache, + * we can replace (offset / cache->buffer_size) by + * ((u64)offset * cache->reciprocal_buffer_size) >> 32 + */ static inline unsigned int obj_to_index(struct kmem_cache *cache, struct slab *slab, void *obj) { - return (unsigned)(obj - slab->s_mem) / cache->buffer_size; + unsigned int offset = (obj - slab->s_mem); + return (u32)(((u64)offset * cache->reciprocal_buffer_size) >> 32); } /* @@ -1400,6 +1421,8 @@ void __init kmem_cache_init(void) cache_cache.buffer_size = ALIGN(cache_cache.buffer_size, cache_line_size()); + cache_cache.reciprocal_buffer_size = + reciprocal_value(cache_cache.buffer_size); for (order = 0; order < MAX_ORDER; order++) { cache_estimate(order, cache_cache.buffer_size, @@ -2297,6 +2320,7 @@ kmem_cache_create (const char *name, siz if (flags & SLAB_CACHE_DMA) cachep->gfpflags |= GFP_DMA; cachep->buffer_size = size; + cachep->reciprocal_buffer_size = reciprocal_value(size); if (flags & CFLGS_OFF_SLAB) { cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 16:41 ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet @ 2006-12-04 16:55 ` Christoph Lameter 2006-12-04 18:18 ` Eric Dumazet 2006-12-05 14:42 ` Pavel Machek 1 sibling, 1 reply; 25+ messages in thread From: Christoph Lameter @ 2006-12-04 16:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andrew Morton, David Miller, linux-kernel On Mon, 4 Dec 2006, Eric Dumazet wrote: > Doing some math, we can use a reciprocal multiplication instead of a divide. Could you generalize the reciprocal thingy so that the division can be used from other parts of the kernel as well? It would be useful to separately get some cycle counts on a regular division compared with your division. If that shows benefit then we may think about using it in the kernel. I am a bit surprised that this is still an issue on modern cpus. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 16:55 ` Christoph Lameter @ 2006-12-04 18:18 ` Eric Dumazet 2006-12-04 19:49 ` Andrew Morton 0 siblings, 1 reply; 25+ messages in thread From: Eric Dumazet @ 2006-12-04 18:18 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, David Miller, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2255 bytes --] On Monday 04 December 2006 17:55, Christoph Lameter wrote: > Could you generalize the reciprocal thingy so that the division > can be used from other parts of the kernel as well? It would be useful to > separately get some cycle counts on a regular division compared with your > division. If that shows benefit then we may think about using it in the > kernel. I am a bit surprised that this is still an issue on modern cpus. OK I added a new include file, I am not sure it is the best way... Well, AFAIK this particular divide is the only one that hurts performance on my machines. Do you have in mind another spot in kernel where we could use this reciprocal divide as well ? Yes divide complexity is still an issue with modern CPUS : elapsed time for 10^9 loops on Pentium M 1.6 Ghz 24 s for the version using divides 3.8 s for the version using multiplies [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() When some objects are allocated by one CPU but freed by another CPU we can consume lot of cycles doing divides in obj_to_index(). (Typical load on a dual processor machine where network interrupts are handled by one particular CPU (allocating skbufs), and the other CPU is running the application (consuming and freeing skbufs)) Here on one production server (dual-core AMD Opteron 285), I noticed this divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are quite modern cpus and the divide is much more expensive on oldest architectures : On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle for a multiply. Doing some math, we can use a reciprocal multiplication instead of a divide. If we want to compute V = (A / B) (A and B being u32 quantities) we can instead use : V = ((u64)A * RECIPROCAL(B)) >> 32 ; where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B Note : I wrote pure C code for clarity. gcc output for i386 is not optimal but acceptable : mull 0x14(%ebx) mov %edx,%eax // part of the >> 32 xor %edx,%edx // useless mov %eax,(%esp) // could be avoided mov %edx,0x4(%esp) // useless mov (%esp),%ebx Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: slab_avoid_divides.patch --] [-- Type: text/plain, Size: 3013 bytes --] --- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 19:01:44.000000000 +0100 @@ -0,0 +1,30 @@ +#ifndef _LINUX_RECIPROCAL_DIV_H +#define _LINUX_RECIPROCAL_DIV_H + +/* + * Define the reciprocal value of B so that + * ((u32)A / (u32)B) can be replaced by : + * (((u64)A * RECIPROCAL_VALUE(B)) >> 32) + * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply + */ +#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B)) + +static inline u32 reciprocal_value(unsigned int k) +{ + if (__builtin_constant_p(k)) + return RECIPROCAL_VALUE(k); + else { + u64 val = (1LL << 32) + (k - 1); + do_div(val, k); + return (u32)val; + } +} + +/* + * We want to avoid an expensive divide : (A / B) + * If B is known in advance, its reciprocal R can be precalculated/stored. + * then (A / B) = (u32)(((u64)(A) * (R)) >> 32) + */ +#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32) + +#endif --- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100 +++ linux-2.6.19-ed/mm/slab.c 2006-12-04 19:03:42.000000000 +0100 @@ -107,6 +107,7 @@ #include <linux/mempolicy.h> #include <linux/mutex.h> #include <linux/rtmutex.h> +#include <linux/reciprocal_div.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -385,6 +386,7 @@ struct kmem_cache { unsigned int shared; unsigned int buffer_size; + unsigned int reciprocal_buffer_size; /* 3) touched by every alloc & free from the backend */ struct kmem_list3 *nodelists[MAX_NUMNODES]; @@ -626,10 +628,17 @@ static inline void *index_to_obj(struct return slab->s_mem + cache->buffer_size * idx; } -static inline unsigned int obj_to_index(struct kmem_cache *cache, - struct slab *slab, void *obj) +/* + * We want to avoid an expensive divide : (offset / cache->buffer_size) + * Using the fact that buffer_size is a constant for a particular cache, + * we can replace (offset / cache->buffer_size) by + * RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size) + */ +static inline unsigned int obj_to_index(const struct kmem_cache *cache, + const struct slab *slab, void *obj) { - return (unsigned)(obj - slab->s_mem) / cache->buffer_size; + unsigned int offset = (obj - slab->s_mem); + return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size); } /* @@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void) cache_cache.buffer_size = ALIGN(cache_cache.buffer_size, cache_line_size()); + cache_cache.reciprocal_buffer_size = + reciprocal_value(cache_cache.buffer_size); for (order = 0; order < MAX_ORDER; order++) { cache_estimate(order, cache_cache.buffer_size, @@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz if (flags & SLAB_CACHE_DMA) cachep->gfpflags |= GFP_DMA; cachep->buffer_size = size; + cachep->reciprocal_buffer_size = reciprocal_value(size); if (flags & CFLGS_OFF_SLAB) { cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 18:18 ` Eric Dumazet @ 2006-12-04 19:49 ` Andrew Morton 2006-12-04 19:55 ` Christoph Lameter 2006-12-04 21:34 ` Eric Dumazet 0 siblings, 2 replies; 25+ messages in thread From: Andrew Morton @ 2006-12-04 19:49 UTC (permalink / raw) To: Eric Dumazet; +Cc: Christoph Lameter, David Miller, linux-kernel On Mon, 4 Dec 2006 19:18:29 +0100 Eric Dumazet <dada1@cosmosbay.com> wrote: > On Monday 04 December 2006 17:55, Christoph Lameter wrote: > > Could you generalize the reciprocal thingy so that the division > > can be used from other parts of the kernel as well? It would be useful to > > separately get some cycle counts on a regular division compared with your > > division. If that shows benefit then we may think about using it in the > > kernel. I am a bit surprised that this is still an issue on modern cpus. > > OK I added a new include file, I am not sure it is the best way... > > Well, AFAIK this particular divide is the only one that hurts performance on > my machines. > > Do you have in mind another spot in kernel where we could use this reciprocal > divide as well ? > > Yes divide complexity is still an issue with modern CPUS : > elapsed time for 10^9 loops on Pentium M 1.6 Ghz > 24 s for the version using divides > 3.8 s for the version using multiplies > > [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() > > When some objects are allocated by one CPU but freed by another CPU we can > consume lot of cycles doing divides in obj_to_index(). > > (Typical load on a dual processor machine where network interrupts are handled > by one particular CPU (allocating skbufs), and the other CPU is running the > application (consuming and freeing skbufs)) > > Here on one production server (dual-core AMD Opteron 285), I noticed this > divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are > quite modern cpus and the divide is much more expensive on oldest > architectures : Yes, I've seen that divide hurting too. I suspect it was with unusual everything-in-cache workloads, but whatever. > On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle > for a multiply. > > Doing some math, we can use a reciprocal multiplication instead of a divide. > > If we want to compute V = (A / B) __(A and B being u32 quantities) > we can instead use : > > V = ((u64)A * RECIPROCAL(B)) >> 32 ; > > where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B > > Note : > > I wrote pure C code for clarity. gcc output for i386 is not optimal but > acceptable : > > mull __ 0x14(%ebx) > mov __ __%edx,%eax // part of the >> 32 > xor __ __ %edx,%edx // useless > mov __ __%eax,(%esp) // could be avoided > mov __ __%edx,0x4(%esp) // useless > mov __ __(%esp),%ebx > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100 > +++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 19:01:44.000000000 +0100 > @@ -0,0 +1,30 @@ > +#ifndef _LINUX_RECIPROCAL_DIV_H > +#define _LINUX_RECIPROCAL_DIV_H > + > +/* > + * Define the reciprocal value of B so that > + * ((u32)A / (u32)B) can be replaced by : > + * (((u64)A * RECIPROCAL_VALUE(B)) >> 32) > + * If RECIPROCAL_VALUE(B) is precalculated, we change a divide by a multiply "to a multiply". > + */ > +#define RECIPROCAL_VALUE(B) (u32)(((1LL << 32) + ((B) - 1))/ (B)) Does this have to be a macro? I worry that people might try to throw random types at this code and it'll fail. Are you prepared to support s8, u8, s16, u16, s32, u32, s64 and u64? I think not, so perhaps we should be documenting what we _do_ accept, and adding typecheck() calls in there somewhere to enforce that. (In which case it would need to be a macro). > +static inline u32 reciprocal_value(unsigned int k) > +{ > + if (__builtin_constant_p(k)) > + return RECIPROCAL_VALUE(k); > + else { > + u64 val = (1LL << 32) + (k - 1); > + do_div(val, k); > + return (u32)val; > + } > +} We should clearly document that this function is for once-off setup operations - we'd hate for people to call this with any frequency. It should be uninlined if poss, too. > +/* > + * We want to avoid an expensive divide : (A / B) > + * If B is known in advance, its reciprocal R can be precalculated/stored. > + * then (A / B) = (u32)(((u64)(A) * (R)) >> 32) > + */ > +#define RECIPROCAL_DIVIDE(A, R) (u32)(((u64)(A) * (R)) >> 32) And again, depending upon our decision regarding what types this code will support, this perhaps should become an inlined C function. > +#endif > --- linux-2.6.19/mm/slab.c 2006-12-04 11:50:19.000000000 +0100 > +++ linux-2.6.19-ed/mm/slab.c 2006-12-04 19:03:42.000000000 +0100 > @@ -107,6 +107,7 @@ > #include <linux/mempolicy.h> > #include <linux/mutex.h> > #include <linux/rtmutex.h> > +#include <linux/reciprocal_div.h> > > #include <asm/uaccess.h> > #include <asm/cacheflush.h> > @@ -385,6 +386,7 @@ struct kmem_cache { > unsigned int shared; > > unsigned int buffer_size; > + unsigned int reciprocal_buffer_size; If we decide to support only u32 for this operation, this should become u32, for clarity. > /* 3) touched by every alloc & free from the backend */ > struct kmem_list3 *nodelists[MAX_NUMNODES]; > > @@ -626,10 +628,17 @@ static inline void *index_to_obj(struct > return slab->s_mem + cache->buffer_size * idx; > } > > -static inline unsigned int obj_to_index(struct kmem_cache *cache, > - struct slab *slab, void *obj) > +/* > + * We want to avoid an expensive divide : (offset / cache->buffer_size) > + * Using the fact that buffer_size is a constant for a particular cache, > + * we can replace (offset / cache->buffer_size) by > + * RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size) > + */ > +static inline unsigned int obj_to_index(const struct kmem_cache *cache, > + const struct slab *slab, void *obj) > { > - return (unsigned)(obj - slab->s_mem) / cache->buffer_size; > + unsigned int offset = (obj - slab->s_mem); > + return RECIPROCAL_DIVIDE(offset, cache->reciprocal_buffer_size); > } > > /* > @@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void) > > cache_cache.buffer_size = ALIGN(cache_cache.buffer_size, > cache_line_size()); > + cache_cache.reciprocal_buffer_size = > + reciprocal_value(cache_cache.buffer_size); > > for (order = 0; order < MAX_ORDER; order++) { > cache_estimate(order, cache_cache.buffer_size, > @@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz > if (flags & SLAB_CACHE_DMA) > cachep->gfpflags |= GFP_DMA; > cachep->buffer_size = size; > + cachep->reciprocal_buffer_size = reciprocal_value(size); > > if (flags & CFLGS_OFF_SLAB) { > cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u); > > > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 19:49 ` Andrew Morton @ 2006-12-04 19:55 ` Christoph Lameter 2006-12-04 21:34 ` Eric Dumazet 1 sibling, 0 replies; 25+ messages in thread From: Christoph Lameter @ 2006-12-04 19:55 UTC (permalink / raw) To: Andrew Morton; +Cc: Eric Dumazet, David Miller, linux-kernel This is similar stuff to asm-generic/div64.h right? The divide overhead depends on the platform? Maybe it would better to place it in asm-generic/div.h and then have platform specific functions? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 19:49 ` Andrew Morton 2006-12-04 19:55 ` Christoph Lameter @ 2006-12-04 21:34 ` Eric Dumazet 2006-12-04 21:56 ` David Miller 1 sibling, 1 reply; 25+ messages in thread From: Eric Dumazet @ 2006-12-04 21:34 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, David Miller, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1585 bytes --] Thank you Andrew for your comments, here is a new version of the patch that should take them into account. Maybe it should be split into two different patches ? One introducing include/linux/reciprocal_div.h and lib/reciprocal_div.c, one using them in slab ? [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() When some objects are allocated by one CPU but freed by another CPU we can consume lot of cycles doing divides in obj_to_index(). (Typical load on a dual processor machine where network interrupts are handled by one particular CPU (allocating skbufs), and the other CPU is running the application (consuming and freeing skbufs)) Here on one production server (dual-core AMD Opteron 285), I noticed this divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are quite modern cpus and the divide is much more expensive on oldest architectures : On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle for a multiply. Doing some math, we can use a reciprocal multiplication instead of a divide. If we want to compute V = (A / B) (A and B being u32 quantities) we can instead use : V = ((u64)A * RECIPROCAL(B)) >> 32 ; where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B Note : I wrote pure C code for clarity. gcc output for i386 is not optimal but acceptable : mull 0x14(%ebx) mov %edx,%eax // part of the >> 32 xor %edx,%edx // useless mov %eax,(%esp) // could be avoided mov %edx,0x4(%esp) // useless mov (%esp),%ebx Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: reciprocal_division.patch --] [-- Type: text/plain, Size: 3834 bytes --] --- linux-2.6.19/include/linux/reciprocal_div.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.19-ed/include/linux/reciprocal_div.h 2006-12-04 23:12:34.000000000 +0100 @@ -0,0 +1,30 @@ +#ifndef _LINUX_RECIPROCAL_DIV_H +#define _LINUX_RECIPROCAL_DIV_H + +/* + * This file describes reciprocical division. + * + * This optimizes the (A/B) problem, when A and B are two u32 + * and B is a known value (but not known at compile time) + * + * The math principle used is : + * Let RECIPROCAL_VALUE(B) be (((1LL << 32) + (B - 1))/ B) + * Then A / B = (u32)(((u64)(A) * (R)) >> 32) + * + * This replaces a divide by a multiply (and a shift), and + * is generally less expensive in CPU cycles. + */ + +/* + * Computes the reciprocal value (R) for the value B of the divisor. + * Should not be called before each reciprocal_divide(), + * or else the performance is slower than a normal divide. + */ +extern u32 reciprocal_value(u32 B); + + +static inline u32 reciprocal_divide(u32 A, u32 R) +{ + return (u32)(((u64)A * R) >> 32); +} +#endif --- linux-2.6.19/lib/reciprocal_div.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.19-ed/lib/reciprocal_div.c 2006-12-04 23:18:54.000000000 +0100 @@ -0,0 +1,10 @@ +#include <linux/types.h> +#include <asm/div64.h> +#include <linux/reciprocal_div.h> + +u32 reciprocal_value(u32 k) +{ + u64 val = (1LL << 32) + (k - 1); + do_div(val, k); + return (u32)val; +} --- linux-2.6.19/lib/Makefile 2006-12-04 23:12:30.000000000 +0100 +++ linux-2.6.19-ed/lib/Makefile 2006-12-04 23:15:14.000000000 +0100 @@ -5,7 +5,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \ bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \ idr.o div64.o int_sqrt.o bitmap.o extable.o prio_tree.o \ - sha1.o irq_regs.o + sha1.o irq_regs.o reciprocal_div.o lib-$(CONFIG_MMU) += ioremap.o lib-$(CONFIG_SMP) += cpumask.o --- linux-2.6.19/mm/slab.c 2006-12-04 22:51:42.000000000 +0100 +++ linux-2.6.19-ed/mm/slab.c 2006-12-04 23:13:42.000000000 +0100 @@ -107,6 +107,7 @@ #include <linux/mempolicy.h> #include <linux/mutex.h> #include <linux/rtmutex.h> +#include <linux/reciprocal_div.h> #include <asm/uaccess.h> #include <asm/cacheflush.h> @@ -385,6 +386,7 @@ struct kmem_cache { unsigned int shared; unsigned int buffer_size; + u32 reciprocal_buffer_size; /* 3) touched by every alloc & free from the backend */ struct kmem_list3 *nodelists[MAX_NUMNODES]; @@ -626,10 +628,17 @@ static inline void *index_to_obj(struct return slab->s_mem + cache->buffer_size * idx; } -static inline unsigned int obj_to_index(struct kmem_cache *cache, - struct slab *slab, void *obj) +/* + * We want to avoid an expensive divide : (offset / cache->buffer_size) + * Using the fact that buffer_size is a constant for a particular cache, + * we can replace (offset / cache->buffer_size) by + * reciprocal_divide(offset, cache->reciprocal_buffer_size) + */ +static inline unsigned int obj_to_index(const struct kmem_cache *cache, + const struct slab *slab, void *obj) { - return (unsigned)(obj - slab->s_mem) / cache->buffer_size; + unsigned int offset = (obj - slab->s_mem); + return reciprocal_divide(offset, cache->reciprocal_buffer_size); } /* @@ -1400,6 +1409,8 @@ void __init kmem_cache_init(void) cache_cache.buffer_size = ALIGN(cache_cache.buffer_size, cache_line_size()); + cache_cache.reciprocal_buffer_size = + reciprocal_value(cache_cache.buffer_size); for (order = 0; order < MAX_ORDER; order++) { cache_estimate(order, cache_cache.buffer_size, @@ -2297,6 +2308,7 @@ kmem_cache_create (const char *name, siz if (flags & SLAB_CACHE_DMA) cachep->gfpflags |= GFP_DMA; cachep->buffer_size = size; + cachep->reciprocal_buffer_size = reciprocal_value(size); if (flags & CFLGS_OFF_SLAB) { cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u); ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 21:34 ` Eric Dumazet @ 2006-12-04 21:56 ` David Miller 2006-12-04 22:45 ` Eric Dumazet 0 siblings, 1 reply; 25+ messages in thread From: David Miller @ 2006-12-04 21:56 UTC (permalink / raw) To: dada1; +Cc: akpm, clameter, linux-kernel From: Eric Dumazet <dada1@cosmosbay.com> Date: Mon, 04 Dec 2006 22:34:29 +0100 > On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle > for a multiply. For UltraSPARC I and II (which is what this 200mhz guy probably is), it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles for a 64-bit divide (32-bit divide is 37 cycles). UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71 cycles (32/64-bit) for integer divides. Niagara is even worse :-) 11 cycle integer multiply and a 72 cycle integer divide (regardless of 32-bit or 64-bit). (more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost). So this change has tons of merit for sparc64 chips at least :-) Also, the multiply can parallelize with other operations but it seems that integer divide stalls the pipe for most of the duration of the calculation. So this makes the divide even worse. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 21:56 ` David Miller @ 2006-12-04 22:45 ` Eric Dumazet 0 siblings, 0 replies; 25+ messages in thread From: Eric Dumazet @ 2006-12-04 22:45 UTC (permalink / raw) To: David Miller; +Cc: akpm, clameter, linux-kernel David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Mon, 04 Dec 2006 22:34:29 +0100 > >> On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle >> for a multiply. > > For UltraSPARC I and II (which is what this 200mhz guy probably is), > it's 4 cycle latency for a multiply (32-bit or 64-bit) and 68 cycles > for a 64-bit divide (32-bit divide is 37 cycles). I must have Ultra-2 (running Solaris :( ) for (ui = 0 ; ui < 100000000 ; ui++) val += reciprocal_divide(ui, reciproc); 100000cb0: 83 31 20 00 srl %g4, 0, %g1 100000cb4: 82 48 40 12 mulx %g1, %l2, %g1 100000cb8: 83 30 70 20 srlx %g1, 0x20, %g1 100000cbc: 88 01 20 01 inc %g4 100000cc0: 80 a1 00 05 cmp %g4, %g5 100000cc4: 08 4f ff fb bleu %icc, 100000cb0 100000cc8: b0 06 00 01 add %i0, %g1, %i0 I confirm that this block uses 20 cycles/iteration, while next one uses 72 cycles/iteration for (ui = 0 ; ui < 100000000 ; ui++) val += ui / value; 100000ca8: 83 31 20 00 srl %g4, 0, %g1 100000cac: 82 68 40 11 udivx %g1, %l1, %g1 100000cb0: 88 01 20 01 inc %g4 100000cb4: 80 a1 00 05 cmp %g4, %g5 100000cb8: 08 4f ff fc bleu %icc, 100000ca8 100000cbc: b0 06 00 01 add %i0, %g1, %i0 > > UltraSPARC-III and IV are worse, 6 cycles for multiply and 40/71 > cycles (32/64-bit) for integer divides. > > Niagara is even worse :-) 11 cycle integer multiply and a 72 cycle > integer divide (regardless of 32-bit or 64-bit). > > (more details in gcc/config/sparc/sparc.c:{ultrasparc,ultrasparc3,niagara}_cost). > > So this change has tons of merit for sparc64 chips at least :-) > > Also, the multiply can parallelize with other operations but it > seems that integer divide stalls the pipe for most of the duration > of the calculation. So this makes the divide even worse. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() 2006-12-04 16:41 ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet 2006-12-04 16:55 ` Christoph Lameter @ 2006-12-05 14:42 ` Pavel Machek 1 sibling, 0 replies; 25+ messages in thread From: Pavel Machek @ 2006-12-05 14:42 UTC (permalink / raw) To: Eric Dumazet; +Cc: Andrew Morton, David Miller, linux-kernel, Christoph Lameter Hi! > When some objects are allocated by one CPU but freed by another CPU we can > consume lot of cycles doing divides in obj_to_index(). > > (Typical load on a dual processor machine where network interrupts are handled > by one particular CPU (allocating skbufs), and the other CPU is running the > application (consuming and freeing skbufs)) > > Here on one production server (dual-core AMD Opteron 285), I noticed this > divide took 1.20 % of CPU_CLK_UNHALTED events in kernel. But Opteron are > quite modern cpus and the divide is much more expensive on oldest > architectures : > > On a 200 MHz sparcv9 machine, the division takes 64 cycles instead of 1 cycle > for a multiply. > > Doing some math, we can use a reciprocal multiplication instead of a divide. > > If we want to compute V = (A / B) (A and B being u32 quantities) > we can instead use : > > V = ((u64)A * RECIPROCAL(B)) >> 32 ; > > where RECIPROCAL(B) is precalculated to ((1LL << 32) + (B - 1)) / B Well, I guess it should be gcc doing this optimalization, not we by hand. And I believe gcc *is* smart enough to do it in some cases... pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-03 17:49 ` Wendy Cheng 2006-12-03 20:47 ` Andrew Morton @ 2006-12-04 16:51 ` Russell Cattelan 2006-12-04 20:46 ` Wendy Cheng 1 sibling, 1 reply; 25+ messages in thread From: Russell Cattelan @ 2006-12-04 16:51 UTC (permalink / raw) To: wcheng; +Cc: Andrew Morton, linux-kernel, linux-fsdevel Wendy Cheng wrote: > Andrew Morton wrote: > >> On Thu, 30 Nov 2006 11:05:32 -0500 >> Wendy Cheng <wcheng@redhat.com> wrote: >> >> >> >>> >>> The idea is, instead of unconditionally dropping every buffer >>> associated with the particular mount point (that defeats the purpose >>> of page caching), base kernel exports the "drop_pagecache_sb()" call >>> that allows page cache to be trimmed. More importantly, it is >>> changed to offer the choice of not randomly purging any buffer but >>> the ones that seem to be unused (i_state is NULL and i_count is >>> zero). This will encourage filesystem(s) to pro actively response to >>> vm memory shortage if they choose so. >>> >> >> >> argh. >> >> > I read this as "It is ok to give system admin(s) commands (that this > "drop_pagecache_sb() call" is all about) to drop page cache. It is, > however, not ok to give filesystem developer(s) this very same > function to trim their own page cache if the filesystems choose to do > so" ? > >> In Linux a filesystem is a dumb layer which sits between the VFS and the >> I/O layer and provides dumb services such as reading/writing inodes, >> reading/writing directory entries, mapping pagecache offsets to disk >> blocks, etc. (This model is to varying degrees incorrect for every >> post-ext2 filesystem, but that's the way it is). >> >> > Linux kernel, particularly the VFS layer, is starting to show signs of > inadequacy as the software components built upon it keep growing. I > have doubts that it can keep up and handle this complexity with a > development policy like you just described (filesystem is a dumb layer > ?). Aren't these DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a > perfect example why trying to do too many things inside vfs layer for > so many filesystems is a bad idea ? By the way, since we're on this > subject, could we discuss a little bit about vfs rename call (or I can > start another new discussion thread) ? > > Note that linux do_rename() starts with the usual lookup logic, > followed by "lock_rename", then a final round of dentry lookup, and > finally comes to filesystem's i_op->rename call. Since lock_rename() > only calls for vfs layer locks that are local to this particular > machine, for a cluster filesystem, there exists a huge window between > the final lookup and filesystem's i_op->rename calls such that the > file could get deleted from another node before fs can do anything > about it. Is it possible that we could get a new function pointer > (lock_rename) in inode_operations structure so a cluster filesystem > can do proper locking ? It looks like the ocfs2 guys have the similar problem? http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt Does this change help fix gfs lock ordering problem as well? -Russell Cattelan cattelan@xfs.org ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH] prune_icache_sb 2006-12-04 16:51 ` [PATCH] prune_icache_sb Russell Cattelan @ 2006-12-04 20:46 ` Wendy Cheng 0 siblings, 0 replies; 25+ messages in thread From: Wendy Cheng @ 2006-12-04 20:46 UTC (permalink / raw) To: Russell Cattelan; +Cc: Andrew Morton, linux-kernel, linux-fsdevel Russell Cattelan wrote: > Wendy Cheng wrote: > >> Linux kernel, particularly the VFS layer, is starting to show signs >> of inadequacy as the software components built upon it keep growing. >> I have doubts that it can keep up and handle this complexity with a >> development policy like you just described (filesystem is a dumb >> layer ?). Aren't these DIO_xxx_LOCKING flags inside >> __blockdev_direct_IO() a perfect example why trying to do too many >> things inside vfs layer for so many filesystems is a bad idea ? By >> the way, since we're on this subject, could we discuss a little bit >> about vfs rename call (or I can start another new discussion thread) ? >> >> Note that linux do_rename() starts with the usual lookup logic, >> followed by "lock_rename", then a final round of dentry lookup, and >> finally comes to filesystem's i_op->rename call. Since lock_rename() >> only calls for vfs layer locks that are local to this particular >> machine, for a cluster filesystem, there exists a huge window between >> the final lookup and filesystem's i_op->rename calls such that the >> file could get deleted from another node before fs can do anything >> about it. Is it possible that we could get a new function pointer >> (lock_rename) in inode_operations structure so a cluster filesystem >> can do proper locking ? > > It looks like the ocfs2 guys have the similar problem? > > http://ftp.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/ocfs2_git_patches/ocfs2-upstream-linus-20060924/0009-PATCH-Allow-file-systems-to-manually-d_move-inside-of-rename.txt > > > Thanks for the pointer. Same as ocfs2, under current VFS code, both GFS1/2 also need FS_ODD_RENAME flag for the rename problem - got an ugly ~200 line draft patch ready for GFS1 (and am looking into GFS2 at this moment). The issue here is, for GFS, if vfs lock_rename() can call us, this complication can be greatly reduced. Will start another thread to see whether the wish can be granted. -- Wendy ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2006-12-05 15:12 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-22 21:35 [PATCH] prune_icache_sb Wendy Cheng 2006-11-22 23:36 ` Andrew Morton 2006-11-27 23:52 ` Wendy Cheng 2006-11-28 0:52 ` Andrew Morton 2006-11-28 21:41 ` Wendy Cheng 2006-11-29 0:21 ` Andrew Morton 2006-11-29 6:02 ` Wendy Cheng 2006-11-30 16:05 ` Wendy Cheng 2006-11-30 19:31 ` Nate Diller 2006-12-01 21:23 ` Andrew Morton 2006-12-03 17:49 ` Wendy Cheng 2006-12-03 20:47 ` Andrew Morton 2006-12-04 5:57 ` Wendy Cheng 2006-12-04 6:28 ` Andrew Morton 2006-12-04 16:41 ` [PATCH] SLAB : use a multiply instead of a divide in obj_to_index() Eric Dumazet 2006-12-04 16:55 ` Christoph Lameter 2006-12-04 18:18 ` Eric Dumazet 2006-12-04 19:49 ` Andrew Morton 2006-12-04 19:55 ` Christoph Lameter 2006-12-04 21:34 ` Eric Dumazet 2006-12-04 21:56 ` David Miller 2006-12-04 22:45 ` Eric Dumazet 2006-12-05 14:42 ` Pavel Machek 2006-12-04 16:51 ` [PATCH] prune_icache_sb Russell Cattelan 2006-12-04 20:46 ` Wendy Cheng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox