* [PATCH] fs: pipe/sockets/anon dentries should not have a parent [not found] ` <4926AEDB.10007-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-21 15:13 ` Eric Dumazet [not found] ` <4926D022.5060008-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig 0 siblings, 2 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-21 15:13 UTC (permalink / raw) To: David Miller, mingo-X9Un+BFzKDI Cc: cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List [-- Attachment #1: Type: text/plain, Size: 5720 bytes --] Eric Dumazet a écrit : > David Miller a écrit : >> From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> >> Date: Fri, 21 Nov 2008 09:51:32 +0100 >> >>> Now, I wish sockets and pipes not going through dcache, not tbench >>> affair >>> of course but real workloads... >>> >>> running 8 processes on a 8 way machine doing a >>> for (;;) >>> close(socket(AF_INET, SOCK_STREAM, 0)); >>> >>> is slow as hell, we hit so many contended cache lines ... >>> >>> ticket spin locks are slower in this case (dcache_lock for example >>> is taken twice when we allocate a socket(), once in d_alloc(), >>> another one >>> in d_instantiate()) >> >> As you of course know, this used to be a ton worse. At least now >> these things are unhashed. :) > > Well, this is dust compared to what we currently have. > > To allocate a socket we : > 0) Do the usual file manipulation (pretty scalable these days) > (but recent drop_file_write_access() and co slow down a bit) > 1) allocate an inode with new_inode() > This function : > - locks inode_lock, > - dirties nr_inodes counter > - dirties inode_in_use list (for sockets, I doubt it is usefull) > - dirties superblock s_inodes. > - dirties last_ino counter > All these are in different cache lines of course. > 2) allocate a dentry > d_alloc() takes dcache_lock, > insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) > dirties nr_dentry > 3) d_instantiate() dentry (dcache_lock taken again) > 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to > umount this vfs ...) > > > > At close() time, we must undo the things. Its even more expensive because > of the _atomic_dec_and_lock() that stress a lot, and because of two > cache lines that are touched when an element is deleted from a list. > > for (i = 0; i < 1000*1000; i++) > close(socket(socket(AF_INET, SOCK_STREAM, 0)); > > Cost if run one one cpu : > > real 0m1.561s > user 0m0.092s > sys 0m1.469s > > If run on 8 CPUS : > > real 0m27.496s > user 0m0.657s > sys 3m39.092s > > [PATCH] fs: pipe/sockets/anon dentries should not have a parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. If we correct dnotify_parent() and inotify_d_instantiate() to take into account a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. Before patch, time to run 8 millions of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.997s user 0m0.682s sys 3m11.193s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.24 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 147287 147287 10.3984 10.3984 new_inode 144884 292171 10.2287 20.6271 inet_create 93670 385841 6.6131 27.2402 init_file 89852 475693 6.3435 33.5837 wake_up_inode 80910 556603 5.7122 39.2959 kmem_cache_alloc 53588 610191 3.7833 43.0792 _atomic_dec_and_lock 44341 654532 3.1305 46.2096 generic_forget_inode 38710 693242 2.7329 48.9425 kmem_cache_free 37605 730847 2.6549 51.5974 tcp_v4_init_sock 37228 768075 2.6283 54.2257 d_alloc 34085 802160 2.4064 56.6321 tcp_close 32550 834710 2.2980 58.9301 sysenter_past_esp 25931 860641 1.8307 60.7608 vfs_dq_drop 24458 885099 1.7267 62.4875 d_kill 22015 907114 1.5542 64.0418 dentry_iput 18877 925991 1.3327 65.3745 __copy_from_user_ll 17873 943864 1.2618 66.6363 mwait_idle Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/anon_inodes.c | 2 +- fs/dnotify.c | 2 +- fs/inotify.c | 2 +- fs/pipe.c | 2 +- net/socket.c | 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) [-- Attachment #2: null_parent.patch --] [-- Type: text/plain, Size: 2076 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..22cce87 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -92,7 +92,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc(NULL, &this); if (!dentry) goto err_put_unused_fd; diff --git a/fs/dnotify.c b/fs/dnotify.c index 676073b..66066a3 100644 --- a/fs/dnotify.c +++ b/fs/dnotify.c @@ -173,7 +173,7 @@ void dnotify_parent(struct dentry *dentry, unsigned long event) spin_lock(&dentry->d_lock); parent = dentry->d_parent; - if (parent->d_inode->i_dnotify_mask & event) { + if (parent && parent->d_inode->i_dnotify_mask & event) { dget(parent); spin_unlock(&dentry->d_lock); __inode_dir_notify(parent->d_inode, event); diff --git a/fs/inotify.c b/fs/inotify.c index 7bbed1b..9f051bb 100644 --- a/fs/inotify.c +++ b/fs/inotify.c @@ -270,7 +270,7 @@ void inotify_d_instantiate(struct dentry *entry, struct inode *inode) spin_lock(&entry->d_lock); parent = entry->d_parent; - if (parent->d_inode && inotify_inode_watched(parent->d_inode)) + if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode)) entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED; spin_unlock(&entry->d_lock); } diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4b961bc 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -926,7 +926,7 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc(NULL, &name); if (!dentry) goto err_inode; diff --git a/net/socket.c b/net/socket.c index e9d65ea..b84de7d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -373,7 +373,7 @@ static int sock_attach_fd(struct socket *sock, struct file *file, int flags) struct dentry *dentry; struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc(NULL, &name); if (unlikely(!dentry)) return -ENOMEM; ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <4926D022.5060008-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent [not found] ` <4926D022.5060008-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-21 15:21 ` Ingo Molnar [not found] ` <20081121152148.GA20388-X9Un+BFzKDI@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2008-11-21 15:21 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote: > Before patch, time to run 8 millions of close(socket()) calls on 8 > CPUS was : > > real 0m27.496s > user 0m0.657s > sys 3m39.092s > > After patch : > > real 0m23.997s > user 0m0.682s > sys 3m11.193s cool :-) What would it take to get it down to: >> Cost if run one one cpu : >> >> real 0m1.561s >> user 0m0.092s >> sys 0m1.469s i guess asking for a wall-clock cost of 1.561/8 would be too much? :) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081121152148.GA20388-X9Un+BFzKDI@public.gmane.org>]
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent [not found] ` <20081121152148.GA20388-X9Un+BFzKDI@public.gmane.org> @ 2008-11-21 15:28 ` Eric Dumazet [not found] ` <4926D39D.9050603-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-21 15:28 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List Ingo Molnar a écrit : > * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote: > >> Before patch, time to run 8 millions of close(socket()) calls on 8 >> CPUS was : >> >> real 0m27.496s >> user 0m0.657s >> sys 3m39.092s >> >> After patch : >> >> real 0m23.997s >> user 0m0.682s >> sys 3m11.193s > > cool :-) > > What would it take to get it down to: > >>> Cost if run one one cpu : >>> >>> real 0m1.561s >>> user 0m0.092s >>> sys 0m1.469s > > i guess asking for a wall-clock cost of 1.561/8 would be too much? :) > It might be possible, depending on the level of hackery I am allowed to inject in fs/dcache.c and fs/inode.c :) wall cost of 1.56 (each cpu runs one loop of one million iterations) ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <4926D39D.9050603-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent [not found] ` <4926D39D.9050603-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (3 more replies) 0 siblings, 4 replies; 75+ messages in thread From: Ingo Molnar @ 2008-11-21 15:34 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote: > Ingo Molnar a écrit : >> * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote: >> >>> Before patch, time to run 8 millions of close(socket()) calls on 8 >>> CPUS was : >>> >>> real 0m27.496s >>> user 0m0.657s >>> sys 3m39.092s >>> >>> After patch : >>> >>> real 0m23.997s >>> user 0m0.682s >>> sys 3m11.193s >> >> cool :-) >> >> What would it take to get it down to: >> >>>> Cost if run one one cpu : >>>> >>>> real 0m1.561s >>>> user 0m0.092s >>>> sys 0m1.469s >> >> i guess asking for a wall-clock cost of 1.561/8 would be too much? :) >> > > It might be possible, depending on the level of hackery I am allowed > to inject in fs/dcache.c and fs/inode.c :) I think being able to open+close sockets in a scalable way is an undisputed prime-time workload on Linux. The numbers you showed look horrible. Once you can show how much faster it could go via hacks, it should only be a matter of time to achieve that safely and cleanly. > wall cost of 1.56 (each cpu runs one loop of one million iterations) (indeed.) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-21 15:34 ` Ingo Molnar @ 2008-11-26 23:27 ` Eric Dumazet 2008-11-27 9:39 ` Christoph Hellwig ` (4 more replies) 2008-11-26 23:32 ` [PATCH 3/6] " Eric Dumazet ` (2 subsequent siblings) 3 siblings, 5 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:27 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig Hi all Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 1.6 second) Long version : To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer) 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) Sample program : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (benchmark named socket8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.325s (instead of 1.561s) user 0.091s sys 1.234s If run on 8 CPUS : real 2.229s <<<< instead of 27.496s >>> user 0.695s sys 16.903s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 2999.74 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 143791 143791 11.7849 11.7849 __slab_free 128404 272195 10.5238 22.3087 add_partial 99150 371345 8.1262 30.4349 kmem_cache_alloc 52031 423376 4.2644 34.6993 sysenter_past_esp 47752 471128 3.9137 38.6130 kmem_cache_free 47429 518557 3.8872 42.5002 tcp_close 34376 552933 2.8174 45.3176 __percpu_counter_add 29046 581979 2.3806 47.6982 copy_from_user 28249 610228 2.3152 50.0134 init_timer 26220 636448 2.1490 52.1624 __slab_alloc 23402 659850 1.9180 54.0803 discard_slab 20560 680410 1.6851 55.7654 __call_rcu 18288 698698 1.4989 57.2643 d_alloc 16425 715123 1.3462 58.6104 get_empty_filp 16237 731360 1.3308 59.9412 __fput 15729 747089 1.2891 61.2303 alloc_fd 15021 762110 1.2311 62.4614 alloc_inode 14690 776800 1.2040 63.6654 sys_close 14666 791466 1.2020 64.8674 inet_create 13638 805104 1.1178 65.9852 dput 12503 817607 1.0247 67.0099 iput_special 12231 829838 1.0024 68.0123 lock_sock_nested 12210 842048 1.0007 69.0130 fd_install 12137 854185 0.9947 70.0078 d_alloc_special 12058 866243 0.9883 70.9960 sock_init_data 11200 877443 0.9179 71.9140 release_sock 11114 888557 0.9109 72.8248 inotify_d_instantiate The last point is about SLUB being hit hard, unless we use slub_min_order=3 at boot, or we use Christoph Lameter patch (struct file RCU optimizations) http://thread.gmane.org/gmane.linux.kernel/418615 If we boot machine with slub_min_order=3, SLUB overhead disappears. New cost if run on one cpu : real 1.307s user 0.094s sys 1.214s If run on 8 CPUS : real 1.625s <<<< instead of 27.496s or 2.229s >>> user 0.771s sys 12.061s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.05 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 108005 108005 11.0758 11.0758 kmem_cache_alloc 52023 160028 5.3349 16.4107 sysenter_past_esp 47363 207391 4.8570 21.2678 tcp_close 45430 252821 4.6588 25.9266 kmem_cache_free 36566 289387 3.7498 29.6764 __percpu_counter_add 36085 325472 3.7005 33.3769 __slab_free 29185 354657 2.9929 36.3698 copy_from_user 28210 382867 2.8929 39.2627 init_timer 25663 408530 2.6317 41.8944 d_alloc_special 22360 430890 2.2930 44.1874 cap_file_alloc_security 19237 450127 1.9727 46.1601 __call_rcu 19097 469224 1.9584 48.1185 d_alloc 16962 486186 1.7394 49.8580 alloc_fd 16315 502501 1.6731 51.5311 __fput 16102 518603 1.6512 53.1823 get_empty_filp 14954 533557 1.5335 54.7158 inet_create 14468 548025 1.4837 56.1995 alloc_inode 14198 562223 1.4560 57.6555 sys_close 13905 576128 1.4259 59.0814 dput 12262 588390 1.2575 60.3389 lock_sock_nested 12203 600593 1.2514 61.5903 sock_attach_fd 12147 612740 1.2457 62.8360 iput_special 12049 624789 1.2356 64.0716 fd_install 12033 636822 1.2340 65.3056 sock_init_data 11999 648821 1.2305 66.5361 release_sock 11231 660052 1.1517 67.6878 inotify_d_instantiate 11068 671120 1.1350 68.8228 inet_csk_destroy_sock This patch serie contains 6 patches, against net-next-2.6 tree (because this tree already contains network improvement on this subject, but should apply on other trees) [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Adding a per_cpu nr_dentry avoids cache line ping pongs between cpus to maintain this metric. We centralize decrements of nr_dentry in d_free(), and increments in d_alloc(). d_alloc() can avoid taking dcache_lock if parent is NULL [PATCH 2/6] fs: Introduce special dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SPECIAL flag, to mark a dentry as a special one (for sockets, pipes, anonymous fd), and a new d_alloc_special(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_special() for special dentries. Differences betwen a special dentry and a normal one are : 1) Special dentry has the DCACHE_SPECIAL flag 2) Special dentry's parent are themselves This to avoid taking a reference on 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table 4) Their d_alias list is empty Internally, dput() can avoid an expensive atomic_dec_and_lock() for special dentries. (socket8 bench result : from 27.5s to 25.5s) [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino. Note : last_ino_get() method must be called with preemption disabled. (socket8 bench result : 25.5s to 25s almost no differences, but this is because inode_lock cost is too heavy for the moment) [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : 25s to 20.5s) [PATCH 5/6] fs: Introduce special inodes Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. In new_inode(), we test if super block has MS_SPECIAL flag set. If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list As inode_lock was taken only to protect these lists, we avoid it as well Using iput_special() from dput_special() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we really need a different flag. (socket8 bench result : from 20.5s to 2.94s) [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs This function arms a flag (MNT_SPECIAL) on the vfs, to avoid refcounting on permanent system vfs. Use this function for sockets, pipes, anonymous fds. (socket8 bench result : from 2.94s to 2.23s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- Overall diffstat : fs/anon_inodes.c | 19 +----- fs/dcache.c | 106 ++++++++++++++++++++++++++++++++------- fs/fs-writeback.c | 2 fs/inode.c | 101 +++++++++++++++++++++++++++++++------ fs/pipe.c | 28 +--------- fs/super.c | 9 +++ include/linux/dcache.h | 2 include/linux/fs.h | 8 ++ include/linux/mount.h | 5 + kernel/sysctl.c | 6 +- mm/page-writeback.c | 2 net/socket.c | 27 +-------- 12 files changed, 212 insertions(+), 103 deletions(-) ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet @ 2008-11-27 9:39 ` Christoph Hellwig 2008-11-28 18:03 ` Ingo Molnar ` (3 subsequent siblings) 4 siblings, 0 replies; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:39 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig As I told you before, you absolutely must include the fsdevel list and the VFS maintainer for a patchset like this. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-27 9:39 ` Christoph Hellwig @ 2008-11-28 18:03 ` Ingo Molnar [not found] ` <20081128180318.GL10487-X9Un+BFzKDI@public.gmane.org> 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (2 subsequent siblings) 4 siblings, 1 reply; 75+ messages in thread From: Ingo Molnar @ 2008-11-28 18:03 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig * Eric Dumazet <dada1@cosmosbay.com> wrote: > Hi all > > Short summary : Nice speedups for allocation/deallocation of sockets/pipes > (From 27.5 seconds to 1.6 second) Wow, that's incredibly impressive! :-) Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081128180318.GL10487-X9Un+BFzKDI@public.gmane.org>]
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP [not found] ` <20081128180318.GL10487-X9Un+BFzKDI@public.gmane.org> @ 2008-11-28 18:47 ` Peter Zijlstra 2008-11-29 6:38 ` Christoph Hellwig 0 siblings, 1 reply; 75+ messages in thread From: Peter Zijlstra @ 2008-11-28 18:47 UTC (permalink / raw) To: Ingo Molnar Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote: > * Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> wrote: > > > Hi all > > > > Short summary : Nice speedups for allocation/deallocation of sockets/pipes > > (From 27.5 seconds to 1.6 second) > > Wow, that's incredibly impressive! :-) Yeah, we got a similar speedup on -rt by pushing those super-block files list into per-cpu lists and doing crazy locking on them. Of course avoiding them all together, like done here is a nicer option but is sadly not a possibility for regular files (until hch gets around to removing the need for the list). ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-28 18:47 ` Peter Zijlstra @ 2008-11-29 6:38 ` Christoph Hellwig [not found] ` <20081129063816.GA869-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Christoph Hellwig @ 2008-11-29 6:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote: > > Wow, that's incredibly impressive! :-) > > Yeah, we got a similar speedup on -rt by pushing those super-block files > list into per-cpu lists and doing crazy locking on them. > > Of course avoiding them all together, like done here is a nicer option > but is sadly not a possibility for regular files (until hch gets around > to removing the need for the list). We should have finished this long ago, thanks for the reminder. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081129063816.GA869-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP [not found] ` <20081129063816.GA869-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2008-11-29 8:07 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Peter Zijlstra, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Linux Netdev List, Christoph Lameter Christoph Hellwig a écrit : > On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote: >>> Wow, that's incredibly impressive! :-) >> Yeah, we got a similar speedup on -rt by pushing those super-block files >> list into per-cpu lists and doing crazy locking on them. >> >> Of course avoiding them all together, like done here is a nicer option >> but is sadly not a possibility for regular files (until hch gets around >> to removing the need for the list). > > We should have finished this long ago, thanks for the reminder. > > inode_in_use could be percpu, at least. Or just zap it, since we never have to scan it. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v2 0/5] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet 2008-11-27 9:39 ` Christoph Hellwig 2008-11-28 18:03 ` Ingo Molnar @ 2008-11-29 8:43 ` Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet ` (7 more replies) [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 4 siblings, 8 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro Hi all Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks)) Long version : For this second version, I removed the mntput()/mntget() optimization since most reviewers are not convinced it is usefull. This is a four lines patch that can be reconsidered later. I chose the name SINGLE instead of SPECIAL to name isolated dentries (for sockets, pipes, anonymous fd) that have no parent and no relationship in the vfs. Thanks all To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer) 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all but one contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) Sample program : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (benchmark named socket8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.325s (instead of 1.561s) user 0.091s sys 1.234s If run on 8 CPUS : real 0m2.971s user 0m0.726s sys 0m21.310s CPU: Core 2, speed 3000.04 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100 000 samples cum. samples % cum. % symbol name 189772 189772 12.7205 12.7205 _atomic_dec_and_lock 140467 330239 9.4155 22.1360 __slab_free 128210 458449 8.5940 30.7300 add_partial 121578 580027 8.1494 38.8794 kmem_cache_alloc 72626 652653 4.8681 43.7475 init_file 62720 715373 4.2041 47.9517 __percpu_counter_add 51632 767005 3.4609 51.4126 sysenter_past_esp 49196 816201 3.2976 54.7102 tcp_close 47933 864134 3.2130 57.9231 kmem_cache_free 29628 893762 1.9860 59.9091 copy_from_user 28443 922205 1.9065 61.8157 init_timer 25602 947807 1.7161 63.5318 __slab_alloc 22139 969946 1.4840 65.0158 discard_slab 20428 990374 1.3693 66.3851 __call_rcu 18174 1008548 1.2182 67.6033 alloc_fd 17643 1026191 1.1826 68.7859 __fput 17374 1043565 1.1646 69.9505 d_alloc 17196 1060761 1.1527 71.1031 sys_close 17024 1077785 1.1411 72.2442 inet_create 15208 1092993 1.0194 73.2636 alloc_inode 12201 1105194 0.8178 74.0815 fd_install 12167 1117361 0.8156 74.8970 lock_sock_nested 12123 1129484 0.8126 75.7096 get_empty_filp 11648 1141132 0.7808 76.4904 release_sock 11509 1152641 0.7715 77.2619 dput 11335 1163976 0.7598 78.0216 sock_init_data 11038 1175014 0.7399 78.7615 inet_csk_destroy_sock 10880 1185894 0.7293 79.4908 drop_file_write_access 10083 1195977 0.6759 80.1667 inotify_d_instantiate 9216 1205193 0.6178 80.7844 local_bh_enable_ip 8881 1214074 0.5953 81.3797 sysenter_do_call 8759 1222833 0.5871 81.9668 setup_object 8489 1231322 0.5690 82.5359 iput_single So we now hit mntput()/mntget() and SLUB. The last point is about SLUB being hit hard, unless we use slub_min_order=3 (or slub_min_objects=45) at boot, or we use Christoph Lameter patch (struct file RCU optimizations) http://thread.gmane.org/gmane.linux.kernel/418615 If we boot machine with slub_min_order=3, SLUB overhead disappears. If run on 8 CPUS : real 0m2.315s user 0m0.752s sys 0m17.324s CPU: Core 2, speed 3000.15 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 199409 199409 15.6440 15.6440 _atomic_dec_and_lock (mntput()) 141606 341015 11.1092 26.7532 kmem_cache_alloc 76071 417086 5.9679 32.7211 init_file 70595 487681 5.5383 38.2595 __percpu_counter_add 51595 539276 4.0477 42.3072 sysenter_past_esp 49313 588589 3.8687 46.1759 tcp_close 45503 634092 3.5698 49.7457 kmem_cache_free 41413 675505 3.2489 52.9946 __slab_free 29911 705416 2.3466 55.3412 copy_from_user 28979 734395 2.2735 57.6146 init_timer 22251 756646 1.7456 59.3602 get_empty_filp 19942 776588 1.5645 60.9247 __call_rcu 18348 794936 1.4394 62.3642 __fput 18328 813264 1.4379 63.8020 alloc_fd 17395 830659 1.3647 65.1667 sys_close 17301 847960 1.3573 66.5240 d_alloc 16570 864530 1.2999 67.8239 inet_create 15522 880052 1.2177 69.0417 alloc_inode 13185 893237 1.0344 70.0761 setup_object 12359 905596 0.9696 71.0456 fd_install 12275 917871 0.9630 72.0086 lock_sock_nested 11924 929795 0.9355 72.9441 release_sock 11790 941585 0.9249 73.8690 sock_init_data 11310 952895 0.8873 74.7563 dput 10924 963819 0.8570 75.6133 drop_file_write_access 10903 974722 0.8554 76.4687 inet_csk_destroy_sock 10184 984906 0.7990 77.2676 inotify_d_instantiate 9372 994278 0.7353 78.0029 local_bh_enable_ip 8901 1003179 0.6983 78.7012 sysenter_do_call 8569 1011748 0.6723 79.3735 iput_single 8194 1019942 0.6428 80.0163 inet_release This patch serie contains 5 patches, against net-next-2.6 tree (because this tree already contains network improvement on this subject, but should apply on other trees) [PATCH 1/5] fs: Use a percpu_counter to track nr_dentry Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL (socket8 bench result : 27.5s to 25s) [PATCH 2/5] fs: Use a percpu_counter to track nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) [PATCH 3/5] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) (socket8 bench result : no difference) [PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) [PATCH 5/5] fs: new_inode_single() and iput_single() Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() (socket8 bench result : from 19.9s to 2.3s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- Overall diffstat : fs/anon_inodes.c | 18 ------ fs/dcache.c | 100 ++++++++++++++++++++++++++++++-------- fs/fs-writeback.c | 2 fs/inode.c | 101 +++++++++++++++++++++++++++++++-------- fs/pipe.c | 25 +-------- include/linux/dcache.h | 9 +++ include/linux/fs.h | 17 ++++++ kernel/sysctl.c | 6 +- mm/page-writeback.c | 2 net/socket.c | 26 +--------- 10 files changed, 200 insertions(+), 106 deletions(-) ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 0/7] fs: Scalability of sockets/pipes allocation/deallocation on SMP 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet @ 2008-12-11 22:38 ` Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet ` (6 subsequent siblings) 7 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Hi Andrew Take v2 of this patch serie got no new feedback, maybe its time for mm inclusion for a while ? In this third version I added last two patches, one intialy from Christoph Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs. Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done on "struct file". Thank you Short summary : Nice speedups for allocation/deallocation of sockets/pipes (From 27.5 seconds to 1.62 s, on a 8 cpus machine) Long version : To allocate a socket or a pipe we : 0) Do the usual file table manipulation (pretty scalable these days, but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU and avoid call_rcu() cache killer). This point is addressed by 6th patch. 1) allocate an inode with new_inode() This function : - locks inode_lock, - dirties nr_inodes counter - dirties inode_in_use list (for sockets/pipes, this is useless) - dirties superblock s_inodes. - dirties last_ino counter All these are in different cache lines unfortunatly. 2) allocate a dentry d_alloc() takes dcache_lock, insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root) dirties nr_dentry 3) d_instantiate() dentry (dcache_lock taken again) 4) init_file() -> atomic_inc() on sock_mnt->refcount At close() time, we must undo the things. Its even more expensive because of the _atomic_dec_and_lock() that stress a lot, and because of two cache lines that are touched when an element is deleted from a list (previous and next items) This is really bad, since sockets/pipes dont need to be visible in dcache or an inode list per super block. This patch series get rid of all but one contended cache lines for sockets, pipes and anonymous fd (signalfd, timerfd, ...) socketallocbench is a very simple program (attached to this mail) that makes a loop : for (i = 0; i < 1000000; i++) close(socket(AF_INET, SOCK_STREAM, 0)); Cost if one cpu runs the program : real 1.561s user 0.092s sys 1.469s Cost if 8 processes are launched on a 8 CPU machine (socketallocbench -n 8) : real 27.496s <<<< !!!! >>>> user 0.657s sys 3m39.092s Oprofile results (for the 8 process run, 3 times): CPU: Core 2, speed 3000.03 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 3347352 3347352 28.0232 28.0232 _atomic_dec_and_lock 3301428 6648780 27.6388 55.6620 d_instantiate 2971130 9619910 24.8736 80.5355 d_alloc 241318 9861228 2.0203 82.5558 init_file 146190 10007418 1.2239 83.7797 __slab_free 144149 10151567 1.2068 84.9864 inotify_d_instantiate 143971 10295538 1.2053 86.1917 inet_create 137168 10432706 1.1483 87.3401 new_inode 117549 10550255 0.9841 88.3242 add_partial 110795 10661050 0.9275 89.2517 generic_drop_inode 107137 10768187 0.8969 90.1486 kmem_cache_alloc 94029 10862216 0.7872 90.9358 tcp_close 82837 10945053 0.6935 91.6293 dput 67486 11012539 0.5650 92.1943 dentry_iput 57751 11070290 0.4835 92.6778 iput 54327 11124617 0.4548 93.1326 tcp_v4_init_sock 49921 11174538 0.4179 93.5505 sysenter_past_esp 47616 11222154 0.3986 93.9491 kmem_cache_free 30792 11252946 0.2578 94.2069 clear_inode 27540 11280486 0.2306 94.4375 copy_from_user 26509 11306995 0.2219 94.6594 init_timer 26363 11333358 0.2207 94.8801 discard_slab 25284 11358642 0.2117 95.0918 __fput 22482 11381124 0.1882 95.2800 __percpu_counter_add 20369 11401493 0.1705 95.4505 sock_alloc 18501 11419994 0.1549 95.6054 inet_csk_destroy_sock 17923 11437917 0.1500 95.7555 sys_close This patch serie avoids all contented cache lines and makes this "bench" pretty fast. New cost if run on one cpu : real 1.245s (instead of 1.561s) user 0.074s sys 1.161s If run on 8 CPUS : real 1.624s user 0.580s sys 12.296s On oprofile, we finally can see network stuff coming at the front of expensive stuff. (with the exception of kmem_cache_[z]alloc(), because it has to clear 192 bytes of file structures, this takes half of the time) CPU: Core 2, speed 3000.09 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100 000 samples cum. samples % cum. % symbol name 176586 176586 10.9376 10.9376 kmem_cache_alloc 169838 346424 10.5196 21.4572 tcp_close 105331 451755 6.5241 27.9813 tcp_v4_init_sock 105146 556901 6.5126 34.4939 tcp_v4_destroy_sock 83307 640208 5.1600 39.6539 sysenter_past_esp 80241 720449 4.9701 44.6239 inet_csk_destroy_sock 74263 794712 4.5998 49.2237 kmem_cache_free 56806 851518 3.5185 52.7422 __percpu_counter_add 48619 900137 3.0114 55.7536 copy_from_user 44803 944940 2.7751 58.5287 init_timer 28539 973479 1.7677 60.2964 d_alloc 27795 1001274 1.7216 62.0180 alloc_fd 26747 1028021 1.6567 63.6747 __fput 24312 1052333 1.5059 65.1805 sys_close 24205 1076538 1.4992 66.6798 inet_create 22409 1098947 1.3880 68.0677 alloc_inode 21359 1120306 1.3230 69.3907 release_sock 19865 1140171 1.2304 70.6211 fd_install 19472 1159643 1.2061 71.8272 lock_sock_nested 18956 1178599 1.1741 73.0013 sock_init_data 17301 1195900 1.0716 74.0729 drop_file_write_access 17113 1213013 1.0600 75.1329 inotify_d_instantiate 16384 1229397 1.0148 76.1477 dput 15173 1244570 0.9398 77.0875 local_bh_enable_ip 15017 1259587 0.9301 78.0176 local_bh_enable 13354 1272941 0.8271 78.8448 __sock_create 13139 1286080 0.8138 79.6586 inet_release 13062 1299142 0.8090 80.4676 sysenter_do_call 11935 1311077 0.7392 81.2069 iput_single This patch serie contains 7 patches, against linux-2.6 tree, plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c) [PATCH 1/7] fs: Use a percpu_counter to track nr_dentry Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL ("socketallocbench -n 8" bench result : 27.5s to 25s) [PATCH 2/7] fs: Use a percpu_counter to track nr_inodes Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. ("socketallocbench -n 8" bench result : no difference at this point) [PATCH 3/7] fs: Introduce a per_cpu last_ino allocator new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 232 allocations) ("socketallocbench -n 8" result : no difference) [PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) [PATCH 5/7] fs: new_inode_single() and iput_single() Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() ("socketallocbench -n 8" result : from 19.9s to 3.01s) [PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU From: Christoph Lameter <cl@linux-foundation.org> Currently we schedule RCU frees for each file we free separately. That has several drawbacks against the earlier file handling (in 2.6.5 f.e.), which did not require RCU callbacks: 1. Excessive number of RCU callbacks can be generated causing long RCU queues that in turn cause long latencies. We hit SLUB page allocation more often than necessary. 2. The cache hot object is not preserved between free and realloc. A close followed by another open is very fast with the RCUless approach because the last freed object is returned by the slab allocator that is still cache hot. RCU free means that the object is not immediately available again. The new object is cache cold and therefore open/close performance tests show a significant degradation with the RCU implementation. One solution to this problem is to move the RCU freeing into the Slab allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation time. The slab allocator will do RCU frees only when it is necessary to dispose of slabs of objects (rare). So with that approach we can cut out the RCU overhead significantly. However, the slab allocator may return the object for another use even before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means there is the (unlikely) possibility that the object is going to be switched under us in sections protected by rcu_read_lock() and rcu_read_unlock(). So we need to verify that we have acquired the correct object after establishing a stable object reference (incrementing the refcounter does that). Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> ("socketallocbench -n 8" result : from 3.01s to 2.20s) [PATCH 7/7] fs: MS_NOREFCOUNT Some fs are hardwired into kernel, and mntput()/mntget() hit a contended cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set on socket, pipes and anonymous fd superblocks. mntput()/mntget() become null ops on these fs. ("socketallocbench -n 8" result : from 2.20s to 1.64s) cat socketallocbench.c /* * socketallocbench benchmark * * Usage : socket [-n procs] [-l loops] */ #include <sys/socket.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <sys/wait.h> void dowork(int loops) { int i; for (i = 0; i < loops; i++) close(socket(AF_INET, SOCK_STREAM, 0)); } int main(int argc, char *argv[]) { int i; int n = 1; int loops = 1000000; pid_t *pidtable; while ((i = getopt(argc, argv, "n:l:")) != EOF) { if (i == 'n') n = atoi(optarg); if (i == 'l') loops = atoi(optarg); } pidtable = malloc(n * sizeof(pid_t)); for (i = 1; i < n; i++) { pidtable[i] = fork(); if (pidtable[i] == 0) { dowork(loops); _exit(0); } if (pidtable[i] == -1) { perror("fork"); n = i; break; } } dowork(loops); for (i = 1; i < n; i++) { int status; wait(&status); } return 0; } ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet @ 2008-12-11 22:38 ` Eric Dumazet 2007-07-24 1:24 ` Nick Piggin [not found] ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet ` (5 subsequent siblings) 7 siblings, 2 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL ("socketallocbench -n8" result : 27.5s to 25s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/dcache.c | 49 +++++++++++++++++++++++++------------------ include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 32 insertions(+), 21 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index fa1ba03..f463a81 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static struct percpu_counter nr_dentry; /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + percpu_counter_dec(&nr_dentry); } /* @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); - + percpu_counter_inc(&nr_dentry); return dentry; } @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) { int loop; + percpu_counter_init(&nr_dentry, 0); /* * A constructor could be added for stable state like the lists, * but it is probably not worth it because of the cache nature diff --git a/include/linux/fs.h b/include/linux/fs.h index 4a853ef..114cb65 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 3d56fe7..777bee7 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2007-07-24 1:24 ` Nick Piggin [not found] ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 1 sibling, 0 replies; 75+ messages in thread From: Nick Piggin @ 2007-07-24 1:24 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:38, Eric Dumazet wrote: > Adding a percpu_counter nr_dentry avoids cache line ping pongs > between cpus to maintain this metric, and dcache_lock is > no more needed to protect dentry_stat.nr_dentry > > We centralize nr_dentry updates at the right place : > - increments in d_alloc() > - decrements in d_free() > > d_alloc() can avoid taking dcache_lock if parent is NULL > > ("socketallocbench -n8" result : 27.5s to 25s) Seems like a good idea. > @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct > dentry *dentry) * otherwise we ascend to the parent and move to the > * next sibling if there is one */ > if (!parent) > - goto out; > + return; > > dentry = parent; > Andrew doesn't like return from middle of function. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry [not found] ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-16 21:04 ` Paul E. McKenney 0 siblings, 0 replies; 75+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:04 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro On Thu, Dec 11, 2008 at 11:38:56PM +0100, Eric Dumazet wrote: > Adding a percpu_counter nr_dentry avoids cache line ping pongs > between cpus to maintain this metric, and dcache_lock is > no more needed to protect dentry_stat.nr_dentry > > We centralize nr_dentry updates at the right place : > - increments in d_alloc() > - decrements in d_free() > > d_alloc() can avoid taking dcache_lock if parent is NULL > > ("socketallocbench -n8" result : 27.5s to 25s) Looks good! (At least once I realised that nr_dentry was global rather than per-dentry!!!) Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> > --- > fs/dcache.c | 49 +++++++++++++++++++++++++------------------ > include/linux/fs.h | 2 + > kernel/sysctl.c | 2 - > 3 files changed, 32 insertions(+), 21 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index fa1ba03..f463a81 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; > static unsigned int d_hash_mask __read_mostly; > static unsigned int d_hash_shift __read_mostly; > static struct hlist_head *dentry_hashtable __read_mostly; > +static struct percpu_counter nr_dentry; > > /* Statistics gathering. */ > struct dentry_stat_t dentry_stat = { > .age_limit = 45, > }; > > +/* > + * Handle nr_dentry sysctl > + */ > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); > + return proc_dointvec(table, write, filp, buffer, lenp, ppos); > +} > +#else > +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + return -ENOSYS; > +} > +#endif > + > static void __d_free(struct dentry *dentry) > { > WARN_ON(!list_empty(&dentry->d_alias)); > @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) > } > > /* > - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry > - * inside dcache_lock. > + * no dcache_lock, please. > */ > static void d_free(struct dentry *dentry) > { > @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) > __d_free(dentry); > else > call_rcu(&dentry->d_u.d_rcu, d_callback); > + percpu_counter_dec(&nr_dentry); > } > > /* > @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) > struct dentry *parent; > > list_del(&dentry->d_u.d_child); > - dentry_stat.nr_dentry--; /* For d_free, below */ > /*drops the locks, at that point nobody can reach this dentry */ > dentry_iput(dentry); > if (IS_ROOT(dentry)) > @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) > static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > { > struct dentry *parent; > - unsigned detached = 0; > > BUG_ON(!IS_ROOT(dentry)); > > @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > } > > list_del(&dentry->d_u.d_child); > - detached++; > > inode = dentry->d_inode; > if (inode) { > @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > * otherwise we ascend to the parent and move to the > * next sibling if there is one */ > if (!parent) > - goto out; > + return; > > dentry = parent; > > @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) > dentry = list_entry(dentry->d_subdirs.next, > struct dentry, d_u.d_child); > } > -out: > - /* several dentries were freed, need to correct nr_dentry */ > - spin_lock(&dcache_lock); > - dentry_stat.nr_dentry -= detached; > - spin_unlock(&dcache_lock); > } > > /* > @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) > dentry->d_flags = DCACHE_UNHASHED; > spin_lock_init(&dentry->d_lock); > dentry->d_inode = NULL; > - dentry->d_parent = NULL; > - dentry->d_sb = NULL; > dentry->d_op = NULL; > dentry->d_fsdata = NULL; > dentry->d_mounted = 0; > @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) > if (parent) { > dentry->d_parent = dget(parent); > dentry->d_sb = parent->d_sb; > + spin_lock(&dcache_lock); > + list_add(&dentry->d_u.d_child, &parent->d_subdirs); > + spin_unlock(&dcache_lock); > } else { > + dentry->d_parent = NULL; > + dentry->d_sb = NULL; > INIT_LIST_HEAD(&dentry->d_u.d_child); > } > - > - spin_lock(&dcache_lock); > - if (parent) > - list_add(&dentry->d_u.d_child, &parent->d_subdirs); > - dentry_stat.nr_dentry++; > - spin_unlock(&dcache_lock); > - > + percpu_counter_inc(&nr_dentry); > return dentry; > } > > @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) > { > int loop; > > + percpu_counter_init(&nr_dentry, 0); > /* > * A constructor could be added for stable state like the lists, > * but it is probably not worth it because of the cache nature > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 4a853ef..114cb65 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2217,6 +2217,8 @@ static inline void free_secdata(void *secdata) > struct ctl_table; > int proc_nr_files(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos); > > int get_filesystem_list(char * buf); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 3d56fe7..777bee7 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1246,7 +1246,7 @@ static struct ctl_table fs_table[] = { > .data = &dentry_stat, > .maxlen = 6*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_dentry, > }, > { > .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet 2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet [not found] ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet ` (4 subsequent siblings) 7 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/fs-writeback.c | 2 +- fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- include/linux/fs.h | 3 +++ kernel/sysctl.c | 4 ++-- mm/page-writeback.c | 2 +- 5 files changed, 38 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..f94f889 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static struct percpu_counter nr_inodes; static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + return percpu_counter_sum_positive(&nr_inodes); +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + percpu_counter_sub(&nr_inodes, nr_disposed); } /* @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { + percpu_counter_inc(&nr_inodes); spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); inode->i_ino = ++last_ino; @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); security_inode_delete(inode); @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); @@ -1394,6 +1416,7 @@ void __init inode_init(void) { int loop; + percpu_counter_init(&nr_inodes, 0); /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), diff --git a/include/linux/fs.h b/include/linux/fs.h index 114cb65..a789346 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 777bee7..b705f3a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes [not found] ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2007-07-24 1:30 ` Nick Piggin [not found] ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org> 2008-12-16 21:10 ` Paul E. McKenney 1 sibling, 1 reply; 75+ messages in thread From: Nick Piggin @ 2007-07-24 1:30 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:39, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes dont need inode_lock anymore. > > (socket8 bench result : no difference at this point) Looks good. But.... If we never actually need fast access to the approximate total, (which seems to apply to this and the previous patch) we could use something much simpler which does not have the spinlock or all this batching stuff that percpu counters have. I'd prefer that because it will be faster in a straight line... (BTW. percpu counters can't be used in interrupt context? That's nice.) ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>]
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes [not found] ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org> @ 2008-12-12 5:11 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-12 5:11 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney Nick Piggin a écrit : > On Friday 12 December 2008 09:39, Eric Dumazet wrote: >> Avoids cache line ping pongs between cpus and prepare next patch, >> because updates of nr_inodes dont need inode_lock anymore. >> >> (socket8 bench result : no difference at this point) > > Looks good. > > But.... If we never actually need fast access to the approximate > total, (which seems to apply to this and the previous patch) we > could use something much simpler which does not have the spinlock > or all this batching stuff that percpu counters have. I'd prefer > that because it will be faster in a straight line... Well, using a non batching mode could be real easy, just call __percpu_counter_add(&counter, inc, 1<<30); Or define a new percpu_counter_fastadd(&counter, inc); percpu_counter are nice because handle the CPU hotplug problem, if we want to use for_each_online_cpu() instead of for_each_possible_cpu(). > > (BTW. percpu counters can't be used in interrupt context? That's > nice.) > > Not sure why you said this. I would like to have a irqsafe percpu_counter, I was preparing such a patch because we need it for net-next ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes [not found] ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2007-07-24 1:30 ` Nick Piggin @ 2008-12-16 21:10 ` Paul E. McKenney 1 sibling, 0 replies; 75+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:10 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro On Thu, Dec 11, 2008 at 11:39:10PM +0100, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes dont need inode_lock anymore. > > (socket8 bench result : no difference at this point) I do like this per-CPU counter infrastructure! One small comment change noted below. Other than that: Reviewed-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> > --- > fs/fs-writeback.c | 2 +- > fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- > include/linux/fs.h | 3 +++ > kernel/sysctl.c | 4 ++-- > mm/page-writeback.c | 2 +- > 5 files changed, 38 insertions(+), 12 deletions(-) > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index d0ff0b8..b591cdd 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) > unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); > > wbc.nr_to_write = nr_dirty + nr_unstable + > - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + > + (get_nr_inodes() - inodes_stat.nr_unused) + > nr_dirty + nr_unstable; > wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ > sync_sb_inodes(sb, &wbc); > diff --git a/fs/inode.c b/fs/inode.c > index 0487ddb..f94f889 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); > * Statistics gathering.. > */ > struct inodes_stat_t inodes_stat; > +static struct percpu_counter nr_inodes; > > static struct kmem_cache * inode_cachep __read_mostly; > > +int get_nr_inodes(void) > +{ > + return percpu_counter_sum_positive(&nr_inodes); > +} > + > +/* > + * Handle nr_dentry sysctl That would be "nr_inode", right? > + */ > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + inodes_stat.nr_inodes = get_nr_inodes(); > + return proc_dointvec(table, write, filp, buffer, lenp, ppos); > +} > +#else > +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos) > +{ > + return -ENOSYS; > +} > +#endif > + > static void wake_up_inode(struct inode *inode) > { > /* > @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) > destroy_inode(inode); > nr_disposed++; > } > - spin_lock(&inode_lock); > - inodes_stat.nr_inodes -= nr_disposed; > - spin_unlock(&inode_lock); > + percpu_counter_sub(&nr_inodes, nr_disposed); > } > > /* > @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) > > inode = alloc_inode(sb); > if (inode) { > + percpu_counter_inc(&nr_inodes); > spin_lock(&inode_lock); > - inodes_stat.nr_inodes++; > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > inode->i_ino = ++last_ino; > @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h > if (set(inode, data)) > goto set_failed; > > - inodes_stat.nr_inodes++; > + percpu_counter_inc(&nr_inodes); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > hlist_add_head(&inode->i_hash, head); > @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he > old = find_inode_fast(sb, head, ino); > if (!old) { > inode->i_ino = ino; > - inodes_stat.nr_inodes++; > + percpu_counter_inc(&nr_inodes); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > hlist_add_head(&inode->i_hash, head); > @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) > list_del_init(&inode->i_list); > list_del_init(&inode->i_sb_list); > inode->i_state |= I_FREEING; > - inodes_stat.nr_inodes--; > spin_unlock(&inode_lock); > + percpu_counter_dec(&nr_inodes); > > security_inode_delete(inode); > > @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) > list_del_init(&inode->i_list); > list_del_init(&inode->i_sb_list); > inode->i_state |= I_FREEING; > - inodes_stat.nr_inodes--; > spin_unlock(&inode_lock); > + percpu_counter_dec(&nr_inodes); > if (inode->i_data.nrpages) > truncate_inode_pages(&inode->i_data, 0); > clear_inode(inode); > @@ -1394,6 +1416,7 @@ void __init inode_init(void) > { > int loop; > > + percpu_counter_init(&nr_inodes, 0); > /* inode slab cache */ > inode_cachep = kmem_cache_create("inode_cache", > sizeof(struct inode), > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 114cb65..a789346 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -47,6 +47,7 @@ struct inodes_stat_t { > int dummy[5]; /* padding for sysctl ABI compatibility */ > }; > extern struct inodes_stat_t inodes_stat; > +extern int get_nr_inodes(void); > > extern int leases_enable, lease_break_time; > > @@ -2219,6 +2220,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, > void __user *buffer, size_t *lenp, loff_t *ppos); > +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, > + void __user *buffer, size_t *lenp, loff_t *ppos); > > int get_filesystem_list(char * buf); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 777bee7..b705f3a 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1205,7 +1205,7 @@ static struct ctl_table fs_table[] = { > .data = &inodes_stat, > .maxlen = 2*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_inodes, > }, > { > .ctl_name = FS_STATINODE, > @@ -1213,7 +1213,7 @@ static struct ctl_table fs_table[] = { > .data = &inodes_stat, > .maxlen = 7*sizeof(int), > .mode = 0444, > - .proc_handler = &proc_dointvec, > + .proc_handler = &proc_nr_inodes, > }, > { > .procname = "file-nr", > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 2970e35..a71a922 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) > next_jif = start_jif + dirty_writeback_interval; > nr_to_write = global_page_state(NR_FILE_DIRTY) + > global_page_state(NR_UNSTABLE_NFS) + > - (inodes_stat.nr_inodes - inodes_stat.nr_unused); > + (get_nr_inodes() - inodes_stat.nr_unused); > while (nr_to_write > 0) { > wbc.more_io = 0; > wbc.encountered_congestion = 0; ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (2 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet 2007-07-24 1:34 ` Nick Piggin 2008-12-16 21:26 ` Paul E. McKenney 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet ` (3 subsequent siblings) 7 siblings, 2 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- 1 files changed, 32 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index f94f889..dc8e72a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -556,6 +556,36 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * Each cpu owns a range of 1024 numbers. + * 'shared_last_ino' is dirtied only once out of 1024 allocations, + * to renew the exhausted range. + */ +static DEFINE_PER_CPU(int, last_ino); + +static int last_ino_get(void) +{ + static atomic_t shared_last_ino; + int *p = &get_cpu_var(last_ino); + int res = *p; + + if (unlikely((res & 1023) == 0)) + res = atomic_add_return(1024, &shared_last_ino) - 1024; + + *p = ++res; + put_cpu_var(last_ino); + return res; +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); + inode->i_state = 0; + inode->i_ino = last_ino_get(); spin_lock(&inode_lock); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; - inode->i_state = 0; spin_unlock(&inode_lock); } return inode; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2007-07-24 1:34 ` Nick Piggin 2008-12-16 21:26 ` Paul E. McKenney 1 sibling, 0 replies; 75+ messages in thread From: Nick Piggin @ 2007-07-24 1:34 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:39, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get increasing > inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino, and give same > spreading ino numbers than before. > (same wraparound after 2^32 allocations) I don't suppose this would cause any filesystems to do silly things? Seems like a good idea, if you could just add a #define instead of 1024. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- > 1 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index f94f889..dc8e72a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -556,6 +556,36 @@ repeat: > return node ? inode : NULL; > } > > +#ifdef CONFIG_SMP > +/* > + * Each cpu owns a range of 1024 numbers. > + * 'shared_last_ino' is dirtied only once out of 1024 allocations, > + * to renew the exhausted range. > + */ > +static DEFINE_PER_CPU(int, last_ino); > + > +static int last_ino_get(void) > +{ > + static atomic_t shared_last_ino; > + int *p = &get_cpu_var(last_ino); > + int res = *p; > + > + if (unlikely((res & 1023) == 0)) > + res = atomic_add_return(1024, &shared_last_ino) - 1024; > + > + *p = ++res; > + put_cpu_var(last_ino); > + return res; > +} > +#else > +static int last_ino_get(void) > +{ > + static int last_ino; > + > + return ++last_ino; > +} > +#endif > + > /** > * new_inode - obtain an inode > * @sb: superblock > @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) > * error if st_ino won't fit in target struct field. Use 32bit counter > * here to attempt to avoid that. > */ > - static unsigned int last_ino; > struct inode * inode; > > spin_lock_prefetch(&inode_lock); > @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > + inode->i_state = 0; > + inode->i_ino = last_ino_get(); > spin_lock(&inode_lock); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > - inode->i_ino = ++last_ino; > - inode->i_state = 0; > spin_unlock(&inode_lock); > } > return inode; ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet 2007-07-24 1:34 ` Nick Piggin @ 2008-12-16 21:26 ` Paul E. McKenney 1 sibling, 0 replies; 75+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:26 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:39:18PM +0100, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get increasing > inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino, and give same > spreading ino numbers than before. > (same wraparound after 2^32 allocations) One question below, but just a clarification. Works correctly as is, though a bit strangely. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- > 1 files changed, 32 insertions(+), 3 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index f94f889..dc8e72a 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -556,6 +556,36 @@ repeat: > return node ? inode : NULL; > } > > +#ifdef CONFIG_SMP > +/* > + * Each cpu owns a range of 1024 numbers. > + * 'shared_last_ino' is dirtied only once out of 1024 allocations, > + * to renew the exhausted range. > + */ > +static DEFINE_PER_CPU(int, last_ino); > + > +static int last_ino_get(void) > +{ > + static atomic_t shared_last_ino; > + int *p = &get_cpu_var(last_ino); > + int res = *p; > + > + if (unlikely((res & 1023) == 0)) > + res = atomic_add_return(1024, &shared_last_ino) - 1024; > + > + *p = ++res; So the first CPU gets the range [1:1024], the second [1025:2048], and so on, eventually wrapping to [4294966273:0]. Is that the intent? (I don't see a problem with this, just seems a bit strange.) > + put_cpu_var(last_ino); > + return res; > +} > +#else > +static int last_ino_get(void) > +{ > + static int last_ino; > + > + return ++last_ino; > +} > +#endif > + > /** > * new_inode - obtain an inode > * @sb: superblock > @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) > * error if st_ino won't fit in target struct field. Use 32bit counter > * here to attempt to avoid that. > */ > - static unsigned int last_ino; > struct inode * inode; > > spin_lock_prefetch(&inode_lock); > @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > + inode->i_state = 0; > + inode->i_ino = last_ino_get(); > spin_lock(&inode_lock); > list_add(&inode->i_list, &inode_in_use); > list_add(&inode->i_sb_list, &sb->s_inodes); > - inode->i_ino = ++last_ino; > - inode->i_state = 0; > spin_unlock(&inode_lock); > } > return inode; ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (3 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet @ 2008-12-11 22:39 ` Eric Dumazet [not found] ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet ` (2 subsequent siblings) 7 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty ("socketallocbench -n 8" bench result : from 25s to 19.9s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 16 ------------ fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ fs/pipe.c | 23 +---------------- include/linux/dcache.h | 9 ++++++ net/socket.c | 24 +----------------- 5 files changed, 65 insertions(+), 58 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..8bf83cb 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, mnt); } -static int anon_inodefs_delete_dentry(struct dentry *dentry) -{ - /* - * We faked vfs to believe the dentry was hashed when we created it. - * Now we restore the flag so that dput() will work correctly. - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 1; -} - static struct file_system_type anon_inode_fs_type = { .name = "anon_inodefs", .get_sb = anon_inodefs_get_sb, .kill_sb = kill_anon_super, }; static struct dentry_operations anon_inodefs_dentry_operations = { - .d_delete = anon_inodefs_delete_dentry, }; /** @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_single(&this, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index f463a81..af3bfb3 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) */ /* + * special version of dput() for pipes/sockets/anon. + * These dentries are not present in hash table, we can avoid + * taking/dirtying dcache_lock + */ +static void dput_single(struct dentry *dentry) +{ + struct inode *inode; + + if (!atomic_dec_and_test(&dentry->d_count)) + return; + inode = dentry->d_inode; + if (inode) + iput(inode); + d_free(dentry); +} + +/* * dput - release a dentry * @dentry: dentry to release * @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) { if (!dentry) return; + /* + * single dentries (sockets/pipes/anon) fast path + */ + if (dentry->d_flags & DCACHE_SINGLE) + return dput_single(dentry); repeat: if (atomic_read(&dentry->d_count) == 1) @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_single - allocate SINGLE dentry + * @name: dentry name, given in a qstr structure + * @inode: inode to allocate the dentry for + * + * Allocate an SINGLE dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - their d_alias list is empty + */ +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) +{ + struct dentry *entry; + + entry = d_alloc(NULL, name); + if (entry) { + entry->d_sb = inode->i_sb; + entry->d_parent = entry; + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; + entry->d_inode = inode; + fsnotify_d_instantiate(entry, inode); + security_d_instantiate(entry, inode); + } + return entry; +} + + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4de6dd5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) } static struct vfsmount *pipe_mnt __read_mostly; -static int pipefs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} /* * pipefs_dname() is called from d_path(). @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations pipefs_dentry_operations = { - .d_delete = pipefs_delete_dentry, .d_dname = pipefs_dname, }; @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..ca8d269 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -176,6 +176,14 @@ d_iput: no no no yes #define DCACHE_UNHASHED 0x0010 #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ +#define DCACHE_SINGLE 0x0040 + /* + * socket, pipe or anonymous fd dentry + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - Their d_alias list is empty + * - They dont need dcache_lock synchronization + */ extern spinlock_t dcache_lock; extern seqlock_t rename_lock; @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); extern void shrink_dcache_parent(struct dentry *); extern void shrink_dcache_for_umount(struct super_block *); extern int d_invalidate(struct dentry *); +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); diff --git a/net/socket.c b/net/socket.c index 92764d8..353c928 100644 --- a/net/socket.c +++ b/net/socket.c @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = { .kill_sb = kill_anon_super, }; -static int sockfs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} - /* * sockfs_dname() is called from d_path(). */ @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations sockfs_dentry_operations = { - .d_delete = sockfs_delete_dentry, .d_dname = sockfs_dname, }; @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd [not found] ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-16 21:40 ` Paul E. McKenney 0 siblings, 0 replies; 75+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:40 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro On Thu, Dec 11, 2008 at 11:39:38PM +0100, Eric Dumazet wrote: > Sockets, pipes and anonymous fds have interesting properties. > > Like other files, they use a dentry and an inode. > > But dentries for these kind of files are not hashed into dcache, > since there is no way someone can lookup such a file in the vfs tree. > (/proc/{pid}/fd/{number} uses a different mechanism) > > Still, allocating and freeing such dentries are expensive processes, > because we currently take dcache_lock inside d_alloc(), d_instantiate(), > and dput(). This lock is very contended on SMP machines. > > This patch defines a new DCACHE_SINGLE flag, to mark a dentry as > a single one (for sockets, pipes, anonymous fd), and a new > d_alloc_single(const struct qstr *name, struct inode *inode) > method, called by the three subsystems. > > Internally, dput() can take a fast path to dput_single() for > SINGLE dentries. No more atomic_dec_and_lock() > for such dentries. > > > Differences betwen an SINGLE dentry and a normal one are : > > 1) SINGLE dentry has the DCACHE_SINGLE flag > 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) > This to avoid taking a reference on sb 'root' dentry, shared > by too many dentries. > 3) They are not hashed into global hash table (DCACHE_UNHASHED) > 4) Their d_alias list is empty > > ("socketallocbench -n 8" bench result : from 25s to 19.9s) Acked-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> > --- > fs/anon_inodes.c | 16 ------------ > fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ > fs/pipe.c | 23 +---------------- > include/linux/dcache.h | 9 ++++++ > net/socket.c | 24 +----------------- > 5 files changed, 65 insertions(+), 58 deletions(-) > > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 3662dd4..8bf83cb 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, > mnt); > } > > -static int anon_inodefs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * We faked vfs to believe the dentry was hashed when we created it. > - * Now we restore the flag so that dput() will work correctly. > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 1; > -} > - > static struct file_system_type anon_inode_fs_type = { > .name = "anon_inodefs", > .get_sb = anon_inodefs_get_sb, > .kill_sb = kill_anon_super, > }; > static struct dentry_operations anon_inodefs_dentry_operations = { > - .d_delete = anon_inodefs_delete_dentry, > }; > > /** > @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, > this.name = name; > this.len = strlen(name); > this.hash = 0; > - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); > + dentry = d_alloc_single(&this, anon_inode_inode); > if (!dentry) > goto err_put_unused_fd; > > @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, > atomic_inc(&anon_inode_inode->i_count); > > dentry->d_op = &anon_inodefs_dentry_operations; > - /* Do not publish this dentry inside the global dentry hash table */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, anon_inode_inode); > > error = -ENFILE; > file = alloc_file(anon_inode_mnt, dentry, > diff --git a/fs/dcache.c b/fs/dcache.c > index f463a81..af3bfb3 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) > */ > > /* > + * special version of dput() for pipes/sockets/anon. > + * These dentries are not present in hash table, we can avoid > + * taking/dirtying dcache_lock > + */ > +static void dput_single(struct dentry *dentry) > +{ > + struct inode *inode; > + > + if (!atomic_dec_and_test(&dentry->d_count)) > + return; > + inode = dentry->d_inode; > + if (inode) > + iput(inode); > + d_free(dentry); > +} > + > +/* > * dput - release a dentry > * @dentry: dentry to release > * > @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) > { > if (!dentry) > return; > + /* > + * single dentries (sockets/pipes/anon) fast path > + */ > + if (dentry->d_flags & DCACHE_SINGLE) > + return dput_single(dentry); > > repeat: > if (atomic_read(&dentry->d_count) == 1) > @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) > return res; > } > > +/** > + * d_alloc_single - allocate SINGLE dentry > + * @name: dentry name, given in a qstr structure > + * @inode: inode to allocate the dentry for > + * > + * Allocate an SINGLE dentry for the inode given. The inode is > + * instantiated and returned. %NULL is returned if there is insufficient > + * memory. > + * - SINGLE dentries have themselves as a parent. > + * - SINGLE dentries are not hashed into global hash table > + * - their d_alias list is empty > + */ > +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) > +{ > + struct dentry *entry; > + > + entry = d_alloc(NULL, name); > + if (entry) { > + entry->d_sb = inode->i_sb; > + entry->d_parent = entry; > + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; > + entry->d_inode = inode; > + fsnotify_d_instantiate(entry, inode); > + security_d_instantiate(entry, inode); > + } > + return entry; > +} > + > + > static inline struct hlist_head *d_hash(struct dentry *parent, > unsigned long hash) > { > diff --git a/fs/pipe.c b/fs/pipe.c > index 7aea8b8..4de6dd5 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) > } > > static struct vfsmount *pipe_mnt __read_mostly; > -static int pipefs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * At creation time, we pretended this dentry was hashed > - * (by clearing DCACHE_UNHASHED bit in d_flags) > - * At delete time, we restore the truth : not hashed. > - * (so that dput() can proceed correctly) > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 0; > -} > > /* > * pipefs_dname() is called from d_path(). > @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) > } > > static struct dentry_operations pipefs_dentry_operations = { > - .d_delete = pipefs_delete_dentry, > .d_dname = pipefs_dname, > }; > > @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) > struct inode *inode; > struct file *f; > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > err = -ENFILE; > inode = get_pipe_inode(); > @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) > goto err; > > err = -ENOMEM; > - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); > + dentry = d_alloc_single(&name, inode); > if (!dentry) > goto err_inode; > > dentry->d_op = &pipefs_dentry_operations; > - /* > - * We dont want to publish this dentry into global dentry hash table. > - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > - * This permits a working /proc/$pid/fd/XXX on pipes > - */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, inode); > > err = -ENFILE; > f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index a37359d..ca8d269 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -176,6 +176,14 @@ d_iput: no no no yes > #define DCACHE_UNHASHED 0x0010 > > #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ > +#define DCACHE_SINGLE 0x0040 > + /* > + * socket, pipe or anonymous fd dentry > + * - SINGLE dentries have themselves as a parent. > + * - SINGLE dentries are not hashed into global hash table > + * - Their d_alias list is empty > + * - They dont need dcache_lock synchronization > + */ > > extern spinlock_t dcache_lock; > extern seqlock_t rename_lock; > @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); > extern void shrink_dcache_parent(struct dentry *); > extern void shrink_dcache_for_umount(struct super_block *); > extern int d_invalidate(struct dentry *); > +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); > > /* only used at mount-time */ > extern struct dentry * d_alloc_root(struct inode *); > diff --git a/net/socket.c b/net/socket.c > index 92764d8..353c928 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -308,18 +308,6 @@ static struct file_system_type sock_fs_type = { > .kill_sb = kill_anon_super, > }; > > -static int sockfs_delete_dentry(struct dentry *dentry) > -{ > - /* > - * At creation time, we pretended this dentry was hashed > - * (by clearing DCACHE_UNHASHED bit in d_flags) > - * At delete time, we restore the truth : not hashed. > - * (so that dput() can proceed correctly) > - */ > - dentry->d_flags |= DCACHE_UNHASHED; > - return 0; > -} > - > /* > * sockfs_dname() is called from d_path(). > */ > @@ -330,7 +318,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) > } > > static struct dentry_operations sockfs_dentry_operations = { > - .d_delete = sockfs_delete_dentry, > .d_dname = sockfs_dname, > }; > > @@ -372,20 +359,13 @@ static int sock_alloc_fd(struct file **filep, int flags) > static int sock_attach_fd(struct socket *sock, struct file *file, int flags) > { > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); > + dentry = d_alloc_single(&name, SOCK_INODE(sock)); > if (unlikely(!dentry)) > return -ENOMEM; > > dentry->d_op = &sockfs_dentry_operations; > - /* > - * We dont want to push this dentry into global dentry hash table. > - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > - * This permits a working /proc/$pid/fd/XXX on sockets > - */ > - dentry->d_flags &= ~DCACHE_UNHASHED; > - d_instantiate(dentry, SOCK_INODE(sock)); > > sock->file = file; > init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 5/7] fs: new_inode_single() and iput_single() 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (4 preceding siblings ...) 2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-12-11 22:40 ` Eric Dumazet 2008-12-16 21:41 ` Paul E. McKenney [not found] ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet 7 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() ("socketallocbench -n 8" result : from 19.9s to 3.01s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/dcache.c | 2 +- fs/inode.c | 29 ++++++++++++++++++++--------- fs/pipe.c | 2 +- include/linux/fs.h | 12 +++++++++++- net/socket.c | 2 +- 6 files changed, 35 insertions(+), 14 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 8bf83cb..89fd36d 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); */ static struct inode *anon_inode_mkinode(void) { - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); if (!inode) return ERR_PTR(-ENOMEM); diff --git a/fs/dcache.c b/fs/dcache.c index af3bfb3..3363853 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_single(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index dc8e72a..0fdfe1b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_single(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + percpu_counter_dec(&nr_inodes); + } +} /* * These are initializations that only need to be done @@ -587,8 +594,9 @@ static int last_ino_get(void) #endif /** - * new_inode - obtain an inode + * __new_inode - obtain an inode * @sb: superblock + * @single: if true, dont link new inode in a list * * Allocates a new inode for given superblock. The default gfp_mask * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. @@ -598,7 +606,7 @@ static int last_ino_get(void) * newly created inode's mapping * */ -struct inode *new_inode(struct super_block *sb) +struct inode *__new_inode(struct super_block *sb, int single) { /* * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); inode->i_state = 0; inode->i_ino = last_ino_get(); - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); - spin_unlock(&inode_lock); + if (single) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } } return inode; } -EXPORT_SYMBOL(new_inode); +EXPORT_SYMBOL(__new_inode); void unlock_new_inode(struct inode *inode) { diff --git a/fs/pipe.c b/fs/pipe.c index 4de6dd5..8c51a0d 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { static struct inode * get_pipe_inode(void) { - struct inode *inode = new_inode(pipe_mnt->mnt_sb); + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); struct pipe_inode_info *pipe; if (!inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index a789346..a702d81 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); -extern struct inode *new_inode(struct super_block *); +extern struct inode *__new_inode(struct super_block *, int); +static inline struct inode *new_inode(struct super_block *sb) +{ + return __new_inode(sb, 0); +} +static inline struct inode *new_inode_single(struct super_block *sb) +{ + return __new_inode(sb, 1); +} +extern void iput_single(struct inode *); + extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index 353c928..4017409 100644 --- a/net/socket.c +++ b/net/socket.c @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void) struct inode *inode; struct socket *sock; - inode = new_inode(sock_mnt->mnt_sb); + inode = new_inode_single(sock_mnt->mnt_sb); if (!inode) return NULL; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v3 5/7] fs: new_inode_single() and iput_single() 2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet @ 2008-12-16 21:41 ` Paul E. McKenney 0 siblings, 0 replies; 75+ messages in thread From: Paul E. McKenney @ 2008-12-16 21:41 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Thu, Dec 11, 2008 at 11:40:07PM +0100, Eric Dumazet wrote: > Goal of this patch is to not touch inode_lock for socket/pipes/anonfd > inodes allocation/freeing. > > SINGLE dentries are attached to inodes that dont need to be linked > in a list of inodes, being "inode_in_use" or "sb->s_inodes" > As inode_lock was taken only to protect these lists, we avoid taking it > as well. > > Using iput_single() from dput_single() avoids taking inode_lock > at freeing time. > > This patch has a very noticeable effect, because we avoid dirtying of > three contended cache lines in new_inode(), and five cache lines in iput() > > ("socketallocbench -n 8" result : from 19.9s to 3.01s) Nice! Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > fs/anon_inodes.c | 2 +- > fs/dcache.c | 2 +- > fs/inode.c | 29 ++++++++++++++++++++--------- > fs/pipe.c | 2 +- > include/linux/fs.h | 12 +++++++++++- > net/socket.c | 2 +- > 6 files changed, 35 insertions(+), 14 deletions(-) > > diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c > index 8bf83cb..89fd36d 100644 > --- a/fs/anon_inodes.c > +++ b/fs/anon_inodes.c > @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); > */ > static struct inode *anon_inode_mkinode(void) > { > - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); > + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); > > if (!inode) > return ERR_PTR(-ENOMEM); > diff --git a/fs/dcache.c b/fs/dcache.c > index af3bfb3..3363853 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) > return; > inode = dentry->d_inode; > if (inode) > - iput(inode); > + iput_single(inode); > d_free(dentry); > } > > diff --git a/fs/inode.c b/fs/inode.c > index dc8e72a..0fdfe1b 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) > kmem_cache_free(inode_cachep, (inode)); > } > > +void iput_single(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_count)) { > + destroy_inode(inode); > + percpu_counter_dec(&nr_inodes); > + } > +} > > /* > * These are initializations that only need to be done > @@ -587,8 +594,9 @@ static int last_ino_get(void) > #endif > > /** > - * new_inode - obtain an inode > + * __new_inode - obtain an inode > * @sb: superblock > + * @single: if true, dont link new inode in a list > * > * Allocates a new inode for given superblock. The default gfp_mask > * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. > @@ -598,7 +606,7 @@ static int last_ino_get(void) > * newly created inode's mapping > * > */ > -struct inode *new_inode(struct super_block *sb) > +struct inode *__new_inode(struct super_block *sb, int single) > { > /* > * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW > @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) > */ > struct inode * inode; > > - spin_lock_prefetch(&inode_lock); > - > inode = alloc_inode(sb); > if (inode) { > percpu_counter_inc(&nr_inodes); > inode->i_state = 0; > inode->i_ino = last_ino_get(); > - spin_lock(&inode_lock); > - list_add(&inode->i_list, &inode_in_use); > - list_add(&inode->i_sb_list, &sb->s_inodes); > - spin_unlock(&inode_lock); > + if (single) { > + INIT_LIST_HEAD(&inode->i_list); > + INIT_LIST_HEAD(&inode->i_sb_list); > + } else { > + spin_lock(&inode_lock); > + list_add(&inode->i_list, &inode_in_use); > + list_add(&inode->i_sb_list, &sb->s_inodes); > + spin_unlock(&inode_lock); > + } > } > return inode; > } > > -EXPORT_SYMBOL(new_inode); > +EXPORT_SYMBOL(__new_inode); > > void unlock_new_inode(struct inode *inode) > { > diff --git a/fs/pipe.c b/fs/pipe.c > index 4de6dd5..8c51a0d 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { > > static struct inode * get_pipe_inode(void) > { > - struct inode *inode = new_inode(pipe_mnt->mnt_sb); > + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); > struct pipe_inode_info *pipe; > > if (!inode) > diff --git a/include/linux/fs.h b/include/linux/fs.h > index a789346..a702d81 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1899,7 +1899,17 @@ extern void __iget(struct inode * inode); > extern void iget_failed(struct inode *); > extern void clear_inode(struct inode *); > extern void destroy_inode(struct inode *); > -extern struct inode *new_inode(struct super_block *); > +extern struct inode *__new_inode(struct super_block *, int); > +static inline struct inode *new_inode(struct super_block *sb) > +{ > + return __new_inode(sb, 0); > +} > +static inline struct inode *new_inode_single(struct super_block *sb) > +{ > + return __new_inode(sb, 1); > +} > +extern void iput_single(struct inode *); > + > extern int should_remove_suid(struct dentry *); > extern int file_remove_suid(struct file *); > > diff --git a/net/socket.c b/net/socket.c > index 353c928..4017409 100644 > --- a/net/socket.c > +++ b/net/socket.c > @@ -464,7 +464,7 @@ static struct socket *sock_alloc(void) > struct inode *inode; > struct socket *sock; > > - inode = new_inode(sock_mnt->mnt_sb); > + inode = new_inode_single(sock_mnt->mnt_sb); > if (!inode) > return NULL; > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU [not found] ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-11 22:40 ` Eric Dumazet 2007-07-24 1:13 ` Nick Piggin 0 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Currently we schedule RCU frees for each file we free separately. That has several drawbacks against the earlier file handling (in 2.6.5 f.e.), which did not require RCU callbacks: 1. Excessive number of RCU callbacks can be generated causing long RCU queues that in turn cause long latencies. We hit SLUB page allocation more often than necessary. 2. The cache hot object is not preserved between free and realloc. A close followed by another open is very fast with the RCUless approach because the last freed object is returned by the slab allocator that is still cache hot. RCU free means that the object is not immediately available again. The new object is cache cold and therefore open/close performance tests show a significant degradation with the RCU implementation. One solution to this problem is to move the RCU freeing into the Slab allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation time. The slab allocator will do RCU frees only when it is necessary to dispose of slabs of objects (rare). So with that approach we can cut out the RCU overhead significantly. However, the slab allocator may return the object for another use even before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means there is the (unlikely) possibility that the object is going to be switched under us in sections protected by rcu_read_lock() and rcu_read_unlock(). So we need to verify that we have acquired the correct object after establishing a stable object reference (incrementing the refcounter does that). Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> --- Documentation/filesystems/files.txt | 21 ++++++++++++++-- fs/file_table.c | 33 ++++++++++++++++++-------- include/linux/fs.h | 5 --- 3 files changed, 42 insertions(+), 17 deletions(-) diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 --- a/Documentation/filesystems/files.txt +++ b/Documentation/filesystems/files.txt @@ -78,13 +78,28 @@ the fdtable structure - that look-up may race with the last put() operation on the file structure. This is avoided using atomic_long_inc_not_zero() on ->f_count : + As file structures are allocated with SLAB_DESTROY_BY_RCU, + they can also be freed before a RCU grace period, and reused, + but still as a struct file. + It is necessary to check again after getting + a stable reference (ie after atomic_long_inc_not_zero()), + that fcheck_files(files, fd) points to the same file. rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (atomic_long_inc_not_zero(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; - else + /* + * Now we have a stable reference to an object. + * Check if other threads freed file and reallocated it. + */ + if (file != fcheck_files(files, fd)) { + *fput_needed = 0; + put_filp(file); + file = NULL; + } + } else /* Didn't get the reference, someone's freed */ file = NULL; } @@ -95,6 +110,8 @@ the fdtable structure - atomic_long_inc_not_zero() detects if refcounts is already zero or goes to zero during increment. If it does, we fail fget()/fget_light(). + The second call to fcheck_files(files, fd) checks that this filp + was not freed, then reused by an other thread. 6. Since both fdtable and file structures can be looked up lock-free, they must be installed using rcu_assign_pointer() diff --git a/fs/file_table.c b/fs/file_table.c index a46e880..3e9259d 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; static struct percpu_counter nr_files __cacheline_aligned_in_smp; -static inline void file_free_rcu(struct rcu_head *head) -{ - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); - kmem_cache_free(filp_cachep, f); -} - static inline void file_free(struct file *f) { percpu_counter_dec(&nr_files); file_check_state(f); - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); + kmem_cache_free(filp_cachep, f); } /* @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) rcu_read_unlock(); return NULL; } + /* + * Now we have a stable reference to an object. + * Check if other threads freed file and re-allocated it. + */ + if (unlikely(file != fcheck_files(files, fd))) { + put_filp(file); + file = NULL; + } } rcu_read_unlock(); @@ -333,9 +335,19 @@ struct file *fget_light(unsigned int fd, int *fput_needed) rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (atomic_long_inc_not_zero(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; - else + /* + * Now we have a stable reference to an object. + * Check if other threads freed this file and + * re-allocated it. + */ + if (unlikely(file != fcheck_files(files, fd))) { + *fput_needed = 0; + put_filp(file); + file = NULL; + } + } else /* Didn't get the reference, someone's freed */ file = NULL; } @@ -402,7 +414,8 @@ void __init files_init(unsigned long mempages) int n; filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, - SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); + SLAB_HWCACHE_ALIGN | SLAB_DESTROY_BY_RCU | SLAB_PANIC, + NULL); /* * One file with associated inode and dcache is very roughly 1K. diff --git a/include/linux/fs.h b/include/linux/fs.h index a702d81..a1f56d4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -811,13 +811,8 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) #define FILE_MNT_WRITE_RELEASED 2 struct file { - /* - * fu_list becomes invalid after file_free is called and queued via - * fu_rcuhead for RCU freeing - */ union { struct list_head fu_list; - struct rcu_head fu_rcuhead; } f_u; struct path f_path; #define f_dentry f_path.dentry ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet @ 2007-07-24 1:13 ` Nick Piggin 2008-12-12 2:50 ` Nick Piggin 2008-12-12 4:45 ` Eric Dumazet 0 siblings, 2 replies; 75+ messages in thread From: Nick Piggin @ 2007-07-24 1:13 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Friday 12 December 2008 09:40, Eric Dumazet wrote: > From: Christoph Lameter <cl@linux-foundation.org> > > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU > > Currently we schedule RCU frees for each file we free separately. That has > several drawbacks against the earlier file handling (in 2.6.5 f.e.), which > did not require RCU callbacks: > > 1. Excessive number of RCU callbacks can be generated causing long RCU > queues that in turn cause long latencies. We hit SLUB page allocation > more often than necessary. > > 2. The cache hot object is not preserved between free and realloc. A close > followed by another open is very fast with the RCUless approach because > the last freed object is returned by the slab allocator that is > still cache hot. RCU free means that the object is not immediately > available again. The new object is cache cold and therefore open/close > performance tests show a significant degradation with the RCU > implementation. > > One solution to this problem is to move the RCU freeing into the Slab > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation > time. The slab allocator will do RCU frees only when it is necessary > to dispose of slabs of objects (rare). So with that approach we can cut > out the RCU overhead significantly. > > However, the slab allocator may return the object for another use even > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means > there is the (unlikely) possibility that the object is going to be > switched under us in sections protected by rcu_read_lock() and > rcu_read_unlock(). So we need to verify that we have acquired the correct > object after establishing a stable object reference (incrementing the > refcounter does that). > > > Signed-off-by: Christoph Lameter <cl@linux-foundation.org> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > --- > Documentation/filesystems/files.txt | 21 ++++++++++++++-- > fs/file_table.c | 33 ++++++++++++++++++-------- > include/linux/fs.h | 5 --- > 3 files changed, 42 insertions(+), 17 deletions(-) > > diff --git a/Documentation/filesystems/files.txt > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 > --- a/Documentation/filesystems/files.txt > +++ b/Documentation/filesystems/files.txt > @@ -78,13 +78,28 @@ the fdtable structure - > that look-up may race with the last put() operation on the > file structure. This is avoided using atomic_long_inc_not_zero() > on ->f_count : > + As file structures are allocated with SLAB_DESTROY_BY_RCU, > + they can also be freed before a RCU grace period, and reused, > + but still as a struct file. > + It is necessary to check again after getting > + a stable reference (ie after atomic_long_inc_not_zero()), > + that fcheck_files(files, fd) points to the same file. > > rcu_read_lock(); > file = fcheck_files(files, fd); > if (file) { > - if (atomic_long_inc_not_zero(&file->f_count)) > + if (atomic_long_inc_not_zero(&file->f_count)) { > *fput_needed = 1; > - else > + /* > + * Now we have a stable reference to an object. > + * Check if other threads freed file and reallocated it. > + */ > + if (file != fcheck_files(files, fd)) { > + *fput_needed = 0; > + put_filp(file); > + file = NULL; > + } > + } else > /* Didn't get the reference, someone's freed */ > file = NULL; > } > @@ -95,6 +110,8 @@ the fdtable structure - > atomic_long_inc_not_zero() detects if refcounts is already zero or > goes to zero during increment. If it does, we fail > fget()/fget_light(). > + The second call to fcheck_files(files, fd) checks that this filp > + was not freed, then reused by an other thread. > > 6. Since both fdtable and file structures can be looked up > lock-free, they must be installed using rcu_assign_pointer() > diff --git a/fs/file_table.c b/fs/file_table.c > index a46e880..3e9259d 100644 > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; > > static struct percpu_counter nr_files __cacheline_aligned_in_smp; > > -static inline void file_free_rcu(struct rcu_head *head) > -{ > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > - kmem_cache_free(filp_cachep, f); > -} > - > static inline void file_free(struct file *f) > { > percpu_counter_dec(&nr_files); > file_check_state(f); > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); > + kmem_cache_free(filp_cachep, f); > } > > /* > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) > rcu_read_unlock(); > return NULL; > } > + /* > + * Now we have a stable reference to an object. > + * Check if other threads freed file and re-allocated it. > + */ > + if (unlikely(file != fcheck_files(files, fd))) { > + put_filp(file); > + file = NULL; > + } This is a non-trivial change, because that put_filp may drop the last reference to the file. So now we have the case where we free the file from a context in which it had never been allocated. >From a quick glance though the callchains, I can't seen an obvious problem. But it needs to have documentation in put_filp, or at least a mention in the changelog, and also cc'ed to the security lists. Also, it adds code and cost to the get/put path in return for improvement in the free path. get/put is the more common path, but it is a small loss for a big improvement. So it might be worth it. But it is not justified by your microbenchmark. Do we have a more useful case that it helps? ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2007-07-24 1:13 ` Nick Piggin @ 2008-12-12 2:50 ` Nick Piggin 2008-12-12 4:45 ` Eric Dumazet 1 sibling, 0 replies; 75+ messages in thread From: Nick Piggin @ 2008-12-12 2:50 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney On Tuesday 24 July 2007 11:13, Nick Piggin wrote: > On Friday 12 December 2008 09:40, Eric Dumazet wrote: > > From: Christoph Lameter <cl@linux-foundation.org> > > > > [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU > > > > Currently we schedule RCU frees for each file we free separately. That > > has several drawbacks against the earlier file handling (in 2.6.5 f.e.), > > which did not require RCU callbacks: > > > > 1. Excessive number of RCU callbacks can be generated causing long RCU > > queues that in turn cause long latencies. We hit SLUB page allocation > > more often than necessary. > > > > 2. The cache hot object is not preserved between free and realloc. A > > close followed by another open is very fast with the RCUless approach > > because the last freed object is returned by the slab allocator that is > > still cache hot. RCU free means that the object is not immediately > > available again. The new object is cache cold and therefore open/close > > performance tests show a significant degradation with the RCU > > implementation. > > > > One solution to this problem is to move the RCU freeing into the Slab > > allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation > > time. The slab allocator will do RCU frees only when it is necessary > > to dispose of slabs of objects (rare). So with that approach we can cut > > out the RCU overhead significantly. > > > > However, the slab allocator may return the object for another use even > > before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means > > there is the (unlikely) possibility that the object is going to be > > switched under us in sections protected by rcu_read_lock() and > > rcu_read_unlock(). So we need to verify that we have acquired the correct > > object after establishing a stable object reference (incrementing the > > refcounter does that). > > > > > > Signed-off-by: Christoph Lameter <cl@linux-foundation.org> > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > --- > > Documentation/filesystems/files.txt | 21 ++++++++++++++-- > > fs/file_table.c | 33 ++++++++++++++++++-------- > > include/linux/fs.h | 5 --- > > 3 files changed, 42 insertions(+), 17 deletions(-) > > > > diff --git a/Documentation/filesystems/files.txt > > b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 > > --- a/Documentation/filesystems/files.txt > > +++ b/Documentation/filesystems/files.txt > > @@ -78,13 +78,28 @@ the fdtable structure - > > that look-up may race with the last put() operation on the > > file structure. This is avoided using atomic_long_inc_not_zero() > > on ->f_count : > > + As file structures are allocated with SLAB_DESTROY_BY_RCU, > > + they can also be freed before a RCU grace period, and reused, > > + but still as a struct file. > > + It is necessary to check again after getting > > + a stable reference (ie after atomic_long_inc_not_zero()), > > + that fcheck_files(files, fd) points to the same file. > > > > rcu_read_lock(); > > file = fcheck_files(files, fd); > > if (file) { > > - if (atomic_long_inc_not_zero(&file->f_count)) > > + if (atomic_long_inc_not_zero(&file->f_count)) { > > *fput_needed = 1; > > - else > > + /* > > + * Now we have a stable reference to an object. > > + * Check if other threads freed file and reallocated it. > > + */ > > + if (file != fcheck_files(files, fd)) { > > + *fput_needed = 0; > > + put_filp(file); > > + file = NULL; > > + } > > + } else > > /* Didn't get the reference, someone's freed */ > > file = NULL; > > } > > @@ -95,6 +110,8 @@ the fdtable structure - > > atomic_long_inc_not_zero() detects if refcounts is already zero or > > goes to zero during increment. If it does, we fail > > fget()/fget_light(). > > + The second call to fcheck_files(files, fd) checks that this filp > > + was not freed, then reused by an other thread. > > > > 6. Since both fdtable and file structures can be looked up > > lock-free, they must be installed using rcu_assign_pointer() > > diff --git a/fs/file_table.c b/fs/file_table.c > > index a46e880..3e9259d 100644 > > --- a/fs/file_table.c > > +++ b/fs/file_table.c > > @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; > > > > static struct percpu_counter nr_files __cacheline_aligned_in_smp; > > > > -static inline void file_free_rcu(struct rcu_head *head) > > -{ > > - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); > > - kmem_cache_free(filp_cachep, f); > > -} > > - > > static inline void file_free(struct file *f) > > { > > percpu_counter_dec(&nr_files); > > file_check_state(f); > > - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); > > + kmem_cache_free(filp_cachep, f); > > } > > > > /* > > @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) > > rcu_read_unlock(); > > return NULL; > > } > > + /* > > + * Now we have a stable reference to an object. > > + * Check if other threads freed file and re-allocated it. > > + */ > > + if (unlikely(file != fcheck_files(files, fd))) { > > + put_filp(file); > > + file = NULL; > > + } > > This is a non-trivial change, because that put_filp may drop the last > reference to the file. So now we have the case where we free the file > from a context in which it had never been allocated. > > From a quick glance though the callchains, I can't seen an obvious > problem. But it needs to have documentation in put_filp, or at least > a mention in the changelog, and also cc'ed to the security lists. > > Also, it adds code and cost to the get/put path in return for > improvement in the free path. get/put is the more common path, but > it is a small loss for a big improvement. So it might be worth it. But > it is not justified by your microbenchmark. Do we have a more useful > case that it helps? Sorry, my clock screwed up and I didn't notice :( ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU 2007-07-24 1:13 ` Nick Piggin 2008-12-12 2:50 ` Nick Piggin @ 2008-12-12 4:45 ` Eric Dumazet [not found] ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-12 4:45 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Nick Piggin a écrit : > On Friday 12 December 2008 09:40, Eric Dumazet wrote: >> From: Christoph Lameter <cl@linux-foundation.org> >> >> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU >> >> Currently we schedule RCU frees for each file we free separately. That has >> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which >> did not require RCU callbacks: >> >> 1. Excessive number of RCU callbacks can be generated causing long RCU >> queues that in turn cause long latencies. We hit SLUB page allocation >> more often than necessary. >> >> 2. The cache hot object is not preserved between free and realloc. A close >> followed by another open is very fast with the RCUless approach because >> the last freed object is returned by the slab allocator that is >> still cache hot. RCU free means that the object is not immediately >> available again. The new object is cache cold and therefore open/close >> performance tests show a significant degradation with the RCU >> implementation. >> >> One solution to this problem is to move the RCU freeing into the Slab >> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation >> time. The slab allocator will do RCU frees only when it is necessary >> to dispose of slabs of objects (rare). So with that approach we can cut >> out the RCU overhead significantly. >> >> However, the slab allocator may return the object for another use even >> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means >> there is the (unlikely) possibility that the object is going to be >> switched under us in sections protected by rcu_read_lock() and >> rcu_read_unlock(). So we need to verify that we have acquired the correct >> object after establishing a stable object reference (incrementing the >> refcounter does that). >> >> >> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> --- >> Documentation/filesystems/files.txt | 21 ++++++++++++++-- >> fs/file_table.c | 33 ++++++++++++++++++-------- >> include/linux/fs.h | 5 --- >> 3 files changed, 42 insertions(+), 17 deletions(-) >> >> diff --git a/Documentation/filesystems/files.txt >> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 >> --- a/Documentation/filesystems/files.txt >> +++ b/Documentation/filesystems/files.txt >> @@ -78,13 +78,28 @@ the fdtable structure - >> that look-up may race with the last put() operation on the >> file structure. This is avoided using atomic_long_inc_not_zero() >> on ->f_count : >> + As file structures are allocated with SLAB_DESTROY_BY_RCU, >> + they can also be freed before a RCU grace period, and reused, >> + but still as a struct file. >> + It is necessary to check again after getting >> + a stable reference (ie after atomic_long_inc_not_zero()), >> + that fcheck_files(files, fd) points to the same file. >> >> rcu_read_lock(); >> file = fcheck_files(files, fd); >> if (file) { >> - if (atomic_long_inc_not_zero(&file->f_count)) >> + if (atomic_long_inc_not_zero(&file->f_count)) { >> *fput_needed = 1; >> - else >> + /* >> + * Now we have a stable reference to an object. >> + * Check if other threads freed file and reallocated it. >> + */ >> + if (file != fcheck_files(files, fd)) { >> + *fput_needed = 0; >> + put_filp(file); >> + file = NULL; >> + } >> + } else >> /* Didn't get the reference, someone's freed */ >> file = NULL; >> } >> @@ -95,6 +110,8 @@ the fdtable structure - >> atomic_long_inc_not_zero() detects if refcounts is already zero or >> goes to zero during increment. If it does, we fail >> fget()/fget_light(). >> + The second call to fcheck_files(files, fd) checks that this filp >> + was not freed, then reused by an other thread. >> >> 6. Since both fdtable and file structures can be looked up >> lock-free, they must be installed using rcu_assign_pointer() >> diff --git a/fs/file_table.c b/fs/file_table.c >> index a46e880..3e9259d 100644 >> --- a/fs/file_table.c >> +++ b/fs/file_table.c >> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; >> >> static struct percpu_counter nr_files __cacheline_aligned_in_smp; >> >> -static inline void file_free_rcu(struct rcu_head *head) >> -{ >> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); >> - kmem_cache_free(filp_cachep, f); >> -} >> - >> static inline void file_free(struct file *f) >> { >> percpu_counter_dec(&nr_files); >> file_check_state(f); >> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); >> + kmem_cache_free(filp_cachep, f); >> } >> >> /* >> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) >> rcu_read_unlock(); >> return NULL; >> } >> + /* >> + * Now we have a stable reference to an object. >> + * Check if other threads freed file and re-allocated it. >> + */ >> + if (unlikely(file != fcheck_files(files, fd))) { >> + put_filp(file); >> + file = NULL; >> + } > > This is a non-trivial change, because that put_filp may drop the last > reference to the file. So now we have the case where we free the file > from a context in which it had never been allocated. If we got at this point, we : Found a non NULL pointer in our fd table. Then, another thread came, closed the file while we not yet added our reference. This file was freed (kmem_cache_free(filp_cachep, file)) This file was reused and inserted on another thread fd table. We added our reference on refcount. We checked if this file is still ours (in our fd tab). We found this file is not anymore the file we wanted. Calling put_filp() here is our only choice to safely remove the reference on a truly allocated file. At this point the file is a truly allocated file but not anymore ours. Unfortunatly we added a reference on it : we must release it. If the other thread already called put_filp() because it wanted to close its new file, we must see f_refcnt going to zero, and we must call __fput(), to perform all the relevant file cleanup ourself. > >>From a quick glance though the callchains, I can't seen an obvious > problem. But it needs to have documentation in put_filp, or at least > a mention in the changelog, and also cc'ed to the security lists. I see your point. But currently, any thread can be "releasing the last reference on a file". That is not always the thread that called close(fd) We extend this to "any thread of any process", so it might have a security effect you are absolutely right. > > Also, it adds code and cost to the get/put path in return for > improvement in the free path. get/put is the more common path, but > it is a small loss for a big improvement. So it might be worth it. But > it is not justified by your microbenchmark. Do we have a more useful > case that it helps? Any real world program that open and close files, or said better, that close and open files :) sizeof(struct file) is 192 bytes. Thats three cache lines. Being able to reuse a hot "struct file" avoids three cache line misses. Thats about 120 ns. Then, using call_rcu() is also a latency killer, since we explicitly say : I dont want to free this file right now, I delegate this job to another layer in two or three milli second (or more) A final point is that SLUB doesnt need to allocate or free a slab in many cases. (This is probably why Christoph needed this patch in 2006 :) ) In my case, I need all these patches to speedup http servers. They obviously open and close many files per second. The added code has a cost of less than 3 ns, but I suspect we can cut it to less than 1ns We prefered with Christoph and Paul to keep patch as short as possible to focus on essential points. :c0287656: mov -0x14(%ebp),%esi :c0287659: mov -0x24(%ebp),%edi :c028765c: mov 0x4(%esi),%eax :c028765f: cmp (%eax),%edi :c0287661: jb c0287678 <fget+0xc8> :c0287663: mov %ebx,%eax :c0287665: xor %ebx,%ebx :c0287667: call c0287450 <put_filp> :c028766c: jmp c02875ec <fget+0x3c> :c0287671: lea 0x0(%esi,%eiz,1),%esi :c0287678: mov 0x4(%eax),%edi :c028767b: add %edi,-0x10(%ebp) :c028767e: mov -0x10(%ebp),%edx 1 8.8e-05 :c0287681: mov (%edx),%eax :c0287683: cmp %eax,%ebx :c0287685: je c02875ec <fget+0x3c> :c028768b: jmp c0287663 <fget+0xb3> We could avoid doing the full test, because there is no way the files->max_fds could become lower under us, or even fdt itself, and fdt->fd So instead of using twice this function : static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd) { struct file * file = NULL; struct fdtable *fdt = files_fdtable(files); if (fd < fdt->max_fds) file = rcu_dereference(fdt->fd[fd]); return file; } We could use the attached patch This becomes a matter of three instructions, including a 99.99% predicted branch : c0287646: 8b 03 mov (%ebx),%eax c0287648: 39 45 e4 cmp %eax,-0x1c(%ebp) c028764b: 74 a1 je c02875ee <fget+0x3e> c028764d: 8b 45 e4 mov -0x1c(%ebp),%eax c0287650: e8 fb fd ff ff call c0287450 <put_filp> c0287655: 31 c0 xor %eax,%eax c0287657: eb 98 jmp c02875f1 <fget+0x41> At the time Christoph sent its patch (in 2006), nobody cared, because we had no benchmark or real world workload that demonstrated the gain of his patch, only intuitions. We had too many contended cache lines that slow down the whole process. SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line misses costs become really problematic. This patch series clearly demonstrate it. Thanks Nick for your feedback and comments. Eric [PATCH] fs: optimize fget() & fget_light() Instead of calling fcheck_files() a second time, we can take into account we already did part of the job, in a rcu read locked section. We need a struct file **filp pointer so that we only dereference it a second time. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/file_table.c | 23 +++++++++++++++++------ 1 files changed, 17 insertions(+), 6 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index 3e9259d..4bc019f 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -289,11 +289,16 @@ void __fput(struct file *file) struct file *fget(unsigned int fd) { - struct file *file; + struct file *file = NULL, **filp; struct files_struct *files = current->files; + struct fdtable *fdt; rcu_read_lock(); - file = fcheck_files(files, fd); + fdt = files_fdtable(files); + if (likely(fd < fdt->max_fds)) { + filp = &fdt->fd[fd]; + file = rcu_dereference(*filp); + } if (file) { if (!atomic_long_inc_not_zero(&file->f_count)) { /* File object ref couldn't be taken */ @@ -304,7 +309,7 @@ struct file *fget(unsigned int fd) * Now we have a stable reference to an object. * Check if other threads freed file and re-allocated it. */ - if (unlikely(file != fcheck_files(files, fd))) { + if (unlikely(file != rcu_dereference(*filp))) { put_filp(file); file = NULL; } @@ -325,15 +330,21 @@ EXPORT_SYMBOL(fget); */ struct file *fget_light(unsigned int fd, int *fput_needed) { - struct file *file; + struct file *file, **filp; struct files_struct *files = current->files; + struct fdtable *fdt; *fput_needed = 0; if (likely((atomic_read(&files->count) == 1))) { file = fcheck_files(files, fd); } else { rcu_read_lock(); - file = fcheck_files(files, fd); + fdt = files_fdtable(files); + file = NULL; + if (likely(fd < fdt->max_fds)) { + filp = &fdt->fd[fd]; + file = rcu_dereference(*filp); + } if (file) { if (atomic_long_inc_not_zero(&file->f_count)) { *fput_needed = 1; @@ -342,7 +353,7 @@ struct file *fget_light(unsigned int fd, int *fput_needed) * Check if other threads freed this file and * re-allocated it. */ - if (unlikely(file != fcheck_files(files, fd))) { + if (unlikely(file != rcu_dereference(*filp))) { *fput_needed = 0; put_filp(file); file = NULL; ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU [not found] ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-12 16:48 ` Eric Dumazet [not found] ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-13 1:41 ` Christoph Lameter 1 sibling, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-12-12 16:48 UTC (permalink / raw) To: Christoph Lameter, Paul E. McKenney Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro Eric Dumazet a écrit : > Nick Piggin a écrit : >> On Friday 12 December 2008 09:40, Eric Dumazet wrote: >>> From: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> >>> >>> [PATCH] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU >>> >>> Currently we schedule RCU frees for each file we free separately. That has >>> several drawbacks against the earlier file handling (in 2.6.5 f.e.), which >>> did not require RCU callbacks: >>> >>> 1. Excessive number of RCU callbacks can be generated causing long RCU >>> queues that in turn cause long latencies. We hit SLUB page allocation >>> more often than necessary. >>> >>> 2. The cache hot object is not preserved between free and realloc. A close >>> followed by another open is very fast with the RCUless approach because >>> the last freed object is returned by the slab allocator that is >>> still cache hot. RCU free means that the object is not immediately >>> available again. The new object is cache cold and therefore open/close >>> performance tests show a significant degradation with the RCU >>> implementation. >>> >>> One solution to this problem is to move the RCU freeing into the Slab >>> allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation >>> time. The slab allocator will do RCU frees only when it is necessary >>> to dispose of slabs of objects (rare). So with that approach we can cut >>> out the RCU overhead significantly. >>> >>> However, the slab allocator may return the object for another use even >>> before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means >>> there is the (unlikely) possibility that the object is going to be >>> switched under us in sections protected by rcu_read_lock() and >>> rcu_read_unlock(). So we need to verify that we have acquired the correct >>> object after establishing a stable object reference (incrementing the >>> refcounter does that). >>> >>> >>> Signed-off-by: Christoph Lameter <cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> >>> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> >>> Signed-off-by: Paul E. McKenney <paulmck-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> >>> --- >>> Documentation/filesystems/files.txt | 21 ++++++++++++++-- >>> fs/file_table.c | 33 ++++++++++++++++++-------- >>> include/linux/fs.h | 5 --- >>> 3 files changed, 42 insertions(+), 17 deletions(-) >>> >>> diff --git a/Documentation/filesystems/files.txt >>> b/Documentation/filesystems/files.txt index ac2facc..6916baa 100644 >>> --- a/Documentation/filesystems/files.txt >>> +++ b/Documentation/filesystems/files.txt >>> @@ -78,13 +78,28 @@ the fdtable structure - >>> that look-up may race with the last put() operation on the >>> file structure. This is avoided using atomic_long_inc_not_zero() >>> on ->f_count : >>> + As file structures are allocated with SLAB_DESTROY_BY_RCU, >>> + they can also be freed before a RCU grace period, and reused, >>> + but still as a struct file. >>> + It is necessary to check again after getting >>> + a stable reference (ie after atomic_long_inc_not_zero()), >>> + that fcheck_files(files, fd) points to the same file. >>> >>> rcu_read_lock(); >>> file = fcheck_files(files, fd); >>> if (file) { >>> - if (atomic_long_inc_not_zero(&file->f_count)) >>> + if (atomic_long_inc_not_zero(&file->f_count)) { >>> *fput_needed = 1; >>> - else >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and reallocated it. >>> + */ >>> + if (file != fcheck_files(files, fd)) { >>> + *fput_needed = 0; >>> + put_filp(file); >>> + file = NULL; >>> + } >>> + } else >>> /* Didn't get the reference, someone's freed */ >>> file = NULL; >>> } >>> @@ -95,6 +110,8 @@ the fdtable structure - >>> atomic_long_inc_not_zero() detects if refcounts is already zero or >>> goes to zero during increment. If it does, we fail >>> fget()/fget_light(). >>> + The second call to fcheck_files(files, fd) checks that this filp >>> + was not freed, then reused by an other thread. >>> >>> 6. Since both fdtable and file structures can be looked up >>> lock-free, they must be installed using rcu_assign_pointer() >>> diff --git a/fs/file_table.c b/fs/file_table.c >>> index a46e880..3e9259d 100644 >>> --- a/fs/file_table.c >>> +++ b/fs/file_table.c >>> @@ -37,17 +37,11 @@ static struct kmem_cache *filp_cachep __read_mostly; >>> >>> static struct percpu_counter nr_files __cacheline_aligned_in_smp; >>> >>> -static inline void file_free_rcu(struct rcu_head *head) >>> -{ >>> - struct file *f = container_of(head, struct file, f_u.fu_rcuhead); >>> - kmem_cache_free(filp_cachep, f); >>> -} >>> - >>> static inline void file_free(struct file *f) >>> { >>> percpu_counter_dec(&nr_files); >>> file_check_state(f); >>> - call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); >>> + kmem_cache_free(filp_cachep, f); >>> } >>> >>> /* >>> @@ -306,6 +300,14 @@ struct file *fget(unsigned int fd) >>> rcu_read_unlock(); >>> return NULL; >>> } >>> + /* >>> + * Now we have a stable reference to an object. >>> + * Check if other threads freed file and re-allocated it. >>> + */ >>> + if (unlikely(file != fcheck_files(files, fd))) { >>> + put_filp(file); >>> + file = NULL; >>> + } >> This is a non-trivial change, because that put_filp may drop the last >> reference to the file. So now we have the case where we free the file >> from a context in which it had never been allocated. > > If we got at this point, we : > > Found a non NULL pointer in our fd table. > Then, another thread came, closed the file while we not yet added our reference. > This file was freed (kmem_cache_free(filp_cachep, file)) > This file was reused and inserted on another thread fd table. > We added our reference on refcount. > We checked if this file is still ours (in our fd tab). > We found this file is not anymore the file we wanted. > Calling put_filp() here is our only choice to safely remove the reference on > a truly allocated file. At this point the file is > a truly allocated file but not anymore ours. > Unfortunatly we added a reference on it : we must release it. > If the other thread already called put_filp() because it wanted to close its new file, > we must see f_refcnt going to zero, and we must call __fput(), to perform > all the relevant file cleanup ourself. Reading again this mail I realise we call put_filp(file), while this should be fput(file) or put_filp(file), we dont know. Damned, this patch is wrong as is. Christoph, Paul, do you see the problem ? In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, and tried to close it while we got a reference on file) had to call put_filp() or fput() to release its own reference. So we call atomic_long_dec_and_test() and cannot take the appropriate action (calling the full __fput() version or the small one, that some systems use to 'close' an not really opened file. void put_filp(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) { security_file_free(file); file_kill(file); file_free(file); } } void fput(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) __fput(file); } I believe put_filp() is only called on slowpath (error cases). Should we just zap it and always call fput() ? ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU [not found] ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-13 2:07 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Christoph Lameter @ 2008-12-13 2:07 UTC (permalink / raw) To: Eric Dumazet Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro On Fri, 12 Dec 2008, Eric Dumazet wrote: > > a truly allocated file. At this point the file is > > a truly allocated file but not anymore ours. Its a valid file. Does ownership matter here? > Reading again this mail I realise we call put_filp(file), while this should > be fput(file) or put_filp(file), we dont know. > > Damned, this patch is wrong as is. > > Christoph, Paul, do you see the problem ? Yes. > In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, > and tried to close it while we got a reference on file) had to call put_filp() or fput() > to release its own reference. So we call atomic_long_dec_and_test() and cannot > take the appropriate action (calling the full __fput() version or the small one, > that some systems use to 'close' an not really opened file. The difference is mainly that fput() does full processing whereas put_filp() is used when we know that the file was not fully operational. If the checks in __fput are able to handle the put_filp() situation by not releasing resources that were not allocated then we should be fine. > I believe put_filp() is only called on slowpath (error cases). Looks like it. It seems to assume that no dentry is associated. > Should we just zap it and always call fput() ? Only if fput() can handle partially setup files. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU [not found] ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org> @ 2008-12-17 20:25 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-17 20:25 UTC (permalink / raw) To: Christoph Lameter Cc: Paul E. McKenney, Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro Christoph Lameter a écrit : > On Fri, 12 Dec 2008, Eric Dumazet wrote: > >>> a truly allocated file. At this point the file is >>> a truly allocated file but not anymore ours. > > Its a valid file. Does ownership matter here? > >> Reading again this mail I realise we call put_filp(file), while this should >> be fput(file) or put_filp(file), we dont know. >> >> Damned, this patch is wrong as is. >> >> Christoph, Paul, do you see the problem ? > > Yes. > >> In fget()/fget_light() we dont know if the other thread (the one who re-allocated the file, >> and tried to close it while we got a reference on file) had to call put_filp() or fput() >> to release its own reference. So we call atomic_long_dec_and_test() and cannot >> take the appropriate action (calling the full __fput() version or the small one, >> that some systems use to 'close' an not really opened file. > > The difference is mainly that fput() does full processing whereas > put_filp() is used when we know that the file was not fully operational. > If the checks in __fput are able to handle the put_filp() situation by not > releasing resources that were not allocated then we should be fine. > >> I believe put_filp() is only called on slowpath (error cases). > > Looks like it. It seems to assume that no dentry is associated. > >> Should we just zap it and always call fput() ? > > Only if fput() can handle partially setup files. It can do that if we add a check for NULL dentry in __fput(), so put_filp() can disappear. But there is a remaining point where we do an atomic_long_dec_and_test(&...->f_count), in fs/aio.c, function __aio_put_req(). This one is tricky :( ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU [not found] ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-12-12 16:48 ` Eric Dumazet @ 2008-12-13 1:41 ` Christoph Lameter 1 sibling, 0 replies; 75+ messages in thread From: Christoph Lameter @ 2008-12-13 1:41 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Paul E. McKenney On Fri, 12 Dec 2008, Eric Dumazet wrote: > > This is a non-trivial change, because that put_filp may drop the last > > reference to the file. So now we have the case where we free the file > > from a context in which it had never been allocated. > > If we got at this point, we : > > Found a non NULL pointer in our fd table. > Then, another thread came, closed the file while we not yet added our reference. > This file was freed (kmem_cache_free(filp_cachep, file)) > This file was reused and inserted on another thread fd table. > We added our reference on refcount. > We checked if this file is still ours (in our fd tab). > We found this file is not anymore the file we wanted. > Calling put_filp() here is our only choice to safely remove the reference on > a truly allocated file. At this point the file is > a truly allocated file but not anymore ours. > Unfortunatly we added a reference on it : we must release it. > If the other thread already called put_filp() because it wanted to close its new file, > we must see f_refcnt going to zero, and we must call __fput(), to perform > all the relevant file cleanup ourself. Correct. That was the idea. > A final point is that SLUB doesnt need to allocate or free a slab in many cases. > (This is probably why Christoph needed this patch in 2006 :) ) We needed this patch in 2006 because the AIM9 creat-clo test showed regressions after the rcu free was put in (discovered during SLES11 verification cycle). All slab allocators do at least defer frees until all objects in the page are freed if not longer. > In my case, I need all these patches to speedup http servers. > They obviously open and close many files per second. Run AIM9 creat-close tests.... > SLAB_DESTROY_BY_RCU is a must on current hardware, where memory cache line > misses costs become really problematic. This patch series clearly demonstrate > it. Well the issue becomes more severe as accesses to cold memory become more extensive. Thanks for your work on this. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v3 7/7] fs: MS_NOREFCOUNT 2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet ` (6 preceding siblings ...) [not found] ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-12-11 22:41 ` Eric Dumazet 7 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-12-11 22:41 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro, Paul E. McKenney Some fs are hardwired into kernel, and mntput()/mntget() hit a contended cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set on socket, pipes and anonymous fd superblocks. mntput()/mntget() become null ops on these fs. ("socketallocbench -n 8" result : from 2.20s to 1.64s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 1 + fs/pipe.c | 3 ++- include/linux/fs.h | 2 ++ include/linux/mount.h | 8 +++----- net/socket.c | 1 + 5 files changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 89fd36d..de0ec3b 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -158,6 +158,7 @@ static int __init anon_inode_init(void) error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; } + anon_inode_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; anon_inode_inode = anon_inode_mkinode(); if (IS_ERR(anon_inode_inode)) { error = PTR_ERR(anon_inode_inode); diff --git a/fs/pipe.c b/fs/pipe.c index 8c51a0d..f547432 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void) if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); - } + } else + pipe_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; } return err; } diff --git a/include/linux/fs.h b/include/linux/fs.h index a1f56d4..11b0452 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -137,6 +137,8 @@ extern int dir_notify_enable; #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ + +#define MS_NOREFCOUNT (1<<29) /* kernel static mnt : no refcounting needed */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) diff --git a/include/linux/mount.h b/include/linux/mount.h index cab2a85..51418b5 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -14,10 +14,8 @@ #include <linux/nodemask.h> #include <linux/spinlock.h> #include <asm/atomic.h> +#include <linux/fs.h> -struct super_block; -struct vfsmount; -struct dentry; struct mnt_namespace; #define MNT_NOSUID 0x01 @@ -73,7 +71,7 @@ struct vfsmount { static inline struct vfsmount *mntget(struct vfsmount *mnt) { - if (mnt) + if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) atomic_inc(&mnt->mnt_count); return mnt; } @@ -87,7 +85,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { - if (mnt) { + if (mnt && !(mnt->mnt_sb->s_flags & MS_NOREFCOUNT)) { mnt->mnt_expiry_mark = 0; mntput_no_expire(mnt); } diff --git a/net/socket.c b/net/socket.c index 4017409..2534dbc 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2206,6 +2206,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); sock_mnt = kern_mount(&sock_fs_type); + sock_mnt->mnt_sb->s_flags |= MS_NOREFCOUNT; /* The real protocol initialization is performed in later initcalls. */ ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 1:37 ` Christoph Lameter [not found] ` <Pine.LNX.4.64.0811261935330.31159-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org> 2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet ` (3 subsequent siblings) 4 siblings, 1 reply; 75+ messages in thread From: Christoph Lameter @ 2008-11-27 1:37 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig On Thu, 27 Nov 2008, Eric Dumazet wrote: > The last point is about SLUB being hit hard, unless we > use slub_min_order=3 at boot, or we use Christoph Lameter > patch (struct file RCU optimizations) > http://thread.gmane.org/gmane.linux.kernel/418615 > > If we boot machine with slub_min_order=3, SLUB overhead disappears. I'd rather not be that drastic. Did you try increasing slub_min_objects instead? Try 40-100. If we find the right number then we should update the tuning to make sure that it pickes the right slab page sizes. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <Pine.LNX.4.64.0811261935330.31159-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP [not found] ` <Pine.LNX.4.64.0811261935330.31159-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org> @ 2008-11-27 6:27 ` Eric Dumazet [not found] ` <492E3DEF.8030602-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-27 6:27 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig Christoph Lameter a écrit : > On Thu, 27 Nov 2008, Eric Dumazet wrote: > >> The last point is about SLUB being hit hard, unless we >> use slub_min_order=3 at boot, or we use Christoph Lameter >> patch (struct file RCU optimizations) >> http://thread.gmane.org/gmane.linux.kernel/418615 >> >> If we boot machine with slub_min_order=3, SLUB overhead disappears. > > > I'd rather not be that drastic. Did you try increasing slub_min_objects > instead? Try 40-100. If we find the right number then we should update > the tuning to make sure that it pickes the right slab page sizes. > > 4096/192 = 21 with slub_min_objects=22 : # cat /sys/kernel/slab/filp/order 1 # time ./socket8 real 0m1.725s user 0m0.685s sys 0m12.955s with slub_min_objects=45 : # cat /sys/kernel/slab/filp/order 2 # time ./socket8 real 0m1.652s user 0m0.694s sys 0m12.367s with slub_min_objects=80 : # cat /sys/kernel/slab/filp/order 3 # time ./socket8 real 0m1.642s user 0m0.719s sys 0m12.315s I would say slub_min_objects=45 is the optimal value on 32bit arches to get acceptable performance on this workload (order=2 for filp kmem_cache) Note : SLAB here is disastrous, but you already knew that :) real 0m8.128s user 0m0.748s sys 1m3.467s ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <492E3DEF.8030602-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP [not found] ` <492E3DEF.8030602-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 14:44 ` Christoph Lameter 0 siblings, 0 replies; 75+ messages in thread From: Christoph Lameter @ 2008-11-27 14:44 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Hellwig, Pekka Enberg On Thu, 27 Nov 2008, Eric Dumazet wrote: > with slub_min_objects=45 : > > # cat /sys/kernel/slab/filp/order > 2 > # time ./socket8 > real 0m1.652s > user 0m0.694s > sys 0m12.367s That may be a good value. How many processor do you have? Look at calculate_order() in mm/slub.c: if (!min_objects) min_objects = 4 * (fls(nr_cpu_ids) + 1); We couild increase the scaling factor there or start with a mininum of 20 objects? Try min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1); > I would say slub_min_objects=45 is the optimal value on 32bit arches to > get acceptable performance on this workload (order=2 for filp kmem_cache) > > Note : SLAB here is disastrous, but you already knew that :) Its good though to have examples where the queue management gets in the way of performance. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-27 1:37 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Christoph Lameter @ 2008-11-29 8:43 ` Eric Dumazet 2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet ` (2 subsequent siblings) 4 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro [-- Attachment #1: Type: text/plain, Size: 632 bytes --] Adding a percpu_counter nr_dentry avoids cache line ping pongs between cpus to maintain this metric, and dcache_lock is no more needed to protect dentry_stat.nr_dentry We centralize nr_dentry updates at the right place : - increments in d_alloc() - decrements in d_free() d_alloc() can avoid taking dcache_lock if parent is NULL (socket8 bench result : 27.5s to 25s) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/dcache.c | 49 +++++++++++++++++++++++++------------------ include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 32 insertions(+), 21 deletions(-) [-- Attachment #2: nr_dentry.patch --] [-- Type: text/plain, Size: 4891 bytes --] diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..46d5d1e 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,31 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static struct percpu_counter nr_dentry; /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + dentry_stat.nr_dentry = percpu_counter_sum_positive(&nr_dentry); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +101,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +112,7 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + percpu_counter_dec(&nr_dentry); } /* @@ -172,7 +191,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +637,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +695,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +712,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +721,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +954,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,16 +968,15 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); - + percpu_counter_inc(&nr_dentry); return dentry; } @@ -2282,6 +2290,7 @@ static void __init dcache_init(void) { int loop; + percpu_counter_init(&nr_dentry, 0); /* * A constructor could be added for stable state like the lists, * but it is probably not worth it because of the cache nature diff --git a/include/linux/fs.h b/include/linux/fs.h index 0dcdd94..c5e7aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9d048fa..eebddef 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-27 1:37 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Christoph Lameter 2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet @ 2008-11-29 8:43 ` Eric Dumazet 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet 4 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:43 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro [-- Attachment #1: Type: text/plain, Size: 507 bytes --] Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes dont need inode_lock anymore. (socket8 bench result : no difference at this point) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/fs-writeback.c | 2 +- fs/inode.c | 39 +++++++++++++++++++++++++++++++-------- include/linux/fs.h | 3 +++ kernel/sysctl.c | 4 ++-- mm/page-writeback.c | 2 +- 5 files changed, 38 insertions(+), 12 deletions(-) [-- Attachment #2: nr_inodes.patch --] [-- Type: text/plain, Size: 5626 bytes --] diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..f94f889 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,33 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static struct percpu_counter nr_inodes; static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + return percpu_counter_sum_positive(&nr_inodes); +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +330,7 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + percpu_counter_sub(&nr_inodes, nr_disposed); } /* @@ -560,8 +582,8 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { + percpu_counter_inc(&nr_inodes); spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); inode->i_ino = ++last_ino; @@ -622,7 +644,7 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -671,7 +693,7 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + percpu_counter_inc(&nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1042,8 +1064,8 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); security_inode_delete(inode); @@ -1093,8 +1115,8 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + percpu_counter_dec(&nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); @@ -1394,6 +1416,7 @@ void __init inode_init(void) { int loop; + percpu_counter_init(&nr_inodes, 0); /* inode slab cache */ inode_cachep = kmem_cache_create("inode_cache", sizeof(struct inode), diff --git a/include/linux/fs.h b/include/linux/fs.h index c5e7aa5..2482977 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index eebddef..eebed01 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> ` (2 preceding siblings ...) 2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet @ 2008-11-29 8:44 ` Eric Dumazet [not found] ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet 4 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:44 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro [-- Attachment #1: Type: text/plain, Size: 1628 bytes --] Sockets, pipes and anonymous fds have interesting properties. Like other files, they use a dentry and an inode. But dentries for these kind of files are not hashed into dcache, since there is no way someone can lookup such a file in the vfs tree. (/proc/{pid}/fd/{number} uses a different mechanism) Still, allocating and freeing such dentries are expensive processes, because we currently take dcache_lock inside d_alloc(), d_instantiate(), and dput(). This lock is very contended on SMP machines. This patch defines a new DCACHE_SINGLE flag, to mark a dentry as a single one (for sockets, pipes, anonymous fd), and a new d_alloc_single(const struct qstr *name, struct inode *inode) method, called by the three subsystems. Internally, dput() can take a fast path to dput_single() for SINGLE dentries. No more atomic_dec_and_lock() for such dentries. Differences betwen an SINGLE dentry and a normal one are : 1) SINGLE dentry has the DCACHE_SINGLE flag 2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED) This to avoid taking a reference on sb 'root' dentry, shared by too many dentries. 3) They are not hashed into global hash table (DCACHE_UNHASHED) 4) Their d_alias list is empty (socket8 bench result : from 25s to 19.9s) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/anon_inodes.c | 16 ------------ fs/dcache.c | 51 +++++++++++++++++++++++++++++++++++++++ fs/pipe.c | 23 +---------------- include/linux/dcache.h | 9 ++++++ net/socket.c | 24 +----------------- 5 files changed, 65 insertions(+), 58 deletions(-) [-- Attachment #2: dcache_single.patch --] [-- Type: text/plain, Size: 7886 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..8bf83cb 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -33,23 +33,12 @@ static int anon_inodefs_get_sb(struct file_system_type *fs_type, int flags, mnt); } -static int anon_inodefs_delete_dentry(struct dentry *dentry) -{ - /* - * We faked vfs to believe the dentry was hashed when we created it. - * Now we restore the flag so that dput() will work correctly. - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 1; -} - static struct file_system_type anon_inode_fs_type = { .name = "anon_inodefs", .get_sb = anon_inodefs_get_sb, .kill_sb = kill_anon_super, }; static struct dentry_operations anon_inodefs_dentry_operations = { - .d_delete = anon_inodefs_delete_dentry, }; /** @@ -92,7 +81,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, this.name = name; this.len = strlen(name); this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_single(&this, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +93,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index 46d5d1e..35d4a25 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -219,6 +219,23 @@ static struct dentry *d_kill(struct dentry *dentry) */ /* + * special version of dput() for pipes/sockets/anon. + * These dentries are not present in hash table, we can avoid + * taking/dirtying dcache_lock + */ +static void dput_single(struct dentry *dentry) +{ + struct inode *inode; + + if (!atomic_dec_and_test(&dentry->d_count)) + return; + inode = dentry->d_inode; + if (inode) + iput(inode); + d_free(dentry); +} + +/* * dput - release a dentry * @dentry: dentry to release * @@ -234,6 +251,11 @@ void dput(struct dentry *dentry) { if (!dentry) return; + /* + * single dentries (sockets/pipes/anon) fast path + */ + if (dentry->d_flags & DCACHE_SINGLE) + return dput_single(dentry); repeat: if (atomic_read(&dentry->d_count) == 1) @@ -1119,6 +1141,35 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_single - allocate SINGLE dentry + * @name: dentry name, given in a qstr structure + * @inode: inode to allocate the dentry for + * + * Allocate an SINGLE dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - their d_alias list is empty + */ +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) +{ + struct dentry *entry; + + entry = d_alloc(NULL, name); + if (entry) { + entry->d_sb = inode->i_sb; + entry->d_parent = entry; + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; + entry->d_inode = inode; + fsnotify_d_instantiate(entry, inode); + security_d_instantiate(entry, inode); + } + return entry; +} + + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..4de6dd5 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -849,17 +849,6 @@ void free_pipe_info(struct inode *inode) } static struct vfsmount *pipe_mnt __read_mostly; -static int pipefs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} /* * pipefs_dname() is called from d_path(). @@ -871,7 +860,6 @@ static char *pipefs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations pipefs_dentry_operations = { - .d_delete = pipefs_delete_dentry, .d_dname = pipefs_dname, }; @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +914,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..ca8d269 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -176,6 +176,14 @@ d_iput: no no no yes #define DCACHE_UNHASHED 0x0010 #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 /* Parent inode is watched */ +#define DCACHE_SINGLE 0x0040 + /* + * socket, pipe or anonymous fd dentry + * - SINGLE dentries have themselves as a parent. + * - SINGLE dentries are not hashed into global hash table + * - Their d_alias list is empty + * - They dont need dcache_lock synchronization + */ extern spinlock_t dcache_lock; extern seqlock_t rename_lock; @@ -235,6 +243,7 @@ extern void shrink_dcache_sb(struct super_block *); extern void shrink_dcache_parent(struct dentry *); extern void shrink_dcache_for_umount(struct super_block *); extern int d_invalidate(struct dentry *); +extern struct dentry *d_alloc_single(const struct qstr *, struct inode *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..231cd66 100644 --- a/net/socket.c +++ b/net/socket.c @@ -307,18 +307,6 @@ static struct file_system_type sock_fs_type = { .kill_sb = kill_anon_super, }; -static int sockfs_delete_dentry(struct dentry *dentry) -{ - /* - * At creation time, we pretended this dentry was hashed - * (by clearing DCACHE_UNHASHED bit in d_flags) - * At delete time, we restore the truth : not hashed. - * (so that dput() can proceed correctly) - */ - dentry->d_flags |= DCACHE_UNHASHED; - return 0; -} - /* * sockfs_dname() is called from d_path(). */ @@ -329,7 +317,6 @@ static char *sockfs_dname(struct dentry *dentry, char *buffer, int buflen) } static struct dentry_operations sockfs_dentry_operations = { - .d_delete = sockfs_delete_dentry, .d_dname = sockfs_dname, }; @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; + static const struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_single(&name, SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd [not found] ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-29 10:38 ` Jörn Engel [not found] ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Jörn Engel @ 2008-11-29 10:38 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote: > > +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) > +{ > + struct dentry *entry; > + > + entry = d_alloc(NULL, name); > + if (entry) { > + entry->d_sb = inode->i_sb; > + entry->d_parent = entry; > + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; > + entry->d_inode = inode; > + fsnotify_d_instantiate(entry, inode); > + security_d_instantiate(entry, inode); > + } > + return entry; Calling the struct dentry entry had me onfused a bit. I believe everyone else (including the code you removed) uses dentry. > @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) > struct inode *inode; > struct file *f; > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; > > err = -ENFILE; > inode = get_pipe_inode(); ... > @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) > static int sock_attach_fd(struct socket *sock, struct file *file, int flags) > { > struct dentry *dentry; > - struct qstr name = { .name = "" }; > + static const struct qstr name = { .name = "" }; These two could even be combined. And of course I realize that I comment on absolute trivialities. On the whole, I couldn't spot a real problem in your patches. Jörn -- Public Domain - Free as in Beer General Public - Free as in Speech BSD License - Free as in Enterprise Shared Source - Free as in "Work will make you..." ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>]
* Re: [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd [not found] ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org> @ 2008-11-29 11:14 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 11:14 UTC (permalink / raw) To: Jörn Engel Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro Jörn Engel a écrit : > On Sat, 29 November 2008 09:44:23 +0100, Eric Dumazet wrote: >> +struct dentry *d_alloc_single(const struct qstr *name, struct inode *inode) >> +{ >> + struct dentry *entry; >> + >> + entry = d_alloc(NULL, name); >> + if (entry) { >> + entry->d_sb = inode->i_sb; >> + entry->d_parent = entry; >> + entry->d_flags |= DCACHE_SINGLE | DCACHE_DISCONNECTED; >> + entry->d_inode = inode; >> + fsnotify_d_instantiate(entry, inode); >> + security_d_instantiate(entry, inode); >> + } >> + return entry; > > Calling the struct dentry entry had me onfused a bit. I believe > everyone else (including the code you removed) uses dentry. Ah yes, it seems I took it from d_instantiate(), I guess a cleanup patch would be nice. > >> @@ -918,7 +906,7 @@ struct file *create_write_pipe(int flags) >> struct inode *inode; >> struct file *f; >> struct dentry *dentry; >> - struct qstr name = { .name = "" }; >> + static const struct qstr name = { .name = "" }; >> >> err = -ENFILE; >> inode = get_pipe_inode(); > ... >> @@ -371,20 +358,13 @@ static int sock_alloc_fd(struct file **filep, int flags) >> static int sock_attach_fd(struct socket *sock, struct file *file, int flags) >> { >> struct dentry *dentry; >> - struct qstr name = { .name = "" }; >> + static const struct qstr name = { .name = "" }; > > These two could even be combined. > > And of course I realize that I comment on absolute trivialities. On the > whole, I couldn't spot a real problem in your patches. Well, at least you reviewed it, it's the important point ! Thanks Jörn ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v2 5/5] fs: new_inode_single() and iput_single() [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> ` (3 preceding siblings ...) 2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet @ 2008-11-29 8:45 ` Eric Dumazet 2008-11-29 11:14 ` Jörn Engel 4 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:45 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA@public.gmane.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro [-- Attachment #1: Type: text/plain, Size: 931 bytes --] Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. SINGLE dentries are attached to inodes that dont need to be linked in a list of inodes, being "inode_in_use" or "sb->s_inodes" As inode_lock was taken only to protect these lists, we avoid taking it as well. Using iput_single() from dput_single() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() (socket8 bench result : from 19.9s to 2.3s) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/anon_inodes.c | 2 +- fs/dcache.c | 2 +- fs/inode.c | 29 ++++++++++++++++++++--------- fs/pipe.c | 2 +- include/linux/fs.h | 12 +++++++++++- net/socket.c | 2 +- 6 files changed, 35 insertions(+), 14 deletions(-) [-- Attachment #2: new_inode_single.patch --] [-- Type: text/plain, Size: 4080 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 8bf83cb..89fd36d 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -125,7 +125,7 @@ EXPORT_SYMBOL_GPL(anon_inode_getfd); */ static struct inode *anon_inode_mkinode(void) { - struct inode *inode = new_inode(anon_inode_mnt->mnt_sb); + struct inode *inode = new_inode_single(anon_inode_mnt->mnt_sb); if (!inode) return ERR_PTR(-ENOMEM); diff --git a/fs/dcache.c b/fs/dcache.c index 35d4a25..3aa9ed5 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -231,7 +231,7 @@ static void dput_single(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_single(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index dc8e72a..0fdfe1b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -221,6 +221,13 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_single(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + percpu_counter_dec(&nr_inodes); + } +} /* * These are initializations that only need to be done @@ -587,8 +594,9 @@ static int last_ino_get(void) #endif /** - * new_inode - obtain an inode + * __new_inode - obtain an inode * @sb: superblock + * @single: if true, dont link new inode in a list * * Allocates a new inode for given superblock. The default gfp_mask * for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE. @@ -598,7 +606,7 @@ static int last_ino_get(void) * newly created inode's mapping * */ -struct inode *new_inode(struct super_block *sb) +struct inode *__new_inode(struct super_block *sb, int single) { /* * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW @@ -607,22 +615,25 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); inode->i_state = 0; inode->i_ino = last_ino_get(); - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); - spin_unlock(&inode_lock); + if (single) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } } return inode; } -EXPORT_SYMBOL(new_inode); +EXPORT_SYMBOL(__new_inode); void unlock_new_inode(struct inode *inode) { diff --git a/fs/pipe.c b/fs/pipe.c index 4de6dd5..8c51a0d 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -865,7 +865,7 @@ static struct dentry_operations pipefs_dentry_operations = { static struct inode * get_pipe_inode(void) { - struct inode *inode = new_inode(pipe_mnt->mnt_sb); + struct inode *inode = new_inode_single(pipe_mnt->mnt_sb); struct pipe_inode_info *pipe; if (!inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 2482977..b3daffc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1898,7 +1898,17 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); -extern struct inode *new_inode(struct super_block *); +extern struct inode *__new_inode(struct super_block *, int); +static inline struct inode *new_inode(struct super_block *sb) +{ + return __new_inode(sb, 0); +} +static inline struct inode *new_inode_single(struct super_block *sb) +{ + return __new_inode(sb, 1); +} +extern void iput_single(struct inode *); + extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index 231cd66..f1e656c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -463,7 +463,7 @@ static struct socket *sock_alloc(void) struct inode *inode; struct socket *sock; - inode = new_inode(sock_mnt->mnt_sb); + inode = new_inode_single(sock_mnt->mnt_sb); if (!inode) return NULL; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH v2 5/5] fs: new_inode_single() and iput_single() 2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet @ 2008-11-29 11:14 ` Jörn Engel 0 siblings, 0 replies; 75+ messages in thread From: Jörn Engel @ 2008-11-29 11:14 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Christoph Hellwig, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro On Sat, 29 November 2008 09:45:09 +0100, Eric Dumazet wrote: > > +void iput_single(struct inode *inode) > +{ > + if (atomic_dec_and_test(&inode->i_count)) { > + destroy_inode(inode); > + percpu_counter_dec(&nr_inodes); > + } > +} I wonder if it is possible to avoid the atomic_dec_and_test() here, at least in the common case, and combine it with the atomic_dec_and_test() of the dentry. A quick look at fs/inode.c indicates that inode->i_count may never get changed for a SINGLE inode, except during creation or deletion. It might be worth to - remove the conditional from iput_single() and measure that it makes a difference, - poison SINGLE inodes with some value and - put a BUG_ON() in __iget() that checks for the poison value. I _think_ the BUG_ON() is unnecessary, but at least my brain is not sufficient to convince me. Can inotify somehow get a hold of a socket? Or dquot (how insane would that be?) Jörn -- Mac is for working, Linux is for Networking, Windows is for Solitaire! -- stolen from dc ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet ` (3 preceding siblings ...) [not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-29 8:44 ` Eric Dumazet 4 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-29 8:44 UTC (permalink / raw) To: Ingo Molnar, Christoph Hellwig Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers@vger.kernel.org >> Kernel Testers List, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, linux-fsdevel, Al Viro [-- Attachment #1: Type: text/plain, Size: 505 bytes --] new_inode() dirties a contended cache line to get increasing inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino, and give same spreading ino numbers than before. (same wraparound after 2^32 allocations) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 35 ++++++++++++++++++++++++++++++++--- 1 files changed, 32 insertions(+), 3 deletions(-) [-- Attachment #2: last_ino.patch --] [-- Type: text/plain, Size: 1511 bytes --] diff --git a/fs/inode.c b/fs/inode.c index f94f889..dc8e72a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -556,6 +556,36 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * Each cpu owns a range of 1024 numbers. + * 'shared_last_ino' is dirtied only once out of 1024 allocations, + * to renew the exhausted range. + */ +static DEFINE_PER_CPU(int, last_ino); + +static int last_ino_get(void) +{ + static atomic_t shared_last_ino; + int *p = &get_cpu_var(last_ino); + int res = *p; + + if (unlikely((res & 1023) == 0)) + res = atomic_add_return(1024, &shared_last_ino) - 1024; + + *p = ++res; + put_cpu_var(last_ino); + return res; +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -575,7 +605,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -583,11 +612,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { percpu_counter_inc(&nr_inodes); + inode->i_state = 0; + inode->i_ino = last_ino_get(); spin_lock(&inode_lock); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; - inode->i_state = 0; spin_unlock(&inode_lock); } return inode; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator 2008-11-21 15:34 ` Ingo Molnar 2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet [not found] ` <492DDC88.2050305-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> [not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org> 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet 3 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 565 bytes --] new_inode() dirties a contended cache line to get inode numbers. Solve this problem by providing to each cpu a per_cpu variable, feeded by the shared last_ino, but once every 1024 allocations. This reduce contention on the shared last_ino. Note : last_ino_get() method must be called with preemption disabled on SMP. (socket8 bench result : no differences, but this is because inode_lock cost is too heavy) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/inode.c | 27 +++++++++++++++++++++++++-- 1 files changed, 25 insertions(+), 2 deletions(-) [-- Attachment #2: last_ino.patch --] [-- Type: text/plain, Size: 1308 bytes --] diff --git a/fs/inode.c b/fs/inode.c index 0487ddb..d850050 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -534,6 +534,30 @@ repeat: return node ? inode : NULL; } +#ifdef CONFIG_SMP +/* + * each cpu owns a block of 1024 numbers. + * The global 'last_ino' is dirtied once every 1024 allocations + */ +static DEFINE_PER_CPU(int, cpu_ino_alloc) = {0}; +static int last_ino_get(void) +{ + static atomic_t last_ino; + int *ptr = &__raw_get_cpu_var(cpu_ino_alloc); + + if (unlikely((*ptr & 1023) == 0)) + *ptr = atomic_add_return(1024, &last_ino); + return --(*ptr); +} +#else +static int last_ino_get(void) +{ + static int last_ino; + + return ++last_ino; +} +#endif + /** * new_inode - obtain an inode * @sb: superblock @@ -553,7 +577,6 @@ struct inode *new_inode(struct super_block *sb) * error if st_ino won't fit in target struct field. Use 32bit counter * here to attempt to avoid that. */ - static unsigned int last_ino; struct inode * inode; spin_lock_prefetch(&inode_lock); @@ -564,7 +587,7 @@ struct inode *new_inode(struct super_block *sb) inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); - inode->i_ino = ++last_ino; + inode->i_ino = last_ino_get(); inode->i_state = 0; spin_unlock(&inode_lock); } ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <492DDC88.2050305-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 3/6] fs: Introduce a per_cpu last_ino allocator [not found] ` <492DDC88.2050305-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 9:46 ` Christoph Hellwig 0 siblings, 0 replies; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:46 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Thu, Nov 27, 2008 at 12:32:24AM +0100, Eric Dumazet wrote: > new_inode() dirties a contended cache line to get inode numbers. > > Solve this problem by providing to each cpu a per_cpu variable, > feeded by the shared last_ino, but once every 1024 allocations. > > This reduce contention on the shared last_ino. > > Note : last_ino_get() method must be called with preemption > disabled on SMP. Looks a little clumsy. One idea might be to have a special slab for synthetic inodes using new_inode and only assign it on the first allocation and after that re-use it. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org>]
* [PATCH 1/6] fs: Introduce a per_cpu nr_dentry [not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org> @ 2008-11-26 23:30 ` Eric Dumazet [not found] ` <492DDC0B.8060804-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet 2 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:30 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 495 bytes --] Adding a per_cpu nr_dentry avoids cache line ping pongs between cpus to maintain this metric. We centralize decrements of nr_dentry in d_free(), and increments in d_alloc(). d_alloc() can avoid taking dcache_lock if parent is NULL Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/dcache.c | 55 ++++++++++++++++++++++++++++--------------- include/linux/fs.h | 2 + kernel/sysctl.c | 2 - 3 files changed, 40 insertions(+), 19 deletions(-) [-- Attachment #2: per_cpu_nr_dentry.patch --] [-- Type: text/plain, Size: 4782 bytes --] diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..42ed9fc 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -61,12 +61,38 @@ static struct kmem_cache *dentry_cache __read_mostly; static unsigned int d_hash_mask __read_mostly; static unsigned int d_hash_shift __read_mostly; static struct hlist_head *dentry_hashtable __read_mostly; +static DEFINE_PER_CPU(int, nr_dentry); /* Statistics gathering. */ struct dentry_stat_t dentry_stat = { .age_limit = 45, }; +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + int cpu; + int counter = 0; + + for_each_possible_cpu(cpu) + counter += per_cpu(nr_dentry, cpu); + if (counter < 0) + counter = 0; + dentry_stat.nr_dentry = counter; + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_dentry(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void __d_free(struct dentry *dentry) { WARN_ON(!list_empty(&dentry->d_alias)); @@ -82,8 +108,7 @@ static void d_callback(struct rcu_head *head) } /* - * no dcache_lock, please. The caller must decrement dentry_stat.nr_dentry - * inside dcache_lock. + * no dcache_lock, please. */ static void d_free(struct dentry *dentry) { @@ -94,6 +119,8 @@ static void d_free(struct dentry *dentry) __d_free(dentry); else call_rcu(&dentry->d_u.d_rcu, d_callback); + get_cpu_var(nr_dentry)--; + put_cpu_var(nr_dentry); } /* @@ -172,7 +199,6 @@ static struct dentry *d_kill(struct dentry *dentry) struct dentry *parent; list_del(&dentry->d_u.d_child); - dentry_stat.nr_dentry--; /* For d_free, below */ /*drops the locks, at that point nobody can reach this dentry */ dentry_iput(dentry); if (IS_ROOT(dentry)) @@ -619,7 +645,6 @@ void shrink_dcache_sb(struct super_block * sb) static void shrink_dcache_for_umount_subtree(struct dentry *dentry) { struct dentry *parent; - unsigned detached = 0; BUG_ON(!IS_ROOT(dentry)); @@ -678,7 +703,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) } list_del(&dentry->d_u.d_child); - detached++; inode = dentry->d_inode; if (inode) { @@ -696,7 +720,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) * otherwise we ascend to the parent and move to the * next sibling if there is one */ if (!parent) - goto out; + return; dentry = parent; @@ -705,11 +729,6 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry) dentry = list_entry(dentry->d_subdirs.next, struct dentry, d_u.d_child); } -out: - /* several dentries were freed, need to correct nr_dentry */ - spin_lock(&dcache_lock); - dentry_stat.nr_dentry -= detached; - spin_unlock(&dcache_lock); } /* @@ -943,8 +962,6 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) dentry->d_flags = DCACHE_UNHASHED; spin_lock_init(&dentry->d_lock); dentry->d_inode = NULL; - dentry->d_parent = NULL; - dentry->d_sb = NULL; dentry->d_op = NULL; dentry->d_fsdata = NULL; dentry->d_mounted = 0; @@ -959,15 +976,17 @@ struct dentry *d_alloc(struct dentry * parent, const struct qstr *name) if (parent) { dentry->d_parent = dget(parent); dentry->d_sb = parent->d_sb; + spin_lock(&dcache_lock); + list_add(&dentry->d_u.d_child, &parent->d_subdirs); + spin_unlock(&dcache_lock); } else { + dentry->d_parent = NULL; + dentry->d_sb = NULL; INIT_LIST_HEAD(&dentry->d_u.d_child); } - spin_lock(&dcache_lock); - if (parent) - list_add(&dentry->d_u.d_child, &parent->d_subdirs); - dentry_stat.nr_dentry++; - spin_unlock(&dcache_lock); + get_cpu_var(nr_dentry)++; + put_cpu_var(nr_dentry); return dentry; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 0dcdd94..c5e7aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2216,6 +2216,8 @@ static inline void free_secdata(void *secdata) struct ctl_table; int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 9d048fa..eebddef 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1243,7 +1243,7 @@ static struct ctl_table fs_table[] = { .data = &dentry_stat, .maxlen = 6*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_dentry, }, { .ctl_name = FS_OVERFLOWUID, ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <492DDC0B.8060804-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 1/6] fs: Introduce a per_cpu nr_dentry [not found] ` <492DDC0B.8060804-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 9:41 ` Christoph Hellwig 0 siblings, 0 replies; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:41 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig Looks good modulo the exact version of the for_each_cpu loops that the experts in that area can help with. Same for the per_cpu nr_inodes patch. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 4/6] fs: Introduce a per_cpu nr_inodes [not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org> 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 9:32 ` Peter Zijlstra 2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet 2 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 499 bytes --] Avoids cache line ping pongs between cpus and prepare next patch, because updates of nr_inodes metric dont need inode_lock anymore. (socket8 bench result : 25s to 20.5s) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/fs-writeback.c | 2 - fs/inode.c | 51 +++++++++++++++++++++++++++++++++++------- include/linux/fs.h | 3 ++ kernel/sysctl.c | 4 +-- mm/page-writeback.c | 2 - 5 files changed, 50 insertions(+), 12 deletions(-) [-- Attachment #2: per_cpu_nr_inodes.patch --] [-- Type: text/plain, Size: 5705 bytes --] diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d0ff0b8..b591cdd 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -608,7 +608,7 @@ void sync_inodes_sb(struct super_block *sb, int wait) unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS); wbc.nr_to_write = nr_dirty + nr_unstable + - (inodes_stat.nr_inodes - inodes_stat.nr_unused) + + (get_nr_inodes() - inodes_stat.nr_unused) + nr_dirty + nr_unstable; wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */ sync_sb_inodes(sb, &wbc); diff --git a/fs/inode.c b/fs/inode.c index d850050..8d8d40e 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); * Statistics gathering.. */ struct inodes_stat_t inodes_stat; +static DEFINE_PER_CPU(int, nr_inodes); static struct kmem_cache * inode_cachep __read_mostly; +int get_nr_inodes(void) +{ + int cpu; + int counter = 0; + + for_each_possible_cpu(cpu) + counter += per_cpu(nr_inodes, cpu); + if (counter < 0) + counter = 0; + return counter; +} + +/* + * Handle nr_dentry sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + inodes_stat.nr_inodes = get_nr_inodes(); + return proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_inodes(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; +} +#endif + static void wake_up_inode(struct inode *inode) { /* @@ -306,9 +337,8 @@ static void dispose_list(struct list_head *head) destroy_inode(inode); nr_disposed++; } - spin_lock(&inode_lock); - inodes_stat.nr_inodes -= nr_disposed; - spin_unlock(&inode_lock); + get_cpu_var(nr_inodes) -= nr_disposed; + put_cpu_var(nr_inodes); } /* @@ -584,10 +614,11 @@ struct inode *new_inode(struct super_block *sb) inode = alloc_inode(sb); if (inode) { spin_lock(&inode_lock); - inodes_stat.nr_inodes++; list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); + get_cpu_var(nr_inodes)--; inode->i_ino = last_ino_get(); + put_cpu_var(nr_inodes); inode->i_state = 0; spin_unlock(&inode_lock); } @@ -645,7 +676,8 @@ static struct inode * get_new_inode(struct super_block *sb, struct hlist_head *h if (set(inode, data)) goto set_failed; - inodes_stat.nr_inodes++; + get_cpu_var(nr_inodes)++; + put_cpu_var(nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -694,7 +726,8 @@ static struct inode * get_new_inode_fast(struct super_block *sb, struct hlist_he old = find_inode_fast(sb, head, ino); if (!old) { inode->i_ino = ino; - inodes_stat.nr_inodes++; + get_cpu_var(nr_inodes)++; + put_cpu_var(nr_inodes); list_add(&inode->i_list, &inode_in_use); list_add(&inode->i_sb_list, &sb->s_inodes); hlist_add_head(&inode->i_hash, head); @@ -1065,8 +1098,9 @@ void generic_delete_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); security_inode_delete(inode); @@ -1116,8 +1150,9 @@ static void generic_forget_inode(struct inode *inode) list_del_init(&inode->i_list); list_del_init(&inode->i_sb_list); inode->i_state |= I_FREEING; - inodes_stat.nr_inodes--; spin_unlock(&inode_lock); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); if (inode->i_data.nrpages) truncate_inode_pages(&inode->i_data, 0); clear_inode(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index c5e7aa5..2482977 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -47,6 +47,7 @@ struct inodes_stat_t { int dummy[5]; /* padding for sysctl ABI compatibility */ }; extern struct inodes_stat_t inodes_stat; +extern int get_nr_inodes(void); extern int leases_enable, lease_break_time; @@ -2218,6 +2219,8 @@ int proc_nr_files(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); int proc_nr_dentry(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos); +int proc_nr_inodes(struct ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); int get_filesystem_list(char * buf); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index eebddef..eebed01 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1202,7 +1202,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 2*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .ctl_name = FS_STATINODE, @@ -1210,7 +1210,7 @@ static struct ctl_table fs_table[] = { .data = &inodes_stat, .maxlen = 7*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_inodes, }, { .procname = "file-nr", diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35..a71a922 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -705,7 +705,7 @@ static void wb_kupdate(unsigned long arg) next_jif = start_jif + dirty_writeback_interval; nr_to_write = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS) + - (inodes_stat.nr_inodes - inodes_stat.nr_unused); + (get_nr_inodes() - inodes_stat.nr_unused); while (nr_to_write > 0) { wbc.more_io = 0; wbc.encountered_congestion = 0; ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet @ 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra ` (3 more replies) 0 siblings, 4 replies; 75+ messages in thread From: Peter Zijlstra @ 2008-11-27 9:32 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: > Avoids cache line ping pongs between cpus and prepare next patch, > because updates of nr_inodes metric dont need inode_lock anymore. > > (socket8 bench result : 25s to 20.5s) > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); > * Statistics gathering.. > */ > struct inodes_stat_t inodes_stat; > +static DEFINE_PER_CPU(int, nr_inodes); > > static struct kmem_cache * inode_cachep __read_mostly; > > +int get_nr_inodes(void) > +{ > + int cpu; > + int counter = 0; > + > + for_each_possible_cpu(cpu) > + counter += per_cpu(nr_inodes, cpu); > + if (counter < 0) > + counter = 0; > + return counter; > +} It would be good to get a cpu hotplug handler here and move to for_each_online_cpu(). People are wanting distro's to be build with NR_CPUS=4096. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra @ 2008-11-27 9:39 ` Peter Zijlstra 2008-11-27 9:48 ` Christoph Hellwig 2008-11-27 10:01 ` Eric Dumazet ` (2 subsequent siblings) 3 siblings, 1 reply; 75+ messages in thread From: Peter Zijlstra @ 2008-11-27 9:39 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, 2008-11-27 at 10:33 +0100, Peter Zijlstra wrote: > On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: > > Avoids cache line ping pongs between cpus and prepare next patch, > > because updates of nr_inodes metric dont need inode_lock anymore. > > > > (socket8 bench result : 25s to 20.5s) > > > > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> > > --- > > > @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); > > * Statistics gathering.. > > */ > > struct inodes_stat_t inodes_stat; > > +static DEFINE_PER_CPU(int, nr_inodes); > > > > static struct kmem_cache * inode_cachep __read_mostly; > > > > +int get_nr_inodes(void) > > +{ > > + int cpu; > > + int counter = 0; > > + > > + for_each_possible_cpu(cpu) > > + counter += per_cpu(nr_inodes, cpu); > > + if (counter < 0) > > + counter = 0; > > + return counter; > > +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Also, this trade-off between global vs per_cpu only works if get_nr_inodes() is called significantly less than nr_inodes is changed. With it being called from writeback that might not be true for all workloads. One thing you can do about it is use the regular per-cpu counter stuff, which allows you to do an approximation of the global number (it also does all the hotplug stuff for you already). ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:39 ` Peter Zijlstra @ 2008-11-27 9:48 ` Christoph Hellwig 0 siblings, 0 replies; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis On Thu, Nov 27, 2008 at 10:39:31AM +0100, Peter Zijlstra wrote: > With it being called from writeback that might not be true for all > workloads. One thing you can do about it is use the regular per-cpu > counter stuff, which allows you to do an approximation of the global > number (it also does all the hotplug stuff for you already). The way it's used in writeback is utterly stupid and should be fixed :) But otherwise agreed. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra @ 2008-11-27 10:01 ` Eric Dumazet 2008-11-27 10:07 ` Andi Kleen 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-27 10:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis Peter Zijlstra a écrit : > On Thu, 2008-11-27 at 00:32 +0100, Eric Dumazet wrote: >> Avoids cache line ping pongs between cpus and prepare next patch, >> because updates of nr_inodes metric dont need inode_lock anymore. >> >> (socket8 bench result : 25s to 20.5s) >> >> Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> >> --- > >> @@ -96,9 +96,40 @@ static DEFINE_MUTEX(iprune_mutex); >> * Statistics gathering.. >> */ >> struct inodes_stat_t inodes_stat; >> +static DEFINE_PER_CPU(int, nr_inodes); >> >> static struct kmem_cache * inode_cachep __read_mostly; >> >> +int get_nr_inodes(void) >> +{ >> + int cpu; >> + int counter = 0; >> + >> + for_each_possible_cpu(cpu) >> + counter += per_cpu(nr_inodes, cpu); >> + if (counter < 0) >> + counter = 0; >> + return counter; >> +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Hum, I guess we can use regular percpu_counter for this... ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra 2008-11-27 9:39 ` Peter Zijlstra 2008-11-27 10:01 ` Eric Dumazet @ 2008-11-27 10:07 ` Andi Kleen 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 75+ messages in thread From: Andi Kleen @ 2008-11-27 10:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Lameter, Christoph Hellwig, travis Peter Zijlstra <a.p.zijlstra@chello.nl> writes: >> >> +int get_nr_inodes(void) >> +{ >> + int cpu; >> + int counter = 0; >> + >> + for_each_possible_cpu(cpu) >> + counter += per_cpu(nr_inodes, cpu); >> + if (counter < 0) >> + counter = 0; >> + return counter; >> +} > > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. Doesn't matter, possible cpus is always only set to what the machine supports. -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 4/6] fs: Introduce a per_cpu nr_inodes 2008-11-27 9:32 ` Peter Zijlstra ` (2 preceding siblings ...) 2008-11-27 10:07 ` Andi Kleen @ 2008-11-27 14:46 ` Christoph Lameter 3 siblings, 0 replies; 75+ messages in thread From: Christoph Lameter @ 2008-11-27 14:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Linux Netdev List, Christoph Hellwig, travis On Thu, 27 Nov 2008, Peter Zijlstra wrote: > It would be good to get a cpu hotplug handler here and move to > for_each_online_cpu(). People are wanting distro's to be build with > NR_CPUS=4096. NR_CPUS=4096 does not necessarily increase the number of possible cpus. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 5/6] fs: Introduce special inodes [not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org> 2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet 2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet @ 2008-11-26 23:32 ` Eric Dumazet [not found] ` <492DDC99.5060106-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 1021 bytes --] Goal of this patch is to not touch inode_lock for socket/pipes/anonfd inodes allocation/freeing. In new_inode(), we test if super block has MS_SPECIAL flag set. If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list As inode_lock was taken only to protect these lists, we avoid it as well Using iput_special() from dput_special() avoids taking inode_lock at freeing time. This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines in iput() Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we really need a different flag. (socket8 bench result : from 20.5s to 2.94s) Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> --- fs/anon_inodes.c | 1 + fs/dcache.c | 2 +- fs/inode.c | 25 ++++++++++++++++++------- fs/pipe.c | 3 ++- include/linux/fs.h | 2 ++ net/socket.c | 1 + 6 files changed, 25 insertions(+), 9 deletions(-) [-- Attachment #2: special_inodes.patch --] [-- Type: text/plain, Size: 3551 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 4f20d48..a0212b3 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -158,6 +158,7 @@ static int __init anon_inode_init(void) error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; } + anon_inode_mnt->mnt_sb->s_flags |= MS_SPECIAL; anon_inode_inode = anon_inode_mkinode(); if (IS_ERR(anon_inode_inode)) { error = PTR_ERR(anon_inode_inode); diff --git a/fs/dcache.c b/fs/dcache.c index d73763b..bade7d7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -239,7 +239,7 @@ static void dput_special(struct dentry *dentry) return; inode = dentry->d_inode; if (inode) - iput(inode); + iput_special(inode); d_free(dentry); } diff --git a/fs/inode.c b/fs/inode.c index 8d8d40e..1bb6553 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -228,6 +228,14 @@ void destroy_inode(struct inode *inode) kmem_cache_free(inode_cachep, (inode)); } +void iput_special(struct inode *inode) +{ + if (atomic_dec_and_test(&inode->i_count)) { + destroy_inode(inode); + get_cpu_var(nr_inodes)--; + put_cpu_var(nr_inodes); + } +} /* * These are initializations that only need to be done @@ -609,18 +617,21 @@ struct inode *new_inode(struct super_block *sb) */ struct inode * inode; - spin_lock_prefetch(&inode_lock); - inode = alloc_inode(sb); if (inode) { - spin_lock(&inode_lock); - list_add(&inode->i_list, &inode_in_use); - list_add(&inode->i_sb_list, &sb->s_inodes); + inode->i_state = 0; + if (sb->s_flags & MS_SPECIAL) { + INIT_LIST_HEAD(&inode->i_list); + INIT_LIST_HEAD(&inode->i_sb_list); + } else { + spin_lock(&inode_lock); + list_add(&inode->i_list, &inode_in_use); + list_add(&inode->i_sb_list, &sb->s_inodes); + spin_unlock(&inode_lock); + } get_cpu_var(nr_inodes)--; inode->i_ino = last_ino_get(); put_cpu_var(nr_inodes); - inode->i_state = 0; - spin_unlock(&inode_lock); } return inode; } diff --git a/fs/pipe.c b/fs/pipe.c index 5cc132a..6fca681 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1078,7 +1078,8 @@ static int __init init_pipe_fs(void) if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); - } + } else + pipe_mnt->mnt_sb->s_flags |= MS_SPECIAL; } return err; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 2482977..dd0e8a5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -136,6 +136,7 @@ extern int dir_notify_enable; #define MS_RELATIME (1<<21) /* Update atime relative to mtime/ctime. */ #define MS_KERNMOUNT (1<<22) /* this is a kern_mount call */ #define MS_I_VERSION (1<<23) /* Update inode I_version field */ +#define MS_SPECIAL (1<<24) /* special fs (inodes not in sb->s_inodes) */ #define MS_ACTIVE (1<<30) #define MS_NOUSER (1<<31) @@ -1898,6 +1899,7 @@ extern void __iget(struct inode * inode); extern void iget_failed(struct inode *); extern void clear_inode(struct inode *); extern void destroy_inode(struct inode *); +extern void iput_special(struct inode *inode); extern struct inode *new_inode(struct super_block *); extern int should_remove_suid(struct dentry *); extern int file_remove_suid(struct file *); diff --git a/net/socket.c b/net/socket.c index f41b6c6..4177456 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2205,6 +2205,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); sock_mnt = kern_mount(&sock_fs_type); + sock_mnt->mnt_sb->s_flags |= MS_SPECIAL; /* The real protocol initialization is performed in later initcalls. */ ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <492DDC99.5060106-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 5/6] fs: Introduce special inodes [not found] ` <492DDC99.5060106-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 8:20 ` David Miller 0 siblings, 0 replies; 75+ messages in thread From: David Miller @ 2008-11-27 8:20 UTC (permalink / raw) To: dada1-fPLkHRcR87vqlBn2x/YWAg Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, netdev-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, hch-wEGCiKHe2LqWVfeAwA7xHQ From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> Date: Thu, 27 Nov 2008 00:32:41 +0100 > Goal of this patch is to not touch inode_lock for socket/pipes/anonfd > inodes allocation/freeing. > > In new_inode(), we test if super block has MS_SPECIAL flag set. > If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list > As inode_lock was taken only to protect these lists, we avoid it as well > > Using iput_special() from dput_special() avoids taking inode_lock > at freeing time. > > This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines > in iput() > > Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we > really need a different flag. > > (socket8 bench result : from 20.5s to 2.94s) > > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> No problem with networking part: Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-21 15:34 ` Ingo Molnar ` (2 preceding siblings ...) [not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org> @ 2008-11-26 23:32 ` Eric Dumazet 2008-11-27 9:53 ` Christoph Hellwig [not found] ` <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 3 siblings, 2 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-26 23:32 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 511 bytes --] This function arms a flag (MNT_SPECIAL) on the vfs, to avoid refcounting on permanent system vfs. Use this function for sockets, pipes, anonymous fds. (socket8 bench result : from 2.94s to 2.23s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 2 +- fs/pipe.c | 2 +- fs/super.c | 9 +++++++++ include/linux/fs.h | 1 + include/linux/mount.h | 5 +++-- net/socket.c | 2 +- 6 files changed, 16 insertions(+), 5 deletions(-) [-- Attachment #2: mnt_special.patch --] [-- Type: text/plain, Size: 3352 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index a0212b3..42dfe28 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -153,7 +153,7 @@ static int __init anon_inode_init(void) error = register_filesystem(&anon_inode_fs_type); if (error) goto err_exit; - anon_inode_mnt = kern_mount(&anon_inode_fs_type); + anon_inode_mnt = kern_mount_special(&anon_inode_fs_type); if (IS_ERR(anon_inode_mnt)) { error = PTR_ERR(anon_inode_mnt); goto err_unregister_filesystem; diff --git a/fs/pipe.c b/fs/pipe.c index 6fca681..391d4fe 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -1074,7 +1074,7 @@ static int __init init_pipe_fs(void) int err = register_filesystem(&pipe_fs_type); if (!err) { - pipe_mnt = kern_mount(&pipe_fs_type); + pipe_mnt = kern_mount_special(&pipe_fs_type); if (IS_ERR(pipe_mnt)) { err = PTR_ERR(pipe_mnt); unregister_filesystem(&pipe_fs_type); diff --git a/fs/super.c b/fs/super.c index 400a760..a8e14f7 100644 --- a/fs/super.c +++ b/fs/super.c @@ -982,3 +982,12 @@ struct vfsmount *kern_mount_data(struct file_system_type *type, void *data) } EXPORT_SYMBOL_GPL(kern_mount_data); + +struct vfsmount *kern_mount_special(struct file_system_type *type) +{ + struct vfsmount *res = kern_mount_data(type, NULL); + + if (!IS_ERR(res)) + res->mnt_flags |= MNT_SPECIAL; + return res; +} diff --git a/include/linux/fs.h b/include/linux/fs.h index dd0e8a5..a92544a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1591,6 +1591,7 @@ extern int register_filesystem(struct file_system_type *); extern int unregister_filesystem(struct file_system_type *); extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data); #define kern_mount(type) kern_mount_data(type, NULL) +extern struct vfsmount *kern_mount_special(struct file_system_type *); extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); extern long do_mount(char *, char *, char *, unsigned long, void *); diff --git a/include/linux/mount.h b/include/linux/mount.h index cab2a85..cb4fa90 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -30,6 +30,7 @@ struct mnt_namespace; #define MNT_SHRINKABLE 0x100 #define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ +#define MNT_SPECIAL 0x400 /* special mount (pipes,sockets,...) */ #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ @@ -73,7 +74,7 @@ struct vfsmount { static inline struct vfsmount *mntget(struct vfsmount *mnt) { - if (mnt) + if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) atomic_inc(&mnt->mnt_count); return mnt; } @@ -87,7 +88,7 @@ extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { - if (mnt) { + if (mnt && !(mnt->mnt_flags & MNT_SPECIAL)) { mnt->mnt_expiry_mark = 0; mntput_no_expire(mnt); } diff --git a/net/socket.c b/net/socket.c index 4177456..2857d70 100644 --- a/net/socket.c +++ b/net/socket.c @@ -2204,7 +2204,7 @@ static int __init sock_init(void) init_inodecache(); register_filesystem(&sock_fs_type); - sock_mnt = kern_mount(&sock_fs_type); + sock_mnt = kern_mount_special(&sock_fs_type); sock_mnt->mnt_sb->s_flags |= MS_SPECIAL; /* The real protocol initialization is performed in later initcalls. ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet @ 2008-11-27 9:53 ` Christoph Hellwig [not found] ` <20081127095321.GE13860-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> [not found] ` <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 9:53 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. special is not a useful name for a flag, by definition everything that needs a flag is special compared to the version that doesn't need a flag. The general idea of skippign the writer counts makes sense, but please give it a descriptive name that explains the not unmountable thing. And please kill your kern_mount wrapper and just set the flag manually. Also I think it should be a superblock flag, not a mount flag as you don't want thse to differ for multiple mounts of the same filesystem. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081127095321.GE13860-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>]
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <20081127095321.GE13860-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> @ 2008-11-27 10:04 ` Eric Dumazet [not found] ` <492E70B6.70108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-27 10:04 UTC (permalink / raw) To: Christoph Hellwig Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter Christoph Hellwig a écrit : > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >> refcounting on permanent system vfs. >> Use this function for sockets, pipes, anonymous fds. > > special is not a useful name for a flag, by definition everything that > needs a flag is special compared to the version that doesn't need a > flag. > > The general idea of skippign the writer counts makes sense, but please > give it a descriptive name that explains the not unmountable thing. > And please kill your kern_mount wrapper and just set the flag manually. > > Also I think it should be a superblock flag, not a mount flag as you > don't want thse to differ for multiple mounts of the same filesystem. > > Hum.. we have a superblock flag already, but testing it in mntput()/mntget() is going to be a litle bit expensive if we add a derefence ? if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) { ... } ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <492E70B6.70108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <492E70B6.70108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 10:10 ` Christoph Hellwig 0 siblings, 0 replies; 75+ messages in thread From: Christoph Hellwig @ 2008-11-27 10:10 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter On Thu, Nov 27, 2008 at 11:04:38AM +0100, Eric Dumazet wrote: > Hum.. we have a superblock flag already, but testing it in mntput()/mntget() > is going to be a litle bit expensive if we add a derefence ? > > if (mnt && mnt->mnt_sb->s_flags & MS_SPECIAL) { > ... > } Well, run a benchmark to see if it makes any difference. And when it does please always set the mount flag from the common mount code when it's set on the superblock, and document that this is the only valid way to set it. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-27 8:21 ` David Miller 2008-11-28 9:26 ` Al Viro 1 sibling, 0 replies; 75+ messages in thread From: David Miller @ 2008-11-27 8:21 UTC (permalink / raw) To: dada1-fPLkHRcR87vqlBn2x/YWAg Cc: mingo-X9Un+BFzKDI, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, netdev-u79uwXL29TY76Z2rM5mHXA, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, hch-wEGCiKHe2LqWVfeAwA7xHQ From: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> Date: Thu, 27 Nov 2008 00:32:59 +0100 > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. > > (socket8 bench result : from 2.94s to 2.23s) > > Signed-off-by: Eric Dumazet <dada1-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> For networking bits: Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 2008-11-27 8:21 ` David Miller @ 2008-11-28 9:26 ` Al Viro [not found] ` <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2008-11-28 22:37 ` Eric Dumazet 1 sibling, 2 replies; 75+ messages in thread From: Al Viro @ 2008-11-28 9:26 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09 On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > refcounting on permanent system vfs. > Use this function for sockets, pipes, anonymous fds. IMO that's pushing it past the point of usefulness; unless you can show that this really gives considerable win on pipes et.al. *AND* that it doesn't hurt other loads... dput() part: again, I want to see what happens on other loads; it's probably fine (and win is certainly more than from mntput() change), but... The thing is, atomic_dec_and_lock() in there is often done on dentries with d_count > 1 and that's fairly cheap (and doesn't involve contention on dcache_lock on sane targets). FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock() in a special way, I'd try to compare with if (atomic_add_unless(&dentry->d_count, -1, 1)) return; if (your flag) sod off to special spin_lock(&dcache_lock); if (atomic_dec_and_test(&dentry->d_count)) { spin_unlock(&dcache_lock); return; } the rest as usual As for the alpha... unless I'm misreading the assembler in arch/alpha/lib/dec_and_lock.c, it looks like we have essentially an implementation of atomic_add_unless() in there and one that just might be better than what we've got in arch/alpha/include/asm/atomic.h. How about 1: ldl_l x, addr cmpne x, u, y /* y = x != u */ beq y, 3f /* if !y -> bugger off, return 0 */ addl x, a, y stl_c y, addr /* y <- *addr has not changed since ldl_l */ beq y, 2f 3: /* return value is in y */ .subsection 2 /* out of the way */ 2: br 1b .previous for atomic_add_unless() guts? With that we are rid of HAVE_DEC_LOCK and get a uniform implementation of atomic_dec_and_lock() for all targets... AFAICS, that would be static __inline__ int atomic_add_unless(atomic_t *v, int a, int u) { unsigned long temp, res; __asm__ __volatile__( "1: ldl_l %0,%1\n" " cmpne %0,%4,%2\n" " beq %4,3f\n" " addl %0,%3,%4\n" " stl_c %2,%1\n" " beq %2,2f\n" "3:\n" ".subsection 2\n" "2: br 1b\n" ".previous" :"=&r" (temp), "=m" (v->counter), "=&r" (res) :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); smp_mb(); return res; } static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u) { unsigned long temp, res; __asm__ __volatile__( "1: ldq_l %0,%1\n" " cmpne %0,%4,%2\n" " beq %4,3f\n" " addq %0,%3,%4\n" " stq_c %2,%1\n" " beq %2,2f\n" "3:\n" ".subsection 2\n" "2: br 1b\n" ".previous" :"=&r" (temp), "=m" (v->counter), "=&r" (res) :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); smp_mb(); return res; } Comments? ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2008-11-28 9:34 ` Al Viro 2008-11-28 18:02 ` Ingo Molnar 1 sibling, 0 replies; 75+ messages in thread From: Al Viro @ 2008-11-28 9:34 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09 On Fri, Nov 28, 2008 at 09:26:04AM +0000, Al Viro wrote: gyah... That would be > static __inline__ int atomic_add_unless(atomic_t *v, int a, int u) > { > unsigned long temp, res; > __asm__ __volatile__( > "1: ldl_l %0,%1\n" > " cmpne %0,%4,%2\n" " beq %2,3f\n" " addl %0,%3,%2\n" > " stl_c %2,%1\n" > " beq %2,2f\n" > "3:\n" > ".subsection 2\n" > "2: br 1b\n" > ".previous" > :"=&r" (temp), "=m" (v->counter), "=&r" (res) > :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); > smp_mb(); > return res; > } > > static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u) > { > unsigned long temp, res; > __asm__ __volatile__( > "1: ldq_l %0,%1\n" > " cmpne %0,%4,%2\n" " beq %2,3f\n" " addq %0,%3,%2\n" > " stq_c %2,%1\n" > " beq %2,2f\n" > "3:\n" > ".subsection 2\n" > "2: br 1b\n" > ".previous" > :"=&r" (temp), "=m" (v->counter), "=&r" (res) > :"Ir" (a), "Ir" (u), "m" (v->counter) : "memory"); > smp_mb(); > return res; > } > > Comments? > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2008-11-28 9:34 ` Al Viro @ 2008-11-28 18:02 ` Ingo Molnar 2008-11-28 18:58 ` Ingo Molnar [not found] ` <20081128180220.GK10487-X9Un+BFzKDI@public.gmane.org> 1 sibling, 2 replies; 75+ messages in thread From: Ingo Molnar @ 2008-11-28 18:02 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09 * Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote: > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: > > This function arms a flag (MNT_SPECIAL) on the vfs, to avoid > > refcounting on permanent system vfs. > > Use this function for sockets, pipes, anonymous fds. > > IMO that's pushing it past the point of usefulness; unless you can show > that this really gives considerable win on pipes et.al. *AND* that it > doesn't hurt other loads... The numbers look pretty convincing: > > (socket8 bench result : from 2.94s to 2.23s) And i wouldnt expect it to hurt real-filesystem workloads. Here's the contemporary trace of a typical ext3- sys_open(): 0) | sys_open() { 0) | do_sys_open() { 0) | getname() { 0) 0.367 us | kmem_cache_alloc(); 0) | strncpy_from_user(); { 0) | _cond_resched() { 0) | need_resched() { 0) 0.363 us | constant_test_bit(); 0) 1. 47 us | } 0) 1.815 us | } 0) 2.587 us | } 0) 4. 22 us | } 0) | alloc_fd() { 0) 0.480 us | _spin_lock(); 0) 0.487 us | expand_files(); 0) 2.356 us | } 0) | do_filp_open() { 0) | path_lookup_open() { 0) | get_empty_filp() { 0) 0.439 us | kmem_cache_alloc(); 0) | security_file_alloc() { 0) 0.316 us | cap_file_alloc_security(); 0) 1. 87 us | } 0) 3.189 us | } 0) | do_path_lookup() { 0) 0.366 us | _read_lock(); 0) | path_walk() { 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.441 us | generic_permission(); 0) 1.247 us | } 0) | security_inode_permission() { 0) 0.411 us | cap_inode_permission(); 0) 1.186 us | } 0) 3.555 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.486 us | _spin_lock(); 0) 1.369 us | } 0) 0.442 us | __follow_mount(); 0) 3. 14 us | } 0) | path_to_nameidata() { 0) 0.476 us | dput(); 0) 1.235 us | } 0) | inode_permission() { 0) | ext3_permission() { 0) | generic_permission() { 0) | in_group_p() { 0) 0.410 us | groups_search(); 0) 1.172 us | } 0) 1.994 us | } 0) 2.789 us | } 0) | security_inode_permission() { 0) 0.454 us | cap_inode_permission(); 0) 1.238 us | } 0) 5.262 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.480 us | _spin_lock(); 0) 1.621 us | } 0) 0.456 us | __follow_mount(); 0) 3.215 us | } 0) | path_to_nameidata() { 0) 0.420 us | dput(); 0) 1.193 us | } 0) + 23.551 us | } 0) | path_put() { 0) 0.420 us | dput(); 0) | mntput() { 0) 0.359 us | mntput_no_expire(); 0) 1. 50 us | } 0) 2.544 us | } 0) + 27.253 us | } 0) + 28.850 us | } 0) + 33.217 us | } 0) | may_open() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.480 us | generic_permission(); 0) 1.229 us | } 0) | security_inode_permission() { 0) 0.405 us | cap_inode_permission(); 0) 1.196 us | } 0) 3.589 us | } 0) 4.600 us | } 0) | nameidata_to_filp() { 0) | __dentry_open() { 0) | file_move() { 0) 0.470 us | _spin_lock(); 0) 1.243 us | } 0) | security_dentry_open() { 0) 0.344 us | cap_dentry_open(); 0) 1.139 us | } 0) 0.412 us | generic_file_open(); 0) 0.561 us | file_ra_state_init(); 0) 5.714 us | } 0) 6.483 us | } 0) + 46.494 us | } 0) 0.453 us | inotify_dentry_parent_queue_event(); 0) 0.403 us | inotify_inode_queue_event(); 0) | fd_install() { 0) 0.440 us | _spin_lock(); 0) 1.247 us | } 0) | putname() { 0) | kmem_cache_free() { 0) | virt_to_head_page() { 0) 0.369 us | constant_test_bit(); 0) 1. 23 us | } 0) 1.738 us | } 0) 2.422 us | } 0) + 60.560 us | } 0) + 61.368 us | } and here's a sys_close(): 0) | sys_close() { 0) 0.540 us | _spin_lock(); 0) | filp_close() { 0) 0.437 us | dnotify_flush(); 0) 0.401 us | locks_remove_posix(); 0) 0.349 us | fput(); 0) 2.679 us | } 0) 4.452 us | } i'd be surprised to see a flag to show up in that codepath. Eric, does your testing confirm that? Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 18:02 ` Ingo Molnar @ 2008-11-28 18:58 ` Ingo Molnar [not found] ` <20081128180220.GK10487-X9Un+BFzKDI@public.gmane.org> 1 sibling, 0 replies; 75+ messages in thread From: Ingo Molnar @ 2008-11-28 18:58 UTC (permalink / raw) To: Al Viro Cc: Eric Dumazet, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink * Ingo Molnar <mingo@elte.hu> wrote: > And i wouldnt expect it to hurt real-filesystem workloads. > > Here's the contemporary trace of a typical ext3- sys_open(): here's a sys_open() that has to touch atime: 0) | sys_open() { 0) | do_sys_open() { 0) | getname() { 0) 0.377 us | kmem_cache_alloc(); 0) | strncpy_from_user() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.353 us | constant_test_bit(); 0) 1. 45 us | } 0) 1.739 us | } 0) 2.492 us | } 0) 3.934 us | } 0) | alloc_fd() { 0) 0.374 us | _spin_lock(); 0) 0.447 us | expand_files(); 0) 2.124 us | } 0) | do_filp_open() { 0) | path_lookup_open() { 0) | get_empty_filp() { 0) 0.689 us | kmem_cache_alloc(); 0) | security_file_alloc() { 0) 0.327 us | cap_file_alloc_security(); 0) 1. 71 us | } 0) 2.869 us | } 0) | do_path_lookup() { 0) 0.460 us | _read_lock(); 0) | path_walk() { 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.434 us | generic_permission(); 0) 1.191 us | } 0) | security_inode_permission() { 0) 0.400 us | cap_inode_permission(); 0) 1.130 us | } 0) 3.453 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.489 us | _spin_lock(); 0) 1.525 us | } 0) 0.449 us | __follow_mount(); 0) 3.115 us | } 0) | path_to_nameidata() { 0) 0.422 us | dput(); 0) 1.204 us | } 0) | inode_permission() { 0) | ext3_permission() { 0) 0.391 us | generic_permission(); 0) 1.223 us | } 0) | security_inode_permission() { 0) 0.406 us | cap_inode_permission(); 0) 1.189 us | } 0) 3.565 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.527 us | _spin_lock(); 0) 1.633 us | } 0) 0.440 us | __follow_mount(); 0) 3.223 us | } 0) | do_follow_link() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.361 us | constant_test_bit(); 0) 1. 64 us | } 0) 1.749 us | } 0) | security_inode_follow_link() { 0) 0.390 us | cap_inode_follow_link(); 0) 1.260 us | } 0) | touch_atime() { 0) | mnt_want_write() { 0) 0.360 us | _spin_lock(); 0) 1.137 us | } 0) | mnt_drop_write() { 0) 0.348 us | _spin_lock(); 0) 1.102 us | } 0) 3.402 us | } 0) 0.446 us | ext3_follow_link(); 0) | __link_path_walk() { 0) | inode_permission() { 0) | ext3_permission() { 0) | generic_permission() { 0) 4.481 us | } 0) | security_inode_permission() { 0) 0.402 us | cap_inode_permission(); 0) 1.127 us | } 0) 6.747 us | } 0) | do_lookup() { 0) | __d_lookup() { 0) 0.547 us | _spin_lock(); 0) 1.758 us | } 0) 0.465 us | __follow_mount(); 0) 3.368 us | } 0) | path_to_nameidata() { 0) 0.419 us | dput(); 0) 1.203 us | } 0) + 13. 40 us | } 0) | path_put() { 0) 0.429 us | dput(); 0) | mntput() { 0) 0.367 us | mntput_no_expire(); 0) 1.130 us | } 0) 2.660 us | } 0) | path_put() { 0) | dput() { 0) | _cond_resched() { 0) | need_resched() { 0) 0.382 us | constant_test_bit(); 0) 1. 67 us | } 0) 1.808 us | } 0) 0.399 us | _spin_lock(); 0) 0.452 us | _spin_lock(); 0) 4.270 us | } 0) | mntput() { 0) 0.375 us | mntput_no_expire(); 0) 1. 62 us | } 0) 6.547 us | } 0) + 32.702 us | } 0) + 50.413 us | } 0) | path_put() { 0) 0.421 us | dput(); 0) | mntput() { 0) 0.364 us | mntput_no_expire(); 0) 1. 64 us | } 0) 2.545 us | } 0) + 54.147 us | } 0) + 55.780 us | } 0) + 59.714 us | } 0) | may_open() { 0) | inode_permission() { 0) | ext3_permission() { 0) 0.406 us | generic_permission(); 0) 1.189 us | } 0) | security_inode_permission() { 0) 0.388 us | cap_inode_permission(); 0) 1.175 us | } 0) 3.498 us | } 0) 4.328 us | } 0) | nameidata_to_filp() { 0) | __dentry_open() { 0) | file_move() { 0) 0.361 us | _spin_lock(); 0) 1.102 us | } 0) | security_dentry_open() { 0) 0.356 us | cap_dentry_open(); 0) 1.121 us | } 0) 0.400 us | generic_file_open(); 0) 0.544 us | file_ra_state_init(); 0) 5. 11 us | } 0) 5.709 us | } 0) + 71.181 us | } 0) 0.453 us | inotify_dentry_parent_queue_event(); 0) 0.403 us | inotify_inode_queue_event(); 0) | fd_install() { 0) 0.411 us | _spin_lock(); 0) 1.217 us | } 0) | putname() { 0) | kmem_cache_free() { 0) | virt_to_head_page() { 0) 0.371 us | constant_test_bit(); 0) 1. 47 us | } 0) 1.752 us | } 0) 2.446 us | } 0) + 84.676 us | } 0) + 85.365 us | } Ingo ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20081128180220.GK10487-X9Un+BFzKDI@public.gmane.org>]
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs [not found] ` <20081128180220.GK10487-X9Un+BFzKDI@public.gmane.org> @ 2008-11-28 22:20 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-28 22:20 UTC (permalink / raw) To: Ingo Molnar Cc: Al Viro, David Miller, Rafael J. Wysocki, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09 Ingo Molnar a écrit : > * Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote: > >> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >>> refcounting on permanent system vfs. >>> Use this function for sockets, pipes, anonymous fds. >> IMO that's pushing it past the point of usefulness; unless you can show >> that this really gives considerable win on pipes et.al. *AND* that it >> doesn't hurt other loads... > > The numbers look pretty convincing: > >>> (socket8 bench result : from 2.94s to 2.23s) > > And i wouldnt expect it to hurt real-filesystem workloads. > > Here's the contemporary trace of a typical ext3- sys_open(): > > 0) | sys_open() { > 0) | do_sys_open() { > 0) | getname() { > 0) 0.367 us | kmem_cache_alloc(); > 0) | strncpy_from_user(); { > 0) | _cond_resched() { > 0) | need_resched() { > 0) 0.363 us | constant_test_bit(); > 0) 1. 47 us | } > 0) 1.815 us | } > 0) 2.587 us | } > 0) 4. 22 us | } > 0) | alloc_fd() { > 0) 0.480 us | _spin_lock(); > 0) 0.487 us | expand_files(); > 0) 2.356 us | } > 0) | do_filp_open() { > 0) | path_lookup_open() { > 0) | get_empty_filp() { > 0) 0.439 us | kmem_cache_alloc(); > 0) | security_file_alloc() { > 0) 0.316 us | cap_file_alloc_security(); > 0) 1. 87 us | } > 0) 3.189 us | } > 0) | do_path_lookup() { > 0) 0.366 us | _read_lock(); > 0) | path_walk() { > 0) | __link_path_walk() { > 0) | inode_permission() { > 0) | ext3_permission() { > 0) 0.441 us | generic_permission(); > 0) 1.247 us | } > 0) | security_inode_permission() { > 0) 0.411 us | cap_inode_permission(); > 0) 1.186 us | } > 0) 3.555 us | } > 0) | do_lookup() { > 0) | __d_lookup() { > 0) 0.486 us | _spin_lock(); > 0) 1.369 us | } > 0) 0.442 us | __follow_mount(); > 0) 3. 14 us | } > 0) | path_to_nameidata() { > 0) 0.476 us | dput(); > 0) 1.235 us | } > 0) | inode_permission() { > 0) | ext3_permission() { > 0) | generic_permission() { > 0) | in_group_p() { > 0) 0.410 us | groups_search(); > 0) 1.172 us | } > 0) 1.994 us | } > 0) 2.789 us | } > 0) | security_inode_permission() { > 0) 0.454 us | cap_inode_permission(); > 0) 1.238 us | } > 0) 5.262 us | } > 0) | do_lookup() { > 0) | __d_lookup() { > 0) 0.480 us | _spin_lock(); > 0) 1.621 us | } > 0) 0.456 us | __follow_mount(); > 0) 3.215 us | } > 0) | path_to_nameidata() { > 0) 0.420 us | dput(); > 0) 1.193 us | } > 0) + 23.551 us | } > 0) | path_put() { > 0) 0.420 us | dput(); > 0) | mntput() { > 0) 0.359 us | mntput_no_expire(); > 0) 1. 50 us | } > 0) 2.544 us | } > 0) + 27.253 us | } > 0) + 28.850 us | } > 0) + 33.217 us | } > 0) | may_open() { > 0) | inode_permission() { > 0) | ext3_permission() { > 0) 0.480 us | generic_permission(); > 0) 1.229 us | } > 0) | security_inode_permission() { > 0) 0.405 us | cap_inode_permission(); > 0) 1.196 us | } > 0) 3.589 us | } > 0) 4.600 us | } > 0) | nameidata_to_filp() { > 0) | __dentry_open() { > 0) | file_move() { > 0) 0.470 us | _spin_lock(); > 0) 1.243 us | } > 0) | security_dentry_open() { > 0) 0.344 us | cap_dentry_open(); > 0) 1.139 us | } > 0) 0.412 us | generic_file_open(); > 0) 0.561 us | file_ra_state_init(); > 0) 5.714 us | } > 0) 6.483 us | } > 0) + 46.494 us | } > 0) 0.453 us | inotify_dentry_parent_queue_event(); > 0) 0.403 us | inotify_inode_queue_event(); > 0) | fd_install() { > 0) 0.440 us | _spin_lock(); > 0) 1.247 us | } > 0) | putname() { > 0) | kmem_cache_free() { > 0) | virt_to_head_page() { > 0) 0.369 us | constant_test_bit(); > 0) 1. 23 us | } > 0) 1.738 us | } > 0) 2.422 us | } > 0) + 60.560 us | } > 0) + 61.368 us | } > > and here's a sys_close(): > > 0) | sys_close() { > 0) 0.540 us | _spin_lock(); > 0) | filp_close() { > 0) 0.437 us | dnotify_flush(); > 0) 0.401 us | locks_remove_posix(); > 0) 0.349 us | fput(); > 0) 2.679 us | } > 0) 4.452 us | } > > i'd be surprised to see a flag to show up in that codepath. Eric, does > your testing confirm that? On a socket/pipe, definitly no, because inode->i_sb->s_flags is not contended. But on a shared inode, it might hurt : offsetof(struct inode, i_count)=0x24 offsetof(struct inode, i_lock)=0x70 offsetof(struct inode, i_sb)=0x9c offsetof(struct inode, i_writecount)=0x144 So i_sb sits in a probably contended cache line I wonder why i_writecount sits so far from i_count, that doesnt make sense. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 9:26 ` Al Viro [not found] ` <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2008-11-28 22:37 ` Eric Dumazet 2008-11-28 22:43 ` Eric Dumazet 1 sibling, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-28 22:37 UTC (permalink / raw) To: Al Viro Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink Al Viro a écrit : > On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >> refcounting on permanent system vfs. >> Use this function for sockets, pipes, anonymous fds. > > IMO that's pushing it past the point of usefulness; unless you can show > that this really gives considerable win on pipes et.al. *AND* that it > doesn't hurt other loads... Well, if this is the last cache line that might be shared, then yes, numbers can talk. But coming from 10 to 1 instead of 0 is OK I guess > > dput() part: again, I want to see what happens on other loads; it's probably > fine (and win is certainly more than from mntput() change), but... The > thing is, atomic_dec_and_lock() in there is often done on dentries with > d_count > 1 and that's fairly cheap (and doesn't involve contention on > dcache_lock on sane targets). > > FWIW, unless there's a really good reason to do alpha atomic_dec_and_lock() > in a special way, I'd try to compare with > if (atomic_add_unless(&dentry->d_count, -1, 1)) > return; I dont know, but *reading* d_count before trying to write it is expensive on modern cpus. Oprofile clearly show that on Intel Core2. Then, *testing* the flag before doing the atomic_something() has the same problem. Or we should put flag in a different cache line. I am lazy (time for a sleep here), maybe we are smart here and use a trick like that already ? atomic_t atomic_read_with_write_intent(atomic_t *v) { int val = 0; /* * No LOCK prefix here, we only give a write intent hint to cpu */ asm volatile("xaddl %0, %1" : "+r" (val), "+m" (v->counter) : : "memory"); return val; } > if (your flag) > sod off to special > spin_lock(&dcache_lock); > if (atomic_dec_and_test(&dentry->d_count)) { > spin_unlock(&dcache_lock); > return; > } > the rest as usual > ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs 2008-11-28 22:37 ` Eric Dumazet @ 2008-11-28 22:43 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-28 22:43 UTC (permalink / raw) To: Al Viro Cc: Ingo Molnar, David Miller, Rafael J. Wysocki, linux-kernel, kernel-testers, Mike Galbraith, Peter Zijlstra, Linux Netdev List, Christoph Lameter, Christoph Hellwig, rth, ink Eric Dumazet a écrit : > Al Viro a écrit : >> On Thu, Nov 27, 2008 at 12:32:59AM +0100, Eric Dumazet wrote: >>> This function arms a flag (MNT_SPECIAL) on the vfs, to avoid >>> refcounting on permanent system vfs. >>> Use this function for sockets, pipes, anonymous fds. >> >> IMO that's pushing it past the point of usefulness; unless you can show >> that this really gives considerable win on pipes et.al. *AND* that it >> doesn't hurt other loads... > > Well, if this is the last cache line that might be shared, then yes, > numbers can talk. > But coming from 10 to 1 instead of 0 is OK I guess > >> >> dput() part: again, I want to see what happens on other loads; it's >> probably >> fine (and win is certainly more than from mntput() change), but... The >> thing is, atomic_dec_and_lock() in there is often done on dentries with >> d_count > 1 and that's fairly cheap (and doesn't involve contention on >> dcache_lock on sane targets). >> >> FWIW, unless there's a really good reason to do alpha >> atomic_dec_and_lock() >> in a special way, I'd try to compare with > >> if (atomic_add_unless(&dentry->d_count, -1, 1)) >> return; > > I dont know, but *reading* d_count before trying to write it is expensive > on modern cpus. Oprofile clearly show that on Intel Core2. > > Then, *testing* the flag before doing the atomic_something() has the same > problem. Or we should put flag in a different cache line. > > I am lazy (time for a sleep here), maybe we are smart here and use a > trick like that already ? > > atomic_t atomic_read_with_write_intent(atomic_t *v) > { > int val = 0; > /* > * No LOCK prefix here, we only give a write intent hint to cpu > */ > asm volatile("xaddl %0, %1" > : "+r" (val), "+m" (v->counter) > : : "memory"); > return val; > } Forget it, its wrong... I really need to sleep :) ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should not have a parent 2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet [not found] ` <4926D022.5060008-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-21 15:36 ` Christoph Hellwig 2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet 1 sibling, 1 reply; 75+ messages in thread From: Christoph Hellwig @ 2008-11-21 15:36 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote: > [PATCH] fs: pipe/sockets/anon dentries should not have a parent > > Linking pipe/sockets/anon dentries to one root 'parent' has no functional > impact at all, but a scalability one. > > We can avoid touching a cache line at allocation stage (inside d_alloc(), no need > to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) > We avoid an expensive atomic_dec_and_lock() call on the root dentry. > > If we correct dnotify_parent() and inotify_d_instantiate() to take into account > a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. Sorry folks, but a NULL d_parent is a no-go from the VFS perspective, but you can set d_parent to the dentry itself which is the magic used for root of tree dentries. They should also be marked DCACHE_DISCONNECTED to make sure this is not unexpected. And this kind of stuff really needs to go through -fsdevel. ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent 2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig @ 2008-11-21 17:58 ` Eric Dumazet [not found] ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric Dumazet @ 2008-11-21 17:58 UTC (permalink / raw) To: Christoph Hellwig Cc: David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5101 bytes --] Christoph Hellwig a écrit : > On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote: >> [PATCH] fs: pipe/sockets/anon dentries should not have a parent >> >> Linking pipe/sockets/anon dentries to one root 'parent' has no functional >> impact at all, but a scalability one. >> >> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need >> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) >> We avoid an expensive atomic_dec_and_lock() call on the root dentry. >> >> If we correct dnotify_parent() and inotify_d_instantiate() to take into account >> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry. > > Sorry folks, but a NULL d_parent is a no-go from the VFS perspective, > but you can set d_parent to the dentry itself which is the magic used > for root of tree dentries. They should also be marked > DCACHE_DISCONNECTED to make sure this is not unexpected. > > And this kind of stuff really needs to go through -fsdevel. Thanks Christoph for your review, sorry for fsdevel being forgotten. d_alloc_root() is not an option here, since we also want such dentries to be unhashed. So here is a second version, with the introduction of a new helper, d_alloc_unhashed(), to be used by pipes, sockets and anon I got even better numbers, probably because dnotify/inotify dont have the NULL d_parent test anymore. [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. We add d_alloc_unhashed(const char *name, struct inode *inode) helper to be used by pipes/socket/anon. This function is about the same as d_alloc_root() but for unhashed entries. Before patch, time to run 8 * 1 million of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.843s user 0m0.616s sys 3m9.732s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 148703 148703 10.8581 10.8581 inet_create 116680 265383 8.5198 19.3779 new_inode 108912 374295 7.9526 27.3306 init_file 82911 457206 6.0541 33.3846 kmem_cache_alloc 65690 522896 4.7966 38.1812 wake_up_inode 53286 576182 3.8909 42.0721 _atomic_dec_and_lock 43814 619996 3.1992 45.2713 generic_forget_inode 41993 661989 3.0663 48.3376 d_alloc 41244 703233 3.0116 51.3492 kmem_cache_free 39244 742477 2.8655 54.2148 tcp_v4_init_sock 37402 779879 2.7310 56.9458 tcp_close 33336 813215 2.4342 59.3800 sysenter_past_esp 28596 841811 2.0880 61.4680 inode_has_buffers 25769 867580 1.8816 63.3496 d_kill 22606 890186 1.6507 65.0003 dentry_iput 20224 910410 1.4767 66.4770 vfs_dq_drop 19800 930210 1.4458 67.9228 __copy_from_user_ll Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 9 +-------- fs/dcache.c | 31 +++++++++++++++++++++++++++++++ fs/pipe.c | 10 +--------- include/linux/dcache.h | 1 + net/socket.c | 10 +--------- 5 files changed, 35 insertions(+), 26 deletions(-) [-- Attachment #2: d_alloc_unhashed.patch --] [-- Type: text/plain, Size: 4728 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..9fd0515 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = { int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags) { - struct qstr this; struct dentry *dentry; struct file *file; int error, fd; @@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, * using the inode sequence number. */ error = -ENOMEM; - this.name = name; - this.len = strlen(name); - this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_unhashed(name, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..a5477fd 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1111,6 +1111,37 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_unhashed - allocate unhashed dentry + * @inode: inode to allocate the dentry for + * @name: dentry name + * + * Allocate an unhashed dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. Unhashed dentries have themselves as a parent. + */ + +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) +{ + struct qstr q = { .name = name, .len = strlen(name) }; + struct dentry *res; + + res = d_alloc(NULL, &q); + if (res) { + res->d_sb = inode->i_sb; + res->d_parent = res; + /* + * We dont want to push this dentry into global dentry hash table. + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon + */ + res->d_flags &= ~DCACHE_UNHASHED; + res->d_flags |= DCACHE_DISCONNECTED; + d_instantiate(res, inode); + } + return res; +} + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..29fcac2 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..12438d6 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern struct dentry * d_alloc_unhashed(const char *, struct inode *); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..b659b5d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>]
* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent [not found] ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org> @ 2008-11-21 18:43 ` Matthew Wilcox 2008-11-23 3:53 ` Eric Dumazet 0 siblings, 1 reply; 75+ messages in thread From: Matthew Wilcox @ 2008-11-21 18:43 UTC (permalink / raw) To: Eric Dumazet Cc: Christoph Hellwig, David Miller, mingo-X9Un+BFzKDI, cl-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, rjw-KKrjLPT3xs0, linux-kernel-u79uwXL29TY76Z2rM5mHXA, kernel-testers-u79uwXL29TY76Z2rM5mHXA, efault-Mmb7MZpHnFY, a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw, Linux Netdev List, viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote: > +/** > + * d_alloc_unhashed - allocate unhashed dentry > + * @inode: inode to allocate the dentry for > + * @name: dentry name It's normal to list the parameters in the order they're passed to the function. Not sure if we have a tool that checks for this or not -- Randy? > + * > + * Allocate an unhashed dentry for the inode given. The inode is > + * instantiated and returned. %NULL is returned if there is insufficient > + * memory. Unhashed dentries have themselves as a parent. > + */ > + > +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) > +{ > + struct qstr q = { .name = name, .len = strlen(name) }; > + struct dentry *res; > + > + res = d_alloc(NULL, &q); > + if (res) { > + res->d_sb = inode->i_sb; > + res->d_parent = res; > + /* > + * We dont want to push this dentry into global dentry hash table. > + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED > + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon > + */ Line length ... as checkpatch would have warned you ;-) And there are several other grammatical nitpicks with this comment. Try this: /* * We don't want to put this dentry in the global dentry * hash table, so we pretend the dentry is already hashed * by unsetting DCACHE_UNHASHED. This permits * /proc/$pid/fd/XXX t work for sockets, pipes and * anonymous files (signalfd, timerfd, etc). */ > + res->d_flags &= ~DCACHE_UNHASHED; > + res->d_flags |= DCACHE_DISCONNECTED; Is this really better than: res->d_flags = res->d_flags & ~DCACHE_UNHASHED | DCACHE_DISCONNECTED; Anyway, nice cleanup. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent 2008-11-21 18:43 ` Matthew Wilcox @ 2008-11-23 3:53 ` Eric Dumazet 0 siblings, 0 replies; 75+ messages in thread From: Eric Dumazet @ 2008-11-23 3:53 UTC (permalink / raw) To: Matthew Wilcox Cc: Christoph Hellwig, David Miller, mingo, cl, rjw, linux-kernel, kernel-testers, efault, a.p.zijlstra, Linux Netdev List, viro, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 5733 bytes --] Matthew Wilcox a écrit : > On Fri, Nov 21, 2008 at 06:58:29PM +0100, Eric Dumazet wrote: >> +/** >> + * d_alloc_unhashed - allocate unhashed dentry >> + * @inode: inode to allocate the dentry for >> + * @name: dentry name > > It's normal to list the parameters in the order they're passed to the > function. Not sure if we have a tool that checks for this or not -- > Randy? Yes, no problem, better to have the same order. > >> + * >> + * Allocate an unhashed dentry for the inode given. The inode is >> + * instantiated and returned. %NULL is returned if there is insufficient >> + * memory. Unhashed dentries have themselves as a parent. >> + */ >> + >> +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) >> +{ >> + struct qstr q = { .name = name, .len = strlen(name) }; >> + struct dentry *res; >> + >> + res = d_alloc(NULL, &q); >> + if (res) { >> + res->d_sb = inode->i_sb; >> + res->d_parent = res; >> + /* >> + * We dont want to push this dentry into global dentry hash table. >> + * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED >> + * This permits a working /proc/$pid/fd/XXX on sockets,pipes,anon >> + */ > > Line length ... as checkpatch would have warned you ;-) > > And there are several other grammatical nitpicks with this comment. Try > this: > > /* > * We don't want to put this dentry in the global dentry > * hash table, so we pretend the dentry is already hashed > * by unsetting DCACHE_UNHASHED. This permits > * /proc/$pid/fd/XXX t work for sockets, pipes and > * anonymous files (signalfd, timerfd, etc). > */ Yes, this is better. > >> + res->d_flags &= ~DCACHE_UNHASHED; >> + res->d_flags |= DCACHE_DISCONNECTED; > > Is this really better than: > > res->d_flags = res->d_flags & ~DCACHE_UNHASHED | > DCACHE_DISCONNECTED; Well, I personally prefer the two lines, intention is more readable :) > > Anyway, nice cleanup. > Thanks Matthew, here is an updated version of the patch. [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Linking pipe/sockets/anon dentries to one root 'parent' has no functional impact at all, but a scalability one. We can avoid touching a cache line at allocation stage (inside d_alloc(), no need to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count) We avoid an expensive atomic_dec_and_lock() call on the root dentry. We add d_alloc_unhashed(const char *name, struct inode *inode) helper to be used by pipes/socket/anon. This function is about the same as d_alloc_root() but for unhashed entries. Before patch, time to run 8 * 1 million of close(socket()) calls on 8 CPUS was : real 0m27.496s user 0m0.657s sys 3m39.092s After patch : real 0m23.843s user 0m0.616s sys 3m9.732s Old oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 164257 164257 11.0245 11.0245 init_file 155488 319745 10.4359 21.4604 d_alloc 151887 471632 10.1942 31.6547 _atomic_dec_and_lock 91620 563252 6.1493 37.8039 inet_create 74245 637497 4.9831 42.7871 kmem_cache_alloc 46702 684199 3.1345 45.9216 dentry_iput 46186 730385 3.0999 49.0215 tcp_close 42824 773209 2.8742 51.8957 kmem_cache_free 37275 810484 2.5018 54.3975 wake_up_inode 36553 847037 2.4533 56.8508 tcp_v4_init_sock 35661 882698 2.3935 59.2443 inotify_d_instantiate 32998 915696 2.2147 61.4590 sysenter_past_esp 31442 947138 2.1103 63.5693 d_instantiate 31303 978441 2.1010 65.6703 generic_forget_inode 27533 1005974 1.8479 67.5183 vfs_dq_drop 24237 1030211 1.6267 69.1450 sock_attach_fd 19290 1049501 1.2947 70.4397 __copy_from_user_ll New oprofile : CPU: Core 2, speed 3000.11 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 148703 148703 10.8581 10.8581 inet_create 116680 265383 8.5198 19.3779 new_inode 108912 374295 7.9526 27.3306 init_file 82911 457206 6.0541 33.3846 kmem_cache_alloc 65690 522896 4.7966 38.1812 wake_up_inode 53286 576182 3.8909 42.0721 _atomic_dec_and_lock 43814 619996 3.1992 45.2713 generic_forget_inode 41993 661989 3.0663 48.3376 d_alloc 41244 703233 3.0116 51.3492 kmem_cache_free 39244 742477 2.8655 54.2148 tcp_v4_init_sock 37402 779879 2.7310 56.9458 tcp_close 33336 813215 2.4342 59.3800 sysenter_past_esp 28596 841811 2.0880 61.4680 inode_has_buffers 25769 867580 1.8816 63.3496 d_kill 22606 890186 1.6507 65.0003 dentry_iput 20224 910410 1.4767 66.4770 vfs_dq_drop 19800 930210 1.4458 67.9228 __copy_from_user_ll Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- fs/anon_inodes.c | 9 +-------- fs/dcache.c | 33 +++++++++++++++++++++++++++++++++ fs/pipe.c | 10 +--------- include/linux/dcache.h | 1 + net/socket.c | 10 +--------- 5 files changed, 37 insertions(+), 26 deletions(-) [-- Attachment #2: d_alloc_unhashed2.patch --] [-- Type: text/plain, Size: 4788 bytes --] diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index 3662dd4..9fd0515 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -71,7 +71,6 @@ static struct dentry_operations anon_inodefs_dentry_operations = { int anon_inode_getfd(const char *name, const struct file_operations *fops, void *priv, int flags) { - struct qstr this; struct dentry *dentry; struct file *file; int error, fd; @@ -89,10 +88,7 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, * using the inode sequence number. */ error = -ENOMEM; - this.name = name; - this.len = strlen(name); - this.hash = 0; - dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); + dentry = d_alloc_unhashed(name, anon_inode_inode); if (!dentry) goto err_put_unused_fd; @@ -104,9 +100,6 @@ int anon_inode_getfd(const char *name, const struct file_operations *fops, atomic_inc(&anon_inode_inode->i_count); dentry->d_op = &anon_inodefs_dentry_operations; - /* Do not publish this dentry inside the global dentry hash table */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, anon_inode_inode); error = -ENFILE; file = alloc_file(anon_inode_mnt, dentry, diff --git a/fs/dcache.c b/fs/dcache.c index a1d86c7..43ef88d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1111,6 +1111,39 @@ struct dentry * d_alloc_root(struct inode * root_inode) return res; } +/** + * d_alloc_unhashed - allocate unhashed dentry + * @name: dentry name + * @inode: inode to allocate the dentry for + * + * Allocate an unhashed dentry for the inode given. The inode is + * instantiated and returned. %NULL is returned if there is insufficient + * memory. Unhashed dentries have themselves as a parent. + */ + +struct dentry * d_alloc_unhashed(const char *name, struct inode *inode) +{ + struct qstr q = { .name = name, .len = strlen(name) }; + struct dentry *res; + + res = d_alloc(NULL, &q); + if (res) { + res->d_sb = inode->i_sb; + res->d_parent = res; + /* + * We dont want to push this dentry into global dentry + * hash table, so we pretend the dentry is already hashed + * by unsetting DCACHE_UNHASHED. This permits + * /proc/$pid/fd/XXX to work for sockets, pipes, and + * anonymous files (signalfd, timerfd, ...) + */ + res->d_flags &= ~DCACHE_UNHASHED; + res->d_flags |= DCACHE_DISCONNECTED; + d_instantiate(res, inode); + } + return res; +} + static inline struct hlist_head *d_hash(struct dentry *parent, unsigned long hash) { diff --git a/fs/pipe.c b/fs/pipe.c index 7aea8b8..29fcac2 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -918,7 +918,6 @@ struct file *create_write_pipe(int flags) struct inode *inode; struct file *f; struct dentry *dentry; - struct qstr name = { .name = "" }; err = -ENFILE; inode = get_pipe_inode(); @@ -926,18 +925,11 @@ struct file *create_write_pipe(int flags) goto err; err = -ENOMEM; - dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", inode); if (!dentry) goto err_inode; dentry->d_op = &pipefs_dentry_operations; - /* - * We dont want to publish this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on pipes - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, inode); err = -ENFILE; f = alloc_file(pipe_mnt, dentry, FMODE_WRITE, &write_pipefifo_fops); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index a37359d..12438d6 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -238,6 +238,7 @@ extern int d_invalidate(struct dentry *); /* only used at mount-time */ extern struct dentry * d_alloc_root(struct inode *); +extern struct dentry * d_alloc_unhashed(const char *, struct inode *); /* <clickety>-<click> the ramfs-type tree */ extern void d_genocide(struct dentry *); diff --git a/net/socket.c b/net/socket.c index e9d65ea..b659b5d 100644 --- a/net/socket.c +++ b/net/socket.c @@ -371,20 +371,12 @@ static int sock_alloc_fd(struct file **filep, int flags) static int sock_attach_fd(struct socket *sock, struct file *file, int flags) { struct dentry *dentry; - struct qstr name = { .name = "" }; - dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name); + dentry = d_alloc_unhashed("", SOCK_INODE(sock)); if (unlikely(!dentry)) return -ENOMEM; dentry->d_op = &sockfs_dentry_operations; - /* - * We dont want to push this dentry into global dentry hash table. - * We pretend dentry is already hashed, by unsetting DCACHE_UNHASHED - * This permits a working /proc/$pid/fd/XXX on sockets - */ - dentry->d_flags &= ~DCACHE_UNHASHED; - d_instantiate(dentry, SOCK_INODE(sock)); sock->file = file; init_file(file, sock_mnt, dentry, FMODE_READ | FMODE_WRITE, ^ permalink raw reply related [flat|nested] 75+ messages in thread
end of thread, other threads:[~2008-12-17 20:25 UTC | newest]
Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.64.0811201727070.9089@quilx.com>
[not found] ` <20081121083044.GL16242@elte.hu>
[not found] ` <49267694.1030506@cosmosbay.com>
[not found] ` <20081121.010508.40225532.davem@davemloft.net>
[not found] ` <4926AEDB.10007@cosmosbay.com>
[not found] ` <4926AEDB.10007-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 15:13 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Eric Dumazet
[not found] ` <4926D022.5060008-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 15:21 ` Ingo Molnar
[not found] ` <20081121152148.GA20388-X9Un+BFzKDI@public.gmane.org>
2008-11-21 15:28 ` Eric Dumazet
[not found] ` <4926D39D.9050603-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 15:34 ` Ingo Molnar
2008-11-26 23:27 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Eric Dumazet
2008-11-27 9:39 ` Christoph Hellwig
2008-11-28 18:03 ` Ingo Molnar
[not found] ` <20081128180318.GL10487-X9Un+BFzKDI@public.gmane.org>
2008-11-28 18:47 ` Peter Zijlstra
2008-11-29 6:38 ` Christoph Hellwig
[not found] ` <20081129063816.GA869-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2008-11-29 8:07 ` Eric Dumazet
2008-11-29 8:43 ` [PATCH v2 0/5] " Eric Dumazet
2008-12-11 22:38 ` [PATCH v3 0/7] " Eric Dumazet
2008-12-11 22:38 ` [PATCH v3 1/7] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2007-07-24 1:24 ` Nick Piggin
[not found] ` <49419680.8010409-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:04 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 2/7] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
[not found] ` <4941968E.3020201-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2007-07-24 1:30 ` Nick Piggin
[not found] ` <200707241130.56767.nickpiggin-/E1597aS9LT0CCvOHzKKcA@public.gmane.org>
2008-12-12 5:11 ` Eric Dumazet
2008-12-16 21:10 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 3/7] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2007-07-24 1:34 ` Nick Piggin
2008-12-16 21:26 ` Paul E. McKenney
2008-12-11 22:39 ` [PATCH v3 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
[not found] ` <494196AA.6080002-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-16 21:40 ` Paul E. McKenney
2008-12-11 22:40 ` [PATCH v3 5/7] fs: new_inode_single() and iput_single() Eric Dumazet
2008-12-16 21:41 ` Paul E. McKenney
[not found] ` <493100B0.6090104-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-11 22:40 ` [PATCH v3 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU Eric Dumazet
2007-07-24 1:13 ` Nick Piggin
2008-12-12 2:50 ` Nick Piggin
2008-12-12 4:45 ` Eric Dumazet
[not found] ` <4941EC65.5040903-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-12 16:48 ` Eric Dumazet
[not found] ` <494295C6.2020906-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-12-13 2:07 ` Christoph Lameter
[not found] ` <Pine.LNX.4.64.0812121958470.15781-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-12-17 20:25 ` Eric Dumazet
2008-12-13 1:41 ` Christoph Lameter
2008-12-11 22:41 ` [PATCH v3 7/7] fs: MS_NOREFCOUNT Eric Dumazet
[not found] ` <492DDB6A.8090806-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 1:37 ` [PATCH 0/6] fs: Scalability of sockets/pipes allocation/deallocation on SMP Christoph Lameter
[not found] ` <Pine.LNX.4.64.0811261935330.31159-dRBSpnHQED8AvxtiuMwx3w@public.gmane.org>
2008-11-27 6:27 ` Eric Dumazet
[not found] ` <492E3DEF.8030602-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 14:44 ` Christoph Lameter
2008-11-29 8:43 ` [PATCH v2 1/5] fs: Use a percpu_counter to track nr_dentry Eric Dumazet
2008-11-29 8:43 ` [PATCH v2 2/5] fs: Use a percpu_counter to track nr_inodes Eric Dumazet
2008-11-29 8:44 ` [PATCH v2 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd Eric Dumazet
[not found] ` <493100E7.3030907-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-29 10:38 ` Jörn Engel
[not found] ` <20081129103836.GA11959-PCqxUs/MD9bYtjvyW6yDsg@public.gmane.org>
2008-11-29 11:14 ` Eric Dumazet
2008-11-29 8:45 ` [PATCH v2 5/5] fs: new_inode_single() and iput_single() Eric Dumazet
2008-11-29 11:14 ` Jörn Engel
2008-11-29 8:44 ` [PATCH v2 3/5] fs: Introduce a per_cpu last_ino allocator Eric Dumazet
2008-11-26 23:32 ` [PATCH 3/6] " Eric Dumazet
[not found] ` <492DDC88.2050305-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 9:46 ` Christoph Hellwig
[not found] ` <20081121153453.GA23713-X9Un+BFzKDI@public.gmane.org>
2008-11-26 23:30 ` [PATCH 1/6] fs: Introduce a per_cpu nr_dentry Eric Dumazet
[not found] ` <492DDC0B.8060804-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 9:41 ` Christoph Hellwig
2008-11-26 23:32 ` [PATCH 4/6] fs: Introduce a per_cpu nr_inodes Eric Dumazet
2008-11-27 9:32 ` Peter Zijlstra
2008-11-27 9:39 ` Peter Zijlstra
2008-11-27 9:48 ` Christoph Hellwig
2008-11-27 10:01 ` Eric Dumazet
2008-11-27 10:07 ` Andi Kleen
2008-11-27 14:46 ` Christoph Lameter
2008-11-26 23:32 ` [PATCH 5/6] fs: Introduce special inodes Eric Dumazet
[not found] ` <492DDC99.5060106-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 8:20 ` David Miller
2008-11-26 23:32 ` [PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs Eric Dumazet
2008-11-27 9:53 ` Christoph Hellwig
[not found] ` <20081127095321.GE13860-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2008-11-27 10:04 ` Eric Dumazet
[not found] ` <492E70B6.70108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 10:10 ` Christoph Hellwig
[not found] ` <492DDCAB.1070204-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-27 8:21 ` David Miller
2008-11-28 9:26 ` Al Viro
[not found] ` <20081128092604.GL28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
2008-11-28 9:34 ` Al Viro
2008-11-28 18:02 ` Ingo Molnar
2008-11-28 18:58 ` Ingo Molnar
[not found] ` <20081128180220.GK10487-X9Un+BFzKDI@public.gmane.org>
2008-11-28 22:20 ` Eric Dumazet
2008-11-28 22:37 ` Eric Dumazet
2008-11-28 22:43 ` Eric Dumazet
2008-11-21 15:36 ` [PATCH] fs: pipe/sockets/anon dentries should not have a parent Christoph Hellwig
2008-11-21 17:58 ` [PATCH] fs: pipe/sockets/anon dentries should have themselves as parent Eric Dumazet
[not found] ` <4926F6C5.9030108-fPLkHRcR87vqlBn2x/YWAg@public.gmane.org>
2008-11-21 18:43 ` Matthew Wilcox
2008-11-23 3:53 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).