* [PATCH]: Brown paper bag in fs/file.c? @ 2005-09-14 18:31 David S. Miller 2005-09-14 18:48 ` David S. Miller 2005-09-14 19:18 ` Dipankar Sarma 0 siblings, 2 replies; 19+ messages in thread From: David S. Miller @ 2005-09-14 18:31 UTC (permalink / raw) To: linux-kernel; +Cc: torvalds, akpm While tracking down a sparc64 bug, I spotted what seems to be a bug in __free_fdtable(). (the sparc64 bug is that a user page gets corrupted, and it's full of kernel filp pointers, I've verified that the page mapped there is no longer in the SLAB cache, but the filp's are still active in the SLAB, I've also ruled out cache aliasing and tlb flushing bugs) The bug is that free_fd_array() takes a "num" argument, but when calling it from __free_fdtable() we're instead passing in the size in bytes (ie. "num * sizeof(struct file *)"). How come this doesn't crash things for people? Perhaps I'm missing something. fs/vmalloc.c should bark very loudly if we call it with a non-vmalloc area address, since that is what would happen if we pass a kmalloc() SLAB object address to vfree(). I think I know what might be happening. If the miscalculation means that we kfree() the embedded fdarray, that would actually work just fine, and free up the fdtable. I guess if the SLAB redzone stuff were enabled for these caches, it would trigger when something like this happens. BTW, in mm/slab.c for FORCED_DEBUG we have: if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD))) flags |= SLAB_RED_ZONE|SLAB_STORE_USER; Isn't PAGE_SIZE intended here, not the constant "4096"? Just curious. I'm about to reboot to see if this fixes the sparc64 problem, with my luck it probably won't :-) diff --git a/fs/file.c b/fs/file.c --- a/fs/file.c +++ b/fs/file.c @@ -75,7 +75,7 @@ static void __free_fdtable(struct fdtabl fdarray_size = fdt->max_fds * sizeof(struct file *); free_fdset(fdt->open_fds, fdset_size); free_fdset(fdt->close_on_exec, fdset_size); - free_fd_array(fdt->fd, fdarray_size); + free_fd_array(fdt->fd, fdt->max_fds); kfree(fdt); } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 18:31 [PATCH]: Brown paper bag in fs/file.c? David S. Miller @ 2005-09-14 18:48 ` David S. Miller 2005-09-14 19:05 ` David S. Miller 2005-09-14 19:18 ` Dipankar Sarma 1 sibling, 1 reply; 19+ messages in thread From: David S. Miller @ 2005-09-14 18:48 UTC (permalink / raw) To: linux-kernel; +Cc: torvalds, akpm From: "David S. Miller" <davem@davemloft.net> Date: Wed, 14 Sep 2005 11:31:33 -0700 (PDT) > I'm about to reboot to see if this fixes the sparc64 problem, with my > luck it probably won't :-) It doesn't, but I think the patch is correct regardless. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 18:48 ` David S. Miller @ 2005-09-14 19:05 ` David S. Miller 0 siblings, 0 replies; 19+ messages in thread From: David S. Miller @ 2005-09-14 19:05 UTC (permalink / raw) To: linux-kernel; +Cc: torvalds, akpm From: "David S. Miller" <davem@davemloft.net> Date: Wed, 14 Sep 2005 11:48:21 -0700 (PDT) > From: "David S. Miller" <davem@davemloft.net> > Date: Wed, 14 Sep 2005 11:31:33 -0700 (PDT) > > > I'm about to reboot to see if this fixes the sparc64 problem, with my > > luck it probably won't :-) > > It doesn't, but I think the patch is correct regardless. Studying fs/file.c a bit more, there seems to be several of these "size vs nr" errors in the allocation and freeing code. What's going on here? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 18:31 [PATCH]: Brown paper bag in fs/file.c? David S. Miller 2005-09-14 18:48 ` David S. Miller @ 2005-09-14 19:18 ` Dipankar Sarma 2005-09-14 19:57 ` David S. Miller 1 sibling, 1 reply; 19+ messages in thread From: Dipankar Sarma @ 2005-09-14 19:18 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, torvalds, akpm On Wed, Sep 14, 2005 at 11:31:33AM -0700, David S. Miller wrote: > > The bug is that free_fd_array() takes a "num" argument, but when > calling it from __free_fdtable() we're instead passing in the size in > bytes (ie. "num * sizeof(struct file *)"). Yes it is a bug. I think I messed it up while merging newer changes with an older version where I was using size in bytes to optimize. > > How come this doesn't crash things for people? Perhaps I'm missing > something. fs/vmalloc.c should bark very loudly if we call it with a > non-vmalloc area address, since that is what would happen if we pass a > kmalloc() SLAB object address to vfree(). > > I think I know what might be happening. If the miscalculation means > that we kfree() the embedded fdarray, that would actually work just > fine, and free up the fdtable. I guess if the SLAB redzone stuff were > enabled for these caches, it would trigger when something like this > happens. __free_fdtable() is used only when the fdarray/fdset are vmalloced (use of the workqueue) or there is a race between two expand_files(). That might be why we haven't seen this cause any explicit problem so far. This would be an appropriate patch - (untested). I will update as soon as testing is done. Thanks Dipankar Fixes the fdtable freeing in the case of vmalloced fdset/arrays. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> -- fs/file.c | 10 +++------- 1 files changed, 3 insertions(+), 7 deletions(-) diff -puN fs/file.c~files-fix-fdtable-free fs/file.c --- linux-2.6.14-rc1-fd/fs/file.c~files-fix-fdtable-free 2005-09-15 00:36:03.000000000 +0530 +++ linux-2.6.14-rc1-fd-dipankar/fs/file.c 2005-09-15 00:39:46.000000000 +0530 @@ -69,13 +69,9 @@ void free_fd_array(struct file **array, static void __free_fdtable(struct fdtable *fdt) { - int fdset_size, fdarray_size; - - fdset_size = fdt->max_fdset / 8; - fdarray_size = fdt->max_fds * sizeof(struct file *); - free_fdset(fdt->open_fds, fdset_size); - free_fdset(fdt->close_on_exec, fdset_size); - free_fd_array(fdt->fd, fdarray_size); + free_fdset(fdt->open_fds, fdt->max_fdset); + free_fdset(fdt->close_on_exec, fdt->max_fdset); + free_fd_array(fdt->fd, fdt->max_fds); kfree(fdt); } _ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 19:18 ` Dipankar Sarma @ 2005-09-14 19:57 ` David S. Miller 2005-09-14 20:15 ` Dipankar Sarma 0 siblings, 1 reply; 19+ messages in thread From: David S. Miller @ 2005-09-14 19:57 UTC (permalink / raw) To: dipankar; +Cc: linux-kernel, torvalds, akpm From: Dipankar Sarma <dipankar@in.ibm.com> Date: Thu, 15 Sep 2005 00:48:42 +0530 > __free_fdtable() is used only when the fdarray/fdset are vmalloced > (use of the workqueue) or there is a race between two expand_files(). > That might be why we haven't seen this cause any explicit problem > so far. > > This would be an appropriate patch - (untested). I will update > as soon as testing is done. Thanks. I still can't figure out what causes my sparc64 bug. Somehow a kmalloc() chunk of file pointers gets freed too early, the SLAB is shrunk due to memory pressure so the page containing that object gets freed, that page ends up as an anonymous page in userspace, but filp writes from the older usage occurs and corrupts the page. I wonder if we simply leave a stale pointer around to the older fd array in some case. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 19:57 ` David S. Miller @ 2005-09-14 20:15 ` Dipankar Sarma 2005-09-14 20:29 ` David S. Miller 2005-09-15 21:06 ` [PATCH]: Brown paper bag in fs/file.c? David S. Miller 0 siblings, 2 replies; 19+ messages in thread From: Dipankar Sarma @ 2005-09-14 20:15 UTC (permalink / raw) To: David S. Miller; +Cc: linux-kernel, torvalds, akpm On Wed, Sep 14, 2005 at 12:57:50PM -0700, David S. Miller wrote: > From: Dipankar Sarma <dipankar@in.ibm.com> > Date: Thu, 15 Sep 2005 00:48:42 +0530 > > > __free_fdtable() is used only when the fdarray/fdset are vmalloced > > (use of the workqueue) or there is a race between two expand_files(). > > That might be why we haven't seen this cause any explicit problem > > so far. > > > > This would be an appropriate patch - (untested). I will update > > as soon as testing is done. > > Thanks. > > I still can't figure out what causes my sparc64 bug. Somehow a > kmalloc() chunk of file pointers gets freed too early, the SLAB is > shrunk due to memory pressure so the page containing that object gets > freed, that page ends up as an anonymous page in userspace, but filp > writes from the older usage occurs and corrupts the page. > > I wonder if we simply leave a stale pointer around to the older > fd array in some case. Are you running with preemption enabled ? If so, fyi, I had sent out a patch earlier that fixes locking for preemption. Also, what triggers this in your machine ? I can try to reproduce this albeit on a non-sparc64 box. Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 20:15 ` Dipankar Sarma @ 2005-09-14 20:29 ` David S. Miller 2005-09-14 21:17 ` [PATCH] reorder struct files_struct Eric Dumazet 2005-09-15 21:06 ` [PATCH]: Brown paper bag in fs/file.c? David S. Miller 1 sibling, 1 reply; 19+ messages in thread From: David S. Miller @ 2005-09-14 20:29 UTC (permalink / raw) To: dipankar; +Cc: linux-kernel, torvalds, akpm From: Dipankar Sarma <dipankar@in.ibm.com> Date: Thu, 15 Sep 2005 01:45:50 +0530 > Are you running with preemption enabled ? If so, fyi, I had sent > out a patch earlier that fixes locking for preemption. > Also, what triggers this in your machine ? I can try to reproduce > this albeit on a non-sparc64 box. No PREEMPT enabled. I believe this bug has been around since before your RCU changes, I've been seeing it for about half a year. My test case is, on a uniprocessor with 1GB of ram, run the following in parallel: 1) In one window, do a plain kernel build of 2.6.x 2) In another window, in a temporary directory, run this shell code: while [ 1 ] do rm -rf .git * cp -a ../linux-2.6/.git . git-checkout-cache -a git-update-cache --refresh echo "Finished one pass..." done Adjust the path to a vanilla 2.6.x ".git" directory in #2 as needed. Sometimes this can trigger in less than a minute. Other times it takes 5 or 6 minutes, but other than this variance it is rather reliable. I think the key is to make sure you don't have any more than 1GB of ram for this on a 64-bit box, so that there is sufficient memory pressure from all of the ext3 writebacks and other fs activity in order to shrink the file table SLABs and thus liberate the pages. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] reorder struct files_struct 2005-09-14 20:29 ` David S. Miller @ 2005-09-14 21:17 ` Eric Dumazet 2005-09-14 21:35 ` Peter Staubach 2005-09-14 22:02 ` Dipankar Sarma 0 siblings, 2 replies; 19+ messages in thread From: Eric Dumazet @ 2005-09-14 21:17 UTC (permalink / raw) To: dipankar; +Cc: David S. Miller, linux-kernel, torvalds, akpm [-- Attachment #1: Type: text/plain, Size: 810 bytes --] Hi Browsing (and using) the excellent RCU infrastructure for files that was adopted for 2.6.14-rc1, I noticed that the file_lock spinlock sit close to mostly read fields of 'struct files_struct' In SMP (and NUMA) environnements, each time a thread wants to open or close a file, it has to acquire the spinlock, thus invalidating the cache line containing this spinlock on other CPUS. So other threads doing read()/write()/... calls that use RCU to access the file table are going to ask further memory (possibly NUMA) transactions to read again this memory line. Please consider applying this patch. It moves the spinlock to another cache line, so that concurrent threads can share the cache line containing 'count' and 'fdt' fields. Thank you Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: reorder_files_struct --] [-- Type: text/plain, Size: 660 bytes --] --- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200 +++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200 @@ -34,12 +34,12 @@ */ struct files_struct { atomic_t count; - spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */ struct fdtable *fdt; struct fdtable fdtab; fd_set close_on_exec_init; fd_set open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; + spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */ }; #define files_fdtable(files) (rcu_dereference((files)->fdt)) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 21:17 ` [PATCH] reorder struct files_struct Eric Dumazet @ 2005-09-14 21:35 ` Peter Staubach 2005-09-14 22:02 ` Dipankar Sarma 1 sibling, 0 replies; 19+ messages in thread From: Peter Staubach @ 2005-09-14 21:35 UTC (permalink / raw) To: Eric Dumazet; +Cc: dipankar, David S. Miller, linux-kernel, torvalds, akpm Eric Dumazet wrote: > Hi > > Browsing (and using) the excellent RCU infrastructure for files that > was adopted for 2.6.14-rc1, I noticed that the file_lock spinlock sit > close to mostly read fields of 'struct files_struct' > > In SMP (and NUMA) environnements, each time a thread wants to open or > close a file, it has to acquire the spinlock, thus invalidating the > cache line containing this spinlock on other CPUS. So other threads > doing read()/write()/... calls that use RCU to access the file table > are going to ask further memory (possibly NUMA) transactions to read > again this memory line. > > Please consider applying this patch. It moves the spinlock to another > cache line, so that concurrent threads can share the cache line > containing 'count' and 'fdt' fields. How was the performance impact of this change measured? Thanx... ps ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 21:17 ` [PATCH] reorder struct files_struct Eric Dumazet 2005-09-14 21:35 ` Peter Staubach @ 2005-09-14 22:02 ` Dipankar Sarma 2005-09-14 22:17 ` Andrew Morton 2005-09-14 22:42 ` Eric Dumazet 1 sibling, 2 replies; 19+ messages in thread From: Dipankar Sarma @ 2005-09-14 22:02 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, linux-kernel, torvalds, akpm On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote: > In SMP (and NUMA) environnements, each time a thread wants to open or close > a file, it has to acquire the spinlock, thus invalidating the cache line > containing this spinlock on other CPUS. So other threads doing > read()/write()/... calls that use RCU to access the file table are going to > ask further memory (possibly NUMA) transactions to read again this memory > line. > > Please consider applying this patch. It moves the spinlock to another cache > line, so that concurrent threads can share the cache line containing > 'count' and 'fdt' fields. > > --- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200 > +++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200 > @@ -34,12 +34,12 @@ > */ > struct files_struct { > atomic_t count; > - spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */ > struct fdtable *fdt; > struct fdtable fdtab; > fd_set close_on_exec_init; > fd_set open_fds_init; > struct file * fd_array[NR_OPEN_DEFAULT]; > + spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */ > }; > > #define files_fdtable(files) (rcu_dereference((files)->fdt)) For most apps without too many open fds, the embedded fd_sets are going to be used. Wouldn't that mean that open()/close() will invalidate the cache line containing fdt, fdtab by updating the fd_sets ? If so, you optimization really doesn't help. Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 22:02 ` Dipankar Sarma @ 2005-09-14 22:17 ` Andrew Morton 2005-09-14 22:42 ` Eric Dumazet 1 sibling, 0 replies; 19+ messages in thread From: Andrew Morton @ 2005-09-14 22:17 UTC (permalink / raw) To: dipankar; +Cc: dada1, davem, linux-kernel, torvalds Dipankar Sarma <dipankar@in.ibm.com> wrote: > > > --- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200 > > +++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200 > > @@ -34,12 +34,12 @@ > > */ > > struct files_struct { > > atomic_t count; > > - spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */ > > struct fdtable *fdt; > > struct fdtable fdtab; > > fd_set close_on_exec_init; > > fd_set open_fds_init; > > struct file * fd_array[NR_OPEN_DEFAULT]; > > + spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */ > > }; > > > > #define files_fdtable(files) (rcu_dereference((files)->fdt)) > > For most apps without too many open fds, the embedded fd_sets > are going to be used. Wouldn't that mean that open()/close() will > invalidate the cache line containing fdt, fdtab by updating > the fd_sets ? If so, you optimization really doesn't help. Guys, this is benchmarkable. fget() is astonishingly high in some profiles - it's worth investigating. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 22:02 ` Dipankar Sarma 2005-09-14 22:17 ` Andrew Morton @ 2005-09-14 22:42 ` Eric Dumazet 2005-09-14 22:50 ` Dipankar Sarma 1 sibling, 1 reply; 19+ messages in thread From: Eric Dumazet @ 2005-09-14 22:42 UTC (permalink / raw) To: dipankar; +Cc: David S. Miller, linux-kernel, torvalds, akpm Dipankar Sarma a écrit : > On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote: > >>In SMP (and NUMA) environnements, each time a thread wants to open or close >>a file, it has to acquire the spinlock, thus invalidating the cache line >>containing this spinlock on other CPUS. So other threads doing >>read()/write()/... calls that use RCU to access the file table are going to >>ask further memory (possibly NUMA) transactions to read again this memory >>line. >> >>Please consider applying this patch. It moves the spinlock to another cache >>line, so that concurrent threads can share the cache line containing >>'count' and 'fdt' fields. >> >>--- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200 >>+++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200 >>@@ -34,12 +34,12 @@ >> */ >> struct files_struct { >> atomic_t count; >>- spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */ >> struct fdtable *fdt; >> struct fdtable fdtab; >> fd_set close_on_exec_init; >> fd_set open_fds_init; >> struct file * fd_array[NR_OPEN_DEFAULT]; >>+ spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */ >> }; >> >> #define files_fdtable(files) (rcu_dereference((files)->fdt)) > > > For most apps without too many open fds, the embedded fd_sets > are going to be used. Wouldn't that mean that open()/close() will > invalidate the cache line containing fdt, fdtab by updating > the fd_sets ? If so, you optimization really doesn't help. > If the embedded struct fdtable is used, then the only touched field is 'next_fd', so we could also move this field at the end of 'struct fdtable' But I wonder if 'next_fd' really has to be in 'struct fdtable', maybe it could be moved to 'struct files_struct' close to file_lock ? If yes, the whole embedded struct fdtable is readonly. Eric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 22:42 ` Eric Dumazet @ 2005-09-14 22:50 ` Dipankar Sarma 2005-09-14 23:19 ` Eric Dumazet 0 siblings, 1 reply; 19+ messages in thread From: Dipankar Sarma @ 2005-09-14 22:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, linux-kernel, torvalds, akpm On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote: > > > >>--- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 > >>05:12:09.000000000 +0200 > >>+++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 > >>01:09:13.000000000 +0200 > >>@@ -34,12 +34,12 @@ > >> */ > >>struct files_struct { > >> atomic_t count; > >>- spinlock_t file_lock; /* Protects all the below members. > >>Nests inside tsk->alloc_lock */ > >> struct fdtable *fdt; > >> struct fdtable fdtab; > >> fd_set close_on_exec_init; > >> fd_set open_fds_init; > >> struct file * fd_array[NR_OPEN_DEFAULT]; > >>+ spinlock_t file_lock; /* Protects concurrent writers. Nests > >>inside tsk->alloc_lock */ > >>}; > >> > >>#define files_fdtable(files) (rcu_dereference((files)->fdt)) > > > > > >For most apps without too many open fds, the embedded fd_sets > >are going to be used. Wouldn't that mean that open()/close() will > >invalidate the cache line containing fdt, fdtab by updating > >the fd_sets ? If so, you optimization really doesn't help. > > > > If the embedded struct fdtable is used, then the only touched field is > 'next_fd', so we could also move this field at the end of 'struct fdtable' > Not just embedded fdtable, but also the embedded fdsets. I would expect count, fdt, fdtab and the fdsets to fit into one cache line in some archs. > But I wonder if 'next_fd' really has to be in 'struct fdtable', maybe it > could be moved to 'struct files_struct' close to file_lock ? next_fd has to be in struct fdtable. It needs to be consistent with whichever fdtable a lock-free reader sees. > > If yes, the whole embedded struct fdtable is readonly. But not close_on_exec_init or open_fds_init. We would update them on open/close. Some benchmarking would be useful here. Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 22:50 ` Dipankar Sarma @ 2005-09-14 23:19 ` Eric Dumazet 2005-09-15 4:54 ` Dipankar Sarma 0 siblings, 1 reply; 19+ messages in thread From: Eric Dumazet @ 2005-09-14 23:19 UTC (permalink / raw) To: dipankar; +Cc: David S. Miller, linux-kernel, torvalds, akpm [-- Attachment #1: Type: text/plain, Size: 1619 bytes --] Dipankar Sarma a écrit : > On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote: > >>Dipankar Sarma a écrit : > Not just embedded fdtable, but also the embedded fdsets. I would expect > count, fdt, fdtab and the fdsets to fit into one cache line in > some archs. > > > >>But I wonder if 'next_fd' really has to be in 'struct fdtable', maybe it >>could be moved to 'struct files_struct' close to file_lock ? > > > next_fd has to be in struct fdtable. It needs to be consistent > with whichever fdtable a lock-free reader sees. > I may be wrong, but I think a reader never look at next_fd. next_fd is only used (read/written) by a 'writer', with file_lock hold, in locate_fd(), copy_fdtable() and get_unused_fd(), __put_unused_fd() > >>If yes, the whole embedded struct fdtable is readonly. > > > But not close_on_exec_init or open_fds_init. We would update them > on open/close. Yes, sure, but those fields are not part of the embedded struct fdtable > > Some benchmarking would be useful here. This simple bench can be used, I got good results on a dual opteron machine. As Opterons have prety good NUMA links (Hypertransport), I suspect older hardware should obtain even better results. $ gcc -O2 -o bench bench.c -lpthread $ ./bench -t 2 -l 10 # run the bench with 2 threads for 10 seconds. 2 threads, 10 seconds, work_done=1721627 To run it with a small fdset (no more than 3 + (nbthreads) files opened) $ ./bench -s -t 2 -l 10 2 threads, 10 seconds, work_done=1709716 Unfortunatly I cannot boot the dual opterons at the moment to try to move next_fd close to file_lock Eric [-- Attachment #2: bench.c --] [-- Type: text/plain, Size: 1870 bytes --] /* * Bench program to exercice multi threads using open()/close()/read()/lseek() calls. * Usage : * bench [-t XX] [-l len] * XX : number of threads * len : bench time in seconds * -s : small fdset : try to use embedded sruct fdtable */ #include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <unistd.h> pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; char *file = "/etc/passwd"; int sflag; /* small fdset */ int end_prog; unsigned long work_done; void *perform_work(void *arg) { int fd, i; unsigned long units = 0; char c; /* force this program to open more than 64 fds */ if (!sflag) for (i = 0 ; i < 64 ; i++) open("/dev/null", O_RDONLY); while (!end_prog) { fd = open(file, O_RDONLY); read(fd, &c, 1); lseek(fd, 10, SEEK_SET); read(fd, &c, 1); lseek(fd, 20, SEEK_SET); read(fd, &c, 1); lseek(fd, 30, SEEK_SET); read(fd, &c, 1); lseek(fd, 40, SEEK_SET); read(fd, &c, 1); close(fd); units++; } pthread_mutex_lock(&mut); work_done += units; pthread_mutex_unlock(&mut); return 0; } void usage(int code) { fprintf(stderr, "Usage : bench [-s] [-t threads] [-l duration]\n"); exit(code); } int main(int argc, char *argv[]) { int i, c; int nbthreads = 2; unsigned int length = 10; pthread_t *tid; while ((c = getopt(argc, argv, "st:l:")) != -1) { if (c == 't') nbthreads = atoi(optarg); else if (c == 'l') length = atoi(optarg); else if (c == 's') sflag = 1; else usage(1); } tid = malloc(nbthreads*sizeof(pthread_t)); for (i = 0 ; i < nbthreads; i++) pthread_create(tid + i, NULL, perform_work, NULL); sleep(length); end_prog = 1; for (i = 0 ; i < nbthreads; i++) pthread_join(tid[i], NULL); pthread_mutex_lock(&mut); printf("%d threads, %u seconds, work_done=%lu\n", nbthreads, length, work_done); pthread_mutex_unlock(&mut); return 0; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-14 23:19 ` Eric Dumazet @ 2005-09-15 4:54 ` Dipankar Sarma 2005-09-15 6:17 ` Eric Dumazet 0 siblings, 1 reply; 19+ messages in thread From: Dipankar Sarma @ 2005-09-15 4:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, linux-kernel, torvalds, akpm On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote: > > > >>Dipankar Sarma a écrit : > > >>If yes, the whole embedded struct fdtable is readonly. > > > > > >But not close_on_exec_init or open_fds_init. We would update them > >on open/close. > > Yes, sure, but those fields are not part of the embedded struct fdtable Those fdsets would share a cache line with fdt, fdtable which would be invalidated on open/close. So, what is the point in moving file_lock ? Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-15 4:54 ` Dipankar Sarma @ 2005-09-15 6:17 ` Eric Dumazet 2005-09-15 9:35 ` Dipankar Sarma 0 siblings, 1 reply; 19+ messages in thread From: Eric Dumazet @ 2005-09-15 6:17 UTC (permalink / raw) To: dipankar; +Cc: David S. Miller, linux-kernel, torvalds, akpm Dipankar Sarma a écrit : > On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote: > >>Dipankar Sarma a écrit : >> >>>On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote: >>> >>> >>>>Dipankar Sarma a écrit : >> >>>>If yes, the whole embedded struct fdtable is readonly. >>> >>> >>>But not close_on_exec_init or open_fds_init. We would update them >>>on open/close. >> >>Yes, sure, but those fields are not part of the embedded struct fdtable > > > Those fdsets would share a cache line with fdt, fdtable which would > be invalidated on open/close. So, what is the point in moving > file_lock ? > The point is that we gain nothing in this case for 32 bits platforms, but we gain something on 64 bits platform. And for apps using more than NR_OPEN_DEFAULT files we definitly win on both 32bits/64bits. Maybe moving file_lock just before close_on_exec_init would be a better choice, so that 32bits platform small apps touch one cache line instead of two. sizeof(struct fdtable) = 40/0x28 on 32bits, 72/0x48 on 64 bits struct files_struct { /* mostly read */ atomic_t count; /* offset 0x00 */ struct fdtable *fdt; /* offset 0x04 (0x08 on 64 bits) */ struct fdtable fdtab; /* offset 0x08 (0x10 on 64 bits)*/ /* read/written for apps using small number of files */ fd_set close_on_exec_init; /* offset 0x30 (0x58 on 64 bits) */ fd_set open_fds_init; /* offset 0x34 (0x60 on 64 bits) */ struct file * fd_array[NR_OPEN_DEFAULT]; /* offset 0x38 (0x68 on 64 bits */ spinlock_t file_lock; /* 0xB8 (0x268 on 64 bits) */ }; /* size = 0xBC (0x270 on 64 bits) */ Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power of two, so 64 bytes are allocated instead of 128. Eric ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-15 6:17 ` Eric Dumazet @ 2005-09-15 9:35 ` Dipankar Sarma 2005-09-15 17:11 ` Eric Dumazet 0 siblings, 1 reply; 19+ messages in thread From: Dipankar Sarma @ 2005-09-15 9:35 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, linux-kernel, torvalds, akpm On Thu, Sep 15, 2005 at 08:17:40AM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote: > > > >Those fdsets would share a cache line with fdt, fdtable which would > >be invalidated on open/close. So, what is the point in moving > >file_lock ? > > > > The point is that we gain nothing in this case for 32 bits platforms, but > we gain something on 64 bits platform. And for apps using more than I am not sure about that. IIRC, x86_64 has a 128-byte L1 cacheline. So, count, fdt, fdtab, close_on_exec_init and open_fds_init would all fit into one cache line. And close_on_exec_init will get updated on open(). Also, most apps will not likely have more than the default # of fds, it might not be a good idea to optimize for that case. > struct files_struct { > > /* mostly read */ > atomic_t count; /* offset 0x00 */ > struct fdtable *fdt; /* offset 0x04 (0x08 on 64 bits) */ > struct fdtable fdtab; /* offset 0x08 (0x10 on 64 bits)*/ > > /* read/written for apps using small number of files */ > fd_set close_on_exec_init; /* offset 0x30 (0x58 on 64 bits) */ > fd_set open_fds_init; /* offset 0x34 (0x60 on 64 bits) */ > struct file * fd_array[NR_OPEN_DEFAULT]; /* offset 0x38 (0x68 on 64 > bits */ > spinlock_t file_lock; /* 0xB8 (0x268 on 64 bits) */ > }; /* size = 0xBC (0x270 on 64 bits) */ > > Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win > for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power > of two, so 64 bytes are allocated instead of 128. Can you benchmark this on a higher end SMP/NUMA system ? Thanks Dipankar ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] reorder struct files_struct 2005-09-15 9:35 ` Dipankar Sarma @ 2005-09-15 17:11 ` Eric Dumazet 0 siblings, 0 replies; 19+ messages in thread From: Eric Dumazet @ 2005-09-15 17:11 UTC (permalink / raw) To: dipankar; +Cc: David S. Miller, linux-kernel, torvalds, akpm [-- Attachment #1: Type: text/plain, Size: 5949 bytes --] Dipankar Sarma a écrit : > On Thu, Sep 15, 2005 at 08:17:40AM +0200, Eric Dumazet wrote: >>The point is that we gain nothing in this case for 32 bits platforms, but >>we gain something on 64 bits platform. And for apps using more than > > > I am not sure about that. IIRC, x86_64 has a 128-byte L1 cacheline. > So, count, fdt, fdtab, close_on_exec_init and open_fds_init would > all fit into one cache line. And close_on_exec_init will get updated > on open(). Also, most apps will not likely have more than the > default # of fds, it might not be a good idea to optimize for > that case. x86_64 has 64-bytes L1 cache line, at least for AMD cpus. SMP P3 are quite common and they have 32-bytes L1 cache line. *** WARNING *** BUG **** : In the mean time I discovered that sizeof(fd_set) was 128 bytes ! I thought it was sizeof(long) only. This is a huge waste of 248 bytes for most apps (apps that open less than 32 or 64 files) >> >>Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win >>for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power >>of two, so 64 bytes are allocated instead of 128. > > > Can you benchmark this on a higher end SMP/NUMA system ? Well, I can benchmark on a dual Xeon 2GHz machine. (Dell Poweredge 1600SC) Thats 2 physical cpus, four logical cpus (thanks to HyperThreading) Not exactly higher end NUMA systems unfortunatly. here are the results : linux-2.6.14-rc1 # ./bench -t 1 -l 100 -s # one thread, small fdset 1 threads, 100.005352 seconds, work_done=8429730 # ./bench -t 1 -l 100 # one thread, big fdset 1 threads, 100.006087 seconds, work_done=8343664 # ./bench -t 1 -l 100 -s -i # one thread plus one idle thread to force locks 1 threads, 100.007956 seconds, work_done=6786008 # ./bench -t 1 -l 100 -i # one thread + idle, bigfdset 1 threads, 100.005044 seconds, work_done=6791259 # ./bench -t 2 -l 100 -s # two threads, small fdset 2 threads, 100.002860 seconds, work_done=11034805 # ./bench -t 2 -l 100 # two threads, big fdset 2 threads, 100.006046 seconds, work_done=11063804 # ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS 2 threads, 100.004547 seconds, work_done=10825310 # ./bench -t 2 -l 100 -a 5 2 threads, 100.006288 seconds, work_done=11273778 # ./bench -t 4 -l 100 -s # Four threads, small fdset 4 threads, 100.007234 seconds, work_done=15061795 # ./bench -t 4 -l 100 # Four threads, big fdset 4 threads, 100.007620 seconds, work_done=14811832 linux-2.6.14-rc1 + patch : # ./bench -t 1 -l 100 -s # one thread, small fdset 1 threads, 100.005759 seconds, work_done=8406981 (~same) # ./bench -t 1 -l 100 # one thread, big fdset 1 threads, 100.006887 seconds, work_done=8350681 (~same) # ./bench -t 1 -l 100 -s -i # one thread plus one idle thread to force locks 1 threads, 100.005829 seconds, work_done=6858520 (1% better) # ./bench -t 1 -l 100 -i # one thread + idle, bigfdset 1 threads, 100.007902 seconds, work_done=6847941 (~same) # ./bench -t 2 -l 100 -s # two threads, small fdset 2 threads, 100.005877 seconds, work_done=11257165 (2% better) # ./bench -t 2 -l 100 # two threads, big fdset 2 threads, 100.005561 seconds, work_done=11520262 (4% better) # ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS 2 threads, 100.006744 seconds, work_done=11505449 (6% better) # ./bench -t 2 -l 100 -a 5 2 threads, 100.006706 seconds, work_done=11688051 (3% better) # ./bench -t 4 -l 100 -s # Four threads, small fdset 4 threads, 100.007496 seconds, work_done=15556770 (3% better) # ./bench -t 4 -l 100 # Four threads, big fdset 4 threads, 100.009882 seconds, work_done=16145618 (9% better) linux-2.6.14-rc1 + patch + two embedded fd_set replaced by a long : (this change should also speedup fork()) struct fdtable { unsigned int max_fds; int max_fdset; struct file ** fd; /* current fd array */ fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct files_struct *free_files; struct fdtable *next; }; /* * Open file table structure */ struct files_struct { /* read mostly part */ atomic_t count; struct fdtable *fdt; struct fdtable fdtab; /* written part */ spinlock_t file_lock ____cacheline_aligned_in_smp; int next_fd; unsigned long close_on_exec_init; unsigned long open_fds_init; struct file * fd_array[NR_OPEN_DEFAULT]; }; # grep files_cache /proc/slabinfo files_cache 53 195 256 15 1 : tunables 120 60 8 : slabdata 13 13 0 (256 bytes used instead of 512 for files_cache objects) # ./bench -t 1 -l 100 -s # one thread, small fdset 1 threads, 100.007298 seconds, work_done=8413167 (~same) # ./bench -t 1 -l 100 # one thread, big fdset 1 threads, 100.006007 seconds, work_done=8441197 (1% better) # ./bench -t 1 -l 100 -s -i # one thread plus one idle thread to force locks 1 threads, 100.005101 seconds, work_done=6870893 (1% better) # ./bench -t 1 -l 100 -i # one thread + idle, bigfdset 1 threads, 100.005285 seconds, work_done=6852314 (~same) # ./bench -t 2 -l 100 -s # two threads, small fdset 2 threads, 100.007029 seconds, work_done=11424646 (3.5 % better) # ./bench -t 2 -l 100 # two threads, big fdset 2 threads, 100.006128 seconds, work_done=11634769 (5% better) # ./bench -t 2 -l 100 -s -a 5 # force affinity to two phys CPUS 2 threads, 100.008100 seconds, work_done=11408030 (5% better) # ./bench -t 2 -l 100 -a 5 2 threads, 100.004221 seconds, work_done=11686082 (3% better) # ./bench -t 4 -l 100 -s # Four threads, small fdset 4 threads, 100.008243 seconds, work_done=15818419 (5% better) # ./bench -t 4 -l 100 # Four threads, big fdset 4 threads, 100.008279 seconds, work_done=16352921 (10% better) I suspect that NUMA machines will get more interesting results... Eric Attached bench source code. Compile : gcc -O2 -o bench bench.c -lpthread [-- Attachment #2: bench.c --] [-- Type: text/plain, Size: 3304 bytes --] /* * Bench program to exercice multi threads using open()/close()/read()/lseek() calls. * Usage : * bench [-t XX] [-l len] * XX : number of threads * len : bench time in seconds * -s : small fdset : try to use embedded sruct fdtable */ #include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <fcntl.h> #include <unistd.h> #include <sys/time.h> #include <string.h> #include <sched.h> #include <errno.h> #include <signal.h> static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t cond = PTHREAD_COND_INITIALIZER; static int sflag; /* small fdset */ static int end_prog; static unsigned long work_done; static char onekilo[1024]; static void catch_alarm(int sig) { end_prog = 1; } static void *perform_work(void *arg) { int fd, i; unsigned long units = 0; int fds[64]; char filename[64]; char c; strcpy(filename, "/tmp/benchXXXXXX"); fd = mkstemp(filename); write(fd, onekilo, sizeof(onekilo)); close(fd); /* force this program to open more than 64 fds */ if (!sflag) for (i = 0 ; i < 64 ; i++) fds[i] = open("/dev/null", O_RDONLY); while (!end_prog) { fd = open(filename, O_RDONLY); read(fd, &c, 1); lseek(fd, 10, SEEK_SET); read(fd, &c, 1); lseek(fd, 20, SEEK_SET); read(fd, &c, 1); lseek(fd, 30, SEEK_SET); read(fd, &c, 1); lseek(fd, 40, SEEK_SET); read(fd, &c, 1); close(fd); units++; } unlink(filename); if (!sflag) for (i = 0 ; i < 64 ; i++) close(fds[i]); pthread_mutex_lock(&mut); work_done += units; pthread_mutex_unlock(&mut); return 0; } static void *idle_thread(void *arg) { pthread_mutex_lock(&mut); while (!end_prog) { pthread_cond_wait(&cond, &mut); } pthread_mutex_unlock(&mut); } static void usage(int code) { fprintf(stderr, "Usage : bench [-i] [-s] [-a affinity_mask] [-t threads] [-l duration]\n"); exit(code); } int main(int argc, char *argv[]) { int i, c; int nbthreads = 2; int iflag = 0; unsigned int length = 10; pthread_t *tid; struct sigaction sg; struct timeval t0, t1; long mask = 0; while ((c = getopt(argc, argv, "sit:l:a:")) != -1) { if (c == 't') nbthreads = atoi(optarg); else if (c == 'l') length = atoi(optarg); else if (c == 's') sflag = 1; else if (c == 'i') iflag = 1; else if (c == 'a') sscanf(optarg, "%li", &mask); else usage(1); } if (mask != 0) { int res = sched_setaffinity(0, &mask); if (res != 0) fprintf(stderr, "sched_affinity(0x%lx)->%d errno=%d\n", mask, res, errno); } tid = malloc(nbthreads*sizeof(pthread_t)); gettimeofday(&t0, NULL); for (i = 1 ; i < nbthreads; i++) pthread_create(tid + i, NULL, perform_work, NULL); if (iflag) pthread_create(tid, NULL, idle_thread, NULL); memset(&sg, 0, sizeof(sg)); sg.sa_handler = catch_alarm; sigaction(SIGALRM, &sg, NULL); alarm(length); perform_work(NULL); if (iflag) { pthread_cond_signal(&cond); pthread_join(tid[0], NULL); } for (i = 1 ; i < nbthreads; i++) pthread_join(tid[i], NULL); gettimeofday(&t1, NULL); t1.tv_sec -= t0.tv_sec; t1.tv_usec -= t0.tv_usec; if (t1.tv_usec < 0) { t1.tv_usec += 1000000; t1.tv_sec--; } pthread_mutex_lock(&mut); printf("%d threads, %d.%06d seconds, work_done=%lu\n", nbthreads, (int)t1.tv_sec, (int)t1.tv_usec, work_done); pthread_mutex_unlock(&mut); return 0; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH]: Brown paper bag in fs/file.c? 2005-09-14 20:15 ` Dipankar Sarma 2005-09-14 20:29 ` David S. Miller @ 2005-09-15 21:06 ` David S. Miller 1 sibling, 0 replies; 19+ messages in thread From: David S. Miller @ 2005-09-15 21:06 UTC (permalink / raw) To: dipankar; +Cc: linux-kernel, torvalds, akpm From: Dipankar Sarma <dipankar@in.ibm.com> Date: Thu, 15 Sep 2005 01:45:50 +0530 > Are you running with preemption enabled ? If so, fyi, I had sent > out a patch earlier that fixes locking for preemption. > Also, what triggers this in your machine ? I can try to reproduce > this albeit on a non-sparc64 box. Dipankar, don't spend too much effort trying to reproduce my problem. I'm back to thinking I might have some kind of sparc64 specific bug on my hands. Thanks for all of your help. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2005-09-15 21:06 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-09-14 18:31 [PATCH]: Brown paper bag in fs/file.c? David S. Miller 2005-09-14 18:48 ` David S. Miller 2005-09-14 19:05 ` David S. Miller 2005-09-14 19:18 ` Dipankar Sarma 2005-09-14 19:57 ` David S. Miller 2005-09-14 20:15 ` Dipankar Sarma 2005-09-14 20:29 ` David S. Miller 2005-09-14 21:17 ` [PATCH] reorder struct files_struct Eric Dumazet 2005-09-14 21:35 ` Peter Staubach 2005-09-14 22:02 ` Dipankar Sarma 2005-09-14 22:17 ` Andrew Morton 2005-09-14 22:42 ` Eric Dumazet 2005-09-14 22:50 ` Dipankar Sarma 2005-09-14 23:19 ` Eric Dumazet 2005-09-15 4:54 ` Dipankar Sarma 2005-09-15 6:17 ` Eric Dumazet 2005-09-15 9:35 ` Dipankar Sarma 2005-09-15 17:11 ` Eric Dumazet 2005-09-15 21:06 ` [PATCH]: Brown paper bag in fs/file.c? David S. Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox