* VFS: file-max limit 50044 reached
@ 2005-10-15 13:19 Serge Belyshev
2005-10-15 17:53 ` Serge Belyshev
0 siblings, 1 reply; 48+ messages in thread
From: Serge Belyshev @ 2005-10-15 13:19 UTC (permalink / raw)
To: linux-kernel
This program:
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main (void)
{
int f, j;
j = 0;
while (1) {
f = open ("/dev/null", O_RDONLY);
if (f == -1) {
fprintf (stderr,"open (%i): %s\n", j, strerror (errno));
abort ();
}
close (f);
j ++;
}
return 0;
}
fails on 2.6.14-rc4 kernel with this message:
$ ./a.out
VFS: file-max limit 50044 reached
open (55499): Too many open files in system
Aborted
$
This problem was reproduced on i386 and amd64 with
kernels 2.6.14-rc1 .. 2.6.14-rc4-git4
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: VFS: file-max limit 50044 reached 2005-10-15 13:19 VFS: file-max limit 50044 reached Serge Belyshev @ 2005-10-15 17:53 ` Serge Belyshev 2005-10-16 16:23 ` Dipankar Sarma 0 siblings, 1 reply; 48+ messages in thread From: Serge Belyshev @ 2005-10-15 17:53 UTC (permalink / raw) To: linux-kernel >This problem was reproduced on i386 and amd64 with >kernels 2.6.14-rc1 .. 2.6.14-rc4-git4 Caused by this change: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef or http://tinyurl.com/cyrou aka "[PATCH] files: files struct with RCU" ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-15 17:53 ` Serge Belyshev @ 2005-10-16 16:23 ` Dipankar Sarma 2005-10-16 18:51 ` Serge Belyshev 2005-10-17 2:34 ` Linus Torvalds 0 siblings, 2 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-16 16:23 UTC (permalink / raw) To: Serge Belyshev Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton, Manfred Spraul On Sat, Oct 15, 2005 at 09:53:14PM +0400, Serge Belyshev wrote: > > >This problem was reproduced on i386 and amd64 with > >kernels 2.6.14-rc1 .. 2.6.14-rc4-git4 > > Caused by this change: > > http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef > or > http://tinyurl.com/cyrou > > aka "[PATCH] files: files struct with RCU" Linus, I don't think this has anything to do with RCU grace periods like we discussed previously. I measured on my 3.6GHz x86_64 and found that open()/close() pair on /dev/null takes about 45500 cycles or 12 microseconds. [Does that sound resonable?]. So assuing 100 HZ, we can't really queue more than 1660 file structs to free per 2 timer ticks. In fact, I looked at the filp slabinfo and we were indeed returning file structures to slab. I think this is a known issue I was looking at earlier - the way we do file struct accounting is not very suitable for batched freeing. For scalability reasons, file accounting was constructor/destructor based. This meant that nr_files was decremented only when the object was removed from the slab cache. This is susceptible to slab fragmentation. With RCU based file structure, consequent batched freeing and a test program like Serge's, we just speed this up and end up with a very fragmented slab - llm22:~ # cat /proc/sys/fs/file-nr 587730 0 758844 At the same time, I see only a 2000+ objects in filp cache. To verify this theory, I tried the following experimental patch I had from before and it fixes this problem. However I run into my old "bad page state" problem that I have been seeing since 2.6.9-rc2 in that machine. That needs a separate investigation. Serge, could you please try the following experimental patch just to see if file counting is indeed the problem. The patch is definitely *not* meant for inclusion. Yet. Manfred told me a while ago that global filp counting caused scalability problems in some benchmarks - something I haven't been able to verify. Thanks Dipankar This patch changes the file counting by removing the filp_count_lock. Instead we use a separate atomic_t, nr_files, for now and all accesses to it are through get_nr_files() api. In the sysctl handler for nr_files, we populate files_stat.nr_files before returning to user. Counting files as an when they are created and destroyed (as opposed to inside slab) allows us to correctly count open files with RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> --- diff -puN fs/dcache.c~files-scale-file-counting fs/dcache.c --- linux-2.6.14-rc1-test/fs/dcache.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/fs/dcache.c 2005-10-16 14:03:25.000000000 -0700 @@ -1730,7 +1730,7 @@ void __init vfs_caches_init(unsigned lon SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor); + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); dcache_init(mempages); inode_init(mempages); diff -puN fs/file_table.c~files-scale-file-counting fs/file_table.c --- linux-2.6.14-rc1-test/fs/file_table.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/fs/file_table.c 2005-10-16 14:07:20.000000000 -0700 @@ -5,6 +5,7 @@ * Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu) */ +#include <linux/config.h> #include <linux/string.h> #include <linux/slab.h> #include <linux/file.h> @@ -18,52 +19,67 @@ #include <linux/mount.h> #include <linux/cdev.h> #include <linux/fsnotify.h> +#include <linux/sysctl.h> +#include <asm/atomic.h> /* sysctl tunables... */ struct files_stat_struct files_stat = { .max_files = NR_FILE }; -EXPORT_SYMBOL(files_stat); /* Needed by unix.o */ - /* public. Not pretty! */ __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock); -static DEFINE_SPINLOCK(filp_count_lock); +static atomic_t nr_files __cacheline_aligned_in_smp; -/* slab constructors and destructors are called from arbitrary - * context and must be fully threaded - use a local spinlock - * to protect files_stat.nr_files - */ -void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags) +static inline void file_free_rcu(struct rcu_head *head) { - if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == - SLAB_CTOR_CONSTRUCTOR) { - unsigned long flags; - spin_lock_irqsave(&filp_count_lock, flags); - files_stat.nr_files++; - spin_unlock_irqrestore(&filp_count_lock, flags); - } + struct file *f = container_of(head, struct file, f_rcuhead); + kmem_cache_free(filp_cachep, f); } -void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags) +static inline void file_free(struct file *f) { - unsigned long flags; - spin_lock_irqsave(&filp_count_lock, flags); - files_stat.nr_files--; - spin_unlock_irqrestore(&filp_count_lock, flags); + atomic_dec(&nr_files); + call_rcu(&f->f_rcuhead, file_free_rcu); } -static inline void file_free_rcu(struct rcu_head *head) +/* + * Return the total number of open files in the system + */ +int get_nr_files(void) { - struct file *f = container_of(head, struct file, f_rcuhead); - kmem_cache_free(filp_cachep, f); + return atomic_read(&nr_files); } -static inline void file_free(struct file *f) +/* + * Return the maximum number of open files in the system + */ +int get_max_files(void) { - call_rcu(&f->f_rcuhead, file_free_rcu); + return files_stat.max_files; +} + +EXPORT_SYMBOL(get_nr_files); +EXPORT_SYMBOL(get_max_files); + +/* + * Handle nr_files sysctl + */ +#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) +int proc_nr_files(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + files_stat.nr_files = get_nr_files(); + proc_dointvec(table, write, filp, buffer, lenp, ppos); +} +#else +int proc_nr_files(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + return -ENOSYS; } +#endif /* Find an unused file structure and return a pointer to it. * Returns NULL, if there are no more free file structures or @@ -77,7 +93,7 @@ struct file *get_empty_filp(void) /* * Privileged users can go above max_files */ - if (files_stat.nr_files >= files_stat.max_files && + if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) goto over; @@ -96,11 +112,12 @@ struct file *get_empty_filp(void) rwlock_init(&f->f_owner.lock); /* f->f_version: 0 */ INIT_LIST_HEAD(&f->f_list); + atomic_inc(&nr_files); return f; over: /* Ran out of filps - report that */ - if (files_stat.nr_files > old_max) { + if (get_nr_files() > old_max) { printk(KERN_INFO "VFS: file-max limit %d reached\n", files_stat.max_files); old_max = files_stat.nr_files; diff -puN fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting fs/xfs/linux-2.6/xfs_linux.h --- linux-2.6.14-rc1-test/fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/fs/xfs/linux-2.6/xfs_linux.h 2005-10-16 14:03:25.000000000 -0700 @@ -88,6 +88,7 @@ #include <linux/proc_fs.h> #include <linux/version.h> #include <linux/sort.h> +#include <linux/fs.h> #include <asm/page.h> #include <asm/div64.h> @@ -242,7 +243,7 @@ static inline void set_buffer_unwritten_ /* IRIX uses the current size of the name cache to guess a good value */ /* - this isn't the same but is a good enough starting point for now. */ -#define DQUOT_HASH_HEURISTIC files_stat.nr_files +#define DQUOT_HASH_HEURISTIC get_nr_files() /* IRIX inodes maintain the project ID also, zero this field on Linux */ #define DEFAULT_PROJID 0 diff -puN include/linux/file.h~files-scale-file-counting include/linux/file.h --- linux-2.6.14-rc1-test/include/linux/file.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/include/linux/file.h 2005-10-16 14:03:25.000000000 -0700 @@ -60,8 +60,6 @@ extern void put_filp(struct file *); extern int get_unused_fd(void); extern void FASTCALL(put_unused_fd(unsigned int fd)); struct kmem_cache_s; -extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags); -extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags); extern struct file ** alloc_fd_array(int); extern void free_fd_array(struct file **, int); diff -puN include/linux/fs.h~files-scale-file-counting include/linux/fs.h --- linux-2.6.14-rc1-test/include/linux/fs.h~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/include/linux/fs.h 2005-10-16 14:03:25.000000000 -0700 @@ -36,6 +36,8 @@ struct files_stat_struct { int max_files; /* tunable */ }; extern struct files_stat_struct files_stat; +extern int get_nr_files(void); +extern int get_max_files(void); struct inodes_stat_t { int nr_inodes; diff -puN kernel/sysctl.c~files-scale-file-counting kernel/sysctl.c --- linux-2.6.14-rc1-test/kernel/sysctl.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/kernel/sysctl.c 2005-10-16 14:03:25.000000000 -0700 @@ -50,6 +50,9 @@ #include <linux/nfs_fs.h> #endif +extern int proc_nr_files(ctl_table *table, int write, struct file *filp, + void __user *buffer, size_t *lenp, loff_t *ppos); + #if defined(CONFIG_SYSCTL) /* External variables not in a header file. */ @@ -879,7 +882,7 @@ static ctl_table fs_table[] = { .data = &files_stat, .maxlen = 3*sizeof(int), .mode = 0444, - .proc_handler = &proc_dointvec, + .proc_handler = &proc_nr_files, }, { .ctl_name = FS_MAXFILE, diff -puN net/unix/af_unix.c~files-scale-file-counting net/unix/af_unix.c --- linux-2.6.14-rc1-test/net/unix/af_unix.c~files-scale-file-counting 2005-10-16 14:03:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/net/unix/af_unix.c 2005-10-16 14:03:25.000000000 -0700 @@ -547,7 +547,7 @@ static struct sock * unix_create1(struct struct sock *sk = NULL; struct unix_sock *u; - if (atomic_read(&unix_nr_socks) >= 2*files_stat.max_files) + if (atomic_read(&unix_nr_socks) >= 2*get_max_files()) goto out; sk = sk_alloc(PF_UNIX, GFP_KERNEL, &unix_proto, 1); _ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-16 16:23 ` Dipankar Sarma @ 2005-10-16 18:51 ` Serge Belyshev 2005-10-16 18:56 ` Dipankar Sarma 2005-10-17 2:34 ` Linus Torvalds 1 sibling, 1 reply; 48+ messages in thread From: Serge Belyshev @ 2005-10-16 18:51 UTC (permalink / raw) To: Dipankar Sarma Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton, Manfred Spraul Dipankar Sarma <dipankar@in.ibm.com> writes: > Serge, could you please try the following experimental patch > just to see if file counting is indeed the problem. The patch I ran my test program with this patch applied on top of 2.6.14-rc4-git4 and it worked. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-16 18:51 ` Serge Belyshev @ 2005-10-16 18:56 ` Dipankar Sarma 2005-10-17 2:19 ` Linus Torvalds 0 siblings, 1 reply; 48+ messages in thread From: Dipankar Sarma @ 2005-10-16 18:56 UTC (permalink / raw) To: Serge Belyshev Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton, Manfred Spraul On Sun, Oct 16, 2005 at 10:51:12PM +0400, Serge Belyshev wrote: > Dipankar Sarma <dipankar@in.ibm.com> writes: > > > Serge, could you please try the following experimental patch > > just to see if file counting is indeed the problem. The patch > > I ran my test program with this patch applied on top of 2.6.14-rc4-git4 > and it worked. Serge, thanks for the test. The issue is however far from resolved. We need to find about potential scalability problems with this approach. Secondly, on subsequent repeated tests, I saw a very large number of allocated objects (600000+) in filp cache. That does point to either RCU grace period not happening or my sycall measurements completely wrong. I did run with the following patch that adds syscall exit as a queiescent state, but it didn't help. I am going to have to instrument RCU to see what is really happening. Thanks Dipankar It turns out that under some really heavy RCU updates under simulated conditions, a syscall bound task that doesn't block may prevent RCU from happening during its entire timeslice and that window may be big enough to generate out-of-memory situations for RCU protected objects. This patch starts counting completion of syscalls as quiescent state in order to prevent the above situation from happening. It introduces a new field in thread_info called rcu_qs which stores the RCU quiescent state counter pointer for the cpu on which the thread runs. We increment the counter on every syscall completion to move rcu forward. This patch adds that support to i386 and x86_64 archs, but it doesn't break other arches. As and when support for rcu_qs is added to thread_info structs of other arches, we need to define ARCH_HAS_RCU_QS for that arch. Not-Yet-Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com> diff -puN arch/i386/kernel/entry.S~rcu-syscall-quiescent arch/i386/kernel/entry.S --- linux-2.6.14-rc1-test/arch/i386/kernel/entry.S~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/entry.S 2005-10-16 11:25:10.000000000 -0700 @@ -239,6 +239,8 @@ syscall_exit: cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret + movl TI_rcu_qs(%ebp), %ecx # Update RCU quiescent state flag + movl $1,(%ecx) movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work diff -puN include/asm-i386/thread_info.h~rcu-syscall-quiescent include/asm-i386/thread_info.h --- linux-2.6.14-rc1-test/include/asm-i386/thread_info.h~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/include/asm-i386/thread_info.h 2005-10-16 11:20:37.000000000 -0700 @@ -17,6 +17,8 @@ #include <asm/processor.h> #endif +#define ARCH_HAS_RCU_QS + /* * low level task data that entry.S needs immediate access to * - this struct should fit entirely inside of one cache line @@ -39,6 +41,7 @@ struct thread_info { 0-0xFFFFFFFF for kernel-thread */ struct restart_block restart_block; + int *rcu_qs; /* RCU quiescent state flag */ unsigned long previous_esp; /* ESP of the previous stack in case of nested (IRQ) stacks diff -puN include/linux/rcupdate.h~rcu-syscall-quiescent include/linux/rcupdate.h --- linux-2.6.14-rc1-test/include/linux/rcupdate.h~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/include/linux/rcupdate.h 2005-10-16 12:38:56.000000000 -0700 @@ -41,6 +41,7 @@ #include <linux/percpu.h> #include <linux/cpumask.h> #include <linux/seqlock.h> +#include <linux/thread_info.h> /** * struct rcu_head - callback structure for use with RCU @@ -271,6 +272,16 @@ static inline int rcu_pending(int cpu) */ #define synchronize_sched() synchronize_rcu() +#ifdef ARCH_HAS_RCU_QS +static inline void rcu_set_qs(struct thread_info *ti, int cpu) +{ + struct rcu_data *rdp = &per_cpu(rcu_data, cpu); + ti->rcu_qs = &rdp->passed_quiesc; +} +#else +static inline void rcu_set_qs(struct thread_info *ti, int cpu) { } +#endif + extern void rcu_init(void); extern void rcu_check_callbacks(int cpu, int user); extern void rcu_restart_cpu(int cpu); diff -puN init/main.c~rcu-syscall-quiescent init/main.c --- linux-2.6.14-rc1-test/init/main.c~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/init/main.c 2005-10-16 12:43:19.000000000 -0700 @@ -671,6 +671,9 @@ static int init(void * unused) */ child_reaper = current; + /* Set up rcu quiscent state counter before making any syscall */ + rcu_set_qs(current_thread_info(), smp_processor_id()); + /* Sets up cpus_possible() */ smp_prepare_cpus(max_cpus); diff -puN kernel/sched.c~rcu-syscall-quiescent kernel/sched.c --- linux-2.6.14-rc1-test/kernel/sched.c~rcu-syscall-quiescent 2005-10-16 11:01:35.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/kernel/sched.c 2005-10-16 12:43:53.000000000 -0700 @@ -3006,6 +3006,7 @@ switch_tasks: rq->nr_switches++; rq->curr = next; ++*switch_count; + rcu_set_qs(next->thread_info, task_cpu(prev)); prepare_task_switch(rq, next); prev = context_switch(rq, prev, next); diff -puN arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent arch/i386/kernel/asm-offsets.c --- linux-2.6.14-rc1-test/arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent 2005-10-16 11:35:28.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/asm-offsets.c 2005-10-16 11:36:15.000000000 -0700 @@ -53,6 +53,7 @@ void foo(void) OFFSET(TI_preempt_count, thread_info, preempt_count); OFFSET(TI_addr_limit, thread_info, addr_limit); OFFSET(TI_restart_block, thread_info, restart_block); + OFFSET(TI_rcu_qs, thread_info, rcu_qs); BLANK(); OFFSET(EXEC_DOMAIN_handler, exec_domain, handler); diff -puN arch/x86_64/kernel/entry.S~rcu-syscall-quiescent arch/x86_64/kernel/entry.S --- linux-2.6.14-rc1-test/arch/x86_64/kernel/entry.S~rcu-syscall-quiescent 2005-10-16 11:48:27.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/entry.S 2005-10-16 12:03:01.000000000 -0700 @@ -214,6 +214,8 @@ ret_from_sys_call: sysret_check: GET_THREAD_INFO(%rcx) cli + movq threadinfo_rcu_qs(%rcx),%rdx + movq $1,(%rdx) movl threadinfo_flags(%rcx),%edx andl %edi,%edx CFI_REMEMBER_STATE @@ -310,6 +312,8 @@ ENTRY(int_ret_from_sys_call) /* edi: mask to check */ int_with_check: GET_THREAD_INFO(%rcx) + movq threadinfo_rcu_qs(%rcx),%rdx + movl $1,(%rdx) movl threadinfo_flags(%rcx),%edx andl %edi,%edx jnz int_careful diff -puN include/asm-x86_64/thread_info.h~rcu-syscall-quiescent include/asm-x86_64/thread_info.h --- linux-2.6.14-rc1-test/include/asm-x86_64/thread_info.h~rcu-syscall-quiescent 2005-10-16 11:50:25.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/include/asm-x86_64/thread_info.h 2005-10-16 11:54:47.000000000 -0700 @@ -23,6 +23,8 @@ struct task_struct; struct exec_domain; #include <asm/mmsegment.h> +#define ARCH_HAS_RCU_QS + struct thread_info { struct task_struct *task; /* main task structure */ struct exec_domain *exec_domain; /* execution domain */ @@ -33,6 +35,7 @@ struct thread_info { mm_segment_t addr_limit; struct restart_block restart_block; + int *rcu_qs; }; #endif diff -puN arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent arch/x86_64/kernel/asm-offsets.c --- linux-2.6.14-rc1-test/arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent 2005-10-16 11:52:13.000000000 -0700 +++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/asm-offsets.c 2005-10-16 11:53:14.000000000 -0700 @@ -33,6 +33,7 @@ int main(void) ENTRY(flags); ENTRY(addr_limit); ENTRY(preempt_count); + ENTRY(rcu_qs); BLANK(); #undef ENTRY #define ENTRY(entry) DEFINE(pda_ ## entry, offsetof(struct x8664_pda, entry)) _ ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-16 18:56 ` Dipankar Sarma @ 2005-10-17 2:19 ` Linus Torvalds 2005-10-17 4:43 ` Serge Belyshev 2005-10-17 8:32 ` Jean Delvare 0 siblings, 2 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 2:19 UTC (permalink / raw) To: Dipankar Sarma Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Dipankar Sarma wrote: > > Secondly, on subsequent repeated tests, I saw a very large number > of allocated objects (600000+) in filp cache. That does point to either RCU > grace period not happening or my sycall measurements completely > wrong. I did run with the following patch that adds syscall > exit as a queiescent state, but it didn't help. I am going > to have to instrument RCU to see what is really happening. I would _really_ prefer to not do this in the system call hot-path by default. That is unquestionably the hottest path in the kernel by far. It would be _much_ better to set one of the TIF_WORK flags when there's a lot of RCU stuff, and do this all in the not-quit-so-hot path of do_notify_resume() (on x86, I think others call it other things) instead. If you use the same kind of "set the TIF flag every 1000 rcu events" approach that my failed patch had, you'd be much better off. In fact, in that path you could even do a full "rcu_process_callbacks()". After all, this is not that different from signal handling. Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week since -rc4. Maybe this isn't that serious in practice right now? Serge, how did you notice it? Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 2:19 ` Linus Torvalds @ 2005-10-17 4:43 ` Serge Belyshev 2005-10-17 8:32 ` Jean Delvare 1 sibling, 0 replies; 48+ messages in thread From: Serge Belyshev @ 2005-10-17 4:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, khali, Andrew Morton, Manfred Spraul [resend, sorry for sending mail off-list] Linus Torvalds <torvalds@osdl.org> writes: > Serge, does this alternate patch work for you? > Yes, this patch works too. > Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week > since -rc4. > > Maybe this isn't that serious in practice right now? Serge, how did you > notice it? > This bug causes random failures when building kernel with make -j4, all with "Too many open files in system" message from gcc. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 2:19 ` Linus Torvalds 2005-10-17 4:43 ` Serge Belyshev @ 2005-10-17 8:32 ` Jean Delvare 2005-10-17 8:46 ` Dipankar Sarma 1 sibling, 1 reply; 48+ messages in thread From: Jean Delvare @ 2005-10-17 8:32 UTC (permalink / raw) To: torvalds, dipankar; +Cc: Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Hi Linus, Dipankar, all, On 2005-10-17, Linus Torvalds wrote: > I would _really_ prefer to not do this in the system call hot-path by > default. That is unquestionably the hottest path in the kernel by far. > > It would be _much_ better to set one of the TIF_WORK flags when there's a > lot of RCU stuff, and do this all in the not-quit-so-hot path of > do_notify_resume() (on x86, I think others call it other things) instead. > > If you use the same kind of "set the TIF flag every 1000 rcu events" > approach that my failed patch had, you'd be much better off. > > In fact, in that path you could even do a full "rcu_process_callbacks()". > After all, this is not that different from signal handling. > > Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week > since -rc4. Isn't reverting the original change an option? 2.6.13 was working OK if I'm not mistaken. Thanks, -- Jean Delvare ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 8:32 ` Jean Delvare @ 2005-10-17 8:46 ` Dipankar Sarma 2005-10-17 9:10 ` Eric Dumazet 0 siblings, 1 reply; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 8:46 UTC (permalink / raw) To: Jean Delvare Cc: torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote: > > > In fact, in that path you could even do a full "rcu_process_callbacks()". > > After all, this is not that different from signal handling. > > > > Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week > > since -rc4. > > Isn't reverting the original change an option? 2.6.13 was working OK if > I'm not mistaken. IMO, putting the file accounting in slab ctor/dtors is not very reliable because it depends on slab not getting fragmented. Batched freeing in RCU is just an extreme case of it. We needed to fix file counting anyway. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 8:46 ` Dipankar Sarma @ 2005-10-17 9:10 ` Eric Dumazet 2005-10-17 9:14 ` Christoph Hellwig 2005-10-17 10:32 ` Dipankar Sarma 0 siblings, 2 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 9:10 UTC (permalink / raw) To: dipankar Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Dipankar Sarma a écrit : > On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote: > >>>In fact, in that path you could even do a full "rcu_process_callbacks()". >>>After all, this is not that different from signal handling. >>> >>>Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week >>>since -rc4. >> >>Isn't reverting the original change an option? 2.6.13 was working OK if >>I'm not mistaken. > > > IMO, putting the file accounting in slab ctor/dtors is not very > reliable because it depends on slab not getting fragmented. > Batched freeing in RCU is just an extreme case of it. We needed > to fix file counting anyway. > > Thanks > Dipankar But isnt this file counting a small problem ? This small program can eat all available memory. Fixing the 'file count' wont fix the real problem : Batch freeing is good but should be limited so that not more than *billions* of file struct are queued for deletion. Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14. I believe we can find a solution, even if it might delay 2.6.14 because Linus would have to release a rc5 Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 9:10 ` Eric Dumazet @ 2005-10-17 9:14 ` Christoph Hellwig 2005-10-17 9:25 ` Eric Dumazet 2005-10-17 10:32 ` Dipankar Sarma 1 sibling, 1 reply; 48+ messages in thread From: Christoph Hellwig @ 2005-10-17 9:14 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: > Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14. how so? and why should we care? I'd rather see a 2.6.14 soon with the changes backed out so we can have a proper release that more or less sticks to the release schedule we agreed on at kernel summit. You'll have four weeks time to sort out the issue afterwards. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 9:14 ` Christoph Hellwig @ 2005-10-17 9:25 ` Eric Dumazet 0 siblings, 0 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 9:25 UTC (permalink / raw) To: Christoph Hellwig Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Christoph Hellwig a écrit : > On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: > >>Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14. > > > how so? and why should we care? I'd rather see a 2.6.14 soon with > the changes backed out so we can have a proper release that more or > less sticks to the release schedule we agreed on at kernel summit. > You'll have four weeks time to sort out the issue afterwards. > - Christoph, You can try to hide the forest by killing some trees. Are you sure that RCU 'file structs' is the only problem lying around ? For instance, I think other RCU freeing problem are dormant (see maxbatch=10 and think about the number of routes a busy router (or DOS attack) can handle... Of course, a 'test program' is more difficult to write than a while (1) close(open("/dev/null", 3)); Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 9:10 ` Eric Dumazet 2005-10-17 9:14 ` Christoph Hellwig @ 2005-10-17 10:32 ` Dipankar Sarma 2005-10-17 12:10 ` [RCU problem] was " Eric Dumazet 2005-10-17 15:42 ` Linus Torvalds 1 sibling, 2 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 10:32 UTC (permalink / raw) To: Eric Dumazet Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > > > >IMO, putting the file accounting in slab ctor/dtors is not very > >reliable because it depends on slab not getting fragmented. > >Batched freeing in RCU is just an extreme case of it. We needed > >to fix file counting anyway. > > > >Thanks > >Dipankar > > But isnt this file counting a small problem ? > > This small program can eat all available memory. > > Fixing the 'file count' wont fix the real problem : Batch freeing is good > but should be limited so that not more than *billions* of file struct are > queued for deletion. Agreed. It is not designed to work that way, so there must be a bug somewhere and I am trying to track it down. It could very well be that at maxbatch=10 we are just queueing at a rate far too high compared to processing. > I believe we can find a solution, even if it might delay 2.6.14 because > Linus would have to release a rc5 This I am not sure, it is Linus' call. I am just trying to do the right thing - fix the real problem. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 10:32 ` Dipankar Sarma @ 2005-10-17 12:10 ` Eric Dumazet 2005-10-17 12:31 ` linux-os (Dick Johnson) 2005-10-17 12:36 ` Dipankar Sarma 2005-10-17 15:42 ` Linus Torvalds 1 sibling, 2 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 12:10 UTC (permalink / raw) To: dipankar Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul [-- Attachment #1: Type: text/plain, Size: 860 bytes --] Dipankar Sarma a écrit : > On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: >> >>Fixing the 'file count' wont fix the real problem : Batch freeing is good >>but should be limited so that not more than *billions* of file struct are >>queued for deletion. > > > Agreed. It is not designed to work that way, so there must be > a bug somewhere and I am trying to track it down. It could very well > be that at maxbatch=10 we are just queueing at a rate far too high > compared to processing. > I can freeze my test machine with a program that 'only' use dentries, no files. No message, no panic, but machine becomes totally unresponsive after few seconds. Just greping for call_rcu in kernel sources gave me another call_rcu() use from syscalls. And yes 2.6.13 has the same problem. Here is the killer on by HT Xeon machine (2GB ram) Eric [-- Attachment #2: stress2.c --] [-- Type: text/plain, Size: 409 bytes --] #include <unistd.h> #include <sys/types.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <sys/stat.h> int main(void) { int i, rc; struct stat st; char name[1024]; memset(name, 'a', sizeof(name)); for (i = 0; i < 1000000000;i++) { sprintf(name + 220, "%d", i); rc = stat(name, &st); if (rc == -1 && errno != ENOENT) { perror(name); } } return 0; } ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 12:10 ` [RCU problem] was " Eric Dumazet @ 2005-10-17 12:31 ` linux-os (Dick Johnson) 2005-10-17 12:36 ` Dipankar Sarma 1 sibling, 0 replies; 48+ messages in thread From: linux-os (Dick Johnson) @ 2005-10-17 12:31 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Eric Dumazet wrote: > Dipankar Sarma a écrit : >> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: >>> >>> Fixing the 'file count' wont fix the real problem : Batch freeing is good >>> but should be limited so that not more than *billions* of file struct are >>> queued for deletion. >> >> >> Agreed. It is not designed to work that way, so there must be >> a bug somewhere and I am trying to track it down. It could very well >> be that at maxbatch=10 we are just queueing at a rate far too high >> compared to processing. >> > > I can freeze my test machine with a program that 'only' use dentries, > no files. > > No message, no panic, but machine becomes totally unresponsive after > few seconds. > > Just greping for call_rcu in kernel sources gave me another call_rcu() use > from syscalls. And yes 2.6.13 has the same problem. > > Here is the killer on by HT Xeon machine (2GB ram) > > Eric > No problem with linux-2.6.13.4 and ext3 file-system: F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 1 0 16 0 1544 408 - S ? 0:00 init [5] [SNIPPED....] 1 0 16017 6 15 0 0 0 pdflus SW ? 0:00 [pdflush] 4 666 16406 5273 16 0 4464 1004 wait S tty2 0:00 -bash 0 666 16501 16406 18 0 1324 240 - R tty2 9:46 ./xxx 4 0 16502 5223 15 0 4204 1248 wait S tty1 0:00 -bash 0 0 16563 16502 16 0 2276 584 - R tty1 0:00 ps laxw I just put 9:46 CPU time on your program and every thing is fine. Cheers, Dick Johnson Penguin : Linux version 2.6.13.4 on an i686 machine (5589.46 BogoMips). Warning : 98.36% of all statistics are fiction. **************************************************************** The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 12:10 ` [RCU problem] was " Eric Dumazet 2005-10-17 12:31 ` linux-os (Dick Johnson) @ 2005-10-17 12:36 ` Dipankar Sarma 2005-10-17 13:28 ` Eric Dumazet 1 sibling, 1 reply; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 12:36 UTC (permalink / raw) To: Eric Dumazet Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: > > > >Agreed. It is not designed to work that way, so there must be > >a bug somewhere and I am trying to track it down. It could very well > >be that at maxbatch=10 we are just queueing at a rate far too high > >compared to processing. > > > > I can freeze my test machine with a program that 'only' use dentries, no > files. > > No message, no panic, but machine becomes totally unresponsive after few > seconds. > > Just greping for call_rcu in kernel sources gave me another call_rcu() use > from syscalls. And yes 2.6.13 has the same problem. Can you try it with rcupdate.maxbatch set to 10000 in boot command line ? FWIW, the open/close test problem goes away if I set maxbatch to 10000. I had introduced this limit some time ago to curtail the effect long running softirq handlers have on scheduling latencies, which now conflicts with OOM avoidance requirements. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 12:36 ` Dipankar Sarma @ 2005-10-17 13:28 ` Eric Dumazet 2005-10-17 13:33 ` Dipankar Sarma 2005-10-17 14:54 ` Eric Dumazet 0 siblings, 2 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 13:28 UTC (permalink / raw) To: dipankar Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Dipankar Sarma a écrit : > On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote: > >>Dipankar Sarma a écrit : >> >>>On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: >>> >>>Agreed. It is not designed to work that way, so there must be >>>a bug somewhere and I am trying to track it down. It could very well >>>be that at maxbatch=10 we are just queueing at a rate far too high >>>compared to processing. >>> >> >>I can freeze my test machine with a program that 'only' use dentries, no >>files. >> >>No message, no panic, but machine becomes totally unresponsive after few >>seconds. >> >>Just greping for call_rcu in kernel sources gave me another call_rcu() use >>from syscalls. And yes 2.6.13 has the same problem. > > > Can you try it with rcupdate.maxbatch set to 10000 in boot > command line ? > Changing maxbatch from 10 to 10000 cures the problem. Maybe we could initialize maxbatch to (10000000/HZ), considering no current cpu is able to queue more than 10.000.000 items per second in a list. > FWIW, the open/close test problem goes away if I set maxbatch to > 10000. I had introduced this limit some time ago to curtail > the effect long running softirq handlers have on scheduling > latencies, which now conflicts with OOM avoidance requirements. Yes, and probably OOM avoidance has a higher priority than latencies in DOS situations... Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 13:28 ` Eric Dumazet @ 2005-10-17 13:33 ` Dipankar Sarma 2005-10-17 14:54 ` Eric Dumazet 1 sibling, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 13:33 UTC (permalink / raw) To: Eric Dumazet Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 03:28:22PM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote: > > > > > >Can you try it with rcupdate.maxbatch set to 10000 in boot > >command line ? > > > > Changing maxbatch from 10 to 10000 cures the problem. > Maybe we could initialize maxbatch to (10000000/HZ), considering no current > cpu is able to queue more than 10.000.000 items per second in a list. I don't know, maybe I can look at a more adaptive heuristics. > > > >FWIW, the open/close test problem goes away if I set maxbatch to > >10000. I had introduced this limit some time ago to curtail > >the effect long running softirq handlers have on scheduling > >latencies, which now conflicts with OOM avoidance requirements. > > Yes, and probably OOM avoidance has a higher priority than latencies in DOS > situations... Yes, one would think. But the audio guys would chew my head for this :) Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RCU problem] was VFS: file-max limit 50044 reached 2005-10-17 13:28 ` Eric Dumazet 2005-10-17 13:33 ` Dipankar Sarma @ 2005-10-17 14:54 ` Eric Dumazet 1 sibling, 0 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 14:54 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Eric Dumazet a écrit : > Dipankar Sarma a écrit : > >> On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote: >> >>> Dipankar Sarma a écrit : >>> >>>> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote: >>>> >>>> Agreed. It is not designed to work that way, so there must be >>>> a bug somewhere and I am trying to track it down. It could very well >>>> be that at maxbatch=10 we are just queueing at a rate far too high >>>> compared to processing. >>>> >>> >>> I can freeze my test machine with a program that 'only' use dentries, >>> no files. >>> >>> No message, no panic, but machine becomes totally unresponsive after >>> few seconds. >>> >>> Just greping for call_rcu in kernel sources gave me another >>> call_rcu() use from syscalls. And yes 2.6.13 has the same problem. >> >> >> >> Can you try it with rcupdate.maxbatch set to 10000 in boot >> command line ? >> > > Changing maxbatch from 10 to 10000 cures the problem. > Maybe we could initialize maxbatch to (10000000/HZ), considering no > current cpu is able to queue more than 10.000.000 items per second in a > list. > Well... after one 90 minutes of stress, I got an OOM even with maxbatch=10000 Out of Memory : Killed process 1759 (mysqld) Maybe because on this HT machine, all (timer and network) interrupts are taken by CPU0. So if the user program is bound on CPU1, may be this cpu only performs syscalls and no rcu state change at all. Oct 17 18:24:25 localhost kernel: oom-killer: gfp_mask=0xd0, order=0 Oct 17 18:24:25 localhost kernel: Mem-info: Oct 17 18:24:25 localhost kernel: DMA per-cpu: Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 2, high 6, batch 1 used:5 Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 2, batch 1 used:1 Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 2, high 6, batch 1 used:2 Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 2, batch 1 used:0 Oct 17 18:24:25 localhost kernel: Normal per-cpu: Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 62, high 186, batch 31 used:168 Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 62, batch 31 used:55 Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 62, high 186, batch 31 used:95 Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 62, batch 31 used:33 Oct 17 18:24:25 localhost kernel: HighMem per-cpu: Oct 17 18:26:17 localhost kernel: cpu 0 hot: low 62, high 186, batch 31 used:166 Oct 17 18:26:17 localhost kernel: cpu 0 cold: low 0, high 62, batch 31 used:29 Oct 17 18:26:17 localhost kernel: cpu 1 hot: low 62, high 186, batch 31 used:176 Oct 17 18:26:17 localhost kernel: cpu 1 cold: low 0, high 62, batch 31 used:13 Oct 17 18:26:17 localhost kernel: Free pages: 1136620kB (1129392kB HighMem) Oct 17 18:26:17 localhost kernel: Active:8040 inactive:3876 dirty:1 writeback:0 unstable:0 free:284155 slab:218548 mapped:8064 pagetables:130 Oct 17 18:26:17 localhost kernel: DMA free:3588kB min:68kB low:84kB high:100kB active:0kB inactive:0kB present:16384kB pages_scanned:246 all_unreclaimable? no Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 880 2031 Oct 17 18:26:17 localhost kernel: Normal free:3640kB min:3756kB low:4692kB high:5632kB active:76kB inactive:24kB present:901120kB pages_scanned:8581 all_unreclaimable? no Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 9215 Oct 17 18:26:17 localhost kernel: HighMem free:1129392kB min:512kB low:640kB high:768kB active:32084kB inactive:15480kB present:1179520kB pages_scanned:0 all_unreclaimable? no Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 0 Oct 17 18:26:17 localhost kernel: DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB Oct 17 18:26:17 localhost kernel: Normal: 0*4kB 1*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3640kB Oct 17 18:26:17 localhost kernel: HighMem: 518*4kB 301*8kB 119*16kB 54*32kB 22*64kB 13*128kB 6*256kB 1*512kB 0*1024kB 1*2048kB 272*4096kB = 1129392kB Oct 17 18:26:17 localhost kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0 Oct 17 18:26:17 localhost kernel: Free swap = 1012016kB Oct 17 18:26:17 localhost kernel: Total swap = 1012016kB Oct 17 18:26:17 localhost kernel: Free swap: 1012016kB Oct 17 18:26:17 localhost kernel: 524256 pages of RAM Oct 17 18:26:17 localhost kernel: 294880 pages of HIGHMEM Oct 17 18:26:17 localhost kernel: 5472 reserved pages Oct 17 18:26:17 localhost kernel: 11361 pages shared Oct 17 18:26:18 localhost kernel: 0 pages swap cached Oct 17 18:26:18 localhost kernel: 1 pages dirty Oct 17 18:26:18 localhost kernel: 0 pages writeback Oct 17 18:26:18 localhost kernel: 8064 pages mapped Oct 17 18:26:18 localhost kernel: 218548 pages slab Oct 17 18:26:18 localhost kernel: 130 pages pagetables Oct 17 18:26:18 localhost kernel: Out of Memory: Killed process 1759 (mysqld). Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 10:32 ` Dipankar Sarma 2005-10-17 12:10 ` [RCU problem] was " Eric Dumazet @ 2005-10-17 15:42 ` Linus Torvalds 2005-10-17 16:01 ` Eric Dumazet 2005-10-17 16:20 ` Dipankar Sarma 1 sibling, 2 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 15:42 UTC (permalink / raw) To: Dipankar Sarma Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Dipankar Sarma wrote: > > Agreed. It is not designed to work that way, so there must be > a bug somewhere and I am trying to track it down. It could very well > be that at maxbatch=10 we are just queueing at a rate far too high > compared to processing. That sounds sane. I suspect that the real fix for 2.6.14 might be to update maxbatch to be much higher by default. The thing is, that batching really is fundamentally wrong. If we have a thousand thing to free, we can't just free ten of them, and leave the 990 others to wait for next time. I realize people want real-time, but if it's INCORRECT, then real-time isn't real-time. I just checked: increasing "maxbatch" from 10 to 10000 does fix the problem. > This I am not sure, it is Linus' call. I am just trying to do the > right thing - fix the real problem. It sure looks like the batch limiter is the fundamental problem. Instead of limiting the batching, we should likely try to avoid the RCU lists getting huge in the first place - ie do the RCU callback processing more often if the list is getting longer. So I suspect that the _real_ fix is: - for 2.6.14: remove the batching limig (or just make it much higher for now) - post-14: work on making sure rcu callbacks are done in a more timely manner when the rcu queue gets long. This would involve TIF_RCUPENDING and whatever else to make sure that we have timely quiescent periods, and we do the RCU callback tasklet more often if the queue is long. Hmm? Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 15:42 ` Linus Torvalds @ 2005-10-17 16:01 ` Eric Dumazet 2005-10-17 16:16 ` Linus Torvalds ` (2 more replies) 2005-10-17 16:20 ` Dipankar Sarma 1 sibling, 3 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 16:01 UTC (permalink / raw) To: Linus Torvalds Cc: Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Linus Torvalds a écrit : > So I suspect that the _real_ fix is: > > - for 2.6.14: remove the batching limig (or just make it much higher for > now) I would just remove it. If the limit is wrong, we crash again. And the realtime guys already are pissed off by batch=10000 anyway. > > - post-14: work on making sure rcu callbacks are done in a more timely > manner when the rcu queue gets long. This would involve TIF_RCUPENDING > and whatever else to make sure that we have timely quiescent periods, > and we do the RCU callback tasklet more often if the queue is long. > Absolutely. Keeping a count of (percpu) queued items is basically free if kept in the cache line used by list head, so the 'queue length on this cpu' is a cheap metric. A 'realtime refinement' would be to use a different maxbatch limit depending on the caller's priority : Let a softirq thread have a lower batch count than a regular user thread. Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:01 ` Eric Dumazet @ 2005-10-17 16:16 ` Linus Torvalds 2005-10-17 16:29 ` Dipankar Sarma 2005-10-17 16:23 ` Dipankar Sarma 2005-10-17 16:31 ` Lee Revell 2 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 16:16 UTC (permalink / raw) To: Eric Dumazet Cc: Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Eric Dumazet wrote: > > I would just remove it. If the limit is wrong, we crash again. And the > realtime guys already are pissed off by batch=10000 anyway. Normally I would too, but I'm still hoping I could do a 2.6.14 tonight. I guess that's unreasonable (swtlb issues etc), but for now I just committed the one-liner. > Absolutely. Keeping a count of (percpu) queued items is basically free if kept > in the cache line used by list head, so the 'queue length on this cpu' is a > cheap metric. Yes. I did something broken like that before Dipankar pointed me at batching. The only downside to TIF_RCUUPDATE is that those damn TIF-flags are per-architecture (probably largely unnecessary, but while most architectures don't care at all, others seem to have optimized their layout so that they can test the work bits more efficiently). So it's a matter of each architecture being updated with its TIF_xyz flag and their work function. Anybody willing to try? Dipankar apparently has a lot on his plate, this _should_ be fairly straightforward. Eric? Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:16 ` Linus Torvalds @ 2005-10-17 16:29 ` Dipankar Sarma 2005-10-17 18:01 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 16:29 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > > > Absolutely. Keeping a count of (percpu) queued items is basically free if kept > > in the cache line used by list head, so the 'queue length on this cpu' is a > > cheap metric. > > The only downside to TIF_RCUUPDATE is that those damn TIF-flags are > per-architecture (probably largely unnecessary, but while most > architectures don't care at all, others seem to have optimized their > layout so that they can test the work bits more efficiently). So it's a > matter of each architecture being updated with its TIF_xyz flag and their > work function. > > Anybody willing to try? Dipankar apparently has a lot on his plate, this > _should_ be fairly straightforward. Eric? I *had*, when this hit me :) It was one those spurt things. I am going to look at this, but I think we will need to do this with some careful benchmarking. At the moment however I do have another concern - open/close taking too much time as I mentioned in an earlier email. It is nearly 4 times slower than 2.6.13. So, that is first up in my list of things to do at the moment. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:29 ` Dipankar Sarma @ 2005-10-17 18:01 ` Eric Dumazet 2005-10-17 18:31 ` Dipankar Sarma ` (2 more replies) 2005-10-17 18:15 ` Dipankar Sarma 2005-10-17 18:40 ` Linus Torvalds 2 siblings, 3 replies; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 18:01 UTC (permalink / raw) To: dipankar Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul [-- Attachment #1: Type: text/plain, Size: 1482 bytes --] Dipankar Sarma a écrit : > On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > >>>Absolutely. Keeping a count of (percpu) queued items is basically free if kept >>>in the cache line used by list head, so the 'queue length on this cpu' is a >>>cheap metric. >> >>The only downside to TIF_RCUUPDATE is that those damn TIF-flags are >>per-architecture (probably largely unnecessary, but while most >>architectures don't care at all, others seem to have optimized their >>layout so that they can test the work bits more efficiently). So it's a >>matter of each architecture being updated with its TIF_xyz flag and their >>work function. >> >>Anybody willing to try? Dipankar apparently has a lot on his plate, this >>_should_ be fairly straightforward. Eric? > > > I *had*, when this hit me :) It was one those spurt things. I am going to > look at this, but I think we will need to do this with some careful > benchmarking. > > At the moment however I do have another concern - open/close taking too > much time as I mentioned in an earlier email. It is nearly 4 times > slower than 2.6.13. So, that is first up in my list of things to > do at the moment. > <lazy_mode=ON> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? </lazy_mode> This patch only take care of call_rcu(), I'm unsure of what can be done inside call_rcu_bh() The two stress program dont hit OOM anymore with this patch applied (even with maxbatch=10) Eric [-- Attachment #2: rcu_patch.1 --] [-- Type: text/plain, Size: 1250 bytes --] --- linux-2.6.14-rc4/kernel/rcupdate.c 2005-10-11 03:19:19.000000000 +0200 +++ linux-2.6.14-rc4-ed/kernel/rcupdate.c 2005-10-17 21:52:18.000000000 +0200 @@ -109,6 +109,10 @@ rdp = &__get_cpu_var(rcu_data); *rdp->nxttail = head; rdp->nxttail = &head->next; + + if (unlikely(++rdp->count > 10000)) + set_need_resched(); + local_irq_restore(flags); } @@ -140,6 +144,12 @@ rdp = &__get_cpu_var(rcu_bh_data); *rdp->nxttail = head; rdp->nxttail = &head->next; + rdp->count++; +/* + * Should we directly call rcu_do_batch() here ? + * if (unlikely(rdp->count > 10000)) + * rcu_do_batch(rdp); + */ local_irq_restore(flags); } @@ -157,6 +167,7 @@ next = rdp->donelist = list->next; list->func(list); list = next; + rdp->count--; if (++count >= maxbatch) break; } --- linux-2.6.14-rc4/include/linux/rcupdate.h 2005-10-11 03:19:19.000000000 +0200 +++ linux-2.6.14-rc4-ed/include/linux/rcupdate.h 2005-10-17 21:02:25.000000000 +0200 @@ -94,6 +94,7 @@ long batch; /* Batch # for current RCU batch */ struct rcu_head *nxtlist; struct rcu_head **nxttail; + long count; /* # of queued items */ struct rcu_head *curlist; struct rcu_head **curtail; struct rcu_head *donelist; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 18:01 ` Eric Dumazet @ 2005-10-17 18:31 ` Dipankar Sarma 2005-10-17 19:00 ` Linus Torvalds 2005-10-17 18:37 ` Linus Torvalds 2005-10-17 22:59 ` Paul E. McKenney 2 siblings, 1 reply; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 18:31 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > > > > <lazy_mode=ON> > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? > </lazy_mode> I think the theory was that we have to process the callbacks, not just force the grace period by setting need_resched. That is what TIF_RCUUPDATE indicates - rcus to process. > This patch only take care of call_rcu(), I'm unsure of what can be done > inside call_rcu_bh() > > The two stress program dont hit OOM anymore with this patch applied (even > with maxbatch=10) Hmm.. I am supprised that maxbatch=10 still allowed you keep up with a continuously queueing cpu. OK, I will look at this. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 18:31 ` Dipankar Sarma @ 2005-10-17 19:00 ` Linus Torvalds 0 siblings, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 19:00 UTC (permalink / raw) To: Dipankar Sarma Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul [-- Attachment #1: Type: TEXT/PLAIN, Size: 1905 bytes --] On Tue, 18 Oct 2005, Dipankar Sarma wrote: > On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote: > > Dipankar Sarma a écrit : > > >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > > > > > > > <lazy_mode=ON> > > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? > > </lazy_mode> > > I think the theory was that we have to process the callbacks, > not just force the grace period by setting need_resched. > That is what TIF_RCUUPDATE indicates - rcus to process. I'm having second thoughts about that, since the problem (in SMP) is that even if the currently active process tries to more proactively handle RCU events rather than just setting the grace period, in order to do that you'd still need to wait for the other CPU's to have their quiescent phase. So the RCU queues can grow long, if only because the other CPU's won't necessarily do the same. So we probably cannot throttle RCU queues down, and they will inevitably have to be able to grow pretty long. > Hmm.. I am supprised that maxbatch=10 still allowed you keep up > with a continuously queueing cpu. OK, I will look at this. I think it's just because it ends up rescheduling a lot, and thus waking up softirqd. The RCU thing is done as a tasklet, which means that - it starts out as a "synchronous" softirq event, at which point it gets called at most X times (MAX_SOFTIRQ_RESTART, defaults to 10) - after that, we end up saying "uhhuh, this is using too much softirq time" and instead just run the softirq as a kernel thread. - setting TIF_NEEDRESCHED whenever the rcu lists are long will keep on rescheduling to the softirq thread much more aggressively. See __do_softirq() for some of this softirq (and thus tasklet) handling. I suspect it's _very_ inefficient, but maybe the bad case triggers so seldom that we don't really need to care. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 18:01 ` Eric Dumazet 2005-10-17 18:31 ` Dipankar Sarma @ 2005-10-17 18:37 ` Linus Torvalds 2005-10-17 19:12 ` Eric Dumazet 2005-10-17 22:59 ` Paul E. McKenney 2 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 18:37 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Eric Dumazet wrote: > > <lazy_mode=ON> > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? > </lazy_mode> Hmm.. Your patch looks very much like one I tried already, but the big difference being that I just cleared the count when doing the rcu callback. That was because I hadn't realized the importance of the maxbatch thing (so it didn't work for me, like it did for you). Still - the actual RCU callback will only be called at the next timer tick or whatever as far as I can tell, so the first time you'll still have a _long_ RCU queue (and thus bad latency). I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, because we still need to wait for the _other_ CPU's to get to their RCU quiescent event. However, that leaves us with the nasty situation that we'll ve very inefficient: we'll do "maxbatch" RCU entries, and then return, and then force a whole re-schedule. That just can't be good. How about instead of depending on "maxbatch", we'd depend on "need_resched()"? Mabe the "maxbatch" be a _minbatch_ thing, and then once we've done the minimum amount we _need_ to do (or emptied the RCU queue) we start honoring need_resched(), and return early if we do? That, together with your patch, should work, without causing ludicrous "reschedule every ten system calls" behaviour.. Hmm? Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 18:37 ` Linus Torvalds @ 2005-10-17 19:12 ` Eric Dumazet 2005-10-17 19:30 ` Linus Torvalds 0 siblings, 1 reply; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 19:12 UTC (permalink / raw) To: Linus Torvalds Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Linus Torvalds a écrit : > > On Mon, 17 Oct 2005, Eric Dumazet wrote: > >><lazy_mode=ON> >>Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? >></lazy_mode> > > > Hmm.. Your patch looks very much like one I tried already, but the big > difference being that I just cleared the count when doing the rcu > callback. That was because I hadn't realized the importance of the > maxbatch thing (so it didn't work for me, like it did for you). > > Still - the actual RCU callback will only be called at the next timer tick > or whatever as far as I can tell, so the first time you'll still have a > _long_ RCU queue (and thus bad latency). > > I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, because > we still need to wait for the _other_ CPU's to get to their RCU quiescent > event. > > However, that leaves us with the nasty situation that we'll ve very > inefficient: we'll do "maxbatch" RCU entries, and then return, and then > force a whole re-schedule. That just can't be good. > Thats strange, because on my tests it seems that I dont have one reschedule for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one 'schedule' then filp count goes back to 1000. vmstat shows about 150 context switches per second. (This machines does 1.000.000 pair of open/close in 4.88 seconds) oprofile data shows verly little schedule overhead : CPU: P4 / Xeon with 2 hyper-threads, speed 1993.83 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000 samples % symbol name 132578 11.3301 path_lookup 104788 8.9551 __d_lookup 85220 7.2829 link_path_walk 63013 5.3851 sysenter_past_esp 53287 4.5539 _atomic_dec_and_lock 45825 3.9162 chrdev_open 43105 3.6837 get_unused_fd 39948 3.4139 kmem_cache_alloc 38308 3.2738 strncpy_from_user 35738 3.0542 rcu_do_batch 31850 2.7219 __link_path_walk 31355 2.6796 get_empty_filp 25941 2.2169 kmem_cache_free 24455 2.0899 __fput 24422 2.0871 sys_close 19814 1.6933 filp_dtor 19616 1.6764 free_block 19000 1.6237 open_namei 18214 1.5566 fput 15991 1.3666 fd_install 14394 1.2301 file_kill 14365 1.2276 call_rcu 14338 1.2253 kref_put 13679 1.1690 file_move 13646 1.1662 schedule 13456 1.1499 getname 13019 1.1126 kref_get > How about instead of depending on "maxbatch", we'd depend on > "need_resched()"? Mabe the "maxbatch" be a _minbatch_ thing, and then once > we've done the minimum amount we _need_ to do (or emptied the RCU queue) > we start honoring need_resched(), and return early if we do? > > That, together with your patch, should work, without causing ludicrous > "reschedule every ten system calls" behaviour.. > > Hmm? > > Linus > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 19:12 ` Eric Dumazet @ 2005-10-17 19:30 ` Linus Torvalds 2005-10-17 19:39 ` Eric Dumazet 0 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 19:30 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Eric Dumazet wrote: > > Thats strange, because on my tests it seems that I dont have one reschedule > for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one > 'schedule' then filp count goes back to 1000. Hmm. I think you're right, but for all the wrong reasons. "maxbatch" ends up not actually having any real effect in the end: after the tasklet ends up running in softirqd, softirqd will actually keep on calling the tasklet code until it doesn't get rescheduled any more ;) So it will do "maxbatch" RCU entries, reschedule itself, return, and immediately get called again. Heh. The _good_ news is that since it ends up running in softirqd (after the first ten times - the softirq code in kernel/softirq.c will start off calling it ten times _first_), it can be scheduled away, so it actually ends up helping latency. Which means that we actually end up doing exactly the right thing, although for what appears to be the wrong reasons (or very lucky ones). The _bad_ news is that softirqd is running at nice +19, so I suspect that with some unlucky patterns it's probably pretty easy to make sure that ksoftirqd doesn't actually run very often at all! Gaah. So close, yet so far. I'm _almost_ willing to just undo my "make maxbatch huge" patch, and apply your patch, because now that I see how it all happens to work together I'm convinced that it _almost_ works. Even if it seems to be mostly by luck(*) rather than anything else. Linus (*) Not strictly true. It may not be by design of the RCU code itself, but it's definitely by design of the softirq's being designed to be robust and have good latency behaviour. So it does work by design, but it works by softirq design rather than RCU design ;) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 19:30 ` Linus Torvalds @ 2005-10-17 19:39 ` Eric Dumazet 2005-10-17 20:14 ` Linus Torvalds 0 siblings, 1 reply; 48+ messages in thread From: Eric Dumazet @ 2005-10-17 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Linus Torvalds a écrit : > > On Mon, 17 Oct 2005, Eric Dumazet wrote: > >>Thats strange, because on my tests it seems that I dont have one reschedule >>for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one >>'schedule' then filp count goes back to 1000. > > > Hmm. > > I think you're right, but for all the wrong reasons. > > "maxbatch" ends up not actually having any real effect in the end: after > the tasklet ends up running in softirqd, softirqd will actually keep on > calling the tasklet code until it doesn't get rescheduled any more ;) > > So it will do "maxbatch" RCU entries, reschedule itself, return, and > immediately get called again. > > Heh. > > The _good_ news is that since it ends up running in softirqd (after the > first ten times - the softirq code in kernel/softirq.c will start off > calling it ten times _first_), it can be scheduled away, so it actually > ends up helping latency. > > Which means that we actually end up doing exactly the right thing, > although for what appears to be the wrong reasons (or very lucky ones). > > The _bad_ news is that softirqd is running at nice +19, so I suspect that > with some unlucky patterns it's probably pretty easy to make sure that > ksoftirqd doesn't actually run very often at all! > > Gaah. So close, yet so far. I'm _almost_ willing to just undo my "make > maxbatch huge" patch, and apply your patch, because now that I see how it > all happens to work together I'm convinced that it _almost_ works. Even if > it seems to be mostly by luck(*) rather than anything else. > :) What about call_rcu_bh() which I left unchanged ? At least one of my production machine cannot live very long unless I have maxbatch = 300, because of an insane large tcp route cache (and one of its CPU almost filled by softirq NIC processing) > Linus > > (*) Not strictly true. It may not be by design of the RCU code itself, but > it's definitely by design of the softirq's being designed to be robust and > have good latency behaviour. So it does work by design, but it works by > softirq design rather than RCU design ;) > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 19:39 ` Eric Dumazet @ 2005-10-17 20:14 ` Linus Torvalds 2005-10-17 20:25 ` Christopher Friesen ` (2 more replies) 0 siblings, 3 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 20:14 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Eric Dumazet wrote: > > What about call_rcu_bh() which I left unchanged ? At least one of my > production machine cannot live very long unless I have maxbatch = 300, because > of an insane large tcp route cache (and one of its CPU almost filled by > softirq NIC processing) I think we'll have to release 2.6.14 with maxbatch at the high value (10000). Yes, it may screw up some latency stuff, but quite frankly, even with your patch and even ignoring the call_rcu_bh case, I'm convinced you can easily get into the situation where softirqd just doesn't run soon enough. But at least I think I understand _why_ rcu processing was delayed. I think a real fix might have to involve more explicit knowledge of tasklet behaviour and softirq interaction. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:14 ` Linus Torvalds @ 2005-10-17 20:25 ` Christopher Friesen 2005-10-17 20:24 ` Dipankar Sarma 2005-10-17 20:38 ` Linus Torvalds 2005-10-17 20:33 ` Dipankar Sarma 2005-10-17 22:40 ` Linus Torvalds 2 siblings, 2 replies; 48+ messages in thread From: Christopher Friesen @ 2005-10-17 20:25 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Linus Torvalds wrote: > Yes, it may screw up some latency stuff, but quite frankly, even with your > patch and even ignoring the call_rcu_bh case, I'm convinced you can easily > get into the situation where softirqd just doesn't run soon enough. > > But at least I think I understand _why_ rcu processing was delayed. Could this be related to the "rename14 LTP test with /tmp as tmpfs and HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue? Chris ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:25 ` Christopher Friesen @ 2005-10-17 20:24 ` Dipankar Sarma 2005-10-18 15:55 ` Christopher Friesen 2005-10-17 20:38 ` Linus Torvalds 1 sibling, 1 reply; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 20:24 UTC (permalink / raw) To: Christopher Friesen Cc: Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote: > Linus Torvalds wrote: > > >Yes, it may screw up some latency stuff, but quite frankly, even with your > >patch and even ignoring the call_rcu_bh case, I'm convinced you can easily > >get into the situation where softirqd just doesn't run soon enough. > > > >But at least I think I understand _why_ rcu processing was delayed. > > Could this be related to the "rename14 LTP test with /tmp as tmpfs and > HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue? Could very well be. Chris, could you please try booting with rcupdate.maxbatch=10000 and see if the problem goes away ? Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:24 ` Dipankar Sarma @ 2005-10-18 15:55 ` Christopher Friesen 0 siblings, 0 replies; 48+ messages in thread From: Christopher Friesen @ 2005-10-18 15:55 UTC (permalink / raw) To: dipankar Cc: Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Dipankar Sarma wrote: > On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote: >>Could this be related to the "rename14 LTP test with /tmp as tmpfs and >>HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue? > Could very well be. Chris, could you please try booting > with rcupdate.maxbatch=10000 and see if the problem goes away ? And sure enough, that fixes it. The dcache slab usage maxes out at around 11MB rather than consuming all of zone normal. Is there any downside to this option? Chris ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:25 ` Christopher Friesen 2005-10-17 20:24 ` Dipankar Sarma @ 2005-10-17 20:38 ` Linus Torvalds 1 sibling, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 20:38 UTC (permalink / raw) To: Christopher Friesen Cc: Eric Dumazet, dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Christopher Friesen wrote: > > Could this be related to the "rename14 LTP test with /tmp as tmpfs and HIGHMEM > causes OOM-killer invocation due to zone normal exhaustion" issue? Yes. You can try the current git tree, or just change "maxbatch" from 10 to 10000 in your own tree, and see if it makes a difference. I would not be surprised at all if this turns out to be the exact same issue, for the exact same reason. Eric's patch is also likely to fix it (if the "maxbatch" change fixes it), since I suspect that under _practical_ load Eric's patch works fine. The advantage of Eric's patch is that it shouldn't have any latency downsides, so Eric's is in many ways preferable to just increasing maxbatch. I just can't convince myself that it's really always going to fix the problem. If somebody else can, holler. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:14 ` Linus Torvalds 2005-10-17 20:25 ` Christopher Friesen @ 2005-10-17 20:33 ` Dipankar Sarma 2005-10-17 22:40 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 20:33 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 01:14:20PM -0700, Linus Torvalds wrote: > > > On Mon, 17 Oct 2005, Eric Dumazet wrote: > > > > What about call_rcu_bh() which I left unchanged ? At least one of my > > production machine cannot live very long unless I have maxbatch = 300, because > > of an insane large tcp route cache (and one of its CPU almost filled by > > softirq NIC processing) > > I think we'll have to release 2.6.14 with maxbatch at the high value > (10000). Is 10000 enough ? Eric seemed to find a problem even with this after 90 minutes ? > Yes, it may screw up some latency stuff, but quite frankly, even with your > patch and even ignoring the call_rcu_bh case, I'm convinced you can easily > get into the situation where softirqd just doesn't run soon enough. > > But at least I think I understand _why_ rcu processing was delayed. > > I think a real fix might have to involve more explicit knowledge of > tasklet behaviour and softirq interaction. Agreed. I am now looking at characterizing the corner cases that can get us into trouble and checking what pattern of processing is appropriate to cover them all. It will take some time to sort this out making sure that it satisfies most requirements reasonably. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 20:14 ` Linus Torvalds 2005-10-17 20:25 ` Christopher Friesen 2005-10-17 20:33 ` Dipankar Sarma @ 2005-10-17 22:40 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 22:40 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Linus Torvalds wrote: > On Mon, 17 Oct 2005, Eric Dumazet wrote: > > > > What about call_rcu_bh() which I left unchanged ? At least one of my > > production machine cannot live very long unless I have maxbatch = 300, because > > of an insane large tcp route cache (and one of its CPU almost filled by > > softirq NIC processing) > > I think we'll have to release 2.6.14 with maxbatch at the high value > (10000). Btw, I'm going to apply your patch in _addition_ to the bigger maxbatch value. It might help latency a bit, but more importantly, on one of my machines (but only one - it probably depends on how much memory you have etc), I can re-create the out-of-file-descriptors thing even with a maxbatch of a million. Probably what happens is that the rcu callbacks just grow fast enough without any quiescent period that the maxbatch thing just never matters: we simply run out of file descriptors because we haven't even gotten around to trying to free them yet. I'm compiling with your patch on that machine to verify that it does actually help keep the queues down. Just doing a while : ; do cat /proc/slabinfo | grep filp; sleep 1; done while running the test programs gives some alarming numbers as-is. Your patch keeps the numbers _much_ more stable. Regardless, keeping track of the number of rcu callback events we have will almost inevitably be part of whatever future strategy we take, so your patch is definitely a step in the right direction, even if we have to tweak it later. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 18:01 ` Eric Dumazet 2005-10-17 18:31 ` Dipankar Sarma 2005-10-17 18:37 ` Linus Torvalds @ 2005-10-17 22:59 ` Paul E. McKenney 2005-10-18 9:46 ` Eric Dumazet 2 siblings, 1 reply; 48+ messages in thread From: Paul E. McKenney @ 2005-10-17 22:59 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote: > Dipankar Sarma a écrit : > >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > > > >>>Absolutely. Keeping a count of (percpu) queued items is basically free > >>>if kept > >>>in the cache line used by list head, so the 'queue length on this cpu' > >>>is a > >>>cheap metric. > >> > >>The only downside to TIF_RCUUPDATE is that those damn TIF-flags are > >>per-architecture (probably largely unnecessary, but while most > >>architectures don't care at all, others seem to have optimized their > >>layout so that they can test the work bits more efficiently). So it's a > >>matter of each architecture being updated with its TIF_xyz flag and their > >>work function. > >> > >>Anybody willing to try? Dipankar apparently has a lot on his plate, this > >>_should_ be fairly straightforward. Eric? > > > > > >I *had*, when this hit me :) It was one those spurt things. I am going to > >look at this, but I think we will need to do this with some careful > >benchmarking. > > > >At the moment however I do have another concern - open/close taking too > >much time as I mentioned in an earlier email. It is nearly 4 times > >slower than 2.6.13. So, that is first up in my list of things to > >do at the moment. > > > > <lazy_mode=ON> > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ? > </lazy_mode> > > This patch only take care of call_rcu(), I'm unsure of what can be done > inside call_rcu_bh() > > The two stress program dont hit OOM anymore with this patch applied (even > with maxbatch=10) Keeping the per-CPU count of queued callbacks seems eminently reasonable to me, as does the set_need_resched(). But the proposed (but fortunately commented out) call of rcu_do_batch() from call_rcu() does have deadlock issues. > Eric > > --- linux-2.6.14-rc4/kernel/rcupdate.c 2005-10-11 03:19:19.000000000 +0200 > +++ linux-2.6.14-rc4-ed/kernel/rcupdate.c 2005-10-17 21:52:18.000000000 +0200 > @@ -109,6 +109,10 @@ > rdp = &__get_cpu_var(rcu_data); > *rdp->nxttail = head; > rdp->nxttail = &head->next; > + > + if (unlikely(++rdp->count > 10000)) > + set_need_resched(); > + > local_irq_restore(flags); > } > > @@ -140,6 +144,12 @@ > rdp = &__get_cpu_var(rcu_bh_data); > *rdp->nxttail = head; > rdp->nxttail = &head->next; > + rdp->count++; Really need an "rdp->count++" in call_rcu_bh() as well, otherwise the _bh struct rcu_data will have a steadily decreasing count field. Strictly speaking, this is harmless, since call_rcu_bh() cheerfully ignores this field, but this situation is bound to cause severe confusion at some point. > +/* > + * Should we directly call rcu_do_batch() here ? > + * if (unlikely(rdp->count > 10000)) > + * rcu_do_batch(rdp); > + */ Good thing that the above is commented out! ;-) Doing this can result in self-deadlock, for example with the following: spin_lock(&mylock); /* do some stuff. */ call_rcu(&p->rcu_head, my_rcu_callback); /* do some more stuff. */ spin_unlock(&mylock); void my_rcu_callback(struct rcu_head *p) { spin_lock(&mylock); /* self-deadlock via call_rcu() via rcu_do_batch()!!! */ spin_unlock(&mylock); } Thanx, Paul > } > > @@ -157,6 +167,7 @@ > next = rdp->donelist = list->next; > list->func(list); > list = next; > + rdp->count--; > if (++count >= maxbatch) > break; > } > --- linux-2.6.14-rc4/include/linux/rcupdate.h 2005-10-11 03:19:19.000000000 +0200 > +++ linux-2.6.14-rc4-ed/include/linux/rcupdate.h 2005-10-17 21:02:25.000000000 +0200 > @@ -94,6 +94,7 @@ > long batch; /* Batch # for current RCU batch */ > struct rcu_head *nxtlist; > struct rcu_head **nxttail; > + long count; /* # of queued items */ > struct rcu_head *curlist; > struct rcu_head **curtail; > struct rcu_head *donelist; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 22:59 ` Paul E. McKenney @ 2005-10-18 9:46 ` Eric Dumazet 2005-10-18 16:22 ` Paul E. McKenney 0 siblings, 1 reply; 48+ messages in thread From: Eric Dumazet @ 2005-10-18 9:46 UTC (permalink / raw) To: paulmck Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul Paul E. McKenney a écrit : > > >>+/* >>+ * Should we directly call rcu_do_batch() here ? >>+ * if (unlikely(rdp->count > 10000)) >>+ * rcu_do_batch(rdp); >>+ */ > > > Good thing that the above is commented out! ;-) > > Doing this can result in self-deadlock, for example with the following: > > spin_lock(&mylock); > /* do some stuff. */ > call_rcu(&p->rcu_head, my_rcu_callback); > /* do some more stuff. */ > spin_unlock(&mylock); > > void my_rcu_callback(struct rcu_head *p) > { > spin_lock(&mylock); > /* self-deadlock via call_rcu() via rcu_do_batch()!!! */ > spin_unlock(&mylock); > } > > > Thanx, Paul Thanks Paul for reminding us that call_rcu() should not ever call the callback function, as very well documented in Documentation/RCU/UP.txt (Example 3: Death by Deadlock) But is the same true for call_rcu_bh() ? I intentionally wrote the comment to remind readers that a low maxbatch can trigger OOM in case a CPU is filled by some kind of DOS (network IRQ flood for example, targeting the IP dst cache) To solve this problem, may be we could add a requirement to call_rcu_bh/callback functions : If they have to lock a spinlock, only use a spin_trylock() and make them returns a status (0 : sucessfull callback, 1: please requeue me) As most callback functions just kfree() some memory, most of OOM would be cleared. int my_rcu_callback(struct rcu_head *p) { if (!spin_trylock(&mylock)) return 1; /* please call me later */ /* do something here */ ... spin_unlock(&mylock); return 0; } (Changes to rcu_do_batch() are left as an exercice :) ) Eric ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-18 9:46 ` Eric Dumazet @ 2005-10-18 16:22 ` Paul E. McKenney 0 siblings, 0 replies; 48+ messages in thread From: Paul E. McKenney @ 2005-10-18 16:22 UTC (permalink / raw) To: Eric Dumazet Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Tue, Oct 18, 2005 at 11:46:30AM +0200, Eric Dumazet wrote: > Paul E. McKenney a écrit : > > > > > >>+/* > >>+ * Should we directly call rcu_do_batch() here ? > >>+ * if (unlikely(rdp->count > 10000)) > >>+ * rcu_do_batch(rdp); > >>+ */ > > > > > >Good thing that the above is commented out! ;-) > > > >Doing this can result in self-deadlock, for example with the following: > > > > spin_lock(&mylock); > > /* do some stuff. */ > > call_rcu(&p->rcu_head, my_rcu_callback); > > /* do some more stuff. */ > > spin_unlock(&mylock); > > > >void my_rcu_callback(struct rcu_head *p) > >{ > > spin_lock(&mylock); > > /* self-deadlock via call_rcu() via rcu_do_batch()!!! */ > > spin_unlock(&mylock); > >} > > > > > > Thanx, Paul > > Thanks Paul for reminding us that call_rcu() should not ever call the > callback function, as very well documented in Documentation/RCU/UP.txt > (Example 3: Death by Deadlock) > > But is the same true for call_rcu_bh() ? Yes, same rules for this aspect of call_rcu_bh() and call_rcu(). > I intentionally wrote the comment to remind readers that a low maxbatch can > trigger OOM in case a CPU is filled by some kind of DOS (network IRQ flood > for example, targeting the IP dst cache) > > To solve this problem, may be we could add a requirement to > call_rcu_bh/callback functions : If they have to lock a spinlock, only use > a spin_trylock() and make them returns a status (0 : sucessfull callback, > 1: please requeue me) > > As most callback functions just kfree() some memory, most of OOM would be > cleared. > > int my_rcu_callback(struct rcu_head *p) > { > if (!spin_trylock(&mylock)) > return 1; /* please call me later */ > /* do something here */ > ... > spin_unlock(&mylock); > return 0; > } > > (Changes to rcu_do_batch() are left as an exercice :) ) Another approach that would keep the current easier-to-use semantics would be to schedule a tasklet or workqueue to process the callbacks in a safe context. Thanx, Paul ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:29 ` Dipankar Sarma 2005-10-17 18:01 ` Eric Dumazet @ 2005-10-17 18:15 ` Dipankar Sarma 2005-10-17 18:40 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 18:15 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 09:59:30PM +0530, Dipankar Sarma wrote: > On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote: > > At the moment however I do have another concern - open/close taking too > much time as I mentioned in an earlier email. It is nearly 4 times > slower than 2.6.13. So, that is first up in my list of things to > do at the moment. Please ignore this. This is a big Doh! slab debugging snuck into my config file because I was trying to track down the "bad page state" problem again. Without it, open/close in 2.6.14-rc1 is just as fast as 2.6.13 - ~3 microseconds per pair. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:29 ` Dipankar Sarma 2005-10-17 18:01 ` Eric Dumazet 2005-10-17 18:15 ` Dipankar Sarma @ 2005-10-17 18:40 ` Linus Torvalds 2 siblings, 0 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 18:40 UTC (permalink / raw) To: Dipankar Sarma Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 17 Oct 2005, Dipankar Sarma wrote: > > At the moment however I do have another concern - open/close taking too > much time as I mentioned in an earlier email. It is nearly 4 times > slower than 2.6.13. So, that is first up in my list of things to > do at the moment. It's not slower for me. For me, lmbench shows open/close as being pretty stable at least since 2.6.12. Are you sure that your dentry cache tests haven't just filled up the dentry lists so much that when you compare open/close performance after the dentry tests, they seem much slower than your numbers from before? If you run something that fills up the dentry cache, open/close will be slower just because the open part will have to traverse longer hash chains. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:01 ` Eric Dumazet 2005-10-17 16:16 ` Linus Torvalds @ 2005-10-17 16:23 ` Dipankar Sarma 2005-10-17 16:31 ` Lee Revell 2 siblings, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 16:23 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 06:01:31PM +0200, Eric Dumazet wrote: > Linus Torvalds a écrit : > > > > > - post-14: work on making sure rcu callbacks are done in a more timely > > manner when the rcu queue gets long. This would involve TIF_RCUPENDING > > and whatever else to make sure that we have timely quiescent periods, > > and we do the RCU callback tasklet more often if the queue is long. > > > > Absolutely. Keeping a count of (percpu) queued items is basically free if > kept in the cache line used by list head, so the 'queue length on this cpu' > is a cheap metric. Or 'sudden increase in queue length on this cpu' :) > A 'realtime refinement' would be to use a different maxbatch limit > depending on the caller's priority : Let a softirq thread have a lower > batch count than a regular user thread. Yes, would be interesting. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 16:01 ` Eric Dumazet 2005-10-17 16:16 ` Linus Torvalds 2005-10-17 16:23 ` Dipankar Sarma @ 2005-10-17 16:31 ` Lee Revell 2 siblings, 0 replies; 48+ messages in thread From: Lee Revell @ 2005-10-17 16:31 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, 2005-10-17 at 18:01 +0200, Eric Dumazet wrote: > A 'realtime refinement' would be to use a different maxbatch limit depending > on the caller's priority : Let a softirq thread have a lower batch count than > a regular user thread. Or just make the whole thing preemptible like in the -rt tree and forget about it. Lee ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 15:42 ` Linus Torvalds 2005-10-17 16:01 ` Eric Dumazet @ 2005-10-17 16:20 ` Dipankar Sarma 1 sibling, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 16:20 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul On Mon, Oct 17, 2005 at 08:42:05AM -0700, Linus Torvalds wrote: > > On Mon, 17 Oct 2005, Dipankar Sarma wrote: > > > This I am not sure, it is Linus' call. I am just trying to do the > > right thing - fix the real problem. > > It sure looks like the batch limiter is the fundamental problem. > > Instead of limiting the batching, we should likely try to avoid the RCU > lists getting huge in the first place - ie do the RCU callback processing > more often if the list is getting longer. > > So I suspect that the _real_ fix is: > > - for 2.6.14: remove the batching limig (or just make it much higher for > now) You can remove the batching limit by making maxbatch = 0 by default. Just a one line patch. > - post-14: work on making sure rcu callbacks are done in a more timely > manner when the rcu queue gets long. This would involve TIF_RCUPENDING > and whatever else to make sure that we have timely quiescent periods, > and we do the RCU callback tasklet more often if the queue is long. Yes, I am already looking at this. There are a number approaches to this include adaptive algorithm to cater to naughty corner cases and/or adding different ways to handle RCU as in tree. I hope to experiment with these incrementally after 2.6.14 over a period of time and see what works best for most people. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-16 16:23 ` Dipankar Sarma 2005-10-16 18:51 ` Serge Belyshev @ 2005-10-17 2:34 ` Linus Torvalds 2005-10-17 3:54 ` Roland Dreier 2005-10-17 11:54 ` Dipankar Sarma 1 sibling, 2 replies; 48+ messages in thread From: Linus Torvalds @ 2005-10-17 2:34 UTC (permalink / raw) To: Dipankar Sarma Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton, Manfred Spraul On Sun, 16 Oct 2005, Dipankar Sarma wrote: > > Linus, I don't think this has anything to do with RCU grace periods > like we discussed previously. I measured on my 3.6GHz x86_64 and > found that open()/close() pair on /dev/null takes about 45500 > cycles or 12 microseconds. [Does that sound resonable?]. That sounds very slow. I can do a million open/close pairs in 4 seconds on a 2.5GHz G5. Maybe you tested a cold-cache case? Of course, a P4 is just about the worst architecture to test system call performance on, so ... Still, that's 4us. I'm pretty sure some machines will do it in 3 or less (in fact, lmbench says 3.17us on another machine of mine for open/close). Still, that's only four times faster, so 2 timer ticks should be less than 5000 file structs to free. I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait for confirmation. In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with your changing the file counter to an atomic, but I'd rather keep that change for later. Serge, does this alternate patch work for you? [ cache constructors and destructors are _stupid_. They act exactly the wrong way from a cache standpoint. ] Linus --- diff --git a/fs/dcache.c b/fs/dcache.c index fb10386..40aaa90 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1731,7 +1731,7 @@ void __init vfs_caches_init(unsigned lon SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor); + SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); dcache_init(mempages); inode_init(mempages); diff --git a/fs/file_table.c b/fs/file_table.c index 86ec8ae..fbda480 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_ { if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == SLAB_CTOR_CONSTRUCTOR) { - unsigned long flags; - spin_lock_irqsave(&filp_count_lock, flags); - files_stat.nr_files++; - spin_unlock_irqrestore(&filp_count_lock, flags); } } -void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags) -{ - unsigned long flags; - spin_lock_irqsave(&filp_count_lock, flags); - files_stat.nr_files--; - spin_unlock_irqrestore(&filp_count_lock, flags); -} - static inline void file_free_rcu(struct rcu_head *head) { struct file *f = container_of(head, struct file, f_rcuhead); @@ -62,6 +50,13 @@ static inline void file_free_rcu(struct static inline void file_free(struct file *f) { + unsigned long flags; + + /* Stupid. Use atomics */ + spin_lock_irqsave(&filp_count_lock, flags); + files_stat.nr_files--; + spin_unlock_irqrestore(&filp_count_lock, flags); + call_rcu(&f->f_rcuhead, file_free_rcu); } @@ -73,6 +68,7 @@ struct file *get_empty_filp(void) { static int old_max; struct file * f; + unsigned long flags; /* * Privileged users can go above max_files @@ -85,6 +81,11 @@ struct file *get_empty_filp(void) if (f == NULL) goto fail; + /* Stupid. Use atomics */ + spin_lock_irqsave(&filp_count_lock, flags); + files_stat.nr_files++; + spin_unlock_irqrestore(&filp_count_lock, flags); + memset(f, 0, sizeof(*f)); if (security_file_alloc(f)) goto fail_sec; diff --git a/include/linux/file.h b/include/linux/file.h index f5bbd4c..55f0572 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -60,8 +60,6 @@ extern void put_filp(struct file *); extern int get_unused_fd(void); extern void FASTCALL(put_unused_fd(unsigned int fd)); struct kmem_cache_s; -extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags); -extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags); extern struct file ** alloc_fd_array(int); extern void free_fd_array(struct file **, int); ^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 2:34 ` Linus Torvalds @ 2005-10-17 3:54 ` Roland Dreier 2005-10-17 11:54 ` Dipankar Sarma 1 sibling, 0 replies; 48+ messages in thread From: Roland Dreier @ 2005-10-17 3:54 UTC (permalink / raw) To: Linus Torvalds Cc: Dipankar Sarma, Serge Belyshev, linux-kernel, khali, Andrew Morton, Manfred Spraul > --- a/fs/file_table.c > +++ b/fs/file_table.c > @@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_ > { > if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) == > SLAB_CTOR_CONSTRUCTOR) { > - unsigned long flags; > - spin_lock_irqsave(&filp_count_lock, flags); > - files_stat.nr_files++; > - spin_unlock_irqrestore(&filp_count_lock, flags); > } > } Am I missing something? Why not delete the whole filp_ctor() function rather than just the then clause of the if()? - R. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: VFS: file-max limit 50044 reached 2005-10-17 2:34 ` Linus Torvalds 2005-10-17 3:54 ` Roland Dreier @ 2005-10-17 11:54 ` Dipankar Sarma 1 sibling, 0 replies; 48+ messages in thread From: Dipankar Sarma @ 2005-10-17 11:54 UTC (permalink / raw) To: Linus Torvalds Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton, Manfred Spraul On Sun, Oct 16, 2005 at 07:34:24PM -0700, Linus Torvalds wrote: > On Sun, 16 Oct 2005, Dipankar Sarma wrote: > > > > Linus, I don't think this has anything to do with RCU grace periods > > like we discussed previously. I measured on my 3.6GHz x86_64 and > > found that open()/close() pair on /dev/null takes about 45500 > > cycles or 12 microseconds. [Does that sound resonable?]. > > That sounds very slow. I can do a million open/close pairs in 4 seconds on > a 2.5GHz G5. Maybe you tested a cold-cache case? I measured after warming up for about a 100 times or so. It is not a cold-cache case. I think we have a bigger problem in hand here. I measured this with 2.6.13 and saw that I could do the same in ~3 microseconds per iteration. It balloons to 12 microseconds in 2.6.14-rc1. I am looking at this right now apart from the other problems. > I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait > for confirmation. > > In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with > your changing the file counter to an atomic, but I'd rather keep that > change for later. Even beyond the file counter issue, we do need to address the DoS and the open/close slowdown issue. Thanks Dipankar ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2005-10-18 16:21 UTC | newest] Thread overview: 48+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-10-15 13:19 VFS: file-max limit 50044 reached Serge Belyshev 2005-10-15 17:53 ` Serge Belyshev 2005-10-16 16:23 ` Dipankar Sarma 2005-10-16 18:51 ` Serge Belyshev 2005-10-16 18:56 ` Dipankar Sarma 2005-10-17 2:19 ` Linus Torvalds 2005-10-17 4:43 ` Serge Belyshev 2005-10-17 8:32 ` Jean Delvare 2005-10-17 8:46 ` Dipankar Sarma 2005-10-17 9:10 ` Eric Dumazet 2005-10-17 9:14 ` Christoph Hellwig 2005-10-17 9:25 ` Eric Dumazet 2005-10-17 10:32 ` Dipankar Sarma 2005-10-17 12:10 ` [RCU problem] was " Eric Dumazet 2005-10-17 12:31 ` linux-os (Dick Johnson) 2005-10-17 12:36 ` Dipankar Sarma 2005-10-17 13:28 ` Eric Dumazet 2005-10-17 13:33 ` Dipankar Sarma 2005-10-17 14:54 ` Eric Dumazet 2005-10-17 15:42 ` Linus Torvalds 2005-10-17 16:01 ` Eric Dumazet 2005-10-17 16:16 ` Linus Torvalds 2005-10-17 16:29 ` Dipankar Sarma 2005-10-17 18:01 ` Eric Dumazet 2005-10-17 18:31 ` Dipankar Sarma 2005-10-17 19:00 ` Linus Torvalds 2005-10-17 18:37 ` Linus Torvalds 2005-10-17 19:12 ` Eric Dumazet 2005-10-17 19:30 ` Linus Torvalds 2005-10-17 19:39 ` Eric Dumazet 2005-10-17 20:14 ` Linus Torvalds 2005-10-17 20:25 ` Christopher Friesen 2005-10-17 20:24 ` Dipankar Sarma 2005-10-18 15:55 ` Christopher Friesen 2005-10-17 20:38 ` Linus Torvalds 2005-10-17 20:33 ` Dipankar Sarma 2005-10-17 22:40 ` Linus Torvalds 2005-10-17 22:59 ` Paul E. McKenney 2005-10-18 9:46 ` Eric Dumazet 2005-10-18 16:22 ` Paul E. McKenney 2005-10-17 18:15 ` Dipankar Sarma 2005-10-17 18:40 ` Linus Torvalds 2005-10-17 16:23 ` Dipankar Sarma 2005-10-17 16:31 ` Lee Revell 2005-10-17 16:20 ` Dipankar Sarma 2005-10-17 2:34 ` Linus Torvalds 2005-10-17 3:54 ` Roland Dreier 2005-10-17 11:54 ` Dipankar Sarma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox