VFS: file-max limit 50044 reached

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* VFS: file-max limit 50044 reached
@ 2005-10-15 13:19 Serge Belyshev
  2005-10-15 17:53 ` Serge Belyshev
  0 siblings, 1 reply; 48+ messages in thread
From: Serge Belyshev @ 2005-10-15 13:19 UTC (permalink / raw)
  To: linux-kernel

This program:

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main (void)
{
	int f, j;
	
	j = 0;
	while (1) {
		f = open ("/dev/null", O_RDONLY);
		if (f == -1) {
			fprintf (stderr,"open (%i): %s\n", j, strerror (errno));
			abort ();
		}
		close (f);
		j ++;
	}
	return 0;
}


fails on 2.6.14-rc4 kernel with this message:

$ ./a.out 
VFS: file-max limit 50044 reached
open (55499): Too many open files in system
Aborted
$ 

This problem was reproduced on i386 and amd64 with
kernels 2.6.14-rc1 .. 2.6.14-rc4-git4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-15 13:19 VFS: file-max limit 50044 reached Serge Belyshev
@ 2005-10-15 17:53 ` Serge Belyshev
  2005-10-16 16:23   ` Dipankar Sarma
  0 siblings, 1 reply; 48+ messages in thread
From: Serge Belyshev @ 2005-10-15 17:53 UTC (permalink / raw)
  To: linux-kernel


>This problem was reproduced on i386 and amd64 with
>kernels 2.6.14-rc1 .. 2.6.14-rc4-git4

Caused by this change:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef
or
http://tinyurl.com/cyrou

aka "[PATCH] files: files struct with RCU"

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-15 17:53 ` Serge Belyshev
@ 2005-10-16 16:23   ` Dipankar Sarma
  2005-10-16 18:51     ` Serge Belyshev
  2005-10-17  2:34     ` Linus Torvalds
  0 siblings, 2 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-16 16:23 UTC (permalink / raw)
  To: Serge Belyshev
  Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton,
	Manfred Spraul

On Sat, Oct 15, 2005 at 09:53:14PM +0400, Serge Belyshev wrote:
> 
> >This problem was reproduced on i386 and amd64 with
> >kernels 2.6.14-rc1 .. 2.6.14-rc4-git4
> 
> Caused by this change:
> 
> http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ab2af1f5005069321c5d130f09cce577b03f43ef
> or
> http://tinyurl.com/cyrou
> 
> aka "[PATCH] files: files struct with RCU"

Linus, I don't think this has anything to do with RCU grace periods
like we discussed previously. I measured on my 3.6GHz x86_64 and
found that open()/close() pair on /dev/null takes about 45500
cycles or 12 microseconds. [Does that sound resonable?]. So
assuing 100 HZ, we can't really queue more than 1660 file
structs to free per 2 timer ticks. In fact, I looked at the
filp slabinfo and we were indeed returning file structures
to slab.

I think this is a known issue I was looking at earlier - the
way we do file struct accounting is not very suitable for batched
freeing. For scalability reasons, file accounting was constructor/destructor
based. This meant that nr_files was decremented only when
the object was removed from the slab cache. This is
susceptible to slab fragmentation. With RCU based file structure,
consequent batched freeing and a test program like Serge's,
we just speed this up and end up with a very fragmented slab -

llm22:~ # cat /proc/sys/fs/file-nr
587730  0       758844

At the same time, I see only a 2000+ objects in filp cache.
To verify this theory, I tried the following experimental patch
I had from before and it fixes this problem. However I run
into my old "bad page state" problem that I have been seeing
since 2.6.9-rc2 in that machine. That needs a separate investigation.

Serge, could you please try the following experimental patch
just to see if file counting is indeed the problem. The patch
is definitely *not* meant for inclusion. Yet. Manfred told me
a while ago that global filp counting caused scalability problems
in some benchmarks - something I haven't been able to verify.

Thanks
Dipankar



This patch changes the file counting by removing the filp_count_lock.
Instead we use a separate atomic_t, nr_files, for now and all
accesses to it are through get_nr_files() api. In the sysctl
handler for nr_files, we populate files_stat.nr_files before returning
to user.

Counting files as an when they are created and destroyed (as opposed
to inside slab) allows us to correctly count open files with RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
---



diff -puN fs/dcache.c~files-scale-file-counting fs/dcache.c
--- linux-2.6.14-rc1-test/fs/dcache.c~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/dcache.c	2005-10-16 14:03:25.000000000 -0700
@@ -1730,7 +1730,7 @@ void __init vfs_caches_init(unsigned lon
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor);
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
 	dcache_init(mempages);
 	inode_init(mempages);
diff -puN fs/file_table.c~files-scale-file-counting fs/file_table.c
--- linux-2.6.14-rc1-test/fs/file_table.c~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/file_table.c	2005-10-16 14:07:20.000000000 -0700
@@ -5,6 +5,7 @@
  *  Copyright (C) 1997 David S. Miller (davem@caip.rutgers.edu)
  */
 
+#include <linux/config.h>
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/file.h>
@@ -18,52 +19,67 @@
 #include <linux/mount.h>
 #include <linux/cdev.h>
 #include <linux/fsnotify.h>
+#include <linux/sysctl.h>
+#include <asm/atomic.h>
 
 /* sysctl tunables... */
 struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-EXPORT_SYMBOL(files_stat); /* Needed by unix.o */
-
 /* public. Not pretty! */
  __cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
 
-static DEFINE_SPINLOCK(filp_count_lock);
+static atomic_t nr_files __cacheline_aligned_in_smp;
 
-/* slab constructors and destructors are called from arbitrary
- * context and must be fully threaded - use a local spinlock
- * to protect files_stat.nr_files
- */
-void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags)
+static inline void file_free_rcu(struct rcu_head *head)
 {
-	if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
-	    SLAB_CTOR_CONSTRUCTOR) {
-		unsigned long flags;
-		spin_lock_irqsave(&filp_count_lock, flags);
-		files_stat.nr_files++;
-		spin_unlock_irqrestore(&filp_count_lock, flags);
-	}
+	struct file *f =  container_of(head, struct file, f_rcuhead);
+	kmem_cache_free(filp_cachep, f);
 }
 
-void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags)
+static inline void file_free(struct file *f)
 {
-	unsigned long flags;
-	spin_lock_irqsave(&filp_count_lock, flags);
-	files_stat.nr_files--;
-	spin_unlock_irqrestore(&filp_count_lock, flags);
+	atomic_dec(&nr_files);
+	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
-static inline void file_free_rcu(struct rcu_head *head)
+/*
+ * Return the total number of open files in the system
+ */
+int get_nr_files(void)
 {
-	struct file *f =  container_of(head, struct file, f_rcuhead);
-	kmem_cache_free(filp_cachep, f);
+	return atomic_read(&nr_files);
 }
 
-static inline void file_free(struct file *f)
+/*
+ * Return the maximum number of open files in the system
+ */
+int get_max_files(void)
 {
-	call_rcu(&f->f_rcuhead, file_free_rcu);
+	return files_stat.max_files;
+}
+
+EXPORT_SYMBOL(get_nr_files);
+EXPORT_SYMBOL(get_max_files);
+
+/*
+ * Handle nr_files sysctl
+ */
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+int proc_nr_files(ctl_table *table, int write, struct file *filp,
+                     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	files_stat.nr_files = get_nr_files();
+	proc_dointvec(table, write, filp, buffer, lenp, ppos);
+}
+#else
+int proc_nr_files(ctl_table *table, int write, struct file *filp,
+                     void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	return -ENOSYS;
 }
+#endif
 
 /* Find an unused file structure and return a pointer to it.
  * Returns NULL, if there are no more free file structures or
@@ -77,7 +93,7 @@ struct file *get_empty_filp(void)
 	/*
 	 * Privileged users can go above max_files
 	 */
-	if (files_stat.nr_files >= files_stat.max_files &&
+	if (get_nr_files() >= files_stat.max_files &&
 				!capable(CAP_SYS_ADMIN))
 		goto over;
 
@@ -96,11 +112,12 @@ struct file *get_empty_filp(void)
 	rwlock_init(&f->f_owner.lock);
 	/* f->f_version: 0 */
 	INIT_LIST_HEAD(&f->f_list);
+	atomic_inc(&nr_files);
 	return f;
 
 over:
 	/* Ran out of filps - report that */
-	if (files_stat.nr_files > old_max) {
+	if (get_nr_files() > old_max) {
 		printk(KERN_INFO "VFS: file-max limit %d reached\n",
 					files_stat.max_files);
 		old_max = files_stat.nr_files;
diff -puN fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting fs/xfs/linux-2.6/xfs_linux.h
--- linux-2.6.14-rc1-test/fs/xfs/linux-2.6/xfs_linux.h~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/fs/xfs/linux-2.6/xfs_linux.h	2005-10-16 14:03:25.000000000 -0700
@@ -88,6 +88,7 @@
 #include <linux/proc_fs.h>
 #include <linux/version.h>
 #include <linux/sort.h>
+#include <linux/fs.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
@@ -242,7 +243,7 @@ static inline void set_buffer_unwritten_
 
 /* IRIX uses the current size of the name cache to guess a good value */
 /* - this isn't the same but is a good enough starting point for now. */
-#define DQUOT_HASH_HEURISTIC	files_stat.nr_files
+#define DQUOT_HASH_HEURISTIC	get_nr_files()
 
 /* IRIX inodes maintain the project ID also, zero this field on Linux */
 #define DEFAULT_PROJID	0
diff -puN include/linux/file.h~files-scale-file-counting include/linux/file.h
--- linux-2.6.14-rc1-test/include/linux/file.h~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/file.h	2005-10-16 14:03:25.000000000 -0700
@@ -60,8 +60,6 @@ extern void put_filp(struct file *);
 extern int get_unused_fd(void);
 extern void FASTCALL(put_unused_fd(unsigned int fd));
 struct kmem_cache_s;
-extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags);
-extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags);
 
 extern struct file ** alloc_fd_array(int);
 extern void free_fd_array(struct file **, int);
diff -puN include/linux/fs.h~files-scale-file-counting include/linux/fs.h
--- linux-2.6.14-rc1-test/include/linux/fs.h~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/fs.h	2005-10-16 14:03:25.000000000 -0700
@@ -36,6 +36,8 @@ struct files_stat_struct {
 	int max_files;		/* tunable */
 };
 extern struct files_stat_struct files_stat;
+extern int get_nr_files(void);
+extern int get_max_files(void);
 
 struct inodes_stat_t {
 	int nr_inodes;
diff -puN kernel/sysctl.c~files-scale-file-counting kernel/sysctl.c
--- linux-2.6.14-rc1-test/kernel/sysctl.c~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/kernel/sysctl.c	2005-10-16 14:03:25.000000000 -0700
@@ -50,6 +50,9 @@
 #include <linux/nfs_fs.h>
 #endif
 
+extern int proc_nr_files(ctl_table *table, int write, struct file *filp,
+                     void __user *buffer, size_t *lenp, loff_t *ppos);
+
 #if defined(CONFIG_SYSCTL)
 
 /* External variables not in a header file. */
@@ -879,7 +882,7 @@ static ctl_table fs_table[] = {
 		.data		= &files_stat,
 		.maxlen		= 3*sizeof(int),
 		.mode		= 0444,
-		.proc_handler	= &proc_dointvec,
+		.proc_handler	= &proc_nr_files,
 	},
 	{
 		.ctl_name	= FS_MAXFILE,
diff -puN net/unix/af_unix.c~files-scale-file-counting net/unix/af_unix.c
--- linux-2.6.14-rc1-test/net/unix/af_unix.c~files-scale-file-counting	2005-10-16 14:03:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/net/unix/af_unix.c	2005-10-16 14:03:25.000000000 -0700
@@ -547,7 +547,7 @@ static struct sock * unix_create1(struct
 	struct sock *sk = NULL;
 	struct unix_sock *u;
 
-	if (atomic_read(&unix_nr_socks) >= 2*files_stat.max_files)
+	if (atomic_read(&unix_nr_socks) >= 2*get_max_files())
 		goto out;
 
 	sk = sk_alloc(PF_UNIX, GFP_KERNEL, &unix_proto, 1);

_

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-16 16:23   ` Dipankar Sarma
@ 2005-10-16 18:51     ` Serge Belyshev
  2005-10-16 18:56       ` Dipankar Sarma
  2005-10-17  2:34     ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Serge Belyshev @ 2005-10-16 18:51 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton,
	Manfred Spraul

Dipankar Sarma <dipankar@in.ibm.com> writes:

> Serge, could you please try the following experimental patch
> just to see if file counting is indeed the problem. The patch

I ran my test program with this patch applied on top of 2.6.14-rc4-git4
and it worked.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-16 18:51     ` Serge Belyshev
@ 2005-10-16 18:56       ` Dipankar Sarma
  2005-10-17  2:19         ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-16 18:56 UTC (permalink / raw)
  To: Serge Belyshev
  Cc: linux-kernel, Linus Torvalds, khali, Andrew Morton,
	Manfred Spraul

On Sun, Oct 16, 2005 at 10:51:12PM +0400, Serge Belyshev wrote:
> Dipankar Sarma <dipankar@in.ibm.com> writes:
> 
> > Serge, could you please try the following experimental patch
> > just to see if file counting is indeed the problem. The patch
> 
> I ran my test program with this patch applied on top of 2.6.14-rc4-git4
> and it worked.

Serge, thanks for the test.

The issue is however far from resolved. We need to find about
potential scalability problems with this approach.

Secondly, on subsequent repeated tests, I saw a very large number
of allocated objects (600000+) in filp cache. That does point to either RCU
grace period not happening or my sycall measurements completely
wrong. I did run with the following patch that adds syscall
exit as a queiescent state, but it didn't help. I am going
to have to instrument RCU to see what is really happening.

Thanks
Dipankar


It turns out that under some really heavy RCU updates under simulated
conditions, a syscall bound task that doesn't block may prevent
RCU from happening during its entire timeslice and that window
may be big enough to generate out-of-memory situations for RCU
protected objects. This patch starts counting completion of
syscalls as quiescent state in order to prevent the above situation
from happening.

It introduces a new field in thread_info called rcu_qs which
stores the RCU quiescent state counter pointer for the cpu
on which the thread runs. We increment the counter on every
syscall completion to move rcu forward. This patch adds that
support to i386 and x86_64 archs, but it doesn't break other arches. As and
when support for rcu_qs is added to thread_info structs of
other arches, we need to define ARCH_HAS_RCU_QS for that
arch.

Not-Yet-Signed-Off-By: Dipankar Sarma <dipankar@in.ibm.com>



diff -puN arch/i386/kernel/entry.S~rcu-syscall-quiescent arch/i386/kernel/entry.S
--- linux-2.6.14-rc1-test/arch/i386/kernel/entry.S~rcu-syscall-quiescent	2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/entry.S	2005-10-16 11:25:10.000000000 -0700
@@ -239,6 +239,8 @@ syscall_exit:
 	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
 					# between sampling and the iret
+	movl TI_rcu_qs(%ebp), %ecx      # Update RCU quiescent state flag
+	movl $1,(%ecx)
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx	# current->work
 	jne syscall_exit_work
diff -puN include/asm-i386/thread_info.h~rcu-syscall-quiescent include/asm-i386/thread_info.h
--- linux-2.6.14-rc1-test/include/asm-i386/thread_info.h~rcu-syscall-quiescent	2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/asm-i386/thread_info.h	2005-10-16 11:20:37.000000000 -0700
@@ -17,6 +17,8 @@
 #include <asm/processor.h>
 #endif
 
+#define ARCH_HAS_RCU_QS
+
 /*
  * low level task data that entry.S needs immediate access to
  * - this struct should fit entirely inside of one cache line
@@ -39,6 +41,7 @@ struct thread_info {
 						   0-0xFFFFFFFF for kernel-thread
 						*/
 	struct restart_block    restart_block;
+	int			*rcu_qs;	/* RCU quiescent state flag */
 
 	unsigned long           previous_esp;   /* ESP of the previous stack in case
 						   of nested (IRQ) stacks
diff -puN include/linux/rcupdate.h~rcu-syscall-quiescent include/linux/rcupdate.h
--- linux-2.6.14-rc1-test/include/linux/rcupdate.h~rcu-syscall-quiescent	2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/linux/rcupdate.h	2005-10-16 12:38:56.000000000 -0700
@@ -41,6 +41,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
+#include <linux/thread_info.h>
 
 /**
  * struct rcu_head - callback structure for use with RCU
@@ -271,6 +272,16 @@ static inline int rcu_pending(int cpu)
  */
 #define synchronize_sched() synchronize_rcu()
 
+#ifdef ARCH_HAS_RCU_QS
+static inline void rcu_set_qs(struct thread_info *ti, int cpu)
+{
+	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+	ti->rcu_qs = &rdp->passed_quiesc;
+}
+#else
+static inline void rcu_set_qs(struct thread_info *ti, int cpu) { }
+#endif
+
 extern void rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
 extern void rcu_restart_cpu(int cpu);
diff -puN init/main.c~rcu-syscall-quiescent init/main.c
--- linux-2.6.14-rc1-test/init/main.c~rcu-syscall-quiescent	2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/init/main.c	2005-10-16 12:43:19.000000000 -0700
@@ -671,6 +671,9 @@ static int init(void * unused)
 	 */
 	child_reaper = current;
 
+	/* Set up rcu quiscent state counter before making any syscall */
+	rcu_set_qs(current_thread_info(), smp_processor_id());
+
 	/* Sets up cpus_possible() */
 	smp_prepare_cpus(max_cpus);
 
diff -puN kernel/sched.c~rcu-syscall-quiescent kernel/sched.c
--- linux-2.6.14-rc1-test/kernel/sched.c~rcu-syscall-quiescent	2005-10-16 11:01:35.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/kernel/sched.c	2005-10-16 12:43:53.000000000 -0700
@@ -3006,6 +3006,7 @@ switch_tasks:
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
+		rcu_set_qs(next->thread_info, task_cpu(prev));
 
 		prepare_task_switch(rq, next);
 		prev = context_switch(rq, prev, next);
diff -puN arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent arch/i386/kernel/asm-offsets.c
--- linux-2.6.14-rc1-test/arch/i386/kernel/asm-offsets.c~rcu-syscall-quiescent	2005-10-16 11:35:28.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/i386/kernel/asm-offsets.c	2005-10-16 11:36:15.000000000 -0700
@@ -53,6 +53,7 @@ void foo(void)
 	OFFSET(TI_preempt_count, thread_info, preempt_count);
 	OFFSET(TI_addr_limit, thread_info, addr_limit);
 	OFFSET(TI_restart_block, thread_info, restart_block);
+	OFFSET(TI_rcu_qs, thread_info, rcu_qs);
 	BLANK();
 
 	OFFSET(EXEC_DOMAIN_handler, exec_domain, handler);
diff -puN arch/x86_64/kernel/entry.S~rcu-syscall-quiescent arch/x86_64/kernel/entry.S
--- linux-2.6.14-rc1-test/arch/x86_64/kernel/entry.S~rcu-syscall-quiescent	2005-10-16 11:48:27.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/entry.S	2005-10-16 12:03:01.000000000 -0700
@@ -214,6 +214,8 @@ ret_from_sys_call:
 sysret_check:		
 	GET_THREAD_INFO(%rcx)
 	cli
+	movq threadinfo_rcu_qs(%rcx),%rdx
+	movq $1,(%rdx)
 	movl threadinfo_flags(%rcx),%edx
 	andl %edi,%edx
 	CFI_REMEMBER_STATE
@@ -310,6 +312,8 @@ ENTRY(int_ret_from_sys_call)
 	/* edi:	mask to check */
 int_with_check:
 	GET_THREAD_INFO(%rcx)
+	movq threadinfo_rcu_qs(%rcx),%rdx
+	movl $1,(%rdx)
 	movl threadinfo_flags(%rcx),%edx
 	andl %edi,%edx
 	jnz   int_careful
diff -puN include/asm-x86_64/thread_info.h~rcu-syscall-quiescent include/asm-x86_64/thread_info.h
--- linux-2.6.14-rc1-test/include/asm-x86_64/thread_info.h~rcu-syscall-quiescent	2005-10-16 11:50:25.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/include/asm-x86_64/thread_info.h	2005-10-16 11:54:47.000000000 -0700
@@ -23,6 +23,8 @@ struct task_struct;
 struct exec_domain;
 #include <asm/mmsegment.h>
 
+#define ARCH_HAS_RCU_QS
+
 struct thread_info {
 	struct task_struct	*task;		/* main task structure */
 	struct exec_domain	*exec_domain;	/* execution domain */
@@ -33,6 +35,7 @@ struct thread_info {
 
 	mm_segment_t		addr_limit;	
 	struct restart_block    restart_block;
+	int			*rcu_qs;
 };
 #endif
 
diff -puN arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent arch/x86_64/kernel/asm-offsets.c
--- linux-2.6.14-rc1-test/arch/x86_64/kernel/asm-offsets.c~rcu-syscall-quiescent	2005-10-16 11:52:13.000000000 -0700
+++ linux-2.6.14-rc1-test-dipankar/arch/x86_64/kernel/asm-offsets.c	2005-10-16 11:53:14.000000000 -0700
@@ -33,6 +33,7 @@ int main(void)
 	ENTRY(flags);
 	ENTRY(addr_limit);
 	ENTRY(preempt_count);
+	ENTRY(rcu_qs);
 	BLANK();
 #undef ENTRY
 #define ENTRY(entry) DEFINE(pda_ ## entry, offsetof(struct x8664_pda, entry))

_

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-16 18:56       ` Dipankar Sarma
@ 2005-10-17  2:19         ` Linus Torvalds
  2005-10-17  4:43           ` Serge Belyshev
  2005-10-17  8:32           ` Jean Delvare
  0 siblings, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17  2:19 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
> 
> Secondly, on subsequent repeated tests, I saw a very large number
> of allocated objects (600000+) in filp cache. That does point to either RCU
> grace period not happening or my sycall measurements completely
> wrong. I did run with the following patch that adds syscall
> exit as a queiescent state, but it didn't help. I am going
> to have to instrument RCU to see what is really happening.

I would _really_ prefer to not do this in the system call hot-path by 
default. That is unquestionably the hottest path in the kernel by far. 

It would be _much_ better to set one of the TIF_WORK flags when there's a 
lot of RCU stuff, and do this all in the not-quit-so-hot path of 
do_notify_resume() (on x86, I think others call it other things) instead.

If you use the same kind of "set the TIF flag every 1000 rcu events" 
approach that my failed patch had, you'd be much better off.

In fact, in that path you could even do a full "rcu_process_callbacks()". 
After all, this is not that different from signal handling.

Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week 
since -rc4.

Maybe this isn't that serious in practice right now? Serge, how did you 
notice it?

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  2:19         ` Linus Torvalds
@ 2005-10-17  4:43           ` Serge Belyshev
  2005-10-17  8:32           ` Jean Delvare
  1 sibling, 0 replies; 48+ messages in thread
From: Serge Belyshev @ 2005-10-17  4:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, khali, Andrew Morton, Manfred Spraul

[resend, sorry for sending mail off-list]

Linus Torvalds <torvalds@osdl.org> writes:

> Serge, does this alternate patch work for you?
>

Yes, this patch works too.

> Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week 
> since -rc4.
>
> Maybe this isn't that serious in practice right now? Serge, how did you 
> notice it?
>

This bug causes random failures when building kernel with make -j4, all with
"Too many open files in system" message from gcc.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  2:19         ` Linus Torvalds
  2005-10-17  4:43           ` Serge Belyshev
@ 2005-10-17  8:32           ` Jean Delvare
  2005-10-17  8:46             ` Dipankar Sarma
  1 sibling, 1 reply; 48+ messages in thread
From: Jean Delvare @ 2005-10-17  8:32 UTC (permalink / raw)
  To: torvalds, dipankar; +Cc: Serge Belyshev, LKML, Andrew Morton, Manfred Spraul


Hi Linus, Dipankar, all,

On 2005-10-17, Linus Torvalds wrote:
> I would _really_ prefer to not do this in the system call hot-path by
> default. That is unquestionably the hottest path in the kernel by far.
>
> It would be _much_ better to set one of the TIF_WORK flags when there's a
> lot of RCU stuff, and do this all in the not-quit-so-hot path of
> do_notify_resume() (on x86, I think others call it other things) instead.
>
> If you use the same kind of "set the TIF flag every 1000 rcu events"
> approach that my failed patch had, you'd be much better off.
>
> In fact, in that path you could even do a full "rcu_process_callbacks()".
> After all, this is not that different from signal handling.
>
> Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
> since -rc4.

Isn't reverting the original change an option? 2.6.13 was working OK if
I'm not mistaken.

Thanks,
--
Jean Delvare

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  8:32           ` Jean Delvare
@ 2005-10-17  8:46             ` Dipankar Sarma
  2005-10-17  9:10               ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17  8:46 UTC (permalink / raw)
  To: Jean Delvare
  Cc: torvalds, Serge Belyshev, LKML, Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote:
> 
> > In fact, in that path you could even do a full "rcu_process_callbacks()".
> > After all, this is not that different from signal handling.
> >
> > Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
> > since -rc4.
> 
> Isn't reverting the original change an option? 2.6.13 was working OK if
> I'm not mistaken.

IMO, putting the file accounting in slab ctor/dtors is not very
reliable because it depends on slab not getting fragmented.
Batched freeing in RCU is just an extreme case of it. We needed
to fix file counting anyway.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  8:46             ` Dipankar Sarma
@ 2005-10-17  9:10               ` Eric Dumazet
  2005-10-17  9:14                 ` Christoph Hellwig
  2005-10-17 10:32                 ` Dipankar Sarma
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17  9:10 UTC (permalink / raw)
  To: dipankar
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

Dipankar Sarma a écrit :
> On Mon, Oct 17, 2005 at 10:32:47AM +0200, Jean Delvare wrote:
> 
>>>In fact, in that path you could even do a full "rcu_process_callbacks()".
>>>After all, this is not that different from signal handling.
>>>
>>>Gaah. I had really hoped to release 2.6.14 tomorrow. It's been a week
>>>since -rc4.
>>
>>Isn't reverting the original change an option? 2.6.13 was working OK if
>>I'm not mistaken.
> 
> 
> IMO, putting the file accounting in slab ctor/dtors is not very
> reliable because it depends on slab not getting fragmented.
> Batched freeing in RCU is just an extreme case of it. We needed
> to fix file counting anyway.
> 
> Thanks
> Dipankar

But isnt this file counting a small problem ?

This small program can eat all available memory.

Fixing the 'file count' wont fix the real problem : Batch freeing is good but 
should be limited so that not more than *billions* of file struct are queued 
for deletion.

Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14.

I believe we can find a solution, even if it might delay 2.6.14 because Linus 
would have to release a rc5

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  9:10               ` Eric Dumazet
@ 2005-10-17  9:14                 ` Christoph Hellwig
  2005-10-17  9:25                   ` Eric Dumazet
  2005-10-17 10:32                 ` Dipankar Sarma
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2005-10-17  9:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
> Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14.

how so? and why should we care?  I'd rather see a 2.6.14 soon with
the changes backed out so we can have a proper release that more or
less sticks to the release schedule we agreed on at kernel summit.
You'll have four weeks time to sort out the issue afterwards.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  9:14                 ` Christoph Hellwig
@ 2005-10-17  9:25                   ` Eric Dumazet
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

Christoph Hellwig a écrit :
> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
> 
>>Dont take me wrong : I really *need* the file RCU stuff added in 2.6.14.
> 
> 
> how so? and why should we care?  I'd rather see a 2.6.14 soon with
> the changes backed out so we can have a proper release that more or
> less sticks to the release schedule we agreed on at kernel summit.
> You'll have four weeks time to sort out the issue afterwards.
> -

Christoph,

You can try to hide the forest by killing some trees.

Are you sure that RCU 'file structs' is the only problem lying around ?

For instance, I think other RCU freeing problem are dormant (see maxbatch=10 
and think about the number of routes a busy router (or DOS attack) can handle...

Of course, a 'test program' is more difficult to write than a

while (1) close(open("/dev/null", 3));

Eric


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  9:10               ` Eric Dumazet
  2005-10-17  9:14                 ` Christoph Hellwig
@ 2005-10-17 10:32                 ` Dipankar Sarma
  2005-10-17 12:10                   ` [RCU problem] was " Eric Dumazet
  2005-10-17 15:42                   ` Linus Torvalds
  1 sibling, 2 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 10:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >
> >IMO, putting the file accounting in slab ctor/dtors is not very
> >reliable because it depends on slab not getting fragmented.
> >Batched freeing in RCU is just an extreme case of it. We needed
> >to fix file counting anyway.
> >
> >Thanks
> >Dipankar
> 
> But isnt this file counting a small problem ?
> 
> This small program can eat all available memory.
> 
> Fixing the 'file count' wont fix the real problem : Batch freeing is good 
> but should be limited so that not more than *billions* of file struct are 
> queued for deletion.

Agreed. It is not designed to work that way, so there must be
a bug somewhere and I am trying to track it down. It could very well
be that at maxbatch=10 we are just queueing at a rate far too high
compared to processing.

> I believe we can find a solution, even if it might delay 2.6.14 because 
> Linus would have to release a rc5

This I am not sure, it is Linus' call. I am just trying to do the
right thing - fix the real problem.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 10:32                 ` Dipankar Sarma
@ 2005-10-17 12:10                   ` Eric Dumazet
  2005-10-17 12:31                     ` linux-os (Dick Johnson)
  2005-10-17 12:36                     ` Dipankar Sarma
  2005-10-17 15:42                   ` Linus Torvalds
  1 sibling, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 12:10 UTC (permalink / raw)
  To: dipankar
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

[-- Attachment #1: Type: text/plain, Size: 860 bytes --]

Dipankar Sarma a écrit :
> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>
>>Fixing the 'file count' wont fix the real problem : Batch freeing is good 
>>but should be limited so that not more than *billions* of file struct are 
>>queued for deletion.
> 
> 
> Agreed. It is not designed to work that way, so there must be
> a bug somewhere and I am trying to track it down. It could very well
> be that at maxbatch=10 we are just queueing at a rate far too high
> compared to processing.
> 

I can freeze my test machine with a program that 'only' use dentries, no files.

No message, no panic, but machine becomes totally unresponsive after few seconds.

Just greping for call_rcu in kernel sources gave me another call_rcu() use 
from syscalls. And yes 2.6.13 has the same problem.

Here is the killer on by HT Xeon machine (2GB ram)

Eric


[-- Attachment #2: stress2.c --]
[-- Type: text/plain, Size: 409 bytes --]

#include <unistd.h>
#include <sys/types.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/stat.h>

int main(void)
{
	int i, rc;	
	struct stat st;
	char name[1024];

	memset(name, 'a', sizeof(name));

	for (i = 0; i < 1000000000;i++) {
		sprintf(name + 220, "%d", i);
		rc = stat(name, &st);
		if (rc == -1 && errno != ENOENT) {
			perror(name);
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 12:10                   ` [RCU problem] was " Eric Dumazet
@ 2005-10-17 12:31                     ` linux-os (Dick Johnson)
  2005-10-17 12:36                     ` Dipankar Sarma
  1 sibling, 0 replies; 48+ messages in thread
From: linux-os (Dick Johnson) @ 2005-10-17 12:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul


On Mon, 17 Oct 2005, Eric Dumazet wrote:

> Dipankar Sarma a écrit :
>> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>>
>>> Fixing the 'file count' wont fix the real problem : Batch freeing is good
>>> but should be limited so that not more than *billions* of file struct are
>>> queued for deletion.
>>
>>
>> Agreed. It is not designed to work that way, so there must be
>> a bug somewhere and I am trying to track it down. It could very well
>> be that at maxbatch=10 we are just queueing at a rate far too high
>> compared to processing.
>>
>
> I can freeze my test machine with a program that 'only' use dentries,
> no files.
>
> No message, no panic, but machine becomes totally unresponsive after
> few seconds.
>
> Just greping for call_rcu in kernel sources gave me another call_rcu() use
> from syscalls. And yes 2.6.13 has the same problem.
>
> Here is the killer on by HT Xeon machine (2GB ram)
>
> Eric
>

No problem with linux-2.6.13.4 and ext3 file-system:
F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
4     0     1     0  16   0  1544  408 -      S    ?          0:00 init [5]
[SNIPPED....]
1     0 16017     6  15   0     0    0 pdflus SW   ?          0:00 [pdflush]
4   666 16406  5273  16   0  4464 1004 wait   S    tty2       0:00 -bash
0   666 16501 16406  18   0  1324  240 -      R    tty2       9:46 ./xxx
4     0 16502  5223  15   0  4204 1248 wait   S    tty1       0:00 -bash
0     0 16563 16502  16   0  2276  584 -      R    tty1       0:00 ps laxw

I just put 9:46 CPU time on your program and every thing is fine.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.46 BogoMips).
Warning : 98.36% of all statistics are fiction.

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 12:10                   ` [RCU problem] was " Eric Dumazet
  2005-10-17 12:31                     ` linux-os (Dick Johnson)
@ 2005-10-17 12:36                     ` Dipankar Sarma
  2005-10-17 13:28                       ` Eric Dumazet
  1 sibling, 1 reply; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 12:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
> >
> >Agreed. It is not designed to work that way, so there must be
> >a bug somewhere and I am trying to track it down. It could very well
> >be that at maxbatch=10 we are just queueing at a rate far too high
> >compared to processing.
> >
> 
> I can freeze my test machine with a program that 'only' use dentries, no 
> files.
> 
> No message, no panic, but machine becomes totally unresponsive after few 
> seconds.
> 
> Just greping for call_rcu in kernel sources gave me another call_rcu() use 
> from syscalls. And yes 2.6.13 has the same problem.

Can you try it with rcupdate.maxbatch set to 10000 in boot
command line ?

FWIW, the open/close test problem goes away if I set maxbatch to
10000. I had introduced this limit some time ago to curtail
the effect long running softirq handlers have on scheduling
latencies, which now conflicts with OOM avoidance requirements.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 12:36                     ` Dipankar Sarma
@ 2005-10-17 13:28                       ` Eric Dumazet
  2005-10-17 13:33                         ` Dipankar Sarma
  2005-10-17 14:54                         ` Eric Dumazet
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 13:28 UTC (permalink / raw)
  To: dipankar
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

Dipankar Sarma a écrit :
> On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
> 
>>Dipankar Sarma a écrit :
>>
>>>On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>>
>>>Agreed. It is not designed to work that way, so there must be
>>>a bug somewhere and I am trying to track it down. It could very well
>>>be that at maxbatch=10 we are just queueing at a rate far too high
>>>compared to processing.
>>>
>>
>>I can freeze my test machine with a program that 'only' use dentries, no 
>>files.
>>
>>No message, no panic, but machine becomes totally unresponsive after few 
>>seconds.
>>
>>Just greping for call_rcu in kernel sources gave me another call_rcu() use 
>>from syscalls. And yes 2.6.13 has the same problem.
> 
> 
> Can you try it with rcupdate.maxbatch set to 10000 in boot
> command line ?
> 

Changing maxbatch from 10 to 10000 cures the problem.
Maybe we could initialize maxbatch to (10000000/HZ), considering no current 
cpu is able to queue more than 10.000.000 items per second in a list.


> FWIW, the open/close test problem goes away if I set maxbatch to
> 10000. I had introduced this limit some time ago to curtail
> the effect long running softirq handlers have on scheduling
> latencies, which now conflicts with OOM avoidance requirements.

Yes, and probably OOM avoidance has a higher priority than latencies in DOS 
situations...

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 13:28                       ` Eric Dumazet
@ 2005-10-17 13:33                         ` Dipankar Sarma
  2005-10-17 14:54                         ` Eric Dumazet
  1 sibling, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 13:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jean Delvare, torvalds, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 03:28:22PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
> >
> >
> >Can you try it with rcupdate.maxbatch set to 10000 in boot
> >command line ?
> >
> 
> Changing maxbatch from 10 to 10000 cures the problem.
> Maybe we could initialize maxbatch to (10000000/HZ), considering no current 
> cpu is able to queue more than 10.000.000 items per second in a list.

I don't know, maybe I can look at a more adaptive heuristics.

> 
> 
> >FWIW, the open/close test problem goes away if I set maxbatch to
> >10000. I had introduced this limit some time ago to curtail
> >the effect long running softirq handlers have on scheduling
> >latencies, which now conflicts with OOM avoidance requirements.
> 
> Yes, and probably OOM avoidance has a higher priority than latencies in DOS 
> situations...

Yes, one would think. But the audio guys would chew my head for this :)

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RCU problem] was VFS: file-max limit 50044 reached
  2005-10-17 13:28                       ` Eric Dumazet
  2005-10-17 13:33                         ` Dipankar Sarma
@ 2005-10-17 14:54                         ` Eric Dumazet
  1 sibling, 0 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 14:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, torvalds, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

Eric Dumazet a écrit :
> Dipankar Sarma a écrit :
> 
>> On Mon, Oct 17, 2005 at 02:10:09PM +0200, Eric Dumazet wrote:
>>
>>> Dipankar Sarma a écrit :
>>>
>>>> On Mon, Oct 17, 2005 at 11:10:04AM +0200, Eric Dumazet wrote:
>>>>
>>>> Agreed. It is not designed to work that way, so there must be
>>>> a bug somewhere and I am trying to track it down. It could very well
>>>> be that at maxbatch=10 we are just queueing at a rate far too high
>>>> compared to processing.
>>>>
>>>
>>> I can freeze my test machine with a program that 'only' use dentries, 
>>> no files.
>>>
>>> No message, no panic, but machine becomes totally unresponsive after 
>>> few seconds.
>>>
>>> Just greping for call_rcu in kernel sources gave me another 
>>> call_rcu() use from syscalls. And yes 2.6.13 has the same problem.
>>
>>
>>
>> Can you try it with rcupdate.maxbatch set to 10000 in boot
>> command line ?
>>
> 
> Changing maxbatch from 10 to 10000 cures the problem.
> Maybe we could initialize maxbatch to (10000000/HZ), considering no 
> current cpu is able to queue more than 10.000.000 items per second in a 
> list.
> 

Well... after one 90 minutes of stress, I got an OOM even with maxbatch=10000

Out of Memory : Killed process 1759 (mysqld)

Maybe because on this HT machine, all (timer and network) interrupts are taken 
by CPU0.

So if the user program is bound on CPU1, may be this cpu only performs 
syscalls and no rcu state change at all.


Oct 17 18:24:25 localhost kernel: oom-killer: gfp_mask=0xd0, order=0
Oct 17 18:24:25 localhost kernel: Mem-info:
Oct 17 18:24:25 localhost kernel: DMA per-cpu:
Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 2, high 6, batch 1 used:5
Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 2, batch 1 used:1
Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 2, high 6, batch 1 used:2
Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 2, batch 1 used:0
Oct 17 18:24:25 localhost kernel: Normal per-cpu:
Oct 17 18:24:25 localhost kernel: cpu 0 hot: low 62, high 186, batch 31 used:168
Oct 17 18:24:25 localhost kernel: cpu 0 cold: low 0, high 62, batch 31 used:55
Oct 17 18:24:25 localhost kernel: cpu 1 hot: low 62, high 186, batch 31 used:95
Oct 17 18:24:25 localhost kernel: cpu 1 cold: low 0, high 62, batch 31 used:33
Oct 17 18:24:25 localhost kernel: HighMem per-cpu:
Oct 17 18:26:17 localhost kernel: cpu 0 hot: low 62, high 186, batch 31 used:166
Oct 17 18:26:17 localhost kernel: cpu 0 cold: low 0, high 62, batch 31 used:29
Oct 17 18:26:17 localhost kernel: cpu 1 hot: low 62, high 186, batch 31 used:176
Oct 17 18:26:17 localhost kernel: cpu 1 cold: low 0, high 62, batch 31 used:13
Oct 17 18:26:17 localhost kernel: Free pages:     1136620kB (1129392kB HighMem)
Oct 17 18:26:17 localhost kernel: Active:8040 inactive:3876 dirty:1 
writeback:0 unstable:0 free:284155 slab:218548 mapped:8064 pagetables:130
Oct 17 18:26:17 localhost kernel: DMA free:3588kB min:68kB low:84kB high:100kB 
active:0kB inactive:0kB present:16384kB pages_scanned:246 all_unreclaimable? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 880 2031
Oct 17 18:26:17 localhost kernel: Normal free:3640kB min:3756kB low:4692kB 
high:5632kB active:76kB inactive:24kB present:901120kB pages_scanned:8581 
all_unreclaimable? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 9215
Oct 17 18:26:17 localhost kernel: HighMem free:1129392kB min:512kB low:640kB 
high:768kB active:32084kB inactive:15480kB present:1179520kB pages_scanned:0 
all_unreclaimable? no
Oct 17 18:26:17 localhost kernel: lowmem_reserve[]: 0 0 0
Oct 17 18:26:17 localhost kernel: DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 
0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
Oct 17 18:26:17 localhost kernel: Normal: 0*4kB 1*8kB 1*16kB 1*32kB 0*64kB 
0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3640kB
Oct 17 18:26:17 localhost kernel: HighMem: 518*4kB 301*8kB 119*16kB 54*32kB 
22*64kB 13*128kB 6*256kB 1*512kB 0*1024kB 1*2048kB 272*4096kB = 1129392kB
Oct 17 18:26:17 localhost kernel: Swap cache: add 0, delete 0, find 0/0, race 0+0
Oct 17 18:26:17 localhost kernel: Free swap  = 1012016kB
Oct 17 18:26:17 localhost kernel: Total swap = 1012016kB
Oct 17 18:26:17 localhost kernel: Free swap:       1012016kB
Oct 17 18:26:17 localhost kernel: 524256 pages of RAM
Oct 17 18:26:17 localhost kernel: 294880 pages of HIGHMEM
Oct 17 18:26:17 localhost kernel: 5472 reserved pages
Oct 17 18:26:17 localhost kernel: 11361 pages shared
Oct 17 18:26:18 localhost kernel: 0 pages swap cached
Oct 17 18:26:18 localhost kernel: 1 pages dirty
Oct 17 18:26:18 localhost kernel: 0 pages writeback
Oct 17 18:26:18 localhost kernel: 8064 pages mapped
Oct 17 18:26:18 localhost kernel: 218548 pages slab
Oct 17 18:26:18 localhost kernel: 130 pages pagetables
Oct 17 18:26:18 localhost kernel: Out of Memory: Killed process 1759 (mysqld).

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 10:32                 ` Dipankar Sarma
  2005-10-17 12:10                   ` [RCU problem] was " Eric Dumazet
@ 2005-10-17 15:42                   ` Linus Torvalds
  2005-10-17 16:01                     ` Eric Dumazet
  2005-10-17 16:20                     ` Dipankar Sarma
  1 sibling, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 15:42 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
> 
> Agreed. It is not designed to work that way, so there must be
> a bug somewhere and I am trying to track it down. It could very well
> be that at maxbatch=10 we are just queueing at a rate far too high
> compared to processing.

That sounds sane.

I suspect that the real fix for 2.6.14 might be to update maxbatch to be 
much higher by default.

The thing is, that batching really is fundamentally wrong. If we have a 
thousand thing to free, we can't just free ten of them, and leave the 990 
others to wait for next time. I realize people want real-time, but 
if it's INCORRECT, then real-time isn't real-time.

I just checked: increasing "maxbatch" from 10 to 10000 does fix the 
problem.

> This I am not sure, it is Linus' call. I am just trying to do the
> right thing - fix the real problem.

It sure looks like the batch limiter is the fundamental problem.

Instead of limiting the batching, we should likely try to avoid the RCU 
lists getting huge in the first place - ie do the RCU callback processing 
more often if the list is getting longer.

So I suspect that the _real_ fix is:

 - for 2.6.14: remove the batching limig (or just make it much higher for 
   now)

 - post-14: work on making sure rcu callbacks are done in a more timely 
   manner when the rcu queue gets long. This would involve TIF_RCUPENDING 
   and whatever else to make sure that we have timely quiescent periods, 
   and we do the RCU callback tasklet more often if the queue is long.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 15:42                   ` Linus Torvalds
@ 2005-10-17 16:01                     ` Eric Dumazet
  2005-10-17 16:16                       ` Linus Torvalds
                                         ` (2 more replies)
  2005-10-17 16:20                     ` Dipankar Sarma
  1 sibling, 3 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 16:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

Linus Torvalds a écrit :

> So I suspect that the _real_ fix is:
> 
>  - for 2.6.14: remove the batching limig (or just make it much higher for 
>    now)

I would just remove it. If the limit is wrong, we crash again. And the 
realtime guys already are pissed off by batch=10000 anyway.

> 
>  - post-14: work on making sure rcu callbacks are done in a more timely 
>    manner when the rcu queue gets long. This would involve TIF_RCUPENDING 
>    and whatever else to make sure that we have timely quiescent periods, 
>    and we do the RCU callback tasklet more often if the queue is long.
> 

Absolutely. Keeping a count of (percpu) queued items is basically free if kept 
in the cache line used by list head, so the 'queue length on this cpu' is a 
cheap metric.

A 'realtime refinement' would be to use a different maxbatch limit depending 
on the caller's priority : Let a softirq thread have a lower batch count than 
a regular user thread.

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:01                     ` Eric Dumazet
@ 2005-10-17 16:16                       ` Linus Torvalds
  2005-10-17 16:29                         ` Dipankar Sarma
  2005-10-17 16:23                       ` Dipankar Sarma
  2005-10-17 16:31                       ` Lee Revell
  2 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 16:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dipankar Sarma, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
> I would just remove it. If the limit is wrong, we crash again. And the
> realtime guys already are pissed off by batch=10000 anyway.

Normally I would too, but I'm still hoping I could do a 2.6.14 tonight. I 
guess that's unreasonable (swtlb issues etc), but for now I just committed 
the one-liner. 

> Absolutely. Keeping a count of (percpu) queued items is basically free if kept
> in the cache line used by list head, so the 'queue length on this cpu' is a
> cheap metric.

Yes. I did something broken like that before Dipankar pointed me at 
batching.

The only downside to TIF_RCUUPDATE is that those damn TIF-flags are 
per-architecture (probably largely unnecessary, but while most 
architectures don't care at all, others seem to have optimized their 
layout so that they can test the work bits more efficiently). So it's a 
matter of each architecture being updated with its TIF_xyz flag and their 
work function.

Anybody willing to try? Dipankar apparently has a lot on his plate, this 
_should_ be fairly straightforward. Eric?

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:16                       ` Linus Torvalds
@ 2005-10-17 16:29                         ` Dipankar Sarma
  2005-10-17 18:01                           ` Eric Dumazet
                                             ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 16:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> 
> > Absolutely. Keeping a count of (percpu) queued items is basically free if kept
> > in the cache line used by list head, so the 'queue length on this cpu' is a
> > cheap metric.
> 
> The only downside to TIF_RCUUPDATE is that those damn TIF-flags are 
> per-architecture (probably largely unnecessary, but while most 
> architectures don't care at all, others seem to have optimized their 
> layout so that they can test the work bits more efficiently). So it's a 
> matter of each architecture being updated with its TIF_xyz flag and their 
> work function.
> 
> Anybody willing to try? Dipankar apparently has a lot on his plate, this 
> _should_ be fairly straightforward. Eric?

I *had*, when this hit me :) It was one those spurt things. I am going to
look at this, but I think we will need to do this with some careful
benchmarking.

At the moment however I do have another concern - open/close taking too
much time as I mentioned in an earlier email. It is nearly 4 times
slower than 2.6.13. So, that is first up in my list of things to
do at the moment.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:29                         ` Dipankar Sarma
@ 2005-10-17 18:01                           ` Eric Dumazet
  2005-10-17 18:31                             ` Dipankar Sarma
                                               ` (2 more replies)
  2005-10-17 18:15                           ` Dipankar Sarma
  2005-10-17 18:40                           ` Linus Torvalds
  2 siblings, 3 replies; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 18:01 UTC (permalink / raw)
  To: dipankar
  Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

[-- Attachment #1: Type: text/plain, Size: 1482 bytes --]

Dipankar Sarma a écrit :
> On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> 
>>>Absolutely. Keeping a count of (percpu) queued items is basically free if kept
>>>in the cache line used by list head, so the 'queue length on this cpu' is a
>>>cheap metric.
>>
>>The only downside to TIF_RCUUPDATE is that those damn TIF-flags are 
>>per-architecture (probably largely unnecessary, but while most 
>>architectures don't care at all, others seem to have optimized their 
>>layout so that they can test the work bits more efficiently). So it's a 
>>matter of each architecture being updated with its TIF_xyz flag and their 
>>work function.
>>
>>Anybody willing to try? Dipankar apparently has a lot on his plate, this 
>>_should_ be fairly straightforward. Eric?
> 
> 
> I *had*, when this hit me :) It was one those spurt things. I am going to
> look at this, but I think we will need to do this with some careful
> benchmarking.
> 
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.
> 

<lazy_mode=ON>
Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
</lazy_mode>

This patch only take care of call_rcu(), I'm unsure of what can be done inside 
call_rcu_bh()

The two stress program dont hit OOM anymore with this patch applied (even with 
maxbatch=10)

Eric


[-- Attachment #2: rcu_patch.1 --]
[-- Type: text/plain, Size: 1250 bytes --]

--- linux-2.6.14-rc4/kernel/rcupdate.c	2005-10-11 03:19:19.000000000 +0200
+++ linux-2.6.14-rc4-ed/kernel/rcupdate.c	2005-10-17 21:52:18.000000000 +0200
@@ -109,6 +109,10 @@
 	rdp = &__get_cpu_var(rcu_data);
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
+
+	if (unlikely(++rdp->count > 10000))
+		set_need_resched();
+
 	local_irq_restore(flags);
 }
 
@@ -140,6 +144,12 @@
 	rdp = &__get_cpu_var(rcu_bh_data);
 	*rdp->nxttail = head;
 	rdp->nxttail = &head->next;
+	rdp->count++;
+/*
+ *  Should we directly call rcu_do_batch() here ?
+ *  if (unlikely(rdp->count > 10000))
+ *      rcu_do_batch(rdp);
+ */
 	local_irq_restore(flags);
 }
 
@@ -157,6 +167,7 @@
 		next = rdp->donelist = list->next;
 		list->func(list);
 		list = next;
+		rdp->count--;
 		if (++count >= maxbatch)
 			break;
 	}
--- linux-2.6.14-rc4/include/linux/rcupdate.h	2005-10-11 03:19:19.000000000 +0200
+++ linux-2.6.14-rc4-ed/include/linux/rcupdate.h	2005-10-17 21:02:25.000000000 +0200
@@ -94,6 +94,7 @@
 	long  	       	batch;           /* Batch # for current RCU batch */
 	struct rcu_head *nxtlist;
 	struct rcu_head **nxttail;
+	long            count; /* # of queued items */
 	struct rcu_head *curlist;
 	struct rcu_head **curtail;
 	struct rcu_head *donelist;

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 18:01                           ` Eric Dumazet
@ 2005-10-17 18:31                             ` Dipankar Sarma
  2005-10-17 19:00                               ` Linus Torvalds
  2005-10-17 18:37                             ` Linus Torvalds
  2005-10-17 22:59                             ` Paul E. McKenney
  2 siblings, 1 reply; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> >
> 
> <lazy_mode=ON>
> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
> </lazy_mode>

I think the theory was that we have to process the callbacks,
not just force the grace period by setting need_resched.
That is what TIF_RCUUPDATE indicates - rcus to process.

> This patch only take care of call_rcu(), I'm unsure of what can be done 
> inside call_rcu_bh()
> 
> The two stress program dont hit OOM anymore with this patch applied (even 
> with maxbatch=10)

Hmm.. I am supprised that maxbatch=10 still allowed you keep up
with a continuously queueing cpu. OK, I will look at this.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 18:31                             ` Dipankar Sarma
@ 2005-10-17 19:00                               ` Linus Torvalds
  0 siblings, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 19:00 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1905 bytes --]

On Tue, 18 Oct 2005, Dipankar Sarma wrote:

> On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:
> > Dipankar Sarma a écrit :
> > >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> > >
> > 
> > <lazy_mode=ON>
> > Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
> > </lazy_mode>
> 
> I think the theory was that we have to process the callbacks,
> not just force the grace period by setting need_resched.
> That is what TIF_RCUUPDATE indicates - rcus to process.

I'm having second thoughts about that, since the problem (in SMP) is that 
even if the currently active process tries to more proactively handle RCU 
events rather than just setting the grace period, in order to do that 
you'd still need to wait for the other CPU's to have their quiescent 
phase.

So the RCU queues can grow long, if only because the other CPU's won't 
necessarily do the same.

So we probably cannot throttle RCU queues down, and they will inevitably 
have to be able to grow pretty long. 

> Hmm.. I am supprised that maxbatch=10 still allowed you keep up
> with a continuously queueing cpu. OK, I will look at this.

I think it's just because it ends up rescheduling a lot, and thus waking 
up softirqd.

The RCU thing is done as a tasklet, which means that
 - it starts out as a "synchronous" softirq event, at which point it gets 
   called at most X times (MAX_SOFTIRQ_RESTART, defaults to 10)
 - after that, we end up saying "uhhuh, this is using too much softirq 
   time" and instead just run the softirq as a kernel thread.
 - setting TIF_NEEDRESCHED whenever the rcu lists are long will keep on 
   rescheduling to the softirq thread much more aggressively.

See __do_softirq() for some of this softirq (and thus tasklet) handling.

I suspect it's _very_ inefficient, but maybe the bad case triggers so 
seldom that we don't really need to care. 

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 18:01                           ` Eric Dumazet
  2005-10-17 18:31                             ` Dipankar Sarma
@ 2005-10-17 18:37                             ` Linus Torvalds
  2005-10-17 19:12                               ` Eric Dumazet
  2005-10-17 22:59                             ` Paul E. McKenney
  2 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 18:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
> <lazy_mode=ON>
> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
> </lazy_mode>

Hmm.. Your patch looks very much like one I tried already, but the big 
difference being that I just cleared the count when doing the rcu 
callback. That was because I hadn't realized the importance of the 
maxbatch thing (so it didn't work for me, like it did for you).

Still - the actual RCU callback will only be called at the next timer tick 
or whatever as far as I can tell, so the first time you'll still have a 
_long_ RCU queue (and thus bad latency).

I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, because 
we still need to wait for the _other_ CPU's to get to their RCU quiescent 
event.

However, that leaves us with the nasty situation that we'll ve very 
inefficient: we'll do "maxbatch" RCU entries, and then return, and then 
force a whole re-schedule. That just can't be good.

How about instead of depending on "maxbatch", we'd depend on 
"need_resched()"? Mabe the "maxbatch" be a _minbatch_ thing, and then once 
we've done the minimum amount we _need_ to do (or emptied the RCU queue) 
we start honoring need_resched(), and return early if we do? 

That, together with your patch, should work, without causing ludicrous 
"reschedule every ten system calls" behaviour..

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 18:37                             ` Linus Torvalds
@ 2005-10-17 19:12                               ` Eric Dumazet
  2005-10-17 19:30                                 ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 19:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

Linus Torvalds a écrit :
> 
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
>><lazy_mode=ON>
>>Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
>></lazy_mode>
> 
> 
> Hmm.. Your patch looks very much like one I tried already, but the big 
> difference being that I just cleared the count when doing the rcu 
> callback. That was because I hadn't realized the importance of the 
> maxbatch thing (so it didn't work for me, like it did for you).
> 
> Still - the actual RCU callback will only be called at the next timer tick 
> or whatever as far as I can tell, so the first time you'll still have a 
> _long_ RCU queue (and thus bad latency).
> 
> I guess that's inevitable - and TIF_RCUUPDATE wouldn't even help, because 
> we still need to wait for the _other_ CPU's to get to their RCU quiescent 
> event.
> 
> However, that leaves us with the nasty situation that we'll ve very 
> inefficient: we'll do "maxbatch" RCU entries, and then return, and then 
> force a whole re-schedule. That just can't be good.
>

Thats strange, because on my tests it seems that I dont have one reschedule 
for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one 
'schedule' then filp count goes back to 1000.

vmstat shows about 150 context switches per second.

(This machines does 1.000.000 pair of open/close in 4.88 seconds)

oprofile data shows verly little schedule overhead :

CPU: P4 / Xeon with 2 hyper-threads, speed 1993.83 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not 
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples  %        symbol name
132578   11.3301  path_lookup
104788    8.9551  __d_lookup
85220     7.2829  link_path_walk
63013     5.3851  sysenter_past_esp
53287     4.5539  _atomic_dec_and_lock
45825     3.9162  chrdev_open
43105     3.6837  get_unused_fd
39948     3.4139  kmem_cache_alloc
38308     3.2738  strncpy_from_user
35738     3.0542  rcu_do_batch
31850     2.7219  __link_path_walk
31355     2.6796  get_empty_filp
25941     2.2169  kmem_cache_free
24455     2.0899  __fput
24422     2.0871  sys_close
19814     1.6933  filp_dtor
19616     1.6764  free_block
19000     1.6237  open_namei
18214     1.5566  fput
15991     1.3666  fd_install
14394     1.2301  file_kill
14365     1.2276  call_rcu
14338     1.2253  kref_put
13679     1.1690  file_move
13646     1.1662  schedule
13456     1.1499  getname
13019     1.1126  kref_get



> How about instead of depending on "maxbatch", we'd depend on 
> "need_resched()"? Mabe the "maxbatch" be a _minbatch_ thing, and then once 
> we've done the minimum amount we _need_ to do (or emptied the RCU queue) 
> we start honoring need_resched(), and return early if we do? 
> 
> That, together with your patch, should work, without causing ludicrous 
> "reschedule every ten system calls" behaviour..
> 
> Hmm?
> 
> 		Linus
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 19:12                               ` Eric Dumazet
@ 2005-10-17 19:30                                 ` Linus Torvalds
  2005-10-17 19:39                                   ` Eric Dumazet
  0 siblings, 1 reply; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
> Thats strange, because on my tests it seems that I dont have one reschedule
> for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one
> 'schedule' then filp count goes back to 1000.

Hmm.

I think you're right, but for all the wrong reasons.

"maxbatch" ends up not actually having any real effect in the end: after 
the tasklet ends up running in softirqd, softirqd will actually keep on 
calling the tasklet code until it doesn't get rescheduled any more ;)

So it will do "maxbatch" RCU entries, reschedule itself, return, and 
immediately get called again.

Heh.

The _good_ news is that since it ends up running in softirqd (after the 
first ten times - the softirq code in kernel/softirq.c will start off 
calling it ten times _first_), it can be scheduled away, so it actually 
ends up helping latency.

Which means that we actually end up doing exactly the right thing, 
although for what appears to be the wrong reasons (or very lucky ones).

The _bad_ news is that softirqd is running at nice +19, so I suspect that 
with some unlucky patterns it's probably pretty easy to make sure that 
ksoftirqd doesn't actually run very often at all! 

Gaah. So close, yet so far. I'm _almost_ willing to just undo my "make 
maxbatch huge" patch, and apply your patch, because now that I see how it 
all happens to work together I'm convinced that it _almost_ works. Even if 
it seems to be mostly by luck(*) rather than anything else.

		Linus

(*) Not strictly true. It may not be by design of the RCU code itself, but 
it's definitely by design of the softirq's being designed to be robust and 
have good latency behaviour. So it does work by design, but it works by 
softirq design rather than RCU design ;)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 19:30                                 ` Linus Torvalds
@ 2005-10-17 19:39                                   ` Eric Dumazet
  2005-10-17 20:14                                     ` Linus Torvalds
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2005-10-17 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

Linus Torvalds a écrit :
> 
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
>>Thats strange, because on my tests it seems that I dont have one reschedule
>>for 'maxbatch' items. Doing 'grep filp /proc/slabinfo' it seems I have one
>>'schedule' then filp count goes back to 1000.
> 
> 
> Hmm.
> 
> I think you're right, but for all the wrong reasons.
> 
> "maxbatch" ends up not actually having any real effect in the end: after 
> the tasklet ends up running in softirqd, softirqd will actually keep on 
> calling the tasklet code until it doesn't get rescheduled any more ;)
> 
> So it will do "maxbatch" RCU entries, reschedule itself, return, and 
> immediately get called again.
> 
> Heh.
> 
> The _good_ news is that since it ends up running in softirqd (after the 
> first ten times - the softirq code in kernel/softirq.c will start off 
> calling it ten times _first_), it can be scheduled away, so it actually 
> ends up helping latency.
> 
> Which means that we actually end up doing exactly the right thing, 
> although for what appears to be the wrong reasons (or very lucky ones).
> 
> The _bad_ news is that softirqd is running at nice +19, so I suspect that 
> with some unlucky patterns it's probably pretty easy to make sure that 
> ksoftirqd doesn't actually run very often at all! 
> 
> Gaah. So close, yet so far. I'm _almost_ willing to just undo my "make 
> maxbatch huge" patch, and apply your patch, because now that I see how it 
> all happens to work together I'm convinced that it _almost_ works. Even if 
> it seems to be mostly by luck(*) rather than anything else.
> 

:)

What about call_rcu_bh() which I left unchanged ? At least one of my 
production machine cannot live very long unless I have maxbatch = 300, because 
of an insane large tcp route cache (and one of its CPU almost filled by 
softirq NIC processing)

> 		Linus
> 
> (*) Not strictly true. It may not be by design of the RCU code itself, but 
> it's definitely by design of the softirq's being designed to be robust and 
> have good latency behaviour. So it does work by design, but it works by 
> softirq design rather than RCU design ;)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 19:39                                   ` Eric Dumazet
@ 2005-10-17 20:14                                     ` Linus Torvalds
  2005-10-17 20:25                                       ` Christopher Friesen
                                                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 20:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Eric Dumazet wrote:
> 
> What about call_rcu_bh() which I left unchanged ? At least one of my
> production machine cannot live very long unless I have maxbatch = 300, because
> of an insane large tcp route cache (and one of its CPU almost filled by
> softirq NIC processing)

I think we'll have to release 2.6.14 with maxbatch at the high value 
(10000).

Yes, it may screw up some latency stuff, but quite frankly, even with your 
patch and even ignoring the call_rcu_bh case, I'm convinced you can easily 
get into the situation where softirqd just doesn't run soon enough.

But at least I think I understand _why_ rcu processing was delayed.

I think a real fix might have to involve more explicit knowledge of 
tasklet behaviour and softirq interaction.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:14                                     ` Linus Torvalds
@ 2005-10-17 20:25                                       ` Christopher Friesen
  2005-10-17 20:24                                         ` Dipankar Sarma
  2005-10-17 20:38                                         ` Linus Torvalds
  2005-10-17 20:33                                       ` Dipankar Sarma
  2005-10-17 22:40                                       ` Linus Torvalds
  2 siblings, 2 replies; 48+ messages in thread
From: Christopher Friesen @ 2005-10-17 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, dipankar, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

Linus Torvalds wrote:

> Yes, it may screw up some latency stuff, but quite frankly, even with your 
> patch and even ignoring the call_rcu_bh case, I'm convinced you can easily 
> get into the situation where softirqd just doesn't run soon enough.
> 
> But at least I think I understand _why_ rcu processing was delayed.

Could this be related to the "rename14 LTP test with /tmp as tmpfs and 
HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

Chris

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:25                                       ` Christopher Friesen
@ 2005-10-17 20:24                                         ` Dipankar Sarma
  2005-10-18 15:55                                           ` Christopher Friesen
  2005-10-17 20:38                                         ` Linus Torvalds
  1 sibling, 1 reply; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 20:24 UTC (permalink / raw)
  To: Christopher Friesen
  Cc: Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote:
> Linus Torvalds wrote:
> 
> >Yes, it may screw up some latency stuff, but quite frankly, even with your 
> >patch and even ignoring the call_rcu_bh case, I'm convinced you can easily 
> >get into the situation where softirqd just doesn't run soon enough.
> >
> >But at least I think I understand _why_ rcu processing was delayed.
> 
> Could this be related to the "rename14 LTP test with /tmp as tmpfs and 
> HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

Could very well be. Chris, could you please try booting
with rcupdate.maxbatch=10000 and see if the problem goes away ?

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:24                                         ` Dipankar Sarma
@ 2005-10-18 15:55                                           ` Christopher Friesen
  0 siblings, 0 replies; 48+ messages in thread
From: Christopher Friesen @ 2005-10-18 15:55 UTC (permalink / raw)
  To: dipankar
  Cc: Linus Torvalds, Eric Dumazet, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

Dipankar Sarma wrote:
> On Mon, Oct 17, 2005 at 02:25:17PM -0600, Christopher Friesen wrote:

>>Could this be related to the "rename14 LTP test with /tmp as tmpfs and 
>>HIGHMEM causes OOM-killer invocation due to zone normal exhaustion" issue?

> Could very well be. Chris, could you please try booting
> with rcupdate.maxbatch=10000 and see if the problem goes away ?

And sure enough, that fixes it.  The dcache slab usage maxes out at 
around 11MB rather than consuming all of zone normal.

Is there any downside to this option?

Chris

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:25                                       ` Christopher Friesen
  2005-10-17 20:24                                         ` Dipankar Sarma
@ 2005-10-17 20:38                                         ` Linus Torvalds
  1 sibling, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 20:38 UTC (permalink / raw)
  To: Christopher Friesen
  Cc: Eric Dumazet, dipankar, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

On Mon, 17 Oct 2005, Christopher Friesen wrote:
> 
> Could this be related to the "rename14 LTP test with /tmp as tmpfs and HIGHMEM
> causes OOM-killer invocation due to zone normal exhaustion" issue?

Yes. 

You can try the current git tree, or just change "maxbatch" from 10 to 
10000 in your own tree, and see if it makes a difference.

I would not be surprised at all if this turns out to be the exact same 
issue, for the exact same reason.

Eric's patch is also likely to fix it (if the "maxbatch" change fixes it), 
since I suspect that under _practical_ load Eric's patch works fine.

The advantage of Eric's patch is that it shouldn't have any latency 
downsides, so Eric's is in many ways preferable to just increasing 
maxbatch. I just can't convince myself that it's really always going to 
fix the problem.

If somebody else can, holler.

		Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:14                                     ` Linus Torvalds
  2005-10-17 20:25                                       ` Christopher Friesen
@ 2005-10-17 20:33                                       ` Dipankar Sarma
  2005-10-17 22:40                                       ` Linus Torvalds
  2 siblings, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 01:14:20PM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> > 
> > What about call_rcu_bh() which I left unchanged ? At least one of my
> > production machine cannot live very long unless I have maxbatch = 300, because
> > of an insane large tcp route cache (and one of its CPU almost filled by
> > softirq NIC processing)
> 
> I think we'll have to release 2.6.14 with maxbatch at the high value 
> (10000).

Is 10000 enough ? Eric seemed to find a problem even with this
after 90 minutes ?


> Yes, it may screw up some latency stuff, but quite frankly, even with your 
> patch and even ignoring the call_rcu_bh case, I'm convinced you can easily 
> get into the situation where softirqd just doesn't run soon enough.
> 
> But at least I think I understand _why_ rcu processing was delayed.
> 
> I think a real fix might have to involve more explicit knowledge of 
> tasklet behaviour and softirq interaction.

Agreed. I am now looking at characterizing the corner cases that
can get us into trouble and checking what pattern of processing
is appropriate to cover them all. It will take some time to
sort this out making sure that it satisfies most requirements
reasonably.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 20:14                                     ` Linus Torvalds
  2005-10-17 20:25                                       ` Christopher Friesen
  2005-10-17 20:33                                       ` Dipankar Sarma
@ 2005-10-17 22:40                                       ` Linus Torvalds
  2 siblings, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 22:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Linus Torvalds wrote:
> On Mon, 17 Oct 2005, Eric Dumazet wrote:
> > 
> > What about call_rcu_bh() which I left unchanged ? At least one of my
> > production machine cannot live very long unless I have maxbatch = 300, because
> > of an insane large tcp route cache (and one of its CPU almost filled by
> > softirq NIC processing)
> 
> I think we'll have to release 2.6.14 with maxbatch at the high value 
> (10000).

Btw, I'm going to apply your patch in _addition_ to the bigger maxbatch 
value.

It might help latency a bit, but more importantly, on one of my machines 
(but only one - it probably depends on how much memory you have etc), I 
can re-create the out-of-file-descriptors thing even with a maxbatch of a 
million.

Probably what happens is that the rcu callbacks just grow fast enough 
without any quiescent period that the maxbatch thing just never matters: 
we simply run out of file descriptors because we haven't even gotten 
around to trying to free them yet.

I'm compiling with your patch on that machine to verify that it does 
actually help keep the queues down. Just doing a

	while : ; do cat /proc/slabinfo | grep filp; sleep 1; done

while running the test programs gives some alarming numbers as-is.

Your patch keeps the numbers _much_ more stable.

Regardless, keeping track of the number of rcu callback events we have 
will almost inevitably be part of whatever future strategy we take, so 
your patch is definitely a step in the right direction, even if we have to 
tweak it later.

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 18:01                           ` Eric Dumazet
  2005-10-17 18:31                             ` Dipankar Sarma
  2005-10-17 18:37                             ` Linus Torvalds
@ 2005-10-17 22:59                             ` Paul E. McKenney
  2005-10-18  9:46                               ` Eric Dumazet
  2 siblings, 1 reply; 48+ messages in thread
From: Paul E. McKenney @ 2005-10-17 22:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

On Mon, Oct 17, 2005 at 08:01:21PM +0200, Eric Dumazet wrote:
> Dipankar Sarma a écrit :
> >On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> >
> >>>Absolutely. Keeping a count of (percpu) queued items is basically free 
> >>>if kept
> >>>in the cache line used by list head, so the 'queue length on this cpu' 
> >>>is a
> >>>cheap metric.
> >>
> >>The only downside to TIF_RCUUPDATE is that those damn TIF-flags are 
> >>per-architecture (probably largely unnecessary, but while most 
> >>architectures don't care at all, others seem to have optimized their 
> >>layout so that they can test the work bits more efficiently). So it's a 
> >>matter of each architecture being updated with its TIF_xyz flag and their 
> >>work function.
> >>
> >>Anybody willing to try? Dipankar apparently has a lot on his plate, this 
> >>_should_ be fairly straightforward. Eric?
> >
> >
> >I *had*, when this hit me :) It was one those spurt things. I am going to
> >look at this, but I think we will need to do this with some careful
> >benchmarking.
> >
> >At the moment however I do have another concern - open/close taking too
> >much time as I mentioned in an earlier email. It is nearly 4 times
> >slower than 2.6.13. So, that is first up in my list of things to
> >do at the moment.
> >
> 
> <lazy_mode=ON>
> Do we really need a TIF_RCUUPDATE flag, or could we just ask for a resched ?
> </lazy_mode>
> 
> This patch only take care of call_rcu(), I'm unsure of what can be done 
> inside call_rcu_bh()
> 
> The two stress program dont hit OOM anymore with this patch applied (even 
> with maxbatch=10)

Keeping the per-CPU count of queued callbacks seems eminently reasonable
to me, as does the set_need_resched().  But the proposed (but fortunately
commented out) call of rcu_do_batch() from call_rcu() does have deadlock
issues.

> Eric
> 

> --- linux-2.6.14-rc4/kernel/rcupdate.c	2005-10-11 03:19:19.000000000 +0200
> +++ linux-2.6.14-rc4-ed/kernel/rcupdate.c	2005-10-17 21:52:18.000000000 +0200
> @@ -109,6 +109,10 @@
>  	rdp = &__get_cpu_var(rcu_data);
>  	*rdp->nxttail = head;
>  	rdp->nxttail = &head->next;
> +
> +	if (unlikely(++rdp->count > 10000))
> +		set_need_resched();
> +
>  	local_irq_restore(flags);
>  }
>  
> @@ -140,6 +144,12 @@
>  	rdp = &__get_cpu_var(rcu_bh_data);
>  	*rdp->nxttail = head;
>  	rdp->nxttail = &head->next;
> +	rdp->count++;

Really need an "rdp->count++" in call_rcu_bh() as well, otherwise
the _bh struct rcu_data will have a steadily decreasing count field.
Strictly speaking, this is harmless, since call_rcu_bh() cheerfully
ignores this field, but this situation is bound to cause severe confusion
at some point.

> +/*
> + *  Should we directly call rcu_do_batch() here ?
> + *  if (unlikely(rdp->count > 10000))
> + *      rcu_do_batch(rdp);
> + */

Good thing that the above is commented out!  ;-)

Doing this can result in self-deadlock, for example with the following:

	spin_lock(&mylock);
	/* do some stuff. */
	call_rcu(&p->rcu_head, my_rcu_callback);
	/* do some more stuff. */
	spin_unlock(&mylock);

void my_rcu_callback(struct rcu_head *p)
{
	spin_lock(&mylock);
	/* self-deadlock via call_rcu() via rcu_do_batch()!!! */
	spin_unlock(&mylock);
}


						Thanx, Paul

>  }
>  
> @@ -157,6 +167,7 @@
>  		next = rdp->donelist = list->next;
>  		list->func(list);
>  		list = next;
> +		rdp->count--;
>  		if (++count >= maxbatch)
>  			break;
>  	}
> --- linux-2.6.14-rc4/include/linux/rcupdate.h	2005-10-11 03:19:19.000000000 +0200
> +++ linux-2.6.14-rc4-ed/include/linux/rcupdate.h	2005-10-17 21:02:25.000000000 +0200
> @@ -94,6 +94,7 @@
>  	long  	       	batch;           /* Batch # for current RCU batch */
>  	struct rcu_head *nxtlist;
>  	struct rcu_head **nxttail;
> +	long            count; /* # of queued items */
>  	struct rcu_head *curlist;
>  	struct rcu_head **curtail;
>  	struct rcu_head *donelist;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 22:59                             ` Paul E. McKenney
@ 2005-10-18  9:46                               ` Eric Dumazet
  2005-10-18 16:22                                 ` Paul E. McKenney
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Dumazet @ 2005-10-18  9:46 UTC (permalink / raw)
  To: paulmck
  Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

Paul E. McKenney a écrit :
> 
> 
>>+/*
>>+ *  Should we directly call rcu_do_batch() here ?
>>+ *  if (unlikely(rdp->count > 10000))
>>+ *      rcu_do_batch(rdp);
>>+ */
> 
> 
> Good thing that the above is commented out!  ;-)
> 
> Doing this can result in self-deadlock, for example with the following:
> 
> 	spin_lock(&mylock);
> 	/* do some stuff. */
> 	call_rcu(&p->rcu_head, my_rcu_callback);
> 	/* do some more stuff. */
> 	spin_unlock(&mylock);
> 
> void my_rcu_callback(struct rcu_head *p)
> {
> 	spin_lock(&mylock);
> 	/* self-deadlock via call_rcu() via rcu_do_batch()!!! */
> 	spin_unlock(&mylock);
> }
> 
> 
> 						Thanx, Paul

Thanks Paul for reminding us that call_rcu() should not ever call the callback 
function, as very well documented in Documentation/RCU/UP.txt
(Example 3: Death by Deadlock)

But is the same true for call_rcu_bh() ?

I intentionally wrote the comment to remind readers that a low maxbatch can 
trigger OOM in case a CPU is filled by some kind of DOS (network IRQ flood for 
example, targeting the IP dst cache)

To solve this problem, may be we could add a requirement to 
call_rcu_bh/callback functions  : If they have to lock a spinlock, only use a 
spin_trylock() and make them returns a status (0 : sucessfull callback, 1: 
please requeue me)

As most callback functions just kfree() some memory, most of OOM would be cleared.

int my_rcu_callback(struct rcu_head *p)
{
	if (!spin_trylock(&mylock))
		return 1; /* please call me later */
	/* do something here */
	...
	spin_unlock(&mylock);
	return 0;
}

(Changes to rcu_do_batch() are left as an exercice :) )

Eric

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-18  9:46                               ` Eric Dumazet
@ 2005-10-18 16:22                                 ` Paul E. McKenney
  0 siblings, 0 replies; 48+ messages in thread
From: Paul E. McKenney @ 2005-10-18 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: dipankar, Linus Torvalds, Jean Delvare, Serge Belyshev, LKML,
	Andrew Morton, Manfred Spraul

On Tue, Oct 18, 2005 at 11:46:30AM +0200, Eric Dumazet wrote:
> Paul E. McKenney a écrit :
> >
> >
> >>+/*
> >>+ *  Should we directly call rcu_do_batch() here ?
> >>+ *  if (unlikely(rdp->count > 10000))
> >>+ *      rcu_do_batch(rdp);
> >>+ */
> >
> >
> >Good thing that the above is commented out!  ;-)
> >
> >Doing this can result in self-deadlock, for example with the following:
> >
> >	spin_lock(&mylock);
> >	/* do some stuff. */
> >	call_rcu(&p->rcu_head, my_rcu_callback);
> >	/* do some more stuff. */
> >	spin_unlock(&mylock);
> >
> >void my_rcu_callback(struct rcu_head *p)
> >{
> >	spin_lock(&mylock);
> >	/* self-deadlock via call_rcu() via rcu_do_batch()!!! */
> >	spin_unlock(&mylock);
> >}
> >
> >
> >						Thanx, Paul
> 
> Thanks Paul for reminding us that call_rcu() should not ever call the 
> callback function, as very well documented in Documentation/RCU/UP.txt
> (Example 3: Death by Deadlock)
> 
> But is the same true for call_rcu_bh() ?

Yes, same rules for this aspect of call_rcu_bh() and call_rcu().

> I intentionally wrote the comment to remind readers that a low maxbatch can 
> trigger OOM in case a CPU is filled by some kind of DOS (network IRQ flood 
> for example, targeting the IP dst cache)
> 
> To solve this problem, may be we could add a requirement to 
> call_rcu_bh/callback functions  : If they have to lock a spinlock, only use 
> a spin_trylock() and make them returns a status (0 : sucessfull callback, 
> 1: please requeue me)
> 
> As most callback functions just kfree() some memory, most of OOM would be 
> cleared.
> 
> int my_rcu_callback(struct rcu_head *p)
> {
> 	if (!spin_trylock(&mylock))
> 		return 1; /* please call me later */
> 	/* do something here */
> 	...
> 	spin_unlock(&mylock);
> 	return 0;
> }
> 
> (Changes to rcu_do_batch() are left as an exercice :) )

Another approach that would keep the current easier-to-use semantics
would be to schedule a tasklet or workqueue to process the callbacks
in a safe context.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:29                         ` Dipankar Sarma
  2005-10-17 18:01                           ` Eric Dumazet
@ 2005-10-17 18:15                           ` Dipankar Sarma
  2005-10-17 18:40                           ` Linus Torvalds
  2 siblings, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 09:59:30PM +0530, Dipankar Sarma wrote:
> On Mon, Oct 17, 2005 at 09:16:25AM -0700, Linus Torvalds wrote:
> 
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.

Please ignore this. This is a big Doh! slab debugging snuck into
my config file because I was trying to track down the 
"bad page state" problem again. Without it, open/close in 2.6.14-rc1
is just as fast as 2.6.13 - ~3 microseconds per pair.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:29                         ` Dipankar Sarma
  2005-10-17 18:01                           ` Eric Dumazet
  2005-10-17 18:15                           ` Dipankar Sarma
@ 2005-10-17 18:40                           ` Linus Torvalds
  2 siblings, 0 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17 18:40 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, 17 Oct 2005, Dipankar Sarma wrote:
> 
> At the moment however I do have another concern - open/close taking too
> much time as I mentioned in an earlier email. It is nearly 4 times
> slower than 2.6.13. So, that is first up in my list of things to
> do at the moment.

It's not slower for me. For me, lmbench shows open/close as being pretty 
stable at least since 2.6.12.

Are you sure that your dentry cache tests haven't just filled up the 
dentry lists so much that when you compare open/close performance after 
the dentry tests, they seem much slower than your numbers from before?

If you run something that fills up the dentry cache, open/close will be 
slower just because the open part will have to traverse longer hash 
chains.

			Linus

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:01                     ` Eric Dumazet
  2005-10-17 16:16                       ` Linus Torvalds
@ 2005-10-17 16:23                       ` Dipankar Sarma
  2005-10-17 16:31                       ` Lee Revell
  2 siblings, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 16:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 06:01:31PM +0200, Eric Dumazet wrote:
> Linus Torvalds a écrit :
> 
> >
> > - post-14: work on making sure rcu callbacks are done in a more timely 
> >   manner when the rcu queue gets long. This would involve TIF_RCUPENDING 
> >   and whatever else to make sure that we have timely quiescent periods, 
> >   and we do the RCU callback tasklet more often if the queue is long.
> >
> 
> Absolutely. Keeping a count of (percpu) queued items is basically free if 
> kept in the cache line used by list head, so the 'queue length on this cpu' 
> is a cheap metric.

Or 'sudden increase in queue length on this cpu' :)

> A 'realtime refinement' would be to use a different maxbatch limit 
> depending on the caller's priority : Let a softirq thread have a lower 
> batch count than a regular user thread.

Yes, would be interesting.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 16:01                     ` Eric Dumazet
  2005-10-17 16:16                       ` Linus Torvalds
  2005-10-17 16:23                       ` Dipankar Sarma
@ 2005-10-17 16:31                       ` Lee Revell
  2 siblings, 0 replies; 48+ messages in thread
From: Lee Revell @ 2005-10-17 16:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linus Torvalds, Dipankar Sarma, Jean Delvare, Serge Belyshev,
	LKML, Andrew Morton, Manfred Spraul

On Mon, 2005-10-17 at 18:01 +0200, Eric Dumazet wrote:
> A 'realtime refinement' would be to use a different maxbatch limit depending 
> on the caller's priority : Let a softirq thread have a lower batch count than 
> a regular user thread.

Or just make the whole thing preemptible like in the -rt tree and forget
about it.

Lee


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17 15:42                   ` Linus Torvalds
  2005-10-17 16:01                     ` Eric Dumazet
@ 2005-10-17 16:20                     ` Dipankar Sarma
  1 sibling, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 16:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Dumazet, Jean Delvare, Serge Belyshev, LKML, Andrew Morton,
	Manfred Spraul

On Mon, Oct 17, 2005 at 08:42:05AM -0700, Linus Torvalds wrote:
> 
> On Mon, 17 Oct 2005, Dipankar Sarma wrote:
> 
> > This I am not sure, it is Linus' call. I am just trying to do the
> > right thing - fix the real problem.
> 
> It sure looks like the batch limiter is the fundamental problem.
> 
> Instead of limiting the batching, we should likely try to avoid the RCU 
> lists getting huge in the first place - ie do the RCU callback processing 
> more often if the list is getting longer.
> 
> So I suspect that the _real_ fix is:
> 
>  - for 2.6.14: remove the batching limig (or just make it much higher for 
>    now)

You can remove the batching limit by making maxbatch = 0 by default.
Just a one line patch.

>  - post-14: work on making sure rcu callbacks are done in a more timely 
>    manner when the rcu queue gets long. This would involve TIF_RCUPENDING 
>    and whatever else to make sure that we have timely quiescent periods, 
>    and we do the RCU callback tasklet more often if the queue is long.

Yes, I am already looking at this. There are a number approaches
to this include adaptive algorithm to cater to naughty corner
cases and/or adding different ways to handle RCU as in 
tree. I hope to experiment with these incrementally after 2.6.14 over 
a period of time and see what works best for most people.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-16 16:23   ` Dipankar Sarma
  2005-10-16 18:51     ` Serge Belyshev
@ 2005-10-17  2:34     ` Linus Torvalds
  2005-10-17  3:54       ` Roland Dreier
  2005-10-17 11:54       ` Dipankar Sarma
  1 sibling, 2 replies; 48+ messages in thread
From: Linus Torvalds @ 2005-10-17  2:34 UTC (permalink / raw)
  To: Dipankar Sarma
  Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton,
	Manfred Spraul



On Sun, 16 Oct 2005, Dipankar Sarma wrote:
> 
> Linus, I don't think this has anything to do with RCU grace periods
> like we discussed previously. I measured on my 3.6GHz x86_64 and
> found that open()/close() pair on /dev/null takes about 45500
> cycles or 12 microseconds. [Does that sound resonable?].

That sounds very slow. I can do a million open/close pairs in 4 seconds on 
a 2.5GHz G5. Maybe you tested a cold-cache case?

Of course, a P4 is just about the worst architecture to test system call 
performance on, so ...

Still, that's 4us. I'm pretty sure some machines will do it in 3 or less 
(in fact, lmbench says 3.17us on another machine of mine for open/close). 
Still, that's only four times faster, so 2 timer ticks should be less than 
5000 file structs to free.

I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait 
for confirmation.

In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with 
your changing the file counter to an atomic, but I'd rather keep that 
change for later.

Serge, does this alternate patch work for you?

[ cache constructors and destructors are _stupid_. They act exactly the 
  wrong way from a cache standpoint. ]

		Linus

---
diff --git a/fs/dcache.c b/fs/dcache.c
index fb10386..40aaa90 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1731,7 +1731,7 @@ void __init vfs_caches_init(unsigned lon
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, filp_ctor, filp_dtor);
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
 	dcache_init(mempages);
 	inode_init(mempages);
diff --git a/fs/file_table.c b/fs/file_table.c
index 86ec8ae..fbda480 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_
 {
 	if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
 	    SLAB_CTOR_CONSTRUCTOR) {
-		unsigned long flags;
-		spin_lock_irqsave(&filp_count_lock, flags);
-		files_stat.nr_files++;
-		spin_unlock_irqrestore(&filp_count_lock, flags);
 	}
 }
 
-void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags)
-{
-	unsigned long flags;
-	spin_lock_irqsave(&filp_count_lock, flags);
-	files_stat.nr_files--;
-	spin_unlock_irqrestore(&filp_count_lock, flags);
-}
-
 static inline void file_free_rcu(struct rcu_head *head)
 {
 	struct file *f =  container_of(head, struct file, f_rcuhead);
@@ -62,6 +50,13 @@ static inline void file_free_rcu(struct 
 
 static inline void file_free(struct file *f)
 {
+	unsigned long flags;
+
+	/* Stupid. Use atomics */
+	spin_lock_irqsave(&filp_count_lock, flags);
+	files_stat.nr_files--;
+	spin_unlock_irqrestore(&filp_count_lock, flags);
+
 	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
@@ -73,6 +68,7 @@ struct file *get_empty_filp(void)
 {
 	static int old_max;
 	struct file * f;
+	unsigned long flags;
 
 	/*
 	 * Privileged users can go above max_files
@@ -85,6 +81,11 @@ struct file *get_empty_filp(void)
 	if (f == NULL)
 		goto fail;
 
+	/* Stupid. Use atomics */
+	spin_lock_irqsave(&filp_count_lock, flags);
+	files_stat.nr_files++;
+	spin_unlock_irqrestore(&filp_count_lock, flags);
+
 	memset(f, 0, sizeof(*f));
 	if (security_file_alloc(f))
 		goto fail_sec;
diff --git a/include/linux/file.h b/include/linux/file.h
index f5bbd4c..55f0572 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -60,8 +60,6 @@ extern void put_filp(struct file *);
 extern int get_unused_fd(void);
 extern void FASTCALL(put_unused_fd(unsigned int fd));
 struct kmem_cache_s;
-extern void filp_ctor(void * objp, struct kmem_cache_s *cachep, unsigned long cflags);
-extern void filp_dtor(void * objp, struct kmem_cache_s *cachep, unsigned long dflags);
 
 extern struct file ** alloc_fd_array(int);
 extern void free_fd_array(struct file **, int);

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  2:34     ` Linus Torvalds
@ 2005-10-17  3:54       ` Roland Dreier
  2005-10-17 11:54       ` Dipankar Sarma
  1 sibling, 0 replies; 48+ messages in thread
From: Roland Dreier @ 2005-10-17  3:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dipankar Sarma, Serge Belyshev, linux-kernel, khali,
	Andrew Morton, Manfred Spraul

    > --- a/fs/file_table.c
    > +++ b/fs/file_table.c
    > @@ -39,21 +39,9 @@ void filp_ctor(void * objp, struct kmem_
    >  {
    >  	if ((cflags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
    >  	    SLAB_CTOR_CONSTRUCTOR) {
    > -		unsigned long flags;
    > -		spin_lock_irqsave(&filp_count_lock, flags);
    > -		files_stat.nr_files++;
    > -		spin_unlock_irqrestore(&filp_count_lock, flags);
    >  	}
    >  }

Am I missing something?  Why not delete the whole filp_ctor() function
rather than just the then clause of the if()?

 - R.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: VFS: file-max limit 50044 reached
  2005-10-17  2:34     ` Linus Torvalds
  2005-10-17  3:54       ` Roland Dreier
@ 2005-10-17 11:54       ` Dipankar Sarma
  1 sibling, 0 replies; 48+ messages in thread
From: Dipankar Sarma @ 2005-10-17 11:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge Belyshev, linux-kernel, khali, Andrew Morton,
	Manfred Spraul

On Sun, Oct 16, 2005 at 07:34:24PM -0700, Linus Torvalds wrote:
> On Sun, 16 Oct 2005, Dipankar Sarma wrote:
> > 
> > Linus, I don't think this has anything to do with RCU grace periods
> > like we discussed previously. I measured on my 3.6GHz x86_64 and
> > found that open()/close() pair on /dev/null takes about 45500
> > cycles or 12 microseconds. [Does that sound resonable?].
> 
> That sounds very slow. I can do a million open/close pairs in 4 seconds on 
> a 2.5GHz G5. Maybe you tested a cold-cache case?

I measured after warming up for about a 100 times or so. It is
not a cold-cache case. I think we have a bigger problem in hand
here. I measured this with 2.6.13 and saw that I could do the
same in ~3 microseconds per iteration. It balloons to 12 microseconds
in 2.6.14-rc1. I am looking at this right now apart from the other
problems.


> I suspect this patch is worth it for the 2.6.14 timeframe, but I'll wait 
> for confirmation.
> 
> In fact, for 2.6.14, I'd almost do an even more minimal one. I agree with 
> your changing the file counter to an atomic, but I'd rather keep that 
> change for later.

Even beyond the file counter issue, we do need to address the DoS
and the open/close slowdown issue.

Thanks
Dipankar

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2005-10-18 16:21 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-15 13:19 VFS: file-max limit 50044 reached Serge Belyshev
2005-10-15 17:53 ` Serge Belyshev
2005-10-16 16:23   ` Dipankar Sarma
2005-10-16 18:51     ` Serge Belyshev
2005-10-16 18:56       ` Dipankar Sarma
2005-10-17  2:19         ` Linus Torvalds
2005-10-17  4:43           ` Serge Belyshev
2005-10-17  8:32           ` Jean Delvare
2005-10-17  8:46             ` Dipankar Sarma
2005-10-17  9:10               ` Eric Dumazet
2005-10-17  9:14                 ` Christoph Hellwig
2005-10-17  9:25                   ` Eric Dumazet
2005-10-17 10:32                 ` Dipankar Sarma
2005-10-17 12:10                   ` [RCU problem] was " Eric Dumazet
2005-10-17 12:31                     ` linux-os (Dick Johnson)
2005-10-17 12:36                     ` Dipankar Sarma
2005-10-17 13:28                       ` Eric Dumazet
2005-10-17 13:33                         ` Dipankar Sarma
2005-10-17 14:54                         ` Eric Dumazet
2005-10-17 15:42                   ` Linus Torvalds
2005-10-17 16:01                     ` Eric Dumazet
2005-10-17 16:16                       ` Linus Torvalds
2005-10-17 16:29                         ` Dipankar Sarma
2005-10-17 18:01                           ` Eric Dumazet
2005-10-17 18:31                             ` Dipankar Sarma
2005-10-17 19:00                               ` Linus Torvalds
2005-10-17 18:37                             ` Linus Torvalds
2005-10-17 19:12                               ` Eric Dumazet
2005-10-17 19:30                                 ` Linus Torvalds
2005-10-17 19:39                                   ` Eric Dumazet
2005-10-17 20:14                                     ` Linus Torvalds
2005-10-17 20:25                                       ` Christopher Friesen
2005-10-17 20:24                                         ` Dipankar Sarma
2005-10-18 15:55                                           ` Christopher Friesen
2005-10-17 20:38                                         ` Linus Torvalds
2005-10-17 20:33                                       ` Dipankar Sarma
2005-10-17 22:40                                       ` Linus Torvalds
2005-10-17 22:59                             ` Paul E. McKenney
2005-10-18  9:46                               ` Eric Dumazet
2005-10-18 16:22                                 ` Paul E. McKenney
2005-10-17 18:15                           ` Dipankar Sarma
2005-10-17 18:40                           ` Linus Torvalds
2005-10-17 16:23                       ` Dipankar Sarma
2005-10-17 16:31                       ` Lee Revell
2005-10-17 16:20                     ` Dipankar Sarma
2005-10-17  2:34     ` Linus Torvalds
2005-10-17  3:54       ` Roland Dreier
2005-10-17 11:54       ` Dipankar Sarma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox