public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [0/2] filtered wakeups
@ 2004-05-03  2:17 William Lee Irwin III
  2004-05-03  2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
  2004-05-03  2:46 ` [0/2] filtered wakeups William Lee Irwin III
  0 siblings, 2 replies; 7+ messages in thread
From: William Lee Irwin III @ 2004-05-03  2:17 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

The thundering herd issue in waitqueue hashing has been seen in
practice. In order to preserve the space footprint reduction while
improving performance, I wrote "filtered wakeups", which discriminate
between waiters based on a key.

The following patch series, vs. 2.6.6-rc3-mm1, drastically reduces the
kernel cpu consumption of tiobench --threads 512 --size 16384 (fed to
tiotest by hand since apparently the perl script is buggy) on a 6x336MHz
UltraSPARC III Sun Enterprise 3000 with 3.5GB RAM, ESP-366HME HBA,
10x10Krpm 18GB U160 SCSI disks configured for dm thusly:
0 355655680 striped 10 64 /dev/sda 0 /dev/sdb 0 /dev/sdc 0 /dev/sdd 0 \
	/dev/sde 0 /dev/sdf 0 /dev/sdg 0 /dev/sdh 0 /dev/sdi 0 /dev/sdj 0
This was mkfs'd freshly to a single 171GB ext2 fs.

1/2, filtered page waitqueues, resolves the thundering herd issue with
	hashed page waitqueues.
2/2, filtered buffer_head waitqueues, resolves the thundering herd issue
	with hashed buffer_head waitqueues.
Futexes appear to have their own solution to this issue, which is
necessarily different from this as it needs to discriminate based on a
longer key. They could in principle be consolidated by passing a
comparator instead of comparing a key field or some similar strategy at
the cost of indirect function calls.

I furthermore instrumented the calls to schedule(), possibly done
indirectly, in patch 0.5/2 of the series, which isn't necessarily meant
to be applied to anything, but merely shows how I collected some of the
information in the runtime logs, which for space reasons I've posted as
URL's instead of including them inline.
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/filtered_wakeup/virgin_mm.log.tar.bz2
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/filtered_wakeup/filtered_wakeup.log.tar.bz2

Here "cpusec" represents 1 second of actual cpu consumed, counting cpu
consumption of both user and kernel. Apart from regular sampling of
profile data, no other load was running on the machine.

before:
Tiotest results for 512 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write       16384 MBs | 1118.1 s |  14.654 MB/s |   1.6 %  | 280.9 % |
| Random Write 2000 MBs |  336.2 s |   5.950 MB/s |   0.8 %  |  20.4 % |
| Read        16384 MBs | 1717.1 s |   9.542 MB/s |   1.4 %  |  31.8 % |
| Random Read  2000 MBs |  465.2 s |   4.300 MB/s |   1.1 %  |  36.1 % |
`----------------------------------------------------------------------'

Throughput scaled by %cpu:
Write:            5.1873MB/cpusec
Random Write:    28.0660MB/cpusec
Read:            28.7410MB/cpusec
Random Read:     11.5591MB/cpusec

top 10 kernel cpu consumers:
 21733 finish_task_switch                       113.1927
 11976 __wake_up                                187.1250
 11433 generic_file_aio_write_nolock              5.0321
  9730 read_sched_profile                        43.4375
  9606 file_read_actor                           42.8839
  9116 __do_softirq                              31.6528
  8682 do_anonymous_page                         19.3795
  3635 prepare_to_wait                           28.3984
  2159 kmem_cache_free                           16.8672
  1944 buffered_rmqueue                           3.3750

top 10 callers of scheduling functions:
9391185 wait_on_page_bit                         32608.2812
7280055 cpu_idle                                 37916.9531
1458446 __lock_page                              5064.0486
258142 __handle_preemption                      16133.8750
134815 worker_thread                            247.8217
 45989 __wait_on_buffer                         205.3080
 22294 do_exit                                   21.7715
 22187 generic_file_aio_write_nolock              9.7654
 14932 sys_wait4                                 25.9236
 14652 shrink_list                                7.8944


after:
Tiotest results for 512 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write       16384 MBs | 1099.5 s |  14.901 MB/s |   2.2 %  | 279.3 % |
| Random Write 2000 MBs |  333.8 s |   5.991 MB/s |   1.0 %  |  14.9 % |
| Read        16384 MBs | 1706.3 s |   9.602 MB/s |   1.4 %  |  19.1 % |
| Random Read  2000 MBs |  460.3 s |   4.345 MB/s |   1.1 %  |  14.8 % |
`----------------------------------------------------------------------'

Throughput scaled by %cpu:
Write:            5.2934MB/cpusec
Random Write:    37.6792MB/cpusec
Read:            46.8390MB/cpusec
Random Read:     27.3270MB/cpusec

top 10 kernel cpu consumers:
 11873 generic_file_aio_write_nolock              5.2258
 10245 file_read_actor                           45.7366
 10212 read_sched_profile                        45.5893
 10135 finish_task_switch                        52.7865
  9171 do_anonymous_page                         20.4710
  8619 __do_softirq                              29.9271
  2905 wake_up_filtered                          18.1562
  2325 __get_page_state                          10.3795
  2278 del_timer_sync                             5.0848
  2033 buffered_rmqueue                           3.5295

top 10 callers of scheduling functions:
3985424 cpu_idle                                 20757.4167
2396754 wait_on_page_bit                         7489.8562
209453 __handle_preemption                      13090.8125
164071 worker_thread                            301.6011
 24321 do_exit                                   23.7510
 21272 generic_file_aio_write_nolock              9.3627
 16271 sys_wait4                                 28.2483
 11080 pipe_wait                                 86.5625
  9634 compat_sys_nanosleep                      25.0885
  7742 shrink_list                                4.1713


-- wli

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [0.5/2] scheduler caller profiling
  2004-05-03  2:17 [0/2] filtered wakeups William Lee Irwin III
@ 2004-05-03  2:23 ` William Lee Irwin III
  2004-05-03  2:29   ` William Lee Irwin III
  2004-05-03 18:51   ` [0.5/2] scheduler caller profiling David Mosberger
  2004-05-03  2:46 ` [0/2] filtered wakeups William Lee Irwin III
  1 sibling, 2 replies; 7+ messages in thread
From: William Lee Irwin III @ 2004-05-03  2:23 UTC (permalink / raw)
  To: akpm, linux-kernel

On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin III wrote:
> The thundering herd issue in waitqueue hashing has been seen in
> practice. In order to preserve the space footprint reduction while
> improving performance, I wrote "filtered wakeups", which discriminate
> between waiters based on a key.

This patch was used to collect the data on the offending callers into
the scheduler. It creates a profile buffer completely analogous to its
handling to /proc/profile, but registers profile ticks at calls to the
various scheduler entry points instead of during timer ticks and
rearranges scheduler code for this to be accounted properly. It does
not report meaningful statistics in the presence of CONFIG_PREEMPT.

Posting this patch is in order to disclose how I obtained the
scheduling statistics reported in the first post. This patch is not
intended for inclusion.


-- wli


Index: wli-2.6.6-rc3-mm1/include/linux/sched.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/sched.h	2004-04-30 15:06:48.000000000 -0700
+++ wli-2.6.6-rc3-mm1/include/linux/sched.h	2004-04-30 15:55:34.000000000 -0700
@@ -189,7 +189,11 @@
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
+extern signed long FASTCALL(__schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
+asmlinkage void __schedule(void);
+void __sched_profile(void *);
+#define sched_profile()		__sched_profile(__builtin_return_address(0))
 
 struct namespace;
 
Index: wli-2.6.6-rc3-mm1/include/linux/profile.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/profile.h	2004-04-03 19:37:06.000000000 -0800
+++ wli-2.6.6-rc3-mm1/include/linux/profile.h	2004-04-30 16:05:35.000000000 -0700
@@ -13,6 +13,7 @@
 
 /* init basic kernel profiler */
 void __init profile_init(void);
+void schedprof_init(void);
 
 extern unsigned int * prof_buffer;
 extern unsigned long prof_len;
Index: wli-2.6.6-rc3-mm1/kernel/sched.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/sched.c	2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/sched.c	2004-05-01 11:48:46.000000000 -0700
@@ -2312,7 +2312,7 @@
 /*
  * schedule() is the main scheduler function.
  */
-asmlinkage void __sched schedule(void)
+asmlinkage void __sched __schedule(void)
 {
 	long *switch_count;
 	task_t *prev, *next;
@@ -2451,6 +2451,11 @@
 		goto need_resched;
 }
 
+asmlinkage void __sched schedule(void)
+{
+	sched_profile();
+	__schedule();
+}
 EXPORT_SYMBOL(schedule);
 
 #ifdef CONFIG_PREEMPT
@@ -2472,7 +2477,8 @@
 
 need_resched:
 	ti->preempt_count = PREEMPT_ACTIVE;
-	schedule();
+	sched_profile();
+	__schedule();
 	ti->preempt_count = 0;
 
 	/* we could miss a preemption opportunity between schedule and now */
@@ -2609,7 +2615,8 @@
 		do {
 			__set_current_state(TASK_UNINTERRUPTIBLE);
 			spin_unlock_irq(&x->wait.lock);
-			schedule();
+			sched_profile();
+			__schedule();
 			spin_lock_irq(&x->wait.lock);
 		} while (!x->done);
 		__remove_wait_queue(&x->wait, &wait);
@@ -2641,7 +2648,8 @@
 	current->state = TASK_INTERRUPTIBLE;
 
 	SLEEP_ON_HEAD
-	schedule();
+	sched_profile();
+	__schedule();
 	SLEEP_ON_TAIL
 }
 
@@ -2654,7 +2662,8 @@
 	current->state = TASK_INTERRUPTIBLE;
 
 	SLEEP_ON_HEAD
-	timeout = schedule_timeout(timeout);
+	sched_profile();
+	timeout = __schedule_timeout(timeout);
 	SLEEP_ON_TAIL
 
 	return timeout;
@@ -2669,7 +2678,8 @@
 	current->state = TASK_UNINTERRUPTIBLE;
 
 	SLEEP_ON_HEAD
-	schedule();
+	sched_profile();
+	__schedule();
 	SLEEP_ON_TAIL
 }
 
@@ -2682,7 +2692,8 @@
 	current->state = TASK_UNINTERRUPTIBLE;
 
 	SLEEP_ON_HEAD
-	timeout = schedule_timeout(timeout);
+	sched_profile();
+	timeout = __schedule_timeout(timeout);
 	SLEEP_ON_TAIL
 
 	return timeout;
@@ -3127,7 +3138,7 @@
  * to the expired array. If there are no other threads running on this
  * CPU then this function will return.
  */
-asmlinkage long sys_sched_yield(void)
+static long sched_yield(void)
 {
 	runqueue_t *rq = this_rq_lock();
 	prio_array_t *array = current->array;
@@ -3154,15 +3165,22 @@
 	_raw_spin_unlock(&rq->lock);
 	preempt_enable_no_resched();
 
-	schedule();
+	__schedule();
 
 	return 0;
 }
 
+asmlinkage long sys_sched_yield(void)
+{
+	__sched_profile(sys_sched_yield);
+	return sched_yield();
+}
+
 void __sched __cond_resched(void)
 {
 	set_current_state(TASK_RUNNING);
-	schedule();
+	sched_profile();
+	__schedule();
 }
 
 EXPORT_SYMBOL(__cond_resched);
@@ -3176,7 +3194,8 @@
 void __sched yield(void)
 {
 	set_current_state(TASK_RUNNING);
-	sys_sched_yield();
+	sched_profile();
+	sched_yield();
 }
 
 EXPORT_SYMBOL(yield);
@@ -3193,7 +3212,8 @@
 	struct runqueue *rq = this_rq();
 
 	atomic_inc(&rq->nr_iowait);
-	schedule();
+	sched_profile();
+	__schedule();
 	atomic_dec(&rq->nr_iowait);
 }
 
@@ -3205,7 +3225,8 @@
 	long ret;
 
 	atomic_inc(&rq->nr_iowait);
-	ret = schedule_timeout(timeout);
+	sched_profile();
+	ret = __schedule_timeout(timeout);
 	atomic_dec(&rq->nr_iowait);
 	return ret;
 }
@@ -4161,3 +4182,93 @@
 
 EXPORT_SYMBOL(__preempt_write_lock);
 #endif /* defined(CONFIG_SMP) && defined(CONFIG_PREEMPT) */
+
+static atomic_t *schedprof_buf;
+static int sched_profiling;
+static unsigned long schedprof_len;
+
+#include <linux/bootmem.h>
+#include <asm/sections.h>
+
+void __sched_profile(void *__pc)
+{
+	if (schedprof_buf) {
+		unsigned long pc = (unsigned long)__pc;
+		pc -= min(pc, (unsigned long)_stext);
+		atomic_inc(&schedprof_buf[min(pc, schedprof_len)]);
+	}
+}
+
+static int __init schedprof_setup(char *s)
+{
+	int n;
+	if (get_option(&s, &n))
+		sched_profiling = 1;
+	return 1;
+}
+__setup("schedprof=", schedprof_setup);
+
+void __init schedprof_init(void)
+{
+	if (!sched_profiling)
+		return;
+	schedprof_len = (unsigned long)(_etext - _stext) + 1;
+	schedprof_buf = alloc_bootmem(schedprof_len*sizeof(atomic_t));
+	printk(KERN_INFO "Scheduler call profiling enabled\n");
+}
+
+#ifdef CONFIG_PROC_FS
+#include <linux/proc_fs.h>
+
+static ssize_t
+read_sched_profile(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+	unsigned long p = *ppos;
+	ssize_t read;
+	char * pnt;
+	unsigned int sample_step = 1;
+
+	if (p >= (schedprof_len+1)*sizeof(atomic_t))
+		return 0;
+	if (count > (schedprof_len+1)*sizeof(atomic_t) - p)
+		count = (schedprof_len+1)*sizeof(atomic_t) - p;
+	read = 0;
+
+	while (p < sizeof(atomic_t) && count > 0) {
+		put_user(*((char *)(&sample_step)+p),buf);
+		buf++; p++; count--; read++;
+	}
+	pnt = (char *)schedprof_buf + p - sizeof(atomic_t);
+	if (copy_to_user(buf,(void *)pnt,count))
+		return -EFAULT;
+	read += count;
+	*ppos += read;
+	return read;
+}
+
+static ssize_t write_sched_profile(struct file *file, const char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	memset(schedprof_buf, 0, sizeof(atomic_t)*schedprof_len);
+	return count;
+}
+
+static struct file_operations sched_profile_operations = {
+	.read		= read_sched_profile,
+	.write		= write_sched_profile,
+};
+
+static int proc_schedprof_init(void)
+{
+	struct proc_dir_entry *entry;
+	if (!sched_profiling)
+		return 1;
+	entry = create_proc_entry("schedprof", S_IWUSR | S_IRUGO, NULL);
+	if (entry) {
+		entry->proc_fops = &sched_profile_operations;
+		entry->size = sizeof(atomic_t)*(schedprof_len + 1);
+	}
+	return !!entry;
+}
+module_init(proc_schedprof_init);
+#endif
Index: wli-2.6.6-rc3-mm1/kernel/timer.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/timer.c	2004-04-30 15:05:53.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/timer.c	2004-04-30 17:35:43.000000000 -0700
@@ -1065,7 +1065,7 @@
  *
  * In all cases the return value is guaranteed to be non-negative.
  */
-fastcall signed long __sched schedule_timeout(signed long timeout)
+fastcall signed long __sched __schedule_timeout(signed long timeout)
 {
 	struct timer_list timer;
 	unsigned long expire;
@@ -1080,7 +1080,7 @@
 		 * but I' d like to return a valid offset (>=0) to allow
 		 * the caller to do everything it want with the retval.
 		 */
-		schedule();
+		__schedule();
 		goto out;
 	default:
 		/*
@@ -1108,7 +1108,7 @@
 	timer.function = process_timeout;
 
 	add_timer(&timer);
-	schedule();
+	__schedule();
 	del_timer_sync(&timer);
 
 	timeout = expire - jiffies;
@@ -1117,6 +1117,11 @@
 	return timeout < 0 ? 0 : timeout;
 }
 
+fastcall signed long __sched schedule_timeout(signed long timeout)
+{
+	sched_profile();
+	return __schedule_timeout(timeout);
+}
 EXPORT_SYMBOL(schedule_timeout);
 
 /* Thread ID - the internal kernel "pid" */
Index: wli-2.6.6-rc3-mm1/arch/alpha/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/alpha/kernel/semaphore.c	2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/alpha/kernel/semaphore.c	2004-04-30 15:14:54.000000000 -0700
@@ -66,7 +66,6 @@
 {
 	struct task_struct *tsk = current;
 	DECLARE_WAITQUEUE(wait, tsk);
-
 #ifdef CONFIG_DEBUG_SEMAPHORE
 	printk("%s(%d): down failed(%p)\n",
 	       tsk->comm, tsk->pid, sem);
@@ -83,7 +82,8 @@
 	 * that we are asleep, and then sleep.
 	 */
 	while (__sem_update_count(sem, -1) <= 0) {
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 	remove_wait_queue(&sem->wait, &wait);
@@ -108,7 +108,6 @@
 	struct task_struct *tsk = current;
 	DECLARE_WAITQUEUE(wait, tsk);
 	long ret = 0;
-
 #ifdef CONFIG_DEBUG_SEMAPHORE
 	printk("%s(%d): down failed(%p)\n",
 	       tsk->comm, tsk->pid, sem);
@@ -129,7 +128,8 @@
 			ret = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_INTERRUPTIBLE);
 	}
 
Index: wli-2.6.6-rc3-mm1/arch/arm/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/arm/kernel/semaphore.c	2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/arm/kernel/semaphore.c	2004-04-30 15:15:12.000000000 -0700
@@ -77,8 +77,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
@@ -127,8 +127,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
Index: wli-2.6.6-rc3-mm1/arch/arm26/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/arm26/kernel/semaphore.c	2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/arm26/kernel/semaphore.c	2004-04-30 15:15:22.000000000 -0700
@@ -79,8 +79,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
@@ -129,8 +129,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
Index: wli-2.6.6-rc3-mm1/arch/cris/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/cris/kernel/semaphore.c	2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/cris/kernel/semaphore.c	2004-04-30 15:15:34.000000000 -0700
@@ -101,7 +101,8 @@
 	DOWN_HEAD(TASK_UNINTERRUPTIBLE)
 	if (waking_non_zero(sem))
 		break;
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_UNINTERRUPTIBLE)
 }
 
@@ -119,7 +120,8 @@
 			ret = 0;
 		break;
 	}
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_INTERRUPTIBLE)
 	return ret;
 }
Index: wli-2.6.6-rc3-mm1/arch/h8300/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/h8300/kernel/semaphore.c	2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/h8300/kernel/semaphore.c	2004-04-30 15:15:42.000000000 -0700
@@ -103,7 +103,8 @@
 	DOWN_HEAD(TASK_UNINTERRUPTIBLE)
 	if (waking_non_zero(sem))
 		break;
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_UNINTERRUPTIBLE)
 }
 
@@ -122,7 +123,8 @@
 			ret = 0;
 		break;
 	}
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_INTERRUPTIBLE)
 	return ret;
 }
Index: wli-2.6.6-rc3-mm1/arch/i386/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/i386/kernel/semaphore.c	2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/i386/kernel/semaphore.c	2004-04-30 15:16:52.000000000 -0700
@@ -79,8 +79,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_UNINTERRUPTIBLE;
@@ -132,8 +132,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/arch/ia64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ia64/kernel/semaphore.c	2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ia64/kernel/semaphore.c	2004-04-30 15:16:58.000000000 -0700
@@ -70,8 +70,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_UNINTERRUPTIBLE;
@@ -123,8 +123,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/arch/m68k/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/m68k/kernel/semaphore.c	2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/m68k/kernel/semaphore.c	2004-04-30 15:17:09.000000000 -0700
@@ -103,7 +103,8 @@
 	DOWN_HEAD(TASK_UNINTERRUPTIBLE)
 	if (waking_non_zero(sem))
 		break;
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_UNINTERRUPTIBLE)
 }
 
@@ -122,7 +123,8 @@
 			ret = 0;
 		break;
 	}
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_INTERRUPTIBLE)
 	return ret;
 }
Index: wli-2.6.6-rc3-mm1/arch/m68knommu/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/m68knommu/kernel/semaphore.c	2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/m68knommu/kernel/semaphore.c	2004-04-30 15:17:13.000000000 -0700
@@ -104,7 +104,8 @@
 	DOWN_HEAD(TASK_UNINTERRUPTIBLE)
 	if (waking_non_zero(sem))
 		break;
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_UNINTERRUPTIBLE)
 }
 
@@ -123,7 +124,8 @@
 			ret = 0;
 		break;
 	}
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_INTERRUPTIBLE)
 	return ret;
 }
Index: wli-2.6.6-rc3-mm1/arch/mips/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/mips/kernel/semaphore.c	2004-04-30 15:05:33.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/mips/kernel/semaphore.c	2004-04-30 15:17:25.000000000 -0700
@@ -132,7 +132,8 @@
 	for (;;) {
 		if (waking_non_zero(sem))
 			break;
-		schedule();
+		sched_profile();
+		__schedule();
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 	}
 	__set_current_state(TASK_RUNNING);
@@ -261,7 +262,8 @@
 				ret = 0;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		__set_current_state(TASK_INTERRUPTIBLE);
 	}
 	__set_current_state(TASK_RUNNING);
Index: wli-2.6.6-rc3-mm1/arch/parisc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/parisc/kernel/semaphore.c	2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/parisc/kernel/semaphore.c	2004-04-30 15:17:31.000000000 -0700
@@ -68,7 +68,8 @@
 		/* we can _read_ this without the sentry */
 		if (sem->count != -1)
 			break;
- 		schedule();
+		sched_profile();
+ 		__schedule();
  	}
 
 	DOWN_TAIL
@@ -89,7 +90,8 @@
 			ret = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 	}
 
 	DOWN_TAIL
Index: wli-2.6.6-rc3-mm1/arch/ppc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ppc/kernel/semaphore.c	2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ppc/kernel/semaphore.c	2004-04-30 15:17:36.000000000 -0700
@@ -86,7 +86,8 @@
 	 * that we are asleep, and then sleep.
 	 */
 	while (__sem_update_count(sem, -1) <= 0) {
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 	}
 	remove_wait_queue(&sem->wait, &wait);
@@ -121,7 +122,8 @@
 			retval = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 	}
 	tsk->state = TASK_RUNNING;
Index: wli-2.6.6-rc3-mm1/arch/ppc64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ppc64/kernel/semaphore.c	2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ppc64/kernel/semaphore.c	2004-04-30 15:17:40.000000000 -0700
@@ -86,7 +86,8 @@
 	 * that we are asleep, and then sleep.
 	 */
 	while (__sem_update_count(sem, -1) <= 0) {
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 	remove_wait_queue(&sem->wait, &wait);
@@ -120,7 +121,8 @@
 			retval = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_INTERRUPTIBLE);
 	}
 	remove_wait_queue(&sem->wait, &wait);
Index: wli-2.6.6-rc3-mm1/arch/s390/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/s390/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/s390/kernel/semaphore.c	2004-04-30 15:17:43.000000000 -0700
@@ -69,7 +69,8 @@
 	__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	add_wait_queue_exclusive(&sem->wait, &wait);
 	while (__sem_update_count(sem, -1) <= 0) {
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 	remove_wait_queue(&sem->wait, &wait);
@@ -97,7 +98,8 @@
 			retval = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		set_task_state(tsk, TASK_INTERRUPTIBLE);
 	}
 	remove_wait_queue(&sem->wait, &wait);
Index: wli-2.6.6-rc3-mm1/arch/sh/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sh/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sh/kernel/semaphore.c	2004-04-30 15:17:52.000000000 -0700
@@ -110,7 +110,8 @@
 	DOWN_HEAD(TASK_UNINTERRUPTIBLE)
 	if (waking_non_zero(sem))
 		break;
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_UNINTERRUPTIBLE)
 }
 
@@ -128,7 +129,8 @@
 			ret = 0;
 		break;
 	}
-	schedule();
+	sched_profile();
+	__schedule();
 	DOWN_TAIL(TASK_INTERRUPTIBLE)
 	return ret;
 }
Index: wli-2.6.6-rc3-mm1/arch/sparc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sparc/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sparc/kernel/semaphore.c	2004-04-30 15:18:02.000000000 -0700
@@ -68,8 +68,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
@@ -118,8 +118,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
Index: wli-2.6.6-rc3-mm1/arch/sparc64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sparc64/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sparc64/kernel/semaphore.c	2004-04-30 15:18:10.000000000 -0700
@@ -100,7 +100,8 @@
 	add_wait_queue_exclusive(&sem->wait, &wait);
 
 	while (__sem_update_count(sem, -1) <= 0) {
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 	}
 	remove_wait_queue(&sem->wait, &wait);
@@ -208,7 +209,8 @@
 			retval = -EINTR;
 			break;
 		}
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 	}
 	tsk->state = TASK_RUNNING;
Index: wli-2.6.6-rc3-mm1/arch/v850/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/v850/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/v850/kernel/semaphore.c	2004-04-30 15:18:21.000000000 -0700
@@ -79,8 +79,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_UNINTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
@@ -129,8 +129,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irq(&semaphore_lock);
-
-		schedule();
+		sched_profile();
+		__schedule();
 		tsk->state = TASK_INTERRUPTIBLE;
 		spin_lock_irq(&semaphore_lock);
 	}
Index: wli-2.6.6-rc3-mm1/arch/x86_64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/x86_64/kernel/semaphore.c	2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/x86_64/kernel/semaphore.c	2004-04-30 15:18:28.000000000 -0700
@@ -80,8 +80,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_UNINTERRUPTIBLE;
@@ -133,8 +133,8 @@
 		}
 		sem->sleepers = 1;	/* us - see -1 above */
 		spin_unlock_irqrestore(&sem->wait.lock, flags);
-
-		schedule();
+		sched_profile();
+		__schedule();
 
 		spin_lock_irqsave(&sem->wait.lock, flags);
 		tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/lib/rwsem.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/lib/rwsem.c	2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/lib/rwsem.c	2004-04-30 15:13:46.000000000 -0700
@@ -150,7 +150,7 @@
 	for (;;) {
 		if (!waiter->flags)
 			break;
-		schedule();
+		__schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
@@ -165,7 +165,7 @@
 struct rw_semaphore fastcall __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
 {
 	struct rwsem_waiter waiter;
-
+	sched_profile();
 	rwsemtrace(sem,"Entering rwsem_down_read_failed");
 
 	waiter.flags = RWSEM_WAITING_FOR_READ;
@@ -181,7 +181,7 @@
 struct rw_semaphore fastcall __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
 {
 	struct rwsem_waiter waiter;
-
+	sched_profile();
 	rwsemtrace(sem,"Entering rwsem_down_write_failed");
 
 	waiter.flags = RWSEM_WAITING_FOR_WRITE;
Index: wli-2.6.6-rc3-mm1/init/main.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/init/main.c	2004-04-30 15:06:48.000000000 -0700
+++ wli-2.6.6-rc3-mm1/init/main.c	2004-04-30 16:04:58.000000000 -0700
@@ -486,6 +486,7 @@
 	if (panic_later)
 		panic(panic_later, panic_param);
 	profile_init();
+	schedprof_init();
 	local_irq_enable();
 #ifdef CONFIG_BLK_DEV_INITRD
 	if (initrd_start && !initrd_below_start_ok &&

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [0.5/2] scheduler caller profiling
  2004-05-03  2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
@ 2004-05-03  2:29   ` William Lee Irwin III
  2004-05-03  2:32     ` [2/2] filtered buffer_head wakeups William Lee Irwin III
  2004-05-03 18:51   ` [0.5/2] scheduler caller profiling David Mosberger
  1 sibling, 1 reply; 7+ messages in thread
From: William Lee Irwin III @ 2004-05-03  2:29 UTC (permalink / raw)
  To: akpm, linux-kernel

On Sun, May 02, 2004 at 07:23:46PM -0700, William Lee Irwin III wrote:
> This patch was used to collect the data on the offending callers into
> the scheduler. It creates a profile buffer completely analogous to its

This patch creates a new scheduling entrypoint, wake_up_filtered(), and
uses it in page waitqueue hashing to discriminate between the waiters
on various pages. One of the sources of the thundering herds was
identified as the page waitqueue hashing by a priori methods and
empirically confirmed using the scheduler caller profiling patch.


-- wli

Index: wli-2.6.6-rc3-mm1/include/linux/wait.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/wait.h	2004-04-03 19:37:07.000000000 -0800
+++ wli-2.6.6-rc3-mm1/include/linux/wait.h	2004-04-30 19:50:33.000000000 -0700
@@ -28,6 +28,11 @@
 	struct list_head task_list;
 };
 
+struct filtered_wait_queue {
+	void *key;
+	wait_queue_t wait;
+};
+
 struct __wait_queue_head {
 	spinlock_t lock;
 	struct list_head task_list;
@@ -104,6 +109,7 @@
 	list_del(&old->task_list);
 }
 
+void FASTCALL(wake_up_filtered(wait_queue_head_t *, void *));
 extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
 extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode));
 extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
@@ -257,6 +263,16 @@
 		wait->func = autoremove_wake_function;			\
 		INIT_LIST_HEAD(&wait->task_list);			\
 	} while (0)
+
+#define DEFINE_FILTERED_WAIT(name, p)					\
+	struct filtered_wait_queue name = {				\
+		.key	= p,						\
+		.wait	=	{					\
+			.task	= current,				\
+			.func	= autoremove_wake_function,		\
+			.task_list = LIST_HEAD_INIT(name.wait.task_list),\
+		},							\
+	}
 	
 #endif /* __KERNEL__ */
 
Index: wli-2.6.6-rc3-mm1/kernel/sched.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/sched.c	2004-04-30 16:13:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/sched.c	2004-04-30 19:50:33.000000000 -0700
@@ -2524,6 +2524,19 @@
 	}
 }
 
+void fastcall wake_up_filtered(wait_queue_head_t *q, void *key)
+{
+	unsigned long flags;
+	unsigned int mode = TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE;
+	struct filtered_wait_queue *wait, *save;
+	spin_lock_irqsave(&q->lock, flags);
+	list_for_each_entry_safe(wait, save, &q->task_list, wait.task_list) {
+		if (wait->key == key)
+			wait->wait.func(&wait->wait, mode, 0);
+	}
+	spin_unlock_irqrestore(&q->lock, flags);
+}
+
 /**
  * __wake_up - wake up threads blocked on a waitqueue.
  * @q: the waitqueue
Index: wli-2.6.6-rc3-mm1/mm/filemap.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/mm/filemap.c	2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/mm/filemap.c	2004-04-30 19:50:33.000000000 -0700
@@ -307,16 +307,16 @@
 void fastcall wait_on_page_bit(struct page *page, int bit_nr)
 {
 	wait_queue_head_t *waitqueue = page_waitqueue(page);
-	DEFINE_WAIT(wait);
+	DEFINE_FILTERED_WAIT(wait, page);
 
 	do {
-		prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(waitqueue, &wait.wait, TASK_UNINTERRUPTIBLE);
 		if (test_bit(bit_nr, &page->flags)) {
 			sync_page(page);
 			io_schedule();
 		}
 	} while (test_bit(bit_nr, &page->flags));
-	finish_wait(waitqueue, &wait);
+	finish_wait(waitqueue, &wait.wait);
 }
 
 EXPORT_SYMBOL(wait_on_page_bit);
@@ -344,7 +344,7 @@
 		BUG();
 	smp_mb__after_clear_bit(); 
 	if (waitqueue_active(waitqueue))
-		wake_up_all(waitqueue);
+		wake_up_filtered(waitqueue, page);
 }
 
 EXPORT_SYMBOL(unlock_page);
@@ -363,7 +363,7 @@
 		smp_mb__after_clear_bit();
 	}
 	if (waitqueue_active(waitqueue))
-		wake_up_all(waitqueue);
+		wake_up_filtered(waitqueue, page);
 }
 
 EXPORT_SYMBOL(end_page_writeback);
@@ -379,16 +379,16 @@
 void fastcall __lock_page(struct page *page)
 {
 	wait_queue_head_t *wqh = page_waitqueue(page);
-	DEFINE_WAIT(wait);
+	DEFINE_FILTERED_WAIT(wait, page);
 
 	while (TestSetPageLocked(page)) {
-		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE);
 		if (PageLocked(page)) {
 			sync_page(page);
 			io_schedule();
 		}
 	}
-	finish_wait(wqh, &wait);
+	finish_wait(wqh, &wait.wait);
 }
 
 EXPORT_SYMBOL(__lock_page);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [2/2] filtered buffer_head wakeups
  2004-05-03  2:29   ` William Lee Irwin III
@ 2004-05-03  2:32     ` William Lee Irwin III
  0 siblings, 0 replies; 7+ messages in thread
From: William Lee Irwin III @ 2004-05-03  2:32 UTC (permalink / raw)
  To: akpm, linux-kernel

On Sun, May 02, 2004 at 07:29:36PM -0700, William Lee Irwin III wrote:
> This patch creates a new scheduling entrypoint, wake_up_filtered(), and
> uses it in page waitqueue hashing to discriminate between the waiters
> on various pages. One of the sources of the thundering herds was
> identified as the page waitqueue hashing by a priori methods and
> empirically confirmed using the scheduler caller profiling patch.

It turned out there were still thundering herds after the page
waitqueue hashtable. The scheduler caller profiling patch identified
this as the buffer_head waitqueue hashtable. This patch teaches the
buffer_head waitqueue hashing to use filtered wakeups.


-- wli

Index: wli-2.6.6-rc3-mm1/fs/buffer.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/fs/buffer.c	2004-04-30 15:06:46.000000000 -0700
+++ wli-2.6.6-rc3-mm1/fs/buffer.c	2004-04-30 19:51:25.000000000 -0700
@@ -74,7 +74,7 @@
 
 	smp_mb();
 	if (waitqueue_active(wq))
-		wake_up_all(wq);
+		wake_up_filtered(wq, bh);
 }
 EXPORT_SYMBOL(wake_up_buffer);
 
@@ -93,10 +93,10 @@
 void __wait_on_buffer(struct buffer_head * bh)
 {
 	wait_queue_head_t *wqh = bh_waitq_head(bh);
-	DEFINE_WAIT(wait);
+	DEFINE_FILTERED_WAIT(wait, bh);
 
 	do {
-		prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE);
 		if (buffer_locked(bh)) {
 			struct block_device *bd;
 			smp_mb();
@@ -106,7 +106,7 @@
 			io_schedule();
 		}
 	} while (buffer_locked(bh));
-	finish_wait(wqh, &wait);
+	finish_wait(wqh, &wait.wait);
 }
 
 static void
Index: wli-2.6.6-rc3-mm1/fs/jbd/transaction.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/fs/jbd/transaction.c	2004-04-30 15:06:46.000000000 -0700
+++ wli-2.6.6-rc3-mm1/fs/jbd/transaction.c	2004-04-30 19:51:25.000000000 -0700
@@ -638,7 +638,7 @@
 			jbd_unlock_bh_state(bh);
 			/* commit wakes up all shadow buffers after IO */
 			wqh = bh_waitq_head(jh2bh(jh));
-			wait_event(*wqh, (jh->b_jlist != BJ_Shadow));
+			wait_event_filtered(*wqh, jh2bh(jh), (jh->b_jlist != BJ_Shadow));
 			goto repeat;
 		}
 
Index: wli-2.6.6-rc3-mm1/include/linux/wait.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/wait.h	2004-04-30 19:50:33.000000000 -0700
+++ wli-2.6.6-rc3-mm1/include/linux/wait.h	2004-04-30 19:51:25.000000000 -0700
@@ -146,7 +146,6 @@
 		break;							\
 	__wait_event(wq, condition);					\
 } while (0)
-
 #define __wait_event_interruptible(wq, condition, ret)			\
 do {									\
 	wait_queue_t __wait;						\
@@ -273,7 +272,28 @@
 			.task_list = LIST_HEAD_INIT(name.wait.task_list),\
 		},							\
 	}
-	
+
+#define __wait_event_filtered(wq, key, condition) 			\
+do {									\
+	DEFINE_FILTERED_WAIT(__wait, key);				\
+	add_wait_queue(&(wq), &__wait.wait);				\
+	for (;;) {							\
+		set_current_state(TASK_UNINTERRUPTIBLE);		\
+		if (condition)						\
+			break;						\
+		schedule();						\
+	}								\
+	current->state = TASK_RUNNING;					\
+	remove_wait_queue(&(wq), &__wait.wait);				\
+} while (0)
+
+
+#define wait_event_filtered(wq, key, condition)				\
+do {									\
+	if (!(condition))						\
+		__wait_event_filtered(wq, key, condition);		\
+} while (0)
+
 #endif /* __KERNEL__ */
 
 #endif

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [0/2] filtered wakeups
  2004-05-03  2:17 [0/2] filtered wakeups William Lee Irwin III
  2004-05-03  2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
@ 2004-05-03  2:46 ` William Lee Irwin III
  1 sibling, 0 replies; 7+ messages in thread
From: William Lee Irwin III @ 2004-05-03  2:46 UTC (permalink / raw)
  To: akpm, linux-kernel

On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin III wrote:
> before:
> Tiotest results for 512 concurrent io threads:

Parting shot: I also used time(1):

before:
tiotest -t 512 -f 32 -b 4096 -d .  14337.17s user 3931.52s system 301% cpu 1:40:51.08 total

after:
tiotest -t 512 -f 32 -b 4096 -d .  10985.23s user 3524.50s system 266% cpu 1:30:48.80 total

i.e. it sped up the run by 10 minutes, or 10% of the total execution time.


-- wli

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [0.5/2] scheduler caller profiling
@ 2004-05-03 12:18 Oleg Nesterov
  0 siblings, 0 replies; 7+ messages in thread
From: Oleg Nesterov @ 2004-05-03 12:18 UTC (permalink / raw)
  To: linux-kernel, William Lee Irwin III

Hello.

William Lee Irwin III wrote:
> This patch creates a new scheduling entrypoint, wake_up_filtered(), and
> uses it in page waitqueue hashing to discriminate between the waiters
> on various pages. One of the sources of the thundering herds was
> identified as the page waitqueue hashing by a priori methods and
> empirically confirmed using the scheduler caller profiling patch.

How about this (untested, of course) idea:

struct wait_bit_queue {
	unsigned long *flags;
	int bit_nr;
	wait_queue_t wait;
};

#define DEFINE_WAIT_BIT(name, flags, bit_nr)					\
	struct wait_bit_queue name = {						\
		.flags	= flags,						\
		.bit_nr	= bit_nr,						\
		.wait	= {							\
			.task = current,					\
			.func = wake_bit_function,				\
			.task_list = LIST_HEAD_INIT(name.wait.task_list),	\
		},								\
	}

int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync)
{
	struct wait_bit_queue *wait_bit =
		container_of(wait, struct wait_bit_queue, wait);

	if (test_bit(wait_bit->bit_nr, &wait_bit->flags))
		return 0;

	return autoremove_wake_function(wait, mode, sync);
}

This way only waiters must be modified:

void fastcall wait_on_page_bit(struct page *page, int bit_nr)
{
	wait_queue_head_t *waitqueue = page_waitqueue(page);
	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

	prepare_to_wait(waitqueue, &wait.wait, TASK_UNINTERRUPTIBLE);

	if (test_bit(bit_nr, &page->flags)) {
		sync_page(page);
		io_schedule();
	}

	finish_wait(waitqueue, &wait.wait);
}

__wait_on_buffer() can use DEFINE_WAIT_BIT(wait, &bh->b_state, BH_Lock)

Oleg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [0.5/2] scheduler caller profiling
  2004-05-03  2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
  2004-05-03  2:29   ` William Lee Irwin III
@ 2004-05-03 18:51   ` David Mosberger
  1 sibling, 0 replies; 7+ messages in thread
From: David Mosberger @ 2004-05-03 18:51 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: akpm, linux-kernel


  Bill> On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin
  Bill> III wrote:
  >> The thundering herd issue in waitqueue hashing has been seen in
  >> practice. In order to preserve the space footprint reduction
  >> while improving performance, I wrote "filtered wakeups", which
  >> discriminate between waiters based on a key.

  Bill> This patch was used to collect the data on the offending
  Bill> callers into the scheduler. It creates a profile buffer
  Bill> completely analogous to its handling to /proc/profile, but
  Bill> registers profile ticks at calls to the various scheduler
  Bill> entry points instead of during timer ticks and rearranges
  Bill> scheduler code for this to be accounted properly. It does not
  Bill> report meaningful statistics in the presence of
  Bill> CONFIG_PREEMPT.

  Bill> Posting this patch is in order to disclose how I obtained the
  Bill> scheduling statistics reported in the first post. This patch
  Bill> is not intended for inclusion.

Note that on ia64, you can use q-syscollect/q-view to collect
call-counts statistically (with zero intrusion to the monitored
program, so it's safe for the kernel).  While the call-graph/counts
won't be perfectly accurate, this has proven to work extremely well in
practice.  In fact, it would be nice if other arches could support the
same.  All you really need for this to work is the ability to count N
call (or return) instructions and record the source and destination
address of the N-th call somewhere (registers or memory).  I looked at
the P4 performance-monitor briefly but I couldn't quite figure out
whether it supports the required functionality (it seemed like it
could only record _all_ branches, which would be a problem).  If a
P4-expert is interested in pursuing this, let me know.  I'd be happy
to help/advise with some of the more subtle issues that need to be
address to get this to work correctly.

	--david

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-05-03 18:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-03  2:17 [0/2] filtered wakeups William Lee Irwin III
2004-05-03  2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
2004-05-03  2:29   ` William Lee Irwin III
2004-05-03  2:32     ` [2/2] filtered buffer_head wakeups William Lee Irwin III
2004-05-03 18:51   ` [0.5/2] scheduler caller profiling David Mosberger
2004-05-03  2:46 ` [0/2] filtered wakeups William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2004-05-03 12:18 [0.5/2] scheduler caller profiling Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox