* [0/2] filtered wakeups
@ 2004-05-03 2:17 William Lee Irwin III
2004-05-03 2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
2004-05-03 2:46 ` [0/2] filtered wakeups William Lee Irwin III
0 siblings, 2 replies; 6+ messages in thread
From: William Lee Irwin III @ 2004-05-03 2:17 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel
The thundering herd issue in waitqueue hashing has been seen in
practice. In order to preserve the space footprint reduction while
improving performance, I wrote "filtered wakeups", which discriminate
between waiters based on a key.
The following patch series, vs. 2.6.6-rc3-mm1, drastically reduces the
kernel cpu consumption of tiobench --threads 512 --size 16384 (fed to
tiotest by hand since apparently the perl script is buggy) on a 6x336MHz
UltraSPARC III Sun Enterprise 3000 with 3.5GB RAM, ESP-366HME HBA,
10x10Krpm 18GB U160 SCSI disks configured for dm thusly:
0 355655680 striped 10 64 /dev/sda 0 /dev/sdb 0 /dev/sdc 0 /dev/sdd 0 \
/dev/sde 0 /dev/sdf 0 /dev/sdg 0 /dev/sdh 0 /dev/sdi 0 /dev/sdj 0
This was mkfs'd freshly to a single 171GB ext2 fs.
1/2, filtered page waitqueues, resolves the thundering herd issue with
hashed page waitqueues.
2/2, filtered buffer_head waitqueues, resolves the thundering herd issue
with hashed buffer_head waitqueues.
Futexes appear to have their own solution to this issue, which is
necessarily different from this as it needs to discriminate based on a
longer key. They could in principle be consolidated by passing a
comparator instead of comparing a key field or some similar strategy at
the cost of indirect function calls.
I furthermore instrumented the calls to schedule(), possibly done
indirectly, in patch 0.5/2 of the series, which isn't necessarily meant
to be applied to anything, but merely shows how I collected some of the
information in the runtime logs, which for space reasons I've posted as
URL's instead of including them inline.
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/filtered_wakeup/virgin_mm.log.tar.bz2
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/vm/filtered_wakeup/filtered_wakeup.log.tar.bz2
Here "cpusec" represents 1 second of actual cpu consumed, counting cpu
consumption of both user and kernel. Apart from regular sampling of
profile data, no other load was running on the machine.
before:
Tiotest results for 512 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 16384 MBs | 1118.1 s | 14.654 MB/s | 1.6 % | 280.9 % |
| Random Write 2000 MBs | 336.2 s | 5.950 MB/s | 0.8 % | 20.4 % |
| Read 16384 MBs | 1717.1 s | 9.542 MB/s | 1.4 % | 31.8 % |
| Random Read 2000 MBs | 465.2 s | 4.300 MB/s | 1.1 % | 36.1 % |
`----------------------------------------------------------------------'
Throughput scaled by %cpu:
Write: 5.1873MB/cpusec
Random Write: 28.0660MB/cpusec
Read: 28.7410MB/cpusec
Random Read: 11.5591MB/cpusec
top 10 kernel cpu consumers:
21733 finish_task_switch 113.1927
11976 __wake_up 187.1250
11433 generic_file_aio_write_nolock 5.0321
9730 read_sched_profile 43.4375
9606 file_read_actor 42.8839
9116 __do_softirq 31.6528
8682 do_anonymous_page 19.3795
3635 prepare_to_wait 28.3984
2159 kmem_cache_free 16.8672
1944 buffered_rmqueue 3.3750
top 10 callers of scheduling functions:
9391185 wait_on_page_bit 32608.2812
7280055 cpu_idle 37916.9531
1458446 __lock_page 5064.0486
258142 __handle_preemption 16133.8750
134815 worker_thread 247.8217
45989 __wait_on_buffer 205.3080
22294 do_exit 21.7715
22187 generic_file_aio_write_nolock 9.7654
14932 sys_wait4 25.9236
14652 shrink_list 7.8944
after:
Tiotest results for 512 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 16384 MBs | 1099.5 s | 14.901 MB/s | 2.2 % | 279.3 % |
| Random Write 2000 MBs | 333.8 s | 5.991 MB/s | 1.0 % | 14.9 % |
| Read 16384 MBs | 1706.3 s | 9.602 MB/s | 1.4 % | 19.1 % |
| Random Read 2000 MBs | 460.3 s | 4.345 MB/s | 1.1 % | 14.8 % |
`----------------------------------------------------------------------'
Throughput scaled by %cpu:
Write: 5.2934MB/cpusec
Random Write: 37.6792MB/cpusec
Read: 46.8390MB/cpusec
Random Read: 27.3270MB/cpusec
top 10 kernel cpu consumers:
11873 generic_file_aio_write_nolock 5.2258
10245 file_read_actor 45.7366
10212 read_sched_profile 45.5893
10135 finish_task_switch 52.7865
9171 do_anonymous_page 20.4710
8619 __do_softirq 29.9271
2905 wake_up_filtered 18.1562
2325 __get_page_state 10.3795
2278 del_timer_sync 5.0848
2033 buffered_rmqueue 3.5295
top 10 callers of scheduling functions:
3985424 cpu_idle 20757.4167
2396754 wait_on_page_bit 7489.8562
209453 __handle_preemption 13090.8125
164071 worker_thread 301.6011
24321 do_exit 23.7510
21272 generic_file_aio_write_nolock 9.3627
16271 sys_wait4 28.2483
11080 pipe_wait 86.5625
9634 compat_sys_nanosleep 25.0885
7742 shrink_list 4.1713
-- wli
^ permalink raw reply [flat|nested] 6+ messages in thread
* [0.5/2] scheduler caller profiling
2004-05-03 2:17 [0/2] filtered wakeups William Lee Irwin III
@ 2004-05-03 2:23 ` William Lee Irwin III
2004-05-03 2:29 ` William Lee Irwin III
2004-05-03 18:51 ` [0.5/2] scheduler caller profiling David Mosberger
2004-05-03 2:46 ` [0/2] filtered wakeups William Lee Irwin III
1 sibling, 2 replies; 6+ messages in thread
From: William Lee Irwin III @ 2004-05-03 2:23 UTC (permalink / raw)
To: akpm, linux-kernel
On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin III wrote:
> The thundering herd issue in waitqueue hashing has been seen in
> practice. In order to preserve the space footprint reduction while
> improving performance, I wrote "filtered wakeups", which discriminate
> between waiters based on a key.
This patch was used to collect the data on the offending callers into
the scheduler. It creates a profile buffer completely analogous to its
handling to /proc/profile, but registers profile ticks at calls to the
various scheduler entry points instead of during timer ticks and
rearranges scheduler code for this to be accounted properly. It does
not report meaningful statistics in the presence of CONFIG_PREEMPT.
Posting this patch is in order to disclose how I obtained the
scheduling statistics reported in the first post. This patch is not
intended for inclusion.
-- wli
Index: wli-2.6.6-rc3-mm1/include/linux/sched.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/sched.h 2004-04-30 15:06:48.000000000 -0700
+++ wli-2.6.6-rc3-mm1/include/linux/sched.h 2004-04-30 15:55:34.000000000 -0700
@@ -189,7 +189,11 @@
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
+extern signed long FASTCALL(__schedule_timeout(signed long timeout));
asmlinkage void schedule(void);
+asmlinkage void __schedule(void);
+void __sched_profile(void *);
+#define sched_profile() __sched_profile(__builtin_return_address(0))
struct namespace;
Index: wli-2.6.6-rc3-mm1/include/linux/profile.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/profile.h 2004-04-03 19:37:06.000000000 -0800
+++ wli-2.6.6-rc3-mm1/include/linux/profile.h 2004-04-30 16:05:35.000000000 -0700
@@ -13,6 +13,7 @@
/* init basic kernel profiler */
void __init profile_init(void);
+void schedprof_init(void);
extern unsigned int * prof_buffer;
extern unsigned long prof_len;
Index: wli-2.6.6-rc3-mm1/kernel/sched.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/sched.c 2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/sched.c 2004-05-01 11:48:46.000000000 -0700
@@ -2312,7 +2312,7 @@
/*
* schedule() is the main scheduler function.
*/
-asmlinkage void __sched schedule(void)
+asmlinkage void __sched __schedule(void)
{
long *switch_count;
task_t *prev, *next;
@@ -2451,6 +2451,11 @@
goto need_resched;
}
+asmlinkage void __sched schedule(void)
+{
+ sched_profile();
+ __schedule();
+}
EXPORT_SYMBOL(schedule);
#ifdef CONFIG_PREEMPT
@@ -2472,7 +2477,8 @@
need_resched:
ti->preempt_count = PREEMPT_ACTIVE;
- schedule();
+ sched_profile();
+ __schedule();
ti->preempt_count = 0;
/* we could miss a preemption opportunity between schedule and now */
@@ -2609,7 +2615,8 @@
do {
__set_current_state(TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&x->wait.lock);
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irq(&x->wait.lock);
} while (!x->done);
__remove_wait_queue(&x->wait, &wait);
@@ -2641,7 +2648,8 @@
current->state = TASK_INTERRUPTIBLE;
SLEEP_ON_HEAD
- schedule();
+ sched_profile();
+ __schedule();
SLEEP_ON_TAIL
}
@@ -2654,7 +2662,8 @@
current->state = TASK_INTERRUPTIBLE;
SLEEP_ON_HEAD
- timeout = schedule_timeout(timeout);
+ sched_profile();
+ timeout = __schedule_timeout(timeout);
SLEEP_ON_TAIL
return timeout;
@@ -2669,7 +2678,8 @@
current->state = TASK_UNINTERRUPTIBLE;
SLEEP_ON_HEAD
- schedule();
+ sched_profile();
+ __schedule();
SLEEP_ON_TAIL
}
@@ -2682,7 +2692,8 @@
current->state = TASK_UNINTERRUPTIBLE;
SLEEP_ON_HEAD
- timeout = schedule_timeout(timeout);
+ sched_profile();
+ timeout = __schedule_timeout(timeout);
SLEEP_ON_TAIL
return timeout;
@@ -3127,7 +3138,7 @@
* to the expired array. If there are no other threads running on this
* CPU then this function will return.
*/
-asmlinkage long sys_sched_yield(void)
+static long sched_yield(void)
{
runqueue_t *rq = this_rq_lock();
prio_array_t *array = current->array;
@@ -3154,15 +3165,22 @@
_raw_spin_unlock(&rq->lock);
preempt_enable_no_resched();
- schedule();
+ __schedule();
return 0;
}
+asmlinkage long sys_sched_yield(void)
+{
+ __sched_profile(sys_sched_yield);
+ return sched_yield();
+}
+
void __sched __cond_resched(void)
{
set_current_state(TASK_RUNNING);
- schedule();
+ sched_profile();
+ __schedule();
}
EXPORT_SYMBOL(__cond_resched);
@@ -3176,7 +3194,8 @@
void __sched yield(void)
{
set_current_state(TASK_RUNNING);
- sys_sched_yield();
+ sched_profile();
+ sched_yield();
}
EXPORT_SYMBOL(yield);
@@ -3193,7 +3212,8 @@
struct runqueue *rq = this_rq();
atomic_inc(&rq->nr_iowait);
- schedule();
+ sched_profile();
+ __schedule();
atomic_dec(&rq->nr_iowait);
}
@@ -3205,7 +3225,8 @@
long ret;
atomic_inc(&rq->nr_iowait);
- ret = schedule_timeout(timeout);
+ sched_profile();
+ ret = __schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
return ret;
}
@@ -4161,3 +4182,93 @@
EXPORT_SYMBOL(__preempt_write_lock);
#endif /* defined(CONFIG_SMP) && defined(CONFIG_PREEMPT) */
+
+static atomic_t *schedprof_buf;
+static int sched_profiling;
+static unsigned long schedprof_len;
+
+#include <linux/bootmem.h>
+#include <asm/sections.h>
+
+void __sched_profile(void *__pc)
+{
+ if (schedprof_buf) {
+ unsigned long pc = (unsigned long)__pc;
+ pc -= min(pc, (unsigned long)_stext);
+ atomic_inc(&schedprof_buf[min(pc, schedprof_len)]);
+ }
+}
+
+static int __init schedprof_setup(char *s)
+{
+ int n;
+ if (get_option(&s, &n))
+ sched_profiling = 1;
+ return 1;
+}
+__setup("schedprof=", schedprof_setup);
+
+void __init schedprof_init(void)
+{
+ if (!sched_profiling)
+ return;
+ schedprof_len = (unsigned long)(_etext - _stext) + 1;
+ schedprof_buf = alloc_bootmem(schedprof_len*sizeof(atomic_t));
+ printk(KERN_INFO "Scheduler call profiling enabled\n");
+}
+
+#ifdef CONFIG_PROC_FS
+#include <linux/proc_fs.h>
+
+static ssize_t
+read_sched_profile(struct file *file, char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned long p = *ppos;
+ ssize_t read;
+ char * pnt;
+ unsigned int sample_step = 1;
+
+ if (p >= (schedprof_len+1)*sizeof(atomic_t))
+ return 0;
+ if (count > (schedprof_len+1)*sizeof(atomic_t) - p)
+ count = (schedprof_len+1)*sizeof(atomic_t) - p;
+ read = 0;
+
+ while (p < sizeof(atomic_t) && count > 0) {
+ put_user(*((char *)(&sample_step)+p),buf);
+ buf++; p++; count--; read++;
+ }
+ pnt = (char *)schedprof_buf + p - sizeof(atomic_t);
+ if (copy_to_user(buf,(void *)pnt,count))
+ return -EFAULT;
+ read += count;
+ *ppos += read;
+ return read;
+}
+
+static ssize_t write_sched_profile(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ memset(schedprof_buf, 0, sizeof(atomic_t)*schedprof_len);
+ return count;
+}
+
+static struct file_operations sched_profile_operations = {
+ .read = read_sched_profile,
+ .write = write_sched_profile,
+};
+
+static int proc_schedprof_init(void)
+{
+ struct proc_dir_entry *entry;
+ if (!sched_profiling)
+ return 1;
+ entry = create_proc_entry("schedprof", S_IWUSR | S_IRUGO, NULL);
+ if (entry) {
+ entry->proc_fops = &sched_profile_operations;
+ entry->size = sizeof(atomic_t)*(schedprof_len + 1);
+ }
+ return !!entry;
+}
+module_init(proc_schedprof_init);
+#endif
Index: wli-2.6.6-rc3-mm1/kernel/timer.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/timer.c 2004-04-30 15:05:53.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/timer.c 2004-04-30 17:35:43.000000000 -0700
@@ -1065,7 +1065,7 @@
*
* In all cases the return value is guaranteed to be non-negative.
*/
-fastcall signed long __sched schedule_timeout(signed long timeout)
+fastcall signed long __sched __schedule_timeout(signed long timeout)
{
struct timer_list timer;
unsigned long expire;
@@ -1080,7 +1080,7 @@
* but I' d like to return a valid offset (>=0) to allow
* the caller to do everything it want with the retval.
*/
- schedule();
+ __schedule();
goto out;
default:
/*
@@ -1108,7 +1108,7 @@
timer.function = process_timeout;
add_timer(&timer);
- schedule();
+ __schedule();
del_timer_sync(&timer);
timeout = expire - jiffies;
@@ -1117,6 +1117,11 @@
return timeout < 0 ? 0 : timeout;
}
+fastcall signed long __sched schedule_timeout(signed long timeout)
+{
+ sched_profile();
+ return __schedule_timeout(timeout);
+}
EXPORT_SYMBOL(schedule_timeout);
/* Thread ID - the internal kernel "pid" */
Index: wli-2.6.6-rc3-mm1/arch/alpha/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/alpha/kernel/semaphore.c 2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/alpha/kernel/semaphore.c 2004-04-30 15:14:54.000000000 -0700
@@ -66,7 +66,6 @@
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
-
#ifdef CONFIG_DEBUG_SEMAPHORE
printk("%s(%d): down failed(%p)\n",
tsk->comm, tsk->pid, sem);
@@ -83,7 +82,8 @@
* that we are asleep, and then sleep.
*/
while (__sem_update_count(sem, -1) <= 0) {
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
remove_wait_queue(&sem->wait, &wait);
@@ -108,7 +108,6 @@
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
long ret = 0;
-
#ifdef CONFIG_DEBUG_SEMAPHORE
printk("%s(%d): down failed(%p)\n",
tsk->comm, tsk->pid, sem);
@@ -129,7 +128,8 @@
ret = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_INTERRUPTIBLE);
}
Index: wli-2.6.6-rc3-mm1/arch/arm/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/arm/kernel/semaphore.c 2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/arm/kernel/semaphore.c 2004-04-30 15:15:12.000000000 -0700
@@ -77,8 +77,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
@@ -127,8 +127,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
Index: wli-2.6.6-rc3-mm1/arch/arm26/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/arm26/kernel/semaphore.c 2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/arm26/kernel/semaphore.c 2004-04-30 15:15:22.000000000 -0700
@@ -79,8 +79,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
@@ -129,8 +129,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
Index: wli-2.6.6-rc3-mm1/arch/cris/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/cris/kernel/semaphore.c 2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/cris/kernel/semaphore.c 2004-04-30 15:15:34.000000000 -0700
@@ -101,7 +101,8 @@
DOWN_HEAD(TASK_UNINTERRUPTIBLE)
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_UNINTERRUPTIBLE)
}
@@ -119,7 +120,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_INTERRUPTIBLE)
return ret;
}
Index: wli-2.6.6-rc3-mm1/arch/h8300/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/h8300/kernel/semaphore.c 2004-04-30 15:05:31.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/h8300/kernel/semaphore.c 2004-04-30 15:15:42.000000000 -0700
@@ -103,7 +103,8 @@
DOWN_HEAD(TASK_UNINTERRUPTIBLE)
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_UNINTERRUPTIBLE)
}
@@ -122,7 +123,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_INTERRUPTIBLE)
return ret;
}
Index: wli-2.6.6-rc3-mm1/arch/i386/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/i386/kernel/semaphore.c 2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/i386/kernel/semaphore.c 2004-04-30 15:16:52.000000000 -0700
@@ -79,8 +79,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
@@ -132,8 +132,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/arch/ia64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ia64/kernel/semaphore.c 2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ia64/kernel/semaphore.c 2004-04-30 15:16:58.000000000 -0700
@@ -70,8 +70,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
@@ -123,8 +123,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/arch/m68k/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/m68k/kernel/semaphore.c 2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/m68k/kernel/semaphore.c 2004-04-30 15:17:09.000000000 -0700
@@ -103,7 +103,8 @@
DOWN_HEAD(TASK_UNINTERRUPTIBLE)
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_UNINTERRUPTIBLE)
}
@@ -122,7 +123,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_INTERRUPTIBLE)
return ret;
}
Index: wli-2.6.6-rc3-mm1/arch/m68knommu/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/m68knommu/kernel/semaphore.c 2004-04-30 15:05:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/m68knommu/kernel/semaphore.c 2004-04-30 15:17:13.000000000 -0700
@@ -104,7 +104,8 @@
DOWN_HEAD(TASK_UNINTERRUPTIBLE)
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_UNINTERRUPTIBLE)
}
@@ -123,7 +124,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_INTERRUPTIBLE)
return ret;
}
Index: wli-2.6.6-rc3-mm1/arch/mips/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/mips/kernel/semaphore.c 2004-04-30 15:05:33.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/mips/kernel/semaphore.c 2004-04-30 15:17:25.000000000 -0700
@@ -132,7 +132,8 @@
for (;;) {
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
__set_current_state(TASK_UNINTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);
@@ -261,7 +262,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
__set_current_state(TASK_INTERRUPTIBLE);
}
__set_current_state(TASK_RUNNING);
Index: wli-2.6.6-rc3-mm1/arch/parisc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/parisc/kernel/semaphore.c 2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/parisc/kernel/semaphore.c 2004-04-30 15:17:31.000000000 -0700
@@ -68,7 +68,8 @@
/* we can _read_ this without the sentry */
if (sem->count != -1)
break;
- schedule();
+ sched_profile();
+ __schedule();
}
DOWN_TAIL
@@ -89,7 +90,8 @@
ret = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
}
DOWN_TAIL
Index: wli-2.6.6-rc3-mm1/arch/ppc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ppc/kernel/semaphore.c 2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ppc/kernel/semaphore.c 2004-04-30 15:17:36.000000000 -0700
@@ -86,7 +86,8 @@
* that we are asleep, and then sleep.
*/
while (__sem_update_count(sem, -1) <= 0) {
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
}
remove_wait_queue(&sem->wait, &wait);
@@ -121,7 +122,8 @@
retval = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
}
tsk->state = TASK_RUNNING;
Index: wli-2.6.6-rc3-mm1/arch/ppc64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/ppc64/kernel/semaphore.c 2004-04-30 15:05:34.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/ppc64/kernel/semaphore.c 2004-04-30 15:17:40.000000000 -0700
@@ -86,7 +86,8 @@
* that we are asleep, and then sleep.
*/
while (__sem_update_count(sem, -1) <= 0) {
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
remove_wait_queue(&sem->wait, &wait);
@@ -120,7 +121,8 @@
retval = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_INTERRUPTIBLE);
}
remove_wait_queue(&sem->wait, &wait);
Index: wli-2.6.6-rc3-mm1/arch/s390/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/s390/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/s390/kernel/semaphore.c 2004-04-30 15:17:43.000000000 -0700
@@ -69,7 +69,8 @@
__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
add_wait_queue_exclusive(&sem->wait, &wait);
while (__sem_update_count(sem, -1) <= 0) {
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
remove_wait_queue(&sem->wait, &wait);
@@ -97,7 +98,8 @@
retval = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
set_task_state(tsk, TASK_INTERRUPTIBLE);
}
remove_wait_queue(&sem->wait, &wait);
Index: wli-2.6.6-rc3-mm1/arch/sh/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sh/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sh/kernel/semaphore.c 2004-04-30 15:17:52.000000000 -0700
@@ -110,7 +110,8 @@
DOWN_HEAD(TASK_UNINTERRUPTIBLE)
if (waking_non_zero(sem))
break;
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_UNINTERRUPTIBLE)
}
@@ -128,7 +129,8 @@
ret = 0;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
DOWN_TAIL(TASK_INTERRUPTIBLE)
return ret;
}
Index: wli-2.6.6-rc3-mm1/arch/sparc/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sparc/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sparc/kernel/semaphore.c 2004-04-30 15:18:02.000000000 -0700
@@ -68,8 +68,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
@@ -118,8 +118,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
Index: wli-2.6.6-rc3-mm1/arch/sparc64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/sparc64/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/sparc64/kernel/semaphore.c 2004-04-30 15:18:10.000000000 -0700
@@ -100,7 +100,8 @@
add_wait_queue_exclusive(&sem->wait, &wait);
while (__sem_update_count(sem, -1) <= 0) {
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
}
remove_wait_queue(&sem->wait, &wait);
@@ -208,7 +209,8 @@
retval = -EINTR;
break;
}
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
}
tsk->state = TASK_RUNNING;
Index: wli-2.6.6-rc3-mm1/arch/v850/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/v850/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/v850/kernel/semaphore.c 2004-04-30 15:18:21.000000000 -0700
@@ -79,8 +79,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_UNINTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
@@ -129,8 +129,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irq(&semaphore_lock);
-
- schedule();
+ sched_profile();
+ __schedule();
tsk->state = TASK_INTERRUPTIBLE;
spin_lock_irq(&semaphore_lock);
}
Index: wli-2.6.6-rc3-mm1/arch/x86_64/kernel/semaphore.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/arch/x86_64/kernel/semaphore.c 2004-04-30 15:05:35.000000000 -0700
+++ wli-2.6.6-rc3-mm1/arch/x86_64/kernel/semaphore.c 2004-04-30 15:18:28.000000000 -0700
@@ -80,8 +80,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_UNINTERRUPTIBLE;
@@ -133,8 +133,8 @@
}
sem->sleepers = 1; /* us - see -1 above */
spin_unlock_irqrestore(&sem->wait.lock, flags);
-
- schedule();
+ sched_profile();
+ __schedule();
spin_lock_irqsave(&sem->wait.lock, flags);
tsk->state = TASK_INTERRUPTIBLE;
Index: wli-2.6.6-rc3-mm1/lib/rwsem.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/lib/rwsem.c 2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/lib/rwsem.c 2004-04-30 15:13:46.000000000 -0700
@@ -150,7 +150,7 @@
for (;;) {
if (!waiter->flags)
break;
- schedule();
+ __schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -165,7 +165,7 @@
struct rw_semaphore fastcall __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;
-
+ sched_profile();
rwsemtrace(sem,"Entering rwsem_down_read_failed");
waiter.flags = RWSEM_WAITING_FOR_READ;
@@ -181,7 +181,7 @@
struct rw_semaphore fastcall __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;
-
+ sched_profile();
rwsemtrace(sem,"Entering rwsem_down_write_failed");
waiter.flags = RWSEM_WAITING_FOR_WRITE;
Index: wli-2.6.6-rc3-mm1/init/main.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/init/main.c 2004-04-30 15:06:48.000000000 -0700
+++ wli-2.6.6-rc3-mm1/init/main.c 2004-04-30 16:04:58.000000000 -0700
@@ -486,6 +486,7 @@
if (panic_later)
panic(panic_later, panic_param);
profile_init();
+ schedprof_init();
local_irq_enable();
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && !initrd_below_start_ok &&
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [0.5/2] scheduler caller profiling
2004-05-03 2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
@ 2004-05-03 2:29 ` William Lee Irwin III
2004-05-03 2:32 ` [2/2] filtered buffer_head wakeups William Lee Irwin III
2004-05-03 18:51 ` [0.5/2] scheduler caller profiling David Mosberger
1 sibling, 1 reply; 6+ messages in thread
From: William Lee Irwin III @ 2004-05-03 2:29 UTC (permalink / raw)
To: akpm, linux-kernel
On Sun, May 02, 2004 at 07:23:46PM -0700, William Lee Irwin III wrote:
> This patch was used to collect the data on the offending callers into
> the scheduler. It creates a profile buffer completely analogous to its
This patch creates a new scheduling entrypoint, wake_up_filtered(), and
uses it in page waitqueue hashing to discriminate between the waiters
on various pages. One of the sources of the thundering herds was
identified as the page waitqueue hashing by a priori methods and
empirically confirmed using the scheduler caller profiling patch.
-- wli
Index: wli-2.6.6-rc3-mm1/include/linux/wait.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/wait.h 2004-04-03 19:37:07.000000000 -0800
+++ wli-2.6.6-rc3-mm1/include/linux/wait.h 2004-04-30 19:50:33.000000000 -0700
@@ -28,6 +28,11 @@
struct list_head task_list;
};
+struct filtered_wait_queue {
+ void *key;
+ wait_queue_t wait;
+};
+
struct __wait_queue_head {
spinlock_t lock;
struct list_head task_list;
@@ -104,6 +109,7 @@
list_del(&old->task_list);
}
+void FASTCALL(wake_up_filtered(wait_queue_head_t *, void *));
extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(__wake_up_locked(wait_queue_head_t *q, unsigned int mode));
extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
@@ -257,6 +263,16 @@
wait->func = autoremove_wake_function; \
INIT_LIST_HEAD(&wait->task_list); \
} while (0)
+
+#define DEFINE_FILTERED_WAIT(name, p) \
+ struct filtered_wait_queue name = { \
+ .key = p, \
+ .wait = { \
+ .task = current, \
+ .func = autoremove_wake_function, \
+ .task_list = LIST_HEAD_INIT(name.wait.task_list),\
+ }, \
+ }
#endif /* __KERNEL__ */
Index: wli-2.6.6-rc3-mm1/kernel/sched.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/kernel/sched.c 2004-04-30 16:13:32.000000000 -0700
+++ wli-2.6.6-rc3-mm1/kernel/sched.c 2004-04-30 19:50:33.000000000 -0700
@@ -2524,6 +2524,19 @@
}
}
+void fastcall wake_up_filtered(wait_queue_head_t *q, void *key)
+{
+ unsigned long flags;
+ unsigned int mode = TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE;
+ struct filtered_wait_queue *wait, *save;
+ spin_lock_irqsave(&q->lock, flags);
+ list_for_each_entry_safe(wait, save, &q->task_list, wait.task_list) {
+ if (wait->key == key)
+ wait->wait.func(&wait->wait, mode, 0);
+ }
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
/**
* __wake_up - wake up threads blocked on a waitqueue.
* @q: the waitqueue
Index: wli-2.6.6-rc3-mm1/mm/filemap.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/mm/filemap.c 2004-04-30 15:06:49.000000000 -0700
+++ wli-2.6.6-rc3-mm1/mm/filemap.c 2004-04-30 19:50:33.000000000 -0700
@@ -307,16 +307,16 @@
void fastcall wait_on_page_bit(struct page *page, int bit_nr)
{
wait_queue_head_t *waitqueue = page_waitqueue(page);
- DEFINE_WAIT(wait);
+ DEFINE_FILTERED_WAIT(wait, page);
do {
- prepare_to_wait(waitqueue, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(waitqueue, &wait.wait, TASK_UNINTERRUPTIBLE);
if (test_bit(bit_nr, &page->flags)) {
sync_page(page);
io_schedule();
}
} while (test_bit(bit_nr, &page->flags));
- finish_wait(waitqueue, &wait);
+ finish_wait(waitqueue, &wait.wait);
}
EXPORT_SYMBOL(wait_on_page_bit);
@@ -344,7 +344,7 @@
BUG();
smp_mb__after_clear_bit();
if (waitqueue_active(waitqueue))
- wake_up_all(waitqueue);
+ wake_up_filtered(waitqueue, page);
}
EXPORT_SYMBOL(unlock_page);
@@ -363,7 +363,7 @@
smp_mb__after_clear_bit();
}
if (waitqueue_active(waitqueue))
- wake_up_all(waitqueue);
+ wake_up_filtered(waitqueue, page);
}
EXPORT_SYMBOL(end_page_writeback);
@@ -379,16 +379,16 @@
void fastcall __lock_page(struct page *page)
{
wait_queue_head_t *wqh = page_waitqueue(page);
- DEFINE_WAIT(wait);
+ DEFINE_FILTERED_WAIT(wait, page);
while (TestSetPageLocked(page)) {
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE);
if (PageLocked(page)) {
sync_page(page);
io_schedule();
}
}
- finish_wait(wqh, &wait);
+ finish_wait(wqh, &wait.wait);
}
EXPORT_SYMBOL(__lock_page);
^ permalink raw reply [flat|nested] 6+ messages in thread
* [2/2] filtered buffer_head wakeups
2004-05-03 2:29 ` William Lee Irwin III
@ 2004-05-03 2:32 ` William Lee Irwin III
0 siblings, 0 replies; 6+ messages in thread
From: William Lee Irwin III @ 2004-05-03 2:32 UTC (permalink / raw)
To: akpm, linux-kernel
On Sun, May 02, 2004 at 07:29:36PM -0700, William Lee Irwin III wrote:
> This patch creates a new scheduling entrypoint, wake_up_filtered(), and
> uses it in page waitqueue hashing to discriminate between the waiters
> on various pages. One of the sources of the thundering herds was
> identified as the page waitqueue hashing by a priori methods and
> empirically confirmed using the scheduler caller profiling patch.
It turned out there were still thundering herds after the page
waitqueue hashtable. The scheduler caller profiling patch identified
this as the buffer_head waitqueue hashtable. This patch teaches the
buffer_head waitqueue hashing to use filtered wakeups.
-- wli
Index: wli-2.6.6-rc3-mm1/fs/buffer.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/fs/buffer.c 2004-04-30 15:06:46.000000000 -0700
+++ wli-2.6.6-rc3-mm1/fs/buffer.c 2004-04-30 19:51:25.000000000 -0700
@@ -74,7 +74,7 @@
smp_mb();
if (waitqueue_active(wq))
- wake_up_all(wq);
+ wake_up_filtered(wq, bh);
}
EXPORT_SYMBOL(wake_up_buffer);
@@ -93,10 +93,10 @@
void __wait_on_buffer(struct buffer_head * bh)
{
wait_queue_head_t *wqh = bh_waitq_head(bh);
- DEFINE_WAIT(wait);
+ DEFINE_FILTERED_WAIT(wait, bh);
do {
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(wqh, &wait.wait, TASK_UNINTERRUPTIBLE);
if (buffer_locked(bh)) {
struct block_device *bd;
smp_mb();
@@ -106,7 +106,7 @@
io_schedule();
}
} while (buffer_locked(bh));
- finish_wait(wqh, &wait);
+ finish_wait(wqh, &wait.wait);
}
static void
Index: wli-2.6.6-rc3-mm1/fs/jbd/transaction.c
===================================================================
--- wli-2.6.6-rc3-mm1.orig/fs/jbd/transaction.c 2004-04-30 15:06:46.000000000 -0700
+++ wli-2.6.6-rc3-mm1/fs/jbd/transaction.c 2004-04-30 19:51:25.000000000 -0700
@@ -638,7 +638,7 @@
jbd_unlock_bh_state(bh);
/* commit wakes up all shadow buffers after IO */
wqh = bh_waitq_head(jh2bh(jh));
- wait_event(*wqh, (jh->b_jlist != BJ_Shadow));
+ wait_event_filtered(*wqh, jh2bh(jh), (jh->b_jlist != BJ_Shadow));
goto repeat;
}
Index: wli-2.6.6-rc3-mm1/include/linux/wait.h
===================================================================
--- wli-2.6.6-rc3-mm1.orig/include/linux/wait.h 2004-04-30 19:50:33.000000000 -0700
+++ wli-2.6.6-rc3-mm1/include/linux/wait.h 2004-04-30 19:51:25.000000000 -0700
@@ -146,7 +146,6 @@
break; \
__wait_event(wq, condition); \
} while (0)
-
#define __wait_event_interruptible(wq, condition, ret) \
do { \
wait_queue_t __wait; \
@@ -273,7 +272,28 @@
.task_list = LIST_HEAD_INIT(name.wait.task_list),\
}, \
}
-
+
+#define __wait_event_filtered(wq, key, condition) \
+do { \
+ DEFINE_FILTERED_WAIT(__wait, key); \
+ add_wait_queue(&(wq), &__wait.wait); \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (condition) \
+ break; \
+ schedule(); \
+ } \
+ current->state = TASK_RUNNING; \
+ remove_wait_queue(&(wq), &__wait.wait); \
+} while (0)
+
+
+#define wait_event_filtered(wq, key, condition) \
+do { \
+ if (!(condition)) \
+ __wait_event_filtered(wq, key, condition); \
+} while (0)
+
#endif /* __KERNEL__ */
#endif
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [0/2] filtered wakeups
2004-05-03 2:17 [0/2] filtered wakeups William Lee Irwin III
2004-05-03 2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
@ 2004-05-03 2:46 ` William Lee Irwin III
1 sibling, 0 replies; 6+ messages in thread
From: William Lee Irwin III @ 2004-05-03 2:46 UTC (permalink / raw)
To: akpm, linux-kernel
On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin III wrote:
> before:
> Tiotest results for 512 concurrent io threads:
Parting shot: I also used time(1):
before:
tiotest -t 512 -f 32 -b 4096 -d . 14337.17s user 3931.52s system 301% cpu 1:40:51.08 total
after:
tiotest -t 512 -f 32 -b 4096 -d . 10985.23s user 3524.50s system 266% cpu 1:30:48.80 total
i.e. it sped up the run by 10 minutes, or 10% of the total execution time.
-- wli
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [0.5/2] scheduler caller profiling
2004-05-03 2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
2004-05-03 2:29 ` William Lee Irwin III
@ 2004-05-03 18:51 ` David Mosberger
1 sibling, 0 replies; 6+ messages in thread
From: David Mosberger @ 2004-05-03 18:51 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: akpm, linux-kernel
Bill> On Sun, May 02, 2004 at 07:17:09PM -0700, William Lee Irwin
Bill> III wrote:
>> The thundering herd issue in waitqueue hashing has been seen in
>> practice. In order to preserve the space footprint reduction
>> while improving performance, I wrote "filtered wakeups", which
>> discriminate between waiters based on a key.
Bill> This patch was used to collect the data on the offending
Bill> callers into the scheduler. It creates a profile buffer
Bill> completely analogous to its handling to /proc/profile, but
Bill> registers profile ticks at calls to the various scheduler
Bill> entry points instead of during timer ticks and rearranges
Bill> scheduler code for this to be accounted properly. It does not
Bill> report meaningful statistics in the presence of
Bill> CONFIG_PREEMPT.
Bill> Posting this patch is in order to disclose how I obtained the
Bill> scheduling statistics reported in the first post. This patch
Bill> is not intended for inclusion.
Note that on ia64, you can use q-syscollect/q-view to collect
call-counts statistically (with zero intrusion to the monitored
program, so it's safe for the kernel). While the call-graph/counts
won't be perfectly accurate, this has proven to work extremely well in
practice. In fact, it would be nice if other arches could support the
same. All you really need for this to work is the ability to count N
call (or return) instructions and record the source and destination
address of the N-th call somewhere (registers or memory). I looked at
the P4 performance-monitor briefly but I couldn't quite figure out
whether it supports the required functionality (it seemed like it
could only record _all_ branches, which would be a problem). If a
P4-expert is interested in pursuing this, let me know. I'd be happy
to help/advise with some of the more subtle issues that need to be
address to get this to work correctly.
--david
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2004-05-03 18:52 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-03 2:17 [0/2] filtered wakeups William Lee Irwin III
2004-05-03 2:23 ` [0.5/2] scheduler caller profiling William Lee Irwin III
2004-05-03 2:29 ` William Lee Irwin III
2004-05-03 2:32 ` [2/2] filtered buffer_head wakeups William Lee Irwin III
2004-05-03 18:51 ` [0.5/2] scheduler caller profiling David Mosberger
2004-05-03 2:46 ` [0/2] filtered wakeups William Lee Irwin III
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox