public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.6.15-rc5-rt2 slowness
@ 2005-12-16 11:30 Gunter Ohrner
  2005-12-16 11:42 ` Gunter Ohrner
                   ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Gunter Ohrner @ 2005-12-16 11:30 UTC (permalink / raw)
  To: linux-kernel

Hi!

Thanks to Steven's Kconfig fixes I was able to compile 2.6.15-rc5 with
Ingo's rt2-patch just fine.

I have two small problem with it, however. One is the Oops just posted, the
other is a high system load and bad responsiveness of the system as a
whole. I didn't even bother to measure timer/sleep jitters as the system
was dog slow and the fans started to run a full speed nearly immediately.

We observed this on two different systems, one with the config attached to
my mail with the Oops/backtrace.

I'll try to recompile the kernel with some debug options, if anyone knows if
there's something I should specifically look for this may be helpful.

Greetings,

  Gunter


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 11:30 2.6.15-rc5-rt2 slowness Gunter Ohrner
@ 2005-12-16 11:42 ` Gunter Ohrner
  2005-12-16 12:04   ` Gunter Ohrner
  2005-12-16 12:34   ` Steven Rostedt
  2005-12-16 12:32 ` Steven Rostedt
  2005-12-17  3:33 ` Steven Rostedt
  2 siblings, 2 replies; 56+ messages in thread
From: Gunter Ohrner @ 2005-12-16 11:42 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 474 bytes --]

Gunter Ohrner wrote:
> the other is a high system load and bad responsiveness of the system as a
> whole. I didn't even bother to measure timer/sleep jitters as the system
> was dog slow and the fans started to run a full speed nearly immediately.

Ok, I recompiled the kernel with some debug options and attached
a /proc/latency_trace output, hoping it will be helpful to track down the
problem...

Please tell me if there's anything else I should do.

Greetings,

  Gunter

[-- Attachment #2: lat_trace.log --]
[-- Type: text/plain, Size: 6646 bytes --]

preemption latency trace v1.1.5 on 2.6.15-rc5-rt2.zb.20051216.1
--------------------------------------------------------------------
 latency: 45 us, #94/94, CPU#0 | (M:rt VP:0, KP:0, SP:1 HP:1)
    -----------------
    | task: IRQ 12-733 (uid:0 nice:-5 policy:1 rt_prio:47)
    -----------------

                 _------=> CPU#            
                / _-----=> irqs-off        
               | / _----=> need-resched    
               || / _---=> hardirq/softirq 
               ||| / _--=> preempt-depth   
               |||| /                      
               |||||     delay             
   cmd     pid ||||| time  |   caller      
      \   /    |||||   \   |   /           
konquero-4635  0D.h3    0us : __trace_start_sched_wakeup (try_to_wake_up)
konquero-4635  0D.h3    1us : __trace_start_sched_wakeup <<...>-733> (34 0)
konquero-4635  0Dnh2    1us : try_to_wake_up <<...>-733> (34 74)
konquero-4635  0Dnh2    1us : check_raw_flags (try_to_wake_up)
konquero-4635  0Dnh1    1us : preempt_schedule (try_to_wake_up)
konquero-4635  0Dnh1    1us : wake_up_process (redirect_hardirq)
konquero-4635  0Dnh.    2us : preempt_schedule (__do_IRQ)
konquero-4635  0Dnh.    2us : irq_exit (do_IRQ)
konquero-4635  0Dn..    2us : __schedule (work_resched)
konquero-4635  0Dn..    3us : profile_hit (__schedule)
konquero-4635  0Dn.1    3us : sched_clock (__schedule)
konquero-4635  0D..2    5us : trace_array (__schedule)
konquero-4635  0D..2    6us : trace_array <<...>-733> (34 34)
konquero-4635  0D..2    6us : trace_array <konquero-4635> (74 78)
konquero-4635  0D..2    7us : trace_array <<...>-4458> (74 78)
konquero-4635  0D..2    7us : trace_array <<...>-4621> (75 78)
konquero-4635  0D..2    7us : trace_array <<...>-5894> (76 78)
konquero-4635  0D..2    8us : trace_array <<...>-5892> (77 78)
konquero-4635  0D..2    8us+: trace_array (__schedule)
   <...>-733   0D..2   11us : __switch_to (__schedule)
   <...>-733   0D..2   11us : __schedule <konquero-4635> (74 34)
   <...>-733   0D.h2   13us : do_IRQ (c030af09 0 0)
   <...>-733   0D.h3   13us+: mask_and_ack_8259A (__do_IRQ)
   <...>-733   0D.h4   16us : check_raw_flags (mask_and_ack_8259A)
   <...>-733   0D.h3   16us : redirect_hardirq (__do_IRQ)
   <...>-733   0D.h2   16us : handle_IRQ_event (__do_IRQ)
   <...>-733   0D.h2   17us : timer_interrupt (handle_IRQ_event)
   <...>-733   0D.h2   18us : handle_nextevent_tick_update (timer_interrupt)
   <...>-733   0D.h2   18us : hrtimer_interrupt (handle_nextevent_tick_update)
   <...>-733   0D.h2   19us : get_monotonic_clock (hrtimer_interrupt)
   <...>-733   0D.h2   19us : acpi_pm_read (get_monotonic_clock)
   <...>-733   0D.h2   20us : get_check_value (get_monotonic_clock)
   <...>-733   0D.h3   21us : check_raw_flags (get_check_value)
   <...>-733   0D.h2   21us : check_monotonic_clock (get_monotonic_clock)
   <...>-733   0D.h3   21us : check_raw_flags (check_monotonic_clock)
   <...>-733   0D.h2   22us : clockevents_set_next_event (hrtimer_interrupt)
   <...>-733   0D.h2   22us : get_monotonic_clock (clockevents_set_next_event)
   <...>-733   0D.h2   23us : acpi_pm_read (get_monotonic_clock)
   <...>-733   0D.h2   23us : get_check_value (get_monotonic_clock)
   <...>-733   0D.h3   24us : check_raw_flags (get_check_value)
   <...>-733   0D.h2   24us : check_monotonic_clock (get_monotonic_clock)
   <...>-733   0D.h3   24us : check_raw_flags (check_monotonic_clock)
   <...>-733   0D.h2   25us+: pit_next_event (clockevents_set_next_event)
   <...>-733   0D.h3   29us : check_raw_flags (pit_next_event)
   <...>-733   0D.h2   29us : handle_tick (handle_nextevent_tick_update)
   <...>-733   0D.h3   29us : do_timer (handle_tick)
   <...>-733   0D.h2   30us : handle_update (handle_nextevent_tick_update)
   <...>-733   0D.h2   30us : update_process_times (handle_update)
   <...>-733   0D.h2   31us : account_system_time (update_process_times)
   <...>-733   0D.h2   31us : run_local_timers (update_process_times)
   <...>-733   0D.h2   31us : raise_softirq (run_local_timers)
   <...>-733   0D.h2   32us : wakeup_softirqd (raise_softirq)
   <...>-733   0D.h2   32us : wake_up_process (wakeup_softirqd)
   <...>-733   0D.h2   32us : check_preempt_wakeup (wake_up_process)
   <...>-733   0D.h2   33us : try_to_wake_up (wake_up_process)
   <...>-733   0D.h3   33us : activate_task (try_to_wake_up)
   <...>-733   0D.h3   33us : sched_clock (activate_task)
   <...>-733   0D.h3   33us : activate_task <<...>-3> (62 6)
   <...>-733   0D.h3   33us : enqueue_task (activate_task)
   <...>-733   0D.h3   34us : check_raw_flags (try_to_wake_up)
   <...>-733   0D.h2   34us : wake_up_process (wakeup_softirqd)
   <...>-733   0D.h2   34us : check_raw_flags (raise_softirq)
   <...>-733   0D.h2   34us : rcu_pending (update_process_times)
   <...>-733   0D.h2   35us : rcu_check_callbacks (update_process_times)
   <...>-733   0D.h2   35us : rcu_try_flip (rcu_check_callbacks)
   <...>-733   0D.h3   36us : check_raw_flags (rcu_try_flip)
   <...>-733   0D.h3   36us : __rcu_advance_callbacks (rcu_check_callbacks)
   <...>-733   0D.h3   36us : check_raw_flags (rcu_check_callbacks)
   <...>-733   0D.h2   36us : __tasklet_schedule (rcu_check_callbacks)
   <...>-733   0D.h2   37us : wakeup_softirqd (__tasklet_schedule)
   <...>-733   0D.h2   37us : wake_up_process (wakeup_softirqd)
   <...>-733   0D.h2   37us : check_preempt_wakeup (wake_up_process)
   <...>-733   0D.h2   38us : try_to_wake_up (wake_up_process)
   <...>-733   0D.h3   38us : activate_task (try_to_wake_up)
   <...>-733   0D.h3   38us : sched_clock (activate_task)
   <...>-733   0D.h3   38us : activate_task <<...>-7> (62 7)
   <...>-733   0D.h3   38us : enqueue_task (activate_task)
   <...>-733   0D.h3   39us : check_raw_flags (try_to_wake_up)
   <...>-733   0D.h2   39us : wake_up_process (wakeup_softirqd)
   <...>-733   0D.h2   39us : check_raw_flags (__tasklet_schedule)
   <...>-733   0D.h2   39us : scheduler_tick (update_process_times)
   <...>-733   0D.h2   39us : sched_clock (scheduler_tick)
   <...>-733   0D.h2   40us : softlockup_tick (update_process_times)
   <...>-733   0D.h2   40us : touch_light_softlockup_watchdog (softlockup_tick)
   <...>-733   0D.h3   41us : note_interrupt (__do_IRQ)
   <...>-733   0D.h3   41us : enable_8259A_irq (__do_IRQ)
   <...>-733   0D.h4   42us : check_raw_flags (enable_8259A_irq)
   <...>-733   0D.h2   43us : irq_exit (do_IRQ)
   <...>-733   0D..2   43us < (608)
   <...>-733   0...1   43us : trace_stop_sched_switched (__schedule)
   <...>-733   0D..2   44us : trace_stop_sched_switched <<...>-733> (34 0)
   <...>-733   0D..2   45us : trace_stop_sched_switched (__schedule)


vim:ft=help

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 11:42 ` Gunter Ohrner
@ 2005-12-16 12:04   ` Gunter Ohrner
  2005-12-16 12:34   ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Gunter Ohrner @ 2005-12-16 12:04 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

Hi!

Some further info:

,----[ cat rcuctrs rcugp rcuptrs rcustats schedstat ]
| CPU last cur
|   0    0   0
| ggp = 36340
| oldggp=36343  newggp=36354
| nl=c07fd218/c96abee8 nt=c94eae1c
|  wl=c07fd220/00000000 wt=c07fd220 dl=c07fd228/00000000 dt=c07fd228
| ggp=36368 lgp=36368 rcc=36368
| na=135275 nl=3 wa=135272 wl=0 da=135272 dl=0 dr=135272 di=135272
| rtf1=36624 rtf2=36624 rtf3=36368 rtfe1=0 rtfe2=0 rtfe3=256
| version 12
| timestamp 80698
| cpu0 0 0 4 4 736 1123111 85725 0 0 92271 332649 1037386
`----

.config of the currently running kernel is attached.

The system I'm trying this on is a Celeron-M 1,5 GHz Notebook with an i865
chipset. The CPU was forced to ACPI C0 state:

# cat /sys/module/processor/parameters/max_cstate
0

cpufreq scaling governor is set to "performance".

Greetings,

  Gunter

[-- Attachment #2: config-2.6.15-rc5-rt2.zb.20051216.2 --]
[-- Type: text/plain, Size: 39538 bytes --]

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.15-rc5-rt2.zb.20051216.2
# Fri Dec 16 12:16:58 2005
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_UID16=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_CLEAN_COMPILE=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_HOTPLUG=y
CONFIG_KOBJECT_UEVENT=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_SLOB=y

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_OBSOLETE_MODPARM=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y

#
# Block layer
#
# CONFIG_LBD is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
CONFIG_MPENTIUMIII=y
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_HIGH_RES_RESOLUTION=1000
# CONFIG_SMP is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT_DESKTOP is not set
CONFIG_PREEMPT_RT=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_SOFTIRQS=y
CONFIG_PREEMPT_HARDIRQS=y
CONFIG_PREEMPT_BKL=y
CONFIG_PREEMPT_RCU=y
CONFIG_RCU_STATS=y
CONFIG_ASM_SEMAPHORES=y
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
# CONFIG_X86_MCE_P4THERMAL is not set
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m

#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_PHYSICAL_START=0x100000
# CONFIG_KEXEC is not set

#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_SOFTWARE_SUSPEND is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
# CONFIG_ACPI_HOTKEY is not set
CONFIG_ACPI_FAN=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y

#
# APM (Advanced Power Management) BIOS Support
#

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# CPUFreq processor drivers
#
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_SPEEDSTEP_ICH=y
# CONFIG_X86_SPEEDSTEP_SMI is not set
CONFIG_X86_P4_CLOCKMOD=y
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y
CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK=y

#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
# CONFIG_PCI_MSI is not set
# CONFIG_PCI_LEGACY_PROC is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set

#
# PCCARD (PCMCIA/CardBus) support
#
CONFIG_PCCARD=m
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=m
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=m
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_I82365 is not set
# CONFIG_TCIC is not set
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=m

#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
# CONFIG_BINFMT_AOUT is not set
CONFIG_BINFMT_MISC=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_NET_KEY=y
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=y
CONFIG_NET_IPGRE=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=y
CONFIG_INET_ESP=y
CONFIG_INET_IPCOMP=y
CONFIG_INET_TUNNEL=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_BIC=y

#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
# CONFIG_IPV6 is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m

#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_CT_ACCT=y
CONFIG_IP_NF_CONNTRACK_MARK=y
# CONFIG_IP_NF_CONNTRACK_EVENTS is not set
CONFIG_IP_NF_CONNTRACK_NETLINK=m
CONFIG_IP_NF_CT_PROTO_SCTP=m
CONFIG_IP_NF_FTP=m
CONFIG_IP_NF_IRC=m
CONFIG_IP_NF_NETBIOS_NS=m
CONFIG_IP_NF_TFTP=m
CONFIG_IP_NF_AMANDA=m
CONFIG_IP_NF_PPTP=m
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_LIMIT=m
CONFIG_IP_NF_MATCH_IPRANGE=m
CONFIG_IP_NF_MATCH_MAC=m
CONFIG_IP_NF_MATCH_PKTTYPE=m
CONFIG_IP_NF_MATCH_MARK=m
CONFIG_IP_NF_MATCH_MULTIPORT=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_DSCP=m
CONFIG_IP_NF_MATCH_AH_ESP=m
CONFIG_IP_NF_MATCH_LENGTH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_TCPMSS=m
CONFIG_IP_NF_MATCH_HELPER=m
CONFIG_IP_NF_MATCH_STATE=m
CONFIG_IP_NF_MATCH_CONNTRACK=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_REALM=m
CONFIG_IP_NF_MATCH_SCTP=m
CONFIG_IP_NF_MATCH_DCCP=m
CONFIG_IP_NF_MATCH_COMMENT=m
CONFIG_IP_NF_MATCH_CONNMARK=m
CONFIG_IP_NF_MATCH_CONNBYTES=m
CONFIG_IP_NF_MATCH_HASHLIMIT=m
CONFIG_IP_NF_MATCH_STRING=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_TARGET_TCPMSS=m
CONFIG_IP_NF_TARGET_NFQUEUE=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_SAME=m
CONFIG_IP_NF_NAT_SNMP_BASIC=m
CONFIG_IP_NF_NAT_IRC=m
CONFIG_IP_NF_NAT_FTP=m
CONFIG_IP_NF_NAT_TFTP=m
CONFIG_IP_NF_NAT_AMANDA=m
CONFIG_IP_NF_NAT_PPTP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_DSCP=m
CONFIG_IP_NF_TARGET_MARK=m
CONFIG_IP_NF_TARGET_CLASSIFY=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_TARGET_CONNMARK=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_TARGET_NOTRACK=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# DCCP Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP=m
CONFIG_INET_DCCP_DIAG=m

#
# DCCP CCIDs Configuration (EXPERIMENTAL)
#
CONFIG_IP_DCCP_CCID3=m
CONFIG_IP_DCCP_TFRC_LIB=m

#
# DCCP Kernel Hacking
#
# CONFIG_IP_DCCP_DEBUG is not set
# CONFIG_IP_DCCP_UNLOAD_HACK is not set

#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_NET_DIVERT is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set

#
# QoS and/or fair queueing
#
CONFIG_NET_SCHED=y
# CONFIG_NET_SCH_CLK_JIFFIES is not set
# CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set
CONFIG_NET_SCH_CLK_CPU=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
CONFIG_NET_SCH_HTB=m
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_RED is not set
CONFIG_NET_SCH_SFQ=m
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=m
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_CLS_U32_PERF=y
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
CONFIG_NET_EMATCH_CMP=m
CONFIG_NET_EMATCH_NBYTE=m
CONFIG_NET_EMATCH_U32=m
CONFIG_NET_EMATCH_META=m
CONFIG_NET_EMATCH_TEXT=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=m
CONFIG_NET_ACT_GACT=m
CONFIG_GACT_PROB=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_IPT=m
CONFIG_NET_ACT_PEDIT=m
CONFIG_NET_ACT_SIMP=m
# CONFIG_NET_CLS_IND is not set
CONFIG_NET_ESTIMATOR=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=m
CONFIG_BT_L2CAP=m
CONFIG_BT_SCO=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
# CONFIG_BT_BNEP_MC_FILTER is not set
# CONFIG_BT_BNEP_PROTO_FILTER is not set
CONFIG_BT_HIDP=m

#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
CONFIG_IEEE80211=m
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=m
CONFIG_IEEE80211_CRYPT_CCMP=m
CONFIG_IEEE80211_CRYPT_TKIP=m

#
# Device Drivers
#

#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set

#
# Connector - unified userspace <-> kernelspace linker
#
CONFIG_CONNECTOR=m

#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set

#
# Parallel port support
#
# CONFIG_PARPORT is not set

#
# Plug and Play support
#
CONFIG_PNP=y
CONFIG_PNP_DEBUG=y

#
# Protocols
#
CONFIG_ISAPNP=y
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y

#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
# CONFIG_ATA_OVER_ETH is not set

#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
# CONFIG_IDEDISK_MULTI_MODE is not set
# CONFIG_BLK_DEV_IDECS is not set
CONFIG_BLK_DEV_IDECD=m
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
# CONFIG_BLK_DEV_IDESCSI is not set
# CONFIG_IDE_TASK_IOCTL is not set

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
# CONFIG_IDEPCI_SHARE_IRQ is not set
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=y
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
# CONFIG_IDE_CHIPSETS is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=m
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=m
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
# CONFIG_BLK_DEV_SR is not set
# CONFIG_CHR_DEV_SG is not set
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
# CONFIG_SCSI_MULTI_LUN is not set
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set

#
# SCSI Transport Attributes
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set

#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_SATA is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_FC is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA2XXX=m
# CONFIG_SCSI_QLA21XX is not set
# CONFIG_SCSI_QLA22XX is not set
# CONFIG_SCSI_QLA2300 is not set
# CONFIG_SCSI_QLA2322 is not set
# CONFIG_SCSI_QLA6312 is not set
# CONFIG_SCSI_QLA24XX is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set

#
# PCMCIA SCSI adapter support
#
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set

#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set

#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set

#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set

#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y

#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y
CONFIG_IEEE1394_CONFIG_ROM_IP1394=y
# CONFIG_IEEE1394_EXPORT_FULL_API is not set

#
# Device Drivers
#

#
# Texas Instruments PCILynx requires I2C
#
CONFIG_IEEE1394_OHCI1394=m

#
# Protocol Drivers
#
CONFIG_IEEE1394_VIDEO1394=m
CONFIG_IEEE1394_SBP2=m
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394=m
CONFIG_IEEE1394_DV1394=m
CONFIG_IEEE1394_RAWIO=m
CONFIG_IEEE1394_CMP=m
CONFIG_IEEE1394_AMDTP=m

#
# I2O device support
#
# CONFIG_I2O is not set

#
# Network device support
#
CONFIG_NETDEVICES=y
CONFIG_DUMMY=y
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_NET_SB1000 is not set

#
# ARCnet devices
#
# CONFIG_ARCNET is not set

#
# PHY device support
#
# CONFIG_PHYLIB is not set

#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set

#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
CONFIG_EEPRO100=m
CONFIG_E100=m
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set

#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set

#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set

#
# Token Ring devices
#
# CONFIG_TR is not set

#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y

#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
# CONFIG_ARLAN is not set
# CONFIG_WAVELAN is not set
# CONFIG_PCMCIA_WAVELAN is not set
# CONFIG_PCMCIA_NETWAVE is not set

#
# Wireless 802.11 Frequency Hopping cards support
#
# CONFIG_PCMCIA_RAYCS is not set

#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set

#
# Wireless 802.11b Pcmcia/Cardbus cards support
#
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set

#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
CONFIG_PRISM54=m
# CONFIG_HOSTAP is not set
CONFIG_NET_WIRELESS=y

#
# PCMCIA network device support
#
# CONFIG_NET_PCMCIA is not set

#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
CONFIG_PPP=m
# CONFIG_PPP_MULTILINK is not set
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=m
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set

#
# ISDN subsystem
#
# CONFIG_ISDN is not set

#
# Telephony Support
#
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_SERIAL=m
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
CONFIG_INPUT_PCSPKR=y
# CONFIG_INPUT_WISTRON_BTNS is not set
CONFIG_INPUT_UINPUT=m

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_SERIAL_NONSTANDARD is not set

#
# Serial drivers
#
# CONFIG_SERIAL_8250 is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256

#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set

#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
# CONFIG_HW_RANDOM is not set
CONFIG_NVRAM=m
CONFIG_RTC=y
# CONFIG_RTC_HISTOGRAM is not set
# CONFIG_BLOCKER is not set
# CONFIG_LPPTEST is not set
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set

#
# Ftape, the floppy tape device driver
#
# CONFIG_FTAPE is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
CONFIG_DRM_I830=m
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set

#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set

#
# I2C support
#
# CONFIG_I2C is not set

#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set

#
# Hardware Monitoring support
#
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Misc devices
#
# CONFIG_IBM_ASM is not set

#
# Multimedia Capabilities Port drivers
#

#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set

#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set

#
# Graphics support
#
# CONFIG_FB is not set
CONFIG_VIDEO_SELECT=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y

#
# Sound
#
CONFIG_SOUND=y

#
# Advanced Linux Sound Architecture
#
CONFIG_SND=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_AC97_BUS=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_SEQUENCER=m
# CONFIG_SND_SEQ_DUMMY is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_RTCTIMER=m
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set

#
# Generic devices
#
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set

#
# ISA devices
#
# CONFIG_SND_AD1816A is not set
# CONFIG_SND_AD1848 is not set
# CONFIG_SND_CS4231 is not set
# CONFIG_SND_CS4232 is not set
# CONFIG_SND_CS4236 is not set
# CONFIG_SND_ES968 is not set
# CONFIG_SND_ES1688 is not set
# CONFIG_SND_ES18XX is not set
# CONFIG_SND_GUSCLASSIC is not set
# CONFIG_SND_GUSEXTREME is not set
# CONFIG_SND_GUSMAX is not set
# CONFIG_SND_INTERWAVE is not set
# CONFIG_SND_INTERWAVE_STB is not set
# CONFIG_SND_OPTI92X_AD1848 is not set
# CONFIG_SND_OPTI92X_CS4231 is not set
# CONFIG_SND_OPTI93X is not set
# CONFIG_SND_SB8 is not set
# CONFIG_SND_SB16 is not set
# CONFIG_SND_SBAWE is not set
# CONFIG_SND_WAVEFRONT is not set
# CONFIG_SND_ALS100 is not set
# CONFIG_SND_AZT2320 is not set
# CONFIG_SND_CMI8330 is not set
# CONFIG_SND_DT019X is not set
# CONFIG_SND_OPL3SA2 is not set
# CONFIG_SND_SGALAXY is not set
# CONFIG_SND_SSCAPE is not set

#
# PCI devices
#
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_HDA_INTEL is not set

#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set

#
# PCMCIA devices
#
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set

#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set

#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_BANDWIDTH=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set

#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set

#
# USB Device Class drivers
#
# CONFIG_OBSOLETE_OSS_USB_DRIVER is not set
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=m

#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#

#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set

#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_USB_HIDINPUT=y
# CONFIG_HID_FF is not set
CONFIG_USB_HIDDEV=y

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_MTOUCH is not set
# CONFIG_USB_ITMTOUCH is not set
# CONFIG_USB_EGALAX is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB Multimedia devices
#
# CONFIG_USB_DABUSB is not set

#
# Video4Linux support is needed for USB Multimedia device support
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_USB_MON is not set

#
# USB port drivers
#

#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=m
# CONFIG_USB_SERIAL_GENERIC is not set
# CONFIG_USB_SERIAL_AIRPRIME is not set
# CONFIG_USB_SERIAL_ANYDATA is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP2101 is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
CONFIG_USB_SERIAL_VISOR=m
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_HP4X is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OMNINET is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGETKIT is not set
# CONFIG_USB_PHIDGETSERVO is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TEST is not set

#
# USB DSL modem support
#

#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set

#
# MMC/SD Card support
#
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
CONFIG_MMC_BLOCK=m
CONFIG_MMC_WBSD=m

#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set

#
# SN Devices
#

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
# CONFIG_EXT3_FS is not set
# CONFIG_JBD is not set
CONFIG_REISERFS_FS=m
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=m
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
CONFIG_RELAYFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set

#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
CONFIG_NFS_V4=y
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
CONFIG_NFSD_V4=y
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_RPCSEC_GSS_SPKM3 is not set
CONFIG_SMB_FS=m
CONFIG_SMB_NLS_DEFAULT=y
CONFIG_SMB_NLS_REMOTE="iso8859-15"
CONFIG_CIFS=m
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y

#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-15"
CONFIG_NLS_CODEPAGE_437=m
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=m
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
CONFIG_NLS_CODEPAGE_1250=m
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m

#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
CONFIG_PROFILE_NMI=y
# CONFIG_KPROBES is not set

#
# Kernel hacking
#
# CONFIG_PRINTK_TIME is not set
# CONFIG_PRINTK_IGNORE_LOGLEVEL is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_MAGIC_SYSRQ is not set
CONFIG_LOG_BUF_SHIFT=14
CONFIG_PARANOID_GENERIC_TIME=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_SCHEDSTATS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_IRQ_FLAGS=y
CONFIG_WAKEUP_TIMING=y
# CONFIG_WAKEUP_LATENCY_HIST is not set
CONFIG_PREEMPT_TRACE=y
CONFIG_CRITICAL_PREEMPT_TIMING=y
# CONFIG_PREEMPT_OFF_HIST is not set
CONFIG_CRITICAL_IRQSOFF_TIMING=y
# CONFIG_INTERRUPT_OFF_HIST is not set
CONFIG_CRITICAL_TIMING=y
CONFIG_LATENCY_TIMING=y
CONFIG_LATENCY_TRACE=y
CONFIG_MCOUNT=y
CONFIG_DEBUG_DEADLOCKS=y
# CONFIG_DEBUG_RT_LOCKING_MODE is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_DEBUG_VM is not set
CONFIG_FRAME_POINTER=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_4KSTACKS is not set
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set

#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=m
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_TEST is not set

#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set

#
# Library routines
#
CONFIG_CRC_CCITT=y
# CONFIG_CRC16 is not set
CONFIG_CRC32=y
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 11:30 2.6.15-rc5-rt2 slowness Gunter Ohrner
  2005-12-16 11:42 ` Gunter Ohrner
@ 2005-12-16 12:32 ` Steven Rostedt
  2005-12-16 22:58   ` john stultz
  2005-12-17  3:33 ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-16 12:32 UTC (permalink / raw)
  To: G.Ohrner; +Cc: john stultz, Thomas Gleixner, Ingo Molnar, linux-kernel

On Fri, 2005-12-16 at 12:30 +0100, Gunter Ohrner wrote:
> Hi!
> 
> Thanks to Steven's Kconfig fixes I was able to compile 2.6.15-rc5 with
> Ingo's rt2-patch just fine.
> 
> I have two small problem with it, however. One is the Oops just posted, the
> other is a high system load and bad responsiveness of the system as a
> whole. I didn't even bother to measure timer/sleep jitters as the system
> was dog slow and the fans started to run a full speed nearly immediately.
> 
> We observed this on two different systems, one with the config attached to
> my mail with the Oops/backtrace.
> 
> I'll try to recompile the kernel with some debug options, if anyone knows if
> there's something I should specifically look for this may be helpful.

I'll look into your oops later (or maybe Ingo has some time), but I've
also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
now.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 11:42 ` Gunter Ohrner
  2005-12-16 12:04   ` Gunter Ohrner
@ 2005-12-16 12:34   ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-16 12:34 UTC (permalink / raw)
  To: G.Ohrner; +Cc: Ingo Molnar, linux-kernel

On Fri, 2005-12-16 at 12:42 +0100, Gunter Ohrner wrote:
> Gunter Ohrner wrote:
> > the other is a high system load and bad responsiveness of the system as a
> > whole. I didn't even bother to measure timer/sleep jitters as the system
> > was dog slow and the fans started to run a full speed nearly immediately.
> 
> Ok, I recompiled the kernel with some debug options and attached
> a /proc/latency_trace output, hoping it will be helpful to track down the
> problem...
> 
> Please tell me if there's anything else I should do.

Sorry, your latency trace doesn't help.  The 45usec is not really a
latency (well not a bad one).

-- Steve
	


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 12:32 ` Steven Rostedt
@ 2005-12-16 22:58   ` john stultz
  2005-12-17  0:22     ` Gunter Ohrner
  2005-12-17  3:51     ` Steven Rostedt
  0 siblings, 2 replies; 56+ messages in thread
From: john stultz @ 2005-12-16 22:58 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: G.Ohrner, Thomas Gleixner, Ingo Molnar, linux-kernel

On Fri, 2005-12-16 at 07:32 -0500, Steven Rostedt wrote:
> I'll look into your oops later (or maybe Ingo has some time), but I've
> also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
> now.

Hey Steven,
	Do check that the slowness you're seeing isn't related to the
CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
the extra checks made by that config option would slow things down a
bit.

thanks
-john


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 22:58   ` john stultz
@ 2005-12-17  0:22     ` Gunter Ohrner
  2005-12-17  3:51     ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Gunter Ohrner @ 2005-12-17  0:22 UTC (permalink / raw)
  To: linux-kernel

john stultz wrote:
> Do check that the slowness you're seeing isn't related to the
> CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
> the extra checks made by that config option would slow things down a
> bit.

The first kernel I built which showed this behaviour had no debugging
options enabled.

It happens if the system is mostly idle, in this state "top" will show a
kernel usage of 20%-50%, and as soon as something cpu intensive is started,
the whole system becomes extremely unresponsive.

Greetings,

  Gunter


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 11:30 2.6.15-rc5-rt2 slowness Gunter Ohrner
  2005-12-16 11:42 ` Gunter Ohrner
  2005-12-16 12:32 ` Steven Rostedt
@ 2005-12-17  3:33 ` Steven Rostedt
  2005-12-17 22:57   ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-17  3:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: john stultz, Gunter Ohrner, linux-kernel

Ingo,

After searching all over to find out where the slowness is, I finally
discovered it.  It's the SLOB!

I noticed a that a make install of the kernel over NFS took on
2.6.15-rc5 ~26 seconds to complete, and on 2.6.15-rc5-rt2 it took almost
2 minutes for the same operation on the same machine.

I added my logdev device to record lots of output, and it found the
place that is taking the longest:

In 2.6.15-rc5-rt2:

[  789.171773] cpu:0 kfree_skbmem:291 in
[  789.171873] cpu:0 kfree_skbmem:295 1
[  789.172357] cpu:0 kfree_skbmem:320 out

in 2.6.15-rc5:

[  343.253988] cpu:0 kfree_skbmem:291 in
[  343.253990] cpu:0 kfree_skbmem:295 1
[  343.253991] cpu:0 kfree_skbmem:320 out

Here's the code for both systems (they are identical here):

void kfree_skbmem(struct sk_buff *skb)
{
	struct sk_buff *other;
	atomic_t *fclone_ref;

	edprint("in");
	skb_release_data(skb);
	switch (skb->fclone) {
	case SKB_FCLONE_UNAVAILABLE:
	edprint("1");
		kmem_cache_free(skbuff_head_cache, skb);
		break;

	case SKB_FCLONE_ORIG:
	edprint("2");
		fclone_ref = (atomic_t *) (skb + 2);
		if (atomic_dec_and_test(fclone_ref))
			kmem_cache_free(skbuff_fclone_cache, skb);
		break;

	case SKB_FCLONE_CLONE:
	edprint("3");
		fclone_ref = (atomic_t *) (skb + 1);
		other = skb - 1;

		/* The clone portion is available for
		 * fast-cloning again.
		 */
		skb->fclone = SKB_FCLONE_UNAVAILABLE;

		if (atomic_dec_and_test(fclone_ref))
			kmem_cache_free(skbuff_fclone_cache, other);
		break;
	};
	edprint("out");
}

My edprint records in a ring buffer (much like relayfs), and produces
the above output.  The time in brackets is in seconds. We see the
difference between "1" and "out" is greatly different.  (Note, I have a
edprint in all interrupts, so I would know if one was taken, and these
times are not a one time deal, but show up like this every time).

So for 2.6.15-rc5 the time difference is 343.253991 - 343.253990 or
1 usec, where as the time for 2.6.15-rc5-rt2 is 789.172357 - 789.171873
or 484 usecs!  We're talking about a 48,400% increase here!

The difference here is that the kmem_cache_free in rt is a SLOB where as
the vanilla kernel still uses SLABs,  What's the rational for SLOB now?

The patches used are here:

For the logdev device:
http://www.kihontech.com/logdev/logdev-2.6.15-rc5-rt2.patch
http://www.kihontech.com/logdev/logdev-2.6.15-rc5.patch

For debugging (on top of logdev):
http://www.kihontech.com/logdev/debug-2.6.15-rc5-rt2.patch
http://www.kihontech.com/logdev/debug-2.6.15-rc5.patch

(I also added my patches previously posted to get it to compile and
handle the softirq hrtimer problems).

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-16 22:58   ` john stultz
  2005-12-17  0:22     ` Gunter Ohrner
@ 2005-12-17  3:51     ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-17  3:51 UTC (permalink / raw)
  To: john stultz; +Cc: G.Ohrner, Thomas Gleixner, Ingo Molnar, linux-kernel

On Fri, 2005-12-16 at 14:58 -0800, john stultz wrote:
> On Fri, 2005-12-16 at 07:32 -0500, Steven Rostedt wrote:
> > I'll look into your oops later (or maybe Ingo has some time), but I've
> > also notice the slowness of 2.6.15-rc5-rt2, and I'm investigating it
> > now.
> 
> Hey Steven,
> 	Do check that the slowness you're seeing isn't related to the
> CONFIG_PARANIOD_GENERIC_TIME option being enabled. It is expected that
> the extra checks made by that config option would slow things down a
> bit.
> 

Hi John,

Thanks for the suggestion, but I've been running my tests with that
turned off.  Actually, I've turned off pretty much all debugging, and
added my own logdev device.  As mentioned in my previous email, that I
CC you on, I found the culprit.

Seems, the SLOB is not as fast as the SLAB. I'll look more into this
tomorrow. (My wife is taking by daughter out of town for gymnastics, so
I get to stay home and work :-/ )

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-17  3:33 ` Steven Rostedt
@ 2005-12-17 22:57   ` Steven Rostedt
  2005-12-18 16:05     ` K.R. Foley
  2005-12-20 13:32     ` Ingo Molnar
  0 siblings, 2 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-17 22:57 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Gunter Ohrner, john stultz

Ingo,

I ported your old changes of 2.6.14-rt22 of mm/slab.c to 2.6.15-rc5-rt2
and tried it out.  I believe that this confirms that the SLOB _is_ the
problem in the slowness.  Booting with this slab patch, gives the old
speeds that we use to have.

Now, is the solution to bring the SLOB up to par with the SLAB, or to
make the SLAB as close to possible to the mainline (why remove NUMA?)
and keep it for PREEMPT_RT?

Below is the port of the slab changes if anyone else would like to see
if this speeds things up for them.

-- Steve

Index: linux-2.6.15-rc5-rt2/init/Kconfig
===================================================================
--- linux-2.6.15-rc5-rt2.orig/init/Kconfig	2005-12-17 14:09:22.000000000 -0500
+++ linux-2.6.15-rc5-rt2/init/Kconfig	2005-12-17 14:09:41.000000000 -0500
@@ -402,7 +402,7 @@
 	default y
 	bool "Use full SLAB allocator" if EMBEDDED
 	# we switch to the SLOB on PREEMPT_RT
-	depends on !PREEMPT_RT
+#	depends on !PREEMPT_RT
 	help
 	  Disabling this replaces the advanced SLAB allocator and
 	  kmalloc support with the drastically simpler SLOB allocator.
Index: linux-2.6.15-rc5-rt2/mm/slab.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slab.c	2005-12-17 16:44:10.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slab.c	2005-12-17 17:27:30.000000000 -0500
@@ -75,15 +75,6 @@
  *
  *	At present, each engine can be growing a cache.  This should be blocked.
  *
- * 15 March 2005. NUMA slab allocator.
- *	Shai Fultheim <shai@scalex86.org>.
- *	Shobhit Dayal <shobhit@calsoftinc.com>
- *	Alok N Kataria <alokk@calsoftinc.com>
- *	Christoph Lameter <christoph@lameter.com>
- *
- *	Modified the slab allocator to be node aware on NUMA systems.
- *	Each node has its own list of partial, free and full slabs.
- *	All object allocations for a node occur from node specific slab lists.
  */
 
 #include	<linux/config.h>
@@ -102,7 +93,6 @@
 #include	<linux/module.h>
 #include	<linux/rcupdate.h>
 #include	<linux/string.h>
-#include	<linux/nodemask.h>
 
 #include	<asm/uaccess.h>
 #include	<asm/cacheflush.h>
@@ -222,7 +212,6 @@
 	void			*s_mem;		/* including colour offset */
 	unsigned int		inuse;		/* num of objs active in slab */
 	kmem_bufctl_t		free;
-	unsigned short          nodeid;
 };
 
 /*
@@ -250,6 +239,7 @@
 /*
  * struct array_cache
  *
+ * Per cpu structures
  * Purpose:
  * - LIFO ordering, to hand out cache-warm objects from _alloc
  * - reduce the number of linked list operations
@@ -264,13 +254,6 @@
 	unsigned int limit;
 	unsigned int batchcount;
 	unsigned int touched;
-	spinlock_t lock;
-	void *entry[0];		/*
-				 * Must have this definition in here for the proper
-				 * alignment of array_cache. Also simplifies accessing
-				 * the entries.
-				 * [0] is for gcc 2.95. It should really be [].
-				 */
 };
 
 /* bootstrap: The caches do not work without cpuarrays anymore,
@@ -283,84 +266,34 @@
 };
 
 /*
- * The slab lists for all objects.
+ * The slab lists of all objects.
+ * Hopefully reduce the internal fragmentation
+ * NUMA: The spinlock could be moved from the kmem_cache_t
+ * into this structure, too. Figure out what causes
+ * fewer cross-node spinlock operations.
  */
 struct kmem_list3 {
 	struct list_head	slabs_partial;	/* partial list first, better asm code */
 	struct list_head	slabs_full;
 	struct list_head	slabs_free;
 	unsigned long	free_objects;
-	unsigned long	next_reap;
 	int		free_touched;
-	unsigned int 	free_limit;
-	spinlock_t      list_lock;
-	struct array_cache	*shared;	/* shared per node */
-	struct array_cache	**alien;	/* on other nodes */
+	unsigned long	next_reap;
+	struct array_cache	*shared;
 };
 
-/*
- * Need this for bootstrapping a per node allocator.
- */
-#define NUM_INIT_LISTS (2 * MAX_NUMNODES + 1)
-struct kmem_list3 __initdata initkmem_list3[NUM_INIT_LISTS];
-#define	CACHE_CACHE 0
-#define	SIZE_AC 1
-#define	SIZE_L3 (1 + MAX_NUMNODES)
-
-/*
- * This function must be completely optimized away if
- * a constant is passed to it. Mostly the same as
- * what is in linux/slab.h except it returns an
- * index.
- */
-static __always_inline int index_of(const size_t size)
-{
-	if (__builtin_constant_p(size)) {
-		int i = 0;
-
-#define CACHE(x) \
-	if (size <=x) \
-		return i; \
-	else \
-		i++;
-#include "linux/kmalloc_sizes.h"
-#undef CACHE
-		{
-			extern void __bad_size(void);
-			__bad_size();
-		}
-	} else
-		BUG();
-	return 0;
-}
-
-#define INDEX_AC index_of(sizeof(struct arraycache_init))
-#define INDEX_L3 index_of(sizeof(struct kmem_list3))
-
-static inline void kmem_list3_init(struct kmem_list3 *parent)
-{
-	INIT_LIST_HEAD(&parent->slabs_full);
-	INIT_LIST_HEAD(&parent->slabs_partial);
-	INIT_LIST_HEAD(&parent->slabs_free);
-	parent->shared = NULL;
-	parent->alien = NULL;
-	spin_lock_init(&parent->list_lock);
-	parent->free_objects = 0;
-	parent->free_touched = 0;
-}
-
-#define MAKE_LIST(cachep, listp, slab, nodeid)	\
-	do {	\
-		INIT_LIST_HEAD(listp);		\
-		list_splice(&(cachep->nodelists[nodeid]->slab), listp); \
-	} while (0)
-
-#define	MAKE_ALL_LISTS(cachep, ptr, nodeid)			\
-	do {					\
-	MAKE_LIST((cachep), (&(ptr)->slabs_full), slabs_full, nodeid);	\
-	MAKE_LIST((cachep), (&(ptr)->slabs_partial), slabs_partial, nodeid); \
-	MAKE_LIST((cachep), (&(ptr)->slabs_free), slabs_free, nodeid);	\
-	} while (0)
+#define LIST3_INIT(parent) \
+	{ \
+		.slabs_full	= LIST_HEAD_INIT(parent.slabs_full), \
+		.slabs_partial	= LIST_HEAD_INIT(parent.slabs_partial), \
+		.slabs_free	= LIST_HEAD_INIT(parent.slabs_free) \
+	}
+#define list3_data(cachep) \
+	(&(cachep)->lists)
+
+/* NUMA: per-node */
+#define list3_data_ptr(cachep, ptr) \
+		list3_data(cachep)
 
 /*
  * kmem_cache_t
@@ -373,12 +306,13 @@
 	struct array_cache	*array[NR_CPUS];
 	unsigned int		batchcount;
 	unsigned int		limit;
-	unsigned int 		shared;
-	unsigned int		objsize;
 /* 2) touched by every alloc & free from the backend */
-	struct kmem_list3	*nodelists[MAX_NUMNODES];
+	struct kmem_list3	lists;
+	/* NUMA: kmem_3list_t	*nodelists[MAX_NUMNODES] */
+	unsigned int		objsize;
 	unsigned int	 	flags;	/* constant flags */
 	unsigned int		num;	/* # of objs per slab */
+	unsigned int		free_limit; /* upper limit of objects in the lists */
 	spinlock_t		spinlock;
 
 /* 3) cache_grow/shrink */
@@ -415,7 +349,6 @@
 	unsigned long 		errors;
 	unsigned long		max_freeable;
 	unsigned long		node_allocs;
-	unsigned long		node_frees;
 	atomic_t		allochit;
 	atomic_t		allocmiss;
 	atomic_t		freehit;
@@ -451,7 +384,6 @@
 				} while (0)
 #define	STATS_INC_ERR(x)	((x)->errors++)
 #define	STATS_INC_NODEALLOCS(x)	((x)->node_allocs++)
-#define	STATS_INC_NODEFREES(x)	((x)->node_frees++)
 #define	STATS_SET_FREEABLE(x, i) \
 				do { if ((x)->max_freeable < i) \
 					(x)->max_freeable = i; \
@@ -470,7 +402,6 @@
 #define	STATS_SET_HIGH(x)	do { } while (0)
 #define	STATS_INC_ERR(x)	do { } while (0)
 #define	STATS_INC_NODEALLOCS(x)	do { } while (0)
-#define	STATS_INC_NODEFREES(x)	do { } while (0)
 #define	STATS_SET_FREEABLE(x, i) \
 				do { } while (0)
 
@@ -618,9 +549,9 @@
 
 /* internal cache of cache description objs */
 static kmem_cache_t cache_cache = {
+	.lists		= LIST3_INIT(cache_cache.lists),
 	.batchcount	= 1,
 	.limit		= BOOT_CPUCACHE_ENTRIES,
-	.shared		= 1,
 	.objsize	= sizeof(kmem_cache_t),
 	.flags		= SLAB_NO_REAP,
 	.spinlock	= SPIN_LOCK_UNLOCKED(cache_cache.spinlock),
@@ -641,6 +572,7 @@
  * SLAB_RECLAIM_ACCOUNT turns this on per-slab
  */
 atomic_t slab_reclaim_pages;
+EXPORT_SYMBOL(slab_reclaim_pages);
 
 /*
  * chicken and egg problem: delay the per-cpu array allocation
@@ -648,24 +580,28 @@
  */
 static enum {
 	NONE,
-	PARTIAL_AC,
-	PARTIAL_L3,
+	PARTIAL,
 	FULL
 } g_cpucache_up;
 
 static DEFINE_PER_CPU(struct work_struct, reap_work);
 
-static void free_block(kmem_cache_t* cachep, void** objpp, int len, int node);
+static void free_block(kmem_cache_t* cachep, void** objpp, int len);
 static void enable_cpucache (kmem_cache_t *cachep);
 static void cache_reap (void *unused);
-static int __node_shrink(kmem_cache_t *cachep, int node);
 
-static inline struct array_cache *ac_data(kmem_cache_t *cachep)
+static inline void **ac_entry(struct array_cache *ac)
 {
-	return cachep->array[smp_processor_id()];
+	return (void**)(ac+1);
 }
 
-static inline kmem_cache_t *__find_general_cachep(size_t size, gfp_t gfpflags)
+static inline struct array_cache *ac_data(kmem_cache_t *cachep, int cpu)
+{
+	return cachep->array[cpu];
+}
+
+static inline kmem_cache_t *__find_general_cachep(size_t size,
+						unsigned int __nocast gfpflags)
 {
 	struct cache_sizes *csizep = malloc_sizes;
 
@@ -674,13 +610,13 @@
  	* kmem_cache_create(), or __kmalloc(), before
  	* the generic caches are initialized.
  	*/
-	BUG_ON(malloc_sizes[INDEX_AC].cs_cachep == NULL);
+	BUG_ON(csizep->cs_cachep == NULL);
 #endif
 	while (size > csizep->cs_size)
 		csizep++;
 
 	/*
-	 * Really subtle: The last entry with cs->cs_size==ULONG_MAX
+	 * Really subtile: The last entry with cs->cs_size==ULONG_MAX
 	 * has cs_{dma,}cachep==NULL. Thus no special case
 	 * for large kmalloc calls required.
 	 */
@@ -689,7 +625,8 @@
 	return csizep->cs_cachep;
 }
 
-kmem_cache_t *kmem_find_general_cachep(size_t size, gfp_t gfpflags)
+kmem_cache_t *kmem_find_general_cachep(size_t size,
+		unsigned int __nocast gfpflags)
 {
 	return __find_general_cachep(size, gfpflags);
 }
@@ -754,160 +691,48 @@
 	}
 }
 
-static struct array_cache *alloc_arraycache(int node, int entries,
+static struct array_cache *alloc_arraycache(int cpu, int entries,
 						int batchcount)
 {
 	int memsize = sizeof(void*)*entries+sizeof(struct array_cache);
 	struct array_cache *nc = NULL;
 
-	nc = kmalloc_node(memsize, GFP_KERNEL, node);
+	if (cpu == -1)
+		nc = kmalloc(memsize, GFP_KERNEL);
+	else
+		nc = kmalloc_node(memsize, GFP_KERNEL, cpu_to_node(cpu));
+
 	if (nc) {
 		nc->avail = 0;
 		nc->limit = entries;
 		nc->batchcount = batchcount;
 		nc->touched = 0;
-		spin_lock_init(&nc->lock);
 	}
 	return nc;
 }
 
-#ifdef CONFIG_NUMA
-static inline struct array_cache **alloc_alien_cache(int node, int limit)
-{
-	struct array_cache **ac_ptr;
-	int memsize = sizeof(void*)*MAX_NUMNODES;
-	int i;
-
-	if (limit > 1)
-		limit = 12;
-	ac_ptr = kmalloc_node(memsize, GFP_KERNEL, node);
-	if (ac_ptr) {
-		for_each_node(i) {
-			if (i == node || !node_online(i)) {
-				ac_ptr[i] = NULL;
-				continue;
-			}
-			ac_ptr[i] = alloc_arraycache(node, limit, 0xbaadf00d);
-			if (!ac_ptr[i]) {
-				for (i--; i <=0; i--)
-					kfree(ac_ptr[i]);
-				kfree(ac_ptr);
-				return NULL;
-			}
-		}
-	}
-	return ac_ptr;
-}
-
-static inline void free_alien_cache(struct array_cache **ac_ptr)
-{
-	int i;
-
-	if (!ac_ptr)
-		return;
-
-	for_each_node(i)
-		kfree(ac_ptr[i]);
-
-	kfree(ac_ptr);
-}
-
-static inline void __drain_alien_cache(kmem_cache_t *cachep, struct array_cache *ac, int node)
-{
-	struct kmem_list3 *rl3 = cachep->nodelists[node];
-
-	if (ac->avail) {
-		spin_lock(&rl3->list_lock);
-		free_block(cachep, ac->entry, ac->avail, node);
-		ac->avail = 0;
-		spin_unlock(&rl3->list_lock);
-	}
-}
-
-static void drain_alien_cache(kmem_cache_t *cachep, struct kmem_list3 *l3)
-{
-	int i=0;
-	struct array_cache *ac;
-	unsigned long flags;
-
-	for_each_online_node(i) {
-		ac = l3->alien[i];
-		if (ac) {
-			spin_lock_irqsave(&ac->lock, flags);
-			__drain_alien_cache(cachep, ac, i);
-			spin_unlock_irqrestore(&ac->lock, flags);
-		}
-	}
-}
-#else
-#define alloc_alien_cache(node, limit) do { } while (0)
-#define free_alien_cache(ac_ptr) do { } while (0)
-#define drain_alien_cache(cachep, l3) do { } while (0)
-#endif
-
 static int __devinit cpuup_callback(struct notifier_block *nfb,
 				  unsigned long action, void *hcpu)
 {
 	long cpu = (long)hcpu;
 	kmem_cache_t* cachep;
-	struct kmem_list3 *l3 = NULL;
-	int node = cpu_to_node(cpu);
-	int memsize = sizeof(struct kmem_list3);
-	struct array_cache *nc = NULL;
 
 	switch (action) {
 	case CPU_UP_PREPARE:
 		down(&cache_chain_sem);
-		/* we need to do this right in the beginning since
-		 * alloc_arraycache's are going to use this list.
-		 * kmalloc_node allows us to add the slab to the right
-		 * kmem_list3 and not this cpu's kmem_list3
-		 */
-
 		list_for_each_entry(cachep, &cache_chain, next) {
-			/* setup the size64 kmemlist for cpu before we can
-			 * begin anything. Make sure some other cpu on this
-			 * node has not already allocated this
-			 */
-			if (!cachep->nodelists[node]) {
-				if (!(l3 = kmalloc_node(memsize,
-						GFP_KERNEL, node)))
-					goto bad;
-				kmem_list3_init(l3);
-				l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
-				  ((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-
-				cachep->nodelists[node] = l3;
-			}
-
-			spin_lock_irq(&cachep->nodelists[node]->list_lock);
-			cachep->nodelists[node]->free_limit =
-				(1 + nr_cpus_node(node)) *
-				cachep->batchcount + cachep->num;
-			spin_unlock_irq(&cachep->nodelists[node]->list_lock);
-		}
+			struct array_cache *nc;
 
-		/* Now we can go ahead with allocating the shared array's
-		  & array cache's */
-		list_for_each_entry(cachep, &cache_chain, next) {
-			nc = alloc_arraycache(node, cachep->limit,
-					cachep->batchcount);
+			nc = alloc_arraycache(cpu, cachep->limit, cachep->batchcount);
 			if (!nc)
 				goto bad;
+
+			spin_lock_irq(&cachep->spinlock);
 			cachep->array[cpu] = nc;
+			cachep->free_limit = (1+num_online_cpus())*cachep->batchcount
+						+ cachep->num;
+			spin_unlock_irq(&cachep->spinlock);
 
-			l3 = cachep->nodelists[node];
-			BUG_ON(!l3);
-			if (!l3->shared) {
-				if (!(nc = alloc_arraycache(node,
-					cachep->shared*cachep->batchcount,
-					0xbaadf00d)))
-					goto  bad;
-
-				/* we are serialised from CPU_DEAD or
-				  CPU_UP_CANCELLED by the cpucontrol lock */
-				l3->shared = nc;
-			}
 		}
 		up(&cache_chain_sem);
 		break;
@@ -922,51 +747,13 @@
 
 		list_for_each_entry(cachep, &cache_chain, next) {
 			struct array_cache *nc;
-			cpumask_t mask;
 
-			mask = node_to_cpumask(node);
 			spin_lock_irq(&cachep->spinlock);
 			/* cpu is dead; no one can alloc from it. */
 			nc = cachep->array[cpu];
 			cachep->array[cpu] = NULL;
-			l3 = cachep->nodelists[node];
-
-			if (!l3)
-				goto unlock_cache;
-
-			spin_lock(&l3->list_lock);
-
-			/* Free limit for this kmem_list3 */
-			l3->free_limit -= cachep->batchcount;
-			if (nc)
-				free_block(cachep, nc->entry, nc->avail, node);
-
-			if (!cpus_empty(mask)) {
-                                spin_unlock(&l3->list_lock);
-                                goto unlock_cache;
-                        }
-
-			if (l3->shared) {
-				free_block(cachep, l3->shared->entry,
-						l3->shared->avail, node);
-				kfree(l3->shared);
-				l3->shared = NULL;
-			}
-			if (l3->alien) {
-				drain_alien_cache(cachep, l3);
-				free_alien_cache(l3->alien);
-				l3->alien = NULL;
-			}
-
-			/* free slabs belonging to this node */
-			if (__node_shrink(cachep, node)) {
-				cachep->nodelists[node] = NULL;
-				spin_unlock(&l3->list_lock);
-				kfree(l3);
-			} else {
-				spin_unlock(&l3->list_lock);
-			}
-unlock_cache:
+			cachep->free_limit -= cachep->batchcount;
+			free_block(cachep, ac_entry(nc), nc->avail);
 			spin_unlock_irq(&cachep->spinlock);
 			kfree(nc);
 		}
@@ -982,25 +769,6 @@
 
 static struct notifier_block cpucache_notifier = { &cpuup_callback, NULL, 0 };
 
-/*
- * swap the static kmem_list3 with kmalloced memory
- */
-static void init_list(kmem_cache_t *cachep, struct kmem_list3 *list,
-		int nodeid)
-{
-	struct kmem_list3 *ptr;
-
-	BUG_ON(cachep->nodelists[nodeid] != list);
-	ptr = kmalloc_node(sizeof(struct kmem_list3), GFP_KERNEL, nodeid);
-	BUG_ON(!ptr);
-
-	local_irq_disable();
-	memcpy(ptr, list, sizeof(struct kmem_list3));
-	MAKE_ALL_LISTS(cachep, ptr, nodeid);
-	cachep->nodelists[nodeid] = ptr;
-	local_irq_enable();
-}
-
 /* Initialisation.
  * Called after the gfp() functions have been enabled, and before smp_init().
  */
@@ -1009,13 +777,6 @@
 	size_t left_over;
 	struct cache_sizes *sizes;
 	struct cache_names *names;
-	int i;
-
-	for (i = 0; i < NUM_INIT_LISTS; i++) {
-		kmem_list3_init(&initkmem_list3[i]);
-		if (i < MAX_NUMNODES)
-			cache_cache.nodelists[i] = NULL;
-	}
 
 	/*
 	 * Fragmentation resistance on low memory - only use bigger
@@ -1024,24 +785,21 @@
 	if (num_physpages > (32 << 20) >> PAGE_SHIFT)
 		slab_break_gfp_order = BREAK_GFP_ORDER_HI;
 
+
 	/* Bootstrap is tricky, because several objects are allocated
 	 * from caches that do not exist yet:
 	 * 1) initialize the cache_cache cache: it contains the kmem_cache_t
 	 *    structures of all caches, except cache_cache itself: cache_cache
 	 *    is statically allocated.
-	 *    Initially an __init data area is used for the head array and the
-	 *    kmem_list3 structures, it's replaced with a kmalloc allocated
-	 *    array at the end of the bootstrap.
+	 *    Initially an __init data area is used for the head array, it's
+	 *    replaced with a kmalloc allocated array at the end of the bootstrap.
 	 * 2) Create the first kmalloc cache.
-	 *    The kmem_cache_t for the new cache is allocated normally.
-	 *    An __init data area is used for the head array.
-	 * 3) Create the remaining kmalloc caches, with minimally sized
-	 *    head arrays.
+	 *    The kmem_cache_t for the new cache is allocated normally. An __init
+	 *    data area is used for the head array.
+	 * 3) Create the remaining kmalloc caches, with minimally sized head arrays.
 	 * 4) Replace the __init data head arrays for cache_cache and the first
 	 *    kmalloc cache with kmalloc allocated arrays.
-	 * 5) Replace the __init data for kmem_list3 for cache_cache and
-	 *    the other cache's with kmalloc allocated memory.
-	 * 6) Resize the head arrays of the kmalloc caches to their final sizes.
+	 * 5) Resize the head arrays of the kmalloc caches to their final sizes.
 	 */
 
 	/* 1) create the cache_cache */
@@ -1050,7 +808,6 @@
 	list_add(&cache_cache.next, &cache_chain);
 	cache_cache.colour_off = cache_line_size();
 	cache_cache.array[smp_processor_id()] = &initarray_cache.cache;
-	cache_cache.nodelists[numa_node_id()] = &initkmem_list3[CACHE_CACHE];
 
 	cache_cache.objsize = ALIGN(cache_cache.objsize, cache_line_size());
 
@@ -1068,33 +825,15 @@
 	sizes = malloc_sizes;
 	names = cache_names;
 
-	/* Initialize the caches that provide memory for the array cache
-	 * and the kmem_list3 structures first.
-	 * Without this, further allocations will bug
-	 */
-
-	sizes[INDEX_AC].cs_cachep = kmem_cache_create(names[INDEX_AC].name,
-				sizes[INDEX_AC].cs_size, ARCH_KMALLOC_MINALIGN,
-				(ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
-
-	if (INDEX_AC != INDEX_L3)
-		sizes[INDEX_L3].cs_cachep =
-			kmem_cache_create(names[INDEX_L3].name,
-				sizes[INDEX_L3].cs_size, ARCH_KMALLOC_MINALIGN,
-				(ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
-
 	while (sizes->cs_size != ULONG_MAX) {
-		/*
-		 * For performance, all the general caches are L1 aligned.
+		/* For performance, all the general caches are L1 aligned.
 		 * This should be particularly beneficial on SMP boxes, as it
 		 * eliminates "false sharing".
 		 * Note for systems short on memory removing the alignment will
-		 * allow tighter packing of the smaller caches.
-		 */
-		if(!sizes->cs_cachep)
-			sizes->cs_cachep = kmem_cache_create(names->name,
-				sizes->cs_size, ARCH_KMALLOC_MINALIGN,
-				(ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
+		 * allow tighter packing of the smaller caches. */
+		sizes->cs_cachep = kmem_cache_create(names->name,
+			sizes->cs_size, ARCH_KMALLOC_MINALIGN,
+			(ARCH_KMALLOC_FLAGS | SLAB_PANIC), NULL, NULL);
 
 		/* Inc off-slab bufctl limit until the ceiling is hit. */
 		if (!(OFF_SLAB(sizes->cs_cachep))) {
@@ -1113,47 +852,25 @@
 	/* 4) Replace the bootstrap head arrays */
 	{
 		void * ptr;
+		int cpu = smp_processor_id();
 
 		ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);
-
-		local_irq_disable();
-		BUG_ON(ac_data(&cache_cache) != &initarray_cache.cache);
-		memcpy(ptr, ac_data(&cache_cache),
-				sizeof(struct arraycache_init));
-		cache_cache.array[smp_processor_id()] = ptr;
-		local_irq_enable();
+		local_irq_disable_nort();
+		BUG_ON(ac_data(&cache_cache, cpu) != &initarray_cache.cache);
+		memcpy(ptr, ac_data(&cache_cache, cpu), sizeof(struct arraycache_init));
+		cache_cache.array[cpu] = ptr;
+		local_irq_enable_nort();
 
 		ptr = kmalloc(sizeof(struct arraycache_init), GFP_KERNEL);
-
-		local_irq_disable();
-		BUG_ON(ac_data(malloc_sizes[INDEX_AC].cs_cachep)
-				!= &initarray_generic.cache);
-		memcpy(ptr, ac_data(malloc_sizes[INDEX_AC].cs_cachep),
+		local_irq_disable_nort();
+		BUG_ON(ac_data(malloc_sizes[0].cs_cachep, cpu) != &initarray_generic.cache);
+		memcpy(ptr, ac_data(malloc_sizes[0].cs_cachep, cpu),
 				sizeof(struct arraycache_init));
-		malloc_sizes[INDEX_AC].cs_cachep->array[smp_processor_id()] =
-						ptr;
-		local_irq_enable();
-	}
-	/* 5) Replace the bootstrap kmem_list3's */
-	{
-		int node;
-		/* Replace the static kmem_list3 structures for the boot cpu */
-		init_list(&cache_cache, &initkmem_list3[CACHE_CACHE],
-				numa_node_id());
-
-		for_each_online_node(node) {
-			init_list(malloc_sizes[INDEX_AC].cs_cachep,
-					&initkmem_list3[SIZE_AC+node], node);
-
-			if (INDEX_AC != INDEX_L3) {
-				init_list(malloc_sizes[INDEX_L3].cs_cachep,
-						&initkmem_list3[SIZE_L3+node],
-						node);
-			}
-		}
+		malloc_sizes[0].cs_cachep->array[cpu] = ptr;
+		local_irq_enable_nort();
 	}
 
-	/* 6) resize the head arrays to their final sizes */
+	/* 5) resize the head arrays to their final sizes */
 	{
 		kmem_cache_t *cachep;
 		down(&cache_chain_sem);
@@ -1170,6 +887,7 @@
 	 */
 	register_cpu_notifier(&cpucache_notifier);
 
+
 	/* The reap timers are started later, with a module init call:
 	 * That part of the kernel is not yet operational.
 	 */
@@ -1183,8 +901,10 @@
 	 * Register the timers that return unneeded
 	 * pages to gfp.
 	 */
-	for_each_online_cpu(cpu)
-		start_cpu_timer(cpu);
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (cpu_online(cpu))
+			start_cpu_timer(cpu);
+	}
 
 	return 0;
 }
@@ -1198,7 +918,7 @@
  * did not request dmaable memory, we might get it, but that
  * would be relatively rare and ignorable.
  */
-static void *kmem_getpages(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+static void *kmem_getpages(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
 {
 	struct page *page;
 	void *addr;
@@ -1268,7 +988,7 @@
 
 	*addr++=0x12345678;
 	*addr++=caller;
-	*addr++=smp_processor_id();
+	*addr++=raw_smp_processor_id();
 	size -= 3*sizeof(unsigned long);
 	{
 		unsigned long *sptr = &caller;
@@ -1459,20 +1179,6 @@
 	}
 }
 
-/* For setting up all the kmem_list3s for cache whose objsize is same
-   as size of kmem_list3. */
-static inline void set_up_list3s(kmem_cache_t *cachep, int index)
-{
-	int node;
-
-	for_each_online_node(node) {
-		cachep->nodelists[node] = &initkmem_list3[index+node];
-		cachep->nodelists[node]->next_reap = jiffies +
-			REAPTIMEOUT_LIST3 +
-			((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-	}
-}
-
 /**
  * kmem_cache_create - Create a cache.
  * @name: A string which is used in /proc/slabinfo to identify this cache.
@@ -1514,6 +1220,7 @@
 	size_t left_over, slab_size, ralign;
 	kmem_cache_t *cachep = NULL;
 	struct list_head *p;
+	int cpu = raw_smp_processor_id();
 
 	/*
 	 * Sanity checks... these are all serious usage bugs.
@@ -1656,7 +1363,7 @@
 		size += BYTES_PER_WORD;
 	}
 #if FORCED_DEBUG && defined(CONFIG_DEBUG_PAGEALLOC)
-	if (size >= malloc_sizes[INDEX_L3+1].cs_size && cachep->reallen > cache_line_size() && size < PAGE_SIZE) {
+	if (size > 128 && cachep->reallen > cache_line_size() && size < PAGE_SIZE) {
 		cachep->dbghead += PAGE_SIZE - size;
 		size = PAGE_SIZE;
 	}
@@ -1758,9 +1465,13 @@
 		cachep->gfpflags |= GFP_DMA;
 	spin_lock_init(&cachep->spinlock);
 	cachep->objsize = size;
+	/* NUMA */
+	INIT_LIST_HEAD(&cachep->lists.slabs_full);
+	INIT_LIST_HEAD(&cachep->lists.slabs_partial);
+	INIT_LIST_HEAD(&cachep->lists.slabs_free);
 
 	if (flags & CFLGS_OFF_SLAB)
-		cachep->slabp_cache = kmem_find_general_cachep(slab_size, 0u);
+		cachep->slabp_cache = kmem_find_general_cachep(slab_size,0);
 	cachep->ctor = ctor;
 	cachep->dtor = dtor;
 	cachep->name = name;
@@ -1776,52 +1487,25 @@
 			 * the cache that's used by kmalloc(24), otherwise
 			 * the creation of further caches will BUG().
 			 */
-			cachep->array[smp_processor_id()] =
-				&initarray_generic.cache;
-
-			/* If the cache that's used by
-			 * kmalloc(sizeof(kmem_list3)) is the first cache,
-			 * then we need to set up all its list3s, otherwise
-			 * the creation of further caches will BUG().
-			 */
-			set_up_list3s(cachep, SIZE_AC);
-			if (INDEX_AC == INDEX_L3)
-				g_cpucache_up = PARTIAL_L3;
-			else
-				g_cpucache_up = PARTIAL_AC;
+			cachep->array[cpu] = &initarray_generic.cache;
+			g_cpucache_up = PARTIAL;
 		} else {
-			cachep->array[smp_processor_id()] =
-				kmalloc(sizeof(struct arraycache_init),
-						GFP_KERNEL);
-
-			if (g_cpucache_up == PARTIAL_AC) {
-				set_up_list3s(cachep, SIZE_L3);
-				g_cpucache_up = PARTIAL_L3;
-			} else {
-				int node;
-				for_each_online_node(node) {
-
-					cachep->nodelists[node] =
-						kmalloc_node(sizeof(struct kmem_list3),
-								GFP_KERNEL, node);
-					BUG_ON(!cachep->nodelists[node]);
-					kmem_list3_init(cachep->nodelists[node]);
-				}
-			}
+			cachep->array[cpu] = kmalloc(sizeof(struct arraycache_init),GFP_KERNEL);
 		}
-		cachep->nodelists[numa_node_id()]->next_reap =
-			jiffies + REAPTIMEOUT_LIST3 +
-			((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-
-		BUG_ON(!ac_data(cachep));
-		ac_data(cachep)->avail = 0;
-		ac_data(cachep)->limit = BOOT_CPUCACHE_ENTRIES;
-		ac_data(cachep)->batchcount = 1;
-		ac_data(cachep)->touched = 0;
+		BUG_ON(!ac_data(cachep, cpu));
+		ac_data(cachep, cpu)->avail = 0;
+		ac_data(cachep, cpu)->limit = BOOT_CPUCACHE_ENTRIES;
+		ac_data(cachep, cpu)->batchcount = 1;
+		ac_data(cachep, cpu)->touched = 0;
 		cachep->batchcount = 1;
 		cachep->limit = BOOT_CPUCACHE_ENTRIES;
+		cachep->free_limit = (1+num_online_cpus())*cachep->batchcount
+					+ cachep->num;
 	} 
 
+	cachep->lists.next_reap = jiffies + REAPTIMEOUT_LIST3 +
+					((unsigned long)cachep)%REAPTIMEOUT_LIST3;
+
 	/* cache setup completed, link it into the list */
 	list_add(&cachep->next, &cache_chain);
 	unlock_cpu_hotplug();
@@ -1837,35 +1521,27 @@
 #if DEBUG
 static void check_irq_off(void)
 {
-	BUG_ON(!irqs_disabled());
+#ifndef CONFIG_PREEMPT_RT
+	BUG_ON(!raw_irqs_disabled());
+#endif
 }
 
 static void check_irq_on(void)
 {
-	BUG_ON(irqs_disabled());
+	BUG_ON(raw_irqs_disabled());
 }
 
 static void check_spinlock_acquired(kmem_cache_t *cachep)
 {
 #ifdef CONFIG_SMP
 	check_irq_off();
-	assert_spin_locked(&cachep->nodelists[numa_node_id()]->list_lock);
-#endif
-}
-
-static inline void check_spinlock_acquired_node(kmem_cache_t *cachep, int node)
-{
-#ifdef CONFIG_SMP
-	check_irq_off();
-	assert_spin_locked(&cachep->nodelists[node]->list_lock);
+	BUG_ON(spin_trylock(&cachep->spinlock));
 #endif
 }
-
 #else
 #define check_irq_off()	do { } while(0)
 #define check_irq_on()	do { } while(0)
 #define check_spinlock_acquired(x) do { } while(0)
-#define check_spinlock_acquired_node(x, y) do { } while(0)
 #endif
 
 /*
@@ -1876,9 +1552,9 @@
 	check_irq_on();
 	preempt_disable();
 
-	local_irq_disable();
+	raw_local_irq_disable();
 	func(arg);
-	local_irq_enable();
+	raw_local_irq_enable();
 
 	if (smp_call_function(func, arg, 1, 1))
 		BUG();
@@ -1887,92 +1563,85 @@
 }
 
 static void drain_array_locked(kmem_cache_t* cachep,
-				struct array_cache *ac, int force, int node);
+				struct array_cache *ac, int force);
 
-static void do_drain(void *arg)
+static void do_drain_cpu(kmem_cache_t *cachep, int cpu)
 {
-	kmem_cache_t *cachep = (kmem_cache_t*)arg;
 	struct array_cache *ac;
-	int node = numa_node_id();
 
 	check_irq_off();
-	ac = ac_data(cachep);
-	spin_lock(&cachep->nodelists[node]->list_lock);
-	free_block(cachep, ac->entry, ac->avail, node);
-	spin_unlock(&cachep->nodelists[node]->list_lock);
+
+	spin_lock(&cachep->spinlock);
+	ac = ac_data(cachep, cpu);
+	free_block(cachep, &ac_entry(ac)[0], ac->avail);
 	ac->avail = 0;
+	spin_unlock(&cachep->spinlock);
 }
 
-static void drain_cpu_caches(kmem_cache_t *cachep)
+#ifndef CONFIG_PREEMPT_RT
+/*
+ * Executes in an IRQ context:
+ */
+static void do_drain(void *arg)
 {
-	struct kmem_list3 *l3;
-	int node;
+	do_drain_cpu((kmem_cache_t*)arg, smp_processor_id());
+}
+#endif
 
+static void drain_cpu_caches(kmem_cache_t *cachep)
+{
+#ifndef CONFIG_PREEMPT_RT
 	smp_call_function_all_cpus(do_drain, cachep);
+#else
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		do_drain_cpu(cachep, cpu);
+#endif
 	check_irq_on();
 	spin_lock_irq(&cachep->spinlock);
-	for_each_online_node(node)  {
-		l3 = cachep->nodelists[node];
-		if (l3) {
-			spin_lock(&l3->list_lock);
-			drain_array_locked(cachep, l3->shared, 1, node);
-			spin_unlock(&l3->list_lock);
-			if (l3->alien)
-				drain_alien_cache(cachep, l3);
-		}
-	}
+	if (cachep->lists.shared)
+		drain_array_locked(cachep, cachep->lists.shared, 1);
 	spin_unlock_irq(&cachep->spinlock);
 }
 
-static int __node_shrink(kmem_cache_t *cachep, int node)
+
+/* NUMA shrink all list3s */
+static int __cache_shrink(kmem_cache_t *cachep)
 {
 	struct slab *slabp;
-	struct kmem_list3 *l3 = cachep->nodelists[node];
 	int ret;
 
-	for (;;) {
+	drain_cpu_caches(cachep);
+
+	check_irq_on();
+	spin_lock_irq(&cachep->spinlock);
+
+	for(;;) {
 		struct list_head *p;
 
-		p = l3->slabs_free.prev;
-		if (p == &l3->slabs_free)
+		p = cachep->lists.slabs_free.prev;
+		if (p == &cachep->lists.slabs_free)
 			break;
 
-		slabp = list_entry(l3->slabs_free.prev, struct slab, list);
+		slabp = list_entry(cachep->lists.slabs_free.prev, struct slab, list);
 #if DEBUG
 		if (slabp->inuse)
 			BUG();
 #endif
 		list_del(&slabp->list);
 
-		l3->free_objects -= cachep->num;
-		spin_unlock_irq(&l3->list_lock);
+		cachep->lists.free_objects -= cachep->num;
+		spin_unlock_irq(&cachep->spinlock);
 		slab_destroy(cachep, slabp);
-		spin_lock_irq(&l3->list_lock);
+		spin_lock_irq(&cachep->spinlock);
 	}
-	ret = !list_empty(&l3->slabs_full) ||
-		!list_empty(&l3->slabs_partial);
+	ret = !list_empty(&cachep->lists.slabs_full) ||
+		!list_empty(&cachep->lists.slabs_partial);
+	spin_unlock_irq(&cachep->spinlock);
 	return ret;
 }
 
-static int __cache_shrink(kmem_cache_t *cachep)
-{
-	int ret = 0, i = 0;
-	struct kmem_list3 *l3;
-
-	drain_cpu_caches(cachep);
-
-	check_irq_on();
-	for_each_online_node(i) {
-		l3 = cachep->nodelists[i];
-		if (l3) {
-			spin_lock_irq(&l3->list_lock);
-			ret += __node_shrink(cachep, i);
-			spin_unlock_irq(&l3->list_lock);
-		}
-	}
-	return (ret ? 1 : 0);
-}
-
 /**
  * kmem_cache_shrink - Shrink a cache.
  * @cachep: The cache to shrink.
@@ -2009,7 +1678,6 @@
 int kmem_cache_destroy(kmem_cache_t * cachep)
 {
 	int i;
-	struct kmem_list3 *l3;
 
 	if (!cachep || in_interrupt())
 		BUG();
@@ -2037,17 +1705,15 @@
 	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
 		synchronize_rcu();
 
-	for_each_online_cpu(i)
+	/* no cpu_online check required here since we clear the percpu
+	 * array on cpu offline and set this to NULL.
+	 */
+	for (i = 0; i < NR_CPUS; i++)
 		kfree(cachep->array[i]);
 
 	/* NUMA: free the list3 structures */
-	for_each_online_node(i) {
-		if ((l3 = cachep->nodelists[i])) {
-			kfree(l3->shared);
-			free_alien_cache(l3->alien);
-			kfree(l3);
-		}
-	}
+	kfree(cachep->lists.shared);
+	cachep->lists.shared = NULL;
 	kmem_cache_free(&cache_cache, cachep);
 
 	unlock_cpu_hotplug();
@@ -2057,8 +1723,8 @@
 EXPORT_SYMBOL(kmem_cache_destroy);
 
 /* Get the memory for a slab management obj. */
-static struct slab* alloc_slabmgmt(kmem_cache_t *cachep, void *objp,
-			int colour_off, gfp_t local_flags)
+static struct slab* alloc_slabmgmt(kmem_cache_t *cachep,
+			void *objp, int colour_off, unsigned int __nocast local_flags)
 {
 	struct slab *slabp;
 	
@@ -2089,7 +1755,7 @@
 	int i;
 
 	for (i = 0; i < cachep->num; i++) {
-		void *objp = slabp->s_mem+cachep->objsize*i;
+		void* objp = slabp->s_mem+cachep->objsize*i;
 #if DEBUG
 		/* need to poison the objs? */
 		if (cachep->flags & SLAB_POISON)
@@ -2159,14 +1825,13 @@
  * Grow (by 1) the number of slabs within a cache.  This is called by
  * kmem_cache_alloc() when there are no active objs left in a cache.
  */
-static int cache_grow(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+static int cache_grow(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
 {
 	struct slab	*slabp;
 	void		*objp;
 	size_t		 offset;
 	gfp_t	 	 local_flags;
 	unsigned long	 ctor_flags;
-	struct kmem_list3 *l3;
 
 	/* Be lazy and only check for valid flags here,
  	 * keeping it out of the critical path in kmem_cache_alloc().
@@ -2198,9 +1863,8 @@
 
 	spin_unlock(&cachep->spinlock);
 
-	check_irq_off();
 	if (local_flags & __GFP_WAIT)
-		local_irq_enable();
+		local_irq_enable_nort();
 
 	/*
 	 * The test for missing atomic flag is performed here, rather than
@@ -2210,9 +1874,8 @@
 	 */
 	kmem_flagcheck(cachep, flags);
 
-	/* Get mem for the objs.
-	 * Attempt to allocate a physical page from 'nodeid',
-	 */
+
+	/* Get mem for the objs. */
 	if (!(objp = kmem_getpages(cachep, flags, nodeid)))
 		goto failed;
 
@@ -2220,28 +1883,26 @@
 	if (!(slabp = alloc_slabmgmt(cachep, objp, offset, local_flags)))
 		goto opps1;
 
-	slabp->nodeid = nodeid;
 	set_slab_attr(cachep, slabp, objp);
 
 	cache_init_objs(cachep, slabp, ctor_flags);
 
 	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
+		local_irq_disable_nort();
 	check_irq_off();
-	l3 = cachep->nodelists[nodeid];
-	spin_lock(&l3->list_lock);
+	spin_lock(&cachep->spinlock);
 
 	/* Make slab active. */
-	list_add_tail(&slabp->list, &(l3->slabs_free));
+	list_add_tail(&slabp->list, &(list3_data(cachep)->slabs_free));
 	STATS_INC_GROWN(cachep);
-	l3->free_objects += cachep->num;
-	spin_unlock(&l3->list_lock);
+	list3_data(cachep)->free_objects += cachep->num;
+	spin_unlock(&cachep->spinlock);
 	return 1;
 opps1:
 	kmem_freepages(cachep, objp);
 failed:
 	if (local_flags & __GFP_WAIT)
-		local_irq_disable();
+		local_irq_disable_nort();
 	return 0;
 }
 
@@ -2341,6 +2002,7 @@
 	kmem_bufctl_t i;
 	int entries = 0;
 	
+	check_spinlock_acquired(cachep);
 	/* Check slab's freelist to see if this obj is there. */
 	for (i = slabp->free; i != BUFCTL_END; i = slab_bufctl(slabp)[i]) {
 		entries++;
@@ -2366,14 +2028,14 @@
 #define check_slabp(x,y) do { } while(0)
 #endif
 
-static void *cache_alloc_refill(kmem_cache_t *cachep, gfp_t flags)
+static void *cache_alloc_refill(kmem_cache_t *cachep, unsigned int __nocast flags, int cpu)
 {
 	int batchcount;
 	struct kmem_list3 *l3;
 	struct array_cache *ac;
 
 	check_irq_off();
-	ac = ac_data(cachep);
+	ac = ac_data(cachep, cpu);
 retry:
 	batchcount = ac->batchcount;
 	if (!ac->touched && batchcount > BATCHREFILL_LIMIT) {
@@ -2383,11 +2045,10 @@
 		 */
 		batchcount = BATCHREFILL_LIMIT;
 	}
-	l3 = cachep->nodelists[numa_node_id()];
-
-	BUG_ON(ac->avail > 0 || !l3);
-	spin_lock(&l3->list_lock);
+	l3 = list3_data(cachep);
 
+	BUG_ON(ac->avail > 0);
+	spin_lock_nort(&cachep->spinlock);
 	if (l3->shared) {
 		struct array_cache *shared_array = l3->shared;
 		if (shared_array->avail) {
@@ -2395,9 +2056,8 @@
 				batchcount = shared_array->avail;
 			shared_array->avail -= batchcount;
 			ac->avail = batchcount;
-			memcpy(ac->entry,
-				&(shared_array->entry[shared_array->avail]),
-				sizeof(void*)*batchcount);
+			memcpy(ac_entry(ac), &ac_entry(shared_array)[shared_array->avail],
+					sizeof(void*)*batchcount);
 			shared_array->touched = 1;
 			goto alloc_done;
 		}
@@ -2424,8 +2084,7 @@
 			STATS_SET_HIGH(cachep);
 
 			/* get obj pointer */
-			ac->entry[ac->avail++] = slabp->s_mem +
-				slabp->free*cachep->objsize;
+			ac_entry(ac)[ac->avail++] = slabp->s_mem + slabp->free*cachep->objsize;
 
 			slabp->inuse++;
 			next = slab_bufctl(slabp)[slabp->free];
@@ -2448,14 +2107,17 @@
 must_grow:
 	l3->free_objects -= ac->avail;
 alloc_done:
-	spin_unlock(&l3->list_lock);
+	spin_unlock_nort(&cachep->spinlock);
 
 	if (unlikely(!ac->avail)) {
 		int x;
-		x = cache_grow(cachep, flags, numa_node_id());
+		spin_unlock_rt(&cachep->spinlock);
+		x = cache_grow(cachep, flags, -1);
 
+		spin_lock_rt(&cachep->spinlock);
 		// cache_grow can reenable interrupts, then ac could change.
-		ac = ac_data(cachep);
+		cpu = smp_processor_id_rt(cpu);
+		ac = ac_data(cachep, cpu);
 		if (!x && ac->avail == 0)	// no objects in sight? abort
 			return NULL;
 
@@ -2463,11 +2125,11 @@
 			goto retry;
 	}
 	ac->touched = 1;
-	return ac->entry[--ac->avail];
+	return ac_entry(ac)[--ac->avail];
 }
 
 static inline void
-cache_alloc_debugcheck_before(kmem_cache_t *cachep, gfp_t flags)
+cache_alloc_debugcheck_before(kmem_cache_t *cachep, unsigned int __nocast flags)
 {
 	might_sleep_if(flags & __GFP_WAIT);
 #if DEBUG
@@ -2478,7 +2140,7 @@
 #if DEBUG
 static void *
 cache_alloc_debugcheck_after(kmem_cache_t *cachep,
-			gfp_t flags, void *objp, void *caller)
+			unsigned int __nocast flags, void *objp, void *caller)
 {
 	if (!objp)	
 		return objp;
@@ -2521,118 +2183,47 @@
 #define cache_alloc_debugcheck_after(a,b,objp,d) (objp)
 #endif
 
-static inline void *____cache_alloc(kmem_cache_t *cachep, gfp_t flags)
+
+static inline void *__cache_alloc(kmem_cache_t *cachep, unsigned int __nocast flags)
 {
+	int cpu;
+	unsigned long save_flags;
 	void* objp;
 	struct array_cache *ac;
 
-	check_irq_off();
-	ac = ac_data(cachep);
+	cache_alloc_debugcheck_before(cachep, flags);
+
+	local_irq_save_nort(save_flags);
+	spin_lock_rt(&cachep->spinlock);
+	cpu = raw_smp_processor_id();
+	ac = ac_data(cachep, cpu);
 	if (likely(ac->avail)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
-		objp = ac->entry[--ac->avail];
+		objp = ac_entry(ac)[--ac->avail];
 	} else {
 		STATS_INC_ALLOCMISS(cachep);
-		objp = cache_alloc_refill(cachep, flags);
+		objp = cache_alloc_refill(cachep, flags, cpu);
 	}
+	spin_unlock_rt(&cachep->spinlock);
+	local_irq_restore_nort(save_flags);
+	objp = cache_alloc_debugcheck_after(cachep, flags, objp, __builtin_return_address(0));
 	return objp;
 }
 
-static inline void *__cache_alloc(kmem_cache_t *cachep, gfp_t flags)
-{
-	unsigned long save_flags;
-	void* objp;
-
-	cache_alloc_debugcheck_before(cachep, flags);
-
-	local_irq_save(save_flags);
-	objp = ____cache_alloc(cachep, flags);
-	local_irq_restore(save_flags);
-	objp = cache_alloc_debugcheck_after(cachep, flags, objp,
-					__builtin_return_address(0));
-	prefetchw(objp);
-	return objp;
-}
-
-#ifdef CONFIG_NUMA
 /*
- * A interface to enable slab creation on nodeid
+ * NUMA: different approach needed if the spinlock is moved into
+ * the l3 structure
  */
-static void *__cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int nodeid)
-{
-	struct list_head *entry;
- 	struct slab *slabp;
- 	struct kmem_list3 *l3;
- 	void *obj;
- 	kmem_bufctl_t next;
- 	int x;
 
- 	l3 = cachep->nodelists[nodeid];
- 	BUG_ON(!l3);
-
-retry:
- 	spin_lock(&l3->list_lock);
- 	entry = l3->slabs_partial.next;
- 	if (entry == &l3->slabs_partial) {
- 		l3->free_touched = 1;
- 		entry = l3->slabs_free.next;
- 		if (entry == &l3->slabs_free)
- 			goto must_grow;
- 	}
-
- 	slabp = list_entry(entry, struct slab, list);
- 	check_spinlock_acquired_node(cachep, nodeid);
- 	check_slabp(cachep, slabp);
-
- 	STATS_INC_NODEALLOCS(cachep);
- 	STATS_INC_ACTIVE(cachep);
- 	STATS_SET_HIGH(cachep);
-
- 	BUG_ON(slabp->inuse == cachep->num);
-
- 	/* get obj pointer */
- 	obj =  slabp->s_mem + slabp->free*cachep->objsize;
- 	slabp->inuse++;
- 	next = slab_bufctl(slabp)[slabp->free];
-#if DEBUG
- 	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
-#endif
- 	slabp->free = next;
- 	check_slabp(cachep, slabp);
- 	l3->free_objects--;
- 	/* move slabp to correct slabp list: */
- 	list_del(&slabp->list);
-
- 	if (slabp->free == BUFCTL_END) {
- 		list_add(&slabp->list, &l3->slabs_full);
- 	} else {
- 		list_add(&slabp->list, &l3->slabs_partial);
- 	}
-
- 	spin_unlock(&l3->list_lock);
- 	goto done;
-
-must_grow:
- 	spin_unlock(&l3->list_lock);
- 	x = cache_grow(cachep, flags, nodeid);
-
- 	if (!x)
- 		return NULL;
-
- 	goto retry;
-done:
- 	return obj;
-}
-#endif
-
-/*
- * Caller needs to acquire correct kmem_list's list_lock
- */
-static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects, int node)
+static void free_block(kmem_cache_t *cachep, void **objpp, int nr_objects)
 {
 	int i;
-	struct kmem_list3 *l3;
+
+	check_spinlock_acquired(cachep);
+
+	/* NUMA: move add into loop */
+	cachep->lists.free_objects += nr_objects;
 
 	for (i = 0; i < nr_objects; i++) {
 		void *objp = objpp[i];
@@ -2640,19 +2231,16 @@
 		unsigned int objnr;
 
 		slabp = page_get_slab(virt_to_page(objp));
-		l3 = cachep->nodelists[node];
 		list_del(&slabp->list);
 		objnr = (objp - slabp->s_mem) / cachep->objsize;
-		check_spinlock_acquired_node(cachep, node);
 		check_slabp(cachep, slabp);
-
 #if DEBUG
 		/* Verify that the slab belongs to the intended node */
 		WARN_ON(slabp->nodeid != node);
 
 		if (slab_bufctl(slabp)[objnr] != BUFCTL_FREE) {
-			printk(KERN_ERR "slab: double free detected in cache "
-					"'%s', objp %p\n", cachep->name, objp);
+			printk(KERN_ERR "slab: double free detected in cache '%s', objp %p.\n",
+						cachep->name, objp);
 			BUG();
 		}
 #endif
@@ -2660,23 +2248,24 @@
 		slabp->free = objnr;
 		STATS_DEC_ACTIVE(cachep);
 		slabp->inuse--;
-		l3->free_objects++;
 		check_slabp(cachep, slabp);
 
 		/* fixup slab chains */
 		if (slabp->inuse == 0) {
-			if (l3->free_objects > l3->free_limit) {
-				l3->free_objects -= cachep->num;
+			if (cachep->lists.free_objects > cachep->free_limit) {
+				cachep->lists.free_objects -= cachep->num;
 				slab_destroy(cachep, slabp);
 			} else {
-				list_add(&slabp->list, &l3->slabs_free);
+				list_add(&slabp->list,
+				&list3_data_ptr(cachep, objp)->slabs_free);
 			}
 		} else {
 			/* Unconditionally move a slab to the end of the
 			 * partial list on free - maximum time for the
 			 * other objects to be freed, too.
 			 */
-			list_add_tail(&slabp->list, &l3->slabs_partial);
+			list_add_tail(&slabp->list,
+				&list3_data_ptr(cachep, objp)->slabs_partial);
 		}
 	}
 }
@@ -2684,39 +2273,36 @@
 static void cache_flusharray(kmem_cache_t *cachep, struct array_cache *ac)
 {
 	int batchcount;
-	struct kmem_list3 *l3;
-	int node = numa_node_id();
 
 	batchcount = ac->batchcount;
 #if DEBUG
 	BUG_ON(!batchcount || batchcount > ac->avail);
 #endif
 	check_irq_off();
-	l3 = cachep->nodelists[node];
-	spin_lock(&l3->list_lock);
-	if (l3->shared) {
-		struct array_cache *shared_array = l3->shared;
+	spin_lock_nort(&cachep->spinlock);
+	if (cachep->lists.shared) {
+		struct array_cache *shared_array = cachep->lists.shared;
 		int max = shared_array->limit-shared_array->avail;
 		if (max) {
 			if (batchcount > max)
 				batchcount = max;
-			memcpy(&(shared_array->entry[shared_array->avail]),
-					ac->entry,
+			memcpy(&ac_entry(shared_array)[shared_array->avail],
+					&ac_entry(ac)[0],
 					sizeof(void*)*batchcount);
 			shared_array->avail += batchcount;
 			goto free_done;
 		}
 	}
 
-	free_block(cachep, ac->entry, batchcount, node);
+	free_block(cachep, &ac_entry(ac)[0], batchcount);
 free_done:
 #if STATS
 	{
 		int i = 0;
 		struct list_head *p;
 
-		p = l3->slabs_free.next;
-		while (p != &(l3->slabs_free)) {
+		p = list3_data(cachep)->slabs_free.next;
+		while (p != &(list3_data(cachep)->slabs_free)) {
 			struct slab *slabp;
 
 			slabp = list_entry(p, struct slab, list);
@@ -2728,13 +2314,12 @@
 		STATS_SET_FREEABLE(cachep, i);
 	}
 #endif
-	spin_unlock(&l3->list_lock);
+	spin_unlock_nort(&cachep->spinlock);
 	ac->avail -= batchcount;
-	memmove(ac->entry, &(ac->entry[batchcount]),
+	memmove(&ac_entry(ac)[0], &ac_entry(ac)[batchcount],
 			sizeof(void*)*ac->avail);
 }
 
-
 /*
  * __cache_free
  * Release an obj back to its cache. If the obj has a constructed
@@ -2744,52 +2329,24 @@
  */
 static inline void __cache_free(kmem_cache_t *cachep, void *objp)
 {
-	struct array_cache *ac = ac_data(cachep);
+	int cpu;
+	struct array_cache *ac;
 
 	check_irq_off();
 	objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
 
-	/* Make sure we are not freeing a object from another
-	 * node to the array cache on this cpu.
-	 */
-#ifdef CONFIG_NUMA
-	{
-		struct slab *slabp;
-		slabp = page_get_slab(virt_to_page(objp));
-		if (unlikely(slabp->nodeid != numa_node_id())) {
-			struct array_cache *alien = NULL;
-			int nodeid = slabp->nodeid;
-			struct kmem_list3 *l3 = cachep->nodelists[numa_node_id()];
-
-			STATS_INC_NODEFREES(cachep);
-			if (l3->alien && l3->alien[nodeid]) {
-				alien = l3->alien[nodeid];
-				spin_lock(&alien->lock);
-				if (unlikely(alien->avail == alien->limit))
-					__drain_alien_cache(cachep,
-							alien, nodeid);
-				alien->entry[alien->avail++] = objp;
-				spin_unlock(&alien->lock);
-			} else {
-				spin_lock(&(cachep->nodelists[nodeid])->
-						list_lock);
-				free_block(cachep, &objp, 1, nodeid);
-				spin_unlock(&(cachep->nodelists[nodeid])->
-						list_lock);
-			}
-			return;
-		}
-	}
-#endif
+	spin_lock_rt(&cachep->spinlock);
+	cpu = raw_smp_processor_id();
+	ac = ac_data(cachep, cpu);
 	if (likely(ac->avail < ac->limit)) {
 		STATS_INC_FREEHIT(cachep);
-		ac->entry[ac->avail++] = objp;
-		return;
+		ac_entry(ac)[ac->avail++] = objp;
 	} else {
 		STATS_INC_FREEMISS(cachep);
 		cache_flusharray(cachep, ac);
-		ac->entry[ac->avail++] = objp;
+		ac_entry(ac)[ac->avail++] = objp;
 	}
+	spin_unlock_rt(&cachep->spinlock);
 }
 
 /**
@@ -2800,7 +2357,7 @@
  * Allocate an object from this cache.  The flags are only relevant
  * if the cache has no available objects.
  */
-void *kmem_cache_alloc(kmem_cache_t *cachep, gfp_t flags)
+void *kmem_cache_alloc(kmem_cache_t *cachep, unsigned int __nocast flags)
 {
 	return __cache_alloc(cachep, flags);
 }
@@ -2858,37 +2415,85 @@
  * Identical to kmem_cache_alloc, except that this function is slow
  * and can sleep. And it will allocate memory on the given node, which
  * can improve the performance for cpu bound structures.
- * New and improved: it will now make sure that the object gets
- * put on the correct node list so that there is no false sharing.
  */
-void *kmem_cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int nodeid)
+void *kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int __nocast flags, int nodeid)
 {
-	unsigned long save_flags;
-	void *ptr;
+	int loop;
+	void *objp;
+	struct slab *slabp;
+	kmem_bufctl_t next;
 
 	if (nodeid == -1)
-		return __cache_alloc(cachep, flags);
+		return kmem_cache_alloc(cachep, flags);
 
-	if (unlikely(!cachep->nodelists[nodeid])) {
-		/* Fall back to __cache_alloc if we run into trouble */
-		printk(KERN_WARNING "slab: not allocating in inactive node %d for cache %s\n", nodeid, cachep->name);
-		return __cache_alloc(cachep,flags);
+	for (loop = 0;;loop++) {
+		struct list_head *q;
+
+		objp = NULL;
+		check_irq_on();
+		spin_lock_irq(&cachep->spinlock);
+		/* walk through all partial and empty slab and find one
+		 * from the right node */
+		list_for_each(q,&cachep->lists.slabs_partial) {
+			slabp = list_entry(q, struct slab, list);
+
+			if (page_to_nid(virt_to_page(slabp->s_mem)) == nodeid ||
+					loop > 2)
+				goto got_slabp;
+		}
+		list_for_each(q, &cachep->lists.slabs_free) {
+			slabp = list_entry(q, struct slab, list);
+
+			if (page_to_nid(virt_to_page(slabp->s_mem)) == nodeid ||
+					loop > 2)
+				goto got_slabp;
+		}
+		spin_unlock_irq(&cachep->spinlock);
+
+		local_irq_disable_nort();
+		if (!cache_grow(cachep, flags, nodeid)) {
+			local_irq_enable_nort();
+			return NULL;
+		}
+		local_irq_enable_nort();
 	}
+got_slabp:
+	/* found one: allocate object */
+	check_slabp(cachep, slabp);
+	check_spinlock_acquired(cachep);
 
-	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
-	if (nodeid == numa_node_id())
-		ptr = ____cache_alloc(cachep, flags);
+	STATS_INC_ALLOCED(cachep);
+	STATS_INC_ACTIVE(cachep);
+	STATS_SET_HIGH(cachep);
+	STATS_INC_NODEALLOCS(cachep);
+
+	objp = slabp->s_mem + slabp->free*cachep->objsize;
+
+	slabp->inuse++;
+	next = slab_bufctl(slabp)[slabp->free];
+#if DEBUG
+	slab_bufctl(slabp)[slabp->free] = BUFCTL_FREE;
+#endif
+	slabp->free = next;
+	check_slabp(cachep, slabp);
+
+	/* move slabp to correct slabp list: */
+	list_del(&slabp->list);
+	if (slabp->free == BUFCTL_END)
+		list_add(&slabp->list, &cachep->lists.slabs_full);
 	else
-		ptr = __cache_alloc_node(cachep, flags, nodeid);
-	local_irq_restore(save_flags);
-	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, __builtin_return_address(0));
+		list_add(&slabp->list, &cachep->lists.slabs_partial);
+
+	list3_data(cachep)->free_objects--;
+	spin_unlock_irq(&cachep->spinlock);
 
-	return ptr;
+	objp = cache_alloc_debugcheck_after(cachep, GFP_KERNEL, objp,
+					__builtin_return_address(0));
+	return objp;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 
-void *kmalloc_node(size_t size, gfp_t flags, int node)
+void *kmalloc_node(size_t size, unsigned int __nocast flags, int node)
 {
 	kmem_cache_t *cachep;
 
@@ -2921,7 +2526,7 @@
  * platforms.  For example, on i386, it means that the memory must come
  * from the first 16MB.
  */
-void *__kmalloc(size_t size, gfp_t flags)
+void *__kmalloc(size_t size, unsigned int __nocast flags)
 {
 	kmem_cache_t *cachep;
 
@@ -2954,18 +2559,11 @@
 	if (!pdata)
 		return NULL;
 
-	/*
-	 * Cannot use for_each_online_cpu since a cpu may come online
-	 * and we have no way of figuring out how to fix the array
-	 * that we have allocated then....
-	 */
-	for_each_cpu(i) {
-		int node = cpu_to_node(i);
-
-		if (node_online(node))
-			pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL, node);
-		else
-			pdata->ptrs[i] = kmalloc(size, GFP_KERNEL);
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_possible(i))
+			continue;
+		pdata->ptrs[i] = kmalloc_node(size, GFP_KERNEL,
+						cpu_to_node(i));
 
 		if (!pdata->ptrs[i])
 			goto unwind_oom;
@@ -2999,18 +2597,31 @@
 {
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_irq_save_nort(flags);
 	__cache_free(cachep, objp);
-	local_irq_restore(flags);
+	local_irq_restore_nort(flags);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
+#ifdef CONFIG_DEBUG_DEADLOCKS
+static size_t cache_size(kmem_cache_t *c)
+{
+	struct cache_sizes *csizep = malloc_sizes;
+
+	for ( ; csizep->cs_size; csizep++) {
+		if (csizep->cs_cachep == c)
+			return csizep->cs_size;
+		if (csizep->cs_dmacachep == c)
+			return csizep->cs_size;
+	}
+	return 0;
+}
+#endif
+
 /**
  * kfree - free previously allocated memory
  * @objp: pointer returned by kmalloc.
  *
- * If @objp is NULL, no operation is performed.
- *
  * Don't free memory not originally allocated by kmalloc()
  * or you will run into trouble.
  */
@@ -3021,11 +2632,16 @@
 
 	if (unlikely(!objp))
 		return;
-	local_irq_save(flags);
+	local_irq_save_nort(flags);
 	kfree_debugcheck(objp);
 	c = page_get_cache(virt_to_page(objp));
+#ifdef CONFIG_DEBUG_DEADLOCKS
+	if (check_no_locks_freed(objp, objp+cache_size(c)))
+		printk("slab %s[%p] (%d), obj: %p\n",
+			c->name, c, c->objsize, objp);
+#endif
 	__cache_free(c, (void*)objp);
-	local_irq_restore(flags);
+	local_irq_restore_nort(flags);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -3043,11 +2659,11 @@
 	int i;
 	struct percpu_data *p = (struct percpu_data *) (~(unsigned long) objp);
 
-	/*
-	 * We allocate for all cpus so we cannot use for online cpu here.
-	 */
-	for_each_cpu(i)
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_possible(i))
+			continue;
 		kfree(p->ptrs[i]);
+	}
 	kfree(p);
 }
 EXPORT_SYMBOL(free_percpu);
@@ -3065,76 +2681,21 @@
 }
 EXPORT_SYMBOL_GPL(kmem_cache_name);
 
-/*
- * This initializes kmem_list3 for all nodes.
- */
-static int alloc_kmemlist(kmem_cache_t *cachep)
-{
-	int node;
-	struct kmem_list3 *l3;
-	int err = 0;
-
-	for_each_online_node(node) {
-		struct array_cache *nc = NULL, *new;
-		struct array_cache **new_alien = NULL;
-#ifdef CONFIG_NUMA
-		if (!(new_alien = alloc_alien_cache(node, cachep->limit)))
-			goto fail;
-#endif
-		if (!(new = alloc_arraycache(node, (cachep->shared*
-				cachep->batchcount), 0xbaadf00d)))
-			goto fail;
-		if ((l3 = cachep->nodelists[node])) {
-
-			spin_lock_irq(&l3->list_lock);
-
-			if ((nc = cachep->nodelists[node]->shared))
-				free_block(cachep, nc->entry,
-							nc->avail, node);
-
-			l3->shared = new;
-			if (!cachep->nodelists[node]->alien) {
-				l3->alien = new_alien;
-				new_alien = NULL;
-			}
-			l3->free_limit = (1 + nr_cpus_node(node))*
-				cachep->batchcount + cachep->num;
-			spin_unlock_irq(&l3->list_lock);
-			kfree(nc);
-			free_alien_cache(new_alien);
-			continue;
-		}
-		if (!(l3 = kmalloc_node(sizeof(struct kmem_list3),
-						GFP_KERNEL, node)))
-			goto fail;
-
-		kmem_list3_init(l3);
-		l3->next_reap = jiffies + REAPTIMEOUT_LIST3 +
-			((unsigned long)cachep)%REAPTIMEOUT_LIST3;
-		l3->shared = new;
-		l3->alien = new_alien;
-		l3->free_limit = (1 + nr_cpus_node(node))*
-			cachep->batchcount + cachep->num;
-		cachep->nodelists[node] = l3;
-	}
-	return err;
-fail:
-	err = -ENOMEM;
-	return err;
-}
-
 struct ccupdate_struct {
 	kmem_cache_t *cachep;
 	struct array_cache *new[NR_CPUS];
 };
 
+/*
+ * Executes in IRQ context:
+ */
 static void do_ccupdate_local(void *info)
 {
 	struct ccupdate_struct *new = (struct ccupdate_struct *)info;
 	struct array_cache *old;
 
 	check_irq_off();
-	old = ac_data(new->cachep);
+	old = ac_data(new->cachep, smp_processor_id());
 
 	new->cachep->array[smp_processor_id()] = new->new[smp_processor_id()];
 	new->new[smp_processor_id()] = old;
@@ -3145,14 +2706,19 @@
 				int shared)
 {
 	struct ccupdate_struct new;
-	int i, err;
+	struct array_cache *new_shared;
+	int i;
 
 	memset(&new.new,0,sizeof(new.new));
-	for_each_online_cpu(i) {
-		new.new[i] = alloc_arraycache(cpu_to_node(i), limit, batchcount);
-		if (!new.new[i]) {
-			for (i--; i >= 0; i--) kfree(new.new[i]);
-			return -ENOMEM;
+	for (i = 0; i < NR_CPUS; i++) {
+		if (cpu_online(i)) {
+			new.new[i] = alloc_arraycache(i, limit, batchcount);
+			if (!new.new[i]) {
+				for (i--; i >= 0; i--) kfree(new.new[i]);
+				return -ENOMEM;
+			}
+		} else {
+			new.new[i] = NULL;
 		}
 	}
 	new.cachep = cachep;
@@ -3163,25 +2729,31 @@
 	spin_lock_irq(&cachep->spinlock);
 	cachep->batchcount = batchcount;
 	cachep->limit = limit;
-	cachep->shared = shared;
+	cachep->free_limit = (1+num_online_cpus())*cachep->batchcount + cachep->num;
 	spin_unlock_irq(&cachep->spinlock);
 
-	for_each_online_cpu(i) {
+	for (i = 0; i < NR_CPUS; i++) {
 		struct array_cache *ccold = new.new[i];
 		if (!ccold)
 			continue;
-		spin_lock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock);
-		free_block(cachep, ccold->entry, ccold->avail, cpu_to_node(i));
-		spin_unlock_irq(&cachep->nodelists[cpu_to_node(i)]->list_lock);
+		spin_lock_irq(&cachep->spinlock);
+		free_block(cachep, ac_entry(ccold), ccold->avail);
+		spin_unlock_irq(&cachep->spinlock);
 		kfree(ccold);
 	}
-
-	err = alloc_kmemlist(cachep);
-	if (err) {
-		printk(KERN_ERR "alloc_kmemlist failed for %s, error %d.\n",
-				cachep->name, -err);
-		BUG();
+	new_shared = alloc_arraycache(-1, batchcount*shared, 0xbaadf00d);
+	if (new_shared) {
+		struct array_cache *old;
+
+		spin_lock_irq(&cachep->spinlock);
+		old = cachep->lists.shared;
+		cachep->lists.shared = new_shared;
+		if (old)
+			free_block(cachep, ac_entry(old), old->avail);
+		spin_unlock_irq(&cachep->spinlock);
+		kfree(old);
 	}
+
 	return 0;
 }
 
@@ -3232,6 +2804,10 @@
 	if (limit > 32)
 		limit = 32;
 #endif
+#ifdef CONFIG_PREEMPT
+	if (limit > 16)
+		limit = 16;
+#endif
 	err = do_tune_cpucache(cachep, limit, (limit+1)/2, shared);
 	if (err)
 		printk(KERN_ERR "enable_cpucache failed for %s, error %d.\n",
@@ -3239,11 +2815,11 @@
 }
 
 static void drain_array_locked(kmem_cache_t *cachep,
-				struct array_cache *ac, int force, int node)
+				struct array_cache *ac, int force)
 {
 	int tofree;
 
-	check_spinlock_acquired_node(cachep, node);
+	check_spinlock_acquired(cachep);
 	if (ac->touched && !force) {
 		ac->touched = 0;
 	} else if (ac->avail) {
@@ -3251,9 +2827,9 @@
 		if (tofree > ac->avail) {
 			tofree = (ac->avail+1)/2;
 		}
-		free_block(cachep, ac->entry, tofree, node);
+		free_block(cachep, ac_entry(ac), tofree);
 		ac->avail -= tofree;
-		memmove(ac->entry, &(ac->entry[tofree]),
+		memmove(&ac_entry(ac)[0], &ac_entry(ac)[tofree],
 					sizeof(void*)*ac->avail);
 	}
 }
@@ -3272,12 +2848,14 @@
  */
 static void cache_reap(void *unused)
 {
+	int cpu;
 	struct list_head *walk;
-	struct kmem_list3 *l3;
 
 	if (down_trylock(&cache_chain_sem)) {
 		/* Give up. Setup the next iteration. */
-		schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
+next_iteration:
+		cpu = raw_smp_processor_id();
+		schedule_delayed_work(&per_cpu(reap_work, cpu), REAPTIMEOUT_CPUC + cpu);
 		return;
 	}
 
@@ -3294,32 +2872,28 @@
 
 		check_irq_on();
 
-		l3 = searchp->nodelists[numa_node_id()];
-		if (l3->alien)
-			drain_alien_cache(searchp, l3);
-		spin_lock_irq(&l3->list_lock);
+		spin_lock_irq(&searchp->spinlock);
+		cpu = raw_smp_processor_id();
 
-		drain_array_locked(searchp, ac_data(searchp), 0,
-				numa_node_id());
+		drain_array_locked(searchp, ac_data(searchp, cpu), 0);
 
-		if (time_after(l3->next_reap, jiffies))
+		if(time_after(searchp->lists.next_reap, jiffies))
 			goto next_unlock;
 
-		l3->next_reap = jiffies + REAPTIMEOUT_LIST3;
+		searchp->lists.next_reap = jiffies + REAPTIMEOUT_LIST3;
 
-		if (l3->shared)
-			drain_array_locked(searchp, l3->shared, 0,
-				numa_node_id());
+		if (searchp->lists.shared)
+			drain_array_locked(searchp, searchp->lists.shared, 0);
 
-		if (l3->free_touched) {
-			l3->free_touched = 0;
+		if (searchp->lists.free_touched) {
+			searchp->lists.free_touched = 0;
 			goto next_unlock;
 		}
 
-		tofree = (l3->free_limit+5*searchp->num-1)/(5*searchp->num);
+		tofree = (searchp->free_limit+5*searchp->num-1)/(5*searchp->num);
 		do {
-			p = l3->slabs_free.next;
-			if (p == &(l3->slabs_free))
+			p = list3_data(searchp)->slabs_free.next;
+			if (p == &(list3_data(searchp)->slabs_free))
 				break;
 
 			slabp = list_entry(p, struct slab, list);
@@ -3332,13 +2906,13 @@
 			 * searchp cannot disappear, we hold
 			 * cache_chain_lock
 			 */
-			l3->free_objects -= searchp->num;
-			spin_unlock_irq(&l3->list_lock);
+			searchp->lists.free_objects -= searchp->num;
+			spin_unlock_irq(&searchp->spinlock);
 			slab_destroy(searchp, slabp);
-			spin_lock_irq(&l3->list_lock);
+			spin_lock_irq(&searchp->spinlock);
 		} while(--tofree > 0);
 next_unlock:
-		spin_unlock_irq(&l3->list_lock);
+		spin_unlock_irq(&searchp->spinlock);
 next:
 		cond_resched();
 	}
@@ -3346,7 +2920,7 @@
 	up(&cache_chain_sem);
 	drain_remote_pages();
 	/* Setup the next iteration */
-	schedule_delayed_work(&__get_cpu_var(reap_work), REAPTIMEOUT_CPUC);
+	goto next_iteration;
 }
 
 #ifdef CONFIG_PROC_FS
@@ -3372,7 +2946,7 @@
 		seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
 #if STATS
 		seq_puts(m, " : globalstat <listallocs> <maxobjs> <grown> <reaped>"
-				" <error> <maxfreeable> <nodeallocs> <remotefrees>");
+				" <error> <maxfreeable> <freelimit> <nodeallocs>");
 		seq_puts(m, " : cpustat <allochit> <allocmiss> <freehit> <freemiss>");
 #endif
 		seq_putc(m, '\n');
@@ -3407,53 +2981,39 @@
 	unsigned long	active_objs;
 	unsigned long	num_objs;
 	unsigned long	active_slabs = 0;
-	unsigned long	num_slabs, free_objects = 0, shared_avail = 0;
+	unsigned long	num_slabs;
 	const char *name;
 	char *error = NULL;
-	int node;
-	struct kmem_list3 *l3;
 
 	check_irq_on();
 	spin_lock_irq(&cachep->spinlock);
 	active_objs = 0;
 	num_slabs = 0;
-	for_each_online_node(node) {
-		l3 = cachep->nodelists[node];
-		if (!l3)
-			continue;
-
-		spin_lock(&l3->list_lock);
-
-		list_for_each(q,&l3->slabs_full) {
-			slabp = list_entry(q, struct slab, list);
-			if (slabp->inuse != cachep->num && !error)
-				error = "slabs_full accounting error";
-			active_objs += cachep->num;
-			active_slabs++;
-		}
-		list_for_each(q,&l3->slabs_partial) {
-			slabp = list_entry(q, struct slab, list);
-			if (slabp->inuse == cachep->num && !error)
-				error = "slabs_partial inuse accounting error";
-			if (!slabp->inuse && !error)
-				error = "slabs_partial/inuse accounting error";
-			active_objs += slabp->inuse;
-			active_slabs++;
-		}
-		list_for_each(q,&l3->slabs_free) {
-			slabp = list_entry(q, struct slab, list);
-			if (slabp->inuse && !error)
-				error = "slabs_free/inuse accounting error";
-			num_slabs++;
-		}
-		free_objects += l3->free_objects;
-		shared_avail += l3->shared->avail;
-
-		spin_unlock(&l3->list_lock);
+	list_for_each(q,&cachep->lists.slabs_full) {
+		slabp = list_entry(q, struct slab, list);
+		if (slabp->inuse != cachep->num && !error)
+			error = "slabs_full accounting error";
+		active_objs += cachep->num;
+		active_slabs++;
+	}
+	list_for_each(q,&cachep->lists.slabs_partial) {
+		slabp = list_entry(q, struct slab, list);
+		if (slabp->inuse == cachep->num && !error)
+			error = "slabs_partial inuse accounting error";
+		if (!slabp->inuse && !error)
+			error = "slabs_partial/inuse accounting error";
+		active_objs += slabp->inuse;
+		active_slabs++;
+	}
+	list_for_each(q,&cachep->lists.slabs_free) {
+		slabp = list_entry(q, struct slab, list);
+		if (slabp->inuse && !error)
+			error = "slabs_free/inuse accounting error";
+		num_slabs++;
 	}
 	num_slabs+=active_slabs;
 	num_objs = num_slabs*cachep->num;
-	if (num_objs - active_objs != free_objects && !error)
+	if (num_objs - active_objs != cachep->lists.free_objects && !error)
 		error = "free_objects accounting error";
 
 	name = cachep->name; 
@@ -3465,9 +3025,9 @@
 		cachep->num, (1<<cachep->gfporder));
 	seq_printf(m, " : tunables %4u %4u %4u",
 			cachep->limit, cachep->batchcount,
-			cachep->shared);
-	seq_printf(m, " : slabdata %6lu %6lu %6lu",
-			active_slabs, num_slabs, shared_avail);
+			cachep->lists.shared->limit/cachep->batchcount);
+	seq_printf(m, " : slabdata %6lu %6lu %6u",
+			active_slabs, num_slabs, cachep->lists.shared->avail);
 #if STATS
 	{	/* list3 stats */
 		unsigned long high = cachep->high_mark;
@@ -3476,13 +3036,12 @@
 		unsigned long reaped = cachep->reaped;
 		unsigned long errors = cachep->errors;
 		unsigned long max_freeable = cachep->max_freeable;
+		unsigned long free_limit = cachep->free_limit;
 		unsigned long node_allocs = cachep->node_allocs;
-		unsigned long node_frees = cachep->node_frees;
 
-		seq_printf(m, " : globalstat %7lu %6lu %5lu %4lu \
-				%4lu %4lu %4lu %4lu",
+		seq_printf(m, " : globalstat %7lu %6lu %5lu %4lu %4lu %4lu %4lu %4lu",
 				allocs, high, grown, reaped, errors,
-				max_freeable, node_allocs, node_frees);
+				max_freeable, free_limit, node_allocs);
 	}
 	/* cpu stats */
 	{
@@ -3561,10 +3120,9 @@
 			    batchcount < 1 ||
 			    batchcount > limit ||
 			    shared < 0) {
-				res = 0;
+				res = -EINVAL;
 			} else {
-				res = do_tune_cpucache(cachep, limit,
-							batchcount, shared);
+				res = do_tune_cpucache(cachep, limit, batchcount, shared);
 			}
 			break;
 		}
@@ -3576,22 +3134,18 @@
 }
 #endif
 
-/**
- * ksize - get the actual amount of memory allocated for a given object
- * @objp: Pointer to the object
- *
- * kmalloc may internally round up allocations and return more memory
- * than requested. ksize() can be used to determine the actual amount of
- * memory allocated. The caller may use this additional memory, even though
- * a smaller amount of memory was initially specified with the kmalloc call.
- * The caller must guarantee that objp points to a valid object previously
- * allocated with either kmalloc() or kmem_cache_alloc(). The object
- * must not be freed during the duration of the call.
- */
 unsigned int ksize(const void *objp)
 {
-	if (unlikely(objp == NULL))
-		return 0;
+	kmem_cache_t *c;
+	unsigned long flags;
+	unsigned int size = 0;
+
+	if (likely(objp != NULL)) {
+		local_irq_save_nort(flags);
+		c = page_get_cache(virt_to_page(objp));
+		size = kmem_cache_size(c);
+		local_irq_restore_nort(flags);
+	}
 
-	return obj_reallen(page_get_cache(virt_to_page(objp)));
+	return size;
 }



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-17 22:57   ` Steven Rostedt
@ 2005-12-18 16:05     ` K.R. Foley
  2005-12-20 13:32     ` Ingo Molnar
  1 sibling, 0 replies; 56+ messages in thread
From: K.R. Foley @ 2005-12-18 16:05 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: Ingo Molnar, linux-kernel, Gunter Ohrner, john stultz

Steven Rostedt wrote:
> Ingo,
> 
> I ported your old changes of 2.6.14-rt22 of mm/slab.c to 2.6.15-rc5-rt2
> and tried it out.  I believe that this confirms that the SLOB _is_ the
> problem in the slowness.  Booting with this slab patch, gives the old
> speeds that we use to have.
> 
> Now, is the solution to bring the SLOB up to par with the SLAB, or to
> make the SLAB as close to possible to the mainline (why remove NUMA?)
> and keep it for PREEMPT_RT?
> 
> Below is the port of the slab changes if anyone else would like to see
> if this speeds things up for them.
> 
> -- Steve
> 

This drastically improves performance on my slower uniprocessor system.
2.6.15-rc5-rt2 still doesn't boot on my dual 933 box, with or without
this patch. I will try to dig into that a bit more today.

-- 
   kr

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-17 22:57   ` Steven Rostedt
  2005-12-18 16:05     ` K.R. Foley
@ 2005-12-20 13:32     ` Ingo Molnar
  2005-12-20 13:38       ` Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2005-12-20 13:32 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Gunter Ohrner, john stultz


* Steven Rostedt <rostedt@goodmis.org> wrote:

> I ported your old changes of 2.6.14-rt22 of mm/slab.c to 
> 2.6.15-rc5-rt2 and tried it out.  I believe that this confirms that 
> the SLOB _is_ the problem in the slowness.  Booting with this slab 
> patch, gives the old speeds that we use to have.
> 
> Now, is the solution to bring the SLOB up to par with the SLAB, or to 
> make the SLAB as close to possible to the mainline (why remove NUMA?) 
> and keep it for PREEMPT_RT?
> 
> Below is the port of the slab changes if anyone else would like to see 
> if this speeds things up for them.

ok, i've added this back in - but we really need a cleaner port of SLAB 
...

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 13:32     ` Ingo Molnar
@ 2005-12-20 13:38       ` Steven Rostedt
  2005-12-20 13:57         ` Ingo Molnar
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 13:38 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Gunter Ohrner, john stultz


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > I ported your old changes of 2.6.14-rt22 of mm/slab.c to
> > 2.6.15-rc5-rt2 and tried it out.  I believe that this confirms that
> > the SLOB _is_ the problem in the slowness.  Booting with this slab
> > patch, gives the old speeds that we use to have.
> >
> > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > and keep it for PREEMPT_RT?
> >
> > Below is the port of the slab changes if anyone else would like to see
> > if this speeds things up for them.
>
> ok, i've added this back in - but we really need a cleaner port of SLAB
> ...
>

Actually, how much do you want that SLOB code?  For the last couple of
days I've been working on different approaches that can speed it up.
Right now I have one that takes advantage of the different caches.  But
unfortunately, I'm dealing with a bad pointer some where that keeps
making it bug. Argh!

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 13:38       ` Steven Rostedt
@ 2005-12-20 13:57         ` Ingo Molnar
  2005-12-20 14:04           ` Steven Rostedt
                             ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Ingo Molnar @ 2005-12-20 13:57 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: linux-kernel, Gunter Ohrner, john stultz


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > > and keep it for PREEMPT_RT?
> > >
> > > Below is the port of the slab changes if anyone else would like to see
> > > if this speeds things up for them.
> >
> > ok, i've added this back in - but we really need a cleaner port of SLAB
> > ...
> >
> 
> Actually, how much do you want that SLOB code?  For the last couple of 
> days I've been working on different approaches that can speed it up. 
> Right now I have one that takes advantage of the different caches.  
> But unfortunately, I'm dealing with a bad pointer some where that 
> keeps making it bug. Argh!

well, the SLOB is mainly about being simple and small. So as long as 
those speedups are SMP-only, they ought to be fine. The problems are 
mainly SMP related, correct?

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 13:57         ` Ingo Molnar
@ 2005-12-20 14:04           ` Steven Rostedt
  2005-12-20 14:33             ` Steven Rostedt
                               ` (3 more replies)
  2005-12-20 14:07           ` 2.6.15-rc5-rt2 slowness Steven Rostedt
  2005-12-20 15:26           ` K.R. Foley
  2 siblings, 4 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 14:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Gunter Ohrner, john stultz

On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > > Now, is the solution to bring the SLOB up to par with the SLAB, or to
> > > > make the SLAB as close to possible to the mainline (why remove NUMA?)
> > > > and keep it for PREEMPT_RT?
> > > >
> > > > Below is the port of the slab changes if anyone else would like to see
> > > > if this speeds things up for them.
> > >
> > > ok, i've added this back in - but we really need a cleaner port of SLAB
> > > ...
> > >
> >
> > Actually, how much do you want that SLOB code?  For the last couple of
> > days I've been working on different approaches that can speed it up.
> > Right now I have one that takes advantage of the different caches.
> > But unfortunately, I'm dealing with a bad pointer some where that
> > keeps making it bug. Argh!
>
> well, the SLOB is mainly about being simple and small. So as long as
> those speedups are SMP-only, they ought to be fine. The problems are
> mainly SMP related, correct?

Actually, no.  My test is to do a make install over NFS of a kernel that
has already been built.

The times I'm getting for the SLAB is ~26 seconds, the time for the SLOB
is 1 minute 32 seconds.  So your looking at >300% slowness here.  The test
bed is a UP.  (I do that first before looking into SMP).

I'm still trying to keep the SLOB simple.  It's the lack of sleep that is
making it hard ;)

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 13:57         ` Ingo Molnar
  2005-12-20 14:04           ` Steven Rostedt
@ 2005-12-20 14:07           ` Steven Rostedt
  2005-12-20 15:26           ` K.R. Foley
  2 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 14:07 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, Gunter Ohrner, john stultz


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> well, the SLOB is mainly about being simple and small. So as long as
> those speedups are SMP-only, they ought to be fine. The problems are
> mainly SMP related, correct?

Oh, but if this does work out, it _will_ improve SMP performance greatly!

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 14:04           ` Steven Rostedt
@ 2005-12-20 14:33             ` Steven Rostedt
  2005-12-20 15:07               ` Ingo Molnar
  2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 14:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: john stultz, Gunter Ohrner, linux-kernel

On Tue, 2005-12-20 at 09:04 -0500, Steven Rostedt wrote:

> 
> Actually, no.  My test is to do a make install over NFS of a kernel that
> has already been built.
> 
> The times I'm getting for the SLAB is ~26 seconds, the time for the SLOB
> is 1 minute 32 seconds.  So your looking at >300% slowness here.  The test
> bed is a UP.  (I do that first before looking into SMP).
> 
> I'm still trying to keep the SLOB simple.  It's the lack of sleep that is
> making it hard ;)
> 

Amazing what you see after a few hours of sleep.  My bug that I was
looking for all yesterday happened to be a < when it should have been a
<=.

OK, it boots and runs. Now I just need to clean it up and prepare to
ship!  My test which is simply to run make install on a prebuilt kernel
over NFS.  The test box is:

bert:/home/rostedt/work/ernie/linux-2.6.15-rc5-rt2# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 736.045
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips        : 1472.74

The box where the kernel is NFS mounted is running the 2.6.15-rc4:

rostedt@gandalf:~/work/ernie/linux-2.6.15-rc5-rt2$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 43
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
stepping        : 1
cpu MHz         : 2210.221
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4424.61
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

processor       : 1
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 43
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4200+
stepping        : 1
cpu MHz         : 2210.221
cache size      : 512 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 4419.74
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp



Here's the times for "time make install":  Three runs for each.

rt with slab:

run 1:
  real    0m27.327s
  user    0m15.151s
  sys     0m3.149s

run 2:
  real    0m26.952s
  user    0m15.171s
  sys     0m3.178s

run 3:
  real    0m27.269s
  user    0m15.175s
  sys     0m3.226s

rt with slob (plain):

run 1:
  real    1m26.845s
  user    0m16.173s
  sys     0m29.558s

run 2:
  real    1m27.895s
  user    0m16.532s
  sys     0m30.460s

run 3:
  real    1m25.645s
  user    0m16.468s
  sys     0m30.973s

rt with slob (new):

run 1:
  real    0m28.740s
  user    0m15.364s
  sys     0m3.866s

run 2:
  real    0m27.782s
  user    0m15.409s
  sys     0m3.885s

run 3:
  real    0m27.576s
  user    0m15.193s
  sys     0m3.933s

As you see, the new SLOB code runs almost as fast as the SLAB code.
With some more improvements, I'm sure it can get even faster.

I'll send out the patch real soon (after I wash and dry it).

Note: After I send out the patch, I'll give it a try on SMP.

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 14:33             ` Steven Rostedt
@ 2005-12-20 15:07               ` Ingo Molnar
  2005-12-20 15:16                 ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2005-12-20 15:07 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: john stultz, Gunter Ohrner, linux-kernel


* Steven Rostedt <rostedt@goodmis.org> wrote:

> As you see, the new SLOB code runs almost as fast as the SLAB code. 
> With some more improvements, I'm sure it can get even faster.

cool, the numbers are really impressive! I'm wondering where the biggest 
hit comes from - perhaps the SLOB does linear list walking when 
allocating?

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 15:07               ` Ingo Molnar
@ 2005-12-20 15:16                 ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 15:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: john stultz, Gunter Ohrner, linux-kernel


On Tue, 20 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > As you see, the new SLOB code runs almost as fast as the SLAB code.
> > With some more improvements, I'm sure it can get even faster.
>
> cool, the numbers are really impressive! I'm wondering where the biggest
> hit comes from - perhaps the SLOB does linear list walking when
> allocating?
>

Yeah, I think that is the biggest hit. The SLOB does the old K&R memory
management. Basically, right from the book.  But it is slow and can
fragment very easily.

I have a little more to do on this patch, (I don't perform the correct
cleanup on kmem_cache_destroy), but I'll send it to you anyway within
the next couple of minutes.  Just so you can take a look and try it out.

-- Steve

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: 2.6.15-rc5-rt2 slowness
  2005-12-20 13:57         ` Ingo Molnar
  2005-12-20 14:04           ` Steven Rostedt
  2005-12-20 14:07           ` 2.6.15-rc5-rt2 slowness Steven Rostedt
@ 2005-12-20 15:26           ` K.R. Foley
  2 siblings, 0 replies; 56+ messages in thread
From: K.R. Foley @ 2005-12-20 15:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Steven Rostedt, linux-kernel, Gunter Ohrner, john stultz

Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>>>> Now, is the solution to bring the SLOB up to par with the SLAB, or to
>>>> make the SLAB as close to possible to the mainline (why remove NUMA?)
>>>> and keep it for PREEMPT_RT?
>>>>
>>>> Below is the port of the slab changes if anyone else would like to see
>>>> if this speeds things up for them.
>>> ok, i've added this back in - but we really need a cleaner port of SLAB
>>> ...
>>>
>> Actually, how much do you want that SLOB code?  For the last couple of 
>> days I've been working on different approaches that can speed it up. 
>> Right now I have one that takes advantage of the different caches.  
>> But unfortunately, I'm dealing with a bad pointer some where that 
>> keeps making it bug. Argh!
> 
> well, the SLOB is mainly about being simple and small. So as long as 
> those speedups are SMP-only, they ought to be fine. The problems are 
> mainly SMP related, correct?
> 
> 	Ingo

No. I experienced horrible performance running the original patch with
the SLOB on my uniprocessor system vs. the patch with Steven's SLAB
patch applied on the same system. In fact I am currently running the
latter on that system now. With the original patch the system is really
unusable.

-- 
   kr

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH RT 00/02] SLOB optimizations
  2005-12-20 14:04           ` Steven Rostedt
  2005-12-20 14:33             ` Steven Rostedt
@ 2005-12-20 15:44             ` Steven Rostedt
  2005-12-20 15:56               ` Steven Rostedt
                                 ` (2 more replies)
  2005-12-20 15:44             ` [PATCH RT 01/02] SLOB - remove bigblock list Steven Rostedt
  2005-12-20 15:44             ` [PATCH RT 02/02] SLOB - break SLOB up by caches Steven Rostedt
  3 siblings, 3 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Matt Mackall, john stultz, Gunter Ohrner,
	linux-kernel

(Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)

OK Ingo, here it is.

The old SLOB did the old K&R memory allocations.

It had a global link list "slobfree".  When it needed memory it would
search this list linearly to find the first spot that fit and then
return it.  It was broken up into SLOB_UNITS which was the number of
bytes to hold slob_t.

Since the sizes of the allocations would greatly fluctuate, the chances
for fragmentation was very high.  This would also cause the looking for
free locations to increase, since the number of free blocks would also
increase due to the fragmentation.

It also had one global spinlock for ALL allocations.  This would
obviously kill SMP performance.

For large blocks greater than PAGE_SIZE it would just use the buddy
system.  This I didn't change, in fact, I made it use the buddy system
for blocks greater than PAGE_SIZE >> 1, but I'll explain that below.

The problem that this system had, was it made another global link list
to hold all of these large blocks, which means it needed another global
spinlock to manage this list.

When any block was freed via kfree, it would first search all the big
blocks to see if it was a large allocation, and if not, then it would
search the slobfree list to find where it goes.  Both taking two global
spinlocks!


Here's what I've done to solve this.

First things first, the first patch was to get rid of the bigblock list.
I'm simple used the method of SLAB to use the lru list field of the
corresponding page to store the pointer to the bigblock descriptor which
has the information to free it. This got rid of the bigblock link list
and global spinlock.

The next patch was the big improvement, with the largest changes.  I
took advantage of how the kmem_cache usage that SLAB also takes
advantage of.  I created a memory pool like the global one, but for
every cache with a size less then PAGE_SIZE >> 1.

[ Note: I picked PAGE_SIZE >> 1, since it didn't seem to make much
difference when there were greater, since it would use a full page for
just one allocation.  I can play with this more, but it still seems to
be a waste ].

I used lru.next of the page that the pages are allocated for, for the
bigblock descriptors (as described above), and now I use lru.prev to
point to the cache that the items belong to in the pool.  So I removed
the need for the descriptor being held in the pool.

I also created the general caches like SLAB for kmalloc and kfree, for
sizes 32 through PAGE_SIZE >> 1.  All greater allocations will use the
backend buddy system.

Tests:
=====

To test this, I used what showed the problem the greatest.  Doing a make
install over NFS.  So on my 733MHz UP machine, I ran "time make install"
on the -rt kernel with the old SLAB, as well as -rt kernel with the
default (old) SLOB, and then with the SLOB with these patches (three
tests each).  Here's the results:

rt with slab:

run 1:
  real    0m27.327s
  user    0m15.151s
  sys     0m3.149s

run 2:
  real    0m26.952s
  user    0m15.171s
  sys     0m3.178s

run 3:
  real    0m27.269s
  user    0m15.175s
  sys     0m3.226s

rt with slob (plain):

run 1:
  real    1m26.845s
  user    0m16.173s
  sys     0m29.558s

run 2:
  real    1m27.895s
  user    0m16.532s
  sys     0m30.460s

run 3:
  real    1m25.645s
  user    0m16.468s
  sys     0m30.973s

rt with slob (new):

run 1:
  real    0m28.740s
  user    0m15.364s
  sys     0m3.866s

run 2:
  real    0m27.782s
  user    0m15.409s
  sys     0m3.885s

run 3:
  real    0m27.576s
  user    0m15.193s
  sys     0m3.933s


So I have improved the speed of SLOB to almost that of SLAB!

TODO:  IMPORTANT!!!

1) I haven't cleaned up the kmem_cache_destroy yet, so every time that
happens, there's a memory leak.

2) I need to test on SMP.

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH RT 01/02] SLOB - remove bigblock list
  2005-12-20 14:04           ` Steven Rostedt
  2005-12-20 14:33             ` Steven Rostedt
  2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
@ 2005-12-20 15:44             ` Steven Rostedt
  2005-12-20 15:44             ` [PATCH RT 02/02] SLOB - break SLOB up by caches Steven Rostedt
  3 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gunter Ohrner, john stultz, Matt Mackall,
	Andrew Morton

This patch uses the mem_map pages to find the bigblock descriptor for
large allocations.

-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.15-rc5-rt2/mm/slob.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slob.c	2005-12-19 10:45:55.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slob.c	2005-12-19 14:12:08.000000000 -0500
@@ -50,15 +50,42 @@
 struct bigblock {
 	int order;
 	void *pages;
-	struct bigblock *next;
 };
 typedef struct bigblock bigblock_t;
 
 static slob_t arena = { .next = &arena, .units = 1 };
 static slob_t *slobfree = &arena;
-static bigblock_t *bigblocks;
 static DEFINE_SPINLOCK(slob_lock);
-static DEFINE_SPINLOCK(block_lock);
+
+#define __get_slob_block(b) ((unsigned long)(b) & ~(PAGE_SIZE-1))
+
+static inline struct page *get_slob_page(const void *mem)
+{
+	void *virt = (void*)__get_slob_block(mem);
+
+	return virt_to_page(virt);
+}
+
+static inline void zero_slob_block(const void *b)
+{
+	struct page *page;
+	page = get_slob_page(b);
+	memset(&page->lru, 0, sizeof(page->lru));
+}
+
+static inline void *get_slob_block(const void *b)
+{
+	struct page *page;
+	page = get_slob_page(b);
+	return page->lru.next;
+}
+
+static inline void set_slob_block(const void *b, void *data)
+{
+	struct page *page;
+	page = get_slob_page(b);
+	page->lru.next = data;
+}
 
 static void slob_free(void *b, int size);
 
@@ -108,6 +135,7 @@
 			if (!cur)
 				return 0;
 
+			zero_slob_block(cur);
 			slob_free(cur, PAGE_SIZE);
 			spin_lock_irqsave(&slob_lock, flags);
 			cur = slobfree;
@@ -162,7 +190,6 @@
 {
 	slob_t *m;
 	bigblock_t *bb;
-	unsigned long flags;
 
 	if (size < PAGE_SIZE - SLOB_UNIT) {
 		m = slob_alloc(size + SLOB_UNIT, gfp, 0);
@@ -177,10 +204,7 @@
 	bb->pages = (void *)__get_free_pages(gfp, bb->order);
 
 	if (bb->pages) {
-		spin_lock_irqsave(&block_lock, flags);
-		bb->next = bigblocks;
-		bigblocks = bb;
-		spin_unlock_irqrestore(&block_lock, flags);
+		set_slob_block(bb->pages, bb);
 		return bb->pages;
 	}
 
@@ -192,25 +216,16 @@
 
 void kfree(const void *block)
 {
-	bigblock_t *bb, **last = &bigblocks;
-	unsigned long flags;
+	bigblock_t *bb;
 
 	if (!block)
 		return;
 
-	if (!((unsigned long)block & (PAGE_SIZE-1))) {
-		/* might be on the big block list */
-		spin_lock_irqsave(&block_lock, flags);
-		for (bb = bigblocks; bb; last = &bb->next, bb = bb->next) {
-			if (bb->pages == block) {
-				*last = bb->next;
-				spin_unlock_irqrestore(&block_lock, flags);
-				free_pages((unsigned long)block, bb->order);
-				slob_free(bb, sizeof(bigblock_t));
-				return;
-			}
-		}
-		spin_unlock_irqrestore(&block_lock, flags);
+	bb = get_slob_block(block);
+	if (bb) {
+		free_pages((unsigned long)block, bb->order);
+		slob_free(bb, sizeof(bigblock_t));
+		return;
 	}
 
 	slob_free((slob_t *)block - 1, 0);
@@ -222,20 +237,13 @@
 unsigned int ksize(const void *block)
 {
 	bigblock_t *bb;
-	unsigned long flags;
 
 	if (!block)
 		return 0;
 
-	if (!((unsigned long)block & (PAGE_SIZE-1))) {
-		spin_lock_irqsave(&block_lock, flags);
-		for (bb = bigblocks; bb; bb = bb->next)
-			if (bb->pages == block) {
-				spin_unlock_irqrestore(&slob_lock, flags);
-				return PAGE_SIZE << bb->order;
-			}
-		spin_unlock_irqrestore(&block_lock, flags);
-	}
+	bb = get_slob_block(block);
+	if (bb)
+		return PAGE_SIZE << bb->order;
 
 	return ((slob_t *)block - 1)->units * SLOB_UNIT;
 }



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH RT 02/02] SLOB - break SLOB up by caches
  2005-12-20 14:04           ` Steven Rostedt
                               ` (2 preceding siblings ...)
  2005-12-20 15:44             ` [PATCH RT 01/02] SLOB - remove bigblock list Steven Rostedt
@ 2005-12-20 15:44             ` Steven Rostedt
  3 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 15:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gunter Ohrner, john stultz, Matt Mackall,
	Andrew Morton

This patch breaks up the SLOBs by caches, and also uses the mem_map
pages to find the cache descriptor.


Once again:

TODO:  IMPORTANT!!!

1) I haven't cleaned up the kmem_cache_destroy yet, so every time that
happens, there's a memory leak.

2) I need to test on SMP.

-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.15-rc5-rt2/mm/slob.c
===================================================================
--- linux-2.6.15-rc5-rt2.orig/mm/slob.c	2005-12-19 18:00:01.000000000 -0500
+++ linux-2.6.15-rc5-rt2/mm/slob.c	2005-12-20 10:16:39.000000000 -0500
@@ -27,6 +27,20 @@
  * are allocated by calling __get_free_pages. As SLAB objects know
  * their size, no separate size bookkeeping is necessary and there is
  * essentially no allocation space overhead.
+ *
+ * Modified by: Steven Rostedt <rostedt@goodmis.org> 12/20/05
+ *
+ * Now we take advantage of the kmem_cache usage.  I've removed
+ * the global slobfree, and created one for every cache.
+ *
+ * For kmalloc/kfree I've reintroduced the usage of cache_sizes,
+ * but only for sizes 32 through PAGE_SIZE >> 1 by order of 2.
+ *
+ * Having the SLOB alloc per size of the cache should speed things up
+ * greatly, not only by making the search paths smaller, but also by
+ * keeping all the caches of similar units.  This way the fragmentation
+ * should not be as big of a problem.
+ *
  */
 
 #include <linux/config.h>
@@ -37,6 +51,8 @@
 #include <linux/module.h>
 #include <linux/timer.h>
 
+#undef DEBUG_CACHE
+
 struct slob_block {
 	int units;
 	struct slob_block *next;
@@ -53,17 +69,66 @@
 };
 typedef struct bigblock bigblock_t;
 
-static slob_t arena = { .next = &arena, .units = 1 };
-static slob_t *slobfree = &arena;
-static DEFINE_SPINLOCK(slob_lock);
+struct kmem_cache {
+	unsigned int size, align;
+	const char *name;
+	slob_t *slobfree;
+	slob_t arena;
+	spinlock_t lock;
+	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+	atomic_t items;
+	unsigned int free;
+	struct list_head list;
+};
 
-#define __get_slob_block(b) ((unsigned long)(b) & ~(PAGE_SIZE-1))
+#define NR_SLOB_CACHES ((PAGE_SHIFT) - 5) /* 32 to PAGE_SIZE-1 by order of 2 */
+#define MAX_SLOB_CACHE_SIZE (PAGE_SIZE >> 1)
 
-static inline struct page *get_slob_page(const void *mem)
+static struct kmem_cache *cache_sizes[NR_SLOB_CACHES];
+static struct kmem_cache *bb_cache;
+
+static struct semaphore	cache_chain_sem;
+static struct list_head cache_chain;
+
+#ifdef DEBUG_CACHE
+static void test_cache(kmem_cache_t *c)
 {
-	void *virt = (void*)__get_slob_block(mem);
+	slob_t *cur = c->slobfree;
+	unsigned int x = -1 >> 2;
 
-	return virt_to_page(virt);
+	do {
+		BUG_ON(!cur->next);
+		cur = cur->next;
+	} while (cur != c->slobfree && --x);
+	BUG_ON(!x);
+}
+#else
+#define test_cache(x) do {} while(0)
+#endif
+
+/*
+ * Here we take advantage of the lru field of the pages that
+ * map to the pages we use in the SLOB.  This is done similar
+ * to what is done with SLAB.
+ *
+ * The lru.next field is used to get the bigblock descriptor
+ *    for large blocks larger than PAGE_SIZE >> 1.
+ *
+ * Set and retrieved by set_slob_block and get_slob_block
+ * respectively.
+ *
+ * The lru.prev field is used to find the cache descriptor
+ *   for small blocks smaller than or equal to PAGE_SIZE >> 1.
+ *
+ * Set and retrieved by set_slob_ptr and get_slob_ptr
+ * respectively.
+ *
+ * The use of lru.next tells us in kmalloc that the page is large.
+ */
+static inline struct page *get_slob_page(const void *mem)
+{
+	return virt_to_page(mem);
 }
 
 static inline void zero_slob_block(const void *b)
@@ -87,18 +152,39 @@
 	page->lru.next = data;
 }
 
-static void slob_free(void *b, int size);
+static inline void *get_slob_ptr(const void *b)
+{
+	struct page *page;
+	page = get_slob_page(b);
+	return page->lru.prev;
+}
+
+static inline void set_slob_ptr(const void *b, void *data)
+{
+	struct page *page;
+	page = get_slob_page(b);
+	page->lru.prev = data;
+}
+
+static void slob_free(kmem_cache_t *cachep, void *b, int size);
 
-static void *slob_alloc(size_t size, gfp_t gfp, int align)
+static void *slob_alloc(kmem_cache_t *cachep, gfp_t gfp, int align)
 {
+	size_t size;
 	slob_t *prev, *cur, *aligned = 0;
-	int delta = 0, units = SLOB_UNITS(size);
+	int delta = 0, units;
 	unsigned long flags;
 
-	spin_lock_irqsave(&slob_lock, flags);
-	prev = slobfree;
+	size = cachep->size;
+	units = SLOB_UNITS(size);
+	BUG_ON(!units);
+
+	spin_lock_irqsave(&cachep->lock, flags);
+	prev = cachep->slobfree;
 	for (cur = prev->next; ; prev = cur, cur = cur->next) {
 		if (align) {
+			while (align < SLOB_UNIT)
+				align <<= 1;
 			aligned = (slob_t *)ALIGN((unsigned long)cur, align);
 			delta = aligned - cur;
 		}
@@ -121,12 +207,16 @@
 				cur->units = units;
 			}
 
-			slobfree = prev;
-			spin_unlock_irqrestore(&slob_lock, flags);
+			cachep->slobfree = prev;
+			test_cache(cachep);
+			if (prev < prev->next)
+				BUG_ON(cur + cur->units > prev->next);
+			spin_unlock_irqrestore(&cachep->lock, flags);
 			return cur;
 		}
-		if (cur == slobfree) {
-			spin_unlock_irqrestore(&slob_lock, flags);
+		if (cur == cachep->slobfree) {
+			test_cache(cachep);
+			spin_unlock_irqrestore(&cachep->lock, flags);
 
 			if (size == PAGE_SIZE) /* trying to shrink arena? */
 				return 0;
@@ -136,14 +226,15 @@
 				return 0;
 
 			zero_slob_block(cur);
-			slob_free(cur, PAGE_SIZE);
-			spin_lock_irqsave(&slob_lock, flags);
-			cur = slobfree;
+			set_slob_ptr(cur, cachep);
+			slob_free(cachep, cur, PAGE_SIZE);
+			spin_lock_irqsave(&cachep->lock, flags);
+			cur = cachep->slobfree;
 		}
 	}
 }
 
-static void slob_free(void *block, int size)
+static void slob_free(kmem_cache_t *cachep, void *block, int size)
 {
 	slob_t *cur, *b = (slob_t *)block;
 	unsigned long flags;
@@ -155,26 +246,29 @@
 		b->units = SLOB_UNITS(size);
 
 	/* Find reinsertion point */
-	spin_lock_irqsave(&slob_lock, flags);
-	for (cur = slobfree; !(b > cur && b < cur->next); cur = cur->next)
+	spin_lock_irqsave(&cachep->lock, flags);
+	for (cur = cachep->slobfree; !(b > cur && b < cur->next); cur = cur->next)
 		if (cur >= cur->next && (b > cur || b < cur->next))
 			break;
 
 	if (b + b->units == cur->next) {
 		b->units += cur->next->units;
 		b->next = cur->next->next;
+		BUG_ON(cur->next == &cachep->arena);
 	} else
 		b->next = cur->next;
 
 	if (cur + cur->units == b) {
 		cur->units += b->units;
 		cur->next = b->next;
+		BUG_ON(b == &cachep->arena);
 	} else
 		cur->next = b;
 
-	slobfree = cur;
+	cachep->slobfree = cur;
 
-	spin_unlock_irqrestore(&slob_lock, flags);
+	test_cache(cachep);
+	spin_unlock_irqrestore(&cachep->lock, flags);
 }
 
 static int FASTCALL(find_order(int size));
@@ -188,15 +282,24 @@
 
 void *kmalloc(size_t size, gfp_t gfp)
 {
-	slob_t *m;
 	bigblock_t *bb;
 
-	if (size < PAGE_SIZE - SLOB_UNIT) {
-		m = slob_alloc(size + SLOB_UNIT, gfp, 0);
-		return m ? (void *)(m + 1) : 0;
+	/*
+	 * If the size is less than PAGE_SIZE >> 1 then
+	 * we use the generic caches.  Otherwise, we
+	 * just allocate the necessary pages.
+	 */
+	if (size <= MAX_SLOB_CACHE_SIZE) {
+		int i;
+		int order;
+		for (i=0, order=32; i < NR_SLOB_CACHES; i++, order <<= 1)
+			if (size <= order)
+				break;
+		BUG_ON(i == NR_SLOB_CACHES);
+		return kmem_cache_alloc(cache_sizes[i], gfp);
 	}
 
-	bb = slob_alloc(sizeof(bigblock_t), gfp, 0);
+	bb = slob_alloc(bb_cache, gfp, 0);
 	if (!bb)
 		return 0;
 
@@ -208,7 +311,7 @@
 		return bb->pages;
 	}
 
-	slob_free(bb, sizeof(bigblock_t));
+	slob_free(bb_cache, bb, sizeof(bigblock_t));
 	return 0;
 }
 
@@ -216,19 +319,26 @@
 
 void kfree(const void *block)
 {
+	kmem_cache_t *c;
 	bigblock_t *bb;
 
 	if (!block)
 		return;
 
+	/*
+	 * look into the page of the allocated block to
+	 * see if this is a big allocation or not.
+	 */
 	bb = get_slob_block(block);
 	if (bb) {
 		free_pages((unsigned long)block, bb->order);
-		slob_free(bb, sizeof(bigblock_t));
+		slob_free(bb_cache, bb, sizeof(bigblock_t));
 		return;
 	}
 
-	slob_free((slob_t *)block - 1, 0);
+	c = get_slob_ptr(block);
+	kmem_cache_free(c, (void *)block);
+
 	return;
 }
 
@@ -237,6 +347,7 @@
 unsigned int ksize(const void *block)
 {
 	bigblock_t *bb;
+	kmem_cache_t *c;
 
 	if (!block)
 		return 0;
@@ -245,14 +356,16 @@
 	if (bb)
 		return PAGE_SIZE << bb->order;
 
-	return ((slob_t *)block - 1)->units * SLOB_UNIT;
+	c = get_slob_ptr(block);
+	return c->size;
 }
 
-struct kmem_cache {
-	unsigned int size, align;
-	const char *name;
-	void (*ctor)(void *, struct kmem_cache *, unsigned long);
-	void (*dtor)(void *, struct kmem_cache *, unsigned long);
+static slob_t cache_arena = { .next = &cache_arena, .units = 0 };
+struct kmem_cache cache_cache = {
+	.name = "cache",
+	.slobfree = &cache_cache.arena,
+	.arena = { .next = &cache_cache.arena, .units = 0 },
+	.lock = SPIN_LOCK_UNLOCKED(cache_cache.lock)
 };
 
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
@@ -261,8 +374,22 @@
 	void (*dtor)(void*, struct kmem_cache *, unsigned long))
 {
 	struct kmem_cache *c;
+	void *p;
+
+	c = slob_alloc(&cache_cache, flags, 0);
+
+	memset(c, 0, sizeof(*c));
 
-	c = slob_alloc(sizeof(struct kmem_cache), flags, 0);
+	c->size = PAGE_SIZE;
+	c->arena.next = &c->arena;
+	c->arena.units = 0;
+	c->slobfree = &c->arena;
+	atomic_set(&c->items, 0);
+	spin_lock_init(&c->lock);
+
+	p = slob_alloc(c, 0, PAGE_SIZE-1);
+	if (p)
+		free_page((unsigned long)p);
 
 	if (c) {
 		c->name = name;
@@ -274,6 +401,9 @@
 		if (c->align < align)
 			c->align = align;
 	}
+	down(&cache_chain_sem);
+	list_add_tail(&c->list, &cache_chain);
+	up(&cache_chain_sem);
 
 	return c;
 }
@@ -281,7 +411,17 @@
 
 int kmem_cache_destroy(struct kmem_cache *c)
 {
-	slob_free(c, sizeof(struct kmem_cache));
+	down(&cache_chain_sem);
+	list_del(&c->list);
+	up(&cache_chain_sem);
+
+	BUG_ON(atomic_read(&c->items));
+
+	/*
+	 * WARNING!!! Memory leak!
+	 */
+	printk("FIX ME: need to free memory\n");
+	slob_free(&cache_cache, c, sizeof(struct kmem_cache));
 	return 0;
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
@@ -290,11 +430,16 @@
 {
 	void *b;
 
-	if (c->size < PAGE_SIZE)
-		b = slob_alloc(c->size, flags, c->align);
+	atomic_inc(&c->items);
+
+	if (c->size <= MAX_SLOB_CACHE_SIZE)
+		b = slob_alloc(c, flags, c->align);
 	else
 		b = (void *)__get_free_pages(flags, find_order(c->size));
 
+	if (!b)
+		return 0;
+
 	if (c->ctor)
 		c->ctor(b, c, SLAB_CTOR_CONSTRUCTOR);
 
@@ -304,11 +449,13 @@
 
 void kmem_cache_free(struct kmem_cache *c, void *b)
 {
+	atomic_dec(&c->items);
+
 	if (c->dtor)
 		c->dtor(b, c, 0);
 
-	if (c->size < PAGE_SIZE)
-		slob_free(b, c->size);
+	if (c->size <= MAX_SLOB_CACHE_SIZE)
+		slob_free(c, b, c->size);
 	else
 		free_pages((unsigned long)b, find_order(c->size));
 }
@@ -326,22 +473,62 @@
 }
 EXPORT_SYMBOL(kmem_cache_name);
 
-static struct timer_list slob_timer = TIMER_INITIALIZER(
-	(void (*)(unsigned long))kmem_cache_init, 0, 0);
+static char cache_names[NR_SLOB_CACHES][15];
 
 void kmem_cache_init(void)
 {
-	void *p = slob_alloc(PAGE_SIZE, 0, PAGE_SIZE-1);
+	static int done;
+	void *p;
 
-	if (p)
-		free_page((unsigned long)p);
-
-	mod_timer(&slob_timer, jiffies + HZ);
+	if (!done) {
+		int i;
+		int size = 32;
+		done = 1;
+
+		init_MUTEX(&cache_chain_sem);
+		INIT_LIST_HEAD(&cache_chain);
+
+		cache_cache.size = PAGE_SIZE;
+		p = slob_alloc(&cache_cache, 0, PAGE_SIZE-1);
+		if (p)
+			free_page((unsigned long)p);
+		cache_cache.size = sizeof(struct kmem_cache);
+
+		bb_cache = kmem_cache_create("bb_cache",sizeof(bigblock_t), 0,
+					     GFP_KERNEL, NULL, NULL);
+		for (i=0; i < NR_SLOB_CACHES; i++, size <<= 1)
+			cache_sizes[i] = kmem_cache_create(cache_names[i], size, 0,
+							   GFP_KERNEL, NULL, NULL);
+	}
 }
 
 atomic_t slab_reclaim_pages = ATOMIC_INIT(0);
 EXPORT_SYMBOL(slab_reclaim_pages);
 
+static void test_slob(slob_t *s)
+{
+	slob_t *p;
+	long x = 0;
+
+	for (p=s->next; p != s && x < 10000; p = p->next, x++)
+		printk(".");
+}
+
+void print_slobs(void)
+{
+	struct list_head *curr;
+
+	list_for_each(curr, &cache_chain) {
+		kmem_cache_t *c = list_entry(curr, struct kmem_cache, list);
+
+		printk("%s items:%d",
+		       c->name?:"<none>",
+		       atomic_read(&c->items));
+		test_slob(&c->arena);
+		printk("\n");
+	}
+}
+
 #ifdef CONFIG_SMP
 
 void *__alloc_percpu(size_t size, size_t align)



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
@ 2005-12-20 15:56               ` Steven Rostedt
  2005-12-20 15:58                 ` Ingo Molnar
  2005-12-20 16:13               ` Ingo Molnar
  2005-12-20 18:19               ` Matt Mackall
  2 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 15:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Gunter Ohrner, john stultz, Matt Mackall,
	Andrew Morton

On Tue, 2005-12-20 at 10:44 -0500, Steven Rostedt wrote:
> (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> 
> OK Ingo, here it is.

I just tested it out on SMP (2x), and it boots. Ingo, do you have a good
memory test that I can do benchmarks with?  Something better that my
"make install" test.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 15:56               ` Steven Rostedt
@ 2005-12-20 15:58                 ` Ingo Molnar
  0 siblings, 0 replies; 56+ messages in thread
From: Ingo Molnar @ 2005-12-20 15:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Gunter Ohrner, john stultz, Matt Mackall,
	Andrew Morton


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 2005-12-20 at 10:44 -0500, Steven Rostedt wrote:
> > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> > 
> > OK Ingo, here it is.
> 
> I just tested it out on SMP (2x), and it boots. Ingo, do you have a 
> good memory test that I can do benchmarks with?  Something better that 
> my "make install" test.

networking is the most SLAB-intensive, so your test over NFS ought to be 
pretty good already.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
  2005-12-20 15:56               ` Steven Rostedt
@ 2005-12-20 16:13               ` Ingo Molnar
  2005-12-20 16:29                 ` Steven Rostedt
  2005-12-20 18:19               ` Matt Mackall
  2 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2005-12-20 16:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andrew Morton, Matt Mackall, john stultz, Gunter Ohrner,
	linux-kernel


* Steven Rostedt <rostedt@goodmis.org> wrote:

> Tests:
> =====

could you also post the output of 'size mm/slob.o', with and without 
these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE 
enabled? (and with all debugging options disabled) Both the UP and the 
SMP overhead would be interesting to see.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 16:13               ` Ingo Molnar
@ 2005-12-20 16:29                 ` Steven Rostedt
  2005-12-20 16:39                   ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 16:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Matt Mackall, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, 20 Dec 2005, Ingo Molnar wrote:
> > Tests:
> > =====
>
> could you also post the output of 'size mm/slob.o', with and without
> these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE
> enabled? (and with all debugging options disabled) Both the UP and the
> SMP overhead would be interesting to see.
>

Well, there is definitely a hit there:

rt (slob new):
size mm/slob.o
   text    data     bss     dec     hex filename
   2051     112     233    2396     95c mm/slob.o

without
size mm/slob.o
   text    data     bss     dec     hex filename
   1331     120       8    1459     5b3 mm/slob.o

rt smp (slob new)
size mm/slob.o
   text    data     bss     dec     hex filename
   2297     120     233    2650     a5a mm/slob.o

without
size mm/slob.o
   text    data     bss     dec     hex filename
   1573     140       8    1721     6b9 mm/slob.o


So, should this be a third memory managment system?  A fast_slob?


Just for kicks here's slab.o:

rt:
size mm/slab.o
   text    data     bss     dec     hex filename
   8896     556     144    9596    257c mm/slab.o

rt smp:
size mm/slab.o
   text    data     bss     dec     hex filename
   9679     640      84   10403    28a3 mm/slab.o

So there's still a great improvement on that (maybe not the bss though).

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 16:29                 ` Steven Rostedt
@ 2005-12-20 16:39                   ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 16:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Matt Mackall, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, 2005-12-20 at 11:29 -0500, Steven Rostedt wrote:
> On Tue, 20 Dec 2005, Ingo Molnar wrote:
> > > Tests:
> > > =====
> >
> > could you also post the output of 'size mm/slob.o', with and without
> > these patches, with CONFIG_EMBEDDED and CONFIG_CC_OPTIMIZE_FOR_SIZE
> > enabled? (and with all debugging options disabled) Both the UP and the
> > SMP overhead would be interesting to see.
> >

> Well, there is definitely a hit there:

There's also crap that can be removed in my patch.  Like I started a
cache_chain algorithm.  For example by adding just:

#if 0
static void test_slob(slob_t *s)
{

[...]

}

void print_slobs(void)
{

[...]

}
#endif

I get:
rt:
size mm/slob.o
   text    data     bss     dec     hex filename
   1889     112     233    2234     8ba mm/slob.o

rt smp:
size mm/slob.o
   text    data     bss     dec     hex filename
   2135     120     233    2488     9b8 mm/slob.o

So, I probably need to add stuff for CONFIG_OPTIMIZE_FOR_SIZE.

-- Steve

> 
> rt (slob new):
> size mm/slob.o
>    text    data     bss     dec     hex filename
>    2051     112     233    2396     95c mm/slob.o
> 
> without
> size mm/slob.o
>    text    data     bss     dec     hex filename
>    1331     120       8    1459     5b3 mm/slob.o
> 
> rt smp (slob new)
> size mm/slob.o
>    text    data     bss     dec     hex filename
>    2297     120     233    2650     a5a mm/slob.o
> 
> without
> size mm/slob.o
>    text    data     bss     dec     hex filename
>    1573     140       8    1721     6b9 mm/slob.o
> 
> 
> So, should this be a third memory managment system?  A fast_slob?
> 
> 
> Just for kicks here's slab.o:
> 
> rt:
> size mm/slab.o
>    text    data     bss     dec     hex filename
>    8896     556     144    9596    257c mm/slab.o
> 
> rt smp:
> size mm/slab.o
>    text    data     bss     dec     hex filename
>    9679     640      84   10403    28a3 mm/slab.o
> 
> So there's still a great improvement on that (maybe not the bss though).
> 
> -- Steve
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Steven Rostedt
Senior Programmer
Kihon Technologies
(607)786-4830


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
  2005-12-20 15:56               ` Steven Rostedt
  2005-12-20 16:13               ` Ingo Molnar
@ 2005-12-20 18:19               ` Matt Mackall
  2005-12-20 19:15                 ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Matt Mackall @ 2005-12-20 18:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> 
> OK Ingo, here it is.
> 
> The old SLOB did the old K&R memory allocations.
> 
> It had a global link list "slobfree".  When it needed memory it would
> search this list linearly to find the first spot that fit and then
> return it.  It was broken up into SLOB_UNITS which was the number of
> bytes to hold slob_t.
> 
> Since the sizes of the allocations would greatly fluctuate, the chances
> for fragmentation was very high.  This would also cause the looking for
> free locations to increase, since the number of free blocks would also
> increase due to the fragmentation.

On the target systems for the original SLOB design, we have less than
16MB of memory, so the linked list walking is pretty well bounded.
 
> It also had one global spinlock for ALL allocations.  This would
> obviously kill SMP performance.

And again, the locking primarily exists for PREEMPT and small dual-core.
So I'm still a bit amused that you guys are using it for -RT.

> When any block was freed via kfree, it would first search all the big
> blocks to see if it was a large allocation, and if not, then it would
> search the slobfree list to find where it goes.  Both taking two global
> spinlocks!

I don't think this is correct, or else indicates a bug. We should only
scan the big block list when the freed block was page-aligned.

> First things first, the first patch was to get rid of the bigblock list.
> I'm simple used the method of SLAB to use the lru list field of the
> corresponding page to store the pointer to the bigblock descriptor which
> has the information to free it. This got rid of the bigblock link list
> and global spinlock.

This I like a lot. I'd like to see a size/performance measurement of
this by itself. I suspect it's an unambiguous win in both categories.
 
> The next patch was the big improvement, with the largest changes.  I
> took advantage of how the kmem_cache usage that SLAB also takes
> advantage of.  I created a memory pool like the global one, but for
> every cache with a size less then PAGE_SIZE >> 1.

Hmm. By every size, I assume you mean powers of two. Which negates
some of the fine-grained allocation savings that current SLOB provides.

[...]
> So I have improved the speed of SLOB to almost that of SLAB!

Nice.

For what it's worth, I think we really ought to consider a generalized
allocator approach like Sun's VMEM, with various removable pieces.

Currently we've got something like this:

 get_free_pages     boot_mem         idr    resource_*   vmalloc ...
        |
      slab
        |
  per_cpu/node
        |
  kmem_cache_alloc
        |
     kmalloc

We could take it in a direction like this:

 generic range allocator          (completely agnostic)
          |
  optional size buckets           (reduced fragmentation, O(1))
          |    
    optional slab                 (cache-friendly, pre-initialized)
          |
 optional per cpu/node caches     (cache-hot and lockless)
          |
 kmalloc / kmem_cache_alloc / boot_mem / idr / resource_* / vmalloc / ...

(You read that right, the top level allocator can replace all the
different allocators that hand back integers or non-overlapping ranges.)

Each user of, say, kmem_create() could then pass in flags to specify
which caching layers ought to be bypassed. IDR, for example, would
probably disable all the layers and specify a first-fit policy.

And then depending on our global size and performance requirements, we
could globally disable some layers like SLAB, buckets, or per_cpu
caches. With all the optional layers disabled, we'd end up with
something much like SLOB (but underneath get_free_page!).

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 18:19               ` Matt Mackall
@ 2005-12-20 19:15                 ` Steven Rostedt
  2005-12-20 19:43                   ` Matt Mackall
                                     ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 19:15 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Andrew Morton, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, 2005-12-20 at 12:19 -0600, Matt Mackall wrote:
> On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> > 
> > OK Ingo, here it is.
> > 
> > The old SLOB did the old K&R memory allocations.
> > 
> > It had a global link list "slobfree".  When it needed memory it would
> > search this list linearly to find the first spot that fit and then
> > return it.  It was broken up into SLOB_UNITS which was the number of
> > bytes to hold slob_t.
> > 
> > Since the sizes of the allocations would greatly fluctuate, the chances
> > for fragmentation was very high.  This would also cause the looking for
> > free locations to increase, since the number of free blocks would also
> > increase due to the fragmentation.
> 
> On the target systems for the original SLOB design, we have less than
> 16MB of memory, so the linked list walking is pretty well bounded.

I bet after a while of running, your performance will still suffer due
to fragmentation.  The more fragmented it is, the more space you lose
and the more steps you need to walk.

Remember, because of the small stack, kmalloc and kfree are used an
awful lot.  And if you slow those down, you will start to take a big hit
in performance.

>  
> > It also had one global spinlock for ALL allocations.  This would
> > obviously kill SMP performance.
> 
> And again, the locking primarily exists for PREEMPT and small dual-core.
> So I'm still a bit amused that you guys are using it for -RT.

I think this is due to the complexity of the current SLAB.  With slab.c
unmodified, the RT kernel doesn't boot.  And it's getting more complex,
so to make the proper changes to have it run under a fully preemptible
kernel, takes knowing all the details of the SLAB.

Ingo can answer this better himself, but I have a feeling he jumped to
your SLOB system just because of the simplicity.

> 
> > When any block was freed via kfree, it would first search all the big
> > blocks to see if it was a large allocation, and if not, then it would
> > search the slobfree list to find where it goes.  Both taking two global
> > spinlocks!
> 
> I don't think this is correct, or else indicates a bug. We should only
> scan the big block list when the freed block was page-aligned.

Yep, you're right here.  I forgot about that since updating the bigblock
list was the first thing I did, and I didn't need that check anymore.
So, I was wrong here, yours does _only_ scan the bigblock list if the
block is page aligned.

> 
> > First things first, the first patch was to get rid of the bigblock list.
> > I'm simple used the method of SLAB to use the lru list field of the
> > corresponding page to store the pointer to the bigblock descriptor which
> > has the information to free it. This got rid of the bigblock link list
> > and global spinlock.
> 
> This I like a lot. I'd like to see a size/performance measurement of
> this by itself. I suspect it's an unambiguous win in both categories.

Actually the performance gain was disappointingly small.  As it was a
separate patch and I though it would gain a lot.  But if IIRC, it only
increased the speed by a second or two (of the 1 minute 27 seconds).
That's why I spent so much time in the next approach.

>  
> > The next patch was the big improvement, with the largest changes.  I
> > took advantage of how the kmem_cache usage that SLAB also takes
> > advantage of.  I created a memory pool like the global one, but for
> > every cache with a size less then PAGE_SIZE >> 1.
> 
> Hmm. By every size, I assume you mean powers of two. Which negates
> some of the fine-grained allocation savings that current SLOB provides.

Yeah its the same as what the slabs use.  But I would like to take
measurements of a running system between the two approaches.  After a
day of heavy network traffic, see what the fragmentation is like and how
much is wasted.  This would require me finishing my cache_chain work,
and adding something similar to your SLOB.

But the powers of two is only for the kmalloc, which this is a know
behavior of the current system.  So it <should> only be used for things
that would alloc and free within a quick time (like for things you would
like to put on a stack but cant), or the size is close to (less than or
equal) a power of two.  Otherwise a kmem_cache is made which is the size
of expected object (off by UNIT_SIZE).

Oh, this reminds me, I probably still need to add a shrink cache
algorithm.  Which would be very hard to do in the current SLOB.

Also note, I don't need to save the descriptor in with each kmalloc as
the current SLOB does.  Since each memory pool is of a fixed size, I
again use the mem_map pages to store the location of the descriptor.  So
I save on memory that way.

> 
> [...]
> > So I have improved the speed of SLOB to almost that of SLAB!
> 
> Nice.
> 
> For what it's worth, I think we really ought to consider a generalized
> allocator approach like Sun's VMEM, with various removable pieces.

Interesting, I don't know how Sun's VMEM works.  Do you have links to
some documentation?

> 
> Currently we've got something like this:
> 
>  get_free_pages     boot_mem         idr    resource_*   vmalloc ...
>         |
>       slab
>         |
>   per_cpu/node
>         |
>   kmem_cache_alloc
>         |
>      kmalloc
> 
> We could take it in a direction like this:
> 
>  generic range allocator          (completely agnostic)
>           |
>   optional size buckets           (reduced fragmentation, O(1))
>           |    
>     optional slab                 (cache-friendly, pre-initialized)
>           |
>  optional per cpu/node caches     (cache-hot and lockless)
>           |
>  kmalloc / kmem_cache_alloc / boot_mem / idr / resource_* / vmalloc / ...
> 
> (You read that right, the top level allocator can replace all the
> different allocators that hand back integers or non-overlapping ranges.)
> 
> Each user of, say, kmem_create() could then pass in flags to specify
> which caching layers ought to be bypassed. IDR, for example, would
> probably disable all the layers and specify a first-fit policy.
> 
> And then depending on our global size and performance requirements, we
> could globally disable some layers like SLAB, buckets, or per_cpu
> caches. With all the optional layers disabled, we'd end up with
> something much like SLOB (but underneath get_free_page!).

That looks like quite an undertaking, but may be well worth it.  I think
Linux's memory management is starting to show it's age.  It's been
through a few transformations, and maybe it's time to go through
another.  The work being done by the NUMA folks, should be taking into
account, and maybe we can come up with a way that can make things easier
and less complex without losing performance.

BTW, the NUMA code in the slabs was the main killer for the RT
conversion.


Thanks for the input,

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 19:15                 ` Steven Rostedt
@ 2005-12-20 19:43                   ` Matt Mackall
  2005-12-20 20:06                     ` Steven Rostedt
  2005-12-20 20:15                   ` Pekka Enberg
  2005-12-21  2:30                   ` Nick Piggin
  2 siblings, 1 reply; 56+ messages in thread
From: Matt Mackall @ 2005-12-20 19:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Andrew Morton, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, Dec 20, 2005 at 02:15:24PM -0500, Steven Rostedt wrote:
> On Tue, 2005-12-20 at 12:19 -0600, Matt Mackall wrote:
> > On Tue, Dec 20, 2005 at 10:44:20AM -0500, Steven Rostedt wrote:
> > > (Andrew, I'm CC'ing you and Matt to see if you would like this in -mm)
> > > 
> > > OK Ingo, here it is.
> > > 
> > > The old SLOB did the old K&R memory allocations.
> > > 
> > > It had a global link list "slobfree".  When it needed memory it would
> > > search this list linearly to find the first spot that fit and then
> > > return it.  It was broken up into SLOB_UNITS which was the number of
> > > bytes to hold slob_t.
> > > 
> > > Since the sizes of the allocations would greatly fluctuate, the chances
> > > for fragmentation was very high.  This would also cause the looking for
> > > free locations to increase, since the number of free blocks would also
> > > increase due to the fragmentation.
> > 
> > On the target systems for the original SLOB design, we have less than
> > 16MB of memory, so the linked list walking is pretty well bounded.
> 
> I bet after a while of running, your performance will still suffer due
> to fragmentation.  The more fragmented it is, the more space you lose
> and the more steps you need to walk.
> 
> Remember, because of the small stack, kmalloc and kfree are used an
> awful lot.  And if you slow those down, you will start to take a big hit
> in performance.

True, with the exception that the improved packing may be the
difference between fitting the working set in memory and
thrashing/OOMing for some applications. Not running at all =
infinitely bad performance.

And the fragmentation is really not all that bad. Remember, Linux and
other legacy systems used similar allocators for ages.
 
> Ingo can answer this better himself, but I have a feeling he jumped to
> your SLOB system just because of the simplicity.

And only a config switch away..

> > This I like a lot. I'd like to see a size/performance measurement of
> > this by itself. I suspect it's an unambiguous win in both categories.
> 
> Actually the performance gain was disappointingly small.  As it was a
> separate patch and I though it would gain a lot.  But if IIRC, it only
> increased the speed by a second or two (of the 1 minute 27 seconds).
> That's why I spent so much time in the next approach.

Still, if it's a size win, it definitely makes sense to merge.
Removing the big block list lock is also a good thing and might make a
bigger difference on SMP.
 
> > > The next patch was the big improvement, with the largest changes.  I
> > > took advantage of how the kmem_cache usage that SLAB also takes
> > > advantage of.  I created a memory pool like the global one, but for
> > > every cache with a size less then PAGE_SIZE >> 1.
> > 
> > Hmm. By every size, I assume you mean powers of two. Which negates
> > some of the fine-grained allocation savings that current SLOB provides.
> 
> Yeah its the same as what the slabs use.  But I would like to take
> measurements of a running system between the two approaches.  After a
> day of heavy network traffic, see what the fragmentation is like and how
> much is wasted.  This would require me finishing my cache_chain work,
> and adding something similar to your SLOB.
> 
> But the powers of two is only for the kmalloc, which this is a know
> behavior of the current system.  So it <should> only be used for things
> that would alloc and free within a quick time (like for things you would
> like to put on a stack but cant), or the size is close to (less than or
> equal) a power of two.  Otherwise a kmem_cache is made which is the size
> of expected object (off by UNIT_SIZE).

There are a fair number of long-lived kmalloc objects. You might try
playing with the kmalloc accounting patch in -tiny to see what's out
there.

http://www.selenic.com/repo/tiny?f=bbcd48f1d9c1;file=kmalloc-accounting.patch;style=raw

> Oh, this reminds me, I probably still need to add a shrink cache
> algorithm.  Which would be very hard to do in the current SLOB.

Hmmm? It already has one.

> > For what it's worth, I think we really ought to consider a generalized
> > allocator approach like Sun's VMEM, with various removable pieces.
> 
> Interesting, I don't know how Sun's VMEM works.  Do you have links to
> some documentation?

http://citeseer.ist.psu.edu/bonwick01magazines.html

> That looks like quite an undertaking, but may be well worth it.  I think
> Linux's memory management is starting to show it's age.  It's been
> through a few transformations, and maybe it's time to go through
> another.  The work being done by the NUMA folks, should be taking into
> account, and maybe we can come up with a way that can make things easier
> and less complex without losing performance.

Fortunately, it can be done completely piecemeal. 

> BTW, the NUMA code in the slabs was the main killer for the RT
> conversion.

I think the VMEM scheme avoids that problem to some degree, but I
might be wrong.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 19:43                   ` Matt Mackall
@ 2005-12-20 20:06                     ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 20:06 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Andrew Morton, john stultz, Gunter Ohrner,
	linux-kernel

On Tue, 2005-12-20 at 13:43 -0600, Matt Mackall wrote:
> > 
> > I bet after a while of running, your performance will still suffer due
> > to fragmentation.  The more fragmented it is, the more space you lose
> > and the more steps you need to walk.
> > 
> > Remember, because of the small stack, kmalloc and kfree are used an
> > awful lot.  And if you slow those down, you will start to take a big hit
> > in performance.
> 
> True, with the exception that the improved packing may be the
> difference between fitting the working set in memory and
> thrashing/OOMing for some applications. Not running at all =
> infinitely bad performance.

Well the best way to see, is to try it out with real applications on
small machines.  I guess I need to pull out my IBM Thinkpad 75c (32
megs, I'll need to only allocate half) and try out the two and see how
far I can push it.  Unfortunately, this test may need to wait, since I
have a ton of other things to push out first.

If someone else (perhaps yourself) would like to give my patches a try,
I would be really appreciate it. :)

> 
> And the fragmentation is really not all that bad. Remember, Linux and
> other legacy systems used similar allocators for ages.

But the performance, was greatly reduced, and the system just booted up.

>  
> > Ingo can answer this better himself, but I have a feeling he jumped to
> > your SLOB system just because of the simplicity.
> 
> And only a config switch away..
> 
> > > This I like a lot. I'd like to see a size/performance measurement of
> > > this by itself. I suspect it's an unambiguous win in both categories.
> > 
> > Actually the performance gain was disappointingly small.  As it was a
> > separate patch and I though it would gain a lot.  But if IIRC, it only
> > increased the speed by a second or two (of the 1 minute 27 seconds).
> > That's why I spent so much time in the next approach.
> 
> Still, if it's a size win, it definitely makes sense to merge.
> Removing the big block list lock is also a good thing and might make a
> bigger difference on SMP.

Well, I guess I can check out the -mm branch and at least port the first
patch over.

>  
> > > > The next patch was the big improvement, with the largest changes.  I
> > > > took advantage of how the kmem_cache usage that SLAB also takes
> > > > advantage of.  I created a memory pool like the global one, but for
> > > > every cache with a size less then PAGE_SIZE >> 1.
> > > 
> > > Hmm. By every size, I assume you mean powers of two. Which negates
> > > some of the fine-grained allocation savings that current SLOB provides.
> > 
> > Yeah its the same as what the slabs use.  But I would like to take
> > measurements of a running system between the two approaches.  After a
> > day of heavy network traffic, see what the fragmentation is like and how
> > much is wasted.  This would require me finishing my cache_chain work,
> > and adding something similar to your SLOB.
> > 
> > But the powers of two is only for the kmalloc, which this is a know
> > behavior of the current system.  So it <should> only be used for things
> > that would alloc and free within a quick time (like for things you would
> > like to put on a stack but cant), or the size is close to (less than or
> > equal) a power of two.  Otherwise a kmem_cache is made which is the size
> > of expected object (off by UNIT_SIZE).
> 
> There are a fair number of long-lived kmalloc objects. You might try
> playing with the kmalloc accounting patch in -tiny to see what's out
> there.
> 
> http://www.selenic.com/repo/tiny?f=bbcd48f1d9c1;file=kmalloc-accounting.patch;style=raw

I'll have to try this out too. Thanks for the link.
> 
> > Oh, this reminds me, I probably still need to add a shrink cache
> > algorithm.  Which would be very hard to do in the current SLOB.
> 
> Hmmm? It already has one.

The current version in Ingo's 2.6.15-rc5-rt2 didn't have one.

> 
> > > For what it's worth, I think we really ought to consider a generalized
> > > allocator approach like Sun's VMEM, with various removable pieces.
> > 
> > Interesting, I don't know how Sun's VMEM works.  Do you have links to
> > some documentation?
> 
> http://citeseer.ist.psu.edu/bonwick01magazines.html

Thanks, I'll read up on this.

> 
> > That looks like quite an undertaking, but may be well worth it.  I think
> > Linux's memory management is starting to show it's age.  It's been
> > through a few transformations, and maybe it's time to go through
> > another.  The work being done by the NUMA folks, should be taking into
> > account, and maybe we can come up with a way that can make things easier
> > and less complex without losing performance.
> 
> Fortunately, it can be done completely piecemeal. 

If you would like me to test any code, I'd be happy to when I have time.
And maybe even add a few patches myself.

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 19:15                 ` Steven Rostedt
  2005-12-20 19:43                   ` Matt Mackall
@ 2005-12-20 20:15                   ` Pekka Enberg
  2005-12-20 21:42                     ` Steven Rostedt
  2005-12-21  2:30                   ` Nick Piggin
  2 siblings, 1 reply; 56+ messages in thread
From: Pekka Enberg @ 2005-12-20 20:15 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1028 bytes --]

Hi Steve and Matt,

On 12/20/05, Steven Rostedt <rostedt@goodmis.org> wrote:
> That looks like quite an undertaking, but may be well worth it.  I think
> Linux's memory management is starting to show it's age.  It's been
> through a few transformations, and maybe it's time to go through
> another.  The work being done by the NUMA folks, should be taking into
> account, and maybe we can come up with a way that can make things easier
> and less complex without losing performance.

The slab allocator is indeed complex, messy, and hard to understand.
In case you're interested, I have included a replacement I started out
a while a go. It follows the design of a magazine allocator described
by Bonwick. It's not a complete replacement but should boot (well, did
anyway at some point). I have also included a user space test harness
I am using to smoke it.

If there's enough interest, I would be more than glad to help write a
replacement for mm/slab.c :-)

                                        Pekka

[-- Attachment #2: magazine-slab.patch --]
[-- Type: text/x-patch, Size: 85541 bytes --]

Index: 2.6/mm/kmalloc.c
===================================================================
--- /dev/null
+++ 2.6/mm/kmalloc.c
@@ -0,0 +1,170 @@
+/*
+ * mm/kmalloc.c - A general purpose memory allocator.
+ *
+ * Copyright (C) 1996, 1997 Mark Hemment
+ * Copyright (C) 1999 Andrea Arcangeli
+ * Copyright (C) 2000, 2002 Manfred Spraul
+ * Copyright (C) 2005 Shai Fultheim
+ * Copyright (C) 2005 Shobhit Dayal
+ * Copyright (C) 2005 Alok N Kataria
+ * Copyright (C) 2005 Christoph Lameter
+ * Copyright (C) 2005 Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#include <linux/kernel.h>
+#include <linux/kmem.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/string.h>
+
+struct cache_sizes malloc_sizes[] = {
+#define CACHE(x) { .cs_size = (x) },
+#include <linux/kmalloc_sizes.h>
+	{ .cs_size = ULONG_MAX }
+#undef CACHE
+};
+EXPORT_SYMBOL(malloc_sizes);
+
+struct cache_names {
+	char *name;
+	char *name_dma;
+};
+
+static struct cache_names cache_names[] = {
+#define CACHE(x) { .name = "size-" #x, .name_dma = "size-" #x "(DMA" },
+#include <linux/kmalloc_sizes.h>
+	{ NULL, }
+#undef CACHE
+};
+
+void kmalloc_init(void)
+{
+	struct cache_sizes *sizes = malloc_sizes;
+	struct cache_names *names = cache_names;
+
+	while (sizes->cs_size != ULONG_MAX) {
+		sizes->cs_cache = kmem_cache_create(names->name,
+						    sizes->cs_size, 0, 0,
+						    NULL, NULL);
+		sizes->cs_dma_cache = kmem_cache_create(names->name_dma,
+							sizes->cs_size, 0, 0,
+							NULL, NULL);
+		sizes++;
+		names++;
+	}
+}
+
+static struct kmem_cache *find_general_cache(size_t size, gfp_t flags)
+{
+	struct cache_sizes *sizes = malloc_sizes;
+
+	while (size > sizes->cs_size)
+		sizes++;
+
+	if (unlikely(flags & GFP_DMA))
+		return sizes->cs_dma_cache;
+	return sizes->cs_cache;
+}
+
+/**
+ * kmalloc - allocate memory
+ * @size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate.
+ *
+ * kmalloc is the normal method of allocating memory
+ * in the kernel.
+ *
+ * The @flags argument may be one of:
+ *
+ * %GFP_USER - Allocate memory on behalf of user.  May sleep.
+ *
+ * %GFP_KERNEL - Allocate normal kernel ram.  May sleep.
+ *
+ * %GFP_ATOMIC - Allocation will not sleep.  Use inside interrupt handlers.
+ *
+ * Additionally, the %GFP_DMA flag may be set to indicate the memory
+ * must be suitable for DMA.  This can mean different things on different
+ * platforms.  For example, on i386, it means that the memory must come
+ * from the first 16MB.
+ */
+void *__kmalloc(size_t size, gfp_t flags)
+{
+	struct kmem_cache *cache = find_general_cache(size, flags);
+	if (unlikely(cache == NULL))
+		return NULL;
+	return kmem_cache_alloc(cache, flags);
+}
+EXPORT_SYMBOL(__kmalloc);
+
+void *kmalloc_node(size_t size, unsigned int __nocast flags, int node)
+{
+	return __kmalloc(size, flags);
+}
+EXPORT_SYMBOL(kmalloc_node);
+
+/**
+ * kzalloc - allocate memory. The memory is set to zero.
+ * @size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate.
+ */
+void *kzalloc(size_t size, gfp_t flags)
+{
+        void *ret = kmalloc(size, flags);
+        if (ret)
+                memset(ret, 0, size);
+        return ret;
+}
+EXPORT_SYMBOL(kzalloc);
+
+/*
+ * kstrdup - allocate space for and copy an existing string
+ *
+ * @s: the string to duplicate
+ * @gfp: the GFP mask used in the kmalloc() call when allocating memory
+ */
+char *kstrdup(const char *s, gfp_t gfp)
+{
+	size_t len;
+	char *buf;
+
+	if (!s)
+		return NULL;
+
+	len = strlen(s) + 1;
+	buf = kmalloc(len, gfp);
+	if (buf)
+		memcpy(buf, s, len);
+	return buf;
+}
+EXPORT_SYMBOL(kstrdup);
+
+// FÌXME: duplicate!
+static struct kmem_cache *page_get_cache(struct page *page)
+{
+	return (struct kmem_cache *)page->lru.next;
+}
+
+/**
+ * kfree - free previously allocated memory
+ * @objp: pointer returned by kmalloc.
+ *
+ * If @objp is NULL, no operation is performed.
+ *
+ * Don't free memory not originally allocated by kmalloc()
+ * or you will run into trouble.
+ */
+void kfree(const void *obj)
+{
+	struct page *page;
+	struct kmem_cache *cache;
+
+	if (unlikely(!obj))
+		return;
+
+	page = virt_to_page(obj);
+	cache = page_get_cache(page);
+	kmem_cache_free(cache, (void *)obj);
+}
+EXPORT_SYMBOL(kfree);
Index: 2.6/mm/kmem.c
===================================================================
--- /dev/null
+++ 2.6/mm/kmem.c
@@ -0,0 +1,1203 @@
+/*
+ * mm/kmem.c - An object-caching memory allocator.
+ *
+ * Copyright (C) 1996, 1997 Mark Hemment
+ * Copyright (C) 1999 Andrea Arcangeli
+ * Copyright (C) 2000, 2002 Manfred Spraul
+ * Copyright (C) 2005 Shai Fultheim
+ * Copyright (C) 2005 Shobhit Dayal
+ * Copyright (C) 2005 Alok N Kataria
+ * Copyright (C) 2005 Christoph Lameter
+ * Copyright (C) 2005 Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ *
+ * The design of this allocator is based on the following papers:
+ *
+ * Jeff Bonwick.  The Slab Allocator: An Object-Caching Kernel Memory
+ * 	Allocator. 1994.
+ *
+ * Jeff Bonwick, Jonathan Adams.  Magazines and Vmem: Extending the Slab
+ * 	Allocator to Many CPUs and Arbitrary Resources. 2001.
+ *
+ * TODO:
+ *
+ *   - Shrinking
+ *   - Alignment
+ *   - Coloring
+ *   - Per node slab lists and depots
+ *   - Compatible procfs
+ *   - Red zoning
+ *   - Poisoning
+ *   - Use after free
+ *   - Adaptive magazine size?
+ *   - Batching for freeing of wrong-node objects?
+ *   - Lock-less magazines?
+ *   - Disable magazine layer for UP?
+ *   - sysfs?
+ */
+
+#include <linux/kmem.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <linux/string.h>
+#include <linux/percpu.h>
+#include <linux/workqueue.h>
+
+#include <asm/semaphore.h>
+#include <asm/uaccess.h>
+
+
+/* Guard access to the cache-chain. */
+static struct semaphore	cache_chain_sem;
+static struct list_head cache_chain;
+
+atomic_t slab_reclaim_pages;
+
+static DEFINE_PER_CPU(struct work_struct, reap_work);
+
+#define REAP_TIMEOUT_CPU_CACHES (2*HZ)
+
+
+/*
+ *	Internal Caches
+ */
+
+static void kmem_cache_ctor(void *, struct kmem_cache *, unsigned long);
+static void kmem_magazine_ctor(void *, struct kmem_cache *, unsigned long);
+
+static struct kmem_cache cache_cache = {
+	.name = "cache-cache",
+	.objsize = sizeof(struct kmem_cache),
+	.ctor = kmem_cache_ctor
+};
+
+static struct kmem_cache slab_cache = {
+	.name = "slab-cache",
+	.objsize = sizeof(struct kmem_slab)
+};
+
+static struct kmem_cache magazine_cache = {
+	.name = "magazine-cache",
+	.objsize = sizeof(struct kmem_magazine),
+	.ctor = kmem_magazine_ctor
+};
+
+
+/*
+ *	The following functions are used to find the cache and slab an object
+ *	belongs to. They are used when we want to free an object.
+ */
+
+static void page_set_cache(struct page *page, struct kmem_cache *cache)
+{
+	page->lru.next = (struct list_head *)cache;
+}
+
+static struct kmem_cache *page_get_cache(struct page *page)
+{
+	return (struct kmem_cache *)page->lru.next;
+}
+
+static void page_set_slab(struct page *page, struct kmem_slab *slab)
+{
+	page->lru.prev = (struct list_head *)slab;
+}
+
+static struct kmem_slab *page_get_slab(struct page *page)
+{
+	return (struct kmem_slab *)page->lru.prev;
+}
+
+
+/*
+ *	Cache Statistics
+ */
+
+static inline void stats_inc_grown(struct kmem_cache *cache)
+{
+	cache->stats.grown++;
+}
+
+static inline void stats_inc_reaped(struct kmem_cache *cache)
+{
+	cache->stats.reaped++;
+}
+
+
+/*
+ *	Magazines, CPU Caches, and Depots
+ */
+
+static void init_magazine(struct kmem_magazine *mag)
+{
+	memset(mag, 0, sizeof(*mag));
+	INIT_LIST_HEAD(&mag->list);
+}
+
+static void kmem_magazine_ctor(void *obj, struct kmem_cache *cache,
+			       unsigned long flags)
+{
+	struct kmem_magazine *mag = obj;
+	if (cache != &magazine_cache)
+		BUG();
+	init_magazine(mag);
+}
+
+static int magazine_is_empty(struct kmem_magazine *mag)
+{
+	return mag->rounds == 0;
+}
+
+static int magazine_is_full(struct kmem_magazine *mag)
+{
+	return mag->rounds == MAX_ROUNDS;
+}
+
+static void *magazine_get(struct kmem_magazine *mag)
+{
+	BUG_ON(magazine_is_empty(mag));
+	return mag->objs[--mag->rounds];
+}
+
+static void magazine_put(struct kmem_magazine *mag, void *obj)
+{
+	BUG_ON(magazine_is_full(mag));
+	mag->objs[mag->rounds++] = obj;
+}
+
+static struct kmem_cpu_cache *__cpu_cache_get(struct kmem_cache *cache,
+					      unsigned long cpu)
+{
+	return &cache->cpu_cache[cpu];
+}
+
+static struct kmem_cpu_cache *cpu_cache_get(struct kmem_cache *cache)
+{
+	return __cpu_cache_get(cache, smp_processor_id());
+}
+
+static void depot_put_full(struct kmem_cache *cache,
+			   struct kmem_magazine *magazine)
+{
+	BUG_ON(!magazine_is_full(magazine));
+	list_add(&magazine->list, &cache->full_magazines);
+}
+
+static struct kmem_magazine *depot_get_full(struct kmem_cache *cache)
+{
+	struct kmem_magazine *ret = list_entry(cache->full_magazines.next,
+					       struct kmem_magazine, list);
+	list_del(&ret->list);
+	BUG_ON(!magazine_is_full(ret));
+	return ret;
+}
+
+static void depot_put_empty(struct kmem_cache *cache,
+			    struct kmem_magazine *magazine)
+{
+	BUG_ON(!magazine_is_empty(magazine));
+	list_add(&magazine->list, &cache->empty_magazines);
+}
+
+static struct kmem_magazine *depot_get_empty(struct kmem_cache *cache)
+{
+	struct kmem_magazine *ret = list_entry(cache->empty_magazines.next,
+					       struct kmem_magazine, list);
+	list_del(&ret->list);
+	BUG_ON(!magazine_is_empty(ret));
+	return ret;
+}
+
+
+/*
+ *	Object Caches and Slabs
+ */
+
+const char *kmem_cache_name(struct kmem_cache *cache)
+{
+	return cache->name;
+}
+EXPORT_SYMBOL_GPL(kmem_cache_name);
+
+static inline struct kmem_bufctl *obj_to_bufctl(struct kmem_cache *cache,
+						struct kmem_slab *slab,
+						void *ptr)
+{
+	return ptr + (cache->objsize) - sizeof(struct kmem_bufctl);
+}
+
+static void init_cache(struct kmem_cache *cache)
+{
+	spin_lock_init(&cache->lists_lock);
+	INIT_LIST_HEAD(&cache->full_slabs);
+	INIT_LIST_HEAD(&cache->partial_slabs);
+	INIT_LIST_HEAD(&cache->empty_slabs);
+	INIT_LIST_HEAD(&cache->full_magazines);
+	INIT_LIST_HEAD(&cache->empty_magazines);
+}
+
+static void kmem_cache_ctor(void *obj, struct kmem_cache *cache,
+			    unsigned long flags)
+{
+	struct kmem_cache *cachep = obj;
+	if (cache != &cache_cache)
+		BUG();
+	init_cache(cachep);
+}
+
+#define MAX_WASTAGE (PAGE_SIZE/8)
+
+static inline int mgmt_in_slab(struct kmem_cache *cache)
+{
+	return cache->objsize < MAX_WASTAGE;
+}
+
+static inline size_t order_to_size(unsigned int order)
+{
+	return (1UL << order) * PAGE_SIZE;
+}
+
+static inline size_t slab_size(struct kmem_cache *cache)
+{
+	return order_to_size(cache->cache_order);
+}
+
+static inline unsigned int slab_capacity(struct kmem_cache *cache)
+{
+	unsigned long mgmt_size = 0;
+	if (mgmt_in_slab(cache))
+		mgmt_size = sizeof(struct kmem_slab);
+
+	return (slab_size(cache) - mgmt_size) / cache->objsize;
+}
+
+static void *obj_at(struct kmem_cache *cache, struct kmem_slab *slab,
+		    unsigned long idx)
+{
+	return slab->mem + idx * cache->objsize;
+}
+
+static void init_slab_bufctl(struct kmem_cache *cache, struct kmem_slab *slab)
+{
+	unsigned long i;
+	struct kmem_bufctl *bufctl;
+	void *obj;
+
+	for (i = 0; i < cache->slab_capacity-1; i++) {
+		obj = obj_at(cache, slab, i);
+		bufctl = obj_to_bufctl(cache, slab, obj);
+		bufctl->addr = obj;
+		bufctl->next = obj_to_bufctl(cache, slab, obj+cache->objsize);
+	}
+	obj = obj_at(cache, slab, cache->slab_capacity-1);
+	bufctl = obj_to_bufctl(cache, slab, obj);
+	bufctl->addr = obj;
+	bufctl->next = NULL;
+
+	slab->free = obj_to_bufctl(cache, slab, slab->mem);
+}
+
+static struct kmem_slab *create_slab(struct kmem_cache *cache, gfp_t gfp_flags)
+{
+	struct page *page;
+	void *addr;
+	struct kmem_slab *slab;
+	int nr_pages;
+
+	page = alloc_pages(cache->gfp_flags, cache->cache_order);
+	if (!page)
+		return NULL;
+
+	addr = page_address(page);
+
+	if (mgmt_in_slab(cache))
+		slab = addr + slab_size(cache) - sizeof(*slab);
+	else {
+		slab = kmem_cache_alloc(&slab_cache, gfp_flags);
+		if (!slab)
+			goto failed;
+	}
+
+	INIT_LIST_HEAD(&slab->list);
+	slab->nr_available = cache->slab_capacity;
+	slab->mem = addr;
+	init_slab_bufctl(cache, slab);
+
+	nr_pages = 1 << cache->cache_order;
+	add_page_state(nr_slab, nr_pages);
+
+	while (nr_pages--) {
+		SetPageSlab(page);
+		page_set_cache(page, cache);
+		page_set_slab(page, slab);
+		page++;
+	}
+
+	cache->free_objects += cache->slab_capacity;
+
+	return slab;
+
+  failed:
+	free_pages((unsigned long)addr, cache->cache_order);
+	return NULL;
+}
+
+static void construct_object(void *obj, struct kmem_cache *cache,
+			     gfp_t gfp_flags)
+{
+	unsigned long ctor_flags = SLAB_CTOR_CONSTRUCTOR;
+
+	if (!cache->ctor)
+		return;
+
+	if (!(gfp_flags & __GFP_WAIT))
+		ctor_flags |= SLAB_CTOR_ATOMIC;
+
+	cache->ctor(obj, cache, ctor_flags);
+}
+
+static inline void destruct_object(void *obj, struct kmem_cache *cache)
+{
+	if (unlikely(cache->dtor))
+		cache->dtor(obj, cache, 0);
+}
+
+static void destroy_slab(struct kmem_cache *cache, struct kmem_slab *slab)
+{
+	unsigned long addr = (unsigned long)slab->mem;
+	struct page *page = virt_to_page(addr);
+	unsigned long nr_pages;
+
+	BUG_ON(slab->nr_available != cache->slab_capacity);
+
+	if (!mgmt_in_slab(cache))
+		kmem_cache_free(&slab_cache, slab);
+
+	nr_pages = 1 << cache->cache_order;
+
+	sub_page_state(nr_slab, nr_pages);
+
+	while (nr_pages--) {
+		if (!TestClearPageSlab(page))
+			BUG();
+		page++;
+	}
+	free_pages(addr, cache->cache_order);
+	cache->free_objects -= cache->slab_capacity;
+
+	stats_inc_reaped(cache);
+}
+
+static struct kmem_slab *expand_cache(struct kmem_cache *cache, gfp_t gfp_flags)
+{
+	struct kmem_slab *slab = create_slab(cache, gfp_flags);
+	if (!slab)
+		return NULL;
+
+	list_add_tail(&slab->list, &cache->full_slabs);
+	stats_inc_grown(cache);
+
+	return slab;
+}
+
+static struct kmem_slab *find_slab(struct kmem_cache *cache)
+{
+	struct kmem_slab *slab;
+	struct list_head *list = NULL;
+	
+	if (!list_empty(&cache->partial_slabs))
+		list = &cache->partial_slabs;
+	else if (!list_empty(&cache->full_slabs))
+		list = &cache->full_slabs;
+	else
+		BUG();
+
+	slab = list_entry(list->next, struct kmem_slab, list);
+	BUG_ON(!slab->nr_available);
+	return slab;
+}
+
+static void *alloc_obj(struct kmem_cache *cache, struct kmem_slab *slab)
+{
+	void *obj = slab->free->addr;
+	slab->free = slab->free->next;
+	slab->nr_available--;
+	cache->free_objects--;
+	return obj;
+}
+
+/* The caller must hold cache->lists_lock.  */
+static void *slab_alloc(struct kmem_cache *cache, gfp_t gfp_flags)
+{
+	struct kmem_slab *slab;
+	void *ret;
+
+	if (list_empty(&cache->partial_slabs) &&
+	    list_empty(&cache->full_slabs) &&
+	    !expand_cache(cache, gfp_flags))
+		return NULL;
+
+	slab = find_slab(cache);
+	if (slab->nr_available == cache->slab_capacity)
+		list_move(&slab->list, &cache->partial_slabs);
+
+	ret = alloc_obj(cache, slab);
+	if (!slab->nr_available)
+		list_move(&slab->list, &cache->empty_slabs);
+
+	return ret;
+}
+
+static void swap_magazines(struct kmem_cpu_cache *cpu_cache)
+{
+	struct kmem_magazine *tmp = cpu_cache->loaded;
+	cpu_cache->loaded = cpu_cache->prev;
+	cpu_cache->prev = tmp;
+}
+
+/**
+ * kmem_ptr_validate - check if an untrusted pointer might
+ *	be a slab entry.
+ * @cachep: the cache we're checking against
+ * @ptr: pointer to validate
+ *
+ * This verifies that the untrusted pointer looks sane: it is _not_ a
+ * guarantee that the pointer is actually part of the slab cache in
+ * question, but it at least validates that the pointer can be
+ * dereferenced and looks half-way sane.
+ *
+ * Currently only used for dentry validation.
+ */
+int fastcall kmem_ptr_validate(struct kmem_cache *cache, void *ptr)
+{
+	unsigned long addr = (unsigned long) ptr;
+	unsigned long min_addr = PAGE_OFFSET;
+	unsigned long size = cache->objsize;
+	struct page *page;
+
+	if (unlikely(addr < min_addr))
+		goto out;
+	if (unlikely(addr > (unsigned long)high_memory - size))
+		goto out;
+	if (unlikely(!kern_addr_valid(addr)))
+		goto out;
+	if (unlikely(!kern_addr_valid(addr + size - 1)))
+		goto out;
+	page = virt_to_page(ptr);
+	if (unlikely(!PageSlab(page)))
+		goto out;
+	if (unlikely(page_get_cache(page) != cache))
+		goto out;
+	return 1;
+  out:
+	return 0;
+}
+
+/**
+ * kmem_cache_alloc - Allocate an object
+ * @cachep: The cache to allocate from.
+ * @flags: See kmalloc().
+ *
+ * This function can be called from interrupt and process contexts.
+ *
+ * Allocate an object from this cache.  The flags are only relevant
+ * if the cache has no available objects.
+ */
+void *kmem_cache_alloc(struct kmem_cache *cache, gfp_t gfp_flags)
+{
+	void *ret = NULL;
+	unsigned long flags;
+	struct kmem_cpu_cache *cpu_cache = cpu_cache_get(cache);
+
+	spin_lock_irqsave(&cpu_cache->lock, flags);
+
+	while (1) {
+		if (likely(!magazine_is_empty(cpu_cache->loaded))) {
+			ret = magazine_get(cpu_cache->loaded);
+			break;
+		} else if (magazine_is_full(cpu_cache->prev)) {
+			swap_magazines(cpu_cache);
+			continue;
+		}
+
+		spin_lock(&cache->lists_lock);
+
+		if (list_empty(&cache->full_magazines)) {
+			ret = slab_alloc(cache, gfp_flags);
+			spin_unlock(&cache->lists_lock);
+			if (ret)
+				construct_object(ret, cache, gfp_flags);
+			break;
+		}
+		depot_put_empty(cache, cpu_cache->prev);
+		cpu_cache->prev = cpu_cache->loaded;
+		cpu_cache->loaded = depot_get_full(cache);
+
+		spin_unlock(&cache->lists_lock);
+	}
+
+	spin_unlock_irqrestore(&cpu_cache->lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL(kmem_cache_alloc);
+
+void *kmem_cache_alloc_node(struct kmem_cache *cache, unsigned int __nocast flags, int nodeid)
+{
+	return kmem_cache_alloc(cache, flags);
+}
+EXPORT_SYMBOL(kmem_cache_alloc_node);
+
+static void free_obj(struct kmem_cache *cache, struct kmem_slab *slab,
+		     void *obj)
+{
+	struct kmem_bufctl *bufctl;
+
+	bufctl = obj_to_bufctl(cache, slab, obj);
+	bufctl->next = slab->free;
+	bufctl->addr = obj;
+
+	slab->free = bufctl;
+	slab->nr_available++;
+	cache->free_objects++;
+}
+
+static void slab_free(struct kmem_cache *cache, void *obj)
+{
+	struct page *page = virt_to_page(obj);
+	struct kmem_slab *slab = page_get_slab(page);
+
+	if (page_get_cache(page) != cache)
+		BUG();
+
+	if (slab->nr_available == 0)
+		list_move(&slab->list, &cache->partial_slabs);
+
+	free_obj(cache, slab, obj);
+
+	if (slab->nr_available == cache->slab_capacity)
+		list_move(&slab->list, &cache->full_slabs);
+}
+
+/**
+ * kmem_cache_free - Deallocate an object
+ * @cachep: The cache the allocation was from.
+ * @objp: The previously allocated object.
+ *
+ * This function can be called from interrupt and process contexts.
+ *
+ * Free an object which was previously allocated from this
+ * cache.
+ */
+void kmem_cache_free(struct kmem_cache *cache, void *obj)
+{
+	unsigned long flags;
+	struct kmem_cpu_cache *cpu_cache = cpu_cache_get(cache);
+
+	if (!obj)
+		return;
+
+	spin_lock_irqsave(&cpu_cache->lock, flags);
+
+	while (1) {
+		if (!magazine_is_full(cpu_cache->loaded)) {
+			magazine_put(cpu_cache->loaded, obj);
+			break;
+		}
+
+		if (magazine_is_empty(cpu_cache->prev)) {
+			swap_magazines(cpu_cache);
+			continue;
+		}
+	
+		spin_lock(&cache->lists_lock);
+		if (unlikely(list_empty(&cache->empty_magazines))) {
+			struct kmem_magazine *magazine;
+
+			spin_unlock(&cache->lists_lock);
+			magazine = kmem_cache_alloc(&magazine_cache,
+						    GFP_KERNEL);
+			if (magazine) {
+				depot_put_empty(cache, magazine);
+				continue;
+			}
+			destruct_object(obj, cache);
+			spin_lock(&cache->lists_lock);
+			slab_free(cache, obj);
+			spin_unlock(&cache->lists_lock);
+			break;
+		}
+		depot_put_full(cache, cpu_cache->prev);
+		cpu_cache->prev = cpu_cache->loaded;
+		cpu_cache->loaded = depot_get_empty(cache);
+		spin_unlock(&cache->lists_lock);
+	}
+
+	spin_unlock_irqrestore(&cpu_cache->lock, flags);
+}
+
+EXPORT_SYMBOL(kmem_cache_free);
+
+static void free_slab_list(struct kmem_cache *cache, struct list_head *slab_list)
+{
+	struct kmem_slab *slab, *tmp;
+
+	list_for_each_entry_safe(slab, tmp, slab_list, list) {
+		list_del(&slab->list);
+		destroy_slab(cache, slab);
+	}
+}
+
+static void free_cache_slabs(struct kmem_cache *cache)
+{
+	free_slab_list(cache, &cache->full_slabs);
+	free_slab_list(cache, &cache->partial_slabs);
+	free_slab_list(cache, &cache->empty_slabs);
+}
+
+static void purge_magazine(struct kmem_cache *cache,
+			   struct kmem_magazine *mag)
+{
+	while (!magazine_is_empty(mag)) {
+		void *obj = magazine_get(mag);
+		destruct_object(obj, cache);
+		spin_lock(&cache->lists_lock);
+		slab_free(cache, obj);
+		spin_unlock(&cache->lists_lock);
+	}
+}
+
+static void destroy_magazine(struct kmem_cache *cache,
+			     struct kmem_magazine *mag)
+{
+	if (!mag)
+		return;
+
+	purge_magazine(cache, mag);
+	kmem_cache_free(&magazine_cache, mag);
+}
+
+static void free_cpu_caches(struct kmem_cache *cache)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		struct kmem_cpu_cache *cpu_cache = __cpu_cache_get(cache, i);
+		destroy_magazine(cache, cpu_cache->loaded);
+		destroy_magazine(cache, cpu_cache->prev);
+	}
+}
+
+static int init_cpu_cache(struct kmem_cpu_cache *cpu_cache)
+{
+	int err = 0;
+
+	spin_lock_init(&cpu_cache->lock);
+
+	cpu_cache->loaded = kmem_cache_alloc(&magazine_cache, GFP_KERNEL);
+	if (!cpu_cache->loaded)
+		goto failed;
+
+	cpu_cache->prev = kmem_cache_alloc(&magazine_cache, GFP_KERNEL);
+	if (!cpu_cache->prev)
+		goto failed;
+
+  out:
+	return err;
+
+  failed:
+	kmem_cache_free(&magazine_cache, cpu_cache->loaded);
+	err = -ENOMEM;
+	goto out;
+}
+
+static int init_cpu_caches(struct kmem_cache *cache)
+{
+	int i;
+	int ret = 0;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		struct kmem_cpu_cache *cpu_cache = __cpu_cache_get(cache, i);
+		ret = init_cpu_cache(cpu_cache);
+		if (ret)
+			break;
+	}
+
+	if (ret)
+		free_cpu_caches(cache);
+
+	return ret;
+}
+
+static unsigned long wastage(struct kmem_cache *cache, unsigned long order)
+{
+	unsigned long size = order_to_size(order);
+	return size % cache->objsize;
+}
+
+static long cache_order(struct kmem_cache *cache)
+{
+	unsigned int prev, order;
+
+	prev = order = 0;
+
+	/*
+	 * First find the first order in which the objects fit.
+	 */ 
+	while (1) {
+		if (cache->objsize <= order_to_size(order))
+			break;
+		if (++order > MAX_ORDER) {
+			order = -1;
+			goto out;
+		}
+	}
+
+	/*
+	 * Then see if we can find a better one.
+	 */
+	while (order < MAX_ORDER-1) {
+		unsigned long prev_wastage, current_wastage;
+
+		prev = order;
+		prev_wastage = wastage(cache, prev);
+		current_wastage = wastage(cache, ++order);
+
+		if (prev_wastage < current_wastage ||
+		    prev_wastage-current_wastage < MAX_WASTAGE) {
+			order = prev;
+			break;
+		}
+	}
+
+  out:
+	return order;
+}
+
+/**
+ * kmem_cache_create - Create a cache.
+ * @name: A string which is used in /proc/slabinfo to identify this cache.
+ * @size: The size of objects to be created in this cache.
+ * @align: The required alignment for the objects.
+ * @flags: SLAB flags
+ * @ctor: A constructor for the objects.
+ * @dtor: A destructor for the objects.
+ *
+ * This function must not be called from interrupt context.
+ *
+ * Returns a ptr to the cache on success, NULL on failure.  Cannot be
+ * called within a int, but can be interrupted.  The @ctor is run when
+ * new pages are allocated by the cache and the @dtor is run before
+ * the pages are handed back.
+ *
+ * @name must be valid until the cache is destroyed. This implies that
+ * the module calling this has to destroy the cache before getting
+ * unloaded.
+ *
+ * The flags are
+ *
+ * %SLAB_POISON - Poison the slab with a known test pattern (a5a5a5a5)
+ * to catch references to uninitialised memory.
+ *
+ * %SLAB_RED_ZONE - Insert `Red' zones around the allocated memory to
+ * check for buffer overruns.
+ *
+ * %SLAB_NO_REAP - Don't automatically reap this cache when we're
+ * under memory pressure.
+ *
+ * %SLAB_HWCACHE_ALIGN - Align the objects in this cache to a hardware
+ * cacheline.  This can be beneficial if you're counting cycles as
+ * closely as davem.
+ */
+struct kmem_cache *kmem_cache_create(const char *name, size_t objsize,
+				     size_t align, unsigned long flags,
+				     kmem_ctor_fn ctor, kmem_dtor_fn dtor)
+{
+	struct kmem_cache *cache = kmem_cache_alloc(&cache_cache, GFP_KERNEL);
+	if (!cache)
+		return NULL;
+
+	cache->name = name;
+	cache->objsize = objsize;
+	cache->ctor = ctor;
+	cache->dtor = dtor;
+	cache->free_objects = 0;
+
+	cache->cache_order = cache_order(cache);
+	if (cache->cache_order < 0)
+		goto failed;
+
+	cache->slab_capacity = slab_capacity(cache);
+
+	memset(&cache->stats, 0, sizeof(struct kmem_cache_statistics));
+
+	if (init_cpu_caches(cache))
+		goto failed;
+
+	down(&cache_chain_sem);
+	list_add(&cache->next, &cache_chain);
+	up(&cache_chain_sem);
+
+	return cache;
+
+  failed:
+	kmem_cache_free(&cache_cache, cache);
+	return NULL;
+}
+
+EXPORT_SYMBOL(kmem_cache_create);
+
+static void free_depot_magazines(struct kmem_cache *cache)
+{
+	struct kmem_magazine *magazine, *tmp;
+
+	list_for_each_entry_safe(magazine, tmp, &cache->empty_magazines, list) {
+		list_del(&magazine->list);
+		destroy_magazine(cache, magazine);
+	}
+
+	list_for_each_entry_safe(magazine, tmp, &cache->full_magazines, list) {
+		list_del(&magazine->list);
+		destroy_magazine(cache, magazine);
+	}
+}
+
+/**
+ * kmem_cache_destroy - delete a cache
+ * @cache: the cache to destroy
+ *
+ * This function must not be called from interrupt context.
+ *
+ * Remove a kmem_cache from the slab cache.
+ *
+ * It is expected this function will be called by a module when it is
+ * unloaded.  This will remove the cache completely, and avoid a
+ * duplicate cache being allocated each time a module is loaded and
+ * unloaded, if the module doesn't have persistent in-kernel storage
+ * across loads and unloads.
+ *
+ * The cache must be empty before calling this function.
+ *
+ * The caller must guarantee that no one will allocate memory from the
+ * cache during the kmem_cache_destroy().
+ */
+int kmem_cache_destroy(struct kmem_cache *cache)
+{
+	unsigned long flags;
+
+	down(&cache_chain_sem);
+	list_del(&cache->next);
+	up(&cache_chain_sem);
+
+	spin_lock_irqsave(&cache->lists_lock, flags);
+	free_cpu_caches(cache);
+	free_depot_magazines(cache);
+	free_cache_slabs(cache);
+	kmem_cache_free(&cache_cache, cache);
+	spin_unlock_irqrestore(&cache->lists_lock, flags);
+
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_destroy);
+
+extern int kmem_cache_shrink(struct kmem_cache *cache)
+{
+	unsigned long flags;
+	struct kmem_cpu_cache *cpu_cache = cpu_cache_get(cache);
+
+	purge_magazine(cache, cpu_cache->loaded);
+	purge_magazine(cache, cpu_cache->prev);
+
+	spin_lock_irqsave(&cache->lists_lock, flags);
+	free_depot_magazines(cache);
+	free_slab_list(cache, &cache->full_slabs);
+	spin_unlock_irqrestore(&cache->lists_lock, flags);
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
+
+
+/*
+ *	Cache Reaping
+ */
+
+/**
+ * cache_reap - Reclaim memory from caches.
+ * @unused: unused parameter
+ *
+ * Called from workqueue/eventd every few seconds.
+ * Purpose:
+ * - clear the per-cpu caches for this CPU.
+ * - return freeable pages to the main free memory pool.
+ *
+ * If we cannot acquire the cache chain semaphore then just give up - we'll
+ * try again on the next iteration.
+ */
+static void cache_reap(void *unused)
+{
+	struct list_head *walk;
+
+	if (down_trylock(&cache_chain_sem))
+		goto out;
+
+	list_for_each(walk, &cache_chain) {
+		struct kmem_cache *cache = list_entry(walk, struct kmem_cache,
+						      next);
+		kmem_cache_shrink(cache);
+	}
+
+	up(&cache_chain_sem);
+  out:
+	/* Setup the next iteration */
+	schedule_delayed_work(&__get_cpu_var(reap_work),
+			      REAP_TIMEOUT_CPU_CACHES);
+}
+
+/*
+ * Initiate the reap timer running on the target CPU.  We run at around 1 to 2Hz
+ * via the workqueue/eventd.
+ * Add the CPU number into the expiration time to minimize the possibility of
+ * the CPUs getting into lockstep and contending for the global cache chain
+ * lock.
+ */
+static void __devinit start_cpu_timer(int cpu)
+{
+	struct work_struct *reap_work = &per_cpu(reap_work, cpu);
+
+	/*
+	 * When this gets called from do_initcalls via cpucache_init(),
+	 * init_workqueues() has already run, so keventd will be setup
+	 * at that time.
+	 */
+	if (keventd_up() && reap_work->func == NULL) {
+		INIT_WORK(reap_work, cache_reap, NULL);
+		schedule_delayed_work_on(cpu, reap_work, HZ + 3 * cpu);
+	}
+}
+
+
+/*
+ *	Proc FS
+ */
+
+#ifdef CONFIG_PROC_FS
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+	struct list_head *p;
+
+	down(&cache_chain_sem);
+	if (!n) {
+		/*
+		 * Output format version, so at least we can change it
+		 * without _too_ many complaints.
+		 */
+		seq_puts(m, "slabinfo - version: 2.1\n");
+		seq_puts(m, "# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab>");
+		seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+		seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+		seq_putc(m, '\n');
+	}
+	p = cache_chain.next;
+	while (n--) {
+		p = p->next;
+		if (p == &cache_chain)
+			return NULL;
+	}
+	return list_entry(p, struct kmem_cache, next);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct kmem_cache *cache = p;
+	++*pos;
+	return cache->next.next == &cache_chain ? NULL
+		: list_entry(cache->next.next, struct kmem_cache, next);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up(&cache_chain_sem);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct kmem_cache *cache = p;
+	struct list_head *q;
+	struct kmem_slab *slab;
+	unsigned long active_objs;
+	unsigned long num_objs;
+	unsigned long active_slabs = 0;
+	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
+	const char *name;
+	char *error = NULL;
+
+	spin_lock_irq(&cache->lists_lock);
+
+	active_objs = 0;
+	num_slabs = 0;
+
+	list_for_each(q, &cache->full_slabs) {
+		slab = list_entry(q, struct kmem_slab, list);
+		active_slabs++;
+		active_objs += cache->slab_capacity - slab->nr_available;
+	}
+
+	list_for_each(q, &cache->partial_slabs) {
+		slab = list_entry(q, struct kmem_slab, list);
+		active_slabs++;
+		active_objs += cache->slab_capacity - slab->nr_available;
+	}
+
+	list_for_each(q, &cache->empty_slabs) {
+		slab = list_entry(q, struct kmem_slab, list);
+		active_slabs++;
+		active_objs += cache->slab_capacity - slab->nr_available;
+	}
+
+	num_slabs += active_slabs;
+	num_objs = num_slabs * cache->slab_capacity;
+	free_objects = cache->free_objects;
+
+	if (num_objs - active_objs != free_objects && !error)
+		error = "free_objects accounting error";
+
+	name = cache->name;
+	if (error)
+		printk(KERN_ERR "slab: cache %s error: %s\n", name, error);
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
+		name, active_objs, num_objs, cache->objsize,
+		cache->slab_capacity, (1 << cache->cache_order));
+	seq_printf(m, " : slabdata %6lu %6lu %6lu",
+			active_slabs, num_slabs, shared_avail);
+	seq_putc(m, '\n');
+
+	spin_unlock_irq(&cache->lists_lock);
+	return 0;
+}
+
+/*
+ * slabinfo_op - iterator that generates /proc/slabinfo
+ *
+ * Output layout:
+ * cache-name
+ * num-active-objs
+ * total-objs
+ * object size
+ * num-active-slabs
+ * total-slabs
+ * num-pages-per-slab
+ * + further values on SMP and with statistics enabled
+ */
+
+struct seq_operations slabinfo_op = {
+	.start	= s_start,
+	.next	= s_next,
+	.stop	= s_stop,
+	.show	= s_show,
+};
+
+ssize_t slabinfo_write(struct file *file, const char __user *buffer,
+		       size_t count, loff_t *ppos)
+{
+	return -EFAULT;
+}
+#endif
+
+
+/*
+ *	Memory Allocator Initialization
+ */
+
+static int bootstrap_cpu_caches(struct kmem_cache *cache)
+{
+	int i, err = 0;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		struct kmem_cpu_cache *cpu_cache = __cpu_cache_get(cache, i);
+		spin_lock_init(&cpu_cache->lock);
+
+		spin_lock(&cache->lists_lock);
+		cpu_cache->loaded = slab_alloc(cache, GFP_KERNEL);
+		spin_unlock(&cache->lists_lock);
+		if (!cpu_cache->loaded) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		init_magazine(cpu_cache->loaded);
+
+		spin_lock(&cache->lists_lock);
+		cpu_cache->prev = slab_alloc(cache, GFP_KERNEL);
+		spin_unlock(&cache->lists_lock);
+		if (!cpu_cache->prev) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		init_magazine(cpu_cache->prev);
+	}
+
+  out:
+	return err;
+}
+
+void kmem_cache_init(void)
+{
+	init_MUTEX(&cache_chain_sem);
+	INIT_LIST_HEAD(&cache_chain);
+
+	cache_cache.cache_order = cache_order(&cache_cache);
+	cache_cache.slab_capacity = slab_capacity(&cache_cache);
+	slab_cache.cache_order = cache_order(&slab_cache);
+	slab_cache.slab_capacity = slab_capacity(&slab_cache);
+	magazine_cache.cache_order = cache_order(&magazine_cache);
+	magazine_cache.slab_capacity = slab_capacity(&magazine_cache);
+
+	init_cache(&cache_cache);
+	init_cache(&slab_cache);
+	init_cache(&magazine_cache);
+
+	if (bootstrap_cpu_caches(&magazine_cache))
+		goto failed;
+
+	if (init_cpu_caches(&cache_cache))
+		goto failed;
+
+	if (init_cpu_caches(&slab_cache))
+		goto failed;
+
+	list_add(&cache_cache.next, &cache_chain);
+	list_add(&slab_cache.next, &cache_chain);
+	list_add(&magazine_cache.next, &cache_chain);
+
+	kmalloc_init();
+
+	return;
+
+  failed:
+	panic("slab allocator init failed");
+}
+
+static int __init cpucache_init(void)
+{
+	int cpu;
+
+	/* 
+	 * Register the timers that return unneeded
+	 * pages to gfp.
+	 */
+	for_each_online_cpu(cpu)
+		start_cpu_timer(cpu);
+
+	return 0;
+}
+
+__initcall(cpucache_init);
+
+void kmem_cache_release(void)
+{
+}
Index: 2.6/test/CuTest.c
===================================================================
--- /dev/null
+++ 2.6/test/CuTest.c
@@ -0,0 +1,331 @@
+#include <assert.h>
+#include <setjmp.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+
+#include "CuTest.h"
+
+/*-------------------------------------------------------------------------*
+ * CuStr
+ *-------------------------------------------------------------------------*/
+
+char* CuStrAlloc(int size)
+{
+	char* newStr = (char*) malloc( sizeof(char) * (size) );
+	return newStr;
+}
+
+char* CuStrCopy(const char* old)
+{
+	int len = strlen(old);
+	char* newStr = CuStrAlloc(len + 1);
+	strcpy(newStr, old);
+	return newStr;
+}
+
+/*-------------------------------------------------------------------------*
+ * CuString
+ *-------------------------------------------------------------------------*/
+
+void CuStringInit(CuString* str)
+{
+	str->length = 0;
+	str->size = STRING_MAX;
+	str->buffer = (char*) malloc(sizeof(char) * str->size);
+	str->buffer[0] = '\0';
+}
+
+CuString* CuStringNew(void)
+{
+	CuString* str = (CuString*) malloc(sizeof(CuString));
+	str->length = 0;
+	str->size = STRING_MAX;
+	str->buffer = (char*) malloc(sizeof(char) * str->size);
+	str->buffer[0] = '\0';
+	return str;
+}
+
+void CuStringDelete(CuString* str)
+{
+	free(str->buffer);
+	free(str);
+}
+
+void CuStringResize(CuString* str, int newSize)
+{
+	str->buffer = (char*) realloc(str->buffer, sizeof(char) * newSize);
+	str->size = newSize;
+}
+
+void CuStringAppend(CuString* str, const char* text)
+{
+	int length;
+
+	if (text == NULL) {
+		text = "NULL";
+	}
+
+	length = strlen(text);
+	if (str->length + length + 1 >= str->size)
+		CuStringResize(str, str->length + length + 1 + STRING_INC);
+	str->length += length;
+	strcat(str->buffer, text);
+}
+
+void CuStringAppendChar(CuString* str, char ch)
+{
+	char text[2];
+	text[0] = ch;
+	text[1] = '\0';
+	CuStringAppend(str, text);
+}
+
+void CuStringAppendFormat(CuString* str, const char* format, ...)
+{
+	va_list argp;
+	char buf[HUGE_STRING_LEN];
+	va_start(argp, format);
+	vsprintf(buf, format, argp);
+	va_end(argp);
+	CuStringAppend(str, buf);
+}
+
+void CuStringInsert(CuString* str, const char* text, int pos)
+{
+	int length = strlen(text);
+	if (pos > str->length)
+		pos = str->length;
+	if (str->length + length + 1 >= str->size)
+		CuStringResize(str, str->length + length + 1 + STRING_INC);
+	memmove(str->buffer + pos + length, str->buffer + pos, (str->length - pos) + 1);
+	str->length += length;
+	memcpy(str->buffer + pos, text, length);
+}
+
+/*-------------------------------------------------------------------------*
+ * CuTest
+ *-------------------------------------------------------------------------*/
+
+void CuTestInit(CuTest* t, const char* name, TestFunction function)
+{
+	t->name = CuStrCopy(name);
+	t->failed = 0;
+	t->ran = 0;
+	t->message = NULL;
+	t->function = function;
+	t->jumpBuf = NULL;
+}
+
+CuTest* CuTestNew(const char* name, TestFunction function)
+{
+	CuTest* tc = malloc(sizeof(*tc));
+	CuTestInit(tc, name, function);
+	return tc;
+}
+
+void CuTestDelete(CuTest *ct)
+{
+	free((char *)ct->name);
+	free(ct);
+}
+
+void CuTestRun(CuTest* tc)
+{
+	jmp_buf buf;
+	tc->jumpBuf = &buf;
+	if (setjmp(buf) == 0)
+	{
+		tc->ran = 1;
+		(tc->function)(tc);
+	}
+	tc->jumpBuf = 0;
+}
+
+static void CuFailInternal(CuTest* tc, const char* file, int line, CuString* string)
+{
+	char buf[HUGE_STRING_LEN];
+
+	sprintf(buf, "%s:%d: ", file, line);
+	CuStringInsert(string, buf, 0);
+
+	tc->failed = 1;
+	tc->message = string->buffer;
+	if (tc->jumpBuf != 0) longjmp(*(tc->jumpBuf), 0);
+}
+
+void CuFail_Line(CuTest* tc, const char* file, int line, const char* message2, const char* message)
+{
+	CuString string;
+
+	CuStringInit(&string);
+	if (message2 != NULL)
+	{
+		CuStringAppend(&string, message2);
+		CuStringAppend(&string, ": ");
+	}
+	CuStringAppend(&string, message);
+	CuFailInternal(tc, file, line, &string);
+}
+
+void CuAssert_Line(CuTest* tc, const char* file, int line, const char* message, int condition)
+{
+	if (condition) return;
+	CuFail_Line(tc, file, line, NULL, message);
+}
+
+void CuAssertStrEquals_LineMsg(CuTest* tc, const char* file, int line, const char* message,
+	const char* expected, const char* actual)
+{
+	CuString string;
+	if ((expected == NULL && actual == NULL) ||
+	    (expected != NULL && actual != NULL &&
+	     strcmp(expected, actual) == 0))
+	{
+		return;
+	}
+
+	CuStringInit(&string);
+	if (message != NULL)
+	{
+		CuStringAppend(&string, message);
+		CuStringAppend(&string, ": ");
+	}
+	CuStringAppend(&string, "expected <");
+	CuStringAppend(&string, expected);
+	CuStringAppend(&string, "> but was <");
+	CuStringAppend(&string, actual);
+	CuStringAppend(&string, ">");
+	CuFailInternal(tc, file, line, &string);
+}
+
+void CuAssertIntEquals_LineMsg(CuTest* tc, const char* file, int line, const char* message,
+	int expected, int actual)
+{
+	char buf[STRING_MAX];
+	if (expected == actual) return;
+	sprintf(buf, "expected <%d> but was <%d>", expected, actual);
+	CuFail_Line(tc, file, line, message, buf);
+}
+
+void CuAssertDblEquals_LineMsg(CuTest* tc, const char* file, int line, const char* message,
+	double expected, double actual, double delta)
+{
+	char buf[STRING_MAX];
+	if (fabs(expected - actual) <= delta) return;
+	sprintf(buf, "expected <%lf> but was <%lf>", expected, actual);
+	CuFail_Line(tc, file, line, message, buf);
+}
+
+void CuAssertPtrEquals_LineMsg(CuTest* tc, const char* file, int line, const char* message,
+	void* expected, void* actual)
+{
+	char buf[STRING_MAX];
+	if (expected == actual) return;
+	sprintf(buf, "expected pointer <0x%p> but was <0x%p>", expected, actual);
+	CuFail_Line(tc, file, line, message, buf);
+}
+
+
+/*-------------------------------------------------------------------------*
+ * CuSuite
+ *-------------------------------------------------------------------------*/
+
+void CuSuiteInit(CuSuite* testSuite)
+{
+	testSuite->count = 0;
+	testSuite->failCount = 0;
+}
+
+CuSuite* CuSuiteNew(void)
+{
+	CuSuite* testSuite = malloc(sizeof(*testSuite));
+	CuSuiteInit(testSuite);
+	return testSuite;
+}
+
+void CuSuiteDelete(CuSuite *testSuite)
+{
+	int i;
+	for (i = 0 ; i < testSuite->count ; ++i)
+	{
+		CuTestDelete(testSuite->list[i]);
+	}
+	free(testSuite);
+}
+
+void CuSuiteAdd(CuSuite* testSuite, CuTest *testCase)
+{
+	assert(testSuite->count < MAX_TEST_CASES);
+	testSuite->list[testSuite->count] = testCase;
+	testSuite->count++;
+}
+
+void CuSuiteAddSuite(CuSuite* testSuite, CuSuite* testSuite2)
+{
+	int i;
+	for (i = 0 ; i < testSuite2->count ; ++i)
+	{
+		CuTest* testCase = testSuite2->list[i];
+		CuSuiteAdd(testSuite, testCase);
+	}
+}
+
+void CuSuiteRun(CuSuite* testSuite)
+{
+	int i;
+	for (i = 0 ; i < testSuite->count ; ++i)
+	{
+		CuTest* testCase = testSuite->list[i];
+		CuTestRun(testCase);
+		if (testCase->failed) { testSuite->failCount += 1; }
+	}
+}
+
+void CuSuiteSummary(CuSuite* testSuite, CuString* summary)
+{
+	int i;
+	for (i = 0 ; i < testSuite->count ; ++i)
+	{
+		CuTest* testCase = testSuite->list[i];
+		CuStringAppend(summary, testCase->failed ? "F" : ".");
+	}
+	CuStringAppend(summary, "\n\n");
+}
+
+void CuSuiteDetails(CuSuite* testSuite, CuString* details)
+{
+	int i;
+	int failCount = 0;
+
+	if (testSuite->failCount == 0)
+	{
+		int passCount = testSuite->count - testSuite->failCount;
+		const char* testWord = passCount == 1 ? "test" : "tests";
+		CuStringAppendFormat(details, "OK (%d %s)\n", passCount, testWord);
+	}
+	else
+	{
+		if (testSuite->failCount == 1)
+			CuStringAppend(details, "There was 1 failure:\n");
+		else
+			CuStringAppendFormat(details, "There were %d failures:\n", testSuite->failCount);
+
+		for (i = 0 ; i < testSuite->count ; ++i)
+		{
+			CuTest* testCase = testSuite->list[i];
+			if (testCase->failed)
+			{
+				failCount++;
+				CuStringAppendFormat(details, "%d) %s: %s\n",
+					failCount, testCase->name, testCase->message);
+			}
+		}
+		CuStringAppend(details, "\n!!!FAILURES!!!\n");
+
+		CuStringAppendFormat(details, "Runs: %d ",   testSuite->count);
+		CuStringAppendFormat(details, "Passes: %d ", testSuite->count - testSuite->failCount);
+		CuStringAppendFormat(details, "Fails: %d\n",  testSuite->failCount);
+	}
+}
Index: 2.6/test/Makefile
===================================================================
--- /dev/null
+++ 2.6/test/Makefile
@@ -0,0 +1,18 @@
+all: test
+
+gen:
+	sh make-tests.sh mm/*.c > test-runner.c
+
+compile:
+	gcc -O2 -g -Wall -Iinclude -I../include -D__KERNEL__=1 ../mm/kmalloc.c mm/kmalloc-test.c ../mm/kmem.c mm/kmem-test.c mm/page_alloc.c kernel/panic.c kernel/workqueue.c kernel/timer.c CuTest.c test-runner.c -o test-runner
+
+run:
+	./test-runner
+
+test: gen compile run
+
+valgrind: gen compile
+	valgrind --leak-check=full test-runner
+
+clean:
+	rm -f *.o tags test-runner test-runner.c
Index: 2.6/test/include/CuTest.h
===================================================================
--- /dev/null
+++ 2.6/test/include/CuTest.h
@@ -0,0 +1,116 @@
+#ifndef CU_TEST_H
+#define CU_TEST_H
+
+#include <setjmp.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+/* CuString */
+
+char* CuStrAlloc(int size);
+char* CuStrCopy(const char* old);
+
+#define CU_ALLOC(TYPE)		((TYPE*) malloc(sizeof(TYPE)))
+
+#define HUGE_STRING_LEN	8192
+#define STRING_MAX		256
+#define STRING_INC		256
+
+typedef struct
+{
+	int length;
+	int size;
+	char* buffer;
+} CuString;
+
+void CuStringInit(CuString* str);
+CuString* CuStringNew(void);
+void CuStringDelete(CuString *str);
+void CuStringRead(CuString* str, const char* path);
+void CuStringAppend(CuString* str, const char* text);
+void CuStringAppendChar(CuString* str, char ch);
+void CuStringAppendFormat(CuString* str, const char* format, ...);
+void CuStringInsert(CuString* str, const char* text, int pos);
+void CuStringResize(CuString* str, int newSize);
+
+/* CuTest */
+
+typedef struct CuTest CuTest;
+
+typedef void (*TestFunction)(CuTest *);
+
+struct CuTest
+{
+	const char* name;
+	TestFunction function;
+	int failed;
+	int ran;
+	const char* message;
+	jmp_buf *jumpBuf;
+};
+
+void CuTestInit(CuTest* t, const char* name, TestFunction function);
+CuTest* CuTestNew(const char* name, TestFunction function);
+void CuTestDelete(CuTest *tc);
+void CuTestRun(CuTest* tc);
+
+/* Internal versions of assert functions -- use the public versions */
+void CuFail_Line(CuTest* tc, const char* file, int line, const char* message2, const char* message);
+void CuAssert_Line(CuTest* tc, const char* file, int line, const char* message, int condition);
+void CuAssertStrEquals_LineMsg(CuTest* tc,
+	const char* file, int line, const char* message,
+	const char* expected, const char* actual);
+void CuAssertIntEquals_LineMsg(CuTest* tc,
+	const char* file, int line, const char* message,
+	int expected, int actual);
+void CuAssertDblEquals_LineMsg(CuTest* tc,
+	const char* file, int line, const char* message,
+	double expected, double actual, double delta);
+void CuAssertPtrEquals_LineMsg(CuTest* tc,
+	const char* file, int line, const char* message,
+	void* expected, void* actual);
+
+/* public assert functions */
+
+#define CuFail(tc, ms)                        CuFail_Line(  (tc), __FILE__, __LINE__, NULL, (ms))
+#define CuAssert(tc, ms, cond)                CuAssert_Line((tc), __FILE__, __LINE__, (ms), (cond))
+#define CuAssertTrue(tc, cond)                CuAssert_Line((tc), __FILE__, __LINE__, "assert failed", (cond))
+
+#define CuAssertStrEquals(tc,ex,ac)           CuAssertStrEquals_LineMsg((tc),__FILE__,__LINE__,NULL,(ex),(ac))
+#define CuAssertStrEquals_Msg(tc,ms,ex,ac)    CuAssertStrEquals_LineMsg((tc),__FILE__,__LINE__,(ms),(ex),(ac))
+#define CuAssertIntEquals(tc,ex,ac)           CuAssertIntEquals_LineMsg((tc),__FILE__,__LINE__,NULL,(ex),(ac))
+#define CuAssertIntEquals_Msg(tc,ms,ex,ac)    CuAssertIntEquals_LineMsg((tc),__FILE__,__LINE__,(ms),(ex),(ac))
+#define CuAssertDblEquals(tc,ex,ac,dl)        CuAssertDblEquals_LineMsg((tc),__FILE__,__LINE__,NULL,(ex),(ac),(dl))
+#define CuAssertDblEquals_Msg(tc,ms,ex,ac,dl) CuAssertDblEquals_LineMsg((tc),__FILE__,__LINE__,(ms),(ex),(ac),(dl))
+#define CuAssertPtrEquals(tc,ex,ac)           CuAssertPtrEquals_LineMsg((tc),__FILE__,__LINE__,NULL,(ex),(ac))
+#define CuAssertPtrEquals_Msg(tc,ms,ex,ac)    CuAssertPtrEquals_LineMsg((tc),__FILE__,__LINE__,(ms),(ex),(ac))
+
+#define CuAssertPtrNotNull(tc,p)        CuAssert_Line((tc),__FILE__,__LINE__,"null pointer unexpected",(p != NULL))
+#define CuAssertPtrNotNullMsg(tc,msg,p) CuAssert_Line((tc),__FILE__,__LINE__,(msg),(p != NULL))
+
+/* CuSuite */
+
+#define MAX_TEST_CASES	1024
+
+#define SUITE_ADD_TEST(SUITE,TEST)	CuSuiteAdd(SUITE, CuTestNew(#TEST, TEST))
+
+typedef struct
+{
+	int count;
+	CuTest* list[MAX_TEST_CASES];
+	int failCount;
+
+} CuSuite;
+
+
+void CuSuiteInit(CuSuite* testSuite);
+CuSuite* CuSuiteNew(void);
+void CuSuiteDelete(CuSuite *);
+void CuSuiteAdd(CuSuite* testSuite, CuTest *testCase);
+void CuSuiteAddSuite(CuSuite* testSuite, CuSuite* testSuite2);
+void CuSuiteRun(CuSuite* testSuite);
+void CuSuiteSummary(CuSuite* testSuite, CuString* summary);
+void CuSuiteDetails(CuSuite* testSuite, CuString* details);
+
+#endif /* CU_TEST_H */
Index: 2.6/test/make-tests.sh
===================================================================
--- /dev/null
+++ 2.6/test/make-tests.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+
+# Auto generate single AllTests file for CuTest.
+# Searches through all *.c files in the current directory.
+# Prints to stdout.
+# Author: Asim Jalis
+# Date: 01/08/2003
+
+if test $# -eq 0 ; then FILES=*.c ; else FILES=$* ; fi
+
+echo '
+
+/* This is auto-generated code. Edit at your own peril. */
+
+#include "CuTest.h"
+
+'
+
+cat $FILES | grep '^void test' |
+    sed -e 's/(.*$//' \
+        -e 's/$/(CuTest*);/' \
+        -e 's/^/extern /'
+
+echo \
+'
+
+void RunAllTests(void)
+{
+    CuString *output = CuStringNew();
+    CuSuite* suite = CuSuiteNew();
+
+'
+cat $FILES | grep '^void test' |
+    sed -e 's/^void //' \
+        -e 's/(.*$//' \
+        -e 's/^/    SUITE_ADD_TEST(suite, /' \
+        -e 's/$/);/'
+
+echo \
+'
+    CuSuiteRun(suite);
+    CuSuiteSummary(suite, output);
+    CuSuiteDetails(suite, output);
+    printf("%s\n", output->buffer);
+    CuSuiteDelete(suite);
+    CuStringDelete(output);
+}
+
+int main(void)
+{
+    RunAllTests();
+    return 0;
+}
+'
Index: 2.6/test/mm/kmalloc-test.c
===================================================================
--- /dev/null
+++ 2.6/test/mm/kmalloc-test.c
@@ -0,0 +1,21 @@
+#include <CuTest.h>
+#include <linux/kmem.h>
+
+void test_kmalloc_returns_from_slab(CuTest *ct)
+{
+	kmem_cache_init();
+	void *obj1 = kmalloc(10, GFP_KERNEL);
+	void *obj2 = kmalloc(10, GFP_KERNEL);
+	CuAssertIntEquals(ct, (unsigned long)obj1+32, (unsigned long)obj2);
+	kmem_cache_release();
+}
+
+void test_kzalloc_zeros_memory(CuTest *ct)
+{
+	int i;
+	kmem_cache_init();
+	char *obj = kzalloc(10, GFP_KERNEL);
+	for (i = 0; i < 10; i++)
+		CuAssertIntEquals(ct, 0, obj[i]);
+	kmem_cache_release();
+}
Index: 2.6/test/mm/kmem-test.c
===================================================================
--- /dev/null
+++ 2.6/test/mm/kmem-test.c
@@ -0,0 +1,239 @@
+#include <CuTest.h>
+#include <linux/kmem.h>
+#include <linux/string.h>
+#include <linux/mm.h>
+
+#define DEFAULT_OBJSIZE (PAGE_SIZE/2)
+#define MAX_OBJS (100)
+
+void test_retains_cache_name(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("object_cache", 512, 0, 0, NULL, NULL);
+	CuAssertStrEquals(ct, "object_cache", cache->name);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+void test_alloc_grows_cache(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE, 0, 0, NULL, NULL);
+	CuAssertIntEquals(ct, 0, cache->stats.grown);
+	void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
+	CuAssertIntEquals(ct, 1, cache->stats.grown);
+	kmem_cache_free(cache, obj);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+static void alloc_objs(struct kmem_cache *cache, void *objs[], size_t nr_objs)
+{
+	int i;
+	for (i = 0; i < nr_objs; i++) {
+		objs[i] = kmem_cache_alloc(cache, GFP_KERNEL);
+	}
+}
+
+static void free_objs(struct kmem_cache *cache, void *objs[], size_t nr_objs)
+{
+	int i;
+	for (i = 0; i < nr_objs; i++) {
+		kmem_cache_free(cache, objs[i]);
+	}
+}
+
+void test_destroying_cache_reaps_slabs(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE, 0, 0, NULL, NULL);
+	void *objs[MAX_OBJS];
+	alloc_objs(cache, objs, MAX_OBJS);
+	free_objs(cache, objs, MAX_OBJS);
+	kmem_cache_destroy(cache);
+	CuAssertIntEquals(ct, 1, list_empty(&cache->full_slabs));
+	CuAssertIntEquals(ct, 1, list_empty(&cache->partial_slabs));
+	CuAssertIntEquals(ct, 1, list_empty(&cache->empty_slabs));
+	kmem_cache_release();
+}
+
+void test_multiple_objects_within_one_page(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE, 0, 0, NULL, NULL);
+	void *objs[MAX_OBJS];
+	alloc_objs(cache, objs, MAX_OBJS);
+	CuAssertIntEquals(ct, (MAX_OBJS*DEFAULT_OBJSIZE/PAGE_SIZE), cache->stats.grown);
+	free_objs(cache, objs, MAX_OBJS);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+void test_allocates_from_magazine_when_available(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE, 0, 0, NULL, NULL);
+	void *obj1 = kmem_cache_alloc(cache, GFP_KERNEL);
+	kmem_cache_free(cache, obj1);
+	void *obj2 = kmem_cache_alloc(cache, GFP_KERNEL);
+	kmem_cache_free(cache, obj2);
+	CuAssertPtrEquals(ct, obj1, obj2);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+void test_allocated_objects_are_from_same_slab(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE, 0, 0, NULL, NULL);
+	void *obj1 = kmem_cache_alloc(cache, GFP_KERNEL);
+	void *obj2 = kmem_cache_alloc(cache, GFP_KERNEL);
+	CuAssertPtrEquals(ct, obj1+(DEFAULT_OBJSIZE), obj2);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+static unsigned long nr_ctor_dtor_called;
+static struct kmem_cache *cache_passed_to_ctor_dtor;
+static unsigned long flags_passed_to_ctor_dtor;
+
+static void ctor_dtor(void *obj, struct kmem_cache *cache, unsigned long flags)
+{
+	nr_ctor_dtor_called++;
+	cache_passed_to_ctor_dtor = cache;
+	flags_passed_to_ctor_dtor = flags;
+}
+
+static void reset_ctor_dtor(void)
+{
+	nr_ctor_dtor_called = 0;
+	cache_passed_to_ctor_dtor = NULL;
+	flags_passed_to_ctor_dtor = 0;
+}
+
+void test_constructor_is_called_for_allocated_objects(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE,
+						     0, 0, ctor_dtor, NULL);
+	reset_ctor_dtor();
+	void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
+	CuAssertIntEquals(ct, 1, nr_ctor_dtor_called);
+	CuAssertPtrEquals(ct, cache, cache_passed_to_ctor_dtor);
+	CuAssertIntEquals(ct, SLAB_CTOR_CONSTRUCTOR,
+			  flags_passed_to_ctor_dtor);
+	kmem_cache_free(cache, obj);
+	kmem_cache_release();
+}
+
+void test_atomic_flag_is_passed_to_constructor(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE,
+						     0, 0, ctor_dtor, NULL);
+	reset_ctor_dtor();
+	void *obj = kmem_cache_alloc(cache, GFP_ATOMIC);
+	CuAssertIntEquals(ct, SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_ATOMIC,
+			  flags_passed_to_ctor_dtor);
+	kmem_cache_free(cache, obj);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+void test_destructor_is_called_for_allocated_objects(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE,
+						     0, 0, NULL, ctor_dtor);
+	reset_ctor_dtor();
+	void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
+	kmem_cache_free(cache, obj);
+	CuAssertIntEquals(ct, 0, nr_ctor_dtor_called);
+	kmem_cache_destroy(cache);
+	CuAssertIntEquals(ct, 1, nr_ctor_dtor_called);
+	CuAssertPtrEquals(ct, cache, cache_passed_to_ctor_dtor);
+	CuAssertIntEquals(ct, 0, flags_passed_to_ctor_dtor);
+	kmem_cache_release();
+}
+
+#define PATTERN 0x7D
+
+static void memset_ctor(void *obj, struct kmem_cache *cache, unsigned long flags)
+{
+	memset(obj, PATTERN, cache->objsize);
+}
+
+static void memcmp_dtor(void *obj, struct kmem_cache *cache, unsigned long flags)
+{
+	int i;
+	char *array = obj;
+
+	for (i = 0; i < cache->objsize; i++) {
+		if (array[i] != PATTERN)
+			BUG();
+	}
+}
+
+void test_object_is_preserved_until_destructed(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", DEFAULT_OBJSIZE,
+						     0, 0, memset_ctor,
+						     memcmp_dtor);
+	reset_ctor_dtor();
+	void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
+	kmem_cache_free(cache, obj);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+static void assert_num_objs_and_cache_order(CuTest *ct,
+					    unsigned long expected_num_objs,
+					    unsigned int expected_order,
+					    unsigned long objsize)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", objsize,
+						     0, 0, NULL, NULL);
+	CuAssertIntEquals(ct, expected_num_objs, cache->slab_capacity);
+	CuAssertIntEquals(ct, expected_order, cache->cache_order);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
+
+void test_slab_order_grows_with_object_size(CuTest *ct)
+{
+	assert_num_objs_and_cache_order(ct, 127, 0, 32);
+	assert_num_objs_and_cache_order(ct, 63, 0, 64);
+	assert_num_objs_and_cache_order(ct, 31, 0, 128);
+	assert_num_objs_and_cache_order(ct, 15, 0, 256);
+	assert_num_objs_and_cache_order(ct,  8, 0, 512);
+	assert_num_objs_and_cache_order(ct,  4, 0, 1024);
+	assert_num_objs_and_cache_order(ct,  2, 0, 2048);
+	assert_num_objs_and_cache_order(ct,  1, 0, 4096);
+	assert_num_objs_and_cache_order(ct,  1, 1, 8192);
+	assert_num_objs_and_cache_order(ct,  1, 2, 16384);
+	assert_num_objs_and_cache_order(ct,  1, 3, 32768);
+	assert_num_objs_and_cache_order(ct,  1, 11, (1<<MAX_ORDER)*PAGE_SIZE);
+}
+
+void test_find_best_order_for_worst_fitting_objects(CuTest *ct)
+{
+	assert_num_objs_and_cache_order(ct, 5, 0, 765);
+	assert_num_objs_and_cache_order(ct, 1, 1, PAGE_SIZE+1);
+	assert_num_objs_and_cache_order(ct, 7, 3, PAGE_SIZE+512);
+}
+
+void test_shrinking_cache_purges_magazines(CuTest *ct)
+{
+	kmem_cache_init();
+	struct kmem_cache *cache = kmem_cache_create("cache", PAGE_SIZE, 0, 0, NULL, NULL);
+	void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
+	kmem_cache_free(cache, obj);
+	CuAssertIntEquals(ct, 0, cache->stats.reaped);
+	kmem_cache_shrink(cache);
+	CuAssertIntEquals(ct, 1, list_empty(&cache->full_slabs));
+	CuAssertIntEquals(ct, 1, cache->stats.reaped);
+	kmem_cache_destroy(cache);
+	kmem_cache_release();
+}
Index: 2.6/test/include/linux/gfp.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/gfp.h
@@ -0,0 +1,60 @@
+#ifndef __LINUX_GFP_H
+#define __LINUX_GFP_H
+
+/*
+ * GFP bitmasks..
+ */
+/* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low two bits) */
+#define __GFP_DMA	((__force gfp_t)0x01u)
+#define __GFP_HIGHMEM	((__force gfp_t)0x02u)
+
+/*
+ * Action modifiers - doesn't change the zoning
+ *
+ * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
+ * _might_ fail.  This depends upon the particular VM implementation.
+ *
+ * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
+ * cannot handle allocation failures.
+ *
+ * __GFP_NORETRY: The VM implementation must not retry indefinitely.
+ */
+#define __GFP_WAIT	((__force gfp_t)0x10u)	/* Can wait and reschedule? */
+#define __GFP_HIGH	((__force gfp_t)0x20u)	/* Should access emergency pools? */
+#define __GFP_IO	((__force gfp_t)0x40u)	/* Can start physical IO? */
+#define __GFP_FS	((__force gfp_t)0x80u)	/* Can call down to low-level FS? */
+#define __GFP_COLD	((__force gfp_t)0x100u)	/* Cache-cold page required */
+#define __GFP_NOWARN	((__force gfp_t)0x200u)	/* Suppress page allocation failure warning */
+#define __GFP_REPEAT	((__force gfp_t)0x400u)	/* Retry the allocation.  Might fail */
+#define __GFP_NOFAIL	((__force gfp_t)0x800u)	/* Retry for ever.  Cannot fail */
+#define __GFP_NORETRY	((__force gfp_t)0x1000u)/* Do not retry.  Might fail */
+#define __GFP_NO_GROW	((__force gfp_t)0x2000u)/* Slab internal usage */
+#define __GFP_COMP	((__force gfp_t)0x4000u)/* Add compound page metadata */
+#define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
+#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_NORECLAIM  ((__force gfp_t)0x20000u) /* No realy zone reclaim during allocation */
+#define __GFP_HARDWALL   ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
+
+#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
+
+/* if you forget to add the bitmask here kernel will crash, period */
+#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
+			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
+			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
+			__GFP_NOMEMALLOC|__GFP_NORECLAIM|__GFP_HARDWALL)
+
+#define GFP_ATOMIC	(__GFP_HIGH)
+#define GFP_NOIO	(__GFP_WAIT)
+#define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
+#define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
+#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_HIGHUSER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
+			 __GFP_HIGHMEM)
+
+/* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
+   platforms, used as appropriate on others */
+
+#define GFP_DMA		__GFP_DMA
+
+#endif
Index: 2.6/test/include/asm/processor.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/processor.h
@@ -0,0 +1,4 @@
+#ifndef __LINUX_PROCESSOR_H
+#define __LINUX_PROCESSOR_H
+
+#endif
Index: 2.6/test/include/linux/compiler-gcc3.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/compiler-gcc3.h
@@ -0,0 +1,30 @@
+/* Never include this file directly.  Include <linux/compiler.h> instead.  */
+
+/* These definitions are for GCC v3.x.  */
+#include <linux/compiler-gcc.h>
+
+#if __GNUC_MINOR__ >= 1
+# define inline		inline		__attribute__((always_inline))
+# define __inline__	__inline__	__attribute__((always_inline))
+# define __inline	__inline	__attribute__((always_inline))
+#endif
+
+#if __GNUC_MINOR__ > 0
+# define __deprecated		__attribute__((deprecated))
+#endif
+
+#if __GNUC_MINOR__ >= 3
+#else
+# define __attribute_used__	__attribute__((__unused__))
+#endif
+
+#define __attribute_const__	__attribute__((__const__))
+
+#if __GNUC_MINOR__ >= 1
+#define  noinline		__attribute__((noinline))
+#endif
+
+#if __GNUC_MINOR__ >= 4
+#define __must_check		__attribute__((warn_unused_result))
+#endif
+
Index: 2.6/test/include/asm/system.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/system.h
@@ -0,0 +1,7 @@
+#ifndef __LINUX_SYSTEM_H
+#define __LINUX_SYSTEM_H
+
+#define smp_wmb(x) x
+#define cmpxchg(ptr,o,n)
+
+#endif
Index: 2.6/test/include/asm/bug.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/bug.h
@@ -0,0 +1,13 @@
+#ifndef _I386_BUG_H
+#define _I386_BUG_H
+
+#include <linux/config.h>
+#include <assert.h>
+
+#define HAVE_ARCH_BUG
+#define BUG() assert(!"bug")
+#define HAVE_ARCH_BUG_ON
+#define BUG_ON(cond) assert(!cond)
+
+#include <asm-generic/bug.h>
+#endif
Index: 2.6/test/include/linux/mm.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/mm.h
@@ -0,0 +1,41 @@
+#ifndef __MM_H
+#define __MM_H
+
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/list.h>
+#include <linux/mmzone.h>
+#include <linux/errno.h>
+#include <linux/sched.h>
+#include <asm/pgtable.h>
+
+struct page {
+	unsigned long flags;
+	void *virtual;
+	struct list_head lru;
+	struct list_head memory_map;
+	unsigned int order;
+};
+
+#define high_memory (~0UL)
+
+#define PageSlab(page) (page->flags & 0x01)
+#define SetPageSlab(page) do { page->flags |= 0x01; } while (0)
+#define ClearPageSlab(page) do { page->flags &= ~0x01; } while (0)
+
+#define add_page_state(member,delta)
+#define sub_page_state(member,delta)
+
+static inline int TestClearPageSlab(struct page *page)
+{
+	int ret = page->flags;
+	ClearPageSlab(page);
+	return ret;
+}
+
+#define page_address(page) (page->virtual)
+
+extern struct page *alloc_pages(gfp_t, unsigned int);
+extern void free_pages(unsigned long, unsigned int);
+
+#endif
Index: 2.6/include/linux/kmem.h
===================================================================
--- /dev/null
+++ 2.6/include/linux/kmem.h
@@ -0,0 +1,242 @@
+/*
+ * include/linux/kmem.h - An object-caching memory allocator.
+ *
+ * Copyright (C) 2005 Pekka Enberg
+ *
+ * This file is released under the GPLv2.
+ */
+
+#ifndef __LINUX_KMEM_H
+#define __LINUX_KMEM_H
+
+#include <linux/config.h>
+#include <linux/kernel.h>
+#include <linux/gfp.h>
+#include <linux/init.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+#include <asm/cache.h>
+#include <asm/page.h>
+
+/*
+ *	Object-Caching Allocator
+ */
+
+struct kmem_bufctl {
+	void *addr;
+	void *next;
+};
+
+/**
+ * struct kmem_slab - contiguous memory carved up into equal-sized chunks.
+ *
+ * @list: List head used by object cache slab lists.
+ * @mem: Pointer to the beginning of a contiguous memory block.
+ * @nr_available: Number of available objects.
+ * @free: A pointer to bufctl of next free object.
+ *
+ * A slab consist of one or more pages of contiguous memory carved up into
+ * equal-sized chunks.
+ */
+struct kmem_slab {
+	struct list_head list;
+	void *mem;
+	size_t nr_available;
+	struct kmem_bufctl *free;
+};
+
+enum { MAX_ROUNDS = 10 };
+
+/**
+ * struct kmem_magazine - a stack of objects.
+ *
+ * @rounds: Number of objects available for allocation.
+ * @objs: Objects in this magazine.
+ * @list: List head used by object cache depot magazine lists.
+ *
+ * A magazine contains a stack of objects. It is used as a per-CPU data
+ * structure that can satisfy M allocations without a need for a global
+ * lock.
+ */
+struct kmem_magazine {
+	size_t rounds;
+	void *objs[MAX_ROUNDS];
+	struct list_head list;
+};
+
+struct kmem_cpu_cache {
+	spinlock_t lock;
+	struct kmem_magazine *loaded;
+	struct kmem_magazine *prev;
+};
+
+struct kmem_cache_statistics {
+	unsigned long grown;
+	unsigned long reaped;
+};
+
+struct kmem_cache;
+
+typedef void (*kmem_ctor_fn)(void *, struct kmem_cache *, unsigned long);
+typedef void (*kmem_dtor_fn)(void *, struct kmem_cache *, unsigned long);
+
+/**
+ * An object cache for equal-sized objects. An cache consists of per-CPU
+ * magazines, a depot, and a list of slabs.
+ *
+ * @lists_lock: A lock protecting full_slabs, partia_slabs, empty_slabs,
+ * 	full_magazines, and empty_magazines lists.
+ * @slabs: List of slabs that contain free buffers.
+ * @empty_slabs: List of slabs do not contain any free buffers.
+ * @full_magazines: List of magazines that can contain objects.
+ * @empty_magazines: List of empty magazines that do not contain any objects.
+ */
+struct kmem_cache {
+	struct kmem_cpu_cache cpu_cache[NR_CPUS];
+	size_t objsize;
+	gfp_t gfp_flags;
+	unsigned int slab_capacity;
+	unsigned int cache_order;
+	spinlock_t lists_lock;
+	struct list_head full_slabs;
+	struct list_head partial_slabs;
+	struct list_head empty_slabs;
+	struct list_head full_magazines;
+	struct list_head empty_magazines;
+	struct kmem_cache_statistics stats;
+	kmem_ctor_fn ctor;
+	kmem_ctor_fn dtor;
+	const char *name;
+	struct list_head next;
+	unsigned long active_objects;
+	unsigned long free_objects;
+};
+
+typedef struct kmem_cache kmem_cache_t;
+
+extern void kmem_cache_init(void);
+extern void kmem_cache_release(void);
+extern struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
+					    unsigned long, kmem_ctor_fn,
+					    kmem_ctor_fn);
+extern int kmem_cache_destroy(struct kmem_cache *);
+extern int kmem_cache_shrink(struct kmem_cache *);
+extern const char *kmem_cache_name(struct kmem_cache *cache);
+extern void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
+extern void *kmem_cache_alloc_node(kmem_cache_t *, unsigned int __nocast, int);
+extern void kmem_cache_free(struct kmem_cache *, void *);
+
+/* Flags passd to kmem_cache_alloc().  */
+#define	SLAB_NOFS	GFP_NOFS
+#define	SLAB_NOIO	GFP_NOIO
+#define	SLAB_ATOMIC	GFP_ATOMIC
+#define	SLAB_USER	GFP_USER
+#define	SLAB_KERNEL	GFP_KERNEL
+#define	SLAB_DMA	GFP_DMA
+
+/* Flags passed to kmem_cache_create(). The first three are only valid when
+ * the allocator as been build with SLAB_DEBUG_SUPPORT.
+ */
+#define	SLAB_DEBUG_FREE		0x00000100UL	/* Peform (expensive) checks on free */
+#define	SLAB_DEBUG_INITIAL	0x00000200UL	/* Call constructor (as verifier) */
+#define	SLAB_RED_ZONE		0x00000400UL	/* Red zone objs in a cache */
+#define	SLAB_POISON		0x00000800UL	/* Poison objects */
+#define	SLAB_NO_REAP		0x00001000UL	/* never reap from the cache */
+#define	SLAB_HWCACHE_ALIGN	0x00002000UL	/* align objs on a h/w cache lines */
+#define SLAB_CACHE_DMA		0x00004000UL	/* use GFP_DMA memory */
+#define SLAB_MUST_HWCACHE_ALIGN	0x00008000UL	/* force alignment */
+#define SLAB_STORE_USER		0x00010000UL	/* store the last owner for bug hunting */
+#define SLAB_RECLAIM_ACCOUNT	0x00020000UL	/* track pages allocated to indicate
+						   what is reclaimable later*/
+#define SLAB_PANIC		0x00040000UL	/* panic if kmem_cache_create() fails */
+#define SLAB_DESTROY_BY_RCU	0x00080000UL	/* defer freeing pages to RCU */
+
+/* Flags passed to a constructor function.  */
+#define	SLAB_CTOR_CONSTRUCTOR	0x001UL		/* if not set, then deconstructor */
+#define SLAB_CTOR_ATOMIC	0x002UL		/* tell constructor it can't sleep */
+#define	SLAB_CTOR_VERIFY	0x004UL		/* tell constructor it's a verify call */
+
+extern int FASTCALL(kmem_ptr_validate(struct kmem_cache *cachep, void *ptr));
+
+
+/*
+ *	General purpose allocator
+ */
+
+extern void kmalloc_init(void);
+
+struct cache_sizes {
+	size_t cs_size;
+	struct kmem_cache *cs_cache, *cs_dma_cache;
+};
+
+extern struct cache_sizes malloc_sizes[];
+
+extern void *kmalloc_node(size_t size, unsigned int __nocast flags, int node);
+extern void *__kmalloc(size_t, gfp_t);
+
+static inline void *kmalloc(size_t size, gfp_t flags)
+{
+	if (__builtin_constant_p(size)) {
+		int i = 0;
+#define CACHE(x) \
+		if (size <= x) \
+			goto found; \
+		else \
+			i++;
+#include <linux/kmalloc_sizes.h>
+#undef CACHE
+		{
+			extern void __you_cannot_kmalloc_that_much(void);
+			__you_cannot_kmalloc_that_much();
+		}
+found:
+		return kmem_cache_alloc((flags & GFP_DMA) ?
+			malloc_sizes[i].cs_dma_cache :
+			malloc_sizes[i].cs_cache, flags);
+	}
+	return __kmalloc(size, flags);
+}
+
+extern void *kzalloc(size_t, gfp_t);
+
+/**
+ * kcalloc - allocate memory for an array. The memory is set to zero.
+ * @n: number of elements.
+ * @size: element size.
+ * @flags: the type of memory to allocate.
+ */
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+{
+	if (n != 0 && size > INT_MAX / n)
+		return NULL;
+	return kzalloc(n * size, flags);
+}
+
+extern void kfree(const void *);
+extern unsigned int ksize(const void *);
+
+
+/*
+ *	System wide caches
+ */
+
+extern struct kmem_cache *vm_area_cachep;
+extern struct kmem_cache *names_cachep;
+extern struct kmem_cache *files_cachep;
+extern struct kmem_cache *filp_cachep;
+extern struct kmem_cache *fs_cachep;
+extern struct kmem_cache *signal_cachep;
+extern struct kmem_cache *sighand_cachep;
+extern struct kmem_cache *bio_cachep;
+
+
+/*
+ * 	???
+ */
+
+extern atomic_t slab_reclaim_pages;
+
+#endif
Index: 2.6/mm/Makefile
===================================================================
--- 2.6.orig/mm/Makefile
+++ 2.6/mm/Makefile
@@ -9,7 +9,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o 
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   page_alloc.o page-writeback.o pdflush.o \
-			   readahead.o slab.o swap.o truncate.o vmscan.o \
+			   readahead.o kmem.o kmalloc.o swap.o truncate.o vmscan.o \
 			   prio_tree.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o thrash.o
Index: 2.6/test/include/stdlib.h
===================================================================
--- /dev/null
+++ 2.6/test/include/stdlib.h
@@ -0,0 +1,11 @@
+#ifndef __STDLIB_H
+#define __STDLIB_H
+
+#include <stddef.h>
+
+extern void *malloc(size_t);
+extern void *calloc(size_t, size_t);
+extern void free(void *);
+extern void *realloc(void *, size_t);
+
+#endif
Index: 2.6/test/mm/page_alloc.c
===================================================================
--- /dev/null
+++ 2.6/test/mm/page_alloc.c
@@ -0,0 +1,44 @@
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/string.h>
+#include <linux/list.h>
+
+#include <asm/page.h>
+
+#include <stdlib.h>
+
+static LIST_HEAD(pages);
+
+struct page *__virt_to_page(unsigned long addr)
+{
+	struct page *page;
+
+	list_for_each_entry(page, &pages, memory_map) {
+		unsigned long virtual = (unsigned long) page->virtual;
+
+		if (virtual <= addr && addr < virtual+(1<<page->order)*PAGE_SIZE)
+			return page;
+	}
+	return NULL;
+}
+
+struct page *alloc_pages(gfp_t flags, unsigned int order)
+{
+	unsigned long nr_pages = 1<<order;
+	struct page *page = malloc(sizeof(*page));
+	memset(page, 0, sizeof(*page));
+	page->order = order;
+	page->virtual = calloc(nr_pages, PAGE_SIZE);
+	INIT_LIST_HEAD(&page->memory_map);
+	list_add(&page->memory_map, &pages);
+	return page;
+}
+
+void free_pages(unsigned long addr, unsigned int order)
+{
+	struct page *page = virt_to_page(addr);
+	free(page->virtual);
+	free(page);
+}
+
+
Index: 2.6/include/linux/slab.h
===================================================================
--- 2.6.orig/include/linux/slab.h
+++ 2.6/include/linux/slab.h
@@ -1,151 +1,6 @@
-/*
- * linux/mm/slab.h
- * Written by Mark Hemment, 1996.
- * (markhe@nextd.demon.co.uk)
- */
-
 #ifndef _LINUX_SLAB_H
 #define	_LINUX_SLAB_H
 
-#if	defined(__KERNEL__)
-
-typedef struct kmem_cache kmem_cache_t;
-
-#include	<linux/config.h>	/* kmalloc_sizes.h needs CONFIG_ options */
-#include	<linux/gfp.h>
-#include	<linux/init.h>
-#include	<linux/types.h>
-#include	<asm/page.h>		/* kmalloc_sizes.h needs PAGE_SIZE */
-#include	<asm/cache.h>		/* kmalloc_sizes.h needs L1_CACHE_BYTES */
-
-/* flags for kmem_cache_alloc() */
-#define	SLAB_NOFS		GFP_NOFS
-#define	SLAB_NOIO		GFP_NOIO
-#define	SLAB_ATOMIC		GFP_ATOMIC
-#define	SLAB_USER		GFP_USER
-#define	SLAB_KERNEL		GFP_KERNEL
-#define	SLAB_DMA		GFP_DMA
-
-#define SLAB_LEVEL_MASK		GFP_LEVEL_MASK
-
-#define	SLAB_NO_GROW		__GFP_NO_GROW	/* don't grow a cache */
-
-/* flags to pass to kmem_cache_create().
- * The first 3 are only valid when the allocator as been build
- * SLAB_DEBUG_SUPPORT.
- */
-#define	SLAB_DEBUG_FREE		0x00000100UL	/* Peform (expensive) checks on free */
-#define	SLAB_DEBUG_INITIAL	0x00000200UL	/* Call constructor (as verifier) */
-#define	SLAB_RED_ZONE		0x00000400UL	/* Red zone objs in a cache */
-#define	SLAB_POISON		0x00000800UL	/* Poison objects */
-#define	SLAB_NO_REAP		0x00001000UL	/* never reap from the cache */
-#define	SLAB_HWCACHE_ALIGN	0x00002000UL	/* align objs on a h/w cache lines */
-#define SLAB_CACHE_DMA		0x00004000UL	/* use GFP_DMA memory */
-#define SLAB_MUST_HWCACHE_ALIGN	0x00008000UL	/* force alignment */
-#define SLAB_STORE_USER		0x00010000UL	/* store the last owner for bug hunting */
-#define SLAB_RECLAIM_ACCOUNT	0x00020000UL	/* track pages allocated to indicate
-						   what is reclaimable later*/
-#define SLAB_PANIC		0x00040000UL	/* panic if kmem_cache_create() fails */
-#define SLAB_DESTROY_BY_RCU	0x00080000UL	/* defer freeing pages to RCU */
-
-/* flags passed to a constructor func */
-#define	SLAB_CTOR_CONSTRUCTOR	0x001UL		/* if not set, then deconstructor */
-#define SLAB_CTOR_ATOMIC	0x002UL		/* tell constructor it can't sleep */
-#define	SLAB_CTOR_VERIFY	0x004UL		/* tell constructor it's a verify call */
-
-/* prototypes */
-extern void __init kmem_cache_init(void);
-
-extern kmem_cache_t *kmem_cache_create(const char *, size_t, size_t, unsigned long,
-				       void (*)(void *, kmem_cache_t *, unsigned long),
-				       void (*)(void *, kmem_cache_t *, unsigned long));
-extern int kmem_cache_destroy(kmem_cache_t *);
-extern int kmem_cache_shrink(kmem_cache_t *);
-extern void *kmem_cache_alloc(kmem_cache_t *, gfp_t);
-extern void kmem_cache_free(kmem_cache_t *, void *);
-extern unsigned int kmem_cache_size(kmem_cache_t *);
-extern const char *kmem_cache_name(kmem_cache_t *);
-extern kmem_cache_t *kmem_find_general_cachep(size_t size, gfp_t gfpflags);
-
-/* Size description struct for general caches. */
-struct cache_sizes {
-	size_t		 cs_size;
-	kmem_cache_t	*cs_cachep;
-	kmem_cache_t	*cs_dmacachep;
-};
-extern struct cache_sizes malloc_sizes[];
-extern void *__kmalloc(size_t, gfp_t);
-
-static inline void *kmalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size)) {
-		int i = 0;
-#define CACHE(x) \
-		if (size <= x) \
-			goto found; \
-		else \
-			i++;
-#include "kmalloc_sizes.h"
-#undef CACHE
-		{
-			extern void __you_cannot_kmalloc_that_much(void);
-			__you_cannot_kmalloc_that_much();
-		}
-found:
-		return kmem_cache_alloc((flags & GFP_DMA) ?
-			malloc_sizes[i].cs_dmacachep :
-			malloc_sizes[i].cs_cachep, flags);
-	}
-	return __kmalloc(size, flags);
-}
-
-extern void *kzalloc(size_t, gfp_t);
-
-/**
- * kcalloc - allocate memory for an array. The memory is set to zero.
- * @n: number of elements.
- * @size: element size.
- * @flags: the type of memory to allocate.
- */
-static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
-{
-	if (n != 0 && size > INT_MAX / n)
-		return NULL;
-	return kzalloc(n * size, flags);
-}
-
-extern void kfree(const void *);
-extern unsigned int ksize(const void *);
-
-#ifdef CONFIG_NUMA
-extern void *kmem_cache_alloc_node(kmem_cache_t *, gfp_t flags, int node);
-extern void *kmalloc_node(size_t size, gfp_t flags, int node);
-#else
-static inline void *kmem_cache_alloc_node(kmem_cache_t *cachep, gfp_t flags, int node)
-{
-	return kmem_cache_alloc(cachep, flags);
-}
-static inline void *kmalloc_node(size_t size, gfp_t flags, int node)
-{
-	return kmalloc(size, flags);
-}
-#endif
-
-extern int FASTCALL(kmem_cache_reap(int));
-extern int FASTCALL(kmem_ptr_validate(kmem_cache_t *cachep, void *ptr));
-
-/* System wide caches */
-extern kmem_cache_t	*vm_area_cachep;
-extern kmem_cache_t	*names_cachep;
-extern kmem_cache_t	*files_cachep;
-extern kmem_cache_t	*filp_cachep;
-extern kmem_cache_t	*fs_cachep;
-extern kmem_cache_t	*signal_cachep;
-extern kmem_cache_t	*sighand_cachep;
-extern kmem_cache_t	*bio_cachep;
-
-extern atomic_t slab_reclaim_pages;
-
-#endif	/* __KERNEL__ */
+#include <linux/kmem.h>
 
 #endif	/* _LINUX_SLAB_H */
Index: 2.6/test/include/asm/page.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/page.h
@@ -0,0 +1,15 @@
+#ifndef __LINUX_PAGE_H
+#define __LINUX_PAGE_H
+
+#include <linux/mm.h>
+
+#define PAGE_OFFSET 0
+#define PAGE_SHIFT 12
+#define PAGE_SIZE 4096
+#define PAGE_MASK    (~(PAGE_SIZE-1))
+
+#define virt_to_page(addr) __virt_to_page((unsigned long) addr)
+
+extern struct page *__virt_to_page(unsigned long);
+
+#endif
Index: 2.6/test/include/linux/spinlock.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/spinlock.h
@@ -0,0 +1,14 @@
+#ifndef __LINUX_SPINLOCK_H
+#define __LINUX_SPINLOCK_H
+
+#include <asm/atomic.h>
+
+typedef int spinlock_t;
+
+#define spin_lock_init(x)
+#define spin_lock_irqsave(x, y) (y = 1)
+#define spin_unlock_irqrestore(x, y) (y = 0)
+#define spin_lock(x)
+#define spin_unlock(x)
+
+#endif
Index: 2.6/test/include/linux/mmzone.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/mmzone.h
@@ -0,0 +1,8 @@
+#ifndef __LINUX_MMZONE_H
+#define __LINUX_MMZONE_H
+
+#include <linux/threads.h>
+
+#define MAX_ORDER 11
+
+#endif
Index: 2.6/test/include/linux/threads.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/threads.h
@@ -0,0 +1,6 @@
+#ifndef __LINUX_THREADS_H
+#define __LINUX_THREADS_H
+
+#define NR_CPUS 1
+
+#endif
Index: 2.6/test/include/linux/module.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/module.h
@@ -0,0 +1,7 @@
+#ifndef __LINUX_MODULE_H
+#define __LINUX_MODULE_H
+
+#define EXPORT_SYMBOL(x)
+#define EXPORT_SYMBOL_GPL(x)
+
+#endif
Index: 2.6/test/kernel/panic.c
===================================================================
--- /dev/null
+++ 2.6/test/kernel/panic.c
@@ -0,0 +1,6 @@
+extern void abort(void);
+
+void panic(const char * fmt, ...)
+{
+	abort();
+}
Index: 2.6/test/include/asm/pgtable.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/pgtable.h
@@ -0,0 +1,6 @@
+#ifndef __ASM_PGTABLE_H
+#define __ASM_PGTABLE_H
+
+#define kern_addr_valid(addr)    (1)
+
+#endif
Index: 2.6/test/include/asm/semaphore.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/semaphore.h
@@ -0,0 +1,24 @@
+#ifndef __ASM_SEMAPHORE_H
+#define __ASM_SEMAPHORE_H
+
+struct semaphore {
+};
+
+static inline void init_MUTEX(struct semaphore *sem)
+{
+}
+
+static inline void up(struct semaphore *sem)
+{
+}
+
+static inline void down(struct semaphore *sem)
+{
+}
+
+static inline int down_trylock(struct semaphore *sem)
+{
+	return 1;
+}
+
+#endif
Index: 2.6/test/include/asm/uaccess.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/uaccess.h
@@ -0,0 +1,4 @@
+#ifndef __ASM_UACCESS_H
+#define __ASM_UACCESS_H
+
+#endif
Index: 2.6/test/include/linux/config.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/config.h
@@ -0,0 +1,8 @@
+#ifndef __LINUX_CONFIG_H
+#define __LINUX_CONFIG_H
+
+#include <linux/autoconf.h>
+
+#undef CONFIG_PROC_FS
+
+#endif
Index: 2.6/test/include/linux/seq_file.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/seq_file.h
@@ -0,0 +1,4 @@
+#ifndef __LINUX_SEQFILE_H
+#define __LINUX_SEQFILE_H
+
+#endif
Index: 2.6/test/include/asm/param.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/param.h
@@ -0,0 +1,6 @@
+#ifndef __ASM_PARAM_H
+#define __ASM_PARAM_H
+
+#define HZ 100
+
+#endif
Index: 2.6/test/include/asm/percpu.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/percpu.h
@@ -0,0 +1,6 @@
+#ifndef __ARCH_I386_PERCPU__
+#define __ARCH_I386_PERCPU__
+
+#include <asm-generic/percpu.h>
+
+#endif /* __ARCH_I386_PERCPU__ */
Index: 2.6/test/include/linux/sched.h
===================================================================
--- /dev/null
+++ 2.6/test/include/linux/sched.h
@@ -0,0 +1,7 @@
+#ifndef __LINUX_SCHED_H
+#define __LINUX_SCHED_H
+
+#include <linux/cpumask.h>
+#include <asm/param.h>
+
+#endif
Index: 2.6/test/kernel/timer.c
===================================================================
--- /dev/null
+++ 2.6/test/kernel/timer.c
@@ -0,0 +1,5 @@
+#include <linux/timer.h>
+
+void fastcall init_timer(struct timer_list *timer)
+{
+}
Index: 2.6/test/kernel/workqueue.c
===================================================================
--- /dev/null
+++ 2.6/test/kernel/workqueue.c
@@ -0,0 +1,17 @@
+#include <linux/workqueue.h>
+
+int keventd_up(void)
+{
+	return 1;
+}
+
+int fastcall schedule_delayed_work(struct work_struct *work, unsigned long delay)
+{
+	return 1;
+}
+
+int schedule_delayed_work_on(int cpu,
+			struct work_struct *work, unsigned long delay)
+{
+	return 1;
+}
Index: 2.6/test/include/asm/thread_info.h
===================================================================
--- /dev/null
+++ 2.6/test/include/asm/thread_info.h
@@ -0,0 +1,13 @@
+#ifndef __ASM_THREADINFO_H
+#define __ASM_THREADINFO_H
+
+#include <linux/config.h>
+#include <linux/compiler.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+
+struct thread_info {
+	unsigned long flags;
+};
+
+#endif




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 20:15                   ` Pekka Enberg
@ 2005-12-20 21:42                     ` Steven Rostedt
  2005-12-20 21:52                       ` Christoph Lameter
  2005-12-21  6:56                       ` Ingo Molnar
  0 siblings, 2 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 21:42 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Alok N Kataria, Shobhit Dayal, Shai Fultheim,
	Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

On Tue, 2005-12-20 at 22:15 +0200, Pekka Enberg wrote:
> Hi Steve and Matt,
> 
> On 12/20/05, Steven Rostedt <rostedt@goodmis.org> wrote:
> > That looks like quite an undertaking, but may be well worth it.  I think
> > Linux's memory management is starting to show it's age.  It's been
> > through a few transformations, and maybe it's time to go through
> > another.  The work being done by the NUMA folks, should be taking into
> > account, and maybe we can come up with a way that can make things easier
> > and less complex without losing performance.
> 
> The slab allocator is indeed complex, messy, and hard to understand.
> In case you're interested, I have included a replacement I started out
> a while a go. It follows the design of a magazine allocator described
> by Bonwick. It's not a complete replacement but should boot (well, did
> anyway at some point). I have also included a user space test harness
> I am using to smoke it.
> 
> If there's enough interest, I would be more than glad to help write a
> replacement for mm/slab.c :-)

Hi Pekka,

What other interest have you pulled up on this?  I mean, have others
shown interest in pushing something like this.  Today's slab system is
starting to become like the IDE where nobody, but a select few
sado-masochis, dare to venture in. (I've CC'd them ;)  Perhaps it would
make the addition of NUMA easier.

Maybe, putting this into RT might be a way to get it tested, and help us
with the memory management and a fully preemptible kernel.

-- Steve

For those just coming in, Pekka posted this:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 21:42                     ` Steven Rostedt
@ 2005-12-20 21:52                       ` Christoph Lameter
  2005-12-20 22:11                         ` Steven Rostedt
  2005-12-21  6:56                       ` Ingo Molnar
  1 sibling, 1 reply; 56+ messages in thread
From: Christoph Lameter @ 2005-12-20 21:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pekka Enberg, Alok N Kataria, Shobhit Dayal, Shai Fultheim,
	Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

On Tue, 20 Dec 2005, Steven Rostedt wrote:

> What other interest have you pulled up on this?  I mean, have others
> shown interest in pushing something like this.  Today's slab system is
> starting to become like the IDE where nobody, but a select few
> sado-masochis, dare to venture in. (I've CC'd them ;)  Perhaps it would
> make the addition of NUMA easier.

Hmm. The basics of the SLAB allocator are rather simple. 

I'd be interested in seeing an alternate approach. There is the danger
that you will end up end up with the same complexity as before.

> http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2

Quite a long list of unsupported features. These academic papers
usually only focus on one thing. The SLAB allocator has to work
for a variety of situations though.

It would help to explain what ultimately will be better in the new slab 
allocator. The complexity could be taken care of by reorganizing the code.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 21:52                       ` Christoph Lameter
@ 2005-12-20 22:11                         ` Steven Rostedt
  2005-12-21  6:36                           ` Ingo Molnar
  0 siblings, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-20 22:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Alok N Kataria, Shobhit Dayal, Shai Fultheim,
	Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

On Tue, 2005-12-20 at 13:52 -0800, Christoph Lameter wrote:
> On Tue, 20 Dec 2005, Steven Rostedt wrote:
> 
> > What other interest have you pulled up on this?  I mean, have others
> > shown interest in pushing something like this.  Today's slab system is
> > starting to become like the IDE where nobody, but a select few
> > sado-masochis, dare to venture in. (I've CC'd them ;)  Perhaps it would
> > make the addition of NUMA easier.
> 
> Hmm. The basics of the SLAB allocator are rather simple. 
> 
> I'd be interested in seeing an alternate approach. There is the danger
> that you will end up end up with the same complexity as before.

True.  I understand the basics of the SLAB allocator very well, but I
stumble over the slab.c code quite a bit.  This topic came up when Ingo
replaced slab with slob in the rt patch and it killed the performance.
I introduced a cross between the slab and the slob that sped up the
system almost to that of the current slab.

Matt Mackall needs a memory management that uses the smallest amount of
memory to handle embedded systems, and brought up the approach
referenced in the paper by Bonwick.

> 
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
> 
> Quite a long list of unsupported features. These academic papers
> usually only focus on one thing. The SLAB allocator has to work
> for a variety of situations though.
> 
> It would help to explain what ultimately will be better in the new slab 
> allocator. The complexity could be taken care of by reorganizing the code.
> 

Honestly, what I would like is a simpler solution, whether we go with a
new approach or reorganize the current slab.  I need to get -rt working,
and the code in slab is pulling my resources more than they can extend.
I'm capable to convert slab today as it is for RT but it will probably
take longer than I can afford.

Yes, if we go with a new system, it would not have all the features that
the slab has today, but I can live with that, and if I'm involved in the
work as it grows, I'll understand it better.  The problem is, I wasn't
involved in the evolution of slab, and I have to deal with what it grew
into, without being there to see why it does what it does today.

-- Steve



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 19:15                 ` Steven Rostedt
  2005-12-20 19:43                   ` Matt Mackall
  2005-12-20 20:15                   ` Pekka Enberg
@ 2005-12-21  2:30                   ` Nick Piggin
  2005-12-21  2:41                     ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Nick Piggin @ 2005-12-21  2:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

Steven Rostedt wrote:

> That looks like quite an undertaking, but may be well worth it.  I think
> Linux's memory management is starting to show it's age.  It's been

What do you mean by this? ie. what parts of it are a problem, and why?

I think that replacing the buddy allocator probably wouldn't be a good
idea because it is really fast and simple for page sized allocations which
are the most common, and it is good at naturally avoiding external
fragmentation. Internal fragmentation is not much of a problem because it
is handled by slab.

I can't see how replacing the buddy allocator with a completely agnostic
range allocator could be a win at all.

Perhaps it would make more sense for bootmem, resources, vmalloc, etc. and
I guess that is what Matt is suggesting.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  2:30                   ` Nick Piggin
@ 2005-12-21  2:41                     ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-21  2:41 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matt Mackall, Ingo Molnar, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


On Wed, 21 Dec 2005, Nick Piggin wrote:

> Steven Rostedt wrote:
>
> > That looks like quite an undertaking, but may be well worth it.  I think
> > Linux's memory management is starting to show it's age.  It's been
>
> What do you mean by this? ie. what parts of it are a problem, and why?
>
> I think that replacing the buddy allocator probably wouldn't be a good
> idea because it is really fast and simple for page sized allocations which
> are the most common, and it is good at naturally avoiding external
> fragmentation. Internal fragmentation is not much of a problem because it
> is handled by slab.

Actually, I wasn't talking about the buddy allocator, since it is probably
the best backend allocator to have.  I actually like it alot and it
doesn't seem to have a problem.

But the slab code has gotten more complex, and probably too feature full.
And I'm afraid that Christoph Lameter may be right, in that we could go to
another allocation scheme and after adding all the features that the slab
has, we would be just as complex.


>
> I can't see how replacing the buddy allocator with a completely agnostic
> range allocator could be a win at all.

That part I didn't agree with (replacing the buddy system I mean).

>
> Perhaps it would make more sense for bootmem, resources, vmalloc, etc. and
> I guess that is what Matt is suggesting.

I'd still add slab there, but as I said above, anything else may become
too complex.  Although, playing with this Magazine thingy is starting to
look interesting!

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 22:11                         ` Steven Rostedt
@ 2005-12-21  6:36                           ` Ingo Molnar
  2005-12-21 12:50                             ` Steven Rostedt
  0 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2005-12-21  6:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Christoph Lameter, Pekka Enberg, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
> > 
> > Quite a long list of unsupported features. These academic papers
> > usually only focus on one thing. The SLAB allocator has to work
> > for a variety of situations though.
> > 
> > It would help to explain what ultimately will be better in the new slab 
> > allocator. The complexity could be taken care of by reorganizing the code.
> 
> Honestly, what I would like is a simpler solution, whether we go with 
> a new approach or reorganize the current slab.  I need to get -rt 
> working, and the code in slab is pulling my resources more than they 
> can extend. I'm capable to convert slab today as it is for RT but it 
> will probably take longer than I can afford.

please, lets let the -rt tree out of the equation. The SLAB code is fine 
on upstream, and it was a pure practical maintainance decision to go for 
SLOB in the -rt tree. Yes, the SLAB code is complex, but i could hardly 
list any complexity in it tht isnt justified with a performance 
argument. _Maybe_ the SLAB code could be further cleaned up, maybe it 
could even be replaced, but we'd have to see the patches first. In any 
case, the -rt tree is not an argument that matters.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-20 21:42                     ` Steven Rostedt
  2005-12-20 21:52                       ` Christoph Lameter
@ 2005-12-21  6:56                       ` Ingo Molnar
  2005-12-21  7:16                         ` Pekka J Enberg
                                           ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Ingo Molnar @ 2005-12-21  6:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Pekka Enberg, Christoph Lameter, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


* Steven Rostedt <rostedt@goodmis.org> wrote:

> [...] Today's slab system is starting to become like the IDE where 
> nobody, but a select few sado-masochis, dare to venture in. (I've CC'd 
> them ;) [...]

while it could possibly be cleaned up a bit, it's one of the 
best-optimized subsystems Linux has. Most of the "unnecessary 
complexity" in SLAB is related to a performance or a debugging feature.  
Many times i have looked at the SLAB code in a disassembler, right next 
to profile output from some hot workload, and have concluded: 'I couldnt 
do this any better even with hand-coded assembly'.

SLAB-bashing has become somewhat fashionable, but i really challenge 
everyone to improve the SLAB code first (to make it more modular, easier 
to read, etc.), before suggesting replacements.

the SLOB is nice because it gives us a simple option at the other end of 
the complexity spectrum. The SLOB should remain there. (but it certainly 
makes sense to make it faster, within certain limits, so i'm not 
opposing your SLOB patches per se.)

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  6:56                       ` Ingo Molnar
@ 2005-12-21  7:16                         ` Pekka J Enberg
  2005-12-21  7:50                           ` Ingo Molnar
  2005-12-21 13:13                           ` Steven Rostedt
  2005-12-21  7:20                         ` [PATCH RT 00/02] SLOB optimizations Eric Dumazet
  2005-12-21 13:02                         ` Steven Rostedt
  2 siblings, 2 replies; 56+ messages in thread
From: Pekka J Enberg @ 2005-12-21  7:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Christoph Lameter, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel

Hi Ingo,

Steven Rostedt <rostedt@goodmis.org> wrote:
> > [...] Today's slab system is starting to become like the IDE where 
> > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd 
> > them ;) [...]

On Wed, 21 Dec 2005, Ingo Molnar wrote:
> while it could possibly be cleaned up a bit, it's one of the 
> best-optimized subsystems Linux has. Most of the "unnecessary 
> complexity" in SLAB is related to a performance or a debugging feature.  
> Many times i have looked at the SLAB code in a disassembler, right next 
> to profile output from some hot workload, and have concluded: 'I couldnt 
> do this any better even with hand-coded assembly'.
> 
> SLAB-bashing has become somewhat fashionable, but i really challenge 
> everyone to improve the SLAB code first (to make it more modular, easier 
> to read, etc.), before suggesting replacements.

I dropped working on the replacement because I wanted to do just that. I 
sent my patch only because Matt and Steve talked about writing a 
replacement and thought they would be interested to see it.

I am all for gradual improvements but after taking a stab at it, I 
starting to think rewriting would be easier, simply because the slab 
allocator has been clean-up resistant for so long.

			Pekka

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  6:56                       ` Ingo Molnar
  2005-12-21  7:16                         ` Pekka J Enberg
@ 2005-12-21  7:20                         ` Eric Dumazet
  2005-12-21  7:43                           ` Ingo Molnar
  2005-12-21 13:02                         ` Steven Rostedt
  2 siblings, 1 reply; 56+ messages in thread
From: Eric Dumazet @ 2005-12-21  7:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel

Ingo Molnar a écrit :
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> 
>>[...] Today's slab system is starting to become like the IDE where 
>>nobody, but a select few sado-masochis, dare to venture in. (I've CC'd 
>>them ;) [...]
> 
> 
> while it could possibly be cleaned up a bit, it's one of the 
> best-optimized subsystems Linux has. Most of the "unnecessary 
> complexity" in SLAB is related to a performance or a debugging feature.  
> Many times i have looked at the SLAB code in a disassembler, right next 
> to profile output from some hot workload, and have concluded: 'I couldnt 
> do this any better even with hand-coded assembly'.

Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that wont play 
with IRQ masking.

The local_irq_save()/local_irq_restore() pair is quite expensive and could be 
avoided for several caches that are exclusively used in process context.

(Not speaking of general caches of course, but caches like dentry_cache, filp, 
...)

Eric


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  7:20                         ` [PATCH RT 00/02] SLOB optimizations Eric Dumazet
@ 2005-12-21  7:43                           ` Ingo Molnar
  2005-12-21  8:02                             ` Eric Dumazet
  0 siblings, 1 reply; 56+ messages in thread
From: Ingo Molnar @ 2005-12-21  7:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> >while it could possibly be cleaned up a bit, it's one of the 
> >best-optimized subsystems Linux has. Most of the "unnecessary 
> >complexity" in SLAB is related to a performance or a debugging feature.  
> >Many times i have looked at the SLAB code in a disassembler, right next 
> >to profile output from some hot workload, and have concluded: 'I couldnt 
> >do this any better even with hand-coded assembly'.
> 
> Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that 
> wont play with IRQ masking.

sure, but adding this sure wont reduce complexity ;)

> The local_irq_save()/local_irq_restore() pair is quite expensive and 
> could be avoided for several caches that are exclusively used in 
> process context.

in any case, on sane platforms (i386, x86_64) an irq-disable is 
well-optimized in hardware, and is just as fast as a preempt_disable().

Combined with the fact that CLI/STI has no register side-effects, it can 
even be faster/cheaper, on x86 at least.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  7:16                         ` Pekka J Enberg
@ 2005-12-21  7:50                           ` Ingo Molnar
  2005-12-21 13:13                           ` Steven Rostedt
  1 sibling, 0 replies; 56+ messages in thread
From: Ingo Molnar @ 2005-12-21  7:50 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Steven Rostedt, Christoph Lameter, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


* Pekka J Enberg <penberg@cs.Helsinki.FI> wrote:

> > SLAB-bashing has become somewhat fashionable, but i really challenge 
> > everyone to improve the SLAB code first (to make it more modular, easier 
> > to read, etc.), before suggesting replacements.
> 
> I dropped working on the replacement because I wanted to do just that. 
> I sent my patch only because Matt and Steve talked about writing a 
> replacement and thought they would be interested to see it.
> 
> I am all for gradual improvements but after taking a stab at it, I 
> starting to think rewriting would be easier, simply because the slab 
> allocator has been clean-up resistant for so long.

i'd suggest to try harder, unless you think the _fundamentals_ of the 
SLAB allocator are wrong. (which you are entitled to believe, but we 
also have to admit that the SLAB has been around for many years, and 
works pretty well)

most of the ugliness in slab.c comes from:

1) debugging. There's no easy solutions here, but it could be improved. 

2) bootstrapping. Bootstrapping an allocator in a generic way is hard.
   E.g. what if cache_cache gets larger than 1 page?

3) cache-footprint tricks and lockless fastpath. SLAB does things all 
   the right way, even that ugly memmove is the right thing. Maybe it 
   could be cleaned up, but the fundamental complexity will likely 
   remain.

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  7:43                           ` Ingo Molnar
@ 2005-12-21  8:02                             ` Eric Dumazet
  2005-12-22 18:02                               ` Zwane Mwaikambo
  2005-12-22 21:11                               ` Ingo Molnar
  0 siblings, 2 replies; 56+ messages in thread
From: Eric Dumazet @ 2005-12-21  8:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> 
>>>while it could possibly be cleaned up a bit, it's one of the 
>>>best-optimized subsystems Linux has. Most of the "unnecessary 
>>>complexity" in SLAB is related to a performance or a debugging feature.  
>>>Many times i have looked at the SLAB code in a disassembler, right next 
>>>to profile output from some hot workload, and have concluded: 'I couldnt 
>>>do this any better even with hand-coded assembly'.
>>
>>Well, I miss a version of kmem_cache_alloc()/kmem_cache_free() that 
>>wont play with IRQ masking.
> 
> 
> sure, but adding this sure wont reduce complexity ;)
> 
> 
>>The local_irq_save()/local_irq_restore() pair is quite expensive and 
>>could be avoided for several caches that are exclusively used in 
>>process context.
> 
> 
> in any case, on sane platforms (i386, x86_64) an irq-disable is 
> well-optimized in hardware, and is just as fast as a preempt_disable().
> 

I'm afraid its not the case on current hardware.

The irq enable/disable pair count for more than 50% the cpu time spent in 
kmem_cache_alloc()/kmem_cache_free()/kfree()

oprofile results on a dual Opteron 246 :

You can see the high profile numbers right after cli and popf(sti) 
instructions, popf being VERY expensive.

CPU: Hammer, speed 1993.39 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit 
mask of 0x00 (No unit mask) count 50000

29993     1.9317  kfree
18654     1.2014  kmem_cache_alloc
12962     0.8348  kmem_cache_free

ffffffff8015c370 <kfree>: /* kfree total:  30334  1.9335 */
    770  0.0491 :ffffffff8015c370:       push   %rbp
   2477  0.1579 :ffffffff8015c371:       mov    %rdi,%rbp
                :ffffffff8015c374:       push   %rbx
                :ffffffff8015c375:       sub    $0x8,%rsp
   1792  0.1142 :ffffffff8015c379:       test   %rdi,%rdi
                :ffffffff8015c37c:       je     ffffffff8015c452 <kfree+0xe2>
    122  0.0078 :ffffffff8015c382:       pushfq
   1001  0.0638 :ffffffff8015c383:       popq   (%rsp)
   1456  0.0928 :ffffffff8015c386:       cli
   2489  0.1586 :ffffffff8015c387:       mov    $0xffffffff7fffffff,%rax    <<

...
     72  0.0046 :ffffffff8015c44e:       pushq  (%rsp)
   1080  0.0688 :ffffffff8015c451:       popfq
  13934  0.8882 :ffffffff8015c452:       add    $0x8,%rsp      << HERE >>
    290  0.0185 :ffffffff8015c456:       pop    %rbx
                :ffffffff8015c457:       pop    %rbp
    124  0.0079 :ffffffff8015c458:       retq


ffffffff8015c460 <kmem_cache_free>: /* kmem_cache_free total:  13084  0.8340 */
    388  0.0247 :ffffffff8015c460:       sub    $0x18,%rsp
    365  0.0233 :ffffffff8015c464:       mov    %rbp,0x10(%rsp)
                :ffffffff8015c469:       mov    %rbx,0x8(%rsp)
    121  0.0077 :ffffffff8015c46e:       mov    %rsi,%rbp
    262  0.0167 :ffffffff8015c471:       pushfq
    549  0.0350 :ffffffff8015c472:       popq   (%rsp)
    351  0.0224 :ffffffff8015c475:       cli
   2478  0.1579 :ffffffff8015c476:       mov    %gs:0x34,%eax
    592  0.0377 :ffffffff8015c47e:       cltq
                :ffffffff8015c480:       mov    (%rdi,%rax,8),%rbx
      7 4.5e-04 :ffffffff8015c484:       mov    (%rbx),%eax
    200  0.0127 :ffffffff8015c486:       cmp    0x4(%rbx),%eax
                :ffffffff8015c489:       jae    ffffffff8015c48f 
<kmem_cache_free+0x2f>
                :ffffffff8015c48b:       mov    %eax,%eax
    766  0.0488 :ffffffff8015c48d:       jmp    ffffffff8015c4a0 
<kmem_cache_free+0x40>
                :ffffffff8015c48f:       mov    %rbx,%rsi
     71  0.0045 :ffffffff8015c492:       callq  ffffffff8015c810 
<cache_flusharray>
                :ffffffff8015c497:       mov    (%rbx),%eax
      1 6.4e-05 :ffffffff8015c499:       data16
                :ffffffff8015c49a:       data16
                :ffffffff8015c49b:       data16
                :ffffffff8015c49c:       nop
                :ffffffff8015c49d:       data16
                :ffffffff8015c49e:       data16
                :ffffffff8015c49f:       nop
                :ffffffff8015c4a0:       mov    %rbp,0x10(%rbx,%rax,8)
     20  0.0013 :ffffffff8015c4a5:       incl   (%rbx)
    176  0.0112 :ffffffff8015c4a7:       pushq  (%rsp)
      7 4.5e-04 :ffffffff8015c4aa:       popfq
   6187  0.3944 :ffffffff8015c4ab:       mov    0x8(%rsp),%rbx << HERE>>
    543  0.0346 :ffffffff8015c4b0:       mov    0x10(%rsp),%rbp
                :ffffffff8015c4b5:       add    $0x18,%rsp
                :ffffffff8015c4b9:       retq


ffffffff8015bd70 <kmem_cache_alloc>: /* kmem_cache_alloc total:  18803  1.1985 */
    549  0.0350 :ffffffff8015bd70:       sub    $0x8,%rsp
    700  0.0446 :ffffffff8015bd74:       pushfq
   1427  0.0910 :ffffffff8015bd75:       popq   (%rsp)
    226  0.0144 :ffffffff8015bd78:       cli
   2399  0.1529 :ffffffff8015bd79:       mov    %gs:0x34,%eax  <<HERE>>
    416  0.0265 :ffffffff8015bd81:       cltq
                :ffffffff8015bd83:       mov    (%rdi,%rax,8),%rdx
     21  0.0013 :ffffffff8015bd87:       mov    (%rdx),%eax
    172  0.0110 :ffffffff8015bd89:       test   %eax,%eax
                :ffffffff8015bd8b:       je     ffffffff8015bda1 
<kmem_cache_alloc+0x31>
      8 5.1e-04 :ffffffff8015bd8d:       dec    %eax
   1338  0.0853 :ffffffff8015bd8f:       movl   $0x1,0xc(%rdx)
      9 5.7e-04 :ffffffff8015bd96:       mov    %eax,(%rdx)
      9 5.7e-04 :ffffffff8015bd98:       mov    %eax,%eax
   1146  0.0730 :ffffffff8015bd9a:       mov    0x10(%rdx,%rax,8),%rax
      4 2.5e-04 :ffffffff8015bd9f:       jmp    ffffffff8015bda6 
<kmem_cache_alloc+0x36>
                :ffffffff8015bda1:       callq  ffffffff8015c160 
<cache_alloc_refill>
    154  0.0098 :ffffffff8015bda6:       pushq  (%rsp)
    241  0.0154 :ffffffff8015bda9:       popfq
   9222  0.5878 :ffffffff8015bdaa:       prefetchw (%rax) <<HERE>>
    758  0.0483 :ffffffff8015bdad:       add    $0x8,%rsp
      4 2.5e-04 :ffffffff8015bdb1:       retq

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  6:36                           ` Ingo Molnar
@ 2005-12-21 12:50                             ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-21 12:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Pekka Enberg, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


On Wed, 21 Dec 2005, Ingo Molnar wrote:

>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > > > http://marc.theaimsgroup.com/?l=linux-kernel&m=113510997009883&w=2
> > >
> > > Quite a long list of unsupported features. These academic papers
> > > usually only focus on one thing. The SLAB allocator has to work
> > > for a variety of situations though.
> > >
> > > It would help to explain what ultimately will be better in the new slab
> > > allocator. The complexity could be taken care of by reorganizing the code.
> >
> > Honestly, what I would like is a simpler solution, whether we go with
> > a new approach or reorganize the current slab.  I need to get -rt
> > working, and the code in slab is pulling my resources more than they
> > can extend. I'm capable to convert slab today as it is for RT but it
> > will probably take longer than I can afford.
>
> please, lets let the -rt tree out of the equation. The SLAB code is fine
> on upstream, and it was a pure practical maintainance decision to go for
> SLOB in the -rt tree. Yes, the SLAB code is complex, but i could hardly
> list any complexity in it tht isnt justified with a performance
> argument. _Maybe_ the SLAB code could be further cleaned up, maybe it
> could even be replaced, but we'd have to see the patches first. In any
> case, the -rt tree is not an argument that matters.

You're right about the -rt tree not being an argument for upstream.  I
used it as an example of the complexities.  This is not limited to -rt,
but for any other changes as well.  Years ago I tried changing the slab to
run on a small embedded device with very little memory, and I was pretty
much overwhelmed.

Now I see that people are converting it for NUMA. I give them a lot of
credit, since they must be smarter than I ;)

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  6:56                       ` Ingo Molnar
  2005-12-21  7:16                         ` Pekka J Enberg
  2005-12-21  7:20                         ` [PATCH RT 00/02] SLOB optimizations Eric Dumazet
@ 2005-12-21 13:02                         ` Steven Rostedt
  2 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-21 13:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Pekka Enberg, Christoph Lameter, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


On Wed, 21 Dec 2005, Ingo Molnar wrote:
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > [...] Today's slab system is starting to become like the IDE where
> > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> > them ;) [...]
>
> while it could possibly be cleaned up a bit, it's one of the
> best-optimized subsystems Linux has. Most of the "unnecessary
> complexity" in SLAB is related to a performance or a debugging feature.
> Many times i have looked at the SLAB code in a disassembler, right next
> to profile output from some hot workload, and have concluded: 'I couldnt
> do this any better even with hand-coded assembly'.

Exactly my point!  The complexity of SLAB keeps it at the "I could not do
it better myself" catagory.  This wasn't suppose to be a bash, it was
actually a complement.  But things in the "I could not do it better
myself" catagory are usually very hard to modify.  Because, unless you are
at the level of genius of those that wrote it, you may easily break it.
Or put it to a level of "Ha, I can do this better".

 >
> SLAB-bashing has become somewhat fashionable, but i really challenge
> everyone to improve the SLAB code first (to make it more modular, easier
> to read, etc.), before suggesting replacements.

I perfectly agree with this statement.  As I mentioned earlier, it may
have been different if I was a part of the changes that were made.  But I
wasn't, and that leaves me the task to figure out why things were done the
way they were done.  Before changes can be made, one must have a full
understanding of why things exist as it does.

Don't get me wrong, my comments are more of a frustration with myself that
I'm having trouble understanding all that's in SLAB.  I understand the
SLAB concept, but I'm having trouble with understanding the current
implementation.  That's _my_ problem.  But I will continue to work at it,
and maybe I will be able to produce some clean up patches once I do
understand.

>
> the SLOB is nice because it gives us a simple option at the other end of
> the complexity spectrum. The SLOB should remain there. (but it certainly
> makes sense to make it faster, within certain limits, so i'm not
> opposing your SLOB patches per se.)
>

I like the SLOB code, because it was simple enough for my mortal mind.  I
actually started to play with it to get a better understanding of the way
the SLAB works.  It has actually helped in that catagory.

-- Steve


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  7:16                         ` Pekka J Enberg
  2005-12-21  7:50                           ` Ingo Molnar
@ 2005-12-21 13:13                           ` Steven Rostedt
  2005-12-21 15:34                             ` [PATCH] SLAB - have index_of bug at compile time Steven Rostedt
  1 sibling, 1 reply; 56+ messages in thread
From: Steven Rostedt @ 2005-12-21 13:13 UTC (permalink / raw)
  To: Pekka J Enberg
  Cc: Ingo Molnar, Christoph Lameter, Alok N Kataria, Shobhit Dayal,
	Shai Fultheim, Matt Mackall, Andrew Morton, john stultz,
	Gunter Ohrner, linux-kernel


On Wed, 21 Dec 2005, Pekka J Enberg wrote:
> Hi Ingo,
>
> Steven Rostedt <rostedt@goodmis.org> wrote:
> > > [...] Today's slab system is starting to become like the IDE where
> > > nobody, but a select few sado-masochis, dare to venture in. (I've CC'd
> > > them ;) [...]
>
> On Wed, 21 Dec 2005, Ingo Molnar wrote:
> > while it could possibly be cleaned up a bit, it's one of the
> > best-optimized subsystems Linux has. Most of the "unnecessary
> > complexity" in SLAB is related to a performance or a debugging feature.
> > Many times i have looked at the SLAB code in a disassembler, right next
> > to profile output from some hot workload, and have concluded: 'I couldnt
> > do this any better even with hand-coded assembly'.
> >
> > SLAB-bashing has become somewhat fashionable, but i really challenge
> > everyone to improve the SLAB code first (to make it more modular, easier
> > to read, etc.), before suggesting replacements.
>
> I dropped working on the replacement because I wanted to do just that. I
> sent my patch only because Matt and Steve talked about writing a
> replacement and thought they would be interested to see it.
>
> I am all for gradual improvements but after taking a stab at it, I
> starting to think rewriting would be easier, simply because the slab
> allocator has been clean-up resistant for so long.

And I think that what was done to SLAB is excellent. But like code I've
written, I've often thought, if I rewrote it again, I would do it cleaner
since I learned so much in doing it.

So the only way that I can feel that I can actually improve the current
system, is to write one from scratch (or start with one that is simple)
and try to make it as good as the current system.  But by the time I got
it there, it would be just as complex as it is today.  So only then,  I
could rewrite it to be better, since I learned why things were done the
way they were, and can have that in my mind as I rewrite.  So that means
writing it twice!

Unfortunately, it is probably the case that those that wrote slab.c are
too busy doing other things (or probably just don't want to), to rewite
the slab.c with the prior knowledge of what they wrote.

For the short time, I could just force myself to study the code and play
with it to see what I break, and figure out "Oh, that's whay that was
done!".

-- Steve

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH] SLAB - have index_of bug at compile time.
  2005-12-21 13:13                           ` Steven Rostedt
@ 2005-12-21 15:34                             ` Steven Rostedt
  0 siblings, 0 replies; 56+ messages in thread
From: Steven Rostedt @ 2005-12-21 15:34 UTC (permalink / raw)
  To: LKML
  Cc: Pekka J Enberg, Andrew Morton, Gunter Ohrner, john stultz,
	Andrew Morton, Matt Mackall, Shai Fultheim, Shobhit Dayal,
	Alok N Kataria, Christoph Lameter, Ingo Molnar

Hi,  after all the talk over SLAB and SLOBs I decided to make myself
useful, and I'm trying very hard to understand the Linux implementation
of SLAB.  So I'm going through ever line of code and examining it
thoroughly, when I find something that could be improved, either
performance wise (highly doubtful), clean up wise, documentation wise,
enhancement wise, or just have a question, I'll make myself known.

This email is enhancement wise. ;)

I noticed the code for index_of is a creative way of finding the cache
index using the compiler to optimize to a single hard coded number.  But
I couldn't help noticing that it uses two methods to let you know that
someone used it wrong.  One is at compile time (the correct way), and
the other is at run time (not good).

OK, this isn't really an enhancement since the code already works. But
this change can help those later who do real enhancements to SLAB.

-- Steve

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Index: linux-2.6.15-rc6/mm/slab.c
===================================================================
--- linux-2.6.15-rc6.orig/mm/slab.c	2005-12-20 16:47:05.000000000 -0500
+++ linux-2.6.15-rc6/mm/slab.c	2005-12-21 10:20:03.000000000 -0500
@@ -315,6 +315,8 @@
  */
 static __always_inline int index_of(const size_t size)
 {
+	extern void __bad_size(void);
+
 	if (__builtin_constant_p(size)) {
 		int i = 0;
 
@@ -325,12 +327,9 @@
 		i++;
 #include "linux/kmalloc_sizes.h"
 #undef CACHE
-		{
-			extern void __bad_size(void);
-			__bad_size();
-		}
+		__bad_size();
 	} else
-		BUG();
+		__bad_size();
 	return 0;
 }
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  8:02                             ` Eric Dumazet
@ 2005-12-22 18:02                               ` Zwane Mwaikambo
  2005-12-22 21:11                               ` Ingo Molnar
  1 sibling, 0 replies; 56+ messages in thread
From: Zwane Mwaikambo @ 2005-12-22 18:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Steven Rostedt, Pekka Enberg, Christoph Lameter,
	Alok N Kataria, Shobhit Dayal, Shai Fultheim, Matt Mackall,
	Andrew Morton, john stultz, Gunter Ohrner, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; CHARSET=X-UNKNOWN, Size: 1809 bytes --]

On Wed, 21 Dec 2005, Eric Dumazet wrote:

> Ingo Molnar a écrit :
> > * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> > in any case, on sane platforms (i386, x86_64) an irq-disable is
> > well-optimized in hardware, and is just as fast as a preempt_disable().
> > 
> 
> I'm afraid its not the case on current hardware.
> 
> The irq enable/disable pair count for more than 50% the cpu time spent in
> kmem_cache_alloc()/kmem_cache_free()/kfree()
> 
> oprofile results on a dual Opteron 246 :
> 
> You can see the high profile numbers right after cli and popf(sti)
> instructions, popf being VERY expensive.
> 
> CPU: Hammer, speed 1993.39 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
> mask of 0x00 (No unit mask) count 50000
> 
> 29993     1.9317  kfree
> 18654     1.2014  kmem_cache_alloc
> 12962     0.8348  kmem_cache_free
> 
> ffffffff8015c370 <kfree>: /* kfree total:  30334  1.9335 */
>    770  0.0491 :ffffffff8015c370:       push   %rbp
>   2477  0.1579 :ffffffff8015c371:       mov    %rdi,%rbp
>                :ffffffff8015c374:       push   %rbx
>                :ffffffff8015c375:       sub    $0x8,%rsp
>   1792  0.1142 :ffffffff8015c379:       test   %rdi,%rdi
>                :ffffffff8015c37c:       je     ffffffff8015c452 <kfree+0xe2>
>    122  0.0078 :ffffffff8015c382:       pushfq
>   1001  0.0638 :ffffffff8015c383:       popq   (%rsp)
>   1456  0.0928 :ffffffff8015c386:       cli
>   2489  0.1586 :ffffffff8015c387:       mov    $0xffffffff7fffffff,%rax    <<
> 
> ...
>     72  0.0046 :ffffffff8015c44e:       pushq  (%rsp)
>   1080  0.0688 :ffffffff8015c451:       popfq
>  13934  0.8882 :ffffffff8015c452:       add    $0x8,%rsp      << HERE >>

Isn't that due to taking an interrupt?

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-21  8:02                             ` Eric Dumazet
  2005-12-22 18:02                               ` Zwane Mwaikambo
@ 2005-12-22 21:11                               ` Ingo Molnar
  2005-12-22 21:39                                 ` Eric Dumazet
                                                   ` (2 more replies)
  1 sibling, 3 replies; 56+ messages in thread
From: Ingo Molnar @ 2005-12-22 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> >in any case, on sane platforms (i386, x86_64) an irq-disable is 
> >well-optimized in hardware, and is just as fast as a preempt_disable().
> 
> I'm afraid its not the case on current hardware.
> 
> The irq enable/disable pair count for more than 50% the cpu time spent 
> in kmem_cache_alloc()/kmem_cache_free()/kfree()

because you are not using NMI based profiling?

> oprofile results on a dual Opteron 246 :
> 
> You can see the high profile numbers right after cli and popf(sti) 
> instructions, popf being VERY expensive.

that's just the profiling interrupt hitting them. You should not analyze 
irq-safe code with a non-NMI profiling interrupt.

CLI/STI is extremely fast. (In fact in the -rt tree i'm using them 
within mutexes instead of preempt_enable()/preempt_disable(), because 
they are faster and generate less register side-effect.)

	Ingo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-22 21:11                               ` Ingo Molnar
@ 2005-12-22 21:39                                 ` Eric Dumazet
  2005-12-22 21:44                                 ` George Anzinger
  2005-12-22 22:08                                 ` Eric Dumazet
  2 siblings, 0 replies; 56+ messages in thread
From: Eric Dumazet @ 2005-12-22 21:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>>> in any case, on sane platforms (i386, x86_64) an irq-disable is 
>>> well-optimized in hardware, and is just as fast as a preempt_disable().
>> I'm afraid its not the case on current hardware.
>>
>> The irq enable/disable pair count for more than 50% the cpu time spent 
>> in kmem_cache_alloc()/kmem_cache_free()/kfree()
> 
> because you are not using NMI based profiling?
> 
>> oprofile results on a dual Opteron 246 :
>>
>> You can see the high profile numbers right after cli and popf(sti) 
>> instructions, popf being VERY expensive.
> 
> that's just the profiling interrupt hitting them. You should not analyze 
> irq-safe code with a non-NMI profiling interrupt.
> 

I'm using oprofile on Opteron, and AFAIK it's NMI based.

# grep NMI /proc/interrupts ; sleep 1 ; grep NMI /proc/interrupts
NMI:  391352095 2867983903
NMI:  391359678 2867998498

thats 7583 and 14595 NMI / second on cpu0 and cpu1 respectivly in this sample.

> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them 
> within mutexes instead of preempt_enable()/preempt_disable(), because 
> they are faster and generate less register side-effect.)
> 
> 	Ingo
> 
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-22 21:11                               ` Ingo Molnar
  2005-12-22 21:39                                 ` Eric Dumazet
@ 2005-12-22 21:44                                 ` George Anzinger
  2005-12-22 22:00                                   ` Eric Dumazet
  2005-12-22 22:08                                 ` Eric Dumazet
  2 siblings, 1 reply; 56+ messages in thread
From: George Anzinger @ 2005-12-22 21:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Steven Rostedt, Pekka Enberg, Christoph Lameter,
	Alok N Kataria, Shobhit Dayal, Shai Fultheim, Matt Mackall,
	Andrew Morton, john stultz, Gunter Ohrner, linux-kernel



> that's just the profiling interrupt hitting them. You should not analyze 
> irq-safe code with a non-NMI profiling interrupt.
> 
> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them 
> within mutexes instead of preempt_enable()/preempt_disable(), because 
> they are faster and generate less register side-effect.)
> 
Hm... I rather thought that the cli would cause a rather large hit on 
the pipeline and certainly on OOE.  Is your observation based on any 
particular instruction stream?  Sti, on the otherhand should be fast...
-- 
George Anzinger   george@mvista.com
HRT (High-res-timers):  http://sourceforge.net/projects/high-res-timers/

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-22 21:44                                 ` George Anzinger
@ 2005-12-22 22:00                                   ` Eric Dumazet
  0 siblings, 0 replies; 56+ messages in thread
From: Eric Dumazet @ 2005-12-22 22:00 UTC (permalink / raw)
  To: george
  Cc: Ingo Molnar, Steven Rostedt, Pekka Enberg, Christoph Lameter,
	Alok N Kataria, Shobhit Dayal, Shai Fultheim, Matt Mackall,
	Andrew Morton, john stultz, Gunter Ohrner, linux-kernel

George Anzinger a écrit :
> 
> 
>> that's just the profiling interrupt hitting them. You should not 
>> analyze irq-safe code with a non-NMI profiling interrupt.
>>
>> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them 
>> within mutexes instead of preempt_enable()/preempt_disable(), because 
>> they are faster and generate less register side-effect.)
>>
> Hm... I rather thought that the cli would cause a rather large hit on 
> the pipeline and certainly on OOE.  Is your observation based on any 
> particular instruction stream?  Sti, on the otherhand should be fast...

Just to be exact, the 'cli' is coded as 3 instruction :

pushfq
popq (%rsp)
cli

and the 'sti' is coded as 2 instructions :
pushq (%rsp)
popfq

And 'popfq' seems to be expensive, at least on Opteron machines and if 
oprofile is not completely wrong...

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-22 21:11                               ` Ingo Molnar
  2005-12-22 21:39                                 ` Eric Dumazet
  2005-12-22 21:44                                 ` George Anzinger
@ 2005-12-22 22:08                                 ` Eric Dumazet
  2005-12-23 19:22                                   ` Zwane Mwaikambo
  2 siblings, 1 reply; 56+ messages in thread
From: Eric Dumazet @ 2005-12-22 22:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Pekka Enberg, Christoph Lameter, Alok N Kataria,
	Shobhit Dayal, Shai Fultheim, Matt Mackall, Andrew Morton,
	john stultz, Gunter Ohrner, linux-kernel

Ingo Molnar a écrit :
> 
> CLI/STI is extremely fast. (In fact in the -rt tree i'm using them 
> within mutexes instead of preempt_enable()/preempt_disable(), because 
> they are faster and generate less register side-effect.)
> 

Yes, but most of my machines have a ! CONFIG_PREEMPT kernel, so 
preempt_enable()/preempt_disable() are empty, thus faster than CLI/STI for sure :)

Then, maybe the patch that moves 'current' in a dedicated x86_64 register may 
help to lower  the cost of preempt_enable()/preempt_disable() on a 
CONFIG_PREEMPT kernel ?

Eric

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH RT 00/02] SLOB optimizations
  2005-12-22 22:08                                 ` Eric Dumazet
@ 2005-12-23 19:22                                   ` Zwane Mwaikambo
  0 siblings, 0 replies; 56+ messages in thread
From: Zwane Mwaikambo @ 2005-12-23 19:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Steven Rostedt, Pekka Enberg, Christoph Lameter,
	Alok N Kataria, Shobhit Dayal, Shai Fultheim, Matt Mackall,
	Andrew Morton, john stultz, Gunter Ohrner, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 763 bytes --]

On Thu, 22 Dec 2005, Eric Dumazet wrote:

> Ingo Molnar a écrit :
> > 
> > CLI/STI is extremely fast. (In fact in the -rt tree i'm using them within
> > mutexes instead of preempt_enable()/preempt_disable(), because they are
> > faster and generate less register side-effect.)
> > 
> 
> Yes, but most of my machines have a ! CONFIG_PREEMPT kernel, so
> preempt_enable()/preempt_disable() are empty, thus faster than CLI/STI for
> sure :)
> 
> Then, maybe the patch that moves 'current' in a dedicated x86_64 register may
> help to lower  the cost of preempt_enable()/preempt_disable() on a
> CONFIG_PREEMPT kernel ?

I'm not sure if it'll make much of a difference over;

mov    %gs:offset,%reg

So 'current' already is fairly fast on x86_64.

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2005-12-23 19:16 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-16 11:30 2.6.15-rc5-rt2 slowness Gunter Ohrner
2005-12-16 11:42 ` Gunter Ohrner
2005-12-16 12:04   ` Gunter Ohrner
2005-12-16 12:34   ` Steven Rostedt
2005-12-16 12:32 ` Steven Rostedt
2005-12-16 22:58   ` john stultz
2005-12-17  0:22     ` Gunter Ohrner
2005-12-17  3:51     ` Steven Rostedt
2005-12-17  3:33 ` Steven Rostedt
2005-12-17 22:57   ` Steven Rostedt
2005-12-18 16:05     ` K.R. Foley
2005-12-20 13:32     ` Ingo Molnar
2005-12-20 13:38       ` Steven Rostedt
2005-12-20 13:57         ` Ingo Molnar
2005-12-20 14:04           ` Steven Rostedt
2005-12-20 14:33             ` Steven Rostedt
2005-12-20 15:07               ` Ingo Molnar
2005-12-20 15:16                 ` Steven Rostedt
2005-12-20 15:44             ` [PATCH RT 00/02] SLOB optimizations Steven Rostedt
2005-12-20 15:56               ` Steven Rostedt
2005-12-20 15:58                 ` Ingo Molnar
2005-12-20 16:13               ` Ingo Molnar
2005-12-20 16:29                 ` Steven Rostedt
2005-12-20 16:39                   ` Steven Rostedt
2005-12-20 18:19               ` Matt Mackall
2005-12-20 19:15                 ` Steven Rostedt
2005-12-20 19:43                   ` Matt Mackall
2005-12-20 20:06                     ` Steven Rostedt
2005-12-20 20:15                   ` Pekka Enberg
2005-12-20 21:42                     ` Steven Rostedt
2005-12-20 21:52                       ` Christoph Lameter
2005-12-20 22:11                         ` Steven Rostedt
2005-12-21  6:36                           ` Ingo Molnar
2005-12-21 12:50                             ` Steven Rostedt
2005-12-21  6:56                       ` Ingo Molnar
2005-12-21  7:16                         ` Pekka J Enberg
2005-12-21  7:50                           ` Ingo Molnar
2005-12-21 13:13                           ` Steven Rostedt
2005-12-21 15:34                             ` [PATCH] SLAB - have index_of bug at compile time Steven Rostedt
2005-12-21  7:20                         ` [PATCH RT 00/02] SLOB optimizations Eric Dumazet
2005-12-21  7:43                           ` Ingo Molnar
2005-12-21  8:02                             ` Eric Dumazet
2005-12-22 18:02                               ` Zwane Mwaikambo
2005-12-22 21:11                               ` Ingo Molnar
2005-12-22 21:39                                 ` Eric Dumazet
2005-12-22 21:44                                 ` George Anzinger
2005-12-22 22:00                                   ` Eric Dumazet
2005-12-22 22:08                                 ` Eric Dumazet
2005-12-23 19:22                                   ` Zwane Mwaikambo
2005-12-21 13:02                         ` Steven Rostedt
2005-12-21  2:30                   ` Nick Piggin
2005-12-21  2:41                     ` Steven Rostedt
2005-12-20 15:44             ` [PATCH RT 01/02] SLOB - remove bigblock list Steven Rostedt
2005-12-20 15:44             ` [PATCH RT 02/02] SLOB - break SLOB up by caches Steven Rostedt
2005-12-20 14:07           ` 2.6.15-rc5-rt2 slowness Steven Rostedt
2005-12-20 15:26           ` K.R. Foley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox