All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH]: scatter list are supposed to have a int length
From: Juan Quintela @ 2002-12-20 10:27 UTC (permalink / raw)
  To: Ralf Baechle, mipslist


Hi
        
        length is int in: alpha, sparc64, ppc64 and s390x.

Later, Juan.

Index: include/asm-mips64/scatterlist.h
===================================================================
RCS file: /home/cvs/linux/include/asm-mips64/scatterlist.h,v
retrieving revision 1.4.2.5
diff -u -r1.4.2.5 scatterlist.h
--- include/asm-mips64/scatterlist.h	28 Sep 2002 18:51:41 -0000	1.4.2.5
+++ include/asm-mips64/scatterlist.h	20 Dec 2002 09:55:13 -0000
@@ -7,7 +7,7 @@
 	struct page * page;	/* Location for highmem page, if any */
 	unsigned int offset;
 	dma_addr_t dma_address;
-	unsigned long length;
+	unsigned int length;
 };
 
 #define ISA_DMA_THRESHOLD (0x00ffffff)


-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy

^ permalink raw reply

* Re: 2.4.20: Broken AGP initialization for i845G chipset [patch]
From: Dave Jones @ 2002-12-20 10:27 UTC (permalink / raw)
  To: Michael Milligan; +Cc: linux-kernel
In-Reply-To: <3E025858.4000404@acmeps.com>

On Thu, Dec 19, 2002 at 04:38:00PM -0700, Michael Milligan wrote:
 > 
 > Patch below.  Calls the 845 initialization function instead of the 830MP,
 > and a small formatting cleanup.  This is verified working.

With testgart/some other AGP using app ?
 
It looks totally logical. I'm just wondering if it was a cut-n-paste
accident, or someone had a genuine reason for doing that in the
first place.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply

* Re: [PATCH]Timer list init is done AFTER use
From: Andrew Morton @ 2002-12-20 10:26 UTC (permalink / raw)
  To: george anzinger; +Cc: linux-kernel@vger.kernel.org, Linus Torvalds
In-Reply-To: <3E02D81F.13A5A59D@mvista.com>

george anzinger wrote:
> 
> On SMP systems the timer list init is done by way of a
> cpu_notifier call.  This has two problems:
> 
> 1.) Timers are started WAY before the cpu_notifier call
> chain is executed.  In particular the console blanking timer
> is deleted and inserted every time printk() is called.  That
> this does not fail is only because the kernel has yet to
> protect location zero.

But init_timers() directly calls timer_cpu_notify(), which directly
calls init_timers_cpu().

So your patch appears to be a no-op for the boot CPU.
 
> 2.) This notifier is called when a cpu comes up.  I suspect
> that initializing the timer list when a hot swap of a cpu is
> done is NOT the right thing to do.  In any case, if this is
> a desired action, the list still needs to be initialized
> prior to its use.

It should be OK as-is?  The CPU_UP_PREPARE callout is performed
before the secondary starts doing things.  Its timers are initialised.
 
> The attached patch initializes all the timer lists at
> init_timers time and does not put code in the notify list.

But the patch assumes that the per-cpu data exists for all CPUs - even
the !cpu_possible() ones.

This is true at present.  But the intent here is that the per-cpu
storage be allocated as the CPUs come up, and in their node-local
memory.  That saves memory and presumably having the cpu-local timers
in the cpu-local memory is a good thing.

I have working code which did all that, but it sort-of got lost
because there was a lot going on at the time.


Have you actually observed any problem?

^ permalink raw reply

* Re: 2.5.52: agp, drm, i810 problem
From: Dave Jones @ 2002-12-20 10:25 UTC (permalink / raw)
  To: Felix Seeger; +Cc: linux-kernel
In-Reply-To: <200212200034.16969.felix.seeger@gmx.de>

On Fri, Dec 20, 2002 at 12:34:12AM +0100, Felix Seeger wrote:

 > I am running 2.5.52 (I must say that it is the best kernel I ever had, first 
 > time acpi and sony jog dial are working, great)
 > But I have some agp problems at the moment:
 > 
 > $ modprobe i810
 > FATAL: Error inserting i810 
 > (/lib/modules/2.5.52/kernel/drivers/char/drm/i810.ko): Cannot allocate memory

That one was my fault. I think I may have this fixed now.
bk://linux-dj.bkbits.net/agpgart has a bunch of fixes for numerous agp
related problems. I'll generate a GNU patch later today, and ask
Linus to pull what I have so far.

		Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply

* Re: PTRACE_GET_THREAD_AREA
From: Christoph Hellwig @ 2002-12-20 10:24 UTC (permalink / raw)
  To: Roland McGrath; +Cc: linux-kernel, Ingo Molnar
In-Reply-To: <200212200832.gBK8Wfg29816@magilla.sf.frob.com>

On Fri, Dec 20, 2002 at 12:32:41AM -0800, Roland McGrath wrote:
> This patch vs 2.5.51 (should apply fine to 2.5.52) adds two new ptrace
> requests for i386, PTRACE_GET_THREAD_AREA and PTRACE_SET_THREAD_AREA.
> These let another process using ptrace do the equivalent of performing
> get_thread_area and set_thread_area system calls for another thread.

I don't think ptrace is the right interface for this.  Just changed
the get_thread_area/set_thread_area to take a new first pid_t argument.

Of course you might have to check privilegues if the first argument is
non-null (i.e. not yourself).


^ permalink raw reply

* [PATCH]: test_bit returns int in all the architectures
From: Juan Quintela @ 2002-12-20 10:20 UTC (permalink / raw)
  To: Ralf Baechle, mipslist


Hi
        to be consistent with everybody else, test_bit should return a
        int.  Notice that it only returns 0/1, not a big deal.

Later, Juan.

Index: include/asm-mips64/bitops.h
===================================================================
RCS file: /home/cvs/linux/include/asm-mips64/bitops.h,v
retrieving revision 1.15.2.10
diff -u -r1.15.2.10 bitops.h
--- include/asm-mips64/bitops.h	5 Dec 2002 03:25:20 -0000	1.15.2.10
+++ include/asm-mips64/bitops.h	20 Dec 2002 09:55:13 -0000
@@ -302,7 +302,7 @@
  * @nr: bit number to test
  * @addr: Address to start counting from
  */
-static inline unsigned long test_bit(int nr, volatile void * addr)
+static inline unsigned int test_bit(int nr, volatile void * addr)
 {
 	return 1UL & (((volatile unsigned long *) addr)[nr >> 6] >> (nr & 0x3f));
 }


-- 
In theory, practice and theory are the same, but in practice they 
are different -- Larry McVoy

^ permalink raw reply

* Re: linuxppc_2_4_devel patch: board config updates
From: Wolfgang Denk @ 2002-12-20 10:04 UTC (permalink / raw)
  To: Tom Rini; +Cc: linuxppc-embedded

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

I wrote:

> this is a patch against linuxppc_2_4_devel BK Changeset 1.1197
> (trini@kernel.crashing.org|ChangeSet|20021219180614|11718)
>
> It updates the configuration files for a couple of  boards  supported
> by DENX.

Resent (compressed) because of list's attachment size limits.


Wolfgang Denk

--
Software Engineering:  Embedded and Realtime Systems,  Embedded Linux
Phone: (+49)-8142-4596-87  Fax: (+49)-8142-4596-88  Email: wd@denx.de
The IQ of the group is the lowest IQ of a member of the group divided
by the number of people in the group.


[-- Attachment #2: board_configs.patch.gz --]
[-- Type: application/x-gzip , Size: 10945 bytes --]

^ permalink raw reply

* Re: Intel P6 vs P7 system call performance
From: Ulrich Drepper @ 2002-12-20 10:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: bart, davej, lk, hpa, terje.eggestad, matti.aarnio, hugh, mingo,
	linux-kernel
In-Reply-To: <Pine.LNX.4.44.0212191134180.2731-100000@penguin.transmeta.com>

Linus Torvalds wrote:

> For _zero_ gain.  The jump to the library address has to be indirect 
> anyway, and glibc has several places to put the information without any 
> mmap's or anything like that.

Correct.  The current implementation is optimal.

It is necessary to have indirection since the target address can change.

I'm never going to use self-modifying code.

And it's a simple, one-instruction change.

  int $0x80  ->  call *%gs:0x18


That's it.  It's all implemented and tested.  The results are in the
latest NPTL source drop.  The code won't be available in LinuxThreads
since it requires a kernel with TLS support.

As far as I'm concerned the discussion is over.  I'm happy with what I
have now.  The additional overhead for the case when AT_SYSINFO is not
available is neglegable (and can be compiled-out completely if one
really wants), and in case AT_SYSINFO is available the code really is
the fatest possible given the constraints mentioned above.

-- 
--------------.                        ,-.            444 Castro Street
Ulrich Drepper \    ,-----------------'   \ Mountain View, CA 94041 USA
Red Hat         `--' drepper at redhat.com `---------------------------


^ permalink raw reply

* Re: New style dpalloc/hostalloc routines (diff).
From: Pantelis Antoniou @ 2002-12-20  9:57 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-embedded
In-Reply-To: <15874.23846.177956.192840@argo.ozlabs.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

Paul Mackerras wrote:

>Pantelis Antoniou writes:
>
>
>
>>+# Support new type of routines, usable from modules
>>+bool 'Use new type dpalloc routines()' CONFIG_NEW_DPALLOC
>>+bool 'Use new type hostalloc routines()' CONFIG_NEW_HOSTALLOC
>>+if [ "$CONFIG_NEW_DPALLOC" = "y" -o "$CONFIG_NEW_HOSTALLOC" = "y" ]; then
>>+  define_bool CONFIG_CPM_RHEAP y
>>+fi
>>
>>
>
>I don't want to see config options that select between different
>internal implementations of the same thing.  Either your new routines
>are better, and we'll use them, or they are worse, and we'll use the
>old ones.  Having a config option just leads to tons of ifdefs
>throughout the code, which makes it harder to read and understand.
>Having two implementations of the same thing is just bloat.
>
>Similarly, I don't like the way all your new routines have a "new_"
>prefix on the name.  You should be thinking of replacing the existing
>routines rather than providing an alternative implementation with a
>different name.  Where you have changed the API, either fix the
>drivers or provide a compatibility routine.
>
>The way it looks at the moment, it seems that you don't really have
>the conviction that your code is better than what is there already.
>Please redo your patch so that it just replaces the old routines.  And
>please don't send it as a bkpatch since they are impossible to read, a
>plain diff -u is much better.
>
>Paul.
>
>
>
>
>
>
OK

Here is the updated patch in diff format as per your suggestions.

Any more suggestions?

Pantelis




[-- Attachment #2: linuxppc_2_4_devel-dpalloc.patch.gz --]
[-- Type: application/x-gunzip, Size: 8109 bytes --]

^ permalink raw reply

* EDE-0.0.5a
From: Neil Holmes @ 2002-12-20  9:53 UTC (permalink / raw)
  To: Linux 8086

OK. I found my "typo" in my upgrade code so I have uploaded something thats
called 0.0.5a. The only difference between that and 0.0.5 is that it will
upgarde an existing type 80 partition. The upgrade is achieved by overwrite
of kernel and binaries. I have not built in any backup yet. Its on my list
of things to do.

The "upgrade" of an existing ELKS installation is a lot smoother because a
valid partition already exists.

Incidentally. An EDE 0.0.5/5a install that has partition problems throws up
a raft of ugly errors at the moment. Nothing to panic about as no damage is
done. The other thing that I am working on, beside backup on upgrade, is
some code to trap the errors and stop the install. Something that should see
the light of day in 0.0.6. Hopefully the filesystem guys will have the
answers to all my prayers before then and it becomes something of minor
importance.

So, for the record, I have 2 areas of work in progress at the moment :-

1. Backup binaries and kernel on upgrade
2. Trap mkfs errors and stop the install

Other work will come with 0.1.2 kernel and updaded elkscmd in due course.

Have Fun !

Neil



^ permalink raw reply

* Re: [PATCH ] POSIX clocks & timers take 17 (NOT HIGH RES)
From: george anzinger @ 2002-12-20  9:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel@vger.kernel.org, Randy.Dunlap
In-Reply-To: <Pine.LNX.4.44.0212050904390.27298-100000@home.transmeta.com>

[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]

Now for 2.5.52-bk4.

Changes since last time:

Fix to kernel make file to accommodate a new source.

-----------

The changes for the new sys_call restart now allow one
restart function to handle both nanosleep and
clock_nanosleep.  Saves a bit of code, nice.

All the requested changes and Lindent too :).

I also broke clock_nanosleep() apart much the same way
nanosleep() was with the 2.5.50-bk5 changes.  

This is still this way.  Should be easy to do the compat
stuff.

George


Linus Torvalds wrote:
> 
> Ok, finally starting to look at merging this, however:
> 
> This must go (we already have a timespec, there's no way it should be
> here in <asm/signal.h>):
> 
>         +#ifndef _STRUCT_TIMESPEC
>         +#define _STRUCT_TIMESPEC
>         +struct timespec {
>         +       time_t  tv_sec;         /* seconds */
>         +       long    tv_nsec;        /* nanoseconds */
>         +};
>         +#endif /* _STRUCT_TIMESPEC */

OK.
> 
> and you have things like
> 
>         +       if ((flags & TIMER_ABSTIME) &&
>         +           (clock->clock_get != do_posix_clock_monotonic_gettime)) {
>         +       }else{
>         +       }

A hang over from the high res code, I will remove the empty
else.
> 
> and
> 
>         +if (!p) {
>         +printk("in sub_remove for id=%d called with null pointer.\n", id);
>         +return(0);
>         +}

That is in there!?  I will check into and fix it.
> 
> and obviously the "nanosleep()" thing and the CLOCK_NANOSLEEP_ENTRY()
> stuff has been discussed in the unrelated thread (ie it doesn't work for
> alpha or other architectures).

Right!  I am merging this now.

-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

[-- Attachment #2: hrtimers-posix-2.5.52-bk4.1.0.patch --]
[-- Type: text/plain, Size: 71299 bytes --]

diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/arch/i386/Kconfig linux/arch/i386/Kconfig
--- linux-2.5.52-bk4-kb/arch/i386/Kconfig	Thu Dec 19 12:36:01 2002
+++ linux/arch/i386/Kconfig	Thu Dec 19 13:54:04 2002
@@ -323,6 +323,107 @@
 
 	  If you don't know what to do here, say N.
 
+config HIGH_RES_TIMERS
+	bool "Configure High-Resolution-Timers"
+	help
+	  POSIX timers are available by default.  This option enables
+	  high resolution POSIX timers.  With this option the resolution
+	  is at least 1 micro second.  High resolution is not free.  If
+	  enabled this option will add a small overhead each time a
+	  timer expires that is not on a 1/HZ tick boundry.  If no such
+	  timers are used the overhead is nil.
+
+	  This option enables two additional POSIX CLOCKS,
+	  CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR.  Note that this
+	  option does not change the resolution of CLOCK_REALTIME or
+	  CLOCK_MONOTONIC which remain at 1/HZ resolution.
+
+choice
+	prompt "Clock source?"
+	depends on HIGH_RES_TIMERS
+	default HIGH_RES_TIMER_TSC
+	help 
+	  This option allows you to choose the wall clock timer for your
+	  system.  With high resolution timers on the x86 platforms it
+	  is best to keep the interrupt generating timer separate from
+	  the time keeping timer.  On x86 platforms there are three
+	  possible sources implemented for the wall clock.  These are:
+ 
+  	  <timer>				<resolution>
+ 	  ACPI power management (pm) timer	~280 nano seconds
+  	  TSC (Time Stamp Counter)		1/CPU clock
+ 	  PIT (Programmable Interrupt Timer)	~838 nano seconds
+
+	  The PIT is always used to generate clock interrupts but, in
+	  SMP systems the APIC timers are used to drive the timer list
+	  code.  This means that, in SMP systems the PIT will not be
+	  programmed to generate sub jiffie events and can give
+	  reasonable service as the clock interrupt. In non SMP (UP)
+	  systems it will be programmed to interrupt when the next timer
+	  is to expire or on the next 1/HZ tick.  For this reason it is
+	  best to not use this timer as the wall clock timer in UP
+	  systems.  This timer has a resolution of 838 nano seconds.  IN
+	  UP SYSTEMS THIS OPTION SHOULD ONLY BE USED IF BOTH ACPI AND
+	  TSC ARE NOT AVAILABLE.
+
+	  The TSC runs at the cpu clock rate (i.e. its resolution is
+	  1/CPU clock) and it has a very low access time.  However, it
+	  is subject, in some (incorrect) processors, to throttling to
+	  cool the cpu, and to other slow downs during power management.
+	  If your system has power managment code active these changes
+	  are tracked by the TSC timer code.  If your cpu is correct and
+	  does not change the TSC frequency for throttling or power
+	  management outside of the power managment kernel code, this is
+	  the best clock timer.
+
+	  The ACPI pm timer is available on systems with Advanced
+	  Configuration and Power Interface support.  The pm timer is
+	  available on these systems even if you don't use or enable
+	  ACPI in the software or the BIOS (but see Default ACPI pm
+	  timer address).  The timer has a resolution of about 280
+	  nanoseconds, however, the access time is a bit higher than
+	  that of the TSC.  Since it is part of ACPI it is intended to
+	  keep track of time while the system is under power management,
+	  thus it is not subject to the power management problems of the
+	  TSC.
+
+	  If you enable the ACPI pm timer and it can not be found, it is
+	  possible that your BIOS is not producing the ACPI table or
+	  that your machine does not support ACPI.  In the former case,
+	  see "Default ACPI pm timer address".  If the timer is not
+	  found the boot will fail when trying to calibrate the 'delay'
+	  loop.
+
+config HIGH_RES_TIMER_ACPI_PM
+	bool "ACPI-pm-timer"
+	
+config HIGH_RES_TIMER_TSC
+	bool "Time-stamp-counter/TSC"
+	depends on X86_TSC
+
+config HIGH_RES_TIMER_PIT
+	bool "Programable-interrupt-timer/PIT"
+	  
+endchoice	  
+
+config HIGH_RES_TIMER_ACPI_PM_ADD
+	int "Default ACPI pm timer address"
+	depends on HIGH_RES_TIMER_ACPI_PM
+	default 0
+	help
+	  This option is available for use on systems where the BIOS
+	  does not generate the ACPI tables if ACPI is not enabled.  For
+	  example some BIOSes will not generate the ACPI tables if APM
+	  is enabled.  The ACPI pm timer is still available but can not
+	  be found by the software.  This option allows you to supply
+	  the needed address.  When the high resolution timers code
+	  finds a valid ACPI pm timer address it reports it in the boot
+	  messages log (look for lines that begin with
+	  "High-res-timers:").  You can turn on the ACPI support in the
+	  BIOS, boot the system and find this value.  You can then enter
+	  it at configure time.  Both the report and the entry are in
+	  decimal.
+
 config PREEMPT
 	bool "Preemptible Kernel"
 	help
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.52-bk4-kb/arch/i386/kernel/entry.S	Thu Dec 19 12:14:48 2002
+++ linux/arch/i386/kernel/entry.S	Thu Dec 19 14:44:24 2002
@@ -41,7 +41,6 @@
  */
 
 #include <linux/config.h>
-#include <linux/sys.h>
 #include <linux/linkage.h>
 #include <asm/thread_info.h>
 #include <asm/errno.h>
@@ -273,7 +272,7 @@
 	pushl %eax
 	SAVE_ALL
 	GET_THREAD_INFO(%ebx)
-	cmpl $(NR_syscalls), %eax
+	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
 
 	testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebx)
@@ -297,7 +296,7 @@
 	pushl %eax			# save orig_eax
 	SAVE_ALL
 	GET_THREAD_INFO(%ebx)
-	cmpl $(NR_syscalls), %eax
+	cmpl $(nr_syscalls), %eax
 	jae syscall_badsys
 					# system call tracing in operation
 	testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebx)
@@ -373,7 +372,7 @@
 	xorl %edx,%edx
 	call do_syscall_trace
 	movl ORIG_EAX(%esp), %eax
-	cmpl $(NR_syscalls), %eax
+	cmpl $(nr_syscalls), %eax
 	jnae syscall_call
 	jmp syscall_exit
 
@@ -827,8 +826,15 @@
 	.long sys_epoll_wait
  	.long sys_remap_file_pages
  	.long sys_set_tid_address
-
-
-	.rept NR_syscalls-(.-sys_call_table)/4
-		.long sys_ni_syscall
-	.endr
+ 	.long sys_timer_create
+ 	.long sys_timer_settime		/* 260 */
+ 	.long sys_timer_gettime
+ 	.long sys_timer_getoverrun
+ 	.long sys_timer_delete
+ 	.long sys_clock_settime
+ 	.long sys_clock_gettime		/* 265 */
+ 	.long sys_clock_getres
+ 	.long sys_clock_nanosleep
+ 
+ 
+nr_syscalls=(.-sys_call_table)/4
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/arch/i386/kernel/time.c linux/arch/i386/kernel/time.c
--- linux-2.5.52-bk4-kb/arch/i386/kernel/time.c	Tue Nov 12 12:39:37 2002
+++ linux/arch/i386/kernel/time.c	Thu Dec 19 12:16:00 2002
@@ -132,6 +132,7 @@
 	time_maxerror = NTP_PHASE_LIMIT;
 	time_esterror = NTP_PHASE_LIMIT;
 	write_unlock_irq(&xtime_lock);
+	clock_was_set();
 }
 
 /*
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/fs/exec.c linux/fs/exec.c
--- linux-2.5.52-bk4-kb/fs/exec.c	Thu Dec 19 12:13:11 2002
+++ linux/fs/exec.c	Thu Dec 19 12:16:00 2002
@@ -779,6 +779,7 @@
 			
 	flush_signal_handlers(current);
 	flush_old_files(current->files);
+	exit_itimers(current);
 
 	return 0;
 
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/asm-generic/siginfo.h linux/include/asm-generic/siginfo.h
--- linux-2.5.52-bk4-kb/include/asm-generic/siginfo.h	Wed Oct 30 22:45:08 2002
+++ linux/include/asm-generic/siginfo.h	Thu Dec 19 12:16:00 2002
@@ -43,8 +43,11 @@
 
 		/* POSIX.1b timers */
 		struct {
-			unsigned int _timer1;
-			unsigned int _timer2;
+			timer_t _tid;		/* timer id */
+			int _overrun;		/* overrun count */
+			char _pad[sizeof( __ARCH_SI_UID_T) - sizeof(int)];
+			sigval_t _sigval;	/* same as below */
+			int _sys_private;       /* not to be passed to user */
 		} _timer;
 
 		/* POSIX.1b signals */
@@ -86,8 +89,9 @@
  */
 #define si_pid		_sifields._kill._pid
 #define si_uid		_sifields._kill._uid
-#define si_timer1	_sifields._timer._timer1
-#define si_timer2	_sifields._timer._timer2
+#define si_tid		_sifields._timer._tid
+#define si_overrun	_sifields._timer._overrun
+#define si_sys_private  _sifields._timer._sys_private
 #define si_status	_sifields._sigchld._status
 #define si_utime	_sifields._sigchld._utime
 #define si_stime	_sifields._sigchld._stime
@@ -221,6 +225,7 @@
 #define SIGEV_SIGNAL	0	/* notify via signal */
 #define SIGEV_NONE	1	/* other notification: meaningless */
 #define SIGEV_THREAD	2	/* deliver via thread creation */
+#define SIGEV_THREAD_ID 4	/* deliver to thread */
 
 #define SIGEV_MAX_SIZE	64
 #ifndef SIGEV_PAD_SIZE
@@ -235,6 +240,7 @@
 	int sigev_notify;
 	union {
 		int _pad[SIGEV_PAD_SIZE];
+		 int _tid;
 
 		struct {
 			void (*_function)(sigval_t);
@@ -247,10 +253,12 @@
 
 #define sigev_notify_function	_sigev_un._sigev_thread._function
 #define sigev_notify_attributes	_sigev_un._sigev_thread._attribute
+#define sigev_notify_thread_id	 _sigev_un._tid
 
 #ifdef __KERNEL__
 
 struct siginfo;
+void do_schedule_next_timer(struct siginfo *info);
 
 #ifndef HAVE_ARCH_COPY_SIGINFO
 
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/asm-i386/posix_types.h linux/include/asm-i386/posix_types.h
--- linux-2.5.52-bk4-kb/include/asm-i386/posix_types.h	Mon Sep  9 10:35:18 2002
+++ linux/include/asm-i386/posix_types.h	Thu Dec 19 12:16:00 2002
@@ -22,6 +22,8 @@
 typedef long		__kernel_time_t;
 typedef long		__kernel_suseconds_t;
 typedef long		__kernel_clock_t;
+typedef int		__kernel_timer_t;
+typedef int		__kernel_clockid_t;
 typedef int		__kernel_daddr_t;
 typedef char *		__kernel_caddr_t;
 typedef unsigned short	__kernel_uid16_t;
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/asm-i386/signal.h linux/include/asm-i386/signal.h
--- linux-2.5.52-bk4-kb/include/asm-i386/signal.h	Wed Dec 11 06:25:28 2002
+++ linux/include/asm-i386/signal.h	Thu Dec 19 12:16:00 2002
@@ -3,6 +3,7 @@
 
 #include <linux/types.h>
 #include <linux/linkage.h>
+#include <linux/time.h>
 
 /* Avoid too many header ordering problems.  */
 struct siginfo;
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/asm-i386/unistd.h linux/include/asm-i386/unistd.h
--- linux-2.5.52-bk4-kb/include/asm-i386/unistd.h	Wed Dec 11 06:25:29 2002
+++ linux/include/asm-i386/unistd.h	Thu Dec 19 12:16:00 2002
@@ -264,6 +264,15 @@
 #define __NR_epoll_wait		256
 #define __NR_remap_file_pages	257
 #define __NR_set_tid_address	258
+#define __NR_timer_create	259
+#define __NR_timer_settime	(__NR_timer_create+1)
+#define __NR_timer_gettime	(__NR_timer_create+2)
+#define __NR_timer_getoverrun	(__NR_timer_create+3)
+#define __NR_timer_delete	(__NR_timer_create+4)
+#define __NR_clock_settime	(__NR_timer_create+5)
+#define __NR_clock_gettime	(__NR_timer_create+6)
+#define __NR_clock_getres	(__NR_timer_create+7)
+#define __NR_clock_nanosleep	(__NR_timer_create+8)
 
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/id_reuse.h linux/include/linux/id_reuse.h
--- linux-2.5.52-bk4-kb/include/linux/id_reuse.h	Wed Dec 31 16:00:00 1969
+++ linux/include/linux/id_reuse.h	Thu Dec 19 12:16:00 2002
@@ -0,0 +1,119 @@
+/*
+ * include/linux/id.h
+ * 
+ * 2002-10-18  written by Jim Houston jim.houston@ccur.com
+ *	Copyright (C) 2002 by Concurrent Computer Corporation
+ *	Distributed under the GNU GPL license version 2.
+ *
+ * Small id to pointer translation service avoiding fixed sized
+ * tables.
+ */
+
+#define IDR_BITS 5
+#define IDR_MASK ((1 << IDR_BITS)-1)
+#define IDR_FULL ((int)((1ULL << (1 << IDR_BITS))-1))
+
+/* Number of id_layer structs to leave in free list */
+#define IDR_FREE_MAX 6
+
+struct idr_layer {
+	unsigned long	        bitmap;
+	struct idr_layer	*ary[1<<IDR_BITS];
+};
+
+struct idr {
+	int		layers;
+	int		last;
+	int		count;
+	struct idr_layer *top;
+	spinlock_t      id_slock;
+};
+
+void *idr_find(struct idr *idp, int id);
+void *idr_find_nolock(struct idr *idp, int id);
+int idr_get_new(struct idr *idp, void *ptr);
+void idr_remove(struct idr *idp, int id);
+void idr_init(struct idr *idp);
+void idr_lock(struct idr *idp);
+void idr_unlock(struct idr *idp);
+
+extern inline void update_bitmap(struct idr_layer *p, int bit)
+{
+	if (p->ary[bit] && p->ary[bit]->bitmap == IDR_FULL)
+		__set_bit(bit, &p->bitmap);
+	else
+		__clear_bit(bit, &p->bitmap);
+}
+
+extern inline void update_bitmap_set(struct idr_layer *p, int bit)
+{
+	if (p->ary[bit] && p->ary[bit]->bitmap == IDR_FULL)
+		__set_bit(bit, &p->bitmap);
+}
+
+extern inline void update_bitmap_clear(struct idr_layer *p, int bit)
+{
+	if (p->ary[bit] && p->ary[bit]->bitmap == IDR_FULL)
+		;
+	else
+		__clear_bit(bit, &p->bitmap);
+}
+
+extern inline void idr_lock(struct idr *idp)
+{
+	spin_lock(&idp->id_slock);
+}
+
+extern inline void idr_unlock(struct idr *idp)
+{
+	spin_unlock(&idp->id_slock);
+}
+
+extern inline void *idr_find(struct idr *idp, int id)
+{
+	int n;
+	struct idr_layer *p;
+
+	id--;
+	idr_lock(idp);
+	n = idp->layers * IDR_BITS;
+	p = idp->top;
+	if ((unsigned)id >= (1 << n)) { // unsigned catches <=0 input
+		idr_unlock(idp);
+		return(NULL);
+	}
+
+	while (n > 0 && p) {
+		n -= IDR_BITS;
+		p = p->ary[(id >> n) & IDR_MASK];
+	}
+	idr_unlock(idp);
+	return((void *)p);
+}
+/*
+ * caller calls idr_lock/ unlock around this one.  Allows
+ * additional code to be protected.
+ */
+extern inline void *idr_find_nolock(struct idr *idp, int id)
+{
+	int n;
+	struct idr_layer *p;
+
+	id--;
+	n = idp->layers * IDR_BITS;
+	p = idp->top;
+	if ((unsigned)id >= (1 << n)) { // unsigned catches <=0 input
+		return(NULL);
+	}
+
+	while (n > 0 && p) {
+		n -= IDR_BITS;
+		p = p->ary[(id >> n) & IDR_MASK];
+	}
+	return((void *)p);
+}
+
+
+
+extern kmem_cache_t *idr_layer_cache;
+
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/init_task.h linux/include/linux/init_task.h
--- linux-2.5.52-bk4-kb/include/linux/init_task.h	Thu Oct  3 10:42:11 2002
+++ linux/include/linux/init_task.h	Thu Dec 19 12:16:00 2002
@@ -93,6 +93,7 @@
 	.sig		= &init_signals,				\
 	.pending	= { NULL, &tsk.pending.head, {{0}}},		\
 	.blocked	= {{0}},					\
+	 .posix_timers	 = LIST_HEAD_INIT(tsk.posix_timers),		   \
 	.alloc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/posix-timers.h linux/include/linux/posix-timers.h
--- linux-2.5.52-bk4-kb/include/linux/posix-timers.h	Wed Dec 31 16:00:00 1969
+++ linux/include/linux/posix-timers.h	Thu Dec 19 12:16:00 2002
@@ -0,0 +1,30 @@
+#ifndef _linux_POSIX_TIMERS_H
+#define _linux_POSIX_TIMERS_H
+
+struct k_clock {
+	int res;		/* in nano seconds */
+	int (*clock_set) (struct timespec * tp);
+	int (*clock_get) (struct timespec * tp);
+	int (*nsleep) (int flags,
+		       struct timespec * new_setting,
+		       struct itimerspec * old_setting);
+	int (*timer_set) (struct k_itimer * timr, int flags,
+			  struct itimerspec * new_setting,
+			  struct itimerspec * old_setting);
+	int (*timer_del) (struct k_itimer * timr);
+	void (*timer_get) (struct k_itimer * timr,
+			   struct itimerspec * cur_setting);
+};
+struct now_struct {
+	unsigned long jiffies;
+};
+
+#define posix_get_now(now) (now)->jiffies = jiffies;
+#define posix_time_before(timer, now) \
+                      time_before((timer)->expires, (now)->jiffies)
+
+#define posix_bump_timer(timr) do { \
+                        (timr)->it_timer.expires += (timr)->it_incr; \
+                        (timr)->it_overrun++;               \
+                       }while (0)
+#endif
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.52-bk4-kb/include/linux/sched.h	Thu Dec 19 12:13:17 2002
+++ linux/include/linux/sched.h	Thu Dec 19 12:16:00 2002
@@ -276,6 +276,25 @@
 typedef struct prio_array prio_array_t;
 struct backing_dev_info;
 
+/* POSIX.1b interval timer structure. */
+struct k_itimer {
+	struct list_head list;		 /* free/ allocate list */
+	spinlock_t it_lock;
+	clockid_t it_clock;		/* which timer type */
+	timer_t it_id;			/* timer id */
+	int it_overrun;			/* overrun on pending signal  */
+	int it_overrun_last;		 /* overrun on last delivered signal */
+	int it_requeue_pending;          /* waiting to requeue this timer */
+	int it_sigev_notify;		 /* notify word of sigevent struct */
+	int it_sigev_signo;		 /* signo word of sigevent struct */
+	sigval_t it_sigev_value;	 /* value word of sigevent struct */
+	unsigned long it_incr;		/* interval specified in jiffies */
+	struct task_struct *it_process;	/* process to send signal to */
+	struct timer_list it_timer;
+};
+
+
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -339,6 +358,7 @@
 	unsigned long it_real_value, it_prof_value, it_virt_value;
 	unsigned long it_real_incr, it_prof_incr, it_virt_incr;
 	struct timer_list real_timer;
+	struct list_head posix_timers; /* POSIX.1b Interval Timers */
 	unsigned long utime, stime, cutime, cstime;
 	unsigned long start_time;
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
@@ -578,6 +598,7 @@
 extern void exit_files(struct task_struct *);
 extern void exit_sighand(struct task_struct *);
 extern void __exit_sighand(struct task_struct *);
+extern void exit_itimers(struct task_struct *);
 
 extern void reparent_to_init(void);
 extern void daemonize(void);
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/signal.h linux/include/linux/signal.h
--- linux-2.5.52-bk4-kb/include/linux/signal.h	Wed Dec 11 06:25:32 2002
+++ linux/include/linux/signal.h	Thu Dec 19 12:16:00 2002
@@ -224,7 +224,7 @@
 struct pt_regs;
 extern int get_signal_to_deliver(siginfo_t *info, struct pt_regs *regs);
 #endif
-
+#define FOLD_NANO_SLEEP_INTO_CLOCK_NANO_SLEEP
 #endif /* __KERNEL__ */
 
 #endif /* _LINUX_SIGNAL_H */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/sys.h linux/include/linux/sys.h
--- linux-2.5.52-bk4-kb/include/linux/sys.h	Wed Oct 30 22:46:36 2002
+++ linux/include/linux/sys.h	Thu Dec 19 12:16:00 2002
@@ -2,9 +2,8 @@
 #define _LINUX_SYS_H
 
 /*
- * system call entry points ... but not all are defined
+ * This file is no longer used or needed
  */
-#define NR_syscalls 260
 
 /*
  * These are system calls that will be removed at some time
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/time.h linux/include/linux/time.h
--- linux-2.5.52-bk4-kb/include/linux/time.h	Wed Dec 11 06:25:33 2002
+++ linux/include/linux/time.h	Thu Dec 19 12:16:00 2002
@@ -40,6 +40,19 @@
  */
 #define MAX_JIFFY_OFFSET ((~0UL >> 1)-1)
 
+/* Parameters used to convert the timespec values */
+#ifndef USEC_PER_SEC
+#define USEC_PER_SEC (1000000L)
+#endif
+
+#ifndef NSEC_PER_SEC
+#define NSEC_PER_SEC (1000000000L)
+#endif
+
+#ifndef NSEC_PER_USEC
+#define NSEC_PER_USEC (1000L)
+#endif
+
 static __inline__ unsigned long
 timespec_to_jiffies(struct timespec *value)
 {
@@ -138,6 +151,8 @@
 #ifdef __KERNEL__
 extern void do_gettimeofday(struct timeval *tv);
 extern void do_settimeofday(struct timeval *tv);
+extern int do_sys_settimeofday(struct timeval *tv, struct timezone *tz);
+extern void clock_was_set(void); // call when ever the clock is set
 extern long do_nanosleep(struct timespec *t);
 extern long do_utimes(char * filename, struct timeval * times);
 #endif
@@ -165,5 +180,25 @@
 	struct	timeval it_interval;	/* timer interval */
 	struct	timeval it_value;	/* current value */
 };
+
+
+/*
+ * The IDs of the various system clocks (for POSIX.1b interval timers).
+ */
+#define CLOCK_REALTIME		  0
+#define CLOCK_MONOTONIC	  1
+#define CLOCK_PROCESS_CPUTIME_ID 2
+#define CLOCK_THREAD_CPUTIME_ID	 3
+#define CLOCK_REALTIME_HR	 4
+#define CLOCK_MONOTONIC_HR	  5
+
+#define MAX_CLOCKS 6
+
+/*
+ * The various flags for setting POSIX.1b interval timers.
+ */
+
+#define TIMER_ABSTIME 0x01
+
 
 #endif
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/include/linux/types.h linux/include/linux/types.h
--- linux-2.5.52-bk4-kb/include/linux/types.h	Tue Oct 15 15:43:06 2002
+++ linux/include/linux/types.h	Thu Dec 19 12:16:00 2002
@@ -23,6 +23,8 @@
 typedef __kernel_daddr_t	daddr_t;
 typedef __kernel_key_t		key_t;
 typedef __kernel_suseconds_t	suseconds_t;
+typedef __kernel_timer_t	timer_t;
+typedef __kernel_clockid_t	clockid_t;
 
 #ifdef __KERNEL__
 typedef __kernel_uid32_t	uid_t;
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/Makefile linux/kernel/Makefile
--- linux-2.5.52-bk4-kb/kernel/Makefile	Thu Dec 19 12:13:18 2002
+++ linux/kernel/Makefile	Thu Dec 19 12:18:00 2002
@@ -10,7 +10,8 @@
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o futex.o platform.o pid.o \
-	    rcupdate.o intermodule.o extable.o params.o
+	    rcupdate.o intermodule.o extable.o params.o \
+	    posix-timers.o id_reuse.o
 
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
 obj-$(CONFIG_SMP) += cpu.o
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/exit.c linux/kernel/exit.c
--- linux-2.5.52-bk4-kb/kernel/exit.c	Wed Dec 11 06:25:33 2002
+++ linux/kernel/exit.c	Thu Dec 19 12:16:01 2002
@@ -411,6 +411,7 @@
 	mmdrop(active_mm);
 }
 
+
 /*
  * Turn us into a lazy TLB process if we
  * aren't already..
@@ -659,6 +660,7 @@
 	__exit_files(tsk);
 	__exit_fs(tsk);
 	exit_namespace(tsk);
+	exit_itimers(tsk);
 	exit_thread();
 
 	if (current->leader)
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/fork.c linux/kernel/fork.c
--- linux-2.5.52-bk4-kb/kernel/fork.c	Thu Dec 19 12:13:18 2002
+++ linux/kernel/fork.c	Thu Dec 19 12:16:01 2002
@@ -812,6 +812,7 @@
 		goto bad_fork_cleanup_files;
 	if (copy_sighand(clone_flags, p))
 		goto bad_fork_cleanup_fs;
+	INIT_LIST_HEAD(&p->posix_timers);
 	if (copy_mm(clone_flags, p))
 		goto bad_fork_cleanup_sighand;
 	if (copy_namespace(clone_flags, p))
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/id_reuse.c linux/kernel/id_reuse.c
--- linux-2.5.52-bk4-kb/kernel/id_reuse.c	Wed Dec 31 16:00:00 1969
+++ linux/kernel/id_reuse.c	Thu Dec 19 12:16:01 2002
@@ -0,0 +1,194 @@
+/*
+ * linux/kernel/id.c
+ *
+ * 2002-10-18  written by Jim Houston jim.houston@ccur.com
+ *	Copyright (C) 2002 by Concurrent Computer Corporation
+ *	Distributed under the GNU GPL license version 2.
+ *
+ * Small id to pointer translation service.  
+ *
+ * It uses a radix tree like structure as a sparse array indexed 
+ * by the id to obtain the pointer.  The bitmap makes allocating
+ * an new id quick.  
+
+ * Modified by George Anzinger to reuse immediately and to use
+ * find bit instructions.  Also removed _irq on spinlocks.
+ */
+
+
+#include <linux/slab.h>
+#include <linux/id_reuse.h>
+#include <linux/init.h>
+#include <linux/string.h>
+
+static kmem_cache_t *idr_layer_cache;
+
+/*
+ * Since we can't allocate memory with spinlock held and dropping the
+ * lock to allocate gets ugly keep a free list which will satisfy the
+ * worst case allocation.
+
+ * Hm?  Looks like the free list is shared with all users... I guess
+ * that is ok, think of it as an extension of alloc.
+ */
+
+static struct idr_layer *id_free;
+static int id_free_cnt;
+
+static inline struct idr_layer *alloc_layer(void)
+{
+	struct idr_layer *p;
+
+	if (!(p = id_free))
+		BUG();
+	id_free = p->ary[0];
+	id_free_cnt--;
+	p->ary[0] = 0;
+	return(p);
+}
+
+static inline void free_layer(struct idr_layer *p)
+{
+	/*
+	 * Depends on the return element being zeroed.
+	 */
+	p->ary[0] = id_free;
+	id_free = p;
+	id_free_cnt++;
+}
+
+static int sub_alloc(struct idr_layer *p, int shift, void *ptr)
+{
+	int bitmap = p->bitmap;
+	int v, n;
+
+	n = ffz(bitmap);
+	if (shift == 0) {
+		p->ary[n] = (struct idr_layer *)ptr;
+		__set_bit(n, &p->bitmap);
+		return(n);
+	}
+	if (!p->ary[n])
+		p->ary[n] = alloc_layer();
+	v = sub_alloc(p->ary[n], shift-IDR_BITS, ptr);
+	update_bitmap_set(p, n);
+	return(v + (n << shift));
+}
+
+int idr_get_new(struct idr *idp, void *ptr)
+{
+	int n, v;
+	
+	idr_lock(idp);
+	n = idp->layers * IDR_BITS;
+	/*
+	 * Since we can't allocate memory with spinlock held and dropping the
+	 * lock to allocate gets ugly keep a free list which will satisfy the
+	 * worst case allocation.
+	 */
+	while (id_free_cnt < n+1) {
+		struct idr_layer *new;
+		idr_unlock(idp);
+		new = kmem_cache_alloc(idr_layer_cache, GFP_KERNEL);
+		if(new == NULL)
+			return (0);
+		memset(new, 0, sizeof(struct idr_layer));
+		idr_lock(idp);
+		free_layer(new);
+	}
+	/*
+	 * Add a new layer if the array is full 
+	 */
+	if (idp->top->bitmap == IDR_FULL){
+		struct idr_layer *new = alloc_layer();
+		++idp->layers;
+		n += IDR_BITS;
+		new->ary[0] = idp->top;
+		idp->top = new;
+		update_bitmap_set(new, 0);
+	}
+	v = sub_alloc(idp->top, n-IDR_BITS, ptr);
+	idp->last = v;
+	idp->count++;
+	idr_unlock(idp);
+	return(v+1);
+}
+/*
+ * At this time we only free leaf nodes.  It would take another bitmap
+ * or, better, an in use counter to correctly free higher nodes.
+ */
+
+static int sub_remove(struct idr_layer *p, int shift, int id)
+{
+	int n = (id >> shift) & IDR_MASK;
+	
+	if (shift != 0) {
+		if (sub_remove(p->ary[n], shift-IDR_BITS, id)) {
+			free_layer(p->ary[n]);
+			p->ary[n] = NULL;
+		}
+		__clear_bit(n, &p->bitmap);
+		return (0);      // for now, prune only at 0
+	} else {
+		p->ary[n] = NULL;
+		__clear_bit(n, &p->bitmap);
+	} 
+	return (! p->bitmap);
+}
+
+void idr_remove(struct idr *idp, int id)
+{
+	struct idr_layer *p;
+
+	if (id <= 0)
+		return;
+	id--;
+	idr_lock(idp);
+	sub_remove(idp->top, (idp->layers-1)*IDR_BITS, id);
+#if 0
+	/*
+	 * To do this correctly we really need a bit map or counter that
+	 * indicates if any are allocated, not the current one that
+	 * indicates if any are free.  Something to do...
+	 * This is not too bad as we do prune the leaf nodes. So for a 
+	 * three layer tree we will only be left with 33 nodes when 
+	 * empty
+	 */
+	if(idp->top->bitmap == 1 && idp->layers > 1 ){  // We can drop a layer
+		p = idp->top->ary[0];
+		free_layer(idp->top);
+		idp->top = p;
+		--idp->layers;
+	}
+#endif
+	idp->count--;
+	if (id_free_cnt >= IDR_FREE_MAX) {
+		
+		p = alloc_layer();
+		idr_unlock(idp);
+		kmem_cache_free(idr_layer_cache, p);
+		return;
+	}
+	idr_unlock(idp);
+}
+
+static  __init int init_id_cache(void)
+{
+	if (!idr_layer_cache)
+		idr_layer_cache = kmem_cache_create("idr_layer_cache", 
+			sizeof(struct idr_layer), 0, 0, 0, 0);
+	return 0;
+}
+
+void idr_init(struct idr *idp)
+{
+	init_id_cache();
+	idp->count = 0;
+	idp->last = 0;
+	idp->layers = 1;
+	idp->top = kmem_cache_alloc(idr_layer_cache, GFP_KERNEL);
+	memset(idp->top, 0, sizeof(struct idr_layer));
+	spin_lock_init(&idp->id_slock);
+}
+
+__initcall(init_id_cache);
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/posix-timers.c linux/kernel/posix-timers.c
--- linux-2.5.52-bk4-kb/kernel/posix-timers.c	Wed Dec 31 16:00:00 1969
+++ linux/kernel/posix-timers.c	Thu Dec 19 12:16:01 2002
@@ -0,0 +1,1311 @@
+/*
+ * linux/kernel/posix_timers.c
+ *
+ * 
+ * 2002-10-15  Posix Clocks & timers by George Anzinger
+ *			     Copyright (C) 2002 by MontaVista Software.
+ */
+
+/* These are all the functions necessary to implement 
+ * POSIX clocks & timers
+ */
+
+#include <linux/mm.h>
+#include <linux/smp_lock.h>
+#include <linux/interrupt.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+
+#include <asm/uaccess.h>
+#include <asm/semaphore.h>
+#include <linux/list.h>
+#include <linux/init.h>
+#include <linux/compiler.h>
+#include <linux/id_reuse.h>
+#include <linux/posix-timers.h>
+
+#ifndef div_long_long_rem
+#include <asm/div64.h>
+
+#define div_long_long_rem(dividend,divisor,remainder) ({ \
+		       u64 result = dividend;		\
+		       *remainder = do_div(result,divisor); \
+		       result; })
+
+#endif				/* ifndef div_long_long_rem */
+
+/*
+ * Management arrays for POSIX timers.	 Timers are kept in slab memory
+ * Timer ids are allocated by an external routine that keeps track of the
+ * id and the timer.  The external interface is:
+ *
+ *void *idr_find(struct idr *idp, int id);           to find timer_id <id>
+ *int idr_get_new(struct idr *idp, void *ptr);       to get a new id and 
+ *                                                  related it to <ptr>
+ *void idr_remove(struct idr *idp, int id);          to release <id>
+ *void idr_init(struct idr *idp);                    to initialize <idp>
+ *                                                  which we supply.
+ * The idr_get_new *may* call slab for more memory so it must not be
+ * called under a spin lock.  Likewise idr_remore may release memory
+ * (but it may be ok to do this under a lock...).
+ * idr_find is just a memory look up and is quite fast.  A zero return
+ * indicates that the requested id does not exist.
+
+ */
+/*
+   * Lets keep our timers in a slab cache :-)
+ */
+static kmem_cache_t *posix_timers_cache;
+struct idr posix_timers_id;
+
+/*
+ * Just because the timer is not in the timer list does NOT mean it is
+ * inactive.  It could be in the "fire" routine getting a new expire time.
+ */
+#define TIMER_INACTIVE 1
+#define TIMER_RETRY 1
+#ifdef CONFIG_SMP
+#define timer_active(tmr) (tmr->it_timer.entry.prev != (void *)TIMER_INACTIVE)
+#define set_timer_inactive(tmr) tmr->it_timer.entry.prev = (void *)TIMER_INACTIVE
+#else
+#define timer_active(tmr) BARFY	// error to use outside of SMP
+#define set_timer_inactive(tmr)
+#endif
+/*
+ * The timer ID is turned into a timer address by idr_find().
+ * Verifying a valid ID consists of:
+ * 
+ * a) checking that idr_find() returns other than zero.
+ * b) checking that the timer id matches the one in the timer itself.
+ * c) that the timer owner is in the callers thread group.
+ */
+
+extern rwlock_t xtime_lock;
+
+/* 
+ * CLOCKs: The POSIX standard calls for a couple of clocks and allows us
+ *	    to implement others.  This structure defines the various
+ *	    clocks and allows the possibility of adding others.	 We
+ *	    provide an interface to add clocks to the table and expect
+ *	    the "arch" code to add at least one clock that is high
+ *	    resolution.	 Here we define the standard CLOCK_REALTIME as a
+ *	    1/HZ resolution clock.
+
+ * CPUTIME & THREAD_CPUTIME: We are not, at this time, definding these
+ *	    two clocks (and the other process related clocks (Std
+ *	    1003.1d-1999).  The way these should be supported, we think,
+ *	    is to use large negative numbers for the two clocks that are
+ *	    pinned to the executing process and to use -pid for clocks
+ *	    pinned to particular pids.	Calls which supported these clock
+ *	    ids would split early in the function.
+ 
+ * RESOLUTION: Clock resolution is used to round up timer and interval
+ *	    times, NOT to report clock times, which are reported with as
+ *	    much resolution as the system can muster.  In some cases this
+ *	    resolution may depend on the underlaying clock hardware and
+ *	    may not be quantifiable until run time, and only then is the
+ *	    necessary code is written.	The standard says we should say
+ *	    something about this issue in the documentation...
+
+ * FUNCTIONS: The CLOCKs structure defines possible functions to handle
+ *	    various clock functions.  For clocks that use the standard
+ *	    system timer code these entries should be NULL.  This will
+ *	    allow dispatch without the overhead of indirect function
+ *	    calls.  CLOCKS that depend on other sources (e.g. WWV or GPS)
+ *	    must supply functions here, even if the function just returns
+ *	    ENOSYS.  The standard POSIX timer management code assumes the
+ *	    following: 1.) The k_itimer struct (sched.h) is used for the
+ *	    timer.  2.) The list, it_lock, it_clock, it_id and it_process
+ *	    fields are not modified by timer code. 
+ *
+ *          At this time all functions EXCEPT clock_nanosleep can be
+ *          redirected by the CLOCKS structure.  Clock_nanosleep is in
+ *          there, but the code ignors it.
+ *
+ * Permissions: It is assumed that the clock_settime() function defined
+ *	    for each clock will take care of permission checks.	 Some
+ *	    clocks may be set able by any user (i.e. local process
+ *	    clocks) others not.	 Currently the only set able clock we
+ *	    have is CLOCK_REALTIME and its high res counter part, both of
+ *	    which we beg off on and pass to do_sys_settimeofday().
+ */
+
+struct k_clock posix_clocks[MAX_CLOCKS];
+
+#define if_clock_do(clock_fun, alt_fun,parms)	(! clock_fun)? alt_fun parms :\
+							      clock_fun parms
+
+#define p_timer_get( clock,a,b) if_clock_do((clock)->timer_get, \
+					     do_timer_gettime,	 \
+					     (a,b))
+
+#define p_nsleep( clock,a,b,c) if_clock_do((clock)->nsleep,   \
+					    do_nsleep,	       \
+					    (a,b,c))
+
+#define p_timer_del( clock,a) if_clock_do((clock)->timer_del, \
+					   do_timer_delete,    \
+					   (a))
+
+void register_posix_clock(int clock_id, struct k_clock *new_clock);
+
+static int do_posix_gettime(struct k_clock *clock, struct timespec *tp);
+
+int do_posix_clock_monotonic_gettime(struct timespec *tp);
+
+int do_posix_clock_monotonic_settime(struct timespec *tp);
+static struct k_itimer *lock_timer(timer_t timer_id, long *flags);
+static inline void unlock_timer(struct k_itimer *timr, long flags);
+
+/* 
+ * Initialize everything, well, just everything in Posix clocks/timers ;)
+ */
+
+static __init int
+init_posix_timers(void)
+{
+	struct k_clock clock_realtime = {.res = NSEC_PER_SEC / HZ };
+	struct k_clock clock_monotonic = {.res = NSEC_PER_SEC / HZ,
+		.clock_get = do_posix_clock_monotonic_gettime,
+		.clock_set = do_posix_clock_monotonic_settime
+	};
+
+	register_posix_clock(CLOCK_REALTIME, &clock_realtime);
+	register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic);
+
+	posix_timers_cache = kmem_cache_create("posix_timers_cache",
+					       sizeof (struct k_itimer), 0, 0,
+					       0, 0);
+	idr_init(&posix_timers_id);
+	return 0;
+}
+
+__initcall(init_posix_timers);
+
+static inline int
+tstojiffie(struct timespec *tp, int res, unsigned long *jiff)
+{
+	unsigned long sec = tp->tv_sec;
+	long nsec = tp->tv_nsec + res - 1;
+
+	if (nsec > NSEC_PER_SEC) {
+		sec++;
+		nsec -= NSEC_PER_SEC;
+	}
+
+	/*
+	 * A note on jiffy overflow: It is possible for the system to
+	 * have been up long enough for the jiffies quanity to overflow.
+	 * In order for correct timer evaluations we require that the
+	 * specified time be somewhere between now and now + (max
+	 * unsigned int/2).  Times beyond this will be truncated back to
+	 * this value.   This is done in the absolute adjustment code,
+	 * below.  Here it is enough to just discard the high order
+	 * bits.  
+	 */
+	*jiff = HZ * sec;
+	/*
+	 * Do the res thing. (Don't forget the add in the declaration of nsec) 
+	 */
+	nsec -= nsec % res;
+	/*
+	 * Split to jiffie and sub jiffie
+	 */
+	*jiff += nsec / (NSEC_PER_SEC / HZ);
+	/*
+	 * We trust that the optimizer will use the remainder from the 
+	 * above div in the following operation as long as they are close. 
+	 */
+	return 0;
+}
+static void
+tstotimer(struct itimerspec *time, struct k_itimer *timer)
+{
+	int res = posix_clocks[timer->it_clock].res;
+	tstojiffie(&time->it_value, res, &timer->it_timer.expires);
+	tstojiffie(&time->it_interval, res, &timer->it_incr);
+}
+
+static void
+schedule_next_timer(struct k_itimer *timr)
+{
+	struct now_struct now;
+
+	/* Set up the timer for the next interval (if there is one) */
+	if (timr->it_incr == 0) {
+		{
+			set_timer_inactive(timr);
+			return;
+		}
+	}
+	posix_get_now(&now);
+	while (posix_time_before(&timr->it_timer, &now)) {
+		posix_bump_timer(timr);
+	};
+	timr->it_overrun_last = timr->it_overrun;
+	timr->it_overrun = -1;
+	add_timer(&timr->it_timer);
+}
+
+/*
+
+ * This function is exported for use by the signal deliver code.  It is
+ * called just prior to the info block being released and passes that
+ * block to us.  It's function is to update the overrun entry AND to
+ * restart the timer.  It should only be called if the timer is to be
+ * restarted (i.e. we have flagged this in the sys_private entry of the
+ * info block).
+ *
+ * To protect aginst the timer going away while the interrupt is queued,
+ * we require that the it_requeue_pending flag be set.
+
+ */
+void
+do_schedule_next_timer(struct siginfo *info)
+{
+
+	struct k_itimer *timr;
+	long flags;
+
+	timr = lock_timer(info->si_tid, &flags);
+
+	if (!timr || !timr->it_requeue_pending)
+		goto exit;
+
+	schedule_next_timer(timr);
+	info->si_overrun = timr->it_overrun_last;
+      exit:
+	if (timr)
+		unlock_timer(timr, flags);
+}
+
+/* 
+
+ * Notify the task and set up the timer for the next expiration (if
+ * applicable).  This function requires that the k_itimer structure
+ * it_lock is taken.  This code will requeue the timer only if we get
+ * either an error return or a flag (ret > 0) from send_seg_info
+ * indicating that the signal was either not queued or was queued
+ * without an info block.  In this case, we will not get a call back to
+ * do_schedule_next_timer() so we do it here.  This should be rare...
+
+ */
+
+static void
+timer_notify_task(struct k_itimer *timr)
+{
+	struct siginfo info;
+	int ret;
+
+	memset(&info, 0, sizeof (info));
+
+	/* Send signal to the process that owns this timer. */
+	info.si_signo = timr->it_sigev_signo;
+	info.si_errno = 0;
+	info.si_code = SI_TIMER;
+	info.si_tid = timr->it_id;
+	info.si_value = timr->it_sigev_value;
+	if (timr->it_incr == 0) {
+		set_timer_inactive(timr);
+	} else {
+		timr->it_requeue_pending = info.si_sys_private = 1;
+	}
+	ret = send_sig_info(info.si_signo, &info, timr->it_process);
+	switch (ret) {
+
+	default:
+		/*
+		 * Signal was not sent.  May or may not need to
+		 * restart the timer.
+		 */
+		printk(KERN_WARNING "sending signal failed: %d\n", ret);
+	case 1:
+		/*
+		 * signal was not sent because of sig_ignor or,
+		 * possibly no queue memory OR will be sent but,
+		 * we will not get a call back to restart it AND
+		 * it should be restarted. 
+		 */
+		schedule_next_timer(timr);
+	case 0:
+		/* 
+		 * all's well new signal queued
+		 */
+		break;
+	}
+}
+
+/*
+
+ * This function gets called when a POSIX.1b interval timer expires.  It
+ * is used as a callback from the kernel internal timer.  The
+ * run_timer_list code ALWAYS calls with interrutps on.
+
+ */
+static void
+posix_timer_fn(unsigned long __data)
+{
+	struct k_itimer *timr = (struct k_itimer *) __data;
+	long flags;
+
+	spin_lock_irqsave(&timr->it_lock, flags);
+	timer_notify_task(timr);
+	unlock_timer(timr, flags);
+}
+
+/*
+ * For some reason mips/mips64 define the SIGEV constants plus 128.  
+ * Here we define a mask to get rid of the common bits.	 The 
+ * optimizer should make this costless to all but mips.
+ */
+#if (ARCH == mips) || (ARCH == mips64)
+#define MIPS_SIGEV ~(SIGEV_NONE & \
+		      SIGEV_SIGNAL & \
+		      SIGEV_THREAD &  \
+		      SIGEV_THREAD_ID)
+#else
+#define MIPS_SIGEV (int)-1
+#endif
+
+static inline struct task_struct *
+good_sigevent(sigevent_t * event)
+{
+	struct task_struct *rtn = current;
+
+	if (event->sigev_notify & SIGEV_THREAD_ID & MIPS_SIGEV) {
+		if (!(rtn =
+		      find_task_by_pid(event->sigev_notify_thread_id)) ||
+		    rtn->tgid != current->tgid) {
+			return NULL;
+		}
+	}
+	if (event->sigev_notify & SIGEV_SIGNAL & MIPS_SIGEV) {
+		if ((unsigned) (event->sigev_signo > SIGRTMAX))
+			return NULL;
+	}
+	if (event->sigev_notify & ~(SIGEV_SIGNAL | SIGEV_THREAD_ID)) {
+		return NULL;
+	}
+	return rtn;
+}
+
+void
+register_posix_clock(int clock_id, struct k_clock *new_clock)
+{
+	if ((unsigned) clock_id >= MAX_CLOCKS) {
+		printk("POSIX clock register failed for clock_id %d\n",
+		       clock_id);
+		return;
+	}
+	posix_clocks[clock_id] = *new_clock;
+}
+
+static struct k_itimer *
+alloc_posix_timer(void)
+{
+	struct k_itimer *tmr;
+	tmr = kmem_cache_alloc(posix_timers_cache, GFP_KERNEL);
+	memset(tmr, 0, sizeof (struct k_itimer));
+	return (tmr);
+}
+
+static void
+release_posix_timer(struct k_itimer *tmr)
+{
+	if (tmr->it_id > 0)
+		idr_remove(&posix_timers_id, tmr->it_id);
+	kmem_cache_free(posix_timers_cache, tmr);
+}
+
+/* Create a POSIX.1b interval timer. */
+
+asmlinkage int
+sys_timer_create(clockid_t which_clock,
+		 struct sigevent *timer_event_spec, timer_t * created_timer_id)
+{
+	int error = 0;
+	struct k_itimer *new_timer = NULL;
+	timer_t new_timer_id;
+	struct task_struct *process = 0;
+	sigevent_t event;
+
+	if ((unsigned) which_clock >= MAX_CLOCKS ||
+	    !posix_clocks[which_clock].res) return -EINVAL;
+
+	new_timer = alloc_posix_timer();
+	if (new_timer == NULL)
+		return -EAGAIN;
+
+	spin_lock_init(&new_timer->it_lock);
+	new_timer_id = (timer_t) idr_get_new(&posix_timers_id,
+					     (void *) new_timer);
+	new_timer->it_id = new_timer_id;
+	if (new_timer_id == 0) {
+		error = -EAGAIN;
+		goto out;
+	}
+	/*
+	 * return the timer_id now.  The next step is hard to 
+	 * back out if there is an error.
+	 */
+	if (copy_to_user(created_timer_id,
+			 &new_timer_id, sizeof (new_timer_id))) {
+		error = -EFAULT;
+		goto out;
+	}
+	if (timer_event_spec) {
+		if (copy_from_user(&event, timer_event_spec, sizeof (event))) {
+			error = -EFAULT;
+			goto out;
+		}
+		read_lock(&tasklist_lock);
+		if ((process = good_sigevent(&event))) {
+			/*
+
+			 * We may be setting up this process for another
+			 * thread.  It may be exitiing.  To catch this
+			 * case the we check the PF_EXITING flag.  If
+			 * the flag is not set, the task_lock will catch
+			 * him before it is too late (in exit_itimers).
+
+			 * The exec case is a bit more invloved but easy
+			 * to code.  If the process is in our thread
+			 * group (and it must be or we would not allow
+			 * it here) and is doing an exec, it will cause
+			 * us to be killed.  In this case it will wait
+			 * for us to die which means we can finish this
+			 * linkage with our last gasp. I.e. no code :)
+
+			 */
+			task_lock(process);
+			if (!(process->flags & PF_EXITING)) {
+				list_add(&new_timer->list,
+					 &process->posix_timers);
+				task_unlock(process);
+			} else {
+				task_unlock(process);
+				process = 0;
+			}
+		}
+		read_unlock(&tasklist_lock);
+		if (!process) {
+			error = -EINVAL;
+			goto out;
+		}
+		new_timer->it_sigev_notify = event.sigev_notify;
+		new_timer->it_sigev_signo = event.sigev_signo;
+		new_timer->it_sigev_value = event.sigev_value;
+	} else {
+		new_timer->it_sigev_notify = SIGEV_SIGNAL;
+		new_timer->it_sigev_signo = SIGALRM;
+		new_timer->it_sigev_value.sival_int = new_timer->it_id;
+		process = current;
+		task_lock(process);
+		list_add(&new_timer->list, &process->posix_timers);
+		task_unlock(process);
+	}
+
+	new_timer->it_clock = which_clock;
+	new_timer->it_incr = 0;
+	new_timer->it_overrun = -1;
+	init_timer(&new_timer->it_timer);
+	new_timer->it_timer.expires = 0;
+	new_timer->it_timer.data = (unsigned long) new_timer;
+	new_timer->it_timer.function = posix_timer_fn;
+	set_timer_inactive(new_timer);
+
+	/*
+	 * Once we set the process, it can be found so do it last...
+	 */
+	new_timer->it_process = process;
+
+      out:
+	if (error) {
+		release_posix_timer(new_timer);
+	}
+	return error;
+}
+
+/*
+ * good_timespec
+ *
+ * This function checks the elements of a timespec structure.
+ *
+ * Arguments:
+ * ts	     : Pointer to the timespec structure to check
+ *
+ * Return value: 
+ * If a NULL pointer was passed in, or the tv_nsec field was less than 0
+ * or greater than NSEC_PER_SEC, or the tv_sec field was less than 0,
+ * this function returns 0. Otherwise it returns 1.
+
+ */
+
+static int
+good_timespec(const struct timespec *ts)
+{
+	if ((ts == NULL) ||
+	    (ts->tv_sec < 0) ||
+	    ((unsigned) ts->tv_nsec >= NSEC_PER_SEC)) return 0;
+	return 1;
+}
+
+static inline void
+unlock_timer(struct k_itimer *timr, long flags)
+{
+	spin_unlock_irqrestore(&timr->it_lock, flags);
+}
+
+/*
+
+ * Locking issues: We need to protect the result of the id look up until
+ * we get the timer locked down so it is not deleted under us.  The
+ * removal is done under the idr spinlock so we use that here to bridge
+ * the find to the timer lock.  To avoid a dead lock, the timer id MUST
+ * be release with out holding the timer lock.
+
+ */
+static struct k_itimer *
+lock_timer(timer_t timer_id, long *flags)
+{
+	struct k_itimer *timr;
+
+	idr_lock(&posix_timers_id);
+	timr = (struct k_itimer *) idr_find_nolock(&posix_timers_id,
+						   (int) timer_id);
+	if (timr) {
+		spin_lock_irqsave(&timr->it_lock, *flags);
+		idr_unlock(&posix_timers_id);
+
+		if (timr->it_id != timer_id) {
+			BUG();
+		}
+		if (!(timr->it_process) ||
+		    timr->it_process->tgid != current->tgid) {
+			unlock_timer(timr, *flags);
+			timr = NULL;
+		}
+	} else {
+		idr_unlock(&posix_timers_id);
+	}
+
+	return timr;
+}
+
+/* 
+
+ * Get the time remaining on a POSIX.1b interval timer.  This function
+ * is ALWAYS called with spin_lock_irq on the timer, thus it must not
+ * mess with irq.
+
+ * We have a couple of messes to clean up here.  First there is the case
+ * of a timer that has a requeue pending.  These timers should appear to
+ * be in the timer list with an expiry as if we were to requeue them
+ * now.
+
+ * The second issue is the SIGEV_NONE timer which may be active but is
+ * not really ever put in the timer list (to save system resources).
+ * This timer may be expired, and if so, we will do it here.  Otherwise
+ * it is the same as a requeue pending timer WRT to what we should
+ * report.
+
+ */
+void inline
+do_timer_gettime(struct k_itimer *timr, struct itimerspec *cur_setting)
+{
+	long sub_expires;
+	unsigned long expires;
+	struct now_struct now;
+
+	do {
+		expires = timr->it_timer.expires;
+	} while ((volatile long) (timr->it_timer.expires) != expires);
+
+	posix_get_now(&now);
+
+	if (expires && (timr->it_sigev_notify & SIGEV_NONE) && !timr->it_incr) {
+		if (posix_time_before(&timr->it_timer, &now)) {
+			timr->it_timer.expires = expires = 0;
+		}
+	}
+	if (expires) {
+		if (timr->it_requeue_pending ||
+		    (timr->it_sigev_notify & SIGEV_NONE)) {
+			while (posix_time_before(&timr->it_timer, &now)) {
+				posix_bump_timer(timr);
+			};
+		} else {
+			if (!timer_pending(&timr->it_timer)) {
+				sub_expires = expires = 0;
+			}
+		}
+		if (expires) {
+			expires -= now.jiffies;
+		}
+	}
+	jiffies_to_timespec(expires, &cur_setting->it_value);
+	jiffies_to_timespec(timr->it_incr, &cur_setting->it_interval);
+
+	if (cur_setting->it_value.tv_sec < 0) {
+		cur_setting->it_value.tv_nsec = 1;
+		cur_setting->it_value.tv_sec = 0;
+	}
+}
+/* Get the time remaining on a POSIX.1b interval timer. */
+asmlinkage int
+sys_timer_gettime(timer_t timer_id, struct itimerspec *setting)
+{
+	struct k_itimer *timr;
+	struct itimerspec cur_setting;
+	long flags;
+
+	timr = lock_timer(timer_id, &flags);
+	if (!timr)
+		return -EINVAL;
+
+	p_timer_get(&posix_clocks[timr->it_clock], timr, &cur_setting);
+
+	unlock_timer(timr, flags);
+
+	if (copy_to_user(setting, &cur_setting, sizeof (cur_setting)))
+		return -EFAULT;
+
+	return 0;
+}
+/*
+
+ * Get the number of overruns of a POSIX.1b interval timer.  This is to
+ * be the overrun of the timer last delivered.  At the same time we are
+ * accumulating overruns on the next timer.  The overrun is frozen when
+ * the signal is delivered, either at the notify time (if the info block
+ * is not queued) or at the actual delivery time (as we are informed by
+ * the call back to do_schedule_next_timer().  So all we need to do is
+ * to pick up the frozen overrun.
+
+ */
+
+asmlinkage int
+sys_timer_getoverrun(timer_t timer_id)
+{
+	struct k_itimer *timr;
+	int overrun;
+	long flags;
+
+	timr = lock_timer(timer_id, &flags);
+	if (!timr)
+		return -EINVAL;
+
+	overrun = timr->it_overrun_last;
+	unlock_timer(timr, flags);
+
+	return overrun;
+}
+/* Adjust for absolute time */
+/*
+ * If absolute time is given and it is not CLOCK_MONOTONIC, we need to
+ * adjust for the offset between the timer clock (CLOCK_MONOTONIC) and
+ * what ever clock he is using.
+ *
+ * If it is relative time, we need to add the current (CLOCK_MONOTONIC)
+ * time to it to get the proper time for the timer.
+ */
+static int
+adjust_abs_time(struct k_clock *clock, struct timespec *tp, int abs)
+{
+	struct timespec now;
+	struct timespec oc;
+	do_posix_clock_monotonic_gettime(&now);
+
+	if (abs &&
+	    (posix_clocks[CLOCK_MONOTONIC].clock_get == clock->clock_get)) {
+	} else {
+
+		if (abs) {
+			do_posix_gettime(clock, &oc);
+		} else {
+			oc.tv_nsec = oc.tv_sec = 0;
+		}
+		tp->tv_sec += now.tv_sec - oc.tv_sec;
+		tp->tv_nsec += now.tv_nsec - oc.tv_nsec;
+
+		/* 
+		 * Normalize...
+		 */
+		if ((tp->tv_nsec - NSEC_PER_SEC) >= 0) {
+			tp->tv_nsec -= NSEC_PER_SEC;
+			tp->tv_sec++;
+		}
+		if ((tp->tv_nsec) < 0) {
+			tp->tv_nsec += NSEC_PER_SEC;
+			tp->tv_sec--;
+		}
+	}
+	/*
+	 * Check if the requested time is prior to now (if so set now) or
+	 * is more than the timer code can handle (if so we error out).
+	 * The (unsigned) catches the case of prior to "now" with the same
+	 * test.  Only on failure do we sort out what happened, and then
+	 * we use the (unsigned) to error out negative seconds.
+	 */
+	if ((unsigned) (tp->tv_sec - now.tv_sec) > (MAX_JIFFY_OFFSET / HZ)) {
+		if ((unsigned) tp->tv_sec < now.tv_sec) {
+			tp->tv_sec = now.tv_sec;
+			tp->tv_nsec = now.tv_nsec;
+		} else {
+			// tp->tv_sec = now.tv_sec + (MAX_JIFFY_OFFSET / HZ);
+			/*
+			 * This is a considered response, not exactly in
+			 * line with the standard (in fact it is silent on
+			 * possible overflows).  We assume such a large 
+			 * value is ALMOST always a programming error and
+			 * try not to compound it by setting a really dumb
+			 * value.
+			 */
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+/* Set a POSIX.1b interval timer. */
+/* timr->it_lock is taken. */
+static inline int
+do_timer_settime(struct k_itimer *timr, int flags,
+		 struct itimerspec *new_setting, struct itimerspec *old_setting)
+{
+	struct k_clock *clock = &posix_clocks[timr->it_clock];
+
+	if (old_setting) {
+		do_timer_gettime(timr, old_setting);
+	}
+
+	/* disable the timer */
+	timr->it_incr = 0;
+	/* 
+	 * careful here.  If smp we could be in the "fire" routine which will
+	 * be spinning as we hold the lock.  But this is ONLY an SMP issue.
+	 */
+#ifdef CONFIG_SMP
+	if (timer_active(timr) && !del_timer(&timr->it_timer)) {
+		/*
+		 * It can only be active if on an other cpu.  Since
+		 * we have cleared the interval stuff above, it should
+		 * clear once we release the spin lock.  Of course once
+		 * we do that anything could happen, including the 
+		 * complete melt down of the timer.  So return with 
+		 * a "retry" exit status.
+		 */
+		return TIMER_RETRY;
+	}
+	set_timer_inactive(timr);
+#else
+	del_timer(&timr->it_timer);
+#endif
+	timr->it_requeue_pending = 0;
+	timr->it_overrun_last = 0;
+	timr->it_overrun = -1;
+	/* 
+	 *switch off the timer when it_value is zero 
+	 */
+	if ((new_setting->it_value.tv_sec == 0) &&
+	    (new_setting->it_value.tv_nsec == 0)) {
+		timr->it_timer.expires = 0;
+		return 0;
+	}
+
+	if ((flags & TIMER_ABSTIME) &&
+	    (clock->clock_get != do_posix_clock_monotonic_gettime)) {
+	}
+	if (adjust_abs_time(clock,
+			    &new_setting->it_value, flags & TIMER_ABSTIME)) {
+		return -EINVAL;
+	}
+	tstotimer(new_setting, timr);
+
+	/*
+	 * For some reason the timer does not fire immediately if expires is
+	 * equal to jiffies, so the timer notify function is called directly.
+	 * We do not even queue SIGEV_NONE timers!
+	 */
+	if (!(timr->it_sigev_notify & SIGEV_NONE)) {
+		if (timr->it_timer.expires == jiffies) {
+			timer_notify_task(timr);
+		} else
+			add_timer(&timr->it_timer);
+	}
+	return 0;
+}
+
+/* Set a POSIX.1b interval timer */
+asmlinkage int
+sys_timer_settime(timer_t timer_id, int flags,
+		  const struct itimerspec *new_setting,
+		  struct itimerspec *old_setting)
+{
+	struct k_itimer *timr;
+	struct itimerspec new_spec, old_spec;
+	int error = 0;
+	long flag;
+	struct itimerspec *rtn = old_setting ? &old_spec : NULL;
+
+	if (new_setting == NULL) {
+		return -EINVAL;
+	}
+
+	if (copy_from_user(&new_spec, new_setting, sizeof (new_spec))) {
+		return -EFAULT;
+	}
+
+	if ((!good_timespec(&new_spec.it_interval)) ||
+	    (!good_timespec(&new_spec.it_value))) {
+		return -EINVAL;
+	}
+      retry:
+	timr = lock_timer(timer_id, &flag);
+	if (!timr)
+		return -EINVAL;
+
+	if (!posix_clocks[timr->it_clock].timer_set) {
+		error = do_timer_settime(timr, flags, &new_spec, rtn);
+	} else {
+		error = posix_clocks[timr->it_clock].timer_set(timr,
+							       flags,
+							       &new_spec, rtn);
+	}
+	unlock_timer(timr, flag);
+	if (error == TIMER_RETRY) {
+		rtn = NULL;	// We already got the old time...
+		goto retry;
+	}
+
+	if (old_setting && !error) {
+		if (copy_to_user(old_setting, &old_spec, sizeof (old_spec))) {
+			error = -EFAULT;
+		}
+	}
+
+	return error;
+}
+
+static inline int
+do_timer_delete(struct k_itimer *timer)
+{
+	timer->it_incr = 0;
+#ifdef CONFIG_SMP
+	if (timer_active(timer) &&
+	    !del_timer(&timer->it_timer) && !timer->it_requeue_pending) {
+		/*
+		 * It can only be active if on an other cpu.  Since
+		 * we have cleared the interval stuff above, it should
+		 * clear once we release the spin lock.  Of course once
+		 * we do that anything could happen, including the 
+		 * complete melt down of the timer.  So return with 
+		 * a "retry" exit status.
+		 */
+		return TIMER_RETRY;
+	}
+#else
+	del_timer(&timer->it_timer);
+#endif
+	return 0;
+}
+
+/* Delete a POSIX.1b interval timer. */
+asmlinkage int
+sys_timer_delete(timer_t timer_id)
+{
+	struct k_itimer *timer;
+	long flags;
+
+#ifdef CONFIG_SMP
+	int error;
+      retry_delete:
+#endif
+
+	timer = lock_timer(timer_id, &flags);
+	if (!timer)
+		return -EINVAL;
+
+#ifdef CONFIG_SMP
+	error = p_timer_del(&posix_clocks[timer->it_clock], timer);
+
+	if (error == TIMER_RETRY) {
+		unlock_timer(timer, flags);
+		goto retry_delete;
+	}
+#else
+	p_timer_del(&posix_clocks[timer->it_clock], timer);
+#endif
+
+	task_lock(timer->it_process);
+
+	list_del(&timer->list);
+
+	task_unlock(timer->it_process);
+
+	/*
+	 * This keeps any tasks waiting on the spin lock from thinking
+	 * they got something (see the lock code above).
+	 */
+	timer->it_process = NULL;
+	unlock_timer(timer, flags);
+	release_posix_timer(timer);
+	return 0;
+}
+/*
+ * return  timer owned by the process, used by exit_itimers
+ */
+static inline void
+itimer_delete(struct k_itimer *timer)
+{
+	if (sys_timer_delete(timer->it_id)) {
+		BUG();
+	}
+}
+/*
+ * This is exported to exit and exec
+ */
+void
+exit_itimers(struct task_struct *tsk)
+{
+	struct k_itimer *tmr;
+
+	task_lock(tsk);
+	while (!list_empty(&tsk->posix_timers)) {
+		tmr = list_entry(tsk->posix_timers.next, struct k_itimer, list);
+		task_unlock(tsk);
+		itimer_delete(tmr);
+		task_lock(tsk);
+	}
+	task_unlock(tsk);
+}
+
+/*
+ * And now for the "clock" calls
+
+ * These functions are called both from timer functions (with the timer
+ * spin_lock_irq() held and from clock calls with no locking.	They must
+ * use the save flags versions of locks.
+ */
+static int
+do_posix_gettime(struct k_clock *clock, struct timespec *tp)
+{
+
+	if (clock->clock_get) {
+		return clock->clock_get(tp);
+	}
+
+	do_gettimeofday((struct timeval *) tp);
+	tp->tv_nsec *= NSEC_PER_USEC;
+	return 0;
+}
+
+/*
+ * We do ticks here to avoid the irq lock ( they take sooo long)
+ * Note also that the while loop assures that the sub_jiff_offset
+ * will be less than a jiffie, thus no need to normalize the result.
+ * Well, not really, if called with ints off :(
+ */
+
+int
+do_posix_clock_monotonic_gettime(struct timespec *tp)
+{
+	long sub_sec;
+	u64 jiffies_64_f;
+
+#if (BITS_PER_LONG > 32)
+
+	jiffies_64_f = jiffies_64;
+
+#elif defined(CONFIG_SMP)
+
+	/* Tricks don't work here, must take the lock.   Remember, called
+	 * above from both timer and clock system calls => save flags.
+	 */
+	{
+		unsigned long flags;
+		read_lock_irqsave(&xtime_lock, flags);
+		jiffies_64_f = jiffies_64;
+
+		read_unlock_irqrestore(&xtime_lock, flags);
+	}
+#elif ! defined(CONFIG_SMP) && (BITS_PER_LONG < 64)
+	unsigned long jiffies_f;
+	do {
+		jiffies_f = jiffies;
+		barrier();
+		jiffies_64_f = jiffies_64;
+	} while (unlikely(jiffies_f != jiffies));
+
+#endif
+	tp->tv_sec = div_long_long_rem(jiffies_64_f, HZ, &sub_sec);
+
+	tp->tv_nsec = sub_sec * (NSEC_PER_SEC / HZ);
+	return 0;
+}
+
+int
+do_posix_clock_monotonic_settime(struct timespec *tp)
+{
+	return -EINVAL;
+}
+
+asmlinkage int
+sys_clock_settime(clockid_t which_clock, const struct timespec *tp)
+{
+	struct timespec new_tp;
+
+	if ((unsigned) which_clock >= MAX_CLOCKS ||
+	    !posix_clocks[which_clock].res) return -EINVAL;
+	if (copy_from_user(&new_tp, tp, sizeof (*tp)))
+		return -EFAULT;
+	if (posix_clocks[which_clock].clock_set) {
+		return posix_clocks[which_clock].clock_set(&new_tp);
+	}
+	new_tp.tv_nsec /= NSEC_PER_USEC;
+	return do_sys_settimeofday((struct timeval *) &new_tp, NULL);
+}
+asmlinkage int
+sys_clock_gettime(clockid_t which_clock, struct timespec *tp)
+{
+	struct timespec rtn_tp;
+	int error = 0;
+
+	if ((unsigned) which_clock >= MAX_CLOCKS ||
+	    !posix_clocks[which_clock].res) return -EINVAL;
+
+	error = do_posix_gettime(&posix_clocks[which_clock], &rtn_tp);
+
+	if (!error) {
+		if (copy_to_user(tp, &rtn_tp, sizeof (rtn_tp))) {
+			error = -EFAULT;
+		}
+	}
+	return error;
+
+}
+asmlinkage int
+sys_clock_getres(clockid_t which_clock, struct timespec *tp)
+{
+	struct timespec rtn_tp;
+
+	if ((unsigned) which_clock >= MAX_CLOCKS ||
+	    !posix_clocks[which_clock].res) return -EINVAL;
+
+	rtn_tp.tv_sec = 0;
+	rtn_tp.tv_nsec = posix_clocks[which_clock].res;
+	if (tp) {
+		if (copy_to_user(tp, &rtn_tp, sizeof (rtn_tp))) {
+			return -EFAULT;
+		}
+	}
+	return 0;
+
+}
+static void
+nanosleep_wake_up(unsigned long __data)
+{
+	struct task_struct *p = (struct task_struct *) __data;
+
+	wake_up_process(p);
+}
+
+/*
+ * The standard says that an absolute nanosleep call MUST wake up at
+ * the requested time in spite of clock settings.  Here is what we do:
+ * For each nanosleep call that needs it (only absolute and not on 
+ * CLOCK_MONOTONIC* (as it can not be set)) we thread a little structure
+ * into the "nanosleep_abs_list".  All we need is the task_struct pointer.
+ * When ever the clock is set we just wake up all those tasks.	 The rest
+ * is done by the while loop in clock_nanosleep().
+
+ * On locking, clock_was_set() is called from update_wall_clock which 
+ * holds (or has held for it) a write_lock_irq( xtime_lock) and is 
+ * called from the timer bh code.  Thus we need the irq save locks.
+ */
+spinlock_t nanosleep_abs_list_lock = SPIN_LOCK_UNLOCKED;
+
+struct list_head nanosleep_abs_list = LIST_HEAD_INIT(nanosleep_abs_list);
+
+struct abs_struct {
+	struct list_head list;
+	struct task_struct *t;
+};
+
+void
+clock_was_set(void)
+{
+	struct list_head *pos;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nanosleep_abs_list_lock, flags);
+	list_for_each(pos, &nanosleep_abs_list) {
+		wake_up_process(list_entry(pos, struct abs_struct, list)->t);
+	}
+	spin_unlock_irqrestore(&nanosleep_abs_list_lock, flags);
+}
+
+long clock_nanosleep_restart(struct restart_block *restart_block);
+
+extern long do_clock_nanosleep(clockid_t which_clock, int flags, 
+			       struct timespec *t);
+
+#ifdef FOLD_NANO_SLEEP_INTO_CLOCK_NANO_SLEEP
+
+asmlinkage long
+sys_nanosleep(struct timespec *rqtp, struct timespec *rmtp)
+{
+	struct timespec t;
+	long ret;
+
+	if (copy_from_user(&t, rqtp, sizeof (t)))
+		return -EFAULT;
+
+	ret = do_clock_nanosleep(CLOCK_REALTIME, 0, &t);
+
+	if (ret == -ERESTART_RESTARTBLOCK && rmtp && 
+	    copy_to_user(rmtp, &t, sizeof (t)))
+			return -EFAULT;
+	return ret;
+}
+#endif				// ! FOLD_NANO_SLEEP_INTO_CLOCK_NANO_SLEEP
+
+asmlinkage long
+sys_clock_nanosleep(clockid_t which_clock, int flags,
+		    const struct timespec *rqtp, struct timespec *rmtp)
+{
+	struct timespec t;
+	int ret;
+
+	if ((unsigned) which_clock >= MAX_CLOCKS ||
+	    !posix_clocks[which_clock].res) return -EINVAL;
+
+	if (copy_from_user(&t, rqtp, sizeof (struct timespec)))
+		return -EFAULT;
+
+	if ((unsigned) t.tv_nsec >= NSEC_PER_SEC || t.tv_sec < 0)
+		return -EINVAL;
+
+	ret = do_clock_nanosleep(which_clock, flags, &t);
+
+	if ((ret == -ERESTART_RESTARTBLOCK) && rmtp && 
+	    copy_to_user(rmtp, &t, sizeof (t)))
+			return -EFAULT;
+	return ret;
+
+}
+
+long
+do_clock_nanosleep(clockid_t which_clock, int flags, struct timespec *tsave)
+{
+	struct timespec t;
+	struct timer_list new_timer;
+	struct abs_struct abs_struct = { list:{next:0} };
+	int abs;
+	int rtn = 0;
+	int active;
+	struct restart_block *restart_block =
+	    &current_thread_info()->restart_block;
+
+	init_timer(&new_timer);
+	new_timer.expires = 0;
+	new_timer.data = (unsigned long) current;
+	new_timer.function = nanosleep_wake_up;
+	abs = flags & TIMER_ABSTIME;
+
+	if (restart_block->fn == clock_nanosleep_restart) {
+		/*
+		 * Interrupted by a non-delivered signal, pick up remaining
+		 * time and continue.
+		 */
+		restart_block->fn = do_no_restart_syscall;
+		if (!restart_block->arg2)
+			return -EINTR;
+
+		new_timer.expires = restart_block->arg2;
+		if (time_before(new_timer.expires, jiffies))
+			return 0;
+	}
+
+	if (abs && (posix_clocks[which_clock].clock_get !=
+		    posix_clocks[CLOCK_MONOTONIC].clock_get)) {
+		spin_lock_irq(&nanosleep_abs_list_lock);
+		list_add(&abs_struct.list, &nanosleep_abs_list);
+		abs_struct.t = current;
+		spin_unlock_irq(&nanosleep_abs_list_lock);
+	}
+	do {
+		t = *tsave;
+		if ((abs || !new_timer.expires) &&
+		    !(rtn = adjust_abs_time(&posix_clocks[which_clock],
+					    &t, abs))) {
+			/*
+			 * On error, we don't set up the timer so
+			 * we don't arm the timer so
+			 * del_timer_sync() will return 0, thus
+			 * active is zero... and so it goes.
+			 */
+
+			tstojiffie(&t,
+				   posix_clocks[which_clock].res,
+				   &new_timer.expires);
+		}
+		if (new_timer.expires) {
+			current->state = TASK_INTERRUPTIBLE;
+			add_timer(&new_timer);
+
+			schedule();
+		}
+	}
+	while ((active = del_timer_sync(&new_timer)) &&
+	       !test_thread_flag(TIF_SIGPENDING));
+
+	if (abs_struct.list.next) {
+		spin_lock_irq(&nanosleep_abs_list_lock);
+		list_del(&abs_struct.list);
+		spin_unlock_irq(&nanosleep_abs_list_lock);
+	}
+	if (active) {
+		unsigned long jiffies_f = jiffies;
+
+		/*
+		 * Always restart abs calls from scratch to pick up any
+		 * clock shifting that happened while we are away.
+		 */
+		if (abs)
+			return -ERESTARTNOHAND;
+
+		jiffies_to_timespec(new_timer.expires - jiffies_f, tsave);
+
+		while (tsave->tv_nsec < 0) {
+			tsave->tv_nsec += NSEC_PER_SEC;
+			tsave->tv_sec--;
+		}
+		if (tsave->tv_sec < 0) {
+			tsave->tv_sec = 0;
+			tsave->tv_nsec = 1;
+		}
+		restart_block->fn = clock_nanosleep_restart;
+		restart_block->arg0 = which_clock;
+		restart_block->arg1 = (int)tsave;
+		restart_block->arg2 = new_timer.expires;
+		return -ERESTART_RESTARTBLOCK;
+	}
+
+	return rtn;
+}
+/*
+ * This will restart either clock_nanosleep or clock_nanosleep
+ */
+long
+clock_nanosleep_restart(struct restart_block *restart_block)
+{
+	struct timespec t;
+	int ret = do_clock_nanosleep(restart_block->arg0, 0, &t);
+
+	if ((ret == -ERESTART_RESTARTBLOCK) && restart_block->arg1 && 
+	    copy_to_user((struct timespec *)(restart_block->arg1), &t, 
+			 sizeof (t)))
+		return -EFAULT;
+	return ret;
+}
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/signal.c linux/kernel/signal.c
--- linux-2.5.52-bk4-kb/kernel/signal.c	Thu Dec 19 12:13:18 2002
+++ linux/kernel/signal.c	Thu Dec 19 12:16:01 2002
@@ -457,8 +457,6 @@
 		if (!collect_signal(sig, pending, info))
 			sig = 0;
 				
-		/* XXX: Once POSIX.1b timers are in, if si_code == SI_TIMER,
-		   we need to xchg out the timer overrun values.  */
 	}
 	recalc_sigpending();
 
@@ -473,6 +471,7 @@
  */
 int dequeue_signal(sigset_t *mask, siginfo_t *info)
 {
+	int ret;
 	/*
 	 * Here we handle shared pending signals. To implement the full
 	 * semantics we need to unqueue and resend them. It will likely
@@ -483,7 +482,13 @@
 		if (signr)
 			__send_sig_info(signr, info, current);
 	}
-	return __dequeue_signal(&current->pending, mask, info);
+	ret = __dequeue_signal(&current->pending, mask, info);
+	if ( ret &&
+	     ((info->si_code & __SI_MASK) == __SI_TIMER) &&
+	     info->si_sys_private){
+		do_schedule_next_timer(info);
+	}
+	return ret;
 }
 
 static int rm_from_queue(int sig, struct sigpending *s)
@@ -622,6 +627,7 @@
 static int send_signal(int sig, struct siginfo *info, struct sigpending *signals)
 {
 	struct sigqueue * q = NULL;
+	int ret = 0;
 
 	/*
 	 * fast-pathed signals for kernel-internal things like SIGSTOP
@@ -665,17 +671,26 @@
 				copy_siginfo(&q->info, info);
 				break;
 		}
-	} else if (sig >= SIGRTMIN && info && (unsigned long)info != 1
+	} else {
+		if (sig >= SIGRTMIN && info && (unsigned long)info != 1
 		   && info->si_code != SI_USER)
 		/*
 		 * Queue overflow, abort.  We may abort if the signal was rt
 		 * and sent by user using something other than kill().
 		 */
-		return -EAGAIN;
+			return -EAGAIN;
+
+		if (((unsigned long)info > 1) && (info->si_code == SI_TIMER))
+			/*
+			 * Set up a return to indicate that we dropped 
+			 * the signal.
+			 */
+			ret = info->si_sys_private;
+	}
 
 out_set:
 	sigaddset(&signals->signal, sig);
-	return 0;
+	return ret;
 }
 
 /*
@@ -715,7 +730,7 @@
 {
 	int retval = send_signal(sig, info, &t->pending);
 
-	if (!retval && !sigismember(&t->blocked, sig))
+	if ((retval >= 0) && !sigismember(&t->blocked, sig))
 		signal_wake_up(t);
 
 	return retval;
@@ -751,6 +766,12 @@
 
 	handle_stop_signal(sig, t);
 
+	if (((unsigned long)info > 2) && (info->si_code == SI_TIMER))
+		/*
+		 * Set up a return to indicate that we dropped the signal.
+		 */
+		ret = info->si_sys_private;
+
 	/* Optimize away the signal, if it's a signal that can be
 	   handled immediately (ie non-blocked and untraced) and
 	   that is ignored (either explicitly or by default).  */
@@ -1478,8 +1499,9 @@
 		err |= __put_user(from->si_uid, &to->si_uid);
 		break;
 	case __SI_TIMER:
-		err |= __put_user(from->si_timer1, &to->si_timer1);
-		err |= __put_user(from->si_timer2, &to->si_timer2);
+		 err |= __put_user(from->si_tid, &to->si_tid);
+		 err |= __put_user(from->si_overrun, &to->si_overrun);
+		 err |= __put_user(from->si_ptr, &to->si_ptr);
 		break;
 	case __SI_POLL:
 		err |= __put_user(from->si_band, &to->si_band);
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-kb/kernel/timer.c linux/kernel/timer.c
--- linux-2.5.52-bk4-kb/kernel/timer.c	Thu Dec 19 12:13:18 2002
+++ linux/kernel/timer.c	Thu Dec 19 12:16:01 2002
@@ -49,12 +49,11 @@
 	struct list_head vec[TVR_SIZE];
 } tvec_root_t;
 
-typedef struct timer_list timer_t;
 
 struct tvec_t_base_s {
 	spinlock_t lock;
 	unsigned long timer_jiffies;
-	timer_t *running_timer;
+	struct timer_list *running_timer;
 	tvec_root_t tv1;
 	tvec_t tv2;
 	tvec_t tv3;
@@ -67,7 +66,7 @@
 /* Fake initialization */
 static DEFINE_PER_CPU(tvec_base_t, tvec_bases) = { SPIN_LOCK_UNLOCKED };
 
-static void check_timer_failed(timer_t *timer)
+static void check_timer_failed(struct timer_list *timer)
 {
 	static int whine_count;
 	if (whine_count < 16) {
@@ -85,13 +84,13 @@
 	timer->magic = TIMER_MAGIC;
 }
 
-static inline void check_timer(timer_t *timer)
+static inline void check_timer(struct timer_list *timer)
 {
 	if (timer->magic != TIMER_MAGIC)
 		check_timer_failed(timer);
 }
 
-static inline void internal_add_timer(tvec_base_t *base, timer_t *timer)
+static inline void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
 {
 	unsigned long expires = timer->expires;
 	unsigned long idx = expires - base->timer_jiffies;
@@ -143,7 +142,7 @@
  * Timers with an ->expired field in the past will be executed in the next
  * timer tick. It's illegal to add an already pending timer.
  */
-void add_timer(timer_t *timer)
+void add_timer(struct timer_list *timer)
 {
 	int cpu = get_cpu();
 	tvec_base_t *base = &per_cpu(tvec_bases, cpu);
@@ -201,7 +200,7 @@
  * (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
  * active timer returns 1.)
  */
-int mod_timer(timer_t *timer, unsigned long expires)
+int mod_timer(struct timer_list *timer, unsigned long expires)
 {
 	tvec_base_t *old_base, *new_base;
 	unsigned long flags;
@@ -278,7 +277,7 @@
  * (ie. del_timer() of an inactive timer returns 0, del_timer() of an
  * active timer returns 1.)
  */
-int del_timer(timer_t *timer)
+int del_timer(struct timer_list *timer)
 {
 	unsigned long flags;
 	tvec_base_t *base;
@@ -317,7 +316,7 @@
  *
  * The function returns whether it has deactivated a pending timer or not.
  */
-int del_timer_sync(timer_t *timer)
+int del_timer_sync(struct timer_list *timer)
 {
 	tvec_base_t *base;
 	int i, ret = 0;
@@ -360,9 +359,9 @@
 	 * detach them individually, just clear the list afterwards.
 	 */
 	while (curr != head) {
-		timer_t *tmp;
+		struct timer_list *tmp;
 
-		tmp = list_entry(curr, timer_t, entry);
+		tmp = list_entry(curr, struct timer_list, entry);
 		if (tmp->base != base)
 			BUG();
 		next = curr->next;
@@ -401,9 +400,9 @@
 		if (curr != head) {
 			void (*fn)(unsigned long);
 			unsigned long data;
-			timer_t *timer;
+			struct timer_list *timer;
 
-			timer = list_entry(curr, timer_t, entry);
+			timer = list_entry(curr, struct timer_list, entry);
  			fn = timer->function;
  			data = timer->data;
 
@@ -505,6 +504,7 @@
 	if (xtime.tv_sec % 86400 == 0) {
 	    xtime.tv_sec--;
 	    time_state = TIME_OOP;
+	    clock_was_set();
 	    printk(KERN_NOTICE "Clock: inserting leap second 23:59:60 UTC\n");
 	}
 	break;
@@ -513,6 +513,7 @@
 	if ((xtime.tv_sec + 1) % 86400 == 0) {
 	    xtime.tv_sec++;
 	    time_state = TIME_WAIT;
+	    clock_was_set();
 	    printk(KERN_NOTICE "Clock: deleting leap second 23:59:59 UTC\n");
 	}
 	break;
@@ -965,7 +966,7 @@
  */
 signed long schedule_timeout(signed long timeout)
 {
-	timer_t timer;
+	struct timer_list timer;
 	unsigned long expire;
 
 	switch (timeout)
@@ -1020,6 +1021,7 @@
 {
 	return current->pid;
 }
+#ifndef FOLD_NANO_SLEEP_INTO_CLOCK_NANO_SLEEP
 
 static long nanosleep_restart(struct restart_block *restart)
 {
@@ -1078,6 +1080,7 @@
 	}
 	return ret;
 }
+#endif // ! FOLD_NANO_SLEEP_INTO_CLOCK_NANO_SLEEP
 
 /*
  * sys_sysinfo - fill in sysinfo struct
Binary files linux-2.5.52-bk4-kb/scripts/kallsyms and linux/scripts/kallsyms differ
Binary files linux-2.5.52-bk4-kb/scripts/lxdialog/lxdialog and linux/scripts/lxdialog/lxdialog differ
Binary files linux-2.5.52-bk4-kb/usr/gen_init_cpio and linux/usr/gen_init_cpio differ
Binary files linux-2.5.52-bk4-kb/usr/initramfs_data.cpio.gz and linux/usr/initramfs_data.cpio.gz differ

^ permalink raw reply

* [PATCH 2/3] High-res-timers part 2 (x86 platform code) take 21
From: george anzinger @ 2002-12-20  9:52 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2019 bytes --]

This is the platform part of the high-res timers for the
x86.

Changes since last time:
CONFIG dependency added to not turn on stuff only needed
when CONFIG_HIGH_RES = y.
----------

This patch, in conjunction with the "core" high-res-timers
patch implements high resolution timers on the i386
platforms.  The high-res-timers use the periodic interrupt
to "remind" the system to look at the clock.  The clock
should be relatively high resolution (1 micro second or
better).  This patch allows configuring of three possible
clocks, the TSC, the ACPI pm timer, or the Programmable
interrupt timer (PIT).  Most of the changes in this patch
are in the arch/i386/kernel/timer/* code.

This patch uses (if available) the APIC timer(s) to generate
1/HZ ticks and sub 1/HZ ticks as needed.  The PIT still
interrupts, but if the APIC timer is available, just causes
the wall clock update.  No attempt is made to make this
interrupt happen on jiffie boundaries, however, the APIC
timers are disciplined to expire on 1/HZ boundaries to give
consistent timer latencies WRT to the system time.

With this patch applied and enabled (at config time in the
processor feature section), the system clock will be the
specified clock.  The PIT is not used to keep track of time,
but only to remind the system to look at the clock.  Sub
jiffies are kept and available for code that knows how to
use them.

Depends on the core high res timers patch.

Patch is against 2.5.52-bk4

This patch as well as the POSIX clocks & timers patch is
available on the project site:
http://sourceforge.net/projects/high-res-timers/

The 3 parts to the high res timers are:
 core		The core kernel (i.e. platform independent) changes
*i386		The high-res changes for the i386 (x86) platform
 hrposix	The changes to the POSIX clocks & timers patch to
use high-res timers

Please apply.
-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

[-- Attachment #2: hrtimers-i386-2.5.52-bk4.1.0.patch --]
[-- Type: text/plain, Size: 89838 bytes --]

diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/Kconfig linux/arch/i386/Kconfig
--- linux-2.5.52-bk4-core/arch/i386/Kconfig	Thu Dec 19 12:36:01 2002
+++ linux/arch/i386/Kconfig	Thu Dec 19 13:59:36 2002
@@ -323,6 +323,107 @@
 
 	  If you don't know what to do here, say N.
 
+config HIGH_RES_TIMERS
+	bool "Configure High-Resolution-Timers"
+	help
+	  POSIX timers are available by default.  This option enables
+	  high resolution POSIX timers.  With this option the resolution
+	  is at least 1 micro second.  High resolution is not free.  If
+	  enabled this option will add a small overhead each time a
+	  timer expires that is not on a 1/HZ tick boundry.  If no such
+	  timers are used the overhead is nil.
+
+	  This option enables two additional POSIX CLOCKS,
+	  CLOCK_REALTIME_HR and CLOCK_MONOTONIC_HR.  Note that this
+	  option does not change the resolution of CLOCK_REALTIME or
+	  CLOCK_MONOTONIC which remain at 1/HZ resolution.
+
+choice
+	prompt "Clock source?"
+	depends on HIGH_RES_TIMERS
+ 	default HIGH_RES_TIMER_TSC
+	help 
+	  This option allows you to choose the wall clock timer for your
+	  system.  With high resolution timers on the x86 platforms it
+	  is best to keep the interrupt generating timer separate from
+	  the time keeping timer.  On x86 platforms there are three
+	  possible sources implemented for the wall clock.  These are:
+ 
+  	  <timer>				<resolution>
+ 	  ACPI power management (pm) timer	~280 nano seconds
+  	  TSC (Time Stamp Counter)		1/CPU clock
+ 	  PIT (Programmable Interrupt Timer)	~838 nano seconds
+
+	  The PIT is always used to generate clock interrupts but, in
+	  SMP systems the APIC timers are used to drive the timer list
+	  code.  This means that, in SMP systems the PIT will not be
+	  programmed to generate sub jiffie events and can give
+	  reasonable service as the clock interrupt. In non SMP (UP)
+	  systems it will be programmed to interrupt when the next timer
+	  is to expire or on the next 1/HZ tick.  For this reason it is
+	  best to not use this timer as the wall clock timer in UP
+	  systems.  This timer has a resolution of 838 nano seconds.  IN
+	  UP SYSTEMS THIS OPTION SHOULD ONLY BE USED IF BOTH ACPI AND
+	  TSC ARE NOT AVAILABLE.
+
+	  The TSC runs at the cpu clock rate (i.e. its resolution is
+	  1/CPU clock) and it has a very low access time.  However, it
+	  is subject, in some (incorrect) processors, to throttling to
+	  cool the cpu, and to other slow downs during power management.
+	  If your system has power managment code active these changes
+	  are tracked by the TSC timer code.  If your cpu is correct and
+	  does not change the TSC frequency for throttling or power
+	  management outside of the power managment kernel code, this is
+	  the best clock timer.
+
+	  The ACPI pm timer is available on systems with Advanced
+	  Configuration and Power Interface support.  The pm timer is
+	  available on these systems even if you don't use or enable
+	  ACPI in the software or the BIOS (but see Default ACPI pm
+	  timer address).  The timer has a resolution of about 280
+	  nanoseconds, however, the access time is a bit higher than
+	  that of the TSC.  Since it is part of ACPI it is intended to
+	  keep track of time while the system is under power management,
+	  thus it is not subject to the power management problems of the
+	  TSC.
+
+	  If you enable the ACPI pm timer and it can not be found, it is
+	  possible that your BIOS is not producing the ACPI table or
+	  that your machine does not support ACPI.  In the former case,
+	  see "Default ACPI pm timer address".  If the timer is not
+	  found the boot will fail when trying to calibrate the 'delay'
+	  loop.
+
+config HIGH_RES_TIMER_ACPI_PM
+	bool "ACPI-pm-timer"
+	
+config HIGH_RES_TIMER_TSC
+	bool "Time-stamp-counter/TSC"
+	depends on X86_TSC
+
+config HIGH_RES_TIMER_PIT
+	bool "Programable-interrupt-timer/PIT"
+	  
+endchoice	  
+
+config HIGH_RES_TIMER_ACPI_PM_ADD
+	int "Default ACPI pm timer address"
+	depends on HIGH_RES_TIMER_ACPI_PM
+	default 0
+	help
+	  This option is available for use on systems where the BIOS
+	  does not generate the ACPI tables if ACPI is not enabled.  For
+	  example some BIOSes will not generate the ACPI tables if APM
+	  is enabled.  The ACPI pm timer is still available but can not
+	  be found by the software.  This option allows you to supply
+	  the needed address.  When the high resolution timers code
+	  finds a valid ACPI pm timer address it reports it in the boot
+	  messages log (look for lines that begin with
+	  "High-res-timers:").  You can turn on the ACPI support in the
+	  BIOS, boot the system and find this value.  You can then enter
+	  it at configure time.  Both the report and the entry are in
+	  decimal.
+
 config PREEMPT
 	bool "Preemptible Kernel"
 	help
@@ -1564,6 +1665,14 @@
 	  If you say Y here, various routines which may sleep will become very
 	  noisy if they are called with a spinlock held.	
 
+#config FRAME_POINTER
+#	bool "Compile the kernel with frame pointers"
+#	help
+#	  If you say Y here the resulting kernel image will be slightly larger
+#	  and slower, but it will give very useful debugging information.
+#	  If you don't debug the kernel, you can say N, but we may not be able
+#	  to solve problems without frame pointers.
+
 config KGDB
 	bool "Include kgdb kernel debugger"
 	depends on DEBUG_KERNEL
@@ -1748,7 +1857,7 @@
 
 config FRAME_POINTER
 	bool "Compile the kernel with frame pointers"
-	default KGDB
+	default y if KGDB
 	help
 	  If you say Y here the resulting kernel image will be slightly larger
 	  and slower, but it will give very useful debugging information.
@@ -1757,8 +1866,7 @@
 
 config MAGIC_SYSRQ
 	bool
-	depends on KGDB_SYSRQ
-	default y
+	default y if KGDB_SYSRQ
 
 config X86_EXTRA_IRQS
 	bool
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/apic.c linux/arch/i386/kernel/apic.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/apic.c	Wed Oct 16 00:17:47 2002
+++ linux/arch/i386/kernel/apic.c	Thu Dec 19 13:00:26 2002
@@ -23,6 +23,7 @@
 #include <linux/interrupt.h>
 #include <linux/mc146818rtc.h>
 #include <linux/kernel_stat.h>
+#include <linux/hrtime.h>
 
 #include <asm/atomic.h>
 #include <asm/smp.h>
@@ -800,7 +801,7 @@
  * P5 APIC double write bug.
  */
 
-#define APIC_DIVISOR 16
+#define APIC_DIVISOR 1
 
 void __setup_APIC_LVTT(unsigned int clocks)
 {
@@ -811,12 +812,12 @@
 	apic_write_around(APIC_LVTT, lvtt1_value);
 
 	/*
-	 * Divide PICLK by 16
+	 * Divide PICLK by 1
 	 */
 	tmp_value = apic_read(APIC_TDCR);
 	apic_write_around(APIC_TDCR, (tmp_value
 				& ~(APIC_TDR_DIV_1 | APIC_TDR_DIV_TMBASE))
-				| APIC_TDR_DIV_16);
+				| APIC_TDR_DIV_1);
 
 	apic_write_around(APIC_TMICT, clocks/APIC_DIVISOR);
 }
@@ -914,7 +915,7 @@
 	return result;
 }
 
-static unsigned int calibration_result;
+static unsigned int calibration_result = 1000;
 
 int dont_use_local_apic_timer __initdata = 0;
 
@@ -935,6 +936,8 @@
 	 */
 	setup_APIC_timer(calibration_result);
 
+	compute_latch(calibration_result);
+
 	local_irq_enable();
 }
 
@@ -1025,6 +1028,8 @@
 			__setup_APIC_LVTT(calibration_result/prof_counter[cpu]);
 			prof_old_multiplier[cpu] = prof_counter[cpu];
 		}
+
+		discipline_timer(cpu);
 
 #ifdef CONFIG_SMP
 		update_process_times(user_mode(regs));
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/time.c linux/arch/i386/kernel/time.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/time.c	Thu Dec 19 12:16:00 2002
+++ linux/arch/i386/kernel/time.c	Thu Dec 19 13:00:26 2002
@@ -29,7 +29,10 @@
  *	Fixed a xtime SMP race (we need the xtime_lock rw spinlock to
  *	serialize accesses to xtime/lost_ticks).
  */
-
+/* 2002-8-13 George Anzinger  Modified for High res timers: 
+ *                            Copyright (C) 2002 MontaVista Software
+*/
+#define _INCLUDED_FROM_TIME_C
 #include <linux/errno.h>
 #include <linux/sched.h>
 #include <linux/kernel.h>
@@ -59,6 +62,7 @@
 #include <linux/config.h>
 
 #include <asm/arch_hooks.h>
+#include <linux/hrtime.h>
 
 extern spinlock_t i8259A_lock;
 int pit_latch_buggy;              /* extern */
@@ -72,7 +76,23 @@
 extern rwlock_t xtime_lock;
 extern unsigned long wall_jiffies;
 
+
+#ifndef CONFIG_HIGH_RES_TIMERS
+
+/* Number of usecs that the last interrupt was delayed */
+static int delay_at_last_interrupt;
+
+#endif  /* CONFIG_HIGH_RES_TIMERS */
+
 spinlock_t rtc_lock = SPIN_LOCK_UNLOCKED;
+/*
+ * We have three of these do_xxx_gettimeoffset() routines:
+ * do_fast_gettimeoffset(void) for TSC systems with out high-res-timers
+ * do_slow_gettimeoffset(void) for ~TSC systems with out high-res-timers
+ * do_highres__gettimeoffset(void) for systems with high-res-timers
+ *
+ * Pick the desired one at compile time...
+ */
 
 spinlock_t i8253_lock = SPIN_LOCK_UNLOCKED;
 EXPORT_SYMBOL(i8253_lock);
@@ -91,16 +111,25 @@
 	read_lock_irqsave(&xtime_lock, flags);
 	usec = timer->get_offset();
 	{
+                /*
+                 * FIX ME***** Due to adjtime and such
+                 * this should be changed to actually update
+                 * wall time using the proper routine.
+                 * Otherwise we run the risk of time moving
+                 * backward due to different interpretations
+                 * of the jiffie.  I.e jiffie != 1/HZ
+                 * (but it is close).
+                 */
 		unsigned long lost = jiffies - wall_jiffies;
 		if (lost)
-			usec += lost * (1000000 / HZ);
+			usec += lost * (USEC_PER_SEC / HZ);
 	}
 	sec = xtime.tv_sec;
 	usec += (xtime.tv_nsec / 1000);
 	read_unlock_irqrestore(&xtime_lock, flags);
 
-	while (usec >= 1000000) {
-		usec -= 1000000;
+	while (usec >= USEC_PER_SEC) {
+		usec -= USEC_PER_SEC;
 		sec++;
 	}
 
@@ -212,7 +241,7 @@
  * timer_interrupt() needs to keep up the real-time clock,
  * as well as call the "do_timer()" routine every clocktick
  */
-static inline void do_timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
+static inline void do_timer_interrupt(int irq, struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_IO_APIC
 	if (timer_ack) {
@@ -232,36 +261,29 @@
 
 	do_timer_interrupt_hook(regs);
 
-	/*
+        /* 
+         * This is dumb for two reasons.  
+         * 1.) it is based on wall time which has not yet been updated.
+         * 2.) it is checked each tick for something that happens each
+         *     10 min.  Why not use a timer for it?  Much lower overhead,
+         *     in fact, zero if STA_UNSYNC is set.
+         */
+        /*
 	 * If we have an externally synchronized Linux clock, then update
 	 * CMOS clock accordingly every ~11 minutes. Set_rtc_mmss() has to be
 	 * called as close as possible to 500 ms before the new second starts.
 	 */
 	if ((time_status & STA_UNSYNC) == 0 &&
 	    xtime.tv_sec > last_rtc_update + 660 &&
-	    (xtime.tv_nsec / 1000) >= 500000 - ((unsigned) TICK_SIZE) / 2 &&
-	    (xtime.tv_nsec / 1000) <= 500000 + ((unsigned) TICK_SIZE) / 2) {
+	    (xtime.tv_nsec ) >= 500000000 - ((unsigned) tick_nsec) / 2 &&
+	    (xtime.tv_nsec ) <= 500000000 + ((unsigned) tick_nsec) / 2) {
 		if (set_rtc_mmss(xtime.tv_sec) == 0)
 			last_rtc_update = xtime.tv_sec;
 		else
-			last_rtc_update = xtime.tv_sec - 600; /* do it again in 60 s */
+                        /* do it again in 60 s */	
+			last_rtc_update = xtime.tv_sec - 600; 
 	}
 	    
-#ifdef CONFIG_MCA
-	if( MCA_bus ) {
-		/* The PS/2 uses level-triggered interrupts.  You can't
-		turn them off, nor would you want to (any attempt to
-		enable edge-triggered interrupts usually gets intercepted by a
-		special hardware circuit).  Hence we have to acknowledge
-		the timer interrupt.  Through some incredibly stupid
-		design idea, the reset for IRQ 0 is done by setting the
-		high bit of the PPI port B (0x61).  Note that some PS/2s,
-		notably the 55SX, work fine if this is removed.  */
-
-		irq = inb_p( 0x61 );	/* read the current state */
-		outb_p( irq|0x80, 0x61 );	/* reset the IRQ */
-	}
-#endif
 }
 
 /*
@@ -278,16 +300,66 @@
 	 * the irq version of write_lock because as just said we have irq
 	 * locally disabled. -arca
 	 */
+	discipline_timer(smp_processor_id());
 	write_lock(&xtime_lock);
 
 	timer->mark_offset();
  
-	do_timer_interrupt(irq, NULL, regs);
+	do_timer_interrupt(irq, regs);
 
 	write_unlock(&xtime_lock);
 
 }
+#ifdef CONFIG_HIGH_RES_TIMERS
+/*
 
+ * We always continue to provide interrupts even if they are not
+ * serviced.  To do this, we leave the chip in periodic mode programmed
+ * to interrupt every jiffie.  This is done by, for short intervals,
+ * programming a short time, waiting till it is loaded and then
+ * programming the 1/HZ.  The chip will not load the 1/HZ count till the
+ * short count expires.  If the last interrupt was programmed to be
+ * short, we need to program another short to cover the remaining part
+ * of the jiffie and can then just leave the chip alone.  Note that it
+ * is also a low overhead way of doing things as we do not have to mess
+ * with the chip MOST of the time. 
+ 
+  */
+
+int _schedule_next_int(unsigned long jiffie_f,long sub_jiffie_in, int always)
+{
+        long sub_jiff_offset; 
+	int * last_was_long = &__last_was_long;
+	if ((sub_jiffie_in == -1) && *last_was_long) return 0;
+        /* 
+         * First figure where we are in time. 
+         * A note on locking.  We are under the timerlist_lock here.  This
+         * means that interrupts are off already, so don't use irq versions.
+         */
+        IF_SMP( read_lock(&xtime_lock));
+
+        sub_jiff_offset = quick_update_jiffies_sub(jiffie_f);
+
+        IF_SMP( read_unlock(&xtime_lock));
+
+
+        if (( *last_was_long = (sub_jiffie_in == -1 ))) {
+
+                sub_jiff_offset = cycles_per_jiffies - sub_jiff_offset;
+        }else{
+		sub_jiff_offset = sub_jiffie_in - sub_jiff_offset;
+        }
+        /*
+         * If time is already passed, just return saying so.
+         */
+        if (! always && (sub_jiff_offset < 0)){
+                *last_was_long = 0;
+                return 1;
+        }
+        reload_timer_chip(sub_jiff_offset);
+        return 0;
+}
+#endif
 /* not static: needed by APM */
 unsigned long get_cmos_time(void)
 {
@@ -352,6 +424,7 @@
 	
 	xtime.tv_sec = get_cmos_time();
 	xtime.tv_nsec = 0;
+        IF_HIGH_RES(tick_nsec = NSEC_PER_SEC / HZ);
 
 
 	timer = select_timer();
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/Makefile linux/arch/i386/kernel/timers/Makefile
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/Makefile	Thu Dec 19 12:12:53 2002
+++ linux/arch/i386/kernel/timers/Makefile	Thu Dec 19 13:03:35 2002
@@ -7,3 +7,7 @@
 obj-y += timer_tsc.o
 obj-y += timer_pit.o
 obj-$(CONFIG_X86_CYCLONE)   += timer_cyclone.o
+obj-$(CONFIG_HIGH_RES_TIMER_ACPI_PM) += hrtimer_pm.o
+obj-$(CONFIG_HIGH_RES_TIMER_ACPI_PM) += high-res-tbxfroot.o
+obj-$(CONFIG_HIGH_RES_TIMER_TSC) += hrtimer_tsc.o
+obj-$(CONFIG_HIGH_RES_TIMER_PIT) += hrtimer_pit.o
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/high-res-tbxfroot.c linux/arch/i386/kernel/timers/high-res-tbxfroot.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/high-res-tbxfroot.c	Wed Dec 31 16:00:00 1969
+++ linux/arch/i386/kernel/timers/high-res-tbxfroot.c	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,263 @@
+/******************************************************************************
+ *
+ * Module Name: tbxfroot - Find the root ACPI table (RSDT)
+ *              $Revision: 49 $
+ *
+ *****************************************************************************/
+
+/*
+ *  Copyright (C) 2000, 2001 R. Byron Moore
+
+ *  This code purloined and modified by George Anzinger
+ *                          Copyright (C) 2002 by MontaVista Software.
+ *  It is part of the high-res-timers ACPI option and its sole purpose is
+ *  to find the darn timer.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+/* This is most annoying!  We want to find the address of the pm timer in the
+ * ACPI hardware package.  We know there is one if ACPI is available at all 
+ * as it is part of the basic ACPI hardware set. 
+ * However, the powers that be have conspired to make it a real
+ * pain to find the address.  We have written a minimal search routine
+ * that we use only once on boot up.  We try to cover all the bases including
+ * checksum, and version.  We will try to get some constants and structures
+ * from the ACPI code in an attempt to follow it, but darn, what a mess.
+ *
+ * First problem, the include files are in the driver package....
+ * and what a mess they are.  We pick up the kernel string and types first.
+
+ * But then there is the COMPILER_DEPENDENT_UINT64 ...
+ */
+#define ACPI_MACHINE_WIDTH	BITS_PER_LONG
+#define COMPILER_DEPENDENT_UINT64   unsigned long long
+#define COMPILER_DEPENDENT_INT64   long long
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <../drivers/acpi/include/actypes.h>
+#include <../drivers/acpi/include/actbl.h>
+#include <../drivers/acpi/include/acconfig.h>
+#include <linux/init.h>
+#include <asm/page.h>
+
+#define STRNCMP(d,s,n)  strncmp((d), (s), (NATIVE_INT)(n))
+#define RSDP_CHECKSUM_LENGTH 20
+
+#ifndef CONFIG_ACPI
+/*******************************************************************************
+ *
+ * FUNCTION:    hrt_acpi_checksum
+ *
+ * PARAMETERS:  Buffer              - Buffer to checksum
+ *              Length              - Size of the buffer
+ *
+ * RETURNS      8 bit checksum of buffer
+ *
+ * DESCRIPTION: Computes an 8 bit checksum of the buffer(length) and returns it.
+ *
+ ******************************************************************************/
+static __init u8
+hrt_acpi_checksum(void *buffer, u32 length)
+{
+	u8 *limit;
+	u8 *rover;
+	u8 sum = 0;
+
+	if (buffer && length) {
+		/*  Buffer and Length are valid   */
+
+		limit = (u8 *) buffer + length;
+
+		for (rover = buffer; rover < limit; rover++) {
+			sum = (u8) (sum + *rover);
+		}
+	}
+
+	return (sum);
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION:    hrt_acpi_scan_memory_for_rsdp
+ *
+ * PARAMETERS:  Start_address       - Starting pointer for search
+ *              Length              - Maximum length to search
+ *
+ * RETURN:      Pointer to the RSDP if found, otherwise NULL.
+ *
+ * DESCRIPTION: Search a block of memory for the RSDP signature
+ *
+ ******************************************************************************/
+static __init u8 *
+hrt_acpi_scan_memory_for_rsdp(u8 * start_address, u32 length)
+{
+	u32 offset;
+	u8 *mem_rover;
+
+	/* Search from given start addr for the requested length  */
+
+	for (offset = 0, mem_rover = start_address;
+	     offset < length;
+	     offset += RSDP_SCAN_STEP, mem_rover += RSDP_SCAN_STEP) {
+
+		/* The signature and checksum must both be correct */
+
+		if (STRNCMP((NATIVE_CHAR *) mem_rover,
+			    RSDP_SIG, sizeof (RSDP_SIG) - 1) == 0 &&
+		    hrt_acpi_checksum(mem_rover, RSDP_CHECKSUM_LENGTH) == 0) {
+			/* If so, we have found the RSDP */
+
+			;
+			return (mem_rover);
+		}
+	}
+
+	/* Searched entire block, no RSDP was found */
+
+	return (NULL);
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION:    hrt_acpi_find_rsdp
+ *
+ * PARAMETERS: 
+ *
+ * RETURN:      Logical address of rsdp
+ *
+ * DESCRIPTION: Search lower 1_mbyte of memory for the root system descriptor
+ *              pointer structure.  If it is found, return its address,
+ *              else return 0.
+ *
+ *              NOTE: The RSDP must be either in the first 1_k of the Extended
+ *              BIOS Data Area or between E0000 and FFFFF (ACPI 1.0 section
+ *              5.2.2; assertion #421).
+ *
+ ******************************************************************************/
+/* Constants used in searching for the RSDP in low memory */
+
+#define LO_RSDP_WINDOW_BASE         0	/* Physical Address */
+#define HI_RSDP_WINDOW_BASE         0xE0000	/* Physical Address */
+#define LO_RSDP_WINDOW_SIZE         0x400
+#define HI_RSDP_WINDOW_SIZE         0x20000
+#define RSDP_SCAN_STEP              16
+
+static __init RSDP_DESCRIPTOR *
+hrt_find_acpi_rsdp(void)
+{
+	u8 *mem_rover;
+
+	/*
+	 * 1) Search EBDA (low memory) paragraphs
+	 */
+	mem_rover =
+	    hrt_acpi_scan_memory_for_rsdp((u8 *) __va(LO_RSDP_WINDOW_BASE),
+					  LO_RSDP_WINDOW_SIZE);
+
+	if (!mem_rover) {
+		/*
+		 * 2) Search upper memory: 
+		 *    16-byte boundaries in E0000h-F0000h
+		 */
+		mem_rover =
+		    hrt_acpi_scan_memory_for_rsdp((u8 *)
+						  __va(HI_RSDP_WINDOW_BASE),
+						  HI_RSDP_WINDOW_SIZE);
+	}
+
+	if (mem_rover) {
+		/* Found it, return the logical address */
+
+		return (RSDP_DESCRIPTOR *) mem_rover;
+	}
+	return (RSDP_DESCRIPTOR *) 0;
+}
+
+__init u32 hrt_get_acpi_pm_ptr(void)
+{
+	fadt_descriptor_rev2 *fadt;
+	RSDT_DESCRIPTOR_REV2 *rsdt;
+	XSDT_DESCRIPTOR_REV2 *xsdt;
+	RSDP_DESCRIPTOR *rsdp = hrt_find_acpi_rsdp();
+
+	if (!rsdp) {
+		printk("ACPI: System description tables not found\n");
+		return 0;
+	}
+	/*
+	 * Now that we have that problem out of the way, lets set up this
+	 * timer.  We need to figure the addresses based on the revision
+	 * of ACPI, which is in this here table we just found.
+	 * We will not check the RSDT checksum, but will the FADT.
+	 */
+	if (rsdp->revision == 2) {
+		xsdt =
+		    (XSDT_DESCRIPTOR_REV2 *) __va(rsdp->xsdt_physical_address);
+		fadt =
+		    (fadt_descriptor_rev2 *) __va(xsdt->table_offset_entry[0]);
+	} else {
+		rsdt =
+		    (RSDT_DESCRIPTOR_REV2 *) __va(rsdp->rsdt_physical_address);
+		fadt =
+		    (fadt_descriptor_rev2 *) __va(rsdt->table_offset_entry[0]);
+	}
+	/*
+	 * Verify the signature and the checksum
+	 */
+	if (STRNCMP((NATIVE_CHAR *) fadt->header.signature,
+		    FADT_SIG, sizeof (FADT_SIG) - 1) == 0 &&
+	    hrt_acpi_checksum((NATIVE_CHAR *) fadt, fadt->header.length) == 0) {
+		/*
+		 * looks good.  Again, based on revision,
+		 * pluck the addresses we want and get out.
+		 */
+		if (rsdp->revision == 2) {
+			return (u32) fadt->Xpm_tmr_blk.address;
+		} else {
+			return (u32) fadt->V1_pm_tmr_blk;
+		}
+	}
+	printk("ACPI: Signature or checksum failed on FADT\n");
+	return 0;
+}
+
+#else
+int acpi_get_firmware_table(acpi_string signature,
+			    u32 instance,
+			    u32 flags, acpi_table_header ** table_pointer);
+
+extern fadt_descriptor_rev2 acpi_fadt;
+__init u32 hrt_get_acpi_pm_ptr(void)
+{
+	fadt_descriptor_rev2 *fadt = &acpi_fadt;
+	fadt_descriptor_rev2 local_fadt;
+
+	if (!fadt || !fadt->header.signature[0]) {
+		fadt = &local_fadt;
+		acpi_get_firmware_table("FACP", 1, 0,
+					(acpi_table_header **) & fadt);
+	}
+	if (!fadt || !fadt->header.signature[0]) {
+		printk("ACPI: Could not find the ACPI pm timer.");
+	}
+
+	if (fadt->header.revision == 2) {
+		return (u32) fadt->Xpm_tmr_blk.address;
+	} else {
+		return (u32) fadt->V1_pm_tmr_blk;
+	}
+}
+#endif
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_pit.c linux/arch/i386/kernel/timers/hrtimer_pit.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_pit.c	Wed Dec 31 16:00:00 1969
+++ linux/arch/i386/kernel/timers/hrtimer_pit.c	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,165 @@
+/*
+ * This code largely moved from arch/i386/kernel/time.c.
+ * See comments there for proper credits.
+ */
+
+#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/timex.h>
+#include <linux/errno.h>
+#include <linux/cpufreq.h>
+#include <linux/hrtime.h>
+
+#include <asm/timer.h>
+#include <asm/io.h>
+
+
+
+/* Cached *multiplier* to convert TSC counts to microseconds.
+ * (see the equation below).
+ * Equal to 2^32 * (1 / (clocks per usec) ).
+ * Initialized in time_init.
+ */
+extern unsigned long fast_gettimeoffset_quotient;
+
+extern unsigned long do_highres_gettimeoffset_pit(void)
+{
+        /*
+         * We are under the xtime_lock here.
+         */
+        long tmp = quick_get_cpuctr();
+        long rtn = arch_cycles_to_usec(tmp + sub_jiffie());
+	return rtn;
+}
+
+static void high_res_mark_offset_pit(void)
+{
+	return;
+}
+
+
+/* ------ Calibrate the TSC ------- 
+ * Return 2^32 * (1 / (TSC clocks per usec)) for do_fast_gettimeoffset().
+ * Too much 64-bit arithmetic here to do this cleanly in C, and for
+ * accuracy's sake we want to keep the overhead on the CTC speaker (channel 2)
+ * output busy loop as low as possible. We avoid reading the CTC registers
+ * directly because of the awkward 8-bit access mechanism of the 82C54
+ * device.
+ */
+
+#define CAL_JIFS 5
+#define CALIBRATE_LATCH	(((CAL_JIFS * CLOCK_TICK_RATE) + HZ/2)/HZ)
+#define CALIBRATE_TIME	((CAL_JIFS * USEC_PER_SEC)/HZ)
+#define CALIBRATE_TIME_NSEC (CAL_JIFS * (NSEC_PER_SEC/HZ))
+
+
+static unsigned long __init calibrate_tsc(void)
+{
+       /* Set the Gate high, disable speaker */
+	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+	/*
+	 * Now let's take care of CTC channel 2
+	 *
+	 * Set the Gate high, program CTC channel 2 for mode 0,
+	 * (interrupt on terminal count mode), binary count,
+	 * load 5 * LATCH count, (LSB and MSB) to begin countdown.
+	 */
+	outb(0xb0, 0x43);			/* binary, mode 0, LSB/MSB, Ch 2 */
+	outb(CALIBRATE_LATCH & 0xff, 0x42);	/* LSB of count */
+	outb(CALIBRATE_LATCH >> 8, 0x42);	/* MSB of count */
+
+	{
+		unsigned long startlow, starthigh;
+		unsigned long endlow, endhigh;
+		unsigned long count;
+
+		rdtsc(startlow,starthigh);
+		count = 0;
+		do {
+			count++;
+		} while ((inb(0x61) & 0x20) == 0);
+		rdtsc(endlow,endhigh);
+
+		/* Error: ECTCNEVERSET */
+		if (count <= 1)
+			goto bad_ctc;
+
+		/* 64-bit subtract - gcc just messes up with long longs */
+		__asm__("subl %2,%0\n\t"
+			"sbbl %3,%1"
+			:"=a" (endlow), "=d" (endhigh)
+			:"g" (startlow), "g" (starthigh),
+			 "0" (endlow), "1" (endhigh));
+
+		/* Error: ECPUTOOFAST */
+		if (endhigh)
+			goto bad_ctc;
+
+		/* Error: ECPUTOOSLOW */
+		if (endlow <= CALIBRATE_TIME)
+			goto bad_ctc;
+
+                /*
+                 * endlow at this point is CAL_JIFS * arch clocks
+                 * per jiffie.  Set up the value for 
+                 * high_res use. Note: keep the whole
+                 * value for now, we will do
+                 * the divide later (want that precision).
+                 */
+
+		__asm__("divl %2"
+			:"=a" (endlow), "=d" (endhigh)
+			:"r" (endlow), "0" (0), "1" (CALIBRATE_TIME));
+
+		return endlow;
+	}
+
+	/*
+	 * The CTC wasn't reliable: we got a hit on the very first read,
+	 * or the CPU was so fast/slow that the quotient wouldn't fit in
+	 * 32 bits..
+	 */
+bad_ctc:
+        printk("******************** TSC calibrate failed!\n");
+	return 0;
+}
+
+
+
+#include <asm/kgdb.h>
+
+static int high_res_init_pit(void)
+{
+
+	//breakpoint();
+
+
+	/* report CPU clock rate in Hz.
+	 * The formula is:
+	 * (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) =
+	 * clock/second. Our precision is about 100 ppm.
+	 */
+	if (cpu_has_tsc) {
+		unsigned long tsc_quotient = calibrate_tsc();
+		if(tsc_quotient){
+			fast_gettimeoffset_quotient = tsc_quotient;
+			cpu_khz = div_sc32( 1000, tsc_quotient);
+			{	
+				printk("Detected %lu.%03lu MHz processor.\n", 
+				       cpu_khz / 1000, cpu_khz % 1000);
+			}
+		}
+	}
+	start_PIT();
+	return 0;
+}
+
+/************************************************************/
+
+/* tsc timer_opts struct */
+struct timer_opts hrtimer_pit = {
+	.init =		high_res_init_pit,
+	.mark_offset =	high_res_mark_offset_pit, 
+	.get_offset =	do_highres_gettimeoffset_pit,
+};
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_pm.c linux/arch/i386/kernel/timers/hrtimer_pm.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_pm.c	Wed Dec 31 16:00:00 1969
+++ linux/arch/i386/kernel/timers/hrtimer_pm.c	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,198 @@
+/*
+ * This code largely moved from arch/i386/kernel/time.c.
+ * See comments there for proper credits.
+ */
+
+#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/timex.h>
+#include <linux/errno.h>
+#include <linux/cpufreq.h>
+#include <linux/hrtime.h>
+
+#include <asm/timer.h>
+#include <asm/io.h>
+
+
+
+/* Cached *multiplier* to convert TSC counts to microseconds.
+ * (see the equation below).
+ * Equal to 2^32 * (1 / (clocks per usec) ).
+ * Initialized in time_init.
+ */
+extern unsigned long fast_gettimeoffset_quotient;
+
+extern unsigned long do_highres_gettimeoffset_pm(void)
+{
+        /*
+         * We are under the xtime_lock here.
+         */
+        long tmp = quick_get_cpuctr();
+        long rtn = arch_cycles_to_usec(tmp + sub_jiffie());
+	return rtn;
+}
+
+static void high_res_mark_offset_pm(void)
+{
+	return;
+}
+
+
+/* ------ Calibrate the TSC ------- 
+ * Return 2^32 * (1 / (TSC clocks per usec)) for do_fast_gettimeoffset().
+ * Too much 64-bit arithmetic here to do this cleanly in C, and for
+ * accuracy's sake we want to keep the overhead on the CTC speaker (channel 2)
+ * output busy loop as low as possible. We avoid reading the CTC registers
+ * directly because of the awkward 8-bit access mechanism of the 82C54
+ * device.
+ */
+
+#define CAL_JIFS 5
+#define CALIBRATE_LATCH	(((CAL_JIFS * CLOCK_TICK_RATE) + HZ/2)/HZ)
+#define CALIBRATE_TIME	((CAL_JIFS * USEC_PER_SEC)/HZ)
+#define CALIBRATE_TIME_NSEC (CAL_JIFS * (NSEC_PER_SEC/HZ))
+
+static __initdata unsigned long tsc_cycles_per_5_jiffies;
+
+static unsigned long __init calibrate_tsc(void)
+{
+       /* Set the Gate high, disable speaker */
+	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+	/*
+	 * Now let's take care of CTC channel 2
+	 *
+	 * Set the Gate high, program CTC channel 2 for mode 0,
+	 * (interrupt on terminal count mode), binary count,
+	 * load 5 * LATCH count, (LSB and MSB) to begin countdown.
+	 */
+	outb(0xb0, 0x43);			/* binary, mode 0, LSB/MSB, Ch 2 */
+	outb(CALIBRATE_LATCH & 0xff, 0x42);	/* LSB of count */
+	outb(CALIBRATE_LATCH >> 8, 0x42);	/* MSB of count */
+
+	{
+		unsigned long startlow, starthigh;
+		unsigned long endlow, endhigh;
+		unsigned long count;
+
+		rdtsc(startlow,starthigh);
+		count = 0;
+		do {
+			count++;
+		} while ((inb(0x61) & 0x20) == 0);
+		rdtsc(endlow,endhigh);
+
+		/* Error: ECTCNEVERSET */
+		if (count <= 1)
+			goto bad_ctc;
+
+		/* 64-bit subtract - gcc just messes up with long longs */
+		__asm__("subl %2,%0\n\t"
+			"sbbl %3,%1"
+			:"=a" (endlow), "=d" (endhigh)
+			:"g" (startlow), "g" (starthigh),
+			 "0" (endlow), "1" (endhigh));
+
+		/* Error: ECPUTOOFAST */
+		if (endhigh)
+			goto bad_ctc;
+
+		/* Error: ECPUTOOSLOW */
+		if (endlow <= CALIBRATE_TIME)
+			goto bad_ctc;
+
+                /*
+                 * endlow at this point is CAL_JIFS * arch clocks
+                 * per jiffie.  Set up the value for 
+                 * high_res use. Note: keep the whole
+                 * value for now, we will do
+                 * the divide later (want that precision).
+                 */
+
+		__asm__("divl %2"
+			:"=a" (endlow), "=d" (endhigh)
+			:"r" (endlow), "0" (0), "1" (CALIBRATE_TIME));
+
+		return endlow;
+	}
+
+	/*
+	 * The CTC wasn't reliable: we got a hit on the very first read,
+	 * or the CPU was so fast/slow that the quotient wouldn't fit in
+	 * 32 bits..
+	 */
+bad_ctc:
+        printk("******************** TSC calibrate failed!\n");
+	return 0;
+}
+
+
+static inline __init void hrt_udelay(int usec)
+{
+        long now,end;
+        rdtscl(end);
+        end += (usec * tsc_cycles_per_5_jiffies) / (USEC_PER_JIFFIES * 5);
+        do {rdtscl(now);} while((end - now) > 0);
+
+}
+
+
+
+static int high_res_init_pm(void)
+{
+
+
+	/* report CPU clock rate in Hz.
+	 * The formula is:
+	 * (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) =
+	 * clock/second. Our precision is about 100 ppm.
+	 */
+	if (cpu_has_tsc) {
+		unsigned long tsc_quotient = calibrate_tsc();
+		if(tsc_quotient){
+			fast_gettimeoffset_quotient = tsc_quotient;
+			cpu_khz = div_sc32( 1000, tsc_quotient);
+			{	
+				printk("Detected %lu.%03lu MHz processor.\n", 
+				       cpu_khz / 1000, cpu_khz % 1000);
+			}
+		}
+	}
+	start_PIT();
+        acpi_pm_tmr_address = hrt_get_acpi_pm_ptr(); 
+        if (!acpi_pm_tmr_address){                    
+                printk(message,default_pm_add);
+                if ( (acpi_pm_tmr_address = default_pm_add)){
+                        last_update +=  quick_get_cpuctr();
+                        hrt_udelay(4);
+			if (!quick_get_cpuctr()){
+                                printk("High-res-timers: No ACPI pm timer found at %d.\n",
+                                       acpi_pm_tmr_address);
+                                acpi_pm_tmr_address = 0;
+                        } 
+                } 
+        }else{
+                if (default_pm_add != acpi_pm_tmr_address) {
+                        printk("High-res-timers: Ignoring supplied default ACPI pm timer address.\n"); 
+                }
+                last_update +=  quick_get_cpuctr();
+        }
+        if (!acpi_pm_tmr_address){
+                printk(fail_message);
+		return -EINVAL;
+        }else{
+                printk("High-res-timers: Found ACPI pm timer at %d\n",
+                       acpi_pm_tmr_address);
+        }
+	return 0;
+}
+
+/************************************************************/
+
+/* tsc timer_opts struct */
+struct timer_opts hrtimer_pm = {
+	.init =		high_res_init_pm,
+	.mark_offset =	high_res_mark_offset_pm, 
+	.get_offset =	do_highres_gettimeoffset_pm,
+};
+
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_tsc.c linux/arch/i386/kernel/timers/hrtimer_tsc.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/hrtimer_tsc.c	Wed Dec 31 16:00:00 1969
+++ linux/arch/i386/kernel/timers/hrtimer_tsc.c	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,289 @@
+/*
+ * This code largely moved from arch/i386/kernel/time.c.
+ * See comments there for proper credits.
+ */
+
+#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/timex.h>
+#include <linux/errno.h>
+#include <linux/cpufreq.h>
+#include <linux/hrtime.h>
+
+#include <asm/timer.h>
+#include <asm/io.h>
+
+extern int x86_udelay_tsc;
+extern spinlock_t i8253_lock;
+
+
+
+/* Cached *multiplier* to convert TSC counts to microseconds.
+ * (see the equation below).
+ * Equal to 2^32 * (1 / (clocks per usec) ).
+ * Initialized in time_init.
+ */
+extern unsigned long fast_gettimeoffset_quotient;
+
+static unsigned long do_highres_gettimeoffset(void)
+{
+        /*
+         * We are under the xtime_lock here.
+         */
+        long tmp = quick_get_cpuctr();
+        long rtn = arch_cycles_to_usec(tmp + sub_jiffie());
+	return rtn;
+}
+
+static void high_res_mark_offset_tsc(void)
+{
+	return;
+}
+
+
+/* ------ Calibrate the TSC ------- 
+ * Return 2^32 * (1 / (TSC clocks per usec)) for do_fast_gettimeoffset().
+ * Too much 64-bit arithmetic here to do this cleanly in C, and for
+ * accuracy's sake we want to keep the overhead on the CTC speaker (channel 2)
+ * output busy loop as low as possible. We avoid reading the CTC registers
+ * directly because of the awkward 8-bit access mechanism of the 82C54
+ * device.
+ */
+
+#define CAL_JIFS 5
+#define CALIBRATE_LATCH	(((CAL_JIFS * CLOCK_TICK_RATE) + HZ/2)/HZ)
+#define CALIBRATE_TIME	((CAL_JIFS * USEC_PER_SEC)/HZ)
+#define CALIBRATE_TIME_NSEC (CAL_JIFS * (NSEC_PER_SEC/HZ))
+
+static __initdata unsigned long tsc_cycles_per_5_jiffies;
+
+static unsigned long __init calibrate_tsc(void)
+{
+       /* Set the Gate high, disable speaker */
+	outb((inb(0x61) & ~0x02) | 0x01, 0x61);
+
+	/*
+	 * Now let's take care of CTC channel 2
+	 *
+	 * Set the Gate high, program CTC channel 2 for mode 0,
+	 * (interrupt on terminal count mode), binary count,
+	 * load 5 * LATCH count, (LSB and MSB) to begin countdown.
+	 */
+	outb(0xb0, 0x43);			/* binary, mode 0, LSB/MSB, Ch 2 */
+	outb(CALIBRATE_LATCH & 0xff, 0x42);	/* LSB of count */
+	outb(CALIBRATE_LATCH >> 8, 0x42);	/* MSB of count */
+
+	{
+		unsigned long startlow, starthigh;
+		unsigned long endlow, endhigh;
+		unsigned long count;
+
+		rdtsc(startlow,starthigh);
+		count = 0;
+		do {
+			count++;
+		} while ((inb(0x61) & 0x20) == 0);
+		rdtsc(endlow,endhigh);
+
+
+		/* Error: ECTCNEVERSET */
+		if (count <= 1)
+			goto bad_ctc;
+
+		/* 64-bit subtract - gcc just messes up with long longs */
+		__asm__("subl %2,%0\n\t"
+			"sbbl %3,%1"
+			:"=a" (endlow), "=d" (endhigh)
+			:"g" (startlow), "g" (starthigh),
+			 "0" (endlow), "1" (endhigh));
+
+		/* Error: ECPUTOOFAST */
+		if (endhigh)
+			goto bad_ctc;
+
+		/* Error: ECPUTOOSLOW */
+		if (endlow <= CALIBRATE_TIME)
+			goto bad_ctc;
+
+                /*
+                 * endlow at this point is CAL_JIFS * arch clocks
+                 * per jiffie.  Set up the value for 
+                 * high_res use. Note: keep the whole
+                 * value for now, we will do
+                 * the divide later (want that precision).
+                 */
+                tsc_cycles_per_5_jiffies = endlow;
+
+		__asm__("divl %2"
+			:"=a" (endlow), "=d" (endhigh)
+			:"r" (endlow), "0" (0), "1" (CALIBRATE_TIME));
+
+		return endlow;
+	}
+
+	/*
+	 * The CTC wasn't reliable: we got a hit on the very first read,
+	 * or the CPU was so fast/slow that the quotient wouldn't fit in
+	 * 32 bits..
+	 */
+bad_ctc:
+        printk("******************** TSC calibrate failed!\n");
+	return 0;
+}
+
+
+#ifdef CONFIG_CPU_FREQ
+
+static int
+time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
+		       void *data)
+{
+	struct cpufreq_freqs *freq = data;
+	unsigned int i;
+
+	if (!cpu_has_tsc)
+		return 0;
+
+	if((val == CPUFREQ_PRECHANGE && (freq->old < freq->new)) ||
+	   (val == CPUFREQ_POSTCHANGE && (freq->old > freq->new))){
+		if((freq->cpu == CPUFREQ_ALL_CPUS) || (freq->cpu == 0)){
+
+			cpu_khz = cpufreq_scale(cpu_khz, freq->old, freq->new);
+
+		        arch_to_usec = 
+				fast_gettimeoffset_quotient = 
+				cpufreq_scale(fast_gettimeoffset_quotient, 
+					      freq->new, freq->old);
+			arch_to_latch = 
+				cpufreq_scale(arch_to_latch, 
+					      freq->new, freq->old);
+			arch_to_nsec =
+				cpufreq_scale(arch_to_nsec, 
+					      freq->new, freq->old);
+			nsec_to_arch =
+				cpufreq_scale(nsec_to_arch, 
+					      freq->old, freq->new);
+			usec_to_arch =
+				cpufreq_scale(usec_to_arch, 
+					      freq->old, freq->new);
+			cycles_per_jiffies =
+				cpufreq_scale(cycles_per_jiffies, 
+					      freq->old, freq->new);
+		}
+		for (i=0; i<NR_CPUS; i++)
+			if ((freq->cpu == CPUFREQ_ALL_CPUS) || (freq->cpu == i))
+				cpu_data[i].loops_per_jiffy = 
+					cpufreq_scale(
+						cpu_data[i].loops_per_jiffy, 
+						freq->old, freq->new);
+	}
+
+	return 0;
+}
+
+static struct notifier_block time_cpufreq_notifier_block = {
+	notifier_call:	time_cpufreq_notifier
+};
+#endif
+
+
+static int high_res_init_tsc(void)
+{
+	/*
+	 * If we have APM enabled or the CPU clock speed is variable
+	 * (CPU stops clock on HLT or slows clock to save power)
+	 * then the TSC timestamps may diverge by up to 1 jiffy from
+	 * 'real time' but nothing will break.
+	 * The most frequent case is that the CPU is "woken" from a halt
+	 * state by the timer interrupt itself, so we get 0 error. In the
+	 * rare cases where a driver would "wake" the CPU and request a
+	 * timestamp, the maximum error is < 1 jiffy. But timestamps are
+	 * still perfectly ordered.
+	 * Note that the TSC counter will be reset if APM suspends
+	 * to disk; this won't break the kernel, though, 'cuz we're
+	 * smart.  See arch/i386/kernel/apm.c.
+	 */
+ 	/*
+ 	 *	Firstly we have to do a CPU check for chips with
+ 	 * 	a potentially buggy TSC. At this point we haven't run
+ 	 *	the ident/bugs checks so we must run this hook as it
+ 	 *	may turn off the TSC flag.
+ 	 *
+ 	 *	NOTE: this doesnt yet handle SMP 486 machines where only
+ 	 *	some CPU's have a TSC. Thats never worked and nobody has
+ 	 *	moaned if you have the only one in the world - you fix it!
+ 	 */
+ 
+ 	dodgy_tsc();
+ 	
+	if (cpu_has_tsc) {
+		unsigned long tsc_quotient = calibrate_tsc();
+		if (tsc_quotient) {
+			fast_gettimeoffset_quotient = tsc_quotient;
+			/*
+			 *	We could be more selective here I suspect
+			 *	and just enable this for the next intel chips ?
+			 */
+			x86_udelay_tsc = 1;
+
+                        /*
+                         * Kick off the high res timers
+                         */
+			/*
+			 * The init_hrtimers macro is in the choosen
+			 * support package depending on the clock
+			 *  source, PIT, TSC, or ACPI pm timer.
+			 */
+			arch_to_usec = fast_gettimeoffset_quotient;
+ 
+			arch_to_latch = div_ll_X_l(
+				mpy_l_X_l_ll(fast_gettimeoffset_quotient,
+					     CLOCK_TICK_RATE),
+				(USEC_PER_SEC));
+
+			arch_to_nsec = div_sc_n(HR_TIME_SCALE_NSEC,
+						CALIBRATE_TIME * NSEC_PER_USEC,
+						tsc_cycles_per_5_jiffies);
+
+			nsec_to_arch = div_sc_n(HR_TIME_SCALE_NSEC,
+						tsc_cycles_per_5_jiffies,
+						CALIBRATE_TIME * NSEC_PER_USEC);
+
+			usec_to_arch = div_sc_n(HR_TIME_SCALE_USEC,
+						tsc_cycles_per_5_jiffies,
+						CALIBRATE_TIME );
+
+			cycles_per_jiffies = tsc_cycles_per_5_jiffies / 
+				CAL_JIFS;  
+
+			start_PIT();
+
+			/* report CPU clock rate in Hz.
+			 * The formula is:
+			 * (10^6 * 2^32) / (2^32 * 1 / (clocks/us)) =
+			 * clock/second. Our precision is about 100 ppm.
+			 */
+			cpu_khz = div_sc32( 1000, tsc_quotient);
+			{	
+				printk("Detected %lu.%03lu MHz processor.\n", 
+				       cpu_khz / 1000, cpu_khz % 1000);
+			}
+#ifdef CONFIG_CPU_FREQ
+			cpufreq_register_notifier(&time_cpufreq_notifier_block,
+						  CPUFREQ_TRANSITION_NOTIFIER);
+#endif
+			return 0;
+		}
+	}
+	return -ENODEV;
+}
+
+/************************************************************/
+
+/* tsc timer_opts struct */
+struct timer_opts hrtimer_tsc = {
+	.init =		high_res_init_tsc,
+	.mark_offset =	high_res_mark_offset_tsc, 
+	.get_offset =	do_highres_gettimeoffset,
+};
+
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer.c linux/arch/i386/kernel/timers/timer.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer.c	Tue Oct 15 15:42:24 2002
+++ linux/arch/i386/kernel/timers/timer.c	Thu Dec 19 13:00:26 2002
@@ -1,15 +1,32 @@
 #include <linux/kernel.h>
+#include <linux/hrtime.h>
 #include <asm/timer.h>
-
+/*
+ * export this here so it can be used by more than one clock source
+ */
+unsigned long fast_gettimeoffset_quotient;
 /* list of externed timers */
 extern struct timer_opts timer_pit;
 extern struct timer_opts timer_tsc;
+extern struct timer_opts hrtimer_tsc;
+extern struct timer_opts hrtimer_pm;
+extern struct timer_opts hrtimer_pit;
 
 /* list of timers, ordered by preference, NULL terminated */
 static struct timer_opts* timers[] = {
+#ifdef CONFIG_HIGH_RES_TIMERS
+#ifdef CONFIG_HIGH_RES_TIMER_ACPI_PM
+	&hrtimer_pm,
+#elif  CONFIG_HIGH_RES_TIMER_TSC
+	&hrtimer_tsc,
+#elif  CONFIG_HIGH_RES_TIMER_PIT
+	&hrtimer_pit,
+#endif
+#else
 	&timer_tsc,
 #ifndef CONFIG_X86_TSC
 	&timer_pit,
+#endif
 #endif
 	NULL,
 };
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer_pit.c linux/arch/i386/kernel/timers/timer_pit.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer_pit.c	Mon Nov  4 15:58:24 2002
+++ linux/arch/i386/kernel/timers/timer_pit.c	Thu Dec 19 13:00:26 2002
@@ -10,6 +10,7 @@
 #include <asm/mpspec.h>
 #include <asm/timer.h>
 #include <asm/io.h>
+#include <linux/hrtime.h>
 
 extern spinlock_t i8259A_lock;
 extern spinlock_t i8253_lock;
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer_tsc.c linux/arch/i386/kernel/timers/timer_tsc.c
--- linux-2.5.52-bk4-core/arch/i386/kernel/timers/timer_tsc.c	Mon Nov 18 12:30:46 2002
+++ linux/arch/i386/kernel/timers/timer_tsc.c	Thu Dec 19 13:00:26 2002
@@ -26,7 +26,7 @@
  * Equal to 2^32 * (1 / (clocks per usec) ).
  * Initialized in time_init.
  */
-unsigned long fast_gettimeoffset_quotient;
+extern unsigned long fast_gettimeoffset_quotient;
 
 static unsigned long get_offset_tsc(void)
 {
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/arch/i386/mach-generic/do_timer.h linux/arch/i386/mach-generic/do_timer.h
--- linux-2.5.52-bk4-core/arch/i386/mach-generic/do_timer.h	Tue Nov 12 12:39:37 2002
+++ linux/arch/i386/mach-generic/do_timer.h	Thu Dec 19 13:00:26 2002
@@ -16,6 +16,11 @@
 static inline void do_timer_interrupt_hook(struct pt_regs *regs)
 {
 	do_timer(regs);
+        IF_HIGH_RES(
+                if (!(new_jiffie() & 1))
+                        return;
+                jiffies_intr = 0;
+                )
 /*
  * In the SMP case we use the local APIC timer interrupt to do the
  * profiling, except when we simulate SMP mode on a uniprocessor
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/include/asm-i386/hrtime-M386.h linux/include/asm-i386/hrtime-M386.h
--- linux-2.5.52-bk4-core/include/asm-i386/hrtime-M386.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/hrtime-M386.h	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,272 @@
+/*
+ *
+ * File: include/asm-i386/hrtime-M386.h
+ * Copyright (C) 1999 by the University of Kansas Center for Research, Inc.
+ * Copyright (C) 2001 by MontaVista Software.
+ *
+ * This software was developed by the Information and
+ * Telecommunication Technology Center (ITTC) at the University of
+ * Kansas.  Partial funding for this project was provided by Sprint. This
+ * software may be used and distributed according to the terms of the GNU
+ * Public License, incorporated herein by reference.  Neither ITTC nor
+ * Sprint accept any liability whatsoever for this product.
+ *
+ * This project was developed under the direction of Dr. Douglas Niehaus.
+ * 
+ * Authors: Balaji S., Raghavan Menon
+ *	    Furquan Ansari, Jason Keimig, Apurva Sheth
+ *
+ * Thanx to Michael Barabanov for helping me with the non-pentium code.
+ *
+ * Please send bug-reports/suggestions/comments to utime@ittc.ukans.edu
+ * 
+ * Further details about this project can be obtained at
+ *    http://hegel.ittc.ukans.edu/projects/utime/ 
+ *    or in the file Documentation/utime.txt
+ */
+/* This is in case its not a pentuim or a ppro.
+ * we dont have access to the cycle counters
+ */
+/* 
+ * This code swiped from the utime project to support high res timers
+ * Principle thief George Anzinger george@mvista.com
+ */
+#ifndef _ASM_HRTIME_M386_H
+#define _ASM_HRTIME_M386_H
+
+#ifdef __KERNEL__
+
+extern int base_c0, base_c0_offset;
+#define timer_latch_reset(x) _timer_latch_reset = x
+extern int _timer_latch_reset;
+
+/*
+ * Never call this routine with local ints on.
+ * update_jiffies_sub()
+ */
+
+extern inline unsigned int
+read_timer_chip(void)
+{
+	unsigned int next_intr;
+
+	LATCH_CNT0();
+	READ_CNT0(next_intr);
+	return next_intr;
+}
+
+#define HR_SCALE_ARCH_NSEC 20
+#define HR_SCALE_ARCH_USEC 30
+#define HR_SCALE_NSEC_ARCH 32
+#define HR_SCALE_USEC_ARCH 29
+
+#define cf_arch_to_usec (SC_n(HR_SCALE_ARCH_USEC,1000000)/ \
+                           (long long)CLOCK_TICK_RATE)
+
+extern inline int
+arch_cycles_to_usec(long update)
+{
+	return (mpy_sc_n(HR_SCALE_ARCH_USEC, update, arch_to_usec));
+}
+#define cf_arch_to_nsec (SC_n(HR_SCALE_ARCH_NSEC,1000000000)/ \
+                           (long long)CLOCK_TICK_RATE)
+
+extern inline int
+arch_cycles_to_nsec(long update)
+{
+	return mpy_sc_n(HR_SCALE_ARCH_NSEC, update, arch_to_nsec);
+}
+/* 
+ * And the other way...
+ */
+#define cf_usec_to_arch (SC_n( HR_SCALE_USEC_ARCH,CLOCK_TICK_RATE)/ \
+                                            (long long)1000000)
+extern inline int
+usec_to_arch_cycles(unsigned long usec)
+{
+	return mpy_sc_n(HR_SCALE_USEC_ARCH, usec, usec_to_arch);
+}
+#define cf_nsec_to_arch (SC_n( HR_SCALE_NSEC_ARCH,CLOCK_TICK_RATE)/ \
+                                            (long long)1000000000)
+extern inline int
+nsec_to_arch_cycles(long nsec)
+{
+	return (mpy_sc32(nsec, nsec_to_arch));
+}
+#ifndef CONFIG_SMP
+/*
+ * If this is defined otherwise to allow NTP adjusting, it should
+ * be scaled by about 16 bits (or so) to allow small percentage
+ * changes
+ */
+#define arch_cycles_to_latch(x) x
+
+#else
+/*
+ * APIC clocks run from a low of 33MH to say 200MH.  The PIT timer
+ * runs about 1.2 MH.  We want to scale so that ( APIC << scale )/PIT
+ * is less 2 ^ 32.  Lets use 2 ^ 19, leaves plenty of room.
+ */
+#define HR_SCALE_ARCH_LATCH 19
+
+#define compute_latch(APIC_clocks_jiffie) arch_to_latch = div_sc_n(   \
+                                                    HR_SCALE_ARCH_LATCH,   \
+				                    APIC_clocks_jiffie,   \
+				                    cycles_per_jiffies);
+extern inline int
+arch_cycles_to_latch(unsigned long update)
+{
+	return (mpy_sc_n(HR_SCALE_ARCH_LATCH, update, arch_to_latch));
+}
+
+#endif
+/*
+ * This function updates base_c0
+ * This function is always called under the write_lock_irq(&xtime_lock)
+ * It returns the number of "clocks" since the last call to it.
+ *
+ * There is a problem having a counter that has a period the same as it is
+ * interagated.  I.e. did it just roll over or has a very short time really
+ * elapsed.  (One of the reasons one should not use the PIT for both ints
+ * and time.)  We will take the occurance of an interrupt since last time
+ * to indicate that the counter has reset.  This will work for the 
+ * get_cpuctr() code but is flawed for the quick_get_cpuctr() as it is
+ * called when ever time is requested.  For that code, we make sure that
+ * we never move backward in time.
+ */
+extern inline unsigned long
+get_cpuctr(void)
+{
+	int c0;
+	long rtn;
+
+	spin_lock(&i8253_lock);
+	c0 = read_timer_chip();
+
+	rtn = base_c0 - c0 + _timer_latch_reset;
+
+//      if (rtn < 0) {
+//                rtn += _timer_latch_reset;
+//        }
+	base_c0 = c0;
+	base_c0_offset = 0;
+	spin_unlock(&i8253_lock);
+
+	return rtn;
+}
+/*
+ * In an SMP system this is called under the read_lock_irq(xtime_lock)
+ * In a UP system it is also called with this lock (PIT case only)
+ * It returns the number of "clocks" since the last call to get_cpuctr (above).
+ */
+extern inline unsigned long
+quick_get_cpuctr(void)
+{
+	register int c0;
+	long rtn;
+
+	spin_lock(&i8253_lock);
+	c0 = read_timer_chip();
+	/*
+	 * If the new count is greater than 
+	 * the last one (base_c0) the chip has just rolled and an 
+	 * interrupt is pending.  To get the time right. We need to add
+	 * _timer_latch_reset to the answer.  All this is true if only
+	 * one roll is involved, but base_co should be updated at least
+	 * every 1/HZ.
+	 */
+	rtn = base_c0 - c0;
+	if (rtn < base_c0_offset) {
+		rtn += _timer_latch_reset;
+	}
+	base_c0_offset = rtn;
+	spin_unlock(&i8253_lock);
+	return rtn;
+}
+
+#ifdef _INCLUDED_FROM_TIME_C
+int base_c0 = 0;
+int base_c0_offset = 0;
+struct timer_conversion_bits timer_conversion_bits = {
+	_cycles_per_jiffies:(LATCH),
+	_nsec_to_arch:cf_nsec_to_arch,
+	_usec_to_arch:cf_usec_to_arch,
+	_arch_to_nsec:cf_arch_to_nsec,
+	_arch_to_usec:cf_arch_to_usec,
+	_arch_to_latch:1
+};
+EXTERN int _timer_latch_reset = LATCH;
+
+#define set_last_timer_cc() (void)(1)
+
+/* This returns the correct cycles_per_sec from a calibrated one
+ */
+#define arch_hrtime_init(x) (CLOCK_TICK_RATE)
+
+/*
+ * The reload_timer_chip routine is called under the timerlist lock (irq off)
+ * and, in SMP, the xtime_lock.  We also take the i8253_lock for the chip access
+ */
+#ifndef CONFIG_X86_LOCAL_APIC
+
+extern inline void
+reload_timer_chip(int new_latch_value)
+{
+	int c1, c1new, delta;
+	unsigned char pit_status;
+	/*
+	 * In put value is in timer units for the 386 platform.
+	 * We must be called with irq disabled.
+	 */
+	spin_lock(&i8253_lock);
+	/*
+	 * we need to get this last value of the timer chip
+	 */
+	LATCH_CNT0_AND_CNT1();
+	READ_CNT0(delta);
+	READ_CNT1(c1);
+	base_c0 -= delta;
+
+	new_latch_value = arch_cycles_to_latch(new_latch_value);
+	if (new_latch_value < TIMER_DELTA) {
+		new_latch_value = TIMER_DELTA;
+	}
+	outb_p(PIT0_PERIODIC, PIT_COMMAND);
+	outb_p(new_latch_value & 0xff, PIT0);	/* LSB */
+	outb(new_latch_value >> 8, PIT0);	/* MSB */
+	do {
+		outb_p(PIT0_LATCH_STATUS, PIT_COMMAND);
+		pit_status = inb(PIT0);
+	} while (pit_status & PIT_NULL_COUNT);
+	do {
+		LATCH_CNT0_AND_CNT1();
+		READ_CNT0(delta);
+		READ_CNT1(c1new);
+	} while (!(((new_latch_value - delta) & 0xffff) < 15));
+
+	outb_p(LATCH & 0xff, PIT0);	/* LSB */
+	outb(LATCH >> 8, PIT0);	/* MSB */
+
+	/*
+	 * this is assuming that counter one is latched on with
+	 * 18 as the value
+	 * Most BIOSes do this i guess....
+	 */
+	c1 -= c1new;
+	base_c0 += ((c1 < 0) ? (c1 + 18) : (c1)) + delta;
+	if (base_c0 < 0) {
+		base_c0 += _timer_latch_reset;
+	}
+	spin_unlock(&i8253_lock);
+	return;
+}
+#endif
+/*
+ * No run time conversion factors need to be set up as the PIT has a fixed
+ * speed.
+ */
+#define init_hrtimers()
+
+#endif				/* _INCLUDED_FROM_HRTIME_C_ */
+#endif				/* __KERNEL__ */
+#endif				/* _ASM_HRTIME_M386_H */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/include/asm-i386/hrtime-M586.h linux/include/asm-i386/hrtime-M586.h
--- linux-2.5.52-bk4-core/include/asm-i386/hrtime-M586.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/hrtime-M586.h	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,170 @@
+/*
+ * UTIME: On-demand Microsecond Resolution Timers
+ * ----------------------------------------------
+ *
+ * File: include/asm-i586/hrtime-Macpi.h
+ * Copyright (C) 1999 by the University of Kansas Center for Research, Inc.
+ * Copyright (C) 2001 by MontaVista Software.
+ *
+ * This software was developed by the Information and
+ * Telecommunication Technology Center (ITTC) at the University of
+ * Kansas.  Partial funding for this project was provided by Sprint. This
+ * software may be used and distributed according to the terms of the GNU
+ * Public License, incorporated herein by reference.  Neither ITTC nor
+ * Sprint accept any liability whatsoever for this product.
+ *
+ * This project was developed under the direction of Dr. Douglas Niehaus.
+ * 
+ * Authors: Balaji S., Raghavan Menon
+ *	    Furquan Ansari, Jason Keimig, Apurva Sheth
+ *
+ * Please send bug-reports/suggestions/comments to utime@ittc.ukans.edu
+ * 
+ * Further details about this project can be obtained at
+ *    http://hegel.ittc.ukans.edu/projects/utime/ 
+ *    or in the file Documentation/utime.txt
+ */
+/* 
+ * This code swiped from the utime project to support high res timers
+ * Principle thief George Anzinger george@mvista.com
+ */
+#include <asm/msr.h>
+#ifndef _ASM_HRTIME_M586_H
+#define _ASM_HRTIME_M586_H
+
+#ifdef __KERNEL__
+
+#ifdef _INCLUDED_FROM_TIME_C
+/*
+ * This gets redefined when we calibrate the TSC
+ */
+struct timer_conversion_bits timer_conversion_bits = {
+	_cycles_per_jiffies:LATCH
+};
+#endif
+
+/*
+ * This define avoids an ugly ifdef in time.c
+ */
+#define get_cpuctr_from_timer_interrupt()
+#define timer_latch_reset(s)
+
+/* NOTE: When trying to port this to other architectures define
+ * this to be (void)(1) (ie. #define set_last_timer_cc() (void)(1))
+ * otherwise sched.c would give an undefined reference
+ */
+
+// think this is old cruft... extern void set_last_timer_cc(void);
+/*
+ * These are specific to the pentium counters
+ */
+extern inline unsigned long
+get_cpuctr(void)
+{
+	/*
+	 * We are interested only in deltas so we just use the low bits
+	 * at 1GHZ this should be good for 4.2 seconds, at 100GHZ 42 ms
+	 */
+	unsigned long old = last_update;
+	rdtscl(last_update);
+	return last_update - old;
+}
+extern inline unsigned long
+quick_get_cpuctr(void)
+{
+	unsigned long value;
+	rdtscl(value);
+	return value - last_update;
+}
+#define arch_hrtime_init(x) (x)
+
+extern unsigned long long base_cpuctr;
+extern unsigned long base_jiffies;
+/* 
+ * We use various scaling.  The sc32 scales by 2**32, sc_n by the first parm.
+ * When working with constants, choose a scale such that x/n->(32-scale)< 1/2.
+ * So for 1/3 <1/2 so scale of 32, where as 3/1 must be shifted 3 times (3/8) to
+ * be less than 1/2 so scale should be 29
+ *
+ * The principle high end is when we can no longer keep 1/HZ worth of arch
+ * time (TSC counts) in an integer.  This will happen somewhere between 40GHz and
+ * 50GHz with HZ set to 100.  For now we are cool and the scale of 24 works for 
+ * the nano second to arch from 2MHz to 40+GHz.  
+ */
+#define HR_TIME_SCALE_NSEC 22
+#define HR_TIME_SCALE_USEC 14
+extern inline int
+arch_cycles_to_usec(unsigned long update)
+{
+	return (mpy_sc32(update, arch_to_usec));
+}
+/*
+ * We use the same scale for both the pit and the APIC
+ */
+extern inline int
+arch_cycles_to_latch(unsigned long update)
+{
+	return (mpy_sc32(update, arch_to_latch));
+}
+#define compute_latch(APIC_clocks_jiffie) arch_to_latch = \
+                                             div_sc32(APIC_clocks_jiffie, \
+				                      cycles_per_jiffies);
+
+extern inline int
+arch_cycles_to_nsec(long update)
+{
+	return mpy_sc_n(HR_TIME_SCALE_NSEC, update, arch_to_nsec);
+}
+/* 
+ * And the other way...
+ */
+extern inline int
+usec_to_arch_cycles(unsigned long usec)
+{
+	return mpy_sc_n(HR_TIME_SCALE_USEC, usec, usec_to_arch);
+}
+extern inline int
+nsec_to_arch_cycles(unsigned long nsec)
+{
+	return mpy_sc_n(HR_TIME_SCALE_NSEC, nsec, nsec_to_arch);
+}
+
+EXTERN int pit_pgm_correction;
+
+#ifdef _INCLUDED_FROM_TIME_C
+
+#include <asm/io.h>
+
+#ifndef USEC_PER_SEC
+#define USEC_PER_SEC 1000000
+#endif
+	/*
+	 * Code for runtime calibration of high res timers
+	 * Watch out, cycles_per_sec will overflow when we
+	 * get a ~ 2.14 GHz machine...
+	 * We are starting with tsc_cycles_per_5_jiffies set to 
+	 * 5 times the actual value (as set by 
+	 * calibrate_tsc() ).
+	 */
+#define init_hrtimers() \
+        arch_to_usec = fast_gettimeoffset_quotient; \
+ \
+        arch_to_latch = div_ll_X_l(mpy_l_X_l_ll(fast_gettimeoffset_quotient, \
+                                                CLOCK_TICK_RATE),           \
+                                   (USEC_PER_SEC));          \
+\
+        arch_to_nsec = div_sc_n(HR_TIME_SCALE_NSEC, \
+                               CALIBRATE_TIME * NSEC_PER_USEC, \
+                               tsc_cycles_per_5_jiffies); \
+ \
+        nsec_to_arch = div_sc_n(HR_TIME_SCALE_NSEC, \
+                                tsc_cycles_per_5_jiffies, \
+                                CALIBRATE_TIME * NSEC_PER_USEC); \
+        usec_to_arch = div_sc_n(HR_TIME_SCALE_USEC, \
+                                tsc_cycles_per_5_jiffies, \
+                                CALIBRATE_TIME ); \
+        cycles_per_jiffies = tsc_cycles_per_5_jiffies / CAL_JIFS;
+
+#endif				/* _INCLUDED_FROM_HRTIME_C */
+#endif				/* __KERNEL__ */
+#endif				/* _ASM_HRTIME-M586_H */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/include/asm-i386/hrtime-Macpi.h linux/include/asm-i386/hrtime-Macpi.h
--- linux-2.5.52-bk4-core/include/asm-i386/hrtime-Macpi.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/hrtime-Macpi.h	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,190 @@
+/*
+ *
+ * File: include/asm-i386/hrtime-Macpi.h 
+ * Copyright (C) 2001 by MontaVista Software,
+
+ * This software may be used and distributed according to the terms of
+ * the GNU Public License, incorporated herein by reference.
+
+ */
+#include <asm/msr.h>
+#include <asm/io.h>
+#ifndef _ASM_HRTIME_Macpi_H
+#define _ASM_HRTIME_Macpi_H
+
+#ifdef __KERNEL__
+
+/*
+ * This define avoids an ugly ifdef in time.c
+ */
+#define timer_latch_reset(s)
+
+/* NOTE: When trying to port this to other architectures define
+ * this to be (void)(1) (ie. #define set_last_timer_cc() (void)(1))
+ * otherwise sched.c would give an undefined reference
+ */
+
+extern void set_last_timer_cc(void);
+/*
+ * These are specific to the ACPI pm counter
+ * The spec says the counter can be either 32 or 24 bits wide.  We treat them
+ * both as 24 bits.  Its faster than doing the test.
+ */
+#define SIZE_MASK 0xffffff
+
+extern int acpi_pm_tmr_address;
+
+extern inline unsigned long
+get_cpuctr(void)
+{
+	static long old;
+
+	old = last_update;
+	last_update = inl(acpi_pm_tmr_address);
+	return (last_update - old) & SIZE_MASK;
+}
+extern inline unsigned long
+quick_get_cpuctr(void)
+{
+	return (inl(acpi_pm_tmr_address) - last_update) & SIZE_MASK;
+}
+#define arch_hrtime_init(x) (x)
+
+/* 
+ * We use various scaling.  The sc32 scales by 2**32, sc_n by the first parm.
+ * When working with constants, choose a scale such that x/n->(32-scale)< 1/2.
+ * So for 1/3 <1/2 so scale of 32, where as 3/1 must be shifted 3 times (3/8) to
+ * be less than 1/2 so scale should be 29
+ *
+ */
+#define HR_SCALE_ARCH_NSEC 22
+#define HR_SCALE_ARCH_USEC 32
+#define HR_SCALE_NSEC_ARCH 32
+#define HR_SCALE_USEC_ARCH 29
+
+#ifndef  PM_TIMER_FREQUENCY
+#define PM_TIMER_FREQUENCY  3579545	/*45   counts per second */
+#endif
+#define PM_TIMER_FREQUENCY_x_100  357954545	/* counts per second * 100 */
+
+#define cf_arch_to_usec (SC_32(100000000)/(long long)PM_TIMER_FREQUENCY_x_100)
+extern inline int
+arch_cycles_to_usec(unsigned long update)
+{
+	return (mpy_sc32(update, arch_to_usec));
+}
+/* 
+ * Note: In the SMP case this value will be overwritten when the 
+ * APIC clocks are figured out using the "compute_latch function below.
+ * If the system is not SMP, the PIT is the ticker and this is the 
+ * conversion for that.
+ */
+#define cf_arch_to_latch SC_32(CLOCK_TICK_RATE)/(long long)(CLOCK_TICK_RATE * 3)
+
+#ifndef CONFIG_SMP
+/*
+ * We need to take 1/3 of the presented value (or more exactly)
+ * CLOCK_TICK_RATE /PM_TIMER_FREQUENCY.  Note that these two timers
+ * are on the same cyrstal so will be EXACTLY 1/3.
+ */
+extern inline int
+arch_cycles_to_latch(unsigned long update)
+{
+	return (mpy_sc32(update, arch_to_latch));
+}
+#else
+/*
+ * APIC clocks run from a low of 33MH to say 200MH.  The PM timer
+ * runs about 3.5 MH.  We want to scale so that ( APIC << scale )/PM
+ * is less 2 ^ 32.  Lets use 2 ^ 19, leaves plenty of room.
+ */
+#define HR_SCALE_ARCH_LATCH 19
+
+#define compute_latch(APIC_clocks_jiffie) arch_to_latch = div_sc_n(   \
+                                                    HR_SCALE_ARCH_LATCH,   \
+				                    APIC_clocks_jiffie,   \
+				                    cycles_per_jiffies);
+extern inline int
+arch_cycles_to_latch(unsigned long update)
+{
+	return (mpy_sc_n(HR_SCALE_ARCH_LATCH, update, arch_to_latch));
+}
+
+#endif
+
+#define cf_arch_to_nsec (SC_n(HR_SCALE_ARCH_NSEC,100000000000LL)/ \
+                           (long long)PM_TIMER_FREQUENCY_x_100)
+
+extern inline int
+arch_cycles_to_nsec(long update)
+{
+	return mpy_sc_n(HR_SCALE_ARCH_NSEC, update, arch_to_nsec);
+}
+/* 
+ * And the other way...
+ */
+#define cf_usec_to_arch (SC_n( HR_SCALE_USEC_ARCH,PM_TIMER_FREQUENCY_x_100)/ \
+                                            (long long)100000000)
+extern inline int
+usec_to_arch_cycles(unsigned long usec)
+{
+	return mpy_sc_n(HR_SCALE_USEC_ARCH, usec, usec_to_arch);
+}
+#define cf_nsec_to_arch (SC_n( HR_SCALE_NSEC_ARCH,PM_TIMER_FREQUENCY)/ \
+                                            (long long)1000000000)
+extern inline int
+nsec_to_arch_cycles(unsigned long nsec)
+{
+	return mpy_sc32(nsec, nsec_to_arch);
+}
+
+extern int hrt_get_acpi_pm_ptr(void);
+//EXTERN int pit_pgm_correction;
+
+#ifdef _INCLUDED_FROM_TIME_C
+
+#include <asm/io.h>
+struct timer_conversion_bits timer_conversion_bits = {
+	_cycles_per_jiffies:((PM_TIMER_FREQUENCY + HZ / 2) / HZ),
+	_nsec_to_arch:cf_nsec_to_arch,
+	_usec_to_arch:cf_usec_to_arch,
+	_arch_to_nsec:cf_arch_to_nsec,
+	_arch_to_usec:cf_arch_to_usec,
+	_arch_to_latch:cf_arch_to_latch
+};
+int acpi_pm_tmr_address;
+
+#endif				/* _INCLUDED_FROM_TIME_C_ */
+
+/*
+ * No run time conversion factors need to be set up as the pm timer has a fixed
+ * speed.
+ */
+/*
+ * Here we have a local udelay for our init use only.  The system delay has
+ * has not yet been calibrated when we use this, however, we do know
+ * tsc_cycles_per_5_jiffies...
+ */
+
+#if defined( CONFIG_HIGH_RES_TIMER_ACPI_PM_ADD) && CONFIG_HIGH_RES_TIMER_ACPI_PM_ADD > 0
+#define default_pm_add CONFIG_HIGH_RES_TIMER_ACPI_PM_ADD
+#define message "High-res-timers: ACPI pm timer not found.  Trying specified address %d\n"
+#else
+#define default_pm_add 0
+#define message \
+        "High-res-timers: ACPI pm timer not found(%d) and no backup."\
+        "\nCheck BIOS settings or supply a backup.  See configure documentation.\n"
+#endif
+#define fail_message \
+"High-res-timers: >-<--><-->-<-->-<-->-<--><-->-<-->-<-->-<-->-<-->-<-->-<-->-<\n"\
+"High-res-timers: >Failed to find the ACPI pm timer                           <\n"\
+"High-res-timers: >-<--><-->-<-->-<-->-<-->Boot will fail in Calibrate Delay  <\n"\
+"High-res-timers: >Supply a valid default pm timer address                    <\n"\
+"High-res-timers: >or get your BIOS to turn on ACPI support.                  <\n"\
+"High-res-timers: >See CONFIGURE help for more information.                   <\n"\
+"High-res-timers: >-<--><-->-<-->-<-->-<--><-->-<-->-<-->-<-->-<-->-<-->-<-->-<\n"
+/*
+ * After we get the address, we set last_update to the current timer value
+ */
+#endif				/* __KERNEL__ */
+#endif				/* _ASM_HRTIME-Mapic_H */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/include/asm-i386/hrtime.h linux/include/asm-i386/hrtime.h
--- linux-2.5.52-bk4-core/include/asm-i386/hrtime.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/hrtime.h	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,523 @@
+/*
+ *
+ * File: include/asm-i386/hrtime.h
+ * Copyright (C) 1999 by the University of Kansas Center for Research, Inc.  
+ * Copyright (C) 2001 by MontaVista Software.
+ *
+ * This software was developed by the Information and
+ * Telecommunication Technology Center (ITTC) at the University of
+ * Kansas.  Partial funding for this project was provided by Sprint. This
+ * software may be used and distributed according to the terms of the GNU
+ * Public License, incorporated herein by reference.  Neither ITTC nor
+ * Sprint accept any liability whatsoever for this product.
+ *
+ * This project was developed under the direction of Dr. Douglas Niehaus.
+ * 
+ * Authors: Balaji S., Raghavan Menon
+ *	    Furquan Ansari, Jason Keimig, Apurva Sheth
+ *
+ * Please send bug-reports/suggestions/comments to utime@ittc.ukans.edu
+ * 
+ * Further details about this project can be obtained at
+ *    http://hegel.ittc.ukans.edu/projects/utime/ 
+ *    or in the file Documentation/high-res-timers/
+ */
+/*
+ * This code purloined from the utime project for high res timers.
+ * Principle modifier George Anzinger george@mvista.com
+ */
+#ifndef _I386_HRTIME_H
+#define _I386_HRTIME_H
+#ifdef __KERNEL__
+
+#include <linux/config.h>	/* for CONFIG_APM etc... */
+#include <asm/types.h>		/* for u16s */
+#include <asm/io.h>
+#include <asm/sc_math.h>	/* scaling math routines */
+#include <asm/delay.h>
+#include <asm/smp.h>
+#include <linux/timex.h>	/* for LATCH */
+/*
+
+ * We always want the timer, if not touched otherwise, to give periodic
+ * 1/HZ interrupts.  This is done by programing the interrupt we want
+ * and, once it it loaded, (in the case of the PIT) dropping a 1/HZ
+ * program on top of it.  For other timers, other strategies are used,
+ * such as programming a 1/HZ interval on interrupt.  The The PIT will
+ * give us the desired interrupt and, at interrupt time, load the 1/HZ
+ * program.  So...
+
+ * If no sub 1/HZ ticks are needed AND we are aligned with the 1/HZ 
+ * boundry, we don't need to touch the PIT.  Otherwise we do the above.
+
+ * There are two reasons to keep this:
+ * 1. The NMI watchdog uses the timer interrupt to generate the NMI interrupts.
+ * 2. We don't have to touch the PIT unless we have a sub jiffie event in
+ *    the next 1/HZ interval (unless we drift away from the 1/HZ boundry).
+ */
+
+/*
+ * The high-res-timers option is set up to self configure with different 
+ * platforms.  It is up to the platform to provide certian macros which
+ * override the default macros defined in system without (or with disabled)
+ * high-res-timers.
+ *
+ * To do high-res-timers at some fundamental level the timer interrupt must
+ * be seperated from the time keeping tick.  A tick can still be generated
+ * by the timer interrupt, but it may be surrounded by non-tick interrupts.
+ * It is up to the platform to determine if a particular interrupt is a tick,
+ * and up to the timer code (in timer.c) to determine what time events have
+ * expired.
+ *
+ * Macros:
+ * update_jiffies()  This macro is to compute the new value of jiffie and 
+ *                   sub_jiffie.  If high-res-timers are not available it
+ *                   may be assumed that this macro will be called once
+ *                   every 1/HZ and so should reduce to:
+ *
+ * 	(*(u64 *)&jiffies_64)++;
+ *
+ * sub_jiffie, in this case will always be zero, and need not be addressed.
+ * It is assumed that the sub_jiffie is in platform defined units and runs
+ * from 0 to a value which represents 1/HZ on that platform.  (See conversion
+ * macro requirements below.)
+ * If high-res-timers are available, this macro will be called each timer
+ * interrupt which may be more often than 1/HZ.  It is up to the code to 
+ * determine if a new jiffie has just started and pass this info to:
+ *
+ * new_jiffie() which should return true if the last call to update_jiffie()
+ *              moved the jiffie count (as apposed to just the sub_jiffie).
+ *              For systems without high-res-timers the kernel will predefine
+ *              this to be 0 which will allow the compiler to optimize the code
+ *              for this case.  In SMP systems this should be set to all 1's
+ *              as it is used in a per cpu fashion to indicate that a paricular
+ *              cpu needs to run the accounting code.  It should result
+ *              in a variable that can be cast to a volital long and of
+ *              which the address can be taken.
+ *
+ * schedule_next_int(jiffie_f,sub_jiffie_v,always) is a macro that the 
+ *                                 platform should 
+ *                                 provide that will set up the timer interrupt 
+ *                                 hardware to interrupt at the absolute time
+ *                                 defined by jiffie_f,sub_jiffie_v where the 
+ *                                 units are 1/HZ and the platform defined 
+ *                                 sub_jiffie unit.  This function must 
+ *                                 determine the actual current time and the 
+ *                                 requested offset and act accordingly.  A 
+ *                                 sub_jiffie_v value of -1 should be 
+ *                                 understood to mean the next even jiffie 
+ *                                 regardless of the jiffie_f value.  If 
+ *                                 the current jiffie is not jiffie_f, it 
+ *                                 may be assumed that the requested time 
+ *                                 has passed and an immeadiate interrupt 
+ *                                 should be taken.  If high-res-timers are 
+ *                                 not available, this macro should evaluate 
+ *                                 to nil.  This macro may return 1 if always
+ *                                 if false AND the requested time has passed.
+ *                                 "Always" indicates that an interrupt is
+ *                                 required even if the time has already passed.
+ */
+
+/*
+ * no of usecs less than which events cannot be scheduled
+ */
+#define TIMER_DELTA  5
+#ifdef _INCLUDED_FROM_TIME_C
+#define EXTERN
+int timer_delta = TIMER_DELTA;
+#else
+#define EXTERN  extern
+extern int timer_delta;
+#endif
+
+/*
+
+ * Interrupt generators need to be disciplined to generate the interrupt
+ * on the 1/HZ boundry (assuming we don't need sub_jiffie interrupts) if
+ * the timer clock is other than the interrupt generator clock.  In the
+ * I386 case this includes the PIT and TSC or pm combinations and the
+ * apic and TSC or pm combinations, i.e. all but the PIT/PIT
+ * combination.
+
+ */
+#if defined(CONFIG_X86_LOCAL_APIC) || !defined(CONFIG_HIGH_RES_TIMER_PIT)
+#define TIMER_NEEDS_DISCIPLINE
+#define IF_DISCIPLINE(x) x
+EXTERN int timer_discipline_diff;
+EXTERN int min_hz_sub_jiffie;
+EXTERN int max_hz_sub_jiffie;
+EXTERN int _last_was_long[NR_CPUS];
+#define __last_was_long  _last_was_long[smp_processor_id()]
+#else
+#define IF_DISCIPLINE(x)
+EXTERN int _last_was_long;
+#define __last_was_long  _last_was_long
+#endif
+
+#define CONFIG_HIGH_RES_RESOLUTION 1000	// nano second resolution
+					   // we will use for high res.
+
+#define USEC_PER_JIFFIES  (1000000/HZ)
+/*
+ * This is really: x*(CLOCK_TICK_RATE+HZ/2)/1000000
+ * Note that we can not figure the constant part at
+ * compile time because we would loose precision.
+ */
+#define PIT0_LATCH_STATUS 0xc2
+#define PIT0 0x40
+#define PIT1 0x41
+#define PIT_COMMAND 0x43
+#define PIT0_ONE_SHOT 0x38
+#define PIT0_PERIODIC 0x34
+#define PIT0_LATCH_COUNT 0xd2
+#define PIT01_LATCH_COUNT 0xd6
+#define PIT_NULL_COUNT 0x40
+#define READ_CNT0(varr) {varr = inb(PIT0);varr += (inb(PIT0))<<8;}
+#define READ_CNT1(var) { var = inb(PIT1); }
+#define LATCH_CNT0() { outb(PIT0_LATCH_COUNT,PIT_COMMAND); }
+#define LATCH_CNT0_AND_CNT1() { outb(PIT01_LATCH_COUNT,PIT_COMMAND); }
+
+#define TO_LATCH(x) (((x)*LATCH)/USEC_PER_JIFFIES)
+
+#define sub_jiffie() _sub_jiffie
+#define schedule_next_int(a,b,c)  _schedule_next_int(a,b,c)
+
+#define update_jiffies() update_jiffies_sub()
+#define new_jiffie() _new_jiffie
+
+extern unsigned long next_intr;
+extern spinlock_t i8253_lock;
+extern rwlock_t xtime_lock;
+extern volatile unsigned long jiffies;
+extern u64 jiffies_64;
+
+extern int _schedule_next_int(unsigned long jiffie_f, long sub_jiffie_in,
+			      int always);
+
+extern unsigned int volatile latch_reload;
+
+EXTERN int jiffies_intr;
+EXTERN long volatile _new_jiffie;
+EXTERN int _sub_jiffie;
+EXTERN unsigned long volatile last_update;
+EXTERN int high_res_test_val;
+
+extern inline void
+start_PIT(void)
+{
+	spin_lock(&i8253_lock);
+	outb_p(PIT0_PERIODIC, PIT_COMMAND);
+	outb_p(LATCH & 0xff, PIT0);	/* LSB */
+	outb(LATCH >> 8, PIT0);	/* MSB */
+	spin_unlock(&i8253_lock);
+}
+
+/*
+ * Now go ahead and include the clock specific file 586/386/acpi
+ * These asm files have extern inline functions to do a lot of
+ * stuff as well as the conversion routines.
+ */
+#ifdef CONFIG_HIGH_RES_TIMER_ACPI_PM
+#include <asm/hrtime-Macpi.h>
+#elif defined(CONFIG_HIGH_RES_TIMER_PIT)
+#include <asm/hrtime-M386.h>
+#elif defined(CONFIG_HIGH_RES_TIMER_TSC)
+#include <asm/hrtime-M586.h>
+#else
+#error "Need one of: CONFIG_HIGH_RES_TIMER_ACPI_PM CONFIG_HIGH_RES_TIMER_TSC CONFIG_HIGH_RES_TIMER_PIT"
+#endif
+
+extern unsigned long long jiffiesll;
+
+/*
+ * We stole this routine from the Utime code, but there it
+ * calculated microseconds and here we calculate sub_jiffies
+ * which have (in this case) units of TSC count.  (If there
+ * is no TSC, see hrtime-M386.h where a different unit
+ * is used.  This allows the more expensive math (to get
+ * standard units) to be done only when needed.  Also this
+ * makes it as easy (and as efficient) to calculate nano
+ * as well as micro seconds.
+ */
+
+extern inline void
+arch_update_jiffies(unsigned long update)
+{
+	/*
+	 * update is the delta in sub_jiffies
+	 */
+	_sub_jiffie += update;
+	while ((unsigned long) _sub_jiffie > cycles_per_jiffies) {
+		_sub_jiffie -= cycles_per_jiffies;
+		_new_jiffie = ~0;
+		jiffies_intr++;
+		jiffies_64++;
+	}
+}
+
+#define SC_32_TO_USEC (SC_32(1000000)/ (long long)CLOCK_TICK_RATE)
+
+	/*
+	 * In the ALL_PERIODIC mode we program the PIT to give periodic
+	 * interrupts and, if no sub_jiffie timers are due, leave it alone.
+	 * This means that it can drift WRT the clock (TSC or pm timer).
+	 * What we are trying to do is to program the next interrupt to
+	 * occure on exactly the requested time.  If we are not doing 
+	 * sub HZ interrupts we expect to find a small excess of time
+	 * beyond the 1/HZ, i.e. _sub_jiffie will have some small value. 
+	 * This value will drift AND may jump upward from time to time. 
+	 * The drift is due to not having precise tracking between the 
+	 * two timers (the PIT and either the TSC or the PM timer) and
+	 * the jump is caused by interrupt delays, cache misses etc. 
+	 * We need to correct for the drift.  To correct all we need to 
+	 * do is to set "last_was_long" to zero and a new timer program 
+	 * will be started to "do the right thing".
+
+	 * Detecting the need to do this correction is another issue. 
+	 * Here is what we do:
+	 * Each interrupt where last_was_long is !=0 (indicates the
+	 * interrupt should be on a 1/HZ boundry) we check the resulting 
+	 * _sub_jiffie.  If it is smaller than some MIN value, we do
+	 * the correction.  (Note that drift that makes the value  
+	 * smaller is the easy one.)  We also require that
+	 * _sub_jiffie <= some max at least once over a period of 1 second. 
+	 * I.e.  with HZ = 100, we will allow up to 99 "late" interrupts
+	 * before we do a correction.
+
+	 * The values we use for min_hz_sub_jiffie and max_hz_sub_jiffie 
+	 * depend on the units and we will start by, during boot,
+	 * observing what MIN appears to be.  We will set max_hz_sub_jiffie
+	 * to be about 100 machine cycles more than this.
+
+	 * Note that with  min_hz_sub_jiffie and max_hz_sub_jiffie
+	 * set to 0, this code will reset the PIT every HZ.
+	 */
+#ifdef TIMER_NEEDS_DISCIPLINE
+extern inline void
+discipline_timer(int cpu)
+{
+	int *last_was_long = &_last_was_long[cpu];
+
+	if (!*last_was_long)
+		return;
+
+	timer_discipline_diff = quick_get_cpuctr() + _sub_jiffie;
+	while (timer_discipline_diff > cycles_per_jiffies) {
+		timer_discipline_diff -= cycles_per_jiffies;
+	}
+	if (timer_discipline_diff < min_hz_sub_jiffie) {
+		*last_was_long = 0;
+		return;
+	}
+	if (timer_discipline_diff <= max_hz_sub_jiffie) {
+		*last_was_long = 1;
+		return;
+	}
+	if (++*last_was_long > HZ) {
+		*last_was_long = 0;
+		return;
+	}
+}
+#else
+#define discipline_timer(a)
+#endif
+/*
+ * This routine is always called under the write_lockirq(xtime_lock)
+ */
+extern inline void
+update_jiffies_sub(void)
+{
+	unsigned long cycles_update;
+
+	cycles_update = get_cpuctr();
+
+	arch_update_jiffies(cycles_update);
+}
+
+/*
+ * quick_update_jiffies_sub returns the sub_jiffie offset of 
+ * current time from the "ref_jiff" jiffie value.  We do this
+ * with out updating any memory values and thus do not need to
+ * take any locks, if we are careful.
+ *
+ * I don't know how to eliminate the lock in the SMP case, so..
+ * Oh, and also the PIT case requires a lock anyway, so..
+ */
+#if defined (CONFIG_SMP) || defined(CONFIG_HIGH_RES_TIMER_PIT)
+static inline void
+get_rat_jiffies(unsigned long *jiffies_f,
+		long *_sub_jiffie_f, unsigned long *update)
+{
+	unsigned long flags;
+
+	read_lock_irqsave(&xtime_lock, flags);
+	*jiffies_f = jiffies;
+	*_sub_jiffie_f = _sub_jiffie;
+	*update = quick_get_cpuctr();
+	read_unlock_irqrestore(&xtime_lock, flags);
+}
+
+#else
+static inline void
+get_rat_jiffies(unsigned long *jiffies_f, long *_sub_jiffie_f,
+		unsigned long *update)
+{
+	unsigned long last_update_f;
+	do {
+		*jiffies_f = jiffies;
+		last_update_f = last_update;
+		barrier();
+		*_sub_jiffie_f = _sub_jiffie;
+		*update = quick_get_cpuctr();
+		barrier();
+	} while (*jiffies_f != jiffies || last_update_f != last_update);
+}
+#endif				/* CONFIG_SMP */
+
+/*
+ * If smp, this must be called with the read_lockirq(&xtime_lock) held.
+ * No lock is needed if not SMP.
+ */
+
+extern inline long
+quick_update_jiffies_sub(unsigned long ref_jiff)
+{
+	unsigned long update;
+	unsigned long rtn;
+	unsigned long jiffies_f;
+	long _sub_jiffie_f;
+
+	get_rat_jiffies(&jiffies_f, &_sub_jiffie_f, &update);
+
+	rtn = _sub_jiffie_f + (unsigned long) update;
+	rtn += (jiffies_f - ref_jiff) * cycles_per_jiffies;
+	return rtn;
+
+}
+#ifdef CONFIG_X86_LOCAL_APIC
+#include <asm/apic.h>
+/*
+ * If we have a local APIC, we will use its counter to get the needed 
+ * interrupts.  Here is where we program it.
+ */
+extern int prof_counter[NR_CPUS];
+
+extern void __setup_APIC_LVTT(unsigned int);
+
+extern inline void
+reload_timer_chip(int new_latch_value)
+{
+	int new_latch = arch_cycles_to_latch(new_latch_value);
+	/*
+	 * We may want to do more in line code for speed here.
+	 * For now, however...
+
+	 * Note: The interrupt routine presets the counter for 1/HZ
+	 * each interrupt so we only deal with requested shorter times
+	 * either due to timer requests or drift.
+	 */
+	if (new_latch < timer_delta)
+		new_latch = timer_delta;
+	/*
+	 * The profile counter may be set causing us to ignor (or 
+	 * really just profile) the interrupt.  Force it to roll over
+	 * and give us the interrupt.  This may cause a hic cup in
+	 * the profile, but it will resume on the next tick.
+	 * There are, clearly, more complicated ways to deal with
+	 * profiling.
+	 */
+	prof_counter[smp_processor_id()] = 1;
+	apic_write_around(APIC_TMICT, new_latch);
+}
+
+#endif
+#ifndef CONFIG_HIGH_RES_TIMER_PIT
+#ifndef CONFIG_X86_LOCAL_APIC
+extern inline void
+reload_timer_chip(int new_latch_value)
+{
+	unsigned char pit_status;
+	/*
+	 * The input value is in arch cycles
+	 * We must be called with irq disabled.
+	 */
+
+	new_latch_value = arch_cycles_to_latch(new_latch_value);
+	if (new_latch_value < TIMER_DELTA) {
+		new_latch_value = TIMER_DELTA;
+	}
+	spin_lock(&i8253_lock);
+	outb_p(PIT0_PERIODIC, PIT_COMMAND);
+	outb_p(new_latch_value & 0xff, PIT0);	/* LSB */
+	outb(new_latch_value >> 8, PIT0);	/* MSB */
+	do {
+		outb_p(PIT0_LATCH_STATUS, PIT_COMMAND);
+		pit_status = inb(PIT0);
+	} while (pit_status & PIT_NULL_COUNT);
+	outb_p(LATCH & 0xff, PIT0);	/* LSB */
+	outb(LATCH >> 8, PIT0);	/* MSB */
+	spin_unlock(&i8253_lock);
+	return;
+}
+#endif				//  ! CONFIG_X86_LOCAL_APIC
+#endif				//  ! CONFIG_HIGH_RES_TIMER_PIT
+/*
+ * Time out for a discussion.  Because the PIT and TSC (or the PIT and
+ * pm timer) may drift WRT each other, we need a way to get the jiffie
+ * interrupt to happen as near to the jiffie roll as possible.  This
+ * insures that we will get the interrupt when the timer is to be
+ * delivered, not before (we would not deliver) or later, making the
+ * jiffie timers different from the sub_jiffie deliveries.  We would
+ * also like any latency between a "requested" interrupt and the
+ * automatic jiffie interrupts from the PIT to be the same.  Since it
+ * takes some time to set up the PIT, we assume that requested
+ * interrupts may be a bit late when compared to the automatic
+ * interrupts.  When we request a jiffie interrupt, we want the
+ * interrupt to happen at the requested time, which will be a bit before
+ * we get to the jiffies update code. 
+ *
+ * What we want to determine here is a.) how long it takes (min) to get
+ * from a requested interrupt to the jiffies update code and b.) how
+ * long it takes when the interrupt is automatic (i.e. from the PIT
+ * reset logic).  When we set "last_was_long" to zero, the next tick
+ * setup code will "request" a jiffies interrupt (as long as we do not
+ * have any sub jiffie timers pending).  The interrupt after the
+ * requested one will be automatic.  Ignoring drift over this 2/HZ time
+ * we then get two latency values, the requested latency and the
+ * automatic latency.  We set up the difference to correct the requested
+ * time and the second one as the center of a window which we will use
+ * to detect the need to resync the PIT.  We do this for HZ ticks and
+ * take the min.
+ */
+#ifdef TIMER_NEEDS_DISCIPLINE
+#define NANOSEC_SYNC_LIMIT 2000	// Try for 2 usec. max drift
+#define final_clock_init() \
+        { unsigned long end = jiffies + HZ + HZ; \
+          int min_a =  cycles_per_jiffies, min_b =  cycles_per_jiffies;  \
+          long flags;                         \
+          int * last_was_long = &_last_was_long[smp_processor_id()];   \
+          while (time_before(jiffies,end)){ \
+               unsigned long f_jiffies = jiffies;     \
+               while (jiffies == f_jiffies); \
+               *last_was_long = 0;            \
+               while (jiffies == f_jiffies + 1); \
+               read_lock_irqsave(&xtime_lock, flags); \
+               if (  timer_discipline_diff < min_a) \
+                     min_a =   timer_discipline_diff; \
+               read_unlock_irqrestore(&xtime_lock, flags); \
+               while (jiffies == f_jiffies + 2); \
+               read_lock_irqsave(&xtime_lock, flags); \
+               if (  timer_discipline_diff < min_b) \
+                     min_b =   timer_discipline_diff; \
+               read_unlock_irqrestore(&xtime_lock, flags); \
+          }                             \
+         min_hz_sub_jiffie = min_b -  nsec_to_arch_cycles(NANOSEC_SYNC_LIMIT);\
+          if( min_hz_sub_jiffie < 0)  min_hz_sub_jiffie = 0; \
+          max_hz_sub_jiffie = min_b +  nsec_to_arch_cycles(NANOSEC_SYNC_LIMIT);\
+       timer_delta = arch_cycles_to_latch(usec_to_arch_cycles(TIMER_DELTA)); \
+       }
+#else
+#define final_clock_init()
+#endif				// TIMER_NEEDS_DISCIPLINE
+#endif				/* __KERNEL__ */
+#endif				/* _I386_HRTIME_H */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-core/include/asm-i386/sc_math.h linux/include/asm-i386/sc_math.h
--- linux-2.5.52-bk4-core/include/asm-i386/sc_math.h	Wed Dec 31 16:00:00 1969
+++ linux/include/asm-i386/sc_math.h	Thu Dec 19 13:00:26 2002
@@ -0,0 +1,139 @@
+#ifndef SC_MATH
+#define SC_MATH
+#define MATH_STR(X) #X
+#define MATH_NAME(X) X
+
+/*
+ * Pre scaling defines
+ */
+#define SC_32(x) ((long long)x<<32)
+#define SC_n(n,x) (((long long)x)<<n)
+/*
+ * This routine preforms the following calculation:
+ *
+ * X = (a*b)>>32
+ * we could, (but don't) also get the part shifted out.
+ */
+extern inline long
+mpy_sc32(long a, long b)
+{
+	long edx;
+      __asm__("imull %2":"=a"(a), "=d"(edx)
+      :	"rm"(b), "0"(a));
+	return edx;
+}
+/*
+ * X = (a/b)<<32 or more precisely x = (a<<32)/b
+ */
+
+extern inline long
+div_sc32(long a, long b)
+{
+	long dum;
+      __asm__("divl %2":"=a"(b), "=d"(dum)
+      :	"r"(b), "0"(0), "1"(a));
+
+	return b;
+}
+/*
+ * X = (a*b)>>24
+ * we could, (but don't) also get the part shifted out.
+ */
+
+#define mpy_ex24(a,b) mpy_sc_n(24,a,b)
+/*
+ * X = (a/b)<<24 or more precisely x = (a<<24)/b
+ */
+#define div_ex24(a,b) div_sc_n(24,a,b)
+
+/*
+ * The routines allow you to do x = (a/b) << N and
+ * x=(a*b)>>N for values of N from 1 to 32.
+ *
+ * These are handy to have to do scaled math.
+ * Scaled math has two nice features:
+ * A.) A great deal more precision can be maintained by
+ *     keeping more signifigant bits.
+ * B.) Often an in line div can be repaced with a mpy
+ *     which is a LOT faster.
+ */
+
+#define mpy_sc_n(N,aa,bb) ({long edx,a=aa,b=bb; \
+	__asm__("imull %2\n\t" \
+                "shldl $(32-"MATH_STR(N)"),%0,%1"    \
+		:"=a" (a), "=d" (edx)\
+		:"rm" (b),            \
+		 "0" (a)); edx;})
+
+#define div_sc_n(N,aa,bb) ({long dum=aa,dum2,b=bb; \
+        __asm__("shrdl $(32-"MATH_STR(N)"),%4,%3\n\t"  \
+                "sarl $(32-"MATH_STR(N)"),%4\n\t"      \
+                "divl %2"              \
+                :"=a" (dum2), "=d" (dum)      \
+                :"rm" (b), "0" (0), "1" (dum)); dum2;})
+
+/*
+ * (long)X = ((long long)divs) / (long)div
+ * (long)rem = ((long long)divs) % (long)div
+ *
+ * Warning, this will do an exception if X overflows.
+ */
+#define div_long_long_rem(a,b,c) div_ll_X_l_rem(a,b,c)
+
+extern inline long
+div_ll_X_l_rem(long long divs, long div, long *rem)
+{
+	long dum2;
+      __asm__("divl %2":"=a"(dum2), "=d"(*rem)
+      :	"rm"(div), "A"(divs));
+
+	return dum2;
+
+}
+/*
+ * same as above, but no remainder
+ */
+extern inline long
+div_ll_X_l(long long divs, long div)
+{
+	long dum;
+	return div_ll_X_l_rem(divs, div, &dum);
+}
+/*
+ * (long)X = (((long)divh<<32) | (long)divl) / (long)div
+ * (long)rem = (((long)divh<<32) % (long)divl) / (long)div
+ *
+ * Warning, this will do an exception if X overflows.
+ */
+extern inline long
+div_h_or_l_X_l_rem(long divh, long divl, long div, long *rem)
+{
+	long dum2;
+      __asm__("divl %2":"=a"(dum2), "=d"(*rem)
+      :	"rm"(div), "0"(divl), "1"(divh));
+
+	return dum2;
+
+}
+extern inline long long
+mpy_l_X_l_ll(long mpy1, long mpy2)
+{
+	long long eax;
+      __asm__("imull %1\n\t":"=A"(eax)
+      :	"rm"(mpy2), "a"(mpy1));
+
+	return eax;
+
+}
+extern inline long
+mpy_1_X_1_h(long mpy1, long mpy2, long *hi)
+{
+	long eax;
+      __asm__("imull %2\n\t":"=a"(eax), "=d"(*hi)
+      :	"rm"(mpy2), "0"(mpy1));
+
+	return eax;
+
+}
+
+#endif

^ permalink raw reply

* Re: What do you think, is compatible locking solvable at all?
From: Michal Samek @ 2002-12-20  9:45 UTC (permalink / raw)
  To: Linux-MSDOS Mailing list

On Èt, 2002-12-19 at 22:07, Bart Oldeman wrote:
> On 19 Dec 2002, Michal Samek wrote:
> 

> maybe it's better to ask the Samba and smbfs developers - DOSEMU
> basically acts like just another Linux app and I know almost nothing
> (only basic things) about Samba.
> 
> Bart

Oh, I must say a bit more about it; I've mentioned samba because it's
the target environment but the problem really starts somewhere between
dosemu and linux fs differrencies. Locking isn't working even on
lredired linux local ext2/3 fs for me and my clipper app, not only on
smbfs mounted lredired. 

-- 

Michal Samek <webmaster@tony.cz>

-
To unsubscribe from this list: send the line "unsubscribe linux-msdos" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 3/3] High-res-timers part 3 (posix to hrposix) take 21
From: george anzinger @ 2002-12-20  9:52 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel@vger.kernel.org
In-Reply-To: <3DB9A314.6CECA1AC@mvista.com>

[-- Attachment #1: Type: text/plain, Size: 1852 bytes --]


And this finishs the high res timers code.

Changes since last time:
 <none>
-----------

I had to add arg3 to the restart_block to handle the two
word restart time...

This patch adds the two POSIX clocks CLOCK_REALTIME_HR and
CLOCK_MONOTONIC_HR to the posix clocks & timers package.  A
small change is made in sched.h and the rest of the patch is
against .../kernel/posix_timers.c and
.../include/linux/posix_timers.h


This patch takes advantage of the timer storm protection
features of the POSIX clock and timers patch.

This patch fixes the high resolution timer resolution at 1
micro second.  Should this number be a CONFIG option?

I think it would be a "good thing" to move the NTP stuff to
the jiffies clock.  This would allow the wall clock/ jiffies
clock difference to be a "fixed value" so that code that
needed this would not have to read two clocks.  Setting the
wall clock would then just be an adjustment to this "fixed
value".  It would also eliminate the problem of asking for a
wall clock offset and getting a jiffies clock offset.  This
issue is what causes the current 2.5.46 system to fail the
simple:

time sleep 60

test (any value less than 60 seconds violates the standard
in that it implies a timer expired early).

Patch is against 2.5.52-bk4

These patches as well as the POSIX clocks & timers patch are
available on the project site:
http://sourceforge.net/projects/high-res-timers/

The 3 parts to the high res timers are:
 core      The core kernel (i.e. platform independent)
 i386      The high-res changes for the i386 (x86) platform
*hrposix   The changes to the POSIX clocks & timers patch to
           use high-res timers

Please apply.
-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

[-- Attachment #2: hrtimers-hrposix-2.5.52-bk4.1.0.patch --]
[-- Type: text/plain, Size: 12638 bytes --]

diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-i386/include/linux/posix-timers.h linux/include/linux/posix-timers.h
--- linux-2.5.52-bk4-i386/include/linux/posix-timers.h	Thu Dec 19 12:16:00 2002
+++ linux/include/linux/posix-timers.h	Thu Dec 19 13:04:32 2002
@@ -15,6 +15,39 @@
 	void (*timer_get) (struct k_itimer * timr,
 			   struct itimerspec * cur_setting);
 };
+
+#ifdef CONFIG_HIGH_RES_TIMERS
+struct now_struct {
+	unsigned long jiffies;
+	long sub_jiffie;
+};
+static inline void
+posix_get_now(struct now_struct *now)
+{
+	(now)->jiffies = jiffies;
+	(now)->sub_jiffie = quick_update_jiffies_sub((now)->jiffies);
+	while (unlikely(((now)->sub_jiffie - cycles_per_jiffies) > 0)) {
+		(now)->sub_jiffie = (now)->sub_jiffie - cycles_per_jiffies;
+		(now)->jiffies++;
+	}
+}
+
+#define posix_time_before(timer, now) \
+         ( {long diff = (long)(timer)->expires - (long)(now)->jiffies;  \
+           (diff < 0) ||                                      \
+	   ((diff == 0) && ((timer)->sub_expires < (now)->sub_jiffie)); })
+
+#define posix_bump_timer(timr) do { \
+          (timr)->it_timer.expires += (timr)->it_incr; \
+          (timr)->it_timer.sub_expires += (timr)->it_sub_incr; \
+          if (((timr)->it_timer.sub_expires - cycles_per_jiffies) >= 0){ \
+		  (timr)->it_timer.sub_expires -= cycles_per_jiffies; \
+		  (timr)->it_timer.expires++; \
+	  }                                 \
+          (timr)->it_overrun++;               \
+        }while (0)
+
+#else
 struct now_struct {
 	unsigned long jiffies;
 };
@@ -27,4 +60,5 @@
                         (timr)->it_timer.expires += (timr)->it_incr; \
                         (timr)->it_overrun++;               \
                        }while (0)
+#endif				// CONFIG_HIGH_RES_TIMERS
 #endif
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-i386/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.52-bk4-i386/include/linux/sched.h	Thu Dec 19 12:16:00 2002
+++ linux/include/linux/sched.h	Thu Dec 19 13:04:32 2002
@@ -289,6 +289,9 @@
 	int it_sigev_signo;		 /* signo word of sigevent struct */
 	sigval_t it_sigev_value;	 /* value word of sigevent struct */
 	unsigned long it_incr;		/* interval specified in jiffies */
+#ifdef CONFIG_HIGH_RES_TIMERS
+        int it_sub_incr;                /* sub jiffie part of interval */
+#endif
 	struct task_struct *it_process;	/* process to send signal to */
 	struct timer_list it_timer;
 };
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-i386/include/linux/thread_info.h linux/include/linux/thread_info.h
--- linux-2.5.52-bk4-i386/include/linux/thread_info.h	Wed Dec 11 06:25:32 2002
+++ linux/include/linux/thread_info.h	Thu Dec 19 13:04:32 2002
@@ -12,7 +12,7 @@
  */
 struct restart_block {
 	long (*fn)(struct restart_block *);
-	unsigned long arg0, arg1, arg2;
+	unsigned long arg0, arg1, arg2, arg3;
 };
 
 extern long do_no_restart_syscall(struct restart_block *parm);
Only in linux-2.5.52-bk4-i386/kernel: linux
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-i386/kernel/posix-timers.c linux/kernel/posix-timers.c
--- linux-2.5.52-bk4-i386/kernel/posix-timers.c	Thu Dec 19 12:16:01 2002
+++ linux/kernel/posix-timers.c	Thu Dec 19 13:04:32 2002
@@ -22,6 +22,7 @@
 #include <linux/init.h>
 #include <linux/compiler.h>
 #include <linux/id_reuse.h>
+#include <linux/hrtime.h>
 #include <linux/posix-timers.h>
 
 #ifndef div_long_long_rem
@@ -177,6 +178,14 @@
 					       sizeof (struct k_itimer), 0, 0,
 					       0, 0);
 	idr_init(&posix_timers_id);
+	IF_HIGH_RES(clock_realtime.res = CONFIG_HIGH_RES_RESOLUTION;
+		    register_posix_clock(CLOCK_REALTIME_HR, &clock_realtime);
+		    clock_monotonic.res = CONFIG_HIGH_RES_RESOLUTION;
+		    register_posix_clock(CLOCK_MONOTONIC_HR,
+					 &clock_monotonic);;);
+#ifdef	 final_clock_init
+	final_clock_init();	// defined by arch header file
+#endif
 	return 0;
 }
 
@@ -216,8 +225,23 @@
 	 * We trust that the optimizer will use the remainder from the 
 	 * above div in the following operation as long as they are close. 
 	 */
-	return 0;
+	return (nsec_to_arch_cycles(nsec % (NSEC_PER_SEC / HZ)));
 }
+#ifdef CONFIG_HIGH_RES_TIMERS
+static void
+tstotimer(struct itimerspec *time, struct k_itimer *timer)
+{
+	int res = posix_clocks[timer->it_clock].res;
+
+	timer->it_timer.sub_expires = tstojiffie(&time->it_value,
+						 res, &timer->it_timer.expires);
+	timer->it_sub_incr = tstojiffie(&time->it_interval,
+					res, (unsigned long *) &timer->it_incr);
+	if ((unsigned long) timer->it_incr > MAX_JIFFY_OFFSET)
+		timer->it_incr = MAX_JIFFY_OFFSET;
+}
+
+#else
 static void
 tstotimer(struct itimerspec *time, struct k_itimer *timer)
 {
@@ -226,6 +250,8 @@
 	tstojiffie(&time->it_interval, res, &timer->it_incr);
 }
 
+#endif
+
 static void
 schedule_next_timer(struct k_itimer *timr)
 {
@@ -233,7 +259,7 @@
 
 	/* Set up the timer for the next interval (if there is one) */
 	if (timr->it_incr == 0) {
-		{
+		IF_HIGH_RES(if (timr->it_sub_incr == 0)) {
 			set_timer_inactive(timr);
 			return;
 		}
@@ -305,7 +331,7 @@
 	info.si_code = SI_TIMER;
 	info.si_tid = timr->it_id;
 	info.si_value = timr->it_sigev_value;
-	if (timr->it_incr == 0) {
+	if ((timr->it_incr == 0) IF_HIGH_RES(&&(timr->it_sub_incr == 0))) {
 		set_timer_inactive(timr);
 	} else {
 		timr->it_requeue_pending = info.si_sys_private = 1;
@@ -619,13 +645,15 @@
 
 	do {
 		expires = timr->it_timer.expires;
+		IF_HIGH_RES(sub_expires = timr->it_timer.sub_expires);
 	} while ((volatile long) (timr->it_timer.expires) != expires);
 
 	posix_get_now(&now);
 
 	if (expires && (timr->it_sigev_notify & SIGEV_NONE) && !timr->it_incr) {
 		if (posix_time_before(&timr->it_timer, &now)) {
-			timr->it_timer.expires = expires = 0;
+			IF_HIGH_RES(timr->it_timer.sub_expires =)
+			    timr->it_timer.expires = expires = 0;
 		}
 	}
 	if (expires) {
@@ -641,11 +669,26 @@
 		}
 		if (expires) {
 			expires -= now.jiffies;
+			IF_HIGH_RES(sub_expires -= now.sub_jiffie);
 		}
 	}
 	jiffies_to_timespec(expires, &cur_setting->it_value);
 	jiffies_to_timespec(timr->it_incr, &cur_setting->it_interval);
 
+	IF_HIGH_RES(cur_setting->it_value.tv_nsec +=
+		    arch_cycles_to_nsec(sub_expires);
+		    if (cur_setting->it_value.tv_nsec < 0) {
+		    cur_setting->it_value.tv_nsec += NSEC_PER_SEC;
+		    cur_setting->it_value.tv_sec--;}
+		    if ((cur_setting->it_value.tv_nsec - NSEC_PER_SEC) >= 0) {
+		    cur_setting->it_value.tv_nsec -= NSEC_PER_SEC;
+		    cur_setting->it_value.tv_sec++;}
+		    cur_setting->it_interval.tv_nsec +=
+		    arch_cycles_to_nsec(timr->it_sub_incr);
+		    if ((cur_setting->it_interval.tv_nsec - NSEC_PER_SEC) >= 0) {
+		    cur_setting->it_interval.tv_nsec -= NSEC_PER_SEC;
+		    cur_setting->it_interval.tv_sec++;}
+	) ;
 	if (cur_setting->it_value.tv_sec < 0) {
 		cur_setting->it_value.tv_nsec = 1;
 		cur_setting->it_value.tv_sec = 0;
@@ -781,6 +824,7 @@
 
 	/* disable the timer */
 	timr->it_incr = 0;
+	IF_HIGH_RES(timr->it_sub_incr = 0);
 	/* 
 	 * careful here.  If smp we could be in the "fire" routine which will
 	 * be spinning as we hold the lock.  But this is ONLY an SMP issue.
@@ -810,6 +854,7 @@
 	if ((new_setting->it_value.tv_sec == 0) &&
 	    (new_setting->it_value.tv_nsec == 0)) {
 		timr->it_timer.expires = 0;
+		IF_HIGH_RES(timr->it_timer.sub_expires = 0);
 		return 0;
 	}
 
@@ -823,14 +868,19 @@
 	tstotimer(new_setting, timr);
 
 	/*
-	 * For some reason the timer does not fire immediately if expires is
-	 * equal to jiffies, so the timer notify function is called directly.
+
+	 * For some reason the timer does not fire immediately if
+	 * expires is equal to jiffies and the old cascade timer list,
+	 * so the timer notify function is called directly. 
 	 * We do not even queue SIGEV_NONE timers!
+
 	 */
 	if (!(timr->it_sigev_notify & SIGEV_NONE)) {
+#ifndef	 CONFIG_HIGH_RES_TIMERS
 		if (timr->it_timer.expires == jiffies) {
 			timer_notify_task(timr);
 		} else
+#endif
 			add_timer(&timr->it_timer);
 	}
 	return 0;
@@ -891,6 +941,7 @@
 do_timer_delete(struct k_itimer *timer)
 {
 	timer->it_incr = 0;
+	IF_HIGH_RES(timer->it_sub_incr = 0);
 #ifdef CONFIG_SMP
 	if (timer_active(timer) &&
 	    !del_timer(&timer->it_timer) && !timer->it_requeue_pending) {
@@ -994,9 +1045,25 @@
 	if (clock->clock_get) {
 		return clock->clock_get(tp);
 	}
-
+#ifdef CONFIG_HIGH_RES_TIMERS
+	{
+		unsigned long flags;
+		write_lock_irqsave(&xtime_lock, flags);
+		update_jiffies_sub();
+		update_real_wall_time();
+		tp->tv_sec = xtime.tv_sec;
+		tp->tv_nsec = xtime.tv_nsec;
+		tp->tv_nsec += arch_cycles_to_nsec(sub_jiffie());
+		write_unlock_irqrestore(&xtime_lock, flags);
+		if (tp->tv_nsec > NSEC_PER_SEC) {
+			tp->tv_nsec -= NSEC_PER_SEC;
+			tp->tv_sec++;
+		}
+	}
+#else
 	do_gettimeofday((struct timeval *) tp);
 	tp->tv_nsec *= NSEC_PER_USEC;
+#endif
 	return 0;
 }
 
@@ -1012,10 +1079,10 @@
 {
 	long sub_sec;
 	u64 jiffies_64_f;
-
-#if (BITS_PER_LONG > 32)
-
-	jiffies_64_f = jiffies_64;
+	IF_HIGH_RES(long sub_jiff_offset;
+	    )
+#if (BITS_PER_LONG > 32) && !defined(CONFIG_HIGH_RES_TIMERS)
+	    jiffies_64_f = jiffies_64;
 
 #elif defined(CONFIG_SMP)
 
@@ -1027,6 +1094,9 @@
 		read_lock_irqsave(&xtime_lock, flags);
 		jiffies_64_f = jiffies_64;
 
+		IF_HIGH_RES(sub_jiff_offset =
+			    quick_update_jiffies_sub(jiffies));
+
 		read_unlock_irqrestore(&xtime_lock, flags);
 	}
 #elif ! defined(CONFIG_SMP) && (BITS_PER_LONG < 64)
@@ -1034,13 +1104,30 @@
 	do {
 		jiffies_f = jiffies;
 		barrier();
+		IF_HIGH_RES(sub_jiff_offset =
+			    quick_update_jiffies_sub(jiffies_f));
 		jiffies_64_f = jiffies_64;
 	} while (unlikely(jiffies_f != jiffies));
 
+#else				/* 64 bit long and high-res but no SMP if I did the Venn right */
+	    do {
+		jiffies_64_f = jiffies_64;
+		barrier();
+		sub_jiff_offset = quick_update_jiffies_sub(jiffies_64_f);
+	} while (unlikely(jiffies_64_f != jiffies_64));
+
 #endif
-	tp->tv_sec = div_long_long_rem(jiffies_64_f, HZ, &sub_sec);
+	/*
+	 * Remember that quick_update_jiffies_sub() can return more
+	 * than a jiffies worth of cycles...
+	 */
+	IF_HIGH_RES(while (unlikely(sub_jiff_offset > cycles_per_jiffies)) {
+		    sub_jiff_offset -= cycles_per_jiffies; jiffies_64_f++;}
+	)
+		tp->tv_sec = div_long_long_rem(jiffies_64_f, HZ, &sub_sec);
 
 	tp->tv_nsec = sub_sec * (NSEC_PER_SEC / HZ);
+	IF_HIGH_RES(tp->tv_nsec += arch_cycles_to_nsec(sub_jiff_offset));
 	return 0;
 }
 
@@ -1223,6 +1310,7 @@
 			return -EINTR;
 
 		new_timer.expires = restart_block->arg2;
+		IF_HIGH_RES(new_timer.sub_expires = restart_block->arg3);
 		if (time_before(new_timer.expires, jiffies))
 			return 0;
 	}
@@ -1236,7 +1324,9 @@
 	}
 	do {
 		t = *tsave;
-		if ((abs || !new_timer.expires) &&
+		if ((abs ||
+		     !(new_timer.expires
+		       IF_HIGH_RES(|new_timer.sub_expires))) &&
 		    !(rtn = adjust_abs_time(&posix_clocks[which_clock],
 					    &t, abs))) {
 			/*
@@ -1245,12 +1335,14 @@
 			 * del_timer_sync() will return 0, thus
 			 * active is zero... and so it goes.
 			 */
+			IF_HIGH_RES(new_timer.sub_expires =)
 
-			tstojiffie(&t,
-				   posix_clocks[which_clock].res,
-				   &new_timer.expires);
+			    tstojiffie(&t,
+				       posix_clocks[which_clock].res,
+				       &new_timer.expires);
 		}
-		if (new_timer.expires) {
+		if (new_timer.expires
+		    IF_HIGH_RES(|new_timer.sub_expires)) {
 			current->state = TASK_INTERRUPTIBLE;
 			add_timer(&new_timer);
 
@@ -1268,6 +1360,8 @@
 	if (active) {
 		unsigned long jiffies_f = jiffies;
 
+		IF_HIGH_RES(long sub_jiff =
+			    quick_update_jiffies_sub(jiffies_f));
 		/*
 		 * Always restart abs calls from scratch to pick up any
 		 * clock shifting that happened while we are away.
@@ -1277,6 +1371,9 @@
 
 		jiffies_to_timespec(new_timer.expires - jiffies_f, tsave);
 
+		IF_HIGH_RES(tsave->tv_nsec +=
+			    arch_cycles_to_nsec(new_timer.sub_expires -
+						sub_jiff));
 		while (tsave->tv_nsec < 0) {
 			tsave->tv_nsec += NSEC_PER_SEC;
 			tsave->tv_sec--;
@@ -1289,6 +1386,7 @@
 		restart_block->arg0 = which_clock;
 		restart_block->arg1 = (int)tsave;
 		restart_block->arg2 = new_timer.expires;
+		IF_HIGH_RES(restart_block->arg3 = new_timer.sub_expires);
 		return -ERESTART_RESTARTBLOCK;
 	}
 
Binary files linux-2.5.52-bk4-i386/scripts/kallsyms and linux/scripts/kallsyms differ
Binary files linux-2.5.52-bk4-i386/scripts/lxdialog/lxdialog and linux/scripts/lxdialog/lxdialog differ
Binary files linux-2.5.52-bk4-i386/usr/gen_init_cpio and linux/usr/gen_init_cpio differ
Binary files linux-2.5.52-bk4-i386/usr/initramfs_data.cpio.gz and linux/usr/initramfs_data.cpio.gz differ

^ permalink raw reply

* [PATCH 1/3] High-res-timers part 1 (core) take 21
From: george anzinger @ 2002-12-20  9:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2513 bytes --]

Just in case you might like high res timers...

Changes since last time:
Added stub macros to include/linux/hrtime.h to handle the no
high res case.  This was causing undefines when building a
kernel with CONFIG_HIGH_RES = n.

Removed the SMP optimization in timer.c (folks objected to
eliminating unneeded spin locks in favor of letting the
compiler do it).
-------
This patch supplies the core changes to implement high
resolution timers.  Mostly it changes the timer list from
the multi stage hash (or cascade) list to a single stage
hash list.  This change makes it easy to configure the list
size for those who are concerned with performance.  It also
eliminates the "time out" for the cascade operation every
512 jiffies, thus eliminating possibly long preemption
times.  On input from Stephen Hemminger<shemminger@osdl.org>
the configuration of the timer list size is no longer
presented as a configure option.  The code can still be
change (one line) to use larger or smaller lists.

It also adds a sub jiffie word to the timer structure to
allow timers to exist between jiffies.  However, to support
the sub jiffie timers, work needs to be done in the platform
code for each arch.  The platform work for the i386 arch
follows in part 2.  To prevent requests from
nonexistent code for sub jiffies stuff, these parts of this
patch are disabled with the IF_HIGH_RES() macro which
depends on CONFIG_HIGH_RES_TIMERS which will be defined for
each platform as they supply the needed code.

With this patch applied, the system should boot and run much
as it does prior to the patch.  This patch depends on the
POSIX clocks & timers patch in that it assumes the changes
that patch made to timer.c to remove timer_t.  This
dependency can be removed if needed.

Patch is against 2.5.52-bk4

This patch as well as the POSIX clocks & timers patch is
available on the project site:
http://sourceforge.net/projects/high-res-timers/

For those who want a change log, this version is changed
only to follow the changes in the posix patch, which, was
changed to use the new syscall restart stuff.

The 3 parts to the high res timers are:
*core		The core kernel (i.e. platform independent) changes
 i386		The high-res changes for the i386 (x86) platform
 hrposix	The changes to the POSIX clocks & timers patch to
use high-res timers

Please apply.
-- 
George Anzinger   george@mvista.com
High-res-timers: 
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

[-- Attachment #2: hrtimers-core-2.5.52-bk4.1.0.patch --]
[-- Type: text/plain, Size: 26569 bytes --]

diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-posix/include/linux/hrtime.h linux/include/linux/hrtime.h
--- linux-2.5.52-bk4-posix/include/linux/hrtime.h	Wed Dec 31 16:00:00 1969
+++ linux/include/linux/hrtime.h	Thu Dec 19 14:55:51 2002
@@ -0,0 +1,155 @@
+#ifndef _HRTIME_H
+#define _HRTIME_H
+
+/*
+ * This file is the glue to bring in the platform stuff.
+ * We make it all depend on the CONFIG option so all archs
+ * will work as long as the CONFIG is not set.	Once an 
+ * arch defines the CONFIG, it had better have the 
+ * asm/hrtime.h file in place.
+ */
+
+/*
+ * This gets filled in at init time, either static or dynamic.
+ * Someday this will be what NTP fiddles with.
+ * Do we need the scale here?  I don't think so, as long as we
+ * do percentage offsets for NTP.
+ */
+struct timer_conversion_bits {
+	unsigned long _arch_to_usec;
+	unsigned long _arch_to_nsec;
+	unsigned long _usec_to_arch;
+	unsigned long _nsec_to_arch;
+	long _cycles_per_jiffies;
+	unsigned long _arch_to_latch;
+};
+extern struct timer_conversion_bits timer_conversion_bits;
+/*
+ * The following four values are not used for machines 
+ * without a TSC.  For machines with a TSC they
+ * are caculated at boot time. They are used to 
+ * calculate "cycles" to jiffies or usec.  Don't get
+ * confused into thinking they are simple multiples or
+ * divisors, however.  
+ */
+#define arch_to_usec timer_conversion_bits._arch_to_usec
+#define arch_to_nsec timer_conversion_bits._arch_to_nsec
+#define usec_to_arch timer_conversion_bits._usec_to_arch
+#define nsec_to_arch timer_conversion_bits._nsec_to_arch
+#define cycles_per_jiffies timer_conversion_bits._cycles_per_jiffies
+#define arch_to_latch timer_conversion_bits._arch_to_latch
+
+#include <linux/config.h>
+#ifdef CONFIG_HIGH_RES_TIMERS
+#include <asm/hrtime.h>
+/*
+ * The schedule_next_int function is to be defined by the "arch" code
+ * when an "arch" is implementing the high-res part of POSIX timers.
+ * The actual function will be called with the offset in "arch" (parm 2)
+ * defined sub_jiffie units from the reference jiffie boundry (parm 1)to
+ * the next required sub_jiffie timer interrupt. This value will be -1
+ * if the next timer interrupt should be the next jiffie value.	 The
+ * "arch" code must determine how far out the interrupt is, based on
+ * current jiffie, sub_jiffie time and set up the hardware to interrupt
+ * at that time.  It is possible that the time will already have passed,
+ * in which case the function should return true (no interrupt is
+ * needed), otherwise the return should be 0.  The third parameter is the
+ * "always" flag which says that the code needs an interrupt, even if the
+ * time has passed.  In this case a "close" in time should be used to 
+ * generate the required interrupt.  The sub_jiffie interrupt
+ * should just call do_timer(). If the interrupt code ususally does stuff
+ * each jiffie, a flag should be kept by the jiffies update code to
+ * indicate that a new jiffie has started.  This flag is to keep this code 
+ * from being executed on the sub jiffie interrupt.
+ */
+#ifndef schedule_next_int
+#define schedule_next_int(s,d,a) 0
+#undef CONFIG_HIGH_RES_TIMERS
+#endif				// schedule_next_int
+/*
+ * The sub_jiffie() macro should return the current time offset from the latest
+ * jiffie.  This will be in "arch" defined units and is used to determine if
+ * a timer has expired.	 Since no sub_expire value will be used if "arch" 
+ * has not defined the high-res package, 0 will work well here.
+ *
+ * In addition, to save time if there is no high-res package (or it is not
+ * configured), we define the sub expression for the run_timer_list.
+ */
+
+#ifndef sub_jiffie
+#undef CONFIG_HIGH_RES_TIMERS
+#define sub_jiffie() 0
+#endif				// sub_jiffie
+
+/*
+ * The high_res_test() macro should set up a test mode that will do a
+ * worst case timer interrupt.	I.e. it may be that a call to 
+ * schedule_next_int() could return -1 indicating that the time has
+ * already expired.  This macro says to set it so that schedule_next_int()
+ * will always set up a timer interrupt.  This is used during init to
+ * calculate the worst case loop time from timer set up to int to 
+ * the signal code.
+
+ * high_res_end_test() cancels the above state and allows the no
+ * interrupt return from schedule_next_int()
+ */
+#ifndef high_res_test
+#define high_res_test()
+#define high_res_end_test()
+#endif
+
+#define IF_HIGH_RES(a) a
+
+#else				/*  CONFIG_HIGH_RES_TIMERS */
+#define IF_HIGH_RES(a)
+#define nsec_to_arch_cycles(a) 0
+
+#define discipline_timer(cpu)
+#define compute_latch(calibration_result)
+#endif				/*  CONFIG_HIGH_RES_TIMERS */
+
+/*
+ * Here is an SMP helping macro...
+ */
+#ifdef CONFIG_SMP
+#define IF_SMP(a) a
+#else
+#define IF_SMP(a)
+#endif
+/*
+ * These should have been defined in the platform hrtimers.h
+ * If not (or HIGH_RES_TIMERS not configured) define the default.
+ */
+#ifndef update_jiffies
+extern u64 jiffies_64;
+#define update_jiffies() (*(u64 *)&jiffies_64)++
+#endif
+#ifndef new_jiffie
+#define new_jiffie() 0
+#endif
+#ifndef schedule_next_int
+#define schedule_next_int(a,b,c)
+#endif
+/*
+ * If we included a high-res file, we may have gotten a more efficient 
+ * u64/u32, u64%u32 routine.  The one in div64.h actually handles a 
+ * u64 result, something we don't need, and, since it is more expensive
+ * arch porters are encouraged to implement the div_long_long_rem().
+ *
+ * int div_long_long_rem(u64 dividend,int divisor,int* remainder)
+ * which returns dividend/divisor.
+ * 
+ * Here we provide default code for those who, for what ever reason,
+ * have not provided the above.
+ */
+#ifndef div_long_long_rem
+#include <asm/div64.h>
+
+#define div_long_long_rem(dividend,divisor,remainder) ({ \
+		       u64 result = dividend;		\
+		       *remainder = do_div(result,divisor); \
+		       result; })
+
+#endif				/* ifndef div_long_long_rem */
+
+#endif				/* _HRTIME_H  */
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-posix/include/linux/time.h linux/include/linux/time.h
--- linux-2.5.52-bk4-posix/include/linux/time.h	Thu Dec 19 12:16:00 2002
+++ linux/include/linux/time.h	Thu Dec 19 12:43:58 2002
@@ -1,7 +1,7 @@
 #ifndef _LINUX_TIME_H
 #define _LINUX_TIME_H
 
-#include <asm/param.h>
+#include <linux/param.h>
 #include <linux/types.h>
 
 #ifndef _STRUCT_TIMESPEC
@@ -53,6 +53,19 @@
 #define NSEC_PER_USEC (1000L)
 #endif
 
+/* Parameters used to convert the timespec values */
+#ifndef USEC_PER_SEC
+#define USEC_PER_SEC (1000000L)
+#endif
+
+#ifndef NSEC_PER_SEC
+#define NSEC_PER_SEC (1000000000L)
+#endif
+
+#ifndef NSEC_PER_USEC
+#define NSEC_PER_USEC (1000L)
+#endif
+
 static __inline__ unsigned long
 timespec_to_jiffies(struct timespec *value)
 {
@@ -61,16 +74,16 @@
 
 	if (sec >= (MAX_JIFFY_OFFSET / HZ))
 		return MAX_JIFFY_OFFSET;
-	nsec += 1000000000L / HZ - 1;
-	nsec /= 1000000000L / HZ;
+	nsec += NSEC_PER_SEC / HZ - 1;
+	nsec /= NSEC_PER_SEC / HZ;
 	return HZ * sec + nsec;
 }
 
 static __inline__ void
-jiffies_to_timespec(unsigned long jiffies, struct timespec *value)
+jiffies_to_timespec(unsigned long _jiffies, struct timespec *value)
 {
-	value->tv_nsec = (jiffies % HZ) * (1000000000L / HZ);
-	value->tv_sec = jiffies / HZ;
+	value->tv_nsec = (_jiffies % HZ) * (NSEC_PER_SEC / HZ);
+	value->tv_sec = _jiffies / HZ;
 }
 
 /* Same for "timeval" */
@@ -171,9 +184,9 @@
 #define	ITIMER_VIRTUAL	1
 #define	ITIMER_PROF	2
 
-struct  itimerspec {
-        struct  timespec it_interval;    /* timer period */
-        struct  timespec it_value;       /* timer expiration */
+struct	itimerspec {
+	struct	timespec it_interval;	 /* timer period */
+	struct	timespec it_value;	 /* timer expiration */
 };
 
 struct	itimerval {
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-posix/include/linux/timer.h linux/include/linux/timer.h
--- linux-2.5.52-bk4-posix/include/linux/timer.h	Tue Nov 12 12:40:04 2002
+++ linux/include/linux/timer.h	Thu Dec 19 12:43:58 2002
@@ -18,6 +18,7 @@
 	unsigned long data;
 
 	struct tvec_t_base_s *base;
+	long sub_expires;
 };
 
 #define TIMER_MAGIC	0x4b87ad6e
@@ -29,6 +30,7 @@
 		.base = NULL,					\
 		.magic = TIMER_MAGIC,				\
 		.lock = SPIN_LOCK_UNLOCKED,			\
+		.sub_expires = 0,			        \
 	}
 
 /***
@@ -41,6 +43,7 @@
 static inline void init_timer(struct timer_list * timer)
 {
 	timer->base = NULL;
+	timer->sub_expires = 0;
 	timer->magic = TIMER_MAGIC;
 	spin_lock_init(&timer->lock);
 }
@@ -64,6 +67,7 @@
 extern void add_timer_on(struct timer_list *timer, int cpu);
 extern int del_timer(struct timer_list * timer);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
+extern void update_real_wall_time(void);
   
 #if CONFIG_SMP
   extern int del_timer_sync(struct timer_list * timer);
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-posix/kernel/ksyms.c linux/kernel/ksyms.c
--- linux-2.5.52-bk4-posix/kernel/ksyms.c	Thu Dec 19 12:13:48 2002
+++ linux/kernel/ksyms.c	Thu Dec 19 12:43:58 2002
@@ -56,6 +56,7 @@
 #include <linux/mount.h>
 #include <linux/ptrace.h>
 #include <asm/checksum.h>
+#include <linux/hrtime.h>
 
 #if defined(CONFIG_PROC_FS)
 #include <linux/proc_fs.h>
@@ -477,6 +478,9 @@
 #endif
 EXPORT_SYMBOL(jiffies);
 EXPORT_SYMBOL(jiffies_64);
+#ifdef CONFIG_HIGH_RES_TIMERS
+EXPORT_SYMBOL(timer_conversion_bits);
+#endif
 EXPORT_SYMBOL(xtime);
 EXPORT_SYMBOL(do_gettimeofday);
 EXPORT_SYMBOL(do_settimeofday);
diff -urP -I \$Id:.*Exp \$ -X /usr/src/patch.exclude linux-2.5.52-bk4-posix/kernel/timer.c linux/kernel/timer.c
--- linux-2.5.52-bk4-posix/kernel/timer.c	Thu Dec 19 12:16:01 2002
+++ linux/kernel/timer.c	Fri Dec 20 01:14:49 2002
@@ -17,6 +17,8 @@
  *  2000-10-05  Implemented scalable SMP per-CPU timer handling.
  *                              Copyright (C) 2000, 2001, 2002  Ingo Molnar
  *              Designed by David S. Miller, Alexey Kuznetsov and Ingo Molnar
+ *  2002-10-01	High res timers code by George Anzinger 
+ *		    Copyright (C)2002 by MontaVista Software.
  */
 
 #include <linux/kernel_stat.h>
@@ -27,38 +29,25 @@
 #include <linux/notifier.h>
 #include <linux/thread_info.h>
 
+#include <linux/hrtime.h>
+#include <linux/compiler.h>
+#include <asm/signal.h>
 #include <asm/uaccess.h>
 
+#ifndef CONFIG_NEW_TIMER_LISTSIZE
+#define CONFIG_NEW_TIMER_LISTSIZE 512
+#endif
+#define NEW_TVEC_SIZE CONFIG_NEW_TIMER_LISTSIZE
+#define NEW_TVEC_MASK (NEW_TVEC_SIZE - 1)
 /*
  * per-CPU timer vector definitions:
  */
-#define TVN_BITS 6
-#define TVR_BITS 8
-#define TVN_SIZE (1 << TVN_BITS)
-#define TVR_SIZE (1 << TVR_BITS)
-#define TVN_MASK (TVN_SIZE - 1)
-#define TVR_MASK (TVR_SIZE - 1)
-
-typedef struct tvec_s {
-	int index;
-	struct list_head vec[TVN_SIZE];
-} tvec_t;
-
-typedef struct tvec_root_s {
-	int index;
-	struct list_head vec[TVR_SIZE];
-} tvec_root_t;
-
 
 struct tvec_t_base_s {
 	spinlock_t lock;
 	unsigned long timer_jiffies;
-	struct timer_list *running_timer;
-	tvec_root_t tv1;
-	tvec_t tv2;
-	tvec_t tv3;
-	tvec_t tv4;
-	tvec_t tv5;
+ 	volatile struct timer_list * volatile running_timer;
+ 	struct list_head tv[NEW_TVEC_SIZE];
 } ____cacheline_aligned_in_smp;
 
 typedef struct tvec_t_base_s tvec_base_t;
@@ -90,42 +79,101 @@
 		check_timer_failed(timer);
 }
 
-static inline void internal_add_timer(tvec_base_t *base, struct timer_list *timer)
-{
-	unsigned long expires = timer->expires;
-	unsigned long idx = expires - base->timer_jiffies;
-	struct list_head *vec;
-
-	if (idx < TVR_SIZE) {
-		int i = expires & TVR_MASK;
-		vec = base->tv1.vec + i;
-	} else if (idx < 1 << (TVR_BITS + TVN_BITS)) {
-		int i = (expires >> TVR_BITS) & TVN_MASK;
-		vec = base->tv2.vec + i;
-	} else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
-		int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
-		vec = base->tv3.vec + i;
-	} else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
-		int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
-		vec = base->tv4.vec + i;
-	} else if ((signed long) idx < 0) {
-		/*
-		 * Can happen if you add a timer with expires == jiffies,
-		 * or you set a timer to go off in the past
-		 */
-		vec = base->tv1.vec + base->tv1.index;
-	} else if (idx <= 0xffffffffUL) {
-		int i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
-		vec = base->tv5.vec + i;
-	} else {
-		/* Can only get here on architectures with 64-bit jiffies */
-		INIT_LIST_HEAD(&timer->entry);
-		return;
-	}
-	/*
-	 * Timers are FIFO:
-	 */
-	list_add_tail(&timer->entry, vec);
+static inline void internal_add_timer(tvec_base_t *base, 
+ 				      struct timer_list *timer)
+  {
+ 	/*
+ 	 * must be cli-ed when calling this
+ 	 */
+  	unsigned long expires = timer->expires;
+ 	IF_HIGH_RES(int sub_expires = timer->sub_expires;)
+ 		int indx;
+ 	struct list_head *pos,*list_start;
+  
+ 	if ( time_before(expires, base->timer_jiffies) ){
+ 		/*
+ 		 * already expired, schedule for next tick 
+ 		 * would like to do better here
+ 		 * Actually this now works just fine with the
+ 		 * back up of timer_jiffies in "run_timer_list".
+ 		 * Note that this puts the timer on a list other
+ 		 * than the one it idexes to.  We don't want to
+ 		 * change the expires value in the timer as it is
+ 		 * used by the repeat code in setitimer and the
+ 		 * POSIX timers code.
+ 			 */
+ 		expires = base->timer_jiffies;
+ 		IF_HIGH_RES(sub_expires = 0);
+ 	}
+ 			
+ 	indx =	expires & NEW_TVEC_MASK;
+ 	if ((expires - base->timer_jiffies) <= NEW_TVEC_SIZE) {
+#ifdef CONFIG_HIGH_RES_TIMERS
+ 		unsigned long jiffies_f;
+  		/*
+ 		 * The high diff bits are the same, goes to the head of 
+ 		 * the list, sort on sub_expire.
+  		 */
+ 		for (pos = (list_start = &base->tv[indx])->next; 
+ 		     pos != list_start; 
+ 		     pos = pos->next){
+ 			struct timer_list *tmr = 
+ 				list_entry(pos,
+ 					   struct timer_list,
+ 					   entry);
+ 			if ((tmr->sub_expires >= sub_expires) ||
+ 			    (tmr->expires != expires)){
+ 				break;
+ 			}
+ 		}
+ 		list_add_tail(&timer->entry, pos);
+ 		/*
+ 		 * Notes to me.	 Use jiffies here instead of 
+ 		 * timer_jiffies to prevent adding unneeded interrupts.
+ 		 * Running_timer is NULL if we are NOT currently 
+ 		 * activly dispatching timers.	Since we are under
+ 		 * the same spin lock, it being false means that 
+ 		 * it has dropped the spinlock to call the timer
+ 		 * function, which could well be who called us.
+ 		 * In any case, we don't need a new interrupt as
+ 		 * the timer dispach code (run_timer_list) will
+ 		 * pick this up when the function it is calling 
+ 		 * returns.
+ 		 */
+ 		if ( expires == (jiffies_f = base->timer_jiffies) && 
+ 		     list_start->next == &timer->entry &&
+ 		     (base->running_timer == NULL)) {
+ 			schedule_next_int(jiffies_f, sub_expires,1);
+ 		}
+#else
+ 		pos = (&base->tv[indx])->next;
+ 		list_add_tail(&timer->entry, pos);
+#endif
+ 	}else{
+ 		/*
+ 		 * The high diff bits differ, search from the tail
+ 		 * The for will pick up an empty list.
+ 		 */
+ 		for (pos = (list_start = &base->tv[indx])->prev; 
+ 		     pos != list_start; 
+ 		     pos = pos->prev){
+ 			struct timer_list *tmr = 
+ 				list_entry(pos,
+ 					   struct timer_list,
+ 					   entry);
+ 			if (time_after(tmr->expires, expires)){
+ 				continue;
+ 			}
+ 			IF_HIGH_RES(
+ 				if ((tmr->expires != expires) ||
+ 				    (tmr->sub_expires < sub_expires)) {
+ 					break;
+ 				}
+ 				);
+ 		}
+ 		list_add(&timer->entry, pos);
+ 	}
+ 				
 }
 
 /***
@@ -200,7 +248,14 @@
  * (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
  * active timer returns 1.)
  */
-int mod_timer(struct timer_list *timer, unsigned long expires)
+#ifdef CONFIG_HIGH_RES_TIMERS
+int mod_timer_hr(struct timer_list *timer, 
+		 unsigned long expires, 
+		 long sub_expires)
+#else
+int mod_timer(struct timer_list *timer, 
+		 unsigned long expires)
+#endif
 {
 	tvec_base_t *old_base, *new_base;
 	unsigned long flags;
@@ -255,6 +310,7 @@
 		ret = 1;
 	}
 	timer->expires = expires;
+	IF_HIGH_RES(timer->sub_expires = sub_expires);
 	internal_add_timer(new_base, timer);
 	timer->base = new_base;
 
@@ -265,6 +321,12 @@
 
 	return ret;
 }
+#ifdef CONFIG_HIGH_RES_TIMERS
+int mod_timer(struct timer_list *timer, unsigned long expires)
+{
+	return mod_timer_hr(timer, expires, timer->sub_expires);
+}
+#endif
 
 /***
  * del_timer - deactive a timer.
@@ -347,55 +409,42 @@
 #endif
 
 
-static int cascade(tvec_base_t *base, tvec_t *tv)
-{
-	/* cascade all the timers from tv up one level */
-	struct list_head *head, *curr, *next;
-
-	head = tv->vec + tv->index;
-	curr = head->next;
-	/*
-	 * We are removing _all_ timers from the list, so we don't  have to
-	 * detach them individually, just clear the list afterwards.
-	 */
-	while (curr != head) {
-		struct timer_list *tmp;
-
-		tmp = list_entry(curr, struct timer_list, entry);
-		if (tmp->base != base)
-			BUG();
-		next = curr->next;
-		internal_add_timer(base, tmp);
-		curr = next;
-	}
-	INIT_LIST_HEAD(head);
-
-	return tv->index = (tv->index + 1) & TVN_MASK;
-}
-
-/***
- * __run_timers - run all expired timers (if any) on this CPU.
- * @base: the timer vector to be processed.
- *
- * This function cascades all vectors and executes all expired timer
- * vectors.
+/*
+ * run_timer_list is ALWAYS called from softirq which calls with irq enabled.
+ * We may assume this and not save the flags.
  */
-static inline void __run_timers(tvec_base_t *base)
+  
+  
+static void __run_timers(tvec_base_t *base)
 {
+	IF_HIGH_RES( unsigned long jiffies_f;
+		     long sub_jiff = -1;
+		     long sub_jiffie_f);
 	spin_lock_irq(&base->lock);
+#ifdef CONFIG_HIGH_RES_TIMERS
+	read_lock(&xtime_lock);
+	jiffies_f = jiffies;
+	sub_jiffie_f = sub_jiffie() + quick_get_cpuctr();
+	read_unlock(&xtime_lock);
+	while ( unlikely(sub_jiffie_f >= cycles_per_jiffies)){
+		sub_jiffie_f -= cycles_per_jiffies;
+		jiffies_f++;
+	}
+	while ((long)(jiffies_f - base->timer_jiffies) >= 0) {
+#else
 	while ((long)(jiffies - base->timer_jiffies) >= 0) {
+#endif
+ 
 		struct list_head *head, *curr;
-
+		head = base->tv + 
+			(base->timer_jiffies	& NEW_TVEC_MASK);
 		/*
-		 * Cascade timers:
+		 * Note that we never move "head" but continue to
+		 * pick the first entry from it.  This allows new
+		 * entries to be inserted while we unlock for the
+		 * function call below.
 		 */
-		if (!base->tv1.index &&
-			(cascade(base, &base->tv2) == 1) &&
-				(cascade(base, &base->tv3) == 1) &&
-					cascade(base, &base->tv4) == 1)
-			cascade(base, &base->tv5);
 repeat:
-		head = base->tv1.vec + base->tv1.index;
 		curr = head->next;
 		if (curr != head) {
 			void (*fn)(unsigned long);
@@ -403,28 +452,67 @@
 			struct timer_list *timer;
 
 			timer = list_entry(curr, struct timer_list, entry);
- 			fn = timer->function;
- 			data = timer->data;
-
-			list_del(&timer->entry);
-			timer->base = NULL;
-#if CONFIG_SMP
-			base->running_timer = timer;
+#ifdef CONFIG_HIGH_RES_TIMERS
+			/*
+			 * This would be simpler if we never got behind
+			 * i.e. if timer_jiffies == jiffies, we could
+			 * drop one of the tests.  Since we may get 
+			 * behind, (in fact we don't up date until
+			 * we are behind to allow sub_jiffie entries)
+			 * we need a way to negate the sub_jiffie
+			 * test in that case...
+			 */
+			if (time_before(timer->expires, jiffies_f)||
+			    ((timer->expires == jiffies_f) &&
+			     timer->sub_expires <= sub_jiffie_f))
+#else
+			if (time_before_eq(timer->expires, jiffies))
 #endif
-			spin_unlock_irq(&base->lock);
-			if (!fn)
-				printk("Bad: timer %p has NULL fn. (data: %08lx)\n", timer, data);
-			else
+				{fn = timer->function;
+				data= timer->data;
+
+				list_del(&timer->entry);
+				timer->base = NULL;
+				timer->entry.next = timer->entry.prev = NULL;
+				base->running_timer = timer;
+				spin_unlock_irq(&base->lock);
 				fn(data);
-			spin_lock_irq(&base->lock);
-			goto repeat;
+				spin_lock_irq(&base->lock);
+				goto repeat;
+			}
+#ifdef CONFIG_HIGH_RES_TIMERS
+			else{
+				/*
+				 * The new timer list is not always emptied
+				 * here as it contains:
+				 * a.) entries (list size)^N*jiffies out and
+				 * b.) entries that match in jiffies, but have
+				 *     sub_expire times further out than now.
+				 */
+				 if (timer->expires == jiffies_f ){
+					sub_jiff = timer->sub_expires;
+				}
+			}
+#endif
 		}
 		++base->timer_jiffies; 
-		base->tv1.index = (base->tv1.index + 1) & TVR_MASK;
 	}
-#if CONFIG_SMP
+	/*
+	 * It is faster to back out the last bump, than to prevent it.
+	 * This allows zero time inserts as well as sub_jiffie values in
+	 * the current jiffie.
+	 */
+	--base->timer_jiffies;
 	base->running_timer = NULL;
-#endif
+
+	IF_HIGH_RES(if (schedule_next_int( jiffies_f, sub_jiff, 0)){
+		/*
+		 * If schedule_next_int says the time has passed
+		 * bump the tasklet lock so we go round again
+		 */
+		run_local_timers();
+		});
+
 	spin_unlock_irq(&base->lock);
 }
 
@@ -704,15 +792,37 @@
 /*
  * Called from the timer interrupt handler to charge one tick to the current 
  * process.  user_tick is 1 if the tick is user time, 0 for system.
+ *
+ * Here is where we need to sort out the sub-jiffie interrupts from the 
+ * jiffie ones and make sure we only do accounting once per jiffie per cpu.
+ * We do this by using new_jiffie as a bit per cpu. All ops are atomic.
+ */
+/*
+ * This read-write spinlock protects us from races in SMP while
+ * playing with xtime and avenrun.
  */
+rwlock_t xtime_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
+
 void update_process_times(int user_tick)
 {
 	struct task_struct *p = current;
 	int cpu = smp_processor_id(), system = user_tick ^ 1;
 
-	update_one_process(p, user_tick, system, cpu);
+	/*
+	 * always run the timer list to pick up sub-jiffie timers
+	 */
 	run_local_timers();
-	scheduler_tick(user_tick, system);
+
+	/* 
+         * If high-res, we come here more often that 1/HZ.  Don't pass
+	 * the extra calls to those who only want the 1/HZ call.
+         */
+#ifdef CONFIG_HIGH_RES_TIMERS
+	if (test_and_clear_bit(cpu, (volatile unsigned long *)&new_jiffie()))
+#endif	 
+	{	update_one_process(p, user_tick, system, cpu);
+		scheduler_tick(user_tick, system);
+	}
 }
 
 /*
@@ -731,35 +841,38 @@
  *
  * Requires xtime_lock to access.
  */
-unsigned long avenrun[3];
 
 /*
- * calc_load - given tick count, update the avenrun load estimates.
- * This is called while holding a write_lock on xtime_lock.
+ * calc_load - (runs on above timer), update the avenrun load estimates.
+ * This is called from soft_irq context, ints on, bh locked.
  */
-static inline void calc_load(unsigned long ticks)
+unsigned long avenrun[3];
+static inline void calc_load(void);
+
+struct timer_list calc_load_timer = {
+	.expires = LOAD_FREQ,
+	.function = (void (*)(unsigned long))calc_load,
+	.entry = {0,0} };
+				       
+static inline void calc_load(void)
 {
 	unsigned long active_tasks; /* fixed-point */
-	static int count = LOAD_FREQ;
 
-	count -= ticks;
-	if (count < 0) {
-		count += LOAD_FREQ;
-		active_tasks = count_active_tasks();
-		CALC_LOAD(avenrun[0], EXP_1, active_tasks);
-		CALC_LOAD(avenrun[1], EXP_5, active_tasks);
-		CALC_LOAD(avenrun[2], EXP_15, active_tasks);
-	}
+	active_tasks = count_active_tasks();
+	write_lock_irq(&xtime_lock);
+	CALC_LOAD(avenrun[0], EXP_1, active_tasks);
+	CALC_LOAD(avenrun[1], EXP_5, active_tasks);
+	CALC_LOAD(avenrun[2], EXP_15, active_tasks);
+	write_unlock_irq(&xtime_lock);
+
+	calc_load_timer.expires = jiffies + LOAD_FREQ;
+	add_timer(&calc_load_timer);
 }
 
+
 /* jiffies at the most recent update of wall time */
 unsigned long wall_jiffies;
 
-/*
- * This read-write spinlock protects us from races in SMP while
- * playing with xtime and avenrun.
- */
-rwlock_t xtime_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
 unsigned long last_time_offset;
 
 /*
@@ -769,8 +882,7 @@
 {
 	tvec_base_t *base = &per_cpu(tvec_bases, smp_processor_id());
 
-	if ((long)(jiffies - base->timer_jiffies) >= 0)
-		__run_timers(base);
+	__run_timers(base);
 }
 
 /*
@@ -795,8 +907,25 @@
 		update_wall_time(ticks);
 	}
 	last_time_offset = 0;
-	calc_load(ticks);
 }
+#ifdef CONFIG_HIGH_RES_TIMERS
+void update_real_wall_time(void)
+{
+	unsigned long ticks;
+       /*
+	 * To get the time of day really right, we need to make sure 
+	 * every one is on the same jiffie. (Because of adj_time, etc.)
+	 * So we provide this for the high res code.  Must be called 
+	 * under the write(xtime_lock).	 (External locking allows the
+	 * caller to include sub jiffies in the lock region.)
+	 */
+	ticks = jiffies - wall_jiffies;
+	if (ticks) {
+		wall_jiffies += ticks;
+		update_wall_time(ticks);
+	}
+}
+#endif
   
 /*
  * The 64-bit jiffies value is not atomic - you MUST NOT read it
@@ -806,10 +935,11 @@
 
 void do_timer(struct pt_regs *regs)
 {
-	jiffies_64++;
+	update_jiffies();
+	/* 
+	 * SMP process accounting uses the local APIC timer 
+	 */
 #ifndef CONFIG_SMP
-	/* SMP process accounting uses the local APIC timer */
-
 	update_process_times(user_mode(regs));
 #endif
 	update_times();
@@ -820,7 +950,7 @@
 extern int do_setitimer(int, struct itimerval *, struct itimerval *);
 
 /*
- * For backwards compatibility?  This can be done in libc so Alpha
+ * For backwards compatibility?	 This can be done in libc so Alpha
  * and all newer ports shouldn't need it.
  */
 asmlinkage unsigned long sys_alarm(unsigned int seconds)
@@ -928,7 +1058,7 @@
 asmlinkage long sys_getegid(void)
 {
 	/* Only we change this so SMP safe */
-	return  current->egid;
+	return	current->egid;
 }
 
 #endif
@@ -1160,14 +1290,8 @@
        
 	base = &per_cpu(tvec_bases, cpu);
 	spin_lock_init(&base->lock);
-	for (j = 0; j < TVN_SIZE; j++) {
-		INIT_LIST_HEAD(base->tv5.vec + j);
-		INIT_LIST_HEAD(base->tv4.vec + j);
-		INIT_LIST_HEAD(base->tv3.vec + j);
-		INIT_LIST_HEAD(base->tv2.vec + j);
-	}
-	for (j = 0; j < TVR_SIZE; j++)
-		INIT_LIST_HEAD(base->tv1.vec + j);
+	for (j = 0; j < NEW_TVEC_SIZE; j++)
+		INIT_LIST_HEAD(base->tv + j);
 }
 	
 static int __devinit timer_cpu_notify(struct notifier_block *self, 
@@ -1194,5 +1318,6 @@
 	timer_cpu_notify(&timers_nb, (unsigned long)CPU_UP_PREPARE,
 				(void *)(long)smp_processor_id());
 	register_cpu_notifier(&timers_nb);
+	calc_load();
 	open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
 }
Only in linux-2.5.52-bk4-posix/scripts: kallsyms
Binary files linux-2.5.52-bk4-posix/scripts/lxdialog/lxdialog and linux/scripts/lxdialog/lxdialog differ
Only in linux-2.5.52-bk4-posix/usr: gen_init_cpio
Only in linux-2.5.52-bk4-posix/usr: initramfs_data.cpio.gz

^ permalink raw reply

* Re: [lvm-devel] [PATCH] add kobject to struct mapped_device
From: Joe Thornber @ 2002-12-20  9:44 UTC (permalink / raw)
  To: lvm-devel; +Cc: linux-kernel
In-Reply-To: <20021220083149.GA10484@kroah.com>

On Fri, Dec 20, 2002 at 12:31:50AM -0800, Greg KH wrote:
> Here's an ascii picture which probably makes more sense:
> /sys/block/
> |-- fd0
> |   |-- dev
> |   |-- range
> |   |-- size
> |   `-- stat
> |-- dm-1
> |   |-- dev
> |   |-- dm
> |   |   |-- device0 -> ../../devices/pci0/00:02.5/ide0/0.0
> |   |   |-- device1 -> ../../devices/pci0/00:02.5/ide1/1.0
> |   |   |-- status
> |   |   |-- suspend
> |   |   `-- table
> |   |-- range
> |   |-- size
> |   `-- stat
> 
> Look reasonable?

Yes, it looks promising.  Some worries:

i) The 'status' and 'table' files do not contain a single value.
Splitting it up into single values would be ungainly to say the least,
eg.

  dm
  |-- table
      |-- target1
      |   |-- sector_start
      |   |-- sector_len
      |   |-- target_type
      |   `-- target args
      |
      |-- target2
          |-- sector_start

  hmm ... maybe that's not too bad.

ii) If the table files are not split up then we have the problem that
    they can be larger than a single page, which sysfs can't handle
    (is this still true ?).

iii) We need to be able to poll on the status file so that people can
     block until there is a change of status.  eg, a snapshot uses up
     another 5% of its COW storage, a mirror completes its initial
     build, a path fails in the multipath target.

> Ok.  I can place all of the sysfs specific functions in dm.c, just like
> drivers/block/genhd.c has, or if we place struct mapped_device into
> dm.h, they can live in their own file.  Doesn't bother me either way.

Either put it in dm.c, or define some extra access functions (like
dm_suspended() and dm_kdev()) to get the information you need.  I
would prefer the latter, but we can always move things later.

- Joe

^ permalink raw reply

* Re: PATCH 2.5.x disable BAR when sizing
From: David Mosberger @ 2002-12-20  9:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel
In-Reply-To: <atubg3$699$1@penguin.transmeta.com>

>>>>> On Fri, 20 Dec 2002 05:57:23 +0000 (UTC), torvalds@transmeta.com (Linus Torvalds) said:

  Linus> DO NOT DO THIS. It locks up some machines at
  Linus> bootup. Hard. Total bus lockup if you have legacy USB enabled
  Linus> (or anything else that does DMA, for that matter) at the same
  Linus> time as probing the northbridge with this.

  Linus> Trust me.  If you have some new silly ia64-specific bug, the
  Linus> fix is _not_ to break real and existing hardware out there.

Could you please stop this ia64 paranoia and instead explain to me why
it's OK to relocate a PCI device to (0x100000000-PCI_dev_size)
temporarily?  That just seems horribly unsafe to me.  The PCI spec
seems to say the same as it says pretty clearly that memory decoding
should be disabled during BAR-sizing.  If certain bridges cause
problems, perhaps those need to be special-cased?

	--david

^ permalink raw reply

* Re: Dedicated kernel bug database
From: John Bradford @ 2002-12-20  9:48 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: linux-kernel
In-Reply-To: <79780000.1040355621@titus>

[CC list trimmed]

> > I've got loads of ideas about how we could build a better bug database
> 
> Go ahead, knock yourself out. Come back when you're done.

Not sure what you mean.  I do intend to start coding a new bug
database system today, and I'll announce it on the list when it's
ready.  If nobody likes it, I wasted my time.

> > - for example, we have categories at the moment in Bugzilla.  Why?  We
> > already have a MAINTAINERS file, so say somebody looks up the relevant
> > maintainer in that list, finds them, then goes to enter a bug in
> > Bugzilla.  Now they have to assign it to a category, and different
> > people may well assign the same bug to different categories -
> > immediately making duplicate detection more difficult.
> 
> Have you actually looked at the maintainers file?

Yes.

> It's a twisted mess of outdated information,

Then it should be updated, that is nothing to do with Bugzilla.

> in no well formated order.

Looks easy enough to parse with regular expressions to me.

> The category list in Bugzilla was an attempt to bring some sanity to
> the structure,

By adding an extra layer of abstraction.  I don't agree that that
helps.

> though I won't claim it's perfect. We really need a 3-level tree,
> but that's a fair amount of work to code.

I disagree, (that we need a 3-level tree).

John.

^ permalink raw reply

* How I can do correct uninstall of IPTables?
From: Anatoliy Borisov @ 2002-12-20  9:25 UTC (permalink / raw)
  To: Netfilter general discussion and user questions

[-- Attachment #1: Type: text/plain, Size: 332 bytes --]

Hello.

I have installed from source (make install ....) iptables userland packet. How I can do correct uninstall of this so that the my system switch to previous (such before installation) state?   

environment:
- OS Debian GNU/Linux 3.0
- IPTables 1.2.7a

Anatoliy Borisov  ( borisov@ukrpack.net )
Engineer, "INFOCOM" JV

[-- Attachment #2: Type: text/html, Size: 792 bytes --]

^ permalink raw reply

* Re: [drm:drm_init] *ERROR* Cannot initialize the agpgart module.
From: Dave Jones @ 2002-12-20  9:33 UTC (permalink / raw)
  To: Randy.Dunlap
  Cc: Matt Bernstein, Ed Tomlinson, Paul P Komkoff Jr, linux-kernel,
	Rusty Russell
In-Reply-To: <Pine.LNX.4.33L2.0212191734240.30841-100000@dragon.pdx.osdl.net>

On Thu, Dec 19, 2002 at 05:43:14PM -0800, Randy.Dunlap wrote:
 > |
 > | I get a very similar oops, but with amd_k7_agp (2.5.52-mm2). I'm not
 > | bk-savvy as yet, but if pointed at a diff, would be happy to verify it.
 > 
 > 2.5.zz kernel diff snapshots (from bk) are available at
 >   http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/
 > e.g., latest is:
 >   http://www.kernel.org/pub/linux/kernel/v2.5/snapshots/patch-2.5.52-bk4.bz2

Latest AGP bits aren't in Linus tree yet. A few more bits to nail
down, and then I'll ask him to pull again.

        Dave

-- 
| Dave Jones.        http://www.codemonkey.org.uk
| SuSE Labs

^ permalink raw reply

* [LARTC] Simple pfifo_fast question
From: Thomas Jalsovsky @ 2002-12-20  9:19 UTC (permalink / raw)
  To: lartc


Hello,

	I use HTB on IMQ, and I would like to use pfifo_fast on my HTB
leafs.
	Do I need to add a pfifo_fast qdisc to my HTB class, or this is
the default one?
	I tryed to add and I got:
RTNETLINK answers: Invalid argument
It looks like pfifo_fast qdisc is not supported on HTB leaf...

	Any comments?

	Regards,
		Thomas

_______________________________________________
LARTC mailing list / LARTC@mailman.ds9a.nl
http://mailman.ds9a.nl/mailman/listinfo/lartc HOWTO: http://lartc.org/

^ permalink raw reply

* Re: [ISN] Music file flaws could threaten traders
From: Tom @ 2002-12-20  9:15 UTC (permalink / raw)
  To: Russell Coker; +Cc: selinux
In-Reply-To: <200212192307.20386.russell@coker.com.au>

On Thu, Dec 19, 2002 at 11:07:20PM +0100, Russell Coker wrote:
> The problem about doing the same for audio/video programs such as players for 
> avi, mp3, and vob files is that their typical use involves downloading files 
> from the net to play immediately so that denying them read access to 
> user_home_t files will give a large decrease in functionality.  I believe 
> that there are two major categories of SE Linux users, those who will never 
> run such A/V programs on Linux, and those who won't use any security software 
> that gets in the way of their entertainment.

My preferred solution would be to handle this much like in old BBS
days. Any file downloaded from the net, no matter how, should be
labeled with a special "untrusted download" type first. It could then
either be relabeled after checking (for virii, content correctness or
whatever) or the player could be run in a domain that allows absolute
minimum access only.

This can be implemented in one of two ways:

a) by modifying any programs downloading files
b) by making downloads only  to a special download directory
   and using file_auto_trans there.
   

The better/easier b) requires a little discipline from the user since
netscape et al will need write access to other directories (e.g. /tmp)
so he _could_ in theory save his stuff there.


-- 
PGP/GPG key: http://web.lemuria.org/pubkey.html
pub  1024D/2D7A04F5 2002-05-16 Tom Vogt <tom@lemuria.org>
     Key fingerprint = C731 64D1 4BCF 4C20 48A4  29B2 BF01 9FA1 2D7A 04F5

--
This message was distributed to subscribers of the selinux mailing list.
If you no longer wish to subscribe, send mail to majordomo@tycho.nsa.gov with
the words "unsubscribe selinux" without quotes as the message.

^ permalink raw reply

* [BENCHMARK] 2.5.52-mm2 with contest
From: Con Kolivas @ 2002-12-20  9:22 UTC (permalink / raw)
  To: linux kernel mailing list; +Cc: Andrew Morton

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here are contest (http://contest.kolivas.net) benchmarks using the osdl 
(http://www.osdl.org) hardware for 2.5.52-mm2 in both UniProcessor and SMP 
mode:

For the uniprocessor results you need your filter glasses on to compare 2.5.52 
with the -mm results as the baseline changed for compicated reasons. So apart 
from the trend, you can only compare the absolute results between mm1 and 
mm2.

UP:
noload:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              70.2    96      0       0       1.05
2.5.52-mm1 [7]          74.7    96      0       0       1.12
2.5.52-mm2 [7]          74.6    96      0       0       1.12

cacherun:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              67.5    99      0       0       1.01
2.5.52-mm1 [7]          71.9    99      0       0       1.08
2.5.52-mm2 [7]          72.0    99      0       0       1.08

process_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              84.4    79      17      19      1.26
2.5.52-mm1 [7]          91.0    79      18      19      1.36
2.5.52-mm2 [7]          90.3    79      18      19      1.35

dbench_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              222.3   36      2       53      3.33
2.5.52-mm1 [7]          226.4   37      2       51      3.39
2.5.52-mm2 [7]          229.6   36      2       50      3.44

ctar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              109.8   81      2       8       1.64
2.5.52-mm1 [7]          112.2   81      3       9       1.68
2.5.52-mm2 [7]          109.6   81      2       9       1.64

xtar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              161.4   69      3       8       2.42
2.5.52-mm1 [7]          127.9   70      2       7       1.92
2.5.52-mm2 [7]          125.0   70      2       7       1.87

io_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              120.9   60      13      12      1.81
2.5.52-mm1 [7]          143.9   55      18      13      2.16
2.5.52-mm2 [7]          129.3   61      14      12      1.94

io_other:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              94.9    76      7       10      1.42
2.5.52-mm1 [7]          115.5   67      11      11      1.73
2.5.52-mm2 [7]          93.3    79      7       9       1.40

read_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              88.1    80      15      7       1.32
2.5.52-mm1 [7]          97.0    78      15      6       1.45
2.5.52-mm2 [7]          93.6    80      15      6       1.40

list_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              81.0    86      0       9       1.21
2.5.52-mm1 [7]          86.8    85      0       9       1.30
2.5.52-mm2 [7]          86.3    85      0       9       1.29

mem_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [3]              100.0   78      45      2       1.50
2.5.52-mm1 [7]          117.5   69      45      1       1.76
2.5.52-mm2 [7]          108.0   77      46      2       1.62


SMP:
noload:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              39.3    181     0       0       1.09
2.5.52-mm1 [8]          39.7    180     0       0       1.10
2.5.52-mm2 [7]          39.2    181     0       0       1.08

cacherun:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              36.5    194     0       0       1.01
2.5.52-mm1 [7]          36.9    194     0       0       1.02
2.5.52-mm2 [7]          36.5    194     0       0       1.01

process_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              48.7    144     10      49      1.34
2.5.52-mm1 [7]          49.0    144     10      50      1.35
2.5.52-mm2 [7]          46.5    152     8       41      1.28

dbench_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio

ctar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              56.1    161     1       10      1.55
2.5.52-mm1 [7]          55.5    156     1       10      1.53
2.5.52-mm2 [7]          52.8    154     1       10      1.46

xtar_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              83.1    138     1       9       2.29
2.5.52-mm1 [7]          77.4    122     1       8       2.14
2.5.52-mm2 [7]          76.1    124     1       8       2.10

io_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              73.1    111     10      19      2.02
2.5.52-mm1 [7]          80.5    108     10      19      2.22
2.5.52-mm2 [7]          74.5    112     11      20      2.06

io_other:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              75.1    120     10      21      2.07
2.5.52-mm1 [7]          60.1    131     7       18      1.66
2.5.52-mm2 [7]          59.9    134     6       18      1.65

read_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              49.4    151     5       7       1.36
2.5.52-mm1 [7]          49.9    149     5       6       1.38
2.5.52-mm2 [7]          50.5    147     5       6       1.39

list_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              43.2    167     0       9       1.19
2.5.52-mm1 [7]          43.8    167     0       9       1.21
2.5.52-mm2 [7]          43.7    167     0       9       1.21

mem_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.5.52 [7]              63.5    148     38      3       1.75
2.5.52-mm1 [7]          71.1    123     36      2       1.96
2.5.52-mm2 [7]          66.0    141     39      3       1.82

Slight shift in the balance in both SMP and UP results towards lower times for 
io_load, io_other and mem_load. Note also the interesting rise in mem_loads 
done despite the shorter time (a marked improvement therefore).

Con
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.0 (GNU/Linux)

iD8DBQE+AuFSF6dfvkL3i1gRAosVAJ0VgSaPJurexvoCR7wRnA1+wJtWLwCgqu9u
OKFw2P3E8MHYPMfAhWyKEyQ=
=6+VY
-----END PGP SIGNATURE-----

^ permalink raw reply

* Re: Problems while starting 1.1.4 with other user than root
From: Bart Oldeman @ 2002-12-20  9:05 UTC (permalink / raw)
  To: Peter Forst; +Cc: DosEmu Mailinglist
In-Reply-To: <20021220085355.0c672c42.pforst@firemail.de>

On Fri, 20 Dec 2002, Peter Forst wrote:

> After upgrading from 1.1.3-7 to 1.1.4, any other user, than root
> can not start anymore the dosemu. Error-message:
>
> -- Running unpriviledged in low feature mode
> Linux kernel 2.4.16; CPU speed is 1460485000 Hz
> Dosemu-1.1.4.0 Running on CPU=586, FPU=1
> ERROR: can't open /dev/mem: errno=13, Keine Berechtigung

right, this is a bug because this shouldn't happen in the first
place, but please adapt your dosemu.conf or ~/.dosemurc and comment out
(at least) $_dualmon, like this:

#$_dualmon = (0)        # if you have one vga _plus_ one hgc (2 monitors)

-- a workaround that fixes your problem.

> An ls -l shows following:
> crw-r-----    1 root     kmem       1,   1 May 12  2001 /dev/mem
>
> Also an chmod to 777
> crwxrwxrwx    1 root     kmem       1,   1 May 12  2001 /dev/mem
>
> Takes no effect.

you should move that back to what it was to plug a big security hole ;)

Bart


^ permalink raw reply

* [PATCH] IRQ distribution in the 2.5.52  kernel
From: Kamble, Nitin A @ 2002-12-20  9:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kamble, Nitin A

[-- Attachment #1: Type: text/plain, Size: 19718 bytes --]

Hello All,

  We were looking at the performance impact of the IRQ routing from the 2.5.52 Linux kernel. This email includes some of our findings about the way the interrupts are getting moved in the 2.5.52 kernel. Also there is discussion and a patch for a new implementation. Let me know what you think at nitin.a.kamble@intel.com
 
Current implementation:
======================
We have found that the existing implementation works well on IA32 SMP systems with light load of interrupts. Also we noticed that it is not working that well under heavy interrupt load conditions on these SMP systems. The observations are:
 
* Interrupt load of each IRQ is getting balanced on CPUs independent of load of other IRQs. Also the current implementation moves the IRQs randomly. This works well when the interrupt load is light. But we start seeing imbalance of interrupt load with existence of multiple heavy interrupt sources. Frequently multiple heavily loaded IRQs gets moved to a single CPU while other CPUs stay very lightly loaded. To achieve a good interrupts load balance, it is important to consider the load of all the interrupts together.
    This further can be explained with an example of 4 CPUs and 4 heavy interrupt sources. With the existing random movement approach, the chance of each of these heavy interrupt sources moving to separate CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of the time the situation is, some CPUs are very lightly loaded and some are loaded with multiple heavy interrupts. This causes the interrupt load imbalance and results in less performance. In a case of 2 CPUs and 2 heavily loaded interrupt sources, this imbalance happens 1/2 = 50% of the times. This issue becomes more and more severe with increasing number of heavy interrupt sources.
 
* Another interesting observation is: We cannot see the imbalance of the interrupt load from /proc/interrupts. (/proc/interrupts shows the cumulative load of interrupts on all CPUs.) If the interrupt load is imbalanced and this imbalance is getting rotated among CPUs continuously, then /proc/interrupts will still show that the interrupt load is going to processors very evenly. Currently at the frequency (HZ/50) at which IRQs are moved across CPUs, it is not possible to see any interrupt load imbalance happening.
 
* We have also found that, in certain cases the static IRQ binding performs better than the existing kernel distribution of interrupt load. The reason is, in a well-balanced interrupt load situations, these interrupts are unnecessarily getting frequently moved across CPUs. This adds an extra overhead; also it takes off the CPU cache warmth benefits.
  This came out from the performance measurements done on a 4-way HT (8 logical processors) Pentium 4 Xeon system running 8 copies of netperf. The 4 NICs in the system taking different IRQs generated sizable interrupt load with the help of connected clients.
 
Here the netperf transactions/sec throughput numbers observed are:
 
IRQs nicely manually bound to CPUs: 56.20K 
The current kernel implementation of IRQ movement: 50.05K
 -----------------------
 The static binding of IRQs has performed 12.28% better than the current IRQ movement implemented in the kernel.
 
* The current implementation does not distinguish siblings from the HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to balance the interrupt load with respect to processor packages first, and then among logical CPUs inside processor packages. 
  For example if we have 2 heavy interrupt sources and 2 processor packages (4 logical CPUs); Assigning both the heavy interrupt sources in different processor packages is better, it will use different execution resources from the different processor packages.
 
 

New revised implementation:
==========================
We also have been working on a new implementation. The following points are in main focus.
 
* At any moment heavily loaded IRQs are distributed to different CPUs to achieve as much balance as possible. 
 
* Lightly loaded interrupt sources are ignored from the load balancing, as they do not cause considerable imbalance.
 
* When the heavy interrupt sources are balanced, they are not moved around. This also helps in keeping the CPU caches warm.
 
* It has been made HT aware. While distributing the load, the load on a processor package to which the logical CPUs belong to is also considered.
 
* In the situations of few (lesser than num_cpus) heavy interrupt sources, it is not possible to balance them evenly. In such case the existing code has been reused to move the interrupts. The randomness from the original code has been removed.
 
* The time interval for redistribution has been made flexible. It varies as the system interrupt load changes.
 
* A new kernel_thread is introduced to do the load balancing calculations for all the interrupt sources. It keeps the balanace_maps ready for interrupt handlers, keeping the overhead in the interrupt handling to minimum.
 
* It allows the disabling of the IRQ distribution from the boot loader command line, if anybody wants to do it for any reason. 
 
* The algorithm also takes into account the static binding of interrupts to CPUs that user imposes from the /proc/irq/{n}/smp_affinity interface.
 
 
Throughput numbers with the netperf setup for the new implementation:
 
Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
 ---------------------
  The performance improvement on P4 Xeon of 11.9% is observed.
 
The new IRQ balance implementation also shows little performance improvement on P6 (Pentium II, III) systems.
 
On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
 ---------------------
  Here the performance improvement on P6 system of about 2% is observed.
 
 
Thanks,
Nitin

diff -Naru 2.5.52/Documentation/kernel-parameters.txt kirqb/Documentation/kernel-parameters.txt
--- 2.5.52/Documentation/kernel-parameters.txt	Tue Dec 17 15:35:57 2002
+++ kirqb/Documentation/kernel-parameters.txt	Tue Dec 17 15:37:29 2002
@@ -352,6 +352,8 @@
 
 	hugepages=	[HW,IA-32] Maximal number of HugeTLB pages
 
+	noirqbalance	[IA-32,SMP,KNL] Disable kernel irq balancing
+
 	i8042_direct	[HW] Non-translated mode
 	i8042_dumbkbd
 	i8042_noaux
diff -Naru 2.5.52/arch/i386/kernel/io_apic.c kirqb/arch/i386/kernel/io_apic.c
--- 2.5.52/arch/i386/kernel/io_apic.c	Tue Dec 17 15:35:26 2002
+++ kirqb/arch/i386/kernel/io_apic.c	Fri Dec 20 01:23:15 2002
@@ -206,19 +206,37 @@
 	spin_unlock_irqrestore(&ioapic_lock, flags);
 }
 
-#if CONFIG_SMP
+#if defined(CONFIG_SMP)
+# include <asm/processor.h>	/* kernel_thread() */
+# include <linux/kernel_stat.h>	/* kstat */
+# include <linux/slab.h>		/* kmalloc() */
+# include <linux/timer.h>	/* time_after() */
+ 
+# if CONFIG_BALANCED_IRQ_DEBUG
+#  define TDprintk(x...) do { printk("<%ld:%s:%d>: ", jiffies, __FILE__, __LINE__); printk(x); } while (0)
+#  define Dprintk(x...) do { TDprintk(x); } while (0)
+# else
+#  define TDprintk(x...) 
+#  define Dprintk(x...) 
+# endif
 
-typedef struct {
-	unsigned int cpu;
-	unsigned long timestamp;
-} ____cacheline_aligned irq_balance_t;
-
-static irq_balance_t irq_balance[NR_IRQS] __cacheline_aligned
-			= { [ 0 ... NR_IRQS-1 ] = { 0, 0 } };
+# define MIN(a,b) (((a) < (b)) ? (a) : (b))
+# define MAX(a,b) (((a) > (b)) ? (a) : (b))
 
 extern unsigned long irq_affinity [NR_IRQS];
-
-#endif
+unsigned long __cacheline_aligned irq_balance_mask [NR_IRQS];
+static int irqbalance_disabled __initdata = 0;
+static int physical_balance = 0;
+
+struct irq_cpu_info {
+	unsigned long * last_irq;
+	unsigned long * irq_delta;
+	unsigned long irq;
+} irq_cpu_data[NR_CPUS];
+
+#define CPU_IRQ(cpu)		(irq_cpu_data[cpu].irq)
+#define LAST_CPU_IRQ(cpu,irq)   (irq_cpu_data[cpu].last_irq[irq])
+#define IRQ_DELTA(cpu,irq) 	(irq_cpu_data[cpu].irq_delta[irq])
 
 #define IDLE_ENOUGH(cpu,now) \
 		(idle_cpu(cpu) && ((now) - irq_stat[(cpu)].idle_timestamp > 1))
@@ -226,10 +244,224 @@
 #define IRQ_ALLOWED(cpu,allowed_mask) \
 		((1 << cpu) & (allowed_mask))
 
-#if CONFIG_SMP
+#define CPU_TO_PACKAGEINDEX(i) \
+		((physical_balance && i > cpu_sibling_map[i]) ? cpu_sibling_map[i] : i)
+
+#define MAX_BALANCED_IRQ_INTERVAL	(5*HZ)
+#define MIN_BALANCED_IRQ_INTERVAL	(HZ/2)
+#define BALANCED_IRQ_MORE_DELTA		(HZ/10)
+#define BALANCED_IRQ_LESS_DELTA		(HZ)
+
+unsigned long balanced_irq_interval = MAX_BALANCED_IRQ_INTERVAL;
+					 
+static inline void balance_irq(int cpu, int irq);
+
+static inline void rotate_irqs_among_cpus(unsigned long useful_load_threshold)
+{
+	int i, j;
+	Dprintk("Rotating IRQs among CPUs.\n");
+	for (i = 0; i < NR_CPUS; i++) {
+		for (j = 0; cpu_online(i) && (j < NR_IRQS); j++) {
+			if (!irq_desc[j].action)
+				continue;
+			/* Is it a significant load ?  */
+			if (IRQ_DELTA(CPU_TO_PACKAGEINDEX(i),j) < useful_load_threshold)
+				continue;
+			balance_irq(i, j);
+		}
+	}
+	balanced_irq_interval = MAX(MIN_BALANCED_IRQ_INTERVAL,
+		balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);	
+	return;
+}
+
+static void do_irq_balance(void)
+{
+	int i, j;
+	unsigned long max_cpu_irq = 0, min_cpu_irq = (~0);
+	unsigned long move_this_load = 0;
+	int max_loaded = 0, min_loaded = 0;
+	unsigned long useful_load_threshold = balanced_irq_interval + 10;
+	int selected_irq;
+	int tmp_loaded, first_attempt = 1;
+	unsigned long tmp_cpu_irq;
+	unsigned long imbalance = 0;
+	unsigned long allowed_mask;
+	unsigned long target_cpu_mask;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		int package_index;
+		CPU_IRQ(i) = 0;
+		if (!cpu_online(i))
+			continue;
+		package_index = CPU_TO_PACKAGEINDEX(i);
+		for (j = 0; j < NR_IRQS; j++) {
+			unsigned long value_now, delta;
+			/* Is this an active IRQ? */
+			if (!irq_desc[j].action)
+				continue;
+			if (package_index == i)
+				IRQ_DELTA(package_index,j) = 0;
+			/* Determine the total count per processor per IRQ */
+			value_now = (unsigned long) kstat_cpu(i).irqs[j];
+
+			/* Determine the activity per processor per IRQ */
+			delta = value_now - LAST_CPU_IRQ(i,j);
+
+			/* Update last_cpu_irq[][] for the next time */
+			LAST_CPU_IRQ(i,j) = value_now;
+
+			/* Ignore IRQs whose rate is less than the clock */
+			if (delta < useful_load_threshold)
+				continue;
+			/* update the load for the processor or package total */
+			IRQ_DELTA(package_index,j) += delta;
+
+			/* Keep track of the higher numbered sibling as well */
+			if (i != package_index)
+				CPU_IRQ(i) += delta;
+			/*
+			 * We have sibling A and sibling B in the package
+			 *
+			 * cpu_irq[A] = load for cpu A + load for cpu B
+			 * cpu_irq[B] = load for cpu B
+			 */
+			CPU_IRQ(package_index) += delta;
+		}
+	}
+	/* Find the least loaded processor package */
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		if (physical_balance && i > cpu_sibling_map[i])
+			continue;
+		if (min_cpu_irq > CPU_IRQ(i)) {
+			min_cpu_irq = CPU_IRQ(i);
+			min_loaded = i;
+		}
+	}
+	max_cpu_irq = ULONG_MAX;
+
+tryanothercpu:
+	/* Look for heaviest loaded processor.
+	 * We may come back to get the next heaviest loaded processor.
+	 * Skip processors with trivial loads.
+	 */
+	tmp_cpu_irq = 0;
+	tmp_loaded = -1;
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		if (physical_balance && i > cpu_sibling_map[i])
+			continue;
+		if (max_cpu_irq <= CPU_IRQ(i)) 
+			continue;
+		if (tmp_cpu_irq < CPU_IRQ(i)) {
+			tmp_cpu_irq = CPU_IRQ(i);
+			tmp_loaded = i;
+		}
+	}
+
+	if (tmp_loaded == -1) {
+ 	 /* In the case of small number of heavy interrupt sources, 
+	  * loading some of the cpus too much. We use Ingo's original 
+	  * approach to rotate them around.
+	  */
+		if (!first_attempt && imbalance >= useful_load_threshold) {
+			rotate_irqs_among_cpus(useful_load_threshold);
+			return;
+		}
+		goto not_worth_the_effort;
+	}
+	
+	first_attempt = 0;		/* heaviest search */
+	max_cpu_irq = tmp_cpu_irq;	/* load */
+	max_loaded = tmp_loaded;	/* processor */
+	imbalance = (max_cpu_irq - min_cpu_irq) / 2;
+	
+	Dprintk("max_loaded cpu = %d\n", max_loaded);
+	Dprintk("min_loaded cpu = %d\n", min_loaded);
+	Dprintk("max_cpu_irq load = %ld\n", max_cpu_irq);
+	Dprintk("min_cpu_irq load = %ld\n", min_cpu_irq);
+	Dprintk("load imbalance = %lu\n", imbalance);
+
+	/* if imbalance is less than approx 10% of max load, then
+	 * observe diminishing returns action. - quit
+	 */
+	if (imbalance < (max_cpu_irq >> 3)) {
+		Dprintk("Imbalance too trivial\n");
+		goto not_worth_the_effort;
+	}
+
+tryanotherirq:
+	/* if we select an IRQ to move that can't go where we want, then
+	 * see if there is another one to try.
+	 */
+	move_this_load = 0;
+	selected_irq = -1;
+	for (j = 0; j < NR_IRQS; j++) {
+		/* Is this an active IRQ? */
+		if (!irq_desc[j].action)
+			continue;
+		if (imbalance <= IRQ_DELTA(max_loaded,j))
+			continue;
+		/* Try to find the IRQ that is closest to the imbalance
+		 * without going over.
+		 */
+		if (move_this_load < IRQ_DELTA(max_loaded,j)) {
+			move_this_load = IRQ_DELTA(max_loaded,j);
+			selected_irq = j;
+		}
+	}
+	if (selected_irq == -1) {
+		goto tryanothercpu;
+	}
 
-#define IRQ_BALANCE_INTERVAL (HZ/50)
+	imbalance = move_this_load;
 	
+	/* For physical_balance case, we accumlated both load
+	 * values in the one of the siblings cpu_irq[],
+	 * to use the same code for physical and logical processors
+	 * as much as possible. 
+	 *
+	 * NOTE: the cpu_irq[] array holds the sum of the load for
+	 * sibling A and sibling B in the slot for the lowest numbered
+	 * sibling (A), _AND_ the load for sibling B in the slot for
+	 * the higher numbered sibling.
+	 *
+	 * We seek the least loaded sibling by making the comparison
+	 * (A+B)/2 vs B
+	 */
+	if (physical_balance && (CPU_IRQ(min_loaded) >> 1) > CPU_IRQ(cpu_sibling_map[min_loaded]))
+		min_loaded = cpu_sibling_map[min_loaded];
+
+	allowed_mask = cpu_online_map & irq_affinity[selected_irq];
+	target_cpu_mask = 1 << min_loaded;
+
+	if (target_cpu_mask & allowed_mask) {
+		irq_desc_t *desc = irq_desc + selected_irq;
+		Dprintk("irq = %d moved to cpu = %d\n", selected_irq, min_loaded);
+		/* mark for change destination */
+		spin_lock(&desc->lock);
+		irq_balance_mask[selected_irq] = target_cpu_mask;
+		spin_unlock(&desc->lock);
+		/* Since we made a change, come back sooner to 
+		 * check for more variation.
+		 */
+		balanced_irq_interval = MAX(MIN_BALANCED_IRQ_INTERVAL,
+			balanced_irq_interval - BALANCED_IRQ_LESS_DELTA);	
+		return;
+	}
+	goto tryanotherirq;
+
+not_worth_the_effort:
+	/* if we did not find an IRQ to move, then adjust the time interval upward */
+	balanced_irq_interval = MIN(MAX_BALANCED_IRQ_INTERVAL,
+		balanced_irq_interval + BALANCED_IRQ_MORE_DELTA);	
+	Dprintk("IRQ worth rotating not found\n");
+	return;
+}
+
 static unsigned long move(int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
 {
 	int search_idle = 1;
@@ -256,34 +488,112 @@
 	return cpu;
 }
 
-static inline void balance_irq(int irq)
+static inline void balance_irq (int cpu, int irq)
 {
-	irq_balance_t *entry = irq_balance + irq;
 	unsigned long now = jiffies;
-
+	unsigned long allowed_mask;
+	unsigned int new_cpu;
+		
 	if (clustered_apic_mode)
 		return;
 
-	if (unlikely(time_after(now, entry->timestamp + IRQ_BALANCE_INTERVAL))) {
-		unsigned long allowed_mask;
-		unsigned int new_cpu;
-		int random_number;
-
-		rdtscl(random_number);
-		random_number &= 1;
-
-		allowed_mask = cpu_online_map & irq_affinity[irq];
-		entry->timestamp = now;
-		new_cpu = move(entry->cpu, allowed_mask, now, random_number);
-		if (entry->cpu != new_cpu) {
-			entry->cpu = new_cpu;
-			set_ioapic_affinity(irq, 1 << new_cpu);
+	allowed_mask = cpu_online_map & irq_affinity[irq];
+	new_cpu = move(cpu, allowed_mask, now, 1);
+	if (cpu != new_cpu) {
+		irq_desc_t *desc = irq_desc + irq;
+		spin_lock(&desc->lock);
+		irq_balance_mask[irq] = 1 << new_cpu;
+		spin_unlock(&desc->lock);
+	}
+}
+
+int balanced_irq(void *unused)
+{
+	int i;
+	unsigned long prev_balance_time = jiffies;
+	long time_remaining = balanced_irq_interval;
+	daemonize();
+	sigfillset(&current->blocked);
+	sprintf(current->comm, "balanced_irq");
+	
+	/* push everything to CPU 0 to give us a starting point.  */
+	for (i = 0 ; i < NR_IRQS ; i++)
+		irq_balance_mask[i] = 1 << 0;
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		time_remaining = schedule_timeout(time_remaining);
+		if (time_after(jiffies, prev_balance_time+balanced_irq_interval)) {
+			Dprintk("balanced_irq: calling do_irq_balance() %lu\n", jiffies);
+			do_irq_balance();
+			prev_balance_time = jiffies;
+			time_remaining = balanced_irq_interval;
 		}
+        }
+}
+
+static int __init balanced_irq_init(void)
+{
+	int i;
+	struct cpuinfo_x86 *c;
+        c = &boot_cpu_data;
+	if (irqbalance_disabled)
+		return 0;
+	/* Enable physical balance only if more than 
+	 * one physical processor package is present */
+	if (smp_num_siblings > 1 && cpu_online_map >> 2)
+		physical_balance = 1;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(i))
+			continue;
+		irq_cpu_data[i].irq_delta = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
+		irq_cpu_data[i].last_irq = kmalloc(sizeof(unsigned long) * NR_IRQS, GFP_KERNEL);
+		if (irq_cpu_data[i].irq_delta == NULL || irq_cpu_data[i].last_irq == NULL) {
+			printk(KERN_ERR "balanced_irq_init: out of memory");
+			goto failed;
+		}
+		memset(irq_cpu_data[i].irq_delta,0,sizeof(unsigned long) * NR_IRQS);
+		memset(irq_cpu_data[i].last_irq,0,sizeof(unsigned long) * NR_IRQS);
+	}
+	
+	printk(KERN_INFO "Starting balanced_irq\n");
+	if (kernel_thread(balanced_irq, NULL, CLONE_KERNEL) >= 0) 
+		return 0;
+	else 
+		printk(KERN_ERR "balanced_irq_init: failed to spawn balanced_irq");
+failed:
+	for (i = 0; i < NR_CPUS; i++) {
+		if (irq_cpu_data[i].irq_delta)
+			kfree(irq_cpu_data[i].irq_delta);
+		if (irq_cpu_data[i].last_irq)
+			kfree(irq_cpu_data[i].last_irq);
 	}
+	return 0;
 }
-#else /* !SMP */
-static inline void balance_irq(int irq) { }
-#endif
+
+static int __init irqbalance_disable(char *str)
+{
+	irqbalance_disabled = 1;
+	return 0;
+}
+
+__setup("noirqbalance", irqbalance_disable);
+
+static void set_ioapic_affinity (unsigned int irq, unsigned long mask);
+
+static inline void move_irq(int irq)
+{
+	/* note - we hold the desc->lock */
+	if (unlikely(irq_balance_mask[irq])) {
+		set_ioapic_affinity(irq, irq_balance_mask[irq]);
+		irq_balance_mask[irq] = 0;
+	}
+}
+
+__initcall(balanced_irq_init);
+
+#endif /* defined(CONFIG_SMP) */
+
 
 /*
  * support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to
@@ -1308,7 +1618,7 @@
  */
 static void ack_edge_ioapic_irq(unsigned int irq)
 {
-	balance_irq(irq);
+	move_irq(irq);
 	if ((irq_desc[irq].status & (IRQ_PENDING | IRQ_DISABLED))
 					== (IRQ_PENDING | IRQ_DISABLED))
 		mask_IO_APIC_irq(irq);
@@ -1348,7 +1658,7 @@
 	unsigned long v;
 	int i;
 
-	balance_irq(irq);
+	move_irq(irq);
 /*
  * It appears there is an erratum which affects at least version 0x11
  * of I/O APIC (that's the 82093AA and cores integrated into various



[-- Attachment #2: kirqb_2.5.52.ZIP --]
[-- Type: application/x-zip-compressed, Size: 4484 bytes --]

^ permalink raw reply


This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.