public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Linux-ia64] Replacements for local_irq_xxx()
@ 2001-05-10  8:55 Keith Owens
  2001-05-10 14:26 ` David Mosberger
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Keith Owens @ 2001-05-10  8:55 UTC (permalink / raw)
  To: linux-ia64

Existing local_irq_xxx() routines clear psr.i which masks all
interrupts, including NMI.  Then the only way to get the attention of a
cpu in a disabled spin loop is via INIT which is too drastic.  How
about these alternatives which set cr.tpr.mmi instead, masking all
external interrupts except NMI?

This is not complete, I have to do the debugging versions plus check
out callers of ia64_{get,set}_tpr and resolve any conflicts.  Before I
do any more work, is it worth continuing with this or do we live with
psr.i and accept that the only remedy for a cpu that is hung in a
disabled spin loop is to reboot the entire machine?

/* Nothing changes cr.tpr except local_irq_xxx() and they should always
 * leave tpr in the same state that they found it, so there is no need
 * to serialize access to tpr.  If this is incorrect, uncomment the
 * references to psr and the following srlz.d and add r15 to the clobbers.
 */

# define local_irq_save(x)						\
do {									\
	__asm__ __volatile__ (						\
	";;\n\t"							\
	"# mov r15=psr;\n\t"						\
	"mov r17=(1 << 16);\n\t"					\
	"# ;; rsm psr.i;;\n\t"						\
	"mov %0=cr.tpr;;\n\t"						\
	"or r17=r17,%0;;\n\t"						\
	"mov cr.tpr=r17;;\n\t"						\
	"srlz.d;;\n\t"							\
	"# mov psr.l=r15;;\n\t"						\
	"# srlz.d;;\n\t"						\
	: "=r" (x) : : "r16", "r17");					\
} while (0)

# define local_irq_disable()						\
do {									\
	__asm__ __volatile__ (						\
	";;\n\t"							\
	"# mov r15=psr;\n\t"						\
	"mov r17=(1 << 16);\n\t"					\
	"# ;; rsm psr.i;;\n\t"						\
	"mov r16=cr.tpr;;\n\t"						\
	"or r17=r17,r16;;\n\t"						\
	"mov cr.tpr=r17;;\n\t"						\
	"srlz.d;;\n\t"							\
	"# mov psr.l=r15;;\n\t"						\
	"# srlz.d;;\n\t"						\
	: : : "r16", "r17");						\
} while (0)

# define local_irq_restore(x)						\
do {									\
	__asm__ __volatile__ (						\
	";;\n\t"							\
	"# mov r15=psr;\n\t"						\
	"# rsm psr.i;;\n\t"						\
	"mov cr.tpr=%0;;\n\t"						\
	"srlz.d;;\n\t"							\
	"# mov psr.l=r15;;\n\t"						\
	"# srlz.d;;\n\t"						\
	: : "r" (x) );							\
} while (0)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linux-ia64] Replacements for local_irq_xxx()
  2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
@ 2001-05-10 14:26 ` David Mosberger
  2001-05-10 15:39 ` David Mosberger
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: David Mosberger @ 2001-05-10 14:26 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 10 May 2001 18:55:13 +1000, Keith Owens <kaos@ocs.com.au> said:

  Keith> Existing local_irq_xxx() routines clear psr.i which masks all
  Keith> interrupts, including NMI.  Then the only way to get the
  Keith> attention of a cpu in a disabled spin loop is via INIT which
  Keith> is too drastic.  How about these alternatives which set
  Keith> cr.tpr.mmi instead, masking all external interrupts except
  Keith> NMI?

I'd rather not do that.  Accessing tpr is slow and requires explicit
serialization.

	--david


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linux-ia64] Replacements for local_irq_xxx()
  2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
  2001-05-10 14:26 ` David Mosberger
@ 2001-05-10 15:39 ` David Mosberger
  2001-05-14 12:52 ` Keith Owens
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: David Mosberger @ 2001-05-10 15:39 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 10 May 2001 18:55:13 +1000, Keith Owens <kaos@ocs.com.au> said:

  Keith> Before I do any more work, is it worth continuing with this
  Keith> or do we live with psr.i and accept that the only remedy for
  Keith> a cpu that is hung in a disabled spin loop is to reboot the
  Keith> entire machine?

Can't you use the INIT based approach that has been suggested earlier?

	--david


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Linux-ia64] Replacements for local_irq_xxx()
  2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
  2001-05-10 14:26 ` David Mosberger
  2001-05-10 15:39 ` David Mosberger
@ 2001-05-14 12:52 ` Keith Owens
  2001-05-14 17:22 ` DE-DINECHIN,CHRISTOPHE (HP-Cupertino,ex1)
  2001-05-15 15:11 ` David Mosberger
  4 siblings, 0 replies; 6+ messages in thread
From: Keith Owens @ 2001-05-14 12:52 UTC (permalink / raw)
  To: linux-ia64

On Thu, 10 May 2001 07:26:38 -0700, 
David Mosberger <davidm@hpl.hp.com> wrote:
>>>>>> On Thu, 10 May 2001 18:55:13 +1000, Keith Owens <kaos@ocs.com.au> said:
>
>  Keith> Existing local_irq_xxx() routines clear psr.i which masks all
>  Keith> interrupts, including NMI.  Then the only way to get the
>  Keith> attention of a cpu in a disabled spin loop is via INIT which
>  Keith> is too drastic.  How about these alternatives which set
>  Keith> cr.tpr.mmi instead, masking all external interrupts except
>  Keith> NMI?
>
>I'd rather not do that.  Accessing tpr is slow and requires explicit
>serialization.

I have measured the slowdown and I believe that it is acceptable,
expecially when the benefit is far better debugging and the ability to
use an NMI watchdog.  The module below measures the cost of the
existing method (rsm psr.i) and my replacement method using cr.tpr.mmi.
Using psr.i takes 8 cycles while using cr.tpr.mmi takes 109 cycles to
disable then reenable interrupts.  Typical values on a BigSur dual B3
@ 700MHz, build 99.

psr.i                   8.18
tpr.mmi               109.03
mov from tpr           35.99
mov to tpr              4.06
srlz.d                 32.01

The first two figures are the important ones, the others were for my
curiosity.

That seems to be a large difference but it depends on how often
local_irq_xxx() routines are called so I instrumented local_irq_save()
and local_irq_disable().  Starting from a freshly booted machine, make
oldconfig; make dep; make -j4 vmlinux modules on a dual B3 @ 700MHz
shows approx. 14,000,000 calls to those routines.  14,000,000 calls *
100 extra cycles @ 700MHz is approx. 2 extra seconds over the span of a
15 minute compile.

Using an INIT interrupt goes through PAL and SAL before it gets to the
OS, any side effects of INIT are going to be platform dependent.  It is
bad enough maintaining kdb for multiple architectures, I do not want to
handle multiple platforms as well.  Also INIT is far too expensive to
use as a watchdog.

Note that I only want to replace the code in local_irq_xxx() routines.
Interrupt handlers and switch_to() will still use psr.i, either because
they have to or because those routines very rarely lock up.

David, your choice.

1. Current method.  No debugging of disabled lockups, no NMI watchdog.

2. Use cr.tpr.mmi.  Can debug disabled machines, can use NMI watchdog.
   A kernel compile is 0.001% slower.

3. Use INIT interrupt.  Platform dependent side effects, too expensive
   for a watchdog.

4. Use psr.i for normal kernels, use cr.tpr.mmi for kdb kernels.  This
   has a risk of Heisenbugs, and will not work well for binary only
   kernel modules loaded into a kdb kernel.

5. Any other ideas?


/* Example module to measure extra cost of cr.tpr.mmi */
#include <linux/config.h>
#include <linux/kernel.h>
#include <linux/module.h>

#define TEST_LOOPS 1000

#define timeit(n,code)				\
	__asm__ __volatile__ (			\
	";;"					\
	"mov ar.lc=%2;"				\
	"mov %0=ar.itc;;"			\
	"1:;"					\
	code					\
	"br.cloop.sptk.few 1b;;"		\
	"mov %1=ar.itc;;"			\
	: "=r" (start##n), "=r" (end##n) :	\
	"r" (TEST_LOOPS) :			\
	"r16", "r17" );

#define var(n) long start##n, end##n, cost##n

#define calc_cost(n, expr)			\
	cost##n = expr < 0 ? 0 : expr

#define print_cost(n, text)			\
	printk("%-20.20s %4ld.%02ld\n", text,	\
	(cost##n)/TEST_LOOPS,			\
	(((cost##n)%TEST_LOOPS)*100)/TEST_LOOPS)

int init_module(void)
{
	var(0);
	var(1);
	var(2);
	var(3);
	var(4);
	var(5);

	timeit(0, "");

	timeit(1,
		"rsm psr.i;;"
		"ssm psr.i;; srlz.d;;");

	/* Must wrap tpr timing code in rsm/ssm psr.i to avoid
	 * race with tpr changes in ia64_handle_irq.  Probably not
	 * needed for the final code, after changing ia64_handle_irq.
	 */
	timeit(2,
		"rsm psr.i;;"
		"mov r17=(1 << 16); mov r16=cr.tpr;;"
		"or r17=r17,r16;; mov cr.tpr=r17;; srlz.d;;"
		"mov cr.tpr=r16;; srlz.d;;"
		"ssm psr.i;; srlz.d;;");

	timeit(3, "mov r16=cr.tpr;;");

	timeit(4, "mov r16=cr.tpr;; mov cr.tpr=r16;;");

	timeit(5, "mov r16=cr.tpr;; mov cr.tpr=r16;; srlz.d;;");

	/* These calculations assume no variation in loop timings.
	 * Of course that is not true, interrupts will disturb the
	 * counters so take the results with a big pinch of salt.
	 * Load and unload the module several times and manually
	 * discard values that are "obviously" wrong.  In particular
	 * any zero costs indicate jitter in the results.
	 */

	calc_cost(0, end0-start0);			/* empty loop */
	calc_cost(1, end1-start1-cost0);		/* extra cost for psr.i */
	calc_cost(2, end2-start2-(end1-start1));	/* extra cost for tpr.mmi */
	calc_cost(3, end3-start3-cost0);		/* extra cost for mov r16=cr.tpr */
	calc_cost(4, end4-start4-(end3-start3));	/* extra cost for mov cr.tpr=r16 */
	calc_cost(5, end5-start5-(end4-start4));	/* extra cost for srlz.d */

	print_cost(1, "psr.i");
	print_cost(2, "tpr.mmi");
	print_cost(3, "mov from tpr");
	print_cost(4, "mov to tpr");
	print_cost(5, "srlz.d");

	return 0;
}



^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Linux-ia64] Replacements for local_irq_xxx()
  2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
                   ` (2 preceding siblings ...)
  2001-05-14 12:52 ` Keith Owens
@ 2001-05-14 17:22 ` DE-DINECHIN,CHRISTOPHE (HP-Cupertino,ex1)
  2001-05-15 15:11 ` David Mosberger
  4 siblings, 0 replies; 6+ messages in thread
From: DE-DINECHIN,CHRISTOPHE (HP-Cupertino,ex1) @ 2001-05-14 17:22 UTC (permalink / raw)
  To: linux-ia64

Keith,


Dave and I discussed another method: use a per-CPU global that represents
TPR, and then check the global for the external interrupts vector. If an
interrupt is received, then the interrupt handler syncs the cached TPR and
the actual TPR value. Re-enabling the interrupts still requires TPR to be
accessed directly, however.

It is unclear in practice if this represents a real improvement. In some
benchmarks where the interrupt paths are critical, this will be bad. Your
kernel compilation example is obviously not such a case. You'd want some I/O
intensive benchmark to really see the bad effect of TPR accesses. Kernel
compile is probably still too much of a CPU-intensive task.


Regards,
Christophe

-----Original Message-----
From: linux-ia64-admin@linuxia64.org
[mailto:linux-ia64-admin@linuxia64.org]On Behalf Of Keith Owens
Sent: Monday, May 14, 2001 5:53 AM
To: linux-ia64@linuxia64.org
Subject: Re: [Linux-ia64] Replacements for local_irq_xxx() 


On Thu, 10 May 2001 07:26:38 -0700, 
David Mosberger <davidm@hpl.hp.com> wrote:
>>>>>> On Thu, 10 May 2001 18:55:13 +1000, Keith Owens <kaos@ocs.com.au>
said:
>
>  Keith> Existing local_irq_xxx() routines clear psr.i which masks all
>  Keith> interrupts, including NMI.  Then the only way to get the
>  Keith> attention of a cpu in a disabled spin loop is via INIT which
>  Keith> is too drastic.  How about these alternatives which set
>  Keith> cr.tpr.mmi instead, masking all external interrupts except
>  Keith> NMI?
>
>I'd rather not do that.  Accessing tpr is slow and requires explicit
>serialization.

I have measured the slowdown and I believe that it is acceptable,
expecially when the benefit is far better debugging and the ability to
use an NMI watchdog.  The module below measures the cost of the
existing method (rsm psr.i) and my replacement method using cr.tpr.mmi.
Using psr.i takes 8 cycles while using cr.tpr.mmi takes 109 cycles to
disable then reenable interrupts.  Typical values on a BigSur dual B3
@ 700MHz, build 99.

psr.i                   8.18
tpr.mmi               109.03
mov from tpr           35.99
mov to tpr              4.06
srlz.d                 32.01

The first two figures are the important ones, the others were for my
curiosity.

That seems to be a large difference but it depends on how often
local_irq_xxx() routines are called so I instrumented local_irq_save()
and local_irq_disable().  Starting from a freshly booted machine, make
oldconfig; make dep; make -j4 vmlinux modules on a dual B3 @ 700MHz
shows approx. 14,000,000 calls to those routines.  14,000,000 calls *
100 extra cycles @ 700MHz is approx. 2 extra seconds over the span of a
15 minute compile.

Using an INIT interrupt goes through PAL and SAL before it gets to the
OS, any side effects of INIT are going to be platform dependent.  It is
bad enough maintaining kdb for multiple architectures, I do not want to
handle multiple platforms as well.  Also INIT is far too expensive to
use as a watchdog.

Note that I only want to replace the code in local_irq_xxx() routines.
Interrupt handlers and switch_to() will still use psr.i, either because
they have to or because those routines very rarely lock up.

David, your choice.

1. Current method.  No debugging of disabled lockups, no NMI watchdog.

2. Use cr.tpr.mmi.  Can debug disabled machines, can use NMI watchdog.
   A kernel compile is 0.001% slower.

3. Use INIT interrupt.  Platform dependent side effects, too expensive
   for a watchdog.

4. Use psr.i for normal kernels, use cr.tpr.mmi for kdb kernels.  This
   has a risk of Heisenbugs, and will not work well for binary only
   kernel modules loaded into a kdb kernel.

5. Any other ideas?


/* Example module to measure extra cost of cr.tpr.mmi */
#include <linux/config.h>
#include <linux/kernel.h>
#include <linux/module.h>

#define TEST_LOOPS 1000

#define timeit(n,code)				\
	__asm__ __volatile__ (			\
	";;"					\
	"mov ar.lc=%2;"				\
	"mov %0=ar.itc;;"			\
	"1:;"					\
	code					\
	"br.cloop.sptk.few 1b;;"		\
	"mov %1=ar.itc;;"			\
	: "=r" (start##n), "=r" (end##n) :	\
	"r" (TEST_LOOPS) :			\
	"r16", "r17" );

#define var(n) long start##n, end##n, cost##n

#define calc_cost(n, expr)			\
	cost##n = expr < 0 ? 0 : expr

#define print_cost(n, text)			\
	printk("%-20.20s %4ld.%02ld\n", text,	\
	(cost##n)/TEST_LOOPS,			\
	(((cost##n)%TEST_LOOPS)*100)/TEST_LOOPS)

int init_module(void)
{
	var(0);
	var(1);
	var(2);
	var(3);
	var(4);
	var(5);

	timeit(0, "");

	timeit(1,
		"rsm psr.i;;"
		"ssm psr.i;; srlz.d;;");

	/* Must wrap tpr timing code in rsm/ssm psr.i to avoid
	 * race with tpr changes in ia64_handle_irq.  Probably not
	 * needed for the final code, after changing ia64_handle_irq.
	 */
	timeit(2,
		"rsm psr.i;;"
		"mov r17=(1 << 16); mov r16=cr.tpr;;"
		"or r17=r17,r16;; mov cr.tpr=r17;; srlz.d;;"
		"mov cr.tpr=r16;; srlz.d;;"
		"ssm psr.i;; srlz.d;;");

	timeit(3, "mov r16=cr.tpr;;");

	timeit(4, "mov r16=cr.tpr;; mov cr.tpr=r16;;");

	timeit(5, "mov r16=cr.tpr;; mov cr.tpr=r16;; srlz.d;;");

	/* These calculations assume no variation in loop timings.
	 * Of course that is not true, interrupts will disturb the
	 * counters so take the results with a big pinch of salt.
	 * Load and unload the module several times and manually
	 * discard values that are "obviously" wrong.  In particular
	 * any zero costs indicate jitter in the results.
	 */

	calc_cost(0, end0-start0);			/* empty loop */
	calc_cost(1, end1-start1-cost0);		/* extra cost for
psr.i */
	calc_cost(2, end2-start2-(end1-start1));	/* extra cost for
tpr.mmi */
	calc_cost(3, end3-start3-cost0);		/* extra cost for
mov r16=cr.tpr */
	calc_cost(4, end4-start4-(end3-start3));	/* extra cost for
mov cr.tpr=r16 */
	calc_cost(5, end5-start5-(end4-start4));	/* extra cost for
srlz.d */

	print_cost(1, "psr.i");
	print_cost(2, "tpr.mmi");
	print_cost(3, "mov from tpr");
	print_cost(4, "mov to tpr");
	print_cost(5, "srlz.d");

	return 0;
}


_______________________________________________
Linux-IA64 mailing list
Linux-IA64@linuxia64.org
http://lists.linuxia64.org/lists/listinfo/linux-ia64


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [Linux-ia64] Replacements for local_irq_xxx()
  2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
                   ` (3 preceding siblings ...)
  2001-05-14 17:22 ` DE-DINECHIN,CHRISTOPHE (HP-Cupertino,ex1)
@ 2001-05-15 15:11 ` David Mosberger
  4 siblings, 0 replies; 6+ messages in thread
From: David Mosberger @ 2001-05-15 15:11 UTC (permalink / raw)
  To: linux-ia64

  Keith> I have measured the slowdown and I believe that it is
  Keith> acceptable, expecially when the benefit is far better
  Keith> debugging and the ability to use an NMI watchdog.  The module
  Keith> below measures the cost of the existing method (rsm psr.i)
  Keith> and my replacement method using cr.tpr.mmi.  Using psr.i
  Keith> takes 8 cycles while using cr.tpr.mmi takes 109 cycles to
  Keith> disable then reenable interrupts.  Typical values on a BigSur
  Keith> dual B3 @ 700MHz, build 99.

Slowing down local_irq_xxx() by more than 13 times is certainly not
acceptable.  Definitely not for kdb.  Let me make this clear: ia64
linux is designed and optimized for the case of running ia64
applications on native hardware.  It's NOT optimized for running IA-32
binaries (though we do want to do that well, too), it's NOT optimized
for running in a virtual machine environment, it's NOT optimized for
running KDB, etc., etc.  I hope you get the idea: writing a lean and
mean kernel is the goal here.

Note that your performance assement pretty much ignored that a TPR
based interrupt masking scheme would result in significant code bloat.
This could easily make the difference between a tight loop fitting or
exceeding the i-cache size.  Also, I really don't like the argument
that just because kernel compiles slow down by "only" X seconds, it's
somehow OK to make a performance critical operation so much slower.  I
am not sure which macro-level benchmark would show the most impact,
but I do know that if interrupt masking takes >100 cycles, people WILL
come up with software-optimizations which will make the kernel more
complicated and will introduce more variability (many papers have been
written on this subject, btw).  I'm glad that IA-64 is one of the few
architectures that can efficiently (few instructions & at CPU speeds)
mask interrupts.  I'm certainly not going to give that up without a
VERY good reason.

  Keith> Using an INIT interrupt goes through PAL and SAL before it
  Keith> gets to the OS, any side effects of INIT are going to be
  Keith> platform dependent.It is bad enough maintaining kdb for
  Keith> multiple architectures, I do not want to handle multiple
  Keith> platforms as well.

Have you tried this?  PAL and SAL are quite explicit about the events
that happen in response to an INIT.  I didn't think there is much room
for platform dependent side effects.  Note that we already have an OS
INIT handler in arch/ia64/kernel/mca.c.

  Keith> Also INIT is far too expensive to use as a watchdog.

  Keith> 3. Use INIT interrupt.  Platform dependent side effects, too
  Keith> expensive for a watchdog.

This sounds like it's at least worth trying.

	--david


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2001-05-15 15:11 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-05-10  8:55 [Linux-ia64] Replacements for local_irq_xxx() Keith Owens
2001-05-10 14:26 ` David Mosberger
2001-05-10 15:39 ` David Mosberger
2001-05-14 12:52 ` Keith Owens
2001-05-14 17:22 ` DE-DINECHIN,CHRISTOPHE (HP-Cupertino,ex1)
2001-05-15 15:11 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox