public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Linux-ia64] floating-point error
@ 2003-04-17 19:21 Bruno Cornec
  2003-04-17 19:52 ` Jesse Barnes
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Bruno Cornec @ 2003-04-17 19:21 UTC (permalink / raw)
  To: linux-ia64

Hello,

We're using Oracle 9iRAC on Linux I2 for a proof of concept and are having
one issue.

Sometimes the Oracle process dies with the following error message:

Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562

The ip address is always the same. This happens on all of our 4 nodes
as it seems randomly. I do not have other debug info as this is the only
message printed. Some times for the same process the message is printed 
up to 4 times.

Kernel used: 2.4.18-e.25 (RedHat update) based on 2.4.18 + a lot of patches
making it difficult to match with David's releases.

The error message is a printk warning in traps.c.

Any idea on what can cause it, and how we could debug more the context ?
Thanks in advance,

Bruno.
-- 
Linux Solution Consultant         Tél: +33 476 143 278 - Fax: +33 476 146 105
HP/Intel Solution Center http://hpintelco.net Hewlett-Packard Grenoble/France
Des infos sur Linux?  http://www.HyPer-Linux.org      http://www.hp.com/linux
La musique ancienne?  http://www.musique-ancienne.org http://www.medieval.org


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
@ 2003-04-17 19:52 ` Jesse Barnes
  2003-04-17 20:06 ` Bjorn Helgaas
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jesse Barnes @ 2003-04-17 19:52 UTC (permalink / raw)
  To: linux-ia64

On Thu, Apr 17, 2003 at 09:21:48PM +0200, Bruno Cornec wrote:
> Sometimes the Oracle process dies with the following error message:
> 
> Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562
> 
> The ip address is always the same. This happens on all of our 4 nodes
> as it seems randomly. I do not have other debug info as this is the only
> message printed. Some times for the same process the message is printed 
> up to 4 times.
> 
> Kernel used: 2.4.18-e.25 (RedHat update) based on 2.4.18 + a lot of patches
> making it difficult to match with David's releases.
> 
> The error message is a printk warning in traps.c.
> 
> Any idea on what can cause it, and how we could debug more the context ?
> Thanks in advance,

It would be helpful to apply the patch that Martin Hicks sent out
awhile ago which will print the isr as well as the IP.  You can then
decode the isr to figure out exactly what's causing the fault in your
app.

Jesse


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
  2003-04-17 19:52 ` Jesse Barnes
@ 2003-04-17 20:06 ` Bjorn Helgaas
  2003-04-17 20:25 ` David Mosberger
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2003-04-17 20:06 UTC (permalink / raw)
  To: linux-ia64

On Thursday 17 April 2003 1:52 pm, Jesse Barnes wrote:
> It would be helpful to apply the patch that Martin Hicks sent out
> awhile ago which will print the isr as well as the IP.  You can then
> decode the isr to figure out exactly what's causing the fault in your
> app.

I couldn't find this in the archives, but I got it out of David's
2.5 tree, and applied it for 2.4 as well.  Here it is if you
want to try it:

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
#	           ChangeSet	1.889.308.20 -> 1.889.308.21
#	arch/ia64/kernel/traps.c	1.26    -> 1.26.1.1
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/03/18	mort@wildopensource.com	1.889.308.21
# [PATCH] ia64: print ISR for FPSWA faults
# 
# Here is a simple patch to also print isr during the handling of a
# floating point assist fault.
# --------------------------------------------
#
diff -Nru a/arch/ia64/kernel/traps.c b/arch/ia64/kernel/traps.c
--- a/arch/ia64/kernel/traps.c	Thu Apr 17 13:41:11 2003
+++ b/arch/ia64/kernel/traps.c	Thu Apr 17 13:41:11 2003
@@ -336,8 +336,8 @@
 		fpu_swa_count = 0;
 	if ((++fpu_swa_count < 5) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) {
 		last_time = jiffies;
-		printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx\n",
-		       current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri);
+		printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
+		       current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
 	}
 
 	exception = fp_emulate(fp_fault, bundle, &regs->cr_ipsr, &regs->ar_fpsr, &isr, &regs->pr,



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
  2003-04-17 19:52 ` Jesse Barnes
  2003-04-17 20:06 ` Bjorn Helgaas
@ 2003-04-17 20:25 ` David Mosberger
  2003-04-17 20:27 ` Jesse Barnes
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: David Mosberger @ 2003-04-17 20:25 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Thu, 17 Apr 2003 21:21:48 +0200, Bruno Cornec <Bruno.Cornec@hp.com> said:

  Bruno> Hello,
  Bruno> We're using Oracle 9iRAC on Linux I2 for a proof of concept and are having
  Bruno> one issue.

  Bruno> Sometimes the Oracle process dies with the following error message:

  Bruno> Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562

  Bruno> The ip address is always the same. This happens on all of our
  Bruno> 4 nodes as it seems randomly. I do not have other debug info
  Bruno> as this is the only message printed. Some times for the same
  Bruno> process the message is printed up to 4 times.

  Bruno> [snip...]

  Bruno> Any idea on what can cause it, and how we could debug more
  Bruno> the context ?

The message is informational.  There is probably nothing wrong with
the application, at least if the message occurs only rarely (at an
interval less than 4 messages every 10 seconds or so).  If the message
occurs all the time, then there could be a performance problem with
the application.

You should be able to turn of the message with "prctl --fpemul=silent".

	--david


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (2 preceding siblings ...)
  2003-04-17 20:25 ` David Mosberger
@ 2003-04-17 20:27 ` Jesse Barnes
  2003-04-17 20:40 ` Luck, Tony
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Jesse Barnes @ 2003-04-17 20:27 UTC (permalink / raw)
  To: linux-ia64

Great, thanks Bjorn.

Jesse

On Thu, Apr 17, 2003 at 02:06:42PM -0600, Bjorn Helgaas wrote:
> On Thursday 17 April 2003 1:52 pm, Jesse Barnes wrote:
> > It would be helpful to apply the patch that Martin Hicks sent out
> > awhile ago which will print the isr as well as the IP.  You can then
> > decode the isr to figure out exactly what's causing the fault in your
> > app.
> 
> I couldn't find this in the archives, but I got it out of David's
> 2.5 tree, and applied it for 2.4 as well.  Here it is if you
> want to try it:
> 
> # This is a BitKeeper generated patch for the following project:
> # Project Name: Linux kernel tree
> # This patch format is intended for GNU patch command version 2.5 or higher.
> # This patch includes the following deltas:
> #	           ChangeSet	1.889.308.20 -> 1.889.308.21
> #	arch/ia64/kernel/traps.c	1.26    -> 1.26.1.1
> #
> # The following is the BitKeeper ChangeSet Log
> # --------------------------------------------
> # 03/03/18	mort@wildopensource.com	1.889.308.21
> # [PATCH] ia64: print ISR for FPSWA faults
> # 
> # Here is a simple patch to also print isr during the handling of a
> # floating point assist fault.
> # --------------------------------------------
> #
> diff -Nru a/arch/ia64/kernel/traps.c b/arch/ia64/kernel/traps.c
> --- a/arch/ia64/kernel/traps.c	Thu Apr 17 13:41:11 2003
> +++ b/arch/ia64/kernel/traps.c	Thu Apr 17 13:41:11 2003
> @@ -336,8 +336,8 @@
>  		fpu_swa_count = 0;
>  	if ((++fpu_swa_count < 5) && !(current->thread.flags & IA64_THREAD_FPEMU_NOPRINT)) {
>  		last_time = jiffies;
> -		printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx\n",
> -		       current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri);
> +		printk(KERN_WARNING "%s(%d): floating-point assist fault at ip %016lx, isr %016lx\n",
> +		       current->comm, current->pid, regs->cr_iip + ia64_psr(regs)->ri, isr);
>  	}
>  
>  	exception = fp_emulate(fp_fault, bundle, &regs->cr_ipsr, &regs->ar_fpsr, &isr, &regs->pr,


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (3 preceding siblings ...)
  2003-04-17 20:27 ` Jesse Barnes
@ 2003-04-17 20:40 ` Luck, Tony
  2003-04-17 20:40 ` Bruno Cornec
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Luck, Tony @ 2003-04-17 20:40 UTC (permalink / raw)
  To: linux-ia64

> Sometimes the Oracle process dies with the following error message:
> 
> Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562
> 
> The ip address is always the same. This happens on all of our 4 nodes
> as it seems randomly. I do not have other debug info as this is the only
> message printed. Some times for the same process the message is printed 
> up to 4 times.

Are you certain that the message is related to the death of the process?

This message is a warning to let you know that your application has run into
one of the corner cases of IEEE floating point that is not implemented in
hardware by the processor (typically operations involving denormalized numbers
will cause this, but there may be other cases). There is rate limiting code in
the kernel to prevent this message from flooding the logs (and from becoming
even more of a performance drag than taking a trap and emulating in s/w).

It is relatively normal to see this message (even multiple times from the
same process), and it usually isn't fatal.

-Tony Luck


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (4 preceding siblings ...)
  2003-04-17 20:40 ` Luck, Tony
@ 2003-04-17 20:40 ` Bruno Cornec
  2003-04-17 21:12 ` Chen, Kenneth W
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Bruno Cornec @ 2003-04-17 20:40 UTC (permalink / raw)
  To: linux-ia64

Hello,

Thanks a lot David for your quick reply (as well as others).

David Mosberger (davidm@napali.hpl.hp.com) said:
> 
>   Bruno> Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562
> 
> The message is informational.  There is probably nothing wrong with
> the application, at least if the message occurs only rarely (at an
> interval less than 4 messages every 10 seconds or so).  

Which is the case.
But on the other hand we're loosing Oracle connections. So I suppose
that the information means something in that case.

> If the message
> occurs all the time, then there could be a performance problem with
> the application.
> 
> You should be able to turn of the message with "prctl --fpemul=silent".

Ok, will try that tomorrow. 

Best regards,
Bruno.

PS: We may also try to recompile our own kernel based on your latest patch
+ the one given in the other messages of the thread to see if it changes
anything.
-- 
Linux Solution Consultant         Tél: +33 476 143 278 - Fax: +33 476 146 105
HP/Intel Solution Center http://hpintelco.net Hewlett-Packard Grenoble/France
Des infos sur Linux?  http://www.HyPer-Linux.org      http://www.hp.com/linux
La musique ancienne?  http://www.musique-ancienne.org http://www.medieval.org


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (5 preceding siblings ...)
  2003-04-17 20:40 ` Bruno Cornec
@ 2003-04-17 21:12 ` Chen, Kenneth W
  2003-04-17 23:57 ` martin sepulveda
  2003-04-17 23:59 ` martin sepulveda
  8 siblings, 0 replies; 10+ messages in thread
From: Chen, Kenneth W @ 2003-04-17 21:12 UTC (permalink / raw)
  To: linux-ia64

Oracle has its own exception handler.  It usually dumps out more
information in its alert file.  Look in $ORACLE_HOME/rdbms/log/*.log for
more clue.

- Ken

-----Original Message-----
From: Luck, Tony 
Sent: Thursday, April 17, 2003 1:40 PM
To: Bruno Cornec; linux-ia64@linuxia64.org
Cc: martin@ojf.com; klaus.grupe@hp.com; emmanuel.avrillon@hp.com
Subject: RE: [Linux-ia64] floating-point error

> Sometimes the Oracle process dies with the following error message:
> 
> Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault
at ip 40000000048b4562
> 
> The ip address is always the same. This happens on all of our 4 nodes
> as it seems randomly. I do not have other debug info as this is the
only
> message printed. Some times for the same process the message is
printed 
> up to 4 times.

Are you certain that the message is related to the death of the process?

This message is a warning to let you know that your application has run
into
one of the corner cases of IEEE floating point that is not implemented
in
hardware by the processor (typically operations involving denormalized
numbers
will cause this, but there may be other cases). There is rate limiting
code in
the kernel to prevent this message from flooding the logs (and from
becoming
even more of a performance drag than taking a trap and emulating in
s/w).

It is relatively normal to see this message (even multiple times from
the
same process), and it usually isn't fatal.

-Tony Luck

_______________________________________________
Linux-IA64 mailing list
Linux-IA64@linuxia64.org
http://lists.linuxia64.org/lists/listinfo/linux-ia64


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (6 preceding siblings ...)
  2003-04-17 21:12 ` Chen, Kenneth W
@ 2003-04-17 23:57 ` martin sepulveda
  2003-04-17 23:59 ` martin sepulveda
  8 siblings, 0 replies; 10+ messages in thread
From: martin sepulveda @ 2003-04-17 23:57 UTC (permalink / raw)
  To: linux-ia64

we did a little shell script to watch kernel messages and do a 'ps ax' on the pid,
and saw oracle proceses were not running after the message was printk'ed 
(at least not for much time).
the floating-poin assist fault was only trigered by oracle processes, but oracle 
was the only heavily used on this machines during the test.
by the way it happens on all four nodes we're running, the firmware is up to date
and includes the FPSWA, but it might be affected by system load, since in some
cases it would be affecting about 10 % of the oracle processes while on other
tests it may be seen affecting below 1%.

(i'm not on the list)

m.

On Thu, 17 Apr 2003 13:40:28 -0700
"Luck, Tony" <tony.luck@intel.com> wrote:

> > Sometimes the Oracle process dies with the following error message:
> > 
> > Apr 17 20:24:48 rx1 kernel: oracle(7148): floating-point assist fault at ip 40000000048b4562
> > 
> > The ip address is always the same. This happens on all of our 4 nodes
> > as it seems randomly. I do not have other debug info as this is the only
> > message printed. Some times for the same process the message is printed 
> > up to 4 times.
> 
> Are you certain that the message is related to the death of the process?
> 
> This message is a warning to let you know that your application has run into
> one of the corner cases of IEEE floating point that is not implemented in
> hardware by the processor (typically operations involving denormalized numbers
> will cause this, but there may be other cases). There is rate limiting code in
> the kernel to prevent this message from flooding the logs (and from becoming
> even more of a performance drag than taking a trap and emulating in s/w).
> 
> It is relatively normal to see this message (even multiple times from the
> same process), and it usually isn't fatal.
> 
> -Tony Luck
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Linux-ia64] floating-point error
  2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
                   ` (7 preceding siblings ...)
  2003-04-17 23:57 ` martin sepulveda
@ 2003-04-17 23:59 ` martin sepulveda
  8 siblings, 0 replies; 10+ messages in thread
From: martin sepulveda @ 2003-04-17 23:59 UTC (permalink / raw)
  To: linux-ia64

it does not write anything about this on the alert log, and neither does it on a trace file.


m.

On Thu, 17 Apr 2003 14:12:40 -0700
"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:

> Oracle has its own exception handler.  It usually dumps out more
> information in its alert file.  Look in $ORACLE_HOME/rdbms/log/*.log for
> more clue.
> 
> - Ken
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2003-04-17 23:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-17 19:21 [Linux-ia64] floating-point error Bruno Cornec
2003-04-17 19:52 ` Jesse Barnes
2003-04-17 20:06 ` Bjorn Helgaas
2003-04-17 20:25 ` David Mosberger
2003-04-17 20:27 ` Jesse Barnes
2003-04-17 20:40 ` Luck, Tony
2003-04-17 20:40 ` Bruno Cornec
2003-04-17 21:12 ` Chen, Kenneth W
2003-04-17 23:57 ` martin sepulveda
2003-04-17 23:59 ` martin sepulveda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox