[0/2] netxen: bug fix and diagnostics for possible (hardware?) bug

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
@ 2013-12-17  5:22 David Gibson
  2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: David Gibson @ 2013-12-17  5:22 UTC (permalink / raw)
  To: manish.chopra, sony.chacko, rajesh.borundia
  Cc: netdev, snagarka, tcamuso, vdasgupt

At Red Hat, we've hit a couple of customer cases with crashes in the
netxen driver due to list corruption.  This seems to be very rarely
triggered, and unfortunately the dumps we have don't have enough
information to be certain of the cause, although we have a possible
theory.

I'm suggesting, therefore a patch to add some sanity checking which
should help to at least localize and mitigate the problem when someone
hits it in future.  Please let me know if there's a better approach to
doing this.

That's 2/2.  1/2 is a fix for a clear bug I spotted along the way, but
not one that could cause the symptoms we've seen.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] netxen: Correct off-by-one error in bounds check
  2013-12-17  5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
@ 2013-12-17  5:22 ` David Gibson
  2013-12-17  6:37   ` Jitendra Kalsaria
  2013-12-19 11:51   ` Manish Chopra
  2013-12-17  5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
  2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
  2 siblings, 2 replies; 12+ messages in thread
From: David Gibson @ 2013-12-17  5:22 UTC (permalink / raw)
  To: manish.chopra, sony.chacko, rajesh.borundia
  Cc: netdev, snagarka, tcamuso, vdasgupt, David Gibson

After retrieving an Rx buffer index from the hardware, the netxen driver
bounds checks it against its array of receive buffers.  However in the case
of LRO packets, there's an off-by-one error in the check - it uses >
instead of the >= it uses for the normal receive path in
netxen_process_rcv().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
index 7692dfd..68658b8 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
@@ -1610,7 +1610,7 @@ netxen_process_lro(struct netxen_adapter *adapter,
 	rds_ring = &recv_ctx->rds_rings[ring];
 
 	index = netxen_get_lro_sts_refhandle(sts_data0);
-	if (unlikely(index > rds_ring->num_desc))
+	if (unlikely(index >= rds_ring->num_desc))
 		return NULL;
 
 	buffer = &rds_ring->rx_buf_arr[index];
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware
  2013-12-17  5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
  2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
@ 2013-12-17  5:22 ` David Gibson
  2013-12-19 20:05   ` David Miller
  2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
  2 siblings, 1 reply; 12+ messages in thread
From: David Gibson @ 2013-12-17  5:22 UTC (permalink / raw)
  To: manish.chopra, sony.chacko, rajesh.borundia
  Cc: netdev, snagarka, tcamuso, vdasgupt, David Gibson

At Red Hat we've seen a couple of customer cases where the kernel has
crashed due to corruption of the Rx buffer free lists in the netxen driver.

>From the information we've had so far, we haven't been able to locate the
bug.  Our working theory is that due to either a driver, hardware or
firmware bug, we very occasionally process Rx descriptors which have the
wrong buffer ref_handle in them.  This is consistent with all the details
we've seen of the two crashes we've seen.

More specifically here's what we suspect happened:

Case 1:
  netxen_process_rcv_ring()
    netxen_process_rcv() on buffer X (correctly)
      list_add_tail(buffer X, sds_ring free list)

    netxen_process_rcv() on buffer Y (correctly)
      list_add_tail(buffer Y, sds_ring free list)

    netxen_process_rcv() on buffer X            <- wrong index
      list_add_tail(buffer X, sds_ring free list) <- list corrupted

    netxen_merge_rx_buffers()
      list_splice() sds_ring free list to rds_ring free_list

    netxen_post_rx_buffers_nodb()
      BUG on list_del() due to bad prev pointer in buffer X

Case 2:
  CPU0                      CPU1
  netxen_process_rcv_list() on sds_ring A
    netxen_process_rcv() on buffer X (correctly)
    list_add_tail(buffer X, sds_ring A free_list)

                            netxen_process_rcv_list() on sds_ring B
                              netxen_process_rcv() buffer X   <- wrong
                              list_add_tail(buffer X, sds_ring B list)
                            [...]
                              netxen_post_rx_buffers_nodb()
                                list_del(buffer X)   <- buf X poisoned

    for_each on sds_ring A free list <- crash due to dereferencing
                                        poisoned pointers in buffer X

Because the bug is triggered very rarely, and we have no known
reproducer, it's difficult to pin down exactly what's going wrong.
This patch tries to improve the situation by adding some (fairly cheap)
sanity checks when we process an Rx buffer to make sure it's not
already on someone's free list.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 .../net/ethernet/qlogic/netxen/netxen_nic_init.c   | 37 ++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
index 68658b8..edbdf4b 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
@@ -1531,6 +1531,39 @@ no_skb:
 	return skb;
 }
 
+static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf)
+{
+	int ret = 0;
+
+	if ((rxbuf->list.next != LIST_POISON1)
+	    || (rxbuf->list.prev != LIST_POISON2)) {
+		printk(KERN_ERR "netxen: Rx buffer is already on free list "
+		       "(next=%p, prev=%p)\n", rxbuf->list.next,
+		       rxbuf->list.prev);
+		ret = -1;
+	}
+
+	if (rxbuf->ref_handle != index) {
+		printk(KERN_ERR "netxen: Rx buffer has bad ref_handle "
+		       "(%d instead of %d)\n", rxbuf->ref_handle, index);
+		ret = -1;
+	}
+
+	if (rxbuf->state != NETXEN_BUFFER_BUSY) {
+		printk(KERN_ERR "netxen: Rx buffer has bad state %d\n",
+		       rxbuf->state);
+		ret = -1;
+	}
+
+	if (!rxbuf->skb) {
+		printk(KERN_ERR "netxen: Rx buffer has no skb\n",
+		       rxbuf->skb);
+		ret = -1;
+	}
+
+	return ret;
+}
+
 static struct netxen_rx_buffer *
 netxen_process_rcv(struct netxen_adapter *adapter,
 		struct nx_host_sds_ring *sds_ring,
@@ -1553,6 +1586,8 @@ netxen_process_rcv(struct netxen_adapter *adapter,
 		return NULL;
 
 	buffer = &rds_ring->rx_buf_arr[index];
+	if (WARN_ON(netxen_check_rx_buf(index, buffer) != 0))
+		return NULL;
 
 	length = netxen_get_sts_totallength(sts_data0);
 	cksum  = netxen_get_sts_status(sts_data0);
@@ -1614,6 +1649,8 @@ netxen_process_lro(struct netxen_adapter *adapter,
 		return NULL;
 
 	buffer = &rds_ring->rx_buf_arr[index];
+	if (WARN_ON(netxen_check_rx_buf(index, buffer) != 0))
+		return NULL;
 
 	timestamp = netxen_get_lro_sts_timestamp(sts_data0);
 	lro_length = netxen_get_lro_sts_length(sts_data0);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] netxen: Correct off-by-one error in bounds check
  2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
@ 2013-12-17  6:37   ` Jitendra Kalsaria
  2013-12-19 11:51   ` Manish Chopra
  1 sibling, 0 replies; 12+ messages in thread
From: Jitendra Kalsaria @ 2013-12-17  6:37 UTC (permalink / raw)
  To: David Gibson, Manish Chopra, Sony Chacko, Rajesh Borundia
  Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com,
	vdasgupt@redhat.com

David,

You may also want to include fix for receive rings.

        if (unlikely(ring > adapter->max_rds_rings))
		return NULL;


Jiten

On 12/16/13 9:22 PM, "David Gibson" <david@gibson.dropbear.id.au> wrote:

>After retrieving an Rx buffer index from the hardware, the netxen driver
>bounds checks it against its array of receive buffers.  However in the
>case
>of LRO packets, there's an off-by-one error in the check - it uses >
>instead of the >= it uses for the normal receive path in
>netxen_process_rcv().
>
>Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>---
> drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
>b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
>index 7692dfd..68658b8 100644
>--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
>+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c
>@@ -1610,7 +1610,7 @@ netxen_process_lro(struct netxen_adapter *adapter,
> 	rds_ring = &recv_ctx->rds_rings[ring];
> 
> 	index = netxen_get_lro_sts_refhandle(sts_data0);
>-	if (unlikely(index > rds_ring->num_desc))
>+	if (unlikely(index >= rds_ring->num_desc))
> 		return NULL;
> 
> 	buffer = &rds_ring->rx_buf_arr[index];
>-- 
>1.8.3.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
  2013-12-17  5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
  2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
  2013-12-17  5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
@ 2013-12-17 21:50 ` Manish Chopra
  2013-12-18  6:22   ` David Gibson
  2 siblings, 1 reply; 12+ messages in thread
From: Manish Chopra @ 2013-12-17 21:50 UTC (permalink / raw)
  To: David Gibson, Sony Chacko, Rajesh Borundia
  Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com,
	vdasgupt@redhat.com

>-----Original Message-----
>From: David Gibson [mailto:david@gibson.dropbear.id.au]
>Sent: Tuesday, December 17, 2013 10:53 AM
>To: Manish Chopra; Sony Chacko; Rajesh Borundia
>Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
>vdasgupt@redhat.com
>Subject: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
>
>At Red Hat, we've hit a couple of customer cases with crashes in the netxen driver
>due to list corruption.  This seems to be very rarely triggered, and unfortunately
>the dumps we have don't have enough information to be certain of the cause,
>although we have a possible theory.
>
>I'm suggesting, therefore a patch to add some sanity checking which should help
>to at least localize and mitigate the problem when someone hits it in future.
>Please let me know if there's a better approach to doing this.
>
>That's 2/2.  1/2 is a fix for a clear bug I spotted along the way, but not one that
>could cause the symptoms we've seen.

David,

Having these checks in data path(Rx path) may have some performance impact. It's better to root cause it instead of putting some sanity checks.
We will get back to you on this.


Thanks
Manish

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
  2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
@ 2013-12-18  6:22   ` David Gibson
  2013-12-19  9:11     ` Manish Chopra
  0 siblings, 1 reply; 12+ messages in thread
From: David Gibson @ 2013-12-18  6:22 UTC (permalink / raw)
  To: Manish Chopra
  Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com,
	tcamuso@redhat.com, vdasgupt@redhat.com

[-- Attachment #1: Type: text/plain, Size: 1776 bytes --]

On Tue, Dec 17, 2013 at 09:50:52PM +0000, Manish Chopra wrote:
> >-----Original Message-----
> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
> >Sent: Tuesday, December 17, 2013 10:53 AM
> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
> >vdasgupt@redhat.com
> >Subject: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
> >
> >At Red Hat, we've hit a couple of customer cases with crashes in the netxen driver
> >due to list corruption.  This seems to be very rarely triggered, and unfortunately
> >the dumps we have don't have enough information to be certain of the cause,
> >although we have a possible theory.
> >
> >I'm suggesting, therefore a patch to add some sanity checking which should help
> >to at least localize and mitigate the problem when someone hits it in future.
> >Please let me know if there's a better approach to doing this.
> >
> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way, but not one that
> >could cause the symptoms we've seen.
> 
> David,
> 
> Having these checks in data path(Rx path) may have some performance
> impact. It's better to root cause it instead of putting some sanity
> checks.

Obviously, but this was the best way I could think of to try narrowing
down the root cause (at least trying to eliminate driver vs. firmware
bug).

> We will get back to you on this.

If you have a better idea for locating the root cause, please let me
know.  I have access to a vmcore which I can poke around in.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
  2013-12-18  6:22   ` David Gibson
@ 2013-12-19  9:11     ` Manish Chopra
  2014-01-24  6:44       ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: Manish Chopra @ 2013-12-19  9:11 UTC (permalink / raw)
  To: David Gibson
  Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com,
	tcamuso@redhat.com, vdasgupt@redhat.com

>> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
>> >Sent: Tuesday, December 17, 2013 10:53 AM
>> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
>> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
>> >vdasgupt@redhat.com
>> >Subject: [0/2] netxen: bug fix and diagnostics for possible
>> >(hardware?) bug
>> >
>> >At Red Hat, we've hit a couple of customer cases with crashes in the
>> >netxen driver due to list corruption.  This seems to be very rarely
>> >triggered, and unfortunately the dumps we have don't have enough
>> >information to be certain of the cause, although we have a possible theory.
>> >
>> >I'm suggesting, therefore a patch to add some sanity checking which
>> >should help to at least localize and mitigate the problem when someone hits it
>in future.
>> >Please let me know if there's a better approach to doing this.
>> >
>> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way,
>> >but not one that could cause the symptoms we've seen.
>>
>> David,
>>
>> Having these checks in data path(Rx path) may have some performance
>> impact. It's better to root cause it instead of putting some sanity
>> checks.
>
>Obviously, but this was the best way I could think of to try narrowing down the
>root cause (at least trying to eliminate driver vs. firmware bug).

David, Instead of making permanent changes in driver, can you please run your modified driver in selective customer environment where this issues is seen?
Which may give some data point that what's the issue exactly and then we go by that.

>
>> We will get back to you on this.
>
>If you have a better idea for locating the root cause, please let me know.  I have
>access to a vmcore which I can poke around in.

We will also try to reproduce the problem in our environment and debug this.
Can you please give some details?

1) what's the driver and firmware version used?
2) which operating system and kernel version?
3) please send the vmcore also with backtrace if available which can give some idea what can trigger this issue.
4) Test case details:- what type of test is running on the system? Just to make sure we also try the same test cases in our environment.
5) Server details (Number of CPus, memory etc.) if available.

Thanks,
Manish
  

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 1/2] netxen: Correct off-by-one error in bounds check
  2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
  2013-12-17  6:37   ` Jitendra Kalsaria
@ 2013-12-19 11:51   ` Manish Chopra
  2013-12-20  4:11     ` David Gibson
  1 sibling, 1 reply; 12+ messages in thread
From: Manish Chopra @ 2013-12-19 11:51 UTC (permalink / raw)
  To: David Gibson, Sony Chacko, Rajesh Borundia
  Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com,
	vdasgupt@redhat.com

>-----Original Message-----
>From: David Gibson [mailto:david@gibson.dropbear.id.au]
>Sent: Tuesday, December 17, 2013 10:53 AM
>To: Manish Chopra; Sony Chacko; Rajesh Borundia
>Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
>vdasgupt@redhat.com; David Gibson
>Subject: [PATCH 1/2] netxen: Correct off-by-one error in bounds check
>
>After retrieving an Rx buffer index from the hardware, the netxen driver bounds
>checks it against its array of receive buffers.  However in the case of LRO packets,
>there's an off-by-one error in the check - it uses > instead of the >= it uses for the
>normal receive path in netxen_process_rcv().
>
>Signed-off-by: David Gibson <david@gibson.dropbear.id.au>

Acked-by: Manish Chopra <manish.chopra@qlogic.com>

Thanks David. This is really a bug fix. You can re-submit this single patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware
  2013-12-17  5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
@ 2013-12-19 20:05   ` David Miller
  2014-01-24  5:21     ` David Gibson
  0 siblings, 1 reply; 12+ messages in thread
From: David Miller @ 2013-12-19 20:05 UTC (permalink / raw)
  To: david
  Cc: manish.chopra, sony.chacko, rajesh.borundia, netdev, snagarka,
	tcamuso, vdasgupt

From: David Gibson <david@gibson.dropbear.id.au>
Date: Tue, 17 Dec 2013 16:22:33 +1100

> +static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf)
> +{
> +	int ret = 0;
> +
> +	if ((rxbuf->list.next != LIST_POISON1)
> +	    || (rxbuf->list.prev != LIST_POISON2)) {
> +		printk(KERN_ERR "netxen: Rx buffer is already on free list "
> +		       "(next=%p, prev=%p)\n", rxbuf->list.next,
> +		       rxbuf->list.prev);
> +		ret = -1;
> +	}

First, you should be using netdev_err() or similar.

Second, doing this unconditionally for every receive packet is
not really reasonable.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] netxen: Correct off-by-one error in bounds check
  2013-12-19 11:51   ` Manish Chopra
@ 2013-12-20  4:11     ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2013-12-20  4:11 UTC (permalink / raw)
  To: Manish Chopra
  Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com,
	tcamuso@redhat.com, vdasgupt@redhat.com

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

On Thu, Dec 19, 2013 at 11:51:37AM +0000, Manish Chopra wrote:
> >-----Original Message-----
> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
> >Sent: Tuesday, December 17, 2013 10:53 AM
> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
> >vdasgupt@redhat.com; David Gibson
> >Subject: [PATCH 1/2] netxen: Correct off-by-one error in bounds check
> >
> >After retrieving an Rx buffer index from the hardware, the netxen driver bounds
> >checks it against its array of receive buffers.  However in the case of LRO packets,
> >there's an off-by-one error in the check - it uses > instead of the >= it uses for the
> >normal receive path in netxen_process_rcv().
> >
> >Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> 
> Acked-by: Manish Chopra <manish.chopra@qlogic.com>
> 
> Thanks David. This is really a bug fix. You can re-submit this single patch.

Ok, I added the extra fix Jiten suggested and resent.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware
  2013-12-19 20:05   ` David Miller
@ 2014-01-24  5:21     ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2014-01-24  5:21 UTC (permalink / raw)
  To: David Miller
  Cc: manish.chopra, sony.chacko, rajesh.borundia, netdev, snagarka,
	tcamuso, vdasgupt

[-- Attachment #1: Type: text/plain, Size: 1315 bytes --]

On Thu, Dec 19, 2013 at 03:05:27PM -0500, David Miller wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> Date: Tue, 17 Dec 2013 16:22:33 +1100
> 
> > +static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf)
> > +{
> > +	int ret = 0;
> > +
> > +	if ((rxbuf->list.next != LIST_POISON1)
> > +	    || (rxbuf->list.prev != LIST_POISON2)) {
> > +		printk(KERN_ERR "netxen: Rx buffer is already on free list "
> > +		       "(next=%p, prev=%p)\n", rxbuf->list.next,
> > +		       rxbuf->list.prev);
> > +		ret = -1;
> > +	}
> 
> First, you should be using netdev_err() or similar.

Well, I was just matching the form used in the rest of the driver.

> Second, doing this unconditionally for every receive packet is
> not really reasonable.

Yeah, trouble is I don't know what the alternative is.  Whatever is
causing this problem, it's triggered very rarely - I'm only seeing a
pattern by looking across everything reported by Red Hat customers.
Just adding diagnostics for one customer is unlikely to be useful,
because the chances are they'll never hit the bug again.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug
  2013-12-19  9:11     ` Manish Chopra
@ 2014-01-24  6:44       ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2014-01-24  6:44 UTC (permalink / raw)
  To: Manish Chopra
  Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com,
	tcamuso@redhat.com, vdasgupt@redhat.com

[-- Attachment #1: Type: text/plain, Size: 6205 bytes --]

On Thu, Dec 19, 2013 at 09:11:33AM +0000, Manish Chopra wrote:
> >> >From: David Gibson [mailto:david@gibson.dropbear.id.au]
> >> >Sent: Tuesday, December 17, 2013 10:53 AM
> >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia
> >> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com;
> >> >vdasgupt@redhat.com
> >> >Subject: [0/2] netxen: bug fix and diagnostics for possible
> >> >(hardware?) bug
> >> >
> >> >At Red Hat, we've hit a couple of customer cases with crashes in the
> >> >netxen driver due to list corruption.  This seems to be very rarely
> >> >triggered, and unfortunately the dumps we have don't have enough
> >> >information to be certain of the cause, although we have a possible theory.
> >> >
> >> >I'm suggesting, therefore a patch to add some sanity checking which
> >> >should help to at least localize and mitigate the problem when someone hits it
> >in future.
> >> >Please let me know if there's a better approach to doing this.
> >> >
> >> >That's 2/2.  1/2 is a fix for a clear bug I spotted along the way,
> >> >but not one that could cause the symptoms we've seen.
> >>
> >> David,
> >>
> >> Having these checks in data path(Rx path) may have some performance
> >> impact. It's better to root cause it instead of putting some sanity
> >> checks.
> >
> >Obviously, but this was the best way I could think of to try narrowing down the
> >root cause (at least trying to eliminate driver vs. firmware bug).
> 
> David, Instead of making permanent changes in driver, can you please
> run your modified driver in selective customer environment where
> this issues is seen?

Yeah, the problem with that is that the problem has never triggered
twice for a single customer.  Well, technically there is one customer
that's hit it twice, but I'm pretty sure it's on entirely unrelated
systems in different sections of a large customer.  The only reason I
can see enough cases to suspect a pattern to these problems is from
looking across Red Hat's whole case history.

> Which may give some data point that what's the issue exactly and then we go by that.
> 
> >
> >> We will get back to you on this.
> >
> >If you have a better idea for locating the root cause, please let me know.  I have
> >access to a vmcore which I can poke around in.
> 
> We will also try to reproduce the problem in our environment and debug this.
> Can you please give some details?

Apologies for the long delay, I'd been hoping for some more
confirmation of things, but it hasn't happened.  I'll give you what I
can.

> 1) what's the driver and firmware version used?

I'm not sure what the most useful way ot giving a driver version.
I've given kernel version below, but it's an RH kernel, so I'm not
sure how much has been backported.

As to firmware, the driver reports:

netxen_nic 0000:04:00.0: Gen2 strapping detected
netxen_nic 0000:04:00.0: using 64-bit dma mask
netxen_nic: NX3031 Gigabit Ethernet Board S/N
<FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF>
<FF><FF><FF>NX3031 Gigabit Ethernet  Chip rev 0x42
netxen_nic 0000:04:00.0: firmware v4.0.585 [legacy]

> 2) which operating system and kernel version?

RHEL5, 

Linux hostname 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

> 3) please send the vmcore also with backtrace if available which can
> give some idea what can trigger this issue.

I can't send the vmcore itself, since it will include customer data.
I can give you the backtrace below, and look up specific things if you
can give me an idea of what you need:

crash> bt
PID: 0      TASK: ffff81207f8bd7e0  CPU: 44  COMMAND: "swapper"
 #0 [ffff81107fd33b40] crash_kexec at ffffffff800b0938
 #1 [ffff81107fd33c00] __die at ffffffff80065137
 #2 [ffff81107fd33c40] die at ffffffff8006c789
 #3 [ffff81107fd33c70] do_invalid_op at ffffffff8006cd49
 #4 [ffff81107fd33d30] error_exit at ffffffff8005dde9
    [exception RIP: list_del+71]
    RIP: ffffffff8015a793  RSP: ffff81107fd33de0  RFLAGS: 00010286
    RAX: 0000000000000058  RBX: 0000000000000427  RCX: ffffffff80323028
    RDX: ffffffff80323028  RSI: 0000000000000000  RDI: ffffffff80323020
    RBP: ffff81407f4e8680   R8: ffffffff80323028   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffc200104494a0
    R13: 0000000000000002  R14: ffff81107a2cf500  R15: ffff81407e1bf400
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff81107fd33de8] netxen_process_rcv_ring at ffffffff8830b050 [netxen_nic]
 #6 [ffff81107fd33eb8] netxen_nic_poll at ffffffff88306e71 [netxen_nic]
 #7 [ffff81107fd33ef8] net_rx_action at ffffffff8000c9b9
 #8 [ffff81107fd33f38] __do_softirq at ffffffff80012551
 #9 [ffff81107fd33f68] call_softirq at ffffffff8005e2fc
#10 [ffff81107fd33f80] do_softirq at ffffffff8006d646
#11 [ffff81107fd33f90] do_IRQ at ffffffff8006d4d6
--- <IRQ stack> ---
#12 [ffff81307fe2fe38] ret_from_intr at ffffffff8005d615
    [exception RIP: mwait_idle_with_hints+102]
    RIP: ffffffff8006b9cf  RSP: ffff81307fe2fee8  RFLAGS: 00000246
    RAX: 0000000000000000  RBX: 00000000000000ff  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007f319bfb6f27   R8: ffff81307fe2e000   R9: 0000000000000013
    R10: ffff8110b8288510  R11: 00000000ffffffff  R12: ffff81306273d040
    R13: ffff81207f8bd7e0  R14: 0000000000000001  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff64  CS: 0010  SS: 0018
#13 [ffff81307fe2fee8] mwait_idle at ffffffff80056c65
#14 [ffff81307fe2fef0] cpu_idle at ffffffff80048f92

> 4) Test case details:- what type of test is running on the system?
> Just to make sure we also try the same test cases in our
> environment.

No particular type of test, it's an Oracle server in production.

> 5) Server details (Number of CPus, memory etc.) if available.

64 x Xeon X7550 CPUs, 256G RAM

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-01-24  6:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17  5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson
2013-12-17  5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson
2013-12-17  6:37   ` Jitendra Kalsaria
2013-12-19 11:51   ` Manish Chopra
2013-12-20  4:11     ` David Gibson
2013-12-17  5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson
2013-12-19 20:05   ` David Miller
2014-01-24  5:21     ` David Gibson
2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra
2013-12-18  6:22   ` David Gibson
2013-12-19  9:11     ` Manish Chopra
2014-01-24  6:44       ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).