* [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug @ 2013-12-17 5:22 David Gibson 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: David Gibson @ 2013-12-17 5:22 UTC (permalink / raw) To: manish.chopra, sony.chacko, rajesh.borundia Cc: netdev, snagarka, tcamuso, vdasgupt At Red Hat, we've hit a couple of customer cases with crashes in the netxen driver due to list corruption. This seems to be very rarely triggered, and unfortunately the dumps we have don't have enough information to be certain of the cause, although we have a possible theory. I'm suggesting, therefore a patch to add some sanity checking which should help to at least localize and mitigate the problem when someone hits it in future. Please let me know if there's a better approach to doing this. That's 2/2. 1/2 is a fix for a clear bug I spotted along the way, but not one that could cause the symptoms we've seen. ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] netxen: Correct off-by-one error in bounds check 2013-12-17 5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson @ 2013-12-17 5:22 ` David Gibson 2013-12-17 6:37 ` Jitendra Kalsaria 2013-12-19 11:51 ` Manish Chopra 2013-12-17 5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson 2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra 2 siblings, 2 replies; 12+ messages in thread From: David Gibson @ 2013-12-17 5:22 UTC (permalink / raw) To: manish.chopra, sony.chacko, rajesh.borundia Cc: netdev, snagarka, tcamuso, vdasgupt, David Gibson After retrieving an Rx buffer index from the hardware, the netxen driver bounds checks it against its array of receive buffers. However in the case of LRO packets, there's an off-by-one error in the check - it uses > instead of the >= it uses for the normal receive path in netxen_process_rcv(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c index 7692dfd..68658b8 100644 --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c @@ -1610,7 +1610,7 @@ netxen_process_lro(struct netxen_adapter *adapter, rds_ring = &recv_ctx->rds_rings[ring]; index = netxen_get_lro_sts_refhandle(sts_data0); - if (unlikely(index > rds_ring->num_desc)) + if (unlikely(index >= rds_ring->num_desc)) return NULL; buffer = &rds_ring->rx_buf_arr[index]; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] netxen: Correct off-by-one error in bounds check 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson @ 2013-12-17 6:37 ` Jitendra Kalsaria 2013-12-19 11:51 ` Manish Chopra 1 sibling, 0 replies; 12+ messages in thread From: Jitendra Kalsaria @ 2013-12-17 6:37 UTC (permalink / raw) To: David Gibson, Manish Chopra, Sony Chacko, Rajesh Borundia Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com David, You may also want to include fix for receive rings. if (unlikely(ring > adapter->max_rds_rings)) return NULL; Jiten On 12/16/13 9:22 PM, "David Gibson" <david@gibson.dropbear.id.au> wrote: >After retrieving an Rx buffer index from the hardware, the netxen driver >bounds checks it against its array of receive buffers. However in the >case >of LRO packets, there's an off-by-one error in the check - it uses > >instead of the >= it uses for the normal receive path in >netxen_process_rcv(). > >Signed-off-by: David Gibson <david@gibson.dropbear.id.au> >--- > drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c >b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c >index 7692dfd..68658b8 100644 >--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c >+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c >@@ -1610,7 +1610,7 @@ netxen_process_lro(struct netxen_adapter *adapter, > rds_ring = &recv_ctx->rds_rings[ring]; > > index = netxen_get_lro_sts_refhandle(sts_data0); >- if (unlikely(index > rds_ring->num_desc)) >+ if (unlikely(index >= rds_ring->num_desc)) > return NULL; > > buffer = &rds_ring->rx_buf_arr[index]; >-- >1.8.3.1 > >-- >To unsubscribe from this list: send the line "unsubscribe netdev" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [PATCH 1/2] netxen: Correct off-by-one error in bounds check 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson 2013-12-17 6:37 ` Jitendra Kalsaria @ 2013-12-19 11:51 ` Manish Chopra 2013-12-20 4:11 ` David Gibson 1 sibling, 1 reply; 12+ messages in thread From: Manish Chopra @ 2013-12-19 11:51 UTC (permalink / raw) To: David Gibson, Sony Chacko, Rajesh Borundia Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com >-----Original Message----- >From: David Gibson [mailto:david@gibson.dropbear.id.au] >Sent: Tuesday, December 17, 2013 10:53 AM >To: Manish Chopra; Sony Chacko; Rajesh Borundia >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; >vdasgupt@redhat.com; David Gibson >Subject: [PATCH 1/2] netxen: Correct off-by-one error in bounds check > >After retrieving an Rx buffer index from the hardware, the netxen driver bounds >checks it against its array of receive buffers. However in the case of LRO packets, >there's an off-by-one error in the check - it uses > instead of the >= it uses for the >normal receive path in netxen_process_rcv(). > >Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Manish Chopra <manish.chopra@qlogic.com> Thanks David. This is really a bug fix. You can re-submit this single patch. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] netxen: Correct off-by-one error in bounds check 2013-12-19 11:51 ` Manish Chopra @ 2013-12-20 4:11 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2013-12-20 4:11 UTC (permalink / raw) To: Manish Chopra Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com [-- Attachment #1: Type: text/plain, Size: 1202 bytes --] On Thu, Dec 19, 2013 at 11:51:37AM +0000, Manish Chopra wrote: > >-----Original Message----- > >From: David Gibson [mailto:david@gibson.dropbear.id.au] > >Sent: Tuesday, December 17, 2013 10:53 AM > >To: Manish Chopra; Sony Chacko; Rajesh Borundia > >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; > >vdasgupt@redhat.com; David Gibson > >Subject: [PATCH 1/2] netxen: Correct off-by-one error in bounds check > > > >After retrieving an Rx buffer index from the hardware, the netxen driver bounds > >checks it against its array of receive buffers. However in the case of LRO packets, > >there's an off-by-one error in the check - it uses > instead of the >= it uses for the > >normal receive path in netxen_process_rcv(). > > > >Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > > Acked-by: Manish Chopra <manish.chopra@qlogic.com> > > Thanks David. This is really a bug fix. You can re-submit this single patch. Ok, I added the extra fix Jiten suggested and resent. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware 2013-12-17 5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson @ 2013-12-17 5:22 ` David Gibson 2013-12-19 20:05 ` David Miller 2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra 2 siblings, 1 reply; 12+ messages in thread From: David Gibson @ 2013-12-17 5:22 UTC (permalink / raw) To: manish.chopra, sony.chacko, rajesh.borundia Cc: netdev, snagarka, tcamuso, vdasgupt, David Gibson At Red Hat we've seen a couple of customer cases where the kernel has crashed due to corruption of the Rx buffer free lists in the netxen driver. >From the information we've had so far, we haven't been able to locate the bug. Our working theory is that due to either a driver, hardware or firmware bug, we very occasionally process Rx descriptors which have the wrong buffer ref_handle in them. This is consistent with all the details we've seen of the two crashes we've seen. More specifically here's what we suspect happened: Case 1: netxen_process_rcv_ring() netxen_process_rcv() on buffer X (correctly) list_add_tail(buffer X, sds_ring free list) netxen_process_rcv() on buffer Y (correctly) list_add_tail(buffer Y, sds_ring free list) netxen_process_rcv() on buffer X <- wrong index list_add_tail(buffer X, sds_ring free list) <- list corrupted netxen_merge_rx_buffers() list_splice() sds_ring free list to rds_ring free_list netxen_post_rx_buffers_nodb() BUG on list_del() due to bad prev pointer in buffer X Case 2: CPU0 CPU1 netxen_process_rcv_list() on sds_ring A netxen_process_rcv() on buffer X (correctly) list_add_tail(buffer X, sds_ring A free_list) netxen_process_rcv_list() on sds_ring B netxen_process_rcv() buffer X <- wrong list_add_tail(buffer X, sds_ring B list) [...] netxen_post_rx_buffers_nodb() list_del(buffer X) <- buf X poisoned for_each on sds_ring A free list <- crash due to dereferencing poisoned pointers in buffer X Because the bug is triggered very rarely, and we have no known reproducer, it's difficult to pin down exactly what's going wrong. This patch tries to improve the situation by adding some (fairly cheap) sanity checks when we process an Rx buffer to make sure it's not already on someone's free list. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- .../net/ethernet/qlogic/netxen/netxen_nic_init.c | 37 ++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c index 68658b8..edbdf4b 100644 --- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c +++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_init.c @@ -1531,6 +1531,39 @@ no_skb: return skb; } +static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf) +{ + int ret = 0; + + if ((rxbuf->list.next != LIST_POISON1) + || (rxbuf->list.prev != LIST_POISON2)) { + printk(KERN_ERR "netxen: Rx buffer is already on free list " + "(next=%p, prev=%p)\n", rxbuf->list.next, + rxbuf->list.prev); + ret = -1; + } + + if (rxbuf->ref_handle != index) { + printk(KERN_ERR "netxen: Rx buffer has bad ref_handle " + "(%d instead of %d)\n", rxbuf->ref_handle, index); + ret = -1; + } + + if (rxbuf->state != NETXEN_BUFFER_BUSY) { + printk(KERN_ERR "netxen: Rx buffer has bad state %d\n", + rxbuf->state); + ret = -1; + } + + if (!rxbuf->skb) { + printk(KERN_ERR "netxen: Rx buffer has no skb\n", + rxbuf->skb); + ret = -1; + } + + return ret; +} + static struct netxen_rx_buffer * netxen_process_rcv(struct netxen_adapter *adapter, struct nx_host_sds_ring *sds_ring, @@ -1553,6 +1586,8 @@ netxen_process_rcv(struct netxen_adapter *adapter, return NULL; buffer = &rds_ring->rx_buf_arr[index]; + if (WARN_ON(netxen_check_rx_buf(index, buffer) != 0)) + return NULL; length = netxen_get_sts_totallength(sts_data0); cksum = netxen_get_sts_status(sts_data0); @@ -1614,6 +1649,8 @@ netxen_process_lro(struct netxen_adapter *adapter, return NULL; buffer = &rds_ring->rx_buf_arr[index]; + if (WARN_ON(netxen_check_rx_buf(index, buffer) != 0)) + return NULL; timestamp = netxen_get_lro_sts_timestamp(sts_data0); lro_length = netxen_get_lro_sts_length(sts_data0); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware 2013-12-17 5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson @ 2013-12-19 20:05 ` David Miller 2014-01-24 5:21 ` David Gibson 0 siblings, 1 reply; 12+ messages in thread From: David Miller @ 2013-12-19 20:05 UTC (permalink / raw) To: david Cc: manish.chopra, sony.chacko, rajesh.borundia, netdev, snagarka, tcamuso, vdasgupt From: David Gibson <david@gibson.dropbear.id.au> Date: Tue, 17 Dec 2013 16:22:33 +1100 > +static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf) > +{ > + int ret = 0; > + > + if ((rxbuf->list.next != LIST_POISON1) > + || (rxbuf->list.prev != LIST_POISON2)) { > + printk(KERN_ERR "netxen: Rx buffer is already on free list " > + "(next=%p, prev=%p)\n", rxbuf->list.next, > + rxbuf->list.prev); > + ret = -1; > + } First, you should be using netdev_err() or similar. Second, doing this unconditionally for every receive packet is not really reasonable. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware 2013-12-19 20:05 ` David Miller @ 2014-01-24 5:21 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2014-01-24 5:21 UTC (permalink / raw) To: David Miller Cc: manish.chopra, sony.chacko, rajesh.borundia, netdev, snagarka, tcamuso, vdasgupt [-- Attachment #1: Type: text/plain, Size: 1315 bytes --] On Thu, Dec 19, 2013 at 03:05:27PM -0500, David Miller wrote: > From: David Gibson <david@gibson.dropbear.id.au> > Date: Tue, 17 Dec 2013 16:22:33 +1100 > > > +static int netxen_check_rx_buf(int index, struct netxen_rx_buffer *rxbuf) > > +{ > > + int ret = 0; > > + > > + if ((rxbuf->list.next != LIST_POISON1) > > + || (rxbuf->list.prev != LIST_POISON2)) { > > + printk(KERN_ERR "netxen: Rx buffer is already on free list " > > + "(next=%p, prev=%p)\n", rxbuf->list.next, > > + rxbuf->list.prev); > > + ret = -1; > > + } > > First, you should be using netdev_err() or similar. Well, I was just matching the form used in the rest of the driver. > Second, doing this unconditionally for every receive packet is > not really reasonable. Yeah, trouble is I don't know what the alternative is. Whatever is causing this problem, it's triggered very rarely - I'm only seeing a pattern by looking across everything reported by Red Hat customers. Just adding diagnostics for one customer is unlikely to be useful, because the chances are they'll never hit the bug again. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug 2013-12-17 5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson 2013-12-17 5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson @ 2013-12-17 21:50 ` Manish Chopra 2013-12-18 6:22 ` David Gibson 2 siblings, 1 reply; 12+ messages in thread From: Manish Chopra @ 2013-12-17 21:50 UTC (permalink / raw) To: David Gibson, Sony Chacko, Rajesh Borundia Cc: netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com >-----Original Message----- >From: David Gibson [mailto:david@gibson.dropbear.id.au] >Sent: Tuesday, December 17, 2013 10:53 AM >To: Manish Chopra; Sony Chacko; Rajesh Borundia >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; >vdasgupt@redhat.com >Subject: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug > >At Red Hat, we've hit a couple of customer cases with crashes in the netxen driver >due to list corruption. This seems to be very rarely triggered, and unfortunately >the dumps we have don't have enough information to be certain of the cause, >although we have a possible theory. > >I'm suggesting, therefore a patch to add some sanity checking which should help >to at least localize and mitigate the problem when someone hits it in future. >Please let me know if there's a better approach to doing this. > >That's 2/2. 1/2 is a fix for a clear bug I spotted along the way, but not one that >could cause the symptoms we've seen. David, Having these checks in data path(Rx path) may have some performance impact. It's better to root cause it instead of putting some sanity checks. We will get back to you on this. Thanks Manish ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug 2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra @ 2013-12-18 6:22 ` David Gibson 2013-12-19 9:11 ` Manish Chopra 0 siblings, 1 reply; 12+ messages in thread From: David Gibson @ 2013-12-18 6:22 UTC (permalink / raw) To: Manish Chopra Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com [-- Attachment #1: Type: text/plain, Size: 1776 bytes --] On Tue, Dec 17, 2013 at 09:50:52PM +0000, Manish Chopra wrote: > >-----Original Message----- > >From: David Gibson [mailto:david@gibson.dropbear.id.au] > >Sent: Tuesday, December 17, 2013 10:53 AM > >To: Manish Chopra; Sony Chacko; Rajesh Borundia > >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; > >vdasgupt@redhat.com > >Subject: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug > > > >At Red Hat, we've hit a couple of customer cases with crashes in the netxen driver > >due to list corruption. This seems to be very rarely triggered, and unfortunately > >the dumps we have don't have enough information to be certain of the cause, > >although we have a possible theory. > > > >I'm suggesting, therefore a patch to add some sanity checking which should help > >to at least localize and mitigate the problem when someone hits it in future. > >Please let me know if there's a better approach to doing this. > > > >That's 2/2. 1/2 is a fix for a clear bug I spotted along the way, but not one that > >could cause the symptoms we've seen. > > David, > > Having these checks in data path(Rx path) may have some performance > impact. It's better to root cause it instead of putting some sanity > checks. Obviously, but this was the best way I could think of to try narrowing down the root cause (at least trying to eliminate driver vs. firmware bug). > We will get back to you on this. If you have a better idea for locating the root cause, please let me know. I have access to a vmcore which I can poke around in. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug 2013-12-18 6:22 ` David Gibson @ 2013-12-19 9:11 ` Manish Chopra 2014-01-24 6:44 ` David Gibson 0 siblings, 1 reply; 12+ messages in thread From: Manish Chopra @ 2013-12-19 9:11 UTC (permalink / raw) To: David Gibson Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com >> >From: David Gibson [mailto:david@gibson.dropbear.id.au] >> >Sent: Tuesday, December 17, 2013 10:53 AM >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia >> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; >> >vdasgupt@redhat.com >> >Subject: [0/2] netxen: bug fix and diagnostics for possible >> >(hardware?) bug >> > >> >At Red Hat, we've hit a couple of customer cases with crashes in the >> >netxen driver due to list corruption. This seems to be very rarely >> >triggered, and unfortunately the dumps we have don't have enough >> >information to be certain of the cause, although we have a possible theory. >> > >> >I'm suggesting, therefore a patch to add some sanity checking which >> >should help to at least localize and mitigate the problem when someone hits it >in future. >> >Please let me know if there's a better approach to doing this. >> > >> >That's 2/2. 1/2 is a fix for a clear bug I spotted along the way, >> >but not one that could cause the symptoms we've seen. >> >> David, >> >> Having these checks in data path(Rx path) may have some performance >> impact. It's better to root cause it instead of putting some sanity >> checks. > >Obviously, but this was the best way I could think of to try narrowing down the >root cause (at least trying to eliminate driver vs. firmware bug). David, Instead of making permanent changes in driver, can you please run your modified driver in selective customer environment where this issues is seen? Which may give some data point that what's the issue exactly and then we go by that. > >> We will get back to you on this. > >If you have a better idea for locating the root cause, please let me know. I have >access to a vmcore which I can poke around in. We will also try to reproduce the problem in our environment and debug this. Can you please give some details? 1) what's the driver and firmware version used? 2) which operating system and kernel version? 3) please send the vmcore also with backtrace if available which can give some idea what can trigger this issue. 4) Test case details:- what type of test is running on the system? Just to make sure we also try the same test cases in our environment. 5) Server details (Number of CPus, memory etc.) if available. Thanks, Manish ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug 2013-12-19 9:11 ` Manish Chopra @ 2014-01-24 6:44 ` David Gibson 0 siblings, 0 replies; 12+ messages in thread From: David Gibson @ 2014-01-24 6:44 UTC (permalink / raw) To: Manish Chopra Cc: Sony Chacko, Rajesh Borundia, netdev, snagarka@redhat.com, tcamuso@redhat.com, vdasgupt@redhat.com [-- Attachment #1: Type: text/plain, Size: 6205 bytes --] On Thu, Dec 19, 2013 at 09:11:33AM +0000, Manish Chopra wrote: > >> >From: David Gibson [mailto:david@gibson.dropbear.id.au] > >> >Sent: Tuesday, December 17, 2013 10:53 AM > >> >To: Manish Chopra; Sony Chacko; Rajesh Borundia > >> >Cc: netdev; snagarka@redhat.com; tcamuso@redhat.com; > >> >vdasgupt@redhat.com > >> >Subject: [0/2] netxen: bug fix and diagnostics for possible > >> >(hardware?) bug > >> > > >> >At Red Hat, we've hit a couple of customer cases with crashes in the > >> >netxen driver due to list corruption. This seems to be very rarely > >> >triggered, and unfortunately the dumps we have don't have enough > >> >information to be certain of the cause, although we have a possible theory. > >> > > >> >I'm suggesting, therefore a patch to add some sanity checking which > >> >should help to at least localize and mitigate the problem when someone hits it > >in future. > >> >Please let me know if there's a better approach to doing this. > >> > > >> >That's 2/2. 1/2 is a fix for a clear bug I spotted along the way, > >> >but not one that could cause the symptoms we've seen. > >> > >> David, > >> > >> Having these checks in data path(Rx path) may have some performance > >> impact. It's better to root cause it instead of putting some sanity > >> checks. > > > >Obviously, but this was the best way I could think of to try narrowing down the > >root cause (at least trying to eliminate driver vs. firmware bug). > > David, Instead of making permanent changes in driver, can you please > run your modified driver in selective customer environment where > this issues is seen? Yeah, the problem with that is that the problem has never triggered twice for a single customer. Well, technically there is one customer that's hit it twice, but I'm pretty sure it's on entirely unrelated systems in different sections of a large customer. The only reason I can see enough cases to suspect a pattern to these problems is from looking across Red Hat's whole case history. > Which may give some data point that what's the issue exactly and then we go by that. > > > > >> We will get back to you on this. > > > >If you have a better idea for locating the root cause, please let me know. I have > >access to a vmcore which I can poke around in. > > We will also try to reproduce the problem in our environment and debug this. > Can you please give some details? Apologies for the long delay, I'd been hoping for some more confirmation of things, but it hasn't happened. I'll give you what I can. > 1) what's the driver and firmware version used? I'm not sure what the most useful way ot giving a driver version. I've given kernel version below, but it's an RH kernel, so I'm not sure how much has been backported. As to firmware, the driver reports: netxen_nic 0000:04:00.0: Gen2 strapping detected netxen_nic 0000:04:00.0: using 64-bit dma mask netxen_nic: NX3031 Gigabit Ethernet Board S/N <FF><FF><FF><FF><FF><FF><FF><FF><FF> <FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF><FF> <FF><FF><FF>NX3031 Gigabit Ethernet Chip rev 0x42 netxen_nic 0000:04:00.0: firmware v4.0.585 [legacy] > 2) which operating system and kernel version? RHEL5, Linux hostname 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux > 3) please send the vmcore also with backtrace if available which can > give some idea what can trigger this issue. I can't send the vmcore itself, since it will include customer data. I can give you the backtrace below, and look up specific things if you can give me an idea of what you need: crash> bt PID: 0 TASK: ffff81207f8bd7e0 CPU: 44 COMMAND: "swapper" #0 [ffff81107fd33b40] crash_kexec at ffffffff800b0938 #1 [ffff81107fd33c00] __die at ffffffff80065137 #2 [ffff81107fd33c40] die at ffffffff8006c789 #3 [ffff81107fd33c70] do_invalid_op at ffffffff8006cd49 #4 [ffff81107fd33d30] error_exit at ffffffff8005dde9 [exception RIP: list_del+71] RIP: ffffffff8015a793 RSP: ffff81107fd33de0 RFLAGS: 00010286 RAX: 0000000000000058 RBX: 0000000000000427 RCX: ffffffff80323028 RDX: ffffffff80323028 RSI: 0000000000000000 RDI: ffffffff80323020 RBP: ffff81407f4e8680 R8: ffffffff80323028 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffffc200104494a0 R13: 0000000000000002 R14: ffff81107a2cf500 R15: ffff81407e1bf400 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff81107fd33de8] netxen_process_rcv_ring at ffffffff8830b050 [netxen_nic] #6 [ffff81107fd33eb8] netxen_nic_poll at ffffffff88306e71 [netxen_nic] #7 [ffff81107fd33ef8] net_rx_action at ffffffff8000c9b9 #8 [ffff81107fd33f38] __do_softirq at ffffffff80012551 #9 [ffff81107fd33f68] call_softirq at ffffffff8005e2fc #10 [ffff81107fd33f80] do_softirq at ffffffff8006d646 #11 [ffff81107fd33f90] do_IRQ at ffffffff8006d4d6 --- <IRQ stack> --- #12 [ffff81307fe2fe38] ret_from_intr at ffffffff8005d615 [exception RIP: mwait_idle_with_hints+102] RIP: ffffffff8006b9cf RSP: ffff81307fe2fee8 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 00000000000000ff RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 00007f319bfb6f27 R8: ffff81307fe2e000 R9: 0000000000000013 R10: ffff8110b8288510 R11: 00000000ffffffff R12: ffff81306273d040 R13: ffff81207f8bd7e0 R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: ffffffffffffff64 CS: 0010 SS: 0018 #13 [ffff81307fe2fee8] mwait_idle at ffffffff80056c65 #14 [ffff81307fe2fef0] cpu_idle at ffffffff80048f92 > 4) Test case details:- what type of test is running on the system? > Just to make sure we also try the same test cases in our > environment. No particular type of test, it's an Oracle server in production. > 5) Server details (Number of CPus, memory etc.) if available. 64 x Xeon X7550 CPUs, 256G RAM -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson [-- Attachment #2: Type: application/pgp-signature, Size: 819 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2014-01-24 6:44 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-17 5:22 [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug David Gibson 2013-12-17 5:22 ` [PATCH 1/2] netxen: Correct off-by-one error in bounds check David Gibson 2013-12-17 6:37 ` Jitendra Kalsaria 2013-12-19 11:51 ` Manish Chopra 2013-12-20 4:11 ` David Gibson 2013-12-17 5:22 ` [PATCH 2/2] netxen: Add sanity checks for Rx buffers returning from hardware David Gibson 2013-12-19 20:05 ` David Miller 2014-01-24 5:21 ` David Gibson 2013-12-17 21:50 ` [0/2] netxen: bug fix and diagnostics for possible (hardware?) bug Manish Chopra 2013-12-18 6:22 ` David Gibson 2013-12-19 9:11 ` Manish Chopra 2014-01-24 6:44 ` David Gibson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).