* [BUG] mellanox IB driver fails to load on large config
@ 2015-07-10 19:15 andrew banman
2015-07-11 20:20 ` Or Gerlitz
0 siblings, 1 reply; 11+ messages in thread
From: andrew banman @ 2015-07-10 19:15 UTC (permalink / raw)
To: linux-kernel
Cc: Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz,
David S. Miller, Roland Dreier, Matan Barak, Moni Shoua,
Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny,
linux-rdma
I'm seeing a large number of allocation errors originating from the Mellanox IB
driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system:
8<---
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 64; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 65; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 66; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 67; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 68; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 69; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 70; reverting to legacy
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 71; reverting to legacy
......
<mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 123; reverting to legacy
--->8
Where the failing function is in drivers/infiniband/hw/mlx4/main.c:
8<---
2042 static void mlx4_ib_alloc_eqs(struct mlx4_dev *dev, struct mlx4_ib_dev *ibdev)
...
2075 /* Set IRQ for specific name (per ring) */
2076 if (mlx4_assign_eq(dev, name, NULL,
2077 &ibdev->eq_table[eq])) {
2078 /* Use legacy (same as mlx4_en driver) */
2079 pr_warn("Can't allocate EQ %d; reverting to legacy\n", eq);
2080 ibdev->eq_table[eq] =
2081 (eq % dev->caps.num_comp_vectors);
2082 }
--->8
The problem doesn't appear to be fatal. At this point I am unsure if this is
actually expected behavior, so I'm looking for some insight into the issue.
At first we believed the problem to be with request_irq, but after writing in
some debug code that mlx4_assign_eq returned -28, indicating that vec was
never assigned:
8<---
@@ -1401,6 +1402,7 @@ int mlx4_assign_eq(struct mlx4_dev *dev, char *name, struct cpu_rmap *rmap,
if (vec) {
*vector = vec;
} else {
+ pr_crit("!!! debug: mlx4_assign_eq - last err %d\n", err);
*vector = 0;
err = (i == dev->caps.comp_pool) ? -ENOSPC : err;
}
--->8
8<---
[ 1565.416273] !!! debug: mlx4_assign_eq - last err 0
[ 1565.416275] <mlx4_ib> mlx4_ib_alloc_eqs: !!! debug: mlx4_assign_eq returned -28
[ 1565.416277] <mlx4_ib> mlx4_ib_alloc_eqs: Can't allocate EQ 64; reverting to legacy
--->8
Any help would be greatly appreciated!
Andrew Banman
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-10 19:15 [BUG] mellanox IB driver fails to load on large config andrew banman @ 2015-07-11 20:20 ` Or Gerlitz 2015-07-14 18:22 ` andrew banman 0 siblings, 1 reply; 11+ messages in thread From: Or Gerlitz @ 2015-07-11 20:20 UTC (permalink / raw) To: andrew banman Cc: Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: > I'm seeing a large number of allocation errors originating from the Mellanox IB > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: Just to make sure, mlx4 works fine on this small (...) system with 4.1 and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that config? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-11 20:20 ` Or Gerlitz @ 2015-07-14 18:22 ` andrew banman 2015-07-14 18:48 ` Alex Thorlton 0 siblings, 1 reply; 11+ messages in thread From: andrew banman @ 2015-07-14 18:22 UTC (permalink / raw) To: Or Gerlitz Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org, athorlton On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: > > I'm seeing a large number of allocation errors originating from the Mellanox IB > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: > > Just to make sure, mlx4 works fine on this small (...) system with 4.1 > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that > config? I'll let Alex comment on that, he did some testing on that. -Andrew ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-14 18:22 ` andrew banman @ 2015-07-14 18:48 ` Alex Thorlton 2015-07-14 20:06 ` Or Gerlitz 0 siblings, 1 reply; 11+ messages in thread From: Alex Thorlton @ 2015-07-14 18:48 UTC (permalink / raw) To: andrew banman Cc: Or Gerlitz, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org, athorlton On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: > On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: > > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: > > > I'm seeing a large number of allocation errors originating from the Mellanox IB > > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: > > > > Just to make sure, mlx4 works fine on this small (...) system with 4.1 > > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that > > config? > > I'll let Alex comment on that, he did some testing on that. I started seeing this on a 4.1-rc8 kernel, so it's been around for a little while. It may have been around before 4.1-rc8, but I haven't run any kernels older than that on the big machine for some time. - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-14 18:48 ` Alex Thorlton @ 2015-07-14 20:06 ` Or Gerlitz 2015-07-14 20:28 ` Alex Thorlton 0 siblings, 1 reply; 11+ messages in thread From: Or Gerlitz @ 2015-07-14 20:06 UTC (permalink / raw) To: Alex Thorlton Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote: > On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: >> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: >> > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: >> > > I'm seeing a large number of allocation errors originating from the Mellanox IB >> > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: >> > >> > Just to make sure, mlx4 works fine on this small (...) system with 4.1 >> > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that >> > config? >> >> I'll let Alex comment on that, he did some testing on that. > > I started seeing this on a 4.1-rc8 kernel, so it's been around for a > little while. It may have been around before 4.1-rc8, but I haven't run > any kernels older than that on the big machine for some time. To make sure I am correctly following, on 4.1-rc8 you also see something, right? are these the same messages or different ones? if the latter send to us. Or. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-14 20:06 ` Or Gerlitz @ 2015-07-14 20:28 ` Alex Thorlton 2015-07-15 11:33 ` Matan Barak 2015-07-16 6:25 ` Or Gerlitz 0 siblings, 2 replies; 11+ messages in thread From: Alex Thorlton @ 2015-07-14 20:28 UTC (permalink / raw) To: Or Gerlitz Cc: Alex Thorlton, andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote: > On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote: > > On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: > >> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: > >> > On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: > >> > > I'm seeing a large number of allocation errors originating from the Mellanox IB > >> > > driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: > >> > > >> > Just to make sure, mlx4 works fine on this small (...) system with 4.1 > >> > and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that > >> > config? > >> > >> I'll let Alex comment on that, he did some testing on that. > > > > I started seeing this on a 4.1-rc8 kernel, so it's been around for a > > little while. It may have been around before 4.1-rc8, but I haven't run > > any kernels older than that on the big machine for some time. > > To make sure I am correctly following, on 4.1-rc8 you also see > something, right? Yes, that's correct. > are these the same messages or different ones? if the latter send to us. We see the same exact messages on 4.1-rc8. Thanks for looking into this! - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-14 20:28 ` Alex Thorlton @ 2015-07-15 11:33 ` Matan Barak 2015-07-16 6:25 ` Or Gerlitz 1 sibling, 0 replies; 11+ messages in thread From: Matan Barak @ 2015-07-15 11:33 UTC (permalink / raw) To: Alex Thorlton, Or Gerlitz Cc: andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, Or Gerlitz, David S. Miller, Roland Dreier, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On 7/14/2015 11:28 PM, Alex Thorlton wrote: > On Tue, Jul 14, 2015 at 11:06:26PM +0300, Or Gerlitz wrote: >> On Tue, Jul 14, 2015 at 9:48 PM, Alex Thorlton <athorlton@sgi.com> wrote: >>> On Tue, Jul 14, 2015 at 01:22:34PM -0500, andrew banman wrote: >>>> On Sat, Jul 11, 2015 at 11:20:19PM +0300, Or Gerlitz wrote: >>>>> On Fri, Jul 10, 2015 at 10:15 PM, andrew banman <abanman@sgi.com> wrote: >>>>>> I'm seeing a large number of allocation errors originating from the Mellanox IB >>>>>> driver when booting the 4.2-rc1 kernel on a 4096cpu 32TB memory system: >>>>> >>>>> Just to make sure, mlx4 works fine on this small (...) system with 4.1 >>>>> and 4.2-rc1 breaks, or 4.2-rc1 is the 1st time you're trying that >>>>> config? >>>> >>>> I'll let Alex comment on that, he did some testing on that. >>> >>> I started seeing this on a 4.1-rc8 kernel, so it's been around for a >>> little while. It may have been around before 4.1-rc8, but I haven't run >>> any kernels older than that on the big machine for some time. >> >> To make sure I am correctly following, on 4.1-rc8 you also see >> something, right? > > Yes, that's correct. > >> are these the same messages or different ones? if the latter send to us. > > We see the same exact messages on 4.1-rc8. Hi, We don't recall getting those error with 32cpu machines, but we'll try to reproduce this issue. Matan > > Thanks for looking into this! > > - Alex > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-14 20:28 ` Alex Thorlton 2015-07-15 11:33 ` Matan Barak @ 2015-07-16 6:25 ` Or Gerlitz 2015-07-20 16:28 ` Alex Thorlton 1 sibling, 1 reply; 11+ messages in thread From: Or Gerlitz @ 2015-07-16 6:25 UTC (permalink / raw) To: Alex Thorlton Cc: Or Gerlitz, andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On 7/14/2015 11:28 PM, Alex Thorlton wrote: > > We see the same exact messages on 4.1-rc8. > > does this solves the problem? diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index ad31e47..c8ae3b9 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -45,7 +45,7 @@ #include <linux/timecounter.h> #define MAX_MSIX_P_PORT 17 -#define MAX_MSIX 64 +#define MAX_MSIX 1024 #define MIN_MSIX_P_PORT 5 #define MLX4_IS_LEGACY_EQ_MODE(dev_cap) ((dev_cap).num_comp_vectors < \ (dev_cap).num_ports * MIN_MSIX_P_PORT) -- ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-16 6:25 ` Or Gerlitz @ 2015-07-20 16:28 ` Alex Thorlton 2015-07-21 2:56 ` Alex Thorlton 0 siblings, 1 reply; 11+ messages in thread From: Alex Thorlton @ 2015-07-20 16:28 UTC (permalink / raw) To: Or Gerlitz Cc: Alex Thorlton, Or Gerlitz, andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Thu, Jul 16, 2015 at 09:25:37AM +0300, Or Gerlitz wrote: > On 7/14/2015 11:28 PM, Alex Thorlton wrote: >> >> We see the same exact messages on 4.1-rc8. >> >> > > does this solves the problem? > > > diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h > index ad31e47..c8ae3b9 100644 > --- a/include/linux/mlx4/device.h > +++ b/include/linux/mlx4/device.h > @@ -45,7 +45,7 @@ > #include <linux/timecounter.h> > > #define MAX_MSIX_P_PORT 17 > -#define MAX_MSIX 64 > +#define MAX_MSIX 1024 > #define MIN_MSIX_P_PORT 5 > #define MLX4_IS_LEGACY_EQ_MODE(dev_cap) ((dev_cap).num_comp_vectors < \ > (dev_cap).num_ports * MIN_MSIX_P_PORT) > -- > I've got some time on the large machine later today. I'll give this a try then. Thanks, Or! - Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-20 16:28 ` Alex Thorlton @ 2015-07-21 2:56 ` Alex Thorlton 2015-07-21 14:21 ` Matan Barak 0 siblings, 1 reply; 11+ messages in thread From: Alex Thorlton @ 2015-07-21 2:56 UTC (permalink / raw) To: Alex Thorlton Cc: Or Gerlitz, Or Gerlitz, andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote: > I've got some time on the large machine later today. I'll give this a > try then. I ran a boot with this patch applied: diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index 83e80ab..c84aea0 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -45,7 +45,7 @@ #include <linux/timecounter.h> #define MAX_MSIX_P_PORT 17 -#define MAX_MSIX 64 +#define MAX_MSIX 8192 #define MSIX_LEGACY_SZ 4 #define MIN_MSIX_P_PORT 5 I went for a max of 8192, since I was actually booting the machine with 6144 cores (not 4096) for this run. It doesn't look like this fixed the problem. I still saw the same errors during boot. FWIW, the module does appear to still successfully load: 8<--- # lsmod | grep mlx mlx4_ib 151552 0 ib_sa 32768 1 mlx4_ib ib_mad 49152 2 ib_sa,mlx4_ib ib_core 102400 3 ib_sa,mlx4_ib,ib_mad mlx4_core 278528 1 mlx4_ib --->8 If the module loading is good enough, and we should just ignore the errors, then I'm fine with that. Just wanting to make sure that everything is behaving correctly. - Alex ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [BUG] mellanox IB driver fails to load on large config 2015-07-21 2:56 ` Alex Thorlton @ 2015-07-21 14:21 ` Matan Barak 0 siblings, 0 replies; 11+ messages in thread From: Matan Barak @ 2015-07-21 14:21 UTC (permalink / raw) To: Alex Thorlton Cc: Or Gerlitz, Or Gerlitz, andrew banman, Linux Kernel, Doug Ledford, Sean Hefty, Hal Rosenstock, David S. Miller, Roland Dreier, Matan Barak, Moni Shoua, Jack Morgenstein, Yishai Hadas, Eran Ben Elisha, Ira Weiny, linux-rdma@vger.kernel.org On Tue, Jul 21, 2015 at 5:56 AM, Alex Thorlton <athorlton@sgi.com> wrote: > On Mon, Jul 20, 2015 at 11:28:03AM -0500, Alex Thorlton wrote: >> I've got some time on the large machine later today. I'll give this a >> try then. > > I ran a boot with this patch applied: > > diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h > index 83e80ab..c84aea0 100644 > --- a/include/linux/mlx4/device.h > +++ b/include/linux/mlx4/device.h > @@ -45,7 +45,7 @@ > #include <linux/timecounter.h> > > #define MAX_MSIX_P_PORT 17 > -#define MAX_MSIX 64 > +#define MAX_MSIX 8192 > #define MSIX_LEGACY_SZ 4 > #define MIN_MSIX_P_PORT 5 > > I went for a max of 8192, since I was actually booting the machine with > 6144 cores (not 4096) for this run. It doesn't look like this fixed the > problem. I still saw the same errors during boot. > > FWIW, the module does appear to still successfully load: > > 8<--- > # lsmod | grep mlx > mlx4_ib 151552 0 > ib_sa 32768 1 mlx4_ib > ib_mad 49152 2 ib_sa,mlx4_ib > ib_core 102400 3 ib_sa,mlx4_ib,ib_mad > mlx4_core 278528 1 mlx4_ib > --->8 > > If the module loading is good enough, and we should just ignore the > errors, then I'm fine with that. Just wanting to make sure that > everything is behaving correctly. It shouldn't be a problem, as all unused/erroneous EQs get "-1". We'll try to reproduce the problem here, it might take awhile though. Thanks for checking this, Matan > > - Alex > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2015-07-21 14:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-10 19:15 [BUG] mellanox IB driver fails to load on large config andrew banman 2015-07-11 20:20 ` Or Gerlitz 2015-07-14 18:22 ` andrew banman 2015-07-14 18:48 ` Alex Thorlton 2015-07-14 20:06 ` Or Gerlitz 2015-07-14 20:28 ` Alex Thorlton 2015-07-15 11:33 ` Matan Barak 2015-07-16 6:25 ` Or Gerlitz 2015-07-20 16:28 ` Alex Thorlton 2015-07-21 2:56 ` Alex Thorlton 2015-07-21 14:21 ` Matan Barak
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox