public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Upstream mlx4 driver very broken (when using SRIOV)
@ 2015-06-13  5:35 Doug Ledford
       [not found] ` <557BC105.3070405-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Doug Ledford @ 2015-06-13  5:35 UTC (permalink / raw)
  To: Or Gerlitz, Amir Vadai, linux-rdma

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

I ran across a problem today when I went to do some run tests of my
for-4.2 tree.  For a second there, I was about to seriously have a
conniption fit.  But, after about 6 hours of work bisecting and
debugging, I've come to find that I wasn't so crazy after all.

When I went to install my for-4.2 tree, IPoIB was totally busted, as in
DOA.  I knew the 4.1 code I submitted to Linus I had checked, but I
wanted to have a good starting point for a bisection so I compiled a
kernel from my for-4.1-rc branch.  And it was DOA too.  That seriously
unnerved me because I knew I tested that code.  I did a number of manual
checkouts at possible suspicious code points, and none of them showed
that the problem was resolved.  Then I started doing some debugging on
both the afflicted machine and on the opensm server.  I finally saw that
the afflicted machine was claiming that it was attempting to join the
multicast group, but was reporting error 110 (ETIMEDOUT).  The opensm
server was not seeing the requests at all.

Long story short, I did my testing in the 4.1 merge window and rc phase
on machines without SRIOV enabled, but when you enable SRIOV in the mlx4
driver, the current driver seems to have broken QP0/QP1 multiplexing
support because the host becomes unable to join the IPoIB multicast
groups.  In addition, with SRIOV enabled, mlx4_en throws corruption
errors on reboot and requires that the machine be power cycled as
opposed to rebooting cleanly.  From what I can tell, the 4.0 release
kernel has this problem too, and it still exists at least as far as
4.1-rc7 + all of my queued up -next patches.

From my /etc/modprobe.d/mlx4.conf file if you want to try and duplicate:

options mlx4_core probe_vf=0 num_vfs=7 port_type_array=1,2
options mlx4_en pfctx=0x28 pfcrx=0x28

And I'm guessing that your internal regression tests must not have a
machine in IB/Eth SRIOV mode as a standard config.  I would consider
adding it to the mix.  I have it myself, but only on a few machines and
I don't always use them for initial testing.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-06-19 10:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-06-13  5:35 Upstream mlx4 driver very broken (when using SRIOV) Doug Ledford
     [not found] ` <557BC105.3070405-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-13  7:18   ` Or Gerlitz
     [not found]     ` <557C2718.2000505@redhat.com>
     [not found]       ` <557C2718.2000505-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-13 21:02         ` Or Gerlitz
2015-06-14 14:31   ` Or Gerlitz
     [not found]     ` <CAJ3xEMi--ygFeYC12iiivXnbZLd=ox22fzt_f1+TFn7M0Emhug-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-19  0:57       ` Doug Ledford
     [not found]         ` <CFEE2FE0-21D6-469F-8B16-C211DED6BB45-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-06-19 10:41           ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox