linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Christoph Lameter <cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org>,
	Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4
Date: Wed, 7 Oct 2015 11:28:12 -0400	[thread overview]
Message-ID: <56153A0C.4040006@redhat.com> (raw)
In-Reply-To: <CAJ3xEMiNk2ZXgg_Ji5dg+ahL_D5_RAvEzEEhMsPDjC-PJYDu7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 4930 bytes --]

On 10/06/2015 05:49 PM, Or Gerlitz wrote:
> On Wed, Oct 7, 2015 at 12:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> Nothing so simple unfortunately.  And it isn't an IB/RoCE cluster, it's
>> IB/IB/OPA/RoCE/IWARP cluster.  Regardless though, that's not my problem
>> and what I'm chasing.
> 
> To be precise no two transports out of IB/RoCE/iWARP/OPA are
> inter-operable, so these are "just" different cards/transports under
> the same IB core on this cluster.

Except that some machines have links to as many as four of the different
fabrics and so a problem in one can effect testing of others.

>> Yes, I know how to do DOA testing.
> 
> So what's dead in your env after (say) 59m of examination?

It's not dead after 59m, it's DOA immediately.  And it's iSER.

But the details are much more complex than iSER is DOA.  It was DOA when
running a rhel7 kernel (internal, for next kernel, not a release
kernel).  That kernel is pretty close to upstream.  When I went to put
an upstream kernel on there to see if it had the same issue, the
upstream kernel on that machine oopses on boot.  It oopses in list_add,
but the backtrace doesn't list any usable information about who called
list_add with bogus data.  However, reliably, right before the oops, the
ciostor driver fails to load properly, so I'm going with that being the
likely culprit.  But each iteration is slow because when the rhel7
kernel iSER does it's thing, it causes a hung reboot, but it also
crashes the iDRAC in the machine (errant drivers crashing a baseboard
management controller is never a good sign), so the reboot must be done
via a hard power cycle.  When the upstream kernel oopses on boot, at
least the iDRAC is still working.  As a result, each test iteration is
pretty slow.

There we go, bootup on a 4.3-rc4 kernel with cxgb4 FCoE driver disabled
succeeded.  A hurdle passed.  Now I can test upstream iSER.

With an upstream kernel, the drive is still read-only with iSER (it's
not configured that way to the best of my knowledge, but I'm using auto
generate ACLs, I'm getting ready to switch the system to specific ACLs
instead), but the thread isn't stuck in D state, so that's an improvement.

However, the machine is still crashing the iDRAC on reboot.  I can't be
certain if it's the SRP target or iSER target causing this as they both
were brought up live at the same time and reboot cycles without either
of these work fine.  So I have more investigation to go before I know
exactly what's going on.  And as I pointed out, each iteration is slow :-/

>>> What we do know that needs fixing for 4.3-rc
>>> --> RoCE, you need the patch re-posted by Haggai few hours ago
>>> "IB/cma: Accept connection without a valid netdev on RoCE" -- without
>>> it, RoCE isn't working.
> 
>> I have that already.  It's available on both github and k.o and just
>> waiting for a pull request.
> 
> Maybe wait to get the fixes for the non-default pkey on mlx5 (see more below)?
> 
> Did you actually note that before Haggai posted the patch?!

No.

> once I realized how deep was the breakage, I became sort of very
> worried re your testing env not shouting hard on us this something is
> broken even before 4.3-rc1

My test environment has been down for upgrades.  In the last little bit
we've brought a second rack online, added 10 new machines, 3 new
switches, and moved existing machines around between the two racks in
order to more evenly balance the need for each port type across switches
in the two racks.  There's been more than that going on behind the
scenes here too, but it's not really worth getting into all of it.
Suffice it to say I've been working on A) expanding the cluster, B)
expanding the things the cluster is configured to do and therefore able
to test, and C) finding a way to get upstream code into this testing
framework since it was previously all rhel/fedora centric.

And this test infrastructure goes down by COB Thursday of this week and
won't be back for a week because it's being used for NFSoRDMA testing at
this fall's Bake-a-thon.

>>> --> **mlx5** devices and no-default IB pkeys, Haggai and Co are
>>> working on a fix since this isn't working since 4.3-rc1. I told them
>>> we need it till rc5.5 (i.e few days before rc6 and if not, will have
>>> to revert some 4.3-rc1 bits.
> 
>> I already have on patch related to this in my repo as well.  The 0day
>> testing just came back and it's all good.
> 
> I suspect that you don't...

I meant build tests passed, not run tests.

> do you have rping up and running between
> mlx4 and mlx5 on non default pkey? the breakage is a bit tricky and
> you might not see it if you run mlx5 against mlx5, BTW which patch is
> that?


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

  parent reply	other threads:[~2015-10-07 15:28 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-20 14:34 [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4 Eran Ben Elisha
     [not found] ` <1440081275-15864-1-git-send-email-eranbe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-08-20 14:34   ` [PATCH v1 for-next 1/7] IB/core: Extend ib_uverbs_create_qp Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 2/7] IB/core: Allow setting create flags in QP init attribute Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 3/7] net/mlx4_core: Add support for filtering multicast loopback Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 4/7] net/mlx4_en: Implement mcast loopback prevention for ETH qps Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 5/7] IB/mlx4: Add IB counters table Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 6/7] IB/mlx4: Add counter based implementation for QP multicast loopback block Eran Ben Elisha
2015-08-20 14:34   ` [PATCH v1 for-next 7/7] IB/mlx4: Add support for blocking multicast loopback QP creation user flag Eran Ben Elisha
2015-08-25 16:44   ` [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4 Christoph Lameter
     [not found]     ` <alpine.DEB.2.11.1508251143430.16801-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2015-08-26  9:12       ` eran ben elisha
     [not found]         ` <CAKHjkjnk9sFfr3KhNGBD_LzXUbDdb7_=vF2A3Z5+ousFShfxJw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-14 15:34           ` Christoph Lameter
2015-09-21 14:24   ` eran ben elisha
     [not found]     ` <CAKHjkjnQNqeLu+irEg97a1hcKp_ziXsDOFaTRwcaPpZAX_fMng-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-09-29 18:24       ` Doug Ledford
     [not found]         ` <560AD75A.7080700-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-05 21:59           ` Or Gerlitz
     [not found]             ` <CAJ3xEMiuPEJZ5EQueC=8wQyok2QbEaywQ3gRYVpqOwSaFQcj0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 16:05               ` Doug Ledford
     [not found]                 ` <5613F145.3040204-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-06 16:54                   ` Sagi Grimberg
     [not found]                     ` <5613FCB1.4050605-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-10-06 17:41                       ` Doug Ledford
2015-10-06 20:54                   ` Or Gerlitz
     [not found]                     ` <CAJ3xEMhsr5uRLZ=Ur56TOpZy5XT3UbK=M1kcgjt=zx_x3HNcSQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-06 21:26                       ` Doug Ledford
     [not found]                         ` <56143C8E.9090407-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-06 21:49                           ` Or Gerlitz
     [not found]                             ` <CAJ3xEMiNk2ZXgg_Ji5dg+ahL_D5_RAvEzEEhMsPDjC-PJYDu7Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-10-07 15:28                               ` Doug Ledford [this message]
     [not found]                                 ` <56153A0C.4040006-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-10-07 15:36                                   ` Or Gerlitz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56153A0C.4040006@redhat.com \
    --to=dledford-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=cl-vYTEC60ixJUAvxtiuMwx3w@public.gmane.org \
    --cc=eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).