From mboxrd@z Thu Jan  1 00:00:00 1970
From: swise@opengridcomputing.com (Steve Wise)
Date: Thu, 9 Jun 2016 09:29:33 -0500
Subject: nvme-fabrics: crash at nvme connect-all
In-Reply-To: <006a01d1c25a$5b23c0d0$116b4270$@opengridcomputing.com>
References: <53708289.31891804.1465463883806.JavaMail.zimbra@kalray.eu>
 <575936F0.9000600@lightbits.io>
 <574056153.32082017.1465466832847.JavaMail.zimbra@kalray.eu>
 <57594E81.9060302@lightbits.io>
 <1218382158.32228335.1465474321289.JavaMail.zimbra@kalray.eu>
 <5759614D.5080703@lightbits.io>
 <004901d1c252$b5978d10$20c6a730$@opengridcomputing.com>
 <005701d1c253$f9590550$ec0b0ff0$@opengridcomputing.com>
 <575973A4.9080001@lightbits.io>
 <006501d1c258$835dacc0$8a190640$@opengridcomputing.com>
 <006a01d1c25a$5b23c0d0$116b4270$@opengridcomputing.com>
Message-ID: <007501d1c25b$574f43c0$05edcb40$@opengridcomputing.com>

> > >
> > > >>> Steve, did you see this before? I'm wandering if we need some sort
> > > >>> of logic handling with resource limitation in iWARP (global mrs
pool...)
> > > >>
> > > >> Haven't seen this.  Does 'cat /sys/kernel/debug/iw_cxgb4/blah/stats'
show
> > > >> anything interesting?  Where/why is it crashing?
> > > >>
> > > >
> > > > So this is the failure:
> > > >
> > > > [  703.239462] rdma_rw_init_mrs: failed to allocated 128 MRs
> > > > [  703.239498] failed to init MR pool ret= -12
> > > > [  703.239541] nvmet_rdma: failed to create_qp ret= -12
> > > > [  703.239582] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue
> > > failed
> > > > (-12).
> > > >
> > > > Not sure why it would fail.  I would think my setup would be allocating
> more
> > > > given I have 16 cores on the host and target.  The debugfs "stats" file
I
> > > > mentioned above should show us something if we're running out of adapter
> > > > resources for MR or PBL records.
> > >
> > > Note that Marta ran both the host and the target on the same machine.
> > > So, 8 (cores) x 128 (queue entries) x 2 (host and target) gives 2048
> > > MRs...
> > >
> > > What is the T5 limitation?
> >
> > It varies based on a config file that gets loaded when cxgb4 loads.  Note
the
> > error has nothing to do with the low fastreg sg depth limit of T5.  If we
were
> > hitting that then we would be seeing EINVAL and not ENOMEM.  Looking at
> > c4iw_alloc_mr(), the ENOMEM paths are either failures from kzalloc() or
> > dma_alloc_coherent(), or failures to allocate adapter resources for MR and
PBL
> > records.  Each MR takes a 32B record in adapter mem, and the PBL takes
> whatever
> > based on the max sg depth (roughly sg_depth * 8 + some rounding up).  The
> > debugfs "stats" file will show us what is being exhausted and how much
adapter
> > mem is available for these resources.
> >
> > Also, the amount of available adapter mem depends on the type of T5 adapter.
> > The T5 adapter info should be in the dmesg log when cxgb4 is loaded.
> >
> > Steve
> 
> Here is an example of the iw_cxgb4 debugfs "stats" output.  This is for a
> T580-CR with the "default" configuration, which means there is no config file
> named t5-config.txt in /lib/firmware/cxgb4/.
> 
> [root at stevo1 linux-2.6]# cat /sys/kernel/debug/iw_cxgb4/0000\:82\:00.4/stats
>    Object:      Total    Current        Max       Fail
>      PDID:      65536          0          0          0
>       QID:      24576          0          0          0
>    TPTMEM:   36604800          0          0          0
>    PBLMEM:   91512064          0          0          0
>    RQTMEM:  128116864          0          0          0
>   OCQPMEM:          0          0          0          0
>   DB FULL:          0
>  DB EMPTY:          0
>   DB DROP:          0
>  DB State: NORMAL Transitions 0 FC Interruptions 0
> TCAM_FULL:          0
> ACT_OFLD_CONN_FAILS:          0
> PAS_OFLD_CONN_FAILS:          0
> NEG_ADV_RCVD:          0
> AVAILABLE IRD:     589824
> 
> Note it shows the total, currently allocated, max ever allocated, and failures
> for each rdma resource, most of which are tied to HW resources.  So if we see
> failures, then we know the adapter resources were exhausted.
> 
> TPTMEM is the available adapter memory for MR records.  Each record is 32B.
So
> a total of 1143900 MRs (TPTMEM / 32) can be created.  The PBLMEM resource is
> for
> holding the dma addresses for all pages in a MR, so each MR uses some number
> depending on the sg depth passed in when allocating a FRMR.  So if we allocate
> 128 deep page lists, we should be able to allocate 89367 PBLs (PBLMEM / 8 /
> 128).
> 
> Seems like we shouldn't be exhausting the adapter resources with 2048 MRs...
> 
> Steve

I don't see this on my 16 core/64GB memory note, I successfully did a
discover/connect-all with the target/host on the same node with 7 target devices
w/o any errors.   Note I'm using the nvmf-all.2 branch Christoph setup up
yesterday.

Marta, I need to learn more about your T5 setup and the "stats" file output.
Thanks!

Steve.