Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: NFSD generic R/W API (sendto path) performance results
From: Chuck Lever @ 2016-11-17 20:20 UTC (permalink / raw)
  To: Steve Wise; +Cc: Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing
In-Reply-To: <01f001d2410d$b16c97f0$1445c7d0$@opengridcomputing.com>


> On Nov 17, 2016, at 3:03 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> 
>>> 
>>> On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>> 
>>>> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>>>> Also did you try to always register for > max_sge
>>>> calls?  The code can already register all segments with the
>>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>>> that behavior.
>>> 
>>> For various reasons I decided the design should build one WR chain for
>>> each RDMA segment provided by the client. Good clients expose just
>>> one RDMA segment for the whole NFS READ payload.
>>> 
>>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>>> assumed it changed only the behavior with RDMA Read. I'll try that
>>> too, if RDMA Write can easily be made to use FRWR.
>> 
>> Unfortunately, some RPC replies are formed from two or three
>> discontiguous buffers. The gap test in ib_sg_to_pages returns
>> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
>> fails.
>> 
>> Thus with my current prototype I'm not able to test with FRWR.
>> 
>> I could fix this in my prototype, but it would be nicer for me if
>> rdma_rw_init_ctx handled this case the same for FRWR as it does
>> for physical addressing, which doesn't seem to have any problem
>> with a discontiguous SGL.
> 
> Just to make sure I'm understanding you, for rdma-rw to handle this,  it would
> have to use multiple REG_MR registrations, one for each contiguous area in the
> scatter list.  
> 
> Right?

Right, that's the approach the NFS client takes. See
net/sunrpc/xprtrdma/frwr_ops.c :: frwr_op_map.

If the passed-in memory list isn't contiguous, frwr_op_map stops
registering and returns to the caller, who allocates another
MR and calls in again with the remaining part of the list.

I think this would not apply to SG_GAP MRs, which should
already be able to handle discontiguous SGLs?

Note this doesn't apply to most NFS READs, where just the data
payload is going via RDMA Write, and the payload is already in
a contiguous piece of memory. But Reply chunks, which are used
for READDIRs and other requests, can be built from discontiguous
memory.

I haven't looked closely at the RDMA Read logic, but I think
it always reads into a contiguous set of pages, then builds
the xdr_buf out of that. It shouldn't have the same problem
(and it is already known to work with FRWR ;-).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Sagi Grimberg @ 2016-11-17 20:20 UTC (permalink / raw)
  To: Chuck Lever, Christoph Hellwig; +Cc: Steve Wise, List Linux RDMA Mailing
In-Reply-To: <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>


>>> Also did you try to always register for > max_sge
>>> calls?  The code can already register all segments with the
>>> rdma_rw_force_mr module option, so it would only need a small tweak for
>>> that behavior.
>>
>> For various reasons I decided the design should build one WR chain for
>> each RDMA segment provided by the client. Good clients expose just
>> one RDMA segment for the whole NFS READ payload.
>>
>> Does force_mr make the generic API use FRWR with RDMA Write? I had
>> assumed it changed only the behavior with RDMA Read. I'll try that
>> too, if RDMA Write can easily be made to use FRWR.
>
> Unfortunately, some RPC replies are formed from two or three
> discontiguous buffers. The gap test in ib_sg_to_pages returns
> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
> fails.
>
> Thus with my current prototype I'm not able to test with FRWR.
>
> I could fix this in my prototype, but it would be nicer for me if
> rdma_rw_init_ctx handled this case the same for FRWR as it does
> for physical addressing, which doesn't seem to have any problem
> with a discontiguous SGL.
>
>
>> But I'd like a better explanation for this result. Could be a bug
>> in my implementation, my design, or in the driver. Continuing to
>> investigate.

Hi Chuck, sorry for the late reply (have been busy lately..)

I think that the Call-to-first-Write phenomenon you are seeing makes
perfect sense, the question is, is a QD=1 1M transfers latency that
interesting? Did you see a positive effect on small (say 4k) transfers?
both latency and iops scalability should be able to improve especially
when serving multiple clients.

If indeed you feel that this is an interesting workload to optimize, I
think we can come up with something.

About the First-to-last-Write, thats weird, and sound like a bug
somewhere. Maybe Mellanox folks can tell us if splitting 1M to multiple
writes works better (although I cannot comprehend why).

Question, are the send and receive cqs still in IB_POLL_SOFTIRQ mode?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: NFSD generic R/W API (sendto path) performance results
From: Steve Wise @ 2016-11-17 20:03 UTC (permalink / raw)
  To: 'Chuck Lever', 'Christoph Hellwig'
  Cc: 'Sagi Grimberg', 'List Linux RDMA Mailing'
In-Reply-To: <676323E9-2F30-4DB0-AEF8-CDE38E8A0715-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

> 
> > On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >
> >> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> >> Also did you try to always register for > max_sge
> >> calls?  The code can already register all segments with the
> >> rdma_rw_force_mr module option, so it would only need a small tweak for
> >> that behavior.
> >
> > For various reasons I decided the design should build one WR chain for
> > each RDMA segment provided by the client. Good clients expose just
> > one RDMA segment for the whole NFS READ payload.
> >
> > Does force_mr make the generic API use FRWR with RDMA Write? I had
> > assumed it changed only the behavior with RDMA Read. I'll try that
> > too, if RDMA Write can easily be made to use FRWR.
> 
> Unfortunately, some RPC replies are formed from two or three
> discontiguous buffers. The gap test in ib_sg_to_pages returns
> a smaller number than sg_nents in this case, and rdma_rw_init_ctx
> fails.
> 
> Thus with my current prototype I'm not able to test with FRWR.
> 
> I could fix this in my prototype, but it would be nicer for me if
> rdma_rw_init_ctx handled this case the same for FRWR as it does
> for physical addressing, which doesn't seem to have any problem
> with a discontiguous SGL.

Just to make sure I'm understanding you, for rdma-rw to handle this,  it would
have to use multiple REG_MR registrations, one for each contiguous area in the
scatter list.  

Right?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PULL REQUEST] Please pull rdma.git
From: Leon Romanovsky @ 2016-11-17 20:02 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma
In-Reply-To: <582E089A.3040106-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1237 bytes --]

On Thu, Nov 17, 2016 at 02:44:26PM -0500, Doug Ledford wrote:
> On 11/17/16 1:49 PM, Leon Romanovsky wrote:
> > On Thu, Nov 17, 2016 at 07:13:54AM -0500, Doug Ledford wrote:
> >> Hi Linus,
> >>
> >> Due to various issues, I've been away and couldn't send a pull request
> >> for about three weeks.  There were a number of -rc patches that built up
> >> in the meantime (some where there already from the early -rc stages).
> >> Obviously, there were way too many to send now, so I tried to pare the
> >> list down to the more important patches for the -rc cycle.
> >
> > Hi Doug,
> > Are you adding the rest to your for-next branch? We would like to have
> > enough time to check that nothing is lost.
> >
> > Thanks
> >
>
> Yes, it's already there in the mlx-next branch on github.

Thanks,
Do I understand you correctly that it is not final version and you
didn't upload general branch yet?

There is patch for the MAD which will be great to see in the list too.
https://patchwork.kernel.org/patch/9382745/
http://marc.info/?l=linux-rdma&m=147681123708587&w=2

Thanks

>
> --
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    GPG Key ID: 0E572FDD
>   Red Hat, Inc.
>   100 E. Davie St
>   Raleigh, NC 27601 USA
>




[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PULL REQUEST] Please pull rdma.git
From: Doug Ledford @ 2016-11-17 19:44 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma
In-Reply-To: <20161117184950.GP4240-2ukJVAZIZ/Y@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 866 bytes --]

On 11/17/16 1:49 PM, Leon Romanovsky wrote:
> On Thu, Nov 17, 2016 at 07:13:54AM -0500, Doug Ledford wrote:
>> Hi Linus,
>>
>> Due to various issues, I've been away and couldn't send a pull request
>> for about three weeks.  There were a number of -rc patches that built up
>> in the meantime (some where there already from the early -rc stages).
>> Obviously, there were way too many to send now, so I tried to pare the
>> list down to the more important patches for the -rc cycle.
> 
> Hi Doug,
> Are you adding the rest to your for-next branch? We would like to have
> enough time to check that nothing is lost.
> 
> Thanks
> 

Yes, it's already there in the mlx-next branch on github.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>    GPG Key ID: 0E572FDD
  Red Hat, Inc.
  100 E. Davie St
  Raleigh, NC 27601 USA


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 907 bytes --]

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Chuck Lever @ 2016-11-17 19:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Steve Wise, Sagi Grimberg, List Linux RDMA Mailing
In-Reply-To: <84B43CFF-EBF7-4758-8751-8C97102C5BCF-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>


> On Nov 17, 2016, at 10:04 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
>> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>> Also did you try to always register for > max_sge
>> calls?  The code can already register all segments with the
>> rdma_rw_force_mr module option, so it would only need a small tweak for
>> that behavior.
> 
> For various reasons I decided the design should build one WR chain for
> each RDMA segment provided by the client. Good clients expose just
> one RDMA segment for the whole NFS READ payload.
> 
> Does force_mr make the generic API use FRWR with RDMA Write? I had
> assumed it changed only the behavior with RDMA Read. I'll try that
> too, if RDMA Write can easily be made to use FRWR.

Unfortunately, some RPC replies are formed from two or three
discontiguous buffers. The gap test in ib_sg_to_pages returns
a smaller number than sg_nents in this case, and rdma_rw_init_ctx
fails.

Thus with my current prototype I'm not able to test with FRWR.

I could fix this in my prototype, but it would be nicer for me if
rdma_rw_init_ctx handled this case the same for FRWR as it does
for physical addressing, which doesn't seem to have any problem
with a discontiguous SGL.


> But I'd like a better explanation for this result. Could be a bug
> in my implementation, my design, or in the driver. Continuing to
> investigate.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PULL REQUEST] Please pull rdma.git
From: Leon Romanovsky @ 2016-11-17 18:49 UTC (permalink / raw)
  To: Doug Ledford; +Cc: linux-rdma
In-Reply-To: <58466423-c87e-3921-101e-bffab8989fd8-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 562 bytes --]

On Thu, Nov 17, 2016 at 07:13:54AM -0500, Doug Ledford wrote:
> Hi Linus,
>
> Due to various issues, I've been away and couldn't send a pull request
> for about three weeks.  There were a number of -rc patches that built up
> in the meantime (some where there already from the early -rc stages).
> Obviously, there were way too many to send now, so I tried to pare the
> list down to the more important patches for the -rc cycle.

Hi Doug,
Are you adding the rest to your for-next branch? We would like to have
enough time to check that nothing is lost.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* RE: rdma-core release process questions
From: Nikolova, Tatyana E @ 2016-11-17 18:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161117102846.GJ4240-2ukJVAZIZ/Y@public.gmane.org>

Hi Leon and Jason,

Thank you for the information.

Tatyana

-----Original Message-----
From: Leon Romanovsky [mailto:leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org] 
Sent: Thursday, November 17, 2016 4:29 AM
To: Nikolova, Tatyana E <tatyana.e.nikolova-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>; Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: rdma-core release process questions

On Thu, Nov 17, 2016 at 04:56:45AM +0000, Nikolova, Tatyana E wrote:
> Hi,
>
> We are submitting patches to the kernel space driver i40iw and to the user space plugin libi40iw, which is currently part of rdma-core. Some of the changes need to be coordinated so that they appear in both kernel space and user space in corresponding releases. We have some questions regarding the process about submitting patches to rdma-core which have dependencies on kernel patches.
>
> 1) Can user space patches target a for-next rdma-core release, if the corresponding kernel patches are queued for the next kernel?

I don't see any problem with that, once the patches accepted for the -next by Doug, they can be accepted to the rdma-core too. Anyway these changes should be compatible with old kernel without such new feature.

> 2) How are ABI changes handled in rdma-core?

Do you have specific thing in mind?
Generally speaking, send to ML pass review and we will apply.

>
> 3) Could you explain the release process for rdma-core?

The process as agreed will be something like that:
a. Review/accept/decline patches in 1-2 weeks time frame.
b. Once kernel released, stop accepting new features.
c. Wait for 1-2 weeks to see no one complains. It is just to be on safe side, because the library is always ready for release and checked constantly.
d. Create new tag and push release.

>
> 4) Is each rdma-core release going to correspond to a specific kernel version?

As Jason wrote, It will be aligned in release time to the kernel, but library should remain backward compatible.

>
> Thank you,
> Tatyana
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 0/4] Add packet pacing support for IB verbs
From: Leon Romanovsky @ 2016-11-17 18:15 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1477909297-14491-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2661 bytes --]

On Mon, Oct 31, 2016 at 12:21:33PM +0200, Leon Romanovsky wrote:
> When sending from a 10G host to a 1G host, it is easy to overrun the receiver,
> leading to packet loss and traffic backing off. Similar problems occur when
> a 10G host sends data to a sub-10G virtual circuit, or a 40G host sending
> to a 10G host. Packet pacing could control packet injection rate and reduces
> network congestion to maximize throughput & minimize network latency.
>
> Packet pacing is a rate limiting and shaping for a QP (SQ for RAW QP), set
> and change the rate is done by modifying QP. This series of patch made the
> following high level changes:
>  1. Report rate limit capabilities through user data. Reported capabilities
>     include: The maximum and minimum rate limit in kbps supported by packet
>     pacing; Bitmap showing which QP types are supported by packet pacing
>     operation.
>  2. Extend modify QP interface for growing attributes. Add rate limit support
>     to the extended interface.
>  3. Enable mlx5-based hardware to be able to update the rate limit for
>     RAW QP packet.
>
> Available in the "topic/packet_pacing" topic branch of this git repo:
> git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git
>
> Or for browsing:
> https://git.kernel.org/cgit/linux/kernel/git/leon/linux-rdma.git/log/?h=topic/packet_pacing

Hi Doug,

Please drop this patch series, we discovered an issue with proposed
modify_qp implementation and will respin it.

Sorry for the inconvenience.

Thanks

>
> Thanks,
>   Bodong & Leon
>
> Bodong Wang (4):
>   IB/mlx5: Report mlx5 packet pacing capabilities when querying device
>   IB/core: Support rate limit for packet pacing
>   IB/uverbs: Extend modify_qp and support packet pacing
>   IB/mlx5: Update the rate limit according to user setting for RAW QP
>
>  drivers/infiniband/core/uverbs.h      |   1 +
>  drivers/infiniband/core/uverbs_cmd.c  | 178 +++++++++++++++++++++-------------
>  drivers/infiniband/core/uverbs_main.c |   1 +
>  drivers/infiniband/core/verbs.c       |   2 +
>  drivers/infiniband/hw/mlx5/main.c     |  16 ++-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h  |   1 +
>  drivers/infiniband/hw/mlx5/qp.c       |  71 ++++++++++++--
>  include/rdma/ib_verbs.h               |   2 +
>  include/uapi/rdma/ib_user_verbs.h     |  12 +++
>  include/uapi/rdma/mlx5-abi.h          |  13 +++
>  10 files changed, 219 insertions(+), 78 deletions(-)
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Yuval Shaia @ 2016-11-17 16:38 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Moni Shoua, Doug Ledford, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161117121554.GA4292-Hxa29pjIrETlQW142y8m19+IiqhCXseY@public.gmane.org>

Besides the soft-aggressive commit message -:)
Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
 
On Thu, Nov 17, 2016 at 02:00:05PM +0300, Dan Carpenter wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
> index 2a6e3cd..efc832a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_srq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_srq.c
> @@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
>  			}
>  		}
>  
> -		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
> +		err = rxe_queue_resize(q, &attr->max_wr,
>  				       rcv_wqe_size(srq->rq.max_sge),
>  				       srq->rq.queue->ip ?
>  						srq->rq.queue->ip->context :
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Chuck Lever @ 2016-11-17 15:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Steve Wise, Sagi Grimberg, List Linux RDMA Mailing
In-Reply-To: <20161117124602.GA25821-jcswGhMUV9g@public.gmane.org>


> On Nov 17, 2016, at 7:46 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> 
> On Wed, Nov 16, 2016 at 02:45:33PM -0500, Chuck Lever wrote:
>> Out of curiosity, I hacked up my NFS client to limit the size of RDMA
>> segments to 30 pages (the server HCA's max_sge).
>> 
>> A 1MB NFS READ now takes 9 segments. That forces the after-conversion
>> server to build single-Write chains and use 9 post_send calls to
>> transmit the READ payload, just like the before-conversion server.
>> 
>> Performance of before- and after-conversion servers is now equivalent.
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024  1061237  1141614  1961410  2000223                                                                                  
> 
> What HCA is this, btw?

ConnectX-3 Pro, f/w 2.31.5050


> Also did you try to always register for > max_sge
> calls?  The code can already register all segments with the
> rdma_rw_force_mr module option, so it would only need a small tweak for
> that behavior.

For various reasons I decided the design should build one WR chain for
each RDMA segment provided by the client. Good clients expose just
one RDMA segment for the whole NFS READ payload.

Does force_mr make the generic API use FRWR with RDMA Write? I had
assumed it changed only the behavior with RDMA Read. I'll try that
too, if RDMA Write can easily be made to use FRWR.

But I'd like a better explanation for this result. Could be a bug
in my implementation, my design, or in the driver. Continuing to
investigate.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Moni Shoua @ 2016-11-17 13:51 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Doug Ledford, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors
In-Reply-To: <20161117110005.GB32143@mwanda>

On Thu, Nov 17, 2016 at 1:00 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Moni Shoua <monis@mellanox.com>

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Christoph Hellwig @ 2016-11-17 12:46 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Steve Wise, Christoph Hellwig, Sagi Grimberg,
	List Linux RDMA Mailing
In-Reply-To: <BA9DC9F7-C893-428B-AFE5-EFCCD13C9F25-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Wed, Nov 16, 2016 at 02:45:33PM -0500, Chuck Lever wrote:
> Out of curiosity, I hacked up my NFS client to limit the size of RDMA
> segments to 30 pages (the server HCA's max_sge).
> 
> A 1MB NFS READ now takes 9 segments. That forces the after-conversion
> server to build single-Write chains and use 9 post_send calls to
> transmit the READ payload, just like the before-conversion server.
> 
> Performance of before- and after-conversion servers is now equivalent.
> 
>               kB  reclen    write  rewrite    read    reread
>          2097152    1024  1061237  1141614  1961410  2000223                                                                                  

What HCA is this, btw?  Also did you try to always register for > max_sge
calls?  The code can already register all segments with the
rdma_rw_force_mr module option, so it would only need a small tweak for
that behavior.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Yuval Shaia @ 2016-11-17 12:16 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Moni Shoua, Doug Ledford, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161117110005.GB32143@mwanda>

Besides the soft-aggressive commit message -:)
Reviewed-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On Thu, Nov 17, 2016 at 02:00:05PM +0300, Dan Carpenter wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
> index 2a6e3cd..efc832a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_srq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_srq.c
> @@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
>  			}
>  		}
>  
> -		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
> +		err = rxe_queue_resize(q, &attr->max_wr,
>  				       rcv_wqe_size(srq->rq.max_sge),
>  				       srq->rq.queue->ip ?
>  						srq->rq.queue->ip->context :
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PULL REQUEST] Please pull rdma.git
From: Doug Ledford @ 2016-11-17 12:13 UTC (permalink / raw)
  To: Torvalds, Linus, linux-rdma


[-- Attachment #1.1: Type: text/plain, Size: 6015 bytes --]

Hi Linus,

Due to various issues, I've been away and couldn't send a pull request
for about three weeks.  There were a number of -rc patches that built up
in the meantime (some where there already from the early -rc stages).
Obviously, there were way too many to send now, so I tried to pare the
list down to the more important patches for the -rc cycle.  Most of the
code has had plenty of soak time at the various vendor's testing setups,
so I doubt there will be another -rc pull request this cycle.  I also
tried to limit the patches to those with smaller footprints, so even
though a shortlog is longer than I would like, the actual diffstat is
mostly very small with the exception of just three files that had more
changes, and a couple files with pure removals.  Here's the boilerplate:

The following changes since commit a909d3e636995ba7c349e2ca5dbb528154d4ac30:

  Linux 4.9-rc3 (2016-10-29 13:52:02 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
tags/for-linus

for you to fetch changes up to 5c6b2aaf9316fd0983c0c999d920306ddc65bd2d:

  iw_cxgb4: invalidate the mr when posting a read_w_inv wr (2016-11-16
20:10:36 -0500)

----------------------------------------------------------------
First round of -rc fixes

- Misc Intel hfi1 fixes
- Misc Mellanox mlx4, mlx5, and rxe fixes
- A couple cxgb4 fixes

----------------------------------------------------------------
Daniel Jurgens (2):
      IB/mlx5: Use cache line size to select CQE stride
      IB/mlx4: Check gid_index return value

Dasaratharaman Chandramouli (1):
      IB/hfi1: Fix ECN processing in prescan_rxq

Dennis Dalessandro (3):
      IB/rdmavt: rdmavt can handle non aligned page maps
      IB/hfi1: Remove leftover snoop references
      IB/hfi1: Remove incorrect IS_ERR check

Doug Ledford (1):
      Merge branches 'hfi1' and 'mlx' into k.o/for-4.9-rc

Easwar Hariharan (2):
      IB/hfi1: Clean up unused argument
      IB/hfi1: Delete unused lock

Eli Cohen (2):
      IB/mlx5: Fix fatal error dispatching
      IB/mlx5: Fix NULL pointer dereference on debug print

Ira Weiny (1):
      IB/hfi1: Fix rnr_timer addition

Jakub Pawlak (2):
      IB/hfi1: Fix integrity check flags default values
      IB/hfi1: Fix status error code for unsupported packets

Jianxin Xiong (2):
      IB/hfi1: Fix a potential memory leak in hfi1_create_ctxts()
      IB/hfi1: Prevent hardware counter names from being cut off

Krzysztof Blaszkowski (2):
      IB/hfi1: Return ENODEV for unsupported PCI device ids.
      IB/hfi1: Relocate rcvhdrcnt module parameter check.

Leon Romanovsky (1):
      IB/core: Set routable RoCE gid type for ipv4/ipv6 networks

Majd Dibbiny (1):
      IB/mlx5: Fix memory leak in query device

Maor Gottlieb (1):
      IB/mlx5: Validate requested RQT size

Mark Bloch (3):
      IB/cm: Mark stale CM id's whenever the mad agent was unregistered
      IB/core: Add missing check for addr_resolve callback return value
      IB/core: Avoid unsigned int overflow in sg_alloc_table

Matan Barak (1):
      IB/mlx4: Fix create CQ error flow

Moshe Lazer (1):
      IB/mlx5: Resolve soft lock on massive reg MRs

Steve Wise (2):
      iw_cxgb4: set *bad_wr for post_send/post_recv errors
      iw_cxgb4: invalidate the mr when posting a read_w_inv wr

Tadeusz Struk (2):
      IB/hfi1: Remove redundant sysfs irq affinity entry
      IB/hfi1: Fix an Oops on pci device force remove

Tariq Toukan (1):
      IB/uverbs: Fix leak of XRC target QPs

Yonatan Cohen (4):
      IB/rxe: Fix kernel panic in UDP tunnel with GRO and RX checksum
      IB/rxe: Fix handling of erroneous WR
      IB/rxe: Clear queue buffer when modifying QP to reset
      IB/rxe: Update qp state for user query

 drivers/infiniband/core/addr.c         |  11 ++-
 drivers/infiniband/core/cm.c           | 126
++++++++++++++++++++++++++++-----
 drivers/infiniband/core/cma.c          |  21 +++++-
 drivers/infiniband/core/umem.c         |   2 +-
 drivers/infiniband/core/uverbs_main.c  |   7 +-
 drivers/infiniband/hw/cxgb4/cq.c       |  17 +----
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   2 +-
 drivers/infiniband/hw/cxgb4/mem.c      |  12 ++++
 drivers/infiniband/hw/cxgb4/qp.c       |  20 +++---
 drivers/infiniband/hw/hfi1/affinity.c  |  72 -------------------
 drivers/infiniband/hw/hfi1/affinity.h  |   4 --
 drivers/infiniband/hw/hfi1/chip.c      |  27 +++----
 drivers/infiniband/hw/hfi1/chip.h      |   3 +
 drivers/infiniband/hw/hfi1/driver.c    |  37 +++++++---
 drivers/infiniband/hw/hfi1/file_ops.c  |  19 ++++-
 drivers/infiniband/hw/hfi1/hfi.h       |  89 +++++++++--------------
 drivers/infiniband/hw/hfi1/init.c      | 104 ++++++++++++++++-----------
 drivers/infiniband/hw/hfi1/pcie.c      |   3 +-
 drivers/infiniband/hw/hfi1/pio.c       |  13 +---
 drivers/infiniband/hw/hfi1/rc.c        |   2 +-
 drivers/infiniband/hw/hfi1/sdma.c      |  19 +----
 drivers/infiniband/hw/hfi1/sysfs.c     |  25 -------
 drivers/infiniband/hw/hfi1/trace_rx.h  |  60 ----------------
 drivers/infiniband/hw/hfi1/user_sdma.c |   2 +-
 drivers/infiniband/hw/mlx4/ah.c        |   5 +-
 drivers/infiniband/hw/mlx4/cq.c        |   5 +-
 drivers/infiniband/hw/mlx5/cq.c        |   3 +-
 drivers/infiniband/hw/mlx5/main.c      |  11 +--
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   2 +
 drivers/infiniband/hw/mlx5/mr.c        |   6 +-
 drivers/infiniband/hw/mlx5/qp.c        |  12 +++-
 drivers/infiniband/sw/rdmavt/dma.c     |   3 -
 drivers/infiniband/sw/rxe/rxe_net.c    |   8 +--
 drivers/infiniband/sw/rxe/rxe_qp.c     |   2 +
 drivers/infiniband/sw/rxe/rxe_queue.c  |   9 +++
 drivers/infiniband/sw/rxe/rxe_queue.h  |   2 +
 drivers/infiniband/sw/rxe/rxe_req.c    |  21 +++---
 37 files changed, 391 insertions(+), 395 deletions(-)

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Leon Romanovsky @ 2016-11-17 11:49 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Moni Shoua, Doug Ledford, Sean Hefty, Hal Rosenstock, linux-rdma,
	kernel-janitors
In-Reply-To: <20161117110005.GB32143@mwanda>

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Thu, Nov 17, 2016 at 02:00:05PM +0300, Dan Carpenter wrote:
> It makes me nervous when we cast pointer parameters.  I would estimate
> that around 50% of the time, it indicates a bug.  Here the cast is not
> needed becaue u32 and and unsigned int are the same thing.  Removing the
> cast makes the code more robust and future proof in case any of the
> types change.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

>
> diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
> index 2a6e3cd..efc832a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_srq.c
> +++ b/drivers/infiniband/sw/rxe/rxe_srq.c
> @@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
>  			}
>  		}
>
> -		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
> +		err = rxe_queue_resize(q, &attr->max_wr,
>  				       rcv_wqe_size(srq->rq.max_sge),
>  				       srq->rq.queue->ip ?
>  						srq->rq.queue->ip->context :
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* [patch] IB/rxe: Remove unneeded cast in rxe_srq_from_attr()
From: Dan Carpenter @ 2016-11-17 11:00 UTC (permalink / raw)
  To: Moni Shoua
  Cc: Doug Ledford, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

It makes me nervous when we cast pointer parameters.  I would estimate
that around 50% of the time, it indicates a bug.  Here the cast is not
needed becaue u32 and and unsigned int are the same thing.  Removing the
cast makes the code more robust and future proof in case any of the
types change.

Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index 2a6e3cd..efc832a 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -169,7 +169,7 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 			}
 		}
 
-		err = rxe_queue_resize(q, (unsigned int *)&attr->max_wr,
+		err = rxe_queue_resize(q, &attr->max_wr,
 				       rcv_wqe_size(srq->rq.max_sge),
 				       srq->rq.queue->ip ?
 						srq->rq.queue->ip->context :
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: rdma-core release process questions
From: Leon Romanovsky @ 2016-11-17 10:28 UTC (permalink / raw)
  To: Nikolova, Tatyana E
  Cc: Jason Gunthorpe, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <13AA599688F47243B14FCFCCC2C803BB10AB79D5-96pTJSsuoYQ64kNsxIetb7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1596 bytes --]

On Thu, Nov 17, 2016 at 04:56:45AM +0000, Nikolova, Tatyana E wrote:
> Hi,
>
> We are submitting patches to the kernel space driver i40iw and to the user space plugin libi40iw, which is currently part of rdma-core. Some of the changes need to be coordinated so that they appear in both kernel space and user space in corresponding releases. We have some questions regarding the process about submitting patches to rdma-core which have dependencies on kernel patches.
>
> 1) Can user space patches target a for-next rdma-core release, if the corresponding kernel patches are queued for the next kernel?

I don't see any problem with that, once the patches accepted for the -next by Doug,
they can be accepted to the rdma-core too. Anyway these changes should be compatible
with old kernel without such new feature.

> 2) How are ABI changes handled in rdma-core?

Do you have specific thing in mind?
Generally speaking, send to ML pass review and we will apply.

>
> 3) Could you explain the release process for rdma-core?

The process as agreed will be something like that:
a. Review/accept/decline patches in 1-2 weeks time frame.
b. Once kernel released, stop accepting new features.
c. Wait for 1-2 weeks to see no one complains. It is just to be on safe side,
because the library is always ready for release and checked constantly.
d. Create new tag and push release.

>
> 4) Is each rdma-core release going to correspond to a specific kernel version?

As Jason wrote, It will be aligned in release time to the kernel,
but library should remain backward compatible.

>
> Thank you,
> Tatyana

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: mlx4 BUG_ON in probe path
From: Yishai Hadas @ 2016-11-17 10:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yishai Hadas, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Johannes Thumshirn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161116182527.GC26600-1RhO1Y9PlrlHTL0Zs8A6p5iNqAH0jzoTYJqu5kTmcBRl57MIdRCFDg@public.gmane.org>

On 11/16/2016 8:25 PM, Bjorn Helgaas wrote:
> Hi Yishai,
>
> Johannes has been working on an mlx4 initialization problem on an
> IBM x3850 X6.  The underlying problem is a PCI core issue -- we're
> setting RCB in the Mellanox device, which means it thinks it can
> generate 128-byte Completions, even though the Root Port above it
> can't handle them.  That issue is
> https://bugzilla.kernel.org/show_bug.cgi?id=187781
>
> The machine crashed when this happened, apparently not because of any
> error reported via AER, but because mlx4 contains a BUG_ON, probably
> the one in mlx4_enter_error_state().
>
> That one happens if pci_channel_offline() returns false.  Is this
> telling us about a problem in PCI error handling, or is it just a case
> where mlx4 isn't as smart as it could be?

Yes, we expect at that step a problem/bug in the PCI layer that should 
be fixed (e.g. reporting online but really is offline, etc.), can you 
please evaluate and confirm that ?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core] ccan: Add likely implementation
From: Leon Romanovsky @ 2016-11-17  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	Tatyana.E.Nikolova-ral2JQCrhuEAvxtiuMwx3w,
	oulijun-hv44wF8Li93QT0dZR+AlfA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20161116201343.GB19593-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

On Wed, Nov 16, 2016 at 01:13:43PM -0700, Jason Gunthorpe wrote:
> On Wed, Nov 16, 2016 at 07:40:11PM +0200, Leon Romanovsky wrote:
>
> > index b5de515..153426f 100644
> > +++ b/ccan/CMakeLists.txt
> > @@ -6,11 +6,13 @@ publish_internal_headers(ccan
> >    minmax.h
> >    str.h
> >    str_debug.h
> > +  likely.h
> >    )
> >
> >  set(C_FILES
> >    list.c
> >    str.c
> > +  likely.c
> >    )
>
> Keep these lists sorted please

Sure

>
> > +++ b/ccan/likely.c
> > @@ -0,0 +1,136 @@
> > +/* CC0 (Public domain) - see LICENSE file for details. */
> > +#ifdef CCAN_LIKELY_DEBUG
> > +#include <ccan/likely/likely.h>
> > +#include <ccan/hash/hash.h>
> > +#include <ccan/htable/htable_type.h>
>
> Hmm, this isn't going to compile if the debug is set, maybe drop the
> .c - but this seems really interesting to see if likely is being used
> sensibly....
>
> > +#ifndef CCAN_LIKELY_DEBUG
> > +#if HAVE_BUILTIN_EXPECT
>
> You need to add '#define HAVE_BUILTIN_EXPECT 1' to
> buildlib/config.h.in - or this doesn't work at all.

This is exactly what I wanted to discuss over ML.

From one side, I wanted to ensure that ccan files are similar to
official ones, so upgrade to new versions will be seamless.

From another side, I don't see the real usage of likely/unlikely debug
facilities.

So my approach was to add these files as is, but don't connect debug
functionality.

>
> Jason

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: rdma-core release process questions
From: Jason Gunthorpe @ 2016-11-17  5:14 UTC (permalink / raw)
  To: Nikolova, Tatyana E
  Cc: Doug Ledford, leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <13AA599688F47243B14FCFCCC2C803BB10AB79D5-96pTJSsuoYQ64kNsxIetb7fspsVTdybXVpNB7YpNyf8@public.gmane.org>

On Thu, Nov 17, 2016 at 04:56:45AM +0000, Nikolova, Tatyana E wrote:

> We are submitting patches to the kernel space driver i40iw and to
> the user space plugin libi40iw, which is currently part of
> rdma-core. Some of the changes need to be coordinated so that they
> appear in both kernel space and user space in corresponding
> releases. We have some questions regarding the process about
> submitting patches to rdma-core which have dependencies on kernel
> patches.

Incompatible changes are very strongly discouraged, just don't do it.

> 4) Is each rdma-core release going to correspond to a specific kernel version?

No. All versions must work with all kernels.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* rdma-core release process questions
From: Nikolova, Tatyana E @ 2016-11-17  4:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford,
	leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi,

We are submitting patches to the kernel space driver i40iw and to the user space plugin libi40iw, which is currently part of rdma-core. Some of the changes need to be coordinated so that they appear in both kernel space and user space in corresponding releases. We have some questions regarding the process about submitting patches to rdma-core which have dependencies on kernel patches.

1) Can user space patches target a for-next rdma-core release, if the corresponding kernel patches are queued for the next kernel? 

2) How are ABI changes handled in rdma-core?

3) Could you explain the release process for rdma-core? 

4) Is each rdma-core release going to correspond to a specific kernel version?

Thank you,
Tatyana
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-core] ccan: Add likely implementation
From: Jason Gunthorpe @ 2016-11-16 20:13 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	Tatyana.E.Nikolova-ral2JQCrhuEAvxtiuMwx3w,
	oulijun-hv44wF8Li93QT0dZR+AlfA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479318011-26878-1-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Wed, Nov 16, 2016 at 07:40:11PM +0200, Leon Romanovsky wrote:

> index b5de515..153426f 100644
> +++ b/ccan/CMakeLists.txt
> @@ -6,11 +6,13 @@ publish_internal_headers(ccan
>    minmax.h
>    str.h
>    str_debug.h
> +  likely.h
>    )
>  
>  set(C_FILES
>    list.c
>    str.c
> +  likely.c
>    )

Keep these lists sorted please

> +++ b/ccan/likely.c
> @@ -0,0 +1,136 @@
> +/* CC0 (Public domain) - see LICENSE file for details. */
> +#ifdef CCAN_LIKELY_DEBUG
> +#include <ccan/likely/likely.h>
> +#include <ccan/hash/hash.h>
> +#include <ccan/htable/htable_type.h>

Hmm, this isn't going to compile if the debug is set, maybe drop the
.c - but this seems really interesting to see if likely is being used
sensibly....

> +#ifndef CCAN_LIKELY_DEBUG
> +#if HAVE_BUILTIN_EXPECT

You need to add '#define HAVE_BUILTIN_EXPECT 1' to
buildlib/config.h.in - or this doesn't work at all.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: NFSD generic R/W API (sendto path) performance results
From: Chuck Lever @ 2016-11-16 19:45 UTC (permalink / raw)
  To: Steve Wise; +Cc: Christoph Hellwig, Sagi Grimberg, List Linux RDMA Mailing
In-Reply-To: <024601d23f7f$cef62500$6ce26f00$@opengridcomputing.com>


> On Nov 15, 2016, at 3:35 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> 
>> 
>> I've built a prototype conversion of the in-kernel NFS server's sendto
>> path to use the new generic R/W API. This path handles NFS Replies, so
>> it is responsible for building and sending RDMA Writes carrying NFS
>> READ payloads, and for transmitting all NFS Replies.
>> 
>> I've published the prototype (against my for-4.10 server series) here:
>> 
>> 
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=shortlog;h=refs/heads/nfsd-rdma-rw
> -api
>> 
>> It's the very last patch in the series.
>> 
>> 
>> "iozone -i0 -i1 -s2g -r1m -I" with NFSv3, sec=sys, CX-3 on both sides,
>> FDR fabric, share is a tmpfs. This test writes and reads a 2GB file with
>> 1MB direct writes and reads.
>> 
>> The client forms NFS requests with a single 1MB RDMA segment to catch
>> the NFS READ payload. Before the conversion, the server posts a series
>> of single Write WRs with 30 pages each, for each RDMA segment written
>> to the client. After the conversion, the server posts a single chain
>> of 30-page Write WRs for each RDMA segment written to the client.
>> 
>> Before the API conversion: rdma_stat_post_send = 45097
>> 
>> After the API conversion: rdma_stat_post_send = 16411
>> 
>> That's what I expected to see. This shows the number of ib_post_send
>> calls is significantly lower after the conversion.
>> 
>> 
>> Unfortunately the throughput and latency numbers are worse (ignore
>> the write/rewrite numbers for now). Output is in kBytes/sec.
>> 
>> Before conversion, one iozone run:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   772835   931267  1895922  1927848
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.006345     RTT: 0.321132     total execute time: 0.332113
>> 
>> After conversion:
>> 
>>              kB  reclen    write  rewrite    read    reread
>>         2097152    1024   703850   913824  1561682  1441448
>> 
>> READ:
>>    4098 ops (49%)
>>    avg bytes sent per op: 140    avg bytes received per op: 1048704
>>    backlog wait: 0.010737     RTT: 0.469497     total execute time: 0.488043
>> 
>> That's 140us worse RTT per READ, in this run. The gap between before and
>> after was roughly the same for all runs.
>> 
>> 
>> To partially explain this, I captured traffic on the server using ibdump
>> during a similar iozone test. This removes fabric and client HCA latencies
>> from the picture.
>> 
>> This is a QD=1 test, so it's easy to analyze individual NFS READ operations
>> in each capture. I computed three latency numbers per READ transaction
>> based on the timestamps in the capture file, which should be accurate to
>> 1 microsecond:
>> 
>> 1. Call took: the time between when the server i/f sees the incoming RDMA
>> Send carrying the NFS READ Call, and when the server i/f sees the outgoing
>> RDMA Send carrying the NFS READ Reply.
>> 
>> 2. Call-to-first-Write: the time between when the server i/f sees the
>> incoming RDMA Send carrying the NFS READ Call, and when the server i/f
>> sees the first outgoing RDMA Write request. Roughly how long it takes
>> the server to prepare and post the RDMA Writes.
>> 
>> 3. First-to-last-Write: the time between when the server i/f sees the
>> first outgoing RDMA Write request, and when the server i/f sees the
>> last outgoing RDMA Write request. Roughly how long it takes the HCA
>> to transmit the RDMA Writes.
>> 
>> 
>> Averages over 5 NFS READ calls chosen at random, before conversion:
>> Call took 414us. Call-to-first-Write 85us. First-to-last-Write 327us
>> 
>> Averages over 5 NFS READ calls chosen at random, after conversion:
>> Call took 521us. Call-to-first-Write 160us. First-to-last-Write 360us
>> 
>> The gap between before and after results was 100% consistent with
>> the average results across the individual NFS READ operations.
>> 
>> 
> 
> Good work here! 
> 
>> There are two stories here:
>> 
>> 1. Call-to-first-Write takes longer. My first guess is that the server
>> takes longer to build and DMA map a long Write WR chain than it does
>> to build, map, and post a single Write WR. The HCA can get started
>> transmitting Writes sooner, and the server continues working on
>> posting Write WRs in parallel with the on-the-wire activity.
>> 
> 
> So perhaps the RDMA R/W API can have a threshold where it will dump a list of
> WRs once it exceeds the threshold, and continue chunking?  That threshold, by
> the way, is probably device-specific.
> 
>> 2. First-to-last-Write takes longer. I don't have any explanation
>> for the HCA taking 10% longer to transmit the full 1MB payload.
>> 
> 
> Perhaps the single WR posts are hitting device's fast-path and lowering latency
> vs a long chain post that must be DMAed by the device?  I'm not sure exactly how
> the MLX devices work, but they do have a fast path that utilizes the CPU's
> write-combining logic to send a WR over the bus as a single PCIE transaction.
> But your WRs are probably large since they have 30 pages in the SGE.  I'm not
> sure what the threshold is for this fastpath logic for mlx.  For cxgb, its 64B,
> so the WR would have to fit in 64B to take advantage.

Out of curiosity, I hacked up my NFS client to limit the size of RDMA
segments to 30 pages (the server HCA's max_sge).

A 1MB NFS READ now takes 9 segments. That forces the after-conversion
server to build single-Write chains and use 9 post_send calls to
transmit the READ payload, just like the before-conversion server.

Performance of before- and after-conversion servers is now equivalent.

              kB  reclen    write  rewrite    read    reread
         2097152    1024  1061237  1141614  1961410  2000223                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140    avg bytes received per op: 1048704
    backlog wait: 0.006345     RTT: 0.314300     total execute time: 0.325037

At 60-page segments (2 Write WRs per chain), I see about the same
throughput, and RT latency is a touch higher.

At 61-page segments (3 Write WRs per chain), throughput drops
significantly:

              kB  reclen    write  rewrite    read    reread
         2097152    1024   932665   976784  1627842  1627169                                                                                  

READ:
    4098 ops (49%) 
    avg bytes sent per op: 140	avg bytes received per op: 1048704
    backlog wait: 0.009761 	RTT: 0.383358 	total execute time: 0.398731

A couple of random samples of an ibdump capture show that most of the
latency increase is in the Call-to-first-Write gap (1. above).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* mlx4 BUG_ON in probe path
From: Bjorn Helgaas @ 2016-11-16 18:25 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: netdev, linux-rdma, Johannes Thumshirn, linux-kernel

Hi Yishai,

Johannes has been working on an mlx4 initialization problem on an
IBM x3850 X6.  The underlying problem is a PCI core issue -- we're
setting RCB in the Mellanox device, which means it thinks it can
generate 128-byte Completions, even though the Root Port above it
can't handle them.  That issue is
https://bugzilla.kernel.org/show_bug.cgi?id=187781

The machine crashed when this happened, apparently not because of any
error reported via AER, but because mlx4 contains a BUG_ON, probably
the one in mlx4_enter_error_state().

That one happens if pci_channel_offline() returns false.  Is this
telling us about a problem in PCI error handling, or is it just a case
where mlx4 isn't as smart as it could be?

Ideally, if mlx4 can't initialize the device, it should just return an
error from the probe function instead of crashing the whole machine.

Here's the crash (the entire dmesg log is in the bugzilla above):

  mlx4_core 0000:41:00.0: command 0xfff timed out (go bit not cleared)
  mlx4_core 0000:41:00.0: device is going to be reset
  mlx4_core 0000:41:00.0: Failed to obtain HW semaphore, aborting
  mlx4_core 0000:41:00.0: Fail to reset HCA
  ------------[ cut here ]------------
  kernel BUG at drivers/net/ethernet/mellanox/mlx4/catas.c:193!
  invalid opcode: 0000 [#1] SMP 
  Modules linked in: sr_mod(E) cdrom(E) uas(E) usb_storage(E) mlx4_core(E+) cdc_ether(E) usbnet(E) mii(E) joydev(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) drbg(E) ansi_cprng(E) aesni_intel(E) iTCO_wdt(E) aes_x86_64(E) igb(E) ipmi_devintf(E) iTCO_vendor_support(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) ptp(E) cryptd(E) pps_core(E) sb_edac(E) pcspkr(E) lpc_ich(E) ipmi_ssif(E) ioatdma(E) edac_core(E) shpchp(E) mfd_core(E) dca(E) wmi(E) ipmi_si(E) ipmi_msghandler(E) fjes(E) button(E) processor(E) acpi_pad(E) hid_generic(E) usbhid(E) ext4(E) crc16(E) jbd2(E) mbcache(E) sd_mod(E) mgag200(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) xhci_pci(E) sysfillrect(E) ehci_pci(E) sysimgbl
 t(E)
   fb_sys_fops(E) xhci_hcd(E) ehci_hcd(E) ttm(E) usbcore(E) drm(E) usb_common(E) megaraid_sas(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
  Supported: Yes
  CPU: 27 PID: 2867 Comm: modprobe Tainted: G            E      4.4.21-default #6
  Hardware name: IBM x3850 X6 -[3837Z7P]-/00FN772, BIOS -[A8E120CUS-1.30]- 08/22/2016
  task: ffff881fb2ff9280 ti: ffff881fbd3c4000 task.ti: ffff881fbd3c4000
  RIP: 0010:[<ffffffffa0446740>]  [<ffffffffa0446740>] mlx4_enter_error_state+0x240/0x320 [mlx4_core]
  RSP: 0018:ffff881fbd3c79a0  EFLAGS: 00010246
  RAX: ffff8820b2486e00 RBX: ffff883fbe240000 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff881fbf63b000
  RBP: ffff8820b2486e60 R08: 0000000000000029 R09: ffff88803feda50f
  R10: 00000000000d1b50 R11: 0000000000000000 R12: 0000000000000000
  R13: 0000000000000000 R14: ffff883fbe240460 R15: 00000000fffffffb
  FS:  00007f7c55203700(0000) GS:ffff883fbf900000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f1813c88000 CR3: 0000003fbe637000 CR4: 00000000001406e0
  Stack:
   15b30000c0000100 ffff883fbe240000 0000000000000fff 0000000000000000
   ffffffffa0447d54 000000000000ffff ffffffff00000000 000000000000ea60
   0000000000000000 000000000000ea60 ffffc90031dba680 ffff883fbe240000
  Call Trace:
   [<ffffffffa0447d54>] __mlx4_cmd+0x594/0x8a0 [mlx4_core]
   [<ffffffffa045191b>] mlx4_map_cmd+0x2ab/0x3c0 [mlx4_core]
   [<ffffffffa045a855>] mlx4_load_one+0x515/0x1220 [mlx4_core]
   [<ffffffffa045bb69>] mlx4_init_one+0x4e9/0x6a0 [mlx4_core]
   [<ffffffff8135626f>] local_pci_probe+0x3f/0xa0
   [<ffffffff81357694>] pci_device_probe+0xd4/0x120
   [<ffffffff8144d0b7>] driver_probe_device+0x1f7/0x420
   [<ffffffff8144d35b>] __driver_attach+0x7b/0x80
   [<ffffffff8144afc8>] bus_for_each_dev+0x58/0x90
   [<ffffffff8144c519>] bus_add_driver+0x1c9/0x280
   [<ffffffff8144dccb>] driver_register+0x5b/0xd0
   [<ffffffffa03f911a>] mlx4_init+0x11a/0x1000 [mlx4_core]
   [<ffffffff81002138>] do_one_initcall+0xc8/0x1f0
   [<ffffffff81182a08>] do_init_module+0x5a/0x1d7
   [<ffffffff81103726>] load_module+0x1366/0x1c50
   [<ffffffff811041c0>] SYSC_finit_module+0x70/0xa0
   [<ffffffff815e14ae>] entry_SYSCALL_64_fastpath+0x12/0x71

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox