From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bart Van Assche Subject: Re: generic RDMA READ/WRITE API V6 Date: Mon, 2 May 2016 15:14:34 -0700 Message-ID: <5727D14A.8040600@sandisk.com> References: <1460410360-13104-1-git-send-email-hch@lst.de> <571AA5C8.4080502@sandisk.com> <20160502151535.GA520@lst.de> <5727A5C7.1090009@sandisk.com> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5727A5C7.1090009@sandisk.com> Sender: target-devel-owner@vger.kernel.org To: Christoph Hellwig Cc: "dledford@redhat.com" , "swise@opengridcomputing.com" , "sagi@grimberg.me" , "linux-rdma@vger.kernel.org" , "target-devel@vger.kernel.org" List-Id: linux-rdma@vger.kernel.org On 05/02/2016 12:08 PM, Bart Van Assche wrote: > On 05/02/2016 08:15 AM, Christoph Hellwig wrote: >> On Fri, Apr 22, 2016 at 03:29:28PM -0700, Bart Van Assche wrote: >>> On 04/11/2016 02:32 PM, Christoph Hellwig wrote: >>>> git://git.infradead.org/users/hch/rdma.git rdma-rw-api >>> >>> Hello Christoph, >>> >>> Is the version that has been pushed on April 18 the latest and greatest >>> version of this patch series ? >> >> Should be. I've pushed out a new version, but the only changes are >> in response to your small review comments, and a no-op rebase to Doug's >> latest tree. >> >>> I'm asking because with that version I see >>> error messages appearing that I hadn't seen with the previous version: >>> >>> ib_srpt:srpt_qp_event: ib_srpt QP event 16 on cm_id=ffff8801713d5628 >>> sess_name=0x0000000000000000e41d2d03000a85b1 state=1 >>> ib_srpt:srpt_qp_event: ib_srpt 0x0000000000000000e41d2d03000a85b1-522, >>> state live: received Last WQE event. >>> ib_srpt RDMA_READ for ioctx 0xffff8804593092a8 failed with status 4 >>> >>> This test was run with the force_mr=Y: >>> >>> $ cat /etc/modprobe.d/ib_core.conf >>> options ib_core force_mr=Y >> >> I haven't been able to reproduce this with my usual xfstests run >> on mlx4 hardware. What did you do to reproduce the issue, and what >> hardware were you using? > > After having disabled CONFIG_SLUB_DEBUG_ON I don't see the "QP event" > message anymore. But running xfstests triggered the following (mlx4 > hardware; SRP initiator and LIO target running on the same server and > communicating over loopback): > > WARNING: CPU: 11 PID: 9224 at drivers/infiniband/ulp/srpt/ib_srpt.c:1209 > srpt_rdma_read_done+0xc7/0x110 [ib_srpt] > Call Trace: > [] dump_stack+0x67/0x92 > [] __warn+0xc1/0xe0 > [] warn_slowpath_null+0x18/0x20 > [] srpt_rdma_read_done+0xc7/0x110 [ib_srpt] > [] __ib_process_cq+0x4b/0xd0 [ib_core] > [] ib_cq_poll_work+0x1b/0x60 [ib_core] > [] process_one_work+0x19a/0x490 > [] ? process_one_work+0x13a/0x490 > [] worker_thread+0x49/0x490 > [] ? process_one_work+0x490/0x490 > [] kthread+0xea/0x100 > [] ret_from_fork+0x22/0x40 (replying to my own e-mail) I just noticed that ib_comp_wq is created as follows: ib_comp_wq = alloc_workqueue("ib-comp-wq", WQ_UNBOUND | WQ_HIGHPRI | WQ_MEM_RECLAIM, WQ_UNBOUND_MAX_ACTIVE); I think this breaks the locking guarantees for completion handlers. A quote from Documentation/infiniband/core_locking.txt: "The driver must guarantee that only one CQ event handler for a given CQ is running at a time." The ib_srpt driver assumes that completion handler invocations are serialized such that no locking is needed to access wait_list from inside a completion handler. Bart.