From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steve Wise Subject: rds cq event handler issue Date: Tue, 06 Mar 2012 12:05:25 -0600 Message-ID: <4F5651E5.1020005@opengridcomputing.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: linux-rdma , Netdev , Vipul Pandya To: Venkat Venkatsubra Return-path: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: netdev.vger.kernel.org Hey Venkat, I think I see a bug in the RDS RDMA module where RDS is not adhering to the RDMA locking context. From the kernel tree Documentation/infiniband/core_locking.txt: --- The context in which completion event and asynchronous event callbacks run is not defined. Depending on the low-level driver, it may be process context, softirq context, or interrupt context. Upper level protocol consumers may not sleep in a callback. --- So RDMA ULPs cannot assume any certain context for their callback functions. Yet I get a BUG_ON() when running RDS with iw_cxgb3 where RDS is bugging in rds_rdma_free_op(): --- /* Mark page dirty if it was possibly modified, which * is the case for a RDMA_READ which copies from remote * to local memory */ if (!ro->op_write) { BUG_ON(irqs_disabled()); set_page_dirty(page); } --- And rds_rdma_free_op() can be called in the cq callback path. Here's a stack trace when it bugged: --- Call Trace: [] :rds:rds_message_purge+0x54/0x79 [] :rds:rds_message_put+0x41/0x4c [] :rds_rdma:rds_iw_send_unmap_rm+0xe2/0xf2 [] :rds_rdma:rds_iw_send_cq_comp_handler+0x193/0x2e5 [] :iw_cxgb3:iwch_ev_dispatch+0x1df/0x2b1 [] :iw_cxgb3:cxio_hal_ev_handler+0x6b/0xb4 [] :cxgb3:process_rx+0x3d/0xa0 [] :cxgb3:process_responses+0x120c/0x1350 --- iwch_ev_dispatch() explicitly disables irqs to ensure proper serialization: --- spin_lock_irqsave(&chp->comp_handler_lock, flag); (*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context); spin_unlock_irqrestore(&chp->comp_handler_lock, flag); --- I'm not sure if that BUG_ON() in rds_rdma_free_op() is valid or not. If it is valid, then RDS needs to run this logic in a safe context, not in the context of the CQ callback. It BUG_ON() is not valid, we can remove it :). Can you comment? Thanks, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html