From mboxrd@z Thu Jan 1 00:00:00 1970 From: Vu Pham Subject: Re: [ofa-general][PATCH 3/4] SRP fail-over faster Date: Wed, 14 Oct 2009 13:37:21 -0700 Message-ID: <4AD63681.6080901@mellanox.com> References: <4AD3B453.3030109@mellanox.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Roland Dreier Cc: Linux RDMA list List-Id: linux-rdma@vger.kernel.org Roland Dreier wrote: > > +static int srp_dev_loss_tmo = 60; > > I don't think the name needs to be this abbreviated. We don't > necessarily need the srp_ prefix, but probably "device_loss_timeout" is > much clearer without being too much longer. > > OK > > + > > +module_param(srp_dev_loss_tmo, int, 0444); > > +MODULE_PARM_DESC(srp_dev_loss_tmo, > > + "Default number of seconds that srp transport should \ > > + insulate the lost of a remote port (default is 60 secs"); > > I can't understand this description. What does "insulate the lost" of a > port mean? > > I should change "remote port" to just "port". It means that multipath driver won't know about port offline event (pulling cable, power cycling switch, target...) and won't act/fail-over because srp won't return error code until this timeout expired > > +static void srp_reconnect_work(struct work_struct *work) > > +{ > > + struct srp_target_port *target = > > + container_of(work, struct srp_target_port, work); > > + > > + srp_reconnect_target(target); > > + target->work_in_progress = 0; > > surely this is racy... isn't it possible for a context to see > work_in_progress as 1, decide not to schedule the work, and then have it > set to 0 immediately afterwards by the workqueue context? > Yes, it is racy. It should be in lock_irq scsi host_lock > > + target->qp_err_timer.expires = time * HZ + jiffies; > > given that this is only with 1 second resolution, probably makes sense > to either make it a deferrable timer or round the timeout to avoid extra > wakeups. > OK - I'll round the timeout. > > + add_timer(&target->qp_err_timer); > > I don't see anywhere that this is canceled on module unload etc? > > My mistake. Bart also pointed it out. I'll fix this. > > + srp_qp_err_add_timer(target, > > + srp_dev_loss_tmo - 55); > > > + if (srp_dev_loss_tmo < 60) > > + srp_dev_loss_tmo = 60; > > I don't understand the 55 and the 60 here... what are these magic > numbers? Wouldn't it make sense for the user to specify the actual > timeout that is used (value - 55) rather than the value and then > secretly subtracting 55? > > - R. > First it does not make sense for user to set it below 60; therefore, it is forced to have 60 and above With async event handler, srp can detect local port offline and set timer exact device_loss_timeout; however, it does not have mechanism to detect remote port offline (srp_daemon need to register trap and communicate remote port in/out fabric down to srp driver) I should just add timer (X seconds) instead of (device_loss_tmo - 55) in case receiving cqe error and/or connection close event -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html