* [LSF/MM TOPIC] Reducing the SRP initiator failover time
@ 2013-02-01 13:43 Bart Van Assche
[not found] ` <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>
0 siblings, 1 reply; 4+ messages in thread
From: Bart Van Assche @ 2013-02-01 13:43 UTC (permalink / raw)
To: lsf-pc, linux-scsi, linux-rdma@vger.kernel.org, David Dillow
It is known that it takes about two to three minutes before the upstream
SRP initiator fails over from a failed path to a working path. This is
not only considered longer than acceptable but is also longer than other
Linux SCSI initiators (e.g. iSCSI and FC). Progress so far with
improving the fail-over SRP initiator has been slow. This is because the
discussion about candidate patches occurred at two different levels: not
only the patches itself were discussed but also the approach that should
be followed. That last aspect is easier to discuss in a meeting than
over a mailing list. Hence the proposal to discuss SRP initiator
failover behavior during the LSF/MM summit. The topics that need further
discussion are:
* If a path fails, remove the entire SCSI host or preserve the SCSI
host and only remove the SCSI devices associated with that host ?
* Which software component should test the state of a path and should
reconnect to an SRP target if a path is restored ? Should that be
done by the user space process srp_daemon or by the SRP initiator
kernel module ?
* How should the SRP initiator behave after a path failure has been
detected ? Should the behavior be similar to the FC initiator with
its fast_io_fail_tmo and dev_loss_tmo parameters ?
Dave, if this topic gets accepted, I really hope you will be able to
attend the LSF/MM summit.
Bart.
^ permalink raw reply [flat|nested] 4+ messages in thread[parent not found: <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>]
[parent not found: <BB97625FCF082447AC2B11418FF02044A6E9E9C5@MTLDAG01.mtl.com>]
[parent not found: <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>]
* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time [not found] ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org> @ 2013-02-07 22:42 ` Vu Pham [not found] ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 4+ messages in thread From: Vu Pham @ 2013-02-07 22:42 UTC (permalink / raw) To: Bart Van Assche Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-scsi, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Dillow, Oren Duer, Sagi Grimberg > > > It is known that it takes about two to three minutes before the > upstream SRP initiator fails over from a failed path to a working > path. This is not only considered longer than acceptable but is also > longer than other Linux SCSI initiators (e.g. iSCSI and FC). Progress > so far with improving the fail-over SRP initiator has been slow. This > is because the discussion about candidate patches occurred at two > different levels: not only the patches itself were discussed but also > the approach that should be followed. That last aspect is easier to > discuss in a meeting than over a mailing list. Hence the proposal to > discuss SRP initiator failover behavior during the LSF/MM summit. The > topics that need further discussion are: > * If a path fails, remove the entire SCSI host or preserve the SCSI > host and only remove the SCSI devices associated with that host ? > * Which software component should test the state of a path and should > reconnect to an SRP target if a path is restored ? Should that be > done by the user space process srp_daemon or by the SRP initiator > kernel module ? > * How should the SRP initiator behave after a path failure has been > detected ? Should the behavior be similar to the FC initiator with > its fast_io_fail_tmo and dev_loss_tmo parameters ? > > Dave, if this topic gets accepted, I really hope you will be able to > attend the LSF/MM summit. > > Bart. > Hello Bart, Thank you for taking the initiative. Mellanox think that this should be discussed. We'd be happy to attend. We also would like to discuss: * How and how fast does SRP detect a path failure besides RC error? * Role of srp_daemon, how often srp_daemon scan fabric for new/old targets, how-to scale srp_daemon discovery, traps. -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time [not found] ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2013-02-08 9:24 ` Sagi Grimberg 2013-02-08 11:38 ` Sebastian Riemer 0 siblings, 1 reply; 4+ messages in thread From: Sagi Grimberg @ 2013-02-08 9:24 UTC (permalink / raw) To: Bart Van Assche Cc: Vu Pham, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-scsi, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Dillow, Oren Duer On 2/8/2013 12:42 AM, Vu Pham wrote: > >> >> >> It is known that it takes about two to three minutes before the >> upstream SRP initiator fails over from a failed path to a working >> path. This is not only considered longer than acceptable but is also >> longer than other Linux SCSI initiators (e.g. iSCSI and FC). Progress >> so far with improving the fail-over SRP initiator has been slow. This >> is because the discussion about candidate patches occurred at two >> different levels: not only the patches itself were discussed but also >> the approach that should be followed. That last aspect is easier to >> discuss in a meeting than over a mailing list. Hence the proposal to >> discuss SRP initiator failover behavior during the LSF/MM summit. The >> topics that need further discussion are: >> * If a path fails, remove the entire SCSI host or preserve the SCSI >> host and only remove the SCSI devices associated with that host ? >> * Which software component should test the state of a path and should >> reconnect to an SRP target if a path is restored ? Should that be >> done by the user space process srp_daemon or by the SRP initiator >> kernel module ? >> * How should the SRP initiator behave after a path failure has been >> detected ? Should the behavior be similar to the FC initiator with >> its fast_io_fail_tmo and dev_loss_tmo parameters ? >> >> Dave, if this topic gets accepted, I really hope you will be able to >> attend the LSF/MM summit. >> >> Bart. >> > Hello Bart, > > Thank you for taking the initiative. > Mellanox think that this should be discussed. We'd be happy to attend. > > We also would like to discuss: > * How and how fast does SRP detect a path failure besides RC error? > * Role of srp_daemon, how often srp_daemon scan fabric for new/old > targets, how-to scale srp_daemon discovery, traps. > > -vu Hey Bart, I agree with Vu that this issue should be discussed. We'd be happy to attend. -- Sagi -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time 2013-02-08 9:24 ` Sagi Grimberg @ 2013-02-08 11:38 ` Sebastian Riemer 0 siblings, 0 replies; 4+ messages in thread From: Sebastian Riemer @ 2013-02-08 11:38 UTC (permalink / raw) To: Sagi Grimberg Cc: Bart Van Assche, Vu Pham, lsf-pc, linux-scsi, linux-rdma@vger.kernel.org, David Dillow, Oren Duer On 08.02.2013 10:24, Sagi Grimberg wrote: > On 2/8/2013 12:42 AM, Vu Pham wrote: >> Hello Bart, >> >> Thank you for taking the initiative. >> Mellanox think that this should be discussed. We'd be happy to attend. >> >> We also would like to discuss: >> * How and how fast does SRP detect a path failure besides RC error? >> * Role of srp_daemon, how often srp_daemon scan fabric for new/old >> targets, how-to scale srp_daemon discovery, traps. >> >> -vu > Hey Bart, > > I agree with Vu that this issue should be discussed. We'd be happy to > attend. > > -- > Sagi Wow, also thanks to Mellanox for spending resources on SRP as well! Last year in June we came across a very different situation. Cheers, Sebastian and the ProfitBricks storage team ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-02-08 11:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-01 13:43 [LSF/MM TOPIC] Reducing the SRP initiator failover time Bart Van Assche
[not found] ` <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>
[not found] ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5@MTLDAG01.mtl.com>
[not found] ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
2013-02-07 22:42 ` Vu Pham
[not found] ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-08 9:24 ` Sagi Grimberg
2013-02-08 11:38 ` Sebastian Riemer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).