[LSF/MM TOPIC] Reducing the SRP initiator failover time

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] Reducing the SRP initiator failover time
@ 2013-02-01 13:43 Bart Van Assche
       [not found] ` <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Bart Van Assche @ 2013-02-01 13:43 UTC (permalink / raw)
  To: lsf-pc, linux-scsi, linux-rdma@vger.kernel.org, David Dillow

It is known that it takes about two to three minutes before the upstream 
SRP initiator fails over from a failed path to a working path. This is 
not only considered longer than acceptable but is also longer than other 
Linux SCSI initiators (e.g. iSCSI and FC). Progress so far with 
improving the fail-over SRP initiator has been slow. This is because the 
discussion about candidate patches occurred at two different levels: not 
only the patches itself were discussed but also the approach that should 
be followed. That last aspect is easier to discuss in a meeting than 
over a mailing list. Hence the proposal to discuss SRP initiator 
failover behavior during the LSF/MM summit. The topics that need further 
discussion are:
* If a path fails, remove the entire SCSI host or preserve the SCSI
   host and only remove the SCSI devices associated with that host ?
* Which software component should test the state of a path and should
   reconnect to an SRP target if a path is restored ? Should that be
   done by the user space process srp_daemon or by the SRP initiator
   kernel module ?
* How should the SRP initiator behave after a path failure has been
   detected ? Should the behavior be similar to the FC initiator with
   its fast_io_fail_tmo and dev_loss_tmo parameters ?

Dave, if this topic gets accepted, I really hope you will be able to 
attend the LSF/MM summit.

Bart.

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>]

[parent not found: <BB97625FCF082447AC2B11418FF02044A6E9E9C5@MTLDAG01.mtl.com>]

[parent not found: <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>]

* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time
       [not found]     ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
@ 2013-02-07 22:42       ` Vu Pham
       [not found]         ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Vu Pham @ 2013-02-07 22:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-scsi,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, David Dillow,
	Oren Duer, Sagi Grimberg


>
>
> It is known that it takes about two to three minutes before the 
> upstream SRP initiator fails over from a failed path to a working 
> path. This is not only considered longer than acceptable but is also 
> longer than other Linux SCSI initiators (e.g. iSCSI and FC). Progress 
> so far with improving the fail-over SRP initiator has been slow. This 
> is because the discussion about candidate patches occurred at two 
> different levels: not only the patches itself were discussed but also 
> the approach that should be followed. That last aspect is easier to 
> discuss in a meeting than over a mailing list. Hence the proposal to 
> discuss SRP initiator failover behavior during the LSF/MM summit. The 
> topics that need further discussion are:
> * If a path fails, remove the entire SCSI host or preserve the SCSI
>   host and only remove the SCSI devices associated with that host ?
> * Which software component should test the state of a path and should
>   reconnect to an SRP target if a path is restored ? Should that be
>   done by the user space process srp_daemon or by the SRP initiator
>   kernel module ?
> * How should the SRP initiator behave after a path failure has been
>   detected ? Should the behavior be similar to the FC initiator with
>   its fast_io_fail_tmo and dev_loss_tmo parameters ?
>
> Dave, if this topic gets accepted, I really hope you will be able to 
> attend the LSF/MM summit.
>
> Bart.
>
Hello Bart,

Thank you for taking the initiative.
Mellanox think that this should be discussed. We'd be happy to attend.

We also would like to discuss:
* How and how fast does SRP detect a path failure besides RC error?
* Role of srp_daemon, how often srp_daemon scan fabric for new/old 
targets, how-to scale srp_daemon discovery, traps.

-vu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]

* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time
       [not found]         ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-08  9:24           ` Sagi Grimberg
  2013-02-08 11:38             ` Sebastian Riemer
  0 siblings, 1 reply; 4+ messages in thread
From: Sagi Grimberg @ 2013-02-08  9:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Vu Pham, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-scsi, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	David Dillow, Oren Duer

On 2/8/2013 12:42 AM, Vu Pham wrote:
>
>>
>>
>> It is known that it takes about two to three minutes before the 
>> upstream SRP initiator fails over from a failed path to a working 
>> path. This is not only considered longer than acceptable but is also 
>> longer than other Linux SCSI initiators (e.g. iSCSI and FC). Progress 
>> so far with improving the fail-over SRP initiator has been slow. This 
>> is because the discussion about candidate patches occurred at two 
>> different levels: not only the patches itself were discussed but also 
>> the approach that should be followed. That last aspect is easier to 
>> discuss in a meeting than over a mailing list. Hence the proposal to 
>> discuss SRP initiator failover behavior during the LSF/MM summit. The 
>> topics that need further discussion are:
>> * If a path fails, remove the entire SCSI host or preserve the SCSI
>>   host and only remove the SCSI devices associated with that host ?
>> * Which software component should test the state of a path and should
>>   reconnect to an SRP target if a path is restored ? Should that be
>>   done by the user space process srp_daemon or by the SRP initiator
>>   kernel module ?
>> * How should the SRP initiator behave after a path failure has been
>>   detected ? Should the behavior be similar to the FC initiator with
>>   its fast_io_fail_tmo and dev_loss_tmo parameters ?
>>
>> Dave, if this topic gets accepted, I really hope you will be able to 
>> attend the LSF/MM summit.
>>
>> Bart.
>>
> Hello Bart,
>
> Thank you for taking the initiative.
> Mellanox think that this should be discussed. We'd be happy to attend.
>
> We also would like to discuss:
> * How and how fast does SRP detect a path failure besides RC error?
> * Role of srp_daemon, how often srp_daemon scan fabric for new/old 
> targets, how-to scale srp_daemon discovery, traps.
>
> -vu
Hey Bart,

I agree with Vu that this issue should be discussed. We'd be happy to 
attend.

--
Sagi
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM TOPIC] Reducing the SRP initiator failover time
  2013-02-08  9:24           ` Sagi Grimberg
@ 2013-02-08 11:38             ` Sebastian Riemer
  0 siblings, 0 replies; 4+ messages in thread
From: Sebastian Riemer @ 2013-02-08 11:38 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Vu Pham, lsf-pc, linux-scsi,
	linux-rdma@vger.kernel.org, David Dillow, Oren Duer

On 08.02.2013 10:24, Sagi Grimberg wrote:
> On 2/8/2013 12:42 AM, Vu Pham wrote:
>> Hello Bart,
>>
>> Thank you for taking the initiative.
>> Mellanox think that this should be discussed. We'd be happy to attend.
>>
>> We also would like to discuss:
>> * How and how fast does SRP detect a path failure besides RC error?
>> * Role of srp_daemon, how often srp_daemon scan fabric for new/old
>> targets, how-to scale srp_daemon discovery, traps.
>>
>> -vu
> Hey Bart,
> 
> I agree with Vu that this issue should be discussed. We'd be happy to
> attend.
> 
> -- 
> Sagi

Wow, also thanks to Mellanox for spending resources on SRP as well! Last
year in June we came across a very different situation.

Cheers,
Sebastian and the ProfitBricks storage team

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-02-08 11:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-01 13:43 [LSF/MM TOPIC] Reducing the SRP initiator failover time Bart Van Assche
     [not found] ` <CAJZOPZJeCdkJ0xfK0kxic9jfz5A5ddw7TSWXe51yuO6bYTk4ag@mail.gmail.com>
     [not found]   ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5@MTLDAG01.mtl.com>
     [not found]     ` <BB97625FCF082447AC2B11418FF02044A6E9E9C5-fViJhHBwANKuSA5JZHE7gA@public.gmane.org>
2013-02-07 22:42       ` Vu Pham
     [not found]         ` <51142DE9.30900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-08  9:24           ` Sagi Grimberg
2013-02-08 11:38             ` Sebastian Riemer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).