[LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux

Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
@ 2018-01-23 15:11 Johannes Thumshirn
  2018-01-23 16:09 ` Bart Van Assche
  2018-01-24 18:42 ` Sagi Grimberg
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2018-01-23 15:11 UTC (permalink / raw)


In NVMe over Fabrics we currently perform target discovery by running either
one of 'nvme discover' or 'nvme connect-all' (with or without the use of an
appropriate /etc/nvme/discovery.conf).

This is well suited for the RDMA transport, which has no idea of the
underlying fabric and it's connections. To automatically connect to an RDMA
target Sagi proposed a systemd one-shot service in [1]. 

The Fibre Channel transport on the other hand does already know it's mapping
of rports to lports and thus could automatically connect to the target (with a
little help from udev) as shown in [2].

Unfortunately the method for FC is not possible with RDMA and the currently
used 'nvme discover/connect/connect-all' method is extremely cumbersome with
Fibre Channel, especially as no special setup was/is needed for SCSI devices
over Fibre Channel and administrators thus expect it for NVMe as well.

Other downside of the "RDMA version" are 1) once the network topology and thus
/etc/nvme/discovery.conf changes one has to rebuild the initrd if nvme is to
be started from the initrd and 2) if we use the one-shot systemd service there
is no way to automatically re-try the discovery/connect.

I'm hoping we have developers from the RDMA and Fibre Channel transports, as
well as seasoned Storage developers with a SCSI Fibre Channel and RDMA
knowledge and Distribution Maintainers around to discuss a way to address this 
problem is a user-friendly way.

Byte,
	Johannes

[1] http://lists.infradead.org/pipermail/linux-nvme/2017-September/012976.html
[2] http://lists.infradead.org/pipermail/linux-nvme/2017-December/014324.html

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-23 15:11 [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux Johannes Thumshirn
@ 2018-01-23 16:09 ` Bart Van Assche
  2018-01-24  8:26   ` Hannes Reinecke
  2018-01-24 18:42 ` Sagi Grimberg
  1 sibling, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2018-01-23 16:09 UTC (permalink / raw)




On 01/23/18 07:11, Johannes Thumshirn wrote:
> In NVMe over Fabrics we currently perform target discovery by running either
> one of 'nvme discover' or 'nvme connect-all' (with or without the use of an
> appropriate /etc/nvme/discovery.conf).
> 
> This is well suited for the RDMA transport, which has no idea of the
> underlying fabric and it's connections. To automatically connect to an RDMA
> target Sagi proposed a systemd one-shot service in [1].
> 
> The Fibre Channel transport on the other hand does already know it's mapping
> of rports to lports and thus could automatically connect to the target (with a
> little help from udev) as shown in [2].
> 
> Unfortunately the method for FC is not possible with RDMA and the currently
> used 'nvme discover/connect/connect-all' method is extremely cumbersome with
> Fibre Channel, especially as no special setup was/is needed for SCSI devices
> over Fibre Channel and administrators thus expect it for NVMe as well.
> 
> Other downside of the "RDMA version" are 1) once the network topology and thus
> /etc/nvme/discovery.conf changes one has to rebuild the initrd if nvme is to
> be started from the initrd and 2) if we use the one-shot systemd service there
> is no way to automatically re-try the discovery/connect.
> 
> I'm hoping we have developers from the RDMA and Fibre Channel transports, as
> well as seasoned Storage developers with a SCSI Fibre Channel and RDMA
> knowledge and Distribution Maintainers around to discuss a way to address this
> problem is a user-friendly way.
> 
> Byte,
> 	Johannes
> 
> [1] http://lists.infradead.org/pipermail/linux-nvme/2017-September/012976.html
> [2] http://lists.infradead.org/pipermail/linux-nvme/2017-December/014324.html

Hello Johannes,

Can you have a look at the SSDP and SLP protocols and see whether one of 
these protocols or an alternative is appropriate? See also 
https://en.wikipedia.org/wiki/Simple_Service_Discovery_Protocol and 
https://en.wikipedia.org/wiki/Service_Location_Protocol.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-23 16:09 ` Bart Van Assche
@ 2018-01-24  8:26   ` Hannes Reinecke
  2018-01-24 17:17     ` James Smart
  0 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2018-01-24  8:26 UTC (permalink / raw)


On 01/23/2018 05:09 PM, Bart Van Assche wrote:
> 
> 
> On 01/23/18 07:11, Johannes Thumshirn wrote:
>> In NVMe over Fabrics we currently perform target discovery by running
>> either
>> one of 'nvme discover' or 'nvme connect-all' (with or without the use
>> of an
>> appropriate /etc/nvme/discovery.conf).
>>
>> This is well suited for the RDMA transport, which has no idea of the
>> underlying fabric and it's connections. To automatically connect to an
>> RDMA
>> target Sagi proposed a systemd one-shot service in [1].
>>
>> The Fibre Channel transport on the other hand does already know it's
>> mapping
>> of rports to lports and thus could automatically connect to the target
>> (with a
>> little help from udev) as shown in [2].
>>
>> Unfortunately the method for FC is not possible with RDMA and the
>> currently
>> used 'nvme discover/connect/connect-all' method is extremely
>> cumbersome with
>> Fibre Channel, especially as no special setup was/is needed for SCSI
>> devices
>> over Fibre Channel and administrators thus expect it for NVMe as well.
>>
>> Other downside of the "RDMA version" are 1) once the network topology
>> and thus
>> /etc/nvme/discovery.conf changes one has to rebuild the initrd if nvme
>> is to
>> be started from the initrd and 2) if we use the one-shot systemd
>> service there
>> is no way to automatically re-try the discovery/connect.
>>
>> I'm hoping we have developers from the RDMA and Fibre Channel
>> transports, as
>> well as seasoned Storage developers with a SCSI Fibre Channel and RDMA
>> knowledge and Distribution Maintainers around to discuss a way to
>> address this
>> problem is a user-friendly way.
>>
>> Byte,
>> ????Johannes
>>
>> [1]
>> http://lists.infradead.org/pipermail/linux-nvme/2017-September/012976.html
>>
>> [2]
>> http://lists.infradead.org/pipermail/linux-nvme/2017-December/014324.html
> 
> Hello Johannes,
> 
> Can you have a look at the SSDP and SLP protocols and see whether one of
> these protocols or an alternative is appropriate? See also
> https://en.wikipedia.org/wiki/Simple_Service_Discovery_Protocol and
> https://en.wikipedia.org/wiki/Service_Location_Protocol.
> 
Partially beside the point.

The problem currently is that FC-NVMe is the only transport which
implements dev_loss_tmo, causing connections to be dropped completely
after a certain time.
After that the user has to manually re-establish the connection via
nvme-cli, or one has to create some udev/systemd interaction (cf the
thread "nvme/fc: add 'discovery' sysfs attribute to fc transport
devices" and others).

The other transports just keep the reconnection loop running, and the
user has to manually _disconnect_ here.

So we have a difference in user experience, which should be reconciled.

Also, a user-space based rediscovery/reconnect will get tricky during
path failover, as one might end up with all connections down and no way
of ever being _able_ to call nvme-cli as the root fs in inaccessible.
But that might be another topic.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare at suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-24  8:26   ` Hannes Reinecke
@ 2018-01-24 17:17     ` James Smart
  2018-01-24 18:46       ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: James Smart @ 2018-01-24 17:17 UTC (permalink / raw)

On 1/24/2018 12:26 AM, Hannes Reinecke wrote:
> Partially beside the point.
>
> The problem currently is that FC-NVMe is the only transport which
> implements dev_loss_tmo, causing connections to be dropped completely
> after a certain time.
> After that the user has to manually re-establish the connection via
> nvme-cli, or one has to create some udev/systemd interaction (cf the
> thread "nvme/fc: add 'discovery' sysfs attribute to fc transport
> devices" and others).
>
> The other transports just keep the reconnection loop running, and the
> user has to manually _disconnect_ here.
>
> So we have a difference in user experience, which should be reconciled.

This is incorrect. Rdma (and FC too) have the reconnect_delay timer that 
caps, on a per-controller basis, how long the reconnection loop will run 
before the controller is deleted.?? In FC's case, as we know the state 
of the node, which may have multiple controllers connected via it, and 
have inherited the SCSI semantics for how long to wait for connectivity 
to a node before giving up - thus FC's reconnect window is capped at 
min(controller reconnect_delay, fc node dev_loss_tmo).

So it is the same experience - at least in termination behavior. It 
possibly is the same in recovery after this full termination/deletion of 
the controller has been done as well.
If there is connectivity yet the controller reconnect_delay expired, FC 
will need the manual reconnect action just like rdma. However, FC will 
support, if you go through a loss of connectivity followed by 
connectivity, it can auto-matically reconnect back to the storage - 
granted it may have a different /dev name at that point.

As stated by Johannes, the real difference in behavior is establishing 
the initial connectivity as well as those auto-reconnect behaviors where 
connectivity was lost and later regained.

>
> Also, a user-space based rediscovery/reconnect will get tricky during
> path failover, as one might end up with all connections down and no way
> of ever being _able_ to call nvme-cli as the root fs in inaccessible.
> But that might be another topic.
>
> Cheers,
>
> Hannes

I don't disagree with this, and do believe, that to support low-memory 
issues during failover and reconnectivity (as we've seen in the past), 
as well as to make booting easier (stop futzing with ramdisk), it will 
likely require some amount of nvme discovery engine within the kernel.? 
It's difficult as without things like SSDP or LSP, rdma and tcp can't 
really do this without admin help. But as you say - this can be a later 
topic.

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-24 17:17     ` James Smart
@ 2018-01-24 18:46       ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2018-01-24 18:46 UTC (permalink / raw)



>> Partially beside the point.
>>
>> The problem currently is that FC-NVMe is the only transport which
>> implements dev_loss_tmo, causing connections to be dropped completely
>> after a certain time.
>> After that the user has to manually re-establish the connection via
>> nvme-cli, or one has to create some udev/systemd interaction (cf the
>> thread "nvme/fc: add 'discovery' sysfs attribute to fc transport
>> devices" and others).
>>
>> The other transports just keep the reconnection loop running, and the
>> user has to manually _disconnect_ here.
>>
>> So we have a difference in user experience, which should be reconciled.
> 
> This is incorrect. Rdma (and FC too) have the reconnect_delay timer that 
> caps, on a per-controller basis, how long the reconnection loop will run 
> before the controller is deleted.

echo that...

> As stated by Johannes, the real difference in behavior is establishing 
> the initial connectivity as well as those auto-reconnect behaviors where 
> connectivity was lost and later regained.

I agree that this is a problem for IP based transports. So until we
have discovery enhancements we probably need to poll predefined
discovery subsystems periodically and cross check with what we are
connected to like iscsi discoveryd..

>> Also, a user-space based rediscovery/reconnect will get tricky during
>> path failover, as one might end up with all connections down and no way
>> of ever being _able_ to call nvme-cli as the root fs in inaccessible.
>> But that might be another topic.

Again, polling discovery subsystems every X time will overcome that...

> I don't disagree with this, and do believe, that to support low-memory 
> issues during failover and reconnectivity (as we've seen in the past), 
> as well as to make booting easier (stop futzing with ramdisk), it will 
> likely require some amount of nvme discovery engine within the kernel. 
> It's difficult as without things like SSDP or LSP, rdma and tcp can't 
> really do this without admin help. But as you say - this can be a later 
> topic.

I hope fabrics 1.1 discovery enhancements will tackle all that...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-23 15:11 [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux Johannes Thumshirn
  2018-01-23 16:09 ` Bart Van Assche
@ 2018-01-24 18:42 ` Sagi Grimberg
  2018-01-24 18:51   ` James Smart
  2018-01-29 13:05   ` Johannes Thumshirn
  1 sibling, 2 replies; 9+ messages in thread
From: Sagi Grimberg @ 2018-01-24 18:42 UTC (permalink / raw)


Hi Johannes,

> In NVMe over Fabrics we currently perform target discovery by running either
> one of 'nvme discover' or 'nvme connect-all' (with or without the use of an
> appropriate /etc/nvme/discovery.conf).
> 
> This is well suited for the RDMA transport, which has no idea of the
> underlying fabric and it's connections. To automatically connect to an RDMA
> target Sagi proposed a systemd one-shot service in [1].
> 
> The Fibre Channel transport on the other hand does already know it's mapping
> of rports to lports and thus could automatically connect to the target (with a
> little help from udev) as shown in [2].
> 
> Unfortunately the method for FC is not possible with RDMA and the currently
> used 'nvme discover/connect/connect-all' method is extremely cumbersome with
> Fibre Channel, especially as no special setup was/is needed for SCSI devices
> over Fibre Channel and administrators thus expect it for NVMe as well.
> 
> Other downside of the "RDMA version" are 1) once the network topology and thus
> /etc/nvme/discovery.conf changes one has to rebuild the initrd if nvme is to
> be started from the initrd and 2) if we use the one-shot systemd service there
> is no way to automatically re-try the discovery/connect.
> 
> I'm hoping we have developers from the RDMA and Fibre Channel transports, as
> well as seasoned Storage developers with a SCSI Fibre Channel and RDMA
> knowledge and Distribution Maintainers around to discuss a way to address this
> problem is a user-friendly way.

Discovery enhancements is a subject the NVMe TWG will be working on in
the near future, and "discovery of the  the discovery service" is indeed
a sub-topic IIRC. I'm not sure LSF would be the appropriate forum for
this.

What we do need to have, is a way to support existing devices. I think
its acceptable that FC and Ethernet based transports diverge in their
implementations for this.

For Ethernet based transports we could follow the open-iscsi model which
has discoveryd service which periodically polls predefined addresses. As
for updating initramfs, maybe we can live with this limitation for the
time being?

FC can keep doing its own thing...

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-24 18:42 ` Sagi Grimberg
@ 2018-01-24 18:51   ` James Smart
  2018-01-24 18:59     ` Sagi Grimberg
  2018-01-29 13:05   ` Johannes Thumshirn
  1 sibling, 1 reply; 9+ messages in thread
From: James Smart @ 2018-01-24 18:51 UTC (permalink / raw)


On 1/24/2018 10:42 AM, Sagi Grimberg wrote:
>
> FC can keep doing its own thing...
>

:)? Funny, I see it as ethernet solutions keep doing their own thing 
with each new protocol. FC keeps doing what it has for years.

-- james

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-24 18:51   ` James Smart
@ 2018-01-24 18:59     ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2018-01-24 18:59 UTC (permalink / raw)



>> FC can keep doing its own thing...
>>
> 
> :)? Funny, I see it as ethernet solutions keep doing their own thing 
> with each new protocol. FC keeps doing what it has for years.

I meant "its own mature well-known that works thing" :)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux
  2018-01-24 18:42 ` Sagi Grimberg
  2018-01-24 18:51   ` James Smart
@ 2018-01-29 13:05   ` Johannes Thumshirn
  1 sibling, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2018-01-29 13:05 UTC (permalink / raw)


Sagi Grimberg <sagi at grimberg.me> writes:
> Discovery enhancements is a subject the NVMe TWG will be working on in
> the near future, and "discovery of the  the discovery service" is indeed
> a sub-topic IIRC. I'm not sure LSF would be the appropriate forum for
> this.
>
> What we do need to have, is a way to support existing devices. I think
> its acceptable that FC and Ethernet based transports diverge in their
> implementations for this.
>
> For Ethernet based transports we could follow the open-iscsi model which
> has discoveryd service which periodically polls predefined addresses. As
> for updating initramfs, maybe we can live with this limitation for the
> time being?
>
> FC can keep doing its own thing...

Sure, general discovery enhancements are out of LSF's scope, but for the
implementing a nvme-discoveryd or something similar for Fabrics 1.0
based systems discussion LSF would be the right forum IMHO. This model
still needs /etc/nvme/discovery.conf updates mirrored into the initrd
etc...

Byte,
        Johannes
-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-01-29 13:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-23 15:11 [LSF/MM TOPIC] NVMe over Fabrics auto-discovery in Linux Johannes Thumshirn
2018-01-23 16:09 ` Bart Van Assche
2018-01-24  8:26   ` Hannes Reinecke
2018-01-24 17:17     ` James Smart
2018-01-24 18:46       ` Sagi Grimberg
2018-01-24 18:42 ` Sagi Grimberg
2018-01-24 18:51   ` James Smart
2018-01-24 18:59     ` Sagi Grimberg
2018-01-29 13:05   ` Johannes Thumshirn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox