* status of spdk @ 2016-11-08 23:31 Yehuda Sadeh-Weinraub 2016-11-08 23:40 ` Sage Weil 2016-11-09 4:45 ` Haomai Wang 0 siblings, 2 replies; 18+ messages in thread From: Yehuda Sadeh-Weinraub @ 2016-11-08 23:31 UTC (permalink / raw) To: Wang, Haomai, Weil, Sage; +Cc: ceph-devel I just started looking at spdk, and have a few comments and questions. First, it's not clear to me how we should handle build. At the moment the spdk code resides as a submodule in the ceph tree, but it depends on dpdk, which currently needs to be downloaded separately. We can add it as a submodule (upstream is here: git://dpdk.org/dpdk). That been said, getting it to build was a bit tricky and I think it might be broken with cmake. In order to get it working I resorted to building a system library and use that. The way to currently configure an osd to use bluestore with spdk is by creating a symbolic link that replaces the bluestore 'block' device to point to a file that has a name that is prefixed with 'spdk:'. Originally I assumed that the suffix would be the nvme device id, but it seems that it's not really needed, however, the file itself needs to contain the device id (see https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of minor fixes). As I understand it, in order to support multiple osds on the same NVMe device we have a few options. We can leverage NVMe namespaces, but that's not supported on all devices. We can configure bluestore to only use part of the device (device sharding? not sure if it supports it). I think it's best if we could keep bluestore out of the loop there and have the NVMe driver abstract multiple partitions of the NVMe device. The idea is to be able to define multiple partitions on the device (e.g., each partition will be defined by the offset, size, and namespace), and have the osd set to use a specific partition. We'll probably need a special tool to manage it, and potentially keep the partition table information on the device itself. The tool could also manage the creation of the block link. We should probably rethink how the link is structure and what it points at. Any thoughts? Yehuda ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub @ 2016-11-08 23:40 ` Sage Weil 2016-11-09 0:06 ` Yehuda Sadeh-Weinraub 2016-11-09 4:49 ` Haomai Wang 2016-11-09 4:45 ` Haomai Wang 1 sibling, 2 replies; 18+ messages in thread From: Sage Weil @ 2016-11-08 23:40 UTC (permalink / raw) To: Yehuda Sadeh-Weinraub; +Cc: Wang, Haomai, ceph-devel On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > I just started looking at spdk, and have a few comments and questions. > > First, it's not clear to me how we should handle build. At the moment > the spdk code resides as a submodule in the ceph tree, but it depends > on dpdk, which currently needs to be downloaded separately. We can add > it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > said, getting it to build was a bit tricky and I think it might be > broken with cmake. In order to get it working I resorted to building a > system library and use that. Note that this PR is about to merge https://github.com/ceph/ceph/pull/10748 which adds the DPDK submodule, so hopefully this issue will go away when that merged or with a follow-on cleanup. > The way to currently configure an osd to use bluestore with spdk is by > creating a symbolic link that replaces the bluestore 'block' device to > point to a file that has a name that is prefixed with 'spdk:'. > Originally I assumed that the suffix would be the nvme device id, but > it seems that it's not really needed, however, the file itself needs > to contain the device id (see > https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > minor fixes). Open a PR for those? > As I understand it, in order to support multiple osds on the same NVMe > device we have a few options. We can leverage NVMe namespaces, but > that's not supported on all devices. We can configure bluestore to > only use part of the device (device sharding? not sure if it supports > it). I think it's best if we could keep bluestore out of the loop > there and have the NVMe driver abstract multiple partitions of the > NVMe device. The idea is to be able to define multiple partitions on > the device (e.g., each partition will be defined by the offset, size, > and namespace), and have the osd set to use a specific partition. > We'll probably need a special tool to manage it, and potentially keep > the partition table information on the device itself. The tool could > also manage the creation of the block link. We should probably rethink > how the link is structure and what it points at. I agree that bluestore shouldn't get involved. Is the NVMe namespaces meant to support multiple processes sharing the same hardware device? Also, if you do that, is it possible to give one of the namespaces to the kernel? That might solve the bootstrapping problem we currently have where we have nowhere to put the $osd_data filesystem with the device metadata. (This is admittedly not necessarily a blocking issue. Putting those dirs on / wouldn't be the end of the world; it just means cards can't be easily moved between boxes.) sage ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-08 23:40 ` Sage Weil @ 2016-11-09 0:06 ` Yehuda Sadeh-Weinraub 2016-11-09 0:21 ` LIU, Fei 2016-11-09 4:49 ` Haomai Wang 1 sibling, 1 reply; 18+ messages in thread From: Yehuda Sadeh-Weinraub @ 2016-11-09 0:06 UTC (permalink / raw) To: Sage Weil; +Cc: Wang, Haomai, ceph-devel On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: >> I just started looking at spdk, and have a few comments and questions. >> >> First, it's not clear to me how we should handle build. At the moment >> the spdk code resides as a submodule in the ceph tree, but it depends >> on dpdk, which currently needs to be downloaded separately. We can add >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been >> said, getting it to build was a bit tricky and I think it might be >> broken with cmake. In order to get it working I resorted to building a >> system library and use that. > > Note that this PR is about to merge > > https://github.com/ceph/ceph/pull/10748 > > which adds the DPDK submodule, so hopefully this issue will go away when > that merged or with a follow-on cleanup. > >> The way to currently configure an osd to use bluestore with spdk is by >> creating a symbolic link that replaces the bluestore 'block' device to >> point to a file that has a name that is prefixed with 'spdk:'. >> Originally I assumed that the suffix would be the nvme device id, but >> it seems that it's not really needed, however, the file itself needs >> to contain the device id (see >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of >> minor fixes). > > Open a PR for those? Sure > >> As I understand it, in order to support multiple osds on the same NVMe >> device we have a few options. We can leverage NVMe namespaces, but >> that's not supported on all devices. We can configure bluestore to >> only use part of the device (device sharding? not sure if it supports >> it). I think it's best if we could keep bluestore out of the loop >> there and have the NVMe driver abstract multiple partitions of the >> NVMe device. The idea is to be able to define multiple partitions on >> the device (e.g., each partition will be defined by the offset, size, >> and namespace), and have the osd set to use a specific partition. >> We'll probably need a special tool to manage it, and potentially keep >> the partition table information on the device itself. The tool could >> also manage the creation of the block link. We should probably rethink >> how the link is structure and what it points at. > > I agree that bluestore shouldn't get involved. > > Is the NVMe namespaces meant to support multiple processes sharing the > same hardware device? More of a partitioning solution, but yes (as far as I undestand). > > Also, if you do that, is it possible to give one of the namespaces to the > kernel? That might solve the bootstrapping problem we currently have Theoretically, but not right now (or ever?). See here: https://lists.01.org/pipermail/spdk/2016-July/000073.html > where we have nowhere to put the $osd_data filesystem with the device > metadata. (This is admittedly not necessarily a blocking issue. Putting > those dirs on / wouldn't be the end of the world; it just means cards > can't be easily moved between boxes.) > Maybe we can use bluestore for these too ;) that been said, there might be some kind of a loopback solution that could work, but not sure if it won't create major bottlenecks that we'd want to avoid. Yehuda ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 0:06 ` Yehuda Sadeh-Weinraub @ 2016-11-09 0:21 ` LIU, Fei 2016-11-09 2:45 ` Dong Wu 2016-11-09 4:59 ` Haomai Wang 0 siblings, 2 replies; 18+ messages in thread From: LIU, Fei @ 2016-11-09 0:21 UTC (permalink / raw) To: Yehuda Sadeh-Weinraub, Sage Weil; +Cc: Wang, Haomai, ceph-devel Hi Yehuda and Haomai, The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? Regards, James On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: >> I just started looking at spdk, and have a few comments and questions. >> >> First, it's not clear to me how we should handle build. At the moment >> the spdk code resides as a submodule in the ceph tree, but it depends >> on dpdk, which currently needs to be downloaded separately. We can add >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been >> said, getting it to build was a bit tricky and I think it might be >> broken with cmake. In order to get it working I resorted to building a >> system library and use that. > > Note that this PR is about to merge > > https://github.com/ceph/ceph/pull/10748 > > which adds the DPDK submodule, so hopefully this issue will go away when > that merged or with a follow-on cleanup. > >> The way to currently configure an osd to use bluestore with spdk is by >> creating a symbolic link that replaces the bluestore 'block' device to >> point to a file that has a name that is prefixed with 'spdk:'. >> Originally I assumed that the suffix would be the nvme device id, but >> it seems that it's not really needed, however, the file itself needs >> to contain the device id (see >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of >> minor fixes). > > Open a PR for those? Sure > >> As I understand it, in order to support multiple osds on the same NVMe >> device we have a few options. We can leverage NVMe namespaces, but >> that's not supported on all devices. We can configure bluestore to >> only use part of the device (device sharding? not sure if it supports >> it). I think it's best if we could keep bluestore out of the loop >> there and have the NVMe driver abstract multiple partitions of the >> NVMe device. The idea is to be able to define multiple partitions on >> the device (e.g., each partition will be defined by the offset, size, >> and namespace), and have the osd set to use a specific partition. >> We'll probably need a special tool to manage it, and potentially keep >> the partition table information on the device itself. The tool could >> also manage the creation of the block link. We should probably rethink >> how the link is structure and what it points at. > > I agree that bluestore shouldn't get involved. > > Is the NVMe namespaces meant to support multiple processes sharing the > same hardware device? More of a partitioning solution, but yes (as far as I undestand). > > Also, if you do that, is it possible to give one of the namespaces to the > kernel? That might solve the bootstrapping problem we currently have Theoretically, but not right now (or ever?). See here: https://lists.01.org/pipermail/spdk/2016-July/000073.html > where we have nowhere to put the $osd_data filesystem with the device > metadata. (This is admittedly not necessarily a blocking issue. Putting > those dirs on / wouldn't be the end of the world; it just means cards > can't be easily moved between boxes.) > Maybe we can use bluestore for these too ;) that been said, there might be some kind of a loopback solution that could work, but not sure if it won't create major bottlenecks that we'd want to avoid. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 0:21 ` LIU, Fei @ 2016-11-09 2:45 ` Dong Wu 2016-11-09 20:53 ` Moreno, Orlando 2016-11-09 4:59 ` Haomai Wang 1 sibling, 1 reply; 18+ messages in thread From: Dong Wu @ 2016-11-09 2:45 UTC (permalink / raw) To: LIU, Fei; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, Wang, Haomai, ceph-devel Hi, Yehuda and Haomai, DPDK backend may have the same problem. I had tried to use haomai's PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support: http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config to set multi-process support. Anything wrong or multi-process support not been implemented? 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > Hi Yehuda and Haomai, > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? > > Regards, > James > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > >> I just started looking at spdk, and have a few comments and questions. > >> > >> First, it's not clear to me how we should handle build. At the moment > >> the spdk code resides as a submodule in the ceph tree, but it depends > >> on dpdk, which currently needs to be downloaded separately. We can add > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > >> said, getting it to build was a bit tricky and I think it might be > >> broken with cmake. In order to get it working I resorted to building a > >> system library and use that. > > > > Note that this PR is about to merge > > > > https://github.com/ceph/ceph/pull/10748 > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > that merged or with a follow-on cleanup. > > > >> The way to currently configure an osd to use bluestore with spdk is by > >> creating a symbolic link that replaces the bluestore 'block' device to > >> point to a file that has a name that is prefixed with 'spdk:'. > >> Originally I assumed that the suffix would be the nvme device id, but > >> it seems that it's not really needed, however, the file itself needs > >> to contain the device id (see > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > >> minor fixes). > > > > Open a PR for those? > > Sure > > > > >> As I understand it, in order to support multiple osds on the same NVMe > >> device we have a few options. We can leverage NVMe namespaces, but > >> that's not supported on all devices. We can configure bluestore to > >> only use part of the device (device sharding? not sure if it supports > >> it). I think it's best if we could keep bluestore out of the loop > >> there and have the NVMe driver abstract multiple partitions of the > >> NVMe device. The idea is to be able to define multiple partitions on > >> the device (e.g., each partition will be defined by the offset, size, > >> and namespace), and have the osd set to use a specific partition. > >> We'll probably need a special tool to manage it, and potentially keep > >> the partition table information on the device itself. The tool could > >> also manage the creation of the block link. We should probably rethink > >> how the link is structure and what it points at. > > > > I agree that bluestore shouldn't get involved. > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > same hardware device? > > More of a partitioning solution, but yes (as far as I undestand). > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > kernel? That might solve the bootstrapping problem we currently have > > Theoretically, but not right now (or ever?). See here: > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > where we have nowhere to put the $osd_data filesystem with the device > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > those dirs on / wouldn't be the end of the world; it just means cards > > can't be easily moved between boxes.) > > > > Maybe we can use bluestore for these too ;) that been said, there > might be some kind of a loopback solution that could work, but not > sure if it won't create major bottlenecks that we'd want to avoid. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: status of spdk 2016-11-09 2:45 ` Dong Wu @ 2016-11-09 20:53 ` Moreno, Orlando 2016-11-09 20:58 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Moreno, Orlando @ 2016-11-09 20:53 UTC (permalink / raw) To: Dong Wu, LIU, Fei Cc: Yehuda Sadeh-Weinraub, Sage Weil, Wang, Haomai, ceph-devel, 'ifedotov@mirantis.com' Hi all, Multiple DPDK/SPDK instances on a single host does not work because the current implementation in Ceph does not support it. This issue is tracked here: http://tracker.ceph.com/issues/16966 There is multi-process support in DPDK, but you must configure the EAL correctly for it to work. I have been working on a patch, https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user to configure multiple BlueStore OSDs backed by SPDK. Though this patch works, I think it needs a few additions to actually make it performant. This is just to get the 1 OSD process per NVMe case working. A multi-OSD per NVMe solution will probably require more work as described in this thread. Thanks, Orlando -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu Sent: Tuesday, November 8, 2016 7:45 PM To: LIU, Fei <james.liu@alibaba-inc.com> Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org> Subject: Re: status of spdk Hi, Yehuda and Haomai, DPDK backend may have the same problem. I had tried to use haomai's PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support: http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config to set multi-process support. Anything wrong or multi-process support not been implemented? 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > Hi Yehuda and Haomai, > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? > > Regards, > James > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > >> I just started looking at spdk, and have a few comments and questions. > >> > >> First, it's not clear to me how we should handle build. At the moment > >> the spdk code resides as a submodule in the ceph tree, but it depends > >> on dpdk, which currently needs to be downloaded separately. We can add > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > >> said, getting it to build was a bit tricky and I think it might be > >> broken with cmake. In order to get it working I resorted to building a > >> system library and use that. > > > > Note that this PR is about to merge > > > > https://github.com/ceph/ceph/pull/10748 > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > that merged or with a follow-on cleanup. > > > >> The way to currently configure an osd to use bluestore with spdk is by > >> creating a symbolic link that replaces the bluestore 'block' device to > >> point to a file that has a name that is prefixed with 'spdk:'. > >> Originally I assumed that the suffix would be the nvme device id, but > >> it seems that it's not really needed, however, the file itself needs > >> to contain the device id (see > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > >> minor fixes). > > > > Open a PR for those? > > Sure > > > > >> As I understand it, in order to support multiple osds on the same NVMe > >> device we have a few options. We can leverage NVMe namespaces, but > >> that's not supported on all devices. We can configure bluestore to > >> only use part of the device (device sharding? not sure if it supports > >> it). I think it's best if we could keep bluestore out of the loop > >> there and have the NVMe driver abstract multiple partitions of the > >> NVMe device. The idea is to be able to define multiple partitions on > >> the device (e.g., each partition will be defined by the offset, size, > >> and namespace), and have the osd set to use a specific partition. > >> We'll probably need a special tool to manage it, and potentially keep > >> the partition table information on the device itself. The tool could > >> also manage the creation of the block link. We should probably rethink > >> how the link is structure and what it points at. > > > > I agree that bluestore shouldn't get involved. > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > same hardware device? > > More of a partitioning solution, but yes (as far as I undestand). > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > kernel? That might solve the bootstrapping problem we currently > have > > Theoretically, but not right now (or ever?). See here: > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > where we have nowhere to put the $osd_data filesystem with the device > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > those dirs on / wouldn't be the end of the world; it just means cards > > can't be easily moved between boxes.) > > > > Maybe we can use bluestore for these too ;) that been said, there > might be some kind of a loopback solution that could work, but not > sure if it won't create major bottlenecks that we'd want to avoid. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: status of spdk 2016-11-09 20:53 ` Moreno, Orlando @ 2016-11-09 20:58 ` Sage Weil 2016-11-09 21:00 ` Gohad, Tushar 2016-11-09 21:10 ` Gohad, Tushar 0 siblings, 2 replies; 18+ messages in thread From: Sage Weil @ 2016-11-09 20:58 UTC (permalink / raw) To: Moreno, Orlando Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai, ceph-devel, 'ifedotov@mirantis.com' On Wed, 9 Nov 2016, Moreno, Orlando wrote: > Hi all, > > Multiple DPDK/SPDK instances on a single host does not work because the > current implementation in Ceph does not support it. This issue is > tracked here: http://tracker.ceph.com/issues/16966 There is > multi-process support in DPDK, but you must configure the EAL correctly > for it to work. I have been working on a patch, > https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user to > configure multiple BlueStore OSDs backed by SPDK. Though this patch > works, I think it needs a few additions to actually make it performant. > > This is just to get the 1 OSD process per NVMe case working. A multi-OSD > per NVMe solution will probably require more work as described in this > thread. TBH I'm not sure how important the multi-osd per NVMe case is. The only reason to do that would be performance bottlenecks within the OSD itself, and I'd rather focus our efforts on eliminating those than on enabling a bandaid solution. As I understand it the scenarios that are most interesting are 1- sharing the same network device to multiple osds with DPDK (this will presumably be pretty common unless/until we combine many OSDs into a single process), and 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). Not sure this will be feasible or not. As I think Haomai mentioned, the next barrier is probably the requirements around DPDK event loop and dedicated core? sage > > Thanks, > Orlando > > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > Sent: Tuesday, November 8, 2016 7:45 PM > To: LIU, Fei <james.liu@alibaba-inc.com> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org> > Subject: Re: status of spdk > > Hi, Yehuda and Haomai, > DPDK backend may have the same problem. I had tried to use haomai's PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support: > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config to set multi-process support. Anything wrong or multi-process support not been implemented? > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > Hi Yehuda and Haomai, > > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? > > > > Regards, > > James > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > >> I just started looking at spdk, and have a few comments and questions. > > >> > > >> First, it's not clear to me how we should handle build. At the moment > > >> the spdk code resides as a submodule in the ceph tree, but it depends > > >> on dpdk, which currently needs to be downloaded separately. We can add > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > > >> said, getting it to build was a bit tricky and I think it might be > > >> broken with cmake. In order to get it working I resorted to building a > > >> system library and use that. > > > > > > Note that this PR is about to merge > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > > that merged or with a follow-on cleanup. > > > > > >> The way to currently configure an osd to use bluestore with spdk is by > > >> creating a symbolic link that replaces the bluestore 'block' device to > > >> point to a file that has a name that is prefixed with 'spdk:'. > > >> Originally I assumed that the suffix would be the nvme device id, but > > >> it seems that it's not really needed, however, the file itself needs > > >> to contain the device id (see > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > > >> minor fixes). > > > > > > Open a PR for those? > > > > Sure > > > > > > > >> As I understand it, in order to support multiple osds on the same NVMe > > >> device we have a few options. We can leverage NVMe namespaces, but > > >> that's not supported on all devices. We can configure bluestore to > > >> only use part of the device (device sharding? not sure if it supports > > >> it). I think it's best if we could keep bluestore out of the loop > > >> there and have the NVMe driver abstract multiple partitions of the > > >> NVMe device. The idea is to be able to define multiple partitions on > > >> the device (e.g., each partition will be defined by the offset, size, > > >> and namespace), and have the osd set to use a specific partition. > > >> We'll probably need a special tool to manage it, and potentially keep > > >> the partition table information on the device itself. The tool could > > >> also manage the creation of the block link. We should probably rethink > > >> how the link is structure and what it points at. > > > > > > I agree that bluestore shouldn't get involved. > > > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > > same hardware device? > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > > kernel? That might solve the bootstrapping problem we currently > > have > > > > Theoretically, but not right now (or ever?). See here: > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > where we have nowhere to put the $osd_data filesystem with the device > > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > > those dirs on / wouldn't be the end of the world; it just means cards > > > can't be easily moved between boxes.) > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > might be some kind of a loopback solution that could work, but not > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > Yehuda > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: status of spdk 2016-11-09 20:58 ` Sage Weil @ 2016-11-09 21:00 ` Gohad, Tushar 2016-11-09 21:10 ` Gohad, Tushar 1 sibling, 0 replies; 18+ messages in thread From: Gohad, Tushar @ 2016-11-09 21:00 UTC (permalink / raw) To: Sage Weil, Moreno, Orlando Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai, ceph-devel, 'ifedotov@mirantis.com' -----Original Message----- From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil Sent: Wednesday, November 9, 2016 1:59 PM To: Moreno, Orlando <orlando.moreno@intel.com> Cc: Dong Wu <archer.wudong@gmail.com>; LIU, Fei <james.liu@alibaba-inc.com>; Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel <ceph-devel@vger.kernel.org>; 'ifedotov@mirantis.com' <ifedotov@mirantis.com> Subject: RE: status of spdk On Wed, 9 Nov 2016, Moreno, Orlando wrote: > Hi all, > > Multiple DPDK/SPDK instances on a single host does not work because > the current implementation in Ceph does not support it. This issue is > tracked here: http://tracker.ceph.com/issues/16966 There is > multi-process support in DPDK, but you must configure the EAL > correctly for it to work. I have been working on a patch, > https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user > to configure multiple BlueStore OSDs backed by SPDK. Though this patch > works, I think it needs a few additions to actually make it performant. > > This is just to get the 1 OSD process per NVMe case working. A > multi-OSD per NVMe solution will probably require more work as > described in this thread. TBH I'm not sure how important the multi-osd per NVMe case is. The only reason to do that would be performance bottlenecks within the OSD itself, and I'd rather focus our efforts on eliminating those than on enabling a bandaid solution. As I understand it the scenarios that are most interesting are 1- sharing the same network device to multiple osds with DPDK (this will presumably be pretty common unless/until we combine many OSDs into a single process), and 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). Not sure this will be feasible or not. As I think Haomai mentioned, the next barrier is probably the requirements around DPDK event loop and dedicated core? sage > > Thanks, > Orlando > > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > Sent: Tuesday, November 8, 2016 7:45 PM > To: LIU, Fei <james.liu@alibaba-inc.com> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel > <ceph-devel@vger.kernel.org> > Subject: Re: status of spdk > > Hi, Yehuda and Haomai, > DPDK backend may have the same problem. I had tried to use haomai's PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support: > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config to set multi-process support. Anything wrong or multi-process support not been implemented? > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > Hi Yehuda and Haomai, > > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? > > > > Regards, > > James > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > >> I just started looking at spdk, and have a few comments and questions. > > >> > > >> First, it's not clear to me how we should handle build. At the moment > > >> the spdk code resides as a submodule in the ceph tree, but it depends > > >> on dpdk, which currently needs to be downloaded separately. We can add > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > > >> said, getting it to build was a bit tricky and I think it might be > > >> broken with cmake. In order to get it working I resorted to building a > > >> system library and use that. > > > > > > Note that this PR is about to merge > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > > that merged or with a follow-on cleanup. > > > > > >> The way to currently configure an osd to use bluestore with spdk is by > > >> creating a symbolic link that replaces the bluestore 'block' device to > > >> point to a file that has a name that is prefixed with 'spdk:'. > > >> Originally I assumed that the suffix would be the nvme device id, but > > >> it seems that it's not really needed, however, the file itself needs > > >> to contain the device id (see > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > > >> minor fixes). > > > > > > Open a PR for those? > > > > Sure > > > > > > > >> As I understand it, in order to support multiple osds on the same NVMe > > >> device we have a few options. We can leverage NVMe namespaces, but > > >> that's not supported on all devices. We can configure bluestore to > > >> only use part of the device (device sharding? not sure if it supports > > >> it). I think it's best if we could keep bluestore out of the loop > > >> there and have the NVMe driver abstract multiple partitions of the > > >> NVMe device. The idea is to be able to define multiple partitions on > > >> the device (e.g., each partition will be defined by the offset, size, > > >> and namespace), and have the osd set to use a specific partition. > > >> We'll probably need a special tool to manage it, and potentially keep > > >> the partition table information on the device itself. The tool could > > >> also manage the creation of the block link. We should probably rethink > > >> how the link is structure and what it points at. > > > > > > I agree that bluestore shouldn't get involved. > > > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > > same hardware device? > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > > kernel? That might solve the bootstrapping problem we > > currently have > > > > Theoretically, but not right now (or ever?). See here: > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > where we have nowhere to put the $osd_data filesystem with the device > > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > > those dirs on / wouldn't be the end of the world; it just means cards > > > can't be easily moved between boxes.) > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > might be some kind of a loopback solution that could work, but not > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > Yehuda > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at > > http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: status of spdk 2016-11-09 20:58 ` Sage Weil 2016-11-09 21:00 ` Gohad, Tushar @ 2016-11-09 21:10 ` Gohad, Tushar 2016-11-10 22:39 ` Walker, Benjamin 1 sibling, 1 reply; 18+ messages in thread From: Gohad, Tushar @ 2016-11-09 21:10 UTC (permalink / raw) To: Sage Weil, Moreno, Orlando Cc: Dong Wu, LIU, Fei, Yehuda Sadeh-Weinraub, Wang, Haomai, ceph-devel, 'ifedotov@mirantis.com' >> Multiple DPDK/SPDK instances on a single host does not work because >> the current implementation in Ceph does not support it. This issue is >> tracked here: http://tracker.ceph.com/issues/16966 There is >> multi-process support in DPDK, but you must configure the EAL >> correctly for it to work. I have been working on a patch, >> https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user >> to configure multiple BlueStore OSDs backed by SPDK. Though this patch >> works, I think it needs a few additions to actually make it performant. >> This is just to get the 1 OSD process per NVMe case working. A >> multi-OSD per NVMe solution will probably require more work as >> described in this thread. > TBH I'm not sure how important the multi-osd per NVMe case is. > The only reason to do that would be performance bottlenecks within the > OSD itself, and I'd rather focus our efforts on eliminating those than on > enabling a bandaid solution. Completely agree here. > As I understand it the scenarios that are most interesting are > 1- sharing the same network device to multiple osds with DPDK (this will presumably be pretty common unless/until we combine many OSDs into a single process), and > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). > Not sure this will be feasible or not. Unfortunately, this is not feasible without some form of partitioning support in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev) - the latter is under development at the moment. The limitation that Orlando identified is, not being able to launch multiple SPDK-based OSDs on a node today. Igor and Orlando's PR (16966) is to add a config option to limit the number of hugepages assigned to an OSD via an EAL switch. The other limitation today is being able to specify the CPU mask assigned to each OSD which would require config addition. Tushar > > -----Original Message----- > From: ceph-devel-owner@vger.kernel.org > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > Sent: Tuesday, November 8, 2016 7:45 PM > To: LIU, Fei <james.liu@alibaba-inc.com> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel > <ceph-devel@vger.kernel.org> > Subject: Re: status of spdk > > Hi, Yehuda and Haomai, > DPDK backend may have the same problem. I had tried to use haomai's PR: https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to start multiple OSDs on the host with only one network card, alse i read about the dpdk multi-process support: > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not find any config to set multi-process support. Anything wrong or multi-process support not been implemented? > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > Hi Yehuda and Haomai, > > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? > > > > Regards, > > James > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > >> I just started looking at spdk, and have a few comments and questions. > > >> > > >> First, it's not clear to me how we should handle build. At the moment > > >> the spdk code resides as a submodule in the ceph tree, but it depends > > >> on dpdk, which currently needs to be downloaded separately. We can add > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > > >> said, getting it to build was a bit tricky and I think it might be > > >> broken with cmake. In order to get it working I resorted to building a > > >> system library and use that. > > > > > > Note that this PR is about to merge > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > > that merged or with a follow-on cleanup. > > > > > >> The way to currently configure an osd to use bluestore with spdk is by > > >> creating a symbolic link that replaces the bluestore 'block' device to > > >> point to a file that has a name that is prefixed with 'spdk:'. > > >> Originally I assumed that the suffix would be the nvme device id, but > > >> it seems that it's not really needed, however, the file itself needs > > >> to contain the device id (see > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > > >> minor fixes). > > > > > > Open a PR for those? > > > > Sure > > > > > > > >> As I understand it, in order to support multiple osds on the same NVMe > > >> device we have a few options. We can leverage NVMe namespaces, but > > >> that's not supported on all devices. We can configure bluestore to > > >> only use part of the device (device sharding? not sure if it supports > > >> it). I think it's best if we could keep bluestore out of the loop > > >> there and have the NVMe driver abstract multiple partitions of the > > >> NVMe device. The idea is to be able to define multiple partitions on > > >> the device (e.g., each partition will be defined by the offset, size, > > >> and namespace), and have the osd set to use a specific partition. > > >> We'll probably need a special tool to manage it, and potentially keep > > >> the partition table information on the device itself. The tool could > > >> also manage the creation of the block link. We should probably rethink > > >> how the link is structure and what it points at. > > > > > > I agree that bluestore shouldn't get involved. > > > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > > same hardware device? > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > > kernel? That might solve the bootstrapping problem we > > currently have > > > > Theoretically, but not right now (or ever?). See here: > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > where we have nowhere to put the $osd_data filesystem with the device > > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > > those dirs on / wouldn't be the end of the world; it just means cards > > > can't be easily moved between boxes.) > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > might be some kind of a loopback solution that could work, but not > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > Yehuda > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at > > http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@vger.kernel.org More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 21:10 ` Gohad, Tushar @ 2016-11-10 22:39 ` Walker, Benjamin 2016-11-10 22:59 ` Sage Weil 0 siblings, 1 reply; 18+ messages in thread From: Walker, Benjamin @ 2016-11-10 22:39 UTC (permalink / raw) To: Gohad, Tushar, sage@newdream.net, Moreno, Orlando Cc: haomaiwang@gmail.com, archer.wudong@gmail.com, ifedotov@mirantis.com, james.liu@alibaba-inc.com, ceph-devel@vger.kernel.org, yehuda@redhat.com On Wed, 2016-11-09 at 21:10 +0000, Gohad, Tushar wrote: > > > > > > > > Multiple DPDK/SPDK instances on a single host does not work because > > > the current implementation in Ceph does not support it. This issue is > > > tracked here: http://tracker.ceph.com/issues/16966 There is > > > multi-process support in DPDK, but you must configure the EAL > > > correctly for it to work. I have been working on a patch, > > > https://github.com/ommoreno/ceph/tree/wip-16966, that allows the user > > > to configure multiple BlueStore OSDs backed by SPDK. Though this patch > > > works, I think it needs a few additions to actually make it performant. > > > This is just to get the 1 OSD process per NVMe case working. A > > > multi-OSD per NVMe solution will probably require more work as > > > described in this thread. > > > > > TBH I'm not sure how important the multi-osd per NVMe case is. > > The only reason to do that would be performance bottlenecks within the > > OSD itself, and I'd rather focus our efforts on eliminating those than on > > enabling a bandaid solution. > > Completely agree here. I'm not a Ceph expert (I'm the technical lead for SPDK), but I echo this sentiment 1000x. Even the fastest NVMe device can only do <1M 4k I/Ops which is a very modest number in terms of CPU time, so there is no technical reason that a single OSD can't saturate that. I understand that the OSDs of today aren't able to achieve that level of performance, but I'm optimistic a concerted long- term effort involving the experts could make it happen. I'd also like to explain a few things about NVMe to clear up some confusion I saw earlier in this thread. NVMe devices are composed of three major primitives - a singleton controller, some set of namespaces, and a set of queues. The namespaces are constructs on the SSD itself, and they're basically contiguous sets of logical blocks within a single NVMe controller. The vast majority of SSDs support exactly 1 namespace and I don't expect that to change going forward. The singular NVMe controller is what the NVMe driver is loaded against, so you can either have SPDK loaded or the kernel - you can't mix and match or split namespaces, etc. NVMe also exposes a set of queues on which I/O requests can be submitted. These queues can submit an I/O request to any namespace on the device and there is no way to enforce particular queues mapping to particular namespaces. Therefore, namespaces aren't that valuable as a mechanism for sharing the drive - you basically still have to have a software/driver layer verifying that everyone is keeping their requests separate (the namespace mechanism is there so that the media can be formatted in different ways - i.e. different block sizes, additional metadata, etc.). SPDK exposes these queues to the user so that applications can submit I/O on each queue entirely locklessly and with no coordination. Unfortunately, the version of SPDK currently in use by BlueStore is ancient and the queues are all implicit still. It probably doesn't matter for performance, since the BlueStore SPDK backend only sends I/O from a single thread, which means it is using just a single queue. NVMe devices almost universally can get their full performance using a single queue, so multiple queues is only useful for the application software to submit I/O from many threads simultaneously without locking (which BlueStore is not doing). The SPDK NVMe driver unbinds the nvme driver in the kernel, then maps the NVMe controller registers into a userspace process, so only that process has access to the device. We're currently modifying the driver to allocate the critical structures in shared memory so certain parts can be mapped by secondary processes. This does allow for some level of multi-process support. We mostly intended this for use with management tools like nvme-cli - they can attach to the main process and send some management commands and then detach. I'm not sure this is a great solution for sharing an NVMe device across multiple primary OSD processes though. We can definitely do something in this space to create a daemon process that owns the device and allows other processes to attach and allocate queues, but like I said above I think the effort is best spent on making the OSD faster. Further, given what the NVMe hardware is actually capable of, I think the right solution for sharing an NVMe device within a process is to write a partition layer in software based on standard GPT partitioning. That could sit on top of the NVMe driver and do the enforcement of which parts can write to which logical blocks on the SSD. This would be the best way forward if the Ceph community pursues multiple OSDs in a single process (again, I think the time should be spent making one OSD fast enough to saturate one SSD instead). > > > > > As I understand it the scenarios that are most interesting are > > > > > 1- sharing the same network device to multiple osds with DPDK (this will > > presumably be pretty common unless/until we combine many OSDs into a single > > process), and I believe the best path forward here is SR-IOV hardware support in the NICs. I don't know what the state of the hardware on the NIC side is here, but I think SR-IOV is commonly available already on the network side. > > > > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). > > Not sure this will be feasible or not. > > Unfortunately, this is not feasible without some form of partitioning support > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev) > - the latter is under development at the moment. We are currently developing both a persistent block allocator and a very lightweight, minimally featured filesystem (no directories, no permissions, no times). The original target for these are as the backing store of RocksDB, but they can be easily expanded to store other data. It isn't necessarily our primary aim to incorporate this into Ceph, but it clearly fits into the Ceph internals very well. We haven't provided a timeline on open sourcing this, but we're actively writing the code now. > > The limitation that Orlando identified is, not being able to launch multiple > SPDK-based OSDs on a node today. Igor and Orlando's PR (16966) is to add a > config option to limit the number of hugepages assigned to an OSD via an EAL > switch. The other limitation today is being able to specify the CPU mask > assigned to each OSD which would require config addition. > To clarify, DPDK requires a process to declare the amount of memory (hugepages) and which CPU cores the process will use up front. If you want to run multiple DPDK-based processes on the same system, you just have to make sure there are enough hugepages and that the cores you specify don't overlap. That PR is just making it so you can configure these values. I just wanted to clarify that there isn't any deeper technical problem with running multiple DPDK processes on the same system - you all probably knew that but it's best to be clear. Also, DPDK uses hugepages because it's the only good way to get "pinned" memory in userspace that userspace drivers can DMA into and out of. That's because the kernel doesn't page out or move around hugepages (hugepages also happen to be more efficient TLB-wise given that the data buffers are often large transfers). There is some work on vfio-pci in the kernel that may provide a better solution in the long term, but I'm not totally up to speed on that. Because data must reside in hugepages currently, all buffers sent to the SPDK backend for BlueStore are copied from wherever they are into a buffer from a pool allocated out of hugepages. It would be better if all data buffers were originally allocated from hugepage memory, but that's a bigger change to Ceph of course. Note that incoming packets from DPDK will also reside in hugepages upon DMA from the NIC, which would be convenient except that almost all NVMe devices today don't support fully flexible scatter-gather specification of buffers and you end up forced to copy simply to satisfy the alignment requirements of the DMA engine. Some day though! Sorry to be so long-winded, but I'm happy to help with SPDK. Thanks, Ben > Tushar > > > > > > > > -----Original Message----- > > From: ceph-devel-owner@vger.kernel.org > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > > Sent: Tuesday, November 8, 2016 7:45 PM > > To: LIU, Fei <james.liu@alibaba-inc.com> > > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel > > <ceph-devel@vger.kernel.org> > > Subject: Re: status of spdk > > > > Hi, Yehuda and Haomai, > > DPDK backend may have the same problem. I had tried to use haomai's PR: > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to > > start multiple OSDs on the host with only one network card, alse i read > > about the dpdk multi-process support: > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not > > find any config to set multi-process support. Anything wrong or multi- > > process support not been implemented? > > > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > > > > > Hi Yehuda and Haomai, > > > The issue of drives driven by SPDK is not able to be shared by multiple > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be shared > > > across multiple processes like OSDs, right? > > > > > > Regards, > > > James > > > > > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel > > > .org on behalf of yehuda@redhat.com> wrote: > > > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > > >> I just started looking at spdk, and have a few comments and > > > questions. > > > >> > > > >> First, it's not clear to me how we should handle build. At the > > > moment > > > >> the spdk code resides as a submodule in the ceph tree, but it > > > depends > > > >> on dpdk, which currently needs to be downloaded separately. We can > > > add > > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That > > > been > > > >> said, getting it to build was a bit tricky and I think it might be > > > >> broken with cmake. In order to get it working I resorted to > > > building a > > > >> system library and use that. > > > > > > > > Note that this PR is about to merge > > > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > > > which adds the DPDK submodule, so hopefully this issue will go away > > > when > > > > that merged or with a follow-on cleanup. > > > > > > > >> The way to currently configure an osd to use bluestore with spdk is > > > by > > > >> creating a symbolic link that replaces the bluestore 'block' device > > > to > > > >> point to a file that has a name that is prefixed with 'spdk:'. > > > >> Originally I assumed that the suffix would be the nvme device id, > > > but > > > >> it seems that it's not really needed, however, the file itself > > > needs > > > >> to contain the device id (see > > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple > > > of > > > >> minor fixes). > > > > > > > > Open a PR for those? > > > > > > Sure > > > > > > > > > > >> As I understand it, in order to support multiple osds on the same > > > NVMe > > > >> device we have a few options. We can leverage NVMe namespaces, but > > > >> that's not supported on all devices. We can configure bluestore to > > > >> only use part of the device (device sharding? not sure if it > > > supports > > > >> it). I think it's best if we could keep bluestore out of the loop > > > >> there and have the NVMe driver abstract multiple partitions of the > > > >> NVMe device. The idea is to be able to define multiple partitions > > > on > > > >> the device (e.g., each partition will be defined by the offset, > > > size, > > > >> and namespace), and have the osd set to use a specific partition. > > > >> We'll probably need a special tool to manage it, and potentially > > > keep > > > >> the partition table information on the device itself. The tool > > > could > > > >> also manage the creation of the block link. We should probably > > > rethink > > > >> how the link is structure and what it points at. > > > > > > > > I agree that bluestore shouldn't get involved. > > > > > > > > Is the NVMe namespaces meant to support multiple processes sharing > > > the > > > > same hardware device? > > > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > > > > Also, if you do that, is it possible to give one of the namespaces > > > to the > > > > kernel? That might solve the bootstrapping problem we > > > currently have > > > > > > Theoretically, but not right now (or ever?). See here: > > > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > > > where we have nowhere to put the $osd_data filesystem with the > > > device > > > > metadata. (This is admittedly not necessarily a blocking > > > issue. Putting > > > > those dirs on / wouldn't be the end of the world; it just means > > > cards > > > > can't be easily moved between boxes.) > > > > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > > might be some kind of a loopback solution that could work, but not > > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > > > Yehuda > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@vger.kernel.org More majordomo info at http:// > vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-10 22:39 ` Walker, Benjamin @ 2016-11-10 22:59 ` Sage Weil 2016-11-10 23:54 ` Walker, Benjamin 0 siblings, 1 reply; 18+ messages in thread From: Sage Weil @ 2016-11-10 22:59 UTC (permalink / raw) To: Walker, Benjamin Cc: Gohad, Tushar, Moreno, Orlando, haomaiwang@gmail.com, archer.wudong@gmail.com, ifedotov@mirantis.com, james.liu@alibaba-inc.com, ceph-devel@vger.kernel.org, yehuda@redhat.com [-- Attachment #1: Type: TEXT/PLAIN, Size: 13289 bytes --] Hi- Thanks, Ben-- this is super helpful! On Thu, 10 Nov 2016, Walker, Benjamin wrote: > > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a few > > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). > > > Not sure this will be feasible or not. > > > > Unfortunately, this is not feasible without some form of partitioning support > > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK bdev) > > - the latter is under development at the moment. > > We are currently developing both a persistent block allocator and a very > lightweight, minimally featured filesystem (no directories, no permissions, no > times). The original target for these are as the backing store of RocksDB, but > they can be easily expanded to store other data. It isn't necessarily our > primary aim to incorporate this into Ceph, but it clearly fits into the Ceph > internals very well. We haven't provided a timeline on open sourcing this, but > we're actively writing the code now. This may not actually be that helpful. It's basically what BlueFS is already doing (it's a rocksdb::Env that implements minimal "file system" and shares the device with teh rest of BlueStore). What this point is really about is more operational than anything. Currently disks (HDDs or SSDs) can be easily swapped between machines because they have GPT partition labels and udev rules to run 'ceph-disk trigger' on them. That basically mounts the tagged partition to a temporary location, figures out which OSD it is, bind mounts it to the appropriate /var/lib/ceph/osd/* directory, and then starts up the process. With BlueStore there are just a handful of metadata/bootstrap files here to get the OSD started: -rw-r--r-- 1 sage sage 2 Nov 9 11:40 bluefs -rw-r--r-- 1 sage sage 37 Nov 9 11:40 ceph_fsid -rw-r--r-- 1 sage sage 37 Nov 9 11:40 fsid -rw------- 1 sage sage 56 Nov 9 11:40 keyring -rw-r--r-- 1 sage sage 8 Nov 9 11:40 kv_backend -rw-r--r-- 1 sage sage 21 Nov 9 11:40 magic -rw-r--r-- 1 sage sage 4 Nov 9 11:40 mkfs_done -rw-r--r-- 1 sage sage 6 Nov 9 11:40 ready -rw-r--r-- 1 sage sage 10 Nov 9 11:40 type -rw-r--r-- 1 sage sage 2 Nov 9 11:40 whoami plus a symlink for block, block.db, and block.wal to the other partitions or devices with the actual block data. With SPDK, we can't carve out a partition or label it, a certainly can't mount it, so we'll need to rethink the bootstrapping process. Fortunatley that can be wrapped up reasonably neatly in the 'ceph-disk activate' function, but eventually we'll need to decide how to storage/manage this metadata about the device. Or just forget about easy hot swapping and stick these files on the hosts root partition. > > The limitation that Orlando identified is, not being able to launch multiple > > SPDK-based OSDs on a node today. Igor and Orlando's PR (16966) is to add a > > config option to limit the number of hugepages assigned to an OSD via an EAL > > switch. The other limitation today is being able to specify the CPU mask > > assigned to each OSD which would require config addition. > > > > To clarify, DPDK requires a process to declare the amount of memory (hugepages) > and which CPU cores the process will use up front. If you want to run multiple > DPDK-based processes on the same system, you just have to make sure there are > enough hugepages and that the cores you specify don't overlap. That PR is just > making it so you can configure these values. I just wanted to clarify that there > isn't any deeper technical problem with running multiple DPDK processes on the > same system - you all probably knew that but it's best to be clear. > > Also, DPDK uses hugepages because it's the only good way to get "pinned" memory > in userspace that userspace drivers can DMA into and out of. That's because the > kernel doesn't page out or move around hugepages (hugepages also happen to be > more efficient TLB-wise given that the data buffers are often large transfers). > There is some work on vfio-pci in the kernel that may provide a better solution > in the long term, but I'm not totally up to speed on that. Because data must > reside in hugepages currently, all buffers sent to the SPDK backend for > BlueStore are copied from wherever they are into a buffer from a pool allocated > out of hugepages. It would be better if all data buffers were originally > allocated from hugepage memory, but that's a bigger change to Ceph of course. > Note that incoming packets from DPDK will also reside in hugepages upon DMA from > the NIC, which would be convenient except that almost all NVMe devices today > don't support fully flexible scatter-gather specification of buffers and you end > up forced to copy simply to satisfy the alignment requirements of the DMA > engine. Some day though! Yeah, we definitely want to get there eventually. When Ceph sends data over the wire it is preceded by a header that includes an alignment so that (with TCP currently) we read data off the socket into properly aligned memory. That way we can eventually do O_DIRECT writes with it. If it's possible to direct what memory the DPDK data comes into we can hopefully do something similar here... The rest of Ceph's bufferlist library should be flexible enough to enable zero-copy. > Sorry to be so long-winded, but I'm happy to help with SPDK. That's great to hear--this was very helpful for me! Thanks- sage > > Thanks, > Ben > > > Tushar > > > > > > > > > > > > > -----Original Message----- > > > From: ceph-devel-owner@vger.kernel.org > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > > > Sent: Tuesday, November 8, 2016 7:45 PM > > > To: LIU, Fei <james.liu@alibaba-inc.com> > > > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > > > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel > > > <ceph-devel@vger.kernel.org> > > > Subject: Re: status of spdk > > > > > > Hi, Yehuda and Haomai, > > > DPDK backend may have the same problem. I had tried to use haomai's PR: > > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed to > > > start multiple OSDs on the host with only one network card, alse i read > > > about the dpdk multi-process support: > > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did not > > > find any config to set multi-process support. Anything wrong or multi- > > > process support not been implemented? > > > > > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > > > > > > > Hi Yehuda and Haomai, > > > > The issue of drives driven by SPDK is not able to be shared by multiple > > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be shared > > > > across multiple processes like OSDs, right? > > > > > > > > Regards, > > > > James > > > > > > > > > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel > > > > .org on behalf of yehuda@redhat.com> wrote: > > > > > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > > > >> I just started looking at spdk, and have a few comments and > > > > questions. > > > > >> > > > > >> First, it's not clear to me how we should handle build. At the > > > > moment > > > > >> the spdk code resides as a submodule in the ceph tree, but it > > > > depends > > > > >> on dpdk, which currently needs to be downloaded separately. We can > > > > add > > > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That > > > > been > > > > >> said, getting it to build was a bit tricky and I think it might be > > > > >> broken with cmake. In order to get it working I resorted to > > > > building a > > > > >> system library and use that. > > > > > > > > > > Note that this PR is about to merge > > > > > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > > > > > which adds the DPDK submodule, so hopefully this issue will go away > > > > when > > > > > that merged or with a follow-on cleanup. > > > > > > > > > >> The way to currently configure an osd to use bluestore with spdk is > > > > by > > > > >> creating a symbolic link that replaces the bluestore 'block' device > > > > to > > > > >> point to a file that has a name that is prefixed with 'spdk:'. > > > > >> Originally I assumed that the suffix would be the nvme device id, > > > > but > > > > >> it seems that it's not really needed, however, the file itself > > > > needs > > > > >> to contain the device id (see > > > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple > > > > of > > > > >> minor fixes). > > > > > > > > > > Open a PR for those? > > > > > > > > Sure > > > > > > > > > > > > > >> As I understand it, in order to support multiple osds on the same > > > > NVMe > > > > >> device we have a few options. We can leverage NVMe namespaces, but > > > > >> that's not supported on all devices. We can configure bluestore to > > > > >> only use part of the device (device sharding? not sure if it > > > > supports > > > > >> it). I think it's best if we could keep bluestore out of the loop > > > > >> there and have the NVMe driver abstract multiple partitions of the > > > > >> NVMe device. The idea is to be able to define multiple partitions > > > > on > > > > >> the device (e.g., each partition will be defined by the offset, > > > > size, > > > > >> and namespace), and have the osd set to use a specific partition. > > > > >> We'll probably need a special tool to manage it, and potentially > > > > keep > > > > >> the partition table information on the device itself. The tool > > > > could > > > > >> also manage the creation of the block link. We should probably > > > > rethink > > > > >> how the link is structure and what it points at. > > > > > > > > > > I agree that bluestore shouldn't get involved. > > > > > > > > > > Is the NVMe namespaces meant to support multiple processes sharing > > > > the > > > > > same hardware device? > > > > > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > > > > > > > Also, if you do that, is it possible to give one of the namespaces > > > > to the > > > > > kernel? That might solve the bootstrapping problem we > > > > currently have > > > > > > > > Theoretically, but not right now (or ever?). See here: > > > > > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > > > > > where we have nowhere to put the $osd_data filesystem with the > > > > device > > > > > metadata. (This is admittedly not necessarily a blocking > > > > issue. Putting > > > > > those dirs on / wouldn't be the end of the world; it just means > > > > cards > > > > > can't be easily moved between boxes.) > > > > > > > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > > > might be some kind of a loopback solution that could work, but not > > > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > > > > > Yehuda > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > > body of a message to majordomo@vger.kernel.org More majordomo info at http:// > > vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > N?????r??y??????X??ǧv???){.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w???\f???j:+v???w????????\a????zZ+???????j"????i ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-10 22:59 ` Sage Weil @ 2016-11-10 23:54 ` Walker, Benjamin 0 siblings, 0 replies; 18+ messages in thread From: Walker, Benjamin @ 2016-11-10 23:54 UTC (permalink / raw) To: sage@newdream.net Cc: haomaiwang@gmail.com, james.liu@alibaba-inc.com, archer.wudong@gmail.com, Moreno, Orlando, yehuda@redhat.com, Gohad, Tushar, ifedotov@mirantis.com, ceph-devel@vger.kernel.org On Thu, 2016-11-10 at 22:59 +0000, Sage Weil wrote: > Hi- > > Thanks, Ben-- this is super helpful! > > On Thu, 10 Nov 2016, Walker, Benjamin wrote: > > > > > > > > > > > > > 2- sharing a tiny portion of the NVMe device for the osd_data (usually a > > > > few > > > > MB of metadata that gets mounted at /var/lib/ceph/osd/$cluster-$id). > > > > Not sure this will be feasible or not. > > > > > > Unfortunately, this is not feasible without some form of partitioning > > > support > > > in SPDK (namespace/GPT or in form of a new LVM-like layer on top of SPDK > > > bdev) > > > - the latter is under development at the moment. > > > > We are currently developing both a persistent block allocator and a very > > lightweight, minimally featured filesystem (no directories, no permissions, > > no > > times). The original target for these are as the backing store of RocksDB, > > but > > they can be easily expanded to store other data. It isn't necessarily our > > primary aim to incorporate this into Ceph, but it clearly fits into the Ceph > > internals very well. We haven't provided a timeline on open sourcing this, > > but > > we're actively writing the code now. > > This may not actually be that helpful. It's basically what BlueFS is > already doing (it's a rocksdb::Env that implements minimal "file system" > and shares the device with teh rest of BlueStore). Understood - this is pretty much BlueFS + BlueStore in a standalone format. On the surface it seems like it's duplicated work, but we are very heavily focused on solid state media (particularly, next generation media beyond NAND) and that has led us to diverge quite a bit in design from BlueStore. If our work ends up benefiting Ceph in some way in the longer term, that's great, but I understand Ceph already has code doing somewhat similar things. > > What this point is really about is more operational than anything. > Currently disks (HDDs or SSDs) can be easily swapped between machines > because they have GPT partition labels and udev rules to run 'ceph-disk > trigger' on them. That basically mounts the tagged partition to a > temporary location, figures out which OSD it is, bind mounts it to the > appropriate /var/lib/ceph/osd/* directory, and then starts up the process. > With BlueStore there are just a handful of metadata/bootstrap files here > to get the OSD started: > > -rw-r--r-- 1 sage sage 2 Nov 9 11:40 bluefs > -rw-r--r-- 1 sage sage 37 Nov 9 11:40 ceph_fsid > -rw-r--r-- 1 sage sage 37 Nov 9 11:40 fsid > -rw------- 1 sage sage 56 Nov 9 11:40 keyring > -rw-r--r-- 1 sage sage 8 Nov 9 11:40 kv_backend > -rw-r--r-- 1 sage sage 21 Nov 9 11:40 magic > -rw-r--r-- 1 sage sage 4 Nov 9 11:40 mkfs_done > -rw-r--r-- 1 sage sage 6 Nov 9 11:40 ready > -rw-r--r-- 1 sage sage 10 Nov 9 11:40 type > -rw-r--r-- 1 sage sage 2 Nov 9 11:40 whoami > > plus a symlink for block, block.db, and block.wal to the other partitions > or devices with the actual block data. > > With SPDK, we can't carve out a partition or label it, a certainly can't > mount it, so we'll need to rethink the bootstrapping process. Fortunatley > that can be wrapped up reasonably neatly in the 'ceph-disk activate' > function, but eventually we'll need to decide how to storage/manage this > metadata about the device. This sounds like a solvable problem to me. An OSD using BlueStore uses a block device that has a one GPT partition with a filesystem (XFS?) that contains the above bootstrapping data, plus some number of other GPT partitions with no filesystems that are used for everything else, right? I think there are two changes that could be made here. First, the bootstrap partition needs to contain a BlueStore/Ceph-specific formatted data layout instead of using a kernel filesystem. Maybe it could even be simpler and just have a flat binary layout containing the above files sequentially or something. Second, the BlueStore SPDK backend needs to comprehend real GPT partition metadata (this part is not particularly hard - GPT is simple). That way, the disk format between OSDs using SPDK and those using the kernel are identical and SPDK respects the partitions and can locate them by partition label. Once they're identical, Ceph can simply load using the GPT partition label and udev mechanism as it does today, then dynamically unbind the kernel nvme driver from the device (you just write to sysfs) and load SPDK in its place. Because the SPDK backend is expecting the same disk format as the kernel, it will load without issue. I think this probably solves a few of the other pain points of using Ceph with SPDK too around configuration. With this strategy all you have to do is flag the OSD to use SPDK with no other configuration changes (well, maybe the number of hugepages and which cores are allowed). This is because most of the configuration for the disks is around specifying which data is where, and that seems to be done by GPT partition label which the SPDK backend would now comprehend. > > Or just forget about easy hot swapping and stick these files on the > hosts root partition. > > > > > > > > > The limitation that Orlando identified is, not being able to launch > > > multiple > > > SPDK-based OSDs on a node today. Igor and Orlando's PR (16966) is to add > > > a > > > config option to limit the number of hugepages assigned to an OSD via an > > > EAL > > > switch. The other limitation today is being able to specify the CPU mask > > > assigned to each OSD which would require config addition. > > > > > > > To clarify, DPDK requires a process to declare the amount of memory > > (hugepages) > > and which CPU cores the process will use up front. If you want to run > > multiple > > DPDK-based processes on the same system, you just have to make sure there > > are > > enough hugepages and that the cores you specify don't overlap. That PR is > > just > > making it so you can configure these values. I just wanted to clarify that > > there > > isn't any deeper technical problem with running multiple DPDK processes on > > the > > same system - you all probably knew that but it's best to be clear. > > > > Also, DPDK uses hugepages because it's the only good way to get "pinned" > > memory > > in userspace that userspace drivers can DMA into and out of. That's because > > the > > kernel doesn't page out or move around hugepages (hugepages also happen to > > be > > more efficient TLB-wise given that the data buffers are often large > > transfers). > > There is some work on vfio-pci in the kernel that may provide a better > > solution > > in the long term, but I'm not totally up to speed on that. Because data must > > reside in hugepages currently, all buffers sent to the SPDK backend for > > BlueStore are copied from wherever they are into a buffer from a pool > > allocated > > out of hugepages. It would be better if all data buffers were originally > > allocated from hugepage memory, but that's a bigger change to Ceph of > > course. > > Note that incoming packets from DPDK will also reside in hugepages upon DMA > > from > > the NIC, which would be convenient except that almost all NVMe devices today > > don't support fully flexible scatter-gather specification of buffers and you > > end > > up forced to copy simply to satisfy the alignment requirements of the DMA > > engine. Some day though! > > Yeah, we definitely want to get there eventually. When Ceph sends data > over the wire it is preceded by a header that includes an alignment > so that (with TCP currently) we read data off the socket into > properly aligned memory. That way we can eventually do O_DIRECT writes > with it. If it's possible to direct what memory the DPDK data comes into > we can hopefully do something similar here... The rest of Ceph's > bufferlist library should be flexible enough to enable zero-copy. > > > > > Sorry to be so long-winded, but I'm happy to help with SPDK. > > That's great to hear--this was very helpful for me! > > Thanks- > sage > > > > > > > > > > Thanks, > > Ben > > > > > > > > Tushar > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > From: ceph-devel-owner@vger.kernel.org > > > > [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Dong Wu > > > > Sent: Tuesday, November 8, 2016 7:45 PM > > > > To: LIU, Fei <james.liu@alibaba-inc.com> > > > > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > > > > <sweil@redhat.com>; Wang, Haomai <haomaiwang@gmail.com>; ceph-devel > > > > <ceph-devel@vger.kernel.org> > > > > Subject: Re: status of spdk > > > > > > > > Hi, Yehuda and Haomai, > > > > DPDK backend may have the same problem. I had tried to use > > > > haomai's PR: > > > > https://github.com/ceph/ceph/pull/10748 to test dpdk backend, but failed > > > > to > > > > start multiple OSDs on the host with only one network card, alse i read > > > > about the dpdk multi-process support: > > > > http://dpdk.org/doc/guides/prog_guide/multi_proc_support.html, but did > > > > not > > > > find any config to set multi-process support. Anything wrong or multi- > > > > process support not been implemented? > > > > > > > > 2016-11-09 8:21 GMT+08:00 LIU, Fei <james.liu@alibaba-inc.com>: > > > > > > > > > > > > > > > Hi Yehuda and Haomai, > > > > > The issue of drives driven by SPDK is not able to be shared by > > > > > multiple > > > > > OSDs as kernel NVMe drive since SPDK as a process so far can not be > > > > > shared > > > > > across multiple processes like OSDs, right? > > > > > > > > > > Regards, > > > > > James > > > > > > > > > > > > > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.ke > > > > > rnel > > > > > .org on behalf of yehuda@redhat.com> wrote: > > > > > > > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> > > > > > wrote: > > > > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > > > > >> I just started looking at spdk, and have a few comments and > > > > > questions. > > > > > >> > > > > > >> First, it's not clear to me how we should handle build. At the > > > > > moment > > > > > >> the spdk code resides as a submodule in the ceph tree, but it > > > > > depends > > > > > >> on dpdk, which currently needs to be downloaded separately. We > > > > > can > > > > > add > > > > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That > > > > > been > > > > > >> said, getting it to build was a bit tricky and I think it might > > > > > be > > > > > >> broken with cmake. In order to get it working I resorted to > > > > > building a > > > > > >> system library and use that. > > > > > > > > > > > > Note that this PR is about to merge > > > > > > > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > > > > > > > which adds the DPDK submodule, so hopefully this issue will go > > > > > away > > > > > when > > > > > > that merged or with a follow-on cleanup. > > > > > > > > > > > >> The way to currently configure an osd to use bluestore with > > > > > spdk is > > > > > by > > > > > >> creating a symbolic link that replaces the bluestore 'block' > > > > > device > > > > > to > > > > > >> point to a file that has a name that is prefixed with 'spdk:'. > > > > > >> Originally I assumed that the suffix would be the nvme device > > > > > id, > > > > > but > > > > > >> it seems that it's not really needed, however, the file itself > > > > > needs > > > > > >> to contain the device id (see > > > > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a > > > > > couple > > > > > of > > > > > >> minor fixes). > > > > > > > > > > > > Open a PR for those? > > > > > > > > > > Sure > > > > > > > > > > > > > > > > >> As I understand it, in order to support multiple osds on the > > > > > same > > > > > NVMe > > > > > >> device we have a few options. We can leverage NVMe namespaces, > > > > > but > > > > > >> that's not supported on all devices. We can configure bluestore > > > > > to > > > > > >> only use part of the device (device sharding? not sure if it > > > > > supports > > > > > >> it). I think it's best if we could keep bluestore out of the > > > > > loop > > > > > >> there and have the NVMe driver abstract multiple partitions of > > > > > the > > > > > >> NVMe device. The idea is to be able to define multiple > > > > > partitions > > > > > on > > > > > >> the device (e.g., each partition will be defined by the offset, > > > > > size, > > > > > >> and namespace), and have the osd set to use a specific > > > > > partition. > > > > > >> We'll probably need a special tool to manage it, and > > > > > potentially > > > > > keep > > > > > >> the partition table information on the device itself. The tool > > > > > could > > > > > >> also manage the creation of the block link. We should probably > > > > > rethink > > > > > >> how the link is structure and what it points at. > > > > > > > > > > > > I agree that bluestore shouldn't get involved. > > > > > > > > > > > > Is the NVMe namespaces meant to support multiple processes > > > > > sharing > > > > > the > > > > > > same hardware device? > > > > > > > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > > > > > > > > > > Also, if you do that, is it possible to give one of the > > > > > namespaces > > > > > to the > > > > > > kernel? That might solve the bootstrapping problem we > > > > > currently have > > > > > > > > > > Theoretically, but not right now (or ever?). See here: > > > > > > > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > > > > > > > where we have nowhere to put the $osd_data filesystem with the > > > > > device > > > > > > metadata. (This is admittedly not necessarily a blocking > > > > > issue. Putting > > > > > > those dirs on / wouldn't be the end of the world; it just means > > > > > cards > > > > > > can't be easily moved between boxes.) > > > > > > > > > > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > > > > might be some kind of a loopback solution that could work, but not > > > > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > > > > > > > Yehuda > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph- > > > > > devel" > > > > > in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > > in the body of a message to majordomo@vger.kernel.org More majordomo > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the > > > body of a message to majordomo@vger.kernel.org More majordomo info > > > at http:// > > > vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > N?????r??y??????X??ǧv???){.n?????z?]z????ay?\x1dʇڙ??j\a??f???h?????\x1e?w??? > > \f???j:+v???w????????\a????zZ+???????j"????i ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 0:21 ` LIU, Fei 2016-11-09 2:45 ` Dong Wu @ 2016-11-09 4:59 ` Haomai Wang 2016-11-09 5:02 ` LIU, Fei 1 sibling, 1 reply; 18+ messages in thread From: Haomai Wang @ 2016-11-09 4:59 UTC (permalink / raw) To: LIU, Fei; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote: > Hi Yehuda and Haomai, > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? spdk nvme supports multi process is a undergoing spdk feature now, it will be implemented via shared memory among multi process. > > Regards, > James > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > >> I just started looking at spdk, and have a few comments and questions. > >> > >> First, it's not clear to me how we should handle build. At the moment > >> the spdk code resides as a submodule in the ceph tree, but it depends > >> on dpdk, which currently needs to be downloaded separately. We can add > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > >> said, getting it to build was a bit tricky and I think it might be > >> broken with cmake. In order to get it working I resorted to building a > >> system library and use that. > > > > Note that this PR is about to merge > > > > https://github.com/ceph/ceph/pull/10748 > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > that merged or with a follow-on cleanup. > > > >> The way to currently configure an osd to use bluestore with spdk is by > >> creating a symbolic link that replaces the bluestore 'block' device to > >> point to a file that has a name that is prefixed with 'spdk:'. > >> Originally I assumed that the suffix would be the nvme device id, but > >> it seems that it's not really needed, however, the file itself needs > >> to contain the device id (see > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > >> minor fixes). > > > > Open a PR for those? > > Sure > > > > >> As I understand it, in order to support multiple osds on the same NVMe > >> device we have a few options. We can leverage NVMe namespaces, but > >> that's not supported on all devices. We can configure bluestore to > >> only use part of the device (device sharding? not sure if it supports > >> it). I think it's best if we could keep bluestore out of the loop > >> there and have the NVMe driver abstract multiple partitions of the > >> NVMe device. The idea is to be able to define multiple partitions on > >> the device (e.g., each partition will be defined by the offset, size, > >> and namespace), and have the osd set to use a specific partition. > >> We'll probably need a special tool to manage it, and potentially keep > >> the partition table information on the device itself. The tool could > >> also manage the creation of the block link. We should probably rethink > >> how the link is structure and what it points at. > > > > I agree that bluestore shouldn't get involved. > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > same hardware device? > > More of a partitioning solution, but yes (as far as I undestand). > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > kernel? That might solve the bootstrapping problem we currently have > > Theoretically, but not right now (or ever?). See here: > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > where we have nowhere to put the $osd_data filesystem with the device > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > those dirs on / wouldn't be the end of the world; it just means cards > > can't be easily moved between boxes.) > > > > Maybe we can use bluestore for these too ;) that been said, there > might be some kind of a loopback solution that could work, but not > sure if it won't create major bottlenecks that we'd want to avoid. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- Best Regards, Wheat ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 4:59 ` Haomai Wang @ 2016-11-09 5:02 ` LIU, Fei 2016-11-09 5:09 ` Liu, Changpeng 0 siblings, 1 reply; 18+ messages in thread From: LIU, Fei @ 2016-11-09 5:02 UTC (permalink / raw) To: Haomai Wang, Liu, Changpeng; +Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel Haomai, Thanks a lot. Regards, James Hi Changpeng, Would you mind updating us about the status of multi processes support of spdk? Regards, James On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote: On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote: > Hi Yehuda and Haomai, > The issue of drives driven by SPDK is not able to be shared by multiple OSDs as kernel NVMe drive since SPDK as a process so far can not be shared across multiple processes like OSDs, right? spdk nvme supports multi process is a undergoing spdk feature now, it will be implemented via shared memory among multi process. > > Regards, > James > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel-owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > >> I just started looking at spdk, and have a few comments and questions. > >> > >> First, it's not clear to me how we should handle build. At the moment > >> the spdk code resides as a submodule in the ceph tree, but it depends > >> on dpdk, which currently needs to be downloaded separately. We can add > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > >> said, getting it to build was a bit tricky and I think it might be > >> broken with cmake. In order to get it working I resorted to building a > >> system library and use that. > > > > Note that this PR is about to merge > > > > https://github.com/ceph/ceph/pull/10748 > > > > which adds the DPDK submodule, so hopefully this issue will go away when > > that merged or with a follow-on cleanup. > > > >> The way to currently configure an osd to use bluestore with spdk is by > >> creating a symbolic link that replaces the bluestore 'block' device to > >> point to a file that has a name that is prefixed with 'spdk:'. > >> Originally I assumed that the suffix would be the nvme device id, but > >> it seems that it's not really needed, however, the file itself needs > >> to contain the device id (see > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > >> minor fixes). > > > > Open a PR for those? > > Sure > > > > >> As I understand it, in order to support multiple osds on the same NVMe > >> device we have a few options. We can leverage NVMe namespaces, but > >> that's not supported on all devices. We can configure bluestore to > >> only use part of the device (device sharding? not sure if it supports > >> it). I think it's best if we could keep bluestore out of the loop > >> there and have the NVMe driver abstract multiple partitions of the > >> NVMe device. The idea is to be able to define multiple partitions on > >> the device (e.g., each partition will be defined by the offset, size, > >> and namespace), and have the osd set to use a specific partition. > >> We'll probably need a special tool to manage it, and potentially keep > >> the partition table information on the device itself. The tool could > >> also manage the creation of the block link. We should probably rethink > >> how the link is structure and what it points at. > > > > I agree that bluestore shouldn't get involved. > > > > Is the NVMe namespaces meant to support multiple processes sharing the > > same hardware device? > > More of a partitioning solution, but yes (as far as I undestand). > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > kernel? That might solve the bootstrapping problem we currently have > > Theoretically, but not right now (or ever?). See here: > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > where we have nowhere to put the $osd_data filesystem with the device > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > those dirs on / wouldn't be the end of the world; it just means cards > > can't be easily moved between boxes.) > > > > Maybe we can use bluestore for these too ;) that been said, there > might be some kind of a loopback solution that could work, but not > sure if it won't create major bottlenecks that we'd want to avoid. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- Best Regards, Wheat ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: status of spdk 2016-11-09 5:02 ` LIU, Fei @ 2016-11-09 5:09 ` Liu, Changpeng 2016-11-09 5:23 ` LIU, Fei 0 siblings, 1 reply; 18+ messages in thread From: Liu, Changpeng @ 2016-11-09 5:09 UTC (permalink / raw) To: LIU, Fei, Haomai Wang Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel, Cao, Gang, Yang, Ziye, Dai, Qihua, Harris, James R Hi James, Yes, the multi processes support of SPDK is under development, Gang is the developer for the feature of SPDK. We are targeting to release the feature in 16.12 version for SPDK(WW50). > -----Original Message----- > From: LIU, Fei [mailto:james.liu@alibaba-inc.com] > Sent: Wednesday, November 9, 2016 1:03 PM > To: Haomai Wang <haomaiwang@gmail.com>; Liu, Changpeng > <changpeng.liu@intel.com> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > <sweil@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org> > Subject: Re: status of spdk > > Haomai, > Thanks a lot. > > Regards, > James > > Hi Changpeng, > Would you mind updating us about the status of multi processes support of > spdk? > > Regards, > James > > On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote: > > On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote: > > Hi Yehuda and Haomai, > > The issue of drives driven by SPDK is not able to be shared by multiple OSDs > as kernel NVMe drive since SPDK as a process so far can not be shared across > multiple processes like OSDs, right? > > spdk nvme supports multi process is a undergoing spdk feature now, it > will be implemented via shared memory among multi process. > > > > > Regards, > > James > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel- > owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > >> I just started looking at spdk, and have a few comments and questions. > > >> > > >> First, it's not clear to me how we should handle build. At the moment > > >> the spdk code resides as a submodule in the ceph tree, but it depends > > >> on dpdk, which currently needs to be downloaded separately. We can > add > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > > >> said, getting it to build was a bit tricky and I think it might be > > >> broken with cmake. In order to get it working I resorted to building a > > >> system library and use that. > > > > > > Note that this PR is about to merge > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > which adds the DPDK submodule, so hopefully this issue will go away > when > > > that merged or with a follow-on cleanup. > > > > > >> The way to currently configure an osd to use bluestore with spdk is by > > >> creating a symbolic link that replaces the bluestore 'block' device to > > >> point to a file that has a name that is prefixed with 'spdk:'. > > >> Originally I assumed that the suffix would be the nvme device id, but > > >> it seems that it's not really needed, however, the file itself needs > > >> to contain the device id (see > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple > of > > >> minor fixes). > > > > > > Open a PR for those? > > > > Sure > > > > > > > >> As I understand it, in order to support multiple osds on the same NVMe > > >> device we have a few options. We can leverage NVMe namespaces, but > > >> that's not supported on all devices. We can configure bluestore to > > >> only use part of the device (device sharding? not sure if it supports > > >> it). I think it's best if we could keep bluestore out of the loop > > >> there and have the NVMe driver abstract multiple partitions of the > > >> NVMe device. The idea is to be able to define multiple partitions on > > >> the device (e.g., each partition will be defined by the offset, size, > > >> and namespace), and have the osd set to use a specific partition. > > >> We'll probably need a special tool to manage it, and potentially keep > > >> the partition table information on the device itself. The tool could > > >> also manage the creation of the block link. We should probably rethink > > >> how the link is structure and what it points at. > > > > > > I agree that bluestore shouldn't get involved. > > > > > > Is the NVMe namespaces meant to support multiple processes sharing > the > > > same hardware device? > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > > kernel? That might solve the bootstrapping problem we currently have > > > > Theoretically, but not right now (or ever?). See here: > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > where we have nowhere to put the $osd_data filesystem with the device > > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > > those dirs on / wouldn't be the end of the world; it just means cards > > > can't be easily moved between boxes.) > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > might be some kind of a loopback solution that could work, but not > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > Yehuda > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > Best Regards, > > Wheat > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-09 5:09 ` Liu, Changpeng @ 2016-11-09 5:23 ` LIU, Fei 0 siblings, 0 replies; 18+ messages in thread From: LIU, Fei @ 2016-11-09 5:23 UTC (permalink / raw) To: Liu, Changpeng, Haomai Wang Cc: Yehuda Sadeh-Weinraub, Sage Weil, ceph-devel, Cao, Gang, Yang, Ziye, Dai, Qihua, Harris, James R Hi Changpeng, Thanks a lot for your update. Regards, James On 11/8/16, 9:09 PM, "Liu, Changpeng" <changpeng.liu@intel.com> wrote: Hi James, Yes, the multi processes support of SPDK is under development, Gang is the developer for the feature of SPDK. We are targeting to release the feature in 16.12 version for SPDK(WW50). > -----Original Message----- > From: LIU, Fei [mailto:james.liu@alibaba-inc.com] > Sent: Wednesday, November 9, 2016 1:03 PM > To: Haomai Wang <haomaiwang@gmail.com>; Liu, Changpeng > <changpeng.liu@intel.com> > Cc: Yehuda Sadeh-Weinraub <yehuda@redhat.com>; Sage Weil > <sweil@redhat.com>; ceph-devel <ceph-devel@vger.kernel.org> > Subject: Re: status of spdk > > Haomai, > Thanks a lot. > > Regards, > James > > Hi Changpeng, > Would you mind updating us about the status of multi processes support of > spdk? > > Regards, > James > > On 11/8/16, 8:59 PM, "Haomai Wang" <haomaiwang@gmail.com> wrote: > > On Wed, Nov 9, 2016 at 8:21 AM, LIU, Fei <james.liu@alibaba-inc.com> wrote: > > Hi Yehuda and Haomai, > > The issue of drives driven by SPDK is not able to be shared by multiple OSDs > as kernel NVMe drive since SPDK as a process so far can not be shared across > multiple processes like OSDs, right? > > spdk nvme supports multi process is a undergoing spdk feature now, it > will be implemented via shared memory among multi process. > > > > > Regards, > > James > > > > > > > > On 11/8/16, 4:06 PM, "Yehuda Sadeh-Weinraub" <ceph-devel- > owner@vger.kernel.org on behalf of yehuda@redhat.com> wrote: > > > > On Tue, Nov 8, 2016 at 3:40 PM, Sage Weil <sweil@redhat.com> wrote: > > > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: > > >> I just started looking at spdk, and have a few comments and questions. > > >> > > >> First, it's not clear to me how we should handle build. At the moment > > >> the spdk code resides as a submodule in the ceph tree, but it depends > > >> on dpdk, which currently needs to be downloaded separately. We can > add > > >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > > >> said, getting it to build was a bit tricky and I think it might be > > >> broken with cmake. In order to get it working I resorted to building a > > >> system library and use that. > > > > > > Note that this PR is about to merge > > > > > > https://github.com/ceph/ceph/pull/10748 > > > > > > which adds the DPDK submodule, so hopefully this issue will go away > when > > > that merged or with a follow-on cleanup. > > > > > >> The way to currently configure an osd to use bluestore with spdk is by > > >> creating a symbolic link that replaces the bluestore 'block' device to > > >> point to a file that has a name that is prefixed with 'spdk:'. > > >> Originally I assumed that the suffix would be the nvme device id, but > > >> it seems that it's not really needed, however, the file itself needs > > >> to contain the device id (see > > >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple > of > > >> minor fixes). > > > > > > Open a PR for those? > > > > Sure > > > > > > > >> As I understand it, in order to support multiple osds on the same NVMe > > >> device we have a few options. We can leverage NVMe namespaces, but > > >> that's not supported on all devices. We can configure bluestore to > > >> only use part of the device (device sharding? not sure if it supports > > >> it). I think it's best if we could keep bluestore out of the loop > > >> there and have the NVMe driver abstract multiple partitions of the > > >> NVMe device. The idea is to be able to define multiple partitions on > > >> the device (e.g., each partition will be defined by the offset, size, > > >> and namespace), and have the osd set to use a specific partition. > > >> We'll probably need a special tool to manage it, and potentially keep > > >> the partition table information on the device itself. The tool could > > >> also manage the creation of the block link. We should probably rethink > > >> how the link is structure and what it points at. > > > > > > I agree that bluestore shouldn't get involved. > > > > > > Is the NVMe namespaces meant to support multiple processes sharing > the > > > same hardware device? > > > > More of a partitioning solution, but yes (as far as I undestand). > > > > > > > > Also, if you do that, is it possible to give one of the namespaces to the > > > kernel? That might solve the bootstrapping problem we currently have > > > > Theoretically, but not right now (or ever?). See here: > > > > https://lists.01.org/pipermail/spdk/2016-July/000073.html > > > > > where we have nowhere to put the $osd_data filesystem with the device > > > metadata. (This is admittedly not necessarily a blocking issue. Putting > > > those dirs on / wouldn't be the end of the world; it just means cards > > > can't be easily moved between boxes.) > > > > > > > Maybe we can use bluestore for these too ;) that been said, there > > might be some kind of a loopback solution that could work, but not > > sure if it won't create major bottlenecks that we'd want to avoid. > > > > Yehuda > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > -- > Best Regards, > > Wheat > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-08 23:40 ` Sage Weil 2016-11-09 0:06 ` Yehuda Sadeh-Weinraub @ 2016-11-09 4:49 ` Haomai Wang 1 sibling, 0 replies; 18+ messages in thread From: Haomai Wang @ 2016-11-09 4:49 UTC (permalink / raw) To: Sage Weil; +Cc: Yehuda Sadeh-Weinraub, ceph-devel On Wed, Nov 9, 2016 at 7:40 AM, Sage Weil <sweil@redhat.com> wrote: > On Tue, 8 Nov 2016, Yehuda Sadeh-Weinraub wrote: >> I just started looking at spdk, and have a few comments and questions. >> >> First, it's not clear to me how we should handle build. At the moment >> the spdk code resides as a submodule in the ceph tree, but it depends >> on dpdk, which currently needs to be downloaded separately. We can add >> it as a submodule (upstream is here: git://dpdk.org/dpdk). That been >> said, getting it to build was a bit tricky and I think it might be >> broken with cmake. In order to get it working I resorted to building a >> system library and use that. > > Note that this PR is about to merge > > https://github.com/ceph/ceph/pull/10748 > > which adds the DPDK submodule, so hopefully this issue will go away when > that merged or with a follow-on cleanup. I rebased and I think we can merge now. > >> The way to currently configure an osd to use bluestore with spdk is by >> creating a symbolic link that replaces the bluestore 'block' device to >> point to a file that has a name that is prefixed with 'spdk:'. >> Originally I assumed that the suffix would be the nvme device id, but >> it seems that it's not really needed, however, the file itself needs >> to contain the device id (see >> https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of >> minor fixes). > > Open a PR for those? yep! > >> As I understand it, in order to support multiple osds on the same NVMe >> device we have a few options. We can leverage NVMe namespaces, but >> that's not supported on all devices. We can configure bluestore to >> only use part of the device (device sharding? not sure if it supports >> it). I think it's best if we could keep bluestore out of the loop >> there and have the NVMe driver abstract multiple partitions of the >> NVMe device. The idea is to be able to define multiple partitions on >> the device (e.g., each partition will be defined by the offset, size, >> and namespace), and have the osd set to use a specific partition. >> We'll probably need a special tool to manage it, and potentially keep >> the partition table information on the device itself. The tool could >> also manage the creation of the block link. We should probably rethink >> how the link is structure and what it points at. > > I agree that bluestore shouldn't get involved. > > Is the NVMe namespaces meant to support multiple processes sharing the > same hardware device? sure > > Also, if you do that, is it possible to give one of the namespaces to the > kernel? That might solve the bootstrapping problem we currently have > where we have nowhere to put the $osd_data filesystem with the device > metadata. (This is admittedly not necessarily a blocking issue. Putting > those dirs on / wouldn't be the end of the world; it just means cards > can't be easily moved between boxes.) the spdk community is make nvme-cli support spdk backend. by default nvmecli only can operate kernel nvme module, but intel is working on making spdk can be operated by nvmecli. so it will make users much convenient. > > sage -- Best Regards, Wheat ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: status of spdk 2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub 2016-11-08 23:40 ` Sage Weil @ 2016-11-09 4:45 ` Haomai Wang 1 sibling, 0 replies; 18+ messages in thread From: Haomai Wang @ 2016-11-09 4:45 UTC (permalink / raw) To: Yehuda Sadeh-Weinraub; +Cc: Weil, Sage, ceph-devel On Wed, Nov 9, 2016 at 7:31 AM, Yehuda Sadeh-Weinraub <yehuda@redhat.com> wrote: > I just started looking at spdk, and have a few comments and questions. > > First, it's not clear to me how we should handle build. At the moment > the spdk code resides as a submodule in the ceph tree, but it depends > on dpdk, which currently needs to be downloaded separately. We can add > it as a submodule (upstream is here: git://dpdk.org/dpdk). That been > said, getting it to build was a bit tricky and I think it might be > broken with cmake. In order to get it working I resorted to building a > system library and use that. yes, because we expect dpdk submodule will merge soon. we left this aside.. now the eaisest way is yum install dpdk-devel to complete the build instead of git clone dpdk repo separated. > > The way to currently configure an osd to use bluestore with spdk is by > creating a symbolic link that replaces the bluestore 'block' device to > point to a file that has a name that is prefixed with 'spdk:'. > Originally I assumed that the suffix would be the nvme device id, but > it seems that it's not really needed, however, the file itself needs > to contain the device id (see > https://github.com/yehudasa/ceph/tree/wip-yehuda-spdk for a couple of > minor fixes). hmm, I commented in config_opt.h. // If you want to use spdk driver, you need to specify NVMe serial number here // with "spdk:" prefix. // Users can use 'lspci -vvv -d 8086:0953 | grep "Device Serial Number"' to // get the serial number of Intel(R) Fultondale NVMe controllers. // Example: // bluestore_block_path = spdk:55cd2e404bd73932 we don't need to create symbolic link by hand, it could be done in bluestore codes. > > As I understand it, in order to support multiple osds on the same NVMe > device we have a few options. We can leverage NVMe namespaces, but > that's not supported on all devices. We can configure bluestore to > only use part of the device (device sharding? not sure if it supports > it). I think it's best if we could keep bluestore out of the loop > there and have the NVMe driver abstract multiple partitions of the > NVMe device. The idea is to be able to define multiple partitions on > the device (e.g., each partition will be defined by the offset, size, > and namespace), and have the osd set to use a specific partition. > We'll probably need a special tool to manage it, and potentially keep > the partition table information on the device itself. The tool could > also manage the creation of the block link. We should probably rethink > how the link is structure and what it points at. I discussed multi namespace with intel, spdk will embedded multi namespace management. But before ceph-osd single process can support multi OSD instance, I think we need to do offset/length in application side. Besides these problems, the most important thing is getting ride of spdk dependence on dpdk. before multi-osd within single process feature is done, we can't bear the multi polling threads occur 100% cpu times. > > Any thoughts? > > Yehuda -- Best Regards, Wheat ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2016-11-10 23:55 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-11-08 23:31 status of spdk Yehuda Sadeh-Weinraub 2016-11-08 23:40 ` Sage Weil 2016-11-09 0:06 ` Yehuda Sadeh-Weinraub 2016-11-09 0:21 ` LIU, Fei 2016-11-09 2:45 ` Dong Wu 2016-11-09 20:53 ` Moreno, Orlando 2016-11-09 20:58 ` Sage Weil 2016-11-09 21:00 ` Gohad, Tushar 2016-11-09 21:10 ` Gohad, Tushar 2016-11-10 22:39 ` Walker, Benjamin 2016-11-10 22:59 ` Sage Weil 2016-11-10 23:54 ` Walker, Benjamin 2016-11-09 4:59 ` Haomai Wang 2016-11-09 5:02 ` LIU, Fei 2016-11-09 5:09 ` Liu, Changpeng 2016-11-09 5:23 ` LIU, Fei 2016-11-09 4:49 ` Haomai Wang 2016-11-09 4:45 ` Haomai Wang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.