* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-25 19:03 Walker, Benjamin
0 siblings, 0 replies; 10+ messages in thread
From: Walker, Benjamin @ 2018-05-25 19:03 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 10055 bytes --]
I've been doing my best to think this through over the last few days, as have a
number of other community members, and some things are beginning to look a bit
clearer now.
SPDK was always intended to be a composable set of libraries as opposed to a
framework. By that, I mean that SPDK is intended to be integrated into other
applications as opposed to existing code being integrated into SPDK. The
community has done a lot of work to attempt to make that happen, with varying
degrees of success. The challenges are primarily centered on two things. First,
SPDK requires special memory management operations to allocate DMA-safe memory.
This stems from the strict requirement to avoid data copies. The problem would
essentially go away if SPDK instead internally allocated DMA-safe memory and
copied user data into those buffers, but the performance would take a big hit.
Second, SPDK avoids locks by instead passing messages between threads. That
means that many components (although not all) within SPDK imply that the
application is using a certain threading model. Specifically, the threading
model needs to look like cooperative multi-tasking, or futures and promises, or
event loops, etc. So far the consensus seems to be that it is acceptable to
assume there is some threading model that is conducive to message passing, but
we don't want to specifically pick a single model or framework.
The problem that John, Madhu, and the others at NetApp have identified is that
SPDK currently makes entirely too many assumptions about and places too many
strict requirements on the mechanics of the threading model in an application. I
think there is a strong consensus that fixing this is important and should be
high priority. The fix, ultimately, will be better abstractions around the
underlying application's threading model. I hope we can design something that
will enable people to plug SPDK into all sorts of frameworks - green threading
frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more people
we can get participating in this work, the better the abstractions will be, so
please everyone chime in with requirements and ideas.
The current set of patches break the 1:1 mapping between reactors and cores.
Instead, reactors are stored on a global list. Each core iterates on this global
list and pulls the next reactor and processes any waiting events and executes
pollers, then places the reactor back on the list. I'm concerned about three
things with this design:
* Since the reactors now potentially execute on a different core each time
through their loop, the CPU cache is going to be badly thrashed. I suspect the
performance hit here is very large and continues to grow as additional threads
are added. SPDK is designed to scale linearly with the addition of CPU cores as
much as possible, and I think it would be a mistake to move away from that.
* All NUMA-awareness has been lost. Placing the processing of I/O on the same
NUMA node as the NIC or SSD is critical to achieving high performance, so the
code needs to remain NUMA-aware.
* All threads are polling a single queue of reactors, so the atomic variables
controlling the head and tail of that queue are going to be highly contended and
become more contended as the number of threads increases.
I hope this is just the beginning of a larger discussion. I'll let the patch
review settle into next week and see if solutions begin to emerge.
Thanks,
Ben
On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> Hi Frank.
>
> Thanks for your suggestion.
>
> In our implementation/application, we don’t use DPDK. This is why the first
> set of changes we proposed last year were to abstract out the dependencies on
> DPK. I think I still have copy of the old pull request around for reference.
>
> https://github.com/spdk/spdk/pull/152
>
> We are actually running SPDK in a completely different execution environment,
> and we need a “native” SPDK dynamic threading model that can be supported on
> any platform, without DPDK.
>
> An second RFC patch has been pushed up to GerritHub for review. Please see
> the commit message of these two patches for a complete description of the
> proposed change.
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
>
> /John
>
> 40.5. The L-thread subsystem
> The L-thread subsystem resides in the examples/performance-thread/common
> directory and is built and linked automatically when building the l3fwd-
> thread example.
>
> The subsystem provides a simple cooperative scheduler to enable arbitrary
> functions to run as cooperative threads within a single EAL thread. The
> subsystem provides a pthread like API that is intended to assist in reuse of
> legacy code written for POSIX pthreads.
>
> The following sections provide some detail on the features, constraints,
> performance and porting considerations when using L-threads.
>
>
>
> From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank <kinzent(a)hotma
> il.com>
> Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Date: Wednesday, May 23, 2018 at 9:46 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] 答复: SPDK Dynamic Threading Model
>
> Hi,
>
> Why not consider to use lpthread provided by DPDK?
> http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html#lthread
> -subsystem
>
>
>
> Frank Huang
>
>
> 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John <John.Meneghini(a)netap
> p.com>
> 发送时间: 2018年5月23日 4:12
> 收件人: Storage Performance Development Kit
> 主题: [SPDK] RFC: SPDK Dynamic Threading Model
>
> As discussed during the Summit last week, we believe SPDK needs support for a
> dynamic threading model. An RFC patch has been pushed upstream for review.
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
>
> This patch is a beginning point for our proposed changes. Improvements will be
> made with subsequent patches.
>
> The description below is taken from https://github.com/spdk/spdk/issues/308
> SPDK needs to support a dynamic threading model where reactors are NOT bound
> to lcores.
> Many applications need SPDK to support a threading model that:
> Does not assume a static number of threads
> Does not bind threads to cores (this burns up cores)
> Does not assume all treads use the same polling model
> Removing these assumptions from the SPDK libraries will allow:
> Different applications to share the SPDK libraries on the same platform
> E.g. FC-NVMe, RDMA-NVMe, and NVMe
> Different platforms to support the same applications with the same libraries
> E.g. a 4 core platform and a 128 core plaform, a PowerPC and NFS traffic
> Different workloads at different scales
> E.g. 1 NVMF Host with 1 Subsystem and 1 Namespace, or 16 NVMF Hosts with 100
> Subsystems and 1,000 namespaces.
> In particular, in SPDK, NVMF threads need to come and go depending upon the
> “NVMF load”.
> More Dynamic Use Cases Coming
> With the advent of FC-NVMe (which uses NPIV to visualize FC ports) NVMF
> Subsystem Ports and Host Ports are not static. Different Hosts and Subsystems
> can have a different number of Ports, and Ports can be dynamically added and
> removed from the configuration. This means:
> The same platform may end up having different number of Subsystem ports at
> various points in its lifecycle
> The SPDK FC-NVMe application does NOT know up front how many ports it will
> have.
> Expected Behavior
> SPDK libraries should not assume a static number of threads
> SPDK libraries should bind threads to cores only optionally - supporting both
> static and dynamic threading models
> SPDK libraries should support a Hybrid polling model (modified run to
> completion)
> Current Behavior
> SPDK libraries assume a static number of threads
> SPDK libraries bind threads to cores
> SPDK libraries assume all treads use the same polling model
> Possible Solution
> Proposal to solve above Use Cases:
> Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction
> Use PG as the fundamental unit on which a thread operates
> The spdk_thread will be a “virtual” thread that gets tied into a PG (1-1
> relationship)
> Create PGs as and when hardware ports (and associated queue-pairs) come to
> life.
> No dependency between a PG and a “real” thread.
> A PG can be picked up by any “real” thread and worked upon. The PG contains
> everything needed for IO handling.
> PG continues to contain spdk_thread. spdk_thread continues same mechanisms for
> IO channels to different NS etc. etc.
> PG contains vendor data. Eg. A “ring” for depositing asynchronous callback
> events from the backend OR management events that come from external modules.
> spdk_thread contains thread_context that points to a PG instead of a reactor.
> So messages from the library get routed to the PG “ring” instead of a
> thread/reactor event ring.spdk_bdev_get_io
> Understanding the intent of the event library, it is believed this is the
> place for customization. However, the current event library assumes a
> threading model that's a part of the util library. Moreover, many of the other
> SPDK core libraries assume the same threading model as the util library. If
> the SPDK util library can be modified to support these use dynamic threading
> use cases, all applications would be able to use the SPDK framework more
> effectively.
> Steps to Reproduce
> This is an enhancement. There is no bug.
> Context (Environment including OS version, SPDK version, etc.)
> Would like to provide these enhancements in V18.07.
>
>
>
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-25 20:31 Pai, Madhu
0 siblings, 0 replies; 10+ messages in thread
From: Pai, Madhu @ 2018-05-25 20:31 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 14769 bytes --]
Ben,
Thank you very much for the detailed analysis and mail. I agree with the points you are making here and the design goals for SPDK. I'll try to talk some more about the design for the patch, the advantages, and see if we can improvise and better this. I'll also try to answer the three valid concerns you have raised below.
1) Design principle:
Breaking the 1:1 mapping between reactors and core will give an app better flexibility in terms of the threading model. At a very high abstract level this could be looked at, as being similar to the green threading framework. I'm no expert in the green threading framework, but, based on my reading the similarity would be:
- The "spdk_thread" is the virtual thread.
- The reactor is the "cache" of this virtual thread (i.e. they have a 1-1 relationship).
- The bare metal thread is the DPDK thread in this model.
So in this design, the bare metal thread inhales the properties of the virtual thread when it is running on a reactor. The implementation specific part of that was the change of the TLS of the _lcore value to the reactor id.
2) Advantages:
This was the first iteration of what can become a more solidified solution as we go along. There are gaps in this approach and I think we can mitigate some of those concerns.
- It gives applications a switch. They may choose to not do this at all and then a virtual thread would map to a bare metal thread and stay that way.
- The other advantage is that the core library, API"s and the design do not change.
- I think it helps apps to seamlessly adapt to a dynamic threading model with their current code base.
- It allows apps to dynamically increase the number of reactors. The granularity of a QP to a reactor could be decided by the app. So, we could, in one way solve the problem of trying to move a QP from one thread to another for load balancing.
- It gives Apps certain QoS capabilities. Apps can create "gold", "silver" and "bronze" reactor rings. The number of bare-metal threads working on each ring may be different depending on how quickly certain QP's have to be serviced.
3) Concerns:
Looking at the three concerns you raised -
3a) Cache thrashing: Agreed. In fact, I write about this in the commit patch as a valid concern. But, I think we can mitigate this. For applications that are concerned about cache thrashing - the option is to run the reactor on a bare metal thread for a longer period of time. In the patch I showed a crude way, where after 100 usec, the reactor is switched out. But, that does not absolutely have to be the case. A reactor could run for a much longer period of time (10s of msec) allowing the benefits of the CPU caching to be used. The other way to mitigate this is to make sure that the bare metal threads run on the same socket. Thus even when reactors are switched out, the cache at the socket layer is not invalidated.
3b) NUMA: I believe NUMA-awareness can be built in. It depends on where the bare metal threads run. If we have NICs and SSDs spread across the two sockets, a more elegant solution can probably be designed where we create reactor rings per socket. Then we would have the capability to add the QP's to the right reactor (in the right ring) based on the NIC. That is an extension of the current design IMO.
3c) Global Reactor Ring bottleneck: The number of reactors and the number of threads are not high. Also, the idea here is to run on a reactor for "some extended period of time". Given that the number of producer/consumers from this ring will be limited, I don't think the reactor ring will be a bottleneck. Compare and contrast this global reactor ring with the event queue ring that exists today. We use events for all callback events from the backend during IO. We definitely do not want the reactors to be swapped in and out at the rate of IO, but to hold on to a reactor for a somewhat larger period of time. When the application specific metrics show that these threads are doing more "useful work" versus "idle polling", we just add more threads. Eventually at high loads the number of threads will be the same as the number of reactors and thus falls back to the traditional SPDK model.
This would be something that is decided by the ecosystem that runs SPDK.
Looking forward to discussing this more.
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Friday, May 25, 2018 3:03 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
I've been doing my best to think this through over the last few days, as have a number of other community members, and some things are beginning to look a bit clearer now.
SPDK was always intended to be a composable set of libraries as opposed to a framework. By that, I mean that SPDK is intended to be integrated into other applications as opposed to existing code being integrated into SPDK. The community has done a lot of work to attempt to make that happen, with varying degrees of success. The challenges are primarily centered on two things. First, SPDK requires special memory management operations to allocate DMA-safe memory.
This stems from the strict requirement to avoid data copies. The problem would essentially go away if SPDK instead internally allocated DMA-safe memory and copied user data into those buffers, but the performance would take a big hit.
Second, SPDK avoids locks by instead passing messages between threads. That means that many components (although not all) within SPDK imply that the application is using a certain threading model. Specifically, the threading model needs to look like cooperative multi-tasking, or futures and promises, or event loops, etc. So far the consensus seems to be that it is acceptable to assume there is some threading model that is conducive to message passing, but we don't want to specifically pick a single model or framework.
The problem that John, Madhu, and the others at NetApp have identified is that SPDK currently makes entirely too many assumptions about and places too many strict requirements on the mechanics of the threading model in an application. I think there is a strong consensus that fixing this is important and should be high priority. The fix, ultimately, will be better abstractions around the underlying application's threading model. I hope we can design something that will enable people to plug SPDK into all sorts of frameworks - green threading frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more people we can get participating in this work, the better the abstractions will be, so please everyone chime in with requirements and ideas.
The current set of patches break the 1:1 mapping between reactors and cores.
Instead, reactors are stored on a global list. Each core iterates on this global list and pulls the next reactor and processes any waiting events and executes pollers, then places the reactor back on the list. I'm concerned about three things with this design:
* Since the reactors now potentially execute on a different core each time through their loop, the CPU cache is going to be badly thrashed. I suspect the performance hit here is very large and continues to grow as additional threads are added. SPDK is designed to scale linearly with the addition of CPU cores as much as possible, and I think it would be a mistake to move away from that.
* All NUMA-awareness has been lost. Placing the processing of I/O on the same NUMA node as the NIC or SSD is critical to achieving high performance, so the code needs to remain NUMA-aware.
* All threads are polling a single queue of reactors, so the atomic variables controlling the head and tail of that queue are going to be highly contended and become more contended as the number of threads increases.
I hope this is just the beginning of a larger discussion. I'll let the patch review settle into next week and see if solutions begin to emerge.
Thanks,
Ben
On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> Hi Frank.
>
> Thanks for your suggestion.
>
> In our implementation/application, we don’t use DPDK. This is why the
> first set of changes we proposed last year were to abstract out the
> dependencies on DPK. I think I still have copy of the old pull request around for reference.
>
> https://github.com/spdk/spdk/pull/152
>
> We are actually running SPDK in a completely different execution
> environment, and we need a “native” SPDK dynamic threading model that
> can be supported on any platform, without DPDK.
>
> An second RFC patch has been pushed up to GerritHub for review.
> Please see the commit message of these two patches for a complete
> description of the proposed change.
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
>
> /John
>
> 40.5. The L-thread subsystem
> The L-thread subsystem resides in the
> examples/performance-thread/common
> directory and is built and linked automatically when building the
> l3fwd- thread example.
>
> The subsystem provides a simple cooperative scheduler to enable
> arbitrary functions to run as cooperative threads within a single EAL
> thread. The subsystem provides a pthread like API that is intended to
> assist in reuse of legacy code written for POSIX pthreads.
>
> The following sections provide some detail on the features,
> constraints, performance and porting considerations when using L-threads.
>
>
>
> From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank
> <kinzent(a)hotma il.com>
> Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Date: Wednesday, May 23, 2018 at 9:46 PM
> To: Storage Performance Development Kit <spdk(a)lists.01.org>
> Subject: [SPDK] 答复: SPDK Dynamic Threading Model
>
> Hi,
>
> Why not consider to use lpthread provided by DPDK?
> http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html
> #lthread
> -subsystem
>
>
>
> Frank Huang
>
>
> 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John
> <John.Meneghini(a)netap p.com>
> 发送时间: 2018年5月23日 4:12
> 收件人: Storage Performance Development Kit
> 主题: [SPDK] RFC: SPDK Dynamic Threading Model
>
> As discussed during the Summit last week, we believe SPDK needs
> support for a dynamic threading model. An RFC patch has been pushed upstream for review.
>
> https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
>
> This patch is a beginning point for our proposed changes. Improvements
> will be made with subsequent patches.
>
> The description below is taken from
> https://github.com/spdk/spdk/issues/308
> SPDK needs to support a dynamic threading model where reactors are NOT
> bound to lcores.
> Many applications need SPDK to support a threading model that:
> Does not assume a static number of threads Does not bind threads to
> cores (this burns up cores) Does not assume all treads use the same
> polling model Removing these assumptions from the SPDK libraries will
> allow:
> Different applications to share the SPDK libraries on the same
> platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> support the same applications with the same libraries E.g. a 4 core
> platform and a 128 core plaform, a PowerPC and NFS traffic Different
> workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and 1
> Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> In particular, in SPDK, NVMF threads need to come and go depending
> upon the “NVMF load”.
> More Dynamic Use Cases Coming
> With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> and Subsystems can have a different number of Ports, and Ports can be
> dynamically added and removed from the configuration. This means:
> The same platform may end up having different number of Subsystem
> ports at various points in its lifecycle The SPDK FC-NVMe application
> does NOT know up front how many ports it will have.
> Expected Behavior
> SPDK libraries should not assume a static number of threads SPDK
> libraries should bind threads to cores only optionally - supporting
> both static and dynamic threading models SPDK libraries should support
> a Hybrid polling model (modified run to
> completion)
> Current Behavior
> SPDK libraries assume a static number of threads SPDK libraries bind
> threads to cores SPDK libraries assume all treads use the same polling
> model Possible Solution Proposal to solve above Use Cases:
> Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction
> Use PG as the fundamental unit on which a thread operates The
> spdk_thread will be a “virtual” thread that gets tied into a PG (1-1
> relationship)
> Create PGs as and when hardware ports (and associated queue-pairs)
> come to life.
> No dependency between a PG and a “real” thread.
> A PG can be picked up by any “real” thread and worked upon. The PG
> contains everything needed for IO handling.
> PG continues to contain spdk_thread. spdk_thread continues same
> mechanisms for IO channels to different NS etc. etc.
> PG contains vendor data. Eg. A “ring” for depositing asynchronous
> callback events from the backend OR management events that come from external modules.
> spdk_thread contains thread_context that points to a PG instead of a reactor.
> So messages from the library get routed to the PG “ring” instead of a
> thread/reactor event ring.spdk_bdev_get_io Understanding the intent of
> the event library, it is believed this is the place for customization.
> However, the current event library assumes a threading model that's a
> part of the util library. Moreover, many of the other SPDK core
> libraries assume the same threading model as the util library. If the
> SPDK util library can be modified to support these use dynamic
> threading use cases, all applications would be able to use the SPDK
> framework more effectively.
> Steps to Reproduce
> This is an enhancement. There is no bug.
> Context (Environment including OS version, SPDK version, etc.) Would
> like to provide these enhancements in V18.07.
>
>
>
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-25 21:26 Walker, Benjamin
0 siblings, 0 replies; 10+ messages in thread
From: Walker, Benjamin @ 2018-05-25 21:26 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 16920 bytes --]
On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
>
> Thank you very much for the detailed analysis and mail. I agree with the
> points you are making here and the design goals for SPDK. I'll try to talk
> some more about the design for the patch, the advantages, and see if we can
> improvise and better this. I'll also try to answer the three valid concerns
> you have raised below.
>
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app better
> flexibility in terms of the threading model. At a very high abstract level
> this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my reading the
> similarity would be:
> - The "spdk_thread" is the virtual thread.
> - The reactor is the "cache" of this virtual thread (i.e. they have a 1-1
> relationship).
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the virtual
> thread when it is running on a reactor. The implementation specific part of
> that was the change of the TLS of the _lcore value to the reactor id.
I think we'll need to iterate on what the basic primitives are (especially their
names), but I'm generally leaving that discussion for slightly later on in the
design. For now, I agree with the direction above. I'm going to temporarily use
the words "spdk_thread" for the virtual thread and "DPDK thread" for the native
thread.
>
> 2) Advantages:
> This was the first iteration of what can become a more solidified solution as
> we go along. There are gaps in this approach and I think we can mitigate some
> of those concerns.
> - It gives applications a switch. They may choose to not do this at all and
> then a virtual thread would map to a bare metal thread and stay that way.
> - The other advantage is that the core library, API"s and the design do not
> change.
I generally don't like to have switches that control behavior this fundamental,
where possible. It would be ideal to have the old behavior fall out of the new
behavior whenever you happen to create one spdk_thread per DPDK thread.
> - I think it helps apps to seamlessly adapt to a dynamic threading model with
> their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we could, in
> one way solve the problem of trying to move a QP from one thread to another
> for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold", "silver" and
> "bronze" reactor rings. The number of bare-metal threads working on each ring
> may be different depending on how quickly certain QP's have to be serviced.
>
> 3) Concerns:
> Looking at the three concerns you raised -
>
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit patch
> as a valid concern. But, I think we can mitigate this. For applications that
> are concerned about cache thrashing - the option is to run the reactor on a
> bare metal thread for a longer period of time. In the patch I showed a crude
> way, where after 100 usec, the reactor is switched out. But, that does not
> absolutely have to be the case. A reactor could run for a much longer period
> of time (10s of msec) allowing the benefits of the CPU caching to be used. The
> other way to mitigate this is to make sure that the bare metal threads run on
> the same socket. Thus even when reactors are switched out, the cache at the
> socket layer is not invalidated.
Instead of automatically moving spdk_thread/reactor objects between DPDK
threads, what if moving an spdk_thread to a new DPDK thread was an explicit
operation performed by the application periodically?
>
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on where the
> bare metal threads run. If we have NICs and SSDs spread across the two
> sockets, a more elegant solution can probably be designed where we create
> reactor rings per socket. Then we would have the capability to add the QP's to
> the right reactor (in the right ring) based on the NIC. That is an extension
> of the current design IMO.
It's not always clear what the right NUMA node to run on actually is. That's
because an spdk_thread has a set of I/O channels (queue pairs) that talk to
different devices. Sometimes you want to be on the same NUMA node as a
particular NIC, but other times as a particular SSD. Making the movement of
spdk_threads between DPDK threads an explicit operation performed by the
application would push this decision up into the application/user code, where it
knows best.
>
> 3c) Global Reactor Ring bottleneck: The number of reactors and the number of
> threads are not high. Also, the idea here is to run on a reactor for "some
> extended period of time". Given that the number of producer/consumers from
> this ring will be limited, I don't think the reactor ring will be a
> bottleneck. Compare and contrast this global reactor ring with the event queue
> ring that exists today. We use events for all callback events from the backend
> during IO. We definitely do not want the reactors to be swapped in and out at
> the rate of IO, but to hold on to a reactor for a somewhat larger period of
> time. When the application specific metrics show that these threads are doing
> more "useful work" versus "idle polling", we just add more threads. Eventually
> at high loads the number of threads will be the same as the number of reactors
> and thus falls back to the traditional SPDK model.
> This would be something that is decided by the ecosystem that runs SPDK.
>
> Looking forward to discussing this more.
>
> Thanks,
> Madhu
>
>
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
>
> I've been doing my best to think this through over the last few days, as have
> a number of other community members, and some things are beginning to look a
> bit clearer now.
>
> SPDK was always intended to be a composable set of libraries as opposed to a
> framework. By that, I mean that SPDK is intended to be integrated into other
> applications as opposed to existing code being integrated into SPDK. The
> community has done a lot of work to attempt to make that happen, with varying
> degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate DMA-safe
> memory.
> This stems from the strict requirement to avoid data copies. The problem would
> essentially go away if SPDK instead internally allocated DMA-safe memory and
> copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads. That
> means that many components (although not all) within SPDK imply that the
> application is using a certain threading model. Specifically, the threading
> model needs to look like cooperative multi-tasking, or futures and promises,
> or event loops, etc. So far the consensus seems to be that it is acceptable to
> assume there is some threading model that is conducive to message passing, but
> we don't want to specifically pick a single model or framework.
>
> The problem that John, Madhu, and the others at NetApp have identified is that
> SPDK currently makes entirely too many assumptions about and places too many
> strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and should
> be high priority. The fix, ultimately, will be better abstractions around the
> underlying application's threading model. I hope we can design something that
> will enable people to plug SPDK into all sorts of frameworks - green threading
> frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more people
> we can get participating in this work, the better the abstractions will be, so
> please everyone chime in with requirements and ideas.
>
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on this
> global list and pulls the next reactor and processes any waiting events and
> executes pollers, then places the reactor back on the list. I'm concerned
> about three things with this design:
>
> * Since the reactors now potentially execute on a different core each time
> through their loop, the CPU cache is going to be badly thrashed. I suspect the
> performance hit here is very large and continues to grow as additional threads
> are added. SPDK is designed to scale linearly with the addition of CPU cores
> as much as possible, and I think it would be a mistake to move away from
> that.
> * All NUMA-awareness has been lost. Placing the processing of I/O on the same
> NUMA node as the NIC or SSD is critical to achieving high performance, so the
> code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic variables
> controlling the head and tail of that queue are going to be highly contended
> and become more contended as the number of threads increases.
>
> I hope this is just the beginning of a larger discussion. I'll let the patch
> review settle into next week and see if solutions begin to emerge.
>
> Thanks,
> Ben
>
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >
> > Thanks for your suggestion.
> >
> > In our implementation/application, we don’t use DPDK. This is why the
> > first set of changes we proposed last year were to abstract out the
> > dependencies on DPK. I think I still have copy of the old pull request
> > around for reference.
> >
> > https://github.com/spdk/spdk/pull/152
> >
> > We are actually running SPDK in a completely different execution
> > environment, and we need a “native” SPDK dynamic threading model that
> > can be supported on any platform, without DPDK.
> >
> > An second RFC patch has been pushed up to GerritHub for review.
> > Please see the commit message of these two patches for a complete
> > description of the proposed change.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > /John
> >
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the
> > l3fwd- thread example.
> >
> > The subsystem provides a simple cooperative scheduler to enable
> > arbitrary functions to run as cooperative threads within a single EAL
> > thread. The subsystem provides a pthread like API that is intended to
> > assist in reuse of legacy code written for POSIX pthreads.
> >
> > The following sections provide some detail on the features,
> > constraints, performance and porting considerations when using L-threads.
> >
> >
> >
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank
> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >
> > Hi,
> >
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html
> > #lthread
> > -subsystem
> >
> >
> >
> > Frank Huang
> >
> >
> > 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John
> > <John.Meneghini(a)netap p.com>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >
> > As discussed during the Summit last week, we believe SPDK needs
> > support for a dynamic threading model. An RFC patch has been pushed
> > upstream for review.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > This patch is a beginning point for our proposed changes. Improvements
> > will be made with subsequent patches.
> >
> > The description below is taken from
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are NOT
> > bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to
> > cores (this burns up cores) Does not assume all treads use the same
> > polling model Removing these assumptions from the SPDK libraries will
> > allow:
> > Different applications to share the SPDK libraries on the same
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> > support the same applications with the same libraries E.g. a 4 core
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and 1
> > Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> > and Subsystems can have a different number of Ports, and Ports can be
> > dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem
> > ports at various points in its lifecycle The SPDK FC-NVMe application
> > does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK
> > libraries should bind threads to cores only optionally - supporting
> > both static and dynamic threading models SPDK libraries should support
> > a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind
> > threads to cores SPDK libraries assume all treads use the same polling
> > model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction
> > Use PG as the fundamental unit on which a thread operates The
> > spdk_thread will be a “virtual” thread that gets tied into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs)
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of a
> > thread/reactor event ring.spdk_bdev_get_io Understanding the intent of
> > the event library, it is believed this is the place for customization.
> > However, the current event library assumes a threading model that's a
> > part of the util library. Moreover, many of the other SPDK core
> > libraries assume the same threading model as the util library. If the
> > SPDK util library can be modified to support these use dynamic
> > threading use cases, all applications would be able to use the SPDK
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would
> > like to provide these enhancements in V18.07.
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-25 22:32 Pai, Madhu
0 siblings, 0 replies; 10+ messages in thread
From: Pai, Madhu @ 2018-05-25 22:32 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 19156 bytes --]
Comments inline marked by [MP].
BTW, is it worthwhile to discuss this in more detail in one of the SPDK community meetings?
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Friday, May 25, 2018 5:26 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
>
> Thank you very much for the detailed analysis and mail. I agree with
> the points you are making here and the design goals for SPDK. I'll
> try to talk some more about the design for the patch, the advantages,
> and see if we can improvise and better this. I'll also try to answer
> the three valid concerns you have raised below.
>
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app
> better flexibility in terms of the threading model. At a very high
> abstract level this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my
> reading the similarity would be:
> - The "spdk_thread" is the virtual thread.
> - The reactor is the "cache" of this virtual thread (i.e. they have a
> 1-1 relationship).
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the
> virtual thread when it is running on a reactor. The implementation
> specific part of that was the change of the TLS of the _lcore value to the reactor id.
I think we'll need to iterate on what the basic primitives are (especially their names), but I'm generally leaving that discussion for slightly later on in the design. For now, I agree with the direction above. I'm going to temporarily use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread.
[MP]: Agreed. The intent above was to draw a parallel to a model that may be existing out there. We can decide on the names and the basic primitives later on.
>
> 2) Advantages:
> This was the first iteration of what can become a more solidified
> solution as we go along. There are gaps in this approach and I think
> we can mitigate some of those concerns.
> - It gives applications a switch. They may choose to not do this at
> all and then a virtual thread would map to a bare metal thread and stay that way.
> - The other advantage is that the core library, API"s and the design
> do not change.
I generally don't like to have switches that control behavior this fundamental, where possible. It would be ideal to have the old behavior fall out of the new behavior whenever you happen to create one spdk_thread per DPDK thread.
[MP]: Ok. I'll think some more about this. In the current patch, I used the "number-of-reactors" in the config file as the mechanism to determine new versus old behavior. In some ways that falls into what you mention above. If the number-of-lcores do NOT match the number-of-reactors, one could go to the new way of doing things else, fall back to the legacy behavior. This was a prototype that I built over the last week after our discussion during the SPDK conference. There are iterations and improvements needed here. But, yes, at this point there is a switch here.
> - I think it helps apps to seamlessly adapt to a dynamic threading
> model with their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we
> could, in one way solve the problem of trying to move a QP from one
> thread to another for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold",
> "silver" and "bronze" reactor rings. The number of bare-metal threads
> working on each ring may be different depending on how quickly certain QP's have to be serviced.
>
> 3) Concerns:
> Looking at the three concerns you raised -
>
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit
> patch as a valid concern. But, I think we can mitigate this. For
> applications that are concerned about cache thrashing - the option is
> to run the reactor on a bare metal thread for a longer period of time.
> In the patch I showed a crude way, where after 100 usec, the reactor
> is switched out. But, that does not absolutely have to be the case. A
> reactor could run for a much longer period of time (10s of msec)
> allowing the benefits of the CPU caching to be used. The other way to
> mitigate this is to make sure that the bare metal threads run on the
> same socket. Thus even when reactors are switched out, the cache at the socket layer is not invalidated.
Instead of automatically moving spdk_thread/reactor objects between DPDK threads, what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
[MP]: Yes, but isn't that precisely something that needs to be left to the application? In this patch, a simple MP-MC polling ring was used as a way to break up binding. The application may use something other than a ring to manage the spdk_threads and how it binds to a DPDK thread?
I'll think some more to see how to do this in an explicit operation instead of periodically, as done in the patch. I'll have to look into the DPDK thread pipes and the m2s/s2m IPC mechanism because the app most likely will have to communicate directly with the DPDK thread to achieve this.
>
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on
> where the bare metal threads run. If we have NICs and SSDs spread
> across the two sockets, a more elegant solution can probably be
> designed where we create reactor rings per socket. Then we would have
> the capability to add the QP's to the right reactor (in the right
> ring) based on the NIC. That is an extension of the current design IMO.
It's not always clear what the right NUMA node to run on actually is. That's because an spdk_thread has a set of I/O channels (queue pairs) that talk to different devices. Sometimes you want to be on the same NUMA node as a particular NIC, but other times as a particular SSD. Making the movement of spdk_threads between DPDK threads an explicit operation performed by the application would push this decision up into the application/user code, where it knows best.
[MP]: Agreed. The design allows this to happen by allowing the creation of multiple reactor rings (one per socket?). Applications can decide to move the reactors using custom messages from one ring to another. This goes back to your previous comment about having an explicit mechanism to move the reactor from one DPDK thread to another.
>
> 3c) Global Reactor Ring bottleneck: The number of reactors and the
> number of threads are not high. Also, the idea here is to run on a
> reactor for "some extended period of time". Given that the number of
> producer/consumers from this ring will be limited, I don't think the
> reactor ring will be a bottleneck. Compare and contrast this global
> reactor ring with the event queue ring that exists today. We use
> events for all callback events from the backend during IO. We
> definitely do not want the reactors to be swapped in and out at the
> rate of IO, but to hold on to a reactor for a somewhat larger period
> of time. When the application specific metrics show that these threads
> are doing more "useful work" versus "idle polling", we just add more
> threads. Eventually at high loads the number of threads will be the same as the number of reactors and thus falls back to the traditional SPDK model.
> This would be something that is decided by the ecosystem that runs SPDK.
>
> Looking forward to discussing this more.
>
> Thanks,
> Madhu
>
>
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
>
> I've been doing my best to think this through over the last few days,
> as have a number of other community members, and some things are
> beginning to look a bit clearer now.
>
> SPDK was always intended to be a composable set of libraries as
> opposed to a framework. By that, I mean that SPDK is intended to be
> integrated into other applications as opposed to existing code being
> integrated into SPDK. The community has done a lot of work to attempt
> to make that happen, with varying degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate
> DMA-safe memory.
> This stems from the strict requirement to avoid data copies. The
> problem would essentially go away if SPDK instead internally allocated
> DMA-safe memory and copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads.
> That means that many components (although not all) within SPDK imply
> that the application is using a certain threading model. Specifically,
> the threading model needs to look like cooperative multi-tasking, or
> futures and promises, or event loops, etc. So far the consensus seems
> to be that it is acceptable to assume there is some threading model
> that is conducive to message passing, but we don't want to specifically pick a single model or framework.
>
> The problem that John, Madhu, and the others at NetApp have identified
> is that SPDK currently makes entirely too many assumptions about and
> places too many strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and
> should be high priority. The fix, ultimately, will be better
> abstractions around the underlying application's threading model. I
> hope we can design something that will enable people to plug SPDK into
> all sorts of frameworks - green threading frameworks, DPDK lthreads,
> Seastar, coroutine frameworks, etc. The more people we can get
> participating in this work, the better the abstractions will be, so please everyone chime in with requirements and ideas.
>
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on
> this global list and pulls the next reactor and processes any waiting
> events and executes pollers, then places the reactor back on the list.
> I'm concerned about three things with this design:
>
> * Since the reactors now potentially execute on a different core each
> time through their loop, the CPU cache is going to be badly thrashed.
> I suspect the performance hit here is very large and continues to grow
> as additional threads are added. SPDK is designed to scale linearly
> with the addition of CPU cores as much as possible, and I think it
> would be a mistake to move away from that.
> * All NUMA-awareness has been lost. Placing the processing of I/O on
> the same NUMA node as the NIC or SSD is critical to achieving high
> performance, so the code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic
> variables controlling the head and tail of that queue are going to be
> highly contended and become more contended as the number of threads increases.
>
> I hope this is just the beginning of a larger discussion. I'll let the
> patch review settle into next week and see if solutions begin to emerge.
>
> Thanks,
> Ben
>
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >
> > Thanks for your suggestion.
> >
> > In our implementation/application, we don’t use DPDK. This is why
> > the first set of changes we proposed last year were to abstract out
> > the dependencies on DPK. I think I still have copy of the old pull
> > request around for reference.
> >
> > https://github.com/spdk/spdk/pull/152
> >
> > We are actually running SPDK in a completely different execution
> > environment, and we need a “native” SPDK dynamic threading model
> > that can be supported on any platform, without DPDK.
> >
> > An second RFC patch has been pushed up to GerritHub for review.
> > Please see the commit message of these two patches for a complete
> > description of the proposed change.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > /John
> >
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the
> > l3fwd- thread example.
> >
> > The subsystem provides a simple cooperative scheduler to enable
> > arbitrary functions to run as cooperative threads within a single
> > EAL thread. The subsystem provides a pthread like API that is
> > intended to assist in reuse of legacy code written for POSIX pthreads.
> >
> > The following sections provide some detail on the features,
> > constraints, performance and porting considerations when using L-threads.
> >
> >
> >
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank
> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >
> > Hi,
> >
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.ht
> > ml
> > #lthread
> > -subsystem
> >
> >
> >
> > Frank Huang
> >
> >
> > 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John
> > <John.Meneghini(a)netap p.com>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >
> > As discussed during the Summit last week, we believe SPDK needs
> > support for a dynamic threading model. An RFC patch has been pushed
> > upstream for review.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > This patch is a beginning point for our proposed changes.
> > Improvements will be made with subsequent patches.
> >
> > The description below is taken from
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are
> > NOT bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to
> > cores (this burns up cores) Does not assume all treads use the same
> > polling model Removing these assumptions from the SPDK libraries
> > will
> > allow:
> > Different applications to share the SPDK libraries on the same
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> > support the same applications with the same libraries E.g. a 4 core
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and
> > 1 Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> > and Subsystems can have a different number of Ports, and Ports can
> > be dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem
> > ports at various points in its lifecycle The SPDK FC-NVMe
> > application does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK
> > libraries should bind threads to cores only optionally - supporting
> > both static and dynamic threading models SPDK libraries should
> > support a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind
> > threads to cores SPDK libraries assume all treads use the same
> > polling model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading
> > abstraction Use PG as the fundamental unit on which a thread
> > operates The spdk_thread will be a “virtual” thread that gets tied
> > into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs)
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of
> > a thread/reactor event ring.spdk_bdev_get_io Understanding the
> > intent of the event library, it is believed this is the place for customization.
> > However, the current event library assumes a threading model that's
> > a part of the util library. Moreover, many of the other SPDK core
> > libraries assume the same threading model as the util library. If
> > the SPDK util library can be modified to support these use dynamic
> > threading use cases, all applications would be able to use the SPDK
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would
> > like to provide these enhancements in V18.07.
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-29 16:24 Meneghini, John
0 siblings, 0 replies; 10+ messages in thread
From: Meneghini, John @ 2018-05-29 16:24 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 22619 bytes --]
Notes from our Community meeting discussion this morning:
1. Break the physical connection between lcore and thread
* This was agreed upon as a needed abstraction
* We’ll use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread
2. No way to specify the “threading policy”
* There is no NUMA awareness in the current design/implementation, this will lead to cache thrashing
* Instead of automatically moving spdk_thread/reactor objects between DPDK threads,
what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
i. Will this work for NetApp’s use case?
* Madhu agreed this was something that could be done, and he is willing to create another patch to do this.
i. Would creating such a patch be helpful?
NOTE: the meeting ended abruptly at this point because the meeting host dropped off and the meeting ended.
Editor’s note:
I think the important questions that didn’t get answered is: will eliminating the “automatic threading policy” in the current design (as seen in Madhu’s patches) work for NetApp’s use case?
/John
On 5/25/18, 6:32 PM, "SPDK on behalf of Pai, Madhu" <spdk-bounces(a)lists.01.org on behalf of Madhusudan.Pai(a)netapp.com> wrote:
Comments inline marked by [MP].
BTW, is it worthwhile to discuss this in more detail in one of the SPDK community meetings?
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Friday, May 25, 2018 5:26 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
>
> Thank you very much for the detailed analysis and mail. I agree with
> the points you are making here and the design goals for SPDK. I'll
> try to talk some more about the design for the patch, the advantages,
> and see if we can improvise and better this. I'll also try to answer
> the three valid concerns you have raised below.
>
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app
> better flexibility in terms of the threading model. At a very high
> abstract level this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my
> reading the similarity would be:
> - The "spdk_thread" is the virtual thread.
> - The reactor is the "cache" of this virtual thread (i.e. they have a
> 1-1 relationship).
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the
> virtual thread when it is running on a reactor. The implementation
> specific part of that was the change of the TLS of the _lcore value to the reactor id.
I think we'll need to iterate on what the basic primitives are (especially their names), but I'm generally leaving that discussion for slightly later on in the design. For now, I agree with the direction above. I'm going to temporarily use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread.
[MP]: Agreed. The intent above was to draw a parallel to a model that may be existing out there. We can decide on the names and the basic primitives later on.
>
> 2) Advantages:
> This was the first iteration of what can become a more solidified
> solution as we go along. There are gaps in this approach and I think
> we can mitigate some of those concerns.
> - It gives applications a switch. They may choose to not do this at
> all and then a virtual thread would map to a bare metal thread and stay that way.
> - The other advantage is that the core library, API"s and the design
> do not change.
I generally don't like to have switches that control behavior this fundamental, where possible. It would be ideal to have the old behavior fall out of the new behavior whenever you happen to create one spdk_thread per DPDK thread.
[MP]: Ok. I'll think some more about this. In the current patch, I used the "number-of-reactors" in the config file as the mechanism to determine new versus old behavior. In some ways that falls into what you mention above. If the number-of-lcores do NOT match the number-of-reactors, one could go to the new way of doing things else, fall back to the legacy behavior. This was a prototype that I built over the last week after our discussion during the SPDK conference. There are iterations and improvements needed here. But, yes, at this point there is a switch here.
> - I think it helps apps to seamlessly adapt to a dynamic threading
> model with their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we
> could, in one way solve the problem of trying to move a QP from one
> thread to another for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold",
> "silver" and "bronze" reactor rings. The number of bare-metal threads
> working on each ring may be different depending on how quickly certain QP's have to be serviced.
>
> 3) Concerns:
> Looking at the three concerns you raised -
>
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit
> patch as a valid concern. But, I think we can mitigate this. For
> applications that are concerned about cache thrashing - the option is
> to run the reactor on a bare metal thread for a longer period of time.
> In the patch I showed a crude way, where after 100 usec, the reactor
> is switched out. But, that does not absolutely have to be the case. A
> reactor could run for a much longer period of time (10s of msec)
> allowing the benefits of the CPU caching to be used. The other way to
> mitigate this is to make sure that the bare metal threads run on the
> same socket. Thus even when reactors are switched out, the cache at the socket layer is not invalidated.
Instead of automatically moving spdk_thread/reactor objects between DPDK threads, what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
[MP]: Yes, but isn't that precisely something that needs to be left to the application? In this patch, a simple MP-MC polling ring was used as a way to break up binding. The application may use something other than a ring to manage the spdk_threads and how it binds to a DPDK thread?
I'll think some more to see how to do this in an explicit operation instead of periodically, as done in the patch. I'll have to look into the DPDK thread pipes and the m2s/s2m IPC mechanism because the app most likely will have to communicate directly with the DPDK thread to achieve this.
>
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on
> where the bare metal threads run. If we have NICs and SSDs spread
> across the two sockets, a more elegant solution can probably be
> designed where we create reactor rings per socket. Then we would have
> the capability to add the QP's to the right reactor (in the right
> ring) based on the NIC. That is an extension of the current design IMO.
It's not always clear what the right NUMA node to run on actually is. That's because an spdk_thread has a set of I/O channels (queue pairs) that talk to different devices. Sometimes you want to be on the same NUMA node as a particular NIC, but other times as a particular SSD. Making the movement of spdk_threads between DPDK threads an explicit operation performed by the application would push this decision up into the application/user code, where it knows best.
[MP]: Agreed. The design allows this to happen by allowing the creation of multiple reactor rings (one per socket?). Applications can decide to move the reactors using custom messages from one ring to another. This goes back to your previous comment about having an explicit mechanism to move the reactor from one DPDK thread to another.
>
> 3c) Global Reactor Ring bottleneck: The number of reactors and the
> number of threads are not high. Also, the idea here is to run on a
> reactor for "some extended period of time". Given that the number of
> producer/consumers from this ring will be limited, I don't think the
> reactor ring will be a bottleneck. Compare and contrast this global
> reactor ring with the event queue ring that exists today. We use
> events for all callback events from the backend during IO. We
> definitely do not want the reactors to be swapped in and out at the
> rate of IO, but to hold on to a reactor for a somewhat larger period
> of time. When the application specific metrics show that these threads
> are doing more "useful work" versus "idle polling", we just add more
> threads. Eventually at high loads the number of threads will be the same as the number of reactors and thus falls back to the traditional SPDK model.
> This would be something that is decided by the ecosystem that runs SPDK.
>
> Looking forward to discussing this more.
>
> Thanks,
> Madhu
>
>
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
>
> I've been doing my best to think this through over the last few days,
> as have a number of other community members, and some things are
> beginning to look a bit clearer now.
>
> SPDK was always intended to be a composable set of libraries as
> opposed to a framework. By that, I mean that SPDK is intended to be
> integrated into other applications as opposed to existing code being
> integrated into SPDK. The community has done a lot of work to attempt
> to make that happen, with varying degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate
> DMA-safe memory.
> This stems from the strict requirement to avoid data copies. The
> problem would essentially go away if SPDK instead internally allocated
> DMA-safe memory and copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads.
> That means that many components (although not all) within SPDK imply
> that the application is using a certain threading model. Specifically,
> the threading model needs to look like cooperative multi-tasking, or
> futures and promises, or event loops, etc. So far the consensus seems
> to be that it is acceptable to assume there is some threading model
> that is conducive to message passing, but we don't want to specifically pick a single model or framework.
>
> The problem that John, Madhu, and the others at NetApp have identified
> is that SPDK currently makes entirely too many assumptions about and
> places too many strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and
> should be high priority. The fix, ultimately, will be better
> abstractions around the underlying application's threading model. I
> hope we can design something that will enable people to plug SPDK into
> all sorts of frameworks - green threading frameworks, DPDK lthreads,
> Seastar, coroutine frameworks, etc. The more people we can get
> participating in this work, the better the abstractions will be, so please everyone chime in with requirements and ideas.
>
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on
> this global list and pulls the next reactor and processes any waiting
> events and executes pollers, then places the reactor back on the list.
> I'm concerned about three things with this design:
>
> * Since the reactors now potentially execute on a different core each
> time through their loop, the CPU cache is going to be badly thrashed.
> I suspect the performance hit here is very large and continues to grow
> as additional threads are added. SPDK is designed to scale linearly
> with the addition of CPU cores as much as possible, and I think it
> would be a mistake to move away from that.
> * All NUMA-awareness has been lost. Placing the processing of I/O on
> the same NUMA node as the NIC or SSD is critical to achieving high
> performance, so the code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic
> variables controlling the head and tail of that queue are going to be
> highly contended and become more contended as the number of threads increases.
>
> I hope this is just the beginning of a larger discussion. I'll let the
> patch review settle into next week and see if solutions begin to emerge.
>
> Thanks,
> Ben
>
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >
> > Thanks for your suggestion.
> >
> > In our implementation/application, we don’t use DPDK. This is why
> > the first set of changes we proposed last year were to abstract out
> > the dependencies on DPK. I think I still have copy of the old pull
> > request around for reference.
> >
> > https://github.com/spdk/spdk/pull/152
> >
> > We are actually running SPDK in a completely different execution
> > environment, and we need a “native” SPDK dynamic threading model
> > that can be supported on any platform, without DPDK.
> >
> > An second RFC patch has been pushed up to GerritHub for review.
> > Please see the commit message of these two patches for a complete
> > description of the proposed change.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > /John
> >
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the
> > l3fwd- thread example.
> >
> > The subsystem provides a simple cooperative scheduler to enable
> > arbitrary functions to run as cooperative threads within a single
> > EAL thread. The subsystem provides a pthread like API that is
> > intended to assist in reuse of legacy code written for POSIX pthreads.
> >
> > The following sections provide some detail on the features,
> > constraints, performance and porting considerations when using L-threads.
> >
> >
> >
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank
> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >
> > Hi,
> >
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.ht
> > ml
> > #lthread
> > -subsystem
> >
> >
> >
> > Frank Huang
> >
> >
> > 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John
> > <John.Meneghini(a)netap p.com>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >
> > As discussed during the Summit last week, we believe SPDK needs
> > support for a dynamic threading model. An RFC patch has been pushed
> > upstream for review.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > This patch is a beginning point for our proposed changes.
> > Improvements will be made with subsequent patches.
> >
> > The description below is taken from
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are
> > NOT bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to
> > cores (this burns up cores) Does not assume all treads use the same
> > polling model Removing these assumptions from the SPDK libraries
> > will
> > allow:
> > Different applications to share the SPDK libraries on the same
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> > support the same applications with the same libraries E.g. a 4 core
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and
> > 1 Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> > and Subsystems can have a different number of Ports, and Ports can
> > be dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem
> > ports at various points in its lifecycle The SPDK FC-NVMe
> > application does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK
> > libraries should bind threads to cores only optionally - supporting
> > both static and dynamic threading models SPDK libraries should
> > support a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind
> > threads to cores SPDK libraries assume all treads use the same
> > polling model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading
> > abstraction Use PG as the fundamental unit on which a thread
> > operates The spdk_thread will be a “virtual” thread that gets tied
> > into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs)
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of
> > a thread/reactor event ring.spdk_bdev_get_io Understanding the
> > intent of the event library, it is believed this is the place for customization.
> > However, the current event library assumes a threading model that's
> > a part of the util library. Moreover, many of the other SPDK core
> > libraries assume the same threading model as the util library. If
> > the SPDK util library can be modified to support these use dynamic
> > threading use cases, all applications would be able to use the SPDK
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would
> > like to provide these enhancements in V18.07.
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 50910 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-29 19:26 Meneghini, John
0 siblings, 0 replies; 10+ messages in thread
From: Meneghini, John @ 2018-05-29 19:26 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 25104 bytes --]
I’ve scheduled an Ad-Hoc Meeting to discuss this subject.
Agenda:
1. Discuss open issues in the proposed dynamic threading design.
* No way to specify the “threading policy”
* There is no NUMA awareness in the current design/implementation, this will lead to cache thrashing
2. Instead of automatically moving spdk_thread/reactor objects between DPDK threads,
what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
* Will this work for NetApp’s use case?
John Meneghini has invited you to join a meeting on the Web, using WebEx
Topic: SPDK Dynamic Threading Model
Date: Wednesday, May 30, 2018
Time: 12:00 pm, Eastern Daylight Time (New York, GMT-04:00)
Meeting number: 929 139 003
Meeting password: spdk
Please click the link below to see more information, or to join the meeting.
NOTE: You do not need to register for a WebEx Account to attend this meeting. Just click on the link below and use the Meeting password: spdk
https://netapp-meeting.webex.com/netapp-meeting/j.php?MTID=m534ae151fad1018b517d74e0736d503e
Teleconference: To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
US Toll Free: +1-844-655-1728
US Toll: +1-408-990-0022
Access code:929 139 003
Global call-in numbers: https://netapp-meeting.webex.com/netapp-meeting/globalcallin.php?serviceType=MC&ED=674045167&tollFree=1
Toll-free dialing restrictions: https://www.webex.com/pdf/tollfree_restrictions.pdf
-------------------------------------------------------
To join the meeting on iPhone
-------------------------------------------------------
Go to wbx://netapp-meeting.webex.com/netapp-meeting?MK=929139003&MTGTK=SDJTSwAAAASLYBrb0Z2wjEeOVCuCKAiTKGQ3u5dqWWqeEvEa2V-6sg2&r2sec=1
Don't have the iPhone WebEx application yet?
Go to http://itunes.apple.com/app/cisco-webex-meetings/id298844386
To contact John Meneghini, call 1-781-768-5324 or
send a message to this address: john.meneghini(a)netapp.com<mailto:john.meneghini(a)netapp.com>
From: SPDK <spdk-bounces(a)lists.01.org> on behalf of John Meneghini <John.Meneghini(a)netapp.com>
Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
Date: Tuesday, May 29, 2018 at 12:24 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Notes from our Community meeting discussion this morning:
1. Break the physical connection between lcore and thread
* This was agreed upon as a needed abstraction
* We’ll use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread
2. No way to specify the “threading policy”
* There is no NUMA awareness in the current design/implementation, this will lead to cache thrashing
* Instead of automatically moving spdk_thread/reactor objects between DPDK threads,
what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
i. Will this work for NetApp’s use case?
* Madhu agreed this was something that could be done, and he is willing to create another patch to do this.
i. Would creating such a patch be helpful?
NOTE: the meeting ended abruptly at this point because the meeting host dropped off and the meeting ended.
Editor’s note:
I think the important questions that didn’t get answered is: will eliminating the “automatic threading policy” in the current design (as seen in Madhu’s patches) work for NetApp’s use case?
/John
On 5/25/18, 6:32 PM, "SPDK on behalf of Pai, Madhu" <spdk-bounces(a)lists.01.org on behalf of Madhusudan.Pai(a)netapp.com> wrote:
Comments inline marked by [MP].
BTW, is it worthwhile to discuss this in more detail in one of the SPDK community meetings?
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Friday, May 25, 2018 5:26 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
>
> Thank you very much for the detailed analysis and mail. I agree with
> the points you are making here and the design goals for SPDK. I'll
> try to talk some more about the design for the patch, the advantages,
> and see if we can improvise and better this. I'll also try to answer
> the three valid concerns you have raised below.
>
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app
> better flexibility in terms of the threading model. At a very high
> abstract level this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my
> reading the similarity would be:
> - The "spdk_thread" is the virtual thread.
> - The reactor is the "cache" of this virtual thread (i.e. they have a
> 1-1 relationship).
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the
> virtual thread when it is running on a reactor. The implementation
> specific part of that was the change of the TLS of the _lcore value to the reactor id.
I think we'll need to iterate on what the basic primitives are (especially their names), but I'm generally leaving that discussion for slightly later on in the design. For now, I agree with the direction above. I'm going to temporarily use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread.
[MP]: Agreed. The intent above was to draw a parallel to a model that may be existing out there. We can decide on the names and the basic primitives later on.
>
> 2) Advantages:
> This was the first iteration of what can become a more solidified
> solution as we go along. There are gaps in this approach and I think
> we can mitigate some of those concerns.
> - It gives applications a switch. They may choose to not do this at
> all and then a virtual thread would map to a bare metal thread and stay that way.
> - The other advantage is that the core library, API"s and the design
> do not change.
I generally don't like to have switches that control behavior this fundamental, where possible. It would be ideal to have the old behavior fall out of the new behavior whenever you happen to create one spdk_thread per DPDK thread.
[MP]: Ok. I'll think some more about this. In the current patch, I used the "number-of-reactors" in the config file as the mechanism to determine new versus old behavior. In some ways that falls into what you mention above. If the number-of-lcores do NOT match the number-of-reactors, one could go to the new way of doing things else, fall back to the legacy behavior. This was a prototype that I built over the last week after our discussion during the SPDK conference. There are iterations and improvements needed here. But, yes, at this point there is a switch here.
> - I think it helps apps to seamlessly adapt to a dynamic threading
> model with their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we
> could, in one way solve the problem of trying to move a QP from one
> thread to another for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold",
> "silver" and "bronze" reactor rings. The number of bare-metal threads
> working on each ring may be different depending on how quickly certain QP's have to be serviced.
>
> 3) Concerns:
> Looking at the three concerns you raised -
>
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit
> patch as a valid concern. But, I think we can mitigate this. For
> applications that are concerned about cache thrashing - the option is
> to run the reactor on a bare metal thread for a longer period of time.
> In the patch I showed a crude way, where after 100 usec, the reactor
> is switched out. But, that does not absolutely have to be the case. A
> reactor could run for a much longer period of time (10s of msec)
> allowing the benefits of the CPU caching to be used. The other way to
> mitigate this is to make sure that the bare metal threads run on the
> same socket. Thus even when reactors are switched out, the cache at the socket layer is not invalidated.
Instead of automatically moving spdk_thread/reactor objects between DPDK threads, what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
[MP]: Yes, but isn't that precisely something that needs to be left to the application? In this patch, a simple MP-MC polling ring was used as a way to break up binding. The application may use something other than a ring to manage the spdk_threads and how it binds to a DPDK thread?
I'll think some more to see how to do this in an explicit operation instead of periodically, as done in the patch. I'll have to look into the DPDK thread pipes and the m2s/s2m IPC mechanism because the app most likely will have to communicate directly with the DPDK thread to achieve this.
>
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on
> where the bare metal threads run. If we have NICs and SSDs spread
> across the two sockets, a more elegant solution can probably be
> designed where we create reactor rings per socket. Then we would have
> the capability to add the QP's to the right reactor (in the right
> ring) based on the NIC. That is an extension of the current design IMO.
It's not always clear what the right NUMA node to run on actually is. That's because an spdk_thread has a set of I/O channels (queue pairs) that talk to different devices. Sometimes you want to be on the same NUMA node as a particular NIC, but other times as a particular SSD. Making the movement of spdk_threads between DPDK threads an explicit operation performed by the application would push this decision up into the application/user code, where it knows best.
[MP]: Agreed. The design allows this to happen by allowing the creation of multiple reactor rings (one per socket?). Applications can decide to move the reactors using custom messages from one ring to another. This goes back to your previous comment about having an explicit mechanism to move the reactor from one DPDK thread to another.
>
> 3c) Global Reactor Ring bottleneck: The number of reactors and the
> number of threads are not high. Also, the idea here is to run on a
> reactor for "some extended period of time". Given that the number of
> producer/consumers from this ring will be limited, I don't think the
> reactor ring will be a bottleneck. Compare and contrast this global
> reactor ring with the event queue ring that exists today. We use
> events for all callback events from the backend during IO. We
> definitely do not want the reactors to be swapped in and out at the
> rate of IO, but to hold on to a reactor for a somewhat larger period
> of time. When the application specific metrics show that these threads
> are doing more "useful work" versus "idle polling", we just add more
> threads. Eventually at high loads the number of threads will be the same as the number of reactors and thus falls back to the traditional SPDK model.
> This would be something that is decided by the ecosystem that runs SPDK.
>
> Looking forward to discussing this more.
>
> Thanks,
> Madhu
>
>
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
>
> I've been doing my best to think this through over the last few days,
> as have a number of other community members, and some things are
> beginning to look a bit clearer now.
>
> SPDK was always intended to be a composable set of libraries as
> opposed to a framework. By that, I mean that SPDK is intended to be
> integrated into other applications as opposed to existing code being
> integrated into SPDK. The community has done a lot of work to attempt
> to make that happen, with varying degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate
> DMA-safe memory.
> This stems from the strict requirement to avoid data copies. The
> problem would essentially go away if SPDK instead internally allocated
> DMA-safe memory and copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads.
> That means that many components (although not all) within SPDK imply
> that the application is using a certain threading model. Specifically,
> the threading model needs to look like cooperative multi-tasking, or
> futures and promises, or event loops, etc. So far the consensus seems
> to be that it is acceptable to assume there is some threading model
> that is conducive to message passing, but we don't want to specifically pick a single model or framework.
>
> The problem that John, Madhu, and the others at NetApp have identified
> is that SPDK currently makes entirely too many assumptions about and
> places too many strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and
> should be high priority. The fix, ultimately, will be better
> abstractions around the underlying application's threading model. I
> hope we can design something that will enable people to plug SPDK into
> all sorts of frameworks - green threading frameworks, DPDK lthreads,
> Seastar, coroutine frameworks, etc. The more people we can get
> participating in this work, the better the abstractions will be, so please everyone chime in with requirements and ideas.
>
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on
> this global list and pulls the next reactor and processes any waiting
> events and executes pollers, then places the reactor back on the list.
> I'm concerned about three things with this design:
>
> * Since the reactors now potentially execute on a different core each
> time through their loop, the CPU cache is going to be badly thrashed.
> I suspect the performance hit here is very large and continues to grow
> as additional threads are added. SPDK is designed to scale linearly
> with the addition of CPU cores as much as possible, and I think it
> would be a mistake to move away from that.
> * All NUMA-awareness has been lost. Placing the processing of I/O on
> the same NUMA node as the NIC or SSD is critical to achieving high
> performance, so the code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic
> variables controlling the head and tail of that queue are going to be
> highly contended and become more contended as the number of threads increases.
>
> I hope this is just the beginning of a larger discussion. I'll let the
> patch review settle into next week and see if solutions begin to emerge.
>
> Thanks,
> Ben
>
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >
> > Thanks for your suggestion.
> >
> > In our implementation/application, we don’t use DPDK. This is why
> > the first set of changes we proposed last year were to abstract out
> > the dependencies on DPK. I think I still have copy of the old pull
> > request around for reference.
> >
> > https://github.com/spdk/spdk/pull/152
> >
> > We are actually running SPDK in a completely different execution
> > environment, and we need a “native” SPDK dynamic threading model
> > that can be supported on any platform, without DPDK.
> >
> > An second RFC patch has been pushed up to GerritHub for review.
> > Please see the commit message of these two patches for a complete
> > description of the proposed change.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > /John
> >
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the
> > l3fwd- thread example.
> >
> > The subsystem provides a simple cooperative scheduler to enable
> > arbitrary functions to run as cooperative threads within a single
> > EAL thread. The subsystem provides a pthread like API that is
> > intended to assist in reuse of legacy code written for POSIX pthreads.
> >
> > The following sections provide some detail on the features,
> > constraints, performance and porting considerations when using L-threads.
> >
> >
> >
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank
> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >
> > Hi,
> >
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.ht
> > ml
> > #lthread
> > -subsystem
> >
> >
> >
> > Frank Huang
> >
> >
> > 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John
> > <John.Meneghini(a)netap p.com>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >
> > As discussed during the Summit last week, we believe SPDK needs
> > support for a dynamic threading model. An RFC patch has been pushed
> > upstream for review.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > This patch is a beginning point for our proposed changes.
> > Improvements will be made with subsequent patches.
> >
> > The description below is taken from
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are
> > NOT bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to
> > cores (this burns up cores) Does not assume all treads use the same
> > polling model Removing these assumptions from the SPDK libraries
> > will
> > allow:
> > Different applications to share the SPDK libraries on the same
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> > support the same applications with the same libraries E.g. a 4 core
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and
> > 1 Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> > and Subsystems can have a different number of Ports, and Ports can
> > be dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem
> > ports at various points in its lifecycle The SPDK FC-NVMe
> > application does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK
> > libraries should bind threads to cores only optionally - supporting
> > both static and dynamic threading models SPDK libraries should
> > support a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind
> > threads to cores SPDK libraries assume all treads use the same
> > polling model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading
> > abstraction Use PG as the fundamental unit on which a thread
> > operates The spdk_thread will be a “virtual” thread that gets tied
> > into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs)
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of
> > a thread/reactor event ring.spdk_bdev_get_io Understanding the
> > intent of the event library, it is believed this is the place for customization.
> > However, the current event library assumes a threading model that's
> > a part of the util library. Moreover, many of the other SPDK core
> > libraries assume the same threading model as the util library. If
> > the SPDK util library can be modified to support these use dynamic
> > threading use cases, all applications would be able to use the SPDK
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would
> > like to provide these enhancements in V18.07.
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 81159 bytes --]
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: attachment.ics --]
[-- Type: text/calendar, Size: 28256 bytes --]
BEGIN:VCALENDAR
METHOD:REQUEST
PRODID:Microsoft Exchange Server 2010
VERSION:2.0
BEGIN:VTIMEZONE
TZID:Eastern Standard Time
BEGIN:STANDARD
DTSTART:16010101T020000
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=1SU;BYMONTH=11
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:16010101T020000
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
RRULE:FREQ=YEARLY;INTERVAL=1;BYDAY=2SU;BYMONTH=3
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
ORGANIZER;CN="Meneghini, John":MAILTO:John.Meneghini@netapp.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN=Storage Pe
rformance Development Kit:MAILTO:spdk@lists.01.org
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Pai, Madhu"
:MAILTO:Madhusudan.Pai@netapp.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Rodriguez,
Edwin":MAILTO:Ed.Rodriguez@netapp.com
ATTENDEE;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP=TRUE;CN="Kaligotla,
Srikanth":MAILTO:Srikanth.Kaligotla@netapp.com
DESCRIPTION;LANGUAGE=en-US:I’ve scheduled an Ad-Hoc Meeting to discuss th
is subject.\n\nAgenda:\n\n\n 1. Discuss open issues in the proposed dyna
mic threading design.\n * No way to specify the “threading policy
”\n * There is no NUMA awareness in the current design/implementat
ion\, this will lead to cache thrashing\n 2. Instead of automatically mo
ving spdk_thread/reactor objects between DPDK threads\,\nwhat if moving an
spdk_thread to a new DPDK thread was an explicit operation performed by t
he application periodically?\n * Will this work for NetApp’s use c
ase?\n\n\nJohn Meneghini has invited you to join a meeting on the Web\, us
ing WebEx\n\nTopic: SPDK Dynamic Threading Model\nDate: Wednesday\, May 30
\, 2018\nTime: 12:00 pm\, Eastern Daylight Time (New York\, GMT-04:00)\nMe
eting number: 929 139 003\nMeeting password: spdk\n\nPlease click the link
below to see more information\, or to join the meeting.\n\nNOTE: You do n
ot need to register for a WebEx Account to attend this meeting. Just click
on the link below and use the Meeting password: spdk\n\nhttps://netapp-me
eting.webex.com/netapp-meeting/j.php?MTID=m534ae151fad1018b517d74e0736d503
e\n\nTeleconference: To receive a call back\, provide your phone number wh
en you join the meeting\, or call the number below and enter the access co
de.\nUS Toll Free: +1-844-655-1728\nUS Toll: +1-408-990-0022\n\nAccess cod
e:929 139 003\nGlobal call-in numbers: https://netapp-meeting.webex.com/ne
tapp-meeting/globalcallin.php?serviceType=MC&ED=674045167&tollFree=1\nToll
-free dialing restrictions: https://www.webex.com/pdf/tollfree_restriction
s.pdf\n\n-------------------------------------------------------\nTo join
the meeting on iPhone\n---------------------------------------------------
----\nGo to wbx://netapp-meeting.webex.com/netapp-meeting?MK=929139003&MTG
TK=SDJTSwAAAASLYBrb0Z2wjEeOVCuCKAiTKGQ3u5dqWWqeEvEa2V-6sg2&r2sec=1\n\nDon'
t have the iPhone WebEx application yet?\nGo to http://itunes.apple.com/ap
p/cisco-webex-meetings/id298844386\n\n\nTo contact John Meneghini\, call 1
-781-768-5324 or\nsend a message to this address: john.meneghini@netapp.co
m<mailto:john.meneghini@netapp.com>\n\n\n\n\nFrom: SPDK <spdk-bounces@list
s.01.org> on behalf of John Meneghini <John.Meneghini@netapp.com>\nReply-T
o: Storage Performance Development Kit <spdk@lists.01.org>\nDate: Tuesday\
, May 29\, 2018 at 12:24 PM\nTo: Storage Performance Development Kit <spdk
@lists.01.org>\nSubject: Re: [SPDK] SPDK Dynamic Threading Model\n\n\nNote
s from our Community meeting discussion this morning:\n\n\n\n 1. Break t
he physical connection between lcore and thread\n * This was agreed
upon as a needed abstraction\n * We’ll use the words "spdk_thread"
for the virtual thread and "DPDK thread" for the native thread\n 2. No
way to specify the “threading policy”\n * There is no NUMA aware
ness in the current design/implementation\, this will lead to cache thrash
ing\n * Instead of automatically moving spdk_thread/reactor objects
between DPDK threads\,\nwhat if moving an spdk_thread to a new DPDK thread
was an explicit operation performed by the application periodically?\n\n
i. Will
this work for NetApp’s use case?\n\n * Madhu agreed this was some
thing that could be done\, and he is willing to create another patch to do
this.\n\n i
. Would creating such a patch be helpful?\n\n\n\nNOTE: the meeting e
nded abruptly at this point because the meeting host dropped off and the m
eeting ended.\n\n\n\nEditor’s note:\n\n\n\n\nI think the important quest
ions that didn’t get answered is: will eliminating the “automatic thre
ading policy” in the current design (as seen in Madhu’s patches) work
for NetApp’s use case?\n\n\n\n/John\n\n\n\nOn 5/25/18\, 6:32 PM\, "SPDK
on behalf of Pai\, Madhu" <spdk-bounces@lists.01.org on behalf of Madhusud
an.Pai@netapp.com> wrote:\n\n\n\n Comments inline marked by [MP].\n\n\n
\n BTW\, is it worthwhile to discuss this in more detail in one of the
SPDK community meetings?\n\n\n\n Thanks\,\n\n Madhu\n\n\n\n\n\n -
----Original Message-----\n\n From: SPDK <spdk-bounces@lists.01.org> On
Behalf Of Walker\, Benjamin\n\n Sent: Friday\, May 25\, 2018 5:26 PM\n
\n To: spdk@lists.01.org\n\n Subject: Re: [SPDK] SPDK Dynamic Thread
ing Model\n\n\n\n On Fri\, 2018-05-25 at 20:31 +0000\, Pai\, Madhu wrot
e:\n\n > Ben\,\n\n >\n\n > Thank you very much for the detailed a
nalysis and mail. I agree with\n\n > the points you are making here and
the design goals for SPDK. I'll\n\n > try to talk some more about the
design for the patch\, the advantages\,\n\n > and see if we can improv
ise and better this. I'll also try to answer\n\n > the three valid conc
erns you have raised below.\n\n >\n\n > 1) Design principle:\n\n
> Breaking the 1:1 mapping between reactors and core will give an app\n\n
> better flexibility in terms of the threading model. At a very high\n\
n > abstract level this could be looked at\, as being similar to the gr
een threading framework.\n\n > I'm no expert in the green threading fra
mework\, but\, based on my\n\n > reading the similarity would be:\n\n
> - The "spdk_thread" is the virtual thread.\n\n > - The reactor is t
he "cache" of this virtual thread (i.e. they have a\n\n > 1-1 relations
hip).\n\n > - The bare metal thread is the DPDK thread in this model.\n
\n > So in this design\, the bare metal thread inhales the properties o
f the\n\n > virtual thread when it is running on a reactor. The impleme
ntation\n\n > specific part of that was the change of the TLS of the _l
core value to the reactor id.\n\n\n\n I think we'll need to iterate on
what the basic primitives are (especially their names)\, but I'm generally
leaving that discussion for slightly later on in the design. For now\, I
agree with the direction above. I'm going to temporarily use the words "sp
dk_thread" for the virtual thread and "DPDK thread" for the native thread.
\n\n\n\n [MP]: Agreed. The intent above was to draw a parallel to a mod
el that may be existing out there. We can decide on the names and the basi
c primitives later on.\n\n\n\n >\n\n > 2) Advantages:\n\n > This
was the first iteration of what can become a more solidified\n\n > solu
tion as we go along. There are gaps in this approach and I think\n\n >
we can mitigate some of those concerns.\n\n > - It gives applications a
switch. They may choose to not do this at\n\n > all and then a virtual
thread would map to a bare metal thread and stay that way.\n\n > - The
other advantage is that the core library\, API"s and the design\n\n >
do not change.\n\n\n\n I generally don't like to have switches that con
trol behavior this fundamental\, where possible. It would be ideal to have
the old behavior fall out of the new behavior whenever you happen to crea
te one spdk_thread per DPDK thread.\n\n\n\n [MP]: Ok. I'll think some m
ore about this. In the current patch\, I used the "number-of-reactors" in
the config file as the mechanism to determine new versus old behavior. In
some ways that falls into what you mention above. If the number-of-lcores
do NOT match the number-of-reactors\, one could go to the new way of doing
things else\, fall back to the legacy behavior. This was a prototype tha
t I built over the last week after our discussion during the SPDK conferen
ce. There are iterations and improvements needed here. But\, yes\, at this
point there is a switch here.\n\n\n\n\n\n > - I think it helps apps to
seamlessly adapt to a dynamic threading\n\n > model with their current
code base.\n\n > - It allows apps to dynamically increase the number o
f reactors. The\n\n > granularity of a QP to a reactor could be decided
by the app. So\, we\n\n > could\, in one way solve the problem of tryi
ng to move a QP from one\n\n > thread to another for load balancing.\n\
n > - It gives Apps certain QoS capabilities. Apps can create "gold"\,\
n\n > "silver" and "bronze" reactor rings. The number of bare-metal th
reads\n\n > working on each ring may be different depending on how quic
kly certain QP's have to be serviced.\n\n >\n\n > 3) Concerns:\n\n
> Looking at the three concerns you raised -\n\n >\n\n > 3a) Cache
thrashing: Agreed. In fact\, I write about this in the commit\n\n > pa
tch as a valid concern. But\, I think we can mitigate this. For\n\n > a
pplications that are concerned about cache thrashing - the option is\n\n
> to run the reactor on a bare metal thread for a longer period of time.
\n\n > In the patch I showed a crude way\, where after 100 usec\, the r
eactor\n\n > is switched out. But\, that does not absolutely have to be
the case. A\n\n > reactor could run for a much longer period of time (
10s of msec)\n\n > allowing the benefits of the CPU caching to be used.
The other way to\n\n > mitigate this is to make sure that the bare met
al threads run on the\n\n > same socket. Thus even when reactors are sw
itched out\, the cache at the socket layer is not invalidated.\n\n\n\n
Instead of automatically moving spdk_thread/reactor objects between DPDK t
hreads\, what if moving an spdk_thread to a new DPDK thread was an explici
t operation performed by the application periodically?\n\n\n\n [MP]: Ye
s\, but isn't that precisely something that needs to be left to the applic
ation? In this patch\, a simple MP-MC polling ring was used as a way to br
eak up binding. The application may use something other than a ring to man
age the spdk_threads and how it binds to a DPDK thread?\n\n I'll think
some more to see how to do this in an explicit operation instead of period
ically\, as done in the patch. I'll have to look into the DPDK thread pipe
s and the m2s/s2m IPC mechanism because the app most likely will have to c
ommunicate directly with the DPDK thread to achieve this.\n\n\n\n\n\n >
\n\n > 3b) NUMA: I believe NUMA-awareness can be built in. It depends o
n\n\n > where the bare metal threads run. If we have NICs and SSDs spre
ad\n\n > across the two sockets\, a more elegant solution can probably
be\n\n > designed where we create reactor rings per socket. Then we wou
ld have\n\n > the capability to add the QP's to the right reactor (in t
he right\n\n > ring) based on the NIC. That is an extension of the curr
ent design IMO.\n\n\n\n It's not always clear what the right NUMA node
to run on actually is. That's because an spdk_thread has a set of I/O chan
nels (queue pairs) that talk to different devices. Sometimes you want to b
e on the same NUMA node as a particular NIC\, but other times as a particu
lar SSD. Making the movement of spdk_threads between DPDK threads an expli
cit operation performed by the application would push this decision up int
o the application/user code\, where it knows best.\n\n\n\n [MP]: Agreed
. The design allows this to happen by allowing the creation of multiple re
actor rings (one per socket?). Applications can decide to move the reactor
s using custom messages from one ring to another. This goes back to your p
revious comment about having an explicit mechanism to move the reactor fro
m one DPDK thread to another.\n\n\n\n >\n\n > 3c) Global Reactor Rin
g bottleneck: The number of reactors and the\n\n > number of threads ar
e not high. Also\, the idea here is to run on a\n\n > reactor for "some
extended period of time". Given that the number of\n\n > producer/cons
umers from this ring will be limited\, I don't think the\n\n > reactor
ring will be a bottleneck. Compare and contrast this global\n\n > react
or ring with the event queue ring that exists today. We use\n\n > event
s for all callback events from the backend during IO. We\n\n > definite
ly do not want the reactors to be swapped in and out at the\n\n > rate
of IO\, but to hold on to a reactor for a somewhat larger period\n\n >
of time. When the application specific metrics show that these threads\n\n
> are doing more "useful work" versus "idle polling"\, we just add mor
e\n\n > threads. Eventually at high loads the number of threads will be
the same as the number of reactors and thus falls back to the traditional
SPDK model.\n\n > This would be something that is decided by the ecosy
stem that runs SPDK.\n\n >\n\n > Looking forward to discussing this
more.\n\n >\n\n > Thanks\,\n\n > Madhu\n\n >\n\n >\n\n >
-----Original Message-----\n\n > From: SPDK <spdk-bounces@lists.01.org
> On Behalf Of Walker\, Benjamin\n\n > Sent: Friday\, May 25\, 2018 3:0
3 PM\n\n > To: spdk@lists.01.org\n\n > Subject: Re: [SPDK] SPDK Dyna
mic Threading Model\n\n >\n\n > I've been doing my best to think thi
s through over the last few days\,\n\n > as have a number of other comm
unity members\, and some things are\n\n > beginning to look a bit clear
er now.\n\n >\n\n > SPDK was always intended to be a composable set o
f libraries as\n\n > opposed to a framework. By that\, I mean that SPDK
is intended to be\n\n > integrated into other applications as opposed
to existing code being\n\n > integrated into SPDK. The community has do
ne a lot of work to attempt\n\n > to make that happen\, with varying de
grees of success. The challenges are primarily centered on two things.\n\n
> First\, SPDK requires special memory management operations to alloca
te\n\n > DMA-safe memory.\n\n > This stems from the strict requireme
nt to avoid data copies. The\n\n > problem would essentially go away if
SPDK instead internally allocated\n\n > DMA-safe memory and copied use
r data into those buffers\, but the performance would take a big hit.\n\n
> Second\, SPDK avoids locks by instead passing messages between thread
s.\n\n > That means that many components (although not all) within SPDK
imply\n\n > that the application is using a certain threading model. S
pecifically\,\n\n > the threading model needs to look like cooperative
multi-tasking\, or\n\n > futures and promises\, or event loops\, etc. S
o far the consensus seems\n\n > to be that it is acceptable to assume t
here is some threading model\n\n > that is conducive to message passing
\, but we don't want to specifically pick a single model or framework.\n\n
>\n\n > The problem that John\, Madhu\, and the others at NetApp ha
ve identified\n\n > is that SPDK currently makes entirely too many assu
mptions about and\n\n > places too many strict requirements on the mech
anics of the threading model in an application.\n\n > I think there is
a strong consensus that fixing this is important and\n\n > should be hi
gh priority. The fix\, ultimately\, will be better\n\n > abstractions a
round the underlying application's threading model. I\n\n > hope we can
design something that will enable people to plug SPDK into\n\n > all s
orts of frameworks - green threading frameworks\, DPDK lthreads\,\n\n >
Seastar\, coroutine frameworks\, etc. The more people we can get\n\n >
participating in this work\, the better the abstractions will be\, so ple
ase everyone chime in with requirements and ideas.\n\n >\n\n > The c
urrent set of patches break the 1:1 mapping between reactors and cores.\n\
n > Instead\, reactors are stored on a global list. Each core iterates
on\n\n > this global list and pulls the next reactor and processes any
waiting\n\n > events and executes pollers\, then places the reactor bac
k on the list.\n\n > I'm concerned about three things with this design:
\n\n >\n\n > * Since the reactors now potentially execute on a diffe
rent core each\n\n > time through their loop\, the CPU cache is going t
o be badly thrashed.\n\n > I suspect the performance hit here is very l
arge and continues to grow\n\n > as additional threads are added. SPDK
is designed to scale linearly\n\n > with the addition of CPU cores as m
uch as possible\, and I think it\n\n > would be a mistake to move away
from that.\n\n > * All NUMA-awareness has been lost. Placing the proces
sing of I/O on\n\n > the same NUMA node as the NIC or SSD is critical t
o achieving high\n\n > performance\, so the code needs to remain NUMA-a
ware.\n\n > * All threads are polling a single queue of reactors\, so t
he atomic\n\n > variables controlling the head and tail of that queue a
re going to be\n\n > highly contended and become more contended as the
number of threads increases.\n\n >\n\n > I hope this is just the beg
inning of a larger discussion. I'll let the\n\n > patch review settle i
nto next week and see if solutions begin to emerge.\n\n >\n\n > Than
ks\,\n\n > Ben\n\n >\n\n > On Thu\, 2018-05-24 at 02:24 +0000\, M
eneghini\, John wrote:\n\n > > Hi Frank.\n\n > >\n\n > > Thanks f
or your suggestion.\n\n > >\n\n > > In our implementation/applicatio
n\, we don’t use DPDK. This is why\n\n > > the first set of changes
we proposed last year were to abstract out\n\n > > the dependencies on
DPK. I think I still have copy of the old pull\n\n > > request around f
or reference.\n\n > >\n\n > > https://github.com/spdk/spdk/pull/152\
n\n > >\n\n > > We are actually running SPDK in a completely differe
nt execution\n\n > > environment\, and we need a “native” SPDK dyna
mic threading model\n\n > > that can be supported on any platform\, wit
hout DPDK.\n\n > >\n\n > > An second RFC patch has been pushed up to
GerritHub for review.\n\n > > Please see the commit message of these t
wo patches for a complete\n\n > > description of the proposed change.\n
\n > >\n\n > > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/\n
\n > >\n\n > > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/\n
\n > >\n\n > > /John\n\n > >\n\n > > 40.5. The L-thread subsys
tem\n\n > > The L-thread subsystem resides in the\n\n > > examples/p
erformance-thread/common\n\n > > directory and is built and linked auto
matically when building the\n\n > > l3fwd- thread example.\n\n > >\n
\n > > The subsystem provides a simple cooperative scheduler to enable\
n\n > > arbitrary functions to run as cooperative threads within a sing
le\n\n > > EAL thread. The subsystem provides a pthread like API that i
s\n\n > > intended to assist in reuse of legacy code written for POSIX
pthreads.\n\n > >\n\n > > The following sections provide some detail
on the features\,\n\n > > constraints\, performance and porting consid
erations when using L-threads.\n\n > >\n\n > >\n\n > >\n\n > >
From: SPDK <spdk-bounces@lists.01.org> on behalf of Huang Frank\n\n >
> <kinzent@hotma il.com>\n\n > > Reply-To: Storage Performance Developm
ent Kit <spdk@lists.01.org>\n\n > > Date: Wednesday\, May 23\, 2018 at
9:46 PM\n\n > > To: Storage Performance Development Kit <spdk@lists.01.
org>\n\n > > Subject: [SPDK] 答复: SPDK Dynamic Threading Model\n\n
> >\n\n > > Hi\,\n\n > >\n\n > > Why not consider to use lpthre
ad provided by DPDK?\n\n > > http://dpdk.org/doc/guides-16.04/sample_ap
p_ug/performance_thread.ht\n\n > > ml\n\n > > #lthread\n\n > > -s
ubsystem\n\n > >\n\n > >\n\n > >\n\n > > Frank Huang\n\n >
>\n\n > >\n\n > > 发件人: SPDK <spdk-bounces@lists.01.org> 代表
Meneghini\, John\n\n > > <John.Meneghini@netap p.com>\n\n > > 发
送时间: 2018年5月23日 4:12\n\n > > 收件人: Storage Performance
Development Kit\n\n > > 主题: [SPDK] RFC: SPDK Dynamic Threading Mod
el\n\n > >\n\n > > As discussed during the Summit last week\, we bel
ieve SPDK needs\n\n > > support for a dynamic threading model. An RFC
patch has been pushed\n\n > > upstream for review.\n\n > >\n\n >
> https://review.gerrithub.io/#/c/spdk/spdk/+/412093/\n\n > >\n\n >
> This patch is a beginning point for our proposed changes.\n\n > > Imp
rovements will be made with subsequent patches.\n\n > >\n\n > > The
description below is taken from\n\n > > https://github.com/spdk/spdk/is
sues/308\n\n > > SPDK needs to support a dynamic threading model where
reactors are\n\n > > NOT bound to lcores.\n\n > > Many applications
need SPDK to support a threading model that:\n\n > > Does not assume a
static number of threads Does not bind threads to\n\n > > cores (this b
urns up cores) Does not assume all treads use the same\n\n > > polling
model Removing these assumptions from the SPDK libraries\n\n > > will\n
\n > > allow:\n\n > > Different applications to share the SPDK libra
ries on the same\n\n > > platform E.g. FC-NVMe\, RDMA-NVMe\, and NVMe D
ifferent platforms to\n\n > > support the same applications with the sa
me libraries E.g. a 4 core\n\n > > platform and a 128 core plaform\, a
PowerPC and NFS traffic Different\n\n > > workloads at different scales
E.g. 1 NVMF Host with 1 Subsystem and\n\n > > 1 Namespace\, or 16 NVMF
Hosts with 100 Subsystems and 1\,000 namespaces.\n\n > > In particular
\, in SPDK\, NVMF threads need to come and go depending\n\n > > upon th
e “NVMF load”.\n\n > > More Dynamic Use Cases Coming\n\n > > Wit
h the advent of FC-NVMe (which uses NPIV to visualize FC ports)\n\n > >
NVMF Subsystem Ports and Host Ports are not static. Different Hosts\n\n
> > and Subsystems can have a different number of Ports\, and Ports can\
n\n > > be dynamically added and removed from the configuration. This m
eans:\n\n > > The same platform may end up having different number of S
ubsystem\n\n > > ports at various points in its lifecycle The SPDK FC-N
VMe\n\n > > application does NOT know up front how many ports it will h
ave.\n\n > > Expected Behavior\n\n > > SPDK libraries should not ass
ume a static number of threads SPDK\n\n > > libraries should bind threa
ds to cores only optionally - supporting\n\n > > both static and dynami
c threading models SPDK libraries should\n\n > > support a Hybrid polli
ng model (modified run to\n\n > > completion)\n\n > > Current Behavi
or\n\n > > SPDK libraries assume a static number of threads SPDK librar
ies bind\n\n > > threads to cores SPDK libraries assume all treads use
the same\n\n > > polling model Possible Solution Proposal to solve abov
e Use Cases:\n\n > > Use the spdk_nvmf_poll_group (PG) as the unit of t
hreading\n\n > > abstraction Use PG as the fundamental unit on which a
thread\n\n > > operates The spdk_thread will be a “virtual” thread
that gets tied\n\n > > into a PG (1-1\n\n > > relationship)\n\n >
> Create PGs as and when hardware ports (and associated queue-pairs)\n\n
> > come to life.\n\n > > No dependency between a PG and a “real
” thread.\n\n > > A PG can be picked up by any “real” thread and
worked upon. The PG\n\n > > contains everything needed for IO handling.
\n\n > > PG continues to contain spdk_thread. spdk_thread continues sam
e\n\n > > mechanisms for IO channels to different NS etc. etc.\n\n >
> PG contains vendor data. Eg. A “ring” for depositing asynchronous\n
\n > > callback events from the backend OR management events that come
from\n\n > > external modules.\n\n > > spdk_thread contains thread_c
ontext that points to a PG instead of a\n\n > > reactor.\n\n > > So
messages from the library get routed to the PG “ring” instead of\n\n
> > a thread/reactor event ring.spdk_bdev_get_io Understanding the\n\n
> > intent of the event library\, it is believed this is the place for c
ustomization.\n\n > > However\, the current event library assumes a thr
eading model that's\n\n > > a part of the util library. Moreover\, many
of the other SPDK core\n\n > > libraries assume the same threading mod
el as the util library. If\n\n > > the SPDK util library can be modifie
d to support these use dynamic\n\n > > threading use cases\, all applic
ations would be able to use the SPDK\n\n > > framework more effectively
.\n\n > > Steps to Reproduce\n\n > > This is an enhancement. There i
s no bug.\n\n > > Context (Environment including OS version\, SPDK vers
ion\, etc.) Would\n\n > > like to provide these enhancements in V18.07.
\n\n > >\n\n > >\n\n > >\n\n > >\n\n > > __________________
_____________________________\n\n > > SPDK mailing list\n\n > > SPDK
@lists.01.org\n\n > > https://lists.01.org/mailman/listinfo/spdk\n\n
>\n\n > _______________________________________________\n\n > SPDK
mailing list\n\n > SPDK@lists.01.org\n\n > https://lists.01.org/mail
man/listinfo/spdk\n\n > _______________________________________________
\n\n > SPDK mailing list\n\n > SPDK@lists.01.org\n\n > https://li
sts.01.org/mailman/listinfo/spdk\n\n __________________________________
_____________\n\n SPDK mailing list\n\n SPDK@lists.01.org\n\n htt
ps://lists.01.org/mailman/listinfo/spdk\n\n ___________________________
____________________\n\n SPDK mailing list\n\n SPDK@lists.01.org\n\n
https://lists.01.org/mailman/listinfo/spdk\n\n\n
UID:68C50EFD-6A46-4B6A-BBC5-2B5842E11AE9
SUMMARY;LANGUAGE=en-US:Re: [SPDK] SPDK Dynamic Threading Model
DTSTART;TZID=Eastern Standard Time:20180530T120000
DTEND;TZID=Eastern Standard Time:20180530T130000
CLASS:PUBLIC
PRIORITY:5
DTSTAMP:20180529T192606Z
TRANSP:OPAQUE
STATUS:CONFIRMED
SEQUENCE:0
LOCATION;LANGUAGE=en-US:WebEx Meeting
X-MICROSOFT-CDO-APPT-SEQUENCE:0
X-MICROSOFT-CDO-OWNERAPPTID:2116414571
X-MICROSOFT-CDO-BUSYSTATUS:TENTATIVE
X-MICROSOFT-CDO-INTENDEDSTATUS:BUSY
X-MICROSOFT-CDO-ALLDAYEVENT:FALSE
X-MICROSOFT-CDO-IMPORTANCE:1
X-MICROSOFT-CDO-INSTTYPE:0
X-MICROSOFT-DONOTFORWARDMEETING:FALSE
X-MICROSOFT-DISALLOW-COUNTER:FALSE
X-MICROSOFT-LOCATIONS:[{"DisplayName":"WebEx Meeting"\,"LocationAnnotation"
:""\,"LocationUri":""\,"LocationStreet":""\,"LocationCity":""\,"LocationSt
ate":""\,"LocationCountry":""\,"LocationPostalCode":""\,"LocationFullAddre
ss":""}]
BEGIN:VALARM
DESCRIPTION:REMINDER
TRIGGER;RELATED=START:-PT15M
ACTION:DISPLAY
END:VALARM
END:VEVENT
END:VCALENDAR
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-30 17:31 Pai, Madhu
0 siblings, 0 replies; 10+ messages in thread
From: Pai, Madhu @ 2018-05-30 17:31 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 24877 bytes --]
Thanks all for attending the meeting. Minutes below.
Please add if I’ve missed or incorrectly stated anything.
Thanks,
Madhu
1. The event library is the place for customization. This is where applications can do their custom environment changes.
2. The dynamic threading idea is good. But, the implementation and high level approach needs work. Specifically:
* In the open source code, the DPDK thread should stay tied to a single reactor.
* Make the “spdk_thread” the unit of abstraction (instead of the reactor as done in the patch today).
* Allow the spdk_thread to be moved between reactors
* The spdk_thread data structure would need to be enhanced. It would at a minimum need a poller and an event ring.
* The DPDK thread would poll and find spdk_threads to work on, in its poller ring.
* In the first iteration (in the open source code), keep a 1-1 relationship between a spdk_thread and a reactor. That essentially means that the DPDK thread works on one spdk_thread. This would work as today.
* Individual implementations may create/destroy spdk_threads and move them as needed.
* A NVMF Poller Group would go on the polling ring in the spdk_thread.
* Use TLS to set thread_id’s as appropriate. Goal is to NOT change any of the util api’s including _get_thread.
In terms of next steps, I’ll look into more detail into the above design and start working on a patch.
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Meneghini, John
Sent: Tuesday, May 29, 2018 12:24 PM
To: Storage Performance Development Kit <spdk(a)lists.01.org>
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Notes from our Community meeting discussion this morning:
1. Break the physical connection between lcore and thread
* This was agreed upon as a needed abstraction
* We’ll use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread
2. No way to specify the “threading policy”
* There is no NUMA awareness in the current design/implementation, this will lead to cache thrashing
* Instead of automatically moving spdk_thread/reactor objects between DPDK threads,
what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
i. Will this work for NetApp’s use case?
* Madhu agreed this was something that could be done, and he is willing to create another patch to do this.
i. Would creating such a patch be helpful?
NOTE: the meeting ended abruptly at this point because the meeting host dropped off and the meeting ended.
Editor’s note:
I think the important questions that didn’t get answered is: will eliminating the “automatic threading policy” in the current design (as seen in Madhu’s patches) work for NetApp’s use case?
/John
On 5/25/18, 6:32 PM, "SPDK on behalf of Pai, Madhu" <spdk-bounces(a)lists.01.org on behalf of Madhusudan.Pai(a)netapp.com<mailto:spdk-bounces(a)lists.01.org%20on%20behalf%20of%20Madhusudan.Pai(a)netapp.com>> wrote:
Comments inline marked by [MP].
BTW, is it worthwhile to discuss this in more detail in one of the SPDK community meetings?
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> On Behalf Of Walker, Benjamin
Sent: Friday, May 25, 2018 5:26 PM
To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
Subject: Re: [SPDK] SPDK Dynamic Threading Model
On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
>
> Thank you very much for the detailed analysis and mail. I agree with
> the points you are making here and the design goals for SPDK. I'll
> try to talk some more about the design for the patch, the advantages,
> and see if we can improvise and better this. I'll also try to answer
> the three valid concerns you have raised below.
>
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app
> better flexibility in terms of the threading model. At a very high
> abstract level this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my
> reading the similarity would be:
> - The "spdk_thread" is the virtual thread.
> - The reactor is the "cache" of this virtual thread (i.e. they have a
> 1-1 relationship).
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the
> virtual thread when it is running on a reactor. The implementation
> specific part of that was the change of the TLS of the _lcore value to the reactor id.
I think we'll need to iterate on what the basic primitives are (especially their names), but I'm generally leaving that discussion for slightly later on in the design. For now, I agree with the direction above. I'm going to temporarily use the words "spdk_thread" for the virtual thread and "DPDK thread" for the native thread.
[MP]: Agreed. The intent above was to draw a parallel to a model that may be existing out there. We can decide on the names and the basic primitives later on.
>
> 2) Advantages:
> This was the first iteration of what can become a more solidified
> solution as we go along. There are gaps in this approach and I think
> we can mitigate some of those concerns.
> - It gives applications a switch. They may choose to not do this at
> all and then a virtual thread would map to a bare metal thread and stay that way.
> - The other advantage is that the core library, API"s and the design
> do not change.
I generally don't like to have switches that control behavior this fundamental, where possible. It would be ideal to have the old behavior fall out of the new behavior whenever you happen to create one spdk_thread per DPDK thread.
[MP]: Ok. I'll think some more about this. In the current patch, I used the "number-of-reactors" in the config file as the mechanism to determine new versus old behavior. In some ways that falls into what you mention above. If the number-of-lcores do NOT match the number-of-reactors, one could go to the new way of doing things else, fall back to the legacy behavior. This was a prototype that I built over the last week after our discussion during the SPDK conference. There are iterations and improvements needed here. But, yes, at this point there is a switch here.
> - I think it helps apps to seamlessly adapt to a dynamic threading
> model with their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we
> could, in one way solve the problem of trying to move a QP from one
> thread to another for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold",
> "silver" and "bronze" reactor rings. The number of bare-metal threads
> working on each ring may be different depending on how quickly certain QP's have to be serviced.
>
> 3) Concerns:
> Looking at the three concerns you raised -
>
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit
> patch as a valid concern. But, I think we can mitigate this. For
> applications that are concerned about cache thrashing - the option is
> to run the reactor on a bare metal thread for a longer period of time.
> In the patch I showed a crude way, where after 100 usec, the reactor
> is switched out. But, that does not absolutely have to be the case. A
> reactor could run for a much longer period of time (10s of msec)
> allowing the benefits of the CPU caching to be used. The other way to
> mitigate this is to make sure that the bare metal threads run on the
> same socket. Thus even when reactors are switched out, the cache at the socket layer is not invalidated.
Instead of automatically moving spdk_thread/reactor objects between DPDK threads, what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically?
[MP]: Yes, but isn't that precisely something that needs to be left to the application? In this patch, a simple MP-MC polling ring was used as a way to break up binding. The application may use something other than a ring to manage the spdk_threads and how it binds to a DPDK thread?
I'll think some more to see how to do this in an explicit operation instead of periodically, as done in the patch. I'll have to look into the DPDK thread pipes and the m2s/s2m IPC mechanism because the app most likely will have to communicate directly with the DPDK thread to achieve this.
>
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on
> where the bare metal threads run. If we have NICs and SSDs spread
> across the two sockets, a more elegant solution can probably be
> designed where we create reactor rings per socket. Then we would have
> the capability to add the QP's to the right reactor (in the right
> ring) based on the NIC. That is an extension of the current design IMO.
It's not always clear what the right NUMA node to run on actually is. That's because an spdk_thread has a set of I/O channels (queue pairs) that talk to different devices. Sometimes you want to be on the same NUMA node as a particular NIC, but other times as a particular SSD. Making the movement of spdk_threads between DPDK threads an explicit operation performed by the application would push this decision up into the application/user code, where it knows best.
[MP]: Agreed. The design allows this to happen by allowing the creation of multiple reactor rings (one per socket?). Applications can decide to move the reactors using custom messages from one ring to another. This goes back to your previous comment about having an explicit mechanism to move the reactor from one DPDK thread to another.
>
> 3c) Global Reactor Ring bottleneck: The number of reactors and the
> number of threads are not high. Also, the idea here is to run on a
> reactor for "some extended period of time". Given that the number of
> producer/consumers from this ring will be limited, I don't think the
> reactor ring will be a bottleneck. Compare and contrast this global
> reactor ring with the event queue ring that exists today. We use
> events for all callback events from the backend during IO. We
> definitely do not want the reactors to be swapped in and out at the
> rate of IO, but to hold on to a reactor for a somewhat larger period
> of time. When the application specific metrics show that these threads
> are doing more "useful work" versus "idle polling", we just add more
> threads. Eventually at high loads the number of threads will be the same as the number of reactors and thus falls back to the traditional SPDK model.
> This would be something that is decided by the ecosystem that runs SPDK.
>
> Looking forward to discussing this more.
>
> Thanks,
> Madhu
>
>
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
>
> I've been doing my best to think this through over the last few days,
> as have a number of other community members, and some things are
> beginning to look a bit clearer now.
>
> SPDK was always intended to be a composable set of libraries as
> opposed to a framework. By that, I mean that SPDK is intended to be
> integrated into other applications as opposed to existing code being
> integrated into SPDK. The community has done a lot of work to attempt
> to make that happen, with varying degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate
> DMA-safe memory.
> This stems from the strict requirement to avoid data copies. The
> problem would essentially go away if SPDK instead internally allocated
> DMA-safe memory and copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads.
> That means that many components (although not all) within SPDK imply
> that the application is using a certain threading model. Specifically,
> the threading model needs to look like cooperative multi-tasking, or
> futures and promises, or event loops, etc. So far the consensus seems
> to be that it is acceptable to assume there is some threading model
> that is conducive to message passing, but we don't want to specifically pick a single model or framework.
>
> The problem that John, Madhu, and the others at NetApp have identified
> is that SPDK currently makes entirely too many assumptions about and
> places too many strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and
> should be high priority. The fix, ultimately, will be better
> abstractions around the underlying application's threading model. I
> hope we can design something that will enable people to plug SPDK into
> all sorts of frameworks - green threading frameworks, DPDK lthreads,
> Seastar, coroutine frameworks, etc. The more people we can get
> participating in this work, the better the abstractions will be, so please everyone chime in with requirements and ideas.
>
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on
> this global list and pulls the next reactor and processes any waiting
> events and executes pollers, then places the reactor back on the list.
> I'm concerned about three things with this design:
>
> * Since the reactors now potentially execute on a different core each
> time through their loop, the CPU cache is going to be badly thrashed.
> I suspect the performance hit here is very large and continues to grow
> as additional threads are added. SPDK is designed to scale linearly
> with the addition of CPU cores as much as possible, and I think it
> would be a mistake to move away from that.
> * All NUMA-awareness has been lost. Placing the processing of I/O on
> the same NUMA node as the NIC or SSD is critical to achieving high
> performance, so the code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic
> variables controlling the head and tail of that queue are going to be
> highly contended and become more contended as the number of threads increases.
>
> I hope this is just the beginning of a larger discussion. I'll let the
> patch review settle into next week and see if solutions begin to emerge.
>
> Thanks,
> Ben
>
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >
> > Thanks for your suggestion.
> >
> > In our implementation/application, we don’t use DPDK. This is why
> > the first set of changes we proposed last year were to abstract out
> > the dependencies on DPK. I think I still have copy of the old pull
> > request around for reference.
> >
> > https://github.com/spdk/spdk/pull/152
> >
> > We are actually running SPDK in a completely different execution
> > environment, and we need a “native” SPDK dynamic threading model
> > that can be supported on any platform, without DPDK.
> >
> > An second RFC patch has been pushed up to GerritHub for review.
> > Please see the commit message of these two patches for a complete
> > description of the proposed change.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > /John
> >
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the
> > l3fwd- thread example.
> >
> > The subsystem provides a simple cooperative scheduler to enable
> > arbitrary functions to run as cooperative threads within a single
> > EAL thread. The subsystem provides a pthread like API that is
> > intended to assist in reuse of legacy code written for POSIX pthreads.
> >
> > The following sections provide some detail on the features,
> > constraints, performance and porting considerations when using L-threads.
> >
> >
> >
> > From: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> on behalf of Huang Frank
> > <kinzent(a)hotma il.com<mailto:kinzent(a)hotma%20il.com>>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org<mailto:spdk(a)lists.01.org>>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >
> > Hi,
> >
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.ht
> > ml
> > #lthread
> > -subsystem
> >
> >
> >
> > Frank Huang
> >
> >
> > 发件人: SPDK <spdk-bounces(a)lists.01.org<mailto:spdk-bounces(a)lists.01.org>> 代表 Meneghini, John
> > <John.Meneghini(a)netap p.com<mailto:John.Meneghini(a)netap%20p.com>>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >
> > As discussed during the Summit last week, we believe SPDK needs
> > support for a dynamic threading model. An RFC patch has been pushed
> > upstream for review.
> >
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >
> > This patch is a beginning point for our proposed changes.
> > Improvements will be made with subsequent patches.
> >
> > The description below is taken from
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are
> > NOT bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to
> > cores (this burns up cores) Does not assume all treads use the same
> > polling model Removing these assumptions from the SPDK libraries
> > will
> > allow:
> > Different applications to share the SPDK libraries on the same
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to
> > support the same applications with the same libraries E.g. a 4 core
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and
> > 1 Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports)
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts
> > and Subsystems can have a different number of Ports, and Ports can
> > be dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem
> > ports at various points in its lifecycle The SPDK FC-NVMe
> > application does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK
> > libraries should bind threads to cores only optionally - supporting
> > both static and dynamic threading models SPDK libraries should
> > support a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind
> > threads to cores SPDK libraries assume all treads use the same
> > polling model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading
> > abstraction Use PG as the fundamental unit on which a thread
> > operates The spdk_thread will be a “virtual” thread that gets tied
> > into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs)
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of
> > a thread/reactor event ring.spdk_bdev_get_io Understanding the
> > intent of the event library, it is believed this is the place for customization.
> > However, the current event library assumes a threading model that's
> > a part of the util library. Moreover, many of the other SPDK core
> > libraries assume the same threading model as the util library. If
> > the SPDK util library can be modified to support these use dynamic
> > threading use cases, all applications would be able to use the SPDK
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would
> > like to provide these enhancements in V18.07.
> >
> >
> >
> >
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
> > https://lists.01.org/mailman/listinfo/spdk
>
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
> https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org<mailto:SPDK(a)lists.01.org>
https://lists.01.org/mailman/listinfo/spdk
[-- Attachment #2: attachment.html --]
[-- Type: text/html, Size: 59361 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-30 18:08 Walker, Benjamin
0 siblings, 0 replies; 10+ messages in thread
From: Walker, Benjamin @ 2018-05-30 18:08 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 3627 bytes --]
Thank you for all of your effort on this Madhu (and everyone else involved)! I'm
just providing a little more detail on each point below. I have high hopes that
this not only solves your specific problems, but also makes it much easier to
integrate SPDK into other cooperative multitasking frameworks.
On Wed, 2018-05-30 at 17:31 +0000, Pai, Madhu wrote:
> 1. The event library is the place for customization. This is where
> applications can do their custom environment changes.
This was the intent, but it isn't clear enough in the code today in my opinion.
I think we need to make notes as we go about strategies for clarifying what is
an abstraction around the user application's framework and what is an SPDK
deliverable. We've done that a bit with the environment abstraction library, but
the abstractions around threading models are not at all clearly delineated
today. This won't affect any of the patches you make in the immediate future,
but it's something to keep in mind.
> 2. The dynamic threading idea is good. But, the implementation and high level
> approach needs work. Specifically:
> a. In the open source code, the DPDK thread should stay tied to a single
> reactor.
> b. Make the “spdk_thread” the unit of abstraction (instead of the
> reactor as done in the patch today).
To summarize, the spdk_reactor becomes the abstraction for a "native" thread on
the system, and spdk_thread becomes the abstraction for the
"user/green/virtual/lightweight" thread mapped on top of it. In the first set of
changes, I think we keep the names as they are as much as possible and minimize
changes. After the initial changes are done, I think we need to revisit some of
the names of these objects to make a few concepts a bit clearer.
> c. Allow the spdk_thread to be moved between reactors
The function to do the movement should be in lib/event.
> d. The spdk_thread data structure would need to be enhanced. It would at
> a minimum need a poller and an event ring.
It needs the active_pollers, timer_pollers, and events data members from struct
spdk_reactor.
> e. The DPDK thread would poll and find spdk_threads to work on, in its
> poller ring.
The spdk_reactor structure would have a separate ring data structure (one per
reactor) that holds spdk_thread objects.
> f. In the first iteration (in the open source code), keep a 1-1
> relationship between a spdk_thread and a reactor. That essentially means that
> the DPDK thread works on one spdk_thread. This would work as today.
> g. Individual implementations may create/destroy spdk_threads and move
> them as needed.
Implementations may also dynamically create spdk_reactors (i.e. new native
threads) at run time as needed. DPDK can handle this case - it just isn't coded
up in SPDK yet.
> h. A NVMF Poller Group would go on the polling ring in the spdk_thread.
The abstractions that spdk_thread deals with are spdk_poller, spdk_thread_fn
(which is a message), and spdk_io_channel. An NVMe-oF poll group is an unrelated
thing sitting at a much higher level of abstraction (in lib/nvmf). The NVMe-oF
poll group will not change or be used any differently as these changes are made
- it will still be mapped 1:1 to an spdk_thread. The spdk_thread will contain a
list of spdk_poller objects that it needs to process.
> i. Use TLS to set thread_id’s as appropriate. Goal is to NOT change any
> of the util api’s including _get_thread.
Using TLS is definitely the easiest way to make this happen, but do you have TLS
available in your environment?
Thanks,
Ben
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-05-30 20:26 Pai, Madhu
0 siblings, 0 replies; 10+ messages in thread
From: Pai, Madhu @ 2018-05-30 20:26 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 4524 bytes --]
Thanks for the additional context Ben. Agree on all the comments. There were a couple of points that I needed clarification on. I have marked them inline [MP].
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Wednesday, May 30, 2018 2:09 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Thank you for all of your effort on this Madhu (and everyone else involved)! I'm just providing a little more detail on each point below. I have high hopes that this not only solves your specific problems, but also makes it much easier to integrate SPDK into other cooperative multitasking frameworks.
On Wed, 2018-05-30 at 17:31 +0000, Pai, Madhu wrote:
> 1. The event library is the place for customization. This is where
> applications can do their custom environment changes.
This was the intent, but it isn't clear enough in the code today in my opinion.
I think we need to make notes as we go about strategies for clarifying what is an abstraction around the user application's framework and what is an SPDK deliverable. We've done that a bit with the environment abstraction library, but the abstractions around threading models are not at all clearly delineated today. This won't affect any of the patches you make in the immediate future, but it's something to keep in mind.
> 2. The dynamic threading idea is good. But, the implementation and
> high level approach needs work. Specifically:
> a. In the open source code, the DPDK thread should stay tied to a
> single reactor.
> b. Make the “spdk_thread” the unit of abstraction (instead of the
> reactor as done in the patch today).
To summarize, the spdk_reactor becomes the abstraction for a "native" thread on the system, and spdk_thread becomes the abstraction for the "user/green/virtual/lightweight" thread mapped on top of it. In the first set of changes, I think we keep the names as they are as much as possible and minimize changes. After the initial changes are done, I think we need to revisit some of the names of these objects to make a few concepts a bit clearer.
> c. Allow the spdk_thread to be moved between reactors
The function to do the movement should be in lib/event.
> d. The spdk_thread data structure would need to be enhanced. It would
> at a minimum need a poller and an event ring.
It needs the active_pollers, timer_pollers, and events data members from struct spdk_reactor.
> e. The DPDK thread would poll and find spdk_threads to work on, in
> its poller ring.
The spdk_reactor structure would have a separate ring data structure (one per
reactor) that holds spdk_thread objects.
[MP]: Would we need a separate new ring data structure in the reactor? Could we not use the active_pollers and add a polling function that would take the spdk_thread as an argument? This polling function would then run through the spdk_threads rings (active_pollers, timer_pollers and events).
> f. In the first iteration (in the open source code), keep a 1-1
> relationship between a spdk_thread and a reactor. That essentially
> means that the DPDK thread works on one spdk_thread. This would work as today.
> g. Individual implementations may create/destroy spdk_threads and
> move them as needed.
Implementations may also dynamically create spdk_reactors (i.e. new native
threads) at run time as needed. DPDK can handle this case - it just isn't coded up in SPDK yet.
> h. A NVMF Poller Group would go on the polling ring in the spdk_thread.
The abstractions that spdk_thread deals with are spdk_poller, spdk_thread_fn (which is a message), and spdk_io_channel. An NVMe-oF poll group is an unrelated thing sitting at a much higher level of abstraction (in lib/nvmf). The NVMe-oF poll group will not change or be used any differently as these changes are made
- it will still be mapped 1:1 to an spdk_thread. The spdk_thread will contain a list of spdk_poller objects that it needs to process.
[MP]: Ok. Agreed.
> i. Use TLS to set thread_id’s as appropriate. Goal is to NOT change
> any of the util api’s including _get_thread.
Using TLS is definitely the easiest way to make this happen, but do you have TLS available in your environment?
[MP]: Yes, we have TLS available in our environment.
Thanks,
Ben
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [SPDK] SPDK Dynamic Threading Model
@ 2018-06-08 1:18 Meneghini, John
0 siblings, 0 replies; 10+ messages in thread
From: Meneghini, John @ 2018-06-08 1:18 UTC (permalink / raw)
To: spdk
[-- Attachment #1: Type: text/plain, Size: 7244 bytes --]
There is a new patch available with a proposed change to the threading model at:
https://review.gerrithub.io/#/c/spdk/spdk/+/414293/
The following is from the commit message:
Infrastructure for the virtualization of spdk_thread.
This has patch contains the base infrastructure set up for the virtualization of spdk_thread.
The design is to embed pollers/timers/event rings in the spdk_thread infrastructure. The native threads (i.e. DPDK threads) use
the reactor polling loop. When this functionality is enabled, the reactors contain the virtual thread polling functions. No other
pollers run on the reactor. This allows an app to run a virtual thread on any reactor and move it around. It also allows an app
to create new virtual threads (depending on load/ports) and "assign" it to any reactor.
The main changes are:
1. Create a new TLS to identify the virtual thread.
2. Additions to the spdk_thread data structure.
3. Accessor functions around the spdk_thread data structure.
4. UT code changes for the util library changes.
5. New UT suite to test the dynamic threading.
6. Introduced a flag for dynamic threading in the config file.
7. Reactor/APP changes to use new mechanism.
8. Fixed poll groups for NVMF.
9. Updated CIT for NVMf to use dynamic threading flag.
All CIT's passed in the NetApp environment with and without the flag. [We don't run all the CITs that the Intel CITs do, e.g. vhost tests are not run]
Next steps:
1. Other apps need to be modified to work with this logic.
2. Remove switch once approved, thus making code flow simpler.
3. Add logic in apps that show how threads can be moved between reactors (possibly need a thread state to allow threads to break from the polling loop).
Known Issues:
1. During startup, there exists a race condition where the threads have not completed initialization before the app tries adding events to the event ring.
2. Currently threads will run on only one reactor. The infrastructure to break a thread from a reactor has not been added. (This is part of "Next steps").
- Madhu Pai
On 5/30/18, 4:27 PM, "SPDK on behalf of Pai, Madhu" <spdk-bounces(a)lists.01.org on behalf of Madhusudan.Pai(a)netapp.com> wrote:
Thanks for the additional context Ben. Agree on all the comments. There were a couple of points that I needed clarification on. I have marked them inline [MP].
Thanks,
Madhu
-----Original Message-----
From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
Sent: Wednesday, May 30, 2018 2:09 PM
To: spdk(a)lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Thank you for all of your effort on this Madhu (and everyone else involved)! I'm just providing a little more detail on each point below. I have high hopes that this not only solves your specific problems, but also makes it much easier to integrate SPDK into other cooperative multitasking frameworks.
On Wed, 2018-05-30 at 17:31 +0000, Pai, Madhu wrote:
> 1. The event library is the place for customization. This is where
> applications can do their custom environment changes.
This was the intent, but it isn't clear enough in the code today in my opinion.
I think we need to make notes as we go about strategies for clarifying what is an abstraction around the user application's framework and what is an SPDK deliverable. We've done that a bit with the environment abstraction library, but the abstractions around threading models are not at all clearly delineated today. This won't affect any of the patches you make in the immediate future, but it's something to keep in mind.
> 2. The dynamic threading idea is good. But, the implementation and
> high level approach needs work. Specifically:
> a. In the open source code, the DPDK thread should stay tied to a
> single reactor.
> b. Make the “spdk_thread” the unit of abstraction (instead of the
> reactor as done in the patch today).
To summarize, the spdk_reactor becomes the abstraction for a "native" thread on the system, and spdk_thread becomes the abstraction for the "user/green/virtual/lightweight" thread mapped on top of it. In the first set of changes, I think we keep the names as they are as much as possible and minimize changes. After the initial changes are done, I think we need to revisit some of the names of these objects to make a few concepts a bit clearer.
> c. Allow the spdk_thread to be moved between reactors
The function to do the movement should be in lib/event.
> d. The spdk_thread data structure would need to be enhanced. It would
> at a minimum need a poller and an event ring.
It needs the active_pollers, timer_pollers, and events data members from struct spdk_reactor.
> e. The DPDK thread would poll and find spdk_threads to work on, in
> its poller ring.
The spdk_reactor structure would have a separate ring data structure (one per
reactor) that holds spdk_thread objects.
[MP]: Would we need a separate new ring data structure in the reactor? Could we not use the active_pollers and add a polling function that would take the spdk_thread as an argument? This polling function would then run through the spdk_threads rings (active_pollers, timer_pollers and events).
> f. In the first iteration (in the open source code), keep a 1-1
> relationship between a spdk_thread and a reactor. That essentially
> means that the DPDK thread works on one spdk_thread. This would work as today.
> g. Individual implementations may create/destroy spdk_threads and
> move them as needed.
Implementations may also dynamically create spdk_reactors (i.e. new native
threads) at run time as needed. DPDK can handle this case - it just isn't coded up in SPDK yet.
> h. A NVMF Poller Group would go on the polling ring in the spdk_thread.
The abstractions that spdk_thread deals with are spdk_poller, spdk_thread_fn (which is a message), and spdk_io_channel. An NVMe-oF poll group is an unrelated thing sitting at a much higher level of abstraction (in lib/nvmf). The NVMe-oF poll group will not change or be used any differently as these changes are made
- it will still be mapped 1:1 to an spdk_thread. The spdk_thread will contain a list of spdk_poller objects that it needs to process.
[MP]: Ok. Agreed.
> i. Use TLS to set thread_id’s as appropriate. Goal is to NOT change
> any of the util api’s including _get_thread.
Using TLS is definitely the easiest way to make this happen, but do you have TLS available in your environment?
[MP]: Yes, we have TLS available in our environment.
Thanks,
Ben
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
_______________________________________________
SPDK mailing list
SPDK(a)lists.01.org
https://lists.01.org/mailman/listinfo/spdk
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-06-08 1:18 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-30 18:08 [SPDK] SPDK Dynamic Threading Model Walker, Benjamin
-- strict thread matches above, loose matches on Subject: below --
2018-06-08 1:18 Meneghini, John
2018-05-30 20:26 Pai, Madhu
2018-05-30 17:31 Pai, Madhu
2018-05-29 19:26 Meneghini, John
2018-05-29 16:24 Meneghini, John
2018-05-25 22:32 Pai, Madhu
2018-05-25 21:26 Walker, Benjamin
2018-05-25 20:31 Pai, Madhu
2018-05-25 19:03 Walker, Benjamin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.