From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============5749751357144168039==" MIME-Version: 1.0 From: Walker, Benjamin Subject: Re: [SPDK] SPDK Dynamic Threading Model Date: Fri, 25 May 2018 21:26:03 +0000 Message-ID: <1527283562.55770.73.camel@intel.com> In-Reply-To: MWHPR06MB2558CED974739A13B11CA842E5690@MWHPR06MB2558.namprd06.prod.outlook.com List-ID: To: spdk@lists.01.org --===============5749751357144168039== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote: > Ben, > = > Thank you very much for the detailed analysis and mail. I agree with the > points you are making here and the design goals for SPDK. I'll try to ta= lk > some more about the design for the patch, the advantages, and see if we c= an > improvise and better this. I'll also try to answer the three valid concer= ns > you have raised below. > = > 1) Design principle: > Breaking the 1:1 mapping between reactors and core will give an app better > flexibility in terms of the threading model. At a very high abstract level > this could be looked at, as being similar to the green threading framewor= k. > I'm no expert in the green threading framework, but, based on my reading = the > similarity would be: = > - The "spdk_thread" is the virtual thread. = > - The reactor is the "cache" of this virtual thread (i.e. they have a 1-1 > relationship). = > - The bare metal thread is the DPDK thread in this model. > So in this design, the bare metal thread inhales the properties of the vi= rtual > thread when it is running on a reactor. The implementation specific part = of > that was the change of the TLS of the _lcore value to the reactor id. I think we'll need to iterate on what the basic primitives are (especially = their names), but I'm generally leaving that discussion for slightly later on in = the design. For now, I agree with the direction above. I'm going to temporarily= use the words "spdk_thread" for the virtual thread and "DPDK thread" for the na= tive thread. > = > 2) Advantages: > This was the first iteration of what can become a more solidified solutio= n as > we go along. There are gaps in this approach and I think we can mitigate = some > of those concerns. = > - It gives applications a switch. They may choose to not do this at all a= nd > then a virtual thread would map to a bare metal thread and stay that way. = > - The other advantage is that the core library, API"s and the design do n= ot > change. = I generally don't like to have switches that control behavior this fundamen= tal, where possible. It would be ideal to have the old behavior fall out of the = new behavior whenever you happen to create one spdk_thread per DPDK thread. > - I think it helps apps to seamlessly adapt to a dynamic threading model = with > their current code base. > - It allows apps to dynamically increase the number of reactors. The > granularity of a QP to a reactor could be decided by the app. So, we coul= d, in > one way solve the problem of trying to move a QP from one thread to anoth= er > for load balancing. > - It gives Apps certain QoS capabilities. Apps can create "gold", "silver= " and > "bronze" reactor rings. The number of bare-metal threads working on each= ring > may be different depending on how quickly certain QP's have to be service= d. > = > 3) Concerns: > Looking at the three concerns you raised - = > = > 3a) Cache thrashing: Agreed. In fact, I write about this in the commit pa= tch > as a valid concern. But, I think we can mitigate this. For applications t= hat > are concerned about cache thrashing - the option is to run the reactor on= a > bare metal thread for a longer period of time. In the patch I showed a cr= ude > way, where after 100 usec, the reactor is switched out. But, that does not > absolutely have to be the case. A reactor could run for a much longer per= iod > of time (10s of msec) allowing the benefits of the CPU caching to be used= . The > other way to mitigate this is to make sure that the bare metal threads ru= n on > the same socket. Thus even when reactors are switched out, the cache at t= he > socket layer is not invalidated. Instead of automatically moving spdk_thread/reactor objects between DPDK threads, what if moving an spdk_thread to a new DPDK thread was an explicit operation performed by the application periodically? > = > 3b) NUMA: I believe NUMA-awareness can be built in. It depends on where t= he > bare metal threads run. If we have NICs and SSDs spread across the two > sockets, a more elegant solution can probably be designed where we create > reactor rings per socket. Then we would have the capability to add the QP= 's to > the right reactor (in the right ring) based on the NIC. That is an extens= ion > of the current design IMO. It's not always clear what the right NUMA node to run on actually is. That's because an spdk_thread has a set of I/O channels (queue pairs) that talk to different devices. Sometimes you want to be on the same NUMA node as a particular NIC, but other times as a particular SSD. Making the movement of spdk_threads between DPDK threads an explicit operation performed by the application would push this decision up into the application/user code, whe= re it knows best. > = > 3c) Global Reactor Ring bottleneck: The number of reactors and the number= of > threads are not high. Also, the idea here is to run on a reactor for "some > extended period of time". Given that the number of producer/consumers from > this ring will be limited, I don't think the reactor ring will be a > bottleneck. Compare and contrast this global reactor ring with the event = queue > ring that exists today. We use events for all callback events from the ba= ckend > during IO. We definitely do not want the reactors to be swapped in and ou= t at > the rate of IO, but to hold on to a reactor for a somewhat larger period = of > time. When the application specific metrics show that these threads are d= oing > more "useful work" versus "idle polling", we just add more threads. Event= ually > at high loads the number of threads will be the same as the number of rea= ctors > and thus falls back to the traditional SPDK model. = > This would be something that is decided by the ecosystem that runs SPDK. > = > Looking forward to discussing this more. > = > Thanks, > Madhu > = > = > -----Original Message----- > From: SPDK On Behalf Of Walker, Benjamin > Sent: Friday, May 25, 2018 3:03 PM > To: spdk(a)lists.01.org > Subject: Re: [SPDK] SPDK Dynamic Threading Model > = > I've been doing my best to think this through over the last few days, as = have > a number of other community members, and some things are beginning to loo= k a > bit clearer now. > = > SPDK was always intended to be a composable set of libraries as opposed t= o a > framework. By that, I mean that SPDK is intended to be integrated into ot= her > applications as opposed to existing code being integrated into SPDK. The > community has done a lot of work to attempt to make that happen, with var= ying > degrees of success. The challenges are primarily centered on two things. > First, SPDK requires special memory management operations to allocate DMA= -safe = > memory. > This stems from the strict requirement to avoid data copies. The problem = would > essentially go away if SPDK instead internally allocated DMA-safe memory = and > copied user data into those buffers, but the performance would take a big= hit. > Second, SPDK avoids locks by instead passing messages between threads. Th= at > means that many components (although not all) within SPDK imply that the > application is using a certain threading model. Specifically, the threadi= ng > model needs to look like cooperative multi-tasking, or futures and promis= es, > or event loops, etc. So far the consensus seems to be that it is acceptab= le to > assume there is some threading model that is conducive to message passing= , but > we don't want to specifically pick a single model or framework. > = > The problem that John, Madhu, and the others at NetApp have identified is= that > SPDK currently makes entirely too many assumptions about and places too m= any > strict requirements on the mechanics of the threading model in an applica= tion. > I think there is a strong consensus that fixing this is important and sho= uld > be high priority. The fix, ultimately, will be better abstractions around= the > underlying application's threading model. I hope we can design something = that > will enable people to plug SPDK into all sorts of frameworks - green thre= ading > frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more p= eople > we can get participating in this work, the better the abstractions will b= e, so > please everyone chime in with requirements and ideas. > = > The current set of patches break the 1:1 mapping between reactors and cor= es. > Instead, reactors are stored on a global list. Each core iterates on this > global list and pulls the next reactor and processes any waiting events a= nd > executes pollers, then places the reactor back on the list. I'm concerned > about three things with this design: > = > * Since the reactors now potentially execute on a different core each time > through their loop, the CPU cache is going to be badly thrashed. I suspec= t the > performance hit here is very large and continues to grow as additional th= reads > are added. SPDK is designed to scale linearly with the addition of CPU co= res > as much as possible, and I think it would be a mistake to move away from > that. = > * All NUMA-awareness has been lost. Placing the processing of I/O on the = same > NUMA node as the NIC or SSD is critical to achieving high performance, so= the > code needs to remain NUMA-aware. > * All threads are polling a single queue of reactors, so the atomic varia= bles > controlling the head and tail of that queue are going to be highly conten= ded > and become more contended as the number of threads increases. > = > I hope this is just the beginning of a larger discussion. I'll let the pa= tch > review settle into next week and see if solutions begin to emerge. > = > Thanks, > Ben > = > On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote: > > Hi Frank. > > = > > Thanks for your suggestion. > > = > > In our implementation/application, we don=E2=80=99t use DPDK. This is = why the = > > first set of changes we proposed last year were to abstract out the = > > dependencies on DPK. I think I still have copy of the old pull request > > around for reference. > > = > > https://github.com/spdk/spdk/pull/152 > > = > > We are actually running SPDK in a completely different execution = > > environment, and we need a =E2=80=9Cnative=E2=80=9D SPDK dynamic thread= ing model that = > > can be supported on any platform, without DPDK. > > = > > An second RFC patch has been pushed up to GerritHub for review. = > > Please see the commit message of these two patches for a complete = > > description of the proposed change. > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/ > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/ > > = > > /John > > = > > 40.5. The L-thread subsystem > > The L-thread subsystem resides in the = > > examples/performance-thread/common > > directory and is built and linked automatically when building the = > > l3fwd- thread example. > > = > > The subsystem provides a simple cooperative scheduler to enable = > > arbitrary functions to run as cooperative threads within a single EAL = > > thread. The subsystem provides a pthread like API that is intended to = > > assist in reuse of legacy code written for POSIX pthreads. > > = > > The following sections provide some detail on the features, = > > constraints, performance and porting considerations when using L-thread= s. > > = > > = > > = > > From: SPDK on behalf of Huang Frank = > > > > Reply-To: Storage Performance Development Kit > > Date: Wednesday, May 23, 2018 at 9:46 PM > > To: Storage Performance Development Kit > > Subject: [SPDK] =E7=AD=94=E5=A4=8D: SPDK Dynamic Threading Model > > = > > Hi, > > = > > Why not consider to use lpthread provided by DPDK? > > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html > > #lthread > > -subsystem > > = > > = > > = > > Frank Huang > > = > > = > > =E5=8F=91=E4=BB=B6=E4=BA=BA: SPDK =E4=BB= =A3=E8=A1=A8 Meneghini, John = > > > > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2018=E5=B9=B45=E6=9C=8823=E6=97= =A5 4:12 > > =E6=94=B6=E4=BB=B6=E4=BA=BA: Storage Performance Development Kit > > =E4=B8=BB=E9=A2=98: [SPDK] RFC: SPDK Dynamic Threading Model > > = > > As discussed during the Summit last week, we believe SPDK needs = > > support for a dynamic threading model. An RFC patch has been pushed > > upstream for review. > > = > > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/ > > = > > This patch is a beginning point for our proposed changes. Improvements = > > will be made with subsequent patches. > > = > > The description below is taken from = > > https://github.com/spdk/spdk/issues/308 > > SPDK needs to support a dynamic threading model where reactors are NOT = > > bound to lcores. > > Many applications need SPDK to support a threading model that: > > Does not assume a static number of threads Does not bind threads to = > > cores (this burns up cores) Does not assume all treads use the same = > > polling model Removing these assumptions from the SPDK libraries will = > > allow: > > Different applications to share the SPDK libraries on the same = > > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to = > > support the same applications with the same libraries E.g. a 4 core = > > platform and a 128 core plaform, a PowerPC and NFS traffic Different = > > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and 1 = > > Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces. > > In particular, in SPDK, NVMF threads need to come and go depending = > > upon the =E2=80=9CNVMF load=E2=80=9D. > > More Dynamic Use Cases Coming > > With the advent of FC-NVMe (which uses NPIV to visualize FC ports) = > > NVMF Subsystem Ports and Host Ports are not static. Different Hosts = > > and Subsystems can have a different number of Ports, and Ports can be = > > dynamically added and removed from the configuration. This means: > > The same platform may end up having different number of Subsystem = > > ports at various points in its lifecycle The SPDK FC-NVMe application = > > does NOT know up front how many ports it will have. > > Expected Behavior > > SPDK libraries should not assume a static number of threads SPDK = > > libraries should bind threads to cores only optionally - supporting = > > both static and dynamic threading models SPDK libraries should support = > > a Hybrid polling model (modified run to > > completion) > > Current Behavior > > SPDK libraries assume a static number of threads SPDK libraries bind = > > threads to cores SPDK libraries assume all treads use the same polling = > > model Possible Solution Proposal to solve above Use Cases: > > Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction = > > Use PG as the fundamental unit on which a thread operates The = > > spdk_thread will be a =E2=80=9Cvirtual=E2=80=9D thread that gets tied i= nto a PG (1-1 > > relationship) > > Create PGs as and when hardware ports (and associated queue-pairs) = > > come to life. > > No dependency between a PG and a =E2=80=9Creal=E2=80=9D thread. > > A PG can be picked up by any =E2=80=9Creal=E2=80=9D thread and worked u= pon. The PG = > > contains everything needed for IO handling. > > PG continues to contain spdk_thread. spdk_thread continues same = > > mechanisms for IO channels to different NS etc. etc. > > PG contains vendor data. Eg. A =E2=80=9Cring=E2=80=9D for depositing as= ynchronous = > > callback events from the backend OR management events that come from > > external modules. > > spdk_thread contains thread_context that points to a PG instead of a > > reactor. > > So messages from the library get routed to the PG =E2=80=9Cring=E2=80= =9D instead of a = > > thread/reactor event ring.spdk_bdev_get_io Understanding the intent of = > > the event library, it is believed this is the place for customization. = > > However, the current event library assumes a threading model that's a = > > part of the util library. Moreover, many of the other SPDK core = > > libraries assume the same threading model as the util library. If the = > > SPDK util library can be modified to support these use dynamic = > > threading use cases, all applications would be able to use the SPDK = > > framework more effectively. > > Steps to Reproduce > > This is an enhancement. There is no bug. > > Context (Environment including OS version, SPDK version, etc.) Would = > > like to provide these enhancements in V18.07. > > = > > = > > = > > = > > _______________________________________________ > > SPDK mailing list > > SPDK(a)lists.01.org > > https://lists.01.org/mailman/listinfo/spdk > = > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk > _______________________________________________ > SPDK mailing list > SPDK(a)lists.01.org > https://lists.01.org/mailman/listinfo/spdk --===============5749751357144168039==--