Re: [SPDK] SPDK Dynamic Threading Model

From: Walker, Benjamin <benjamin.walker at intel.com>
To: spdk@lists.01.org
Subject: Re: [SPDK] SPDK Dynamic Threading Model
Date: Fri, 25 May 2018 21:26:03 +0000	[thread overview]
Message-ID: <1527283562.55770.73.camel@intel.com> (raw)
In-Reply-To: MWHPR06MB2558CED974739A13B11CA842E5690@MWHPR06MB2558.namprd06.prod.outlook.com

[-- Attachment #1: Type: text/plain, Size: 16920 bytes --]

On Fri, 2018-05-25 at 20:31 +0000, Pai, Madhu wrote:
> Ben,
> 
> Thank you very much for the detailed analysis and mail. I agree with the
> points you are making here and the design goals for SPDK.  I'll try to talk
> some more about the design for the patch, the advantages, and see if we can
> improvise and better this. I'll also try to answer the three valid concerns
> you have raised below.
> 
> 1) Design principle:
> Breaking the 1:1 mapping between reactors and core will give an app better
> flexibility in terms of the threading model. At a very high abstract level
> this could be looked at, as being similar to the green threading framework.
> I'm no expert in the green threading framework, but, based on my reading the
> similarity would be: 
> - The "spdk_thread" is the virtual thread. 
> - The reactor is the "cache" of this virtual thread (i.e. they have a 1-1
> relationship). 
> - The bare metal thread is the DPDK thread in this model.
> So in this design, the bare metal thread inhales the properties of the virtual
> thread when it is running on a reactor. The implementation specific part of
> that was the change of the TLS of the _lcore value to the reactor id.

I think we'll need to iterate on what the basic primitives are (especially their
names), but I'm generally leaving that discussion for slightly later on in the
design. For now, I agree with the direction above. I'm going to temporarily use
the words "spdk_thread" for the virtual thread and "DPDK thread" for the native
thread.

> 
> 2) Advantages:
> This was the first iteration of what can become a more solidified solution as
> we go along. There are gaps in this approach and I think we can mitigate some
> of those concerns. 
> - It gives applications a switch. They may choose to not do this at all and
> then a virtual thread would map to a bare metal thread and stay that way. 
> - The other advantage is that the core library, API"s and the design do not
> change. 

I generally don't like to have switches that control behavior this fundamental,
where possible. It would be ideal to have the old behavior fall out of the new
behavior whenever you happen to create one spdk_thread per DPDK thread.

> - I think it helps apps to seamlessly adapt to a dynamic threading model with
> their current code base.
> - It allows apps to dynamically increase the number of reactors. The
> granularity of a QP to a reactor could be decided by the app. So, we could, in
> one way solve the problem of trying to move a QP from one thread to another
> for load balancing.
> - It gives Apps certain QoS capabilities. Apps can create "gold", "silver" and
> "bronze"  reactor rings. The number of bare-metal threads working on each ring
> may be different depending on how quickly certain QP's have to be serviced.
> 
> 3) Concerns:
> Looking at the three concerns you raised - 
> 
> 3a) Cache thrashing: Agreed. In fact, I write about this in the commit patch
> as a valid concern. But, I think we can mitigate this. For applications that
> are concerned about cache thrashing - the option is to run the reactor on a
> bare metal thread for a longer period of time. In the patch I showed a crude
> way, where after 100 usec, the reactor is switched out. But, that does not
> absolutely have to be the case. A reactor could run for a much longer period
> of time (10s of msec) allowing the benefits of the CPU caching to be used. The
> other way to mitigate this is to make sure that the bare metal threads run on
> the same socket. Thus even when reactors are switched out, the cache at the
> socket layer is not invalidated.

Instead of automatically moving spdk_thread/reactor objects between DPDK
threads, what if moving an spdk_thread to a new DPDK thread was an explicit
operation performed by the application periodically?

> 
> 3b) NUMA: I believe NUMA-awareness can be built in. It depends on where the
> bare metal threads run. If we have NICs and SSDs spread across the two
> sockets, a more elegant solution can probably be designed where we create
> reactor rings per socket. Then we would have the capability to add the QP's to
> the right reactor (in the right ring) based on the NIC. That is an extension
> of the current design IMO.

It's not always clear what the right NUMA node to run on actually is. That's
because an spdk_thread has a set of I/O channels (queue pairs) that talk to
different devices. Sometimes you want to be on the same NUMA node as a
particular NIC, but other times as a particular SSD. Making the movement of
spdk_threads between DPDK threads an explicit operation performed by the
application would push this decision up into the application/user code, where it
knows best.

> 
> 3c) Global Reactor Ring bottleneck: The number of reactors and the number of
> threads are not high. Also, the idea here is to run on a reactor for "some
> extended period of time". Given that the number of producer/consumers from
> this ring will be limited, I don't think the reactor ring will be a
> bottleneck. Compare and contrast this global reactor ring with the event queue
> ring that exists today. We use events for all callback events from the backend
> during IO. We definitely do not want the reactors to be swapped in and out at
> the rate of IO, but to hold on to a reactor for a somewhat larger period of
> time. When the application specific metrics show that these threads are doing
> more "useful work" versus "idle polling", we just add more threads. Eventually
> at high loads the number of threads will be the same as the number of reactors
> and thus falls back to the traditional SPDK model. 
> This would be something that is decided by the ecosystem that runs SPDK.
> 
> Looking forward to discussing this more.
> 
> Thanks,
> Madhu
> 
> 
> -----Original Message-----
> From: SPDK <spdk-bounces(a)lists.01.org> On Behalf Of Walker, Benjamin
> Sent: Friday, May 25, 2018 3:03 PM
> To: spdk(a)lists.01.org
> Subject: Re: [SPDK] SPDK Dynamic Threading Model
> 
> I've been doing my best to think this through over the last few days, as have
> a number of other community members, and some things are beginning to look a
> bit clearer now.
> 
> SPDK was always intended to be a composable set of libraries as opposed to a
> framework. By that, I mean that SPDK is intended to be integrated into other
> applications as opposed to existing code being integrated into SPDK. The
> community has done a lot of work to attempt to make that happen, with varying
> degrees of success. The challenges are primarily centered on two things.
> First, SPDK requires special memory management operations to allocate DMA-safe 
> memory.
> This stems from the strict requirement to avoid data copies. The problem would
> essentially go away if SPDK instead internally allocated DMA-safe memory and
> copied user data into those buffers, but the performance would take a big hit.
> Second, SPDK avoids locks by instead passing messages between threads. That
> means that many components (although not all) within SPDK imply that the
> application is using a certain threading model. Specifically, the threading
> model needs to look like cooperative multi-tasking, or futures and promises,
> or event loops, etc. So far the consensus seems to be that it is acceptable to
> assume there is some threading model that is conducive to message passing, but
> we don't want to specifically pick a single model or framework.
> 
> The problem that John, Madhu, and the others at NetApp have identified is that
> SPDK currently makes entirely too many assumptions about and places too many
> strict requirements on the mechanics of the threading model in an application.
> I think there is a strong consensus that fixing this is important and should
> be high priority. The fix, ultimately, will be better abstractions around the
> underlying application's threading model. I hope we can design something that
> will enable people to plug SPDK into all sorts of frameworks - green threading
> frameworks, DPDK lthreads, Seastar, coroutine frameworks, etc. The more people
> we can get participating in this work, the better the abstractions will be, so
> please everyone chime in with requirements and ideas.
> 
> The current set of patches break the 1:1 mapping between reactors and cores.
> Instead, reactors are stored on a global list. Each core iterates on this
> global list and pulls the next reactor and processes any waiting events and
> executes pollers, then places the reactor back on the list. I'm concerned
> about three things with this design:
> 
> * Since the reactors now potentially execute on a different core each time
> through their loop, the CPU cache is going to be badly thrashed. I suspect the
> performance hit here is very large and continues to grow as additional threads
> are added. SPDK is designed to scale linearly with the addition of CPU cores
> as much as possible, and I think it would be a mistake to move away from
> that. 
> * All NUMA-awareness has been lost. Placing the processing of I/O on the same
> NUMA node as the NIC or SSD is critical to achieving high performance, so the
> code needs to remain NUMA-aware.
> * All threads are polling a single queue of reactors, so the atomic variables
> controlling the head and tail of that queue are going to be highly contended
> and become more contended as the number of threads increases.
> 
> I hope this is just the beginning of a larger discussion. I'll let the patch
> review settle into next week and see if solutions begin to emerge.
> 
> Thanks,
> Ben
> 
> On Thu, 2018-05-24 at 02:24 +0000, Meneghini, John wrote:
> > Hi Frank.
> >  
> > Thanks for your suggestion.
> >  
> > In our implementation/application, we don’t use DPDK.  This is why the 
> > first set of changes we proposed last year were to abstract out the 
> > dependencies on DPK. I think I still have copy of the old pull request
> > around for reference.
> >  
> > https://github.com/spdk/spdk/pull/152
> >  
> > We are actually running SPDK in a completely different execution 
> > environment, and we need a “native” SPDK dynamic threading model that 
> > can be supported on any platform, without DPDK.
> >  
> > An second RFC patch has been pushed up to GerritHub for review.  
> > Please see the commit message of these two patches for a complete 
> > description of the proposed change.
> >  
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412277/
> >  
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >  
> > /John
> >  
> > 40.5. The L-thread subsystem
> > The L-thread subsystem resides in the 
> > examples/performance-thread/common
> > directory and is built and linked automatically when building the 
> > l3fwd- thread example.
> > 
> > The subsystem provides a simple cooperative scheduler to enable 
> > arbitrary functions to run as cooperative threads within a single EAL 
> > thread. The subsystem provides a pthread like API that is intended to 
> > assist in reuse of legacy code written for POSIX pthreads.
> > 
> > The following sections provide some detail on the features, 
> > constraints, performance and porting considerations when using L-threads.
> > 
> >  
> >  
> > From: SPDK <spdk-bounces(a)lists.01.org> on behalf of Huang Frank 
> > <kinzent(a)hotma il.com>
> > Reply-To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Date: Wednesday, May 23, 2018 at 9:46 PM
> > To: Storage Performance Development Kit <spdk(a)lists.01.org>
> > Subject: [SPDK] 答复: SPDK Dynamic Threading Model
> >  
> > Hi,
> >  
> > Why not consider to use lpthread provided by DPDK?
> > http://dpdk.org/doc/guides-16.04/sample_app_ug/performance_thread.html
> > #lthread
> > -subsystem
> >  
> >  
> > 
> > Frank Huang
> > 
> >  
> > 发件人: SPDK <spdk-bounces(a)lists.01.org> 代表 Meneghini, John 
> > <John.Meneghini(a)netap p.com>
> > 发送时间: 2018年5月23日 4:12
> > 收件人: Storage Performance Development Kit
> > 主题: [SPDK] RFC: SPDK Dynamic Threading Model
> >  
> > As discussed during the Summit last week, we believe SPDK needs 
> > support for a dynamic threading model.  An RFC patch has been pushed
> > upstream for review.
> >  
> > https://review.gerrithub.io/#/c/spdk/spdk/+/412093/
> >  
> > This patch is a beginning point for our proposed changes. Improvements 
> > will be made with subsequent patches.
> >  
> > The description below is taken from 
> > https://github.com/spdk/spdk/issues/308
> > SPDK needs to support a dynamic threading model where reactors are NOT 
> > bound to lcores.
> > Many applications need SPDK to support a threading model that:
> > Does not assume a static number of threads Does not bind threads to 
> > cores (this burns up cores) Does not assume all treads use the same 
> > polling model Removing these assumptions from the SPDK libraries will 
> > allow:
> > Different applications to share the SPDK libraries on the same 
> > platform E.g. FC-NVMe, RDMA-NVMe, and NVMe Different platforms to 
> > support the same applications with the same libraries E.g. a 4 core 
> > platform and a 128 core plaform, a PowerPC and NFS traffic Different 
> > workloads at different scales E.g. 1 NVMF Host with 1 Subsystem and 1 
> > Namespace, or 16 NVMF Hosts with 100 Subsystems and 1,000 namespaces.
> > In particular, in SPDK, NVMF threads need to come and go depending 
> > upon the “NVMF load”.
> > More Dynamic Use Cases Coming
> > With the advent of FC-NVMe (which uses NPIV to visualize FC ports) 
> > NVMF Subsystem Ports and Host Ports are not static. Different Hosts 
> > and Subsystems can have a different number of Ports, and Ports can be 
> > dynamically added and removed from the configuration. This means:
> > The same platform may end up having different number of Subsystem 
> > ports at various points in its lifecycle The SPDK FC-NVMe application 
> > does NOT know up front how many ports it will have.
> > Expected Behavior
> > SPDK libraries should not assume a static number of threads SPDK 
> > libraries should bind threads to cores only optionally - supporting 
> > both static and dynamic threading models SPDK libraries should support 
> > a Hybrid polling model (modified run to
> > completion)
> > Current Behavior
> > SPDK libraries assume a static number of threads SPDK libraries bind 
> > threads to cores SPDK libraries assume all treads use the same polling 
> > model Possible Solution Proposal to solve above Use Cases:
> > Use the spdk_nvmf_poll_group (PG) as the unit of threading abstraction 
> > Use PG as the fundamental unit on which a thread operates The 
> > spdk_thread will be a “virtual” thread that gets tied into a PG (1-1
> > relationship)
> > Create PGs as and when hardware ports (and associated queue-pairs) 
> > come to life.
> > No dependency between a PG and a “real” thread.
> > A PG can be picked up by any “real” thread and worked upon. The PG 
> > contains everything needed for IO handling.
> > PG continues to contain spdk_thread. spdk_thread continues same 
> > mechanisms for IO channels to different NS etc. etc.
> > PG contains vendor data. Eg. A “ring” for depositing asynchronous 
> > callback events from the backend OR management events that come from
> > external modules.
> > spdk_thread contains thread_context that points to a PG instead of a
> > reactor.
> > So messages from the library get routed to the PG “ring” instead of a 
> > thread/reactor event ring.spdk_bdev_get_io Understanding the intent of 
> > the event library, it is believed this is the place for customization. 
> > However, the current event library assumes a threading model that's a 
> > part of the util library. Moreover, many of the other SPDK core 
> > libraries assume the same threading model as the util library. If the 
> > SPDK util library can be modified to support these use dynamic 
> > threading use cases, all applications would be able to use the SPDK 
> > framework more effectively.
> > Steps to Reproduce
> > This is an enhancement. There is no bug.
> > Context (Environment including OS version, SPDK version, etc.) Would 
> > like to provide these enhancements in V18.07.
> >  
> >  
> >  
> >  
> > _______________________________________________
> > SPDK mailing list
> > SPDK(a)lists.01.org
> > https://lists.01.org/mailman/listinfo/spdk
> 
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk
> _______________________________________________
> SPDK mailing list
> SPDK(a)lists.01.org
> https://lists.01.org/mailman/listinfo/spdk