From mboxrd@z Thu Jan 1 00:00:00 1970 From: Orit Wasserman Subject: Re: newstore direction Date: Thu, 22 Oct 2015 10:51:35 +0200 Message-ID: <1445503895.9019.11.camel@redhat.com> References: <56268886.7010806@redhat.com> <7334B4281E425749B85E08CF7EC6F8534383DD15@SACMBXIP03.sdcorp.global.sandisk.com> <562775DD.8050304@redhat.com> <56279DD3.4020901@redhat.com> <5627B471.1040602@redhat.com> <5627E96E.6090706@redhat.com> <1445462425.24939.21.camel@millnert.se> <7334B4281E425749B85E08CF7EC6F8534383E574@SACMBXIP03.sdcorp.global.sandisk.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mx1.redhat.com ([209.132.183.28]:41553 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753832AbbJVIvk (ORCPT ); Thu, 22 Oct 2015 04:51:40 -0400 In-Reply-To: <7334B4281E425749B85E08CF7EC6F8534383E574@SACMBXIP03.sdcorp.global.sandisk.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Allen Samuels Cc: Martin Millnert , Mark Nelson , Ric Wheeler , Sage Weil , "ceph-devel@vger.kernel.org" On Thu, 2015-10-22 at 02:12 +0000, Allen Samuels wrote: > One of the biggest changes that flash is making in the storage world = is that the way basic trade-offs in storage management software archite= cture are being affected. In the HDD world CPU time per IOP was relativ= ely inconsequential, i.e., it had little effect on overall performance = which was limited by the physics of the hard drive. Flash is now invert= ing that situation. When you look at the performance levels being deliv= ered in the latest generation of NVMe SSDs you rapidly see that that st= orage itself is generally no longer the bottleneck (speaking about BW, = not latency of course) but rather it's the system sitting in front of t= he storage that is the bottleneck. Generally it's the CPU cost of an IO= P. >=20 > When Sandisk first starting working with Ceph (Dumpling) the design o= f librados and the OSD lead to the situation that the CPU cost of an IO= P was dominated by context switches and network socket handling. Over t= ime, much of that has been addressed. The socket handling code has been= re-written (more than once!) some of the internal queueing in the OSD = (and the associated context switches) have been eliminated. As the CPU = costs have dropped, performance on flash has improved accordingly. >=20 > Because we didn't want to completely re-write the OSD (time-to-market= and stability drove that decision), we didn't move it from the current= "thread per IOP" model into a truly asynchronous "thread per CPU core"= model that essentially eliminates context switches in the IO path. But= a fully optimized OSD would go down that path (at least part-way). I b= elieve it's been proposed in the past. Perhaps a hybrid "fast-path" sty= le could get most of the benefits while preserving much of the legacy c= ode. >=20 +1 It not just reducing context switches but also about removing contentio= n and data copies and getting better cache utilization. Scylladb just did this to cassandra (using seastar library): http://www.zdnet.com/article/kvm-creators-open-source-fast-cassandra-dr= op-in-replacement-scylla/ Orit > I believe this trend toward thread-per-core software development will= also tend to support the "do it in user-space" trend. That's because m= ost of the kernel and file-system interface is architected around the b= locking "thread-per-IOP" model and is unlikely to change in the future. >=20 >=20 > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions >=20 > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@SanDisk.com >=20 > -----Original Message----- > From: Martin Millnert [mailto:martin@millnert.se] > Sent: Thursday, October 22, 2015 6:20 AM > To: Mark Nelson > Cc: Ric Wheeler ; Allen Samuels ; Sage Weil ; ceph-devel@vger.kernel.org > Subject: Re: newstore direction >=20 > Adding 2c >=20 > On Wed, 2015-10-21 at 14:37 -0500, Mark Nelson wrote: > > My thought is that there is some inflection point where the userlan= d > > kvstore/block approach is going to be less work, for everyone I thi= nk, > > than trying to quickly discover, understand, fix, and push upstream > > patches that sometimes only really benefit us. I don't know if we'= ve > > truly hit that that point, but it's tough for me to find flaws with > > Sage's argument. >=20 > Regarding the userland / kernel land aspect of the topic, there are f= urther aspects AFAIK not yet addressed in the thread: > In the networking world, there's been development on memory mapped (m= ultiple approaches exist) userland networking, which for packet managem= ent has the benefit of - for very, very specific applications of networ= king code - avoiding e.g. per-packet context switches etc, and streamli= ning processor cache management performance. People have gone as far as= removing CPU cores from CPU scheduler to completely dedicate them to t= he networking task at hand (cache optimizations). There are various lat= ency/throughput (bulking) optimizations applicable, but at the end of t= he day, it's about keeping the CPU bus busy with "revenue" bus traffic. >=20 > Granted, storage IO operations may be much heavier in cycle counts fo= r context switches to ever appear as a problem in themselves, certainly= for slower SSDs and HDDs. However, when going for truly high performan= ce IO, *every* hurdle in the data path counts toward the total latency. > (And really, high performance random IO characteristics approaches th= e networking, per-packet handling characteristics). Now, I'm not reall= y suggesting memory-mapping a storage device to user space, not at all,= but having better control over the data path for a very specific use c= ase, reduces dependency on the code that works as best as possible for = the general case, and allows for very purpose-built code, to address a = narrow set of requirements. ("Ceph storage cluster backend" isn't a typ= ical FS use case.) It also decouples dependencies on users i.e. > waiting for the next distro release before being able to take up the = benefits of improvements to the storage code. >=20 > A random google came up with related data on where "doing something w= ay different" /can/ have significant benefits: > http://phunq.net/pipermail/tux3/2015-April/002147.html >=20 > I (FWIW) certainly agree there is merit to the idea. > The scientific approach here could perhaps be to simply enumerate all= corner cases of "generic FS" that actually are cause for the experienc= ed issues, and assess probability of them being solved (and if so when)= =2E > That *could* improve chances of approaching consensus which wouldn't = hurt I suppose? >=20 > BR, > Martin >=20 >=20 > ________________________________ >=20 > PLEASE NOTE: The information contained in this electronic mail messag= e is intended only for the use of the designated recipient(s) named abo= ve. If the reader of this message is not the intended recipient, you ar= e hereby notified that you have received this message in error and that= any review, dissemination, distribution, or copying of this message is= strictly prohibited. If you have received this communication in error,= please notify the sender by telephone or e-mail (as shown above) immed= iately and destroy any and all copies of this message in your possessio= n (whether hard copies or electronically stored copies). >=20 > NrybX=C7=A7v^)=DE=BA{.n+z]z{ay=1D=CA=87=DA=99,j=07fhz=1Ew=0Cj:+vwjm=07= zZ+=DD=A2j"! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html