From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bruce Richardson Subject: Re: A question about hugepage initialization time Date: Fri, 12 Dec 2014 09:59:40 +0000 Message-ID: <20141212095940.GA2100@bricha3-MOBL3> References: <20141209141032.5fa2db0d@urahara> <20141210103225.GA10056@bricha3-MOBL3> <20141210142926.GA17040@localhost.localdomain> <20141210143558.GB1632@bricha3-MOBL3> <20141211101449.GB5668@bricha3-MOBL3> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: "dev-VfR2kkLFssw@public.gmane.org" To: =?iso-8859-1?B?TOFzemzz?= Vadkerti Return-path: Content-Disposition: inline In-Reply-To: List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org Sender: "dev" On Fri, Dec 12, 2014 at 04:07:40AM +0000, L=E1szl=F3 Vadkerti wrote: > On Thu, 11 Dec, 2014, Bruce Richardson wrote: > > On Wed, Dec 10, 2014 at 07:16:59PM +0000, L=E1szl=F3 Vadkerti wrote: > > > > > > On Wed, 10 Dec 2014, Bruce Richardson wrote: > > > > > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: > > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote= : > > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wro= te: > > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew > > > >>>> wrote: > > > >>>> > > > >>>>>> Hey Folks, > > > >>>>>> > > > >>>>>> Our DPDK application deals with very large in memory data > > > >>>>>> structures, and can potentially use tens or even hundreds of > > gigabytes of hugepage memory. > > > >>>>>> During the course of development, we've noticed that as the > > > >>>>>> number of huge pages increases, the memory initialization ti= me > > > >>>>>> during EAL init gets to be quite long, lasting several minut= es > > > >>>>>> at present. The growth in init time doesn't appear to be li= near, > > which is concerning. > > > >>>>>> > > > >>>>>> This is a minor inconvenience for us and our customers, as > > > >>>>>> memory initialization makes our boot times a lot longer than= it > > > >>>>>> would otherwise be. Also, my experience has been that reall= y > > > >>>>>> long operations often are hiding errors - what you think is > > > >>>>>> merely a slow operation is actually a timeout of some sort, > > > >>>>>> often due to misconfiguration. This leads to two > > > >>>>>> questions: > > > >>>>>> > > > >>>>>> 1. Does the long initialization time suggest that there's an > > > >>>>>> error happening under the covers? > > > >>>>>> 2. If not, is there any simple way that we can shorten memor= y > > > >>>>>> initialization time? > > > >>>>>> > > > >>>>>> Thanks in advance for your insights. > > > >>>>>> > > > >>>>>> -- > > > >>>>>> Matt Laswell > > > >>>>>> laswell-bIuJOMs36aleGPcbtGPokg@public.gmane.org > > > >>>>>> infinite io, inc. > > > >>>>>> > > > >>>>> > > > >>>>> Hello, > > > >>>>> > > > >>>>> please find some quick comments on the questions: > > > >>>>> 1.) By our experience long initialization time is normal in c= ase > > > >>>>> of large amount of memory. However this time depends on some > > things: > > > >>>>> - number of hugepages (pagefault handled by kernel is pretty > > > >>>>> expensive) > > > >>>>> - size of hugepages (memset at initialization) > > > >>>>> > > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initializati= on > > > >>>>> time significantly. Using wmemset instead of memset adds an > > > >>>>> additional 20-30% boost by our measurements. Or, just by > > > >>>>> touching the pages but not cleaning them you can have still s= ome > > > >>>>> more speedup. But in this case your layer or the applications > > > >>>>> above need to do the cleanup at allocation time (e.g. by usin= g > > rte_zmalloc). > > > >>>>> > > > >>>>> Cheers, > > > >>>>> &rew > > > >>>> > > > >>>> I wonder if the whole rte_malloc code is even worth it with a > > > >>>> modern kernel with transparent huge pages? rte_malloc adds ver= y > > > >>>> little value and is less safe and slower than glibc or other > > > >>>> allocators. Plus you lose the ablilty to get all the benefit o= ut of > > valgrind or electric fence. > > > >>> > > > >>> While I'd dearly love to not have our own custom malloc lib to > > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to > > > >>> replace as we would need a replacement solution that similarly > > > >>> guarantees that memory mapped in process A is also available at > > > >>> the same address in process B. :-( > > > >>> > > > >> Just out of curiosity, why even bother with multiprocess support= ? > > > >> What you're talking about above is a multithread model, and your > > > >> shoehorning multiple processes into it. > > > >> Neil > > > >> > > > > > > > > Yep, that's pretty much what it is alright. However, this > > > > multiprocess support is very widely used by our customers in > > > > building their applications, and has been in place and supported > > > > since some of the earliest DPDK releases. If it is to be removed,= it > > > > needs to be replaced by something that provides equivalent > > > > capabilities to application writers (perhaps something with more > > > > fine-grained sharing > > > > etc.) > > > > > > > > /Bruce > > > > > > > > > > It is probably time to start discussing how to pull in our multi > > > process and memory management improvements we were talking about in > > > our DPDK Summit presentation: > > > https://www.youtube.com/watch?v=3D907VShi799k#t=3D647 > > > > > > Multi-process model could have several benefits mostly in the high > > > availability area (telco requirement) due to better separation, > > > controlling permissions (per process RO or RW page mappings), singl= e > > > process restartability, improved startup and core dumping time etc. > > > > > > As a summary of our memory management additions, it allows an > > > application to describe their memory model in a configuration (or v= ia > > > an API), e.g. a simplified config would say that every instance wil= l > > > need 4GB private memory and 2GB shared memory. In a multi process > > > model this will result mapping only 6GB memory in each process inst= ead > > > of the current DPDK model where the 4GB per process private memory = is > > > mapped into all other processes resulting in unnecessary mappings, = e.g. > > 16x4GB + 2GB in every processes. > > > > > > What we've chosen is to use DPDK's NUMA aware allocator for this > > > purpose, e.g. the above example for 16 instances will result > > > allocating > > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can > > > selectively map a given "NUMA socket" (set of memsegs) into a proce= ss. > > > This also opens many other possibilities to play with, e.g. > > > - clearing of the full private memory if a process dies including > > > memzones on it > > > - pop-up memory support > > > etc. etc. > > > > > > Other option could be to use page aligned memzones and control the > > > mapping/permissions on a memzone level. > > > > > > /Laszlo > >=20 > > Those enhancements sound really, really good. Do you have code for th= ese > > that you can share that we can start looking at with a view to pullin= g it in? > >=20 > > /Bruce >=20 > Our approach when started implementing these enhancements was to have > an additional layer on top of DPDK, so our changes cannot just be pulle= d in as is > and unfortunately we do not yet have the permission to share our code. > However we can share ideas and start discussing what would more interes= t the > community and if there is something which we can easily pull in or put = on the > DPDK roadmap. >=20 > As mentioned in the presentation we implemented a new EAL layer which w= e > also rely on, although this may not be necessary for all our enhancemen= ts. > For example our named memory partition pools ("memdomains") which is th= e > base of our selective memory mapping and permission control could eithe= r be > implemented above or below the memzones or DPDK could even be just a us= er > of it. Our implementation relies on our new EAL layer, but there may be= another > option to pull this in as a new library which relies on the memzone all= ocator. >=20 > We have a whole set of features with the main goal of environment indep= endency > and of course performance first mainly focusing on NFV deployments. > e.g. allowing applications to adopt different environments (without any= code change) > while still getting the highest possible performance. > The key for this is our new split EAL layer which I think should be the= first step to > start with. This can co-exist with the current linuxapp and bsdapp and= would allow > supporting both Linux and BSD with separate publisher components which = could > be relying on the existing linuxapp/bsdapp code :) > This new EAL layer would open up many possibilities to play with, > e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a= new > NUMA domain, emulate a multi CPU multi socket system on a single CPU et= c. etc. >=20 > What do you think would be the right way to start these discussions? > We should probably need to open a new thread on this as it is now not f= ully related > to the subject or should we have an internal discussion and then presen= t and discuss > the ideas in a community call? > We are working with DPDK since a long time, but new to the community an= d need to > understand the ways of working here... A new thread describing the details of how you have implemented things wo= uld be great. Thanks, /Bruce