From mboxrd@z Thu Jan  1 00:00:00 1970
From: Bruce Richardson <bruce.richardson-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Subject: Re: A question about hugepage initialization time
Date: Fri, 12 Dec 2014 09:59:40 +0000
Message-ID: <20141212095940.GA2100@bricha3-MOBL3>
References: <CA+GnqArTJoVd9Hh2xZ-fFhHRnUdbgvxB5Tp+rvi2crUi0-9g9A@mail.gmail.com>
 <alpine.DEB.2.10.1412091130410.13009@mwlx389>
 <20141209141032.5fa2db0d@urahara>
 <20141210103225.GA10056@bricha3-MOBL3>
 <20141210142926.GA17040@localhost.localdomain>
 <20141210143558.GB1632@bricha3-MOBL3>
 <C2225743E7290344B4DAA0FA42E605D2AF837C@eusaamb109.ericsson.se>
 <20141211101449.GB5668@bricha3-MOBL3>
 <C2225743E7290344B4DAA0FA42E605D2AFBEB2@eusaamb109.ericsson.se>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: "dev-VfR2kkLFssw@public.gmane.org" <dev-VfR2kkLFssw@public.gmane.org>
To: =?iso-8859-1?B?TOFzemzz?= Vadkerti <laszlo.vadkerti-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org>
Return-path: <dev-bounces-VfR2kkLFssw@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <C2225743E7290344B4DAA0FA42E605D2AFBEB2-37wnTBQGOYJt7ojvGyDN8uaU1rCVNFv4@public.gmane.org>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev-VfR2kkLFssw@public.gmane.org>
List-Help: <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=subscribe>
Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org
Sender: "dev" <dev-bounces-VfR2kkLFssw@public.gmane.org>

On Fri, Dec 12, 2014 at 04:07:40AM +0000, L=E1szl=F3 Vadkerti wrote:
> On Thu, 11 Dec,  2014, Bruce Richardson wrote:
> > On Wed, Dec 10, 2014 at 07:16:59PM +0000, L=E1szl=F3 Vadkerti wrote:
> > >
> > > On Wed, 10 Dec 2014, Bruce Richardson wrote:
> > >
> > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote=
:
> > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wro=
te:
> > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew
> > > >>>> <andras.kovacs-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org> wrote:
> > > >>>>
> > > >>>>>> Hey Folks,
> > > >>>>>>
> > > >>>>>> Our DPDK application deals with very large in memory data
> > > >>>>>> structures, and can potentially use tens or even hundreds of
> > gigabytes of hugepage memory.
> > > >>>>>> During the course of development, we've noticed that as the
> > > >>>>>> number of huge pages increases, the memory initialization ti=
me
> > > >>>>>> during EAL init gets to be quite long, lasting several minut=
es
> > > >>>>>> at present.  The growth in init time doesn't appear to be li=
near,
> > which is concerning.
> > > >>>>>>
> > > >>>>>> This is a minor inconvenience for us and our customers, as
> > > >>>>>> memory initialization makes our boot times a lot longer than=
 it
> > > >>>>>> would otherwise be.  Also, my experience has been that reall=
y
> > > >>>>>> long operations often are hiding errors - what you think is
> > > >>>>>> merely a slow operation is actually a timeout of some sort,
> > > >>>>>> often due to misconfiguration. This leads to two
> > > >>>>>> questions:
> > > >>>>>>
> > > >>>>>> 1. Does the long initialization time suggest that there's an
> > > >>>>>> error happening under the covers?
> > > >>>>>> 2. If not, is there any simple way that we can shorten memor=
y
> > > >>>>>> initialization time?
> > > >>>>>>
> > > >>>>>> Thanks in advance for your insights.
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Matt Laswell
> > > >>>>>> laswell-bIuJOMs36aleGPcbtGPokg@public.gmane.org
> > > >>>>>> infinite io, inc.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Hello,
> > > >>>>>
> > > >>>>> please find some quick comments on the questions:
> > > >>>>> 1.) By our experience long initialization time is normal in c=
ase
> > > >>>>> of large amount of memory. However this time depends on some
> > things:
> > > >>>>> - number of hugepages (pagefault handled by kernel is pretty
> > > >>>>> expensive)
> > > >>>>> - size of hugepages (memset at initialization)
> > > >>>>>
> > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initializati=
on
> > > >>>>> time significantly. Using wmemset instead of memset adds an
> > > >>>>> additional 20-30% boost by our measurements. Or, just by
> > > >>>>> touching the pages but not cleaning them you can have still s=
ome
> > > >>>>> more speedup. But in this case your layer or the applications
> > > >>>>> above need to do the cleanup at allocation time (e.g. by usin=
g
> > rte_zmalloc).
> > > >>>>>
> > > >>>>> Cheers,
> > > >>>>> &rew
> > > >>>>
> > > >>>> I wonder if the whole rte_malloc code is even worth it with a
> > > >>>> modern kernel with transparent huge pages? rte_malloc adds ver=
y
> > > >>>> little value and is less safe and slower than glibc or other
> > > >>>> allocators. Plus you lose the ablilty to get all the benefit o=
ut of
> > valgrind or electric fence.
> > > >>>
> > > >>> While I'd dearly love to not have our own custom malloc lib to
> > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to
> > > >>> replace as we would need a replacement solution that similarly
> > > >>> guarantees that memory mapped in process A is also available at
> > > >>> the same address in process B. :-(
> > > >>>
> > > >> Just out of curiosity, why even bother with multiprocess support=
?
> > > >> What you're talking about above is a multithread model, and your
> > > >> shoehorning multiple processes into it.
> > > >> Neil
> > > >>
> > > >
> > > > Yep, that's pretty much what it is alright. However, this
> > > > multiprocess support is very widely used by our customers in
> > > > building their applications, and has been in place and supported
> > > > since some of the earliest DPDK releases. If it is to be removed,=
 it
> > > > needs to be replaced by something that provides equivalent
> > > > capabilities to application writers (perhaps something with more
> > > > fine-grained sharing
> > > > etc.)
> > > >
> > > > /Bruce
> > > >
> > >
> > > It is probably time to start discussing how to pull in our multi
> > > process and memory management improvements we were talking about in
> > > our DPDK Summit presentation:
> > > https://www.youtube.com/watch?v=3D907VShi799k#t=3D647
> > >
> > > Multi-process model could have several benefits mostly in the high
> > > availability area (telco requirement) due to better separation,
> > > controlling permissions (per process RO or RW page mappings), singl=
e
> > > process restartability, improved startup and core dumping time etc.
> > >
> > > As a summary of our memory management additions, it allows an
> > > application to describe their memory model in a configuration (or v=
ia
> > > an API), e.g. a simplified config would say that every instance wil=
l
> > > need 4GB private memory and 2GB shared memory. In a multi process
> > > model this will result mapping only 6GB memory in each process inst=
ead
> > > of the current DPDK model where the 4GB per process private memory =
is
> > > mapped into all other processes resulting in unnecessary mappings, =
e.g.
> > 16x4GB + 2GB in every processes.
> > >
> > > What we've chosen is to use DPDK's NUMA aware allocator for this
> > > purpose, e.g. the above example for 16 instances will result
> > > allocating
> > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can
> > > selectively map a given "NUMA socket" (set of memsegs) into a proce=
ss.
> > > This also opens many other possibilities to play with, e.g.
> > >  - clearing of the full private memory if a process dies including
> > > memzones on it
> > >  - pop-up memory support
> > > etc. etc.
> > >
> > > Other option could be to use page aligned memzones and control the
> > > mapping/permissions on a memzone level.
> > >
> > > /Laszlo
> >=20
> > Those enhancements sound really, really good. Do you have code for th=
ese
> > that you can share that we can start looking at with a view to pullin=
g it in?
> >=20
> > /Bruce
>=20
> Our approach when started implementing these enhancements was to have
> an additional layer on top of DPDK, so our changes cannot just be pulle=
d in as is
> and unfortunately we do not yet have the permission to share our code.
> However we can share ideas and start discussing what would more interes=
t the
> community and if there is something which we can easily pull in or put =
on the
> DPDK roadmap.
>=20
> As mentioned in the presentation we implemented a new EAL layer which w=
e
> also rely on, although this may not be necessary for all our enhancemen=
ts.
> For example our named memory partition pools ("memdomains") which is th=
e
> base of our selective memory mapping and permission control could eithe=
r be
> implemented above or below the memzones or DPDK could even be just a us=
er
> of it. Our implementation relies on our new EAL layer, but there may be=
 another
> option to pull this in as a new library which relies on the memzone all=
ocator.
>=20
> We have a whole set of features with the main goal of environment indep=
endency
> and of course performance first mainly focusing on NFV deployments.
> e.g. allowing applications to adopt different environments (without any=
 code change)
> while still getting the highest possible performance.
> The key for this is our new split EAL layer which I think should be the=
 first step to
> start with. This can co-exist with the current linuxapp and bsdapp  and=
 would allow
> supporting both Linux and BSD with separate publisher components which =
could
> be relying on the existing linuxapp/bsdapp code :)
> This new EAL layer would open up many possibilities to play with,
> e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a=
 new
> NUMA domain, emulate a multi CPU multi socket system on a single CPU et=
c. etc.
>=20
> What do you think would be the right way to start these discussions?
> We should probably need to open a new thread on this as it is now not f=
ully related
> to the subject or should we have an internal discussion and then presen=
t and discuss
> the ideas in a community call?
> We are working with DPDK since a long time, but new to the community an=
d need to
> understand the ways of working here...

A new thread describing the details of how you have implemented things wo=
uld be
great.
Thanks,
/Bruce