From mboxrd@z Thu Jan  1 00:00:00 1970
From: Thomas Monjalon <thomas.monjalon-pdR9zngts4EAvxtiuMwx3w@public.gmane.org>
Subject: Re: A question about hugepage initialization time
Date: Fri, 12 Dec 2014 16:50:33 +0100
Message-ID: <2123951.k0dJfZKBPF@xps13>
References: <CA+GnqArTJoVd9Hh2xZ-fFhHRnUdbgvxB5Tp+rvi2crUi0-9g9A@mail.gmail.com>
 <C2225743E7290344B4DAA0FA42E605D2AFBEB2@eusaamb109.ericsson.se>
 <20141212095940.GA2100@bricha3-MOBL3>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Cc: dev-VfR2kkLFssw@public.gmane.org
To: =?ISO-8859-1?Q?L=E1szl=F3?= Vadkerti <laszlo.vadkerti-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org>
Return-path: <dev-bounces-VfR2kkLFssw@public.gmane.org>
In-Reply-To: <20141212095940.GA2100@bricha3-MOBL3>
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev-VfR2kkLFssw@public.gmane.org>
List-Help: <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request-VfR2kkLFssw@public.gmane.org?subject=subscribe>
Errors-To: dev-bounces-VfR2kkLFssw@public.gmane.org
Sender: "dev" <dev-bounces-VfR2kkLFssw@public.gmane.org>

2014-12-12 09:59, Bruce Richardson:
> On Fri, Dec 12, 2014 at 04:07:40AM +0000, L=E1szl=F3 Vadkerti wrote:
> > On Thu, 11 Dec,  2014, Bruce Richardson wrote:
> > > On Wed, Dec 10, 2014 at 07:16:59PM +0000, L=E1szl=F3 Vadkerti wro=
te:
> > > >
> > > > On Wed, 10 Dec 2014, Bruce Richardson wrote:
> > > >
> > > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> > > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson w=
rote:
> > > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger=
 wrote:
> > > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew
> > > > >>>> <andras.kovacs-IzeFyvvaP7pWk0Htik3J/w@public.gmane.org> wrote:
> > > > >>>>
> > > > >>>>>> Hey Folks,
> > > > >>>>>>
> > > > >>>>>> Our DPDK application deals with very large in memory dat=
a
> > > > >>>>>> structures, and can potentially use tens or even hundred=
s of
> > > gigabytes of hugepage memory.
> > > > >>>>>> During the course of development, we've noticed that as =
the
> > > > >>>>>> number of huge pages increases, the memory initializatio=
n time
> > > > >>>>>> during EAL init gets to be quite long, lasting several m=
inutes
> > > > >>>>>> at present.  The growth in init time doesn't appear to b=
e linear,
> > > which is concerning.
> > > > >>>>>>
> > > > >>>>>> This is a minor inconvenience for us and our customers, =
as
> > > > >>>>>> memory initialization makes our boot times a lot longer =
than it
> > > > >>>>>> would otherwise be.  Also, my experience has been that r=
eally
> > > > >>>>>> long operations often are hiding errors - what you think=
 is
> > > > >>>>>> merely a slow operation is actually a timeout of some so=
rt,
> > > > >>>>>> often due to misconfiguration. This leads to two
> > > > >>>>>> questions:
> > > > >>>>>>
> > > > >>>>>> 1. Does the long initialization time suggest that there'=
s an
> > > > >>>>>> error happening under the covers?
> > > > >>>>>> 2. If not, is there any simple way that we can shorten m=
emory
> > > > >>>>>> initialization time?
> > > > >>>>>>
> > > > >>>>>> Thanks in advance for your insights.
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Matt Laswell
> > > > >>>>>> laswell-bIuJOMs36aleGPcbtGPokg@public.gmane.org
> > > > >>>>>> infinite io, inc.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Hello,
> > > > >>>>>
> > > > >>>>> please find some quick comments on the questions:
> > > > >>>>> 1.) By our experience long initialization time is normal =
in case
> > > > >>>>> of large amount of memory. However this time depends on s=
ome
> > > things:
> > > > >>>>> - number of hugepages (pagefault handled by kernel is pre=
tty
> > > > >>>>> expensive)
> > > > >>>>> - size of hugepages (memset at initialization)
> > > > >>>>>
> > > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initiali=
zation
> > > > >>>>> time significantly. Using wmemset instead of memset adds =
an
> > > > >>>>> additional 20-30% boost by our measurements. Or, just by
> > > > >>>>> touching the pages but not cleaning them you can have sti=
ll some
> > > > >>>>> more speedup. But in this case your layer or the applicat=
ions
> > > > >>>>> above need to do the cleanup at allocation time (e.g. by =
using
> > > rte_zmalloc).
> > > > >>>>>
> > > > >>>>> Cheers,
> > > > >>>>> &rew
> > > > >>>>
> > > > >>>> I wonder if the whole rte_malloc code is even worth it wit=
h a
> > > > >>>> modern kernel with transparent huge pages? rte_malloc adds=
 very
> > > > >>>> little value and is less safe and slower than glibc or oth=
er
> > > > >>>> allocators. Plus you lose the ablilty to get all the benef=
it out of
> > > valgrind or electric fence.
> > > > >>>
> > > > >>> While I'd dearly love to not have our own custom malloc lib=
 to
> > > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to=

> > > > >>> replace as we would need a replacement solution that simila=
rly
> > > > >>> guarantees that memory mapped in process A is also availabl=
e at
> > > > >>> the same address in process B. :-(
> > > > >>>
> > > > >> Just out of curiosity, why even bother with multiprocess sup=
port?
> > > > >> What you're talking about above is a multithread model, and =
your
> > > > >> shoehorning multiple processes into it.
> > > > >> Neil
> > > > >>
> > > > >
> > > > > Yep, that's pretty much what it is alright. However, this
> > > > > multiprocess support is very widely used by our customers in
> > > > > building their applications, and has been in place and suppor=
ted
> > > > > since some of the earliest DPDK releases. If it is to be remo=
ved, it
> > > > > needs to be replaced by something that provides equivalent
> > > > > capabilities to application writers (perhaps something with m=
ore
> > > > > fine-grained sharing
> > > > > etc.)
> > > > >
> > > > > /Bruce
> > > > >
> > > >
> > > > It is probably time to start discussing how to pull in our mult=
i
> > > > process and memory management improvements we were talking abou=
t in
> > > > our DPDK Summit presentation:
> > > > https://www.youtube.com/watch?v=3D907VShi799k#t=3D647
> > > >
> > > > Multi-process model could have several benefits mostly in the h=
igh
> > > > availability area (telco requirement) due to better separation,=

> > > > controlling permissions (per process RO or RW page mappings), s=
ingle
> > > > process restartability, improved startup and core dumping time =
etc.
> > > >
> > > > As a summary of our memory management additions, it allows an
> > > > application to describe their memory model in a configuration (=
or via
> > > > an API), e.g. a simplified config would say that every instance=
 will
> > > > need 4GB private memory and 2GB shared memory. In a multi proce=
ss
> > > > model this will result mapping only 6GB memory in each process =
instead
> > > > of the current DPDK model where the 4GB per process private mem=
ory is
> > > > mapped into all other processes resulting in unnecessary mappin=
gs, e.g.
> > > 16x4GB + 2GB in every processes.
> > > >
> > > > What we've chosen is to use DPDK's NUMA aware allocator for thi=
s
> > > > purpose, e.g. the above example for 16 instances will result
> > > > allocating
> > > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can=

> > > > selectively map a given "NUMA socket" (set of memsegs) into a p=
rocess.
> > > > This also opens many other possibilities to play with, e.g.
> > > >  - clearing of the full private memory if a process dies includ=
ing
> > > > memzones on it
> > > >  - pop-up memory support
> > > > etc. etc.
> > > >
> > > > Other option could be to use page aligned memzones and control =
the
> > > > mapping/permissions on a memzone level.
> > > >
> > > > /Laszlo
> > >=20
> > > Those enhancements sound really, really good. Do you have code fo=
r these
> > > that you can share that we can start looking at with a view to pu=
lling it in?
> > >=20
> > > /Bruce
> >=20
> > Our approach when started implementing these enhancements was to ha=
ve
> > an additional layer on top of DPDK, so our changes cannot just be p=
ulled in as is
> > and unfortunately we do not yet have the permission to share our co=
de.
> > However we can share ideas and start discussing what would more int=
erest the
> > community and if there is something which we can easily pull in or =
put on the
> > DPDK roadmap.
> >=20
> > As mentioned in the presentation we implemented a new EAL layer whi=
ch we
> > also rely on, although this may not be necessary for all our enhanc=
ements.
> > For example our named memory partition pools ("memdomains") which i=
s the
> > base of our selective memory mapping and permission control could e=
ither be
> > implemented above or below the memzones or DPDK could even be just =
a user
> > of it. Our implementation relies on our new EAL layer, but there ma=
y be another
> > option to pull this in as a new library which relies on the memzone=
 allocator.
> >=20
> > We have a whole set of features with the main goal of environment i=
ndependency
> > and of course performance first mainly focusing on NFV deployments.=

> > e.g. allowing applications to adopt different environments (without=
 any code change)
> > while still getting the highest possible performance.
> > The key for this is our new split EAL layer which I think should be=
 the first step to
> > start with. This can co-exist with the current linuxapp and bsdapp =
 and would allow
> > supporting both Linux and BSD with separate publisher components wh=
ich could
> > be relying on the existing linuxapp/bsdapp code :)
> > This new EAL layer would open up many possibilities to play with,
> > e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is =
in a new
> > NUMA domain, emulate a multi CPU multi socket system on a single CP=
U etc. etc.
> >=20
> > What do you think would be the right way to start these discussions=
?
> > We should probably need to open a new thread on this as it is now n=
ot fully related
> > to the subject or should we have an internal discussion and then pr=
esent and discuss
> > the ideas in a community call?
> > We are working with DPDK since a long time, but new to the communit=
y and need to
> > understand the ways of working here...
>=20
> A new thread describing the details of how you have implemented thing=
s would be
> great.

+1
Please, explain also which problems you try to solve.
Maybe that some of your constraints does not apply here, so the impleme=
ntation
could be different.
If your work can be split in different features, it may be easier to di=
scuss
each feature in a different thread.

Thank you
--=20
Thomas