* Bitbake Memory Usage
@ 2012-02-18 23:36 Richard Purdie
2012-02-19 2:55 ` [bitbake-devel] " Richard Purdie
2012-02-20 14:05 ` Colin Walters
0 siblings, 2 replies; 8+ messages in thread
From: Richard Purdie @ 2012-02-18 23:36 UTC (permalink / raw)
To: bitbake-devel, openembedded-core
One of the advantages of conferences is that you can talk to others
about current issues and discuss them. At ELC, I talked with Chris
Larson about the memory usage of bitbake and how I'd been having trouble
tracking usage down. Chris should take credit for pointing me at the
tools I'm mentioning here but I wanted to write down some of the things
I've found so far when looking around bitbake with the tools.
Before I dive into bitbake's memory usage, its probably worth refreshing
how system memory management roughly works for those who've not had the
pleasure of working more closely with it. In simple terms, the kernel is
clever and does things to try and share memory pages between processes.
Part of this is when you make a fork() call, it uses CoW (copy on write)
for the process's pages. This means that you can have ten forked
processes, each looking like they use 100MB but the total usage can be
100MB if they all never write to memory. When files like dynamic
libraries are loaded, particularly read only sections, those mapped
pages can also be shared between different processes. Prelink is such a
benefit because that makes the load addresses consistent and hence even
the precomputed offset tables can be shared between processes. Finally,
its worth keeping in mind the kernel only maps memory to a process when
it writes to it. Until a read/write happens, the memory mapping can
point to the kernel's empty page. This means that a system generally
over commits to processes knowing most processes don't ever use all the
memory they request.
I decided to poke around the memory used in a bitbake "worker" process.
This is where we do the real work and ideally we want a low memory
overhead when we fork these off. To start with to see where we were, I
added:
import meliae
from meliae import scanner
scanner.dump_all_objects("/tmp/dump2")
to runqeueue.py, just before the bb.build.exec_task() call. There is a
visualisation tool I'd never used before called "runsnakemem" which
provides a graphical view of where the large users are (and will load
the saved dump in /tmp/dump2). Its semi cryptic in that you don't get
the full variable names of the users but there is enough information
there to be able to look for the problem. I would note that the memory
sizes displayed by runsnakemem (if they are memory sizes?) seem totally
bogus. They are still proportional though (see more info below).
Its pretty easy to spot our worst memory offenders so I started to
remove a few of these from memory (all still in the child process).
These are mostly cache objects and we can neutralise them with:
bb.codeparser.pythonparsecache = {}
bb.codeparser.shellparsecache = {}
bb.providers.regexp_cache = {}
bb.parse.parse_py.BBHandler.cached_statements = {}
I was looking at our process with top and sadly, this shows zero change
in bitbake memory usage, with it being stable at 206M. What is worth
noting is I've seen instability and had problems reproducing numbers in
the past. It now appears the first time you edit any of bitbake's .py
files and run it, you will get a higher memory usage (say 267M). The
second time you run bitbake having changed none of its .py files, the
usage will go back down. I can only assume there is some overhead in
creating the .pyc files and this pushes the memory usage up. I'd like to
understand that problem further.
I started wondering what the size of the objects we were removing were
and whether the statistics were somehow misleading things. The dump file
itself is human readable and was varying heavily in size before and
after the above changes which was a good sign. There is also a second
way to summarise the data using the python:
import meliae
from meliae import loader
om = loader.load("/tmp/dump2")
print om.summarize()
which in the above case gives he before and after of:
Total 762502 objects, 188 types, Total size = 114.7MiB (120289900 bytes)
Index Count % Size % Cum Max Kind
0 55345 7 41224216 34 34 3146008 dict
1 480494 63 28878839 24 58 14798 str
2 50615 6 26939864 22 80 33000 set
3 98452 12 10339600 8 89 23736 list
4 1111 0 3751400 3 92 3416 CoreRecipeInfo
5 2203 0 3271776 2 95 786720 collections.defaultdict
6 48389 6 1161336 0 96 24 int
7 9687 1 721248 0 96 424 tuple
8 213 0 687504 0 97 12624 module
9 629 0 568616 0 97 904 type
10 446 0 495952 0 98 1112 LRItem
11 3542 0 453376 0 98 128 code
12 3645 0 437400 0 98 120 function
13 148 0 164576 0 99 1112 Production
14 131 0 145672 0 99 1112 LogRecord
15 388 0 133472 0 99 344 DataNode
16 1159 0 92720 0 99 80 wrapper_descriptor
17 68 0 75616 0 99 1112 BufferedLogger
18 834 0 73392 0 99 88 _sre.SRE_Pattern
19 825 0 72600 0 99 88 weakref
Total 371890 objects, 176 types, Total size = 65.4MiB (68581634 bytes)
Index Count % Size % Cum Max Kind
0 21282 5 29763504 43 43 3146008 dict
1 208546 56 15827021 23 66 14798 str
2 97478 26 10204048 14 81 23736 list
3 1111 0 3751400 5 86 3416 CoreRecipeInfo
4 2203 0 3271776 4 91 786720 collections.defaultdict
5 2488 0 1044160 1 93 33000 set
6 9685 2 721120 1 94 424 tuple
7 213 0 687504 1 95 12624 module
8 629 0 568616 0 96 904 type
9 446 0 495952 0 96 1112 LRItem
10 3542 0 453376 0 97 128 code
11 3645 0 437400 0 98 120 function
12 14539 3 348936 0 98 24 int
13 148 0 164576 0 98 1112 Production
14 129 0 143448 0 98 1112 LogRecord
15 1159 0 92720 0 99 80 wrapper_descriptor
16 67 0 74504 0 99 1112 BufferedLogger
17 829 0 72952 0 99 88 weakref
18 797 0 57384 0 99 72 builtin_function_or_method
19 652 0 46944 0 99 72 method_descriptor
So we got rid of about half the objects from memory and nearly halved
the memory consumption. Seems a shame the overall memory usage therefore
didn't change then? :/.
I therefore added in some other calls and also called in smem to get a
better look at the numbers. smem dives a bit deeper into process memory
usage by working through /proc/xxx/smaps which describe all memory
mappings for a process.
I'm going to give three sets of numbers here using 'smem -c "pid user
command swap uss pss rss vss maps" -P python'. Before we do anything
after the fork() we have:
PID User Command Swap USS PSS RSS VSS Maps
4937 richard python /media/data1/build1/ 0 65392 72681 89052 141140 107
4941 richard python /media/data1/build1/ 0 10404 75968 149476 211032 109
4961 richard python /media/data1/build1/ 0 11416 76421 149136 211032 109
and then after we free the caches we've identified, we have:
PID User Command Swap USS PSS RSS VSS Maps
4937 richard python /media/data1/build1/ 0 65392 74830 89052 141140 107
4941 richard python /media/data1/build1/ 0 68260 107045 149476 211032 109
4961 richard python /media/data1/build1/ 0 82164 113944 149136 211032 109
for fun, I then wondered what difference the python garbage collection
routines would make using:
import gc
gc.collect(2)
which gave:
PID User Command Swap USS PSS RSS VSS Maps
4937 richard python /media/data1/build1/ 0 65392 74831 89052 141140 107
4941 richard python /media/data1/build1/ 0 96684 121258 149476 211032 109
4961 richard python /media/data1/build1/ 0 110596 128163 149152 211032 109
Incidentally, I also checked len(gc.garbage) which was zero so we didn't
have uncollectable objects.
So what does the above mean? Firstly, to define what these numbers mean:
USS: Unique Set Size - Is memory unique to the process. If you kill it,
you get this much back
PSS: Proportional Set Size - Includes a measurement accounting for
shared pages (size of page is divided by he number of users it has)
RSS: Resident Set Size - Ignores sharing and is how much memory the
process has actively used
VSS: Virtual Set Size - A total of all the memory mapped by the process
used, or otherwise.
Why three processes? One is the UI process, the second is the main
runqueue, the process parent and the third is the child worker running
the actual task.
So bitbake is using about 150M of memory and mapping 211M. When we fork
off the child worker, its got a surprisingly low overhead of unique
pages (USS) of about 11M. As soon as the child starts trying to remove
things from memory, we lose the benefits of CoW and USS and PSS rise.
RSS and VSS remain unchanged. We then call the garbage collection and it
just makes USS and PSS worse.
So by trying to make things better here, we've taken the unique overhead
of the workers from 11M and made it about ten times worse.
I have to say this doesn't surprise me and I kind of expected this but
its nice to see the numbers. The piece of the puzzle I don't understand
is why even with the gc calls, RSS does not decrease. It appears python
is not returning memory to the system even when we do free it up.
So in summary, our fork() execution model may have some scary looking
memory numbers but in reality the overhead is good with CoW. There are
some puzzles remaining about how/when we can cooerce python to give
memory back to the system. We should also look at the memory hogging
variables identified and see if we can't do something better with them
in the core.
I'm going to continue to look at this when I have time but I hope this
reassures some people about the fork() overhead and enthuses some people
to further look around bitbake's memory usage.
Cheers,
Richard
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [bitbake-devel] Bitbake Memory Usage
2012-02-18 23:36 Bitbake Memory Usage Richard Purdie
@ 2012-02-19 2:55 ` Richard Purdie
2012-02-19 8:34 ` Phil Blundell
2012-02-20 14:05 ` Colin Walters
1 sibling, 1 reply; 8+ messages in thread
From: Richard Purdie @ 2012-02-19 2:55 UTC (permalink / raw)
To: bitbake-devel; +Cc: openembedded-core
On Sat, 2012-02-18 at 23:36 +0000, Richard Purdie wrote:
> I was looking at our process with top and sadly, this shows zero change
> in bitbake memory usage, with it being stable at 206M. What is worth
> noting is I've seen instability and had problems reproducing numbers in
> the past. It now appears the first time you edit any of bitbake's .py
> files and run it, you will get a higher memory usage (say 267M). The
> second time you run bitbake having changed none of its .py files, the
> usage will go back down. I can only assume there is some overhead in
> creating the .pyc files and this pushes the memory usage up. I'd like to
> understand that problem further.
For reference, only VSS increases and RSS does not in this case so it
looks like there are leftover memory mappings.
> So we got rid of about half the objects from memory and nearly halved
> the memory consumption. Seems a shame the overall memory usage therefore
> didn't change then? :/.
> So in summary, our fork() execution model may have some scary looking
> memory numbers but in reality the overhead is good with CoW. There are
> some puzzles remaining about how/when we can cooerce python to give
> memory back to the system. We should also look at the memory hogging
> variables identified and see if we can't do something better with them
> in the core.
>
> I'm going to continue to look at this when I have time but I hope this
> reassures some people about the fork() overhead and enthuses some people
> to further look around bitbake's memory usage.
Digging a little deeper, it was mainly strings that we were making the
saving with. I checked and the allocator for string objects in python
will return them to the main python objector allocator, maintaining only
a small queue of 80 objects for reuse. For objects over 256 bytes in
size, the glibc malloc/free calls are used. In turn, glibc uses sbrk()
to change the data segment size if necessary for anything less that
128kb in size. The string in the caches in question are very likely to
be over 256 bytes in size but under 128kb.
The problem is then going to be memory fragmentation since whilst we can
free the memory used by the objects, its extremely unlikely everything
to the end of the data segment is likely to be free so sbrk() cannot be
rolled backwards to free memory.
So in summary its unlikely we're going to get freed memory back with
python for these kinds of caches, at least from the main server
process...
Cheers,
Richard
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [bitbake-devel] Bitbake Memory Usage
2012-02-19 2:55 ` [bitbake-devel] " Richard Purdie
@ 2012-02-19 8:34 ` Phil Blundell
2012-02-19 15:41 ` Richard Purdie
0 siblings, 1 reply; 8+ messages in thread
From: Phil Blundell @ 2012-02-19 8:34 UTC (permalink / raw)
To: Patches and discussions about the oe-core layer; +Cc: bitbake-devel
On Sun, 2012-02-19 at 02:55 +0000, Richard Purdie wrote:
> The problem is then going to be memory fragmentation since whilst we can
> free the memory used by the objects, its extremely unlikely everything
> to the end of the data segment is likely to be free so sbrk() cannot be
> rolled backwards to free memory.
Sounds like it. Have you tried running python with a lower
MMAP_THRESHOLD?
p.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [bitbake-devel] Bitbake Memory Usage
2012-02-19 8:34 ` Phil Blundell
@ 2012-02-19 15:41 ` Richard Purdie
2012-02-19 16:04 ` Phil Blundell
0 siblings, 1 reply; 8+ messages in thread
From: Richard Purdie @ 2012-02-19 15:41 UTC (permalink / raw)
To: Patches and discussions about the oe-core layer; +Cc: bitbake-devel
On Sun, 2012-02-19 at 08:34 +0000, Phil Blundell wrote:
> On Sun, 2012-02-19 at 02:55 +0000, Richard Purdie wrote:
> > The problem is then going to be memory fragmentation since whilst we can
> > free the memory used by the objects, its extremely unlikely everything
> > to the end of the data segment is likely to be free so sbrk() cannot be
> > rolled backwards to free memory.
>
> Sounds like it. Have you tried running python with a lower
> MMAP_THRESHOLD?
I just did (to 4k) and it drops the overall RSS from 150M to 143M but
otherwise no significant difference. For the record in the archives, the
rather nasty hacky code I used was:
from ctypes import *
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
print libc.mallopt(-3, 4*1024)
I just looked at the caches in question again and I now suspect the
strings are under the 256b limit, I was thinking the data was slightly
different.
If that is the case, it means the memory is being returned to python's
pool allocator. Unfortunately that will suffer the same fragmentation
problem described above.
Either way, I'm pretty sure memory fragmentation is the problem here.
One solution we might consider putting the codeparser into a separate
process accessed over a socket. At the (mostly) face to face TSC meeting
earlier in the week I did briefly touch on the future with bitbake. I'm
planning to summarise this on the mailing list when I get a chance but
one of the ideas discussed was putting the data store into a separate
process accessed over sockets as a solution to some of the memory use
issues. There are problems with this such as the anonymous python
functions and where in the system these would run but its certainly
worth thinking along those lines. It would solve some of the nasty cache
issues we currently have with the interaction between multiple parsing
threads and the codeparser cache needing to get "merged" at the end of
parsing.
Cheers,
Richard
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [bitbake-devel] Bitbake Memory Usage
2012-02-19 15:41 ` Richard Purdie
@ 2012-02-19 16:04 ` Phil Blundell
0 siblings, 0 replies; 8+ messages in thread
From: Phil Blundell @ 2012-02-19 16:04 UTC (permalink / raw)
To: Patches and discussions about the oe-core layer; +Cc: bitbake-devel
On Sun, 2012-02-19 at 15:41 +0000, Richard Purdie wrote:
> I just did (to 4k) and it drops the overall RSS from 150M to 143M but
> otherwise no significant difference. For the record in the archives, the
> rather nasty hacky code I used was:
>
> from ctypes import *
> cdll.LoadLibrary("libc.so.6")
> libc = CDLL("libc.so.6")
> print libc.mallopt(-3, 4*1024)
FWIW, you can just say "export MALLOC_THRESHOLD=4096" before running
bitbake, which should achieve roughly the same thing with slightly less
typing.
> I just looked at the caches in question again and I now suspect the
> strings are under the 256b limit, I was thinking the data was slightly
> different.
>
> If that is the case, it means the memory is being returned to python's
> pool allocator. Unfortunately that will suffer the same fragmentation
> problem described above.
Yeah. I guess it would be possible to make the pool allocator smarter
to cut down on or ameliorate the fragmentation. But I don't know
whether that is feasible/straightforward without patching Python itself.
Some further investigation required I suppose.
p.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bitbake Memory Usage
2012-02-18 23:36 Bitbake Memory Usage Richard Purdie
2012-02-19 2:55 ` [bitbake-devel] " Richard Purdie
@ 2012-02-20 14:05 ` Colin Walters
2012-02-20 19:57 ` Chris Larson
1 sibling, 1 reply; 8+ messages in thread
From: Colin Walters @ 2012-02-20 14:05 UTC (permalink / raw)
To: Patches and discussions about the oe-core layer; +Cc: bitbake-devel
On Sat, 2012-02-18 at 23:36 +0000, Richard Purdie wrote:
> As soon as the child starts trying to remove
> things from memory, we lose the benefits of CoW and USS and PSS rise.
Note that even leaving out the garbage collector, the cPython VM
incrementing/decrementing the refcount of objects will force the copy.
The python multiprocessing module has some classes which allow
explicitly sharing memory. Certainly for various caches e.g. if
they're just read-only after a certain point it might make sense
to wrap them in e.g. a multiprocessing.Value.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Bitbake Memory Usage
2012-02-20 14:05 ` Colin Walters
@ 2012-02-20 19:57 ` Chris Larson
2012-02-21 22:15 ` [bitbake-devel] " Richard Purdie
0 siblings, 1 reply; 8+ messages in thread
From: Chris Larson @ 2012-02-20 19:57 UTC (permalink / raw)
To: Patches and discussions about the oe-core layer; +Cc: bitbake-devel
On Mon, Feb 20, 2012 at 7:05 AM, Colin Walters <walters@verbum.org> wrote:
> On Sat, 2012-02-18 at 23:36 +0000, Richard Purdie wrote:
>> As soon as the child starts trying to remove
>> things from memory, we lose the benefits of CoW and USS and PSS rise.
>
> Note that even leaving out the garbage collector, the cPython VM
> incrementing/decrementing the refcount of objects will force the copy.
>
> The python multiprocessing module has some classes which allow
> explicitly sharing memory. Certainly for various caches e.g. if
> they're just read-only after a certain point it might make sense
> to wrap them in e.g. a multiprocessing.Value.
Hmm, that's quite interesting, thanks for that tidbit. We should
consider looking into that. On a related note, I think we should sit
down and basically review every use of a cache anywhere in bitbake and
rethink their operation. We still have multiple caches stored as
globals with no expiration policy of any sort, they accrue
indefinitely, until process exit. As we review them, we could consider
this as well on a case-by-case basis.
--
Christopher Larson
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [bitbake-devel] Bitbake Memory Usage
2012-02-20 19:57 ` Chris Larson
@ 2012-02-21 22:15 ` Richard Purdie
0 siblings, 0 replies; 8+ messages in thread
From: Richard Purdie @ 2012-02-21 22:15 UTC (permalink / raw)
To: Chris Larson
Cc: bitbake-devel, Patches and discussions about the oe-core layer
On Mon, 2012-02-20 at 12:57 -0700, Chris Larson wrote:
> On Mon, Feb 20, 2012 at 7:05 AM, Colin Walters <walters@verbum.org> wrote:
> > On Sat, 2012-02-18 at 23:36 +0000, Richard Purdie wrote:
> >> As soon as the child starts trying to remove
> >> things from memory, we lose the benefits of CoW and USS and PSS rise.
> >
> > Note that even leaving out the garbage collector, the cPython VM
> > incrementing/decrementing the refcount of objects will force the copy.
> >
> > The python multiprocessing module has some classes which allow
> > explicitly sharing memory. Certainly for various caches e.g. if
> > they're just read-only after a certain point it might make sense
> > to wrap them in e.g. a multiprocessing.Value.
>
> Hmm, that's quite interesting, thanks for that tidbit. We should
> consider looking into that.
Its interesting but my tests are suggesting the reference counting isn't
hurting in at least our common "execute a shell task" use case. With the
way CoW and the data store works, I can believe that.
On the other hand, the serialisation that multiprocessing puts around
its variables looked potentially problematic (and slow) last time I
looked. The manual page specifically says those structures are slow if I
remember rightly.
> On a related note, I think we should sit
> down and basically review every use of a cache anywhere in bitbake and
> rethink their operation. We still have multiple caches stored as
> globals with no expiration policy of any sort, they accrue
> indefinitely, until process exit. As we review them, we could consider
> this as well on a case-by-case basis.
I agree we need to resolve some of the life cycle issue and remove the
globals. The ones I'm aware of do give significant speed benefits though
and I'm not sure there will be much scope for removing them :/.
Cheers,
Richard
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-02-21 22:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-18 23:36 Bitbake Memory Usage Richard Purdie
2012-02-19 2:55 ` [bitbake-devel] " Richard Purdie
2012-02-19 8:34 ` Phil Blundell
2012-02-19 15:41 ` Richard Purdie
2012-02-19 16:04 ` Phil Blundell
2012-02-20 14:05 ` Colin Walters
2012-02-20 19:57 ` Chris Larson
2012-02-21 22:15 ` [bitbake-devel] " Richard Purdie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox