public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* mmap vs fs cache
@ 2013-03-05 17:57 Howard Chu
  2013-03-05 23:55 ` Howard Chu
  2013-03-07 15:43 ` Jan Kara
  0 siblings, 2 replies; 21+ messages in thread
From: Howard Chu @ 2013-03-05 17:57 UTC (permalink / raw)
  To: linux-kernel

I'm testing our memory-mapped database code on a small VM. The machine has 
32GB of RAM and the size of the DB on disk is ~44GB. The database library 
mmaps the entire file as a single region and starts accessing it as a tree of 
B+trees. Running on an Ubuntu 3.5.0-23 kernel, XFS on a local disk.

If I start running read-only queries against the DB with a freshly started 
server, I see that my process (OpenLDAP slapd) quickly grows to an RSS of 
about 16GB in tandem with the FS cache. (I.e., "top" shows 16GB cached, and 
slapd is 16GB.)
If I confine my queries to the first 20% of the data then it all fits in RAM 
and queries are nice and fast.

if I extend the query range to cover more of the data, approaching the size of 
physical RAM, I see something strange - the FS cache keeps growing, but the 
slapd process size grows at a slower rate. This is rather puzzling to me since 
the only thing triggering reads is accesses through the mmap region. 
Eventually the FS cache grows to basically all of the 32GB of RAM (+/- some 
text/data space...) but the slapd process only reaches 25GB, at which point it 
actually starts to shrink - apparently the FS cache is now stealing pages from 
it. I find that a bit puzzling; if the pages are present in memory, and the 
only reason they were paged in was to satisfy an mmap reference, why aren't 
they simply assigned to the slapd process?

The current behavior gets even more aggravating: I can run a test that spans 
exactly 30GB of the data. One would expect that the slapd process should 
simply grow to 30GB in size, and then remain static for the remainder of the 
test. Instead, the server grows to 25GB, the FS cache grows to 32GB, and 
starts stealing pages from the server, shrinking it back down to 19GB or so.

If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this condition, 
the FS cache shrinks back to 25GB, matching the slapd process size.
This then frees up enough RAM for slapd to grow further. If I don't do this, 
the test is constantly paging in data from disk. Even so, the FS cache 
continues to grow faster than the slapd process size, so the system may run 
out of free RAM again, and I have to drop caches multiple times before slapd 
finally grows to the full 30GB. Once it gets to that size the test runs 
entirely from RAM with zero I/Os, but it doesn't get there without a lot of 
babysitting.

2 questions:
   why is there data in the FS cache that isn't owned by (the mmap of) the 
process that caused it to be paged in in the first place?
   is there a tunable knob to discourage the page cache from stealing from the 
process?

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-03-11 15:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-05 17:57 mmap vs fs cache Howard Chu
2013-03-05 23:55 ` Howard Chu
2013-03-06 10:14   ` Howard Chu
2013-03-06 10:22     ` Howard Chu
2013-03-07 15:43 ` Jan Kara
2013-03-08  2:08   ` Johannes Weiner
2013-03-08  7:46     ` Howard Chu
2013-03-08  8:42       ` Kirill A. Shutemov
2013-03-08  9:40         ` Howard Chu
2013-03-08 14:47           ` Chris Friesen
2013-03-08 15:00             ` Howard Chu
2013-03-08 15:25               ` Chris Friesen
2013-03-08 16:16               ` Johannes Weiner
2013-03-08 20:04                 ` Howard Chu
2013-03-11 12:04                   ` Jan Kara
2013-03-11 12:40                     ` Howard Chu
2013-03-09  3:28                 ` Ric Mason
2013-03-09  1:22               ` Phillip Susi
2013-03-11 11:52                 ` Jan Kara
2013-03-11 15:03                   ` Phillip Susi
2013-03-09  2:34     ` Ric Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox