All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: "Peter Schüller" <scode@spotify.com>
Cc: linux-kernel@vger.kernel.org,
	Mattias de Zalenski <zalenski@spotify.com>,
	linux-mm@kvack.org
Subject: Re: Sudden and massive page cache eviction
Date: Mon, 22 Nov 2010 16:11:58 -0800	[thread overview]
Message-ID: <20101122161158.02699d10.akpm@linux-foundation.org> (raw)
In-Reply-To: <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:

> Hello,
> 
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
> 
> First, here is a good example of the symptom (more graphs later on):
> 
>    http://files.spotify.com/memcut/b_daily_allcut.png
> 
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
> 
> The I/O load on these systems comes mainly from two things:
> 
>   (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
>   (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
>   (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
> 
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
> 
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
> 
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
> 
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
> 
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
> 
>   (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
>   (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
>   (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
> 
> Some additional "sample" graphs showing a few incidences of the problem:
> 
>    http://files.spotify.com/memcut/a_daily.png
>    http://files.spotify.com/memcut/a_weekly.png
>    http://files.spotify.com/memcut/b_daily_allcut.png
>    http://files.spotify.com/memcut/c_monthly.png
>    http://files.spotify.com/memcut/c_yearly.png
>    http://files.spotify.com/memcut/d_monthly.png
>    http://files.spotify.com/memcut/d_yearly.png
>    http://files.spotify.com/memcut/a_monthly.png
>    http://files.spotify.com/memcut/a_yearly.png
>    http://files.spotify.com/memcut/c_daily.png
>    http://files.spotify.com/memcut/c_weekly.png
>    http://files.spotify.com/memcut/d_daily.png
>    http://files.spotify.com/memcut/d_weekly.png
> 
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
> 
>    http://files.spotify.com/memcut/postgresql_weekly.png
> 
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
> 
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
> 
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
> 


WARNING: multiple messages have this Message-ID (diff)
From: Andrew Morton <akpm@linux-foundation.org>
To: "Peter Schüller" <scode@spotify.com>
Cc: linux-kernel@vger.kernel.org,
	Mattias de Zalenski <zalenski@spotify.com>,
	linux-mm@kvack.org
Subject: Re: Sudden and massive page cache eviction
Date: Mon, 22 Nov 2010 16:11:58 -0800	[thread overview]
Message-ID: <20101122161158.02699d10.akpm@linux-foundation.org> (raw)
In-Reply-To: <AANLkTikg-sR97tkG=ST9kjZcHe6puYSvMGh-eA3cnH7X@mail.gmail.com>

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <scode@spotify.com> wrote:

> Hello,
> 
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
> 
> First, here is a good example of the symptom (more graphs later on):
> 
>    http://files.spotify.com/memcut/b_daily_allcut.png
> 
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
> 
> The I/O load on these systems comes mainly from two things:
> 
>   (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
>   (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
>   (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
> 
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
> 
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
> 
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
> 
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
> 
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
> 
>   (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
>   (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
>   (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
> 
> Some additional "sample" graphs showing a few incidences of the problem:
> 
>    http://files.spotify.com/memcut/a_daily.png
>    http://files.spotify.com/memcut/a_weekly.png
>    http://files.spotify.com/memcut/b_daily_allcut.png
>    http://files.spotify.com/memcut/c_monthly.png
>    http://files.spotify.com/memcut/c_yearly.png
>    http://files.spotify.com/memcut/d_monthly.png
>    http://files.spotify.com/memcut/d_yearly.png
>    http://files.spotify.com/memcut/a_monthly.png
>    http://files.spotify.com/memcut/a_yearly.png
>    http://files.spotify.com/memcut/c_daily.png
>    http://files.spotify.com/memcut/c_weekly.png
>    http://files.spotify.com/memcut/d_daily.png
>    http://files.spotify.com/memcut/d_weekly.png
> 
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
> 
>    http://files.spotify.com/memcut/postgresql_weekly.png
> 
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
> 
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
> 
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
> 
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-11-23  0:12 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-12 16:20 Sudden and massive page cache eviction Peter Schüller
2010-11-23  0:11 ` Andrew Morton [this message]
2010-11-23  0:11   ` Andrew Morton
2010-11-23  8:38   ` Dave Hansen
2010-11-23  8:38     ` Dave Hansen
2010-11-23  9:44     ` Peter Schüller
2010-11-23  9:44       ` Peter Schüller
2010-11-23 16:19       ` Dave Hansen
2010-11-23 16:19         ` Dave Hansen
2010-11-24 14:02         ` Peter Schüller
2010-11-24 14:02           ` Peter Schüller
2010-11-24 14:14           ` Peter Schüller
2010-11-24 14:14             ` Peter Schüller
2010-11-24 14:20             ` Pekka Enberg
2010-11-24 14:20               ` Pekka Enberg
2010-11-24 15:32               ` Peter Schüller
2010-11-24 15:32                 ` Peter Schüller
2010-11-24 17:46                 ` Pekka Enberg
2010-11-24 17:46                   ` Pekka Enberg
2010-11-25  1:18                 ` Simon Kirby
2010-11-25  1:18                   ` Simon Kirby
2010-11-25 15:59                   ` Peter Schüller
2010-11-25 15:59                     ` Peter Schüller
2010-12-01  6:36                     ` Simon Kirby
2010-12-01  6:36                       ` Simon Kirby
2010-11-24 17:32             ` Dave Hansen
2010-11-24 17:32               ` Dave Hansen
2010-11-25 15:33               ` Peter Schüller
2010-11-25 15:33                 ` Peter Schüller
2010-12-01  9:15                 ` Simon Kirby
2010-12-01  9:15                   ` Simon Kirby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101122161158.02699d10.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=scode@spotify.com \
    --cc=zalenski@spotify.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.