Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?
@ 2010-09-23 22:29 Jeremy Cole
  2010-09-24 18:37 ` Dave Hansen
  0 siblings, 1 reply; 4+ messages in thread
From: Jeremy Cole @ 2010-09-23 22:29 UTC (permalink / raw)
  To: Linux Memory Management

Hello Linux MM community!

I have some questions about Linux memory management, perhaps if
someone can check my theory.  I've been researching why sometimes
larger MySQL servers tend to swap even though they don't appear to be
under memory pressure.  The machine(s) in question are dual quad core
Intel Nehalem processors, with 64GB (16x4GB) memory.  They are used
for a MySQL+InnoDB workload with 48GB allocated to InnoDB's buffer
pool.

My theory is that it is caused by the interplay between the NUMA
memory allocation policy and the paging and page reclaim algorithms,
but I'd like a double-check on my understanding of it and whether my
theory makes sense.

With our configuration, MySQL allocates about 53GB of memory -- 48GB
in the buffer pool, plus a bunch of smaller allocations.  This occurs
nearly at once near startup time (as expected) and the NUMA policy is
"local" by default, so the memory ends up being allocated
preferentially in Node 0 (in this case), and since Node 0 only has
less than 32GB total, once it is full, it spills over into more than
half of Node 1 as well.

Now, since Node 0 has zero free memory, the entirety of the file cache
and most of the rest of the on-demand memory allocations on the system
occur only in Node 1 free memory.  So while the system has lots of
"free" memory tied up in caches, nearly none of it is on Node 0.
There seem to be a bundle of allocations which for whatever reason
must be done in Node 0, and this causes some of the already allocated
memory in Node 0 to be paged out to make room.  However, mostly what
is getting paged out is parts of InnoDB's buffer pool which inevitably
needs to be paged back in to satisfy a query fairly soon.

(Note that this isn't always Node 0, sometimes Node 1 gets most of the
memory allocated and the exact same thing happens in reverse.)

This seems to be especially exacerbated on machines that are acting as
a slave primarily (single writer reading from master, and various logs
queuing on disk) and systems where backups are occurring.

On an exemplary running (production) system, free shows:

# free -m
             total       used       free     shared    buffers     cached
Mem:         64435      64218        216          0        240      10570
-/+ buffers/cache:      53406      11028
Swap:        31996      15068      16928

The InnoDB buffer pool can be found easily enough (by its sheer size)
in the numa_maps:

# grep 2aaaaad3e000 /proc/20520/numa_maps
2aaaaad3e000 default anon=13226274 dirty=13210834 swapcache=3440575
active=13226272 N0=7849384 N1=5376890

(anon=~50.45GB swapcache=~13.12GB N0=~29.94GB N1=~20.51GB)

(Note: It would be quite interesting for this case to see e.g.
N0_swapcache and N1_swapcache separated from each other.  I noticed in
the code that only the totals are accounted for per-node.  Would it
make sense to tally them all up per-node and then add them together
for presentation of totals?)

So the questions are:

1. Is it plausible that Linux for whatever reason needs memory to be
in Node 0, and chooses to page out used memory to make room, rather
than choosing to drop some of the cache in Node 1 and use that memory?
 I think this is true, but maybe I've missed something important.

2. If so, what circumstances would you expect to see that in?

I think we have a solution to this (still in testing and
qualification) using "numactl --interleave all" with the mysqld
process, but in the mean time I am hoping to check my understanding
and theory.  However, this of course spreads all the allocations
between the two nodes, which allows for some of the memory to be used
for file cache on each node, thus meaning they are on at least equal
footing for freeing memory.

All replies, questions, clarifications, requests for clarification,
and requests to bugger off welcome!

Regards,

Jeremy

P.S.: Thank you a million times to Rik van Riel and to Mel Gorman;
your amazing documentation, wiki posts, mailing list replies, etc.,
have helped me immensely in understanding how things work and in
researching this issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?
  2010-09-23 22:29 Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations? Jeremy Cole
@ 2010-09-24 18:37 ` Dave Hansen
  2010-09-28  1:58   ` Jeremy Cole
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2010-09-24 18:37 UTC (permalink / raw)
  To: Jeremy Cole; +Cc: Linux Memory Management

On Thu, 2010-09-23 at 15:29 -0700, Jeremy Cole wrote:
> 1. Is it plausible that Linux for whatever reason needs memory to be
> in Node 0, and chooses to page out used memory to make room, rather
> than choosing to drop some of the cache in Node 1 and use that memory?
>  I think this is true, but maybe I've missed something important.

Your situation sounds pretty familiar.  It happens a lot when
applications are moved over to a NUMA system for the first time.  Your
interleaving solution is a decent one, although teaching the database
about NUMA is a much better long-term approach.

As far as the decisions about running reclaim or swapping versus going
to another node for an allocation, take a look at the
"zone_reclaim_mode" bits in Documentation/sysctl/vm.txt .  It does a
decent job of explaining what we do.

Most users new to NUMA systems just prefer to "echo 0 >
zone_reclaim_mode".  I've also run into a fair number of "tuning" guides
that say to do this.  It will make the allocator act a lot more like if
NUMA wasn't there.  It isn't as _optimized_ for NUMA locality then, but
it does tend to let you allocate memory more freely.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?
  2010-09-24 18:37 ` Dave Hansen
@ 2010-09-28  1:58   ` Jeremy Cole
  2010-10-01 16:43     ` Dave Hansen
  0 siblings, 1 reply; 4+ messages in thread
From: Jeremy Cole @ 2010-09-28  1:58 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Linux Memory Management

Dave,

Thanks for your response.  This is helpful.  And, my testing with
"numactl --interleave=all" is going well, so far testing indicates
that it completely eliminates the swapping, without incurring a
measurable performance penalty for my workload.

> Your situation sounds pretty familiar.  It happens a lot when
> applications are moved over to a NUMA system for the first time.  Your
> interleaving solution is a decent one, although teaching the database
> about NUMA is a much better long-term approach.

That's exactly my thought.  I've read through the NUMA API
documentation, and it looks like it's not at all insurmountable to, at
the very least, ensure that the big cache allocations (InnoDB buffer
pool, MyISAM key buffer, etc.) are done interleaved, while leaving
most of the rest alone.  My biggest worry with using "numactl
--interleave=all" is that all of the small buffers allocated for the
use of a single thread (for instance query text buffer, sorting
buffers, etc.) will get spread around.  I'm still working to
completely understand the implications of this on performance, but I
don't think it will be terribly bad -- much better than the current
swapping situation, certainly.

> As far as the decisions about running reclaim or swapping versus going
> to another node for an allocation, take a look at the
> "zone_reclaim_mode" bits in Documentation/sysctl/vm.txt .  It does a
> decent job of explaining what we do.

I had read about zone_reclaim_mode, and I've also been testing
different settings for it, but I don't think it actually completely
solves the situation here.  It seems to be primarily concerned with
allocations that *could* happen anywhere, whereas I think what we're
often seeing is that memory for whatever reason (which is not
completely obvious to me) *must* be allocated on Node X, but Node X
has no free memory and no caches to free.

Nonetheless, I have to admit that I don't completely understand the
documentation for zone_reclaim_mode in its current form.  Perhaps you
could answer a few questions?  I feel that the documentation could be
updated with some important answers, which are missing now:

1. What "zone reclaim" actually means.  My understanding is that "zone
reclaim" is the practice of freeing memory on a specific node where
memory was preferentially requested (due to NUMA memory allocation
policy, by default "local") in favor of satisfying the allocation
using free memory from wherever it is currently available.

2. It isn't terribly clear what the default (0) policy is, and it
could use an explanation.  Here's my take on it:

When zone_reclaim_mode = 0, programs requesting memory to be allocated
on a particular node will only receive memory on the requested node if
free memory is available.  If no free memory is available on the
requested node, but free memory is available on a different node, the
allocation will be made there unless policy forbids it.  If no free
memory is available on any node, then the normal cache freeing and
paging out policies will apply to make free memory available on any
node to satisfy the allocation. [Is there any preference for which
node caches are freed from in this case?]

Is this correct?

3. I found that the list of possible values' descriptions are a bit
too terse to be usable by me.  Here are some efforts to refine the
definitions:

  a. "1 = Zone reclaim on" -- This means that cache pages will be
freed to make free memory to satisfy the request only if they are not
dirty.

  b. "2 = Zone reclaim writes dirty pages out" -- This means that
dirty cache pages will be written out and then freed if no clean pages
are available to be freed.  This incurs additional cost due to disk
I/O.

  c. "4 = Zone reclaim swaps pages" -- This means that anonymous pages
may be swapped out to disk and then freed if no clean pages are
available to be freed and (if bit 2 is set) no dirty cache pages are
available to be written out and freed.  This incurs additional cost
due to swap I/O.

Do those refinements make sense and are they correct?

4. How is it determined that "pages from remote zones will cause a
measurable performance reduction"?  My understanding is that this is
based on whether the node distance, as reported by "numactl
--hardware" is > RECLAIM_DISTANCE (by default defined as 20).  In this
case zone_reclaim_mode will be set to 1 by default by the kernel,
meaning cache pages may be freed on the particular node to make free
memory in order to preferentially allocate for programs that request
on a particular node.

5. I cannot parse/understand this statement at all: "Allowing regular
swap effectively restricts allocations to the local node unless
explicitly overridden by memory policies or cpuset configurations." --
Could this be rephrased and/or explained?

Thanks, again, everyone.

Regards,

Jeremy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations?
  2010-09-28  1:58   ` Jeremy Cole
@ 2010-10-01 16:43     ` Dave Hansen
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Hansen @ 2010-10-01 16:43 UTC (permalink / raw)
  To: Jeremy Cole; +Cc: Linux Memory Management

On Mon, 2010-09-27 at 18:58 -0700, Jeremy Cole wrote:
> > As far as the decisions about running reclaim or swapping versus going
> > to another node for an allocation, take a look at the
> > "zone_reclaim_mode" bits in Documentation/sysctl/vm.txt .  It does a
> > decent job of explaining what we do.
> 
> I had read about zone_reclaim_mode, and I've also been testing
> different settings for it, but I don't think it actually completely
> solves the situation here.  It seems to be primarily concerned with
> allocations that *could* happen anywhere, whereas I think what we're
> often seeing is that memory for whatever reason (which is not
> completely obvious to me) *must* be allocated on Node X, but Node X
> has no free memory and no caches to free.

All allocations have a list of zones from which the allocation can be
satisfied.  With many nodes, some may be closer than others, so
allocations _prefer_ to be on close (rather than distant) nodes.
zone_reclaim_mode basically speaks to how hard we try to work before we
give up and move down this priority list.

> Nonetheless, I have to admit that I don't completely understand the
> documentation for zone_reclaim_mode in its current form.  Perhaps you
> could answer a few questions?  I feel that the documentation could be
> updated with some important answers, which are missing now:
> 
> 1. What "zone reclaim" actually means.  My understanding is that "zone
> reclaim" is the practice of freeing memory on a specific node where
> memory was preferentially requested (due to NUMA memory allocation
> policy, by default "local") in favor of satisfying the allocation
> using free memory from wherever it is currently available.

Reclaim is what happens when you ask for memory and we don't have any.
Zone reclaim is the process that we follow to get memory inside a
particular area.

> 2. It isn't terribly clear what the default (0) policy is, and it
> could use an explanation.  Here's my take on it:
> 
> When zone_reclaim_mode = 0, programs requesting memory to be allocated
> on a particular node will only receive memory on the requested node if
> free memory is available.If no free memory is available on the
> requested node, but free memory is available on a different node, the
> allocation will be made there unless policy forbids it.  If no free
> memory is available on any node, then the normal cache freeing and
> paging out policies will apply to make free memory available on any
> node to satisfy the allocation. [Is there any preference for which
> node caches are freed from in this case?]

I think it's simpler than that.  When it's 0, we don't try to reclaim
memory until the whole system is full.  Basically, memory allocation
acts like it would on a system which isn't NUMA.

> Is this correct?
> 
> 3. I found that the list of possible values' descriptions are a bit
> too terse to be usable by me.  Here are some efforts to refine the
> definitions:
> 
>   a. "1 = Zone reclaim on" -- This means that cache pages will be
> freed to make free memory to satisfy the request only if they are not
> dirty.

It also means that we'll try and drop some slab caches.  I think you can
more generally:  This will try to reclaim memory in ways that won't
cause extra writes to disk.  But, it might cause extra reads at some
point in the future.  

>   b. "2 = Zone reclaim writes dirty pages out" -- This means that
> dirty cache pages will be written out and then freed if no clean pages
> are available to be freed.  This incurs additional cost due to disk
> I/O.

It can also cause processes doing writes to stall sooner than they might
have otherwise.  That's kinda covered in the current documentation.

>   c. "4 = Zone reclaim swaps pages" -- This means that anonymous pages
> may be swapped out to disk and then freed if no clean pages are
> available to be freed and (if bit 2 is set) no dirty cache pages are
> available to be written out and freed.  This incurs additional cost
> due to swap I/O.

I wouldn't mention the other modes.  

I'd encourage you to try and put together a patch for the documentation.
I can't tell you how many times I've scratched my head looking at that
particular entry.  

> Do those refinements make sense and are they correct?
> 
> 4. How is it determined that "pages from remote zones will cause a
> measurable performance reduction"?  My understanding is that this is
> based on whether the node distance, as reported by "numactl
> --hardware" is > RECLAIM_DISTANCE (by default defined as 20).

Exactly.  We take the BIOS tables (or whatever they're called on
non-x86) and translate them into a reclaim distance.  If it's >20, then
we say "pages from remote zones will cause a measurable performance
reduction".  

> 5. I cannot parse/understand this statement at all: "Allowing regular
> swap effectively restricts allocations to the local node unless
> explicitly overridden by memory policies or cpuset configurations." --
> Could this be rephrased and/or explained?

Yeah, that's pretty obtuse. :)

If you ask for memory from one zone, you'll get it from that zone.  The
kernel will do everything it can to give you memory from that zone,
including swapping pages out.  You implicitly ask for memory from the
current node for every allocation, so this will effectively restrict
your memory use to the local node, unless you override it.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-10-01 16:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-23 22:29 Linux swapping with MySQL/InnoDB due to NUMA architecture imbalanced allocations? Jeremy Cole
2010-09-24 18:37 ` Dave Hansen
2010-09-28  1:58   ` Jeremy Cole
2010-10-01 16:43     ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).