cleancache can lead to serious performance degradation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* cleancache can lead to serious performance degradation
@ 2011-08-17 21:57 Nebojsa Trpkovic
  2011-08-25  4:12 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 7+ messages in thread
From: Nebojsa Trpkovic @ 2011-08-17 21:57 UTC (permalink / raw)
  To: linux-kernel

Hello.

I've tried using cleancache on my file server and came to conclusion 
that my Core2 Duo 4MB L2 cache 2.33GHz CPU cannot cope with the amount 
of data it needs to compress during heavy sequential IO when 
cleancache/zcache are enabled.

For an example, with cleancache enabled I get 60-70MB/s from my RAID 
arrays and both CPU cores are saturated with system (kernel) time. 
Without cleancache, each RAID gives me more then 300MB/s of useful read 
throughput.

In the scenario of sequential reading, this drop of throughput seems 
completely normal:
- a lot of data gets pulled in from disks
- data is processed in some non CPU-intensive way
- page cache fills up quickly and cleancache starts compressing pages (a 
lot of "puts" in /sys/kernel/mm/cleancache/)
- these compressed cleancache pages newer get read because there are a 
whole lot of new pages coming in every second replacing old ones 
(practically no "succ_gets" in /sys/kernel/mm/cleancache/)
- CPU saturates doing useless compression, and even worse:
- new disk read operations are waiting for CPU to finish compression and 
make some space in memory

So, using cleancache in scenarios with a lot of non-random data 
throughput can lead to very bad performance degradation.

I guess that possible workaround could be to implement some kind of 
compression throttling valve for cleancache/zcache:

- if there's available CPU time (idle cycles or so), then compress 
(maybe even with low CPU scheduler priority);

- if there's no available CPU time, just store (or throw away) to avoid 
IO waits;

At least, there should be a warning in kernel help about this kind of
situations.

Regards,
Nebojsa Trpkovic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cleancache can lead to serious performance degradation
  2011-08-17 21:57 cleancache can lead to serious performance degradation Nebojsa Trpkovic
@ 2011-08-25  4:12 ` Konrad Rzeszutek Wilk
  2011-08-25 16:56   ` Dan Magenheimer
  0 siblings, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-08-25  4:12 UTC (permalink / raw)
  To: Nebojsa Trpkovic, dan.magenheimer; +Cc: linux-kernel

On Wed, Aug 17, 2011 at 11:57:50PM +0200, Nebojsa Trpkovic wrote:
> Hello.

I've put Dan on the CC since he is the author of it.

> 
> I've tried using cleancache on my file server and came to conclusion
> that my Core2 Duo 4MB L2 cache 2.33GHz CPU cannot cope with the
> amount of data it needs to compress during heavy sequential IO when
> cleancache/zcache are enabled.
> 
> For an example, with cleancache enabled I get 60-70MB/s from my RAID
> arrays and both CPU cores are saturated with system (kernel) time.
> Without cleancache, each RAID gives me more then 300MB/s of useful
> read throughput.
> 
> In the scenario of sequential reading, this drop of throughput seems
> completely normal:
> - a lot of data gets pulled in from disks
> - data is processed in some non CPU-intensive way
> - page cache fills up quickly and cleancache starts compressing
> pages (a lot of "puts" in /sys/kernel/mm/cleancache/)
> - these compressed cleancache pages newer get read because there are
> a whole lot of new pages coming in every second replacing old ones
> (practically no "succ_gets" in /sys/kernel/mm/cleancache/)
> - CPU saturates doing useless compression, and even worse:
> - new disk read operations are waiting for CPU to finish compression
> and make some space in memory
> 
> 
> So, using cleancache in scenarios with a lot of non-random data
> throughput can lead to very bad performance degradation.
> 
> 
> I guess that possible workaround could be to implement some kind of
> compression throttling valve for cleancache/zcache:
> 
> - if there's available CPU time (idle cycles or so), then compress
> (maybe even with low CPU scheduler priority);
> 
> - if there's no available CPU time, just store (or throw away) to
> avoid IO waits;
> 
> 
> At least, there should be a warning in kernel help about this kind of
> situations.
> 
> 
> Regards,
> Nebojsa Trpkovic
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: cleancache can lead to serious performance degradation
  2011-08-25  4:12 ` Konrad Rzeszutek Wilk
@ 2011-08-25 16:56   ` Dan Magenheimer
  2011-08-26 13:24     ` Seth Jennings
  2011-08-29  0:45     ` Nebojsa Trpkovic
  0 siblings, 2 replies; 7+ messages in thread
From: Dan Magenheimer @ 2011-08-25 16:56 UTC (permalink / raw)
  To: Nebojsa Trpkovic
  Cc: linux-kernel, Konrad Wilk, Andrew Morton, Seth Jennings,
	Nitin Gupta

> From: Konrad Rzeszutek Wilk
> I've put Dan on the CC since he is the author of it.

Thanks for forwarding, Konrad (Also thanks akpm for the offlist forward).

Hi Nebojsa --

> > I've tried using cleancache on my file server and came to conclusion
> > that my Core2 Duo 4MB L2 cache 2.33GHz CPU cannot cope with the
> > amount of data it needs to compress during heavy sequential IO when
> > cleancache/zcache are enabled.
> >
> > For an example, with cleancache enabled I get 60-70MB/s from my RAID
> > arrays and both CPU cores are saturated with system (kernel) time.
> > Without cleancache, each RAID gives me more then 300MB/s of useful
> > read throughput.

Hmmm... yes, an older processor with multiple very fast storage devices
could certainly be overstressed by all the compression.

Are your measurements on a real workload or a benchmark?  Can
you describe your configuration more (e.g. number of spindles
-- or SSDs?).  Is any swapping occurring?

In any case, please see below.

> > In the scenario of sequential reading, this drop of throughput seems
> > completely normal:
> > - a lot of data gets pulled in from disks
> > - data is processed in some non CPU-intensive way
> > - page cache fills up quickly and cleancache starts compressing
> > pages (a lot of "puts" in /sys/kernel/mm/cleancache/)
> > - these compressed cleancache pages newer get read because there are
> > a whole lot of new pages coming in every second replacing old ones
> > (practically no "succ_gets" in /sys/kernel/mm/cleancache/)
> > - CPU saturates doing useless compression, and even worse:
> > - new disk read operations are waiting for CPU to finish compression
> > and make some space in memory
> >
> > So, using cleancache in scenarios with a lot of non-random data
> > throughput can lead to very bad performance degradation.

Your analysis is correct IF compression of a page is very slow
and "a lot of data" is *really* a lot.  But...

First, I don't recommend that zcache+cleancache be used without
frontswap, because the additional memory pressure from saving
compressed clean pagecache pages may sometimes result in swapping
(and frontswap will absorb much of the performance loss).  I know
frontswap isn't officially merged yet, but that's an artifact of
the process for submitting a complex patchset (transcendent memory)
that crosses multiple subsystems.  (See https://lwn.net/Articles/454795/ 
if you're not familiar with the whole transcendent memory picture.)

Second, I believe that for any cacheing mechanism -- hardware or
software -- in the history of computing, it is possible to identify
workloads or benchmarks that render the cache counter-productive.
Zcache+cleancache is not immune to that, and benchmarks that are
intended to measure sequential disk read throughput are likely to
imply that zcache is a problem.  However, Phoronix's benchmarking
(see  http://openbenchmarking.org/result/1105276-GR-CLEANCACH02)
didn't show anything like the performance degradation you observed
even though it was also run on a Core 2 Duo, albeit it looks like
with only one spindle.

Third, zcache is relatively new and can certainly benefit from
the input of other developers.  The lzo1x compression in the kernel
is fairly slow; Seth Jennings (cc'ed) is looking into alternate
compression technologies.  Perhaps there is a better compression
choice more suitable for older-slower processors, probably with a
poorer compression ratio.  Further, zcache currently does compression
and decompression with interrupts disabled, which may be a
significant factor in the slowdowns you've observed.  This should
be fixable.  Also the policies to determine acceptance and
reclaim from cleancache are still fairly primitive, but the dynamicity
of cleancache should allow some interesting heuristics to be
explored (as you suggest below).

Fourth, cleancache usually shows a fairly low "hit rate" but still
a solid performance improvement.  I've looked at improving the hit
rate by only moving to cleancache pages that had previously been on
the active list, which would likely solve your problem, but the
patch wasn't really upstreamable and may make cleancache function
less well on other workloads and in other environments (such as
in Xen where the size of transcendent memory may greatly exceed
the size of the guest's pagecache).

> > I guess that possible workaround could be to implement some kind of
> > compression throttling valve for cleancache/zcache:
> >
> > - if there's available CPU time (idle cycles or so), then compress
> > (maybe even with low CPU scheduler priority);

Agreed, and this low-priority kernel thread ideally would also
solve the "compress while irqs disabled" problem!

> > - if there's no available CPU time, just store (or throw away) to
> > avoid IO waits;

Any ideas on how to code this test (for "no available CPU time")?

> > At least, there should be a warning in kernel help about this kind of
> > situations.

That's a good point.  I'll submit a patch updating the Kconfig
and the Documentation/vm/cleancache.txt FAQ to identify the concern.
Documentation for zcache isn't in place yet because frontswap isn't
yet merged and I don't yet want to encourage zcache use without it,
but I'll ensure a warning gets added in zcache someplace also.

If you have any more comments or questions, please cc me directly
as I'm not able to keep up with all the lkml traffic, especially
when traveling... when you posted this I was at Linuxcon 2011
talking about transcendent memory!  See:
http://events.linuxfoundation.org/events/linuxcon/magenheimer 

Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cleancache can lead to serious performance degradation
  2011-08-25 16:56   ` Dan Magenheimer
@ 2011-08-26 13:24     ` Seth Jennings
  2011-08-26 14:42       ` Dan Magenheimer
  2011-08-29  0:45     ` Nebojsa Trpkovic
  1 sibling, 1 reply; 7+ messages in thread
From: Seth Jennings @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: Nebojsa Trpkovic, linux-kernel, Konrad Wilk, Andrew Morton,
	Nitin Gupta

On 08/25/2011 11:56 AM, Dan Magenheimer wrote:

> Third, zcache is relatively new and can certainly benefit from
> the input of other developers.  The lzo1x compression in the kernel
> is fairly slow; Seth Jennings (cc'ed) is looking into alternate
> compression technologies.  Perhaps there is a better compression
> choice more suitable for older-slower processors, probably with a
> poorer compression ratio.  Further, zcache currently does compression
> and decompression with interrupts disabled, which may be a
> significant factor in the slowdowns you've observed.  This should
> be fixable.

This was something I've meaning to ask about.  Why are compression
and decompression done with interrupts disabled?  What would need
to change so that we don't have to disable interrupts?

<cut>
>>> I guess that possible workaround could be to implement some kind of
>>> compression throttling valve for cleancache/zcache:
>>>
>>> - if there's available CPU time (idle cycles or so), then compress
>>> (maybe even with low CPU scheduler priority);
> 
> Agreed, and this low-priority kernel thread ideally would also
> solve the "compress while irqs disabled" problem!

--
Seth

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: cleancache can lead to serious performance degradation
  2011-08-26 13:24     ` Seth Jennings
@ 2011-08-26 14:42       ` Dan Magenheimer
  0 siblings, 0 replies; 7+ messages in thread
From: Dan Magenheimer @ 2011-08-26 14:42 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Nebojsa Trpkovic, linux-kernel, Konrad Wilk, Andrew Morton,
	Nitin Gupta

> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> 
> On 08/25/2011 11:56 AM, Dan Magenheimer wrote:
> 
> > Third, zcache is relatively new and can certainly benefit from
> > the input of other developers.  The lzo1x compression in the kernel
> > is fairly slow; Seth Jennings (cc'ed) is looking into alternate
> > compression technologies.  Perhaps there is a better compression
> > choice more suitable for older-slower processors, probably with a
> > poorer compression ratio.  Further, zcache currently does compression
> > and decompression with interrupts disabled, which may be a
> > significant factor in the slowdowns you've observed.  This should
> > be fixable.
> 
> This was something I've meaning to ask about.  Why are compression
> and decompression done with interrupts disabled?

The "hb->lock" is held during most tmem operations.  If a tmem
operation (or the callback to zcache) is interrupted and the
current cpu is scheduled to run another task, and the new task
calls into tmem, deadlock could occur.

In some cases, I think disabling preemption or bottom-halves instead
of disabling interrupts may be sufficient, but I ran into problems
when I tried that and never got back to it.  Note though that
interrupts are already disabled in some cases when cleancache is
called, so that would only solve part of the problem.

> What would need to change so that we don't have to disable interrupts?

Not easy, but not terribly hard I think:

1) On put, copy the page into tmem uncompressed, and keep a list
of not-yet-compressed pages.  (The copy would still need to be
done with interrupts disabled but copying a page is presumable
one-to-three orders of magnitude faster than compress/decompress.)
2) An independent lower-priority thread would be launched periodically
to compress one or more pages on the uncompressed list and
atomically replace the uncompressed page with the compressed version,
and change all pointers.  Then the uncompressed page could be freed.
There are likely some ugly race conditions in here.
3) On get, ensure not-yet-compressed pages are properly handled.

Dan

P.S. On vacation today, no more email until next week.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: cleancache can lead to serious performance degradation
  2011-08-25 16:56   ` Dan Magenheimer
  2011-08-26 13:24     ` Seth Jennings
@ 2011-08-29  0:45     ` Nebojsa Trpkovic
  2011-08-29 15:08       ` Dan Magenheimer
  1 sibling, 1 reply; 7+ messages in thread
From: Nebojsa Trpkovic @ 2011-08-29  0:45 UTC (permalink / raw)
  To: Dan Magenheimer
  Cc: linux-kernel, Konrad Wilk, Andrew Morton, Seth Jennings,
	Nitin Gupta

Thank you everybody for reviewing my report.

On 08/25/11 18:56, Dan Magenheimer wrote:
> Are your measurements on a real workload or a benchmark?  Can
> you describe your configuration more (e.g. number of spindles
> -- or SSDs?).  Is any swapping occurring?

I've noticed performance degradation during the real workload. I did not 
run any special benchmarks to prove this problem. All my conclusions are 
based on everyday usage case scenarios.

I use multi-purpose (read: all-purpose) server with Intel Core 2 Duo 
E6550, 8GB DDR2, 4 1Gbps NICs and 16 1.5TB 5.4k rpm hard drives in LAN 
with ~50 workstations and WAN with couple of hundreds clients.
Usually, 50 to 60% of RAM is used by "applications". Most of the rest is 
used for cache. Swap allocation is usually less then 100MB (divided to 
three spindles). Swapping is rare (I monitor both swap usage in MB and 
swap reads/writes along with many other parameters).
Drives are partitioned and some of partitions are stand-alone, some are 
in software RAID1 and some are in software RAID5, all depending of the 
partition purpose. Spindles are slow, but RAID5 gives us possibility to 
reach throughputs high enough to get affected with cleancache/zcache.
Just an insight in _synthetic_ RAID5 performance during the light/night 
server load (not an isolated test with all other services shot down):
/dev/md2:
  Timing buffered disk reads: 1044 MB in  3.00 seconds = 347.73 MB/sec
/dev/md3:
  Timing buffered disk reads: 1078 MB in  3.02 seconds = 356.94 MB/sec
/dev/md4:
  Timing buffered disk reads: 1170 MB in  3.00 seconds = 389.86 MB/sec

Scenarios affected by cleancache/zcache usage include:
- hashing of directconnect (DC++) shares on RAID5 arrays full of big 
files like ISO images (~120MB/s without cleancache/zcache as microdc2 
uses just one thread to hash)
- copying big files to my workstation using gigabit LAN with destination 
to a SSD (without cleancache/zcache up to ~105MB/s via NFS and ~117MB/s 
via FTP)
- copying big files between RAID5 arrays that do not have any common spindle
(without cleancache/zcache performance varies heavily based on current 
server workload: 150 - 250MB/s)

In all these scenarios, using cleancache/zcache caps throughput to 
60-70MB/s.

> First, I don't recommend that zcache+cleancache be used without
> frontswap, because the additional memory pressure from saving
> compressed clean pagecache pages may sometimes result in swapping
> (and frontswap will absorb much of the performance loss).  I know
> frontswap isn't officially merged yet, but that's an artifact of
> the process for submitting a complex patchset (transcendent memory)
> that crosses multiple subsystems.  (See https://lwn.net/Articles/454795/
> if you're not familiar with the whole transcendent memory picture.)

I'll do my best to get familiar with the whole transcendent memory story 
and give it a try to frontswap as soon as I can. Unfortunately, I'm 
afraid that I'll have to postpone that at least for couple of weeks.

>>> - if there's no available CPU time, just store (or throw away) to
>>> avoid IO waits;
>
> Any ideas on how to code this test (for "no available CPU time")?

I cannot help with this question as I have no practical experience in 
code development, especially OS related, but maybe some approach similar 
to other kernel systems could by used. For an example, cpufreq makes 
decisions for CPU clock changes based on current CPU usage by sampling 
recent load/usage statistics. Obviously, I have no idea if something 
similar could be used with zcache, but this was my best shot. :)

> If you have any more comments or questions, please cc me directly
> as I'm not able to keep up with all the lkml traffic, especially
> when traveling... when you posted this I was at Linuxcon 2011
> talking about transcendent memory!  See:
> http://events.linuxfoundation.org/events/linuxcon/magenheimer

Please, CC me directly on any further messages regarding this problem, too.

Last but not least, thank you for developing such a great feature for 
Linux kernel!

Best Regards,
Nebojsa Trpkovic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: cleancache can lead to serious performance degradation
  2011-08-29  0:45     ` Nebojsa Trpkovic
@ 2011-08-29 15:08       ` Dan Magenheimer
  0 siblings, 0 replies; 7+ messages in thread
From: Dan Magenheimer @ 2011-08-29 15:08 UTC (permalink / raw)
  To: Nebojsa Trpkovic
  Cc: linux-kernel, Konrad Wilk, Andrew Morton, Seth Jennings,
	Nitin Gupta

> From: Nebojsa Trpkovic [mailto:trx.lists@gmail.com]
> Subject: Re: cleancache can lead to serious performance degradation
> 
> Thank you everybody for reviewing my report.

Thanks for reporting it!  To paraphrase Clint Eastwood:

"A kernel developer's got to know his patchset's limitations".

(Clint said "A man's got to know his limitations" in Magnum Force, 1973.)

> I've noticed performance degradation during the real workload. I did not
> run any special benchmarks to prove this problem. All my conclusions are
> based on everyday usage case scenarios.
> 
> I use multi-purpose (read: all-purpose) server with Intel Core 2 Duo
> E6550, 8GB DDR2, 4 1Gbps NICs and 16 1.5TB 5.4k rpm hard drives in LAN
> with ~50 workstations and WAN with couple of hundreds clients.

<snip>

OK, so the old CPU is mostly acting as a fancy router between 16 spindles
and 4 fast NICs to transfer very large "packets" (sequential files).
I can see how that kind of workload would not be kind to zcache,
though a more modern multi-core (4-core i7?) might handle it better.
And, as suggested earlier in this thread, a faster-but-less-space-efficient
compression algorithm would be useful too, along with some policy
that senses when 

> Last but not least, thank you for developing such a great feature for
> Linux kernel!

Thanks!  Many people contributed (especially Nitin Gupta).  I hope other
users and workloads have better luck with it than you/yours does!

> Best Regards,
> Nebojsa Trpkovic

Thanks,
Dan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-08-29 15:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-17 21:57 cleancache can lead to serious performance degradation Nebojsa Trpkovic
2011-08-25  4:12 ` Konrad Rzeszutek Wilk
2011-08-25 16:56   ` Dan Magenheimer
2011-08-26 13:24     ` Seth Jennings
2011-08-26 14:42       ` Dan Magenheimer
2011-08-29  0:45     ` Nebojsa Trpkovic
2011-08-29 15:08       ` Dan Magenheimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox