* zsmalloc limitations and related topics
@ 2013-02-27 23:24 Dan Magenheimer
2013-02-28 22:00 ` Dan Magenheimer
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Dan Magenheimer @ 2013-02-27 23:24 UTC (permalink / raw)
To: minchan, sjenning, Nitin Gupta
Cc: Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato,
Mel Gorman
Hi all --
I've been doing some experimentation on zsmalloc in preparation
for my topic proposed for LSFMM13 and have run across some
perplexing limitations. Those familiar with the intimate details
of zsmalloc might be well aware of these limitations, but they
aren't documented or immediately obvious, so I thought it would
be worthwhile to air them publicly. I've also included some
measurements from the experimentation and some related thoughts.
(Some of the terms here are unusual and may be used inconsistently
by different developers so a glossary of definitions of the terms
used here is appended.)
ZSMALLOC LIMITATIONS
Zsmalloc is used for two zprojects: zram and the out-of-tree
zswap. Zsmalloc can achieve high density when "full". But:
1) Zsmalloc has a worst-case density of 0.25 (one zpage per
four pageframes).
2) When not full and especially when nearly-empty _after_
being full, density may fall below 1.0 as a result of
fragmentation.
3) Zsmalloc has a density of exactly 1.0 for any number of
zpages with zsize >= 0.8.
4) Zsmalloc contains several compile-time parameters;
the best value of these parameters may be very workload
dependent.
If density == 1.0, that means we are paying the overhead of
compression+decompression for no space advantage. If
density < 1.0, that means using zsmalloc is detrimental,
resulting in worse memory pressure than if it were not used.
WORKLOAD ANALYSIS
These limitations emphasize that the workload used to evaluate
zsmalloc is very important. Benchmarks that measure data
throughput or CPU utilization are of questionable value because
it is the _content_ of the data that is particularly relevant
for compression. Even more precisely, it is the "entropy"
of the data that is relevant, because the amount of
compressibility in the data is related to the entropy:
I.e. an entirely random pagefull of bits will compress poorly
and a highly-regular pagefull of bits will compress well.
Since the zprojects manage a large number of zpages, both
the mean and distribution of zsize of the workload should
be "representative".
The workload most widely used to publish results for
the various zprojects is a kernel-compile using "make -jN"
where N is artificially increased to impose memory pressure.
By adding some debug code to zswap, I was able to analyze
this workload and found the following:
1) The average page compressed by almost a factor of six
(mean zsize == 694, stddev == 474)
2) Almost eleven percent of the pages were zero pages. A
zero page compresses to 28 bytes.
3) On average, 77% of the bytes (3156) in the pages-to-be-
compressed contained a byte-value of zero.
4) Despite the above, mean density of zsmalloc was measured at
3.2 zpages/pageframe, presumably losing nearly half of
available space to fragmentation.
I have no clue if these measurements are representative
of a wide range of workloads over the lifetime of a booted
machine, but I am suspicious that they are not. For example,
the lzo1x compression algorithm claims to compress data by
about a factor of two.
I would welcome ideas on how to evaluate workloads for
"representativeness". Personally I don't believe we should
be making decisions about selecting the "best" algorithms
or merging code without an agreement on workloads.
PAGEFRAME EVACUATION AND RECLAIM
I've repeatedly stated the opinion that managing the number of
pageframes containing compressed pages will be valuable for
managing MM interaction/policy when compression is used in
the kernel. After the experimentation above and some brainstorming,
I still do not see an effective method for zsmalloc evacuating and
reclaiming pageframes, because both are complicated by high density
and page-crossing. In other words, zsmalloc's strengths may
also be its Achilles heels. For zram, as far as I can see,
pageframe evacuation/reclaim is irrelevant except perhaps
as part of mass defragmentation. For zcache and zswap, where
writethrough is used, pageframe evacuation/reclaim is very relevant.
(Note: The writeback implemented in zswap does _zpage_ evacuation
without pageframe reclaim.)
CLOSING THOUGHT
Since zsmalloc and zbud have different strengths and weaknesses,
I wonder if some combination or hybrid might be more optimal?
But unless/until we have and can measure a representative workload,
only intuition can answer that.
GLOSSARY
zproject -- a kernel project using compression (zram, zcache, zswap)
zpage -- a compressed sequence of PAGE_SIZE bytes
zsize -- the number of bytes in a compressed page
pageframe -- the term "page" is widely used both to describe
either (1) PAGE_SIZE bytes of data, or (2) a physical RAM
area with size=PAGE_SIZE which is PAGE_SIZE-aligned,
as represented in the kernel by a struct page. To be explicit,
we refer to (2) as a pageframe.
density -- zpages per pageframe; higher is (presumably) better
zsmalloc -- a slab-based allocator written by Nitin Gupta to
efficiently store zpages and designed to allow zpages
to be split across two non-contiguous pageframes
zspage -- a grouping of N non-contiguous pageframes managed
as a unit by zsmalloc to store zpages for which zsize
falls within a certain range. (The compile-time
default maximum size for N is 4).
zbud -- a buddy-based allocator written by Dan Magenheimer
(specifically for zcache) to predictably store zpages;
no more than two zpages are stored in any pageframe
pageframe evacuation/reclaim -- the process of removing
zpages from one or more pageframes, including pointers/nodes
from any data structures referencing those zpages,
so that the pageframe(s) can be freed for use by
the rest of the kernel
writeback -- the process of transferring zpages from
storage in a zproject to a backing swap device
lzo1x -- a compression algorithm used by default by all the
zprojects; the kernel implementation resides in lib/lzo.c
entropy -- randomness of data to be compressed; higher entropy
means worse data compression
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread* RE: zsmalloc limitations and related topics 2013-02-27 23:24 zsmalloc limitations and related topics Dan Magenheimer @ 2013-02-28 22:00 ` Dan Magenheimer 2013-03-01 1:40 ` Ric Mason 2013-03-13 15:14 ` Robert Jennings 2 siblings, 0 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-02-28 22:00 UTC (permalink / raw) To: minchan, sjenning, Nitin Gupta Cc: Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Dan Magenheimer > Subject: zsmalloc limitations and related topics > > WORKLOAD ANALYSIS > : > 1) The average page compressed by almost a factor of six > (mean zsize == 694, stddev == 474) > 2) Almost eleven percent of the pages were zero pages. A > zero page compresses to 28 bytes. > 3) On average, 77% of the bytes (3156) in the pages-to-be- > compressed contained a byte-value of zero. > 4) Despite the above, mean density of zsmalloc was measured at > 3.2 zpages/pageframe, presumably losing nearly half of > available space to fragmentation. > > I have no clue if these measurements are representative > of a wide range of workloads over the lifetime of a booted > machine, but I am suspicious that they are not. For example, > the lzo1x compression algorithm claims to compress data by > about a factor of two. I realized that with a small hack in zswap, I could simulate the effect on zsmalloc of a workload with very different zsize distribution, one with a much higher mean, by simply doubling (and tripling) the zsize passed to zs_malloc. The results: Unchanged: mean=694 stddev=474 -> mean density = 3.2 Doubled: mean=1340 stddev=842 -> mean density = 1.9 Tripled: mean=1636 stddev=1031 -> mean density = 1.6 Note that even tripled, the mean of the simulated distribution is still much lower than PAGE_SIZE/2, which is roughly the published expected compression for lzo1x. So one would still expect a mean density greater than two but, apparently, one-third of available space is lost to fragmentation. Without a "representative" workload, I still have no clue as to whether this simulated distribution is relevant, but it is interesting to note that, for a workload with lower mean compressibility, zsmalloc's reputation as "high density" may be undeserved. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-02-27 23:24 zsmalloc limitations and related topics Dan Magenheimer 2013-02-28 22:00 ` Dan Magenheimer @ 2013-03-01 1:40 ` Ric Mason 2013-03-04 18:29 ` Dan Magenheimer 2013-03-13 15:14 ` Robert Jennings 2 siblings, 1 reply; 17+ messages in thread From: Ric Mason @ 2013-03-01 1:40 UTC (permalink / raw) To: Dan Magenheimer Cc: minchan, sjenning, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 02/28/2013 07:24 AM, Dan Magenheimer wrote: > Hi all -- > > I've been doing some experimentation on zsmalloc in preparation > for my topic proposed for LSFMM13 and have run across some > perplexing limitations. Those familiar with the intimate details > of zsmalloc might be well aware of these limitations, but they > aren't documented or immediately obvious, so I thought it would > be worthwhile to air them publicly. I've also included some > measurements from the experimentation and some related thoughts. > > (Some of the terms here are unusual and may be used inconsistently > by different developers so a glossary of definitions of the terms > used here is appended.) > > ZSMALLOC LIMITATIONS > > Zsmalloc is used for two zprojects: zram and the out-of-tree > zswap. Zsmalloc can achieve high density when "full". But: > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > four pageframes). > 2) When not full and especially when nearly-empty _after_ > being full, density may fall below 1.0 as a result of > fragmentation. What's the meaning of nearly-empty _after_ being full? > 3) Zsmalloc has a density of exactly 1.0 for any number of > zpages with zsize >= 0.8. > 4) Zsmalloc contains several compile-time parameters; > the best value of these parameters may be very workload > dependent. > > If density == 1.0, that means we are paying the overhead of > compression+decompression for no space advantage. If > density < 1.0, that means using zsmalloc is detrimental, > resulting in worse memory pressure than if it were not used. > > WORKLOAD ANALYSIS > > These limitations emphasize that the workload used to evaluate > zsmalloc is very important. Benchmarks that measure data Could you share your benchmark? In order that other guys can take advantage of it. > throughput or CPU utilization are of questionable value because > it is the _content_ of the data that is particularly relevant > for compression. Even more precisely, it is the "entropy" > of the data that is relevant, because the amount of > compressibility in the data is related to the entropy: > I.e. an entirely random pagefull of bits will compress poorly > and a highly-regular pagefull of bits will compress well. > Since the zprojects manage a large number of zpages, both > the mean and distribution of zsize of the workload should > be "representative". > > The workload most widely used to publish results for > the various zprojects is a kernel-compile using "make -jN" > where N is artificially increased to impose memory pressure. > By adding some debug code to zswap, I was able to analyze > this workload and found the following: > > 1) The average page compressed by almost a factor of six > (mean zsize == 694, stddev == 474) stddev is what? > 2) Almost eleven percent of the pages were zero pages. A > zero page compresses to 28 bytes. > 3) On average, 77% of the bytes (3156) in the pages-to-be- > compressed contained a byte-value of zero. > 4) Despite the above, mean density of zsmalloc was measured at > 3.2 zpages/pageframe, presumably losing nearly half of > available space to fragmentation. > > I have no clue if these measurements are representative > of a wide range of workloads over the lifetime of a booted > machine, but I am suspicious that they are not. For example, > the lzo1x compression algorithm claims to compress data by > about a factor of two. > > I would welcome ideas on how to evaluate workloads for > "representativeness". Personally I don't believe we should > be making decisions about selecting the "best" algorithms > or merging code without an agreement on workloads. > > PAGEFRAME EVACUATION AND RECLAIM > > I've repeatedly stated the opinion that managing the number of > pageframes containing compressed pages will be valuable for > managing MM interaction/policy when compression is used in > the kernel. After the experimentation above and some brainstorming, > I still do not see an effective method for zsmalloc evacuating and > reclaiming pageframes, because both are complicated by high density > and page-crossing. In other words, zsmalloc's strengths may > also be its Achilles heels. For zram, as far as I can see, > pageframe evacuation/reclaim is irrelevant except perhaps > as part of mass defragmentation. For zcache and zswap, where > writethrough is used, pageframe evacuation/reclaim is very relevant. > (Note: The writeback implemented in zswap does _zpage_ evacuation > without pageframe reclaim.) > > CLOSING THOUGHT > > Since zsmalloc and zbud have different strengths and weaknesses, > I wonder if some combination or hybrid might be more optimal? > But unless/until we have and can measure a representative workload, > only intuition can answer that. > > GLOSSARY > > zproject -- a kernel project using compression (zram, zcache, zswap) > zpage -- a compressed sequence of PAGE_SIZE bytes > zsize -- the number of bytes in a compressed page > pageframe -- the term "page" is widely used both to describe > either (1) PAGE_SIZE bytes of data, or (2) a physical RAM > area with size=PAGE_SIZE which is PAGE_SIZE-aligned, > as represented in the kernel by a struct page. To be explicit, > we refer to (2) as a pageframe. > density -- zpages per pageframe; higher is (presumably) better > zsmalloc -- a slab-based allocator written by Nitin Gupta to > efficiently store zpages and designed to allow zpages > to be split across two non-contiguous pageframes > zspage -- a grouping of N non-contiguous pageframes managed > as a unit by zsmalloc to store zpages for which zsize > falls within a certain range. (The compile-time > default maximum size for N is 4). > zbud -- a buddy-based allocator written by Dan Magenheimer > (specifically for zcache) to predictably store zpages; > no more than two zpages are stored in any pageframe > pageframe evacuation/reclaim -- the process of removing > zpages from one or more pageframes, including pointers/nodes > from any data structures referencing those zpages, > so that the pageframe(s) can be freed for use by > the rest of the kernel > writeback -- the process of transferring zpages from > storage in a zproject to a backing swap device > lzo1x -- a compression algorithm used by default by all the > zprojects; the kernel implementation resides in lib/lzo.c > entropy -- randomness of data to be compressed; higher entropy > means worse data compression > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-01 1:40 ` Ric Mason @ 2013-03-04 18:29 ` Dan Magenheimer 0 siblings, 0 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-04 18:29 UTC (permalink / raw) To: Ric Mason Cc: minchan, sjenning, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Ric Mason [mailto:ric.masonn@gmail.com] > Subject: Re: zsmalloc limitations and related topics > > On 02/28/2013 07:24 AM, Dan Magenheimer wrote: > > Hi all -- > > > > I've been doing some experimentation on zsmalloc in preparation > > for my topic proposed for LSFMM13 and have run across some > > perplexing limitations. Those familiar with the intimate details > > of zsmalloc might be well aware of these limitations, but they > > aren't documented or immediately obvious, so I thought it would > > be worthwhile to air them publicly. I've also included some > > measurements from the experimentation and some related thoughts. > > > > (Some of the terms here are unusual and may be used inconsistently > > by different developers so a glossary of definitions of the terms > > used here is appended.) > > > > ZSMALLOC LIMITATIONS > > > > Zsmalloc is used for two zprojects: zram and the out-of-tree > > zswap. Zsmalloc can achieve high density when "full". But: > > > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > > four pageframes). > > 2) When not full and especially when nearly-empty _after_ > > being full, density may fall below 1.0 as a result of > > fragmentation. > > What's the meaning of nearly-empty _after_ being full? Step 1: Add a few (N) pages to zsmalloc. It is "nearly empty". Step 2: Now add many more pages to zsmalloc until allocation limits are reached. It is "full". Step 3: Now remove many pages from zsmalloc until there are N pages remaining. It is now "nearly empty after being full". Fragmentation characteristics are different comparing after Step 1 and after Step 3 even though, in both cases, zsmalloc contains N pages. > > 3) Zsmalloc has a density of exactly 1.0 for any number of > > zpages with zsize >= 0.8. > > 4) Zsmalloc contains several compile-time parameters; > > the best value of these parameters may be very workload > > dependent. > > > > If density == 1.0, that means we are paying the overhead of > > compression+decompression for no space advantage. If > > density < 1.0, that means using zsmalloc is detrimental, > > resulting in worse memory pressure than if it were not used. > > > > WORKLOAD ANALYSIS > > > > These limitations emphasize that the workload used to evaluate > > zsmalloc is very important. Benchmarks that measure data > > Could you share your benchmark? In order that other guys can take > advantage of it. As Seth does, I just used "make" of a kernel. I run it on a full graphical installation of EL6. In order to ensure there is memory pressure, I limit physical memory to 1GB, and use "make -j20". > > throughput or CPU utilization are of questionable value because > > it is the _content_ of the data that is particularly relevant > > for compression. Even more precisely, it is the "entropy" > > of the data that is relevant, because the amount of > > compressibility in the data is related to the entropy: > > I.e. an entirely random pagefull of bits will compress poorly > > and a highly-regular pagefull of bits will compress well. > > Since the zprojects manage a large number of zpages, both > > the mean and distribution of zsize of the workload should > > be "representative". > > > > The workload most widely used to publish results for > > the various zprojects is a kernel-compile using "make -jN" > > where N is artificially increased to impose memory pressure. > > By adding some debug code to zswap, I was able to analyze > > this workload and found the following: > > > > 1) The average page compressed by almost a factor of six > > (mean zsize == 694, stddev == 474) > > stddev is what? Standard deviation. See: http://en.wikipedia.org/wiki/Standard_deviation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-02-27 23:24 zsmalloc limitations and related topics Dan Magenheimer 2013-02-28 22:00 ` Dan Magenheimer 2013-03-01 1:40 ` Ric Mason @ 2013-03-13 15:14 ` Robert Jennings 2013-03-13 15:33 ` Seth Jennings 2013-03-13 20:02 ` Dan Magenheimer 2 siblings, 2 replies; 17+ messages in thread From: Robert Jennings @ 2013-03-13 15:14 UTC (permalink / raw) To: Dan Magenheimer Cc: minchan, sjenning, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman, Robert Jennings * Dan Magenheimer (dan.magenheimer@oracle.com) wrote: > Hi all -- > > I've been doing some experimentation on zsmalloc in preparation > for my topic proposed for LSFMM13 and have run across some > perplexing limitations. Those familiar with the intimate details > of zsmalloc might be well aware of these limitations, but they > aren't documented or immediately obvious, so I thought it would > be worthwhile to air them publicly. I've also included some > measurements from the experimentation and some related thoughts. > > (Some of the terms here are unusual and may be used inconsistently > by different developers so a glossary of definitions of the terms > used here is appended.) > > ZSMALLOC LIMITATIONS > > Zsmalloc is used for two zprojects: zram and the out-of-tree > zswap. Zsmalloc can achieve high density when "full". But: > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > four pageframes). The design of the allocator results in a trade-off between best case density and the worst-case which is true for any allocator. For zsmalloc, the best case density with a 4K page size is 32.0, or 177.0 for a 64K page size, based on storing a set of zero-filled pages compressed by lzo1x-1. > 2) When not full and especially when nearly-empty _after_ > being full, density may fall below 1.0 as a result of > fragmentation. True and there are several ways to address this including defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback of zpages in sparse zspages to free pageframes during normal writeback. > 3) Zsmalloc has a density of exactly 1.0 for any number of > zpages with zsize >= 0.8. For this reason zswap does not cache pages which in this range. It is not enforced in the allocator because some users may be forced to store these pages; users like zram. > 4) Zsmalloc contains several compile-time parameters; > the best value of these parameters may be very workload > dependent. The parameters fall into two major areas, handle computation and class size. The handle can be abstracted away, eliminating the compile-time parameters. The class-size tunable could be changed to a default value with the option for specifying an alternate value from the user during pool creation. > If density == 1.0, that means we are paying the overhead of > compression+decompression for no space advantage. If > density < 1.0, that means using zsmalloc is detrimental, > resulting in worse memory pressure than if it were not used. > > WORKLOAD ANALYSIS > > These limitations emphasize that the workload used to evaluate > zsmalloc is very important. Benchmarks that measure data > throughput or CPU utilization are of questionable value because > it is the _content_ of the data that is particularly relevant > for compression. Even more precisely, it is the "entropy" > of the data that is relevant, because the amount of > compressibility in the data is related to the entropy: > I.e. an entirely random pagefull of bits will compress poorly > and a highly-regular pagefull of bits will compress well. > Since the zprojects manage a large number of zpages, both > the mean and distribution of zsize of the workload should > be "representative". > > The workload most widely used to publish results for > the various zprojects is a kernel-compile using "make -jN" > where N is artificially increased to impose memory pressure. > By adding some debug code to zswap, I was able to analyze > this workload and found the following: > > 1) The average page compressed by almost a factor of six > (mean zsize == 694, stddev == 474) > 2) Almost eleven percent of the pages were zero pages. A > zero page compresses to 28 bytes. > 3) On average, 77% of the bytes (3156) in the pages-to-be- > compressed contained a byte-value of zero. > 4) Despite the above, mean density of zsmalloc was measured at > 3.2 zpages/pageframe, presumably losing nearly half of > available space to fragmentation. > > I have no clue if these measurements are representative > of a wide range of workloads over the lifetime of a booted > machine, but I am suspicious that they are not. For example, > the lzo1x compression algorithm claims to compress data by > about a factor of two. I'm suspicious of the "factor of two" claim. The reference (http://www.oberhumer.com/opensource/lzo/lzodoc.php) for this would appear to be the results of compressing the Calgary Corpus. This is fine for comparing compression algorithms but I would be hesitant to apply that to this problem space. To illustrate the affect of input set, the newer Canterbury Corpus compresses to ~43% of the input size using LZO1X-1. In practice the average for LZO would be workload dependent, as you demonstrate with the kernel build. Swap page entropy for any given workload will not necessarily fit the distribution present in the Calgary Corpus. The high density allocation design in zsmalloc allows for workloads that can compress to factors greater than 2 to do so. > I would welcome ideas on how to evaluate workloads for > "representativeness". Personally I don't believe we should > be making decisions about selecting the "best" algorithms > or merging code without an agreement on workloads. I'd argue that there is no such thing as a "representative workload". Instead, we try different workloads to validate the design and illustrate the performance characteristics and impacts. > PAGEFRAME EVACUATION AND RECLAIM > > I've repeatedly stated the opinion that managing the number of > pageframes containing compressed pages will be valuable for > managing MM interaction/policy when compression is used in > the kernel. After the experimentation above and some brainstorming, > I still do not see an effective method for zsmalloc evacuating and > reclaiming pageframes, because both are complicated by high density > and page-crossing. In other words, zsmalloc's strengths may > also be its Achilles heels. For zram, as far as I can see, > pageframe evacuation/reclaim is irrelevant except perhaps > as part of mass defragmentation. For zcache and zswap, where > writethrough is used, pageframe evacuation/reclaim is very relevant. > (Note: The writeback implemented in zswap does _zpage_ evacuation > without pageframe reclaim.) zswap writeback without guaranteed pageframe reclaim can occur during swap activity. Reclaim, even if it doesn't free a physical page, makes room in the page for incoming swap. With zswap the writeback mechanism is driven by swap activity, so a zpage freed through writeback can be back-filled by a newly compressed zpage. Fragmentation is an issue when processes exit and block zpages are invalidated and becomes an issue when zswap is idle. Otherwise the holes provide elasticity to accommodate incoming pages to zswap. This is the case for both zswap and zcache. At idle we would want defragmentation or aging, either of which has the end result of shrinking the cache and returning pages to the memory manager. The former only reduces fragmentation while the later has the additional benefit of returning memory for other uses. By adding aging, through periodic writeback, zswap becomes a true cache, it eliminates long-held allocations, and addresses fragmentation for long-held allocations. Because the return value of zs_malloc() is not a pointer, but an opaque value that only has meaning to zsmalloc, the API zsmalloc already has would support the addition of an abstraction layer that would accommodate allocation migration necessary for defragmentation. > CLOSING THOUGHT > > Since zsmalloc and zbud have different strengths and weaknesses, > I wonder if some combination or hybrid might be more optimal? > But unless/until we have and can measure a representative workload, > only intuition can answer that. > > GLOSSARY > > zproject -- a kernel project using compression (zram, zcache, zswap) > zpage -- a compressed sequence of PAGE_SIZE bytes > zsize -- the number of bytes in a compressed page > pageframe -- the term "page" is widely used both to describe > either (1) PAGE_SIZE bytes of data, or (2) a physical RAM > area with size=PAGE_SIZE which is PAGE_SIZE-aligned, > as represented in the kernel by a struct page. To be explicit, > we refer to (2) as a pageframe. > density -- zpages per pageframe; higher is (presumably) better > zsmalloc -- a slab-based allocator written by Nitin Gupta to > efficiently store zpages and designed to allow zpages > to be split across two non-contiguous pageframes > zspage -- a grouping of N non-contiguous pageframes managed > as a unit by zsmalloc to store zpages for which zsize > falls within a certain range. (The compile-time > default maximum size for N is 4). > zbud -- a buddy-based allocator written by Dan Magenheimer > (specifically for zcache) to predictably store zpages; > no more than two zpages are stored in any pageframe > pageframe evacuation/reclaim -- the process of removing > zpages from one or more pageframes, including pointers/nodes > from any data structures referencing those zpages, > so that the pageframe(s) can be freed for use by > the rest of the kernel > writeback -- the process of transferring zpages from > storage in a zproject to a backing swap device > lzo1x -- a compression algorithm used by default by all the > zprojects; the kernel implementation resides in lib/lzo.c > entropy -- randomness of data to be compressed; higher entropy > means worse data compression -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-13 15:14 ` Robert Jennings @ 2013-03-13 15:33 ` Seth Jennings 2013-03-13 15:56 ` Seth Jennings 2013-03-13 20:02 ` Dan Magenheimer 1 sibling, 1 reply; 17+ messages in thread From: Seth Jennings @ 2013-03-13 15:33 UTC (permalink / raw) To: Dan Magenheimer, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman The periodic writeback that Rob mentions would go something like this for zswap: --- mm/filemap.c | 3 +-- mm/zswap.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 59 insertions(+), 7 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..fe63e95 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -735,12 +735,11 @@ repeat: if (page && !radix_tree_exception(page)) { lock_page(page); /* Has the page been truncated? */ - if (unlikely(page->mapping != mapping)) { + if (unlikely(page_mapping(page) != mapping)) { unlock_page(page); page_cache_release(page); goto repeat; } - VM_BUG_ON(page->index != offset); } return page; } diff --git a/mm/zswap.c b/mm/zswap.c index 82b8d59..0b2351e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -42,6 +42,9 @@ #include <linux/writeback.h> #include <linux/pagemap.h> +#include <linux/workqueue.h> +#include <linux/time.h> + /********************************* * statistics **********************************/ @@ -102,6 +105,23 @@ module_param_named(max_compression_ratio, */ #define ZSWAP_MAX_OUTSTANDING_FLUSHES 64 +/* + * The amount of time in seconds for zswap is considered "idle" and periodic + * writeback begins + */ +static int zswap_pwb_idle_secs = 30; + +/* + * The delay between iterations of periodic writeback + */ +static unsigned long zswap_pwb_delay_secs = 1; + +/* + * The number of pages to attempt to writeback on each iteration of the periodic + * writeback thread + */ +static int zswap_pwb_writeback_pages = 32; + /********************************* * compression functions **********************************/ @@ -199,6 +219,7 @@ struct zswap_entry { * The tree lock in the zswap_tree struct protects a few things: * - the rbtree * - the lru list + * - starting/modifying the pwb_work timer * - the refcount field of each entry in the tree */ struct zswap_tree { @@ -207,6 +228,7 @@ struct zswap_tree { spinlock_t lock; struct zs_pool *pool; unsigned type; + struct delayed_work pwb_work; }; static struct zswap_tree *zswap_trees[MAX_SWAPFILES]; @@ -492,7 +514,7 @@ static int zswap_get_swap_cache_page(swp_entry_t entry, * called after lookup_swap_cache() failed, re-calling * that would confuse statistics. */ - found_page = find_get_page(&swapper_space, entry.val); + found_page = find_lock_page(&swapper_space, entry.val); if (found_page) break; @@ -588,9 +610,8 @@ static int zswap_writeback_entry(struct zswap_tree *tree, struct zswap_entry *en break; /* not reached */ case ZSWAP_SWAPCACHE_EXIST: /* page is unlocked */ - /* page is already in the swap cache, ignore for now */ - return -EEXIST; - break; /* not reached */ + /* page is already in the swap cache, no need to decompress */ + break; case ZSWAP_SWAPCACHE_NEW: /* page is locked */ /* decompress */ @@ -698,6 +719,26 @@ static int zswap_writeback_entries(struct zswap_tree *tree, int nr) return freed_nr++; } +/********************************* +* periodic writeback (pwb) +**********************************/ +void zswap_pwb_work(struct work_struct *work) +{ + struct delayed_work *dwork; + struct zswap_tree *tree; + + dwork = to_delayed_work(work); + tree = container_of(dwork, struct zswap_tree, pwb_work); + + zswap_writeback_entries(tree, zswap_pwb_writeback_pages); + + spin_lock(&tree->lock); + if (!list_empty(&tree->lru)) + schedule_delayed_work(&tree->pwb_work, + msecs_to_jiffies(MSEC_PER_SEC * zswap_pwb_delay_secs)); + spin_unlock(&tree->lock); +} + /******************************************* * page pool for temporary compression result ********************************************/ @@ -854,8 +895,18 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, entry->handle = handle; entry->length = dlen; - /* map */ spin_lock(&tree->lock); + + if (RB_EMPTY_ROOT(&tree->rbroot)) + /* schedule delayed periodic writeback work */ + schedule_delayed_work(&tree->pwb_work, + msecs_to_jiffies(MSEC_PER_SEC * zswap_pwb_idle_secs)); + else + /* update delay on already scheduled delayed work */ + mod_delayed_work(system_wq, &tree->pwb_work, + msecs_to_jiffies(MSEC_PER_SEC * zswap_pwb_idle_secs)); + + /* map */ do { ret = zswap_rb_insert(&tree->rbroot, entry, &dupentry); if (ret == -EEXIST) { @@ -1001,6 +1052,7 @@ static void zswap_frontswap_invalidate_area(unsigned type) * If post-order traversal code is ever added to the rbtree * implementation, it should be used here. */ + cancel_delayed_work_sync(&tree->pwb_work); while ((node = rb_first(&tree->rbroot))) { entry = rb_entry(node, struct zswap_entry, rbnode); rb_erase(&entry->rbnode, &tree->rbroot); @@ -1027,6 +1079,7 @@ static void zswap_frontswap_init(unsigned type) INIT_LIST_HEAD(&tree->lru); spin_lock_init(&tree->lock); tree->type = type; + INIT_DELAYED_WORK(&tree->pwb_work, zswap_pwb_work); zswap_trees[type] = tree; return; -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-13 15:33 ` Seth Jennings @ 2013-03-13 15:56 ` Seth Jennings 0 siblings, 0 replies; 17+ messages in thread From: Seth Jennings @ 2013-03-13 15:56 UTC (permalink / raw) To: Dan Magenheimer, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 03/13/2013 10:33 AM, Seth Jennings wrote: > The periodic writeback that Rob mentions would go something like this > for zswap: > > --- > mm/filemap.c | 3 +-- > mm/zswap.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 59 insertions(+), 7 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..fe63e95 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -735,12 +735,11 @@ repeat: > if (page && !radix_tree_exception(page)) { > lock_page(page); > /* Has the page been truncated? */ > - if (unlikely(page->mapping != mapping)) { > + if (unlikely(page_mapping(page) != mapping)) { > unlock_page(page); > page_cache_release(page); > goto repeat; > } > - VM_BUG_ON(page->index != offset); A little followup here, previously we were using find_get_page() in zswap_get_swap_cache_page() and if the page was already in the swap cache, then we aborted the writeback of that entry. However, if we do wish to write the page back, as is the case in periodic writeback, we must find _and_ lock it which suggests using find_lock_page() instead. My first attempt to just do a s/find_get_page/find_lock_page/ failed because, for entries that were already in the swap cache, we would hang in the repeat loop of find_lock_page() forever because page->mapping of pages in the swap cache is not set to &swapper_space. However, there is logic in the page_mapping() function to handle swap cache entries, hence the change here. Also page->index != offset for swap cache pages so I just took out the VM_BUG_ON(). Another solution would be to just set the mapping and index fields of swap cache pages, if those fields (or fields in the same union) aren't being used already. Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-13 15:14 ` Robert Jennings 2013-03-13 15:33 ` Seth Jennings @ 2013-03-13 20:02 ` Dan Magenheimer 2013-03-13 22:59 ` Seth Jennings 2013-03-14 19:16 ` Dan Magenheimer 1 sibling, 2 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-13 20:02 UTC (permalink / raw) To: Robert Jennings Cc: minchan, sjenning, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics Hi Robert -- Thanks for the well-considered reply! > * Dan Magenheimer (dan.magenheimer@oracle.com) wrote: > > Hi all -- > > > > I've been doing some experimentation on zsmalloc in preparation > > for my topic proposed for LSFMM13 and have run across some > > perplexing limitations. Those familiar with the intimate details > > of zsmalloc might be well aware of these limitations, but they > > aren't documented or immediately obvious, so I thought it would > > be worthwhile to air them publicly. I've also included some > > measurements from the experimentation and some related thoughts. > > > > (Some of the terms here are unusual and may be used inconsistently > > by different developers so a glossary of definitions of the terms > > used here is appended.) > > > > ZSMALLOC LIMITATIONS > > > > Zsmalloc is used for two zprojects: zram and the out-of-tree > > zswap. Zsmalloc can achieve high density when "full". But: > > > > 1) Zsmalloc has a worst-case density of 0.25 (one zpage per > > four pageframes). > > The design of the allocator results in a trade-off between best case > density and the worst-case which is true for any allocator. For zsmalloc, > the best case density with a 4K page size is 32.0, or 177.0 for a 64K page > size, based on storing a set of zero-filled pages compressed by lzo1x-1. Right. Without a "representative workload", we have no idea whether either my worst-case or your best-case will be relevant. (As an aside, I'm measuring zsize=28 bytes for a zero page... Seth has repeatedly said 103 bytes and I think this is reflected in your computation above. Maybe it is 103 for your hardware compression engine? Else, I'm not sure why our numbers would be different.) > > 2) When not full and especially when nearly-empty _after_ > > being full, density may fall below 1.0 as a result of > > fragmentation. > > True and there are several ways to address this including > defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback > of zpages in sparse zspages to free pageframes during normal writeback. Yes. And add pageframe-reclaim to this list of things that zsmalloc should do but currently cannot do. > > 3) Zsmalloc has a density of exactly 1.0 for any number of > > zpages with zsize >= 0.8. > > For this reason zswap does not cache pages which in this range. > It is not enforced in the allocator because some users may be forced to > store these pages; users like zram. Again, without a "representative" workload, we don't know whether or not it is important to manage pages with zsize >= 0.8. You are simply dismissing it as unnecessary because zsmalloc can't handle them and because they don't appear at any measurable frequency in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger pages under many circumstances... but without a "representative" workload, we don't know whether or not those circumstances will occur.) > > 4) Zsmalloc contains several compile-time parameters; > > the best value of these parameters may be very workload > > dependent. > > The parameters fall into two major areas, handle computation and class > size. The handle can be abstracted away, eliminating the compile-time > parameters. The class-size tunable could be changed to a default value > with the option for specifying an alternate value from the user during > pool creation. Perhaps my point here wasn't clear so let me be more blunt: There's no way in hell that even a very sophisticated user will know how to set these values. I think we need to ensure either that they are "always right" (which without a "representative workload"...) or, preferably, have some way so that they can dynamically adapt at runtime. > > If density == 1.0, that means we are paying the overhead of > > compression+decompression for no space advantage. If > > density < 1.0, that means using zsmalloc is detrimental, > > resulting in worse memory pressure than if it were not used. > > > > WORKLOAD ANALYSIS > > > > These limitations emphasize that the workload used to evaluate > > zsmalloc is very important. Benchmarks that measure data > > throughput or CPU utilization are of questionable value because > > it is the _content_ of the data that is particularly relevant > > for compression. Even more precisely, it is the "entropy" > > of the data that is relevant, because the amount of > > compressibility in the data is related to the entropy: > > I.e. an entirely random pagefull of bits will compress poorly > > and a highly-regular pagefull of bits will compress well. > > Since the zprojects manage a large number of zpages, both > > the mean and distribution of zsize of the workload should > > be "representative". > > > > The workload most widely used to publish results for > > the various zprojects is a kernel-compile using "make -jN" > > where N is artificially increased to impose memory pressure. > > By adding some debug code to zswap, I was able to analyze > > this workload and found the following: > > > > 1) The average page compressed by almost a factor of six > > (mean zsize == 694, stddev == 474) > > 2) Almost eleven percent of the pages were zero pages. A > > zero page compresses to 28 bytes. > > 3) On average, 77% of the bytes (3156) in the pages-to-be- > > compressed contained a byte-value of zero. > > 4) Despite the above, mean density of zsmalloc was measured at > > 3.2 zpages/pageframe, presumably losing nearly half of > > available space to fragmentation. > > > > I have no clue if these measurements are representative > > of a wide range of workloads over the lifetime of a booted > > machine, but I am suspicious that they are not. For example, > > the lzo1x compression algorithm claims to compress data by > > about a factor of two. > > I'm suspicious of the "factor of two" claim. The reference > (http://www.oberhumer.com/opensource/lzo/lzodoc.php) for this would appear > to be the results of compressing the Calgary Corpus. This is fine for > comparing compression algorithms but I would be hesitant to apply that > to this problem space. To illustrate the affect of input set, the newer > Canterbury Corpus compresses to ~43% of the input size using LZO1X-1. Yes, agreed, we have no idea if the Corpus is representative of this problem space... because we have no idea what would be a "representative workload" for this problem space. But for how I was using "factor of two", a factor of 100/43=~2.3 is close enough. I was only trying to say "factor of two" may be more "representative" than the "factor of six" in kernbench. (As an aside, I like the data Nitin collected here: http://code.google.com/p/compcache/wiki/CompressedLengthDistribution as it shows how different workloads can result in dramatically different zsize distributions. However, this data includes all the pages in a running system, including both anonymous and file pages, and doesn't include mean/stddev.) > In practice the average for LZO would be workload dependent, as you > demonstrate with the kernel build. Swap page entropy for any given > workload will not necessarily fit the distribution present in the > Calgary Corpus. The high density allocation design in zsmalloc allows > for workloads that can compress to factors greater than 2 to do so. Exactly. But at what cost on other workloads? And how do we evaluate the cost/benefit of that high density? (... without a "representative workload" ;-) > > I would welcome ideas on how to evaluate workloads for > > "representativeness". Personally I don't believe we should > > be making decisions about selecting the "best" algorithms > > or merging code without an agreement on workloads. > > I'd argue that there is no such thing as a "representative workload". > Instead, we try different workloads to validate the design and illustrate > the performance characteristics and impacts. Sorry for repeatedly hammering my point in the above, but there have been many design choices driven by what was presumed to be representative (kernbench and now SPECjbb) workload that may be entirely wrong for a different workload (as Seth once pointed out using the text of Moby Dick as a source data stream). Further, the value of different designs can't be measured here just by the workload because the pages chosen to swap may be completely independent of the intended workload-driver... i.e. if you track the pid of the pages intended for swap, the pages can be mostly pages from long-running or periodic system services, not pages generated by kernbench or SPECjbb. So it is the workload PLUS the environment that is being measured and evaluated. That makes the problem especially tough. Just to clarify, I'm not suggesting that there is any single workload that can be called representative, just that we may need both a broad set of workloads (not silly benchmarks) AND some theoretical analysis to drive design decisions. And, without this, arguing about whether zsmalloc is better than zbud or not is silly. Both zbud and zsmalloc have strengths and weaknesses. That said, it should also be pointed out that the stream of pages-to-compress from cleancache ("file pages") may be dramatically different than for frontswap ("anonymous pages"), so unless you and Seth are going to argue upfront that cleancache pages should NEVER be candidates for compression, the evaluation criteria to drive design decisions needs to encompass both anonymous and file pages. It is currently impossible to evaluate that with zswap. > > PAGEFRAME EVACUATION AND RECLAIM > > > > I've repeatedly stated the opinion that managing the number of > > pageframes containing compressed pages will be valuable for > > managing MM interaction/policy when compression is used in > > the kernel. After the experimentation above and some brainstorming, > > I still do not see an effective method for zsmalloc evacuating and > > reclaiming pageframes, because both are complicated by high density > > and page-crossing. In other words, zsmalloc's strengths may > > also be its Achilles heels. For zram, as far as I can see, > > pageframe evacuation/reclaim is irrelevant except perhaps > > as part of mass defragmentation. For zcache and zswap, where > > writethrough is used, pageframe evacuation/reclaim is very relevant. > > (Note: The writeback implemented in zswap does _zpage_ evacuation > > without pageframe reclaim.) > > zswap writeback without guaranteed pageframe reclaim can occur during > swap activity. Reclaim, even if it doesn't free a physical page, makes > room in the page for incoming swap. With zswap the writeback mechanism > is driven by swap activity, so a zpage freed through writeback can be > back-filled by a newly compressed zpage. Fragmentation is an issue when > processes exit and block zpages are invalidated and becomes an issue when > zswap is idle. Otherwise the holes provide elasticity to accommodate > incoming pages to zswap. This is the case for both zswap and zcache. > > At idle we would want defragmentation or aging, either of which has > the end result of shrinking the cache and returning pages to the > memory manager. The former only reduces fragmentation while the > later has the additional benefit of returning memory for other uses. > By adding aging, through periodic writeback, zswap becomes a true cache, > it eliminates long-held allocations, and addresses fragmentation for > long-held allocations. We are definitely on different pages here. You are still trying to push zswap as a separate subsystem that can independently decide how to size itself. I see zcache (and zswap) as a "helper" for the MM subsystem which allow MM to store more anonymous/pagecache pages in memory than otherwise possible. This becomes more obvious when considering policy for both anonymous AND pagecache pages... and zswap is not handling both. > Because the return value of zs_malloc() is not a pointer, but an opaque > value that only has meaning to zsmalloc, the API zsmalloc already has > would support the addition of an abstraction layer that would accommodate > allocation migration necessary for defragmentation. While what you say is theoretically true and theoretically a very nice feature to have, the current encoding of a zsmalloc handle does not appear to be support your argument. (And, btw, zbud does a very similar opaque encoding.) > > CLOSING THOUGHT > > > > Since zsmalloc and zbud have different strengths and weaknesses, > > I wonder if some combination or hybrid might be more optimal? > > But unless/until we have and can measure a representative workload, > > only intuition can answer that. You didn't respond to this, but I am increasingly inclined to believe that the truth lies here, and the path to success lies in working together rather than in battling/forking. From what we jointly have learned, if we were locked together in the same room and asked to jointly design a zpage allocator from scratch, I suspect the result would be quite different from either zsmalloc or zbud. Thanks, Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-13 20:02 ` Dan Magenheimer @ 2013-03-13 22:59 ` Seth Jennings 2013-03-14 12:02 ` Bob 2013-03-14 17:39 ` Dan Magenheimer 2013-03-14 19:16 ` Dan Magenheimer 1 sibling, 2 replies; 17+ messages in thread From: Seth Jennings @ 2013-03-13 22:59 UTC (permalink / raw) To: Dan Magenheimer Cc: Robert Jennings, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 03/13/2013 03:02 PM, Dan Magenheimer wrote: >> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >> Subject: Re: zsmalloc limitations and related topics > > Hi Robert -- > > Thanks for the well-considered reply! > >> * Dan Magenheimer (dan.magenheimer@oracle.com) wrote: >>> Hi all -- >>> >>> I've been doing some experimentation on zsmalloc in preparation >>> for my topic proposed for LSFMM13 and have run across some >>> perplexing limitations. Those familiar with the intimate details >>> of zsmalloc might be well aware of these limitations, but they >>> aren't documented or immediately obvious, so I thought it would >>> be worthwhile to air them publicly. I've also included some >>> measurements from the experimentation and some related thoughts. >>> >>> (Some of the terms here are unusual and may be used inconsistently >>> by different developers so a glossary of definitions of the terms >>> used here is appended.) >>> >>> ZSMALLOC LIMITATIONS >>> >>> Zsmalloc is used for two zprojects: zram and the out-of-tree >>> zswap. Zsmalloc can achieve high density when "full". But: >>> >>> 1) Zsmalloc has a worst-case density of 0.25 (one zpage per >>> four pageframes). >> >> The design of the allocator results in a trade-off between best case >> density and the worst-case which is true for any allocator. For zsmalloc, >> the best case density with a 4K page size is 32.0, or 177.0 for a 64K page >> size, based on storing a set of zero-filled pages compressed by lzo1x-1. > > Right. Without a "representative workload", we have no idea > whether either my worst-case or your best-case will be relevant. > > (As an aside, I'm measuring zsize=28 bytes for a zero page... > Seth has repeatedly said 103 bytes and I think this is > reflected in your computation above. Maybe it is 103 for your > hardware compression engine? Else, I'm not sure why our > numbers would be different.) I rechecked this and found my measurement was flawed. It was based on compressing a zero-filled file with lzop -1. The file size is 107 but, as I recently discovered, contains LZO metadata as well. Using lzop -l, I got that the compressed size of the data (not the file), is 44 bytes. So still not what you are observing but closer. $ dd if=/dev/zero of=zero.page bs=4k count=1 $ lzop -1 zero.page $ lzop -l zero.page.lzo method compressed uncompr. ratio uncompressed_name LZO1X-1(15) 44 4096 1.1% zero.page > >>> 2) When not full and especially when nearly-empty _after_ >>> being full, density may fall below 1.0 as a result of >>> fragmentation. >> >> True and there are several ways to address this including >> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback >> of zpages in sparse zspages to free pageframes during normal writeback. > > Yes. And add pageframe-reclaim to this list of things that > zsmalloc should do but currently cannot do. The real question is why is pageframe-reclaim a requirement? What operation needs this feature? AFAICT, the pageframe-reclaim requirements is derived from the assumption that some external control path should be able to tell zswap/zcache to evacuate a page, like the shrinker interface. But this introduces a new and complex problem in designing a policy that doesn't shrink the zpage pool so aggressively that it is useless. Unless there is another reason for this functionality I'm missing. > >>> 3) Zsmalloc has a density of exactly 1.0 for any number of >>> zpages with zsize >= 0.8. >> >> For this reason zswap does not cache pages which in this range. >> It is not enforced in the allocator because some users may be forced to >> store these pages; users like zram. > > Again, without a "representative" workload, we don't know whether > or not it is important to manage pages with zsize >= 0.8. You are > simply dismissing it as unnecessary because zsmalloc can't handle > them and because they don't appear at any measurable frequency > in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger > pages under many circumstances... but without a "representative" workload, > we don't know whether or not those circumstances will occur.) The real question is not whether any workload would operate on pages that don't compress to 80%. Any workload that operates on pages of already compressed or encrypted data would do this. The question is, is it worth it to store those pages in the compressed cache since the effective reclaim efficiency approaches 0. > >>> 4) Zsmalloc contains several compile-time parameters; >>> the best value of these parameters may be very workload >>> dependent. >> >> The parameters fall into two major areas, handle computation and class >> size. The handle can be abstracted away, eliminating the compile-time >> parameters. The class-size tunable could be changed to a default value >> with the option for specifying an alternate value from the user during >> pool creation. > > Perhaps my point here wasn't clear so let me be more blunt: > There's no way in hell that even a very sophisticated user > will know how to set these values. I think we need to > ensure either that they are "always right" (which without > a "representative workload"...) or, preferably, have some way > so that they can dynamically adapt at runtime. I think you made the point that if this "representative workload" is completely undefined, then having tunables for zsmalloc that are "always right" is also not possible. The best we can hope for is "mostly right" which, of course, is difficult to get everyone to agree on and will be based on usage. > >>> If density == 1.0, that means we are paying the overhead of >>> compression+decompression for no space advantage. If >>> density < 1.0, that means using zsmalloc is detrimental, >>> resulting in worse memory pressure than if it were not used. >>> >>> WORKLOAD ANALYSIS >>> >>> These limitations emphasize that the workload used to evaluate >>> zsmalloc is very important. Benchmarks that measure data >>> throughput or CPU utilization are of questionable value because >>> it is the _content_ of the data that is particularly relevant >>> for compression. Even more precisely, it is the "entropy" >>> of the data that is relevant, because the amount of >>> compressibility in the data is related to the entropy: >>> I.e. an entirely random pagefull of bits will compress poorly >>> and a highly-regular pagefull of bits will compress well. >>> Since the zprojects manage a large number of zpages, both >>> the mean and distribution of zsize of the workload should >>> be "representative". >>> >>> The workload most widely used to publish results for >>> the various zprojects is a kernel-compile using "make -jN" >>> where N is artificially increased to impose memory pressure. >>> By adding some debug code to zswap, I was able to analyze >>> this workload and found the following: >>> >>> 1) The average page compressed by almost a factor of six >>> (mean zsize == 694, stddev == 474) >>> 2) Almost eleven percent of the pages were zero pages. A >>> zero page compresses to 28 bytes. >>> 3) On average, 77% of the bytes (3156) in the pages-to-be- >>> compressed contained a byte-value of zero. >>> 4) Despite the above, mean density of zsmalloc was measured at >>> 3.2 zpages/pageframe, presumably losing nearly half of >>> available space to fragmentation. >>> >>> I have no clue if these measurements are representative >>> of a wide range of workloads over the lifetime of a booted >>> machine, but I am suspicious that they are not. For example, >>> the lzo1x compression algorithm claims to compress data by >>> about a factor of two. >> >> I'm suspicious of the "factor of two" claim. The reference >> (http://www.oberhumer.com/opensource/lzo/lzodoc.php) for this would appear >> to be the results of compressing the Calgary Corpus. This is fine for >> comparing compression algorithms but I would be hesitant to apply that >> to this problem space. To illustrate the affect of input set, the newer >> Canterbury Corpus compresses to ~43% of the input size using LZO1X-1. > > Yes, agreed, we have no idea if the Corpus is representative of > this problem space... because we have no idea what would > be a "representative workload" for this problem space. > > But for how I was using "factor of two", a factor of 100/43=~2.3 is > close enough. I was only trying to say "factor of two" may be > more "representative" than the "factor of six" in kernbench. Again, this "representative workload" is undefined to the point of uselessness. At this point _any_ actual workload is more useful than this undefined representative. > > (As an aside, I like the data Nitin collected here: > http://code.google.com/p/compcache/wiki/CompressedLengthDistribution > as it shows how different workloads can result in dramatically > different zsize distributions. However, this data includes > all the pages in a running system, including both anonymous > and file pages, and doesn't include mean/stddev.) > >> In practice the average for LZO would be workload dependent, as you >> demonstrate with the kernel build. Swap page entropy for any given >> workload will not necessarily fit the distribution present in the >> Calgary Corpus. The high density allocation design in zsmalloc allows >> for workloads that can compress to factors greater than 2 to do so. > > Exactly. But at what cost on other workloads? And how do we evaluate > the cost/benefit of that high density? (... without a "representative > workload" ;-) > >>> I would welcome ideas on how to evaluate workloads for >>> "representativeness". Personally I don't believe we should >>> be making decisions about selecting the "best" algorithms >>> or merging code without an agreement on workloads. >> >> I'd argue that there is no such thing as a "representative workload". >> Instead, we try different workloads to validate the design and illustrate >> the performance characteristics and impacts. > > Sorry for repeatedly hammering my point in the above, but > there have been many design choices driven by what was presumed > to be representative (kernbench and now SPECjbb) workload > that may be entirely wrong for a different workload (as > Seth once pointed out using the text of Moby Dick as a source > data stream). The reality we are going to have to face with the feature of memory compression is that not every workload can benefit. The objective should be to improve known workloads that are able to benefit. Then make improvements that grow that set of workloads. > > Further, the value of different designs can't be measured here just > by the workload because the pages chosen to swap may be completely > independent of the intended workload-driver... i.e. if you track > the pid of the pages intended for swap, the pages can be mostly > pages from long-running or periodic system services, not pages > generated by kernbench or SPECjbb. So it is the workload PLUS the > environment that is being measured and evaluated. That makes > the problem especially tough. > > Just to clarify, I'm not suggesting that there is any single > workload that can be called representative, just that we may > need both a broad set of workloads (not silly benchmarks) AND > some theoretical analysis to drive design decisions. And, without > this, arguing about whether zsmalloc is better than zbud or not > is silly. Both zbud and zsmalloc have strengths and weaknesses. > > That said, it should also be pointed out that the stream of > pages-to-compress from cleancache ("file pages") may be dramatically > different than for frontswap ("anonymous pages"), so unless you > and Seth are going to argue upfront that cleancache pages should > NEVER be candidates for compression, the evaluation criteria > to drive design decisions needs to encompass both anonymous > and file pages. It is currently impossible to evaluate that > with zswap. > >>> PAGEFRAME EVACUATION AND RECLAIM >>> >>> I've repeatedly stated the opinion that managing the number of >>> pageframes containing compressed pages will be valuable for >>> managing MM interaction/policy when compression is used in >>> the kernel. After the experimentation above and some brainstorming, >>> I still do not see an effective method for zsmalloc evacuating and >>> reclaiming pageframes, because both are complicated by high density >>> and page-crossing. In other words, zsmalloc's strengths may >>> also be its Achilles heels. For zram, as far as I can see, >>> pageframe evacuation/reclaim is irrelevant except perhaps >>> as part of mass defragmentation. For zcache and zswap, where >>> writethrough is used, pageframe evacuation/reclaim is very relevant. >>> (Note: The writeback implemented in zswap does _zpage_ evacuation >>> without pageframe reclaim.) >> >> zswap writeback without guaranteed pageframe reclaim can occur during >> swap activity. Reclaim, even if it doesn't free a physical page, makes >> room in the page for incoming swap. With zswap the writeback mechanism >> is driven by swap activity, so a zpage freed through writeback can be >> back-filled by a newly compressed zpage. Fragmentation is an issue when >> processes exit and block zpages are invalidated and becomes an issue when >> zswap is idle. Otherwise the holes provide elasticity to accommodate >> incoming pages to zswap. This is the case for both zswap and zcache. >> >> At idle we would want defragmentation or aging, either of which has >> the end result of shrinking the cache and returning pages to the >> memory manager. The former only reduces fragmentation while the >> later has the additional benefit of returning memory for other uses. >> By adding aging, through periodic writeback, zswap becomes a true cache, >> it eliminates long-held allocations, and addresses fragmentation for >> long-held allocations. > > We are definitely on different pages here. You are still trying to > push zswap as a separate subsystem that can independently decide how > to size itself. I see zcache (and zswap) as a "helper" for the MM > subsystem which allow MM to store more anonymous/pagecache pages in > memory than otherwise possible. IIUC from this and your "Better integration of compression with the broader linux-mm" thread, you are wanting to allow the MM to tell a compressed-MM subsystem to free up pages. There are a few problems I see here, mostly policy related. How does the MM know whether is should reclaim compressed page space or pages from the inactive list? In the case of frontswap, the policies feedback on one another in that the reclaim of an anonymous page from the inactive list via swap results in an increase in the number of pages on the anonymous zspage list. I'm not saying I have the solution. The ideal sizing of the compressed pool is a complex issue and, like so many other elements of compressed memory design, depends on the workload. That being said, just because an ideal policy for every workload doesn't exist doesn't mean you can't choose one policy (hopefully a simple one) and improve it as measurable deficiencies are identified. Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-13 22:59 ` Seth Jennings @ 2013-03-14 12:02 ` Bob 2013-03-14 13:20 ` Robert Jennings 2013-03-14 17:39 ` Dan Magenheimer 1 sibling, 1 reply; 17+ messages in thread From: Bob @ 2013-03-14 12:02 UTC (permalink / raw) To: Seth Jennings Cc: Dan Magenheimer, Robert Jennings, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 03/14/2013 06:59 AM, Seth Jennings wrote: > On 03/13/2013 03:02 PM, Dan Magenheimer wrote: >>> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >>> Subject: Re: zsmalloc limitations and related topics >> >> Hi Robert -- >> >> Thanks for the well-considered reply! >> >>> * Dan Magenheimer (dan.magenheimer@oracle.com) wrote: >>>> Hi all -- >>>> >>>> I've been doing some experimentation on zsmalloc in preparation >>>> for my topic proposed for LSFMM13 and have run across some >>>> perplexing limitations. Those familiar with the intimate details >>>> of zsmalloc might be well aware of these limitations, but they >>>> aren't documented or immediately obvious, so I thought it would >>>> be worthwhile to air them publicly. I've also included some >>>> measurements from the experimentation and some related thoughts. >>>> >>>> (Some of the terms here are unusual and may be used inconsistently >>>> by different developers so a glossary of definitions of the terms >>>> used here is appended.) >>>> >>>> ZSMALLOC LIMITATIONS >>>> >>>> Zsmalloc is used for two zprojects: zram and the out-of-tree >>>> zswap. Zsmalloc can achieve high density when "full". But: >>>> >>>> 1) Zsmalloc has a worst-case density of 0.25 (one zpage per >>>> four pageframes). >>> >>> The design of the allocator results in a trade-off between best case >>> density and the worst-case which is true for any allocator. For zsmalloc, >>> the best case density with a 4K page size is 32.0, or 177.0 for a 64K page >>> size, based on storing a set of zero-filled pages compressed by lzo1x-1. >> >> Right. Without a "representative workload", we have no idea >> whether either my worst-case or your best-case will be relevant. >> >> (As an aside, I'm measuring zsize=28 bytes for a zero page... >> Seth has repeatedly said 103 bytes and I think this is >> reflected in your computation above. Maybe it is 103 for your >> hardware compression engine? Else, I'm not sure why our >> numbers would be different.) > > I rechecked this and found my measurement was flawed. It was based on > compressing a zero-filled file with lzop -1. The file size is 107 but, > as I recently discovered, contains LZO metadata as well. Using lzop -l, > I got that the compressed size of the data (not the file), is 44 bytes. > So still not what you are observing but closer. > > $ dd if=/dev/zero of=zero.page bs=4k count=1 > $ lzop -1 zero.page > $ lzop -l zero.page.lzo > method compressed uncompr. ratio uncompressed_name > LZO1X-1(15) 44 4096 1.1% zero.page > >> >>>> 2) When not full and especially when nearly-empty _after_ >>>> being full, density may fall below 1.0 as a result of >>>> fragmentation. >>> >>> True and there are several ways to address this including >>> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback >>> of zpages in sparse zspages to free pageframes during normal writeback. >> >> Yes. And add pageframe-reclaim to this list of things that >> zsmalloc should do but currently cannot do. > > The real question is why is pageframe-reclaim a requirement? What > operation needs this feature? > > AFAICT, the pageframe-reclaim requirements is derived from the > assumption that some external control path should be able to tell > zswap/zcache to evacuate a page, like the shrinker interface. But this > introduces a new and complex problem in designing a policy that doesn't > shrink the zpage pool so aggressively that it is useless. > > Unless there is another reason for this functionality I'm missing. > Perhaps it's needed if the user want to enable/disable the memory compression feature dynamically. Eg, use it as a module instead of recompile the kernel or even reboot the system. >> >>>> 3) Zsmalloc has a density of exactly 1.0 for any number of >>>> zpages with zsize >= 0.8. >>> >>> For this reason zswap does not cache pages which in this range. >>> It is not enforced in the allocator because some users may be forced to >>> store these pages; users like zram. >> >> Again, without a "representative" workload, we don't know whether >> or not it is important to manage pages with zsize >= 0.8. You are >> simply dismissing it as unnecessary because zsmalloc can't handle >> them and because they don't appear at any measurable frequency >> in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger >> pages under many circumstances... but without a "representative" workload, >> we don't know whether or not those circumstances will occur.) > > The real question is not whether any workload would operate on pages > that don't compress to 80%. Any workload that operates on pages of > already compressed or encrypted data would do this. The question is, is > it worth it to store those pages in the compressed cache since the > effective reclaim efficiency approaches 0. > Hmm.. Yes, i'd prefer to skip those pages at first glance. >> >>>> 4) Zsmalloc contains several compile-time parameters; >>>> the best value of these parameters may be very workload >>>> dependent. >>> >>> The parameters fall into two major areas, handle computation and class >>> size. The handle can be abstracted away, eliminating the compile-time >>> parameters. The class-size tunable could be changed to a default value >>> with the option for specifying an alternate value from the user during >>> pool creation. >> >> Perhaps my point here wasn't clear so let me be more blunt: >> There's no way in hell that even a very sophisticated user >> will know how to set these values. I think we need to >> ensure either that they are "always right" (which without >> a "representative workload"...) or, preferably, have some way >> so that they can dynamically adapt at runtime. > > I think you made the point that if this "representative workload" is > completely undefined, then having tunables for zsmalloc that are "always > right" is also not possible. The best we can hope for is "mostly right" > which, of course, is difficult to get everyone to agree on and will be > based on usage. > >> >>>> If density == 1.0, that means we are paying the overhead of >>>> compression+decompression for no space advantage. If >>>> density < 1.0, that means using zsmalloc is detrimental, >>>> resulting in worse memory pressure than if it were not used. >>>> >>>> WORKLOAD ANALYSIS >>>> >>>> These limitations emphasize that the workload used to evaluate >>>> zsmalloc is very important. Benchmarks that measure data >>>> throughput or CPU utilization are of questionable value because >>>> it is the _content_ of the data that is particularly relevant >>>> for compression. Even more precisely, it is the "entropy" >>>> of the data that is relevant, because the amount of >>>> compressibility in the data is related to the entropy: >>>> I.e. an entirely random pagefull of bits will compress poorly >>>> and a highly-regular pagefull of bits will compress well. >>>> Since the zprojects manage a large number of zpages, both >>>> the mean and distribution of zsize of the workload should >>>> be "representative". >>>> >>>> The workload most widely used to publish results for >>>> the various zprojects is a kernel-compile using "make -jN" >>>> where N is artificially increased to impose memory pressure. >>>> By adding some debug code to zswap, I was able to analyze >>>> this workload and found the following: >>>> >>>> 1) The average page compressed by almost a factor of six >>>> (mean zsize == 694, stddev == 474) >>>> 2) Almost eleven percent of the pages were zero pages. A >>>> zero page compresses to 28 bytes. >>>> 3) On average, 77% of the bytes (3156) in the pages-to-be- >>>> compressed contained a byte-value of zero. >>>> 4) Despite the above, mean density of zsmalloc was measured at >>>> 3.2 zpages/pageframe, presumably losing nearly half of >>>> available space to fragmentation. >>>> >>>> I have no clue if these measurements are representative >>>> of a wide range of workloads over the lifetime of a booted >>>> machine, but I am suspicious that they are not. For example, >>>> the lzo1x compression algorithm claims to compress data by >>>> about a factor of two. >>> >>> I'm suspicious of the "factor of two" claim. The reference >>> (http://www.oberhumer.com/opensource/lzo/lzodoc.php) for this would appear >>> to be the results of compressing the Calgary Corpus. This is fine for >>> comparing compression algorithms but I would be hesitant to apply that >>> to this problem space. To illustrate the affect of input set, the newer >>> Canterbury Corpus compresses to ~43% of the input size using LZO1X-1. >> >> Yes, agreed, we have no idea if the Corpus is representative of >> this problem space... because we have no idea what would >> be a "representative workload" for this problem space. >> >> But for how I was using "factor of two", a factor of 100/43=~2.3 is >> close enough. I was only trying to say "factor of two" may be >> more "representative" than the "factor of six" in kernbench. > > Again, this "representative workload" is undefined to the point of > uselessness. At this point _any_ actual workload is more useful than > this undefined representative. > >> >> (As an aside, I like the data Nitin collected here: >> http://code.google.com/p/compcache/wiki/CompressedLengthDistribution >> as it shows how different workloads can result in dramatically >> different zsize distributions. However, this data includes >> all the pages in a running system, including both anonymous >> and file pages, and doesn't include mean/stddev.) >> >>> In practice the average for LZO would be workload dependent, as you >>> demonstrate with the kernel build. Swap page entropy for any given >>> workload will not necessarily fit the distribution present in the >>> Calgary Corpus. The high density allocation design in zsmalloc allows >>> for workloads that can compress to factors greater than 2 to do so. >> >> Exactly. But at what cost on other workloads? And how do we evaluate >> the cost/benefit of that high density? (... without a "representative >> workload" ;-) >> >>>> I would welcome ideas on how to evaluate workloads for >>>> "representativeness". Personally I don't believe we should >>>> be making decisions about selecting the "best" algorithms >>>> or merging code without an agreement on workloads. >>> >>> I'd argue that there is no such thing as a "representative workload". >>> Instead, we try different workloads to validate the design and illustrate >>> the performance characteristics and impacts. >> >> Sorry for repeatedly hammering my point in the above, but >> there have been many design choices driven by what was presumed >> to be representative (kernbench and now SPECjbb) workload >> that may be entirely wrong for a different workload (as >> Seth once pointed out using the text of Moby Dick as a source >> data stream). > > The reality we are going to have to face with the feature of memory > compression is that not every workload can benefit. The objective > should be to improve known workloads that are able to benefit. Then > make improvements that grow that set of workloads. > >> >> Further, the value of different designs can't be measured here just >> by the workload because the pages chosen to swap may be completely >> independent of the intended workload-driver... i.e. if you track >> the pid of the pages intended for swap, the pages can be mostly >> pages from long-running or periodic system services, not pages >> generated by kernbench or SPECjbb. So it is the workload PLUS the >> environment that is being measured and evaluated. That makes >> the problem especially tough. >> >> Just to clarify, I'm not suggesting that there is any single >> workload that can be called representative, just that we may >> need both a broad set of workloads (not silly benchmarks) AND >> some theoretical analysis to drive design decisions. And, without >> this, arguing about whether zsmalloc is better than zbud or not >> is silly. Both zbud and zsmalloc have strengths and weaknesses. >> >> That said, it should also be pointed out that the stream of >> pages-to-compress from cleancache ("file pages") may be dramatically >> different than for frontswap ("anonymous pages"), so unless you >> and Seth are going to argue upfront that cleancache pages should >> NEVER be candidates for compression, the evaluation criteria >> to drive design decisions needs to encompass both anonymous >> and file pages. It is currently impossible to evaluate that >> with zswap. >> >>>> PAGEFRAME EVACUATION AND RECLAIM >>>> >>>> I've repeatedly stated the opinion that managing the number of >>>> pageframes containing compressed pages will be valuable for >>>> managing MM interaction/policy when compression is used in >>>> the kernel. After the experimentation above and some brainstorming, >>>> I still do not see an effective method for zsmalloc evacuating and >>>> reclaiming pageframes, because both are complicated by high density >>>> and page-crossing. In other words, zsmalloc's strengths may >>>> also be its Achilles heels. For zram, as far as I can see, >>>> pageframe evacuation/reclaim is irrelevant except perhaps >>>> as part of mass defragmentation. For zcache and zswap, where >>>> writethrough is used, pageframe evacuation/reclaim is very relevant. >>>> (Note: The writeback implemented in zswap does _zpage_ evacuation >>>> without pageframe reclaim.) >>> >>> zswap writeback without guaranteed pageframe reclaim can occur during >>> swap activity. Reclaim, even if it doesn't free a physical page, makes >>> room in the page for incoming swap. With zswap the writeback mechanism >>> is driven by swap activity, so a zpage freed through writeback can be >>> back-filled by a newly compressed zpage. Fragmentation is an issue when >>> processes exit and block zpages are invalidated and becomes an issue when >>> zswap is idle. Otherwise the holes provide elasticity to accommodate >>> incoming pages to zswap. This is the case for both zswap and zcache. >>> >>> At idle we would want defragmentation or aging, either of which has >>> the end result of shrinking the cache and returning pages to the >>> memory manager. The former only reduces fragmentation while the >>> later has the additional benefit of returning memory for other uses. >>> By adding aging, through periodic writeback, zswap becomes a true cache, >>> it eliminates long-held allocations, and addresses fragmentation for >>> long-held allocations. >> >> We are definitely on different pages here. You are still trying to >> push zswap as a separate subsystem that can independently decide how >> to size itself. I see zcache (and zswap) as a "helper" for the MM >> subsystem which allow MM to store more anonymous/pagecache pages in >> memory than otherwise possible. > > IIUC from this and your "Better integration of compression with the > broader linux-mm" thread, you are wanting to allow the MM to tell a > compressed-MM subsystem to free up pages. There are a few problems I > see here, mostly policy related. How does the MM know whether is should > reclaim compressed page space or pages from the inactive list? In the > case of frontswap, the policies feedback on one another in that the > reclaim of an anonymous page from the inactive list via swap results in > an increase in the number of pages on the anonymous zspage list. > From the top level, i agree with Dan's opinion that we should integrate compression into the MM core system more tightly. But yes there will be complicated and i think it's hard to eliminate the effect to people who don't need any compression. > I'm not saying I have the solution. The ideal sizing of the compressed > pool is a complex issue and, like so many other elements of compressed > memory design, depends on the workload. > > That being said, just because an ideal policy for every workload doesn't > exist doesn't mean you can't choose one policy (hopefully a simple one) > and improve it as measurable deficiencies are identified. > Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-14 12:02 ` Bob @ 2013-03-14 13:20 ` Robert Jennings 2013-03-14 18:54 ` Dan Magenheimer 0 siblings, 1 reply; 17+ messages in thread From: Robert Jennings @ 2013-03-14 13:20 UTC (permalink / raw) To: Bob Cc: Seth Jennings, Dan Magenheimer, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman * Bob (bob.liu@oracle.com) wrote: > On 03/14/2013 06:59 AM, Seth Jennings wrote: > >On 03/13/2013 03:02 PM, Dan Magenheimer wrote: > >>>From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > >>>Subject: Re: zsmalloc limitations and related topics > >> <snip> > >>Yes. And add pageframe-reclaim to this list of things that > >>zsmalloc should do but currently cannot do. > > > >The real question is why is pageframe-reclaim a requirement? What > >operation needs this feature? > > > >AFAICT, the pageframe-reclaim requirements is derived from the > >assumption that some external control path should be able to tell > >zswap/zcache to evacuate a page, like the shrinker interface. But this > >introduces a new and complex problem in designing a policy that doesn't > >shrink the zpage pool so aggressively that it is useless. > > > >Unless there is another reason for this functionality I'm missing. > > > > Perhaps it's needed if the user want to enable/disable the memory > compression feature dynamically. > Eg, use it as a module instead of recompile the kernel or even > reboot the system. To unload zswap all that is needed is to perform writeback on the pages held in the cache, this can be done by extending the existing writeback code. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-14 13:20 ` Robert Jennings @ 2013-03-14 18:54 ` Dan Magenheimer 2013-03-15 16:14 ` Seth Jennings 2013-03-15 16:18 ` Seth Jennings 0 siblings, 2 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-14 18:54 UTC (permalink / raw) To: Robert Jennings, Bob Liu Cc: Seth Jennings, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > Sent: Thursday, March 14, 2013 7:21 AM > To: Bob > Cc: Seth Jennings; Dan Magenheimer; minchan@kernel.org; Nitin Gupta; Konrad Wilk; linux-mm@kvack.org; > linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman > Subject: Re: zsmalloc limitations and related topics > > * Bob (bob.liu@oracle.com) wrote: > > On 03/14/2013 06:59 AM, Seth Jennings wrote: > > >On 03/13/2013 03:02 PM, Dan Magenheimer wrote: > > >>>From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > > >>>Subject: Re: zsmalloc limitations and related topics > > >> > <snip> > > >>Yes. And add pageframe-reclaim to this list of things that > > >>zsmalloc should do but currently cannot do. > > > > > >The real question is why is pageframe-reclaim a requirement? What > > >operation needs this feature? > > > > > >AFAICT, the pageframe-reclaim requirements is derived from the > > >assumption that some external control path should be able to tell > > >zswap/zcache to evacuate a page, like the shrinker interface. But this > > >introduces a new and complex problem in designing a policy that doesn't > > >shrink the zpage pool so aggressively that it is useless. > > > > > >Unless there is another reason for this functionality I'm missing. > > > > > > > Perhaps it's needed if the user want to enable/disable the memory > > compression feature dynamically. > > Eg, use it as a module instead of recompile the kernel or even > > reboot the system. It's worth thinking about: Under what circumstances would a user want to turn off compression? While unloading a compression module should certainly be allowed if it makes a user comfortable, in my opinion, if a user wants to do that, we have done our job poorly (or there is a bug). > To unload zswap all that is needed is to perform writeback on the pages > held in the cache, this can be done by extending the existing writeback > code. Actually, frontswap supports this directly. See frontswap_shrink. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-14 18:54 ` Dan Magenheimer @ 2013-03-15 16:14 ` Seth Jennings 2013-03-15 16:54 ` Dan Magenheimer 2013-03-15 16:18 ` Seth Jennings 1 sibling, 1 reply; 17+ messages in thread From: Seth Jennings @ 2013-03-15 16:14 UTC (permalink / raw) To: Dan Magenheimer Cc: Robert Jennings, Bob Liu, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 03/14/2013 01:54 PM, Dan Magenheimer wrote: >> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >> Sent: Thursday, March 14, 2013 7:21 AM >> To: Bob >> Cc: Seth Jennings; Dan Magenheimer; minchan@kernel.org; Nitin Gupta; Konrad Wilk; linux-mm@kvack.org; >> linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman >> Subject: Re: zsmalloc limitations and related topics >> >> * Bob (bob.liu@oracle.com) wrote: >>> On 03/14/2013 06:59 AM, Seth Jennings wrote: >>>> On 03/13/2013 03:02 PM, Dan Magenheimer wrote: >>>>>> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >>>>>> Subject: Re: zsmalloc limitations and related topics >>>>> >> <snip> >>>>> Yes. And add pageframe-reclaim to this list of things that >>>>> zsmalloc should do but currently cannot do. >>>> >>>> The real question is why is pageframe-reclaim a requirement? What >>>> operation needs this feature? >>>> >>>> AFAICT, the pageframe-reclaim requirements is derived from the >>>> assumption that some external control path should be able to tell >>>> zswap/zcache to evacuate a page, like the shrinker interface. But this >>>> introduces a new and complex problem in designing a policy that doesn't >>>> shrink the zpage pool so aggressively that it is useless. >>>> >>>> Unless there is another reason for this functionality I'm missing. >>>>. >>> >>> Perhaps it's needed if the user want to enable/disable the memory >>> compression feature dynamically. >>> Eg, use it as a module instead of recompile the kernel or even >>> reboot the system. > > It's worth thinking about: Under what circumstances would a user want > to turn off compression? While unloading a compression module should > certainly be allowed if it makes a user comfortable, in my opinion, > if a user wants to do that, we have done our job poorly (or there > is a bug). > >> To unload zswap all that is needed is to perform writeback on the pages >> held in the cache, this can be done by extending the existing writeback >> code. > > Actually, frontswap supports this directly. See frontswap_shrink. frontswap_shrink() is a best-effort attempt to fault in all the pages stored in the backend. However, if there is not enough RAM to hold all the pages, then it can not completely evacuate the backend. Module exit functions must return void, so there is no way to fail a module unload. If you implement an exit function for your module, you must insure that it can always complete successfully. For this reason frontswap_shrink() is unsuitable for module unloading. You'd need to use a mechanism like writeback that could surely evacuate the backend (baring I/O failures). Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-15 16:14 ` Seth Jennings @ 2013-03-15 16:54 ` Dan Magenheimer 0 siblings, 0 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-15 16:54 UTC (permalink / raw) To: Seth Jennings Cc: Robert Jennings, Bob Liu, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics > > On 03/14/2013 01:54 PM, Dan Magenheimer wrote: > >> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > >> Subject: Re: zsmalloc limitations and related topics > >> > >> * Bob (bob.liu@oracle.com) wrote: > >>> On 03/14/2013 06:59 AM, Seth Jennings wrote: > >>>> On 03/13/2013 03:02 PM, Dan Magenheimer wrote: > >>>>>> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] > >>>>>> Subject: Re: zsmalloc limitations and related topics > >>>>> > >> <snip> > >>>>> Yes. And add pageframe-reclaim to this list of things that > >>>>> zsmalloc should do but currently cannot do. > >>>> > >>>> The real question is why is pageframe-reclaim a requirement? What > >>>> operation needs this feature? > >>>> > >>>> AFAICT, the pageframe-reclaim requirements is derived from the > >>>> assumption that some external control path should be able to tell > >>>> zswap/zcache to evacuate a page, like the shrinker interface. But this > >>>> introduces a new and complex problem in designing a policy that doesn't > >>>> shrink the zpage pool so aggressively that it is useless. > >>>> > >>>> Unless there is another reason for this functionality I'm missing. > >>>>. > >>> > >>> Perhaps it's needed if the user want to enable/disable the memory > >>> compression feature dynamically. > >>> Eg, use it as a module instead of recompile the kernel or even > >>> reboot the system. > > > > It's worth thinking about: Under what circumstances would a user want > > to turn off compression? While unloading a compression module should > > certainly be allowed if it makes a user comfortable, in my opinion, > > if a user wants to do that, we have done our job poorly (or there > > is a bug). > > > >> To unload zswap all that is needed is to perform writeback on the pages > >> held in the cache, this can be done by extending the existing writeback > >> code. > > > > Actually, frontswap supports this directly. See frontswap_shrink. > > frontswap_shrink() is a best-effort attempt to fault in all the pages > stored in the backend. However, if there is not enough RAM to hold all > the pages, then it can not completely evacuate the backend. > > Module exit functions must return void, so there is no way to fail a > module unload. If you implement an exit function for your module, you > must insure that it can always complete successfully. For this reason > frontswap_shrink() is unsuitable for module unloading. You'd need to > use a mechanism like writeback that could surely evacuate the backend > (baring I/O failures). A single call to frontswap_shrink may be unsuitable... multiple calls (do while zcache/zswap is not empty) may work fine. Writeback-until-empty should also work fine. In any case, it's a good point that module exit must succeed, and that if there is already heavy memory pressure when zcache/zswap module exit is invoked, module exit may be very very slow and cause many many swap disk writes, so the system may become unresponsive (and may even OOM). So if someone implements zcache/zswap module unload, a thorough test plan would be good. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: zsmalloc limitations and related topics 2013-03-14 18:54 ` Dan Magenheimer 2013-03-15 16:14 ` Seth Jennings @ 2013-03-15 16:18 ` Seth Jennings 1 sibling, 0 replies; 17+ messages in thread From: Seth Jennings @ 2013-03-15 16:18 UTC (permalink / raw) To: Dan Magenheimer Cc: Robert Jennings, Bob Liu, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman On 03/14/2013 01:54 PM, Dan Magenheimer wrote: >> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >> Sent: Thursday, March 14, 2013 7:21 AM >> To: Bob >> Cc: Seth Jennings; Dan Magenheimer; minchan@kernel.org; Nitin Gupta; Konrad Wilk; linux-mm@kvack.org; >> linux-kernel@vger.kernel.org; Bob Liu; Luigi Semenzato; Mel Gorman >> Subject: Re: zsmalloc limitations and related topics >> >> * Bob (bob.liu@oracle.com) wrote: >>> On 03/14/2013 06:59 AM, Seth Jennings wrote: >>>> On 03/13/2013 03:02 PM, Dan Magenheimer wrote: >>>>>> From: Robert Jennings [mailto:rcj@linux.vnet.ibm.com] >>>>>> Subject: Re: zsmalloc limitations and related topics >>>>> >> <snip> >>>>> Yes. And add pageframe-reclaim to this list of things that >>>>> zsmalloc should do but currently cannot do. >>>> >>>> The real question is why is pageframe-reclaim a requirement? What >>>> operation needs this feature? >>>> >>>> AFAICT, the pageframe-reclaim requirements is derived from the >>>> assumption that some external control path should be able to tell >>>> zswap/zcache to evacuate a page, like the shrinker interface. But this >>>> introduces a new and complex problem in designing a policy that doesn't >>>> shrink the zpage pool so aggressively that it is useless. >>>> >>>> Unless there is another reason for this functionality I'm missing. >>>>. >>> >>> Perhaps it's needed if the user want to enable/disable the memory >>> compression feature dynamically. >>> Eg, use it as a module instead of recompile the kernel or even >>> reboot the system. > > It's worth thinking about: Under what circumstances would a user want > to turn off compression? While unloading a compression module should > certainly be allowed if it makes a user comfortable, in my opinion, > if a user wants to do that, we have done our job poorly (or there > is a bug). > >> To unload zswap all that is needed is to perform writeback on the pages >> held in the cache, this can be done by extending the existing writeback >> code. > > Actually, frontswap supports this directly. See frontswap_shrink. frontswap_shrink() is a best-effort attempt to fault in all the pages stored in the backend. However, if there is not enough RAM to hold all the pages, then it can not completely evacuate the backend. Module exit functions must return void, so there is no way to fail a module unload. If you implement an exit function for your module, you must insure that it can always complete successfully. For this reason frontswap_shrink() is unsuitable for module unloading. You'd need to use a mechanism like writeback that could surely evacuate the backend (baring I/O failures). Seth -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-13 22:59 ` Seth Jennings 2013-03-14 12:02 ` Bob @ 2013-03-14 17:39 ` Dan Magenheimer 1 sibling, 0 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-14 17:39 UTC (permalink / raw) To: Seth Jennings Cc: Robert Jennings, minchan, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com] > Subject: Re: zsmalloc limitations and related topics Hi Seth -- Thanks for the reply. I think it is very important to be having these conversations. > >>> 2) When not full and especially when nearly-empty _after_ > >>> being full, density may fall below 1.0 as a result of > >>> fragmentation. > >> > >> True and there are several ways to address this including > >> defragmentation, fewer class sizes in zsmalloc, aging, and/or writeback > >> of zpages in sparse zspages to free pageframes during normal writeback. > > > > Yes. And add pageframe-reclaim to this list of things that > > zsmalloc should do but currently cannot do. > > The real question is why is pageframe-reclaim a requirement? It is because pageframes are the currency of the MM subsystem. See more below. > What operation needs this feature? > AFAICT, the pageframe-reclaim requirements is derived from the > assumption that some external control path should be able to tell > zswap/zcache to evacuate a page, like the shrinker interface. But this > introduces a new and complex problem in designing a policy that doesn't > shrink the zpage pool so aggressively that it is useless. > > Unless there is another reason for this functionality I'm missing. That's the reason. IMHO, it is precisely this "new and complex" problem that we must solve. Otherwise, compression is just a cool toy that may (or may not) help your workload if you turn it on. Zcache already does implement "a policy that doesn't shrink the zpage pool so aggressively that it is useless". While I won't claim the policy is the right one, it is a policy, it is not particularly complex, and it is definitely not useless. And it depends on pageframe-reclaim. > >>> 3) Zsmalloc has a density of exactly 1.0 for any number of > >>> zpages with zsize >= 0.8. > >> > >> For this reason zswap does not cache pages which in this range. > >> It is not enforced in the allocator because some users may be forced to > >> store these pages; users like zram. > > > > Again, without a "representative" workload, we don't know whether > > or not it is important to manage pages with zsize >= 0.8. You are > > simply dismissing it as unnecessary because zsmalloc can't handle > > them and because they don't appear at any measurable frequency > > in kernbench or SPECjbb. (Zbud _can_ efficiently handle these larger > > pages under many circumstances... but without a "representative" workload, > > we don't know whether or not those circumstances will occur.) > > The real question is not whether any workload would operate on pages > that don't compress to 80%. Any workload that operates on pages of > already compressed or encrypted data would do this. The question is, is > it worth it to store those pages in the compressed cache since the > effective reclaim efficiency approaches 0. You are letting the implementation of zsmalloc color your thinking. Zbud can quite efficiently store pages that compress up to zsize = ((63 * PAGE_SIZE) / 64) because it buddies highly compressible pages with poorly compressible pages. This is also, of course, very zsize-distribution-dependent. (These are not just already-compressed or encrypted data, although those are good examples. Compressibility is related to entropy, and there may be many anonymous pages that have high entropy. We really just don't know.) > >>> 4) Zsmalloc contains several compile-time parameters; > >>> the best value of these parameters may be very workload > >>> dependent. > >> > >> The parameters fall into two major areas, handle computation and class > >> size. The handle can be abstracted away, eliminating the compile-time > >> parameters. The class-size tunable could be changed to a default value > >> with the option for specifying an alternate value from the user during > >> pool creation. > > > > Perhaps my point here wasn't clear so let me be more blunt: > > There's no way in hell that even a very sophisticated user > > will know how to set these values. I think we need to > > ensure either that they are "always right" (which without > > a "representative workload"...) or, preferably, have some way > > so that they can dynamically adapt at runtime. > > I think you made the point that if this "representative workload" is > completely undefined, then having tunables for zsmalloc that are "always > right" is also not possible. The best we can hope for is "mostly right" > which, of course, is difficult to get everyone to agree on and will be > based on usage. I agree "always right" is impossible and, as I said, would prefer adaptable. I think zsmalloc and zbud address very different zsize-distributions so some combination may be better than either by itself. > >>> If density == 1.0, that means we are paying the overhead of > >>> compression+decompression for no space advantage. If > >>> density < 1.0, that means using zsmalloc is detrimental, > >>> resulting in worse memory pressure than if it were not used. > >>> > >>> WORKLOAD ANALYSIS > >>> > >>> These limitations emphasize that the workload used to evaluate > >>> zsmalloc is very important. Benchmarks that measure data > >>> throughput or CPU utilization are of questionable value because > >>> it is the _content_ of the data that is particularly relevant > >>> for compression. Even more precisely, it is the "entropy" > >>> of the data that is relevant, because the amount of > >>> compressibility in the data is related to the entropy: > >>> I.e. an entirely random pagefull of bits will compress poorly > >>> and a highly-regular pagefull of bits will compress well. > >>> Since the zprojects manage a large number of zpages, both > >>> the mean and distribution of zsize of the workload should > >>> be "representative". > >>> > >>> The workload most widely used to publish results for > >>> the various zprojects is a kernel-compile using "make -jN" > >>> where N is artificially increased to impose memory pressure. > >>> By adding some debug code to zswap, I was able to analyze > >>> this workload and found the following: > >>> > >>> 1) The average page compressed by almost a factor of six > >>> (mean zsize == 694, stddev == 474) > >>> 2) Almost eleven percent of the pages were zero pages. A > >>> zero page compresses to 28 bytes. > >>> 3) On average, 77% of the bytes (3156) in the pages-to-be- > >>> compressed contained a byte-value of zero. > >>> 4) Despite the above, mean density of zsmalloc was measured at > >>> 3.2 zpages/pageframe, presumably losing nearly half of > >>> available space to fragmentation. > >>> > >>> I have no clue if these measurements are representative > >>> of a wide range of workloads over the lifetime of a booted > >>> machine, but I am suspicious that they are not. For example, > >>> the lzo1x compression algorithm claims to compress data by > >>> about a factor of two. > >> > >> I'm suspicious of the "factor of two" claim. The reference > >> (http://www.oberhumer.com/opensource/lzo/lzodoc.php) for this would appear > >> to be the results of compressing the Calgary Corpus. This is fine for > >> comparing compression algorithms but I would be hesitant to apply that > >> to this problem space. To illustrate the affect of input set, the newer > >> Canterbury Corpus compresses to ~43% of the input size using LZO1X-1. > > > > Yes, agreed, we have no idea if the Corpus is representative of > > this problem space... because we have no idea what would > > be a "representative workload" for this problem space. > > > > But for how I was using "factor of two", a factor of 100/43=~2.3 is > > close enough. I was only trying to say "factor of two" may be > > more "representative" than the "factor of six" in kernbench. > > Again, this "representative workload" is undefined to the point of > uselessness. At this point _any_ actual workload is more useful than > this undefined representative. I think you are just saying that, on a scale of zero to infinity, "one" is better than "zero". While I can't argue with that logic, I'd prefer "many" to "one", and I'd prefer some theoretical foundation which implies that "many" and "very many" will be similar. > > (As an aside, I like the data Nitin collected here: > > http://code.google.com/p/compcache/wiki/CompressedLengthDistribution > > as it shows how different workloads can result in dramatically > > different zsize distributions. However, this data includes > > all the pages in a running system, including both anonymous > > and file pages, and doesn't include mean/stddev.) > > > >> In practice the average for LZO would be workload dependent, as you > >> demonstrate with the kernel build. Swap page entropy for any given > >> workload will not necessarily fit the distribution present in the > >> Calgary Corpus. The high density allocation design in zsmalloc allows > >> for workloads that can compress to factors greater than 2 to do so. > > > > Exactly. But at what cost on other workloads? And how do we evaluate > > the cost/benefit of that high density? (... without a "representative > > workload" ;-) > > > >>> I would welcome ideas on how to evaluate workloads for > >>> "representativeness". Personally I don't believe we should > >>> be making decisions about selecting the "best" algorithms > >>> or merging code without an agreement on workloads. > >> > >> I'd argue that there is no such thing as a "representative workload". > >> Instead, we try different workloads to validate the design and illustrate > >> the performance characteristics and impacts. > > > > Sorry for repeatedly hammering my point in the above, but > > there have been many design choices driven by what was presumed > > to be representative (kernbench and now SPECjbb) workload > > that may be entirely wrong for a different workload (as > > Seth once pointed out using the text of Moby Dick as a source > > data stream). > > The reality we are going to have to face with the feature of memory > compression is that not every workload can benefit. The objective > should be to improve known workloads that are able to benefit. Then > make improvements that grow that set of workloads. Right, I definitely agree that some compression solution is better than no compression solution, but with this important caveat: _provided_ that the compression solution doesn't benefit some workloads and _penalize_ other workloads. However, the discussion we are having is more about which compression solution is better and why, because we have made different design choices based on assumptions that may or may not be valid. I think it is incumbent on us to validate those assumptions. > > Further, the value of different designs can't be measured here just > > by the workload because the pages chosen to swap may be completely > > independent of the intended workload-driver... i.e. if you track > > the pid of the pages intended for swap, the pages can be mostly > > pages from long-running or periodic system services, not pages > > generated by kernbench or SPECjbb. So it is the workload PLUS the > > environment that is being measured and evaluated. That makes > > the problem especially tough. > > > > Just to clarify, I'm not suggesting that there is any single > > workload that can be called representative, just that we may > > need both a broad set of workloads (not silly benchmarks) AND > > some theoretical analysis to drive design decisions. And, without > > this, arguing about whether zsmalloc is better than zbud or not > > is silly. Both zbud and zsmalloc have strengths and weaknesses. > > > > That said, it should also be pointed out that the stream of > > pages-to-compress from cleancache ("file pages") may be dramatically > > different than for frontswap ("anonymous pages"), so unless you > > and Seth are going to argue upfront that cleancache pages should > > NEVER be candidates for compression, the evaluation criteria > > to drive design decisions needs to encompass both anonymous > > and file pages. It is currently impossible to evaluate that > > with zswap. > > > >>> PAGEFRAME EVACUATION AND RECLAIM > >>> > > > > We are definitely on different pages here. You are still trying to > > push zswap as a separate subsystem that can independently decide how > > to size itself. I see zcache (and zswap) as a "helper" for the MM > > subsystem which allow MM to store more anonymous/pagecache pages in > > memory than otherwise possible. > > IIUC from this and your "Better integration of compression with the > broader linux-mm" thread, you are wanting to allow the MM to tell a > compressed-MM subsystem to free up pages. There are a few problems I > see here, mostly policy related. How does the MM know whether is should > reclaim compressed page space or pages from the inactive list? In the > case of frontswap, the policies feedback on one another in that the > reclaim of an anonymous page from the inactive list via swap results in > an increase in the number of pages on the anonymous zspage list. > > I'm not saying I have the solution. The ideal sizing of the compressed > pool is a complex issue and, like so many other elements of compressed > memory design, depends on the workload. > > That being said, just because an ideal policy for every workload doesn't > exist doesn't mean you can't choose one policy (hopefully a simple one) > and improve it as measurable deficiencies are identified. But a policy very definitely can impose constraints on the underlying implementation. I am claiming (and have been since last summer) that a compression policy integrated (partially or fully) with MM is better than a completely independent compression policy. (Right now, the only information zswap has that it can use to drive its policy is whether or not alloc_page was successful!) Assuming an integrated policy, since MM's currency of choice is pageframes, I am further claiming that it is important for the compression policy to easily converse in pageframes, i.e. pageframe-reclaim must be supported by the underlying implementation. Zsmalloc doesn't support pageframe- reclaim and adding it may require a complete rewrite. And I am claiming that a policy for managing compression for BOTH pagecache pages AND anonymous pages is very important and that there is opportunity for policy interaction between them. Zswap implements compression for only the (far) simpler of these two, so policy management between the two classes cannot be addressed. Thanks, Dan P.S. (moved to end) > > (As an aside, I'm measuring zsize=28 bytes for a zero page... > > Seth has repeatedly said 103 bytes and I think this is > > reflected in your computation above. Maybe it is 103 for your > > hardware compression engine? Else, I'm not sure why our > > numbers would be different.) > > I rechecked this and found my measurement was flawed. It was based on > compressing a zero-filled file with lzop -1. The file size is 107 but, > as I recently discovered, contains LZO metadata as well. Using lzop -l, > I got that the compressed size of the data (not the file), is 44 bytes. > So still not what you are observing but closer. > > $ dd if=/dev/zero of=zero.page bs=4k count=1 > $ lzop -1 zero.page > $ lzop -l zero.page.lzo > method compressed uncompr. ratio uncompressed_name > LZO1X-1(15) 44 4096 1.1% zero.page I added debug code to look at dlen in zswap_frontswap_store on zero-filled pages and always get 28. Perhaps the lzo code in the kernel is different from the lzo code in userland lzop? Or maybe you are measuring on PPC, and lzo is different there? It would be nice to solve the mystery of this difference. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
* RE: zsmalloc limitations and related topics 2013-03-13 20:02 ` Dan Magenheimer 2013-03-13 22:59 ` Seth Jennings @ 2013-03-14 19:16 ` Dan Magenheimer 1 sibling, 0 replies; 17+ messages in thread From: Dan Magenheimer @ 2013-03-14 19:16 UTC (permalink / raw) To: Dan Magenheimer, Robert Jennings Cc: minchan, sjenning, Nitin Gupta, Konrad Wilk, linux-mm, linux-kernel, Bob Liu, Luigi Semenzato, Mel Gorman > From: Dan Magenheimer > Subject: RE: zsmalloc limitations and related topics > > > > I would welcome ideas on how to evaluate workloads for > > > "representativeness". Personally I don't believe we should > > > be making decisions about selecting the "best" algorithms > > > or merging code without an agreement on workloads. > > > > I'd argue that there is no such thing as a "representative workload". > > Instead, we try different workloads to validate the design and illustrate > > the performance characteristics and impacts. > > Sorry for repeatedly hammering my point in the above, but > there have been many design choices driven by what was presumed > to be representative (kernbench and now SPECjbb) workload > that may be entirely wrong for a different workload (as > Seth once pointed out using the text of Moby Dick as a source > data stream). > > Further, the value of different designs can't be measured here just > by the workload because the pages chosen to swap may be completely > independent of the intended workload-driver... i.e. if you track > the pid of the pages intended for swap, the pages can be mostly > pages from long-running or periodic system services, not pages > generated by kernbench or SPECjbb. So it is the workload PLUS the > environment that is being measured and evaluated. That makes > the problem especially tough. > > Just to clarify, I'm not suggesting that there is any single > workload that can be called representative, just that we may > need both a broad set of workloads (not silly benchmarks) AND > some theoretical analysis to drive design decisions. And, without > this, arguing about whether zsmalloc is better than zbud or not > is silly. Both zbud and zsmalloc have strengths and weaknesses. > > That said, it should also be pointed out that the stream of > pages-to-compress from cleancache ("file pages") may be dramatically > different than for frontswap ("anonymous pages"), so unless you > and Seth are going to argue upfront that cleancache pages should > NEVER be candidates for compression, the evaluation criteria > to drive design decisions needs to encompass both anonymous > and file pages. It is currently impossible to evaluate that > with zswap. Sorry to reply to myself here, but I realized last night that I left off another related important point: We have a tendency to run benchmarks on a "cold" system so that the results are reproducible. For compression however, this may unnaturally skew the entropy of data-pages-to-be-compressed and so also the density measurements. I can't prove it, but I suspect that soon after boot the number of anonymous pages containing all (or nearly all) zeroes is large, i.e. entropy is low. As the length of time grows since the system booted, more anonymous pages will be written with non-zero data, thus increasing entropy and decreasing compressibility. So, over time, the distribution of zsize may slowly skew right (toward PAGE_SIZE). If so, this effect may be very real but very hard to observe. Dan -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2013-03-15 16:55 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-02-27 23:24 zsmalloc limitations and related topics Dan Magenheimer 2013-02-28 22:00 ` Dan Magenheimer 2013-03-01 1:40 ` Ric Mason 2013-03-04 18:29 ` Dan Magenheimer 2013-03-13 15:14 ` Robert Jennings 2013-03-13 15:33 ` Seth Jennings 2013-03-13 15:56 ` Seth Jennings 2013-03-13 20:02 ` Dan Magenheimer 2013-03-13 22:59 ` Seth Jennings 2013-03-14 12:02 ` Bob 2013-03-14 13:20 ` Robert Jennings 2013-03-14 18:54 ` Dan Magenheimer 2013-03-15 16:14 ` Seth Jennings 2013-03-15 16:54 ` Dan Magenheimer 2013-03-15 16:18 ` Seth Jennings 2013-03-14 17:39 ` Dan Magenheimer 2013-03-14 19:16 ` Dan Magenheimer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).