Q. cache in squashfs?

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Q. cache in squashfs?
@ 2010-06-24  2:37 J. R. Okajima
  2010-07-08  3:57 ` Phillip Lougher
  0 siblings, 1 reply; 16+ messages in thread
From: J. R. Okajima @ 2010-06-24  2:37 UTC (permalink / raw)
  To: phillip; +Cc: linux-fsdevel

Hello Phillip,

I've found an intersting issue about squashfs.
Please give me a guidance or an advise.
In short:
Why does squashfs read and decompress the same block several times?
Is the nested fs-image always better for squashfs?

Long:
I created two squashfs images.
- from /bin directly by mksquashfs
  $ mksquashfs /bin /tmp/a.img
- from a single ext3 fs image which contains /bin
  $ dd if=/dev/zero of=/tmp/ext3/img bs=... count=...
  $ mkfs -t ext3 -F -m 0 -T small -O dir_index /tmp/ext3/img
  $ sudo mount -o loop /tmp/ext3/img /mnt
  $ tar -C /bin -cf - . | tar -C /mnt -xpf -
  $ sudo umount /mnt
  $ mksquashfs /tmp/ext3/img /tmp/b.img

Of course, /tmp/b.img is bigger than /tmp/a.img. It is OK.
For these squashfs, I tried profiling the random file read all over the
fs.
$ find /squashfs -type f > /tmp/l
$ seq 10 | time sh -c "while read i; do rl /tmp/l | xargs -r cat & done > /dev/null; wait"
("rl" is a command to randomize lines)

For b.img, I have to loopback-mount twice.
$ mount -o ro,loop /tmp/b.img /tmp/sq
$ mount -o ro,loop /tmp/sq/img /mnt

Honestly speaking, I gueesed b.img is slower due to the nested fs
overhead. But it shows that b.img (ext3 within squashfs) consumes less
CPU cycles and faster.

- a.img (plain squashfs)
  0.00user 0.14system 0:00.09elapsed 151%CPU (0avgtext+0avgdata 2192maxresident)k
  0inputs+0outputs (0major+6184minor)pagefaults 0swaps

(oprofile report)
samples  %        image name               app name                 symbol name
710      53.9514  zlib_inflate.ko          zlib_inflate             inflate_fast
123       9.3465  libc-2.7.so              libc-2.7.so              (no symbols)
119       9.0426  zlib_inflate.ko          zlib_inflate             zlib_adler32
106       8.0547  zlib_inflate.ko          zlib_inflate             zlib_inflate
95        7.2188  ld-2.7.so                ld-2.7.so                (no symbols)
64        4.8632  oprofiled                oprofiled                (no symbols)
36        2.7356  dash                     dash                     (no symbols)

- b.img (ext3 + squashfs)
  0.00user 0.01system 0:00.06elapsed 22%CPU (0avgtext+0avgdata 2192maxresident)k
  0inputs+0outputs (0major+6134minor)pagefaults 0swaps

samples  %        image name               app name                 symbol name
268      37.0678  zlib_inflate.ko          zlib_inflate             inflate_fast
126      17.4274  libc-2.7.so              libc-2.7.so              (no symbols)
106      14.6611  ld-2.7.so                ld-2.7.so                (no symbols)
57        7.8838  zlib_inflate.ko          zlib_inflate             zlib_adler32
45        6.2241  oprofiled                oprofiled                (no symbols)
40        5.5325  dash                     dash                     (no symbols)
33        4.5643  zlib_inflate.ko          zlib_inflate             zlib_inflate

The biggest difference is to decompress the blocks.
(Since /bin is used for this sample, the difference is not so big. But
if I used antoher dir which has much more files than /bin, then the
difference grows too).
I don't think the difference of fs-layout or metadata is a problem.
Actually inserting debug-prints to show the block index in
squashfs_read_data(), it shows squashfs reads the same block multiple
times from a.img.

int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
			int length, u64 *next_index, int srclength, int pages)
{
	:::
	// for datablock
	for (b = 0; bytes < length; b++, cur_index++) {
		bh[b] = sb_getblk(sb, cur_index);
+		pr_info("%llu\n", cur_index);
		if (bh[b] == NULL)
			goto block_release;
		bytes += msblk->devblksize;
	}
	ll_rw_block(READ, b, bh);
	:::
	// for metadata
	for (; bytes < length; b++) {
		bh[b] = sb_getblk(sb, ++cur_index);
+		pr_info("%llu\n", cur_index);
		if (bh[b] == NULL)
			goto block_release;
		bytes += msblk->devblksize;
	}
	ll_rw_block(READ, b - 1, bh + 1);
	:::
}

In case of b.img, the same block is read several times too. But the
number of times is much smaller than a.img.

I am intrested where did the difference come from.
Do you think the loopback block device in the middle cached the
decompressed block effectively?
- a.img
  squashfs
  + loop0
  + disk

- b.img
  ext3
  + loop1	<-- so effective?
  + squashfs
  + loop0
  + disk

In other word, is inserting a loopback mount always effective for all
squashfs?

Thanx for reading long mail
J. R. Okajima

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-06-24  2:37 Q. cache in squashfs? J. R. Okajima
@ 2010-07-08  3:57 ` Phillip Lougher
  2010-07-08  6:08   ` J. R. Okajima
  0 siblings, 1 reply; 16+ messages in thread
From: Phillip Lougher @ 2010-07-08  3:57 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: linux-fsdevel

J. R. Okajima wrote:
> Hello Phillip,
> 
> I've found an intersting issue about squashfs.
> Please give me a guidance or an advise.
> In short:
> Why does squashfs read and decompress the same block several times?
> Is the nested fs-image always better for squashfs?
> 

Hi Junjiro,

What I think you're seeing here is the negative effect of fragment
blocks (tail-end packing) in the native squashfs example and the
positive effect of vfs/loop block caching in the ext3 on squashfs example.

In the native Squashfs example (a.img) multiple files are packed into
fragment blocks and then compressed together to get better compression.
Fragment blocks are efficient at getting better compression, but are
highly CPU decompress inefficient on very random file access patterns.
It is easy to see why - access of file X (of say 5K) packed into
a 128K fragment containing files W, X ,Y and Z requires the decompression
of the entire 128K fragment.  If the file access pattern is extremely random
(i.e. it doesn't exhibit any locality of reference), none of the other
files are accessed before the decompressed fragment is evicted from
the fragment cache.  So, when the other files are read the fragment
needs to be read and decompressed a second (and third etc.) time.

As I said fragments are good at getting better compression, and are
CPU efficient in normal file access patterns which exhibit locality
of reference - files in the same directory (and hence are in the
same fragment) tend to be accessed at the same time.  In the above example
it means once the fragment has been decompressed for file X, the other
files W, Y and Z will likely to be read soon afterwards and will find
the already decompressed fragment in the fragment cache.  Indeed for this
scenario fragments actually improve I/O performance and reduce CPU overhead.

In the ext3 on squashfs example you're seeing the effect of VFS/loopback
caching of the decompressed data from Squashfs.  In this example
the ext3 file is stored as a single file inside Squashfs compressed
in blocks of 128K.  In normal operation the mounted ext3
filesystem will issue 4K block reads to the loopback file which
will cause the underlying 128K compressed block to be read from
the Squashfs file, and this decompressed data will go into the
VFS page cache.  With random file access of the ext3 file system only
a small part of that 128K block may be required at this time, however,
much later ext3 filesystem accesses requiring that 128K block will
be satisfied from the page cache still holding the decompressed block,
and so this later access *won't* go to Squashfs wand there will not be
a second decompress of the block.

The overall effect means worse performance for Squashfs.  But this
is rather unsurprising due to the negative effects of fragments and
the positive effect of VFS caching in the case of ext3 on Squashfs.

As I said I suspect the major cause of worse performance for Squashfs
is fragments coupled with your very random file access pattern.
Fragments simply do not work well with atypical random access patterns.
If you expect random access you should *not* use fragments and
specify the -no-fragments option to Mksquashfs.

The default Mksquashfs options (duplicate detection, fragments, 128K
blocks) are a compromise which give high compression coupled with
fast I/O for typical file access patterns.  However, people should
*always* play around with the defaults as the different compression and
I/O performance achieved may suit their needs better.  Unfortunately very
few people do so, which is a shame as I often see people complaining
about various aspects of Squashfs which I know would be solved if they'd
only use different Mksquashfs settings.

Regards

Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-08  3:57 ` Phillip Lougher
@ 2010-07-08  6:08   ` J. R. Okajima
  2010-07-09  7:53     ` J. R. Okajima
  0 siblings, 1 reply; 16+ messages in thread
From: J. R. Okajima @ 2010-07-08  6:08 UTC (permalink / raw)
  To: Phillip Lougher; +Cc: linux-fsdevel


Phillip Lougher:
> What I think you're seeing here is the negative effect of fragment
> blocks (tail-end packing) in the native squashfs example and the
> positive effect of vfs/loop block caching in the ext3 on squashfs example.

Thank you very much for your explanation.
I think the number of cached decompressed fragment blocks is related
too. I thought it is much larger, but I found it is 3 by default. I will
try larger value with/without -no-fragments which you pointed.

Also I am afraid the nested loopback mount will cause caching doubly (or
more), cache by ext3-loopback and by native squashfs loopback, and some
people doesn't want this.
But if user has rich memory and doen't care about nested caching
(because it will be reclaimed when necessary), then I expect the nested
loopback mount will be a good option.
For instance,
- CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE = 1
- inner single ext2 image
- mksquashfs without -no-fragments
- ram 1GB
- the squashfs image size 250MB

Do you think will it be better for very random access pattern?


J. R. Okajima

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-08  6:08   ` J. R. Okajima
@ 2010-07-09  7:53     ` J. R. Okajima
  2010-07-09 10:32       ` Phillip Lougher
  0 siblings, 1 reply; 16+ messages in thread
From: J. R. Okajima @ 2010-07-09  7:53 UTC (permalink / raw)
  To: Phillip Lougher, linux-fsdevel


> Phillip Lougher:
> > What I think you're seeing here is the negative effect of fragment
> > blocks (tail-end packing) in the native squashfs example and the
> > positive effect of vfs/loop block caching in the ext3 on squashfs example.
>
> Thank you very much for your explanation.
> I think the number of cached decompressed fragment blocks is related
> too. I thought it is much larger, but I found it is 3 by default. I will
> try larger value with/without -no-fragments which you pointed.

The -no-fragments shows better performance, but it is very small.
It doesn't seem that the number of fragment blocks is large on my test
environment.

Next, I tried increasing the number of cache entries in squashfs.
squashfs_fill_super()
        /* Allocate read_page block */
-       msblk->read_page = squashfs_cache_init("data", 1, msblk->block_size);
+       msblk->read_page = squashfs_cache_init("data", 100, msblk->block_size);
and
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=100 (it was 3)
which is for msblk->fragment_cache.
Of course, these numbers are not generic solution, but they are large
enough to keep all blocks for my test.

It shows much better performance. All blocks are cached and the number
of decompression for native squashfs (a.img) is almost equivalent to the
case of nested ext3 (b.img). But a.img consumes CPU much more than
b.img.
My guess for CPU is the cost to search in cache.
squashfs_cache_get()
		for (i = 0; i < cache->entries; i++)
			if (cache->entry[i].block == block)
				break;
The value of cache->entries grows, and its search cost grows too.

Befor I am going to introduce a hash table or something to reduce the
search cost, I think it is better to convert the squashfs cache into
generic system cache. The hash index will be based on the block number.
I don't know it will be able to combine with the page cache. But at
least, it will be able to kmem_cache_create() and register_shrinker().

Phillip, how do you think about converting the cache system?


J. R. Okajima

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-09  7:53     ` J. R. Okajima
@ 2010-07-09 10:32       ` Phillip Lougher
  2010-07-09 10:55         ` Phillip Lougher
  2010-07-09 12:24         ` Q. cache in squashfs? J. R. Okajima
  0 siblings, 2 replies; 16+ messages in thread
From: Phillip Lougher @ 2010-07-09 10:32 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: linux-fsdevel

J. R. Okajima wrote:
>> Phillip Lougher:
>>> What I think you're seeing here is the negative effect of fragment
>>> blocks (tail-end packing) in the native squashfs example and the
>>> positive effect of vfs/loop block caching in the ext3 on squashfs example.
>> Thank you very much for your explanation.
>> I think the number of cached decompressed fragment blocks is related
>> too. I thought it is much larger, but I found it is 3 by default. I will
>> try larger value with/without -no-fragments which you pointed.
> 
> The -no-fragments shows better performance, but it is very small.
> It doesn't seem that the number of fragment blocks is large on my test
> environment.

That is *very* surprising.  How many fragments do you have?

> 
> Next, I tried increasing the number of cache entries in squashfs.
> squashfs_fill_super()
>         /* Allocate read_page block */
> -       msblk->read_page = squashfs_cache_init("data", 1, msblk->block_size);
> +       msblk->read_page = squashfs_cache_init("data", 100, msblk->block_size);

That is the *wrong* cache.  Read_page isn't really a cache (it is merely
allocated as a cache to re-use code).  This is used to store the data block
in the read_page() routine, and the entire contents are explicitly pushed into the
page cache.  As the entire contents are pushed into the page cache, it is
*very* unlikely the VFS is calling Squashfs to re-read *this* data.  If it is
then something fundamental is broken, or you're seeing page cache shrinkage.

> and
> CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=100 (it was 3)
> which is for msblk->fragment_cache.

and which should make *no* difference if you've used the -no-fragments option to
build an image without fragments.

Squashfs has three types of compressed block, each with different
caching behaviour

1. Data blocks.  Once read, the entire contents are pushed in the page cache.
    They are not cached by Squashfs.  If you've got repeated reads of *these*
    blocks then you're seeing page cache shrinkage or flushing.

2. Fragment blocks.  These are large data blocks which have have multiple small
    files packed together.  In read_page() the file for which the fragment has
    been read is pushed into the page cache.  The other contents of the fragment
    block (the other files) are not, so they're temporarily cached in the
    squashfs fragment cache in the belief they'll be requested soon (locality of
    reference and all that stuff).

3. Metadata blocks (always 8K).  These store inode and directory metadata, and
    are (unsurprisingly) read when inodes are looked-up/instantiated and when
    directory look-up takes place. These blocks tend to store multiple inodes
    and directories packed together (for greater compression).  As such they're
    temporarily cached in the squashfs metadata_cache in belief they'll be re-used
    soon (again locality of reference).

It is fragments and metadata blocks which show the potential for
repeated re-reading on random access patterns.

As you've presumably eliminated fragments from your image, that leaves
metadata blocks as the *only* cause of repeated re-reading/decompression.

You should have modified the size of the metadata cache, from 8 to something
larger, i.e.

  msblk->block_cache = squashfs_cache_init("metadata",
                         SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE);

As a rough guide, to see how much to increase the cache so that it caches the
entire amount of metadata in your image, you can add up the uncompressed
sizes of the inode and directory tables reported by mksquashfs.

But there's a mystery here, I'll be very much surprised if your test image has
more than 64K of metadata, which would fit into the existing 8 entry metadata
cache.

> Of course, these numbers are not generic solution, but they are large
> enough to keep all blocks for my test.

> 
> It shows much better performance. 

If you've done as you said, it should have made no difference whatsoever, unless
the page pushing into the page cache is broken.

So there's a big mystery here.

>All blocks are cached and the number
> of decompression for native squashfs (a.img) is almost equivalent to the
> case of nested ext3 (b.img). But a.img consumes CPU much more than
> b.img.
> My guess for CPU is the cost to search in cache.
> squashfs_cache_get()
> 		for (i = 0; i < cache->entries; i++)
> 			if (cache->entry[i].block == block)
> 				break;
> The value of cache->entries grows, and its search cost grows too.

As you seriously suggesting a scan of a 100 entry table on a modern CPU
makes any noticable difference?
> 
> Befor I am going to introduce a hash table or something to reduce the
> search cost, I think it is better to convert the squashfs cache into
> generic system cache. The hash index will be based on the block number.
> I don't know it will be able to combine with the page cache. But at
> least, it will be able to kmem_cache_create() and register_shrinker().
> 
> Phillip, how do you think about converting the cache system?
> 

That was discussed on this list back in 2008, and there are pros and cons
to doing this.  You can look at the list archives for the discussion and
so I won't repeat it here.  At the moment I see this as a red herring
because your results suggest something more fundamental is wrong.  Doing
what you did above with the size of the read_page cache should not have
made any difference, and if it did, it suggests pages which *should* be
in the page cache (explicitly pushed there by the read_page() routine) are
not there.  In short its not a question of should Squashfs be using the
page cache, for the pages in question it already is.

I'll try and reproduce your results, as they're to be frank
significantly at variance to my previous experience.  Maybe there's a bug
or VFS changes means the page pushing into the page cache isn't working, but
I cannot see where your repeated block reading/decompression results are
coming from.

Phillip

> 
> J. R. Okajima
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-09 10:32       ` Phillip Lougher
@ 2010-07-09 10:55         ` Phillip Lougher
  2010-07-10  5:07           ` J. R. Okajima
  2010-07-09 12:24         ` Q. cache in squashfs? J. R. Okajima
  1 sibling, 1 reply; 16+ messages in thread
From: Phillip Lougher @ 2010-07-09 10:55 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: Phillip Lougher, linux-fsdevel

Phillip Lougher wrote:

> 
> That was discussed on this list back in 2008, and there are pros and cons
> to doing this.  You can look at the list archives for the discussion and
> so I won't repeat it here.  At the moment I see this as a red herring
> because your results suggest something more fundamental is wrong.  Doing
> what you did above with the size of the read_page cache should not have
> made any difference, and if it did, it suggests pages which *should* be
> in the page cache (explicitly pushed there by the read_page() routine) are
> not there.  In short its not a question of should Squashfs be using the
> page cache, for the pages in question it already is.
> 
> I'll try and reproduce your results, as they're to be frank
> significantly at variance to my previous experience.  Maybe there's a bug
> or VFS changes means the page pushing into the page cache isn't working, 
> but I cannot see where your repeated block reading/decompression results are
> coming from.
> 

You can determine which blocks are being repeatedly decompressed by
printing out the value of cache->name in squashfs_cache_get().

You should get one of "data", "fragment" and "metadata" for data
blocks, fragment blocks and metadata respectively.

This information will go a long way in showing where the problem lies.

Phillip


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-09 10:55         ` Phillip Lougher
@ 2010-07-10  5:07           ` J. R. Okajima
  2010-07-10  5:08             ` J. R. Okajima
  2010-07-11  2:48             ` Phillip Lougher
  0 siblings, 2 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-10  5:07 UTC (permalink / raw)
  To: Phillip Lougher; +Cc: linux-fsdevel


Phillip Lougher:
> You can determine which blocks are being repeatedly decompressed by
> printing out the value of cache->name in squashfs_cache_get().
>
> You should get one of "data", "fragment" and "metadata" for data
> blocks, fragment blocks and metadata respectively.
>
> This information will go a long way in showing where the problem lies.

Here is a patch to count and the result.

----------------------------------------------------------------------
frag(3, 100) x -no-fragments(with, without)

O: no-fragments x inner ext3

A: frag=3 x without -no-fragments
B: frag=3 x with -no-fragments

C: frag=100 x without -no-fragments
-: frag=100 x with -no-fragments

	cat10		cache_get		read		zlib
	(sec,cpu)	(meta,frag,data)	(meta,data)	(meta,data)
	----------------------------------------------------------------------
O	.06, 35%	92, -, 41		3, 44		2, 3557
A	.09, 113%	12359, 81, 22		4, 90		6, 6474
B	.07, 104%	12369, -, 109		3, 100		5, 3484
C	.06, 112%	12381, 80, 35		4, 53		6, 3650

- the case O is b.img in my first mail, and the case A is a.img.
- the "cat10" column is the result of time command as described in my
  first mail.
- all these numbers just show the trend and the small difference doesn't
  have much meaning.
- with -no-fragments option (case B),
  + the number of zlib call is reduced.
  + the CPU usage is not reduced much.
  + the number of cache_get for data increses.
  + the number of read for data may increse too.
- even with the compressed fragments, by increasing
  CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE it shows similar performance (case
  C),
  + the number of zlib call is reduced.
  + the CPU usage is not reduced much.
  + the number of cache_get for data may increse.
  + the number of read for data may decrese.

I am not sure the differece of cache_get/read for data between cases is
so meaningful.
But it surely shows high CPU usage in squashfs and I guess it is caused
by cache_get for metadata. The number of zlib compression may not be
related to this CPU usage much.


J. R. Okajima




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-10  5:07           ` J. R. Okajima
@ 2010-07-10  5:08             ` J. R. Okajima
  2010-07-11  2:48             ` Phillip Lougher
  1 sibling, 0 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-10  5:08 UTC (permalink / raw)
  To: Phillip Lougher, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 111 bytes --]

"J. R. Okajima":
> Here is a patch to count and the result.

Forgot appending the patch, sorry.

J. R. Okajima

[-- Attachment #2: a.patch.bz2 --]
[-- Type: application/x-bzip2, Size: 2075 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-10  5:07           ` J. R. Okajima
  2010-07-10  5:08             ` J. R. Okajima
@ 2010-07-11  2:48             ` Phillip Lougher
  2010-07-11  5:55               ` J. R. Okajima
  1 sibling, 1 reply; 16+ messages in thread
From: Phillip Lougher @ 2010-07-11  2:48 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: linux-fsdevel

J. R. Okajima wrote:

> O: no-fragments x inner ext3
> 
> A: frag=3 x without -no-fragments
> B: frag=3 x with -no-fragments
> 
> C: frag=100 x without -no-fragments
> -: frag=100 x with -no-fragments
> 
> 	cat10		cache_get		read		zlib
> 	(sec,cpu)	(meta,frag,data)	(meta,data)	(meta,data)
> 	----------------------------------------------------------------------
> O	.06, 35%	92, -, 41		3, 44		2, 3557
> A	.09, 113%	12359, 81, 22		4, 90		6, 6474
> B	.07, 104%	12369, -, 109		3, 100		5, 3484
> C	.06, 112%	12381, 80, 35		4, 53		6, 3650
> 

OK,

I've done some tests of my own, and I can report that there is no issue
with Squashfs.  Squashfs on its own is performing better than ext3 on
Squashfs. The reason why your tests suggest otherwise is because your
testing methodology is *broken*.

In your first column (ext3 on squashfs), only a small amount of the
overall cost is being accounted to the 'cat10' command, the bulk of
the work is being accounted to the kernel 'loop1' thread and this isn't
showing up. In the other cases (Squashfs only) the entire cost is being
accounted to the 'cat10' command.  The resulting results are therefore
completely bogus, and incorrectly show higher CPU usage for Squashfs.

The following should illustrate this (all tests done under kvm):

1. Squashfs native

Following sqsh.sh shell used

#!/bin/sh
  for i in `seq 2`; do
  	mount -t squashfs /data/comp/bin.sqsh /mnt -o loop
  	find /mnt -type f|xargs wc 2>&1 > /dev/null
  	umount /mnt
  done

bin.sqsh is a copy of /usr/bin, without any fragments.

# /usr/bin/time sqsh.sh

root@slackware:/data/blame-game/data# /usr/bin/time ./test-sqsh.sh
5.51user 12.70system 0:18.72elapsed 97%CPU (0avgtext+0avgdata 5712maxresident)k

High CPU usage, however, this should not be surprising, in an otherwise idle
system there is no reason not to use all CPU.

Snapshot from top while running confirms this:

top - 01:59:30 up  1:13,  2 users,  load average: 0.49, 0.23, 0.10
  Tasks:  58 total,   2 running,  56 sleeping,   0 stopped,   0 zombie
  Cpu(s): 36.0%us, 64.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  Mem:   2023364k total,  1342200k used,   681164k free,   127316k buffers
  Swap:        0k total,        0k used,        0k free,  1134448k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
   3214 root      20   0  4696 1124  552 R 97.1  0.1   0:05.33 wc

The system is running fully occupied, with 0 % idle.

Note overall elapsed time (from time command) running Squashfs native: 18.72 s

2. ext3 on squashfs

Following ext3.sh shell used

#!/bin/sh
for i in `seq 2`; do
  	mount -t squashfs /data/comp/ext3.sqsh /mnt2 -o loop
  	mount -t ext3 /mnt2/ext3.img /mnt -o loop
  	find /mnt -type f | xargs wc  2&>1 > /dev/null
  	umount /mnt
  	umount /mnt2
  done

ext3.img is an ext3 fs containing /usr/bin.

# /usr/bin/time ext3.sh

5.70user 5.11system 0:20.28elapsed 53%CPU (0avgtext+0avgdata 5712maxresident)k
0inputs+0outputs (0major+5346minor)pagefaults 0swaps

Much lower CPU, but this is bogus.

A snapshot from top shows:

top - 02:04:29 up  1:18,  2 users,  load average: 0.44, 0.18, 0.10
Tasks:  61 total,   2 running,  59 sleeping,   0 stopped,   0 zombie
Cpu(s): 33.0%us, 67.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2023364k total,  1637056k used,   386308k free,   143416k buffers
Swap:        0k total,        0k used,        0k free,  1410636k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  3241 root       0 -20     0    0    0 S 52.8  0.0   0:03.19 loop1
  3248 root      20   0  4696 1148  576 R 44.5  0.1   0:02.38 wc

Again the system is running fully occupied with 0 % idle.

The major difference is 52.8 % of the CPU is being accounted to the
'loop1' kernel thread, and this does not show up in the time command.
To make that clear, all of the cost of reading the loop1 file (ext3.img)
is being accounted to the loop1 kernel thread, and therefore no
decompression overhead is showing up in the time command.  As decompression
cost is the majority of the overhead of reading compressed data, it is
little wonder the CPU usage reported by time is only 53 %.

In fact as the 53 % CPU figure only includes time spent in user-space and
ext3 (and excludes decompression cost), it is surprising it is so *high*.
On this basis Squashfs is using only 47 % of CPU or less to decompress
the data.  Which is *good*, and a complete reversal of your bogus results.

Note overall elapsed time (from time command) running ext3 on Squashfs: 20.28 s.

3. Overall conclusion

On my tests both Squashfs native and ext3 on Squashfs uses 100 % CPU.  However,
Squashfs native is faster, 18.72. seconds versus 20.28 seconds.

Cheers

Phillip

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-11  2:48             ` Phillip Lougher
@ 2010-07-11  5:55               ` J. R. Okajima
  2010-07-11  9:38                 ` [RFC 0/2] squashfs parallel decompression J. R. Okajima
                                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-11  5:55 UTC (permalink / raw)
  To: Phillip Lougher; +Cc: linux-fsdevel


Phillip Lougher:
> In your first column (ext3 on squashfs), only a small amount of the
> overall cost is being accounted to the 'cat10' command, the bulk of
> the work is being accounted to the kernel 'loop1' thread and this isn't
> showing up. In the other cases (Squashfs only) the entire cost is being
> accounted to the 'cat10' command.  The resulting results are therefore
> completely bogus, and incorrectly show higher CPU usage for Squashfs.

Ah, I forget about the kthread.
My question about CPU usage must be due to the kthread.
Also I could confirm that the sequential access pattern as you did shows
good performance.
While the very random access shows worse, it is a positive effect of
loopback caching as you wrote in your first reply.

Thank you very much.


J. R. Okajima

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 0/2] squashfs parallel decompression
  2010-07-11  5:55               ` J. R. Okajima
@ 2010-07-11  9:38                 ` J. R. Okajima
  2011-02-22 19:41                   ` Phillip Susi
  2010-07-11  9:38                 ` [RFC 1/2] squashfs parallel decompression, early wait_on_buffer J. R. Okajima
  2010-07-11  9:38                 ` [RFC 2/2] squashfs parallel decompression, z_stream per cpu J. R. Okajima
  2 siblings, 1 reply; 16+ messages in thread
From: J. R. Okajima @ 2010-07-11  9:38 UTC (permalink / raw)
  To: phillip; +Cc: linux-fsdevel, J. R. Okajima

Discussing about the performance of squashfs, I have tried enabling
parallel decompression.
On my test system, the elapsed time to read 59171 files randomly 10
times becomes 33.25sec to 20.54sec (of course, CPU usage increases).

The base version is v2.6.33.

J. R. Okajima (2):
  squashfs parallel decompression, early wait_on_buffer
  squashfs parallel decompression, z_stream per cpu

 fs/squashfs/block.c          |   81 +++++++++++++++++------------------------
 fs/squashfs/cache.c          |    1 +
 fs/squashfs/dir.c            |    1 +
 fs/squashfs/export.c         |    1 +
 fs/squashfs/file.c           |    1 +
 fs/squashfs/fragment.c       |    1 +
 fs/squashfs/id.c             |    1 +
 fs/squashfs/inode.c          |    1 +
 fs/squashfs/namei.c          |    1 +
 fs/squashfs/squashfs.h       |    3 ++
 fs/squashfs/squashfs_fs_sb.h |    2 -
 fs/squashfs/super.c          |   48 +++++++++++++++++++------
 fs/squashfs/symlink.c        |    1 +
 13 files changed, 83 insertions(+), 60 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] squashfs parallel decompression
  2010-07-11  9:38                 ` [RFC 0/2] squashfs parallel decompression J. R. Okajima
@ 2011-02-22 19:41                   ` Phillip Susi
  2011-02-23  3:23                     ` Phillip Lougher
  0 siblings, 1 reply; 16+ messages in thread
From: Phillip Susi @ 2011-02-22 19:41 UTC (permalink / raw)
  To: J. R. Okajima; +Cc: phillip, linux-fsdevel

Did anything ever come of this?  Nobody ever responded and I don't see
it ever having been merged to Linus' tree.

On 7/11/2010 5:38 AM, J. R. Okajima wrote:
> Discussing about the performance of squashfs, I have tried enabling
> parallel decompression.
> On my test system, the elapsed time to read 59171 files randomly 10
> times becomes 33.25sec to 20.54sec (of course, CPU usage increases).
> 
> The base version is v2.6.33.
> 
> J. R. Okajima (2):
>   squashfs parallel decompression, early wait_on_buffer
>   squashfs parallel decompression, z_stream per cpu
> 
>  fs/squashfs/block.c          |   81 +++++++++++++++++------------------------
>  fs/squashfs/cache.c          |    1 +
>  fs/squashfs/dir.c            |    1 +
>  fs/squashfs/export.c         |    1 +
>  fs/squashfs/file.c           |    1 +
>  fs/squashfs/fragment.c       |    1 +
>  fs/squashfs/id.c             |    1 +
>  fs/squashfs/inode.c          |    1 +
>  fs/squashfs/namei.c          |    1 +
>  fs/squashfs/squashfs.h       |    3 ++
>  fs/squashfs/squashfs_fs_sb.h |    2 -
>  fs/squashfs/super.c          |   48 +++++++++++++++++++------
>  fs/squashfs/symlink.c        |    1 +
>  13 files changed, 83 insertions(+), 60 deletions(-)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 0/2] squashfs parallel decompression
  2011-02-22 19:41                   ` Phillip Susi
@ 2011-02-23  3:23                     ` Phillip Lougher
  0 siblings, 0 replies; 16+ messages in thread
From: Phillip Lougher @ 2011-02-23  3:23 UTC (permalink / raw)
  To: Phillip Susi; +Cc: J. R. Okajima, linux-fsdevel

Phillip Susi wrote:
> Did anything ever come of this?  Nobody ever responded and I don't see
> it ever having been merged to Linus' tree.
> 

They're on my TO DO list.  Unfortunately the patches are against
2.6.33 and are based on Squashfs with only gzip compression
support.  At the time of posting mainline was 2.6.36 or so with
a new compression framework backend in Squashfs with support
for gzip, lzo and (now) xz.  So the patches need a re-work to
fit in with this new framework.

When I get time I'll look into refactoring the patches so they
can apply to mainline.  I've been tied up with xz compression
support for the last couple of months.

Phillip

> On 7/11/2010 5:38 AM, J. R. Okajima wrote:
>> Discussing about the performance of squashfs, I have tried enabling
>> parallel decompression.
>> On my test system, the elapsed time to read 59171 files randomly 10
>> times becomes 33.25sec to 20.54sec (of course, CPU usage increases).
>>
>> The base version is v2.6.33.
>>
>> J. R. Okajima (2):
>>   squashfs parallel decompression, early wait_on_buffer
>>   squashfs parallel decompression, z_stream per cpu
>>
>>  fs/squashfs/block.c          |   81 +++++++++++++++++------------------------
>>  fs/squashfs/cache.c          |    1 +
>>  fs/squashfs/dir.c            |    1 +
>>  fs/squashfs/export.c         |    1 +
>>  fs/squashfs/file.c           |    1 +
>>  fs/squashfs/fragment.c       |    1 +
>>  fs/squashfs/id.c             |    1 +
>>  fs/squashfs/inode.c          |    1 +
>>  fs/squashfs/namei.c          |    1 +
>>  fs/squashfs/squashfs.h       |    3 ++
>>  fs/squashfs/squashfs_fs_sb.h |    2 -
>>  fs/squashfs/super.c          |   48 +++++++++++++++++++------
>>  fs/squashfs/symlink.c        |    1 +
>>  13 files changed, 83 insertions(+), 60 deletions(-)
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC 1/2] squashfs parallel decompression, early wait_on_buffer
  2010-07-11  5:55               ` J. R. Okajima
  2010-07-11  9:38                 ` [RFC 0/2] squashfs parallel decompression J. R. Okajima
@ 2010-07-11  9:38                 ` J. R. Okajima
  2010-07-11  9:38                 ` [RFC 2/2] squashfs parallel decompression, z_stream per cpu J. R. Okajima
  2 siblings, 0 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-11  9:38 UTC (permalink / raw)
  To: phillip; +Cc: linux-fsdevel, J. R. Okajima

Preparing parallel decompression.
In squashfs_read_data(), move wait_on_buffer() forer which is common
part of the 'if (compressed) - else' blocks.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/squashfs/block.c |   17 +++++++----------
 1 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index 2a79603..1017b94 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -151,6 +151,12 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 		}
 		ll_rw_block(READ, b - 1, bh + 1);
 	}
+	for (k = 0; k < b; k++) {
+		wait_on_buffer(bh[k]);
+		/* possible? */
+		WARN_ON(!buffer_uptodate(bh[k]));
+	}
+	k = 0;
 
 	if (compressed) {
 		int zlib_err = 0, zlib_init = 0;
@@ -169,9 +175,6 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 			if (msblk->stream.avail_in == 0 && k < b) {
 				avail = min(bytes, msblk->devblksize - offset);
 				bytes -= avail;
-				wait_on_buffer(bh[k]);
-				if (!buffer_uptodate(bh[k]))
-					goto release_mutex;
 
 				if (avail == 0) {
 					offset = 0;
@@ -223,13 +226,7 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 		/*
 		 * Block is uncompressed.
 		 */
-		int i, in, pg_offset = 0;
-
-		for (i = 0; i < b; i++) {
-			wait_on_buffer(bh[i]);
-			if (!buffer_uptodate(bh[i]))
-				goto block_release;
-		}
+		int in, pg_offset = 0;
 
 		for (bytes = length; k < b; k++) {
 			in = min(bytes, msblk->devblksize - offset);
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/2] squashfs parallel decompression, z_stream per cpu
  2010-07-11  5:55               ` J. R. Okajima
  2010-07-11  9:38                 ` [RFC 0/2] squashfs parallel decompression J. R. Okajima
  2010-07-11  9:38                 ` [RFC 1/2] squashfs parallel decompression, early wait_on_buffer J. R. Okajima
@ 2010-07-11  9:38                 ` J. R. Okajima
  2 siblings, 0 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-11  9:38 UTC (permalink / raw)
  To: phillip; +Cc: linux-fsdevel, J. R. Okajima

Convert z_stream in squashfs_sb_info into the module-global per-cpu var,
and remove read_data_mutex in squashfs_sb_info.
Also convert the repeated sequence of
	zlib_inflateInit - zlib_inflate - zlib_inflateEnd
into
	zlib_inflateReset - zlib_inflate

zlib_inflateEnd() is not called since its current implementation is less
important.

Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp>
---
 fs/squashfs/block.c          |   64 +++++++++++++++++------------------------
 fs/squashfs/cache.c          |    1 +
 fs/squashfs/dir.c            |    1 +
 fs/squashfs/export.c         |    1 +
 fs/squashfs/file.c           |    1 +
 fs/squashfs/fragment.c       |    1 +
 fs/squashfs/id.c             |    1 +
 fs/squashfs/inode.c          |    1 +
 fs/squashfs/namei.c          |    1 +
 fs/squashfs/squashfs.h       |    3 ++
 fs/squashfs/squashfs_fs_sb.h |    2 -
 fs/squashfs/super.c          |   48 ++++++++++++++++++++++++-------
 fs/squashfs/symlink.c        |    1 +
 13 files changed, 76 insertions(+), 50 deletions(-)

diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index 1017b94..ed34979 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -33,6 +33,7 @@
 #include <linux/string.h>
 #include <linux/buffer_head.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
@@ -159,20 +160,28 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 	k = 0;
 
 	if (compressed) {
-		int zlib_err = 0, zlib_init = 0;
+		int zlib_err = 0;
+		z_stream *z;
 
 		/*
 		 * Uncompress block.
 		 */
+		/* it disables preemption */
+		z = &get_cpu_var(squashfs_zstream);
+		zlib_err = zlib_inflateReset(z);
+		if (zlib_err != Z_OK) {
+			put_cpu_var(squashfs_zstream);
+			ERROR("zlib_inflateReset returned"
+			      " unexpected result 0x%x, srclength %d\n",
+			      zlib_err, srclength);
+			goto block_release;
+		}
 
-		mutex_lock(&msblk->read_data_mutex);
-
-		msblk->stream.avail_out = 0;
-		msblk->stream.avail_in = 0;
-
+		z->avail_out = 0;
+		z->avail_in = 0;
 		bytes = length;
 		do {
-			if (msblk->stream.avail_in == 0 && k < b) {
+			if (z->avail_in == 0 && k < b) {
 				avail = min(bytes, msblk->devblksize - offset);
 				bytes -= avail;
 
@@ -182,46 +191,30 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 					continue;
 				}
 
-				msblk->stream.next_in = bh[k]->b_data + offset;
-				msblk->stream.avail_in = avail;
+				z->next_in = bh[k]->b_data + offset;
+				z->avail_in = avail;
 				offset = 0;
 			}
 
-			if (msblk->stream.avail_out == 0 && page < pages) {
-				msblk->stream.next_out = buffer[page++];
-				msblk->stream.avail_out = PAGE_CACHE_SIZE;
+			if (z->avail_out == 0 && page < pages) {
+				z->next_out = buffer[page++];
+				z->avail_out = PAGE_CACHE_SIZE;
 			}
 
-			if (!zlib_init) {
-				zlib_err = zlib_inflateInit(&msblk->stream);
-				if (zlib_err != Z_OK) {
-					ERROR("zlib_inflateInit returned"
-						" unexpected result 0x%x,"
-						" srclength %d\n", zlib_err,
-						srclength);
-					goto release_mutex;
-				}
-				zlib_init = 1;
-			}
+			zlib_err = zlib_inflate(z, Z_SYNC_FLUSH);
 
-			zlib_err = zlib_inflate(&msblk->stream, Z_SYNC_FLUSH);
-
-			if (msblk->stream.avail_in == 0 && k < b)
+			if (z->avail_in == 0 && k < b)
 				put_bh(bh[k++]);
 		} while (zlib_err == Z_OK);
 
 		if (zlib_err != Z_STREAM_END) {
+			put_cpu_var(squashfs_zstream);
 			ERROR("zlib_inflate error, data probably corrupt\n");
-			goto release_mutex;
+			goto block_release;
 		}
 
-		zlib_err = zlib_inflateEnd(&msblk->stream);
-		if (zlib_err != Z_OK) {
-			ERROR("zlib_inflate error, data probably corrupt\n");
-			goto release_mutex;
-		}
-		length = msblk->stream.total_out;
-		mutex_unlock(&msblk->read_data_mutex);
+		length = z->total_out;
+		put_cpu_var(squashfs_zstream);
 	} else {
 		/*
 		 * Block is uncompressed.
@@ -252,9 +245,6 @@ int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
 	kfree(bh);
 	return length;
 
-release_mutex:
-	mutex_unlock(&msblk->read_data_mutex);
-
 block_release:
 	for (; k < b; k++)
 		put_bh(bh[k]);
diff --git a/fs/squashfs/cache.c b/fs/squashfs/cache.c
index 40c98fa..d90814f 100644
--- a/fs/squashfs/cache.c
+++ b/fs/squashfs/cache.c
@@ -53,6 +53,7 @@
 #include <linux/wait.h>
 #include <linux/zlib.h>
 #include <linux/pagemap.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c
index 566b0ea..3d2a632 100644
--- a/fs/squashfs/dir.c
+++ b/fs/squashfs/dir.c
@@ -31,6 +31,7 @@
 #include <linux/vfs.h>
 #include <linux/slab.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/export.c b/fs/squashfs/export.c
index 2b1b8fe..9b33473 100644
--- a/fs/squashfs/export.c
+++ b/fs/squashfs/export.c
@@ -41,6 +41,7 @@
 #include <linux/exportfs.h>
 #include <linux/zlib.h>
 #include <linux/slab.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
index 717767d..f9728f3 100644
--- a/fs/squashfs/file.c
+++ b/fs/squashfs/file.c
@@ -48,6 +48,7 @@
 #include <linux/pagemap.h>
 #include <linux/mutex.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/fragment.c b/fs/squashfs/fragment.c
index b5a2c15..fd6f776 100644
--- a/fs/squashfs/fragment.c
+++ b/fs/squashfs/fragment.c
@@ -37,6 +37,7 @@
 #include <linux/vfs.h>
 #include <linux/slab.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/id.c b/fs/squashfs/id.c
index 3795b83..cea22c1 100644
--- a/fs/squashfs/id.c
+++ b/fs/squashfs/id.c
@@ -35,6 +35,7 @@
 #include <linux/vfs.h>
 #include <linux/slab.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/inode.c b/fs/squashfs/inode.c
index 9101dbd..6977d34 100644
--- a/fs/squashfs/inode.c
+++ b/fs/squashfs/inode.c
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/vfs.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/namei.c b/fs/squashfs/namei.c
index 9e39865..3a6e51f 100644
--- a/fs/squashfs/namei.c
+++ b/fs/squashfs/namei.c
@@ -58,6 +58,7 @@
 #include <linux/string.h>
 #include <linux/dcache.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/fs/squashfs/squashfs.h b/fs/squashfs/squashfs.h
index 0e9feb6..0ff530e 100644
--- a/fs/squashfs/squashfs.h
+++ b/fs/squashfs/squashfs.h
@@ -70,6 +70,9 @@ extern struct inode *squashfs_iget(struct super_block *, long long,
 				unsigned int);
 extern int squashfs_read_inode(struct inode *, long long);
 
+/* super.c */
+DECLARE_PER_CPU(z_stream, squashfs_zstream);
+
 /*
  * Inodes and files operations
  */
diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
index c8c6561..95489cc 100644
--- a/fs/squashfs/squashfs_fs_sb.h
+++ b/fs/squashfs/squashfs_fs_sb.h
@@ -61,10 +61,8 @@ struct squashfs_sb_info {
 	__le64			*id_table;
 	__le64			*fragment_index;
 	unsigned int		*fragment_index_2;
-	struct mutex		read_data_mutex;
 	struct mutex		meta_index_mutex;
 	struct meta_index	*meta_index;
-	z_stream		stream;
 	__le64			*inode_lookup_table;
 	u64			inode_table;
 	u64			directory_table;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 6c197ef..e78ffbf 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -37,6 +37,7 @@
 #include <linux/module.h>
 #include <linux/zlib.h>
 #include <linux/magic.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
@@ -87,13 +88,6 @@ static int squashfs_fill_super(struct super_block *sb, void *data, int silent)
 	}
 	msblk = sb->s_fs_info;
 
-	msblk->stream.workspace = kmalloc(zlib_inflate_workspacesize(),
-		GFP_KERNEL);
-	if (msblk->stream.workspace == NULL) {
-		ERROR("Failed to allocate zlib workspace\n");
-		goto failure;
-	}
-
 	sblk = kzalloc(sizeof(*sblk), GFP_KERNEL);
 	if (sblk == NULL) {
 		ERROR("Failed to allocate squashfs_super_block\n");
@@ -103,7 +97,6 @@ static int squashfs_fill_super(struct super_block *sb, void *data, int silent)
 	msblk->devblksize = sb_min_blocksize(sb, BLOCK_SIZE);
 	msblk->devblksize_log2 = ffz(~msblk->devblksize);
 
-	mutex_init(&msblk->read_data_mutex);
 	mutex_init(&msblk->meta_index_mutex);
 
 	/*
@@ -295,14 +288,12 @@ failed_mount:
 	kfree(msblk->inode_lookup_table);
 	kfree(msblk->fragment_index);
 	kfree(msblk->id_table);
-	kfree(msblk->stream.workspace);
 	kfree(sb->s_fs_info);
 	sb->s_fs_info = NULL;
 	kfree(sblk);
 	return err;
 
 failure:
-	kfree(msblk->stream.workspace);
 	kfree(sb->s_fs_info);
 	sb->s_fs_info = NULL;
 	return -ENOMEM;
@@ -349,7 +340,6 @@ static void squashfs_put_super(struct super_block *sb)
 		kfree(sbi->id_table);
 		kfree(sbi->fragment_index);
 		kfree(sbi->meta_index);
-		kfree(sbi->stream.workspace);
 		kfree(sb->s_fs_info);
 		sb->s_fs_info = NULL;
 	}
@@ -394,14 +384,43 @@ static void destroy_inodecache(void)
 }
 
 
+DEFINE_PER_CPU(z_stream, squashfs_zstream);
 static int __init init_squashfs_fs(void)
 {
 	int err = init_inodecache();
+	int cpu, sz;
+	z_stream *z;
 
 	if (err)
 		return err;
 
+	err = -ENOMEM;
+	z = NULL; /* suppress gcc warning */
+	sz = zlib_inflate_workspacesize();
+	for_each_online_cpu(cpu) {
+		z = &per_cpu(squashfs_zstream, cpu);
+		z->workspace = kmalloc(sz, GFP_KERNEL);
+		if (!z->workspace) {
+			ERROR("Failed to allocate zlib workspace\n");
+			break;
+		}
+		err = zlib_inflateInit(z);
+		if (err == Z_MEM_ERROR) {
+			err = -ENOMEM;
+			ERROR("Failed to intialize zlib\n");
+			break;
+		}
+	}
+	if (!z->workspace) {
+		for_each_online_cpu(cpu) {
+			z = &per_cpu(squashfs_zstream, cpu);
+			kfree(z->workspace);
+		}
+		goto out_err;
+	}
+
 	err = register_filesystem(&squashfs_fs_type);
+out_err:
 	if (err) {
 		destroy_inodecache();
 		return err;
@@ -416,7 +435,14 @@ static int __init init_squashfs_fs(void)
 
 static void __exit exit_squashfs_fs(void)
 {
+	int cpu;
+	z_stream *z;
+
 	unregister_filesystem(&squashfs_fs_type);
+	for_each_online_cpu(cpu) {
+		z = &per_cpu(squashfs_zstream, cpu);
+		kfree(z->workspace);
+	}
 	destroy_inodecache();
 }
 
diff --git a/fs/squashfs/symlink.c b/fs/squashfs/symlink.c
index 83d8788..133776b 100644
--- a/fs/squashfs/symlink.c
+++ b/fs/squashfs/symlink.c
@@ -37,6 +37,7 @@
 #include <linux/string.h>
 #include <linux/pagemap.h>
 #include <linux/zlib.h>
+#include <linux/percpu.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Q. cache in squashfs?
  2010-07-09 10:32       ` Phillip Lougher
  2010-07-09 10:55         ` Phillip Lougher
@ 2010-07-09 12:24         ` J. R. Okajima
  1 sibling, 0 replies; 16+ messages in thread
From: J. R. Okajima @ 2010-07-09 12:24 UTC (permalink / raw)
  To: Phillip Lougher; +Cc: linux-fsdevel

Phillip Lougher:
> > The -no-fragments shows better performance, but it is very small.
> > It doesn't seem that the number of fragment blocks is large on my test
> > environment.
>
> That is *very* surprising.  How many fragments do you have?

Actually -no-fragments could reduce the number of zlib_inflate()
expectedly. But the performance didn't improve much, particulary CPU
usage.
So I removed -no-fragments option again. This is what I forgot to write
in my mail. I hope one your big mystery solved.

$ sq4.0.wcvs/squashfs/squashfs-tools/mksquashfs  /bin  /tmp/a.img -no-progress -noappend -keep-as-directory -comp gzip
Parallel mksquashfs: Using 2 processors
Creating 4.0 filesystem on /tmp/a.img, block size 131072.

Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072
	compressed data, compressed metadata, compressed fragments
	duplicates are removed
Filesystem size 2236.52 Kbytes (2.18 Mbytes)
	47.19% of uncompressed filesystem size (4739.02 Kbytes)
Inode table size 1210 bytes (1.18 Kbytes)
	36.87% of uncompressed inode table size (3282 bytes)
Directory table size 851 bytes (0.83 Kbytes)
	63.70% of uncompressed directory table size (1336 bytes)
Number of duplicate files found 1
Number of inodes 98
Number of files 84
Number of fragments 28
Number of symbolic links  12
Number of device nodes 0
Number of fifo nodes 0
Number of socket nodes 0
Number of directories 2
Number of ids (unique uids + gids) 2
Number of uids 2
	root (0)
	jro (1000)
Number of gids 2
	root (0)
	jro (1000)

> It is fragments and metadata blocks which show the potential for
> repeated re-reading on random access patterns.

Ok, then I'd focus metadata.
Increasing SQUASHFS_CACHED_BLKS	to (8<<10) didn't help the performance
for my case.

Here is my thought.
squashfs_read_metadata() is called very many times, from (every?) lookup
or file read. In squashfs_cache_get(), the search loop runs every time
with a spinlock held. That is why I thought the search is the CPU eater.
"100" is not a problem.

J. R. Okajima

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-02-23  3:42 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-24  2:37 Q. cache in squashfs? J. R. Okajima
2010-07-08  3:57 ` Phillip Lougher
2010-07-08  6:08   ` J. R. Okajima
2010-07-09  7:53     ` J. R. Okajima
2010-07-09 10:32       ` Phillip Lougher
2010-07-09 10:55         ` Phillip Lougher
2010-07-10  5:07           ` J. R. Okajima
2010-07-10  5:08             ` J. R. Okajima
2010-07-11  2:48             ` Phillip Lougher
2010-07-11  5:55               ` J. R. Okajima
2010-07-11  9:38                 ` [RFC 0/2] squashfs parallel decompression J. R. Okajima
2011-02-22 19:41                   ` Phillip Susi
2011-02-23  3:23                     ` Phillip Lougher
2010-07-11  9:38                 ` [RFC 1/2] squashfs parallel decompression, early wait_on_buffer J. R. Okajima
2010-07-11  9:38                 ` [RFC 2/2] squashfs parallel decompression, z_stream per cpu J. R. Okajima
2010-07-09 12:24         ` Q. cache in squashfs? J. R. Okajima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).