btrfs deduplication and linux cache management

All of lore.kernel.org
 help / color / mirror / Atom feed

* btrfs deduplication and linux cache management
       [not found] <1589590871.231414660858286.JavaMail.root@shiva>
@ 2014-10-30  9:26 ` luvar
  2014-10-30 12:00   ` Austin S Hemmelgarn
  2014-10-30 16:00   ` Zygo Blaxell
  0 siblings, 2 replies; 5+ messages in thread
From: luvar @ 2014-10-30  9:26 UTC (permalink / raw)
  To: linux-btrfs

Hi,
I want to ask, if deduplicated file content will be cached in linux kernel just once for two deduplicated files.

To explain in deep:
 - I use btrfs for whole system with few subvolumes with some compression on some subvolumes.
 - I have two directories with eclipse SDK with slightly differences (same version, different config)
 - I assume that given directories is deduplicated and so two eclipse installations take place on hdd like one would (in rough estimation)
 - I will start one of given eclipse
 - linux kernel will cache all opened files during start of eclipse (I have enough free ram)
 - I am just happy stupid linux user:
    1. will kernel cache file content after decompression? (I think yes)
    2. cached data will be in VFS layer or in block device layer?
 - When I will lunch second eclipse (different from first, but deduplicated from first) after first one:
    1. will second start require less data to be read from HDD?
    2. will be metadata for second instance read from hdd? (I asume yes)
    3. will be actual data read second time? (I hope not)

Thanks for answers,
have a nice day,
--
LuVar

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: btrfs deduplication and linux cache management
  2014-10-30  9:26 ` btrfs deduplication and linux cache management luvar
@ 2014-10-30 12:00   ` Austin S Hemmelgarn
  2014-10-30 16:00   ` Zygo Blaxell
  1 sibling, 0 replies; 5+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-30 12:00 UTC (permalink / raw)
  To: luvar, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2268 bytes --]

On 2014-10-30 05:26, luvar@plaintext.sk wrote:
> Hi,
> I want to ask, if deduplicated file content will be cached in linux kernel just once for two deduplicated files.
>
> To explain in deep:
>   - I use btrfs for whole system with few subvolumes with some compression on some subvolumes.
>   - I have two directories with eclipse SDK with slightly differences (same version, different config)
>   - I assume that given directories is deduplicated and so two eclipse installations take place on hdd like one would (in rough estimation)
>   - I will start one of given eclipse
>   - linux kernel will cache all opened files during start of eclipse (I have enough free ram)
>   - I am just happy stupid linux user:
>      1. will kernel cache file content after decompression? (I think yes)
>      2. cached data will be in VFS layer or in block device layer?
>   - When I will lunch second eclipse (different from first, but deduplicated from first) after first one:
>      1. will second start require less data to be read from HDD?
>      2. will be metadata for second instance read from hdd? (I asume yes)
>      3. will be actual data read second time? (I hope not)
>
> Thanks for answers,
> have a nice day,

I don't know for certain, but here is how I understand things work in 
this case:
1. Individual blocks are cached in the block device layer, which means 
that the de-duplicated data would only be cached at most as many times 
as there are disks it is on (ie at most 1 time for a single device 
filesystem, up to twice for a multi-device btrfs raid1 setup).
2. In the vfs layer, the cache handles decoded inodes (the actual file 
metadata), dentries (the file's entry in the parent directory), and 
individual pages of file content (after decompression).  AFAIK, the vfs 
layer's cache is pathname based, so that would probably cache two copies 
of the data, but after the metadata look-up, wouldn't need to read from 
the disk cause of the block layer cache.

Overall, this means that while de-duplicated data may be cached more 
than once, it shouldn't need to be reread from disk if there is still a 
copy in cache.  Metadata may or may not need to be read from the disk, 
depending on what is in the VFS cache.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: btrfs deduplication and linux cache management
  2014-10-30  9:26 ` btrfs deduplication and linux cache management luvar
  2014-10-30 12:00   ` Austin S Hemmelgarn
@ 2014-10-30 16:00   ` Zygo Blaxell
  2014-11-03 14:09     ` LuVar
  1 sibling, 1 reply; 5+ messages in thread
From: Zygo Blaxell @ 2014-10-30 16:00 UTC (permalink / raw)
  To: luvar; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3365 bytes --]

On Thu, Oct 30, 2014 at 10:26:07AM +0100, luvar@plaintext.sk wrote:
> Hi,
> I want to ask, if deduplicated file content will be cached in linux kernel just once for two deduplicated files.
> 
> To explain in deep:
>  - I use btrfs for whole system with few subvolumes with some compression on some subvolumes.
>  - I have two directories with eclipse SDK with slightly differences (same version, different config)
>  - I assume that given directories is deduplicated and so two eclipse installations take place on hdd like one would (in rough estimation)
>  - I will start one of given eclipse
>  - linux kernel will cache all opened files during start of eclipse (I have enough free ram)
>  - I am just happy stupid linux user:
>     1. will kernel cache file content after decompression? (I think yes)
>     2. cached data will be in VFS layer or in block device layer?

My guess based on behavior is the VFS layer.  See below.

>  - When I will lunch second eclipse (different from first, but deduplicated from first) after first one:
>     1. will second start require less data to be read from HDD?

No.

>     2. will be metadata for second instance read from hdd? (I asume yes)

Yes (how could it not?).

>     3. will be actual data read second time? (I hope not)

Unfortunately, yes.

This is my test:

1.  Create a file full of compressible data that is big enough to take
a few seconds to read from disk, but not too big to fit in RAM:

	yes $(date) | head -c 500m > a

2.  Create a "deduplicated" (shared extent) copy of same:

	cp --reflink=always a b

	(use filefrag -v to verify both files have same physical extents)

3.  Drop caches

	sync; sysctl vm.drop_caches=1

4.  Time reading both files with cold and hot cache:

	time cat a > /dev/null
	time cat b > /dev/null
	time cat a > /dev/null
	time cat b > /dev/null

Ideally, the first 'cat a' would load the file back from disk, so it
will take a long time, and the other three would be very fast as the
shared extent data would already be in RAM.

That is what happens on 3.17.1:

	time cat a > /dev/null
	real    0m18.870s
	user    0m0.017s
	sys     0m3.432s

	time cat b > /dev/null
	real    0m16.931s
	user    0m0.007s
	sys     0m3.357s

	time cat a > /dev/null
	real    0m0.141s
	user    0m0.001s
	sys     0m0.136s

	time cat b > /dev/null
	real    0m0.121s
	user    0m0.002s
	sys     0m0.116s

Above we see that reading 'b' the first time takes almost as long as 'a'.
The second reads are cached, so they finish two orders of magnitude
faster.

That suggests that deduplicated extents are read and cached as entirely
separate copies of the data.  The sys time for the first read of 'b'
would imply separate decompression as well.

Compare the above result with a hardlink, which might behave more like
what we expect:

	rm -f b
	ln a b
	sync; sysctl vm.drop_caches=1

	time cat a > /dev/null
	real    0m20.262s
	user    0m0.010s
	sys     0m3.376s

	time cat b > /dev/null
	real    0m0.125s
	user    0m0.003s
	sys     0m0.120s

	time cat a > /dev/null
	real    0m0.103s
	user    0m0.004s
	sys     0m0.097s

	time cat b > /dev/null
	real    0m0.098s
	user    0m0.002s
	sys     0m0.091s

Above we clearly see that we read 'a' from disk only once, and use the
cache three times.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: btrfs deduplication and linux cache management
  2014-10-30 16:00   ` Zygo Blaxell
@ 2014-11-03 14:09     ` LuVar
  2014-11-04 20:01       ` Zygo Blaxell
  0 siblings, 1 reply; 5+ messages in thread
From: LuVar @ 2014-11-03 14:09 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

Thanks for nice and "replicate at home yourself" example. On my machine it is behaving precisely like in your:

<code>
root@blackdawn:/home/luvar# sync; sysctl vm.drop_caches=1
vm.drop_caches = 1
root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
real    0m6.768s
user    0m0.016s
sys     0m0.599s

root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
real    0m5.259s
user    0m0.018s
sys     0m0.695s

root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
real    0m0.701s
user    0m0.014s
sys     0m0.288s

root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null
real    0m0.286s
user    0m0.013s
sys     0m0.272s
</code>

If you would mind asking, is there any plan to optimize this behaviour? I know that btrfs is not like ZFS (whole system from blockdevice, through cache, to VFS), so vould be possible to implement such optimization without major patch in linux block cache/VFS cache?

Thanks, have a nice day,
--
LuVar


----- "Zygo Blaxell" <zblaxell@furryterror.org> wrote:

> On Thu, Oct 30, 2014 at 10:26:07AM +0100, luvar@plaintext.sk wrote:
> > Hi,
> > I want to ask, if deduplicated file content will be cached in linux
> kernel just once for two deduplicated files.
> > 
> > To explain in deep:
> >  - I use btrfs for whole system with few subvolumes with some
> compression on some subvolumes.
> >  - I have two directories with eclipse SDK with slightly differences
> (same version, different config)
> >  - I assume that given directories is deduplicated and so two
> eclipse installations take place on hdd like one would (in rough
> estimation)
> >  - I will start one of given eclipse
> >  - linux kernel will cache all opened files during start of eclipse
> (I have enough free ram)
> >  - I am just happy stupid linux user:
> >     1. will kernel cache file content after decompression? (I think
> yes)
> >     2. cached data will be in VFS layer or in block device layer?
> 
> My guess based on behavior is the VFS layer.  See below.
> 
> >  - When I will lunch second eclipse (different from first, but
> deduplicated from first) after first one:
> >     1. will second start require less data to be read from HDD?
> 
> No.
> 
> >     2. will be metadata for second instance read from hdd? (I asume
> yes)
> 
> Yes (how could it not?).
> 
> >     3. will be actual data read second time? (I hope not)
> 
> Unfortunately, yes.
> 
> This is my test:
> 
> 1.  Create a file full of compressible data that is big enough to
> take
> a few seconds to read from disk, but not too big to fit in RAM:
> 
> 	yes $(date) | head -c 500m > a
> 
> 2.  Create a "deduplicated" (shared extent) copy of same:
> 
> 	cp --reflink=always a b
> 
> 	(use filefrag -v to verify both files have same physical extents)
> 
> 3.  Drop caches
> 
> 	sync; sysctl vm.drop_caches=1
> 
> 4.  Time reading both files with cold and hot cache:
> 
> 	time cat a > /dev/null
> 	time cat b > /dev/null
> 	time cat a > /dev/null
> 	time cat b > /dev/null
> 
> Ideally, the first 'cat a' would load the file back from disk, so it
> will take a long time, and the other three would be very fast as the
> shared extent data would already be in RAM.
> 
> That is what happens on 3.17.1:
> 
> 	time cat a > /dev/null
> 	real    0m18.870s
> 	user    0m0.017s
> 	sys     0m3.432s
> 
> 	time cat b > /dev/null
> 	real    0m16.931s
> 	user    0m0.007s
> 	sys     0m3.357s
> 
> 	time cat a > /dev/null
> 	real    0m0.141s
> 	user    0m0.001s
> 	sys     0m0.136s
> 
> 	time cat b > /dev/null
> 	real    0m0.121s
> 	user    0m0.002s
> 	sys     0m0.116s
> 
> Above we see that reading 'b' the first time takes almost as long as
> 'a'.
> The second reads are cached, so they finish two orders of magnitude
> faster.
> 
> That suggests that deduplicated extents are read and cached as
> entirely
> separate copies of the data.  The sys time for the first read of 'b'
> would imply separate decompression as well.
> 
> Compare the above result with a hardlink, which might behave more
> like
> what we expect:
> 
> 	rm -f b
> 	ln a b
> 	sync; sysctl vm.drop_caches=1
> 
> 	time cat a > /dev/null
> 	real    0m20.262s
> 	user    0m0.010s
> 	sys     0m3.376s
> 
> 	time cat b > /dev/null
> 	real    0m0.125s
> 	user    0m0.003s
> 	sys     0m0.120s
> 
> 	time cat a > /dev/null
> 	real    0m0.103s
> 	user    0m0.004s
> 	sys     0m0.097s
> 
> 	time cat b > /dev/null
> 	real    0m0.098s
> 	user    0m0.002s
> 	sys     0m0.091s
> 
> Above we clearly see that we read 'a' from disk only once, and use
> the
> cache three times.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: btrfs deduplication and linux cache management
  2014-11-03 14:09     ` LuVar
@ 2014-11-04 20:01       ` Zygo Blaxell
  0 siblings, 0 replies; 5+ messages in thread
From: Zygo Blaxell @ 2014-11-04 20:01 UTC (permalink / raw)
  To: LuVar; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7152 bytes --]

On Mon, Nov 03, 2014 at 03:09:11PM +0100, LuVar wrote:
> Thanks for nice and "replicate at home yourself" example. On my machine it is behaving precisely like in your:
> 
> <code>
> root@blackdawn:/home/luvar# sync; sysctl vm.drop_caches=1
> vm.drop_caches = 1
> root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
> real    0m6.768s
> user    0m0.016s
> sys     0m0.599s
> 
> root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
> real    0m5.259s
> user    0m0.018s
> sys     0m0.695s
> 
> root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null 
> real    0m0.701s
> user    0m0.014s
> sys     0m0.288s
> 
> root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img > /dev/null
> real    0m0.286s
> user    0m0.013s
> sys     0m0.272s
> </code>
> 
> If you would mind asking, is there any plan to optimize this
> behaviour? I know that btrfs is not like ZFS (whole system from
> blockdevice, through cache, to VFS), so vould be possible to implement
> such optimization without major patch in linux block cache/VFS cache?

I'd like to know this too.  I think not any time soon though.

AIUI (I'm not really an expert here), the VFS cache is keyed on tuples of
(device:inode, offset), so it has no way to cope with aliasing the same
physical blocks through distinct inodes.  It would have to learn about
reference counting (so multiple inodes can refer to shared blocks, one
inode can refer to the same blocks twice, etc) and copy-on-write (so we
can modify just one share of a shared-extent cache page).  For compressed
data caching, the filesystem would be volunteering references to blocks
that were not asked for (e.g.  unread portions of compressed extents).

It's not impossible to make those changes to the VFS cache, but the
only filesystem on mainline Linux that would benefit is btrfs (ZFS is
not on mainline Linux, the ZFS maintainers probably prefer to use their
own cache layer anyway, and nobody else shares extents between files).
For filesystems that don't share extents, adding the necessary stuff to
VFS is a lot of extra overhead they will never use.

Back in the day, the Linux cache used to use tuples of (device,
block_number), but this approach doesn't work on non-block filesystems
like NFS, so it was dropped in favor of the inode+offset caching.
A block-based scheme would handle shared extents but not compressed ones
(e.g. you've got a 4K cacheable page that was compressed to 312 bytes
somewhere in the middle of a 57K compressed data extent...what's that
page's block number, again?).

> Thanks, have a nice day,
> --
> LuVar
> 
> 
> ----- "Zygo Blaxell" <zblaxell@furryterror.org> wrote:
> 
> > On Thu, Oct 30, 2014 at 10:26:07AM +0100, luvar@plaintext.sk wrote:
> > > Hi,
> > > I want to ask, if deduplicated file content will be cached in linux
> > kernel just once for two deduplicated files.
> > > 
> > > To explain in deep:
> > >  - I use btrfs for whole system with few subvolumes with some
> > compression on some subvolumes.
> > >  - I have two directories with eclipse SDK with slightly differences
> > (same version, different config)
> > >  - I assume that given directories is deduplicated and so two
> > eclipse installations take place on hdd like one would (in rough
> > estimation)
> > >  - I will start one of given eclipse
> > >  - linux kernel will cache all opened files during start of eclipse
> > (I have enough free ram)
> > >  - I am just happy stupid linux user:
> > >     1. will kernel cache file content after decompression? (I think
> > yes)
> > >     2. cached data will be in VFS layer or in block device layer?
> > 
> > My guess based on behavior is the VFS layer.  See below.
> > 
> > >  - When I will lunch second eclipse (different from first, but
> > deduplicated from first) after first one:
> > >     1. will second start require less data to be read from HDD?
> > 
> > No.
> > 
> > >     2. will be metadata for second instance read from hdd? (I asume
> > yes)
> > 
> > Yes (how could it not?).
> > 
> > >     3. will be actual data read second time? (I hope not)
> > 
> > Unfortunately, yes.
> > 
> > This is my test:
> > 
> > 1.  Create a file full of compressible data that is big enough to
> > take
> > a few seconds to read from disk, but not too big to fit in RAM:
> > 
> > 	yes $(date) | head -c 500m > a
> > 
> > 2.  Create a "deduplicated" (shared extent) copy of same:
> > 
> > 	cp --reflink=always a b
> > 
> > 	(use filefrag -v to verify both files have same physical extents)
> > 
> > 3.  Drop caches
> > 
> > 	sync; sysctl vm.drop_caches=1
> > 
> > 4.  Time reading both files with cold and hot cache:
> > 
> > 	time cat a > /dev/null
> > 	time cat b > /dev/null
> > 	time cat a > /dev/null
> > 	time cat b > /dev/null
> > 
> > Ideally, the first 'cat a' would load the file back from disk, so it
> > will take a long time, and the other three would be very fast as the
> > shared extent data would already be in RAM.
> > 
> > That is what happens on 3.17.1:
> > 
> > 	time cat a > /dev/null
> > 	real    0m18.870s
> > 	user    0m0.017s
> > 	sys     0m3.432s
> > 
> > 	time cat b > /dev/null
> > 	real    0m16.931s
> > 	user    0m0.007s
> > 	sys     0m3.357s
> > 
> > 	time cat a > /dev/null
> > 	real    0m0.141s
> > 	user    0m0.001s
> > 	sys     0m0.136s
> > 
> > 	time cat b > /dev/null
> > 	real    0m0.121s
> > 	user    0m0.002s
> > 	sys     0m0.116s
> > 
> > Above we see that reading 'b' the first time takes almost as long as
> > 'a'.
> > The second reads are cached, so they finish two orders of magnitude
> > faster.
> > 
> > That suggests that deduplicated extents are read and cached as
> > entirely
> > separate copies of the data.  The sys time for the first read of 'b'
> > would imply separate decompression as well.
> > 
> > Compare the above result with a hardlink, which might behave more
> > like
> > what we expect:
> > 
> > 	rm -f b
> > 	ln a b
> > 	sync; sysctl vm.drop_caches=1
> > 
> > 	time cat a > /dev/null
> > 	real    0m20.262s
> > 	user    0m0.010s
> > 	sys     0m3.376s
> > 
> > 	time cat b > /dev/null
> > 	real    0m0.125s
> > 	user    0m0.003s
> > 	sys     0m0.120s
> > 
> > 	time cat a > /dev/null
> > 	real    0m0.103s
> > 	user    0m0.004s
> > 	sys     0m0.097s
> > 
> > 	time cat b > /dev/null
> > 	real    0m0.098s
> > 	user    0m0.002s
> > 	sys     0m0.091s
> > 
> > Above we clearly see that we read 'a' from disk only once, and use
> > the
> > cache three times.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-11-04 20:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1589590871.231414660858286.JavaMail.root@shiva>
2014-10-30  9:26 ` btrfs deduplication and linux cache management luvar
2014-10-30 12:00   ` Austin S Hemmelgarn
2014-10-30 16:00   ` Zygo Blaxell
2014-11-03 14:09     ` LuVar
2014-11-04 20:01       ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.