linux-lvm.redhat.com archive mirror
 help / color / mirror / Atom feed
* [linux-lvm] Data deduplication for Linux : lessfs
@ 2009-06-24 15:12 Mark Ruijter
  2009-06-24 18:50 ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 11+ messages in thread
From: Mark Ruijter @ 2009-06-24 15:12 UTC (permalink / raw)
  To: linux-lvm

For those who need OpenSource data deduplication today instead of
tomorrow one might take a look at lessfs.
http://www.lessfs.com
I am thinking about starting to work on a data deduplicating
blockdevice, a kernel module called blockless.
For the time being lessfs is now stable and even usable as a blockdevice
with the help of ietd, loop or nbd.

Mark Ruijter

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 15:12 [linux-lvm] Data deduplication for Linux : lessfs Mark Ruijter
@ 2009-06-24 18:50 ` Roy Sigurd Karlsbakk
  2009-06-24 19:25   ` Mark Ruijter
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2009-06-24 18:50 UTC (permalink / raw)
  To: LVM general discussion and development

On 24. juni. 2009, at 17.12, Mark Ruijter wrote:

> For those who need OpenSource data deduplication today instead of
> tomorrow one might take a look at lessfs.
> http://www.lessfs.com

It's a good idea, but given the current traffic on the lessfs mailing  
list, I'm not sure if much work is done. I have been a member of that  
list since June 1 and haven't received more than one message, which  
was the one I wrote myself.
>
> I am thinking about starting to work on a data deduplicating
> blockdevice, a kernel module called blockless.

If done smartly, this may perhaps be possible, but the problem is the  
filesystem's metadata. Is this going to be dedup'ed? How much will  
this take? A simple backup will update atime on all the files backed  
up, and although atime isn't always wanted or needed, the problem  
occurs elsewhere.

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et element�rt imperativ for alle pedagoger �  
unng� eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer p� norsk.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 18:50 ` Roy Sigurd Karlsbakk
@ 2009-06-24 19:25   ` Mark Ruijter
  2009-06-24 19:43     ` Roy Sigurd Karlsbakk
  2009-06-24 19:32   ` Greg Freemyer
  2009-06-24 20:04   ` Les Mikesell
  2 siblings, 1 reply; 11+ messages in thread
From: Mark Ruijter @ 2009-06-24 19:25 UTC (permalink / raw)
  To: LVM general discussion and development

[-- Attachment #1: Type: text/plain, Size: 2578 bytes --]

Hi Roy,
>
> It's a good idea, but given the current traffic on the lessfs mailing
> list, I'm not sure if much work is done. I have been a member of that
> list since June 1 and haven't received more than one message, which
> was the one I wrote myself.
>
Almost all the traffic is on the forum - open discussion.
Only one person posted to the mailing list. ;-)

> If done smartly, this may perhaps be possible, but the problem is the
> filesystem's metadata. Is this going to be dedup'ed? How much will
> this take? A simple backup will update atime on all the files backed
> up, and although atime isn't always wanted or needed, the problem
> occurs elsewhere.
Typically the meta data on production systems is approx 10%~20% of the
deduplicated stored data.
Stored data is on my systems 40x less then the data written to the
filesystem.

For example, from a real life backup server making dozens of backups
each day:
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/cciss/c0d0p3     9.7G  2.4G  6.9G  26% /
/dev/cciss/c0d0p1      99M   23M   72M  24% /boot
tmpfs                 7.9G     0  7.9G   0% /dev/shm
/dev/cciss/c0d0p4     246G  6.0G  241G   3% /meta
/dev/cciss/c0d1p1     274G   73G  202G  27% /blockdata
/dev/cciss/c1d0p1     4.1T  1.5T  2.7T  35% /data
lessfs                4.1T  1.5T  2.7T  35% /pooldata
[root@lessfssrv pooldata]# du . -s -h
31T     .
[root@lessfssrv pooldata]# ls -alh /data/current/
total 314G
drwxr-xr-x 2 root root   26 Jun  1 00:12 .
drwxr-xr-x 6 root root   59 Jun  1 00:12 ..
-rw-r--r-- 1 root root 314G Jun 22 14:26 blockdata.tch
[root@lessfssrv pooldata]# ls -alh /meta/current/
total 1.4G
drwxr-xr-x 2 root root   63 Jun  1 00:12 .
drwxr-xr-x 6 root root   59 Jun  1 00:12 ..
-rw-r--r-- 1 root root 1.3G Jun 22 14:52 blockusage.tch
-rw-r--r-- 1 root root  89M Jun 22 14:45 dirent.tcb
-rw-r--r-- 1 root root  89M Jun 22 14:52 metadata.tcb

Mark.
>
>
> roy
> -- 
> Roy Sigurd Karlsbakk
> (+47) 97542685
> roy@karlsbakk.net
> http://blogg.karlsbakk.net/
> -- 
> I all pedagogikk er det essensielt at pensum presenteres
> intelligibelt. Det er et element�rt imperativ for alle pedagoger �
> unng� eksessiv anvendelse av idiomer med fremmed opprinnelse. I de
> fleste tilfeller eksisterer adekvate og relevante synonymer p� norsk.
>
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/


[-- Attachment #2: Type: text/html, Size: 4239 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 18:50 ` Roy Sigurd Karlsbakk
  2009-06-24 19:25   ` Mark Ruijter
@ 2009-06-24 19:32   ` Greg Freemyer
  2009-06-24 20:04   ` Les Mikesell
  2 siblings, 0 replies; 11+ messages in thread
From: Greg Freemyer @ 2009-06-24 19:32 UTC (permalink / raw)
  To: LVM general discussion and development

On Wed, Jun 24, 2009 at 2:50 PM, Roy Sigurd Karlsbakk<roy@karlsbakk.net> wrote:
> On 24. juni. 2009, at 17.12, Mark Ruijter wrote:
<snip>
>> I am thinking about starting to work on a data deduplicating
>> blockdevice, a kernel module called blockless.
>
> If done smartly, this may perhaps be possible, but the problem is the
> filesystem's metadata. Is this going to be dedup'ed? How much will this
> take? A simple backup will update atime on all the files backed up, and
> although atime isn't always wanted or needed, the problem occurs elsewhere.
>

As of the 2.6.30 kernel a default mount option is "relatime".  That is
a relatively new mount option that drastically reduces the number of
atime updates.  ie. Not realtime, but relatime.

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 19:25   ` Mark Ruijter
@ 2009-06-24 19:43     ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2009-06-24 19:43 UTC (permalink / raw)
  To: LVM general discussion and development

On 24. juni. 2009, at 21.25, Mark Ruijter wrote:

> Hi Roy,
>>
>> It's a good idea, but given the current traffic on the lessfs  
>> mailing list, I'm not sure if much work is done. I have been a  
>> member of that list since June 1 and haven't received more than one  
>> message, which was the one I wrote myself.
>>
>
> Almost all the traffic is on the forum - open discussion.
> Only one person posted to the mailing list. ;-)

Why??
Mailing lists are so much easier to use. Instead of visiting a bunch  
of websites, they all sit in my mailbox.

>> If done smartly, this may perhaps be possible, but the problem is  
>> the filesystem's metadata. Is this going to be dedup'ed? How much  
>> will this take? A simple backup will update atime on all the files  
>> backed up, and although atime isn't always wanted or needed, the  
>> problem occurs elsewhere.
>
> Typically the meta data on production systems is approx 10%~20% of  
> the deduplicated stored data.
> Stored data is on my systems 40x less then the data written to the  
> filesystem.


The problems with metadata is not that they take up a lot of space,  
but that they are updated so regularly. As Greg Freemyer pointed out,  
relatime will help a lot, but still, deduplicating metadata may take  
up a serious amount of time because of the frequent updates.

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et element�rt imperativ for alle pedagoger �  
unng� eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer p� norsk.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 18:50 ` Roy Sigurd Karlsbakk
  2009-06-24 19:25   ` Mark Ruijter
  2009-06-24 19:32   ` Greg Freemyer
@ 2009-06-24 20:04   ` Les Mikesell
  2009-06-24 20:09     ` Roy Sigurd Karlsbakk
  2009-06-24 20:12     ` Mark Ruijter
  2 siblings, 2 replies; 11+ messages in thread
From: Les Mikesell @ 2009-06-24 20:04 UTC (permalink / raw)
  To: LVM general discussion and development

Roy Sigurd Karlsbakk wrote:
> On 24. juni. 2009, at 17.12, Mark Ruijter wrote:
> 
>> For those who need OpenSource data deduplication today instead of
>> tomorrow one might take a look at lessfs.
>> http://www.lessfs.com
> 
> It's a good idea, but given the current traffic on the lessfs mailing 
> list, I'm not sure if much work is done. I have been a member of that 
> list since June 1 and haven't received more than one message, which was 
> the one I wrote myself.
>>
>> I am thinking about starting to work on a data deduplicating
>> blockdevice, a kernel module called blockless.
> 
> If done smartly, this may perhaps be possible, but the problem is the 
> filesystem's metadata. Is this going to be dedup'ed? How much will this 
> take? A simple backup will update atime on all the files backed up, and 
> although atime isn't always wanted or needed, the problem occurs elsewhere.

Block level deduplication isn't going to know/care about the difference 
between file contents and metadata.  It is either stored in blocks that 
match other blocks or not and the difference should not be visible to 
the filesystem living on top of the block device.

-- 
   Les Mikesell
    lesmikesell@gmail.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 20:04   ` Les Mikesell
@ 2009-06-24 20:09     ` Roy Sigurd Karlsbakk
  2009-06-24 20:59       ` Les Mikesell
  2009-06-24 20:12     ` Mark Ruijter
  1 sibling, 1 reply; 11+ messages in thread
From: Roy Sigurd Karlsbakk @ 2009-06-24 20:09 UTC (permalink / raw)
  To: LVM general discussion and development

On 24. juni. 2009, at 22.04, Les Mikesell wrote:

> Roy Sigurd Karlsbakk wrote:
>> On 24. juni. 2009, at 17.12, Mark Ruijter wrote:
>>> For those who need OpenSource data deduplication today instead of
>>> tomorrow one might take a look at lessfs.
>>> http://www.lessfs.com
>> It's a good idea, but given the current traffic on the lessfs  
>> mailing list, I'm not sure if much work is done. I have been a  
>> member of that list since June 1 and haven't received more than one  
>> message, which was the one I wrote myself.
>>>
>>> I am thinking about starting to work on a data deduplicating
>>> blockdevice, a kernel module called blockless.
>> If done smartly, this may perhaps be possible, but the problem is  
>> the filesystem's metadata. Is this going to be dedup'ed? How much  
>> will this take? A simple backup will update atime on all the files  
>> backed up, and although atime isn't always wanted or needed, the  
>> problem occurs elsewhere.
>
> Block level deduplication isn't going to know/care about the  
> difference between file contents and metadata.  It is either stored  
> in blocks that match other blocks or not and the difference should  
> not be visible to the filesystem living on top of the block device.


My point exactly. If dedup was to be done on the block layer, you'd  
need flag to say "do not dedup this".

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres  
intelligibelt. Det er et element�rt imperativ for alle pedagoger �  
unng� eksessiv anvendelse av idiomer med fremmed opprinnelse. I de  
fleste tilfeller eksisterer adekvate og relevante synonymer p� norsk.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 20:04   ` Les Mikesell
  2009-06-24 20:09     ` Roy Sigurd Karlsbakk
@ 2009-06-24 20:12     ` Mark Ruijter
  1 sibling, 0 replies; 11+ messages in thread
From: Mark Ruijter @ 2009-06-24 20:12 UTC (permalink / raw)
  To: LVM general discussion and development


> Block level deduplication isn't going to know/care about the
> difference between file contents and metadata.  It is either stored in
> blocks that match other blocks or not and the difference should not be
> visible to the filesystem living on top of the block device.
>
Exactly. And this is why lessfs can easily write files with speeds up to
200MB~300MB/sec on modern hardware.

Mark.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 20:09     ` Roy Sigurd Karlsbakk
@ 2009-06-24 20:59       ` Les Mikesell
  2009-06-24 21:03         ` malahal
  0 siblings, 1 reply; 11+ messages in thread
From: Les Mikesell @ 2009-06-24 20:59 UTC (permalink / raw)
  To: LVM general discussion and development

Roy Sigurd Karlsbakk wrote:
> >>>> I am thinking about starting to work on a data deduplicating
>>>> blockdevice, a kernel module called blockless.
>>> If done smartly, this may perhaps be possible, but the problem is the 
>>> filesystem's metadata. Is this going to be dedup'ed? How much will 
>>> this take? A simple backup will update atime on all the files backed 
>>> up, and although atime isn't always wanted or needed, the problem 
>>> occurs elsewhere.
>>
>> Block level deduplication isn't going to know/care about the 
>> difference between file contents and metadata.  It is either stored in 
>> blocks that match other blocks or not and the difference should not be 
>> visible to the filesystem living on top of the block device.
> 
> 
> My point exactly. If dedup was to be done on the block layer, you'd need 
> flag to say "do not dedup this".

Why?  How can it possibly make any difference? It's not likely that 
you'd have dupes in the metadata block, but if you do it doesn't matter 
that they are transparently mapped into one.  You need a copy-on-write 
mechanism anyway since if you write to either they won't be dups any more.

-- 
   Les Mikesell
    lesmikesell@gmail.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 20:59       ` Les Mikesell
@ 2009-06-24 21:03         ` malahal
  2009-06-24 21:21           ` Les Mikesell
  0 siblings, 1 reply; 11+ messages in thread
From: malahal @ 2009-06-24 21:03 UTC (permalink / raw)
  To: linux-lvm

Les Mikesell [lesmikesell@gmail.com] wrote:
> Roy Sigurd Karlsbakk wrote:
>> >>>> I am thinking about starting to work on a data deduplicating
>>>>> blockdevice, a kernel module called blockless.
>>>> If done smartly, this may perhaps be possible, but the problem is the 
>>>> filesystem's metadata. Is this going to be dedup'ed? How much will this 
>>>> take? A simple backup will update atime on all the files backed up, and 
>>>> although atime isn't always wanted or needed, the problem occurs 
>>>> elsewhere.
>>>
>>> Block level deduplication isn't going to know/care about the difference 
>>> between file contents and metadata.  It is either stored in blocks that 
>>> match other blocks or not and the difference should not be visible to the 
>>> filesystem living on top of the block device.
>> My point exactly. If dedup was to be done on the block layer, you'd need 
>> flag to say "do not dedup this".
>
> Why?  How can it possibly make any difference? It's not likely that you'd 
> have dupes in the metadata block, but if you do it doesn't matter that they 
> are transparently mapped into one.  You need a copy-on-write mechanism 
> anyway since if you write to either they won't be dups any more.

Because some file systems create duplicate copies of metadata for
recovery if there is some sectors go bad on the media. You really don't
want to merge them!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [linux-lvm] Data deduplication for Linux : lessfs
  2009-06-24 21:03         ` malahal
@ 2009-06-24 21:21           ` Les Mikesell
  0 siblings, 0 replies; 11+ messages in thread
From: Les Mikesell @ 2009-06-24 21:21 UTC (permalink / raw)
  To: LVM general discussion and development

malahal@us.ibm.com wrote:
> 
>>>> Block level deduplication isn't going to know/care about the difference 
>>>> between file contents and metadata.  It is either stored in blocks that 
>>>> match other blocks or not and the difference should not be visible to the 
>>>> filesystem living on top of the block device.
>>> My point exactly. If dedup was to be done on the block layer, you'd need 
>>> flag to say "do not dedup this".
>> Why?  How can it possibly make any difference? It's not likely that you'd 
>> have dupes in the metadata block, but if you do it doesn't matter that they 
>> are transparently mapped into one.  You need a copy-on-write mechanism 
>> anyway since if you write to either they won't be dups any more.
> 
> Because some file systems create duplicate copies of metadata for
> recovery if there is some sectors go bad on the media. You really don't
> want to merge them!

My experience with disks is that if any part of them fails you don't 
want to trust data from any other part.  So I'd consider this a big 
waste of time and generally keep data that matters on mirrored drives. 
Hmmm, I suppose you would want it to know not to de-dup the mirrored 
blocks..

-- 
   Les Mikesell
    lesmikesell@gmail.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-06-24 21:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-24 15:12 [linux-lvm] Data deduplication for Linux : lessfs Mark Ruijter
2009-06-24 18:50 ` Roy Sigurd Karlsbakk
2009-06-24 19:25   ` Mark Ruijter
2009-06-24 19:43     ` Roy Sigurd Karlsbakk
2009-06-24 19:32   ` Greg Freemyer
2009-06-24 20:04   ` Les Mikesell
2009-06-24 20:09     ` Roy Sigurd Karlsbakk
2009-06-24 20:59       ` Les Mikesell
2009-06-24 21:03         ` malahal
2009-06-24 21:21           ` Les Mikesell
2009-06-24 20:12     ` Mark Ruijter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).