From mboxrd@z Thu Jan 1 00:00:00 1970 From: Leszek Ciesielski Subject: Re: Content based storage Date: Wed, 17 Mar 2010 16:33:41 +0100 Message-ID: <23a15591003170833t3ec4dc3fq9630558aa190afc@mail.gmail.com> References: <201003170948.18819.hjclaes@web.de> <201003171625.50257.hka@qbs.com.pl> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2 To: Hubert Kario , linux-btrfs@vger.kernel.org Return-path: In-Reply-To: <201003171625.50257.hka@qbs.com.pl> List-ID: On Wed, Mar 17, 2010 at 4:25 PM, Hubert Kario wrote: > On Wednesday 17 March 2010 09:48:18 Heinz-Josef Claes wrote: >> Hi, >> >> just want to add one correction to your thoughts: >> >> Storage is not cheap if you think about enterprise storage on a SAN, >> replicated to another data centre. Using dedup on the storage boxes = leads >> =A0to performance issues and other problems - only NetApp is offerin= g this at >> =A0the moment and it's not heavily used (because of the issues). > > there are at least two other suppliers with inline dedup products and= there is > OSS solution: lessfs > >> So I think it would be a big advantage for professional use to have = dedup >> build into the filesystem - processors are faster and faster today a= nd not >> =A0the cost drivers any more. I do not think it's a problem to "spen= d" on >> =A0core of a 2 socket box with 12 cores for this purpose. >> Storage is cost intensive: >> - SAN boxes are expensive >> - RAID5 in two locations is expensive >> - FC lines between locations is expensive (depeding very much on whe= re you >> are). > > In-line dedup is expensive in two ways: first you have to cache the d= ata going > to disk and generate checksum for it, then you have to look if such b= lock is > already stored -- if the database doesn't fit into RAM (for a VM host= it's more > than likely) it requires at least few disk seeks, if not a few dozen = for > really big databases. Then you should read the block/extent back and = compare > them bit for bit. And only then write the data to the disk. That redu= ces your > IOPS by at least an order of maginitude, if not more. Sun decided that with SHA256 (which ZFS uses for normal checksumming) collisions are unlikely enough to skip the read/compare step: http://blogs.sun.com/bonwick/entry/zfs_dedup . That's not the case, of course, with btrfs-used CRC32, but a switch to a stronger hash would be recommended to reduce collisions anyway. And yes, for the truly paranoid, a forced verification (after the hashes match) is always an option. > > For post-process dedup you can go as fast as your HDDs will allow you= =2E And > then, when your machine is mostly idle you can go and churn through t= he data. > > IMHO in-line dedup is a good thing only as storage for backups -- whe= n you > have high probability that the stored data is duplicated (and with a = 1:10 > dedup ratio you have 90% probability, it is). > > So the CPU cost is only one factor. HDDs are a major bottleneck too. > > All things considered, it would be best to have both post-process and= in-line > data deduplication, but I think, that in-line dedup will see much les= s use. > >> >> Naturally, you would not use this feature for all kind of use cases = (eg. >> heavily used database), but I think there is enough need. >> >> my 2 cents, >> Heinz-Josef Claes > -- > Hubert Kario > QBS - Quality Business Software > 02-656 Warszawa, ul. Ksawer=F3w 30/85 > tel. +48 (22) 646-61-51, 646-74-24 > www.qbs.com.pl > > System Zarz=B1dzania Jako=B6ci=B1 > zgodny z norm=B1 ISO 9001:2000 > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs= " in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" = in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html