From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Woods Subject: Re: Question about checksums Date: Thu, 21 Aug 2003 10:28:46 -0700 Sender: linux-c-programming-owner@vger.kernel.org Message-ID: <5.2.1.1.0.20030821100717.032dfb30@no.incoming.mail> References: <20030821132005.GA8614@lsd.di.uminho.pt> Mime-Version: 1.0 Return-path: In-Reply-To: References: <20030821132005.GA8614@lsd.di.uminho.pt> List-Id: Content-Type: text/plain; charset="us-ascii"; format="flowed" Content-Transfer-Encoding: 7bit To: Holger Kiehl Cc: Luciano Miguel Ferreira Rocha , linux-c-programming@vger.kernel.org At +0000 04:36 PM 8/21/2003, Holger Kiehl wrote: >On Thu, 21 Aug 2003, Luciano Miguel Ferreira Rocha wrote: >>On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote: [snip] >>>I think md5sum could do the job but, think it is a bit of an overkill to >>>generate a 128 Bit checksum for such small input data. Also storing such >>>huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum >>>sufficient, or would I be running into problems when these are to short? >> >>CRC-32 is normally sufficient. It's designed for data corruption on >>transmission, though, but it should be OK as long as you don't expect >>people to try and break your code with equal checksums. > >I am not trying to make anything more secure. Will a CRC-32 be sufficient >to always generate a different sum if a single bit changes within the >maximum 5120 Bytes? In general, X bits of storage can take on 2^X distinct values. So CRC-32 can take a maximum of approximately four billion possible values. That's a number with three commas in US notation; I suppose that's twelve periods or spaces on your side of the pond. A 128 bit value can store approximately 64 trillion trillion trillion distinct values. That's a number with *twelve* commas. And a 5120 byte file has 40960 bits so it can have roughly 1*10^4096 distinct values. There will always be the possibility for duplicate values when you take a checksum on arbitrary data longer then the checksum length. You have to make a tradeoff of how much risk you're willing to accept for a duplicate based on how that would affect you. A 32 bit checksum has a *minimum* of one in 4 billion chance of two files sharing the same checksum. For most non-security applications, that's ample. The same possibility exists with 128 bit md5 checksums (or any other hash) but the larger the checksum the less often you'll get duplicate checksums for different data (assuming comparable quality hash algorithms). One way to make such use of checksums fail-safe is to use the checksum as proof that files are different but not as proof they are the same. When the checksum matches you don't really know the files are the same unless their contents are actually the same and once every four billion times you probably can afford to go check if it's really critical to know for certain. -- Jeff Woods