linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Question about checksums
@ 2003-08-21 12:48 Holger Kiehl
  2003-08-21 13:20 ` Luciano Miguel Ferreira Rocha
  0 siblings, 1 reply; 9+ messages in thread
From: Holger Kiehl @ 2003-08-21 12:48 UTC (permalink / raw)
  To: linux-c-programming

Hello

Lets me first start to explain what I try to do. I have a big ascii
configuration file (appr. 500KB), which I split up in many smaller
jobs each approx. 180 Bytes (average, minimum is 50 maximum 5120 Bytes).
For each job I would like to generate a unique number, so that I can
refer to these jobs by their individual numbers.

What is the best way to generate a checksum from each job? Also I would
like that the checksums are always the same, when you calculate it
on a different host with different CPU and OS but using the same
job data.

I think md5sum could do the job but, think it is a bit of an overkill
to generate a 128 Bit checksum for such small input data. Also storing
such huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum
sufficient, or would I be running into problems when these are to
short?

Regards,
Holger


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about checksums
  2003-08-21 12:48 Question about checksums Holger Kiehl
@ 2003-08-21 13:20 ` Luciano Miguel Ferreira Rocha
  2003-08-21 16:36   ` Holger Kiehl
  0 siblings, 1 reply; 9+ messages in thread
From: Luciano Miguel Ferreira Rocha @ 2003-08-21 13:20 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-c-programming

On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote:
> Hello
> 
> Lets me first start to explain what I try to do. I have a big ascii
> configuration file (appr. 500KB), which I split up in many smaller
> jobs each approx. 180 Bytes (average, minimum is 50 maximum 5120 Bytes).
> For each job I would like to generate a unique number, so that I can
> refer to these jobs by their individual numbers.
> 
> What is the best way to generate a checksum from each job? Also I would
> like that the checksums are always the same, when you calculate it
> on a different host with different CPU and OS but using the same
> job data.

Why not just use the number of the job? Or the offset from the file of
the job?

> I think md5sum could do the job but, think it is a bit of an overkill
> to generate a 128 Bit checksum for such small input data. Also storing
> such huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum
> sufficient, or would I be running into problems when these are to
> short?

CRC-32 is normally sufficient. It's designed for data corruption on
transmission, though, but it should be OK as long as you don't expect
people to try and break your code with equal checksums.

Regards,
Luciano Rocha

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about checksums
  2003-08-21 13:20 ` Luciano Miguel Ferreira Rocha
@ 2003-08-21 16:36   ` Holger Kiehl
  2003-08-21 17:28     ` Jeff Woods
  2003-08-21 18:19     ` Question about checksums Luciano Miguel Ferreira Rocha
  0 siblings, 2 replies; 9+ messages in thread
From: Holger Kiehl @ 2003-08-21 16:36 UTC (permalink / raw)
  To: Luciano Miguel Ferreira Rocha; +Cc: linux-c-programming

On Thu, 21 Aug 2003, Luciano Miguel Ferreira Rocha wrote:

> On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote:
> > Hello
> > 
> > Lets me first start to explain what I try to do. I have a big ascii
> > configuration file (appr. 500KB), which I split up in many smaller
> > jobs each approx. 180 Bytes (average, minimum is 50 maximum 5120 Bytes).
> > For each job I would like to generate a unique number, so that I can
> > refer to these jobs by their individual numbers.
> > 
> > What is the best way to generate a checksum from each job? Also I would
> > like that the checksums are always the same, when you calculate it
> > on a different host with different CPU and OS but using the same
> > job data.
> 
> Why not just use the number of the job?
>
This is what I currently do. It however has the disadvantage that with
each change to the configuration file the number is increased and the
job numbers do not have a direct relationship with the job itself. There
is no way for me to trace back a job number with the job itself.

> Or the offset from the file of the job?
> 
The problem here is that the user can move a job from the beginning of
the configuration to the end, ie. the jobs themself can 'flow' arround
in the configuration file.

> > I think md5sum could do the job but, think it is a bit of an overkill
> > to generate a 128 Bit checksum for such small input data. Also storing
> > such huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum
> > sufficient, or would I be running into problems when these are to
> > short?
> 
> CRC-32 is normally sufficient. It's designed for data corruption on
> transmission, though, but it should be OK as long as you don't expect
> people to try and break your code with equal checksums.
> 
I am not trying to make anything more secure. Will a CRC-32 be sufficient
to always generate a different sum if a single bit changes within the
maximum 5120 Bytes?

Thanks,
Holger


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about checksums
  2003-08-21 16:36   ` Holger Kiehl
@ 2003-08-21 17:28     ` Jeff Woods
  2003-08-22 20:18       ` Holger Kiehl
  2003-08-21 18:19     ` Question about checksums Luciano Miguel Ferreira Rocha
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff Woods @ 2003-08-21 17:28 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: Luciano Miguel Ferreira Rocha, linux-c-programming

At +0000 04:36 PM 8/21/2003, Holger Kiehl wrote:
>On Thu, 21 Aug 2003, Luciano Miguel Ferreira Rocha wrote:
>>On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote:
[snip]
>>>I think md5sum could do the job but, think it is a bit of an overkill to 
>>>generate a 128 Bit checksum for such small input data. Also storing such 
>>>huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum 
>>>sufficient, or would I be running into problems when these are to short?
>>
>>CRC-32 is normally sufficient. It's designed for data corruption on 
>>transmission, though, but it should be OK as long as you don't expect 
>>people to try and break your code with equal checksums.
>
>I am not trying to make anything more secure. Will a CRC-32 be sufficient 
>to always generate a different sum if a single bit changes within the 
>maximum 5120 Bytes?

In general, X bits of storage can take on 2^X distinct values.  So CRC-32 
can take a maximum of  approximately four billion possible values.  That's 
a number with three commas in US notation; I suppose that's twelve periods 
or spaces on your side of the pond.  A 128 bit value can store 
approximately 64 trillion trillion trillion distinct values. That's a 
number with *twelve* commas.  And a 5120 byte file has 40960 bits so it can 
have roughly 1*10^4096 distinct values.  There will always be the 
possibility for duplicate values when you take a checksum on arbitrary data 
longer then the checksum length.

You have to make a tradeoff of how much risk you're willing to accept for a 
duplicate based on how that would affect you.  A 32 bit checksum has a 
*minimum* of one in 4 billion chance of two files sharing the same 
checksum.  For most non-security applications, that's ample.  The same 
possibility exists with 128 bit md5 checksums (or any other hash) but the 
larger the checksum the less often you'll get duplicate checksums for 
different data (assuming comparable quality hash algorithms).

One way to make such use of checksums fail-safe is to use the checksum as 
proof that files are different but not as proof they are the same.  When 
the checksum matches you don't really know the files are the same unless 
their contents are actually the same and once every four billion times you 
probably can afford to go check if it's really critical to know for certain.

--
Jeff Woods <kazrak+kernel@cesmail.net> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about checksums
  2003-08-21 16:36   ` Holger Kiehl
  2003-08-21 17:28     ` Jeff Woods
@ 2003-08-21 18:19     ` Luciano Miguel Ferreira Rocha
  1 sibling, 0 replies; 9+ messages in thread
From: Luciano Miguel Ferreira Rocha @ 2003-08-21 18:19 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-c-programming

On Thu, Aug 21, 2003 at 04:36:45PM +0000, Holger Kiehl wrote:
> On Thu, 21 Aug 2003, Luciano Miguel Ferreira Rocha wrote:
> 
> > On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote:
> > > Hello
> > > 
> > > Lets me first start to explain what I try to do. I have a big ascii
> > > configuration file (appr. 500KB), which I split up in many smaller
> > > jobs each approx. 180 Bytes (average, minimum is 50 maximum 5120 Bytes).
> > > For each job I would like to generate a unique number, so that I can
> > > refer to these jobs by their individual numbers.
> > > 
> > > What is the best way to generate a checksum from each job? Also I would
> > > like that the checksums are always the same, when you calculate it
> > > on a different host with different CPU and OS but using the same
> > > job data.
> > 
> > Why not just use the number of the job?
> >
> This is what I currently do. It however has the disadvantage that with
> each change to the configuration file the number is increased and the
> job numbers do not have a direct relationship with the job itself. There
> is no way for me to trace back a job number with the job itself.

Is there no other unique information that you can use? Or that you can
add? Like time of submission?

> > > I think md5sum could do the job but, think it is a bit of an overkill
> > > to generate a 128 Bit checksum for such small input data. Also storing
> > > such huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum
> > > sufficient, or would I be running into problems when these are to
> > > short?
> > 
> > CRC-32 is normally sufficient. It's designed for data corruption on
> > transmission, though, but it should be OK as long as you don't expect
> > people to try and break your code with equal checksums.
> > 
> I am not trying to make anything more secure. Will a CRC-32 be sufficient
> to always generate a different sum if a single bit changes within the
> maximum 5120 Bytes?

Well, a single bit most likely is detected. But two bits at certain
places may nullify each chance.

For a little more certainty, you could use two different algorithms. If
some changes may end up nullifying each other under one algorithm, they should
show up in the other.

Regards,
Luciano Rocha

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Question about checksums
  2003-08-21 17:28     ` Jeff Woods
@ 2003-08-22 20:18       ` Holger Kiehl
  2003-08-23 20:31         ` printf(), aligning fields J.
  0 siblings, 1 reply; 9+ messages in thread
From: Holger Kiehl @ 2003-08-22 20:18 UTC (permalink / raw)
  To: Jeff Woods; +Cc: Luciano Miguel Ferreira Rocha, linux-c-programming

On Thu, 21 Aug 2003, Jeff Woods wrote:

> At +0000 04:36 PM 8/21/2003, Holger Kiehl wrote:
> >On Thu, 21 Aug 2003, Luciano Miguel Ferreira Rocha wrote:
> >>On Thu, Aug 21, 2003 at 12:48:05PM +0000, Holger Kiehl wrote:
> [snip]
> >>>I think md5sum could do the job but, think it is a bit of an overkill to 
> >>>generate a 128 Bit checksum for such small input data. Also storing such 
> >>>huge numbers is a bit of a pain. Would a 32 or 64 Bit checksum 
> >>>sufficient, or would I be running into problems when these are to short?
> >>
> >>CRC-32 is normally sufficient. It's designed for data corruption on 
> >>transmission, though, but it should be OK as long as you don't expect 
> >>people to try and break your code with equal checksums.
> >
> >I am not trying to make anything more secure. Will a CRC-32 be sufficient 
> >to always generate a different sum if a single bit changes within the 
> >maximum 5120 Bytes?
> 
> In general, X bits of storage can take on 2^X distinct values.  So CRC-32 
> can take a maximum of  approximately four billion possible values.  That's 
> a number with three commas in US notation; I suppose that's twelve periods 
> or spaces on your side of the pond.  A 128 bit value can store 
> approximately 64 trillion trillion trillion distinct values. That's a 
> number with *twelve* commas.  And a 5120 byte file has 40960 bits so it can 
> have roughly 1*10^4096 distinct values.  There will always be the 
> possibility for duplicate values when you take a checksum on arbitrary data 
> longer then the checksum length.
> 
> You have to make a tradeoff of how much risk you're willing to accept for a 
> duplicate based on how that would affect you.  A 32 bit checksum has a 
> *minimum* of one in 4 billion chance of two files sharing the same 
> checksum.  For most non-security applications, that's ample.  The same 
> possibility exists with 128 bit md5 checksums (or any other hash) but the 
> larger the checksum the less often you'll get duplicate checksums for 
> different data (assuming comparable quality hash algorithms).
> 
> One way to make such use of checksums fail-safe is to use the checksum as 
> proof that files are different but not as proof they are the same.  When 
> the checksum matches you don't really know the files are the same unless 
> their contents are actually the same and once every four billion times you 
> probably can afford to go check if it's really critical to know for certain.
> 
Thanks for the very good explanation! I will try one of the CRC-32 checksums.
I am always checking for double entries in any case so I will discover it
when there is one checksum for two or more jobs.

Thanks,
Holger


^ permalink raw reply	[flat|nested] 9+ messages in thread

* printf(), aligning fields
  2003-08-22 20:18       ` Holger Kiehl
@ 2003-08-23 20:31         ` J.
  2003-08-24  0:07           ` Glynn Clements
  0 siblings, 1 reply; 9+ messages in thread
From: J. @ 2003-08-23 20:31 UTC (permalink / raw)
  To: linux-c-programming

Hello,

I am trying to print collums from a c program, as the
following example illustrates:

  Total  Number   Folder
  -----  ------   ------
  15502   17      cronlog
  189897  42      linux/debian-curiosa
  161751  32      linux/debian-firewall
   4305   1       linux/debian-general
  17431   1       linux/debian-news
  107517  1       linux/vger-kernel-announce
  61136   16      linux/vger-kernel-c-programming
   8580   2       linux/vger-kernel-gcc

How can I can align the output to the max right field boundry without
overcrossing that field boundry if the field value gets longer?

Do I really have to check the length of the value before printing it with,
%*s, and then determine how many spaces I need? 

Thnkx...

JM


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: printf(), aligning fields
  2003-08-23 20:31         ` printf(), aligning fields J.
@ 2003-08-24  0:07           ` Glynn Clements
  2003-08-24  1:05             ` Stephen Satchell
  0 siblings, 1 reply; 9+ messages in thread
From: Glynn Clements @ 2003-08-24  0:07 UTC (permalink / raw)
  To: J.; +Cc: linux-c-programming


J. wrote:

> I am trying to print collums from a c program, as the
> following example illustrates:
> 
>   Total  Number   Folder
>   -----  ------   ------
>   15502   17      cronlog
>   189897  42      linux/debian-curiosa
>   161751  32      linux/debian-firewall
>    4305   1       linux/debian-general
>   17431   1       linux/debian-news
>   107517  1       linux/vger-kernel-announce
>   61136   16      linux/vger-kernel-c-programming
>    8580   2       linux/vger-kernel-gcc
> 
> How can I can align the output to the max right field boundry without
> overcrossing that field boundry if the field value gets longer?
> 
> Do I really have to check the length of the value before printing it with,
> %*s, and then determine how many spaces I need? 

You have to make the field large enough to hold all of the values. The
*printf() family of functions don't provide any way to truncate a
field to a maximum width.

-- 
Glynn Clements <glynn.clements@virgin.net>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: printf(), aligning fields
  2003-08-24  0:07           ` Glynn Clements
@ 2003-08-24  1:05             ` Stephen Satchell
  0 siblings, 0 replies; 9+ messages in thread
From: Stephen Satchell @ 2003-08-24  1:05 UTC (permalink / raw)
  To: Glynn Clements, J.; +Cc: linux-c-programming

At 01:07 AM 8/24/2003 +0100, Glynn Clements wrote:
> > I am trying to print collums from a c program, as the
> > following example illustrates:
> >
> >   Total  Number   Folder
> >   -----  ------   ------
> >   15502   17      cronlog
> >   189897  42      linux/debian-curiosa
> >   161751  32      linux/debian-firewall
> >    4305   1       linux/debian-general
> >   17431   1       linux/debian-news
> >   107517  1       linux/vger-kernel-announce
> >   61136   16      linux/vger-kernel-c-programming
> >    8580   2       linux/vger-kernel-gcc
> >
> > How can I can align the output to the max right field boundry without
> > overcrossing that field boundry if the field value gets longer?
> >
> > Do I really have to check the length of the value before printing it with,
> > %*s, and then determine how many spaces I need?
>
>You have to make the field large enough to hold all of the values. The
>*printf() family of functions don't provide any way to truncate a
>field to a maximum width.

No, but a program can truncate numbers to the left.

printf ("%7d%3d %1.64s\n", (total % 10000000), (number % 1000), folder);

Your other option is to do a conversion to string, and fill the string with 
an "overflow indicator" when the number to be displayed is too large.  The 
advantage of the second method is you can insert comma (or dot, for Europe) 
delimiters to make the numbers easier for humans to read.


--
X -> unknown; Spurt -> drip of water under pressure
Expert -> X-Spurt -> Unknown drip under pressure.


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-08-24  1:05 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-08-21 12:48 Question about checksums Holger Kiehl
2003-08-21 13:20 ` Luciano Miguel Ferreira Rocha
2003-08-21 16:36   ` Holger Kiehl
2003-08-21 17:28     ` Jeff Woods
2003-08-22 20:18       ` Holger Kiehl
2003-08-23 20:31         ` printf(), aligning fields J.
2003-08-24  0:07           ` Glynn Clements
2003-08-24  1:05             ` Stephen Satchell
2003-08-21 18:19     ` Question about checksums Luciano Miguel Ferreira Rocha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).