Beginning programmer + simple program

linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Beginning programmer + simple program
@ 2004-01-18  0:52 Eric
  2004-01-18  2:34 ` Glynn Clements
  0 siblings, 1 reply; 6+ messages in thread
From: Eric @ 2004-01-18  0:52 UTC (permalink / raw)
  To: linux-c-programming

Hello,
	I have recently written a program designed to be used in the qmail mail queue 
in conjunction with fighting spam. Basically my program pipes STDIN to 
STDOUT, but in the process checks to see if a string is contained in STDIN. 
If a string is contained in STDIN it will return 1, else 0. This is important 
because I will be using the return value to decide what to do with a mail 
message. It is used in conjunction with a message already scanned by 
spamassasin. (See spamflag variable)
 	As a begginning programmer I would like some honest comments on the 
functionality of this program and its flaws/strengths. I thought very much 
about possible error conditions, and I tried very hard to not abort or quit 
without trying to pass the message on to stdout, even at the expense of not 
checking it anymore. I would like this program to be reliable above all else 
as this will be implemented on a site-wise basis. 
		The program will only be manipulating character data, I realize it will 
probably truncate if given binary data, however I am not worried as even 
files are sent as MIME characters (right?)
	This program is pretty fast. It parsed a 2.3MB file in about a second. The 
implementation should be pretty close to O(1) , probably slightly more. Since 
it parsed this pretty fast with very low overhead, I am not worried about 
speed, just correctness.

*Legal disclaimer*
This program is copyright ME. No distribution or modification unless otherwise 
specified by me, Eric Bambach. No guarantees it wont erase your hard drive, 
drink your beer, or kick your cat. Read it over before you compile it. 
*Legal disclaimer*

-------------Start of File------------

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define EXIT_NOMATCH 0
#define EXIT_MATCH 1
#define BUFFERSIZE 65535
int main();
inline int dump_message(char *message);
inline int dump_full_message();

int main()
{
  //Make a buffer for holding the message and related position pointers
  char *message, *msgptr, *spamptr;
  //What we are checking for. must be EXACT
  char *spamflag = "X-Spam-Flag: YES";
  int ret = 0;
  int exit_status = 0;
  exit_status= EXIT_NOMATCH;
  //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
  if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )
    dump_full_message(exit_status);
  //Set up ptr positions
  msgptr = message;
  spamptr = spamflag;
  //Null terminate the buffer.
  msgptr += BUFFERSIZE;
  *msgptr = '\0';
  msgptr = message;
  //Start copying from stdin
  do{
    //We have a match!
    if ( *spamptr == '\0' ){
      exit_status = EXIT_MATCH;
      *msgptr='\0';
      dump_message(message);
      dump_full_message(exit_status);
    }
    ret = read(STDIN_FILENO, msgptr, 1);
    //Read has returned an error....dump what we have.
    if (ret == -1){
      dump_message(message);
    }
    if (*spamptr == *msgptr)
      spamptr++;
    else
       //Start checking from the beginning.
      spamptr = spamflag;
     //Avoid an off by one error when ret = 0 and we havent actually read a 
character.
    if (ret){
      msgptr++;
    }
     //Flush the buffer so we dont overrun it. Next start filling the buffer 
from the beggining again.
    if (*msgptr == '\0'){
      dump_message(message);
      msgptr = message;
    }
  }while (ret != 0);

  //Terminate whatever is left in the buffer, it might not be full then flush 
whatever is remaining
  *msgptr = '\0';
  dump_message(message); 
  return exit_status;
}

//Dumps the current buffer to stdout in case of problems.
//Even if we run out of memory we should still be able to dump what we have.
// Also acts as a buffer flush.
inline int dump_message(char *message){
  char *msgptr;
  msgptr = message;
  while (*msgptr != '\0'){
    putchar(*msgptr);
   msgptr++;
  }
  return 0;
}

//Just do a 1-1 copy to stdout.
inline int dump_full_message(int exit_status){
  char c;
  while ((c = getchar()) != EOF)
      putchar(c);
  exit (exit_status);
}

---------EOF---------------

-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Beginning programmer + simple program
  2004-01-18  0:52 Beginning programmer + simple program Eric
@ 2004-01-18  2:34 ` Glynn Clements
  2004-01-18  6:00   ` Eric
  0 siblings, 1 reply; 6+ messages in thread
From: Glynn Clements @ 2004-01-18  2:34 UTC (permalink / raw)
  To: Eric; +Cc: linux-c-programming


Eric wrote:

>       I have recently written a program designed to be used in the qmail mail queue 
> in conjunction with fighting spam. Basically my program pipes STDIN to 
> STDOUT, but in the process checks to see if a string is contained in STDIN. 
> If a string is contained in STDIN it will return 1, else 0. This is important 
> because I will be using the return value to decide what to do with a mail 
> message. It is used in conjunction with a message already scanned by 
> spamassasin. (See spamflag variable)

>       As a begginning programmer I would like some honest comments on the 
> functionality of this program and its flaws/strengths. I thought very much 
> about possible error conditions, and I tried very hard to not abort or quit 
> without trying to pass the message on to stdout, even at the expense of not 
> checking it anymore. I would like this program to be reliable above all else 
> as this will be implemented on a site-wise basis. 

>               The program will only be manipulating character data, I realize it will 
> probably truncate if given binary data, however I am not worried as even 
> files are sent as MIME characters (right?)

The code is 8-bit clean; it won't have any problems with binary data.

>       This program is pretty fast. It parsed a 2.3MB file in about a second. The 
> implementation should be pretty close to O(1) , probably slightly more. Since 
> it parsed this pretty fast with very low overhead, I am not worried about 
> speed, just correctness.

> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #define EXIT_NOMATCH 0
> #define EXIT_MATCH 1
> #define BUFFERSIZE 65535
> int main();
> inline int dump_message(char *message);
> inline int dump_full_message();
> 
> int main()
> {
>   //Make a buffer for holding the message and related position pointers
>   char *message, *msgptr, *spamptr;
>   //What we are checking for. must be EXACT
>   char *spamflag = "X-Spam-Flag: YES";

This should be:

	const char *spamflag = "X-Spam-Flag: YES";
or:
	const char spamflag[] = "X-Spam-Flag: YES";

if you don't need to modify it, or:

	char spamflag[] = "X-Spam-Flag: YES";

if you do.

By default, gcc stores string literals in the .rodata segment, so
pointers to string literals should be declared "const char *".

>   int ret = 0;
>   int exit_status = 0;
>   exit_status= EXIT_NOMATCH;

>   //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
>   if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )

sizeof(char) is always 1, so it can be omitted. Also, malloc() returns
"void*", which doesn't require an explicit cast.

Also, I don't see any point in using malloc() rather than just an
array declaration, i.e.

	char message[BUFFERSIZE];

>     dump_full_message(exit_status);
>   //Set up ptr positions
>   msgptr = message;
>   spamptr = spamflag;
>   //Null terminate the buffer.
>   msgptr += BUFFERSIZE;
>   *msgptr = '\0';

This is writing outside the allocated block. If you allocate
BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive.

>   msgptr = message;
>   //Start copying from stdin
>   do{
>     //We have a match!
>     if ( *spamptr == '\0' ){
>       exit_status = EXIT_MATCH;
>       *msgptr='\0';
>       dump_message(message);
>       dump_full_message(exit_status);
>     }
>     ret = read(STDIN_FILENO, msgptr, 1);

Reading individual bytes with read() will seriously reduce
performance. Better to use getc() or, better still, getc_unlocked()
(which is a macro).

Either that, or read() large blocks.

Apart from that, I find your algorithm is quite hard to read. Here's
how I would have written it:

	static int writen(int fd, const char *buf, int count)
	{
		while (count > 0)
		{
			int n = write(fd, buf, count);
			if (n < 0)
				return -1;

			buf += n;
			count -= n;
		}

		return 0;
	}


	static int scan_message(const char *spamflag)
	{
		const char *spamptr = spamflag;
		int found = 0;

		for (;;)
		{
			char buf[BUFFERSIZE];
			int count = read(STDIN_FILENO, buf, sizeof(buf));
			int i;
	
			if (count < 0)
			{
				perror("read");
				exit(1);
			}
	
			if (!count)
				break;
	
			if (!found)
				for (i = 0; i < count; i++)
				{
					if (buf[i] == *spamptr)
						spamptr++;
					else
						spamptr = spamflag;
	
					if (*spamptr == '\0')
					{
						found = 1;
						break;
					}
				}
	
			if (writen(STDOUT_FILENO, buf, count) < 0)
			{
				perror("writen");
				exit(1);
			}
		}
	
		return found;
	}

	int main(void)
	{
		return scan_message("X-Spam-Flag: YES");
	}

In most cases, if you can't understand the code without comments, you
should re-write the code.

A few other points:

1. There is a bug in the algorithm: it won't find the search string if
it is prefixed by a prefix of the search string, e.g. it won't find
"X-X-Spam-Flag: YES". When a non-matching character is found, it
resets spamptr to the beginning of spamflag and *doesn't* check for a
match. To fix it, change:

	if (buf[i] == *spamptr)
		spamptr++;
	else
		spamptr = spamflag;

to:
	if (buf[i] != *spamptr)
		spamptr = spamflag;
	if (buf[i] == *spamptr)
		spamptr++;


2. I'm guessing that you only really want to find that string if it
occurs at the beginning of a line, and probably only within the
headers and not the message body. You can fudge the first issue by
searching for "\nX-Spam-Flag: YES". The second issue involves keeping
track of whether you are reading the headers or the body, which
relates to point 3.

3. As soon as the task becomes more complex than searching for a
single fixed string at any point in the stream, you would be better
off using a finite automaton (i.e. using (f)lex rather than coding
directly in C).

-- 
Glynn Clements <glynn.clements@virgin.net>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Beginning programmer + simple program
  2004-01-18  2:34 ` Glynn Clements
@ 2004-01-18  6:00   ` Eric
  2004-01-18  6:54     ` Eric
  2004-01-19  6:02     ` Glynn Clements
  0 siblings, 2 replies; 6+ messages in thread
From: Eric @ 2004-01-18  6:00 UTC (permalink / raw)
  To: Glynn Clements, linux-c-programming

On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote:
> Eric wrote:
> >       I have recently written a program designed to be used in the qmail
> > mail queue in conjunction with fighting spam. Basically my program pipes
> > STDIN to STDOUT, but in the process checks to see if a string is
> > contained in STDIN. If a string is contained in STDIN it will return 1,
> > else 0. This is important because I will be using the return value to
> > decide what to do with a mail message. It is used in conjunction with a
> > message already scanned by spamassasin. (See spamflag variable)
> >
> >       As a begginning programmer I would like some honest comments on the
> > functionality of this program and its flaws/strengths. I thought very
> > much about possible error conditions, and I tried very hard to not abort
> > or quit without trying to pass the message on to stdout, even at the
> > expense of not checking it anymore. I would like this program to be
> > reliable above all else as this will be implemented on a site-wise basis.
> >
> >               The program will only be manipulating character data, I
> > realize it will probably truncate if given binary data, however I am not
> > worried as even files are sent as MIME characters (right?)
>
> The code is 8-bit clean; it won't have any problems with binary data.
 My guess was with the EOF detection and the problems you can encounter with 
binary data.
> >       This program is pretty fast. It parsed a 2.3MB file in about a
> > second. The implementation should be pretty close to O(1) , probably
> > slightly more. Since it parsed this pretty fast with very low overhead, I
> > am not worried about speed, just correctness.
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #define EXIT_NOMATCH 0
> > #define EXIT_MATCH 1
> > #define BUFFERSIZE 65535
> > int main();
> > inline int dump_message(char *message);
> > inline int dump_full_message();
> >
> > int main()
> > {
> >   //Make a buffer for holding the message and related position pointers
> >   char *message, *msgptr, *spamptr;
> >   //What we are checking for. must be EXACT
> >   char *spamflag = "X-Spam-Flag: YES";
>
> This should be:
>
> 	const char *spamflag = "X-Spam-Flag: YES";
> or:
> 	const char spamflag[] = "X-Spam-Flag: YES";
>
> if you don't need to modify it, or:
>
> 	char spamflag[] = "X-Spam-Flag: YES";
>
> if you do.
>
> By default, gcc stores string literals in the .rodata segment, so
> pointers to string literals should be declared "const char *".
 Thank you, thats something I didn't know.
> >   int ret = 0;
> >   int exit_status = 0;
> >   exit_status= EXIT_NOMATCH;
> >
> >   //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
> >   if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )
>
> sizeof(char) is always 1, so it can be omitted. Also, malloc() returns
> "void*", which doesn't require an explicit cast.
>
> Also, I don't see any point in using malloc() rather than just an
> array declaration, i.e.
 True...this was leftover from when I thought I would need to malloc and 
resize teh buffer as I kept reading in data, however I realized I only needed 
to "remember" one or two characters at a time to search for a sting match.
> 	char message[BUFFERSIZE];
>
> >     dump_full_message(exit_status);
> >   //Set up ptr positions
> >   msgptr = message;
> >   spamptr = spamflag;
> >   //Null terminate the buffer.
> >   msgptr += BUFFERSIZE;
> >   *msgptr = '\0';
>
> This is writing outside the allocated block. If you allocate
> BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive.
 That did cross my mind at one point. Thanks, im not sure why I didn't catch 
that one. Like I said....beginning programmer.
> >   msgptr = message;
> >   //Start copying from stdin
> >   do{
> >     //We have a match!
> >     if ( *spamptr == '\0' ){
> >       exit_status = EXIT_MATCH;
> >       *msgptr='\0';
> >       dump_message(message);
> >       dump_full_message(exit_status);
> >     }
> >     ret = read(STDIN_FILENO, msgptr, 1);
>
> Reading individual bytes with read() will seriously reduce
> performance. Better to use getc() or, better still, getc_unlocked()
> (which is a macro).
I did wonder that. By chance what does getc_unlocked do? What are the benifits 
over getc? What's its prototype? I wasnt too concerned over speed, it 
provided enough performance for my tastes.
> Either that, or read() large blocks.
>
> Apart from that, I find your algorithm is quite hard to read. Here's
> how I would have written it:
 Begginng programmer again :). I appreciate all your comments.
> 	static int writen(int fd, const char *buf, int count)
> 	{
> 		while (count > 0)
> 		{
> 			int n = write(fd, buf, count);
> 			if (n < 0)
> 				return -1;
>
> 			buf += n;
> 			count -= n;
> 		}
>
> 		return 0;
> 	}
>
>
> 	static int scan_message(const char *spamflag)
> 	{
> 		const char *spamptr = spamflag;
> 		int found = 0;
>
> 		for (;;)
> 		{
> 			char buf[BUFFERSIZE];
> 			int count = read(STDIN_FILENO, buf, sizeof(buf));
> 			int i;
>
> 			if (count < 0)
> 			{
> 				perror("read");
> 				exit(1);
> 			}
>
> 			if (!count)
> 				break;
>
> 			if (!found)
> 				for (i = 0; i < count; i++)
> 				{
> 					if (buf[i] == *spamptr)
> 						spamptr++;
> 					else
> 						spamptr = spamflag;
>
> 					if (*spamptr == '\0')
> 					{
> 						found = 1;
> 						break;
> 					}
> 				}
>
> 			if (writen(STDOUT_FILENO, buf, count) < 0)
> 			{
> 				perror("writen");
> 				exit(1);
> 			}
> 		}
>
> 		return found;
> 	}
>
> 	int main(void)
> 	{
> 		return scan_message("X-Spam-Flag: YES");
> 	}
>
> In most cases, if you can't understand the code without comments, you
> should re-write the code.
>
> A few other points:
>
> 1. There is a bug in the algorithm: it won't find the search string if
> it is prefixed by a prefix of the search string, e.g. it won't find
> "X-X-Spam-Flag: YES". When a non-matching character is found, it
> resets spamptr to the beginning of spamflag and *doesn't* check for a
> match. To fix it, change:
>
> 	if (buf[i] == *spamptr)
> 		spamptr++;
> 	else
> 		spamptr = spamflag;
>
> to:
> 	if (buf[i] != *spamptr)
> 		spamptr = spamflag;
> 	if (buf[i] == *spamptr)
> 		spamptr++;
>
>
> 2. I'm guessing that you only really want to find that string if it
> occurs at the beginning of a line, and probably only within the
> headers and not the message body. You can fudge the first issue by
> searching for "\nX-Spam-Flag: YES". The second issue involves keeping
> track of whether you are reading the headers or the body, which
> relates to point 3.
>
> 3. As soon as the task becomes more complex than searching for a
> single fixed string at any point in the stream, you would be better
> off using a finite automaton (i.e. using (f)lex rather than coding
> directly in C).
True true. The only reason I wrote this is because I couldnt find a decent 
solution so provide me an exit status on a match without some horrible kludgy 
bash scripting. I only know a few languages so my options are limited :) 
Besides, it was a great learning program and I appreciate all your imput very 
much. I will re-write my program and see how it does.

-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Beginning programmer + simple program
  2004-01-18  6:00   ` Eric
@ 2004-01-18  6:54     ` Eric
  2004-01-19  6:36       ` Glynn Clements
  2004-01-19  6:02     ` Glynn Clements
  1 sibling, 1 reply; 6+ messages in thread
From: Eric @ 2004-01-18  6:54 UTC (permalink / raw)
  To: linux-c-programming

On Sunday 18 January 2004 12:00 am, Eric wrote:
> On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote:
> > Eric wrote:
> > >       I have recently written a program designed to be used in the
> > > qmail mail queue in conjunction with fighting spam. Basically my
> > > program pipes STDIN to STDOUT, but in the process checks to see if a
> > > string is contained in STDIN. If a string is contained in STDIN it will
> > > return 1, else 0. This is important because I will be using the return
> > > value to decide what to do with a mail message. It is used in
> > > conjunction with a message already scanned by spamassasin. (See
> > > spamflag variable)
> > >
> > >       As a begginning programmer I would like some honest comments on
> > > the functionality of this program and its flaws/strengths. I thought
> > > very much about possible error conditions, and I tried very hard to not
> > > abort or quit without trying to pass the message on to stdout, even at
> > > the expense of not checking it anymore. I would like this program to be
> > > reliable above all else as this will be implemented on a site-wise
> > > basis.
> > >
> > >               The program will only be manipulating character data, I
> > > realize it will probably truncate if given binary data, however I am
> > > not worried as even files are sent as MIME characters (right?)
> >
> > The code is 8-bit clean; it won't have any problems with binary data.
>
>  My guess was with the EOF detection and the problems you can encounter
> with binary data.
>
> > >       This program is pretty fast. It parsed a 2.3MB file in about a
> > > second. The implementation should be pretty close to O(1) , probably
> > > slightly more. Since it parsed this pretty fast with very low overhead,
> > > I am not worried about speed, just correctness.
> > >

Here is an update. I am quite pleased with the feedback you have given me and 
how it has improved my program. It scanned a 73MB text file in 7 seconds. I 
would say thats even better! Silly read() code. That was stupid to begin 
with.
	Unfortunatly, I don't believe your code will work for me. I dont want to run 
the risk of overflowing the buffer as I believe you might with your read() 
command. I will be getting an unknown amount of data (possibly file 
attachments) and I dont want to allocate a huge(couple MB) buffer for an 
attachement. I'd rather just pass it along in byte for byte as it comes in.
	Does that sound right? Or am I mis-reading your code. I believe you are 
assuming small text files.
	This is MUCH more readable. I've re-thought my approach and realized I don't 
even need a buffer. This has shortened my code considerably.

BTW, is there a good method for 1-1 copy from STDIN to STDOUT?
time cat < largefile > testfile 
gives me .5s
while my program will find the string in the first few lines but still take 
10s to do essentially the same operation. After finding the string it just 
goes to dump_full_message() which I want to act just like cat in this sense. 

-----Beginning of File----------

#include <stdio.h>
//#include <stdlib.h>

#define EXIT_NOMATCH 0
#define EXIT_MATCH 1
#define BUFFERSIZE 65535
int main();
inline int dump_message(char *message);
inline int dump_full_message(int exit_status);

int main()
{
  char c, *spamptr;
  //What we are checking for. must be EXACT. Leave the newline in because it 
protects the offchance that it is in the message body somewhere.
  //This way it will only match if its at the beginning of the line.
  const char *spamflag = "\nX-Spam-Flag: YES";
  int exit_status =EXIT_NOMATCH;
  spamptr = spamflag;
  //Start copying from stdin
  while( (c = getchar()) != EOF){
    //Test it
    if (c != *spamptr)
      spamptr = spamflag;
    if (c == *spamptr)
      spamptr++;
   //We've matched, so proceed to do a 1-1 copy and exit EXIT_MATCH
    if (*spamptr == '\0'){
      exit_status = EXIT_MATCH;
      dump_full_message(exit_status);
    }
    putchar(c);
  }
  dump_full_message(exit_status);
  return exit_status;
}

inline int dump_full_message(int exit_status){
  char c;
  while( (c = getchar()) != EOF){
    putchar(c);
  }
  exit (exit_status);
}

----------EOF-----------
 
-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Beginning programmer + simple program
  2004-01-18  6:00   ` Eric
  2004-01-18  6:54     ` Eric
@ 2004-01-19  6:02     ` Glynn Clements
  1 sibling, 0 replies; 6+ messages in thread
From: Glynn Clements @ 2004-01-19  6:02 UTC (permalink / raw)
  To: Eric; +Cc: linux-c-programming

Eric wrote:

> > Reading individual bytes with read() will seriously reduce
> > performance. Better to use getc() or, better still, getc_unlocked()
> > (which is a macro).
> 
> I did wonder that. By chance what does getc_unlocked do? What are the benifits 
> over getc? What's its prototype? I wasnt too concerned over speed, it 
> provided enough performance for my tastes.

getc_unlocked() is essentially the original libc-5 version of getc(). 
This is a macro which reads a byte directly from the FILE's buffers:

#define _IO_getc_unlocked(_fp) \
       ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end ? __uflow (_fp) \
	: *(unsigned char *) (_fp)->_IO_read_ptr++)

So long as the read buffer isn't empty, it doesn't involve a function
call. As it is a macro, its usage can be optimised, which may make a
significant difference in a tight loop.

Actually, in glibc 2.0.x, getc_unlocked was just a macro which
expanded to _IO_getc_unlocked, in 2.1.x it's an inline function:

	/* This is defined in POSIX.1:1996.  */
	__STDIO_INLINE int
	getc_unlocked (FILE *__fp) __THROW
	{
	  return _IO_getc_unlocked (__fp);
	}

This shouldn't make any difference; the essential point is that the
code gets inlined rather than called as a function, which allows for
more optimisations.

getc() itself was changed to a function in libc-6 (glibc 2.x) to
handle threading, as the FILE structure has to be locked against
concurrent access. So long as you aren't using multiple threads (or,
at least, aren't accessing any given FILE structure from more than one
thread), getc_unlocked() is safe, and will be faster than getc().

-- 
Glynn Clements <glynn.clements@virgin.net>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Beginning programmer + simple program
  2004-01-18  6:54     ` Eric
@ 2004-01-19  6:36       ` Glynn Clements
  0 siblings, 0 replies; 6+ messages in thread
From: Glynn Clements @ 2004-01-19  6:36 UTC (permalink / raw)
  To: Eric; +Cc: linux-c-programming

Eric wrote:

> > > >       This program is pretty fast. It parsed a 2.3MB file in about a
> > > > second. The implementation should be pretty close to O(1) , probably
> > > > slightly more. Since it parsed this pretty fast with very low overhead,
> > > > I am not worried about speed, just correctness.
> > > >
> 
> Here is an update. I am quite pleased with the feedback you have given me and 
> how it has improved my program. It scanned a 73MB text file in 7 seconds. I 
> would say thats even better! Silly read() code. That was stupid to begin 
> with.
> 	Unfortunatly, I don't believe your code will work for me. I dont want to run 
> the risk of overflowing the buffer as I believe you might with your read() 
> command. I will be getting an unknown amount of data (possibly file 
> attachments) and I dont want to allocate a huge(couple MB) buffer for an 
> attachement. I'd rather just pass it along in byte for byte as it comes in.
> 	Does that sound right? Or am I mis-reading your code. I believe you are 
> assuming small text files.

No; I'm using a double-nested loop. The outer loop reads at most
BUFFERSIZE bytes into a buffer on each pass, and repeats until the
entire file has been read. The inner loop processes the bytes which
have been read on that pass.

When using the *_unlocked() macros, there may not be a significant
difference between the two approaches. Without optimisation, the
double-nested loop would be slightly quicker (as the loop test is
simpler), but optimisation may eliminate the difference.

Using the locked I/O functions could incur a substantial performance
hit, as they have to lock/unlock the FILE structure for each
operation. While the cost of obtaining and releasing an uncontested
lock is small in absolute terms, it could be a substantial proportion
of the overall cost for a tight getc/putc loop.

> 	This is MUCH more readable. I've re-thought my approach and realized I don't 
> even need a buffer. This has shortened my code considerably.
> 
> BTW, is there a good method for 1-1 copy from STDIN to STDOUT?
> time cat < largefile > testfile 
> gives me .5s
> while my program will find the string in the first few lines but still take 
> 10s to do essentially the same operation. After finding the string it just 
> goes to dump_full_message() which I want to act just like cat in this sense. 

First, try using the *_unlocked functions. However:

The fastest way to copy data between two files is to mmap() both
source and destination and use memcpy() to copy the data. This only
requires one copy rather than two (read() copies from the kernel's
buffers to application memory, write() copies from application memory
to the kernel's buffers).

However, mmap() only works with files and block devices, and not with
pipes, sockets or character devices. Also, you need to create space in
the destination (with ftruncate()) first.

If the source is a file but the destination isn't, you can mmap() the
source then write() the mmap()ed region to the destination. Similarly,
if the destination is a file but the source isn't, you can mmap() the
destination then read() into it. Again, the data is only copied once.

If neither source nor destination can be mmap()ed, you may be able to
use sendfile(); however, this isn't portable and doesn't work with all
types of streams (IIRC, one of them has to be a socket). Otherwise,
you're stuck with read()/write(), which involves two copies.

-- 
Glynn Clements <glynn.clements@virgin.net>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-01-19  6:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-18  0:52 Beginning programmer + simple program Eric
2004-01-18  2:34 ` Glynn Clements
2004-01-18  6:00   ` Eric
2004-01-18  6:54     ` Eric
2004-01-19  6:36       ` Glynn Clements
2004-01-19  6:02     ` Glynn Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).