All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric <eric@cisu.net>
To: Glynn Clements <glynn.clements@virgin.net>,
	linux-c-programming@vger.kernel.org
Subject: Re: Beginning programmer + simple program
Date: Sun, 18 Jan 2004 00:00:26 -0600	[thread overview]
Message-ID: <200401180000.26606.eric@cisu.net> (raw)
In-Reply-To: <16393.61626.659377.266758@cerise.nosuchdomain.co.uk>

On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote:
> Eric wrote:
> >       I have recently written a program designed to be used in the qmail
> > mail queue in conjunction with fighting spam. Basically my program pipes
> > STDIN to STDOUT, but in the process checks to see if a string is
> > contained in STDIN. If a string is contained in STDIN it will return 1,
> > else 0. This is important because I will be using the return value to
> > decide what to do with a mail message. It is used in conjunction with a
> > message already scanned by spamassasin. (See spamflag variable)
> >
> >       As a begginning programmer I would like some honest comments on the
> > functionality of this program and its flaws/strengths. I thought very
> > much about possible error conditions, and I tried very hard to not abort
> > or quit without trying to pass the message on to stdout, even at the
> > expense of not checking it anymore. I would like this program to be
> > reliable above all else as this will be implemented on a site-wise basis.
> >
> >               The program will only be manipulating character data, I
> > realize it will probably truncate if given binary data, however I am not
> > worried as even files are sent as MIME characters (right?)
>
> The code is 8-bit clean; it won't have any problems with binary data.
 My guess was with the EOF detection and the problems you can encounter with 
binary data.
> >       This program is pretty fast. It parsed a 2.3MB file in about a
> > second. The implementation should be pretty close to O(1) , probably
> > slightly more. Since it parsed this pretty fast with very low overhead, I
> > am not worried about speed, just correctness.
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #define EXIT_NOMATCH 0
> > #define EXIT_MATCH 1
> > #define BUFFERSIZE 65535
> > int main();
> > inline int dump_message(char *message);
> > inline int dump_full_message();
> >
> > int main()
> > {
> >   //Make a buffer for holding the message and related position pointers
> >   char *message, *msgptr, *spamptr;
> >   //What we are checking for. must be EXACT
> >   char *spamflag = "X-Spam-Flag: YES";
>
> This should be:
>
> 	const char *spamflag = "X-Spam-Flag: YES";
> or:
> 	const char spamflag[] = "X-Spam-Flag: YES";
>
> if you don't need to modify it, or:
>
> 	char spamflag[] = "X-Spam-Flag: YES";
>
> if you do.
>
> By default, gcc stores string literals in the .rodata segment, so
> pointers to string literals should be declared "const char *".
 Thank you, thats something I didn't know.
> >   int ret = 0;
> >   int exit_status = 0;
> >   exit_status= EXIT_NOMATCH;
> >
> >   //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
> >   if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )
>
> sizeof(char) is always 1, so it can be omitted. Also, malloc() returns
> "void*", which doesn't require an explicit cast.
>
> Also, I don't see any point in using malloc() rather than just an
> array declaration, i.e.
 True...this was leftover from when I thought I would need to malloc and 
resize teh buffer as I kept reading in data, however I realized I only needed 
to "remember" one or two characters at a time to search for a sting match.
> 	char message[BUFFERSIZE];
>
> >     dump_full_message(exit_status);
> >   //Set up ptr positions
> >   msgptr = message;
> >   spamptr = spamflag;
> >   //Null terminate the buffer.
> >   msgptr += BUFFERSIZE;
> >   *msgptr = '\0';
>
> This is writing outside the allocated block. If you allocate
> BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive.
 That did cross my mind at one point. Thanks, im not sure why I didn't catch 
that one. Like I said....beginning programmer.
> >   msgptr = message;
> >   //Start copying from stdin
> >   do{
> >     //We have a match!
> >     if ( *spamptr == '\0' ){
> >       exit_status = EXIT_MATCH;
> >       *msgptr='\0';
> >       dump_message(message);
> >       dump_full_message(exit_status);
> >     }
> >     ret = read(STDIN_FILENO, msgptr, 1);
>
> Reading individual bytes with read() will seriously reduce
> performance. Better to use getc() or, better still, getc_unlocked()
> (which is a macro).
I did wonder that. By chance what does getc_unlocked do? What are the benifits 
over getc? What's its prototype? I wasnt too concerned over speed, it 
provided enough performance for my tastes.
> Either that, or read() large blocks.
>
> Apart from that, I find your algorithm is quite hard to read. Here's
> how I would have written it:
 Begginng programmer again :). I appreciate all your comments.
> 	static int writen(int fd, const char *buf, int count)
> 	{
> 		while (count > 0)
> 		{
> 			int n = write(fd, buf, count);
> 			if (n < 0)
> 				return -1;
>
> 			buf += n;
> 			count -= n;
> 		}
>
> 		return 0;
> 	}
>
>
> 	static int scan_message(const char *spamflag)
> 	{
> 		const char *spamptr = spamflag;
> 		int found = 0;
>
> 		for (;;)
> 		{
> 			char buf[BUFFERSIZE];
> 			int count = read(STDIN_FILENO, buf, sizeof(buf));
> 			int i;
>
> 			if (count < 0)
> 			{
> 				perror("read");
> 				exit(1);
> 			}
>
> 			if (!count)
> 				break;
>
> 			if (!found)
> 				for (i = 0; i < count; i++)
> 				{
> 					if (buf[i] == *spamptr)
> 						spamptr++;
> 					else
> 						spamptr = spamflag;
>
> 					if (*spamptr == '\0')
> 					{
> 						found = 1;
> 						break;
> 					}
> 				}
>
> 			if (writen(STDOUT_FILENO, buf, count) < 0)
> 			{
> 				perror("writen");
> 				exit(1);
> 			}
> 		}
>
> 		return found;
> 	}
>
> 	int main(void)
> 	{
> 		return scan_message("X-Spam-Flag: YES");
> 	}
>
> In most cases, if you can't understand the code without comments, you
> should re-write the code.
>
> A few other points:
>
> 1. There is a bug in the algorithm: it won't find the search string if
> it is prefixed by a prefix of the search string, e.g. it won't find
> "X-X-Spam-Flag: YES". When a non-matching character is found, it
> resets spamptr to the beginning of spamflag and *doesn't* check for a
> match. To fix it, change:
>
> 	if (buf[i] == *spamptr)
> 		spamptr++;
> 	else
> 		spamptr = spamflag;
>
> to:
> 	if (buf[i] != *spamptr)
> 		spamptr = spamflag;
> 	if (buf[i] == *spamptr)
> 		spamptr++;
>
>
> 2. I'm guessing that you only really want to find that string if it
> occurs at the beginning of a line, and probably only within the
> headers and not the message body. You can fudge the first issue by
> searching for "\nX-Spam-Flag: YES". The second issue involves keeping
> track of whether you are reading the headers or the body, which
> relates to point 3.
>
> 3. As soon as the task becomes more complex than searching for a
> single fixed string at any point in the stream, you would be better
> off using a finite automaton (i.e. using (f)lex rather than coding
> directly in C).
True true. The only reason I wrote this is because I couldnt find a decent 
solution so provide me an exit status on a match without some horrible kludgy 
bash scripting. I only know a few languages so my options are limited :) 
Besides, it was a great learning program and I appreciate all your imput very 
much. I will re-write my program and see how it does.

-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

  reply	other threads:[~2004-01-18  6:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-18  0:52 Beginning programmer + simple program Eric
2004-01-18  2:34 ` Glynn Clements
2004-01-18  6:00   ` Eric [this message]
2004-01-18  6:54     ` Eric
2004-01-19  6:36       ` Glynn Clements
2004-01-19  6:02     ` Glynn Clements

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200401180000.26606.eric@cisu.net \
    --to=eric@cisu.net \
    --cc=glynn.clements@virgin.net \
    --cc=linux-c-programming@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.