linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric <eric@cisu.net>
To: Glynn Clements <glynn.clements@virgin.net>,
	linux-c-programming@vger.kernel.org
Subject: Re: Beginning programmer + simple program
Date: Sun, 18 Jan 2004 00:00:26 -0600	[thread overview]
Message-ID: <200401180000.26606.eric@cisu.net> (raw)
In-Reply-To: <16393.61626.659377.266758@cerise.nosuchdomain.co.uk>

On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote:
> Eric wrote:
> >       I have recently written a program designed to be used in the qmail
> > mail queue in conjunction with fighting spam. Basically my program pipes
> > STDIN to STDOUT, but in the process checks to see if a string is
> > contained in STDIN. If a string is contained in STDIN it will return 1,
> > else 0. This is important because I will be using the return value to
> > decide what to do with a mail message. It is used in conjunction with a
> > message already scanned by spamassasin. (See spamflag variable)
> >
> >       As a begginning programmer I would like some honest comments on the
> > functionality of this program and its flaws/strengths. I thought very
> > much about possible error conditions, and I tried very hard to not abort
> > or quit without trying to pass the message on to stdout, even at the
> > expense of not checking it anymore. I would like this program to be
> > reliable above all else as this will be implemented on a site-wise basis.
> >
> >               The program will only be manipulating character data, I
> > realize it will probably truncate if given binary data, however I am not
> > worried as even files are sent as MIME characters (right?)
>
> The code is 8-bit clean; it won't have any problems with binary data.
 My guess was with the EOF detection and the problems you can encounter with 
binary data.
> >       This program is pretty fast. It parsed a 2.3MB file in about a
> > second. The implementation should be pretty close to O(1) , probably
> > slightly more. Since it parsed this pretty fast with very low overhead, I
> > am not worried about speed, just correctness.
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #define EXIT_NOMATCH 0
> > #define EXIT_MATCH 1
> > #define BUFFERSIZE 65535
> > int main();
> > inline int dump_message(char *message);
> > inline int dump_full_message();
> >
> > int main()
> > {
> >   //Make a buffer for holding the message and related position pointers
> >   char *message, *msgptr, *spamptr;
> >   //What we are checking for. must be EXACT
> >   char *spamflag = "X-Spam-Flag: YES";
>
> This should be:
>
> 	const char *spamflag = "X-Spam-Flag: YES";
> or:
> 	const char spamflag[] = "X-Spam-Flag: YES";
>
> if you don't need to modify it, or:
>
> 	char spamflag[] = "X-Spam-Flag: YES";
>
> if you do.
>
> By default, gcc stores string literals in the .rodata segment, so
> pointers to string literals should be declared "const char *".
 Thank you, thats something I didn't know.
> >   int ret = 0;
> >   int exit_status = 0;
> >   exit_status= EXIT_NOMATCH;
> >
> >   //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
> >   if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )
>
> sizeof(char) is always 1, so it can be omitted. Also, malloc() returns
> "void*", which doesn't require an explicit cast.
>
> Also, I don't see any point in using malloc() rather than just an
> array declaration, i.e.
 True...this was leftover from when I thought I would need to malloc and 
resize teh buffer as I kept reading in data, however I realized I only needed 
to "remember" one or two characters at a time to search for a sting match.
> 	char message[BUFFERSIZE];
>
> >     dump_full_message(exit_status);
> >   //Set up ptr positions
> >   msgptr = message;
> >   spamptr = spamflag;
> >   //Null terminate the buffer.
> >   msgptr += BUFFERSIZE;
> >   *msgptr = '\0';
>
> This is writing outside the allocated block. If you allocate
> BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive.
 That did cross my mind at one point. Thanks, im not sure why I didn't catch 
that one. Like I said....beginning programmer.
> >   msgptr = message;
> >   //Start copying from stdin
> >   do{
> >     //We have a match!
> >     if ( *spamptr == '\0' ){
> >       exit_status = EXIT_MATCH;
> >       *msgptr='\0';
> >       dump_message(message);
> >       dump_full_message(exit_status);
> >     }
> >     ret = read(STDIN_FILENO, msgptr, 1);
>
> Reading individual bytes with read() will seriously reduce
> performance. Better to use getc() or, better still, getc_unlocked()
> (which is a macro).
I did wonder that. By chance what does getc_unlocked do? What are the benifits 
over getc? What's its prototype? I wasnt too concerned over speed, it 
provided enough performance for my tastes.
> Either that, or read() large blocks.
>
> Apart from that, I find your algorithm is quite hard to read. Here's
> how I would have written it:
 Begginng programmer again :). I appreciate all your comments.
> 	static int writen(int fd, const char *buf, int count)
> 	{
> 		while (count > 0)
> 		{
> 			int n = write(fd, buf, count);
> 			if (n < 0)
> 				return -1;
>
> 			buf += n;
> 			count -= n;
> 		}
>
> 		return 0;
> 	}
>
>
> 	static int scan_message(const char *spamflag)
> 	{
> 		const char *spamptr = spamflag;
> 		int found = 0;
>
> 		for (;;)
> 		{
> 			char buf[BUFFERSIZE];
> 			int count = read(STDIN_FILENO, buf, sizeof(buf));
> 			int i;
>
> 			if (count < 0)
> 			{
> 				perror("read");
> 				exit(1);
> 			}
>
> 			if (!count)
> 				break;
>
> 			if (!found)
> 				for (i = 0; i < count; i++)
> 				{
> 					if (buf[i] == *spamptr)
> 						spamptr++;
> 					else
> 						spamptr = spamflag;
>
> 					if (*spamptr == '\0')
> 					{
> 						found = 1;
> 						break;
> 					}
> 				}
>
> 			if (writen(STDOUT_FILENO, buf, count) < 0)
> 			{
> 				perror("writen");
> 				exit(1);
> 			}
> 		}
>
> 		return found;
> 	}
>
> 	int main(void)
> 	{
> 		return scan_message("X-Spam-Flag: YES");
> 	}
>
> In most cases, if you can't understand the code without comments, you
> should re-write the code.
>
> A few other points:
>
> 1. There is a bug in the algorithm: it won't find the search string if
> it is prefixed by a prefix of the search string, e.g. it won't find
> "X-X-Spam-Flag: YES". When a non-matching character is found, it
> resets spamptr to the beginning of spamflag and *doesn't* check for a
> match. To fix it, change:
>
> 	if (buf[i] == *spamptr)
> 		spamptr++;
> 	else
> 		spamptr = spamflag;
>
> to:
> 	if (buf[i] != *spamptr)
> 		spamptr = spamflag;
> 	if (buf[i] == *spamptr)
> 		spamptr++;
>
>
> 2. I'm guessing that you only really want to find that string if it
> occurs at the beginning of a line, and probably only within the
> headers and not the message body. You can fudge the first issue by
> searching for "\nX-Spam-Flag: YES". The second issue involves keeping
> track of whether you are reading the headers or the body, which
> relates to point 3.
>
> 3. As soon as the task becomes more complex than searching for a
> single fixed string at any point in the stream, you would be better
> off using a finite automaton (i.e. using (f)lex rather than coding
> directly in C).
True true. The only reason I wrote this is because I couldnt find a decent 
solution so provide me an exit status on a match without some horrible kludgy 
bash scripting. I only know a few languages so my options are limited :) 
Besides, it was a great learning program and I appreciate all your imput very 
much. I will re-write my program and see how it does.

-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------

  reply	other threads:[~2004-01-18  6:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-01-18  0:52 Beginning programmer + simple program Eric
2004-01-18  2:34 ` Glynn Clements
2004-01-18  6:00   ` Eric [this message]
2004-01-18  6:54     ` Eric
2004-01-19  6:36       ` Glynn Clements
2004-01-19  6:02     ` Glynn Clements

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200401180000.26606.eric@cisu.net \
    --to=eric@cisu.net \
    --cc=glynn.clements@virgin.net \
    --cc=linux-c-programming@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).