From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Subject: Re: Beginning programmer + simple program Date: Sun, 18 Jan 2004 00:00:26 -0600 Sender: linux-c-programming-owner@vger.kernel.org Message-ID: <200401180000.26606.eric@cisu.net> References: <200401171852.03121.eric@cisu.net> <16393.61626.659377.266758@cerise.nosuchdomain.co.uk> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <16393.61626.659377.266758@cerise.nosuchdomain.co.uk> Content-Disposition: inline List-Id: Content-Type: text/plain; charset="us-ascii" To: Glynn Clements , linux-c-programming@vger.kernel.org On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote: > Eric wrote: > > I have recently written a program designed to be used in the qmail > > mail queue in conjunction with fighting spam. Basically my program pipes > > STDIN to STDOUT, but in the process checks to see if a string is > > contained in STDIN. If a string is contained in STDIN it will return 1, > > else 0. This is important because I will be using the return value to > > decide what to do with a mail message. It is used in conjunction with a > > message already scanned by spamassasin. (See spamflag variable) > > > > As a begginning programmer I would like some honest comments on the > > functionality of this program and its flaws/strengths. I thought very > > much about possible error conditions, and I tried very hard to not abort > > or quit without trying to pass the message on to stdout, even at the > > expense of not checking it anymore. I would like this program to be > > reliable above all else as this will be implemented on a site-wise basis. > > > > The program will only be manipulating character data, I > > realize it will probably truncate if given binary data, however I am not > > worried as even files are sent as MIME characters (right?) > > The code is 8-bit clean; it won't have any problems with binary data. My guess was with the EOF detection and the problems you can encounter with binary data. > > This program is pretty fast. It parsed a 2.3MB file in about a > > second. The implementation should be pretty close to O(1) , probably > > slightly more. Since it parsed this pretty fast with very low overhead, I > > am not worried about speed, just correctness. > > > > #include > > #include > > #include > > #define EXIT_NOMATCH 0 > > #define EXIT_MATCH 1 > > #define BUFFERSIZE 65535 > > int main(); > > inline int dump_message(char *message); > > inline int dump_full_message(); > > > > int main() > > { > > //Make a buffer for holding the message and related position pointers > > char *message, *msgptr, *spamptr; > > //What we are checking for. must be EXACT > > char *spamflag = "X-Spam-Flag: YES"; > > This should be: > > const char *spamflag = "X-Spam-Flag: YES"; > or: > const char spamflag[] = "X-Spam-Flag: YES"; > > if you don't need to modify it, or: > > char spamflag[] = "X-Spam-Flag: YES"; > > if you do. > > By default, gcc stores string literals in the .rodata segment, so > pointers to string literals should be declared "const char *". Thank you, thats something I didn't know. > > int ret = 0; > > int exit_status = 0; > > exit_status= EXIT_NOMATCH; > > > > //Set up the buffer. Just do a 1-1 copy if we cant allocate memory > > if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL ) > > sizeof(char) is always 1, so it can be omitted. Also, malloc() returns > "void*", which doesn't require an explicit cast. > > Also, I don't see any point in using malloc() rather than just an > array declaration, i.e. True...this was leftover from when I thought I would need to malloc and resize teh buffer as I kept reading in data, however I realized I only needed to "remember" one or two characters at a time to search for a sting match. > char message[BUFFERSIZE]; > > > dump_full_message(exit_status); > > //Set up ptr positions > > msgptr = message; > > spamptr = spamflag; > > //Null terminate the buffer. > > msgptr += BUFFERSIZE; > > *msgptr = '\0'; > > This is writing outside the allocated block. If you allocate > BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive. That did cross my mind at one point. Thanks, im not sure why I didn't catch that one. Like I said....beginning programmer. > > msgptr = message; > > //Start copying from stdin > > do{ > > //We have a match! > > if ( *spamptr == '\0' ){ > > exit_status = EXIT_MATCH; > > *msgptr='\0'; > > dump_message(message); > > dump_full_message(exit_status); > > } > > ret = read(STDIN_FILENO, msgptr, 1); > > Reading individual bytes with read() will seriously reduce > performance. Better to use getc() or, better still, getc_unlocked() > (which is a macro). I did wonder that. By chance what does getc_unlocked do? What are the benifits over getc? What's its prototype? I wasnt too concerned over speed, it provided enough performance for my tastes. > Either that, or read() large blocks. > > Apart from that, I find your algorithm is quite hard to read. Here's > how I would have written it: Begginng programmer again :). I appreciate all your comments. > static int writen(int fd, const char *buf, int count) > { > while (count > 0) > { > int n = write(fd, buf, count); > if (n < 0) > return -1; > > buf += n; > count -= n; > } > > return 0; > } > > > static int scan_message(const char *spamflag) > { > const char *spamptr = spamflag; > int found = 0; > > for (;;) > { > char buf[BUFFERSIZE]; > int count = read(STDIN_FILENO, buf, sizeof(buf)); > int i; > > if (count < 0) > { > perror("read"); > exit(1); > } > > if (!count) > break; > > if (!found) > for (i = 0; i < count; i++) > { > if (buf[i] == *spamptr) > spamptr++; > else > spamptr = spamflag; > > if (*spamptr == '\0') > { > found = 1; > break; > } > } > > if (writen(STDOUT_FILENO, buf, count) < 0) > { > perror("writen"); > exit(1); > } > } > > return found; > } > > int main(void) > { > return scan_message("X-Spam-Flag: YES"); > } > > In most cases, if you can't understand the code without comments, you > should re-write the code. > > A few other points: > > 1. There is a bug in the algorithm: it won't find the search string if > it is prefixed by a prefix of the search string, e.g. it won't find > "X-X-Spam-Flag: YES". When a non-matching character is found, it > resets spamptr to the beginning of spamflag and *doesn't* check for a > match. To fix it, change: > > if (buf[i] == *spamptr) > spamptr++; > else > spamptr = spamflag; > > to: > if (buf[i] != *spamptr) > spamptr = spamflag; > if (buf[i] == *spamptr) > spamptr++; > > > 2. I'm guessing that you only really want to find that string if it > occurs at the beginning of a line, and probably only within the > headers and not the message body. You can fudge the first issue by > searching for "\nX-Spam-Flag: YES". The second issue involves keeping > track of whether you are reading the headers or the body, which > relates to point 3. > > 3. As soon as the task becomes more complex than searching for a > single fixed string at any point in the stream, you would be better > off using a finite automaton (i.e. using (f)lex rather than coding > directly in C). True true. The only reason I wrote this is because I couldnt find a decent solution so provide me an exit status on a match without some horrible kludgy bash scripting. I only know a few languages so my options are limited :) Besides, it was a great learning program and I appreciate all your imput very much. I will re-write my program and see how it does. ------------------------- Eric Bambach Eric at cisu dot net -------------------------