From: Eric <eric@cisu.net>
To: Glynn Clements <glynn.clements@virgin.net>,
linux-c-programming@vger.kernel.org
Subject: Re: Beginning programmer + simple program
Date: Sun, 18 Jan 2004 00:00:26 -0600 [thread overview]
Message-ID: <200401180000.26606.eric@cisu.net> (raw)
In-Reply-To: <16393.61626.659377.266758@cerise.nosuchdomain.co.uk>
On Saturday 17 January 2004 08:34 pm, Glynn Clements wrote:
> Eric wrote:
> > I have recently written a program designed to be used in the qmail
> > mail queue in conjunction with fighting spam. Basically my program pipes
> > STDIN to STDOUT, but in the process checks to see if a string is
> > contained in STDIN. If a string is contained in STDIN it will return 1,
> > else 0. This is important because I will be using the return value to
> > decide what to do with a mail message. It is used in conjunction with a
> > message already scanned by spamassasin. (See spamflag variable)
> >
> > As a begginning programmer I would like some honest comments on the
> > functionality of this program and its flaws/strengths. I thought very
> > much about possible error conditions, and I tried very hard to not abort
> > or quit without trying to pass the message on to stdout, even at the
> > expense of not checking it anymore. I would like this program to be
> > reliable above all else as this will be implemented on a site-wise basis.
> >
> > The program will only be manipulating character data, I
> > realize it will probably truncate if given binary data, however I am not
> > worried as even files are sent as MIME characters (right?)
>
> The code is 8-bit clean; it won't have any problems with binary data.
My guess was with the EOF detection and the problems you can encounter with
binary data.
> > This program is pretty fast. It parsed a 2.3MB file in about a
> > second. The implementation should be pretty close to O(1) , probably
> > slightly more. Since it parsed this pretty fast with very low overhead, I
> > am not worried about speed, just correctness.
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> > #define EXIT_NOMATCH 0
> > #define EXIT_MATCH 1
> > #define BUFFERSIZE 65535
> > int main();
> > inline int dump_message(char *message);
> > inline int dump_full_message();
> >
> > int main()
> > {
> > //Make a buffer for holding the message and related position pointers
> > char *message, *msgptr, *spamptr;
> > //What we are checking for. must be EXACT
> > char *spamflag = "X-Spam-Flag: YES";
>
> This should be:
>
> const char *spamflag = "X-Spam-Flag: YES";
> or:
> const char spamflag[] = "X-Spam-Flag: YES";
>
> if you don't need to modify it, or:
>
> char spamflag[] = "X-Spam-Flag: YES";
>
> if you do.
>
> By default, gcc stores string literals in the .rodata segment, so
> pointers to string literals should be declared "const char *".
Thank you, thats something I didn't know.
> > int ret = 0;
> > int exit_status = 0;
> > exit_status= EXIT_NOMATCH;
> >
> > //Set up the buffer. Just do a 1-1 copy if we cant allocate memory
> > if ( (message = (char *)malloc(sizeof(char)*BUFFERSIZE)) == NULL )
>
> sizeof(char) is always 1, so it can be omitted. Also, malloc() returns
> "void*", which doesn't require an explicit cast.
>
> Also, I don't see any point in using malloc() rather than just an
> array declaration, i.e.
True...this was leftover from when I thought I would need to malloc and
resize teh buffer as I kept reading in data, however I realized I only needed
to "remember" one or two characters at a time to search for a sting match.
> char message[BUFFERSIZE];
>
> > dump_full_message(exit_status);
> > //Set up ptr positions
> > msgptr = message;
> > spamptr = spamflag;
> > //Null terminate the buffer.
> > msgptr += BUFFERSIZE;
> > *msgptr = '\0';
>
> This is writing outside the allocated block. If you allocate
> BUFFERSIZE bytes, valid indices are 0 to BUFFERSIZE-1 inclusive.
That did cross my mind at one point. Thanks, im not sure why I didn't catch
that one. Like I said....beginning programmer.
> > msgptr = message;
> > //Start copying from stdin
> > do{
> > //We have a match!
> > if ( *spamptr == '\0' ){
> > exit_status = EXIT_MATCH;
> > *msgptr='\0';
> > dump_message(message);
> > dump_full_message(exit_status);
> > }
> > ret = read(STDIN_FILENO, msgptr, 1);
>
> Reading individual bytes with read() will seriously reduce
> performance. Better to use getc() or, better still, getc_unlocked()
> (which is a macro).
I did wonder that. By chance what does getc_unlocked do? What are the benifits
over getc? What's its prototype? I wasnt too concerned over speed, it
provided enough performance for my tastes.
> Either that, or read() large blocks.
>
> Apart from that, I find your algorithm is quite hard to read. Here's
> how I would have written it:
Begginng programmer again :). I appreciate all your comments.
> static int writen(int fd, const char *buf, int count)
> {
> while (count > 0)
> {
> int n = write(fd, buf, count);
> if (n < 0)
> return -1;
>
> buf += n;
> count -= n;
> }
>
> return 0;
> }
>
>
> static int scan_message(const char *spamflag)
> {
> const char *spamptr = spamflag;
> int found = 0;
>
> for (;;)
> {
> char buf[BUFFERSIZE];
> int count = read(STDIN_FILENO, buf, sizeof(buf));
> int i;
>
> if (count < 0)
> {
> perror("read");
> exit(1);
> }
>
> if (!count)
> break;
>
> if (!found)
> for (i = 0; i < count; i++)
> {
> if (buf[i] == *spamptr)
> spamptr++;
> else
> spamptr = spamflag;
>
> if (*spamptr == '\0')
> {
> found = 1;
> break;
> }
> }
>
> if (writen(STDOUT_FILENO, buf, count) < 0)
> {
> perror("writen");
> exit(1);
> }
> }
>
> return found;
> }
>
> int main(void)
> {
> return scan_message("X-Spam-Flag: YES");
> }
>
> In most cases, if you can't understand the code without comments, you
> should re-write the code.
>
> A few other points:
>
> 1. There is a bug in the algorithm: it won't find the search string if
> it is prefixed by a prefix of the search string, e.g. it won't find
> "X-X-Spam-Flag: YES". When a non-matching character is found, it
> resets spamptr to the beginning of spamflag and *doesn't* check for a
> match. To fix it, change:
>
> if (buf[i] == *spamptr)
> spamptr++;
> else
> spamptr = spamflag;
>
> to:
> if (buf[i] != *spamptr)
> spamptr = spamflag;
> if (buf[i] == *spamptr)
> spamptr++;
>
>
> 2. I'm guessing that you only really want to find that string if it
> occurs at the beginning of a line, and probably only within the
> headers and not the message body. You can fudge the first issue by
> searching for "\nX-Spam-Flag: YES". The second issue involves keeping
> track of whether you are reading the headers or the body, which
> relates to point 3.
>
> 3. As soon as the task becomes more complex than searching for a
> single fixed string at any point in the stream, you would be better
> off using a finite automaton (i.e. using (f)lex rather than coding
> directly in C).
True true. The only reason I wrote this is because I couldnt find a decent
solution so provide me an exit status on a match without some horrible kludgy
bash scripting. I only know a few languages so my options are limited :)
Besides, it was a great learning program and I appreciate all your imput very
much. I will re-write my program and see how it does.
-------------------------
Eric Bambach
Eric at cisu dot net
-------------------------
next prev parent reply other threads:[~2004-01-18 6:00 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-01-18 0:52 Beginning programmer + simple program Eric
2004-01-18 2:34 ` Glynn Clements
2004-01-18 6:00 ` Eric [this message]
2004-01-18 6:54 ` Eric
2004-01-19 6:36 ` Glynn Clements
2004-01-19 6:02 ` Glynn Clements
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200401180000.26606.eric@cisu.net \
--to=eric@cisu.net \
--cc=glynn.clements@virgin.net \
--cc=linux-c-programming@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).