Re: how to find the end of piped data?

linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Charlie Gordon" <gmane@chqrlie.org>
To: linux-c-programming@vger.kernel.org
Subject: Re: how to find the end of piped data?
Date: Tue, 14 Sep 2004 10:10:08 +0200	[thread overview]
Message-ID: <ci68vo$eu4$1@sea.gmane.org> (raw)
In-Reply-To: 200409140106.08674.eric@cisu.net

Dear Eric,

Here are a few remarks on your email scanner:

- Why do you bother to write matching streams to /dev/null ?
If you mean to discard them, just do that, do not write the stuff anywhere.
You still need to read stdin all the way to EOF in case it is a pipe, or you
will crash the piping program with a broken pipe signal.
If you mean to stash the spam somewhere (a real file), you probably should
append to the 'spampipe' file, not just overwrite it.

- Your do/while loop is really a while loop.  do/while loops are error prone
and should be avoided.  In this example, I would write:

while ((len = read_message(buffer)) > 0) {
    if (!discard)
        write_message(buffer, len, fd);
}

This saves many system calls in case of discarding spam.
It is also one less test on len, and removes a bug opportunity, should you
decide to return negative values for len in read_message.

scan_message should take the buffer as const char * to emphasize the fact
that it doesn't modify it.
Your implementation of scanning for a string in a buffer has a potential
flaw:  when the comparison fails, you restart from the failure point instead
of backtracking by matched characters but one.  This happens to work for
your CHECKSTRING but will fail for certain patterns.  eg:

CHECKSTRING = "ABAC"
buffer contents: "...xABABAC..."  will not match with your scanner.

I suggest you go with a more straightforward method involving memchr and
memcmp.  Both functions are implemented inline and will likely perform even
better than your loop.  Otherwise, look up the Boyer-Moore algorithm for
searching for a pattern in a buffer, and find a good implementation of this
tricky but vastly more efficient method.

Why do you read BUFFERSIZE-1 bytes into the buffer ?  I suspect you used
strstr() in a previous version of this program to scan for the pattern...
How did that perform with the same buffer size ?

Regarding performance, while I agree that handling large buffers at a time
is usually more efficient than using getchar() and relying on stdio
buffering, I have experienced that using a well chosen buffer size is the
key.  A multiple of the file system block size, or of the OS's page size is
a better choice, and care must be taken of keeping I/O aligned on such
boundaries.  So I would suggest you use either 4K or 8K instead of 6143.

Regarding CHECKSTRING starting with \n  :  you definitely want to force the
match at the beginning of a line, and it is very unlikely the spam flag will
have been prepended to the file, so not matching the first line is not a
problem, but should be documented.  The main issue with the way you scan
your emails is that you don't stop at the end of the header section.  This
causes unnecessary scanning of some or all of the message body, and more
importantly causes messages that contain the CHECKSTRING in the body only to
match!

For example, you are searching for the following string:

X-Spam-Flag: YES

Bang! this is a match! A good thing there was more than 6K of headers and
blabber above this line!

Here is my proposal for scan_message:

int scan_message (const char *buffer, int size) {

  const char *spamflag = "X-Spam-Flag: YES";
  int len = strlen(spamflag);
  const char *bufptr, *buflast;

  if (size >= len) {
    bufptr = buffer;
    buflast = buffer + size - len;
    for (;;) {
      // match at the beginning of a header line
      if (!memcmp(bufptr, spamflag, len))
        return EXIT_MATCH;
      bufptr = memchr(bufptr, '\n', buflast - bufptr);
      if (!bufptr)
        break;
      bufptr++;   // skip \n
      if (*bufptr == '\n'
      ||  (*bufptr == '\r' && bufptr[1] == '\n')) {
        break;    // end of email header
      }
    }
  }
  return EXIT_NOMATCH;
}

Cheers,

Chqrlie.

PS: Please do not use tabs in source files, you can indent by 2 spaces if
you want, but do not set tabs to something else than 8 spaces.

next prev parent reply	other threads:[~2004-09-14  8:10 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-09-13 11:36 how to find the end of piped data? Richard Sammet
2004-09-14  6:06 ` Eric Bambach
2004-09-14  8:10   ` Charlie Gordon [this message]
2004-09-14  8:45   ` Richard Sammet
2004-09-14  9:03 ` Charlie Gordon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='ci68vo$eu4$1@sea.gmane.org' \
    --to=gmane@chqrlie.org \
    --cc=linux-c-programming@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).