linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* how to find the end of piped data?
@ 2004-09-13 11:36 Richard Sammet
  2004-09-14  6:06 ` Eric Bambach
  2004-09-14  9:03 ` Charlie Gordon
  0 siblings, 2 replies; 5+ messages in thread
From: Richard Sammet @ 2004-09-13 11:36 UTC (permalink / raw)
  To: linux-c-programming

hey list,

i wrote a small tool which gets data over a pipe from other tools (like: 
cat stuff | mytool).

how can i find the end of this data stream?

at the moment im looking for a newline to see if the input is finished, 
but thats not practicable.

this is the rutine for getting the data:

      75 void scanin()
      76 {
      77    int tmpcnt=0;
      78
      79    while(sizeof(tmpkey) && tmpkey[tmpcnt-1] != 10)
      80    {
      81       tmpkey[tmpcnt]=getchar();
      82       tmpcnt++;
      83    }
      84 }

im looking for a flag like EOF but EndOfStream or something like this? ;)

anybody any idea?

e-axe

-- 
=====================================================================
Fraunhofer-Institut für Sichere Informations-Technologie (SIT)

Richard Sammet
Tel.: +49 6151 869 60027
Email: richard.sammet@sit.fraunhofer.de

main(){int
y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101:10))));putchar(x);y++;}}
-
To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to find the end of piped data?
  2004-09-13 11:36 how to find the end of piped data? Richard Sammet
@ 2004-09-14  6:06 ` Eric Bambach
  2004-09-14  8:10   ` Charlie Gordon
  2004-09-14  8:45   ` Richard Sammet
  2004-09-14  9:03 ` Charlie Gordon
  1 sibling, 2 replies; 5+ messages in thread
From: Eric Bambach @ 2004-09-14  6:06 UTC (permalink / raw)
  To: Richard Sammet; +Cc: linux-c-programming

On Monday 13 September 2004 06:36 am, you wrote:
> hey list,
>
> i wrote a small tool which gets data over a pipe from other tools (like:
> cat stuff | mytool).
>
> how can i find the end of this data stream?
>
> at the moment im looking for a newline to see if the input is finished,
> but thats not practicable.
>
> this is the rutine for getting the data:
>
>       75 void scanin()
>       76 {
>       77    int tmpcnt=0;
>       78
>       79    while(sizeof(tmpkey) && tmpkey[tmpcnt-1] != 10)
>       80    {
>       81       tmpkey[tmpcnt]=getchar();
>       82       tmpcnt++;
>       83    }
>       84 }
>
> im looking for a flag like EOF but EndOfStream or something like this? ;)
>
> anybody any idea?

Yea, heres a small program I wrote that works exactly the same way. WIth piped 
data. It just scans STDIN to match to a pattern in the input stream. If it 
finds it, it pipes to /dev/null, if not, it pipes to stdout. Notice the 
read()/write() combo with a buffer. This is MUCH faster than getchar() 
method. Hope it helps.

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define EXIT_NOMATCH 0
#define EXIT_MATCH 1

//Should be large enough to scan most mail messages in one or two passes.
//Performance with 4K buffer is .8s for a 73M message.
//Increasing to 32K only trims it to .6s with an unjustified increase in 
memory use.
//DO NOT SET THIS TOO SMALL. It probably won't catch the spam flag then since 
it only scans the
//first BUFFERSIZE characters read in then fast copies the rest to stdout.
#define BUFFERSIZE 6144

//What we are checking for. must be EXACT. Leave the newline in because it 
protects the offchance that it is in the message body somewhere.
//This way it will only match if its at the beginning of the line.
#define CHECKSTRING "\nX-Spam-Flag: YES"

int main(void);
int dump_full_message();
int write_message(char *buffer,int len,int fd);
int read_message(char *buffer);
int scan_message(char *buffer,int len);

int main(void){
  //Our faithful buffer
  char buffer[BUFFERSIZE];
  //What file should we write to if its spam.
  const char *spampipe = "/dev/null";
  int len,
      fd = STDOUT_FILENO,
      exit_status = EXIT_NOMATCH;
  
  //We only want to scan our message once. Its unlikely there are more than 
4K(BUFFERSIZE)
  //of headers. By scanning once, this lets us trash the rest of the output if 
its spam. Also
  //prevents scanning a huge non-matching mail-message-attachment.
  len = read_message(buffer);
	if (len){
    if( scan_message(buffer,len) == EXIT_MATCH){
      if( (fd = open(spampipe, O_WRONLY)) == -1){
        perror("Cannot open spam pipe...will write to stdout");
        fd = STDOUT_FILENO;
      }
      exit_status= EXIT_MATCH;
    }
  }
 	write_message(buffer,len,fd);

  //Tight read/write for just piping the data. After the first BUFFERSIZE 
characters
  //we should already have what we need and just pass it on in the queue.
  do{
    len = read_message(buffer);
  	if (len){
	  	write_message(buffer,len,fd);
    }
	}while(len > 0);
  close(fd);
  return exit_status;
}

int scan_message(char * buffer,int len){
  char *spamptr,*bufptr;
  int count = 0;
  const char *spamflag = CHECKSTRING;
  spamptr = spamflag;
  bufptr = buffer;
  for(count =0 ; count<len ; count++,bufptr++){
   //Test it
   if (*bufptr != *spamptr)
     spamptr = spamflag;
   if (*bufptr == *spamptr)
     spamptr++;
   //We've hit a match
   if (*spamptr == '\0'){
    return EXIT_MATCH;
   }
  }
 return EXIT_NOMATCH;
}

int read_message(char *buffer){
  int len;
  len = read(STDIN_FILENO,buffer,BUFFERSIZE-1);
  if (len < 0){
    perror("Read Error");
    exit(EXIT_NOMATCH);
  }
  return len;
}

//Works almost like write() except checks for errors.
int write_message(char *buffer,int len,int fd){
  int ret;
  ret = write(fd,buffer,len);
  if (ret < 0){
    perror("Write Error");
    exit(EXIT_NOMATCH);
  }
  return ret;
}

-- 

-EB

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to find the end of piped data?
  2004-09-14  6:06 ` Eric Bambach
@ 2004-09-14  8:10   ` Charlie Gordon
  2004-09-14  8:45   ` Richard Sammet
  1 sibling, 0 replies; 5+ messages in thread
From: Charlie Gordon @ 2004-09-14  8:10 UTC (permalink / raw)
  To: linux-c-programming

Dear Eric,

Here are a few remarks on your email scanner:

- Why do you bother to write matching streams to /dev/null ?
If you mean to discard them, just do that, do not write the stuff anywhere.
You still need to read stdin all the way to EOF in case it is a pipe, or you
will crash the piping program with a broken pipe signal.
If you mean to stash the spam somewhere (a real file), you probably should
append to the 'spampipe' file, not just overwrite it.

- Your do/while loop is really a while loop.  do/while loops are error prone
and should be avoided.  In this example, I would write:

while ((len = read_message(buffer)) > 0) {
    if (!discard)
        write_message(buffer, len, fd);
}

This saves many system calls in case of discarding spam.
It is also one less test on len, and removes a bug opportunity, should you
decide to return negative values for len in read_message.

scan_message should take the buffer as const char * to emphasize the fact
that it doesn't modify it.
Your implementation of scanning for a string in a buffer has a potential
flaw:  when the comparison fails, you restart from the failure point instead
of backtracking by matched characters but one.  This happens to work for
your CHECKSTRING but will fail for certain patterns.  eg:

CHECKSTRING = "ABAC"
buffer contents: "...xABABAC..."  will not match with your scanner.

I suggest you go with a more straightforward method involving memchr and
memcmp.  Both functions are implemented inline and will likely perform even
better than your loop.  Otherwise, look up the Boyer-Moore algorithm for
searching for a pattern in a buffer, and find a good implementation of this
tricky but vastly more efficient method.

Why do you read BUFFERSIZE-1 bytes into the buffer ?  I suspect you used
strstr() in a previous version of this program to scan for the pattern...
How did that perform with the same buffer size ?

Regarding performance, while I agree that handling large buffers at a time
is usually more efficient than using getchar() and relying on stdio
buffering, I have experienced that using a well chosen buffer size is the
key.  A multiple of the file system block size, or of the OS's page size is
a better choice, and care must be taken of keeping I/O aligned on such
boundaries.  So I would suggest you use either 4K or 8K instead of 6143.

Regarding CHECKSTRING starting with \n  :  you definitely want to force the
match at the beginning of a line, and it is very unlikely the spam flag will
have been prepended to the file, so not matching the first line is not a
problem, but should be documented.  The main issue with the way you scan
your emails is that you don't stop at the end of the header section.  This
causes unnecessary scanning of some or all of the message body, and more
importantly causes messages that contain the CHECKSTRING in the body only to
match!

For example, you are searching for the following string:

X-Spam-Flag: YES

Bang! this is a match! A good thing there was more than 6K of headers and
blabber above this line!

Here is my proposal for scan_message:

int scan_message (const char *buffer, int size) {

  const char *spamflag = "X-Spam-Flag: YES";
  int len = strlen(spamflag);
  const char *bufptr, *buflast;

  if (size >= len) {
    bufptr = buffer;
    buflast = buffer + size - len;
    for (;;) {
      // match at the beginning of a header line
      if (!memcmp(bufptr, spamflag, len))
        return EXIT_MATCH;
      bufptr = memchr(bufptr, '\n', buflast - bufptr);
      if (!bufptr)
        break;
      bufptr++;   // skip \n
      if (*bufptr == '\n'
      ||  (*bufptr == '\r' && bufptr[1] == '\n')) {
        break;    // end of email header
      }
    }
  }
  return EXIT_NOMATCH;
}

Cheers,

Chqrlie.

PS: Please do not use tabs in source files, you can indent by 2 spaces if
you want, but do not set tabs to something else than 8 spaces.







^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to find the end of piped data?
  2004-09-14  6:06 ` Eric Bambach
  2004-09-14  8:10   ` Charlie Gordon
@ 2004-09-14  8:45   ` Richard Sammet
  1 sibling, 0 replies; 5+ messages in thread
From: Richard Sammet @ 2004-09-14  8:45 UTC (permalink / raw)
  To: eric; +Cc: linux-c-programming

k, thx, it works fine...

e-axe

Eric Bambach wrote:
> Yea, heres a small program I wrote that works exactly the same way. WIth piped 
> data. It just scans STDIN to match to a pattern in the input stream. If it 
> finds it, it pipes to /dev/null, if not, it pipes to stdout. Notice the 
> read()/write() combo with a buffer. This is MUCH faster than getchar() 
> method. Hope it helps.
> 
> int read_message(char *buffer){
>   int len;
>   len = read(STDIN_FILENO,buffer,BUFFERSIZE-1);
>   if (len < 0){
>     perror("Read Error");
>     exit(EXIT_NOMATCH);
>   }
>   return len;

-- 
=====================================================================
Fraunhofer-Institut für Sichere Informations-Technologie (SIT)

Richard Sammet
Tel.: +49 6151 869 60027
Email: richard.sammet@sit.fraunhofer.de

main(){int
y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101:10))));putchar(x);y++;}}
-
To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: how to find the end of piped data?
  2004-09-13 11:36 how to find the end of piped data? Richard Sammet
  2004-09-14  6:06 ` Eric Bambach
@ 2004-09-14  9:03 ` Charlie Gordon
  1 sibling, 0 replies; 5+ messages in thread
From: Charlie Gordon @ 2004-09-14  9:03 UTC (permalink / raw)
  To: linux-c-programming

Dear Richard,

The C library I/O functions are agnostic about whether stdin data is coming
from a file , a pipeor a character device.
getchar() will return EOF upon end of file, closing of the pipe by the
piping process, or end of stream.

Here is how you should write your scanner:

void scanin()
{
     int c;
     char tmpkey[1024];  /* or whatever size you want */
     int tmpcnt = 0;

    while ((c = getchar()) != EOF) {
        if (c == '\n') {
            tmpkey[tmpcnt] = '\0';
            /* perform whatever treatment you want on a line of input */
            handle_one_line(tmpkey);
            tmpcnt = 0;
       } else {
           if (tmpcnt < (int)sizeof(tmpkey) - 2) {
               rmpkey[tmpcnt++] = c;
           }
       }
}

You can also use fgets() and check for NULL as the indication for end of
stream, but you will get a slightly different behaviour on truncated lines.

Cheers,

Chqrlie.

main(){int
y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101
:10))));putchar(x);y++;}}

You can simplify this further : one liners should be less than 80
characters!

main(){int
y=0;while(++y<7)putchar(y<2?101:y<3?45:y<4?97:y<5?120:y<6?101:10);}

You can reduce this even further ;-)






^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-09-14  9:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-13 11:36 how to find the end of piped data? Richard Sammet
2004-09-14  6:06 ` Eric Bambach
2004-09-14  8:10   ` Charlie Gordon
2004-09-14  8:45   ` Richard Sammet
2004-09-14  9:03 ` Charlie Gordon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).