* how to find the end of piped data?
@ 2004-09-13 11:36 Richard Sammet
2004-09-14 6:06 ` Eric Bambach
2004-09-14 9:03 ` Charlie Gordon
0 siblings, 2 replies; 5+ messages in thread
From: Richard Sammet @ 2004-09-13 11:36 UTC (permalink / raw)
To: linux-c-programming
hey list,
i wrote a small tool which gets data over a pipe from other tools (like:
cat stuff | mytool).
how can i find the end of this data stream?
at the moment im looking for a newline to see if the input is finished,
but thats not practicable.
this is the rutine for getting the data:
75 void scanin()
76 {
77 int tmpcnt=0;
78
79 while(sizeof(tmpkey) && tmpkey[tmpcnt-1] != 10)
80 {
81 tmpkey[tmpcnt]=getchar();
82 tmpcnt++;
83 }
84 }
im looking for a flag like EOF but EndOfStream or something like this? ;)
anybody any idea?
e-axe
--
=====================================================================
Fraunhofer-Institut für Sichere Informations-Technologie (SIT)
Richard Sammet
Tel.: +49 6151 869 60027
Email: richard.sammet@sit.fraunhofer.de
main(){int
y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101:10))));putchar(x);y++;}}
-
To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: how to find the end of piped data? 2004-09-13 11:36 how to find the end of piped data? Richard Sammet @ 2004-09-14 6:06 ` Eric Bambach 2004-09-14 8:10 ` Charlie Gordon 2004-09-14 8:45 ` Richard Sammet 2004-09-14 9:03 ` Charlie Gordon 1 sibling, 2 replies; 5+ messages in thread From: Eric Bambach @ 2004-09-14 6:06 UTC (permalink / raw) To: Richard Sammet; +Cc: linux-c-programming On Monday 13 September 2004 06:36 am, you wrote: > hey list, > > i wrote a small tool which gets data over a pipe from other tools (like: > cat stuff | mytool). > > how can i find the end of this data stream? > > at the moment im looking for a newline to see if the input is finished, > but thats not practicable. > > this is the rutine for getting the data: > > 75 void scanin() > 76 { > 77 int tmpcnt=0; > 78 > 79 while(sizeof(tmpkey) && tmpkey[tmpcnt-1] != 10) > 80 { > 81 tmpkey[tmpcnt]=getchar(); > 82 tmpcnt++; > 83 } > 84 } > > im looking for a flag like EOF but EndOfStream or something like this? ;) > > anybody any idea? Yea, heres a small program I wrote that works exactly the same way. WIth piped data. It just scans STDIN to match to a pattern in the input stream. If it finds it, it pipes to /dev/null, if not, it pipes to stdout. Notice the read()/write() combo with a buffer. This is MUCH faster than getchar() method. Hope it helps. #include <stdio.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #define EXIT_NOMATCH 0 #define EXIT_MATCH 1 //Should be large enough to scan most mail messages in one or two passes. //Performance with 4K buffer is .8s for a 73M message. //Increasing to 32K only trims it to .6s with an unjustified increase in memory use. //DO NOT SET THIS TOO SMALL. It probably won't catch the spam flag then since it only scans the //first BUFFERSIZE characters read in then fast copies the rest to stdout. #define BUFFERSIZE 6144 //What we are checking for. must be EXACT. Leave the newline in because it protects the offchance that it is in the message body somewhere. //This way it will only match if its at the beginning of the line. #define CHECKSTRING "\nX-Spam-Flag: YES" int main(void); int dump_full_message(); int write_message(char *buffer,int len,int fd); int read_message(char *buffer); int scan_message(char *buffer,int len); int main(void){ //Our faithful buffer char buffer[BUFFERSIZE]; //What file should we write to if its spam. const char *spampipe = "/dev/null"; int len, fd = STDOUT_FILENO, exit_status = EXIT_NOMATCH; //We only want to scan our message once. Its unlikely there are more than 4K(BUFFERSIZE) //of headers. By scanning once, this lets us trash the rest of the output if its spam. Also //prevents scanning a huge non-matching mail-message-attachment. len = read_message(buffer); if (len){ if( scan_message(buffer,len) == EXIT_MATCH){ if( (fd = open(spampipe, O_WRONLY)) == -1){ perror("Cannot open spam pipe...will write to stdout"); fd = STDOUT_FILENO; } exit_status= EXIT_MATCH; } } write_message(buffer,len,fd); //Tight read/write for just piping the data. After the first BUFFERSIZE characters //we should already have what we need and just pass it on in the queue. do{ len = read_message(buffer); if (len){ write_message(buffer,len,fd); } }while(len > 0); close(fd); return exit_status; } int scan_message(char * buffer,int len){ char *spamptr,*bufptr; int count = 0; const char *spamflag = CHECKSTRING; spamptr = spamflag; bufptr = buffer; for(count =0 ; count<len ; count++,bufptr++){ //Test it if (*bufptr != *spamptr) spamptr = spamflag; if (*bufptr == *spamptr) spamptr++; //We've hit a match if (*spamptr == '\0'){ return EXIT_MATCH; } } return EXIT_NOMATCH; } int read_message(char *buffer){ int len; len = read(STDIN_FILENO,buffer,BUFFERSIZE-1); if (len < 0){ perror("Read Error"); exit(EXIT_NOMATCH); } return len; } //Works almost like write() except checks for errors. int write_message(char *buffer,int len,int fd){ int ret; ret = write(fd,buffer,len); if (ret < 0){ perror("Write Error"); exit(EXIT_NOMATCH); } return ret; } -- -EB ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: how to find the end of piped data? 2004-09-14 6:06 ` Eric Bambach @ 2004-09-14 8:10 ` Charlie Gordon 2004-09-14 8:45 ` Richard Sammet 1 sibling, 0 replies; 5+ messages in thread From: Charlie Gordon @ 2004-09-14 8:10 UTC (permalink / raw) To: linux-c-programming Dear Eric, Here are a few remarks on your email scanner: - Why do you bother to write matching streams to /dev/null ? If you mean to discard them, just do that, do not write the stuff anywhere. You still need to read stdin all the way to EOF in case it is a pipe, or you will crash the piping program with a broken pipe signal. If you mean to stash the spam somewhere (a real file), you probably should append to the 'spampipe' file, not just overwrite it. - Your do/while loop is really a while loop. do/while loops are error prone and should be avoided. In this example, I would write: while ((len = read_message(buffer)) > 0) { if (!discard) write_message(buffer, len, fd); } This saves many system calls in case of discarding spam. It is also one less test on len, and removes a bug opportunity, should you decide to return negative values for len in read_message. scan_message should take the buffer as const char * to emphasize the fact that it doesn't modify it. Your implementation of scanning for a string in a buffer has a potential flaw: when the comparison fails, you restart from the failure point instead of backtracking by matched characters but one. This happens to work for your CHECKSTRING but will fail for certain patterns. eg: CHECKSTRING = "ABAC" buffer contents: "...xABABAC..." will not match with your scanner. I suggest you go with a more straightforward method involving memchr and memcmp. Both functions are implemented inline and will likely perform even better than your loop. Otherwise, look up the Boyer-Moore algorithm for searching for a pattern in a buffer, and find a good implementation of this tricky but vastly more efficient method. Why do you read BUFFERSIZE-1 bytes into the buffer ? I suspect you used strstr() in a previous version of this program to scan for the pattern... How did that perform with the same buffer size ? Regarding performance, while I agree that handling large buffers at a time is usually more efficient than using getchar() and relying on stdio buffering, I have experienced that using a well chosen buffer size is the key. A multiple of the file system block size, or of the OS's page size is a better choice, and care must be taken of keeping I/O aligned on such boundaries. So I would suggest you use either 4K or 8K instead of 6143. Regarding CHECKSTRING starting with \n : you definitely want to force the match at the beginning of a line, and it is very unlikely the spam flag will have been prepended to the file, so not matching the first line is not a problem, but should be documented. The main issue with the way you scan your emails is that you don't stop at the end of the header section. This causes unnecessary scanning of some or all of the message body, and more importantly causes messages that contain the CHECKSTRING in the body only to match! For example, you are searching for the following string: X-Spam-Flag: YES Bang! this is a match! A good thing there was more than 6K of headers and blabber above this line! Here is my proposal for scan_message: int scan_message (const char *buffer, int size) { const char *spamflag = "X-Spam-Flag: YES"; int len = strlen(spamflag); const char *bufptr, *buflast; if (size >= len) { bufptr = buffer; buflast = buffer + size - len; for (;;) { // match at the beginning of a header line if (!memcmp(bufptr, spamflag, len)) return EXIT_MATCH; bufptr = memchr(bufptr, '\n', buflast - bufptr); if (!bufptr) break; bufptr++; // skip \n if (*bufptr == '\n' || (*bufptr == '\r' && bufptr[1] == '\n')) { break; // end of email header } } } return EXIT_NOMATCH; } Cheers, Chqrlie. PS: Please do not use tabs in source files, you can indent by 2 spaces if you want, but do not set tabs to something else than 8 spaces. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: how to find the end of piped data? 2004-09-14 6:06 ` Eric Bambach 2004-09-14 8:10 ` Charlie Gordon @ 2004-09-14 8:45 ` Richard Sammet 1 sibling, 0 replies; 5+ messages in thread From: Richard Sammet @ 2004-09-14 8:45 UTC (permalink / raw) To: eric; +Cc: linux-c-programming k, thx, it works fine... e-axe Eric Bambach wrote: > Yea, heres a small program I wrote that works exactly the same way. WIth piped > data. It just scans STDIN to match to a pattern in the input stream. If it > finds it, it pipes to /dev/null, if not, it pipes to stdout. Notice the > read()/write() combo with a buffer. This is MUCH faster than getchar() > method. Hope it helps. > > int read_message(char *buffer){ > int len; > len = read(STDIN_FILENO,buffer,BUFFERSIZE-1); > if (len < 0){ > perror("Read Error"); > exit(EXIT_NOMATCH); > } > return len; -- ===================================================================== Fraunhofer-Institut für Sichere Informations-Technologie (SIT) Richard Sammet Tel.: +49 6151 869 60027 Email: richard.sammet@sit.fraunhofer.de main(){int y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101:10))));putchar(x);y++;}} - To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: how to find the end of piped data? 2004-09-13 11:36 how to find the end of piped data? Richard Sammet 2004-09-14 6:06 ` Eric Bambach @ 2004-09-14 9:03 ` Charlie Gordon 1 sibling, 0 replies; 5+ messages in thread From: Charlie Gordon @ 2004-09-14 9:03 UTC (permalink / raw) To: linux-c-programming Dear Richard, The C library I/O functions are agnostic about whether stdin data is coming from a file , a pipeor a character device. getchar() will return EOF upon end of file, closing of the pipe by the piping process, or end of stream. Here is how you should write your scanner: void scanin() { int c; char tmpkey[1024]; /* or whatever size you want */ int tmpcnt = 0; while ((c = getchar()) != EOF) { if (c == '\n') { tmpkey[tmpcnt] = '\0'; /* perform whatever treatment you want on a line of input */ handle_one_line(tmpkey); tmpcnt = 0; } else { if (tmpcnt < (int)sizeof(tmpkey) - 2) { rmpkey[tmpcnt++] = c; } } } You can also use fgets() and check for NULL as the indication for end of stream, but you will get a slightly different behaviour on truncated lines. Cheers, Chqrlie. main(){int y=0,x;while(y!=6){x=(y==0)?101:((y==1)?45:((y==2)?97:((y==3)?120:((y==4)?101 :10))));putchar(x);y++;}} You can simplify this further : one liners should be less than 80 characters! main(){int y=0;while(++y<7)putchar(y<2?101:y<3?45:y<4?97:y<5?120:y<6?101:10);} You can reduce this even further ;-) ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2004-09-14 9:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-09-13 11:36 how to find the end of piped data? Richard Sammet 2004-09-14 6:06 ` Eric Bambach 2004-09-14 8:10 ` Charlie Gordon 2004-09-14 8:45 ` Richard Sammet 2004-09-14 9:03 ` Charlie Gordon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).