RE: Newbie - Perl Equivalent Split

linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: Newbie - Perl Equivalent Split - Seg Faults
  2004-12-13 21:33 Huber, George K RDECOM CERDEC STCD SRI
@ 2004-11-16 21:20 ` J.
  0 siblings, 0 replies; 5+ messages in thread
From: J. @ 2004-11-16 21:20 UTC (permalink / raw)
  To: linux-c-programming

On Mon, 13 Dec 2004, Huber, George K RDECOM CERDEC STCD SRI wrote:

> Darren wrote:
> 
> >Oh.. I almost forgot, if I place the strtok in main instead of calling it as
> >a function in split_char - it works.
> 
> > Here is a super simple program that (trys) to split a charvar based on a
> > delimiter. I get no compile errors. If I remove the strtok line, then
> > split_var returns the string passed to it from main just fine.
> > 
> > I tried changing the char *delim from *delim to delim[50] - same problem.
> > 
> > This is something stupid, and probably super simple. Coming from the Perl
> > world, I'm trying to write some equivalent string manipulation functions
> > that I can use throughout my programs to avoid repetition and make the code
> > cleaner and easier to read.
> > 
> > #include <stdio.h>
> > #include <string.h>
> > 
> > char *split_char(char *string, char *delim) {
> >   fprintf( stderr, "\tString = %s \n", string);
> >   fprintf( stderr, "\tDelimiter = %s \n", delim);
> >   string = strtok(string, delim);
> >   return string;
> > }
> > 
> > int main()
> > {
> >   char *testvar;
> >   testvar = split_char("test-hello", "-");
> >   fprintf( stderr, "\tArray = %s \n", testvar);
> >   return(0);
> > }
> > 
> 
> First of all, this is not a silly question.  You are going to find that string processing
> in C is very painfull when compared to PERL.  After all PERL as initially designed as 
> a text processing language, C was designed as a general purpose language.
> 
> You probably want to take a look at how strtok works.  It is a `destructive' function
> call in that it actually modifies the string that is being tokenized.  
> 
> As an example, consider tokenizing the string `test-case-hello'.  After the first
> call to strtok (i.e. strtok(string, delim), with string containing "test-case-hello" 
> and delim containing "-") we have the following situation.  strtok has replaced the 
> first occurance of the deliminator with a null ('\0') character, returns a pointer to the 
> first token and moves it internal pointer to the character after the old deliminator.
> 
>   return value
>   |        internal pointer
>   |        |
>   \/       \/
>   t e s t\0case-hello\0
> 
> on the next call to strtok (i.e. strtok(NULL, delim) -- not the value of the first 
> parameter.  NULL is used to signify that we are continuing to tokenize the first string),
> we start from the internal pointer and search forward to find the next deliminator.  Now
> this is replaced with a null and the address of the token is returned and the internal 
> pointer is moved to the first character after the old deliminator.  This process continues
> until no more delimitors are found and then strtok returns null.
> 
> At this point, the original string now looks like this:
> 
>   t e s t\0c a s e\0h e l l o\0
> 
> When I need to split a line using strtok I typically do something like this.  Note, I 
> have not attempted to compile this, it should work (barring typos) - but no guarentees.
> 
> char* string="test-case-hello"
> char* delim="-"
> int   idx = 0;
> 
>     these creates my sting and delimators and an index value
> 
> char* tokens[MAX_TOKENS];
> 
>     this creates an array of pointers to characters.  In production code you would need
>     to use some sort of dynamic data structure since you could not know the number of 
>     tokens in a line in advance
> 
> char*  token;
>     
>     this is a pointer to a character (which in C can also be a pointer to a string).
> 
> for(idx = 0; idx < MAX_TOKENS; idx++)
>     tokens[idx] = NULL;
> 
>     this is used to initialize each pointer in the array to NULL.
> 
> token = strtok(string, delim);
> idx = 0;
> 
>     perform the first tokenization and reset the index value.
> 
> while((idx < MAX_TOKENS) && (token != NULL) 
> {
>     tokens[idx] = token;   
>     idx++;
>     strtok(NULL, delim);
> }
> 
>    this loops through the string, pulling out each token or until the max number of 
>    tokens has been found (do not want to overflow a buffer now do we?).  Now at 
>    this point the array `tokens' contains the various tokens.
> 
> idx = 0;
> while((idx < MAX_TOKENS) && (tokens[idx] != 0)
> {
>     printf("%s ", tokens[nIdx]);
>     idx++;
> }
> 
> printf("\n");
> 
>     this loops through the array, stopping when index reaches MAX_TOKEN of a token
>     is NULL, printing each token in turn.
> 
> So now armed with this to write an equivalent `split' function in C you need to create 
> a function that takes a string (or a character array in C-speak) and another string as 
> arguments and returns an array of pointers to string.  A first pass might be:
> 
> char** split(char* string, char* delim)
> {
>    char** tokens = malloc(sizeof(char*) * MAX_TOKENS);
>    char*  working = malloc((strlen(string) + 1) * sizeof(char));
>    char*  token;  
>    int    idx;
> 
>    /* insure that malloc worked.... */
>    if((NULL != working) && (NULL != tokens))
>    {   
>        strcpy(working, string);        /* make a working copy of the string */
> 
>        for(idx = 0; idx < MAX_TOKENS; idx++)
>           tokens[idx] = NULL;
> 
>        token = strtok(working, delim);
>        idx = 0;
> 
>        while((idx < MAX_TOKENS) && (token != NULL) 
>        {
>             tokens[idx] = malloc(sizeof(char) * strlen(token);
>             strcpy(tokens[idx], token);  
>             idx++;
>             token = strtok(NULL, delim);
>        }
> 
>        free(working);
>    }
> 
>    return tokens;
> }
> 
> You would use this function like,
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> 
> char** split(char*, char*);
> 
> int main(int argc, char** argv)
> {
>     char*   text="this-is-a-testing";
>     char*   delim="-";
>     char**  tokens;
>     int     i = 0;
> 
>     tokens = split(text,delim);
> 
>     while(NULL != tokens[i])
>     {
>          printf("token %d is %s\n", i, tokens[i]);
>          i++;
>     }
> 
>     /* NOTE : missing clean up of tokens.  This program leaks 
>               memory like a sieve */
> 
>     return 0;
> }
> 
> The program can be compiled as (assuming everything is in a file called split.c):
> 
> gcc -ansi -pedantic -Wall split.c -o my_split
> 
> and produces the following output:

Error, error.. because you missed a couple ')' and a define...

Sorry.. Just coulnd't resist the reply.. ;-) After reading all your
keyboard bashing ;-) Anywaysss.. May tha peace be with ya..

> George

J.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Newbie - Perl Equivalent Split - Seg Faults
@ 2004-12-13 16:56 Darren Sessions
  2004-12-13 17:10 ` Darren Sessions
  2004-12-13 20:21 ` Jan-Benedict Glaw
  0 siblings, 2 replies; 5+ messages in thread
From: Darren Sessions @ 2004-12-13 16:56 UTC (permalink / raw)
  To: linux-c-programming

Here is a super simple program that (trys) to split a charvar based on a
delimiter. I get no compile errors. If I remove the strtok line, then
split_var returns the string passed to it from main just fine.

I tried changing the char *delim from *delim to delim[50] - same problem.

This is something stupid, and probably super simple. Coming from the Perl
world, I'm trying to write some equivalent string manipulation functions
that I can use throughout my programs to avoid repetition and make the code
cleaner and easier to read.

Thanks in advance,

 - Darren

#include <stdio.h>
#include <string.h>

char *split_char(char *string, char *delim) {
  fprintf( stderr, "\tString = %s \n", string);
  fprintf( stderr, "\tDelimiter = %s \n", delim);
  string = strtok(string, delim);
  return string;
}

int main()
{
  char *testvar;
  testvar = split_char("test-hello", "-");
  fprintf( stderr, "\tArray = %s \n", testvar);
  return(0);
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Newbie - Perl Equivalent Split - Seg Faults
  2004-12-13 16:56 Newbie - Perl Equivalent Split - Seg Faults Darren Sessions
@ 2004-12-13 17:10 ` Darren Sessions
  2004-12-13 20:21 ` Jan-Benedict Glaw
  1 sibling, 0 replies; 5+ messages in thread
From: Darren Sessions @ 2004-12-13 17:10 UTC (permalink / raw)
  To: linux-c-programming

Oh.. I almost forgot, if I place the strtok in main instead of calling it as
a function in split_char - it works.

Thanks,

 - Darren


On 12/13/04 11:56 AM, "Darren Sessions" <dsessions@ionosphere.net> wrote:

> Here is a super simple program that (trys) to split a charvar based on a
> delimiter. I get no compile errors. If I remove the strtok line, then
> split_var returns the string passed to it from main just fine.
> 
> I tried changing the char *delim from *delim to delim[50] - same problem.
> 
> This is something stupid, and probably super simple. Coming from the Perl
> world, I'm trying to write some equivalent string manipulation functions
> that I can use throughout my programs to avoid repetition and make the code
> cleaner and easier to read.
> 
> Thanks in advance,
> 
>  - Darren
> 
> 
> 
> 
> #include <stdio.h>
> #include <string.h>
> 
> char *split_char(char *string, char *delim) {
>   fprintf( stderr, "\tString = %s \n", string);
>   fprintf( stderr, "\tDelimiter = %s \n", delim);
>   string = strtok(string, delim);
>   return string;
> }
> 
> int main()
> {
>   char *testvar;
>   testvar = split_char("test-hello", "-");
>   fprintf( stderr, "\tArray = %s \n", testvar);
>   return(0);
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-c-programming"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Newbie - Perl Equivalent Split - Seg Faults
  2004-12-13 16:56 Newbie - Perl Equivalent Split - Seg Faults Darren Sessions
  2004-12-13 17:10 ` Darren Sessions
@ 2004-12-13 20:21 ` Jan-Benedict Glaw
  1 sibling, 0 replies; 5+ messages in thread
From: Jan-Benedict Glaw @ 2004-12-13 20:21 UTC (permalink / raw)
  To: linux-c-programming

[-- Attachment #1: Type: text/plain, Size: 2710 bytes --]

On Mon, 2004-12-13 11:56:25 -0500, Darren Sessions <dsessions@ionosphere.net>
wrote in message <BDE333E9.1622%dsessions@ionosphere.net>:
> #include <stdio.h>
> #include <string.h>
> 
> char *split_char(char *string, char *delim) {
>   fprintf( stderr, "\tString = %s \n", string);
>   fprintf( stderr, "\tDelimiter = %s \n", delim);
>   string = strtok(string, delim);

Rule of thumb: don't use strtok(), because it internally maintains some
parsing state. Even if you (right now) only have a single-threaded
program which wouldn't probably suffer from strtok()'s limitations, you
don't know if your code won't be--at some time--used in a threaded
environment. Just use strtok_r() instead. ...and even if you use the
strtok_r(), try to avoid it:-)

Both strtok_r() and strtok() modify the string you supplied. While this
is acceptable for some uses, it'll break from time to time (as in this
example). This is why the man page actually warns about using these
functions.

...but what happens here, why does it break? Well, that's easy. Keeping
in mind that this function actually modifies the supplied string, this
is actually where it segfaults...

>   return string;
> }
> 
> int main()
> {
>   char *testvar;
>   testvar = split_char("test-hello", "-");

...and it segfaults because you supply "test-hello right here, right the
way you do it.

If you put some "sdfkjhsdf constant somewhere, of if you have a 

char *string = "some text";

the compiler is allowed to imply that these strings are never ever
modified, but (in the 2nd example) the *pointer* to the string may
change. So gcc knows that "test-hello" won't ever be modified and puts
it into a segment of memory that gets configured as "modify forbidden".
When split_char() calls strtok(), the later one tries to modify the
string (replaces ' ' by '\0') which will result in a segmentation
violation. So you need to force the compiler into laying out your text
as a modifyable string. You can do this by:

char test_hello[] = "test_hello";
...
testvar = split_char (test_hello, "-");
...

Notice that this time, I didn't declare a pointer to string (char *),
but an array of chars (char []). This builds up to the difference of
"modify forbidden" versus "modify allowed".

>   fprintf( stderr, "\tArray = %s \n", testvar);
>   return(0);
> }

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 fuer einen Freien Staat voll Freier Bürger" | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Newbie - Perl Equivalent Split - Seg Faults
@ 2004-12-13 21:33 Huber, George K RDECOM CERDEC STCD SRI
  2004-11-16 21:20 ` J.
  0 siblings, 1 reply; 5+ messages in thread
From: Huber, George K RDECOM CERDEC STCD SRI @ 2004-12-13 21:33 UTC (permalink / raw)
  To: linux-c-programming

Darren wrote:

>Oh.. I almost forgot, if I place the strtok in main instead of calling it as
>a function in split_char - it works.

> Here is a super simple program that (trys) to split a charvar based on a
> delimiter. I get no compile errors. If I remove the strtok line, then
> split_var returns the string passed to it from main just fine.
> 
> I tried changing the char *delim from *delim to delim[50] - same problem.
> 
> This is something stupid, and probably super simple. Coming from the Perl
> world, I'm trying to write some equivalent string manipulation functions
> that I can use throughout my programs to avoid repetition and make the code
> cleaner and easier to read.
> 
> #include <stdio.h>
> #include <string.h>
> 
> char *split_char(char *string, char *delim) {
>   fprintf( stderr, "\tString = %s \n", string);
>   fprintf( stderr, "\tDelimiter = %s \n", delim);
>   string = strtok(string, delim);
>   return string;
> }
> 
> int main()
> {
>   char *testvar;
>   testvar = split_char("test-hello", "-");
>   fprintf( stderr, "\tArray = %s \n", testvar);
>   return(0);
> }
> 

First of all, this is not a silly question.  You are going to find that string processing
in C is very painfull when compared to PERL.  After all PERL as initially designed as 
a text processing language, C was designed as a general purpose language.

You probably want to take a look at how strtok works.  It is a `destructive' function
call in that it actually modifies the string that is being tokenized.  

As an example, consider tokenizing the string `test-case-hello'.  After the first
call to strtok (i.e. strtok(string, delim), with string containing "test-case-hello" 
and delim containing "-") we have the following situation.  strtok has replaced the 
first occurance of the deliminator with a null ('\0') character, returns a pointer to the 
first token and moves it internal pointer to the character after the old deliminator.

  return value
  |        internal pointer
  |        |
  \/       \/
  t e s t\0case-hello\0

on the next call to strtok (i.e. strtok(NULL, delim) -- not the value of the first 
parameter.  NULL is used to signify that we are continuing to tokenize the first string),
we start from the internal pointer and search forward to find the next deliminator.  Now
this is replaced with a null and the address of the token is returned and the internal 
pointer is moved to the first character after the old deliminator.  This process continues
until no more delimitors are found and then strtok returns null.

At this point, the original string now looks like this:

  t e s t\0c a s e\0h e l l o\0

When I need to split a line using strtok I typically do something like this.  Note, I 
have not attempted to compile this, it should work (barring typos) - but no guarentees.

char* string="test-case-hello"
char* delim="-"
int   idx = 0;

    these creates my sting and delimators and an index value

char* tokens[MAX_TOKENS];

    this creates an array of pointers to characters.  In production code you would need
    to use some sort of dynamic data structure since you could not know the number of 
    tokens in a line in advance

char*  token;

    this is a pointer to a character (which in C can also be a pointer to a string).

for(idx = 0; idx < MAX_TOKENS; idx++)
    tokens[idx] = NULL;

    this is used to initialize each pointer in the array to NULL.

token = strtok(string, delim);
idx = 0;

    perform the first tokenization and reset the index value.

while((idx < MAX_TOKENS) && (token != NULL) 
{
    tokens[idx] = token;   
    idx++;
    strtok(NULL, delim);
}

   this loops through the string, pulling out each token or until the max number of 
   tokens has been found (do not want to overflow a buffer now do we?).  Now at 
   this point the array `tokens' contains the various tokens.

idx = 0;
while((idx < MAX_TOKENS) && (tokens[idx] != 0)
{
    printf("%s ", tokens[nIdx]);
    idx++;
}

printf("\n");

    this loops through the array, stopping when index reaches MAX_TOKEN of a token
    is NULL, printing each token in turn.

So now armed with this to write an equivalent `split' function in C you need to create 
a function that takes a string (or a character array in C-speak) and another string as 
arguments and returns an array of pointers to string.  A first pass might be:

char** split(char* string, char* delim)
{
   char** tokens = malloc(sizeof(char*) * MAX_TOKENS);
   char*  working = malloc((strlen(string) + 1) * sizeof(char));
   char*  token;  
   int    idx;

   /* insure that malloc worked.... */
   if((NULL != working) && (NULL != tokens))
   {   
       strcpy(working, string);        /* make a working copy of the string */

       for(idx = 0; idx < MAX_TOKENS; idx++)
          tokens[idx] = NULL;

       token = strtok(working, delim);
       idx = 0;

       while((idx < MAX_TOKENS) && (token != NULL) 
       {
            tokens[idx] = malloc(sizeof(char) * strlen(token);
            strcpy(tokens[idx], token);  
            idx++;
            token = strtok(NULL, delim);
       }

       free(working);
   }

   return tokens;
}

You would use this function like,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char** split(char*, char*);

int main(int argc, char** argv)
{
    char*   text="this-is-a-testing";
    char*   delim="-";
    char**  tokens;
    int     i = 0;

    tokens = split(text,delim);

    while(NULL != tokens[i])
    {
         printf("token %d is %s\n", i, tokens[i]);
         i++;
    }

    /* NOTE : missing clean up of tokens.  This program leaks 
              memory like a sieve */

    return 0;
}

The program can be compiled as (assuming everything is in a file called split.c):

gcc -ansi -pedantic -Wall split.c -o my_split

and produces the following output:

token 0 is this
token 1 is is
token 2 is a
token 3 is testing

Hope this helps and happy coding,
George

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-12-13 21:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-13 16:56 Newbie - Perl Equivalent Split - Seg Faults Darren Sessions
2004-12-13 17:10 ` Darren Sessions
2004-12-13 20:21 ` Jan-Benedict Glaw
  -- strict thread matches above, loose matches on Subject: below --
2004-12-13 21:33 Huber, George K RDECOM CERDEC STCD SRI
2004-11-16 21:20 ` J.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).