linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Pattern matching programming
@ 2005-05-18 17:36 fabio
  2005-05-18 20:09 ` Fabrizio Sestito
  2005-05-20 20:36 ` Glynn Clements
  0 siblings, 2 replies; 4+ messages in thread
From: fabio @ 2005-05-18 17:36 UTC (permalink / raw)
  To: linux-c-programming

Hello,

I am trying to code a small C program that basically takes a long text
file with data that comes from a mysql server.

But I realize It is better to use regular expression. This is an examples
of the text:

=1 <p> blah </p> <div foo>{$foobar}</div>blah.... <p>linux rulez</p>
misc characters.... =2 blah blah <p> linux rulez again</p>.... <p>foo</p?blah


And so on.

The patterns are:

The record is represented by an equal. Ej, record 1 is "=1", record 2 is
"=2" and so on.

The desired text is where "linux rulez" is inside, it is the FIRST <p>
</p> AFTER a record.

So, I see that program this makes no sense because it is better to use sed
and awk.

The result I want to have is something like:

1 linux rulez
2 linux rulez again
3 linux rulez so far
...etc

The idea is elimate all <div>'s tags, then get the numbers (maybe with awk
-F"="), and then get the next <p> taq, remove the tags themself and
numbers and then the text and do the same procedure for all the 65230
records.

Thanks alot for any comment, sorry for the 'offtopic'

Kind regards,

fabio




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pattern matching programming
  2005-05-18 17:36 Pattern matching programming fabio
@ 2005-05-18 20:09 ` Fabrizio Sestito
  2005-05-20  1:55   ` Hareesh Nagarajan
  2005-05-20 20:36 ` Glynn Clements
  1 sibling, 1 reply; 4+ messages in thread
From: Fabrizio Sestito @ 2005-05-18 20:09 UTC (permalink / raw)
  To: linux-c-programming

On Wednesday 18 May 2005 17:36, fabio@crearium.com wrote:
> Hello,
>
> I am trying to code a small C program that basically takes a long text
> file with data that comes from a mysql server.
>
> But I realize It is better to use regular expression. This is an examples
> of the text:
>
> =1 <p> blah </p> <div foo>{$foobar}</div>blah.... <p>linux rulez</p>
> misc characters.... =2 blah blah <p> linux rulez again</p>....
> <p>foo</p?blah
>
>
> And so on.
>
> The patterns are:
>
> The record is represented by an equal. Ej, record 1 is "=1", record 2 is
> "=2" and so on.
>
> The desired text is where "linux rulez" is inside, it is the FIRST <p>
> </p> AFTER a record.
>
> So, I see that program this makes no sense because it is better to use sed
> and awk.
>
> The result I want to have is something like:
>
> 1 linux rulez
> 2 linux rulez again
> 3 linux rulez so far
> ...etc
>
> The idea is elimate all <div>'s tags, then get the numbers (maybe with awk
> -F"="), and then get the next <p> taq, remove the tags themself and
> numbers and then the text and do the same procedure for all the 65230
> records.
>
> Thanks alot for any comment, sorry for the 'offtopic'
>
> Kind regards,
>
> fabio
>
Why don't you use an XML parser library?

Fabrizio

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pattern matching programming
  2005-05-18 20:09 ` Fabrizio Sestito
@ 2005-05-20  1:55   ` Hareesh Nagarajan
  0 siblings, 0 replies; 4+ messages in thread
From: Hareesh Nagarajan @ 2005-05-20  1:55 UTC (permalink / raw)
  To: fabio; +Cc: linux-c-programming

On 5/18/05, Fabrizio Sestito <lain@neotes.org> wrote:
> On Wednesday 18 May 2005 17:36, fabio@crearium.com wrote:
> > Hello,
> >
> > I am trying to code a small C program that basically takes a long text
> > file with data that comes from a mysql server.

If you know the exact syntax of the incoming text, you could hand
write a parser. Essentially, you need to know all the states you can
be in.

For e.g.: You cannot encounter a </p> before you a <p>. Etc.

HTH,

Hareesh

PS: But you should use an existing library which Fabrizio mentions,
instead of reinventing the wheel.

> >
> > But I realize It is better to use regular expression. This is an examples
> > of the text:
> >
> > =1 <p> blah </p> <div foo>{$foobar}</div>blah.... <p>linux rulez</p>
> > misc characters.... =2 blah blah <p> linux rulez again</p>....
> > <p>foo</p?blah
> >
> >
> > And so on.
> >
> > The patterns are:
> >
> > The record is represented by an equal. Ej, record 1 is "=1", record 2 is
> > "=2" and so on.
> >
> > The desired text is where "linux rulez" is inside, it is the FIRST <p>
> > </p> AFTER a record.
> >
> > So, I see that program this makes no sense because it is better to use sed
> > and awk.
> >
> > The result I want to have is something like:
> >
> > 1 linux rulez
> > 2 linux rulez again
> > 3 linux rulez so far
> > ...etc
> >
> > The idea is elimate all <div>'s tags, then get the numbers (maybe with awk
> > -F"="), and then get the next <p> taq, remove the tags themself and
> > numbers and then the text and do the same procedure for all the 65230
> > records.
> >
> > Thanks alot for any comment, sorry for the 'offtopic'
> >
> > Kind regards,
> >
> > fabio
> >
> Why don't you use an XML parser library?
> 
> Fabrizio
> -
> To unsubscribe from this list: send the line "unsubscribe linux-c-programming" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Pattern matching programming
  2005-05-18 17:36 Pattern matching programming fabio
  2005-05-18 20:09 ` Fabrizio Sestito
@ 2005-05-20 20:36 ` Glynn Clements
  1 sibling, 0 replies; 4+ messages in thread
From: Glynn Clements @ 2005-05-20 20:36 UTC (permalink / raw)
  To: fabio; +Cc: linux-c-programming


fabio@crearium.com wrote:

> I am trying to code a small C program that basically takes a long text
> file with data that comes from a mysql server.
> 
> But I realize It is better to use regular expression. This is an examples
> of the text:
> 
> =1 <p> blah </p> <div foo>{$foobar}</div>blah.... <p>linux rulez</p>
> misc characters.... =2 blah blah <p> linux rulez again</p>.... <p>foo</p?blah

Don't try to parse anything as complex as HTML/SGML/XML using an
ad-hoc set of rules; use an existing library.

More generally, don't try to perform any non-trivial parsing without
familiarising yourself with the core theoretical concepts behind
formal grammars. A reasonable starting point is:

	http://en.wikipedia.org/wiki/Formal_grammar

If you can't locate an existing library for the language in question,
and you are using C or C++, the appropriate solution is almost
invariably to use lex (or flex) and yacc (or bison) to generate the
parser.

Using an ad-hoc approach (i.e. hand-coding a parser using the
functions in <regex.h> or, worse still, <string.h> or hand-coded
equivalents) is a recipe for producing a parser that is at worst
incorrect (i.e. will produce the wrong result in some cases) and at
best inefficient (it isn't hard to end up with a parser which is a
hundred times slower than an optimal lex-generated one).

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-05-20 20:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-18 17:36 Pattern matching programming fabio
2005-05-18 20:09 ` Fabrizio Sestito
2005-05-20  1:55   ` Hareesh Nagarajan
2005-05-20 20:36 ` Glynn Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).