Handling large files

linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Handling large files
@ 2005-04-22 17:03 Anindya Mozumdar
  2005-04-22 19:20 ` Glynn Clements
  0 siblings, 1 reply; 2+ messages in thread
From: Anindya Mozumdar @ 2005-04-22 17:03 UTC (permalink / raw)
  To: linux-c-programming

Hi,

   Recently I was dealing with large csv ( comma separated value )
   files, of size around 500M.

   I was using perl to parse such files, and it took around 40 minutes
   for perl to read the file, and duplicate it using the csv module.
   Python's module took 1 hr. I am sure even if I had written c code,
   opened the file and parsed it, it would have taken a lot of time.

   However, I used MySQL to create a database from the file, and the
   entire creation took around 2 minutes. I would like to know how is
   this possible - is it a case of threading, memory mapping or some
   good algorithm ?

   I would be thankful to anyone who can give me a good answer to the
   question, as I cant think of a way myself to solve the problem.

Anindya.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Handling large files
  2005-04-22 17:03 Handling large files Anindya Mozumdar
@ 2005-04-22 19:20 ` Glynn Clements
  0 siblings, 0 replies; 2+ messages in thread
From: Glynn Clements @ 2005-04-22 19:20 UTC (permalink / raw)
  To: Anindya Mozumdar; +Cc: linux-c-programming

Anindya Mozumdar wrote:

>    Recently I was dealing with large csv ( comma separated value )
>    files, of size around 500M.
> 
>    I was using perl to parse such files, and it took around 40 minutes
>    for perl to read the file, and duplicate it using the csv module.
>    Python's module took 1 hr. I am sure even if I had written c code,
>    opened the file and parsed it, it would have taken a lot of time.
> 
>    However, I used MySQL to create a database from the file, and the
>    entire creation took around 2 minutes. I would like to know how is
>    this possible - is it a case of threading, memory mapping or some
>    good algorithm ?

Good algorithm, probably.

When done correctly, parsing uses minimal CPU resources. Unless you
have a very fast hard disk or a very slow CPU, you should be able to
parse that CSV file as fast as you can read it from disk without
significantly loading the CPU.

The standard mechanism for parsing according to a regular grammar is
to use a deterministic finite automaton. The CSV format can be
described by a regular grammar.

There is a substantial amount of theory related to formal grammars and
the parsing of them. Rather than go into that here, I'll provide a
link to the Wikipedia page:

	http://en.wikipedia.org/wiki/Formal_grammar

The links on that page may provide useful information. Also, the
technical terms used on that page (and those it links to) will be
useful as search phrases for Google.

From the practical perspective, the usual mechanism for generating
code to parse according to a regular grammar is to use lex (or flex). 
lex reads a grammar description and generates the C source code for a
parser which will parse that grammar.

If you're concerned about performance, use the -b switch to generate
diagnostic information regarding backing-up, then modify the grammar
as necessary so as to eliminate backing-up.

The main limitation of regular grammars is that they cannot describe
languages (formats) which allow for "nested" constructs (e.g. most
programming languages). The most common such grammars are context-free
grammars, which can be handled using a mix of lex and yacc (or bison).

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2005-04-22 19:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-22 17:03 Handling large files Anindya Mozumdar
2005-04-22 19:20 ` Glynn Clements

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).