linux-c-programming.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Glynn Clements <glynn@gclements.plus.com>
To: Anindya Mozumdar <anindya@cmi.ac.in>
Cc: linux-c-programming@vger.kernel.org
Subject: Re: Handling large files
Date: Fri, 22 Apr 2005 20:20:44 +0100	[thread overview]
Message-ID: <17001.20108.68648.190269@gargle.gargle.HOWL> (raw)
In-Reply-To: <20050422170321.GA16959@cmi.ac.in>


Anindya Mozumdar wrote:

>    Recently I was dealing with large csv ( comma separated value )
>    files, of size around 500M.
> 
>    I was using perl to parse such files, and it took around 40 minutes
>    for perl to read the file, and duplicate it using the csv module.
>    Python's module took 1 hr. I am sure even if I had written c code,
>    opened the file and parsed it, it would have taken a lot of time.
> 
>    However, I used MySQL to create a database from the file, and the
>    entire creation took around 2 minutes. I would like to know how is
>    this possible - is it a case of threading, memory mapping or some
>    good algorithm ?

Good algorithm, probably.

When done correctly, parsing uses minimal CPU resources. Unless you
have a very fast hard disk or a very slow CPU, you should be able to
parse that CSV file as fast as you can read it from disk without
significantly loading the CPU.

The standard mechanism for parsing according to a regular grammar is
to use a deterministic finite automaton. The CSV format can be
described by a regular grammar.

There is a substantial amount of theory related to formal grammars and
the parsing of them. Rather than go into that here, I'll provide a
link to the Wikipedia page:

	http://en.wikipedia.org/wiki/Formal_grammar

The links on that page may provide useful information. Also, the
technical terms used on that page (and those it links to) will be
useful as search phrases for Google.

From the practical perspective, the usual mechanism for generating
code to parse according to a regular grammar is to use lex (or flex). 
lex reads a grammar description and generates the C source code for a
parser which will parse that grammar.

If you're concerned about performance, use the -b switch to generate
diagnostic information regarding backing-up, then modify the grammar
as necessary so as to eliminate backing-up.

The main limitation of regular grammars is that they cannot describe
languages (formats) which allow for "nested" constructs (e.g. most
programming languages). The most common such grammars are context-free
grammars, which can be handled using a mix of lex and yacc (or bison).

-- 
Glynn Clements <glynn@gclements.plus.com>

      reply	other threads:[~2005-04-22 19:20 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-22 17:03 Handling large files Anindya Mozumdar
2005-04-22 19:20 ` Glynn Clements [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=17001.20108.68648.190269@gargle.gargle.HOWL \
    --to=glynn@gclements.plus.com \
    --cc=anindya@cmi.ac.in \
    --cc=linux-c-programming@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).