From mboxrd@z Thu Jan 1 00:00:00 1970 From: Glynn Clements Subject: Re: Handling large files Date: Fri, 22 Apr 2005 20:20:44 +0100 Message-ID: <17001.20108.68648.190269@gargle.gargle.HOWL> References: <20050422170321.GA16959@cmi.ac.in> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20050422170321.GA16959@cmi.ac.in> Sender: linux-c-programming-owner@vger.kernel.org List-Id: Content-Type: text/plain; charset="us-ascii" To: Anindya Mozumdar Cc: linux-c-programming@vger.kernel.org Anindya Mozumdar wrote: > Recently I was dealing with large csv ( comma separated value ) > files, of size around 500M. > > I was using perl to parse such files, and it took around 40 minutes > for perl to read the file, and duplicate it using the csv module. > Python's module took 1 hr. I am sure even if I had written c code, > opened the file and parsed it, it would have taken a lot of time. > > However, I used MySQL to create a database from the file, and the > entire creation took around 2 minutes. I would like to know how is > this possible - is it a case of threading, memory mapping or some > good algorithm ? Good algorithm, probably. When done correctly, parsing uses minimal CPU resources. Unless you have a very fast hard disk or a very slow CPU, you should be able to parse that CSV file as fast as you can read it from disk without significantly loading the CPU. The standard mechanism for parsing according to a regular grammar is to use a deterministic finite automaton. The CSV format can be described by a regular grammar. There is a substantial amount of theory related to formal grammars and the parsing of them. Rather than go into that here, I'll provide a link to the Wikipedia page: http://en.wikipedia.org/wiki/Formal_grammar The links on that page may provide useful information. Also, the technical terms used on that page (and those it links to) will be useful as search phrases for Google. >From the practical perspective, the usual mechanism for generating code to parse according to a regular grammar is to use lex (or flex). lex reads a grammar description and generates the C source code for a parser which will parse that grammar. If you're concerned about performance, use the -b switch to generate diagnostic information regarding backing-up, then modify the grammar as necessary so as to eliminate backing-up. The main limitation of regular grammars is that they cannot describe languages (formats) which allow for "nested" constructs (e.g. most programming languages). The most common such grammars are context-free grammars, which can be handled using a mix of lex and yacc (or bison). -- Glynn Clements