From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Gummerer Subject: Re: Index format v5 Date: Thu, 10 May 2012 14:19:11 +0200 Message-ID: <20120510121911.GB98491@tgummerer> References: <4FA7E703.7040408@alum.mit.edu> <20120508141137.GA3937@tgummerer.surfnet.iacbox> <4FAA2CAF.3040408@alum.mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: git@vger.kernel.org, trast@student.ethz.ch, gitster@pobox.com, peff@peff.net, spearce@spearce.org, davidbarr@google.com To: Michael Haggerty X-From: git-owner@vger.kernel.org Thu May 10 14:19:25 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SSSL3-0000U5-Js for gcvg-git-2@plane.gmane.org; Thu, 10 May 2012 14:19:22 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756296Ab2EJMTQ (ORCPT ); Thu, 10 May 2012 08:19:16 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:39017 "EHLO mail-wi0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754274Ab2EJMTP (ORCPT ); Thu, 10 May 2012 08:19:15 -0400 Received: by wibhr2 with SMTP id hr2so422091wib.1 for ; Thu, 10 May 2012 05:19:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=S+Mhd0xIUJup6Y1KozETrcQH8JRDXzFHGCaVo/dazaM=; b=v+1+CSmWXEbrrOWYMswIMGW91rsDcSiPtxBV2JLRknW5WMAVQphsfZfotWt+y/t/+f O+fYSfRzrBfwOXFXj8JY6IuG+dR3MDcdTcETC9TERyOvRjklHmekAp6OVfAEZ7LoWZ3I x4u8RY3A12dm/+hrIM2LLvZLmNIE5CO10vUybnE5YfWXc5+bHazjXwpu/KmBHppukmJq 9YXwABV3bmhuFelL3ciWFaBoFzHY/L6W/F01UXadSrpvN5bPc1j9VExwimfWyVGSFvvO GlVv0MjzjezSpbfzzz8voxjZG13dV8VN4KRN0jX1PrdY3q8bjqelaOYb+8v11VnSZ8NR oqGg== Received: by 10.180.77.4 with SMTP id o4mr9022917wiw.17.1336652354182; Thu, 10 May 2012 05:19:14 -0700 (PDT) Received: from localhost ([46.18.27.126]) by mx.google.com with ESMTPS id bn9sm2840610wib.5.2012.05.10.05.19.12 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 10 May 2012 05:19:13 -0700 (PDT) Content-Disposition: inline In-Reply-To: <4FAA2CAF.3040408@alum.mit.edu> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 05/09, Michael Haggerty wrote: > On 05/08/2012 04:11 PM, Thomas Gummerer wrote: > >>* The details of the extension data blocks are described in the first > >> (overview) section, whereas it seems like they should be described > >> in their own section following the "conflict data" section. But > >> wouldn't the presence of extension data blocks prevent the addition > >> of conflict data? > > > >Only the details that should be there for every extension are described > >in the overview (the header of the extension), to make sure every > >extension has the same header format, and thus a reader which doesn't > >understand a specific extension still can read its header and know > >what's going on. > > > >They won't prevent the addition of conflicted data, since when a > >conflict is created, other files were probably added and the index has > >to be rewritten anyway. Once the conflict is resolved however only a > >bit has to be flipped, so there is no rewrite necessary. > > In other words, the presence of extensions *does indeed* prevent the > addition of conflict data, but you don't think that it is a problem. Exactly. > Moving the conflict data to after the extensions, on the other hand, > would mean that conflict data can sometimes be added without a > rewrite. I cannot judge whether this would be useful. > > Handling conflict data *as* an extension would allow the conflict > data to be added at any time without rewriting. I cannot judge > whether this would be useful. Since there are offsets in the directory data to the conflicted data I don't think it's good to call this data extension data. It may however be beneficial to have the conflict data after the extension. I'll investigate this. > >>* Does the index file include directory entries for empty directories? > >> What about directories that contain only other directories? > > > >In theory the index is able to include empty directories. I'm however > >not sure if this should be implemented. I'd be happy to get more > >feedback there. > > Currently git does not keep track of empty directories. Even though > there have been proposals to fix this, it is far beyond the scope of > your project to implement the handling of empty directories. The > question is whether your format definition *forbids* the presence of > empty directories in the index file (in the interest of > definiteness, and it might make the reader implementation a little > bit simpler, but it imposes a constraint on the writer). Obviously > empty directories, even if present, mustn't have an effect on the > SHA1 of the trees containing them. No, the index format doesn't forbid the presence of empty directories. Empty directories will have a fileoffset of 0, and the reader will just ignore them as long as there is no empty directory tracking. > >>Directory entry > >>=============== > >> > >>* "4-byte number of entries in the index that is covered by the tree > >> this entry represents." What does this include? > >> Files/directories/both? Recursive or non-recursive? > > > >This is from the cache-tree. I'm not sure but I think it includes both > >files and directories, recursively. > > Please figure this out for the final spec. It includes only files, in a recursive manner. I've written this down in the spec. > >>File entry > >>========== > >>[...] > > > >>* Are file entries sorted by entire path or only by the basename? > > > >They are sorted by the basename, in the respective block of their > >directories. > >Example: paths: a/a a/z b/b > >File entries in the index: > >a ... > >z ... > >b ... > > OK, so in other words, the file entries of all files in a directory > (not including files in subdirectories) are stored contiguously, > sorted by basename. (The thing that wasn't immediately clear is > whether files from subdirectories are intermingled with those of the > parent directory.) Yes, exactly. > >>Flat loading > >>============ > >> > >>* I found the explanation pretty incomprehensible. Perhaps some > >> pseudo-code would make it clearer? > >>[...] > >[...] I have changed the flat loading in the documentation, > >hope it's more understandable now. > > Maybe it's just be, but I still don't think it is very clear. Here > is version fbf8add1b026: > > >== Flat loading > > > >Since internally git expects and works with lexicografic ordering, > >a simple linear scan throught the subdirectories doesn't give > >the right internal sorting. To achieve the right internal sorting > >the loading will be done in the following way: > > > >1. Start with the root directory, and read also the name of the > > first subdirectory (=next directory in the list). > > > >1a. Use the next directory (the one against which the filenames > > were checked previously), and read the next directory name, > > to check the files against. > > > >2. Check the stack if the element at the top is < then the current > > directoryname. > > > > If it's < then current directory name, add files from the stack > > to the entry list, until the file name is > then the > > directory name. > > > >2. While filename < directoryname add the filenames to the entry > > list > > > >3. Add the rest of the files to a stack. > > > >4. Continue with 1a, if there are more directories left. > > > >5. Add the rest of the files from the stack to the end of the > > entry list. > > [..] > There is a reason that I asked for pseudocode, namely because it > forces you to be more precise in your description. I can certainly > imagine several workable algorithms for reading the index file, and > the different algorithms have different tradeoffs particularly > regarding the amount of temporary space needed and locality of > reference in the index file (which, I understand, will be mmapped > when practical but it is not practical on all platforms). Once you > express the algorithm in pseudocode it is possible to be sure which > variant you have chosen and consider whether it is really workable. Ok, here is the variant in pseudo code. I hope it's understandable this way. It needs some temporary space, but never more then the actual entries will need in the end anyway. == Flat loading Since internally git expects and works with lexicografic ordering, a simple linear scan throught the subdirectories doesn't give the right internal sorting. To achieve the right internal sorting the loading will be done in the following way: The data structure is a stack of queues, to allow continous reading of the file. s -> queue1 t -> queue2 a -> queue3 c -> queue4 k -> queue5 dirs = read_all_directories foreach dir in dirs do file = read_next_file while element_on_top_of_stack.first_element < nextdir indexentries.append(dequeue(element_on_top_of_stack)) if element_on_top_of_stack == emtpy: remove_element_on_top_of_stack if file[filename] < nextdir indexentries.append(file) else queue.add(file) foreach f in rest_of_files_in_directory: queue.add(f) stack.push(queue) foreach queue in stack: foreach entry in queue: indexentry.append(entry)