[GSoC] Designing a faster index format

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [GSoC] Designing a faster index format - Progress report week 7
@ 2012-06-04 20:07 Thomas Gummerer
  2012-06-06  9:45 ` Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7] Thomas Rast
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Gummerer @ 2012-06-04 20:07 UTC (permalink / raw)
  To: git; +Cc: trast, gitster, mhagger, pclouds

== Work done in the previous 6 weeks ==

- Definition of a tentative index file v5 format [1]. This differs
  from the proposal in making it possible to bisect the directory
  entries and file entries, to do a binary search. The exact bits
  for each section were also defined. To further compress the index,
  along with prefix compression, the stat data is hashed, since
  it's only used for comparison, but the plain data is never used.
  Thanks to Michael Haggerty, Nguyen Thai Ngoc Duy, Thomas Rast
  and Robin Rosenberg for feedback.
- Prototype of a converter from the index format v2/v3 to the index
  format v5. [2] The converter reads the index from a git repository,
  can output parts of the index (header, index entries as in
  git ls-files --debug, cache tree as in test-dump-cache-tree, or
  the reuc data). Then it writes the v5 index file format to
  .git/index-v5. Thanks to Michael Haggerty for the code review.
- Prototype of a reader for the new index file format. [3] The
  reader has mainly the purpose to show the algorithm used to read
  the index lexicographically sorted after the full name which is
  required by the current internal memory format. Big thanks for
  reviewing this code and giving me advice on refactoring goes
  to Michael Haggerty.
- Started working on the actual git code. Git is now able to read
  the index format v5, although the mapping of the new ondisk
  format to the internal format is not done yet. The latest code
  is not pushed to github yet, since it still needs some polishing,
  but if anyone is interested in the general direction it's going,
  the initial steps are on github. [4] Thanks for reviewing the
  first steps to Thomas Rast.

== Work done in the last week ==

- Continued working on the git code, which now maps the index to
  the internal format. This includes reading the conflicted data
  at the far end of the index. [4]

== Outlook for the next week ==

- Read the cache-tree (convert the directories on the disk to the
  current in-memory cache-tree structure)
- Start implementing an API, as it was discussed at [5]

[1] https://github.com/tgummerer/git/wiki/Index-file-format-v5
[2] https://github.com/tgummerer/git/blob/pythonprototype/git-convert-index.py
[3] https://github.com/tgummerer/git/blob/pythonprototype/git-read-index-v5.py
[4] https://github.com/tgummerer/git/tree/index-v5
[5] http://thread.gmane.org/gmane.comp.version-control.git/198283/focus=198474

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7]
  2012-06-04 20:07 [GSoC] Designing a faster index format - Progress report week 7 Thomas Gummerer
@ 2012-06-06  9:45 ` Thomas Rast
  2012-06-06 13:01   ` Johannes Sixt
  2012-06-06 17:31   ` Junio C Hamano
  0 siblings, 2 replies; 4+ messages in thread
From: Thomas Rast @ 2012-06-06  9:45 UTC (permalink / raw)
  To: Thomas Gummerer; +Cc: git, trast, gitster, mhagger, pclouds

Hi,

Michael, Thomas and me just had a lengthy discussion on IRC about racy
entries.  I'll use "simultaneously" from the perspective of the
filesystem's mtimes; depending on your USE_NSEC, that may mean in the
same second, or the same nanosecond.

Background: Racy Entries
------------------------

There are two cases of racy index entries:

(A)  echo foo >foo
     git add foo
     echo bar >foo

If the latter two commands happen simultaneously, lstat() will match the
index entry.  Git handles this by checking foo.mtime >= index.mtime, and
if so, doing a content check.  Such entries are called racy.

(B) echo foo >foo
    git add foo     # (i)
    echo bar >foo
    sleep 2
    : >dummy
    git add dummy   # (ii)

If the commands before the sleep happen simultaneously, then foo.mtime
has not changed since (i), but due to (ii) index.mtime has, defeating
the raciness check.  To handle this, git checks for racy entries
*w.r.t. the old index* immediately before it writes a new index.  For
all[1] such entries it does a content check.  All racy entries found to
be modified get ce_size=0, which tells the next git that "we know they
are modified".  We call them "smudged".

The Problem
-----------

The use of ce_size=0 is a problem for index v5.  The current drafts
exclude the size field, instead wrapping it in stat_crc along with most
of the other stat fields.

There are some obvious solutions:

* Put the size field back, costing us 4B/entry.

* Use some other marker field for the v5 format, e.g., the stat crc.

Neither of these is good, for an entirely different reason: The current
scheme checks *all* entries for being racy w.r.t. the old index, before
any write.  This completely defeats the point of index v5: *avoid*
loading the entire index for small changes.

Proposed Solution
-----------------

(Michael, we have adapted it somewhat this since you left IRC.)

  When writing an entry: check whether ce_mtime >= index.mtime.  If so,
  write out ce_mtime=0.

The index.mtime here is a lower bound on the mtime of the new index,
obtained e.g. by touching the index and then stat()ing it immediately
before writing out the changed entries.

Note that this is a fundamentally different approach from the one taken
in v[2-4] indexes.  In the old approach, it is the *next* writer's
responsibility to ensure that all racy entries are either truly clean,
or smudged (since they will presumably lose their raciness).  In the new
approach, racy entries are immediately smudged and remain so until an
update.

Footnotes: 
[1]  Ignoring the case where st_size==0 at the beginning, which needs
some arguing around because st_size is also the smudge marker.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7]
  2012-06-06  9:45 ` Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7] Thomas Rast
@ 2012-06-06 13:01   ` Johannes Sixt
  2012-06-06 17:31   ` Junio C Hamano
  1 sibling, 0 replies; 4+ messages in thread
From: Johannes Sixt @ 2012-06-06 13:01 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, gitster, mhagger, pclouds

Am 6/6/2012 11:45, schrieb Thomas Rast:
> Proposed Solution
> -----------------
> 
>   When writing an entry: check whether ce_mtime >= index.mtime.  If so,
>   write out ce_mtime=0.
> 
> The index.mtime here is a lower bound on the mtime of the new index,
> obtained e.g. by touching the index and then stat()ing it immediately
> before writing out the changed entries.

Portability note: To "touch" on Windows will mean that the file is
modified (at least one byte is written), then closed. Only then the stat
information is reliable. The reason is that the time stamp is only valid
after (the last handle of) the file was closed.

> Note that this is a fundamentally different approach from the one taken
> in v[2-4] indexes.  In the old approach, it is the *next* writer's
> responsibility to ensure that all racy entries are either truly clean,
> or smudged (since they will presumably lose their raciness).  In the new
> approach, racy entries are immediately smudged and remain so until an
> update.

-- Hannes

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7]
  2012-06-06  9:45 ` Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7] Thomas Rast
  2012-06-06 13:01   ` Johannes Sixt
@ 2012-06-06 17:31   ` Junio C Hamano
  1 sibling, 0 replies; 4+ messages in thread
From: Junio C Hamano @ 2012-06-06 17:31 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Thomas Gummerer, git, mhagger, pclouds

Thomas Rast <trast@student.ethz.ch> writes:

> (Michael, we have adapted it somewhat this since you left IRC.)
>
>   When writing an entry: check whether ce_mtime >= index.mtime.  If so,
>   write out ce_mtime=0.
>
> The index.mtime here is a lower bound on the mtime of the new index,
> obtained e.g. by touching the index and then stat()ing it immediately
> before writing out the changed entries.

Is this even workable?  I found that "open, read a byte, write it
back in place, then stat" was not giving useful timestamp and that
was the reason the original racy-git code chose not to do this.

You may be able to do "open, read a byte, write it back in place,
fsync, close and then stat", but doing so while holding that file
also as a lock feels somewhat dirty...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-06-06 17:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-04 20:07 [GSoC] Designing a faster index format - Progress report week 7 Thomas Gummerer
2012-06-06  9:45 ` Handling racy entries in the v5 format [Re: [GSoC] Designing a faster index format - Progress report week 7] Thomas Rast
2012-06-06 13:01   ` Johannes Sixt
2012-06-06 17:31   ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).