Re: [RFC Patch] Preventing corrupt objects from entering the repository

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Shawn O. Pearce" <spearce@spearce.org>
To: Nicolas Pitre <nico@cam.org>,
	Martin Koegler <mkoegler@auto.tuwien.ac.at>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [RFC Patch] Preventing corrupt objects from entering the repository
Date: Wed, 13 Feb 2008 02:42:09 -0500	[thread overview]
Message-ID: <20080213074209.GG24004@spearce.org> (raw)
In-Reply-To: <alpine.LFD.1.00.0802120937330.2732@xanadu.home>

Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 11 Feb 2008, Martin Koegler wrote:
> > 
> > But lots of code in git assums that the object content is welformd.
> > 
> > Having such objects somewhere reachable in your repository will
> > disturb cloning (In the best case you get a error messages, in the
> > worst a child process of upload-pack segfaults), gitweb, ... . To fix
> > it, you will need a person with native access to the repository in the
> > worst case. 

We have *waaaay* overthought this.  The solution is a lot simpler
than Martin's patches make it out to be, and we can do it without
changing our current memory footprint.

> I think this is a good idea to always have some sanity checks on any 
> incoming objects so to make sure they're well formed and valid before 
> giving them a SHA1 value, and bail out as soon as any error is found.  

When we get the raw data for an object so we can compute its SHA-1
and/or write its loose object to disk we should first verify its
content is sane, then do the SHA-1/store loose.

Blobs - always assume sane.  Remember that these make up the largest
percentage in terms of raw bytes of any project's packfiles and we
don't need to do any additional processing to them.  Yay us.

Tags - stack allocate a temporary struct tag and pass it's address
into parse_tag_buffer().  The only thing it tries to allocate is
the item->tagged.  Throw an extra argument into parse_tag_buffer()
to bypass this lookup block.  Now there's no extra memory used for
tags, just a bit more CPU.  Tags aren't frequent.

Commits - do like tags, but use parse_commit_buffer().  Avoid
looking up the tree object and the parents, and allocating the
parent references via a new argument.  Also check !item->date like
fsck does.  Other than that I think the validation code in fsck is
redundant with what parse_commit_buffer() is doing already.

Trees - just run init_tree_desc() and tree_entry() loop like in
track_tree_refs() (and a ton of other places) to iterate over the
buffer.  All we care about is sane entry names (no '/'), sane modes,
and correct sorting.

In short, we ignore reachability, but get fsck, and we can completely
avoid additional malloc activity as well as any sort of heap increase
in index-pack and unpack-objects.  Its not hard.

Its a small amount of refactoring for the parse functions we
already have, and maybe expose a function or two from builtin-fsck
(in particular a bulk of fsck_tree should be reused).

Yea, its a tad more CPU time, but it shouldn't be that costly here.
The huge cost in fsck is redoing the SHA-1 checksums, and inflating
the deltas.  We've already got the delta inflated, and we're about
to compute the SHA-1 checksum for the first time.  So those major
costs of fsck drop to 0.

> As to making sure those objects are well connected... well this is a 
> technically different issue entirely, and I wonder if a special mode to 
> fsck might not be a better solution.

Nah, just do what quickfetch does in builtin-fetch.c, but run it
in receive-pack, between unpack() and execute_commands():

	rev-list --quiet --objects $new... --not --all

If it aborts, reachability testing failed and the push is rejected
without updating any refs.  Yes your repository now has objects
that are missing things, but none of those are considered to be
reachable, so this isn't a big deal.  They will get cleaned up on
the next `gc --prune`, whenever that is.

In this configuration (--quiet) rev-list tries to be pretty low
on its memory usage, it doesn't save buffers, etc.  Further since
everything that is already considered reachable is not interesting,
we are only doing a walk over the objects that we just received,
not our entire ODB.  Its also after index-pack exited, so we just
freed up a good chunk of memory.

Rememeber we are talking about receive-pack here.  The cost on
the to perform the rev-list is lower than the cost will be to pack
these objects for distribution back to just one client.  Since this
is a server of some sorts (otherwise why did you push here?), odds
are its going to be doing a lot of packing requests for clients to
receive these newly uploaded objects by the native git protocol.
This new rev-list is nothing compared to that already existing load.
And if your OS is any good the just created .idx and .pack is still
in OS buffer cache, so there shouldn't be any additional disk IO.

Yes, we could make this optional in receive-pack, but really I don't
see a reason to.  Just run it.  The client shouldn't be giving us
unreachable crap.

-- 
Shawn.

next prev parent reply	other threads:[~2008-02-13  7:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-10 17:58 [RFC Patch] Preventing corrupt objects from entering the repository Martin Koegler
2008-02-11  0:00 ` Junio C Hamano
2008-02-11  0:33   ` Nicolas Pitre
2008-02-11 19:56     ` Martin Koegler
2008-02-11 20:41       ` Nicolas Pitre
2008-02-11 21:58         ` Martin Koegler
2008-02-12 16:02           ` Nicolas Pitre
2008-02-12 19:04             ` Martin Koegler
2008-02-12 20:22               ` Nicolas Pitre
2008-02-12 21:38                 ` Martin Koegler
2008-02-12 21:51                   ` Nicolas Pitre
2008-02-13  6:20                     ` Shawn O. Pearce
2008-02-13  7:39                       ` Martin Koegler
2008-02-14  9:00                         ` [RFC PATCH] Remove object-refs from fsck Shawn O. Pearce
2008-02-14 19:07                           ` Martin Koegler
2008-02-13  7:42             ` Shawn O. Pearce [this message]
2008-02-13  8:11               ` [RFC Patch] Preventing corrupt objects from entering the repository Martin Koegler
2008-02-13 12:01                 ` Johannes Schindelin
2008-02-14  6:16                   ` Shawn O. Pearce
2008-02-14 19:04                   ` Martin Koegler
2008-02-15  0:06                     ` Johannes Schindelin
2008-02-15  7:18                       ` Martin Koegler
2008-02-12  7:20   ` Martin Koegler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080213074209.GG24004@spearce.org \
    --to=spearce@spearce.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mkoegler@auto.tuwien.ac.at \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.