Re: [RFC Patch] Preventing corrupt objects from entering the repository

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: mkoegler@auto.tuwien.ac.at (Martin Koegler)
To: Nicolas Pitre <nico@cam.org>
Cc: Junio C Hamano <gitster@pobox.com>, git@vger.kernel.org
Subject: Re: [RFC Patch] Preventing corrupt objects from entering the repository
Date: Mon, 11 Feb 2008 22:58:06 +0100	[thread overview]
Message-ID: <20080211215806.GA24971@auto.tuwien.ac.at> (raw)
In-Reply-To: <alpine.LFD.1.00.0802111513360.2732@xanadu.home>

On Mon, Feb 11, 2008 at 03:41:06PM -0500, Nicolas Pitre wrote:
> On Mon, 11 Feb 2008, Martin Koegler wrote:
> > Please remember, that --strict is used for pushing data.
> 
> And what exactly does it provide in terms of safetiness that index-pack 
> doesn't already do?  I'm not sure I fully understood the goal here.

The patch is about verifing the content of objects. My intented use is
for running a (restricted) server, which allows pushing via git-daemon
(or git-shell) from (untrusted) users. It can be also used to catch
corrupt objects, before they leave your repository, but this is for me
only a side effect.

receive-pack accepts any crap (anything, which you can pipe into
git-hash-object with -t parameter of your choice), if the pack file
format is correct.

But lots of code in git assums that the object content is welformd.

Having such objects somewhere reachable in your repository will
disturb cloning (In the best case you get a error messages, in the
worst a child process of upload-pack segfaults), gitweb, ... . To fix
it, you will need a person with native access to the repository in the
worst case. 

> > Maybe I have missed something, but all repack problems reported on the
> > git mailing list happen durring the deltifing phase. The problematic
> > files are mostly bigger blobs. I'm aware of these problems, so my
> > patch does not keep any blobs in memory.
> 
> This is not only repack, which is something that few people should run 
> anyway.
> 
> It is also index-pack which takes an awful amount of time on the OOO 
> repo (much less than the repack but still), and that's something that is 
> visible to anyone cloning it.
> 
> So the comparison with pack-objects is useless.  In absolute terms, 
> index-pack has to inprove, not regress.

The patch does not change the cloning/fetching behaviour:
@@ -18,6 +19,7 @@ struct object_entry
        unsigned int hdr_size;
        enum object_type type;
        enum object_type real_type;
+       struct object *obj;
 };

 union delta_base {
@@ -44,6 +49,7 @@ static int nr_deltas;
 static int nr_resolved_deltas;

 static int from_stdin;
+static int strict;
 static int verbose;

 static struct progress *progress;
@@ -325,8 +366,8 @@ static int find_delta_children(const union delta_base *base,
        return 0;
 }

-static void sha1_object(const void *data, unsigned long size,
-                       enum object_type type, unsigned char *sha1)
+static struct object* sha1_object(const void *data, unsigned long size,
+                                 enum object_type type, unsigned char *sha1)
 {
        hash_sha1_file(data, size, typename(type), sha1);
        if (has_sha1_file(sha1)) {
@@ -341,6 +382,29 @@ static void sha1_object(const void *data, unsigned long size,
                        die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
                free(has_data);
        }
+       if (strict) {
[..]
+       return NULL;
 }

 static void resolve_delta(struct object_entry *delta_obj, void *base_data,
@@ -361,7 +425,7 @@ static void resolve_delta(struct object_entry *delta_obj, void *base_data,
        free(delta_data);
        if (!result)
                bad_object(delta_obj->idx.offset, "failed to apply delta");
-       sha1_object(result, result_size, type, delta_obj->idx.sha1);
+       delta_obj->obj = sha1_object(result, result_size, type, delta_obj->idx.sha1);
        nr_resolved_deltas++;

        hashcpy(delta_base.sha1, delta_obj->idx.sha1);
@@ -420,7 +484,7 @@ static void parse_pack_objects(unsigned char *sha1)
                        delta->obj_no = i;
                        delta++;
                } else
-                       sha1_object(data, obj->size, obj->type, obj->idx.sha1);
+                       obj->obj = sha1_object(data, obj->size, obj->type, obj->idx.sha1);
                free(data);
                display_progress(progress, i+1);
        }
@@ -714,6 +778,8 @@ int main(int argc, char **argv)
                                from_stdin = 1;
                        } else if (!strcmp(arg, "--fix-thin")) {
                                fix_thin_pack = 1;
+                       } else if (!strcmp(arg, "--strict")) {
+                               strict = 1;
                        } else if (!strcmp(arg, "--keep")) {
                                keep_msg = "";
                        } else if (!prefixcmp(arg, "--keep=")) {
@@ -812,6 +878,8 @@ int main(int argc, char **argv)
                            nr_deltas - nr_resolved_deltas);
        }
        free(deltas);
+       if (strict)
+               check_objects();

        idx_objects = xmalloc((nr_objects) * sizeof(struct pack_idx_entry *));
        for (i = 0; i < nr_objects; i++)

Do you see any problem, if int strict is zero?

For pushing, you must distinguish two case:

* pushing some changes/new branches
  You will have a maximum of a few thousands of objects.

* pushing the whole content into an empty repository

  This is not an every day operation. 

  I implemented a config option to disable the checking. It can be
  used safe time/memory in such cases.
 
  To make this working for OOO repository, my patch will probable need
  more tuning.

> > So my conclusion is, that the memory usage of index-pack with --strict
> > should not be too worse compared to pack-objects.
> 
> Well, the comparison is flawed anyway.  Agressively repacking the OOO 
> repo is reported to take around 14 GB of RAM if not bounded, whereas 
> index-pack remained around 300MB.  But the buffer cache managed by the 
> kernel is also a big performance factor and anything that requires more 
> memory in user space will affect that cache.

The 14 GB is caused by the memory used in the deltifing phase. I
compared it to the counting object phase.

> Then remember that this repo has 2411764 objects.

How many blobs, tags, trees and commits? What is the uncompressed size
of all tags? What is the uncompressed size of all trees? What is the
uncompressed size of all commits?

This information would give me a clue, what bad effects my approach
would have.

mfg Martin Kögler

next prev parent reply	other threads:[~2008-02-11 21:58 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-10 17:58 [RFC Patch] Preventing corrupt objects from entering the repository Martin Koegler
2008-02-11  0:00 ` Junio C Hamano
2008-02-11  0:33   ` Nicolas Pitre
2008-02-11 19:56     ` Martin Koegler
2008-02-11 20:41       ` Nicolas Pitre
2008-02-11 21:58         ` Martin Koegler [this message]
2008-02-12 16:02           ` Nicolas Pitre
2008-02-12 19:04             ` Martin Koegler
2008-02-12 20:22               ` Nicolas Pitre
2008-02-12 21:38                 ` Martin Koegler
2008-02-12 21:51                   ` Nicolas Pitre
2008-02-13  6:20                     ` Shawn O. Pearce
2008-02-13  7:39                       ` Martin Koegler
2008-02-14  9:00                         ` [RFC PATCH] Remove object-refs from fsck Shawn O. Pearce
2008-02-14 19:07                           ` Martin Koegler
2008-02-13  7:42             ` [RFC Patch] Preventing corrupt objects from entering the repository Shawn O. Pearce
2008-02-13  8:11               ` Martin Koegler
2008-02-13 12:01                 ` Johannes Schindelin
2008-02-14  6:16                   ` Shawn O. Pearce
2008-02-14 19:04                   ` Martin Koegler
2008-02-15  0:06                     ` Johannes Schindelin
2008-02-15  7:18                       ` Martin Koegler
2008-02-12  7:20   ` Martin Koegler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080211215806.GA24971@auto.tuwien.ac.at \
    --to=mkoegler@auto.tuwien.ac.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).