* GSoC 2009 Prospective student @ 2009-02-22 19:58 Rohan Dhruva 2009-02-22 20:07 ` Sverre Rabbelier 2009-02-22 20:43 ` Miklos Vajna 0 siblings, 2 replies; 14+ messages in thread From: Rohan Dhruva @ 2009-02-22 19:58 UTC (permalink / raw) To: git Hi, I am a student from India. I am very interested in taking part in GSoC 2009, working under git project mentors. However, I am completely new to git, I have never used it in the past. I have used svn, but only for downloading source code, never to manage my own code. I am very interested in open source in general, and I have been using Linux from 5-6 years. That being said, I have knowledge of C/C++ what was taught to me in school and college. I realize that my qualifications as such are not very impressive, and hence I wish to start with a smaller project. I read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a "jump-in" project might be the "Restartable Clones" proposal. Seeing my capabilities, I would like to know whether I am "fit" to undertake work on that project? I promise to put in a lot of hard work to learn git, and it's source code. However, I would also require a bit of hand-holding, at least initially, to get me through. I am very interested to know the opinion of all prospective mentors on this issue. Thank you very much, and I do hope I am useful to the git community. -- Rohan Dhruva PS: Please CC me, as I am not subscribed to the list. Thanks. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva @ 2009-02-22 20:07 ` Sverre Rabbelier 2009-02-22 20:29 ` Rohan Dhruva 2009-02-22 20:43 ` Miklos Vajna 1 sibling, 1 reply; 14+ messages in thread From: Sverre Rabbelier @ 2009-02-22 20:07 UTC (permalink / raw) To: Rohan Dhruva; +Cc: git Heya, On Sun, Feb 22, 2009 at 20:58, Rohan Dhruva <rohandhruva@gmail.com> wrote: > I am a student from India. I am very interested in taking part in GSoC > 2009, working under git project mentors. However, I am completely new > to git, I have never used it in the past. I have used svn, but only > for downloading source code, never to manage my own code. I am very > interested in open source in general, and I have been using Linux from > 5-6 years. I was in a similar situation myself when I decided to apply for Git as a GSoC student last year, your description makes me wonder "why git" though, any particular reason? > That being said, I have knowledge of C/C++ what was taught to me in > school and college. I realize that my qualifications as such are not > very impressive, and hence I wish to start with a smaller project. I > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a > "jump-in" project might be the "Restartable Clones" proposal. Seeing > my capabilities, I would like to know whether I am "fit" to undertake > work on that project? I promise to put in a lot of hard work to learn > git, and it's source code. However, I would also require a bit of > hand-holding, at least initially, to get me through. Almost all students require such handholding, that's what the mentors are for ;). > I am very interested to know the opinion of all prospective mentors on > this issue. Thank you very much, and I do hope I am useful to the git > community. Showing your face early and asking around is a good thing to do as prospective student, good luck :). > PS: Please CC me, as I am not subscribed to the list. Thanks. I'd say, step one in your path to being a GSoC student with git would be to subscribe to the mailing list. Read the "A note from the maintainer" mails Junio sends out, as well as his "What's cooking" mails. Do some research on the topic you are interested in (e.g., search gmane.org's git archive for discussions on the topic, etc). You might also want to hang out in #git on irc.freenode.net and get to know people there. -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 20:07 ` Sverre Rabbelier @ 2009-02-22 20:29 ` Rohan Dhruva 2009-02-22 20:38 ` Sverre Rabbelier 0 siblings, 1 reply; 14+ messages in thread From: Rohan Dhruva @ 2009-02-22 20:29 UTC (permalink / raw) To: Sverre Rabbelier; +Cc: git Hi Sverre, On Mon, Feb 23, 2009 at 1:37 AM, Sverre Rabbelier <srabbelier@gmail.com> wrote: > Heya, > > On Sun, Feb 22, 2009 at 20:58, Rohan Dhruva <rohandhruva@gmail.com> wrote: >> I am a student from India. I am very interested in taking part in GSoC >> 2009, working under git project mentors. However, I am completely new >> to git, I have never used it in the past. I have used svn, but only >> for downloading source code, never to manage my own code. I am very >> interested in open source in general, and I have been using Linux from >> 5-6 years. > > I was in a similar situation myself when I decided to apply for Git as > a GSoC student last year, your description makes me wonder "why git" > though, any particular reason? > I have developed a particular interest in SCMs lately. Git is a widely used SCM. Also, this project would require knowledge of C, and not some other language which I am not familiar with. Seeing that you were a student yourself, can you please give me some tips? Any things for me to keep in mind? >> That being said, I have knowledge of C/C++ what was taught to me in >> school and college. I realize that my qualifications as such are not >> very impressive, and hence I wish to start with a smaller project. I >> read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a >> "jump-in" project might be the "Restartable Clones" proposal. Seeing >> my capabilities, I would like to know whether I am "fit" to undertake >> work on that project? I promise to put in a lot of hard work to learn >> git, and it's source code. However, I would also require a bit of >> hand-holding, at least initially, to get me through. > > Almost all students require such handholding, that's what the mentors > are for ;). > >> I am very interested to know the opinion of all prospective mentors on >> this issue. Thank you very much, and I do hope I am useful to the git >> community. > > Showing your face early and asking around is a good thing to do as > prospective student, good luck :). > Thanks, I am encouraged :-) >> PS: Please CC me, as I am not subscribed to the list. Thanks. > > I'd say, step one in your path to being a GSoC student with git would > be to subscribe to the mailing list. Read the "A note from the > maintainer" mails Junio sends out, as well as his "What's cooking" > mails. Do some research on the topic you are interested in (e.g., > search gmane.org's git archive for discussions on the topic, etc). You > might also want to hang out in #git on irc.freenode.net and get to > know people there. > I will join the mailing list soon, thanks :) Cheers, -- Rohan Dhruva ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 20:29 ` Rohan Dhruva @ 2009-02-22 20:38 ` Sverre Rabbelier 0 siblings, 0 replies; 14+ messages in thread From: Sverre Rabbelier @ 2009-02-22 20:38 UTC (permalink / raw) To: Rohan Dhruva; +Cc: git On Sun, Feb 22, 2009 at 21:29, Rohan Dhruva <rohandhruva@gmail.com> wrote: > I have developed a particular interest in SCMs lately. Git is a widely > used SCM. Also, this project would require knowledge of C, and not > some other language which I am not familiar with. Ah, that makes sense then. You should make sure you like using git then, use it in some project for school, perhaps in combination with 'git svn'. > Seeing that you were a student yourself, can you please give me > some tips? Any things for me to keep in mind? Hmmm, work on list! As soon as you have anything half-decent (this will hopefully be after a week or two three), send your work to the list for review! Work in the open as much as possible and profit from the combined knowledge of the mailinglist. Before GSoC starts, get in contact with possible mentors, try to learn about the area of the code you will be touching. Learn the coding style, and learn how to send patches by reading Documentation/SubmittingPatches and the list archive, but preferably by sending one!. Most important is that you have a good time and learn from it though :). -- Cheers, Sverre Rabbelier ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva 2009-02-22 20:07 ` Sverre Rabbelier @ 2009-02-22 20:43 ` Miklos Vajna 2009-02-22 22:22 ` Nicolas Pitre 1 sibling, 1 reply; 14+ messages in thread From: Miklos Vajna @ 2009-02-22 20:43 UTC (permalink / raw) To: Rohan Dhruva; +Cc: git [-- Attachment #1: Type: text/plain, Size: 641 bytes --] On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail.com> wrote: > That being said, I have knowledge of C/C++ what was taught to me in > school and college. I realize that my qualifications as such are not > very impressive, and hence I wish to start with a smaller project. I > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a > "jump-in" project might be the "Restartable Clones" proposal. I would recommend you to read this thread: http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 Especially Shawn's message, which can be a base for your proposal, if you want to work in this. [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 20:43 ` Miklos Vajna @ 2009-02-22 22:22 ` Nicolas Pitre 2009-02-23 0:46 ` Sitaram Chamarty 2009-02-23 15:37 ` Jakub Narebski 0 siblings, 2 replies; 14+ messages in thread From: Nicolas Pitre @ 2009-02-22 22:22 UTC (permalink / raw) To: Miklos Vajna; +Cc: Rohan Dhruva, git On Sun, 22 Feb 2009, Miklos Vajna wrote: > On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail.com> wrote: > > That being said, I have knowledge of C/C++ what was taught to me in > > school and college. I realize that my qualifications as such are not > > very impressive, and hence I wish to start with a smaller project. I > > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a > > "jump-in" project might be the "Restartable Clones" proposal. > > I would recommend you to read this thread: > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 > > Especially Shawn's message, which can be a base for your proposal, if > you want to work in this. I don't particularly agree with Shawn's proposal. Reliance on a stable sorting on the server side is too fragile, restrictive and cumbersome. Restartable clone is _hard_. Even I who has quite a bit of knowledge in the affected area didn't find a satisfactory solution yet. I think restartable clone is a really bad suggestion for SOC students. After all we want successful SOC projects, not ones that even core git developers did not yet find a good solution for. IMHO of course. Nicolas ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 22:22 ` Nicolas Pitre @ 2009-02-23 0:46 ` Sitaram Chamarty 2009-02-23 15:37 ` Jakub Narebski 1 sibling, 0 replies; 14+ messages in thread From: Sitaram Chamarty @ 2009-02-23 0:46 UTC (permalink / raw) To: git On 2009-02-22, Nicolas Pitre <nico@cam.org> wrote: > Restartable clone is _hard_. Even I who has quite a bit of knowledge in > the affected area didn't find a satisfactory solution yet. I'm sorry I have not followed the earlier discussion. I have a question. I know the rsync transport is not much used, and I myself have never used it. But can there not be a 'sorry, this repo is not yet open' flag that prevents local git operations while the clone is going on, and then the actual clone itself merely does an rsync of the corresponding files? Because rsync is quite restartable. I can see that this would be a problem if the remote were to 'git repack' in between 2 attempts by the client, because the actual tree inside .git/objects would change, but that is hardly a common occurrence I would think. I'm sorry if I'm being naive and missing a lot of important nuances -- but I was looking at it from a "if I had to do it in shell how would I do it' mindset. Or perhaps by 'restartable clone' you also mean 'restartable fetch', etc, in which case of course you can't lock out the repo if a fetch dies partway. It is not necessary to reply in detail; even a gmane or other link will do if this was already shot down :-) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-22 22:22 ` Nicolas Pitre 2009-02-23 0:46 ` Sitaram Chamarty @ 2009-02-23 15:37 ` Jakub Narebski 2009-02-23 15:58 ` Shawn O. Pearce 1 sibling, 1 reply; 14+ messages in thread From: Jakub Narebski @ 2009-02-23 15:37 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Miklos Vajna, Rohan Dhruva, git Nicolas Pitre <nico@cam.org> writes: > On Sun, 22 Feb 2009, Miklos Vajna wrote: > > On Mon, Feb 23, 2009 at 01:28:33AM +0530, Rohan Dhruva <rohandhruva@gmail.com> wrote: > > > That being said, I have knowledge of C/C++ what was taught to me in > > > school and college. I realize that my qualifications as such are not > > > very impressive, and hence I wish to start with a smaller project. I > > > read on the http://git.or.cz/gitwiki/SoC2009Ideas page that a > > > "jump-in" project might be the "Restartable Clones" proposal. > > > > I would recommend you to read this thread: > > > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 > > > > Especially Shawn's message, which can be a base for your proposal, if > > you want to work in this. > > I don't particularly agree with Shawn's proposal. Reliance on a stable > sorting on the server side is too fragile, restrictive and cumbersome. > > Restartable clone is _hard_. Even I who has quite a bit of knowledge in > the affected area didn't find a satisfactory solution yet. I think it is possible for dumb protocols (using commit walkers) and for (deprecated) rsync. The only thing would be for "git clone --continue" to bypass check if directory to download repository to is nonexistent or empty. I guess that what code can do (or perhaps even does currently) for commit walk based dumb protocols (like HTTP) is to do commit walk, and for packfiles which are already downloaded or partially downloaded, download rest of file (if web server supports it; if not, redownload whole packfile, but do not redownload already exiting packfiles). For rsync:// it could be enough to just bypass the check... but the probability of getting corrupted repository would be even higher, unfortunately. > I think restartable clone is a really bad suggestion for SOC students. > After all we want successful SOC projects, not ones that even core git > developers did not yet find a good solution for. > > IMHO of course. But I agree that within current limits (as far as I know there are no way to ask for SHA-1; you can only ask for refs for security reasons) it would be difficult to very difficult to add restartable clone support to native (smart) protocols. If not for this limitation it would be, I think, possible to do a kind of fsck, checking which commits in packfile are complete (i.e. have all objects), and based on that ask for subset of objects. This would require support only from a client... alas, this is not possible. In mentioned post Shawn talks about a way for server to 1) generate exactly the same packfile (the proposal is to replay want/have, but it also requires stable sorting of objects); 2) transfer only the rest of file (but server has to regenerate packfile anyway, as packfiles are generated on-the-fly; well, unless it caches packfiles, which might be good idea anyway). I think that unless 'restartable clone' is limited to commit wakers (HTP protocol etc.) it should be moved up the diffuculty from "New to Git?" section. I guess that mirror-sync, formerly GitTorrent, could be easier to implement. -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-23 15:37 ` Jakub Narebski @ 2009-02-23 15:58 ` Shawn O. Pearce 2009-02-23 16:31 ` Nicolas Pitre 2009-02-24 15:38 ` Jakub Narebski 0 siblings, 2 replies; 14+ messages in thread From: Shawn O. Pearce @ 2009-02-23 15:58 UTC (permalink / raw) To: Jakub Narebski; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git Jakub Narebski <jnareb@gmail.com> wrote: > Nicolas Pitre <nico@cam.org> writes: > > On Sun, 22 Feb 2009, Miklos Vajna wrote: > > > > > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 > > > > > > Especially Shawn's message, which can be a base for your proposal, if > > > you want to work in this. > > > > I don't particularly agree with Shawn's proposal. Reliance on a stable > > sorting on the server side is too fragile, restrictive and cumbersome. We already rely on a stable sort in the tree format. Asking that a stable sort be applied when a clone is started so that we can later resume it isn't unreasonable. Hell, that tree format sort is a B***H anyway, its not a simple sort by memcmp(). Almost every Git re-implementation gets it wrong the first time out. > > Restartable clone is _hard_. Even I who has quite a bit of knowledge in > > the affected area didn't find a satisfactory solution yet. Sure, its difficult, but nobody has put effort into it either. I think it could be done by enforcing a stable sort during clone (and perhaps only during clone). That's the basis of that message Miklos points to. Though I don't think I ever said anything about the stable sort only being used during clone. > I think it is possible for dumb protocols (using commit walkers) and > for (deprecated) rsync. Yes, it is possible for the commit walkers to implement a restart, as they are actually beginning at the current root and walking back in history. Resuming a large file like a pack is easy to do on HTTP if the remote server supports byte range serving. Its also easy to validate on the client that the pack wasn't repacked during the idle period (between initial fetch and restart), just validate the SHA-1 footer. If the pack was repacked and came up with the same name you'll have a mismatch on the footer. Discard and try again. And if you want to save bandwidth, always grab the last 20 bytes of the file before getting any other parts, save it somewhere, and revalidate that last 20 before resuming. If its changed, you should discard what you have and start over from the beginning. > > I think restartable clone is a really bad suggestion for SOC students. > > After all we want successful SOC projects, not ones that even core git > > developers did not yet find a good solution for. > > > > IMHO of course. > > But I agree that within current limits (as far as I know there are no > way to ask for SHA-1; you can only ask for refs for security reasons) > it would be difficult to very difficult to add restartable clone > support to native (smart) protocols. > > If not for this limitation it would be, I think, possible to do a kind > of fsck, checking which commits in packfile are complete (i.e. have > all objects), and based on that ask for subset of objects. This would > require support only from a client... alas, this is not possible. I think the current "must want advertised ref" restriction is too strict. If you make the server check the reachability of the wanted object, (assuming it can be resolved to a commit) then you can pick up in the middle of history. We already (to some extent) support that with the deepen thing in a shallow clone. Sure, it may cause more server load when clients ask for this partial fetch. But clients can already abuse a server far more by repeatedly doing a clone, and then break the network connection as soon as the PACK header comes down the wire. The server just spent a lot of CPU and IO time building the complete list of the objects to transmit. Its really a non-trivial load on the server side. And by having the client break the pipe at the 'PACK' header, the client doesn't have to absorb the large data transfer either. Making it fairly easy to DOS a Git daemon with a small botnet. So, IMHO, the restriction that a commit must be advertised, and not merely reachable, is overly strict and doesn't buy us a whole lot. > I think that unless 'restartable clone' is limited to commit wakers > (HTP protocol etc.) it should be moved up the diffuculty from "New to > Git?" section. I guess that mirror-sync, formerly GitTorrent, could be > easier to implement. Maybe. But a simple stable sort on the objects makes it easier, perhaps within reach of "new to git". That ideas page is a wiki for a reason. If folks feel differently from me, please edit it to improve things! :-) -- Shawn. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-23 15:58 ` Shawn O. Pearce @ 2009-02-23 16:31 ` Nicolas Pitre 2009-02-24 15:38 ` Jakub Narebski 1 sibling, 0 replies; 14+ messages in thread From: Nicolas Pitre @ 2009-02-23 16:31 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Jakub Narebski, Miklos Vajna, Rohan Dhruva, git On Mon, 23 Feb 2009, Shawn O. Pearce wrote: > Jakub Narebski <jnareb@gmail.com> wrote: > > Nicolas Pitre <nico@cam.org> writes: > > > On Sun, 22 Feb 2009, Miklos Vajna wrote: > > > > > > > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 > > > > > > > > Especially Shawn's message, which can be a base for your proposal, if > > > > you want to work in this. > > > > > > I don't particularly agree with Shawn's proposal. Reliance on a stable > > > sorting on the server side is too fragile, restrictive and cumbersome. > > We already rely on a stable sort in the tree format. Asking that > a stable sort be applied when a clone is started so that we can > later resume it isn't unreasonable. Hell, that tree format sort > is a B***H anyway, its not a simple sort by memcmp(). Almost every > Git re-implementation gets it wrong the first time out. That's not the issue at all. The sorting within a single tree object is indeed well defined (even if it is arguably a bit odd). The object order is not, and now with threaded delta the list of actually deltified objects may and do vary from successive packing of the same repo. Committing ourselves to determinism here just for the sake of a restartable clone is not something I subscribe to. > > > Restartable clone is _hard_. Even I who has quite a bit of knowledge in > > > the affected area didn't find a satisfactory solution yet. > > Sure, its difficult, but nobody has put effort into it either. > I think it could be done by enforcing a stable sort during clone > (and perhaps only during clone). We should aim for a real solution, not something that is "special" for a clone. After all, a clone is just a fetch, and large fetches may be interrupted too. > > I think it is possible for dumb protocols (using commit walkers) and > > for (deprecated) rsync. > > Yes, it is possible for the commit walkers to implement a restart, > as they are actually beginning at the current root and walking back > in history. Resuming a large file like a pack is easy to do on HTTP > if the remote server supports byte range serving. Its also easy > to validate on the client that the pack wasn't repacked during the > idle period (between initial fetch and restart), just validate the > SHA-1 footer. If the pack was repacked and came up with the same > name you'll have a mismatch on the footer. Discard and try again. Sure, dumb protocols are easy. It's one of the few advantages they have over the native protocol. > But clients can already abuse a server far more by repeatedly doing > a clone, and then break the network connection as soon as the PACK > header comes down the wire. The server just spent a lot of CPU > and IO time building the complete list of the objects to transmit. > Its really a non-trivial load on the server side. And by having > the client break the pipe at the 'PACK' header, the client doesn't > have to absorb the large data transfer either. Making it fairly > easy to DOS a Git daemon with a small botnet. This is easy to fix, and something I've posted design notes about a while ago. A cache of generated packs can be made, indexed by a hash of the wanted/excluded refs used for pack generation. This way popular fetches (say after Linus pushes stuff to his tree and everyone else fetches it at night) would require computation only once. That is I think something more suitable for a SOC student project. Of course willfully abusing a git server can be done despite of this, but that is true for any other service as well. > That ideas page is a wiki for a reason. If folks feel differently > from me, please edit it to improve things! :-) /me hates editing wiki pages... :-/ Nicolas ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-23 15:58 ` Shawn O. Pearce 2009-02-23 16:31 ` Nicolas Pitre @ 2009-02-24 15:38 ` Jakub Narebski 2009-02-24 15:55 ` Shawn O. Pearce 1 sibling, 1 reply; 14+ messages in thread From: Jakub Narebski @ 2009-02-24 15:38 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git On Mon, 23 Feb 2009, Shawn O. Pearce wrote: > Jakub Narebski <jnareb@gmail.com> wrote: >> Nicolas Pitre <nico@cam.org> writes: >>> On Sun, 22 Feb 2009, Miklos Vajna wrote: >>>> >>>> http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298 >>>> >>>> Especially Shawn's message, which can be a base for your proposal, if >>>> you want to work in this. >>> >>> I don't particularly agree with Shawn's proposal. Reliance on a stable >>> sorting on the server side is too fragile, restrictive and cumbersome. > > We already rely on a stable sort in the tree format. [...] I (and Nicolas) by 'sorting order' mean here ordering of objects and deltas in the pack file, i.e. whether we get _exactly_ the same (byte for byte) packfile for the same want/have exchange (your proposal), or even for the same arguments to git-pack-objects (which is a necessary, although I think not sufficient condition). [...] >> I think it is possible for dumb protocols (using commit walkers) and >> for (deprecated) rsync. > > Yes, it is possible for the commit walkers to implement a restart, > as they are actually beginning at the current root and walking back > in history. Resuming a large file like a pack is easy to do on HTTP > if the remote server supports byte range serving. Its also easy > to validate on the client that the pack wasn't repacked during the > idle period (between initial fetch and restart), just validate the > SHA-1 footer. If the pack was repacked and came up with the same > name you'll have a mismatch on the footer. Discard and try again. Can we assume that packfiles are named correctly (i.e. name of packfile match SHA-1 footer)? > > And if you want to save bandwidth, always grab the last 20 bytes > of the file before getting any other parts, save it somewhere, > and revalidate that last 20 before resuming. If its changed, > you should discard what you have and start over from the beginning. Therefore I think that restartable clone for "dumb" (commit walker) protocols is easy GSoC project, while restartable clone for "smart" (generate packfile) protocols is at least of medium difficulty, and might be harder. >>> I think restartable clone is a really bad suggestion for SOC students. >>> After all we want successful SOC projects, not ones that even core git >>> developers did not yet find a good solution for. >>> >>> IMHO of course. >> >> But I agree that within current limits (as far as I know there are no >> way to ask for SHA-1; you can only ask for refs for security reasons) >> it would be difficult to very difficult to add restartable clone >> support to native (smart) protocols. >> >> If not for this limitation it would be, I think, possible to do a kind >> of fsck, checking which commits in packfile are complete (i.e. have >> all objects), and based on that ask for subset of objects. This would >> require support only from a client... alas, this is not possible. > > I think the current "must want advertised ref" restriction is > too strict. If you make the server check the reachability of the > wanted object, (assuming it can be resolved to a commit) then you > can pick up in the middle of history. We already (to some extent) > support that with the deepen thing in a shallow clone. Sure, it > may cause more server load when clients ask for this partial fetch. Hmmm... I forgot about shallow clone. Still, we can have the following situation: *---*---o---.---.---. .... .---o---*---* <-- some ref ^ ^ | | a b where '*' means that we have commit and all its object fully in packfile (i.e. if they are delta, there is base for delta in packfile), 'o' means incomplete, for example commit with some o blobs missing, and '.' means missing commit object. Because git deals with continuous range, we can tell on restart of clone that we have 'a', and that we want 'b', but without further extensions to git protocols, where we can tell that we have some objects (to exclude), but not assume anything about their requirements; something that if I remember correctly was implemented in some floating 'lazy clone' patch (well, lazy loading of blobs patch)... [...] > So, IMHO, the restriction that a commit must be advertised, and not > merely reachable, is overly strict and doesn't buy us a whole lot. > >> I think that unless 'restartable clone' is limited to commit wakers >> (HTP protocol etc.) it should be moved up the diffuculty from "New to >> Git?" section. I guess that mirror-sync, formerly GitTorrent, could be >> easier to implement. > > Maybe. But a simple stable sort on the objects makes it easier, > perhaps within reach of "new to git". As Nico said in the presence of threaded packing ordering of _objects_ on _packfile_ might be not deterministic. > > That ideas page is a wiki for a reason. If folks feel differently > from me, please edit it to improve things! :-) I'll try to add 'pack file cache for git-daemon' proposal to GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for this idea. -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-24 15:38 ` Jakub Narebski @ 2009-02-24 15:55 ` Shawn O. Pearce 2009-02-24 21:08 ` Jakub Narebski 0 siblings, 1 reply; 14+ messages in thread From: Shawn O. Pearce @ 2009-02-24 15:55 UTC (permalink / raw) To: Jakub Narebski; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git Jakub Narebski <jnareb@gmail.com> wrote: > > I (and Nicolas) by 'sorting order' mean here ordering of objects and > deltas in the pack file, i.e. whether we get _exactly_ the same (byte > for byte) packfile for the same want/have exchange (your proposal), or > even for the same arguments to git-pack-objects (which is a necessary, > although I think not sufficient condition). I know. My proposal though didn't require the same byte-for-byte pack file. Only that the objects were in a predictable order. It didn't permit resuming in the middle of an object. If the last object in the pack was truncated the client would resume by getting that object again, and may get a different byte sequence for that object representation. Its a b**ch to know where you stopped though, as you could be in a long string of deltas whose base is in the portion you didn't yet receive. Which means you can't identify that string that you already have, and pack-objects on resume can't assume you have those objects, because you only have the deltas for them and are lacking a way to restore them. > Can we assume that packfiles are named correctly (i.e. name of packfile > match SHA-1 footer)? Wrong. The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte SHA-1 footer. Its the 20 byte SHA-1 of the sorted object names who are in that pack. We should try not to assume that the pack's file name matches the sorted object names, but we can assume that the pack file name is "pack-$hash.pack" where $hash is a 40 character hexadecimal string. The dumb commit walkers already have this restriction built into them, and have for quite some time. Any pack writers, including fast-import, honor this naming standard in order to ensure they are compatible with the existing dumb commit walkers. > Therefore I think that restartable clone for "dumb" (commit walker) > protocols is easy GSoC project, while restartable clone for "smart" > (generate packfile) protocols is at least of medium difficulty, and > might be harder. Probably quite right. Unfortunately the majority of the git repositories out there are served with the smart protocol, because it is more efficient. :) > Still, we can have the following situation: > > *---*---o---.---.---. .... .---o---*---* <-- some ref > > ^ ^ > | | > a b > > where '*' means that we have commit and all its object fully in packfile > (i.e. if they are delta, there is base for delta in packfile), 'o' means > incomplete, for example commit with some o blobs missing, and '.' means > missing commit object. > > Because git deals with continuous range, we can tell on restart of clone > that we have 'a', and that we want 'b', but without further extensions > to git protocols, where we can tell that we have some objects (to > exclude), but not assume anything about their requirements; something > that if I remember correctly was implemented in some floating 'lazy > clone' patch (well, lazy loading of blobs patch)... Err, yes. Which is why I wanted to put a stable sort order on the objects in the pack. If you do that then you can specify a range within range of objects being fetched. E.g. in the diagram above if the client said "want b, have a" during a "git fetch" we can apply the stable ordering to all objects in that range "a..b", and then apply another subrange to that where the client says "complete until Q", where "Q" denotes a position in that sorted list. Thus we only need to transmit the remaining elements. > As Nico said in the presence of threaded packing ordering of _objects_ > on _packfile_ might be not deterministic. Yea, ick. I haven't looked at the threaded code in enough detail to know how it behaves. But from what I read in discussion on the list it really makes it impossible to get a stable ordering because the delta base selected for an object can differ depending on which thread handled that object, and if OFS_DELTA is being used then the base must go before the delta, making the order somewhat determined by which thread handled which object. IIRC, my proposal was pre-threaded delta code being introduced. Now that we have threaded delta code as the default on many platforms... yea, this is likely *not* a good project for someone who is new to Git. Its become a lot more difficult. > I'll try to add 'pack file cache for git-daemon' proposal to > GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for > this idea. The pack file cache project is likely easier than restarting a pack file. Especially in the face of the threaded delta code. There are difficult details about making the cache secure so we can't overwrite repository data due to a buffer overflow. Or making the cache prune itself so it doesn't run out of disk. Etc. We've talked about a cache before on list. On a related note, I remember I wrote a patch that saved packs during "git push", before we added "git gc --auto", as crude attempt to incrementally repack a repository during other operations. -- Shawn. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-24 15:55 ` Shawn O. Pearce @ 2009-02-24 21:08 ` Jakub Narebski 2009-02-24 21:17 ` Nicolas Pitre 0 siblings, 1 reply; 14+ messages in thread From: Jakub Narebski @ 2009-02-24 21:08 UTC (permalink / raw) To: Shawn O. Pearce; +Cc: Nicolas Pitre, Miklos Vajna, Rohan Dhruva, git On Tue, 24 Feb 2009, Shawn O. Pearce wrote: > Jakub Narebski <jnareb@gmail.com> wrote: > > > > I (and Nicolas) by 'sorting order' mean here ordering of objects and > > deltas in the pack file, i.e. whether we get _exactly_ the same (byte > > for byte) packfile for the same want/have exchange (your proposal), or > > even for the same arguments to git-pack-objects (which is a necessary, > > although I think not sufficient condition). > > I know. > > My proposal though didn't require the same byte-for-byte pack file. > Only that the objects were in a predictable order. It didn't permit > resuming in the middle of an object. If the last object in the pack > was truncated the client would resume by getting that object again, > and may get a different byte sequence for that object representation. Ah, so you meant skipping first N _objects_, and not first N _bytes_ of a re-generated pack. That's better. Although in the case when packfiles are cached, I think you can support resuming on a byte. But I guess only in such case (where exactly byte-for-byte the same packfile is resend / reused). > > Its a b**ch to know where you stopped though, as you could be in > a long string of deltas whose base is in the portion you didn't > yet receive. Which means you can't identify that string that you > already have, and pack-objects on resume can't assume you have > those objects, because you only have the deltas for them and are > lacking a way to restore them. Moreover from what I understand the want/have exchange is about _commits_, and it assumes that if you 'have' a commit, you have all its ancestors, and all trees (including those of ancestors), and all blobs (including those of ancestors). Not only delta without base. Besides if I remember correctly we always write base before delta; or am I mistaken here? But one could take a look at patches (present in git mailing list archive) which tried to add 'lazy clone' / 'remote alternates' support. IIRC there was 'haveonly' extension to exchange protocol, which was to meant that you have (in full) only given object, but not necessary its prerequisites. Then you can filter out those 'haveonly' objects from list of objects to pack fed to git-pack-object, isn't it? > > > Can we assume that packfiles are named correctly (i.e. name of packfile > > match SHA-1 footer)? > > Wrong. > > The hash in "pack-$hash.pack"/"pack-$hash.idx" is *not* the 20 byte > SHA-1 footer. Its the 20 byte SHA-1 of the sorted object names who > are in that pack. > > We should try not to assume that the pack's file name matches the > sorted object names, but we can assume that the pack file name is > "pack-$hash.pack" where $hash is a 40 character hexadecimal string. > The dumb commit walkers already have this restriction built into > them, and have for quite some time. > > Any pack writers, including fast-import, honor this naming standard > in order to ensure they are compatible with the existing dumb > commit walkers. Ah. So it is a _bit_ harder (for "dumb" protocols) than I thought. Still much easier than resumable clone for smart (pack generating) protocols. > > > Therefore I think that restartable clone for "dumb" (commit walker) > > protocols is easy GSoC project, while restartable clone for "smart" > > (generate packfile) protocols is at least of medium difficulty, and > > might be harder. > > Probably quite right. Unfortunately the majority of the git > repositories out there are served with the smart protocol, because > it is more efficient. :) Long, long time ago rsync:// protocol was recommended for initial clone. It has serious disadvantage of possibly returning silently corrupted repository, as it didn't ensure that references and objects were fetched in correct sequence, and is thus deprecated, and support for it bit-rotten ;) in places... I wonder if it is possible to make rsync:// more robust... [...] > > I'll try to add 'pack file cache for git-daemon' proposal to > > GSoC2009Ideas page... but I cannot be mentor (or even co-mentor) for > > this idea. > > The pack file cache project is likely easier than restarting a > pack file. Especially in the face of the threaded delta code. > > There are difficult details about making the cache secure so we can't > overwrite repository data due to a buffer overflow. Or making > the cache prune itself so it doesn't run out of disk. Etc. > We've talked about a cache before on list. Well, this is _cache_. OTOH having pack cache would make it easy to have resumable clone if you hit one of cached packfiles on resume... On the other hand I wonder what improvements it would give, as generating packs with delta reuse is, I think, quite fast... -- Jakub Narebski Poland ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: GSoC 2009 Prospective student 2009-02-24 21:08 ` Jakub Narebski @ 2009-02-24 21:17 ` Nicolas Pitre 0 siblings, 0 replies; 14+ messages in thread From: Nicolas Pitre @ 2009-02-24 21:17 UTC (permalink / raw) To: Jakub Narebski; +Cc: Shawn O. Pearce, Miklos Vajna, Rohan Dhruva, git On Tue, 24 Feb 2009, Jakub Narebski wrote: > Well, this is _cache_. OTOH having pack cache would make it easy to have > resumable clone if you hit one of cached packfiles on resume... > > On the other hand I wonder what improvements it would give, as generating > packs with delta reuse is, I think, quite fast... Object enumeration is still an issue. A cache would allow skipping that part as well, making cached packs about the same as a simple file server. Nicolas ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2009-02-24 21:18 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-02-22 19:58 GSoC 2009 Prospective student Rohan Dhruva 2009-02-22 20:07 ` Sverre Rabbelier 2009-02-22 20:29 ` Rohan Dhruva 2009-02-22 20:38 ` Sverre Rabbelier 2009-02-22 20:43 ` Miklos Vajna 2009-02-22 22:22 ` Nicolas Pitre 2009-02-23 0:46 ` Sitaram Chamarty 2009-02-23 15:37 ` Jakub Narebski 2009-02-23 15:58 ` Shawn O. Pearce 2009-02-23 16:31 ` Nicolas Pitre 2009-02-24 15:38 ` Jakub Narebski 2009-02-24 15:55 ` Shawn O. Pearce 2009-02-24 21:08 ` Jakub Narebski 2009-02-24 21:17 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox