[RFC] Introducing different handling for small/large transactions

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Introducing different handling for small/large transactions
@ 2015-01-15 22:36 Stefan Beller
  2015-01-15 22:46 ` Jeff King
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Stefan Beller @ 2015-01-15 22:36 UTC (permalink / raw)
  To: mhagger, git; +Cc: Stefan Beller

For everyday use we want git to be fast. Creating one commit should not
touch the packed refs file. If we do other stuff involving more than
one ref, we may accept touching the packed refs file and have a process
which takes slightly longer but can handle more complex requests correctly,
such as renaming into and from directories (topic/1 -> topic and reverse).
Renaming is currently not part of the transaction API because of the (D/F)
problems. This proposed change would enable having renames being part of
the transactions API.

A transaction covers creating, deleting and updating a ref and its reflog.
Renaming would be a deletion followed by creating a new ref atomically.

So for here is my proposal for small transactions:
(just one ref [and/or reflog] touched):

In ref_transaction_update:
	* create $REF.lock file
	* write new content to the lock file

In ref_transaction_commit
	* commit the .lock file to its destination
	* in case this is a deletion:
		* remove the loose ref
		* and repack the packed refs file if necessary

The larger transactions would be handled differently by relying
on the packed refs file:
In ref_transaction_update:
	* detect if we transition to a large transaction
	  (by having more than one entry in transaction->updates)
	  if so:
		* Pack all currently existing refs into the packed
		  refs file, commit the packed refs file and delete
		  all loose refs. This will avoid (d/f) conflicts.

		* Keep the packed-refs file locked and move the first
		  transaction update into the packed-refs.lock file

	* Any update(delete, create, update) is put into the locked
	  packed refs file.
	* Additionally we need to obtain the .lock for the loose refs
	  file to keep guarantees, though we should close the file
	  descriptor as we don't wand to run out of file descriptors.

In ref_transaction_commit:
	* We only need to commit the packed refs file
	* Discard all other lock files as the changes get committed as a whole
	  by the packed refs file

Any feedback would be welcome!
Thanks,
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 22:36 [RFC] Introducing different handling for small/large transactions Stefan Beller
@ 2015-01-15 22:46 ` Jeff King
  2015-01-15 23:24   ` Stefan Beller
  2015-01-15 23:34 ` Junio C Hamano
  2015-01-18 12:13 ` Michael Haggerty
  2 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2015-01-15 22:46 UTC (permalink / raw)
  To: Stefan Beller; +Cc: mhagger, git

On Thu, Jan 15, 2015 at 02:36:11PM -0800, Stefan Beller wrote:

> So for here is my proposal for small transactions:
> (just one ref [and/or reflog] touched):

The implication being that a "large" transaction is any with more than
one update.

I think performance may suffer if you do not also take into account the
size of the packed-refs file. If you are updating 5 refs and there are
10 in the packed-refs file, rewriting the extra 5 is probably not a big
deal. If there are 400,000 in the packed-refs file, it probably is. I'm
not sure where the cutoff is (certainly the per-ref cost is less for
packed-refs once you have started writing the file, so there is probably
some crossover percentage that you could measure).

> 	* detect if we transition to a large transaction
> 	  (by having more than one entry in transaction->updates)
> 	  if so:
> 		* Pack all currently existing refs into the packed
> 		  refs file, commit the packed refs file and delete
> 		  all loose refs. This will avoid (d/f) conflicts.
> 
> 		* Keep the packed-refs file locked and move the first
> 		  transaction update into the packed-refs.lock file

This increases lock contention, as now independent ref updates all need
to take the same packed-refs.lock. This can be a problem on a busy
repository, especially because we never retry the packed-refs lock.
We already see this problem somewhat on GitHub. Ref deletions need the
packed-refs.lock file, which can conflict with another deletion, or with
running `git pack-refs`.

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 22:46 ` Jeff King
@ 2015-01-15 23:24   ` Stefan Beller
  2015-01-15 23:53     ` Jeff King
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Beller @ 2015-01-15 23:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git@vger.kernel.org

On Thu, Jan 15, 2015 at 2:46 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jan 15, 2015 at 02:36:11PM -0800, Stefan Beller wrote:
>
>> So for here is my proposal for small transactions:
>> (just one ref [and/or reflog] touched):
>
> The implication being that a "large" transaction is any with more than
> one update.

Exactly.

>
> I think performance may suffer if you do not also take into account the
> size of the packed-refs file. If you are updating 5 refs and there are
> 10 in the packed-refs file, rewriting the extra 5 is probably not a big
> deal. If there are 400,000 in the packed-refs file, it probably is. I'm
> not sure where the cutoff is (certainly the per-ref cost is less for
> packed-refs once you have started writing the file, so there is probably
> some crossover percentage that you could measure).
>
>>       * detect if we transition to a large transaction
>>         (by having more than one entry in transaction->updates)
>>         if so:
>>               * Pack all currently existing refs into the packed
>>                 refs file, commit the packed refs file and delete
>>                 all loose refs. This will avoid (d/f) conflicts.
>>
>>               * Keep the packed-refs file locked and move the first
>>                 transaction update into the packed-refs.lock file
>
> This increases lock contention, as now independent ref updates all need
> to take the same packed-refs.lock. This can be a problem on a busy
> repository, especially because we never retry the packed-refs lock.
> We already see this problem somewhat on GitHub. Ref deletions need the
> packed-refs.lock file, which can conflict with another deletion, or with
> running `git pack-refs`.
>
> -Peff

I see the performance problem as well as the contention problem
you're pointing out. Dealing with loose refs however creates other
problems such as directory/file conflicts on renaming. I am trying to
think of a way which moves most of the failures to the transaction_update
phase, such that the transaction_commit is rather easy and not expected
to produce many errors.

So I think dealing with a generic large transaction cannot be really solved
outside the packed refs file. There could be another special case for mass
deleting refs however. Or retries for the packed refs file. Or we first check if
we *really* need to lock the packed refs file (just realized we
already do that :/)

(just curious:)
May I ask on which kinds of repository you can see packed-refs.lock contention?

I want to improve git atomicity, specially for 'weird' cases as presented in my
previous mail[1]. Eventually I even want to have cross repository atomicty in
git, so an example could be:
(
    cd API-Provider &&
    edit files # new changes breaking the API
    git commit -a -m "..."
) &&
(
    cd API-consumer
    edit files # use new and shiny API
    git commit -a -m "..."
) &&
git multipush --atomic API-Provider:origin:master API-consumer:origin:master

When having such a goal a reliable and easy to use ref transaction API makes
life lots more easier. By reliable I mean that there are no sudden problems
as pointed out in [1], these kinds of rejections make users unhappy. And by
easy to use I mean there are only a few functions I need to know and no
proliferation of functions exposed in the API. Internally we can do all we want
such as special cases for delete-only transactions.

As another unrelated thought (400,000 refs is quite a lot)
Would it make sense to have packed-refs files grouped by topic directory, i.e.
one packed-ref for
    topic/1
    topic/2
    topic/*
and another packed ref for
    feature/a
    feature/b
    feature/*

This would rather help your problems with contention instead of making things
more atomic though. But that would avoid 400,000 refs in one packed refs file,
which then may still be easier to handle for larger transactions.

Thanks,
Stefan

[1] http://www.mail-archive.com/git@vger.kernel.org/msg63919.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 23:24   ` Stefan Beller
@ 2015-01-15 23:53     ` Jeff King
  2015-01-16 19:23       ` Stefan Beller
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff King @ 2015-01-15 23:53 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Michael Haggerty, git@vger.kernel.org

On Thu, Jan 15, 2015 at 03:24:15PM -0800, Stefan Beller wrote:

> I see the performance problem as well as the contention problem
> you're pointing out. Dealing with loose refs however creates other
> problems such as directory/file conflicts on renaming. I am trying to
> think of a way which moves most of the failures to the transaction_update
> phase, such that the transaction_commit is rather easy and not expected
> to produce many errors.
> 
> So I think dealing with a generic large transaction cannot be really solved
> outside the packed refs file. There could be another special case for mass
> deleting refs however. Or retries for the packed refs file. Or we first check if
> we *really* need to lock the packed refs file (just realized we
> already do that :/)

It sounds like there are certain _types_ of transactions that trigger
problems (and that we can handle better). E.g., renames. Which must
involve a deletion. And we already have to touch the packed-refs file
for a deletion. Once we know that we must do that, we might as well
write the rest of the ref updates into it for "free", right? There is no
extra lock contention (there is contention, but we are not making it
worse than what is already there, and what is necessary for the on-disk
format).

Could the rule be something like:

  if there are deletions in the transaction
    /* to handle potential D/F conflicts */
    update_type = packed
  else if # of transactions / # of packed-refs entries > 0.8
    /*
     * for performance, we are better off just writing the whole file;
     * extra lock contention probably doesn't matter because a
     * simultaneous write would probably conflict with one of our
     * refs anyway.
     */
     update_type = packed
  else
    /*
     * small-ish ref update. Just write out the loose refs.
     */
    update_type = loose

I'm not sure I'm completely comfortable with all of the "probably"'s in
the second comment, though.

> May I ask on which kinds of repository you can see packed-refs.lock contention?

I'm not sure what you mean by "kinds" (i.e., how you want me to
characterize our repos). Busy ones, certainly. :) And this is all
server-side, where the main write operations are client pushes and
running "git gc". And of course mainly ones that are shared between many
users (so many users pushing to a central repo, not each one forking and
making pull requests from their own publishing points).

Some people do a lot of deletions. E.g., they may be constantly pushing
and deleting tags from a CI process that kicks off whenever a "real"
push happens.

A lot of our contention is from the pack-refs process itself. It has to
take a lock on each of the loose refs it prunes. It's smart enough
(since 06839515, anyway) to avoid complaining when a real writer is
holding the lock. But when the race goes the other way, the "real"
writer has no idea that the process holding the lock is "weak". That is,
if it were another push, we would not want to block and wait to acquire
the lock. We would only find out then that the sha1 had moved, and we
have to abort the operation. But for a weak writer like packed-refs, the
ref value is not changing at all, and we could probably succeed if we
retried the lock after a delay.

> I want to improve git atomicity, specially for 'weird' cases as presented in my
> previous mail[1]. Eventually I even want to have cross repository atomicty in
> git, so an example could be:
> (
>     cd API-Provider &&
>     edit files # new changes breaking the API
>     git commit -a -m "..."
> ) &&
> (
>     cd API-consumer
>     edit files # use new and shiny API
>     git commit -a -m "..."
> ) &&
> git multipush --atomic API-Provider:origin:master API-consumer:origin:master

I think that's a reasonable goal, but I am not sure what packed-refs has
to do with it. The client sees only the push interface. So you need:

  1. Atomic push to each remote (which I think you have already
     implemented).

  2. Some kind of 2-phase commit on each remote. To say "take the lock,
     tell me when you have it, but _don't_ update until I ask you to".
     And then when you have confirmation from all remotes, send each one
     the instruction to commit the update.

     Of course that isn't foolproof if the commit step fails (lost
     connection, write error on disk, etc), but it's basically what you
     want (and this kind of commit is well-studied in the literature).

The actions that the remote takes to do (1) are not important, and
either way it needs to take all of the locks for each ref. Using
packed-refs here is just a quality of implementation issue (there are
some transactions we can take as a single atomic change that we
otherwise would not be able to).

Coincidentally, we are looking at something similar at GitHub to (2) for
doing repository replication (where you'd want to make sure that each of
your replicas moves in lockstep). It hasn't been written yet, but we
imagined that the result would be way too gross and too GitHub-specific
to make it into the upstream protocol. But maybe not.

> When having such a goal a reliable and easy to use ref transaction API makes
> life lots more easier. By reliable I mean that there are no sudden problems
> as pointed out in [1], these kinds of rejections make users unhappy. And by
> easy to use I mean there are only a few functions I need to know and no
> proliferation of functions exposed in the API. Internally we can do all we want
> such as special cases for delete-only transactions.

Yeah, I agree with this. But I have a nagging feeling as we deal with
these problems that the right solution is some better ref-store than the
filesystem + packed-refs. It's a great, simple solution, but the scaling
and transactional considerations are really making it not-simple. These
are solved problems in the database world. Moving onto a ready-made
solution has its own problems, and I do not think we would ever
deprecate the filesystem + packed-refs storage format. It works very
well for client repositories, and has no dependencies. But it would be
nice if there was a pluggable option for busy shared server repos.

> As another unrelated thought (400,000 refs is quite a lot)
> Would it make sense to have packed-refs files grouped by topic directory, i.e.
> one packed-ref for
>     topic/1
>     topic/2
>     topic/*
> and another packed ref for
>     feature/a
>     feature/b
>     feature/*
> 
> This would rather help your problems with contention instead of making things
> more atomic though. But that would avoid 400,000 refs in one packed refs file,
> which then may still be easier to handle for larger transactions.

Certainly the thought has occurred to me (and others at GitHub). But if
we are going to break compatibility, I think I'd rather see us move onto
a more mature data-storage backend than invent something even more
complicated (and that does not really solve the problem, just makes them
less bad).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 23:53     ` Jeff King
@ 2015-01-16 19:23       ` Stefan Beller
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Beller @ 2015-01-16 19:23 UTC (permalink / raw)
  To: Jeff King; +Cc: Michael Haggerty, git@vger.kernel.org

On Thu, Jan 15, 2015 at 3:53 PM, Jeff King <peff@peff.net> wrote:
> On Thu, Jan 15, 2015 at 03:24:15PM -0800, Stefan Beller wrote:
>
>> I see the performance problem as well as the contention problem
>> you're pointing out. Dealing with loose refs however creates other
>> problems such as directory/file conflicts on renaming. I am trying to
>> think of a way which moves most of the failures to the transaction_update
>> phase, such that the transaction_commit is rather easy and not expected
>> to produce many errors.
>>
>> So I think dealing with a generic large transaction cannot be really solved
>> outside the packed refs file. There could be another special case for mass
>> deleting refs however. Or retries for the packed refs file. Or we first check if
>> we *really* need to lock the packed refs file (just realized we
>> already do that :/)
>
> It sounds like there are certain _types_ of transactions that trigger
> problems (and that we can handle better). E.g., renames. Which must
> involve a deletion. And we already have to touch the packed-refs file
> for a deletion. Once we know that we must do that, we might as well
> write the rest of the ref updates into it for "free", right? There is no
> extra lock contention (there is contention, but we are not making it
> worse than what is already there, and what is necessary for the on-disk
> format).
>
> Could the rule be something like:
>
>   if there are deletions in the transaction
>     /* to handle potential D/F conflicts */
>     update_type = packed
>   else if # of transactions / # of packed-refs entries > 0.8
>     /*
>      * for performance, we are better off just writing the whole file;
>      * extra lock contention probably doesn't matter because a
>      * simultaneous write would probably conflict with one of our
>      * refs anyway.
>      */
>      update_type = packed
>   else
>     /*
>      * small-ish ref update. Just write out the loose refs.
>      */
>     update_type = loose
>
> I'm not sure I'm completely comfortable with all of the "probably"'s in
> the second comment, though.
>
>> May I ask on which kinds of repository you can see packed-refs.lock contention?
>
> I'm not sure what you mean by "kinds" (i.e., how you want me to
> characterize our repos). Busy ones, certainly. :) And this is all
> server-side, where the main write operations are client pushes and
> running "git gc". And of course mainly ones that are shared between many
> users (so many users pushing to a central repo, not each one forking and
> making pull requests from their own publishing points).
>
> Some people do a lot of deletions. E.g., they may be constantly pushing
> and deleting tags from a CI process that kicks off whenever a "real"
> push happens.
>
> A lot of our contention is from the pack-refs process itself. It has to
> take a lock on each of the loose refs it prunes. It's smart enough
> (since 06839515, anyway) to avoid complaining when a real writer is
> holding the lock. But when the race goes the other way, the "real"
> writer has no idea that the process holding the lock is "weak". That is,
> if it were another push, we would not want to block and wait to acquire
> the lock. We would only find out then that the sha1 had moved, and we
> have to abort the operation. But for a weak writer like packed-refs, the
> ref value is not changing at all, and we could probably succeed if we
> retried the lock after a delay.
>
>> I want to improve git atomicity, specially for 'weird' cases as presented in my
>> previous mail[1]. Eventually I even want to have cross repository atomicty in
>> git, so an example could be:
>> (
>>     cd API-Provider &&
>>     edit files # new changes breaking the API
>>     git commit -a -m "..."
>> ) &&
>> (
>>     cd API-consumer
>>     edit files # use new and shiny API
>>     git commit -a -m "..."
>> ) &&
>> git multipush --atomic API-Provider:origin:master API-consumer:origin:master
>
> I think that's a reasonable goal, but I am not sure what packed-refs has
> to do with it. The client sees only the push interface. So you need:

I was assuming origin to be the same for both repos, which then makes
it a lot easier.
So on the server side you need to do:
* prepare the ref transaction in the first repo
* "cd second repo"
* prepare the ref transaction the second repo
* ref transaction commit.

Note that we're in the second repository when committing the ref transaction,
but we also want to commit things the first repo. So we cannot keep branch
names in memory but need to have the lockfiles already as a lockfile is pointed
to by its filename or optionally by its file descriptor, so we still
know which of
the repos we're adressing there.

>
>   1. Atomic push to each remote (which I think you have already
>      implemented).
>
>   2. Some kind of 2-phase commit on each remote. To say "take the lock,
>      tell me when you have it, but _don't_ update until I ask you to".
>      And then when you have confirmation from all remotes, send each one
>      the instruction to commit the update.
>
>      Of course that isn't foolproof if the commit step fails (lost
>      connection, write error on disk, etc), but it's basically what you
>      want (and this kind of commit is well-studied in the literature).

It is even taught in universities, so it must be at least 15 years old
already. ;)

>
> The actions that the remote takes to do (1) are not important, and
> either way it needs to take all of the locks for each ref. Using
> packed-refs here is just a quality of implementation issue (there are
> some transactions we can take as a single atomic change that we
> otherwise would not be able to).

Yeah I think it's a good goal to have atomic not shrinking the
possible set of operations,
i.e. what works without --atomic should also work with --atomic.

>
> Coincidentally, we are looking at something similar at GitHub to (2) for
> doing repository replication (where you'd want to make sure that each of
> your replicas moves in lockstep). It hasn't been written yet, but we
> imagined that the result would be way too gross and too GitHub-specific
> to make it into the upstream protocol. But maybe not.

Ronnie has written code for that and already posted it here in the mailing list
a few times, but it was lacking reviews and feedback.
Essentially he proposed the following:
* do a push to a hidden part of the git server. This hidden place is
hidden by default
and doesn't need special setup. (refs/hidden/ iirc)
* do an push which makes the previous push visible. This one doesn't need to
  wait for pack uploads, so it's fast. Hence it could lock all it wants.

>
>> When having such a goal a reliable and easy to use ref transaction API makes
>> life lots more easier. By reliable I mean that there are no sudden problems
>> as pointed out in [1], these kinds of rejections make users unhappy. And by
>> easy to use I mean there are only a few functions I need to know and no
>> proliferation of functions exposed in the API. Internally we can do all we want
>> such as special cases for delete-only transactions.
>
> Yeah, I agree with this. But I have a nagging feeling as we deal with
> these problems that the right solution is some better ref-store than the
> filesystem + packed-refs. It's a great, simple solution, but the scaling
> and transactional considerations are really making it not-simple. These
> are solved problems in the database world. Moving onto a ready-made
> solution has its own problems, and I do not think we would ever
> deprecate the filesystem + packed-refs storage format. It works very
> well for client repositories, and has no dependencies. But it would be
> nice if there was a pluggable option for busy shared server repos.

That's what Jonathan said here as well. I though about getting the files backend
improved, but you're the second who talks me into implementing a database
backend. I'll look into that.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 22:36 [RFC] Introducing different handling for small/large transactions Stefan Beller
  2015-01-15 22:46 ` Jeff King
@ 2015-01-15 23:34 ` Junio C Hamano
  2015-01-16 19:00   ` Stefan Beller
  2015-01-18 12:13 ` Michael Haggerty
  2 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2015-01-15 23:34 UTC (permalink / raw)
  To: Stefan Beller; +Cc: mhagger, git

Stefan Beller <sbeller@google.com> writes:

> In ref_transaction_commit
> 	* commit the .lock file to its destination
> 	* in case this is a deletion:
> 		* remove the loose ref
> 		* and repack the packed refs file if necessary

Don't you need to repack and then remove the loose one, though?
Otherwise you would expose a stale packed ref in the middle to the
other readers, no?

> The larger transactions would be handled differently by relying
> on the packed refs file:
> In ref_transaction_update:
> 	* detect if we transition to a large transaction
> 	  (by having more than one entry in transaction->updates)
> 	  if so:
> 		* Pack all currently existing refs into the packed
> 		  refs file, commit the packed refs file and delete
> 		  all loose refs. This will avoid (d/f) conflicts.
>
> 		* Keep the packed-refs file locked and move the first
> 		  transaction update into the packed-refs.lock file
>
> 	* Any update(delete, create, update) is put into the locked
> 	  packed refs file.

I am not sure if you mean (a) keep updates only in-core, to be
flushed at the commit time, or (b) each and every update in the
large transaction results in rewriting the entire packed-refs.lock
file, only to be renamed to the final name at the commit time.
I am hoping it would be the former.

> 	* Additionally we need to obtain the .lock for the loose refs
> 	  file to keep guarantees, though we should close the file
> 	  descriptor as we don't wand to run out of file descriptors.

Yes, this last point is important.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 23:34 ` Junio C Hamano
@ 2015-01-16 19:00   ` Stefan Beller
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Beller @ 2015-01-16 19:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, git@vger.kernel.org

On Thu, Jan 15, 2015 at 3:34 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Stefan Beller <sbeller@google.com> writes:
>
>> In ref_transaction_commit
>>       * commit the .lock file to its destination
>>       * in case this is a deletion:
>>               * remove the loose ref
>>               * and repack the packed refs file if necessary
>
> Don't you need to repack and then remove the loose one, though?
> Otherwise you would expose a stale packed ref in the middle to the
> other readers, no?

You're right.

>
>> The larger transactions would be handled differently by relying
>> on the packed refs file:
>> In ref_transaction_update:
>>       * detect if we transition to a large transaction
>>         (by having more than one entry in transaction->updates)
>>         if so:
>>               * Pack all currently existing refs into the packed
>>                 refs file, commit the packed refs file and delete
>>                 all loose refs. This will avoid (d/f) conflicts.
>>
>>               * Keep the packed-refs file locked and move the first
>>                 transaction update into the packed-refs.lock file
>>
>>       * Any update(delete, create, update) is put into the locked
>>         packed refs file.
>
> I am not sure if you mean (a) keep updates only in-core, to be
> flushed at the commit time, or (b) each and every update in the
> large transaction results in rewriting the entire packed-refs.lock
> file, only to be renamed to the final name at the commit time.
> I am hoping it would be the former.

I wasn't sure which I mean. The first one is obviously better to do.

>
>>       * Additionally we need to obtain the .lock for the loose refs
>>         file to keep guarantees, though we should close the file
>>         descriptor as we don't wand to run out of file descriptors.
>
> Yes, this last point is important.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Introducing different handling for small/large transactions
  2015-01-15 22:36 [RFC] Introducing different handling for small/large transactions Stefan Beller
  2015-01-15 22:46 ` Jeff King
  2015-01-15 23:34 ` Junio C Hamano
@ 2015-01-18 12:13 ` Michael Haggerty
  2 siblings, 0 replies; 8+ messages in thread
From: Michael Haggerty @ 2015-01-18 12:13 UTC (permalink / raw)
  To: Stefan Beller, git

On 01/15/2015 11:36 PM, Stefan Beller wrote:
> For everyday use we want git to be fast. Creating one commit should not
> touch the packed refs file. If we do other stuff involving more than
> one ref, we may accept touching the packed refs file and have a process
> which takes slightly longer but can handle more complex requests correctly,
> such as renaming into and from directories (topic/1 -> topic and reverse).
> Renaming is currently not part of the transaction API because of the (D/F)
> problems. This proposed change would enable having renames being part of
> the transactions API.
> 
> A transaction covers creating, deleting and updating a ref and its reflog.
> Renaming would be a deletion followed by creating a new ref atomically.

A rename is a little bit more than a generic delete+create pair; it also
moves the reflog from the old name to the new name. Is your plan to add
an extra "rename" operation to the refs-transactions API, or to
automatically detect delete+create pairs and treat them as renames?

> So for here is my proposal for small transactions:
> (just one ref [and/or reflog] touched):
> 
> In ref_transaction_update:
> 	* create $REF.lock file
> 	* write new content to the lock file
> 
> In ref_transaction_commit
> 	* commit the .lock file to its destination
> 	* in case this is a deletion:
> 		* remove the loose ref
> 		* and repack the packed refs file if necessary

The above describes the current algorithm, but FYI it is not entirely
correct. The deletion of the loose ref might expose an old version of
the reference in the packed-refs file (which might even point at an
object that has been garbage-collected. So the reference has to be
deleted from the packed-refs file before the loose ref version is deleted.

However, it is important that the packed-ref lock be held during the
whole procedure, so that a pack-refs process doesn't rewrite the loose
ref version of the reference into the (now-unlocked) packed-refs file,
causing the reference to survive its supposed deletion. (At least that
was the status a while ago; I don't know if recent changes to pack-refs
might have removed this problem in another way.)

But activating a new packed-refs file while still holding the
packed-refs lock is not supported by our current lockfile API. In fact,
working towards enabling this was one of the reasons for the big
lockfile refactoring that I did a while back. Though I never got as far
as fixing this bug.

> The larger transactions would be handled differently by relying
> on the packed refs file:
> In ref_transaction_update:
> 	* detect if we transition to a large transaction
> 	  (by having more than one entry in transaction->updates)
> 	  if so:
> 		* Pack all currently existing refs into the packed
> 		  refs file, commit the packed refs file and delete
> 		  all loose refs. This will avoid (d/f) conflicts.
> 
> 		* Keep the packed-refs file locked and move the first
> 		  transaction update into the packed-refs.lock file

NB: this requires not just one but two rewrites of the packed-refs file,
sharpening the performance concerns expressed elsewhere in this thread.

But couldn't one of the rewrites be avoided if the transaction doesn't
involve any deletes?

> 	* Any update(delete, create, update) is put into the locked
> 	  packed refs file.
> 	* Additionally we need to obtain the .lock for the loose refs
> 	  file to keep guarantees, though we should close the file
> 	  descriptor as we don't wand to run out of file descriptors.
> 
> In ref_transaction_commit:
> 	* We only need to commit the packed refs file
> 	* Discard all other lock files as the changes get committed as a whole
> 	  by the packed refs file

Michael

-- 
Michael Haggerty
mhagger@alum.mit.edu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-01-18 12:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-15 22:36 [RFC] Introducing different handling for small/large transactions Stefan Beller
2015-01-15 22:46 ` Jeff King
2015-01-15 23:24   ` Stefan Beller
2015-01-15 23:53     ` Jeff King
2015-01-16 19:23       ` Stefan Beller
2015-01-15 23:34 ` Junio C Hamano
2015-01-16 19:00   ` Stefan Beller
2015-01-18 12:13 ` Michael Haggerty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).