From: Luke Diamand <luke@diamand.org>
To: larsxschneider@gmail.com, git@vger.kernel.org
Cc: gitster@pobox.com
Subject: Re: [PATCH v5 6/7] git-p4: add support for large file systems
Date: Wed, 16 Sep 2015 09:36:46 +0100 [thread overview]
Message-ID: <55F92A1E.1090002@diamand.org> (raw)
In-Reply-To: <1442237194-49624-7-git-send-email-larsxschneider@gmail.com>
On 14/09/15 14:26, larsxschneider@gmail.com wrote:
> From: Lars Schneider <larsxschneider@gmail.com>
>
> Perforce repositories can contain large (binary) files. Migrating these
> repositories to Git generates very large local clones. External storage
> systems such as Git LFS [1], Git Fat [2], Git Media [3], git-annex [4]
> try to address this problem.
>
> Add a generic mechanism to detect large files based on extension,
> uncompressed size, and/or compressed size.
Looks excellent! I've got a small few comments (below) and I need to
come back and have another look through if that's OK.
Thanks!
Luke
>
> [1] https://git-lfs.github.com/
> [2] https://github.com/jedbrown/git-fat
> [3] https://github.com/alebedev/git-media
> [4] https://git-annex.branchable.com/
>
> Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
> ---
> Documentation/git-p4.txt | 32 +++++++++++
> git-p4.py | 139 +++++++++++++++++++++++++++++++++++++++++----
> t/t9823-git-p4-mock-lfs.sh | 96 +++++++++++++++++++++++++++++++
> 3 files changed, 257 insertions(+), 10 deletions(-)
> create mode 100755 t/t9823-git-p4-mock-lfs.sh
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index 82aa5d6..3f21e95 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -510,6 +510,38 @@ git-p4.useClientSpec::
> option '--use-client-spec'. See the "CLIENT SPEC" section above.
> This variable is a boolean, not the name of a p4 client.
>
> +git-p4.largeFileSystem::
> + Specify the system that is used for large (binary) files. Please note
> + that large file systems do not support the 'git p4 submit' command.
Why is that? Is it just that you haven't implemented support, or is it
fundamentally impossible?
> + Only Git LFS [1] is implemented right now. Download
> + and install the Git LFS command line extension to use this option
> + and configure it like this:
> ++
> +-------------
> +git config git-p4.largeFileSystem GitLFS
> +-------------
> ++
> + [1] https://git-lfs.github.com/
> +
> +git-p4.largeFileExtensions::
> + All files matching a file extension in the list will be processed
> + by the large file system. Do not prefix the extensions with '.'.
> +
> +git-p4.largeFileThreshold::
> + All files with an uncompressed size exceeding the threshold will be
> + processed by the large file system. By default the threshold is
> + defined in bytes. Add the suffix k, m, or g to change the unit.
> +
> +git-p4.largeFileCompressedThreshold::
> + All files with a compressed size exceeding the threshold will be
> + processed by the large file system. This option might slow down
> + your clone/sync process. By default the threshold is defined in
> + bytes. Add the suffix k, m, or g to change the unit.
> +
> +git-p4.pushLargeFiles::
> + Boolean variable which defines if large files are automatically
> + pushed to a server.
> +
> Submit variables
> ~~~~~~~~~~~~~~~~
> git-p4.detectRenames::
> diff --git a/git-p4.py b/git-p4.py
> index b465356..bfe71b5 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -22,6 +22,8 @@ import platform
> import re
> import shutil
> import stat
> +import zipfile
> +import zlib
>
> try:
> from subprocess import CalledProcessError
> @@ -932,6 +934,111 @@ def wildcard_present(path):
> m = re.search("[*#@%]", path)
> return m is not None
>
> +class LargeFileSystem(object):
> + """Base class for large file system support."""
> +
> + def __init__(self, writeToGitStream):
> + self.largeFiles = set()
> + self.writeToGitStream = writeToGitStream
> +
> + def generatePointer(self, cloneDestination, contentFile):
> + """Return the content of a pointer file that is stored in Git instead of
> + the actual content."""
> + assert False, "Method 'generatePointer' required in " + self.__class__.__name__
> +
> + def pushFile(self, localLargeFile):
> + """Push the actual content which is not stored in the Git repository to
> + a server."""
> + assert False, "Method 'pushFile' required in " + self.__class__.__name__
> +
> + def hasLargeFileExtension(self, relPath):
> + return reduce(
> + lambda a, b: a or b,
> + [relPath.endswith('.' + e) for e in gitConfigList('git-p4.largeFileExtensions')],
> + False
> + )
> +
> + def generateTempFile(self, contents):
> + contentFile = tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
> + for d in contents:
> + contentFile.write(d)
> + contentFile.flush()
> + contentFile.close()
close() should flush() automatically I would have thought.
> + return contentFile.name
> +
> + def exceedsLargeFileThreshold(self, relPath, contents):
> + if gitConfigInt('git-p4.largeFileThreshold'):
> + contentsSize = sum(len(d) for d in contents)
> + if contentsSize > gitConfigInt('git-p4.largeFileThreshold'):
> + return True
> + if gitConfigInt('git-p4.largeFileCompressedThreshold'):
> + contentsSize = sum(len(d) for d in contents)
> + if contentsSize <= gitConfigInt('git-p4.largeFileCompressedThreshold'):
> + return False
> + contentTempFile = self.generateTempFile(contents)
> + compressedContentFile = tempfile.NamedTemporaryFile(prefix='git-p4-large-file', delete=False)
> + zf = zipfile.ZipFile(compressedContentFile.name, mode='w')
> + zf.write(contentTempFile, compress_type=zipfile.ZIP_DEFLATED)
> + zf.close()
> + compressedContentsSize = zf.infolist()[0].compress_size
> + os.remove(contentTempFile)
> + os.remove(compressedContentFile.name)
> + if compressedContentsSize > gitConfigInt('git-p4.largeFileCompressedThreshold'):
> + return True
> + return False
> +
> + def addLargeFile(self, relPath):
> + self.largeFiles.add(relPath)
> +
> + def removeLargeFile(self, relPath):
> + self.largeFiles.remove(relPath)
> +
> + def isLargeFile(self, relPath):
> + return relPath in self.largeFiles
> +
> + def processContent(self, cloneDestination, git_mode, relPath, contents):
> + """Processes the content of git fast import. This method decides if a
> + file is stored in the large file system and handles all necessary
> + steps."""
> + if self.exceedsLargeFileThreshold(relPath, contents) or self.hasLargeFileExtension(relPath):
> + contentTempFile = self.generateTempFile(contents)
> + (git_mode, contents, localLargeFile) = self.generatePointer(cloneDestination, contentTempFile)
> +
> + # Move temp file to final location in large file system
> + largeFileDir = os.path.dirname(localLargeFile)
> + if not os.path.isdir(largeFileDir):
> + os.makedirs(largeFileDir)
> + shutil.move(contentTempFile, localLargeFile)
> + self.addLargeFile(relPath)
> + if gitConfigBool('git-p4.pushLargeFiles'):
> + self.pushFile(localLargeFile)
> + if verbose:
> + sys.stderr.write("%s moved to large file system (%s)\n" % (relPath, localLargeFile))
> + return (git_mode, contents)
> +
> +class MockLFS(LargeFileSystem):
> + """Mock large file system for testing."""
> +
> + def generatePointer(self, cloneDestination, contentFile):
> + """The pointer content is the original content prefixed with "pointer-".
> + The local filename of the large file storage is derived from the file content.
> + """
> + with open(contentFile, 'r') as f:
> + content = next(f)
> + gitMode = '100644'
> + pointerContents = 'pointer-' + content
> + localLargeFile = os.path.join(cloneDestination, '.git', 'mock-storage', 'local', content[:-1])
> + return (gitMode, pointerContents, localLargeFile)
> +
> + def pushFile(self, localLargeFile):
> + """The remote filename of the large file storage is the same as the local
> + one but in a different directory.
> + """
> + remotePath = os.path.join(os.path.dirname(localLargeFile), '..', 'remote')
> + if not os.path.exists(remotePath):
> + os.makedirs(remotePath)
> + shutil.copyfile(localLargeFile, os.path.join(remotePath, os.path.basename(localLargeFile)))
> +
> class Command:
> def __init__(self):
> self.usage = "usage: %prog [options]"
> @@ -1105,6 +1212,9 @@ class P4Submit(Command, P4UserMap):
> self.p4HasMoveCommand = p4_has_move_command()
> self.branch = None
>
> + if gitConfig('git-p4.largeFileSystem'):
> + die("Large file system not supported for git-p4 submit command. Please remove it from config.")
> +
> def check(self):
> if len(p4CmdList("opened ...")) > 0:
> die("You have files opened with perforce! Close them before starting the sync.")
> @@ -2057,6 +2167,12 @@ class P4Sync(Command, P4UserMap):
> lambda git_mode, relPath, contents: self.writeToGitStream(git_mode, relPath, contents)
> )
>
> + if gitConfig('git-p4.largeFileSystem'):
> + largeFileSystemConstructor = globals()[gitConfig('git-p4.largeFileSystem')]
> + self.largeFileSystem = largeFileSystemConstructor(
> + lambda git_mode, relPath, contents: self.writeToGitStream(git_mode, relPath, contents)
> + )
> +
> if gitConfig("git-p4.syncFromOrigin") == "false":
> self.syncWithOrigin = False
>
> @@ -2176,6 +2292,13 @@ class P4Sync(Command, P4UserMap):
>
> return branches
>
> + def writeToGitStream(self, gitMode, relPath, contents):
> + self.gitStream.write('M %s inline %s\n' % (gitMode, relPath))
> + self.gitStream.write('data %d\n' % sum(len(d) for d in contents))
> + for d in contents:
> + self.gitStream.write(d)
> + self.gitStream.write('\n')
> +
> # output one file from the P4 stream
> # - helper for streamP4Files
>
> @@ -2246,17 +2369,10 @@ class P4Sync(Command, P4UserMap):
> text = regexp.sub(r'$\1$', text)
> contents = [ text ]
>
> - self.gitStream.write("M %s inline %s\n" % (git_mode, relPath))
> + if self.largeFileSystem:
> + (git_mode, contents) = self.largeFileSystem.processContent(self.cloneDestination, git_mode, relPath, contents)
>
> - # total length...
> - length = 0
> - for d in contents:
> - length = length + len(d)
> -
> - self.gitStream.write("data %d\n" % length)
> - for d in contents:
> - self.gitStream.write(d)
> - self.gitStream.write("\n")
> + self.writeToGitStream(git_mode, relPath, contents)
>
> def streamOneP4Deletion(self, file):
> relPath = self.stripRepoPath(file['path'], self.branchPrefixes)
> @@ -2265,6 +2381,9 @@ class P4Sync(Command, P4UserMap):
> sys.stdout.flush()
> self.gitStream.write("D %s\n" % relPath)
>
> + if self.largeFileSystem and self.largeFileSystem.isLargeFile(relPath):
> + self.largeFileSystem.removeLargeFile(relPath)
> +
> # handle another chunk of streaming data
> def streamP4FilesCb(self, marshalled):
>
> diff --git a/t/t9823-git-p4-mock-lfs.sh b/t/t9823-git-p4-mock-lfs.sh
> new file mode 100755
> index 0000000..8582857
> --- /dev/null
> +++ b/t/t9823-git-p4-mock-lfs.sh
> @@ -0,0 +1,96 @@
> +#!/bin/sh
> +
> +test_description='Clone repositories and store files in Mock LFS'
> +
> +. ./lib-git-p4.sh
> +
> +test_file_in_mock () {
> + FILE="$1"
Missing &&
Plus the next few lines
> + CONTENT="$2"
> + LOCAL_STORAGE=".git/mock-storage/local/$CONTENT"
> + SERVER_STORAGE=".git/mock-storage/remote/$CONTENT"
> + echo "pointer-$CONTENT" >expect_pointer
> + echo "$CONTENT" >expect_content
> + test_path_is_file "$FILE" &&
> + test_path_is_file "$LOCAL_STORAGE" &&
> + test_path_is_file "$SERVER_STORAGE" &&
> + test_cmp expect_pointer "$FILE" &&
> + test_cmp expect_content "$LOCAL_STORAGE" &&
> + test_cmp expect_content "$SERVER_STORAGE"
> +}
> +
> +test_file_count_in_dir () {
> + DIR="$1"
> + EXPECTED_COUNT="$2"
> + find "$DIR" -type f >actual
> + test_line_count = $EXPECTED_COUNT actual
> +}
> +
> +test_expect_success 'start p4d' '
> + start_p4d
> +'
> +
> +test_expect_success 'Create repo with binary files' '
> + client_view "//depot/... //client/..." &&
> + (
> + cd "$cli" &&
> +
> + echo "content 1 txt 23 bytes" >file1.txt &&
> + p4 add file1.txt &&
> + echo "content 2-3 bin 25 bytes" >file2.dat &&
> + p4 add file2.dat &&
> + p4 submit -d "Add text and binary file" &&
> +
> + mkdir "path with spaces" &&
> + echo "content 2-3 bin 25 bytes" >"path with spaces/file3.bin" &&
Nice putting spaces in the path!
> + p4 add "path with spaces/file3.bin" &&
> + p4 submit -d "Add another binary file with same content and spaces in path" &&
> +
> + echo "content 4 bin 26 bytes XX" >file4.bin &&
> + p4 add file4.bin &&
> + p4 submit -d "Add another binary file with different content"
> + )
> +'
> +
> +test_expect_success 'Store files in Mock based on size (>24 bytes)' '
> + client_view "//depot/... //client/..." &&
> + test_when_finished cleanup_git &&
> + (
> + cd "$git" &&
> + git init . &&
> + git config git-p4.useClientSpec true &&
> + git config git-p4.largeFileSystem MockLFS &&
> + git config git-p4.largeFileThreshold 24 &&
> + git config git-p4.pushLargeFiles True &&
> + git p4 clone --destination="$git" //depot@all &&
> +
> + test_file_in_mock file2.dat "content 2-3 bin 25 bytes" &&
> + test_file_in_mock "path with spaces/file3.bin" "content 2-3 bin 25 bytes" &&
> + test_file_in_mock file4.bin "content 4 bin 26 bytes XX" &&
> +
> + test_file_count_in_dir ".git/mock-storage/local" 2 &&
> + test_file_count_in_dir ".git/mock-storage/remote" 2
> + )
> +'
> +
> +test_expect_success 'Run git p4 submit in repo configured with large file system' '
> + client_view "//depot/... //client/..." &&
> + test_when_finished cleanup_git &&
> + (
> + cd "$git" &&
> + git init . &&
> + git config git-p4.useClientSpec true &&
> + git config git-p4.largeFileSystem MockLFS &&
> + git config git-p4.largeFileThreshold 24 &&
> + git config git-p4.pushLargeFiles True &&
> + git p4 clone --destination="$git" //depot@all &&
> +
> + test_must_fail git p4 submit
No test for deletion, does that matter?
> + )
> +'
> +
> +test_expect_success 'kill p4d' '
> + kill_p4d
> +'
> +
> +test_done
>
next prev parent reply other threads:[~2015-09-16 8:37 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-14 13:26 [PATCH v5 0/7] git-p4: add support for large file systems larsxschneider
2015-09-14 13:26 ` [PATCH v5 1/7] git-p4: add optional type specifier to gitConfig reader larsxschneider
2015-09-15 7:34 ` Luke Diamand
2015-09-14 13:26 ` [PATCH v5 2/7] git-p4: add gitConfigInt reader larsxschneider
2015-09-14 13:26 ` [PATCH v5 3/7] git-p4: return an empty list if a list config has no values larsxschneider
2015-09-14 13:26 ` [PATCH v5 4/7] git-p4: add file streaming progress in verbose mode larsxschneider
2015-09-14 13:26 ` [PATCH v5 5/7] git-p4: check free space during streaming larsxschneider
2015-09-14 13:26 ` [PATCH v5 6/7] git-p4: add support for large file systems larsxschneider
2015-09-16 8:36 ` Luke Diamand [this message]
2015-09-16 12:05 ` Lars Schneider
2015-09-16 14:59 ` Eric Sunshine
2015-09-16 15:20 ` Junio C Hamano
2015-09-20 21:26 ` Lars Schneider
2015-09-14 13:26 ` [PATCH v5 7/7] git-p4: add Git LFS backend for large file system larsxschneider
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55F92A1E.1090002@diamand.org \
--to=luke@diamand.org \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=larsxschneider@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).