* [PATCH 1/3] dir.c: clean up handling of 'path' parameter in read_directory_recursive()
@ 2009-05-14 20:42 Linus Torvalds
2009-05-14 20:46 ` [PATCH 2/3] Add 'fill_directory()' helper function for directory traversal Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 20:42 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 14 May 2009 13:05:03 -0700
Right now we pass two different pathnames ('path' and 'base') down to
read_directory_recursive(), and the only real reason for that is that we
want to allow an empty 'base' parameter, but when we do so, we need the
pathname to "opendir()" to be "." rather than the empty string.
And rather than handle that confusion in the caller, we can just fix
read_directory_recursive() to handle the case of an empty path itself,
by just passing opendir() a "." ourselves if the path is empty.
This would allow us to then drop one of the pathnames entirely from the
calling convention, but rather than do that, we'll start separating them
out as a "filesystem pathname" (the one we use for filesystem accesses)
and a "git internal base name" (which is the name that we use for git
internally).
That will eventually allow us to do things like handle different
encodings (eg the filesystem pathnames might be Latin1, while git itself
would use UTF-8 for filename information).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
This is a truly trivial diff, but it's independent from the other changes
I have, and simplifies the next ones, so I've made it a patch of its own.
dir.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/dir.c b/dir.c
index 6aae09a..0e6b752 100644
--- a/dir.c
+++ b/dir.c
@@ -576,7 +576,7 @@ static int get_dtype(struct dirent *de, const char *path)
*/
static int read_directory_recursive(struct dir_struct *dir, const char *path, const char *base, int baselen, int check_only, const struct path_simplify *simplify)
{
- DIR *fdir = opendir(path);
+ DIR *fdir = opendir(*path ? path : ".");
int contents = 0;
if (fdir) {
--
1.6.3.1.11.g97114
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/3] Add 'fill_directory()' helper function for directory traversal
2009-05-14 20:42 [PATCH 1/3] dir.c: clean up handling of 'path' parameter in read_directory_recursive() Linus Torvalds
@ 2009-05-14 20:46 ` Linus Torvalds
2009-05-14 20:54 ` [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 20:46 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 14 May 2009 13:22:36 -0700
Most of the users of "read_directory()" actually want a much simpler
interface than the whole complex (but rather powerful) one.
In fact 'git add' had already largely abstracted out the core interface
issues into a private "fill_directory()" function that was largely
applicable almost as-is to a number of callers. Yes, 'git add' wants to
do some extra work of its own, specific to the add semantics, but we can
easily split that out, and use the core as a generic function.
This function does exactly that, and now that much simplified
'fill_directory()' function can be shared with a number of callers,
while also ensuring that the rather more complex calling conventions of
read_directory() are used by fewer call-sites.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
As you can see from the diffstat, this removes more lines than it adds,
and generally simplifies some calling conventions.
The return value from "fill_directory()" makes little sense for any other
user than the builtin-add.c case, and I'm not really proud of it, but it
basically allows everybody to share the same general infrastructure.
Note: this depends on the previous one, in that we now use the empty
string ("") as the "path" argument, and now almost all users of
read_directory() will pass in the same thing as both "path" and "base".
That will eventually change, though, if we want to have different
encodings.
builtin-add.c | 45 ++++++++++++++-------------------------------
builtin-clean.c | 12 +-----------
builtin-ls-files.c | 7 +------
dir.c | 21 +++++++++++++++++++++
dir.h | 1 +
wt-status.c | 2 +-
6 files changed, 39 insertions(+), 49 deletions(-)
diff --git a/builtin-add.c b/builtin-add.c
index cb67d2c..ba25893 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -95,35 +95,6 @@ static void treat_gitlinks(const char **pathspec)
}
}
-static void fill_directory(struct dir_struct *dir, const char **pathspec,
- int ignored_too)
-{
- const char *path, *base;
- int baselen;
-
- /* Set up the default git porcelain excludes */
- memset(dir, 0, sizeof(*dir));
- if (!ignored_too) {
- dir->flags |= DIR_COLLECT_IGNORED;
- setup_standard_excludes(dir);
- }
-
- /*
- * Calculate common prefix for the pathspec, and
- * use that to optimize the directory walk
- */
- baselen = common_prefix(pathspec);
- path = ".";
- base = "";
- if (baselen)
- path = base = xmemdupz(*pathspec, baselen);
-
- /* Read the directory and prune it */
- read_directory(dir, path, base, baselen, pathspec);
- if (pathspec)
- prune_directory(dir, pathspec, baselen);
-}
-
static void refresh(int verbose, const char **pathspec)
{
char *seen;
@@ -290,9 +261,21 @@ int cmd_add(int argc, const char **argv, const char *prefix)
die("index file corrupt");
treat_gitlinks(pathspec);
- if (add_new_files)
+ if (add_new_files) {
+ int baselen;
+
+ /* Set up the default git porcelain excludes */
+ memset(&dir, 0, sizeof(dir));
+ if (!ignored_too) {
+ dir.flags |= DIR_COLLECT_IGNORED;
+ setup_standard_excludes(&dir);
+ }
+
/* This picks up the paths that are not tracked */
- fill_directory(&dir, pathspec, ignored_too);
+ baselen = fill_directory(&dir, pathspec);
+ if (pathspec)
+ prune_directory(&dir, pathspec, baselen);
+ }
if (refresh_only) {
refresh(verbose, pathspec);
diff --git a/builtin-clean.c b/builtin-clean.c
index c5ad33d..febd10f 100644
--- a/builtin-clean.c
+++ b/builtin-clean.c
@@ -33,7 +33,6 @@ int cmd_clean(int argc, const char **argv, const char *prefix)
int ignored_only = 0, baselen = 0, config_set = 0, errors = 0;
struct strbuf directory = STRBUF_INIT;
struct dir_struct dir;
- const char *path, *base;
static const char **pathspec;
struct strbuf buf = STRBUF_INIT;
const char *qname;
@@ -77,16 +76,7 @@ int cmd_clean(int argc, const char **argv, const char *prefix)
pathspec = get_pathspec(prefix, argv);
read_cache();
- /*
- * Calculate common prefix for the pathspec, and
- * use that to optimize the directory walk
- */
- baselen = common_prefix(pathspec);
- path = ".";
- base = "";
- if (baselen)
- path = base = xmemdupz(*pathspec, baselen);
- read_directory(&dir, path, base, baselen, pathspec);
+ fill_directory(&dir, pathspec);
if (pathspec)
seen = xmalloc(argc > 0 ? argc : 1);
diff --git a/builtin-ls-files.c b/builtin-ls-files.c
index da2daf4..a011a42 100644
--- a/builtin-ls-files.c
+++ b/builtin-ls-files.c
@@ -161,12 +161,7 @@ static void show_files(struct dir_struct *dir, const char *prefix)
/* For cached/deleted files we don't need to even do the readdir */
if (show_others || show_killed) {
- const char *path = ".", *base = "";
- int baselen = prefix_len;
-
- if (baselen)
- path = base = prefix;
- read_directory(dir, path, base, baselen, pathspec);
+ fill_directory(dir, pathspec);
if (show_others)
show_other_files(dir);
if (show_killed)
diff --git a/dir.c b/dir.c
index 0e6b752..c667d38 100644
--- a/dir.c
+++ b/dir.c
@@ -9,6 +9,27 @@
#include "dir.h"
#include "refs.h"
+int fill_directory(struct dir_struct *dir, const char **pathspec)
+{
+ const char *path, *base;
+ int baselen;
+
+ /*
+ * Calculate common prefix for the pathspec, and
+ * use that to optimize the directory walk
+ */
+ baselen = common_prefix(pathspec);
+ path = "";
+ base = "";
+
+ if (baselen)
+ path = base = xmemdupz(*pathspec, baselen);
+
+ /* Read the directory and prune it */
+ read_directory(dir, path, base, baselen, pathspec);
+ return baselen;
+}
+
struct path_simplify {
int len;
const char *path;
diff --git a/dir.h b/dir.h
index 541286a..9f7c3ba 100644
--- a/dir.h
+++ b/dir.h
@@ -68,6 +68,7 @@ extern int common_prefix(const char **pathspec);
#define MATCHED_EXACTLY 3
extern int match_pathspec(const char **pathspec, const char *name, int namelen, int prefix, char *seen);
+extern int fill_directory(struct dir_struct *dir, const char **pathspec);
extern int read_directory(struct dir_struct *, const char *path, const char *base, int baselen, const char **pathspec);
extern int excluded(struct dir_struct *, const char *, int *);
diff --git a/wt-status.c b/wt-status.c
index 1b6df45..24a6abf 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -255,7 +255,7 @@ static void wt_status_print_untracked(struct wt_status *s)
DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES;
setup_standard_excludes(&dir);
- read_directory(&dir, ".", "", 0, NULL);
+ fill_directory(&dir, NULL);
for(i = 0; i < dir.nr; i++) {
struct dir_entry *ent = dir.entries[i];
if (!cache_name_is_other(ent->name, ent->len))
--
1.6.3.1.11.g97114
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 20:46 ` [PATCH 2/3] Add 'fill_directory()' helper function for directory traversal Linus Torvalds
@ 2009-05-14 20:54 ` Linus Torvalds
2009-05-14 21:23 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 20:54 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 14 May 2009 13:31:59 -0700
This is some early first infrastructure to make it much easier for
read_directory() to recursively traversing a filesystem directory
structure, while at the same time doing a character set or naming
conversion during traversal.
In particular, this allows:
- the filesystem path component separator to be set to something
different than the normal UNIX '/' character.
Nobody may care (even windows tends to handle '/' well), but it also
happens to be a good way to test the feature, and this patch makes
the filesystem separator be "//" (_two_ slashes) just to verify that
the code correctly keeps the "filesystem representation" from the
"git internal filename representation".
- We could - for example - read filesystems that have pathnames in
a Latin1 encoding, and use this to convert such filesystem character
set details into a git format (where UTF-8 would be the default, the
same way we default to UTF-8 in commit messages without actually
_forcing_ it)
- On OS X, we can make the read_directory() conversion convert the
native (and odd/crazy) UTF-8 NFD representation into the more normal
cross-platform NFC representation.
But please note that this is just preliminary, and not only doesn't
actually have any explicit character set convrsion code, it still lacks
some other infrastructure.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Ok, so this adds many more lines than I just removed, but at least several
of them are comments about what the difference between 'path' and 'base'
is, and it all works.
The use of "//" as the filesystem path component separator may be odd, but
it's a useful hack: from 'strace', you can now see how different
operations now use the "filesystem pathname" and others use the "native
git pathname", because one will have two slashes between components and
the other will not.
So you can literally -visually- see the places that aren't converted, and
that use the git internal paths for filesystem operations. Example strace
output:
...
open("compat//fnmatch//", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
getdents(6, /* 4 entries */, 32768) = 112
open("compat/fnmatch/.gitignore", O_RDONLY) = -1 ENOENT (No such file or directory)
...
iow, here we see how the directory traversal itself uses the "filesystem
pathname", but then the ignore-file handling does not.
Now imagine if one of them needs to do some UTF-8 -> EUC-JP translation or
something like that, rather than having the (purely visual) extra '/'.
So it's currently just a cheezy hack, but it was useful for testing, and I
think this series is worth thinking seriously about. The two first patches
were plain cleanups, and this one isn't _that_ complex, but adds some
potentially interesting infrastructure, even if it's not complete yet.
dir.c | 80 ++++++++++++++++++++++++++++++++++++++++++++-----------
unpack-trees.c | 4 ++-
2 files changed, 67 insertions(+), 17 deletions(-)
diff --git a/dir.c b/dir.c
index c667d38..ae1ae61 100644
--- a/dir.c
+++ b/dir.c
@@ -22,6 +22,7 @@ int fill_directory(struct dir_struct *dir, const char **pathspec)
path = "";
base = "";
+ /* FIXME! Filesystem character set vs git internal character set! */
if (baselen)
path = base = xmemdupz(*pathspec, baselen);
@@ -586,6 +587,21 @@ static int get_dtype(struct dirent *de, const char *path)
return dtype;
}
+/* No actual conversion yet */
+static int convert_path_to_git(const char *path, int plen, char *result)
+{
+ memcpy(result, path, plen+1);
+ return plen;
+}
+
+/*
+ * For testing!
+ *
+ * On Windows, maybe we want FS_PATH_SEP being "\\"?
+ */
+#define FS_PATH_SEP "//"
+#define FS_PATH_SEP_LEN 2
+
/*
* Read a directory tree. We currently ignore anything but
* directories, regular files and symlinks. That's because git
@@ -594,37 +610,62 @@ static int get_dtype(struct dirent *de, const char *path)
*
* Also, we ignore the name ".git" (even if it is not a directory).
* That likely will not change.
+ *
+ * 'path' is the filesystem name of directory, in the filesystem
+ * namespace, while 'base' is the internal git path (converted
+ * into the standard git namespace).
*/
-static int read_directory_recursive(struct dir_struct *dir, const char *path, const char *base, int baselen, int check_only, const struct path_simplify *simplify)
+static int read_directory_recursive(struct dir_struct *dir,
+ const char *path,
+ const char *base, int baselen,
+ int check_only, const struct path_simplify *simplify)
{
DIR *fdir = opendir(*path ? path : ".");
int contents = 0;
if (fdir) {
+ int pathlen = strlen(path);
struct dirent *de;
- char fullname[PATH_MAX + 1];
- memcpy(fullname, base, baselen);
+ char newpath[PATH_MAX + 1];
+ char newbase[PATH_MAX + 1];
+
+ memcpy(newpath, path, pathlen);
+ memcpy(newbase, base, baselen);
while ((de = readdir(fdir)) != NULL) {
- int len, dtype;
+ char converted[256];
+ int len, dtype, nlen;
int exclude;
if (is_dot_or_dotdot(de->d_name) ||
!strcmp(de->d_name, ".git"))
continue;
+
len = strlen(de->d_name);
+
/* Ignore overly long pathnames! */
- if (len + baselen + 8 > sizeof(fullname))
+ if (len + pathlen + 8 > sizeof(newpath))
continue;
- memcpy(fullname + baselen, de->d_name, len+1);
- if (simplify_away(fullname, baselen + len, simplify))
+ memcpy(newpath + pathlen, de->d_name, len+1);
+
+ nlen = convert_path_to_git(de->d_name, len, converted);
+ if (nlen + baselen + 8 > sizeof(newbase))
continue;
+ memcpy(newbase + baselen, converted, nlen+1);
+ len = pathlen + len;
+ nlen = baselen + nlen;
+
+ /* We simplify by the git internal pathname (newbase) */
+ if (simplify_away(newbase, nlen, simplify))
+ continue;
+
+ /* Similarly, exclude rules work on the git pathname */
dtype = DTYPE(de);
- exclude = excluded(dir, fullname, &dtype);
+ exclude = excluded(dir, newbase, &dtype);
if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
- && in_pathspec(fullname, baselen + len, simplify))
- dir_add_ignored(dir, fullname, baselen + len);
+ && in_pathspec(newbase, nlen, simplify))
+ dir_add_ignored(dir, newbase, nlen);
/*
* Excluded? If we don't explicitly want to show
@@ -633,8 +674,12 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
if (exclude && !(dir->flags & DIR_SHOW_IGNORED))
continue;
+ /*
+ * The 'dtype' information comes from the filesystem,
+ * and we use the filesystem pathname for that (newpath)
+ */
if (dtype == DT_UNKNOWN)
- dtype = get_dtype(de, fullname);
+ dtype = get_dtype(de, newpath);
/*
* Do we want to see just the ignored files?
@@ -651,9 +696,12 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
default:
continue;
case DT_DIR:
- memcpy(fullname + baselen + len, "/", 2);
- len++;
- switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+ memcpy(newpath + len, FS_PATH_SEP, FS_PATH_SEP_LEN+1);
+ len += FS_PATH_SEP_LEN;
+ memcpy(newbase + nlen, "/", 2);
+ nlen++;
+
+ switch (treat_directory(dir, newbase, nlen, simplify)) {
case show_directory:
if (exclude != !!(dir->flags
& DIR_SHOW_IGNORED))
@@ -661,7 +709,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
break;
case recurse_into_directory:
contents += read_directory_recursive(dir,
- fullname, fullname, baselen + len, 0, simplify);
+ newpath, newbase, nlen, 0, simplify);
continue;
case ignore_directory:
continue;
@@ -675,7 +723,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
if (check_only)
goto exit_early;
else
- dir_add_name(dir, fullname, baselen + len);
+ dir_add_name(dir, newbase, nlen);
}
exit_early:
closedir(fdir);
diff --git a/unpack-trees.c b/unpack-trees.c
index e4eb8fa..457b529 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -534,7 +534,9 @@ static int verify_clean_subdirectory(struct cache_entry *ce, const char *action,
memset(&d, 0, sizeof(d));
if (o->dir)
d.exclude_per_dir = o->dir->exclude_per_dir;
- i = read_directory(&d, ce->name, pathbuf, namelen+1, NULL);
+
+ /* FIXME! Filesystem pathname vs internal git pathname! */
+ i = read_directory(&d, pathbuf, pathbuf, namelen+1, NULL);
if (i)
return o->gently ? -1 :
error(ERRORMSG(o, not_uptodate_dir), ce->name);
--
1.6.3.1.11.g97114
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 20:54 ` [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion Linus Torvalds
@ 2009-05-14 21:23 ` Linus Torvalds
2009-05-14 22:19 ` Johannes Schindelin
2009-05-15 19:01 ` [PATCH 4/3] Introduce 'convert_path_to_git()' Linus Torvalds
2 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 21:23 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
On Thu, 14 May 2009, Linus Torvalds wrote:
> In particular, this allows:
>
> - the filesystem path component separator to be set to something
> different than the normal UNIX '/' character.
I forgot to mention that this also now allows really having a different
prefix. The old code had "path" and "base", and without really reading the
code you might think that you could have a different base for the two, but
immediately when it recursed, it would re-set the path and base to be the
same thing, so you could never really have two different address spaces.
The new code very much intentionally keeps the two apart, and the
_intention_ is that on platforms like Windows, you should not just be able
to use other path component separators like '\', it should also be
possible to use an absolute base (which, if I recall correctly, is the
only way to handle things like long path-names. But maybe I'm wrong - I
really don't know the crazy native Windows API's).
IOW, the _intention_ is that you could literally pass in something like
"c:\Source\git\myrepo"
as the "path", and with an empty "base", it would then be possible to
basically traverse the tree with the filesystem operations building up a
"path" like
c:\Source\git\myrepo\subdir\myfile.txt
while "base" would track it, but become "subdir/myfile.txt".
In fact, my intention was that the pathname could easily be in some crazy
UTF16LE format (ie not a real "string" at all), but I might need to pass
the "pathlen" around as a parameter if we need to handle strings that
contain embedded NUL characters. That's an easy thing to do if required,
though.
Now, it's possible that nobody wants to do that kind of crazy windows
stuff, because even windows people are perfectly fine using regular utf-8.
I really dunno. My point is more that this is meant to be very flexible
basic infrastructure and that we _could_ do things like that.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 20:54 ` [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion Linus Torvalds
2009-05-14 21:23 ` Linus Torvalds
@ 2009-05-14 22:19 ` Johannes Schindelin
2009-05-14 22:36 ` Aaron Cohen
2009-05-14 22:47 ` Linus Torvalds
2009-05-15 19:01 ` [PATCH 4/3] Introduce 'convert_path_to_git()' Linus Torvalds
2 siblings, 2 replies; 13+ messages in thread
From: Johannes Schindelin @ 2009-05-14 22:19 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
Hi,
On Thu, 14 May 2009, Linus Torvalds wrote:
> The use of "//" as the filesystem path component separator may be odd,
Hopefully it will not bite us on Windows: "//fileserver/x" is different
from "/fileserver/x" there: the former tries to access the share "x" of
samba server "fileserver", while the latter will expand to "C:\Program
Files\Git\fileserver\x" (or wherever you installed Git).
Ciao,
Dscho
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 22:19 ` Johannes Schindelin
@ 2009-05-14 22:36 ` Aaron Cohen
2009-05-14 22:51 ` Linus Torvalds
2009-05-14 22:47 ` Linus Torvalds
1 sibling, 1 reply; 13+ messages in thread
From: Aaron Cohen @ 2009-05-14 22:36 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Linus Torvalds, Git Mailing List, Junio C Hamano
On Thu, May 14, 2009 at 6:19 PM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
> Hi,
>
> On Thu, 14 May 2009, Linus Torvalds wrote:
>
>> The use of "//" as the filesystem path component separator may be odd,
>
> Hopefully it will not bite us on Windows: "//fileserver/x" is different
> from "/fileserver/x" there: the former tries to access the share "x" of
> samba server "fileserver", while the latter will expand to "C:\Program
> Files\Git\fileserver\x" (or wherever you installed Git).
>
> Ciao,
> Dscho
Does this possibly allow using the magic "\\?\" prefix on windows to
avoid file name length restrictions?
-- Aaron
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 22:19 ` Johannes Schindelin
2009-05-14 22:36 ` Aaron Cohen
@ 2009-05-14 22:47 ` Linus Torvalds
1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 22:47 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Git Mailing List, Junio C Hamano
On Fri, 15 May 2009, Johannes Schindelin wrote:
>
> On Thu, 14 May 2009, Linus Torvalds wrote:
>
> > The use of "//" as the filesystem path component separator may be odd,
>
> Hopefully it will not bite us on Windows: "//fileserver/x" is different
> from "/fileserver/x" there: the former tries to access the share "x" of
> samba server "fileserver", while the latter will expand to "C:\Program
> Files\Git\fileserver\x" (or wherever you installed Git).
It only does it in the middle of names, so it should be safe.
That said, it's also meant to be just a temporary debugging aid. I'm fine
with the '//' part not being merged.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion
2009-05-14 22:36 ` Aaron Cohen
@ 2009-05-14 22:51 ` Linus Torvalds
0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-05-14 22:51 UTC (permalink / raw)
To: aaron; +Cc: Johannes Schindelin, Git Mailing List, Junio C Hamano
On Thu, 14 May 2009, Aaron Cohen wrote:
>
> Does this possibly allow using the magic "\\?\" prefix on windows to
> avoid file name length restrictions?
That would be the intention - eventually. The point being exactly that the
'path' side can be done differently from the 'basename' part that git then
uses internally.
However, the thing is not complete. As shown from the strace, almost all
filesystem operations then end up using the 'git internal' name anyway.
It's currently literally just the filesystem traversal itself that knows
to separate the notion of 'internal pathname representation' from the
filesystem accesses.
So right now, the only thing that uses the filesystem-specific stuff is
the "opendir()" (and the lstat() in case the filesystem doesn't support
d_type in the dirent).
Kubys
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 4/3] Introduce 'convert_path_to_git()'
2009-05-14 20:54 ` [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion Linus Torvalds
2009-05-14 21:23 ` Linus Torvalds
2009-05-14 22:19 ` Johannes Schindelin
@ 2009-05-15 19:01 ` Linus Torvalds
2009-05-16 6:40 ` Junio C Hamano
2 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2009-05-15 19:01 UTC (permalink / raw)
To: Git Mailing List, Junio C Hamano
Ok, this is at the stage where if somebody wants UTF-8 filenames to work
sanely on OS X, you can now add just a couple of lines (assuming you find
the right conversion function to turn it into NFC), and get a pretty
efficient end result.
It avoids the conversion overhead for the case where the filename is
perfectly normal US-ASCII, so in 99% of all cases you'd not have to do
anything fancy, and then if you have a slightly more expensive thing to
handle decomposed -> composed UTF-8, you'll still be ok.
There's a comment on "This is where we should get fancy. Some day", and
the only case that matters for OS X is "convert_path_to_git()", since you
never need to do it the other way around.
Anybody with OS X, and a wish to be able to sanely use non-ASCII UTF-8?
(I only did the "unsigned long at a time" for x86, which does unaligneds
well. If your architecture sucks at unaligned accesses, you don't want to
do that thing.)
Linus
---
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri May 15 09:40:57 2009 -0700
Add initial support for pathname conversion to UTF-8
We're still not converting anything, but this adds the infrastructure
for it, including a fast-path to handle the trivial case of all-ASCII
characters with no need for conversion.
That means that we can then afford to spend a bit more time on the case
where conversion fails.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
Makefile | 1 +
dir.c | 11 ++-----
dir.h | 3 ++
filename.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 93 insertions(+), 8 deletions(-)
diff --git a/Makefile b/Makefile
index 6e21643..fc847c7 100644
--- a/Makefile
+++ b/Makefile
@@ -473,6 +473,7 @@ LIB_OBJS += editor.o
LIB_OBJS += entry.o
LIB_OBJS += environment.o
LIB_OBJS += exec_cmd.o
+LIB_OBJS += filename.o
LIB_OBJS += fsck.o
LIB_OBJS += graph.o
LIB_OBJS += grep.o
diff --git a/dir.c b/dir.c
index 2d53b11..3c5c6a6 100644
--- a/dir.c
+++ b/dir.c
@@ -589,13 +589,6 @@ static int get_dtype(struct dirent *de, const char *path)
return dtype;
}
-/* No actual conversion yet */
-static int convert_path_to_git(const char *path, int plen, char *result)
-{
- memcpy(result, path, plen+1);
- return plen;
-}
-
/*
* For testing!
*
@@ -650,7 +643,9 @@ static int read_directory_recursive(struct dir_struct *dir,
continue;
memcpy(newpath + pathlen, de->d_name, len+1);
- nlen = convert_path_to_git(de->d_name, len, converted);
+ nlen = convert_path_to_git(de->d_name, len, converted, sizeof(converted));
+ if (nlen <= 0)
+ continue;
if (nlen + baselen + 8 > sizeof(newbase))
continue;
memcpy(newbase + baselen, converted, nlen+1);
diff --git a/dir.h b/dir.h
index f9d69dd..c04eb3a 100644
--- a/dir.h
+++ b/dir.h
@@ -66,6 +66,9 @@ struct dir_struct {
#define MATCHED_EXACTLY 3
extern int match_pathspec(const char **pathspec, const char *name, int namelen, int prefix, char *seen);
+extern int convert_path_to_git(const char *, int, char *, int);
+extern int convert_git_to_path(const char *, int, char *, int);
+
extern int fill_directory(struct dir_struct *dir, const char **pathspec);
extern int read_directory(struct dir_struct *, const char *path, const char *base, int baselen, const char **pathspec);
diff --git a/filename.c b/filename.c
new file mode 100644
index 0000000..8a049ab
--- /dev/null
+++ b/filename.c
@@ -0,0 +1,86 @@
+/*
+ * Filename character set conversion
+ */
+#include "cache.h"
+#include "dir.h"
+
+#if defined(__x86_64__) || defined(__i386__)
+#define FAST_UNALIGNED
+#endif
+
+/*
+ * The "common" case that requires no conversion: all 7-bit ASCII.
+ *
+ * Return how many character were trivially converted, negative on
+ * error (result wouldn't fit in buffer even trivially).
+ */
+static int convert_path_common(const char *path, int plen, char *result, int resultlen)
+{
+ int retval;
+
+ if (plen+1 > resultlen)
+ return -1;
+ retval = 0;
+#ifdef FAST_UNALIGNED
+ while (plen >= sizeof(unsigned long)) {
+ unsigned long x = *(unsigned long *)path;
+ if (x & (unsigned long) 0x8080808080808080ull)
+ break;
+ *(unsigned long *)result = x;
+ path += sizeof(unsigned long);
+ result += sizeof(unsigned long);
+ plen -= sizeof(unsigned long);
+ retval += sizeof(unsigned long);
+ }
+#endif
+ while (plen > 0) {
+ unsigned char x = *path;
+ if (x & 0x80)
+ break;
+ *result = x;
+ path++;
+ result++;
+ plen--;
+ retval++;
+ }
+ *result = 0;
+ return retval;
+}
+
+int convert_path_to_git(const char *path, int plen, char *result, int resultlen)
+{
+ int retval;
+
+ retval = convert_path_common(path, plen, result, resultlen);
+ /* Absolute failure, or total success.. */
+ if (retval < 0 || retval == plen)
+ return retval;
+
+ /* Skip the part we already did trivially */
+ result += retval;
+ path += retval;
+ plen -= retval;
+
+ /* This is where we should get fancy. Some day */
+ memcpy(result, path, plen+1);
+ return retval + plen;
+}
+
+int convert_git_to_path(const char *path, int plen, char *result, int resultlen)
+{
+ int retval;
+
+ retval = convert_path_common(path, plen, result, resultlen);
+ /* Absolute failure, or total success.. */
+ if (retval < 0 || retval == plen)
+ return retval;
+
+ /* Skip the part we already did trivially */
+ result += retval;
+ path += retval;
+ plen -= retval;
+
+ /* This is where we should get fancy. Some day */
+ memcpy(result, path, plen+1);
+ return retval + plen;
+}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 4/3] Introduce 'convert_path_to_git()'
2009-05-15 19:01 ` [PATCH 4/3] Introduce 'convert_path_to_git()' Linus Torvalds
@ 2009-05-16 6:40 ` Junio C Hamano
2009-05-16 17:27 ` Linus Torvalds
2009-05-19 12:20 ` Jens Kilian
0 siblings, 2 replies; 13+ messages in thread
From: Junio C Hamano @ 2009-05-16 6:40 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Git Mailing List, Junio C Hamano
Linus Torvalds <torvalds@linux-foundation.org> writes:
> +#ifdef FAST_UNALIGNED
> + while (plen >= sizeof(unsigned long)) {
> + unsigned long x = *(unsigned long *)path;
> + if (x & (unsigned long) 0x8080808080808080ull)
> + break;
> + *(unsigned long *)result = x;
> + path += sizeof(unsigned long);
> + result += sizeof(unsigned long);
> + plen -= sizeof(unsigned long);
> + retval += sizeof(unsigned long);
> + }
> +#endif
It somehow bothers me that the stride of this loop is protected from
having different size of unsigned long from what the author of this
function expected, but the literal constant used as the mask is not quite,
which means that taken as a whole it does not work if your unsigned long
is more than 8-bytes. No, I do not think it matters in practice, and I
find it a neat trick to cast the unsigned long long literal down to
unsigned long, but still it looks somewhat, eh, what would I say...
"Ugly" is not quite the word I am looking for. "My gut feels that there
has to be a way to write this more cleanly, but I am frustrated that I
cannot come up with one" might be the word...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/3] Introduce 'convert_path_to_git()'
2009-05-16 6:40 ` Junio C Hamano
@ 2009-05-16 17:27 ` Linus Torvalds
2009-05-19 12:20 ` Jens Kilian
1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2009-05-16 17:27 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Git Mailing List
On Fri, 15 May 2009, Junio C Hamano wrote:
>
> "Ugly" is not quite the word I am looking for. "My gut feels that there
> has to be a way to write this more cleanly, but I am frustrated that I
> cannot come up with one" might be the word...
Well, we can certainly make it even more interesting, and more prone to
work even when the word-size grows.
#define MAX_SHIFT (8*sizeof(unsigned long))
#define SHIFT_BITS(x,y) ((x) << ((y) & (MAX_SHIFT-1)))
#define EXPAND(x,bits) ((x) | SHIFT_BITS(x,bits))
#define EXPAND2(x,bits) EXPAND(EXPAND(x,bits),bits*2)
#define EXPAND4(x,bits) EXPAND2(EXPAND2(x,bits),bits*4)
#define MASK80 EXPAND4(0x80808080ul,32)
and now it should work up to 256 bits without warnings or undefined
behavior (shifting by the word-size or more is not well-specified, which
is why it has the "MAX_SHIFT/SHIFT_BIT" magic)
Untested, of course. But it seems to work on 32-bit and 64-bit cases. I
can only hope that it works for 128-bit and 256-bit cases too.
And yes, it depends on "sizeof(unsigned long)" being a power of two. We
could avoid that dependency by turning the "& (MAX_SHIFT-1)" into a ?:
operation that actually compares with the value, and then it would work
for a 6-byte "unsigned long" too.
It fundamentally does depend on 8-bit bytes, of course, but so does the
whole algorithm, so that's not much of a dependency.
IOW, I'm not claiming it's "truly portable". Just reasonably so.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/3] Introduce 'convert_path_to_git()'
2009-05-16 6:40 ` Junio C Hamano
2009-05-16 17:27 ` Linus Torvalds
@ 2009-05-19 12:20 ` Jens Kilian
2009-05-19 13:31 ` John Koleszar
1 sibling, 1 reply; 13+ messages in thread
From: Jens Kilian @ 2009-05-19 12:20 UTC (permalink / raw)
To: git
Junio C Hamano <gitster <at> pobox.com> writes:
> "Ugly" is not quite the word I am looking for. "My gut feels that there
> has to be a way to write this more cleanly, but I am frustrated that I
> cannot come up with one" might be the word...
How about this:
#include <stdio.h>
#define MAGIC(type) ((~(type)0 / (type)0xff) << 7)
#define TEST(type) printf(#type " %llx\n", (unsigned long long)MAGIC(type))
int
main(void)
{
/*TEST(unsigned char); Doesn't work, and I'm too lazy to find out why. */
TEST(unsigned int);
TEST(unsigned long);
TEST(unsigned long long);
return 0;
}
HTH,
Jens.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 4/3] Introduce 'convert_path_to_git()'
2009-05-19 12:20 ` Jens Kilian
@ 2009-05-19 13:31 ` John Koleszar
0 siblings, 0 replies; 13+ messages in thread
From: John Koleszar @ 2009-05-19 13:31 UTC (permalink / raw)
To: Jens Kilian; +Cc: git@vger.kernel.org
Haven't been following this thread, so I don't know what the context is
here, but the brain teaser caught my eye.
On Tue, 2009-05-19 at 08:20 -0400, Jens Kilian wrote:
> Junio C Hamano <gitster <at> pobox.com> writes:
> > "Ugly" is not quite the word I am looking for. "My gut feels that there
> > has to be a way to write this more cleanly, but I am frustrated that I
> > cannot come up with one" might be the word...
>
> How about this:
>
> #include <stdio.h>
>
-#define MAGIC(type) ((~(type)0 / (type)0xff) << 7)
+#define MAGIC(type) ((type)(~(type)0 / 0xffU << 7))
> #define TEST(type) printf(#type " %llx\n", (unsigned long long)MAGIC(type))
>
> int
> main(void)
> {
> /*TEST(unsigned char); Doesn't work, and I'm too lazy to find out why. */
> TEST(unsigned int);
> TEST(unsigned long);
> TEST(unsigned long long);
> return 0;
> }
>
--John
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-05-19 13:39 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 20:42 [PATCH 1/3] dir.c: clean up handling of 'path' parameter in read_directory_recursive() Linus Torvalds
2009-05-14 20:46 ` [PATCH 2/3] Add 'fill_directory()' helper function for directory traversal Linus Torvalds
2009-05-14 20:54 ` [PATCH 3/3] read_directory(): infrastructure for pathname character set conversion Linus Torvalds
2009-05-14 21:23 ` Linus Torvalds
2009-05-14 22:19 ` Johannes Schindelin
2009-05-14 22:36 ` Aaron Cohen
2009-05-14 22:51 ` Linus Torvalds
2009-05-14 22:47 ` Linus Torvalds
2009-05-15 19:01 ` [PATCH 4/3] Introduce 'convert_path_to_git()' Linus Torvalds
2009-05-16 6:40 ` Junio C Hamano
2009-05-16 17:27 ` Linus Torvalds
2009-05-19 12:20 ` Jens Kilian
2009-05-19 13:31 ` John Koleszar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).