* Maximum filename length @ 2008-11-21 12:51 Alexey Salmin 2008-11-21 22:32 ` Theodore Tso 0 siblings, 1 reply; 7+ messages in thread From: Alexey Salmin @ 2008-11-21 12:51 UTC (permalink / raw) To: linux-ext4 Hello! I'm not sure the developers mailing list is a right place for philosophical discussions - it's more about features, bugs and patches. But anyway I have an important for me topic I want to talk about. Limits of the ext4 file are really huge, I just can't imagine 1 EiB disk array, it's out of my mind's bounds. Maximum file size is quite big too. But there is one limitation looking tiny against these Tera- and Exbi-bytes: maximum filename length is 255 bytes. Is 255 characters enough? I think it's enough for the vast majority of users. But there is one problem: 255 bytes and 255 characters are no longer equal. Multibyte encodings are spreading fast and it should be taken into account. For a long time I was using the simple koi8-r encoding and it was enough. Even when my favorite debian distribution moved to utf8 I was still keeping it. Even when I discovered that gtk and qt application always use utf8 and every io-operation causes conversions I was using koi8. But when I found out that the first thing gcc does with the source code is converting it to utf8 I thought that it is really the time to move ahead. I was full of optimism converting my file systems to utf8 but I discovered that my book collection can not be stored correctly due to the 127 characters filename length limitation. Actually I'm lucky having only two bytes per character, utf8-character can contain up to 4 bytes which reduces the limit to 63 characters. Really I see no reasons for keeping such a terrible limitation. Ext4 branch was created because there were to many things to change compared to ext3. And it's very sad that such a simple improvement was forgotten :( Alexey ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-21 12:51 Maximum filename length Alexey Salmin @ 2008-11-21 22:32 ` Theodore Tso 2008-11-22 2:50 ` rae l 2008-11-22 18:48 ` Alexey Salmin 0 siblings, 2 replies; 7+ messages in thread From: Theodore Tso @ 2008-11-21 22:32 UTC (permalink / raw) To: Alexey Salmin; +Cc: linux-ext4 On Fri, Nov 21, 2008 at 06:51:23PM +0600, Alexey Salmin wrote: > But there is one limitation looking tiny against these Tera- > and Exbi-bytes: maximum filename length is 255 bytes. Is 255 > characters enough? I think it's enough for the vast majority of users. > But there is one problem: 255 bytes and 255 characters are no longer > equal. Multibyte encodings are spreading fast and it should be taken > into account. For a long time I was using the simple koi8-r encoding > and it was enough. Even when my favorite debian distribution moved to > utf8 I was still keeping it. Yeah, unfortunately Unicode and UTF-8 is unfortunate for the Cyrillic and Greek alphabets, since they are non-Latin alphabets where multiple characters form a word. For most writing systems, they either use a single glyph to represent a word (such as the CJK, or Chinese-Japanese-Korean characters), or they are based on the Latin alphabet, so in those writing systems, only a few characters in practice require glyphs that are encoded using two bytes, and most require only one byte. > Actually I'm lucky having only two bytes per character, > utf8-character can contain up to 4 bytes which reduces the limit to 63 > characters. Really I see no reasons for keeping such a terrible In practice the bulk of the characters which require 3 bytes to encode are used to denote a word (which in most other languages might be encoded in 3 to 20 letters). There are a few writing systems that have letters encoded above U+0800, and so require 3 bytes per letter, but they tend to be "niche" languages that are rarely used in computing. For example, the Buhid script, which is spoken by the indigenous people Mangyans, which lives in the province of Mindoro in the Phillipines, and which has about 8,000 speakers in the world, utilize Unicode characters U+1740 through U+175F, and so require 3 bytes per character. The Native American Cherokee language, which has about 10,000 speakers in the world, uses Unicode symbols U+13A0 through U+13F4, and similarly needs 3 bytes per character. Characters that require 4 bytes to encode are needed to encode Unicode symbols above U+10000, which are used primarily by dead languages (i.e., no one alive speaks it as their primary language --- and in some cases, no one alive has any idea how to speak it). For example the Linear B script, which was used in Mycenaean civilization sometime around the 13th and 14th century BC (i.e., over 3 millennia ago) is assigned Unicode characters U+10000 through U+100FF, and so would require 4 bytes per Linear B glyph to encode. However, aside from researchers in ancient languages, it is doubtful anyone would actually be using it, and it's even less likely anyone would be trying to catalog books or mp3 filenames using Linear B glyphs. :-) So in practice, in terms of the common languages that are likely to be used in computing that are based on phomemes (i.e., such as most European, Russian, Greeek writing systems) as opposed to ideographs (i.e., the CJK writing systems) Russian, Greek, Hewbrew, and Arabic are the unlucky ones that are not based on a Latin-1 alphabet, and have this problem where 2 bytes are required. Curiously enough, its generally people using the Cyrillic alphabet that tend to complain; I suspect that it has the largest number of users who are likely to use those letters in computing. (In practice, not many people complain about Hewbrew writing systems, and I suspect that it's partially because of the relative difference in the number of people using the Hewbrew writing system as compared to the Cyrillic, and also that most Israeli computer folk I know tend to do most of the computing work in English, and not in Hewbrew.) > Really I see no reasons for keeping such a terrible > limitation. Ext4 branch was created because there were to many things > to change compared to ext3. And it's very sad that such a simple > improvement was forgotten :( It wouldn't be _that_ hard to add an extension to ext4 to support longer filenames (it would mean a new directory entry format, and a way of marking a directory inode as to whether the old or new directory format was being used). Unfortunately, the 255 byte limit is encoded not only in the filesystem, but also in the kernel. Changing it in the kernel is not just a matter of a #define constant, but also fixing places which put filename[NAME_MAX] on the stack, and where increasing NAME_MAX might cause kernel functions to blow the limited stack space available to kernel code. In addition, there are numerous userspace and in some cases, protocol limitations which assume that the total overall length of a pathname is no more than 1024 bytes. (I suspect there is at least userspace code that also would blow up if an individual pathname exceeded NAME_MAX, or 256 bytes.) So the problem is that even if we were to add that enhancement to ext4, there are lots of other things, both in and outside of the kernel, that would likely need to be changed in order to support this. I will say personally that its rare for me to use filenames longer than 50-60 characters, just because they are a pain in the *ss to type. However, I can see how someone using a graphical interface might be happy with filenames in the 100-120 character range. The question though is whether it is worth trying to fix this by increasing the filename length beyond 255 bytes or not, given the amount of effort that would be required in the kernel, libc, userspace, etc. - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-21 22:32 ` Theodore Tso @ 2008-11-22 2:50 ` rae l 2008-11-23 3:08 ` Theodore Tso 2008-11-22 18:48 ` Alexey Salmin 1 sibling, 1 reply; 7+ messages in thread From: rae l @ 2008-11-22 2:50 UTC (permalink / raw) To: Theodore Tso; +Cc: Alexey Salmin, linux-ext4 On Sat, Nov 22, 2008 at 6:32 AM, Theodore Tso <tytso@mit.edu> wrote: > So the problem is that even if we were to add that enhancement to > ext4, there are lots of other things, both in and outside of the > kernel, that would likely need to be changed in order to support this. > I will say personally that its rare for me to use filenames longer > than 50-60 characters, just because they are a pain in the *ss to > type. However, I can see how someone using a graphical interface > might be happy with filenames in the 100-120 character range. The > question though is whether it is worth trying to fix this by > increasing the filename length beyond 255 bytes or not, given the > amount of effort that would be required in the kernel, libc, > userspace, etc. In China, there's also a trend moving ahead from 2bytes charsets (GB2312/GBK/GB18030/BIG5) to UTF-8, so all Chinese characters will need 3 bytes for each to each instead of 2 from then on. The 255 filename length limit the Chinese filename to 85 characters: now try to touch a file with 86 Chinese characters: gektop@tux ~/tmp 0 $ touch 中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中国 touch: cannot touch `中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中中国': File name too long It's very difficult to solve all these problems in the kernel, libc, userspace, I know, but I think we should keep the option to fix them in the future. Maybe in the next POSIX standard, we should change NAME_MAX to [255 * max_bytes_per_character] ? Regards, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Cheng Renquan, Shenzhen, China Yogi Berra - "I never said most of the things I said." ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-22 2:50 ` rae l @ 2008-11-23 3:08 ` Theodore Tso 0 siblings, 0 replies; 7+ messages in thread From: Theodore Tso @ 2008-11-23 3:08 UTC (permalink / raw) To: rae l; +Cc: Alexey Salmin, linux-ext4 On Sat, Nov 22, 2008 at 10:50:25AM +0800, rae l wrote: > In China, there's also a trend moving ahead from 2bytes charsets > (GB2312/GBK/GB18030/BIG5) to UTF-8, so all Chinese characters will > need 3 bytes for each to each instead of 2 from then on. The 255 > filename length limit the Chinese filename to 85 characters: Sure, but 85 characters is a lot, given that each character might be the equivalent of multiple letters. For example the English word "country", which takes six characters, or 6 bytes in UTF-8, can be encoded as a single Chinese ideograph, which can be encoded in 3 bytes in UTF-8. Something like "United States of America", is encoded in 24 bytes in English, and 6 bytes (two ideographs) in Chinese in UTF-8. My name, "Theodore Yue Tak Ts'o", takes 21 bytes in English and UTF-8. In Chinese, it's 3 ideographs, or 9 bytes in UTF-8. I'm choosing fairly basied examples here, of course, but I think it's in general true. As a final example consider, "The Tao which can be described is not the true Tao". This can be expressed *much* more succiently in Chinese. :-) So I don't think people who use the Chinese writing system have much to complain about with respect to the 255 byte / 85 ideograph limit. I have much more sympathy for people who trying to are trying to write something like "Union of Soviet Socialist Republics" in Russian, and find that it takes many more bytes in UTF-8..... - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-21 22:32 ` Theodore Tso 2008-11-22 2:50 ` rae l @ 2008-11-22 18:48 ` Alexey Salmin 2008-11-22 23:36 ` Andreas Dilger 1 sibling, 1 reply; 7+ messages in thread From: Alexey Salmin @ 2008-11-22 18:48 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 2008/11/22 Theodore Tso <tytso@mit.edu>: > It wouldn't be _that_ hard to add an extension to ext4 to support > longer filenames (it would mean a new directory entry format, and a > way of marking a directory inode as to whether the old or new > directory format was being used). Unfortunately, the 255 byte limit > is encoded not only in the filesystem, but also in the kernel. > Changing it in the kernel is not just a matter of a #define constant, > but also fixing places which put filename[NAME_MAX] on the stack, and > where increasing NAME_MAX might cause kernel functions to blow the > limited stack space available to kernel code. In addition, there are > numerous userspace and in some cases, protocol limitations which > assume that the total overall length of a pathname is no more than > 1024 bytes. (I suspect there is at least userspace code that also > would blow up if an individual pathname exceeded NAME_MAX, or 256 > bytes.) > Sure, I understand the problems you've mentioned. But every big act has the beginning. Adding the extension to the ext4 is only the first step. Of course it'll cause crashes and other problems in many places from kernel to userspace code. But these problems will disturb only people who will really use this extension (like me). Anyway most of these bugs will be fixed some day, may be in two or three years. No one is talking that it's a fast process but it will reach it's end and that's good I think. > I will say personally that its rare for me to use filenames longer > than 50-60 characters, just because they are a pain in the *ss to > type. However, I can see how someone using a graphical interface > might be happy with filenames in the 100-120 character range. Same here: most of my filenames are _way_ shorter than 50-60 characters. Besides, I really use English filenames almost always. But there are some cases when long Cyrillic names are needed and it's sad for me to have problems here. Alexey ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-22 18:48 ` Alexey Salmin @ 2008-11-22 23:36 ` Andreas Dilger 2008-11-23 22:15 ` Andi Kleen 0 siblings, 1 reply; 7+ messages in thread From: Andreas Dilger @ 2008-11-22 23:36 UTC (permalink / raw) To: Alexey Salmin; +Cc: Theodore Tso, linux-ext4 On Nov 23, 2008 00:48 +0600, Alexey Salmin wrote: > Sure, I understand the problems you've mentioned. But every big act > has the beginning. Adding the extension to the ext4 is only the first > step. Of course it'll cause crashes and other problems in many places > from kernel to userspace code. But these problems will disturb only > people who will really use this extension (like me). Anyway most of these > bugs will be fixed some day, may be in two or three years. No one is > talking that it's a fast process but it will reach it's end and that's good > I think. If you are motivated to work on this, there are a number of possible ways that this could be done. The simplest would be to create a new directory entry (replacing ext4_dir_entry_2) that has a longer name_len field, and ideally it would also have space for a 48-bit inode number (ext4 will NEVER need more than 280 trillion inodes I think). I don't know that it is practical to require this format for the entire directory, because it would mean in some rare cases rewriting 1M entries (or whatever) in a large directory to the new format. It would be better to allow either just the leaf block to hold the new record format (with a marker at the start of the block), or individual records having the new format, possibly marked by a bit in the "file_type" field. It's kind of ugly, but it needs to be possible to detect if the entry is the old format or the new one. #define EXT4_DIRENT3_FL 0x00400000 /* directory has any dir_entry_3 */ #define EXT4_FT_ENTRY_3 0x80 /* file_type for dir_entry_3 */ #define EXT4_FT_MASK 0x0f /* EXT4_FT_* mask */ #define EXT4_INODE_MASK 0x00ffffffffffffff /* 48-bit inode number mask */ #define EXT4_NAME_LEN3 1012 struct ext4_dir_entry_3 { __le64 inode; /* High byte holds file_type */ __le16 rec_len; __le16 name_len; char name[EXT4_NAME_LEN3]; }; static inline __u8 ext4_get_de_file_type(struct ext4_dir_entry_2 *dirent) { return (dirent->file_type & EXT4_FT_MASK); } static inline int ext4_get_de_name_len(struct ext4_dir_entry_2 *dirent) { if (dirent->file_type & EXT4_FT_ENTRY_3) { struct ext4_dir_entry_3 *dirent3 = dirent; return le16_to_cpu(dirent3->name_len); } return dirent->name_len; } static inline int ext4_get_de_rec_len(struct ext4_dir_entry_2 *dirent) { if (dirent->file_type & EXT4_FT_ENTRY_3) { struct ext4_dir_entry_3 *dirent3 = dirent; return le16_to_cpu(dirent3->rec_len); } return le16_to_cpu(dirent->rec_len); } static inline __u64 ext4_get_de_inode(struct ext4_dir_entry_2 *dirent) { if (dirent->file_type & EXT4_FT_ENTRY_3) { struct ext4_dir_entry_3 *dirent3 = dirent; return le64_to_cpu(dirent3->inode) & EXT4_INODE_MASK; } return le32_to_cpu(dirent->inode); } static inline __u64 ext4_get_de_name(struct ext4_dir_entry_2 *dirent) { if (dirent->file_type & EXT4_FT_ENTRY_3) { struct ext4_dir_entry_3 *dirent3 = dirent; return dirent3->name; } return dirent->name); } Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Maximum filename length 2008-11-22 23:36 ` Andreas Dilger @ 2008-11-23 22:15 ` Andi Kleen 0 siblings, 0 replies; 7+ messages in thread From: Andi Kleen @ 2008-11-23 22:15 UTC (permalink / raw) To: Andreas Dilger; +Cc: Alexey Salmin, Theodore Tso, linux-ext4 Andreas Dilger <adilger@sun.com> writes: > > #define EXT4_FT_ENTRY_3 0x80 /* file_type for dir_entry_3 */ > #define EXT4_FT_MASK 0x0f /* EXT4_FT_* mask */ > #define EXT4_INODE_MASK 0x00ffffffffffffff /* 48-bit inode number mask */ > #define EXT4_NAME_LEN3 1012 > > struct ext4_dir_entry_3 { > __le64 inode; /* High byte holds file_type */ > __le16 rec_len; > __le16 name_len; The new format should also reserve space for a checksum. Adding that would be actually a reasonable practical improvement for everyone. > char name[EXT4_NAME_LEN3]; -Andi -- ak@linux.intel.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-11-23 22:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-21 12:51 Maximum filename length Alexey Salmin 2008-11-21 22:32 ` Theodore Tso 2008-11-22 2:50 ` rae l 2008-11-23 3:08 ` Theodore Tso 2008-11-22 18:48 ` Alexey Salmin 2008-11-22 23:36 ` Andreas Dilger 2008-11-23 22:15 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).