Hi, While working on a new file systems latency profiler, which we call FSprof, we noticed that generic_file_llseek competes for i_sem if several processes access the same file. For example, we observed that contention happens up to 25% of the time, even with just two processes randomly reading the same file. Even worse, i_sem (as well as other semaphores) accesses require either locking the whole memory bus or at least purging the same cache line from all other processors which should hurt performance on SMP systems. Fortunately, we reviewed the rest of the VFS and file systems code and it seems that most file systems don't need to touch i_sem during llseeks of files (but they do for directories). Here is the generic_file_llseek used as the ->llseek file operation by most of the file systems (see fs/read_write.c, Linux 2.6.10, lines 30-53): loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { long long retval; struct inode *inode = file->f_mapping->host; down(&inode->i_sem); switch (origin) { case 2: offset += inode->i_size; break; case 1: offset += file->f_pos; } retval = -EINVAL; if (offset>=0 && offset<=inode->i_sb->s_maxbytes) { if (offset != file->f_pos) { file->f_pos = offset; file->f_version = 0; } retval = offset; } up(&inode->i_sem); return retval; } An important observation here is that we need to down the i_sem to atomically update f_pos and f_version. However, f_version is not currently used for file objects by any of the file systems supplied with the kernel. f_version is only used by ufs, ntfs, hpfs, ext3, ext2, proc, and affs for readdir operations of their directories and by fifos and pipes for poll(). One may think that the i_sem protection is necessary to provide consistency between i_size reads and f_pos reads/writes. However, these are not provided in other places of the kernel. For example, consider the code of sys_write (see lines 309-311 of fs/read_write.c): loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, &pos); file_pos_write(file, pos); In the above case, f_pos is read and written (using file_pos_read and file_pos_write) without any provisions for atomicity; hence, we shouldn't expect atomicity on f_pos. Therefore, in the case of SEEK_CUR, we can remove the down and up on i_sem in generic_file_llseek, which would only be useful if transaction semantics were enforced on f_pos everywhere else as well. Clearly the same applies in the SEEK_SET case. In the case of SEEK_END, the i_sem protection doesn't add any consistency between i_size and f_pos either. Some other process can modify i_size right after the i_sem is released with the same effect as modifying it right after i_size is read. This is an existing, well-known inconsistency that happens when several processes are appending to a file at the same time and overwrite each other's data. Note that we need to read i_size using the i_size_read function to guarantee that the unaligned 64-bit i_size is read atomically on 32-bit architectures. Considering all the above, it seems that there is no need to down/up i_sem for generic_file_llseek on files. We can do it in two ways: 1) Add a line like: if (S_ISDIR(inode->i_mode)) before the down/up of the i_sem in the generic_file_llseek. 2) Introduce generic_file_llseek_light function (or, pick a function name you deem more appropriate) which doesn't touch the i_sem and f_version. This approach is less intrusive: FS developers who like it can use it. We've benchmarked both options. Both cases result in very similar performance improvements. Thus, while testing the new code for case 2 on our P4 SMP system, we found that removal of the down/up pair reduces average generic_file_llseek execution time from about 400 cycles to 120 cycles, a 70% reduction even if there is no contention on i_sem. Most importantly, however, neither the memory bus is locked two extra times nor the inode structure is evicted from the caches of all other CPUs which is significant for SMP systems. And, of course, now there is no i_sem contention if several processes access the same file and therefore share the inode; contention happens about 25% of the time with the original lseek, even with just two processes randomly reading the same file opened with O_DIRECT. The patch below adds a new function, generic_file_llseek_light, and substitutes it for generic_file_llseek in file seeks in ext2/ext3. (Seeks in directories are left unchanged.) We have used a kernel patched in this way for more than a month without problems, even under FS-intensive benchmarks. We therefore recommend that this patch be considered (in some form) in 2.6 and that all other file system developers consider integrating it into their code as well. Note that this patch only fixes ext2/3, and is here primarily to illustrate one possible way to fix this issue; there may be other/better ways to address this contention. Either way, if you apply it to other file systems, then you need to use the "*_light" version of llseek appropriately for those other file systems. Cheers, Nikolai Joukov. ------------------------------------------ PhD student. Filesystems and Storage Lab. Stony Brook University (SUNY). Advisor: Erez Zadok ------------------------------------------ diff -Nrup linux-2.6.10-original/fs/read_write.c linux-2.6.10-light/fs/read_write.c --- linux-2.6.10-original/fs/read_write.c 2004-12-24 16:35:00.000000000 -0500 +++ linux-2.6.10-light/fs/read_write.c 2005-02-08 20:23:21.000000000 -0500 @@ -27,6 +27,27 @@ struct file_operations generic_ro_fops = EXPORT_SYMBOL(generic_ro_fops); +loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin) +{ + long long retval = -EINVAL; + struct inode *inode = file->f_mapping->host; + + switch (origin) { + case 2: + offset += i_size_read(inode); + break; + case 1: + offset += file->f_pos; + } + if (offset>=0 && offset<=inode->i_sb->s_maxbytes) { + file->f_pos = offset; + retval = offset; + } + return retval; +} + +EXPORT_SYMBOL(generic_file_llseek_light); + loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { long long retval; diff -rup linux-2.6.10-original/include/linux/fs.h linux-2.6.10-light/include/linux/fs.h --- linux-2.6.10-original/include/linux/fs.h 2004-12-24 16:34:27.000000000 -0500 +++ linux-2.6.10-light/include/linux/fs.h 2005-02-08 20:23:54.000000000 -0500 @@ -1473,6 +1473,7 @@ extern ssize_t generic_file_readv(struct ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); extern loff_t no_llseek(struct file *file, loff_t offset, int origin); +extern loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); extern int generic_file_open(struct inode * inode, struct file * filp); diff -rup linux-2.6.10-original/fs/ext2/file.c linux-2.6.10-light/fs/ext2/file.c --- linux-2.6.10-original/fs/ext2/file.c 2004-12-24 16:34:31.000000000 -0500 +++ linux-2.6.10-light/fs/ext2/file.c 2005-02-08 20:28:05.000000000 -0500 @@ -40,7 +40,7 @@ static int ext2_release_file (struct ino * the ext2 filesystem. */ struct file_operations ext2_file_operations = { - .llseek = generic_file_llseek, + .llseek = generic_file_llseek_light, .read = generic_file_read, .write = generic_file_write, .aio_read = generic_file_aio_read, diff -rup linux-2.6.10-original/fs/ext3/file.c linux-2.6.10-light/fs/ext3/file.c --- linux-2.6.10-original/fs/ext3/file.c 2004-12-24 16:35:39.000000000 -0500 +++ linux-2.6.10-light/fs/ext3/file.c 2005-02-08 20:28:18.000000000 -0500 @@ -116,7 +116,7 @@ force_commit: } struct file_operations ext3_file_operations = { - .llseek = generic_file_llseek, + .llseek = generic_file_llseek_light, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read,