From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nikolai Joukov Subject: [PATCH-2.6] i_sem contention in 2.6.10 generic_file_llseek Date: Wed, 23 Feb 2005 11:33:24 -0500 (EST) Message-ID: Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-559023410-684387517-1109097904=:26945" Cc: Andrew Morton , Alexander Viro Received: from sbcs.cs.sunysb.edu ([130.245.1.15]:64500 "EHLO sbcs.cs.sunysb.edu") by vger.kernel.org with ESMTP id S261494AbVBWQel (ORCPT ); Wed, 23 Feb 2005 11:34:41 -0500 To: linux-fsdevel@vger.kernel.org Content-ID: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. Send mail to mime@docserver.cac.washington.edu for more info. ---559023410-684387517-1109097904=:26945 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII Content-ID: Hi, While working on a new file systems latency profiler, which we call FSprof, we noticed that generic_file_llseek competes for i_sem if several processes access the same file. For example, we observed that contention happens up to 25% of the time, even with just two processes randomly reading the same file. Even worse, i_sem (as well as other semaphores) accesses require either locking the whole memory bus or at least purging the same cache line from all other processors which should hurt performance on SMP systems. Fortunately, we reviewed the rest of the VFS and file systems code and it seems that most file systems don't need to touch i_sem during llseeks of files (but they do for directories). Here is the generic_file_llseek used as the ->llseek file operation by most of the file systems (see fs/read_write.c, Linux 2.6.10, lines 30-53): loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { long long retval; struct inode *inode = file->f_mapping->host; down(&inode->i_sem); switch (origin) { case 2: offset += inode->i_size; break; case 1: offset += file->f_pos; } retval = -EINVAL; if (offset>=0 && offset<=inode->i_sb->s_maxbytes) { if (offset != file->f_pos) { file->f_pos = offset; file->f_version = 0; } retval = offset; } up(&inode->i_sem); return retval; } An important observation here is that we need to down the i_sem to atomically update f_pos and f_version. However, f_version is not currently used for file objects by any of the file systems supplied with the kernel. f_version is only used by ufs, ntfs, hpfs, ext3, ext2, proc, and affs for readdir operations of their directories and by fifos and pipes for poll(). One may think that the i_sem protection is necessary to provide consistency between i_size reads and f_pos reads/writes. However, these are not provided in other places of the kernel. For example, consider the code of sys_write (see lines 309-311 of fs/read_write.c): loff_t pos = file_pos_read(file); ret = vfs_write(file, buf, count, &pos); file_pos_write(file, pos); In the above case, f_pos is read and written (using file_pos_read and file_pos_write) without any provisions for atomicity; hence, we shouldn't expect atomicity on f_pos. Therefore, in the case of SEEK_CUR, we can remove the down and up on i_sem in generic_file_llseek, which would only be useful if transaction semantics were enforced on f_pos everywhere else as well. Clearly the same applies in the SEEK_SET case. In the case of SEEK_END, the i_sem protection doesn't add any consistency between i_size and f_pos either. Some other process can modify i_size right after the i_sem is released with the same effect as modifying it right after i_size is read. This is an existing, well-known inconsistency that happens when several processes are appending to a file at the same time and overwrite each other's data. Note that we need to read i_size using the i_size_read function to guarantee that the unaligned 64-bit i_size is read atomically on 32-bit architectures. Considering all the above, it seems that there is no need to down/up i_sem for generic_file_llseek on files. We can do it in two ways: 1) Add a line like: if (S_ISDIR(inode->i_mode)) before the down/up of the i_sem in the generic_file_llseek. 2) Introduce generic_file_llseek_light function (or, pick a function name you deem more appropriate) which doesn't touch the i_sem and f_version. This approach is less intrusive: FS developers who like it can use it. We've benchmarked both options. Both cases result in very similar performance improvements. Thus, while testing the new code for case 2 on our P4 SMP system, we found that removal of the down/up pair reduces average generic_file_llseek execution time from about 400 cycles to 120 cycles, a 70% reduction even if there is no contention on i_sem. Most importantly, however, neither the memory bus is locked two extra times nor the inode structure is evicted from the caches of all other CPUs which is significant for SMP systems. And, of course, now there is no i_sem contention if several processes access the same file and therefore share the inode; contention happens about 25% of the time with the original lseek, even with just two processes randomly reading the same file opened with O_DIRECT. The patch below adds a new function, generic_file_llseek_light, and substitutes it for generic_file_llseek in file seeks in ext2/ext3. (Seeks in directories are left unchanged.) We have used a kernel patched in this way for more than a month without problems, even under FS-intensive benchmarks. We therefore recommend that this patch be considered (in some form) in 2.6 and that all other file system developers consider integrating it into their code as well. Note that this patch only fixes ext2/3, and is here primarily to illustrate one possible way to fix this issue; there may be other/better ways to address this contention. Either way, if you apply it to other file systems, then you need to use the "*_light" version of llseek appropriately for those other file systems. Cheers, Nikolai Joukov. ------------------------------------------ PhD student. Filesystems and Storage Lab. Stony Brook University (SUNY). Advisor: Erez Zadok ------------------------------------------ diff -Nrup linux-2.6.10-original/fs/read_write.c linux-2.6.10-light/fs/read_write.c --- linux-2.6.10-original/fs/read_write.c 2004-12-24 16:35:00.000000000 -0500 +++ linux-2.6.10-light/fs/read_write.c 2005-02-08 20:23:21.000000000 -0500 @@ -27,6 +27,27 @@ struct file_operations generic_ro_fops = EXPORT_SYMBOL(generic_ro_fops); +loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin) +{ + long long retval = -EINVAL; + struct inode *inode = file->f_mapping->host; + + switch (origin) { + case 2: + offset += i_size_read(inode); + break; + case 1: + offset += file->f_pos; + } + if (offset>=0 && offset<=inode->i_sb->s_maxbytes) { + file->f_pos = offset; + retval = offset; + } + return retval; +} + +EXPORT_SYMBOL(generic_file_llseek_light); + loff_t generic_file_llseek(struct file *file, loff_t offset, int origin) { long long retval; diff -rup linux-2.6.10-original/include/linux/fs.h linux-2.6.10-light/include/linux/fs.h --- linux-2.6.10-original/include/linux/fs.h 2004-12-24 16:34:27.000000000 -0500 +++ linux-2.6.10-light/include/linux/fs.h 2005-02-08 20:23:54.000000000 -0500 @@ -1473,6 +1473,7 @@ extern ssize_t generic_file_readv(struct ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *ppos); extern loff_t no_llseek(struct file *file, loff_t offset, int origin); +extern loff_t generic_file_llseek_light(struct file *file, loff_t offset, int origin); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); extern int generic_file_open(struct inode * inode, struct file * filp); diff -rup linux-2.6.10-original/fs/ext2/file.c linux-2.6.10-light/fs/ext2/file.c --- linux-2.6.10-original/fs/ext2/file.c 2004-12-24 16:34:31.000000000 -0500 +++ linux-2.6.10-light/fs/ext2/file.c 2005-02-08 20:28:05.000000000 -0500 @@ -40,7 +40,7 @@ static int ext2_release_file (struct ino * the ext2 filesystem. */ struct file_operations ext2_file_operations = { - .llseek = generic_file_llseek, + .llseek = generic_file_llseek_light, .read = generic_file_read, .write = generic_file_write, .aio_read = generic_file_aio_read, diff -rup linux-2.6.10-original/fs/ext3/file.c linux-2.6.10-light/fs/ext3/file.c --- linux-2.6.10-original/fs/ext3/file.c 2004-12-24 16:35:39.000000000 -0500 +++ linux-2.6.10-light/fs/ext3/file.c 2005-02-08 20:28:18.000000000 -0500 @@ -116,7 +116,7 @@ force_commit: } struct file_operations ext3_file_operations = { - .llseek = generic_file_llseek, + .llseek = generic_file_llseek_light, .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, ---559023410-684387517-1109097904=:26945 Content-Type: TEXT/PLAIN; CHARSET=US-ASCII; NAME="2.6.10-generic_file_llseek_light.diff" Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: ATTACHMENT; FILENAME="2.6.10-generic_file_llseek_light.diff" ZGlmZiAtcnVwIGxpbnV4LTIuNi4xMC1vcmlnaW5hbC9mcy9yZWFkX3dyaXRl LmMgbGludXgtMi42LjEwLWxpZ2h0L2ZzL3JlYWRfd3JpdGUuYw0KLS0tIGxp bnV4LTIuNi4xMC1vcmlnaW5hbC9mcy9yZWFkX3dyaXRlLmMJMjAwNC0xMi0y NCAxNjozNTowMC4wMDAwMDAwMDAgLTA1MDANCisrKyBsaW51eC0yLjYuMTAt bGlnaHQvZnMvcmVhZF93cml0ZS5jCTIwMDUtMDItMDggMjA6MjM6MjEuMDAw MDAwMDAwIC0wNTAwDQpAQCAtMjcsNiArMjcsMjcgQEAgc3RydWN0IGZpbGVf b3BlcmF0aW9ucyBnZW5lcmljX3JvX2ZvcHMgPQ0KIA0KIEVYUE9SVF9TWU1C T0woZ2VuZXJpY19yb19mb3BzKTsNCiANCitsb2ZmX3QgZ2VuZXJpY19maWxl X2xsc2Vla19saWdodChzdHJ1Y3QgZmlsZSAqZmlsZSwgbG9mZl90IG9mZnNl dCwgaW50IG9yaWdpbikNCit7DQorCWxvbmcgbG9uZyByZXR2YWwgPSAtRUlO VkFMOw0KKwlzdHJ1Y3QgaW5vZGUgKmlub2RlID0gZmlsZS0+Zl9tYXBwaW5n LT5ob3N0Ow0KKw0KKwlzd2l0Y2ggKG9yaWdpbikgew0KKwkJY2FzZSAyOg0K KwkJCW9mZnNldCArPSBpX3NpemVfcmVhZChpbm9kZSk7DQorCQkJYnJlYWs7 DQorCQljYXNlIDE6DQorCQkJb2Zmc2V0ICs9IGZpbGUtPmZfcG9zOw0KKwl9 DQorCWlmIChvZmZzZXQ+PTAgJiYgb2Zmc2V0PD1pbm9kZS0+aV9zYi0+c19t YXhieXRlcykgew0KKwkJZmlsZS0+Zl9wb3MgPSBvZmZzZXQ7DQorCQlyZXR2 YWwgPSBvZmZzZXQ7DQorCX0NCisJcmV0dXJuIHJldHZhbDsNCit9DQorDQor RVhQT1JUX1NZTUJPTChnZW5lcmljX2ZpbGVfbGxzZWVrX2xpZ2h0KTsNCisN CiBsb2ZmX3QgZ2VuZXJpY19maWxlX2xsc2VlayhzdHJ1Y3QgZmlsZSAqZmls ZSwgbG9mZl90IG9mZnNldCwgaW50IG9yaWdpbikNCiB7DQogCWxvbmcgbG9u ZyByZXR2YWw7DQpkaWZmIC1ydXAgbGludXgtMi42LjEwLW9yaWdpbmFsL2lu Y2x1ZGUvbGludXgvZnMuaCBsaW51eC0yLjYuMTAtbGlnaHQvaW5jbHVkZS9s aW51eC9mcy5oDQotLS0gbGludXgtMi42LjEwLW9yaWdpbmFsL2luY2x1ZGUv bGludXgvZnMuaAkyMDA0LTEyLTI0IDE2OjM0OjI3LjAwMDAwMDAwMCAtMDUw MA0KKysrIGxpbnV4LTIuNi4xMC1saWdodC9pbmNsdWRlL2xpbnV4L2ZzLmgJ MjAwNS0wMi0wOCAyMDoyMzo1NC4wMDAwMDAwMDAgLTA1MDANCkBAIC0xNDcz LDYgKzE0NzMsNyBAQCBleHRlcm4gc3NpemVfdCBnZW5lcmljX2ZpbGVfcmVh ZHYoc3RydWN0DQogc3NpemVfdCBnZW5lcmljX2ZpbGVfd3JpdGV2KHN0cnVj dCBmaWxlICpmaWxwLCBjb25zdCBzdHJ1Y3QgaW92ZWMgKmlvdiwgDQogCQkJ dW5zaWduZWQgbG9uZyBucl9zZWdzLCBsb2ZmX3QgKnBwb3MpOw0KIGV4dGVy biBsb2ZmX3Qgbm9fbGxzZWVrKHN0cnVjdCBmaWxlICpmaWxlLCBsb2ZmX3Qg b2Zmc2V0LCBpbnQgb3JpZ2luKTsNCitleHRlcm4gbG9mZl90IGdlbmVyaWNf ZmlsZV9sbHNlZWtfbGlnaHQoc3RydWN0IGZpbGUgKmZpbGUsIGxvZmZfdCBv ZmZzZXQsIGludCBvcmlnaW4pOw0KIGV4dGVybiBsb2ZmX3QgZ2VuZXJpY19m aWxlX2xsc2VlayhzdHJ1Y3QgZmlsZSAqZmlsZSwgbG9mZl90IG9mZnNldCwg aW50IG9yaWdpbik7DQogZXh0ZXJuIGxvZmZfdCByZW1vdGVfbGxzZWVrKHN0 cnVjdCBmaWxlICpmaWxlLCBsb2ZmX3Qgb2Zmc2V0LCBpbnQgb3JpZ2luKTsN CiBleHRlcm4gaW50IGdlbmVyaWNfZmlsZV9vcGVuKHN0cnVjdCBpbm9kZSAq IGlub2RlLCBzdHJ1Y3QgZmlsZSAqIGZpbHApOw0KZGlmZiAtcnVwIGxpbnV4 LTIuNi4xMC1vcmlnaW5hbC9mcy9leHQyL2ZpbGUuYyBsaW51eC0yLjYuMTAt bGlnaHQvZnMvZXh0Mi9maWxlLmMNCi0tLSBsaW51eC0yLjYuMTAtb3JpZ2lu YWwvZnMvZXh0Mi9maWxlLmMJMjAwNC0xMi0yNCAxNjozNDozMS4wMDAwMDAw MDAgLTA1MDANCisrKyBsaW51eC0yLjYuMTAtbGlnaHQvZnMvZXh0Mi9maWxl LmMJMjAwNS0wMi0wOCAyMDoyODowNS4wMDAwMDAwMDAgLTA1MDANCkBAIC00 MCw3ICs0MCw3IEBAIHN0YXRpYyBpbnQgZXh0Ml9yZWxlYXNlX2ZpbGUgKHN0 cnVjdCBpbm8NCiAgKiB0aGUgZXh0MiBmaWxlc3lzdGVtLg0KICAqLw0KIHN0 cnVjdCBmaWxlX29wZXJhdGlvbnMgZXh0Ml9maWxlX29wZXJhdGlvbnMgPSB7 DQotCS5sbHNlZWsJCT0gZ2VuZXJpY19maWxlX2xsc2VlaywNCisJLmxsc2Vl awkJPSBnZW5lcmljX2ZpbGVfbGxzZWVrX2xpZ2h0LA0KIAkucmVhZAkJPSBn ZW5lcmljX2ZpbGVfcmVhZCwNCiAJLndyaXRlCQk9IGdlbmVyaWNfZmlsZV93 cml0ZSwNCiAJLmFpb19yZWFkCT0gZ2VuZXJpY19maWxlX2Fpb19yZWFkLA0K ZGlmZiAtcnVwIGxpbnV4LTIuNi4xMC1vcmlnaW5hbC9mcy9leHQzL2ZpbGUu YyBsaW51eC0yLjYuMTAtbGlnaHQvZnMvZXh0My9maWxlLmMNCi0tLSBsaW51 eC0yLjYuMTAtb3JpZ2luYWwvZnMvZXh0My9maWxlLmMJMjAwNC0xMi0yNCAx NjozNTozOS4wMDAwMDAwMDAgLTA1MDANCisrKyBsaW51eC0yLjYuMTAtbGln aHQvZnMvZXh0My9maWxlLmMJMjAwNS0wMi0wOCAyMDoyODoxOC4wMDAwMDAw MDAgLTA1MDANCkBAIC0xMTYsNyArMTE2LDcgQEAgZm9yY2VfY29tbWl0Og0K IH0NCiANCiBzdHJ1Y3QgZmlsZV9vcGVyYXRpb25zIGV4dDNfZmlsZV9vcGVy YXRpb25zID0gew0KLQkubGxzZWVrCQk9IGdlbmVyaWNfZmlsZV9sbHNlZWss DQorCS5sbHNlZWsJCT0gZ2VuZXJpY19maWxlX2xsc2Vla19saWdodCwNCiAJ LnJlYWQJCT0gZG9fc3luY19yZWFkLA0KIAkud3JpdGUJCT0gZG9fc3luY193 cml0ZSwNCiAJLmFpb19yZWFkCT0gZ2VuZXJpY19maWxlX2Fpb19yZWFkLA0K ---559023410-684387517-1109097904=:26945--