* [PATCH v3 0/1] dcache: Translating dentry into pathname without taking rename_lock @ 2013-09-06 16:08 Waiman Long 2013-09-06 16:08 ` [PATCH v3 1/1] " Waiman Long 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-09-06 16:08 UTC (permalink / raw) To: Alexander Viro, Linus Torvalds Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel Change History v2->v3: - Simplify prepend_name() to blindly copy the dname over until it reaches a null byte or the specified length leaving the sequence check to handle error case. v1->v2: - Check for internal vs external dname, taking d_lock only for external dname for safety. - Replace memchr() by a byte-by-byte checking for loop. - Try lockless dentry to pathname conversion 3 times before falling back to taking the rename_lock to prevent live-lock. - Make code re-factoring suggested by George Spelvin. Waiman Long (1): dcache: Translating dentry into pathname without taking rename_lock fs/dcache.c | 178 ++++++++++++++++++++++++++++++++++++++-------------------- 1 files changed, 116 insertions(+), 62 deletions(-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 16:08 [PATCH v3 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long @ 2013-09-06 16:08 ` Waiman Long 2013-09-06 20:52 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Waiman Long @ 2013-09-06 16:08 UTC (permalink / raw) To: Alexander Viro, Linus Torvalds Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel When running the AIM7's short workload, Linus' lockref patch eliminated most of the spinlock contention. However, there were still some left: 8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock |--42.21%-- d_path | proc_pid_readlink | SyS_readlinkat | SyS_readlink | system_call | __GI___readlink | |--40.97%-- sys_getcwd | system_call | __getcwd The big one here is the rename_lock (seqlock) contention in d_path() and the getcwd system call. This patch will eliminate the need to take the rename_lock while translating dentries into the full pathnames. The need to take the rename_lock is to make sure that no rename operation can be ongoing while the translation is in progress. However, only one thread can take the rename_lock thus blocking all the other threads that need it even though the translation process won't make any change to the dentries. This patch will replace the writer's write_seqlock/write_sequnlock sequence of the rename_lock of the callers of the prepend_path() and __dentry_path() functions with the reader's read_seqbegin/read_seqretry sequence within these 2 functions. As a result, the code will have to retry if one or more rename operations had been performed. In addition, RCU read lock will be taken during the translation process to make sure that no dentries will go away. To prevent live-lock from happening, the code will switch back to take the rename_lock if read_seqretry() fails for three times. To further reduce spinlock contention, this patch does not take the dentry's d_lock when copying the filename from the dentries. Instead, it treats the name pointer and length as unreliable and just copy the string byte-by-byte over until it hits a null byte or the end of string as specified by the length. This should avoid stepping into invalid memory address. The error cases are left to be handled by the sequence number check. The following code re-factoring are also made: 1. Move prepend('/') into prepend_name() to remove one conditional check. 2. Move the global root check in prepend_path() back to the top of the while loop. With this patch, the _raw_spin_lock will now account for only 1.2% of the total CPU cycles for the short workload. This patch also has the effect of reducing the effect of running perf on its profile since the perf command itself can be a heavy user of the d_path() function depending on the complexity of the workload. When taking the perf profile of the high-systime workload, the amount of spinlock contention contributed by running perf without this patch was about 16%. With this patch, the spinlock contention caused by the running of perf will go away and we will have a more accurate perf profile. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- fs/dcache.c | 178 ++++++++++++++++++++++++++++++++++++++-------------------- 1 files changed, 116 insertions(+), 62 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 96655f4..da095ce 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -89,6 +89,12 @@ EXPORT_SYMBOL(rename_lock); static struct kmem_cache *dentry_cache __read_mostly; /* + * Try 3 times of lockless dentry to pathname conversion before falling + * back to take the rename_lock. + */ +#define D_LOCKLESS_PREPEND_RETRY 3 + +/* * This is the single most critical data structure when it comes * to the dcache: the hashtable for lookups. Somebody should try * to make this good - I've just made it work. @@ -2512,7 +2518,33 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen) static int prepend_name(char **buffer, int *buflen, struct qstr *name) { - return prepend(buffer, buflen, name->name, name->len); + /* + * With RCU path tracing, it may race with d_move(). Use + * ACCESS_ONCE() to make sure that either the old or the new name + * pointer and length are fetched. However, there may be mismatch + * between length and pointer. The length cannot be trusted, we need + * to copy it byte-by-byte until the length is reached or a null + * byte is found. It also prepends "/" at the beginning of the name. + * The sequence number check at the caller will retry it again when + * a d_move() does happen. So any garbage in the buffer due to + * mismatched pointer and length will be discarded. + */ + const char *dname = ACCESS_ONCE(name->name); + u32 dlen = ACCESS_ONCE(name->len); + char *p; + + if (*buflen < dlen + 1) + return -ENAMETOOLONG; + *buflen -= dlen + 1; + p = *buffer -= dlen + 1; + *p++ = '/'; + while (dlen--) { + char c = *dname++; + if (!c) + break; + *p++ = c; + } + return 0; } /** @@ -2522,7 +2554,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name) * @buffer: pointer to the end of the buffer * @buflen: pointer to buffer length * - * Caller holds the rename_lock. + * The function tries to write out the pathname without taking any lock other + * than the RCU read lock to make sure that dentries won't go away. It only + * checks the sequence number of the global rename_lock as any change in the + * dentry's d_seq will be preceded by changes in the rename_lock sequence + * number. If the sequence number had been change, it will restart the whole + * pathname back-tracing sequence again. It performs a total of 3 trials of + * lockless back-tracing sequences before falling back to take the + * rename_lock. */ static int prepend_path(const struct path *path, const struct path *root, @@ -2531,54 +2570,70 @@ static int prepend_path(const struct path *path, struct dentry *dentry = path->dentry; struct vfsmount *vfsmnt = path->mnt; struct mount *mnt = real_mount(vfsmnt); - bool slash = false; int error = 0; + unsigned seq = 0; + char *bptr; + int blen; + int retry_cnt = D_LOCKLESS_PREPEND_RETRY; +restart: + bptr = *buffer; + blen = *buflen; + if (retry_cnt) { + seq = read_seqbegin(&rename_lock); + rcu_read_lock(); + } else + write_seqlock(&rename_lock); while (dentry != root->dentry || vfsmnt != root->mnt) { struct dentry * parent; if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { /* Global root? */ - if (!mnt_has_parent(mnt)) - goto global_root; - dentry = mnt->mnt_mountpoint; - mnt = mnt->mnt_parent; - vfsmnt = &mnt->mnt; - continue; + if (mnt_has_parent(mnt)) { + dentry = mnt->mnt_mountpoint; + mnt = mnt->mnt_parent; + vfsmnt = &mnt->mnt; + continue; + } + /* + * Filesystems needing to implement special "root names" + * should do so with ->d_dname() + */ + if (IS_ROOT(dentry) && + (dentry->d_name.len != 1 || + dentry->d_name.name[0] != '/')) { + WARN(1, "Root dentry has weird name <%.*s>\n", + (int) dentry->d_name.len, + dentry->d_name.name); + } + if (!error) + error = is_mounted(vfsmnt) ? 1 : 2; + break; } parent = dentry->d_parent; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(buffer, buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (!error) - error = prepend(buffer, buflen, "/", 1); + error = prepend_name(&bptr, &blen, &dentry->d_name); if (error) break; - slash = true; dentry = parent; } + if (retry_cnt) { + retry_cnt--; + rcu_read_unlock(); + if (read_seqretry(&rename_lock, seq)) + goto restart; + } else + write_sequnlock(&rename_lock); - if (!error && !slash) - error = prepend(buffer, buflen, "/", 1); - - return error; - -global_root: - /* - * Filesystems needing to implement special "root names" - * should do so with ->d_dname() - */ - if (IS_ROOT(dentry) && - (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) { - WARN(1, "Root dentry has weird name <%.*s>\n", - (int) dentry->d_name.len, dentry->d_name.name); - } - if (!slash) - error = prepend(buffer, buflen, "/", 1); - if (!error) - error = is_mounted(vfsmnt) ? 1 : 2; + if (error >= 0 && bptr == *buffer) { + if (--blen < 0) + error = -ENAMETOOLONG; + else + *--bptr = '/'; + } + *buffer = bptr; + *buflen = blen; return error; } @@ -2607,9 +2662,7 @@ char *__d_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -2628,9 +2681,7 @@ char *d_absolute_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error > 1) @@ -2696,9 +2747,7 @@ char *d_path(const struct path *path, char *buf, int buflen) get_fs_root(current->fs, &root); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = path_with_deleted(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) res = ERR_PTR(error); @@ -2733,10 +2782,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) char *end = buffer + buflen; /* these dentries are never renamed, so d_lock is not needed */ if (prepend(&end, &buflen, " (deleted)", 11) || - prepend_name(&end, &buflen, &dentry->d_name) || + prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) || prepend(&end, &buflen, "/", 1)) end = ERR_PTR(-ENAMETOOLONG); - return end; + return end; } /* @@ -2744,30 +2793,46 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) */ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen) { - char *end = buf + buflen; - char *retval; + char *end, *retval; + int len, seq = 0; + int error = 0; + int retry_cnt = D_LOCKLESS_PREPEND_RETRY; - prepend(&end, &buflen, "\0", 1); +restart: + end = buf + buflen; + len = buflen; + prepend(&end, &len, "\0", 1); if (buflen < 1) goto Elong; /* Get '/' right */ retval = end-1; *retval = '/'; - + if (retry_cnt) { + seq = read_seqbegin(&rename_lock); + rcu_read_lock(); + } else + write_seqlock(&rename_lock); while (!IS_ROOT(dentry)) { struct dentry *parent = dentry->d_parent; int error; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(&end, &buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (error != 0 || prepend(&end, &buflen, "/", 1) != 0) - goto Elong; + error = prepend_name(&end, &len, &dentry->d_name); + if (error) + break; retval = end; dentry = parent; } + if (retry_cnt) { + retry_cnt--; + rcu_read_unlock(); + if (read_seqretry(&rename_lock, seq)) + goto restart; + } else + write_sequnlock(&rename_lock); + if (error) + goto Elong; return retval; Elong: return ERR_PTR(-ENAMETOOLONG); @@ -2775,13 +2840,7 @@ Elong: char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen) { - char *retval; - - write_seqlock(&rename_lock); - retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); - - return retval; + return __dentry_path(dentry, buf, buflen); } EXPORT_SYMBOL(dentry_path_raw); @@ -2790,7 +2849,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) char *p = NULL; char *retval; - write_seqlock(&rename_lock); if (d_unlinked(dentry)) { p = buf + buflen; if (prepend(&p, &buflen, "//deleted", 10) != 0) @@ -2798,7 +2856,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) buflen++; } retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); if (!IS_ERR(retval) && p) *p = '/'; /* restore '/' overriden with '\0' */ return retval; @@ -2837,7 +2894,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -ENOENT; br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); if (!d_unlinked(pwd.dentry)) { unsigned long len; char *cwd = page + PAGE_SIZE; @@ -2845,7 +2901,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) prepend(&cwd, &buflen, "\0", 1); error = prepend_path(&pwd, &root, &cwd, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -2866,7 +2921,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -EFAULT; } } else { - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); } -- 1.7.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 16:08 ` [PATCH v3 1/1] " Waiman Long @ 2013-09-06 20:52 ` Linus Torvalds 2013-09-06 21:05 ` Al Viro 2013-09-07 2:24 ` Waiman Long 0 siblings, 2 replies; 18+ messages in thread From: Linus Torvalds @ 2013-09-06 20:52 UTC (permalink / raw) To: Waiman Long Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 6, 2013 at 9:08 AM, Waiman Long <Waiman.Long@hp.com> wrote: > > This patch will replace the writer's write_seqlock/write_sequnlock > sequence of the rename_lock of the callers of the prepend_path() and > __dentry_path() functions with the reader's read_seqbegin/read_seqretry > sequence within these 2 functions. Ok, this actually looks really good. I do have one comment, from just reading the patch: I would really like the stuff inside the restart: bptr = *buffer; blen = *buflen; if (retry_cnt) { seq = read_seqbegin(&rename_lock); rcu_read_lock(); } else write_seqlock(&rename_lock); ... guts of path generation ... if (retry_cnt) { retry_cnt--; rcu_read_unlock(); if (read_seqretry(&rename_lock, seq)) goto restart; } else write_sequnlock(&rename_lock); could possible be done as a separate function? Alternatively (or perhaps additionally), maybe the locking could be done as an inline function too (taking the retry count as an argument) to make things a bit more easy to understand. Right now there is a lot of fairly subtle things going on in that __dentry_path() function. It's not a huge function, but I think that "while()" loop inside that locking could be done as its own function and make it even more readable. But I could already apply this as-is, so it's not a big deal. Al - do you have comments? Do you want to take this through your tree, or are you working on other things? I can take this directly too.. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 20:52 ` Linus Torvalds @ 2013-09-06 21:05 ` Al Viro 2013-09-06 21:48 ` Linus Torvalds 2013-09-07 2:24 ` Waiman Long 1 sibling, 1 reply; 18+ messages in thread From: Al Viro @ 2013-09-06 21:05 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 06, 2013 at 01:52:49PM -0700, Linus Torvalds wrote: > Al - do you have comments? Do you want to take this through your tree, > or are you working on other things? I can take this directly too.. I can take that, but I'm really not convinced that we need writer lock there at all. After all, if we really can get livelocks on that one, we would be getting them on d_lookup()... Let's not complicate the things without need; if we ever run into loads where livelock start to happen, we can look into dealing with them. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 21:05 ` Al Viro @ 2013-09-06 21:48 ` Linus Torvalds 2013-09-07 0:00 ` Al Viro 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2013-09-06 21:48 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 6, 2013 at 2:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I can take that, but I'm really not convinced that we need writer lock > there at all. After all, if we really can get livelocks on that one, > we would be getting them on d_lookup()... d_lookup() does a _single_ path component. That's a *big* difference. Sure, the hash chain that d_lookup() (well, __d_lookup()) ends up walking is a bit more complicated than just following the dentry parent pointer, but that's much harder to trigger than just creating a really deep directory structure of single-letter nested directories, and then doing a "getcwd()" that walks 1024+ parents, while another thread is looping renaming things.. So I personally do feel a lot safer with the fallback to write locking here. Especially since it's pretty simple, so there isn't really much downside. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 21:48 ` Linus Torvalds @ 2013-09-07 0:00 ` Al Viro 2013-09-07 0:19 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Al Viro @ 2013-09-07 0:00 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 06, 2013 at 02:48:32PM -0700, Linus Torvalds wrote: > On Fri, Sep 6, 2013 at 2:05 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I can take that, but I'm really not convinced that we need writer lock > > there at all. After all, if we really can get livelocks on that one, > > we would be getting them on d_lookup()... > > d_lookup() does a _single_ path component. That's a *big* difference. > > Sure, the hash chain that d_lookup() (well, __d_lookup()) ends up > walking is a bit more complicated than just following the dentry > parent pointer, but that's much harder to trigger than just creating a > really deep directory structure of single-letter nested directories, > and then doing a "getcwd()" that walks 1024+ parents, while another > thread is looping renaming things.. > > So I personally do feel a lot safer with the fallback to write locking here. > > Especially since it's pretty simple, so there isn't really much downside. Er... what will happen if you have done just what you've described and have a process call d_lookup()? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 0:00 ` Al Viro @ 2013-09-07 0:19 ` Linus Torvalds 2013-09-07 0:58 ` Linus Torvalds 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2013-09-07 0:19 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 6, 2013 at 5:00 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Er... what will happen if you have done just what you've described and have > a process call d_lookup()? Umm. Yes? What part of "one single path component" did you not get? To repeat: d_lookup() NEVER LOOKS UP A WHOLE PATHNAME. It looks up just a single path component. It matters not one whit whether you look up a filename that is 1500 components deep, d_lookup() only ever works on one single component. It's a single short hash chain lookup. Sure, it can fail, but people really have to work at it. You're not going to be able to make it fail by just calling "rename()" in a loop. You're going to have to do multiple threads at least, and now you need to do it on multiple different filesystems, since otherwise those multiple threads are going to be serialized by the (outer) per-filesystem vfs-layer rename semaphores. In other words, it sounds impossible to trigger, wouldn't you say? Or if you try, you're going to stand out for using a *lot* of resources. In contrast, doing the getcwd() lookup really is following potentially quite long chains. So it's quite possible that just a single thread doing rename() in a loop (on, say, /tmp, so that there isn't any IO) can trigger the sequence write-lock frequently enough that traversing 1500 d_parent entries might never complete. Have I tried it? No. But think about the two different scenarios. There really is a *big* difference between looking up one single dentry from its parent using our hash tables, and traversing a potentially almost unbounded parenthood chain. (We're bounded in practice by PATH_MAX, so you can't make getcwd() traverse more than about 2000 parents (single character filename plus the slash for each level), and for all I know filesystems might cap it before that, so it's not unbounded, but the difference between "1" and "2000" is pretty damn big) Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 0:19 ` Linus Torvalds @ 2013-09-07 0:58 ` Linus Torvalds 2013-09-07 3:01 ` Al Viro 0 siblings, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2013-09-07 0:58 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 6, 2013 at 5:19 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > (We're bounded in practice by PATH_MAX, so you can't make getcwd() > traverse more than about 2000 parents (single character filename plus > the slash for each level), and for all I know filesystems might cap it > before that, so it's not unbounded, but the difference between "1" and > "2000" is pretty damn big) .. in particular, it's big enough that one is pretty much guaranteed to fit in any reasonable L1 cache (if we have dentry hash chains so long that that becomes a problem for traversing a single chain, we're screwed anyway), while the other can most likely be a case of "not a single L1 cache hit because by the time you fail and go back to the start, you've flushed the L1 cache". Now, whether 2000 L2 cache misses is long enough to give people a chance to run the whole rename system call path in a loop a few times, I don't know, but it sure as heck sounds likely. Of course, you might still ask "why should we even care?" At least without preemption, you might be able to trigger some really excessive latencies and possibly a watchdog screaming at you as a result. But that said, maybe we wouldn't care. I just think that the solution is so simple (what, five extra lines or so) that it's worth avoiding even the worry. Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 0:58 ` Linus Torvalds @ 2013-09-07 3:01 ` Al Viro 2013-09-07 17:32 ` Al Viro 2013-09-07 17:52 ` Linus Torvalds 0 siblings, 2 replies; 18+ messages in thread From: Al Viro @ 2013-09-07 3:01 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 06, 2013 at 05:58:51PM -0700, Linus Torvalds wrote: > On Fri, Sep 6, 2013 at 5:19 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > (We're bounded in practice by PATH_MAX, so you can't make getcwd() > > traverse more than about 2000 parents (single character filename plus > > the slash for each level), and for all I know filesystems might cap it > > before that, so it's not unbounded, but the difference between "1" and > > "2000" is pretty damn big) > > .. in particular, it's big enough that one is pretty much guaranteed > to fit in any reasonable L1 cache (if we have dentry hash chains so > long that that becomes a problem for traversing a single chain, we're > screwed anyway), while the other can most likely be a case of "not a > single L1 cache hit because by the time you fail and go back to the > start, you've flushed the L1 cache". > > Now, whether 2000 L2 cache misses is long enough to give people a > chance to run the whole rename system call path in a loop a few times, > I don't know, but it sure as heck sounds likely. > > Of course, you might still ask "why should we even care?" At least > without preemption, you might be able to trigger some really excessive > latencies and possibly a watchdog screaming at you as a result. But > that said, maybe we wouldn't care. I just think that the solution is > so simple (what, five extra lines or so) that it's worth avoiding even > the worry. We already have that kind of logics - see select_parent() et.al. in mainline or d_walk() in vfs.git#for-linus (pull request will go in a few minutes). With this patch we get * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(), ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(), [audit] handle_path()) * try seqretry once, then switch to write_seqlock() (the things that got unified into d_walk()) * try seqretry three times, then switch to write_seqlock() (d_path() and friends) * several pure write_seqlock() users (d_move(), d_set_mounted(), d_materialize_unique()) The last class is not a problem - these we want as writers. I really don't like the way the rest is distributed - if nothing else, nfs_path() and friends are in exactly the same situation as d_path(). Moreover, why the distinction between "try once" and "try thrice"? _If_ we fold the second and the third groups together (and probably have a bunch from the first one join that), we at least get something understandable, but the I really wonder if seqlock has the right calling conventions for that (and at least I'd like to fold the "already got writelock" flag into seq - we do have a spare bit there). Comments? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 3:01 ` Al Viro @ 2013-09-07 17:32 ` Al Viro 2013-09-08 4:15 ` Ian Kent 2013-09-07 17:52 ` Linus Torvalds 1 sibling, 1 reply; 18+ messages in thread From: Al Viro @ 2013-09-07 17:32 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil, Ian Kent On Sat, Sep 07, 2013 at 04:01:10AM +0100, Al Viro wrote: > * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(), > ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(), _mds_, actually - sorry. > [audit] handle_path()) > * try seqretry once, then switch to write_seqlock() (the things > that got unified into d_walk()) > * try seqretry three times, then switch to write_seqlock() (d_path() > and friends) > * several pure write_seqlock() users (d_move(), d_set_mounted(), > d_materialize_unique()) BTW, autofs4_getpath() looks really odd: static int autofs4_getpath(struct autofs_sb_info *sbi, struct dentry *dentry, char **name) and *name is never modified in it. Why not simply pass it by value? Moreover, I'm not sure I understand what do we need sbi->fs_lock in there. Other than that, it's very close to dentry_path() (well, that and different calling conventions). Ian? ceph_mds_build_path() is similar, but it does kmalloc to store the result and grabs ->d_lock for ->d_name protection. This one, BTW, is much more likely to get stalls - it ends up doing kmalloc on each attempt (after having calculated the current length). Bugger if I understand what's wrong with simply grabbing a page and doing that once - before everything else... build_path_from_dentry() - same story, might very well have been the source of ceph_mds_build_path(). nfs_path() - not far from open-coded dentry_path() with some postprocessing, uses ->d_lock for ->d_name protection. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 17:32 ` Al Viro @ 2013-09-08 4:15 ` Ian Kent 2013-09-08 4:58 ` Al Viro 0 siblings, 1 reply; 18+ messages in thread From: Ian Kent @ 2013-09-08 4:15 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil On Sat, 2013-09-07 at 18:32 +0100, Al Viro wrote: > On Sat, Sep 07, 2013 at 04:01:10AM +0100, Al Viro wrote: > > * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(), > > ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(), > _mds_, actually - sorry. > > [audit] handle_path()) > > * try seqretry once, then switch to write_seqlock() (the things > > that got unified into d_walk()) > > * try seqretry three times, then switch to write_seqlock() (d_path() > > and friends) > > * several pure write_seqlock() users (d_move(), d_set_mounted(), > > d_materialize_unique()) > > BTW, autofs4_getpath() looks really odd: > static int autofs4_getpath(struct autofs_sb_info *sbi, > struct dentry *dentry, char **name) > and *name is never modified in it. Why not simply pass it by value? > Moreover, I'm not sure I understand what do we need sbi->fs_lock in > there. Other than that, it's very close to dentry_path() (well, that > and different calling conventions). Ian? Yes, it is close to dentry_path() but the path functions used to return the full path to the root, although I don't see where dentry_path() get the root, mmm ... autofs4_getpath() is supposed to return a path relative to the current dentry. That goes back to the pseudo direct mounts of autofs version 4 where the keys could be a relative path. The fs_lock probably isn't needed. This was a tree of directories and I didn't want mount requests coming in while I was getting the path, and didn't want an expire to happen either, although there should only be one expire process anyway. Ian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-08 4:15 ` Ian Kent @ 2013-09-08 4:58 ` Al Viro 2013-09-08 8:51 ` Ian Kent 0 siblings, 1 reply; 18+ messages in thread From: Al Viro @ 2013-09-08 4:58 UTC (permalink / raw) To: Ian Kent Cc: Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil On Sun, Sep 08, 2013 at 12:15:46PM +0800, Ian Kent wrote: > > and *name is never modified in it. Why not simply pass it by value? > > Moreover, I'm not sure I understand what do we need sbi->fs_lock in > > there. Other than that, it's very close to dentry_path() (well, that > > and different calling conventions). Ian? > > Yes, it is close to dentry_path() but the path functions used to return > the full path to the root, although I don't see where dentry_path() get > the root, mmm ... dentry_path(), unlike d_path(), is relative to the root of filesystem containing dentry in question. There are 3 of those suckers: d_path(): vfsmount/dentry => path to current process' root d_absolute_path(): ditto, but ignores chroot jails (goes to the absolute root of namespace) dentry_path(): dentry => path from root of fs containing that dentry. IOW, if you have something mounted on /jail/mnt/foo and are chrooted into /jail, passing vfsmount/dentry of /jail/mnt/foo/bar/baz to d_path() will yield "/mnt/foo/bar/baz", to d_absolute_path() - "/jail/mnt/foo/bar/baz" and passing the same dentry to dentry_path() - "/bar/baz". ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-08 4:58 ` Al Viro @ 2013-09-08 8:51 ` Ian Kent 0 siblings, 0 replies; 18+ messages in thread From: Ian Kent @ 2013-09-08 8:51 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel, Sage Weil On Sun, 2013-09-08 at 05:58 +0100, Al Viro wrote: > On Sun, Sep 08, 2013 at 12:15:46PM +0800, Ian Kent wrote: > > > and *name is never modified in it. Why not simply pass it by value? > > > Moreover, I'm not sure I understand what do we need sbi->fs_lock in > > > there. Other than that, it's very close to dentry_path() (well, that > > > and different calling conventions). Ian? > > > > Yes, it is close to dentry_path() but the path functions used to return > > the full path to the root, although I don't see where dentry_path() get > > the root, mmm ... > > dentry_path(), unlike d_path(), is relative to the root of filesystem > containing dentry in question. There are 3 of those suckers: > d_path(): vfsmount/dentry => path to current process' root > d_absolute_path(): ditto, but ignores chroot jails (goes to > the absolute root of namespace) > dentry_path(): dentry => path from root of fs containing that > dentry. > > IOW, if you have something mounted on /jail/mnt/foo and are chrooted into > /jail, passing vfsmount/dentry of /jail/mnt/foo/bar/baz to d_path() will > yield "/mnt/foo/bar/baz", to d_absolute_path() - "/jail/mnt/foo/bar/baz" and > passing the same dentry to dentry_path() - "/bar/baz". Right, so all I need to do is drop the leading "/" and I'll have what I need. I'll do that. Thanks Ian ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 3:01 ` Al Viro 2013-09-07 17:32 ` Al Viro @ 2013-09-07 17:52 ` Linus Torvalds 2013-09-07 18:07 ` Al Viro 1 sibling, 1 reply; 18+ messages in thread From: Linus Torvalds @ 2013-09-07 17:52 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Fri, Sep 6, 2013 at 8:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > * plain seqretry loop (d_lookup(), is_subdir(), autofs4_getpath(), > ceph_misc_build_path(), [cifs] build_path_from_dentry(), nfs_path(), > [audit] handle_path()) > * try seqretry once, then switch to write_seqlock() (the things > that got unified into d_walk()) > * try seqretry three times, then switch to write_seqlock() (d_path() > and friends) > * several pure write_seqlock() users (d_move(), d_set_mounted(), > d_materialize_unique()) > > The last class is not a problem - these we want as writers. I really don't > like the way the rest is distributed - if nothing else, nfs_path() and > friends are in exactly the same situation as d_path(). Moreover, why > the distinction between "try once" and "try thrice"? > > _If_ we fold the second and the third groups together (and probably have > a bunch from the first one join that), we at least get something > understandable, but the I really wonder if seqlock has the right calling > conventions for that (and at least I'd like to fold the "already got writelock" > flag into seq - we do have a spare bit there). > > Comments? So I think we could make a more complicated data structure that looks something like this: struct seqlock_retry { unsigned int seq_no; int state; }; and pass that around. Gcc should do pretty well, especially if we inline things (but even if not, small structures that fit in 64 bytes generate reasonable code even on 32-bit targets, because gcc knows about using two registers for passing data around).. Then you can make "state" have a retry counter in it, and have a negative value mean "I hold the lock for writing". Add a couple of helper functions, and you can fairly easily handle the mixed "try for reading first, then fall back to writing". That said, __d_lookup() still shows up as very performance-critical on some loads (symlinks in particular cause us to fall out of the RCU cases) so I'd like to keep that using the simple pure read case. I don't believe you can livelock it, as mentioned. But the other ones might well be worth moving to a "fall back to write-locking after <n> tries" model. They might all traverse user-specified paths of fairly arbitrary depth, no? So this "seqlock_retry" thing wouldn't _replace_ bare seqlocks, it would just be a helper thing for this kind of behavior where we want to normally do things with just the read-lock, but want to guarantee that we don't live-lock. Sounds reasonable? Linus ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 17:52 ` Linus Torvalds @ 2013-09-07 18:07 ` Al Viro 2013-09-07 18:53 ` Al Viro 2013-09-09 14:31 ` Waiman Long 0 siblings, 2 replies; 18+ messages in thread From: Al Viro @ 2013-09-07 18:07 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Sat, Sep 07, 2013 at 10:52:02AM -0700, Linus Torvalds wrote: > So I think we could make a more complicated data structure that looks > something like this: > > struct seqlock_retry { > unsigned int seq_no; > int state; > }; > > and pass that around. Gcc should do pretty well, especially if we > inline things (but even if not, small structures that fit in 64 bytes > generate reasonable code even on 32-bit targets, because gcc knows > about using two registers for passing data around).. > > Then you can make "state" have a retry counter in it, and have a > negative value mean "I hold the lock for writing". Add a couple of > helper functions, and you can fairly easily handle the mixed "try for > reading first, then fall back to writing". > > That said, __d_lookup() still shows up as very performance-critical on > some loads (symlinks in particular cause us to fall out of the RCU > cases) so I'd like to keep that using the simple pure read case. I > don't believe you can livelock it, as mentioned. But the other ones > might well be worth moving to a "fall back to write-locking after <n> > tries" model. They might all traverse user-specified paths of fairly > arbitrary depth, no? > > So this "seqlock_retry" thing wouldn't _replace_ bare seqlocks, it > would just be a helper thing for this kind of behavior where we want > to normally do things with just the read-lock, but want to guarantee > that we don't live-lock. > > Sounds reasonable? More or less; I just wonder if we are overdesigning here - if we don't do "repeat more than once", we can simply use the lower bit of seq - read_seqlock() always returns an even value. So we could do something like seqretry_and_lock(lock, &seq): if ((*seq & 1) || !read_seqretry(lock, *seq)) return true; *seq |= 1; write_seqlock(lock); return false; and seqretry_done(lock, seq): if (seq & 1) write_sequnlock(lock); with these loops turning into seq = read_seqlock(&rename_lock); ... if (!seqretry_and_lock(&rename_lock, &seq)) goto again; ... seqretry_done(&rename_lock); But I'd really like to understand the existing zoo - in particular, ceph and cifs users can't be converted to anything of that kind (blocking kmalloc() can't live under write_seqlock()) and they are _easier_ to livelock than d_path(), due to the same kmalloc() widening the window. Guys, do we really care about precisely-sized allocations there? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 18:07 ` Al Viro @ 2013-09-07 18:53 ` Al Viro 2013-09-09 14:31 ` Waiman Long 1 sibling, 0 replies; 18+ messages in thread From: Al Viro @ 2013-09-07 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Sat, Sep 07, 2013 at 07:07:24PM +0100, Al Viro wrote: > with these loops turning into > seq = read_seqlock(&rename_lock); again: > ... > if (!seqretry_and_lock(&rename_lock, &seq)) > goto again; > ... > seqretry_done(&rename_lock); Forgot the label, sorry. And I agree that d_lookup() ought to stay as is - this is just about the ones that try readlock once and then fall back to writer. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-07 18:07 ` Al Viro 2013-09-07 18:53 ` Al Viro @ 2013-09-09 14:31 ` Waiman Long 1 sibling, 0 replies; 18+ messages in thread From: Waiman Long @ 2013-09-09 14:31 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/07/2013 02:07 PM, Al Viro wrote: > On Sat, Sep 07, 2013 at 10:52:02AM -0700, Linus Torvalds wrote: > >> So I think we could make a more complicated data structure that looks >> something like this: >> >> struct seqlock_retry { >> unsigned int seq_no; >> int state; >> }; >> >> and pass that around. Gcc should do pretty well, especially if we >> inline things (but even if not, small structures that fit in 64 bytes >> generate reasonable code even on 32-bit targets, because gcc knows >> about using two registers for passing data around).. >> >> Then you can make "state" have a retry counter in it, and have a >> negative value mean "I hold the lock for writing". Add a couple of >> helper functions, and you can fairly easily handle the mixed "try for >> reading first, then fall back to writing". >> >> That said, __d_lookup() still shows up as very performance-critical on >> some loads (symlinks in particular cause us to fall out of the RCU >> cases) so I'd like to keep that using the simple pure read case. I >> don't believe you can livelock it, as mentioned. But the other ones >> might well be worth moving to a "fall back to write-locking after<n> >> tries" model. They might all traverse user-specified paths of fairly >> arbitrary depth, no? >> >> So this "seqlock_retry" thing wouldn't _replace_ bare seqlocks, it >> would just be a helper thing for this kind of behavior where we want >> to normally do things with just the read-lock, but want to guarantee >> that we don't live-lock. >> >> Sounds reasonable? > More or less; I just wonder if we are overdesigning here - if we don't > do "repeat more than once", we can simply use the lower bit of seq - > read_seqlock() always returns an even value. So we could do something > like seqretry_and_lock(lock,&seq): > if ((*seq& 1) || !read_seqretry(lock, *seq)) > return true; > *seq |= 1; > write_seqlock(lock); > return false; > and seqretry_done(lock, seq): > if (seq& 1) > write_sequnlock(lock); > with these loops turning into > seq = read_seqlock(&rename_lock); > ... > if (!seqretry_and_lock(&rename_lock,&seq)) > goto again; > ... > seqretry_done(&rename_lock); I am fine with try it once and then lock instead of trying it a few times. Now are you planning to have these helper functions for the dcache layer only or as part of the seqlock infrastructure? If we are going to touch the seqlock layer, I would suggest adding a blocking reader that takes the lock, but won't update the sequence number so that it won't block other sequence readers as my original seqlock patch was doing. -Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-06 20:52 ` Linus Torvalds 2013-09-06 21:05 ` Al Viro @ 2013-09-07 2:24 ` Waiman Long 1 sibling, 0 replies; 18+ messages in thread From: Waiman Long @ 2013-09-07 2:24 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/06/2013 04:52 PM, Linus Torvalds wrote: > On Fri, Sep 6, 2013 at 9:08 AM, Waiman Long<Waiman.Long@hp.com> wrote: >> This patch will replace the writer's write_seqlock/write_sequnlock >> sequence of the rename_lock of the callers of the prepend_path() and >> __dentry_path() functions with the reader's read_seqbegin/read_seqretry >> sequence within these 2 functions. > Ok, this actually looks really good. > > I do have one comment, from just reading the patch: > > I would really like the stuff inside the > > restart: > bptr = *buffer; > blen = *buflen; > if (retry_cnt) { > seq = read_seqbegin(&rename_lock); > rcu_read_lock(); > } else > write_seqlock(&rename_lock); > > ... guts of path generation ... > > if (retry_cnt) { > retry_cnt--; > rcu_read_unlock(); > if (read_seqretry(&rename_lock, seq)) > goto restart; > } else > write_sequnlock(&rename_lock); > > could possible be done as a separate function? > > Alternatively (or perhaps additionally), maybe the locking could be > done as an inline function too (taking the retry count as an argument) > to make things a bit more easy to understand. I would prefer putting the begin and end blocks into 2 helper inlined helper functions to make code easier to look at. I will work on this over the weekend. -Longman ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2013-09-09 14:31 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-06 16:08 [PATCH v3 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long 2013-09-06 16:08 ` [PATCH v3 1/1] " Waiman Long 2013-09-06 20:52 ` Linus Torvalds 2013-09-06 21:05 ` Al Viro 2013-09-06 21:48 ` Linus Torvalds 2013-09-07 0:00 ` Al Viro 2013-09-07 0:19 ` Linus Torvalds 2013-09-07 0:58 ` Linus Torvalds 2013-09-07 3:01 ` Al Viro 2013-09-07 17:32 ` Al Viro 2013-09-08 4:15 ` Ian Kent 2013-09-08 4:58 ` Al Viro 2013-09-08 8:51 ` Ian Kent 2013-09-07 17:52 ` Linus Torvalds 2013-09-07 18:07 ` Al Viro 2013-09-07 18:53 ` Al Viro 2013-09-09 14:31 ` Waiman Long 2013-09-07 2:24 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).