* [PATCH v4 0/1] dcache: Translating dentry into pathname without taking rename_lock @ 2013-09-09 16:18 Waiman Long 2013-09-09 16:18 ` [PATCH v4 1/1] " Waiman Long 0 siblings, 1 reply; 25+ messages in thread From: Waiman Long @ 2013-09-09 16:18 UTC (permalink / raw) To: Alexander Viro, Linus Torvalds Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel Change History v3->v4: - Extract the begin and end blocks of the rename_lock sequence number check into helper functions to make the code easier to read. v2->v3: - Simplify prepend_name() to blindly copy the dname over until it reaches a null byte or the specified length leaving the sequence check to handle error case. v1->v2: - Check for internal vs external dname, taking d_lock only for external dname for safety. - Replace memchr() by a byte-by-byte checking for loop. - Try lockless dentry to pathname conversion 3 times before falling back to taking the rename_lock to prevent live-lock. - Make code re-factoring suggested by George Spelvin. Waiman Long (1): dcache: Translating dentry into pathname without taking rename_lock fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 133 insertions(+), 63 deletions(-) ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 16:18 [PATCH v4 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long @ 2013-09-09 16:18 ` Waiman Long 2013-09-09 17:29 ` Al Viro 0 siblings, 1 reply; 25+ messages in thread From: Waiman Long @ 2013-09-09 16:18 UTC (permalink / raw) To: Alexander Viro, Linus Torvalds Cc: Waiman Long, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel When running the AIM7's short workload, Linus' lockref patch eliminated most of the spinlock contention. However, there were still some left: 8.46% reaim [kernel.kallsyms] [k] _raw_spin_lock |--42.21%-- d_path | proc_pid_readlink | SyS_readlinkat | SyS_readlink | system_call | __GI___readlink | |--40.97%-- sys_getcwd | system_call | __getcwd The big one here is the rename_lock (seqlock) contention in d_path() and the getcwd system call. This patch will eliminate the need to take the rename_lock while translating dentries into the full pathnames. The need to take the rename_lock is to make sure that no rename operation can be ongoing while the translation is in progress. However, only one thread can take the rename_lock thus blocking all the other threads that need it even though the translation process won't make any change to the dentries. This patch will replace the writer's write_seqlock/write_sequnlock sequence of the rename_lock of the callers of the prepend_path() and __dentry_path() functions with the reader's read_seqbegin/read_seqretry sequence within these 2 functions. As a result, the code will have to retry if one or more rename operations had been performed. In addition, RCU read lock will be taken during the translation process to make sure that no dentries will go away. To prevent live-lock from happening, the code will switch back to take the rename_lock if read_seqretry() fails for three times. To further reduce spinlock contention, this patch does not take the dentry's d_lock when copying the filename from the dentries. Instead, it treats the name pointer and length as unreliable and just copy the string byte-by-byte over until it hits a null byte or the end of string as specified by the length. This should avoid stepping into invalid memory address. The error cases are left to be handled by the sequence number check. The following code re-factoring are also made: 1. Move prepend('/') into prepend_name() to remove one conditional check. 2. Move the global root check in prepend_path() back to the top of the while loop. With this patch, the _raw_spin_lock will now account for only 1.2% of the total CPU cycles for the short workload. This patch also has the effect of reducing the effect of running perf on its profile since the perf command itself can be a heavy user of the d_path() function depending on the complexity of the workload. When taking the perf profile of the high-systime workload, the amount of spinlock contention contributed by running perf without this patch was about 16%. With this patch, the spinlock contention caused by the running of perf will go away and we will have a more accurate perf profile. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- fs/dcache.c | 196 ++++++++++++++++++++++++++++++++++++++++------------------- 1 files changed, 133 insertions(+), 63 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index ca8e9cd..8186ff9 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -88,6 +88,44 @@ EXPORT_SYMBOL(rename_lock); static struct kmem_cache *dentry_cache __read_mostly; +/** + * read_seqbegin_or_lock - begin a sequence number check or locking block + * lock: sequence lock + * seq : sequence number to be checked + * + * First try it once optimistically without taking the lock. If that fails, + * take the lock. The sequence number is also used as a marker for deciding + * whether to be a reader (even) or writer (odd). + * N.B. seq must be initialized to an even number to begin with. + */ +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) +{ + if (!(*seq & 1)) { /* Even */ + *seq = read_seqbegin(lock); + rcu_read_lock(); + } else /* Odd */ + write_seqlock(lock); +} + +/** + * read_seqretry_or_unlock - end a seqretry or lock block & return retry status + * lock : sequence lock + * seq : sequence number + * Return: 1 to retry operation again, 0 to continue + */ +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq) +{ + if (!(*seq & 1)) { /* Even */ + rcu_read_unlock(); + if (read_seqretry(lock, *seq)) { + (*seq)++; /* Take writer lock */ + return 1; + } + } else /* Odd */ + write_sequnlock(lock); + return 0; +} + /* * This is the single most critical data structure when it comes * to the dcache: the hashtable for lookups. Somebody should try @@ -2647,9 +2685,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen) return 0; } +/** + * prepend_name - prepend a pathname in front of current buffer pointer + * buffer: buffer pointer + * buflen: allocated length of the buffer + * name: name string and length qstr structure + * + * With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to + * make sure that either the old or the new name pointer and length are + * fetched. However, there may be mismatch between length and pointer. + * The length cannot be trusted, we need to copy it byte-by-byte until + * the length is reached or a null byte is found. It also prepends "/" at + * the beginning of the name. The sequence number check at the caller will + * retry it again when a d_move() does happen. So any garbage in the buffer + * due to mismatched pointer and length will be discarded. + */ static int prepend_name(char **buffer, int *buflen, struct qstr *name) { - return prepend(buffer, buflen, name->name, name->len); + const char *dname = ACCESS_ONCE(name->name); + u32 dlen = ACCESS_ONCE(name->len); + char *p; + + if (*buflen < dlen + 1) + return -ENAMETOOLONG; + *buflen -= dlen + 1; + p = *buffer -= dlen + 1; + *p++ = '/'; + while (dlen--) { + char c = *dname++; + if (!c) + break; + *p++ = c; + } + return 0; } /** @@ -2659,7 +2727,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name) * @buffer: pointer to the end of the buffer * @buflen: pointer to buffer length * - * Caller holds the rename_lock. + * The function tries to write out the pathname without taking any lock other + * than the RCU read lock to make sure that dentries won't go away. It only + * checks the sequence number of the global rename_lock as any change in the + * dentry's d_seq will be preceded by changes in the rename_lock sequence + * number. If the sequence number had been change, it will restart the whole + * pathname back-tracing sequence again. It performs a total of 3 trials of + * lockless back-tracing sequences before falling back to take the + * rename_lock. */ static int prepend_path(const struct path *path, const struct path *root, @@ -2668,54 +2743,60 @@ static int prepend_path(const struct path *path, struct dentry *dentry = path->dentry; struct vfsmount *vfsmnt = path->mnt; struct mount *mnt = real_mount(vfsmnt); - bool slash = false; int error = 0; + unsigned seq = 0; + char *bptr; + int blen; +restart: + bptr = *buffer; + blen = *buflen; + read_seqbegin_or_lock(&rename_lock, &seq); while (dentry != root->dentry || vfsmnt != root->mnt) { struct dentry * parent; if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { /* Global root? */ - if (!mnt_has_parent(mnt)) - goto global_root; - dentry = mnt->mnt_mountpoint; - mnt = mnt->mnt_parent; - vfsmnt = &mnt->mnt; - continue; + if (mnt_has_parent(mnt)) { + dentry = mnt->mnt_mountpoint; + mnt = mnt->mnt_parent; + vfsmnt = &mnt->mnt; + continue; + } + /* + * Filesystems needing to implement special "root names" + * should do so with ->d_dname() + */ + if (IS_ROOT(dentry) && + (dentry->d_name.len != 1 || + dentry->d_name.name[0] != '/')) { + WARN(1, "Root dentry has weird name <%.*s>\n", + (int) dentry->d_name.len, + dentry->d_name.name); + } + if (!error) + error = is_mounted(vfsmnt) ? 1 : 2; + break; } parent = dentry->d_parent; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(buffer, buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (!error) - error = prepend(buffer, buflen, "/", 1); + error = prepend_name(&bptr, &blen, &dentry->d_name); if (error) break; - slash = true; dentry = parent; } + if (read_seqretry_or_unlock(&rename_lock, &seq)) + goto restart; - if (!error && !slash) - error = prepend(buffer, buflen, "/", 1); - - return error; - -global_root: - /* - * Filesystems needing to implement special "root names" - * should do so with ->d_dname() - */ - if (IS_ROOT(dentry) && - (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) { - WARN(1, "Root dentry has weird name <%.*s>\n", - (int) dentry->d_name.len, dentry->d_name.name); - } - if (!slash) - error = prepend(buffer, buflen, "/", 1); - if (!error) - error = is_mounted(vfsmnt) ? 1 : 2; + if (error >= 0 && bptr == *buffer) { + if (--blen < 0) + error = -ENAMETOOLONG; + else + *--bptr = '/'; + } + *buffer = bptr; + *buflen = blen; return error; } @@ -2744,9 +2825,7 @@ char *__d_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -2765,9 +2844,7 @@ char *d_absolute_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error > 1) @@ -2833,9 +2910,7 @@ char *d_path(const struct path *path, char *buf, int buflen) get_fs_root(current->fs, &root); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = path_with_deleted(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) res = ERR_PTR(error); @@ -2870,10 +2945,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) char *end = buffer + buflen; /* these dentries are never renamed, so d_lock is not needed */ if (prepend(&end, &buflen, " (deleted)", 11) || - prepend_name(&end, &buflen, &dentry->d_name) || + prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) || prepend(&end, &buflen, "/", 1)) end = ERR_PTR(-ENAMETOOLONG); - return end; + return end; } /* @@ -2881,30 +2956,36 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) */ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen) { - char *end = buf + buflen; - char *retval; + char *end, *retval; + int len, seq = 0; + int error = 0; - prepend(&end, &buflen, "\0", 1); +restart: + end = buf + buflen; + len = buflen; + prepend(&end, &len, "\0", 1); if (buflen < 1) goto Elong; /* Get '/' right */ retval = end-1; *retval = '/'; - + read_seqbegin_or_lock(&rename_lock, &seq); while (!IS_ROOT(dentry)) { struct dentry *parent = dentry->d_parent; int error; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(&end, &buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (error != 0 || prepend(&end, &buflen, "/", 1) != 0) - goto Elong; + error = prepend_name(&end, &len, &dentry->d_name); + if (error) + break; retval = end; dentry = parent; } + if (read_seqretry_or_unlock(&rename_lock, &seq)) + goto restart; + if (error) + goto Elong; return retval; Elong: return ERR_PTR(-ENAMETOOLONG); @@ -2912,13 +2993,7 @@ Elong: char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen) { - char *retval; - - write_seqlock(&rename_lock); - retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); - - return retval; + return __dentry_path(dentry, buf, buflen); } EXPORT_SYMBOL(dentry_path_raw); @@ -2927,7 +3002,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) char *p = NULL; char *retval; - write_seqlock(&rename_lock); if (d_unlinked(dentry)) { p = buf + buflen; if (prepend(&p, &buflen, "//deleted", 10) != 0) @@ -2935,7 +3009,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) buflen++; } retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); if (!IS_ERR(retval) && p) *p = '/'; /* restore '/' overriden with '\0' */ return retval; @@ -2974,7 +3047,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -ENOENT; br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); if (!d_unlinked(pwd.dentry)) { unsigned long len; char *cwd = page + PAGE_SIZE; @@ -2982,7 +3054,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) prepend(&cwd, &buflen, "\0", 1); error = prepend_path(&pwd, &root, &cwd, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -3003,7 +3074,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -EFAULT; } } else { - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); } -- 1.7.1 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 16:18 ` [PATCH v4 1/1] " Waiman Long @ 2013-09-09 17:29 ` Al Viro 2013-09-09 17:45 ` Linus Torvalds 2013-09-09 17:55 ` Waiman Long 0 siblings, 2 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 17:29 UTC (permalink / raw) To: Waiman Long Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote: > +/** > + * read_seqbegin_or_lock - begin a sequence number check or locking block > + * lock: sequence lock > + * seq : sequence number to be checked > + * > + * First try it once optimistically without taking the lock. If that fails, > + * take the lock. The sequence number is also used as a marker for deciding > + * whether to be a reader (even) or writer (odd). > + * N.B. seq must be initialized to an even number to begin with. > + */ > +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) > +{ > + if (!(*seq & 1)) { /* Even */ > + *seq = read_seqbegin(lock); > + rcu_read_lock(); > + } else /* Odd */ > + write_seqlock(lock); > +} > +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq) > +{ > + if (!(*seq & 1)) { /* Even */ > + rcu_read_unlock(); > + if (read_seqretry(lock, *seq)) { > + (*seq)++; /* Take writer lock */ > + return 1; > + } > + } else /* Odd */ > + write_sequnlock(lock); > + return 0; > +} I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends can do that themselves just fine (it needs to be taken when seq is even), and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with this variant. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 17:29 ` Al Viro @ 2013-09-09 17:45 ` Linus Torvalds 2013-09-09 17:56 ` Waiman Long 2013-09-09 18:06 ` Al Viro 2013-09-09 17:55 ` Waiman Long 1 sibling, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2013-09-09 17:45 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends > can do that themselves just fine (it needs to be taken when seq is even), > and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with > this variant. Hmm.. I think you need the RCU read lock even when you get the write_seqlock(). Yes, getting the seqlock for write implies that you get a spinlock and in many normal circumstances that basically is equvalent to being rcu-locked, but afaik in some configurations that is *not* sufficient protection against an RCU grace period on another CPU. You need to do a real rcu_read_lock that increments that whole rcu_read_lock_nesting level, which a spinlock won't do. And while the rename sequence lock protects against _renames_, it does not protect against just plain dentries getting free'd under memory pressure. So I think the RCU-readlockness really needs to be independent of the sequence lock. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 17:45 ` Linus Torvalds @ 2013-09-09 17:56 ` Waiman Long 2013-09-09 18:06 ` Al Viro 1 sibling, 0 replies; 25+ messages in thread From: Waiman Long @ 2013-09-09 17:56 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/09/2013 01:45 PM, Linus Torvalds wrote: > On Mon, Sep 9, 2013 at 10:29 AM, Al Viro<viro@zeniv.linux.org.uk> wrote: >> I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends >> can do that themselves just fine (it needs to be taken when seq is even), >> and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with >> this variant. > Hmm.. I think you need the RCU read lock even when you get the write_seqlock(). > > Yes, getting the seqlock for write implies that you get a spinlock and > in many normal circumstances that basically is equvalent to being > rcu-locked, but afaik in some configurations that is *not* sufficient > protection against an RCU grace period on another CPU. You need to do > a real rcu_read_lock that increments that whole rcu_read_lock_nesting > level, which a spinlock won't do. > > And while the rename sequence lock protects against _renames_, it does > not protect against just plain dentries getting free'd under memory > pressure. > > So I think the RCU-readlockness really needs to be independent of the > sequence lock. > > Linus Yes, you are right. It will be safer to take rcu_read_lock() even if we are taking the rename_lock. It wasn't needed before as d_lock was taken. Will update the patch to take rcu_read_lock() out to reflect that. Regards, Longman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 17:45 ` Linus Torvalds 2013-09-09 17:56 ` Waiman Long @ 2013-09-09 18:06 ` Al Viro 2013-09-09 18:15 ` Linus Torvalds 2013-09-09 18:21 ` Al Viro 1 sibling, 2 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 18:06 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 10:45:38AM -0700, Linus Torvalds wrote: > On Mon, Sep 9, 2013 at 10:29 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends > > can do that themselves just fine (it needs to be taken when seq is even), > > and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with > > this variant. > > Hmm.. I think you need the RCU read lock even when you get the write_seqlock(). > > Yes, getting the seqlock for write implies that you get a spinlock and > in many normal circumstances that basically is equvalent to being > rcu-locked, but afaik in some configurations that is *not* sufficient > protection against an RCU grace period on another CPU. You need to do > a real rcu_read_lock that increments that whole rcu_read_lock_nesting > level, which a spinlock won't do. > > And while the rename sequence lock protects against _renames_, it does > not protect against just plain dentries getting free'd under memory > pressure. It protects the chain of ->d_parent, so they'd better not get freeds at all... > So I think the RCU-readlockness really needs to be independent of the > sequence lock. Actually, now that I've tried to convert d_walk() to those guys, I think I like my proposal for the set of primitives better: static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq): { if ((*seq & 1) || !read_seqretry(lock, *seq)) return true; *seq |= 1; write_seqlock(lock); return false; } static inline void seqretry_done(seqlock_t *lock, unsigned seq) { if (seq & 1) write_sequnlock(lock); } with the prepend_path() and friends becoming rcu_read_lock(); seq = read_seqbegin(&rename_lock); again: .... if (!seqretry_and_lock(&rename_lock, seq)) goto again; /* now as writer */ seqretry_done(&rename_lock, seq); rcu_read_unlock(); The thing is, d_walk() does essentially seq = read_seqbegin(&rename_lock); again: .... spin_lock(&d->d_lock); if (!seqretry_and_lock(&rename_lock, seq)) { spin_unlock(&d->d_lock); goto again; /* now as writer */ } /* now we are holding ->d_lock on it and we know * that d has not gone stale until that point. */ do stuff with d spin_unlock(&d->d_lock); seqretry_done(&rename_lock, seq); OTOH, it's not impossible to handle with Waiman's primitives, just more massage to do that... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:06 ` Al Viro @ 2013-09-09 18:15 ` Linus Torvalds 2013-09-09 18:21 ` Al Viro 1 sibling, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2013-09-09 18:15 UTC (permalink / raw) To: Al Viro Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 9, 2013 at 11:06 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> And while the rename sequence lock protects against _renames_, it does >> not protect against just plain dentries getting free'd under memory >> pressure. > > It protects the chain of ->d_parent, so they'd better not get freeds at > all... Yeah, I think you're right because of the other constraints.. Holding just the rename-lock for writing is actually sufficient, because (a) you are guaranteed to already hold on to the end-point of the walk (it's where you begin your path lookup), and so the memory pressure issue is actually irrelevant. (b) the only dentry lookup thing that remains is the parenthood chain which is recursively reference-count-protected from the end, and yes, in the absense of normal memory pressure, renames are the only thing that will mess with that. So even without holding d_lock, I guess we're actually safe even without a real rcu read lock if we hold the rename lock for writing. That actually argues for doing the rcu_read_lock() inside the helper function. HOWEVER, I'd like a comment to that effect, to show why we can access all those dentries without any other synchronization. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:06 ` Al Viro 2013-09-09 18:15 ` Linus Torvalds @ 2013-09-09 18:21 ` Al Viro 2013-09-09 18:36 ` Al Viro 2013-09-10 0:40 ` George Spelvin 1 sibling, 2 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 18:21 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 07:06:04PM +0100, Al Viro wrote: > I like my proposal for the set of primitives better: > > static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq): > { > if ((*seq & 1) || !read_seqretry(lock, *seq)) > return true; > *seq |= 1; > write_seqlock(lock); > return false; > } > > static inline void seqretry_done(seqlock_t *lock, unsigned seq) > { > if (seq & 1) > write_sequnlock(lock); > } > > with the prepend_path() and friends becoming > > rcu_read_lock(); > seq = read_seqbegin(&rename_lock); > again: > .... > if (!seqretry_and_lock(&rename_lock, seq)) > goto again; /* now as writer */ > seqretry_done(&rename_lock, seq); > rcu_read_unlock(); Actually, it's better for prepend_path() as well, because it's actually rcu_read_lock(); seq = read_seqbegin(&rename_lock); again: .... if (error) goto done; .... if (!seqretry_and_lock(&rename_lock, seq)) goto again; /* now as writer */ done: seqretry_done(&rename_lock, seq); rcu_read_unlock(); Posted variant will sometimes hit the following path: * seq_readlock() * start generating the output * hit an error [another process has taken and released rename_lock for some reason] * hit read_seqretry_and_unlock(), which returns 1. * retry everything with seq_writelock(), despite the error. It's not too horrible (we won't be looping indefinitely, ignoring error all along), but it's certainly subtle enough... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:21 ` Al Viro @ 2013-09-09 18:36 ` Al Viro 2013-09-09 18:46 ` Al Viro 2013-09-09 18:46 ` Waiman Long 2013-09-10 0:40 ` George Spelvin 1 sibling, 2 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 18:36 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 07:21:11PM +0100, Al Viro wrote: > Actually, it's better for prepend_path() as well, because it's actually > > rcu_read_lock(); > seq = read_seqbegin(&rename_lock); > again: > .... > if (error) > goto done; > .... > if (!seqretry_and_lock(&rename_lock, seq)) > goto again; /* now as writer */ > done: > seqretry_done(&rename_lock, seq); > rcu_read_unlock(); > > Posted variant will sometimes hit the following path: > * seq_readlock() > * start generating the output > * hit an error > [another process has taken and released rename_lock for some reason] > * hit read_seqretry_and_unlock(), which returns 1. > * retry everything with seq_writelock(), despite the error. > > It's not too horrible (we won't be looping indefinitely, ignoring error > all along), but it's certainly subtle enough... FWIW, what I propose is this (just the d_path-related parts): diff --git a/fs/dcache.c b/fs/dcache.c index 761e31b..b963605 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -88,6 +88,21 @@ EXPORT_SYMBOL(rename_lock); static struct kmem_cache *dentry_cache __read_mostly; +static inline bool seqretry_and_lock(seqlock_t *lock, unsigned *seq) +{ + if ((*seq & 1) || !read_seqretry(lock, *seq)) + return true; + *seq |= 1; + write_seqlock(lock); + return false; +} + +static inline void seqretry_done(seqlock_t *lock, unsigned seq) +{ + if (seq & 1) + write_sequnlock(lock); +} + /* * This is the single most critical data structure when it comes * to the dcache: the hashtable for lookups. Somebody should try @@ -2644,9 +2659,39 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen) return 0; } +/** + * prepend_name - prepend a pathname in front of current buffer pointer + * buffer: buffer pointer + * buflen: allocated length of the buffer + * name: name string and length qstr structure + * + * With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to + * make sure that either the old or the new name pointer and length are + * fetched. However, there may be mismatch between length and pointer. + * The length cannot be trusted, we need to copy it byte-by-byte until + * the length is reached or a null byte is found. It also prepends "/" at + * the beginning of the name. The sequence number check at the caller will + * retry it again when a d_move() does happen. So any garbage in the buffer + * due to mismatched pointer and length will be discarded. + */ static int prepend_name(char **buffer, int *buflen, struct qstr *name) { - return prepend(buffer, buflen, name->name, name->len); + const char *dname = ACCESS_ONCE(name->name); + u32 dlen = ACCESS_ONCE(name->len); + char *p; + + if (*buflen < dlen + 1) + return -ENAMETOOLONG; + *buflen -= dlen + 1; + p = *buffer -= dlen + 1; + *p++ = '/'; + while (dlen--) { + char c = *dname++; + if (!c) + break; + *p++ = c; + } + return 0; } /** @@ -2656,7 +2701,14 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name) * @buffer: pointer to the end of the buffer * @buflen: pointer to buffer length * - * Caller holds the rename_lock. + * The function tries to write out the pathname without taking any lock other + * than the RCU read lock to make sure that dentries won't go away. It only + * checks the sequence number of the global rename_lock as any change in the + * dentry's d_seq will be preceded by changes in the rename_lock sequence + * number. If the sequence number had been change, it will restart the whole + * pathname back-tracing sequence again. It performs a total of 3 trials of + * lockless back-tracing sequences before falling back to take the + * rename_lock. */ static int prepend_path(const struct path *path, const struct path *root, @@ -2665,54 +2717,64 @@ static int prepend_path(const struct path *path, struct dentry *dentry = path->dentry; struct vfsmount *vfsmnt = path->mnt; struct mount *mnt = real_mount(vfsmnt); - bool slash = false; int error = 0; + unsigned seq; + char *bptr; + int blen; + rcu_read_lock(); + seq = read_seqbegin(&rename_lock); +restart: + bptr = *buffer; + blen = *buflen; while (dentry != root->dentry || vfsmnt != root->mnt) { struct dentry * parent; if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { /* Global root? */ - if (!mnt_has_parent(mnt)) - goto global_root; - dentry = mnt->mnt_mountpoint; - mnt = mnt->mnt_parent; - vfsmnt = &mnt->mnt; - continue; + if (mnt_has_parent(mnt)) { + dentry = mnt->mnt_mountpoint; + mnt = mnt->mnt_parent; + vfsmnt = &mnt->mnt; + continue; + } + /* + * Filesystems needing to implement special "root names" + * should do so with ->d_dname() + */ + if (IS_ROOT(dentry) && + (dentry->d_name.len != 1 || + dentry->d_name.name[0] != '/')) { + WARN(1, "Root dentry has weird name <%.*s>\n", + (int) dentry->d_name.len, + dentry->d_name.name); + } + if (!error) + error = is_mounted(vfsmnt) ? 1 : 2; + break; } parent = dentry->d_parent; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(buffer, buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (!error) - error = prepend(buffer, buflen, "/", 1); + error = prepend_name(&bptr, &blen, &dentry->d_name); if (error) break; - slash = true; dentry = parent; } + if (error >= 0 && !seqretry_and_lock(&rename_lock, &seq)) + goto restart; - if (!error && !slash) - error = prepend(buffer, buflen, "/", 1); - - return error; + seqretry_done(&rename_lock, seq); + rcu_read_unlock(); -global_root: - /* - * Filesystems needing to implement special "root names" - * should do so with ->d_dname() - */ - if (IS_ROOT(dentry) && - (dentry->d_name.len != 1 || dentry->d_name.name[0] != '/')) { - WARN(1, "Root dentry has weird name <%.*s>\n", - (int) dentry->d_name.len, dentry->d_name.name); - } - if (!slash) - error = prepend(buffer, buflen, "/", 1); - if (!error) - error = is_mounted(vfsmnt) ? 1 : 2; + if (error >= 0 && bptr == *buffer) { + if (--blen < 0) + error = -ENAMETOOLONG; + else + *--bptr = '/'; + } + *buffer = bptr; + *buflen = blen; return error; } @@ -2741,9 +2803,7 @@ char *__d_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -2762,9 +2822,7 @@ char *d_absolute_path(const struct path *path, prepend(&res, &buflen, "\0", 1); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = prepend_path(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error > 1) @@ -2830,9 +2888,7 @@ char *d_path(const struct path *path, char *buf, int buflen) get_fs_root(current->fs, &root); br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); error = path_with_deleted(path, &root, &res, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) res = ERR_PTR(error); @@ -2867,10 +2923,10 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) char *end = buffer + buflen; /* these dentries are never renamed, so d_lock is not needed */ if (prepend(&end, &buflen, " (deleted)", 11) || - prepend_name(&end, &buflen, &dentry->d_name) || + prepend(&end, &buflen, dentry->d_name.name, dentry->d_name.len) || prepend(&end, &buflen, "/", 1)) end = ERR_PTR(-ENAMETOOLONG); - return end; + return end; } /* @@ -2878,30 +2934,40 @@ char *simple_dname(struct dentry *dentry, char *buffer, int buflen) */ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen) { - char *end = buf + buflen; - char *retval; + char *end, *retval; + int len, seq; + int error = 0; - prepend(&end, &buflen, "\0", 1); + rcu_read_lock(); + seq = read_seqbegin(&rename_lock); +restart: + end = buf + buflen; + len = buflen; + prepend(&end, &len, "\0", 1); if (buflen < 1) goto Elong; /* Get '/' right */ retval = end-1; *retval = '/'; - while (!IS_ROOT(dentry)) { struct dentry *parent = dentry->d_parent; int error; prefetch(parent); - spin_lock(&dentry->d_lock); - error = prepend_name(&end, &buflen, &dentry->d_name); - spin_unlock(&dentry->d_lock); - if (error != 0 || prepend(&end, &buflen, "/", 1) != 0) - goto Elong; + error = prepend_name(&end, &len, &dentry->d_name); + if (error) + goto done; retval = end; dentry = parent; } + if (!seqretry_and_lock(&rename_lock, &seq)) + goto restart; +done: + seqretry_done(&rename_lock, seq); + rcu_read_unlock(); + if (error) + goto Elong; return retval; Elong: return ERR_PTR(-ENAMETOOLONG); @@ -2909,13 +2975,7 @@ Elong: char *dentry_path_raw(struct dentry *dentry, char *buf, int buflen) { - char *retval; - - write_seqlock(&rename_lock); - retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); - - return retval; + return __dentry_path(dentry, buf, buflen); } EXPORT_SYMBOL(dentry_path_raw); @@ -2924,7 +2984,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) char *p = NULL; char *retval; - write_seqlock(&rename_lock); if (d_unlinked(dentry)) { p = buf + buflen; if (prepend(&p, &buflen, "//deleted", 10) != 0) @@ -2932,7 +2991,6 @@ char *dentry_path(struct dentry *dentry, char *buf, int buflen) buflen++; } retval = __dentry_path(dentry, buf, buflen); - write_sequnlock(&rename_lock); if (!IS_ERR(retval) && p) *p = '/'; /* restore '/' overriden with '\0' */ return retval; @@ -2971,7 +3029,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -ENOENT; br_read_lock(&vfsmount_lock); - write_seqlock(&rename_lock); if (!d_unlinked(pwd.dentry)) { unsigned long len; char *cwd = page + PAGE_SIZE; @@ -2979,7 +3036,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) prepend(&cwd, &buflen, "\0", 1); error = prepend_path(&pwd, &root, &cwd, &buflen); - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); if (error < 0) @@ -3000,7 +3056,6 @@ SYSCALL_DEFINE2(getcwd, char __user *, buf, unsigned long, size) error = -EFAULT; } } else { - write_sequnlock(&rename_lock); br_read_unlock(&vfsmount_lock); } ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:36 ` Al Viro @ 2013-09-09 18:46 ` Al Viro 2013-09-09 18:46 ` Waiman Long 1 sibling, 0 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 18:46 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 07:36:08PM +0100, Al Viro wrote: > FWIW, what I propose is this (just the d_path-related parts): Grrr... It's still the wrong set of primitives ;-/ Nevermind that one... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:36 ` Al Viro 2013-09-09 18:46 ` Al Viro @ 2013-09-09 18:46 ` Waiman Long 2013-09-09 19:10 ` Al Viro 1 sibling, 1 reply; 25+ messages in thread From: Waiman Long @ 2013-09-09 18:46 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/09/2013 02:36 PM, Al Viro wrote: > On Mon, Sep 09, 2013 at 07:21:11PM +0100, Al Viro wrote: >> Actually, it's better for prepend_path() as well, because it's actually >> >> rcu_read_lock(); >> seq = read_seqbegin(&rename_lock); >> again: >> .... >> if (error) >> goto done; >> .... >> if (!seqretry_and_lock(&rename_lock, seq)) >> goto again; /* now as writer */ >> done: >> seqretry_done(&rename_lock, seq); >> rcu_read_unlock(); >> >> Posted variant will sometimes hit the following path: >> * seq_readlock() >> * start generating the output >> * hit an error >> [another process has taken and released rename_lock for some reason] >> * hit read_seqretry_and_unlock(), which returns 1. >> * retry everything with seq_writelock(), despite the error. >> >> It's not too horrible (we won't be looping indefinitely, ignoring error >> all along), but it's certainly subtle enough... > FWIW, what I propose is this (just the d_path-related parts): > > I am fine with your proposed change as long as it gets the job done. It doesn't really matter if you do it or I do it. Thank for taking the effort to make it better for us all. -Longman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:46 ` Waiman Long @ 2013-09-09 19:10 ` Al Viro 2013-09-09 19:28 ` Al Viro 0 siblings, 1 reply; 25+ messages in thread From: Al Viro @ 2013-09-09 19:10 UTC (permalink / raw) To: Waiman Long Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote: > I am fine with your proposed change as long as it gets the job done. I suspect that the real problem is the unlock part of read_seqretry_or_unlock(); for d_walk() we want to be able to check if we need retry and continue walking if we do not. Let's do it that way: I've applied your patch as is, with the next step being * split read_seqretry_or_unlock(): need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq)) done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)), your if (read_seqretry_or_unlock(&rename_lock, &seq)) goto restart; becoming if (need_seqretry(&rename_lock, seq)) { seq = 1; goto restart; } done_seqretry(&rename_lock, seq); Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(), need_seqretry() and done_seqretry(). Give me a few, I'll post it... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 19:10 ` Al Viro @ 2013-09-09 19:28 ` Al Viro 2013-09-09 22:57 ` Waiman Long 0 siblings, 1 reply; 25+ messages in thread From: Al Viro @ 2013-09-09 19:28 UTC (permalink / raw) To: Waiman Long Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote: > On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote: > > > I am fine with your proposed change as long as it gets the job done. > > I suspect that the real problem is the unlock part of read_seqretry_or_unlock(); > for d_walk() we want to be able to check if we need retry and continue walking > if we do not. Let's do it that way: I've applied your patch as is, with the > next step being > * split read_seqretry_or_unlock(): > need_seqretry() (return (!(seq & 1) && read_seqretry(lock, seq)) > done_seqretry() (if (seq & 1) write_sequnlock(lock, seq)), > your if (read_seqretry_or_unlock(&rename_lock, &seq)) > goto restart; > becoming > if (need_seqretry(&rename_lock, seq)) { > seq = 1; > goto restart; > } > done_seqretry(&rename_lock, seq); > > Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(), > need_seqretry() and done_seqretry(). Give me a few, I'll post it... OK, how about this? It splits read_seqretry_or_unlock(), takes rcu_read_{lock,unlock} in the callers and converts d_walk() to those primitives. I've pushed that and your commit into vfs.git#experimental (head at 48f5ec2, should propagate in a few); guys, please give it a look and comment. diff --git a/fs/dcache.c b/fs/dcache.c index 38b1b09..b9caf47 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -100,30 +100,21 @@ static struct kmem_cache *dentry_cache __read_mostly; */ static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) { - if (!(*seq & 1)) { /* Even */ + if (!(*seq & 1)) /* Even */ *seq = read_seqbegin(lock); - rcu_read_lock(); - } else /* Odd */ + else /* Odd */ write_seqlock(lock); } -/** - * read_seqretry_or_unlock - end a seqretry or lock block & return retry status - * lock : sequence lock - * seq : sequence number - * Return: 1 to retry operation again, 0 to continue - */ -static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq) +static inline int need_seqretry(seqlock_t *lock, int seq) { - if (!(*seq & 1)) { /* Even */ - rcu_read_unlock(); - if (read_seqretry(lock, *seq)) { - (*seq)++; /* Take writer lock */ - return 1; - } - } else /* Odd */ + return !(seq & 1) && read_seqretry(lock, seq); +} + +static inline void done_seqretry(seqlock_t *lock, int seq) +{ + if (seq & 1) write_sequnlock(lock); - return 0; } /* @@ -1047,7 +1038,7 @@ void shrink_dcache_for_umount(struct super_block *sb) * the parenthood after dropping the lock and check * that the sequence number still matches. */ -static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq) +static struct dentry *try_to_ascend(struct dentry *old, unsigned seq) { struct dentry *new = old->d_parent; @@ -1061,7 +1052,7 @@ static struct dentry *try_to_ascend(struct dentry *old, int locked, unsigned seq */ if (new != old->d_parent || (old->d_flags & DCACHE_DENTRY_KILLED) || - (!locked && read_seqretry(&rename_lock, seq))) { + need_seqretry(&rename_lock, seq)) { spin_unlock(&new->d_lock); new = NULL; } @@ -1098,13 +1089,12 @@ static void d_walk(struct dentry *parent, void *data, { struct dentry *this_parent; struct list_head *next; - unsigned seq; - int locked = 0; + unsigned seq = 0; enum d_walk_ret ret; bool retry = true; - seq = read_seqbegin(&rename_lock); again: + read_seqbegin_or_lock(&rename_lock, &seq); this_parent = parent; spin_lock(&this_parent->d_lock); @@ -1158,13 +1148,13 @@ resume: */ if (this_parent != parent) { struct dentry *child = this_parent; - this_parent = try_to_ascend(this_parent, locked, seq); + this_parent = try_to_ascend(this_parent, seq); if (!this_parent) goto rename_retry; next = child->d_u.d_child.next; goto resume; } - if (!locked && read_seqretry(&rename_lock, seq)) { + if (need_seqretry(&rename_lock, seq)) { spin_unlock(&this_parent->d_lock); goto rename_retry; } @@ -1173,17 +1163,13 @@ resume: out_unlock: spin_unlock(&this_parent->d_lock); - if (locked) - write_sequnlock(&rename_lock); + done_seqretry(&rename_lock, seq); return; rename_retry: if (!retry) return; - if (locked) - goto again; - locked = 1; - write_seqlock(&rename_lock); + seq = 1; goto again; } @@ -2745,6 +2731,7 @@ static int prepend_path(const struct path *path, char *bptr; int blen; + rcu_read_lock(); restart: bptr = *buffer; blen = *buflen; @@ -2783,8 +2770,13 @@ restart: dentry = parent; } - if (read_seqretry_or_unlock(&rename_lock, &seq)) + if (!(seq & 1)) + rcu_read_unlock(); + if (need_seqretry(&rename_lock, seq)) { + seq = 1; goto restart; + } + done_seqretry(&rename_lock, seq); if (error >= 0 && bptr == *buffer) { if (--blen < 0) @@ -2957,6 +2949,7 @@ static char *__dentry_path(struct dentry *dentry, char *buf, int buflen) int len, seq = 0; int error = 0; + rcu_read_lock(); restart: end = buf + buflen; len = buflen; @@ -2979,8 +2972,13 @@ restart: retval = end; dentry = parent; } - if (read_seqretry_or_unlock(&rename_lock, &seq)) + if (!(seq & 1)) + rcu_read_unlock(); + if (need_seqretry(&rename_lock, seq)) { + seq = 1; goto restart; + } + done_seqretry(&rename_lock, seq); if (error) goto Elong; return retval; ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 19:28 ` Al Viro @ 2013-09-09 22:57 ` Waiman Long 0 siblings, 0 replies; 25+ messages in thread From: Waiman Long @ 2013-09-09 22:57 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, Linux Kernel Mailing List, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/09/2013 03:28 PM, Al Viro wrote: > On Mon, Sep 09, 2013 at 08:10:29PM +0100, Al Viro wrote: >> On Mon, Sep 09, 2013 at 02:46:57PM -0400, Waiman Long wrote: >> >>> I am fine with your proposed change as long as it gets the job done. >> I suspect that the real problem is the unlock part of read_seqretry_or_unlock(); >> for d_walk() we want to be able to check if we need retry and continue walking >> if we do not. Let's do it that way: I've applied your patch as is, with the >> next step being >> * split read_seqretry_or_unlock(): >> need_seqretry() (return (!(seq& 1)&& read_seqretry(lock, seq)) >> done_seqretry() (if (seq& 1) write_sequnlock(lock, seq)), >> your if (read_seqretry_or_unlock(&rename_lock,&seq)) >> goto restart; >> becoming >> if (need_seqretry(&rename_lock, seq)) { >> seq = 1; >> goto restart; >> } >> done_seqretry(&rename_lock, seq); >> >> Then d_walk() is trivially massaged to use of read_seqbegin_or_lock(), >> need_seqretry() and done_seqretry(). Give me a few, I'll post it... > OK, how about this? It splits read_seqretry_or_unlock(), takes > rcu_read_{lock,unlock} in the callers and converts d_walk() to those > primitives. I've pushed that and your commit into vfs.git#experimental > (head at 48f5ec2, should propagate in a few); guys, please give it a look > and comment. The changes look good to me. I was planning to take rcu_read_lock() out and doing something similar, but your change is good. BTW, I think Linus want to add some comments on why RCU lock is needed without the rename_lock, but I can put that in with a follow-up patch once the current change is merged. Thank for your help and inspiration on this patch. -Longman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 18:21 ` Al Viro 2013-09-09 18:36 ` Al Viro @ 2013-09-10 0:40 ` George Spelvin 2013-09-10 0:57 ` Al Viro 2013-09-10 3:57 ` Waiman Long 1 sibling, 2 replies; 25+ messages in thread From: George Spelvin @ 2013-09-10 0:40 UTC (permalink / raw) To: torvalds, viro Cc: aswin, john, linux-fsdevel, linux-kernel, linux, scott.norton, Waiman.Long I'm really wondering about only trying once before taking the write lock. Yes, using the lsbit is a cute hack, but are we using it for its cuteness rather than its effectiveness? Renames happen occasionally. If that causes all the current pathname translations to fall back to the write lock, that is fairly heavy. Worse, all of those translations will (unnecessarily) bump the write seqcount, triggering *other* translations to fail back to the write-lock path. One patch to fix this would be to have the fallback read algorithm take sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent readers. But another is to simply retry at least once (two attempts) on the non-exclusive path before falling back to the exclusive one, This means that the count lsbit is no longer enough space for a retry counter, but oh, well. (If you really want to use one word, perhaps a better heuristic as to how to retry would be to examine the *number* of writes to the seqlock during the read. If there was only one, there's a fair chance that another read will succeed. If there was more than one (i.e. the seqlock has incremented by 3 or more), then forcing the writers to stop is probably necessary.) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 0:40 ` George Spelvin @ 2013-09-10 0:57 ` Al Viro 2013-09-10 1:15 ` Ramkumar Ramachandra 2013-09-10 8:24 ` George Spelvin 2013-09-10 3:57 ` Waiman Long 1 sibling, 2 replies; 25+ messages in thread From: Al Viro @ 2013-09-10 0:57 UTC (permalink / raw) To: George Spelvin Cc: torvalds, aswin, john, linux-fsdevel, linux-kernel, scott.norton, Waiman.Long On Mon, Sep 09, 2013 at 08:40:20PM -0400, George Spelvin wrote: > I'm really wondering about only trying once before taking the write lock. > Yes, using the lsbit is a cute hack, but are we using it for its cuteness > rather than its effectiveness? > > Renames happen occasionally. If that causes all the current pathname > translations to fall back to the write lock, that is fairly heavy. > Worse, all of those translations will (unnecessarily) bump the write > seqcount, triggering *other* translations to fail back to the write-lock > path. _What_ "pathname translations"? Pathname resolution doesn't fall back to seq_writelock() at all. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 0:57 ` Al Viro @ 2013-09-10 1:15 ` Ramkumar Ramachandra 2013-09-10 1:34 ` Linus Torvalds 2013-09-10 8:24 ` George Spelvin 1 sibling, 1 reply; 25+ messages in thread From: Ramkumar Ramachandra @ 2013-09-10 1:15 UTC (permalink / raw) To: Al Viro Cc: George Spelvin, Linus Torvalds, Chandramouleeswaran, Aswin, john, linux-fsdevel, LKML, Norton, Scott J, Waiman Long Al Viro wrote: > _What_ "pathname translations"? Pathname resolution doesn't fall back to > seq_writelock() at all. Maybe it should then? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 1:15 ` Ramkumar Ramachandra @ 2013-09-10 1:34 ` Linus Torvalds 2013-09-10 2:25 ` Al Viro 2013-09-10 3:12 ` Ramkumar Ramachandra 0 siblings, 2 replies; 25+ messages in thread From: Linus Torvalds @ 2013-09-10 1:34 UTC (permalink / raw) To: Ramkumar Ramachandra Cc: Al Viro, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <artagnon@gmail.com> wrote: > > Maybe it should then? It doesn't need to. The RCU lookup looks at individual dentry sequence numbers and doesn't care about the bigger rename sequence number at all. The fallback (if you hit one of the very very rare races, or if you hit a symlink) ends up doing per-path-component lookups under the rename sequence lock, but for it, read-locking it until it succeeds is the right thing to do. Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 1:34 ` Linus Torvalds @ 2013-09-10 2:25 ` Al Viro 2013-09-10 2:33 ` Linus Torvalds 2013-09-10 3:12 ` Ramkumar Ramachandra 1 sibling, 1 reply; 25+ messages in thread From: Al Viro @ 2013-09-10 2:25 UTC (permalink / raw) To: Linus Torvalds Cc: Ramkumar Ramachandra, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long On Mon, Sep 09, 2013 at 06:34:16PM -0700, Linus Torvalds wrote: > On Mon, Sep 9, 2013 at 6:15 PM, Ramkumar Ramachandra <artagnon@gmail.com> wrote: > > > > Maybe it should then? > > It doesn't need to. The RCU lookup looks at individual dentry sequence > numbers and doesn't care about the bigger rename sequence number at > all. One name: Mark V. Shaney... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 2:25 ` Al Viro @ 2013-09-10 2:33 ` Linus Torvalds 0 siblings, 0 replies; 25+ messages in thread From: Linus Torvalds @ 2013-09-10 2:33 UTC (permalink / raw) To: Al Viro Cc: George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long On Mon, Sep 9, 2013 at 7:25 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > One name: Mark V. Shaney... Heh, yes. I had ignored the earlier emails, and that last one looked more reasonable than the earlier ones ;) Linus ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 1:34 ` Linus Torvalds 2013-09-10 2:25 ` Al Viro @ 2013-09-10 3:12 ` Ramkumar Ramachandra 1 sibling, 0 replies; 25+ messages in thread From: Ramkumar Ramachandra @ 2013-09-10 3:12 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, George Spelvin, Chandramouleeswaran, Aswin, John Stoffel, linux-fsdevel, LKML, Norton, Scott J, Waiman Long Linus Torvalds wrote: > It doesn't need to. The RCU lookup looks at individual dentry sequence > numbers and doesn't care about the bigger rename sequence number at > all. Right; it's sequential. > The fallback (if you hit one of the very very rare races, or if you > hit a symlink) ends up doing per-path-component lookups under the > rename sequence lock, but for it, read-locking it until it succeeds is > the right thing to do. No, it's write-locking. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 0:57 ` Al Viro 2013-09-10 1:15 ` Ramkumar Ramachandra @ 2013-09-10 8:24 ` George Spelvin 1 sibling, 0 replies; 25+ messages in thread From: George Spelvin @ 2013-09-10 8:24 UTC (permalink / raw) To: linux, viro Cc: aswin, john, linux-fsdevel, linux-kernel, scott.norton, torvalds, Waiman.Long > _What_ "pathname translations"? Pathname resolution doesn't fall back to > seq_writelock() at all. I meant the translation of dentry to pathname that this thread is about. Apologies for my confusing abbreviation. Since the reverse translation (pathname to dentry) is done a level at a time and people understand that it's not atomic, I agree with Linus that a livelock there is too remote a possibility to worry about. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-10 0:40 ` George Spelvin 2013-09-10 0:57 ` Al Viro @ 2013-09-10 3:57 ` Waiman Long 1 sibling, 0 replies; 25+ messages in thread From: Waiman Long @ 2013-09-10 3:57 UTC (permalink / raw) To: George Spelvin Cc: torvalds, viro, aswin, john, linux-fsdevel, linux-kernel, scott.norton On 09/09/2013 08:40 PM, George Spelvin wrote: > I'm really wondering about only trying once before taking the write lock. > Yes, using the lsbit is a cute hack, but are we using it for its cuteness > rather than its effectiveness? > > Renames happen occasionally. If that causes all the current pathname > translations to fall back to the write lock, that is fairly heavy. > Worse, all of those translations will (unnecessarily) bump the write > seqcount, triggering *other* translations to fail back to the write-lock > path. > > One patch to fix this would be to have the fallback read algorithm take > sl->lock but *not* touch sl->seqcount, so it wouldn't break concurrent > readers. Actually, a follow-up patch that I am planning to do is to introduce a read_seqlock() primitive in seqlock.h that does exactly that. Then the write_seqlock() in this patch will be modified to read_seqlock(). -Longman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 17:29 ` Al Viro 2013-09-09 17:45 ` Linus Torvalds @ 2013-09-09 17:55 ` Waiman Long 2013-09-09 18:07 ` Al Viro 1 sibling, 1 reply; 25+ messages in thread From: Waiman Long @ 2013-09-09 17:55 UTC (permalink / raw) To: Al Viro Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On 09/09/2013 01:29 PM, Al Viro wrote: > On Mon, Sep 09, 2013 at 12:18:13PM -0400, Waiman Long wrote: >> +/** >> + * read_seqbegin_or_lock - begin a sequence number check or locking block >> + * lock: sequence lock >> + * seq : sequence number to be checked >> + * >> + * First try it once optimistically without taking the lock. If that fails, >> + * take the lock. The sequence number is also used as a marker for deciding >> + * whether to be a reader (even) or writer (odd). >> + * N.B. seq must be initialized to an even number to begin with. >> + */ >> +static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq) >> +{ >> + if (!(*seq& 1)) { /* Even */ >> + *seq = read_seqbegin(lock); >> + rcu_read_lock(); >> + } else /* Odd */ >> + write_seqlock(lock); >> +} >> +static inline int read_seqretry_or_unlock(seqlock_t *lock, int *seq) >> +{ >> + if (!(*seq& 1)) { /* Even */ >> + rcu_read_unlock(); >> + if (read_seqretry(lock, *seq)) { >> + (*seq)++; /* Take writer lock */ >> + return 1; >> + } >> + } else /* Odd */ >> + write_sequnlock(lock); >> + return 0; >> +} > I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends > can do that themselves just fine (it needs to be taken when seq is even), > and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with > this variant. I think rcu_read_lock() is needed to make sure that the dentry won't be freed as we don't take d_lock now. -Longman ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/1] dcache: Translating dentry into pathname without taking rename_lock 2013-09-09 17:55 ` Waiman Long @ 2013-09-09 18:07 ` Al Viro 0 siblings, 0 replies; 25+ messages in thread From: Al Viro @ 2013-09-09 18:07 UTC (permalink / raw) To: Waiman Long Cc: Linus Torvalds, linux-fsdevel, linux-kernel, Chandramouleeswaran, Aswin, Norton, Scott J, George Spelvin, John Stoffel On Mon, Sep 09, 2013 at 01:55:06PM -0400, Waiman Long wrote: > >I'm not sure I like mixing rcu_read_lock() into that - d_path() and friends > >can do that themselves just fine (it needs to be taken when seq is even), > >and e.g. d_walk() doesn't need it at all. Other than that, I'm OK with > >this variant. > > I think rcu_read_lock() is needed to make sure that the dentry won't > be freed as we don't take d_lock now. Sure, you do need that; the question is whether you need to take it in the primitives you are introducing. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2013-09-10 8:24 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-09 16:18 [PATCH v4 0/1] dcache: Translating dentry into pathname without taking rename_lock Waiman Long 2013-09-09 16:18 ` [PATCH v4 1/1] " Waiman Long 2013-09-09 17:29 ` Al Viro 2013-09-09 17:45 ` Linus Torvalds 2013-09-09 17:56 ` Waiman Long 2013-09-09 18:06 ` Al Viro 2013-09-09 18:15 ` Linus Torvalds 2013-09-09 18:21 ` Al Viro 2013-09-09 18:36 ` Al Viro 2013-09-09 18:46 ` Al Viro 2013-09-09 18:46 ` Waiman Long 2013-09-09 19:10 ` Al Viro 2013-09-09 19:28 ` Al Viro 2013-09-09 22:57 ` Waiman Long 2013-09-10 0:40 ` George Spelvin 2013-09-10 0:57 ` Al Viro 2013-09-10 1:15 ` Ramkumar Ramachandra 2013-09-10 1:34 ` Linus Torvalds 2013-09-10 2:25 ` Al Viro 2013-09-10 2:33 ` Linus Torvalds 2013-09-10 3:12 ` Ramkumar Ramachandra 2013-09-10 8:24 ` George Spelvin 2013-09-10 3:57 ` Waiman Long 2013-09-09 17:55 ` Waiman Long 2013-09-09 18:07 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).