Re: linux-next: manual merge of the akpm tree with the vfs tree

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	linux-next@vger.kernel.org, linux-kernel@vger.kernel.org,
	Nathan Zimmer <nzimmer@sgi.com>
Subject: Re: linux-next: manual merge of the akpm tree with the vfs tree
Date: Thu, 4 Apr 2013 09:02:48 +0100	[thread overview]
Message-ID: <20130404080248.GN21522@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20130403235634.6bc72c39.akpm@linux-foundation.org>

On Wed, Apr 03, 2013 at 11:56:34PM -0700, Andrew Morton wrote:
> On Thu, 4 Apr 2013 17:26:48 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > Hi Andrew,
> > 
> > Today's linux-next merge of the akpm tree got a conflict in
> > fs/proc/generic.c between several commits from the vfs tree and commit
> > "procfs: improve scaling in proc" from the akpm tree.
> > 
> > I just dropped the akpm tree patch (and the following
> > "procfs-improve-scaling-in-proc-v5") as the conflicts are a bit complex.
> 
> Well perhaps the vfs tree should start paying some attention to the
> rest of the world, particularly after -rc5.

I'm sorry, but... not in this case.  There are seriously nasty races around
remove_proc_entry()/proc_reg_release() and the whole area needs a rewrite.
Tentative fix is in vfs.git#experimental; I hadn't pushed it into #for-next
yet, but Nathan's patches are definitely going to buggered by any realistic
solution.  For now I'm going for the dumbest variant possible; ->pde_users
will eventually become atomic_t, but it'll be modified by
atomic_inc_unless_negative(), not atomic_inc() and ->proc_fops won't be
zeroed at all.  But before we do that, I want to get that sucker tested and
stabilized - the whole thing is sufficiently convoluted for those races to
stay unnoticed for a long time in the first place.

I am aware of those patches; it's just that they'll need to be redone - the
code being optimized is broken and needs to be fixed.  Longer term, I want
to lift the whole thing into VFS proper; it *can* be done with minimal
overhead and it'll get us a large part of revoke(2) if done right.  Basically,
what I want is a pair of new types - struct revoke and struct revokable.
	* struct file gets a pointer to struct revoke; set in ->open() by
file_revokable(file, revokable) and never changed afterwards.  No locking
is needed to check it.
	* struct revoke contains a pointer back to struct file (never changed)
+ pointer to struct revokable (RCU-protected, zeroed on revoke) + mutex/count
(serializes __fput() vs. revoke() deciding who's going to call ->release() and
who'll be waiting; see pdeo->mutex and pdep->count in #experimental) + cyclic
list anchored in struct revokable (list_del_init() after ->release() had been
done, under revoke->mutex and revokable->lock).
	* struct revokable contains an anchor for aforementioned cyclic
list + spinlock + atomic_t in_use (a-la pde->pde_users) + pointer to
completion + one method (->kick(); see below).  Freed via RCU.
	* start_using(file) is an inlined helper, returning true if
file->f_revoke is NULL; if it's not NULL, we do rcu_read_lock() and look
at file->f_revoke->revokable.  If it's NULL - rcu_read_unlock() and return
false (file had been revoked).  If it's not, atomic_inc_unless_negative()
of revokable->in_use.  Then rcu_read_unlock() and return - true if
->in_use used to be non-negative (file not revoked, revoke will wait) and
false if it was negative (file in process of being revoked).
	* stop_using(file) - inlined helper, does nothing if file->f_revoke
is NULL, otherwise decrements revokable->in_use and if it's reached
BIAS (large negative), complete(revokable->completion).
	* all normal method calls (everything except ->release()) are
turned into
	if (likely(start_using(file)) {
		res = method call
		stop_using(file);
	} else {
		res = <appropriate for method>;
	}
Overhead for non-revokable files is trivial - we just check one field in
struct file and if it's in the same cacheline as ->f_op, we are not going
to see any real delays.
	* new helper - release_revoke(); similar to close_pdeo() in
vfs.git#experimenatal.  __fput() checks ->f_revoke and, if that sucker's
non-NULL, does rcu_read_lock(), checks ->revokable, grabs ->lock on it
and does release_revoke().  If ->f_revoke is NULL, call ->release() as
we do now.  Note that unlike struct pde_opener, these guys would be
created with ->count equal to 1 - IOW, file->f_revoke would contribute to it.
	* do_revoke(revokable) starts with setting ->completion and adding
BIAS to ->in_use; if it non-zero prior to that, call ->kick(revokable) and
wait for completion.  ->kick() should essentially wake up those who are
sleeping in ->read() or ->write() and make them return, be it with EAGAIN
or short read/write.  Empty for procfs, for something like TTY it should
imitate hangup.  That's where driver-specific logics in revoke(2) would
live.  Once do_revoke() has finished waiting, we know that nobody is in
method calls (except possibly ->release()) and nobody will manage to enter
them from now on.  We grab ->lock, pick the first struct revoke from the
list, do release_revoke() to it and keep doing that to these guys until
none is left.

procfs would have struct revokable embedded into proc_dir_entry, with
freeing of those guys RCUd.  It would have file_revokable() done in
proc_reg_open() (that would do allocation of struct revoke, adding it
to the cyclic list, etc.) and set ->f_op to ->proc_fops; no wrappers
needed anymore.  All file_operations instances fed to proc_create()
et.al. would lose ->owner - it's already not needed for those, actually.
remove_proc_entry()/remove_proc_subtree() would call do_revoke() on
everything we are removing.

Getting from there to working revoke(2) is probably not too hard; the
interesting part is what should we associate struct revokable with and
what should revoke(2) in progress do to new attempts to open the damn
device.  Hell knows - not sure what's the right semantics here.

next prev parent reply	other threads:[~2013-04-04  8:02 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-04  6:26 linux-next: manual merge of the akpm tree with the vfs tree Stephen Rothwell
2013-04-04  6:56 ` Andrew Morton
2013-04-04  7:02   ` Andrew Morton
2013-04-04  8:10     ` Al Viro
2013-04-04 23:18       ` Stephen Rothwell
2013-04-04  8:02   ` Al Viro [this message]
2013-04-04 15:43     ` Nathan Zimmer
2013-04-04 15:43       ` Nathan Zimmer
2013-04-04 15:53       ` [PATCH resend] fs/proc: Move kfree outside pde_unload_lock Nathan Zimmer
2013-04-04 16:11         ` Al Viro
2013-04-04 17:12           ` Nathan Zimmer
2013-04-04 20:44             ` Al Viro
2013-04-05 17:05               ` Nathan Zimmer
2013-04-05 17:36                 ` Al Viro
2013-04-05 20:56                   ` Nathan Zimmer
2013-04-05 21:00                     ` Al Viro
2013-04-08 15:34                       ` Nathan Zimmer
2013-04-08 15:58                         ` Al Viro
2013-04-08 16:42                           ` Nathan Zimmer
2013-04-08 20:52                           ` Nathan Zimmer
2013-04-08 21:23                             ` Al Viro
2013-04-08 21:48                               ` hangs on boot in 9984d7394618df9 Al Viro
2013-04-08 22:17                                 ` Stephen Warren
2013-04-08 22:45                                   ` Doug Anderson
2013-04-08 23:06                                     ` Al Viro
2013-04-08 23:20                                       ` Stephen Warren
2013-04-08 23:46                                       ` Doug Anderson
2013-04-09 17:12                                         ` Nathan Zimmer
2013-04-09 17:12                                           ` Nathan Zimmer
2013-04-08 22:46                                   ` Al Viro
2013-04-08 22:57                                     ` Al Viro
2013-04-08 21:56                               ` [PATCH resend] fs/proc: Move kfree outside pde_unload_lock Nathan Zimmer
  -- strict thread matches above, loose matches on Subject: below --
2020-05-15 11:33 linux-next: manual merge of the akpm tree with the vfs tree Stephen Rothwell
2019-04-11  6:21 Stephen Rothwell
2018-05-17  6:36 Stephen Rothwell
2018-01-02  6:46 Stephen Rothwell
2016-12-12  5:52 Stephen Rothwell
2016-12-12  8:14 ` Ian Kent
2014-08-08  6:20 Stephen Rothwell
2013-09-10  4:41 Stephen Rothwell
2013-09-05  8:56 Stephen Rothwell
2013-04-30  5:54 Stephen Rothwell
2013-04-29  8:34 Stephen Rothwell
2013-04-29  8:25 Stephen Rothwell
2013-04-04  6:17 Stephen Rothwell
2013-04-04  6:12 Stephen Rothwell
2013-04-04 12:33 ` Jan Kara
2013-04-04  6:04 Stephen Rothwell
2013-02-25  3:40 Stephen Rothwell
2012-09-24 13:40 Stephen Rothwell
2012-09-24 13:12 Stephen Rothwell
2012-09-24 13:06 Stephen Rothwell
2012-07-22  5:44 Stephen Rothwell
2011-07-18  8:55 Stephen Rothwell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130404080248.GN21522@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=nzimmer@sgi.com \
    --cc=sfr@canb.auug.org.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.