* Re: Linux 2.6.35
[not found] ` <20100802055834.GB19164@dastard>
@ 2010-08-02 7:55 ` Nick Piggin
2010-08-02 8:24 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Nick Piggin @ 2010-08-02 7:55 UTC (permalink / raw)
To: Dave Chinner, linux-fsdevel; +Cc: Linus Torvalds, Linux Kernel Mailing List
On Mon, Aug 02, 2010 at 03:58:34PM +1000, Dave Chinner wrote:
> On Sun, Aug 01, 2010 at 07:50:02PM -0700, Linus Torvalds wrote:
> > On Sun, Aug 1, 2010 at 7:33 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > There hasn't been nearly enough review or testing of this patch
> > > series yet. Before a merge, it needs to be split up in smaller,
> > > more digestable chunks for more comprehensive review, regression
> > > testing and behavioural analysis.
BTW. it has in fact had quite a bit of testing in earlier form in the
-rt tree for a long time, and several fixes come from there. And good
performance results there too.
> > I dunno. We merge _way_ scarier things in the VM and the block layer,
> > for much less actual upside, and with less review.
>
> Scary stuff outside of direct VFS/FS interfaces is generally hidden
> from me by my +6 Blinkers of Blissful Ignorance. I make the
> assumption that the experts involved know the risks and have weighed
> them up appropriately. ;)
>
> > The RCU pathname lookup has some rather impressive performance
> > upsides, and I agree that it would be good to get a lot of review and
> > testing, but the latter isn't going to happen without it being
> > mainlined, and the former is sadly lacking. The person I'd like most
> > to review it is Al,
>
> Most definitely.
I hate to say but I would like to see it mature for another release. It
should also clash a bit with Al's recent inode work that he'll want to
push.
What I can do is send some of the ground work patches this time around,
put the tree into linux-next, and put reviewers on notice.
I think it is all conceptually sound, but it will inevitably have some
bugs left to shake out, and things to be fixed on the review side. I
don't anticipate a problem that could not be fixed in the release cycle,
but I think aiming for post 2.6.36 is a bit fairer for vfs guys,
honestly. LSF is next week too, so most of them will be busy with travel
and such. But I do hope to discuss the vfs-scale patches there.
> > but anybody in the filesystem world should
> > basically see it as a #1 priority,
>
> Agreed - I've actually looked at every patch, commented on some
> of the more questionable things, got quoted by LWN for saying that
> it "fell off the locking cliff", have run benchmarks on it and sent
> patches fixing bugs back to Nick.
>
> It's just really hard to digest it all in one lump and core VFS
> changes on this scale scare me....
For filesystems developers, the dcache and inode locking changes
should be more or less just following simple steps as shown in the
patch series. If they're not abusing dcache_lock (and most except
autofs4 are not), then it should not be a big deal.
There are a couple of locking constraints changed at the API level,
but I didn't run into any problems there yet. It should be all
documented in Documentation/filesystems/* although I need to run a
few more passes over the series to ensure I caught everything.
> > because unlike all the masturbatory
> > patches like xstat() that add new functionality that nobody will
> > likely ever use, Nick's patchseries improves on the thing that
> > everybody uses heavily every day without even thinking about it.
> >
> > Is it tough to review? Yes. It's core code, not just some random
> > addition that adds a new feature and doesn't impact any old code. But
> > that's also the thing that makes it meaningful, and makes me think it
> > should get merged _much_ more eagerly than most code we ever see.
>
> I agree with you for the pure locking changes.
>
> But for the bits that change writeback, LRU ordering and reclaim
> calculations the benefits are not quite so obvious, nor is the
> correctness of the code/behaviour quite so provably correct. Maybe
> I'm being a bit too paranoid, but generally it pays to be a bit
> conservative as a filesystem developer because the cost of screwing
> up can be pretty high...
Writeback shouldn't be changed. LRU ordering is changed for 2
reasons. Firstly, to make things per-zone instead of global. This
basically fits our whole reclaim model much better, although it
will inevitably cause some random little changes but I think it
is agreed this is a good thing (memory shortage in one zone or
node does not require global shrinkings, NUMA level parallelism
of reclaim.)
The other thing is converting the last few dcache refcounting, and
all of inode refcounting over to this "lazy LRU" model. This can
have a bigger impact, but it really reduces locking on the per-zone
lists, so it definitely helps speed and scalability of non-reclaim
fastpaths. I'm up for changing this if numbers show it hurts, it
would be rather easy to do, but in comparison to the overall
patchset, it would rate as a minor tweak :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-02 7:55 ` Linux 2.6.35 Nick Piggin
@ 2010-08-02 8:24 ` Christoph Hellwig
2010-08-02 8:46 ` KOSAKI Motohiro
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Christoph Hellwig @ 2010-08-02 8:24 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Chinner, linux-fsdevel, Linus Torvalds,
Linux Kernel Mailing List
On Mon, Aug 02, 2010 at 05:55:37PM +1000, Nick Piggin wrote:
> I hate to say but I would like to see it mature for another release. It
> should also clash a bit with Al's recent inode work that he'll want to
> push.
>
> What I can do is send some of the ground work patches this time around,
> put the tree into linux-next, and put reviewers on notice.
>
> I think it is all conceptually sound, but it will inevitably have some
> bugs left to shake out, and things to be fixed on the review side. I
> don't anticipate a problem that could not be fixed in the release cycle,
> but I think aiming for post 2.6.36 is a bit fairer for vfs guys,
> honestly. LSF is next week too, so most of them will be busy with travel
> and such. But I do hope to discuss the vfs-scale patches there.
What I'm most concerned bit merging everything in one go. It's a huge
series and I'd rather see it start going in in batches over multiple
kernel releases.
Things like the fs_struct spinlock and some other preparatory patches
should be ver easily to do for 2.6.36. Scaling the files and vfsmount
locks should also be easily doable, but we need to sort out the struct
file growth in the later. We really can't grow struct file by two
pointers as that would have devasting effects on various workloads.
What follows after that is the dcache_lock scaling which to seems the
most immature bit of the series, and the one that showed by far the
most problems in -RT. I'm very much dead set against merging that in
.36. I'd much rather see the inode_lock scaling or the lockless path
walk going in before, but I haven't checked how complicated the
reordering would be. The lockless path walk also is only rather
theoretically useful until we do ACL checks lockless as we're having
ACLs enabled pretty much everywhere at least in the distros.
The per-zone shrinkers are another thing that's not directly related,
I think they need a lot more discussion with the VM folks, and
integrating with Dave's work in that area.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-02 8:24 ` Christoph Hellwig
@ 2010-08-02 8:46 ` KOSAKI Motohiro
2010-08-02 9:05 ` Christoph Hellwig
2010-08-02 9:51 ` Nick Piggin
2 siblings, 0 replies; 9+ messages in thread
From: KOSAKI Motohiro @ 2010-08-02 8:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: kosaki.motohiro, Nick Piggin, Dave Chinner, linux-fsdevel,
Linus Torvalds, Linux Kernel Mailing List
> The per-zone shrinkers are another thing that's not directly related,
> I think they need a lot more discussion with the VM folks, and
> integrating with Dave's work in that area.
per-zone shrinkers don't cause so much impact to VM design except zone
reclaim feature. So, if FS folks think it's ok, I'm not against this at all.
btw, however, I haven't review such patch series in the detail yet. so
perhaps I might post some bug fix later.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-02 8:24 ` Christoph Hellwig
2010-08-02 8:46 ` KOSAKI Motohiro
@ 2010-08-02 9:05 ` Christoph Hellwig
2010-08-02 10:07 ` Nick Piggin
2010-08-02 9:51 ` Nick Piggin
2 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2010-08-02 9:05 UTC (permalink / raw)
To: Nick Piggin
Cc: Dave Chinner, linux-fsdevel, Linus Torvalds,
Linux Kernel Mailing List
On Mon, Aug 02, 2010 at 04:24:28AM -0400, Christoph Hellwig wrote:
> .36. I'd much rather see the inode_lock scaling or the lockless path
> walk going in before, but I haven't checked how complicated the
> reordering would be. The lockless path walk also is only rather
> theoretically useful until we do ACL checks lockless as we're having
> ACLs enabled pretty much everywhere at least in the distros.
>From a quick look it seems like the inode_lock splitup can easily
be moved forward, and it would help us with doing some work on the
writeback side. The problem is that it would need rebasing ontop
of both the vfs and writeback (aka block) trees.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-02 8:24 ` Christoph Hellwig
2010-08-02 8:46 ` KOSAKI Motohiro
2010-08-02 9:05 ` Christoph Hellwig
@ 2010-08-02 9:51 ` Nick Piggin
2 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-08-02 9:51 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Nick Piggin, Dave Chinner, linux-fsdevel, Linus Torvalds,
Linux Kernel Mailing List
On Mon, Aug 02, 2010 at 04:24:28AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 05:55:37PM +1000, Nick Piggin wrote:
> > I hate to say but I would like to see it mature for another release. It
> > should also clash a bit with Al's recent inode work that he'll want to
> > push.
> >
> > What I can do is send some of the ground work patches this time around,
> > put the tree into linux-next, and put reviewers on notice.
> >
> > I think it is all conceptually sound, but it will inevitably have some
> > bugs left to shake out, and things to be fixed on the review side. I
> > don't anticipate a problem that could not be fixed in the release cycle,
> > but I think aiming for post 2.6.36 is a bit fairer for vfs guys,
> > honestly. LSF is next week too, so most of them will be busy with travel
> > and such. But I do hope to discuss the vfs-scale patches there.
>
> What I'm most concerned bit merging everything in one go. It's a huge
> series and I'd rather see it start going in in batches over multiple
> kernel releases.
One problem is that to win much benefit, several different aspects
must be scaled. If not, then you end up with more locks *and* still
have bouncing global cachelines. And filesystems will go through
multiple releases where locking changes are in flux. This is what
I'm concerned about.
I definitely have tried to keep everything as conceptually seperate
small chunks. But there is a real big-picture aspect that is required
to review it.
For example, you asked for just the locking split-up without any of
the per-hash-locks and per-cpu locks etc. That's fine for review, but
you cannot merge it because then you end up with N bouncing global
locks instead of 1. It also tends to be much uglier than a final
outcome because I have not applied any transformations to improve lock
orderings and reduce trylocking etc.
> Things like the fs_struct spinlock and some other preparatory patches
> should be ver easily to do for 2.6.36. Scaling the files and vfsmount
> locks should also be easily doable, but we need to sort out the struct
> file growth in the later. We really can't grow struct file by two
> pointers as that would have devasting effects on various workloads.
Strictly, it is a filesystem corruption bug-fix for the tty layer
and nothing to do with tty scaling patches.
I don't have the patience at the moment to sort through tty layer
crap, but whoever is maintaining that should. I could possibly come
back and look at it some point, but given your half-working patch
as a guide, I think someone who knows the code can fix it.
> What follows after that is the dcache_lock scaling which to seems the
> most immature bit of the series, and the one that showed by far the
> most problems in -RT. I'm very much dead set against merging that in
> .36.
That's a fair point, I agree with. It needs most review.
> I'd much rather see the inode_lock scaling or the lockless path
> walk going in before, but I haven't checked how complicated the
> reordering would be.
I would much prefer not to re-order it before either of inode or
dcache scaling patches. It would introduce a lot of churn and
locking is significantly changed.
It probably should be possible, although we would still get path
walk contention on dcache_lock, vfsmount_lock, and requires inode-RCU
(making inodes more expensive without being offset by any benefits of
inode scaling), and requires changes to filesystem dcache and inode
APIs.
I could work on re-ordering it certainly, but only if it is decided
that we definitely don't want dcache-scale or inode-scale patch sets
in the forseeable future. I think we definitely do want them, so I
find it hard to justify a big reordering.
> The lockless path walk also is only rather
> theoretically useful until we do ACL checks lockless as we're having
> ACLs enabled pretty much everywhere at least in the distros.
True, it needs a last bit of work for permission checking. The
conceptual idea and the bulk of the code I think is ready to review
though. ACLs should be just more of the same.
> The per-zone shrinkers are another thing that's not directly related,
> I think they need a lot more discussion with the VM folks, and
> integrating with Dave's work in that area.
Well I'm a VM folk :) Conceptually, there is no problems for MM
here. This is really the right way to drive reclaim from the MM
perspective (ie. per-zone). Of course I will work with Dave and
take suggestions on implementation.
It is directly related in that it is required to remove global lock and
global list scanning from vfs reclaim, which is something that we've
known and wanted for a long time.
On one hand, you might say I'm going overboard, but on another hand,
vfs really sucks on NUMA and SMP right now and it's only going to
get worse for "normal" (ie. not HPC) people.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-02 9:05 ` Christoph Hellwig
@ 2010-08-02 10:07 ` Nick Piggin
0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2010-08-02 10:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Nick Piggin, Dave Chinner, linux-fsdevel, Linus Torvalds,
Linux Kernel Mailing List
On Mon, Aug 02, 2010 at 05:05:42AM -0400, Christoph Hellwig wrote:
> On Mon, Aug 02, 2010 at 04:24:28AM -0400, Christoph Hellwig wrote:
> > .36. I'd much rather see the inode_lock scaling or the lockless path
> > walk going in before, but I haven't checked how complicated the
> > reordering would be. The lockless path walk also is only rather
> > theoretically useful until we do ACL checks lockless as we're having
> > ACLs enabled pretty much everywhere at least in the distros.
>
> >From a quick look it seems like the inode_lock splitup can easily
> be moved forward, and it would help us with doing some work on the
> writeback side. The problem is that it would need rebasing ontop
> of both the vfs and writeback (aka block) trees.
inode_lock splitup is much simpler than dcache_lock, yes.
And I have to rebase it on the work currently queued for 2.6.35
anyway, so that's no problem. I can easily put it in front of
dcache_lock patches in the series (as I said, I've kept everything
independent and well split up).
I do want opinions on how to do the big-picture merge, though,
before I start moving things around. And obviously reviewing
each of the parts is more important at this point than exact
way to order the thing.
But even the inode_lock patches I am wary of merging in 2.6.36
without having much review or any linux-next / vfs-tree exposure.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
[not found] ` <87tyncds89.fsf@basil.nowhere.org>
@ 2010-08-03 9:28 ` Nick Piggin
2010-08-03 9:49 ` Andi Kleen
2010-08-03 15:05 ` Henrique de Moraes Holschuh
0 siblings, 2 replies; 9+ messages in thread
From: Nick Piggin @ 2010-08-03 9:28 UTC (permalink / raw)
To: Andi Kleen, linux-fsdevel, Christoph Hellwig
Cc: Dave Chinner, Linus Torvalds, Linux Kernel Mailing List, npiggin
On Tue, Aug 03, 2010 at 10:18:30AM +0200, Andi Kleen wrote:
> Dave Chinner <david@fromorbit.com> writes:
>
> > On Sun, Aug 01, 2010 at 04:52:42PM -0700, Linus Torvalds wrote:
> >> On a slightly happier note: one thing I do hope we can merge in the
> >> upcoming merge window is Nick Piggin's cool VFS scalability series.
> >> I've been using it on my own machine, and gone through all the commits
> >> (not that I shouldn't go through some of them some more), and am
> >> personally really excited about it. It's seldom we see major
> >> performance improvements in core code that are quite that noticeable,
> >> and Nick's whole RCU pathname lookup in particular just tickles me
> >> pink.
> >
> > There hasn't been nearly enough review or testing of this patch
> > series yet. Before a merge, it needs to be split up in smaller,
> > more digestable chunks for more comprehensive review, regression
> > testing and behavioural analysis.
>
> We started some testing of the VFS series on larger systems and so
> far it looks all very good and the performance improvements are impressive
> (but of course new bottlenecks are being exposed then)
>
> The only snag found so far was that an ACL enabled file system
> disables all the nice path walk improvements, so right now you
> need to remount with noacl. I'm hoping this can be fixed
> before a mainline release, otherwise I suspect it would disable
> the improvements for lots of people (common distributions default
> to acl on)
OK, vfs-scale-working branch now has commits to enable rcu-walk aware
d_revalidate, permission, and check_acl in the filesystems, and
implements a basic rcu-walk aware scheme for generic/posix acls and
implements that for tmpfs, ext2, btrfs. It just drops out of rcu-walk
if there is an ACL on a directory, or if it is not cached. I think
that's enough to be production ready now. Pushing rcu-walk awareness
down into acl checking code would not be hard.
I was under the impression that ACLs on directories are not that common,
so maybe this is as far as we need to go for now anyway.
It does need more commenting of the new methods and explaining how they
can and can't be used by filesystems. The tree is also getting messy
with incremental changes -- I'm avoiding rebasing it so people following
it can see response to reviews and issues that arise. Obviously it will
all get cleaned up and rebased properly onto a new branch before
anything is merged.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-03 9:28 ` Nick Piggin
@ 2010-08-03 9:49 ` Andi Kleen
2010-08-03 15:05 ` Henrique de Moraes Holschuh
1 sibling, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2010-08-03 9:49 UTC (permalink / raw)
To: Nick Piggin
Cc: Andi Kleen, linux-fsdevel, Christoph Hellwig, Dave Chinner,
Linus Torvalds, Linux Kernel Mailing List
On Tue, Aug 03, 2010 at 07:28:23PM +1000, Nick Piggin wrote:
> OK, vfs-scale-working branch now has commits to enable rcu-walk aware
> d_revalidate, permission, and check_acl in the filesystems, and
> implements a basic rcu-walk aware scheme for generic/posix acls and
Cool. I was just looking at that.
> I was under the impression that ACLs on directories are not that common,
> so maybe this is as far as we need to go for now anyway.
Yes it shouldn't be common normally. I think the common case for distros
is just a few ACLs in /dev. Of course you never know for
specific end user workloads.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Linux 2.6.35
2010-08-03 9:28 ` Nick Piggin
2010-08-03 9:49 ` Andi Kleen
@ 2010-08-03 15:05 ` Henrique de Moraes Holschuh
1 sibling, 0 replies; 9+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-08-03 15:05 UTC (permalink / raw)
To: Nick Piggin
Cc: Andi Kleen, linux-fsdevel, Christoph Hellwig, Dave Chinner,
Linus Torvalds, Linux Kernel Mailing List
On Tue, 03 Aug 2010, Nick Piggin wrote:
> I was under the impression that ACLs on directories are not that common,
> so maybe this is as far as we need to go for now anyway.
They are quite common on fileserver data areas, at least on the places where
I worked at.
--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2010-08-03 15:05 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <AANLkTim8FALeVG+NPLNLEQChu=dPD=GHSpnxmZHYgsNx@mail.gmail.com>
[not found] ` <20100802023322.GA19164@dastard>
[not found] ` <AANLkTint_s6h_EC7mDiPsyxr=C0GSrnrgJYkCCU7JEtN@mail.gmail.com>
[not found] ` <20100802055834.GB19164@dastard>
2010-08-02 7:55 ` Linux 2.6.35 Nick Piggin
2010-08-02 8:24 ` Christoph Hellwig
2010-08-02 8:46 ` KOSAKI Motohiro
2010-08-02 9:05 ` Christoph Hellwig
2010-08-02 10:07 ` Nick Piggin
2010-08-02 9:51 ` Nick Piggin
[not found] ` <87tyncds89.fsf@basil.nowhere.org>
2010-08-03 9:28 ` Nick Piggin
2010-08-03 9:49 ` Andi Kleen
2010-08-03 15:05 ` Henrique de Moraes Holschuh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).