Re: sysfs: tagged directories not merged completely yet

All of lore.kernel.org
 help / color / mirror / Atom feed

From: ebiederm@xmission.com (Eric W. Biederman)
To: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>, Al Viro <viro@ZenIV.linux.org.uk>,
	Benjamin Thery <benjamin.thery@bull.net>,
	linux-kernel@vger.kernel.org,
	"Serge E. Hallyn" <serue@us.ibm.com>,
	Al Viro <viro@ftp.linux.org.uk>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: sysfs: tagged directories not merged completely yet
Date: Thu, 16 Oct 2008 14:58:12 -0700	[thread overview]
Message-ID: <m1r66gb3t7.fsf@frodo.ebiederm.org> (raw)
In-Reply-To: <48F5CE43.2030207@kernel.org> (Tejun Heo's message of "Wed, 15 Oct 2008 20:04:35 +0900")

Tejun Heo <tj@kernel.org> writes:

> Eric W. Biederman wrote:
>> Tejun Heo <tj@kernel.org> writes:

>>>> 2) i_mutex seems to protect very little if anything that we care about.
>>>>    The dcache has it's own set of locks.  So we may be able to completely
>>>>    avoid taking i_mutex in sysfs and simplify things enormously.
>>>>    Currently I believe we are very similar to ocfs2 in terms of locking
>>>>    requirements.
>>> I think the timestamps are one of the things it protects.
>> 
>> Yes.  I think parts of the page cache and anything in the inode itself
>> is protected by i_mutex.  As for timestamsp or anything else that
>> we really care about we can and should put them in sysfs_dirent and
>> we can have the stat method recreate it, and possibly have d_revalidate
>> refresh it.
>
> Some of the timestamps are not on the sysfs_dirent because 1. nobody
> cared (the original sd implementation didn't preserve it) and
> 2. of memory overhead.

Yep.  Which basically boils down to nobody cared.

>>>> 3) For i_notify and d_notify that seems to require pinning the inode
>>>>    or the dentry in question, so I see no reason why a d_revalidate
>>>>    style of update will have problems.
>>> Because the existing notifications won't be moved over to the new
>>> dentry.  dnotify wouldn't work the same way.  ISTR that was the reason
>>> why I didn't do the d_revalidate thing, but I don't think it really
>>> matters.  dnotify on sysfs nodes doesn't work properly anyway.
>> 
>> Reasonable.  I have seen two ways of handling rename properly.
>> Some weird variant d_splice_alias or some cleaner variant of what
>> we are doing today.
>
> FWIW, I think it would be just fine to invalidate the renamed dentry.

I have a working implementation now and it needs a little bit of cleanup.
The patch that gets me there is to big.

I have realized the following things.
1) Keeping the VFS in sync with the sysfs_dirent tree is impossible
   because the VFS occasionally denies operations for internal
   consistency reasons (think removing a mount point from the dcache)
   and getting the sysfs_dirent tree out of sync with the kobject
   tree could get very hairy.

2) For a distributed filesystem there are small races in lookup
   and revalidate between when the change was made and when it
   appears so lookup and revalidate need to cope.

3) fsnotify and the like is pointless to worry about because it looks
   like only sysfs_file_chmod does the necessary magic and
   sysfs_file_chmod appears to only happen at file creation.

4) If we really need to we can run what is essentially sysfs_get_dirent
   after the appropriate operations and in a timely manner see
   renames and have notifications, but I don't think the cost
   is worth it.

>> Depends on how many devices people are adding and removing dynamically
>> I guess.  sysctl has had that issue so I am thinking about it.  I
>> figure we need to make things work properly first.
>
> Yeap, let's think about optimization later.  The problem hasn't come
> up yet even on machines where the memory footprint of sysfs dentries
> and inodes posed serious problems, so I don't think optimizing it is a
> high priority at this point.

Agreed, worry about optimization later.  Except at the extreme end it
isn't a real issue.  I keep thinking about it because it has come up
with sysctl.  When creating lots of virtual network devices.  sysfs is
the next obvious target.

> Well, I suppose most of that blame falls on me but I still can't bring
> myself to agree with the current implementation.  The biggest problem
> I have is that the implementation doesn't really show in straight
> forward manner what it tries to achieve (showing partial tree
> depending on sb).

Alright.

I guess we really need to talk about this some more and look at patches.
I expect some of the blame falls on me for being a bit impatient.  sysfs
has not been the interesting part just a silly little filesystem that
is in the way.

> Getting the clean up part in usually isn't a problem, right?  But
> getting in the actual namespace part is (and should be).

Nah.  Getting the namespace design agreed to is the hard part.
Once everyone is on the same page namespace patches are no harder
than any others.  Unfortunately it looks like we really haven't
agreed to the design.

So back to basics.  Where I think we should go from 10,000 feet.

- Multiple superblocks for sysfs.
- Each superblock showing the devices in sysfs from a different network namespace.
- There should be one instance of uevent_sock for each network namespae.
- kobject_uevents should be out all uevent_socks unless it is for a network device.
- kobject_uevents should be sent out the uevent_sock for their network namespae.
- kobject_uevents for network devices no in the initial network namespace will not
  be sent with uevent_helper.

Reasons.

Foremost namespaces don't have names.  Namespaces without names
allow us to have nested containers.

Without a name for the network namespace there is not a way to create
a unique entry in sysfs for each network device.  As the network device
names themselves can repeat in each network namespace.  So I have chosen
the point of user space control the mount of sysfs to encode the network
namespace information.  This allows everything to be visible and user space
to still set policy.

Without a change to the policy of how things are named in sysfs the existing
user space code will just work.  Both inside and outside a network namespace.

Since the network namespace is available on the kobject it is easy to stuff
packets into the correct socket and user space code will just work.

Since uevent_helper is a limited and essentially legacy mechanism it makes sense
to only tell it about devices in the initial network namespace.

What clinches it for me is that even if network namespaces had names if we don't
have a different view on different mounts I don't see how we could put network
devices in sysfs without breaking user space backwards compatibility.

Eric

next prev parent reply	other threads:[~2008-10-16 22:00 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-09-22 14:31 sysfs: tagged directories not merged completely yet Benjamin Thery
2008-09-22 15:34 ` Greg KH
2008-09-22 20:24   ` Eric W. Biederman
2008-09-23 14:24   ` Benjamin Thery
2008-09-23 18:23     ` Eric W. Biederman
2008-10-03 10:13       ` Al Viro
2008-10-05  5:32         ` Greg KH
2008-10-07  8:27           ` Eric W. Biederman
2008-10-07 10:47             ` [PATCH 0/3] minor sysfs tagged directory fixes Eric W. Biederman
2008-10-07 10:49               ` [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file Eric W. Biederman
2008-10-07 10:51                 ` [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename Eric W. Biederman
2008-10-07 10:52                   ` [PATCH 3/3] sysfs: Take sysfs_mutex when fetching the root inode Eric W. Biederman
2008-10-07 21:21                   ` [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename Dave Hansen
2008-10-07 21:19                 ` [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file Dave Hansen
2008-10-07 22:31                   ` Eric W. Biederman
2008-10-07 22:27             ` sysfs: tagged directories not merged completely yet Greg KH
2008-10-07 22:54               ` Serge E. Hallyn
2008-10-07 23:39                 ` Greg KH
2008-10-08  0:12                   ` Serge E. Hallyn
2008-10-08  0:38                     ` Greg KH
2008-10-08 14:18                       ` Serge E. Hallyn
2008-10-07 23:34               ` Tejun Heo
2008-10-14  1:11                 ` Eric W. Biederman
2008-10-14  7:55                   ` Tejun Heo
2008-10-14 12:19                     ` Eric W. Biederman
2008-10-15 11:04                       ` Tejun Heo
2008-10-16 21:58                         ` Eric W. Biederman [this message]
2008-10-14 18:53                     ` Serge E. Hallyn
2008-10-15  0:48                       ` Eric W. Biederman
2008-10-15 13:42                         ` Serge E. Hallyn
2008-10-15 13:54                           ` Benjamin Thery
2008-10-08  0:39               ` Eric W. Biederman
2008-10-08  1:29               ` Eric W. Biederman
2008-10-07  8:08         ` Eric W. Biederman
2008-10-07  9:01         ` Daniel Lezcano
2008-10-07  9:12         ` Tejun Heo
2008-10-07 11:56           ` Eric W. Biederman
2008-10-07 12:19             ` Tejun Heo
2008-10-07 23:17               ` Tejun Heo
2008-10-08  0:04                 ` Eric W. Biederman
2008-10-08  0:20                   ` Tejun Heo
2008-10-08  0:58                     ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=m1r66gb3t7.fsf@frodo.ebiederm.org \
    --to=ebiederm@xmission.com \
    --cc=benjamin.thery@bull.net \
    --cc=greg@kroah.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=serue@us.ibm.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    --cc=viro@ftp.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.