public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC]  Slimming down struct inode
@ 2006-06-09 23:50 Theodore Ts'o
  2006-06-10  0:24 ` Bernd Eckenfels
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Theodore Ts'o @ 2006-06-09 23:50 UTC (permalink / raw)
  To: linux-kernel


Since Linus has been complaining about how awful ext3's in-core inodes
are, and given that 70% of the space used by ext3's in-core inode is
coming from the include/linux/fs.h's struct inode structure, I decided
it would be good to see how we might be able to slim it down.  Slimming
down struct inode has the added benefit that it will help all
filesystems, so this is generic goodness.  Furthermore, depending on
which features you have compiled into the kernel, even slimming struct
inode by 12 bytes could result being able to pack more objects per page
in the slab cache.

So without further ado, here are some ideas of ways that we can slim
down struct inode:

1) Move i_blksize (optimal size for I/O, reported by the stat system
   call).  Is there any reason why this needs to be per-inode, instead
   of per-filesystem?

2) Move i_blkbits (blocksize for doing direct I/O in bits) to struct
   super.  Again, why is this per-inode?

3) Move i_pipe, i_bdev, and i_cdev into a union.  An inode cannot
    simultaneously be a pipe, block device, and character device at the
    same time.

4) i_state and i_flags are both 4-byte longs, but they only need to be
   2-byte shorts, and could be placed next to each other.

5) Nuke i_cindex.  This is only used by ieee1394's
   ieee_file_to_instance.  There must be another place where we can
   store this --- say, in a ieee1394-specific field in struct file?  Or
   maybe it can be derived some other way, but to chew up 4 bytes in
   i_cindex for all inodes just for ieee1394's benefit seems like the
   Wrong Thing(tm).

6) Separate out those elements which are only used if the inode is open
   into a separate data structure (call it "struct inode_state" for
   the sake of argument):

	i_flock, i_mapping, i_data, i_dnotify_mask, i_dnotify,
	inotify_watches, inotify_sem, i_state, dirtied_when,
	i_size_seqcount, i_mutex, i_alloc_sem

   This is the motherload.  Moving these fields out to a separate
   structure which is only allocated for inodes which are open will save
   a huge amount of memory.  But, of course, sweeping through all of the
   code which uses these variables to move them would be a major code
   change.  (We can do it initially with #define magic, but we will need
   to audit the code paths to see if it's always to safe to assume that
   inode is open before dereferencing the i_state pointer, or whether we
   need to check to see if it is valid first.)

The first four I think are fairly non-contentious, but it's worth
checking to see if anybody knows if there are filesystems for which the
block size changes on a per-inode basis (I hope not!).  There's a very
trivial way of fixing #5 by simply moving i_cindex into struct file, but
it may be possible to fix the ieee1394 layer so it doesn't need it at
all.

#6 is going to be the hard one, since it will involving touching the
largest amount of code.  But of course, the payoff will be quite big as
well.  Of course, memory is pretty cheap these days, which is probably
why we've ignored it until Linus started kvetching about the size of
ext3's in-core inodes....   and when I looked into it, most of it wasn't
even ext3's fault.  :-)

What do people think?  Is it worth putting together patches to do some
or all of the above?

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
@ 2006-06-10  0:24 ` Bernd Eckenfels
  2006-06-10  1:27 ` Al Viro
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Bernd Eckenfels @ 2006-06-10  0:24 UTC (permalink / raw)
  To: linux-kernel; +Cc: tytso

Theodore Ts'o <tytso@mit.edu> wrote:
> 3) Move i_pipe, i_bdev, and i_cdev into a union.  An inode cannot
>    simultaneously be a pipe, block device, and character device at the
>    same time.

Mabe a regular file inode atribute can be put in that union, too?

Gruss
Bernd



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
  2006-06-10  0:24 ` Bernd Eckenfels
@ 2006-06-10  1:27 ` Al Viro
  2006-06-10  1:56   ` Theodore Tso
  2006-06-10 10:48 ` Jan Engelhardt
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Al Viro @ 2006-06-10  1:27 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

On Fri, Jun 09, 2006 at 07:50:08PM -0400, Theodore Ts'o wrote:
> 4) i_state and i_flags are both 4-byte longs, but they only need to be
>    2-byte shorts, and could be placed next to each other.

void wake_up_inode(struct inode *inode)
{
        /*
         * Prevent speculative execution through spin_unlock(&inode_lock);
         */
        smp_mb();
        wake_up_bit(&inode->i_state, __I_LOCK);
}

> 5) Nuke i_cindex.  This is only used by ieee1394's
>    ieee_file_to_instance.  There must be another place where we can
>    store this --- say, in a ieee1394-specific field in struct file?  Or
>    maybe it can be derived some other way, but to chew up 4 bytes in
>    i_cindex for all inodes just for ieee1394's benefit seems like the
>    Wrong Thing(tm).

No, it's actually the right thing for _all_ char devices.  And it's used
before we get a struct file.  If anything, ->i_rdev should go long-term...

> 6) Separate out those elements which are only used if the inode is open
>    into a separate data structure (call it "struct inode_state" for
>    the sake of argument):
> 
> 	i_flock, i_mapping, i_data, i_dnotify_mask, i_dnotify,
> 	inotify_watches, inotify_sem, i_state, dirtied_when,
> 	i_size_seqcount, i_mutex, i_alloc_sem
> 
>    This is the motherload.  Moving these fields out to a separate
>    structure which is only allocated for inodes which are open will save
>    a huge amount of memory.  But, of course, sweeping through all of the
>    code which uses these variables to move them would be a major code
>    change.  (We can do it initially with #define magic, but we will need
>    to audit the code paths to see if it's always to safe to assume that
>    inode is open before dereferencing the i_state pointer, or whether we
>    need to check to see if it is valid first.)

It is not safe e.g. for ->i_mutex, since that puppy is used not only when
there's an opened file over this inode (or, in fact, before any method
had been called for this inode).

It is _certainly_ not safe for ->i_state (take a look at fs/inode.c).

It is not safe for ->i_data, unless you are willing to dispose of page
cache on close (even leaving aside such things as directories).

No comments on idiotify fields - IIRC, they can also be used before any
->open() on the inode in question.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10  1:27 ` Al Viro
@ 2006-06-10  1:56   ` Theodore Tso
  2006-06-10  6:24     ` Stefan Richter
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2006-06-10  1:56 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel

On Sat, Jun 10, 2006 at 02:27:57AM +0100, Al Viro wrote:
> > 5) Nuke i_cindex.  This is only used by ieee1394's
> >    ieee_file_to_instance.  There must be another place where we can
> >    store this --- say, in a ieee1394-specific field in struct file?  Or
> >    maybe it can be derived some other way, but to chew up 4 bytes in
> >    i_cindex for all inodes just for ieee1394's benefit seems like the
> >    Wrong Thing(tm).
> 
> No, it's actually the right thing for _all_ char devices.  And it's used
> before we get a struct file.  If anything, ->i_rdev should go long-term...

i_cindex is set by fs/char_dev.h, and the only user of that field (I
grepped the entire sources) is ./drivers/ieee1394/ieee1394_core.h:

   /* return the index (within a minor number block) of a file */
   static inline unsigned char ieee1394_file_to_instance(struct file *file)
   {
	return file->f_dentry->d_inode->i_cindex;
   }


> > 6) Separate out those elements which are only used if the inode is open
> >    into a separate data structure (call it "struct inode_state" for
> >    the sake of argument):
> > 
> > 	i_flock, i_mapping, i_data, i_dnotify_mask, i_dnotify,
> > 	inotify_watches, inotify_sem, i_state, dirtied_when,
> > 	i_size_seqcount, i_mutex, i_alloc_sem
> > 
>
> It is not safe e.g. for ->i_mutex, since that puppy is used not only when
> there's an opened file over this inode (or, in fact, before any method
> had been called for this inode).
> 
> It is _certainly_ not safe for ->i_state (take a look at fs/inode.c).
> 
> It is not safe for ->i_data, unless you are willing to dispose of page
> cache on close (even leaving aside such things as directories).
> 
> No comments on idiotify fields - IIRC, they can also be used before any
> ->open() on the inode in question.

You're right, I added some fields which can't be moved out of struct
inode, but the basic point remains; it should be possible to shrink
struct inode by moving fields to an inode_state structure.

As far as i/dnotify fields are concerned, yes, they can be used on an
unopened inode, but we could also simply consider a file that is being
watched by i/dnotify as one that is "opened" from the point of view of
needing a inode_state structure to be allocated.  Depending on how
many files a typical GUI desktop is going to watch, this might or
might not be good idea.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10  1:56   ` Theodore Tso
@ 2006-06-10  6:24     ` Stefan Richter
  0 siblings, 0 replies; 28+ messages in thread
From: Stefan Richter @ 2006-06-10  6:24 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Al Viro, linux-kernel, linux1394-devel

Theodore Tso wrote at lkml:
[saving space in the numerously instantiated struct inode]
> On Sat, Jun 10, 2006 at 02:27:57AM +0100, Al Viro wrote:
>>>5) Nuke i_cindex.  This is only used by ieee1394's
>>>   ieee_file_to_instance.  There must be another place where we can
>>>   store this --- say, in a ieee1394-specific field in struct file?  Or
>>>   maybe it can be derived some other way, but to chew up 4 bytes in
>>>   i_cindex for all inodes just for ieee1394's benefit seems like the
>>>   Wrong Thing(tm).
>>
>>No, it's actually the right thing for _all_ char devices.  And it's used
>>before we get a struct file.  If anything, ->i_rdev should go long-term...
> 
> i_cindex is set by fs/char_dev.h, and the only user of that field (I
> grepped the entire sources) is ./drivers/ieee1394/ieee1394_core.h:
> 
>    /* return the index (within a minor number block) of a file */
>    static inline unsigned char ieee1394_file_to_instance(struct file *file)
>    {
> 	return file->f_dentry->d_inode->i_cindex;
>    }
...

Also, the places where ieee1394_file_to_instance() is used appear not to 
be fast paths. (It's the struct file_operations.open() hook of two 
protocol drivers.) I.e. there is no harm in replacing it by whatever 
more expensive lookup method.

Dv1394 calls ieee1394_file_to_instance() from within spin_lock_irqsave() 
/unlock_irqrestore(), but it can easily moved out of it if need be.
-- 
Stefan Richter
-=====-=-==- -==- -=-=-
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
  2006-06-10  0:24 ` Bernd Eckenfels
  2006-06-10  1:27 ` Al Viro
@ 2006-06-10 10:48 ` Jan Engelhardt
  2006-06-10 15:04   ` Jeff Garzik
  2006-06-13  4:32   ` Nathan Scott
  2006-06-10 11:03 ` Tomasz Torcz
  2006-06-15  0:16 ` Brian F. G. Bidulock
  4 siblings, 2 replies; 28+ messages in thread
From: Jan Engelhardt @ 2006-06-10 10:48 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel


>1) Move i_blksize (optimal size for I/O, reported by the stat system
>   call).  Is there any reason why this needs to be per-inode, instead
>   of per-filesystem?
>
I do not know much about XFS's realtime feature, but from what I have read 
about it so far, it sounds to be a potential source where i_blksize might 
differ from the regular filesystem. A guess, though.


>3) Move i_pipe, i_bdev, and i_cdev into a union.  An inode cannot
>    simultaneously be a pipe, block device, and character device at the
>    same time.
>
Ah, finally this is being tackled. Let's recall Bernd's post:

  "Mabe a regular file inode atribute can be put in that union, too?"

which made me think about members in struct inode that are only useful for 
regular files. Is i_blkbits/i_blksize relevant for block special files? I 
doubt that, since (as you mentioned above) it is (=should be) a 
per-filesystem option, but block devices don't have to do anything with 
filesystems in this respect. IOW,

  struct ionde {
      ...
      union {
          struct regular_file {
              unsigned int i_blkbits;
              unsigned long i_blksize;
              ...
          };
          ...
          struct block_device *i_bdev;
          struct cdev *i_cdev;
      };
      ...
  };              

something like that. (Yes, i_blkbits/size should go into the sb, but maybe 
there are other regular-files-only members.)


>6) Separate out those elements which are only used if the inode is open
>   into a separate data structure (call it "struct inode_state" for
>   the sake of argument):
>
>	i_flock, i_mapping, i_data, i_dnotify_mask, i_dnotify,
>	inotify_watches, inotify_sem, i_state, dirtied_when,
>	i_size_seqcount, i_mutex, i_alloc_sem
>
>   This is the motherload.  Moving these fields out to a separate
>   structure which is only allocated for inodes which are open will save
>   a huge amount of memory.  But, of course, sweeping through all of the
>   code which uses these variables to move them would be a major code
>   change.  (We can do it initially with #define magic, but we will need
>   to audit the code paths to see if it's always to safe to assume that
>   inode is open before dereferencing the i_state pointer, or whether we
>   need to check to see if it is valid first.)
>
Maybe doing it one at a time.


How long is inotify going to remain around?


>#6 is going to be the hard one, since it will involving touching the
>largest amount of code.  But of course, the payoff will be quite big as
>well.  Of course, memory is pretty cheap these days, which is probably
>
Cheap is relative :)

>What do people think?  Is it worth putting together patches to do some
>or all of the above?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
                   ` (2 preceding siblings ...)
  2006-06-10 10:48 ` Jan Engelhardt
@ 2006-06-10 11:03 ` Tomasz Torcz
  2006-06-10 15:06   ` Jeff Garzik
  2006-06-15  0:16 ` Brian F. G. Bidulock
  4 siblings, 1 reply; 28+ messages in thread
From: Tomasz Torcz @ 2006-06-10 11:03 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 733 bytes --]

On Fri, Jun 09, 2006 at 07:50:08PM -0400, Theodore Ts'o wrote:
> So without further ado, here are some ideas of ways that we can slim
> down struct inode:
> 
> 1) Move i_blksize (optimal size for I/O, reported by the stat system
>    call).  Is there any reason why this needs to be per-inode, instead
>    of per-filesystem?
> 
> 2) Move i_blkbits (blocksize for doing direct I/O in bits) to struct
>    super.  Again, why is this per-inode?

  ZFS filesystem uses dynamic, per-file blocksizes. Some Linux
filesystem may implement something like this in order to be called
"modern".

-- 
Tomasz Torcz                 "God, root, what's the difference?"
zdzichu@irc.-nie.spam-.pl         "God is more forgiving."


[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10 10:48 ` Jan Engelhardt
@ 2006-06-10 15:04   ` Jeff Garzik
  2006-06-13  4:35     ` Nathan Scott
  2006-06-13  4:32   ` Nathan Scott
  1 sibling, 1 reply; 28+ messages in thread
From: Jeff Garzik @ 2006-06-10 15:04 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Theodore Ts'o, linux-kernel

Jan Engelhardt wrote:
> I do not know much about XFS's realtime feature, but from what I have read 


I _think_ people are referring to GRIO when they mention that:

http://www.cepba.upc.es/docs/sgi_doc/SGI_Admin/books/IA_DiskFiles/sgi_html/ch08.html

	Jeff



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10 11:03 ` Tomasz Torcz
@ 2006-06-10 15:06   ` Jeff Garzik
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Garzik @ 2006-06-10 15:06 UTC (permalink / raw)
  To: Theodore Ts'o, linux-kernel

Tomasz Torcz wrote:
> On Fri, Jun 09, 2006 at 07:50:08PM -0400, Theodore Ts'o wrote:
>> So without further ado, here are some ideas of ways that we can slim
>> down struct inode:
>>
>> 1) Move i_blksize (optimal size for I/O, reported by the stat system
>>    call).  Is there any reason why this needs to be per-inode, instead
>>    of per-filesystem?
>>
>> 2) Move i_blkbits (blocksize for doing direct I/O in bits) to struct
>>    super.  Again, why is this per-inode?
> 
>   ZFS filesystem uses dynamic, per-file blocksizes. Some Linux
> filesystem may implement something like this in order to be called
> "modern".

Yep, Sun stole my buckets idea <j/k>  I think ZFS calls them 
"meta-slabs" or somesuch.

See what I posted in the 'continuous inodes' sub-thread.

	Jeff




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10 10:48 ` Jan Engelhardt
  2006-06-10 15:04   ` Jeff Garzik
@ 2006-06-13  4:32   ` Nathan Scott
  2006-06-13 14:00     ` Avi Kivity
  1 sibling, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2006-06-13  4:32 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Engelhardt; +Cc: linux-kernel

On Sat, Jun 10, 2006 at 12:48:27PM +0200, Jan Engelhardt wrote:
> 
> >1) Move i_blksize (optimal size for I/O, reported by the stat system
> >   call).  Is there any reason why this needs to be per-inode, instead
> >   of per-filesystem?

Sorry, missed this on the first reading - yes, there are reasons
for doing this per inode, as Jan points out...

> I do not know much about XFS's realtime feature, but from what I have read 
> about it so far, it sounds to be a potential source where i_blksize might 
> differ from the regular filesystem. A guess, though.

Such a change would would indeed break XFS, in exactly the way you
suggest Jan - the realtime subvolume does typically use a different
blocksize from the data subvolume (the realtime extent size is used,
and this can be set per-inode too), and there would now be no way to
distinguish this preferred IO size difference.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-10 15:04   ` Jeff Garzik
@ 2006-06-13  4:35     ` Nathan Scott
  0 siblings, 0 replies; 28+ messages in thread
From: Nathan Scott @ 2006-06-13  4:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Jan Engelhardt, Theodore Ts'o, linux-kernel

On Sat, Jun 10, 2006 at 11:04:45AM -0400, Jeff Garzik wrote:
> Jan Engelhardt wrote:
> > I do not know much about XFS's realtime feature, but from what I have read 
> 
> I _think_ people are referring to GRIO when they mention that:
> 

No, thats something else.  The realtime feature is a mechanism
XFS provides for separating out data IOs for specific files from
metadata/log IOs.  This is CONFIG_XFS_RT in the kernel sources.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13  4:32   ` Nathan Scott
@ 2006-06-13 14:00     ` Avi Kivity
  2006-06-13 17:44       ` Theodore Tso
  2006-06-13 22:41       ` Nathan Scott
  0 siblings, 2 replies; 28+ messages in thread
From: Avi Kivity @ 2006-06-13 14:00 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Theodore Ts'o, Jan Engelhardt, linux-kernel

Nathan Scott wrote:
>
> Such a change would would indeed break XFS, in exactly the way you
> suggest Jan - the realtime subvolume does typically use a different
> blocksize from the data subvolume (the realtime extent size is used,
> and this can be set per-inode too), and there would now be no way to
> distinguish this preferred IO size difference.
>

It can be made into an inode operation:

    if (inode->i_ops->getblksize)
         return inode->i_ops->getblksize(inode);
    else
         return inode->i_sb->s_blksize;

Trading some efficiency for space.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 14:00     ` Avi Kivity
@ 2006-06-13 17:44       ` Theodore Tso
  2006-06-13 18:08         ` Avi Kivity
  2006-06-13 22:41       ` Nathan Scott
  1 sibling, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2006-06-13 17:44 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Nathan Scott, Jan Engelhardt, linux-kernel

On Tue, Jun 13, 2006 at 05:00:59PM +0300, Avi Kivity wrote:
> 
> It can be made into an inode operation:
> 
>    if (inode->i_ops->getblksize)
>         return inode->i_ops->getblksize(inode);
>    else
>         return inode->i_sb->s_blksize;
> 
> Trading some efficiency for space.

Yep, that was what I was planning on doing....

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 17:44       ` Theodore Tso
@ 2006-06-13 18:08         ` Avi Kivity
  2006-06-13 20:10           ` Jan Engelhardt
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2006-06-13 18:08 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nathan Scott, Jan Engelhardt, linux-kernel

Theodore Tso wrote:
>
> On Tue, Jun 13, 2006 at 05:00:59PM +0300, Avi Kivity wrote:
> >
> > It can be made into an inode operation:
> >
> >    if (inode->i_ops->getblksize)
> >         return inode->i_ops->getblksize(inode);
> >    else
> >         return inode->i_sb->s_blksize;
> >
> > Trading some efficiency for space.
>
> Yep, that was what I was planning on doing....
>

Maybe

    if (inode->i_sb->s_blksize)
        return inode->i_sb->s_blksize;
    else
        ...

is a tiny little bit faster...

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 18:08         ` Avi Kivity
@ 2006-06-13 20:10           ` Jan Engelhardt
  2006-06-13 20:25             ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Engelhardt @ 2006-06-13 20:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Theodore Tso, Nathan Scott, linux-kernel

>> > if (inode->i_ops->getblksize)
>> >     return inode->i_ops->getblksize(inode);
>> > else
>> > return inode->i_sb->s_blksize;
>> > 
>> > Trading some efficiency for space.
>> 
>> Yep, that was what I was planning on doing....
>> 
>
> Maybe
>
> if (inode->i_sb->s_blksize)
>   return inode->i_sb->s_blksize;
> else
> ...
>
> is a tiny little bit faster...
>

The compiler will anyway pick the one it thinks is better by itself.
Influence can be taken using likely/unlikely of course.


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 20:10           ` Jan Engelhardt
@ 2006-06-13 20:25             ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2006-06-13 20:25 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Theodore Tso, Nathan Scott, linux-kernel

Jan Engelhardt wrote:
>>>> if (inode->i_ops->getblksize)
>>>>     return inode->i_ops->getblksize(inode);
>>>> else
>>>> return inode->i_sb->s_blksize;
>>>>
>>>> Trading some efficiency for space.
>>>>         
>>> Yep, that was what I was planning on doing....
>>>
>>>       
>> Maybe
>>
>> if (inode->i_sb->s_blksize)
>>   return inode->i_sb->s_blksize;
>> else
>> ...
>>
>> is a tiny little bit faster...
>>
>>     
>
> The compiler will anyway pick the one it thinks is better by itself.
> Influence can be taken using likely/unlikely of course.
>   

The compiler cannot infer that (inode->i_ops->getblksize == 0) is 
equivalent to (inode->i_sb->s_blksize != 0).

Maybe someday the language will allow us to specify it, but not today.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 14:00     ` Avi Kivity
  2006-06-13 17:44       ` Theodore Tso
@ 2006-06-13 22:41       ` Nathan Scott
  2006-06-14 10:29         ` Nikita Danilov
  1 sibling, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2006-06-13 22:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Theodore Ts'o, Jan Engelhardt, linux-kernel

On Tue, Jun 13, 2006 at 05:00:59PM +0300, Avi Kivity wrote:
> Nathan Scott wrote:
> > Such a change would would indeed break XFS, in exactly the way you

Oh, I should clarify - the suggestion of using sb->s_blocksize/
s_blocksize_bits was what I meant by "would break XFS".

> > suggest Jan - the realtime subvolume does typically use a different
> > blocksize from the data subvolume (the realtime extent size is used,
> > and this can be set per-inode too), and there would now be no way to
> > distinguish this preferred IO size difference.
> 
> It can be made into an inode operation:

*nod* - that'd work fine for our needs here.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-13 22:41       ` Nathan Scott
@ 2006-06-14 10:29         ` Nikita Danilov
  2006-06-14 21:50           ` Nathan Scott
  2006-06-14 23:27           ` Jan Engelhardt
  0 siblings, 2 replies; 28+ messages in thread
From: Nikita Danilov @ 2006-06-14 10:29 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Theodore Ts'o, Jan Engelhardt, linux-kernel

Nathan Scott writes:
 > On Tue, Jun 13, 2006 at 05:00:59PM +0300, Avi Kivity wrote:
 > > Nathan Scott wrote:
 > > > Such a change would would indeed break XFS, in exactly the way you
 > 
 > Oh, I should clarify - the suggestion of using sb->s_blocksize/
 > s_blocksize_bits was what I meant by "would break XFS".
 > 
 > > > suggest Jan - the realtime subvolume does typically use a different
 > > > blocksize from the data subvolume (the realtime extent size is used,
 > > > and this can be set per-inode too), and there would now be no way to
 > > > distinguish this preferred IO size difference.
 > > 
 > > It can be made into an inode operation:
 > 
 > *nod* - that'd work fine for our needs here.

Sorry, but why this operation is needed? Generic code (in fs/*.c)
doesn't use ->i_blksize at all. If XFS wants to provide per-inode
st_blksize, all it has to do is to store preferred buffer size in its
file system specific inode (struct xfs_inode), and use something
different from generic_fillattr() as its ->i_op->getattr() callback
(xfs_vn_getattr()).

 > 
 > cheers.
 > 
 > -- 
 > Nathan

Nikita.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-14 10:29         ` Nikita Danilov
@ 2006-06-14 21:50           ` Nathan Scott
  2006-06-15  5:49             ` Theodore Tso
  2006-06-14 23:27           ` Jan Engelhardt
  1 sibling, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2006-06-14 21:50 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Theodore Ts'o, Jan Engelhardt, linux-kernel

Hi Nikita,

On Wed, Jun 14, 2006 at 02:29:39PM +0400, Nikita Danilov wrote:
> Sorry, but why this operation is needed? Generic code (in fs/*.c)
> doesn't use ->i_blksize at all. If XFS wants to provide per-inode
> st_blksize, all it has to do is to store preferred buffer size in its
> file system specific inode (struct xfs_inode), and use something
> different from generic_fillattr() as its ->i_op->getattr() callback
> (xfs_vn_getattr()).

We already do this.  The original questions were related to whether
i_blksize and i_blkbits need to be per-inode or per-filesystem, and
thats what I was trying to answer...

| 1) Move i_blksize (optimal size for I/O, reported by the stat system
|   call).  Is there any reason why this needs to be per-inode, instead
|   of per-filesystem?
| 2) Move i_blkbits (blocksize for doing direct I/O in bits) to struct
|    super.  Again, why is this per-inode?

As to whether a new inode operation is useful/needed - *shrug* - not
really my call, I was saying we can work with whatever ends up being
the final solution, provided it keeps per-inode granularity.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-14 10:29         ` Nikita Danilov
  2006-06-14 21:50           ` Nathan Scott
@ 2006-06-14 23:27           ` Jan Engelhardt
  2006-06-15 10:09             ` Nikita Danilov
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Engelhardt @ 2006-06-14 23:27 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: Nathan Scott, Theodore Ts'o, Linux Kernel Mailing List

>
>Sorry, but why this operation is needed? Generic code (in fs/*.c)
>doesn't use ->i_blksize at all. If XFS wants to provide per-inode
>st_blksize, all it has to do is to store preferred buffer size in its
>file system specific inode (struct xfs_inode), and use something
>different from generic_fillattr() as its ->i_op->getattr() callback
>(xfs_vn_getattr()).
>
By the way, are there any significant userspace applications that use 
i_blksize/i_blkbits?


Jan Engelhardt
-- 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
                   ` (3 preceding siblings ...)
  2006-06-10 11:03 ` Tomasz Torcz
@ 2006-06-15  0:16 ` Brian F. G. Bidulock
  2006-06-15  4:43   ` Theodore Tso
  4 siblings, 1 reply; 28+ messages in thread
From: Brian F. G. Bidulock @ 2006-06-15  0:16 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

Theodore,

On Fri, 09 Jun 2006, Theodore Ts'o wrote:
> 1) Move i_blksize (optimal size for I/O, reported by the stat system
>    call).  Is there any reason why this needs to be per-inode, instead
>    of per-filesystem?
> 
> 2) Move i_blkbits (blocksize for doing direct I/O in bits) to struct
>    super.  Again, why is this per-inode?

Have you considered NFS?

> 3) Move i_pipe, i_bdev, and i_cdev into a union.  An inode cannot
>     simultaneously be a pipe, block device, and character device at the
>     same time.

A STREAMS-based FIFO is both a (named) pipe and a character device at the
same time.  I would prefer if you did not merge i_pipe with i_cdev for this
reason.  In the current GPL'ed out of tree STREAMS implementation, i_pipe
is used to point to the Stream head (as the normal v_str pointer in the UNIX
vnode).  STREAMS-based FIFOs are the only instance in STREAMS where it
ventures outside its own filesystem (specfs) and adjusts inodes from other
filesystems.  This is also true for Linux native FIFOs (named pipes) that
use the i_pipe pointer in the filesystem in which they are named instead of
creating inodes within the pipefs.

I suppose that the other two permutations are correct:

  - a block device inode cannot also be a character device inode
  - a block device inode cannot also be a pipe

so at least i_cdev and i_bdev could be merged, however, you will need some
way to determine which actual object was attached to the union to allow the
object reference to be dropped when the inode is cleaned.  It might be better
to leave that one alone too, as any flag or mode that might be used could get
munged by a filesystem during the inode lifecycle causing incorrect reference
counts, or worse, an attempt to free the object against the wrong cache.

No comment on the rest.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-15  0:16 ` Brian F. G. Bidulock
@ 2006-06-15  4:43   ` Theodore Tso
  2006-06-15  8:27     ` Brian F. G. Bidulock
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2006-06-15  4:43 UTC (permalink / raw)
  To: linux-kernel

On Wed, Jun 14, 2006 at 06:16:27PM -0600, Brian F. G. Bidulock wrote:
> > 3) Move i_pipe, i_bdev, and i_cdev into a union.  An inode cannot
> >     simultaneously be a pipe, block device, and character device at the
> >     same time.
> 
> A STREAMS-based FIFO is both a (named) pipe and a character device at the
> same time.  I would prefer if you did not merge i_pipe with i_cdev for this
> reason.  In the current GPL'ed out of tree STREAMS implementation, i_pipe
> is used to point to the Stream head (as the normal v_str pointer in the UNIX
> vnode).  

I'm not particularly sympathetic to out of tree implementations,
especially one which is as (NOT!) likely to be merged as STREAMS
support.  Out of tree patches can always patch struct inode to add all
the bloat they want.  Also, it souinds like you're not usually using
i_pipe as a true pointer to a struct pipe_inode_info, but rather as
kludged location to stash your v_str pointer.  Why not just have your
STREAMS implementation patch include/linux/fs.h to add a v_str pointer?

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-14 21:50           ` Nathan Scott
@ 2006-06-15  5:49             ` Theodore Tso
  2006-06-15  7:01               ` Nathan Scott
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2006-06-15  5:49 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Nikita Danilov, Jan Engelhardt, linux-kernel

On Thu, Jun 15, 2006 at 07:50:19AM +1000, Nathan Scott wrote:
> As to whether a new inode operation is useful/needed - *shrug* - not
> really my call, I was saying we can work with whatever ends up being
> the final solution, provided it keeps per-inode granularity.

XFS should be return a per-inode value for st_blksize by simply setting
kstat->st_blksize in linvfs_getattr() found in fs/xfs/linux-2.6/xfs_iops.c.

In the inode diet patches that I'm working on, I've just deleted the
calls to set i_blksize in the XFS code, which will cause st_blksize to
default to PAGE_CACHE_SIZE (unless the filesystem overrides the value
found in sb->s_blksize).  I tried to figure out mind-twisting
conversion of the multiple data structures hanging off of each other
in the Irix/Linux compatibility layer (yuck, I still can't believe
this got into mainline), but since I didn't have the stomach for it,
I'll let the XFS maintainers decide how put in per-inode st_blksize
values --- but it is definitely doable.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-15  5:49             ` Theodore Tso
@ 2006-06-15  7:01               ` Nathan Scott
  2006-06-15  8:46                 ` Brian F. G. Bidulock
  0 siblings, 1 reply; 28+ messages in thread
From: Nathan Scott @ 2006-06-15  7:01 UTC (permalink / raw)
  To: Theodore Tso, Nikita Danilov, Jan Engelhardt, linux-kernel

Hi Ted,

On Thu, Jun 15, 2006 at 01:49:31AM -0400, Theodore Tso wrote:
> On Thu, Jun 15, 2006 at 07:50:19AM +1000, Nathan Scott wrote:
> > As to whether a new inode operation is useful/needed - *shrug* - not
> > really my call, I was saying we can work with whatever ends up being
> > the final solution, provided it keeps per-inode granularity.
> 
> XFS should be return a per-inode value for st_blksize by simply setting
> kstat->st_blksize in linvfs_getattr() found in fs/xfs/linux-2.6/xfs_iops.c.

Hmm, you're looking at an older kernel there I guess - that routine
doesn't exist anymore.  Look at -mm, its the most recent version.

> in the Irix/Linux compatibility layer (yuck, I still can't believe
> this got into mainline),

There is no IRIX/Linux compatibility layer, you're misunderstanding
the code (which is understandable, its erm a bit crufty in places).
There's a fair bit of history there too - *shrug* - its on the improve
but its a complex body of code and takes time to mutate.  We've tended
to focus on real problems and needed features in the past, moreso than
cosmetics, but thats getting attention nowadays too.

> I'll let the XFS maintainers decide how put in per-inode st_blksize
> values --- but it is definitely doable.

Yep, agreed.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-15  4:43   ` Theodore Tso
@ 2006-06-15  8:27     ` Brian F. G. Bidulock
  0 siblings, 0 replies; 28+ messages in thread
From: Brian F. G. Bidulock @ 2006-06-15  8:27 UTC (permalink / raw)
  To: Theodore Tso, linux-kernel

Theodore,

On Thu, 15 Jun 2006, Theodore Tso wrote:
> 
> I'm not particularly sympathetic to out of tree implementations,

I believe you asked for comments; not proposed to offer sympathy.

> especially one which is as (NOT!) likely to be merged as STREAMS
> support.  Out of tree patches can always patch struct inode to add all
> the bloat they want.

Bloat that nobody complained about for 10 years or so...  Sounds pretty
antagonistic to me.  Streams are pretty basic character devices.

> Also, it souinds like you're not usually using
> i_pipe as a true pointer to a struct pipe_inode_info, but rather as
> kludged location to stash your v_str pointer.

It points to a STREAMS-based FIFO which, of course, uses a Stream head
structure instead of a pipe_inode_info structure.  Is is used in the
same fashion as Linux FIFOs use the pointer: to attach information to
an inode in a foreign filesystem.  generic_ip and such does not help
here because they are already used by the foreign filesystem.

The cdev structure used to have a private pointer (cd_private) that
could be used to the same effect, but it disappeared some time ago when
char_device was reworked into cdev.  block_device still has a bd_private
pointer.  Merging i_pipe will remove the ability for a character
device driver to association a private pointer with the inode at open
to obtain FIFO-like behaviour (where the device is associated with the
specific inode rather than the device number), reducing the flexibility
of the Linux VFS.

i_pipe is not necessary for Unnamed FIFOs or pipes in Linux (those
inodes are allocated from the pipefs and could very well use the
generic_ip pointer.  i_pipe is only necessary for named FIFOs because
these are attached to inodes belonging to filesystems foreign to the
pipefs.  It is a slim special case of file on file mounting.  Regardless
of STREAMS I think that it deserves to be separate from i_cdev and
i_bdev.

All other basic character device mechanisms either allocate their own
inodes or hang their private data off of the open file pointer.  As
do STREAMS, which, aside from STREAMS-based FIFOs are basic character
devices.


> Why not just have your
> STREAMS implementation patch include/linux/fs.h to add a v_str pointer?

Because it does not patch the kernel but simply loads as a set of
GPL'ed kernel modules.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-15  7:01               ` Nathan Scott
@ 2006-06-15  8:46                 ` Brian F. G. Bidulock
  2006-06-15 10:20                   ` Nathan Scott
  0 siblings, 1 reply; 28+ messages in thread
From: Brian F. G. Bidulock @ 2006-06-15  8:46 UTC (permalink / raw)
  To: Nathan Scott; +Cc: linux-kernel

Nathan,

On Thu, 15 Jun 2006, Nathan Scott wrote:
> 
> There is no IRIX/Linux compatibility layer, you're misunderstanding
> the code (which is understandable, its erm a bit crufty in places).

You gotta be kidding.  It does everything in terms of an SVR 4 VFS
vnode and then converts that to Linux VFS on dentries/inodes.  It was
obviously built by stuffing the Linux VFS under SVR 4 VFS code and
even documents the code as such.  Things like:

| /*
|  * XFS arguments structure, constructed from the arguments we
|  * are passed via the mount system call.
|  *
|  * NOTE: The mount system call is handled differently between
|  * Linux and IRIX.  In IRIX we worked work with a binary data
|  * structure coming in across the syscall interface from user
|  * space (the mount userspace knows about each filesystem type
|  * and the set of valid options for it, and converts the users
|  * argument string into a binary structure _before_ making the
|  * system call), and the ABI issues that this implies.
|  *
|  * In Linux, we are passed a comma separated set of options;
|  * ie. a NULL terminated string of characters.  Userspace mount
|  * code does not have any knowledge of mount options expected by
|  * each filesystem type and so each filesystem parses its mount
|  * options in kernel space.
|  *
|  * For the Linux port, we kept this structure pretty much intact
|  * and use it internally (because the existing code groks it).
|  */
| struct xfs_mount_args {
| 	int	flags;		/* flags -> see XFSMNT_... macros below */
| 	int	flags2;		/* flags -> see XFSMNT2_... macros below */
| 	int	logbufs;	/* Number of log buffers, -1 to default */
| 	int	logbufsize;	/* Size of log buffers, -1 to default */
| 	char	fsname[MAXNAMELEN+1];	/* data device name */
| 	char	rtname[MAXNAMELEN+1];	/* realtime device filename */
| 	char	logname[MAXNAMELEN+1];	/* journal device filename */
| 	char	mtpt[MAXNAMELEN+1];	/* filesystem mount point */
| 	int	sunit;		/* stripe unit (BBs) */
| 	int	swidth;		/* stripe width (BBs), multiple of sunit */
| 	uchar_t iosizelog;	/* log2 of the preferred I/O size */
| 	int	ihashsize;	/* inode hash table size (buckets) */
| };

No... No compatibility layer there.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-14 23:27           ` Jan Engelhardt
@ 2006-06-15 10:09             ` Nikita Danilov
  0 siblings, 0 replies; 28+ messages in thread
From: Nikita Danilov @ 2006-06-15 10:09 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: Nathan Scott, Theodore Ts'o, Linux Kernel Mailing List

Jan Engelhardt writes:
 > >
 > >Sorry, but why this operation is needed? Generic code (in fs/*.c)
 > >doesn't use ->i_blksize at all. If XFS wants to provide per-inode
 > >st_blksize, all it has to do is to store preferred buffer size in its
 > >file system specific inode (struct xfs_inode), and use something
 > >different from generic_fillattr() as its ->i_op->getattr() callback
 > >(xfs_vn_getattr()).
 > >
 > By the way, are there any significant userspace applications that use 
 > i_blksize/i_blkbits?
 > 

cp(1), db(3) or any other using st_blksize field of struct statbuf to
select the size of IO buffer.

 > 
 > Jan Engelhardt
 > -- 

Nikita.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC]  Slimming down struct inode
  2006-06-15  8:46                 ` Brian F. G. Bidulock
@ 2006-06-15 10:20                   ` Nathan Scott
  0 siblings, 0 replies; 28+ messages in thread
From: Nathan Scott @ 2006-06-15 10:20 UTC (permalink / raw)
  To: linux-kernel

On Thu, Jun 15, 2006 at 02:46:06AM -0600, Brian F. G. Bidulock wrote:
> On Thu, 15 Jun 2006, Nathan Scott wrote:
> > There is no IRIX/Linux compatibility layer, you're misunderstanding
> > the code (which is understandable, its erm a bit crufty in places).
> 
> You gotta be kidding.

No, go look at some IRIX headers, instead of just believing what
people tell you...

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-06-15 10:21 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-09 23:50 [RFC] Slimming down struct inode Theodore Ts'o
2006-06-10  0:24 ` Bernd Eckenfels
2006-06-10  1:27 ` Al Viro
2006-06-10  1:56   ` Theodore Tso
2006-06-10  6:24     ` Stefan Richter
2006-06-10 10:48 ` Jan Engelhardt
2006-06-10 15:04   ` Jeff Garzik
2006-06-13  4:35     ` Nathan Scott
2006-06-13  4:32   ` Nathan Scott
2006-06-13 14:00     ` Avi Kivity
2006-06-13 17:44       ` Theodore Tso
2006-06-13 18:08         ` Avi Kivity
2006-06-13 20:10           ` Jan Engelhardt
2006-06-13 20:25             ` Avi Kivity
2006-06-13 22:41       ` Nathan Scott
2006-06-14 10:29         ` Nikita Danilov
2006-06-14 21:50           ` Nathan Scott
2006-06-15  5:49             ` Theodore Tso
2006-06-15  7:01               ` Nathan Scott
2006-06-15  8:46                 ` Brian F. G. Bidulock
2006-06-15 10:20                   ` Nathan Scott
2006-06-14 23:27           ` Jan Engelhardt
2006-06-15 10:09             ` Nikita Danilov
2006-06-10 11:03 ` Tomasz Torcz
2006-06-10 15:06   ` Jeff Garzik
2006-06-15  0:16 ` Brian F. G. Bidulock
2006-06-15  4:43   ` Theodore Tso
2006-06-15  8:27     ` Brian F. G. Bidulock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox