Re: Question about XFS

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Question about XFS_MAXINUMBER
       [not found]       ` <20180318230259.GJ7000@dastard>
@ 2018-03-19  4:03         ` Amir Goldstein
  2018-03-19  8:42           ` Miklos Szeredi
  2018-03-20  1:47           ` Dave Chinner
  0 siblings, 2 replies; 10+ messages in thread
From: Amir Goldstein @ 2018-03-19  4:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

This thread has come to a point where I should have included fsdevel a
while ago,
so CCing fsdevel. For those interested in previous episodes:
https://marc.info/?l=linux-xfs&m=152120912822207&w=2

On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> [....]
>
>> I should have mentioned that "foo" is a pure upper - a file that was created
>> as upper and let's suppose the real ino of "foo" in upper fs is 10.
>> And let's suppose that the real ino of "bar" on lower fs is also 10, which is
>> possible when lower fs is a different fs than upper fs.
>
> Ok, so to close the loop. The problem is that overlay has no inode
> number space of it's own, nor does it have any persistent inode
> number mapping scheme. Hence overlay has no way of providing users
> with a consistent, unique {dev,ino #} tuple to userspace when it's
> different directories lie on different filesystems.
>

Yes.

[...]
>> Because real pure upper inode and lower inode can have the same
>> inode number and we want to multiplex our way our of this collision.
>>
>> Note that we do NOT maintain a data structure for looking up used
>> lower/upper inode numbers, nor do we want to maintain a persistent
>> data structure for persistent overlay inode numbers that map to
>> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
>> feature. This is something that we wish to avoid.
>
> SO instead of maintaining your own data structure to provide the
> necessary guarantees, the solution is to steal bits from the
> underlying filesystem inode numbers on the assumption they they will
> never user them?
>

Well, it is not an assumption if filesystem is inclined to publish
s_max_ino_bits, which is not that different in concept from publishing
s_maxbytes and s_max_links, which are also limitations in current
kernel/sb that could be lifted in the future.

> What happens when a user upgrades their kernel, the underlying fs
> changes all it's inode numbers because it's done some virtual
> mapping thing for, say, having different inode number ranges for
> separate mount namespaces? And so instead of having N bits of free
> inode number space before upgrade, it now has zero? How will overlay
> react to this sort of change, given it could expose duplicate inode
> numbers....

After kernel upgrade, filesystem would set s_max_ino_bits to 64 or
not set it at all and then overlayfs will not use high bits and fall back
to what it does today.

But if we want to bring practical arguments from containers
world into the picture, IMO it is far more likely that existing container
solution would benefit from overlayfs inode numbers multiplexing
than they would from inode number mapping by filesystem for
different mount namespace.

>
> Quite frankly, I think this "steal bits from the underlying
> filesystems" mechanism is a recipe for trouble. If you want play
> these games, you get to keep all the broken bits when filesystems
> change the number of available bits.
>

I don't see that as a problem. I would say there are a fair amount of
users out there using containers with overlayfs.

Do you realize that the majority of those users are settling for things
like: no directory rename, breaking hardlinks on copy up.
Those are "features" of overlayfs that have been fixed in recent kernels,
but only now on their way to distro kernels and not yet enabled
by container runtimes.

Container admins already make the choice of underlying fileystem
concisely to get the best from overlayfs and I would expect that
they will soon be opting in for xfs+reflink because of that concience
choice. If ever xfs decides to change inode numbers address space
on kernel upgrade without users opting in for it, I would be surprised,
but I should also hope that xfs would at least leave a choice for users
to opt-out of this behavior and that is what container admins would do.

Heck, for all I care, users could also opt-in for unused inode bits
explicitly (e.g. -o inode56) if you are concerned about letting go
of those upper bits implicitly.
My patch set already provides the capability for users to declare
with overlay -o xino that enough upper bits are available (i.e. because
user knows the underlying fs and its true practical limits). But the
feature will be much more useful if users disn't have to do that.

> Given that overlay has a persistent inode numbering problem, why
> doesn't overlay just allocate and store it's own inode numbers and
> other required persistent state in an xattr?
>

First, this is not as simple as it sounds.
If you have a huge number of readonly files in multiple lower layers,
it makes no sense to scan them all on overlay mount to discover which
inode numbers are free to use and it make no sense either to create
a persistent mapping for every lower file accessed in that case.
And there are other problematic factors with this sort of scheme.

Second, and this may be a revolutionary argument, I would like to
believe that we are all working together for a "greater good".
Sure, xfs developers strive to perfect and enhance xfs and overlayfs
developers strive to perfect and enhance overlayfs.
But when there is an opportunity for synergy between subsystems,
one should consider the best solution as a whole and IMHO,
the solution of filesystem declaring already unused ino bits
is the best solution as a whole. xfs is not required to declare
s_max_ino_bits for all eternity, only for this specific super block
instance, in this specific kernel.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-19  4:03         ` Question about XFS_MAXINUMBER Amir Goldstein
@ 2018-03-19  8:42           ` Miklos Szeredi
  2018-03-20  1:47           ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Miklos Szeredi @ 2018-03-19  8:42 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

On Mon, Mar 19, 2018 at 5:03 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:

>> Given that overlay has a persistent inode numbering problem, why
>> doesn't overlay just allocate and store it's own inode numbers and
>> other required persistent state in an xattr?
>
> First, this is not as simple as it sounds.
> If you have a huge number of readonly files in multiple lower layers,
> it makes no sense to scan them all on overlay mount to discover which
> inode numbers are free to use and it make no sense either to create
> a persistent mapping for every lower file accessed in that case.
> And there are other problematic factors with this sort of scheme.

Such as when all layers are read-only.   Where do we store the
persistent inode numbers in that case?

>
> Second, and this may be a revolutionary argument, I would like to
> believe that we are all working together for a "greater good".
> Sure, xfs developers strive to perfect and enhance xfs and overlayfs
> developers strive to perfect and enhance overlayfs.
> But when there is an opportunity for synergy between subsystems,
> one should consider the best solution as a whole and IMHO,
> the solution of filesystem declaring already unused ino bits
> is the best solution as a whole. xfs is not required to declare
> s_max_ino_bits for all eternity, only for this specific super block
> instance, in this specific kernel.

The "specific kernel" part requires clarification.  We do promise
backward compatibility when upgrading the kernel, and silently
increasing s_max_ino_bits on a kernel upgrade would break that
promise.  Could be backed by a feature flag.  And unlimited use could
be the default, people have learned to live with needing special
features for overlayfs.

And I do agree with Amir, that the "mine all mine" philosophy isn't
necessarily the right one.  In normal cases overlayfs would just use
one or two bits of the inumber space.  While Amir's current patch
keeps the layer index in the spare bits, it is sufficient to hold an
"fs index" that is incremented when a new superblock is encountered
during enumeration of layers.  The number of different fs instances
used for creating an overlay is unlikely to be large, so for all
practical purposes a few (4-6) bits should be enough.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-19  4:03         ` Question about XFS_MAXINUMBER Amir Goldstein
  2018-03-19  8:42           ` Miklos Szeredi
@ 2018-03-20  1:47           ` Dave Chinner
  2018-03-20  6:29             ` Amir Goldstein
  2018-03-20  9:32             ` Miklos Szeredi
  1 sibling, 2 replies; 10+ messages in thread
From: Dave Chinner @ 2018-03-20  1:47 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
> This thread has come to a point where I should have included fsdevel a
> while ago,
> so CCing fsdevel. For those interested in previous episodes:
> https://marc.info/?l=linux-xfs&m=152120912822207&w=2
> 
> On Mon, Mar 19, 2018 at 1:02 AM, Dave Chinner <david@fromorbit.com> wrote:
> > [....]
> >
> >> I should have mentioned that "foo" is a pure upper - a file that was created
> >> as upper and let's suppose the real ino of "foo" in upper fs is 10.
> >> And let's suppose that the real ino of "bar" on lower fs is also 10, which is
> >> possible when lower fs is a different fs than upper fs.
> >
> > Ok, so to close the loop. The problem is that overlay has no inode
> > number space of it's own, nor does it have any persistent inode
> > number mapping scheme. Hence overlay has no way of providing users
> > with a consistent, unique {dev,ino #} tuple to userspace when it's
> > different directories lie on different filesystems.
> >
> 
> Yes.
> 
> [...]
> >> Because real pure upper inode and lower inode can have the same
> >> inode number and we want to multiplex our way our of this collision.
> >>
> >> Note that we do NOT maintain a data structure for looking up used
> >> lower/upper inode numbers, nor do we want to maintain a persistent
> >> data structure for persistent overlay inode numbers that map to
> >> real underlying inodes. AFAIK, aufs can use a small db for it's 'xino'
> >> feature. This is something that we wish to avoid.
> >
> > SO instead of maintaining your own data structure to provide the
> > necessary guarantees, the solution is to steal bits from the
> > underlying filesystem inode numbers on the assumption they they will
> > never user them?
> >
> 
> Well, it is not an assumption if filesystem is inclined to publish
> s_max_ino_bits, which is not that different in concept from publishing
> s_maxbytes and s_max_links, which are also limitations in current
> kernel/sb that could be lifted in the future.

It is different, because you're expecting to be able to publish
persistent user visible information based on it.

If we change s_max_ino_bits in the underlying filesystem, then
overlay inode numbers change and that can cause all sorts of problem
with things like filehandles, backups that use dev/inode number
tuples to detect identical files, etc.  i.e. there's a heap of
downstream impacts of changing inode numbers. If we have to
publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
user visible inode number the filesysetm publishes. IOWs, we
effectively can't change it without breaking external users.

I suspect you don't realise we already expose the full 64 bit
inode number space completely to userspace through other ABIs. e.g.
the bulkstat ioctls. We've already got applications that use the XFS
inode number as a 64 bit value both to and from the kernel (e.g.
xfs_dump, file handle encoding, etc), so the idea that we can now
take bits back from what we've already agreed to expose to userspace
is fraught with problems.

That's the problem I see here - it's not that we /can't/ implement
s_max_ino_bits, the problem is that once we publish it we can't
change it because it will cause random breakage of applications
using it. And because we've already effectively published it to
userspace applications as s_max_ino_bits = 64, there's no scope for
movement at all.

> Do you realize that the majority of those users are settling for things
> like: no directory rename, breaking hardlinks on copy up.
> Those are "features" of overlayfs that have been fixed in recent kernels,
> but only now on their way to distro kernels and not yet enabled
> by container runtimes.
> 
> Container admins already make the choice of underlying fileystem
> concisely to get the best from overlayfs and I would expect that
> they will soon be opting in for xfs+reflink because of that concience
> choice. If ever xfs decides to change inode numbers address space
> on kernel upgrade without users opting in for it,

We've done this many times in the past. e.g. we changed the default
inode allocation policy from inode32 to inode64 back in 2012. That
means users, on kernel upgrade, silently went from 32 bit inodes to
64 bit inodes. We've done this because of the fact that the
*filesystem owns the entire inode number space* and as long as we
don't change individual inode numbers that users see for a specific
inode, we can do whatever we want inside that inode number space.

> > Given that overlay has a persistent inode numbering problem, why
> > doesn't overlay just allocate and store it's own inode numbers and
> > other required persistent state in an xattr?
> >
> 
> First, this is not as simple as it sounds.

Sure, just like s_max_ino_bits is not as simple as it sounds.

If we want to explicitly reserve part of the inode number space for
other layers to use for their own purposes, then we need to
explicitly and persistently support that in the underlying
filesystem. That means mkfs, repair, db, growfs, etc all need to
understand that inode numbers have a size limit and do the right
thing...

That makes it an opt-in configuration that we can test and support
without having to care about overlay implementations or backwards
compatibility across applications on existing filesystems.

> Second, and this may be a revolutionary argument, I would like to
> believe that we are all working together for a "greater good".

I don't say no for the fun of saying no. I say no because I think
something is a bad idea. Just because I say no doesn't mean I don't
don't want to solve the problem. It just means that I think the
solution being presented is a bad idea and we need to explore the
problem space for a more robust solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  1:47           ` Dave Chinner
@ 2018-03-20  6:29             ` Amir Goldstein
  2018-03-20  8:04               ` Ian Kent
  2018-03-20 13:08               ` Dave Chinner
  2018-03-20  9:32             ` Miklos Szeredi
  1 sibling, 2 replies; 10+ messages in thread
From: Amir Goldstein @ 2018-03-20  6:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
[...]
>> Well, it is not an assumption if filesystem is inclined to publish
>> s_max_ino_bits, which is not that different in concept from publishing
>> s_maxbytes and s_max_links, which are also limitations in current
>> kernel/sb that could be lifted in the future.
>
> It is different, because you're expecting to be able to publish
> persistent user visible information based on it.
>
> If we change s_max_ino_bits in the underlying filesystem, then
> overlay inode numbers change and that can cause all sorts of problem
> with things like filehandles, backups that use dev/inode number
> tuples to detect identical files, etc.  i.e. there's a heap of
> downstream impacts of changing inode numbers. If we have to
> publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
> user visible inode number the filesysetm publishes. IOWs, we
> effectively can't change it without breaking external users.
>

You are right.

> I suspect you don't realise we already expose the full 64 bit
> inode number space completely to userspace through other ABIs. e.g.
> the bulkstat ioctls. We've already got applications that use the XFS
> inode number as a 64 bit value both to and from the kernel (e.g.
> xfs_dump, file handle encoding, etc), so the idea that we can now
> take bits back from what we've already agreed to expose to userspace
> is fraught with problems.

I'm sorry. There must be something I am missing.
Are users exposed to high ino bits via xfs tools other than NULLFSINO
NULLAGINO? If they are then I did not find where.
And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(),
so not a problem for overlayfs.

>
> That's the problem I see here - it's not that we /can't/ implement
> s_max_ino_bits, the problem is that once we publish it we can't
> change it because it will cause random breakage of applications
> using it. And because we've already effectively published it to
> userspace applications as s_max_ino_bits = 64, there's no scope for
> movement at all.
>

Agreed. So we can add an explicit compat feature bit to declare that user
would like to limit future use of high ino bits on his fs.
Makes me wonder, how come there is no feature to block "inode64"
mount option, so user can declare he wishes to keep the fs fully
compatible for mounting on 32bit systems?

[...]

> We've done this many times in the past. e.g. we changed the default
> inode allocation policy from inode32 to inode64 back in 2012. That
> means users, on kernel upgrade, silently went from 32 bit inodes to
> 64 bit inodes. We've done this because of the fact that the
> *filesystem owns the entire inode number space* and as long as we
> don't change individual inode numbers that users see for a specific
> inode, we can do whatever we want inside that inode number space.
>

Right. My main point is that, unless I am missing something, never in
xfs history, was a non NULL inode number exposed to user with high
8 bits used, so at least forward/backward compat for "inode56" feature
is not going to be a big challenge.

>> > Given that overlay has a persistent inode numbering problem, why
>> > doesn't overlay just allocate and store it's own inode numbers and
>> > other required persistent state in an xattr?
>> >
>>
>> First, this is not as simple as it sounds.
>
> Sure, just like s_max_ino_bits is not as simple as it sounds.

It never is ;-)

>
> If we want to explicitly reserve part of the inode number space for
> other layers to use for their own purposes, then we need to
> explicitly and persistently support that in the underlying
> filesystem. That means mkfs, repair, db, growfs, etc all need to
> understand that inode numbers have a size limit and do the right
> thing...
>
> That makes it an opt-in configuration that we can test and support
> without having to care about overlay implementations or backwards
> compatibility across applications on existing filesystems.
>

OK. I'll work on a proposal.

>> Second, and this may be a revolutionary argument, I would like to
>> believe that we are all working together for a "greater good".
>
> I don't say no for the fun of saying no. I say no because I think
> something is a bad idea. Just because I say no doesn't mean I don't
> don't want to solve the problem. It just means that I think the
> solution being presented is a bad idea and we need to explore the
> problem space for a more robust solution.
>

And I do appreciate the time you've put into understanding the overlayfs
problem and explaining the problems with my current proposal.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  6:29             ` Amir Goldstein
@ 2018-03-20  8:04               ` Ian Kent
  2018-03-20  8:57                 ` Amir Goldstein
  2018-03-20  9:20                 ` Miklos Szeredi
  2018-03-20 13:08               ` Dave Chinner
  1 sibling, 2 replies; 10+ messages in thread
From: Ian Kent @ 2018-03-20  8:04 UTC (permalink / raw)
  To: Amir Goldstein, Dave Chinner
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

Hi Amir, Miklos,

On 20/03/18 14:29, Amir Goldstein wrote:
> 
> And I do appreciate the time you've put into understanding the overlayfs
> problem and explaining the problems with my current proposal.
> 

For a while now I've been wondering why overlayfs is keen to avoid using
a local, persistent, inode number mapping cache?

Sure there can be subtle problems with them but there are problems with
other alternatives too.

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:04               ` Ian Kent
@ 2018-03-20  8:57                 ` Amir Goldstein
  2018-03-20 10:18                   ` Ian Kent
  2018-03-20  9:20                 ` Miklos Szeredi
  1 sibling, 1 reply; 10+ messages in thread
From: Amir Goldstein @ 2018-03-20  8:57 UTC (permalink / raw)
  To: Ian Kent
  Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote:
> Hi Amir, Miklos,
>
> On 20/03/18 14:29, Amir Goldstein wrote:
>>
>> And I do appreciate the time you've put into understanding the overlayfs
>> problem and explaining the problems with my current proposal.
>>
>
> For a while now I've been wondering why overlayfs is keen to avoid using
> a local, persistent, inode number mapping cache?
>

A local persistent inode map is a more complex solution.
If you remove re-factoring, my patch set adds less than 100 lines of code
and it solves the problem for many real world setups.
A more complex solution needs a use case in the real world to justify
it over a less complex solution.
I am not saying we can avoid the complex solution forever, but so far,
I did not yet see the requests from users to justify it.

> Sure there can be subtle problems with them but there are problems with
> other alternatives too.
>

There is a difference between "not applicable" and "problematic"
The -o xino solution is not applicable to all setups, but I am not aware of
any problems with this solution.
Even without underlying filesystem declaring number of used ino bit,
user can declare this with overlayfs mount option, so practically, the
problem for overlayfs over xfs is already solved.

The discussion about a VFS API for max_ino_bits is to make users
life easier, but the API is not required to fix the overlayfs inode number
problem.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:04               ` Ian Kent
  2018-03-20  8:57                 ` Amir Goldstein
@ 2018-03-20  9:20                 ` Miklos Szeredi
  1 sibling, 0 replies; 10+ messages in thread
From: Miklos Szeredi @ 2018-03-20  9:20 UTC (permalink / raw)
  To: Ian Kent
  Cc: Amir Goldstein, Dave Chinner, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On Tue, Mar 20, 2018 at 9:04 AM, Ian Kent <raven@themaw.net> wrote:
> Hi Amir, Miklos,
>
> On 20/03/18 14:29, Amir Goldstein wrote:
>>
>> And I do appreciate the time you've put into understanding the overlayfs
>> problem and explaining the problems with my current proposal.
>>
>
> For a while now I've been wondering why overlayfs is keen to avoid using
> a local, persistent, inode number mapping cache?

Think of overlayfs as a normal filesystem, except it's not backed by a
block device, but instead one or more read-only directory tree and
optionally one writable directory tree. There's a twist, however: when
not mounted, you are allowed to change the backing directories.  This
is a really important feature of overlayfs.

So where does the initial mapping come from (overlay is never started
from scratch, like a newly formatted filesystem)?  And what happens
when layers are modified and we encounter unmapped inode numbers?

In both cases we must either create/update the mapping before mount,
or update the mapping on lookup.

Creating/updating the mapping up-front means a really high startup
cost, which can be amortized if the layers are guaranteed not to
change outside of the overlay.

Updating a persistent mapping on lookup means having to do sync writes
on lookup, which can be very detrimental to performance.  If all
layers are read-only, this scheme falls apart, since we've nowhere to
write the persistent mapping.

Or we can just say, screw the persistency and store the mapping on
e.g. tmpfs.  Performance-wise that's much better, but then we fail to
provide the guarantees about inode numbers (e.g. NFS export won't work
properly).

In my opinion it's much less about simplicity of implementation as
about quality of implementation.

Ideas for fixing the above issues are welcome.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  1:47           ` Dave Chinner
  2018-03-20  6:29             ` Amir Goldstein
@ 2018-03-20  9:32             ` Miklos Szeredi
  1 sibling, 0 replies; 10+ messages in thread
From: Miklos Szeredi @ 2018-03-20  9:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

On Tue, Mar 20, 2018 at 2:47 AM, Dave Chinner <david@fromorbit.com> wrote:

>> Second, and this may be a revolutionary argument, I would like to
>> believe that we are all working together for a "greater good".
>
> I don't say no for the fun of saying no. I say no because I think
> something is a bad idea. Just because I say no doesn't mean I don't
> don't want to solve the problem. It just means that I think the
> solution being presented is a bad idea and we need to explore the
> problem space for a more robust solution.

Totally agreed, let's do that.  I've presented the issues I see with
creating a generic (i.e. non-multiplexing) inode number mapping for
overlayfs in answer to Ian's mail.

Do you see a way this problem can be solved without those issues?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  8:57                 ` Amir Goldstein
@ 2018-03-20 10:18                   ` Ian Kent
  0 siblings, 0 replies; 10+ messages in thread
From: Ian Kent @ 2018-03-20 10:18 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Dave Chinner, Miklos Szeredi, Darrick J. Wong, linux-xfs,
	overlayfs, linux-fsdevel

On 20/03/18 16:57, Amir Goldstein wrote:
> On Tue, Mar 20, 2018 at 10:04 AM, Ian Kent <raven@themaw.net> wrote:
>> Hi Amir, Miklos,
>>
>> On 20/03/18 14:29, Amir Goldstein wrote:
>>>
>>> And I do appreciate the time you've put into understanding the overlayfs
>>> problem and explaining the problems with my current proposal.
>>>
>>
>> For a while now I've been wondering why overlayfs is keen to avoid using
>> a local, persistent, inode number mapping cache?
>>
> 
> A local persistent inode map is a more complex solution.
> If you remove re-factoring, my patch set adds less than 100 lines of code
> and it solves the problem for many real world setups.
> A more complex solution needs a use case in the real world to justify
> it over a less complex solution.

Indeed, it is significantly more complex.

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question about XFS_MAXINUMBER
  2018-03-20  6:29             ` Amir Goldstein
  2018-03-20  8:04               ` Ian Kent
@ 2018-03-20 13:08               ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-03-20 13:08 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Miklos Szeredi, Darrick J. Wong, linux-xfs, overlayfs,
	linux-fsdevel

On Tue, Mar 20, 2018 at 08:29:35AM +0200, Amir Goldstein wrote:
> On Tue, Mar 20, 2018 at 3:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 19, 2018 at 06:03:30AM +0200, Amir Goldstein wrote:
> [...]
> >> Well, it is not an assumption if filesystem is inclined to publish
> >> s_max_ino_bits, which is not that different in concept from publishing
> >> s_maxbytes and s_max_links, which are also limitations in current
> >> kernel/sb that could be lifted in the future.
> >
> > It is different, because you're expecting to be able to publish
> > persistent user visible information based on it.
> >
> > If we change s_max_ino_bits in the underlying filesystem, then
> > overlay inode numbers change and that can cause all sorts of problem
> > with things like filehandles, backups that use dev/inode number
> > tuples to detect identical files, etc.  i.e. there's a heap of
> > downstream impacts of changing inode numbers. If we have to
> > publish s_max_ino_bits to the VFS, we essentially fix the ABI of the
> > user visible inode number the filesysetm publishes. IOWs, we
> > effectively can't change it without breaking external users.
> >
> 
> You are right.
> 
> > I suspect you don't realise we already expose the full 64 bit
> > inode number space completely to userspace through other ABIs. e.g.
> > the bulkstat ioctls. We've already got applications that use the XFS
> > inode number as a 64 bit value both to and from the kernel (e.g.
> > xfs_dump, file handle encoding, etc), so the idea that we can now
> > take bits back from what we've already agreed to expose to userspace
> > is fraught with problems.
> 
> I'm sorry. There must be something I am missing.
> Are users exposed to high ino bits via xfs tools other than NULLFSINO
> NULLAGINO? If they are then I did not find where.
> And w.r.t to NULLINO (-1), that ino is not exposed via getattr() and readdir(),
> so not a problem for overlayfs.

Bulkstat exposes the on-disk inode number directly to userspace, and
other ioctls take those inode numbers back in as ioctl parameters
(e.g.  as bulkstat iteration cookies) and as part of userspce
constructed filehandles (i.e. in libhandle, xfs_fsr, xfsdump, etc).
The filehandles are explicitly encoded with 64 bit inode numbers....

> > That's the problem I see here - it's not that we /can't/ implement
> > s_max_ino_bits, the problem is that once we publish it we can't
> > change it because it will cause random breakage of applications
> > using it. And because we've already effectively published it to
> > userspace applications as s_max_ino_bits = 64, there's no scope for
> > movement at all.
> >
> 
> Agreed. So we can add an explicit compat feature bit to declare that user
> would like to limit future use of high ino bits on his fs.
> Makes me wonder, how come there is no feature to block "inode64"
> mount option, so user can declare he wishes to keep the fs fully
> compatible for mounting on 32bit systems?

Because inode64 was the original mechanism for allocating inodes.
inode32 was introduced years after XFS was first shipped. You need
to go ask the old Irix engineers why they implemented inode32 as a
mount option and not an on-disk feature flag and created the mess
that is the inode32 mount option.

These days, inode32 reads 64 bit inode just fine - it just can't
create new 64 bit inode numbers.  And if you *really* still need
only 32 bit inodes in this day and age, there's that old xfs_reno
tool:

http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-03-20 13:08 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAOQ4uxhubJhspfj+x7yRkTJ4rUXyhZoF0x3kjuFyv-0=Lxrctw@mail.gmail.com>
     [not found] ` <CAOQ4uxh0H8mbbAZt0MEdLGRY-8Aqg2c9UFM56-KL2b90fLYCFQ@mail.gmail.com>
     [not found]   ` <20180317212814.GI7000@dastard>
     [not found]     ` <CAOQ4uxit=nHvd2C080QeM27icSFra+0Anwu8vXqs4sd=8tCCNg@mail.gmail.com>
     [not found]       ` <20180318230259.GJ7000@dastard>
2018-03-19  4:03         ` Question about XFS_MAXINUMBER Amir Goldstein
2018-03-19  8:42           ` Miklos Szeredi
2018-03-20  1:47           ` Dave Chinner
2018-03-20  6:29             ` Amir Goldstein
2018-03-20  8:04               ` Ian Kent
2018-03-20  8:57                 ` Amir Goldstein
2018-03-20 10:18                   ` Ian Kent
2018-03-20  9:20                 ` Miklos Szeredi
2018-03-20 13:08               ` Dave Chinner
2018-03-20  9:32             ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).