Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Eric Biggers @ 2019-06-18 17:51 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190615153112.GO6142@mit.edu>

On Sat, Jun 15, 2019 at 11:31:12AM -0400, Theodore Ts'o wrote:
> On Thu, Jun 06, 2019 at 08:52:03AM -0700, Eric Biggers wrote:
> > +/*
> > + * Format of ext4 verity xattr.  This points to the location of the verity
> > + * descriptor within the file data rather than containing it directly because
> > + * the verity descriptor *must* be encrypted when ext4 encryption is used.  But,
> > + * ext4 encryption does not encrypt xattrs.
> > + */
> > +struct fsverity_descriptor_location {
> > +	__le32 version;
> > +	__le32 size;
> > +	__le64 pos;
> > +};
> 
> What's the benefit of storing the location in an xattr as opposed to
> just keying it off the end of i_size, rounded up to next page size (or
> 64k) as I had suggested earlier?
> 
> Using an xattr burns xattr space, which is a limited resource, and it
> adds some additional code complexity.  Does the benefits outweigh the
> added complexity?
> 
> 						- Ted

It means that only the fs/verity/ support layer has to be aware of the format of
the fsverity_descriptor, and the filesystem can just treat it an as opaque blob.

Otherwise the filesystem would need to read the first 'sizeof(struct
fsverity_descriptor)' bytes and use those to calculate the size as
'sizeof(struct fsverity_descriptor) + le32_to_cpu(desc.sig_size)', then read the
rest.  Is this what you have in mind?

Alternatively the filesystem could prepend the fsverity_descriptor with its
size, similar to how in the v1 and v2 patchsets there was an fsverity_footer
appended to the fsverity_descriptor.  But an xattr seems a cleaner approach to
store a few bytes that don't need to be encrypted.

Putting the verity descriptor before the Merkle tree also means that we'd have
to pass the desc_size to ->begin_enable_verity(), ->read_merkle_tree_page(), and
->write_merkle_tree_block(), versus just passing the merkle_tree_size to
->end_enable_verity().  This would be easy, but it would still add a bit of
complexity in the fsverity_operations rather than reduce it.

It's also somewhat nice to have the version number in the xattr, in case we ever
introduce a new fs-verity format for ext4 or f2fs.

So to me, it doesn't seem like the other possible solutions are better.

- Eric

^ permalink raw reply

* Re: [PATCH] mm, memcg: Report number of memcg caches in slabinfo
From: Michal Hocko @ 2019-06-18 18:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-api
In-Reply-To: <dee4dee2-1f4f-a7c9-0014-dca54b991377@redhat.com>

On Tue 18-06-19 12:59:24, Waiman Long wrote:
> On 6/18/19 8:37 AM, Michal Hocko wrote:
[...]
> > Is this useful enough to put into slabinfo? Doesn't this sound more like
> > a debugfs kinda a thing?
> 
> I guess it is probably more on the debug side of things. I add it to
> slabinfo as the data is readily available. It will be much more work if
> we need to export the data via debugfs.
> 
> We are seeing the kmem_cache slab growing continuously overtime when
> running a container-based workloads. Roman's kmem_cache reparenting
> patch will hopefully solve a major part of the problem, but we still
> need a way to confirm that by looking at how many memcg kmem_caches are
> associated with each root kmem_cache.

I am not disputing usefulness. Dead memcgs are showing up as a problem
for a longer time and having a more debugging information is definitely
useful. I am just not really sure that /proc/slabinfo is the proper
vehicle for that information. It might be just easier to stick it there
but that is not the best justification for adding something we will have
to maintain for ever. Not to mention that the number of dead memcgs
might not be enough to debug further when we can easily end up needing
to provide more in something less "carved in stone" kinda interface like
debugfs.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* Re: [PATCH] mm, memcg: Report number of memcg caches in slabinfo
From: Waiman Long @ 2019-06-18 19:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Andrew Morton, linux-mm, linux-kernel, Roman Gushchin,
	Johannes Weiner, Shakeel Butt, Vladimir Davydov, linux-api
In-Reply-To: <20190618183208.GK3318@dhcp22.suse.cz>

On 6/18/19 2:32 PM, Michal Hocko wrote:
> On Tue 18-06-19 12:59:24, Waiman Long wrote:
>> On 6/18/19 8:37 AM, Michal Hocko wrote:
> [...]
>>> Is this useful enough to put into slabinfo? Doesn't this sound more like
>>> a debugfs kinda a thing?
>> I guess it is probably more on the debug side of things. I add it to
>> slabinfo as the data is readily available. It will be much more work if
>> we need to export the data via debugfs.
>>
>> We are seeing the kmem_cache slab growing continuously overtime when
>> running a container-based workloads. Roman's kmem_cache reparenting
>> patch will hopefully solve a major part of the problem, but we still
>> need a way to confirm that by looking at how many memcg kmem_caches are
>> associated with each root kmem_cache.
> I am not disputing usefulness. Dead memcgs are showing up as a problem
> for a longer time and having a more debugging information is definitely
> useful. I am just not really sure that /proc/slabinfo is the proper
> vehicle for that information. It might be just easier to stick it there
> but that is not the best justification for adding something we will have
> to maintain for ever. Not to mention that the number of dead memcgs
> might not be enough to debug further when we can easily end up needing
> to provide more in something less "carved in stone" kinda interface like
> debugfs.
>
Fair enough.

I will rework the patch and expose the information via debugfs then.

Cheers,
Longman

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-06-18 22:12 UTC (permalink / raw)
  To: Steve Grubb
  Cc: Richard Guy Briggs, Tycho Andersen, Serge E. Hallyn, containers,
	linux-api, Linux-Audit Mailing List, linux-fsdevel, LKML, netdev,
	netfilter-devel, omosnace, dhowells, simo, Eric Paris, ebiederm,
	nhorman
In-Reply-To: <97478582.yP93vGJyqj@x2>

On Mon, Jun 3, 2019 at 4:24 PM Steve Grubb <sgrubb@redhat.com> wrote:
> Hello Paul,
>
> I am curious about this. We seemed to be close to getting this patch pulled
> in. A lot of people are waiting for it. Can you summarize what you think the
> patches need and who we think needs to do it? I'm lost. Does LXC people need
> to propose something? Does Richard? Someone else? Who's got the ball?

[My apologies, this was lost in my inbox and I just not noticed it.]

Please don't top post on things sent to the mailing lists Steve, you
know better than that.

Yes, things were moving along well, but upon talking with the LXC
folks it appears we underestimated the importance of nested
orchestrators.  I suspect my reply to Dan on the 4th covered your
questions, if you didn't see it, here is the relevant snippet:

"To be clear, that's where we are at: we need to figure out what the
kernel API would look like to support nested container orchestrators.
My gut feeling is that this isn't going to be terribly complicated
compared to the rest of the audit container ID work, but it is going
to be some work.  We had a discussion about some potential solutions
in the cover letter and it sounds like Richard is working up some
ideas now, let's wait to see what that looks like."

... and that is where we are at.  I'm looking forward to seeing
Richard's next patchset.

> On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> > On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-05-30 19:26, Paul Moore wrote:
> > > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > >
> > > > > This sort of seems like a distinction without a difference;
> > > > > presumably
> > > > > audit is going to want to differentiate between everything that
> > > > > people
> > > > > in userspace call a container. So you'll have to support all this
> > > > > insanity anyway, even if it's "not a container ID".
> > > >
> > > > That's not quite right.  Audit doesn't care about what a container is,
> > > > or is not, it also doesn't care if the "audit container ID" actually
> > > > matches the ID used by the container engine in userspace and I think
> > > > that is a very important line to draw.  Audit is simply given a value
> > > > which it calls the "audit container ID", it ensures that the value is
> > > > inherited appropriately (e.g. children inherit their parent's audit
> > > > container ID), and it uses the value in audit records to provide some
> > > > additional context for log analysis.  The distinction isn't limited to
> > > > the value itself, but also to how it is used; it is an "audit
> > > > container ID" and not a "container ID" because this value is
> > > > exclusively for use by the audit subsystem.  We are very intentionally
> > > > not adding a generic container ID to the kernel.  If the kernel does
> > > > ever grow a general purpose container ID we will be one of the first
> > > > ones in line to make use of it, but we are not going to be the ones to
> > > > generically add containers to the kernel.  Enough people already hate
> > > > audit ;)
> > > >
> > > > > > I'm not interested in supporting/merging something that isn't
> > > > > > useful;
> > > > > > if this doesn't work for your use case then we need to figure out
> > > > > > what
> > > > > > would work.  It sounds like nested containers are much more common
> > > > > > in
> > > > > > the lxc world, can you elaborate a bit more on this?
> > > > > >
> > > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > > the
> > > > > > necessary tracking information via the audit record stream and let
> > > > > > the
> > > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > > how
> > > > > > to limit (re)setting the audit container ID when you are in a
> > > > > > non-init
> > > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > > starter for everything but the initial userns, and using
> > > > > > ns_capable()
> > > > > > is equally poor as it essentially allows any userns the ability to
> > > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > > need a different method for controlling access to the audit
> > > > > > container
> > > > > > ID.
> > > > >
> > > > > One option would be to make it a string, and have it be append only.
> > > > > That should be safe with no checks.
> > > > >
> > > > > I know there was a long thread about what type to make this thing. I
> > > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > > some rule about only allowing setting lower order bits than those
> > > > > that
> > > > > are already set. With 4 bits for simplicity:
> > > > >
> > > > > 1100         # initial container id
> > > > > 1100 -> 1011 # not allowed
> > > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > >
> > > > >              # no lower order bits left
> > > > >
> > > > > There are probably fancier ways to do it if you actually understand
> > > > > math :)
> > > >
> > > >  ;)
> > > >
> > > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > > you have 64 bits, this might be reasonable. You could just teach
> > > > > container engines to use the first say N bits for themselves, with a
> > > > > 1
> > > > > bit for the barrier at the end.
> > > >
> > > > I like the creativity, but I worry that at some point these
> > > > limitations are going to be raised (limits have a funny way of doing
> > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > want to be able to quickly do an audit container ID comparison and
> > > > we're going to pay a penalty for these larger values (we'll need this
> > > > when we add multiple auditd support and the requisite record routing).
> > > >
> > > > Thinking about this makes me also realize we probably need to think a
> > > > bit longer about audit container ID conflicts between orchestrators.
> > > > Right now we just take the value that is given to us by the
> > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > to work without some form of cooperation in userspace (I think we have
> > > > to assume the orchestrators will not talk to each other) we likely
> > > > need to have some way to block reuse of an audit container ID.  We
> > > > would either need to prevent the orchestrator from explicitly setting
> > > > an audit container ID to a currently in use value, or instead generate
> > > > the audit container ID in the kernel upon an event triggered by the
> > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > start looking at the idr code, I think we will need to make use of it.
> > >
> > > My first reaction to using the IDR code is that once an idr is given up,
> > > it can be reused.  I suppose we request IDRs and then never give them up
> > > to avoid reuse...
> >
> > I'm not sure we ever what to guarantee that an audit container ID
> > won't be reused during the lifetime of the system, it is a discrete
> > integer after all.  What I think we do want to guarantee is that we
> > won't allow an unintentional audit container ID collision between
> > different orchestrators; if a single orchestrator wants to reuse an
> > audit container ID then that is its choice.
> >
> > > I already had some ideas of preventing an existing ID from being reused,
> >
> > Cool.  I only made the idr suggestion since it is used for PIDs and
> > solves a very similar problem.
> >
> > > but that makes the practice of some container engines injecting
> > > processes into existing containers difficult if not impossible.
> >
> > Yes, we'll need some provision to indicate which orchestrator
> > "controls" that particular audit container ID, and allow that
> > orchestrator to reuse that particular audit container ID (until all
> > those containers disappear and the audit container ID is given back to
> > the pool).
>
>
>
>

^ permalink raw reply

* Re: [PATCH 01/25] vfs: syscall: Add fsinfo() to query filesystem information [ver #13]
From: David Howells @ 2019-06-18 22:24 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Al Viro, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel, Miklos Szeredi
In-Reply-To: <CAJfpegtkpNNOOWQ3TnLPGSm=bwL2otQp1--GjNNFiXO7imMxEQ@mail.gmail.com>

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Please don't resurrect MS_ flags.  They are from the old API and
> shouldn't be used in the new one.  Some of them (e.g. MS_POSIXACL,
> MS_I_VERSION) are actually internal flags despite being exported on
> the old API.

That makes it harder to emulate statfs() using this interface, but ok.

I wonder if I should split the standard parameters (rw/ro, posixacl, dirsync,
sync, lazytime, mand) out of FSINFO_ATTR_PARAMETERS and stick them in their
own attribute, say FSINFO_ATTR_STD_PARAMETERS.  That would make it easier for
a filesystem to only overload them if it wants to.

> And there's SB_SILENT which is simply not a superblock flag and we might be
> better getting rid of it entirely.

Yeah.  It's a parse-time flag.

> The proper way to query mount options should be analogous to the way
> they are set on the new API: list of {key, type, value, aux} tuples.

It's not quite that simple: "aux" might be a datum that you can't recover or
is meaningless to another process (an fd, for example).

David

^ permalink raw reply

* Re: [PATCH 04/25] vfs: Implement parameter value retrieval with fsinfo() [ver #13]
From: David Howells @ 2019-06-18 22:34 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: dhowells, Al Viro, Ian Kent, Linux API, linux-fsdevel,
	linux-kernel, Miklos Szeredi
In-Reply-To: <CAJfpegutheVtnmN6BFSjzrmz8p9+DpZxFoKa4CoShoh4MW+5gQ@mail.gmail.com>

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Again, don't blindly transform s_flags into options, because some of
> them may have been internally manipulated by the filesystem.

In what filesystems do I need to undo this manipulation?

> You could do a helper for filesystems that does the the common ones
> (ro/sync/dirsync) but all of that *should* go through the filesystem.

I don't agree, but since you keep insisting, I've changed the helper function
that renders these so that it now takes s_flags as an argument and is called
from generic_fsinfo() if the filesystem doesn't handle FSINFO_ATTR_PARAMETERS.

Therefore, every filesystem that handles FSINFO_ATTR_PARAMETERS, *must* call
the function itself (or do the noting directly) otherwise these parameters
will not get rendered.

The helper function has been exported, and the calling filesystem can give any
s_flags it likes.  All the filesystems so far just use
path->dentry->d_sb->s_flags.

> Same goes for vfs_parse_sb_flag() btw.   It should be moved into each
> filesystem's ->parse_param() and not be a mandatory thing.

I disagree.  Every filesystem *must* be able to accept these standard flags,
even if it then ignores them.

David

^ permalink raw reply

* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Theodore Ts'o @ 2019-06-18 22:46 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190618175117.GF184520@gmail.com>

On Tue, Jun 18, 2019 at 10:51:18AM -0700, Eric Biggers wrote:
> On Sat, Jun 15, 2019 at 11:31:12AM -0400, Theodore Ts'o wrote:
> > On Thu, Jun 06, 2019 at 08:52:03AM -0700, Eric Biggers wrote:
> > > +/*
> > > + * Format of ext4 verity xattr.  This points to the location of the verity
> > > + * descriptor within the file data rather than containing it directly because
> > > + * the verity descriptor *must* be encrypted when ext4 encryption is used.  But,
> > > + * ext4 encryption does not encrypt xattrs.
> > > + */
> > > +struct fsverity_descriptor_location {
> > > +	__le32 version;
> > > +	__le32 size;
> > > +	__le64 pos;
> > > +};
> > 
> > What's the benefit of storing the location in an xattr as opposed to
> > just keying it off the end of i_size, rounded up to next page size (or
> > 64k) as I had suggested earlier?
> > 
> > Using an xattr burns xattr space, which is a limited resource, and it
> > adds some additional code complexity.  Does the benefits outweigh the
> > added complexity?
> > 
> > 						- Ted
> 
> It means that only the fs/verity/ support layer has to be aware of the format of
> the fsverity_descriptor, and the filesystem can just treat it an as opaque blob.
> 
> Otherwise the filesystem would need to read the first 'sizeof(struct
> fsverity_descriptor)' bytes and use those to calculate the size as
> 'sizeof(struct fsverity_descriptor) + le32_to_cpu(desc.sig_size)', then read the
> rest.  Is this what you have in mind?

So right now, the way enable_verity() works is that it appends the
Merkle tree to the data file, rounding up to the next page (but we
might change so we round up to the next 64k boundary).  Then it calls
end_enable_verity(), which is a file system specific function, passing
in the descriptor and the descriptor size.

Today ext4 and f2fs appends the descriptor after the Merkle, and then
sets the xattr containing the fsverity_descriptor_location.  Correct?

What I'm suggesting that ext4 do instead is that it appends the
descriptor to the Merkle tree, and then assuming that there is the
(descriptor size % block_size) is less than PAGE_SIZE-4, we can write
the descriptor size into the last 4 bytes of the last block in the
file.  If there is not enough space at the end of the descriptor, then
we append a block to the file, and then write the descriptor_size into
last 4 bytes of that block.

When ext4 needs to find the descriptor, it simply reads the last block
from the file, reads it into the page cache, reads the last 4 bytes
from that block to fetch the descriptor size, and it can use the
logical offset of the last block and the descriptor size to calculate
the beginning offset of the descriptor size.

We can then fake up the fsverity_descriptor_location structure, and
pass that into fsverity.

It does add a bit of extra complexity, but 99.9% of the time, it
requires no extra space.  The last 0.098% of the time, the file size
will grow by 4k, but if we can avoid spilling over to an external
xattr block, it will all be worth it.

And in the V1 version of the fsverity code, I had already written the
code to descend the extent tree to find the last logical block in the
extent tree.

> It's also somewhat nice to have the version number in the xattr, in case we ever
> introduce a new fs-verity format for ext4 or f2fs.

We already have a version number in the fsverity descriptor.  Surely
that is what we would bump if we need to itnroduce a new fs-verity
format?

						- Ted

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Richard Guy Briggs @ 2019-06-18 22:46 UTC (permalink / raw)
  To: Paul Moore
  Cc: Steve Grubb, Tycho Andersen, nhorman, linux-api, containers, LKML,
	dhowells, Linux-Audit Mailing List, netfilter-devel, ebiederm,
	simo, netdev, linux-fsdevel, Eric Paris, Serge E. Hallyn
In-Reply-To: <CAHC9VhQTPUnHt73SVn=iA_PW1mHkkDf=Nri1kgw_m5mcoiJV9A@mail.gmail.com>

On 2019-06-18 18:12, Paul Moore wrote:
> On Mon, Jun 3, 2019 at 4:24 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > Hello Paul,
> >
> > I am curious about this. We seemed to be close to getting this patch pulled
> > in. A lot of people are waiting for it. Can you summarize what you think the
> > patches need and who we think needs to do it? I'm lost. Does LXC people need
> > to propose something? Does Richard? Someone else? Who's got the ball?
> 
> [My apologies, this was lost in my inbox and I just not noticed it.]
> 
> Please don't top post on things sent to the mailing lists Steve, you
> know better than that.
> 
> Yes, things were moving along well, but upon talking with the LXC
> folks it appears we underestimated the importance of nested
> orchestrators.  I suspect my reply to Dan on the 4th covered your
> questions, if you didn't see it, here is the relevant snippet:
> 
> "To be clear, that's where we are at: we need to figure out what the
> kernel API would look like to support nested container orchestrators.
> My gut feeling is that this isn't going to be terribly complicated
> compared to the rest of the audit container ID work, but it is going
> to be some work.  We had a discussion about some potential solutions
> in the cover letter and it sounds like Richard is working up some
> ideas now, let's wait to see what that looks like."
> 
> ... and that is where we are at.  I'm looking forward to seeing
> Richard's next patchset.

I've rebased everything and am trying out some code to see if it will
address the concerns raised...  There will be more overhead on contid
write, and a tiny bit more for normal operations...

> > On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> > > On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-05-30 19:26, Paul Moore wrote:
> > > > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > > >
> > > > > > This sort of seems like a distinction without a difference;
> > > > > > presumably
> > > > > > audit is going to want to differentiate between everything that
> > > > > > people
> > > > > > in userspace call a container. So you'll have to support all this
> > > > > > insanity anyway, even if it's "not a container ID".
> > > > >
> > > > > That's not quite right.  Audit doesn't care about what a container is,
> > > > > or is not, it also doesn't care if the "audit container ID" actually
> > > > > matches the ID used by the container engine in userspace and I think
> > > > > that is a very important line to draw.  Audit is simply given a value
> > > > > which it calls the "audit container ID", it ensures that the value is
> > > > > inherited appropriately (e.g. children inherit their parent's audit
> > > > > container ID), and it uses the value in audit records to provide some
> > > > > additional context for log analysis.  The distinction isn't limited to
> > > > > the value itself, but also to how it is used; it is an "audit
> > > > > container ID" and not a "container ID" because this value is
> > > > > exclusively for use by the audit subsystem.  We are very intentionally
> > > > > not adding a generic container ID to the kernel.  If the kernel does
> > > > > ever grow a general purpose container ID we will be one of the first
> > > > > ones in line to make use of it, but we are not going to be the ones to
> > > > > generically add containers to the kernel.  Enough people already hate
> > > > > audit ;)
> > > > >
> > > > > > > I'm not interested in supporting/merging something that isn't
> > > > > > > useful;
> > > > > > > if this doesn't work for your use case then we need to figure out
> > > > > > > what
> > > > > > > would work.  It sounds like nested containers are much more common
> > > > > > > in
> > > > > > > the lxc world, can you elaborate a bit more on this?
> > > > > > >
> > > > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > > > the
> > > > > > > necessary tracking information via the audit record stream and let
> > > > > > > the
> > > > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > > > how
> > > > > > > to limit (re)setting the audit container ID when you are in a
> > > > > > > non-init
> > > > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > > > starter for everything but the initial userns, and using
> > > > > > > ns_capable()
> > > > > > > is equally poor as it essentially allows any userns the ability to
> > > > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > > > need a different method for controlling access to the audit
> > > > > > > container
> > > > > > > ID.
> > > > > >
> > > > > > One option would be to make it a string, and have it be append only.
> > > > > > That should be safe with no checks.
> > > > > >
> > > > > > I know there was a long thread about what type to make this thing. I
> > > > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > > > some rule about only allowing setting lower order bits than those
> > > > > > that
> > > > > > are already set. With 4 bits for simplicity:
> > > > > >
> > > > > > 1100         # initial container id
> > > > > > 1100 -> 1011 # not allowed
> > > > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > > >
> > > > > >              # no lower order bits left
> > > > > >
> > > > > > There are probably fancier ways to do it if you actually understand
> > > > > > math :)
> > > > >
> > > > >  ;)
> > > > >
> > > > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > > > you have 64 bits, this might be reasonable. You could just teach
> > > > > > container engines to use the first say N bits for themselves, with a
> > > > > > 1
> > > > > > bit for the barrier at the end.
> > > > >
> > > > > I like the creativity, but I worry that at some point these
> > > > > limitations are going to be raised (limits have a funny way of doing
> > > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > > want to be able to quickly do an audit container ID comparison and
> > > > > we're going to pay a penalty for these larger values (we'll need this
> > > > > when we add multiple auditd support and the requisite record routing).
> > > > >
> > > > > Thinking about this makes me also realize we probably need to think a
> > > > > bit longer about audit container ID conflicts between orchestrators.
> > > > > Right now we just take the value that is given to us by the
> > > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > > to work without some form of cooperation in userspace (I think we have
> > > > > to assume the orchestrators will not talk to each other) we likely
> > > > > need to have some way to block reuse of an audit container ID.  We
> > > > > would either need to prevent the orchestrator from explicitly setting
> > > > > an audit container ID to a currently in use value, or instead generate
> > > > > the audit container ID in the kernel upon an event triggered by the
> > > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > > start looking at the idr code, I think we will need to make use of it.
> > > >
> > > > My first reaction to using the IDR code is that once an idr is given up,
> > > > it can be reused.  I suppose we request IDRs and then never give them up
> > > > to avoid reuse...
> > >
> > > I'm not sure we ever what to guarantee that an audit container ID
> > > won't be reused during the lifetime of the system, it is a discrete
> > > integer after all.  What I think we do want to guarantee is that we
> > > won't allow an unintentional audit container ID collision between
> > > different orchestrators; if a single orchestrator wants to reuse an
> > > audit container ID then that is its choice.
> > >
> > > > I already had some ideas of preventing an existing ID from being reused,
> > >
> > > Cool.  I only made the idr suggestion since it is used for PIDs and
> > > solves a very similar problem.
> > >
> > > > but that makes the practice of some container engines injecting
> > > > processes into existing containers difficult if not impossible.
> > >
> > > Yes, we'll need some provision to indicate which orchestrator
> > > "controls" that particular audit container ID, and allow that
> > > orchestrator to reuse that particular audit container ID (until all
> > > those containers disappear and the audit container ID is given back to
> > > the pool).
> 
> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635

^ permalink raw reply

* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Eric Biggers @ 2019-06-18 23:41 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190618224615.GB4576@mit.edu>

On Tue, Jun 18, 2019 at 06:46:15PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 18, 2019 at 10:51:18AM -0700, Eric Biggers wrote:
> > On Sat, Jun 15, 2019 at 11:31:12AM -0400, Theodore Ts'o wrote:
> > > On Thu, Jun 06, 2019 at 08:52:03AM -0700, Eric Biggers wrote:
> > > > +/*
> > > > + * Format of ext4 verity xattr.  This points to the location of the verity
> > > > + * descriptor within the file data rather than containing it directly because
> > > > + * the verity descriptor *must* be encrypted when ext4 encryption is used.  But,
> > > > + * ext4 encryption does not encrypt xattrs.
> > > > + */
> > > > +struct fsverity_descriptor_location {
> > > > +	__le32 version;
> > > > +	__le32 size;
> > > > +	__le64 pos;
> > > > +};
> > > 
> > > What's the benefit of storing the location in an xattr as opposed to
> > > just keying it off the end of i_size, rounded up to next page size (or
> > > 64k) as I had suggested earlier?
> > > 
> > > Using an xattr burns xattr space, which is a limited resource, and it
> > > adds some additional code complexity.  Does the benefits outweigh the
> > > added complexity?
> > > 
> > > 						- Ted
> > 
> > It means that only the fs/verity/ support layer has to be aware of the format of
> > the fsverity_descriptor, and the filesystem can just treat it an as opaque blob.
> > 
> > Otherwise the filesystem would need to read the first 'sizeof(struct
> > fsverity_descriptor)' bytes and use those to calculate the size as
> > 'sizeof(struct fsverity_descriptor) + le32_to_cpu(desc.sig_size)', then read the
> > rest.  Is this what you have in mind?
> 
> So right now, the way enable_verity() works is that it appends the
> Merkle tree to the data file, rounding up to the next page (but we
> might change so we round up to the next 64k boundary).  Then it calls
> end_enable_verity(), which is a file system specific function, passing
> in the descriptor and the descriptor size.
> 
> Today ext4 and f2fs appends the descriptor after the Merkle, and then
> sets the xattr containing the fsverity_descriptor_location.  Correct?

That's all correct, except that enable_verity() itself doesn't know or care that
the Merkle tree is being appended to the file.  That's up to the
->write_merkle_tree_block() and ->read_merkle_tree_page() methods which are
filesystem-specific.

> 
> What I'm suggesting that ext4 do instead is that it appends the
> descriptor to the Merkle tree, and then assuming that there is the
> (descriptor size % block_size) is less than PAGE_SIZE-4, we can write
> the descriptor size into the last 4 bytes of the last block in the
> file.  If there is not enough space at the end of the descriptor, then
> we append a block to the file, and then write the descriptor_size into
> last 4 bytes of that block.
> 
> When ext4 needs to find the descriptor, it simply reads the last block
> from the file, reads it into the page cache, reads the last 4 bytes
> from that block to fetch the descriptor size, and it can use the
> logical offset of the last block and the descriptor size to calculate
> the beginning offset of the descriptor size.
> 
> We can then fake up the fsverity_descriptor_location structure, and
> pass that into fsverity.
> 
> It does add a bit of extra complexity, but 99.9% of the time, it
> requires no extra space.  The last 0.098% of the time, the file size
> will grow by 4k, but if we can avoid spilling over to an external
> xattr block, it will all be worth it.
> 
> And in the V1 version of the fsverity code, I had already written the
> code to descend the extent tree to find the last logical block in the
> extent tree.
> 

I don't think your proposed solution is so simple.  By definition the last
extent ends on a filesystem block boundary, while the Merkle tree ends on a
Merkle tree block boundary.  In the future we might support the case where these
differ, so we don't want to preclude that in the on-disk format we choose now.
Therefore, just storing the desc_size isn't enough; we'd actually have to store
(desc_pos, desc_size), like I'm doing in the xattr.

Also, using ext4_find_extent() to find the last mapped block (as the v1 and v2
patchsets did) assumes the file actually uses extents.  So we'd have to forbid
non-extents based files as a special case, as the v2 patchset did.  We'd also
have to find a way to implement the same functionality on f2fs (which should be
possible, but it seems it would require some new code; there's nothing like
f2fs_get_extent()) unless we did something different for f2fs.

Note that on Android devices (the motivating use case for fs-verity), the xattrs
of user data files on ext4 already spill into an external xattr block, due to
the fscrypt and SELinux xattrs.  If/when people actually start caring about
this, they'll need to increase the inode size to 512 bytes anyway, in which case
there will be plenty of space for a few more in-line xattrs.  So I don't think
we should jump through too many hoops to avoid using an xattr.

> > It's also somewhat nice to have the version number in the xattr, in case we ever
> > introduce a new fs-verity format for ext4 or f2fs.
> 
> We already have a version number in the fsverity descriptor.  Surely
> that is what we would bump if we need to itnroduce a new fs-verity
> format?
> 

I'm talking about if we ever wanted to make a filesystem-specific change to
where the verity metadata is stored.  That's what the version number in the
filesystem-specific xattr is for.  The version number in the fsverity_descriptor
is different: that's for if we made a change to fs-verity for *all* filesystems.
We hopefully won't ever need the filesystem-specific version number, but as long
as we have to store the (desc_pos, desc_size) anyway, I think it's wise to add a
version number just in case; it doesn't really cost anything.

- Eric

^ permalink raw reply

* Re: [PATCH v4 14/16] ext4: add basic fs-verity support
From: Theodore Ts'o @ 2019-06-19  3:05 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Darrick J . Wong, linux-api, Dave Chinner, linux-f2fs-devel,
	linux-fscrypt, linux-fsdevel, Jaegeuk Kim, linux-integrity,
	linux-ext4, Linus Torvalds, Christoph Hellwig, Victor Hsieh
In-Reply-To: <20190618234133.GL184520@gmail.com>

On Tue, Jun 18, 2019 at 04:41:34PM -0700, Eric Biggers wrote:
> 
> I don't think your proposed solution is so simple.  By definition the last
> extent ends on a filesystem block boundary, while the Merkle tree ends on a
> Merkle tree block boundary.  In the future we might support the case where these
> differ, so we don't want to preclude that in the on-disk format we choose now.
> Therefore, just storing the desc_size isn't enough; we'd actually have to store
> (desc_pos, desc_size), like I'm doing in the xattr.

I don't think any of this matters much, since what you're describing
above is all about the Merkle tree, and that doesn't affect how we
find the fsverity descriptor information.  We can just say that
fsverity descriptor block begins on the next file system block
boundary after the Merkle tree.  And in the case where say, the Merkle
tree is 4k and the file system block size is 64k, that's fine --- the
fs descriptor would just begin at the next 64k (fs blocksize)
boundary.

> Also, using ext4_find_extent() to find the last mapped block (as the v1 and v2
> patchsets did) assumes the file actually uses extents.  So we'd have to forbid
> non-extents based files as a special case, as the v2 patchset did.  We'd also
> have to find a way to implement the same functionality on f2fs (which should be
> possible, but it seems it would require some new code; there's nothing like
> f2fs_get_extent()) unless we did something different for f2fs.

So first, if f2fs wants to continue using the xattr, that's fine.  The
code to write and fetch the fsverity descriptor is in file system
specific code, and so this is something I'm happy to support just for
ext4, and it shouldn't require any special changes in the common
fsverity code at all.  Secondly, I suspect it's not *that* hard to
find the last logical block mapping in f2fs, but I'll let Jaeguk
comment on that.

Finally, it's not that hard to find the last mapped block for indirect
blocks, if we really care about supporting that combination.  (There
are enough other things --- like fallocate --- which don't work with
indirect mapped files, so I don't feel especially bad forbidding that
combination.  A quick check in enable_verity() to return EOPNOTSUPP if
the EXTENTS_FL flag is not present is not all that different from what
we do with fallocate today.)

But if we *did* want to support it, it's actually quite easy to find
the last mapped block for an indirect mapped inode.  I just didn't
bother to write the code, but it requires at most 3 block reads if
there is a triple indirection block.  Otherwise, if there is a double
indirection block in the inode, it requires at most 2 block reads, and
otherwise, at most a single block read.

> Note that on Android devices (the motivating use case for fs-verity), the xattrs
> of user data files on ext4 already spill into an external xattr block, due to
> the fscrypt and SELinux xattrs.  If/when people actually start caring about
> this, they'll need to increase the inode size to 512 bytes anyway, in which case
> there will be plenty of space for a few more in-line xattrs.  So I don't think
> we should jump through too many hoops to avoid using an xattr.

I'm thinking about other cases where we might not be using fscrypt,
but where we might still be using fsverity and SELinux --- or maybe
cases where the file systems are using 128 byte inodes, and where only
fsverity is required.  (There are a *vast* number of production file
systems using 128 byte inodes.)

Cheers,

						- Ted

^ permalink raw reply

* Re: [PATCH 04/25] vfs: Implement parameter value retrieval with fsinfo() [ver #13]
From: Miklos Szeredi @ 2019-06-19  6:33 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Ian Kent, Linux API, linux-fsdevel, linux-kernel,
	Miklos Szeredi
In-Reply-To: <24127.1560897289@warthog.procyon.org.uk>

On Wed, Jun 19, 2019 at 12:34 AM David Howells <dhowells@redhat.com> wrote:

> > Same goes for vfs_parse_sb_flag() btw.   It should be moved into each
> > filesystem's ->parse_param() and not be a mandatory thing.
>
> I disagree.  Every filesystem *must* be able to accept these standard flags,
> even if it then ignores them.

"posixacl" is not a standard flag.  It never was accepted by mount(8)
so I don't see where you got that from.

Can you explain why you think "mand", "sync", "dirsync", "lazytime"
should be accepted by a filesystem such as proc?  The argument that it
breaks userspace is BS, because this is a new interface, hence by
definition we cannot break old userspace.  If mount(8) wants to use
the new API and there really is breakage if these options are rejected
(which I doubt) then it can easily work around that by ignoring them
itself.

Also why should "rw" not be rejected for filesystems which are
read-only by definition, such as iso9660?

Thanks,
Miklos

^ permalink raw reply

* Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
From: Michal Hocko @ 2019-06-19 12:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
	Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, jannh, oleg, christian,
	oleksandr, hdanton, lizeb
In-Reply-To: <20190610111252.239156-1-minchan@kernel.org>

On Mon 10-06-19 20:12:47, Minchan Kim wrote:
> This patch is part of previous series:
> https://lore.kernel.org/lkml/20190531064313.193437-1-minchan@kernel.org/T/#u
> Originally, it was created for external madvise hinting feature.
> 
> https://lkml.org/lkml/2019/5/31/463
> Michal wanted to separte the discussion from external hinting interface
> so this patchset includes only first part of my entire patchset
> 
>   - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.
> 
> However, I keep entire description for others for easier understanding
> why this kinds of hint was born.
> 
> Thanks.
> 
> This patchset is against on next-20190530.
> 
> Below is description of previous entire patchset.
> ================= &< =====================
> 
> - Background
> 
> The Android terminology used for forking a new process and starting an app
> from scratch is a cold start, while resuming an existing app is a hot start.
> While we continually try to improve the performance of cold starts, hot
> starts will always be significantly less power hungry as well as faster so
> we are trying to make hot start more likely than cold start.
> 
> To increase hot start, Android userspace manages the order that apps should
> be killed in a process called ActivityManagerService. ActivityManagerService
> tracks every Android app or service that the user could be interacting with
> at any time and translates that into a ranked list for lmkd(low memory
> killer daemon). They are likely to be killed by lmkd if the system has to
> reclaim memory. In that sense they are similar to entries in any other cache.
> Those apps are kept alive for opportunistic performance improvements but
> those performance improvements will vary based on the memory requirements of
> individual workloads.
> 
> - Problem
> 
> Naturally, cached apps were dominant consumers of memory on the system.
> However, they were not significant consumers of swap even though they are
> good candidate for swap. Under investigation, swapping out only begins
> once the low zone watermark is hit and kswapd wakes up, but the overall
> allocation rate in the system might trip lmkd thresholds and cause a cached
> process to be killed(we measured performance swapping out vs. zapping the
> memory by killing a process. Unsurprisingly, zapping is 10x times faster
> even though we use zram which is much faster than real storage) so kill
> from lmkd will often satisfy the high zone watermark, resulting in very
> few pages actually being moved to swap.
> 
> - Approach
> 
> The approach we chose was to use a new interface to allow userspace to
> proactively reclaim entire processes by leveraging platform information.
> This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
> that are known to be cold from userspace and to avoid races with lmkd
> by reclaiming apps as soon as they entered the cached state. Additionally,
> it could provide many chances for platform to use much information to
> optimize memory efficiency.
> 
> To achieve the goal, the patchset introduce two new options for madvise.
> One is MADV_COLD which will deactivate activated pages and the other is
> MADV_PAGEOUT which will reclaim private pages instantly. These new options
> complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
> gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
> that it hints the kernel that memory region is not currently needed and
> should be reclaimed when memory pressure rises.

This all is a very good background information suitable for the cover
letter.

> This approach is similar in spirit to madvise(MADV_WONTNEED), but the
> information required to make the reclaim decision is not known to the app.
> Instead, it is known to a centralized userspace daemon, and that daemon
> must be able to initiate reclaim on its own without any app involvement.
> To solve the concern, this patch introduces new syscall -
> 
>     struct pr_madvise_param {
>             int size;               /* the size of this structure */
>             int cookie;             /* reserved to support atomicity */
>             int nr_elem;            /* count of below arrary fields */
>             int __user *hints;      /* hints for each range */
>             /* to store result of each operation */
>             const struct iovec __user *results;
>             /* input address ranges */
>             const struct iovec __user *ranges;
>     };
>     
>     int process_madvise(int pidfd, struct pr_madvise_param *u_param,
>                             unsigned long flags);

But this and the following paragraphs are referring to the later step
when the madvise gains a remote process capabilities and that is out
of the scope of this patch series so I would simply remove it from
here. Andrew tends to put the cover letter into the first patch of the
series and that would be indeed
confusing here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

* [PATCH 01/13] vfs: verify param type in vfs_parse_sb_flag()
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel

vfs_parse_sb_flag() accepted any kind of param with a matching key, not
just a flag.  This is wrong, only allow flag type and return -EINVAL
otherwise.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fs_context.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 103643c68e3f..e56310fd8c75 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -81,30 +81,29 @@ static const char *const forbidden_sb_flag[] = {
 /*
  * Check for a common mount option that manipulates s_flags.
  */
-static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
+static int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
 {
-	unsigned int token;
+	const char *key = param->key;
+	unsigned int set, clear;
 	unsigned int i;
 
 	for (i = 0; i < ARRAY_SIZE(forbidden_sb_flag); i++)
 		if (strcmp(key, forbidden_sb_flag[i]) == 0)
 			return -EINVAL;
 
-	token = lookup_constant(common_set_sb_flag, key, 0);
-	if (token) {
-		fc->sb_flags |= token;
-		fc->sb_flags_mask |= token;
-		return 0;
-	}
+	set = lookup_constant(common_set_sb_flag, key, 0);
+	clear = lookup_constant(common_clear_sb_flag, key, 0);
+	if (!set && !clear)
+		return -ENOPARAM;
 
-	token = lookup_constant(common_clear_sb_flag, key, 0);
-	if (token) {
-		fc->sb_flags &= ~token;
-		fc->sb_flags_mask |= token;
-		return 0;
-	}
+	if (param->type != fs_value_is_flag)
+		return invalf(fc, "%s: Unexpected value for '%s'",
+			      fc->fs_type->name, param->key);
 
-	return -ENOPARAM;
+	fc->sb_flags |= set;
+	fc->sb_flags &= ~clear;
+	fc->sb_flags_mask |= set | clear;
+	return 0;
 }
 
 /**
@@ -130,7 +129,7 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 	if (!param->key)
 		return invalf(fc, "Unnamed parameter\n");
 
-	ret = vfs_parse_sb_flag(fc, param->key);
+	ret = vfs_parse_sb_flag(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 02/13] vfs: move vfs_parse_sb_flag() calls into filesystems
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

Move parsing "standard" options into filesystems' ->parse_param().  This
patch doesn't change behavior.

This is in preparation for allowing filesystems to reject options that they
don't support.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  6 +++++-
 fs/afs/super.c                         |  6 +++++-
 fs/fs_context.c                        | 12 +++++++-----
 fs/fuse/control.c                      |  1 +
 fs/hugetlbfs/inode.c                   |  6 +++++-
 fs/proc/root.c                         |  6 +++++-
 fs/sysfs/mount.c                       |  1 +
 include/linux/fs_context.h             |  1 +
 ipc/mqueue.c                           |  1 +
 kernel/cgroup/cgroup-v1.c              |  6 +++++-
 kernel/cgroup/cgroup.c                 |  6 +++++-
 kernel/cgroup/cpuset.c                 |  1 +
 12 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 2131b8bbaad7..83d3c358f95e 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2051,7 +2051,11 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct rdt_fs_context *ctx = rdt_fc2context(fc);
 	struct fs_parse_result result;
-	int opt;
+	int ret, opt;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &rdt_fs_parameters, param, &result);
 	if (opt < 0)
diff --git a/fs/afs/super.c b/fs/afs/super.c
index f18911e8d770..7f032d08781b 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -321,7 +321,11 @@ static int afs_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct fs_parse_result result;
 	struct afs_fs_context *ctx = fc->fs_private;
-	int opt;
+	int ret, opt;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &afs_fs_parameters, param, &result);
 	if (opt < 0)
diff --git a/fs/fs_context.c b/fs/fs_context.c
index e56310fd8c75..a9f314390b99 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -81,7 +81,7 @@ static const char *const forbidden_sb_flag[] = {
 /*
  * Check for a common mount option that manipulates s_flags.
  */
-static int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
+int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
 {
 	const char *key = param->key;
 	unsigned int set, clear;
@@ -105,6 +105,7 @@ static int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
 	fc->sb_flags_mask |= set | clear;
 	return 0;
 }
+EXPORT_SYMBOL(vfs_parse_sb_flag);
 
 /**
  * vfs_parse_fs_param - Add a single parameter to a superblock config
@@ -129,10 +130,6 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param)
 	if (!param->key)
 		return invalf(fc, "Unnamed parameter\n");
 
-	ret = vfs_parse_sb_flag(fc, param);
-	if (ret != -ENOPARAM)
-		return ret;
-
 	ret = security_fs_context_parse_param(fc, param);
 	if (ret != -ENOPARAM)
 		/* Param belongs to the LSM or is disallowed by the LSM; so
@@ -561,6 +558,11 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct legacy_fs_context *ctx = fc->fs_private;
 	unsigned int size = ctx->data_size;
 	size_t len = 0;
+	int ret;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	if (strcmp(param->key, "source") == 0) {
 		if (param->type != fs_value_is_string)
diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index 14ce1e47f980..c35013ed7f65 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -351,6 +351,7 @@ static int fuse_ctl_get_tree(struct fs_context *fc)
 
 static const struct fs_context_operations fuse_ctl_context_ops = {
 	.get_tree	= fuse_ctl_get_tree,
+	.parse_param	= vfs_parse_fs_param,
 };
 
 static int fuse_ctl_init_fs_context(struct fs_context *fc)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1dcc57189382..89125cc36d0e 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1149,7 +1149,11 @@ static int hugetlbfs_parse_param(struct fs_context *fc, struct fs_parameter *par
 	struct fs_parse_result result;
 	char *rest;
 	unsigned long ps;
-	int opt;
+	int ret, opt;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &hugetlb_fs_parameters, param, &result);
 	if (opt < 0)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 8b145e7b9661..6ef1527ad975 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -56,7 +56,11 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 	struct fs_parse_result result;
-	int opt;
+	int ret, opt;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &proc_fs_parameters, param, &result);
 	if (opt < 0)
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 1b56686ab178..ba576a976e8c 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -49,6 +49,7 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 
 static const struct fs_context_operations sysfs_fs_context_ops = {
 	.free		= sysfs_fs_context_free,
+	.parse_param	= vfs_parse_sb_flag,
 	.get_tree	= sysfs_get_tree,
 };
 
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index d476ff0c10df..39f4d8b0a390 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -127,6 +127,7 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
 						struct dentry *reference);
 
 extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
+extern int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param);
 extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 216cad1ff0d0..557aa887996a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1577,6 +1577,7 @@ static const struct super_operations mqueue_super_ops = {
 
 static const struct fs_context_operations mqueue_fs_context_ops = {
 	.free		= mqueue_fs_context_free,
+	.parse_param	= vfs_parse_sb_flag,
 	.get_tree	= mqueue_get_tree,
 };
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 88006be40ea3..f960e6149311 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -927,7 +927,11 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	struct cgroup_subsys *ss;
 	struct fs_parse_result result;
-	int opt, i;
+	int ret, opt, i;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &cgroup1_fs_parameters, param, &result);
 	if (opt == -ENOPARAM) {
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index bf9dbffd46b1..93890285b510 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1834,7 +1834,11 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 {
 	struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
 	struct fs_parse_result result;
-	int opt;
+	int ret, opt;
+
+	ret = vfs_parse_sb_flag(fc, param);
+	if (ret != -ENOPARAM)
+		return ret;
 
 	opt = fs_parse(fc, &cgroup2_fs_parameters, param, &result);
 	if (opt < 0)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 515525ff1cfd..025f6c6083a3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -394,6 +394,7 @@ static int cpuset_get_tree(struct fs_context *fc)
 }
 
 static const struct fs_context_operations cpuset_fs_context_ops = {
+	.parse_param	= vfs_parse_sb_flag,
 	.get_tree	= cpuset_get_tree,
 };
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 03/13] vfs: don't parse forbidden flags
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

Impossible to keep this blacklist properly synced with what mount(8) parses
and what it doesn't.  E.g. it has various forms of "*atime" options, but
not "atime"...

Other than being impossible to maintain, it also makes little sense.  So
just get rid of it.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fs_context.c | 28 ----------------------------
 1 file changed, 28 deletions(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index a9f314390b99..cbf89117a507 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -55,29 +55,6 @@ static const struct constant_table common_clear_sb_flag[] = {
 	{ "silent",	SB_SILENT },
 };
 
-static const char *const forbidden_sb_flag[] = {
-	"bind",
-	"dev",
-	"exec",
-	"move",
-	"noatime",
-	"nodev",
-	"nodiratime",
-	"noexec",
-	"norelatime",
-	"nostrictatime",
-	"nosuid",
-	"private",
-	"rec",
-	"relatime",
-	"remount",
-	"shared",
-	"slave",
-	"strictatime",
-	"suid",
-	"unbindable",
-};
-
 /*
  * Check for a common mount option that manipulates s_flags.
  */
@@ -85,11 +62,6 @@ int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
 {
 	const char *key = param->key;
 	unsigned int set, clear;
-	unsigned int i;
-
-	for (i = 0; i < ARRAY_SIZE(forbidden_sb_flag); i++)
-		if (strcmp(key, forbidden_sb_flag[i]) == 0)
-			return -EINVAL;
 
 	set = lookup_constant(common_set_sb_flag, key, 0);
 	clear = lookup_constant(common_clear_sb_flag, key, 0);
-- 
2.21.0

^ permalink raw reply related

* [PATCH 04/13] vfs: don't parse "posixacl" option
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

Unlike the others, this is _not_ a standard option accepted by mount(8).

In fact SB_POSIXACL is an internal flag, and accepting MS_POSIXACL on the
mount(2) interface is possibly a bug.

The only filesystem that apparently wants to handle the "posixacl" option
is 9p, but it has special handling of that option besides setting
SB_POSIXACL.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fs_context.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index cbf89117a507..49636e541293 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -42,7 +42,6 @@ static const struct constant_table common_set_sb_flag[] = {
 	{ "dirsync",	SB_DIRSYNC },
 	{ "lazytime",	SB_LAZYTIME },
 	{ "mand",	SB_MANDLOCK },
-	{ "posixacl",	SB_POSIXACL },
 	{ "ro",		SB_RDONLY },
 	{ "sync",	SB_SYNCHRONOUS },
 };
-- 
2.21.0

^ permalink raw reply related

* [PATCH 05/13] vfs: don't parse "silent" option
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

While this is a standard option as documented in mount(8), it is ignored by
most filesystems.  So reject, unless filesystem explicitly wants to handle
it.

The exception is unconverted filesystems, where it is unknown if the
filesystem handles this or not.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fs_context.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index 49636e541293..c26b353aa858 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -51,7 +51,6 @@ static const struct constant_table common_clear_sb_flag[] = {
 	{ "nolazytime",	SB_LAZYTIME },
 	{ "nomand",	SB_MANDLOCK },
 	{ "rw",		SB_RDONLY },
-	{ "silent",	SB_SILENT },
 };
 
 /*
@@ -535,6 +534,9 @@ static int legacy_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	if (ret != -ENOPARAM)
 		return ret;
 
+	if (strcmp(param->key, "silent") == 0)
+		fc->sb_flags |= SB_SILENT;
+
 	if (strcmp(param->key, "source") == 0) {
 		if (param->type != fs_value_is_string)
 			return invalf(fc, "VFS: Legacy: Non-string source");
-- 
2.21.0

^ permalink raw reply related

* [PATCH 06/13] vfs: new helper: vfs_parse_ro_rw()
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

This just parses the "ro" and "rw" options and sets sb_flags accordingly.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fs_context.c            | 18 ++++++++++++++++++
 include/linux/fs_context.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/fs/fs_context.c b/fs/fs_context.c
index c26b353aa858..5012ab7650ec 100644
--- a/fs/fs_context.c
+++ b/fs/fs_context.c
@@ -77,6 +77,24 @@ int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param)
 }
 EXPORT_SYMBOL(vfs_parse_sb_flag);
 
+int vfs_parse_ro_rw(struct fs_context *fc, struct fs_parameter *param)
+{
+	if (strcmp(param->key, "ro") == 0)
+		fc->sb_flags |= SB_RDONLY;
+	else if (strcmp(param->key, "rw") == 0)
+		fc->sb_flags &= ~SB_RDONLY;
+	else
+		return -ENOPARAM;
+
+	if (param->type != fs_value_is_flag)
+		return invalf(fc, "%s: Unexpected value for '%s'",
+			      fc->fs_type->name, param->key);
+
+	fc->sb_flags_mask |= SB_RDONLY;
+	return 0;
+}
+EXPORT_SYMBOL(vfs_parse_ro_rw);
+
 /**
  * vfs_parse_fs_param - Add a single parameter to a superblock config
  * @fc: The filesystem context to modify
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 39f4d8b0a390..bff6796a89ef 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -128,6 +128,7 @@ extern struct fs_context *fs_context_for_submount(struct file_system_type *fs_ty
 
 extern struct fs_context *vfs_dup_fs_context(struct fs_context *fc);
 extern int vfs_parse_sb_flag(struct fs_context *fc, struct fs_parameter *param);
+extern int vfs_parse_ro_rw(struct fs_context *fc, struct fs_parameter *param);
 extern int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param);
 extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
 			       const char *value, size_t v_size);
-- 
2.21.0

^ permalink raw reply related

* [PATCH 07/13] proc: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the proc filesystem.  If these options are
supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/proc/root.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6ef1527ad975..70127eaba04d 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -58,7 +58,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct fs_parse_result result;
 	int ret, opt;
 
-	ret = vfs_parse_sb_flag(fc, param);
+	ret = vfs_parse_ro_rw(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 08/13] sysfs: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the sysfs filesystem.  If these options are
supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/sysfs/mount.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index ba576a976e8c..4797b7952fc5 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -49,7 +49,7 @@ static void sysfs_fs_context_free(struct fs_context *fc)
 
 static const struct fs_context_operations sysfs_fs_context_ops = {
 	.free		= sysfs_fs_context_free,
-	.parse_param	= vfs_parse_sb_flag,
+	.parse_param	= vfs_parse_ro_rw,
 	.get_tree	= sysfs_get_tree,
 };
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 09/13] mqueue: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the mqueue filesystem.  If these options are
supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 ipc/mqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 557aa887996a..1e8567c20d6a 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1577,7 +1577,7 @@ static const struct super_operations mqueue_super_ops = {
 
 static const struct fs_context_operations mqueue_fs_context_ops = {
 	.free		= mqueue_fs_context_free,
-	.parse_param	= vfs_parse_sb_flag,
+	.parse_param	= vfs_parse_ro_rw,
 	.get_tree	= mqueue_get_tree,
 };
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 10/13] cpuset: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the cpuset filesystem.  If these options are
supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 kernel/cgroup/cpuset.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 025f6c6083a3..f6f6b5b44fc3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -394,7 +394,7 @@ static int cpuset_get_tree(struct fs_context *fc)
 }
 
 static const struct fs_context_operations cpuset_fs_context_ops = {
-	.parse_param	= vfs_parse_sb_flag,
+	.parse_param	= vfs_parse_ro_rw,
 	.get_tree	= cpuset_get_tree,
 };
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 11/13] cgroup: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the cgroup filesystem.  If these options are
supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 kernel/cgroup/cgroup-v1.c | 2 +-
 kernel/cgroup/cgroup.c    | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index f960e6149311..1f50d59f7f4e 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -929,7 +929,7 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct fs_parse_result result;
 	int ret, opt, i;
 
-	ret = vfs_parse_sb_flag(fc, param);
+	ret = vfs_parse_ro_rw(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 93890285b510..f2e86b3942b3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1836,7 +1836,7 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 	struct fs_parse_result result;
 	int ret, opt;
 
-	ret = vfs_parse_sb_flag(fc, param);
+	ret = vfs_parse_ro_rw(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
-- 
2.21.0

^ permalink raw reply related

* [PATCH 12/13] fusectl: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the fusectl filesystem.  If these options
are supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/fuse/control.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/control.c b/fs/fuse/control.c
index c35013ed7f65..f3aab288929f 100644
--- a/fs/fuse/control.c
+++ b/fs/fuse/control.c
@@ -351,7 +351,7 @@ static int fuse_ctl_get_tree(struct fs_context *fc)
 
 static const struct fs_context_operations fuse_ctl_context_ops = {
 	.get_tree	= fuse_ctl_get_tree,
-	.parse_param	= vfs_parse_fs_param,
+	.parse_param	= vfs_parse_ro_rw,
 };
 
 static int fuse_ctl_init_fs_context(struct fs_context *fc)
-- 
2.21.0

^ permalink raw reply related

* [PATCH 13/13] resctrl: don't ignore options
From: Miklos Szeredi @ 2019-06-19 12:30 UTC (permalink / raw)
  To: David Howells; +Cc: Al Viro, Ian Kent, linux-api, linux-fsdevel, linux-kernel
In-Reply-To: <20190619123019.30032-1-mszeredi@redhat.com>

The options "sync", "async", "dirsync", "lazytime", "nolazytime", "mand"
and "nomand" make no sense for the resctrl filesystem.  If these options
are supplied to fsconfig(FSCONFIG_SET_FLAG), then return -EINVAL instead of
silently ignoring the option.

Any implementation, such as mount(8) that needs to parse this option
without failing should simply ignore the return value from fsconfig().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 83d3c358f95e..16b110d31457 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2053,7 +2053,7 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	struct fs_parse_result result;
 	int ret, opt;
 
-	ret = vfs_parse_sb_flag(fc, param);
+	ret = vfs_parse_ro_rw(fc, param);
 	if (ret != -ENOPARAM)
 		return ret;
 
-- 
2.21.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox