* Re: [PATCHv3 01/20] media: add new types for DVB devnodes
From: Sakari Ailus @ 2015-01-07 14:09 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Laurent Pinchart
Cc: Linux Media Mailing List, Mauro Carvalho Chehab,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <7f1ea82b1055aa490726f3af2ad22bca25e49a28.1420578087.git.mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
Hi Mauro,
Mauro Carvalho Chehab wrote:
> Most of the DVB subdevs have already their own devnode.
>
> Add support for them at the media controller API.
>
> Signed-off-by: Mauro Carvalho Chehab <mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
>
> diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> index 7902e800f019..707db275f92b 100644
> --- a/include/uapi/linux/media.h
> +++ b/include/uapi/linux/media.h
> @@ -50,7 +50,14 @@ struct media_device_info {
> #define MEDIA_ENT_T_DEVNODE_V4L (MEDIA_ENT_T_DEVNODE + 1)
> #define MEDIA_ENT_T_DEVNODE_FB (MEDIA_ENT_T_DEVNODE + 2)
> #define MEDIA_ENT_T_DEVNODE_ALSA (MEDIA_ENT_T_DEVNODE + 3)
> -#define MEDIA_ENT_T_DEVNODE_DVB (MEDIA_ENT_T_DEVNODE + 4)
> +#define MEDIA_ENT_T_DEVNODE_DVB_FE (MEDIA_ENT_T_DEVNODE + 4)
> +#define MEDIA_ENT_T_DEVNODE_DVB_DEMUX (MEDIA_ENT_T_DEVNODE + 5)
> +#define MEDIA_ENT_T_DEVNODE_DVB_DVR (MEDIA_ENT_T_DEVNODE + 6)
> +#define MEDIA_ENT_T_DEVNODE_DVB_CA (MEDIA_ENT_T_DEVNODE + 7)
> +#define MEDIA_ENT_T_DEVNODE_DVB_NET (MEDIA_ENT_T_DEVNODE + 8)
I'd create another type for the DVB sub-type devices, as there is for
V4L2 sub-devices. I wonder what Laurent thinks.
> +
> +/* Legacy symbol. Use it to avoid userspace compilation breakages */
> +#define MEDIA_ENT_T_DEVNODE_DVB MEDIA_ENT_T_DEVNODE_DVB_FE
>
> #define MEDIA_ENT_T_V4L2_SUBDEV (2 << MEDIA_ENT_TYPE_SHIFT)
> #define MEDIA_ENT_T_V4L2_SUBDEV_SENSOR (MEDIA_ENT_T_V4L2_SUBDEV + 1)
>
--
Kind regards,
Sakari Ailus
sakari.ailus-VuQAYsv1563Yd54FQh9/CA@public.gmane.org
^ permalink raw reply
* Re: [PATCHv3 01/20] media: add new types for DVB devnodes
From: Mauro Carvalho Chehab @ 2015-01-07 14:22 UTC (permalink / raw)
To: Sakari Ailus
Cc: Laurent Pinchart, Linux Media Mailing List, Mauro Carvalho Chehab,
linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <54AD3E00.5070208-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Em Wed, 07 Jan 2015 16:09:04 +0200
Sakari Ailus <sakari.ailus-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> escreveu:
> Hi Mauro,
>
> Mauro Carvalho Chehab wrote:
> > Most of the DVB subdevs have already their own devnode.
> >
> > Add support for them at the media controller API.
> >
> > Signed-off-by: Mauro Carvalho Chehab <mchehab-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org>
> >
> > diff --git a/include/uapi/linux/media.h b/include/uapi/linux/media.h
> > index 7902e800f019..707db275f92b 100644
> > --- a/include/uapi/linux/media.h
> > +++ b/include/uapi/linux/media.h
> > @@ -50,7 +50,14 @@ struct media_device_info {
> > #define MEDIA_ENT_T_DEVNODE_V4L (MEDIA_ENT_T_DEVNODE + 1)
> > #define MEDIA_ENT_T_DEVNODE_FB (MEDIA_ENT_T_DEVNODE + 2)
> > #define MEDIA_ENT_T_DEVNODE_ALSA (MEDIA_ENT_T_DEVNODE + 3)
> > -#define MEDIA_ENT_T_DEVNODE_DVB (MEDIA_ENT_T_DEVNODE + 4)
> > +#define MEDIA_ENT_T_DEVNODE_DVB_FE (MEDIA_ENT_T_DEVNODE + 4)
> > +#define MEDIA_ENT_T_DEVNODE_DVB_DEMUX (MEDIA_ENT_T_DEVNODE + 5)
> > +#define MEDIA_ENT_T_DEVNODE_DVB_DVR (MEDIA_ENT_T_DEVNODE + 6)
> > +#define MEDIA_ENT_T_DEVNODE_DVB_CA (MEDIA_ENT_T_DEVNODE + 7)
> > +#define MEDIA_ENT_T_DEVNODE_DVB_NET (MEDIA_ENT_T_DEVNODE + 8)
>
> I'd create another type for the DVB sub-type devices, as there is for
> V4L2 sub-devices. I wonder what Laurent thinks.
I discussed this quickly with Laurent on IRC.
There are some concept differences between V4L2 and DVB.
At v4l2:
- the spec is one monolitic header (videodev2.h);
- one devnode is used to control everyhing (/dev/video?)
- there is one v4l core for all types of devices
At DVB:
- each different DVB API has its own header;
- each DVB device type has its own core (ok, they're
linked into one module, but internally they're almost independent);
- each different DVB API has its own devnode.
So, using "SUBDEV" for DVB (or at least for the devnodes) don't
make much sense.
Ok, there are still some things at DVB side that could be mapped as
subdev. The clear example is the tuner. However, in this case, the
same tuner can be either V4L, DVB or both. So, we need to define just
one subdev type for the tuner.
Also, each DVB device can be identified via major/minor pairs.
I wrote already (and submitted upstream) the patches for media-ctl to
recognize them. They're also on my experimental v4l-utils tree:
http://git.linuxtv.org/cgit.cgi/mchehab/experimental-v4l-utils.git/log/?h=dvb-media-ctl
Regards,
Mauro
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-07 14:45 UTC (permalink / raw)
To: Richard Weinberger
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Tejun Heo, cgroups mailinglist, Ingo Molnar
In-Reply-To: <54ACFC38.5070007-/L3Ra7n9ekc@public.gmane.org>
Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>
> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
I think I probably need to go back upthread and read the patches.
I think it is a reasonable practical requirement that a widely used long
term supported distribution like RHEL 7 needs to be able to run in a linux
container bizarre init system and all. And that we the abstractions
should be that that we should be able to migrate such a beast.
There are a couple of issues in play and I think we need actual testing
rather than reports that something shouldn't work before we reject a set
of patches. Aditya in one of his replies to me has reported a
configuration that he expects will work. So I think that configuration
needs to be tested.
cgroups is a weird beast and the problems tend not to lie where a person
would first expect.
I suspect no one strongly cares if the cgroup hierarchy is unified or
not. By unified hierarchy I mean that every mount of cgroupfs has the
same directories with the same processes in each directory.
I do think people will care which controllers will show up in differ
mounts of cgroupfs, and I think that is relevant to process migration.
I am going to segway into scope of what is achievable with a cgroup namespace.
- If there are files in cgroupfs that are not safe to delegate we can
not support those files in a container.
Last I looked there were such files and systemd used them.
- Which controllers share hierarchies of processes to track resources is
a core cgroup issue and not a cgroup namespace issue.
If we find problems with using a unified hierarchy support we need to
go fix cgroups in general not cgroupfs.
Eric
^ permalink raw reply
* [PATCH v8 0/2] crypto: AF_ALG: add AEAD and RNG support
From: Stephan Mueller @ 2015-01-07 15:51 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: 'Quentin Gouchet', Daniel Borkmann, linux-kernel,
linux-api, linux-crypto
Hi,
This patch set adds AEAD and RNG support to the AF_ALG interface
exported by the kernel crypto API. By extending AF_ALG with AEAD and RNG
support, all cipher types the kernel crypto API allows access to are
now accessible from userspace.
Both, AEAD and RNG implementations are stand-alone and do not depend
other AF_ALG interfaces (like hash or skcipher).
The AEAD implementation uses the same approach as provided with
skcipher by offering the following interfaces:
* sendmsg and recvmsg interfaces allowing multiple
invocations supporting a threaded user space. To support
multi-threaded user space, kernel-side buffering
is implemented similarly to skcipher.
* splice / vmsplice interfaces allowing a zero-copy
invocation
The RNG interface only implements the recvmsg interface as
zero-copy is not applicable.
The new AEAD and RNG interfaces are fully tested with the test application
provided at [1]. That test application exercises all newly added user space
interfaces. The testing covers:
* use of the sendmsg/recvmsg interface
* use of the splice / vmsplice interface
* invocation of all AF_ALG types (aead, rng, skcipher, hash)
* using all types of operation (encryption, decryption, keyed MD,
MD, random numbers, AEAD decryption with positive and negative
authentication verification)
* stress testing by running all tests for 30 minutes in an
endless loop
* test execution on 64 bit and 32 bit
[1] http://www.chronox.de/libkcapi.html
Changes v2:
* rebase to current cryptodev-2.6 tree
* use memzero_explicit to zeroize AEAD associated data
* use sizeof for determining length of AEAD associated data
* update algif_rng.c covering all suggestions from Daniel Borkmann
<dborkman@redhat.com>
* addition of patch 9: add digestsize interface for hashes
* addition of patch to update documentation covering the userspace interface
* change numbers of getsockopt options: separate them from sendmsg interface
definitions
Changes v3:
* remove getsockopt interface
* AEAD: associated data is set prepended to the plain/ciphertext
* AEAD: allowing arbitrary associated data lengths
* remove setkey patch as protection was already in the existing code
Changes v4:
* stand-alone implementation of AEAD
* testing of all interfaces offered by AEAD
* stress testing of AEAD and RNG
Changes v5:
* AEAD: add outer while(size) loop in aead_sendmsg to ensure all data is
copied into the kernel (reporter Herbert Xu)
* AEAD: aead_sendmsg bug fix: change size -= len; to size -= plen;
* AF_ALG / AEAD: add aead_setauthsize and associated extension to
struct af_alg_type as well as alg_setsockopt (reporter Herbert Xu)
* RNG: rng_recvmsg: use 128 byte stack variable for output of RNG instead
of ctx->result (reporter Herbert Xu)
* RNG / AF_ALG: allow user space to seed RNG via setsockopt
* RNG: rng_recvmsg bug fix: use genlen as result variable for
crypto_rng_get_bytes as previously no negative errors were obtained
* AF_ALG: alg_setop: zeroize buffer before free
Changes v6:
* AEAD/RNG: port to 3.19-rc1 with the iov_iter handling
* RNG: use the setkey interface to obtain the seed and drop the patch adding
a separate reseeding interface
* extract the zeroization patch for alg_setkey into a stand-alone patch
submission
* fix bug in aead_sufficient_data (reporter Herbert Xu)
* testing of all interfaces with test application provided with libkcapi version
0.6.2
Changes v7:
* AEAD: aead_recvmsg: change error code from ENOMEM to EINVAL
* AEAD: drop aead_readable/aead_sufficient_data and only use ctx->more to decide
whether the read side shall become active. This change requires that the
patch for crypto_aead_decrypt ensuring that the ciphertext contains the
authentication tag was added -- see https://lkml.org/lkml/2014/12/30/200.
Otherwise, user space can trigger a kernel crash.
* RNG: patch dropped as it was applied
* AEAD: port Kconfig/Makefile patch forward to current code base
Changes v8:
* Removed check for aead_assoclen in aead_sendmsg
* Fix endless loop bug in aead_sendmsg (check for sgl->cur > ALG_MAX_PAGES in
while condition removed -- this condition is checked within the loop already)
* Resurrect aead_sufficient_data and call it in aead_sendmsg, aead_sendpage to
notify caller about wrong invocation
* Re-add aead_sufficient_data to aead_recvmsg to verify user input data before
using them to ensure the kernel protects against malicious parameters
* Allow arbitrary size of AD (i.e. up to the maximum buffer size of
ALG_MAX_PAGES)
* When aead_recvmsg receives an error from decryption, release all pages if the
error is EBADMSG -- this error implies that a proper decryption was performed
but the integrity of the message is lost. This error is considered to be a
valid decryption result.
* Add test cases for sendmsg and splice interface to test large AD sizes (in
case of sendmsg, use 65504 bytes AD and 32 bytes plaintext; in case of splice
use 15 pages AD and 32 bytes in the 16th page for plaintext). See [1] for
updated test case.
Stephan Mueller (2):
crypto: AF_ALG: add AEAD support
crypto: AF_ALG: enable AEAD interface compilation
crypto/Kconfig | 9 +
crypto/Makefile | 1 +
crypto/algif_aead.c | 666 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 676 insertions(+)
create mode 100644 crypto/algif_aead.c
--
2.1.0
^ permalink raw reply
* [PATCH v8 1/2] crypto: AF_ALG: add AEAD support
From: Stephan Mueller @ 2015-01-07 15:51 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: 'Quentin Gouchet', Daniel Borkmann, linux-kernel,
linux-api, linux-crypto
In-Reply-To: <33040723.pAXIT3fl8h@tachyon.chronox.de>
This patch adds the AEAD support for AF_ALG.
The implementation is based on algif_skcipher, but contains heavy
modifications to streamline the interface for AEAD uses.
To use AEAD, the user space consumer has to use the salg_type named
"aead".
The AEAD implementation includes some overhead to calculate the size of
the ciphertext, because the AEAD implementation of the kernel crypto API
makes implied assumption on the location of the authentication tag. When
performing an encryption, the tag will be added to the created
ciphertext (note, the tag is placed adjacent to the ciphertext). For
decryption, the caller must hand in the ciphertext with the tag appended
to the ciphertext. Therefore, the selection of the used memory
needs to add/subtract the tag size from the source/destination buffers
depending on the encryption type. The code is provided with comments
explaining when and how that operation is performed.
A fully working example using all aspects of AEAD is provided at
http://www.chronox.de/libkcapi.html
Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
crypto/algif_aead.c | 666 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 666 insertions(+)
create mode 100644 crypto/algif_aead.c
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
new file mode 100644
index 0000000..34bf4a0
--- /dev/null
+++ b/crypto/algif_aead.c
@@ -0,0 +1,666 @@
+/*
+ * algif_aeadr: User-space interface for AEAD algorithms
+ *
+ * Copyright (C) 2014, Stephan Mueller <smueller@chronox.de>
+ *
+ * This file provides the user-space API for AEAD ciphers.
+ *
+ * This file is derived from algif_skcipher.c.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include <crypto/scatterwalk.h>
+#include <crypto/if_alg.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/net.h>
+#include <net/sock.h>
+
+struct aead_sg_list {
+ unsigned int cur;
+ struct scatterlist sg[ALG_MAX_PAGES];
+};
+
+struct aead_ctx {
+ struct aead_sg_list tsgl;
+ struct af_alg_sgl rsgl;
+
+ void *iv;
+
+ struct af_alg_completion completion;
+
+ unsigned long used;
+
+ unsigned int len;
+ bool more;
+ bool merge;
+ bool enc;
+
+ size_t aead_assoclen;
+ struct aead_request aead_req;
+};
+
+static inline int aead_sndbuf(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+
+ return max_t(int, max_t(int, sk->sk_sndbuf & PAGE_MASK, PAGE_SIZE) -
+ ctx->used, 0);
+}
+
+static inline bool aead_writable(struct sock *sk)
+{
+ return PAGE_SIZE <= aead_sndbuf(sk);
+}
+
+static inline bool aead_sufficient_data(struct aead_ctx *ctx)
+{
+ unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+
+ return (ctx->used >= (ctx->aead_assoclen + (ctx->enc ? 0 : as)));
+}
+
+static void aead_put_sgl(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct scatterlist *sg = sgl->sg;
+ unsigned int i;
+
+ for (i = 0; i < sgl->cur; i++) {
+ if (!sg_page(sg + i))
+ continue;
+
+ put_page(sg_page(sg + i));
+ sg_assign_page(sg + i, NULL);
+ }
+ sgl->cur = 0;
+ ctx->used = 0;
+ ctx->more = 0;
+ ctx->merge = 0;
+}
+
+static int aead_wait_for_wmem(struct sock *sk, unsigned flags)
+{
+ long timeout;
+ DEFINE_WAIT(wait);
+ int err = -ERESTARTSYS;
+
+ if (flags & MSG_DONTWAIT)
+ return -EAGAIN;
+
+ set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+ for (;;) {
+ if (signal_pending(current))
+ break;
+ prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ timeout = MAX_SCHEDULE_TIMEOUT;
+ if (sk_wait_event(sk, &timeout, aead_writable(sk))) {
+ err = 0;
+ break;
+ }
+ }
+ finish_wait(sk_sleep(sk), &wait);
+
+ return err;
+}
+
+static void aead_wmem_wakeup(struct sock *sk)
+{
+ struct socket_wq *wq;
+
+ if (!aead_writable(sk))
+ return;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLIN |
+ POLLRDNORM |
+ POLLRDBAND);
+ sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+ rcu_read_unlock();
+}
+
+static int aead_wait_for_data(struct sock *sk, unsigned flags)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ long timeout;
+ DEFINE_WAIT(wait);
+ int err = -ERESTARTSYS;
+
+ if (flags & MSG_DONTWAIT) {
+ return -EAGAIN;
+ }
+
+ set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+ for (;;) {
+ if (signal_pending(current))
+ break;
+ prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
+ timeout = MAX_SCHEDULE_TIMEOUT;
+ if (sk_wait_event(sk, &timeout, !ctx->more)) {
+ err = 0;
+ break;
+ }
+ }
+ finish_wait(sk_sleep(sk), &wait);
+
+ clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
+
+ return err;
+}
+
+static void aead_data_wakeup(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct socket_wq *wq;
+
+ if (ctx->more)
+ return;
+
+ rcu_read_lock();
+ wq = rcu_dereference(sk->sk_wq);
+ if (wq_has_sleeper(wq))
+ wake_up_interruptible_sync_poll(&wq->wait, POLLOUT |
+ POLLRDNORM |
+ POLLRDBAND);
+ sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
+ rcu_read_unlock();
+}
+
+static int aead_sendmsg(struct kiocb *unused, struct socket *sock,
+ struct msghdr *msg, size_t size)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned ivsize =
+ crypto_aead_ivsize(crypto_aead_reqtfm(&ctx->aead_req));
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct af_alg_control con = {};
+ long copied = 0;
+ bool enc = 0;
+ bool init = 0;
+ int err = -EINVAL;
+
+ if (msg->msg_controllen) {
+ err = af_alg_cmsg_send(msg, &con);
+ if (err)
+ return err;
+
+ init = 1;
+ switch (con.op) {
+ case ALG_OP_ENCRYPT:
+ enc = 1;
+ break;
+ case ALG_OP_DECRYPT:
+ enc = 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (con.iv && con.iv->ivlen != ivsize)
+ return -EINVAL;
+ }
+
+ lock_sock(sk);
+ if (!ctx->more && ctx->used)
+ goto unlock;
+
+ if (init) {
+ ctx->enc = enc;
+ if (con.iv)
+ memcpy(ctx->iv, con.iv->iv, ivsize);
+
+ ctx->aead_assoclen = con.aead_assoclen;
+ }
+
+ while (size) {
+ unsigned long len = size;
+ struct scatterlist *sg = NULL;
+
+ if (ctx->merge) {
+ sg = sgl->sg + sgl->cur - 1;
+ len = min_t(unsigned long, len,
+ PAGE_SIZE - sg->offset - sg->length);
+ err = memcpy_from_msg(page_address(sg_page(sg)) +
+ sg->offset + sg->length,
+ msg, len);
+ if (err)
+ goto unlock;
+
+ sg->length += len;
+ ctx->merge = (sg->offset + sg->length) &
+ (PAGE_SIZE - 1);
+
+ ctx->used += len;
+ copied += len;
+ size -= len;
+ }
+
+ if (!aead_writable(sk)) {
+ /*
+ * If there is more data to be expected, but we cannot
+ * write more data, forcefully define that we do not
+ * expect more data to invoke the AEAD operation. This
+ * prevents a deadlock in user space.
+ */
+ ctx->more = 0;
+ err = aead_wait_for_wmem(sk, msg->msg_flags);
+ if (err)
+ goto unlock;
+ }
+
+ len = min_t(unsigned long, size, aead_sndbuf(sk));
+ while (len) {
+ int plen = 0;
+
+ if (sgl->cur >= ALG_MAX_PAGES) {
+ err = -E2BIG;
+ goto unlock;
+ }
+
+ sg = sgl->sg + sgl->cur;
+ plen = min_t(int, len, PAGE_SIZE);
+
+ sg_assign_page(sg, alloc_page(GFP_KERNEL));
+ err = -ENOMEM;
+ if (!sg_page(sg))
+ goto unlock;
+
+ err = memcpy_from_msg(page_address(sg_page(sg)),
+ msg, plen);
+ if (err) {
+ __free_page(sg_page(sg));
+ sg_assign_page(sg, NULL);
+ goto unlock;
+ }
+
+ sg->length = plen;
+ len -= plen;
+ ctx->used += plen;
+ copied += plen;
+ sgl->cur++;
+ size -= plen;
+ ctx->merge = plen & (PAGE_SIZE - 1);
+ }
+ }
+
+ err = 0;
+
+ ctx->more = msg->msg_flags & MSG_MORE;
+ if (!ctx->more && !aead_sufficient_data(ctx))
+ err = -EINVAL;
+
+unlock:
+ aead_data_wakeup(sk);
+ release_sock(sk);
+
+ return err ?: copied;
+}
+
+static ssize_t aead_sendpage(struct socket *sock, struct page *page,
+ int offset, size_t size, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ int err = -EINVAL;
+
+ if (flags & MSG_SENDPAGE_NOTLAST)
+ flags |= MSG_MORE;
+
+ if (sgl->cur >= ALG_MAX_PAGES)
+ return -E2BIG;
+
+ lock_sock(sk);
+ if (!ctx->more && ctx->used)
+ goto unlock;
+
+ if (!size)
+ goto done;
+
+ if (!aead_writable(sk)) {
+ /* see aead_sendmsg why more is set to 0 */
+ ctx->more = 0;
+ err = aead_wait_for_wmem(sk, flags);
+ if (err)
+ goto unlock;
+ }
+
+ ctx->merge = 0;
+
+ get_page(page);
+ sg_set_page(sgl->sg + sgl->cur, page, size, offset);
+ sgl->cur++;
+ ctx->used += size;
+
+ err = 0;
+
+done:
+ ctx->more = flags & MSG_MORE;
+ if (!ctx->more && !aead_sufficient_data(ctx))
+ err = -EINVAL;
+
+unlock:
+ aead_data_wakeup(sk);
+ release_sock(sk);
+
+ return err ?: size;
+}
+
+static int aead_recvmsg(struct kiocb *unused, struct socket *sock,
+ struct msghdr *msg, size_t ignored, int flags)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned bs = crypto_aead_blocksize(crypto_aead_reqtfm(&ctx->aead_req));
+ unsigned as = crypto_aead_authsize(crypto_aead_reqtfm(&ctx->aead_req));
+ struct aead_sg_list *sgl = &ctx->tsgl;
+ struct scatterlist *sg = sgl->sg;
+ struct scatterlist assoc[ALG_MAX_PAGES];
+ size_t assoclen = 0;
+ unsigned int i = 0;
+ int err = -EAGAIN;
+ unsigned long used = 0;
+ unsigned long outlen = 0;
+
+ /*
+ * Require exactly one IOV block as the AEAD operation is a one shot
+ * due to the authentication tag.
+ */
+ if (msg->msg_iter.nr_segs != 1)
+ return -ENOMSG;
+
+ lock_sock(sk);
+ /*
+ * AEAD memory structure: For encryption, the tag is appended to the
+ * ciphertext which implies that the memory allocated for the ciphertext
+ * must be increased by the tag length. For decryption, the tag
+ * is expected to be concatenated to the ciphertext. The plaintext
+ * therefore has a memory size of the ciphertext minus the tag length.
+ *
+ * The memory structure for cipher operation has the following
+ * structure:
+ * AEAD encryption input: assoc data || plaintext
+ * AEAD encryption output: cipherntext || auth tag
+ * AEAD decryption input: assoc data || ciphertext || auth tag
+ * AEAD decryption output: plaintext
+ */
+
+ if (ctx->more) {
+ err = aead_wait_for_data(sk, flags);
+ if (err)
+ goto unlock;
+ }
+
+ used = ctx->used;
+
+ err = -EINVAL;
+
+ /*
+ * Make sure sufficient data is present -- note, the same check is
+ * is also present in sendmsg/sendpage. The checks in sendpage/sendmsg
+ * shall provide an information to the data sender that something is
+ * wrong, but they are irrelevant to maintain the kernel integrity.
+ * We need this check here too in case user space decides to not honor
+ * the error message in sendmsg/sendpage and still call recvmsg. This
+ * check here protects the kernel integrity.
+ */
+ if (!aead_sufficient_data(ctx))
+ goto unlock;
+
+ /*
+ * The cipher operation input data is reduced by the associated data
+ * length as this data is processed separately later on.
+ */
+ used -= ctx->aead_assoclen;
+
+ if (ctx->enc) {
+ /* round up output buffer to multiple of block size */
+ outlen = ((used + bs - 1) / bs * bs);
+ /* add the size needed for the auth tag to be created */
+ outlen += as;
+ } else {
+ /* output data size is input without the authentication tag */
+ outlen = used - as;
+ /* round up output buffer to multiple of block size */
+ outlen = ((outlen + bs - 1) / bs * bs);
+ }
+
+ /* ensure output buffer is sufficiently large */
+ if (msg->msg_iter.iov->iov_len < outlen)
+ goto unlock;
+
+ outlen = af_alg_make_sg(&ctx->rsgl, msg->msg_iter.iov->iov_base,
+ outlen, 1);
+ err = outlen;
+ if (err < 0)
+ goto unlock;
+
+ err = -EINVAL;
+ sg_init_table(assoc, ALG_MAX_PAGES);
+ assoclen = ctx->aead_assoclen;
+ /*
+ * Split scatterlist into two: first part becomes AD, second part
+ * is plaintext / ciphertext. The first part is assigned to assoc
+ * scatterlist. When this loop finishes, sg points to the start of the
+ * plaintext / ciphertext.
+ */
+ for(i = 0; i < ctx->tsgl.cur; i++) {
+ sg = sgl->sg + i;
+ if (sg->length <= assoclen) {
+ /* AD is larger than one page */
+ sg_set_page(assoc + i, sg_page(sg),
+ sg->length, sg->offset);
+ assoclen -= sg->length;
+ if (i >= ctx->tsgl.cur)
+ goto unlock;
+ } else if (!assoclen) {
+ /* current page is to start of plaintext / ciphertext */
+ if (i)
+ /* AD terminates at page boundary */
+ sg_mark_end(assoc + i - 1);
+ else
+ /* AD size is zero */
+ sg_mark_end(assoc);
+ break;
+ } else {
+ /* AD does not terminate at page boundary */
+ sg_set_page(assoc + i, sg_page(sg),
+ assoclen, sg->offset);
+ sg_mark_end(assoc + i);
+ /* plaintext / ciphertext starts after AD */
+ sg->length -= assoclen;
+ sg->offset += assoclen;
+ break;
+ }
+ }
+
+ aead_request_set_assoc(&ctx->aead_req, assoc, ctx->aead_assoclen);
+ aead_request_set_crypt(&ctx->aead_req, sg, ctx->rsgl.sg, used, ctx->iv);
+
+ err = af_alg_wait_for_completion(ctx->enc ?
+ crypto_aead_encrypt(&ctx->aead_req) :
+ crypto_aead_decrypt(&ctx->aead_req),
+ &ctx->completion);
+
+ af_alg_free_sg(&ctx->rsgl);
+
+ if (err) {
+ /* EBADMSG implies a valid cipher operation took place */
+ if (err == -EBADMSG)
+ aead_put_sgl(sk);
+ goto unlock;
+ }
+
+ aead_put_sgl(sk);
+
+ err = 0;
+
+unlock:
+ aead_wmem_wakeup(sk);
+ release_sock(sk);
+
+ return err ? err : outlen;
+}
+
+static unsigned int aead_poll(struct file *file, struct socket *sock,
+ poll_table *wait)
+{
+ struct sock *sk = sock->sk;
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned int mask;
+
+ sock_poll_wait(file, sk_sleep(sk), wait);
+ mask = 0;
+
+ if (!ctx->more)
+ mask |= POLLIN | POLLRDNORM;
+
+ if (aead_writable(sk))
+ mask |= POLLOUT | POLLWRNORM | POLLWRBAND;
+
+ return mask;
+}
+
+static struct proto_ops algif_aead_ops = {
+ .family = PF_ALG,
+
+ .connect = sock_no_connect,
+ .socketpair = sock_no_socketpair,
+ .getname = sock_no_getname,
+ .ioctl = sock_no_ioctl,
+ .listen = sock_no_listen,
+ .shutdown = sock_no_shutdown,
+ .getsockopt = sock_no_getsockopt,
+ .mmap = sock_no_mmap,
+ .bind = sock_no_bind,
+ .accept = sock_no_accept,
+ .setsockopt = sock_no_setsockopt,
+
+ .release = af_alg_release,
+ .sendmsg = aead_sendmsg,
+ .sendpage = aead_sendpage,
+ .recvmsg = aead_recvmsg,
+ .poll = aead_poll,
+};
+
+static void *aead_bind(const char *name, u32 type, u32 mask)
+{
+ return crypto_alloc_aead(name, type, mask);
+}
+
+static void aead_release(void *private)
+{
+ crypto_free_aead(private);
+}
+
+static int aead_setauthsize(void *private, unsigned int authsize)
+{
+ return crypto_aead_setauthsize(private, authsize);
+}
+
+static int aead_setkey(void *private, const u8 *key, unsigned int keylen)
+{
+ return crypto_aead_setkey(private, key, keylen);
+}
+
+static void aead_sock_destruct(struct sock *sk)
+{
+ struct alg_sock *ask = alg_sk(sk);
+ struct aead_ctx *ctx = ask->private;
+ unsigned int ivlen = crypto_aead_ivsize(
+ crypto_aead_reqtfm(&ctx->aead_req));
+
+ aead_put_sgl(sk);
+ sock_kzfree_s(sk, ctx->iv, ivlen);
+ sock_kfree_s(sk, ctx, ctx->len);
+ af_alg_release_parent(sk);
+}
+
+static int aead_accept_parent(void *private, struct sock *sk)
+{
+ struct aead_ctx *ctx;
+ struct alg_sock *ask = alg_sk(sk);
+ unsigned int len = sizeof(*ctx) + crypto_aead_reqsize(private);
+ unsigned int ivlen = crypto_aead_ivsize(private);
+
+ ctx = sock_kmalloc(sk, len, GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+ memset(ctx, 0, len);
+
+ ctx->iv = sock_kmalloc(sk, ivlen, GFP_KERNEL);
+ if (!ctx->iv) {
+ sock_kfree_s(sk, ctx, len);
+ return -ENOMEM;
+ }
+ memset(ctx->iv, 0, ivlen);
+
+ ctx->len = len;
+ ctx->used = 0;
+ ctx->more = 0;
+ ctx->merge = 0;
+ ctx->enc = 0;
+ ctx->tsgl.cur = 0;
+ ctx->aead_assoclen = 0;
+ af_alg_init_completion(&ctx->completion);
+ sg_init_table(ctx->tsgl.sg, ALG_MAX_PAGES);
+
+ ask->private = ctx;
+
+ aead_request_set_tfm(&ctx->aead_req, private);
+ aead_request_set_callback(&ctx->aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+ af_alg_complete, &ctx->completion);
+
+ sk->sk_destruct = aead_sock_destruct;
+
+ return 0;
+}
+
+static const struct af_alg_type algif_type_aead = {
+ .bind = aead_bind,
+ .release = aead_release,
+ .setkey = aead_setkey,
+ .setauthsize = aead_setauthsize,
+ .accept = aead_accept_parent,
+ .ops = &algif_aead_ops,
+ .name = "aead",
+ .owner = THIS_MODULE
+};
+
+static int __init algif_aead_init(void)
+{
+ return af_alg_register_type(&algif_type_aead);
+}
+
+static void __exit algif_aead_exit(void)
+{
+ int err = af_alg_unregister_type(&algif_type_aead);
+ BUG_ON(err);
+}
+
+module_init(algif_aead_init);
+module_exit(algif_aead_exit);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Stephan Mueller <smueller@chronox.de>");
+MODULE_DESCRIPTION("AEAD kernel crypto API user space interface");
--
2.1.0
^ permalink raw reply related
* [PATCH v8 2/2] crypto: AF_ALG: enable AEAD interface compilation
From: Stephan Mueller @ 2015-01-07 15:52 UTC (permalink / raw)
To: 'Herbert Xu'
Cc: 'Quentin Gouchet', Daniel Borkmann,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-crypto-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <33040723.pAXIT3fl8h-PJstQz4BMNNP20K/wil9xYQuADTiUCJX@public.gmane.org>
Enable compilation of the AEAD AF_ALG support and provide a Kconfig
option to compile the AEAD AF_ALG support.
Signed-off-by: Stephan Mueller <smueller-T9tCv8IpfcWELgA04lAiVw@public.gmane.org>
---
crypto/Kconfig | 9 +++++++++
crypto/Makefile | 1 +
2 files changed, 10 insertions(+)
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 50f4da4..41a3fc5 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1523,6 +1523,15 @@ config CRYPTO_USER_API_RNG
This option enables the user-spaces interface for random
number generator algorithms.
+config CRYPTO_USER_API_AEAD
+ tristate "User-space interface for AEAD cipher algorithms"
+ depends on NET
+ select CRYPTO_AEAD
+ select CRYPTO_USER_API
+ help
+ This option enables the user-spaces interface for AEAD
+ cipher algorithms.
+
config CRYPTO_HASH_INFO
bool
diff --git a/crypto/Makefile b/crypto/Makefile
index ba19465..97b7d3a 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o
obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o
obj-$(CONFIG_CRYPTO_USER_API_RNG) += algif_rng.o
+obj-$(CONFIG_CRYPTO_USER_API_AEAD) += algif_aead.o
#
# generic algorithms and the async_tx api
--
2.1.0
^ permalink raw reply related
* [PATCH] selftests/vm: fix link error for transhuge-stress test
From: Andrey Skvortsov @ 2015-01-07 18:35 UTC (permalink / raw)
To: Shuah Khan, Andrew Morton, Konstantin Khlebnikov, linux-api,
linux-kernel
Cc: Andrey Skvortsov
add -lrt to fix undefined reference to `clock_gettime'
Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com>
---
tools/testing/selftests/vm/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 4c4b1f6..077828c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -7,7 +7,7 @@ BINARIES += transhuge-stress
all: $(BINARIES)
%: %.c
- $(CC) $(CFLAGS) -o $@ $^
+ $(CC) $(CFLAGS) -o $@ $^ -lrt
run_tests: all
@/bin/sh ./run_vmtests || (echo "vmtests: [FAIL]"; exit 1)
--
1.7.9.5
^ permalink raw reply related
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Aditya Kali @ 2015-01-07 18:57 UTC (permalink / raw)
To: Richard Weinberger
Cc: Eric W. Biederman, Tejun Heo, Li Zefan, Serge Hallyn,
Andy Lutomirski, cgroups mailinglist,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
Ingo Molnar, Linux Containers, Rohit Jnagal, Vivek Goyal
In-Reply-To: <54ACFC38.5070007-/L3Ra7n9ekc@public.gmane.org>
On Wed, Jan 7, 2015 at 1:28 AM, Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> wrote:
> Am 07.01.2015 um 00:20 schrieb Aditya Kali:
>> I understand your point. But it will add some complexity to the code.
>>
>> Before trying to make it work for non-unified hierarchy cases, I would
>> like to get a clearer idea.
>> What do you expect to be mounted when you run:
>> container:/ # mount -t cgroup none /sys/fs/cgroup/
>> from inside the container?
>>
>> Note that cgroup-namespace wont be able to change the way cgroups are
>> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
>> together at a single mount-point, then we cannot mount them any other
>> way (inside a container or outside). This restriction exists today and
>> cgroup-namespaces won't change that.
>
> I wondered why cgroup namespaces won't change that and looked at your patches
> in more detail.
> What you propose as cgroup namespace is much more a cgroup chroot() than
> a namespace.
> As you pass relative paths into the namespace you depend on the mount structure
> of the host side.
> Hence, the abstraction between namespaces happens on the mount paths of the initial
> cgroupfs. But we really want a new cgroupfs instance within a container and not just
> a cut out of the initial cgroupfs mount.
>
What you describe will be useful at Google too, just that I found it
difficult/infeasible to include it in the scope of cgroup namespaces.
The scope of cgroup namespace was deliberately limited to virtualize
/proc/<pid>/cgroup file. That too in a way that doesn't need major
changes to cgroup code itself. (It was also limited to unified
hierarchy to keep things simple, but that can be changed).
Many of the cgroup subsystems (memory, cpu, etc) rely on the fact that
they can see entire cgroup view. For example, in a memcg-OOM scenario,
the memory controller would need to look at all sub-cgroups inside the
OOMing cgroup. A per namespace cgroupfs instance (if I understand
correctly) would mean that sub-cgroups created inside the namespace
won't be visible outside. I expect this will break the functionality
of the subsystem.
Illustration: memcg A is under OOM; [B] and [C] are cgroup namespace
roots with possibly namespace-private sub-cgroups.
------ [B]
A --------|
------ [C]
Cgroups are heavily used inside the kernel for various purposes which
need any namespace-agnostic view. Inherent limitation of running
containers running on a machine is that they share the same kernel.
Perhaps what you need is something like kexec to be supported inside a
container.
> I fear you approach is over simplified and won't work for all cases. It may work
> for your specific use case at Google but we really want something generic.
> Eric, what do you think?
>
> Thanks,
> //richard
--
Aditya
^ permalink raw reply
* Re: [PATCH] selftests/vm: fix link error for transhuge-stress test
From: Shuah Khan @ 2015-01-07 19:21 UTC (permalink / raw)
To: Andrey Skvortsov, Andrew Morton, Konstantin Khlebnikov,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1420655754-10076-1-git-send-email-Andrej.Skvortzov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On 01/07/2015 11:35 AM, Andrey Skvortsov wrote:
> add -lrt to fix undefined reference to `clock_gettime'
>
> Signed-off-by: Andrey Skvortsov <andrej.skvortzov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> tools/testing/selftests/vm/Makefile | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index 4c4b1f6..077828c 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -7,7 +7,7 @@ BINARIES += transhuge-stress
>
> all: $(BINARIES)
> %: %.c
> - $(CC) $(CFLAGS) -o $@ $^
> + $(CC) $(CFLAGS) -o $@ $^ -lrt
>
> run_tests: all
> @/bin/sh ./run_vmtests || (echo "vmtests: [FAIL]"; exit 1)
>
Andrey,
I don't see any undefined references when I build. Curious if it is
specific to your env??
Please include the warning in the change log
when you fix warnings in the future.
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh-JPH+aEBZ4P+UEJcrhfAQsw@public.gmane.org | (970) 217-8978
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Serge E. Hallyn @ 2015-01-07 19:30 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Tejun Heo, cgroups mailinglist, Ingo Molnar
In-Reply-To: <87fvbmir9q.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> Richard Weinberger <richard-/L3Ra7n9ekc@public.gmane.org> writes:
>
> > Am 07.01.2015 um 00:20 schrieb Aditya Kali:
> >> I understand your point. But it will add some complexity to the code.
> >>
> >> Before trying to make it work for non-unified hierarchy cases, I would
> >> like to get a clearer idea.
> >> What do you expect to be mounted when you run:
> >> container:/ # mount -t cgroup none /sys/fs/cgroup/
> >> from inside the container?
> >>
> >> Note that cgroup-namespace wont be able to change the way cgroups are
> >> mounted .. i.e., if say cpu and cpuacct subsystems are mounted
> >> together at a single mount-point, then we cannot mount them any other
> >> way (inside a container or outside). This restriction exists today and
> >> cgroup-namespaces won't change that.
> >
> > I wondered why cgroup namespaces won't change that and looked at your patches
> > in more detail.
> > What you propose as cgroup namespace is much more a cgroup chroot() than
> > a namespace.
> > As you pass relative paths into the namespace you depend on the mount structure
> > of the host side.
> > Hence, the abstraction between namespaces happens on the mount paths of the initial
> > cgroupfs. But we really want a new cgroupfs instance within a container and not just
> > a cut out of the initial cgroupfs mount.
> >
> > I fear you approach is over simplified and won't work for all cases. It may work
> > for your specific use case at Google but we really want something generic.
> > Eric, what do you think?
>
> I think I probably need to go back upthread and read the patches.
>
> I think it is a reasonable practical requirement that a widely used long
> term supported distribution like RHEL 7 needs to be able to run in a linux
> container bizarre init system and all. And that we the abstractions
> should be that that we should be able to migrate such a beast.
Userspace should be able to deal with however cgroups are mounted for
it. The only case I've heard of where it really made a meaningful
difference was google's advanced grid usage. In fact, the whole
justification of the unified cgroup stuff was that it was claimed (and
argued against by google) that that sufficed for any users.
Now yes, until now userspace could cache its info on how cgroups were
mounted and assume that wouldn't change (because the kernel wouldn't
let it), and migration will break that. But if the cgroup roadmap
is to obsolete anything but unified hierarchy, then this was going
to happen regardless of what the cgroupns patchset did.
I agree with Aditya. So long as the proclaimed direction of cgroups is
to only support unified cgroup hierarchy, there's no point in having
cgroupns do anything more than the chrooting.
> There are a couple of issues in play and I think we need actual testing
> rather than reports that something shouldn't work before we reject a set
> of patches. Aditya in one of his replies to me has reported a
> configuration that he expects will work. So I think that configuration
> needs to be tested.
>
> cgroups is a weird beast and the problems tend not to lie where a person
> would first expect.
>
> I suspect no one strongly cares if the cgroup hierarchy is unified or
> not.
Well, google does. There are cases that were either much more complicated
or impossible to represent with unified hierarchy. But complicating cgroupns
to support something which Tejun has said is explicitly not going to be
supported in the future would be ill-advised.
> By unified hierarchy I mean that every mount of cgroupfs has the
> same directories with the same processes in each directory.
No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
unified hierarchy means that all (sane) controllers are co-mounted into
one hierarchy.
> I do think people will care which controllers will show up in differ
> mounts of cgroupfs, and I think that is relevant to process migration.
>
>
>
>
> I am going to segway into scope of what is achievable with a cgroup namespace.
>
> - If there are files in cgroupfs that are not safe to delegate we can
> not support those files in a container.
>
> Last I looked there were such files and systemd used them.
>
> - Which controllers share hierarchies of processes to track resources is
> a core cgroup issue and not a cgroup namespace issue.
>
> If we find problems with using a unified hierarchy support we need to
> go fix cgroups in general not cgroupfs.
>
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply
* Re: [PATCH] selftests/vm: fix link error for transhuge-stress test
From: Andrey Skvortsov @ 2015-01-07 20:10 UTC (permalink / raw)
To: Shuah Khan; +Cc: Andrew Morton, Konstantin Khlebnikov, linux-api, linux-kernel
In-Reply-To: <54AD872E.9060802@osg.samsung.com>
[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]
On Wed, Jan 07, 2015 at 12:21:18PM -0700, Shuah Khan wrote:
> On 01/07/2015 11:35 AM, Andrey Skvortsov wrote:
> > add -lrt to fix undefined reference to `clock_gettime'
> >
> > Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com>
> > ---
> > tools/testing/selftests/vm/Makefile | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> > index 4c4b1f6..077828c 100644
> > --- a/tools/testing/selftests/vm/Makefile
> > +++ b/tools/testing/selftests/vm/Makefile
> > @@ -7,7 +7,7 @@ BINARIES += transhuge-stress
> >
> > all: $(BINARIES)
> > %: %.c
> > - $(CC) $(CFLAGS) -o $@ $^
> > + $(CC) $(CFLAGS) -o $@ $^ -lrt
> >
> > run_tests: all
> > @/bin/sh ./run_vmtests || (echo "vmtests: [FAIL]"; exit 1)
> >
>
> Andrey,
>
> I don't see any undefined references when I build. Curious if it is
> specific to your env??
>
> Please include the warning in the change log
> when you fix warnings in the future.
>
thanks for the comment.
Here is what I get without a patch:
linux-next/tools/testing/selftests/vm $ make
gcc -Wall -o hugepage-mmap hugepage-mmap.c
gcc -Wall -o hugepage-shm hugepage-shm.c
gcc -Wall -o map_hugetlb map_hugetlb.c
gcc -Wall -o thuge-gen thuge-gen.c
gcc -Wall -o hugetlbfstest hugetlbfstest.c
gcc -Wall -o transhuge-stress transhuge-stress.c
/tmp/ccpWoqkG.o: In function `main':
transhuge-stress.c:(.text+0x3a3): undefined reference to
`clock_gettime'
transhuge-stress.c:(.text+0x4dc): undefined reference to
`clock_gettime'
collect2: ld returned 1 exit status
make: *** [transhuge-stress] Error 1
$ gcc --version
gcc (Ubuntu/Linaro 4.6.4-1ubuntu1~12.04) 4.6.4.
The same error I get on my other Debian system.
man page for clock_gettime says 'Link with -lrt'. So I think the error
message is correct.
--
Best regards,
Andrey Skvortsov
Secure e-mail with gnupg: See http://www.gnupg.org/
PGP Key ID: 0x57A3AEAD
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-07 22:14 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Richard Weinberger, Linux API, Linux Containers, Serge Hallyn,
linux-kernel@vger.kernel.org, Andy Lutomirski, Tejun Heo,
cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150107193059.GA1857-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>> By unified hierarchy I mean that every mount of cgroupfs has the
>> same directories with the same processes in each directory.
>
> No, my reading of Documentation/cgroups/unified-hierarchy.txt is that
> unified hierarchy means that all (sane) controllers are co-mounted into
> one hierarchy.
I see what you mean. If it is indeed the case than a mount of cgroupfs
using the unified hiearchy and can not specify which controllers are
present under that mount that very significant bug and presents a very
significant regression in user space flexibility.
I think you can still mount the unified hierarchy and select which
controls you want to see. If you can not that is a change significantly
past what was agreed to and a regression fix needs to be applied.
With a unified hierarchy and separate controllers per mount many cgroup
using applications will continue to work as before without changes, or
with minimal changes. That is what was agreed to and what I expect has
been actually implemented and it is what needs to be implemented in any
case.
I will see about making time to see where things are really at.
Eric
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-01-07 22:45 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <87bnma6xwv.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
> I see what you mean. If it is indeed the case than a mount of cgroupfs
> using the unified hiearchy and can not specify which controllers are
> present under that mount that very significant bug and presents a very
> significant regression in user space flexibility.
The parent always controls which controllers are made available at the
children level. Only if the parent enables a controller, its
children, whether they're namespaces or not, can choose to further
distribute resources using that controller. It's a straight-forward
top-down thing.
--
tejun
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-07 23:02 UTC (permalink / raw)
To: Tejun Heo
Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150107224430.GA28414-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> On Wed, Jan 07, 2015 at 04:14:40PM -0600, Eric W. Biederman wrote:
>> I see what you mean. If it is indeed the case than a mount of cgroupfs
>> using the unified hiearchy and can not specify which controllers are
>> present under that mount that very significant bug and presents a very
>> significant regression in user space flexibility.
>
> The parent always controls which controllers are made available at the
> children level. Only if the parent enables a controller, its
> children, whether they're namespaces or not, can choose to further
> distribute resources using that controller. It's a straight-forward
> top-down thing.
Ignoring namespace details for a moment. The following should be
possible with a unified hierarchy. If it is not it is a show stopper
of a regression.
mount -t tmpfs none /sys/fs/cgroup
(cd /sys/fs/cgroup ; mkdir cpu cpuacct devices memory)
mount -t cgroupfs -o cpu /sys/fs/cgroup/cpu
mount -t cgroupfs -o cpuacct /sys/fs/cgroup/cpuacct
mount -t cgroupfs -o devices /sys/fs/cgroup/devices
mount -t cgroupfs -o memory /sys/fs/cgroup/memory
With the expectation that only the control files for the specified
controllers show up in those mounts.
That is a unified hierarchy is fine. Requiring that there only be one
mount point and that every one use it is not ok and it actively a problem.
It is absolutely required to be able to avoid b0rked controllers, and
to my knowledge the only way to do that is to have multiple mounts where
we pick the controller on each mount. Even if there is now a way that
doesn't require multiple mounts to keep b0rked controllers from being
enabled multiple mounts still need to work to support the existing
userspace programs.
This discussion is happening because Documentation/cgroups/unified-hierarchy.txt
implies the configuration I have just described will not work with
unified hierachies enabled.
Eric
^ permalink raw reply
* Re: [PATCH v4 14/20] selftests/size: add install target to enable test install
From: Tim Bird @ 2015-01-07 23:05 UTC (permalink / raw)
To: Shuah Khan, mmarek@suse.cz, gregkh@linuxfoundation.org,
akpm@linux-foundation.org, rostedt@goodmis.org, mingo@redhat.com,
davem@davemloft.net, keescook@chromium.org,
tranmanphong@gmail.com, mpe@ellerman.id.au, cov@codeaurora.org,
dh.herrmann@gmail.com, hughd@google.com, bobby.prani@gmail.com,
serge.hallyn@ubuntu.com, ebiederm@xmission.com,
josh@joshtriplett.org, koct9i@gmail.com,
"masami.hiramatsu.pt@hitachi.com" <masami.hiram>
Cc: linux-kbuild@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <5861e16e1b533cae37471a8cf18597d3e9e6b6f6.1420571615.git.shuahkh@osg.samsung.com>
On 01/06/2015 11:43 AM, Shuah Khan wrote:
> Add a new make target to enable installing test. This target
> installs test in the kselftest install location and add to the
> kselftest script to run the test. Install target can be run
> only from top level kernel source directory.
>
> Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
> ---
> tools/testing/selftests/size/Makefile | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/size/Makefile b/tools/testing/selftests/size/Makefile
> index 04dc25e..a1478fa 100644
> --- a/tools/testing/selftests/size/Makefile
> +++ b/tools/testing/selftests/size/Makefile
> @@ -1,12 +1,22 @@
> CC = $(CROSS_COMPILE)gcc
>
> +TEST_STR = ./get_size || echo 'get_size selftests: [FAIL]'
> +
> all: get_size
>
> get_size: get_size.c
> $(CC) -static -ffreestanding -nostartfiles -s $< -o $@
>
> +install:
> +ifdef INSTALL_KSFT_PATH
> + install ./get_size $(INSTALL_KSFT_PATH)
> + @echo "$(TEST_STR)" >> $(KSELFTEST)
> +else
> + @echo "Run make kselftest_install in top level source directory"
> +endif
> +
> run_tests: all
> - ./get_size
> + @$(TEST_STR)
>
> clean:
> $(RM) get_size
>
Acked-by: Tim Bird <tim.bird@sonymobile.com>
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-01-07 23:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <878uhe42km.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
> Ignoring namespace details for a moment. The following should be
> possible with a unified hierarchy. If it is not it is a show stopper
> of a regression.
The -o SUBSYS option doesn't exist. Jesus, at least get yourself
familiar with the basics before claiming random stuff.
--
tejun
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-07 23:09 UTC (permalink / raw)
To: Tejun Heo
Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
cgroups mailinglist, Ingo Molnar
In-Reply-To: <20150107230615.GA28630-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>> Ignoring namespace details for a moment. The following should be
>> possible with a unified hierarchy. If it is not it is a show stopper
>> of a regression.
>
> The -o SUBSYS option doesn't exist. Jesus, at least get yourself
> familiar with the basics before claiming random stuff.
Not random and I am familiar thank you very much.
I may have mistyped the manual command line configuration for specifying
which controllers appear on a mount point does not alter my point.
The old options to enable selecting controllers need to continue and
need to continue to work with a unified hierarchy.
Anything else is a gratuitious regression.
Eric
^ permalink raw reply
* Re: [v8 2/5] ext4: adds project ID support
From: Andreas Dilger @ 2015-01-07 23:11 UTC (permalink / raw)
To: Li Xi
Cc: linux-fsdevel, linux-ext4, linux-api, tytso, jack, viro, hch,
dmonakhov
In-Reply-To: <1418102548-5469-3-git-send-email-lixi@ddn.com>
On Dec 8, 2014, at 10:22 PM, Li Xi <pkuelelixi@gmail.com> wrote:
> This patch adds a new internal field of ext4 inode to save project
> identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> inheriting project ID from parent directory.
>
> Signed-off-by: Li Xi <lixi@ddn.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext4/ext4.h | 21 +++++++++++++++++----
> fs/ext4/ialloc.c | 6 ++++++
> fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++
> fs/ext4/namei.c | 17 +++++++++++++++++
> fs/ext4/super.c | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 71 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 29c43e7..8bd1da9 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -377,16 +377,18 @@ struct flex_groups {
> #define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
> #define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
> #define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
> +#define EXT4_PROJINHERIT_FL FS_PROJINHERIT_FL /* Create with parents projid */
Sigh, only a single on-disk inode flag unused.
> #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */
>
> -#define EXT4_FL_USER_VISIBLE 0x004BDFFF /* User visible flags */
> -#define EXT4_FL_USER_MODIFIABLE 0x004380FF /* User modifiable flags */
> +#define EXT4_FL_USER_VISIBLE 0x204BDFFF /* User visible flags */
> +#define EXT4_FL_USER_MODIFIABLE 0x204380FF /* User modifiable flags */
>
> /* Flags that should be inherited by new inodes from their parent. */
> #define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
> EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
> EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
> - EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
> + EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
> + EXT4_PROJINHERIT_FL)
>
> /* Flags that are appropriate for regular files (all but dir-specific ones). */
> #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
> @@ -434,6 +436,7 @@ enum {
> EXT4_INODE_EA_INODE = 21, /* Inode used for large EA */
> EXT4_INODE_EOFBLOCKS = 22, /* Blocks allocated beyond EOF */
> EXT4_INODE_INLINE_DATA = 28, /* Data in inode. */
> + EXT4_INODE_PROJINHERIT = 29, /* Create with parents projid */
> EXT4_INODE_RESERVED = 31, /* reserved for ext4 lib */
> };
>
> @@ -683,6 +686,7 @@ struct ext4_inode {
> __le32 i_crtime; /* File Creation time */
> __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
> __le32 i_version_hi; /* high 32 bits for 64-bit version */
> + __le32 i_projid; /* Project ID */
> };
>
> struct move_extent {
> @@ -934,6 +938,7 @@ struct ext4_inode_info {
>
> /* Precomputed uuid+inum+igen checksum for seeding inode checksums */
> __u32 i_csum_seed;
> + kprojid_t i_projid;
> };
>
> /*
> @@ -1518,6 +1523,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> * GDT_CSUM bits are mutually exclusive.
> */
> #define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400
> +#define EXT4_FEATURE_RO_COMPAT_PROJECT 0x1000 /* Project quota */
I wonder if it makes sense to add EXT4_FEATURE_RO_COMPAT_METADATA_CSUM
here as well, to make it clear why this is skipping a value?
> #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
> #define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
> @@ -1567,7 +1573,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
> EXT4_FEATURE_RO_COMPAT_BIGALLOC |\
> EXT4_FEATURE_RO_COMPAT_METADATA_CSUM|\
> - EXT4_FEATURE_RO_COMPAT_QUOTA)
> + EXT4_FEATURE_RO_COMPAT_QUOTA |\
> + EXT4_FEATURE_RO_COMPAT_PROJECT)
>
> /*
> * Default values for user and/or group using reserved blocks
> @@ -1575,6 +1582,11 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> #define EXT4_DEF_RESUID 0
> #define EXT4_DEF_RESGID 0
>
> +/*
> + * Default project ID
> + */
> +#define EXT4_DEF_PROJID 0
> +
> #define EXT4_DEF_INODE_READAHEAD_BLKS 32
>
> /*
> @@ -2131,6 +2143,7 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
> loff_t lstart, loff_t lend);
> extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
> extern qsize_t *ext4_get_reserved_space(struct inode *inode);
> +extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
> extern void ext4_da_update_reserve_space(struct inode *inode,
> int used, int quota_claim);
>
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index ac644c3..fefb948 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -756,6 +756,12 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> inode->i_gid = dir->i_gid;
> } else
> inode_init_owner(inode, dir, mode);
> + if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT) &&
> + ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) {
> + ei->i_projid = EXT4_I(dir)->i_projid;
> + } else {
> + ei->i_projid = make_kprojid(&init_user_ns, EXT4_DEF_PROJID);
> + }
(style) no need for { } if only a single line in both if/else parts.
> dquot_initialize(inode);
>
> if (!goal)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5653fa4..29204d4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3863,6 +3863,14 @@ static inline void ext4_iget_extra_inode(struct inode *inode,
> EXT4_I(inode)->i_inline_off = 0;
> }
>
> +int ext4_get_projid(struct inode *inode, kprojid_t *projid)
> +{
> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
> + return -EOPNOTSUPP;
> + *projid = EXT4_I(inode)->i_projid;
> + return 0;
> +}
> +
> struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> {
> struct ext4_iloc iloc;
> @@ -3874,6 +3882,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> int block;
> uid_t i_uid;
> gid_t i_gid;
> + projid_t i_projid;
>
> inode = iget_locked(sb, ino);
> if (!inode)
> @@ -3923,12 +3932,18 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> inode->i_mode = le16_to_cpu(raw_inode->i_mode);
> i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
> i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> + if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_PROJECT))
> + i_projid = (projid_t)le32_to_cpu(raw_inode->i_projid);
> + else
> + i_projid = EXT4_DEF_PROJID;
> +
> if (!(test_opt(inode->i_sb, NO_UID32))) {
> i_uid |= le16_to_cpu(raw_inode->i_uid_high) << 16;
> i_gid |= le16_to_cpu(raw_inode->i_gid_high) << 16;
> }
> i_uid_write(inode, i_uid);
> i_gid_write(inode, i_gid);
> + ei->i_projid = make_kprojid(&init_user_ns, i_projid);;
> set_nlink(inode, le16_to_cpu(raw_inode->i_links_count));
>
> ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
> @@ -4158,6 +4173,7 @@ static int ext4_do_update_inode(handle_t *handle,
> int need_datasync = 0, set_large_file = 0;
> uid_t i_uid;
> gid_t i_gid;
> + projid_t i_projid;
>
> spin_lock(&ei->i_raw_lock);
>
> @@ -4170,6 +4186,7 @@ static int ext4_do_update_inode(handle_t *handle,
> raw_inode->i_mode = cpu_to_le16(inode->i_mode);
> i_uid = i_uid_read(inode);
> i_gid = i_gid_read(inode);
> + i_projid = from_kprojid(&init_user_ns, ei->i_projid);
> if (!(test_opt(inode->i_sb, NO_UID32))) {
> raw_inode->i_uid_low = cpu_to_le16(low_16_bits(i_uid));
> raw_inode->i_gid_low = cpu_to_le16(low_16_bits(i_gid));
> @@ -4249,6 +4266,18 @@ static int ext4_do_update_inode(handle_t *handle,
> }
> }
>
> + BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_PROJECT) &&
> + i_projid != EXT4_DEF_PROJID);
> + if (i_projid != EXT4_DEF_PROJID &&
> + (EXT4_INODE_SIZE(inode->i_sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
> + (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)))) {
> + spin_unlock(&ei->i_raw_lock);
> + err = -EFBIG;
I'm not sure if -EFBIG is the best error case, since that is a common
filesystem error. Maybe -EOVERFLOW would be better?
Also, returning the error from ext4_mark_iloc_dirty->ext4_do_update_inode()
is a bit late in the game, since the inode has already been modified at
this point. That callpath typically only returned an error if the disk
was bad or the journal aborted (again normally because the disk was bad),
so at that point the dirty in-memory data was lost anyway.
It would be better to check this in the caller before the inode is changed
so that an error can be returned without (essentially) corrupting the
in-memory state. Since the projid should only be set for new inodes (which
will always have enough space, assuming RO_COMPAT_PROJECT cannot be set on
filesystems with 128-byte inodes), or in case of rename into a project
directory, it shouldn't be too big a change.
> + goto out_brelse;
> + }
> + raw_inode->i_projid = cpu_to_le32(i_projid);
> +
> ext4_inode_csum_set(inode, raw_inode, ei);
>
> spin_unlock(&ei->i_raw_lock);
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 2291923..09e8e55 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2938,6 +2938,11 @@ static int ext4_link(struct dentry *old_dentry,
> if (inode->i_nlink >= EXT4_LINK_MAX)
> return -EMLINK;
>
> + if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
> + (!projid_eq(EXT4_I(dir)->i_projid,
> + EXT4_I(old_dentry->d_inode)->i_projid)))
> + return -EXDEV;
> +
> dquot_initialize(dir);
>
> retry:
> @@ -3217,6 +3222,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> int credits;
> u8 old_file_type;
>
> + if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
> + (!projid_eq(EXT4_I(new_dir)->i_projid,
> + EXT4_I(old_dentry->d_inode)->i_projid)))
> + return -EXDEV;
> +
> dquot_initialize(old.dir);
> dquot_initialize(new.dir);
>
> @@ -3395,6 +3405,13 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
> u8 new_file_type;
> int retval;
>
> + if ((ext4_test_inode_flag(new_dir, EXT4_INODE_PROJINHERIT)) &&
> + ((!projid_eq(EXT4_I(new_dir)->i_projid,
> + EXT4_I(old_dentry->d_inode)->i_projid)) ||
> + (!projid_eq(EXT4_I(old_dir)->i_projid,
> + EXT4_I(new_dentry->d_inode)->i_projid))))
> + return -EXDEV;
> +
> dquot_initialize(old.dir);
> dquot_initialize(new.dir);
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4fca81c..6b67795 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1067,6 +1067,7 @@ static const struct dquot_operations ext4_quota_operations = {
> .write_info = ext4_write_info,
> .alloc_dquot = dquot_alloc,
> .destroy_dquot = dquot_destroy,
> + .get_projid = ext4_get_projid,
> };
>
> static const struct quotactl_ops ext4_qctl_operations = {
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 3735fa0..fcbf647 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -195,6 +195,7 @@ struct inodes_stat_t {
> #define FS_EXTENT_FL 0x00080000 /* Extents */
> #define FS_DIRECTIO_FL 0x00100000 /* Use direct i/o */
> #define FS_NOCOW_FL 0x00800000 /* Do not cow file */
> +#define FS_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
> #define FS_RESERVED_FL 0x80000000 /* reserved for ext2 lib */
>
> #define FS_FL_USER_VISIBLE 0x0003DFFF /* User visible flags */
> --
> 1.7.1
>
Cheers, Andreas
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-01-07 23:16 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <87fvbm2nni.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Wed, Jan 07, 2015 at 05:09:53PM -0600, Eric W. Biederman wrote:
> I may have mistyped the manual command line configuration for specifying
> which controllers appear on a mount point does not alter my point.
Hmmm? You were talking about the old hierarchies?
> The old options to enable selecting controllers need to continue and
> need to continue to work with a unified hierarchy.
>
> Anything else is a gratuitious regression.
I have no idea what you're on about. If the outer system uses unified
hierarchy, the inner system should use that too. If the outer system
doesn't use unified hierarchy, namespace support has never existed,
and even if it did, the inside could never pick and choose controllers
independent from the outside. If the outside is co-mounting cpu and
cpuacct, the inside is either also doing that or not mounting either.
--
tejun
^ permalink raw reply
* Re: WIP alternative - was Re: [PATCH v3 14/20] selftests/size: add install target to enable test install
From: Shuah Khan @ 2015-01-07 23:22 UTC (permalink / raw)
To: Tim Bird, mmarek@suse.cz, gregkh@linuxfoundation.org,
akpm@linux-foundation.org, rostedt@goodmis.org, mingo@redhat.com,
davem@davemloft.net, keescook@chromium.org,
tranmanphong@gmail.com, mpe@ellerman.id.au, cov@codeaurora.org,
dh.herrmann@gmail.com, hughd@google.com, bobby.prani@gmail.com,
serge.hallyn@ubuntu.com, ebiederm@xmission.com,
josh@joshtriplett.org, koct9i@gmail.com
Cc: linux-kbuild@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <54AB08AB.7040009@sonymobile.com>
On 01/05/2015 02:56 PM, Tim Bird wrote:
> On 01/05/2015 01:28 PM, Shuah Khan wrote:
>> On 12/31/2014 07:31 PM, Tim Bird wrote:
> ...
>>> The install phase is desperately needed for usage of kselftest in
>>> cross-target situations (applicable to almost all embedded). So this
>>> is great stuff.
>>
>> Thanks.
>>
>>>
>>> I worked a bit on isolating the install stuff to a makefile include file.
>>> This allows simplifying some of the sub-level Makefiles a bit, and allowing
>>> control of some of the install and run logic in less places.
>>>
>>> This is still a work in progress, but before I got too far along, I wanted
>>> to post it for people to provide feedback. A couple of problems cropped
>>> up that are worth discussing, IMHO.
>>>
>>> 1) I think it should be a requirement that each test has a single
>>> "main" program to execute to run the tests. If multiple tests are supported
>>> or more flexibility is desired for additional arguments, or that sort of
>>> thing, then that's fine, but the automated script builder should be able
>>> to run just a single program or script to have things work. This also
>>> makes things more consistent. In the case of the firmware test, I created
>>> a single fw_both.sh script to do this, instead of having two separate
>>> blocks in the kselftest.sh script.
>>
>> It is a good goal for individual tests to use a main program to run
>> tests, even though, I would not make it a requirement. I would like to
>> leave that decision up to the individual test writer.
>>
> OK. It helps to have a single line when trying to isolate
> RUN_TEST creation into the include file, but there may be other
> ways to accomplish this.
>
>>>
>>> 2) I've added a CROSS_INSTALL variable, which can call an arbitrary program
>>> to place files on the target system (rather than just calling 'install').
>>> In my case, I'd use my own 'ttc cp' command, which I can extend as necessary
>>> to put things on a remote machine. This works for a single directory,
>>> but things get dicier with sub-directory trees full of files (like
>>> the ftrace test uses.)
>>>
>>> If additional items need to be installed to the target, then maybe a setup
>>> program should be used, rather than just copying files.
>>>
>>> 3) Some of the scripts were using /bin/bash to execute them, rather
>>> than rely on the interpreter line in the script itself (and having
>>> the script have executable privileges). Is there a reason for this?
>>> I modified a few scripts to be executable, and got rid of the
>>> explicit execution with /bin/bash.
>>
>> Probably no reason other than the choice made by the test writer.
>> It could be cleaned up and made consistent, however, I would see
>> this as a separate enhancement type work that could be done on its
>> own and not include it in the install work.
>
> OK - this was also something that simplified the creation
> of the RUN_TEST variable in the isolated include file.
> Also, having the interpreter explicit in the invocation line
> in the Makefile as well as in the script itself is a bit redundant.
>>>
>>> The following is just a start... Let me know if this direction looks
>>> OK, and I'll finish this up. The main item to look at is
>>> kselftest.include file. Note that these patches are based on Shuah's
>>> series - but if you want to use these ideas I can rebase them onto
>>> mainline, and break them out per test sub-level like Shuah did.
>>
>> One of the reasons I picked install target approach is to enable the
>> feature by extending the existing run_tests support. This way we will
>> have the feature available quickly. Once that is supported, we can work
>> on evolving to a generic approach to use the include file approach, as
>> the changes you are proposing are based on the the series I sent out,
>> and makes improvements to it.
>>
>> kselftest.include file approach could work for several tests and tests
>> that can't use the generic could add their own install support.
>>
>> I propose evolving to a generic kselftest.include as the second step in
>> evolving the install feature. Can I count on you do the work and update
>> the tests to use kselftest.include, CROSS_INSTALL variable support?
>
> Yes. I'd be happy to evolve it through phases to support the include
> file and cross-target install feature.
Thanks that would be great.
>
> Is there anything I can help with in the mean time? Some of the tests
> require a directory tree of files rather than just a few top-level files
> (e.g. ftrace).
I left out couple of tests in this first round. exec and powerpc. Would
you like to add install target support for these tests? That is the next
thing on my list of todo and if you would like to get that working, you
are welcome to.
>
> I was thinking about doing some work to tar-ify the needed directories of
> data files during build, and untar it in the execution area during the
> install step. Do you want me to propose something there?
More like generating tar-ball of the install directory for remote
installs perhaps. Since install feature is using existing Makefile,
user can specify a override install location. Is there a need for,
more like use for generating a tar-ball for remote installs?
thanks,
-- Shuah
--
Shuah Khan
Sr. Linux Kernel Developer
Open Source Innovation Group
Samsung Research America (Silicon Valley)
shuahkh@osg.samsung.com | (970) 217-8978
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Eric W. Biederman @ 2015-01-07 23:27 UTC (permalink / raw)
To: Tejun Heo
Cc: Serge E. Hallyn, Richard Weinberger, Linux API, Linux Containers,
Serge Hallyn, linux-kernel@vger.kernel.org, Andy Lutomirski,
cgroups mailinglist, Ingo Molnar
In-Reply-To: <87fvbm2nni.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
> Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes:
>
>> On Wed, Jan 07, 2015 at 05:02:17PM -0600, Eric W. Biederman wrote:
>>> Ignoring namespace details for a moment. The following should be
>>> possible with a unified hierarchy. If it is not it is a show stopper
>>> of a regression.
>>
>> The -o SUBSYS option doesn't exist. Jesus, at least get yourself
>> familiar with the basics before claiming random stuff.
Oh let's see I got that command line option out of /proc/mounts and yes
it works. Perhaps it doesn't if I invoke unified hiearchies but the
option does in fact exist and work.
Now I really do need to test report regressions, and send probably send
regression fixes. If I understand your strange ranting I think you just
told me that option that -o SUBSYS does work with unified hierarchies.
Tejun. I asked you specifically about this case 2 years ago at plumbers
and you personally told me this would continue to work. I am going to
hold you to that.
Fixing bugs is one thing. Gratuitious regressions that make supporting
existing user space applications insane is another.
Eric
^ permalink raw reply
* Re: [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
From: Tejun Heo @ 2015-01-07 23:35 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Richard Weinberger, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Linux API, cgroups mailinglist
In-Reply-To: <87y4peyxw5.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> >> The -o SUBSYS option doesn't exist. Jesus, at least get yourself
> >> familiar with the basics before claiming random stuff.
>
> Oh let's see I got that command line option out of /proc/mounts and yes
> it works. Perhaps it doesn't if I invoke unified hiearchies but the
> option does in fact exist and work.
I meant the -o SUBSYS doesn't exist for unified hierarchy.
> Now I really do need to test report regressions, and send probably send
> regression fixes. If I understand your strange ranting I think you just
> told me that option that -o SUBSYS does work with unified hierarchies.
What? Why would -O SUBSYS exist for unified hierarchy? It's unified
for all controllers.
> Tejun. I asked you specifically about this case 2 years ago at plumbers
> and you personally told me this would continue to work. I am going to
> hold you to that.
I have no idea what you're talking about in *THIS* thread. I'm fully
aware of what was discussed *THEN*.
> Fixing bugs is one thing. Gratuitious regressions that make supporting
> existing user space applications insane is another.
Can you explain what problem you're actually trying to talk about
without spouting random claims about regressions?
--
tejun
^ permalink raw reply
* Re: [v8 2/5] ext4: adds project ID support
From: Jan Kara @ 2015-01-08 8:26 UTC (permalink / raw)
To: Li Xi
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
linux-ext4-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA, tytso-3s7WtUTddSA,
adilger-m1MBpc4rdrD3fQ9qLvQP4Q, jack-AlSwsSmVLrQ,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hch-wEGCiKHe2LqWVfeAwA7xHQ,
dmonakhov-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1418102548-5469-3-git-send-email-lixi-LfVdkaOWEx8@public.gmane.org>
On Tue 09-12-14 13:22:25, Li Xi wrote:
> This patch adds a new internal field of ext4 inode to save project
> identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> inheriting project ID from parent directory.
I have noticed one thing you apparently changed in v7 of the patch set.
See below.
> Signed-off-by: Li Xi <lixi-LfVdkaOWEx8@public.gmane.org>
> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> ---
> fs/ext4/ext4.h | 21 +++++++++++++++++----
> fs/ext4/ialloc.c | 6 ++++++
> fs/ext4/inode.c | 29 +++++++++++++++++++++++++++++
> fs/ext4/namei.c | 17 +++++++++++++++++
> fs/ext4/super.c | 1 +
> include/uapi/linux/fs.h | 1 +
> 6 files changed, 71 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 29c43e7..8bd1da9 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -377,16 +377,18 @@ struct flex_groups {
> #define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
> #define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
> #define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
> +#define EXT4_PROJINHERIT_FL FS_PROJINHERIT_FL /* Create with parents projid */
How did FS_PROJINHERIT_FL get here? There used to be 0x20000000 in older
version of the patch set which is correct - this definition is defining
ext4 on-disk format. As such it is an ext4 specific flag and should be
definined to a fixed constant independed of any other filesystem. It seems
you are somewhat mixing what is an on-disk format flag value and what is a
flag value passed from userspace. These two may be different things and you
need to convert between the values when getting / setting flags...
Honza
--
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
SUSE Labs, CR
^ permalink raw reply
* Re: [PATCH net-next v2 0/8] net: extend ethtool link mode bitmaps to 48 bits
From: Amir Vadai @ 2015-01-08 8:40 UTC (permalink / raw)
To: David Decotigny
Cc: Florian Fainelli, netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Saeed Mahameed,
David S. Miller, Jason Wang, Michael S. Tsirkin, Herbert Xu,
Al Viro, Ben Hutchings, Masatake YAMATO, Xi Wang, Neil Horman,
WANG Cong, Flavio Leitner, Tom Gundersen, Jiri Pirko,
Vlad Yasevich, Eric W. Biederman, Venkata Duvvuru,
Govindarajulu Varadarajan <_govind>
In-Reply-To: <CAG88wWYPDpwkWkL+Pj2VKrX5WVp=at8v0=gcFAVAA8nntv+-nw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On 1/6/2015 7:36 PM, David Decotigny wrote:
> Interesting. It seems that the band-aid I was proposing is already
> obsolete. We could still use the remaining reserved 16 bits to encode
> 5 more bits per mask (that is: 53 bits / mask total). But if I
> understand you, it would allow us to survive only a few months longer,
> as opposed to a few weeks.
>
> One short-term alternative solution I can imagine is the following:
> /* For example bitmap-based for variable length: */
> struct ethtool_link_mode {
> __u32 cmd;
> __u8 autoneg :1;
> __u8 duplex :2;
> __u16 supported_nbits;
> __u16 advertising_nbits;
> __u16 lp_advertising_nbits;
> __u32 reserved[4];
> __u8 masks[0];
> };
> /* Or simpler, statically limited to 64b / mask, but easier to migrate
> to for driver authors: */
I think the first options is better. A driver will have to do changes in
order to support >32 link modes, so better change it once now, without
having to change it again for >64 link modes.
> struct ethtool_link_mode {
> __u32 cmd;
> __u8 autoneg :1;
> __u8 duplex :2;
> __u64 supported;
> __u64 advertising;
> __u64 lp_advertising;
> __u32 reserved[4];
> };
> #define ETHTOOL_GLINK_MODE 0x0000004a
> #define ETHTOOL_SLINK_MODE 0x0000004b
> struct ethtool_ops {
> ...
> int (*get_link_mode)(struct net_device *, struct ethtool_link_mode *);
> int (*set_link_mode)(struct net_device *, struct ethtool_link_mode *);
> };
>
> The same thing required for EEE.
Yeh :(
>
> I am not sure about moving the autoneg and duplex fields into the new
> struct. Especially the "duplex" field.
I think so too. ethtool user space will call ETHTOOL_[GS]SET and after
that ETHTOOL_[GS]LINK_MODE (if supported). No need to get the
duplex/autoneg fields again.
>
> Then the idea would be to update the ethtool user-space tool to try
> get/set_link mode when reporting/changing the autoneg/advertising
> settings.
>
> Both will require significant effort from the driver authors.
> Especially if the variable-length bitmap approach is preferred:
> - most drivers currently use simple bitwise arithmetic in their code,
> and that goes far beyond get/set_settings, it is sometimes part of the
> core driver logic. They will have to migrate to the bitmap API if they
> want to use the larger bitmaps (note: no change needed if they are
> happy with <= 32b / mask)
As I said above, it will save as doing this work again in the future,
and more problematic, save another version to backport in the future. In
addition, not all drivers will have to do it, only if >32 link speeds is
needed - this work will be required.
> - we would have to progressively deprecate the use of #define
> ADVERTISED_1000baseT_Full in favor of an enum of the bit indices.
Maybe we could use some macro juggling to define the legacy macro's
using enum for the first 32 bits, and fail the compilation if used on
>32. For example, calling this:
DEFINE_LINK_MODE(ADVERTISED_1000baseT_Full, 5)
Will add the following:
ADVERTISED_1000baseT_Full_SHIFT = 5
ADVERTISED_1000baseT_Full = (1<<5)
DEFINE_LINK_MODE(ADVERTISED_100000baseKR5_Full, 50) will add:
ADVERTISED_100000baseKR5_Full_SHIFT = 50
ADVERTISED_100000baseKR5_Full = #error new link speeds must be defined
using [gs]et_link_speed
This will break compilation if ADVERTISED_100000baseKR5_Full is used in
[gs]et_settings (I know the '#error' will not print something very
pretty - I used it only to explain what I meant)
>
> Any feedback welcome. In the meantime, I am going to propose a v3 of
> current option with 53 bits / mask. I can also propose a prototype of
> the scheme described above, please let me know.
I think that it is better to do it once, and skip the 53 bits / mask
version.
I'll be happy to assist.
Amir
^ permalink raw reply
* Re: [v8 2/5] ext4: adds project ID support
From: Jan Kara @ 2015-01-08 8:51 UTC (permalink / raw)
To: Andreas Dilger
Cc: Li Xi, linux-fsdevel, linux-ext4, linux-api, tytso, jack, viro,
hch, dmonakhov
In-Reply-To: <0F8FAF43-54EB-4202-8120-C1D70CF330F0@dilger.ca>
On Wed 07-01-15 16:11:16, Andreas Dilger wrote:
> On Dec 8, 2014, at 10:22 PM, Li Xi <pkuelelixi@gmail.com> wrote:
> > This patch adds a new internal field of ext4 inode to save project
> > identifier. Also a new flag EXT4_INODE_PROJINHERIT is added for
> > inheriting project ID from parent directory.
> >
> > Signed-off-by: Li Xi <lixi@ddn.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---
...
> > @@ -4249,6 +4266,18 @@ static int ext4_do_update_inode(handle_t *handle,
> > }
> > }
> >
> > + BUG_ON(!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_PROJECT) &&
> > + i_projid != EXT4_DEF_PROJID);
> > + if (i_projid != EXT4_DEF_PROJID &&
> > + (EXT4_INODE_SIZE(inode->i_sb) <= EXT4_GOOD_OLD_INODE_SIZE ||
> > + (!EXT4_FITS_IN_INODE(raw_inode, ei, i_projid)))) {
> > + spin_unlock(&ei->i_raw_lock);
> > + err = -EFBIG;
>
> I'm not sure if -EFBIG is the best error case, since that is a common
> filesystem error. Maybe -EOVERFLOW would be better?
So my thought has been that this is mostly a sanity check and we would
check in tune2fs whether all inodes have enough space when setting
EXT4_FEATURE_RO_COMPAT_PROJECT feature... Because sudden errors (regardless
whether it will be EFBIG or EOVERFLOW) when renaming files will be hard to
understand for sysadmins IMHO.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox