* [PATCH] user namespaces: bump idmap limits
@ 2017-10-04 11:20 Christian Brauner
2017-10-04 14:28 ` Serge E. Hallyn
2017-10-11 20:30 ` Serge E. Hallyn
0 siblings, 2 replies; 7+ messages in thread
From: Christian Brauner @ 2017-10-04 11:20 UTC (permalink / raw)
To: ebiederm, serge, stgraber, linux-kernel, tycho; +Cc: Christian Brauner
We have quite some use cases where users already run into the current limit for
{g,u}id mappings. Consider a user requesting us to map everything but 999, and
1001 for a given range of 1000000000 with a sub{g,u}id layout of:
some-user:100000:1000000000
some-user:999:1
some-user:1000:1
some-user:1001:1
some-user:1002:1
This translates to:
MAPPING-TYPE CONTAINER HOST RANGE
uid 999 999 1
uid 1001 1001 1
uid 0 1000000 999
uid 1000 1001000 1
uid 1002 1001002 999998998
gid 999 999 1
gid 1001 1001 1
gid 0 1000000 999
gid 1000 1001000 1
gid 1002 1001002 999998998
which is already the current limit.
Design Notes:
As discussed at LPC simply bumping the number of limits is not going to work
since this would mean that struct uid_gid_map won't fit into a single cache-line
anymore thereby regressing performance for the base-cases. The same problem
seems to arise when using a single pointer. So the idea is to keep the base
cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
cache-line of 64 byte. For the two removed mappings we place three pointers in
the struct that mock the behavior of traditional filesystems:
1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
2. an indirect pointer to an array of 64 byte of direct pointers to struct
uid_gid_extent of 5 mappings a 60 bytes each
3. a double indirect pointer to an array of 64 bytes of indirect pointers each
to an array of 64 bytes of direct pointers (and so on)
Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
368 mappings which should really be enough. The idea of this approach is to
always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
should only see a (i.e. linear) performance impact caused by iterating through
the idmappings in a for-loop. Note that the base cases shouldn't see any
performance degradation which is the most important part.
Performance Testing:
When Eric introduced the extent-based struct uid_gid_map approach he measured
the performanc impact of his idmap changes:
> My benchmark consisted of going to single user mode where nothing else was
> running. On an ext4 filesystem opening 1,000,000 files and looping through all
> of the files 1000 times and calling fstat on the individuals files. This was
> to ensure I was benchmarking stat times where the inodes were in the kernels
> cache, but the inode values were not in the processors cache. My results:
> v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
I used an identical approach on my laptop. Here's a thorough description of what
I did. I built three kernels and used an additional "control" kernel:
1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
I booted into single user mode (systemd rescue target in newspeak) and used an
ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
files calling fstat() on each of them 1000 times and calculated the mean fstat()
time for a single file. (The test program can be found below.)
For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
0 mappings
1 mapping
2 mappings
3 mappings
5 mappings
For kernel v4.4-rc2-userns+ I tested:
0 mappings
1 mapping
2 mappings
3 mappings
5 mappings
10 mappings
50 mappings
100 mappings
200 mappings
300 mappings
Here are the results:
- v4.14-rc2-vanilla (unmodified v4.14-rc2)
# no unshare: 312 ns
unshare -U # write 0 mappings: 307 ns
unshare -U # write 1 mappings: 328 ns
unshare -U # write 2 mappings: 328 ns
unshare -U # write 3 mappings: 328 ns
unshare -U # write 5 mappings: 338 ns
- v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
# no unshare: 158 ns
unshare -U # write 0 mappings: 158 ns
unshare -U # write 1 mappings: 164 ns
unshare -U # write 2 mappings: 170 ns
unshare -U # write 3 mappings: 175 ns
unshare -U # write 5 mappings: 187 ns
unshare -U # write 10 mappings: 218 ns
unshare -U # write 50 mappings: 528 ns
unshare -U # write 100 mappings: 980 ns
unshare -U # write 200 mappings: 1880 ns
unshare -U # write 300 mappings: 2760 ns
- v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
# no unshare: 161 ns
- 4.12.0-12-generic Ubuntu Kernel:
# no unshare: 328 ns
unshare -U # write 0 mappings: 327 ns
unshare -U # write 1 mappings: 328 ns
unshare -U # write 2 mappings: 328 ns
unshare -U # write 3 mappings: 328 ns
unshare -U # write 5 mappings: 338 ns
I've tested this multiple times and the numbers hold up. All v4.14-rc2 kernels
were built on the same machine with the same .config, the same options and a
simple call to make -j 11 bindeb-pkg. The 4.12 kernel was simply installed from
the Ubuntu archives.
The most import part seems to me that my idmap patches don't regress performance
for the base-cases. I'd actually only consider 0 and 1 mapping to be the proper
base cases since this is the standard layout that users usually expect when
using user namespaces. However, the numbers show that even up to 5 mappings
there's no real performance regression.
When writing more than 10 mappings the numbers nicely show that in the worst
case the performance degrades linearly which is what I was aiming for. To me
this even seems unrelated to cache misses but is rather related to how for-loops
usually scale which is a good sign (I think.).
I expect most users will usually go slightly over the 5 mappings mark up to 10
mappings. For example, when the administrator runs a system container that
serves as a number of users where each of them will want to be able to mount
their home directory in the container and create files with their own uid and
gid. I've used this specific system container setup before on HPC clusters.
The really high idmap numbers are good to have for specific use cases. I have a
bunch of containers that punch various complex holes into idmap ranges to
guarantee safe access for a high number of programs that run with their own uid
on the host and in the container. LXD's idmapping algorithm is able to calculate
complex idmaps that are handed down to the LIBLXC driver.
Here's the test program I used. I asked Eric what he did and this is a more
"advanced" implementation of the idea. It's pretty straight-forward:
/*
* Copyright © 2017 Christian Brauner <christian.brauner@ubuntu.com>.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2, as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License along
* with this program; if not, write to the Free Software Foundation, Inc.,
* 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/
int main(int argc, char *argv[])
{
int ret;
size_t i, k;
int fd[1000000];
int times[1000];
char pathname[4096];
struct stat st;
struct timeval t1, t2;
uint64_t time_in_mcs;
uint64_t sum = 0;
if (argc != 2) {
fprintf(stderr, "Please specify a directory where to create "
"the test files\n");
exit(EXIT_FAILURE);
}
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
sprintf(pathname, "%s/idmap_test_%zu", argv[1], i);
fd[i]= open(pathname, O_RDWR | O_CREAT, S_IXUSR | S_IXGRP | S_IXOTH);
if (fd[i] < 0) {
ssize_t j;
for (j = i; j >= 0; j--)
close(fd[j]);
exit(EXIT_FAILURE);
}
}
for (k = 0; k < 1000; k++) {
ret = gettimeofday(&t1, NULL);
if (ret < 0)
goto close_all;
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++) {
ret = fstat(fd[i], &st);
if (ret < 0)
goto close_all;
}
ret = gettimeofday(&t2, NULL);
if (ret < 0)
goto close_all;
time_in_mcs = (1000000 * t2.tv_sec + t2.tv_usec) -
(1000000 * t1.tv_sec + t1.tv_usec);
printf("Total time in micro seconds: %" PRIu64 "\n",
time_in_mcs);
printf("Total time in nanoseconds: %" PRIu64 "\n",
time_in_mcs * 1000);
printf("Time per file in nanoseconds: %" PRIu64 "\n",
(time_in_mcs * 1000) / 1000000);
times[k] = (time_in_mcs * 1000) / 1000000;
}
close_all:
for (i = 0; i < sizeof(fd) / sizeof(fd[0]); i++)
close(fd[i]);
if (ret < 0)
exit(EXIT_FAILURE);
for (k = 0; k < 1000; k++)
sum += times[k];
printf("Mean time per file in nanoseconds: %" PRIu64 "\n", sum / 1000);
exit(EXIT_SUCCESS);;
}
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
include/linux/user_namespace.h | 18 ++-
kernel/user_namespace.c | 264 ++++++++++++++++++++++++++++++++++++++---
2 files changed, 263 insertions(+), 19 deletions(-)
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index c18e01252346..66e218c289b3 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -10,15 +10,29 @@
#include <linux/sysctl.h>
#include <linux/err.h>
+#define UID_GID_MAP_PTR_SIZE 8
#define UID_GID_MAP_MAX_EXTENTS 5
+#define UID_GID_MAP_BASE 3
+#define UID_GID_MAP_DIRECT UID_GID_MAP_MAX_EXTENTS
+#define UID_GID_MAP_IDIRECT (UID_GID_MAP_MAX_EXTENTS * UID_GID_MAP_PTR_SIZE)
+#define UID_GID_MAP_DIDIRECT (UID_GID_MAP_IDIRECT * UID_GID_MAP_PTR_SIZE)
+#define UID_GID_MAP_MAX \
+ (UID_GID_MAP_BASE + UID_GID_MAP_DIRECT + UID_GID_MAP_IDIRECT + \
+ UID_GID_MAP_DIDIRECT)
struct uid_gid_map { /* 64 bytes -- 1 cache line */
- u32 nr_extents;
+ /* Pointer to an extent of size UID_GID_MAP_MAX_EXTENTS */
+ struct uid_gid_extent *direct;
+ /* Pointer to a 64 byte array of 8 direct pointers. */
+ struct uid_gid_extent **idirect;
+ /* Pointer to a 64 byte array of 8 idirect pointers. */
+ struct uid_gid_extent ***didirect;
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
- } extent[UID_GID_MAP_MAX_EXTENTS];
+ } extent[3];
+ u32 nr_extents;
};
#define USERNS_SETGROUPS_ALLOWED 1UL
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c490f1e4313b..739d880aff39 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -24,6 +24,11 @@
#include <linux/projid.h>
#include <linux/fs_struct.h>
+#define UID_GID_MAP_BASE_MAX UID_GID_MAP_BASE
+#define UID_GID_MAP_DIRECT_MAX (UID_GID_MAP_BASE_MAX + UID_GID_MAP_MAX_EXTENTS)
+#define UID_GID_MAP_IDIRECT_MAX (UID_GID_MAP_IDIRECT + UID_GID_MAP_DIRECT_MAX)
+#define UID_GID_MAP_DIDIRECT_MAX (UID_GID_MAP_DIDIRECT + UID_GID_MAP_IDIRECT_MAX)
+
static struct kmem_cache *user_ns_cachep __read_mostly;
static DEFINE_MUTEX(userns_state_mutex);
@@ -173,6 +178,78 @@ int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
return err;
}
+/* Get indeces for an extent. */
+static inline u32 get_eidx(u32 idx)
+{
+ /*
+ * Subtract 3 to adjust for missing 2 extents in the main {g,u}idmap
+ * struct.
+ */
+ return ((idx - UID_GID_MAP_MAX_EXTENTS - 3) % UID_GID_MAP_MAX_EXTENTS);
+}
+
+/* Get indeces for the direct pointer. */
+static inline u32 get_didx(u32 idx)
+{
+ return idx - UID_GID_MAP_BASE_MAX;
+}
+
+/* Get indeces for the indirect pointer. */
+static inline u32 get_iidx(u32 idx)
+{
+ if (idx < UID_GID_MAP_IDIRECT_MAX)
+ return (idx - UID_GID_MAP_DIRECT_MAX) / UID_GID_MAP_MAX_EXTENTS;
+
+ return ((idx - UID_GID_MAP_IDIRECT_MAX) / UID_GID_MAP_MAX_EXTENTS) %
+ UID_GID_MAP_PTR_SIZE;
+}
+
+/* Get indeces for the double indirect pointer. */
+static inline u32 get_diidx(u32 idx)
+{
+ return (idx - UID_GID_MAP_IDIRECT_MAX) /
+ (UID_GID_MAP_PTR_SIZE * UID_GID_MAP_MAX_EXTENTS);
+}
+
+static void free_extents(struct uid_gid_map *maps)
+{
+ u32 diidx, i, idx, iidx;
+
+ if (!maps->direct)
+ return;
+ kfree(maps->direct);
+ maps->direct = NULL;
+
+ if (!maps->idirect)
+ return;
+
+ if (maps->nr_extents >= UID_GID_MAP_IDIRECT_MAX)
+ iidx = get_iidx(UID_GID_MAP_IDIRECT_MAX - 1);
+ else
+ iidx = get_iidx(maps->nr_extents - 1);
+ for (idx = 0; idx <= iidx; idx++)
+ kfree(maps->idirect[idx]);
+ kfree(maps->idirect);
+ maps->idirect = NULL;
+
+ if (!maps->didirect)
+ return;
+
+ diidx = get_diidx(maps->nr_extents - 1);
+ iidx = (64 / UID_GID_MAP_PTR_SIZE) - 1;
+ for (idx = 0; idx <= diidx; idx++) {
+ if (idx == diidx)
+ iidx = get_iidx(maps->nr_extents - 1);
+
+ for (i = 0; i <= iidx; i++)
+ kfree(maps->didirect[idx][i]);
+
+ kfree(maps->didirect[idx]);
+ }
+ kfree(maps->didirect);
+ maps->didirect = NULL;
+}
+
static void free_user_ns(struct work_struct *work)
{
struct user_namespace *parent, *ns =
@@ -185,6 +262,12 @@ static void free_user_ns(struct work_struct *work)
#ifdef CONFIG_PERSISTENT_KEYRINGS
key_put(ns->persistent_keyring_register);
#endif
+ mutex_lock(&userns_state_mutex);
+ free_extents(&ns->uid_map);
+ free_extents(&ns->gid_map);
+ free_extents(&ns->projid_map);
+ mutex_unlock(&userns_state_mutex);
+
ns_free_inum(&ns->ns);
kmem_cache_free(user_ns_cachep, ns);
dec_user_namespaces(ucounts);
@@ -198,8 +281,36 @@ void __put_user_ns(struct user_namespace *ns)
}
EXPORT_SYMBOL(__put_user_ns);
+static struct uid_gid_extent *get_idmap(struct uid_gid_map *maps, u32 idx)
+{
+ if (idx < UID_GID_MAP_BASE_MAX)
+ return &maps->extent[idx];
+
+ if ((idx >= UID_GID_MAP_BASE_MAX) &&
+ (idx < UID_GID_MAP_DIRECT_MAX))
+ return &maps->direct[get_didx(idx)];
+
+ if ((idx >= UID_GID_MAP_DIRECT_MAX) &&
+ (idx < UID_GID_MAP_IDIRECT_MAX)) {
+ u32 iidx = get_iidx(idx);
+ u32 eidx = get_eidx(idx);
+ return &maps->idirect[iidx][eidx];
+ }
+
+ if ((idx >= UID_GID_MAP_IDIRECT_MAX) &&
+ (idx < UID_GID_MAP_DIDIRECT_MAX)) {
+ u32 diidx = get_diidx(idx);
+ u32 iidx = get_iidx(idx);
+ u32 eidx = get_eidx(idx);
+ return &maps->didirect[diidx][iidx][eidx];
+ }
+
+ return NULL;
+}
+
static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
{
+ struct uid_gid_extent *extent;
unsigned idx, extents;
u32 first, last, id2;
@@ -209,15 +320,16 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
extents = map->nr_extents;
smp_rmb();
for (idx = 0; idx < extents; idx++) {
- first = map->extent[idx].first;
- last = first + map->extent[idx].count - 1;
+ extent = get_idmap(map, idx);
+ first = extent->first;
+ last = first + extent->count - 1;
if (id >= first && id <= last &&
(id2 >= first && id2 <= last))
break;
}
/* Map the id or note failure */
if (idx < extents)
- id = (id - first) + map->extent[idx].lower_first;
+ id = (id - first) + extent->lower_first;
else
id = (u32) -1;
@@ -226,6 +338,7 @@ static u32 map_id_range_down(struct uid_gid_map *map, u32 id, u32 count)
static u32 map_id_down(struct uid_gid_map *map, u32 id)
{
+ struct uid_gid_extent *extent;
unsigned idx, extents;
u32 first, last;
@@ -233,14 +346,15 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
extents = map->nr_extents;
smp_rmb();
for (idx = 0; idx < extents; idx++) {
- first = map->extent[idx].first;
- last = first + map->extent[idx].count - 1;
+ extent = get_idmap(map, idx);
+ first = extent->first;
+ last = first + extent->count - 1;
if (id >= first && id <= last)
break;
}
/* Map the id or note failure */
if (idx < extents)
- id = (id - first) + map->extent[idx].lower_first;
+ id = (id - first) + extent->lower_first;
else
id = (u32) -1;
@@ -249,6 +363,7 @@ static u32 map_id_down(struct uid_gid_map *map, u32 id)
static u32 map_id_up(struct uid_gid_map *map, u32 id)
{
+ struct uid_gid_extent *extent;
unsigned idx, extents;
u32 first, last;
@@ -256,14 +371,15 @@ static u32 map_id_up(struct uid_gid_map *map, u32 id)
extents = map->nr_extents;
smp_rmb();
for (idx = 0; idx < extents; idx++) {
- first = map->extent[idx].lower_first;
- last = first + map->extent[idx].count - 1;
+ extent = get_idmap(map, idx);
+ first = extent->lower_first;
+ last = first + extent->count - 1;
if (id >= first && id <= last)
break;
}
/* Map the id or note failure */
if (idx < extents)
- id = (id - first) + map->extent[idx].first;
+ id = (id - first) + extent->first;
else
id = (u32) -1;
@@ -544,7 +660,7 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
loff_t pos = *ppos;
if (pos < map->nr_extents)
- extent = &map->extent[pos];
+ extent = get_idmap(map, pos);
return extent;
}
@@ -618,7 +734,7 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
u32 prev_upper_last, prev_lower_last;
struct uid_gid_extent *prev;
- prev = &new_map->extent[idx];
+ prev = get_idmap(new_map, idx);
prev_upper_first = prev->first;
prev_lower_first = prev->lower_first;
@@ -638,6 +754,110 @@ static bool mappings_overlap(struct uid_gid_map *new_map,
return false;
}
+static struct uid_gid_extent *alloc_extent(struct uid_gid_map *maps)
+{
+ void *tmp;
+ u32 next = maps->nr_extents;
+ u32 eidx, iidx, ldiidx, diidx, liidx;
+
+ if (next < UID_GID_MAP_BASE_MAX)
+ return &maps->extent[next];
+
+ if ((next >= UID_GID_MAP_BASE_MAX) && (next < UID_GID_MAP_DIRECT_MAX)) {
+ if (!maps->direct)
+ maps->direct = kzalloc(sizeof(struct uid_gid_extent) *
+ UID_GID_MAP_MAX_EXTENTS,
+ GFP_KERNEL);
+ if (!maps->direct)
+ return ERR_PTR(-ENOMEM);
+
+ return &maps->direct[next - UID_GID_MAP_BASE_MAX];
+ }
+
+ if ((next >= UID_GID_MAP_DIRECT_MAX) &&
+ (next < UID_GID_MAP_IDIRECT_MAX)) {
+ liidx = 0;
+ iidx = get_iidx(next);
+ eidx = get_eidx(next);
+
+ if (iidx > 0)
+ liidx = get_iidx(next - 1);
+
+ if (!maps->idirect || iidx > liidx) {
+ tmp = krealloc(maps->idirect,
+ sizeof(struct uid_gid_extent *) * (iidx + 1),
+ GFP_KERNEL);
+ if (!tmp)
+ return ERR_PTR(-ENOMEM);
+
+ maps->idirect = tmp;
+ maps->idirect[iidx] = NULL;
+ }
+
+ tmp = krealloc(maps->idirect[iidx],
+ sizeof(struct uid_gid_extent) * (eidx + 1),
+ GFP_KERNEL);
+ if (!tmp)
+ return ERR_PTR(-ENOMEM);
+
+ maps->idirect[iidx] = tmp;
+ memset(&maps->idirect[iidx][eidx], 0,
+ sizeof(struct uid_gid_extent));
+
+ return &maps->idirect[iidx][eidx];
+ }
+
+ if (next >= UID_GID_MAP_IDIRECT_MAX &&
+ next < UID_GID_MAP_DIDIRECT_MAX) {
+ ldiidx = 0;
+ diidx = get_diidx(next);
+ liidx = 0;
+ iidx = get_iidx(next);
+ eidx = get_eidx(next);
+
+ if (diidx > 0)
+ ldiidx = get_diidx(next - 1);
+
+ if (iidx > 0)
+ liidx = get_iidx(next - 1);
+
+ if (!maps->didirect || diidx > ldiidx) {
+ tmp = krealloc(maps->didirect, sizeof(struct uid_gid_extent **) *
+ (diidx + 1),
+ GFP_KERNEL);
+ if (!tmp)
+ return ERR_PTR(-ENOMEM);
+ maps->didirect = tmp;
+ maps->didirect[diidx] = NULL;
+ }
+
+ if (!maps->didirect[diidx] || iidx > liidx) {
+ tmp = krealloc(maps->didirect[diidx],
+ sizeof(struct uid_gid_extent *) * (iidx + 1),
+ GFP_KERNEL);
+ if (!tmp)
+ return ERR_PTR(-ENOMEM);
+
+ maps->didirect[diidx] = tmp;
+ maps->didirect[diidx][iidx] = NULL;
+ }
+
+ tmp = krealloc(maps->didirect[diidx][iidx],
+ sizeof(struct uid_gid_extent) * (eidx + 1),
+ GFP_KERNEL);
+ if (!tmp)
+ return ERR_PTR(-ENOMEM);
+
+ maps->didirect[diidx][iidx] = tmp;
+ memset(&maps->didirect[diidx][iidx][eidx], 0,
+ sizeof(struct uid_gid_extent));
+
+ return &maps->didirect[diidx][iidx][eidx];
+ }
+
+ return ERR_PTR(-ENOMEM);
+}
+
static ssize_t map_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos,
int cap_setid,
@@ -672,6 +892,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
* architectures returning stale data.
*/
mutex_lock(&userns_state_mutex);
+ memset(&new_map, 0, sizeof(new_map));
ret = -EPERM;
/* Only allow one successful write to the map */
@@ -702,7 +923,11 @@ static ssize_t map_write(struct file *file, const char __user *buf,
pos = kbuf;
new_map.nr_extents = 0;
for (; pos; pos = next_line) {
- extent = &new_map.extent[new_map.nr_extents];
+ extent = alloc_extent(&new_map);
+ if (IS_ERR(extent)) {
+ ret = PTR_ERR(extent);
+ goto out;
+ }
/* Find the end of line and ensure I don't look past it */
next_line = strchr(pos, '\n');
@@ -754,11 +979,11 @@ static ssize_t map_write(struct file *file, const char __user *buf,
new_map.nr_extents++;
/* Fail if the file contains too many extents */
- if ((new_map.nr_extents == UID_GID_MAP_MAX_EXTENTS) &&
+ if ((new_map.nr_extents == UID_GID_MAP_MAX) &&
(next_line != NULL))
goto out;
}
- /* Be very certaint the new map actually exists */
+ /* Be very certain the new map actually exists */
if (new_map.nr_extents == 0)
goto out;
@@ -772,7 +997,7 @@ static ssize_t map_write(struct file *file, const char __user *buf,
*/
for (idx = 0; idx < new_map.nr_extents; idx++) {
u32 lower_first;
- extent = &new_map.extent[idx];
+ extent = get_idmap(&new_map, idx);
lower_first = map_id_range_down(parent_map,
extent->lower_first,
@@ -788,15 +1013,20 @@ static ssize_t map_write(struct file *file, const char __user *buf,
}
/* Install the map */
- memcpy(map->extent, new_map.extent,
- new_map.nr_extents*sizeof(new_map.extent[0]));
+ memcpy(map->extent, new_map.extent, sizeof(new_map.extent));
+ map->direct = new_map.direct;
+ map->idirect = new_map.idirect;
+ map->didirect = new_map.didirect;
smp_wmb();
map->nr_extents = new_map.nr_extents;
*ppos = count;
ret = count;
out:
+ if (ret < 0)
+ free_extents(&new_map);
mutex_unlock(&userns_state_mutex);
+
kfree(kbuf);
return ret;
}
--
2.14.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-04 11:20 [PATCH] user namespaces: bump idmap limits Christian Brauner
@ 2017-10-04 14:28 ` Serge E. Hallyn
2017-10-04 14:45 ` Christian Brauner
2017-10-11 20:30 ` Serge E. Hallyn
1 sibling, 1 reply; 7+ messages in thread
From: Serge E. Hallyn @ 2017-10-04 14:28 UTC (permalink / raw)
To: Christian Brauner; +Cc: ebiederm, serge, stgraber, linux-kernel, tycho
Quoting Christian Brauner (christian.brauner@ubuntu.com):
> We have quite some use cases where users already run into the current limit for
> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
>
> some-user:100000:1000000000
> some-user:999:1
> some-user:1000:1
> some-user:1001:1
> some-user:1002:1
>
> This translates to:
>
> MAPPING-TYPE CONTAINER HOST RANGE
> uid 999 999 1
> uid 1001 1001 1
> uid 0 1000000 999
> uid 1000 1001000 1
> uid 1002 1001002 999998998
>
> gid 999 999 1
> gid 1001 1001 1
> gid 0 1000000 999
> gid 1000 1001000 1
> gid 1002 1001002 999998998
>
> which is already the current limit.
>
> Design Notes:
> As discussed at LPC simply bumping the number of limits is not going to work
> since this would mean that struct uid_gid_map won't fit into a single cache-line
> anymore thereby regressing performance for the base-cases. The same problem
> seems to arise when using a single pointer. So the idea is to keep the base
> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> cache-line of 64 byte. For the two removed mappings we place three pointers in
> the struct that mock the behavior of traditional filesystems:
> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
> uid_gid_extent of 5 mappings a 60 bytes each
> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
> to an array of 64 bytes of direct pointers (and so on)
> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> 368 mappings which should really be enough. The idea of this approach is to
> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> should only see a (i.e. linear) performance impact caused by iterating through
> the idmappings in a for-loop. Note that the base cases shouldn't see any
> performance degradation which is the most important part.
Sounds like a good plan.
> Performance Testing:
> When Eric introduced the extent-based struct uid_gid_map approach he measured
> the performanc impact of his idmap changes:
>
> > My benchmark consisted of going to single user mode where nothing else was
> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> > of the files 1000 times and calling fstat on the individuals files. This was
> > to ensure I was benchmarking stat times where the inodes were in the kernels
> > cache, but the inode values were not in the processors cache. My results:
>
> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
>
> I used an identical approach on my laptop. Here's a thorough description of what
> I did. I built three kernels and used an additional "control" kernel:
>
> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
^ you mean *withYou your patch but with CONFIG_USER_NS=n ?
> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
^ Just curious, why did you include this? To show that other factors have a much
larger impact? This does not include your patch, right?
>
> I booted into single user mode (systemd rescue target in newspeak) and used an
> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> files calling fstat() on each of them 1000 times and calculated the mean fstat()
> time for a single file. (The test program can be found below.)
>
> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
> 0 mappings
> 1 mapping
> 2 mappings
> 3 mappings
> 5 mappings
>
> For kernel v4.4-rc2-userns+ I tested:
> 0 mappings
> 1 mapping
> 2 mappings
> 3 mappings
> 5 mappings
> 10 mappings
> 50 mappings
> 100 mappings
> 200 mappings
> 300 mappings
>
> Here are the results:
>
> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
> # no unshare: 312 ns
> unshare -U # write 0 mappings: 307 ns
> unshare -U # write 1 mappings: 328 ns
> unshare -U # write 2 mappings: 328 ns
> unshare -U # write 3 mappings: 328 ns
> unshare -U # write 5 mappings: 338 ns
>
> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> # no unshare: 158 ns
> unshare -U # write 0 mappings: 158 ns
> unshare -U # write 1 mappings: 164 ns
> unshare -U # write 2 mappings: 170 ns
> unshare -U # write 3 mappings: 175 ns
> unshare -U # write 5 mappings: 187 ns
> unshare -U # write 10 mappings: 218 ns
> unshare -U # write 50 mappings: 528 ns
> unshare -U # write 100 mappings: 980 ns
> unshare -U # write 200 mappings: 1880 ns
> unshare -U # write 300 mappings: 2760 ns
>
> - v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> # no unshare: 161 ns
>
> - 4.12.0-12-generic Ubuntu Kernel:
> # no unshare: 328 ns
> unshare -U # write 0 mappings: 327 ns
> unshare -U # write 1 mappings: 328 ns
> unshare -U # write 2 mappings: 328 ns
> unshare -U # write 3 mappings: 328 ns
> unshare -U # write 5 mappings: 338 ns
>
^ This is really weird. Why does Ubuntu kernel have near-constant (horrible)
time?
> I've tested this multiple times and the numbers hold up. All v4.14-rc2 kernels
> were built on the same machine with the same .config, the same options and a
> simple call to make -j 11 bindeb-pkg. The 4.12 kernel was simply installed from
> the Ubuntu archives.
>
> The most import part seems to me that my idmap patches don't regress performance
> for the base-cases. I'd actually only consider 0 and 1 mapping to be the proper
Agreed. Now personally I probably would have kept 4 direct pointers then make
the 5+ case hurt more, but I'm not saying that's the right thing.
(haven't looked at the patch itself yet)
thanks,
-serge
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-04 14:28 ` Serge E. Hallyn
@ 2017-10-04 14:45 ` Christian Brauner
0 siblings, 0 replies; 7+ messages in thread
From: Christian Brauner @ 2017-10-04 14:45 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Christian Brauner, ebiederm, stgraber, linux-kernel, tycho
On Wed, Oct 04, 2017 at 09:28:57AM -0500, Serge Hallyn wrote:
> Quoting Christian Brauner (christian.brauner@ubuntu.com):
> > We have quite some use cases where users already run into the current limit for
> > {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> > 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
> >
> > some-user:100000:1000000000
> > some-user:999:1
> > some-user:1000:1
> > some-user:1001:1
> > some-user:1002:1
> >
> > This translates to:
> >
> > MAPPING-TYPE CONTAINER HOST RANGE
> > uid 999 999 1
> > uid 1001 1001 1
> > uid 0 1000000 999
> > uid 1000 1001000 1
> > uid 1002 1001002 999998998
> >
> > gid 999 999 1
> > gid 1001 1001 1
> > gid 0 1000000 999
> > gid 1000 1001000 1
> > gid 1002 1001002 999998998
> >
> > which is already the current limit.
> >
> > Design Notes:
> > As discussed at LPC simply bumping the number of limits is not going to work
> > since this would mean that struct uid_gid_map won't fit into a single cache-line
> > anymore thereby regressing performance for the base-cases. The same problem
> > seems to arise when using a single pointer. So the idea is to keep the base
> > cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> > cache-line of 64 byte. For the two removed mappings we place three pointers in
> > the struct that mock the behavior of traditional filesystems:
> > 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> > 2. an indirect pointer to an array of 64 byte of direct pointers to struct
> > uid_gid_extent of 5 mappings a 60 bytes each
> > 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
> > to an array of 64 bytes of direct pointers (and so on)
> > Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> > 368 mappings which should really be enough. The idea of this approach is to
> > always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> > 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> > should only see a (i.e. linear) performance impact caused by iterating through
> > the idmappings in a for-loop. Note that the base cases shouldn't see any
> > performance degradation which is the most important part.
>
> Sounds like a good plan.
>
> > Performance Testing:
> > When Eric introduced the extent-based struct uid_gid_map approach he measured
> > the performanc impact of his idmap changes:
> >
> > > My benchmark consisted of going to single user mode where nothing else was
> > > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> > > of the files 1000 times and calling fstat on the individuals files. This was
> > > to ensure I was benchmarking stat times where the inodes were in the kernels
> > > cache, but the inode values were not in the processors cache. My results:
> >
> > > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> > > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
> >
> > I used an identical approach on my laptop. Here's a thorough description of what
> > I did. I built three kernels and used an additional "control" kernel:
> >
> > 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> > 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> > 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
>
> ^ you mean *withYou your patch but with CONFIG_USER_NS=n ?
Yes, exactly. Sorry, that was unclear here.
>
> > 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
>
> ^ Just curious, why did you include this? To show that other factors have a much
> larger impact? This does not include your patch, right?
Basically I wanted something which I didn't compile and see if the numbers
somehow line-up. In terms of experimentation you could think of this as a second
"control condition".
>
> >
> > I booted into single user mode (systemd rescue target in newspeak) and used an
> > ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> > files calling fstat() on each of them 1000 times and calculated the mean fstat()
> > time for a single file. (The test program can be found below.)
> >
> > For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
> > 0 mappings
> > 1 mapping
> > 2 mappings
> > 3 mappings
> > 5 mappings
> >
> > For kernel v4.4-rc2-userns+ I tested:
> > 0 mappings
> > 1 mapping
> > 2 mappings
> > 3 mappings
> > 5 mappings
> > 10 mappings
> > 50 mappings
> > 100 mappings
> > 200 mappings
> > 300 mappings
> >
> > Here are the results:
> >
> > - v4.14-rc2-vanilla (unmodified v4.14-rc2)
> > # no unshare: 312 ns
> > unshare -U # write 0 mappings: 307 ns
> > unshare -U # write 1 mappings: 328 ns
> > unshare -U # write 2 mappings: 328 ns
> > unshare -U # write 3 mappings: 328 ns
> > unshare -U # write 5 mappings: 338 ns
> >
> > - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> > # no unshare: 158 ns
> > unshare -U # write 0 mappings: 158 ns
> > unshare -U # write 1 mappings: 164 ns
> > unshare -U # write 2 mappings: 170 ns
> > unshare -U # write 3 mappings: 175 ns
> > unshare -U # write 5 mappings: 187 ns
> > unshare -U # write 10 mappings: 218 ns
> > unshare -U # write 50 mappings: 528 ns
> > unshare -U # write 100 mappings: 980 ns
> > unshare -U # write 200 mappings: 1880 ns
> > unshare -U # write 300 mappings: 2760 ns
> >
> > - v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > # no unshare: 161 ns
> >
> > - 4.12.0-12-generic Ubuntu Kernel:
> > # no unshare: 328 ns
> > unshare -U # write 0 mappings: 327 ns
> > unshare -U # write 1 mappings: 328 ns
> > unshare -U # write 2 mappings: 328 ns
> > unshare -U # write 3 mappings: 328 ns
> > unshare -U # write 5 mappings: 338 ns
> >
>
> ^ This is really weird. Why does Ubuntu kernel have near-constant (horrible)
> time?
I actually think - even in single user mode - with the same number of processes
running and so on - that there's a lot of fluctuation going on. That's why I ran
the tests multiple times. It might also depend on compilation since I compiled
the three kernels myself and just downloaded the binaries for the ubuntu kernel.
The tests clearly show that there's an increase with the number of mappings
which is what I expected.
>
> > I've tested this multiple times and the numbers hold up. All v4.14-rc2 kernels
> > were built on the same machine with the same .config, the same options and a
> > simple call to make -j 11 bindeb-pkg. The 4.12 kernel was simply installed from
> > the Ubuntu archives.
> >
> > The most import part seems to me that my idmap patches don't regress performance
> > for the base-cases. I'd actually only consider 0 and 1 mapping to be the proper
>
> Agreed. Now personally I probably would have kept 4 direct pointers then make
> the 5+ case hurt more, but I'm not saying that's the right thing.
Yeah, I thought about that as well but my goal was to basically ramp up the
number of mappings into the hundreds to settle this "once and for all". I
actually don't expect us to go any higher than this. Tbh, users that have a
requirement to have many mappings should be prepared to take the performance
hit. Also, I think that the direct pointers won't necessarily give you more
speed since - I'd guess - that the slowdown simply comes from the number of
iterations through the map you have to do and not necessarily from cache misses.
But I might be thinking nonsense here. Thanks!
>
> (haven't looked at the patch itself yet)
>
> thanks,
> -serge
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-04 11:20 [PATCH] user namespaces: bump idmap limits Christian Brauner
2017-10-04 14:28 ` Serge E. Hallyn
@ 2017-10-11 20:30 ` Serge E. Hallyn
2017-10-11 20:43 ` Eric W. Biederman
1 sibling, 1 reply; 7+ messages in thread
From: Serge E. Hallyn @ 2017-10-11 20:30 UTC (permalink / raw)
To: Christian Brauner; +Cc: ebiederm, serge, stgraber, linux-kernel, tycho
Quoting Christian Brauner (christian.brauner@ubuntu.com):
> We have quite some use cases where users already run into the current limit for
> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
>
> some-user:100000:1000000000
> some-user:999:1
> some-user:1000:1
> some-user:1001:1
> some-user:1002:1
>
> This translates to:
>
> MAPPING-TYPE CONTAINER HOST RANGE
> uid 999 999 1
> uid 1001 1001 1
> uid 0 1000000 999
> uid 1000 1001000 1
> uid 1002 1001002 999998998
>
> gid 999 999 1
> gid 1001 1001 1
> gid 0 1000000 999
> gid 1000 1001000 1
> gid 1002 1001002 999998998
>
> which is already the current limit.
>
> Design Notes:
> As discussed at LPC simply bumping the number of limits is not going to work
> since this would mean that struct uid_gid_map won't fit into a single cache-line
> anymore thereby regressing performance for the base-cases. The same problem
> seems to arise when using a single pointer. So the idea is to keep the base
> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> cache-line of 64 byte. For the two removed mappings we place three pointers in
> the struct that mock the behavior of traditional filesystems:
> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
> uid_gid_extent of 5 mappings a 60 bytes each
> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
> to an array of 64 bytes of direct pointers (and so on)
> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> 368 mappings which should really be enough. The idea of this approach is to
> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> should only see a (i.e. linear) performance impact caused by iterating through
> the idmappings in a for-loop. Note that the base cases shouldn't see any
> performance degradation which is the most important part.
>
> Performance Testing:
> When Eric introduced the extent-based struct uid_gid_map approach he measured
> the performanc impact of his idmap changes:
>
> > My benchmark consisted of going to single user mode where nothing else was
> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> > of the files 1000 times and calling fstat on the individuals files. This was
> > to ensure I was benchmarking stat times where the inodes were in the kernels
> > cache, but the inode values were not in the processors cache. My results:
>
> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
>
> I used an identical approach on my laptop. Here's a thorough description of what
> I did. I built three kernels and used an additional "control" kernel:
>
> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
>
> I booted into single user mode (systemd rescue target in newspeak) and used an
> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> files calling fstat() on each of them 1000 times and calculated the mean fstat()
> time for a single file. (The test program can be found below.)
>
> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
> 0 mappings
> 1 mapping
> 2 mappings
> 3 mappings
> 5 mappings
>
> For kernel v4.4-rc2-userns+ I tested:
> 0 mappings
> 1 mapping
> 2 mappings
> 3 mappings
> 5 mappings
> 10 mappings
> 50 mappings
> 100 mappings
> 200 mappings
> 300 mappings
>
> Here are the results:
>
> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
> # no unshare: 312 ns
> unshare -U # write 0 mappings: 307 ns
> unshare -U # write 1 mappings: 328 ns
> unshare -U # write 2 mappings: 328 ns
> unshare -U # write 3 mappings: 328 ns
> unshare -U # write 5 mappings: 338 ns
>
> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> # no unshare: 158 ns
> unshare -U # write 0 mappings: 158 ns
> unshare -U # write 1 mappings: 164 ns
> unshare -U # write 2 mappings: 170 ns
> unshare -U # write 3 mappings: 175 ns
> unshare -U # write 5 mappings: 187 ns
> unshare -U # write 10 mappings: 218 ns
> unshare -U # write 50 mappings: 528 ns
> unshare -U # write 100 mappings: 980 ns
> unshare -U # write 200 mappings: 1880 ns
> unshare -U # write 300 mappings: 2760 ns
These numbers look very good.
Eric, does that address your performance concern?
(If so I'll take a closer look at the patch itself, I've not yet done so)
thanks,
-serge
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-11 20:30 ` Serge E. Hallyn
@ 2017-10-11 20:43 ` Eric W. Biederman
2017-10-12 14:09 ` Christian Brauner
0 siblings, 1 reply; 7+ messages in thread
From: Eric W. Biederman @ 2017-10-11 20:43 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: Christian Brauner, stgraber, linux-kernel, tycho
"Serge E. Hallyn" <serge@hallyn.com> writes:
> Quoting Christian Brauner (christian.brauner@ubuntu.com):
>> We have quite some use cases where users already run into the current limit for
>> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
>> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
>>
>> some-user:100000:1000000000
>> some-user:999:1
>> some-user:1000:1
>> some-user:1001:1
>> some-user:1002:1
>>
>> This translates to:
>>
>> MAPPING-TYPE CONTAINER HOST RANGE
>> uid 999 999 1
>> uid 1001 1001 1
>> uid 0 1000000 999
>> uid 1000 1001000 1
>> uid 1002 1001002 999998998
>>
>> gid 999 999 1
>> gid 1001 1001 1
>> gid 0 1000000 999
>> gid 1000 1001000 1
>> gid 1002 1001002 999998998
>>
>> which is already the current limit.
>>
>> Design Notes:
>> As discussed at LPC simply bumping the number of limits is not going to work
>> since this would mean that struct uid_gid_map won't fit into a single cache-line
>> anymore thereby regressing performance for the base-cases. The same problem
>> seems to arise when using a single pointer. So the idea is to keep the base
>> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
>> cache-line of 64 byte. For the two removed mappings we place three pointers in
>> the struct that mock the behavior of traditional filesystems:
>> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
>> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
>> uid_gid_extent of 5 mappings a 60 bytes each
>> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
>> to an array of 64 bytes of direct pointers (and so on)
>> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
>> 368 mappings which should really be enough. The idea of this approach is to
>> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
>> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
>> should only see a (i.e. linear) performance impact caused by iterating through
>> the idmappings in a for-loop. Note that the base cases shouldn't see any
>> performance degradation which is the most important part.
>>
>> Performance Testing:
>> When Eric introduced the extent-based struct uid_gid_map approach he measured
>> the performanc impact of his idmap changes:
>>
>> > My benchmark consisted of going to single user mode where nothing else was
>> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
>> > of the files 1000 times and calling fstat on the individuals files. This was
>> > to ensure I was benchmarking stat times where the inodes were in the kernels
>> > cache, but the inode values were not in the processors cache. My results:
>>
>> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
>> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
>> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
>>
>> I used an identical approach on my laptop. Here's a thorough description of what
>> I did. I built three kernels and used an additional "control" kernel:
>>
>> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
>> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
>>
>> I booted into single user mode (systemd rescue target in newspeak) and used an
>> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
>> files calling fstat() on each of them 1000 times and calculated the mean fstat()
>> time for a single file. (The test program can be found below.)
>>
>> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
>> 0 mappings
>> 1 mapping
>> 2 mappings
>> 3 mappings
>> 5 mappings
>>
>> For kernel v4.4-rc2-userns+ I tested:
>> 0 mappings
>> 1 mapping
>> 2 mappings
>> 3 mappings
>> 5 mappings
>> 10 mappings
>> 50 mappings
>> 100 mappings
>> 200 mappings
>> 300 mappings
>>
>> Here are the results:
>>
>> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> # no unshare: 312 ns
>> unshare -U # write 0 mappings: 307 ns
>> unshare -U # write 1 mappings: 328 ns
>> unshare -U # write 2 mappings: 328 ns
>> unshare -U # write 3 mappings: 328 ns
>> unshare -U # write 5 mappings: 338 ns
>>
>> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> # no unshare: 158 ns
>> unshare -U # write 0 mappings: 158 ns
>> unshare -U # write 1 mappings: 164 ns
>> unshare -U # write 2 mappings: 170 ns
>> unshare -U # write 3 mappings: 175 ns
>> unshare -U # write 5 mappings: 187 ns
>> unshare -U # write 10 mappings: 218 ns
>> unshare -U # write 50 mappings: 528 ns
>> unshare -U # write 100 mappings: 980 ns
>> unshare -U # write 200 mappings: 1880 ns
>> unshare -U # write 300 mappings: 2760 ns
>
> These numbers look very good.
>
> Eric, does that address your performance concern?
>
> (If so I'll take a closer look at the patch itself, I've not yet done so)
Apologies it has been a busy couple of days. I really meant to write my
reply a while ago.
With the current strategy I have no concerns about the new
implementation regressing performance in existing use cases.
I meant to mention that one way I got out variability when I was timing
things was to turn off cpu power management. I don't know if that works
on the newest cpus that can occassionally speed up.
I am concerned about an implementation that scales linearly with the
number of mappings in the large. When it is know we should be able
to scale at log(N) of the number of mappings.
So perhaps we want the implementation to look something like:
struct uid_gid_extent {
u32 first;
u32 lower_first;
u32 count;
};
struct uid_gid_map {
u32 nr_extents;
union {
extent[UID_GID_MAP_MAX_INLINE_EXTENTS];
struct {
struct uid_gid_extent *forward;
struct uid_gid_extent *reverse;
};
};
};
So perhaps we want the implementation to be a union with nr_extents
being the constant member and having two pointers one for a forward
and one for the reverse mappings.
Then limiting ourselves to a single kmalloc (about 4K) of extents,
and performing a binary search through them for the large case.
That will take twice the memory but it will be fast, and it will scale
to about 341 extents with only only 8 or 9 comparisons/cache line hits.
Eric
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-11 20:43 ` Eric W. Biederman
@ 2017-10-12 14:09 ` Christian Brauner
2017-10-12 17:08 ` Eric W. Biederman
0 siblings, 1 reply; 7+ messages in thread
From: Christian Brauner @ 2017-10-12 14:09 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Christian Brauner, Stéphane Graber,
Linux Kernel Mailing List, tycho
On Wed, Oct 11, 2017 at 03:43:20PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
>
> > Quoting Christian Brauner (christian.brauner@ubuntu.com):
> >> We have quite some use cases where users already run into the current limit for
> >> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
> >> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
> >>
> >> some-user:100000:1000000000
> >> some-user:999:1
> >> some-user:1000:1
> >> some-user:1001:1
> >> some-user:1002:1
> >>
> >> This translates to:
> >>
> >> MAPPING-TYPE CONTAINER HOST RANGE
> >> uid 999 999 1
> >> uid 1001 1001 1
> >> uid 0 1000000 999
> >> uid 1000 1001000 1
> >> uid 1002 1001002 999998998
> >>
> >> gid 999 999 1
> >> gid 1001 1001 1
> >> gid 0 1000000 999
> >> gid 1000 1001000 1
> >> gid 1002 1001002 999998998
> >>
> >> which is already the current limit.
> >>
> >> Design Notes:
> >> As discussed at LPC simply bumping the number of limits is not going to work
> >> since this would mean that struct uid_gid_map won't fit into a single cache-line
> >> anymore thereby regressing performance for the base-cases. The same problem
> >> seems to arise when using a single pointer. So the idea is to keep the base
> >> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
> >> cache-line of 64 byte. For the two removed mappings we place three pointers in
> >> the struct that mock the behavior of traditional filesystems:
> >> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
> >> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
> >> uid_gid_extent of 5 mappings a 60 bytes each
> >> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
> >> to an array of 64 bytes of direct pointers (and so on)
> >> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
> >> 368 mappings which should really be enough. The idea of this approach is to
> >> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
> >> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
> >> should only see a (i.e. linear) performance impact caused by iterating through
> >> the idmappings in a for-loop. Note that the base cases shouldn't see any
> >> performance degradation which is the most important part.
> >>
> >> Performance Testing:
> >> When Eric introduced the extent-based struct uid_gid_map approach he measured
> >> the performanc impact of his idmap changes:
> >>
> >> > My benchmark consisted of going to single user mode where nothing else was
> >> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
> >> > of the files 1000 times and calling fstat on the individuals files. This was
> >> > to ensure I was benchmarking stat times where the inodes were in the kernels
> >> > cache, but the inode values were not in the processors cache. My results:
> >>
> >> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
> >> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
> >> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
> >>
> >> I used an identical approach on my laptop. Here's a thorough description of what
> >> I did. I built three kernels and used an additional "control" kernel:
> >>
> >> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
> >> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> >> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
> >> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
> >>
> >> I booted into single user mode (systemd rescue target in newspeak) and used an
> >> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
> >> files calling fstat() on each of them 1000 times and calculated the mean fstat()
> >> time for a single file. (The test program can be found below.)
> >>
> >> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
> >> 0 mappings
> >> 1 mapping
> >> 2 mappings
> >> 3 mappings
> >> 5 mappings
> >>
> >> For kernel v4.4-rc2-userns+ I tested:
> >> 0 mappings
> >> 1 mapping
> >> 2 mappings
> >> 3 mappings
> >> 5 mappings
> >> 10 mappings
> >> 50 mappings
> >> 100 mappings
> >> 200 mappings
> >> 300 mappings
> >>
> >> Here are the results:
> >>
> >> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
> >> # no unshare: 312 ns
> >> unshare -U # write 0 mappings: 307 ns
> >> unshare -U # write 1 mappings: 328 ns
> >> unshare -U # write 2 mappings: 328 ns
> >> unshare -U # write 3 mappings: 328 ns
> >> unshare -U # write 5 mappings: 338 ns
> >>
> >> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
> >> # no unshare: 158 ns
> >> unshare -U # write 0 mappings: 158 ns
> >> unshare -U # write 1 mappings: 164 ns
> >> unshare -U # write 2 mappings: 170 ns
> >> unshare -U # write 3 mappings: 175 ns
> >> unshare -U # write 5 mappings: 187 ns
> >> unshare -U # write 10 mappings: 218 ns
> >> unshare -U # write 50 mappings: 528 ns
> >> unshare -U # write 100 mappings: 980 ns
> >> unshare -U # write 200 mappings: 1880 ns
> >> unshare -U # write 300 mappings: 2760 ns
> >
> > These numbers look very good.
> >
> > Eric, does that address your performance concern?
> >
> > (If so I'll take a closer look at the patch itself, I've not yet done so)
>
> Apologies it has been a busy couple of days. I really meant to write my
> reply a while ago.
>
> With the current strategy I have no concerns about the new
> implementation regressing performance in existing use cases.
>
> I meant to mention that one way I got out variability when I was timing
> things was to turn off cpu power management. I don't know if that works
> on the newest cpus that can occassionally speed up.
>
> I am concerned about an implementation that scales linearly with the
> number of mappings in the large. When it is know we should be able
> to scale at log(N) of the number of mappings.
Oh, I was under the impression that what you cared about was slowing down the
base cases. Thanks!
>
> So perhaps we want the implementation to look something like:
> struct uid_gid_extent {
> u32 first;
> u32 lower_first;
> u32 count;
> };
> struct uid_gid_map {
> u32 nr_extents;
> union {
> extent[UID_GID_MAP_MAX_INLINE_EXTENTS];
> struct {
> struct uid_gid_extent *forward;
> struct uid_gid_extent *reverse;
> };
> };
> };
>
> So perhaps we want the implementation to be a union with nr_extents
> being the constant member and having two pointers one for a forward
> and one for the reverse mappings.
That looks reasonable. I'll code that up soon.
Just to clarify, what do you mean by forward mappings and reverse mappings? Are
you just saying that struct uid_gid_extent *forward points to the beginning of
the sorted 4k kmalloc()ed extent array and struct uid_gid_extent *reverse points
to the end of the sorted 4k kmalloc()ed extent array for the binary search? Not
that I'm missing some intricacy here that might be crucial.
Christian
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] user namespaces: bump idmap limits
2017-10-12 14:09 ` Christian Brauner
@ 2017-10-12 17:08 ` Eric W. Biederman
0 siblings, 0 replies; 7+ messages in thread
From: Eric W. Biederman @ 2017-10-12 17:08 UTC (permalink / raw)
To: Christian Brauner
Cc: Serge E. Hallyn, Christian Brauner, Stéphane Graber,
Linux Kernel Mailing List, tycho
Christian Brauner <christian.brauner@canonical.com> writes:
> On Wed, Oct 11, 2017 at 03:43:20PM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>
>> > Quoting Christian Brauner (christian.brauner@ubuntu.com):
>> >> We have quite some use cases where users already run into the current limit for
>> >> {g,u}id mappings. Consider a user requesting us to map everything but 999, and
>> >> 1001 for a given range of 1000000000 with a sub{g,u}id layout of:
>> >>
>> >> some-user:100000:1000000000
>> >> some-user:999:1
>> >> some-user:1000:1
>> >> some-user:1001:1
>> >> some-user:1002:1
>> >>
>> >> This translates to:
>> >>
>> >> MAPPING-TYPE CONTAINER HOST RANGE
>> >> uid 999 999 1
>> >> uid 1001 1001 1
>> >> uid 0 1000000 999
>> >> uid 1000 1001000 1
>> >> uid 1002 1001002 999998998
>> >>
>> >> gid 999 999 1
>> >> gid 1001 1001 1
>> >> gid 0 1000000 999
>> >> gid 1000 1001000 1
>> >> gid 1002 1001002 999998998
>> >>
>> >> which is already the current limit.
>> >>
>> >> Design Notes:
>> >> As discussed at LPC simply bumping the number of limits is not going to work
>> >> since this would mean that struct uid_gid_map won't fit into a single cache-line
>> >> anymore thereby regressing performance for the base-cases. The same problem
>> >> seems to arise when using a single pointer. So the idea is to keep the base
>> >> cases (0-3 mappings) directly in struct uid_gid_map so they fit into a single
>> >> cache-line of 64 byte. For the two removed mappings we place three pointers in
>> >> the struct that mock the behavior of traditional filesystems:
>> >> 1. a direct pointer to a struct uid_gid_extent of 5 mappings of 60 bytes
>> >> 2. an indirect pointer to an array of 64 byte of direct pointers to struct
>> >> uid_gid_extent of 5 mappings a 60 bytes each
>> >> 3. a double indirect pointer to an array of 64 bytes of indirect pointers each
>> >> to an array of 64 bytes of direct pointers (and so on)
>> >> Fixing a pointer size of 8 byte this gives us 3 + 5 + (8 * 5) + (8 * (8 * 5)) =
>> >> 368 mappings which should really be enough. The idea of this approach is to
>> >> always have each extent of idmaps (struct uid_gid_extent) be 60 bytes (5 * (4 +
>> >> 4 + 4) and thus 4 bytes smaller than the size of a single cache line. This
>> >> should only see a (i.e. linear) performance impact caused by iterating through
>> >> the idmappings in a for-loop. Note that the base cases shouldn't see any
>> >> performance degradation which is the most important part.
>> >>
>> >> Performance Testing:
>> >> When Eric introduced the extent-based struct uid_gid_map approach he measured
>> >> the performanc impact of his idmap changes:
>> >>
>> >> > My benchmark consisted of going to single user mode where nothing else was
>> >> > running. On an ext4 filesystem opening 1,000,000 files and looping through all
>> >> > of the files 1000 times and calling fstat on the individuals files. This was
>> >> > to ensure I was benchmarking stat times where the inodes were in the kernels
>> >> > cache, but the inode values were not in the processors cache. My results:
>> >>
>> >> > v3.4-rc1: ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
>> >> > v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
>> >> > v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)
>> >>
>> >> I used an identical approach on my laptop. Here's a thorough description of what
>> >> I did. I built three kernels and used an additional "control" kernel:
>> >>
>> >> 1. v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> >> 2. v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> >> 3. v4.14-rc2-userns- (v4.14-rc2 without my new user namespace idmap limits patch)
>> >> 4. v4.12.0-12-generic (v4.12.0-12 standard Ubuntu kernel)
>> >>
>> >> I booted into single user mode (systemd rescue target in newspeak) and used an
>> >> ext4 filesystem to open/create 1,000,000 files. Then I looped through all of the
>> >> files calling fstat() on each of them 1000 times and calculated the mean fstat()
>> >> time for a single file. (The test program can be found below.)
>> >>
>> >> For kernels v4.14-rc2-vanilla, v4.12.0-12-generic I tested the following cases:
>> >> 0 mappings
>> >> 1 mapping
>> >> 2 mappings
>> >> 3 mappings
>> >> 5 mappings
>> >>
>> >> For kernel v4.4-rc2-userns+ I tested:
>> >> 0 mappings
>> >> 1 mapping
>> >> 2 mappings
>> >> 3 mappings
>> >> 5 mappings
>> >> 10 mappings
>> >> 50 mappings
>> >> 100 mappings
>> >> 200 mappings
>> >> 300 mappings
>> >>
>> >> Here are the results:
>> >>
>> >> - v4.14-rc2-vanilla (unmodified v4.14-rc2)
>> >> # no unshare: 312 ns
>> >> unshare -U # write 0 mappings: 307 ns
>> >> unshare -U # write 1 mappings: 328 ns
>> >> unshare -U # write 2 mappings: 328 ns
>> >> unshare -U # write 3 mappings: 328 ns
>> >> unshare -U # write 5 mappings: 338 ns
>> >>
>> >> - v4.14-rc2-userns+ (v4.14-rc2 with my new user namespace idmap limits patch)
>> >> # no unshare: 158 ns
>> >> unshare -U # write 0 mappings: 158 ns
>> >> unshare -U # write 1 mappings: 164 ns
>> >> unshare -U # write 2 mappings: 170 ns
>> >> unshare -U # write 3 mappings: 175 ns
>> >> unshare -U # write 5 mappings: 187 ns
>> >> unshare -U # write 10 mappings: 218 ns
>> >> unshare -U # write 50 mappings: 528 ns
>> >> unshare -U # write 100 mappings: 980 ns
>> >> unshare -U # write 200 mappings: 1880 ns
>> >> unshare -U # write 300 mappings: 2760 ns
>> >
>> > These numbers look very good.
>> >
>> > Eric, does that address your performance concern?
>> >
>> > (If so I'll take a closer look at the patch itself, I've not yet done so)
>>
>> Apologies it has been a busy couple of days. I really meant to write my
>> reply a while ago.
>>
>> With the current strategy I have no concerns about the new
>> implementation regressing performance in existing use cases.
>>
>> I meant to mention that one way I got out variability when I was timing
>> things was to turn off cpu power management. I don't know if that works
>> on the newest cpus that can occassionally speed up.
>>
>> I am concerned about an implementation that scales linearly with the
>> number of mappings in the large. When it is know we should be able
>> to scale at log(N) of the number of mappings.
>
> Oh, I was under the impression that what you cared about was slowing down the
> base cases. Thanks!
I care very much about the base case because if that slows down we have
a performance regression that must be fixed. At the same time it
doesn't makes sense to extend the data structure and with something that
has poor performance when used heavily. We will just have to fix that later.
>> So perhaps we want the implementation to look something like:
>> struct uid_gid_extent {
>> u32 first;
>> u32 lower_first;
>> u32 count;
>> };
>> struct uid_gid_map {
>> u32 nr_extents;
>> union {
>> extent[UID_GID_MAP_MAX_INLINE_EXTENTS];
>> struct {
>> struct uid_gid_extent *forward;
>> struct uid_gid_extent *reverse;
>> };
>> };
>> };
>
>>
>> So perhaps we want the implementation to be a union with nr_extents
>> being the constant member and having two pointers one for a forward
>> and one for the reverse mappings.
>
> That looks reasonable. I'll code that up soon.
> Just to clarify, what do you mean by forward mappings and reverse mappings? Are
> you just saying that struct uid_gid_extent *forward points to the beginning of
> the sorted 4k kmalloc()ed extent array and struct uid_gid_extent *reverse points
> to the end of the sorted 4k kmalloc()ed extent array for the binary search? Not
> that I'm missing some intricacy here that might be crucial.
No. Intricacies missed. The sort for forward and reverse is different
so we need two different arrays. One sorted on first the other sorted
on lower_first. If the data was larger we could put in a pointer to
their common elements but as only a u32 is different a pointer to a
common element would be a waste of space.
Eric
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2017-10-12 17:08 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-04 11:20 [PATCH] user namespaces: bump idmap limits Christian Brauner
2017-10-04 14:28 ` Serge E. Hallyn
2017-10-04 14:45 ` Christian Brauner
2017-10-11 20:30 ` Serge E. Hallyn
2017-10-11 20:43 ` Eric W. Biederman
2017-10-12 14:09 ` Christian Brauner
2017-10-12 17:08 ` Eric W. Biederman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).