* [PATCHv5 08/37] posix-timers: Use clock_get_ktime() in common_timer_get()
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
Now, when the clock_get_ktime() callback exists, the suboptimal
timespec64-based conversion can be removed from common_timer_get().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
kernel/time/posix-timers.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index fb1848c84241..aae7ab53790d 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -666,7 +666,6 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
{
const struct k_clock *kc = timr->kclock;
ktime_t now, remaining, iv;
- struct timespec64 ts64;
bool sig_none;
sig_none = timr->it_sigev_notify == SIGEV_NONE;
@@ -684,12 +683,7 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
return;
}
- /*
- * The timespec64 based conversion is suboptimal, but it's not
- * worth to implement yet another callback.
- */
- kc->clock_get_timespec(timr->it_clock, &ts64);
- now = timespec64_to_ktime(ts64);
+ now = kc->clock_get_ktime(timr->it_clock);
/*
* When a requeue is pending or this is a SIGEV_NONE timer move the
--
2.22.0
^ permalink raw reply related
* [PATCHv5 07/37] posix-clocks: Introduce clock_get_ktime() callback
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
The callsite in common_timer_get() has already a comment:
/*
* The timespec64 based conversion is suboptimal, but it's not
* worth to implement yet another callback.
*/
kc->clock_get(timr->it_clock, &ts64);
now = timespec64_to_ktime(ts64);
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
kernel/time/alarmtimer.c | 19 ++++++++++++++++++-
kernel/time/posix-timers.c | 26 +++++++++++++++++++++++++-
kernel/time/posix-timers.h | 3 +++
3 files changed, 46 insertions(+), 2 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index d9e6baa9976a..55cb6e78d082 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -651,7 +651,7 @@ static int alarm_clock_getres(const clockid_t which_clock, struct timespec64 *tp
* @which_clock: clockid
* @tp: timespec to fill.
*
- * Provides the underlying alarm base time.
+ * Provides the underlying alarm base time in a tasks time namespace.
*/
static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 *tp)
{
@@ -663,6 +663,22 @@ static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 *tp
return base->get_timespec(base->base_clockid, tp);
}
+/**
+ * alarm_clock_get_ktime - posix clock_get_ktime interface
+ * @which_clock: clockid
+ *
+ * Provides the underlying alarm base time in the root namespace.
+ */
+static ktime_t alarm_clock_get_ktime(clockid_t which_clock)
+{
+ struct alarm_base *base = &alarm_bases[clock2alarm(which_clock)];
+
+ if (!alarmtimer_get_rtcdev())
+ return -EINVAL;
+
+ return base->get_ktime();
+}
+
/**
* alarm_timer_create - posix timer_create interface
* @new_timer: k_itimer pointer to manage
@@ -826,6 +842,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags,
const struct k_clock alarm_clock = {
.clock_getres = alarm_clock_getres,
+ .clock_get_ktime = alarm_clock_get_ktime,
.clock_get_timespec = alarm_clock_get_timespec,
.timer_create = alarm_timer_create,
.timer_set = common_timer_set,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index a600b8b3e9bf..fb1848c84241 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -172,6 +172,11 @@ posix_clock_realtime_get_timespec(clockid_t which_clock, struct timespec64 *tp)
return 0;
}
+static ktime_t posix_clock_realtime_get_ktime(clockid_t which_clock)
+{
+ return ktime_get_real();
+}
+
/* Set clock_realtime */
static int posix_clock_realtime_set(const clockid_t which_clock,
const struct timespec64 *tp)
@@ -194,6 +199,11 @@ int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
return 0;
}
+static ktime_t posix_get_ktime(clockid_t which_clock)
+{
+ return ktime_get();
+}
+
/*
* Get monotonic-raw time for posix timers
*/
@@ -229,12 +239,22 @@ int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 *
return 0;
}
+static ktime_t posix_get_boottime_ktime(const clockid_t which_clock)
+{
+ return ktime_get_boottime();
+}
+
static int posix_get_tai_timespec(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_clocktai_ts64(tp);
return 0;
}
+static ktime_t posix_get_tai_ktime(clockid_t which_clock)
+{
+ return ktime_get_clocktai();
+}
+
static int posix_get_hrtimer_res(clockid_t which_clock, struct timespec64 *tp)
{
tp->tv_sec = 0;
@@ -782,7 +802,7 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
* Posix magic: Relative CLOCK_REALTIME timers are not affected by
* clock modifications, so they become CLOCK_MONOTONIC based under the
* hood. See hrtimer_init(). Update timr->kclock, so the generic
- * functions which use timr->kclock->clock_get_timespec() work.
+ * functions which use timr->kclock->clock_get_*() work.
*
* Note: it_clock stays unmodified, because the next timer_set() might
* use ABSTIME, so it needs to switch back.
@@ -1228,6 +1248,7 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags,
static const struct k_clock clock_realtime = {
.clock_getres = posix_get_hrtimer_res,
.clock_get_timespec = posix_clock_realtime_get_timespec,
+ .clock_get_ktime = posix_clock_realtime_get_ktime,
.clock_set = posix_clock_realtime_set,
.clock_adj = posix_clock_realtime_adj,
.nsleep = common_nsleep,
@@ -1245,6 +1266,7 @@ static const struct k_clock clock_realtime = {
static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
.clock_get_timespec = posix_get_timespec,
+ .clock_get_ktime = posix_get_ktime,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1274,6 +1296,7 @@ static const struct k_clock clock_monotonic_coarse = {
static const struct k_clock clock_tai = {
.clock_getres = posix_get_hrtimer_res,
+ .clock_get_ktime = posix_get_tai_ktime,
.clock_get_timespec = posix_get_tai_timespec,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
@@ -1289,6 +1312,7 @@ static const struct k_clock clock_tai = {
static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
+ .clock_get_ktime = posix_get_boottime_ktime,
.clock_get_timespec = posix_get_boottime_timespec,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
diff --git a/kernel/time/posix-timers.h b/kernel/time/posix-timers.h
index b3cc9ee36a6b..183994f7e466 100644
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -6,8 +6,11 @@ struct k_clock {
struct timespec64 *tp);
int (*clock_set)(const clockid_t which_clock,
const struct timespec64 *tp);
+ /* Returns the clock value in the current time namespace. */
int (*clock_get_timespec)(const clockid_t which_clock,
struct timespec64 *tp);
+ /* Returns the clock value in the root time namespace. */
+ ktime_t (*clock_get_ktime)(const clockid_t which_clock);
int (*clock_adj)(const clockid_t which_clock, struct __kernel_timex *tx);
int (*timer_create)(struct k_itimer *timer);
int (*nsleep)(const clockid_t which_clock, int flags,
--
2.22.0
^ permalink raw reply related
* [PATCHv5 06/37] alarmtimer: Provide get_timespec() callback
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
Wire up alarm bases with get_timespec().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
include/linux/posix-timers.h | 3 +++
kernel/time/alarmtimer.c | 8 ++++++--
kernel/time/posix-timers.c | 4 ++--
3 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index b20798fc5191..fe13ab265213 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -127,4 +127,7 @@ void set_process_cpu_timer(struct task_struct *task, unsigned int clock_idx,
void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
void posixtimer_rearm(struct kernel_siginfo *info);
+
+int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp);
+int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 *tp);
#endif
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 37254eb64459..d9e6baa9976a 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -37,12 +37,15 @@
* @lock: Lock for syncrhonized access to the base
* @timerqueue: Timerqueue head managing the list of events
* @get_ktime: Function to read the time correlating to the base
+ * @get_timespec: Function to read the namespace time correlating to the base
* @base_clockid: clockid for the base
*/
static struct alarm_base {
spinlock_t lock;
struct timerqueue_head timerqueue;
ktime_t (*get_ktime)(void);
+ int (*get_timespec)(const clockid_t which_clock,
+ struct timespec64 *tp);
clockid_t base_clockid;
} alarm_bases[ALARM_NUMTYPE];
@@ -657,8 +660,7 @@ static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 *tp
if (!alarmtimer_get_rtcdev())
return -EINVAL;
- *tp = ktime_to_timespec64(base->get_ktime());
- return 0;
+ return base->get_timespec(base->base_clockid, tp);
}
/**
@@ -869,8 +871,10 @@ static int __init alarmtimer_init(void)
/* Initialize alarm bases */
alarm_bases[ALARM_REALTIME].base_clockid = CLOCK_REALTIME;
alarm_bases[ALARM_REALTIME].get_ktime = &ktime_get_real;
+ alarm_bases[ALARM_REALTIME].get_timespec = posix_get_timespec,
alarm_bases[ALARM_BOOTTIME].base_clockid = CLOCK_BOOTTIME;
alarm_bases[ALARM_BOOTTIME].get_ktime = &ktime_get_boottime;
+ alarm_bases[ALARM_BOOTTIME].get_timespec = posix_get_boottime_timespec;
for (i = 0; i < ALARM_NUMTYPE; i++) {
timerqueue_init_head(&alarm_bases[i].timerqueue);
spin_lock_init(&alarm_bases[i].lock);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index a71e43178a7d..a600b8b3e9bf 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -188,7 +188,7 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
/*
* Get monotonic time for posix timers
*/
-static int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
+int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_ts64(tp);
return 0;
@@ -223,7 +223,7 @@ static int posix_get_coarse_res(const clockid_t which_clock, struct timespec64 *
return 0;
}
-static int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 *tp)
+int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_boottime_ts64(tp);
return 0;
--
2.22.0
^ permalink raw reply related
* [PATCHv5 05/37] alarmtimer: Rename gettime() callback to get_ktime()
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
The upcoming support for time namespaces requires to have access to:
- The time in a tasks time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
struct alarm_base needs to follow the same name convention, so rename
.gettime() callback into get_ktime() as a preparation for introducing
get_timespec().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
kernel/time/alarmtimer.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index c191254f178b..37254eb64459 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -36,13 +36,13 @@
* struct alarm_base - Alarm timer bases
* @lock: Lock for syncrhonized access to the base
* @timerqueue: Timerqueue head managing the list of events
- * @gettime: Function to read the time correlating to the base
+ * @get_ktime: Function to read the time correlating to the base
* @base_clockid: clockid for the base
*/
static struct alarm_base {
spinlock_t lock;
struct timerqueue_head timerqueue;
- ktime_t (*gettime)(void);
+ ktime_t (*get_ktime)(void);
clockid_t base_clockid;
} alarm_bases[ALARM_NUMTYPE];
@@ -207,7 +207,7 @@ static enum hrtimer_restart alarmtimer_fired(struct hrtimer *timer)
spin_unlock_irqrestore(&base->lock, flags);
if (alarm->function)
- restart = alarm->function(alarm, base->gettime());
+ restart = alarm->function(alarm, base->get_ktime());
spin_lock_irqsave(&base->lock, flags);
if (restart != ALARMTIMER_NORESTART) {
@@ -217,7 +217,7 @@ static enum hrtimer_restart alarmtimer_fired(struct hrtimer *timer)
}
spin_unlock_irqrestore(&base->lock, flags);
- trace_alarmtimer_fired(alarm, base->gettime());
+ trace_alarmtimer_fired(alarm, base->get_ktime());
return ret;
}
@@ -225,7 +225,7 @@ static enum hrtimer_restart alarmtimer_fired(struct hrtimer *timer)
ktime_t alarm_expires_remaining(const struct alarm *alarm)
{
struct alarm_base *base = &alarm_bases[alarm->type];
- return ktime_sub(alarm->node.expires, base->gettime());
+ return ktime_sub(alarm->node.expires, base->get_ktime());
}
EXPORT_SYMBOL_GPL(alarm_expires_remaining);
@@ -270,7 +270,7 @@ static int alarmtimer_suspend(struct device *dev)
spin_unlock_irqrestore(&base->lock, flags);
if (!next)
continue;
- delta = ktime_sub(next->expires, base->gettime());
+ delta = ktime_sub(next->expires, base->get_ktime());
if (!min || (delta < min)) {
expires = next->expires;
min = delta;
@@ -364,7 +364,7 @@ void alarm_start(struct alarm *alarm, ktime_t start)
hrtimer_start(&alarm->timer, alarm->node.expires, HRTIMER_MODE_ABS);
spin_unlock_irqrestore(&base->lock, flags);
- trace_alarmtimer_start(alarm, base->gettime());
+ trace_alarmtimer_start(alarm, base->get_ktime());
}
EXPORT_SYMBOL_GPL(alarm_start);
@@ -377,7 +377,7 @@ void alarm_start_relative(struct alarm *alarm, ktime_t start)
{
struct alarm_base *base = &alarm_bases[alarm->type];
- start = ktime_add_safe(start, base->gettime());
+ start = ktime_add_safe(start, base->get_ktime());
alarm_start(alarm, start);
}
EXPORT_SYMBOL_GPL(alarm_start_relative);
@@ -414,7 +414,7 @@ int alarm_try_to_cancel(struct alarm *alarm)
alarmtimer_dequeue(base, alarm);
spin_unlock_irqrestore(&base->lock, flags);
- trace_alarmtimer_cancel(alarm, base->gettime());
+ trace_alarmtimer_cancel(alarm, base->get_ktime());
return ret;
}
EXPORT_SYMBOL_GPL(alarm_try_to_cancel);
@@ -474,7 +474,7 @@ u64 alarm_forward_now(struct alarm *alarm, ktime_t interval)
{
struct alarm_base *base = &alarm_bases[alarm->type];
- return alarm_forward(alarm, base->gettime(), interval);
+ return alarm_forward(alarm, base->get_ktime(), interval);
}
EXPORT_SYMBOL_GPL(alarm_forward_now);
@@ -500,7 +500,7 @@ static void alarmtimer_freezerset(ktime_t absexp, enum alarmtimer_type type)
return;
}
- delta = ktime_sub(absexp, base->gettime());
+ delta = ktime_sub(absexp, base->get_ktime());
spin_lock_irqsave(&freezer_delta_lock, flags);
if (!freezer_delta || (delta < freezer_delta)) {
@@ -619,7 +619,7 @@ static void alarm_timer_arm(struct k_itimer *timr, ktime_t expires,
struct alarm_base *base = &alarm_bases[alarm->type];
if (!absolute)
- expires = ktime_add_safe(expires, base->gettime());
+ expires = ktime_add_safe(expires, base->get_ktime());
if (sigev_none)
alarm->node.expires = expires;
else
@@ -657,7 +657,7 @@ static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 *tp
if (!alarmtimer_get_rtcdev())
return -EINVAL;
- *tp = ktime_to_timespec64(base->gettime());
+ *tp = ktime_to_timespec64(base->get_ktime());
return 0;
}
@@ -734,7 +734,7 @@ static int alarmtimer_do_nsleep(struct alarm *alarm, ktime_t absexp,
struct timespec64 rmt;
ktime_t rem;
- rem = ktime_sub(absexp, alarm_bases[type].gettime());
+ rem = ktime_sub(absexp, alarm_bases[type].get_ktime());
if (rem <= 0)
return 0;
@@ -803,7 +803,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags,
exp = timespec64_to_ktime(*tsreq);
/* Convert (if necessary) to absolute time */
if (flags != TIMER_ABSTIME) {
- ktime_t now = alarm_bases[type].gettime();
+ ktime_t now = alarm_bases[type].get_ktime();
exp = ktime_add_safe(now, exp);
}
@@ -868,9 +868,9 @@ static int __init alarmtimer_init(void)
/* Initialize alarm bases */
alarm_bases[ALARM_REALTIME].base_clockid = CLOCK_REALTIME;
- alarm_bases[ALARM_REALTIME].gettime = &ktime_get_real;
+ alarm_bases[ALARM_REALTIME].get_ktime = &ktime_get_real;
alarm_bases[ALARM_BOOTTIME].base_clockid = CLOCK_BOOTTIME;
- alarm_bases[ALARM_BOOTTIME].gettime = &ktime_get_boottime;
+ alarm_bases[ALARM_BOOTTIME].get_ktime = &ktime_get_boottime;
for (i = 0; i < ALARM_NUMTYPE; i++) {
timerqueue_init_head(&alarm_bases[i].timerqueue);
spin_lock_init(&alarm_bases[i].lock);
--
2.22.0
^ permalink raw reply related
* [PATCHv5 04/37] posix-clocks: Rename *_clock_get() functions into *_clock_get_timespec()
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format in (struct k_clock).
As a preparation ground for introducing clock_get_ktime(), the original
callback clock_get() was renamed into clock_get_timespec().
Reflect the renaming into callbacks realizations.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
kernel/time/alarmtimer.c | 6 +++---
kernel/time/posix-timers.c | 17 +++++++++--------
2 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index cada96fdc615..c191254f178b 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -644,13 +644,13 @@ static int alarm_clock_getres(const clockid_t which_clock, struct timespec64 *tp
}
/**
- * alarm_clock_get - posix clock_get_timespec interface
+ * alarm_clock_get_timespec - posix clock_get_timespec interface
* @which_clock: clockid
* @tp: timespec to fill.
*
* Provides the underlying alarm base time.
*/
-static int alarm_clock_get(clockid_t which_clock, struct timespec64 *tp)
+static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 *tp)
{
struct alarm_base *base = &alarm_bases[clock2alarm(which_clock)];
@@ -824,7 +824,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags,
const struct k_clock alarm_clock = {
.clock_getres = alarm_clock_getres,
- .clock_get_timespec = alarm_clock_get,
+ .clock_get_timespec = alarm_clock_get_timespec,
.timer_create = alarm_timer_create,
.timer_set = common_timer_set,
.timer_del = common_timer_del,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 6c9c0f2d2f1e..a71e43178a7d 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -165,7 +165,8 @@ static inline void unlock_timer(struct k_itimer *timr, unsigned long flags)
}
/* Get clock_realtime */
-static int posix_clock_realtime_get(clockid_t which_clock, struct timespec64 *tp)
+static int
+posix_clock_realtime_get_timespec(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_real_ts64(tp);
return 0;
@@ -187,7 +188,7 @@ static int posix_clock_realtime_adj(const clockid_t which_clock,
/*
* Get monotonic time for posix timers
*/
-static int posix_ktime_get_ts(clockid_t which_clock, struct timespec64 *tp)
+static int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_ts64(tp);
return 0;
@@ -222,13 +223,13 @@ static int posix_get_coarse_res(const clockid_t which_clock, struct timespec64 *
return 0;
}
-static int posix_get_boottime(const clockid_t which_clock, struct timespec64 *tp)
+static int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_boottime_ts64(tp);
return 0;
}
-static int posix_get_tai(clockid_t which_clock, struct timespec64 *tp)
+static int posix_get_tai_timespec(clockid_t which_clock, struct timespec64 *tp)
{
ktime_get_clocktai_ts64(tp);
return 0;
@@ -1226,7 +1227,7 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags,
static const struct k_clock clock_realtime = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get_timespec = posix_clock_realtime_get,
+ .clock_get_timespec = posix_clock_realtime_get_timespec,
.clock_set = posix_clock_realtime_set,
.clock_adj = posix_clock_realtime_adj,
.nsleep = common_nsleep,
@@ -1243,7 +1244,7 @@ static const struct k_clock clock_realtime = {
static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get_timespec = posix_ktime_get_ts,
+ .clock_get_timespec = posix_get_timespec,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1273,7 +1274,7 @@ static const struct k_clock clock_monotonic_coarse = {
static const struct k_clock clock_tai = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get_timespec = posix_get_tai,
+ .clock_get_timespec = posix_get_tai_timespec,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1288,7 +1289,7 @@ static const struct k_clock clock_tai = {
static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get_timespec = posix_get_boottime,
+ .clock_get_timespec = posix_get_boottime_timespec,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
--
2.22.0
^ permalink raw reply related
* [PATCHv5 03/37] posix-clocks: Rename the clock_get() into clock_get_timespec()
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andrei Vagin, Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@gmail.com>
The upcoming support for time namespaces requires to have access to:
- The time in a task's time namespace for sys_clock_gettime()
- The time in the root name space for common_timer_get()
That adds a valid reason to finally implement a separate callback which
returns the time in ktime_t format, rather than in (struct timespec).
Rename clock_get() callback into clock_get_timespec() as a preparation
for introducing clock_get_ktime().
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
kernel/time/alarmtimer.c | 4 ++--
kernel/time/posix-clock.c | 8 ++++----
kernel/time/posix-cpu-timers.c | 32 ++++++++++++++++----------------
kernel/time/posix-timers.c | 22 +++++++++++-----------
kernel/time/posix-timers.h | 4 ++--
5 files changed, 35 insertions(+), 35 deletions(-)
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 57518efc3810..cada96fdc615 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -644,7 +644,7 @@ static int alarm_clock_getres(const clockid_t which_clock, struct timespec64 *tp
}
/**
- * alarm_clock_get - posix clock_get interface
+ * alarm_clock_get - posix clock_get_timespec interface
* @which_clock: clockid
* @tp: timespec to fill.
*
@@ -824,7 +824,7 @@ static int alarm_timer_nsleep(const clockid_t which_clock, int flags,
const struct k_clock alarm_clock = {
.clock_getres = alarm_clock_getres,
- .clock_get = alarm_clock_get,
+ .clock_get_timespec = alarm_clock_get,
.timer_create = alarm_timer_create,
.timer_set = common_timer_set,
.timer_del = common_timer_del,
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ec960bb939fd..c8f9c9b1cd82 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -315,8 +315,8 @@ static int pc_clock_settime(clockid_t id, const struct timespec64 *ts)
}
const struct k_clock clock_posix_dynamic = {
- .clock_getres = pc_clock_getres,
- .clock_set = pc_clock_settime,
- .clock_get = pc_clock_gettime,
- .clock_adj = pc_clock_adjtime,
+ .clock_getres = pc_clock_getres,
+ .clock_set = pc_clock_settime,
+ .clock_get_timespec = pc_clock_gettime,
+ .clock_adj = pc_clock_adjtime,
};
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 0a426f4e3125..dccf7dfcd36a 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -1417,26 +1417,26 @@ static int thread_cpu_timer_create(struct k_itimer *timer)
}
const struct k_clock clock_posix_cpu = {
- .clock_getres = posix_cpu_clock_getres,
- .clock_set = posix_cpu_clock_set,
- .clock_get = posix_cpu_clock_get,
- .timer_create = posix_cpu_timer_create,
- .nsleep = posix_cpu_nsleep,
- .timer_set = posix_cpu_timer_set,
- .timer_del = posix_cpu_timer_del,
- .timer_get = posix_cpu_timer_get,
- .timer_rearm = posix_cpu_timer_rearm,
+ .clock_getres = posix_cpu_clock_getres,
+ .clock_set = posix_cpu_clock_set,
+ .clock_get_timespec = posix_cpu_clock_get,
+ .timer_create = posix_cpu_timer_create,
+ .nsleep = posix_cpu_nsleep,
+ .timer_set = posix_cpu_timer_set,
+ .timer_del = posix_cpu_timer_del,
+ .timer_get = posix_cpu_timer_get,
+ .timer_rearm = posix_cpu_timer_rearm,
};
const struct k_clock clock_process = {
- .clock_getres = process_cpu_clock_getres,
- .clock_get = process_cpu_clock_get,
- .timer_create = process_cpu_timer_create,
- .nsleep = process_cpu_nsleep,
+ .clock_getres = process_cpu_clock_getres,
+ .clock_get_timespec = process_cpu_clock_get,
+ .timer_create = process_cpu_timer_create,
+ .nsleep = process_cpu_nsleep,
};
const struct k_clock clock_thread = {
- .clock_getres = thread_cpu_clock_getres,
- .clock_get = thread_cpu_clock_get,
- .timer_create = thread_cpu_timer_create,
+ .clock_getres = thread_cpu_clock_getres,
+ .clock_get_timespec = thread_cpu_clock_get,
+ .timer_create = thread_cpu_timer_create,
};
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index d7f2d91acdac..6c9c0f2d2f1e 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -667,7 +667,7 @@ void common_timer_get(struct k_itimer *timr, struct itimerspec64 *cur_setting)
* The timespec64 based conversion is suboptimal, but it's not
* worth to implement yet another callback.
*/
- kc->clock_get(timr->it_clock, &ts64);
+ kc->clock_get_timespec(timr->it_clock, &ts64);
now = timespec64_to_ktime(ts64);
/*
@@ -781,7 +781,7 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
* Posix magic: Relative CLOCK_REALTIME timers are not affected by
* clock modifications, so they become CLOCK_MONOTONIC based under the
* hood. See hrtimer_init(). Update timr->kclock, so the generic
- * functions which use timr->kclock->clock_get() work.
+ * functions which use timr->kclock->clock_get_timespec() work.
*
* Note: it_clock stays unmodified, because the next timer_set() might
* use ABSTIME, so it needs to switch back.
@@ -1032,7 +1032,7 @@ SYSCALL_DEFINE2(clock_gettime, const clockid_t, which_clock,
if (!kc)
return -EINVAL;
- error = kc->clock_get(which_clock, &kernel_tp);
+ error = kc->clock_get_timespec(which_clock, &kernel_tp);
if (!error && put_timespec64(&kernel_tp, tp))
error = -EFAULT;
@@ -1114,7 +1114,7 @@ SYSCALL_DEFINE2(clock_gettime32, clockid_t, which_clock,
if (!kc)
return -EINVAL;
- err = kc->clock_get(which_clock, &ts);
+ err = kc->clock_get_timespec(which_clock, &ts);
if (!err && put_old_timespec32(&ts, tp))
err = -EFAULT;
@@ -1226,7 +1226,7 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, which_clock, int, flags,
static const struct k_clock clock_realtime = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get = posix_clock_realtime_get,
+ .clock_get_timespec = posix_clock_realtime_get,
.clock_set = posix_clock_realtime_set,
.clock_adj = posix_clock_realtime_adj,
.nsleep = common_nsleep,
@@ -1243,7 +1243,7 @@ static const struct k_clock clock_realtime = {
static const struct k_clock clock_monotonic = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get = posix_ktime_get_ts,
+ .clock_get_timespec = posix_ktime_get_ts,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1258,22 +1258,22 @@ static const struct k_clock clock_monotonic = {
static const struct k_clock clock_monotonic_raw = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get = posix_get_monotonic_raw,
+ .clock_get_timespec = posix_get_monotonic_raw,
};
static const struct k_clock clock_realtime_coarse = {
.clock_getres = posix_get_coarse_res,
- .clock_get = posix_get_realtime_coarse,
+ .clock_get_timespec = posix_get_realtime_coarse,
};
static const struct k_clock clock_monotonic_coarse = {
.clock_getres = posix_get_coarse_res,
- .clock_get = posix_get_monotonic_coarse,
+ .clock_get_timespec = posix_get_monotonic_coarse,
};
static const struct k_clock clock_tai = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get = posix_get_tai,
+ .clock_get_timespec = posix_get_tai,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
@@ -1288,7 +1288,7 @@ static const struct k_clock clock_tai = {
static const struct k_clock clock_boottime = {
.clock_getres = posix_get_hrtimer_res,
- .clock_get = posix_get_boottime,
+ .clock_get_timespec = posix_get_boottime,
.nsleep = common_nsleep,
.timer_create = common_timer_create,
.timer_set = common_timer_set,
diff --git a/kernel/time/posix-timers.h b/kernel/time/posix-timers.h
index de5daa6d975a..b3cc9ee36a6b 100644
--- a/kernel/time/posix-timers.h
+++ b/kernel/time/posix-timers.h
@@ -6,8 +6,8 @@ struct k_clock {
struct timespec64 *tp);
int (*clock_set)(const clockid_t which_clock,
const struct timespec64 *tp);
- int (*clock_get)(const clockid_t which_clock,
- struct timespec64 *tp);
+ int (*clock_get_timespec)(const clockid_t which_clock,
+ struct timespec64 *tp);
int (*clock_adj)(const clockid_t which_clock, struct __kernel_timex *tx);
int (*timer_create)(struct k_itimer *timer);
int (*nsleep)(const clockid_t which_clock, int flags,
--
2.22.0
^ permalink raw reply related
* [PATCHv5 02/37] timens: Add timens_offsets
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
x86
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@openvz.org>
Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.
Allocate one page for each time namespace that will be premapped into
userspace among vvar pages.
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
MAINTAINERS | 1 +
include/linux/time_namespace.h | 1 +
include/linux/timens_offsets.h | 8 ++++++++
kernel/time_namespace.c | 14 ++++++++++++--
4 files changed, 22 insertions(+), 2 deletions(-)
create mode 100644 include/linux/timens_offsets.h
diff --git a/MAINTAINERS b/MAINTAINERS
index ce6ff88f7efd..b58c49d331eb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12837,6 +12837,7 @@ S: Maintained
F: fs/timerfd.c
F: include/linux/timer*
F: include/linux/time_namespace.h
+F: include/linux/timens_offsets.h
F: kernel/time_namespace.c
F: kernel/time/*timer*
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 9507ed7072fe..b6985aa87479 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -8,6 +8,7 @@
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
#include <linux/err.h>
+#include <linux/timens_offsets.h>
struct user_namespace;
extern struct user_namespace init_user_ns;
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
new file mode 100644
index 000000000000..7d7cb68ea778
--- /dev/null
+++ b/include/linux/timens_offsets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIME_OFFSETS_H
+#define _LINUX_TIME_OFFSETS_H
+
+struct timens_offsets {
+};
+
+#endif
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 7ee6be0647bc..f849c59f1108 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -14,6 +14,7 @@
#include <linux/slab.h>
#include <linux/cred.h>
#include <linux/err.h>
+#include <linux/mm.h>
static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
{
@@ -47,6 +48,7 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
{
struct time_namespace *ns;
struct ucounts *ucounts;
+ struct page *page;
int err;
err = -ENOSPC;
@@ -59,15 +61,22 @@ static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
if (!ns)
goto fail_dec;
+ page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ goto fail_free;
+ ns->offsets = page_address(page);
+ BUILD_BUG_ON(sizeof(*ns->offsets) > PAGE_SIZE);
+
err = ns_alloc_inum(&ns->ns);
if (err)
- goto fail_free;
+ goto fail_page;
ns->ucounts = ucounts;
ns->ns.ops = &timens_operations;
ns->user_ns = get_user_ns(user_ns);
return ns;
-
+fail_page:
+ free_page((unsigned long)ns->offsets);
fail_free:
kfree(ns);
fail_dec:
@@ -95,6 +104,7 @@ void free_time_ns(struct kref *kref)
struct time_namespace *ns;
ns = container_of(kref, struct time_namespace, kref);
+ free_page((unsigned long)ns->offsets);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
--
2.22.0
^ permalink raw reply related
* [PATCHv5 01/37] ns: Introduce Time Namespace
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Andrei Vagin, Dmitry Safonov, Adrian Reber,
Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
x86
In-Reply-To: <20190729215758.28405-1-dima@arista.com>
From: Andrei Vagin <avagin@openvz.org>
Time Namespace isolates clock values.
The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.
CLOCK_REALTIME
System-wide clock that measures real (i.e., wall-clock) time.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since
some unspecified starting point.
CLOCK_BOOTTIME
Identical to CLOCK_MONOTONIC, except it also includes any time
that the system is suspended.
For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME).
But in a context of the checkpoint/restore functionality, monotonic and
bootime clocks become interesting. Both clocks are monotonic with
unspecified staring points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, we have to
guarantee that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that we need to be able to set CLOCK_MONOTONIC
and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
offsets for clocks.
A time namespace is similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.
This scheme allows setting clock offsets for a namespace, before any
processes appear in it.
All available clone flags have been used, so CLONE_NEWTIME uses the
highest bit of CSIGNAL. It means that we can use it with the unshare
system call only. Rith now, this works for us, because time namespace
offsets can be set only when a new time namespace is not populated. In a
future, we will have the clone3 system call [1] which will allow to use
the CSIGNAL mask for clone flags.
[1]: httmps://lkml.kernel.org/r/20190604160944.4058-1-christian@brauner.io
Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
---
MAINTAINERS | 2 +
fs/proc/namespaces.c | 4 +
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 3 +
include/linux/time_namespace.h | 69 +++++++++++
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 6 +
init/Kconfig | 7 ++
kernel/Makefile | 1 +
kernel/fork.c | 16 ++-
kernel/nsproxy.c | 41 +++++--
kernel/time_namespace.c | 217 +++++++++++++++++++++++++++++++++
12 files changed, 359 insertions(+), 10 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 kernel/time_namespace.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 6426db5198f0..ce6ff88f7efd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12836,6 +12836,8 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
S: Maintained
F: fs/timerfd.c
F: include/linux/timer*
+F: include/linux/time_namespace.h
+F: kernel/time_namespace.c
F: kernel/time/*timer*
POWER MANAGEMENT CORE
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..8b5c720fe5d7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -33,6 +33,10 @@ static const struct proc_ns_operations *ns_entries[] = {
#ifdef CONFIG_CGROUPS
&cgroupns_operations,
#endif
+#ifdef CONFIG_TIME_NS
+ &timens_operations,
+ &timens_for_children_operations,
+#endif
};
static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..074f395b9ad2 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,8 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct time_namespace *time_ns;
+ struct time_namespace *time_ns_for_children;
struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..d312e6281e69 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,6 +32,8 @@ extern const struct proc_ns_operations pidns_for_children_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
extern const struct proc_ns_operations cgroupns_operations;
+extern const struct proc_ns_operations timens_operations;
+extern const struct proc_ns_operations timens_for_children_operations;
/*
* We always define these enumerators
@@ -43,6 +45,7 @@ enum {
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
+ PROC_TIME_INIT_INO = 0xEFFFFFFAU,
};
#ifdef CONFIG_PROC_FS
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
new file mode 100644
index 000000000000..9507ed7072fe
--- /dev/null
+++ b/include/linux/time_namespace.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIMENS_H
+#define _LINUX_TIMENS_H
+
+
+#include <linux/sched.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/ns_common.h>
+#include <linux/err.h>
+
+struct user_namespace;
+extern struct user_namespace init_user_ns;
+
+struct time_namespace {
+ struct kref kref;
+ struct user_namespace *user_ns;
+ struct ucounts *ucounts;
+ struct ns_common ns;
+ struct timens_offsets *offsets;
+ bool initialized;
+} __randomize_layout;
+extern struct time_namespace init_time_ns;
+
+#ifdef CONFIG_TIME_NS
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+ kref_get(&ns->kref);
+ return ns;
+}
+
+extern struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns);
+extern void free_time_ns(struct kref *kref);
+extern int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk);
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+ kref_put(&ns->kref, free_time_ns);
+}
+
+
+#else
+static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
+{
+ return NULL;
+}
+
+static inline void put_time_ns(struct time_namespace *ns)
+{
+}
+
+static inline struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (flags & CLONE_NEWTIME)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+static inline int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+ return 0;
+}
+
+#endif
+
+#endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index fb9f4f799554..6ef1c7109fc4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -45,6 +45,7 @@ enum ucount_type {
UCOUNT_NET_NAMESPACES,
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
+ UCOUNT_TIME_NAMESPACES,
#ifdef CONFIG_INOTIFY_USER
UCOUNT_INOTIFY_INSTANCES,
UCOUNT_INOTIFY_WATCHES,
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index b3105ac1381a..551739227e96 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -33,6 +33,12 @@
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */
+/*
+ * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
+ * syscalls only:
+ */
+#define CLONE_NEWTIME 0x00000080 /* New time namespace */
+
/*
* Arguments for the clone3 syscall
*/
diff --git a/init/Kconfig b/init/Kconfig
index bd7d650d4a99..a7cbc9b470c7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1069,6 +1069,13 @@ config UTS_NS
In this namespace tasks see different info provided with the
uname() system call
+config TIME_NS
+ bool "TIME namespace"
+ default y
+ help
+ In this namespace boottime and monotonic clocks can be set.
+ The time will keep going with the same pace.
+
config IPC_NS
bool "IPC namespace"
depends on (SYSVIPC || POSIX_MQUEUE)
diff --git a/kernel/Makefile b/kernel/Makefile
index a8d923b5481b..a8654c9aa2f5 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup/
obj-$(CONFIG_UTS_NS) += utsname.o
+obj-$(CONFIG_TIME_NS) += time_namespace.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
obj-$(CONFIG_IKCONFIG) += configs.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..3dbb2d33dc52 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1777,6 +1777,7 @@ static __latent_entropy struct task_struct *copy_process(
struct multiprocess_signals delayed;
struct file *pidfile = NULL;
u64 clone_flags = args->flags;
+ struct nsproxy *nsp = current->nsproxy;
/*
* Don't allow sharing the root directory with processes in a different
@@ -1819,8 +1820,16 @@ static __latent_entropy struct task_struct *copy_process(
*/
if (clone_flags & CLONE_THREAD) {
if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
- (task_active_pid_ns(current) !=
- current->nsproxy->pid_ns_for_children))
+ (task_active_pid_ns(current) != nsp->pid_ns_for_children))
+ return ERR_PTR(-EINVAL);
+ }
+
+ /*
+ * If the new process will be in a different time namespace
+ * do not allow it to share VM or a thread group with the forking task.
+ */
+ if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
+ if (nsp->time_ns != nsp->time_ns_for_children)
return ERR_PTR(-EINVAL);
}
@@ -2707,7 +2716,8 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP|
+ CLONE_NEWTIME))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index c815f58e6bc0..ed9882108cd2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -18,6 +18,7 @@
#include <linux/pid_namespace.h>
#include <net/net_namespace.h>
#include <linux/ipc_namespace.h>
+#include <linux/time_namespace.h>
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
@@ -40,6 +41,10 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_CGROUPS
.cgroup_ns = &init_cgroup_ns,
#endif
+#ifdef CONFIG_TIME_NS
+ .time_ns = &init_time_ns,
+ .time_ns_for_children = &init_time_ns,
+#endif
};
static inline struct nsproxy *create_nsproxy(void)
@@ -106,8 +111,18 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_net;
}
+ new_nsp->time_ns_for_children = copy_time_ns(flags, user_ns,
+ tsk->nsproxy->time_ns_for_children);
+ if (IS_ERR(new_nsp->time_ns_for_children)) {
+ err = PTR_ERR(new_nsp->time_ns_for_children);
+ goto out_time;
+ }
+ new_nsp->time_ns = get_time_ns(tsk->nsproxy->time_ns);
+
return new_nsp;
+out_time:
+ put_net(new_nsp->net_ns);
out_net:
put_cgroup_ns(new_nsp->cgroup_ns);
out_cgroup:
@@ -136,15 +151,16 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;
+ int ret;
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
- CLONE_NEWCGROUP)))) {
- get_nsproxy(old_ns);
- return 0;
- }
-
- if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
+ if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
+ get_nsproxy(old_ns);
+ return 0;
+ }
+ } else if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;
/*
@@ -162,6 +178,12 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);
+ ret = timens_on_fork(new_ns, tsk);
+ if (ret) {
+ free_nsproxy(new_ns);
+ return ret;
+ }
+
tsk->nsproxy = new_ns;
return 0;
}
@@ -176,6 +198,10 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->time_ns)
+ put_time_ns(ns->time_ns);
+ if (ns->time_ns_for_children)
+ put_time_ns(ns->time_ns_for_children);
put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
@@ -192,7 +218,8 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP |
+ CLONE_NEWTIME)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
new file mode 100644
index 000000000000..7ee6be0647bc
--- /dev/null
+++ b/kernel/time_namespace.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Andrei Vagin <avagin@openvz.org>
+ * Author: Dmitry Safonov <dima@arista.com>
+ */
+
+#include <linux/time_namespace.h>
+#include <linux/user_namespace.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/proc_ns.h>
+#include <linux/export.h>
+#include <linux/time.h>
+#include <linux/slab.h>
+#include <linux/cred.h>
+#include <linux/err.h>
+
+static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
+{
+ return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
+}
+
+static void dec_time_namespaces(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_TIME_NAMESPACES);
+}
+
+static struct time_namespace *create_time_ns(void)
+{
+ struct time_namespace *time_ns;
+
+ time_ns = kmalloc(sizeof(struct time_namespace), GFP_KERNEL);
+ if (time_ns) {
+ kref_init(&time_ns->kref);
+ time_ns->initialized = false;
+ }
+ return time_ns;
+}
+
+/*
+ * Clone a new ns copying @old_ns, setting refcount to 1
+ * @old_ns: namespace to clone
+ * Return the new ns or ERR_PTR.
+ */
+static struct time_namespace *clone_time_ns(struct user_namespace *user_ns,
+ struct time_namespace *old_ns)
+{
+ struct time_namespace *ns;
+ struct ucounts *ucounts;
+ int err;
+
+ err = -ENOSPC;
+ ucounts = inc_time_namespaces(user_ns);
+ if (!ucounts)
+ goto fail;
+
+ err = -ENOMEM;
+ ns = create_time_ns();
+ if (!ns)
+ goto fail_dec;
+
+ err = ns_alloc_inum(&ns->ns);
+ if (err)
+ goto fail_free;
+
+ ns->ucounts = ucounts;
+ ns->ns.ops = &timens_operations;
+ ns->user_ns = get_user_ns(user_ns);
+ return ns;
+
+fail_free:
+ kfree(ns);
+fail_dec:
+ dec_time_namespaces(ucounts);
+fail:
+ return ERR_PTR(err);
+}
+
+/*
+ * Add a reference to old_ns, or clone it if @flags specify CLONE_NEWTIME.
+ * In latter case, changes to the time of this process won't be seen by parent,
+ * and vice versa.
+ */
+struct time_namespace *copy_time_ns(unsigned long flags,
+ struct user_namespace *user_ns, struct time_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWTIME))
+ return get_time_ns(old_ns);
+
+ return clone_time_ns(user_ns, old_ns);
+}
+
+void free_time_ns(struct kref *kref)
+{
+ struct time_namespace *ns;
+
+ ns = container_of(kref, struct time_namespace, kref);
+ dec_time_namespaces(ns->ucounts);
+ put_user_ns(ns->user_ns);
+ ns_free_inum(&ns->ns);
+ kfree(ns);
+}
+
+static struct time_namespace *to_time_ns(struct ns_common *ns)
+{
+ return container_of(ns, struct time_namespace, ns);
+}
+
+static struct ns_common *timens_get(struct task_struct *task)
+{
+ struct time_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->time_ns;
+ get_time_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static struct ns_common *timens_for_children_get(struct task_struct *task)
+{
+ struct time_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->time_ns_for_children;
+ get_time_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static void timens_put(struct ns_common *ns)
+{
+ put_time_ns(to_time_ns(ns));
+}
+
+static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
+{
+ struct time_namespace *ns = to_time_ns(new);
+
+ if (!thread_group_empty(current))
+ return -EINVAL;
+
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
+ !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
+ return -EPERM;
+
+ get_time_ns(ns);
+ get_time_ns(ns);
+ put_time_ns(nsproxy->time_ns);
+ put_time_ns(nsproxy->time_ns_for_children);
+ nsproxy->time_ns = ns;
+ nsproxy->time_ns_for_children = ns;
+ ns->initialized = true;
+ return 0;
+}
+
+int timens_on_fork(struct nsproxy *nsproxy, struct task_struct *tsk)
+{
+ struct ns_common *nsc = &nsproxy->time_ns_for_children->ns;
+ struct time_namespace *ns = to_time_ns(nsc);
+
+ if (nsproxy->time_ns == nsproxy->time_ns_for_children)
+ return 0;
+
+ get_time_ns(ns);
+ put_time_ns(nsproxy->time_ns);
+ nsproxy->time_ns = ns;
+ ns->initialized = true;
+
+ return 0;
+}
+
+static struct user_namespace *timens_owner(struct ns_common *ns)
+{
+ return to_time_ns(ns)->user_ns;
+}
+
+const struct proc_ns_operations timens_operations = {
+ .name = "time",
+ .type = CLONE_NEWTIME,
+ .get = timens_get,
+ .put = timens_put,
+ .install = timens_install,
+ .owner = timens_owner,
+};
+
+const struct proc_ns_operations timens_for_children_operations = {
+ .name = "time_for_children",
+ .type = CLONE_NEWTIME,
+ .get = timens_for_children_get,
+ .put = timens_put,
+ .install = timens_install,
+ .owner = timens_owner,
+};
+
+struct time_namespace init_time_ns = {
+ .kref = KREF_INIT(3),
+ .user_ns = &init_user_ns,
+ .ns.inum = PROC_TIME_INIT_INO,
+ .ns.ops = &timens_operations,
+};
+
+static int __init time_ns_init(void)
+{
+ return 0;
+}
+subsys_initcall(time_ns_init);
--
2.22.0
^ permalink raw reply related
* [PATCHv5 00/37] kernel: Introduce Time Namespace
From: Dmitry Safonov @ 2019-07-29 21:56 UTC (permalink / raw)
To: linux-kernel
Cc: Dmitry Safonov, Dmitry Safonov, Adrian Reber, Andrei Vagin,
Andy Lutomirski, Arnd Bergmann, Christian Brauner,
Cyrill Gorcunov, Eric W. Biederman, H. Peter Anvin, Ingo Molnar,
Jann Horn, Jeff Dike, Oleg Nesterov, Pavel Emelyanov, Shuah Khan,
Thomas Gleixner, Vincenzo Frascino, containers, criu, linux-api,
x86
Discussions around time namespace are there for a long time. The first
attempt to implement it was in 2006 by Jeff Dike. From that time, the
topic appears on and off in various discussions.
There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.
“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)
The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.
The main idea of this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock and the sum is returned.
All offsets are placed on a separate page, this allows us to map it as
part of VVAR into user processes and use offsets from VDSO calls.
Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.
v4..v5 Changes:
* Rebased over generic vdso (already in master)
* Addressing review comments by Thomas Gleixner
(thanks much for your time and patience):
- Dropping `timens` prefix from subjects (it's not a subsystem)
- Keeping commit messages in a neutral technical form
- Splitting unreasonably large patches
- Document code with missing comments
- Dropped dead code that's not compiled with !CONFIG_TIME_NS
* Updated performance results [here, at the bottom]
* Split vdso jump tables patch
* Allow unshare() with many threads: it's safe until fork()/clone(),
where we check for CLONE_THREADS
* Add missed check in setns() for CLONE_VM | CLONE_THREADS
* Fixed compilation with !CONFIG_UTS_NS
* Add a plan in selftests (prevents new warning "Planned tests != run tests")
* Set jump table section address & size to (-1UL) just in case if there
is no such section while running vdso2c (and WARN() on boot in such
case)
[v1..v4 Changelog is at the very bottom here]
Our performance measurements show that the price of VDSO's clock_gettime()
in a child time namespace is about 8% with a hot CPU cache and about 90%
with a cold CPU cache. There is no performance regression for host
processes outside time namespace on those tests.
We wrote two small benchmarks. The first one gettime_perf.c calls
clock_gettime() in a loop for 3 seconds. It shows us performance with
a hot CPU cache (more clock_gettime() cycles - the better):
| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
| 161822960 | 161147792 | 160187142 | 150584782
| 161891728 | 161489804 | 159914989 | 150417019
| 161891770 | 161098734 | 160123179 | 150601277
| 161687686 | 161114738 | 159874249 | 150243276
| 161247151 | 161411636 | 159096392 | 149637536
--------------------------------------------------------------
avg | 161708259 | 161252540 | 159839190 | 150296778
diff % | 100 | 99.7 | 98.8 | 92.9
-------------------------------------------------------------
stdev % | 0.2 | 0.1 | 0.1 | 0.2
The gettime_perf_cold test does 10K iterations. In each iteration, it
drops cpu caches for vdso pages, clflush is used for this, then it runs
rdtsc(); clock_gettime; rdtsc(); and prints the number of tsc cycles.
(lesser tsc per cycle - the better):
| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
tsc | 434 | 433 | 437 | 477
stdev(tsc) | 5 | 5 | 5 | 3
diff (%) | 1 | 1 | 100.1 | 109
vdsotest results: https://gist.github.com/avagin/f290afb8b721ae0522a561d585f34de0
The numbers gathered on Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz.
Cc: Adrian Reber <adrian@lisas.de>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: containers@lists.linux-foundation.org
Cc: criu@openvz.org
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
v5 on github (if someone prefers `git pull` or `git log`):
https://github.com/0x7f454c46/linux/tree/timens-v5
v4: https://lkml.kernel.org/r/20190612192628.23797-1-dima@arista.com
v3: https://lkml.kernel.org/r/20190425161416.26600-1-dima@arista.com
v2: https://lore.kernel.org/lkml/20190206001107.16488-1-dima@arista.com/
RFC: https://lkml.kernel.org/r/20180919205037.9574-1-dima@arista.com/
v3..v4 Changes:
* CLOCKE_NEWTIME is unshare()-only flag now (CLON_PIDFD took previous value)
* Addressing Jann Horn's feedback - we don't allow CLONE_THREAD or
CLONE_VM together with CLONE_NEWTIME (thanks for spotting!)
* Addressing issues found by Thomas - removed unmaintainable CLOCK_TIMENS
and introduced another call back into k_clock to get ktime instead
of getting timespec and converting it (Patch 03)
* Renaming timens_offsets members to omit _offset postfix
(thanks Cyrill for the suggestion)
* Suggestions, renaming and making code more maintainable from Thomas's
feedback (thanks much!)
* Fixing out-of-bounds and other issues in procfs file (kudos Jann Horn)
* vdso_fault() can be called on a remote task by /proc/$pid/mem or
process_vm_readv() - addressed by adding a slow-path with searching
for owner's namespace (thanks for spotting this unobvious issue, Jann)
* Other nits by Jann Horn
v2..v3: Major changes:
* Simplify two VDSO images by using static_branch() in vclock_gettime()
Removes unwanted conflicts with generic VDSO movement patches and
simplifies things by dropping too invasive linker magic.
As an alternative to static_branch() we tested an attempt to introduce
home-made dynamic patching called retcalls:
https://github.com/0x7f454c46/linux/commit/4cc0180f6d65
Considering some theoretical problems with toolchains, we decided to go
with long well-tested nop-patching in static_branch(). Though, it was
needed to provide backend for relative code.
* address Thomas' comments.
* add sanity checks for offsets:
- the current clock time in a namespace has to be in [0, KTIME_MAX / 2).
KTIME_MAX is divided by two here to be sure that the KTIME_MAX limit
is still unreachable.
Link: https://lkml.org/lkml/2018/9/19/950
Link: https://lkml.org/lkml/2019/2/5/867
v1..v2: There are two major changes:
* Two versions of the VDSO library to avoid a performance penalty for
host tasks outside time namespace (as suggested by Andy and Thomas).
As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.
Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
VDSO code.
Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.
* Allow to set clock offsets for a namespace only before any processes
appear in it.
Now a time namespace looks similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.
This scheme allows to create a new time namespaces, set clock offsets
and then populate the namespace with processes.
Andrei Vagin (25):
ns: Introduce Time Namespace
timens: Add timens_offsets
posix-clocks: Rename the clock_get() into clock_get_timespec()
posix-clocks: Rename *_clock_get() functions into
*_clock_get_timespec()
alarmtimer: Rename gettime() callback to get_ktime()
alarmtimer: Provide get_timespec() callback
posix-clocks: Introduce clock_get_ktime() callback
posix-timers: Use clock_get_ktime() in common_timer_get()
posix-clocks: Introduce CLOCK_MONOTONIC time namespace offsets
posix-clocks: Introduce CLOCK_BOOTTIME time namespace offset
kernel: Add do_timens_ktime_to_host() helper
timerfd: Make timerfd_settime() time namespace aware
posix-timers: Make timer_settime() time namespace aware
alarmtimer: Make nanosleep time namespace aware
hrtimers: Prepare hrtimer_nanosleep() for time namespaces
posix-timers: Make clock_nanosleep() time namespace aware
x86/vdso: Add offsets page in vvar
vdso: Introduce vdso_static_branch_unlikely()
x86/vdso: Enable static branches for the timens vdso
fs/proc: Introduce /proc/pid/timens_offsets
selftest/timens: Add a test for timerfd
selftest/timens: Add a test for clock_nanosleep()
selftest/timens: Add timer offsets test
selftests/timens: Add a simple perf test for clock_gettime()
selftest/timens: Check that a right vdso is mapped after fork and exec
Dmitry Safonov (12):
fd/proc: Respect boottime inside time namespace for /proc/uptime
x86/vdso2c: Correct err messages on file opening
x86/vdso2c: Convert iterator to unsigned
x86/vdso/Makefile: Add vobjs32
x86/vdso: Restrict splitting VVAR VMA
x86/vdso: Rename vdso_image {.data=>.text}
x86/vdso: Allocate timens vdso
x86/vdso: Switch image on setns()/clone()
x86/vdso2c: Process jump tables
posix-clocks: Add align for timens_offsets
selftest/timens: Add Time Namespace test for supported clocks
selftest/timens: Add procfs selftest
MAINTAINERS | 3 +
arch/Kconfig | 5 +
arch/x86/Kconfig | 1 +
arch/x86/entry/vdso/Makefile | 15 +-
arch/x86/entry/vdso/vdso-layout.lds.S | 10 +-
arch/x86/entry/vdso/vdso2c.c | 7 +-
arch/x86/entry/vdso/vdso2c.h | 22 +-
arch/x86/entry/vdso/vma.c | 177 +++++++-
arch/x86/include/asm/jump_label.h | 14 +
arch/x86/include/asm/vdso.h | 14 +-
arch/x86/kernel/jump_label.c | 14 +
fs/proc/base.c | 95 +++++
fs/proc/namespaces.c | 4 +
fs/proc/uptime.c | 3 +
fs/timerfd.c | 3 +
include/linux/hrtimer.h | 2 +-
include/linux/jump_label.h | 8 +
include/linux/nsproxy.h | 2 +
include/linux/posix-timers.h | 3 +
include/linux/proc_ns.h | 3 +
include/linux/time_namespace.h | 114 ++++++
include/linux/timens_offsets.h | 18 +
include/linux/user_namespace.h | 1 +
include/uapi/linux/sched.h | 6 +
init/Kconfig | 9 +
kernel/Makefile | 1 +
kernel/fork.c | 16 +-
kernel/nsproxy.c | 41 +-
kernel/time/alarmtimer.c | 68 +++-
kernel/time/hrtimer.c | 8 +-
kernel/time/posix-clock.c | 8 +-
kernel/time/posix-cpu-timers.c | 32 +-
kernel/time/posix-stubs.c | 15 +-
kernel/time/posix-timers.c | 89 ++--
kernel/time/posix-timers.h | 7 +-
kernel/time_namespace.c | 385 ++++++++++++++++++
lib/vdso/gettimeofday.c | 53 +++
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/timens/.gitignore | 8 +
tools/testing/selftests/timens/Makefile | 12 +
.../selftests/timens/clock_nanosleep.c | 102 +++++
tools/testing/selftests/timens/config | 1 +
tools/testing/selftests/timens/exec.c | 93 +++++
tools/testing/selftests/timens/gettime_perf.c | 101 +++++
.../selftests/timens/gettime_perf_cold.c | 160 ++++++++
tools/testing/selftests/timens/log.h | 26 ++
tools/testing/selftests/timens/procfs.c | 144 +++++++
tools/testing/selftests/timens/timens.c | 185 +++++++++
tools/testing/selftests/timens/timens.h | 63 +++
tools/testing/selftests/timens/timer.c | 118 ++++++
tools/testing/selftests/timens/timerfd.c | 129 ++++++
51 files changed, 2304 insertions(+), 115 deletions(-)
create mode 100644 include/linux/time_namespace.h
create mode 100644 include/linux/timens_offsets.h
create mode 100644 kernel/time_namespace.c
create mode 100644 tools/testing/selftests/timens/.gitignore
create mode 100644 tools/testing/selftests/timens/Makefile
create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c
create mode 100644 tools/testing/selftests/timens/config
create mode 100644 tools/testing/selftests/timens/exec.c
create mode 100644 tools/testing/selftests/timens/gettime_perf.c
create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c
create mode 100644 tools/testing/selftests/timens/log.h
create mode 100644 tools/testing/selftests/timens/procfs.c
create mode 100644 tools/testing/selftests/timens/timens.c
create mode 100644 tools/testing/selftests/timens/timens.h
create mode 100644 tools/testing/selftests/timens/timer.c
create mode 100644 tools/testing/selftests/timens/timerfd.c
--
2.22.0
^ permalink raw reply
* Re: [PATCH V36 19/29] Lock down module params that specify hardware parameters (eg. ioport)
From: Matthew Garrett @ 2019-07-29 21:47 UTC (permalink / raw)
To: James Morris
Cc: LSM List, Linux Kernel Mailing List, Linux API, David Howells,
Alan Cox, Kees Cook, Jessica Yu
In-Reply-To: <20190718194415.108476-20-matthewgarrett@google.com>
On Thu, Jul 18, 2019 at 12:45 PM Matthew Garrett
<matthewgarrett@google.com> wrote:
>
> From: David Howells <dhowells@redhat.com>
>
> Provided an annotation for module parameters that specify hardware
> parameters (such as io ports, iomem addresses, irqs, dma channels, fixed
> dma buffers and other types).
>
> Suggested-by: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Cc: Jessica Yu <jeyu@kernel.org>
Jessica, any feedback on this?
^ permalink raw reply
* Re: [PATCH V36 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-07-29 21:47 UTC (permalink / raw)
To: James Morris
Cc: LSM List, Linux Kernel Mailing List, Linux API, David Howells,
Alexei Starovoitov, Network Development, Chun-Yi Lee,
Daniel Borkmann
In-Reply-To: <20190718194415.108476-24-matthewgarrett@google.com>
On Thu, Jul 18, 2019 at 12:45 PM Matthew Garrett
<matthewgarrett@google.com> wrote:
> bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
> private keys in kernel memory to be leaked. Disable them if the kernel
> has been locked down in confidentiality mode.
>
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: netdev@vger.kernel.org
> cc: Chun-Yi Lee <jlee@suse.com>
> cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
Any further feedback on this?
^ permalink raw reply
* Re: [PATCH v7 09/16] fscrypt: add an HKDF-SHA512 implementation
From: James Bottomley @ 2019-07-29 21:42 UTC (permalink / raw)
To: Eric Biggers, Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190729202951.GG169027@gmail.com>
On Mon, 2019-07-29 at 13:29 -0700, Eric Biggers wrote:
> On Sun, Jul 28, 2019 at 03:39:49PM -0400, Theodore Y. Ts'o wrote:
> > On Fri, Jul 26, 2019 at 03:41:34PM -0700, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
[...]
> > > HKDF solves all the above problems.
> > >
> > > Signed-off-by: Eric Biggers <ebiggers@google.com>
> >
> > Unless I missed something there's nothing here which is fscrypt
> > specific. Granted that it's somewhat unlikely that someone would
> > want to implement (the very bloated) IKE from IPSEC in the kernel,
> > I wonder if there might be other users of HKDF, and whether this
> > would be better placed in lib/ or crypto/ instead of fs/crypto?
>
> This is standard HKDF-SHA512; only the choice of parameters is
> fscrypt-specific. So it could indeed use a common implementation of
> HKDF if one were available.
>
> However, I don't think there are any other HKDF users in the kernel
> currently.
Well, I'm still trying to add the TPM ones, but they're based on SP800-
108 for arbitrary keys and SP800-56A for elliptic curve ones. These
are similar to the RFC5869 except that they do extract/expand in a
single operation. Plus, of course, the TPM mandates we use the name
algorithm (usually sha256, which is what I hardcoded) as the hash.
Note: since you don't use the extract step either in your
implementation, effectively you're equivalent to SP800-108 as well.
This is effectively the same reason as the TPM: we need deterministic
keys, so we've nowhere to get the salt from that would persist.
> Also, while there was a patch to support HKDF via the crypto_rng API,
> there was no consensus about whether this was actually the best way
> to add KDF support:
> https://lore.kernel.org/linux-crypto/2423373.Zd5ThvQH5g@positron.chro
> nox.de
>
> So for now, to avoid unnecessarily blocking this patchset I think we
> should just go with this implementation in fs/crypto/. It can always
> be changed later, once we decide on the best way to add KDFs to the
> crypto API.
>
> [To be clear: this patch already uses "hmac(sha512)" from the crypto
> API. It's only the actual HKDF part that we're talking about here.
Right, once you have the hmac + hash available, the rest is easy, so
this is what we have for the TPM KDFa:
static void KDFa(u8 *key, int keylen, const char *label, u8 *u,
u8 *v, int bytes, u8 *out)
{
u32 counter;
const __be32 bits = cpu_to_be32(bytes * 8);
for (counter = 1; bytes > 0; bytes -= SHA256_DIGEST_SIZE, counter++,
out += SHA256_DIGEST_SIZE) {
SHASH_DESC_ON_STACK(desc, sha256_hash);
__be32 c = cpu_to_be32(counter);
hmac_init(desc, key, keylen);
crypto_shash_update(desc, (u8 *)&c, sizeof(c));
crypto_shash_update(desc, label, strlen(label)+1);
crypto_shash_update(desc, u, SHA256_DIGEST_SIZE);
crypto_shash_update(desc, v, SHA256_DIGEST_SIZE);
crypto_shash_update(desc, (u8 *)&bits, sizeof(bits));
hmac_final(desc, key, keylen, out);
}
}
I honestly think these things are so simplistic with the correct hmac
that it would make it more confusing to try to produce a general KDF
than it would simply to hard code them where we need them.
James
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v7 16/16] fscrypt: document the new ioctls and policy version
From: Eric Biggers @ 2019-07-29 21:36 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190729020009.GA3863@mit.edu>
On Sun, Jul 28, 2019 at 10:00:09PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 26, 2019 at 03:41:41PM -0700, Eric Biggers wrote:
> > +- The kernel cannot magically wipe copies of the master key(s) that
> > + userspace might have as well. Therefore, userspace must wipe all
> > + copies of the master key(s) it makes as well. Naturally, the same
> > + also applies to all higher levels in the key hierarchy. Userspace
> > + should also follow other security precautions such as mlock()ing
> > + memory containing keys to prevent it from being swapped out.
>
> Normally, shouldn't userspace have wiped all copies of the master key
> after they have called ADD_KEY? Why should they be left hanging
> around? Waiting until REMOVE_KEY to remove other copies of the master
> key seems.... late.
Correct, normally userspace should wipe its copy of the key immediately after
adding it to the kernel. I'll clarify that here.
>
> > +- In general, decrypted contents and filenames in the kernel VFS
> > + caches are freed but not wiped. Therefore, portions thereof may be
> > + recoverable from freed memory, even after the corresponding key(s)
> > + were wiped. To partially solve this, you can set
> > + CONFIG_PAGE_POISONING=y in your kernel config and add page_poison=1
> > + to your kernel command line. However, this has a performance cost.
>
> ... and even this won't help if you have swap configured....
Yes, but that's a larger issue. Unencrypted data can be written to swap and
then be recovered from disk offline. This has nothing to do with whether the
key is ever removed on-line or not. So swap really could use its own mention
somewhere else, maybe in the "Offline attacks" section.
>
> > +v1 encryption policies have some weaknesses with respect to online
> > +attacks:
> > +
> > +- There is no verification that the provided master key is correct.
> > + Consequently, malicious users can associate the wrong key with
> > + encrypted files, even files to which they have only read-only
> > + access.
>
> Yes, but they won't be able to trick other users into using that
> incorrect key. With the old interface, it gets written into the
> user's session keyring, which won't get used by another user. And
> with the newer interface, only root is allowed to set v1 key.
>
As mentioned in a previous reply, they *can* trick other users into using that
incorrect key, by opening files using that incorrect key. The incorrect key is
then cached for everyone. (This assumes the other users have at least read
access to the file. If it's mode 0700, this won't work.)
> > +Master keys should be pseudorandom, i.e. indistinguishable from random
> > +bytestrings of the same length. This implies that users **must not**
> > +directly use a password as a master key, zero-pad a shorter key, or
> > +repeat a shorter key.
>
> These paragraphs starts a bit funny, since we first say "should" in
> the first sentence, and then it's followed up by "**must not**" in the
> second sentence. Basically, they *could* do this, but it would just
> weaken the security of the system significantly.
>
> At the very least, we should explain the basis of the recommendation.
I think we should go with "must" instead of "should".
Basically the point of this paragraph is to explain that the API takes a real
cryptographic key of the full given length.
Otherwise the security guarantees for the algorithms the master key may be used
in (AES-128-ECB KDF, HKDF-SHA512, or Adiantum) aren't guaranteed to hold.
One can argue about how much of a problem this actually is, like how unsalted
HKDF on a key with unevenly distributed entropy is *probably* fine in practice
(and much better than the AES-128-ECB KDF). But the security proof for unsalted
HKDF actually still assumes a pseudorandom key. It's only randomly salted HKDF
that doesn't.
I'd strongly prefer to go with *must* for things that are necessary for the
security proofs or cryptanalysis to apply, even if they *might* still be "good
enough" in practice.
I'll try to find a better way to word this paragraph.
>
> > +The KDF used for a particular master key differs depending on whether
> > +the key is used for v1 encryption policies or for v2 encryption
> > +policies. Users **must not** use the same key for both v1 and v2
> > +encryption policies.
>
> "Must not" seems a bit strong. If they do, and a v1 per-file key and
> nonce leaks out, then the encryption key will be compromised. So the
> strength of the key will be limited by the weaknesses of the v1
> scheme. But it's not like using a that was originally meant for v1,
> and then using it for v2, causes any additional weakness. Right?
>
Probably, but we don't know for sure. It's theoretically possible that
cryptanalysis of two cryptographic primitives A and B, where they are each given
the same key, could be much easier than attacking A or B individually.
So again, I'd prefer to go with *must not* for things where there is no theory
of cryptography that says it is okay, even if *probably* someone could get away
with doing it in practice.
- Eric
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v7 10/16] fscrypt: v2 encryption policy support
From: Eric Biggers @ 2019-07-29 20:46 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190728211730.GK6088@mit.edu>
On Sun, Jul 28, 2019 at 05:17:30PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 26, 2019 at 03:41:35PM -0700, Eric Biggers wrote:
> > @@ -319,6 +329,31 @@ int fscrypt_ioctl_add_key(struct file *filp, void __user *_uarg)
> > if (!capable(CAP_SYS_ADMIN))
> > goto out_wipe_secret;
> >
> > + if (arg.key_spec.type != FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR) {
>
> This should be "== FSCRYPT_KEY_SPEC_TYPE_INDENTIFIER" instead. That's
> because you use the identifier part of the union:
>
> > + /* Calculate the key identifier and return it to userspace. */
> > + err = fscrypt_hkdf_expand(&secret.hkdf,
> > + HKDF_CONTEXT_KEY_IDENTIFIER,
> > + NULL, 0, arg.key_spec.u.identifier,
>
> If we ever add a new key specifier type, and alternative in the union,
> this is going to come back to bite us.
Well, I did it this way because the next patch changes the code to:
if (arg.key_spec.type == FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR) {
...
} else {
...
}
We already validated that it's either TYPE_DESCRIPTOR or TYPE_IDENTIFIER.
But I guess to be more clear I'll just make it handle the default case again.
switch (arg.key_spec.type) {
case FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR:
...
break;
case FSCRYPT_KEY_SPEC_TYPE_IDENTIFIER:
...
break;
default:
err = -EINVAL;
break;
}
>
> > + if (policy->version == FSCRYPT_POLICY_V1) {
> > + /*
> > + * The original encryption policy version provided no way of
> > + * verifying that the correct master key was supplied, which was
> > + * insecure in scenarios where multiple users have access to the
> > + * same encrypted files (even just read-only access).
>
> Which scenario do you have in mind? With read-only access, Alice can
> fetch the encryption policy for a directory, and introduce a key with
> the same descriptor, but the "wrong" key, but that's only going to
> affect Alice's use of the key. It won't affect what key is used by
> Bob, since Alice doesn't have write access to Bob's keyrings.
>
> If what you mean is the risk when there is a single global
> filesystem-specific keyring, where Alice could introduce a "wrong" key
> identified with a specific descriptor, then sure, Alice could trick
> Bob into encrypting his data with the wrong key (one known to Alice).
> But we don't allow keys usable by V1 policies to be used in the
> filesystem-specific keyring, do we?
>
The scenario is that Alice lists the directory with the wrong key, then Bob
lists the directory too and gets the wrong filenames. This happens because the
inode, fscrypt_info, dentry cache, page cache, etc. are the same for everyone.
Bob's key is never looked up because the inode already has a key cached.
This also applies to regular files and symlinks.
- Eric
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v7 09/16] fscrypt: add an HKDF-SHA512 implementation
From: Eric Biggers @ 2019-07-29 20:29 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190728193949.GI6088@mit.edu>
On Sun, Jul 28, 2019 at 03:39:49PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 26, 2019 at 03:41:34PM -0700, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> >
> > Add an implementation of HKDF (RFC 5869) to fscrypt, for the purpose of
> > deriving additional key material from the fscrypt master keys for v2
> > encryption policies. HKDF is a key derivation function built on top of
> > HMAC. We choose SHA-512 for the underlying unkeyed hash, and use an
> > "hmac(sha512)" transform allocated from the crypto API.
> >
> > We'll be using this to replace the AES-ECB based KDF currently used to
> > derive the per-file encryption keys. While the AES-ECB based KDF is
> > believed to meet the original security requirements, it is nonstandard
> > and has problems that don't exist in modern KDFs such as HKDF:
> >
> > 1. It's reversible. Given a derived key and nonce, an attacker can
> > easily compute the master key. This is okay if the master key and
> > derived keys are equally hard to compromise, but now we'd like to be
> > more robust against threats such as a derived key being compromised
> > through a timing attack, or a derived key for an in-use file being
> > compromised after the master key has already been removed.
> >
> > 2. It doesn't evenly distribute the entropy from the master key; each 16
> > input bytes only affects the corresponding 16 output bytes.
> >
> > 3. It isn't easily extensible to deriving other values or keys, such as
> > a public hash for securely identifying the key, or per-mode keys.
> > Per-mode keys will be immediately useful for Adiantum encryption, for
> > which fscrypt currently uses the master key directly, introducing
> > unnecessary usage constraints. Per-mode keys will also be useful for
> > hardware inline encryption, which is currently being worked on.
> >
> > HKDF solves all the above problems.
> >
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
>
> Unless I missed something there's nothing here which is fscrypt
> specific. Granted that it's somewhat unlikely that someone would want
> to implement (the very bloated) IKE from IPSEC in the kernel, I wonder
> if there might be other users of HKDF, and whether this would be
> better placed in lib/ or crypto/ instead of fs/crypto?
>
This is standard HKDF-SHA512; only the choice of parameters is fscrypt-specific.
So it could indeed use a common implementation of HKDF if one were available.
However, I don't think there are any other HKDF users in the kernel currently.
Also, while there was a patch to support HKDF via the crypto_rng API, there was
no consensus about whether this was actually the best way to add KDF support:
https://lore.kernel.org/linux-crypto/2423373.Zd5ThvQH5g@positron.chronox.de
So for now, to avoid unnecessarily blocking this patchset I think we should just
go with this implementation in fs/crypto/. It can always be changed later, once
we decide on the best way to add KDFs to the crypto API.
[To be clear: this patch already uses "hmac(sha512)" from the crypto API. It's
only the actual HKDF part that we're talking about here.
Also, its correctness is tested by the ciphertext verification xfstests.]
- Eric
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v7 06/16] fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl
From: Theodore Y. Ts'o @ 2019-07-29 20:14 UTC (permalink / raw)
To: linux-fscrypt, linux-fsdevel, linux-ext4, linux-f2fs-devel,
linux-mtd, linux-api, linux-crypto, keyrings, Paul Crowley,
Satya Tangirala
In-Reply-To: <20190729194644.GE169027@gmail.com>
On Mon, Jul 29, 2019 at 12:46:45PM -0700, Eric Biggers wrote:
> > For that matter, we could just add a new ioctl which returns the file
> > system's keyring id. That way an application program won't have to
> > try to figure out what a file's underlying sb->s_id happens to be.
> > (Especially if things like overlayfs are involved.)
>
> Keep in mind that the new ioctls (FS_IOC_ADD_ENCRYPTION_KEY,
> FS_IOC_REMOVE_ENCRYPTION_KEY, FS_IOC_GET_ENCRYPTION_KEY_STATUS) don't take the
> keyring ID as a parameter, since it's already known from the filesystem the
> ioctl is executed on. So there actually isn't much that can be done with the
> keyring ID. But sure, if it's needed later we can add an API to get it.
Yeah, I was thinking about for testing/debugging purposes so that we
could use keyctl to examine the per-file system keyring and see what
keys are attached to a file system. This is only going to be usable
by root, so I guess we can just try to figure it out by going through
/proc/keys and searching by sb->s_id. If there are ambiguities that
make this hard to do, we can add an interface to make this easier.
- Ted
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* Re: [PATCH v12 0/6] Add utilization clamping support (CGroups API)
From: Tejun Heo @ 2019-07-29 20:06 UTC (permalink / raw)
To: Patrick Bellasi
Cc: linux-kernel, linux-pm, linux-api, cgroups, Ingo Molnar,
Peter Zijlstra, Rafael J . Wysocki, Vincent Guittot, Viresh Kumar,
Paul Turner, Michal Koutny, Quentin Perret, Dietmar Eggemann,
Morten Rasmussen, Juri Lelli, Todd Kjos, Joel Fernandes,
Steve Muckle, Suren Baghdasaryan, Alessio Balsini
In-Reply-To: <20190718181748.28446-1-patrick.bellasi@arm.com>
Hello,
Looks good to me. On cgroup side,
Acked-by: Tejun Heo <tj@kernel.org>
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH v7 07/16] fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-07-29 19:58 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190728192417.GG6088@mit.edu>
On Sun, Jul 28, 2019 at 03:24:17PM -0400, Theodore Y. Ts'o wrote:
> > +
> > +/*
> > + * Try to remove an fscrypt master encryption key. If other users have also
> > + * added the key, we'll remove the current user's usage of the key, then return
> > + * -EUSERS. Otherwise we'll continue on and try to actually remove the key.
>
> Nit: this should be moved to patch #11
>
> Also, perror(EUSERS) will display "Too many users" which is going to
> be confusing. I understand why you chose this; we would like to
> distinguish between there are still inodes using this key, and there
> are other users using this key.
>
> Do we really need to return EUSERS in this case? It's actually not an
> *error* that other users are using the key. After all, the unlink(2)
> system call doesn't return an advisory error when you delete a file
> which has other hard links. And an application which does care about
> this detail can always call FS_IOC_ENCRYPTION_KEY_STATUS() and check
> user_count.
>
Returning 0 when the key wasn't fully removed might also be confusing. But I
guess you're right that returning an error doesn't match how syscalls usually
work. It did remove the current user's usage of the key, after all, rather than
completely fail. And as you point out, if someone cares about other users
having added the key, they can use FS_IOC_GET_ENCRYPTION_KEY_STATUS.
So I guess I'll change it to 0.
Thanks!
- Eric
^ permalink raw reply
* Re: [PATCH v7 06/16] fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl
From: Eric Biggers @ 2019-07-29 19:46 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190728185003.GF6088@mit.edu>
On Sun, Jul 28, 2019 at 02:50:03PM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 26, 2019 at 03:41:31PM -0700, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> >
> > Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an
> > encryption key to the filesystem's fscrypt keyring ->s_master_keys,
> > making any files encrypted with that key appear "unlocked".
>
> Note: it think it's going to be useful to make the keyring id
> available someplace like /sys/fs/<fs>/<blkdev>/keyring, or preferably
> in the new fsinfo system call. Yes, the system administrator can paw
> through /proc/keys and try to figure it out, but it will be nicer if
> there's a direct way to do that.
>
> For that matter, we could just add a new ioctl which returns the file
> system's keyring id. That way an application program won't have to
> try to figure out what a file's underlying sb->s_id happens to be.
> (Especially if things like overlayfs are involved.)
Keep in mind that the new ioctls (FS_IOC_ADD_ENCRYPTION_KEY,
FS_IOC_REMOVE_ENCRYPTION_KEY, FS_IOC_GET_ENCRYPTION_KEY_STATUS) don't take the
keyring ID as a parameter, since it's already known from the filesystem the
ioctl is executed on. So there actually isn't much that can be done with the
keyring ID. But sure, if it's needed later we can add an API to get it.
>
> > diff --git a/include/uapi/linux/fscrypt.h b/include/uapi/linux/fscrypt.h
> > index 29a945d165def..93d6eabaa7de4 100644
> > --- a/include/uapi/linux/fscrypt.h
> > +++ b/include/uapi/linux/fscrypt.h
> > +
> > +struct fscrypt_key_specifier {
> > +#define FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR 1
> > + __u32 type;
> > + __u32 __reserved;
>
> Can you move the definition of FSCRYPT_KEY_SPEC_TYPE_DESCRIPTOR
> outside of the structure definition, and then add a comment about what
> is a "descriptor" key spec? (And then in a later patch, please add a
> comment about what is an "identifier" key type.) There's an
> explanation in Documentation/filesystems/fscrypt.rst, I know, but a
> one or two line comment plus a pointer to
> Documentation/filesystems/fscrypt.rst in the header file would be
> really helpful.
I'll add a brief comment that explains the key specifier.
I've already added a pointer to Documentation/filesystems/fscrypt.rst at the top
of the header (this was one of the cleanups in v6 => v7):
/*
* fscrypt user API
*
* These ioctls can be used on filesystems that support fscrypt. See the
* "User API" section of Documentation/filesystems/fscrypt.rst.
*/
- Eric
^ permalink raw reply
* Re: [PATCH v7 05/16] fscrypt: refactor v1 policy key setup into keysetup_legacy.c
From: Eric Biggers @ 2019-07-29 19:37 UTC (permalink / raw)
To: Theodore Y. Ts'o
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190728154032.GE6088@mit.edu>
Hi Ted, thanks for the review!
On Sun, Jul 28, 2019 at 11:40:32AM -0400, Theodore Y. Ts'o wrote:
> On Fri, Jul 26, 2019 at 03:41:30PM -0700, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> >
> > In preparation for introducing v2 encryption policies which will find
> > and derive encryption keys differently from the current v1 encryption
> > policies, refactor the v1 policy-specific key setup code from keyinfo.c
> > into keysetup_legacy.c. Then rename keyinfo.c to keysetup.c.
>
> I'd use keysetup_v1.c, myself. We can hope that we've gotten it right
> with v2 and we'll never need to do another version, but *something* is
> going to come up eventually which will require a v3 keysetup , whether
> it's post-quantuum cryptography or something else we can't anticipate
> right now.
>
> For an example of the confusion that can result, one good example is
> in the fs/quota subsystem, where QFMT_VFS_OLD, QFMT_VFS_V0, and
> QFMT_VFS_V1 maps to quota_v1 and quota_v2 in an amusing and
> non-obvious way. (Go ahead, try to guess before you go look at the
> code. :-)
>
> Other than that, looks good. We can always move code around or rename
> files in the future, so I'm not going to insist on doing it now (but
> it would be my preference).
>
> Reviewed-by: Theodore Ts'o <tytso@mit.edu>
>
Agreed, I'll call it keysetup_v1.c instead.
- Eric
______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/
^ permalink raw reply
* [PATCH 5.2 121/215] rseq/selftests: Fix Thumb mode build failure on arm32
From: Greg Kroah-Hartman @ 2019-07-29 19:21 UTC (permalink / raw)
To: linux-kernel
Cc: Greg Kroah-Hartman, stable, Mathieu Desnoyers, Will Deacon,
Peter Zijlstra, Thomas Gleixner, Joel Fernandes, Catalin Marinas,
Dave Watson, Shuah Khan, Andi Kleen, linux-kselftest,
H . Peter Anvin, Chris Lameter, Russell King, Michael Kerrisk,
Paul E . McKenney, Paul Turner, Boqun Feng, Josh Triplett
In-Reply-To: <20190729190739.971253303@linuxfoundation.org>
[ Upstream commit ee8a84c60bcc1f1615bd9cb3edfe501e26cdc85b ]
Using ".arm .inst" for the arm signature introduces build issues for
programs compiled in Thumb mode because the assembler stays in the
arm mode for the rest of the inline assembly. Revert to using a ".word"
to express the signature as data instead.
The choice of signature is a valid trap instruction on arm32 little
endian, where both code and data are little endian.
ARMv6+ big endian (BE8) generates mixed endianness code vs data:
little-endian code and big-endian data. The data value of the signature
needs to have its byte order reversed to generate the trap instruction.
Prior to ARMv6, -mbig-endian generates big-endian code and data
(which match), so the endianness of the data representation of the
signature should not be reversed. However, the choice between BE32
and BE8 is done by the linker, so we cannot know whether code and
data endianness will be mixed before the linker is invoked. So rather
than try to play tricks with the linker, the rseq signature is simply
data (not a trap instruction) prior to ARMv6 on big endian. This is
why the signature is expressed as data (.word) rather than as
instruction (.inst) in assembler.
Because a ".word" is used to emit the signature, it will be interpreted
as a literal pool by a disassembler, not as an actual instruction.
Considering that the signature is not meant to be executed except in
scenarios where the program execution is completely bogus, this should
not be an issue.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Will Deacon <will.deacon@arm.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Joel Fernandes <joelaf@google.com>
CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Andi Kleen <andi@firstfloor.org>
CC: linux-kselftest@vger.kernel.org
CC: "H . Peter Anvin" <hpa@zytor.com>
CC: Chris Lameter <cl@linux.com>
CC: Russell King <linux@arm.linux.org.uk>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
CC: Paul Turner <pjt@google.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ben Maurer <bmaurer@fb.com>
CC: linux-api@vger.kernel.org
CC: Andy Lutomirski <luto@amacapital.net>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
tools/testing/selftests/rseq/rseq-arm.h | 61 +++++++++++++------------
1 file changed, 33 insertions(+), 28 deletions(-)
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
index 84f28f147fb6..5943c816c07c 100644
--- a/tools/testing/selftests/rseq/rseq-arm.h
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -6,6 +6,8 @@
*/
/*
+ * - ARM little endian
+ *
* RSEQ_SIG uses the udf A32 instruction with an uncommon immediate operand
* value 0x5de3. This traps if user-space reaches this instruction by mistake,
* and the uncommon operand ensures the kernel does not move the instruction
@@ -22,36 +24,40 @@
* def3 udf #243 ; 0xf3
* e7f5 b.n <7f5>
*
- * pre-ARMv6 big endian code:
- * e7f5 b.n <7f5>
- * def3 udf #243 ; 0xf3
+ * - ARMv6+ big endian (BE8):
*
* ARMv6+ -mbig-endian generates mixed endianness code vs data: little-endian
- * code and big-endian data. Ensure the RSEQ_SIG data signature matches code
- * endianness. Prior to ARMv6, -mbig-endian generates big-endian code and data
- * (which match), so there is no need to reverse the endianness of the data
- * representation of the signature. However, the choice between BE32 and BE8
- * is done by the linker, so we cannot know whether code and data endianness
- * will be mixed before the linker is invoked.
+ * code and big-endian data. The data value of the signature needs to have its
+ * byte order reversed to generate the trap instruction:
+ *
+ * Data: 0xf3def5e7
+ *
+ * Translates to this A32 instruction pattern:
+ *
+ * e7f5def3 udf #24035 ; 0x5de3
+ *
+ * Translates to this T16 instruction pattern:
+ *
+ * def3 udf #243 ; 0xf3
+ * e7f5 b.n <7f5>
+ *
+ * - Prior to ARMv6 big endian (BE32):
+ *
+ * Prior to ARMv6, -mbig-endian generates big-endian code and data
+ * (which match), so the endianness of the data representation of the
+ * signature should not be reversed. However, the choice between BE32
+ * and BE8 is done by the linker, so we cannot know whether code and
+ * data endianness will be mixed before the linker is invoked. So rather
+ * than try to play tricks with the linker, the rseq signature is simply
+ * data (not a trap instruction) prior to ARMv6 on big endian. This is
+ * why the signature is expressed as data (.word) rather than as
+ * instruction (.inst) in assembler.
*/
-#define RSEQ_SIG_CODE 0xe7f5def3
-
-#ifndef __ASSEMBLER__
-
-#define RSEQ_SIG_DATA \
- ({ \
- int sig; \
- asm volatile ("b 2f\n\t" \
- "1: .inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
- "2:\n\t" \
- "ldr %[sig], 1b\n\t" \
- : [sig] "=r" (sig)); \
- sig; \
- })
-
-#define RSEQ_SIG RSEQ_SIG_DATA
-
+#ifdef __ARMEB__
+#define RSEQ_SIG 0xf3def5e7 /* udf #24035 ; 0x5de3 (ARMv6+) */
+#else
+#define RSEQ_SIG 0xe7f5def3 /* udf #24035 ; 0x5de3 */
#endif
#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
@@ -125,8 +131,7 @@ do { \
__rseq_str(table_label) ":\n\t" \
".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
- ".arm\n\t" \
- ".inst " __rseq_str(RSEQ_SIG_CODE) "\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
teardown \
"b %l[" __rseq_str(abort_label) "]\n\t"
--
2.20.1
^ permalink raw reply related
* Re: [f2fs-dev] [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Chao Yu @ 2019-07-29 14:57 UTC (permalink / raw)
To: Chao Yu, Jaegeuk Kim
Cc: Daniel Rosenberg, Jonathan Corbet, linux-f2fs-devel, linux-doc,
linux-api, linux-kernel, linux-fsdevel, kernel-team
In-Reply-To: <fa07a09d-92c9-4e0b-7c2b-e87771273dce@huawei.com>
On 2019-7-29 15:22, Chao Yu wrote:
> On 2019/7/29 14:27, Jaegeuk Kim wrote:
>> On 07/28, Chao Yu wrote:
>>> On 2019-7-24 7:05, Daniel Rosenberg via Linux-f2fs-devel wrote:
>>>> /* Flags that are appropriate for regular files (all but dir-specific ones). */
>>>> #define F2FS_REG_FLMASK (~(F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL))
>>>
>>> We missed to add F2FS_CASEFOLD_FL here to exclude it in F2FS_REG_FLMASK.
>>
>> Applied.
>>
>>>
>>>> @@ -1660,7 +1660,16 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
>>>> return -EPERM;
>>>>
>>>> oldflags = fi->i_flags;
>>>> + if ((iflags ^ oldflags) & F2FS_CASEFOLD_FL) {
>>>> + if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
>>>> + return -EOPNOTSUPP;
>>>> +
>>>> + if (!S_ISDIR(inode->i_mode))
>>>> + return -ENOTDIR;
>>>>
>>>> + if (!f2fs_empty_dir(inode))
>>>> + return -ENOTEMPTY;
>>>> + }
>>
>> Modified like this:
>> @@ -1665,6 +1665,13 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
>> if (IS_NOQUOTA(inode))
>> return -EPERM;
>>
>> + if ((iflags ^ fi->i_flags) & F2FS_CASEFOLD_FL) {
>> + if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
>> + return -EOPNOTSUPP;
>> + if (!f2fs_empty_dir(inode))
>> + return -ENOTEMPTY;
>> + }
>> +
>>
>> Note that, directory is checked by above change.
>>
>> I've uploaded in f2fs.git, so could you check it out and test a bit?
>
> I've checked it out, it looks good to me now, and later I will test this new
> version.
>
> Reviewed-by: Chao Yu <yuchao0@huawei.com>
It can pass generic/556 as well.
Thanks,
>
> Thanks,
>
>>
>> Thanks,
>>
>>>
>>> I applied the patches based on last Jaegeuk's dev branch, it seems we needs to
>>> adjust above code a bit. Otherwise it looks good to me.
>>>
>>> BTW, it looks the patchset works fine with generic/556 testcase.
>>>
>>> Thanks,
>> .
>>
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Chao Yu @ 2019-07-29 7:22 UTC (permalink / raw)
To: Jaegeuk Kim, Chao Yu
Cc: Daniel Rosenberg, Jonathan Corbet, linux-f2fs-devel, linux-doc,
linux-api, linux-kernel, linux-fsdevel, kernel-team
In-Reply-To: <20190729062735.GA98839@jaegeuk-macbookpro.roam.corp.google.com>
On 2019/7/29 14:27, Jaegeuk Kim wrote:
> On 07/28, Chao Yu wrote:
>> On 2019-7-24 7:05, Daniel Rosenberg via Linux-f2fs-devel wrote:
>>> /* Flags that are appropriate for regular files (all but dir-specific ones). */
>>> #define F2FS_REG_FLMASK (~(F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL))
>>
>> We missed to add F2FS_CASEFOLD_FL here to exclude it in F2FS_REG_FLMASK.
>
> Applied.
>
>>
>>> @@ -1660,7 +1660,16 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
>>> return -EPERM;
>>>
>>> oldflags = fi->i_flags;
>>> + if ((iflags ^ oldflags) & F2FS_CASEFOLD_FL) {
>>> + if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
>>> + return -EOPNOTSUPP;
>>> +
>>> + if (!S_ISDIR(inode->i_mode))
>>> + return -ENOTDIR;
>>>
>>> + if (!f2fs_empty_dir(inode))
>>> + return -ENOTEMPTY;
>>> + }
>
> Modified like this:
> @@ -1665,6 +1665,13 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
> if (IS_NOQUOTA(inode))
> return -EPERM;
>
> + if ((iflags ^ fi->i_flags) & F2FS_CASEFOLD_FL) {
> + if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
> + return -EOPNOTSUPP;
> + if (!f2fs_empty_dir(inode))
> + return -ENOTEMPTY;
> + }
> +
>
> Note that, directory is checked by above change.
>
> I've uploaded in f2fs.git, so could you check it out and test a bit?
I've checked it out, it looks good to me now, and later I will test this new
version.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Thanks,
>
> Thanks,
>
>>
>> I applied the patches based on last Jaegeuk's dev branch, it seems we needs to
>> adjust above code a bit. Otherwise it looks good to me.
>>
>> BTW, it looks the patchset works fine with generic/556 testcase.
>>
>> Thanks,
> .
>
^ permalink raw reply
* Re: [f2fs-dev] [PATCH v4 3/3] f2fs: Support case-insensitive file name lookups
From: Jaegeuk Kim @ 2019-07-29 6:27 UTC (permalink / raw)
To: Chao Yu
Cc: Daniel Rosenberg, Chao Yu, Jonathan Corbet, linux-f2fs-devel,
linux-doc, linux-api, linux-kernel, linux-fsdevel, kernel-team
In-Reply-To: <9362e4ed-2be8-39f5-b4d9-9c86e37ab993@kernel.org>
On 07/28, Chao Yu wrote:
> On 2019-7-24 7:05, Daniel Rosenberg via Linux-f2fs-devel wrote:
> > /* Flags that are appropriate for regular files (all but dir-specific ones). */
> > #define F2FS_REG_FLMASK (~(F2FS_DIRSYNC_FL | F2FS_PROJINHERIT_FL))
>
> We missed to add F2FS_CASEFOLD_FL here to exclude it in F2FS_REG_FLMASK.
Applied.
>
> > @@ -1660,7 +1660,16 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
> > return -EPERM;
> >
> > oldflags = fi->i_flags;
> > + if ((iflags ^ oldflags) & F2FS_CASEFOLD_FL) {
> > + if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
> > + return -EOPNOTSUPP;
> > +
> > + if (!S_ISDIR(inode->i_mode))
> > + return -ENOTDIR;
> >
> > + if (!f2fs_empty_dir(inode))
> > + return -ENOTEMPTY;
> > + }
Modified like this:
@@ -1665,6 +1665,13 @@ static int f2fs_setflags_common(struct inode *inode, u32 iflags, u32 mask)
if (IS_NOQUOTA(inode))
return -EPERM;
+ if ((iflags ^ fi->i_flags) & F2FS_CASEFOLD_FL) {
+ if (!f2fs_sb_has_casefold(F2FS_I_SB(inode)))
+ return -EOPNOTSUPP;
+ if (!f2fs_empty_dir(inode))
+ return -ENOTEMPTY;
+ }
+
Note that, directory is checked by above change.
I've uploaded in f2fs.git, so could you check it out and test a bit?
Thanks,
>
> I applied the patches based on last Jaegeuk's dev branch, it seems we needs to
> adjust above code a bit. Otherwise it looks good to me.
>
> BTW, it looks the patchset works fine with generic/556 testcase.
>
> Thanks,
^ permalink raw reply
* Re: [PATCH v7 16/16] fscrypt: document the new ioctls and policy version
From: Theodore Y. Ts'o @ 2019-07-29 2:00 UTC (permalink / raw)
To: Eric Biggers
Cc: Satya Tangirala, linux-api, linux-f2fs-devel, linux-fscrypt,
keyrings, linux-mtd, linux-crypto, linux-fsdevel, linux-ext4,
Paul Crowley
In-Reply-To: <20190726224141.14044-17-ebiggers@kernel.org>
On Fri, Jul 26, 2019 at 03:41:41PM -0700, Eric Biggers wrote:
> +- The kernel cannot magically wipe copies of the master key(s) that
> + userspace might have as well. Therefore, userspace must wipe all
> + copies of the master key(s) it makes as well. Naturally, the same
> + also applies to all higher levels in the key hierarchy. Userspace
> + should also follow other security precautions such as mlock()ing
> + memory containing keys to prevent it from being swapped out.
Normally, shouldn't userspace have wiped all copies of the master key
after they have called ADD_KEY? Why should they be left hanging
around? Waiting until REMOVE_KEY to remove other copies of the master
key seems.... late.
> +- In general, decrypted contents and filenames in the kernel VFS
> + caches are freed but not wiped. Therefore, portions thereof may be
> + recoverable from freed memory, even after the corresponding key(s)
> + were wiped. To partially solve this, you can set
> + CONFIG_PAGE_POISONING=y in your kernel config and add page_poison=1
> + to your kernel command line. However, this has a performance cost.
... and even this won't help if you have swap configured....
> +v1 encryption policies have some weaknesses with respect to online
> +attacks:
> +
> +- There is no verification that the provided master key is correct.
> + Consequently, malicious users can associate the wrong key with
> + encrypted files, even files to which they have only read-only
> + access.
Yes, but they won't be able to trick other users into using that
incorrect key. With the old interface, it gets written into the
user's session keyring, which won't get used by another user. And
with the newer interface, only root is allowed to set v1 key.
> +Master keys should be pseudorandom, i.e. indistinguishable from random
> +bytestrings of the same length. This implies that users **must not**
> +directly use a password as a master key, zero-pad a shorter key, or
> +repeat a shorter key.
These paragraphs starts a bit funny, since we first say "should" in
the first sentence, and then it's followed up by "**must not**" in the
second sentence. Basically, they *could* do this, but it would just
weaken the security of the system significantly.
At the very least, we should explain the basis of the recommendation.
> +The KDF used for a particular master key differs depending on whether
> +the key is used for v1 encryption policies or for v2 encryption
> +policies. Users **must not** use the same key for both v1 and v2
> +encryption policies.
"Must not" seems a bit strong. If they do, and a v1 per-file key and
nonce leaks out, then the encryption key will be compromised. So the
strength of the key will be limited by the weaknesses of the v1
scheme. But it's not like using a that was originally meant for v1,
and then using it for v2, causes any additional weakness. Right?
- Ted
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox