* [PATCH v17 4/7] powerpc: add pmd_[dirty|mkclean] for THP
From: Minchan Kim @ 2014-10-20 10:12 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
Jason Evans, zhangyanfei, Kirill A. Shutemov, Minchan Kim,
Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev,
Aneesh Kumar K.V
In-Reply-To: <1413799924-17946-1-git-send-email-minchan@kernel.org>
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 889c6fa9ee01..7c07e5975871 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
--
2.0.0
^ permalink raw reply related
* [PATCH v17 3/7] sparc: add pmd_[dirty|mkclean] for THP
From: Minchan Kim @ 2014-10-20 10:12 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
Rik van Riel, KOSAKI Motohiro, Mel Gorman, Jason Evans,
zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Kirill A. Shutemov,
Minchan Kim, sparclinux-u79uwXL29TY76Z2rM5mHXA, David S. Miller
In-Reply-To: <1413799924-17946-1-git-send-email-minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Acked-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Cc: sparclinux-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
arch/sparc/include/asm/pgtable_64.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3770bf5c6e1b..b80a309d7e00 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -666,6 +666,13 @@ static inline unsigned long pmd_young(pmd_t pmd)
return pte_young(pte);
}
+static inline int pmd_dirty(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ return pte_dirty(pte);
+}
+
static inline unsigned long pmd_write(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
@@ -723,6 +730,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return __pmd(pte_val(pte));
}
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ pte = pte_mkclean(pte);
+
+ return __pmd(pte_val(pte));
+}
+
static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--
2.0.0
^ permalink raw reply related
* [PATCH v17 2/7] x86: add pmd_[dirty|mkclean] for THP
From: Minchan Kim @ 2014-10-20 10:11 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
Jason Evans, zhangyanfei, Kirill A. Shutemov, Minchan Kim,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
Kirill A. Shutemov
In-Reply-To: <1413799924-17946-1-git-send-email-minchan@kernel.org>
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/x86/include/asm/pgtable.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a070f09f..2259de0ccd79 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}
+static inline int pmd_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_DIRTY;
+}
+
static inline int pte_write(pte_t pte)
{
return pte_flags(pte) & _PAGE_RW;
@@ -272,6 +277,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
}
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_RW);
--
2.0.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v17 1/7] mm: support madvise(MADV_FREE)
From: Minchan Kim @ 2014-10-20 10:11 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
Jason Evans, zhangyanfei, Kirill A. Shutemov, Minchan Kim,
Kirill A. Shutemov
In-Reply-To: <1413799924-17946-1-git-send-email-minchan@kernel.org>
Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).
The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.
Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).
How to work is following as.
When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.
Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)
barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)
Higher avg is better.
vanilla-jemalloc MADV_free-jemalloc
1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00
2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00
4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00
8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00
16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00
32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00
In summary, MADV_FREE is about much faster than MADV_DONTNEED.
* from v16
* Rebased on mmotm-2014-10-15-16-57
* from v15
* Add more Acked-by - Rik van Riel
* Rebased on mmotom-08-29-15-15
* from v14
* Add more Ackedy-by from arch people(sparc, arm64 and arm)
* Drop s390 since pmd_dirty/clean was merged
* from v13
* Add more Ackedy-by from arch people(arm, arm64 and ppc)
* Rebased on mmotm 2014-08-13-14-29
* from v12
* Fix - skip to mark free pte on try_to_free_swap failed page - Kirill
* Add more Acked-by from arch maintainers and Kirill
* From v11
* Fix arm build - Steve
* Separate patch for arm and arm64 - Steve
* Remove unnecessary check - Kirill
* Skip non-vm_normal page - Kirill
* Add Acked-by - Zhang
* Sparc64 build fix
* Pagetable walker THP handling fix
* From v10
* Add Acked-by from arch stuff(x86, s390)
* Pagewalker based pagetable working - Kirill
* Fix try_to_unmap_one broken with hwpoison - Kirill
* Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
* Fix pgtable-3level.h for arm - Steve
* From v9
* Add Acked-by - Rik
* Add THP page support - Kirill
* From v8
* Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
* From v7
* Rebased-on next-20140613
* From v6
* Remove page from swapcache in syscal time
* Move utility functions from memory.c to madvise.c - Johannes
* Rename untilify functtions - Johannes
* Remove unnecessary checks from vmscan.c - Johannes
* Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
* Drop Reviewe-by because there was some changes since then.
* From v5
* Fix PPC problem which don't flush TLB - Rik
* Remove unnecessary lazyfree_range stub function - Rik
* Rebased on v3.15-rc5
* From v4
* Add Reviewed-by: Zhang Yanfei
* Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
* From v3
* Add "how to work part" in description - Zhang
* Add page_discardable utility function - Zhang
* Clean up
* From v2
* Remove forceful dirty marking of swap-readed page - Johannes
* Remove deactivation logic of lazyfreed page
* Rebased on 3.14
* Remove RFC tag
* From v1
* Use custom page table walker for madvise_free - Johannes
* Remove PG_lazypage flag - Johannes
* Do madvise_dontneed instead of madvise_freein swapless system
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jason Evans <je@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/rmap.h | 9 ++-
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 140 +++++++++++++++++++++++++++++++++
mm/rmap.c | 42 +++++++++-
mm/vmscan.c | 40 ++++++++--
mm/vmstat.c | 1 +
7 files changed, 222 insertions(+), 12 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce6b0b7..94d5bcacc83e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -75,6 +75,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
+ TTU_FREE = 8, /* free mode */
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
@@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+ struct mem_cgroup *memcg, unsigned long *vm_flags,
+ int *is_pte_dirty);
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
@@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
static inline int page_referenced(struct page *page, int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_pte_dirty)
{
*vm_flags = 0;
+ if (is_pte_dirty)
+ *is_pte_dirty = 0;
return 0;
}
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 730334cdf037..8be4582ac3ff 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGLAZYFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30da4ab..a21584235bb6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,14 @@
#include <linux/blkdev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
+
+struct madvise_free_private {
+ struct vm_area_struct *vma;
+ struct mmu_gather *tlb;
+};
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -251,6 +260,128 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct madvise_free_private *fp = walk->private;
+ struct mmu_gather *tlb = fp->tlb;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = fp->vma;
+ spinlock_t *ptl;
+ pte_t *pte, ptent;
+ struct page *page;
+
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (PageSwapCache(page)) {
+ if (!trylock_page(page))
+ continue;
+
+ if (!try_to_free_swap(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ ClearPageDirty(page);
+ unlock_page(page);
+ }
+
+ /*
+ * Some of architecture(ex, PPC) don't update TLB
+ * with set_pte_at and tlb_remove_tlb_entry so for
+ * the portability, remap the pte with old|clean
+ * after pte clearing.
+ */
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ ptent = pte_mkold(ptent);
+ ptent = pte_mkclean(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct madvise_free_private fp = {
+ .vma = vma,
+ .tlb = tlb,
+ };
+
+ struct mm_walk free_walk = {
+ .pmd_entry = madvise_free_pte_range,
+ .mm = vma->vm_mm,
+ .private = &fp,
+ };
+
+ BUG_ON(addr >= end);
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &free_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ unsigned long start, end;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ /* MADV_FREE works for only anon vma at the moment */
+ if (vma->vm_file)
+ return -EINVAL;
+
+ start = max(vma->vm_start, start_addr);
+ if (start >= vma->vm_end)
+ return -EINVAL;
+ end = min(vma->vm_end, end_addr);
+ if (end <= vma->vm_start)
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start, end);
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ madvise_free_page_range(&tlb, vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+
+ return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ *prev = vma;
+ return madvise_free_single_vma(vma, start, end);
+}
+
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
@@ -381,6 +512,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_FREE:
+ /*
+ * XXX: In this implementation, MADV_FREE works like
+ * MADV_DONTNEED on swapless system or full swap.
+ */
+ if (get_nr_swap_pages() > 0)
+ return madvise_free(vma, prev, start, end);
+ /* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
@@ -400,6 +539,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fbd0fe8f933..93149c82a5a4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -663,6 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
}
struct page_referenced_arg {
+ int dirtied;
int mapcount;
int referenced;
unsigned long vm_flags;
@@ -677,6 +678,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
int referenced = 0;
+ int dirty = 0;
struct page_referenced_arg *pra = arg;
if (unlikely(PageTransHuge(page))) {
@@ -700,6 +702,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
/* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
+
+ /*
+ * In this implmentation, MADV_FREE doesn't support THP free
+ */
+ dirty++;
spin_unlock(ptl);
} else {
pte_t *pte;
@@ -729,6 +736,10 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (likely(!(vma->vm_flags & VM_SEQ_READ)))
referenced++;
}
+
+ if (pte_dirty(*pte))
+ dirty++;
+
pte_unmap_unlock(pte, ptl);
}
@@ -737,6 +748,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pra->vm_flags |= vma->vm_flags;
}
+ if (dirty)
+ pra->dirtied++;
+
pra->mapcount--;
if (!pra->mapcount)
return SWAP_SUCCESS; /* To break the loop */
@@ -761,6 +775,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
* @is_locked: caller holds lock on the page
* @memcg: target memory cgroup
* @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
@@ -768,7 +783,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
int page_referenced(struct page *page,
int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_pte_dirty)
{
int ret;
int we_locked = 0;
@@ -783,6 +799,9 @@ int page_referenced(struct page *page,
};
*vm_flags = 0;
+ if (is_pte_dirty)
+ *is_pte_dirty = 0;
+
if (!page_mapped(page))
return 0;
@@ -810,6 +829,9 @@ int page_referenced(struct page *page,
if (we_locked)
unlock_page(page);
+ if (is_pte_dirty)
+ *is_pte_dirty = pra.dirtied;
+
return pra.referenced;
}
@@ -1128,6 +1150,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = SWAP_AGAIN;
enum ttu_flags flags = (enum ttu_flags)arg;
+ int dirty = 0;
pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
@@ -1157,7 +1180,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pte);
/* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
+ dirty = pte_dirty(pteval);
+ if (dirty)
set_page_dirty(page);
/* Update high watermark before we lower rss */
@@ -1186,6 +1210,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;
+ if (flags & TTU_FREE) {
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ if (!dirty && !PageDirty(page)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ } else {
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ }
+
if (PageSwapCache(page)) {
/*
* Store the swap location in the pte.
@@ -1227,6 +1264,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
} else
dec_mm_counter(mm, MM_FILEPAGES);
+discard:
page_remove_rmap(page);
page_cache_release(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4636d9e822c1..8f67765ebb77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -712,13 +712,17 @@ enum page_references {
};
static enum page_references page_check_references(struct page *page,
- struct scan_control *sc)
+ struct scan_control *sc,
+ bool *freeable)
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;
+ int pte_dirty;
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
- &vm_flags);
+ &vm_flags, &pte_dirty);
referenced_page = TestClearPageReferenced(page);
/*
@@ -759,6 +763,10 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_KEEP;
}
+ if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
+ !PageDirty(page))
+ *freeable = true;
+
/* Reclaim if clean, defer dirty pages to writeback */
if (referenced_page && !PageSwapBacked(page))
return PAGEREF_RECLAIM_CLEAN;
@@ -827,6 +835,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+ bool freeable = false;
cond_resched();
@@ -950,7 +959,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (!force_reclaim)
- references = page_check_references(page, sc);
+ references = page_check_references(page, sc,
+ &freeable);
switch (references) {
case PAGEREF_ACTIVATE:
@@ -966,7 +976,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page)) {
+ if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (!add_to_swap(page, page_list))
@@ -981,8 +991,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
- if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, ttu_flags)) {
+ if (page_mapped(page) && (mapping || freeable)) {
+ switch (try_to_unmap(page,
+ freeable ? TTU_FREE : ttu_flags)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -990,7 +1001,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case SWAP_MLOCK:
goto cull_mlocked;
case SWAP_SUCCESS:
- ; /* try to free the page below */
+ /* try to free the page below */
+ if (!freeable)
+ break;
+ /*
+ * Freeable anon page doesn't have mapping
+ * due to skipping of swapcache so we free
+ * page in here rather than __remove_mapping.
+ */
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ if (!page_freeze_refs(page, 1))
+ goto keep_locked;
+ __clear_page_locked(page);
+ count_vm_event(PGLAZYFREED);
+ goto free_it;
}
}
@@ -1730,7 +1754,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
if (page_referenced(page, 0, sc->target_mem_cgroup,
- &vm_flags)) {
+ &vm_flags, NULL)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1b12d390dc68..87b6c83ce717 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,6 +814,7 @@ const char * const vmstat_text[] = {
"pgfault",
"pgmajfault",
+ "pglazyfreed",
TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
--
2.0.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH v17 0/7] MADV_FREE support
From: Minchan Kim @ 2014-10-20 10:11 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
Jason Evans, zhangyanfei, Kirill A. Shutemov, Minchan Kim
This patch enable MADV_FREE hint for madvise syscall, which have
been supported by other OSes. [PATCH 1] includes the details.
[1] support MADVISE_FREE for !THP page so if VM encounter
THP page in syscall context, it splits THP page.
[2-6] is to preparing to call madvise syscall without THP plitting
[7] enable THP page support for MADV_FREE.
* from v16
* Rebased on mmotm-2014-10-15-16-57
* from v15
* Add more Acked-by - Rik van Riel
* Rebased on mmotom-08-29-15-15
* from v14
* Add more Ackedy-by from arch people(sparc, arm64 and arm)
* Drop s390 since pmd_dirty/clean was merged
* from v13
* Add more Ackedy-by from arch people(arm, arm64 and ppc)
* Rebased on mmotm 2014-08-13-14-29
* from v12
* Fix - skip to mark free pte on try_to_free_swap failed page - Kirill
* Add more Acked-by from arch maintainers and Kirill
* From v11
* Fix arm build - Steve
* Separate patch for arm and arm64 - Steve
* Remove unnecessary check - Kirill
* Skip non-vm_normal page - Kirill
* Add Acked-by - Zhang
* Sparc64 build fix
* Pagetable walker THP handling fix
* From v10
* Add Acked-by from arch stuff(x86, s390)
* Pagewalker based pagetable working - Kirill
* Fix try_to_unmap_one broken with hwpoison - Kirill
* Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
* Fix pgtable-3level.h for arm - Steve
* From v9
* Add Acked-by - Rik
* Add THP page support - Kirill
* From v8
* Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
* From v7
* Rebased-on next-20140613
* From v6
* Remove page from swapcache in syscal time
* Move utility functions from memory.c to madvise.c - Johannes
* Rename untilify functtions - Johannes
* Remove unnecessary checks from vmscan.c - Johannes
* Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
* Drop Reviewe-by because there was some changes since then.
* From v5
* Fix PPC problem which don't flush TLB - Rik
* Remove unnecessary lazyfree_range stub function - Rik
* Rebased on v3.15-rc5
* From v4
* Add Reviewed-by: Zhang Yanfei
* Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
* From v3
* Add "how to work part" in description - Zhang
* Add page_discardable utility function - Zhang
* Clean up
* From v2
* Remove forceful dirty marking of swap-readed page - Johannes
* Remove deactivation logic of lazyfreed page
* Rebased on 3.14
* Remove RFC tag
* From v1
* Use custom page table walker for madvise_free - Johannes
* Remove PG_lazypage flag - Johannes
* Do madvise_dontneed instead of madvise_freein swapless system
Minchan Kim (7):
mm: support madvise(MADV_FREE)
x86: add pmd_[dirty|mkclean] for THP
sparc: add pmd_[dirty|mkclean] for THP
powerpc: add pmd_[dirty|mkclean] for THP
arm: add pmd_mkclean for THP
arm64: add pmd_[dirty|mkclean] for THP
mm: Don't split THP page when syscall is called
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 16 ++++
arch/x86/include/asm/pgtable.h | 10 ++
include/linux/huge_mm.h | 4 +
include/linux/rmap.h | 9 +-
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/huge_memory.c | 35 +++++++
mm/madvise.c | 159 +++++++++++++++++++++++++++++++
mm/rmap.c | 46 ++++++++-
mm/vmscan.c | 64 +++++++++----
mm/vmstat.c | 1 +
14 files changed, 331 insertions(+), 20 deletions(-)
--
2.0.0
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Dan Carpenter @ 2014-10-20 9:20 UTC (permalink / raw)
To: Greg Kroah-Hartman
Cc: devel, Anup Patel, linux-api, linux-kernel,
Arve Hjønnevåg, John Stultz, Rebecca Schultz Zavin,
Santosh Shilimkar, Sumit Semwal, Christoffer Dall
In-Reply-To: <20141019220547.GC3780@kroah.com>
On Mon, Oct 20, 2014 at 06:05:47AM +0800, Greg Kroah-Hartman wrote:
> On Fri, Oct 17, 2014 at 12:26:01PM +0300, Dan Carpenter wrote:
> > The code isn't very beautiful and there are lots of details wrong like
> > the error codes.
>
> Really, what is wrong with the existing error code usages?
>
I guess I was mostly looking at binder_ioctl(), the rest of the code
seems better.
I feel like we went directly from "This code is never going in so there
is no need to look at it." to "This code is going in in one week so
there is no time to look at it."
How often do people rely on proc_no_lock=1 to make this work? People
are saying on the internet that you don't need acurate information so
you should disable locking as a speed up.
http://www.programdevelop.com/1821706/. It seems like a bad idea to
provide provide the "run fast and crash" option.
Why is binder_set_nice needed? Some comments would help here.
434 static void binder_set_nice(long nice)
435 {
436 long min_nice;
437
438 if (can_nice(current, nice)) {
439 set_user_nice(current, nice);
440 return;
441 }
442 min_nice = rlimit_to_nice(current->signal->rlim[RLIMIT_NICE].rlim_cur);
443 binder_debug(BINDER_DEBUG_PRIORITY_CAP,
444 "%d: nice value %ld not allowed use %ld instead\n",
445 current->pid, nice, min_nice);
446 set_user_nice(current, min_nice);
447 if (min_nice <= MAX_NICE)
448 return;
It's harmless but wierd to check if min_nice is valid after we already
called set_user_nice().
449 binder_user_error("%d RLIMIT_NICE not set\n", current->pid);
450 }
Random comment:
709 has_page_addr =
710 (void *)(((uintptr_t)buffer->data + buffer_size) & PAGE_MASK);
The casting on "buffer->data" ugly and inconsistent. There should be
a function:
void *buffer_data(struct binder_buffer *buffer)
{
return buffer.data;
}
That way this becomes:
has_page_addr = (buffer_data(buffer) + buffer_size) & PAGE_MASK);
The "has_page_addr" variable name is misleading, because we're not
storing true/false. We're storing the last page address.
711 if (n == NULL) {
712 if (size + sizeof(struct binder_buffer) + 4 >= buffer_size)
713 buffer_size = size; /* no room for other buffers */
714 else
715 buffer_size = size + sizeof(struct binder_buffer);
716 }
717 end_page_addr =
718 (void *)PAGE_ALIGN((uintptr_t)buffer->data + buffer_size);
719 if (end_page_addr > has_page_addr)
720 end_page_addr = has_page_addr;
721 if (binder_update_page_range(proc, 1,
722 (void *)PAGE_ALIGN((uintptr_t)buffer->data), end_page_addr, NULL))
723 return NULL;
The alignment here is confusing because we don't align buffer->data
below.
724
725 rb_erase(best_fit, &proc->free_buffers);
726 buffer->free = 0;
727 binder_insert_allocated_buffer(proc, buffer);
728 if (buffer_size != size) {
729 struct binder_buffer *new_buffer = (void *)buffer->data + size;
^^^^^^^^^^^^^^^^^^^^
I don't really understand when buffer->data has to be page aligned and
when not. This code has very few comments.
730
731 list_add(&new_buffer->entry, &buffer->entry);
732 new_buffer->free = 1;
733 binder_insert_free_buffer(proc, new_buffer);
734 }
Does the stop_on_user_error option work? There should be some
documentation for this stuff.
regards,
dan carpenter
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Eric W.Biederman @ 2014-10-20 4:55 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo,
cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <CALCETrUC=yW72d2hDzjESmZAt85x1WcGz4L-DrtY5YXAQxbpMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
>wrote:
>>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >> over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >> cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see. Guess that should be workable. Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup. So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right? IOW, what does this have to do with cgroupns?
Excluding user namespaces the requirements have not changed.
The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
So I was examining what it would take to enter the cgroup of cgroupns.
> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this? If so:
>
>Yes, that current_cred() thing is bogus. (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.) (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually. But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required. Bolting on a fix later
>will be a mess.
I imagine the pinning would be like the userns.
Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup.
I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
Sigh.
I am too tired tonight to see the end game in this.
Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Andy Lutomirski @ 2014-10-19 23:35 UTC (permalink / raw)
To: Al Viro
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Andrew Morton, Kees Cook, Arnd Bergmann, X86 ML,
linux-arch, Linux API
In-Reply-To: <20141019224251.GJ7996@ZenIV.linux.org.uk>
On Sun, Oct 19, 2014 at 3:42 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Oct 19, 2014 at 03:16:03PM -0700, Andy Lutomirski wrote:
>
>> Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
>> /dev/fd/3? That could be interesting, although I can imagine it
>> breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
>> early in boot.
>
> Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
> into argv when it execs the interpreter of #! file. Simply because the
> interpreter (which can be anything whatsoever) has no fscking idea what
> to do with some descriptor it has before execve(). Hell, it doesn't have
> any idea *which* descriptor had it been.
>
> You need to put some pathname that would yield your script upon open(2).
> If you bothered to read those patches, you'd see that they do supply
> one, generating it with d_path(). Which isn't particulary reliable.
>
> I'm not sure there's any point putting any of that in the kernel - if
> you *do* have that pathname, you can just pass it.
Hmm.
This issue certainly makes fexecve or execveat less attractive, at
least in cases where d_path won't work.
On the other hand, if you want to run a static binary on a cloexec fd
(or, for that matter, a dynamic binary if you trust the interpreter to
close the extra copy of the fd it gets) in a namespace or chroot where
the binary is invisible, then you need kernel help.
It's too bad that script interpreters don't have a mechanism to open
their scripts by fd.
>
>> Aside from the general scariness of allowing one process to actually
>> dup another process's fds, I feel like this is asking for trouble wrt
>> the various types of file locks.
>
> Who said anything about another process's fds? That, indeed, would be
> a recipe for serious trouble. It's a filesystem with one directory,
> not with one directory for each process...
>
This still has issues with locks if you pass an fd to a child process,
but I guess that you get what you ask for if you do that.
> FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
> Which is a regular file, with contents consisting of \n-terminated
> lines (one per descriptor in <pid>'s descriptor table>) in the same
> format as in *ctl (they put descriptor number as the first field in
> those).
>
> Unlike our solution, they do not allow to get to any process' files via
> procfs. They do allow /dev/stdin-style access to your own files via
> dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
> you get consistent behaviours for pipes and redirects from file that way.
> See the example I've posted upthread. Besides, for things like sockets
> our semantics simply fails - they really depend on having only one
> struct file for given socket; it's dup or nothing there. The same goes
> for a lot of things like eventfd, etc.
Fair enough.
--Andy
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Al Viro @ 2014-10-19 22:42 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Andrew Morton, Kees Cook, Arnd Bergmann, X86 ML,
linux-arch, Linux API
In-Reply-To: <CALCETrVTiO6YKRYAAe-eBgQvpUqrtE0=KrKJj2T_XJ9iSRDtmA@mail.gmail.com>
On Sun, Oct 19, 2014 at 03:16:03PM -0700, Andy Lutomirski wrote:
> Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
> /dev/fd/3? That could be interesting, although I can imagine it
> breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
> early in boot.
Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
into argv when it execs the interpreter of #! file. Simply because the
interpreter (which can be anything whatsoever) has no fscking idea what
to do with some descriptor it has before execve(). Hell, it doesn't have
any idea *which* descriptor had it been.
You need to put some pathname that would yield your script upon open(2).
If you bothered to read those patches, you'd see that they do supply
one, generating it with d_path(). Which isn't particulary reliable.
I'm not sure there's any point putting any of that in the kernel - if
you *do* have that pathname, you can just pass it.
> Aside from the general scariness of allowing one process to actually
> dup another process's fds, I feel like this is asking for trouble wrt
> the various types of file locks.
Who said anything about another process's fds? That, indeed, would be
a recipe for serious trouble. It's a filesystem with one directory,
not with one directory for each process...
FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
Which is a regular file, with contents consisting of \n-terminated
lines (one per descriptor in <pid>'s descriptor table>) in the same
format as in *ctl (they put descriptor number as the first field in
those).
Unlike our solution, they do not allow to get to any process' files via
procfs. They do allow /dev/stdin-style access to your own files via
dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
you get consistent behaviours for pipes and redirects from file that way.
See the example I've posted upthread. Besides, for things like sockets
our semantics simply fails - they really depend on having only one
struct file for given socket; it's dup or nothing there. The same goes
for a lot of things like eventfd, etc.
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Andy Lutomirski @ 2014-10-19 22:16 UTC (permalink / raw)
To: Al Viro
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <20141019212921.GI7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
On Sun, Oct 19, 2014 at 2:29 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Sun, Oct 19, 2014 at 01:37:54PM -0700, Andy Lutomirski wrote:
>> > The question I hadn't seen really answered through all of that was how to
>> > deal with #!... "Just use d_path()" isn't particulary appealing - if that
>> > file has a pathname reachable for you, you could bloody well use _that_
>> > from the very beginning.
>>
>> Does this matter for absolute paths after #! (or for absolute paths to
>> ELF interpreters)? Does anyone use relative paths there?
>
> It's not about what's after #!; it's what we *append* to what's after #!
> that is interesting. Recall how #! works - we turn execve() of something
> that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
> with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
> arguments list. With make(1) doing opening and reading the file, as it
> would for any makefile. Or /bin/sh opening and reading the script, etc.
>
> Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
> near that path anyway).
>
Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
/dev/fd/3? That could be interesting, although I can imagine it
breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
early in boot.
Why does this use case need the dup-style semantics? Merely opening
the same file should work, right?
[deleted my questions based on my misunderstanding of what you were
talking about]
>> I'm not convinced that these semantics are better than /proc/self/fd's
>> in many contexts. I don't really like the idea that catting some file
>> can *change* the position of one of my open file descriptors.
>
> Er... You do realize that if descriptor refers to the same file,
> cat <&<that descriptor> will change position of your other descriptor,
> right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
> They do it at the price of very convoluted code in their VFS, but that's
> how it works there. And in Plan 9, and (AFAIK) in Solaris...
But cat <&3 is shell magic; it isn't actually an invocation of
/bin/cat with some special argument that causes it to poke at some fd.
And bash already supports this kind of use case.
I still don't userstand why the BSD behavior is better here. Also,
this isn't just file *position* -- anything that opens dupfs/N and
uses fcntl with F_SETFL seems likely to cause failures, crashes, or
outright corruption. And there are file locks, which suddenly have
rather odd semantics. (Even with the new OFD locks or with BSD locks,
I wonder how many little programs open argv[1], lock the file, do
something, and then exit. Anything that does that is now leaking a
lock.)
Aside from the general scariness of allowing one process to actually
dup another process's fds, I feel like this is asking for trouble wrt
the various types of file locks.
--Andy
^ permalink raw reply
* Re: [PATCH v3 4/6] tpm: TPM 2.0 sysfs attributes
From: Andy Lutomirski @ 2014-10-19 22:07 UTC (permalink / raw)
To: Jarkko Sakkinen
Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
josh.triplett-ral2JQCrhuEAvxtiuMwx3w,
christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
will.c.arthur-ral2JQCrhuEAvxtiuMwx3w,
monty.wiseman-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <1413372916-12091-5-git-send-email-jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
On Wed, Oct 15, 2014 at 4:35 AM, Jarkko Sakkinen
<jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> Implemented sysfs attributes for TPM2 devices. TPM2 sysfs attributes
> are mounted in the actual device associated with the chip instead of
> platform device like with TPM1 devices.
>
> Documentation/ABI/stable/sysfs-class/tpm2 contains descriptions
> of these attributes.
>
> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
> +What: /sys/class/misc/tpmX/device/cancel
> +Date: October 2014
> +KernelVersion: 3.19
> +Contact: tpmdd-devel-TtF/mJH4Jtrk1uMJSBkQmQ@public.gmane.org
> +Description: The "cancel" property allows you to cancel the currently
> + pending TPM command. Writing any value to cancel will call the
> + TPM chip specific cancel operation.
This is weird. From the POV of a sysfs user, what operation gets
canceled? What if it's a kernel-internal operation?
Shouldn't this be an ioctl?
--Andy
^ permalink raw reply
* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Greg Kroah-Hartman @ 2014-10-19 22:05 UTC (permalink / raw)
To: Dan Carpenter
Cc: devel, Anup Patel, linux-api, linux-kernel,
Arve Hjønnevåg, John Stultz, Rebecca Schultz Zavin,
Santosh Shilimkar, Sumit Semwal, Christoffer Dall
In-Reply-To: <20141017092601.GH23154@mwanda>
On Fri, Oct 17, 2014 at 12:26:01PM +0300, Dan Carpenter wrote:
> The code isn't very beautiful and there are lots of details wrong like
> the error codes.
Really, what is wrong with the existing error code usages?
> Al had some critical things to say about it but it looks like most of
> those issues have been fixed.
All of the ones that can be fixed, have, as far as I know.
We are stuck with some that we can't fix, which is ok as it is safe in
the Android user model.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Greg Kroah-Hartman @ 2014-10-19 22:04 UTC (permalink / raw)
To: Christoph Hellwig
Cc: devel, Anup Patel, linux-api, linux-kernel, Arve Hj?nnev?g,
Santosh Shilimkar, Rebecca Schultz Zavin, John Stultz,
Sumit Semwal, Christoffer Dall
In-Reply-To: <20141017094329.GB18015@infradead.org>
On Fri, Oct 17, 2014 at 02:43:29AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 16, 2014 at 02:47:41PM +0200, Greg Kroah-Hartman wrote:
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >
> > The Android binder code has been "stable" for many years now. No matter
> > what comes in the future, we are going to have to support this API, so
> > might as well move it to the "real" part of the kernel as there's no
> > real work that needs to be done to the existing code.
>
> NAK. It's complete rubbish and does things to the FD code that it
> really shouldn't. Android needs to completely redo the interface, and
> there's been absolutely no work towards that.
There is now work to resolve the interface, it requires someone who has
the rights to push to Android userspace. But that is going to be a
"total rewrite", and until then, this code needs to be used, no matter
how much we hate this.
> This is exactly the sort of attitude I feared about when you started the
> whole staging concepts, and it sounds like a good reason to disband
> staging entirely, given that it's been mostly useless except for
> boasting peoples commit statistics.
I know you hate staging, which is fine, it's not there for you. It's
there for others, and so much good stuff has happened with it, that I'm
not going to give it up.
But this code is different. It's something like a random driver that a
distro has been carrying for 5+ years that it has to ship due to
userspace that relies on it, so the api can't be changed. We accept
drivers like that at times, and we don't drop really old and crappy
drivers either, as long as someone is using it, and it is
self-contained.
I'll step up to maintain this and handle the interactions between grumpy
kernel developers who look at the code, and the Android developers who
are stuck with this monstrosity.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCH] staging: android: binder: move to the "real" part of the kernel
From: Greg Kroah-Hartman @ 2014-10-19 22:01 UTC (permalink / raw)
To: One Thousand Gnomes
Cc: devel, Anup Patel, Linux API, lkml, Arve Hjønnevåg,
Santosh Shilimkar, Rebecca Schultz Zavin, John Stultz,
Sumit Semwal, Christoffer Dall
In-Reply-To: <20141018223630.497988fa@alan.etchedpixels.co.uk>
On Sat, Oct 18, 2014 at 10:36:30PM +0100, One Thousand Gnomes wrote:
> > Do we really need someone to do more work that has been done on it in
> > the past as an official "maintainer"? I'll be glad to do it, as I doubt
> > it will require any time at all.
>
> Well every time in the past that Al Viro looked in its direction he broke
> it so probably. Someone is going to have to clean up or fix the fact it
> pokes around in the depths of the low level fd I/O code and calls stuff
> like __fd_install and __alloc_fd directly, or mend it if it breaks.
As it is, it is ok, but bad things happen if you allow more than one
process to open the device node. In android systems, that doesn't
happen, so all should be acceptable.
> I'm curious what Al Viro thinks of it
His last comments were along the lines of "don't let anything open that
device node other than libbinder".
> > > Currently in the android space no one but libbinder should use the
> > > kernel interface.
> >
> > That is correct. If you do that, you deserve all of the pain and
> > suffering and rooted machines you will get.
>
> So what is the Android side model for its security. That probably also
> should be described so nobody goes off and uses it for something like
> systemd because "it looked neat".
The side model is "one owner that knows what they are doing as they have
root privileges". I don't know a way to codify that, and we all know no
one reads documentation...
> > But all of the changes will be in new code. Be it kdbus, or something
> > else if that doesn't work out. This existing binder.c file will not be
> > changing at all. This existing ABI, and codebase, is something that we
> > have to maintain forever for those millions of devices out there in the
> > real world today.
>
> 95% of those devices are locked down, most of them have non replaceable
> batteries that will dead and irreplacable (sanely anyway) in 3-5 years.
> "Forever" in the phone world is mercifully rather short.
I still see brand new devices with 2 year old Android userspace being
shipped today. With a total mis-mash of random kernel versions,
depending on what the SoC supported. If we can delete this in 2-5
years, I would be really happy.
thanks,
greg k-h
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Al Viro @ 2014-10-19 21:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <CALCETrVZUW2iPtfFJtGnWd2RsYLwjGRGYuujrVqcOsO5oBB8Cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Sun, Oct 19, 2014 at 01:37:54PM -0700, Andy Lutomirski wrote:
> > The question I hadn't seen really answered through all of that was how to
> > deal with #!... "Just use d_path()" isn't particulary appealing - if that
> > file has a pathname reachable for you, you could bloody well use _that_
> > from the very beginning.
>
> Does this matter for absolute paths after #! (or for absolute paths to
> ELF interpreters)? Does anyone use relative paths there?
It's not about what's after #!; it's what we *append* to what's after #!
that is interesting. Recall how #! works - we turn execve() of something
that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
arguments list. With make(1) doing opening and reading the file, as it
would for any makefile. Or /bin/sh opening and reading the script, etc.
Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
near that path anyway).
> Does execve("/proc/self/fd/N", ...) not work correctly now?
Yes, it does. And if procfs is there, this syscall is completely pointless.
The main argument in favour of adding it to the kernel (rather than doing
in userland) has been "but what of the people who don't want procfs mounted
in there?".
> Presumably relative paths should be relative to the execed program, or
> maybe there should be a flag to execveat that disallows that behavior
> entirely, or maybe it should never work, even through /proc. I don't
> really like the idea that an fd pointing at a *file* should allow
> access to its directory.
Huh? What are you talking about? And who the hell cares whether it's
absolute or relative?
> I'm not convinced that these semantics are better than /proc/self/fd's
> in many contexts. I don't really like the idea that catting some file
> can *change* the position of one of my open file descriptors.
Er... You do realize that if descriptor refers to the same file,
cat <&<that descriptor> will change position of your other descriptor,
right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
They do it at the price of very convoluted code in their VFS, but that's
how it works there. And in Plan 9, and (AFAIK) in Solaris...
^ permalink raw reply
* Re: [PATCH v3 4/6] tpm: TPM 2.0 sysfs attributes
From: Tomas Winkler @ 2014-10-19 21:09 UTC (permalink / raw)
To: Jarkko Sakkinen
Cc: Greg KH, Peter Huewe, Ashley Lai, Marcel Selhorst, tpmdd-devel,
linux-kernel@vger.kernel.org, linux-api, josh.triplett,
christophe.ricard, will.c.arthur, monty.wiseman
In-Reply-To: <20141016160358.GB12979@intel.com>
>
> Here the options are limited because misc_register already spawns
> KOBJ_ADD. Not much to do unless we would wipe the current use of
> misc driver completely.
I would strongly recommend to drop the misc layer here, as tpm itself
can provide the needed abstraction for the devices.
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: H. Peter Anvin @ 2014-10-19 20:53 UTC (permalink / raw)
To: Al Viro, Andy Lutomirski
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, Andrew Morton, Kees Cook,
Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <20141019202034.GH7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
On 10/19/2014 01:20 PM, Al Viro wrote:
> On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
>
>> For example, I want to be able to reliably do something like nsenter
>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>> it is more or less fully functional, so this should Just Work (tm),
>> except that the toybox binary might not exist in the namespace being
>> entered. If execveat were available, I could rig nsenter or a similar
>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>> execveat.
>
> The question I hadn't seen really answered through all of that was how to
> deal with #!... "Just use d_path()" isn't particulary appealing - if that
> file has a pathname reachable for you, you could bloody well use _that_
> from the very beginning.
>
> Frankly, I wonder if it would make sense to provide something like
> dupfs. We can't mount it by default on /dev/fd (more's the pity), but
> it might be a good thing to have.
>
> What it is, for those who are not familiar with Plan 9: a filesystem with
> one directory and a bunch of files in it. Directory contents depends on
> who's looking; for each opened descriptor in your descriptor table, you'll
> see two files there. One series is 0, 1, ... - opening one of those gives
> dup(). IOW, it's *not* giving you a new struct file; it gives you a new
> reference to existing one, complete with sharing IO position, etc. Another
> is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
> much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
>
> It's actually a better match for what one would expect at /dev/fd than what
> we do. Example:
>
Yes, it is really unfortunate that /proc/self/fd/* had the wrong
semantics for the start, due to then-existing implementation issues.
-hpa
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Andy Lutomirski @ 2014-10-19 20:37 UTC (permalink / raw)
To: Al Viro
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Morton,
Kees Cook, Arnd Bergmann, X86 ML, linux-arch, Linux API
In-Reply-To: <20141019202034.GH7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
On Sun, Oct 19, 2014 at 1:20 PM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
>
>> For example, I want to be able to reliably do something like nsenter
>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>> it is more or less fully functional, so this should Just Work (tm),
>> except that the toybox binary might not exist in the namespace being
>> entered. If execveat were available, I could rig nsenter or a similar
>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>> execveat.
>
> The question I hadn't seen really answered through all of that was how to
> deal with #!... "Just use d_path()" isn't particulary appealing - if that
> file has a pathname reachable for you, you could bloody well use _that_
> from the very beginning.
Does this matter for absolute paths after #! (or for absolute paths to
ELF interpreters)? Does anyone use relative paths there?
Does execve("/proc/self/fd/N", ...) not work correctly now?
Presumably relative paths should be relative to the execed program, or
maybe there should be a flag to execveat that disallows that behavior
entirely, or maybe it should never work, even through /proc. I don't
really like the idea that an fd pointing at a *file* should allow
access to its directory.
>
> Frankly, I wonder if it would make sense to provide something like
> dupfs. We can't mount it by default on /dev/fd (more's the pity), but
> it might be a good thing to have.
>
> What it is, for those who are not familiar with Plan 9: a filesystem with
> one directory and a bunch of files in it. Directory contents depends on
> who's looking; for each opened descriptor in your descriptor table, you'll
> see two files there. One series is 0, 1, ... - opening one of those gives
> dup(). IOW, it's *not* giving you a new struct file; it gives you a new
> reference to existing one, complete with sharing IO position, etc. Another
> is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
> much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
>
> It's actually a better match for what one would expect at /dev/fd than what
> we do. Example:
>
> ; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
> ; (echo 'line 1';echo 'line 2') >a
> ; cat a|sh a.sh
> line 2
> The first line was line 1
> ; sh a.sh <a
> line 1
> line 2
> The first line was line 1
> ;
>
> See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
> of whatever your stdin is; if it's a pipe - fine, you've just added yourself
> as additional reader. But if it's a regular file, you've got yourself
> a brand-new opened file, with IO position of its own. Sitting at the
> beginning of the file.
>
> Moreover, try that with stdin being a socket and you'll see cat(1) failing
> to open that sucker.
>
> We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
> semantics is different. However, there's no reason why it can't be mounted
> in environments where you want to avoid procfs - it's certainly exposing less
> than procfs would.
>
> And these days we can implement relatively cheaply. It's a window that will
> close after a while, but right now we can change ->atomic_open() calling
> conventions. Instead of having it return 0 or error, let's switch to returning
> NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
> Same as we did for ->lookup(), and for similar reason.
>
> Right now we have 8 instances of ->atomic_open() and one place calling that
> method. Changing the API like that would be trivial (and it's a trivial
> conversion - replace return ret; with return ERR_PTR(ret); through all
> instances, so any out-of-tree filesystems could follow easily). We certainly
> can't do anything of that sort with ->open() - there would be thousands
> instances to convert. ->atomic_open(), OTOH, is still new enough for
> that to be feasible.
>
> What we get from that conversion is an ability to do dup-style semantics
> easily.
> * give root directory an ->atomic_open() instance that would be
> handling opens.
> * make lookups in there fail with ENOENT if you don't have such a
> descriptor at the moment. Otherwise bind all of them to the same inode.
> The only method it needs is ->getattr(), and that would look into your
> descriptor table for descriptor with number derived from dentry (stashed
> in ->d_fsdata at lookup time) and do what fstat() would.
> * have those dentries always fail ->d_revalidate(), to force
> everything towards ->atomic_open().
> * for ...ctl names, ->atomic_open() would act in normal fashion;
> again, only one inode is needed. ->read() would pick descriptor number
> from ->d_fsdata and report on whatever you have with that number at the
> time.
>
> I'll try to put a prototype of that together; I think it's at least
> interesting to try. And that ought to be safe to mount even in very
> restricted environments, making arguments along the lines of "but we can't
> get the path by opened file without the big bad wol^Wprocfs and we can't
> have that in our environment" much weaker...
>
> Comments?
I'm not convinced that these semantics are better than /proc/self/fd's
in many contexts. I don't really like the idea that catting some file
can *change* the position of one of my open file descriptors. Also,
for execveat in particular, I want to be able to setns into a
completely unknown namespace and exec something, so a new fs won't
help if it's not mounted.
An alternative solution to proc-lite would be to have a heavily
stripped-down variant of /proc. It could self as a real directory
(not a symlink), with the normal semantics for /proc/self/fd, and it
could have very little else (possibly nothing at all; possibly exe,
root, and cwd).
--Andy
\
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Al Viro @ 2014-10-19 20:20 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Drysdale, Eric W. Biederman, Meredydd Luff,
linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Andrew Morton, Kees Cook, Arnd Bergmann, X86 ML,
linux-arch, Linux API
In-Reply-To: <CALCETrVraoD+r4zxBoGd+BV5P275AXcRV_R00SSr8fjQzRHnUg@mail.gmail.com>
On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
> For example, I want to be able to reliably do something like nsenter
> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> it is more or less fully functional, so this should Just Work (tm),
> except that the toybox binary might not exist in the namespace being
> entered. If execveat were available, I could rig nsenter or a similar
> tool to open it with O_CLOEXEC, enter the namespace, and then call
> execveat.
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Frankly, I wonder if it would make sense to provide something like
dupfs. We can't mount it by default on /dev/fd (more's the pity), but
it might be a good thing to have.
What it is, for those who are not familiar with Plan 9: a filesystem with
one directory and a bunch of files in it. Directory contents depends on
who's looking; for each opened descriptor in your descriptor table, you'll
see two files there. One series is 0, 1, ... - opening one of those gives
dup(). IOW, it's *not* giving you a new struct file; it gives you a new
reference to existing one, complete with sharing IO position, etc. Another
is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
It's actually a better match for what one would expect at /dev/fd than what
we do. Example:
; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
; (echo 'line 1';echo 'line 2') >a
; cat a|sh a.sh
line 2
The first line was line 1
; sh a.sh <a
line 1
line 2
The first line was line 1
;
See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
of whatever your stdin is; if it's a pipe - fine, you've just added yourself
as additional reader. But if it's a regular file, you've got yourself
a brand-new opened file, with IO position of its own. Sitting at the
beginning of the file.
Moreover, try that with stdin being a socket and you'll see cat(1) failing
to open that sucker.
We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
semantics is different. However, there's no reason why it can't be mounted
in environments where you want to avoid procfs - it's certainly exposing less
than procfs would.
And these days we can implement relatively cheaply. It's a window that will
close after a while, but right now we can change ->atomic_open() calling
conventions. Instead of having it return 0 or error, let's switch to returning
NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
Same as we did for ->lookup(), and for similar reason.
Right now we have 8 instances of ->atomic_open() and one place calling that
method. Changing the API like that would be trivial (and it's a trivial
conversion - replace return ret; with return ERR_PTR(ret); through all
instances, so any out-of-tree filesystems could follow easily). We certainly
can't do anything of that sort with ->open() - there would be thousands
instances to convert. ->atomic_open(), OTOH, is still new enough for
that to be feasible.
What we get from that conversion is an ability to do dup-style semantics
easily.
* give root directory an ->atomic_open() instance that would be
handling opens.
* make lookups in there fail with ENOENT if you don't have such a
descriptor at the moment. Otherwise bind all of them to the same inode.
The only method it needs is ->getattr(), and that would look into your
descriptor table for descriptor with number derived from dentry (stashed
in ->d_fsdata at lookup time) and do what fstat() would.
* have those dentries always fail ->d_revalidate(), to force
everything towards ->atomic_open().
* for ...ctl names, ->atomic_open() would act in normal fashion;
again, only one inode is needed. ->read() would pick descriptor number
from ->d_fsdata and report on whatever you have with that number at the
time.
I'll try to put a prototype of that together; I think it's at least
interesting to try. And that ought to be safe to mount even in very
restricted environments, making arguments along the lines of "but we can't
get the path by opened file without the big bad wol^Wprocfs and we can't
have that in our environment" much weaker...
Comments?
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Andy Lutomirski @ 2014-10-19 19:11 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, X86 ML, Arnd Bergmann, Thomas Gleixner, Ingo Molnar,
linux-arch, linux-kernel@vger.kernel.org, H. Peter Anvin,
Andrew Morton, David Drysdale, Linux API, Kees Cook,
Meredydd Luff
In-Reply-To: <87zjcszz8y.fsf@x220.int.ebiederm.org>
On Oct 18, 2014 5:21 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> Andy Lutomirski <luto@amacapital.net> writes:
>
> > [Added Eric Biederman, since I think your tree might be a reasonable
> > route forward for these patches.]
> >
> > On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <drysdale@google.com> wrote:
> >> Resending, adding cc:linux-api.
> >>
> >> Also, it may help to add a little more background -- this patch is
> >> needed as a (small) part of implementing Capsicum in the Linux kernel.
> >>
> >> Capsicum is a security framework that has been present in FreeBSD since
> >> version 9.0 (Jan 2012), and is based on concepts from object-capability
> >> security [1].
> >>
> >> One of the features of Capsicum is capability mode, which locks down
> >> access to global namespaces such as the filesystem hierarchy. In
> >> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
> >> work -- hence the need for a kernel-space
> >
> > I just found myself wanting this syscall for another reason: injecting
> > programs into sandboxes or otherwise heavily locked-down namespaces.
> >
> > For example, I want to be able to reliably do something like nsenter
> > --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> > it is more or less fully functional, so this should Just Work (tm),
> > except that the toybox binary might not exist in the namespace being
> > entered. If execveat were available, I could rig nsenter or a similar
> > tool to open it with O_CLOEXEC, enter the namespace, and then call
> > execveat.
> >
> > Is there any reason that these patches can't be merged more or less as
> > is for 3.19?
>
> Yes. There is a silliness in how it implements fexecve. The fexecve
> case should be use the empty string "" not a NULL pointer to indication
> that. That change will then harmonize execveat with the other ...at
> system calls and simplify the code and remove a special case. I believe
> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>
Sounds reasonable.
> For sandboxes execveat seems to make a great deal of sense. I can
> get the same functionality by passing in a directory file descriptor
> calling fchdir and execve so this should not introduce any new security
> holes. And using the final file descriptor removes a race.
The problem with that approach is that the execed program now has its
current directory outside the sandbox, which could be problematic if
you don't trust that program.
>
> AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
> for exec I don't know what problems it can solve.
It can always be added later
>
> Until I am done moving I won't have time to pick this up, and the code
> clearly needs another revision but I will be happy to work to see that
> we get a sane execveat implemented.
Do you have an ETA? If it's likely to miss 3.19, but if you'll have
time to review before then, I can try to do it.
>
> Eric
>
> p.s. I don't believe there are any namespaces issues where doing
> something with execveat flags make sense.
>
OK, I'll bite. How feasible would it be to have a flag that activated
pid_ns_for_children? That would reduce a lot of the ugliness in tools
like nsenter that need to fork to enter a pid ns.
I always assume that the reason for the active vs. for_children
distinction was because a lot of userspace libraries, including glibc,
would malfunction if getpid(2) started returning a different value.
But, for exec, this doesn't matter.
--Andy
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Andy Lutomirski @ 2014-10-19 18:26 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Serge E. Hallyn, Aditya Kali, Linux API, Linux Containers,
Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo,
cgroups-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar
In-Reply-To: <87iojgmy3o.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>
>> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>>> >> setns on a cgroup namespace is allowed only if
>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>> >> over the user-namespace associated with target cgroupns.
>>> >> * task's current cgroup is descendent of the target cgroupns-root
>>> >> cgroup.
>>> >
>>> > What is the point of this?
>>> >
>>> > If I'm a user logged into
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>> > a container which is in
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>> > then I will want to be able to enter the container's cgroup.
>>> > The container's cgroup root is under my own (satisfying the
>>> > below condition0 but my cgroup is not a descendent of the
>>> > container's cgroup.
>>> >
>>> This condition is there because we don't want to do implicit cgroup
>>> changes when a process attaches to another cgroupns. cgroupns tries to
>>> preserve the invariant that at any point, your current cgroup is
>>> always under the cgroupns-root of your cgroup namespace. But in your
>>> example, if we allow a process in "session-c12.scope" container to
>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>> (without implicitly moving its cgroup), then this invariant won't
>>> hold.
>>
>> Oh, I see. Guess that should be workable. Thanks.
>
> Which has me looking at what the rules are for moving through
> the cgroup hierarchy.
>
> As long as we have write access to cgroup.procs and are allowed
> to open the file for write, we can move any of our own tasks
> into the cgroup. So the cgroup namespace rules don't seem
> to be a problem.
>
> Andy can you please take a look at the permission checks in
> __cgroup_procs_write.
The actual requirements for calling that function haven't changed,
right? IOW, what does this have to do with cgroupns? Is the idea
that you want a privileged user wrt a cgroupns's userns to be able to
use this? If so:
Yes, that current_cred() thing is bogus. (Actually, this is probably
exploitable right now if any cgroup.procs inode anywhere on the system
lets non-root write.) (Can we have some kernel debugging option that
makes any use of current_cred() in write(2) warn?)
We really need a weaker version of may_ptrace for this kind of stuff.
Maybe the existing may_ptrace stuff is okay, actually. But this is
completely missing group checks, cap checks, capabilities wrt the
userns, etc.
Also, I think that, if this version of the patchset allows non-init
userns to unshare cgroupns, then the issue of what permission is
needed to lock the cgroup hierarchy like that needs to be addressed,
because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
the calling task with no permission required. Bolting on a fix later
will be a mess.
--Andy
>
> As I read the code I see 3 security gaffaws in the permssion check.
> - Using current->cred instead of file->f_cred.
> - Not checking tcred->euid.
> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>
> The file permission on cgroup.procs seem just sufficient to keep
> to keep those bugs from being easily exploitable.
>
> Eric
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply
* Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support
From: Eric W. Biederman @ 2014-10-19 5:23 UTC (permalink / raw)
To: Serge E. Hallyn
Cc: Linux API, Linux Containers, Serge Hallyn,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Andy Lutomirski, Ingo Molnar, Tejun Heo,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141016214710.GA4759-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Quoting Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> >> setns on a cgroup namespace is allowed only if
>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>> >> over the user-namespace associated with target cgroupns.
>> >> * task's current cgroup is descendent of the target cgroupns-root
>> >> cgroup.
>> >
>> > What is the point of this?
>> >
>> > If I'm a user logged into
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>> > a container which is in
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>> > then I will want to be able to enter the container's cgroup.
>> > The container's cgroup root is under my own (satisfying the
>> > below condition0 but my cgroup is not a descendent of the
>> > container's cgroup.
>> >
>> This condition is there because we don't want to do implicit cgroup
>> changes when a process attaches to another cgroupns. cgroupns tries to
>> preserve the invariant that at any point, your current cgroup is
>> always under the cgroupns-root of your cgroup namespace. But in your
>> example, if we allow a process in "session-c12.scope" container to
>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>> (without implicitly moving its cgroup), then this invariant won't
>> hold.
>
> Oh, I see. Guess that should be workable. Thanks.
Which has me looking at what the rules are for moving through
the cgroup hierarchy.
As long as we have write access to cgroup.procs and are allowed
to open the file for write, we can move any of our own tasks
into the cgroup. So the cgroup namespace rules don't seem
to be a problem.
Andy can you please take a look at the permission checks in
__cgroup_procs_write.
As I read the code I see 3 security gaffaws in the permssion check.
- Using current->cred instead of file->f_cred.
- Not checking tcred->euid.
- Checking GLOBAL_ROOT_UID instead of having a capable call.
The file permission on cgroup.procs seem just sufficient to keep
to keep those bugs from being easily exploitable.
Eric
^ permalink raw reply
* Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
From: Eric W. Biederman @ 2014-10-19 4:57 UTC (permalink / raw)
To: Aditya Kali
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
mingo-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1413235430-22944-7-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes:
> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
>
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.
So I think this patch is out of order.
We should add the namespace infrastructre and the restrictions before
we allow creation of the namespace. Otherwise there is a bisection
point where cgroup namespaces are broken or at the very least have a
security hole. Since we can anticipate this let's see if we can figure
out how to avoid it.
Eric
> Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
> kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
> struct task_struct *task;
> int ret;
>
> + /* Only allow changing cgroups accessible within task's cgroup
> + * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> + * cgroupns->root_cgrp. */
> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> + return -EPERM;
> +
> /* look up all src csets */
> down_read(&css_set_rwsem);
> rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> struct cgroup_subsys_state *css;
> int ret;
>
> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + return -EINVAL;
> +
> if (cft->write)
> return cft->write(of, buf, nbytes, off);
>
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> parent = cgroup_kn_lock_live(parent_kn);
> if (!parent)
> return -ENODEV;
> +
> + /* Allow mkdir only within process's cgroup namespace root. */
> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> +
> root = parent->root;
>
> /* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
> if (!cgrp)
> return 0;
>
> + /* Allow rmdir only within process's cgroup namespace root.
> + * The process can't delete its own root anyways. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> + cgroup_kn_unlock(kn);
> + return -EPERM;
> + }
> +
> ret = cgroup_destroy_locked(cgrp);
>
> cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
> continue;
>
> + cgrp = task_cgroup_from_root(tsk, root);
> +
> + /* The cgroup path on default hierarchy is shown only if it
> + * falls under current task's cgroupns-root.
> + */
> + if (root == &cgrp_dfl_root &&
> + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + continue;
> +
> seq_printf(m, "%d:", root->hierarchy_id);
> for_each_subsys(ss, ssid)
> if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> seq_printf(m, "%sname=%s", count ? "," : "",
> root->name);
> seq_putc(m, ':');
> - cgrp = task_cgroup_from_root(tsk, root);
> path = cgroup_path(cgrp, buf, PATH_MAX);
> if (!path) {
> retval = -ENAMETOOLONG;
^ permalink raw reply
* Re: [PATCHv1 0/8] CGroup Namespaces
From: Eric W. Biederman @ 2014-10-19 4:54 UTC (permalink / raw)
To: Aditya Kali
Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, luto-kltTT9wpgjJwATOyAt5JVQ,
tj-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
mingo-H+wXaHxf7aLQT0dZR+AlfA
In-Reply-To: <1413235430-22944-1-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Aditya Kali <adityakali@google.com> writes:
> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> your cgroupns-root.
>
> More details in the writeup below.
This definitely looks like the right direction to go, and something that
in some form or another I had been asking for since cgroups were merged.
So I am very glad to see this work moving forward.
I had hoped that we might just be able to be clever with remounting
cgroupfs but 2 things stand in the way.
1) /proc/<pid>/cgroups (but proc could capture that).
2) providing a hard guarnatee that tasks stay within a subset of the
cgroup hierarchy.
So I think this clearly meets the requirements for a new namespace.
We need to have the discussion on chmod of files on cgroupfs. There is
a notion that has floated around that only systemd or only root (with
the appropriate capabilities) should be allowed to set resource limits
in cgroupfs. In a practical reality that is nonsense. If an atribute
is properly bound in it's hiearchy it should be safe to change.
Not all attributes are properly bound to hierarchy and some are or at
least were dangerous for anyone except root to set. So I suggest that a
CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
a cgroup attribute from root.
That would be complimentary work, and not strictly tied the cgroup
namespaces but unprivileged cgroup namespaces don't make much sense
without that work.
Eric
> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel. Incidentally though, it used the same config option
> name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
>
> The new CGroup Namespace allows a process to “unshare” its cgroup
> hierarchy starting from the cgroup its currently in.
> For Ex:
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> $ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> [ns]$ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
> cgroup:[4026532183]
> # From within new cgroupns, process sees that its in the root cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
> # From global cgroupns:
> $ cat /proc/<pid>/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> # Unshare cgroupns along with userns and mountns
> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> # sets up uid/gid map and exec’s /bin/bash
> $ ~/unshare -c -u -m
>
> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
> # hierarchy.
> [ns]$ mount -t cgroup cgroup /tmp/cgroup
> [ns]$ ls -l /tmp/cgroup
> total 0
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
> filesystem root for the namespace specific cgroupfs mount.
>
> The virtualization of /proc/self/cgroup file combined with restricting
> the view of cgroup hierarchy by namespace-private cgroupfs mount
> should provide a completely isolated cgroup view inside the container.
>
> In its current form, the cgroup namespaces patcheset provides following
> behavior:
>
> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> the process calling unshare is running.
> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> For the init_cgroup_ns, this is the real root (“/”) cgroup
> (identified in code as cgrp_dfl_root.cgrp).
>
> (2) The cgroupns-root cgroup does not change even if the namespace
> creator process later moves to a different cgroup.
> $ ~/unshare -c # unshare cgroupns in some cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> [ns]$ mkdir sub_cgrp_1
> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (3) Each process gets its CGROUPNS specific view of
> /proc/<pid>/cgroup.
> (a) Processes running inside the cgroup namespace will be able to see
> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> [ns]$ sleep 100000 & # From within unshared cgroupns
> [1] 7353
> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (b) From global cgroupns, the real cgroup path will be visible:
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
> path will be visible:
> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
> [ns2]$ cat /proc/7353/cgroup
> [ns2]$
> This is same as when cgroup hierarchy is not mounted at all.
> (In correct container setup though, it should not be possible to
> access PIDs in another container in the first place.)
>
> (4) Processes inside a cgroupns are not allowed to move out of the
> cgroupns-root. This is true even if a privileged process in global
> cgroupns tries to move the process out of its cgroupns-root.
>
> # From global cgroupns
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> -bash: echo: write error: Operation not permitted
>
> (5) Setns to another cgroup namespace is allowed only when:
> (a) process has CAP_SYS_ADMIN in its current userns
> (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
> (c) the process's current cgroup is a descendant cgroupns-root of the
> target namespace.
> (d) the target cgroupns-root is descendant of current cgroupns-root..
> The last check (d) prevents processes from escaping their cgroupns-root by
> attaching to parent cgroupns. Thus, setns is allowed only when the process
> is trying to restrict itself to a deeper cgroup hierarchy.
>
> (6) When some thread from a multi-threaded process unshares its
> cgroup-namespace, the new cgroupns gets applied to the entire
> process (all the threads). This should be OK since
> unified-hierarchy only allows process-level containerization. So
> all the threads in the process will have the same cgroup. And both
> - changing cgroups and unsharing namespaces - are protected under
> threadgroup_lock(task).
>
> (7) The cgroup namespace is alive as long as there is atleast 1
> process inside it. When the last process exits, the cgroup
> namespace is destroyed. The cgroupns-root and the actual cgroups
> remain though.
>
> (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
> the unified cgroup hierarchy with cgroupns-root as the filesystem root.
> The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
> container management tools to be run inside the containers transparently.
>
> Implementation
> The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
> branch). Its fairly non-intrusive and provides above mentioned
> features.
>
> Possible extensions of CGROUPNS:
> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> capabilities to restrict cgroups to administrative users. CGroup
> namespaces could be of help here. With cgroup namespaces, it might
> be possible to delegate administration of sub-cgroups under a
> cgroupns-root to the cgroupns owner.
> ---
> fs/kernfs/dir.c | 53 +++++++++---
> fs/kernfs/mount.c | 48 +++++++++++
> fs/proc/namespaces.c | 3 +
> include/linux/cgroup.h | 41 +++++++++-
> include/linux/cgroup_namespace.h | 62 +++++++++++++++
> include/linux/kernfs.h | 5 ++
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 +
> include/uapi/linux/sched.h | 3 +-
> init/Kconfig | 9 +++
> kernel/Makefile | 1 +
> kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
> kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 ++++-
> 15 files changed, 518 insertions(+), 41 deletions(-)
> create mode 100644 include/linux/cgroup_namespace.h
> create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
> [PATCHv1 7/8] cgroup: cgroup namespace setns support
> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers
^ permalink raw reply
* Re: [PATCHv4 RESEND 0/3] syscalls,x86: Add execveat() system call
From: Eric W. Biederman @ 2014-10-19 0:20 UTC (permalink / raw)
To: Andy Lutomirski
Cc: David Drysdale, Alexander Viro, Meredydd Luff,
linux-kernel@vger.kernel.org, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, Andrew Morton, Kees Cook, Arnd Bergmann, X86 ML,
linux-arch, Linux API
In-Reply-To: <CALCETrVraoD+r4zxBoGd+BV5P275AXcRV_R00SSr8fjQzRHnUg@mail.gmail.com>
Andy Lutomirski <luto@amacapital.net> writes:
> [Added Eric Biederman, since I think your tree might be a reasonable
> route forward for these patches.]
>
> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <drysdale@google.com> wrote:
>> Resending, adding cc:linux-api.
>>
>> Also, it may help to add a little more background -- this patch is
>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>
>> Capsicum is a security framework that has been present in FreeBSD since
>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>> security [1].
>>
>> One of the features of Capsicum is capability mode, which locks down
>> access to global namespaces such as the filesystem hierarchy. In
>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>> work -- hence the need for a kernel-space
>
> I just found myself wanting this syscall for another reason: injecting
> programs into sandboxes or otherwise heavily locked-down namespaces.
>
> For example, I want to be able to reliably do something like nsenter
> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> it is more or less fully functional, so this should Just Work (tm),
> except that the toybox binary might not exist in the namespace being
> entered. If execveat were available, I could rig nsenter or a similar
> tool to open it with O_CLOEXEC, enter the namespace, and then call
> execveat.
>
> Is there any reason that these patches can't be merged more or less as
> is for 3.19?
Yes. There is a silliness in how it implements fexecve. The fexecve
case should be use the empty string "" not a NULL pointer to indication
that. That change will then harmonize execveat with the other ...at
system calls and simplify the code and remove a special case. I believe
using the empty string "" requires implementing the AT_EMPTY_PATH flag.
For sandboxes execveat seems to make a great deal of sense. I can
get the same functionality by passing in a directory file descriptor
calling fchdir and execve so this should not introduce any new security
holes. And using the final file descriptor removes a race.
AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
for exec I don't know what problems it can solve.
Until I am done moving I won't have time to pick this up, and the code
clearly needs another revision but I will be happy to work to see that
we get a sane execveat implemented.
Eric
p.s. I don't believe there are any namespaces issues where doing
something with execveat flags make sense.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox