Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 2/2] arch: wire-up clone3() syscall
From: Arnd Bergmann @ 2019-06-04 18:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Kees Cook, Florian Weimer, Oleg Nesterov, David Howells,
	Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers
In-Reply-To: <20190604160944.4058-2-christian@brauner.io>

On Tue, Jun 4, 2019 at 6:09 PM Christian Brauner <christian@brauner.io> wrote:
>
> Wire up the clone3() call on all arches that don't require hand-rolled
> assembly.
>
> Some of the arches look like they need special assembly massaging and it is
> probably smarter if the appropriate arch maintainers would do the actual
> wiring. Arches that are wired-up are:
> - x86{_32,64}
> - arm{64}
> - xtensa

The ones you did look good to me. I would hope that we can do all other
architectures the same way, even if they have special assembly wrappers
for the old clone(). The most interesting cases appear to be ia64, alpha,
m68k and sparc, so it would be good if their maintainers could take a
look.

What do you use for testing? Would it be possible to override the
internal clone() function in glibc with an LD_PRELOAD library
to quickly test one of the other architectures for regressions?

      Arnd

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Casey Schaufler @ 2019-06-04 20:31 UTC (permalink / raw)
  To: Andy Lutomirski, David Howells
  Cc: Al Viro, raven, Linux FS Devel, Linux API, linux-block, keyrings,
	LSM List, LKML, casey
In-Reply-To: <CALCETrWzDR=Ap8NQ5-YrVhXCEBgr+hwpjw9fBn0m2NkZzZ7XLQ@mail.gmail.com>

n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
> I asked before and didn't see a response, so I'll ask again.  Why are
> you paying any attention at all to the creds that generate an event?
> It seems like the resulting security model will be vary hard to
> understand and probably buggy.  Can't you define a sensible model in
> which only the listener creds matter?

We've spent the last 18 months reeling from the implications
of what can happen when one process has the ability to snoop
on another. Introducing yet another mechanism that is trivial
to exploit is a very bad idea.

I will try to explain the problem once again. If process A
sends a signal (writes information) to process B the kernel
checks that either process A has the same UID as process B
or that process A has privilege to override that policy.
Process B is passive in this access control decision, while
process A is active. In the event delivery case, process A
does something (e.g. modifies a keyring) that generates an
event, which is then sent to process B's event buffer. Again,
A is active and B is passive. Process A must have write access
(defined by some policy) to process B's event buffer. To
implement such a policy requires A's credential, and some
information about the object (passive entity) to which the
event is being delivered. You can't just use the credential
from Process B because it is not the active entity, it is the
passive entity.

>
>> LSM support is included:
>>
>>  (1) The creds of the process that did the fput() that reduced the refcount
>>      to zero are cached in the file struct.
>>
>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>      doing the cleanup, thereby making sure that the creds seen by the
>>      destruction notification generated by mntput() appears to come from
>>      the last fputter.
> That looks like duct tape that is, at best, likely to be very buggy.
>
>>  (3) security_post_notification() is called for each queue that we might
>>      want to post a notification into, thereby allowing the LSM to prevent
>>      covert communications.
> This seems like the wrong approach.  If an LSM wants to prevent covert
> communication from, say, mount actions, then it shouldn't allow the
> watch to be set up in the first place.

^ permalink raw reply

* Re: [PATCH v1 1/4] mm: introduce MADV_COLD
From: Joel Fernandes @ 2019-06-04 20:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, oleksandr, hdanton
In-Reply-To: <20190603053655.127730-2-minchan@kernel.org>

On Mon, Jun 03, 2019 at 02:36:52PM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use.  This could reduce
> workingset eviction so it ends up increasing performance.
> 
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
> 
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
> 
> 	active file page -> inactive file LRU
> 	active anon page -> inacdtive anon LRU
> 
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> files's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. Even, it could
> give a bonus to make them be reclaimed on swapless system. However,
> MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
> in the end. So it's better to move inactive anon's LRU list, not file LRU.
> Furthermore, it would help to avoid unnecessary scanning of cold anonymous
> if system doesn't have a swap device.
> 
> All of error rule is same with MADV_DONTNEED.
> 
> Note:
> This hint works with only private pages(IOW, page_mapcount(page) < 2)
> because shared page could have more chance to be accessed from other
> processes sharing the page although the caller reset the reference bits.
> It ends up preventing the reclaim of the page and wastes CPU cycle.
> 
> * RFCv2
>  * add more description - mhocko
> 
> * RFCv1
>  * renaming from MADV_COOL to MADV_COLD - hannes
> 
> * internal review
>  * use clear_page_youn in deactivate_page - joelaf
>  * Revise the description - surenb
>  * Renaming from MADV_WARM to MADV_COOL - surenb
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/page-flags.h             |   1 +
>  include/linux/page_idle.h              |  15 ++++
>  include/linux/swap.h                   |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/internal.h                          |   2 +-
>  mm/madvise.c                           | 115 ++++++++++++++++++++++++-
>  mm/oom_kill.c                          |   2 +-
>  mm/swap.c                              |  43 +++++++++
>  8 files changed, 176 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 9f8712a4b1a5..58b06654c8dd 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
>  TESTPAGEFLAG(Young, young, PF_ANY)
>  SETPAGEFLAG(Young, young, PF_ANY)
>  TESTCLEARFLAG(Young, young, PF_ANY)
> +CLEARPAGEFLAG(Young, young, PF_ANY)
>  PAGEFLAG(Idle, idle, PF_ANY)
>  #endif
>  
> diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
> index 1e894d34bdce..f3f43b317150 100644
> --- a/include/linux/page_idle.h
> +++ b/include/linux/page_idle.h
> @@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
>  	SetPageYoung(page);
>  }
>  
> +static inline void clear_page_young(struct page *page)
> +{
> +	ClearPageYoung(page);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	return TestClearPageYoung(page);
> @@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
>  	set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
>  }
>  
> +static void clear_page_young(struct page *page)
> +{
> +	struct page_ext *page_ext = lookup_page_ext(page);
> +
> +	if (unlikely(!page_ext))
> +		return;
> +
> +	clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
> +}
> +
>  static inline bool test_and_clear_page_young(struct page *page)
>  {
>  	struct page_ext *page_ext = lookup_page_ext(page);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index de2c67a33b7e..0ce997edb8bb 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
>  extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index bea0278f65ab..1190f4e7f7b9 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -43,6 +43,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_COLD	5		/* deactivatie these pages */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_FREE	8		/* free pages only if memory pressure */
> diff --git a/mm/internal.h b/mm/internal.h
> index 9eeaf2b95166..75a4f96ec0fb 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -43,7 +43,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> -static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
> +static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
>  {
>  	return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
>  }
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 628022e674a7..ab158766858a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
> +	case MADV_COLD:
>  	case MADV_FREE:
>  		return 0;
>  	default:
> @@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +{
> +	pte_t *orig_pte, *pte, ptent;
> +	spinlock_t *ptl;
> +	struct page *page;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long next;
> +
> +	next = pmd_addr_end(addr, end);
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_trans_huge_lock(pmd, vma);
> +		if (!ptl)
> +			return 0;
> +
> +		if (is_huge_zero_pmd(*pmd))
> +			goto huge_unlock;
> +
> +		page = pmd_page(*pmd);
> +		if (page_mapcount(page) > 1)
> +			goto huge_unlock;
> +
> +		if (next - addr != HPAGE_PMD_SIZE) {
> +			int err;
> +
> +			get_page(page);
> +			spin_unlock(ptl);
> +			lock_page(page);
> +			err = split_huge_page(page);
> +			unlock_page(page);
> +			put_page(page);
> +			if (!err)
> +				goto regular_page;
> +			return 0;
> +		}
> +
> +		pmdp_test_and_clear_young(vma, addr, pmd);
> +		deactivate_page(page);
> +huge_unlock:
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +regular_page:
> +	orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (pte_none(ptent))
> +			continue;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (page_mapcount(page) > 1)
> +			continue;
> +
> +		ptep_test_and_clear_young(vma, addr, pte);

Wondering here how it interacts with idle page tracking. Here since young
flag is cleared by the cold hint, page_referenced_one() or
page_idle_clear_pte_refs_one() will not be able to clear the page-idle flag
if it was previously set since it does not know any more that a page was
actively referenced.

Should you also be clearing the page-idle flag here if the ptep young/accessed
bit was previously set, just so that page-idle tracking works smoothly when
this hint is concurrently applied?

> +		deactivate_page(page);
> +	}
> +
> +	pte_unmap_unlock(orig_pte, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static void madvise_cold_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct mm_walk cool_walk = {
> +		.pmd_entry = madvise_cold_pte_range,
> +		.mm = vma->vm_mm,
> +	};
> +
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &cool_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +static long madvise_cold(struct vm_area_struct *vma,
> +			struct vm_area_struct **prev,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	*prev = vma;
> +	if (!can_madv_lru_vma(vma))
> +		return -EINVAL;
> +
> +	lru_add_drain();
> +	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> +	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
> +	tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> +	return 0;
> +}
> +
>  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  				unsigned long end, struct mm_walk *walk)
>  
> @@ -519,7 +627,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  				  int behavior)
>  {
>  	*prev = vma;
> -	if (!can_madv_dontneed_vma(vma))
> +	if (!can_madv_lru_vma(vma))
>  		return -EINVAL;
>  
>  	if (!userfaultfd_remove(vma, start, end)) {
> @@ -541,7 +649,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>  			 */
>  			return -ENOMEM;
>  		}
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			return -EINVAL;
>  		if (end > vma->vm_end) {
>  			/*
> @@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		return madvise_remove(vma, prev, start, end);
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
> +	case MADV_COLD:
> +		return madvise_cold(vma, prev, start, end);
>  	case MADV_FREE:
>  	case MADV_DONTNEED:
>  		return madvise_dontneed_free(vma, prev, start, end, behavior);
> @@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
>  	case MADV_FREE:
> +	case MADV_COLD:
>  #ifdef CONFIG_KSM
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5a58778c91d4..f73d5f5145f0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -515,7 +515,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
>  	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
> -		if (!can_madv_dontneed_vma(vma))
> +		if (!can_madv_lru_vma(vma))
>  			continue;
>  
>  		/*
> diff --git a/mm/swap.c b/mm/swap.c
> index 7b079976cbec..cebedab15aa2 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -47,6 +47,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> @@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +			    void *arg)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		int file = page_is_file_cache(page);
> +		int lru = page_lru_base_type(page);
> +
> +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		ClearPageActive(page);
> +		ClearPageReferenced(page);
> +		clear_page_young(page);

I was curious, how does the above interact with clear_refs?

It appears that a range that is marked as cold will appear to be unreferenced
(since referenced flag is cleared) even though it may have been referenced in
the past. Very least, I believe this behavior should be documented somewhere.

I believe it is best for us to make the clear_refs technique behave similar
to the idle-page tracking for the purposes of figuring out which pages are
referenced, so that clear_refs is more immune to this hint.

thanks,

 - Joel

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: David Howells @ 2019-06-04 20:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: dhowells, Al Viro, Casey Schaufler, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML
In-Reply-To: <CALCETrWzDR=Ap8NQ5-YrVhXCEBgr+hwpjw9fBn0m2NkZzZ7XLQ@mail.gmail.com>

Andy Lutomirski <luto@kernel.org> wrote:

> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> 
> I asked before and didn't see a response, so I'll ask again.  Why are you
> paying any attention at all to the creds that generate an event?

Casey responded to you.  It's one of his requirements.

I'm not sure of the need, and I particularly don't like trying to make
indirect destruction events (mount destruction keyed on fput, for instance)
carry the creds of the triggerer.  Indeed, the trigger can come from all sorts
of places - including af_unix queue destruction, someone poking around in
procfs, a variety of processes fputting simultaneously.  Only one of them can
win, and the LSM needs to handle *all* the possibilities.

However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
when checking permissions.  See selinux_revalidate_file_permission() for
example - it uses current_cred() not file->f_cred to re-evaluate the perms,
and the fd might be shared between a number of processes with different creds.

> This seems like the wrong approach.  If an LSM wants to prevent covert
> communication from, say, mount actions, then it shouldn't allow the
> watch to be set up in the first place.

Yeah, I can agree to that.  Casey?

David

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Andy Lutomirski @ 2019-06-04 20:57 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Al Viro, Casey Schaufler, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML
In-Reply-To: <1207.1559680778@warthog.procyon.org.uk>

On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>
> Andy Lutomirski <luto@kernel.org> wrote:
>
> > > Here's a set of patches to add a general variable-length notification queue
> > > concept and to add sources of events for:
> >
> > I asked before and didn't see a response, so I'll ask again.  Why are you
> > paying any attention at all to the creds that generate an event?
>
> Casey responded to you.  It's one of his requirements.
>

It being a "requirement" doesn't make it okay.

> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
> when checking permissions.  See selinux_revalidate_file_permission() for
> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
> and the fd might be shared between a number of processes with different creds.

That's a bug.  It's arguably a rather severe bug.  If I ever get
around to writing the patch I keep thinking of that will warn if we
use creds from invalid contexts, it will warn.

Let's please not repeat this.

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Andy Lutomirski @ 2019-06-04 21:05 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: Andy Lutomirski, David Howells, Al Viro, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML
In-Reply-To: <50c2ea19-6ae8-1f42-97ef-ba5c95e40475@schaufler-ca.com>

On Tue, Jun 4, 2019 at 1:31 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
> > On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
> >>
> >> Hi Al,
> >>
> >> Here's a set of patches to add a general variable-length notification queue
> >> concept and to add sources of events for:
> > I asked before and didn't see a response, so I'll ask again.  Why are
> > you paying any attention at all to the creds that generate an event?
> > It seems like the resulting security model will be vary hard to
> > understand and probably buggy.  Can't you define a sensible model in
> > which only the listener creds matter?
>
> We've spent the last 18 months reeling from the implications
> of what can happen when one process has the ability to snoop
> on another. Introducing yet another mechanism that is trivial
> to exploit is a very bad idea.

If you're talking about Spectre, etc, this is IMO entirely irrelevant.
Among other things, setting these watches can and should require some
degree of privilege.

>
> I will try to explain the problem once again. If process A
> sends a signal (writes information) to process B the kernel
> checks that either process A has the same UID as process B
> or that process A has privilege to override that policy.
> Process B is passive in this access control decision, while
> process A is active.

Are you stating what you see to be a requirement?

> Process A must have write access
> (defined by some policy) to process B's event buffer.

No, stop right here.  Process B is monitoring some aspect of the
system.  Process A is doing something.  Process B should need
permission to monitor whatever it's monitoring, and process A should
have permission to do whatever it's doing.  I don't think it makes
sense to try to ascribe an identity to the actor doing some action to
decide to omit it from the watch -- this has all kinds of correctness
issues.

If you're writing a policy and you don't like letting process B spy on
processes doing various things, then disallow that type of spying.

> To
> implement such a policy requires A's credential,

You may not design a new mechanism that looks at the credential in a
context where looking at a credential is invalid unless you have some
very strong justification for why all of the known reasons that it's a
bad idea don't apply to what you're doing.

So, without a much stronger justification, NAK.

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Casey Schaufler @ 2019-06-04 21:11 UTC (permalink / raw)
  To: David Howells, Andy Lutomirski
  Cc: Al Viro, raven, Linux FS Devel, Linux API, linux-block, keyrings,
	LSM List, LKML, casey
In-Reply-To: <1207.1559680778@warthog.procyon.org.uk>

On 6/4/2019 1:39 PM, David Howells wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
>
>>> Here's a set of patches to add a general variable-length notification queue
>>> concept and to add sources of events for:
>> I asked before and didn't see a response, so I'll ask again.  Why are you
>> paying any attention at all to the creds that generate an event?
> Casey responded to you.  It's one of his requirements.

Process A takes an action. As a result of that action,
an event is written to Process B's event buffer. This isn't
a covert channel, it's a direct access, just like sending
a signal. Process A is the subject and the event buffer,
which is part of Process B, is the object.


> I'm not sure of the need, and I particularly don't like trying to make
> indirect destruction events (mount destruction keyed on fput, for instance)
> carry the creds of the triggerer.  Indeed, the trigger can come from all sorts
> of places - including af_unix queue destruction, someone poking around in
> procfs, a variety of processes fputting simultaneously.  Only one of them can
> win, and the LSM needs to handle *all* the possibilities.

Yes, it's a hairy problem. It was a significant factor in the
demise of kdbus.

> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
> when checking permissions.  See selinux_revalidate_file_permission() for
> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
> and the fd might be shared between a number of processes with different creds.
>
>> This seems like the wrong approach.  If an LSM wants to prevent covert
>> communication from, say, mount actions, then it shouldn't allow the
>> watch to be set up in the first place.
> Yeah, I can agree to that.  Casey?

Back to your earlier point, you don't know where the
event is coming from when you create the event watch.
If you enforce a watch time, what are you going to check?
Isn't this going to be considered too restrictive?

^ permalink raw reply

* Re: [PATCH v3 2/2] arch: wire-up clone3() syscall
From: Christian Brauner @ 2019-06-04 21:29 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Al Viro, Linux Kernel Mailing List, Linus Torvalds, Jann Horn,
	Kees Cook, Florian Weimer, Oleg Nesterov, David Howells,
	Andrew Morton, Adrian Reber, Linux API, linux-arch,
	the arch/x86 maintainers
In-Reply-To: <CAK8P3a0OfBpx6y4m5uWX-DUg16NoFby5ik-3xCcD+yMrw0tbEw@mail.gmail.com>

On Tue, Jun 04, 2019 at 08:40:01PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 4, 2019 at 6:09 PM Christian Brauner <christian@brauner.io> wrote:
> >
> > Wire up the clone3() call on all arches that don't require hand-rolled
> > assembly.
> >
> > Some of the arches look like they need special assembly massaging and it is
> > probably smarter if the appropriate arch maintainers would do the actual
> > wiring. Arches that are wired-up are:
> > - x86{_32,64}
> > - arm{64}
> > - xtensa
> 
> The ones you did look good to me. I would hope that we can do all other
> architectures the same way, even if they have special assembly wrappers
> for the old clone(). The most interesting cases appear to be ia64, alpha,
> m68k and sparc, so it would be good if their maintainers could take a
> look.

Yes, agreed. They can sort this out even after this lands.

> 
> What do you use for testing? Would it be possible to override the
> internal clone() function in glibc with an LD_PRELOAD library
> to quickly test one of the other architectures for regressions?

I have a test program that is rather horrendously ugly and I compiled
kernels for x86 and the arms and tested in qemu. The program basically
looks like [1].

Christian

[1]:
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/sched.h>
#include <linux/types.h>
#include <sched.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#include <sys/sysmacros.h>
#include <sys/types.h>
#include <sys/un.h>
#include <sys/wait.h>
#include <unistd.h>

static pid_t raw_clone(struct clone_args *args)
{
	return syscall(__NR_clone3, args, sizeof(struct clone_args));
}

static pid_t raw_clone_legacy(int *pidfd, unsigned int flags)
{
	return syscall(__NR_clone, flags, 0, pidfd, 0, 0);
}

static int wait_for_pid(pid_t pid)
{
	int status, ret;

again:
	ret = waitpid(pid, &status, 0);
	if (ret == -1) {
		if (errno == EINTR)
			goto again;

		return -1;
	}

	if (ret != pid)
		goto again;

	if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
		return -1;

	return 0;
}

#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))
#define u64_to_ptr(n) ((uintptr_t)((__u64)(n)))

int main(int argc, char *argv[])
{
	int pidfd = -1;
	pid_t parent_tid = -1, pid = -1;
	struct clone_args args = {0};
	args.parent_tid = ptr_to_u64(&parent_tid);
	args.pidfd = ptr_to_u64(&pidfd);
	args.flags = CLONE_PIDFD | CLONE_PARENT_SETTID;
	args.exit_signal = SIGCHLD;

	pid = raw_clone(&args);
	if (pid < 0) {
		fprintf(stderr, "%s - Failed to create new process\n",
			strerror(errno));
		exit(EXIT_FAILURE);
	}

	if (pid == 0) {
		printf("I am the child with pid %d\n", getpid());
		exit(EXIT_SUCCESS);
	}

	printf("raw_clone: I am the parent. My child's pid is   %d\n", pid);
	printf("raw_clone: I am the parent. My child's pidfd is %d\n",
	       *(int *)args.pidfd);
	printf("raw_clone: I am the parent. My child's paren_tid value is %d\n",
	       *(pid_t *)args.parent_tid);

	if (wait_for_pid(pid))
		exit(EXIT_FAILURE);

	if (pid != *(pid_t *)args.parent_tid)
		exit(EXIT_FAILURE);

	close(pidfd);

	printf("\n\n");
	pidfd = -1;
	pid = raw_clone_legacy(&pidfd, CLONE_PIDFD | SIGCHLD);
	if (pid < 0) {
		fprintf(stderr, "%s - Failed to create new process\n",
			strerror(errno));
		exit(EXIT_FAILURE);
	}

	if (pid == 0) {
		printf("I am the child with pid %d\n", getpid());
		exit(EXIT_SUCCESS);
	}

	printf("raw_clone_legacy: I am the parent. My child's pid is   %d\n",
	       pid);
	printf("raw_clone_legacy: I am the parent. My child's pidfd is %d\n",
	       pidfd);

	if (wait_for_pid(pid))
		exit(EXIT_FAILURE);

	if (pid != *(pid_t *)args.parent_tid)
		exit(EXIT_FAILURE);

	return 0;
}

^ permalink raw reply

* Re: [PATCH v3 1/2] fork: add clone3
From: Christian Brauner @ 2019-06-04 21:54 UTC (permalink / raw)
  To: torvalds
  Cc: keescook, fweimer, oleg, arnd, dhowells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, linux-api, viro,
	linux-kernel, jannh
In-Reply-To: <20190604160944.4058-1-christian@brauner.io>

On Tue, Jun 04, 2019 at 06:09:43PM +0200, Christian Brauner wrote:
> This adds the clone3 system call.
> 
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
> 
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
> 
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again. In general, clone3() is extensible and allows for the implementation
> of new features.
> 
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
>   in clone() into a dedicated struct clone_args
>   - choose a struct layout that is easy to handle on 32 and on 64 bit
>   - choose a struct layout that is extensible
>   - give new flags that currently need to abuse another flag's dedicated
>     return argument in clone() their own dedicated return argument
>     (e.g. CLONE_PIDFD)
>   - use a separate kernel internal struct kernel_clone_args that is
>     properly typed according to current kernel conventions in fork.c and is
>     different from  the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
>   syscalls such as fork(), vfork(), clone(), and clone3() behave identical
>   (Arnd suggested, that we can probably also port do_fork() itself in a
>    separate patchset.)
> - ease of transition for userspace from clone() to clone3()
>   This very much means that we do *not* remove functionality that userspace
>   currently relies on as the latter is a good way of creating a syscall
>   that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
> 
> In accordance with Linus suggestions (cf. [11]), clone3() has the following
> signature:
> 
> /* uapi */
> struct clone_args {
>         __aligned_u64 flags;
>         __aligned_u64 pidfd;
>         __aligned_u64 child_tid;
>         __aligned_u64 parent_tid;
>         __aligned_u64 exit_signal;
>         __aligned_u64 stack;
>         __aligned_u64 stack_size;
>         __aligned_u64 tls;
> };
> 
> /* kernel internal */
> struct kernel_clone_args {
>         u64 flags;
>         int __user *pidfd;
>         int __user *child_tid;
>         int __user *parent_tid;
>         int exit_signal;
>         unsigned long stack;
>         unsigned long stack_size;
>         unsigned long tls;
> };
> 
> long sys_clone3(struct clone_args __user *uargs, size_t size)
> 
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
> 
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
> 
> /* User visible differences to legacy clone() */
> - CLONE_DETACHED will cause EINVAL with clone3()
> - CSIGNAL is deprecated
>   It is superseeded by a dedicated "exit_signal" argument in struct
>   clone_args freeing up space for additional flags.
>   This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
> 
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
> [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@gmail.com/
> [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@mail.gmail.com/
> [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@mail.gmail.com/
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org

Linus,

Would you in principle be fine receiving this for 5.3 through my tree
together with the pidfd_open() and pidfd polling patches or would you
prefer a separate PR for it, or have this go alltogether through someone
else's tree (all assuming no nack of course)?
(I'd let Al handle close_range() as this seems vfs territory.)

Thanks!
Christian

> ---
> v1:
> - Linus Torvalds <torvalds@linux-foundation.org>:
>   - redesign based on Linus proposal
>   - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@arndb.de>:
>   - use a single copy_from_user() instead of multiple get_user() calls
>     since the latter have a constant overhead on some architectures
>   - a range of other tweaks and suggestions
> v2:
> - Linus Torvalds <torvalds@linux-foundation.org>,
>   Andrei Vagin <avagin@gmail.com>:
>   - replace CSIGNAL flag with dedicated exit_signal argument in struct
>     clone_args
> - Christian Brauner <christian@brauner.io>:
>   - improve naming for some struct clone_args members
> v3:
> - Arnd Bergmann <arnd@arndb.de>:
>   - replace memset with constructor for clarity and better object code
>   - call flag verification function clone3_flags_valid() on
>     kernel_clone_args instead of clone_args
>   - remove __ARCH_WANT_SYS_CLONE ifdefine around sys_clone3()
> - Christian Brauner <christian@brauner.io>:
>   - replace clone3_flags_valid() with clone3_args_valid() and call in
>     clone3() directly rather than in copy_clone_args_from_user()
>     This cleanly separates copying the args from userspace from the
>     verification whether those args are sane.
> - David Howells <dhowells@redhat.com>:
>   - align new struct member assignments with tabs
>   - replace CLONE_MAX by with a non-uapi exported CLONE_LEGACY_FLAGS and
>     define it as  0xffffffffULL for clarity
>   - make copy_clone_args_from_user() noinline
>   - avoid assigning to local variables from struct kernel_clone_args
>     members in cases where it makes sense
> ---
>  arch/x86/ia32/sys_ia32.c   |  12 ++-
>  include/linux/sched/task.h |  17 +++-
>  include/linux/syscalls.h   |   4 +
>  include/uapi/linux/sched.h |  16 +++
>  kernel/fork.c              | 201 ++++++++++++++++++++++++++++---------
>  5 files changed, 199 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
> index a43212036257..64a6c952091e 100644
> --- a/arch/x86/ia32/sys_ia32.c
> +++ b/arch/x86/ia32/sys_ia32.c
> @@ -237,6 +237,14 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
>  		       unsigned long, newsp, int __user *, parent_tidptr,
>  		       unsigned long, tls_val, int __user *, child_tidptr)
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
> -			tls_val);
> +	struct kernel_clone_args args = {
> +		.flags		= (clone_flags & ~CSIGNAL),
> +		.child_tid	= child_tidptr,
> +		.parent_tid	= parent_tidptr,
> +		.exit_signal	= (clone_flags & CSIGNAL),
> +		.stack		= newsp,
> +		.tls		= tls_val,
> +	};
> +
> +	return _do_fork(&args);
>  }
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index f1227f2c38a4..109a0df5af39 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -8,11 +8,26 @@
>   */
>  
>  #include <linux/sched.h>
> +#include <linux/uaccess.h>
>  
>  struct task_struct;
>  struct rusage;
>  union thread_union;
>  
> +/* All the bits taken by the old clone syscall. */
> +#define CLONE_LEGACY_FLAGS 0xffffffffULL
> +
> +struct kernel_clone_args {
> +	u64 flags;
> +	int __user *pidfd;
> +	int __user *child_tid;
> +	int __user *parent_tid;
> +	int exit_signal;
> +	unsigned long stack;
> +	unsigned long stack_size;
> +	unsigned long tls;
> +};
> +
>  /*
>   * This serializes "schedule()" and also protects
>   * the run-queue from deletions/modifications (but
> @@ -73,7 +88,7 @@ extern void do_group_exit(int);
>  extern void exit_files(struct task_struct *);
>  extern void exit_itimers(struct signal_struct *);
>  
> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(struct kernel_clone_args *kargs);
>  extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>  struct task_struct *fork_idle(int);
>  struct mm_struct *copy_init_mm(void);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..60a81f374ca3 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -70,6 +70,7 @@ struct sigaltstack;
>  struct rseq;
>  union bpf_attr;
>  struct io_uring_params;
> +struct clone_args;
>  
>  #include <linux/types.h>
>  #include <linux/aio_abi.h>
> @@ -852,6 +853,9 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
>  	       int __user *, unsigned long);
>  #endif
>  #endif
> +
> +asmlinkage long sys_clone3(struct clone_args __user *uargs, size_t size);
> +
>  asmlinkage long sys_execve(const char __user *filename,
>  		const char __user *const __user *argv,
>  		const char __user *const __user *envp);
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index ed4ee170bee2..f5331dbdcaa2 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -2,6 +2,8 @@
>  #ifndef _UAPI_LINUX_SCHED_H
>  #define _UAPI_LINUX_SCHED_H
>  
> +#include <linux/types.h>
> +
>  /*
>   * cloning flags:
>   */
> @@ -31,6 +33,20 @@
>  #define CLONE_NEWNET		0x40000000	/* New network namespace */
>  #define CLONE_IO		0x80000000	/* Clone io context */
>  
> +/*
> + * Arguments for the clone3 syscall
> + */
> +struct clone_args {
> +	__aligned_u64 flags;
> +	__aligned_u64 pidfd;
> +	__aligned_u64 child_tid;
> +	__aligned_u64 parent_tid;
> +	__aligned_u64 exit_signal;
> +	__aligned_u64 stack;
> +	__aligned_u64 stack_size;
> +	__aligned_u64 tls;
> +};
> +
>  /*
>   * Scheduling policies
>   */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b4cba953040a..08ff131f26b4 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1760,19 +1760,15 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
>   * flags). The actual kick-off is left to the caller.
>   */
>  static __latent_entropy struct task_struct *copy_process(
> -					unsigned long clone_flags,
> -					unsigned long stack_start,
> -					unsigned long stack_size,
> -					int __user *parent_tidptr,
> -					int __user *child_tidptr,
>  					struct pid *pid,
>  					int trace,
> -					unsigned long tls,
> -					int node)
> +					int node,
> +					struct kernel_clone_args *args)
>  {
>  	int pidfd = -1, retval;
>  	struct task_struct *p;
>  	struct multiprocess_signals delayed;
> +	u64 clone_flags = args->flags;
>  
>  	/*
>  	 * Don't allow sharing the root directory with processes in a different
> @@ -1821,27 +1817,12 @@ static __latent_entropy struct task_struct *copy_process(
>  	}
>  
>  	if (clone_flags & CLONE_PIDFD) {
> -		int reserved;
> -
>  		/*
> -		 * - CLONE_PARENT_SETTID is useless for pidfds and also
> -		 *   parent_tidptr is used to return pidfds.
>  		 * - CLONE_DETACHED is blocked so that we can potentially
>  		 *   reuse it later for CLONE_PIDFD.
>  		 * - CLONE_THREAD is blocked until someone really needs it.
>  		 */
> -		if (clone_flags &
> -		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
> -			return ERR_PTR(-EINVAL);
> -
> -		/*
> -		 * Verify that parent_tidptr is sane so we can potentially
> -		 * reuse it later.
> -		 */
> -		if (get_user(reserved, parent_tidptr))
> -			return ERR_PTR(-EFAULT);
> -
> -		if (reserved != 0)
> +		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
>  			return ERR_PTR(-EINVAL);
>  	}
>  
> @@ -1874,11 +1855,11 @@ static __latent_entropy struct task_struct *copy_process(
>  	 * p->set_child_tid which is (ab)used as a kthread's data pointer for
>  	 * kernel threads (PF_KTHREAD).
>  	 */
> -	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
> +	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? args->child_tid : NULL;
>  	/*
>  	 * Clear TID on mm_release()?
>  	 */
> -	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL;
> +	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? args->child_tid : NULL;
>  
>  	ftrace_graph_init_task(p);
>  
> @@ -2037,7 +2018,8 @@ static __latent_entropy struct task_struct *copy_process(
>  	retval = copy_io(clone_flags, p);
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
> -	retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
> +	retval = copy_thread_tls(clone_flags, args->stack, args->stack_size, p,
> +				 args->tls);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> @@ -2062,7 +2044,7 @@ static __latent_entropy struct task_struct *copy_process(
>  			goto bad_fork_free_pid;
>  
>  		pidfd = retval;
> -		retval = put_user(pidfd, parent_tidptr);
> +		retval = put_user(pidfd, args->pidfd);
>  		if (retval)
>  			goto bad_fork_put_pidfd;
>  	}
> @@ -2105,7 +2087,7 @@ static __latent_entropy struct task_struct *copy_process(
>  		if (clone_flags & CLONE_PARENT)
>  			p->exit_signal = current->group_leader->exit_signal;
>  		else
> -			p->exit_signal = (clone_flags & CSIGNAL);
> +			p->exit_signal = args->exit_signal;
>  		p->group_leader = p;
>  		p->tgid = p->pid;
>  	}
> @@ -2313,8 +2295,11 @@ static inline void init_idle_pids(struct task_struct *idle)
>  struct task_struct *fork_idle(int cpu)
>  {
>  	struct task_struct *task;
> -	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
> -			    cpu_to_node(cpu));
> +	struct kernel_clone_args args = {
> +		.flags = CLONE_VM,
> +	};
> +
> +	task = copy_process(&init_struct_pid, 0, cpu_to_node(cpu), &args);
>  	if (!IS_ERR(task)) {
>  		init_idle_pids(task);
>  		init_idle(task, cpu);
> @@ -2334,13 +2319,9 @@ struct mm_struct *copy_init_mm(void)
>   * It copies the process, and if successful kick-starts
>   * it and waits for it to finish using the VM if required.
>   */
> -long _do_fork(unsigned long clone_flags,
> -	      unsigned long stack_start,
> -	      unsigned long stack_size,
> -	      int __user *parent_tidptr,
> -	      int __user *child_tidptr,
> -	      unsigned long tls)
> +long _do_fork(struct kernel_clone_args *args)
>  {
> +	u64 clone_flags = args->flags;
>  	struct completion vfork;
>  	struct pid *pid;
>  	struct task_struct *p;
> @@ -2356,7 +2337,7 @@ long _do_fork(unsigned long clone_flags,
>  	if (!(clone_flags & CLONE_UNTRACED)) {
>  		if (clone_flags & CLONE_VFORK)
>  			trace = PTRACE_EVENT_VFORK;
> -		else if ((clone_flags & CSIGNAL) != SIGCHLD)
> +		else if (args->exit_signal != SIGCHLD)
>  			trace = PTRACE_EVENT_CLONE;
>  		else
>  			trace = PTRACE_EVENT_FORK;
> @@ -2365,8 +2346,7 @@ long _do_fork(unsigned long clone_flags,
>  			trace = 0;
>  	}
>  
> -	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
> -			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
> +	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
>  	add_latent_entropy();
>  
>  	if (IS_ERR(p))
> @@ -2382,7 +2362,7 @@ long _do_fork(unsigned long clone_flags,
>  	nr = pid_vnr(pid);
>  
>  	if (clone_flags & CLONE_PARENT_SETTID)
> -		put_user(nr, parent_tidptr);
> +		put_user(nr, args->parent_tid);
>  
>  	if (clone_flags & CLONE_VFORK) {
>  		p->vfork_done = &vfork;
> @@ -2414,8 +2394,16 @@ long do_fork(unsigned long clone_flags,
>  	      int __user *parent_tidptr,
>  	      int __user *child_tidptr)
>  {
> -	return _do_fork(clone_flags, stack_start, stack_size,
> -			parent_tidptr, child_tidptr, 0);
> +	struct kernel_clone_args args = {
> +		.flags		= (clone_flags & ~CSIGNAL),
> +		.child_tid	= child_tidptr,
> +		.parent_tid	= parent_tidptr,
> +		.exit_signal	= (clone_flags & CSIGNAL),
> +		.stack		= stack_start,
> +		.stack_size	= stack_size,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  #endif
>  
> @@ -2424,15 +2412,25 @@ long do_fork(unsigned long clone_flags,
>   */
>  pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
>  {
> -	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
> -		(unsigned long)arg, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags		= ((flags | CLONE_VM | CLONE_UNTRACED) & ~CSIGNAL),
> +		.exit_signal	= (flags & CSIGNAL),
> +		.stack		= (unsigned long)fn,
> +		.stack_size	= (unsigned long)arg,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  
>  #ifdef __ARCH_WANT_SYS_FORK
>  SYSCALL_DEFINE0(fork)
>  {
>  #ifdef CONFIG_MMU
> -	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.exit_signal = SIGCHLD,
> +	};
> +
> +	return _do_fork(&args);
>  #else
>  	/* can not support in nommu mode */
>  	return -EINVAL;
> @@ -2443,8 +2441,12 @@ SYSCALL_DEFINE0(fork)
>  #ifdef __ARCH_WANT_SYS_VFORK
>  SYSCALL_DEFINE0(vfork)
>  {
> -	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
> -			0, NULL, NULL, 0);
> +	struct kernel_clone_args args = {
> +		.flags		= CLONE_VFORK | CLONE_VM,
> +		.exit_signal	= SIGCHLD,
> +	};
> +
> +	return _do_fork(&args);
>  }
>  #endif
>  
> @@ -2472,7 +2474,110 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
>  		 unsigned long, tls)
>  #endif
>  {
> -	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
> +	struct kernel_clone_args args = {
> +		.flags		= (clone_flags & ~CSIGNAL),
> +		.pidfd		= parent_tidptr,
> +		.child_tid	= child_tidptr,
> +		.parent_tid	= parent_tidptr,
> +		.exit_signal	= (clone_flags & CSIGNAL),
> +		.stack		= newsp,
> +		.tls		= tls,
> +	};
> +
> +	/* clone(CLONE_PIDFD) uses parent_tidptr to return a pidfd */
> +	if ((clone_flags & CLONE_PIDFD) && (clone_flags & CLONE_PARENT_SETTID))
> +		return -EINVAL;
> +
> +	return _do_fork(&args);
> +}
> +
> +noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> +					      struct clone_args __user *uargs,
> +					      size_t size)
> +{
> +	struct clone_args args;
> +
> +	if (unlikely(size > PAGE_SIZE))
> +		return -E2BIG;
> +
> +	if (unlikely(size < sizeof(struct clone_args)))
> +		return -EINVAL;
> +
> +	if (unlikely(!access_ok(uargs, size)))
> +		return -EFAULT;
> +
> +	if (size > sizeof(struct clone_args)) {
> +		unsigned char __user *addr;
> +		unsigned char __user *end;
> +		unsigned char val;
> +
> +		addr = (void __user *)uargs + sizeof(struct clone_args);
> +		end = (void __user *)uargs + size;
> +
> +		for (; addr < end; addr++) {
> +			if (get_user(val, addr))
> +				return -EFAULT;
> +			if (val)
> +				return -E2BIG;
> +		}
> +
> +		size = sizeof(struct clone_args);
> +	}
> +
> +	if (copy_from_user(&args, uargs, size))
> +		return -EFAULT;
> +
> +	*kargs = (struct kernel_clone_args){
> +		.flags		= args.flags,
> +		.pidfd		= u64_to_user_ptr(args.pidfd),
> +		.child_tid	= u64_to_user_ptr(args.child_tid),
> +		.parent_tid	= u64_to_user_ptr(args.parent_tid),
> +		.exit_signal	= args.exit_signal,
> +		.stack		= args.stack,
> +		.stack_size	= args.stack_size,
> +		.tls		= args.tls,
> +	};
> +
> +	return 0;
> +}
> +
> +static bool clone3_args_valid(const struct kernel_clone_args *kargs)
> +{
> +	/*
> +	 * All lower bits of the flag word are taken.
> +	 * Verify that no other unknown flags are passed along.
> +	 */
> +	if (kargs->flags & ~CLONE_LEGACY_FLAGS)
> +		return false;
> +
> +	/*
> +	 * - make the CLONE_DETACHED bit reuseable for clone3
> +	 * - make the CSIGNAL bits reuseable for clone3
> +	 */
> +	if (kargs->flags & (CLONE_DETACHED | CSIGNAL))
> +		return false;
> +
> +	if ((kargs->flags & (CLONE_THREAD | CLONE_PARENT)) &&
> +	    kargs->exit_signal)
> +		return false;
> +
> +	return true;
> +}
> +
> +SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> +{
> +	int err;
> +
> +	struct kernel_clone_args kargs;
> +
> +	err = copy_clone_args_from_user(&kargs, uargs, size);
> +	if (err)
> +		return err;
> +
> +	if (!clone3_args_valid(&kargs))
> +		return -EINVAL;
> +
> +	return _do_fork(&kargs);
>  }
>  #endif
>  
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Casey Schaufler @ 2019-06-04 22:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Howells, Al Viro, raven, Linux FS Devel, Linux API,
	linux-block, keyrings, LSM List, LKML, casey
In-Reply-To: <CALCETrWFBA8H0RiZPikLtEi8xg-cqJLtQgnU2CGTuwByrHN7Dw@mail.gmail.com>

On 6/4/2019 2:05 PM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 1:31 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
>>> On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
>>>> Hi Al,
>>>>
>>>> Here's a set of patches to add a general variable-length notification queue
>>>> concept and to add sources of events for:
>>> I asked before and didn't see a response, so I'll ask again.  Why are
>>> you paying any attention at all to the creds that generate an event?
>>> It seems like the resulting security model will be vary hard to
>>> understand and probably buggy.  Can't you define a sensible model in
>>> which only the listener creds matter?
>> We've spent the last 18 months reeling from the implications
>> of what can happen when one process has the ability to snoop
>> on another. Introducing yet another mechanism that is trivial
>> to exploit is a very bad idea.
> If you're talking about Spectre, etc, this is IMO entirely irrelevant.

We're seeing significant interest in using obscure mechanisms
in system exploits. Mechanisms will be exploited.

> Among other things, setting these watches can and should require some
> degree of privilege.

Requiring privilege would address the concerns for most
situations, although I don't see that it would help for
SELinux. SELinux does not generally put much credence in
what others consider "privilege".

Extreme care would probably be required for namespaces, too.

>
>> I will try to explain the problem once again. If process A
>> sends a signal (writes information) to process B the kernel
>> checks that either process A has the same UID as process B
>> or that process A has privilege to override that policy.
>> Process B is passive in this access control decision, while
>> process A is active.
> Are you stating what you see to be a requirement?

Basic subject/object access control is the core of
the Linux security model. Yes, there are exceptions,
but mostly they're historical in origin.

>> Process A must have write access
>> (defined by some policy) to process B's event buffer.
> No, stop right here.

Listening ...

>   Process B is monitoring some aspect of the
> system.

Process B is not "monitoring". At some point in the past it
has registered a request for information should an event occur.
It is currently passive.

> Process A is doing something.

Yes. It is active.'

> Process B should need
> permission to monitor whatever it's monitoring,

OK, I'm good with that. But the only time you
can tell that is when the event is registered,
and at that time you can't tell who might be causing
the event. (Or can you?)

> and process A should
> have permission to do whatever it's doing.

So there needs to be some connection between what B
can request events for and what events A can cause.
Then you can deny B's requests because of A.

>   I don't think it makes
> sense to try to ascribe an identity to the actor doing some action to
> decide to omit it from the watch -- this has all kinds of correctness
> issues.

It works for signals and UDP, but in general I get the concern.

> If you're writing a policy and you don't like letting process B spy on
> processes doing various things, then disallow that type of spying.

That gets you into a situation where you can't do the legitimate
monitoring you want to do just because there's the off chance you
might see something you shouldn't. "I hate security! It's confusing,
and always gets in the way!"

>> To
>> implement such a policy requires A's credential,
> You may not design a new mechanism that looks at the credential in a
> context where looking at a credential is invalid unless you have some
> very strong justification for why all of the known reasons that it's a
> bad idea don't apply to what you're doing.

Point. But you also don't get to ignore basic security policy
just because someone's spiffy lazy memory free cache hashing
tree (or similar mechanism) throws away references to important
information while it's still needed.

> So, without a much stronger justification, NAK.

I try to be reasonable. Really. All I want is something
with a security model that can be explained coherently 
within the context of the basic Linux security model.
There are enough variations as it is.

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Andy Lutomirski @ 2019-06-05  4:19 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Andy Lutomirski, David Howells, Al Viro, Casey Schaufler, raven,
	Linux FS Devel, Linux API, linux-block, keyrings, LSM List, LKML
In-Reply-To: <CAB9W1A0AgMYOwGx9c-TmAt=1O6Bjsr2P3Nhd=2+QV39dgw0CrA@mail.gmail.com>

On Tue, Jun 4, 2019 at 6:18 PM Stephen Smalley
<stephen.smalley@gmail.com> wrote:
>
> On Tue, Jun 4, 2019 at 4:58 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>> >
>> > Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > > > Here's a set of patches to add a general variable-length notification queue
>> > > > concept and to add sources of events for:
>> > >
>> > > I asked before and didn't see a response, so I'll ask again.  Why are you
>> > > paying any attention at all to the creds that generate an event?
>> >
>> > Casey responded to you.  It's one of his requirements.
>> >
>>
>> It being a "requirement" doesn't make it okay.
>>
>> > However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
>> > when checking permissions.  See selinux_revalidate_file_permission() for
>> > example - it uses current_cred() not file->f_cred to re-evaluate the perms,
>> > and the fd might be shared between a number of processes with different creds.
>>
>> That's a bug.  It's arguably a rather severe bug.  If I ever get
>> around to writing the patch I keep thinking of that will warn if we
>> use creds from invalid contexts, it will warn.
>
>
> No, not a bug.  Working as designed. Initial validation on open, but revalidation upon read/write if something has changed since open (process SID differs from opener, inode SID has changed, policy has changed). Current subject SID should be used for the revalidation. It's a MAC vs DAC difference.
>

Can you explain how the design is valid, then?  Consider nasty cases like this:

$ sudo -u lotsofgarbage 2>/dev/whatever

It is certainly the case that drivers, fs code, and other core code
MUST NOT look at current_cred() in the context of syscalls like
open().  Jann, I, and others have found quite a few rootable bugs of
this sort.  What makes MAC special here?

I would believe there are cases where auditing write() callers makes
some sense, but anyone reading those logs needs to understand that the
creds are dubious at best.

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: David Howells @ 2019-06-05  8:41 UTC (permalink / raw)
  To: Casey Schaufler
  Cc: dhowells, Andy Lutomirski, Al Viro, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML
In-Reply-To: <50c2ea19-6ae8-1f42-97ef-ba5c95e40475@schaufler-ca.com>

Casey Schaufler <casey@schaufler-ca.com> wrote:

> I will try to explain the problem once again. If process A
> sends a signal (writes information) to process B the kernel
> checks that either process A has the same UID as process B
> or that process A has privilege to override that policy.
> Process B is passive in this access control decision, while
> process A is active. In the event delivery case, process A
> does something (e.g. modifies a keyring) that generates an
> event, which is then sent to process B's event buffer.

I think this might be the core sticking point here.  It looks like two
different situations:

 (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)

 (2) A implicitly and unknowingly sends event to B as a side effect of some
     other action (eg. B has a watch for the event A did).

The LSM treats them as the same: that is B must have MAC authorisation to send
a message to A.

But there are problems with not sending the event:

 (1) B's internal state is then corrupt (or, at least, unknowingly invalid).

 (2) B can potentially figure out that the event happened by other means.


I've implemented four event sources so far:

 (1) Keys/keyrings.  You can only get events on a key you have View permission
     on and the other process has to have write access to it, so I think this
     is good enough.

 (2) Block layer.  Currently this will only get you hardware error events,
     which is probably safe.  I'm not sure you can manipulate those without
     permission to directly access the device files.

 (3) Superblock.  This is trickier since it can see events that can be
     manufactured (R/W <-> R/O remounting, EDQUOT) as well as events that
     can't without hardware control (EIO, network link loss, RF kill).

 (4) Mount topology.  This is the trickiest since it allows you to see events
     beyond the point at which you placed your watch (in essence, you place a
     subtree watch).

     The question is what permission checking should I do?  Ideally, I'd
     emulate a pathwalk between the watchpoint and the eventing object to see
     if the owner of the watchpoint could reach it.

     I'd need to do a reverse walk, calling inode_permission(MAY_NOT_BLOCK)
     for each directory between the eventing object and the watchpoint to see
     if one rejects it - but some filesystems have a permission check that
     can't be called in this state.

     It would also be necessary to do this separately for each watchpoint in
     the parental chain.

     Further, each permissions check would generate an audit event and could
     generate FAN_ACCESS and/or FAN_ACCESS_PERM fanotify events - which could
     be a problem if fanotify is also trying to post those events to the same
     watch queue.

David

^ permalink raw reply

* Re: [PATCH 09/25] vfs: Allow mount information to be queried by fsinfo() [ver #13]
From: Alan Jenkins @ 2019-06-05 12:21 UTC (permalink / raw)
  To: David Howells, viro
  Cc: raven, linux-api, linux-fsdevel, linux-kernel, mszeredi
In-Reply-To: <155905633578.1662.8087594848892366318.stgit@warthog.procyon.org.uk>

On 28/05/2019 16:12, David Howells wrote:
> Allow mount information, including information about the topology tree to
> be queried with the fsinfo() system call.  Usage of AT_FSINFO_MOUNTID_PATH
> allows overlapping mounts to be queried.
>
> To this end, four fsinfo() attributes are provided:
>
>   (1) FSINFO_ATTR_MOUNT_INFO.
>
>       This is a structure providing information about a mount, including:
>
> 	- Mounted superblock ID.
> 	- Mount ID (as AT_FSINFO_MOUNTID_PATH).
> 	- Parent mount ID.
> 	- Mount attributes (eg. R/O, NOEXEC).
> 	- Number of change notifications generated.
>
>       Note that the parent mount ID is overridden to the ID of the queried
>       mount if the parent lies outside of the chroot or dfd tree.
>
>   (2) FSINFO_ATTR_MOUNT_DEVNAME.
>
>       This a string providing the device name associated with the mount.
>
>       Note that the device name may be a path that lies outside of the root.
>
>   (3) FSINFO_ATTR_MOUNT_CHILDREN.
>
>       This produces an array of structures, one for each child and capped
>       with one for the argument mount (checked after listing all the
>       children).  Each element contains the mount ID and the notification
>       counter of the respective mount object.
>
>   (4) FSINFO_ATTR_MOUNT_SUBMOUNT.
>
>       This is a 1D array of strings, indexed with struct fsinfo_params::Nth.
>       Each string is the relative pathname of the corresponding child
>       returned by FSINFO_ATTR_MOUNT_CHILD.

FSINFO_ATTR_MOUNT_CHILD -> FSINFO_ATTR_MOUNT_CHILDREN


>       Note that paths in the mount at the base of the tree (whether that be
>       dfd or chroot) are relative to the base of the tree, not the root
>       directory of that mount.
>
> Signed-off-by: David Howells<dhowells@redhat.com>
> ---
>
>   fs/d_path.c                 |    2
>   fs/fsinfo.c                 |    9 ++
>   fs/internal.h               |    9 ++
>   fs/namespace.c              |  175 +++++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/fsinfo.h |   28 +++++++
>   samples/vfs/test-fsinfo.c   |   47 +++++++++++-
>   6 files changed, 266 insertions(+), 4 deletions(-)

> +/*
> + * Information struct for fsinfo(FSINFO_ATTR_MOUNT_INFO).
> + */
> +struct fsinfo_mount_info {
> +	__u64		f_sb_id;	/* Superblock ID */
> +	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
> +	__u32		parent_id;	/* Parent mount identifier */
> +	__u32		group_id;	/* Mount group ID */
> +	__u32		master_id;	/* Slave master group ID */
> +	__u32		from_id;	/* Slave propogated from ID */

propogated -> propagated

> +	__u32		attr;		/* MOUNT_ATTR_* flags */
> +	__u32		notify_counter;	/* Number of notifications generated. */
> +	__u32		__reserved[1];
> +};
> +
> +/*
> + * Information struct element for fsinfo(FSINFO_ATTR_MOUNT_CHILDREN).
> + * - An extra element is placed on the end representing the parent mount.
> + */
> +struct fsinfo_mount_child {
> +	__u32		mnt_id;		/* Mount identifier (use with AT_FSINFO_MOUNTID_PATH) */
> +	__u32		notify_counter;	/* Number of notifications generated on mount. */
> +};
> +
>   #endif /* _UAPI_LINUX_FSINFO_H */

^ permalink raw reply

* Re: [PATCH 25/25] fsinfo: Add API documentation [ver #13]
From: Alan Jenkins @ 2019-06-05 12:21 UTC (permalink / raw)
  To: David Howells
  Cc: viro, raven, linux-api, linux-fsdevel, linux-kernel, mszeredi
In-Reply-To: <155905647369.1662.10806818386998503329.stgit@warthog.procyon.org.uk>

On 28/05/2019 16:14, David Howells wrote:
> +Then there are attributes that convey information about the mount topology:
> +
> + *  ``FSINFO_ATTR_MOUNT_INFO``
> +
> +    This struct-type attribute conveys information about a mount topology node
> +    rather than a superblock.  This includes the ID of the superblock mounted
> +    there and the ID of the mount node, its parent, group, master and
> +    propagation source.  It also contains the attribute flags for the mount and
> +    a change notification counter so that it can be quickly determined if that
> +    node changed.
> +
> + *  ``FSINFO_ATTR_MOUNT_DEVNAME``
> +
> +    This string-type attribute returns the "device name" that was supplied when
> +    the mount object was created.
> +
> + *  ``FSINFO_ATTR_MOUNT_CHILDREN``
> +
> +    This is an array-type attribute that conveys a set of structs, each of
> +    which indicates the mount ID of a child and the change counter for that
> +    child.  The kernel also tags an extra element on the end that indicates the
> +    ID and change counter of the queried object.  This allows a conflicting
> +    change to be quickly detected by comparing the before and after counters.
> +
> + *  ``FSINFO_ATTR_MOUNT_SUBMOUNT``
> +
> +    This is a string-type attribute that conveys the pathname of the Nth
> +    mountpoint under the target mount, relative to the mount root or the
> +    chroot, whichever is closer.  These correspond on a 1:1 basis with the
> +    eleemnts in the FSINFO_ATTR_MOUNT_CHROOT list.

FSINFO_ATTR_MOUNT_CHROOT -> FSINFO_ATTR_MOUNT_CHILDREN

^ permalink raw reply

* Re: [PATCH 10/25] vfs: fsinfo sample: Mount listing program [ver #13]
From: Alan Jenkins @ 2019-06-05 12:22 UTC (permalink / raw)
  To: David Howells, viro
  Cc: raven, linux-api, linux-fsdevel, linux-kernel, mszeredi
In-Reply-To: <155905634517.1662.7843563955043166854.stgit@warthog.procyon.org.uk>

On 28/05/2019 16:12, David Howells wrote:
> Implement a program to demonstrate mount listing using the new fsinfo()
> syscall,

Great. I understand this is only for demonstration, but IMO the output 
would be clearer with a few tweaks:

> for example:
>
> # ./test-mntinfo

Can we have a header please?  E.g.

   MNT                                    ID       NOTIFS   TYPE DEVICE
   -------------------------------------- -------- -------- ---- ------


> ROOT                                          5d        c ext4 8:12
>   \_ sys                                       13        8 sysfs 0:13
>   |   \_ kernel/security                       16        0 securityfs 0:7

I assume ID is the same ID which is shown in /proc/mounts.  But it's 
shown as decimal there.  Can we match that?

And can we show NOTIFS in decimal as well? I find counters easier to 
read in decimal :-).

I.e. change "%8x" to "%10d" in both cases.

major:minor is also more often seen formatted as "%d:%d": in 
/proc/self/mountinfo, lsblk, and ls -l.  I don't feel as strongly about 
this one though.

Thanks
Alan

>   |   \_ fs/cgroup                             1a       10 tmpfs 0:17
>   |   |   \_ unified                           1b        0 cgroup2 0:18
>   |   |   \_ systemd                           1c        0 cgroup 0:19
>   |   |   \_ freezer                           20        0 cgroup 0:1d
>   |   |   \_ cpu,cpuacct                       21        0 cgroup 0:1e
>   |   |   \_ memory                            22        0 cgroup 0:1f
>   |   |   \_ cpuset                            23        0 cgroup 0:20
>   |   |   \_ hugetlb                           24        0 cgroup 0:21
>   |   |   \_ net_cls,net_prio                  25        0 cgroup 0:22
>   |   |   \_ blkio                             26        0 cgroup 0:23
>   |   |   \_ perf_event                        27        0 cgroup 0:24
>   |   |   \_ devices                           28        0 cgroup 0:25
>   |   |   \_ rdma                              29        0 cgroup 0:26
>   |   \_ fs/pstore                             1d        0 pstore 0:1a
>   |   \_ firmware/efi/efivars                  1e        0 efivarfs 0:1b
>   |   \_ fs/bpf                                1f        0 bpf 0:1c
>   |   \_ kernel/config                         5a        0 configfs 0:10
>   |   \_ fs/selinux                            2a        0 selinuxfs 0:12
>   |   \_ kernel/debug                          2e        0 debugfs 0:8
>   \_ dev                                       15        4 devtmpfs 0:6
>   |   \_ shm                                   17        0 tmpfs 0:14
>   |   \_ pts                                   18        0 devpts 0:15
>   |   \_ hugepages                             2b        0 hugetlbfs 0:27
>   |   \_ mqueue                                2c        0 mqueue 0:11
>   \_ run                                       19        1 tmpfs 0:16
>   |   \_ user/0                               1b4        0 tmpfs 0:2d
>   \_ proc                                      14        1 proc 0:4
>   |   \_ sys/fs/binfmt_misc                    2d        0 autofs 0:28
>   \_ tmp                                       2f      7d0 tmpfs 0:29
>   \_ var/cache/fscache                         71        0 tmpfs 0:2a
>   \_ boot                                      74        0 ext4 8:15
>   \_ home                                      74        0 ext4 8:15
>   \_ var/lib/nfs/rpc_pipefs                    bf        0 rpc_pipefs 0:2b
>   \_ mnt                                      15b        5 tmpfs 0:2c
>   |   \_ foo                                  164        0 tmpfs 0:2e
>   |   \_ foo1                                 16d        0 tmpfs 0:2f
>   |   \_ foo2                                 176        0 tmpfs 0:30
>   |   \_ foo3                                 17f        0 tmpfs 0:31
>   |   \_ foo4                                 188        1 tmpfs 0:32
>   |       \_ ""                               191        0 tmpfs 0:33
>   \_ afs                                      19a        2 afs 0:34
>       \_ procyon.org.uk                       1a3        0 afs 0:35
>       \_ grand.central.org                    1ac        0 afs 0:36
>
> Signed-off-by: David Howells<dhowells@redhat.com>
> ---
>
>   samples/vfs/Makefile       |    3 +
>   samples/vfs/test-mntinfo.c |  239 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 242 insertions(+)
>   create mode 100644 samples/vfs/test-mntinfo.c

> diff --git a/samples/vfs/test-mntinfo.c b/samples/vfs/test-mntinfo.c
> new file mode 100644
> index 000000000000..00fbefae98fa
> --- /dev/null
> +++ b/samples/vfs/test-mntinfo.c

> +/*
> + * Display a mount and then recurse through its children.
> + */
> +static void display_mount(unsigned int mnt_id, unsigned int depth, char *path)
> +{
> +	struct fsinfo_mount_child *children;
> +	struct fsinfo_mount_info info;
> +	struct fsinfo_ids ids;
> +	unsigned int d;
> +	size_t ch_size, p_size;
> +	int i, n, s;
> +
> +	get_attr(mnt_id, FSINFO_ATTR_MOUNT_INFO, &info, sizeof(info));
> +	get_attr(mnt_id, FSINFO_ATTR_IDS, &ids, sizeof(ids));
> +	if (depth > 0)
> +		printf("%s", tree_buf);
> +
> +	s = strlen(path);
> +	printf("%s", !s ? "\"\"" : path);
> +	if (!s)
> +		s += 2;
> +	s += depth;
> +	if (s < 40)
> +		s = 40 - s;
> +	else
> +		s = 1;
> +	printf("%*.*s", s, s, "");
> +
> +	printf("%8x %8x %s %x:%x",
> +	       info.mnt_id, info.notify_counter,
> +	       ids.f_fs_name, ids.f_dev_major, ids.f_dev_minor);
> +	putchar('\n');
> +
> +	children = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_CHILDREN, 0, &ch_size);
> +	n = ch_size / sizeof(children[0]) - 1;
> +
> +	bar_buf[depth + 1] = '|';
> +	if (depth > 0) {
> +		tree_buf[depth - 4 + 1] = bar_buf[depth - 4 + 1];
> +		tree_buf[depth - 4 + 2] = ' ';
> +	}
> +
> +	tree_buf[depth + 0] = ' ';
> +	tree_buf[depth + 1] = '\\';
> +	tree_buf[depth + 2] = '_';
> +	tree_buf[depth + 3] = ' ';
> +	tree_buf[depth + 4] = 0;
> +	d = depth + 4;
> +
> +	for (i = 0; i < n; i++) {
> +		if (i == n - 1)
> +			bar_buf[depth + 1] = ' ';
> +		path = get_attr_alloc(mnt_id, FSINFO_ATTR_MOUNT_SUBMOUNT, i, &p_size);
> +		display_mount(children[i].mnt_id, d, path + 1);
> +		free(path);
> +	}
> +
> +	free(children);
> +	if (depth > 0) {
> +		tree_buf[depth - 4 + 1] = '\\';
> +		tree_buf[depth - 4 + 2] = '_';
> +	}
> +	tree_buf[depth] = 0;
> +}
> +
> +/*
> + * Find the ID of whatever is at the nominated path.
> + */
> +static unsigned int lookup_mnt_by_path(const char *path)
> +{
> +	struct fsinfo_mount_info mnt;
> +	struct fsinfo_params params = {
> +		.request = FSINFO_ATTR_MOUNT_INFO,
> +	};
> +
> +	if (fsinfo(AT_FDCWD, path, &params, &mnt, sizeof(mnt)) == -1) {
> +		perror(path);
> +		exit(1);
> +	}
> +
> +	return mnt.mnt_id;
> +}
> +
> +/*
> + *
> + */
> +int main(int argc, char **argv)
> +{
> +	unsigned int mnt_id;
> +	char *path;
> +	bool use_mnt_id = false;
> +	int opt;
> +
> +	while ((opt = getopt(argc, argv, "M"))) {
> +		switch (opt) {
> +		case 'M':
> +			use_mnt_id = true;
> +			continue;
> +		}
> +		break;
> +	}
> +
> +	argc -= optind;
> +	argv += optind;
> +
> +	switch (argc) {
> +	case 0:
> +		mnt_id = lookup_mnt_by_path("/");
> +		path = "ROOT";
> +		break;
> +	case 1:
> +		path = argv[0];
> +		if (use_mnt_id) {
> +			mnt_id = strtoul(argv[0], NULL, 0);
> +			break;
> +		}
> +
> +		mnt_id = lookup_mnt_by_path(argv[0]);
> +		break;
> +	default:
> +		printf("Format: test-mntinfo\n");
> +		printf("Format: test-mntinfo <path>\n");
> +		printf("Format: test-mntinfo -M <mnt_id>\n");
> +		exit(2);
> +	}
> +
> +	display_mount(mnt_id, 0, path);
> +	return 0;
> +}
>
>

^ permalink raw reply

* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
From: Oleksandr Natalenko @ 2019-06-05 13:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
	Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
	Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
	oleg, christian, hdanton
In-Reply-To: <20190531232959.GC248371@google.com>

Hi.

On Sat, Jun 01, 2019 at 08:29:59AM +0900, Minchan Kim wrote:
> > > > > /* snip a lot */
> > > > >
> > > > >  #ifdef CONFIG_MEMORY_FAILURE
> > > > >  	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > > > -		return madvise_inject_error(behavior, start, start + len_in);
> > > > > +		return madvise_inject_error(behavior,
> > > > > +					start, start + len_in);
> > > > 
> > > > Not sure what this change is about except changing the line length.
> > > > Note, madvise_inject_error() still operates on "current" through
> > > > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > > > here. I Know you've filtered out this hint later, so technically this
> > > > is not an issue, but, maybe, this needs some attention too since we've
> > > > already spotted it?
> > > 
> > > It is leftover I had done. I actually modified it to handle remote
> > > task but changed my mind not to fix it because process_madvise
> > > will not support it at this moment. I'm not sure it's a good idea
> > > to change it for *might-be-done-in-future* at this moment even though
> > > we have spotted.
> > 
> > I'd expect to have at least some comments in code on why other hints
> > are disabled, so if we already know some shortcomings, this information
> > would not be lost.
> 
> Okay, I will add some comment but do not want to fix code piece until
> someone want to expose the poisoning to external process.

Fair enough.

> > > > >  	write = madvise_need_mmap_write(behavior);
> > > > >  	if (write) {
> > > > > -		if (down_write_killable(&current->mm->mmap_sem))
> > > > > +		if (down_write_killable(&mm->mmap_sem))
> > > > >  			return -EINTR;
> > > > 
> > > > Do you still need that trick with mmget_still_valid() here?
> > > > Something like:
> > > 
> > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > vma->vm_flags, technically, we don't need it if I understand
> > > correctly. Right?
> > 
> > I'd expect so, yes. But.
> > 
> > Since we want this interface to be universal and to be able to cover
> > various needs, and since my initial intention with working in this
> > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > (and if) that happens, the work there is done under write lock, and
> > you'll need this trick to be applied.
> > 
> > Of course, I can do that myself later in a subsequent patch series once
> > (and, again, if) your series is merged, but, maybe, we can cover this
> > already especially given the fact that KSM hinting is a relatively easy
> > task in this pile. I did some preliminary tests with it, and so far no
> > dragons have started to roar.
> 
> Then, do you mind sending a patch based upon this series to expose
> MADV_MERGEABLE to process_madvise? It will have the right description
> why you want to have such feature which I couldn't provide since I don't
> have enough material to write the motivation. And the patch also could
> include the logic to prevent coredump race, which is more proper since
> finally we need to hold mmap_sem write-side lock, finally.
> I will pick it up and will rebase since then.

Sure, I can. Would you really like to have it being based on this exact
revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
this part again?

Thanks.

-- 
  Best regards,
    Oleksandr Natalenko (post-factum)
    Senior Software Maintenance Engineer

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Stephen Smalley @ 2019-06-05 13:47 UTC (permalink / raw)
  To: Andy Lutomirski, Stephen Smalley
  Cc: David Howells, Al Viro, Casey Schaufler, raven, Linux FS Devel,
	Linux API, linux-block, keyrings, LSM List, LKML, Paul Moore
In-Reply-To: <CALCETrU_5djawkwW-GRyHZXHwOUjaei1Cp7NEJaVFDm_bK6G3w@mail.gmail.com>

On 6/5/19 12:19 AM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 6:18 PM Stephen Smalley
> <stephen.smalley@gmail.com> wrote:
>>
>> On Tue, Jun 4, 2019 at 4:58 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>>>>
>>>> Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>>> concept and to add sources of events for:
>>>>>
>>>>> I asked before and didn't see a response, so I'll ask again.  Why are you
>>>>> paying any attention at all to the creds that generate an event?
>>>>
>>>> Casey responded to you.  It's one of his requirements.
>>>>
>>>
>>> It being a "requirement" doesn't make it okay.
>>>
>>>> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
>>>> when checking permissions.  See selinux_revalidate_file_permission() for
>>>> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
>>>> and the fd might be shared between a number of processes with different creds.
>>>
>>> That's a bug.  It's arguably a rather severe bug.  If I ever get
>>> around to writing the patch I keep thinking of that will warn if we
>>> use creds from invalid contexts, it will warn.
>>
>>
>> No, not a bug.  Working as designed. Initial validation on open, but revalidation upon read/write if something has changed since open (process SID differs from opener, inode SID has changed, policy has changed). Current subject SID should be used for the revalidation. It's a MAC vs DAC difference.
>>
> 
> Can you explain how the design is valid, then?  Consider nasty cases like this:
> 
> $ sudo -u lotsofgarbage 2>/dev/whatever

(sorry for the previous html email; gmail or my inability to properly 
use it strikes again!)

Here we have four (or more) opportunities to say no:
1) Upon selinux_inode_permission(), when checking write access to 
/dev/whatever in the context of the shell process,
2) Upon selinux_file_open(), when checking and caching the open and 
write access for shell to /dev/whatever in the file security struct,
3) Upon selinux_bprm_committing_creds() -> flush_unauthorized_files(), 
when revalidating write access to /dev/whatever in the context of sudo,
4) Upon selinux_file_permission() -> 
selinux_revalidate_file_permission(), when revalidating write access to 
/dev/whatever in the context of sudo.

If any of those fail, then access is denied, so unless both the shell 
and sudo are authorized to write to /dev/whatever, it is a no-go.  NB 
Only the shell context requires open permission here; the sudo context 
only needs write.

> It is certainly the case that drivers, fs code, and other core code
> MUST NOT look at current_cred() in the context of syscalls like
> open().  Jann, I, and others have found quite a few rootable bugs of
> this sort.  What makes MAC special here?

Do you mean syscalls like write(), not open()?  I think your concern is 
that they apply some check only during write() and not open() and 
therefore are susceptible to confused deputy scenario above.  In 
contrast we are validating access at open, transfer/inherit, and use. If 
we use file->f_cred instead of current_cred() in 
selinux_revalidate_file_permission() and the current process SID differs 
from that of the opener, we'll never apply a check for the actual 
security context performing the write(), so information can flow in 
violation of the MAC policy.

> I would believe there are cases where auditing write() callers makes
> some sense, but anyone reading those logs needs to understand that the
> creds are dubious at best.

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Tejun Heo @ 2019-06-05 14:03 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190603122725.GB19426@darkstar>

Hello,

On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> All the above, to me it means that:
>  - cgroups are always capped by system clamps
>  - cgroups can further restrict system clamps
> 
> Does that match with your view?

Yeah, as long as what's defined at system level clamps everything in
the system whether they're in cgroups or not, it's all good.

> > * Limits (high / max) default to max.  Protections (low / min) 0.  A
> >   new cgroup by default doesn't constrain itself further and doesn't
> >   have any protection.
> 
> Example 2
> ---------
> 
> Let say we have:
> 
>   /tg1:
>         util_min=200 (as a protection)
>         util_max=800 (as a limit)
> 
> the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> with the same limits _and protections_ of its father:
> 
>   /tg1/tg11:
>         util_min=200 (protection inherited from /tg1)
>         util_max=800 (limit inherited from /tg1)
> 
> Do you mean that we should have instead:
> 
>   /tg1/tg11:
>         util_min=0   (no protection by default at creation time)
>         util_max=800 (limit inherited from /tg1)
> 
> 
> i.e. we need to reset the protection of a newly created subgroup?

The default value for limits should be max, protections 0.  Don't
inherit config values from the parent.  That gets confusing super fast
because when the parent config is set and each child is created plays
into the overall configuration.  Hierarchical confinements should
always be enforced and a new cgroup should always start afresh in
terms of its own configuration.

> > * A limit defines the upper ceiling for the subtree.  If an ancestor
> >   has a limit of X, none of its descendants can have more than X.
> 
> That's correct, however we distinguish between "requested" and
> "effective" values.

Sure, all property propagating controllers should.

> > Note that there's no way for an ancestor to enforce protection its
> > descendants.  It can only allow them to claim some.  This is
> > intentional as the other end of the spectrum is either descendants
> > losing the ability to further distribute protections as they see fit.
> 
> Ok, that means I need to update in v10 the initialization of subgroups
> min clamps to be none by default as discussed in the above Example 2,
> right?

Yeah and max to max.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Tejun Heo @ 2019-06-05 14:09 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, linux-pm, linux-api, Ingo Molnar,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190603122929.GC19426@darkstar>

Hello,

On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> On 31-May 08:35, Tejun Heo wrote:
> > Hello, Patrick.
> > 
> > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> 
> [...]
> 
> > For proportions (as opposed to weights), we use percentage rational
> > numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
> > pending.  I'll put them on cgroup/for-5.3.
> 
> That's a point worth discussing with Peter, we already changed one
> time from percentages to 1024 scale.

cgroup tries to uss uniform units for its interface files as much as
possible even when that deviates from non-cgroup interface.  We can
bikeshed the pros and cons for that design choice for sure but I don't
think it makes sense to deviate from that at this point unless there
are really strong reasons to do so.

> Utilization clamps are expressed as percentages by definition,
> they are just expressed in a convenient 1024 scale which should not be
> alien to people using those knobs.
> 
> If we wanna use a "more specific" name like uclamp.{min,max} then we
> should probably also accept to use a "more specific" metric, don't we?

Heh, this actually made me chuckle.  It's an interesting bargaining
take but I don't think that same word being in two different places
makes them tradable entities.  We can go into the weeds with the
semantics but how about us using an alternative adjective "misleading"
for the cpu.util.min/max names to short-circuit that?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-06-05 14:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190605140324.GL374014@devbig004.ftw2.facebook.com>

On 05-Jun 07:03, Tejun Heo wrote:
> Hello,

Hi!

> On Mon, Jun 03, 2019 at 01:27:25PM +0100, Patrick Bellasi wrote:
> > All the above, to me it means that:
> >  - cgroups are always capped by system clamps
> >  - cgroups can further restrict system clamps
> > 
> > Does that match with your view?
> 
> Yeah, as long as what's defined at system level clamps everything in
> the system whether they're in cgroups or not, it's all good.

Right, then we are good with v9 on this point.

> > > * Limits (high / max) default to max.  Protections (low / min) 0.  A
> > >   new cgroup by default doesn't constrain itself further and doesn't
> > >   have any protection.
> > 
> > Example 2
> > ---------
> > 
> > Let say we have:
> > 
> >   /tg1:
> >         util_min=200 (as a protection)
> >         util_max=800 (as a limit)
> > 
> > the moment we create a subgroup /tg1/tg11, in v9 it is initialized
> > with the same limits _and protections_ of its father:
> > 
> >   /tg1/tg11:
> >         util_min=200 (protection inherited from /tg1)
> >         util_max=800 (limit inherited from /tg1)
> > 
> > Do you mean that we should have instead:
> > 
> >   /tg1/tg11:
> >         util_min=0   (no protection by default at creation time)
> >         util_max=800 (limit inherited from /tg1)
> > 
> > 
> > i.e. we need to reset the protection of a newly created subgroup?
> 
> The default value for limits should be max, protections 0.  Don't
> inherit config values from the parent.  That gets confusing super fast
> because when the parent config is set and each child is created plays
> into the overall configuration.  Hierarchical confinements should
> always be enforced and a new cgroup should always start afresh in
> terms of its own configuration.

Got it, so in the example above we will create:

   /tg1/tg11:
         util_min=0    (no requested protection by default at creation time)
         util_max=1024 (no requests limit by default at creation time)

That's it for the "requested" values side, while the "effective"
values are enforced by the hierarchical confinement rules since
creation time.
Which means we will enforce the effective values as:

   /tg1/tg11:

         util_min.effective=0
            i.e. keep the child protection since smaller than parent

         util_max.effective=800
            i.e. keep parent limit since stricter than child

Please shout if I got it wrong, otherwise I'll update v10 to
implement the above logic.

> > > * A limit defines the upper ceiling for the subtree.  If an ancestor
> > >   has a limit of X, none of its descendants can have more than X.
> > 
> > That's correct, however we distinguish between "requested" and
> > "effective" values.
> 
> Sure, all property propagating controllers should.

Right.

> > > Note that there's no way for an ancestor to enforce protection its
> > > descendants.  It can only allow them to claim some.  This is
> > > intentional as the other end of the spectrum is either descendants
> > > losing the ability to further distribute protections as they see fit.
> > 
> > Ok, that means I need to update in v10 the initialization of subgroups
> > min clamps to be none by default as discussed in the above Example 2,
> > right?
> 
> Yeah and max to max.

Right, I've got it now.


> Thanks.

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Tejun Heo @ 2019-06-05 14:44 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: linux-kernel, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190605143805.olk2ta5p2jnd4mjt@e110439-lin>

Hello,

On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> Which means we will enforce the effective values as:
> 
>    /tg1/tg11:
> 
>          util_min.effective=0
>             i.e. keep the child protection since smaller than parent
> 
>          util_max.effective=800
>             i.e. keep parent limit since stricter than child
>
> Please shout if I got it wrong, otherwise I'll update v10 to
> implement the above logic.

Everything sounds good to me.  Please note that cgroup interface files
actually use literal "max" for limit/protection max settings so that 0
and "max" mean the same things for all limit/protection knobs.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [RFC][PATCH 0/8] Mount, FS, Block and Keyrings notifications [ver #2]
From: Casey Schaufler @ 2019-06-05 14:50 UTC (permalink / raw)
  To: David Howells
  Cc: Andy Lutomirski, Al Viro, raven, Linux FS Devel, Linux API,
	linux-block, keyrings, LSM List, LKML, casey
In-Reply-To: <20192.1559724094@warthog.procyon.org.uk>


[-- Attachment #1.1: Type: text/plain, Size: 4255 bytes --]

On 6/5/2019 1:41 AM, David Howells wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>> I will try to explain the problem once again. If process A
>> sends a signal (writes information) to process B the kernel
>> checks that either process A has the same UID as process B
>> or that process A has privilege to override that policy.
>> Process B is passive in this access control decision, while
>> process A is active. In the event delivery case, process A
>> does something (e.g. modifies a keyring) that generates an
>> event, which is then sent to process B's event buffer.
> I think this might be the core sticking point here.  It looks like two
> different situations:
>
>  (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>
>  (2) A implicitly and unknowingly sends event to B as a side effect of some
>      other action (eg. B has a watch for the event A did).
>
> The LSM treats them as the same: that is B must have MAC authorisation to send
> a message to A.

YES!

Threat is about what you can do, not what you intend to do.

And it would be really great if you put some thought into what
a rational model would be for UID based controls, too.

> But there are problems with not sending the event:
>
>  (1) B's internal state is then corrupt (or, at least, unknowingly invalid).

Then B is a badly written program.

>  (2) B can potentially figure out that the event happened by other means.

Then why does it need the event mechanism in the first place?

> I've implemented four event sources so far:
>
>  (1) Keys/keyrings.  You can only get events on a key you have View permission
>      on and the other process has to have write access to it, so I think this
>      is good enough.

Sounds fine.

>  (2) Block layer.  Currently this will only get you hardware error events,
>      which is probably safe.  I'm not sure you can manipulate those without
>      permission to directly access the device files.

There's an argument to be made that this should require CAP_SYS_ADMIN,
or that an LSM like SELinux might include hardware error events in
policy, but generally I agree that system generated events like this
are both harmless and pointless for the general public to watch.

>  (3) Superblock.  This is trickier since it can see events that can be
>      manufactured (R/W <-> R/O remounting, EDQUOT) as well as events that
>      can't without hardware control (EIO, network link loss, RF kill).

The events generated by processes (the 1st set) need controls
like keys. The events generated by the system (the 2nd set) may
need controls like the block layer.


>  (4) Mount topology.  This is the trickiest since it allows you to see events
>      beyond the point at which you placed your watch (in essence, you place a
>      subtree watch).

Like keys.

>      The question is what permission checking should I do?  Ideally, I'd
>      emulate a pathwalk between the watchpoint and the eventing object to see
>      if the owner of the watchpoint could reach it.

That will depend, as I've been saying, on what causes
the event to be generated. If it's from a process, the
question is "can the active process, the one that generated
the event, write to the passive, watching process?"
If it's the system on a hardware event, you may want the watcher
to have CAP_SYS_ADMIN.

>      I'd need to do a reverse walk, calling inode_permission(MAY_NOT_BLOCK)
>      for each directory between the eventing object and the watchpoint to see
>      if one rejects it - but some filesystems have a permission check that
>      can't be called in this state.

This is for setting the watch, right?

>      It would also be necessary to do this separately for each watchpoint in
>      the parental chain.
>
>      Further, each permissions check would generate an audit event and could
>      generate FAN_ACCESS and/or FAN_ACCESS_PERM fanotify events - which could
>      be a problem if fanotify is also trying to post those events to the same
>      watch queue.

If you required that the watching process open(dir) what
you want to watch you'd get this for free. Or did I miss
something obvious?

> David


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-06-05 15:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, linux-kernel, linux-pm, linux-api, Ingo Molnar,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190605140943.GM374014@devbig004.ftw2.facebook.com>

On 05-Jun 07:09, Tejun Heo wrote:
> Hello,

Hi,

> On Mon, Jun 03, 2019 at 01:29:29PM +0100, Patrick Bellasi wrote:
> > On 31-May 08:35, Tejun Heo wrote:
> > > Hello, Patrick.
> > > 
> > > On Wed, May 15, 2019 at 10:44:55AM +0100, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > > For proportions (as opposed to weights), we use percentage rational
> > > numbers - e.g. 38.44 for 38.44%.  I have parser and doc update commits
> > > pending.  I'll put them on cgroup/for-5.3.
> > 
> > That's a point worth discussing with Peter, we already changed one
> > time from percentages to 1024 scale.
> 
> cgroup tries to uss uniform units for its interface files as much as
> possible even when that deviates from non-cgroup interface.  We can
> bikeshed the pros and cons for that design choice for sure but I don't
> think it makes sense to deviate from that at this point unless there
> are really strong reasons to do so.

that makes sense to me, having a uniform interface has certainly a
value.

The only additional point I can think about as a (slightly) stronger
reason is that I guess we would like to have the same API for cgroups
as well as for the task specific and the system wide settings.

The task specific values comes in via the sched_setattr() syscall:

   [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
   https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/

where we need to encode each clamp into a __u32 value.

System wide settings are expose similarly to these:

   grep '' /proc/sys/kernel/sched_*

where we have always integer numbers.

AFAIU your proposal will require to use a "scaled percentage" - e.g.
3844 for 38.44% which however it's still not quite the same as writing
the string "38.44".

Not sure that's a strong enough argument, is it?

> > Utilization clamps are expressed as percentages by definition,
> > they are just expressed in a convenient 1024 scale which should not be
> > alien to people using those knobs.
> > 
> > If we wanna use a "more specific" name like uclamp.{min,max} then we
> > should probably also accept to use a "more specific" metric, don't we?
> 
> Heh, this actually made me chuckle.

:)

> It's an interesting bargaining take but I don't think that same word
> being in two different places makes them tradable entities.

Sure, that was not my intention.

I was just trying to see if the need to be more specific could
be an argument for having also a more specific value.

> We can go into the weeds with the semantics but how about us using
> an alternative adjective "misleading" for the cpu.util.min/max names
> to short-circuit that?

Not quite sure to get what you mean here. Are you pointing out that
with clamps we don't strictly enforce a bandwidth but we just set a
bias?

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Tejun Heo @ 2019-06-05 15:27 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: Peter Zijlstra, linux-kernel, linux-pm, linux-api, Ingo Molnar,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190605150630.vh5pyfpd6y3mfcaa@e110439-lin>

Hello, Patrick.

On Wed, Jun 05, 2019 at 04:06:30PM +0100, Patrick Bellasi wrote:
> The only additional point I can think about as a (slightly) stronger
> reason is that I guess we would like to have the same API for cgroups
> as well as for the task specific and the system wide settings.
> 
> The task specific values comes in via the sched_setattr() syscall:
> 
>    [PATCH v9 06/16] sched/core: uclamp: Extend sched_setattr() to support utilization clamping
>    https://lore.kernel.org/lkml/20190515094459.10317-7-patrick.bellasi@arm.com/
> 
> where we need to encode each clamp into a __u32 value.
> 
> System wide settings are expose similarly to these:
> 
>    grep '' /proc/sys/kernel/sched_*
> 
> where we have always integer numbers.
> 
> AFAIU your proposal will require to use a "scaled percentage" - e.g.
> 3844 for 38.44% which however it's still not quite the same as writing
> the string "38.44".
> 
> Not sure that's a strong enough argument, is it?

It definitely is an argument but the thing is that the units we use in
kernel API are all over the place anyway.  Even for something as
simple as sizes, we use bytes, 512 byte sectors, kilobytes and pages
all over the place.  Some for good reasons (as you mentioned above)
and others for historical / random ones.

So, I'm generally not too concerned about units differing between
cgroup interface and, say, syscall interface.  That ship has sailed a
long while ago and we have to deal with it everywhere anyway (in many
cases there isn't even a good unit to pick for compatibility because
the existing interfaces are already mixing units heavily).  As long as
the translation is trivial, it isn't a big issue.  Note that some
translations are not trivial.  For example, the sched nice value
mapping to weight has a separate unit matching knob for that reason.

> > We can go into the weeds with the semantics but how about us using
> > an alternative adjective "misleading" for the cpu.util.min/max names
> > to short-circuit that?
> 
> Not quite sure to get what you mean here. Are you pointing out that
> with clamps we don't strictly enforce a bandwidth but we just set a
> bias?

It's just that "util" is already used a lot and cpu.util.max reads
like it should cap cpu utilization (wallclock based) to 80% and it's
likely that it'd read seem way to many other folks too.  A more
distinctive name signals that it isn't something that obvious.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v9 12/16] sched/core: uclamp: Extend CPU's cgroup controller
From: Patrick Bellasi @ 2019-06-05 15:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, linux-pm, linux-api, Ingo Molnar, Peter Zijlstra,
	Rafael J . Wysocki, Vincent Guittot, Viresh Kumar, Paul Turner,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Juri Lelli,
	Todd Kjos, Joel Fernandes, Steve Muckle, Suren Baghdasaryan
In-Reply-To: <20190605144450.GN374014@devbig004.ftw2.facebook.com>

On 05-Jun 07:44, Tejun Heo wrote:
> Hello,

Hi,

> On Wed, Jun 05, 2019 at 03:39:50PM +0100, Patrick Bellasi wrote:
> > Which means we will enforce the effective values as:
> > 
> >    /tg1/tg11:
> > 
> >          util_min.effective=0
> >             i.e. keep the child protection since smaller than parent
> > 
> >          util_max.effective=800
> >             i.e. keep parent limit since stricter than child
> >
> > Please shout if I got it wrong, otherwise I'll update v10 to
> > implement the above logic.
> 
> Everything sounds good to me.  Please note that cgroup interface files
> actually use literal "max" for limit/protection max settings so that 0
> and "max" mean the same things for all limit/protection knobs.

Lemme see if I've got it right, do you mean that we can:

 1) write the _string_ "max" into a cgroup attribute to:

    - set    0 for util_max, since it's a protection
    - set 1024 for util_min, since it's a limit

 2) write the _string_ "0" into a cgroup attribute to:

    - set 1024 for util_max, since it's a protection
    - set    0 for util_min, since it's a limit

Is that correct or it's just me totally confused?


> Thanks.
> 
> --
> tejun

Cheers,
Patrick

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox