Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v2 01/34] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: Ira Weiny @ 2019-08-06 17:39 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, Christoph Hellwig, Dan Williams, Dave Chinner,
	Dave Hansen, Jan Kara, Jason Gunthorpe, Jérôme Glisse,
	LKML, amd-gfx, ceph-devel, devel, devel, dri-devel, intel-gfx,
	kvm, linux-arm-kernel, linux-block, linux-crypto, linux-fbdev,
	linux-fsdevel, linux-media, linux-mm, linux-nfs, linux-rdma,
	linux-rpi-kernel, linux-xfs, netdev, rds-devel, sparclinux, x86,
	xen-devel, John Hubbard, Christoph Hellwig, Matthew Wilcox
In-Reply-To: <20190804224915.28669-2-jhubbard@nvidia.com>

On Sun, Aug 04, 2019 at 03:48:42PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Provide a more capable variation of put_user_pages_dirty_lock(),
> and delete put_user_pages_dirty(). This is based on the
> following:
> 
> 1. Lots of call sites become simpler if a bool is passed
> into put_user_page*(), instead of making the call site
> choose which put_user_page*() variant to call.
> 
> 2. Christoph Hellwig's observation that set_page_dirty_lock()
> is usually correct, and set_page_dirty() is usually a
> bug, or at least questionable, within a put_user_page*()
> calling chain.
> 
> This leads to the following API choices:
> 
>     * put_user_pages_dirty_lock(page, npages, make_dirty)
> 
>     * There is no put_user_pages_dirty(). You have to
>       hand code that, in the rare case that it's
>       required.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  drivers/infiniband/core/umem.c             |   5 +-
>  drivers/infiniband/hw/hfi1/user_pages.c    |   5 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c |  13 +--
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |   5 +-
>  drivers/infiniband/sw/siw/siw_mem.c        |  19 +---
>  include/linux/mm.h                         |   5 +-
>  mm/gup.c                                   | 115 +++++++++------------
>  7 files changed, 61 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 08da840ed7ee..965cf9dea71a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  
>  	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>  		page = sg_page_iter_page(&sg_iter);
> -		if (umem->writable && dirty)
> -			put_user_pages_dirty_lock(&page, 1);
> -		else
> -			put_user_page(page);
> +		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
>  	}
>  
>  	sg_free_table(&umem->sg_head);
> diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
> index b89a9b9aef7a..469acb961fbd 100644
> --- a/drivers/infiniband/hw/hfi1/user_pages.c
> +++ b/drivers/infiniband/hw/hfi1/user_pages.c
> @@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
>  void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
>  			     size_t npages, bool dirty)
>  {
> -	if (dirty)
> -		put_user_pages_dirty_lock(p, npages);
> -	else
> -		put_user_pages(p, npages);
> +	put_user_pages_dirty_lock(p, npages, dirty);
>  
>  	if (mm) { /* during close after signal, mm can be NULL */
>  		atomic64_sub(npages, &mm->pinned_vm);
> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> index bfbfbb7e0ff4..26c1fb8d45cc 100644
> --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> @@ -37,15 +37,6 @@
>  
>  #include "qib.h"
>  
> -static void __qib_release_user_pages(struct page **p, size_t num_pages,
> -				     int dirty)
> -{
> -	if (dirty)
> -		put_user_pages_dirty_lock(p, num_pages);
> -	else
> -		put_user_pages(p, num_pages);
> -}
> -
>  /**
>   * qib_map_page - a safety wrapper around pci_map_page()
>   *
> @@ -124,7 +115,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  
>  	return 0;
>  bail_release:
> -	__qib_release_user_pages(p, got, 0);
> +	put_user_pages_dirty_lock(p, got, false);
>  bail:
>  	atomic64_sub(num_pages, &current->mm->pinned_vm);
>  	return ret;
> @@ -132,7 +123,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  
>  void qib_release_user_pages(struct page **p, size_t num_pages)
>  {
> -	__qib_release_user_pages(p, num_pages, 1);
> +	put_user_pages_dirty_lock(p, num_pages, true);
>  
>  	/* during close after signal, mm can be NULL */
>  	if (current->mm)
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index 0b0237d41613..62e6ffa9ad78 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
>  		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
>  			page = sg_page(sg);
>  			pa = sg_phys(sg);
> -			if (dirty)
> -				put_user_pages_dirty_lock(&page, 1);
> -			else
> -				put_user_page(page);
> +			put_user_pages_dirty_lock(&page, 1, dirty);
>  			usnic_dbg("pa: %pa\n", &pa);
>  		}
>  		kfree(chunk);
> diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> index 67171c82b0c4..1e197753bf2f 100644
> --- a/drivers/infiniband/sw/siw/siw_mem.c
> +++ b/drivers/infiniband/sw/siw/siw_mem.c
> @@ -60,20 +60,6 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
>  	return NULL;
>  }
>  
> -static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
> -			   bool dirty)
> -{
> -	struct page **p = chunk->plist;
> -
> -	while (num_pages--) {
> -		if (!PageDirty(*p) && dirty)
> -			put_user_pages_dirty_lock(p, 1);
> -		else
> -			put_user_page(*p);
> -		p++;
> -	}
> -}
> -
>  void siw_umem_release(struct siw_umem *umem, bool dirty)
>  {
>  	struct mm_struct *mm_s = umem->owning_mm;
> @@ -82,8 +68,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
>  	for (i = 0; num_pages; i++) {
>  		int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
>  
> -		siw_free_plist(&umem->page_chunk[i], to_free,
> -			       umem->writable && dirty);
> +		put_user_pages_dirty_lock(umem->page_chunk[i].plist,
> +					  to_free,
> +					  umem->writable && dirty);
>  		kfree(umem->page_chunk[i].plist);
>  		num_pages -= to_free;
>  	}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..9759b6a24420 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page)
>  	put_page(page);
>  }
>  
> -void put_user_pages_dirty(struct page **pages, unsigned long npages);
> -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> +			       bool make_dirty);
> +
>  void put_user_pages(struct page **pages, unsigned long npages);
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/mm/gup.c b/mm/gup.c
> index 98f13ab37bac..7fefd7ab02c4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,85 +29,70 @@ struct follow_page_context {
>  	unsigned int page_mask;
>  };
>  
> -typedef int (*set_dirty_func_t)(struct page *page);
> -
> -static void __put_user_pages_dirty(struct page **pages,
> -				   unsigned long npages,
> -				   set_dirty_func_t sdf)
> -{
> -	unsigned long index;
> -
> -	for (index = 0; index < npages; index++) {
> -		struct page *page = compound_head(pages[index]);
> -
> -		/*
> -		 * Checking PageDirty at this point may race with
> -		 * clear_page_dirty_for_io(), but that's OK. Two key cases:
> -		 *
> -		 * 1) This code sees the page as already dirty, so it skips
> -		 * the call to sdf(). That could happen because
> -		 * clear_page_dirty_for_io() called page_mkclean(),
> -		 * followed by set_page_dirty(). However, now the page is
> -		 * going to get written back, which meets the original
> -		 * intention of setting it dirty, so all is well:
> -		 * clear_page_dirty_for_io() goes on to call
> -		 * TestClearPageDirty(), and write the page back.
> -		 *
> -		 * 2) This code sees the page as clean, so it calls sdf().
> -		 * The page stays dirty, despite being written back, so it
> -		 * gets written back again in the next writeback cycle.
> -		 * This is harmless.
> -		 */
> -		if (!PageDirty(page))
> -			sdf(page);
> -
> -		put_user_page(page);
> -	}
> -}
> -
>  /**
> - * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> - * @pages:  array of pages to be marked dirty and released.
> + * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> + * @pages:  array of pages to be maybe marked dirty, and definitely released.

Better would be.

@pages:  array of pages to be put

>   * @npages: number of pages in the @pages array.
> + * @make_dirty: whether to mark the pages dirty
>   *
>   * "gup-pinned page" refers to a page that has had one of the get_user_pages()
>   * variants called on that page.
>   *
>   * For each page in the @pages array, make that page (or its head page, if a
> - * compound page) dirty, if it was previously listed as clean. Then, release
> - * the page using put_user_page().
> + * compound page) dirty, if @make_dirty is true, and if the page was previously
> + * listed as clean. In any case, releases all pages using put_user_page(),
> + * possibly via put_user_pages(), for the non-dirty case.

I don't think users of this interface need this level of detail.  I think
something like.

 * For each page in the @pages array, release the page.  If @make_dirty is
 * true, mark the page dirty prior to release.


>   *
>   * Please see the put_user_page() documentation for details.
>   *
> - * set_page_dirty(), which does not lock the page, is used here.
> - * Therefore, it is the caller's responsibility to ensure that this is
> - * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> + * required, then the caller should a) verify that this is really correct,
> + * because _lock() is usually required, and b) hand code it:
> + * set_page_dirty_lock(), put_user_page().
>   *
>   */
> -void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> +			       bool make_dirty)
>  {
> -	__put_user_pages_dirty(pages, npages, set_page_dirty);
> -}
> -EXPORT_SYMBOL(put_user_pages_dirty);
> +	unsigned long index;
>  
> -/**
> - * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> - * @pages:  array of pages to be marked dirty and released.
> - * @npages: number of pages in the @pages array.
> - *
> - * For each page in the @pages array, make that page (or its head page, if a
> - * compound page) dirty, if it was previously listed as clean. Then, release
> - * the page using put_user_page().
> - *
> - * Please see the put_user_page() documentation for details.
> - *
> - * This is just like put_user_pages_dirty(), except that it invokes
> - * set_page_dirty_lock(), instead of set_page_dirty().
> - *
> - */
> -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> -{
> -	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +	/*
> +	 * TODO: this can be optimized for huge pages: if a series of pages is
> +	 * physically contiguous and part of the same compound page, then a
> +	 * single operation to the head page should suffice.
> +	 */

I think this comment belongs to the for loop below...  or just something about
how to make this and put_user_pages() more efficient.  It is odd, that this is
the same comment as in put_user_pages()...

The code is good.  So... Other than the comments.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Ira

> +
> +	if (!make_dirty) {
> +		put_user_pages(pages, npages);
> +		return;
> +	}
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +		/*
> +		 * Checking PageDirty at this point may race with
> +		 * clear_page_dirty_for_io(), but that's OK. Two key
> +		 * cases:
> +		 *
> +		 * 1) This code sees the page as already dirty, so it
> +		 * skips the call to set_page_dirty(). That could happen
> +		 * because clear_page_dirty_for_io() called
> +		 * page_mkclean(), followed by set_page_dirty().
> +		 * However, now the page is going to get written back,
> +		 * which meets the original intention of setting it
> +		 * dirty, so all is well: clear_page_dirty_for_io() goes
> +		 * on to call TestClearPageDirty(), and write the page
> +		 * back.
> +		 *
> +		 * 2) This code sees the page as clean, so it calls
> +		 * set_page_dirty(). The page stays dirty, despite being
> +		 * written back, so it gets written back again in the
> +		 * next writeback cycle. This is harmless.
> +		 */
> +		if (!PageDirty(page))
> +			set_page_dirty_lock(page);
> +		put_user_page(page);
> +	}
>  }
>  EXPORT_SYMBOL(put_user_pages_dirty_lock);
>  
> -- 
> 2.22.0
> 

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/3] selftests/bpf: test_progs: switch to open_memstream
From: Stanislav Fomichev @ 2019-08-06 17:40 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Stanislav Fomichev, Networking, bpf, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
In-Reply-To: <CAEf4BzYU6xfcPrHzz0p6dWL3_VM2mD9pKy3T-NfnuDUrd4RMDQ@mail.gmail.com>

On 08/06, Andrii Nakryiko wrote:
> On Tue, Aug 6, 2019 at 10:19 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Use open_memstream to override stdout during test execution.
> > The copy of the original stdout is held in env.stdout and used
> > to print subtest info and dump failed log.
> >
> > test_{v,}printf are now simple wrappers around stdout and will be
> > removed in the next patch.
> >
> > v4:
> > * one field per line for stdout/stderr (Andrii Nakryiko)
> >
> > v3:
> > * don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)
> >
> > v2:
> > * add ifdef __GLIBC__ around open_memstream (maybe pointless since
> >   we already depend on glibc for argp_parse)
> > * hijack stderr as well (Andrii Nakryiko)
> > * don't hijack for every test, do it once (Andrii Nakryiko)
> > * log_cap -> log_size (Andrii Nakryiko)
> > * do fseeko in a proper place (Andrii Nakryiko)
> > * check open_memstream returned value (Andrii Nakryiko)
> >
> > Cc: Andrii Nakryiko <andriin@fb.com>
> > Acked-by: Andrii Nakryiko <andriin@fb.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  tools/testing/selftests/bpf/test_progs.c | 115 ++++++++++++-----------
> >  tools/testing/selftests/bpf/test_progs.h |   3 +-
> >  2 files changed, 62 insertions(+), 56 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> > index db00196c8315..9556439c607c 100644
> > --- a/tools/testing/selftests/bpf/test_progs.c
> > +++ b/tools/testing/selftests/bpf/test_progs.c
> > @@ -40,14 +40,20 @@ static bool should_run(struct test_selector *sel, int num, const char *name)
> >
> >  static void dump_test_log(const struct prog_test_def *test, bool failed)
> >  {
> > +       if (stdout == env.stdout)
> > +               return;
> > +
> > +       fflush(stdout); /* exports env.log_buf & env.log_cnt */
> > +
> >         if (env.verbose || test->force_log || failed) {
> >                 if (env.log_cnt) {
> > -                       fprintf(stdout, "%s", env.log_buf);
> > +                       fprintf(env.stdout, "%s", env.log_buf);
> >                         if (env.log_buf[env.log_cnt - 1] != '\n')
> > -                               fprintf(stdout, "\n");
> > +                               fprintf(env.stdout, "\n");
> >                 }
> >         }
> > -       env.log_cnt = 0;
> > +
> > +       fseeko(stdout, 0, SEEK_SET); /* rewind */
> >  }
> >
> >  void test__end_subtest()
> > @@ -62,7 +68,7 @@ void test__end_subtest()
> >
> >         dump_test_log(test, sub_error_cnt);
> >
> > -       printf("#%d/%d %s:%s\n",
> > +       fprintf(env.stdout, "#%d/%d %s:%s\n",
> >                test->test_num, test->subtest_num,
> >                test->subtest_name, sub_error_cnt ? "FAIL" : "OK");
> >  }
> > @@ -79,7 +85,8 @@ bool test__start_subtest(const char *name)
> >         test->subtest_num++;
> >
> >         if (!name || !name[0]) {
> > -               fprintf(stderr, "Subtest #%d didn't provide sub-test name!\n",
> > +               fprintf(env.stderr,
> > +                       "Subtest #%d didn't provide sub-test name!\n",
> >                         test->subtest_num);
> >                 return false;
> >         }
> > @@ -100,53 +107,7 @@ void test__force_log() {
> >
> >  void test__vprintf(const char *fmt, va_list args)
> >  {
> > -       size_t rem_sz;
> > -       int ret = 0;
> > -
> > -       if (env.verbose || (env.test && env.test->force_log)) {
> > -               vfprintf(stderr, fmt, args);
> > -               return;
> > -       }
> > -
> > -try_again:
> > -       rem_sz = env.log_cap - env.log_cnt;
> > -       if (rem_sz) {
> > -               va_list ap;
> > -
> > -               va_copy(ap, args);
> > -               /* we reserved extra byte for \0 at the end */
> > -               ret = vsnprintf(env.log_buf + env.log_cnt, rem_sz + 1, fmt, ap);
> > -               va_end(ap);
> > -
> > -               if (ret < 0) {
> > -                       env.log_buf[env.log_cnt] = '\0';
> > -                       fprintf(stderr, "failed to log w/ fmt '%s'\n", fmt);
> > -                       return;
> > -               }
> > -       }
> > -
> > -       if (!rem_sz || ret > rem_sz) {
> > -               size_t new_sz = env.log_cap * 3 / 2;
> > -               char *new_buf;
> > -
> > -               if (new_sz < 4096)
> > -                       new_sz = 4096;
> > -               if (new_sz < ret + env.log_cnt)
> > -                       new_sz = ret + env.log_cnt;
> > -
> > -               /* +1 for guaranteed space for terminating \0 */
> > -               new_buf = realloc(env.log_buf, new_sz + 1);
> > -               if (!new_buf) {
> > -                       fprintf(stderr, "failed to realloc log buffer: %d\n",
> > -                               errno);
> > -                       return;
> > -               }
> > -               env.log_buf = new_buf;
> > -               env.log_cap = new_sz;
> > -               goto try_again;
> > -       }
> > -
> > -       env.log_cnt += ret;
> > +       vprintf(fmt, args);
> >  }
> >
> >  void test__printf(const char *fmt, ...)
> > @@ -477,6 +438,48 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> >         return 0;
> >  }
> >
> > +static void stdio_hijack(void)
> > +{
> > +#ifdef __GLIBC__
> > +       if (env.verbose || (env.test && env.test->force_log)) {
> 
> I just also realized that you don't need `(env.test &&
> env.test->force_log)` test. We hijack stdout/stderr before env.test is
> even set, so this does nothing anyways. Plus, force_log can be set in
> the middle of test/sub-test, yet we hijack stdout just once (or even
> if per-test), so it's still going to be "racy". Let's buffer output
> (unless it's env.verbose, which is important to not buffer because
> some tests will have huge output, when failing, so this allows to
> bypass using tons of memory for those, when debugging) and dump at the
> end.
Makes sense, will drop this test and resubmit along with a fix for '-v'
that Alexei discovered. Thanks!

> > +               /* nothing to do, output to stdout by default */
> > +               return;
> > +       }
> > +
> > +       /* stdout and stderr -> buffer */
> > +       fflush(stdout);
> > +
> > +       env.stdout = stdout;
> > +       env.stderr = stderr;
> > +
> > +       stdout = open_memstream(&env.log_buf, &env.log_cnt);
> > +       if (!stdout) {
> > +               stdout = env.stdout;
> > +               perror("open_memstream");
> > +               return;
> > +       }
> > +
> > +       stderr = stdout;
> > +#endif
> > +}
> > +
> > +static void stdio_restore(void)
> > +{
> > +#ifdef __GLIBC__
> > +       if (stdout == env.stdout)
> > +               return;
> > +
> > +       fclose(stdout);
> > +       free(env.log_buf);
> > +
> > +       env.log_buf = NULL;
> > +       env.log_cnt = 0;
> > +
> > +       stdout = env.stdout;
> > +       stderr = env.stderr;
> > +#endif
> > +}
> > +
> >  int main(int argc, char **argv)
> >  {
> >         static const struct argp argp = {
> > @@ -496,6 +499,7 @@ int main(int argc, char **argv)
> >
> >         env.jit_enabled = is_jit_enabled();
> >
> > +       stdio_hijack();
> >         for (i = 0; i < prog_test_cnt; i++) {
> >                 struct prog_test_def *test = &prog_test_defs[i];
> >                 int old_pass_cnt = pass_cnt;
> > @@ -523,13 +527,14 @@ int main(int argc, char **argv)
> >
> >                 dump_test_log(test, test->error_cnt);
> >
> > -               printf("#%d %s:%s\n", test->test_num, test->test_name,
> > -                      test->error_cnt ? "FAIL" : "OK");
> > +               fprintf(env.stdout, "#%d %s:%s\n",
> > +                       test->test_num, test->test_name,
> > +                       test->error_cnt ? "FAIL" : "OK");
> >         }
> > +       stdio_restore();
> >         printf("Summary: %d/%d PASSED, %d FAILED\n",
> >                env.succ_cnt, env.sub_succ_cnt, env.fail_cnt);
> >
> > -       free(env.log_buf);
> >         free(env.test_selector.num_set);
> >         free(env.subtest_selector.num_set);
> >
> > diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
> > index afd14962456f..541f9eab5eed 100644
> > --- a/tools/testing/selftests/bpf/test_progs.h
> > +++ b/tools/testing/selftests/bpf/test_progs.h
> > @@ -56,9 +56,10 @@ struct test_env {
> >         bool jit_enabled;
> >
> >         struct prog_test_def *test;
> > +       FILE *stdout;
> > +       FILE *stderr;
> >         char *log_buf;
> >         size_t log_cnt;
> > -       size_t log_cap;
> >
> >         int succ_cnt; /* successful tests */
> >         int sub_succ_cnt; /* successful sub-tests */
> > --
> > 2.22.0.770.g0f2c4a37fd-goog
> >

^ permalink raw reply

* Re: [PATCH 12/17] skge: no need to check return value of debugfs_create functions
From: Stephen Hemminger @ 2019-08-06 17:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: netdev, Mirko Lindner, David S. Miller
In-Reply-To: <20190806161128.31232-13-gregkh@linuxfoundation.org>

On Tue,  6 Aug 2019 18:11:23 +0200
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: Mirko Lindner <mlindner@marvell.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

Did not pull card out of dust bin to test it though

^ permalink raw reply

* Re: [PATCH v6 1/3] mm/gup: add make_dirty arg to put_user_pages_dirty_lock()
From: Ira Weiny @ 2019-08-06 17:40 UTC (permalink / raw)
  To: john.hubbard
  Cc: Andrew Morton, Alexander Viro, Björn Töpel,
	Boaz Harrosh, Christoph Hellwig, Daniel Vetter, Dan Williams,
	Dave Chinner, David Airlie, David S . Miller, Ilya Dryomov,
	Jan Kara, Jason Gunthorpe, Jens Axboe, Jérôme Glisse,
	Johannes Thumshirn, Magnus Karlsson, Matthew Wilcox,
	Miklos Szeredi, Ming Lei, Sage Weil, Santosh Shilimkar, Yan Zheng,
	netdev, dri-devel, linux-mm, linux-rdma, bpf, LKML, John Hubbard
In-Reply-To: <20190804214042.4564-2-jhubbard@nvidia.com>

On Sun, Aug 04, 2019 at 02:40:40PM -0700, john.hubbard@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Provide a more capable variation of put_user_pages_dirty_lock(),
> and delete put_user_pages_dirty(). This is based on the
> following:
> 
> 1. Lots of call sites become simpler if a bool is passed
> into put_user_page*(), instead of making the call site
> choose which put_user_page*() variant to call.
> 
> 2. Christoph Hellwig's observation that set_page_dirty_lock()
> is usually correct, and set_page_dirty() is usually a
> bug, or at least questionable, within a put_user_page*()
> calling chain.
> 
> This leads to the following API choices:
> 
>     * put_user_pages_dirty_lock(page, npages, make_dirty)
> 
>     * There is no put_user_pages_dirty(). You have to
>       hand code that, in the rare case that it's
>       required.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

I assume this is superseded by the patch in the large series?

Ira

> ---
>  drivers/infiniband/core/umem.c             |   5 +-
>  drivers/infiniband/hw/hfi1/user_pages.c    |   5 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c |  13 +--
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |   5 +-
>  drivers/infiniband/sw/siw/siw_mem.c        |  19 +---
>  include/linux/mm.h                         |   5 +-
>  mm/gup.c                                   | 115 +++++++++------------
>  7 files changed, 61 insertions(+), 106 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 08da840ed7ee..965cf9dea71a 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
>  
>  	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>  		page = sg_page_iter_page(&sg_iter);
> -		if (umem->writable && dirty)
> -			put_user_pages_dirty_lock(&page, 1);
> -		else
> -			put_user_page(page);
> +		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
>  	}
>  
>  	sg_free_table(&umem->sg_head);
> diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
> index b89a9b9aef7a..469acb961fbd 100644
> --- a/drivers/infiniband/hw/hfi1/user_pages.c
> +++ b/drivers/infiniband/hw/hfi1/user_pages.c
> @@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
>  void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
>  			     size_t npages, bool dirty)
>  {
> -	if (dirty)
> -		put_user_pages_dirty_lock(p, npages);
> -	else
> -		put_user_pages(p, npages);
> +	put_user_pages_dirty_lock(p, npages, dirty);
>  
>  	if (mm) { /* during close after signal, mm can be NULL */
>  		atomic64_sub(npages, &mm->pinned_vm);
> diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
> index bfbfbb7e0ff4..26c1fb8d45cc 100644
> --- a/drivers/infiniband/hw/qib/qib_user_pages.c
> +++ b/drivers/infiniband/hw/qib/qib_user_pages.c
> @@ -37,15 +37,6 @@
>  
>  #include "qib.h"
>  
> -static void __qib_release_user_pages(struct page **p, size_t num_pages,
> -				     int dirty)
> -{
> -	if (dirty)
> -		put_user_pages_dirty_lock(p, num_pages);
> -	else
> -		put_user_pages(p, num_pages);
> -}
> -
>  /**
>   * qib_map_page - a safety wrapper around pci_map_page()
>   *
> @@ -124,7 +115,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  
>  	return 0;
>  bail_release:
> -	__qib_release_user_pages(p, got, 0);
> +	put_user_pages_dirty_lock(p, got, false);
>  bail:
>  	atomic64_sub(num_pages, &current->mm->pinned_vm);
>  	return ret;
> @@ -132,7 +123,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
>  
>  void qib_release_user_pages(struct page **p, size_t num_pages)
>  {
> -	__qib_release_user_pages(p, num_pages, 1);
> +	put_user_pages_dirty_lock(p, num_pages, true);
>  
>  	/* during close after signal, mm can be NULL */
>  	if (current->mm)
> diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
> index 0b0237d41613..62e6ffa9ad78 100644
> --- a/drivers/infiniband/hw/usnic/usnic_uiom.c
> +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
> @@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
>  		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
>  			page = sg_page(sg);
>  			pa = sg_phys(sg);
> -			if (dirty)
> -				put_user_pages_dirty_lock(&page, 1);
> -			else
> -				put_user_page(page);
> +			put_user_pages_dirty_lock(&page, 1, dirty);
>  			usnic_dbg("pa: %pa\n", &pa);
>  		}
>  		kfree(chunk);
> diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
> index 67171c82b0c4..1e197753bf2f 100644
> --- a/drivers/infiniband/sw/siw/siw_mem.c
> +++ b/drivers/infiniband/sw/siw/siw_mem.c
> @@ -60,20 +60,6 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
>  	return NULL;
>  }
>  
> -static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
> -			   bool dirty)
> -{
> -	struct page **p = chunk->plist;
> -
> -	while (num_pages--) {
> -		if (!PageDirty(*p) && dirty)
> -			put_user_pages_dirty_lock(p, 1);
> -		else
> -			put_user_page(*p);
> -		p++;
> -	}
> -}
> -
>  void siw_umem_release(struct siw_umem *umem, bool dirty)
>  {
>  	struct mm_struct *mm_s = umem->owning_mm;
> @@ -82,8 +68,9 @@ void siw_umem_release(struct siw_umem *umem, bool dirty)
>  	for (i = 0; num_pages; i++) {
>  		int to_free = min_t(int, PAGES_PER_CHUNK, num_pages);
>  
> -		siw_free_plist(&umem->page_chunk[i], to_free,
> -			       umem->writable && dirty);
> +		put_user_pages_dirty_lock(umem->page_chunk[i].plist,
> +					  to_free,
> +					  umem->writable && dirty);
>  		kfree(umem->page_chunk[i].plist);
>  		num_pages -= to_free;
>  	}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..9759b6a24420 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page)
>  	put_page(page);
>  }
>  
> -void put_user_pages_dirty(struct page **pages, unsigned long npages);
> -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages);
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> +			       bool make_dirty);
> +
>  void put_user_pages(struct page **pages, unsigned long npages);
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> diff --git a/mm/gup.c b/mm/gup.c
> index 98f13ab37bac..7fefd7ab02c4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,85 +29,70 @@ struct follow_page_context {
>  	unsigned int page_mask;
>  };
>  
> -typedef int (*set_dirty_func_t)(struct page *page);
> -
> -static void __put_user_pages_dirty(struct page **pages,
> -				   unsigned long npages,
> -				   set_dirty_func_t sdf)
> -{
> -	unsigned long index;
> -
> -	for (index = 0; index < npages; index++) {
> -		struct page *page = compound_head(pages[index]);
> -
> -		/*
> -		 * Checking PageDirty at this point may race with
> -		 * clear_page_dirty_for_io(), but that's OK. Two key cases:
> -		 *
> -		 * 1) This code sees the page as already dirty, so it skips
> -		 * the call to sdf(). That could happen because
> -		 * clear_page_dirty_for_io() called page_mkclean(),
> -		 * followed by set_page_dirty(). However, now the page is
> -		 * going to get written back, which meets the original
> -		 * intention of setting it dirty, so all is well:
> -		 * clear_page_dirty_for_io() goes on to call
> -		 * TestClearPageDirty(), and write the page back.
> -		 *
> -		 * 2) This code sees the page as clean, so it calls sdf().
> -		 * The page stays dirty, despite being written back, so it
> -		 * gets written back again in the next writeback cycle.
> -		 * This is harmless.
> -		 */
> -		if (!PageDirty(page))
> -			sdf(page);
> -
> -		put_user_page(page);
> -	}
> -}
> -
>  /**
> - * put_user_pages_dirty() - release and dirty an array of gup-pinned pages
> - * @pages:  array of pages to be marked dirty and released.
> + * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
> + * @pages:  array of pages to be maybe marked dirty, and definitely released.
>   * @npages: number of pages in the @pages array.
> + * @make_dirty: whether to mark the pages dirty
>   *
>   * "gup-pinned page" refers to a page that has had one of the get_user_pages()
>   * variants called on that page.
>   *
>   * For each page in the @pages array, make that page (or its head page, if a
> - * compound page) dirty, if it was previously listed as clean. Then, release
> - * the page using put_user_page().
> + * compound page) dirty, if @make_dirty is true, and if the page was previously
> + * listed as clean. In any case, releases all pages using put_user_page(),
> + * possibly via put_user_pages(), for the non-dirty case.
>   *
>   * Please see the put_user_page() documentation for details.
>   *
> - * set_page_dirty(), which does not lock the page, is used here.
> - * Therefore, it is the caller's responsibility to ensure that this is
> - * safe. If not, then put_user_pages_dirty_lock() should be called instead.
> + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
> + * required, then the caller should a) verify that this is really correct,
> + * because _lock() is usually required, and b) hand code it:
> + * set_page_dirty_lock(), put_user_page().
>   *
>   */
> -void put_user_pages_dirty(struct page **pages, unsigned long npages)
> +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> +			       bool make_dirty)
>  {
> -	__put_user_pages_dirty(pages, npages, set_page_dirty);
> -}
> -EXPORT_SYMBOL(put_user_pages_dirty);
> +	unsigned long index;
>  
> -/**
> - * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages
> - * @pages:  array of pages to be marked dirty and released.
> - * @npages: number of pages in the @pages array.
> - *
> - * For each page in the @pages array, make that page (or its head page, if a
> - * compound page) dirty, if it was previously listed as clean. Then, release
> - * the page using put_user_page().
> - *
> - * Please see the put_user_page() documentation for details.
> - *
> - * This is just like put_user_pages_dirty(), except that it invokes
> - * set_page_dirty_lock(), instead of set_page_dirty().
> - *
> - */
> -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages)
> -{
> -	__put_user_pages_dirty(pages, npages, set_page_dirty_lock);
> +	/*
> +	 * TODO: this can be optimized for huge pages: if a series of pages is
> +	 * physically contiguous and part of the same compound page, then a
> +	 * single operation to the head page should suffice.
> +	 */
> +
> +	if (!make_dirty) {
> +		put_user_pages(pages, npages);
> +		return;
> +	}
> +
> +	for (index = 0; index < npages; index++) {
> +		struct page *page = compound_head(pages[index]);
> +		/*
> +		 * Checking PageDirty at this point may race with
> +		 * clear_page_dirty_for_io(), but that's OK. Two key
> +		 * cases:
> +		 *
> +		 * 1) This code sees the page as already dirty, so it
> +		 * skips the call to set_page_dirty(). That could happen
> +		 * because clear_page_dirty_for_io() called
> +		 * page_mkclean(), followed by set_page_dirty().
> +		 * However, now the page is going to get written back,
> +		 * which meets the original intention of setting it
> +		 * dirty, so all is well: clear_page_dirty_for_io() goes
> +		 * on to call TestClearPageDirty(), and write the page
> +		 * back.
> +		 *
> +		 * 2) This code sees the page as clean, so it calls
> +		 * set_page_dirty(). The page stays dirty, despite being
> +		 * written back, so it gets written back again in the
> +		 * next writeback cycle. This is harmless.
> +		 */
> +		if (!PageDirty(page))
> +			set_page_dirty_lock(page);
> +		put_user_page(page);
> +	}
>  }
>  EXPORT_SYMBOL(put_user_pages_dirty_lock);
>  
> -- 
> 2.22.0
> 

^ permalink raw reply

* [PATCH bpf-next v5 0/3] selftests/bpf: switch test_progs back to stdio
From: Stanislav Fomichev @ 2019-08-06 17:45 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev, Andrii Nakryiko

I was looking into converting test_sockops* to test_progs framework
and that requires using cgroup_helpers.c which rely on stdio/stderr.
Let's use open_memstream to override stdout into buffer during
subtests instead of custom test_{v,}printf wrappers. That lets
us continue to use stdio in the subtests and dump it on failure
if required.

That would also fix bpf_find_map which currently uses printf to
signal failure (missed during test_printf conversion).

Cc: Andrii Nakryiko <andriin@fb.com>

Stanislav Fomichev (3):
  selftests/bpf: test_progs: switch to open_memstream
  selftests/bpf: test_progs: test__printf -> printf
  selftests/bpf: test_progs: drop extra trailing tab

 .../bpf/prog_tests/bpf_verif_scale.c          |   4 +-
 .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +-
 .../selftests/bpf/prog_tests/map_lock.c       |  10 +-
 .../selftests/bpf/prog_tests/send_signal.c    |   4 +-
 .../selftests/bpf/prog_tests/spinlock.c       |   2 +-
 .../bpf/prog_tests/stacktrace_build_id.c      |   4 +-
 .../bpf/prog_tests/stacktrace_build_id_nmi.c  |   4 +-
 .../selftests/bpf/prog_tests/xdp_noinline.c   |   4 +-
 tools/testing/selftests/bpf/test_progs.c      | 131 ++++++++----------
 tools/testing/selftests/bpf/test_progs.h      |  13 +-
 10 files changed, 84 insertions(+), 94 deletions(-)

-- 
2.22.0.770.g0f2c4a37fd-goog

^ permalink raw reply

* [PATCH bpf-next v5 1/3] selftests/bpf: test_progs: switch to open_memstream
From: Stanislav Fomichev @ 2019-08-06 17:45 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev, Andrii Nakryiko
In-Reply-To: <20190806174529.8341-1-sdf@google.com>

Use open_memstream to override stdout during test execution.
The copy of the original stdout is held in env.stdout and used
to print subtest info and dump failed log.

test_{v,}printf are now simple wrappers around stdout and will be
removed in the next patch.

v5:
* fix -v crash by always setting env.std{in,err} (Alexei Starovoitov)
* drop force_log check from stdio_hijack (Andrii Nakryiko)

v4:
* one field per line for stdout/stderr (Andrii Nakryiko)

v3:
* don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)

v2:
* add ifdef __GLIBC__ around open_memstream (maybe pointless since
  we already depend on glibc for argp_parse)
* hijack stderr as well (Andrii Nakryiko)
* don't hijack for every test, do it once (Andrii Nakryiko)
* log_cap -> log_size (Andrii Nakryiko)
* do fseeko in a proper place (Andrii Nakryiko)
* check open_memstream returned value (Andrii Nakryiko)

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_progs.c | 115 ++++++++++++-----------
 tools/testing/selftests/bpf/test_progs.h |   3 +-
 2 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index db00196c8315..6ea289ba307b 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -40,14 +40,20 @@ static bool should_run(struct test_selector *sel, int num, const char *name)
 
 static void dump_test_log(const struct prog_test_def *test, bool failed)
 {
+	if (stdout == env.stdout)
+		return;
+
+	fflush(stdout); /* exports env.log_buf & env.log_cnt */
+
 	if (env.verbose || test->force_log || failed) {
 		if (env.log_cnt) {
-			fprintf(stdout, "%s", env.log_buf);
+			fprintf(env.stdout, "%s", env.log_buf);
 			if (env.log_buf[env.log_cnt - 1] != '\n')
-				fprintf(stdout, "\n");
+				fprintf(env.stdout, "\n");
 		}
 	}
-	env.log_cnt = 0;
+
+	fseeko(stdout, 0, SEEK_SET); /* rewind */
 }
 
 void test__end_subtest()
@@ -62,7 +68,7 @@ void test__end_subtest()
 
 	dump_test_log(test, sub_error_cnt);
 
-	printf("#%d/%d %s:%s\n",
+	fprintf(env.stdout, "#%d/%d %s:%s\n",
 	       test->test_num, test->subtest_num,
 	       test->subtest_name, sub_error_cnt ? "FAIL" : "OK");
 }
@@ -79,7 +85,8 @@ bool test__start_subtest(const char *name)
 	test->subtest_num++;
 
 	if (!name || !name[0]) {
-		fprintf(stderr, "Subtest #%d didn't provide sub-test name!\n",
+		fprintf(env.stderr,
+			"Subtest #%d didn't provide sub-test name!\n",
 			test->subtest_num);
 		return false;
 	}
@@ -100,53 +107,7 @@ void test__force_log() {
 
 void test__vprintf(const char *fmt, va_list args)
 {
-	size_t rem_sz;
-	int ret = 0;
-
-	if (env.verbose || (env.test && env.test->force_log)) {
-		vfprintf(stderr, fmt, args);
-		return;
-	}
-
-try_again:
-	rem_sz = env.log_cap - env.log_cnt;
-	if (rem_sz) {
-		va_list ap;
-
-		va_copy(ap, args);
-		/* we reserved extra byte for \0 at the end */
-		ret = vsnprintf(env.log_buf + env.log_cnt, rem_sz + 1, fmt, ap);
-		va_end(ap);
-
-		if (ret < 0) {
-			env.log_buf[env.log_cnt] = '\0';
-			fprintf(stderr, "failed to log w/ fmt '%s'\n", fmt);
-			return;
-		}
-	}
-
-	if (!rem_sz || ret > rem_sz) {
-		size_t new_sz = env.log_cap * 3 / 2;
-		char *new_buf;
-
-		if (new_sz < 4096)
-			new_sz = 4096;
-		if (new_sz < ret + env.log_cnt)
-			new_sz = ret + env.log_cnt;
-
-		/* +1 for guaranteed space for terminating \0 */
-		new_buf = realloc(env.log_buf, new_sz + 1);
-		if (!new_buf) {
-			fprintf(stderr, "failed to realloc log buffer: %d\n",
-				errno);
-			return;
-		}
-		env.log_buf = new_buf;
-		env.log_cap = new_sz;
-		goto try_again;
-	}
-
-	env.log_cnt += ret;
+	vprintf(fmt, args);
 }
 
 void test__printf(const char *fmt, ...)
@@ -477,6 +438,48 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
 	return 0;
 }
 
+static void stdio_hijack(void)
+{
+#ifdef __GLIBC__
+	env.stdout = stdout;
+	env.stderr = stderr;
+
+	if (env.verbose) {
+		/* nothing to do, output to stdout by default */
+		return;
+	}
+
+	/* stdout and stderr -> buffer */
+	fflush(stdout);
+
+	stdout = open_memstream(&env.log_buf, &env.log_cnt);
+	if (!stdout) {
+		stdout = env.stdout;
+		perror("open_memstream");
+		return;
+	}
+
+	stderr = stdout;
+#endif
+}
+
+static void stdio_restore(void)
+{
+#ifdef __GLIBC__
+	if (stdout == env.stdout)
+		return;
+
+	fclose(stdout);
+	free(env.log_buf);
+
+	env.log_buf = NULL;
+	env.log_cnt = 0;
+
+	stdout = env.stdout;
+	stderr = env.stderr;
+#endif
+}
+
 int main(int argc, char **argv)
 {
 	static const struct argp argp = {
@@ -496,6 +499,7 @@ int main(int argc, char **argv)
 
 	env.jit_enabled = is_jit_enabled();
 
+	stdio_hijack();
 	for (i = 0; i < prog_test_cnt; i++) {
 		struct prog_test_def *test = &prog_test_defs[i];
 		int old_pass_cnt = pass_cnt;
@@ -523,13 +527,14 @@ int main(int argc, char **argv)
 
 		dump_test_log(test, test->error_cnt);
 
-		printf("#%d %s:%s\n", test->test_num, test->test_name,
-		       test->error_cnt ? "FAIL" : "OK");
+		fprintf(env.stdout, "#%d %s:%s\n",
+			test->test_num, test->test_name,
+			test->error_cnt ? "FAIL" : "OK");
 	}
+	stdio_restore();
 	printf("Summary: %d/%d PASSED, %d FAILED\n",
 	       env.succ_cnt, env.sub_succ_cnt, env.fail_cnt);
 
-	free(env.log_buf);
 	free(env.test_selector.num_set);
 	free(env.subtest_selector.num_set);
 
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index afd14962456f..541f9eab5eed 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -56,9 +56,10 @@ struct test_env {
 	bool jit_enabled;
 
 	struct prog_test_def *test;
+	FILE *stdout;
+	FILE *stderr;
 	char *log_buf;
 	size_t log_cnt;
-	size_t log_cap;
 
 	int succ_cnt; /* successful tests */
 	int sub_succ_cnt; /* successful sub-tests */
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH bpf-next v5 2/3] selftests/bpf: test_progs: test__printf -> printf
From: Stanislav Fomichev @ 2019-08-06 17:45 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev, Andrii Nakryiko
In-Reply-To: <20190806174529.8341-1-sdf@google.com>

Now that test__printf is a simple wraper around printf, let's drop it
(and test__vprintf as well).

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/bpf_verif_scale.c   |  4 ++--
 .../testing/selftests/bpf/prog_tests/l4lb_all.c  |  2 +-
 .../testing/selftests/bpf/prog_tests/map_lock.c  | 10 +++++-----
 .../selftests/bpf/prog_tests/send_signal.c       |  4 ++--
 .../testing/selftests/bpf/prog_tests/spinlock.c  |  2 +-
 .../bpf/prog_tests/stacktrace_build_id.c         |  4 ++--
 .../bpf/prog_tests/stacktrace_build_id_nmi.c     |  4 ++--
 .../selftests/bpf/prog_tests/xdp_noinline.c      |  4 ++--
 tools/testing/selftests/bpf/test_progs.c         | 16 +---------------
 tools/testing/selftests/bpf/test_progs.h         | 10 ++++------
 10 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
index b4be96162ff4..3548ba2f24a8 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_verif_scale.c
@@ -5,13 +5,13 @@ static int libbpf_debug_print(enum libbpf_print_level level,
 			      const char *format, va_list args)
 {
 	if (level != LIBBPF_DEBUG) {
-		test__vprintf(format, args);
+		vprintf(format, args);
 		return 0;
 	}
 
 	if (!strstr(format, "verifier log"))
 		return 0;
-	test__vprintf("%s", args);
+	vprintf("%s", args);
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
index 5ce572c03a5f..20ddca830e68 100644
--- a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
+++ b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
@@ -74,7 +74,7 @@ static void test_l4lb(const char *file)
 	}
 	if (bytes != MAGIC_BYTES * NUM_ITER * 2 || pkts != NUM_ITER * 2) {
 		error_cnt++;
-		test__printf("test_l4lb:FAIL:stats %lld %lld\n", bytes, pkts);
+		printf("test_l4lb:FAIL:stats %lld %lld\n", bytes, pkts);
 	}
 out:
 	bpf_object__close(obj);
diff --git a/tools/testing/selftests/bpf/prog_tests/map_lock.c b/tools/testing/selftests/bpf/prog_tests/map_lock.c
index 2e78217ed3fd..ee99368c595c 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_lock.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_lock.c
@@ -9,12 +9,12 @@ static void *parallel_map_access(void *arg)
 	for (i = 0; i < 10000; i++) {
 		err = bpf_map_lookup_elem_flags(map_fd, &key, vars, BPF_F_LOCK);
 		if (err) {
-			test__printf("lookup failed\n");
+			printf("lookup failed\n");
 			error_cnt++;
 			goto out;
 		}
 		if (vars[0] != 0) {
-			test__printf("lookup #%d var[0]=%d\n", i, vars[0]);
+			printf("lookup #%d var[0]=%d\n", i, vars[0]);
 			error_cnt++;
 			goto out;
 		}
@@ -22,8 +22,8 @@ static void *parallel_map_access(void *arg)
 		for (j = 2; j < 17; j++) {
 			if (vars[j] == rnd)
 				continue;
-			test__printf("lookup #%d var[1]=%d var[%d]=%d\n",
-				     i, rnd, j, vars[j]);
+			printf("lookup #%d var[1]=%d var[%d]=%d\n",
+			       i, rnd, j, vars[j]);
 			error_cnt++;
 			goto out;
 		}
@@ -43,7 +43,7 @@ void test_map_lock(void)
 
 	err = bpf_prog_load(file, BPF_PROG_TYPE_CGROUP_SKB, &obj, &prog_fd);
 	if (err) {
-		test__printf("test_map_lock:bpf_prog_load errno %d\n", errno);
+		printf("test_map_lock:bpf_prog_load errno %d\n", errno);
 		goto close_prog;
 	}
 	map_fd[0] = bpf_find_map(__func__, obj, "hash_map");
diff --git a/tools/testing/selftests/bpf/prog_tests/send_signal.c b/tools/testing/selftests/bpf/prog_tests/send_signal.c
index 461b423d0584..1575f0a1f586 100644
--- a/tools/testing/selftests/bpf/prog_tests/send_signal.c
+++ b/tools/testing/selftests/bpf/prog_tests/send_signal.c
@@ -202,8 +202,8 @@ static int test_send_signal_nmi(void)
 			 -1 /* cpu */, -1 /* group_fd */, 0 /* flags */);
 	if (pmu_fd == -1) {
 		if (errno == ENOENT) {
-			test__printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n",
-				     __func__);
+			printf("%s:SKIP:no PERF_COUNT_HW_CPU_CYCLES\n",
+			       __func__);
 			return 0;
 		}
 		/* Let the test fail with a more informative message */
diff --git a/tools/testing/selftests/bpf/prog_tests/spinlock.c b/tools/testing/selftests/bpf/prog_tests/spinlock.c
index deb2db5b85b0..114ebe6a438e 100644
--- a/tools/testing/selftests/bpf/prog_tests/spinlock.c
+++ b/tools/testing/selftests/bpf/prog_tests/spinlock.c
@@ -12,7 +12,7 @@ void test_spinlock(void)
 
 	err = bpf_prog_load(file, BPF_PROG_TYPE_CGROUP_SKB, &obj, &prog_fd);
 	if (err) {
-		test__printf("test_spin_lock:bpf_prog_load errno %d\n", errno);
+		printf("test_spin_lock:bpf_prog_load errno %d\n", errno);
 		goto close_prog;
 	}
 	for (i = 0; i < 4; i++)
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
index 356d2c017a9c..ac44fda84833 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c
@@ -109,8 +109,8 @@ void test_stacktrace_build_id(void)
 	if (build_id_matches < 1 && retry--) {
 		bpf_link__destroy(link);
 		bpf_object__close(obj);
-		test__printf("%s:WARN:Didn't find expected build ID from the map, retrying\n",
-			     __func__);
+		printf("%s:WARN:Didn't find expected build ID from the map, retrying\n",
+		       __func__);
 		goto retry;
 	}
 
diff --git a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
index f44f2c159714..9557b7dfb782 100644
--- a/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
+++ b/tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c
@@ -140,8 +140,8 @@ void test_stacktrace_build_id_nmi(void)
 	if (build_id_matches < 1 && retry--) {
 		bpf_link__destroy(link);
 		bpf_object__close(obj);
-		test__printf("%s:WARN:Didn't find expected build ID from the map, retrying\n",
-			     __func__);
+		printf("%s:WARN:Didn't find expected build ID from the map, retrying\n",
+		       __func__);
 		goto retry;
 	}
 
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_noinline.c b/tools/testing/selftests/bpf/prog_tests/xdp_noinline.c
index b5404494b8aa..15f7c272edb0 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_noinline.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_noinline.c
@@ -75,8 +75,8 @@ void test_xdp_noinline(void)
 	}
 	if (bytes != MAGIC_BYTES * NUM_ITER * 2 || pkts != NUM_ITER * 2) {
 		error_cnt++;
-		test__printf("test_xdp_noinline:FAIL:stats %lld %lld\n",
-			     bytes, pkts);
+		printf("test_xdp_noinline:FAIL:stats %lld %lld\n",
+		       bytes, pkts);
 	}
 out:
 	bpf_object__close(obj);
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 6ea289ba307b..6c11c796ea1f 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -105,20 +105,6 @@ void test__force_log() {
 	env.test->force_log = true;
 }
 
-void test__vprintf(const char *fmt, va_list args)
-{
-	vprintf(fmt, args);
-}
-
-void test__printf(const char *fmt, ...)
-{
-	va_list args;
-
-	va_start(args, fmt);
-	test__vprintf(fmt, args);
-	va_end(args);
-}
-
 struct ipv4_packet pkt_v4 = {
 	.eth.h_proto = __bpf_constant_htons(ETH_P_IP),
 	.iph.ihl = 5,
@@ -310,7 +296,7 @@ static int libbpf_print_fn(enum libbpf_print_level level,
 {
 	if (!env.very_verbose && level == LIBBPF_DEBUG)
 		return 0;
-	test__vprintf(format, args);
+	vprintf(format, args);
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index 541f9eab5eed..37d427f5a1e5 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -70,8 +70,6 @@ extern int error_cnt;
 extern int pass_cnt;
 extern struct test_env env;
 
-extern void test__printf(const char *fmt, ...);
-extern void test__vprintf(const char *fmt, va_list args);
 extern void test__force_log();
 extern bool test__start_subtest(const char *name);
 
@@ -97,12 +95,12 @@ extern struct ipv6_packet pkt_v6;
 	int __ret = !!(condition);					\
 	if (__ret) {							\
 		error_cnt++;						\
-		test__printf("%s:FAIL:%s ", __func__, tag);		\
-		test__printf(format);					\
+		printf("%s:FAIL:%s ", __func__, tag);			\
+		printf(format);						\
 	} else {							\
 		pass_cnt++;						\
-		test__printf("%s:PASS:%s %d nsec\n",			\
-			      __func__, tag, duration);			\
+		printf("%s:PASS:%s %d nsec\n",				\
+		       __func__, tag, duration);			\
 	}								\
 	__ret;								\
 })
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* [PATCH bpf-next v5 3/3] selftests/bpf: test_progs: drop extra trailing tab
From: Stanislav Fomichev @ 2019-08-06 17:45 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, Stanislav Fomichev, Andrii Nakryiko
In-Reply-To: <20190806174529.8341-1-sdf@google.com>

Small (un)related cleanup.

Cc: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_progs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 6c11c796ea1f..12895d03d58b 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -278,7 +278,7 @@ enum ARG_KEYS {
 	ARG_VERIFIER_STATS = 's',
 	ARG_VERBOSE = 'v',
 };
-	
+
 static const struct argp_option opts[] = {
 	{ "num", ARG_TEST_NUM, "NUM", 0,
 	  "Run test number NUM only " },
-- 
2.22.0.770.g0f2c4a37fd-goog


^ permalink raw reply related

* Re: [PATCH bpf-next v4 1/3] selftests/bpf: test_progs: switch to open_memstream
From: Andrii Nakryiko @ 2019-08-06 17:49 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, Networking, bpf, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
In-Reply-To: <20190806174028.GB23939@mini-arch>

On Tue, Aug 6, 2019 at 10:40 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 08/06, Andrii Nakryiko wrote:
> > On Tue, Aug 6, 2019 at 10:19 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > Use open_memstream to override stdout during test execution.
> > > The copy of the original stdout is held in env.stdout and used
> > > to print subtest info and dump failed log.
> > >
> > > test_{v,}printf are now simple wrappers around stdout and will be
> > > removed in the next patch.
> > >
> > > v4:
> > > * one field per line for stdout/stderr (Andrii Nakryiko)
> > >
> > > v3:
> > > * don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)
> > >
> > > v2:
> > > * add ifdef __GLIBC__ around open_memstream (maybe pointless since
> > >   we already depend on glibc for argp_parse)
> > > * hijack stderr as well (Andrii Nakryiko)
> > > * don't hijack for every test, do it once (Andrii Nakryiko)
> > > * log_cap -> log_size (Andrii Nakryiko)
> > > * do fseeko in a proper place (Andrii Nakryiko)
> > > * check open_memstream returned value (Andrii Nakryiko)
> > >
> > > Cc: Andrii Nakryiko <andriin@fb.com>
> > > Acked-by: Andrii Nakryiko <andriin@fb.com>
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  tools/testing/selftests/bpf/test_progs.c | 115 ++++++++++++-----------
> > >  tools/testing/selftests/bpf/test_progs.h |   3 +-
> > >  2 files changed, 62 insertions(+), 56 deletions(-)
> > >

[...]

> > >  void test__printf(const char *fmt, ...)
> > > @@ -477,6 +438,48 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> > >         return 0;
> > >  }
> > >
> > > +static void stdio_hijack(void)
> > > +{
> > > +#ifdef __GLIBC__
> > > +       if (env.verbose || (env.test && env.test->force_log)) {
> >
> > I just also realized that you don't need `(env.test &&
> > env.test->force_log)` test. We hijack stdout/stderr before env.test is
> > even set, so this does nothing anyways. Plus, force_log can be set in
> > the middle of test/sub-test, yet we hijack stdout just once (or even
> > if per-test), so it's still going to be "racy". Let's buffer output
> > (unless it's env.verbose, which is important to not buffer because
> > some tests will have huge output, when failing, so this allows to
> > bypass using tons of memory for those, when debugging) and dump at the
> > end.
> Makes sense, will drop this test and resubmit along with a fix for '-v'
> that Alexei discovered. Thanks!

Oh, is it because env.stdout and env.stderr is not set in verbose
mode? I'd always set env.stdout/env.stderr at the beginning, next to
env.jit_enabled, to never care about "logging modes".

>
> > > +               /* nothing to do, output to stdout by default */
> > > +               return;
> > > +       }
> > > +

[...]

^ permalink raw reply

* Re: [PATCH net v2] net: dsa: Check existence of .port_mdb_add callback before calling it
From: Chen-Yu Tsai @ 2019-08-06 17:49 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Chen-Yu Tsai, Andrew Lunn, Florian Fainelli, David S. Miller,
	netdev, linux-kernel
In-Reply-To: <20190806131513.GB2822@t480s.localdomain>

On Wed, Aug 7, 2019 at 1:15 AM Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> Hi Chen-Yu,
>
> On Tue,  6 Aug 2019 15:53:25 +0800, Chen-Yu Tsai <wens@kernel.org> wrote:
> > From: Chen-Yu Tsai <wens@csie.org>
> >
> > With the recent addition of commit 75dad2520fc3 ("net: dsa: b53: Disable
> > all ports on setup"), users of b53 (BCM53125 on Lamobo R1 in my case)
> > are forced to use the dsa subsystem to enable the switch, instead of
> > having it in the default transparent "forward-to-all" mode.
> >
> > The b53 driver does not support mdb bitmap functions. However the dsa
> > layer does not check for the existence of the .port_mdb_add callback
> > before actually using it. This results in a NULL pointer dereference,
> > as shown in the kernel oops below.
> >
> > The other functions seem to be properly guarded. Do the same for
> > .port_mdb_add in dsa_switch_mdb_add_bitmap() as well.
> >
> > b53 is not the only driver that doesn't support mdb bitmap functions.
> > Others include bcm_sf2, dsa_loop, lantiq_gswip, mt7530, mv88e6060,
> > qca8k, realtek-smi, and vitesse-vsc73xx.
>
> I don't know what you mean by that, there's no "mdb bitmap function"
> support for drivers, only the port_mdb_{prepare,add,del} callbacks...

The term was coined from commit e6db98db8a95 ("net: dsa: add switch mdb
bitmap functions"). But yeah, .port_mdb_* ops/callbacks would be more
appropriate.

> >     8<--- cut here ---
> >     Unable to handle kernel NULL pointer dereference at virtual address 00000000
> >     pgd = (ptrval)
> >     [00000000] *pgd=00000000
> >     Internal error: Oops: 80000005 [#1] SMP ARM
> >     Modules linked in: rtl8xxxu rtl8192cu rtl_usb rtl8192c_common rtlwifi mac80211 cfg80211
> >     CPU: 1 PID: 134 Comm: kworker/1:2 Not tainted 5.3.0-rc1-00247-gd3519030752a #1
> >     Hardware name: Allwinner sun7i (A20) Family
> >     Workqueue: events switchdev_deferred_process_work
> >     PC is at 0x0
> >     LR is at dsa_switch_event+0x570/0x620
> >     pc : [<00000000>]    lr : [<c08533ec>]    psr: 80070013
> >     sp : ee871db8  ip : 00000000  fp : ee98d0a4
> >     r10: 0000000c  r9 : 00000008  r8 : ee89f710
> >     r7 : ee98d040  r6 : ee98d088  r5 : c0f04c48  r4 : ee98d04c
> >     r3 : 00000000  r2 : ee89f710  r1 : 00000008  r0 : ee98d040
> >     Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> >     Control: 10c5387d  Table: 6deb406a  DAC: 00000051
> >     Process kworker/1:2 (pid: 134, stack limit = 0x(ptrval))
> >     Stack: (0xee871db8 to 0xee872000)
> >     1da0:                                                       ee871e14 103ace2d
> >     1dc0: 00000000 ffffffff 00000000 ee871e14 00000005 00000000 c08524a0 00000000
> >     1de0: ffffe000 c014bdfc c0f04c48 ee871e98 c0f04c48 ee9e5000 c0851120 c014bef0
> >     1e00: 00000000 b643aea2 ee9b4068 c08509a8 ee2bf940 ee89f710 ee871ecb 00000000
> >     1e20: 00000008 103ace2d 00000000 c087e248 ee29c868 103ace2d 00000001 ffffffff
> >     1e40: 00000000 ee871e98 00000006 00000000 c0fb2a50 c087e2d0 ffffffff c08523c4
> >     1e60: ffffffff c014bdfc 00000006 c0fad2d0 ee871e98 ee89f710 00000000 c014c500
> >     1e80: 00000000 ee89f3c0 c0f04c48 00000000 ee9e5000 c087dfb4 ee9e5000 00000000
> >     1ea0: ee89f710 ee871ecb 00000001 103ace2d 00000000 c0f04c48 00000000 c087e0a8
> >     1ec0: 00000000 efd9a3e0 0089f3c0 103ace2d ee89f700 ee89f710 ee9e5000 00000122
> >     1ee0: 00000100 c087e130 ee89f700 c0fad2c8 c1003ef0 c087de4c 2e928000 c0fad2ec
> >     1f00: c0fad2ec ee839580 ef7a62c0 ef7a9400 00000000 c087def8 c0fad2ec c01447dc
> >     1f20: ef315640 ef7a62c0 00000008 ee839580 ee839594 ef7a62c0 00000008 c0f03d00
> >     1f40: ef7a62d8 ef7a62c0 ffffe000 c0145b84 ffffe000 c0fb2420 c0bfaa8c 00000000
> >     1f60: ffffe000 ee84b600 ee84b5c0 00000000 ee870000 ee839580 c0145b40 ef0e5ea4
> >     1f80: ee84b61c c014a6f8 00000001 ee84b5c0 c014a5b0 00000000 00000000 00000000
> >     1fa0: 00000000 00000000 00000000 c01010e8 00000000 00000000 00000000 00000000
> >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
> >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> >     [<c014bdfc>] (notifier_call_chain) from [<c014bef0>] (raw_notifier_call_chain+0x18/0x20)
> >     [<c014bef0>] (raw_notifier_call_chain) from [<c08509a8>] (dsa_port_mdb_add+0x48/0x74)
> >     [<c08509a8>] (dsa_port_mdb_add) from [<c087e248>] (__switchdev_handle_port_obj_add+0x54/0xd4)
> >     [<c087e248>] (__switchdev_handle_port_obj_add) from [<c087e2d0>] (switchdev_handle_port_obj_add+0x8/0x14)
> >     [<c087e2d0>] (switchdev_handle_port_obj_add) from [<c08523c4>] (dsa_slave_switchdev_blocking_event+0x94/0xa4)
> >     [<c08523c4>] (dsa_slave_switchdev_blocking_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> >     [<c014bdfc>] (notifier_call_chain) from [<c014c500>] (blocking_notifier_call_chain+0x50/0x68)
> >     [<c014c500>] (blocking_notifier_call_chain) from [<c087dfb4>] (switchdev_port_obj_notify+0x44/0xa8)
> >     [<c087dfb4>] (switchdev_port_obj_notify) from [<c087e0a8>] (switchdev_port_obj_add_now+0x90/0x104)
> >     [<c087e0a8>] (switchdev_port_obj_add_now) from [<c087e130>] (switchdev_port_obj_add_deferred+0x14/0x5c)
> >     [<c087e130>] (switchdev_port_obj_add_deferred) from [<c087de4c>] (switchdev_deferred_process+0x64/0x104)
> >     [<c087de4c>] (switchdev_deferred_process) from [<c087def8>] (switchdev_deferred_process_work+0xc/0x14)
> >     [<c087def8>] (switchdev_deferred_process_work) from [<c01447dc>] (process_one_work+0x218/0x50c)
> >     [<c01447dc>] (process_one_work) from [<c0145b84>] (worker_thread+0x44/0x5bc)
> >     [<c0145b84>] (worker_thread) from [<c014a6f8>] (kthread+0x148/0x150)
> >     [<c014a6f8>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
> >     Exception stack(0xee871fb0 to 0xee871ff8)
> >     1fa0:                                     00000000 00000000 00000000 00000000
> >     1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> >     1fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> >     Code: bad PC value
> >     ---[ end trace 1292c61abd17b130 ]---
> >
> >     [<c08533ec>] (dsa_switch_event) from [<c014bdfc>] (notifier_call_chain+0x48/0x84)
> >     corresponds to
> >
> >       $ arm-linux-gnueabihf-addr2line -C -i -e vmlinux c08533ec
> >
> >       linux/net/dsa/switch.c:156
> >       linux/net/dsa/switch.c:178
> >       linux/net/dsa/switch.c:328
> >
> > Fixes: e6db98db8a95 ("net: dsa: add switch mdb bitmap functions")
> > Signed-off-by: Chen-Yu Tsai <wens@csie.org>
> > ---
> > Changes since v1:
> >
> >   - Moved the check to the beginning of dsa_switch_mdb_add()
> >
> > Looks like we could also move the ops check out of
> > dsa_switch_mdb_prepare_bitmap(), though I suppose keeping the code the
> > way it is now is clearer.
> >
> > ---
> >  net/dsa/switch.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/dsa/switch.c b/net/dsa/switch.c
> > index 4ec5b7f85d51..231af5268656 100644
> > --- a/net/dsa/switch.c
> > +++ b/net/dsa/switch.c
> > @@ -164,6 +164,9 @@ static int dsa_switch_mdb_add(struct dsa_switch *ds,
> >       struct switchdev_trans *trans = info->trans;
> >       int port;
> >
> > +     if (!ds->ops->port_mdb_add)
> > +             return -EOPNOTSUPP;
> > +
> >       /* Build a mask of Multicast group members */
> >       bitmap_zero(ds->bitmap, ds->num_ports);
> >       if (ds->index == info->sw_index)
> > --
> > 2.20.1
> >
>
> I don't understand the crash here, nor the fix. dsa_switch_mdb_add()
> is supposed to be called through switchdev with a prepare phase,
> which checks for ds->ops->port_mdb_add. Do you mean that a switchdev
> MDB object is added somewhere without a prepare phase? If that's
> the case, this is what the commit message must say. Then the

I had pretty much zero understanding of how switchdev and dsa work.
The symptom is a NULL pointer reference, resulting from an unsupported
callback that was not checked before being called, as described above.
And that is what I mean. A NULL pointer reference happened when it
should not have.

Based on what you just mentioned, yes it does look like an object was
added without a prepare phase. Randomly looking through the net/dsa
code, it seems only dsa_port_vid_add() does a prepare phase, judging
by .ph_prepare being set. dsa_port_{vlan,mbd,fdb}_add directly call
the add phase, without the prepare phase. So I'm guessing "supposed
to be called with a prepare phase" is not quite accurate. This also
exceeds the scope of the simple fix I had in mind.

> ds->ops->port_mdb_add check must go where it is used, that is to say
> at the beginning of dsa_switch_mdb_add_bitmap() (similarly to what
> dsa_switch_mdb_prepare_bitmap() does), not in dsa_switch_mdb_add.

Andrew asked me to move it to where it is now. Please take a look at
v1 [2] if it's what you would like.

I'm ok either way.

ChenYu

[1] https://lkml.org/lkml/2019/7/25/706
[2] https://lkml.org/lkml/2019/7/23/1129

^ permalink raw reply

* [sashal@kernel.org: Re: Back-porting request]
From: Stephen Suryaputra @ 2019-08-06 17:51 UTC (permalink / raw)
  To: netdev

Hi,

How do I request back-porting of commit 5b18f1289808 ("ipv4: reset
rt_iif for recirculated mcast/bcast out pkts") quoted below? I think for
4.19, the following diff is needed in addition to the commit:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index cd84f7f68032..120ef1f284fa 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1652,11 +1652,7 @@ struct rtable *rt_dst_clone(struct net_device *dev, struct rtable *rt)
 		new_rt->rt_iif = rt->rt_iif;
 		new_rt->rt_pmtu = rt->rt_pmtu;
 		new_rt->rt_mtu_locked = rt->rt_mtu_locked;
-		new_rt->rt_gw_family = rt->rt_gw_family;
-		if (rt->rt_gw_family == AF_INET)
-			new_rt->rt_gw4 = rt->rt_gw4;
-		else if (rt->rt_gw_family == AF_INET6)
-			new_rt->rt_gw6 = rt->rt_gw6;
+		new_rt->rt_gateway = rt->rt_gateway;
 		INIT_LIST_HEAD(&new_rt->rt_uncached);
 
 		new_rt->dst.flags |= DST_HOST;

Thanks.

----- Forwarded message from Sasha Levin <sashal@kernel.org> -----

On Mon, Jul 29, 2019 at 05:56:27PM +0200, Greg KH wrote:
> On Mon, Jul 29, 2019 at 11:42:34AM -0400, Stephen Suryaputra wrote:
> > Hello,
> > 
> > I'm requesting this commit to be back-ported to v4.14:
> > ---
> > commit 5b18f1289808fee5d04a7e6ecf200189f41a4db6
> > Author: Stephen Suryaputra <ssuryaextr@gmail.com>
> > Date:   Wed Jun 26 02:21:16 2019 -0400
> > 
> >     ipv4: reset rt_iif for recirculated mcast/bcast out pkts
> > 
> >     Multicast or broadcast egress packets have rt_iif set to the oif. These
> >     packets might be recirculated back as input and lookup to the raw
> >     sockets may fail because they are bound to the incoming interface
> >     (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
> >     returns rt_iif instead of skb_iif. Hence, the lookup fails.
> > 
> >     v2: Make it non vrf specific (David Ahern). Reword the changelog to
> >         reflect it.
> >     Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
> >     Reviewed-by: David Ahern <dsahern@gmail.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > ---
> > 

[deleted]
> 
> For networking patches to be applied to the stable kernel tree(s),
> please read:
>    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
> 
> There is a section for how to do this for networking patches as they are
> accepted a bit differently from other patches.

To clarify a bit more here: technically you're asking for a patch to be
included in 4.14, which isn't one of the "last two stable releases", so
that document will tell you to send that patch directly to us.

However, this patch isn't in 4.19 either, which is still Dave's domain,
and we can't take it in 4.14 if it's not in 4.19 (we don't want to
introduce regressions for people who are upgrading their kernels), so
this will still need to go through Dave.

--
Thanks,
Sasha

----- End forwarded message -----

^ permalink raw reply related

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-06 17:53 UTC (permalink / raw)
  To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <7200bdbb-2a02-92c6-0251-1c59b159dde7@gmail.com>

Tue, Aug 06, 2019 at 07:34:59PM CEST, dsahern@gmail.com wrote:
>On 8/5/19 9:20 AM, Jiri Pirko wrote:
>> Mon, Aug 05, 2019 at 04:51:22PM CEST, dsahern@gmail.com wrote:
>>> On 8/5/19 8:49 AM, Jiri Pirko wrote:
>>>>> Your commit 5fc494225c1eb81309cc4c91f183cd30e4edb674 changed that from a
>>>>> per-namepace accounting to all namespaces managed by a single devlink
>>>>> instance in init_net - which is completely wrong.
>>>> No. Not "all namespaces". Only the one where the devlink is. And that is
>>>> always init_net, until this patchset.
>>>>
>>>>
>>>
>>> Jiri: your change to fib.c does not take into account namespace when
>>> doing rules and routes accounting. you broke it. fix it.
>> 
>> What do you mean by "account namespace"? It's a device resource, why to
>> tight it with namespace? What if you have 2 netdevsim-devlink instances
>> in one namespace? Why the setting should be per-namespace?
>> 
>
>Jiri:
>
>Here's an example of how your 5.2 change to netdevsim broke the resource
>controller:
>
>Create a netdevsim device:
>$ modprobe netdevsim
>$  echo "0 1" > /sys/bus/netdevsim/new_device
>
>Get the current number of IPv4 routes:
>$ n=$(ip -4 ro ls table all | wc -l)
>$ echo $n
>13
>
>Prevent any more from being added. This limit should apply solely to
>this namespace, init_net:
>
>$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size $n
>$ devlink dev reload netdevsim/netdevsim0
>Error: netdevsim: New size is less than current occupancy.
>devlink answers: Invalid argument
>
>So that is the first breakage: accounting is off - maybe. Note there are
>no other visible namespaces, but who knows what systemd or other
>processes are doing with namespaces. Perhaps this accounting is another
>example of your changes not properly handling namespaces:
>
>$ devlink resource show netdevsim/netdevsim0
>netdevsim/netdevsim0:
>  name IPv4 size unlimited unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>    resources:
>      name fib size 13 occ 17 unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>      name fib-rules size unlimited occ 6 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>  name IPv6 size unlimited unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>    resources:
>      name fib size unlimited occ 10 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>      name fib-rules size unlimited occ 4 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>
>So the occupancy does not match the tables for init_net.
>
>Reset the max to 17, the current occupancy:
>$ devlink resource set netdevsim/netdevsim0 path /IPv4/fib size 17
>$ devlink dev reload netdevsim/netdevsim0
>$ devlink resource show netdevsim/netdevsim0
>netdevsim/netdevsim0:
>  name IPv4 size unlimited unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>    resources:
>      name fib size 17 occ 17 unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>      name fib-rules size unlimited occ 6 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>  name IPv6 size unlimited unit entry size_min 0 size_max unlimited
>size_gran 1 dpipe_tables none
>    resources:
>      name fib size unlimited occ 10 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>      name fib-rules size unlimited occ 4 unit entry size_min 0 size_max
>unlimited size_gran 1 dpipe_tables none
>
>Create a new namespace, bring up lo which attempts to add more route
>entries:
>$ unshare -n
>$ ip li set lo up
>
>If you list routes you see the lo routes failed to installed because of
>the limits, but it is a silent failure. Try to add a new route and you
>see the cross namespace accounting now:
>$ ip ro add 192.168.1.0/24 dev lo
>Error: netdevsim: Exceeded number of supported fib entries.
>
>
>Contrast that behavior with 5.1 and you see the new namespaces have no
>bearing on accounting in init_net and limits in init_net do not affect
>other namespaces.
>
>That behavior needs to be restored in 5.2 and 5.3.

Let's figure out the devlink-controlling-kernel-resources thread first.
What you describe here is exactly that.

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/3] selftests/bpf: test_progs: switch to open_memstream
From: Stanislav Fomichev @ 2019-08-06 17:54 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Stanislav Fomichev, Networking, bpf, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
In-Reply-To: <CAEf4Bzbt_6Y3bEpsPiHi59KnWoHsk9gQa3XpfRo+gG7-rKqN4w@mail.gmail.com>

On 08/06, Andrii Nakryiko wrote:
> On Tue, Aug 6, 2019 at 10:40 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
> >
> > On 08/06, Andrii Nakryiko wrote:
> > > On Tue, Aug 6, 2019 at 10:19 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > Use open_memstream to override stdout during test execution.
> > > > The copy of the original stdout is held in env.stdout and used
> > > > to print subtest info and dump failed log.
> > > >
> > > > test_{v,}printf are now simple wrappers around stdout and will be
> > > > removed in the next patch.
> > > >
> > > > v4:
> > > > * one field per line for stdout/stderr (Andrii Nakryiko)
> > > >
> > > > v3:
> > > > * don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)
> > > >
> > > > v2:
> > > > * add ifdef __GLIBC__ around open_memstream (maybe pointless since
> > > >   we already depend on glibc for argp_parse)
> > > > * hijack stderr as well (Andrii Nakryiko)
> > > > * don't hijack for every test, do it once (Andrii Nakryiko)
> > > > * log_cap -> log_size (Andrii Nakryiko)
> > > > * do fseeko in a proper place (Andrii Nakryiko)
> > > > * check open_memstream returned value (Andrii Nakryiko)
> > > >
> > > > Cc: Andrii Nakryiko <andriin@fb.com>
> > > > Acked-by: Andrii Nakryiko <andriin@fb.com>
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >  tools/testing/selftests/bpf/test_progs.c | 115 ++++++++++++-----------
> > > >  tools/testing/selftests/bpf/test_progs.h |   3 +-
> > > >  2 files changed, 62 insertions(+), 56 deletions(-)
> > > >
> 
> [...]
> 
> > > >  void test__printf(const char *fmt, ...)
> > > > @@ -477,6 +438,48 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> > > >         return 0;
> > > >  }
> > > >
> > > > +static void stdio_hijack(void)
> > > > +{
> > > > +#ifdef __GLIBC__
> > > > +       if (env.verbose || (env.test && env.test->force_log)) {
> > >
> > > I just also realized that you don't need `(env.test &&
> > > env.test->force_log)` test. We hijack stdout/stderr before env.test is
> > > even set, so this does nothing anyways. Plus, force_log can be set in
> > > the middle of test/sub-test, yet we hijack stdout just once (or even
> > > if per-test), so it's still going to be "racy". Let's buffer output
> > > (unless it's env.verbose, which is important to not buffer because
> > > some tests will have huge output, when failing, so this allows to
> > > bypass using tons of memory for those, when debugging) and dump at the
> > > end.
> > Makes sense, will drop this test and resubmit along with a fix for '-v'
> > that Alexei discovered. Thanks!
> 
> Oh, is it because env.stdout and env.stderr is not set in verbose
> mode? I'd always set env.stdout/env.stderr at the beginning, next to
> env.jit_enabled, to never care about "logging modes".
Yeah, I moved it to the beginning of stdio_hijack() to mirror the 'undo'
in stdio_restore(). Let me know if you feel strongly about it and want
it to be near env.jit_enabled instead.

> > > > +               /* nothing to do, output to stdout by default */
> > > > +               return;
> > > > +       }
> > > > +
> 
> [...]

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: David Ahern @ 2019-08-06 17:55 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <20190806175323.GB2332@nanopsycho.orion>

On 8/6/19 11:53 AM, Jiri Pirko wrote:
> Let's figure out the devlink-controlling-kernel-resources thread first.
> What you describe here is exactly that.

as I mentioned on the phone, any outcome of that thread will be in 5.4
at best. Meanwhile this breakage in 5.2 and 5.3 needs to be fixed.

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/3] selftests/bpf: test_progs: switch to open_memstream
From: Andrii Nakryiko @ 2019-08-06 17:55 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, Networking, bpf, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
In-Reply-To: <20190806175411.GC23939@mini-arch>

On Tue, Aug 6, 2019 at 10:54 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 08/06, Andrii Nakryiko wrote:
> > On Tue, Aug 6, 2019 at 10:40 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
> > >
> > > On 08/06, Andrii Nakryiko wrote:
> > > > On Tue, Aug 6, 2019 at 10:19 AM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > > Use open_memstream to override stdout during test execution.
> > > > > The copy of the original stdout is held in env.stdout and used
> > > > > to print subtest info and dump failed log.
> > > > >
> > > > > test_{v,}printf are now simple wrappers around stdout and will be
> > > > > removed in the next patch.
> > > > >
> > > > > v4:
> > > > > * one field per line for stdout/stderr (Andrii Nakryiko)
> > > > >
> > > > > v3:
> > > > > * don't do strlen over log_buf, log_cnt has it already (Andrii Nakryiko)
> > > > >
> > > > > v2:
> > > > > * add ifdef __GLIBC__ around open_memstream (maybe pointless since
> > > > >   we already depend on glibc for argp_parse)
> > > > > * hijack stderr as well (Andrii Nakryiko)
> > > > > * don't hijack for every test, do it once (Andrii Nakryiko)
> > > > > * log_cap -> log_size (Andrii Nakryiko)
> > > > > * do fseeko in a proper place (Andrii Nakryiko)
> > > > > * check open_memstream returned value (Andrii Nakryiko)
> > > > >
> > > > > Cc: Andrii Nakryiko <andriin@fb.com>
> > > > > Acked-by: Andrii Nakryiko <andriin@fb.com>
> > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > ---
> > > > >  tools/testing/selftests/bpf/test_progs.c | 115 ++++++++++++-----------
> > > > >  tools/testing/selftests/bpf/test_progs.h |   3 +-
> > > > >  2 files changed, 62 insertions(+), 56 deletions(-)
> > > > >
> >
> > [...]
> >
> > > > >  void test__printf(const char *fmt, ...)
> > > > > @@ -477,6 +438,48 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state)
> > > > >         return 0;
> > > > >  }
> > > > >
> > > > > +static void stdio_hijack(void)
> > > > > +{
> > > > > +#ifdef __GLIBC__
> > > > > +       if (env.verbose || (env.test && env.test->force_log)) {
> > > >
> > > > I just also realized that you don't need `(env.test &&
> > > > env.test->force_log)` test. We hijack stdout/stderr before env.test is
> > > > even set, so this does nothing anyways. Plus, force_log can be set in
> > > > the middle of test/sub-test, yet we hijack stdout just once (or even
> > > > if per-test), so it's still going to be "racy". Let's buffer output
> > > > (unless it's env.verbose, which is important to not buffer because
> > > > some tests will have huge output, when failing, so this allows to
> > > > bypass using tons of memory for those, when debugging) and dump at the
> > > > end.
> > > Makes sense, will drop this test and resubmit along with a fix for '-v'
> > > that Alexei discovered. Thanks!
> >
> > Oh, is it because env.stdout and env.stderr is not set in verbose
> > mode? I'd always set env.stdout/env.stderr at the beginning, next to
> > env.jit_enabled, to never care about "logging modes".
> Yeah, I moved it to the beginning of stdio_hijack() to mirror the 'undo'
> in stdio_restore(). Let me know if you feel strongly about it and want
> it to be near env.jit_enabled instead.

I'm fine with it, no big deal, I wrote email before I saw your v5.

>
> > > > > +               /* nothing to do, output to stdout by default */
> > > > > +               return;
> > > > > +       }
> > > > > +
> >
> > [...]

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Andrew Lunn @ 2019-08-06 18:03 UTC (permalink / raw)
  To: David Ahern
  Cc: Jiri Pirko, netdev, davem, mlxsw, jakub.kicinski, f.fainelli,
	vivien.didelot, mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <c615dce5-9307-7640-2877-4e5c01e565c0@gmail.com>

On Tue, Aug 06, 2019 at 11:38:32AM -0600, David Ahern wrote:
> On 8/6/19 10:40 AM, Jiri Pirko wrote:
> > Hi all.
> > 
> > I just discussed this with DavidA and I would like to bring this to
> > broader audience. David wants to limit kernel resources in network
> > namespaces, for example fibs, fib rules, etc.
> > 
> > He claims that devlink api is rich enough to program this limitations
> > as it already does for mlxsw hw resources for example. If we have this
> > api for hardware, why don't to reuse it for the kernel and it's
> > resources too?
> 
> The analogy is that a kernel is 'programmed' just like hardware, it has
> resources just like hardware (e.g., memory) and those resources are
> limited as well. So the resources consumed by fib entries, rules,
> nexthops, etc should be controllable.

I expect one question that will come up is why not control
groups. That is often used by the rest of the kernel for resource
control.

But cgroups are mostly about limiting resources for a collection of
processes. I don't think that is true for networking resources. The
resources we are talking about are orthogonal to processes. Or are
there any resources which should be linked to processes? eBPF
resources?

> > So the proposal is to have some new device, say "kernelnet", that would
> > implicitly create per-namespace devlink instance.

Maybe kernelns, to make it clear we are talking about namespace
resources.

Going back to cgroups concept. They are generally hierarchical. Do we
need any sort of hierarchy here? Are there some resources we want to
set a global limit on, and then a per namespace limit on top of that?
We would then need two names, and kernelnet sounds more like the
global level?

       Andrew

^ permalink raw reply

* Re: [patch net-next v2 1/3] net: devlink: allow to change namespaces
From: Jiri Pirko @ 2019-08-06 18:07 UTC (permalink / raw)
  To: David Ahern; +Cc: Jakub Kicinski, netdev, davem, sthemmin, mlxsw
In-Reply-To: <fc6a7342-246c-2fe1-a7d1-c7be5bd0a3a3@gmail.com>

Tue, Aug 06, 2019 at 07:55:30PM CEST, dsahern@gmail.com wrote:
>On 8/6/19 11:53 AM, Jiri Pirko wrote:
>> Let's figure out the devlink-controlling-kernel-resources thread first.
>> What you describe here is exactly that.
>
>as I mentioned on the phone, any outcome of that thread will be in 5.4
>at best. Meanwhile this breakage in 5.2 and 5.3 needs to be fixed.

Why? netdevsim is a dummy device, the purpose of existence is to test
kernel api. No real harm breaking it.

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Jakub Kicinski @ 2019-08-06 18:27 UTC (permalink / raw)
  To: Jiri Pirko, dsahern
  Cc: netdev, davem, mlxsw, andrew, f.fainelli, vivien.didelot,
	mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <20190806164036.GA2332@nanopsycho.orion>

On Tue, 6 Aug 2019 18:40:36 +0200, Jiri Pirko wrote:
> Hi all.
> 
> I just discussed this with DavidA and I would like to bring this to
> broader audience. David wants to limit kernel resources in network
> namespaces, for example fibs, fib rules, etc.
> 
> He claims that devlink api is rich enough to program this limitations
> as it already does for mlxsw hw resources for example. 

TBH I don't see how you changed anything to do with FIB notifications,
so the fact that the accounting is off now is a bit confusing. I don't
understand how devlink, FIB and namespaces mix :(

> If we have this api for hardware, why don't to reuse it for the
> kernel and it's resources too?

IMHO the netdevsim use of this API is a slight abuse, to prove the
device can fail the FIB changes, nothing more..

> So the proposal is to have some new device, say "kernelnet", that
> would implicitly create per-namespace devlink instance. This devlink
> instance would be used to setup resource limits. Like:
> 
> devlink resource set kernelnet path /IPv4/fib size 96
> devlink -N ns1name resource set kernelnet path /IPv6/fib size 100
> devlink -N ns2name resource set kernelnet path /IPv4/fib-rules size 8
> 
> To me it sounds a bit odd for kernel namespace to act as a device, but
> thinking about it more, it makes sense. Probably better than to define
> a new api. User would use the same tool to work with kernel and hw.
> 
> Also we can implement other devlink functionality, like dpipe.
> User would then have visibility of network pipeline, tables,
> utilization, etc. It is related to the resources too.
> 
> What do you think?

I'm no expert here but seems counter intuitive that device tables would
be aware of namespaces in the first place. Are we not reinventing
cgroup controllers based on a device API? IMHO from a perspective of
someone unfamiliar with routing offload this seems backwards :)

^ permalink raw reply

* [PATCH net-next] mlx5: use correct counter
From: Jonathan Lemon @ 2019-08-06 18:28 UTC (permalink / raw)
  To: saeedm; +Cc: kernel-team, netdev

mlx5e_grp_q_update_stats seems to be using the wrong counter
for if_down_packets.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 6eee3c7d4b06..1d16e03a987d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -363,7 +363,7 @@ static void mlx5e_grp_q_update_stats(struct mlx5e_priv *priv)
 	    !mlx5_core_query_q_counter(priv->mdev, priv->drop_rq_q_counter, 0,
 				       out, sizeof(out)))
 		qcnt->rx_if_down_packets = MLX5_GET(query_q_counter_out, out,
-						    out_of_buffer);
+						    rx_if_down_packets);
 }
 
 #define VNIC_ENV_OFF(c) MLX5_BYTE_OFF(query_vnic_env_out, c)
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net] net: ethernet: sun4i-emac: Support phy-handle property for finding PHYs
From: David Miller @ 2019-08-06 18:29 UTC (permalink / raw)
  To: wens; +Cc: mripard, wens, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <20190806073539.32519-1-wens@kernel.org>

From: Chen-Yu Tsai <wens@kernel.org>
Date: Tue,  6 Aug 2019 15:35:39 +0800

> From: Chen-Yu Tsai <wens@csie.org>
> 
> The sun4i-emac uses the "phy" property to find the PHY it's supposed to
> use. This property was deprecated in favor of "phy-handle" in commit
> 8c5b09447625 ("dt-bindings: net: sun4i-emac: Convert the binding to a
> schemas").
> 
> Add support for this new property name, and fall back to the old one in
> case the device tree hasn't been updated.
> 
> Signed-off-by: Chen-Yu Tsai <wens@csie.org>

Applied.

^ permalink raw reply

* Re: [PATCH] net: cxgb3_main: Fix a resource leak in a error path in 'init_one()'
From: David Miller @ 2019-08-06 18:34 UTC (permalink / raw)
  To: christophe.jaillet; +Cc: vishal, netdev, linux-kernel, kernel-janitors
In-Reply-To: <20190806085512.11729-1-christophe.jaillet@wanadoo.fr>

From: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date: Tue,  6 Aug 2019 10:55:12 +0200

> A call to 'kfree_skb()' is missing in the error handling path of
> 'init_one()'.
> This is already present in 'remove_one()' but is missing here.
> 
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>

Looks good, applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/mlx4_core: Use refcount_t for refcount
From: Jason Gunthorpe @ 2019-08-06 18:36 UTC (permalink / raw)
  To: Chuhong Yuan
  Cc: Saeed Mahameed, linux-kernel@vger.kernel.org, davem@davemloft.net,
	Tariq Toukan, linux-rdma@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <CANhBUQ1wZPinWicu2c_VZjpTtP_9+AxB=7zn+ymPyYVo_rsxZQ@mail.gmail.com>

On Sat, Aug 03, 2019 at 10:42:31AM +0800, Chuhong Yuan wrote:
> Saeed Mahameed <saeedm@mellanox.com> 于2019年8月3日周六 上午2:38写道:
> >
> > On Sat, 2019-08-03 at 00:10 +0800, Chuhong Yuan wrote:
> > > Chuhong Yuan <hslester96@gmail.com> 于2019年8月2日周五 下午8:10写道:
> > > > refcount_t is better for reference counters since its
> > > > implementation can prevent overflows.
> > > > So convert atomic_t ref counters to refcount_t.
> > > >
> > > > Also convert refcount from 0-based to 1-based.
> > > >
> > >
> > > It seems that directly converting refcount from 0-based
> > > to 1-based is infeasible.
> > > I am sorry for this mistake.
> >
> > Just curious, why not keep it 0 based and use refcout_t ?
> >
> > refcount API should have the same semantics as atomic_t API .. no ?
> 
> refcount API will warn when increase a 0 refcount.
> It regards this as a use-after-free.

If this causes failures then the code is not doing atomic as a
refcount properly anyhow.. 

There are some cases where the atomic refcount is just a imprecise
debugging aide.

Jason

^ permalink raw reply

* Re: [PATCH 1/1] ixgbe: sync the first fragment unconditionally
From: David Miller @ 2019-08-06 18:36 UTC (permalink / raw)
  To: firo.yang; +Cc: netdev, jeffrey.t.kirsher, intel-wired-lan, linux-kernel
In-Reply-To: <20190806092919.13211-1-firo.yang@suse.com>

From: Firo Yang <firo.yang@suse.com>
Date: Tue, 6 Aug 2019 09:29:51 +0000

> In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
> could possibly allocate a page, DMA memory buffer, for the first
> fragment which is not suitable for Xen-swiotlb to do DMA operations.
> Xen-swiotlb will internally allocate another page for doing DMA
> operations. It requires syncing between those two pages. Otherwise,
> we may get an incomplete skb. To fix this problem, sync the first
> fragment no matter the first fargment is makred as "page_released"
> or not.
> 
> Signed-off-by: Firo Yang <firo.yang@suse.com>

I don't understand, an unmap operation implies a sync operation.

^ permalink raw reply

* Re: [PATCH v2] net/mlx5e: Use refcount_t for refcount
From: Jason Gunthorpe @ 2019-08-06 18:38 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Saeed Mahameed, hslester96@gmail.com, davem@davemloft.net,
	netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20190806065958.GQ4832@mtr-leonro.mtl.com>

On Tue, Aug 06, 2019 at 09:59:58AM +0300, Leon Romanovsky wrote:
> On Mon, Aug 05, 2019 at 08:06:36PM +0000, Saeed Mahameed wrote:
> > On Mon, 2019-08-05 at 14:55 +0800, Chuhong Yuan wrote:
> > > On Mon, Aug 5, 2019 at 2:13 PM Leon Romanovsky <leon@kernel.org>
> > > wrote:
> > > > On Sun, Aug 04, 2019 at 10:44:47PM +0800, Chuhong Yuan wrote:
> > > > > On Sun, Aug 4, 2019 at 8:59 PM Leon Romanovsky <leon@kernel.org>
> > > > > wrote:
> > > > > > On Sat, Aug 03, 2019 at 12:48:28AM +0800, Chuhong Yuan wrote:
> > > > > > > refcount_t is better for reference counters since its
> > > > > > > implementation can prevent overflows.
> > > > > > > So convert atomic_t ref counters to refcount_t.
> > > > > >
> > > > > > I'm not thrilled to see those automatic conversion patches,
> > > > > > especially
> > > > > > for flows which can't overflow. There is nothing wrong in using
> > > > > > atomic_t
> > > > > > type of variable, do you have in mind flow which will cause to
> > > > > > overflow?
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > I have to say that these patches are not done automatically...
> > > > > Only the detection of problems is done by a script.
> > > > > All conversions are done manually.
> > > >
> > > > Even worse, you need to audit usage of atomic_t and replace there
> > > > it can overflow.
> > > >
> > > > > I am not sure whether the flow can cause an overflow.
> > > >
> > > > It can't.
> > > >
> > > > > But I think it is hard to ensure that a data path is impossible
> > > > > to have problems in any cases including being attacked.
> > > >
> > > > It is not data path, and I doubt that such conversion will be
> > > > allowed
> > > > in data paths without proving that no performance regression is
> > > > introduced.
> > > > > So I think it is better to do this minor revision to prevent
> > > > > potential risk, just like we have done in mlx5/core/cq.c.
> > > >
> > > > mlx5/core/cq.c is a different beast, refcount there means actual
> > > > users
> > > > of CQ which are limited in SW, so in theory, they have potential
> > > > to be overflown.
> > > >
> > > > It is not the case here, there your are adding new port.
> > > > There is nothing wrong with atomic_t.
> > > >
> > >
> > > Thanks for your explanation!
> > > I will pay attention to this point in similar cases.
> > > But it seems that the semantic of refcount is not always as clear as
> > > here...
> > >
> >
> > Semantically speaking, there is nothing wrong with moving to refcount_t
> > in the case of vxlan ports.. it also seems more accurate and will
> > provide the type protection, even if it is not necessary. Please let me
> > know what is the verdict here, i can apply this patch to net-next-mlx5.
> 
> There is no verdict here, it is up to you., if you like code churn, go
> for it.

IMHO CONFIG_REFCOUNT_FULL is a valuable enough reason to not open code
with an atomic even if overflow is not possible. 

Races resulting in incrs on 0 refcounts is a common enough mistake.

Jason

^ permalink raw reply

* Re: [RFC] implicit per-namespace devlink instance to set kernel resource limitations
From: Jiri Pirko @ 2019-08-06 18:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dsahern, netdev, davem, mlxsw, andrew, f.fainelli, vivien.didelot,
	mkubecek, stephen, daniel, brouer, eric.dumazet
In-Reply-To: <20190806112717.3b070d07@cakuba.netronome.com>

Tue, Aug 06, 2019 at 08:27:17PM CEST, jakub.kicinski@netronome.com wrote:
>On Tue, 6 Aug 2019 18:40:36 +0200, Jiri Pirko wrote:
>> Hi all.
>> 
>> I just discussed this with DavidA and I would like to bring this to
>> broader audience. David wants to limit kernel resources in network
>> namespaces, for example fibs, fib rules, etc.
>> 
>> He claims that devlink api is rich enough to program this limitations
>> as it already does for mlxsw hw resources for example. 
>
>TBH I don't see how you changed anything to do with FIB notifications,
>so the fact that the accounting is off now is a bit confusing. I don't
>understand how devlink, FIB and namespaces mix :(
>
>> If we have this api for hardware, why don't to reuse it for the
>> kernel and it's resources too?
>
>IMHO the netdevsim use of this API is a slight abuse, to prove the
>device can fail the FIB changes, nothing more..

It's slightly bigger abuse :) But in this thread, we are not discussing
netdevsim, but separate "dev".


>
>> So the proposal is to have some new device, say "kernelnet", that
>> would implicitly create per-namespace devlink instance. This devlink
>> instance would be used to setup resource limits. Like:
>> 
>> devlink resource set kernelnet path /IPv4/fib size 96
>> devlink -N ns1name resource set kernelnet path /IPv6/fib size 100
>> devlink -N ns2name resource set kernelnet path /IPv4/fib-rules size 8
>> 
>> To me it sounds a bit odd for kernel namespace to act as a device, but
>> thinking about it more, it makes sense. Probably better than to define
>> a new api. User would use the same tool to work with kernel and hw.
>> 
>> Also we can implement other devlink functionality, like dpipe.
>> User would then have visibility of network pipeline, tables,
>> utilization, etc. It is related to the resources too.
>> 
>> What do you think?
>
>I'm no expert here but seems counter intuitive that device tables would
>be aware of namespaces in the first place. Are we not reinventing
>cgroup controllers based on a device API? IMHO from a perspective of
>someone unfamiliar with routing offload this seems backwards :)

Can we use cgroup for fib and other limitations instead?


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox