Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] ipvs: remove unnecessary space
From: Pablo Neira Ayuso @ 2019-07-15  7:57 UTC (permalink / raw)
  To: yangxingwu
  Cc: wensong, horms, ja, kadlec, fw, davem, netdev, lvs-devel,
	netfilter-devel, coreteam, linux-kernel, joe
In-Reply-To: <20190712130721.7168-1-xingwu.yang@gmail.com>

On Fri, Jul 12, 2019 at 09:07:21PM +0800, yangxingwu wrote:
> this patch removes the extra space and use bitmap_zalloc instead
> 
> Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
> ---
>  net/netfilter/ipvs/ip_vs_mh.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_mh.c b/net/netfilter/ipvs/ip_vs_mh.c
> index 94d9d34..3229867 100644
> --- a/net/netfilter/ipvs/ip_vs_mh.c
> +++ b/net/netfilter/ipvs/ip_vs_mh.c
> @@ -174,8 +174,7 @@ static int ip_vs_mh_populate(struct ip_vs_mh_state *s,
>  		return 0;
>  	}
>  
> -	table =  kcalloc(BITS_TO_LONGS(IP_VS_MH_TAB_SIZE),
> -			 sizeof(unsigned long), GFP_KERNEL);
> +	table = bitmap_zalloc(IP_VS_MH_TAB_SIZE, GFP_KERNEL);

By doing:

        git grep "=  " ...

on the netfilter folders, I see more of these, it would be good if you
fix them at once or, probably, you want to use coccinelle for this.

Thanks.

^ permalink raw reply

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Stefano Garzarella @ 2019-07-15  7:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: Michael S. Tsirkin, Stefan Hajnoczi, virtualization, netdev
In-Reply-To: <a514d8a4-3a12-feeb-4467-af7a9fbf5183@redhat.com>

On Fri, Jul 12, 2019 at 06:14:39PM +0800, Jason Wang wrote:
> 
> On 2019/7/12 下午6:00, Stefano Garzarella wrote:
> > On Thu, Jul 11, 2019 at 03:52:21PM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 11, 2019 at 01:41:34PM +0200, Stefano Garzarella wrote:
> > > > On Thu, Jul 11, 2019 at 03:37:00PM +0800, Jason Wang wrote:
> > > > > On 2019/7/10 下午11:37, Stefano Garzarella wrote:
> > > > > > Hi,
> > > > > > as Jason suggested some months ago, I looked better at the virtio-net driver to
> > > > > > understand if we can reuse some parts also in the virtio-vsock driver, since we
> > > > > > have similar challenges (mergeable buffers, page allocation, small
> > > > > > packets, etc.).
> > > > > > 
> > > > > > Initially, I would add the skbuff in the virtio-vsock in order to re-use
> > > > > > receive_*() functions.
> > > > > 
> > > > > Yes, that will be a good step.
> > > > > 
> > > > Okay, I'll go on this way.
> > > > 
> > > > > > Then I would move receive_[small, big, mergeable]() and
> > > > > > add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
> > > > > > call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
> > > > > > XDP part on the virtio-net driver), but I think it is feasible.
> > > > > > 
> > > > > > The idea is to create a virtio-skb.[h,c] where put these functions and a new
> > > > > > object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
> > > > > > some fields of struct receive_queue).
> > > > > 
> > > > > My understanding is we could be more ambitious here. Do you see any blocker
> > > > > for reusing virtio-net directly? It's better to reuse not only the functions
> > > > > but also the logic like NAPI to avoid re-inventing something buggy and
> > > > > duplicated.
> > > > > 
> > > > These are my concerns:
> > > > - virtio-vsock is not a "net_device", so a lot of code related to
> > > >    ethtool, net devices (MAC address, MTU, speed, VLAN, XDP, offloading) will be
> > > >    not used by virtio-vsock.
> 
> 
> Linux support device other than ethernet, so it should not be a problem.
> 
> 
> > > > 
> > > > - virtio-vsock has a different header. We can consider it as part of
> > > >    virtio_net payload, but it precludes the compatibility with old hosts. This
> > > >    was one of the major doubts that made me think about using only the
> > > >    send/recv skbuff functions, that it shouldn't break the compatibility.
> 
> 
> We can extend the current vnet header helper for it to work for vsock.

Okay, I'll do it.

> 
> 
> > > > 
> > > > > > This is an idea of virtio-skb.h that
> > > > > > I have in mind:
> > > > > >       struct virtskb;
> > > > > 
> > > > > What fields do you want to store in virtskb? It looks to be exist sk_buff is
> > > > > flexible enough to us?
> > > > My idea is to store queues information, like struct receive_queue or
> > > > struct send_queue, and some device attributes (e.g. hdr_len ).
> 
> 
> If you reuse skb or virtnet_info, there is not necessary.
> 

Okay.

> 
> > > > 
> > > > > 
> > > > > >       struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
> > > > > >       struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
> > > > > >       struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
> > > > > > 
> > > > > >       int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
> > > > > >       int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
> > > > > >       int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
> > > > > > 
> > > > > > For the Guest->Host path it should be easier, so maybe I can add a
> > > > > > "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
> > > > > > of xmit_skb().
> > > > > 
> > > > > I may miss something, but I don't see any thing that prevents us from using
> > > > > xmit_skb() directly.
> > > > > 
> > > > Yes, but my initial idea was to make it more parametric and not related to the
> > > > virtio_net_hdr, so the 'hdr_len' could be a parameter and the
> > > > 'num_buffers' should be handled by the caller.
> > > > 
> > > > > > Let me know if you have in mind better names or if I should put these function
> > > > > > in another place.
> > > > > > 
> > > > > > I would like to leave the control part completely separate, so, for example,
> > > > > > the two drivers will negotiate the features independently and they will call
> > > > > > the right virtskb_receive_*() function based on the negotiation.
> > > > > 
> > > > > If it's one the issue of negotiation, we can simply change the
> > > > > virtnet_probe() to deal with different devices.
> > > > > 
> > > > > 
> > > > > > I already started to work on it, but before to do more steps and send an RFC
> > > > > > patch, I would like to hear your opinion.
> > > > > > Do you think that makes sense?
> > > > > > Do you see any issue or a better solution?
> > > > > 
> > > > > I still think we need to seek a way of adding some codes on virtio-net.c
> > > > > directly if there's no huge different in the processing of TX/RX. That would
> > > > > save us a lot time.
> > > > After the reading of the buffers from the virtqueue I think the process
> > > > is slightly different, because virtio-net will interface with the network
> > > > stack, while virtio-vsock will interface with the vsock-core (socket).
> > > > So the virtio-vsock implements the following:
> > > > - control flow mechanism to avoid to loose packets, informing the peer
> > > >    about the amount of memory available in the receive queue using some
> > > >    fields in the virtio_vsock_hdr
> > > > - de-multiplexing parsing the virtio_vsock_hdr and choosing the right
> > > >    socket depending on the port
> > > > - socket state handling
> 
> 
> I think it's just a branch, for ethernet, go for networking stack. otherwise
> go for vsock core?
> 

Yes, that should work.

So, I should refactor the functions that can be called also from the vsock
core, in order to remove "struct net_device *dev" parameter.
Maybe creating some wrappers for the network stack.

Otherwise I should create a fake net_device for vsock_core.

What do you suggest?

> 
> > > > 
> > > > We can use the virtio-net as transport, but we should add a lot of
> > > > code to skip "net device" stuff when it is used by the virtio-vsock.
> 
> 
> This could be another choice, but consider it was not transparent to the
> admin and require new features, we may seek a transparent solution here.
> 
> 
> > > > This could break something in virtio-net, for this reason, I thought to reuse
> > > > only the send/recv functions starting from the idea to split the virtio-net
> > > > driver in two parts:
> > > > a. one with all stuff related to the network stack
> > > > b. one with the stuff needed to communicate with the host
> > > > 
> > > > And use skbuff to communicate between parts. In this way, virtio-vsock
> > > > can use only the b part.
> > > > 
> > > > Maybe we can do this split in a better way, but I'm not sure it is
> > > > simple.
> > > > 
> > > > Thanks,
> > > > Stefano
> > > Frankly, skb is a huge structure which adds a lot of
> > > overhead. I am not sure that using it is such a great idea
> > > if building a device that does not have to interface
> > > with the networking stack.
> 
> 
> I believe vsock is mainly used for stream performance not for PPS. So the
> impact should be minimal. We can use other metadata, just need branch in
> recv_xxx().
> 

Yes, I think stream performance is the case.

> 
> > Thanks for the advice!
> > 
> > > So I agree with Jason in theory. To clarify, he is basically saying
> > > current implementation is all wrong, it should be a protocol and we
> > > should teach networking stack that there are reliable net devices that
> > > handle just this protocol. We could add a flag in virtio net that
> > > will say it's such a device.
> > > 
> > > Whether it's doable, I don't know, and it's definitely not simple - in
> > > particular you will have to also re-implement existing devices in these
> > > terms, and not just virtio - vmware vsock too.
> 
> 
> Merging vsock protocol to exist networking stack could be a long term goal,
> I believe for the first phase, we can seek to use virtio-net first.
>

Yes, I agree.

> 
> > > 
> > > If you want to do a POC you can add a new address family,
> > > that's easier.
> > Very interesting!
> > I agree with you. In this way we can completely split the protocol
> > logic, from the device.
> > 
> > As you said, it will not simple to do, but can be an opportunity to learn
> > better the Linux networking stack!
> > I'll try to do a PoC with AF_VSOCK2 that will use the virtio-net.
> 
> 
> I suggest to do this step by step:
> 
> 1) use virtio-net but keep some protocol logic
> 
> 2) separate protocol logic and merge it to exist Linux networking stack

Make sense, thanks for the suggestions, I'll try to do these steps!

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH v2 3/9] rcu/sync: Remove custom check for reader-section
From: Oleg Nesterov @ 2019-07-15  7:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, linux-kernel, Alexey Kuznetsov, Bjorn Helgaas,
	Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jonathan Corbet, Josh Triplett, keescook,
	kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
	linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
	neilb, netdev, Pavel Machek, peterz, Rafael J. Wysocki,
	Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190713031008.GA248225@google.com>

On 07/12, Joel Fernandes wrote:
>
>  static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
>  {
> -	RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
> -			 !rcu_read_lock_bh_held() &&
> -			 !rcu_read_lock_sched_held(),
> +	RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),

Yes, this is what I meant.

Sorry for confusion, I should have mentioned that rcu_sync_is_idle()
was recently updated when I suggested to use the new helper.

Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Bharath Vedartham @ 2019-07-15  6:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: akpm, ira.weiny, jhubbard, Mauro Carvalho Chehab,
	Dimitri Sivanich, Arnd Bergmann, Greg Kroah-Hartman,
	Alex Williamson, Cornelia Huck, Alexander Viro,
	Björn Töpel, Magnus Karlsson, David S. Miller,
	Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Enrico Weigelt,
	Thomas Gleixner, Alexios Zavras, Dan Carpenter, Max Filippov,
	Matt Sickler, Kirill A. Shutemov, Keith Busch, YueHaibing,
	linux-media, linux-kernel, devel, kvm, linux-block, linux-fsdevel,
	linux-mm, netdev, bpf, xdp-newbies
In-Reply-To: <018ee3d1-e2f0-ca12-9f63-945056c09985@kernel.dk>

On Sun, Jul 14, 2019 at 08:33:57PM -0600, Jens Axboe wrote:
> On 7/14/19 1:08 PM, Bharath Vedartham wrote:
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 4ef62a4..b4a4549 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> >   			 * if we did partial map, or found file backed vmas,
> >   			 * release any pages we did get
> >   			 */
> > -			if (pret > 0) {
> > -				for (j = 0; j < pret; j++)
> > -					put_page(pages[j]);
> > -			}
> > +			if (pret > 0)
> > +				put_user_pages(pages, pret);
> > +
> >   			if (ctx->account_mem)
> >   				io_unaccount_mem(ctx->user, nr_pages);
> >   			kvfree(imu->bvec);
> 
> You handled just the failure case of the buffer registration, but not
> the actual free in io_sqe_buffer_unregister().
> 
> -- 
> Jens Axboe
Yup got it! Thanks! I won't be sending a patch again as fs/io_uring.c
may have larger local changes for put_user_pages.

Thanks

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Bharath Vedartham @ 2019-07-15  6:56 UTC (permalink / raw)
  To: John Hubbard
  Cc: akpm, ira.weiny, Mauro Carvalho Chehab, Dimitri Sivanich,
	Arnd Bergmann, Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
	Jens Axboe, Alexander Viro, Björn Töpel,
	Magnus Karlsson, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Enrico Weigelt, Thomas Gleixner, Alexios Zavras,
	Dan Carpenter, Max Filippov, Matt Sickler, Kirill A. Shutemov,
	Keith Busch, YueHaibing, linux-media, linux-kernel, devel, kvm,
	linux-block, linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
	Jason Gunthorpe
In-Reply-To: <deea584f-2da2-8e1f-5a07-e97bf32c63bb@nvidia.com>

On Sun, Jul 14, 2019 at 04:33:42PM -0700, John Hubbard wrote:
> On 7/14/19 12:08 PM, Bharath Vedartham wrote:
> > This patch converts all call sites of get_user_pages
> > to use put_user_page*() instead of put_page*() functions to
> > release reference to gup pinned pages.
Hi John, 
> Hi Bharath,
> 
> Thanks for jumping in to help, and welcome to the party!
> 
> You've caught everyone in the middle of a merge window, btw.  As a
> result, I'm busy rebasing and reworking the get_user_pages call sites, 
> and gup tracking, in the wake of some semi-traumatic changes to bio 
> and gup and such. I plan to re-post right after 5.3-rc1 shows up, from 
> here:
> 
>     https://github.com/johnhubbard/linux/commits/gup_dma_core
> 
> ...which you'll find already covers the changes you've posted, except for:
> 
>     drivers/misc/sgi-gru/grufault.c
>     drivers/staging/kpc2000/kpc_dma/fileops.c
> 
> ...and this one, which is undergoing to larger local changes, due to
> bvec, so let's leave it out of the choices:
> 
>     fs/io_uring.c
> 
> Therefore, until -rc1, if you'd like to help, I'd recommend one or more
> of the following ideas:
> 
> 1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
> and find missing conversions: look for any additional missing 
> get_user_pages/put_page conversions. You've already found a couple missing 
> ones. I haven't re-run a search in a long time, so there's probably even more.
> 	a) And find more, after I rebase to 5.3-rc1: people probably are adding
> 	get_user_pages() calls as we speak. :)
Shouldn't this be documented then? I don't see any docs for using
put_user_page*() in v5.2.1 in the memory management API section?
> 2. Patches: Focus on just one subsystem at a time, and perfect the patch for
> it. For example, I think this the staging driver would be perfect to start with:
> 
>     drivers/staging/kpc2000/kpc_dma/fileops.c
> 
> 	a) verify that you've really, corrected converted the whole
> 	driver. (Hint: I think you might be overlooking a put_page call.)
Yup. I did see that! Will fix it!
> 	b) Attempt to test it if you can (I'm being hypocritical in
> 	the extreme here, but one of my problems is that testing
> 	has been light, so any help is very valuable). qemu...?
> 	OTOH, maybe even qemu cannot easily test a kpc2000, but
> 	perhaps `git blame` and talking to the authors would help
> 	figure out a way to validate the changes.
Great! I ll do that, I ll mail the patch authors and ask them for help
in testing. 
> 	Thinking about whether you can run a test that would prove or
> 	disprove my claim in (a), above, could be useful in coming up
> 	with tests to run.

> In other words, a few very high quality conversions (even just one) that
> we can really put our faith in, is what I value most here. Tested patches
> are awesome.
I understand that! 
> 3. Once I re-post, turn on the new CONFIG_DEBUG_GET_USER_PAGES_REFERENCES
> and run things such as xfstest/fstest. (Again, doing so would be going
> further than I have yet--very helpful). Help clarify what conversions have
> actually been tested and work, and which ones remain unvalidated.
> Other: Please note that this:
Yup will do that.
>     https://github.com/johnhubbard/linux/commits/gup_dma_core
> 
>     a) gets rebased often, and
> 
>     b) has a bunch of commits (iov_iter and related) that conflict
>        with the latest linux.git,
> 
>     c) has some bugs in the bio area, that I'm fixing, so I don't trust
>        that's it's safely runnable, for a few more days.
I assume your repo contains only work related to fixing gup issues and
not the main repo for gup development? i.e where gup changes are merged?
Also are release_pages and put_user_pages interchangable? 
> One note below, for the future:
> 
> > 
> > This is a bunch of trivial conversions which is a part of an effort
> > by John Hubbard to solve issues with gup pinned pages and 
> > filesystem writeback.
> > 
> > The issue is more clearly described in John Hubbard's patch[1] where
> > put_user_page*() functions are introduced.
> > 
> > Currently put_user_page*() simply does put_page but future implementations
> > look to change that once treewide change of put_page callsites to 
> > put_user_page*() is finished.
> > 
> > The lwn article describing the issue with gup pinned pages and filesystem 
> > writeback [2].
> > 
> > This patch has been tested by building and booting the kernel as I don't
> > have the required hardware to test the device drivers.
> > 
> > I did not modify gpu/drm drivers which use release_pages instead of
> > put_page() to release reference of gup pinned pages as I am not clear
> > whether release_pages and put_page are interchangable. 
> > 
> > [1] https://lkml.org/lkml/2019/3/26/1396
> 
> When referring to patches in a commit description, please use the 
> commit hash, not an external link. See Submitting Patches [1] for details.
> 
> Also, once you figure out the right maintainers and other involved people,
> putting Cc: in the commit description is common practice, too.
> 
> [1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html
Will work on that! Thanks!
> thanks,
> -- 
> John Hubbard
> NVIDIA
> 
> > 
> > [2] https://lwn.net/Articles/784574/
> > 
> > Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
> > ---
> >  drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +--
> >  drivers/misc/sgi-gru/grufault.c           | 2 +-
> >  drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +---
> >  drivers/vfio/vfio_iommu_type1.c           | 2 +-
> >  fs/io_uring.c                             | 7 +++----
> >  mm/gup_benchmark.c                        | 6 +-----
> >  net/xdp/xdp_umem.c                        | 7 +------
> >  7 files changed, 9 insertions(+), 22 deletions(-)
> > 
> > diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> > index 66a6c6c..d6eeb43 100644
> > --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> > +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> > @@ -349,8 +349,7 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
> >  	BUG_ON(dma->sglen);
> >  
> >  	if (dma->pages) {
> > -		for (i = 0; i < dma->nr_pages; i++)
> > -			put_page(dma->pages[i]);
> > +		put_user_pages(dma->pages, dma->nr_pages);
> >  		kfree(dma->pages);
> >  		dma->pages = NULL;
> >  	}
> > diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> > index 4b713a8..61b3447 100644
> > --- a/drivers/misc/sgi-gru/grufault.c
> > +++ b/drivers/misc/sgi-gru/grufault.c
> > @@ -188,7 +188,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
> >  	if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
> >  		return -EFAULT;
> >  	*paddr = page_to_phys(page);
> > -	put_page(page);
> > +	put_user_page(page);
> >  	return 0;
> >  }
> >  
> > diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
> > index 6166587..26dceed 100644
> > --- a/drivers/staging/kpc2000/kpc_dma/fileops.c
> > +++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
> > @@ -198,9 +198,7 @@ int  kpc_dma_transfer(struct dev_private_data *priv, struct kiocb *kcb, unsigned
> >  	sg_free_table(&acd->sgt);
> >   err_dma_map_sg:
> >   err_alloc_sg_table:
> > -	for (i = 0 ; i < acd->page_count ; i++){
> > -		put_page(acd->user_pages[i]);
> > -	}
> > +	put_user_pages(acd->user_pages, acd->page_count);
> >   err_get_user_pages:
> >  	kfree(acd->user_pages);
> >   err_alloc_userpages:
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index add34ad..c491524 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -369,7 +369,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >  		 */
> >  		if (ret > 0 && vma_is_fsdax(vmas[0])) {
> >  			ret = -EOPNOTSUPP;
> > -			put_page(page[0]);
> > +			put_user_page(page[0]);
> >  		}
> >  	}
> >  	up_read(&mm->mmap_sem);
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 4ef62a4..b4a4549 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> >  			 * if we did partial map, or found file backed vmas,
> >  			 * release any pages we did get
> >  			 */
> > -			if (pret > 0) {
> > -				for (j = 0; j < pret; j++)
> > -					put_page(pages[j]);
> > -			}
> > +			if (pret > 0)
> > +				put_user_pages(pages, pret);
> > +
> >  			if (ctx->account_mem)
> >  				io_unaccount_mem(ctx->user, nr_pages);
> >  			kvfree(imu->bvec);
> > diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> > index 7dd602d..15fc7a2 100644
> > --- a/mm/gup_benchmark.c
> > +++ b/mm/gup_benchmark.c
> > @@ -76,11 +76,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> >  	gup->size = addr - gup->addr;
> >  
> >  	start_time = ktime_get();
> > -	for (i = 0; i < nr_pages; i++) {
> > -		if (!pages[i])
> > -			break;
> > -		put_page(pages[i]);
> > -	}
> > +	put_user_pages(pages, nr_pages);
> >  	end_time = ktime_get();
> >  	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
> >  
> > diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> > index 9c6de4f..6103e19 100644
> > --- a/net/xdp/xdp_umem.c
> > +++ b/net/xdp/xdp_umem.c
> > @@ -173,12 +173,7 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> >  {
> >  	unsigned int i;
> >  
> > -	for (i = 0; i < umem->npgs; i++) {
> > -		struct page *page = umem->pgs[i];
> > -
> > -		set_page_dirty_lock(page);
> > -		put_page(page);
> > -	}
> > +	put_user_pages_dirty_lock(umem->pgs, umem->npgs);
> >  
> >  	kfree(umem->pgs);
> >  	umem->pgs = NULL;
> > 

^ permalink raw reply

* [PATCH] ethtool: igb: dump RR2DCDELAY register
From: Artem Bityutskiy @ 2019-07-15  6:52 UTC (permalink / raw)
  To: John W . Linville; +Cc: netdev

Decode 'RR2DCDELAY' register which Linux kernel provides starting from version
5.3. The corresponding commit in the Linux kernel is:
    cd502a7f7c9c igb: add RR2DCDELAY to ethtool registers dump

The RR2DCDELAY register is present in I210 and I211 Intel Gigabit Ethernet
chips and it stands for "Read Request To Data Completion Delay". Here is how
this register is described in the I210 datasheet:

"This field captures the maximum PCIe split time in 16 ns units, which is the
maximum delay between the read request to the first data completion. This is
giving an estimation of the PCIe round trip time."

In practice, this register can be used to measure the time it takes the NIC to
read data from the host memory.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
 igb.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/igb.c b/igb.c
index e0ccef9..ab0896f 100644
--- a/igb.c
+++ b/igb.c
@@ -859,6 +859,18 @@ igb_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
 		"0x03430: TDFPC       (Tx data FIFO packet count)      0x%08X\n",
 		regs_buff[550]);

+	/*
+	 * Starting from kernel version 5.3 the registers dump buffer grew from
+	 * 739 4-byte words to 740 words, and word 740 contains the RR2DCDLAY
+	 * register.
+	 */
+	if (regs->len < 740)
+		return 0;
+
+	fprintf(stdout,
+		"0x05BF4: RR2DCDELAY  (Max. DMA read delay)            0x%08X\n",
+		regs_buff[739]);
+
 	return 0;
 }

-- 
2.20.1

^ permalink raw reply related

* Re: [PATCH 8/8] docs: remove extra conf.py files
From: Markus Heiser @ 2019-07-15  6:16 UTC (permalink / raw)
  To: Mauro Carvalho Chehab, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, Jonathan Corbet, Herbert Xu,
	David S. Miller, David Airlie, Daniel Vetter, Maarten Lankhorst,
	Maxime Ripard, Sean Paul, Dmitry Torokhov, Yoshinori Sato,
	Rich Felker, Jaroslav Kysela, Takashi Iwai, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, x86, linux-crypto,
	dri-devel, linux-input, netdev, linux-sh, alsa-devel
In-Reply-To: <12a160afc9e70156f671010bd4ccff9311acdc5e.1563115732.git.mchehab+samsung@kernel.org>

Hi Mauro,

sorry, I havn't tested your patch, but one question ...

Am 14.07.19 um 17:10 schrieb Mauro Carvalho Chehab:
> Now that the latex_documents are handled automatically, we can
> remove those extra conf.py files.

We need these conf.py also for compiling books' into HTML.  For this
the tags.add("subproject") is needed.  Should we realy drop this feature?

-- Markus --

^ permalink raw reply

* Re: [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Vasily Averin @ 2019-07-15  5:38 UTC (permalink / raw)
  To: Xiaoming Ni, gregkh@linuxfoundation.org
  Cc: adobriyan@gmail.com, akpm@linux-foundation.org,
	anna.schumaker@netapp.com, arjan@linux.intel.com,
	bfields@fieldses.org, chuck.lever@oracle.com, davem@davemloft.net,
	jlayton@kernel.org, luto@kernel.org, mingo@kernel.org,
	Nadia.Derbey@bull.net, paulmck@linux.vnet.ibm.com,
	semen.protsenko@linaro.org, stable@kernel.org,
	stern@rowland.harvard.edu, tglx@linutronix.de,
	torvalds@linux-foundation.org, trond.myklebust@hammerspace.com,
	viresh.kumar@linaro.org, Huangjianhui (Alex), Dailei,
	linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <65f50cf2-3051-ab55-078f-30930fe0c9bc@huawei.com>

On 7/14/19 5:45 AM, Xiaoming Ni wrote:
> On 2019/7/12 22:07, gregkh@linuxfoundation.org wrote:
>> On Fri, Jul 12, 2019 at 09:11:57PM +0800, Xiaoming Ni wrote:
>>> On 2019/7/11 21:57, Vasily Averin wrote:
>>>> On 7/11/19 4:55 AM, Nixiaoming wrote:
>>>>> On Wed, July 10, 2019 1:49 PM Vasily Averin wrote:
>>>>>> On 7/10/19 6:09 AM, Xiaoming Ni wrote:
>>>>>>> Registering the same notifier to a hook repeatedly can cause the hook
>>>>>>> list to form a ring or lose other members of the list.
>>>>>>
>>>>>> I think is not enough to _prevent_ 2nd register attempt,
>>>>>> it's enough to detect just attempt and generate warning to mark host in bad state.
>>>>>>
>>>>>
>>>>> Duplicate registration is prevented in my patch, not just "mark host in bad state"
>>>>>
>>>>> Duplicate registration is checked and exited in notifier_chain_cond_register()
>>>>>
>>>>> Duplicate registration was checked in notifier_chain_register() but only 
>>>>> the alarm was triggered without exiting. added by commit 831246570d34692e 
>>>>> ("kernel/notifier.c: double register detection")
>>>>>
>>>>> My patch is like a combination of 831246570d34692e and notifier_chain_cond_register(),
>>>>>  which triggers an alarm and exits when a duplicate registration is detected.
>>>>>
>>>>>> Unexpected 2nd register of the same hook most likely will lead to 2nd unregister,
>>>>>> and it can lead to host crash in any time: 
>>>>>> you can unregister notifier on first attempt it can be too early, it can be still in use.
>>>>>> on the other hand you can never call 2nd unregister at all.
>>>>>
>>>>> Since the member was not added to the linked list at the time of the second registration, 
>>>>> no linked list ring was formed. 
>>>>> The member is released on the first unregistration and -ENOENT on the second unregistration.
>>>>> After patching, the fault has been alleviated
>>>>
>>>> You are wrong here.
>>>> 2nd notifier's registration is a pure bug, this should never happen.
>>>> If you know the way to reproduce this situation -- you need to fix it. 
>>>>
>>>> 2nd registration can happen in 2 cases:
>>>> 1) missed rollback, when someone forget to call unregister after successfull registration, 
>>>> and then tried to call register again. It can lead to crash for example when according module will be unloaded.
>>>> 2) some subsystem is registered twice, for example from  different namespaces.
>>>> in this case unregister called during sybsystem cleanup in first namespace will incorrectly remove notifier used 
>>>> in second namespace, it also can lead to unexpacted behaviour.
>>>>
>>> So in these two cases, is it more reasonable to trigger BUG() directly when checking for duplicate registration ?
>>> But why does current notifier_chain_register() just trigger WARN() without exiting ?
>>> notifier_chain_cond_register() direct exit without triggering WARN() ?
>>
>> It should recover from this, if it can be detected.  The main point is
>> that not all apis have to be this "robust" when used within the kernel
>> as we do allow for the callers to know what they are doing :)
>>
> In the notifier_chain_register(), the condition ( (*nl) == n) is the same registration of the same hook.
>  We can intercept this situation and avoid forming a linked list ring to make the API more rob

Once again -- yes, you CAN prevent list corruption, but you CANNOT recover the host and return it back to safe state.
If double register event was detected -- it means you have bug in kernel.

Yes, you can add BUG here and crash the host immediately, but I prefer to use warning in such situations.

>> If this does not cause any additional problems or slow downs, it's
>> probably fine to add.
>>
> Notifier_chain_register() is not a system hotspot function.
> At the same time, there is already a WARN_ONCE judgment. There is no new judgment in the new patch.
> It only changes the processing under the condition of (*nl) == n, which will not cause performance problems.
> At the same time, avoiding the formation of a link ring can make the system more robust.

I disagree, 
yes, node will have correct list, but anyway node will work wrong and can crash the host in any time.

>> thanks,
>>
>> greg k-h
>>
>> .
>>
> Thanks
> 
> Xiaoming Ni
> 
> 
> 

^ permalink raw reply

* [PATCH net] caif-hsi: fix possible deadlock in cfhsi_exit_module()
From: Taehee Yoo @ 2019-07-15  5:10 UTC (permalink / raw)
  To: davem, netdev; +Cc: ap420073

cfhsi_exit_module() calls unregister_netdev() under rtnl_lock().
but unregister_netdev() internally calls rtnl_lock().
So deadlock would occur.

Fixes: c41254006377 ("caif-hsi: Add rtnl support")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
I did only compile test because I don't have the testbed.
Could anyone test this patch?
 drivers/net/caif/caif_hsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/caif/caif_hsi.c b/drivers/net/caif/caif_hsi.c
index b2f10b6ad6e5..bbb2575d4728 100644
--- a/drivers/net/caif/caif_hsi.c
+++ b/drivers/net/caif/caif_hsi.c
@@ -1455,7 +1455,7 @@ static void __exit cfhsi_exit_module(void)
 	rtnl_lock();
 	list_for_each_safe(list_node, n, &cfhsi_list) {
 		cfhsi = list_entry(list_node, struct cfhsi, list);
-		unregister_netdev(cfhsi->ndev);
+		unregister_netdevice(cfhsi->ndev);
 	}
 	rtnl_unlock();
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH v3 19/24] vmxnet3: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15  3:21 UTC (permalink / raw)
  Cc: Ronak Doshi, VMware Inc ., David S . Miller, netdev, linux-kernel,
	Fuqian Huang

In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
  - Use actual commit rather than the merge commit in the commit message

 drivers/net/vmxnet3/vmxnet3_drv.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 3f48f05dd2a6..2a1918f25e47 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -3430,7 +3430,6 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 			err = -ENOMEM;
 			goto err_ver;
 		}
-		memset(adapter->coal_conf, 0, sizeof(*adapter->coal_conf));
 		adapter->coal_conf->coalMode = VMXNET3_COALESCE_DISABLED;
 		adapter->default_coal_mode = true;
 	}
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 20/24] wireless: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15  3:19 UTC (permalink / raw)
  Cc: Kalle Valo, David S . Miller, Arend van Spriel, Franky Lin,
	Hante Meuleman, Chi-Hsien Lin, Wright Feng, Igor Mitsyanko,
	Avinash Patil, Sergey Matyukevich, ath10k, linux-wireless, netdev,
	linux-kernel, brcm80211-dev-list.pdl, brcm80211-dev-list,
	Fuqian Huang

In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
  - Use actual commit rather than the merge commit in the commit message

 drivers/net/wireless/ath/ath10k/ce.c                     | 5 -----
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c  | 2 --
 drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c | 2 --
 drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c | 2 --
 4 files changed, 11 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/ce.c b/drivers/net/wireless/ath/ath10k/ce.c
index eca87f7c5b6c..294fbc1e89ab 100644
--- a/drivers/net/wireless/ath/ath10k/ce.c
+++ b/drivers/net/wireless/ath/ath10k/ce.c
@@ -1704,9 +1704,6 @@ ath10k_ce_alloc_dest_ring_64(struct ath10k *ar, unsigned int ce_id,
 	/* Correctly initialize memory to 0 to prevent garbage
 	 * data crashing system when download firmware
 	 */
-	memset(dest_ring->base_addr_owner_space_unaligned, 0,
-	       nentries * sizeof(struct ce_desc_64) + CE_DESC_RING_ALIGN);
-
 	dest_ring->base_addr_owner_space =
 			PTR_ALIGN(dest_ring->base_addr_owner_space_unaligned,
 				  CE_DESC_RING_ALIGN);
@@ -2019,8 +2016,6 @@ void ath10k_ce_alloc_rri(struct ath10k *ar)
 		value |= ar->hw_ce_regs->upd->mask;
 		ath10k_ce_write32(ar, ce_base_addr + ctrl1_regs, value);
 	}
-
-	memset(ce->vaddr_rri, 0, CE_COUNT * sizeof(u32));
 }
 EXPORT_SYMBOL(ath10k_ce_alloc_rri);
 
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
index 4ea5401c4d6b..5f0437a3fd82 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
@@ -1023,8 +1023,6 @@ brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo,
 			       address & 0xffffffff);
 	brcmf_pcie_write_tcm32(devinfo, tcm_dma_phys_addr + 4, address >> 32);
 
-	memset(ring, 0, size);
-
 	return (ring);
 }
 
diff --git a/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c b/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
index 3aa3714d4dfd..5ec1c9bc1612 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
+++ b/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
@@ -244,8 +244,6 @@ static int pearl_alloc_bd_table(struct qtnf_pcie_pearl_state *ps)
 
 	/* tx bd */
 
-	memset(vaddr, 0, len);
-
 	ps->bd_table_vaddr = vaddr;
 	ps->bd_table_paddr = paddr;
 	ps->bd_table_len = len;
diff --git a/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c b/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
index 9a4380ed7f1b..1f91088e3dff 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
+++ b/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
@@ -199,8 +199,6 @@ static int topaz_alloc_bd_table(struct qtnf_pcie_topaz_state *ts,
 	if (!vaddr)
 		return -ENOMEM;
 
-	memset(vaddr, 0, len);
-
 	/* tx bd */
 
 	ts->tx_bd_vbase = vaddr;
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 18/24] hippi: Remove call to memset after pci_alloc_consistent
From: Fuqian Huang @ 2019-07-15  3:19 UTC (permalink / raw)
  Cc: Jes Sorensen, David S . Miller, linux-hippi, netdev, linux-kernel,
	Fuqian Huang

pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
  - Use actual commit rather than the merge commit in the commit message

 drivers/net/hippi/rrunner.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/hippi/rrunner.c b/drivers/net/hippi/rrunner.c
index 7b9350dbebdd..2a6ec5394966 100644
--- a/drivers/net/hippi/rrunner.c
+++ b/drivers/net/hippi/rrunner.c
@@ -1196,7 +1196,6 @@ static int rr_open(struct net_device *dev)
 		goto error;
 	}
 	rrpriv->rx_ctrl_dma = dma_addr;
-	memset(rrpriv->rx_ctrl, 0, 256*sizeof(struct ring_ctrl));
 
 	rrpriv->info = pci_alloc_consistent(pdev, sizeof(struct rr_info),
 					    &dma_addr);
@@ -1205,7 +1204,6 @@ static int rr_open(struct net_device *dev)
 		goto error;
 	}
 	rrpriv->info_dma = dma_addr;
-	memset(rrpriv->info, 0, sizeof(struct rr_info));
 	wmb();
 
 	spin_lock_irqsave(&rrpriv->lock, flags);
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 17/24] ethernet: remove redundant memset
From: Fuqian Huang @ 2019-07-15  3:19 UTC (permalink / raw)
  Cc: Jay Cliburn, Chris Snook, David S . Miller, Michael Chan,
	Vishal Kulkarni, Fugang Duan, Guo-Fu Tseng, Mirko Lindner,
	Stephen Hemminger, Tariq Toukan, Saeed Mahameed, Leon Romanovsky,
	Jiri Pirko, Ido Schimmel, Jon Mason, Manish Chopra, Rahul Verma,
	GR-Linux-NIC-Dev, Samuel Chessman, netdev, linux-kernel,
	linux-rdma, Fuqian Huang

kvzalloc already zeroes the memory during the allocation.
pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So the memset after these function is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
  - Use actual commit rather than the merge commit in the commit message

 drivers/net/ethernet/atheros/atlx/atl1.c                   | 2 --
 drivers/net/ethernet/atheros/atlx/atl2.c                   | 1 -
 drivers/net/ethernet/broadcom/bnxt/bnxt.c                  | 2 --
 drivers/net/ethernet/chelsio/cxgb4/sched.c                 | 1 -
 drivers/net/ethernet/freescale/fec_main.c                  | 2 --
 drivers/net/ethernet/jme.c                                 | 5 -----
 drivers/net/ethernet/marvell/skge.c                        | 2 --
 drivers/net/ethernet/mellanox/mlx4/eq.c                    | 2 --
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c          | 1 -
 drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 3 ---
 drivers/net/ethernet/mellanox/mlxsw/pci.c                  | 1 -
 drivers/net/ethernet/neterion/s2io.c                       | 1 -
 drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c        | 3 ---
 drivers/net/ethernet/ti/tlan.c                             | 1 -
 14 files changed, 27 deletions(-)

diff --git a/drivers/net/ethernet/atheros/atlx/atl1.c b/drivers/net/ethernet/atheros/atlx/atl1.c
index 7c767ce9aafa..b5c6dc914720 100644
--- a/drivers/net/ethernet/atheros/atlx/atl1.c
+++ b/drivers/net/ethernet/atheros/atlx/atl1.c
@@ -1060,8 +1060,6 @@ static s32 atl1_setup_ring_resources(struct atl1_adapter *adapter)
 		goto err_nomem;
 	}
 
-	memset(ring_header->desc, 0, ring_header->size);
-
 	/* init TPD ring */
 	tpd_ring->dma = ring_header->dma;
 	offset = (tpd_ring->dma & 0x7) ? (8 - (ring_header->dma & 0x7)) : 0;
diff --git a/drivers/net/ethernet/atheros/atlx/atl2.c b/drivers/net/ethernet/atheros/atlx/atl2.c
index 3a3fb5ce0fee..3aba38322717 100644
--- a/drivers/net/ethernet/atheros/atlx/atl2.c
+++ b/drivers/net/ethernet/atheros/atlx/atl2.c
@@ -291,7 +291,6 @@ static s32 atl2_setup_ring_resources(struct atl2_adapter *adapter)
 		&adapter->ring_dma);
 	if (!adapter->ring_vir_addr)
 		return -ENOMEM;
-	memset(adapter->ring_vir_addr, 0, adapter->ring_size);
 
 	/* Init TXD Ring */
 	adapter->txd_dma = adapter->ring_dma ;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 3f632028eff0..1069eb01cc55 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2677,8 +2677,6 @@ static int bnxt_alloc_tx_rings(struct bnxt *bp)
 			mapping = txr->tx_push_mapping +
 				sizeof(struct tx_push_bd);
 			txr->data_mapping = cpu_to_le64(mapping);
-
-			memset(txr->tx_push, 0, sizeof(struct tx_push_bd));
 		}
 		qidx = bp->tc_to_qidx[j];
 		ring->queue_id = bp->q_info[qidx].queue_id;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index ba6c153ee45c..60218dc676a8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -207,7 +207,6 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 		goto out_err;
 
 	/* Bind queue to specified class */
-	memset(qe, 0, sizeof(*qe));
 	qe->cntxt_id = qid;
 	memcpy(&qe->param, p, sizeof(qe->param));
 
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 9d459ccf251d..e5610a4da539 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3144,8 +3144,6 @@ static int fec_enet_init(struct net_device *ndev)
 		return -ENOMEM;
 	}
 
-	memset(cbd_base, 0, bd_size);
-
 	/* Get the Ethernet address */
 	fec_get_mac(ndev);
 	/* make sure MAC we just acquired is programmed into the hw */
diff --git a/drivers/net/ethernet/jme.c b/drivers/net/ethernet/jme.c
index 76b7b7b85e35..0b668357db4d 100644
--- a/drivers/net/ethernet/jme.c
+++ b/drivers/net/ethernet/jme.c
@@ -582,11 +582,6 @@ jme_setup_tx_resources(struct jme_adapter *jme)
 	if (unlikely(!(txring->bufinf)))
 		goto err_free_txring;
 
-	/*
-	 * Initialize Transmit Descriptors
-	 */
-	memset(txring->alloc, 0, TX_RING_ALLOC_SIZE(jme->tx_ring_size));
-
 	return 0;
 
 err_free_txring:
diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 35a92fd2cf39..9ac854c2b371 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -2558,8 +2558,6 @@ static int skge_up(struct net_device *dev)
 		goto free_pci_mem;
 	}
 
-	memset(skge->mem, 0, skge->mem_size);
-
 	err = skge_ring_alloc(&skge->rx_ring, skge->mem, skge->dma);
 	if (err)
 		goto free_pci_mem;
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index a5be27772b8e..c790a5fcea73 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -1013,8 +1013,6 @@ static int mlx4_create_eq(struct mlx4_dev *dev, int nent,
 
 		dma_list[i] = t;
 		eq->page_list[i].map = t;
-
-		memset(eq->page_list[i].buf, 0, PAGE_SIZE);
 	}
 
 	eq->eqn = mlx4_bitmap_alloc(&priv->eq_table.bitmap);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 3b04d8927fb1..1f3891fde2eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2450,7 +2450,6 @@ int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
 	MLX5_SET(query_vport_counter_in, in, vport_number, vport->vport);
 	MLX5_SET(query_vport_counter_in, in, other_vport, 1);
 
-	memset(out, 0, outlen);
 	err = mlx5_cmd_exec(esw->dev, in, sizeof(in), out, outlen);
 	if (err)
 		goto free_out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 957d9b09dc3f..089ae4d48a82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1134,7 +1134,6 @@ static int esw_create_offloads_fdb_tables(struct mlx5_eswitch *esw, int nvports)
 	}
 
 	/* create send-to-vport group */
-	memset(flow_group_in, 0, inlen);
 	MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable,
 		 MLX5_MATCH_MISC_PARAMETERS);
 
@@ -1293,8 +1292,6 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch *esw, int nvports)
 		return -ENOMEM;
 
 	/* create vport rx group */
-	memset(flow_group_in, 0, inlen);
-
 	esw_set_flow_group_source_port(esw, flow_group_in);
 
 	MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 0);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 051b19388a81..615455a21567 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -847,7 +847,6 @@ static int mlxsw_pci_queue_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
 					     &mem_item->mapaddr);
 	if (!mem_item->buf)
 		return -ENOMEM;
-	memset(mem_item->buf, 0, mem_item->size);
 
 	q->elem_info = kcalloc(q->count, sizeof(*q->elem_info), GFP_KERNEL);
 	if (!q->elem_info) {
diff --git a/drivers/net/ethernet/neterion/s2io.c b/drivers/net/ethernet/neterion/s2io.c
index 3b2ae1a21678..e0b2bf327905 100644
--- a/drivers/net/ethernet/neterion/s2io.c
+++ b/drivers/net/ethernet/neterion/s2io.c
@@ -747,7 +747,6 @@ static int init_shared_mem(struct s2io_nic *nic)
 				return -ENOMEM;
 			}
 			mem_allocated += size;
-			memset(tmp_v_addr, 0, size);
 
 			size = sizeof(struct rxd_info) *
 				rxd_count[nic->rxd_mode];
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
index 433052f734ed..5e9f8ee99800 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
@@ -442,10 +442,8 @@ nx_fw_cmd_create_tx_ctx(struct netxen_adapter *adapter)
 		goto out_free_rq;
 	}
 
-	memset(rq_addr, 0, rq_size);
 	prq = rq_addr;
 
-	memset(rsp_addr, 0, rsp_size);
 	prsp = rsp_addr;
 
 	prq->host_rsp_dma_addr = cpu_to_le64(rsp_phys_addr);
@@ -755,7 +753,6 @@ int netxen_alloc_hw_resources(struct netxen_adapter *adapter)
 		return -ENOMEM;
 	}
 
-	memset(addr, 0, sizeof(struct netxen_ring_ctx));
 	recv_ctx->hwctx = addr;
 	recv_ctx->hwctx->ctx_id = cpu_to_le32(port);
 	recv_ctx->hwctx->cmd_consumer_offset =
diff --git a/drivers/net/ethernet/ti/tlan.c b/drivers/net/ethernet/ti/tlan.c
index b4ab1a5f6cd0..78f0f2d59e22 100644
--- a/drivers/net/ethernet/ti/tlan.c
+++ b/drivers/net/ethernet/ti/tlan.c
@@ -855,7 +855,6 @@ static int tlan_init(struct net_device *dev)
 		       dev->name);
 		return -ENOMEM;
 	}
-	memset(priv->dma_storage, 0, dma_size);
 	priv->rx_list = (struct tlan_list *)
 		ALIGN((unsigned long)priv->dma_storage, 8);
 	priv->rx_list_dma = ALIGN(priv->dma_storage_dma, 8);
-- 
2.11.0


^ permalink raw reply related

* [PATCH v3 02/24] atm: idt77252: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15  3:17 UTC (permalink / raw)
  Cc: Chas Williams, linux-atm-general, netdev, linux-kernel,
	Fuqian Huang

In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.

Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
  - Use actual commit rather than the merge commit in the commit message

 drivers/atm/idt77252.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 43a14579e80e..df51680e8931 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -1379,7 +1379,6 @@ init_tsq(struct idt77252_dev *card)
 		printk("%s: can't allocate TSQ.\n", card->name);
 		return -1;
 	}
-	memset(card->tsq.base, 0, TSQSIZE);
 
 	card->tsq.last = card->tsq.base + TSQ_NUM_ENTRIES - 1;
 	card->tsq.next = card->tsq.last;
-- 
2.11.0


^ permalink raw reply related

* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-15  3:10 UTC (permalink / raw)
  To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
	David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
	linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190714192951.GM26519@linux.ibm.com>

On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > But short term I don't see any other solution than stop testing
> > > sched_setattr because it does not check arguments enough to prevent
> > > system misbehavior. Which is a pity because syzkaller has found some
> > > bad misconfigurations that were oversight on checking side.
> > > Any other suggestions?
> > 
> > Or maybe syzkaller can put its own limitations on what parameters are
> > sent to sched_setattr?  In practice, there are any number of ways a
> > root user can shoot themselves in the foot when using sched_setattr or
> > sched_setaffinity, for that matter.  I imagine there must be some such
> > constraints already --- or else syzkaller might have set a kernel
> > thread to run with priority SCHED_BATCH, with similar catastrophic
> > effects --- or do similar configurations to make system threads
> > completely unschedulable.
> > 
> > Real time administrators who know what they are doing --- and who know
> > that their real-time threads are well behaved --- will always want to
> > be able to do things that will be catastrophic if the real-time thread
> > is *not* well behaved.  I don't it is possible to add safety checks
> > which would allow the kernel to automatically detect and reject unsafe
> > configurations.
> > 
> > An apt analogy might be civilian versus military aircraft.  Most
> > airplanes are designed to be "inherently stable"; that way, modulo
> > buggy/insane control systems like on the 737 Max, the airplane will
> > automatically return to straight and level flight.  On the other hand,
> > some military planes (for example, the F-16, F-22, F-36, the
> > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > way they can be more maneuverable.
> > 
> > There are use cases for real-time Linux where this flexibility/power
> > vs. stability tradeoff is going to argue for giving root the
> > flexibility to crash the system.  Some of these systems might
> > literally involve using real-time Linux in military applications,
> > something for which Paul and I have had some experience.  :-)
> > 
> > Speaking of sched_setaffinity, one thing which we can do is have
> > syzkaller move all of the system threads to they run on the "system
> > CPU's", and then move the syzkaller processes which are testing the
> > kernel to be on the "system under test CPU's".  Then regardless of
> > what priority the syzkaller test programs try to run themselves at,
> > they can't crash the system.
> > 
> > Some real-time systems do actually run this way, and it's a
> > recommended configuration which is much safer than letting the
> > real-time threads take over the whole system:
> > 
> > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
> 
> Good point!  We might still have issues with some per-CPU kthreads,
> but perhaps use of nohz_full would help at least reduce these sorts
> of problems.  (There could still be issues on CPUs with more than
> one runnable threads.)

I looked at testing limitations in a bit more detail from an RCU
viewpoint, and came up with the following rough rule of thumb (which of
course might or might not survive actual testing experience, but should at
least be a good place to start).  I believe that the sched_setaffinity()
testing rule should be that the SCHED_DEADLINE cycle be no more than
two-thirds of the RCU CPU stall warning timeout, which defaults to 21
seconds in mainline and 60 seconds in many distro kernels.

That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
testing mainline on the one hand or 40 seconds when testing enterprise
distros on the other.

This assumes quite a bit, though:

o	The system has ample memory to spare, and isn't running a
	callback-hungry workload.  For example, if you "only" have 100MB
	of spare memory and you are also repeatedly and concurrently
	expanding (say) large source trees from tarballs and then deleting
	those source trees, the system might OOM.  The reason OOM might
	happen is that each close() of a file generates an RCU callback,
	and 40 seconds worth of waiting-for-a-grace-period structures
	takes up a surprisingly large amount of memory.

	So please be careful when combining tests.  ;-)

o	There are no aggressive real-time workloads on the system.
	The reason for this is that RCU is going to start sending IPIs
	halfway to the RCU CPU stall timeout, and, in certain situations
	on CONFIG_NO_HZ_FULL kernels, much earlier.  (These situations
	constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
	calibrated abuse is what stress testing is all about.)

o	The various RCU kthreads will get a chance to run at least once
	during the SCHED_DEADLINE cycle.  If in real life, they only
	get a chance to run once per two SCHED_DEADLINE cycles, then of
	course the 14 seconds becomes 7 and the 40 seconds becomes 20.

o	The current RCU CPU stall warning defaults remain in
	place.	These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
	Kconfig parameter, which may in turn be overridden by the
	rcupdate.rcu_cpu_stall_timeout kernel boot parameter.

o	The current SCHED_DEADLINE default for providing spare cycles
	for other uses remains in place.

o	Other kthreads might have other constraints, but given that you
	were seeing RCU CPU stall warnings instead of other failures,
	the needs of RCU's kthreads seem to be a good place to start.

Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
cycle be no more than 14 seconds when testing mainline kernels on the one
hand and 40 seconds when testing enterprise distro kernels on the other.

Dmitry, does that help?

							Thanx, Paul

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Jens Axboe @ 2019-07-15  2:33 UTC (permalink / raw)
  To: Bharath Vedartham, akpm, ira.weiny, jhubbard
  Cc: Mauro Carvalho Chehab, Dimitri Sivanich, Arnd Bergmann,
	Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
	Alexander Viro, Björn Töpel, Magnus Karlsson,
	David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Enrico Weigelt, Thomas Gleixner, Alexios Zavras, Dan Carpenter,
	Max Filippov, Matt Sickler, Kirill A. Shutemov, Keith Busch,
	YueHaibing, linux-media, linux-kernel, devel, kvm, linux-block,
	linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies
In-Reply-To: <1563131456-11488-1-git-send-email-linux.bhar@gmail.com>

On 7/14/19 1:08 PM, Bharath Vedartham wrote:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4ef62a4..b4a4549 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
>   			 * if we did partial map, or found file backed vmas,
>   			 * release any pages we did get
>   			 */
> -			if (pret > 0) {
> -				for (j = 0; j < pret; j++)
> -					put_page(pages[j]);
> -			}
> +			if (pret > 0)
> +				put_user_pages(pages, pret);
> +
>   			if (ctx->account_mem)
>   				io_unaccount_mem(ctx->user, nr_pages);
>   			kvfree(imu->bvec);

You handled just the failure case of the buffer registration, but not
the actual free in io_sqe_buffer_unregister().

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH net-next 01/16] qlge: Remove irq_cnt
From: Benjamin Poirier @ 2019-07-15  1:40 UTC (permalink / raw)
  To: David Miller, Greg Kroah-Hartman; +Cc: Manish Chopra, GR-Linux-NIC-Dev, netdev
In-Reply-To: <20190617074858.32467-1-bpoirier@suse.com>

On 2019/06/17 16:48, Benjamin Poirier wrote:
> qlge uses an irq enable/disable refcounting scheme that is:
> * poorly implemented
> 	Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> * buggy
> 	Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> 	when using SO_BUSY_POLL.
> * unnecessary
> 	The purpose or irq_cnt is to reduce irq control writes when
> 	multiple work items result from one irq: the irq is re-enabled
> 	after all work is done.
> 	Analysis of the irq handler shows that there is only one case where
> 	there might be two workers scheduled at once, and those have
> 	separate irq masking bits.
> 
> Therefore, remove irq_cnt.
> 
> Additionally, we get a performance improvement:
> perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
> 
> Before:
> 628560
> 628056
> 622103
> 622744
> 627202
> [...]
>    268,803,947,669      cycles                 ( +-  0.09% )
> 
> After:
> 636300
> 634106
> 634984
> 638555
> 634188
> [...]
>    259,237,291,449      cycles                 ( +-  0.19% )
> 
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>

David, Greg,

Before I send v2 of this patchset, how about moving this driver out to
staging? The hardware has been declared EOL by the vendor but what's
more relevant to the Linux kernel is that the quality of this driver is
not on par with many other mainline drivers, IMO. Over in staging, the
code might benefit from the attention of interested parties (drive-by
fixers) or fade away into obscurity.

Now that I check, the code has >500 basic checkpatch issues. While
working on the rx processing code for the current patchset, I noticed
the following more structural issues:
* commit 7c734359d350 ("qlge: Size RX buffers based on MTU.",
  v2.6.33-rc1) introduced dead code in the receive routines, which
  should be rewritten anyways by the admission of the author himself
  (see the comment above ql_build_rx_skb())
* truesize accounting is incorrect (ex: a 9000B frame has truesize 10280
  while containing two frags of order-1 allocations, ie. >16K)
* in some cases the driver allocates skbs with head room but only puts
  data in the frags
* there is an inordinate amount of disparate debugging code, most of
  which is of questionable value. In particular, qlge_dbg.c has hundreds
  of lines of code bitrotting away in ifdef land (doesn't compile since
  commit 18c49b91777c ("qlge: do vlan cleanup", v3.1-rc1), 8 years ago).
* triggering an ethtool regdump will instead hexdump a 176k struct to
  dmesg depending on some module parameters
* many structures are defined full of holes, as we found in this
  thread already; are used inefficiently (struct rx_ring is used for rx
  and tx completions, with some members relevant to one case only); are
  initialized redundantly (ex. memset 0 after alloc_etherdev). The
  driver also has a habit of using runtime checks where compile time
  checks are possible.
* in terms of namespace, the driver uses either qlge_, ql_ (used by
  other qlogic drivers, with clashes, ex: ql_sem_spinlock) or nothing (with
  clashes, ex: struct ob_mac_iocb_req)

I'm guessing other parts of the driver have more issues of the same
nature. The driver also has many smaller issues like deprecated apis,
duplicate or useless comments, weird code structure (some "while" loops
that could be rewritten with simple "for", ex. ql_wait_reg_rdy()) and
weird line wrapping (all over, ex. the ql_set_routing_reg() calls in
qlge_set_multicast_list()).

I'm aware of some precedents for code moving from mainline to staging:
1bb8155080c6 ncpfs: move net/ncpfs to drivers/staging/ncpfs (v4.16-rc1)
6c391ff758eb irda: move drivers/net/irda to drivers/staging/irda/drivers
(v4.14-rc1)

What do you think is the best course of action in this case?

Thanks,
-Benjamin

^ permalink raw reply

* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: John Hubbard @ 2019-07-14 23:33 UTC (permalink / raw)
  To: Bharath Vedartham, akpm, ira.weiny
  Cc: Mauro Carvalho Chehab, Dimitri Sivanich, Arnd Bergmann,
	Greg Kroah-Hartman, Alex Williamson, Cornelia Huck, Jens Axboe,
	Alexander Viro, Björn Töpel, Magnus Karlsson,
	David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Enrico Weigelt, Thomas Gleixner, Alexios Zavras, Dan Carpenter,
	Max Filippov, Matt Sickler, Kirill A. Shutemov, Keith Busch,
	YueHaibing, linux-media, linux-kernel, devel, kvm, linux-block,
	linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
	Jason Gunthorpe
In-Reply-To: <1563131456-11488-1-git-send-email-linux.bhar@gmail.com>

On 7/14/19 12:08 PM, Bharath Vedartham wrote:
> This patch converts all call sites of get_user_pages
> to use put_user_page*() instead of put_page*() functions to
> release reference to gup pinned pages.

Hi Bharath,

Thanks for jumping in to help, and welcome to the party!

You've caught everyone in the middle of a merge window, btw.  As a
result, I'm busy rebasing and reworking the get_user_pages call sites, 
and gup tracking, in the wake of some semi-traumatic changes to bio 
and gup and such. I plan to re-post right after 5.3-rc1 shows up, from 
here:

    https://github.com/johnhubbard/linux/commits/gup_dma_core

...which you'll find already covers the changes you've posted, except for:

    drivers/misc/sgi-gru/grufault.c
    drivers/staging/kpc2000/kpc_dma/fileops.c

...and this one, which is undergoing to larger local changes, due to
bvec, so let's leave it out of the choices:

    fs/io_uring.c

Therefore, until -rc1, if you'd like to help, I'd recommend one or more
of the following ideas:

1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
and find missing conversions: look for any additional missing 
get_user_pages/put_page conversions. You've already found a couple missing 
ones. I haven't re-run a search in a long time, so there's probably even more.

	a) And find more, after I rebase to 5.3-rc1: people probably are adding
	get_user_pages() calls as we speak. :)

2. Patches: Focus on just one subsystem at a time, and perfect the patch for
it. For example, I think this the staging driver would be perfect to start with:

    drivers/staging/kpc2000/kpc_dma/fileops.c

	a) verify that you've really, corrected converted the whole
	driver. (Hint: I think you might be overlooking a put_page call.)

	b) Attempt to test it if you can (I'm being hypocritical in
	the extreme here, but one of my problems is that testing
	has been light, so any help is very valuable). qemu...?
	OTOH, maybe even qemu cannot easily test a kpc2000, but
	perhaps `git blame` and talking to the authors would help
	figure out a way to validate the changes.

	Thinking about whether you can run a test that would prove or
	disprove my claim in (a), above, could be useful in coming up
	with tests to run.

In other words, a few very high quality conversions (even just one) that
we can really put our faith in, is what I value most here. Tested patches
are awesome.

3. Once I re-post, turn on the new CONFIG_DEBUG_GET_USER_PAGES_REFERENCES
and run things such as xfstest/fstest. (Again, doing so would be going
further than I have yet--very helpful). Help clarify what conversions have
actually been tested and work, and which ones remain unvalidated.

Other: Please note that this:

    https://github.com/johnhubbard/linux/commits/gup_dma_core

    a) gets rebased often, and

    b) has a bunch of commits (iov_iter and related) that conflict
       with the latest linux.git,

    c) has some bugs in the bio area, that I'm fixing, so I don't trust
       that's it's safely runnable, for a few more days.

One note below, for the future:

> 
> This is a bunch of trivial conversions which is a part of an effort
> by John Hubbard to solve issues with gup pinned pages and 
> filesystem writeback.
> 
> The issue is more clearly described in John Hubbard's patch[1] where
> put_user_page*() functions are introduced.
> 
> Currently put_user_page*() simply does put_page but future implementations
> look to change that once treewide change of put_page callsites to 
> put_user_page*() is finished.
> 
> The lwn article describing the issue with gup pinned pages and filesystem 
> writeback [2].
> 
> This patch has been tested by building and booting the kernel as I don't
> have the required hardware to test the device drivers.
> 
> I did not modify gpu/drm drivers which use release_pages instead of
> put_page() to release reference of gup pinned pages as I am not clear
> whether release_pages and put_page are interchangable. 
> 
> [1] https://lkml.org/lkml/2019/3/26/1396

When referring to patches in a commit description, please use the 
commit hash, not an external link. See Submitting Patches [1] for details.

Also, once you figure out the right maintainers and other involved people,
putting Cc: in the commit description is common practice, too.

[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html

thanks,
-- 
John Hubbard
NVIDIA

> 
> [2] https://lwn.net/Articles/784574/
> 
> Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
> ---
>  drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +--
>  drivers/misc/sgi-gru/grufault.c           | 2 +-
>  drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +---
>  drivers/vfio/vfio_iommu_type1.c           | 2 +-
>  fs/io_uring.c                             | 7 +++----
>  mm/gup_benchmark.c                        | 6 +-----
>  net/xdp/xdp_umem.c                        | 7 +------
>  7 files changed, 9 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> index 66a6c6c..d6eeb43 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> @@ -349,8 +349,7 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
>  	BUG_ON(dma->sglen);
>  
>  	if (dma->pages) {
> -		for (i = 0; i < dma->nr_pages; i++)
> -			put_page(dma->pages[i]);
> +		put_user_pages(dma->pages, dma->nr_pages);
>  		kfree(dma->pages);
>  		dma->pages = NULL;
>  	}
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index 4b713a8..61b3447 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -188,7 +188,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
>  	if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
>  		return -EFAULT;
>  	*paddr = page_to_phys(page);
> -	put_page(page);
> +	put_user_page(page);
>  	return 0;
>  }
>  
> diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
> index 6166587..26dceed 100644
> --- a/drivers/staging/kpc2000/kpc_dma/fileops.c
> +++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
> @@ -198,9 +198,7 @@ int  kpc_dma_transfer(struct dev_private_data *priv, struct kiocb *kcb, unsigned
>  	sg_free_table(&acd->sgt);
>   err_dma_map_sg:
>   err_alloc_sg_table:
> -	for (i = 0 ; i < acd->page_count ; i++){
> -		put_page(acd->user_pages[i]);
> -	}
> +	put_user_pages(acd->user_pages, acd->page_count);
>   err_get_user_pages:
>  	kfree(acd->user_pages);
>   err_alloc_userpages:
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index add34ad..c491524 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -369,7 +369,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  		 */
>  		if (ret > 0 && vma_is_fsdax(vmas[0])) {
>  			ret = -EOPNOTSUPP;
> -			put_page(page[0]);
> +			put_user_page(page[0]);
>  		}
>  	}
>  	up_read(&mm->mmap_sem);
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4ef62a4..b4a4549 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
>  			 * if we did partial map, or found file backed vmas,
>  			 * release any pages we did get
>  			 */
> -			if (pret > 0) {
> -				for (j = 0; j < pret; j++)
> -					put_page(pages[j]);
> -			}
> +			if (pret > 0)
> +				put_user_pages(pages, pret);
> +
>  			if (ctx->account_mem)
>  				io_unaccount_mem(ctx->user, nr_pages);
>  			kvfree(imu->bvec);
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d..15fc7a2 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -76,11 +76,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
>  	gup->size = addr - gup->addr;
>  
>  	start_time = ktime_get();
> -	for (i = 0; i < nr_pages; i++) {
> -		if (!pages[i])
> -			break;
> -		put_page(pages[i]);
> -	}
> +	put_user_pages(pages, nr_pages);
>  	end_time = ktime_get();
>  	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
>  
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 9c6de4f..6103e19 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -173,12 +173,7 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
>  {
>  	unsigned int i;
>  
> -	for (i = 0; i < umem->npgs; i++) {
> -		struct page *page = umem->pgs[i];
> -
> -		set_page_dirty_lock(page);
> -		put_page(page);
> -	}
> +	put_user_pages_dirty_lock(umem->pgs, umem->npgs);
>  
>  	kfree(umem->pgs);
>  	umem->pgs = NULL;
> 

^ permalink raw reply

* [PATCH net v3] net: neigh: fix multiple neigh timer scheduling
From: Lorenzo Bianconi @ 2019-07-14 21:36 UTC (permalink / raw)
  To: davem; +Cc: netdev, dsahern, marek

Neigh timer can be scheduled multiple times from userspace adding
multiple neigh entries and forcing the neigh timer scheduling passing
NTF_USE in the netlink requests.
This will result in a refcount leak and in the following dump stack:

[   32.465295] NEIGH: BUG, double timer add, state is 8
[   32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65
[   32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[   32.465313] Call Trace:
[   32.465318]  dump_stack+0x7c/0xc0
[   32.465323]  __neigh_event_send+0x20c/0x880
[   32.465326]  ? ___neigh_create+0x846/0xfb0
[   32.465329]  ? neigh_lookup+0x2a9/0x410
[   32.465332]  ? neightbl_fill_info.constprop.0+0x800/0x800
[   32.465334]  neigh_add+0x4f8/0x5e0
[   32.465337]  ? neigh_xmit+0x620/0x620
[   32.465341]  ? find_held_lock+0x85/0xa0
[   32.465345]  rtnetlink_rcv_msg+0x204/0x570
[   32.465348]  ? rtnl_dellink+0x450/0x450
[   32.465351]  ? mark_held_locks+0x90/0x90
[   32.465354]  ? match_held_lock+0x1b/0x230
[   32.465357]  netlink_rcv_skb+0xc4/0x1d0
[   32.465360]  ? rtnl_dellink+0x450/0x450
[   32.465363]  ? netlink_ack+0x420/0x420
[   32.465366]  ? netlink_deliver_tap+0x115/0x560
[   32.465369]  ? __alloc_skb+0xc9/0x2f0
[   32.465372]  netlink_unicast+0x270/0x330
[   32.465375]  ? netlink_attachskb+0x2f0/0x2f0
[   32.465378]  netlink_sendmsg+0x34f/0x5a0
[   32.465381]  ? netlink_unicast+0x330/0x330
[   32.465385]  ? move_addr_to_kernel.part.0+0x20/0x20
[   32.465388]  ? netlink_unicast+0x330/0x330
[   32.465391]  sock_sendmsg+0x91/0xa0
[   32.465394]  ___sys_sendmsg+0x407/0x480
[   32.465397]  ? copy_msghdr_from_user+0x200/0x200
[   32.465401]  ? _raw_spin_unlock_irqrestore+0x37/0x40
[   32.465404]  ? lockdep_hardirqs_on+0x17d/0x250
[   32.465407]  ? __wake_up_common_lock+0xcb/0x110
[   32.465410]  ? __wake_up_common+0x230/0x230
[   32.465413]  ? netlink_bind+0x3e1/0x490
[   32.465416]  ? netlink_setsockopt+0x540/0x540
[   32.465420]  ? __fget_light+0x9c/0xf0
[   32.465423]  ? sockfd_lookup_light+0x8c/0xb0
[   32.465426]  __sys_sendmsg+0xa5/0x110
[   32.465429]  ? __ia32_sys_shutdown+0x30/0x30
[   32.465432]  ? __fd_install+0xe1/0x2c0
[   32.465435]  ? lockdep_hardirqs_off+0xb5/0x100
[   32.465438]  ? mark_held_locks+0x24/0x90
[   32.465441]  ? do_syscall_64+0xf/0x270
[   32.465444]  do_syscall_64+0x63/0x270
[   32.465448]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER'
receiving a netlink request with NTF_USE flag set

Reported-by: Marek Majkowski <marek@cloudflare.com>
Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
Changes since v2:
- remove check_timer flag and run neigh_del_timer directly
Changes since v1:
- fix compilation errors defining neigh_event_send_check_timer routine
---
 net/core/neighbour.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 742cea4ce72e..0dfc97bc8760 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1124,6 +1124,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
 
 			atomic_set(&neigh->probes,
 				   NEIGH_VAR(neigh->parms, UCAST_PROBES));
+			neigh_del_timer(neigh);
 			neigh->nud_state     = NUD_INCOMPLETE;
 			neigh->updated = now;
 			next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
@@ -1140,6 +1141,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
 		}
 	} else if (neigh->nud_state & NUD_STALE) {
 		neigh_dbg(2, "neigh %p is delayed\n", neigh);
+		neigh_del_timer(neigh);
 		neigh->nud_state = NUD_DELAY;
 		neigh->updated = jiffies;
 		neigh_add_timer(neigh, jiffies +
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH] sky2: Disable MSI on P5W DH Deluxe
From: David Miller @ 2019-07-14 20:46 UTC (permalink / raw)
  To: tasos; +Cc: netdev, linux-kernel, stephen, mlindner
In-Reply-To: <14c872c3-09ac-7f51-dc3d-e68319459fcf@tasossah.com>

From: Tasos Sahanidis <tasos@tasossah.com>
Date: Sun, 14 Jul 2019 13:31:11 +0300

> The onboard sky2 NICs send IRQs after S3, resulting in ethernet not
> working after resume.
> Maskable MSI and MSI-X are also not supported, so fall back to INTx.
> 
> Signed-off-by: Tasos Sahanidis <tasos@tasossah.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net v2] net: neigh: fix multiple neigh timer scheduling
From: Lorenzo Bianconi @ 2019-07-14 20:04 UTC (permalink / raw)
  To: David Ahern; +Cc: davem, netdev, marek
In-Reply-To: <26f58e35-f1f8-9543-819f-ef7f52da1e49@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]

> On 7/14/19 2:45 AM, Lorenzo Bianconi wrote:
> > @@ -1124,7 +1125,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> >  
> >  			atomic_set(&neigh->probes,
> >  				   NEIGH_VAR(neigh->parms, UCAST_PROBES));
> > -			neigh->nud_state     = NUD_INCOMPLETE;
> > +			if (check_timer)
> > +				neigh_del_timer(neigh);
> 
> Why not just always call neigh_del_timer and avoid the check_timer flag?
> Let the NUD_IN_TIMER flag handle whether anything needs to be done.

ack, I have been too paranoid here. I will post a v3 fixing it.

Regards,
Lorenzo

> 
> > +			neigh->nud_state = NUD_INCOMPLETE;
> >  			neigh->updated = now;
> >  			next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
> >  					 HZ/2);
> > @@ -1140,6 +1143,8 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> >  		}
> >  	} else if (neigh->nud_state & NUD_STALE) {
> >  		neigh_dbg(2, "neigh %p is delayed\n", neigh);
> > +		if (check_timer)
> > +			neigh_del_timer(neigh);
> >  		neigh->nud_state = NUD_DELAY;
> >  		neigh->updated = jiffies;
> >  		neigh_add_timer(neigh, jiffies +
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-14 19:29 UTC (permalink / raw)
  To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
	David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
	linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190714190522.GA24049@mit.edu>

On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > But short term I don't see any other solution than stop testing
> > sched_setattr because it does not check arguments enough to prevent
> > system misbehavior. Which is a pity because syzkaller has found some
> > bad misconfigurations that were oversight on checking side.
> > Any other suggestions?
> 
> Or maybe syzkaller can put its own limitations on what parameters are
> sent to sched_setattr?  In practice, there are any number of ways a
> root user can shoot themselves in the foot when using sched_setattr or
> sched_setaffinity, for that matter.  I imagine there must be some such
> constraints already --- or else syzkaller might have set a kernel
> thread to run with priority SCHED_BATCH, with similar catastrophic
> effects --- or do similar configurations to make system threads
> completely unschedulable.
> 
> Real time administrators who know what they are doing --- and who know
> that their real-time threads are well behaved --- will always want to
> be able to do things that will be catastrophic if the real-time thread
> is *not* well behaved.  I don't it is possible to add safety checks
> which would allow the kernel to automatically detect and reject unsafe
> configurations.
> 
> An apt analogy might be civilian versus military aircraft.  Most
> airplanes are designed to be "inherently stable"; that way, modulo
> buggy/insane control systems like on the 737 Max, the airplane will
> automatically return to straight and level flight.  On the other hand,
> some military planes (for example, the F-16, F-22, F-36, the
> Eurofighter, etc.) are sometimes designed to be unstable, since that
> way they can be more maneuverable.
> 
> There are use cases for real-time Linux where this flexibility/power
> vs. stability tradeoff is going to argue for giving root the
> flexibility to crash the system.  Some of these systems might
> literally involve using real-time Linux in military applications,
> something for which Paul and I have had some experience.  :-)
> 
> Speaking of sched_setaffinity, one thing which we can do is have
> syzkaller move all of the system threads to they run on the "system
> CPU's", and then move the syzkaller processes which are testing the
> kernel to be on the "system under test CPU's".  Then regardless of
> what priority the syzkaller test programs try to run themselves at,
> they can't crash the system.
> 
> Some real-time systems do actually run this way, and it's a
> recommended configuration which is much safer than letting the
> real-time threads take over the whole system:
> 
> http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application

Good point!  We might still have issues with some per-CPU kthreads,
but perhaps use of nohz_full would help at least reduce these sorts
of problems.  (There could still be issues on CPUs with more than
one runnable threads.)

							Thanx, Paul


^ permalink raw reply

* Re: [PATCH] sis900: correct a few typos
From: David Miller @ 2019-07-14 19:22 UTC (permalink / raw)
  To: sergej.benilov; +Cc: venza, netdev, andrew
In-Reply-To: <20190714165627.32765-1-sergej.benilov@googlemail.com>

From: Sergej Benilov <sergej.benilov@googlemail.com>
Date: Sun, 14 Jul 2019 18:56:27 +0200

> Correct a few typos in comments and debug text.
> 
> Signed-off-by: Sergej Benilov <sergej.benilov@googlemail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net v2] net: neigh: fix multiple neigh timer scheduling
From: David Miller @ 2019-07-14 19:15 UTC (permalink / raw)
  To: dsahern; +Cc: lorenzo.bianconi, netdev, marek
In-Reply-To: <26f58e35-f1f8-9543-819f-ef7f52da1e49@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Sun, 14 Jul 2019 07:58:19 -0600

> On 7/14/19 2:45 AM, Lorenzo Bianconi wrote:
>> @@ -1124,7 +1125,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>>  
>>  			atomic_set(&neigh->probes,
>>  				   NEIGH_VAR(neigh->parms, UCAST_PROBES));
>> -			neigh->nud_state     = NUD_INCOMPLETE;
>> +			if (check_timer)
>> +				neigh_del_timer(neigh);
> 
> Why not just always call neigh_del_timer and avoid the check_timer flag?
> Let the NUD_IN_TIMER flag handle whether anything needs to be done.

Agreed.

^ permalink raw reply

* Re: [PATCH] sit: use dst_cache in ipip6_tunnel_xmit
From: David Miller @ 2019-07-14 19:15 UTC (permalink / raw)
  To: yanhaishuang; +Cc: kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <1563111082-10721-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Sun, 14 Jul 2019 21:31:22 +0800

> Same as other ip tunnel, use dst_cache in xmit action to avoid
> unnecessary fib lookups.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Looks good, applied, thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox