Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Rusty Russell @ 2014-10-15  5:40 UTC (permalink / raw)
  To: Jason Wang, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <1413011806-3813-2-git-send-email-jasowang@redhat.com>

Jason Wang <jasowang@redhat.com> writes:
> Below should be useful for some experiments Jason is doing.
> I thought I'd send it out for early review/feedback.
>
> event idx feature allows us to defer interrupts until
> a specific # of descriptors were used.
> Sometimes it might be useful to get an interrupt after
> a specific descriptor, regardless.
> This adds a descriptor flag for this, and an API
> to create an urgent output descriptor.
> This is still an RFC:
> we'll need a feature bit for drivers to detect this,
> but we've run out of feature bits for virtio 0.X.
> For experimentation purposes, drivers can assume
> this is set, or add a driver-specific feature bit.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
networking (which tends to take packets in order) couldn't we just set
the event counter to give us a tx interrupt at the packet we want?

Cheers,
Rusty.

^ permalink raw reply

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: Jiafei.Pan @ 2014-10-15  5:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem@davemloft.net, jkosina@suse.cz, netdev@vger.kernel.org,
	LeoLi@freescale.com, linux-doc@vger.kernel.org,
	Jiafei.Pan@freescale.com
In-Reply-To: <1413346503.17109.23.camel@edumazet-glaptop2.roam.corp.google.com>



> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, October 15, 2014 12:15 PM
> To: Pan Jiafei-B37022
> Cc: davem@davemloft.net; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-
> R58472; linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On Wed, 2014-10-15 at 11:26 +0800, Pan Jiafei wrote:
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance. So in some case,
> > it is expected that the packets received by some generic
> > NIC should be put into such hardware managed buffers
> > directly, so that such buffer can be released by hardware
> > or by driver.
> 
> You repeat 'some' four times.
> 
> >
> > This patch provide such general APIs for generic NIC to
> > use hardware block managed buffers without any modification
> > for generic NIC drivers.
> 
> ...
> 
> > In this patch, the following fields are added to "net_device":
> >     void *hw_skb_priv;
> >     struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
> >     void (*free_hw_skb)(struct sk_buff *skb);
> > so in order to let generic NIC driver to use hardware managed
> > buffers, the function "alloc_hw_skb" and "free_hw_skb"
> > provide implementation for allocate and free hardware managed
> > buffers. "hw_skb_priv" is provided to pass some private data for
> > these two functions.
> >
> > When the socket buffer is allocated by these APIs, "hw_skb_state"
> > is provided in struct "sk_buff". this argument can indicate
> > that the buffer is hardware managed buffer, this buffer
> > should freed by software or by hardware.
> >
> > Documentation on how to use this featue can be found at
> > <file:Documentation/networking/hw_skb.txt>.
> >
> > Signed-off-by: Pan Jiafei <Jiafei.Pan@freescale.com>
> 
> 
> I am giving a strong NACK, of course.
> 
> We are not going to grow sk_buff and add yet another conditional in fast
> path for a very obscure feature like that.
> 
> Memory management is not going to be done by drivers.
> 
> The way it should work is that if your hardware has specific needs, rx
> and tx paths (of the driver) need to make the needed adaptation.
> Not the other way.
> 
> We already have complex skb layouts, we do not need a new one.
> 
> Take a look at how drivers can 'lock' pages already, and build skb sith
> page frags. It is already there.
> 
For my case, there are some shortcoming to use page frags. Firstly, I have to
modify each Ethernet drivers to support it especially I don’t which vendor's
driver the customer will use. Secondly, it is not enough only 
build skb by 'lock' pages, the buffer address comes from hardware block and
should be aligned to hardware block.

^ permalink raw reply

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: Jiafei.Pan @ 2014-10-15  5:47 UTC (permalink / raw)
  To: Oliver Hartkopp, davem@davemloft.net, jkosina@suse.cz
  Cc: netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org, Jiafei.Pan@freescale.com
In-Reply-To: <543DFF2C.40306@hartkopp.net>



> -----Original Message-----
> From: Oliver Hartkopp [mailto:socketcan@hartkopp.net]
> Sent: Wednesday, October 15, 2014 12:59 PM
> To: Pan Jiafei-B37022; davem@davemloft.net; jkosina@suse.cz
> Cc: netdev@vger.kernel.org; Li Yang-Leo-R58472; linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On 15.10.2014 05:26, Pan Jiafei wrote:
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance.
> 
> (..)
[Pan Jiafei] I want to build a general patch to build skb
by using hardware buffer manager block.
> 
> > diff --git a/net/Kconfig b/net/Kconfig
> > index d6b138e..346e021 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
> >   	  with many clients some protection against DoS by a single (spoofed)
> >   	  flow that greatly exceeds average workload.
> >
> > +config USE_HW_SKB
> > +	bool "NIC use hardware managed buffer to build skb"
> > +	depends on INET
> 
> The feature seems to be valid for network devices in general.
> Why did you make it depending on INET ??
> 
> Regards,
> Oliver
> 
[Pan Jiafei] Yes, INET dependency should be removed, thanks.
> > +	---help---
> > +	  If select this, the third party drivers will use hardware managed
> > +	  buffers to allocate SKB without any modification for the driver.
> > +
> > +	  Documentation on how to use this featue can be found at
> > +	  <file:Documentation/networking/hw_skb.txt>.
> > +
> >   menu "Network testing"
> >
> >   config NET_PKTGEN


^ permalink raw reply

* Re: [PATCH v2] ipv4: dst_entry leak in ip_append_data()
From: Vasily Averin @ 2014-10-15  6:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
In-Reply-To: <1413348385.12304.0.camel@edumazet-glaptop2.roam.corp.google.com>

On 15.10.2014 08:46, Eric Dumazet wrote:
> On Tue, 2014-10-14 at 08:57 +0400, Vasily Averin wrote:
>> v2: adjust the indentation of the arguments __ip_append_data() call
>>
>> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
>>
>> If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
>> that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
>> that creates skb and adds it to sk_write_queue.
>>
>> If skb was added successfully following ip_push_pending_frames() call
>> reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
>>
>> However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
> 
> I thought this was done by ip_flush_pending_frames() ?

Take look at ip_send_unicast_reply():

ip_flush_pending_frames() is not called if skb was not added to sk_write_queue.
And ip_rt_put() does not work, because dst entry was stolen in ip_setup_cork().

Probably it can happen in raw_sendmsg() and udp_sendmsg() too.

^ permalink raw reply

* Re: [PATCH v4 03/25] virtio-pci: move freeze/restore to virtio core
From: Paul Bolle @ 2014-10-15  7:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Valentin Rothberg, linux-kernel, Rusty Russell, virtualization,
	linux-scsi, linux-s390, v9fs-developer, netdev, kvm, Amit Shah,
	Cornelia Huck, Christian Borntraeger, David S. Miller,
	Paolo Bonzini, Heinz Graalfs
In-Reply-To: <1413114332-626-4-git-send-email-mst-v4@redhat.com>

On Mon, 2014-10-13 at 10:48 +0300, Michael S. Tsirkin wrote:
> This is in preparation to extending config changed event handling
> in core.
> Wrapping these in an API also seems to make for a cleaner code.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>

This patch landed in today's linux-next (next-20141015).

>  include/linux/virtio.h      |  6 +++++
>  drivers/virtio/virtio.c     | 54 +++++++++++++++++++++++++++++++++++++++++++++
>  drivers/virtio/virtio_pci.c | 54 ++-------------------------------------------
>  3 files changed, 62 insertions(+), 52 deletions(-)
> 
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 3c19bd3..8df7ba8 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -78,6 +78,7 @@ bool virtqueue_is_broken(struct virtqueue *vq);
>  /**
>   * virtio_device - representation of a device using virtio
>   * @index: unique position on the virtio bus
> + * @failed: saved value for CONFIG_S_FAILED bit (for restore)

s/CONFIG_S_FAILED/VIRTIO_CONFIG_S_FAILED/?

>   * @dev: underlying device.
>   * @id: the device type identification (used to match it with a driver).
>   * @config: the configuration ops for this device.


Paul Bolle

^ permalink raw reply

* [RFC PATCH net-next 0/6] Always use tx interrupt for virtio-net
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev; +Cc: davem, eric.dumazet

According to David, proper accounting and queueing (at all levels, not
just TCP sockets) is more important than trying to skim a bunch of
cycles by avoiding TX interrupts. Having an event to free the SKB is
absolutely essential for the stack to operate correctly.

This series tries to enable tx interrupt for virtio-net. The idea is
simple: enable tx interrupt and schedule a tx napi to free old xmit
skbs.

Several notes:
- Tx interrupt storm avoidance when queue is about to be full is
  kept. Since we may enable callbacks in both ndo_start_xmit() and tx
  napi, patch 1 adds a check to make sure used event never go
  back. This will let the napi not enable the callbacks wrongly after
  delayed callbacks was used.
- For bulk dequeuing, there's no need to enable tx interrupt for each
  packet. The last patch only enable tx interrupt for the final skb in
  the chain through xmit_more and a new helper to publish current avail
  idx as used event.

This series fixes several issues of original rfc pointed out by Michael.

Smoking test is done, and will do complete netperf test. Send the
seires for early review.

Thanks

Jason Wang (6):
  virtio: make sure used event never go backwards
  virtio: introduce virtio_enable_cb_avail()
  virtio-net: small optimization on free_old_xmit_skbs()
  virtio-net: return the number of packets sent in free_old_xmit_skbs()
  virtio-net: enable tx interrupt
  virtio-net: enable tx interrupt only for the final skb in the chain

 drivers/net/virtio_net.c     |  125 ++++++++++++++++++++++++++++++------------
 drivers/virtio/virtio_ring.c |   25 ++++++++-
 include/linux/virtio.h       |    2 +
 3 files changed, 115 insertions(+), 37 deletions(-)

^ permalink raw reply

* [RFC PATCH net-next 1/6] virtio: make sure used event never go backwards
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev; +Cc: davem, eric.dumazet
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

This patch checks the new event idx to make sure used event idx never
goes back. This is used to synchronize the calls between
virtqueue_enable_cb_delayed() and virtqueue_enable_cb().

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_ring.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3b1f89b..1b3929f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -559,14 +559,17 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 	u16 last_used_idx;
 
 	START_USE(vq);
-
+	last_used_idx = vq->last_used_idx;
 	/* We optimistically turn back on interrupts, then check if there was
 	 * more to do. */
 	/* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
 	 * either clear the flags bit or point the event index at the next
 	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
-	vring_used_event(&vq->vring) = last_used_idx = vq->last_used_idx;
+	/* Make sure used event never go backwards */
+	if (!vring_need_event(vring_used_event(&vq->vring),
+			      vq->vring.avail->idx, last_used_idx))
+		vring_used_event(&vq->vring) = last_used_idx;
 	END_USE(vq);
 	return last_used_idx;
 }
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH net-next 2/6] virtio: introduce virtio_enable_cb_avail()
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev; +Cc: davem, eric.dumazet
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

This patch introduces virtio_enable_cb_avail() to publish avail idx
and used event. This could be used by batched buffer submitting to
reduce the number of tx interrupts.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/virtio/virtio_ring.c |   22 ++++++++++++++++++++--
 include/linux/virtio.h       |    2 ++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1b3929f..d67fbf8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -567,14 +567,32 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
 	 * entry. Always do both to keep code simple. */
 	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
 	/* Make sure used event never go backwards */
-	if (!vring_need_event(vring_used_event(&vq->vring),
-			      vq->vring.avail->idx, last_used_idx))
+	if (vq->vring.avail->idx != vring_used_event(&vq->vring) &&
+	    !vring_need_event(vring_used_event(&vq->vring),
+			      vq->vring.avail->idx, last_used_idx)) {
 		vring_used_event(&vq->vring) = last_used_idx;
+	}
 	END_USE(vq);
 	return last_used_idx;
 }
 EXPORT_SYMBOL_GPL(virtqueue_enable_cb_prepare);
 
+bool virtqueue_enable_cb_avail(struct virtqueue *_vq)
+{
+	struct vring_virtqueue *vq = to_vvq(_vq);
+	bool ret;
+
+	START_USE(vq);
+	vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
+	vring_used_event(&vq->vring) = vq->vring.avail->idx;
+	ret = vring_need_event(vq->vring.avail->idx,
+			       vq->last_used_idx, vq->vring.used->idx);
+	END_USE(vq);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(virtqueue_enable_cb_avail);
+
 /**
  * virtqueue_poll - query pending used buffers
  * @vq: the struct virtqueue we're talking about.
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index b46671e..bfaf058 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -65,6 +65,8 @@ bool virtqueue_enable_cb(struct virtqueue *vq);
 
 unsigned virtqueue_enable_cb_prepare(struct virtqueue *vq);
 
+bool virtqueue_enable_cb_avail(struct virtqueue *vq);
+
 bool virtqueue_poll(struct virtqueue *vq, unsigned);
 
 bool virtqueue_enable_cb_delayed(struct virtqueue *vq);
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH net-next 3/6] virtio-net: small optimization on free_old_xmit_skbs()
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev; +Cc: davem, eric.dumazet
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

Accumulate the sent packets and sent bytes in local variables and perform a
single u64_stats_update_begin/end() after.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d0ce44..a4d56b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -833,17 +833,21 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 	unsigned int len;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	u64 tx_bytes = 0, tx_packets = 0;
 
 	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
 		pr_debug("Sent skb %p\n", skb);
 
-		u64_stats_update_begin(&stats->tx_syncp);
-		stats->tx_bytes += skb->len;
-		stats->tx_packets++;
-		u64_stats_update_end(&stats->tx_syncp);
+		tx_bytes += skb->len;
+		tx_packets++;
 
 		dev_kfree_skb_any(skb);
 	}
+
+	u64_stats_update_begin(&stats->tx_syncp);
+	stats->tx_bytes += tx_bytes;
+	stats->tx_packets =+ tx_packets;
+	u64_stats_update_end(&stats->tx_syncp);
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH net-next 4/6] virtio-net: return the number of packets sent in free_old_xmit_skbs()
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev
  Cc: davem, eric.dumazet, Jason Wang
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

This patch returns the number of packets sent in free_old_xmit_skbs(), this is
necessary for calling this function in napi.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a4d56b8..ccf98f9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -827,7 +827,7 @@ static int virtnet_open(struct net_device *dev)
 	return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
+static int free_old_xmit_skbs(struct send_queue *sq)
 {
 	struct sk_buff *skb;
 	unsigned int len;
@@ -848,6 +848,8 @@ static void free_old_xmit_skbs(struct send_queue *sq)
 	stats->tx_bytes += tx_bytes;
 	stats->tx_packets =+ tx_packets;
 	u64_stats_update_end(&stats->tx_syncp);
+
+	return tx_packets;
 }
 
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH net-next 5/6] virtio-net: enable tx interrupt
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev
  Cc: davem, eric.dumazet, Jason Wang
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

Orphan skb in ndo_start_xmit() breaks socket accounting and packet
queuing. This in fact breaks lots of things such as pktgen and several
TCP optimizations. And also make BQL can't be implemented for
virtio-net.

This patch tries to solve this issue by enabling tx interrupt. To
avoid introducing extra spinlocks, a tx napi was scheduled to free
those packets.

More tx interrupt mitigation method could be used on top.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |  125 +++++++++++++++++++++++++++++++---------------
 1 files changed, 85 insertions(+), 40 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ccf98f9..2afc2e2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
 	/* Name of the send queue: output.$index */
 	char name[40];
+
+	struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
 	return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+	u64 tx_bytes = 0, tx_packets = 0;
+
+	while (tx_packets < budget &&
+	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+		pr_debug("Sent skb %p\n", skb);
+
+		tx_bytes += skb->len;
+		tx_packets++;
+
+		dev_kfree_skb_any(skb);
+	}
+
+	u64_stats_update_begin(&stats->tx_syncp);
+	stats->tx_bytes += tx_bytes;
+	stats->tx_packets =+ tx_packets;
+	u64_stats_update_end(&stats->tx_syncp);
+
+	return tx_packets;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
 	struct virtnet_info *vi = vq->vdev->priv;
+	struct send_queue *sq = &vi->sq[vq2txq(vq)];
 
-	/* Suppress further interrupts. */
-	virtqueue_disable_cb(vq);
-
-	/* We were probably waiting for more output buffers. */
-	netif_wake_subqueue(vi->dev, vq2txq(vq));
+	if (napi_schedule_prep(&sq->napi)) {
+		__napi_schedule(&sq->napi);
+	}
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -774,7 +801,39 @@ again:
 	return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct send_queue *sq =
+		container_of(napi, struct send_queue, napi);
+	struct virtnet_info *vi = sq->vq->vdev->priv;
+	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+	unsigned int r, sent = 0;
+
+again:
+	__netif_tx_lock(txq, smp_processor_id());
+	virtqueue_disable_cb(sq->vq);
+	sent += free_old_xmit_skbs(sq, budget - sent);
+
+	if (sent < budget) {
+		r = virtqueue_enable_cb_prepare(sq->vq);
+		napi_complete(napi);
+		__netif_tx_unlock(txq);
+		if (unlikely(virtqueue_poll(sq->vq, r)) &&
+		    napi_schedule_prep(napi)) {
+			virtqueue_disable_cb(sq->vq);
+			__napi_schedule(napi);
+			goto again;
+		}
+	} else {
+		__netif_tx_unlock(txq);
+	}
+
+	netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+	return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -822,36 +881,12 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 		virtnet_napi_enable(&vi->rq[i]);
+		napi_enable(&vi->sq[i].napi);
 	}
 
 	return 0;
 }
 
-static int free_old_xmit_skbs(struct send_queue *sq)
-{
-	struct sk_buff *skb;
-	unsigned int len;
-	struct virtnet_info *vi = sq->vq->vdev->priv;
-	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-	u64 tx_bytes = 0, tx_packets = 0;
-
-	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		pr_debug("Sent skb %p\n", skb);
-
-		tx_bytes += skb->len;
-		tx_packets++;
-
-		dev_kfree_skb_any(skb);
-	}
-
-	u64_stats_update_begin(&stats->tx_syncp);
-	stats->tx_bytes += tx_bytes;
-	stats->tx_packets =+ tx_packets;
-	u64_stats_update_end(&stats->tx_syncp);
-
-	return tx_packets;
-}
-
 static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 {
 	struct skb_vnet_hdr *hdr;
@@ -917,6 +952,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
 		sg_set_buf(sq->sg, hdr, hdr_len);
 		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
 	}
+
 	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
 
@@ -925,10 +961,11 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct virtnet_info *vi = netdev_priv(dev);
 	int qnum = skb_get_queue_mapping(skb);
 	struct send_queue *sq = &vi->sq[qnum];
-	int err;
+	int err, qsize = virtqueue_get_vring_size(sq->vq);
 
+	virtqueue_disable_cb(sq->vq);
 	/* Free up any pending old buffers before queueing new ones. */
-	free_old_xmit_skbs(sq);
+	free_old_xmit_skbs(sq, qsize);
 
 	/* Try to transmit */
 	err = xmit_skb(sq, skb);
@@ -944,22 +981,20 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_OK;
 	}
 
-	/* Don't wait up for transmitted skbs to be freed. */
-	skb_orphan(skb);
-	nf_reset(skb);
-
 	/* Apparently nice girls don't return TX_BUSY; stop the queue
 	 * before it gets out of hand.  Naturally, this wastes entries. */
 	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
 		netif_stop_subqueue(dev, qnum);
 		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
 			/* More just got used, free them then recheck. */
-			free_old_xmit_skbs(sq);
+			free_old_xmit_skbs(sq, qsize);
 			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
 				netif_start_subqueue(dev, qnum);
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
+	} else if (virtqueue_enable_cb(sq->vq)) {
+		free_old_xmit_skbs(sq, qsize);
 	}
 
 	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
@@ -1141,8 +1176,10 @@ static int virtnet_close(struct net_device *dev)
 	/* Make sure refill_work doesn't re-enable napi! */
 	cancel_delayed_work_sync(&vi->refill);
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		napi_disable(&vi->rq[i].napi);
+		napi_disable(&vi->sq[i].napi);
+	}
 
 	return 0;
 }
@@ -1461,8 +1498,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 {
 	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++)
+	for (i = 0; i < vi->max_queue_pairs; i++) {
 		netif_napi_del(&vi->rq[i].napi);
+		netif_napi_del(&vi->sq[i].napi);
+	}
 
 	kfree(vi->rq);
 	kfree(vi->sq);
@@ -1616,6 +1655,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
 		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
 			       napi_weight);
 		napi_hash_add(&vi->rq[i].napi);
+		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+			       napi_weight);
 
 		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
 		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
@@ -1920,8 +1961,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	if (netif_running(vi->dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
 			napi_disable(&vi->rq[i].napi);
+			napi_disable(&vi->sq[i].napi);
 			napi_hash_del(&vi->rq[i].napi);
 			netif_napi_del(&vi->rq[i].napi);
+			netif_napi_del(&vi->sq[i].napi);
 		}
 	}
 
@@ -1946,8 +1989,10 @@ static int virtnet_restore(struct virtio_device *vdev)
 			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		for (i = 0; i < vi->max_queue_pairs; i++)
+		for (i = 0; i < vi->max_queue_pairs; i++) {
 			virtnet_napi_enable(&vi->rq[i]);
+			napi_enable(&vi->sq[i].napi);
+		}
 	}
 
 	netif_device_attach(vi->dev);
-- 
1.7.1

^ permalink raw reply related

* [RFC PATCH net-next 6/6] virtio-net: enable tx interrupt only for the final skb in the chain
From: Jason Wang @ 2014-10-15  7:25 UTC (permalink / raw)
  To: rusty, mst, virtualization, linux-kernel, netdev
  Cc: davem, eric.dumazet, Jason Wang
In-Reply-To: <1413357930-45302-1-git-send-email-jasowang@redhat.com>

With the help of xmit_more and virtqueue_enable_cb_avail(), this patch
enable tx interrupt only for the final skb in the chain if
needed. This will help to mitigate tx interrupts.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2afc2e2..5943f3f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -993,12 +993,16 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 				virtqueue_disable_cb(sq->vq);
 			}
 		}
-	} else if (virtqueue_enable_cb(sq->vq)) {
-		free_old_xmit_skbs(sq, qsize);
 	}
 
-	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more)
+	if (__netif_subqueue_stopped(dev, qnum) || !skb->xmit_more) {
 		virtqueue_kick(sq->vq);
+		if (sq->vq->num_free >= 2 +MAX_SKB_FRAGS &&
+		    unlikely(virtqueue_enable_cb_avail(sq->vq))) {
+			free_old_xmit_skbs(sq, qsize);
+			virtqueue_disable_cb(sq->vq);
+		}
+	}
 
 	return NETDEV_TX_OK;
 }
-- 
1.7.1

^ permalink raw reply related

* Re: [PATCH] net: fec: ptp: fix convergence issue to support LinuxPTP stack
From: Richard Cochran @ 2014-10-15  7:25 UTC (permalink / raw)
  To: fugang.duan@freescale.com
  Cc: Frank.Li@freescale.com, David Miller, netdev@vger.kernel.org,
	bhutchings@solarflare.com
In-Reply-To: <ca979a5eb7764367a6b8b944948cbd1a@BLUPR03MB373.namprd03.prod.outlook.com>

On Wed, Oct 15, 2014 at 01:30:41AM +0000, fugang.duan@freescale.com wrote:
> 
> Pls read the commit log in entire ?

I did read it. 

It is incomprehensible.

Can you please fix it?

Thanks,
Richard

^ permalink raw reply

* [PATCH net] cxgb4i : Fix -Wmaybe-uninitialized warning.
From: Anish Bhatt @ 2014-10-15  7:26 UTC (permalink / raw)
  To: netdev; +Cc: davem, kxie, Anish Bhatt

Identified by kbuild test robot. csk family is always set to be AF_INET or 
AF_INET6, so skb will always be initialized to some value but there is no harm 
in silencing the warning anyways.

Signed-off-by: Anish Bhatt <anish@chelsio.com>
Fixes : f42bb57c61fd ('cxgb4i : Fix -Wunused-function warning')
---
 drivers/scsi/cxgbi/cxgb4i/cxgb4i.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
index 8c3003b..3e0a0d3 100644
--- a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
+++ b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
@@ -758,7 +758,7 @@ static int act_open_rpl_status_to_errno(int status)
 
 static void csk_act_open_retry_timer(unsigned long data)
 {
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	struct cxgbi_sock *csk = (struct cxgbi_sock *)data;
 	struct cxgb4_lld_info *lldi = cxgbi_cdev_priv(csk->cdev);
 	void (*send_act_open_func)(struct cxgbi_sock *, struct sk_buff *,
-- 
2.1.2

^ permalink raw reply related

* Re: [PATCH] nf_conntrack_proto_tcp: allow server to become a client in TW handling
From: Jozsef Kadlecsik @ 2014-10-15  7:27 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev
In-Reply-To: <a9618c5db692413c2970201b5206bcd357438399.1413216136.git.mleitner@redhat.com>

On Mon, 13 Oct 2014, Marcelo Ricardo Leitner wrote:

> When a port that was used to listen for inbound connections gets closed
> and reused for outgoing connections (like rsh ends up doing for stderr
> flow), current we may reject the SYN/ACK packet for the new connection
> because tcp_conntracks states forbirds a port to become a client while
> there is still a TIME_WAIT entry in there for it.
> 
> As TCP may expire the TIME_WAIT socket in 60s and conntrack's timeout
> for it is 120s, there is a ~60s window that the application can end up
> opening a port that conntrack will end up blocking.
> 
> This patch fixes this by simply allowing such state transition: if we
> see a SYN, in TIME_WAIT state, on REPLY direction, move it to sSS. Note
> that the rest of the code already handles this situation, more
> specificly in tcp_packet(), first switch clause.

In those code branch if there was a valid FIN in either direction, we 
destroy the old connection and a new will be created. That way the rules 
about NEW connections will be applied, so the policies are not bypassed. 
Otherwise we just ignore the SYN packet, so if it's invalid, we'll catch 
the RST from the other side and destroy the conntrack entry. The event 
flow looks OK to me.

> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>

Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

Best regards,
Jozsef

> ---
>  net/netfilter/nf_conntrack_proto_tcp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> index 44d1ea32570a07338dc39f34624bd823b6f76916..d87b6423ffb21e0f8f9b6ef25ef51c1cb5f54ad6 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -213,7 +213,7 @@ static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
>  	{
>  /* REPLY */
>  /* 	     sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2	*/
> -/*syn*/	   { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sIV, sIV, sS2 },
> +/*syn*/	   { sIV, sS2, sIV, sIV, sIV, sIV, sIV, sSS, sIV, sS2 },
>  /*
>   *	sNO -> sIV	Never reached.
>   *	sSS -> sS2	Simultaneous open
> @@ -223,7 +223,7 @@ static const u8 tcp_conntracks[2][6][TCP_CONNTRACK_MAX] = {
>   *	sFW -> sIV
>   *	sCW -> sIV
>   *	sLA -> sIV
> - *	sTW -> sIV	Reopened connection, but server may not do it.
> + *	sTW -> sSS	Reopened connection, but server may have switched role
>   *	sCL -> sIV
>   */
>  /* 	     sNO, sSS, sSR, sES, sFW, sCW, sLA, sTW, sCL, sS2	*/
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: [PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt
From: Jason Wang @ 2014-10-15  7:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Miller
  Cc: kvm, netdev, linux-kernel, virtualization, linux-api
In-Reply-To: <20141014230627.GA23715@redhat.com>

On 10/15/2014 07:06 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 14, 2014 at 02:53:27PM -0400, David Miller wrote:
>> > From: Jason Wang <jasowang@redhat.com>
>> > Date: Sat, 11 Oct 2014 15:16:43 +0800
>> > 
>>> > > We free old transmitted packets in ndo_start_xmit() currently, so any
>>> > > packet must be orphaned also there. This was used to reduce the overhead of
>>> > > tx interrupt to achieve better performance. But this may not work for some
>>> > > protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
>>> > > implement various optimization for small packets stream such as TCP small
>>> > > queue and auto corking. But orphaning packets early in ndo_start_xmit()
>>> > > disable such things more or less since sk_wmem_alloc was not accurate. This
>>> > > lead extra low throughput for TCP stream of small writes.
>>> > > 
>>> > > This series tries to solve this issue by enable tx interrupts for all TCP
>>> > > packets other than the ones with push bit or pure ACK. This is done through
>>> > > the support of urgent descriptor which can force an interrupt for a
>>> > > specified packet. If tx interrupt was enabled for a packet, there's no need
>>> > > to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
>>> > > by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
>>> > > can batch more for small write. More larger skb was produced by TCP in this
>>> > > case to improve both throughput and cpu utilization.
>>> > > 
>>> > > Test shows great improvements on small write tcp streams. For most of the
>>> > > other cases, the throughput and cpu utilization are the same in the
>>> > > past. Only few cases, more cpu utilization was noticed which needs more
>>> > > investigation.
>>> > > 
>>> > > Review and comments are welcomed.
>> > 
>> > I think proper accounting and queueing (at all levels, not just TCP
>> > sockets) is more important than trying to skim a bunch of cycles by
>> > avoiding TX interrupts.
>> > 
>> > Having an event to free the SKB is absolutely essential for the stack
>> > to operate correctly.
>> > 
>> > And with virtio-net you don't even have the excuse of "the HW
>> > unfortunately doesn't have an appropriate TX event."
>> > 
>> > So please don't play games, and instead use TX interrupts all the
>> > time.  You can mitigate them in various ways, but don't turn them on
>> > selectively based upon traffic type, that's terrible.
>> > 
>> > You can even use ->xmit_more to defer the TX interrupt indication to
>> > the final TX packet in the chain.
> I guess we can just defer the kick, interrupt will naturally be
> deferred as well.
> This should solve the problem for old hosts as well.

Interrupt were delayed but not reduced, to support this we need publish
avail idx as used event. This should reduce the tx interrupt in the case
of bulk dequeuing.

I will draft a new rfc series contain this.
>
> We'll also need to implement bql for this.
> Something like the below?
> Completely untested - posting here to see if I figured the
> API out correctly. Has to be applied on top of the previous patch.

Looks so. I believe better to have but not a must.

^ permalink raw reply

* RE: [PATCH] net: fec: ptp: fix convergence issue to support LinuxPTP stack
From: fugang.duan @ 2014-10-15  7:42 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Frank.Li@freescale.com, David Miller, netdev@vger.kernel.org,
	bhutchings@solarflare.com
In-Reply-To: <20141015072540.GA4172@localhost.localdomain>

From: Richard Cochran <richardcochran@gmail.com> Sent: Wednesday, October 15, 2014 3:26 PM
>To: Duan Fugang-B38611
>Cc: Li Frank-B20596; David Miller; netdev@vger.kernel.org;
>bhutchings@solarflare.com
>Subject: Re: [PATCH] net: fec: ptp: fix convergence issue to support
>LinuxPTP stack
>
>On Wed, Oct 15, 2014 at 01:30:41AM +0000, fugang.duan@freescale.com wrote:
>>
>> Pls read the commit log in entire ?
>
>I did read it.
>
>It is incomprehensible.
>
>Can you please fix it?
>
>Thanks,
>Richard

Ok, I will enhance the commit/comment log and then send out V2.

Thanks,
Andy

^ permalink raw reply

* [PATCH] atm: simplify lanai.c by using module_pci_driver
From: Michael Opdenacker @ 2014-10-15  7:45 UTC (permalink / raw)
  To: chas; +Cc: linux-atm-general, netdev, linux-kernel, Michael Opdenacker

This simplifies the lanai.c driver by using
the module_pci_driver() macro, at the expense
of losing only debugging messages.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/atm/lanai.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/atm/lanai.c b/drivers/atm/lanai.c
index fa7d701933ba..93eaf8d94492 100644
--- a/drivers/atm/lanai.c
+++ b/drivers/atm/lanai.c
@@ -2614,27 +2614,7 @@ static struct pci_driver lanai_driver = {
 	.probe    = lanai_init_one,
 };
 
-static int __init lanai_module_init(void)
-{
-	int x;
-
-	x = pci_register_driver(&lanai_driver);
-	if (x != 0)
-		printk(KERN_ERR DEV_LABEL ": no adapter found\n");
-	return x;
-}
-
-static void __exit lanai_module_exit(void)
-{
-	/* We'll only get called when all the interfaces are already
-	 * gone, so there isn't much to do
-	 */
-	DPRINTK("cleanup_module()\n");
-	pci_unregister_driver(&lanai_driver);
-}
-
-module_init(lanai_module_init);
-module_exit(lanai_module_exit);
+module_pci_driver(lanai_driver);
 
 MODULE_AUTHOR("Mitchell Blank Jr <mitch@sfgoth.com>");
 MODULE_DESCRIPTION("Efficient Networks Speedstream 3010 driver");
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v2] ipv4: dst_entry leak in ip_append_data()
From: Vasily Averin @ 2014-10-15  7:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, kuznet, jmorris, yoshfuji, kaber, eric.dumazet
In-Reply-To: <20141014.161225.1399177558139744041.davem@davemloft.net>

On 15.10.2014 00:12, David Miller wrote:
> From: Vasily Averin <vvs@parallels.com>
> Date: Tue, 14 Oct 2014 08:57:14 +0400
> 
>> v2: adjust the indentation of the arguments __ip_append_data() call
>>
>> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
>>
>> If sk_write_queue is empty ip_append_data() executes ip_setup_cork()
>> that "steals" dst entry from rt to cork. Later it calls __ip_append_data()
>> that creates skb and adds it to sk_write_queue.
>>
>> If skb was added successfully following ip_push_pending_frames() call
>> reassign dst entries from cork to skb, and kfree_skb frees dst_entry.
>>
>> However nobody frees stolen dst_entry if skb was not added into sk_write_queue.
>>
>> Signed-off-by: Vasily Averin <vvs@parallels.com>
> 
> Why doesn't ip_make_skb() need the same fix?  It seems to do the same
> thing.

It seems for me ip_make_skb() works (almost) correctly,
but seems refcounting can be is incorrect if queue can be not empty 
(Please see details below).

If __ip_append_data() returns errors ip_make_skb() calls 
__ip_flush_pending_frames() that calls ip_cork_release() inside
and frees stolen dst_entry.

If __ip_append_data() returns success -- dst refcounter changes are not required.
In this case skb will be created and added to queue (and it will not be empty)
Later in __ip_make_skb() these skb will get dst reference,
and refcounter will be decremented during kfree_skb(). 

I do not like that there is such unclear dependency between functions,
but seems currently it works correctly.

However I afraid dst refcountng can work incorrectly if sk_write_queue
can be not empty at the moment of ip_append_data() call.
It was not happen in case ip_send_unicast_reply() but probably
can happen in other places.

Let's calculate dst refcounters changes in this case.

First packet:

dst_refcount increment was happen in ip_append_data() caller, taken during rt lookup
- ip_append_data():
--  sk_write_queue is empty, ip_setup_cork() steals dst entry
-- __ip_append_data() adds skb to queue, queue is not flushed, waiting for next packets.
ip_rt_put in ip_append_data() caller does not work, because dst reference was stolen.

dst refcount here +1

then we want to sent 2nd packet:
dst refcount increment was happen in ip_append_data() caller
- ip_append_data():
-- sk_write_queue is NOT empty, dst was not stolen
-- __ip_append_data() adds skb to queue
ip_rt_put in ip_append_data() caller decrements dst refcount, because it as not stolen

dst refcount here +1 

Then we handle new packets, all of them are added to queue

dst refcount is still +1

Then queue is flushed.
Each packet in queue get dst reference from cork,
Each kfree_skb decrements dst refcounter, and it may become negative.

Am I wrong probably?

^ permalink raw reply

* Re: [PATCH net 2/3] net: sctp: fix panic on duplicate ASCONF chunks
From: Vlad Yasevich @ 2014-10-15  7:51 UTC (permalink / raw)
  To: Daniel Borkmann, Neil Horman; +Cc: davem, linux-sctp, netdev
In-Reply-To: <543B0DD7.7000900@redhat.com>

On 10/13/2014 01:25 AM, Daniel Borkmann wrote:
> On 10/12/2014 03:42 AM, Neil Horman wrote:
>> On Sat, Oct 11, 2014 at 12:02:31AM +0200, Daniel Borkmann wrote:
>>> On 10/10/2014 05:39 PM, Neil Horman wrote:
>>> ...
>>>> Is it worth adding a WARN_ON, to indicate that two ASCONF chunks have been
>>>> received with duplicate serials?
>>>
>>> Don't think so, as this would be triggerable from outside.
>>>
>> WARN_ON_ONCE then, per serial number?
> 
> Sorry, but no. If someone seriously runs that in production and it
> triggers a WARN() from outside, admins will start sending us bug
> reports that apparently something with the kernel code is wrong.
> 
> WARN() should only be used if we have some *internal* unexpected bug,
> but can still fail gracefully. This would neither be an actual code bug
> nor would it be an internally triggered one, plus we add unnecessary
> complexity to the code. Similarly, for those reasons we don't WARN()
> and throw a stack trace when we receive, say, an skb of invalid length
> elsewhere.
> 
> I'd also like to avoid any additional pr_debug().

Why avoid additional pr_debugs()?  The seem cheap, and in this particular
case, they would be rather useful.

> 
> I don't think people enable them in production, and if they really do,
> it's too late anyway as we already have received this chunk. If anything,
> I'd rather like to see debugging code further removed as we have already
> different facilities in the kernel for runtime debugging that are much
> more powerful.

Such as?  pr_debug is part of dynamic debugging which is what we want here.
In prod environments, we don't want any output, but when someone needs to
figure out why something doesn't work, it is very useful to have.

-vlad

^ permalink raw reply

* [net-next.git 0/3] stmmac: review driver Koptions
From: Giuseppe Cavallaro @ 2014-10-15  8:25 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

These patches are to rework some Koption used for configuring
the driver. Recently many options have been added and the main
goal behind this work is to guarantee that the driver built fine
on all the branches where it is present. All the Koption for
platform drivers can be safely removed (for example never added
for SPEAr). In that case, it will be possible to validate the
build for all the platforms like STi avoiding failures.
The patches have been tested on STi Boxes w/o issue at runtime.

Giuseppe Cavallaro (3):
  stmmac: remove specific SoC Koption from platform.
  stmmac: remove STMMAC_DEBUG_FS
  stmmac: remove BUS_MODE_DA

 drivers/net/ethernet/stmicro/stmmac/Kconfig        |   68 +------------------
 drivers/net/ethernet/stmicro/stmmac/Makefile       |    8 +--
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |    4 -
 drivers/net/ethernet/stmicro/stmmac/stmmac.h       |    9 +--
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |   14 ++--
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   10 +---
 6 files changed, 16 insertions(+), 97 deletions(-)

-- 
1.7.4.4

^ permalink raw reply

* [net-next.git 1/3] stmmac: remove specific SoC Koption from platform.
From: Giuseppe Cavallaro @ 2014-10-15  8:25 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1413361531-9182-1-git-send-email-peppe.cavallaro@st.com>

This patch removes all the Koptions added to build the glue-logic files
for all different architectures: DWMAC_MESON, DWMAC_SUNXI, DWMAC_STI ...
Nowadays the stmmac needs to be compiled on several platforms; in some
case it very convenient to guarantee that its build is always completed
with success on all the branches where the driver is present.
Note that there is no Koption for SPEAr.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig        |   50 ++------------------
 drivers/net/ethernet/stmicro/stmmac/Makefile       |    8 +--
 drivers/net/ethernet/stmicro/stmmac/stmmac.h       |    9 +---
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |   10 +----
 4 files changed, 9 insertions(+), 68 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index b02d4a3..40f356d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -18,56 +18,14 @@ config STMMAC_PLATFORM
 	depends on STMMAC_ETH
 	default y
 	---help---
-	  This selects the platform specific bus support for
-	  the stmmac device driver. This is the driver used
-	  on many embedded STM platforms based on ARM and SuperH
-	  processors.
+	  This selects the platform specific bus support for the stmmac driver.
+	  This is the driver used on several SoCs:
+	  STi, Allwinner, Amlogic Meson, Altera SOCFPGA.
+
 	  If you have a controller with this interface, say Y or M here.
 
 	  If unsure, say N.
 
-config DWMAC_MESON
-	bool "Amlogic Meson dwmac support"
-	depends on STMMAC_PLATFORM && ARCH_MESON
-	help
-	  Support for Ethernet controller on Amlogic Meson SoCs.
-
-	  This selects the Amlogic Meson SoC glue layer support for
-	  the stmmac device driver. This driver is used for Meson6 and
-	  Meson8 SoCs.
-
-config DWMAC_SOCFPGA
-	bool "SOCFPGA dwmac support"
-	depends on STMMAC_PLATFORM && MFD_SYSCON && (ARCH_SOCFPGA || COMPILE_TEST)
-	help
-	  Support for ethernet controller on Altera SOCFPGA
-
-	  This selects the Altera SOCFPGA SoC glue layer support
-	  for the stmmac device driver. This driver is used for
-	  arria5 and cyclone5 FPGA SoCs.
-
-config DWMAC_SUNXI
-	bool "Allwinner GMAC support"
-	depends on STMMAC_PLATFORM && ARCH_SUNXI
-	default y
-	---help---
-	  Support for Allwinner A20/A31 GMAC ethernet controllers.
-
-	  This selects Allwinner SoC glue layer support for the
-	  stmmac device driver. This driver is used for A20/A31
-	  GMAC 	  ethernet controller.
-
-config DWMAC_STI
-	bool "STi GMAC support"
-	depends on STMMAC_PLATFORM && ARCH_STI
-	default y
-	---help---
-	  Support for ethernet controller on STi SOCs.
-
-	  This selects STi SoC glue layer support for the stmmac
-	  device driver. This driver is used on for the STi series
-	  SOCs GMAC ethernet controller.
-
 config STMMAC_PCI
 	bool "STMMAC PCI bus support"
 	depends on STMMAC_ETH && PCI
diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 0533d0b..034da70 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -1,10 +1,8 @@
 obj-$(CONFIG_STMMAC_ETH) += stmmac.o
-stmmac-$(CONFIG_STMMAC_PLATFORM) += stmmac_platform.o
 stmmac-$(CONFIG_STMMAC_PCI) += stmmac_pci.o
-stmmac-$(CONFIG_DWMAC_MESON) += dwmac-meson.o
-stmmac-$(CONFIG_DWMAC_SUNXI) += dwmac-sunxi.o
-stmmac-$(CONFIG_DWMAC_STI) += dwmac-sti.o
-stmmac-$(CONFIG_DWMAC_SOCFPGA) += dwmac-socfpga.o
+stmmac-$(CONFIG_STMMAC_PLATFORM) +=	stmmac_platform.o dwmac-meson.o \
+					dwmac-sunxi.o dwmac-sti.o \
+					dwmac-socfpga.o
 stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o ring_mode.o	\
 	      chain_mode.o dwmac_lib.o dwmac1000_core.o  dwmac1000_dma.o \
 	      dwmac100_core.o dwmac100_dma.o enh_desc.o  norm_desc.o \
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 4452889..4df544e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -137,19 +137,12 @@ void stmmac_disable_eee_mode(struct stmmac_priv *priv);
 bool stmmac_eee_init(struct stmmac_priv *priv);
 
 #ifdef CONFIG_STMMAC_PLATFORM
-#ifdef CONFIG_DWMAC_MESON
 extern const struct stmmac_of_data meson6_dwmac_data;
-#endif
-#ifdef CONFIG_DWMAC_SUNXI
 extern const struct stmmac_of_data sun7i_gmac_data;
-#endif
-#ifdef CONFIG_DWMAC_STI
 extern const struct stmmac_of_data sti_gmac_data;
-#endif
-#ifdef CONFIG_DWMAC_SOCFPGA
 extern const struct stmmac_of_data socfpga_gmac_data;
-#endif
 extern struct platform_driver stmmac_pltfr_driver;
+
 static inline int stmmac_register_platform(void)
 {
 	int err;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 6521717..2fe3c43 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -30,21 +30,13 @@
 #include "stmmac.h"
 
 static const struct of_device_id stmmac_dt_ids[] = {
-#ifdef CONFIG_DWMAC_MESON
+	/* SoC specific glue layers should come before generic bindings */
 	{ .compatible = "amlogic,meson6-dwmac", .data = &meson6_dwmac_data},
-#endif
-#ifdef CONFIG_DWMAC_SUNXI
 	{ .compatible = "allwinner,sun7i-a20-gmac", .data = &sun7i_gmac_data},
-#endif
-#ifdef CONFIG_DWMAC_STI
 	{ .compatible = "st,stih415-dwmac", .data = &sti_gmac_data},
 	{ .compatible = "st,stih416-dwmac", .data = &sti_gmac_data},
 	{ .compatible = "st,stid127-dwmac", .data = &sti_gmac_data},
-#endif
-#ifdef CONFIG_DWMAC_SOCFPGA
 	{ .compatible = "altr,socfpga-stmmac", .data = &socfpga_gmac_data },
-#endif
-	/* SoC specific glue layers should come before generic bindings */
 	{ .compatible = "st,spear600-gmac"},
 	{ .compatible = "snps,dwmac-3.610"},
 	{ .compatible = "snps,dwmac-3.70a"},
-- 
1.7.4.4

^ permalink raw reply related

* [net-next.git 3/3] stmmac: remove BUS_MODE_DA
From: Giuseppe Cavallaro @ 2014-10-15  8:25 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1413361531-9182-1-git-send-email-peppe.cavallaro@st.com>

This is a very old and often unused option to configure
a bit in a register inside the DMA. This support should
not stay under Koption and also it is obsolete and should
be extended for new chips. This will be do later maybe via
device-tree parameters. Also no performance impact when
remove this setting on STi platforms.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig        |   10 ----------
 .../net/ethernet/stmicro/stmmac/dwmac1000_dma.c    |    4 ----
 2 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index af5228a..33b85ba 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -37,14 +37,4 @@ config STMMAC_PCI
 	  D1215994A VIRTEX FPGA board.
 
 	  If unsure, say N.
-
-config STMMAC_DA
-	bool "STMMAC DMA arbitration scheme"
-	default n
-	---help---
-	  Selecting this option, rx has priority over Tx (only for Giga
-	  Ethernet device).
-	  By default, the DMA arbitration scheme is based on Round-robin
-	  (rx:tx priority is 1:1).
-
 endif
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
index 0c2058a..59d92e8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c
@@ -70,10 +70,6 @@ static int dwmac1000_dma_init(void __iomem *ioaddr, int pbl, int fb, int mb,
 	if (mb)
 		value |= DMA_BUS_MODE_MB;
 
-#ifdef CONFIG_STMMAC_DA
-	value |= DMA_BUS_MODE_DA;	/* Rx has priority over tx */
-#endif
-
 	if (atds)
 		value |= DMA_BUS_MODE_ATDS;
 
-- 
1.7.4.4

^ permalink raw reply related

* [net-next.git 2/3] stmmac: remove STMMAC_DEBUG_FS
From: Giuseppe Cavallaro @ 2014-10-15  8:25 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro
In-Reply-To: <1413361531-9182-1-git-send-email-peppe.cavallaro@st.com>

the STMMAC_DEBUG_FS Koption is now removed from the
driver configuration and this support will be built
by default when DEBUG_FS is present. This can also be
useful on building driver verification.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/Kconfig       |    8 --------
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c |   14 +++++++-------
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/Kconfig b/drivers/net/ethernet/stmicro/stmmac/Kconfig
index 40f356d..af5228a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Kconfig
+++ b/drivers/net/ethernet/stmicro/stmmac/Kconfig
@@ -38,14 +38,6 @@ config STMMAC_PCI
 
 	  If unsure, say N.
 
-config STMMAC_DEBUG_FS
-	bool "Enable monitoring via sysFS "
-	default n
-	depends on STMMAC_ETH && DEBUG_FS
-	---help---
-	  The stmmac entry in /sys reports DMA TX/RX rings
-	  or (if supported) the HW cap register.
-
 config STMMAC_DA
 	bool "STMMAC DMA arbitration scheme"
 	default n
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 6f77a46..a34754b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -44,10 +44,10 @@
 #include <linux/slab.h>
 #include <linux/prefetch.h>
 #include <linux/pinctrl/consumer.h>
-#ifdef CONFIG_STMMAC_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
-#endif /* CONFIG_STMMAC_DEBUG_FS */
+#endif /* CONFIG_DEBUG_FS */
 #include <linux/net_tstamp.h>
 #include "stmmac_ptp.h"
 #include "stmmac.h"
@@ -116,7 +116,7 @@ MODULE_PARM_DESC(chain_mode, "To use chain instead of ring mode");
 
 static irqreturn_t stmmac_interrupt(int irq, void *dev_id);
 
-#ifdef CONFIG_STMMAC_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 static int stmmac_init_fs(struct net_device *dev);
 static void stmmac_exit_fs(void);
 #endif
@@ -1688,7 +1688,7 @@ static int stmmac_hw_setup(struct net_device *dev)
 	if (ret && ret != -EOPNOTSUPP)
 		pr_warn("%s: failed PTP initialisation\n", __func__);
 
-#ifdef CONFIG_STMMAC_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 	ret = stmmac_init_fs(dev);
 	if (ret < 0)
 		pr_warn("%s: failed debugFS registration\n", __func__);
@@ -1866,7 +1866,7 @@ static int stmmac_release(struct net_device *dev)
 
 	netif_carrier_off(dev);
 
-#ifdef CONFIG_STMMAC_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 	stmmac_exit_fs();
 #endif
 
@@ -2453,7 +2453,7 @@ static int stmmac_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
 	return ret;
 }
 
-#ifdef CONFIG_STMMAC_DEBUG_FS
+#ifdef CONFIG_DEBUG_FS
 static struct dentry *stmmac_fs_dir;
 static struct dentry *stmmac_rings_status;
 static struct dentry *stmmac_dma_cap;
@@ -2638,7 +2638,7 @@ static void stmmac_exit_fs(void)
 	debugfs_remove(stmmac_dma_cap);
 	debugfs_remove(stmmac_fs_dir);
 }
-#endif /* CONFIG_STMMAC_DEBUG_FS */
+#endif /* CONFIG_DEBUG_FS */
 
 static const struct net_device_ops stmmac_netdev_ops = {
 	.ndo_open = stmmac_open,
-- 
1.7.4.4

^ permalink raw reply related

* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: David Laight @ 2014-10-15  8:57 UTC (permalink / raw)
  To: 'Pan Jiafei', davem@davemloft.net, jkosina@suse.cz
  Cc: netdev@vger.kernel.org, LeoLi@freescale.com,
	linux-doc@vger.kernel.org
In-Reply-To: <1413343571-33231-1-git-send-email-Jiafei.Pan@freescale.com>

From: Pan Jiafei
> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.

This looks like some strange variant of 'buffer loaning'.
In general it just doesn't work due to the limited number
of such buffers - they soon all become queued waiting for
applications to read from sockets.

It also isn't at all clear how you expect a 'generic NIC'
to actually allocate buffers from your 'special area'.

	David

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox