* Re: [RFC PATCH net-next 1/6] virtio: make sure used event never go backwards
From: Jason Wang @ 2014-10-17 5:04 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: eric.dumazet, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20141015113859.GA26918@redhat.com>
On 10/15/2014 07:38 PM, Michael S. Tsirkin wrote:
> On Wed, Oct 15, 2014 at 06:44:41PM +0800, Jason Wang wrote:
>> On 10/15/2014 06:32 PM, Michael S. Tsirkin wrote:
>>> On Wed, Oct 15, 2014 at 06:13:19PM +0800, Jason Wang wrote:
>>>> On 10/15/2014 05:34 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Oct 15, 2014 at 03:25:25PM +0800, Jason Wang wrote:
>>>>>> This patch checks the new event idx to make sure used event idx never
>>>>>> goes back. This is used to synchronize the calls between
>>>>>> virtqueue_enable_cb_delayed() and virtqueue_enable_cb().
>>>>>>
>>>>>> Cc: Rusty Russell <rusty@rustcorp.com.au>
>>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> the implication being that moving event idx back might cause some race
>>>>> condition?
>>>> This will cause race condition when tx interrupt is enabled. Consider
>>>> the following cases
>>>>
>>>> 1) tx napi was scheduled
>>>> 2) start_xmit() call virtqueue_enable_cb_delayed() and disable cb, [used
>>>> event is vq->last_used_idx + 3/4 pendg bufs]
>>>> 3) tx napi enable the callback by virtqueue_enable_cb() [ used event is
>>>> vq->last_used_idx ]
>>>>
>>>> After step 3, used event was moved back, unnecessary tx interrupt was
>>>> triggered.
>>> Well unnecessary interrupts are safe.
>> But it that is what we want to reduce.
> It's all about correctness. I don't think mixing enable_cb
> and enable_cb_delayed makes sense, let's just make
> virtio behave correctly if that happens, no need to
> optimize for that.
Then as you said, need document or add WARN_ON() or BUG() in case both
of the two are used.
>
>
>>> With your patch caller of virtqueue_enable_cb will not get an
>>> interrupt on the next buffer which is not safe.
>>>
>>> If you don't want an interrupt on the next buffer, don't
>>> call virtqueue_enable_cb.
>> So something like this patch should be done in virtio core somewhere
>> else. Virtio-net can not do this since it does not have the knowledge of
>> event index.
> Take a look at my patch - no calls to enable_cb, only
> enable_cb_delayed, so we should be fine.
>
>>>>> If yes but please describe the race explicitly.
>>>>> Is there a bug we need to fix on stable?
>>>> Looks not, current code does not have such race condition.
>>>>> Please also explicitly describe a configuration that causes event idx
>>>>> to go back.
>>>>>
>>>>> All this info should go in the commit log.
>>>> Will do this.
>>>>>> ---
>>>>>> drivers/virtio/virtio_ring.c | 7 +++++--
>>>>>> 1 files changed, 5 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>>>> index 3b1f89b..1b3929f 100644
>>>>>> --- a/drivers/virtio/virtio_ring.c
>>>>>> +++ b/drivers/virtio/virtio_ring.c
>>>>>> @@ -559,14 +559,17 @@ unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
>>>>>> u16 last_used_idx;
>>>>>>
>>>>>> START_USE(vq);
>>>>>> -
>>>>>> + last_used_idx = vq->last_used_idx;
>>>>>> /* We optimistically turn back on interrupts, then check if there was
>>>>>> * more to do. */
>>>>>> /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
>>>>>> * either clear the flags bit or point the event index at the next
>>>>>> * entry. Always do both to keep code simple. */
>>>>>> vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
>>>>>> - vring_used_event(&vq->vring) = last_used_idx = vq->last_used_idx;
>>>>>> + /* Make sure used event never go backwards */
>>>>> s/go/goes/
>>>>>
>>>>>> + if (!vring_need_event(vring_used_event(&vq->vring),
>>>>>> + vq->vring.avail->idx, last_used_idx))
>>>>>> + vring_used_event(&vq->vring) = last_used_idx;
>>>>> The result will be that driver will *not* get an interrupt
>>>>> on the next consumed buffer, which is likely not what driver
>>>>> intended when it called virtqueue_enable_cb.
>>>> This will only happen when we want to delay the interrupt for next few
>>>> consumed buffers (virtqueue_enable_cb_delayed() was called). For the
>>>> other case, vq->last_used_idx should be ahead of previous used event. Do
>>>> you see any other case?
>>> Call virtqueue_enable_cb_delayed, later call virtqueue_enable_cb. If
>>> event index is not updated in virtqueue_enable_cb, driver will not get
>>> an interrupt on the next buffer.
>> This is just what we want I think. The interrupt was not lost but fired
>> after 3/4 pending buffers were consumed. Do you see any real issue on this?
> Yes, this violates the API. For example device might never
> consume the rest of buffers.
Then it should be a bug of device which is out of the control of guest.
If not, device might never also consume 3/4 rest of buffers.
>
>>>>> Instead, how about we simply document the requirement that drivers either
>>>>> always call virtqueue_enable_cb_delayed or virtqueue_enable_cb
>>>>> but not both?
>>>> We need call them both when tx interrupt is enabled I believe.
>>> Can you pls reply to my patch and document issues you see?
>>>
>> In the previous reply you said you're using
>> virtuqueue_enable_cb_delayed(), so no race in your patch.
> OK so you think my patch is also correct, but that yours gives better
> efficiency?
>
Need some benchmark to see the difference I think.
^ permalink raw reply
* Re: [PATCH RFC v2 1/3] virtio_net: enable tx interrupt
From: Jason Wang @ 2014-10-17 5:11 UTC (permalink / raw)
To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413383116-30977-2-git-send-email-mst@redhat.com>
On 10/15/2014 10:32 PM, Michael S. Tsirkin wrote:
> On newer hosts that support delayed tx interrupts,
> we probably don't have much to gain from orphaning
> packets early.
>
> Based on patch by Jason Wang.
>
> Note: this might degrade performance for
> hosts without event idx support.
> Should be addressed by the next patch.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/net/virtio_net.c | 137 ++++++++++++++++++++++++++++++++---------------
> 1 file changed, 94 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 13d0a8b..a9bf178 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>
> /* Name of the send queue: output.$index */
> char name[40];
> +
> + struct napi_struct napi;
> };
>
> /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,37 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
> return p;
> }
>
> +static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> + struct sk_buff *skb;
> + unsigned int len;
> + struct virtnet_info *vi = sq->vq->vdev->priv;
> + struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> + unsigned int packets = 0;
> +
> + while (packets < budget &&
> + (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> + pr_debug("Sent skb %p\n", skb);
> +
> + u64_stats_update_begin(&stats->tx_syncp);
> + stats->tx_bytes += skb->len;
> + stats->tx_packets++;
> + u64_stats_update_end(&stats->tx_syncp);
> +
> + dev_kfree_skb_any(skb);
> + packets++;
> + }
> +
> + return packets;
> +}
> +
> static void skb_xmit_done(struct virtqueue *vq)
> {
> struct virtnet_info *vi = vq->vdev->priv;
> + struct send_queue *sq = &vi->sq[vq2txq(vq)];
>
> - /* Suppress further interrupts. */
> - virtqueue_disable_cb(vq);
> -
> - /* We were probably waiting for more output buffers. */
> - netif_wake_subqueue(vi->dev, vq2txq(vq));
> + if (napi_schedule_prep(&sq->napi))
> + __napi_schedule(&sq->napi);
> }
>
> static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
> @@ -774,6 +798,37 @@ again:
> return received;
> }
>
> +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> +{
> + struct send_queue *sq =
> + container_of(napi, struct send_queue, napi);
> + struct virtnet_info *vi = sq->vq->vdev->priv;
> + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> + unsigned int sent = 0;
> + bool enable_done;
> +
> +again:
> + __netif_tx_lock(txq, smp_processor_id());
> + virtqueue_disable_cb(sq->vq);
> + sent += free_old_xmit_skbs(sq, budget - sent);
> +
> + if (sent < budget) {
> + enable_done = virtqueue_enable_cb_delayed(sq->vq);
> + napi_complete(napi);
> + __netif_tx_unlock(txq);
> + if (unlikely(enable_done) && napi_schedule_prep(napi)) {
I think you mean unlikely(!enable_done) here?
> + virtqueue_disable_cb(sq->vq);
> + __napi_schedule(napi);
> + goto again;
> + }
> + } else {
> + __netif_tx_unlock(txq);
> + }
> +
> + netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
> + return sent;
> +}
> +
> #ifdef CONFIG_NET_RX_BUSY_POLL
> /* must be called with local_bh_disable()d */
> static int virtnet_busy_poll(struct napi_struct *napi)
> @@ -822,30 +877,12 @@ static int virtnet_open(struct net_device *dev)
> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> schedule_delayed_work(&vi->refill, 0);
> virtnet_napi_enable(&vi->rq[i]);
> + napi_enable(&vi->sq[i].napi);
> }
>
> return 0;
> }
>
> -static void free_old_xmit_skbs(struct send_queue *sq)
> -{
> - struct sk_buff *skb;
> - unsigned int len;
> - struct virtnet_info *vi = sq->vq->vdev->priv;
> - struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> -
> - while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - pr_debug("Sent skb %p\n", skb);
> -
> - u64_stats_update_begin(&stats->tx_syncp);
> - stats->tx_bytes += skb->len;
> - stats->tx_packets++;
> - u64_stats_update_end(&stats->tx_syncp);
> -
> - dev_kfree_skb_any(skb);
> - }
> -}
> -
> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> {
> struct skb_vnet_hdr *hdr;
> @@ -911,7 +948,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> sg_set_buf(sq->sg, hdr, hdr_len);
> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
> }
> - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> +
> + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
> + GFP_ATOMIC);
> }
>
> static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -919,12 +958,16 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> struct virtnet_info *vi = netdev_priv(dev);
> int qnum = skb_get_queue_mapping(skb);
> struct send_queue *sq = &vi->sq[qnum];
> - int err;
> + int err, qsize = virtqueue_get_vring_size(sq->vq);
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = !skb->xmit_more;
> + bool stopped;
> +
> + virtqueue_disable_cb(sq->vq);
>
> - /* Free up any pending old buffers before queueing new ones. */
> - free_old_xmit_skbs(sq);
> + /* We are going to push one skb.
> + * Try to pop one off to free space for it. */
> + free_old_xmit_skbs(sq, 1);
Looks like qsize instead of 1 is better? The more skbs freed in
ndo_start_xmit() the less chance tx napi will be scheduled.
>
> /* Try to transmit */
> err = xmit_skb(sq, skb);
> @@ -940,27 +983,25 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> return NETDEV_TX_OK;
> }
>
> - /* Don't wait up for transmitted skbs to be freed. */
> - skb_orphan(skb);
> - nf_reset(skb);
> -
> /* Apparently nice girls don't return TX_BUSY; stop the queue
> * before it gets out of hand. Naturally, this wastes entries. */
> if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> netif_stop_subqueue(dev, qnum);
> - if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> - /* More just got used, free them then recheck. */
> - free_old_xmit_skbs(sq);
> - if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> - netif_start_subqueue(dev, qnum);
> - virtqueue_disable_cb(sq->vq);
> - }
> - }
> + stopped = true;
> + } else {
> + stopped = false;
> }
>
> if (kick || netif_xmit_stopped(txq))
> virtqueue_kick(sq->vq);
Looks like we'd better move this in the end in case the queue could be
restarted by following lines?
>
> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> + /* More just got used, free them then recheck. */
> + free_old_xmit_skbs(sq, qsize);
> + if (stopped && sq->vq->num_free >= 2+MAX_SKB_FRAGS)
> + netif_start_subqueue(dev, qnum);
Why drop virtqueue_disable_cb() here?
> + }
> +
> return NETDEV_TX_OK;
> }
>
> @@ -1137,8 +1178,10 @@ static int virtnet_close(struct net_device *dev)
> /* Make sure refill_work doesn't re-enable napi! */
> cancel_delayed_work_sync(&vi->refill);
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> napi_disable(&vi->rq[i].napi);
> + napi_disable(&vi->sq[i].napi);
> + }
>
> return 0;
> }
> @@ -1457,8 +1500,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
> {
> int i;
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> netif_napi_del(&vi->rq[i].napi);
> + netif_napi_del(&vi->sq[i].napi);
> + }
>
> kfree(vi->rq);
> kfree(vi->sq);
> @@ -1612,6 +1657,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
> netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
> napi_weight);
> napi_hash_add(&vi->rq[i].napi);
> + netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
> + napi_weight);
>
> sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
> @@ -1916,8 +1963,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
> if (netif_running(vi->dev)) {
> for (i = 0; i < vi->max_queue_pairs; i++) {
> napi_disable(&vi->rq[i].napi);
> + napi_disable(&vi->sq[i].napi);
> napi_hash_del(&vi->rq[i].napi);
> netif_napi_del(&vi->rq[i].napi);
> + netif_napi_del(&vi->sq[i].napi);
> }
> }
>
> @@ -1942,8 +1991,10 @@ static int virtnet_restore(struct virtio_device *vdev)
> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> schedule_delayed_work(&vi->refill, 0);
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> virtnet_napi_enable(&vi->rq[i]);
> + napi_enable(&vi->sq[i].napi);
> + }
> }
>
> netif_device_attach(vi->dev);
^ permalink raw reply
* Re: [PATCH RFC v2 2/3] virtio_net: bql
From: Jason Wang @ 2014-10-17 5:16 UTC (permalink / raw)
To: Michael S. Tsirkin, linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1413383116-30977-3-git-send-email-mst@redhat.com>
On 10/15/2014 10:32 PM, Michael S. Tsirkin wrote:
> Improve tx batching using byte queue limits.
> Should be especially effective for MQ.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> drivers/net/virtio_net.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a9bf178..8dea411 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -219,13 +219,15 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
> return p;
> }
>
> -static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
> + struct send_queue *sq, int budget)
> {
> struct sk_buff *skb;
> unsigned int len;
> struct virtnet_info *vi = sq->vq->vdev->priv;
> struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> unsigned int packets = 0;
> + unsigned int bytes = 0;
>
> while (packets < budget &&
> (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> @@ -233,6 +235,7 @@ static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
>
> u64_stats_update_begin(&stats->tx_syncp);
> stats->tx_bytes += skb->len;
> + bytes += skb->len;
> stats->tx_packets++;
> u64_stats_update_end(&stats->tx_syncp);
>
> @@ -240,6 +243,8 @@ static unsigned int free_old_xmit_skbs(struct send_queue *sq, int budget)
> packets++;
> }
>
> + netdev_tx_completed_queue(txq, packets, bytes);
> +
> return packets;
> }
>
> @@ -810,7 +815,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> again:
> __netif_tx_lock(txq, smp_processor_id());
> virtqueue_disable_cb(sq->vq);
> - sent += free_old_xmit_skbs(sq, budget - sent);
> + sent += free_old_xmit_skbs(txq, sq, budget - sent);
>
> if (sent < budget) {
> enable_done = virtqueue_enable_cb_delayed(sq->vq);
> @@ -962,12 +967,13 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = !skb->xmit_more;
> bool stopped;
> + unsigned int bytes = skb->len;
>
> virtqueue_disable_cb(sq->vq);
>
> /* We are going to push one skb.
> * Try to pop one off to free space for it. */
> - free_old_xmit_skbs(sq, 1);
> + free_old_xmit_skbs(txq, sq, 1);
>
> /* Try to transmit */
> err = xmit_skb(sq, skb);
> @@ -983,6 +989,12 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> return NETDEV_TX_OK;
> }
>
> + netdev_tx_sent_queue(txq, bytes);
> +
> + /* Kick early so device can process descriptors in parallel with us. */
> + if (kick)
> + virtqueue_kick(sq->vq);
Haven't figured out how this will help for bql, consider only a
netif_stop_subqueue() may be called during two possible kicks. And since
we don't add any buffer between the two kicks, the send kick is almost
useless.
> +
> /* Apparently nice girls don't return TX_BUSY; stop the queue
> * before it gets out of hand. Naturally, this wastes entries. */
> if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> @@ -997,7 +1009,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>
> if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> /* More just got used, free them then recheck. */
> - free_old_xmit_skbs(sq, qsize);
> + free_old_xmit_skbs(txq, sq, qsize);
> if (stopped && sq->vq->num_free >= 2+MAX_SKB_FRAGS)
> netif_start_subqueue(dev, qnum);
> }
^ permalink raw reply
* Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
From: Jason Wang @ 2014-10-17 5:23 UTC (permalink / raw)
To: Rusty Russell, mst, virtualization, netdev, linux-kernel; +Cc: linux-api, kvm
In-Reply-To: <877g01c2ml.fsf@rustcorp.com.au>
On 10/15/2014 01:40 PM, Rusty Russell wrote:
> Jason Wang <jasowang@redhat.com> writes:
>> Below should be useful for some experiments Jason is doing.
>> I thought I'd send it out for early review/feedback.
>>
>> event idx feature allows us to defer interrupts until
>> a specific # of descriptors were used.
>> Sometimes it might be useful to get an interrupt after
>> a specific descriptor, regardless.
>> This adds a descriptor flag for this, and an API
>> to create an urgent output descriptor.
>> This is still an RFC:
>> we'll need a feature bit for drivers to detect this,
>> but we've run out of feature bits for virtio 0.X.
>> For experimentation purposes, drivers can assume
>> this is set, or add a driver-specific feature bit.
>>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
> networking (which tends to take packets in order) couldn't we just set
> the event counter to give us a tx interrupt at the packet we want?
>
> Cheers,
> Rusty.
Yes, we could. Recent RFC of enabling tx interrupt use this.
^ permalink raw reply
* Mail Administrator
From: Webmail Administrator @ 2014-10-17 5:58 UTC (permalink / raw)
Our records indicate that your E-mail® Account could not be automatically updated with our F-Secure R-HTK4S new(2014) version anti-spam/anti-virus/anti-spyware. Please click this link below to update manually
http://www.contactme.com/543903bb8c730e0002008ccf
We Are Sorry For Any Inconvenience.
Regards,
Technical Support Team
Copyright © 2014. All Rights Reserved
^ permalink raw reply
* [PATCH net] r8152: use cancel_delayed_work for runtime suspend
From: Hayes Wang @ 2014-10-17 5:55 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
It would cause dead lock for runtime suspend, when the workqueue
is running and runtime suspend occurs before the workqueue wakes
up the device. The rtl8152_suspend() waits the workqueue to finish
because of calling cancel_delayed_work_sync(). The workqueue waits
the suspend function to finish for waking up the device because of
calling usb_autopm_get_interface().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 864159e..7d4e55a 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3200,12 +3200,13 @@ static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
if (netif_running(tp->netdev)) {
clear_bit(WORK_ENABLE, &tp->flags);
usb_kill_urb(tp->intr_urb);
- cancel_delayed_work_sync(&tp->schedule);
tasklet_disable(&tp->tl);
if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
+ cancel_delayed_work(&tp->schedule);
rtl_stop_rx(tp);
rtl_runtime_suspend_enable(tp, true);
} else {
+ cancel_delayed_work_sync(&tp->schedule);
tp->rtl_ops.down(tp);
}
tasklet_enable(&tp->tl);
--
1.9.3
^ permalink raw reply related
* [PATCH] openvswitch: fix a use after free
From: roy.qing.li @ 2014-10-17 6:03 UTC (permalink / raw)
To: netdev; +Cc: edumazet, jesse
From: Li RongQing <roy.qing.li@gmail.com>
pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
setting after arphdr_ok to avoid the use the freed memory
Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.")
Cc: Jesse Gross <jesse@nicira.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
net/openvswitch/flow.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 62db02b..c5cfc72 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -557,10 +557,11 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
} else if (key->eth.type == htons(ETH_P_ARP) ||
key->eth.type == htons(ETH_P_RARP)) {
struct arp_eth_header *arp;
+ bool arp_available = arphdr_ok(skb);
arp = (struct arp_eth_header *)skb_network_header(skb);
- if (arphdr_ok(skb) &&
+ if (arp_available &&
arp->ar_hrd == htons(ARPHRD_ETHER) &&
arp->ar_pro == htons(ETH_P_IP) &&
arp->ar_hln == ETH_ALEN &&
--
1.7.10.4
^ permalink raw reply related
* [PATCH v2] vxlan: remove the dead codes
From: roy.qing.li @ 2014-10-17 6:04 UTC (permalink / raw)
To: netdev
From: Li RongQing <roy.qing.li@gmail.com>
remove the dead codes, no user uses vxlan_salt
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
drivers/net/vxlan.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 77ab844..855a81d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -152,8 +152,6 @@ struct vxlan_dev {
struct hlist_head fdb_head[FDB_HASH_SIZE];
};
-/* salt for hash table */
-static u32 vxlan_salt __read_mostly;
static struct workqueue_struct *vxlan_wq;
static void vxlan_sock_work(struct work_struct *work);
@@ -2797,8 +2795,6 @@ static int __init vxlan_init_module(void)
if (!vxlan_wq)
return -ENOMEM;
- get_random_bytes(&vxlan_salt, sizeof(vxlan_salt));
-
rc = register_pernet_subsys(&vxlan_net_ops);
if (rc)
goto out1;
--
1.7.10.4
^ permalink raw reply related
* [PATCH v2] vxlan: fix a free after use
From: roy.qing.li @ 2014-10-17 6:06 UTC (permalink / raw)
To: netdev; +Cc: edumazet
From: Li RongQing <roy.qing.li@gmail.com>
pskb_may_pull maybe change skb->data and make eth pointer oboslete,
so eth needs to reload
Fixes: 91269e390d062 ("vxlan: using pskb_may_pull as early as possible")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
drivers/net/vxlan.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 855a81d..fabd514 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1885,6 +1885,7 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
msg->icmph.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION)
return neigh_reduce(dev, skb);
}
+ eth = eth_hdr(skb);
#endif
}
--
1.7.10.4
^ permalink raw reply related
* Re: [PATCH net-next] ipvs: Avoid null-pointer deref in debug code
From: Simon Horman @ 2014-10-17 6:27 UTC (permalink / raw)
To: Julian Anastasov
Cc: Alex Gartrell, dan.carpenter, lvs-devel, netdev, kernel-team
In-Reply-To: <alpine.LFD.2.11.1410062159290.2258@ja.home.ssi.bg>
On Mon, Oct 06, 2014 at 10:01:08PM +0300, Julian Anastasov wrote:
>
> Hello,
>
> On Mon, 6 Oct 2014, Alex Gartrell wrote:
>
> > Use daddr instead of reaching into dest.
> >
> > Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
> > Signed-off-by: Alex Gartrell <agartrell@fb.com>
>
> Thanks!
>
> Acked-by: Julian Anastasov <ja@ssi.bg>
Thanks, I have applied this to the ipvs tree and will see about
both getting it included in v3.18 and v3.17-stable.
> > ---
> > net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> > index 91f17c1..437a366 100644
> > --- a/net/netfilter/ipvs/ip_vs_xmit.c
> > +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> > @@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> > if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> > local))) {
> > IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > - " daddr=%pI4\n", &dest->addr.ip);
> > + " daddr=%pI4\n", &daddr);
> > goto err_put;
> > }
> >
> > @@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
> > if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
> > local))) {
> > IP_VS_DBG_RL("We are crossing local and non-local addresses"
> > - " daddr=%pI6\n", &dest->addr.in6);
> > + " daddr=%pI6\n", daddr);
> > goto err_put;
> > }
> >
> > --
> > Alex Gartrell <agartrell@fb.com>
>
> Regards
>
> --
> Julian Anastasov <ja@ssi.bg>
>
^ permalink raw reply
* [PATCH net-next] openvswitch: Create right mask with disabled megaflows
From: Pravin B Shelar @ 2014-10-17 4:55 UTC (permalink / raw)
To: davem; +Cc: netdev, Pravin B Shelar, Daniele Di Proietto, Andy Zhou
If megaflows are disabled, the userspace does not send the netlink attribute
OVS_FLOW_ATTR_MASK, and the kernel must create an exact match mask.
sw_flow_mask_set() sets every bytes (in 'range') of the mask to 0xff, even the
bytes that represent padding for struct sw_flow, or the bytes that represent
fields that may not be set during ovs_flow_extract().
This is a problem, because when we extract a flow from a packet,
we do not memset() anymore the struct sw_flow to 0.
This commit gets rid of sw_flow_mask_set() and introduces mask_set_nlattr(),
which operates on the netlink attributes rather than on the mask key. Using
this approach we are sure that only the bytes that the user provided in the
flow are matched.
Also, if the parse_flow_mask_nlattrs() for the mask ENCAP attribute fails, we
now return with an error.
This bug is introduced by commit 0714812134d7dcadeb7ecfbfeb18788aa7e1eaac
("openvswitch: Eliminate memset() from flow_extract").
Reported-by: Alex Wang <alexw@nicira.com>
Signed-off-by: Daniele Di Proietto <ddiproietto@vmware.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
---
net/openvswitch/flow_netlink.c | 93 +++++++++++++++++++++++++++++++---------
1 files changed, 72 insertions(+), 21 deletions(-)
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 368f233..939bcb3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -103,10 +103,19 @@ static void update_range__(struct sw_flow_match *match,
SW_FLOW_KEY_MEMCPY_OFFSET(match, offsetof(struct sw_flow_key, field), \
value_p, len, is_mask)
-static u16 range_n_bytes(const struct sw_flow_key_range *range)
-{
- return range->end - range->start;
-}
+#define SW_FLOW_KEY_MEMSET_FIELD(match, field, value, is_mask) \
+ do { \
+ update_range__(match, offsetof(struct sw_flow_key, field), \
+ sizeof((match)->key->field), is_mask); \
+ if (is_mask) { \
+ if ((match)->mask) \
+ memset((u8 *)&(match)->mask->key.field, value,\
+ sizeof((match)->mask->key.field)); \
+ } else { \
+ memset((u8 *)&(match)->key->field, value, \
+ sizeof((match)->key->field)); \
+ } \
+ } while (0)
static bool match_validate(const struct sw_flow_match *match,
u64 key_attrs, u64 mask_attrs)
@@ -809,13 +818,26 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
return 0;
}
-static void sw_flow_mask_set(struct sw_flow_mask *mask,
- struct sw_flow_key_range *range, u8 val)
+static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
{
- u8 *m = (u8 *)&mask->key + range->start;
+ struct nlattr *nla;
+ int rem;
+
+ /* The nlattr stream should already have been validated */
+ nla_for_each_nested(nla, attr, rem) {
+ /* We assume that ovs_key_lens[type] == -1 means that type is a
+ * nested attribute
+ */
+ if (is_attr_mask_key && ovs_key_lens[nla_type(nla)] == -1)
+ nlattr_set(nla, val, false);
+ else
+ memset(nla_data(nla), val, nla_len(nla));
+ }
+}
- mask->range = *range;
- memset(m, val, range_n_bytes(range));
+static void mask_set_nlattr(struct nlattr *attr, u8 val)
+{
+ nlattr_set(attr, val, true);
}
/**
@@ -836,6 +858,7 @@ int ovs_nla_get_match(struct sw_flow_match *match,
{
const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
const struct nlattr *encap;
+ struct nlattr *newmask = NULL;
u64 key_attrs = 0;
u64 mask_attrs = 0;
bool encap_valid = false;
@@ -882,18 +905,44 @@ int ovs_nla_get_match(struct sw_flow_match *match,
if (err)
return err;
+ if (match->mask && !mask) {
+ /* Create an exact match mask. We need to set to 0xff all the
+ * 'match->mask' fields that have been touched in 'match->key'.
+ * We cannot simply memset 'match->mask', because padding bytes
+ * and fields not specified in 'match->key' should be left to 0.
+ * Instead, we use a stream of netlink attributes, copied from
+ * 'key' and set to 0xff: ovs_key_from_nlattrs() will take care
+ * of filling 'match->mask' appropriately.
+ */
+ newmask = kmemdup(key, nla_total_size(nla_len(key)),
+ GFP_KERNEL);
+ if (!newmask)
+ return -ENOMEM;
+
+ mask_set_nlattr(newmask, 0xff);
+
+ /* The userspace does not send tunnel attributes that are 0,
+ * but we should not wildcard them nonetheless.
+ */
+ if (match->key->tun_key.ipv4_dst)
+ SW_FLOW_KEY_MEMSET_FIELD(match, tun_key, 0xff, true);
+
+ mask = newmask;
+ }
+
if (mask) {
err = parse_flow_mask_nlattrs(mask, a, &mask_attrs);
if (err)
- return err;
+ goto free_newmask;
- if (mask_attrs & 1 << OVS_KEY_ATTR_ENCAP) {
+ if (mask_attrs & 1 << OVS_KEY_ATTR_ENCAP) {
__be16 eth_type = 0;
__be16 tci = 0;
if (!encap_valid) {
OVS_NLERR("Encap mask attribute is set for non-VLAN frame.\n");
- return -EINVAL;
+ err = -EINVAL;
+ goto free_newmask;
}
mask_attrs &= ~(1 << OVS_KEY_ATTR_ENCAP);
@@ -904,10 +953,13 @@ int ovs_nla_get_match(struct sw_flow_match *match,
mask_attrs &= ~(1 << OVS_KEY_ATTR_ETHERTYPE);
encap = a[OVS_KEY_ATTR_ENCAP];
err = parse_flow_mask_nlattrs(encap, a, &mask_attrs);
+ if (err)
+ goto free_newmask;
} else {
OVS_NLERR("VLAN frames must have an exact match on the TPID (mask=%x).\n",
ntohs(eth_type));
- return -EINVAL;
+ err = -EINVAL;
+ goto free_newmask;
}
if (a[OVS_KEY_ATTR_VLAN])
@@ -915,23 +967,22 @@ int ovs_nla_get_match(struct sw_flow_match *match,
if (!(tci & htons(VLAN_TAG_PRESENT))) {
OVS_NLERR("VLAN tag present bit must have an exact match (tci_mask=%x).\n", ntohs(tci));
- return -EINVAL;
+ err = -EINVAL;
+ goto free_newmask;
}
}
err = ovs_key_from_nlattrs(match, mask_attrs, a, true);
if (err)
- return err;
- } else {
- /* Populate exact match flow's key mask. */
- if (match->mask)
- sw_flow_mask_set(match->mask, &match->range, 0xff);
+ goto free_newmask;
}
if (!match_validate(match, key_attrs, mask_attrs))
- return -EINVAL;
+ err = -EINVAL;
- return 0;
+free_newmask:
+ kfree(newmask);
+ return err;
}
/**
--
1.7.1
^ permalink raw reply related
* Re: [PATCH net] r8152: use cancel_delayed_work for runtime suspend
From: Bjørn Mork @ 2014-10-17 7:42 UTC (permalink / raw)
To: Hayes Wang; +Cc: netdev, nic_swsd, linux-kernel, linux-usb
In-Reply-To: <1394712342-15778-64-Taiwan-albertk@realtek.com>
Hayes Wang <hayeswang@realtek.com> writes:
> It would cause dead lock for runtime suspend, when the workqueue
> is running and runtime suspend occurs before the workqueue wakes
> up the device. The rtl8152_suspend() waits the workqueue to finish
> because of calling cancel_delayed_work_sync(). The workqueue waits
> the suspend function to finish for waking up the device because of
> calling usb_autopm_get_interface().
>
> Signed-off-by: Hayes Wang <hayeswang@realtek.com>
> ---
> drivers/net/usb/r8152.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
> index 864159e..7d4e55a 100644
> --- a/drivers/net/usb/r8152.c
> +++ b/drivers/net/usb/r8152.c
> @@ -3200,12 +3200,13 @@ static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
> if (netif_running(tp->netdev)) {
> clear_bit(WORK_ENABLE, &tp->flags);
> usb_kill_urb(tp->intr_urb);
> - cancel_delayed_work_sync(&tp->schedule);
> tasklet_disable(&tp->tl);
> if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
> + cancel_delayed_work(&tp->schedule);
> rtl_stop_rx(tp);
> rtl_runtime_suspend_enable(tp, true);
> } else {
> + cancel_delayed_work_sync(&tp->schedule);
> tp->rtl_ops.down(tp);
> }
> tasklet_enable(&tp->tl);
This looks strange to me. The delayed work will cause an immediate
resume due to the usb_autopm_get_interface() it starts with. Wouldn't
it be better to just prevent runtime suspending by returning -EBUSY if
there is any delayed work scheduled?
Bjørn
^ permalink raw reply
* Re: [RFC PATCH net-next] net: ipv6: Add a sysctl to make optimistic addresses useful candidates
From: Lorenzo Colitti @ 2014-10-17 8:31 UTC (permalink / raw)
To: Erik Kline; +Cc: netdev@vger.kernel.org, Hannes Frederic Sowa, Eric Dumazet
In-Reply-To: <1413520309-13814-1-git-send-email-ek@google.com>
On Fri, Oct 17, 2014 at 1:31 PM, Erik Kline <ek@google.com> wrote:
> Add a sysctl that causes an interface's optimistic addresses
> to be considered equivalent to other non-deprecated addresses
> for source address selection purposes. Preferred addresses
> will still take precendence over optimistic addresses, subject
> to other ranking in the source address selection algorithm.
> [...]
> Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
^ permalink raw reply
* [PATCH] ipv4: fix a potential use after free
From: roy.qing.li @ 2014-10-17 8:53 UTC (permalink / raw)
To: netdev; +Cc: pshelar
From: Li RongQing <roy.qing.li@gmail.com>
pskb_may_pull() maybe change skb->data and make eth pointer oboslete,
so set eth after pskb_may_pull()
Fixes:3d7b46cd("ip_tunnel: push generic protocol handling to ip_tunnel module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
net/ipv4/ip_tunnel_core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index f4c987b..88c386c 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -91,11 +91,12 @@ int iptunnel_pull_header(struct sk_buff *skb, int hdr_len, __be16 inner_proto)
skb_pull_rcsum(skb, hdr_len);
if (inner_proto == htons(ETH_P_TEB)) {
- struct ethhdr *eh = (struct ethhdr *)skb->data;
+ struct ethhdr *eh;
if (unlikely(!pskb_may_pull(skb, ETH_HLEN)))
return -ENOMEM;
+ eh = (struct ethhdr *)skb->data;
if (likely(ntohs(eh->h_proto) >= ETH_P_802_3_MIN))
skb->protocol = eh->h_proto;
else
--
1.7.10.4
^ permalink raw reply related
* [PATCH] ipv4: fix a potential use after free in fou.c
From: roy.qing.li @ 2014-10-17 8:53 UTC (permalink / raw)
To: netdev; +Cc: therbert
From: Li RongQing <roy.qing.li@gmail.com>
pskb_may_pull() maybe change skb->data and make uh pointer oboslete,
so reload uh and guehdr
Fixes: 37dd0247 ("gue: Receive side for Generic UDP Encapsulation")
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
---
net/ipv4/fou.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index efa70ad..32e7892 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -87,6 +87,9 @@ static int gue_udp_recv(struct sock *sk, struct sk_buff *skb)
if (!pskb_may_pull(skb, len))
goto drop;
+ uh = udp_hdr(skb);
+ guehdr = (struct guehdr *)&uh[1];
+
if (guehdr->version != 0)
goto drop;
--
1.7.10.4
^ permalink raw reply related
* [PATCH net] r8152: return -EBUSY for runtime suspend
From: Hayes Wang @ 2014-10-17 8:55 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <87zjcvgn1x.fsf@nemi.mork.no>
Remove calling cancel_delayed_work_sync() for runtime suspend,
because it would cause dead lock. Instead, return -EBUSY to
avoid the device enters suspending if the net is running and
the delayed work is pending or running. The delayed work would
try to wake up the device later, so the suspending is not
necessary.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 864159e..e3d84c3 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -3189,31 +3189,39 @@ static void r8153_init(struct r8152 *tp)
static int rtl8152_suspend(struct usb_interface *intf, pm_message_t message)
{
struct r8152 *tp = usb_get_intfdata(intf);
+ struct net_device *netdev = tp->netdev;
+ int ret = 0;
mutex_lock(&tp->control);
- if (PMSG_IS_AUTO(message))
+ if (PMSG_IS_AUTO(message)) {
+ if (netif_running(netdev) && work_busy(&tp->schedule.work)) {
+ ret = -EBUSY;
+ goto out1;
+ }
+
set_bit(SELECTIVE_SUSPEND, &tp->flags);
- else
- netif_device_detach(tp->netdev);
+ } else {
+ netif_device_detach(netdev);
+ }
- if (netif_running(tp->netdev)) {
+ if (netif_running(netdev)) {
clear_bit(WORK_ENABLE, &tp->flags);
usb_kill_urb(tp->intr_urb);
- cancel_delayed_work_sync(&tp->schedule);
tasklet_disable(&tp->tl);
if (test_bit(SELECTIVE_SUSPEND, &tp->flags)) {
rtl_stop_rx(tp);
rtl_runtime_suspend_enable(tp, true);
} else {
+ cancel_delayed_work_sync(&tp->schedule);
tp->rtl_ops.down(tp);
}
tasklet_enable(&tp->tl);
}
-
+out1:
mutex_unlock(&tp->control);
- return 0;
+ return ret;
}
static int rtl8152_resume(struct usb_interface *intf)
--
1.9.3
^ permalink raw reply related
* RE: [PATCH] net: use hardware buffer pool to allocate skb
From: David Laight @ 2014-10-17 9:11 UTC (permalink / raw)
To: 'Alexander Duyck', Eric Dumazet
Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
netdev@vger.kernel.org, LeoLi@freescale.com,
linux-doc@vger.kernel.org
In-Reply-To: <54400C6C.7010405@redhat.com>
From: Alexander Duyck
...
> Actually the likelihood of anything holding onto the 4K page for very
> long doesn't seem to occur, at least from the drivers perspective. It
> is one of the reasons why I went for the page reuse approach rather than
> just partitioning a single large page. It allows us to avoid having to
> call IOMMU map/unmap for the pages since the entire page is usually back
> in the driver ownership before we need to reuse the portion given to the
> stack.
That is almost certainly true for most benchmarks, benchmark processes
consume receive data.
But what about real life situations?
There must be some 'normal' workloads where receive data doesn't get consumed.
David
^ permalink raw reply
* Re: [PATCH] virtio_net: fix use after free
From: Michael S. Tsirkin @ 2014-10-17 9:20 UTC (permalink / raw)
To: David Miller; +Cc: netdev, linux-kernel, virtualization
In-Reply-To: <20141015.164727.298073723527552510.davem@davemloft.net>
On Wed, Oct 15, 2014 at 04:47:27PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Wed, 15 Oct 2014 16:23:28 +0300
>
> > You used __netif_subqueue_stopped but that seems to use
> > a slightly more expensive test_bit internally.
>
> More expensive in what sense? It should be roughly the same
> as "x & y" sans the volatile.
I really just meant volatile - this might prevent some compiler
optimizations. I have't actually checked the produced binary so
I don't know for sure.
> Anyways I'm ambivalent and I want to see this bug fixes, so I'll
> apply your patch.
>
> Thanks!
^ permalink raw reply
* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Eric Dumazet @ 2014-10-17 14:05 UTC (permalink / raw)
To: Jiafei.Pan@freescale.com
Cc: Alexander Duyck, David Miller, jkosina@suse.cz,
netdev@vger.kernel.org, LeoLi@freescale.com,
linux-doc@vger.kernel.org
In-Reply-To: <92e45ba6015c4c48b21baeade3f9c929@BLUPR03MB517.namprd03.prod.outlook.com>
On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:
> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
> you have catch my concerns. For example, I want to add igb net card
> into my bridge, and want to igb net driver allocate skb by using
> my specified memory address, but I don’t want to modify igb net driver
> directly, how to do this in my bridge drivers?
This is exactly our point : We do not want to modify all drivers so that
your bridge is happy with them.
You'll have to make your bridge using standard infra instead.
IGB has no way to know in advance that a particular frame should
eventually reach your bridge anyway.
^ permalink raw reply
* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 14:12 UTC (permalink / raw)
To: Eric Dumazet, Jiafei.Pan@freescale.com
Cc: Alexander Duyck, David Miller, jkosina@suse.cz,
netdev@vger.kernel.org, LeoLi@freescale.com,
linux-doc@vger.kernel.org
In-Reply-To: <1413554709.24953.5.camel@edumazet-glaptop2.roam.corp.google.com>
On 10/17/2014 07:05 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:
>
>> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
>> you have catch my concerns. For example, I want to add igb net card
>> into my bridge, and want to igb net driver allocate skb by using
>> my specified memory address, but I don’t want to modify igb net driver
>> directly, how to do this in my bridge drivers?
> This is exactly our point : We do not want to modify all drivers so that
> your bridge is happy with them.
>
> You'll have to make your bridge using standard infra instead.
>
> IGB has no way to know in advance that a particular frame should
> eventually reach your bridge anyway.
Also why is it igb should use your buffers, is there any reason why your
device cannot use the receive buffers that are handed off to the stack
from igb? It isn't as if there is a copy in the routing or bridging
path. The receive buffer is normally handed off to the stack from the
ingress device, a few headers might get tweaked, and then that buffer is
transmitted by the egress interface. Only in the case of a buffer being
handed to multiple egress devices or user space should it ever be copied.
Thanks,
Alex
^ permalink raw reply
* [PATCH net-next] sfc: add support for skb->xmit_more
From: Edward Cree @ 2014-10-17 14:32 UTC (permalink / raw)
To: Robert Stonehouse
Cc: Edward Cree, Daniel Borkmann, davem, nikolay, netdev,
Shradha Shah, Jon Cooper, linux-net-drivers
In-Reply-To: <543FF408.3000808@solarflare.com>
Don't ring the doorbell, and don't do PIO. This will also prevent
TX Push, because there will be more than one buffer waiting when
the doorbell is rung.
Signed-off-by: Edward Cree <ecree@solarflare.com>
---
drivers/net/ethernet/sfc/nic.h | 29 +++++++++++++++++++++++-----
drivers/net/ethernet/sfc/tx.c | 43 +++++++++++++++++++-----------------------
2 files changed, 43 insertions(+), 29 deletions(-)
diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h
index 60f8514..f77cce0 100644
--- a/drivers/net/ethernet/sfc/nic.h
+++ b/drivers/net/ethernet/sfc/nic.h
@@ -71,9 +71,17 @@ efx_tx_desc(struct efx_tx_queue *tx_queue, unsigned int index)
return ((efx_qword_t *) (tx_queue->txd.buf.addr)) + index;
}
-/* Report whether the NIC considers this TX queue empty, given the
- * write_count used for the last doorbell push. May return false
- * negative.
+/* Get partner of a TX queue, seen as part of the same net core queue */
+static struct efx_tx_queue *efx_tx_queue_partner(struct efx_tx_queue *tx_queue)
+{
+ if (tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD)
+ return tx_queue - EFX_TXQ_TYPE_OFFLOAD;
+ else
+ return tx_queue + EFX_TXQ_TYPE_OFFLOAD;
+}
+
+/* Report whether this TX queue would be empty for the given write_count.
+ * May return false negative.
*/
static inline bool __efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue,
unsigned int write_count)
@@ -86,9 +94,18 @@ static inline bool __efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue,
return ((empty_read_count ^ write_count) & ~EFX_EMPTY_COUNT_VALID) == 0;
}
-static inline bool efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue)
+/* Decide whether we can use TX PIO, ie. write packet data directly into
+ * a buffer on the device. This can reduce latency at the expense of
+ * throughput, so we only do this if both hardware and software TX rings
+ * are empty. This also ensures that only one packet at a time can be
+ * using the PIO buffer.
+ */
+static inline bool efx_nic_may_tx_pio(struct efx_tx_queue *tx_queue)
{
- return __efx_nic_tx_is_empty(tx_queue, tx_queue->write_count);
+ struct efx_tx_queue *partner = efx_tx_queue_partner(tx_queue);
+ return tx_queue->piobuf &&
+ __efx_nic_tx_is_empty(tx_queue, tx_queue->insert_count) &&
+ __efx_nic_tx_is_empty(partner, partner->insert_count);
}
/* Decide whether to push a TX descriptor to the NIC vs merely writing
@@ -96,6 +113,8 @@ static inline bool efx_nic_tx_is_empty(struct efx_tx_queue *tx_queue)
* descriptor to an empty queue, but is otherwise pointless. Further,
* Falcon and Siena have hardware bugs (SF bug 33851) that may be
* triggered if we don't check this.
+ * We use the write_count used for the last doorbell push, to get the
+ * NIC's view of the tx queue.
*/
static inline bool efx_nic_may_push_tx_desc(struct efx_tx_queue *tx_queue,
unsigned int write_count)
diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index 3206098..ee84a90 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -132,15 +132,6 @@ unsigned int efx_tx_max_skb_descs(struct efx_nic *efx)
return max_descs;
}
-/* Get partner of a TX queue, seen as part of the same net core queue */
-static struct efx_tx_queue *efx_tx_queue_partner(struct efx_tx_queue *tx_queue)
-{
- if (tx_queue->queue & EFX_TXQ_TYPE_OFFLOAD)
- return tx_queue - EFX_TXQ_TYPE_OFFLOAD;
- else
- return tx_queue + EFX_TXQ_TYPE_OFFLOAD;
-}
-
static void efx_tx_maybe_stop_queue(struct efx_tx_queue *txq1)
{
/* We need to consider both queues that the net core sees as one */
@@ -344,6 +335,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
struct efx_nic *efx = tx_queue->efx;
struct device *dma_dev = &efx->pci_dev->dev;
struct efx_tx_buffer *buffer;
+ unsigned int old_insert_count = tx_queue->insert_count;
skb_frag_t *fragment;
unsigned int len, unmap_len = 0;
dma_addr_t dma_addr, unmap_addr = 0;
@@ -351,7 +343,7 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
unsigned short dma_flags;
int i = 0;
- EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
+ EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);
if (skb_shinfo(skb)->gso_size)
return efx_enqueue_skb_tso(tx_queue, skb);
@@ -369,9 +361,8 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
/* Consider using PIO for short packets */
#ifdef EFX_USE_PIO
- if (skb->len <= efx_piobuf_size && tx_queue->piobuf &&
- efx_nic_tx_is_empty(tx_queue) &&
- efx_nic_tx_is_empty(efx_tx_queue_partner(tx_queue))) {
+ if (skb->len <= efx_piobuf_size && !skb->xmit_more &&
+ efx_nic_may_tx_pio(tx_queue)) {
buffer = efx_enqueue_skb_pio(tx_queue, skb);
dma_flags = EFX_TX_BUF_OPTION;
goto finish_packet;
@@ -439,13 +430,14 @@ finish_packet:
netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
+ efx_tx_maybe_stop_queue(tx_queue);
+
/* Pass off to hardware */
- efx_nic_push_buffers(tx_queue);
+ if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+ efx_nic_push_buffers(tx_queue);
tx_queue->tx_packets++;
- efx_tx_maybe_stop_queue(tx_queue);
-
return NETDEV_TX_OK;
dma_err:
@@ -458,7 +450,7 @@ finish_packet:
dev_kfree_skb_any(skb);
/* Work backwards until we hit the original insert pointer value */
- while (tx_queue->insert_count != tx_queue->write_count) {
+ while (tx_queue->insert_count != old_insert_count) {
unsigned int pkts_compl = 0, bytes_compl = 0;
--tx_queue->insert_count;
buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
@@ -989,12 +981,13 @@ static int efx_tso_put_header(struct efx_tx_queue *tx_queue,
/* Remove buffers put into a tx_queue. None of the buffers must have
* an skb attached.
*/
-static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue)
+static void efx_enqueue_unwind(struct efx_tx_queue *tx_queue,
+ unsigned int insert_count)
{
struct efx_tx_buffer *buffer;
/* Work backwards until we hit the original insert pointer value */
- while (tx_queue->insert_count != tx_queue->write_count) {
+ while (tx_queue->insert_count != insert_count) {
--tx_queue->insert_count;
buffer = __efx_tx_queue_get_insert_buffer(tx_queue);
efx_dequeue_buffer(tx_queue, buffer, NULL, NULL);
@@ -1258,13 +1251,14 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
struct sk_buff *skb)
{
struct efx_nic *efx = tx_queue->efx;
+ unsigned int old_insert_count = tx_queue->insert_count;
int frag_i, rc;
struct tso_state state;
/* Find the packet protocol and sanity-check it */
state.protocol = efx_tso_check_protocol(skb);
- EFX_BUG_ON_PARANOID(tx_queue->write_count != tx_queue->insert_count);
+ EFX_BUG_ON_PARANOID(tx_queue->write_count > tx_queue->insert_count);
rc = tso_start(&state, efx, skb);
if (rc)
@@ -1308,11 +1302,12 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
netdev_tx_sent_queue(tx_queue->core_txq, skb->len);
- /* Pass off to hardware */
- efx_nic_push_buffers(tx_queue);
-
efx_tx_maybe_stop_queue(tx_queue);
+ /* Pass off to hardware */
+ if (!skb->xmit_more || netif_xmit_stopped(tx_queue->core_txq))
+ efx_nic_push_buffers(tx_queue);
+
tx_queue->tso_bursts++;
return NETDEV_TX_OK;
@@ -1336,6 +1331,6 @@ static int efx_enqueue_skb_tso(struct efx_tx_queue *tx_queue,
dma_unmap_single(&efx->pci_dev->dev, state.header_dma_addr,
state.header_unmap_len, DMA_TO_DEVICE);
- efx_enqueue_unwind(tx_queue);
+ efx_enqueue_unwind(tx_queue, old_insert_count);
return NETDEV_TX_OK;
}
--
1.7.11.7
^ permalink raw reply related
* Re: [PATCH] net: use hardware buffer pool to allocate skb
From: Alexander Duyck @ 2014-10-17 14:40 UTC (permalink / raw)
To: David Laight, Eric Dumazet
Cc: Jiafei.Pan@freescale.com, David Miller, jkosina@suse.cz,
netdev@vger.kernel.org, LeoLi@freescale.com,
linux-doc@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1C9D895B@AcuExch.aculab.com>
On 10/17/2014 02:11 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>> Actually the likelihood of anything holding onto the 4K page for very
>> long doesn't seem to occur, at least from the drivers perspective. It
>> is one of the reasons why I went for the page reuse approach rather than
>> just partitioning a single large page. It allows us to avoid having to
>> call IOMMU map/unmap for the pages since the entire page is usually back
>> in the driver ownership before we need to reuse the portion given to the
>> stack.
> That is almost certainly true for most benchmarks, benchmark processes
> consume receive data.
> But what about real life situations?
>
> There must be some 'normal' workloads where receive data doesn't get consumed.
>
> David
>
Yes, but for workloads where receive data doesn't get consumed it is
very unlikely that much receive data is generated. As such from the
device perspective the time the socket is holding the page is still not
all that long as the descriptor ring is not being looped through that
quickly. The page has almost always been freed by the time we have
processed our way through the full descriptor ring.
Thanks,
Alex
^ permalink raw reply
* Re: tcpdump's capture filter: "vlan" doesn't match
From: Daniel Borkmann @ 2014-10-17 15:48 UTC (permalink / raw)
To: Lukas Tribus
Cc: netdev@vger.kernel.org, John Fastabend, Michał Mirosław,
Jiri Pirko, Ben Hutchings, Atzm Watanabe, Patrick McHardy,
Jesse Gross, Michael Richardson, Ani Sinha
In-Reply-To: <DUB123-W25C5DA9BEDC92CD2524B68EDAB0@phx.gbl>
On 10/17/2014 01:25 AM, Lukas Tribus wrote:
>>> Isn't disabling rx-vlan-offloading supposed to remedy those problems?
>>
>> There were some discussions on this in the past e.g. [1]. We have
>> SKF_AD_VLAN_TAG and SKF_AD_VLAN_TAG_PRESENT for the BPF filter on
>> this, but libpcap is currently not making use of any of them.
>>
>> [1] http://thread.gmane.org/gmane.linux.network/247947
>
> Thanks for the link. I see the situation is unfortunate and although those
> new BPF filters in the kernel may fix the actual filtering problem, one
> thing seems to remain impossible: disabling all this kernel magic and
> passing the frame as-is to libpcap without interception (avoiding any
> kind of artificial header reconstruction).
>
> How is the situation with netsniff-ng anyway? Does it use vlan BPF filter
> in the kernel?
So in netsniff-ng we don't do obscure header reconstruction as it
hurts performance and it can result in incorrect reconstruction cases.
You however can define a bpf_asm program (e.g. tools/net/ in kernel
tree) and use instruction overloading for the vlan case from there.
Thus, you're not tied to the libpcap compiler which misses this.
For more details, I refer you to Documentation/networking/filter.txt .
^ permalink raw reply
* Re: [PATCH] atm: simplify lanai.c by using module_pci_driver
From: David Miller @ 2014-10-17 15:55 UTC (permalink / raw)
To: michael.opdenacker; +Cc: chas, linux-atm-general, netdev, linux-kernel
In-Reply-To: <1413359150-13552-1-git-send-email-michael.opdenacker@free-electrons.com>
From: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Date: Wed, 15 Oct 2014 09:45:50 +0200
> This simplifies the lanai.c driver by using
> the module_pci_driver() macro, at the expense
> of losing only debugging messages.
>
> Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH] ipv4: dst_entry leak in ip_send_unicast_reply()
From: David Miller @ 2014-10-17 15:59 UTC (permalink / raw)
To: vvs; +Cc: eric.dumazet, netdev, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <543E6762.9090807@parallels.com>
From: Vasily Averin <vvs@parallels.com>
Date: Wed, 15 Oct 2014 16:24:02 +0400
> Fixes: 2e77d89b2fa8 ("net: avoid a pair of dst_hold()/dst_release() in ip_append_data()")
>
> ip_setup_cork() called inside ip_append_data() steals dst entry from rt to cork
> and in case errors in __ip_append_data() nobody frees stolen dst entry
>
> Signed-off-by: Vasily Averin <vvs@parallels.com>
Please, in the future, put the Fixes: tag first in the list of signoffs.
Eric, please review this change, it looks good to me.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox