Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH RFC v3 0/3] virtio_net: enabling tx interrupts
From: Jason Wang @ 2014-12-02  3:36 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, netdev
In-Reply-To: <20141201104808.GA16661@redhat.com>



On Mon, Dec 1, 2014 at 6:48 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Mon, Dec 01, 2014 at 06:14:36PM +0800, Jason Wang wrote:
>>  
>>  
>>  On 10/20/2014 02:52 PM, Michael S. Tsirkin wrote:
>>  >RFC patches to enable tx interrupts.
>>  >This is to demonstrate how this can be done without
>>  >core virtio changes, and to make sure I understand
>>  >the new APIs correctly.
>>  >
>>  >Testing TBD, I was asked for a version for early testing.
>>  >
>>  >Applies on top of patch: "virtio_net: fix use after free"
>>  >that I recently sent.
>>  >
>>  >Changes from v3:
>>  >	clean up code, address issues raised by Jason
>>  >Changes from v1:
>>  >         address comments by Jason Wang, use delayed cb everywhere
>>  >         rebased Jason's patch on top of mine and include it (with 
>> some tweaks)
>>  >
>>  >Jason Wang (1):
>>  >   virtio-net: optimize free_old_xmit_skbs stats
>>  >
>>  >Michael S. Tsirkin (2):
>>  >   virtio_net: enable tx interrupt
>>  >   virtio_net: bql
>>  >
>>  >  drivers/net/virtio_net.c | 144 
>> +++++++++++++++++++++++++++++++++--------------
>>  >  1 file changed, 101 insertions(+), 43 deletions(-)
>>  >
>>  
>>  I've run a full tests on this series and see huge regression when 
>> zerocopy
>>  is disabled. Looks like the reason is zerocopy could coalescing tx
>>  completion which greatly reduce the number of tx interrupts.
> 
> I think you refer to this code:
> 
>         /*
>          * Trigger polling thread if guest stopped submitting new
>          * buffers:
>          * in this case, the refcount after decrement will eventually
>          * reach 1.
>          * We also trigger polling periodically after each 16 packets
>          * (the value 16 here is more or less arbitrary, it's tuned to
>          * trigger
>          * less than 10% of times).
>          */
>         if (cnt <= 1 || !(cnt % 16))
>                 vhost_poll_queue(&vq->poll);
> 
> ?
> This seems unrelated to interrupt coalescing.

Well, this in fact tries to coalesce 16 packets per tx irq.

More important, zerocopy depends on host nics tx completion.
This means, if host nic coalesces several packets per irq,
zerocopy will probably also do this. vhost_zerocopy_signal_used()
will try to coalesce the tx intrs.
> 
> We can easily enable something similar for all tx
> packets, without need for guest configuration.

We can, but we lose the an interface for user to tune for 
their applications.
> 
> If it's not clear how to do this, let me know, I'll try to put out a
> patch like this in a couple of days.

We can just do this through harding coding the tx-frames to 16 
through interrupt coalescing. I don't see obvious differences.
And in my test of RFCv4, I just use tx-frames 16 to get the result.

Without a timer for coalescing, I suspect how much this can help
for e.g 1 session of TCP_RR. It just has at most 1 packet pending
during the test. So in fact no tx completion could be coalesced 
in this case.
> 
> 
>>  I will post RFC V4 shortly with interrupt coalescing support. In 
>> this
>>  version I remove the tx packet cleanup in ndo_start_xmit() since it 
>> may
>>  reduce the effects of interrupt coalescing.
> 
> Maybe split this in a separate patch?

I can, but since just two minor changes compared to V3. Maybe just
document the differences is ok for you?

^ permalink raw reply

* Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts
From: Jason Wang @ 2014-12-02  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: pagupta, netdev, davem, linux-kernel, virtualization
In-Reply-To: <20141201104223.GB16108@redhat.com>



On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
>>  Hello:
>>  
>>  We used to orphan packets before transmission for virtio-net. This 
>> breaks
>>  socket accounting and can lead serveral functions won't work, e.g:
>>  
>>  - Byte Queue Limit depends on tx completion nofication to work.
>>  - Packet Generator depends on tx completion nofication for the last
>>    transmitted packet to complete.
>>  - TCP Small Queue depends on proper accounting of sk_wmem_alloc to 
>> work.
>>  
>>  This series tries to solve the issue by enabling tx interrupts. To 
>> minize
>>  the performance impacts of this, several optimizations were used:
>>  
>>  - In guest side, virtqueue_enable_cb_delayed() was used to delay 
>> the tx
>>    interrupt untile 3/4 pending packets were sent.
>>  - In host side, interrupt coalescing were used to reduce tx 
>> interrupts.
>>  
>>  Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
>>  
>>  - For guest receiving. No obvious regression on throughput were
>>    noticed. More cpu utilization were noticed in few cases.
>>  - For guest transmission. Very huge improvement on througput for 
>> small
>>    packet transmission were noticed. This is expected since TSQ and 
>> other
>>    optimization for small packet transmission work after tx 
>> interrupt. But
>>    will use more cpu for large packets.
>>  - For TCP_RR, regression (10% on transaction rate and cpu 
>> utilization) were
>>    found. Tx interrupt won't help but cause overhead in this case. 
>> Using
>>    more aggressive coalescing parameters may help to reduce the 
>> regression.
> 
> OK, you do have posted coalescing patches - does it help any?

Helps a lot.

For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
For small packet TX, it increases 33% - 245% throughput. (reduce 
 about 60% inters)
For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx intrs)

> 
> I'm not sure the regression is due to interrupts.
> It would make sense for CPU but why would it
> hurt transaction rate?

Anyway guest need to take some cycles to handle tx interrupts.
And transaction rate does increase if we coalesces more tx interurpts. 
> 
> 
> It's possible that we are deferring kicks too much due to BQL.
> 
> As an experiment: do we get any of it back if we do
> -        if (kick || netif_xmit_stopped(txq))
> -                virtqueue_kick(sq->vq);
> +        virtqueue_kick(sq->vq);
> ?


I will try, but during TCP_RR, at most 1 packets were pending,
I suspect if BQL can help in this case.
> 
> 
> If yes, we can just kick e.g. periodically, e.g. after queueing each
> X bytes.

Okay, let me try to see if this help.
> 
>>  Changes from RFC V3:
>>  - Don't free tx packets in ndo_start_xmit()
>>  - Add interrupt coalescing support for virtio-net
>>  Changes from RFC v2:
>>  - clean up code, address issues raised by Jason
>>  Changes from RFC v1:
>>  - address comments by Jason Wang, use delayed cb everywhere
>>  - rebased Jason's patch on top of mine and include it (with some 
>> tweaks)
>>  
>>  Please reivew. Comments were more than welcomed.
>>  
>>  [1] Performance Test result:
>>  
>>  Environment:
>>  - Two Intel(R) Xeon(R) CPU E5620 @ 2.40GHz machines connected back 
>> to back
>>    with 82599ES cards.
>>  - Both host and guest were net-next.git plus the patch
>>  - Coalescing parameters for the card:
>>    Adaptive RX: off  TX: off
>>    rx-usecs: 1
>>    rx-frames: 0
>>    tx-usecs: 0
>>    tx-frames: 0
>>  - Vhost_net was enabled and zerocopy was disabled
>>  - Tests was done by netperf-2.6
>>  - Guest has 2 vcpus with single queue virtio-net
>>  
>>  Results:
>>  - Numbers of square brackets are whose significance is grater than 
>> 95%
>>  
>>  Guest RX:
>>  
>>  size/sessions/+throughput/+cpu/+per_cpu_throughput/
>>  64/1/+2.0326/[+6.2807%]/-3.9970%/
>>  64/2/-0.2104%/[+3.2012%]/[-3.3058%]/
>>  64/4/+1.5956%/+2.2451%/-0.6353%/
>>  64/8/+1.1732%/+3.5123%/-2.2598%/
>>  256/1/+3.7619%/[+5.8117%]/-1.9372%/
>>  256/2/-0.0661%/[+3.2511%]/-3.2127%/
>>  256/4/+1.1435%/[-8.1842%]/[+10.1591%]/
>>  256/8/[+2.2447%]/[+6.2044%]/[-3.7283%]/
>>  1024/1/+9.1479%/[+12.0997%]/[-2.6332%]/
>>  1024/2/[-17.3341%]/[+0.0000%]/[-17.3341%]/
>>  1024/4/[-0.6284%]/-1.0376%/+0.4135%/
>>  1024/8/+1.1444%/-1.6069%/+2.7961%/
>>  4096/1/+0.0401%/-0.5993%/+0.6433%/
>>  4096/2/[-0.5894%]/-2.2071%/+1.6542%/
>>  4096/4/[-0.5560%]/-1.4969%/+0.9553%/
>>  4096/8/-0.3362%/+2.7086%/-2.9645%/
>>  16384/1/-0.0285%/+0.7247%/-0.7478%/
>>  16384/2/-0.5286%/+0.3287%/-0.8545%/
>>  16384/4/-0.3297%/-2.0543%/+1.7608%/
>>  16384/8/+1.0932%/+4.0253%/-2.8187%/
>>  65535/1/+0.0003%/-0.1502%/+0.1508%/
>>  65535/2/[-0.6065%]/+0.2309%/-0.8355%/
>>  65535/4/[-0.6861%]/[+3.9451%]/[-4.4554%]/
>>  65535/8/+1.8359%/+3.1590%/-1.2825%/
>>  
>>  Guest RX:
>>  size/sessions/+throughput/+cpu/+per_cpu_throughput/
>>  64/1/[+65.0961%]/[-8.6807%]/[+80.7900%]/
>>  64/2/[+6.0288%]/[-2.2823%]/[+8.5052%]/
>>  64/4/[+5.9038%]/[-2.1834%]/[+8.2677%]/
>>  64/8/[+5.4154%]/[-2.1804%]/[+7.7651%]/
>>  256/1/[+184.6462%]/[+4.8906%]/[+171.3742%]/
>>  256/2/[+46.0731%]/[-8.9626%]/[+60.4539%]/
>>  256/4/[+45.8547%]/[-8.3027%]/[+59.0612%]/
>>  256/8/[+45.3486%]/[-8.4024%]/[+58.6817%]/
>>  1024/1/[+432.5372%]/[+3.9566%]/[+412.2689%]/
>>  1024/2/[-1.4207%]/[-23.6426%]/[+29.1025%]/
>>  1024/4/-0.1003%/[-13.6416%]/[+15.6804%]/
>>  1024/8/[+0.2200%]/[+2.0634%]/[-1.8061%]/
>>  4096/1/[+18.4835%]/[-46.1508%]/[+120.0283%]/
>>  4096/2/+0.1770%/[-26.2780%]/[+35.8848%]/
>>  4096/4/-0.1012%/-0.7353%/+0.6388%/
>>  4096/8/-0.6091%/+1.4159%/-1.9968%/
>>  16384/1/-0.0424%/[+11.9373%]/[-10.7021%]/
>>  16384/2/+0.0482%/+2.4685%/-2.3620%/
>>  16384/4/+0.0840%/[+5.3587%]/[-5.0064%]/
>>  16384/8/+0.0048%/[+5.0176%]/[-4.7733%]/
>>  65535/1/-0.0095%/[+10.9408%]/[-9.8705%]/
>>  65535/2/+0.1515%/[+8.1709%]/[-7.4137%]/
>>  65535/4/+0.0203%/[+5.4316%]/[-5.1325%]/
>>  65535/8/+0.1427%/[+6.2753%]/[-5.7705%]/
>>  
>>  size/sessions/+trans.rate/+cpu/+per_cpu_trans.rate/
>>  64/1/+0.2346%/[+11.5080%]/[-10.1099%]/
>>  64/25/[-10.7893%]/-0.5791%/[-10.2697%]/
>>  64/50/[-11.5997%]/-0.3429%/[-11.2956%]/
>>  256/1/+0.7219%/[+13.2374%]/[-11.0524%]/
>>  256/25/-6.9567%/+0.8887%/[-7.7763%]/
>>  256/50/[-4.8814%]/-0.0338%/[-4.8492%]/
>>  4096/1/-1.6061%/-0.7561%/-0.8565%/
>>  4096/25/[+2.2120%]/[+1.0839%]/+1.1161%/
>>  4096/50/[+5.6180%]/[+3.2116%]/[+2.3315%]/
>>  
>>  Jason Wang (4):
>>    virtio_net: enable tx interrupt
>>    virtio-net: optimize free_old_xmit_skbs stats
>>    virtio-net: add basic interrupt coalescing support
>>    vhost_net: interrupt coalescing support
>>  
>>  Michael S. Tsirkin (1):
>>    virtio_net: bql
>>  
>>   drivers/net/virtio_net.c        | 211 
>> ++++++++++++++++++++++++++++++++--------
>>   drivers/vhost/net.c             | 200 
>> +++++++++++++++++++++++++++++++++++--
>>   include/uapi/linux/vhost.h      |  12 +++
>>   include/uapi/linux/virtio_net.h |  12 +++
>>   4 files changed, 383 insertions(+), 52 deletions(-)
>>  
>>  -- 
>>  1.8.3.1

^ permalink raw reply

* Re: [PATCH net-next] udp: Neaten and reduce size of compute_score functions
From: Joe Perches @ 2014-12-02  3:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, LKML
In-Reply-To: <1417489141.4442.24.camel@edumazet-glaptop2.roam.corp.google.com>

On Mon, 2014-12-01 at 18:59 -0800, Eric Dumazet wrote:
> On Mon, 2014-12-01 at 17:39 -0800, Joe Perches wrote:
> > The compute_score functions are a bit difficult to read.
> > 
> > Neaten them a bit to reduce object sizes and make them a
> > bit more intelligible.
> > 
> > Return early to avoid indentation and avoid unnecessary
> > initializations.
[]
> > +	if (!(net_eq(sock_net(sk), net) &&
> > +	      udp_sk(sk)->udp_port_hash == hnum &&
> > +	      !ipv6_only_sock(sk)))
> > +		return -1
> 
> Or even better :
> 
> 
> 	if (!net_eq(sock_net(sk), net) ||
> 	    udp_sk(sk)->udp_port_hash != hnum ||
> 	    ipv6_only_sock(sk))
> 		return -1;

Hi Eric.

Yeah, I thought about it but thought it
simpler to not change the logic.

Either way is fine with me.

David?

btw: the same thing can be done for the v6 block too:

+       if (!(net_eq(sock_net(sk), net) && !ipv6_only_sock(sk)))
+               return -1;
 
-               if (inet->inet_rcv_saddr != daddr)
+       inet = inet_sk(sk);
+
+       if (inet->inet_rcv_saddr != daddr)
+               return -1;
+       if (inet->inet_num != hnum)
+               return -1;

to:

	if (!net_eq(sock_net(sk, net) ||
	    ipv6_only_sock(sk))
		return -1;

	inet = inet_sk(sk);

	if (inet->inet_rcv_saddr != daddr ||
	    inet->inet_num != hnum)
		return -1;

^ permalink raw reply

* Re: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt
From: Jason Wang @ 2014-12-02  3:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: pagupta, netdev, linux-kernel, virtualization, davem
In-Reply-To: <20141201103536.GA16108@redhat.com>



On Mon, Dec 1, 2014 at 6:35 PM, Michael S. Tsirkin <mst@redhat.com> 
wrote:
> On Mon, Dec 01, 2014 at 06:17:04PM +0800, Jason Wang wrote:
>>  On newer hosts that support delayed tx interrupts,
>>  we probably don't have much to gain from orphaning
>>  packets early.
>>  
>>  Note: this might degrade performance for
>>  hosts without event idx support.
>>  Should be addressed by the next patch.
>> 
>>  Cc: Rusty Russell <rusty@rustcorp.com.au>
>>  Cc: Michael S. Tsirkin <mst@redhat.com>
>>  Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>  Signed-off-by: Jason Wang <jasowang@redhat.com>
> 
> Could you document the changes from the RFC I sent please?

Okay, I will sent a V5 add document more.
> 
> Are there optimizations?

Two optimizations.

- Don't do tx packets free in ndo_start_xmit(), tests shows it reduce 
  the effect of coalescing.
- Let ethtool can change the number of packets freed in one tx napi run
  through tx-frames-irq. This is necessary, since user may coalesce more
  than 64 packets per irq.

> 
> If yes, it might be easier to review (at least for me), if you 
> refactor this,
> e.g. applying the straight-forward rfc patch and then optimizations if
> any on top. If it's taking a different approach, pls feel free to
> disregard this.
> 
> 
>>  ---
>>   drivers/net/virtio_net.c | 132 
>> +++++++++++++++++++++++++++++++----------------
>>   1 file changed, 88 insertions(+), 44 deletions(-)
>>  
>>  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>  index ec2a8b4..f68114e 100644
>>  --- a/drivers/net/virtio_net.c
>>  +++ b/drivers/net/virtio_net.c
>>  @@ -72,6 +72,8 @@ struct send_queue {
>>   
>>   	/* Name of the send queue: output.$index */
>>   	char name[40];
>>  +
>>  +	struct napi_struct napi;
>>   };
>>   
>>   /* Internal representation of a receive virtqueue */
>>  @@ -137,6 +139,9 @@ struct virtnet_info {
>>   
>>   	/* CPU hot plug notifier */
>>   	struct notifier_block nb;
>>  +
>>  +	/* Budget for polling tx completion */
>>  +	u32 tx_work_limit;
>>   };
>>   
>>   struct skb_vnet_hdr {
>>  @@ -211,15 +216,41 @@ static struct page *get_a_page(struct 
>> receive_queue *rq, gfp_t gfp_mask)
>>   	return p;
>>   }
>>   
>>  +static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
>>  +				       struct send_queue *sq, int budget)
>>  +{
>>  +	struct sk_buff *skb;
>>  +	unsigned int len;
>>  +	struct virtnet_info *vi = sq->vq->vdev->priv;
>>  +	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>  +	unsigned int packets = 0;
>>  +
>>  +	while (packets < budget &&
>>  +	       (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>>  +		pr_debug("Sent skb %p\n", skb);
>>  +
>>  +		u64_stats_update_begin(&stats->tx_syncp);
>>  +		stats->tx_bytes += skb->len;
>>  +		stats->tx_packets++;
>>  +		u64_stats_update_end(&stats->tx_syncp);
>>  +
>>  +		dev_kfree_skb_any(skb);
>>  +		packets++;
>>  +	}
>>  +
>>  +	if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
>>  +		netif_tx_start_queue(txq);
>>  +
>>  +	return packets;
>>  +}
>>  +
>>   static void skb_xmit_done(struct virtqueue *vq)
>>   {
>>   	struct virtnet_info *vi = vq->vdev->priv;
>>  +	struct send_queue *sq = &vi->sq[vq2txq(vq)];
>>   
>>  -	/* Suppress further interrupts. */
>>  -	virtqueue_disable_cb(vq);
>>  -
>>  -	/* We were probably waiting for more output buffers. */
>>  -	netif_wake_subqueue(vi->dev, vq2txq(vq));
>>  +	virtqueue_disable_cb(sq->vq);
>>  +	napi_schedule(&sq->napi);
>>   }
>>   
>>   static unsigned int mergeable_ctx_to_buf_truesize(unsigned long 
>> mrg_ctx)
>>  @@ -777,6 +808,32 @@ again:
>>   	return received;
>>   }
>>   
>>  +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>>  +{
>>  +	struct send_queue *sq =
>>  +		container_of(napi, struct send_queue, napi);
>>  +	struct virtnet_info *vi = sq->vq->vdev->priv;
>>  +	struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, 
>> vq2txq(sq->vq));
>>  +	u32 limit = vi->tx_work_limit;
>>  +	unsigned int sent;
>>  +
>>  +	__netif_tx_lock(txq, smp_processor_id());
>>  +	sent = free_old_xmit_skbs(txq, sq, limit);
>>  +	if (sent < limit) {
>>  +		napi_complete(napi);
>>  +		/* Note: we must enable cb *after* napi_complete, because
>>  +		 * napi_schedule calls from callbacks that trigger before
>>  +		 * napi_complete are ignored.
>>  +		 */
>>  +		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>  +			virtqueue_disable_cb(sq->vq);
>>  +			napi_schedule(&sq->napi);
>>  +		}
>>  +	}
>>  +	__netif_tx_unlock(txq);
>>  +	return sent < limit ? 0 : budget;
>>  +}
>>  +
> 
> Unlike the patch I sent, this seems to ignore the budget,
> and always poll the full napi_weight.
> Seems strange.  What is the reason for this?
> 

The budget were in fact the tx_work_limit (by default 64).
This could be tuned by ethtool tx-frames-irq to control
how many packets at most could be processed by on tx napi.

Use may want to coalesce more than 64 packets per irq, so
something like this is necessary for user.
> 
> 
>>   #ifdef CONFIG_NET_RX_BUSY_POLL
>>   /* must be called with local_bh_disable()d */
>>   static int virtnet_busy_poll(struct napi_struct *napi)
>>  @@ -825,30 +882,12 @@ static int virtnet_open(struct net_device 
>> *dev)
>>   			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>>   				schedule_delayed_work(&vi->refill, 0);
>>   		virtnet_napi_enable(&vi->rq[i]);
>>  +		napi_enable(&vi->sq[i].napi);
>>   	}
>>   
>>   	return 0;
>>   }
>>   
>>  -static void free_old_xmit_skbs(struct send_queue *sq)
>>  -{
>>  -	struct sk_buff *skb;
>>  -	unsigned int len;
>>  -	struct virtnet_info *vi = sq->vq->vdev->priv;
>>  -	struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>>  -
>>  -	while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>>  -		pr_debug("Sent skb %p\n", skb);
>>  -
>>  -		u64_stats_update_begin(&stats->tx_syncp);
>>  -		stats->tx_bytes += skb->len;
>>  -		stats->tx_packets++;
>>  -		u64_stats_update_end(&stats->tx_syncp);
>>  -
>>  -		dev_kfree_skb_any(skb);
>>  -	}
>>  -}
>>  -
>>   static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>>   {
>>   	struct skb_vnet_hdr *hdr;
>>  @@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq, 
>> struct sk_buff *skb)
>>   		sg_set_buf(sq->sg, hdr, hdr_len);
>>   		num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
>>   	}
>>  -	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, 
>> GFP_ATOMIC);
>>  +
>>  +	return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
>>  +				    GFP_ATOMIC);
>>   }
>>   
>>   static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
>> net_device *dev)
>>  @@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>   	struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>>   	bool kick = !skb->xmit_more;
>>   
>>  -	/* Free up any pending old buffers before queueing new ones. */
>>  -	free_old_xmit_skbs(sq);
>>  +	virtqueue_disable_cb(sq->vq);
>>   
>>   	/* Try to transmit */
>>   	err = xmit_skb(sq, skb);
>>  @@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff 
>> *skb, struct net_device *dev)
>>   		return NETDEV_TX_OK;
>>   	}
>>   
>>  -	/* Don't wait up for transmitted skbs to be freed. */
>>  -	skb_orphan(skb);
>>  -	nf_reset(skb);
>>  -
>>   	/* Apparently nice girls don't return TX_BUSY; stop the queue
>>   	 * before it gets out of hand.  Naturally, this wastes entries. */
>>  -	if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>>  +	if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
>>   		netif_stop_subqueue(dev, qnum);
>>  -		if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>  -			/* More just got used, free them then recheck. */
>>  -			free_old_xmit_skbs(sq);
>>  -			if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>>  -				netif_start_subqueue(dev, qnum);
>>  -				virtqueue_disable_cb(sq->vq);
>>  -			}
>>  -		}
>>  -	}
>>   
>>   	if (kick || netif_xmit_stopped(txq))
>>   		virtqueue_kick(sq->vq);
>>   
>>  +	if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>>  +		virtqueue_disable_cb(sq->vq);
>>  +		napi_schedule(&sq->napi);
>>  +	}
>>  +
>>   	return NETDEV_TX_OK;
>>   }
>>   
>>  @@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device 
>> *dev)
>>   	/* Make sure refill_work doesn't re-enable napi! */
>>   	cancel_delayed_work_sync(&vi->refill);
>>   
>>  -	for (i = 0; i < vi->max_queue_pairs; i++)
>>  +	for (i = 0; i < vi->max_queue_pairs; i++) {
>>   		napi_disable(&vi->rq[i].napi);
>>  +		napi_disable(&vi->sq[i].napi);
>>  +	}
>>   
>>   	return 0;
>>   }
>>  @@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct 
>> virtnet_info *vi)
>>   {
>>   	int i;
>>   
>>  -	for (i = 0; i < vi->max_queue_pairs; i++)
>>  +	for (i = 0; i < vi->max_queue_pairs; i++) {
>>   		netif_napi_del(&vi->rq[i].napi);
>>  +		netif_napi_del(&vi->sq[i].napi);
>>  +	}
>>   
>>   	kfree(vi->rq);
>>   	kfree(vi->sq);
>>  @@ -1607,6 +1643,8 @@ static int virtnet_alloc_queues(struct 
>> virtnet_info *vi)
>>   		netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
>>   			       napi_weight);
>>   		napi_hash_add(&vi->rq[i].napi);
>>  +		netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
>>  +			       napi_weight);
>>   
>>   		sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
>>   		ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>>  @@ -1790,6 +1828,8 @@ static int virtnet_probe(struct virtio_device 
>> *vdev)
>>   	if (err)
>>   		goto free_stats;
>>   
>>  +	vi->tx_work_limit = napi_weight;
>>  +
>>   #ifdef CONFIG_SYSFS
>>   	if (vi->mergeable_rx_bufs)
>>   		dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
>>  @@ -1904,8 +1944,10 @@ static int virtnet_freeze(struct 
>> virtio_device *vdev)
>>   	if (netif_running(vi->dev)) {
>>   		for (i = 0; i < vi->max_queue_pairs; i++) {
>>   			napi_disable(&vi->rq[i].napi);
>>  +			napi_disable(&vi->sq[i].napi);
>>   			napi_hash_del(&vi->rq[i].napi);
>>   			netif_napi_del(&vi->rq[i].napi);
>>  +			netif_napi_del(&vi->sq[i].napi);
>>   		}
>>   	}
>>   
>>  @@ -1930,8 +1972,10 @@ static int virtnet_restore(struct 
>> virtio_device *vdev)
>>   			if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>>   				schedule_delayed_work(&vi->refill, 0);
>>   
>>  -		for (i = 0; i < vi->max_queue_pairs; i++)
>>  +		for (i = 0; i < vi->max_queue_pairs; i++) {
>>   			virtnet_napi_enable(&vi->rq[i]);
>>  +			napi_enable(&vi->sq[i].napi);
>>  +		}
>>   	}
>>   
>>   	netif_device_attach(vi->dev);
>>  -- 
>>  1.8.3.1

^ permalink raw reply

* Re: [PATCH net-next] udp: Neaten and reduce size of compute_score functions
From: Eric Dumazet @ 2014-12-02  2:59 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev, LKML
In-Reply-To: <1417484341.4894.6.camel@perches.com>

On Mon, 2014-12-01 at 17:39 -0800, Joe Perches wrote:
> The compute_score functions are a bit difficult to read.
> 
> Neaten them a bit to reduce object sizes and make them a
> bit more intelligible.
> 
> Return early to avoid indentation and avoid unnecessary
> initializations.

> Signed-off-by: Joe Perches <joe@perches.com>
> ---
>  net/ipv4/udp.c | 111 +++++++++++++++++++++++++++++++-------------------------
>  net/ipv6/udp.c | 113 ++++++++++++++++++++++++++++++++-------------------------
>  2 files changed, 125 insertions(+), 99 deletions(-)
> 
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index b2d6068..227e6ad 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -336,38 +336,45 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
>  	return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal, hash2_nulladdr);
>  }
>  
> -static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
> -			 unsigned short hnum,
> -			 __be16 sport, __be32 daddr, __be16 dport, int dif)
> +static inline int compute_score(struct sock *sk, struct net *net,
> +				__be32 saddr, unsigned short hnum, __be16 sport,
> +				__be32 daddr, __be16 dport, int dif)
>  {
> -	int score = -1;
> +	int score;
> +	struct inet_sock *inet;
>  
> -	if (net_eq(sock_net(sk), net) && udp_sk(sk)->udp_port_hash == hnum &&
> -			!ipv6_only_sock(sk)) {
> -		struct inet_sock *inet = inet_sk(sk);
> +	if (!(net_eq(sock_net(sk), net) &&
> +	      udp_sk(sk)->udp_port_hash == hnum &&
> +	      !ipv6_only_sock(sk)))
> +		return -1

Or even better :


	if (!net_eq(sock_net(sk), net) ||
	    udp_sk(sk)->udp_port_hash != hnum ||
	    ipv6_only_sock(sk))
		return -1;

^ permalink raw reply

* [PATCH net-next] udp: Neaten and reduce size of compute_score functions
From: Joe Perches @ 2014-12-02  1:39 UTC (permalink / raw)
  To: netdev; +Cc: LKML

The compute_score functions are a bit difficult to read.

Neaten them a bit to reduce object sizes and make them a
bit more intelligible.

Return early to avoid indentation and avoid unnecessary
initializations.

(allyesconfig, but w/ -O2 and no profiling)

$ size net/ipv[46]/udp.o.*
   text	   data	    bss	    dec	    hex	filename
  28680	   1184	     25	  29889	   74c1	net/ipv4/udp.o.new
  28756	   1184	     25	  29965	   750d	net/ipv4/udp.o.old
  17600	   1010	      2	  18612	   48b4	net/ipv6/udp.o.new
  17632	   1010	      2	  18644	   48d4	net/ipv6/udp.o.old

Signed-off-by: Joe Perches <joe@perches.com>
---
 net/ipv4/udp.c | 111 +++++++++++++++++++++++++++++++-------------------------
 net/ipv6/udp.c | 113 ++++++++++++++++++++++++++++++++-------------------------
 2 files changed, 125 insertions(+), 99 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b2d6068..227e6ad 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -336,38 +336,45 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
 	return udp_lib_get_port(sk, snum, ipv4_rcv_saddr_equal, hash2_nulladdr);
 }
 
-static inline int compute_score(struct sock *sk, struct net *net, __be32 saddr,
-			 unsigned short hnum,
-			 __be16 sport, __be32 daddr, __be16 dport, int dif)
+static inline int compute_score(struct sock *sk, struct net *net,
+				__be32 saddr, unsigned short hnum, __be16 sport,
+				__be32 daddr, __be16 dport, int dif)
 {
-	int score = -1;
+	int score;
+	struct inet_sock *inet;
 
-	if (net_eq(sock_net(sk), net) && udp_sk(sk)->udp_port_hash == hnum &&
-			!ipv6_only_sock(sk)) {
-		struct inet_sock *inet = inet_sk(sk);
+	if (!(net_eq(sock_net(sk), net) &&
+	      udp_sk(sk)->udp_port_hash == hnum &&
+	      !ipv6_only_sock(sk)))
+		return -1;
 
-		score = (sk->sk_family == PF_INET ? 2 : 1);
-		if (inet->inet_rcv_saddr) {
-			if (inet->inet_rcv_saddr != daddr)
-				return -1;
-			score += 4;
-		}
-		if (inet->inet_daddr) {
-			if (inet->inet_daddr != saddr)
-				return -1;
-			score += 4;
-		}
-		if (inet->inet_dport) {
-			if (inet->inet_dport != sport)
-				return -1;
-			score += 4;
-		}
-		if (sk->sk_bound_dev_if) {
-			if (sk->sk_bound_dev_if != dif)
-				return -1;
-			score += 4;
-		}
+	score = (sk->sk_family == PF_INET) ? 2 : 1;
+	inet = inet_sk(sk);
+
+	if (inet->inet_rcv_saddr) {
+		if (inet->inet_rcv_saddr != daddr)
+			return -1;
+		score += 4;
+	}
+
+	if (inet->inet_daddr) {
+		if (inet->inet_daddr != saddr)
+			return -1;
+		score += 4;
+	}
+
+	if (inet->inet_dport) {
+		if (inet->inet_dport != sport)
+			return -1;
+		score += 4;
+	}
+
+	if (sk->sk_bound_dev_if) {
+		if (sk->sk_bound_dev_if != dif)
+			return -1;
+		score += 4;
 	}
+
 	return score;
 }
 
@@ -378,33 +385,39 @@ static inline int compute_score2(struct sock *sk, struct net *net,
 				 __be32 saddr, __be16 sport,
 				 __be32 daddr, unsigned int hnum, int dif)
 {
-	int score = -1;
+	int score;
+	struct inet_sock *inet;
 
-	if (net_eq(sock_net(sk), net) && !ipv6_only_sock(sk)) {
-		struct inet_sock *inet = inet_sk(sk);
+	if (!(net_eq(sock_net(sk), net) && !ipv6_only_sock(sk)))
+		return -1;
 
-		if (inet->inet_rcv_saddr != daddr)
+	inet = inet_sk(sk);
+
+	if (inet->inet_rcv_saddr != daddr)
+		return -1;
+	if (inet->inet_num != hnum)
+		return -1;
+
+	score = (sk->sk_family == PF_INET) ? 2 : 1;
+
+	if (inet->inet_daddr) {
+		if (inet->inet_daddr != saddr)
 			return -1;
-		if (inet->inet_num != hnum)
+		score += 4;
+	}
+
+	if (inet->inet_dport) {
+		if (inet->inet_dport != sport)
 			return -1;
+		score += 4;
+	}
 
-		score = (sk->sk_family == PF_INET ? 2 : 1);
-		if (inet->inet_daddr) {
-			if (inet->inet_daddr != saddr)
-				return -1;
-			score += 4;
-		}
-		if (inet->inet_dport) {
-			if (inet->inet_dport != sport)
-				return -1;
-			score += 4;
-		}
-		if (sk->sk_bound_dev_if) {
-			if (sk->sk_bound_dev_if != dif)
-				return -1;
-			score += 4;
-		}
+	if (sk->sk_bound_dev_if) {
+		if (sk->sk_bound_dev_if != dif)
+			return -1;
+		score += 4;
 	}
+
 	return score;
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7cfb5d7..0b9644a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -148,72 +148,85 @@ static inline int compute_score(struct sock *sk, struct net *net,
 				const struct in6_addr *daddr, __be16 dport,
 				int dif)
 {
-	int score = -1;
+	int score;
+	struct inet_sock *inet;
 
-	if (net_eq(sock_net(sk), net) && udp_sk(sk)->udp_port_hash == hnum &&
-			sk->sk_family == PF_INET6) {
-		struct inet_sock *inet = inet_sk(sk);
+	if (!(net_eq(sock_net(sk), net) &&
+	      udp_sk(sk)->udp_port_hash == hnum &&
+	      sk->sk_family == PF_INET6))
+		return -1;
 
-		score = 0;
-		if (inet->inet_dport) {
-			if (inet->inet_dport != sport)
-				return -1;
-			score++;
-		}
-		if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
-			if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
-				return -1;
-			score++;
-		}
-		if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
-			if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
-				return -1;
-			score++;
-		}
-		if (sk->sk_bound_dev_if) {
-			if (sk->sk_bound_dev_if != dif)
-				return -1;
-			score++;
-		}
+	score = 0;
+	inet = inet_sk(sk);
+
+	if (inet->inet_dport) {
+		if (inet->inet_dport != sport)
+			return -1;
+		score++;
 	}
+
+	if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
+		if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
+			return -1;
+		score++;
+	}
+
+	if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
+		if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
+			return -1;
+		score++;
+	}
+
+	if (sk->sk_bound_dev_if) {
+		if (sk->sk_bound_dev_if != dif)
+			return -1;
+		score++;
+	}
+
 	return score;
 }
 
 #define SCORE2_MAX (1 + 1 + 1)
 static inline int compute_score2(struct sock *sk, struct net *net,
-				const struct in6_addr *saddr, __be16 sport,
-				const struct in6_addr *daddr, unsigned short hnum,
-				int dif)
+				 const struct in6_addr *saddr, __be16 sport,
+				 const struct in6_addr *daddr,
+				 unsigned short hnum, int dif)
 {
-	int score = -1;
+	int score;
+	struct inet_sock *inet;
 
-	if (net_eq(sock_net(sk), net) && udp_sk(sk)->udp_port_hash == hnum &&
-			sk->sk_family == PF_INET6) {
-		struct inet_sock *inet = inet_sk(sk);
+	if (!(net_eq(sock_net(sk), net) &&
+	      udp_sk(sk)->udp_port_hash == hnum &&
+	      sk->sk_family == PF_INET6))
+		return -1;
 
-		if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
+	if (!ipv6_addr_equal(&sk->sk_v6_rcv_saddr, daddr))
+		return -1;
+
+	score = 0;
+	inet = inet_sk(sk);
+
+	if (inet->inet_dport) {
+		if (inet->inet_dport != sport)
 			return -1;
-		score = 0;
-		if (inet->inet_dport) {
-			if (inet->inet_dport != sport)
-				return -1;
-			score++;
-		}
-		if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
-			if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
-				return -1;
-			score++;
-		}
-		if (sk->sk_bound_dev_if) {
-			if (sk->sk_bound_dev_if != dif)
-				return -1;
-			score++;
-		}
+		score++;
 	}
+
+	if (!ipv6_addr_any(&sk->sk_v6_daddr)) {
+		if (!ipv6_addr_equal(&sk->sk_v6_daddr, saddr))
+			return -1;
+		score++;
+	}
+
+	if (sk->sk_bound_dev_if) {
+		if (sk->sk_bound_dev_if != dif)
+			return -1;
+		score++;
+	}
+
 	return score;
 }
 
-
 /* called with read_rcu_lock() */
 static struct sock *udp6_lib_lookup2(struct net *net,
 		const struct in6_addr *saddr, __be16 sport,

^ permalink raw reply related

* Re: [RFC PATCH] netfilter: conntrack: cache route for forwarded connections
From: Eric Dumazet @ 2014-12-02  1:36 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, brouer, netdev
In-Reply-To: <1417480114-3002-1-git-send-email-fw@strlen.de>

On Tue, 2014-12-02 at 01:28 +0100, Florian Westphal wrote:
> ... to avoid per-packet FIB lookup if possible.
> 
> The cached dst is re-used provided the input interface
> is the same as that of the previous packet in the same direction.
> 
> If not, the cached dst is invalidated.
> 
> This should speed up forwarding when conntrack is already in use
> anyway, especially when using reverse path filtering -- active RPF
> enforces two FIB lookups for each packet.
> 
> Before the routing cache removal this didn't matter since RPF
> was performed only when route cache didn't yield a result; but without
> route cache it comes at high price.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Seems a good idea (but you might need some IPv6 care, as ( dst =
dst_check(dst, 0); ) seems to handle IPv4 only)

Another idea would be to re-use TCP ehash so that regular IP early demux
can be used, with a single lookup for both local and forwarded sessions.

(That would probably require a bit more memory, as you would need to
insert into TCP ehash some kind of 'tiny sockets' )


^ permalink raw reply

* Re: [PATCH] ioc3: fix incorrect use of htons/ntohs
From: Lino Sanfilippo @ 2014-12-02  1:31 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: ralf, linux-mips, netdev, linux-kernel
In-Reply-To: <1417406976.7215.126.camel@decadent.org.uk>

On 01.12.2014 05:09, Ben Hutchings wrote:
> On Sun, 2014-11-30 at 11:40 +0100, Lino Sanfilippo wrote:
>> The protocol type in the ip header struct is a single byte variable. So there
>> is no need to swap bytes depending on host endianness.
>> 
>> Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
>> ---
>> 
>> Please note that I could not test this, since I dont have access to the
>> concerning hardware.
>> 
>>  drivers/net/ethernet/sgi/ioc3-eth.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c
>> index 7a254da..0bb303d 100644
>> --- a/drivers/net/ethernet/sgi/ioc3-eth.c
>> +++ b/drivers/net/ethernet/sgi/ioc3-eth.c
>> @@ -540,8 +540,7 @@ static void ioc3_tcpudp_checksum(struct sk_buff *skb, uint32_t hwsum, int len)
>>  
>>  	/* Same as tx - compute csum of pseudo header  */
>>  	csum = hwsum +
>> -	       (ih->tot_len - (ih->ihl << 2)) +
>> -	       htons((uint16_t)ih->protocol) +
>> +	       (ih->tot_len - (ih->ihl << 2)) + ih->protocol +
>>  	       (ih->saddr >> 16) + (ih->saddr & 0xffff) +
>>  	       (ih->daddr >> 16) + (ih->daddr & 0xffff);
>>
> 
> The pseudo-header is specified as:
> 
>                      +--------+--------+--------+--------+
>                      |           Source Address          |
>                      +--------+--------+--------+--------+
>                      |         Destination Address       |
>                      +--------+--------+--------+--------+
>                      |  zero  |  PTCL  |    TCP Length   |
>                      +--------+--------+--------+--------+
> 
> The current code zero-extends the protocol number to produce the 5th
> 16-bit word of the pseudo-header, then uses htons() to put it in
> big-endian order, consistent with the other fields.  (Yes, it's doing
> addition on big-endian words; this works even on little-endian machines
> due to the way the checksum is specified.)
> 
> The driver should not be doing this at all, though.  It should set
> skb->csum = hwsum; skb->ip_summed = CHECKSUM_COMPLETE; and let the
> network stack adjust the hardware checksum.
> 
>> @@ -1417,7 +1416,7 @@ static int ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	 */
>>  	if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>  		const struct iphdr *ih = ip_hdr(skb);
>> -		const int proto = ntohs(ih->protocol);
>> +		const int proto = ih->protocol;
>>  		unsigned int csoff;
>>  		uint32_t csum, ehsum;
>>  		uint16_t *eh;
> 
> This should logically be __be16 proto = htons(ih->protocol), but the
> current version should work in practice.
> 
> However, the driver should really use skb->csum_start and
> skb->csum_offset to work out where the checksum belongs and which bytes
> to cancel out.
> 
> Ben.
> 

Hi Ben,

youre right, the use of htons/ntohs is correct in this case. So thank
you for the explanation and please ignore that patch.

Regards,
Lino

^ permalink raw reply

* [RFC PATCH] netfilter: conntrack: cache route for forwarded connections
From: Florian Westphal @ 2014-12-02  0:28 UTC (permalink / raw)
  To: netfilter-devel; +Cc: brouer, netdev, Florian Westphal

... to avoid per-packet FIB lookup if possible.

The cached dst is re-used provided the input interface
is the same as that of the previous packet in the same direction.

If not, the cached dst is invalidated.

This should speed up forwarding when conntrack is already in use
anyway, especially when using reverse path filtering -- active RPF
enforces two FIB lookups for each packet.

Before the routing cache removal this didn't matter since RPF
was performed only when route cache didn't yield a result; but without
route cache it comes at high price.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Sending as RFC since I haven't tested this yet (aside from
 single-forwarded-flow), so no performance data either.

  - doesn't work when iif changes (it invalidates cached dst), don't
  think its a problem
  - auto-active when module is loaded (and always active when =y).
  - cache becomes used after conntrack enters ASSURED state.

  TODOs:
  - integrate with netfilter ipv4/ipv6 rpfilter match (already working on this)

  Things to consider:
  - perhaps extend it for local connections too; could cache
  socket to get sk from conntrack (and then fetchg dst from sk).

 include/net/netfilter/nf_conntrack_extend.h  |   4 +
 include/net/netfilter/nf_conntrack_rtcache.h |  28 +++
 net/netfilter/Kconfig                        |  11 +
 net/netfilter/Makefile                       |   1 +
 net/netfilter/nf_conntrack_rtcache.c         | 327 +++++++++++++++++++++++++++
 5 files changed, 371 insertions(+)
 create mode 100644 include/net/netfilter/nf_conntrack_rtcache.h
 create mode 100644 net/netfilter/nf_conntrack_rtcache.c

diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index 55d1504..1b00d57 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -30,6 +30,9 @@ enum nf_ct_ext_id {
 #if IS_ENABLED(CONFIG_NETFILTER_SYNPROXY)
 	NF_CT_EXT_SYNPROXY,
 #endif
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_RTCACHE)
+	NF_CT_EXT_RTCACHE,
+#endif
 	NF_CT_EXT_NUM,
 };
 
@@ -43,6 +46,7 @@ enum nf_ct_ext_id {
 #define NF_CT_EXT_TIMEOUT_TYPE struct nf_conn_timeout
 #define NF_CT_EXT_LABELS_TYPE struct nf_conn_labels
 #define NF_CT_EXT_SYNPROXY_TYPE struct nf_conn_synproxy
+#define NF_CT_EXT_RTCACHE_TYPE struct nf_conn_rtcache
 
 /* Extensions: optional stuff which isn't permanently in struct. */
 struct nf_ct_ext {
diff --git a/include/net/netfilter/nf_conntrack_rtcache.h b/include/net/netfilter/nf_conntrack_rtcache.h
new file mode 100644
index 0000000..e8d215d
--- /dev/null
+++ b/include/net/netfilter/nf_conntrack_rtcache.h
@@ -0,0 +1,28 @@
+#include <linux/gfp.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_extend.h>
+
+struct dst_entry;
+union nf_conn_cache_ptr {
+	struct dst_entry *dst;
+};
+
+struct nf_conn_rtcache {
+	union nf_conn_cache_ptr ptr[IP_CT_DIR_MAX];
+	int iif[IP_CT_DIR_MAX];
+};
+
+static inline struct nf_conn_rtcache *nf_ct_rtcache_find(const struct nf_conn *ct)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_RTCACHE)
+	return nf_ct_ext_find(ct, NF_CT_EXT_RTCACHE);
+#else
+	return NULL;
+#endif
+}
+
+static inline int nf_conn_rtcache_iif_get(const struct nf_conn_rtcache *rtc,
+					  enum ip_conntrack_dir dir)
+{
+	return rtc->iif[dir];
+}
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index b02660f..4a8b7e7 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -106,6 +106,17 @@ config NF_CONNTRACK_EVENTS
 
 	  If unsure, say `N'.
 
+config NF_CONNTRACK_RTCACHE
+	tristate "Cache route entries in conntrack objects"
+	depends on NETFILTER_ADVANCED
+	depends on NF_CONNTRACK
+	help
+	  If this option is enabled, the connection tracking code will
+	  cache routing information for each connection that is being
+	  forwarded.
+
+	  If unsure, say `N'.
+
 config NF_CONNTRACK_TIMEOUT
 	bool  'Connection tracking timeout'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 89f73a9..ac830c9 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -5,6 +5,7 @@ nf_conntrack-$(CONFIG_NF_CONNTRACK_TIMEOUT) += nf_conntrack_timeout.o
 nf_conntrack-$(CONFIG_NF_CONNTRACK_TIMESTAMP) += nf_conntrack_timestamp.o
 nf_conntrack-$(CONFIG_NF_CONNTRACK_EVENTS) += nf_conntrack_ecache.o
 nf_conntrack-$(CONFIG_NF_CONNTRACK_LABELS) += nf_conntrack_labels.o
+nf_conntrack-$(CONFIG_NF_CONNTRACK_RTCACHE) += nf_conntrack_rtcache.o
 
 obj-$(CONFIG_NETFILTER) = netfilter.o
 
diff --git a/net/netfilter/nf_conntrack_rtcache.c b/net/netfilter/nf_conntrack_rtcache.c
new file mode 100644
index 0000000..1013987
--- /dev/null
+++ b/net/netfilter/nf_conntrack_rtcache.c
@@ -0,0 +1,327 @@
+/* route cache for netfilter.
+ *
+ * (C) 2014 Red Hat GmbH
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/netfilter.h>
+#include <linux/skbuff.h>
+#include <linux/stddef.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/export.h>
+#include <linux/module.h>
+
+#include <net/dst.h>
+
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_extend.h>
+#include <net/netfilter/nf_conntrack_rtcache.h>
+
+static void __nf_conn_rtcache_destroy(struct nf_conn_rtcache *rtc, int dir)
+{
+	struct dst_entry *dst;
+
+	if (rtc->iif[dir] < 0)
+		return;
+
+	dst = rtc->ptr[dir].dst;
+	pr_debug("release dst %p, refcnt %d, dir %d\n", dst, atomic_read(&dst->__refcnt), dir);
+	dst_release(dst);
+}
+
+static void nf_conn_rtcache_destroy(struct nf_conn *ct)
+{
+	struct nf_conn_rtcache *rtc = nf_ct_rtcache_find(ct);
+
+	if (!rtc)
+		return;
+
+	__nf_conn_rtcache_destroy(rtc, IP_CT_DIR_ORIGINAL);
+	__nf_conn_rtcache_destroy(rtc, IP_CT_DIR_REPLY);
+}
+
+static void nf_ct_rtcache_ext_add(struct nf_conn *ct)
+{
+	struct nf_conn_rtcache *rtc;
+
+	rtc = nf_ct_ext_add(ct, NF_CT_EXT_RTCACHE, GFP_ATOMIC);
+	if (rtc) {
+		rtc->iif[IP_CT_DIR_ORIGINAL] = -1;
+		rtc->iif[IP_CT_DIR_REPLY] = -1;
+	}
+}
+
+static bool nf_rtcache_ignore_ct(struct nf_conn *ct)
+{
+	if (nf_ct_is_untracked(ct))
+		return true;
+
+	if (!test_bit(IPS_ASSURED_BIT, &ct->status))
+		return true;
+	return false;
+}
+
+static struct nf_conn_rtcache *nf_ct_rtcache_find_usable(struct nf_conn *ct)
+{
+	if (nf_rtcache_ignore_ct(ct))
+		return NULL;
+	return nf_ct_rtcache_find(ct);
+}
+
+static struct dst_entry *
+nf_conn_rtcache_dst_get(const struct nf_conn_rtcache *rtc,
+		        enum ip_conntrack_dir dir)
+{
+	return rtc->ptr[dir].dst;
+}
+
+static void nf_conn_rtcache_dst_set(struct nf_conn_rtcache *rtc,
+				    struct dst_entry *dst,
+				    enum ip_conntrack_dir dir, int iif)
+{
+	if (rtc->iif[dir] != iif)
+		rtc->iif[dir] = iif;
+
+	if (rtc->ptr[dir].dst != dst) {
+		struct dst_entry *old;
+
+		dst_hold(dst);
+
+		old = xchg(&rtc->ptr[dir].dst, dst);
+		dst_release(old);
+	}
+}
+
+static void nf_conn_rtcache_dst_obsolete(struct nf_conn_rtcache *rtc,
+					 enum ip_conntrack_dir dir)
+{
+	struct dst_entry *old;
+
+	pr_debug("Invalidate iif %d for dir %d on cache %p\n",
+		   rtc->iif[dir], dir, rtc);
+
+	old = xchg(&rtc->ptr[dir].dst, NULL);
+	dst_release(old);
+	rtc->iif[dir] = -1;
+}
+
+static unsigned int nf_rtcache_in(const struct nf_hook_ops *ops,
+				  struct sk_buff *skb,
+				  const struct net_device *in,
+				  const struct net_device *out,
+				  int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn_rtcache *ct_rtcache;
+	enum ip_conntrack_info ctinfo;
+	enum ip_conntrack_dir dir;
+	struct dst_entry *dst;
+	struct nf_conn *ct;
+	int iif;
+
+	if (in == NULL || skb_dst(skb))
+		return NF_ACCEPT;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return NF_ACCEPT;
+
+	ct_rtcache = nf_ct_rtcache_find_usable(ct);
+	if (!ct_rtcache)
+		return NF_ACCEPT;
+
+	/* if iif changes, don't use cache and let ip stack
+	 * do route lookup.
+	 *
+	 * If rp_filter is enabled it might toss skb, so
+	 * we don't want to avoid these checks.
+	 */
+	dir = CTINFO2DIR(ctinfo);
+	iif = nf_conn_rtcache_iif_get(ct_rtcache, dir);
+	if (in->ifindex != iif) {
+		pr_debug("iif %d did not match cached iif %d of ct %p\n\n", iif, in->ifindex, ct);
+		return NF_ACCEPT;
+	}
+	dst = nf_conn_rtcache_dst_get(ct_rtcache, dir);
+	if (dst == NULL)
+		return NF_ACCEPT;
+	dst = dst_check(dst, 0);
+	if (likely(dst)) {
+		pr_debug("using cached dst %p for skb %p\n", dst, skb);
+		skb_dst_set_noref_force(skb, dst);
+	} else {
+		nf_conn_rtcache_dst_obsolete(ct_rtcache, dir);
+	}
+	return NF_ACCEPT;
+}
+
+static unsigned int nf_rtcache_forward(const struct nf_hook_ops *ops,
+				       struct sk_buff *skb,
+				       const struct net_device *in,
+				       const struct net_device *out,
+				       int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn_rtcache *ct_rtcache;
+	enum ip_conntrack_info ctinfo;
+	enum ip_conntrack_dir dir;
+	struct dst_entry *dst;
+	struct nf_conn *ct;
+	int iif;
+
+	dst = skb_dst(skb);
+	if (WARN_ON_ONCE(dst == NULL || in == NULL))
+		return NF_ACCEPT;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return NF_ACCEPT;
+
+	if (!nf_ct_is_confirmed(ct)) {
+		pr_debug("new ct %p, skb %p, adding rtcache extension\n", ct, skb);
+		BUG_ON(nf_ct_rtcache_find(ct));
+		nf_ct_rtcache_ext_add(ct);
+		return NF_ACCEPT;
+	}
+
+	ct_rtcache = nf_ct_rtcache_find_usable(ct);
+	if (!ct_rtcache)
+		return NF_ACCEPT;
+
+	dir = CTINFO2DIR(ctinfo);
+	iif = nf_conn_rtcache_iif_get(ct_rtcache, dir);
+	pr_debug("ct %p, skb %p, dir %d, iif %d, cached iif %d\n", ct, skb, dir, iif, in->ifindex);
+	if (likely(in->ifindex == iif))
+		return NF_ACCEPT;
+
+	nf_conn_rtcache_dst_set(ct_rtcache, dst, dir, in->ifindex);
+	return NF_ACCEPT;
+}
+
+static struct nf_hook_ops rtcache_ops[] = {
+	{
+		.hook		= nf_rtcache_in,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV4,
+		.hooknum	= NF_INET_PRE_ROUTING,
+		.priority       = NF_IP_PRI_LAST,
+	},
+	{
+		.hook           = nf_rtcache_forward,
+		.owner          = THIS_MODULE,
+		.pf             = NFPROTO_IPV4,
+		.hooknum        = NF_INET_FORWARD,
+		.priority       = NF_IP_PRI_LAST,
+	},
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+	{
+		.hook		= nf_rtcache_in,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_PRE_ROUTING,
+		.priority       = NF_IP_PRI_LAST,
+	},
+	{
+		.hook           = nf_rtcache_forward,
+		.owner          = THIS_MODULE,
+		.pf             = NFPROTO_IPV6,
+		.hooknum        = NF_INET_FORWARD,
+		.priority       = NF_IP_PRI_LAST,
+	},
+#endif
+};
+
+static struct nf_ct_ext_type rtcache_extend __read_mostly = {
+	.len	= sizeof(struct nf_conn_rtcache),
+	.align	= __alignof__(struct nf_conn_rtcache),
+	.id	= NF_CT_EXT_RTCACHE,
+	.destroy = nf_conn_rtcache_destroy,
+};
+
+int __init nf_conntrack_rtcache_init(void)
+{
+	int ret = nf_ct_extend_register(&rtcache_extend);
+	if (ret < 0) {
+		pr_err("nf_conntrack_rtcache: Unable to register extension\n");
+		return ret;
+	}
+
+	ret = nf_register_hooks(rtcache_ops, ARRAY_SIZE(rtcache_ops));
+	if (ret < 0) {
+		nf_ct_extend_unregister(&rtcache_extend);
+		return ret;
+	}
+
+	return ret;
+}
+
+static int nf_rtcache_ext_remove(struct nf_conn *ct, void *data)
+{
+	struct nf_conn_rtcache *rtc = nf_ct_rtcache_find(ct);
+	return rtc != NULL;
+}
+
+void __exit nf_conntrack_rtcache_fini(void)
+{
+	struct net *net;
+	bool wait;
+	int cpu, count = 0;
+
+	/* first remove hooks so no new connections get rtcache extension */
+	nf_unregister_hooks(rtcache_ops, ARRAY_SIZE(rtcache_ops));
+
+	synchronize_net();
+
+	/* zap all conntracks with rtcache extension */
+	for_each_net(net)
+		nf_ct_iterate_cleanup(net, nf_rtcache_ext_remove, NULL, 0, 0);
+
+	/* again, so that affected conntracks on the unconfirmed list are now
+	 * free'd or on dying list */
+	synchronize_net();
+
+	/* ... and wait until dying list contains no affected items any more */
+	do {
+		wait = false;
+		for_each_possible_cpu(cpu) {
+			struct nf_conntrack_tuple_hash *h;
+			struct hlist_nulls_node *n;
+			struct nf_conn *ct;
+			struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+			rcu_read_lock();
+			spin_lock_bh(&pcpu->lock);
+
+			hlist_nulls_for_each_entry(h, n, &pcpu->unconfirmed, hnnode) {
+				ct = nf_ct_tuplehash_to_ctrack(h);
+				if (nf_ct_rtcache_find(ct) != NULL) {
+					wait = true;
+					break;
+				}
+			}
+			spin_unlock_bh(&pcpu->lock);
+			rcu_read_unlock();
+
+			if (wait) {
+				/* conntrack on dying list, i.e. waiting for
+				 * some event, e.g. reinject or whatever */
+				msleep(200);
+				WARN_ONCE(++count > 25, "Waiting for all rtcache conntracks to go away\n");
+			}
+		}
+	} while (wait);
+
+	nf_ct_extend_unregister(&rtcache_extend);
+}
+module_init(nf_conntrack_rtcache_init);
+module_exit(nf_conntrack_rtcache_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Florian Westphal <fw@strlen.de>");
+MODULE_DESCRIPTION("Conntrack route cache extension");
-- 
2.0.4


^ permalink raw reply related

* [PATCH v3 net-next] net: bcmgenet: enable driver to work without a device tree
From: Petri Gynther @ 2014-12-02  0:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, f.fainelli

Modify bcmgenet driver so that it can be used on Broadcom 7xxx
MIPS-based STB platforms without a device tree.

Signed-off-by: Petri Gynther <pgynther@google.com>
---
 drivers/net/ethernet/broadcom/Kconfig          |   1 -
 drivers/net/ethernet/broadcom/genet/bcmgenet.c |  31 ++++--
 drivers/net/ethernet/broadcom/genet/bcmmii.c   | 126 ++++++++++++++++++++-----
 include/linux/platform_data/bcmgenet.h         |  18 ++++
 4 files changed, 143 insertions(+), 33 deletions(-)
 create mode 100644 include/linux/platform_data/bcmgenet.h

diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index c3e260c..888247a 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -62,7 +62,6 @@ config BCM63XX_ENET
 
 config BCMGENET
 	tristate "Broadcom GENET internal MAC support"
-	depends on OF
 	select MII
 	select PHYLIB
 	select FIXED_PHY if BCMGENET=y
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index f2fadb0..adfef5c 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -42,6 +42,7 @@
 #include <linux/ip.h>
 #include <linux/ipv6.h>
 #include <linux/phy.h>
+#include <linux/platform_data/bcmgenet.h>
 
 #include <asm/unaligned.h>
 
@@ -2586,8 +2587,9 @@ static const struct of_device_id bcmgenet_match[] = {
 
 static int bcmgenet_probe(struct platform_device *pdev)
 {
+	struct bcmgenet_platform_data *pd = pdev->dev.platform_data;
 	struct device_node *dn = pdev->dev.of_node;
-	const struct of_device_id *of_id;
+	const struct of_device_id *of_id = NULL;
 	struct bcmgenet_priv *priv;
 	struct net_device *dev;
 	const void *macaddr;
@@ -2601,9 +2603,11 @@ static int bcmgenet_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	}
 
-	of_id = of_match_node(bcmgenet_match, dn);
-	if (!of_id)
-		return -EINVAL;
+	if (dn) {
+		of_id = of_match_node(bcmgenet_match, dn);
+		if (!of_id)
+			return -EINVAL;
+	}
 
 	priv = netdev_priv(dev);
 	priv->irq0 = platform_get_irq(pdev, 0);
@@ -2615,11 +2619,15 @@ static int bcmgenet_probe(struct platform_device *pdev)
 		goto err;
 	}
 
-	macaddr = of_get_mac_address(dn);
-	if (!macaddr) {
-		dev_err(&pdev->dev, "can't find MAC address\n");
-		err = -EINVAL;
-		goto err;
+	if (dn) {
+		macaddr = of_get_mac_address(dn);
+		if (!macaddr) {
+			dev_err(&pdev->dev, "can't find MAC address\n");
+			err = -EINVAL;
+			goto err;
+		}
+	} else {
+		macaddr = pd->mac_address;
 	}
 
 	r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
@@ -2659,7 +2667,10 @@ static int bcmgenet_probe(struct platform_device *pdev)
 
 	priv->dev = dev;
 	priv->pdev = pdev;
-	priv->version = (enum bcmgenet_version)of_id->data;
+	if (of_id)
+		priv->version = (enum bcmgenet_version)of_id->data;
+	else
+		priv->version = pd->genet_version;
 
 	priv->clk = devm_clk_get(&priv->pdev->dev, "enet");
 	if (IS_ERR(priv->clk))
diff --git a/drivers/net/ethernet/broadcom/genet/bcmmii.c b/drivers/net/ethernet/broadcom/genet/bcmmii.c
index 933cd7e..446889c 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmmii.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmmii.c
@@ -23,6 +23,7 @@
 #include <linux/of.h>
 #include <linux/of_net.h>
 #include <linux/of_mdio.h>
+#include <linux/platform_data/bcmgenet.h>
 
 #include "bcmgenet.h"
 
@@ -312,22 +313,6 @@ static int bcmgenet_mii_probe(struct net_device *dev)
 	u32 phy_flags;
 	int ret;
 
-	if (priv->phydev) {
-		pr_info("PHY already attached\n");
-		return 0;
-	}
-
-	/* In the case of a fixed PHY, the DT node associated
-	 * to the PHY is the Ethernet MAC DT node.
-	 */
-	if (!priv->phy_dn && of_phy_is_fixed_link(dn)) {
-		ret = of_phy_register_fixed_link(dn);
-		if (ret)
-			return ret;
-
-		priv->phy_dn = of_node_get(dn);
-	}
-
 	/* Communicate the integrated PHY revision */
 	phy_flags = priv->gphy_rev;
 
@@ -337,11 +322,39 @@ static int bcmgenet_mii_probe(struct net_device *dev)
 	priv->old_duplex = -1;
 	priv->old_pause = -1;
 
-	phydev = of_phy_connect(dev, priv->phy_dn, bcmgenet_mii_setup,
-				phy_flags, priv->phy_interface);
-	if (!phydev) {
-		pr_err("could not attach to PHY\n");
-		return -ENODEV;
+	if (dn) {
+		if (priv->phydev) {
+			pr_info("PHY already attached\n");
+			return 0;
+		}
+
+		/* In the case of a fixed PHY, the DT node associated
+		 * to the PHY is the Ethernet MAC DT node.
+		 */
+		if (!priv->phy_dn && of_phy_is_fixed_link(dn)) {
+			ret = of_phy_register_fixed_link(dn);
+			if (ret)
+				return ret;
+
+			priv->phy_dn = of_node_get(dn);
+		}
+
+		phydev = of_phy_connect(dev, priv->phy_dn, bcmgenet_mii_setup,
+					phy_flags, priv->phy_interface);
+		if (!phydev) {
+			pr_err("could not attach to PHY\n");
+			return -ENODEV;
+		}
+	} else {
+		phydev = priv->phydev;
+		phydev->dev_flags = phy_flags;
+
+		ret = phy_connect_direct(dev, phydev, bcmgenet_mii_setup,
+					 priv->phy_interface);
+		if (ret) {
+			pr_err("could not attach to PHY\n");
+			return -ENODEV;
+		}
 	}
 
 	priv->phydev = phydev;
@@ -438,6 +451,75 @@ static int bcmgenet_mii_of_init(struct bcmgenet_priv *priv)
 	return 0;
 }
 
+static int bcmgenet_mii_pd_init(struct bcmgenet_priv *priv)
+{
+	struct device *kdev = &priv->pdev->dev;
+	struct bcmgenet_platform_data *pd = kdev->platform_data;
+	struct mii_bus *mdio = priv->mii_bus;
+	struct phy_device *phydev;
+	int ret;
+
+	if (pd->phy_interface != PHY_INTERFACE_MODE_MOCA && pd->mdio_enabled) {
+		/*
+		 * Internal or external PHY with MDIO access
+		 */
+		if (pd->phy_address >= 0 && pd->phy_address < PHY_MAX_ADDR)
+			mdio->phy_mask = ~(1 << pd->phy_address);
+		else
+			mdio->phy_mask = 0;
+
+		ret = mdiobus_register(mdio);
+		if (ret) {
+			dev_err(kdev, "failed to register MDIO bus\n");
+			return ret;
+		}
+
+		if (pd->phy_address >= 0 && pd->phy_address < PHY_MAX_ADDR)
+			phydev = mdio->phy_map[pd->phy_address];
+		else
+			phydev = phy_find_first(mdio);
+
+		if (!phydev) {
+			dev_err(kdev, "failed to register PHY device\n");
+			mdiobus_unregister(mdio);
+			return -ENODEV;
+		}
+	} else {
+		/*
+		 * MoCA port or no MDIO access.
+		 * Use fixed PHY to represent the link layer.
+		 */
+		struct fixed_phy_status fphy_status = {
+			.link = 1,
+			.speed = pd->phy_speed,
+			.duplex = pd->phy_duplex,
+			.pause = 0,
+			.asym_pause = 0,
+		};
+
+		phydev = fixed_phy_register(PHY_POLL, &fphy_status, NULL);
+		if (!phydev || IS_ERR(phydev)) {
+			dev_err(kdev, "failed to register fixed PHY device\n");
+			return -ENODEV;
+		}
+	}
+
+	priv->phydev = phydev;
+	priv->phy_interface = pd->phy_interface;
+
+	return 0;
+}
+
+static int bcmgenet_mii_bus_init(struct bcmgenet_priv *priv)
+{
+	struct device_node *dn = priv->pdev->dev.of_node;
+
+	if (dn)
+		return bcmgenet_mii_of_init(priv);
+	else
+		return bcmgenet_mii_pd_init(priv);
+}
+
 int bcmgenet_mii_init(struct net_device *dev)
 {
 	struct bcmgenet_priv *priv = netdev_priv(dev);
@@ -447,7 +529,7 @@ int bcmgenet_mii_init(struct net_device *dev)
 	if (ret)
 		return ret;
 
-	ret = bcmgenet_mii_of_init(priv);
+	ret = bcmgenet_mii_bus_init(priv);
 	if (ret)
 		goto out_free;
 
diff --git a/include/linux/platform_data/bcmgenet.h b/include/linux/platform_data/bcmgenet.h
new file mode 100644
index 0000000..26af543
--- /dev/null
+++ b/include/linux/platform_data/bcmgenet.h
@@ -0,0 +1,18 @@
+#ifndef __LINUX_PLATFORM_DATA_BCMGENET_H__
+#define __LINUX_PLATFORM_DATA_BCMGENET_H__
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+#include <linux/phy.h>
+
+struct bcmgenet_platform_data {
+	bool		mdio_enabled;
+	phy_interface_t	phy_interface;
+	int		phy_address;
+	int		phy_speed;
+	int		phy_duplex;
+	u8		mac_address[ETH_ALEN];
+	int		genet_version;
+};
+
+#endif
-- 
2.2.0.rc0.207.ga3a616c

^ permalink raw reply related

* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Tom Herbert @ 2014-12-02  0:09 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Joe Stringer, netdev, Shannon Nelson, Brandeburg, Jesse,
	Jeff Kirsher, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CAEP_g=979u0baWXGmCxo61a-KYEUwkruqBNof_zQpsvJxRFqSA@mail.gmail.com>

On Mon, Dec 1, 2014 at 3:53 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert <therbert@google.com> wrote:
>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer <joestringer@nicira.com> wrote:
>>> On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
>>>> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>>>>> I don't know if we need to have the check at all for IPIP though -
>>>>> after all the driver doesn't expose support for it all (actually it
>>>>> doesn't expose GRE either). This raises kind of an interesting
>>>>> question about the checks though - it's pretty easy to add support to
>>>>> the driver for a new GSO type (and I imagine that people will be
>>>>> adding GRE soon) and forget to update the check.
>>>>
>>>> If the check is more conservative, then testing would show that it's
>>>> not working and lead people to figure out why (and update the check).
>>>
>>> More concretely, one suggestion would be something like following at
>>> the start of each gso_check():
>>>
>>> +       const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
>>> +                             SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
>>> +
>>> +       if (skb_shinfo(skb)->gso_type & ~supported)
>>> +               return false;
>>
>> This should already be handled by net_gso_ok.
>
> My original point wasn't so much that this isn't handled at the moment
> but that it's easy to add a supported GSO type but then forget to
> update this check - i.e. if a driver already supports UDP_TUNNEL and
> adds support for GRE with the same constraints. It seems not entirely
> ideal that this function is acting as a blacklist rather than a
> whitelist.

Agreed, it would be nice to have all the checking logic in one place.
If all the drivers end up implementing ndo_gso_check then we could
potentially get rid of the GSO types as features. This probably
wouldn't be a bad thing since we already know that the features
mechanism doesn't scale (for instance there's no way to indicate that
certain combinations of GSO types are supported by a device).

^ permalink raw reply

* Re: [PATCH 1/1] net-ipvlan: Deletion of an unnecessary check before the function call "free_percpu"
From: Mahesh Bandewar @ 2014-12-02  0:01 UTC (permalink / raw)
  To: SF Markus Elfring; +Cc: linux-netdev, LKML, kernel-janitors, Julia Lawall
In-Reply-To: <5479E693.3010200@users.sourceforge.net>

On Sat, Nov 29, 2014 at 7:30 AM, SF Markus Elfring
<elfring@users.sourceforge.net> wrote:
> From: Markus Elfring <elfring@users.sourceforge.net>
> Date: Sat, 29 Nov 2014 16:23:20 +0100
>
> The free_percpu() function tests whether its argument is NULL and then
> returns immediately. Thus the test around the call is not needed.
>
> This issue was detected by using the Coccinelle software.
>
> Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: Mahesh Bandewar <maheshb@google.com>
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 96b71b0..feb1853 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -125,8 +125,7 @@ static void ipvlan_uninit(struct net_device *dev)
>         struct ipvl_dev *ipvlan = netdev_priv(dev);
>         struct ipvl_port *port = ipvlan->port;
>
> -       if (ipvlan->pcpu_stats)
> -               free_percpu(ipvlan->pcpu_stats);
> +       free_percpu(ipvlan->pcpu_stats);
>
>         port->count -= 1;
>         if (!port->count)
> --
> 2.1.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Joe Stringer @ 2014-12-01 23:58 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Tom Herbert, netdev, Shannon Nelson, Brandeburg, Jesse,
	Jeff Kirsher, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CAEP_g=979u0baWXGmCxo61a-KYEUwkruqBNof_zQpsvJxRFqSA@mail.gmail.com>

On 1 December 2014 at 15:53, Jesse Gross <jesse@nicira.com> wrote:
> On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert <therbert@google.com> wrote:
>> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer <joestringer@nicira.com> wrote:
>>> On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
>>>> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>>>>> I don't know if we need to have the check at all for IPIP though -
>>>>> after all the driver doesn't expose support for it all (actually it
>>>>> doesn't expose GRE either). This raises kind of an interesting
>>>>> question about the checks though - it's pretty easy to add support to
>>>>> the driver for a new GSO type (and I imagine that people will be
>>>>> adding GRE soon) and forget to update the check.
>>>>
>>>> If the check is more conservative, then testing would show that it's
>>>> not working and lead people to figure out why (and update the check).
>>>
>>> More concretely, one suggestion would be something like following at
>>> the start of each gso_check():
>>>
>>> +       const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
>>> +                             SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
>>> +
>>> +       if (skb_shinfo(skb)->gso_type & ~supported)
>>> +               return false;
>>
>> This should already be handled by net_gso_ok.
>
> My original point wasn't so much that this isn't handled at the moment
> but that it's easy to add a supported GSO type but then forget to
> update this check - i.e. if a driver already supports UDP_TUNNEL and
> adds support for GRE with the same constraints. It seems not entirely
> ideal that this function is acting as a blacklist rather than a
> whitelist.

How much less ideal is it to forget to update the check than to make
this a blacklist?

^ permalink raw reply

* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Jesse Gross @ 2014-12-01 23:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Joe Stringer, netdev, Shannon Nelson, Brandeburg, Jesse,
	Jeff Kirsher, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CA+mtBx-Q=Ebrx_8NVB5mWURj-oWB3gdNs5fSAL=Frr9sOmyvHw@mail.gmail.com>

On Mon, Dec 1, 2014 at 3:47 PM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer <joestringer@nicira.com> wrote:
>> On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
>>> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>>>> I don't know if we need to have the check at all for IPIP though -
>>>> after all the driver doesn't expose support for it all (actually it
>>>> doesn't expose GRE either). This raises kind of an interesting
>>>> question about the checks though - it's pretty easy to add support to
>>>> the driver for a new GSO type (and I imagine that people will be
>>>> adding GRE soon) and forget to update the check.
>>>
>>> If the check is more conservative, then testing would show that it's
>>> not working and lead people to figure out why (and update the check).
>>
>> More concretely, one suggestion would be something like following at
>> the start of each gso_check():
>>
>> +       const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
>> +                             SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
>> +
>> +       if (skb_shinfo(skb)->gso_type & ~supported)
>> +               return false;
>
> This should already be handled by net_gso_ok.

My original point wasn't so much that this isn't handled at the moment
but that it's easy to add a supported GSO type but then forget to
update this check - i.e. if a driver already supports UDP_TUNNEL and
adds support for GRE with the same constraints. It seems not entirely
ideal that this function is acting as a blacklist rather than a
whitelist.

^ permalink raw reply

* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Tom Herbert @ 2014-12-01 23:47 UTC (permalink / raw)
  To: Joe Stringer
  Cc: Jesse Gross, netdev, Shannon Nelson, Brandeburg, Jesse,
	Jeff Kirsher, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CANr6G5xRbG=UnJEdBEN_5ZRjNikCjuUseb1VbQmjjfb36jDP0g@mail.gmail.com>

On Mon, Dec 1, 2014 at 3:35 PM, Joe Stringer <joestringer@nicira.com> wrote:
> On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
>> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>>> I don't know if we need to have the check at all for IPIP though -
>>> after all the driver doesn't expose support for it all (actually it
>>> doesn't expose GRE either). This raises kind of an interesting
>>> question about the checks though - it's pretty easy to add support to
>>> the driver for a new GSO type (and I imagine that people will be
>>> adding GRE soon) and forget to update the check.
>>
>> If the check is more conservative, then testing would show that it's
>> not working and lead people to figure out why (and update the check).
>
> More concretely, one suggestion would be something like following at
> the start of each gso_check():
>
> +       const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
> +                             SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
> +
> +       if (skb_shinfo(skb)->gso_type & ~supported)
> +               return false;

This should already be handled by net_gso_ok.

>
> ..followed by checking the specifics for each. So far the patches have
> only been concerned with further checking on UDP tunnels.

^ permalink raw reply

* Re: [PATCH v2 1/6] GMAC: add driver for Rockchip RK3288 SoCs integrated GMAC
From: Heiko Stübner @ 2014-12-01 23:44 UTC (permalink / raw)
  To: Roger Chen
  Cc: peppe.cavallaro, netdev, linux-kernel, linux-rockchip, kever.yang,
	eddie.cai
In-Reply-To: <1417056728-3642-1-git-send-email-roger.chen@rock-chips.com>

Hi Roger,

the comments inline are a rough first review. I hope to get a clearer picture 
for the stuff I'm not sure about in v3 once the big issues are fixed.


Am Donnerstag, 27. November 2014, 10:52:08 schrieb Roger Chen:
> This driver is based on stmmac driver.
> 
> modification based on Giuseppe CAVALLARO's suggestion:
> 1. use BIT()
> 
> 	> +/*RK3288_GRF_SOC_CON3*/
> 	> +#define GMAC_TXCLK_DLY_ENABLE   ((0x4000 << 16) | (0x4000))
> 	> +#define GMAC_TXCLK_DLY_DISABLE  ((0x4000 << 16) | (0x0000))
> 
> 	...
> 
> 	why do not use BIT and BIT_MASK where possible?
> 
> 	===>after modification:
> 
> 	#define GRF_BIT(nr)     (BIT(nr) | BIT(nr+16))
> 	#define GRF_CLR_BIT(nr) (BIT(nr+16))
> 	#define GMAC_TXCLK_DLY_ENABLE   GRF_BIT(14)
> 	#define GMAC_TXCLK_DLY_DISABLE  GRF_CLR_BIT(14)
> 	...
> 2.
> 
> 	> +    regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> 	> +             GMAC_PHY_INTF_SEL_RGMII);
> 	> +    regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> 	> +             GMAC_RMII_MODE_CLR);
> 
> 	maybe you could perform just one write unless there is some HW
> 	constraint.
> 
> 	===>after modification:
> 
> 	regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> 			 GMAC_PHY_INTF_SEL_RGMII | GMAC_RMII_MODE_CLR);
> 
> 3. use macros
> 
> 	> +    regmap_write(bsp_priv->grf, RK3288_GRF_GPIO3D_E, 0xFFFFFFFF);
> 	> +    regmap_write(bsp_priv->grf, RK3288_GRF_GPIO4B_E,
> 	> +             0x3<<2<<16 | 0x3<<2);
> 
> 	pls use macros, these shift sequence is really help to decode
> 
> 	===>after modification:
> 
> 	regmap_write(bsp_priv->grf, RK3288_GRF_GPIO4A_E, GPIO4A_12MA);
> 	regmap_write(bsp_priv->grf, RK3288_GRF_GPIO4B_E, GPIO4B_2_12MA);
> 
> 4. remove grf fail check in rk_gmac_setup()
> 
> 	> +    if (IS_ERR(bsp_priv->grf))
> 	> +        dev_err(&pdev->dev, "Missing rockchip,grf property\n");
> 
> 	I wonder if you can fail on here and save all the check in
> 	set_rgmii_speed etc.
> 	Maybe this can be considered a mandatory property for the glue-logic.
> 
> 5. remove .tx_coe=1
> 
> 	> +const struct stmmac_of_data rk_gmac_data = {
> 	> +    .has_gmac = 1,
> 	> +    .tx_coe = 1,
> 
> 	FYI, on new gmac there is the HW capability register to dinamically
> 	provide you if coe is supported.
> 
> 	IMO you should add the OF "compatible" string and in case of mac
> 	newer than the 3.50a you can remove coe.

changelogs like these, should be compact and also not be in the commit message 
itself, but in the "comment"-section below the "---" and before the diffstat.


> 
> Signed-off-by: Roger Chen <roger.chen@rock-chips.com>
> ---

changelog here ... the commonly used pattern is something like

changes since v2:
- ...
- ...

changes since v1:
- ...

>  drivers/net/ethernet/stmicro/stmmac/Makefile       |    2 +-
>  drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c     |  636
> ++++++++++++++++++++ .../net/ethernet/stmicro/stmmac/stmmac_platform.c  |  
>  3 +
>  .../net/ethernet/stmicro/stmmac/stmmac_platform.h  |    1 +
>  4 files changed, 641 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile
> b/drivers/net/ethernet/stmicro/stmmac/Makefile index ac4d562..73c2715
> 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/Makefile
> +++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
> @@ -6,7 +6,7 @@ stmmac-objs:= stmmac_main.o stmmac_ethtool.o stmmac_mdio.o
> ring_mode.o	\
> 
>  obj-$(CONFIG_STMMAC_PLATFORM) += stmmac-platform.o
>  stmmac-platform-objs:= stmmac_platform.o dwmac-meson.o dwmac-sunxi.o	\
> -		       dwmac-sti.o dwmac-socfpga.o
> +		       dwmac-sti.o dwmac-socfpga.o dwmac-rk.o
> 
>  obj-$(CONFIG_STMMAC_PCI) += stmmac-pci.o
>  stmmac-pci-objs:= stmmac_pci.o
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
> b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c new file mode 100644
> index 0000000..870563f
> --- /dev/null
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
> @@ -0,0 +1,636 @@
> +/**
> + * dwmac-rk.c - Rockchip RK3288 DWMAC specific glue layer
> + *
> + * Copyright (C) 2014 Chen-Zhi (Roger Chen)
> + *
> + * Chen-Zhi (Roger Chen)  <roger.chen@rock-chips.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/stmmac.h>
> +#include <linux/bitops.h>
> +#include <linux/clk.h>
> +#include <linux/phy.h>
> +#include <linux/of_net.h>
> +#include <linux/gpio.h>
> +#include <linux/of_gpio.h>
> +#include <linux/of_device.h>
> +#include <linux/regulator/consumer.h>
> +#include <linux/delay.h>
> +#include <linux/mfd/syscon.h>
> +#include <linux/regmap.h>
> +
> +struct rk_priv_data {
> +	struct platform_device *pdev;
> +	int phy_iface;
> +	bool power_ctrl_by_pmu;
> +	char pmu_regulator[32];
> +	int pmu_enable_level;
> +
> +	int power_io;
> +	int power_io_level;
> +	int reset_io;
> +	int reset_io_level;
> +	int phyirq_io;
> +	int phyirq_io_level;
> +
> +	bool clk_enabled;
> +	bool clock_input;
> +
> +	struct clk *clk_mac;
> +	struct clk *clk_mac_pll;
> +	struct clk *gmac_clkin;
> +	struct clk *mac_clk_rx;
> +	struct clk *mac_clk_tx;
> +	struct clk *clk_mac_ref;
> +	struct clk *clk_mac_refout;
> +	struct clk *aclk_mac;
> +	struct clk *pclk_mac;
> +
> +	int tx_delay;
> +	int rx_delay;
> +
> +	struct regmap *grf;
> +};
> +
> +#define RK3288_GRF_SOC_CON1 0x0248
> +#define RK3288_GRF_SOC_CON3 0x0250
> +#define RK3288_GRF_GPIO3D_E 0x01ec
> +#define RK3288_GRF_GPIO4A_E 0x01f0
> +#define RK3288_GRF_GPIO4B_E 0x01f4

here you're using a space instead of a tab, please select one pattern either 
tabs or space but do not mix them.


> +#define GPIO3D_2MA	0xFFFF0000
> +#define GPIO3D_4MA	0xFFFF5555
> +#define GPIO3D_8MA	0xFFFFAAAA
> +#define GPIO3D_12MA	0xFFFFFFFF
> +
> +#define GPIO4A_2MA	0xFFFF0000
> +#define GPIO4A_4MA	0xFFFF5555
> +#define GPIO4A_8MA	0xFFFFAAAA
> +#define GPIO4A_12MA	0xFFFFFFFF

see comment about pin settings below


> +
> +#define GRF_BIT(nr)	(BIT(nr) | BIT(nr+16))
> +#define GRF_CLR_BIT(nr)	(BIT(nr+16))
> +
> +#define GPIO4B_2_2MA	(GRF_CLR_BIT(2) | GRF_CLR_BIT(3))
> +#define GPIO4B_2_4MA	(GRF_BIT(2) | GRF_CLR_BIT(3))
> +#define GPIO4B_2_8MA	(GRF_CLR_BIT(2) | GRF_BIT(3))
> +#define GPIO4B_2_12MA	(GRF_BIT(2) | GRF_BIT(3))
> +
> +/*RK3288_GRF_SOC_CON1*/
> +#define GMAC_PHY_INTF_SEL_RGMII	(GRF_BIT(6) | GRF_CLR_BIT(7) |
> GRF_CLR_BIT(8)) 
> +#define GMAC_PHY_INTF_SEL_RMII  (GRF_CLR_BIT(6) |
> GRF_CLR_BIT(7) | GRF_BIT(8)) 
> +#define GMAC_FLOW_CTRL          GRF_BIT(9)
> +#define GMAC_FLOW_CTRL_CLR      GRF_CLR_BIT(9)
> +#define GMAC_SPEED_10M          GRF_CLR_BIT(10)
> +#define GMAC_SPEED_100M         GRF_BIT(10)
> +#define GMAC_RMII_CLK_25M       GRF_BIT(11)
> +#define GMAC_RMII_CLK_2_5M      GRF_CLR_BIT(11)
> +#define GMAC_CLK_125M           (GRF_CLR_BIT(12) | GRF_CLR_BIT(13))
> +#define GMAC_CLK_25M            (GRF_BIT(12) | GRF_BIT(13))
> +#define GMAC_CLK_2_5M           (GRF_CLR_BIT(12) | GRF_BIT(13))
> +#define GMAC_RMII_MODE          GRF_BIT(14)
> +#define GMAC_RMII_MODE_CLR      GRF_CLR_BIT(14)
> +
> +/*RK3288_GRF_SOC_CON3*/
> +#define GMAC_TXCLK_DLY_ENABLE   GRF_BIT(14)
> +#define GMAC_TXCLK_DLY_DISABLE  GRF_CLR_BIT(14)
> +#define GMAC_RXCLK_DLY_ENABLE   GRF_BIT(15)
> +#define GMAC_RXCLK_DLY_DISABLE  GRF_CLR_BIT(15)
> +#define GMAC_CLK_RX_DL_CFG(val) ((0x7F<<7<<16) | (val<<7))
> +#define GMAC_CLK_TX_DL_CFG(val) ((0x7F<<16) | (val))

again mixed tabs and spaces as delimiters.

Also the _CFG macros are not well abstracted. You could take a look at the 
HIWORD_UPDATE macro in drivers/clk/rockchip/clk.h:

#define GMAC_CLK_DL_MASK	0x7f
#define GMAC_CLK_RX_DL_CFG(val)  HIWORD_UPDATE(val, GMAC_CLK_DL_MASK, 7)  
#define GMAC_CLK_TX_DL_CFG(val)  HIWORD_UPDATE(val, GMAC_CLK_DL_MASK, 0)  


> +
> +static void set_to_rgmii(struct rk_priv_data *bsp_priv,
> +			 int tx_delay, int rx_delay)
> +{
> +	if (IS_ERR(bsp_priv->grf)) {
> +		pr_err("%s: Missing rockchip,grf property\n", __func__);
> +		return;
> +	}
> +
> +	regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +		     GMAC_PHY_INTF_SEL_RGMII | GMAC_RMII_MODE_CLR);
> +	regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON3,
> +		     GMAC_RXCLK_DLY_ENABLE | GMAC_TXCLK_DLY_ENABLE |
> +		     GMAC_CLK_RX_DL_CFG(rx_delay) |
> +		     GMAC_CLK_TX_DL_CFG(tx_delay));
> +	regmap_write(bsp_priv->grf, RK3288_GRF_GPIO3D_E, GPIO3D_12MA);
> +	regmap_write(bsp_priv->grf, RK3288_GRF_GPIO4A_E, GPIO4A_12MA);
> +	regmap_write(bsp_priv->grf, RK3288_GRF_GPIO4B_E, GPIO4B_2_12MA);

please don't write to parts controlled by other drivers - here the drive 
strength settings of pins is controlled by the pinctrl driver. Instead you can 
just set the drive-strength in the pinctrl settings.


> +
> +	pr_debug("%s: tx delay=0x%x; rx delay=0x%x;\n",
> +		 __func__, tx_delay, rx_delay);
> +}
> +
> +static void set_to_rmii(struct rk_priv_data *bsp_priv)
> +{
> +	if (IS_ERR(bsp_priv->grf)) {
> +		pr_err("%s: Missing rockchip,grf property\n", __func__);

you have a device-reference in rk_priv_data, so you could use dev_err here. 
Same for all other pr_err/pr_debug/pr_* calls in this file.


> +		return;
> +	}
> +
> +	regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +		     GMAC_PHY_INTF_SEL_RMII);
> +	regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +		     GMAC_RMII_MODE);

these two could be combined?


> +}
> +
> +static void set_rgmii_speed(struct rk_priv_data *bsp_priv, int speed)
> +{
> +	if (IS_ERR(bsp_priv->grf)) {
> +		pr_err("%s: Missing rockchip,grf property\n", __func__);
> +		return;
> +	}
> +
> +	if (speed == 10)
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1, GMAC_CLK_2_5M);
> +	else if (speed == 100)
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1, GMAC_CLK_25M);
> +	else if (speed == 1000)
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1, GMAC_CLK_125M);
> +	else
> +		pr_err("unknown speed value for RGMII! speed=%d", speed);
> +}
> +
> +static void set_rmii_speed(struct rk_priv_data *bsp_priv, int speed)
> +{
> +	if (IS_ERR(bsp_priv->grf)) {
> +		pr_err("%s: Missing rockchip,grf property\n", __func__);
> +		return;
> +	}
> +
> +	if (speed == 10) {
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +			     GMAC_RMII_CLK_2_5M);
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +			     GMAC_SPEED_10M);

combine into one write?


> +	} else if (speed == 100) {
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +			     GMAC_RMII_CLK_25M);
> +		regmap_write(bsp_priv->grf, RK3288_GRF_SOC_CON1,
> +			     GMAC_SPEED_100M);

combine into one write?


> +	} else {
> +		pr_err("unknown speed value for RMII! speed=%d", speed);
> +	}
> +}
> +
> +#define MAC_CLK_RX	"mac_clk_rx"
> +#define MAC_CLK_TX	"mac_clk_tx"
> +#define CLK_MAC_REF	"clk_mac_ref"
> +#define CLK_MAC_REF_OUT	"clk_mac_refout"
> +#define CLK_MAC_PLL	"clk_mac_pll"
> +#define ACLK_MAC	"aclk_mac"
> +#define PCLK_MAC	"pclk_mac"
> +#define MAC_CLKIN	"ext_gmac"
> +#define CLK_MAC		"stmmaceth"

why the need to extra constants for the clock names and not use the real names 
directly like most other drivers do?


> +
> +static int gmac_clk_init(struct rk_priv_data *bsp_priv)
> +{
> +	struct device *dev = &bsp_priv->pdev->dev;
> +
> +	bsp_priv->clk_enabled = false;
> +
> +	bsp_priv->mac_clk_rx = clk_get(dev, MAC_CLK_RX);
> +	if (IS_ERR(bsp_priv->mac_clk_rx))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, MAC_CLK_RX);
> +
> +	bsp_priv->mac_clk_tx = clk_get(dev, MAC_CLK_TX);
> +	if (IS_ERR(bsp_priv->mac_clk_tx))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, MAC_CLK_TX);
> +
> +	bsp_priv->clk_mac_ref = clk_get(dev, CLK_MAC_REF);
> +	if (IS_ERR(bsp_priv->clk_mac_ref))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, CLK_MAC_REF);
> +
> +	bsp_priv->clk_mac_refout = clk_get(dev, CLK_MAC_REF_OUT);
> +	if (IS_ERR(bsp_priv->clk_mac_refout))
> +		pr_warn("%s: warning:cannot get clock %s\n",
> +			__func__, CLK_MAC_REF_OUT);
> +
> +	bsp_priv->aclk_mac = clk_get(dev, ACLK_MAC);
> +	if (IS_ERR(bsp_priv->aclk_mac))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, ACLK_MAC);
> +
> +	bsp_priv->pclk_mac = clk_get(dev, PCLK_MAC);
> +	if (IS_ERR(bsp_priv->pclk_mac))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, PCLK_MAC);
> +
> +	bsp_priv->clk_mac_pll = clk_get(dev, CLK_MAC_PLL);
> +	if (IS_ERR(bsp_priv->clk_mac_pll))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, CLK_MAC_PLL);
> +
> +	bsp_priv->gmac_clkin = clk_get(dev, MAC_CLKIN);
> +	if (IS_ERR(bsp_priv->gmac_clkin))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, MAC_CLKIN);
> +
> +	bsp_priv->clk_mac = clk_get(dev, CLK_MAC);
> +	if (IS_ERR(bsp_priv->clk_mac))
> +		pr_warn("%s: warning: cannot get clock %s\n",
> +			__func__, CLK_MAC);

there is not clk_put in the _remove case ... maybe you could simply use 
devm_clk_get here so that all clocks are put on device removal.

Also you're warning on every missing clock. Below it looks like you need a 
different set of them for rgmii and rmii, so maybe you should simply error out 
when core clocks for the selected phy-mode are missing.


> +
> +	if (bsp_priv->clock_input) {
> +		pr_info("%s: clock input from PHY\n", __func__);
> +	} else {
> +		if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RMII)
> +			clk_set_rate(bsp_priv->clk_mac_pll, 50000000);
> +
> +		clk_set_parent(bsp_priv->clk_mac, bsp_priv->clk_mac_pll);

why the explicit reparenting. The common clock-framework is intelligent enough 
to select the best suitable parent.

In general I'm thinking the clock-handling inside this driver should be 
simplyfied, as the common-clock framework can handle most cases itself. I.e. if 
a 125MHz external clock is present and so on. But haven't looked to deep yet.



> +	}
> +
> +	return 0;
> +}
> +
> +static int gmac_clk_enable(struct rk_priv_data *bsp_priv, bool enable)
> +{
> +	int phy_iface = phy_iface = bsp_priv->phy_iface;
> +
> +	if (enable) {
> +		if (!bsp_priv->clk_enabled) {
> +			if (phy_iface == PHY_INTERFACE_MODE_RMII) {
> +				if (!IS_ERR(bsp_priv->mac_clk_rx))
> +					clk_prepare_enable(
> +						bsp_priv->mac_clk_rx);
> +
> +				if (!IS_ERR(bsp_priv->clk_mac_ref))
> +					clk_prepare_enable(
> +						bsp_priv->clk_mac_ref);
> +
> +				if (!IS_ERR(bsp_priv->clk_mac_refout))
> +					clk_prepare_enable(
> +						bsp_priv->clk_mac_refout);
> +			}
> +
> +			if (!IS_ERR(bsp_priv->aclk_mac))
> +				clk_prepare_enable(bsp_priv->aclk_mac);
> +
> +			if (!IS_ERR(bsp_priv->pclk_mac))
> +				clk_prepare_enable(bsp_priv->pclk_mac);
> +
> +			if (!IS_ERR(bsp_priv->mac_clk_tx))
> +				clk_prepare_enable(bsp_priv->mac_clk_tx);
> +
> +			/**
> +			 * if (!IS_ERR(bsp_priv->clk_mac))
> +			 *	clk_prepare_enable(bsp_priv->clk_mac);
> +			 */
> +			mdelay(5);
> +			bsp_priv->clk_enabled = true;
> +		}
> +	} else {
> +		if (bsp_priv->clk_enabled) {
> +			if (phy_iface == PHY_INTERFACE_MODE_RMII) {
> +				if (!IS_ERR(bsp_priv->mac_clk_rx))
> +					clk_disable_unprepare(
> +						bsp_priv->mac_clk_rx);
> +
> +				if (!IS_ERR(bsp_priv->clk_mac_ref))
> +					clk_disable_unprepare(
> +						bsp_priv->clk_mac_ref);
> +
> +				if (!IS_ERR(bsp_priv->clk_mac_refout))
> +					clk_disable_unprepare(
> +						bsp_priv->clk_mac_refout);
> +			}
> +
> +			if (!IS_ERR(bsp_priv->aclk_mac))
> +				clk_disable_unprepare(bsp_priv->aclk_mac);
> +
> +			if (!IS_ERR(bsp_priv->pclk_mac))
> +				clk_disable_unprepare(bsp_priv->pclk_mac);
> +
> +			if (!IS_ERR(bsp_priv->mac_clk_tx))
> +				clk_disable_unprepare(bsp_priv->mac_clk_tx);
> +			/**
> +			 * if (!IS_ERR(bsp_priv->clk_mac))
> +			 *	clk_disable_unprepare(bsp_priv->clk_mac);
> +			 */
> +			bsp_priv->clk_enabled = false;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int power_on_by_pmu(struct rk_priv_data *bsp_priv, bool enable)
> +{
> +	struct regulator *ldo;
> +	char *ldostr = bsp_priv->pmu_regulator;
> +	int ret;
> +
> +	if (!ldostr) {
> +		pr_err("%s: no ldo found\n", __func__);
> +		return -1;
> +	}
> +
> +	ldo = regulator_get(NULL, ldostr);
> +	if (!ldo) {
> +		pr_err("\n%s get ldo %s failed\n", __func__, ldostr);
> +	} else {
> +		if (enable) {
> +			if (!regulator_is_enabled(ldo)) {
> +				regulator_set_voltage(ldo, 3300000, 3300000);
> +				ret = regulator_enable(ldo);
> +				if (ret != 0)
> +					pr_err("%s: fail to enable %s\n",
> +					       __func__, ldostr);
> +				else
> +					pr_info("turn on ldo done.\n");
> +			} else {
> +				pr_warn("%s is enabled before enable", ldostr);
> +			}
> +		} else {
> +			if (regulator_is_enabled(ldo)) {
> +				ret = regulator_disable(ldo);
> +				if (ret != 0)
> +					pr_err("%s: fail to disable %s\n",
> +					       __func__, ldostr);
> +				else
> +					pr_info("turn off ldo done.\n");
> +			} else {
> +				pr_warn("%s is disabled before disable",
> +					ldostr);
> +			}
> +		}
> +		regulator_put(ldo);
> +	}
> +
> +	return 0;
> +}
> +
> +static int power_on_by_gpio(struct rk_priv_data *bsp_priv, bool enable)
> +{
> +	if (enable) {
> +		/*power on*/
> +		if (gpio_is_valid(bsp_priv->power_io))
> +			gpio_direction_output(bsp_priv->power_io,
> +					      bsp_priv->power_io_level);
> +	} else {
> +		/*power off*/
> +		if (gpio_is_valid(bsp_priv->power_io))
> +			gpio_direction_output(bsp_priv->power_io,
> +					      !bsp_priv->power_io_level);
> +	}
> +
> +	return 0;
> +}
> +
> +static int phy_power_on(struct rk_priv_data *bsp_priv, bool enable)
> +{
> +	int ret = -1;
> +
> +	pr_info("Ethernet PHY power %s\n", enable == 1 ? "on" : "off");
> +
> +	if (bsp_priv->power_ctrl_by_pmu)
> +		ret = power_on_by_pmu(bsp_priv, enable);
> +	else
> +		ret =  power_on_by_gpio(bsp_priv, enable);

this looks wrong. This should always be a regulator. Even a regulator + switch 
controlled by a gpio can still be modelled as regulator, so that you don't 
need this switch and assorted special handling - so just use the regulator API


> +
> +	if (enable) {
> +		/*reset*/
> +		if (gpio_is_valid(bsp_priv->reset_io)) {
> +			gpio_direction_output(bsp_priv->reset_io,
> +					      bsp_priv->reset_io_level);
> +			mdelay(5);
> +			gpio_direction_output(bsp_priv->reset_io,
> +					      !bsp_priv->reset_io_level);
> +		}
> +		mdelay(30);
> +
> +	} else {
> +		/*pull down reset*/
> +		if (gpio_is_valid(bsp_priv->reset_io)) {
> +			gpio_direction_output(bsp_priv->reset_io,
> +					      bsp_priv->reset_io_level);
> +		}
> +	}

I'm not sure yet if it would be better to use the reset framework for this. 
While it says it is also meant for reset-gpios, there does not seem a driver 
for this to exist yet.



> +
> +	return ret;
> +}
> +
> +#define GPIO_PHY_POWER	"gmac_phy_power"
> +#define GPIO_PHY_RESET	"gmac_phy_reset"
> +#define GPIO_PHY_IRQ	"gmac_phy_irq"

again I don't understand why these constants are necessary

> +
> +static void *rk_gmac_setup(struct platform_device *pdev)
> +{
> +	struct rk_priv_data *bsp_priv;
> +	struct device *dev = &pdev->dev;
> +	enum of_gpio_flags flags;
> +	int ret;
> +	const char *strings = NULL;
> +	int value;
> +	int irq;
> +
> +	bsp_priv = devm_kzalloc(dev, sizeof(*bsp_priv), GFP_KERNEL);
> +	if (!bsp_priv)
> +		return ERR_PTR(-ENOMEM);
> +
> +	bsp_priv->phy_iface = of_get_phy_mode(dev->of_node);
> +
> +	ret = of_property_read_string(dev->of_node, "pmu_regulator", &strings);
> +	if (ret) {
> +		pr_err("%s: Can not read property: pmu_regulator.\n", __func__);
> +		bsp_priv->power_ctrl_by_pmu = false;
> +	} else {
> +		pr_info("%s: ethernet phy power controlled by pmu(%s).\n",
> +			__func__, strings);
> +		bsp_priv->power_ctrl_by_pmu = true;
> +		strcpy(bsp_priv->pmu_regulator, strings);
> +	}

There is a generic regulator-dt-binding for regulator-consumers available 
which you should of course use.


> +
> +	ret = of_property_read_u32(dev->of_node, "pmu_enable_level", &value);
> +	if (ret) {
> +		pr_err("%s: Can not read property: pmu_enable_level.\n",
> +		       __func__);
> +		bsp_priv->power_ctrl_by_pmu = false;
> +	} else {
> +		pr_info("%s: PHY power controlled by pmu(level = %s).\n",
> +			__func__, (value == 1) ? "HIGH" : "LOW");
> +		bsp_priv->power_ctrl_by_pmu = true;
> +		bsp_priv->pmu_enable_level = value;
> +	}

What is this used for? Enabling should of course be done via regulator_enable 
and disabling using regulator_disable.


> +
> +	ret = of_property_read_string(dev->of_node, "clock_in_out", &strings);
> +	if (ret) {
> +		pr_err("%s: Can not read property: clock_in_out.\n", __func__);
> +		bsp_priv->clock_input = true;
> +	} else {
> +		pr_info("%s: clock input or output? (%s).\n",
> +			__func__, strings);
> +		if (!strcmp(strings, "input"))
> +			bsp_priv->clock_input = true;
> +		else
> +			bsp_priv->clock_input = false;
> +	}
> +
> +	ret = of_property_read_u32(dev->of_node, "tx_delay", &value);
> +	if (ret) {
> +		bsp_priv->tx_delay = 0x30;
> +		pr_err("%s: Can not read property: tx_delay.", __func__);
> +		pr_err("%s: set tx_delay to 0x%x\n",
> +		       __func__, bsp_priv->tx_delay);
> +	} else {
> +		pr_info("%s: TX delay(0x%x).\n", __func__, value);
> +		bsp_priv->tx_delay = value;
> +	}
> +
> +	ret = of_property_read_u32(dev->of_node, "rx_delay", &value);
> +	if (ret) {
> +		bsp_priv->rx_delay = 0x10;
> +		pr_err("%s: Can not read property: rx_delay.", __func__);
> +		pr_err("%s: set rx_delay to 0x%x\n",
> +		       __func__, bsp_priv->rx_delay);
> +	} else {
> +		pr_info("%s: RX delay(0x%x).\n", __func__, value);
> +		bsp_priv->rx_delay = value;
> +	}
> +
> +	bsp_priv->grf = syscon_regmap_lookup_by_phandle(dev->of_node,
> +							"rockchip,grf");
> +	bsp_priv->phyirq_io =
> +		of_get_named_gpio_flags(dev->of_node,
> +					"phyirq-gpio", 0, &flags);
> +	bsp_priv->phyirq_io_level = (flags & OF_GPIO_ACTIVE_LOW) ? 0 : 1;
> +
> +	bsp_priv->reset_io =
> +		of_get_named_gpio_flags(dev->of_node,
> +					"reset-gpio", 0, &flags);
> +	bsp_priv->reset_io_level = (flags & OF_GPIO_ACTIVE_LOW) ? 0 : 1;
> +
> +	bsp_priv->power_io =
> +		of_get_named_gpio_flags(dev->of_node, "power-gpio", 0, &flags);
> +	bsp_priv->power_io_level = (flags & OF_GPIO_ACTIVE_LOW) ? 0 : 1;
> +
> +	/*power*/
> +	if (!gpio_is_valid(bsp_priv->power_io)) {
> +		pr_err("%s: Failed to get GPIO %s.\n",
> +		       __func__, "power-gpio");
> +	} else {
> +		ret = gpio_request(bsp_priv->power_io, GPIO_PHY_POWER);
> +		if (ret)
> +			pr_err("%s: ERROR: Failed to request GPIO %s.\n",
> +			       __func__, GPIO_PHY_POWER);
> +	}

When everything power-related is handled using the regulator api, you don't 
need this


> +
> +	if (!gpio_is_valid(bsp_priv->reset_io)) {
> +		pr_err("%s: ERROR: Get reset-gpio failed.\n", __func__);
> +	} else {
> +		ret = gpio_request(bsp_priv->reset_io, GPIO_PHY_RESET);
> +		if (ret)
> +			pr_err("%s: ERROR: Failed to request GPIO %s.\n",
> +			       __func__, GPIO_PHY_RESET);
> +	}
> +
> +	if (bsp_priv->phyirq_io > 0) {

This is more for my understanding: why does the mac driver need to handle the 
phy interrupt - but I might be overlooking something.


> +		struct plat_stmmacenet_data *plat_dat;
> +
> +		pr_info("PHY irq in use\n");
> +		ret = gpio_request(bsp_priv->phyirq_io, GPIO_PHY_IRQ);
> +		if (ret < 0) {
> +			pr_warn("%s: Failed to request GPIO %s\n",
> +				__func__, GPIO_PHY_IRQ);
> +			goto goon;
> +		}
> +
> +		ret = gpio_direction_input(bsp_priv->phyirq_io);
> +		if (ret < 0) {
> +			pr_err("%s, Failed to set input for GPIO %s\n",
> +			       __func__, GPIO_PHY_IRQ);
> +			gpio_free(bsp_priv->phyirq_io);
> +			goto goon;
> +		}
> +
> +		irq = gpio_to_irq(bsp_priv->phyirq_io);
> +		if (irq < 0) {
> +			ret = irq;
> +			pr_err("Failed to set irq for %s\n",
> +			       GPIO_PHY_IRQ);
> +			gpio_free(bsp_priv->phyirq_io);
> +			goto goon;
> +		}
> +
> +		plat_dat = dev_get_platdata(&pdev->dev);
> +		if (plat_dat)
> +			plat_dat->mdio_bus_data->probed_phy_irq = irq;
> +		else
> +			pr_err("%s: plat_data is NULL\n", __func__);
> +	}
> +
> +goon:
> +	/*rmii or rgmii*/
> +	if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RGMII) {
> +		pr_info("%s: init for RGMII\n", __func__);
> +		set_to_rgmii(bsp_priv, bsp_priv->tx_delay, bsp_priv->rx_delay);
> +	} else if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RMII) {
> +		pr_info("%s: init for RMII\n", __func__);
> +		set_to_rmii(bsp_priv);
> +	} else {
> +		pr_err("%s: ERROR: NO interface defined!\n", __func__);
> +	}
> +
> +	bsp_priv->pdev = pdev;
> +
> +	gmac_clk_init(bsp_priv);
> +
> +	return bsp_priv;
> +}
> +
> +static int rk_gmac_init(struct platform_device *pdev, void *priv)
> +{
> +	struct rk_priv_data *bsp_priv = priv;
> +	int ret;
> +
> +	ret = phy_power_on(bsp_priv, true);
> +	if (ret)
> +		return ret;
> +
> +	ret = gmac_clk_enable(bsp_priv, true);
> +	if (ret)
> +		return ret;
> +
> +	return 0;
> +}
> +
> +static void rk_gmac_exit(struct platform_device *pdev, void *priv)
> +{
> +	struct rk_priv_data *gmac = priv;
> +
> +	phy_power_on(gmac, false);
> +	gmac_clk_enable(gmac, false);
> +}
> +
> +static void rk_fix_speed(void *priv, unsigned int speed)
> +{
> +	struct rk_priv_data *bsp_priv = priv;
> +
> +	if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RGMII)
> +		set_rgmii_speed(bsp_priv, speed);
> +	else if (bsp_priv->phy_iface == PHY_INTERFACE_MODE_RMII)
> +		set_rmii_speed(bsp_priv, speed);
> +	else
> +		pr_err("unsupported interface %d", bsp_priv->phy_iface);
> +}
> +
> +const struct stmmac_of_data rk_gmac_data = {
> +	.has_gmac = 1,
> +	.fix_mac_speed = rk_fix_speed,
> +	.setup = rk_gmac_setup,
> +	.init = rk_gmac_init,
> +	.exit = rk_gmac_exit,
> +};
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c index
> 15814b7..b4dee96 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
> @@ -33,6 +33,7 @@
> 
>  static const struct of_device_id stmmac_dt_ids[] = {
>  	/* SoC specific glue layers should come before generic bindings */
> +	{ .compatible = "rockchip,rk3288-gmac", .data = &rk_gmac_data},

please name that rk3288_gmac_data [of course the other occurences too] 
It makes it easier to see which soc it is meant for and it's also not save to 
assume that the next one will use the same register + bit positions in the 
grf.


>  	{ .compatible = "amlogic,meson6-dwmac", .data = &meson6_dwmac_data},
>  	{ .compatible = "allwinner,sun7i-a20-gmac", .data = &sun7i_gmac_data},
>  	{ .compatible = "st,stih415-dwmac", .data = &stih4xx_dwmac_data},
> @@ -291,6 +292,8 @@ static int stmmac_pltfr_probe(struct platform_device
> *pdev) return  -ENOMEM;
>  		}
> 
> +		pdev->dev.platform_data = plat_dat;
> +
>  		ret = stmmac_probe_config_dt(pdev, plat_dat, &mac);
>  		if (ret) {
>  			pr_err("%s: main dt probe failed", __func__);
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h index
> 25dd1f7..32a0516 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
> @@ -24,5 +24,6 @@ extern const struct stmmac_of_data sun7i_gmac_data;
>  extern const struct stmmac_of_data stih4xx_dwmac_data;
>  extern const struct stmmac_of_data stid127_dwmac_data;
>  extern const struct stmmac_of_data socfpga_gmac_data;
> +extern const struct stmmac_of_data rk_gmac_data;
> 
>  #endif /* __STMMAC_PLATFORM_H__ */

^ permalink raw reply

* Re: ovs datapath complains on unexpected mask for netlink commands
From: Jesse Gross @ 2014-12-01 23:39 UTC (permalink / raw)
  To: Or Gerlitz, Pravin Shelar; +Cc: netdev@vger.kernel.org
In-Reply-To: <54775D61.6090300@mellanox.com>

On Thu, Nov 27, 2014 at 9:20 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> Hi Jesse,
>
> I am running with net-next as of commit 2ad7bf36 "ipvlan: Initial check-in
> [...]"
> and user-space OVS v2.0.90 built from the upstream git.
>
> I suddenly realized that my kernel logs gets filled with the below
> prints which effectively take place for each ovs-vsctl command I run...
>
> openvswitch: netlink: Unexpected mask (mask=1205c, allowed=19805c)
> openvswitch: netlink: Unexpected mask (mask=1205c, allowed=19805c)
> openvswitch: netlink: Unexpected mask (mask=1205c, allowed=19805c)
> openvswitch: netlink: Unexpected mask (mask=1205c, allowed=19805c)
> openvswitch: netlink: Unexpected mask (mask=1205c, allowed=19805c)

Thanks I see the problem. It looks like Pravin just sent out a patch
to fix this issue last night - "[PATCH net] openvswitch: Fix flow mask
validation.".

^ permalink raw reply

* Re: [PATCHv2 net] i40e: Implement ndo_gso_check()
From: Joe Stringer @ 2014-12-01 23:35 UTC (permalink / raw)
  To: Jesse Gross
  Cc: netdev, Shannon Nelson, Brandeburg, Jesse, Jeff Kirsher,
	Tom Herbert, linux.nics, Linux Kernel Mailing List
In-Reply-To: <CANr6G5zrA6NUtcyhX1cwEO1Hiq9jnarB36P6m5eoHUFqWWc26Q@mail.gmail.com>

On 21 November 2014 at 09:59, Joe Stringer <joestringer@nicira.com> wrote:
> On 20 November 2014 16:19, Jesse Gross <jesse@nicira.com> wrote:
>> I don't know if we need to have the check at all for IPIP though -
>> after all the driver doesn't expose support for it all (actually it
>> doesn't expose GRE either). This raises kind of an interesting
>> question about the checks though - it's pretty easy to add support to
>> the driver for a new GSO type (and I imagine that people will be
>> adding GRE soon) and forget to update the check.
>
> If the check is more conservative, then testing would show that it's
> not working and lead people to figure out why (and update the check).

More concretely, one suggestion would be something like following at
the start of each gso_check():

+       const int supported = SKB_GSO_TCPV4 | SKB_GSO_TCPV6 | SKB_GSO_FCOE |
+                             SKB_GSO_UDP | SKB_GSO_UDP_TUNNEL;
+
+       if (skb_shinfo(skb)->gso_type & ~supported)
+               return false;

..followed by checking the specifics for each. So far the patches have
only been concerned with further checking on UDP tunnels.

^ permalink raw reply

* Re: [PATCH v2 net-next] net: bcmgenet: enable driver to work without a device tree
From: Florian Fainelli @ 2014-12-01 23:26 UTC (permalink / raw)
  To: Petri Gynther, netdev; +Cc: davem, Kevin Cernekee
In-Reply-To: <547CF87A.3010605@gmail.com>

On 01/12/14 15:23, Florian Fainelli wrote:
> On 01/12/14 14:08, Petri Gynther wrote:
>> Broadcom 7xxx MIPS-based STB platforms do not use device trees.
>> Modify bcmgenet driver so that it can be used on those platforms.
> 
> Well, that statement is technically not true anymore thanks to Kevin's
> recent efforts, but this is not exactly what we have been
> advertising/supporting so far.
> 
> Looks mostly good to me, some minor nits below:
> 
>>
>> Signed-off-by: Petri Gynther <pgynther@google.com>
>> ---
> [snip]
>> +	if (dn) {
>> +		of_id = of_match_node(bcmgenet_match, dn);
>> +		if (!of_id)
>> +			return -EINVAL;
>> +	} else {
>> +		of_id = NULL;
>> +	}
> 
> You could probably get way with this else condition by assigning of_id
> to NULL by default.
> 
> [snip]
> 
>> +		if (pd->phy_addr >= 0 && pd->phy_addr < PHY_MAX_ADDR) {
>> +			phydev = mdio->phy_map[pd->phy_addr];
>> +		} else {
>> +			phydev = NULL;
>> +			for (i = 0; i < PHY_MAX_ADDR; i++) {
>> +				if (mdio->phy_map[i]) {
>> +					phydev = mdio->phy_map[i];
>> +					break;
>> +				}
>> +			}
> 
> phy_find_first() might provide a shorter version of this.
> 
>> +		}
>> +
>> +		if (!phydev) {
>> +			dev_err(kdev, "failed to register PHY device\n");
>> +			mdiobus_unregister(mdio);
>> +			return -ENODEV;
>> +		}
>> +	} else {
>> +		/*
>> +		 * MoCA port or no MDIO access.
>> +		 * Use 1000/FD fixed PHY to represent the link layer.
>> +		 */
>> +		struct fixed_phy_status fphy_status = {
>> +			.link = 1,
>> +			.duplex = DUPLEX_FULL,
>> +			.speed = SPEED_1000,
>> +			.pause = 0,
>> +			.asym_pause = 0,
>> +		};
>> +
>> +		phydev = fixed_phy_register(PHY_POLL, &fphy_status, NULL);
>> +		if (!phydev || IS_ERR(phydev)) {
>> +			dev_err(kdev, "failed to register fixed PHY device\n");
>> +			return -ENODEV;
>> +		}
> 
> This is typically done by platform code (not necessarily for good
> reasons though) but I cannot seen any problems with doing this here as well.

Well actually there could be one issue, we don't want to assume
1Gbits/sec here, we would want to be communicated a speed information
(typically ETH0_SPEED) in CFE so we can program the link parameters
accordingly, just in case someone interfaces this GENET instance with
e.g: a non-MDIO managed external 10/100 switch.

> 
> [snip]
> 
>> diff --git a/include/linux/platform_data/bcmgenet.h b/include/linux/platform_data/bcmgenet.h
>> new file mode 100644
>> index 0000000..3660133
>> --- /dev/null
>> +++ b/include/linux/platform_data/bcmgenet.h
>> @@ -0,0 +1,18 @@
>> +#ifndef __LINUX_PLATFORM_DATA_BCMGENET_H__
>> +#define __LINUX_PLATFORM_DATA_BCMGENET_H__
>> +
>> +#include <linux/compiler.h>
> 
> That include does not look necessary, you might want linux/types.h for
> "u8" though.
> 
>> +#include <linux/if_ether.h>
>> +
>> +struct bcmgenet_platform_data {
>> +	void __iomem *base_reg;
>> +	int           irq0;
>> +	int           irq1;
> 
> These 3 members are unused and should be communicated to the driver as
> resources, not side parameters, can you get rid of them?
> 
>> +	bool          mdio_enabled;
>> +	int           phy_type;
> 
> This is essentially phy_interface_t, so let's use that type directly here.
> 
>> +	int           phy_addr;
>> +	u8            macaddr[ETH_ALEN];
>> +	int           genet_version;
> 
> That is also an enum, if that helps, you could move it from
> drivers/net/ethernet/broadcom/bcmgenet.h there as well.
> --
> Florian
> 

^ permalink raw reply

* Re: [PATCH v2 net-next] net: bcmgenet: enable driver to work without a device tree
From: Florian Fainelli @ 2014-12-01 23:23 UTC (permalink / raw)
  To: Petri Gynther, netdev; +Cc: davem, Kevin Cernekee
In-Reply-To: <20141201220828.942A7220726@puck.mtv.corp.google.com>

On 01/12/14 14:08, Petri Gynther wrote:
> Broadcom 7xxx MIPS-based STB platforms do not use device trees.
> Modify bcmgenet driver so that it can be used on those platforms.

Well, that statement is technically not true anymore thanks to Kevin's
recent efforts, but this is not exactly what we have been
advertising/supporting so far.

Looks mostly good to me, some minor nits below:

> 
> Signed-off-by: Petri Gynther <pgynther@google.com>
> ---
[snip]
> +	if (dn) {
> +		of_id = of_match_node(bcmgenet_match, dn);
> +		if (!of_id)
> +			return -EINVAL;
> +	} else {
> +		of_id = NULL;
> +	}

You could probably get way with this else condition by assigning of_id
to NULL by default.

[snip]

> +		if (pd->phy_addr >= 0 && pd->phy_addr < PHY_MAX_ADDR) {
> +			phydev = mdio->phy_map[pd->phy_addr];
> +		} else {
> +			phydev = NULL;
> +			for (i = 0; i < PHY_MAX_ADDR; i++) {
> +				if (mdio->phy_map[i]) {
> +					phydev = mdio->phy_map[i];
> +					break;
> +				}
> +			}

phy_find_first() might provide a shorter version of this.

> +		}
> +
> +		if (!phydev) {
> +			dev_err(kdev, "failed to register PHY device\n");
> +			mdiobus_unregister(mdio);
> +			return -ENODEV;
> +		}
> +	} else {
> +		/*
> +		 * MoCA port or no MDIO access.
> +		 * Use 1000/FD fixed PHY to represent the link layer.
> +		 */
> +		struct fixed_phy_status fphy_status = {
> +			.link = 1,
> +			.duplex = DUPLEX_FULL,
> +			.speed = SPEED_1000,
> +			.pause = 0,
> +			.asym_pause = 0,
> +		};
> +
> +		phydev = fixed_phy_register(PHY_POLL, &fphy_status, NULL);
> +		if (!phydev || IS_ERR(phydev)) {
> +			dev_err(kdev, "failed to register fixed PHY device\n");
> +			return -ENODEV;
> +		}

This is typically done by platform code (not necessarily for good
reasons though) but I cannot seen any problems with doing this here as well.

[snip]

> diff --git a/include/linux/platform_data/bcmgenet.h b/include/linux/platform_data/bcmgenet.h
> new file mode 100644
> index 0000000..3660133
> --- /dev/null
> +++ b/include/linux/platform_data/bcmgenet.h
> @@ -0,0 +1,18 @@
> +#ifndef __LINUX_PLATFORM_DATA_BCMGENET_H__
> +#define __LINUX_PLATFORM_DATA_BCMGENET_H__
> +
> +#include <linux/compiler.h>

That include does not look necessary, you might want linux/types.h for
"u8" though.

> +#include <linux/if_ether.h>
> +
> +struct bcmgenet_platform_data {
> +	void __iomem *base_reg;
> +	int           irq0;
> +	int           irq1;

These 3 members are unused and should be communicated to the driver as
resources, not side parameters, can you get rid of them?

> +	bool          mdio_enabled;
> +	int           phy_type;

This is essentially phy_interface_t, so let's use that type directly here.

> +	int           phy_addr;
> +	u8            macaddr[ETH_ALEN];
> +	int           genet_version;

That is also an enum, if that helps, you could move it from
drivers/net/ethernet/broadcom/bcmgenet.h there as well.
--
Florian

^ permalink raw reply

* [PATCH v2 net-next 6/6] samples: bpf: large eBPF program in C
From: Alexei Starovoitov @ 2014-12-01 23:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417475199-15950-1-git-send-email-ast@plumgrid.com>

sockex2_kern.c is purposefully large eBPF program in C.
llvm compiles ~200 lines of C code into ~300 eBPF instructions.

It's similar to __skb_flow_dissect() to demonstrate that complex packet parsing
can be done by eBPF.
Then it uses (struct flow_keys)->dst IP address (or hash of ipv6 dst) to keep
stats of number of packets per IP.
User space loads eBPF program, attaches it to loopback interface and prints
dest_ip->#packets stats every second.

Usage:
$sudo samples/bpf/sockex2
ip 127.0.0.1 count 19
ip 127.0.0.1 count 178115
ip 127.0.0.1 count 369437
ip 127.0.0.1 count 559841
ip 127.0.0.1 count 750539

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
v2: no changes

 samples/bpf/Makefile       |    4 +
 samples/bpf/sockex2_kern.c |  215 ++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/sockex2_user.c |   44 +++++++++
 3 files changed, 263 insertions(+)
 create mode 100644 samples/bpf/sockex2_kern.c
 create mode 100644 samples/bpf/sockex2_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 770d145186c3..b5b3600dcdf5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -5,20 +5,24 @@ obj- := dummy.o
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
 hostprogs-y += sockex1
+hostprogs-y += sockex2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
 sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
+sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
 always += sockex1_kern.o
+always += sockex2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
 HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
 HOSTLOADLIBES_sockex1 += -lelf
+HOSTLOADLIBES_sockex2 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/sockex2_kern.c b/samples/bpf/sockex2_kern.c
new file mode 100644
index 000000000000..6f0135f0f217
--- /dev/null
+++ b/samples/bpf/sockex2_kern.c
@@ -0,0 +1,215 @@
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+#include <uapi/linux/in.h>
+#include <uapi/linux/if.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/ipv6.h>
+#include <uapi/linux/if_tunnel.h>
+#define IP_MF		0x2000
+#define IP_OFFSET	0x1FFF
+
+struct vlan_hdr {
+	__be16 h_vlan_TCI;
+	__be16 h_vlan_encapsulated_proto;
+};
+
+struct flow_keys {
+	__be32 src;
+	__be32 dst;
+	union {
+		__be32 ports;
+		__be16 port16[2];
+	};
+	__u16 thoff;
+	__u8 ip_proto;
+};
+
+static inline int proto_ports_offset(__u64 proto)
+{
+	switch (proto) {
+	case IPPROTO_TCP:
+	case IPPROTO_UDP:
+	case IPPROTO_DCCP:
+	case IPPROTO_ESP:
+	case IPPROTO_SCTP:
+	case IPPROTO_UDPLITE:
+		return 0;
+	case IPPROTO_AH:
+		return 4;
+	default:
+		return 0;
+	}
+}
+
+static inline int ip_is_fragment(struct sk_buff *ctx, __u64 nhoff)
+{
+	return load_half(ctx, nhoff + offsetof(struct iphdr, frag_off))
+		& (IP_MF | IP_OFFSET);
+}
+
+static inline __u32 ipv6_addr_hash(struct sk_buff *ctx, __u64 off)
+{
+	__u64 w0 = load_word(ctx, off);
+	__u64 w1 = load_word(ctx, off + 4);
+	__u64 w2 = load_word(ctx, off + 8);
+	__u64 w3 = load_word(ctx, off + 12);
+
+	return (__u32)(w0 ^ w1 ^ w2 ^ w3);
+}
+
+static inline __u64 parse_ip(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+			     struct flow_keys *flow)
+{
+	__u64 verlen;
+
+	if (unlikely(ip_is_fragment(skb, nhoff)))
+		*ip_proto = 0;
+	else
+		*ip_proto = load_byte(skb, nhoff + offsetof(struct iphdr, protocol));
+
+	if (*ip_proto != IPPROTO_GRE) {
+		flow->src = load_word(skb, nhoff + offsetof(struct iphdr, saddr));
+		flow->dst = load_word(skb, nhoff + offsetof(struct iphdr, daddr));
+	}
+
+	verlen = load_byte(skb, nhoff + 0/*offsetof(struct iphdr, ihl)*/);
+	if (likely(verlen == 0x45))
+		nhoff += 20;
+	else
+		nhoff += (verlen & 0xF) << 2;
+
+	return nhoff;
+}
+
+static inline __u64 parse_ipv6(struct sk_buff *skb, __u64 nhoff, __u64 *ip_proto,
+			       struct flow_keys *flow)
+{
+	*ip_proto = load_byte(skb,
+			      nhoff + offsetof(struct ipv6hdr, nexthdr));
+	flow->src = ipv6_addr_hash(skb,
+				   nhoff + offsetof(struct ipv6hdr, saddr));
+	flow->dst = ipv6_addr_hash(skb,
+				   nhoff + offsetof(struct ipv6hdr, daddr));
+	nhoff += sizeof(struct ipv6hdr);
+
+	return nhoff;
+}
+
+static inline bool flow_dissector(struct sk_buff *skb, struct flow_keys *flow)
+{
+	__u64 nhoff = ETH_HLEN;
+	__u64 ip_proto;
+	__u64 proto = load_half(skb, 12);
+	int poff;
+
+	if (proto == ETH_P_8021AD) {
+		proto = load_half(skb, nhoff + offsetof(struct vlan_hdr,
+							h_vlan_encapsulated_proto));
+		nhoff += sizeof(struct vlan_hdr);
+	}
+
+	if (proto == ETH_P_8021Q) {
+		proto = load_half(skb, nhoff + offsetof(struct vlan_hdr,
+							h_vlan_encapsulated_proto));
+		nhoff += sizeof(struct vlan_hdr);
+	}
+
+	if (likely(proto == ETH_P_IP))
+		nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+	else if (proto == ETH_P_IPV6)
+		nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+	else
+		return false;
+
+	switch (ip_proto) {
+	case IPPROTO_GRE: {
+		struct gre_hdr {
+			__be16 flags;
+			__be16 proto;
+		};
+
+		__u64 gre_flags = load_half(skb,
+					    nhoff + offsetof(struct gre_hdr, flags));
+		__u64 gre_proto = load_half(skb,
+					    nhoff + offsetof(struct gre_hdr, proto));
+
+		if (gre_flags & (GRE_VERSION|GRE_ROUTING))
+			break;
+
+		proto = gre_proto;
+		nhoff += 4;
+		if (gre_flags & GRE_CSUM)
+			nhoff += 4;
+		if (gre_flags & GRE_KEY)
+			nhoff += 4;
+		if (gre_flags & GRE_SEQ)
+			nhoff += 4;
+
+		if (proto == ETH_P_8021Q) {
+			proto = load_half(skb,
+					  nhoff + offsetof(struct vlan_hdr,
+							   h_vlan_encapsulated_proto));
+			nhoff += sizeof(struct vlan_hdr);
+		}
+
+		if (proto == ETH_P_IP)
+			nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+		else if (proto == ETH_P_IPV6)
+			nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+		else
+			return false;
+		break;
+	}
+	case IPPROTO_IPIP:
+		nhoff = parse_ip(skb, nhoff, &ip_proto, flow);
+		break;
+	case IPPROTO_IPV6:
+		nhoff = parse_ipv6(skb, nhoff, &ip_proto, flow);
+		break;
+	default:
+		break;
+	}
+
+	flow->ip_proto = ip_proto;
+	poff = proto_ports_offset(ip_proto);
+	if (poff >= 0) {
+		nhoff += poff;
+		flow->ports = load_word(skb, nhoff);
+	}
+
+	flow->thoff = (__u16) nhoff;
+
+	return true;
+}
+
+struct bpf_map_def SEC("maps") hash_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(__be32),
+	.value_size = sizeof(long),
+	.max_entries = 1024,
+};
+
+SEC("socket2")
+int bpf_prog2(struct sk_buff *skb)
+{
+	struct flow_keys flow;
+	long *value;
+	u32 key;
+
+	if (!flow_dissector(skb, &flow))
+		return 0;
+
+	key = flow.dst;
+	value = bpf_map_lookup_elem(&hash_map, &key);
+	if (value) {
+		__sync_fetch_and_add(value, 1);
+	} else {
+		long val = 1;
+
+		bpf_map_update_elem(&hash_map, &key, &val, BPF_ANY);
+	}
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sockex2_user.c b/samples/bpf/sockex2_user.c
new file mode 100644
index 000000000000..d2d5f5a790d3
--- /dev/null
+++ b/samples/bpf/sockex2_user.c
@@ -0,0 +1,44 @@
+#include <stdio.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <arpa/inet.h>
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	FILE *f;
+	int i, sock;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	sock = open_raw_sock("lo");
+
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	for (i = 0; i < 5; i++) {
+		int key = 0, next_key;
+		long long value;
+
+		while (bpf_get_next_key(map_fd[0], &key, &next_key) == 0) {
+			bpf_lookup_elem(map_fd[0], &next_key, &value);
+			printf("ip %s count %lld\n",
+			       inet_ntoa((struct in_addr){htonl(next_key)}),
+			       value);
+			key = next_key;
+		}
+		sleep(1);
+	}
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 net-next 5/6] samples: bpf: trivial eBPF program in C
From: Alexei Starovoitov @ 2014-12-01 23:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417475199-15950-1-git-send-email-ast@plumgrid.com>

this example does the same task as previous socket example
in assembler, but this one does it in C.

eBPF program in kernel does:
    /* assume that packet is IPv4, load one byte of IP->proto */
    int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    long *value;

    value = bpf_map_lookup_elem(&my_map, &index);
    if (value)
        __sync_fetch_and_add(value, 1);

Corresponding user space reads map[tcp], map[udp], map[icmp]
and prints protocol stats every second

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
V1->V2:

replaced constants in:
load_byte(skb, 14 + 9)
with:
load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol))

 samples/bpf/Makefile       |   14 +++++++++++++
 samples/bpf/libbpf.h       |    2 +-
 samples/bpf/sockex1_kern.c |   25 ++++++++++++++++++++++
 samples/bpf/sockex1_user.c |   49 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/sockex1_kern.c
 create mode 100644 samples/bpf/sockex1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index f46d3492d032..770d145186c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -4,12 +4,26 @@ obj- := dummy.o
 # List of programs to build
 hostprogs-y := test_verifier test_maps
 hostprogs-y += sock_example
+hostprogs-y += sockex1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
+sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
+always += sockex1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
+
+HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
+HOSTLOADLIBES_sockex1 += -lelf
+
+# point this to your LLVM backend with bpf support
+LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
+
+%.o: %.c
+	clang $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) \
+		-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf -filetype=obj -o $@
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index cc62ad4d95de..58c5fe1bdba1 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,7 +15,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 		  const struct bpf_insn *insns, int insn_len,
 		  const char *license);
 
-#define LOG_BUF_SIZE 8192
+#define LOG_BUF_SIZE 65536
 extern char bpf_log_buf[LOG_BUF_SIZE];
 
 /* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
diff --git a/samples/bpf/sockex1_kern.c b/samples/bpf/sockex1_kern.c
new file mode 100644
index 000000000000..066892662915
--- /dev/null
+++ b/samples/bpf/sockex1_kern.c
@@ -0,0 +1,25 @@
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/ip.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") my_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+SEC("socket1")
+int bpf_prog1(struct sk_buff *skb)
+{
+	int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
+	long *value;
+
+	value = bpf_map_lookup_elem(&my_map, &index);
+	if (value)
+		__sync_fetch_and_add(value, 1);
+
+	return 0;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/sockex1_user.c b/samples/bpf/sockex1_user.c
new file mode 100644
index 000000000000..34a443ff3831
--- /dev/null
+++ b/samples/bpf/sockex1_user.c
@@ -0,0 +1,49 @@
+#include <stdio.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <arpa/inet.h>
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	FILE *f;
+	int i, sock;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	sock = open_raw_sock("lo");
+
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	f = popen("ping -c5 localhost", "r");
+	(void) f;
+
+	for (i = 0; i < 5; i++) {
+		long long tcp_cnt, udp_cnt, icmp_cnt;
+		int key;
+
+		key = IPPROTO_TCP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &tcp_cnt) == 0);
+
+		key = IPPROTO_UDP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &udp_cnt) == 0);
+
+		key = IPPROTO_ICMP;
+		assert(bpf_lookup_elem(map_fd[0], &key, &icmp_cnt) == 0);
+
+		printf("TCP %lld UDP %lld ICMP %lld packets\n",
+		       tcp_cnt, udp_cnt, icmp_cnt);
+		sleep(1);
+	}
+
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 net-next 4/6] samples: bpf: elf_bpf file loader
From: Alexei Starovoitov @ 2014-12-01 23:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417475199-15950-1-git-send-email-ast@plumgrid.com>

simple .o parser and loader using BPF syscall.
.o is a standard ELF generated by LLVM backend

It parses elf file compiled by llvm .c->.o
- parses 'maps' section and creates maps via BPF syscall
- parses 'license' section and passes it to syscall
- parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns
  by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
- loads eBPF programs via BPF syscall

One ELF file can contain multiple BPF programs.

int load_bpf_file(char *path);
populates prog_fd[] and map_fd[] with FDs received from bpf syscall

bpf_helpers.h - helper functions available to eBPF programs written in C

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
v2: no changes

These helpers and loader are done as separate patch to make eBPF C examples
(that follow in the next patches) to focus on demonstrating programming
of eBPF in restricted C.

 samples/bpf/bpf_helpers.h |   40 +++++++++
 samples/bpf/bpf_load.c    |  203 +++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.h    |   24 ++++++
 3 files changed, 267 insertions(+)
 create mode 100644 samples/bpf/bpf_helpers.h
 create mode 100644 samples/bpf/bpf_load.c
 create mode 100644 samples/bpf/bpf_load.h

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
new file mode 100644
index 000000000000..ca0333146006
--- /dev/null
+++ b/samples/bpf/bpf_helpers.h
@@ -0,0 +1,40 @@
+#ifndef __BPF_HELPERS_H
+#define __BPF_HELPERS_H
+
+/* helper macro to place programs, maps, license in
+ * different sections in elf_bpf file. Section names
+ * are interpreted by elf_bpf loader
+ */
+#define SEC(NAME) __attribute__((section(NAME), used))
+
+/* helper functions called from eBPF programs written in C */
+static void *(*bpf_map_lookup_elem)(void *map, void *key) =
+	(void *) BPF_FUNC_map_lookup_elem;
+static int (*bpf_map_update_elem)(void *map, void *key, void *value,
+				  unsigned long long flags) =
+	(void *) BPF_FUNC_map_update_elem;
+static int (*bpf_map_delete_elem)(void *map, void *key) =
+	(void *) BPF_FUNC_map_delete_elem;
+
+/* llvm builtin functions that eBPF C program may use to
+ * emit BPF_LD_ABS and BPF_LD_IND instructions
+ */
+struct sk_buff;
+unsigned long long load_byte(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.byte");
+unsigned long long load_half(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.half");
+unsigned long long load_word(void *skb,
+			     unsigned long long off) asm("llvm.bpf.load.word");
+
+/* a helper structure used by eBPF C program
+ * to describe map attributes to elf_bpf loader
+ */
+struct bpf_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+};
+
+#endif
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
new file mode 100644
index 000000000000..1831d236382b
--- /dev/null
+++ b/samples/bpf/bpf_load.c
@@ -0,0 +1,203 @@
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include "libbpf.h"
+#include "bpf_helpers.h"
+#include "bpf_load.h"
+
+static char license[128];
+static bool processed_sec[128];
+int map_fd[MAX_MAPS];
+int prog_fd[MAX_PROGS];
+int prog_cnt;
+
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+	int fd;
+	bool is_socket = strncmp(event, "socket", 6) == 0;
+
+	if (!is_socket)
+		/* tracing events tbd */
+		return -1;
+
+	fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER,
+			   prog, size, license);
+
+	if (fd < 0) {
+		printf("bpf_prog_load() err=%d\n%s", errno, bpf_log_buf);
+		return -1;
+	}
+
+	prog_fd[prog_cnt++] = fd;
+
+	return 0;
+}
+
+static int load_maps(struct bpf_map_def *maps, int len)
+{
+	int i;
+
+	for (i = 0; i < len / sizeof(struct bpf_map_def); i++) {
+
+		map_fd[i] = bpf_create_map(maps[i].type,
+					   maps[i].key_size,
+					   maps[i].value_size,
+					   maps[i].max_entries);
+		if (map_fd[i] < 0)
+			return 1;
+	}
+	return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+		   GElf_Shdr *shdr, Elf_Data **data)
+{
+	Elf_Scn *scn;
+
+	scn = elf_getscn(elf, i);
+	if (!scn)
+		return 1;
+
+	if (gelf_getshdr(scn, shdr) != shdr)
+		return 2;
+
+	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+	if (!*shname || !shdr->sh_size)
+		return 3;
+
+	*data = elf_getdata(scn, 0);
+	if (!*data || elf_getdata(scn, *data) != NULL)
+		return 4;
+
+	return 0;
+}
+
+static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
+				GElf_Shdr *shdr, struct bpf_insn *insn)
+{
+	int i, nrels;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+
+	for (i = 0; i < nrels; i++) {
+		GElf_Sym sym;
+		GElf_Rel rel;
+		unsigned int insn_idx;
+
+		gelf_getrel(data, i, &rel);
+
+		insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+		if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+			printf("invalid relo for insn[%d].code 0x%x\n",
+			       insn_idx, insn[insn_idx].code);
+			return 1;
+		}
+		insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+		insn[insn_idx].imm = map_fd[sym.st_value / sizeof(struct bpf_map_def)];
+	}
+
+	return 0;
+}
+
+int load_bpf_file(char *path)
+{
+	int fd, i;
+	Elf *elf;
+	GElf_Ehdr ehdr;
+	GElf_Shdr shdr, shdr_prog;
+	Elf_Data *data, *data_prog, *symbols = NULL;
+	char *shname, *shname_prog;
+
+	if (elf_version(EV_CURRENT) == EV_NONE)
+		return 1;
+
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		return 1;
+
+	elf = elf_begin(fd, ELF_C_READ, NULL);
+
+	if (!elf)
+		return 1;
+
+	if (gelf_getehdr(elf, &ehdr) != &ehdr)
+		return 1;
+
+	/* scan over all elf sections to get license and map info */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (0) /* helpful for llvm debugging */
+			printf("section %d:%s data %p size %zd link %d flags %d\n",
+			       i, shname, data->d_buf, data->d_size,
+			       shdr.sh_link, (int) shdr.sh_flags);
+
+		if (strcmp(shname, "license") == 0) {
+			processed_sec[i] = true;
+			memcpy(license, data->d_buf, data->d_size);
+		} else if (strcmp(shname, "maps") == 0) {
+			processed_sec[i] = true;
+			if (load_maps(data->d_buf, data->d_size))
+				return 1;
+		} else if (shdr.sh_type == SHT_SYMTAB) {
+			symbols = data;
+		}
+	}
+
+	/* load programs that need map fixup (relocations) */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+		if (shdr.sh_type == SHT_REL) {
+			struct bpf_insn *insns;
+
+			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+				    &shdr_prog, &data_prog))
+				continue;
+
+			insns = (struct bpf_insn *) data_prog->d_buf;
+
+			processed_sec[shdr.sh_info] = true;
+			processed_sec[i] = true;
+
+			if (parse_relo_and_apply(data, symbols, &shdr, insns))
+				continue;
+
+			if (memcmp(shname_prog, "events/", 7) == 0 ||
+			    memcmp(shname_prog, "socket", 6) == 0)
+				load_and_attach(shname_prog, insns, data_prog->d_size);
+		}
+	}
+
+	/* load programs that don't use maps */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (memcmp(shname, "events/", 7) == 0 ||
+		    memcmp(shname, "socket", 6) == 0)
+			load_and_attach(shname, data->d_buf, data->d_size);
+	}
+
+	close(fd);
+	return 0;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
new file mode 100644
index 000000000000..27789a34f5e6
--- /dev/null
+++ b/samples/bpf/bpf_load.h
@@ -0,0 +1,24 @@
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#define MAX_MAPS 32
+#define MAX_PROGS 32
+
+extern int map_fd[MAX_MAPS];
+extern int prog_fd[MAX_PROGS];
+
+/* parses elf file compiled by llvm .c->.o
+ * . parses 'maps' section and creates maps via BPF syscall
+ * . parses 'license' section and passes it to syscall
+ * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
+ *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
+ * . loads eBPF programs via BPF syscall
+ *
+ * One ELF file can contain multiple BPF programs which will be loaded
+ * and their FDs stored stored in prog_fd array
+ *
+ * returns zero on success
+ */
+int load_bpf_file(char *path);
+
+#endif
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 net-next 3/6] samples: bpf: example of stateful socket filtering
From: Alexei Starovoitov @ 2014-12-01 23:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1417475199-15950-1-git-send-email-ast@plumgrid.com>

this socket filter example does:
- creates arraymap in kernel with key 4 bytes and value 8 bytes

- loads eBPF program which assumes that packet is IPv4 and loads one byte of
  IP->proto from the packet and uses it as a key in a map

  r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
  *(u32*)(fp - 4) = r0;
  value = bpf_map_lookup_elem(map_fd, fp - 4);
  if (value)
       (*(u64*)value) += 1;

- attaches this program to raw socket

- every second user space reads map[IPPROTO_TCP], map[IPPROTO_UDP], map[IPPROTO_ICMP]
  to see how many packets of given protocol were seen on loopback interface

Usage:
$sudo samples/bpf/sock_example
TCP 0 UDP 0 ICMP 0 packets
TCP 187600 UDP 0 ICMP 4 packets
TCP 376504 UDP 0 ICMP 8 packets
TCP 563116 UDP 0 ICMP 12 packets
TCP 753144 UDP 0 ICMP 16 packets

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
V1->V2:

fixed comments and replaced constants in:
BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
with:
BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),

 samples/bpf/Makefile       |    2 +
 samples/bpf/libbpf.c       |   28 ++++++++++++
 samples/bpf/libbpf.h       |   13 ++++++
 samples/bpf/sock_example.c |  101 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 144 insertions(+)
 create mode 100644 samples/bpf/sock_example.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0718d9ce4619..f46d3492d032 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -3,9 +3,11 @@ obj- := dummy.o
 
 # List of programs to build
 hostprogs-y := test_verifier test_maps
+hostprogs-y += sock_example
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
+sock_example-objs := sock_example.o libbpf.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 17bb520eb57f..46d50b7ddf79 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -7,6 +7,10 @@
 #include <linux/netlink.h>
 #include <linux/bpf.h>
 #include <errno.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <linux/if_packet.h>
+#include <arpa/inet.h>
 #include "libbpf.h"
 
 static __u64 ptr_to_u64(void *ptr)
@@ -93,3 +97,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 
 	return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
+
+int open_raw_sock(const char *name)
+{
+	struct sockaddr_ll sll;
+	int sock;
+
+	sock = socket(PF_PACKET, SOCK_RAW | SOCK_NONBLOCK | SOCK_CLOEXEC, htons(ETH_P_ALL));
+	if (sock < 0) {
+		printf("cannot create raw socket\n");
+		return -1;
+	}
+
+	memset(&sll, 0, sizeof(sll));
+	sll.sll_family = AF_PACKET;
+	sll.sll_ifindex = if_nametoindex(name);
+	sll.sll_protocol = htons(ETH_P_ALL);
+	if (bind(sock, (struct sockaddr *)&sll, sizeof(sll)) < 0) {
+		printf("bind to %s: %s\n", name, strerror(errno));
+		close(sock);
+		return -1;
+	}
+
+	return sock;
+}
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index f8678e5f48bf..cc62ad4d95de 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -99,6 +99,16 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
 	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
 
 
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
 /* Memory load, dst_reg = *(uint *) (src_reg + off16) */
 
 #define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
@@ -169,4 +179,7 @@ extern char bpf_log_buf[LOG_BUF_SIZE];
 		.off   = 0,					\
 		.imm   = 0 })
 
+/* create RAW socket and bind to interface 'name' */
+int open_raw_sock(const char *name);
+
 #endif
diff --git a/samples/bpf/sock_example.c b/samples/bpf/sock_example.c
new file mode 100644
index 000000000000..c8ad0404416f
--- /dev/null
+++ b/samples/bpf/sock_example.c
@@ -0,0 +1,101 @@
+/* eBPF example program:
+ * - creates arraymap in kernel with key 4 bytes and value 8 bytes
+ *
+ * - loads eBPF program:
+ *   r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
+ *   *(u32*)(fp - 4) = r0;
+ *   // assuming packet is IPv4, lookup ip->proto in a map
+ *   value = bpf_map_lookup_elem(map_fd, fp - 4);
+ *   if (value)
+ *        (*(u64*)value) += 1;
+ *
+ * - attaches this program to eth0 raw socket
+ *
+ * - every second user space reads map[tcp], map[udp], map[icmp] to see
+ *   how many packets of given protocol were seen on eth0
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <assert.h>
+#include <linux/bpf.h>
+#include <string.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <stddef.h>
+#include "libbpf.h"
+
+static int test_sock(void)
+{
+	int sock = -1, map_fd, prog_fd, i, key;
+	long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
+				256);
+	if (map_fd < 0) {
+		printf("failed to create map '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	struct bpf_insn prog[] = {
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+		BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
+		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
+		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
+		BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+		BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+		BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
+		BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), /* xadd r0 += r1 */
+		BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
+		BPF_EXIT_INSN(),
+	};
+
+	prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, sizeof(prog),
+				"GPL");
+	if (prog_fd < 0) {
+		printf("failed to load prog '%s'\n", strerror(errno));
+		goto cleanup;
+	}
+
+	sock = open_raw_sock("lo");
+
+	if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
+		       sizeof(prog_fd)) < 0) {
+		printf("setsockopt %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (i = 0; i < 10; i++) {
+		key = IPPROTO_TCP;
+		assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0);
+
+		key = IPPROTO_UDP;
+		assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0);
+
+		key = IPPROTO_ICMP;
+		assert(bpf_lookup_elem(map_fd, &key, &icmp_cnt) == 0);
+
+		printf("TCP %lld UDP %lld ICMP %lld packets\n",
+		       tcp_cnt, udp_cnt, icmp_cnt);
+		sleep(1);
+	}
+
+cleanup:
+	/* maps, programs, raw sockets will auto cleanup on process exit */
+	return 0;
+}
+
+int main(void)
+{
+	FILE *f;
+
+	f = popen("ping -c5 localhost", "r");
+	(void)f;
+
+	return test_sock();
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH v2 net-next 2/6] net: sock: allow eBPF programs to be attached to sockets
From: Alexei Starovoitov @ 2014-12-01 23:06 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1417475199-15950-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

introduce new setsockopt() command:

setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))

where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER

setsockopt() calls bpf_prog_get() which increments refcnt of the program,
so it doesn't get unloaded while socket is using the program.

The same eBPF program can be attached to multiple sockets.

User task exit automatically closes socket which calls sk_filter_uncharge()
which decrements refcnt of eBPF program

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
v2: no changes

Note, I'm not happy about 'ifdef', but 'select or depend BPF_SYSCALL' will
make tinification folks cringe, so use ifdef until native eBPF use cases
become widespread.

 arch/alpha/include/uapi/asm/socket.h   |    3 +
 arch/avr32/include/uapi/asm/socket.h   |    3 +
 arch/cris/include/uapi/asm/socket.h    |    3 +
 arch/frv/include/uapi/asm/socket.h     |    3 +
 arch/ia64/include/uapi/asm/socket.h    |    3 +
 arch/m32r/include/uapi/asm/socket.h    |    3 +
 arch/mips/include/uapi/asm/socket.h    |    3 +
 arch/mn10300/include/uapi/asm/socket.h |    3 +
 arch/parisc/include/uapi/asm/socket.h  |    3 +
 arch/powerpc/include/uapi/asm/socket.h |    3 +
 arch/s390/include/uapi/asm/socket.h    |    3 +
 arch/sparc/include/uapi/asm/socket.h   |    3 +
 arch/xtensa/include/uapi/asm/socket.h  |    3 +
 include/linux/bpf.h                    |    4 ++
 include/linux/filter.h                 |    1 +
 include/uapi/asm-generic/socket.h      |    3 +
 net/core/filter.c                      |   97 +++++++++++++++++++++++++++++++-
 net/core/sock.c                        |   13 +++++
 18 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index e2fe0700b3b4..9a20821b111c 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -89,4 +89,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/avr32/include/uapi/asm/socket.h b/arch/avr32/include/uapi/asm/socket.h
index 92121b0f5b98..2b65ed6b277c 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
diff --git a/arch/cris/include/uapi/asm/socket.h b/arch/cris/include/uapi/asm/socket.h
index 60f60f5b9b35..e2503d9f1869 100644
--- a/arch/cris/include/uapi/asm/socket.h
+++ b/arch/cris/include/uapi/asm/socket.h
@@ -84,6 +84,9 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
 
 
diff --git a/arch/frv/include/uapi/asm/socket.h b/arch/frv/include/uapi/asm/socket.h
index 2c6890209ea6..4823ad125578 100644
--- a/arch/frv/include/uapi/asm/socket.h
+++ b/arch/frv/include/uapi/asm/socket.h
@@ -82,5 +82,8 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
 
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index 09a93fb566f6..59be3d87f86d 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -91,4 +91,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/m32r/include/uapi/asm/socket.h b/arch/m32r/include/uapi/asm/socket.h
index e8589819c274..7bc4cb273856 100644
--- a/arch/m32r/include/uapi/asm/socket.h
+++ b/arch/m32r/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_M32R_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 2e9ee8c55a10..dec3c850f36b 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -100,4 +100,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/mn10300/include/uapi/asm/socket.h b/arch/mn10300/include/uapi/asm/socket.h
index f3492e8c9f70..cab7d6d50051 100644
--- a/arch/mn10300/include/uapi/asm/socket.h
+++ b/arch/mn10300/include/uapi/asm/socket.h
@@ -82,4 +82,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 7984a1cab3da..a5cd40cd8ee1 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -81,4 +81,7 @@
 
 #define SO_INCOMING_CPU		0x402A
 
+#define SO_ATTACH_BPF		0x402B
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h
index 3474e4ef166d..c046666038f8 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -89,4 +89,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif	/* _ASM_POWERPC_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 8457636c33e1..296942d56e6a 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -88,4 +88,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 4a8003a94163..e6a16c40be5f 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -78,6 +78,9 @@
 
 #define SO_INCOMING_CPU		0x0033
 
+#define SO_ATTACH_BPF		0x0034
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index c46f6a696849..4120af086160 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -93,4 +93,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 75e94eaa228b..bbfceb756452 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -128,7 +128,11 @@ struct bpf_prog_aux {
 	struct work_struct work;
 };
 
+#ifdef CONFIG_BPF_SYSCALL
 void bpf_prog_put(struct bpf_prog *prog);
+#else
+static inline void bpf_prog_put(struct bpf_prog *prog) {}
+#endif
 struct bpf_prog *bpf_prog_get(u32 ufd);
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ca95abd2bed1..caac2087a4d5 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -381,6 +381,7 @@ int bpf_prog_create(struct bpf_prog **pfp, struct sock_fprog_kern *fprog);
 void bpf_prog_destroy(struct bpf_prog *fp);
 
 int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
+int sk_attach_bpf(u32 ufd, struct sock *sk);
 int sk_detach_filter(struct sock *sk);
 
 int bpf_check_classic(const struct sock_filter *filter, unsigned int flen);
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index f541ccefd4ac..5c15c2a5c123 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -84,4 +84,7 @@
 
 #define SO_INCOMING_CPU		49
 
+#define SO_ATTACH_BPF		50
+#define SO_DETACH_BPF		SO_DETACH_FILTER
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 647b12265e18..8cc3c03078b3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -44,6 +44,7 @@
 #include <linux/ratelimit.h>
 #include <linux/seccomp.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 /**
  *	sk_filter - run a packet through a socket filter
@@ -813,8 +814,12 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	bpf_release_orig_filter(prog);
-	bpf_prog_free(prog);
+	if (prog->aux->prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
+		bpf_prog_put(prog);
+	} else {
+		bpf_release_orig_filter(prog);
+		bpf_prog_free(prog);
+	}
 }
 
 static void __sk_filter_release(struct sk_filter *fp)
@@ -1088,6 +1093,94 @@ int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(sk_attach_filter);
 
+#ifdef CONFIG_BPF_SYSCALL
+int sk_attach_bpf(u32 ufd, struct sock *sk)
+{
+	struct sk_filter *fp, *old_fp;
+	struct bpf_prog *prog;
+
+	if (sock_flag(sk, SOCK_FILTER_LOCKED))
+		return -EPERM;
+
+	prog = bpf_prog_get(ufd);
+	if (!prog)
+		return -EINVAL;
+
+	if (prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER) {
+		/* valid fd, but invalid program type */
+		bpf_prog_put(prog);
+		return -EINVAL;
+	}
+
+	fp = kmalloc(sizeof(*fp), GFP_KERNEL);
+	if (!fp) {
+		bpf_prog_put(prog);
+		return -ENOMEM;
+	}
+	fp->prog = prog;
+
+	atomic_set(&fp->refcnt, 0);
+
+	if (!sk_filter_charge(sk, fp)) {
+		__sk_filter_release(fp);
+		return -ENOMEM;
+	}
+
+	old_fp = rcu_dereference_protected(sk->sk_filter,
+					   sock_owned_by_user(sk));
+	rcu_assign_pointer(sk->sk_filter, fp);
+
+	if (old_fp)
+		sk_filter_uncharge(sk, old_fp);
+
+	return 0;
+}
+
+/* allow socket filters to call
+ * bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
+ */
+static const struct bpf_func_proto *sock_filter_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
+		return NULL;
+	}
+}
+
+static bool sock_filter_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+	/* skb fields cannot be accessed yet */
+	return false;
+}
+
+static struct bpf_verifier_ops sock_filter_ops = {
+	.get_func_proto = sock_filter_func_proto,
+	.is_valid_access = sock_filter_is_valid_access,
+};
+
+static struct bpf_prog_type_list tl = {
+	.ops = &sock_filter_ops,
+	.type = BPF_PROG_TYPE_SOCKET_FILTER,
+};
+
+static int __init register_sock_filter_ops(void)
+{
+	bpf_register_prog_type(&tl);
+	return 0;
+}
+late_initcall(register_sock_filter_ops);
+#else
+int sk_attach_bpf(u32 ufd, struct sock *sk)
+{
+	return -EOPNOTSUPP;
+}
+#endif
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/sock.c b/net/core/sock.c
index 0725cf0cb685..9a56b2000c3f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -888,6 +888,19 @@ set_rcvbuf:
 		}
 		break;
 
+	case SO_ATTACH_BPF:
+		ret = -EINVAL;
+		if (optlen == sizeof(u32)) {
+			u32 ufd;
+
+			ret = -EFAULT;
+			if (copy_from_user(&ufd, optval, sizeof(ufd)))
+				break;
+
+			ret = sk_attach_bpf(ufd, sk);
+		}
+		break;
+
 	case SO_DETACH_FILTER:
 		ret = sk_detach_filter(sk);
 		break;
-- 
1.7.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox