* Re: [PATCH V2] vhost: fix check for # of outstanding buffers
From: Michael S. Tsirkin @ 2011-07-21 8:06 UTC (permalink / raw)
To: Shirley Ma; +Cc: David Miller, netdev, jasowang
In-Reply-To: <1311182592.8573.45.camel@localhost.localdomain>
On Wed, Jul 20, 2011 at 10:23:12AM -0700, Shirley Ma wrote:
> Fix the check for number of outstanding buffers returns incorrect
> results due to vq->pend_idx wrap around;
>
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
OK, the logic's right now, and it's not worse
than what we had, so I applied this after
fixing up the comment (it's upend_idx and English
sentences don't need to end with a semicolumn ;)
However, I would like to see the effect of the bug
noted in the log in the future.
And the reason I mention this here, is that
I think that the whole VHOST_MAX_PEND thing
does not work as advertised: this logic only
triggers when the ring is empty, so we will happily push
more than VHOST_MAX_PEND packets if the guest manages
to give them to us.
I'm not sure why we have the limit, either: the wmem
limit in the socket still applies and seems more
effective to prevent denial of service by a malicious guest.
> ---
>
> drivers/vhost/net.c | 12 +++++++++---
> 1 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 70ac604..946a71e 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -182,15 +182,21 @@ static void handle_tx(struct vhost_net *net)
> break;
> /* Nothing new? Wait for eventfd to tell us they refilled. */
> if (head == vq->num) {
> + int num_pends;
> +
> wmem = atomic_read(&sock->sk->sk_wmem_alloc);
> if (wmem >= sock->sk->sk_sndbuf * 3 / 4) {
> tx_poll_start(net, sock);
> set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> break;
> }
> - /* If more outstanding DMAs, queue the work */
> - if (unlikely(vq->upend_idx - vq->done_idx >
> - VHOST_MAX_PEND)) {
> + /* If more outstanding DMAs, queue the work
> + * handle upend_idx wrap around
> + */
> + num_pends = (vq->upend_idx >= vq->done_idx) ?
> + (vq->upend_idx - vq->done_idx) :
> + (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
> + if (unlikely(num_pends > VHOST_MAX_PEND)) {
> tx_poll_start(net, sock);
> set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> break;
>
^ permalink raw reply
* Re: [PATCH net-next-2.6] ipv6: make fragment identifications less predictable
From: Fernando Gont @ 2011-07-21 1:32 UTC (permalink / raw)
To: Eric Dumazet; +Cc: David Miller, security, Eugene Teo, netdev, Matt Mackall
In-Reply-To: <1311157648.2338.22.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>
On 07/20/2011 07:27 AM, Eric Dumazet wrote:
> Le mercredi 20 juillet 2011 à 10:25 +0200, Eric Dumazet a écrit :
>
>> Please hold on, I'll make a different patch series to ease stable teams
>> job. It appears inetpeer & ipv6 are really not an option for old
>> kernels.
>>
>> Common patch for all kernels :
>> 1) Fix the problem without inetpeer help
>> ---
>> Patches for next kernels
>> 2) random split as suggested by Matt Mackal
>> 3) Use inetpeer cache to scale identification generation
>
> Here is the first patch, applicable on net-2.6 / linux-2.6 and stable
> kernels.
Does it make sense to go in this direction rather than simply randomize
the IPv6 Fragment Identification?
Keep in mind that IPv6 routers don't perform fragmentation, that that
the IPv6 identification is 32-bits long.
Thanks,
--
Fernando Gont
e-mail: fernando@gont.com.ar || fgont@acm.org
PGP Fingerprint: 7809 84F5 322E 45C7 F1C9 3945 96EE A9EF D076 FFF1
^ permalink raw reply
* Re: [patch net-next-2.6 37/47] igb: do vlan cleanup
From: Jiri Pirko @ 2011-07-21 6:57 UTC (permalink / raw)
To: Jesse Gross
Cc: netdev, davem, shemminger, eric.dumazet, greearb, mirqus,
jeffrey.t.kirsher, jesse.brandeburg, peter.p.waskiewicz.jr,
bruce.w.allan, carolyn.wyborny, donald.c.skidmore, gregory.v.rose,
alexander.h.duyck, john.ronciak, e1000-devel
In-Reply-To: <CAEP_g=9j3=s74_VQ6RQxVRGOOs-mVR94s31ETXJ1S5P7aQV4iQ@mail.gmail.com>
Thu, Jul 21, 2011 at 01:58:10AM CEST, jesse@nicira.com wrote:
>On Wed, Jul 20, 2011 at 12:10 PM, Jiri Pirko <jpirko@redhat.com> wrote:
>> Wed, Jul 20, 2011 at 07:35:33PM CEST, jesse@nicira.com wrote:
>>>On Wed, Jul 20, 2011 at 7:54 AM, Jiri Pirko <jpirko@redhat.com> wrote:
>>>> @@ -2943,7 +2944,7 @@ static void igb_rlpml_set(struct igb_adapter *adapter)
>>>> struct e1000_hw *hw = &adapter->hw;
>>>> u16 pf_id = adapter->vfs_allocated_count;
>>>>
>>>> - if (adapter->vlgrp)
>>>> + if (igb_vlan_used(adapter))
>>>> max_frame_size += VLAN_TAG_SIZE;
>>>
>>>There are similar issues here as with the VF driver. I think you're
>>>also confusing vlan acceleration with vlan filtering. If no vlan
>>>filters are in use but the card is in promiscuous mode, the buffer
>>>will be undersized and we lose tagged packets.
>>
>> I'm certainly not confusing vlan accel and filtering. Here is the
>> intension is the behaviour remains intact as well. I believe it's true.
>
>I believe the underlying issue for all three of these threads is the
>same, so I'll just respond to them all here.
>
>I agree that this doesn't change the behavior of the driver but I
>don't think that should be the goal. When I originally designed this
>new vlan model my intention was to eliminate a whole class of driver
>bugs that I was repeatedly hitting in various forms. In the example
>above, if you run tcpdump on this device without configuring a vlan
>group on it then you will see that MTU sized packets are missing
>because the receive buffer was undersized.
>
>The common theme for these problems is that they all occur in
>situations where vlans are not configured on the device and the driver
>does something different as a result of this. The solution was to
>prevent drivers from changing their behavior in such situations by
>completely removing the concept of a vlan group from them and letting
>the networking core tell them when to make the changes instead of
>doing it implicitly. That's why I don't see the fact that this change
>essentially emulates the knowledge of configuring a group to be a
>plus. By the way, plenty of your other patches change the behavior of
>the drivers - on any of the NICs that always enable stripping, try
>running tcpdump on the interface without configuring a vlan group.
>Before the change you will see that tags have disappeared and
>afterwards the tags are intact. So I think that changing the behavior
>of drivers in this regard is a positive thing.
>
>As an aside, thank you for taking the time to work on all of these
>drivers. The only reason why I'm complaining about these few drivers
>is because I'd like to close the door on this class of problems, which
>is finally in reach thanks to your work.
Okay now it's clear to me. I tried to stay with the code as much similar
as unpatched. But I see your arguments. I will review and repost
patches which are enabling/disabling vlan accel on add_vid/kill_vid and
convert it to set_features.
Thanks. Jesse.
Jirka
^ permalink raw reply
* Re: [PATCH 1/2] igb: Allow extra 4 bytes on RX for vlan tags.
From: Alexander Duyck @ 2011-07-21 6:35 UTC (permalink / raw)
To: jeffrey.t.kirsher
Cc: Ben Greear, Jesse Gross, netdev@vger.kernel.org,
Duyck, Alexander H
In-Reply-To: <1311211304.2401.9.camel@jtkirshe-mobl>
On Wed, Jul 20, 2011 at 6:21 PM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
> On Wed, 2011-07-20 at 17:27 -0700, Ben Greear wrote:
>> On 07/20/2011 05:18 PM, Jesse Gross wrote:
>> > On Thu, Feb 17, 2011 at 9:28 AM, Ben Greear<greearb@candelatech.com> wrote:
>> >> On 02/17/2011 03:04 AM, Jeff Kirsher wrote:
>> >>>
>> >>> On Thu, Feb 10, 2011 at 13:59,<greearb@candelatech.com> wrote:
>> >>>>
>> >>>> From: Ben Greear<greearb@candelatech.com>
>> >>>>
>> >>>> This allows the NIC to receive 1518 byte (not counting
>> >>>> FCS) packets when MTU is 1500, thus allowing 1500 MTU
>> >>>> VLAN frames to be received. Please note that no VLANs
>> >>>> were actually configured on the NIC...it was just acting
>> >>>> as pass-through device.
>> >>>>
>> >>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>> >>>> ---
>> >>>> :100644 100644 58c665b... 30c9cc6... M drivers/net/igb/igb_main.c
>> >>>> drivers/net/igb/igb_main.c | 5 +++--
>> >>>> 1 files changed, 3 insertions(+), 2 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
>> >>>> index 58c665b..30c9cc6 100644
>> >>>> --- a/drivers/net/igb/igb_main.c
>> >>>> +++ b/drivers/net/igb/igb_main.c
>> >>>> @@ -2281,7 +2281,8 @@ static int __devinit igb_sw_init(struct igb_adapter
>> >>>> *adapter)
>> >>>> adapter->rx_itr_setting = IGB_DEFAULT_ITR;
>> >>>> adapter->tx_itr_setting = IGB_DEFAULT_ITR;
>> >>>>
>> >>>> - adapter->max_frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
>> >>>> + adapter->max_frame_size = (netdev->mtu + ETH_HLEN + ETH_FCS_LEN
>> >>>> + + VLAN_HLEN);
>> >>>> adapter->min_frame_size = ETH_ZLEN + ETH_FCS_LEN;
>> >>>>
>> >>>> spin_lock_init(&adapter->stats64_lock);
>> >>>> @@ -4303,7 +4304,7 @@ static int igb_change_mtu(struct net_device
>> >>>> *netdev, int new_mtu)
>> >>>> {
>> >>>> struct igb_adapter *adapter = netdev_priv(netdev);
>> >>>> struct pci_dev *pdev = adapter->pdev;
>> >>>> - int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN;
>> >>>> + int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>> >>>> u32 rx_buffer_len, i;
>> >>>>
>> >>>> if ((new_mtu< 68) || (max_frame> MAX_JUMBO_FRAME_SIZE)) {
>> >>>
>> >>> While testing this patch, validation found that the patch reduces the
>> >>> maximum mtu size
>> >>> by 4 bytes (reduces it from 9216 to 9212). This is not a desired side
>> >>> effect of this patch.
>> >>
>> >> You could add handling for that case and have it act as it used to when
>> >> new_mtu is greater than 9212?
>> >>
>> >> I tested e1000e and it worked w/out hacking at 1500 MTU, so maybe
>> >> check how it does it?
>> >
>> > I just wanted to bring this up again to see if any progress had been
>> > made. We were looking at this driver and trying to figure out the
>> > best way to convert it to use the new vlan model but I'm not familiar
>>
>> I've been watching :)
>>
>> > enough with the hardware to know. It seems that all of the other
>> > Intel drivers unconditionally add space for the vlan tag to the
>> > receive buffer (and would therefore have similar effects as this
>> > patch), is there something different about this card?
>> >
>> > I believe that Alex was working on something in this area (in the
>> > context of one of my patches from a long time ago) but I'm not sure
>> > what came of that.
>>
>> Truth is, I don't really see why it's a problem to decrease the
>> maximum MTU slightly in order to make it work with VLANs.
>>
>> I'm not sure if there is some way to make it work with VLANs
>> and not decrease the maximum MTU.
>
> This was the reason this did not get accepted. I was looking into what
> could be done so that we did not decease the maximum MTU, but I got
> side-tracked and have not done anything on it in several months.
>
I can take a look at fixing this most likely tomorrow. I have some
work planned for igb anyway over the next few days.
Odds are it is just a matter of where the VLAN_HLEN is added. As I
recall for our drivers the correct spot is in the setting of
rx_buffer_len since that is the area more concerned with maximum
receive frame size versus the mtu section which is more concerned with
the transmit side of things.
Thanks,
Alex
^ permalink raw reply
* [PATCH] net: Kobj and queues_kset should be used when CONFIG_XPS is enabled
From: jhbird.choi @ 2011-07-21 6:33 UTC (permalink / raw)
To: netdev, linux-kernel; +Cc: David S. Miller, Choi, Jong-Hwan
From: Choi, Jong-Hwan <jhbird.choi@samsung.com>
Kobj and queues_kset are used with CONFIG_XPS=y.
Signed-off-by: Choi, Jong-Hwan <jhbird.choi@samsung.com>
---
include/linux/netdevice.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e19477..8eb2f11 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -556,7 +556,7 @@ struct netdev_queue {
struct Qdisc *qdisc;
unsigned long state;
struct Qdisc *qdisc_sleeping;
-#ifdef CONFIG_RPS
+#if defined(CONFIG_RPS) || defined(CONFIG_XPS)
struct kobject kobj;
#endif
#if defined(CONFIG_XPS) && defined(CONFIG_NUMA)
@@ -1214,7 +1214,7 @@ struct net_device {
unsigned char broadcast[MAX_ADDR_LEN]; /* hw bcast add */
-#ifdef CONFIG_RPS
+#if defined(CONFIG_RPS) || defined(CONFIG_XPS)
struct kset *queues_kset;
struct netdev_rx_queue *_rx;
--
1.7.1
^ permalink raw reply related
* Re: ipvs oops in 3.0-rc7
From: Simon Horman @ 2011-07-21 5:40 UTC (permalink / raw)
To: Randy Dunlap; +Cc: netdev, lvs-devel, Wensong Zhang, Julian Anastasov
In-Reply-To: <20110720205019.9dfa30c3.rdunlap@xenotime.net>
On Wed, Jul 20, 2011 at 08:50:19PM -0700, Randy Dunlap wrote:
> I'm seeing the following Oops in 3.0-rc7 on x86_64, just loading and unloading
> modules. Any chance this is already fixed? I can test current git, but I
> wanted to ask first.
>
> Looks like it is on the second module load of ip_vs (i.e.,
> modprobe ip_vs; rmmod ip_vs; modprobe ip_vs).
Hi Randy,
I don't believe that this problem has been resolved (or observed before).
^ permalink raw reply
* Re: IPv6: autoconfiguration and suspend/resume or link down/up
From: Dan Williams @ 2011-07-21 5:30 UTC (permalink / raw)
To: Jiri Bohac; +Cc: netdev, Herbert Xu, David S. Miller, stephen hemminger
In-Reply-To: <20110720163656.GD11692@midget.suse.cz>
On Wed, 2011-07-20 at 18:36 +0200, Jiri Bohac wrote:
> On Wed, Jul 20, 2011 at 11:21:43AM -0500, Dan Williams wrote:
> > ... and in the resume handler use that value to age anything
> > that needs to know about time spent in suspend, and then do what needs
> > to be done with that. So something like that may work for IPv6
> > addrconf; on suspend save current time, and on resume check the current
> > time, subtract the time you saved on suspend, and magically add that to
> > the lifetime counts and then run any expiry stuff.
>
> IPv6 (by specification) does not send any RS when an IP address
> or route expires. So only subtracting the supend time from the
> lifetimes and possibly expiring the routes/IP addresses won't fix
> the problem.
Well, the prefix option of the RA includes the Valid Lifetime (in
seconds, no less) so I'd assume the kernel starts a timer when it
receives the RA and updates any addresses configured as a result of
receiving that RA+prefix, such that when the timer expires, the
autoconfigured address is deleted. That timer can be used as a base for
the expiry mechanism that I've noted above, no? This fixes problem #1
from your first mail.
For problem #2, shouldn't a new RS be sent whenever the interface
changes it's IFF_LOWER_UP bit? IFF_LOWER_UP indicates a carrier on/off
event and thus indicates possible disconnect/reconnect to a new network.
I don't specifically know how it works now, but if RS isn't triggered
from IFF_LOWER_UP, I'd imagine that either (a) something didn't get
updated when IFF_LOWER_UP became how carrier was indicated in 2.6.17
(commit b00055aacdb172c05067612278ba27265fcd05ce) or (b) there's a
reason IFF_LOWER_UP isn't used as the trigger for sending an RS and I'm
qualified to say why.
Dan
> When I move to a new network, I need to restart the
> autoconfiguration. This does not currently happen - neither for
> an alive system where the ethernet link goes down/up, nor for a
> system that gets suspended, moved and then resumed.
>
^ permalink raw reply
* Re: [PATCH v3] net: filter: BPF 'JIT' compiler for PPC64
From: Eric Dumazet @ 2011-07-21 5:00 UTC (permalink / raw)
To: Matt Evans; +Cc: netdev, linuxppc-dev
In-Reply-To: <4E278604.5080605@ozlabs.org>
Le jeudi 21 juillet 2011 à 11:51 +1000, Matt Evans a écrit :
> An implementation of a code generator for BPF programs to speed up packet
> filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
>
> Filter code is generated as an ABI-compliant function in module_alloc()'d mem
> with stackframe & prologue/epilogue generated if required (simple filters don't
> need anything more than an li/blr). The filter's local variables, M[], live in
> registers. Supports all BPF opcodes, although "complicated" loads from negative
> packet offsets (e.g. SKF_LL_OFF) are not yet supported.
>
> There are a couple of further optimisations left for future work; many-pass
> assembly with branch-reach reduction and a register allocator to push M[]
> variables into volatile registers would improve the code quality further.
>
> This currently supports big-endian 64-bit PowerPC only (but is fairly simple
> to port to PPC32 or LE!).
>
> Enabled in the same way as x86-64:
>
> echo 1 > /proc/sys/net/core/bpf_jit_enable
>
> Or, enabled with extra debug output:
>
> echo 2 > /proc/sys/net/core/bpf_jit_enable
>
> Signed-off-by: Matt Evans <matt@ozlabs.org>
> ---
>
> V3: Added BUILD_BUG_ON to assert PACA CPU ID is 16bits, made a comment (in
> LD_MSH) a bit clearer, ratelimited "Unknown opcode" error and moved
> bpf_jit.S to bpf_jit_64.S (it doesn't make sense to rename bpf_jit_comp.c as
> small portions will eventually get split out into _32/_64.c files when we do
> 32bit support).
>
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/Makefile | 3 +-
> arch/powerpc/include/asm/ppc-opcode.h | 40 ++
> arch/powerpc/net/Makefile | 4 +
> arch/powerpc/net/bpf_jit.h | 227 +++++++++++
> arch/powerpc/net/bpf_jit_64.S | 138 +++++++
> arch/powerpc/net/bpf_jit_comp.c | 694 +++++++++++++++++++++++++++++++++
> 7 files changed, 1106 insertions(+), 1 deletions(-)
Nice work Matt ;)
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Thanks
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev
^ permalink raw reply
* RE: Bridging behavior apparently changed around the Fedora 14 time
From: Greg Scott @ 2011-07-21 4:40 UTC (permalink / raw)
To: Greg Scott, David Lamparter
Cc: netdev, Lynn Hanson, Joe Whalen, Graham Parenteau
In-Reply-To: <925A849792280C4E80C5461017A4B8A2A040FB@mail733.InfraSupportEtc.com>
Aw nuts, nothing is ever straightforward.
When I do:
ip link set br0 promisc on
My internal users can see the internally hosted websites using the
public IP Addresses. The router on a stick rules I put in work just
fine. (In on br0/eth1, DNATed in PREROUTING, MASQUERADEd in
POSTROUTING, back out br0/eth1 to the correct internal host.)
However, I just learned tonight, this breaks both inbound and outbound
PPTP VPNs. And when I do:
ip link set br0 promisc off
now my PPTP VPNs work, but this breaks my above router on a stick rules.
My PPTP VPN stuff uses the GRE iptables conntrack modules,
ip_conntrack_pptp and ip_nat_pptp, and some PREROUTING and POSTROUTING
rules to DNAT TCP 1723 and all GRE packets to an internal Windows RRAS
server. But when I turn promisc on for br0, I see a storm of packets
looping over and over again, until the remote client finally times out
after what seems like an eternity.
I'll bet that bridge forwards my packets out the wrong physical ethnn
interface when it's in promisc mode and that's why my NATed PPTP VPN
breaks. Which makes me wonder if putting br0 in promisc mode breaks any
of my other NATed services.
- Greg Scott
-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
On Behalf Of Greg Scott
Sent: Tuesday, July 12, 2011 11:29 AM
To: David Lamparter
Cc: netdev@vger.kernel.org; Lynn Hanson; Joe Whalen
Subject: RE: Bridging behavior apparently changed around the Fedora 14
time
> P.S.: you blissfully ignored my "ip neigh add proxy 1.2.3.4" note :)
Sorry - didn't ignore it, just didn't reply back to it. I'll look into
it. What I've read about this before has all been kind of vague. Does
this mean I proxy ARP only for IP Address 1.2.3.4? So somebody sends an
ARP whois 1.2.3.4, I'll answer with 1.2.3.4. is at {My MAC Address}? If
so, then I agree, not nearly as evil as just setting proxy_arp.
> Whoa. And here I was almost ashamed of running 2.6.38. I'm sorry, but
I
> think you need to go bug RedHat.
Yeah, maybe. OK, probably. This was such a bizarre problem - I started
with Netfilter and those guys suggested I try here. At least now I
understand the problem lots better than before. And it's not like I can
just go and update dozens of kernels at dozens of sites all the time
when a new kernel comes out.
> You totally misunderstood me. I'm suggesting the separate VLAN for
your
> servers which have private IPs but which have services exposed to the
> internet (and your clients) on public IPs through NAT.
Ahh - OK. The challenge with many small sites is, economic reality.
That same server that hosts the public ftp and websites also hosts all
the internal Windows file/print services. It's the only server at this
site, so it has several roles. I would love to build a real DMZ network
and put all the public facing stuff in there, but I don't have money for
multiple servers. This will become even more difficult to separate when
we go to virtual servers and clustered hosts.
> Your H323 stuff is totally unrelated.
Agreed. Wholeheartedly.
> Yes. Your problem seems to be between the private-IP clients in your
> network and your private-IP servers if I understand correctly.
Yes. Dead-bang, right on target.
> Yes. And because it is a router, it as an IP from the private subnet
> your clients are in. My question was: what device is that IP on?
Ahh - eth1 is the private LAN side, 192.168.10.1. All the NATed LAN
stuff and all the workstations are in the 192.168.10.0/24 subnet and
connected to eth1. Eth0 is the Internet side. The Internet side has
the firewall NIC, a cable, and the Internet router. That's it.
Everything is connected to the LAN side.
> No. You're jumping to conclusions. You're affecting the "top" bridge
> device's promiscuity. I would say that the effect you're seeing is in
> the IP stack above it, caused by it now promiscuously handling packets
> that are dropped otherwise.
Well they were sure dropped before I set it to PROMISC mode, that's for
sure. And it all worked with the earlier version. That's why this feels
like a layer 2 issue. If it was an IP issue, why didn't it break
several years ago when I first set it up?
Does bridging make everything a little more complex and delicate to set
up? Well, yeah. And some of the netfilter stuff has been a moving
target over the years.
I don't see how ICMP redirects matter. Comparing
/proc/sys/net/ipv4/conf/*/accept_redirects with this version and an
older one at another site - all identical. ../all/accept_recdirects is
0, the rest are all 1. Shared media and ARP settings -
/proc/sys/net/ipv4/conf/*/shared_media - all 1 for all interfaces.
There are a zillion arp settings. Looking at
/proc/sys/net/ipv4/conf/*/*arp* - all are 0 in both the other older site
and this newer site.
Curiously - at one of my other older sites, apparently br0 is not in
promisc mode. But I don't think these guys do any of the stick routing
stuff. I wonder if these guys have the problem but we don't see it
because they never try it?
[root@NSSSS-fw1 ~]# more /sys/class/net/br0/flags
0x1003
[root@NSSSS-fw1 ~]#
[root@NSSSS-fw1 ~]# more /proc/version
Linux version 2.6.32.11-99.fc12.i686.PAE
(mockbuild@x86-05.phx2.fedoraproject.org) (gcc version 4.4.3 20100127
(Red Hat 4.4.3-4) (GCC) )
#1 SMP Mon Apr 5 16:15:03 EDT 2010
[root@NSSSS-fw1 ~]#
[root@NSSSS-fw1 ~]# uname -a
Linux NSSSS-fw1 2.6.32.11-99.fc12.i686.PAE #1 SMP Mon Apr 5 16:15:03 EDT
2010 i686 i686 i386 GNU/Linux
[root@NSSSS-fw1 ~]#
Here is a much older bridged site based on Fedora 9 and I'm sure these
guys use my stick routing stuff. Look at the difference in ..br0/flags.
[root@lme-fw2 ~]# more /sys/class/net/br0/flags
0x1103
[root@lme-fw2 ~]#
[root@lme-fw2 ~]# more /proc/version
Linux version 2.6.25-14.fc9.i686 (mockbuild@) (gcc version 4.3.0
20080428 (Red H
at 4.3.0-8) (GCC) ) #1 SMP Thu May 1 06:28:41 EDT 2008
[root@lme-fw2 ~]#
[root@lme-fw2 ~]# uname -a
Linux lme-fw2 2.6.25-14.fc9.i686 #1 SMP Thu May 1 06:28:41 EDT 2008 i686
i686 i386 GNU/Linux
I can still get my hands on the old box at the site in question. I
guess it couldn't hurt to fire it up and look at its br0 flags.
- Greg
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* RE: [Pv-drivers] [PATCH net-next] vmxnet3: set netdev parant device before calling netdev_info
From: Shreyas Bhatewara @ 2011-07-21 4:05 UTC (permalink / raw)
To: Joe Perches, Scott Goldman; +Cc: netdev@vger.kernel.org, pv-drivers@vmware.com
In-Reply-To: <1311220141.1663.24.camel@Joe-Laptop>
> -----Original Message-----
> From: Joe Perches [mailto:joe@perches.com]
> Sent: Wednesday, July 20, 2011 8:49 PM
> To: Scott Goldman
> Cc: Shreyas Bhatewara; netdev@vger.kernel.org; pv-drivers@vmware.com
> Subject: RE: [Pv-drivers] [PATCH net-next] vmxnet3: set netdev parant
> device before calling netdev_info
>
> On Wed, 2011-07-20 at 20:06 -0700, Scott Goldman wrote:
> > > Parent device for netdev should be set before netdev_info() can be
> called
> > > otherwise there is a NULL pointer dereference and probe() fails.
>
> I believe this is not true.
> I don't see any NULL pointer dereference here.
> functions and macros reordered top to bottom.
>
Thanks for looking Joe.
This happened in 2.6.36. I saw the panic in 2.6.34 and assumed that it would be same
in the latest kernel.
It would not panic in 3.0 but it is good to have the parent device set early.
^ permalink raw reply
* ipvs oops in 3.0-rc7
From: Randy Dunlap @ 2011-07-21 3:50 UTC (permalink / raw)
To: netdev, lvs-devel; +Cc: Simon Horman, Wensong Zhang, Julian Anastasov
I'm seeing the following Oops in 3.0-rc7 on x86_64, just loading and unloading
modules. Any chance this is already fixed? I can test current git, but I
wanted to ask first.
Looks like it is on the second module load of ip_vs (i.e.,
modprobe ip_vs; rmmod ip_vs; modprobe ip_vs).
Jul 20 17:15:05 chimera kernel: [ 3323.505527] IPVS: ipvs unloaded.
Jul 20 17:15:06 chimera kernel: [ 3324.554297] BUG: unable to handle kernel paging request at ffffffffa1543820
Jul 20 17:15:06 chimera kernel: [ 3324.554382] IP: [<ffffffff810a8d4f>] raw_notifier_chain_register+0x1f/0x4a
Jul 20 17:15:06 chimera kernel: [ 3324.554445] PGD 1872067 PUD 1876063 PMD b653f067 PTE 0
Jul 20 17:15:06 chimera kernel: [ 3324.554505] Oops: 0000 [#1] SMP
Jul 20 17:15:06 chimera kernel: [ 3324.554551] CPU 1
Jul 20 17:15:06 chimera kernel: [ 3324.554574] Modules linked in: ip_vs(+) nf_conntrack_sip nf_tproxy_core xt_RATEEST nf_conntrack_proto_gre nfnetlink_log nfnetlink nf_conntrack_broadcast l2tp_core can rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr atm kernelcapi fcrypt pcbc af_rxrpc xp gru macvtap tun isdnhdlc mISDNipac mISDN_core chipreg map_funcs macvlan ptp pps_core mdio_bitbang hdlcdrv ax25 mdio pppox gre inet_lro cycx_drv wanrouter hdlc lapb uio ppp_generic xenbus_probe_frontend configfs ecb rtl8192c_common ath9k_common ath9k_hw ath libertas atmel rt2x00pci rt2x00usb rt2x00lib rng_core orinoco wl12xx crc7 p54common arc4 hostap rndis_host eeprom_93cx6 libipw lib80211 mac80211 cfg80211 fddi crc32c libcrc32c dca com20020 arcnet psnap cdc_ether phonet usbnet sja1000 can_dev sir_dev ird
a crc_ccitt mtd zlib_deflate slhc virtio_ring virtio tr i2400m wimax mii usbserial leds_net5501 fuse af_packet ipt_MASQUERADE iptable_nat nf_nat nfsd lockd nfs_acl auth_rpcgss stp llc bnep bluetooth rfkill crc16 sunrpc ipt_REJEC
Jul 20 17:15:06 chimera kernel: T nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table mperf binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput mousedev sr_mod cdrom ppdev snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ide_pci_generic snd_pcm_oss usbmouse ide_core snd_mixer_oss firewire_ohci usbhid snd_pcm firewire_core usb_storage hid usblp ata_generic i2c_i801 sg pcspkr snd_timer pata_acpi usb_libusual iTCO_wdt iTCO_vendor_support uas snd crc_itu_t soundcore pata_marvell snd_page_alloc parport_pc evdev process
or parport mac_hid rtc_cmos unix sd_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ssb mmc_core pcmcia pcmcia_core firmware_class ehci_hcd usbcore i915 drm_kms_helper intel_agp button intel_gtt video thermal_sys hwmon [last unlo
Jul 20 17:15:06 chimera kernel: aded: ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] Pid: 20884, comm: modprobe Not tainted 3.0.0-rc7 #6 Gateway GT5636E/DG965OT
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RIP: 0010:[<ffffffff810a8d4f>] [<ffffffff810a8d4f>] raw_notifier_chain_register+0x1f/0x4a
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RSP: 0018:ffff8800b5169e88 EFLAGS: 00010202
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RAX: ffffffffa1543810 RBX: ffffffffa18f3810 RCX: 0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RDX: 0000000000000000 RSI: ffffffffa18f3810 RDI: ffffffffa125f9b8
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RBP: ffff8800b5169e88 R08: ffffffff810aa3ee R09: 0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] R10: 0000000000000088 R11: ffffffff81b24258 R12: ffffffffa1908155
Jul 20 17:15:06 chimera kernel: [ 3324.556037] R13: 0000000000000000 R14: 000003060ede68d6 R15: 0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] FS: 00007f6e2d1856f0(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 20 17:15:06 chimera kernel: [ 3324.556037] CR2: ffffffffa1543820 CR3: 00000000b5241000 CR4: 00000000000006e0
Jul 20 17:15:06 chimera kernel: [ 3324.556037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jul 20 17:15:06 chimera kernel: [ 3324.556037] Process modprobe (pid: 20884, threadinfo ffff8800b5168000, task ffff8800b818b000)
Jul 20 17:15:06 chimera kernel: [ 3324.556037] Stack:
Jul 20 17:15:06 chimera kernel: [ 3324.556037] ffff8800b5169ed8 ffffffff814932e7 ffff8800b5169ed8 ffffffff814c5248
Jul 20 17:15:06 chimera kernel: [ 3324.556037] ffff8800b5169eb8 0000000000000000 ffffffffa1908155 0000000000000000
Jul 20 17:15:06 chimera kernel: [ 3324.556037] 000003060ede68d6 0000000000000000 ffff8800b5169ef8 ffffffffa190843f
Jul 20 17:15:06 chimera kernel: [ 3324.556037] Call Trace:
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffff814932e7>] register_netdevice_notifier+0x3b/0x24b
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffff814c5248>] ? genl_register_family_with_ops+0x50/0x9e
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffffa1908155>] ? ip_vs_conn_init+0x155/0x155 [ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffffa190843f>] ip_vs_control_init+0xeb/0x132 [ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffffa1908155>] ? ip_vs_conn_init+0x155/0x155 [ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffffa1908176>] ip_vs_init+0x21/0x1ff [ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffffa1908155>] ? ip_vs_conn_init+0x155/0x155 [ip_vs]
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffff81002094>] do_one_initcall+0x6c/0x1c5
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffff810cd572>] sys_init_module+0xe1/0x2b0
Jul 20 17:15:06 chimera kernel: [ 3324.556037] [<ffffffff8157da02>] system_call_fastpath+0x16/0x1b
Jul 20 17:15:06 chimera kernel: [ 3324.556037] Code: 89 e5 e8 85 e3 04 00 c9 c3 90 90 90 55 48 89 e5 66 66 66 66 90 48 ff 05 f8 5b 00 01 48 8b 07 eb 1e 48 ff 05 fc 5b 00 01 8b 56 10 <3b> 50 10 7f 14 48 ff 05 e5 5b 00 01 48 8d 78 08 48 8b 40 08 48
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RIP [<ffffffff810a8d4f>] raw_notifier_chain_register+0x1f/0x4a
Jul 20 17:15:06 chimera kernel: [ 3324.556037] RSP <ffff8800b5169e88>
Jul 20 17:15:06 chimera kernel: [ 3324.556037] CR2: ffffffffa1543820
Jul 20 17:15:06 chimera kernel: [ 3324.583800] ---[ end trace 1df4eeece34268d5 ]---
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
^ permalink raw reply
* RE: [Pv-drivers] [PATCH net-next] vmxnet3: set netdev parant device before calling netdev_info
From: Joe Perches @ 2011-07-21 3:49 UTC (permalink / raw)
To: Scott Goldman
Cc: Shreyas Bhatewara, netdev@vger.kernel.org, pv-drivers@vmware.com
In-Reply-To: <03E840D17E263A48A5766AD576E0423A03C5836359@exch-mbx-111.vmware.com>
On Wed, 2011-07-20 at 20:06 -0700, Scott Goldman wrote:
> > Parent device for netdev should be set before netdev_info() can be called
> > otherwise there is a NULL pointer dereference and probe() fails.
I believe this is not true.
I don't see any NULL pointer dereference here.
functions and macros reordered top to bottom.
define_netdev_printk_level(netdev_info, KERN_INFO);
#define define_netdev_printk_level(func, level) \
int func(const struct net_device *dev, const char *fmt, ...) \
{ \
int r; \
struct va_format vaf; \
va_list args; \
\
va_start(args, fmt); \
\
vaf.fmt = fmt; \
vaf.va = &args; \
\
r = __netdev_printk(level, dev, &vaf); \
va_end(args); \
\
return r; \
} \
EXPORT_SYMBOL(func);
static int __netdev_printk(const char *level, const struct net_device *dev,
struct va_format *vaf)
{
int r;
if (dev && dev->dev.parent)
r = dev_printk(level, dev->dev.parent, "%s: %pV",
netdev_name(dev), vaf);
else if (dev)
r = printk("%s%s: %pV", level, netdev_name(dev), vaf);
else
r = printk("%s(NULL net_device): %pV", level, vaf);
return r;
}
static inline const char *netdev_name(const struct net_device *dev)
{
if (dev->reg_state != NETREG_REGISTERED)
return "(unregistered net_device)";
return dev->name;
}
^ permalink raw reply
* [PATCH] Fix panic in virtnet_remove
From: Krishna Kumar @ 2011-07-20 7:43 UTC (permalink / raw)
To: mst; +Cc: netdev, shemminger, davem, Krishna Kumar
Fix a panic in virtnet_remove. unregister_netdev has already
freed up the netdev (and virtnet_info) due to dev->destructor
being set, while virtnet_info is still required. Remove
virtnet_free altogether, and move the freeing of the per-cpu
statistics from virtnet_free to virtnet_remove.
Tested patch below.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
---
drivers/net/virtio_net.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c 2011-07-18 09:14:02.000000000 +0530
+++ new/drivers/net/virtio_net.c 2011-07-18 09:16:35.000000000 +0530
@@ -705,14 +705,6 @@ static void virtnet_netpoll(struct net_d
}
#endif
-static void virtnet_free(struct net_device *dev)
-{
- struct virtnet_info *vi = netdev_priv(dev);
-
- free_percpu(vi->stats);
- free_netdev(dev);
-}
-
static int virtnet_open(struct net_device *dev)
{
struct virtnet_info *vi = netdev_priv(dev);
@@ -959,7 +951,6 @@ static int virtnet_probe(struct virtio_d
/* Set up network device as normal. */
dev->netdev_ops = &virtnet_netdev;
dev->features = NETIF_F_HIGHDMA;
- dev->destructor = virtnet_free;
SET_ETHTOOL_OPS(dev, &virtnet_ethtool_ops);
SET_NETDEV_DEV(dev, &vdev->dev);
@@ -1122,6 +1113,7 @@ static void __devexit virtnet_remove(str
while (vi->pages)
__free_pages(get_a_page(vi, GFP_KERNEL), 0);
+ free_percpu(vi->stats);
free_netdev(vi->dev);
}
^ permalink raw reply
* [PATCH net-next] vmxnet3: fix publicity of NETIF_F_HIGHDMA
From: Shreyas Bhatewara @ 2011-07-21 3:21 UTC (permalink / raw)
To: netdev; +Cc: pv-drivers
NETIF_F_HIGHDMA is being disabled even when dma64 is true. This patch fixes it.
CC: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
--
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 009277e..b46d101 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -2647,7 +2647,7 @@ vmxnet3_declare_features(struct vmxnet3_adapter *adapter, bool dma64)
NETIF_F_HW_VLAN_RX | NETIF_F_TSO | NETIF_F_TSO6 |
NETIF_F_LRO;
if (dma64)
- netdev->features |= NETIF_F_HIGHDMA;
+ netdev->hw_features |= NETIF_F_HIGHDMA;
netdev->vlan_features = netdev->hw_features &
~(NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX);
netdev->features = netdev->hw_features | NETIF_F_HW_VLAN_FILTER;
^ permalink raw reply related
* RE: [Pv-drivers] [PATCH net-next] vmxnet3: set netdev parant device before calling netdev_info
From: Scott Goldman @ 2011-07-21 3:06 UTC (permalink / raw)
To: Shreyas Bhatewara, netdev@vger.kernel.org; +Cc: pv-drivers@vmware.com
In-Reply-To: <alpine.LRH.2.00.1107201834220.19334@sbhatewara-dev1.eng.vmware.com>
> Parent device for netdev should be set before netdev_info() can be called
> otherwise there is a NULL pointer dereference and probe() fails.
> Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: Scott J. Goldman <scottjg@vmware.com>
^ permalink raw reply
* [PATCH net-2.6] jme: Fix unmap error (Causing system freeze)
From: cooldavid @ 2011-07-21 2:57 UTC (permalink / raw)
To: David Miller
Cc: Jason Lamb, linux-netdev, Guo-Fu Tseng, Jason Long, Marcus Becker,
Aries Lee, Devinchiu, Marc Schiffbauer, stable
From: Guo-Fu Tseng <cooldavid@cooldavid.org>
This patch add the missing dma_unmap().
Which solved the critical issue of system freeze on heavy load.
Michal Miroslaw's rejected patch:
[PATCH v2 10/46] net: jme: convert to generic DMA API
Pointed out the issue also, thank you Michal.
But the fix was incorrect. It would unmap needed address
when low memory.
Got lots of feedback from End user and Gentoo Bugzilla.
https://bugs.gentoo.org/show_bug.cgi?id=373109
Thank you all. :)
Cc: stable@kernel.org
Signed-off-by: Guo-Fu Tseng <cooldavid@cooldavid.org>
---
drivers/net/jme.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/net/jme.c b/drivers/net/jme.c
index b5b174a..1973814 100644
--- a/drivers/net/jme.c
+++ b/drivers/net/jme.c
@@ -753,20 +753,28 @@ jme_make_new_rx_buf(struct jme_adapter *jme, int i)
struct jme_ring *rxring = &(jme->rxring[0]);
struct jme_buffer_info *rxbi = rxring->bufinf + i;
struct sk_buff *skb;
+ dma_addr_t mapping;
skb = netdev_alloc_skb(jme->dev,
jme->dev->mtu + RX_EXTRA_LEN);
if (unlikely(!skb))
return -ENOMEM;
+ mapping = pci_map_page(jme->pdev, virt_to_page(skb->data),
+ offset_in_page(skb->data), skb_tailroom(skb),
+ PCI_DMA_FROMDEVICE);
+ if (unlikely(pci_dma_mapping_error(jme->pdev, mapping))) {
+ dev_kfree_skb(skb);
+ return -ENOMEM;
+ }
+
+ if (likely(rxbi->mapping))
+ pci_unmap_page(jme->pdev, rxbi->mapping,
+ rxbi->len, PCI_DMA_FROMDEVICE);
+
rxbi->skb = skb;
rxbi->len = skb_tailroom(skb);
- rxbi->mapping = pci_map_page(jme->pdev,
- virt_to_page(skb->data),
- offset_in_page(skb->data),
- rxbi->len,
- PCI_DMA_FROMDEVICE);
-
+ rxbi->mapping = mapping;
return 0;
}
--
1.7.3.4
_______________________________________________
stable mailing list
stable@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/stable
^ permalink raw reply related
* [PATCH net-next] vmxnet3: set netdev parant device before calling netdev_info
From: Shreyas Bhatewara @ 2011-07-21 2:01 UTC (permalink / raw)
To: netdev; +Cc: pv-drivers
Parent device for netdev should be set before netdev_info() can be called
otherwise there is a NULL pointer dereference and probe() fails.
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
--
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 009277e..1a8cc5b 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -2993,6 +2993,7 @@ vmxnet3_probe_device(struct pci_dev *pdev,
goto err_ver;
}
+ SET_NETDEV_DEV(netdev, &pdev->dev);
vmxnet3_declare_features(adapter, dma64);
adapter->dev_number = atomic_read(&devices_found);
@@ -3038,7 +3039,6 @@ vmxnet3_probe_device(struct pci_dev *pdev,
netif_set_real_num_tx_queues(adapter->netdev, adapter->num_tx_queues);
netif_set_real_num_rx_queues(adapter->netdev, adapter->num_rx_queues);
- SET_NETDEV_DEV(netdev, &pdev->dev);
err = register_netdev(netdev);
if (err) {
^ permalink raw reply related
* [PATCH v3] net: filter: BPF 'JIT' compiler for PPC64
From: Matt Evans @ 2011-07-21 1:51 UTC (permalink / raw)
To: linuxppc-dev, netdev
In-Reply-To: <4E24E867.9050909@ozlabs.org>
An implementation of a code generator for BPF programs to speed up packet
filtering on PPC64, inspired by Eric Dumazet's x86-64 version.
Filter code is generated as an ABI-compliant function in module_alloc()'d mem
with stackframe & prologue/epilogue generated if required (simple filters don't
need anything more than an li/blr). The filter's local variables, M[], live in
registers. Supports all BPF opcodes, although "complicated" loads from negative
packet offsets (e.g. SKF_LL_OFF) are not yet supported.
There are a couple of further optimisations left for future work; many-pass
assembly with branch-reach reduction and a register allocator to push M[]
variables into volatile registers would improve the code quality further.
This currently supports big-endian 64-bit PowerPC only (but is fairly simple
to port to PPC32 or LE!).
Enabled in the same way as x86-64:
echo 1 > /proc/sys/net/core/bpf_jit_enable
Or, enabled with extra debug output:
echo 2 > /proc/sys/net/core/bpf_jit_enable
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
V3: Added BUILD_BUG_ON to assert PACA CPU ID is 16bits, made a comment (in
LD_MSH) a bit clearer, ratelimited "Unknown opcode" error and moved
bpf_jit.S to bpf_jit_64.S (it doesn't make sense to rename bpf_jit_comp.c as
small portions will eventually get split out into _32/_64.c files when we do
32bit support).
arch/powerpc/Kconfig | 1 +
arch/powerpc/Makefile | 3 +-
arch/powerpc/include/asm/ppc-opcode.h | 40 ++
arch/powerpc/net/Makefile | 4 +
arch/powerpc/net/bpf_jit.h | 227 +++++++++++
arch/powerpc/net/bpf_jit_64.S | 138 +++++++
arch/powerpc/net/bpf_jit_comp.c | 694 +++++++++++++++++++++++++++++++++
7 files changed, 1106 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2729c66..39860fc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -134,6 +134,7 @@ config PPC
select GENERIC_IRQ_SHOW_LEVEL
select HAVE_RCU_TABLE_FREE if SMP
select HAVE_SYSCALL_TRACEPOINTS
+ select HAVE_BPF_JIT if PPC64
config EARLY_PRINTK
bool
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index b7212b6..b94740f 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -154,7 +154,8 @@ core-y += arch/powerpc/kernel/ \
arch/powerpc/lib/ \
arch/powerpc/sysdev/ \
arch/powerpc/platforms/ \
- arch/powerpc/math-emu/
+ arch/powerpc/math-emu/ \
+ arch/powerpc/net/
core-$(CONFIG_XMON) += arch/powerpc/xmon/
core-$(CONFIG_KVM) += arch/powerpc/kvm/
diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index e472659..e980faa 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -71,6 +71,42 @@
#define PPC_INST_ERATSX 0x7c000126
#define PPC_INST_ERATSX_DOT 0x7c000127
+/* Misc instructions for BPF compiler */
+#define PPC_INST_LD 0xe8000000
+#define PPC_INST_LHZ 0xa0000000
+#define PPC_INST_LWZ 0x80000000
+#define PPC_INST_STD 0xf8000000
+#define PPC_INST_STDU 0xf8000001
+#define PPC_INST_MFLR 0x7c0802a6
+#define PPC_INST_MTLR 0x7c0803a6
+#define PPC_INST_CMPWI 0x2c000000
+#define PPC_INST_CMPDI 0x2c200000
+#define PPC_INST_CMPLW 0x7c000040
+#define PPC_INST_CMPLWI 0x28000000
+#define PPC_INST_ADDI 0x38000000
+#define PPC_INST_ADDIS 0x3c000000
+#define PPC_INST_ADD 0x7c000214
+#define PPC_INST_SUB 0x7c000050
+#define PPC_INST_BLR 0x4e800020
+#define PPC_INST_BLRL 0x4e800021
+#define PPC_INST_MULLW 0x7c0001d6
+#define PPC_INST_MULHWU 0x7c000016
+#define PPC_INST_MULLI 0x1c000000
+#define PPC_INST_DIVWU 0x7c0003d6
+#define PPC_INST_RLWINM 0x54000000
+#define PPC_INST_RLDICR 0x78000004
+#define PPC_INST_SLW 0x7c000030
+#define PPC_INST_SRW 0x7c000430
+#define PPC_INST_AND 0x7c000038
+#define PPC_INST_ANDDOT 0x7c000039
+#define PPC_INST_OR 0x7c000378
+#define PPC_INST_ANDI 0x70000000
+#define PPC_INST_ORI 0x60000000
+#define PPC_INST_ORIS 0x64000000
+#define PPC_INST_NEG 0x7c0000d0
+#define PPC_INST_BRANCH 0x48000000
+#define PPC_INST_BRANCH_COND 0x40800000
+
/* macros to insert fields into opcodes */
#define __PPC_RA(a) (((a) & 0x1f) << 16)
#define __PPC_RB(b) (((b) & 0x1f) << 11)
@@ -83,6 +119,10 @@
#define __PPC_T_TLB(t) (((t) & 0x3) << 21)
#define __PPC_WC(w) (((w) & 0x3) << 21)
#define __PPC_WS(w) (((w) & 0x1f) << 11)
+#define __PPC_SH(s) __PPC_WS(s)
+#define __PPC_MB(s) (((s) & 0x1f) << 6)
+#define __PPC_ME(s) (((s) & 0x1f) << 1)
+#define __PPC_BI(s) (((s) & 0x1f) << 16)
/*
* Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
diff --git a/arch/powerpc/net/Makefile b/arch/powerpc/net/Makefile
new file mode 100644
index 0000000..266b395
--- /dev/null
+++ b/arch/powerpc/net/Makefile
@@ -0,0 +1,4 @@
+#
+# Arch-specific network modules
+#
+obj-$(CONFIG_BPF_JIT) += bpf_jit_64.o bpf_jit_comp.o
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
new file mode 100644
index 0000000..af1ab5e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit.h
@@ -0,0 +1,227 @@
+/* bpf_jit.h: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#ifndef _BPF_JIT_H
+#define _BPF_JIT_H
+
+#define BPF_PPC_STACK_LOCALS 32
+#define BPF_PPC_STACK_BASIC (48+64)
+#define BPF_PPC_STACK_SAVE (18*8)
+#define BPF_PPC_STACKFRAME (BPF_PPC_STACK_BASIC+BPF_PPC_STACK_LOCALS+ \
+ BPF_PPC_STACK_SAVE)
+#define BPF_PPC_SLOWPATH_FRAME (48+64)
+
+/*
+ * Generated code register usage:
+ *
+ * As normal PPC C ABI (e.g. r1=sp, r2=TOC), with:
+ *
+ * skb r3 (Entry parameter)
+ * A register r4
+ * X register r5
+ * addr param r6
+ * r7-r10 scratch
+ * skb->data r14
+ * skb headlen r15 (skb->len - skb->data_len)
+ * m[0] r16
+ * m[...] ...
+ * m[15] r31
+ */
+#define r_skb 3
+#define r_ret 3
+#define r_A 4
+#define r_X 5
+#define r_addr 6
+#define r_scratch1 7
+#define r_D 14
+#define r_HL 15
+#define r_M 16
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Assembly helpers from arch/powerpc/net/bpf_jit.S:
+ */
+extern u8 sk_load_word[], sk_load_half[], sk_load_byte[], sk_load_byte_msh[];
+
+#define FUNCTION_DESCR_SIZE 24
+
+/*
+ * 16-bit immediate helper macros: HA() is for use with sign-extending instrs
+ * (e.g. LD, ADDI). If the bottom 16 bits is "-ve", add another bit into the
+ * top half to negate the effect (i.e. 0xffff + 1 = 0x(1)0000).
+ */
+#define IMM_H(i) ((uintptr_t)(i)>>16)
+#define IMM_HA(i) (((uintptr_t)(i)>>16) + \
+ (((uintptr_t)(i) & 0x8000) >> 15))
+#define IMM_L(i) ((uintptr_t)(i) & 0xffff)
+
+#define PLANT_INSTR(d, idx, instr) \
+ do { if (d) { (d)[idx] = instr; } idx++; } while (0)
+#define EMIT(instr) PLANT_INSTR(image, ctx->idx, instr)
+
+#define PPC_NOP() EMIT(PPC_INST_NOP)
+#define PPC_BLR() EMIT(PPC_INST_BLR)
+#define PPC_BLRL() EMIT(PPC_INST_BLRL)
+#define PPC_MTLR(r) EMIT(PPC_INST_MTLR | __PPC_RT(r))
+#define PPC_ADDI(d, a, i) EMIT(PPC_INST_ADDI | __PPC_RT(d) | \
+ __PPC_RA(a) | IMM_L(i))
+#define PPC_MR(d, a) PPC_OR(d, a, a)
+#define PPC_LI(r, i) PPC_ADDI(r, 0, i)
+#define PPC_ADDIS(d, a, i) EMIT(PPC_INST_ADDIS | \
+ __PPC_RS(d) | __PPC_RA(a) | IMM_L(i))
+#define PPC_LIS(r, i) PPC_ADDIS(r, 0, i)
+#define PPC_STD(r, base, i) EMIT(PPC_INST_STD | __PPC_RS(r) | \
+ __PPC_RA(base) | ((i) & 0xfffc))
+
+#define PPC_LD(r, base, i) EMIT(PPC_INST_LD | __PPC_RT(r) | \
+ __PPC_RA(base) | IMM_L(i))
+#define PPC_LWZ(r, base, i) EMIT(PPC_INST_LWZ | __PPC_RT(r) | \
+ __PPC_RA(base) | IMM_L(i))
+#define PPC_LHZ(r, base, i) EMIT(PPC_INST_LHZ | __PPC_RT(r) | \
+ __PPC_RA(base) | IMM_L(i))
+/* Convenience helpers for the above with 'far' offsets: */
+#define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i); \
+ else { PPC_ADDIS(r, base, IMM_HA(i)); \
+ PPC_LD(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LWZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LWZ(r, base, i); \
+ else { PPC_ADDIS(r, base, IMM_HA(i)); \
+ PPC_LWZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_LHZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LHZ(r, base, i); \
+ else { PPC_ADDIS(r, base, IMM_HA(i)); \
+ PPC_LHZ(r, r, IMM_L(i)); } } while(0)
+
+#define PPC_CMPWI(a, i) EMIT(PPC_INST_CMPWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPDI(a, i) EMIT(PPC_INST_CMPDI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLWI(a, i) EMIT(PPC_INST_CMPLWI | __PPC_RA(a) | IMM_L(i))
+#define PPC_CMPLW(a, b) EMIT(PPC_INST_CMPLW | __PPC_RA(a) | __PPC_RB(b))
+
+#define PPC_SUB(d, a, b) EMIT(PPC_INST_SUB | __PPC_RT(d) | \
+ __PPC_RB(a) | __PPC_RA(b))
+#define PPC_ADD(d, a, b) EMIT(PPC_INST_ADD | __PPC_RT(d) | \
+ __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MUL(d, a, b) EMIT(PPC_INST_MULLW | __PPC_RT(d) | \
+ __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULHWU(d, a, b) EMIT(PPC_INST_MULHWU | __PPC_RT(d) | \
+ __PPC_RA(a) | __PPC_RB(b))
+#define PPC_MULI(d, a, i) EMIT(PPC_INST_MULLI | __PPC_RT(d) | \
+ __PPC_RA(a) | IMM_L(i))
+#define PPC_DIVWU(d, a, b) EMIT(PPC_INST_DIVWU | __PPC_RT(d) | \
+ __PPC_RA(a) | __PPC_RB(b))
+#define PPC_AND(d, a, b) EMIT(PPC_INST_AND | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ANDI(d, a, i) EMIT(PPC_INST_ANDI | __PPC_RA(d) | \
+ __PPC_RS(a) | IMM_L(i))
+#define PPC_AND_DOT(d, a, b) EMIT(PPC_INST_ANDDOT | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_RB(b))
+#define PPC_OR(d, a, b) EMIT(PPC_INST_OR | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_RB(b))
+#define PPC_ORI(d, a, i) EMIT(PPC_INST_ORI | __PPC_RA(d) | \
+ __PPC_RS(a) | IMM_L(i))
+#define PPC_ORIS(d, a, i) EMIT(PPC_INST_ORIS | __PPC_RA(d) | \
+ __PPC_RS(a) | IMM_L(i))
+#define PPC_SLW(d, a, s) EMIT(PPC_INST_SLW | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_RB(s))
+#define PPC_SRW(d, a, s) EMIT(PPC_INST_SRW | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_RB(s))
+/* slwi = rlwinm Rx, Ry, n, 0, 31-n */
+#define PPC_SLWI(d, a, i) EMIT(PPC_INST_RLWINM | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_SH(i) | \
+ __PPC_MB(0) | __PPC_ME(31-(i)))
+/* srwi = rlwinm Rx, Ry, 32-n, n, 31 */
+#define PPC_SRWI(d, a, i) EMIT(PPC_INST_RLWINM | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_SH(32-(i)) | \
+ __PPC_MB(i) | __PPC_ME(31))
+/* sldi = rldicr Rx, Ry, n, 63-n */
+#define PPC_SLDI(d, a, i) EMIT(PPC_INST_RLDICR | __PPC_RA(d) | \
+ __PPC_RS(a) | __PPC_SH(i) | \
+ __PPC_MB(63-(i)) | (((i) & 0x20) >> 4))
+#define PPC_NEG(d, a) EMIT(PPC_INST_NEG | __PPC_RT(d) | __PPC_RA(a))
+
+/* Long jump; (unconditional 'branch') */
+#define PPC_JMP(dest) EMIT(PPC_INST_BRANCH | \
+ (((dest) - (ctx->idx * 4)) & 0x03fffffc))
+/* "cond" here covers BO:BI fields. */
+#define PPC_BCC_SHORT(cond, dest) EMIT(PPC_INST_BRANCH_COND | \
+ (((cond) & 0x3ff) << 16) | \
+ (((dest) - (ctx->idx * 4)) & \
+ 0xfffc))
+#define PPC_LI32(d, i) do { PPC_LI(d, IMM_L(i)); \
+ if ((u32)(uintptr_t)(i) >= 32768) { \
+ PPC_ADDIS(d, d, IMM_HA(i)); \
+ } } while(0)
+#define PPC_LI64(d, i) do { \
+ if (!((uintptr_t)(i) & 0xffffffff00000000ULL)) \
+ PPC_LI32(d, i); \
+ else { \
+ PPC_LIS(d, ((uintptr_t)(i) >> 48)); \
+ if ((uintptr_t)(i) & 0x0000ffff00000000ULL) \
+ PPC_ORI(d, d, \
+ ((uintptr_t)(i) >> 32) & 0xffff); \
+ PPC_SLDI(d, d, 32); \
+ if ((uintptr_t)(i) & 0x00000000ffff0000ULL) \
+ PPC_ORIS(d, d, \
+ ((uintptr_t)(i) >> 16) & 0xffff); \
+ if ((uintptr_t)(i) & 0x000000000000ffffULL) \
+ PPC_ORI(d, d, (uintptr_t)(i) & 0xffff); \
+ } } while (0);
+
+static inline bool is_nearbranch(int offset)
+{
+ return (offset < 32768) && (offset >= -32768);
+}
+
+/*
+ * The fly in the ointment of code size changing from pass to pass is
+ * avoided by padding the short branch case with a NOP. If code size differs
+ * with different branch reaches we will have the issue of code moving from
+ * one pass to the next and will need a few passes to converge on a stable
+ * state.
+ */
+#define PPC_BCC(cond, dest) do { \
+ if (is_nearbranch((dest) - (ctx->idx * 4))) { \
+ PPC_BCC_SHORT(cond, dest); \
+ PPC_NOP(); \
+ } else { \
+ /* Flip the 'T or F' bit to invert comparison */ \
+ PPC_BCC_SHORT(cond ^ COND_CMP_TRUE, (ctx->idx+2)*4); \
+ PPC_JMP(dest); \
+ } } while(0)
+
+/* To create a branch condition, select a bit of cr0... */
+#define CR0_LT 0
+#define CR0_GT 1
+#define CR0_EQ 2
+/* ...and modify BO[3] */
+#define COND_CMP_TRUE 0x100
+#define COND_CMP_FALSE 0x000
+/* Together, they make all required comparisons: */
+#define COND_GT (CR0_GT | COND_CMP_TRUE)
+#define COND_GE (CR0_LT | COND_CMP_FALSE)
+#define COND_EQ (CR0_EQ | COND_CMP_TRUE)
+#define COND_NE (CR0_EQ | COND_CMP_FALSE)
+#define COND_LT (CR0_LT | COND_CMP_TRUE)
+
+#define SEEN_DATAREF 0x10000 /* might call external helpers */
+#define SEEN_XREG 0x20000 /* X reg is used */
+#define SEEN_MEM 0x40000 /* SEEN_MEM+(1<<n) = use mem[n] for temporary
+ * storage */
+#define SEEN_MEM_MSK 0x0ffff
+
+struct codegen_context {
+ unsigned int seen;
+ unsigned int idx;
+ int pc_ret0; /* bpf index of first RET #0 instruction (if any) */
+};
+
+#endif
+
+#endif
diff --git a/arch/powerpc/net/bpf_jit_64.S b/arch/powerpc/net/bpf_jit_64.S
new file mode 100644
index 0000000..ff4506e
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit_64.S
@@ -0,0 +1,138 @@
+/* bpf_jit.S: Packet/header access helper functions
+ * for PPC64 BPF compiler.
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <asm/ppc_asm.h>
+#include "bpf_jit.h"
+
+/*
+ * All of these routines are called directly from generated code,
+ * whose register usage is:
+ *
+ * r3 skb
+ * r4,r5 A,X
+ * r6 *** address parameter to helper ***
+ * r7-r10 scratch
+ * r14 skb->data
+ * r15 skb headlen
+ * r16-31 M[]
+ */
+
+/*
+ * To consider: These helpers are so small it could be better to just
+ * generate them inline. Inline code can do the simple headlen check
+ * then branch directly to slow_path_XXX if required. (In fact, could
+ * load a spare GPR with the address of slow_path_generic and pass size
+ * as an argument, making the call site a mtlr, li and bllr.)
+ *
+ * Technically, the "is addr < 0" check is unnecessary & slowing down
+ * the ABS path, as it's statically checked on generation.
+ */
+ .globl sk_load_word
+sk_load_word:
+ cmpdi r_addr, 0
+ blt bpf_error
+ /* Are we accessing past headlen? */
+ subi r_scratch1, r_HL, 4
+ cmpd r_scratch1, r_addr
+ blt bpf_slow_path_word
+ /* Nope, just hitting the header. cr0 here is eq or gt! */
+ lwzx r_A, r_D, r_addr
+ /* When big endian we don't need to byteswap. */
+ blr /* Return success, cr0 != LT */
+
+ .globl sk_load_half
+sk_load_half:
+ cmpdi r_addr, 0
+ blt bpf_error
+ subi r_scratch1, r_HL, 2
+ cmpd r_scratch1, r_addr
+ blt bpf_slow_path_half
+ lhzx r_A, r_D, r_addr
+ blr
+
+ .globl sk_load_byte
+sk_load_byte:
+ cmpdi r_addr, 0
+ blt bpf_error
+ cmpd r_HL, r_addr
+ ble bpf_slow_path_byte
+ lbzx r_A, r_D, r_addr
+ blr
+
+/*
+ * BPF_S_LDX_B_MSH: ldxb 4*([offset]&0xf)
+ * r_addr is the offset value, already known positive
+ */
+ .globl sk_load_byte_msh
+sk_load_byte_msh:
+ cmpd r_HL, r_addr
+ ble bpf_slow_path_byte_msh
+ lbzx r_X, r_D, r_addr
+ rlwinm r_X, r_X, 2, 32-4-2, 31-2
+ blr
+
+bpf_error:
+ /* Entered with cr0 = lt */
+ li r3, 0
+ /* Generated code will 'blt epilogue', returning 0. */
+ blr
+
+/* Call out to skb_copy_bits:
+ * We'll need to back up our volatile regs first; we have
+ * local variable space at r1+(BPF_PPC_STACK_BASIC).
+ * Allocate a new stack frame here to remain ABI-compliant in
+ * stashing LR.
+ */
+#define bpf_slow_path_common(SIZE) \
+ mflr r0; \
+ std r0, 16(r1); \
+ /* R3 goes in parameter space of caller's frame */ \
+ std r_skb, (BPF_PPC_STACKFRAME+48)(r1); \
+ std r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1); \
+ std r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1); \
+ addi r5, r1, BPF_PPC_STACK_BASIC+(2*8); \
+ stdu r1, -BPF_PPC_SLOWPATH_FRAME(r1); \
+ /* R3 = r_skb, as passed */ \
+ mr r4, r_addr; \
+ li r6, SIZE; \
+ bl skb_copy_bits; \
+ /* R3 = 0 on success */ \
+ addi r1, r1, BPF_PPC_SLOWPATH_FRAME; \
+ ld r0, 16(r1); \
+ ld r_A, (BPF_PPC_STACK_BASIC+(0*8))(r1); \
+ ld r_X, (BPF_PPC_STACK_BASIC+(1*8))(r1); \
+ mtlr r0; \
+ cmpdi r3, 0; \
+ blt bpf_error; /* cr0 = LT */ \
+ ld r_skb, (BPF_PPC_STACKFRAME+48)(r1); \
+ /* Great success! */
+
+bpf_slow_path_word:
+ bpf_slow_path_common(4)
+ /* Data value is on stack, and cr0 != LT */
+ lwz r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+ blr
+
+bpf_slow_path_half:
+ bpf_slow_path_common(2)
+ lhz r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+ blr
+
+bpf_slow_path_byte:
+ bpf_slow_path_common(1)
+ lbz r_A, BPF_PPC_STACK_BASIC+(2*8)(r1)
+ blr
+
+bpf_slow_path_byte_msh:
+ bpf_slow_path_common(1)
+ lbz r_X, BPF_PPC_STACK_BASIC+(2*8)(r1)
+ rlwinm r_X, r_X, 2, 32-4-2, 31-2
+ blr
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
new file mode 100644
index 0000000..73619d3
--- /dev/null
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -0,0 +1,694 @@
+/* bpf_jit_comp.c: BPF JIT compiler for PPC64
+ *
+ * Copyright 2011 Matt Evans <matt@ozlabs.org>, IBM Corporation
+ *
+ * Based on the x86 BPF compiler, by Eric Dumazet (eric.dumazet@gmail.com)
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+#include <linux/moduleloader.h>
+#include <asm/cacheflush.h>
+#include <linux/netdevice.h>
+#include <linux/filter.h>
+#include "bpf_jit.h"
+
+#ifndef __BIG_ENDIAN
+/* There are endianness assumptions herein. */
+#error "Little-endian PPC not supported in BPF compiler"
+#endif
+
+int bpf_jit_enable __read_mostly;
+
+
+static inline void bpf_flush_icache(void *start, void *end)
+{
+ smp_wmb();
+ flush_icache_range((unsigned long)start, (unsigned long)end);
+}
+
+static void bpf_jit_build_prologue(struct sk_filter *fp, u32 *image,
+ struct codegen_context *ctx)
+{
+ int i;
+ const struct sock_filter *filter = fp->insns;
+
+ if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+ /* Make stackframe */
+ if (ctx->seen & SEEN_DATAREF) {
+ /* If we call any helpers (for loads), save LR */
+ EMIT(PPC_INST_MFLR | __PPC_RT(0));
+ PPC_STD(0, 1, 16);
+
+ /* Back up non-volatile regs. */
+ PPC_STD(r_D, 1, -(8*(32-r_D)));
+ PPC_STD(r_HL, 1, -(8*(32-r_HL)));
+ }
+ if (ctx->seen & SEEN_MEM) {
+ /*
+ * Conditionally save regs r15-r31 as some will be used
+ * for M[] data.
+ */
+ for (i = r_M; i < (r_M+16); i++) {
+ if (ctx->seen & (1 << (i-r_M)))
+ PPC_STD(i, 1, -(8*(32-i)));
+ }
+ }
+ EMIT(PPC_INST_STDU | __PPC_RS(1) | __PPC_RA(1) |
+ (-BPF_PPC_STACKFRAME & 0xfffc));
+ }
+
+ if (ctx->seen & SEEN_DATAREF) {
+ /*
+ * If this filter needs to access skb data,
+ * prepare r_D and r_HL:
+ * r_HL = skb->len - skb->data_len
+ * r_D = skb->data
+ */
+ PPC_LWZ_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+ data_len));
+ PPC_LWZ_OFFS(r_HL, r_skb, offsetof(struct sk_buff, len));
+ PPC_SUB(r_HL, r_HL, r_scratch1);
+ PPC_LD_OFFS(r_D, r_skb, offsetof(struct sk_buff, data));
+ }
+
+ if (ctx->seen & SEEN_XREG) {
+ /*
+ * TODO: Could also detect whether first instr. sets X and
+ * avoid this (as below, with A).
+ */
+ PPC_LI(r_X, 0);
+ }
+
+ switch (filter[0].code) {
+ case BPF_S_RET_K:
+ case BPF_S_LD_W_LEN:
+ case BPF_S_ANC_PROTOCOL:
+ case BPF_S_ANC_IFINDEX:
+ case BPF_S_ANC_MARK:
+ case BPF_S_ANC_RXHASH:
+ case BPF_S_ANC_CPU:
+ case BPF_S_ANC_QUEUE:
+ case BPF_S_LD_W_ABS:
+ case BPF_S_LD_H_ABS:
+ case BPF_S_LD_B_ABS:
+ /* first instruction sets A register (or is RET 'constant') */
+ break;
+ default:
+ /* make sure we dont leak kernel information to user */
+ PPC_LI(r_A, 0);
+ }
+}
+
+static void bpf_jit_build_epilogue(u32 *image, struct codegen_context *ctx)
+{
+ int i;
+
+ if (ctx->seen & (SEEN_MEM | SEEN_DATAREF)) {
+ PPC_ADDI(1, 1, BPF_PPC_STACKFRAME);
+ if (ctx->seen & SEEN_DATAREF) {
+ PPC_LD(0, 1, 16);
+ PPC_MTLR(0);
+ PPC_LD(r_D, 1, -(8*(32-r_D)));
+ PPC_LD(r_HL, 1, -(8*(32-r_HL)));
+ }
+ if (ctx->seen & SEEN_MEM) {
+ /* Restore any saved non-vol registers */
+ for (i = r_M; i < (r_M+16); i++) {
+ if (ctx->seen & (1 << (i-r_M)))
+ PPC_LD(i, 1, -(8*(32-i)));
+ }
+ }
+ }
+ /* The RETs have left a return value in R3. */
+
+ PPC_BLR();
+}
+
+/* Assemble the body code between the prologue & epilogue. */
+static int bpf_jit_build_body(struct sk_filter *fp, u32 *image,
+ struct codegen_context *ctx,
+ unsigned int *addrs)
+{
+ const struct sock_filter *filter = fp->insns;
+ int flen = fp->len;
+ u8 *func;
+ unsigned int true_cond;
+ int i;
+
+ /* Start of epilogue code */
+ unsigned int exit_addr = addrs[flen];
+
+ for (i = 0; i < flen; i++) {
+ unsigned int K = filter[i].k;
+
+ /*
+ * addrs[] maps a BPF bytecode address into a real offset from
+ * the start of the body code.
+ */
+ addrs[i] = ctx->idx * 4;
+
+ switch (filter[i].code) {
+ /*** ALU ops ***/
+ case BPF_S_ALU_ADD_X: /* A += X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_ADD(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_ADD_K: /* A += K; */
+ if (!K)
+ break;
+ PPC_ADDI(r_A, r_A, IMM_L(K));
+ if (K >= 32768)
+ PPC_ADDIS(r_A, r_A, IMM_HA(K));
+ break;
+ case BPF_S_ALU_SUB_X: /* A -= X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_SUB(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_SUB_K: /* A -= K */
+ if (!K)
+ break;
+ PPC_ADDI(r_A, r_A, IMM_L(-K));
+ if (K >= 32768)
+ PPC_ADDIS(r_A, r_A, IMM_HA(-K));
+ break;
+ case BPF_S_ALU_MUL_X: /* A *= X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_MUL(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_MUL_K: /* A *= K */
+ if (K < 32768)
+ PPC_MULI(r_A, r_A, K);
+ else {
+ PPC_LI32(r_scratch1, K);
+ PPC_MUL(r_A, r_A, r_scratch1);
+ }
+ break;
+ case BPF_S_ALU_DIV_X: /* A /= X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_CMPWI(r_X, 0);
+ if (ctx->pc_ret0 != -1) {
+ PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+ } else {
+ /*
+ * Exit, returning 0; first pass hits here
+ * (longer worst-case code size).
+ */
+ PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+ PPC_LI(r_ret, 0);
+ PPC_JMP(exit_addr);
+ }
+ PPC_DIVWU(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_DIV_K: /* A = reciprocal_divide(A, K); */
+ PPC_LI32(r_scratch1, K);
+ /* Top 32 bits of 64bit result -> A */
+ PPC_MULHWU(r_A, r_A, r_scratch1);
+ break;
+ case BPF_S_ALU_AND_X:
+ ctx->seen |= SEEN_XREG;
+ PPC_AND(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_AND_K:
+ if (!IMM_H(K))
+ PPC_ANDI(r_A, r_A, K);
+ else {
+ PPC_LI32(r_scratch1, K);
+ PPC_AND(r_A, r_A, r_scratch1);
+ }
+ break;
+ case BPF_S_ALU_OR_X:
+ ctx->seen |= SEEN_XREG;
+ PPC_OR(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_OR_K:
+ if (IMM_L(K))
+ PPC_ORI(r_A, r_A, IMM_L(K));
+ if (K >= 65536)
+ PPC_ORIS(r_A, r_A, IMM_H(K));
+ break;
+ case BPF_S_ALU_LSH_X: /* A <<= X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_SLW(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_LSH_K:
+ if (K == 0)
+ break;
+ else
+ PPC_SLWI(r_A, r_A, K);
+ break;
+ case BPF_S_ALU_RSH_X: /* A >>= X; */
+ ctx->seen |= SEEN_XREG;
+ PPC_SRW(r_A, r_A, r_X);
+ break;
+ case BPF_S_ALU_RSH_K: /* A >>= K; */
+ if (K == 0)
+ break;
+ else
+ PPC_SRWI(r_A, r_A, K);
+ break;
+ case BPF_S_ALU_NEG:
+ PPC_NEG(r_A, r_A);
+ break;
+ case BPF_S_RET_K:
+ PPC_LI32(r_ret, K);
+ if (!K) {
+ if (ctx->pc_ret0 == -1)
+ ctx->pc_ret0 = i;
+ }
+ /*
+ * If this isn't the very last instruction, branch to
+ * the epilogue if we've stuff to clean up. Otherwise,
+ * if there's nothing to tidy, just return. If we /are/
+ * the last instruction, we're about to fall through to
+ * the epilogue to return.
+ */
+ if (i != flen - 1) {
+ /*
+ * Note: 'seen' is properly valid only on pass
+ * #2. Both parts of this conditional are the
+ * same instruction size though, meaning the
+ * first pass will still correctly determine the
+ * code size/addresses.
+ */
+ if (ctx->seen)
+ PPC_JMP(exit_addr);
+ else
+ PPC_BLR();
+ }
+ break;
+ case BPF_S_RET_A:
+ PPC_MR(r_ret, r_A);
+ if (i != flen - 1) {
+ if (ctx->seen)
+ PPC_JMP(exit_addr);
+ else
+ PPC_BLR();
+ }
+ break;
+ case BPF_S_MISC_TAX: /* X = A */
+ PPC_MR(r_X, r_A);
+ break;
+ case BPF_S_MISC_TXA: /* A = X */
+ ctx->seen |= SEEN_XREG;
+ PPC_MR(r_A, r_X);
+ break;
+
+ /*** Constant loads/M[] access ***/
+ case BPF_S_LD_IMM: /* A = K */
+ PPC_LI32(r_A, K);
+ break;
+ case BPF_S_LDX_IMM: /* X = K */
+ PPC_LI32(r_X, K);
+ break;
+ case BPF_S_LD_MEM: /* A = mem[K] */
+ PPC_MR(r_A, r_M + (K & 0xf));
+ ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+ break;
+ case BPF_S_LDX_MEM: /* X = mem[K] */
+ PPC_MR(r_X, r_M + (K & 0xf));
+ ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+ break;
+ case BPF_S_ST: /* mem[K] = A */
+ PPC_MR(r_M + (K & 0xf), r_A);
+ ctx->seen |= SEEN_MEM | (1<<(K & 0xf));
+ break;
+ case BPF_S_STX: /* mem[K] = X */
+ PPC_MR(r_M + (K & 0xf), r_X);
+ ctx->seen |= SEEN_XREG | SEEN_MEM | (1<<(K & 0xf));
+ break;
+ case BPF_S_LD_W_LEN: /* A = skb->len; */
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+ PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff, len));
+ break;
+ case BPF_S_LDX_W_LEN: /* X = skb->len; */
+ PPC_LWZ_OFFS(r_X, r_skb, offsetof(struct sk_buff, len));
+ break;
+
+ /*** Ancillary info loads ***/
+
+ /* None of the BPF_S_ANC* codes appear to be passed by
+ * sk_chk_filter(). The interpreter and the x86 BPF
+ * compiler implement them so we do too -- they may be
+ * planted in future.
+ */
+ case BPF_S_ANC_PROTOCOL: /* A = ntohs(skb->protocol); */
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+ protocol) != 2);
+ PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+ protocol));
+ /* ntohs is a NOP with BE loads. */
+ break;
+ case BPF_S_ANC_IFINDEX:
+ PPC_LD_OFFS(r_scratch1, r_skb, offsetof(struct sk_buff,
+ dev));
+ PPC_CMPDI(r_scratch1, 0);
+ if (ctx->pc_ret0 != -1) {
+ PPC_BCC(COND_EQ, addrs[ctx->pc_ret0]);
+ } else {
+ /* Exit, returning 0; first pass hits here. */
+ PPC_BCC_SHORT(COND_NE, (ctx->idx*4)+12);
+ PPC_LI(r_ret, 0);
+ PPC_JMP(exit_addr);
+ }
+ BUILD_BUG_ON(FIELD_SIZEOF(struct net_device,
+ ifindex) != 4);
+ PPC_LWZ_OFFS(r_A, r_scratch1,
+ offsetof(struct net_device, ifindex));
+ break;
+ case BPF_S_ANC_MARK:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, mark) != 4);
+ PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+ mark));
+ break;
+ case BPF_S_ANC_RXHASH:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, rxhash) != 4);
+ PPC_LWZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+ rxhash));
+ break;
+ case BPF_S_ANC_QUEUE:
+ BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
+ queue_mapping) != 2);
+ PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
+ queue_mapping));
+ break;
+ case BPF_S_ANC_CPU:
+#ifdef CONFIG_SMP
+ /*
+ * PACA ptr is r13:
+ * raw_smp_processor_id() = local_paca->paca_index
+ */
+ BUILD_BUG_ON(FIELD_SIZEOF(struct paca_struct,
+ paca_index) != 2);
+ PPC_LHZ_OFFS(r_A, 13,
+ offsetof(struct paca_struct, paca_index));
+#else
+ PPC_LI(r_A, 0);
+#endif
+ break;
+
+ /*** Absolute loads from packet header/data ***/
+ case BPF_S_LD_W_ABS:
+ func = sk_load_word;
+ goto common_load;
+ case BPF_S_LD_H_ABS:
+ func = sk_load_half;
+ goto common_load;
+ case BPF_S_LD_B_ABS:
+ func = sk_load_byte;
+ common_load:
+ /*
+ * Load from [K]. Reference with the (negative)
+ * SKF_NET_OFF/SKF_LL_OFF offsets is unsupported.
+ */
+ ctx->seen |= SEEN_DATAREF;
+ if ((int)K < 0)
+ return -ENOTSUPP;
+ PPC_LI64(r_scratch1, func);
+ PPC_MTLR(r_scratch1);
+ PPC_LI32(r_addr, K);
+ PPC_BLRL();
+ /*
+ * Helper returns 'lt' condition on error, and an
+ * appropriate return value in r3
+ */
+ PPC_BCC(COND_LT, exit_addr);
+ break;
+
+ /*** Indirect loads from packet header/data ***/
+ case BPF_S_LD_W_IND:
+ func = sk_load_word;
+ goto common_load_ind;
+ case BPF_S_LD_H_IND:
+ func = sk_load_half;
+ goto common_load_ind;
+ case BPF_S_LD_B_IND:
+ func = sk_load_byte;
+ common_load_ind:
+ /*
+ * Load from [X + K]. Negative offsets are tested for
+ * in the helper functions, and result in a 'ret 0'.
+ */
+ ctx->seen |= SEEN_DATAREF | SEEN_XREG;
+ PPC_LI64(r_scratch1, func);
+ PPC_MTLR(r_scratch1);
+ PPC_ADDI(r_addr, r_X, IMM_L(K));
+ if (K >= 32768)
+ PPC_ADDIS(r_addr, r_addr, IMM_HA(K));
+ PPC_BLRL();
+ /* If error, cr0.LT set */
+ PPC_BCC(COND_LT, exit_addr);
+ break;
+
+ case BPF_S_LDX_B_MSH:
+ /*
+ * x86 version drops packet (RET 0) when K<0, whereas
+ * interpreter does allow K<0 (__load_pointer, special
+ * ancillary data). common_load returns ENOTSUPP if K<0,
+ * so we fall back to interpreter & filter works.
+ */
+ func = sk_load_byte_msh;
+ goto common_load;
+ break;
+
+ /*** Jump and branches ***/
+ case BPF_S_JMP_JA:
+ if (K != 0)
+ PPC_JMP(addrs[i + 1 + K]);
+ break;
+
+ case BPF_S_JMP_JGT_K:
+ case BPF_S_JMP_JGT_X:
+ true_cond = COND_GT;
+ goto cond_branch;
+ case BPF_S_JMP_JGE_K:
+ case BPF_S_JMP_JGE_X:
+ true_cond = COND_GE;
+ goto cond_branch;
+ case BPF_S_JMP_JEQ_K:
+ case BPF_S_JMP_JEQ_X:
+ true_cond = COND_EQ;
+ goto cond_branch;
+ case BPF_S_JMP_JSET_K:
+ case BPF_S_JMP_JSET_X:
+ true_cond = COND_NE;
+ /* Fall through */
+ cond_branch:
+ /* same targets, can avoid doing the test :) */
+ if (filter[i].jt == filter[i].jf) {
+ if (filter[i].jt > 0)
+ PPC_JMP(addrs[i + 1 + filter[i].jt]);
+ break;
+ }
+
+ switch (filter[i].code) {
+ case BPF_S_JMP_JGT_X:
+ case BPF_S_JMP_JGE_X:
+ case BPF_S_JMP_JEQ_X:
+ ctx->seen |= SEEN_XREG;
+ PPC_CMPLW(r_A, r_X);
+ break;
+ case BPF_S_JMP_JSET_X:
+ ctx->seen |= SEEN_XREG;
+ PPC_AND_DOT(r_scratch1, r_A, r_X);
+ break;
+ case BPF_S_JMP_JEQ_K:
+ case BPF_S_JMP_JGT_K:
+ case BPF_S_JMP_JGE_K:
+ if (K < 32768)
+ PPC_CMPLWI(r_A, K);
+ else {
+ PPC_LI32(r_scratch1, K);
+ PPC_CMPLW(r_A, r_scratch1);
+ }
+ break;
+ case BPF_S_JMP_JSET_K:
+ if (K < 32768)
+ /* PPC_ANDI is /only/ dot-form */
+ PPC_ANDI(r_scratch1, r_A, K);
+ else {
+ PPC_LI32(r_scratch1, K);
+ PPC_AND_DOT(r_scratch1, r_A,
+ r_scratch1);
+ }
+ break;
+ }
+ /* Sometimes branches are constructed "backward", with
+ * the false path being the branch and true path being
+ * a fallthrough to the next instruction.
+ */
+ if (filter[i].jt == 0)
+ /* Swap the sense of the branch */
+ PPC_BCC(true_cond ^ COND_CMP_TRUE,
+ addrs[i + 1 + filter[i].jf]);
+ else {
+ PPC_BCC(true_cond, addrs[i + 1 + filter[i].jt]);
+ if (filter[i].jf != 0)
+ PPC_JMP(addrs[i + 1 + filter[i].jf]);
+ }
+ break;
+ default:
+ /* The filter contains something cruel & unusual.
+ * We don't handle it, but also there shouldn't be
+ * anything missing from our list.
+ */
+ if (printk_ratelimit())
+ pr_err("BPF filter opcode %04x (@%d) unsupported\n",
+ filter[i].code, i);
+ return -ENOTSUPP;
+ }
+
+ }
+ /* Set end-of-body-code address for exit. */
+ addrs[i] = ctx->idx * 4;
+
+ return 0;
+}
+
+void bpf_jit_compile(struct sk_filter *fp)
+{
+ unsigned int proglen;
+ unsigned int alloclen;
+ u32 *image = NULL;
+ u32 *code_base;
+ unsigned int *addrs;
+ struct codegen_context cgctx;
+ int pass;
+ int flen = fp->len;
+
+ if (!bpf_jit_enable)
+ return;
+
+ addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
+ if (addrs == NULL)
+ return;
+
+ /*
+ * There are multiple assembly passes as the generated code will change
+ * size as it settles down, figuring out the max branch offsets/exit
+ * paths required.
+ *
+ * The range of standard conditional branches is +/- 32Kbytes. Since
+ * BPF_MAXINSNS = 4096, we can only jump from (worst case) start to
+ * finish with 8 bytes/instruction. Not feasible, so long jumps are
+ * used, distinct from short branches.
+ *
+ * Current:
+ *
+ * For now, both branch types assemble to 2 words (short branches padded
+ * with a NOP); this is less efficient, but assembly will always complete
+ * after exactly 3 passes:
+ *
+ * First pass: No code buffer; Program is "faux-generated" -- no code
+ * emitted but maximum size of output determined (and addrs[] filled
+ * in). Also, we note whether we use M[], whether we use skb data, etc.
+ * All generation choices assumed to be 'worst-case', e.g. branches all
+ * far (2 instructions), return path code reduction not available, etc.
+ *
+ * Second pass: Code buffer allocated with size determined previously.
+ * Prologue generated to support features we have seen used. Exit paths
+ * determined and addrs[] is filled in again, as code may be slightly
+ * smaller as a result.
+ *
+ * Third pass: Code generated 'for real', and branch destinations
+ * determined from now-accurate addrs[] map.
+ *
+ * Ideal:
+ *
+ * If we optimise this, near branches will be shorter. On the
+ * first assembly pass, we should err on the side of caution and
+ * generate the biggest code. On subsequent passes, branches will be
+ * generated short or long and code size will reduce. With smaller
+ * code, more branches may fall into the short category, and code will
+ * reduce more.
+ *
+ * Finally, if we see one pass generate code the same size as the
+ * previous pass we have converged and should now generate code for
+ * real. Allocating at the end will also save the memory that would
+ * otherwise be wasted by the (small) current code shrinkage.
+ * Preferably, we should do a small number of passes (e.g. 5) and if we
+ * haven't converged by then, get impatient and force code to generate
+ * as-is, even if the odd branch would be left long. The chances of a
+ * long jump are tiny with all but the most enormous of BPF filter
+ * inputs, so we should usually converge on the third pass.
+ */
+
+ cgctx.idx = 0;
+ cgctx.seen = 0;
+ cgctx.pc_ret0 = -1;
+ /* Scouting faux-generate pass 0 */
+ if (bpf_jit_build_body(fp, 0, &cgctx, addrs))
+ /* We hit something illegal or unsupported. */
+ goto out;
+
+ /*
+ * Pretend to build prologue, given the features we've seen. This will
+ * update ctgtx.idx as it pretends to output instructions, then we can
+ * calculate total size from idx.
+ */
+ bpf_jit_build_prologue(fp, 0, &cgctx);
+ bpf_jit_build_epilogue(0, &cgctx);
+
+ proglen = cgctx.idx * 4;
+ alloclen = proglen + FUNCTION_DESCR_SIZE;
+ image = module_alloc(max_t(unsigned int, alloclen,
+ sizeof(struct work_struct)));
+ if (!image)
+ goto out;
+
+ code_base = image + (FUNCTION_DESCR_SIZE/4);
+
+ /* Code generation passes 1-2 */
+ for (pass = 1; pass < 3; pass++) {
+ /* Now build the prologue, body code & epilogue for real. */
+ cgctx.idx = 0;
+ bpf_jit_build_prologue(fp, code_base, &cgctx);
+ bpf_jit_build_body(fp, code_base, &cgctx, addrs);
+ bpf_jit_build_epilogue(code_base, &cgctx);
+
+ if (bpf_jit_enable > 1)
+ pr_info("Pass %d: shrink = %d, seen = 0x%x\n", pass,
+ proglen - (cgctx.idx * 4), cgctx.seen);
+ }
+
+ if (bpf_jit_enable > 1)
+ pr_info("flen=%d proglen=%u pass=%d image=%p\n",
+ flen, proglen, pass, image);
+
+ if (image) {
+ if (bpf_jit_enable > 1)
+ print_hex_dump(KERN_ERR, "JIT code: ",
+ DUMP_PREFIX_ADDRESS,
+ 16, 1, code_base,
+ proglen, false);
+
+ bpf_flush_icache(code_base, code_base + (proglen/4));
+ /* Function descriptor nastiness: Address + TOC */
+ ((u64 *)image)[0] = (u64)code_base;
+ ((u64 *)image)[1] = local_paca->kernel_toc;
+ fp->bpf_func = (void *)image;
+ }
+out:
+ kfree(addrs);
+ return;
+}
+
+static void jit_free_defer(struct work_struct *arg)
+{
+ module_free(NULL, arg);
+}
+
+/* run from softirq, we must use a work_struct to call
+ * module_free() from process context
+ */
+void bpf_jit_free(struct sk_filter *fp)
+{
+ if (fp->bpf_func != sk_run_filter) {
+ struct work_struct *work = (struct work_struct *)fp->bpf_func;
+
+ INIT_WORK(work, jit_free_defer);
+ schedule_work(work);
+ }
+}
^ permalink raw reply related
* [PATCH 4/4 net-next] bnx2: Fix endian swapping on firmware version string
From: Michael Chan @ 2011-07-21 0:55 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1311209725-10414-3-git-send-email-mchan@broadcom.com>
so that ethtool -i will display it correctly on big endian systems.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
---
drivers/net/bnx2.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 3ad9b70..4b2b570 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -56,8 +56,8 @@
#include "bnx2_fw.h"
#define DRV_MODULE_NAME "bnx2"
-#define DRV_MODULE_VERSION "2.1.10"
-#define DRV_MODULE_RELDATE "July 12, 2011"
+#define DRV_MODULE_VERSION "2.1.11"
+#define DRV_MODULE_RELDATE "July 20, 2011"
#define FW_MIPS_FILE_06 "bnx2/bnx2-mips-06-6.2.1.fw"
#define FW_RV2P_FILE_06 "bnx2/bnx2-rv2p-06-6.0.15.fw"
#define FW_MIPS_FILE_09 "bnx2/bnx2-mips-09-6.2.1a.fw"
@@ -8097,7 +8097,7 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device *dev)
bp->fw_version[j++] = ' ';
for (i = 0; i < 3 && j < 28; i++) {
reg = bnx2_reg_rd_ind(bp, addr + i * 4);
- reg = swab32(reg);
+ reg = be32_to_cpu(reg);
memcpy(&bp->fw_version[j], ®, 4);
j += 4;
}
--
1.6.4.GIT
^ permalink raw reply related
* [PATCH 3/4 net-next] cnic: Add VLAN ID as a parameter during netevent upcall
From: Michael Chan @ 2011-07-21 0:55 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1311209725-10414-2-git-send-email-mchan@broadcom.com>
The bnx2fc driver needs to handle netdev events on VLAN devices.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
---
drivers/net/cnic.c | 50 +++++++++++++++++++++++++-----------
drivers/net/cnic_if.h | 6 ++--
drivers/scsi/bnx2fc/bnx2fc_fcoe.c | 8 +++++-
drivers/scsi/bnx2i/bnx2i_hwi.c | 8 +++++-
4 files changed, 52 insertions(+), 20 deletions(-)
diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index 9be0c26..94a2e54 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -5334,6 +5334,27 @@ static struct cnic_dev *is_cnic_dev(struct net_device *dev)
return cdev;
}
+static void cnic_rcv_netevent(struct cnic_local *cp, unsigned long event,
+ u16 vlan_id)
+{
+ int if_type;
+
+ rcu_read_lock();
+ for (if_type = 0; if_type < MAX_CNIC_ULP_TYPE; if_type++) {
+ struct cnic_ulp_ops *ulp_ops;
+ void *ctx;
+
+ ulp_ops = rcu_dereference(cp->ulp_ops[if_type]);
+ if (!ulp_ops || !ulp_ops->indicate_netevent)
+ continue;
+
+ ctx = cp->ulp_handle[if_type];
+
+ ulp_ops->indicate_netevent(ctx, event, vlan_id);
+ }
+ rcu_read_unlock();
+}
+
/**
* netdev event handler
*/
@@ -5342,7 +5363,6 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event,
{
struct net_device *netdev = ptr;
struct cnic_dev *dev;
- int if_type;
int new_dev = 0;
dev = cnic_from_netdev(netdev);
@@ -5372,20 +5392,7 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event,
cnic_ulp_start(dev);
}
- rcu_read_lock();
- for (if_type = 0; if_type < MAX_CNIC_ULP_TYPE; if_type++) {
- struct cnic_ulp_ops *ulp_ops;
- void *ctx;
-
- ulp_ops = rcu_dereference(cp->ulp_ops[if_type]);
- if (!ulp_ops || !ulp_ops->indicate_netevent)
- continue;
-
- ctx = cp->ulp_handle[if_type];
-
- ulp_ops->indicate_netevent(ctx, event);
- }
- rcu_read_unlock();
+ cnic_rcv_netevent(cp, event, 0);
if (event == NETDEV_GOING_DOWN) {
cnic_ulp_stop(dev);
@@ -5401,6 +5408,19 @@ static int cnic_netdev_event(struct notifier_block *this, unsigned long event,
goto done;
}
cnic_put(dev);
+ } else {
+ struct net_device *realdev;
+ u16 vid;
+
+ vid = cnic_get_vlan(netdev, &realdev);
+ if (realdev) {
+ dev = cnic_from_netdev(realdev);
+ if (dev) {
+ vid |= VLAN_TAG_PRESENT;
+ cnic_rcv_netevent(dev->cnic_priv, event, vid);
+ cnic_put(dev);
+ }
+ }
}
done:
return NOTIFY_DONE;
diff --git a/drivers/net/cnic_if.h b/drivers/net/cnic_if.h
index 642b9d5..fd54a6a 100644
--- a/drivers/net/cnic_if.h
+++ b/drivers/net/cnic_if.h
@@ -12,8 +12,8 @@
#ifndef CNIC_IF_H
#define CNIC_IF_H
-#define CNIC_MODULE_VERSION "2.5.6"
-#define CNIC_MODULE_RELDATE "July 12, 2011"
+#define CNIC_MODULE_VERSION "2.5.7"
+#define CNIC_MODULE_RELDATE "July 20, 2011"
#define CNIC_ULP_RDMA 0
#define CNIC_ULP_ISCSI 1
@@ -313,7 +313,7 @@ struct cnic_ulp_ops {
void (*cnic_stop)(void *ulp_ctx);
void (*indicate_kcqes)(void *ulp_ctx, struct kcqe *cqes[],
u32 num_cqes);
- void (*indicate_netevent)(void *ulp_ctx, unsigned long event);
+ void (*indicate_netevent)(void *ulp_ctx, unsigned long event, u16 vid);
void (*cm_connect_complete)(struct cnic_sock *);
void (*cm_close_complete)(struct cnic_sock *);
void (*cm_abort_complete)(struct cnic_sock *);
diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 7a16ca1..9eebaeb 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -767,17 +767,23 @@ static void bnx2fc_destroy_timer(unsigned long data)
*
* @context: adapter structure pointer
* @event: event type
+ * @vlan_id: vlan id - associated vlan id with this event
*
* Handles NETDEV_UP, NETDEV_DOWN, NETDEV_GOING_DOWN,NETDEV_CHANGE and
* NETDEV_CHANGE_MTU events
*/
-static void bnx2fc_indicate_netevent(void *context, unsigned long event)
+static void bnx2fc_indicate_netevent(void *context, unsigned long event,
+ u16 vlan_id)
{
struct bnx2fc_hba *hba = (struct bnx2fc_hba *)context;
struct fc_lport *lport = hba->ctlr.lp;
struct fc_lport *vport;
u32 link_possible = 1;
+ /* Ignore vlans for now */
+ if (vlan_id != 0)
+ return;
+
if (!test_bit(BNX2FC_CREATE_DONE, &hba->init_done)) {
BNX2FC_MISC_DBG("driver not ready. event=%s %ld\n",
hba->netdev->name, event);
diff --git a/drivers/scsi/bnx2i/bnx2i_hwi.c b/drivers/scsi/bnx2i/bnx2i_hwi.c
index 5c54a2d..372d30c 100644
--- a/drivers/scsi/bnx2i/bnx2i_hwi.c
+++ b/drivers/scsi/bnx2i/bnx2i_hwi.c
@@ -2386,14 +2386,20 @@ static void bnx2i_indicate_kcqe(void *context, struct kcqe *kcqe[],
* bnx2i_indicate_netevent - Generic netdev event handler
* @context: adapter structure pointer
* @event: event type
+ * @vlan_id: vlans id - associated vlan id with this event
*
* Handles four netdev events, NETDEV_UP, NETDEV_DOWN,
* NETDEV_GOING_DOWN and NETDEV_CHANGE
*/
-static void bnx2i_indicate_netevent(void *context, unsigned long event)
+static void bnx2i_indicate_netevent(void *context, unsigned long event,
+ u16 vlan_id)
{
struct bnx2i_hba *hba = context;
+ /* Ignore all netevent coming from vlans */
+ if (vlan_id != 0)
+ return;
+
switch (event) {
case NETDEV_UP:
if (!test_bit(ADAPTER_STATE_UP, &hba->adapter_state))
--
1.6.4.GIT
^ permalink raw reply related
* [PATCH 1/4 net-next] cnic: Fix Context ID space calculation
From: Michael Chan @ 2011-07-21 0:55 UTC (permalink / raw)
To: davem; +Cc: netdev
Include FCoE CID space only for E2_PLUS devices. Remove old CID
offset adjustments that are no longer needed.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
---
drivers/net/cnic.c | 10 +---------
1 files changed, 1 insertions(+), 9 deletions(-)
diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index ea75f65..455fd0d 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -1172,7 +1172,7 @@ static int cnic_alloc_bnx2x_resc(struct cnic_dev *dev)
cp->iro_arr = ethdev->iro_arr;
- cp->max_cid_space = MAX_ISCSI_TBL_SZ + BNX2X_FCOE_NUM_CONNECTIONS;
+ cp->max_cid_space = MAX_ISCSI_TBL_SZ;
cp->iscsi_start_cid = start_cid;
cp->fcoe_start_cid = start_cid + MAX_ISCSI_TBL_SZ;
@@ -1183,14 +1183,6 @@ static int cnic_alloc_bnx2x_resc(struct cnic_dev *dev)
cp->fcoe_init_cid = 0x10;
}
- if (start_cid < BNX2X_ISCSI_START_CID) {
- u32 delta = BNX2X_ISCSI_START_CID - start_cid;
-
- cp->iscsi_start_cid = BNX2X_ISCSI_START_CID;
- cp->fcoe_start_cid += delta;
- cp->max_cid_space += delta;
- }
-
cp->iscsi_tbl = kzalloc(sizeof(struct cnic_iscsi) * MAX_ISCSI_TBL_SZ,
GFP_KERNEL);
if (!cp->iscsi_tbl)
--
1.6.4.GIT
^ permalink raw reply related
* [PATCH 2/4 net-next] cnic: Wait for all Context IDs to be deleted before sending FCOE_DESTROY_FUNC
From: Michael Chan @ 2011-07-21 0:55 UTC (permalink / raw)
To: davem; +Cc: netdev
In-Reply-To: <1311209725-10414-1-git-send-email-mchan@broadcom.com>
Otherwise, the firmware will not respond and we'll have to wait for
timeout. Refactor the wait loop we already have into a separate
function for this purpose.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
---
drivers/net/cnic.c | 45 +++++++++++++++++++++++++++------------------
1 files changed, 27 insertions(+), 18 deletions(-)
diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index 455fd0d..9be0c26 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -2448,6 +2448,30 @@ static int cnic_bnx2x_fcoe_destroy(struct cnic_dev *dev, struct kwqe *kwqe)
return ret;
}
+static void cnic_bnx2x_delete_wait(struct cnic_dev *dev, u32 start_cid)
+{
+ struct cnic_local *cp = dev->cnic_priv;
+ u32 i;
+
+ for (i = start_cid; i < cp->max_cid_space; i++) {
+ struct cnic_context *ctx = &cp->ctx_tbl[i];
+ int j;
+
+ while (test_bit(CTX_FL_DELETE_WAIT, &ctx->ctx_flags))
+ msleep(10);
+
+ for (j = 0; j < 5; j++) {
+ if (!test_bit(CTX_FL_OFFLD_START, &ctx->ctx_flags))
+ break;
+ msleep(20);
+ }
+
+ if (test_bit(CTX_FL_OFFLD_START, &ctx->ctx_flags))
+ netdev_warn(dev->netdev, "CID %x not deleted\n",
+ ctx->cid);
+ }
+}
+
static int cnic_bnx2x_fcoe_fw_destroy(struct cnic_dev *dev, struct kwqe *kwqe)
{
struct fcoe_kwqe_destroy *req;
@@ -2456,6 +2480,8 @@ static int cnic_bnx2x_fcoe_fw_destroy(struct cnic_dev *dev, struct kwqe *kwqe)
int ret;
u32 cid;
+ cnic_bnx2x_delete_wait(dev, MAX_ISCSI_TBL_SZ);
+
req = (struct fcoe_kwqe_destroy *) kwqe;
cid = BNX2X_HW_CID(cp, cp->fcoe_init_cid);
@@ -3930,7 +3956,6 @@ static void cnic_close_bnx2x_conn(struct cnic_sock *csk, u32 opcode)
static void cnic_cm_stop_bnx2x_hw(struct cnic_dev *dev)
{
struct cnic_local *cp = dev->cnic_priv;
- int i;
if (!cp->ctx_tbl)
return;
@@ -3938,23 +3963,7 @@ static void cnic_cm_stop_bnx2x_hw(struct cnic_dev *dev)
if (!netif_running(dev->netdev))
return;
- for (i = 0; i < cp->max_cid_space; i++) {
- struct cnic_context *ctx = &cp->ctx_tbl[i];
- int j;
-
- while (test_bit(CTX_FL_DELETE_WAIT, &ctx->ctx_flags))
- msleep(10);
-
- for (j = 0; j < 5; j++) {
- if (!test_bit(CTX_FL_OFFLD_START, &ctx->ctx_flags))
- break;
- msleep(20);
- }
-
- if (test_bit(CTX_FL_OFFLD_START, &ctx->ctx_flags))
- netdev_warn(dev->netdev, "CID %x not deleted\n",
- ctx->cid);
- }
+ cnic_bnx2x_delete_wait(dev, 0);
cancel_delayed_work(&cp->delete_task);
flush_workqueue(cnic_wq);
--
1.6.4.GIT
^ permalink raw reply related
* Re: [PATCH 1/2] igb: Allow extra 4 bytes on RX for vlan tags.
From: Jeff Kirsher @ 2011-07-21 1:21 UTC (permalink / raw)
To: Ben Greear; +Cc: Jesse Gross, netdev@vger.kernel.org, Duyck, Alexander H
In-Reply-To: <4E277267.8090702@candelatech.com>
[-- Attachment #1: Type: text/plain, Size: 3800 bytes --]
On Wed, 2011-07-20 at 17:27 -0700, Ben Greear wrote:
> On 07/20/2011 05:18 PM, Jesse Gross wrote:
> > On Thu, Feb 17, 2011 at 9:28 AM, Ben Greear<greearb@candelatech.com> wrote:
> >> On 02/17/2011 03:04 AM, Jeff Kirsher wrote:
> >>>
> >>> On Thu, Feb 10, 2011 at 13:59,<greearb@candelatech.com> wrote:
> >>>>
> >>>> From: Ben Greear<greearb@candelatech.com>
> >>>>
> >>>> This allows the NIC to receive 1518 byte (not counting
> >>>> FCS) packets when MTU is 1500, thus allowing 1500 MTU
> >>>> VLAN frames to be received. Please note that no VLANs
> >>>> were actually configured on the NIC...it was just acting
> >>>> as pass-through device.
> >>>>
> >>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
> >>>> ---
> >>>> :100644 100644 58c665b... 30c9cc6... M drivers/net/igb/igb_main.c
> >>>> drivers/net/igb/igb_main.c | 5 +++--
> >>>> 1 files changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> >>>> index 58c665b..30c9cc6 100644
> >>>> --- a/drivers/net/igb/igb_main.c
> >>>> +++ b/drivers/net/igb/igb_main.c
> >>>> @@ -2281,7 +2281,8 @@ static int __devinit igb_sw_init(struct igb_adapter
> >>>> *adapter)
> >>>> adapter->rx_itr_setting = IGB_DEFAULT_ITR;
> >>>> adapter->tx_itr_setting = IGB_DEFAULT_ITR;
> >>>>
> >>>> - adapter->max_frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
> >>>> + adapter->max_frame_size = (netdev->mtu + ETH_HLEN + ETH_FCS_LEN
> >>>> + + VLAN_HLEN);
> >>>> adapter->min_frame_size = ETH_ZLEN + ETH_FCS_LEN;
> >>>>
> >>>> spin_lock_init(&adapter->stats64_lock);
> >>>> @@ -4303,7 +4304,7 @@ static int igb_change_mtu(struct net_device
> >>>> *netdev, int new_mtu)
> >>>> {
> >>>> struct igb_adapter *adapter = netdev_priv(netdev);
> >>>> struct pci_dev *pdev = adapter->pdev;
> >>>> - int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN;
> >>>> + int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
> >>>> u32 rx_buffer_len, i;
> >>>>
> >>>> if ((new_mtu< 68) || (max_frame> MAX_JUMBO_FRAME_SIZE)) {
> >>>
> >>> While testing this patch, validation found that the patch reduces the
> >>> maximum mtu size
> >>> by 4 bytes (reduces it from 9216 to 9212). This is not a desired side
> >>> effect of this patch.
> >>
> >> You could add handling for that case and have it act as it used to when
> >> new_mtu is greater than 9212?
> >>
> >> I tested e1000e and it worked w/out hacking at 1500 MTU, so maybe
> >> check how it does it?
> >
> > I just wanted to bring this up again to see if any progress had been
> > made. We were looking at this driver and trying to figure out the
> > best way to convert it to use the new vlan model but I'm not familiar
>
> I've been watching :)
>
> > enough with the hardware to know. It seems that all of the other
> > Intel drivers unconditionally add space for the vlan tag to the
> > receive buffer (and would therefore have similar effects as this
> > patch), is there something different about this card?
> >
> > I believe that Alex was working on something in this area (in the
> > context of one of my patches from a long time ago) but I'm not sure
> > what came of that.
>
> Truth is, I don't really see why it's a problem to decrease the
> maximum MTU slightly in order to make it work with VLANs.
>
> I'm not sure if there is some way to make it work with VLANs
> and not decrease the maximum MTU.
This was the reason this did not get accepted. I was looking into what
could be done so that we did not decease the maximum MTU, but I got
side-tracked and have not done anything on it in several months.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply
* Re: Just one more byte, it is wafer thin...
From: Rick Jones @ 2011-07-21 0:52 UTC (permalink / raw)
To: netdev
In-Reply-To: <4E2764A0.90003@hp.com>
On 07/20/2011 04:28 PM, Rick Jones wrote:
> and got a somewhat unexpected result - I've no idea why then they both
> went up - perhaps it was sensing "high" occasionally even in the 4344
> byte request case.
That would seem to be the case? Back to defaults, ./configure'd netperf
with --enable-demo and have it print-out interim results every 250
milliseconds (or so)
root@use111814x:~/netperf-2.5.0# HDR="-P 1";for r in 4344 4345; do
netperf -D 0.25 -H mumble.3.21 -t TCP_RR $HDR -- -r ${r},1; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET
to mumble.3.21 (mumble.3.21) port 0 AF_INET : histogram : demo : first
burst 0
Interim result: 5332.90 Trans/s over 0.28 seconds ending at 1311209347.312
Interim result: 6867.20 Trans/s over 0.25 seconds ending at 1311209347.562
Interim result: 14475.52 Trans/s over 0.25 seconds ending at 1311209347.813
Interim result: 14513.50 Trans/s over 0.25 seconds ending at 1311209348.063
Interim result: 14528.00 Trans/s over 0.25 seconds ending at 1311209348.313
Interim result: 8245.53 Trans/s over 0.44 seconds ending at 1311209348.753
Interim result: 13523.73 Trans/s over 0.25 seconds ending at 1311209349.003
Interim result: 13310.17 Trans/s over 0.26 seconds ending at 1311209349.259
Interim result: 8303.74 Trans/s over 0.40 seconds ending at 1311209349.660
Interim result: 14202.24 Trans/s over 0.25 seconds ending at 1311209349.910
Interim result: 8124.76 Trans/s over 0.44 seconds ending at 1311209350.347
Interim result: 14495.59 Trans/s over 0.25 seconds ending at 1311209350.597
Interim result: 14505.91 Trans/s over 0.25 seconds ending at 1311209350.847
Interim result: 13338.19 Trans/s over 0.27 seconds ending at 1311209351.119
Interim result: 7280.44 Trans/s over 0.46 seconds ending at 1311209351.577
Interim result: 14002.71 Trans/s over 0.25 seconds ending at 1311209351.827
Interim result: 6661.47 Trans/s over 0.53 seconds ending at 1311209352.353
Interim result: 4069.30 Trans/s over 0.41 seconds ending at 1311209352.762
Interim result: 10444.77 Trans/s over 0.35 seconds ending at 1311209353.110
Interim result: 9013.21 Trans/s over 0.29 seconds ending at 1311209353.399
Interim result: 6480.59 Trans/s over 0.35 seconds ending at 1311209353.747
Interim result: 13245.09 Trans/s over 0.25 seconds ending at 1311209353.997
Interim result: 12205.48 Trans/s over 0.30 seconds ending at 1311209354.294
Interim result: 5592.64 Trans/s over 0.55 seconds ending at 1311209354.840
Interim result: 6142.67 Trans/s over 0.59 seconds ending at 1311209355.430
Interim result: 11084.00 Trans/s over 0.25 seconds ending at 1311209355.680
Interim result: 14511.18 Trans/s over 0.25 seconds ending at 1311209355.930
Interim result: 14475.35 Trans/s over 0.25 seconds ending at 1311209356.181
Interim result: 7893.58 Trans/s over 0.46 seconds ending at 1311209356.639
Interim result: 14176.00 Trans/s over 0.25 seconds ending at 1311209356.889
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec
16384 87380 4344 1 10.00 9907.27
16384 87380
now the 4345 byte request:
Interim result: 8712.99 Trans/s over 0.37 seconds ending at 1311209357.406
Interim result: 3344.24 Trans/s over 0.65 seconds ending at 1311209358.057
Interim result: 3495.28 Trans/s over 0.25 seconds ending at 1311209358.308
Interim result: 3457.05 Trans/s over 0.25 seconds ending at 1311209358.561
Interim result: 3315.55 Trans/s over 0.26 seconds ending at 1311209358.821
Interim result: 3340.47 Trans/s over 0.25 seconds ending at 1311209359.072
Interim result: 3343.81 Trans/s over 0.25 seconds ending at 1311209359.322
Interim result: 3373.45 Trans/s over 0.25 seconds ending at 1311209359.572
Interim result: 3292.31 Trans/s over 0.26 seconds ending at 1311209359.828
Interim result: 3328.17 Trans/s over 0.25 seconds ending at 1311209360.079
Interim result: 3373.07 Trans/s over 0.25 seconds ending at 1311209360.329
Interim result: 3431.75 Trans/s over 0.25 seconds ending at 1311209360.579
Interim result: 3324.45 Trans/s over 0.26 seconds ending at 1311209360.837
Interim result: 3347.82 Trans/s over 0.25 seconds ending at 1311209361.087
Interim result: 3327.10 Trans/s over 0.25 seconds ending at 1311209361.338
Interim result: 3337.22 Trans/s over 0.25 seconds ending at 1311209361.589
Interim result: 3444.56 Trans/s over 0.25 seconds ending at 1311209361.839
Interim result: 3336.91 Trans/s over 0.26 seconds ending at 1311209362.097
Interim result: 3323.07 Trans/s over 0.25 seconds ending at 1311209362.348
Interim result: 3422.15 Trans/s over 0.25 seconds ending at 1311209362.598
Interim result: 3327.81 Trans/s over 0.26 seconds ending at 1311209362.855
Interim result: 3312.43 Trans/s over 0.25 seconds ending at 1311209363.106
Interim result: 3346.22 Trans/s over 0.25 seconds ending at 1311209363.356
Interim result: 3426.75 Trans/s over 0.25 seconds ending at 1311209363.606
Interim result: 3304.44 Trans/s over 0.26 seconds ending at 1311209363.866
Interim result: 3466.26 Trans/s over 0.25 seconds ending at 1311209364.116
Interim result: 3299.97 Trans/s over 0.26 seconds ending at 1311209364.379
Interim result: 3360.99 Trans/s over 0.25 seconds ending at 1311209364.629
Interim result: 3402.76 Trans/s over 0.25 seconds ending at 1311209364.879
Interim result: 3389.28 Trans/s over 0.25 seconds ending at 1311209365.130
Interim result: 3360.94 Trans/s over 0.25 seconds ending at 1311209365.382
Interim result: 3319.58 Trans/s over 0.25 seconds ending at 1311209365.635
Interim result: 3440.41 Trans/s over 0.25 seconds ending at 1311209365.886
Interim result: 3386.75 Trans/s over 0.25 seconds ending at 1311209366.140
Interim result: 3337.23 Trans/s over 0.25 seconds ending at 1311209366.393
Interim result: 3329.40 Trans/s over 0.25 seconds ending at 1311209366.644
Interim result: 3328.29 Trans/s over 0.25 seconds ending at 1311209366.894
16384 87380 4345 1 10.00 3560.55
16384 87380
Still, is this suggesting that perhaps the adaptive
> bits are being a bit to aggressive about sensing high? Over what
> interval is that measurement supposed to be happening?
>
> rick jones
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/2] igb: Allow extra 4 bytes on RX for vlan tags.
From: Ben Greear @ 2011-07-21 0:27 UTC (permalink / raw)
To: Jesse Gross; +Cc: Jeff Kirsher, netdev, Duyck, Alexander H
In-Reply-To: <CAEP_g=_o7nP2FCnPufyT44JZZ+Z+jkcVpSa8Qj3mjPn+MwRhXA@mail.gmail.com>
On 07/20/2011 05:18 PM, Jesse Gross wrote:
> On Thu, Feb 17, 2011 at 9:28 AM, Ben Greear<greearb@candelatech.com> wrote:
>> On 02/17/2011 03:04 AM, Jeff Kirsher wrote:
>>>
>>> On Thu, Feb 10, 2011 at 13:59,<greearb@candelatech.com> wrote:
>>>>
>>>> From: Ben Greear<greearb@candelatech.com>
>>>>
>>>> This allows the NIC to receive 1518 byte (not counting
>>>> FCS) packets when MTU is 1500, thus allowing 1500 MTU
>>>> VLAN frames to be received. Please note that no VLANs
>>>> were actually configured on the NIC...it was just acting
>>>> as pass-through device.
>>>>
>>>> Signed-off-by: Ben Greear<greearb@candelatech.com>
>>>> ---
>>>> :100644 100644 58c665b... 30c9cc6... M drivers/net/igb/igb_main.c
>>>> drivers/net/igb/igb_main.c | 5 +++--
>>>> 1 files changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
>>>> index 58c665b..30c9cc6 100644
>>>> --- a/drivers/net/igb/igb_main.c
>>>> +++ b/drivers/net/igb/igb_main.c
>>>> @@ -2281,7 +2281,8 @@ static int __devinit igb_sw_init(struct igb_adapter
>>>> *adapter)
>>>> adapter->rx_itr_setting = IGB_DEFAULT_ITR;
>>>> adapter->tx_itr_setting = IGB_DEFAULT_ITR;
>>>>
>>>> - adapter->max_frame_size = netdev->mtu + ETH_HLEN + ETH_FCS_LEN;
>>>> + adapter->max_frame_size = (netdev->mtu + ETH_HLEN + ETH_FCS_LEN
>>>> + + VLAN_HLEN);
>>>> adapter->min_frame_size = ETH_ZLEN + ETH_FCS_LEN;
>>>>
>>>> spin_lock_init(&adapter->stats64_lock);
>>>> @@ -4303,7 +4304,7 @@ static int igb_change_mtu(struct net_device
>>>> *netdev, int new_mtu)
>>>> {
>>>> struct igb_adapter *adapter = netdev_priv(netdev);
>>>> struct pci_dev *pdev = adapter->pdev;
>>>> - int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN;
>>>> + int max_frame = new_mtu + ETH_HLEN + ETH_FCS_LEN + VLAN_HLEN;
>>>> u32 rx_buffer_len, i;
>>>>
>>>> if ((new_mtu< 68) || (max_frame> MAX_JUMBO_FRAME_SIZE)) {
>>>
>>> While testing this patch, validation found that the patch reduces the
>>> maximum mtu size
>>> by 4 bytes (reduces it from 9216 to 9212). This is not a desired side
>>> effect of this patch.
>>
>> You could add handling for that case and have it act as it used to when
>> new_mtu is greater than 9212?
>>
>> I tested e1000e and it worked w/out hacking at 1500 MTU, so maybe
>> check how it does it?
>
> I just wanted to bring this up again to see if any progress had been
> made. We were looking at this driver and trying to figure out the
> best way to convert it to use the new vlan model but I'm not familiar
I've been watching :)
> enough with the hardware to know. It seems that all of the other
> Intel drivers unconditionally add space for the vlan tag to the
> receive buffer (and would therefore have similar effects as this
> patch), is there something different about this card?
>
> I believe that Alex was working on something in this area (in the
> context of one of my patches from a long time ago) but I'm not sure
> what came of that.
Truth is, I don't really see why it's a problem to decrease the
maximum MTU slightly in order to make it work with VLANs.
I'm not sure if there is some way to make it work with VLANs
and not decrease the maximum MTU.
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox