Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH V2 1/1] Allow cascading to work with 6131 chip
From: Barry Grussling @ 2011-06-21 14:55 UTC (permalink / raw)
  To: netdev; +Cc: buytenh, Barry Grussling
In-Reply-To: <cover.1308667895.git.barry@grussling.com>

---
 net/dsa/mv88e6131.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/dsa/mv88e6131.c b/net/dsa/mv88e6131.c
index 45f7411..dc43419 100644
--- a/net/dsa/mv88e6131.c
+++ b/net/dsa/mv88e6131.c
@@ -118,10 +118,14 @@ static int mv88e6131_setup_global(struct dsa_switch *ds)
 	REG_WRITE(REG_GLOBAL, 0x1a, (dsa_upstream_port(ds) * 0x1100) | 0x00f0);
 
 	/*
-	 * Disable cascade port functionality, and set the switch's
+	 * Disable cascade port functionality unless this device is
+	 * used in a cascade configuration, and set the switch's
 	 * DSA device number.
 	 */
-	REG_WRITE(REG_GLOBAL, 0x1c, 0xe000 | (ds->index & 0x1f));
+	if (ds->dst->pd->nr_chips > 1)
+		REG_WRITE(REG_GLOBAL, 0x1c, 0xf000 | (ds->index & 0x1f));
+	else
+		REG_WRITE(REG_GLOBAL, 0x1c, 0xe000 | (ds->index & 0x1f));
 
 	/*
 	 * Send all frames with destination addresses matching
-- 
1.7.0.4


^ permalink raw reply related

* [PATCH V2 0/1] DSA: Enable cascading for multiple 6131 chips
From: Barry Grussling @ 2011-06-21 14:55 UTC (permalink / raw)
  To: netdev; +Cc: buytenh, Barry Grussling

I found that the Cascade Port field of the 6131 was always set
to 0xe which results in from_cpu frames being discarded.  This
means cascading style multi chip DSA configuration didn't work
for me.  I am a little confused by this since we configure the
DSA routing table a little further down in the function.

It seems like we need to enable cascading by setting the
Cascade Port field to 0xf if we are in a multi-chip scenario.

V2 changes are for whitespace to meet coding style.

Barry Grussling (1):
  Allow cascading to work with 6131 chip

 net/dsa/mv88e6131.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: [PATCH 1/2] udp:  add tracepoints for queueing skb to rcvbuf
From: Hagen Paul Pfeifer @ 2011-06-21 14:48 UTC (permalink / raw)
  To: Neil Horman; +Cc: Satoru Moriya, netdev, Seiji Aguchi, Steven Rostedt
In-Reply-To: <20110621135009.GD16311@hmsreliant.think-freely.org>

On Tue, 21 Jun 2011 09:50:09 -0400, Neil Horman wrote:

> I hadn't really thought about that much, but yes, I suppose I could

migrate

> dropwatch to export kfree_skb data via perf.  Admittedly I don't know

much

> about

> the perf api.  Do you have any pointers on its use (to save me time in

> figuring

> out how it all works)?  If so I'll start looking into it.

http://git.kernel.org/?p=status/powertop/powertop.git;a=tree;f=perf;hb=HEAD

is probably a good starting point. Especially

perf_bundle.cpp:handle_trace_point(). But I am not sure if this is the most

clever way. The direct us of the perf api is somewhat dodgy (not sure if

the ABI will change). IIRC Steven Rostedt wrote about a user space library

(I CC'ed Steven). BUT: tracing via /sys/kernel/debug/tracing/* may be

enough, eventually there is no need for perf at all. Then trace-cmd may

provide some nice ideas how to wrap the /sys/kernel/debug/tracing interface

programmatically.

The idea behind dropwatch is great! There is currently to much

unconsolidated information. It takes a genius to understand where and later

why packets are dropped. A userspace tool where no kernel patch is required

is a big plus! ;-)

Hagen

^ permalink raw reply

* Re: [PATCH] netconsole: fix build when CONFIG_NETCONSOLE_DYNAMIC is turned on
From: Randy Dunlap @ 2011-06-21 14:05 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Andrew Morton, davem, netdev, bugme-daemon, hilld
In-Reply-To: <1308660309.3093.108.camel@localhost>

On 06/21/11 05:45, Ben Hutchings wrote:
> On Mon, 2011-06-20 at 21:25 -0700, Randy Dunlap wrote:
>> From: Randy Dunlap <randy.dunlap@oracle.com>
>>
>> When NETCONSOLE_DYNAMIC=y and CONFIGFS_FS=m, there are build errors
>> in netconsole:
>>
>> drivers/built-in.o: In function `drop_netconsole_target':
>> netconsole.c:(.text+0x1a100f): undefined reference to `config_item_put'
>> drivers/built-in.o: In function `make_netconsole_target':
>> netconsole.c:(.text+0x1a10b9): undefined reference to `config_item_init_type_name'
>> drivers/built-in.o: In function `write_msg':
>> netconsole.c:(.text+0x1a11a4): undefined reference to `config_item_get'
>> netconsole.c:(.text+0x1a1211): undefined reference to `config_item_put'
>> drivers/built-in.o: In function `netconsole_netdev_event':
>> netconsole.c:(.text+0x1a12cc): undefined reference to `config_item_put'
>> netconsole.c:(.text+0x1a12ec): undefined reference to `config_item_get'
>> netconsole.c:(.text+0x1a1366): undefined reference to `config_item_put'
>> drivers/built-in.o: In function `init_netconsole':
>> netconsole.c:(.init.text+0x953a): undefined reference to `config_group_init'
>> netconsole.c:(.init.text+0x9560): undefined reference to `configfs_register_subsystem'
>> drivers/built-in.o: In function `dynamic_netconsole_exit':
>> netconsole.c:(.exit.text+0x809): undefined reference to `configfs_unregister_subsystem'
>>
>> so make NETCONSOLE_DYNAMIC require CONFIGFS_FS=y to fix the build errors.
> [...]
> 
> NETCONSOLE is tristate, and I think NETCONSOLE=m && NETCONSOLE_DYNAMIC=y
> && CONFIGFS_FS=m should be OK.
> 
> It seems like Kconfig should have a '>=' operator which behaves like a
> numeric comparison with n=0, m=1, y=2.  Then we could use a dependency
> of:
> 	NETCONSOLE && SYSFS && CONFIGFS_FS>=NETCONSOLE
> 
> But for now I think the correct dependency is:
> 	NETCONSOLE && SYSFS && CONFIGFS_FS && !(NETCONSOLE=y && CONFIGFS_FS=m)

or just have NETCONSOLE_DYNAMIC select CONFIGFS_FS instead of depend on it.

-- 
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply

* Re: [Xen-devel] [PATCH net-next 4/5] xen: convert to 64 bit stats interface
From: Ian Campbell @ 2011-06-21 14:05 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: davem@davemloft.net, Jeremy Fitzhardinge, netdev@vger.kernel.org,
	xen-devel@lists.xensource.com
In-Reply-To: <20110620203602.929964665@vyatta.com>

On Mon, 2011-06-20 at 21:35 +0100, Stephen Hemminger wrote:
> Convert xen driver to 64 bit statistics interface.
> Use stats_sync to ensure that 64 bit update is read atomically
> on 32 bit platform. Put hot statistics into per-cpu table.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Thanks Stephen.

> @@ -867,12 +882,13 @@ static int handle_incoming_queue(struct
>  		if (checksum_setup(dev, skb)) {
>  			kfree_skb(skb);
>  			packets_dropped++;
> -			dev->stats.rx_errors++;

Why is this dropped? We should be counting these somehow, I think.

[...]

> >From shemminger@vyatta.com Mon Jun 20 13:36:03 2011
> Message-Id: <20110620203603.019928129@vyatta.com>
> User-Agent: quilt/0.48-1
> Date: Mon, 20 Jun 2011 13:35:11 -0700
> From: Stephen Hemminger <shemminger@vyatta.com>
> To: davem@davemloft.net
> Cc: netdev@vger.kernel.org
> Subject: [PATCH net-next 5/5] ifb: convert to 64 bit stats
> References: <20110620203506.363818794@vyatta.com>
> Content-Disposition: inline; filename=ifb-stats64.patch
> 
> Convert input functional block device to use 64 bit stats.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> ---
> v2 - add stats_sync
> 
> 
> --- a/drivers/net/ifb.c	2011-06-09 14:39:25.000000000 -0700
> +++ b/drivers/net/ifb.c	2011-06-20 13:30:30.135992612 -0700

This entire patch was appended in the mail I got -- something up with
your scripting?

Ian.



^ permalink raw reply

* Netconf 2011 notes
From: Ben Hutchings @ 2011-06-21 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: seenutn, netdev
In-Reply-To: <1308259955.2925.16.camel@bwh-desktop>

On Thu, 2011-06-16 at 22:32 +0100, Ben Hutchings wrote:
> On Thu, 2011-06-16 at 17:11 -0400, David Miller wrote:
> > From: Srinivasa T N <seenutn@linux.vnet.ibm.com>
> > Date: Thu, 16 Jun 2011 15:39:08 +0530
> > 
> > > 	Were there some interesting topics which is useful for the community?
> > > 	(Few lines on each such topic would do).
> > 
> > There is a topic description for each presentation, plus the
> > slides themselves on the web site.
> > 
> > I can't think of anything more significant that could be
> > provided.
> 
> It would be nice to have some record of significant questions and
> answers; and any conclusions or consensus from discussions.
> 
> I can provide the notes I took regarding my own topics.

Here are my notes.  These are somewhat biased by own areas of interest
and ignorance; others may wish to correct or fill in some gaps.

Stephen Hemminger: Crossing the next bridge
http://vger.kernel.org/netconf2011_slides/shemminger_Bridge2.pdf

Linux bridge driver is missing some features found in other software
(and hardware) bridges.
These include virtualisation features like VEPA VEB and VN tag.
Should the bridge control plane remain entirely in the kernel, or should
the bridge call out to userspace (like Openflow)?  Benefits include
easier persistence of state, complex policies.  Performance can be
lower; is that significant?
Some discussion but no conclusions that I recall.

Jesse Brandeburg: Reducing Stack Latency
http://vger.kernel.org/netconf2011_slides/jesse_brandeburg_netconf2011.pdf

Jesse presented some graphs showing cycle counts spent in packet
processing in the network stack and driver (ixgbe) on several hardware
platforms, for a netperf UDP_RR test.  Some discussion of why certain
functions are expensive.  No conclusions but I expect that the numbers
will be useful.  Jeese said the ranges on the graphs show the variation
between different hardware platforms (not between packets), but I don't
think this is correct.

Jiri Pirko: LNST Project
http://vger.kernel.org/netconf2011_slides/Netconf2011_lnst.pdf
https://fedorahosted.org/lnst/

Jiri is working on LNST (Linux Network Stack Test), a test framework for
network topologies, currently concentrated on regression-testing various
software devices (bridge, bond, VLAN).
Currently at an early stage of development.
Written in Python; uses XML-RPC to control DUTs.
Configuration file specifies setup using Linux net devices and switch
ports, and commands to test with.

Jiri Pirko: Team driver
http://vger.kernel.org/netconf2011_slides/Netconf2011_team.pdf

Current bonding driver supports various different policies and protocols
implemented by different people.  It has become a mess and this is
probably not fixable due to backward compatibility concerns.  (All
agreed.)
Jiri proposes a simpler replacement for the current bonding driver, with
all policy defined by user-space.
General support for this, but 'show us the code'.
I questioned how load balancing would be done without built-in policies
for flow hashing.  Answer: user-space provides hash function as BPF code
or similar; we now have a JIT compiler for BPF so this should not be too
slow.

Herbert Xu: Scalability
http://vger.kernel.org/netconf2011_slides/herbert_xu_netconf2011.odp

XPS (transmit packet steering) may reorder packets in a flow when it
changes the TX queue used.  Protocol sets a flag to indicate whether
this is OK, and currently only TCP does that.  Should we set it for UDP,
by default or by socket option?
Conclusion: depends on applications; add the socket option but also a
sysctl for the default so users don't need to modify applications.

Enumerated some areas of networking that still involve global or
per-device locks or other mutable state, and network structures that are
not allocated in a NUMA-aware way.  Some discussion of what can be done
to improve this.

Herbert Xu: Hardware LRO

GRO + forwarding can results in moving segment boundaries.  Does anyone
mind?  Can we also let LRO implementations set gso_type like GRO does,
and not disable them when forwarding?

Stephen Hemminger: IRQ name/balancing
http://vger.kernel.org/netconf2011_slides/shemminger_IRQ.pdf

There is no information about IRQ/queue mapping in sysfs, and IRQs may
not even be visible while interface is down.
IRQs do appear in /proc/interrupts, but the name format for per-queue
IRQs is inconsistent between different drivers!
Conclusion: naming scheme has already been agreed but we need to fix
some multiqueue drivers; we should add a function to generate standard
names.

irqbalance: most agree that it doesn't work at the moment, but Intel is
happy that current version follows their hints.
Currently irqbalance usually does things wrong and everyone has to write
their own scripts.
Further discussion deferred to my slot.

Stephen Hemminger: Open vSwitch
http://openvswitch.org/

I didn't take any notes for this.  Apparently it's an interesting
project.

Stephen Hemminger: Virtualized Networking Performance
http://vger.kernel.org/netconf2011_slides/shemminger_VirtPerfSummary.pdf

Presented networking throughput measurements for hosts and routers.
Performance is terrible, although VMware does better than Xen or KVM.

Thomas Graf: Network Configuration Usability and World IPv6 Day
http://vger.kernel.org/netconf2011_slides/tgraf_netconf2011.pdf

Presented libnl 3.0, its Python bindings and the 'ncfg' tool as a
potential replacement for many of the current network configuration
tools.  (Slide 4 seems to show other tools building on top of ncfg, but
this is not actually what he meant.  They should use libnl too.)

Requesting dump of interface state though netlink can currently provide
too much information.  Should be a way for user-space to request partial
state, e.g. statistics.
Automatic dump retry: if I understood correctly, it is possible to get
inconsistent information when a dump uses multiple packets.  So there
should be some way for user-space to detect and handle this.
Some interface state only accessible through ethtool ioctl; should be
accessible through netlink too.  Problem with setting through netlink is
that each setting operation may fail and there is no way to commit or
rollback atomically (without changing most drivers).

World IPv6 Day seems to have mostly worked.  However there are still
some gaps and silly bugs in IPv6 suport in both Linux kernel (e.g.
netfilter can't track DHCPv6 properly) and user-space (e.g. ping6
doesn't restrict hostname lookup to IPv6 addresses).

Tom Herbert: Super Networking Performance
http://vger.kernel.org/netconf2011_slides/therbert_netconf2011.pdf

Gave reasons for wanting higher networking performance.
Presented results using Onload with simple benchmarks and a real
application (load balancer).  Attendees seemed generally impressed; some
questions to me about how Onload works.
Showed how kernel stack latency improves with greater use of polling and
avoiding user-space rescheduling.
Presented some performance goals and networking features that may help
to get there.

David S. Miller: Routing Cache: Just Say No
http://vger.kernel.org/netconf2011_slides/davem_netconf2011.pdf

David wants to get rid of the IPv4 routing cache.  Removing the cache
entirely seems to make route lookup take about 50% longer than it
currently does for a cache hit, and much less time than for a cache
miss.  It avoids some potential for denial of service (forced cache
misses) and generally simplifies routing.

This was a progress report on the refactoring required; none of this was
familiar to me so I didn't try to summarise.

Ben Hutchings: Managing multiple queues: affinity and other issues
http://vger.kernel.org/netconf2011_slides/bwh_netconf2011.pdf

I recapped the current situation of affinity settings and presented the
two options I see for improving and simplifying it.  The consensus was
to go with option 2: each queue will have irq (read-only) and affinity
(read-write) attributes exposed in sysfs, and the networking core will
generate IRQ affinity hints which irqbalance should normally follow.  I
think there's enough support for this that we won't have to do all the
work.

I recapped the way RX queues are currently selected and why this may not
be optimal, and proposed some kind of system policy that could be used
to control this.  This would provide a superset of the functionality to
the rss_cpus module parameter and IRQ affinity setting in our
out-of-tree driver.  I believe this was agreed to be a reasonable
feature, though I'm not sure everyone looked at the details I listed.

Some people wanted an ethtool interface to set per-queue interrupt
moderation.
Some would really like to be able to add and remove RX queues, or at
least set indirection table, based on demand.  This would save power.
Tom wants an interface to set steering + hashing; ideally automatic when
multiple threads listen on the same (host, port).

PJ Waskiewicz: iWarp portspace
http://vger.kernel.org/netconf2011_slides/pj_netconf2011.ppt

iWarp offload previously required kernel patch to reserve ports.  RHEL
stopped carrying the patch.  Port reservation will now be handled by a
user-space daemon holding sockets.

PJ Waskiewicz: Standard netdev module parms
http://vger.kernel.org/netconf2011_slides/pj_netdev_params.odp

Proposed some standardisation of options that may need to be established
before net device registration, e.g. interrupt mode or number of VFs to
enable.
Per-device parameters would be provided as list (as in Intel out-of-tree
drivers).  But this assumes enumeration order is stable, which it isn't
in general.
Not much support for module parameters.  Someone suggested that
per-device settings could be requested at probe time, similarly to
request_firmware().

PJ Waskiewicz: Advanced stats
http://vger.kernel.org/netconf2011_slides/pj_advanced_stats.odp

Complex devices with many VFs, bridge functionality, etc. can present
many more statistics.  ethtool API is unstructured and won't scale to
this.  Proposes to put them in sysfs.  The total number could be a
big problem, as each needs an inode in memory.

Eric Dumazet: JIT, UDP, Packet Schedulers
http://vger.kernel.org/netconf2011_slides/edumazet_netconf2011.pdf

Implemented JIT compiler for BPF on x86_64; porting should be easy.
Room for further optimisation.  Can we use a similar technique to speed
up iptables/ip6tables filters?

UDP multiqueue transmit perf is suffering from cache bouncing.
Kernel takes reference to dst information (for MTU etc.) before
copying from userspace.  Copying from userspace may sleep so we must
take counted reference not RCU.  For small packets, could copy onto
kernel stack first, then no need for refcounting.
How about an adaptive refcount that dynamically switches to percpu
counter if highly contended?
My suggestion: assuming we only need dst for MTU, in order to
fragment into skbs - why bother doing that here?  The output path
can already do fragmentation (GSO-UFO).

Smart packet schedulers needed for proper accounting of packets of
varying size and for software QoS.  However the smarter schedulers
don't currently work well with multiqueue (without hardware priority).
HTB is entirely single-queue so it can maintain per-device rate
limits.  Can we reduce locking by batching packet accounting?  (Reduce
precision of limiting but improve performance.)

Jeffrey T. Kirsher: drivers/net rearrangement
http://vger.kernel.org/netconf2011_slides/jkirsher_netconf2011.odp

As previously discussed, drivers/net and corresponding configuration
menus are a mess.  Almost finished the proposed rearrangement by link
layer type and other groupings.

Jamal Hadi Salim: Catching up With Herbert
http://vger.kernel.org/netconf2011_slides/jamal_netconf2011.pdf
http://vger.kernel.org/netconf2011_slides/netconf-2011-flash.tgz
(don't miss the animations)

History of TX locking:
1. Each sender enters and locks qdisc (sw queue) and hw queue in turn;
repeats for each packet until done.  Many senders can be spinning.
2. Add busy flag; sender sets when entering qdisc.  When not
previously set, the sender takes responsibility for draining sw queue
into hw queue.  Other senders only add to sw queue.  Draining sender
yields at the next clock tick or (some other condition).
3. Spinlock behaviour changed to Baker's algorithm (ticket locking).
Generally better but means the draining sender has to wait behind
other senders when re-locking the qdisc.  (Contention is not so
high for multiqueue devices, though.)
4. Busylock: extra lock for senders preparing to lock qdisc first
time, not taken by draining sender when re-entering.  Effectively gives
the draining sender higher priority.

Potential for great unfairness, as some senders take care of hw
queueing for others - for up to a tick (variable length of time!).
Proposes quota for draining instead of or as well as the current
limits.  Showed results suggesting that good quota is #CPU + 1.

Eric and Herbert objected that his experiments on the dummy device
may not be representative.

David S. Miller / Jamal Hadi Salim: Closing statements, future netconf planning

David open to proposals for netconf in Feb-Apr next year.  Wants to
invite wider range of people.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/2] udp:  add tracepoints for queueing skb to rcvbuf
From: Neil Horman @ 2011-06-21 13:50 UTC (permalink / raw)
  To: Hagen Paul Pfeifer; +Cc: Satoru Moriya, netdev, Seiji Aguchi
In-Reply-To: <4123e3d3ce0192e63947178f249d3411@localhost>

On Tue, Jun 21, 2011 at 01:58:27PM +0200, Hagen Paul Pfeifer wrote:
> 
> On Tue, 21 Jun 2011 06:47:43 -0400, Neil Horman wrote:
> 
> 
> 
> > I was thinking you could just trace callers of __sk_mem_schedule, but
> 
> > looking at
> 
> > it this works as well
> 
> > Acked-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
> 
> Hey Neil,
> 
> 
> 
> since you acked the patch do you have any plans to migrate dropwatch to
> 
> use perf infrastructure and skip the netlink transport? Should be
> 
> practicable now. No kernel patch required to run dropwatch ;-)
> 
> 
I hadn't really thought about that much, but yes, I suppose I could migrate
dropwatch to export kfree_skb data via perf.  Admittedly I don't know much about
the perf api.  Do you have any pointers on its use (to save me time in figuring
out how it all works)?  If so I'll start looking into it.
Neil

> 
> 
> 
> HGN
> 
> 
> 

^ permalink raw reply

* Re: [PATCH 1/3] serial/imx: add device tree support
From: Shawn Guo @ 2011-06-21 13:55 UTC (permalink / raw)
  To: Grant Likely
  Cc: patches, netdev, devicetree-discuss, Jason Liu, linux-kernel,
	Jeremy Kerr, Sascha Hauer, linux-arm-kernel
In-Reply-To: <20110619073000.GA23171@S2100-06.ap.freescale.net>

Hi Grant,

I just gave a try to use aliases node for identify the device index.
Please take a look and let me know if it's what you expect.

On Sun, Jun 19, 2011 at 03:30:02PM +0800, Shawn Guo wrote:
[...]
> > >  
> > > +#ifdef CONFIG_OF
> > > +static int serial_imx_probe_dt(struct imx_port *sport,
> > > +		struct platform_device *pdev)
> > > +{
> > > +	struct device_node *node = pdev->dev.of_node;
> > > +	const __be32 *line;
> > > +
> > > +	if (!node)
> > > +		return -ENODEV;
> > > +
> > > +	line = of_get_property(node, "id", NULL);
> > > +	if (!line)
> > > +		return -ENODEV;
> > > +
> > > +	sport->port.line = be32_to_cpup(line) - 1;
> > 
> > Hmmm, I really would like to be rid of this.  Instead, if uarts must
> > be enumerated, the driver should look for a /aliases/uart* property
> > that matches the of_node.  Doing it that way is already established in
> > the OpenFirmware documentation, and it ensures there are no overlaps
> > in the global namespace.
> > 
> 
> I just gave one more try to avoid using 'aliases', and you gave a
> 'no' again.  Now, I know how hard you are on this.  Okay, I start
> thinking about your suggestion seriously :)
> 
> > We do need some infrastructure to make that easier though.  Would you
> > have time to help put that together?
> > 
> Ok, I will give it a try.
> 

diff --git a/arch/arm/boot/dts/imx51-babbage.dts b/arch/arm/boot/dts/imx51-babbage.dts
index da0381a..f4a5c3c 100644
--- a/arch/arm/boot/dts/imx51-babbage.dts
+++ b/arch/arm/boot/dts/imx51-babbage.dts
@@ -18,6 +18,12 @@
 	compatible = "fsl,imx51-babbage", "fsl,imx51";
 	interrupt-parent = <&tzic>;
 
+	aliases {
+		serial0 = &uart0;
+		serial1 = &uart1;
+		serial2 = &uart2;
+	};
+
 	chosen {
 		bootargs = "console=ttymxc0,115200 root=/dev/mmcblk0p3 rootwait";
 	};
@@ -47,29 +53,29 @@
 			reg = <0x70000000 0x40000>;
 			ranges;
 
-			uart@7000c000 {
+			uart2: uart@7000c000 {
 				compatible = "fsl,imx51-uart", "fsl,imx21-uart";
 				reg = <0x7000c000 0x4000>;
 				interrupts = <33>;
 				id = <3>;
-				fsl,has-rts-cts;
+				fsl,uart-has-rtscts;
 			};
 		};
 
-		uart@73fbc000 {
+		uart0: uart@73fbc000 {
 			compatible = "fsl,imx51-uart", "fsl,imx21-uart";
 			reg = <0x73fbc000 0x4000>;
 			interrupts = <31>;
 			id = <1>;
-			fsl,has-rts-cts;
+			fsl,uart-has-rtscts;
 		};
 
-		uart@73fc0000 {
+		uart1: uart@73fc0000 {
 			compatible = "fsl,imx51-uart", "fsl,imx21-uart";
 			reg = <0x73fc0000 0x4000>;
 			interrupts = <32>;
 			id = <2>;
-			fsl,has-rts-cts;
+			fsl,uart-has-rtscts;
 		};
 	};
 
diff --git a/drivers/of/base.c b/drivers/of/base.c
index 632ebae..13df5d2 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -737,6 +737,37 @@ err0:
 EXPORT_SYMBOL(of_parse_phandles_with_args);
 
 /**
+ *	of_get_device_index - Get device index by looking up "aliases" node
+ *	@np:	Pointer to device node that asks for device index
+ *	@name:	The device alias without index number
+ *
+ *	Returns the device index if find it, else returns -ENODEV.
+ */
+int of_get_device_index(struct device_node *np, const char *alias)
+{
+	struct device_node *aliases = of_find_node_by_name(NULL, "aliases");
+	struct property *prop;
+	char name[32];
+	int index = 0;
+
+	if (!aliases)
+		return -ENODEV;
+
+	while (1) {
+		snprintf(name, sizeof(name), "%s%d", alias, index);
+		prop = of_find_property(aliases, name, NULL);
+		if (!prop)
+			return -ENODEV;
+		if (np == of_find_node_by_path(prop->value))
+			break;
+		index++;
+	}
+
+	return index;
+}
+EXPORT_SYMBOL(of_get_device_index);
+
+/**
  * prom_add_property - Add a property to a node
  */
 int prom_add_property(struct device_node *np, struct property *prop)
diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
index da436e0..852668f 100644
--- a/drivers/tty/serial/imx.c
+++ b/drivers/tty/serial/imx.c
@@ -1271,18 +1271,18 @@ static int serial_imx_probe_dt(struct imx_port *sport,
 	struct device_node *node = pdev->dev.of_node;
 	const struct of_device_id *of_id =
 			of_match_device(imx_uart_dt_ids, &pdev->dev);
-	const __be32 *line;
+	int line;
 
 	if (!node)
 		return -ENODEV;
 
-	line = of_get_property(node, "id", NULL);
-	if (!line)
+	line = of_get_device_index(node, "serial");
+	if (IS_ERR_VALUE(line))
 		return -ENODEV;
 
-	sport->port.line = be32_to_cpup(line) - 1;
+	sport->port.line = line;
 
-	if (of_get_property(node, "fsl,has-rts-cts", NULL))
+	if (of_get_property(node, "fsl,uart-has-rtscts", NULL))
 		sport->have_rtscts = 1;
 
 	if (of_get_property(node, "fsl,irda-mode", NULL))
diff --git a/include/linux/of.h b/include/linux/of.h
index bfc0ed1..3153752 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -213,6 +213,8 @@ extern int of_parse_phandles_with_args(struct device_node *np,
 	const char *list_name, const char *cells_name, int index,
 	struct device_node **out_node, const void **out_args);
 
+extern int of_get_device_index(struct device_node *np, const char *alias);
+
 extern int of_machine_is_compatible(const char *compat);
 
 extern int prom_add_property(struct device_node* np, struct property* prop);

-- 
Regards,
Shawn


^ permalink raw reply related

* Re: [RFC PATCH] packet: Add fanout support.
From: Victor Julien @ 2011-06-21 13:27 UTC (permalink / raw)
  To: Changli Gao; +Cc: David Miller, netdev
In-Reply-To: <BANLkTikgTqGY=S9UVik6wjSp5WE4WLmKtA@mail.gmail.com>

On 06/21/2011 03:05 PM, Changli Gao wrote:
> On Tue, Jun 21, 2011 at 6:46 PM, David Miller <davem@davemloft.net> wrote:
>> From: Victor Julien <victor@inliniac.net>
>> Date: Tue, 21 Jun 2011 12:39:11 +0200
>>
>>> The hash based on skb->rxhash, does that result in a "flow" based
>>> distribution over the listeners? So all packets sharing a tuple
>>> being sent to the same socket?
>>
>> Yes, that's exactly right.
> 
> But not for fragments, in additional.

>From a Suricata IDS point of view, I would need to have the fragments of
a flow/tuple on the same socket.

-- 
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------


^ permalink raw reply

* [PATCH] MAINTAINERS: Remove Sven Eckelmann from BATMAN ADVANCED
From: Sven Eckelmann @ 2011-06-21 13:13 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA,
	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r

I cannot speak on behalf of the batman-adv developers due to conflicts
in the opinion about the ongoing development. The batman-adv module is
still maintained by Marek Lindner and Simon Wunderlich. Those are the
main persons behind the visions of batman-adv. Therefore, the state of
module hasn't changed.

Signed-off-by: Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
Cc: b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r@public.gmane.org
---
Just as small background information:
https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2011-June/005020.html
expresses the same problems I have with the changes and I can honestly
not speak on behalf the developers when I my inner mind is against the
changes and I would only be an additional, unneeded barrier.

This decission was made in context of the changes which are on the
horizon. I personally don't know what is coming, but that many things
are changing in an incompatible way. I would like to leave here instead
to destroy some friendships.

 MAINTAINERS |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index dc2a7c8..9597832 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1390,7 +1390,6 @@ F:	include/linux/backlight.h
 BATMAN ADVANCED
 M:	Marek Lindner <lindner_marek-LWAfsSFWpa4@public.gmane.org>
 M:	Simon Wunderlich <siwu-MaAgPAbsBIVS8oHt8HbXEIQuADTiUCJX@public.gmane.org>
-M:	Sven Eckelmann <sven-KaDOiPu9UxWEi8DpZVb4nw@public.gmane.org>
 L:	b.a.t.m.a.n-ZwoEplunGu2X36UT3dwllkB+6BGkLq7r@public.gmane.org
 W:	http://www.open-mesh.org/
 S:	Maintained
-- 
1.7.5.3

^ permalink raw reply related

* [PATCH] rtnl: provide link dump consistency info
From: Thomas Graf @ 2011-06-21 13:11 UTC (permalink / raw)
  To: Johannes Berg
  Cc: netdev, linux-wireless, Samuel Ortiz, aloisio.almeida,
	John Linville, Thomas Graf
In-Reply-To: <1308570046.4322.5.camel@jlt3.sipsolutions.net>

This patch adds a change sequence counter to each net namespace
which is bumped whenever a netdevice is added or removed from
the list. If such a change occurred while a link dump took place,
the dump will have the NLM_F_DUMP_INTR flag set in the first
message which has been interrupted and in all subsequent messages
of the same dump.

Note that links may still be modified or renamed while a dump is
taking place but we can guarantee for userspace to receive a
complete list of links and not miss any.

Testing:
I have added 500 VLAN netdevices to make sure the dump is split
over multiple messages. Then while continuously dumping links in
one process I also continuously deleted and re-added a dummy
netdevice in another process. Multiple dumps per seconds have
had the NLM_F_DUMP_INTR flag set.

I guess we can wait for Johannes patch to hit net-next via the
wireless tree.  I just wanted to give this some testing right away.

Signed-off-by: Thomas Graf <tgraf@infradead.org>

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 2bf9ed9..ff5c680 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -62,6 +62,7 @@ struct net {
 	struct list_head 	dev_base_head;
 	struct hlist_head 	*dev_name_head;
 	struct hlist_head	*dev_index_head;
+	unsigned int		dev_base_seq;	/* protected by rtnl_mutex */
 
 	/* core fib_rules */
 	struct list_head	rules_ops;
diff --git a/net/core/dev.c b/net/core/dev.c
index 9c58c1e..97f30b4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -199,6 +199,11 @@ static struct list_head ptype_all __read_mostly;	/* Taps */
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
 
+static inline void dev_base_seq_inc(struct net *net)
+{
+	while (++net->dev_base_seq == 0);
+}
+
 static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
 {
 	unsigned hash = full_name_hash(name, strnlen(name, IFNAMSIZ));
@@ -237,6 +242,9 @@ static int list_netdevice(struct net_device *dev)
 	hlist_add_head_rcu(&dev->index_hlist,
 			   dev_index_hash(net, dev->ifindex));
 	write_unlock_bh(&dev_base_lock);
+
+	dev_base_seq_inc(net);
+
 	return 0;
 }
 
@@ -253,6 +261,8 @@ static void unlist_netdevice(struct net_device *dev)
 	hlist_del_rcu(&dev->name_hlist);
 	hlist_del_rcu(&dev->index_hlist);
 	write_unlock_bh(&dev_base_lock);
+
+	dev_base_seq_inc(dev_net(dev));
 }
 
 /*
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index e41e511..91f03c7 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -128,6 +128,7 @@ static __net_init int setup_net(struct net *net)
 	LIST_HEAD(net_exit_list);
 
 	atomic_set(&net->count, 1);
+	net->dev_base_seq = 1;
 
 #ifdef NETNS_REFCNT_DEBUG
 	atomic_set(&net->use_count, 0);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index abd936d..8d694b6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1009,6 +1009,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 	s_idx = cb->args[1];
 
 	rcu_read_lock();
+	cb->seq = net->dev_base_seq;
+
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
 		head = &net->dev_index_head[h];
@@ -1020,6 +1022,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
 					     cb->nlh->nlmsg_seq, 0,
 					     NLM_F_MULTI) <= 0)
 				goto out;
+
+			nl_dump_check_consistent(cb, nlmsg_hdr(skb));
 cont:
 			idx++;
 		}

^ permalink raw reply related

* Re: [RFC PATCH] packet: Add fanout support.
From: Changli Gao @ 2011-06-21 13:05 UTC (permalink / raw)
  To: David Miller; +Cc: victor, netdev
In-Reply-To: <20110621.034627.30677905865798284.davem@davemloft.net>

On Tue, Jun 21, 2011 at 6:46 PM, David Miller <davem@davemloft.net> wrote:
> From: Victor Julien <victor@inliniac.net>
> Date: Tue, 21 Jun 2011 12:39:11 +0200
>
>> The hash based on skb->rxhash, does that result in a "flow" based
>> distribution over the listeners? So all packets sharing a tuple
>> being sent to the same socket?
>
> Yes, that's exactly right.

But not for fragments, in additional.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [PATCH]: Add Network Sysrq Support
From: Prarit Bhargava @ 2011-06-21 13:00 UTC (permalink / raw)
  To: netdev, davem, agospoda, nhorman, lwoodman; +Cc: Prarit Bhargava

Add Network Sysrq Support

In some circumstances, a system can hang/lockup in such a way that the system
is completely unresponsive to keyboard or console input but is still
responsive to ping.  The config option, CONFIG_SYSRQ_PING, builds
net/ipv4/sysrq-ping.ko which allows a root user to configure the system for
a remote sysrq.

To use this do:

mount -t debugfs none /sys/kernel/debug/
echo 1 > /proc/sys/kernel/sysrq
echo <hex digit val> > /sys/kernel/debug/network_sysrq_magic
echo 1 > /sys/kernel/debug/network_sysrq_enable

Then on another system on the network you can do:

ping -c 1 -p <up to 30 hex digit val><hex val of sysrq> <target_system_name>

ex) sysrq-m, m is ascii 0x6d

ping -c 1 p 1623a06f554d46d676d <target_system_name>

Note that the network sysrq automatically disables after the receipt of
the ping, ie) it is single-shot mode.  If you want to use this again, you
must complete the above four steps again.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>

diff --git a/Documentation/networking/sysrq-ping.txt b/Documentation/networking/sysrq-ping.txt
new file mode 100644
index 0000000..efa8be3
--- /dev/null
+++ b/Documentation/networking/sysrq-ping.txt
@@ -0,0 +1,26 @@
+In some circumstances, a system can hang/lockup in such a way that the system
+is completely unresponsive to keyboard or console input but is still
+responsive to ping.  The config option, CONFIG_SYSRQ_PING, builds
+net/ipv4/sysrq-ping.ko which allows a root user to configure the system for a
+remote sysrq.
+
+To use this do:
+
+mount -t debugfs none /sys/kernel/debug/
+echo 1 > /proc/sys/kernel/sysrq
+echo <hex digit val> > /sys/kernel/debug/network_sysrq_magic
+echo 1 > /sys/kernel/debug/network_sysrq_enable
+
+Then on another system you can do:
+
+ping -c 1 -p <hex digit val><hex val of sysrq> <target_system_name>
+
+ex) sysrq-m, m is ascii 0x6d
+
+    ping -c 1 p 1623a06f554d46d676d <target_system_name>
+
+Note that the network sysrq automatically disables after the receipt of
+the ping, ie) it is single-shot mode.  If you want to use this again, you
+must complete the above four steps again.
+
+Hint: 'man ascii' ;)
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index cbb505b..03bb7b1 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -624,3 +624,11 @@ config TCP_MD5SIG
 	  on the Internet.
 
 	  If unsure, say N.
+
+config SYSRQ_PING
+	tristate
+	default m
+	help
+	  Allows execution of sysrq-X commands via ping over ipv4.  This is a
+	  known security hazard and should not be used in unsecure
+	  environments.
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index f2dc69c..c23c15e 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o
 obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_SYSRQ_PING) += sysrq-ping.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o
diff --git a/net/ipv4/sysrq-ping.c b/net/ipv4/sysrq-ping.c
new file mode 100644
index 0000000..67a6d0e
--- /dev/null
+++ b/net/ipv4/sysrq-ping.c
@@ -0,0 +1,207 @@
+/*
+ * network_sysrq.c - allow sysrq to be executed over a network via ping
+ *
+ * written by:  Prarit Bhargava <prarit@redhat.com>
+ *		Andy Gospodarek <agospoda@redhat.com>
+ *		Neil Horman <nhorman@redhat.com>
+ *
+ * based on work by:	Larry Woodman <lwoodman@redhat.com>
+ *
+ * To use this do:
+ *
+ *	mount -t debugfs none /sys/kernel/debug/
+ *	echo 1 > /proc/sys/kernel/sysrq
+ *	echo <hex digit val> > /sys/kernel/debug/network_sysrq_magic
+ *	echo 1 > /sys/kernel/debug/network_sysrq_enable
+ *
+ * Then on another system you can do:
+ *
+ *	ping -c 1 -p <hex digit val><hex val of sysrq> <target_system_name>
+ *
+ *	ex) sysrq-m, m is 0x6d
+ *
+ *	    ping -c 1 p 1623a06f554d46d676d <target_system_name>
+ *
+ * Note that the network sysrq automatically disables after the receipt of
+ * *ANY* ping.  If you want to use this again, you must complete the
+ * above four steps again.
+ *
+ */
+
+#include <linux/debugfs.h>
+#include <linux/icmp.h>
+#include <linux/init.h>
+#include <linux/ip.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/sysrq.h>
+
+#include <net/xfrm.h>
+
+static u8 network_sysrq_enable; /* set in debugfs network_sysrq_enable */
+static u16 network_sysrq_magic[16]; /* 15 bytes leaves 1 feature byte */
+static int network_sysrq_magic_len;
+
+static int to_hex(int val)
+{
+	if ((val >= '0') && (val <= '9'))
+		return val - 0x30;
+
+	if ((val >= 'a') && (val <= 'f'))
+		return val - 0x37;
+
+	if ((val >= 'A') && (val <= 'F'))
+		return val - 0x57;
+
+	return -1;
+}
+
+static bool network_sysrq_armed(void)
+{
+	int i;
+
+	if (!network_sysrq_enable)
+		return false;
+	if (!network_sysrq_magic_len)
+		return false;
+	for (i = 0; i < 16; i++)
+		if (network_sysrq_magic[i] != 0)
+			return true;
+	return false;
+}
+
+static void network_sysrq_disable(void)
+{
+	network_sysrq_enable = 0;
+	memset(network_sysrq_magic, 0, 32);
+	network_sysrq_magic_len = 0;
+}
+
+static ssize_t network_sysrq_seq_write(struct file *file,
+				       const char __user *ubuf,
+				       size_t count, loff_t *ppos)
+{
+	int i, j, hi, lo;
+	char buf[33];
+	memset(buf, 0, sizeof(buf));
+
+	if (count >= 33)
+		return -EINVAL;
+
+	if (copy_from_user(&buf, ubuf, min_t(size_t, sizeof(buf) - 1, count)))
+		return -EFAULT;
+
+	for (i = 0, j = 0; i < count - 2 ; i += 2, j++) {
+		hi = to_hex(buf[i]);
+		lo = to_hex(buf[i+1]) & 0x0f;
+		if ((hi == -1) || (lo == -1)) {
+			network_sysrq_disable();
+			return -EINVAL;
+		}
+		network_sysrq_magic[j] = (u16)(hi << 4) + lo;
+	}
+	network_sysrq_magic_len = j;
+
+	return count;
+}
+
+static int network_sysrq_seq_show(struct seq_file *m, void *p)
+{
+	int i;
+
+	for (i = 0; i < network_sysrq_magic_len; i++)
+		seq_printf(m, "%02x", network_sysrq_magic[i]);
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static int network_sysrq_fops_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, network_sysrq_seq_show, inode->i_private);
+}
+
+static const struct file_operations xnetwork_sysrq_fops = {
+	.open		= network_sysrq_fops_open,
+	.write		= network_sysrq_seq_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+};
+
+static int network_sysrq_func(struct sk_buff *skb, struct net_device *dev,
+			      struct packet_type *pt,
+			      struct net_device *orig_dev)
+{
+	struct icmphdr *icmph;
+	char *found;
+
+	if (ip_hdr(skb)->protocol != IPPROTO_ICMP)
+		goto end;
+
+	if (!skb_pull(skb, sizeof(struct iphdr)))
+		goto end;
+
+	skb_reset_transport_header(skb);
+	icmph = icmp_hdr(skb);
+
+	if (!skb_pull(skb, sizeof(*icmph)))
+		goto end;
+
+	/* is this a ping? */
+	if (icmph->type != ICMP_ECHO)
+		goto end;
+
+	if (network_sysrq_armed()) {
+		found = strnstr(skb->data, (char *)network_sysrq_magic,
+				skb->len - skb->data_len);
+		if (found)
+			handle_sysrq(found[network_sysrq_magic_len]);
+		network_sysrq_disable();
+	}
+end:
+	kfree_skb(skb);
+	return 0;
+}
+
+static struct packet_type network_sysrq_type = {
+	.type = cpu_to_be16(ETH_P_IP),
+	.func = network_sysrq_func,
+};
+
+static struct dentry *network_sysrq_enable_dentry;
+static struct dentry *network_sysrq_magic_dentry;
+
+int __init init_network_sysrq(void)
+{
+	network_sysrq_enable_dentry = debugfs_create_u8("network_sysrq_enable",
+							S_IWUGO | S_IRUGO,
+							NULL,
+							&network_sysrq_enable);
+	if (!network_sysrq_enable_dentry)
+		return -EIO;
+
+	network_sysrq_magic_dentry = debugfs_create_file("network_sysrq_magic",
+							S_IWUGO | S_IRUGO,
+							NULL,
+							&network_sysrq_magic,
+							&xnetwork_sysrq_fops);
+	if (!network_sysrq_magic_dentry) {
+		debugfs_remove(network_sysrq_enable_dentry);
+		return -EIO;
+	}
+
+	dev_add_pack(&network_sysrq_type);
+	return 0;
+}
+
+void __exit cleanup_network_sysrq(void)
+{
+	dev_remove_pack(&network_sysrq_type);
+	debugfs_remove(network_sysrq_enable_dentry);
+	debugfs_remove(network_sysrq_magic_dentry);
+}
+
+module_init(init_network_sysrq);
+module_exit(cleanup_network_sysrq);
+
+MODULE_LICENSE("GPL");

^ permalink raw reply related

* Re: [PATCH] netconsole: fix build when CONFIG_NETCONSOLE_DYNAMIC is turned on
From: WANG Cong @ 2011-06-21 12:50 UTC (permalink / raw)
  To: netdev
In-Reply-To: <20110620212504.e639ad5c.randy.dunlap@oracle.com>

On Mon, 20 Jun 2011 21:25:04 -0700, Randy Dunlap wrote:

> From: Randy Dunlap <randy.dunlap@oracle.com>
> 
> When NETCONSOLE_DYNAMIC=y and CONFIGFS_FS=m, there are build errors in
> netconsole:
> 
> drivers/built-in.o: In function `drop_netconsole_target':
> netconsole.c:(.text+0x1a100f): undefined reference to `config_item_put'
> drivers/built-in.o: In function `make_netconsole_target':
> netconsole.c:(.text+0x1a10b9): undefined reference to
> `config_item_init_type_name' drivers/built-in.o: In function
> `write_msg': netconsole.c:(.text+0x1a11a4): undefined reference to
> `config_item_get' netconsole.c:(.text+0x1a1211): undefined reference to
> `config_item_put' drivers/built-in.o: In function
> `netconsole_netdev_event': netconsole.c:(.text+0x1a12cc): undefined
> reference to `config_item_put' netconsole.c:(.text+0x1a12ec): undefined
> reference to `config_item_get' netconsole.c:(.text+0x1a1366): undefined
> reference to `config_item_put' drivers/built-in.o: In function
> `init_netconsole': netconsole.c:(.init.text+0x953a): undefined reference
> to `config_group_init' netconsole.c:(.init.text+0x9560): undefined
> reference to `configfs_register_subsystem' drivers/built-in.o: In
> function `dynamic_netconsole_exit': netconsole.c:(.exit.text+0x809):
> undefined reference to `configfs_unregister_subsystem'
> 
> so make NETCONSOLE_DYNAMIC require CONFIGFS_FS=y to fix the build
> errors.
> 
> This is one possible fix.
> Fixes https://bugzilla.kernel.org/show_bug.cgi?id=37992
> 
> Reported-by: David Hill <hilld@binarystorm.net> Signed-off-by: Randy
> Dunlap <randy.dunlap@oracle.com> ---
>  drivers/net/Kconfig |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- lnx-30-rc3.orig/drivers/net/Kconfig +++
> lnx-30-rc3/drivers/net/Kconfig
> @@ -3416,7 +3416,7 @@ config NETCONSOLE
>  
>  config NETCONSOLE_DYNAMIC
>  	bool "Dynamic reconfiguration of logging targets"
> -	depends on NETCONSOLE && SYSFS && CONFIGFS_FS 
> +	depends on NETCONSOLE
> && SYSFS && CONFIGFS_FS=y

I recall someone already fixed this by adding "select CONFIGFS_FS",
who removed it again... :-/


^ permalink raw reply

* Re: [PATCH] netconsole: fix build when CONFIG_NETCONSOLE_DYNAMIC is turned on
From: Ben Hutchings @ 2011-06-21 12:45 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: Andrew Morton, davem, netdev, bugme-daemon, hilld
In-Reply-To: <20110620212504.e639ad5c.randy.dunlap@oracle.com>

On Mon, 2011-06-20 at 21:25 -0700, Randy Dunlap wrote:
> From: Randy Dunlap <randy.dunlap@oracle.com>
> 
> When NETCONSOLE_DYNAMIC=y and CONFIGFS_FS=m, there are build errors
> in netconsole:
> 
> drivers/built-in.o: In function `drop_netconsole_target':
> netconsole.c:(.text+0x1a100f): undefined reference to `config_item_put'
> drivers/built-in.o: In function `make_netconsole_target':
> netconsole.c:(.text+0x1a10b9): undefined reference to `config_item_init_type_name'
> drivers/built-in.o: In function `write_msg':
> netconsole.c:(.text+0x1a11a4): undefined reference to `config_item_get'
> netconsole.c:(.text+0x1a1211): undefined reference to `config_item_put'
> drivers/built-in.o: In function `netconsole_netdev_event':
> netconsole.c:(.text+0x1a12cc): undefined reference to `config_item_put'
> netconsole.c:(.text+0x1a12ec): undefined reference to `config_item_get'
> netconsole.c:(.text+0x1a1366): undefined reference to `config_item_put'
> drivers/built-in.o: In function `init_netconsole':
> netconsole.c:(.init.text+0x953a): undefined reference to `config_group_init'
> netconsole.c:(.init.text+0x9560): undefined reference to `configfs_register_subsystem'
> drivers/built-in.o: In function `dynamic_netconsole_exit':
> netconsole.c:(.exit.text+0x809): undefined reference to `configfs_unregister_subsystem'
> 
> so make NETCONSOLE_DYNAMIC require CONFIGFS_FS=y to fix the build errors.
[...]

NETCONSOLE is tristate, and I think NETCONSOLE=m && NETCONSOLE_DYNAMIC=y
&& CONFIGFS_FS=m should be OK.

It seems like Kconfig should have a '>=' operator which behaves like a
numeric comparison with n=0, m=1, y=2.  Then we could use a dependency
of:
	NETCONSOLE && SYSFS && CONFIGFS_FS>=NETCONSOLE

But for now I think the correct dependency is:
	NETCONSOLE && SYSFS && CONFIGFS_FS && !(NETCONSOLE=y && CONFIGFS_FS=m)

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH 1/2] udp:  add tracepoints for queueing skb to rcvbuf
From: Hagen Paul Pfeifer @ 2011-06-21 11:58 UTC (permalink / raw)
  To: Neil Horman; +Cc: Satoru Moriya, netdev, Seiji Aguchi
In-Reply-To: <20110621104742.GA16311@hmsreliant.think-freely.org>


On Tue, 21 Jun 2011 06:47:43 -0400, Neil Horman wrote:



> I was thinking you could just trace callers of __sk_mem_schedule, but

> looking at

> it this works as well

> Acked-by: Neil Horman <nhorman@tuxdriver.com>



Hey Neil,



since you acked the patch do you have any plans to migrate dropwatch to

use perf infrastructure and skip the netlink transport? Should be

practicable now. No kernel patch required to run dropwatch ;-)





HGN



^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Ilpo Järvinen @ 2011-06-21 11:46 UTC (permalink / raw)
  To: Carsten Wolff
  Cc: Alexander Zimmermann, Dominik Kaspar, John Heffner, Eric Dumazet,
	Netdev, Lennart Schulte, Arnd Hannemann
In-Reply-To: <201106211334.17825.carsten@wolffcarsten.de>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1783 bytes --]

On Tue, 21 Jun 2011, Carsten Wolff wrote:

> On Tuesday 21 June 2011, Ilpo Järvinen wrote:
> > On Wed, 27 Apr 2011, Alexander Zimmermann wrote:
> > > Am 27.04.2011 um 18:22 schrieb Dominik Kaspar:
> > > > Hi Carsten,
> > > > 
> > > > Thanks for your feedback. I made some new tests with the same setup of
> > > > packet-based forwarding over two emulated paths (600 KB/s, 10 ms) +
> > > > (400 KB/s, 100 ms). In the first experiments, which showed a step-wise
> > > > adaptation to reordering, SACK, DSACK, and Timestamps were all
> > > > enabled. In the experiments, I individually disabled these three
> > > > mechanisms and saw the following:
> > > > 
> > > > - Disabling timestamps causes TCP to never adjust to reordering at all.
> > > 
> > > Reordering detection with DSACK is broken in Linux. We will fix that in
> > > a couple of weeks...
> > > 
> > > > - Disabling SACK allows TCP to adapt very rapidly ("perfect"
> > > > aggregation!).
> > > 
> > > If you disable SACK, you will use the NewReno detection
> > 
> > Which probably has some reordering over-estimate bugs on its own...
> > (but I've forgotten details of my suspicion long time ago so please don't
> > ask for the them).
> 
> the NewReno detection is clever, but there's no exact information it could 
> utilize for a good metric, because it detects the event too late, when the 
> information is already gone. In my experiments it always under-estimated the 
> reordering extent, though. I also remmember thinking that the metric of the 
> Eifel-detection has an off-by-one bug.

That might be true for most of the cases but IIRC I figured out a
a scenario where it miscalculates RTT worth of extra into the reordering 
(but I never really confirmed that in real tests or so, just figured it 
a bit).

-- 
 i.

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Carsten Wolff @ 2011-06-21 11:34 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Alexander Zimmermann, Dominik Kaspar, John Heffner, Eric Dumazet,
	Netdev, Lennart Schulte, Arnd Hannemann
In-Reply-To: <alpine.DEB.2.00.1106211423400.17529@wel-95.cs.helsinki.fi>

Hi,

On Tuesday 21 June 2011, Ilpo Järvinen wrote:
> On Wed, 27 Apr 2011, Alexander Zimmermann wrote:
> > Am 27.04.2011 um 18:22 schrieb Dominik Kaspar:
> > > Hi Carsten,
> > > 
> > > Thanks for your feedback. I made some new tests with the same setup of
> > > packet-based forwarding over two emulated paths (600 KB/s, 10 ms) +
> > > (400 KB/s, 100 ms). In the first experiments, which showed a step-wise
> > > adaptation to reordering, SACK, DSACK, and Timestamps were all
> > > enabled. In the experiments, I individually disabled these three
> > > mechanisms and saw the following:
> > > 
> > > - Disabling timestamps causes TCP to never adjust to reordering at all.
> > 
> > Reordering detection with DSACK is broken in Linux. We will fix that in
> > a couple of weeks...
> > 
> > > - Disabling SACK allows TCP to adapt very rapidly ("perfect"
> > > aggregation!).
> > 
> > If you disable SACK, you will use the NewReno detection
> 
> Which probably has some reordering over-estimate bugs on its own...
> (but I've forgotten details of my suspicion long time ago so please don't
> ask for the them).

the NewReno detection is clever, but there's no exact information it could 
utilize for a good metric, because it detects the event too late, when the 
information is already gone. In my experiments it always under-estimated the 
reordering extent, though. I also remmember thinking that the metric of the 
Eifel-detection has an off-by-one bug.

Carsten

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Ilpo Järvinen @ 2011-06-21 11:35 UTC (permalink / raw)
  To: Dominik Kaspar
  Cc: Alexander Zimmermann, Netdev, Yuchung Cheng, Carsten Wolff,
	John Heffner, Eric Dumazet, Lennart Schulte, Arnd Hannemann
In-Reply-To: <BANLkTin6fsc=4GUY+1UKsLEbgzeybx7FHg@mail.gmail.com>

On Mon, 20 Jun 2011, Dominik Kaspar wrote:

> > Where did you get this idea of reneging?!?
> 
> I observed that my scenario of a retransmitted packet overtaking the
> original somehow causes TCP to enter the "Loss" state although no RTO
> was caused. And since the Loss state seems to be only entered due to
> RTO timeout or SACK reneging, I got the idea that reneging must be
> occurring.
> 
> > Reneging has nothing to do with DSACKs,
> > instead it is only detected if the cumulative ACK stops to such
> > boundary where the _next_ segment is SACKed (i.e., some reason
> > the receiver "didn't bother" to cumulatively ACK for that too). ...
> > That certainly does not happen (ever) for out of window DSACKs.
> 
> You are right. If I turn off DSACK, the same thing happens: TCP enters
> the Loss state without timeouts occurring. Isn't that a sign of
> reneging happening? What else can it be?

There's a MIB for reneging from where you should be able to confirm 
that it did(n't) happen...

Please note that tcpprobe is only run per ACK (not on timeouts), and 
FRTO (enabled by default) doesn't even cause CA_Loss entry immediately 
but slightly later on once it has figured out that the timeout doesn't 
seem to be spurious.

-- 
 i.

^ permalink raw reply

* Re: Linux TCP's Robustness to Multipath Packet Reordering
From: Ilpo Järvinen @ 2011-06-21 11:25 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Dominik Kaspar, Carsten Wolff, John Heffner, Eric Dumazet, Netdev,
	Lennart Schulte, Arnd Hannemann
In-Reply-To: <D0D2412D-2D30-4051-B346-32D20858BC92@nets.rwth-aachen.de>

On Wed, 27 Apr 2011, Alexander Zimmermann wrote:

> Hi,
> 
> Am 27.04.2011 um 18:22 schrieb Dominik Kaspar:
> 
> > Hi Carsten,
> > 
> > Thanks for your feedback. I made some new tests with the same setup of
> > packet-based forwarding over two emulated paths (600 KB/s, 10 ms) +
> > (400 KB/s, 100 ms). In the first experiments, which showed a step-wise
> > adaptation to reordering, SACK, DSACK, and Timestamps were all
> > enabled. In the experiments, I individually disabled these three
> > mechanisms and saw the following:
> > 
> > - Disabling timestamps causes TCP to never adjust to reordering at all.
> 
> Reordering detection with DSACK is broken in Linux. We will fix that in
> a couple of weeks...
> 
> > - Disabling SACK allows TCP to adapt very rapidly ("perfect" aggregation!).
> 
> If you disable SACK, you will use the NewReno detection

Which probably has some reordering over-estimate bugs on its own... 
(but I've forgotten details of my suspicion long time ago so please don't 
ask for the them).

-- 
 i.

^ permalink raw reply

* Re: [PATCH 2/2] core: add tracepoints for queueing skb to rcvbuf
From: Neil Horman @ 2011-06-21 10:48 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: netdev@vger.kernel.org, davem@davemloft.net,
	dle-develop@lists.sourceforge.net, Seiji Aguchi
In-Reply-To: <65795E11DBF1E645A09CEC7EAEE94B9C402B96E7@USINDEVS02.corp.hds.com>

On Fri, Jun 17, 2011 at 06:00:03PM -0400, Satoru Moriya wrote:
> This patch adds 2 tracepoints to get a status of a socket receive queue
> and related parameter. 
> 
> One tracepoint is added to sock_queue_rcv_skb. It records rcvbuf size
> and its usage. The other tracepoint is added to __sk_mem_schedule and
> it records limitations of memory for sockets and current usage.
> 
> By using these tracepoints we're able to know detailed reason why kernel
> drop the packet.
> 
> Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
> ---
>  include/trace/events/sock.h |   68 +++++++++++++++++++++++++++++++++++++++++++
>  net/core/net-traces.c       |    1 +
>  net/core/sock.c             |    5 +++
>  3 files changed, 74 insertions(+), 0 deletions(-)
>  create mode 100644 include/trace/events/sock.h
> 
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> new file mode 100644
> index 0000000..779abb9
> --- /dev/null
> +++ b/include/trace/events/sock.h
> @@ -0,0 +1,68 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM sock
> +
> +#if !defined(_TRACE_SOCK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_SOCK_H
> +
> +#include <net/sock.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(sock_rcvqueue_full,
> +
> +	TP_PROTO(struct sock *sk, struct sk_buff *skb),
> +
> +	TP_ARGS(sk, skb),
> +
> +	TP_STRUCT__entry(
> +		__field(int, rmem_alloc)
> +		__field(unsigned int, truesize)
> +		__field(int, sk_rcvbuf)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
> +		__entry->truesize   = skb->truesize;
> +		__entry->sk_rcvbuf  = sk->sk_rcvbuf;
> +	),
> +
> +	TP_printk("rmem_alloc=%d truesize=%u sk_rcvbuf=%d",
> +		__entry->rmem_alloc, __entry->truesize, __entry->sk_rcvbuf)
> +);
> +
> +TRACE_EVENT(sock_exceed_buf_limit,
> +
> +	TP_PROTO(struct sock *sk, struct proto *prot, long allocated),
> +
> +	TP_ARGS(sk, prot, allocated),
> +
> +	TP_STRUCT__entry(
> +		__array(char, name, 32)
> +		__field(long *, sysctl_mem)
> +		__field(long, allocated)
> +		__field(int, sysctl_rmem)
> +		__field(int, rmem_alloc)
> +	),
> +
> +	TP_fast_assign(
> +		strncpy(__entry->name, prot->name, 32);
> +		__entry->sysctl_mem = prot->sysctl_mem;
> +		__entry->allocated = allocated;
> +		__entry->sysctl_rmem = prot->sysctl_rmem[0];
> +		__entry->rmem_alloc = atomic_read(&sk->sk_rmem_alloc);
> +	),
> +
> +	TP_printk("proto:%s sysctl_mem=%ld,%ld,%ld allocated=%ld "
> +		"sysctl_rmem=%d rmem_alloc=%d",
> +		__entry->name,
> +		__entry->sysctl_mem[0],
> +		__entry->sysctl_mem[1],
> +		__entry->sysctl_mem[2],
> +		__entry->allocated,
> +		__entry->sysctl_rmem,
> +		__entry->rmem_alloc)
> +);
> +
> +#endif /* _TRACE_SOCK_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/net/core/net-traces.c b/net/core/net-traces.c
> index 13aab64..52380b1 100644
> --- a/net/core/net-traces.c
> +++ b/net/core/net-traces.c
> @@ -28,6 +28,7 @@
>  #include <trace/events/skb.h>
>  #include <trace/events/net.h>
>  #include <trace/events/napi.h>
> +#include <trace/events/sock.h>
>  #include <trace/events/udp.h>
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 6e81978..76c4031 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -128,6 +128,8 @@
>  
>  #include <linux/filter.h>
>  
> +#include <trace/events/sock.h>
> +
>  #ifdef CONFIG_INET
>  #include <net/tcp.h>
>  #endif
> @@ -292,6 +294,7 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
>  	    (unsigned)sk->sk_rcvbuf) {
>  		atomic_inc(&sk->sk_drops);
> +		trace_sock_rcvqueue_full(sk, skb);
>  		return -ENOMEM;
>  	}
>  
> @@ -1736,6 +1739,8 @@ suppress_allocation:
>  			return 1;
>  	}
>  
> +	trace_sock_exceed_buf_limit(sk, prot, allocated);
> +
>  	/* Alas. Undo changes. */
>  	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
>  	atomic_long_sub(amt, prot->memory_allocated);
> -- 
> 1.7.1
> 
> 
> 
Acked-by: Neil Horman <nhorman@tuxdriver.com>


^ permalink raw reply

* Re: [RFC PATCH] packet: Add fanout support.
From: Victor Julien @ 2011-06-21 10:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110621.025334.547463578193934724.davem@davemloft.net>

On 06/21/2011 11:53 AM, David Miller wrote:
> 
> This adds demuxing support for AF_PACKET sockets.  It's just to give
> people an idea, I've only build tested this patch.
> 
> Basically it allows to spread the AF_PACKET processing load amongst
> several AF_PACKET sockets.  The distribution can either be based upon
> hashing (PACKET_FANOUT_HASH) or round-robin based load-balancing
> (PACKET_FANOUT_LB).
> 
> The hash based fanout takes advantage of the precomputed skb->rxhash
> and only costs ~20 cpu cycles.
> 
> A restriction is that you must bind the AF_PACKET socket fully before
> you add it to a fanout.
> 
> The encoding of the PACKET_FANOUT socket option argument is:
> 
> 	(PACKET_FANOUT_{HASH,LB} << 16) | (ID & 0xffff)
> 
> All sockets adding themselves to the same fanout ID must all use
> the same PACKET_FANOUT_* type and also must be bound to the same
> device/protocol.
> 
> The implementation is agnostic to the type of AF_PACKET sockets in
> use.  You can use mmap based, and non-mmap based, AF_PACKET sockets.
> It simply doesn't care.

Thanks David! Looks interesting. I'm not familiar with the kernel
internals, so just a quick question. The hash based on skb->rxhash, does
that result in a "flow" based distribution over the listeners? So all
packets sharing a tuple being sent to the same socket?

Cheers,
Victor

> Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/include/linux/if_packet.h b/include/linux/if_packet.h
> index 7b31863..1efa1cb 100644
> --- a/include/linux/if_packet.h
> +++ b/include/linux/if_packet.h
> @@ -49,6 +49,10 @@ struct sockaddr_ll {
>  #define PACKET_VNET_HDR			15
>  #define PACKET_TX_TIMESTAMP		16
>  #define PACKET_TIMESTAMP		17
> +#define PACKET_FANOUT			18
> +
> +#define PACKET_FANOUT_HASH		0
> +#define PACKET_FANOUT_LB		1
>  
>  struct tpacket_stats {
>  	unsigned int	tp_packets;
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 461b16f..e6af2eb 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -187,9 +187,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg);
>  
>  static void packet_flush_mclist(struct sock *sk);
>  
> +struct packet_fanout;
>  struct packet_sock {
>  	/* struct sock has to be the first member of packet_sock */
>  	struct sock		sk;
> +	struct packet_fanout	*fanout;
>  	struct tpacket_stats	stats;
>  	struct packet_ring_buffer	rx_ring;
>  	struct packet_ring_buffer	tx_ring;
> @@ -212,6 +214,22 @@ struct packet_sock {
>  	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
>  };
>  
> +#define PACKET_FANOUT_MAX	2048
> +
> +struct packet_fanout {
> +#ifdef CONFIG_NET_NS
> +	struct net		*net;
> +#endif
> +	int			num_members;
> +	u16			id;
> +	u8			type;
> +	u8			pad;
> +	atomic_t		rr_cur;
> +	struct list_head	list;
> +	struct sock		*arr[PACKET_FANOUT_MAX];
> +	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
> +};
> +
>  struct packet_skb_cb {
>  	unsigned int origlen;
>  	union {
> @@ -344,6 +362,164 @@ static void packet_sock_destruct(struct sock *sk)
>  	sk_refcnt_debug_dec(sk);
>  }
>  
> +static int fanout_rr_next(struct packet_fanout *f)
> +{
> +	int x = atomic_read(&f->rr_cur) + 1;
> +
> +	if (x >= f->num_members)
> +		x = 0;
> +
> +	return x;
> +}
> +
> +static struct sock *fanout_demux_hash(struct packet_fanout *f, struct sk_buff *skb)
> +{
> +	u32 idx = ((u64)skb->rxhash * f->num_members) >> 32;
> +
> +	return f->arr[idx];
> +}
> +
> +static struct sock *fanout_demux_lb(struct packet_fanout *f, struct sk_buff *skb)
> +{
> +	int cur, old;
> +
> +	cur = atomic_read(&f->rr_cur);
> +	while ((old = atomic_cmpxchg(&f->rr_cur, cur,
> +				     fanout_rr_next(f))) != cur)
> +		cur = old;
> +	return f->arr[cur];
> +}
> +
> +static int packet_rcv_fanout_hash(struct sk_buff *skb, struct net_device *dev,
> +				  struct packet_type *pt, struct net_device *orig_dev)
> +{
> +	struct packet_fanout *f = pt->af_packet_priv;
> +	struct packet_sock *po;
> +	struct sock *sk;
> +
> +	if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	sk = fanout_demux_hash(f, skb);
> +	po = pkt_sk(sk);
> +
> +	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
> +}
> +
> +static int packet_rcv_fanout_lb(struct sk_buff *skb, struct net_device *dev,
> +				struct packet_type *pt, struct net_device *orig_dev)
> +{
> +	struct packet_fanout *f = pt->af_packet_priv;
> +	struct packet_sock *po;
> +	struct sock *sk;
> +
> +	if (!net_eq(dev_net(dev), read_pnet(&f->net))) {
> +		kfree_skb(skb);
> +		return 0;
> +	}
> +
> +	sk = fanout_demux_lb(f, skb);
> +	po = pkt_sk(sk);
> +
> +	return po->prot_hook.func(skb, dev, &po->prot_hook, orig_dev);
> +}
> +
> +static DEFINE_MUTEX(fanout_mutex);
> +static LIST_HEAD(fanout_list);
> +
> +static int fanout_add(struct sock *sk, u16 id, u8 type)
> +{
> +	struct packet_sock *po = pkt_sk(sk);
> +	struct packet_fanout *f, *match;
> +	int err;
> +
> +	switch (type) {
> +	case PACKET_FANOUT_HASH:
> +	case PACKET_FANOUT_LB:
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (!po->running)
> +		return -EINVAL;
> +
> +	mutex_lock(&fanout_mutex);
> +	match = NULL;
> +	list_for_each_entry(f, &fanout_list, list) {
> +		if (f->id == id) {
> +			match = f;
> +			break;
> +		}
> +	}
> +	if (!match) {
> +		match = kzalloc(sizeof(*match), GFP_KERNEL);
> +		if (match) {
> +			write_pnet(&match->net, sock_net(sk));
> +			match->id = id;
> +			match->type = type;
> +			atomic_set(&match->rr_cur, 0);
> +			INIT_LIST_HEAD(&match->list);
> +			match->prot_hook.type = po->prot_hook.type;
> +			match->prot_hook.dev = po->prot_hook.dev;
> +			switch (type) {
> +			case PACKET_FANOUT_HASH:
> +				match->prot_hook.func = packet_rcv_fanout_hash;
> +				break;
> +			case PACKET_FANOUT_LB:
> +				match->prot_hook.func = packet_rcv_fanout_lb;
> +				break;
> +			}
> +			match->prot_hook.af_packet_priv = match;
> +			dev_add_pack(&match->prot_hook);
> +		}
> +	}
> +	err = -ENOMEM;
> +	if (match) {
> +		err = -EINVAL;
> +		if (match->type == type) {
> +			err = -ENOSPC;
> +			if (match->num_members < PACKET_FANOUT_MAX) {
> +				__dev_remove_pack(&po->prot_hook);
> +				po->fanout = match;
> +				match->arr[match->num_members] = sk;
> +				smp_wmb();
> +				match->num_members++;
> +				err = 0;
> +			}
> +		}
> +	}
> +	mutex_unlock(&fanout_mutex);
> +	return err;
> +}
> +
> +static void fanout_del(struct sock *sk)
> +{
> +	struct packet_sock *po = pkt_sk(sk);
> +	struct packet_fanout *f;
> +	int i;
> +
> +	f = po->fanout;
> +	po->fanout = NULL;
> +
> +	mutex_lock(&fanout_mutex);
> +	for (i = 0; i < f->num_members; i++) {
> +		if (f->arr[i] == sk)
> +			break;
> +	}
> +	BUG_ON(i >= f->num_members);
> +	f->arr[i] = f->arr[f->num_members - 1];
> +	f->num_members--;
> +
> +	if (!f->num_members) {
> +		list_del(&f->list);
> +		dev_remove_pack(&f->prot_hook);
> +		kfree(f);
> +	}
> +	mutex_unlock(&fanout_mutex);
> +}
>  
>  static const struct proto_ops packet_ops;
>  
> @@ -1343,7 +1519,10 @@ static int packet_release(struct socket *sock)
>  		 */
>  		po->running = 0;
>  		po->num = 0;
> -		__dev_remove_pack(&po->prot_hook);
> +		if (po->fanout)
> +			fanout_del(sk);
> +		else
> +			__dev_remove_pack(&po->prot_hook);
>  		__sock_put(sk);
>  	}
>  	if (po->prot_hook.dev) {
> @@ -1396,9 +1575,11 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
>  		__sock_put(sk);
>  		po->running = 0;
>  		po->num = 0;
> -		spin_unlock(&po->bind_lock);
> -		dev_remove_pack(&po->prot_hook);
> -		spin_lock(&po->bind_lock);
> +		if (!po->fanout) {
> +			spin_unlock(&po->bind_lock);
> +			dev_remove_pack(&po->prot_hook);
> +			spin_lock(&po->bind_lock);
> +		}
>  	}
>  
>  	po->num = protocol;
> @@ -1413,7 +1594,8 @@ static int packet_do_bind(struct sock *sk, struct net_device *dev, __be16 protoc
>  		goto out_unlock;
>  
>  	if (!dev || (dev->flags & IFF_UP)) {
> -		dev_add_pack(&po->prot_hook);
> +		if (!po->fanout)
> +			dev_add_pack(&po->prot_hook);
>  		sock_hold(sk);
>  		po->running = 1;
>  	} else {
> @@ -1542,7 +1724,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
>  
>  	if (proto) {
>  		po->prot_hook.type = proto;
> -		dev_add_pack(&po->prot_hook);
> +		if (!po->fanout)
> +			dev_add_pack(&po->prot_hook);
>  		sock_hold(sk);
>  		po->running = 1;
>  	}
> @@ -2109,6 +2292,17 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
>  		po->tp_tstamp = val;
>  		return 0;
>  	}
> +	case PACKET_FANOUT:
> +	{
> +		int val;
> +
> +		if (optlen != sizeof(val))
> +			return -EINVAL;
> +		if (copy_from_user(&val, optval, sizeof(val)))
> +			return -EFAULT;
> +
> +		return fanout_add(sk, val & 0xffff, val >> 16);
> +	}
>  	default:
>  		return -ENOPROTOOPT;
>  	}
> @@ -2207,6 +2401,15 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
>  		val = po->tp_tstamp;
>  		data = &val;
>  		break;
> +	case PACKET_FANOUT:
> +		if (len > sizeof(int))
> +			len = sizeof(int);
> +		val = (po->fanout ?
> +		       ((u32)po->fanout->id |
> +			((u32)po->fanout->type << 16)) :
> +		       0);
> +		data = &val;
> +		break;
>  	default:
>  		return -ENOPROTOOPT;
>  	}
> @@ -2260,7 +2463,8 @@ static int packet_notifier(struct notifier_block *this, unsigned long msg, void
>  			if (dev->ifindex == po->ifindex) {
>  				spin_lock(&po->bind_lock);
>  				if (po->num && !po->running) {
> -					dev_add_pack(&po->prot_hook);
> +					if (!po->fanout)
> +						dev_add_pack(&po->prot_hook);
>  					sock_hold(sk);
>  					po->running = 1;
>  				}
> @@ -2530,7 +2734,8 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  	was_running = po->running;
>  	num = po->num;
>  	if (was_running) {
> -		__dev_remove_pack(&po->prot_hook);
> +		if (!po->fanout)
> +			__dev_remove_pack(&po->prot_hook);
>  		po->num = 0;
>  		po->running = 0;
>  		__sock_put(sk);
> @@ -2568,7 +2773,8 @@ static int packet_set_ring(struct sock *sk, struct tpacket_req *req,
>  		sock_hold(sk);
>  		po->running = 1;
>  		po->num = num;
> -		dev_add_pack(&po->prot_hook);
> +		if (!po->fanout)
> +			dev_add_pack(&po->prot_hook);
>  	}
>  	spin_unlock(&po->bind_lock);
>  
> 


-- 
---------------------------------------------
Victor Julien
http://www.inliniac.net/
PGP: http://www.inliniac.net/victorjulien.asc
---------------------------------------------


^ permalink raw reply

* Re: [PATCH 1/2] udp:  add tracepoints for queueing skb to rcvbuf
From: Neil Horman @ 2011-06-21 10:47 UTC (permalink / raw)
  To: Satoru Moriya
  Cc: netdev@vger.kernel.org, davem@davemloft.net,
	dle-develop@lists.sourceforge.net, Seiji Aguchi
In-Reply-To: <65795E11DBF1E645A09CEC7EAEE94B9C402B96E5@USINDEVS02.corp.hds.com>

On Fri, Jun 17, 2011 at 05:58:39PM -0400, Satoru Moriya wrote:
> This patch adds a tracepoint to __udp_queue_rcv_skb to get the
> return value of ip_queue_rcv_skb. It indicates why kernel drops
> a packet at this point.
> 
> ip_queue_rcv_skb returns following values in the packet drop case:
> 
> rcvbuf is full                 : -ENOMEM
> sk_filter returns error        : -EINVAL, -EACCESS, -ENOMEM, etc.
> __sk_mem_schedule returns error: -ENOBUF
> 
> 
> Signed-off-by: Satoru Moriya <satoru.moriya@hds.com>
> ---
>  include/trace/events/udp.h |   32 ++++++++++++++++++++++++++++++++
>  net/core/net-traces.c      |    1 +
>  net/ipv4/udp.c             |    2 ++
>  3 files changed, 35 insertions(+), 0 deletions(-)
>  create mode 100644 include/trace/events/udp.h
> 
> diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h
> new file mode 100644
> index 0000000..a664bb9
> --- /dev/null
> +++ b/include/trace/events/udp.h
> @@ -0,0 +1,32 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM udp
> +
> +#if !defined(_TRACE_UDP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_UDP_H
> +
> +#include <linux/udp.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(udp_fail_queue_rcv_skb,
> +
> +	TP_PROTO(int rc, struct sock *sk),
> +
> +	TP_ARGS(rc, sk),
> +
> +	TP_STRUCT__entry(
> +		__field(int, rc)
> +		__field(__u16, lport)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->rc = rc;
> +		__entry->lport = inet_sk(sk)->inet_num;
> +	),
> +
> +	TP_printk("rc=%d port=%hu", __entry->rc, __entry->lport)
> +);
> +
> +#endif /* _TRACE_UDP_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/net/core/net-traces.c b/net/core/net-traces.c
> index 7f1bb2a..13aab64 100644
> --- a/net/core/net-traces.c
> +++ b/net/core/net-traces.c
> @@ -28,6 +28,7 @@
>  #include <trace/events/skb.h>
>  #include <trace/events/net.h>
>  #include <trace/events/napi.h>
> +#include <trace/events/udp.h>
>  
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index abca870..37aa9bf 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -105,6 +105,7 @@
>  #include <net/route.h>
>  #include <net/checksum.h>
>  #include <net/xfrm.h>
> +#include <trace/events/udp.h>
>  #include "udp_impl.h"
>  
>  struct udp_table udp_table __read_mostly;
> @@ -1363,6 +1364,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  					 is_udplite);
>  		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
>  		kfree_skb(skb);
> +		trace_udp_fail_queue_rcv_skb(rc, sk);
>  		return -1;
>  	}
>  
> -- 
> 1.7.1
> 
I was thinking you could just trace callers of __sk_mem_schedule, but looking at
it this works as well
Acked-by: Neil Horman <nhorman@tuxdriver.com>

> 
> 

^ permalink raw reply

* Re: [RFC PATCH] packet: Add fanout support.
From: David Miller @ 2011-06-21 10:46 UTC (permalink / raw)
  To: victor; +Cc: netdev
In-Reply-To: <4E0074CF.8070003@inliniac.net>

From: Victor Julien <victor@inliniac.net>
Date: Tue, 21 Jun 2011 12:39:11 +0200

> The hash based on skb->rxhash, does that result in a "flow" based
> distribution over the listeners? So all packets sharing a tuple
> being sent to the same socket?

Yes, that's exactly right.

^ permalink raw reply

* Re: [net-next 00/16][pull request] Intel Wired LAN Driver Update
From: David Miller @ 2011-06-21 10:04 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, gospo
In-Reply-To: <1308645228-32444-1-git-send-email-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Tue, 21 Jun 2011 01:33:32 -0700

> The following series contains update for e1000, igb, ixgbevf and ixgbe.
> 
> e1000 and igb: conversion to ndo_fix_features from Michal
> ixgbevf: fix function declarations and removal of un-necessary &'s
> ixgbe: Mainly cleanup of DCB code and the addition of Dell CEM support.
> 
> Dropped the problem patch from Vasu that was previously submitted in the last
> patch series.
> 
> The following are changes since commit 9f6ec8d697c08963d83880ccd35c13c5ace716ea:
>   Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
> and are available in the git repository at:
>   master.kernel.org:/pub/scm/linux/kernel/git/jkirsher/net-next-2.6 master

Pulled, thanks Jeff.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox