Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] skb align patch
From: David Miller @ 2009-09-22  5:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: shemminger, jesse.brandeburg, hawk, netdev
In-Reply-To: <4AB84295.3050509@gmail.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 22 Sep 2009 05:20:53 +0200

> Oh I see, you want to optimize the rx (NIC has to do a DMA to write
> packet into host memory and this DMA could be a read /modify/write
> if address is not aligned, instead of a pure write), while I tried
> to align skb to optimize the pktgen tx (NIC has to do a DMA to read
> packet from host), and align the skb had no effect.

This is a problem with these kinds of changes.

This patch from Stephen came out of a presentation and discussion
at netconf where the Intel folks showed that if they did a combination
of things it improved NUMA forwarding numbers a lot.

So you couldn't just do NUMA spreading of RX queue memory, or just
do this ALIGN patch, or just eliminate the false sharing from
statistics updates.

You had to do all three to start seeing forwarding rates go up.

So don't worry, this is getting us somewhere to where improvement
shows, but individually each change won't trigger it.

The alignment in this patch is a real big deal for 64 byte forwarding
tests, where the entire packet is a whole PCI-E cacheline.  But not
if it isn't aligned properly.

^ permalink raw reply

* Re: [RFC] skb align patch
From: Stephen Hemminger @ 2009-09-22  5:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesse Brandeburg, Jesper Dangaard Brouer, netdev
In-Reply-To: <4AB84295.3050509@gmail.com>

On Tue, 22 Sep 2009 05:20:53 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > On Mon, 21 Sep 2009 08:13:20 +0200
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> >> Stephen Hemminger a écrit :
> >>> Based on the Intel suggestion that PCI-express overhead is
> >>> a significant cost.
> >>>
> >>> Would people doing performance please measure the impact of
> >>> changing SKB alignment (64 bit only).
> >> I had this idea some time ago when I hit a limit on bnx2 adapter
> >> (Giga bit link, BCM5708S), with small packets. pktgen was able
> >> to send ~500 Mbps 'only', or 700kps if I remember well.
> >> So I tried to align the pktgen build packet to a cache line,
> >> it gave no difference at all, but it was on a 32 bit kernel.
> >> (Thus my patch was for pktgen only, not a generic one as yours)
> >>
> >> Could you elaborate why this change could be useful on 64bit ?
> >>
> > 
> > It is useful on all architecture where unaligned CPU access is
> > relatively cheap.
> > 
> > The issue is that a unaligned DMA requires a read/modify/write
> > cache line access versus just a write access. I am not a bus
> > expert, but writes are probably more pipelined as well.
> > 
> 
> Oh I see, you want to optimize the rx (NIC has to do a DMA
> to write packet into host memory and this DMA could be a read
> /modify/write if address is not aligned, instead of a pure write),
>  while I tried to align skb to optimize the pktgen tx 
> (NIC has to do a DMA to  read packet from host), and align the skb
> had no effect.
> 
> Maybe we should separate the rx/tx, and try your idea only
> for skb allocated for rx.
> 
> Also/Or we might try 
> __builtin_prefetch (addr, 0, 0);
> to instruct cpu to commit to memory cache lines that are
> going to be modified by NIC.

Don't think it matters whether RX buffer has to read/modify/write
from cpu cache or memory on modern cache snooping architecures.
The cost is the PCI traffic.

-- 

^ permalink raw reply

* Re: [RFC] skb align patch
From: Eric Dumazet @ 2009-09-22  3:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jesse Brandeburg, Jesper Dangaard Brouer, netdev
In-Reply-To: <20090921213011.704e0594@nehalam>

Stephen Hemminger a écrit :
> On Mon, 21 Sep 2009 08:13:20 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> Stephen Hemminger a écrit :
>>> Based on the Intel suggestion that PCI-express overhead is
>>> a significant cost.
>>>
>>> Would people doing performance please measure the impact of
>>> changing SKB alignment (64 bit only).
>> I had this idea some time ago when I hit a limit on bnx2 adapter
>> (Giga bit link, BCM5708S), with small packets. pktgen was able
>> to send ~500 Mbps 'only', or 700kps if I remember well.
>> So I tried to align the pktgen build packet to a cache line,
>> it gave no difference at all, but it was on a 32 bit kernel.
>> (Thus my patch was for pktgen only, not a generic one as yours)
>>
>> Could you elaborate why this change could be useful on 64bit ?
>>
> 
> It is useful on all architecture where unaligned CPU access is
> relatively cheap.
> 
> The issue is that a unaligned DMA requires a read/modify/write
> cache line access versus just a write access. I am not a bus
> expert, but writes are probably more pipelined as well.
> 

Oh I see, you want to optimize the rx (NIC has to do a DMA
to write packet into host memory and this DMA could be a read
/modify/write if address is not aligned, instead of a pure write),
 while I tried to align skb to optimize the pktgen tx 
(NIC has to do a DMA to  read packet from host), and align the skb
had no effect.

Maybe we should separate the rx/tx, and try your idea only
for skb allocated for rx.

Also/Or we might try 
__builtin_prefetch (addr, 0, 0);
to instruct cpu to commit to memory cache lines that are
going to be modified by NIC.



^ permalink raw reply

* igb VF allocation with quirk_i82576_sriov
From: Chris Wright @ 2009-09-22  5:19 UTC (permalink / raw)
  To: John Ronciak; +Cc: netdev, e1000-devel

Is this known to work?  During recent virt testing for upcoming Fedora 12,
a box w/out SR-IOV support in BIOS was using quirk to create VF BAR space,
VF allocation worked enough to assign a device to the guest, but igbvf
was not actually functioning properly in the guest.

Is it worth debugging this further, or is it already a known issue?

thanks,
-chris

^ permalink raw reply

* Re: bugfix: wireless bug causing working setups to loose net connectivity
From: John W. Linville @ 2009-09-22  4:24 UTC (permalink / raw)
  To: Arkadiusz Miskiewicz; +Cc: Johannes Berg, netdev
In-Reply-To: <200909212035.50592.a.miskiewicz@gmail.com>

On Mon, Sep 21, 2009 at 08:35:50PM +0200, Arkadiusz Miskiewicz wrote:
> 
> Could 
> http://marc.info/?l=linux-wireless&m=125323296617306&w=2 
> be merged without waiting for separate wireless pull request?
> 
> Currently previously working setups are no longer able to connect to AP (in my 
> case WPA2PSK via wpasupplicant).
> 
> AFAIK there was some kind of policy where bugfixes that break basic 
> functionality are supposed to be merged fast to allow to actually use and test 
> git kernel.

I'll make sure to include this in the next pull request, probably
tomorrow.  Sorry, I'm too tired tonight.  FWIW, I'm traveling ATM...

Hth!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [RFC] skb align patch
From: Stephen Hemminger @ 2009-09-22  4:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesse Brandeburg, Jesper Dangaard Brouer, netdev
In-Reply-To: <4AB71980.4020208@gmail.com>

On Mon, 21 Sep 2009 08:13:20 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Stephen Hemminger a écrit :
> > Based on the Intel suggestion that PCI-express overhead is
> > a significant cost.
> > 
> > Would people doing performance please measure the impact of
> > changing SKB alignment (64 bit only).
> 
> I had this idea some time ago when I hit a limit on bnx2 adapter
> (Giga bit link, BCM5708S), with small packets. pktgen was able
> to send ~500 Mbps 'only', or 700kps if I remember well.
> So I tried to align the pktgen build packet to a cache line,
> it gave no difference at all, but it was on a 32 bit kernel.
> (Thus my patch was for pktgen only, not a generic one as yours)
> 
> Could you elaborate why this change could be useful on 64bit ?
> 

It is useful on all architecture where unaligned CPU access is
relatively cheap.

The issue is that a unaligned DMA requires a read/modify/write
cache line access versus just a write access. I am not a bus
expert, but writes are probably more pipelined as well.

-- 

^ permalink raw reply

* Re: [PATCH] sky2: Set SKY2_HW_RAM_BUFFER in sky2_init
From: Stephen Hemminger @ 2009-09-22  2:50 UTC (permalink / raw)
  To: Mike McCormack, David Miller; +Cc: netdev
In-Reply-To: <4AB788F4.90503@ring3k.org>

On Mon, 21 Sep 2009 23:08:52 +0900
Mike McCormack <mikem@ring3k.org> wrote:

> The SKY2_HW_RAM_BUFFER bit in hw->flags was checked in sky2_mac_init(),
>  before being set later in sky2_up().
> 
> Setting SKY2_HW_RAM_BUFFER in sky2_init() where other hw->flags are set
>  should avoid this problem recurring.
> 
> Signed-off-by: Mike McCormack <mikem@ring3k.org>
> ---
>  drivers/net/sky2.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c
> index 4bb52e9..68d256b 100644
> --- a/drivers/net/sky2.c
> +++ b/drivers/net/sky2.c
> @@ -1497,7 +1497,6 @@ static int sky2_up(struct net_device *dev)
>  	if (ramsize > 0) {
>  		u32 rxspace;
>  
> -		hw->flags |= SKY2_HW_RAM_BUFFER;
>  		pr_debug(PFX "%s: ram buffer %dK\n", dev->name, ramsize);
>  		if (ramsize < 16)
>  			rxspace = ramsize / 2;
> @@ -2926,6 +2925,9 @@ static int __devinit sky2_init(struct sky2_hw *hw)
>  			++hw->ports;
>  	}
>  
> +	if (sky2_read8(hw, B2_E_0))
> +		hw->flags |= SKY2_HW_RAM_BUFFER;
> +
>  	return 0;
>  }
>  

This should go to stable tree as well.

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

-- 

^ permalink raw reply

* Re: [PATCH][RESEND] IPv6: 6rd tunnel mode
From: Brian Haley @ 2009-09-22  2:39 UTC (permalink / raw)
  To: Alexandre Cassen; +Cc: netdev
In-Reply-To: <20090922003956.GA19947@lnxos.staff.proxad.net>

Hi Alexandre,

Alexandre Cassen wrote:
> This patch add support to 6rd tunnel mode currently targetting
> standard track at the IETF.
> 
> IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
> to enable a service provider to rapidly deploy IPv6 unicast service
> to IPv4 sites to which it provides customer premise equipment.  Like
> 6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
> transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
> provider uses an IPv6 prefix of its own in place of the fixed 6to4
> prefix.

I couldn't find RFC 5569 (delayed due to IPR rights?), although I did find
the latest 6rd draft, -03.  It was showing as Informational, not Standards
track, is that right?  Just curious.

> +		case SIOCADD6RD:
> +		case SIOCCHG6RD:
> +			if (ip6rd.prefixlen >= 95) {
> +				err = -EINVAL;
> +				goto done;
> +			}
> +			t->ip6rd_prefix.addr = ip6rd.addr;

ipv6_addr_copy(&t->ip6rd_prefix.addr, &ip6rd.addr); is the preferred way to
copy the address.

-Brian

^ permalink raw reply

* [PATCH] cnic: Shutdown iSCSI ring during uio_close.
From: Michael Chan @ 2009-09-22  1:39 UTC (permalink / raw)
  To: davem; +Cc: netdev, michaelc, Michael Chan, Benjamin Li

The iSCSI ring should be shutdown during uio_close instead of uio_open
for proper operations.  This fixes the problem of the ring getting
stuck intermittently.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Benjamin Li <benli@broadcom.com>
---
 drivers/net/cnic.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/cnic.c b/drivers/net/cnic.c
index d45eacb..211c8e9 100644
--- a/drivers/net/cnic.c
+++ b/drivers/net/cnic.c
@@ -85,8 +85,6 @@ static int cnic_uio_open(struct uio_info *uinfo, struct inode *inode)
 
 	cp->uio_dev = iminor(inode);
 
-	cnic_shutdown_bnx2_rx_ring(dev);
-
 	cnic_init_bnx2_tx_ring(dev);
 	cnic_init_bnx2_rx_ring(dev);
 
@@ -98,6 +96,8 @@ static int cnic_uio_close(struct uio_info *uinfo, struct inode *inode)
 	struct cnic_dev *dev = uinfo->priv;
 	struct cnic_local *cp = dev->cnic_priv;
 
+	cnic_shutdown_bnx2_rx_ring(dev);
+
 	cp->uio_dev = -1;
 	return 0;
 }
-- 
1.6.4.GIT



^ permalink raw reply related

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  1:09 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	linuxppc-dev, bhutchings, prodyut hazarika, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE81A@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:53 -0700, Prodyut Hazarika wrote:
> 
> In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
> both on Tx and Rx per channel (physical not virtual), which can be
> enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
> defined in the dts file. But in the older revs, there is only a global
> EOP_Int_Enable in the MAL configuration register. There can be a
> possible way even for older SoCs if we use the MAL descriptor I bit
> and
> not the global EOP_Int_Enable. But to turn on/off the channel, we will
> have to go and set/clear the I bit in whole of MAL descriptor ring for
> that channel. That might be really inefficient.
> 
> What would you suggest?

I wouldn't bother with the old SoCs, we should keep the current
workaround we have today for them. For the new ones, I'll have a look
and see how we can get the driver upgraded to avoid the workaround.

Don't bother with this for now. I'll dig at some stage.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-22  0:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, prodyut hazarika
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253579943.7103.194.camel@pasglop>

Hi Ben,

> Well... the above is a HW limitation :-) IE. I was suggesting you fix
> the HW, but in the case where you already did and the current MAL in
> your SoC can indeed mask the interrupt per-channel, then that's great
> and we should definitely look into having the driver go back to a more
> standard NAPI model on MALs that have that capability.

In the newer revs of 460EX/GT and 405EX, we have Interrupt coalescing
both on Tx and Rx per channel (physical not virtual), which can be
enabled/disabled per channel via UIC. The Tx/Rx Coalesce mappings are
defined in the dts file. But in the older revs, there is only a global
EOP_Int_Enable in the MAL configuration register. There can be a
possible way even for older SoCs if we use the MAL descriptor I bit and
not the global EOP_Int_Enable. But to turn on/off the channel, we will
have to go and set/clear the I bit in whole of MAL descriptor ring for
that channel. That might be really inefficient.

What would you suggest?

Thanks
Prodyut

^ permalink raw reply

* [PATCH] fec: Add FEC support for MX25 processor
From: Fabio Estevam @ 2009-09-22  0:41 UTC (permalink / raw)
  To: netdev; +Cc: s.hauer

Add FEC support for MX25 processor.

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
---
 drivers/net/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index ed5741b..2bea67c 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1875,7 +1875,7 @@ config 68360_ENET
 
 config FEC
 	bool "FEC ethernet controller (of ColdFire and some i.MX CPUs)"
-	depends on M523x || M527x || M5272 || M528x || M520x || M532x || MACH_MX27 || ARCH_MX35
+	depends on M523x || M527x || M5272 || M528x || M520x || M532x || MACH_MX27 || ARCH_MX35 || ARCH_MX25
 	help
 	  Say Y here if you want to use the built-in 10/100 Fast ethernet
 	  controller on some Motorola ColdFire and Freescale i.MX processors.
-- 
1.6.0.4


      

^ permalink raw reply related

* [PATCH iproute2][RESEND] IPv6: 6rd iproute2 support
From: Alexandre Cassen @ 2009-09-22  0:41 UTC (permalink / raw)
  To: netdev

This patch provide iproute2 facilities to configure 6rd tunnel. To
configure a 6rd tunnel, simply configure a sit tunnel and set
6rd prefix as following :

    ip tunnel add sit1 mode site local a.b.c.d ttl 64
    ip tunnel 6rd dev sit1 set-6rd_prefix xxxx:yyyy::/z

Additionaly you can reset 6rd_prefix :

    ip tunnel 6rd dev sit1 reset-6rd_prefix

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 include/linux/if_tunnel.h |   10 ++++++++
 ip/iptunnel.c             |   53 ++++++++++++++++++++++++++++++++++++++++++++-
 ip/tunnel.c               |   17 +++++++++++++-
 ip/tunnel.h               |    2 +
 4 files changed, 80 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 9229075..5ebe5a4 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -12,6 +12,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -48,6 +52,12 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+/* 6RD parms */
+struct ip_tunnel_6rd {
+	struct in6_addr		addr;
+	__u8			prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 338d8bd..31843ad 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -38,10 +38,11 @@ static void usage(void) __attribute__((noreturn));
 
 static void usage(void)
 {
-	fprintf(stderr, "Usage: ip tunnel { add | change | del | show | prl } [ NAME ]\n");
+	fprintf(stderr, "Usage: ip tunnel { add | change | del | show | prl | 6rd } [ NAME ]\n");
 	fprintf(stderr, "          [ mode { ipip | gre | sit | isatap } ] [ remote ADDR ] [ local ADDR ]\n");
 	fprintf(stderr, "          [ [i|o]seq ] [ [i|o]key KEY ] [ [i|o]csum ]\n");
 	fprintf(stderr, "          [ prl-default ADDR ] [ prl-nodefault ADDR ] [ prl-delete ADDR ]\n");
+	fprintf(stderr, "          [ set-6rd_prefix ADDR ] [ reset-6rd_prefix ]\n");
 	fprintf(stderr, "          [ ttl TTL ] [ tos TOS ] [ [no]pmtudisc ] [ dev PHYS_DEV ]\n");
 	fprintf(stderr, "\n");
 	fprintf(stderr, "Where: NAME := STRING\n");
@@ -308,11 +309,13 @@ static int do_del(int argc, char **argv)
 
 static void print_tunnel(struct ip_tunnel_parm *p)
 {
+	struct ip_tunnel_6rd ip6rd;
 	char s1[1024];
 	char s2[1024];
 	char s3[64];
 	char s4[64];
 
+	memset(&ip6rd, 0, sizeof(ip6rd));
 	inet_ntop(AF_INET, &p->i_key, s3, sizeof(s3));
 	inet_ntop(AF_INET, &p->o_key, s4, sizeof(s4));
 
@@ -368,6 +371,13 @@ static void print_tunnel(struct ip_tunnel_parm *p)
 	if (!(p->iph.frag_off&htons(IP_DF)))
 		printf(" nopmtudisc");
 
+	if (!tnl_ioctl_get_6rd(p->name, &ip6rd) && ip6rd.prefixlen) {
+		char buf[128];
+		printf(" 6rd_prefix %s/%u ",
+		       inet_ntop(AF_INET6, &ip6rd.addr, buf, 128),
+		       ip6rd.prefixlen);
+	}
+
 	if ((p->i_flags&GRE_KEY) && (p->o_flags&GRE_KEY) && p->o_key == p->i_key)
 		printf(" key %s", s3);
 	else if ((p->i_flags|p->o_flags)&GRE_KEY) {
@@ -534,6 +544,45 @@ static int do_prl(int argc, char **argv)
 	return tnl_prl_ioctl(cmd, medium, &p);
 }
 
+static int do_6rd(int argc, char **argv)
+{
+	struct ip_tunnel_6rd ip6rd;
+	int devname = 0;
+	int cmd = 0;
+	char medium[IFNAMSIZ];
+
+	memset(&ip6rd, 0, sizeof(ip6rd));
+	memset(&medium, 0, sizeof(medium));
+
+	while (argc > 0) {
+		if (strcmp(*argv, "set-6rd_prefix") == 0) {
+			inet_prefix prefix;
+			NEXT_ARG();
+			if (get_prefix(&prefix, *argv, AF_INET6))
+				invarg("invalid 6rd_prefix\n", *argv);
+			cmd = SIOCADD6RD;
+			memcpy(&ip6rd.addr, prefix.data, 16);
+			ip6rd.prefixlen = prefix.bitlen;
+		} else if (strcmp(*argv, "reset-6rd_prefix") == 0) {
+			cmd = SIOCDEL6RD;
+		} else if (strcmp(*argv, "dev") == 0) {
+			NEXT_ARG();
+			strncpy(medium, *argv, IFNAMSIZ-1);
+			devname++;
+		} else {
+			fprintf(stderr,"%s: Invalid 6RD parameter.\n", *argv);
+			exit(-1);
+		}
+		argc--; argv++;
+	}
+	if (devname == 0) {
+		fprintf(stderr, "Must specify dev.\n");
+		exit(-1);
+	}
+
+	return tnl_6rd_ioctl(cmd, medium, &ip6rd);
+}
+
 int do_iptunnel(int argc, char **argv)
 {
 	switch (preferred_family) {
@@ -567,6 +616,8 @@ int do_iptunnel(int argc, char **argv)
 			return do_show(argc-1, argv+1);
 		if (matches(*argv, "prl") == 0)
 			return do_prl(argc-1, argv+1);
+		if (matches(*argv, "6rd") == 0)
+			return do_6rd(argc-1, argv+1);
 		if (matches(*argv, "help") == 0)
 			usage();
 	} else
diff --git a/ip/tunnel.c b/ip/tunnel.c
index d1296e6..d389e86 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -168,7 +168,7 @@ int tnl_del_ioctl(const char *basedev, const char *name, void *p)
 	return err;
 }
 
-int tnl_prl_ioctl(int cmd, const char *name, void *p)
+static int tnl_gen_ioctl(int cmd, const char *name, void *p)
 {
 	struct ifreq ifr;
 	int fd;
@@ -183,3 +183,18 @@ int tnl_prl_ioctl(int cmd, const char *name, void *p)
 	close(fd);
 	return err;
 }
+
+int tnl_prl_ioctl(int cmd, const char *name, void *p)
+{
+	return tnl_gen_ioctl(cmd, name, p);
+}
+
+int tnl_6rd_ioctl(int cmd, const char *name, void *p)
+{
+	return tnl_gen_ioctl(cmd, name, p);
+}
+
+int tnl_ioctl_get_6rd(const char *name, void *p)
+{
+	return tnl_gen_ioctl(SIOCGET6RD, name, p);
+}
diff --git a/ip/tunnel.h b/ip/tunnel.h
index 0661e27..ded226b 100644
--- a/ip/tunnel.h
+++ b/ip/tunnel.h
@@ -32,5 +32,7 @@ int tnl_get_ioctl(const char *basedev, void *p);
 int tnl_add_ioctl(int cmd, const char *basedev, const char *name, void *p);
 int tnl_del_ioctl(const char *basedev, const char *name, void *p);
 int tnl_prl_ioctl(int cmd, const char *name, void *p);
+int tnl_6rd_ioctl(int cmd, const char *name, void *p);
+int tnl_ioctl_get_6rd(const char *name, void *p);
 
 #endif
-- 
1.6.0.4


^ permalink raw reply related

* [PATCH][RESEND] IPv6: 6rd tunnel mode
From: Alexandre Cassen @ 2009-09-22  0:39 UTC (permalink / raw)
  To: netdev

This patch add support to 6rd tunnel mode currently targetting
standard track at the IETF.

IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
to enable a service provider to rapidly deploy IPv6 unicast service
to IPv4 sites to which it provides customer premise equipment.  Like
6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
provider uses an IPv6 prefix of its own in place of the fixed 6to4
prefix.

Signed-off-by: Alexandre Cassen <acassen@freebox.fr>
---
 include/linux/if_tunnel.h |   10 +++++
 include/net/ipip.h        |    2 +
 net/ipv6/Kconfig          |   13 +++++++
 net/ipv6/sit.c            |   84 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/include/linux/if_tunnel.h b/include/linux/if_tunnel.h
index 5eb9b0f..0d44376 100644
--- a/include/linux/if_tunnel.h
+++ b/include/linux/if_tunnel.h
@@ -15,6 +15,10 @@
 #define SIOCADDPRL      (SIOCDEVPRIVATE + 5)
 #define SIOCDELPRL      (SIOCDEVPRIVATE + 6)
 #define SIOCCHGPRL      (SIOCDEVPRIVATE + 7)
+#define SIOCGET6RD      (SIOCDEVPRIVATE + 8)
+#define SIOCADD6RD      (SIOCDEVPRIVATE + 9)
+#define SIOCDEL6RD      (SIOCDEVPRIVATE + 10)
+#define SIOCCHG6RD      (SIOCDEVPRIVATE + 11)
 
 #define GRE_CSUM	__cpu_to_be16(0x8000)
 #define GRE_ROUTING	__cpu_to_be16(0x4000)
@@ -51,6 +55,12 @@ struct ip_tunnel_prl {
 /* PRL flags */
 #define	PRL_DEFAULT		0x0001
 
+/* 6RD parms */
+struct ip_tunnel_6rd {
+	struct in6_addr		addr;
+	__u8			prefixlen;
+};
+
 enum
 {
 	IFLA_GRE_UNSPEC,
diff --git a/include/net/ipip.h b/include/net/ipip.h
index 5d3036f..fa92c41 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -26,6 +26,8 @@ struct ip_tunnel
 
 	struct ip_tunnel_prl_entry	*prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
+
+	struct ip_tunnel_6rd	ip6rd_prefix;	/* 6RD SP prefix */
 };
 
 /* ISATAP: default interval between RS in secondy */
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ead6c7a..78a565b 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -170,6 +170,19 @@ config IPV6_SIT
 
 	  Saying M here will produce a module called sit. If unsure, say Y.
 
+config IPV6_SIT_6RD
+	bool "IPv6: 6rd tunnel mode (EXPERIMENTAL)"
+	depends on IPV6_SIT && EXPERIMENTAL
+	default n
+	---help---
+	IPv6 rapid deployment (RFC5569) builds upon mechanisms of 6to4 (RFC3056)
+	to enable a service provider to rapidly deploy IPv6 unicast service
+	to IPv4 sites to which it provides customer premise equipment.  Like
+	6to4, it utilizes stateless IPv6 in IPv4 encapsulation in order to
+	transit IPv4-only network infrastructure. Unlike 6to4, a 6rd service
+	provider uses an IPv6 prefix of its own in place of the fixed 6to4
+	prefix.
+
 config IPV6_NDISC_NODETYPE
 	bool
 
diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
index 0ae4f64..ff62e97 100644
--- a/net/ipv6/sit.c
+++ b/net/ipv6/sit.c
@@ -604,6 +604,30 @@ static inline __be32 try_6to4(struct in6_addr *v6dst)
 	return dst;
 }
 
+#ifdef CONFIG_IPV6_SIT_6RD
+/* Returns the embedded IPv4 address if the IPv6 address comes from
+   6rd rule */
+
+static inline __be32 try_6rd(struct in6_addr *addr, u8 prefix_len, struct in6_addr *v6dst)
+{
+	__be32 dst = 0;
+
+	/* isolate addr according to mask */
+	if (ipv6_prefix_equal(v6dst, addr, prefix_len)) {
+		unsigned int d32_off, bits;
+
+		d32_off = prefix_len >> 5;
+		bits = (prefix_len & 0x1f);
+
+		dst = (ntohl(v6dst->s6_addr32[d32_off]) << bits);
+		if (bits)
+			dst |= ntohl(v6dst->s6_addr32[d32_off + 1]) >> (32 - bits);
+		dst = htonl(dst);
+	}
+	return dst;
+}
+#endif
+
 /*
  *	This function assumes it is being called from dev_queue_xmit()
  *	and that skb is filled properly by that function.
@@ -657,6 +681,13 @@ static netdev_tx_t ipip6_tunnel_xmit(struct sk_buff *skb,
 			goto tx_error;
 	}
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	if (!dst && tunnel->ip6rd_prefix.prefixlen)
+		dst = try_6rd(&tunnel->ip6rd_prefix.addr,
+			      tunnel->ip6rd_prefix.prefixlen,
+			      &iph6->daddr);
+	else
+#endif
 	if (!dst)
 		dst = try_6to4(&iph6->daddr);
 
@@ -848,6 +879,9 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 	int err = 0;
 	struct ip_tunnel_parm p;
 	struct ip_tunnel_prl prl;
+#ifdef CONFIG_IPV6_SIT_6RD
+	struct ip_tunnel_6rd ip6rd;
+#endif
 	struct ip_tunnel *t;
 	struct net *net = dev_net(dev);
 	struct sit_net *sitn = net_generic(net, sit_net_id);
@@ -987,6 +1021,56 @@ ipip6_tunnel_ioctl (struct net_device *dev, struct ifreq *ifr, int cmd)
 		netdev_state_change(dev);
 		break;
 
+#ifdef CONFIG_IPV6_SIT_6RD
+	case SIOCGET6RD:
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+		memcpy(&ip6rd, &t->ip6rd_prefix, sizeof(ip6rd));
+		if (copy_to_user(ifr->ifr_ifru.ifru_data, &ip6rd, sizeof(ip6rd)))
+			err = -EFAULT;
+		else
+			err = 0;
+		break;
+
+	case SIOCADD6RD:
+	case SIOCDEL6RD:
+	case SIOCCHG6RD:
+		err = -EPERM;
+		if (!capable(CAP_NET_ADMIN))
+			goto done;
+		err = -EINVAL;
+		if (dev == sitn->fb_tunnel_dev)
+			goto done;
+		err = -EFAULT;
+		if (copy_from_user(&ip6rd, ifr->ifr_ifru.ifru_data, sizeof(ip6rd)))
+			goto done;
+		err = -ENOENT;
+		if (!(t = netdev_priv(dev)))
+			goto done;
+
+		err = 0;
+		switch (cmd) {
+		case SIOCDEL6RD:
+			memset(&t->ip6rd_prefix, 0, sizeof(ip6rd));
+			break;
+		case SIOCADD6RD:
+		case SIOCCHG6RD:
+			if (ip6rd.prefixlen >= 95) {
+				err = -EINVAL;
+				goto done;
+			}
+			t->ip6rd_prefix.addr = ip6rd.addr;
+			t->ip6rd_prefix.prefixlen = ip6rd.prefixlen;
+			break;
+		}
+		netdev_state_change(dev);
+		break;
+#endif
+
 	default:
 		err = -EINVAL;
 	}
-- 
1.6.0.4


^ permalink raw reply related

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:39 UTC (permalink / raw)
  To: prodyut hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <49c0ff980909211728s2d39e356p6900d047c6918826@mail.gmail.com>

On Mon, 2009-09-21 at 17:28 -0700, prodyut hazarika wrote:
> > BTW. If you guys are ever going to do another change to MAL, please
> > please plase, add the -one- major missing feature that's causing all
> the
> > pain and complication in the current design: Add a per-channel
> interrupt
> > masking option.
> >
> > The lack of ability to mask the interrupt per MAL channel is what
> forces
> > us to create that fake netdev structure in order to share the napi
> > device instance between all the EMACs in the system. This is very
> > inefficient too. We would be able to make things run a lot smoother
> if
> > we could just have a napi instance per EMAC, but for that, we need
> > per-channel interrupt masking.
> >
> 
> I will add a patch for the above as soon as I am done incorporating
> your comments on the MAL coalescing support.
> 
Well... the above is a HW limitation :-) IE. I was suggesting you fix
the HW, but in the case where you already did and the current MAL in
your SoC can indeed mask the interrupt per-channel, then that's great
and we should definitely look into having the driver go back to a more
standard NAPI model on MALs that have that capability.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: prodyut hazarika @ 2009-09-22  0:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, Prodyut Hazarika, linuxppc-dev, davem
In-Reply-To: <1253578361.7103.180.camel@pasglop>

Hi Ben,

>
> BTW. If you guys are ever going to do another change to MAL, please
> please plase, add the -one- major missing feature that's causing all the
> pain and complication in the current design: Add a per-channel interrupt
> masking option.
>
> The lack of ability to mask the interrupt per MAL channel is what forces
> us to create that fake netdev structure in order to share the napi
> device instance between all the EMACs in the system. This is very
> inefficient too. We would be able to make things run a lot smoother if
> we could just have a napi instance per EMAC, but for that, we need
> per-channel interrupt masking.
>

I will add a patch for the above as soon as I am done incorporating
your comments on the MAL coalescing support.

Thanks
Prodyut

^ permalink raw reply

* Re: fanotify as syscalls
From: Randy Dunlap @ 2009-09-22  0:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Jamie Lokier, Evgeniy Polyakov, Eric Paris,
	David Miller, linux-kernel, linux-fsdevel, netdev, viro, alan,
	hch
In-Reply-To: <m1r5tzq1gf.fsf@fess.ebiederm.org>

On Mon, 21 Sep 2009 17:15:28 -0700 Eric W. Biederman wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> > kernel with some idiotic packet interface. It's just a fancy way to do 
> > ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> > packet interfaces suddenly much better?
> 
> For working with the networking stack there are a lot of advantages because
> netlink is the interface to everything in the network stack.
> 
> There are nice things like the packet to create a new interface is the same
> packet the kernel sends everyone to report a new interface etc.
> 
> netlink also seems to get the structured data thing right.  You can
> parse the packet even if you don't understand everything.  Each tag is
> well defined like a syscall, taking exactly one kind of argument.
> Which avoids the worst failure of ioctl in that you can't even parse
> everything, and the argument may be a linked list in the calling
> process or something else atrocious.
> 
> All of that said syscalls are good, and I would not recommend netlink
> to anything not in the network stack.

like CONFIG_SCSI_NETLINK and CONFIG_QUOTA_NETLINK_INTERFACE  :(


---
~Randy

^ permalink raw reply

* Re: fanotify as syscalls
From: Eric W. Biederman @ 2009-09-22  0:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jamie Lokier, Evgeniy Polyakov, Eric Paris, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <alpine.LFD.2.01.0909170934450.4950@localhost.localdomain>

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Quite frankly, I have _never_ever_ seen a good reason for talking to the 
> kernel with some idiotic packet interface. It's just a fancy way to do 
> ioctl's, and everybody knows that ioctl's are bad and evil. Why are fancy 
> packet interfaces suddenly much better?

For working with the networking stack there are a lot of advantages because
netlink is the interface to everything in the network stack.

There are nice things like the packet to create a new interface is the same
packet the kernel sends everyone to report a new interface etc.

netlink also seems to get the structured data thing right.  You can
parse the packet even if you don't understand everything.  Each tag is
well defined like a syscall, taking exactly one kind of argument.
Which avoids the worst failure of ioctl in that you can't even parse
everything, and the argument may be a linked list in the calling
process or something else atrocious.

All of that said syscalls are good, and I would not recommend netlink
to anything not in the network stack.

Eric

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:12 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE802@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 17:05 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks again for your comments.
> 
> > Same goes with the SDR register definitions. Prefix them with the SOC
> > name but don't make them conditionally compiled.
> 
> I will add the base address in the Device tree, and make all register
> definitions based on offset from the base in the next version of this
> patch.

That's a good idea. In fact, you can also use the dcr_read/write
variants of the accessors rather than the low level mfdcri/mtdcri. This
wouldn't make much of a difference unless you ever release a SoC with
those same registers behind an MMIO mapping but it's cleaner.

> Thanks for this comment. I will hookup ethtool with the EMAC driver, but
> the MAL driver will come up with default coalesce options (as defined in
> the appropriate defconfig file). The user will be able to change these
> parameters as needed using ethtool.

That's ok. I don't have an objection in using Kconfig to set the
defaults.

> I will get all the changes in place in the next version of this patch.

Thanks !

BTW. If you guys are ever going to do another change to MAL, please
please plase, add the -one- major missing feature that's causing all the
pain and complication in the current design: Add a per-channel interrupt
masking option.

The lack of ability to mask the interrupt per MAL channel is what forces
us to create that fake netdev structure in order to share the napi
device instance between all the EMACs in the system. This is very
inefficient too. We would be able to make things run a lot smoother if
we could just have a napi instance per EMAC, but for that, we need
per-channel interrupt masking.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-22  0:07 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <0CA0A16855646F4FA96D25A158E299D606FFE7FF@SDCEXCHANGE01.ad.amcc.com>

On Mon, 2009-09-21 at 16:49 -0700, Prodyut Hazarika wrote:
> Hi Ben,
> Thanks for your comments.
> 
> 
> > What happens if we build a kernel that is supposed to boot with two
> > different variants of 405 or 440 ?
> 
> We cannot build a kernel with H/W Interrupt coalescing other than in
> 405EX/460EX/GT.
> This is controlled via KConfig (config IBM_NEW_EMAC_INTR_COALESCE
> depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX))
> Is this approach acceptable (via Kconfig)?

No. That's my point. All of this must be runtime options. The kernel
must be buildablt for 460EX -and- 460GT - and an old 440EP if I want to
in a single image, and this -with- the coalescing option enabled. It
would obviously only be available when running on the cores that support
it, but it should -not- be a compile time decision.

IE. All your ifdef's should be turned into runtime checks. If you have
conflicting #define for register names and bits, then prefix them with
the SoC name.

The only acceptable compile-time option is to have the ability to not
compile the coalescing support at all, thus avoiding bloat when building
configs that are only targeted toward processors that don't have it or
setups that don't want it. 

> > There are existing mechanisms via ethtool to configure coalescing. You
> > should hookup onto these.
> 
> I will start looking at the ethtool options

Thanks.

Cheers,
Ben.

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-22  0:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253576514.7103.165.camel@pasglop>

Hi Ben,
Thanks again for your comments.

> Same goes with the SDR register definitions. Prefix them with the SOC
> name but don't make them conditionally compiled.

I will add the base address in the Device tree, and make all register
definitions based on offset from the base in the next version of this
patch.

> Also, this coalescing option, while it makes sense to have a CONFIG
> option to compile in the support for it or not, the choice to use
> coalescing or not should be done at runtime. Same goes with the
various
> thresholds which should be runtime configurable.

Thanks for this comment. I will hookup ethtool with the EMAC driver, but
the MAL driver will come up with default coalesce options (as defined in
the appropriate defconfig file). The user will be able to change these
parameters as needed using ethtool.

I will get all the changes in place in the next version of this patch.

Thanks
Prodyut

^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-21 23:56 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Eric Paris, Linus Torvalds, Evgeniy Polyakov, David Miller,
	linux-kernel, linux-fsdevel, netdev, viro, alan, hch
In-Reply-To: <200909220109.05995.agruen@suse.de>

Andreas Gruenbacher wrote:
> If the antimalware vendors want to base their decisions on pathnames then 
> that's their decision, and they can check /proc/self/fd/N.

Race hazards and loopholes.  It doesn't work.

> Waiting for your code to demonstrate; an object based cache (e.g.,
> st_dev + st_ino) rather than a pathname based cache would seem more
> reasonable.

Nearly everything that people do with files involves paths.  The point
is to cache what people (or their programs) do.  Apache does not
consult inodes by number, and rsync does not write inodes by number :-)
Yes, to the code...

> > > but I see no need for access decisions on them.
> >
> > Please excuse me; I'm a bit confused.  Is fanotify intended just for
> > use by access decision programs, or is the plan now for it to also be
> > a replacement for inotify?  I'm getting conflicting signals about
> > that.
> 
> Inotify doesn't support access decisions. So where's the problem with 
> having "notify only" events for directory / mount / unmount events?

No problem here.

You seemed to be saying you want to add directory events to fanotify.
But if fanotify is only intended for access decisions?  Something I
must have misunderstood in that.

> > If it's just for access decision programs, and if those aren't going
> > to care about location, then there's no need to add directory events
> > to fanotify at all.  But then I'll be demanding subtree support in
> > inotify, please :-)
> >
> > > Even less so for mounts and unmounts.
> >
> >    (as root) mkdir foo; mount dodgy foo -oloop; mount --bind foo/cat
> > /bin/cat
> 
> ... and then someone accesses /bin/cat, which triggers a fanotify access 
> decision.

That's fine as long as there was no location-awareness in the logic
which checked foo/innocent.txt and set that inode's "read-ok,cache-me" bit.

Mount only matters if you're sensitive to location.  If you think
location-independent checks make good anti-malware
I_have_a_bridge_to_sell^H^H^H^H^H^H^H^H^H^H^Hfine with me :-)

-- Jamie

^ permalink raw reply

* RE: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Prodyut Hazarika @ 2009-09-21 23:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Victor Gallardo, Feng Kan, netdev, lada.podivin, Loc Ho,
	bhutchings, linuxppc-dev, davem
In-Reply-To: <1253576514.7103.165.camel@pasglop>

Hi Ben,
Thanks for your comments.

> What happens if we build a kernel that is supposed to boot with two
> different variants of 405 or 440 ?

We cannot build a kernel with H/W Interrupt coalescing other than in
405EX/460EX/GT.
This is controlled via KConfig (config IBM_NEW_EMAC_INTR_COALESCE
depends on IBM_NEW_EMAC && (460EX || 460GT || 405EX))
Is this approach acceptable (via Kconfig)?

> There are existing mechanisms via ethtool to configure coalescing. You
> should hookup onto these.

I will start looking at the ethtool options

Thanks
Prodyut

^ permalink raw reply

* Re: [PATCH 1/2] ibm_newemac: Add Support for MAL Interrupt Coalescing
From: Benjamin Herrenschmidt @ 2009-09-21 23:41 UTC (permalink / raw)
  To: Prodyut Hazarika
  Cc: netdev, Feng Kan, Loc Ho, Victor Gallardo, bhutchings,
	linuxppc-dev, davem, jwboyer, lada.podivin
In-Reply-To: <1253573245-1867-1-git-send-email-phazarika@amcc.com>

On Mon, 2009-09-21 at 15:47 -0700, Prodyut Hazarika wrote:
> Support for Hardware Interrupt coalescing in MAL.
> Coalescing is supported on the newer revs of 460EX/GT and 405EX.
> The MAL driver falls back to EOB IRQ if coalescing not supported
> 
> Signed-off-by: Prodyut Hazarika <phazarika@amcc.com>
> Acked-by: Victor Gallardo <vgallardo@amcc.com>
> Acked-by: Feng Kan <fkan@amcc.com>

There's an awful lot of ifdef based on the CPU type in there. This is
not right.

What happens if we build a kernel that is supposed to boot with two
different variants of 405 or 440 ?

All of this should be runtime features.

ie:

> #ifdef CONFIG_IBM_NEW_EMAC_INTR_COALESCE
> +static inline void mal_enable_coal(struct mal_instance *mal)
> +{
> +	unsigned int val;
> +#if defined(CONFIG_405EX)
> +	/* Clear the counters */
> +	val = SDR0_ICC_FLUSH0 | SDR0_ICC_FLUSH1;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
> +
> +	/* Set Tx/Rx Timer values */
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +
> +	/* Enable the Tx/Rx Coalescing interrupt */
> +	val = ((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR0_SHIFT) |
> +		((CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR1_SHIFT);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX, val);
> +
> +	val = ((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR0_SHIFT) |
> +		((CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR1_SHIFT);
> +
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX, val);
> +#elif defined(CONFIG_460EX) || defined(CONFIG_460GT)
> +	/* Clear the counters */
> +	val = SDR0_ICC_FLUSH;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
> +#endif
> +
> +	/* Set Tx/Rx Timer values */
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX0, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX1, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX0, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX1, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX2, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRTX3, CONFIG_IBM_NEW_EMAC_TX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX2, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +	mtdcri(SDR0, DCRN_SDR0_ICCTRRX3, CONFIG_IBM_NEW_EMAC_RX_COAL_TIMER);
> +#endif
> +
> +	/* Enable the Tx/Rx Coalescing interrupt */
> +	val = (CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR_SHIFT;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRTX3, val);
> +#endif
> +
> +	val = (CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT & COAL_FRAME_MASK)
> +			<< SDR0_ICC_FTHR_SHIFT;
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX0, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX1, val);
> +#if defined(CONFIG_460GT)
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX2, val);
> +	mtdcri(SDR0, DCRN_SDR0_ICCRRX3, val);
> +#endif
> +#endif
> +	printk(KERN_INFO "MAL: Enabled Intr Coal TxCnt: %d RxCnt: %d\n",
> +		CONFIG_IBM_NEW_EMAC_TX_COAL_COUNT,
> +		CONFIG_IBM_NEW_EMAC_RX_COAL_COUNT);
> +}
> +#endif

This is all quite wrong. Either use MAL features or some other runtime
check, possibly via the "compatible" property.

Same goes with the SDR register definitions. Prefix them with the SOC
name but don't make them conditionally compiled. This is all back to the
same mess we had in arch/ppc and I'm not going to accept it.

Also, this coalescing option, while it makes sense to have a CONFIG
option to compile in the support for it or not, the choice to use
coalescing or not should be done at runtime. Same goes with the various
thresholds which should be runtime configurable.

There are existing mechanisms via ethtool to configure coalescing. You
should hookup onto these.


Cheers,
Ben.




^ permalink raw reply

* Re: fanotify as syscalls
From: Jamie Lokier @ 2009-09-21 23:12 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Andreas Gruenbacher, Eric Paris, Linus Torvalds, Evgeniy Polyakov,
	David Miller, Linux Kernel Mailing List, linux-fsdevel, netdev,
	viro, alan, hch
In-Reply-To: <alpine.DEB.2.00.0909211456180.1116@makko.or.mcafeemobile.com>

Davide Libenzi wrote:
> On Mon, 21 Sep 2009, Jamie Lokier wrote:
> 
> > I think so to, and that'd be a great all round solution.
> 
> If this is for anti-malware vendors

Personally I'm not interested in anti-malware, and am simply
interested in leveraging fsnotify improvements to accelerate userspace
caches of information which depends on files (indexes, templates,
compiler caches, stat caches etc.).  Basically make inotify better,
and sufficiently correct for that purpose.

My sticking my oar in lately is to ensure the fsnotify improvements
are going in the (imho) right direction.  There's a lot of interesting
apps waiting in the wings on this.  It doesn't have to be complicated,
just... sensible.

> to intercept userspace accesses 
> they're currently doing it by hacking the syscall table, why don't we 
> offer a way to monitor syscalls (kernel side) in a non racy way?
> Modules can [un]register themselves for syscall intercaption, and receive 
> the syscall number and parameters. They'd be able to change paramters, 
> return error codes, and so on.
> The cost of the check in the syscall path could even be under an 
> alternative-like patching, if really neeeded.
> The Pros of this would be:
> 
> - The kernel code to implement this would be trivially small, with no 
>   I-need-this-feature-too growth potential

(Fwiw, the {fa,fs,i}notify thing looks to me like it's getting simpler
as we go.  Good design = decrease complexity + increase versatility.
E.g. see epoll.)

> - There won't be any externally visible API to maintain (and its kernel 
>   counter part) and expand
> 
> - Any system call can be intercepted, allowing it to be flexible while 
>   leaving the burden of the interception handling, and communication with 
>   userspace policy enforcers, to the anti-malware (or whoever really) 
>   companies modules
> 
> The anti-malware are already doing this (intercepting syscall), they 
> already have code for it, and they always did (writing kernel 
> modules/drivers, that is) for Windows.

I don't mind at all if fanotify is replaced by a general purpose "take
over the system call table" solution for anti-malware, and I still get
to keep the fsnotify improvements :-)

But I can't help noticing that we _already_ have quite well placed
hooks for intercepting system calls, called security_this and
security_that (SELinux etc), albeit they can't redirect things so much.

However, being a little kinder, I suspect even the anti-malware
vendors would rather not slow down everything with race-prone
complicated tracking of everything every process does...  which is why
fanotify allows it's "interest set" to be reduced from everything to a
subset of files, and it's results to be cached, and let the races be
handled in the normal way by VFS.

Once you have an "interest set" and focus on files, it looks somewhat
reasonable to use the fsnotify hooks.

...That is, if you believe monitoring files is the best approach to
anti-malware.  I can't help noticing that on (ahem) Windows, running
just a "virus checker" which generically scans every file independent
of it's location looking for signatures and keeping up with patches is
no longer considered good enough.

-- Jamie

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox